adist
Approximate String Distances
Description
Compute the approximate string distance between character vectors. The distance is a generalized Levenshtein (edit) distance, giving the minimal possibly weighted number of insertions, deletions and substitutions needed to transform one string into another.
Usage
adist(x, y = NULL, costs = NULL, counts = FALSE, fixed = TRUE, partial = !fixed, ignore.case = FALSE, useBytes = FALSE)
Arguments
x | a character vector. Long vectors are not supported. |
y | a character vector, or |
costs | a numeric vector or list with names partially matching insertions, deletions and substitutions giving the respective costs for computing the Levenshtein distance, or |
counts | a logical indicating whether to optionally return the transformation counts (numbers of insertions, deletions and substitutions) as the |
fixed | a logical. If |
partial | a logical indicating whether the transformed |
ignore.case | a logical. If |
useBytes | a logical. If |
Details
The (generalized) Levenshtein (or edit) distance between two strings s and t is the minimal possibly weighted number of insertions, deletions and substitutions needed to transform s into t (so that the transformation exactly matches t). This distance is computed for partial = FALSE
, currently using a dynamic programming algorithm (see, e.g., https://en.wikipedia.org/wiki/Levenshtein_distance) with space and time complexity O(mn), where m and n are the lengths of s and t, respectively. Additionally computing the transformation sequence and counts is O(\max(m, n)).
The generalized Levenshtein distance can also be used for approximate (fuzzy) string matching, in which case one finds the substring of t with minimal distance to the pattern s (which could be taken as a regular expression, in which case the principle of using the leftmost and longest match applies), see, e.g., https://en.wikipedia.org/wiki/Approximate_string_matching. This distance is computed for partial = TRUE
using tre by Ville Laurikari (https://github.com/laurikari/tre) and corresponds to the distance used by agrep
. In this case, the given cost values are coerced to integer.
Note that the costs for insertions and deletions can be different, in which case the distance between s and t can be different from the distance between t and s.
Value
A matrix with the approximate string distances of the elements of x
and y
, with rows and columns corresponding to x
and y
, respectively.
If counts
is TRUE
, the transformation counts are returned as the "counts"
attribute of this matrix, as a 3-dimensional array with dimensions corresponding to the elements of x
, the elements of y
, and the type of transformation (insertions, deletions and substitutions), respectively. Additionally, if partial = FALSE
, the transformation sequences are returned as the "trafos"
attribute of the return value, as character strings with elements M, I, D and S indicating a match, insertion, deletion and substitution, respectively. If partial = TRUE
, the offsets (positions of the first and last element) of the matched substrings are returned as the "offsets"
attribute of the return value (with both offsets -1 in case of no match).
See Also
agrep
for approximate string matching (fuzzy matching) using the generalized Levenshtein distance.
Examples
## Cf. https://en.wikipedia.org/wiki/Levenshtein_distance adist("kitten", "sitting") ## To see the transformation counts for the Levenshtein distance: drop(attr(adist("kitten", "sitting", counts = TRUE), "counts")) ## To see the transformation sequences: attr(adist(c("kitten", "sitting"), counts = TRUE), "trafos") ## Cf. the examples for agrep: adist("lasy", "1 lazy 2") ## For a "partial approximate match" (as used for agrep): adist("lasy", "1 lazy 2", partial = TRUE)
Copyright (©) 1999–2012 R Foundation for Statistical Computing.
Licensed under the GNU General Public License.