The
Algorithms used in the Fuzzy Matching Process:
The Dice Coefficient
- Similarity metric:
The method used to feed the string to this algorithm
is the linear q-gram (or n-gram) model:
Q
or N would be the length of the gram.
Q-gram: monograms, bigrams, trigrams, etc.
Example bigram for “Nelson”:
%N Ne el ls so on n#
The Dice Coefficient is used as the threshold
algorithm: D is Dice
Coefficient
SB is Shared Bigrams
TBg1 is total number of bigrams in Qgram1
TBg2 is total number of bigrams in Qgram2
D = (2SB)/(TBg1+TBg2)
Double the number of shared
bigrams and divide by total number of bigrams
in each string. A good Dice Coefficient value
would be a value greater than 0.2. If the Dice
value is less than 0.2, the comparisons are rejected
immediately and not analyzed by the other algorithms.
Levenshtein Edit Distance:
The Levenshtein Distance algorithm is named after
the Russian scientist Vladimir Levenshtein, who
devised the algorithm in 1965. Levenshtein Edit
Distance is the number of insertions, deletions,
or replacements of single characters that are
required to convert one string to the other.
Character transposition is
detected.
Transposition is given a cost of 1.
Transposition detection is
taken from:
Berghel, Hal ; Roach, David : "An Extension
of Ukkonen's Enhanced Dynamic Programming ASM
Algorithm"
A Levenshtein Distance of 2, 3, or more characters
is meaningless without evaluating the lengths
of the two comparison strings as well.
Longest Common Subsequence:
The LCS is calculated by the length of the longest
common, not necessarily contiguous, sub-sequence
of characters divided by the average character
lengths of both strings. A good value for LCS
is a value greater than or equal to 0.8.
Double Metaphone:
The Double Metaphone algorithm, developed by Lawrence
Phillips and published in the June 2000 issue
of C/C++ Users Journal, is part of a class of
algorithms known as "phonetic matching"
or "phonetic encoding" algorithms.
He wrote it as a replacement for SOUNDEX.
These algorithms attempt to
detect phonetic ("sounds-like") relationships
between words. For example, a phonetic matching
algorithm should detect a strong phonetic relationship
between "Nelson" and "Nilsen",
and no phonetic relationship between "Adam"
and "Nelson."
Double Metaphone is
designed primarily to encode American English
names (though it also encodes most English words
well) while taking into account the fact that
such words can have more than one acceptable pronunciation.
Double Metaphone can compute a primary and a secondary
encoding for a given word or name to indicate
both the most likely pronunciation as well as
an optional alternative pronunciation (hence the
"double" in the name).
Back to the top:
|