FUZZY STRING SEARCHING
'Fuzzy string search' is the name that is used for a category of techniques for
finding strings that approximately match some given pattern string. It may also be known as 'approximate' or inexact matching.
Approximate string searching has two different flavors:
:finding one or more matching substrings of a text, and
:finding similar strings in a string set, often referred to as a dictionary.
Approximate string searching has many application areas including
information retrieval, spellchecking and computational biology .
The cornerstone of any approximate searching method is a 'similarity function' or 'string metric'. Among the most commonly used similarity functions are Levenshtein distance (a type of edit distance) and n-gram distance. The latter is based on counting of the number of common n-grams, and is used mostly for filtering. In contrast to n-gram distance, Levenshtein distance is a de-facto standard similarity function. It has several extensions. One well known extension is Damerau-Levenshtein distance that counts transposition as a single edit operation. Another extension is the so-called generalized or weighted Levenshtein distance. It assigns different costs to elementary edit operations. Ukkonen described even more sophisticated similarity function where edit operations go beyond single-character insertions, deletions and substitutions and include substitutions of arbitrary-length strings.
Traditionally, approximate string matching algorithms
are classified into two categories: on-line and off-line. With on-line algorithms
the pattern can be preprocessed before searching but the text cannot.
In other words, on-line techniques do searching without indexation. Early
algorithms for on-line approximate matching were suggested by Wagner and
Fisher and by Sellers. Both algorithms are based on
dynamic programming but solve different problems. Sellers' algorithm
searches approximately for a substring in a text while the algorithm of Wagner
and Fisher calculates Levenshtein distance, being appropriate for dictionary fuzzy search only.
On-line searching techniques were repeatedly improved. Perhaps the most
famous improvement is the bitap algorithm (also known as the shift-or and shift-and algorithm), which is very efficient for relatively short pattern strings. Bitap algorithm is the heart of Unix searching utility agrep. An excellent review of on-line searching algorithms was done by G. Navarro.
Although very fast on-line techniques exist, their
performance on large data is unacceptable.
Text preprocessing or indexing makes searching dramatically faster.
Today, a variety of indexing algorithms are presented. Among them are suffix trees, metric trees and n-gram methods.
For a detailed list of indexing techniques see the paper of Navarro ''et al.''
★ Soundex
★ Agrep
★ Spellchecker
★ String searching algorithm
★ Wildcard character
★ Levenshtein distance
★ Computer-assisted translation
★ R. Baeza-Yates and G. Navarro. A faster algorithm for approximate string matching. In Dan Hirchsberg and Gene Myers, editors, Combinatorial Pattern Matching (CPM'96), LNCS 1075, pages 1--23, Irvine, CA, Jun 1996.
★ D. Gusfield. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, New York, NY, USA, 1997.
★ R. Baeza-Yates and G. Navarro. Fast Approximate String Matching in a Dictionary.Proc. SPIRE'98. IEEE CS Press, pages 14-22.
★ G. Myers. A fast bit-vector algorithm for approximate string matching based on dynamic programming, Journal of the ACM (JACM) 46 (3), May 1999, 395 - 415.
★ G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys (CSUR) archive 33(1), pp 31-88, 2001.
★ G. Navarro, Ricardo Baeza-Yates, E. Sutinen and J. Tarhio. Indexing Methods for Approximate String Matching.IEEE Data Engineering Bulletin 24(4):19-27, 2001.
★ P.H. Sellers. The Theory and Computation of Evolutionary Distances: Pattern Recognition. Journal of Algorithms, 1:359-373, 1980.
★ E. Ukkonen, Algorithms for approximate string matching. Information and Control 64, 100-118. 1985.
★ R. Wagner and M. Fisher, The string-to-string correction problem, Journal of the association for computing machinery, vol. 21, pp. 168 173, 1974.
★ J. Zobel, P. Dart. Finding approximate matches in large lexicons. Software-Practice & Experience 25(3), pp 331-345, 1995.
★ Efficient POSIX compliant regexp matching library with support for approximate matching
★ Site devoted to fuzzy searching and information retrieval
★ The description of Levenshtein algorithm
★ Project Dedupe
★ The Fuzzy Gazetteer: A fuzzy string search engine for place names worldwide
★ Source code for n-gram based matching
★ Siderite's Sift2: An empiric, but fairly accurate and very fast edit-distance algorithm
★ Taxonomic nomenclature checker
finding strings that approximately match some given pattern string. It may also be known as 'approximate' or inexact matching.
Approximate string searching has two different flavors:
:finding one or more matching substrings of a text, and
:finding similar strings in a string set, often referred to as a dictionary.
Approximate string searching has many application areas including
information retrieval, spellchecking and computational biology .
| Contents |
| Similarity functions |
| On-line vs. off-line |
| See also |
| References |
| External links |
Similarity functions
The cornerstone of any approximate searching method is a 'similarity function' or 'string metric'. Among the most commonly used similarity functions are Levenshtein distance (a type of edit distance) and n-gram distance. The latter is based on counting of the number of common n-grams, and is used mostly for filtering. In contrast to n-gram distance, Levenshtein distance is a de-facto standard similarity function. It has several extensions. One well known extension is Damerau-Levenshtein distance that counts transposition as a single edit operation. Another extension is the so-called generalized or weighted Levenshtein distance. It assigns different costs to elementary edit operations. Ukkonen described even more sophisticated similarity function where edit operations go beyond single-character insertions, deletions and substitutions and include substitutions of arbitrary-length strings.
On-line vs. off-line
Traditionally, approximate string matching algorithms
are classified into two categories: on-line and off-line. With on-line algorithms
the pattern can be preprocessed before searching but the text cannot.
In other words, on-line techniques do searching without indexation. Early
algorithms for on-line approximate matching were suggested by Wagner and
Fisher and by Sellers. Both algorithms are based on
dynamic programming but solve different problems. Sellers' algorithm
searches approximately for a substring in a text while the algorithm of Wagner
and Fisher calculates Levenshtein distance, being appropriate for dictionary fuzzy search only.
On-line searching techniques were repeatedly improved. Perhaps the most
famous improvement is the bitap algorithm (also known as the shift-or and shift-and algorithm), which is very efficient for relatively short pattern strings. Bitap algorithm is the heart of Unix searching utility agrep. An excellent review of on-line searching algorithms was done by G. Navarro.
Although very fast on-line techniques exist, their
performance on large data is unacceptable.
Text preprocessing or indexing makes searching dramatically faster.
Today, a variety of indexing algorithms are presented. Among them are suffix trees, metric trees and n-gram methods.
For a detailed list of indexing techniques see the paper of Navarro ''et al.''
See also
★ Soundex
★ Agrep
★ Spellchecker
★ String searching algorithm
★ Wildcard character
★ Levenshtein distance
★ Computer-assisted translation
References
★ R. Baeza-Yates and G. Navarro. A faster algorithm for approximate string matching. In Dan Hirchsberg and Gene Myers, editors, Combinatorial Pattern Matching (CPM'96), LNCS 1075, pages 1--23, Irvine, CA, Jun 1996.
★ D. Gusfield. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, New York, NY, USA, 1997.
★ R. Baeza-Yates and G. Navarro. Fast Approximate String Matching in a Dictionary.Proc. SPIRE'98. IEEE CS Press, pages 14-22.
★ G. Myers. A fast bit-vector algorithm for approximate string matching based on dynamic programming, Journal of the ACM (JACM) 46 (3), May 1999, 395 - 415.
★ G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys (CSUR) archive 33(1), pp 31-88, 2001.
★ G. Navarro, Ricardo Baeza-Yates, E. Sutinen and J. Tarhio. Indexing Methods for Approximate String Matching.IEEE Data Engineering Bulletin 24(4):19-27, 2001.
★ P.H. Sellers. The Theory and Computation of Evolutionary Distances: Pattern Recognition. Journal of Algorithms, 1:359-373, 1980.
★ E. Ukkonen, Algorithms for approximate string matching. Information and Control 64, 100-118. 1985.
★ R. Wagner and M. Fisher, The string-to-string correction problem, Journal of the association for computing machinery, vol. 21, pp. 168 173, 1974.
★ J. Zobel, P. Dart. Finding approximate matches in large lexicons. Software-Practice & Experience 25(3), pp 331-345, 1995.
External links
★ Efficient POSIX compliant regexp matching library with support for approximate matching
★ Site devoted to fuzzy searching and information retrieval
★ The description of Levenshtein algorithm
★ Project Dedupe
★ The Fuzzy Gazetteer: A fuzzy string search engine for place names worldwide
★ Source code for n-gram based matching
★ Siderite's Sift2: An empiric, but fairly accurate and very fast edit-distance algorithm
★ Taxonomic nomenclature checker
This article provided by Wikipedia. To edit the contents of this article, click here for original source.
psst.. try this: add to faves

العربية
中国
Français
Deutsch
Ελληνική
हिन्दी
Italiano
日本語
Português
Русский
Español