Help on fuzzy search

Overview of the fuzzy search algorithm used in quick search by name functions

Terms that are often misspelled can be a problem for the users of the application. Names, for example, are variable length, can have strange spellings, and they are not unique. Words can be misspelled or have multiple spellings, especially across different cultures or national sources.

To solve this problem, we need phonetic algorithms which can find similar sounding terms and names. Just such a family of algorithms exist and are called SoundExes, after the first patented version. A Soundex search algorithm takes a word, such as a person's name, as input and produces a character string which identifies a set of words that are (roughly) phonetically alike. It is very handy for searching large databases when the user has incomplete data.

The original Soundex algorithm was patented by Margaret O'Dell and Robert C. Russell in 1918. The method is based on the six phonetic classifications of human speech sounds (bilabial, labiodental, dental, alveolar, velar, and glottal), which in turn are based on where you put your lips and tongue to make the sounds. The algorithm is fairly straight forward to code and requires no backtracking or multiple passes over the input word.

The Algorithm

  1. Capitalize all letters in the word and drop all punctuation marks. Pad the word with rightmost blanks as needed during each procedure step.
  2. Retain the first letter of the word.
  3. Change all occurrence of the following letters to '0' (zero): 'A', E', 'I', 'O', 'U', 'H', 'W', 'Y'.
  4. Change letters from the following sets into the digit given:
    • 1 = 'B', 'F', 'P', 'V'
    • 2 = 'C', 'G', 'J', 'K', 'Q', 'S', 'X', 'Z'
    • 3 = 'D','T'
    • 4 = 'L'
    • 5 = 'M','N'
    • 6 = 'R'
  5. Remove all pairs of digits which occur beside each other from the string that resulted after step (4).
  6. Remove all zeros from the string that results from step 5.0 (placed there in step 3)
  7. Pad the string that resulted from step (6) with trailing zeros and return only the first four positions, which will be of the form [uppercase letter] [digit] [digit] [digit].

EUNIS Database approach

In EUNIS Database application we implemented a variation of the Soundex algorithm for the 'Quick search by name' feature. It is based on the implementation of the SOUNDEX function in MySQL and a custom developed function which takes a step-by-step intelligent approach in matching the string(s) used in search with a database of pre-validated terms. If the initial query does not return any results., EUNIS Database will try to to find 'similar' terms in the database and, as Google does, will suggest the one that is closest, using the SOUNDEX variation implemented by the EUNIS team.

European Environment Agency (EEA)
Kongens Nytorv 6
1050 Copenhagen K
Denmark
Phone: +45 3336 7100