|
|
| Home Thesaurus Customers |
| The NameThesaurus is a technology
for finding Surname and Forename variants
For a product demonstration please click here |
| What is a name variant ? |
| Why do we need another name matching system ? |
| What is wrong with Soundex and Metaphone ? |
| What algorithms does the NameThesaurus use ? |
| Why are weighted pairs important ? |
| What do you mean by Precision and Recall ? |
| How do I work with the NameThesaurus ? |
| Is the NameThesaurus just data ? |
| Is the NameThesaurus a Thesaurus generation Service ? |
| How fast is the NameThesaurus ? |
| Who are Image Partners ? |
|
What is a name variant ?
Variations can be introduced into database applications as a result of typographical errors where letters are interchanged (e.g. Nobel and Noble), letters are substituted (e.g. Stevens and Stephens), letters are added (e.g. Colins and Collins) or removed (e.g. Clarke and Clark). Variations can also be introduced as a result of alternate spelling for names with the same or very similar pronunciation (e.g. Cavenagh and Kavenagh or Ewal and Yule or Sean and Shawn). This is a particular problem where the transcription is from the spoken word. Forename abbreviations introduce further problems and in many cases there is little or no phonetic link (e.g. William with Wm,Billy,Bill,etc or Elizabeth with Lizzie,E'beth,Elz,Beth,etc). In addition, we do not want to match known masculine names with known
feminine variants and vice versa (e.g. Alexander
should match with Alex and Alexandar
but not with Alexandria.
Why do we need another name matching system ?
What is wrong with Soundex and Metaphone ?
Metaphone is another algorithm that attempts to normalise words by removing vowels after the first letter and mapping common consonant variations. For example, Kavenagh gets the code KFNK and this matches with Kavnaugh, Kavenough and Cavenagh. Unfortunately the same code also matches many unrelated names such as Gavnik and Kaffanke. In summary Metaphone is better suited to name matching than Soundex but is still far from ideal. Neither Soundex nor Metaphone include any of the following matches for
Kavenagh: Kavena, Cavena, Kavanha, Kavan. Also, neither
algorithm offers any assistance with the problem of forename
abbreviations.
What algorithms does the NameThesaurus use ?
All surnames are limited to 27 characters (a-z and " ' "). Double-barrelled names are split into their component parts and treated independently. All accented characters are converted to their closest matching letter (e.g. à, â, ä and å are all mapped to " a "). This mapping is used to simplify the thesaurus and ensures that similar names are located without regard to the use of accented characters. Once a potential match has been identified a weighting is computed as follows:
Why are weighted pairs important ?
The NameThesaurus assigns a match score to all name variants so that each
candidate is offered as a weighted pair and all variants for a given name can
be supplied as a ranked set. For example, the NameThesaurus has generated more
than 120 matches for the surname Galbraith including:
By assigning a weight to each name variant the application has the option of
limiting a particular search to the better matches thus improving precision at
the expense of recall. What do you mean by Precision and Recall ?
Recall measures the percentage of all possible correct names appearing in the match list with 100% indicating that the match list contains every correct name. In an ideal world it would be possible to achieve 100% precision with 100% recall but in practice this is not possible for large volumes of data. In general, higher precision leads to lower recall and vice versa.
The above diagram shows how precision drops off as recall increases by dropping the NameThesaurus match score threshold. The diagram also shows how Soundex and Metaphone compare to the NameThesaurus with both providing reasonable levels of recall but with poor precision. Whilst the performance of Soundex and Metaphone are fixed for any given name the range indicated in the diagram shows the expected performance across a spread of names. When weighted term pairs are available it is possible to tune precision and recall dynamically. For example, point 'a' on the diagram above represents relatively high precision with lower recall and would be achieved by setting a high match score threshold. In contrast, point 'b' on the diagram represents relatively high recall but with lower precision and would be achieved by selecting a low match score threshold. In all cases the NameThesaurus provides better Precision than either Soundex or
Metaphone.
How do I work with the NameThesaurus ?
Is the NameThesaurus just data ?
The names in the standard thesaurus come from all over the
world. Is the NameThesaurus a Thesaurus generation Service ?
How fast is the NameThesaurus ?
Who are Image Partners ?
John Challis, who founded the company in 1996, has been designing software
products for the management of unstructured information since 1987.
|