go to ImagePartners.co.uk 
Home    Thesaurus    Customers

The Name Thesaurus is a technology for finding Surname and Forename variants
For a product demonstration please click here
 
What is a name variant ?
Why do we need another name matching system ?
What is wrong with Soundex and Metaphone ?
What algorithms does the Name Thesaurus use ?
Why are weighted pairs important ?
What do you mean by Precision and Recall ?
How do I work with the Name Thesaurus ?
Is the Name Thesaurus just data ?
Is the Name Thesaurus a Thesaurus generation Service ?
How fast is the Name Thesaurus ?
Who are Image Partners ?


What is a name variant ?
Database applications that involve the storing and searching for personal data often need to allow for a degree of variation in the spelling of both surnames and forenames.

Variations can be introduced into database applications as a result of typographical errors where letters are interchanged (e.g. Nobel and Noble), letters are substituted (e.g. Stevens and Stephens), letters are added (e.g. Colins and Collins) or removed (e.g. Clarke and Clark).

Variations can also be introduced as a result of alternate spelling for names with the same or very similar pronunciation (e.g. Cavenagh and Kavenagh or Ewal and Yule or Sean and Shawn). This is a particular problem where the transcription is from the spoken word. 

Forename abbreviations introduce further problems and in many cases there is little or no phonetic link (e.g. William with Wm,Billy,Bill,etc or Elizabeth with Lizzie,E'beth,Elz,Beth,etc).

In addition, we do not want to match known masculine names with known feminine variants and vice versa (e.g. Alexander should match with Alex and Alexandar but not with Alexandria.
top

Why do we need another name matching system ?
Many techniques have been used to assist with the important problem of matching variant names. However, most of these techniques were developed for general word matching and as a result they are not optimised for name matching. The Name Thesaurus was designed specifically for name matching and achieves significantly higher precision with higher recall than the alternatives.
top

What is wrong with Soundex and Metaphone ?
Soundex is a simple algorithm that transforms any word into a code comprising a leading letter followed by up to three digits. For example, Kavenagh gets the code K152 and this may be matched to other names with the same code such as Kavnaugh and Kavenough. Unfortunately the same code also matches with a huge number of unrelated names such as Kyppins and Koppensteiner. This code also fails to match any name with a different leading letter such as Cavenagh. In summary, Soundex offers a reasonable level of recall but with low precision when matching names.

Metaphone is another algorithm that attempts to normalise words by removing vowels after the first letter and mapping common consonant variations. For example, Kavenagh gets the code KFNK and this matches with Kavnaugh, Kavenough and Cavenagh. Unfortunately the same code also matches many unrelated names such as Gavnik and Kaffanke. In summary, Metaphone is better suited to name matching than Soundex but is still far from ideal.

Neither Soundex nor Metaphone include any of the following matches for Kavenagh: Kavena, Cavena, Kavanha, Kavan.  Also, neither algorithm offers any assistance with the problem of forename abbreviations.
top

What algorithms does the Name Thesaurus use ?
The Name Thesaurus uses a combination of phonetic and other techniques for name variant identification.

All surnames are limited to 27 characters (a-z and " ' "). Double-barrelled names are split into their component parts and treated independently. All accented characters are converted to their closest matching letter (e.g. , , and are all mapped to " a "). This mapping is used to simplify the thesaurus and ensures that similar names are located without regard to the use of accented characters.

Once a potential match has been identified a weighting is computed as follows:
1. All surnames are converted to a phonetic encoding.
2. The degree of similarity for potential matches are computed using two different matching algorithms. Both the original and the phonetic versions are compared using the two matching algorithms producing a total of four match scores.
3. The Soundex and Metaphone codes are also computed.
4. The Match Score is computed from the weighted average of the four match scores combined with the results of the Soundex and Metaphone encodings.
5. For forenames, the Name Thesaurus uses a knowledge of gender associations so that close phonetic matches between the sexes can be avoided (e.g. John and Joan or Alexander and Alexandria).
top

Why are weighted pairs important ?
Neither Soundex nor Metaphone is able to rank their matching names so that the closer matches can be identified. Therefore an application that uses these techniques has an all-or-nothing decision regarding the inclusion of name variants.

The Name Thesaurus assigns a match score to all name variants so that each candidate is offered as a weighted pair and all variants for a given name can be supplied as a ranked set. For example, the Name Thesaurus has generated more than 150 matches for the surname Galbraith including:
  - Gailbriath - with a match score of 99 
  - Gallbreath - with a match score of 97 
  - Galbrealth - with a match score of 85
  - Galbruth   - with a match score of 83 
  - Goilbreath - with a match score of 76

By assigning a weight to each name variant the application has the option of limiting a particular search to the better matches thus improving precision at the expense of recall.
top

What do you mean by Precision and Recall ?
Precision measures the percentage of correct names in a match list with 100% indicating that the match list does not contain any invalid names.

Recall measures the percentage of all possible correct names appearing in the match list with 100% indicating that the match list contains every correct name.

In an ideal world it would be possible to achieve 100% precision with 100% recall but in practice this is not possible for large volumes of data. In general, higher precision leads to lower recall and vice versa.

Precision Vs Recall

The above diagram shows how precision drops off as recall increases by dropping the Name Thesaurus match score threshold. The diagram also shows how Soundex and Metaphone compare to the Name Thesaurus with both providing reasonable levels of recall but with poor precision. Whilst the performance of Soundex and Metaphone are fixed for any given name the range indicated in the diagram shows the expected performance across a spread of names.

When weighted term pairs are available it is possible to tune precision and recall dynamically. For example, point 'a' on the diagram above represents relatively high precision with lower recall and would be achieved by setting a high match score threshold. In contrast, point 'b' on the diagram represents relatively high recall but with lower precision and would be achieved by selecting a low match score threshold.

In all cases the Name Thesaurus provides better Precision than either Soundex or Metaphone.
top

How do I work with the Name Thesaurus ?
The Name Thesaurus data is organised as a thesaurus of name pairs with weights and would normally be held in a relational database. Database applications that wish to utilise the Name Thesaurus would simply include a sub-select to include matching names from the Thesaurus above the selected threshold.
top

Is the Name Thesaurus just data ? 
Yes - the Name Thesaurus is available as a standard thesaurus containing 385 million variants for 5.9 million distinct Surnames and 32 million variants for 1.4 million distinct Forenames.

The names in the standard thesaurus come from all over the world.
top

Is the Name Thesaurus a Thesaurus generation Service ? 
Yes - we can build custom thesauri for specific name collections. 
top

How fast is the Name Thesaurus ? 
The performance of the Name Thesaurus is dependent upon the speed of the relational database used to hold the Thesauri. Since the sub-select can easily be controlled by a clustered index the overhead of fetching the additional variants is normally a small percentage of the overall search time. Our customers use the Name Thesaurus with databases that hold hundreds of millions of names.  
top

Who are Image Partners ?
Image Partners Limited is the company that designed and developed the Name Thesaurus.

John Challis, who founded the company in 1996, has been designing software products for the management of unstructured information since 1987.
top