What's in a domain name? NIST has an answer
Connecting state and local government leaders
NIST computer scientist devises algorithm to measure visual similarity between domain names.
Everyone knows how frustrating ' and embarrassing '
it can be to mistype a URL into your browser. (Remember the
snickering you used to hear if you went to
'whitehouse.com' instead of
'whitehouse.gov'? The .com address is now a political
news site, by the way.) The Internet Corp. for Assigned Names and
Numbers (ICANN) plans to launch a new round of proposals later this
year for generic top-level Internet domains and is looking for a
way to help avoid confusion and fraud as the number of domains
increases.
To help this effort, Paul Black, a computer scientist at the
National Institute of Standards and Technology,
has come up with an algorithm to measure the amount of visual
similarity between domain names. The tool
scores the similarities between a proposed domain and an existing
one. For instance, a domain such as '.c0m' (with a
zero) scores an 88 percent compared with '.com' and
probably would not be approved.
Generic top-level domains are the strings of letters and numbers
that appear after the far right '.' or dot, before a
'/' or slash in a URL. According to ICANN, there are 21
generic top-level domains now approved for use ' from .aero
(reserved for members of the air transport industry) to .travel
(reserved for the travel industry), as well as the more familiar
.com, .edu, .gov and .mil.
According to NIST, Black's algorithm rates the degree of
similarity between pairs of alphanumeric characters, such as the
numeral '1' and the lowercase letter 'l,'
which in some fonts are dead ringers and would receive the highest
score. Other pairs, such as 'h' and 'n,'
are similar and get lower scores. The algorithm also takes into
consideration combinations of letters, such as 'cl,'
which can look like 'd.' Putting everything together,
the algorithm then computes the 'cost' of transforming
one string into another based on visual similarity and expresses
that in a percentage score.
NIST says ICANN is considering future enhancements to the
algorithm, including checks for confusing similarities between
domains in other alphabets or scripts such as Cyrillic.
NEXT STORY: Whittling spam down to a manageable level