DNS - Dooot .com Domain News
Domain Name Registration and Web Hosting

Sec Dom



RSS
:: Domain Industry - News Archive

Chris Weber Unicode attacks and test cases - Visual Spoofing IDN homograph attacks and the Mixed Script Confusables

DNS News - >>

Source: www.idnnews.com

Let’s face it, playing tricks that mess with people’s perception can be fun.  With Unicode, there’s lots of fun tricks to be had.  What’s to stop someone from believing the following is what it appears to be:

www.?mazon.com

Looks like amazon.com of course, but it’s not.  The first ‘a’ is the Cyrillic small letter a, not the English, or Latin rather, small letter ‘a’, although they look identical - they’re from two different languages.   Confused?  Good.  Now hover your mouse over the link above, don’t click it because I don’t know where it goes but it probably isn’t nice.  In your browser’s status bar you should see the Punycode encoded version of the domain name:

http://www.xn--mazon-3ve.com/

Because DNS does not support Unicode (only a subset of ASCII characters are allowed), we have IDN (Internationalized Domain Name) standards which define how domain names with Unicode characters should be encoded.  Punycode is the name of the encoding mechanism.

The above is often referred to as an IDN homograph attack.  Aside from spoofing with lookalike characters from completely different alphabets, we can do a bunch of spoofing just within our own alphabets.  For example, certain fonts make combinations of characters hard to determine.  Just like the letter’s ‘r’ and ‘n’ together can look like the letter ‘m’: rn == m Zeroe’s can look like ‘O’ and the number 1 can look like a lower case ‘l’.  So you wind up with lots of clever visual attacks:

  • www.rnu11ets.com looks a lot like www.mullets.com
  • www.rnu11ets.com looks a lot like www.mullets.com
  • www.rnu11ets.com looks a lot like www.mullets.com
  • www.rnu11ets.com looks a lot like www.mullets.com
  • www.rnu11ets.com looks a lot like www.mullets.com
  • www.rnu11ets.com looks a lot like www.mullets.com
  • www.rnu11ets.com looks a lot like www.mullets.com
  • www.rnu11ets.com looks a lot like www.mullets.com

I’ve listed the same text here in several different fonts, because in some fonts, you wouldn’t be able to tell the visual difference between the two words.  The visual appearance of characters has a lot to do with the fonts used to display the glyph, not just the alphabet.

The Confusables

These types of visual attacks are attributed to what’s known as ‘the confusables‘ and have been documented in Unicode’s Technical Report 36 and TR39.  The confusables is a name given to scripts that essentially lookalike each other. The Unicode consortium has defined three main classes of confusable strings which are possible:

  1. Single-script
  2. Mixed-script
  3. Whole-script

I want to investigate each one in turn.  Because I’m simplifying things here, I may not be accurate in my use of the terms script, alphabet, letter, and so on.  Linguistics people get it better than I do but for the rest of us, the term ‘script refers to:

A collection of letters and other written signs used to represent textual information in one or more writing systems. For example, Russian is written with a subset of the Cyrillic script; Ukranian is written with a different subset. The Japanese writing system uses several scripts.

Single-script confusables

These occur when letters from the same alphabet, or script, are used to give the same visual appearance.  This definition should be extended to say that these occur when letters from either the same script, inherited script, or common script, are used together.   For example, the following two combinations of Latin letters look identical:

  • so?s
  • søs

If you take these apart, there’s a big difference.  While the letter ’s’ is the same in each, the ‘o?’ and ‘ø’ are different.  The first uses the Basic Latin ‘o’ with a combining diacritical mark named COMBINING SHORT SOLIDUS OVERLAY, which is considered an inherited script.  To put it a different way, we have two atomic Unicode code points here, which together give the affect of a single character or letter.  The second uses the atomic character LATIN SMALL LETTER O WITH STROKE.  Let’s take these apart and look at the Unicode code point values for each.

  • so?s == \\u0073\\u006F\\u0337\\u0073
  • søs == \\u0073\\u00F8\\u0073

As you can see, the first ‘o?’ gets formed from two Unicode code points, u006F and u0337.  If you copy and paste that word into a text editor that supports Unicode (e.g. Notepad) and click backspace, you’ll see the first backspace removes the combining diacritical mark, and the second removes the ‘o’.  Continuing with the example, the second ‘ø’ is made of a single Unicode code point u00F8 part of the Latin-1 Supplement Unicode block. At a lower level, because we’re using different code points and bytes to achieve the same visual affect, we have a case of the confusables.

Let’s take a closer look at what qualifies as a single-script confusable for the Latin lower-case letter ‘a’ - taken from the confusables table at http://unicode.org/reports/tr39/data/confusables.txt.

FF21 ; 0041 ; SA # ( ? ? A ) FULLWIDTH LATIN CAPITAL LETTER A ? LATIN CAPITAL LETTER A
1D400 ; 0041 ; SA # (

Last changed: Dec 10 2008 at 10:20 AM
.. Back

Exclusive Domain News |

last updated: Feb 11 2012 8:58 AM

Domain Name News

last updated: Feb 10 2012 1:49 PM
An XML error occurred on line 584: junk after document element

www.dooot.com (c) Copyright 2000 - 2012 Irist IST Member of the IST Group. All rights reserved. , istanco.com