 |
IETF Standards Summary |
 |
|
The Domain Name System (DNS) only recognizes
ASCII characters A-Z, 0-9 and '-'. This limits the number of characters
that can be utilized to build domain names to 37 of the more than 40,000
characters identified within Unicode. To create domain names from the
wider range of Unicode characters, a character-encoding scheme that
uniquely maps Unicode code points to an ASCII representation must be
used and standardized.
The Internet Engineering Task Force (IETF)
has led the effort in standardizing the way that non-ASCII characters
are to be represented within and handled by DNS. The IETF published
three standards related to Internationalized Domain Names (IDN):
- Encoding scheme
for IDNs
- Name preparation
- IDNs in applications
Encoding Scheme
The encoding scheme for IDNs will be an ASCII
Compatible Encoding (ACE) that will encode the local language characters
of an IDN into ASCII characters such that DNS can accurately answer
a request for an address record. There are several types of ACE. In
order to select an ACE as the standard, IETF must consider the difficult
balance between compression and implementation. The preferred ACE will
allow the greatest number of characters (code points) to be represented
and will not be difficult to deploy. The IETF has chosen an ACE known
as Punycode to be the standard.
Name Preparation
The name preparation standard will provide
the rules that will ensure uniqueness in registering Unicode code points.
The rules outline the criteria through which a set of non-ASCII characters
will be refined to ensure that there is no ambiguity within the registrations
of a specific name space. These rules are Mapping, Normalization and
Prohibition.
- Mapping
Characters may be mapped to nothing, a single character or multiple
characters based upon their usefulness in text only or case. An example
of usefulness: the soft hyphen (U+00AD) is discretionary and only has
use within text and is invisible or ignored. The more common example
is the mapping of a capital letter to a small letter such as 'B' (U+0042)
to 'b' (U+0062). This is to ensure that a registration such as ibm.com
does not have a conflict with other registration such as IBM.com or
iBm.com.
There are cases where a single character will map to multiple characters.
The small letter sharp s or 'ß' (U+00DF) has an upper case representation
of 'SS' (U+0053, U+0053). This is also the same upper case representation
for 'ss' (U+0073, U+0073). Therefore, 'ß' maps to 'ss'.
- Normalization
Once a set of characters has been mapped, the set is normalized. Some
input method editors (IME) enter characters that look exactly like another
character, but have different code points. For example, 1 is a fullwidth
digit one (U+FF11) and will normalize into a digit one (1) (U+0031).
Normalization also ensures predictable results through ordering where
characters have a number of combining diacritics.
- Prohibition
After normalization, the mapped and normalized set of characters is
checked against a table of prohibited characters. These characters are
prohibited for a variety of reasons but the most common are spaces that
could lead to confusion and control characters that cannot be displayed.
IDNs in Applications
The IDN in applications standard focuses on
the location where the Unicode to ASCII mapping will take place. The
IETF's approach makes the applications that send and receive traffic
from DNS (browsers, e-mail clients, etc.) encode and un-encode the Unicode
characters.
The Bottom Line
All of these issues are currently outlined
in the IETF
Request for Comment (RCFs).
In summary, enhancing the current DNS to include
more than just English characters is not a simple undertaking.
VeriSign is committed to following the IETF
standards and supporting rapid deployment of this new technology.
|