Unicode codepoints

4/4/2023

The newer code points greater than or equal to 2 16 are encoded by a compound value using two 16-bit code units. In the UTF-16 encoding, code points less than 2 16 are encoded with a single 16-bit code unit equal to the numerical value of the code point, as in the older UCS-2. It is fully specified in RFC 2781, published in 2000 by the IETF. The UTF-16 encoding scheme was developed as a compromise and introduced with version 2.0 of the Unicode standard in July 1996. This was resisted by the Unicode Consortium, both because 4 bytes per character wasted a lot of memory and disk space, and because some manufacturers were already heavily invested in 2-byte-per-character technology. When it became increasingly clear that 2 16 characters would not suffice, IEEE introduced a larger 31-bit space and an encoding ( UCS-4) that would require 4 bytes per character. The early 2-byte encoding was originally called "Unicode", but is now called "UCS-2". The two groups attempted to synchronize their character assignments so that the developing encodings would be mutually compatible. Two groups worked on this in parallel, ISO/IEC JTC 1/SC 2 and the Unicode Consortium, the latter representing mostly manufacturers of computing equipment. The original idea was to replace the typical 256-character encodings, which required 1 byte per character, with an encoding using 65,536 (2 16) values, which would require 2 bytes (16 bits) per character. The goal was to include all required characters from most of the world's languages, as well as symbols from technical domains such as science, mathematics, and music. In the late 1980s, work began on developing a uniform encoding for a "Universal Character Set" ( UCS) that would replace earlier language-specific encodings with one coordinated system. The Web Hypertext Application Technology Working Group (WHATWG) considers UTF-8 "the mandatory encoding for all " and that for security reasons browser applications should not use UTF-16. UTF-8, by comparison, accounts for 98% of all web pages. UTF-16 is the only web-encoding incompatible with ASCII and never gained popularity on the web, where it is declared by under 0.002% (little over 1 thousandth of 1 percent) of web pages (and many of these are actually UTF-8 because of "contradictory character encoding specifications" and/or "incorrect character encoding defined"). It is used by SMS (the SMS standard specifies UCS-2, but almost all users actually implement UTF-16 so that emojis work). It is rarely used for files on Unix-like systems. It is also sometimes used for plain text and word-processing data files on Microsoft Windows. UTF-16 is used by systems such as the Microsoft Windows API, the Java programming language and JavaScript/ECMAScript. UTF-16 arose from an earlier obsolete fixed-width 16-bit encoding, now known as UCS-2 (for 2-byte Universal Character Set), once it became clear that more than 2 16 (65,536) code points were needed. The encoding is variable-length, as code points are encoded with one or two 16-bit code units. UTF-16 ( 16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16).

0 Comments

BLOG

Unicode codepoints

Leave a Reply.

Author

Archives

Categories