Misplaced Pages

UTF-8

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.

UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode Transformation Format – 8-bit . Almost every webpage is stored in UTF-8.

#550449

62-413: UTF-8 is capable of encoding all 1,112,064 valid Unicode scalar values using a variable-width encoding of one to four one- byte (8-bit) code units. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. It was designed for backward compatibility with ASCII : the first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using

124-504: A Private Use Area . In either approach, the byte value is encoded in the low eight bits of the output code point. These encodings are needed if invalid UTF-8 is to survive translation to and then back from the UTF-16 used internally by Python, and as Unix filenames can contain invalid UTF-8 it is necessary for this to work. The official name for the encoding is UTF-8 , the spelling used in all Unicode Consortium documents. The hyphen-minus

186-509: A BOM (a change from Windows 7 Notepad ), bringing it into line with most other text editors. Some system files on Windows 11 require UTF-8 with no requirement for a BOM, and almost all files on macOS and Linux are required to be UTF-8 without a BOM. Programming languages that default to UTF-8 for I/O include Ruby  3.0, R  4.2.2, Raku and Java  18. Although the current version of Python requires an option to open() to read/write UTF-8, plans exist to make UTF-8 I/O

248-472: A byte with the high bit set cannot be alone; and in a truly random string a byte with a high bit set has only a 1 ⁄ 15 chance of starting a valid UTF-8 character. This has the (possibly unintended) consequence of making it easy to detect if a legacy text encoding is accidentally used instead of UTF-8, making conversion of a system to UTF-8 easier and avoiding the need to require a Byte Order Mark or any other metadata. Since RFC 3629 (November 2003),

310-447: A change would break compatibility with existing systems and therefore might not be feasible at all. Since the aim of a multibyte encoding system is to minimise changes to existing application software, some characters must retain their pre-existing single-unit codes, even while other characters have multiple units in their codes. The result is that there are three sorts of units in a variable-width encoding: singletons , which consist of

372-623: A future version of Python is planned to store strings as UTF-8 by default. Modern versions of Microsoft Visual Studio use UTF-8 internally. Microsoft's SQL Server 2019 added support for UTF-8, and using it results in a 35% speed increase, and "nearly 50% reduction in storage requirements." Java internally uses Modified UTF-8 (MUTF-8), in which the null character U+0000 uses the two-byte overlong encoding 0xC0 ,  0x80 , instead of just 0x00 . Modified UTF-8 strings never contain any actual null bytes but can contain all Unicode code points including U+0000, which allows such strings (with

434-471: A null byte appended) to be processed by traditional null-terminated string functions. Java reads and writes normal UTF-8 to files and streams, but it uses Modified UTF-8 for object serialization , for the Java Native Interface , and for embedding constant strings in class files . The dex format defined by Dalvik also uses the same modified UTF-8 to represent string values. Tcl also uses

496-604: A row in the above table to encode a code point less than "First code point" (thus using more bytes than necessary) is termed an overlong encoding . These are a security problem because they allow the same code point to be encoded in multiple ways. Overlong encodings (of ../ for example) have been used to bypass security validations in high-profile products including Microsoft's IIS web server and Apache's Tomcat servlet container. Overlong encodings should therefore be considered an error and never decoded. Modified UTF-8 allows an overlong encoding of U+0000 . The chart below gives

558-440: A search for the two-unit sequence DF E0 can yield a false positive in the sequence DE DF E0 E1, which consists of two consecutive two-unit sequences. There is also the danger that a single corrupted or lost unit may render the whole interpretation of a large run of multiunit sequences incorrect. In a variable-width encoding where all three types of units are disjunct, string searching always works without false positives, and (provided

620-401: A single byte with the same binary value as ASCII, so that a UTF-8-encoded file using only those characters is identical to an ASCII file. Most software designed for any extended ASCII can read and write UTF-8 (including on Microsoft Windows ) and this results in fewer internationalization issues than any alternative text encoding. UTF-8 is dominant for all countries/languages on the internet,

682-417: A single unit, lead units , which come first in a multiunit sequence, and trail units , which come afterwards in a multiunit sequence. Input and display software obviously needs to know about the structure of the multibyte encoding scheme, but other software generally doesn't need to know if a pair of bytes represent two separate characters or just one character. For example, the four character string " I♥NY "

SECTION 10

#1732798399551

744-508: A variable-width encoding called UTF-1 , in which singletons had the range 00–9F, lead units the range A0–FF and trail units the ranges A0–FF and 21–7E. Because of this bad design, similar to Shift JIS and Big5 in its overlap of values, the inventors of the Plan 9 operating system, the first to implement Unicode throughout, abandoned it and replaced it with a much better designed variable-width encoding for Unicode: UTF-8, in which singletons have

806-485: Is a type of character encoding scheme in which codes of differing lengths are used to encode a character set (a repertoire of symbols) for representation, usually in a computer . Most common variable-width encodings are multibyte encodings (aka MBCS – multi-byte character set ), which use varying numbers of bytes ( octets ) to encode different characters. (Some authors, notably in Microsoft documentation, use

868-534: Is an alphabet that uses letters of the Latin script . The 21-letter archaic Latin alphabet and the 23-letter classical Latin alphabet belong to the oldest of this group. The 26-letter modern Latin alphabet is the newest of this group. The 26-letter ISO basic Latin alphabet (adopted from the earlier ASCII ) contains the 26 letters of the English alphabet . To handle the many other alphabets also derived from

930-496: Is assumed to be UTF-8 to be losslessly transformed to UTF-16 or UTF-32, by translating the 128 possible error bytes to reserved code points, and transforming those code points back to error bytes to output UTF-8. The most common approach is to translate the codes to U+DC80...U+DCFF which are low (trailing) surrogate values and thus "invalid" UTF-16, as used by Python 's PEP 383 (or "surrogateescape") approach. Another encoding called MirBSD OPTU-8/16 converts them to U+EF80...U+EFFF in

992-521: Is called CESU-8 . If the Unicode byte-order mark U+FEFF is at the start of a UTF-8 file, the first three bytes will be 0xEF , 0xBB , 0xBF . The Unicode Standard neither requires nor recommends the use of the BOM for UTF-8, but warns that it may be encountered at the start of a file trans-coded from another encoding. While ASCII text encoded using UTF-8 is backward compatible with ASCII, this

1054-594: Is done; for this you can use utf8-c8 ". That UTF-8 Clean-8 variant, implemented by Raku, is an encoder/decoder that preserves bytes as is (even illegal UTF-8 sequences) and allows for Normal Form Grapheme synthetics. Version 3 of the Python programming language treats each byte of an invalid UTF-8 bytestream as an error (see also changes with new UTF-8 mode in Python 3.7); this gives 128 different possible errors. Extensions have been created to allow any byte sequence that

1116-577: Is encoded in UTF-8 like this (shown as hexadecimal byte values): 49 E2 99 A5 4E 59 . Of the six units in that sequence, 49, 4E, and 59 are singletons (for I, N, and Y ), E2 is a lead unit and 99 and A5 are trail units. The heart symbol is represented by the combination of the lead unit and the two trail units. UTF-8 makes it easy for a program to identify the three sorts of units, since they fall into separate value ranges. Older variable-width encodings are typically not as well-designed, since

1178-417: Is not true when Unicode Standard recommendations are ignored and a BOM is added. A BOM can confuse software that isn't prepared for it but can otherwise accept UTF-8, e.g. programming languages that permit non-ASCII bytes in string literals but not at the start of the file. Nevertheless, there was and still is software that always inserts a BOM when writing UTF-8, and refuses to correctly interpret UTF-8 unless

1240-455: Is required and no spaces are allowed. Some other names used are: There are several current definitions of UTF-8 in various standards documents: They supersede the definitions given in the following obsolete works: They are all the same in their general mechanics, with the main differences being on issues such as allowed range of code point values and safe handling of invalid input. Variable-width encoding A variable-width encoding

1302-418: Is used in most standards, often the only allowed encoding, and is supported by all modern operating systems and programming languages. The International Organization for Standardization (ISO) set out to compose a universal multi-byte character set in 1989. The draft ISO 10646 standard contained a non-required annex called UTF-1 that provided a byte stream encoding of its 32-bit code points. This encoding

SECTION 20

#1732798399551

1364-584: The Unix path directory separator. In July 1992, the X/Open committee XoJIG was looking for a better encoding. Dave Prosser of Unix System Laboratories submitted a proposal for one that had faster implementation characteristics and introduced the improvement that 7-bit ASCII characters would only represent themselves; multi-byte sequences would only include bytes with the high bit was set. The name File System Safe UCS Transformation Format ( FSS-UTF ) and most of

1426-442: The replacement character "�" (U+FFFD) and continue decoding. Some decoders consider the sequence E1,A0,20 (a truncated 3-byte code followed by a space) as a single error. This is not a good idea as a search for a space character would find the one hidden in the error. Since Unicode 6 (October 2010) the standard (chapter 3) has recommended a "best practice" where the error is either one continuation byte, or ends at

1488-566: The French é and the German ö are not listed separately in their respective alphabet sequences. With some alphabets, some altered letters are considered distinct while others are not; for instance, in Spanish, ñ (which indicates a unique phoneme) is listed separately, while á, é, í, ó, ú, and ü (which do not; the first five of these indicate a nonstandard stress-accent placement, while the last forces

1550-543: The ISO 2022 7-bit encodings were replaced by a set of 8-bit encoding schemes, the Extended Unix Code: EUC-JP, EUC-CN and EUC-KR. Instead of distinguishing between the multiunit sequences and the singletons with escape sequences, which made the encodings stateful, multiunit sequences were marked by having the most significant bit set, that is, being in the range 80–FF (hexadecimal), while the singletons were in

1612-473: The PC ( DOS and Microsoft Windows platforms), two encodings became established for Japanese and Traditional Chinese in which all of singletons, lead units and trail units overlapped: Shift-JIS and Big5 respectively. In Shift-JIS, lead units had the range 81–9F and E0–FC, trail units had the range 40–7E and 80–FC, and singletons had the range 21–7E and A1–DF. In Big5, lead units had the range A1–FE, trail units had

1674-490: The characters u to z are replaced by the bits of the code point, from the positions U+uvwxyz : The first 128 code points (ASCII) need 1 byte. The next 1,920 code points need two bytes to encode, which covers the remainder of almost all Latin-script alphabets , and also IPA extensions , Greek , Cyrillic , Coptic , Armenian , Hebrew , Arabic , Syriac , Thaana and N'Ko alphabets, as well as Combining Diacritical Marks . Three bytes are needed for

1736-576: The classical Latin one, ISO and other telecommunications groups "extended" the ISO basic Latin multiple times in the late 20th century. More recent international standards (e.g. Unicode ) include those that achieved ISO adoption. Apart from alphabets for modern spoken languages, there exist phonetic alphabets and spelling alphabets in use derived from Latin script letters. Historical languages may also have used (or are now studied using) alphabets that are derived but still distinct from those of classical Latin and their modern forms (if any). The Latin script

1798-471: The constraints of the UTF-16 character encoding: explicitly prohibiting code points corresponding to the high and low surrogate characters removed more than 3% of the three-byte sequences, and ending at U+10FFFF removed more than 48% of the four-byte sequences and all five- and six-byte sequences. UTF-8 encodes code points in one to four bytes, depending on the value of the code point. In the following table,

1860-467: The decoder is well written) the corruption or loss of one unit corrupts only one character. The first use of multibyte encodings was for the encoding of Chinese, Japanese and Korean, which have large character sets well in excess of 256 characters. At first the encoding was constrained to the limit of 7 bits. The ISO-2022-JP, ISO-2022-CN and ISO-2022-KR encodings used the range 21–7E (hexadecimal) for both lead units and trail units, and marked them off from

1922-506: The default encoding in XML and HTML (and not just using UTF-8, also declaring it in metadata), "even when all characters are in the ASCII range ... Using non-UTF-8 encodings can have unexpected results". Lots of software has the ability to read/write UTF-8. It may though require the user to change options from the normal settings, or may require a BOM (byte-order mark) as the first character to read

UTF-8 - Misplaced Pages Continue

1984-519: The default in Python ;3.15. C++23 adopts UTF-8 as the only portable source code file format (surprisingly there was none before). Backwards compatibility is a serious impediment to changing code and APIs using UTF-16 to use UTF-8, but this is happening. As of May 2019, Microsoft added the capability for an application to set UTF-8 as the "code page" for the Windows API, removing

2046-492: The detailed meaning of each byte in a stream encoded in UTF-8. Not all sequences of bytes are valid UTF-8. A UTF-8 decoder should be prepared for: Many of the first UTF-8 decoders would decode these, ignoring incorrect bits. Carefully crafted invalid UTF-8 could make them either skip or create ASCII characters such as NUL , slash, or quotes, leading to security vulnerabilities. It is also common to throw an exception or truncate

2108-451: The early days of Unicode there were no characters greater than U+FFFF and combining characters were rarely used, so the 16-bit encoding was fixed-size. This made processing of text more efficient, though the gains are nowhere as great as novice programmers may imagine. All such advantages were lost as soon as UTF-16 became variable width as well. The code points U+0800 – U+FFFF take 3 bytes in UTF-8 but only 2 in UTF-16. This led to

2170-420: The file. Examples of software supporting UTF-8 include Microsoft Word , Microsoft Excel (2016 and later), Google Drive , LibreOffice and most databases. Software that "defaults" to UTF-8 (meaning it writes it without the user changing settings, and it reads it without a BOM) has become more common since 2010. Windows Notepad , in all currently supported versions of Windows, defaults to writing UTF-8 without

2232-412: The first byte that is disallowed, so E1,A0,20 is a two-byte error followed by a space. This means an error is no more than three bytes long and never contains the start of a valid character, and there are 21,952  different possible errors. Technically this makes UTF-8 no longer a prefix code (you have to read one byte past some errors to figure out they are an error), but searching still works if

2294-536: The first character is a BOM (or the file only contains ASCII). For a long time there was considerable argument as to whether it was better to process text in UTF-16 or in UTF-8. The primary advantage of UTF-16 is that the Windows API required it to be used to get access to all Unicode characters (only recently has this been fixed). This caused several libraries such as Qt to also use UTF-16 strings which propagates this requirement to non-Windows platforms. In

2356-634: The high and low surrogates used by UTF-16 ( U+D800 through U+DFFF ) are not legal Unicode values, and their UTF-8 encodings must be treated as an invalid byte sequence. These encodings all start with 0xED followed by 0xA0 or higher. This rule is often ignored as surrogates are allowed in Windows filenames and this means there must be a way to store them in a string. UTF-8 that allows these surrogate halves has been (informally) called WTF-8 , while another variation that also encodes all non-BMP characters as two surrogates (6 bytes instead of 4)

2418-527: The idea that text in Chinese and other languages would take more space in UTF-8. However, text is only larger if there are more of these code points than 1-byte ASCII code points, and this rarely happens in the real-world documents due to spaces, newlines, digits, punctuation, English words, and HTML markup. UTF-8 has the advantages of being trivial to retrofit to any system that could handle an extended ASCII , not having byte-order problems, and taking about 1/2

2480-441: The last byte of a code point to decode it. Unlike many earlier multi-byte text encodings such as Shift-JIS , it is self-synchronizing so searches for short strings or characters are possible and that the start of a code point can be found from a random position by backing up at most 3 bytes. The values chosen for the lead bytes means sorting a list of UTF-8 strings puts them in the same order as sorting UTF-32 strings. Using

2542-474: The need to use UTF-16; and more recently has recommended programmers use UTF-8, and even states "UTF-16 [...] is a unique burden that Windows places on code that targets multiple platforms". The default string primitive in Go , Julia , Rust , Swift (since version 5), and PyPy uses UTF-8 internally in all cases. Python (since version 3.3) uses UTF-8 internally for Python C API extensions and sometimes for strings and

UTF-8 - Misplaced Pages Continue

2604-792: The original Unicode (1.x) without breaking compatibility with the 16-bit encoding. In UTF-16, singletons have the range 0000–D7FF (55,296 code points) and E000–FFFF (8192 code points, 63,488 in total), lead units the range D800–DBFF (1024 code points) and trail units the range DC00–DFFF (1024 code points, 2048 in total). The lead and trail units, called high surrogates and low surrogates , respectively, in Unicode terminology, map 1024×1024 or 1,048,576 supplementary characters, making 1,112,064 (63,488 BMP code points + 1,048,576 code points represented by high and low surrogate pairs) encodable code points, or scalar values in Unicode parlance (surrogates are not encodable). Latin-script alphabet A Latin-script alphabet ( Latin alphabet or Roman alphabet )

2666-414: The previous proposal. It also abandoned the use of biases that prevented overlong encodings . Thompson's design was outlined on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike . In the following days, Pike and Thompson implemented it and updated Plan 9 to use it throughout, and then communicated their success back to X/Open, which accepted it as the specification for FSS-UTF. UTF-8

2728-458: The pronunciation of a normally-silent letter) are not. Digraphs in some languages may be separately included in the collation sequence (e.g. Hungarian CS, Welsh RH). New letters must be separately included unless collation is not practised. Coverage of the letters of the ISO basic Latin alphabet can be and additional letters can be Most alphabets have the letters of the ISO basic Latin alphabet in

2790-435: The range 00–7F alone. The lead units and trail units were in the range A1 to FE (hexadecimal), that is, the same as their range in the ISO 2022 encodings, but with the high bit set to 1. These encodings were reasonably easy to work with provided all your delimiters were ASCII characters and you avoided truncating strings to fixed lengths, but a break in the middle of a multibyte character could still cause major corruption. On

2852-406: The range 00–7F, lead units have the range C0–FD (now actually C2–F4, to avoid overlong sequences and to maintain synchronism with the encoding capacity of UTF-16; see the UTF-8 article), and trail units have the range 80–BF. The lead unit also tells how many trail units follow: one after C2–DF, two after E0–EF and three after F0–F4. UTF-16 was devised to break free of the 65,536-character limit of

2914-532: The range 40–7E and A1–FE, and singletons had the range 21–7E (all values in hexadecimal). This overlap again made processing tricky, though at least most of the symbols had unique byte values (though strangely the backslash does not). The Unicode standard has two variable-width encodings: UTF-8 and UTF-16 (it also has a fixed-width encoding, UTF-32 ). Originally, both the Unicode and ISO 10646 standards were meant to be fixed-width, with Unicode being 16-bit and ISO 10646 being 32-bit. ISO 10646 provided

2976-466: The ranges may overlap. A text processing application that deals with the variable-width encoding must then scan the text from the beginning of all definitive sequences in order to identify the various units and interpret the text correctly. In such encodings, one is liable to encounter false positives when searching for a string in the middle of the text. For example, if the hexadecimal values DE, DF, E0, and E1 can all be either lead units or trail units, then

3038-494: The remaining 61,440 codepoints of the Basic Multilingual Plane (BMP), including most Chinese, Japanese and Korean characters . Four bytes are needed for the 1,048,576 codepoints in the other planes of Unicode , which include emoji (pictographic symbols), less common CJK characters , various historic scripts, and mathematical symbols . This is a prefix code and it is unnecessary to read past

3100-436: The result of a need to increase the number of characters which can be encoded without breaking backward compatibility with an existing constraint. For example, with one byte (8 bits) per character, one can encode 256 possible characters; in order to encode more than 256 characters, the obvious choice would be to use two or more bytes per encoding unit, two bytes (16 bits) would allow 65,536 possible characters, but such

3162-589: The same modified UTF-8 as Java for internal representation of Unicode data, but uses strict CESU-8 for external data. All known Modified UTF-8 implementations also treat the surrogate pairs as in CESU-8 . Raku programming language (formerly Perl 6) uses utf-8 encoding by default for I/O ( Perl 5 also supports it); though that choice in Raku also implies "normalization into Unicode NFC (normalization form canonical) . In some cases you may want to ensure no normalization

SECTION 50

#1732798399551

3224-537: The same order as that alphabet. Some alphabets regard digraphs as distinct letters, e.g. the Spanish alphabet from 1803 to 1994 had CH and LL sorted apart from C and L. Some alphabets sort letters that have diacritics or are ligatures at the end of the alphabet. Examples are the Scandinavian Danish , Norwegian , Swedish , and Finnish alphabets. Icelandic sorts a new letter form and a ligature at

3286-464: The searched-for string does not contain any errors. Making each byte be an error, in which case E1,A0,20 is two errors followed by a space, also still allows searching for a valid string. This means there are only 128 different errors which makes it practical to store the errors in the output string, or replace them with characters from a legacy encoding. Only a small subset of possible byte strings are error-free UTF-8: several bytes cannot appear;

3348-462: The singletons by using ISO 2022 escape sequences to switch between single-byte and multibyte mode. A total of 8,836 (94×94) characters could be encoded at first, and further sets of 94×94 characters with switching. The ISO 2022 encoding schemes for CJK are still in use on the Internet. The stateful nature of these encodings and the large overlap make them very awkward to process. On Unix platforms,

3410-496: The space for any language using mostly Latin letters. UTF-8 has been the most common encoding for the World Wide Web since 2008. As of November 2024, UTF-8 is used by 98.4% of surveyed web sites. Although many pages only use ASCII characters to display content, very few websites now declare their encoding to only be ASCII instead of UTF-8. Virtually all countries and languages have 95% or more use of UTF-8 encodings on

3472-642: The string at an error but this turns what would otherwise be harmless errors (i.e. "file not found") into a denial of service , for instance early versions of Python 3.0 would exit immediately if the command line or environment variables contained invalid UTF-8. RFC 3629 states "Implementations of the decoding algorithm MUST protect against decoding invalid sequences." The Unicode Standard requires decoders to: "... treat any ill-formed code unit sequence as an error condition. This guarantees that it will neither interpret nor emit an ill-formed code unit sequence." The standard now recommends replacing each error with

3534-572: The term multibyte character set, which is a misnomer , because representation size is an attribute of the encoding, not of the character set.) Early variable-width encodings using less than a byte per character were sometimes used to pack English text into fewer bytes in adventure games for early microcomputers . However disks (which unlike tapes allowed random access allowing text to be loaded on demand), increases in computer memory and general purpose compression algorithms have rendered such tricks largely obsolete. Multibyte encodings are usually

3596-424: The text of this proposal were later preserved in the final specification. In August 1992, this proposal was circulated by an IBM X/Open representative to interested parties. A modification by Ken Thompson of the Plan 9 operating system group at Bell Labs made it self-synchronizing , letting a reader start anywhere and immediately detect character boundaries, at the cost of being somewhat less bit-efficient than

3658-590: The web. Many standards only support UTF-8, e.g. JSON exchange requires it (without a byte-order mark (BOM)). UTF-8 is also the recommendation from the WHATWG for HTML and DOM specifications, and stating "UTF-8 encoding is the most appropriate encoding for interchange of Unicode " and the Internet Mail Consortium recommends that all e‑mail programs be able to display and create mail using UTF-8. The World Wide Web Consortium recommends UTF-8 as

3720-647: Was first officially presented at the USENIX conference in San Diego , from January 25 to 29, 1993. The Internet Engineering Task Force adopted UTF-8 in its Policy on Character Sets and Languages in RFC ;2277 ( BCP 18) for future internet standards work in January 1998, replacing Single Byte Character Sets such as Latin-1 in older RFCs. In November 2003, UTF-8 was restricted by RFC   3629 to match

3782-457: Was not satisfactory on performance grounds, among other problems, and the biggest problem was probably that it did not have a clear separation between ASCII and non-ASCII: new UTF-1 tools would be backward compatible with ASCII-encoded text, but UTF-1-encoded text could confuse existing code expecting ASCII (or extended ASCII ), because it could contain continuation bytes in the range 0x21–0x7E that meant something else in ASCII, e.g., 0x2F for / ,

SECTION 60

#1732798399551

3844-485: Was typically slightly altered to function as an alphabet for each different language (or other use), although the main letters are largely the same. A few general classes of alteration cover many particular cases: These often were given a place in the alphabet by defining an alphabetical order or collation sequence, which can vary between languages. Some of the results, especially from just adding diacritics, were not considered distinct letters for this purpose; for example,

#550449