Misplaced Pages

Latin-1 Supplement

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.

The Latin-1 Supplement (also called C1 Controls and Latin-1 Supplement ) is the second Unicode block in the Unicode standard. It encodes the upper range of ISO 8859-1 : 80 (U+0080) - FF (U+00FF). C1 Controls (0080–009F) are not graphic. This block ranges from U+0080 to U+00FF, contains 128 characters and includes the C1 controls , Latin-1 punctuation and symbols , 30 pairs of majuscule and minuscule accented Latin characters and 2 mathematical operators.

#152847

78-684: The C1 Controls and Latin-1 Supplement block has been included in its present form, with the same character repertoire since version 1.0 of the Unicode Standard . Its block name in Unicode 1.0 was simply Latin1 . The C1 Controls and Latin-1 Supplement block has four subheadings within its character collection: C1 controls, Latin-1 Punctuation and Symbols, Letters, and Mathematical operator(s). The C1 controls subheading contains 32 supplementary control codes inherited from ISO/IEC 8859-1 and many other 8-bit character standards. The alias names for

156-506: A Private Use Area . In either approach, the byte value is encoded in the low eight bits of the output code point. These encodings are needed if invalid UTF-8 is to survive translation to and then back from the UTF-16 used internally by Python, and as Unix filenames can contain invalid UTF-8 it is necessary for this to work. The official name for the encoding is UTF-8 , the spelling used in all Unicode Consortium documents. The hyphen-minus

234-444: A variable-width encoding of one to four one- byte (8-bit) code units. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. It was designed for backward compatibility with ASCII : the first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, so that a UTF-8-encoded file using only those characters

312-400: A BOM (a change from Windows 7 Notepad ), bringing it into line with most other text editors. Some system files on Windows 11 require UTF-8 with no requirement for a BOM, and almost all files on macOS and Linux are required to be UTF-8 without a BOM. Programming languages that default to UTF-8 for I/O include Ruby  3.0, R  4.2.2, Raku and Java  18. Although

390-498: A block is always a multiple of 16, and is often a multiple of 128, but is otherwise arbitrary. Characters required for a given script may be spread out over several different, potentially disjunct blocks within the codespace. Each code point is assigned a classification, listed as the code point's General Category property. Here, at the uppermost level code points are categorized as one of Letter, Mark, Number, Punctuation, Symbol, Separator, or Other. Under each category, each code point

468-472: A byte with the high bit set cannot be alone; and in a truly random string a byte with a high bit set has only a 1 ⁄ 15 chance of starting a valid UTF-8 character. This has the (possibly unintended) consequence of making it easy to detect if a legacy text encoding is accidentally used instead of UTF-8, making conversion of a system to UTF-8 easier and avoiding the need to require a Byte Order Mark or any other metadata. Since RFC 3629 (November 2003),

546-727: A calendar year and with rare cases where the scheduled release had to be postponed. For instance, in April 2020, a month after version 13.0 was published, the Unicode Consortium announced they had changed the intended release date for version 14.0, pushing it back six months to September 2021 due to the COVID-19 pandemic . Unicode 16.0, the latest version, was released on 10 September 2024. It added 5,185 characters and seven new scripts: Garay , Gurung Khema , Kirat Rai , Ol Onal , Sunuwar , Todhri , and Tulu-Tigalari . Thus far,

624-432: A comprehensive catalog of character properties, including those needed for supporting bidirectional text , as well as visual charts and reference data sets to aid implementers. Previously, The Unicode Standard was sold as a print volume containing the complete core specification, standard annexes, and code charts. However, version 5.0, published in 2006, was the last version printed this way. Starting with version 5.2, only

702-520: A full semantic duplicate of the Latin alphabet, because legacy CJK encodings contained both "fullwidth" (matching the width of CJK characters) and "halfwidth" (matching ordinary Latin script) characters. The Unicode Bulldog Award is given to people deemed to be influential in Unicode's development, with recipients including Tatsuo Kobayashi , Thomas Milo, Roozbeh Pournader , Ken Lunde , and Michael Everson . The origins of Unicode can be traced back to

780-629: A future version of Python is planned to store strings as UTF-8 by default. Modern versions of Microsoft Visual Studio use UTF-8 internally. Microsoft's SQL Server 2019 added support for UTF-8, and using it results in a 35% speed increase, and "nearly 50% reduction in storage requirements." Java internally uses Modified UTF-8 (MUTF-8), in which the null character U+0000 uses the two-byte overlong encoding 0xC0 ,  0x80 , instead of just 0x00 . Modified UTF-8 strings never contain any actual null bytes but can contain all Unicode code points including U+0000, which allows such strings (with

858-442: A large number of scripts, and not with all of the scripts supported being treated in a consistent manner. The philosophy that underpins Unicode seeks to encode the underlying characters— graphemes and grapheme-like units—rather than graphical distinctions considered mere variant glyphs thereof, that are instead best handled by the typeface , through the use of markup , or by some other means. In particularly complex cases, such as

SECTION 10

#1732766265153

936-481: A low-surrogate code point forms a surrogate pair in UTF-16 in order to represent code points greater than U+FFFF . In principle, these code points cannot otherwise be used, though in practice this rule is often ignored, especially when not using UTF-16. A small set of code points are guaranteed never to be assigned to characters, although third-parties may make independent use of them at their discretion. There are 66 of these noncharacters : U+FDD0 – U+FDEF and

1014-477: A null byte appended) to be processed by traditional null-terminated string functions. Java reads and writes normal UTF-8 to files and streams, but it uses Modified UTF-8 for object serialization , for the Java Native Interface , and for embedding constant strings in class files . The dex format defined by Dalvik also uses the same modified UTF-8 to represent string values. Tcl also uses

1092-421: A part of the standard. Moreover, the widespread adoption of Unicode was in large part responsible for the initial popularization of emoji outside of Japan. Unicode is ultimately capable of encoding more than 1.1 million characters. Unicode has largely supplanted the previous environment of a myriad of incompatible character sets , each used within different locales and on different computer architectures. Unicode

1170-535: A project run by Deborah Anderson at the University of California, Berkeley was founded in 2002 with the goal of funding proposals for scripts not yet encoded in the standard. The project has become a major source of proposed additions to the standard in recent years. The Unicode Consortium together with the ISO have developed a shared repertoire following the initial publication of The Unicode Standard : Unicode and

1248-399: A properly engineered design, 16 bits per character are more than sufficient for this purpose. This design decision was made based on the assumption that only scripts and characters in "modern" use would require encoding: Unicode gives higher priority to ensuring utility for the future than to preserving past antiquities. Unicode aims in the first instance at the characters published in

1326-501: A reader start anywhere and immediately detect character boundaries, at the cost of being somewhat less bit-efficient than the previous proposal. It also abandoned the use of biases that prevented overlong encodings . Thompson's design was outlined on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike . In the following days, Pike and Thompson implemented it and updated Plan 9 to use it throughout, and then communicated their success back to X/Open, which accepted it as

1404-606: A row in the above table to encode a code point less than "First code point" (thus using more bytes than necessary) is termed an overlong encoding . These are a security problem because they allow the same code point to be encoded in multiple ways. Overlong encodings (of ../ for example) have been used to bypass security validations in high-profile products including Microsoft's IIS web server and Apache's Tomcat servlet container. Overlong encodings should therefore be considered an error and never decoded. Modified UTF-8 allows an overlong encoding of U+0000 . The chart below gives

1482-557: A text. The exclusion of surrogates and noncharacters leaves 1 111 998 code points available for use. UTF-8 UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode Transformation Format – 8-bit . Almost every webpage is stored in UTF-8. UTF-8 is capable of encoding all 1,112,064 valid Unicode scalar values using

1560-558: A total of 168 scripts are included in the latest version of Unicode (covering alphabets , abugidas and syllabaries ), although there are still scripts that are not yet encoded, particularly those mainly used in historical, liturgical, and academic contexts. Further additions of characters to the already encoded scripts, as well as symbols, in particular for mathematics and music (in the form of notes and rhythmic symbols), also occur. The Unicode Roadmap Committee ( Michael Everson , Rick McGowan, Ken Whistler, V.S. Umamaheswaran) maintain

1638-654: A universal encoding than the original Unicode architecture envisioned. Version 1.0 of Microsoft's TrueType specification, published in 1992, used the name "Apple Unicode" instead of "Unicode" for the Platform ID in the naming table. The Unicode Consortium is a nonprofit organization that coordinates Unicode's development. Full members include most of the main computer software and hardware companies (and few others) with any interest in text-processing standards, including Adobe , Apple , Google , IBM , Meta (previously as Facebook), Microsoft , Netflix , and SAP . Over

SECTION 20

#1732766265153

1716-582: A universal multi-byte character set in 1989. The draft ISO 10646 standard contained a non-required annex called UTF-1 that provided a byte stream encoding of its 32-bit code points. This encoding was not satisfactory on performance grounds, among other problems, and the biggest problem was probably that it did not have a clear separation between ASCII and non-ASCII: new UTF-1 tools would be backward compatible with ASCII-encoded text, but UTF-1-encoded text could confuse existing code expecting ASCII (or extended ASCII ), because it could contain continuation bytes in

1794-497: Is assumed to be UTF-8 to be losslessly transformed to UTF-16 or UTF-32, by translating the 128 possible error bytes to reserved code points, and transforming those code points back to error bytes to output UTF-8. The most common approach is to translate the codes to U+DC80...U+DCFF which are low (trailing) surrogate values and thus "invalid" UTF-16, as used by Python 's PEP 383 (or "surrogateescape") approach. Another encoding called MirBSD OPTU-8/16 converts them to U+EF80...U+EFFF in

1872-467: Is called CESU-8 . If the Unicode byte-order mark U+FEFF is at the start of a UTF-8 file, the first three bytes will be 0xEF , 0xBB , 0xBF . The Unicode Standard neither requires nor recommends the use of the BOM for UTF-8, but warns that it may be encountered at the start of a file trans-coded from another encoding. While ASCII text encoded using UTF-8 is backward compatible with ASCII, this

1950-597: Is done; for this you can use utf8-c8 ". That UTF-8 Clean-8 variant, implemented by Raku, is an encoder/decoder that preserves bytes as is (even illegal UTF-8 sequences) and allows for Normal Form Grapheme synthetics. Version 3 of the Python programming language treats each byte of an invalid UTF-8 bytestream as an error (see also changes with new UTF-8 mode in Python 3.7 ); this gives 128 different possible errors. Extensions have been created to allow any byte sequence that

2028-456: Is either one continuation byte, or ends at the first byte that is disallowed, so E1,A0,20 is a two-byte error followed by a space. This means an error is no more than three bytes long and never contains the start of a valid character, and there are 21,952  different possible errors. Technically this makes UTF-8 no longer a prefix code (you have to read one byte past some errors to figure out they are an error), but searching still works if

2106-511: Is identical to an ASCII file. Most software designed for any extended ASCII can read and write UTF-8 (including on Microsoft Windows ) and this results in fewer internationalization issues than any alternative text encoding. UTF-8 is dominant for all countries/languages on the internet, is used in most standards, often the only allowed encoding, and is supported by all modern operating systems and programming languages. The International Organization for Standardization (ISO) set out to compose

2184-413: Is intended to suggest a unique, unified, universal encoding". In this document, entitled Unicode 88 , Becker outlined a scheme using 16-bit characters: Unicode is intended to address the need for a workable, reliable world text encoding. Unicode could be roughly described as "wide-body ASCII " that has been stretched to 16 bits to encompass the characters of all the world's living languages. In

2262-457: Is not padded. There are a total of 2 + (2 − 2 ) = 1 112 064 valid code points within the codespace. (This number arises from the limitations of the UTF-16 character encoding, which can encode the 2 code points in the range U+0000 through U+FFFF except for the 2 code points in the range U+D800 through U+DFFF , which are used as surrogate pairs to encode the 2 code points in

2340-417: Is not true when Unicode Standard recommendations are ignored and a BOM is added. A BOM can confuse software that isn't prepared for it but can otherwise accept UTF-8, e.g. programming languages that permit non-ASCII bytes in string literals but not at the start of the file. Nevertheless, there was and still is software that always inserts a BOM when writing UTF-8, and refuses to correctly interpret UTF-8 unless

2418-480: Is projected to include 4301 new unified CJK characters . The Unicode Standard defines a codespace : a sequence of integers called code points in the range from 0 to 1 114 111 , notated according to the standard as U+0000 – U+10FFFF . The codespace is a systematic, architecture-independent representation of The Unicode Standard ; actual text is processed as binary data via one of several Unicode encodings, such as UTF-8 . In this normative notation,

Latin-1 Supplement - Misplaced Pages Continue

2496-400: Is then further subcategorized. In most cases, other properties must be used to adequately describe all the characteristics of any given code point. The 1024 points in the range U+D800 – U+DBFF are known as high-surrogate code points, and code points in the range U+DC00 – U+DFFF ( 1024 code points) are known as low-surrogate code points. A high-surrogate code point followed by

2574-507: Is used to encode the vast majority of text on the Internet, including most web pages , and relevant Unicode support has become a common consideration in contemporary software development. The Unicode character repertoire is synchronized with ISO/IEC 10646 , each being code-for-code identical with one another. However, The Unicode Standard is more than just a repertoire within which characters are assigned. To aid developers and designers,

2652-399: The replacement character "�" (U+FFFD) and continue decoding. Some decoders consider the sequence E1,A0,20 (a truncated 3-byte code followed by a space) as a single error. This is not a good idea as a search for a space character would find the one hidden in the error. Since Unicode 6 (October 2010) the standard (chapter 3) has recommended a "best practice" where the error

2730-578: The 1980s, to a group of individuals with connections to Xerox 's Character Code Standard (XCCS). In 1987, Xerox employee Joe Becker , along with Apple employees Lee Collins and Mark Davis , started investigating the practicalities of creating a universal character set. With additional input from Peter Fenwick and Dave Opstad , Becker published a draft proposal for an "international/multilingual text character encoding system in August 1988, tentatively called Unicode". He explained that "the name 'Unicode'

2808-609: The C0 and C1 control codes are taken from ISO/IEC 6429:1992 . The Latin-1 Punctuation and Symbols subheading contains 32 characters of common international punctuation characters, such as the inverted question and exclamation marks, a middle dot, and symbols such as currency signs, spacing diacritic marks, vulgar fractions, and superscript numbers. The Letters subheading contains 30 pairs of majuscule and minuscule accented or novel Latin characters for western European languages, and two extra minuscule characters ( ß and ÿ ) not commonly used as

2886-567: The ISO's Universal Coded Character Set (UCS) use identical character names and code points. However, the Unicode versions do differ from their ISO equivalents in two significant ways. While the UCS is a simple character map, Unicode specifies the rules, algorithms, and properties necessary to achieve interoperability between different platforms and languages. Thus, The Unicode Standard includes more information, covering in-depth topics such as bitwise encoding, collation , and rendering. It also provides

2964-640: The capability for an application to set UTF-8 as the "code page" for the Windows API, removing the need to use UTF-16; and more recently has recommended programmers use UTF-8, and even states "UTF-16 [...] is a unique burden that Windows places on code that targets multiple platforms". The default string primitive in Go , Julia , Rust , Swift (since version 5), and PyPy uses UTF-8 internally in all cases. Python (since version 3.3) uses UTF-8 internally for Python C API extensions and sometimes for strings and

3042-496: The core specification, published as a print-on-demand paperback, may be purchased. The full text, on the other hand, is published as a free PDF on the Unicode website. A practical reason for this publication method highlights the second significant difference between the UCS and Unicode—the frequency with which updated versions are released and new characters added. The Unicode Standard has regularly released annual expanded versions, occasionally with more than one version released in

3120-482: The current version of Python requires an option to open() to read/write UTF-8, plans exist to make UTF-8 I/O the default in Python ;3.15. C++23 adopts UTF-8 as the only portable source code file format (surprisingly there was none before). Backwards compatibility is a serious impediment to changing code and APIs using UTF-16 to use UTF-8, but this is happening. As of May 2019 , Microsoft added

3198-507: The default encoding in XML and HTML (and not just using UTF-8, also declaring it in metadata), "even when all characters are in the ASCII range ... Using non-UTF-8 encodings can have unexpected results". Lots of software has the ability to read/write UTF-8. It may though require the user to change options from the normal settings, or may require a BOM (byte-order mark) as the first character to read

Latin-1 Supplement - Misplaced Pages Continue

3276-440: The detailed meaning of each byte in a stream encoded in UTF-8. Not all sequences of bytes are valid UTF-8. A UTF-8 decoder should be prepared for: Many of the first UTF-8 decoders would decode these, ignoring incorrect bits. Carefully crafted invalid UTF-8 could make them either skip or create ASCII characters such as NUL , slash, or quotes, leading to security vulnerabilities. It is also common to throw an exception or truncate

3354-475: The discretion of the software actually rendering the text, such as a web browser or word processor . However, partially with the intent of encouraging rapid adoption, the simplicity of this original model has become somewhat more elaborate over time, and various pragmatic concessions have been made over the course of the standard's development. The first 256 code points mirror the ISO/IEC 8859-1 standard, with

3432-451: The early days of Unicode there were no characters greater than U+FFFF and combining characters were rarely used, so the 16-bit encoding was fixed-size. This made processing of text more efficient, though the gains are nowhere as great as novice programmers may imagine. All such advantages were lost as soon as UTF-16 became variable width as well. The code points U+0800 – U+FFFF take 3 bytes in UTF-8 but only 2 in UTF-16. This led to

3510-426: The file. Examples of software supporting UTF-8 include Microsoft Word , Microsoft Excel (2016 and later), Google Drive , LibreOffice and most databases. Software that "defaults" to UTF-8 (meaning it writes it without the user changing settings, and it reads it without a BOM) has become more common since 2010. Windows Notepad , in all currently supported versions of Windows, defaults to writing UTF-8 without

3588-537: The first character is a BOM (or the file only contains ASCII). For a long time there was considerable argument as to whether it was better to process text in UTF-16 or in UTF-8. The primary advantage of UTF-16 is that the Windows API required it to be used to get access to all Unicode characters (only recently has this been fixed). This caused several libraries such as Qt to also use UTF-16 strings which propagates this requirement to non-Windows platforms. In

3666-526: The first letter of words. The Mathematical operator subheading is used for the multiplication and division signs. The table below shows the number of letters, symbols and control codes in each of the subheadings in the C1 Controls and Latin-1 Supplement block. The Latin-1 Supplement block contains two emoji : U+00A9 and U+00AE. The block has four standardized variants defined to specify emoji-style (U+FE0F VS16) or text presentation (U+FE0E VS15) for

3744-401: The following versions of The Unicode Standard have been published. Update versions, which do not include any changes to character repertoire, are signified by the third number (e.g., "version 4.0.1") and are omitted in the table below. The Unicode Consortium normally releases a new version of The Unicode Standard once a year. Version 17.0, the next major version,

3822-526: The group. By the end of 1990, most of the work of remapping existing standards had been completed, and a final review draft of Unicode was ready. The Unicode Consortium was incorporated in California on 3 January 1991, and the first volume of The Unicode Standard was published that October. The second volume, now adding Han ideographs, was published in June 1992. In 1996, a surrogate character mechanism

3900-637: The high and low surrogates used by UTF-16 ( U+D800 through U+DFFF ) are not legal Unicode values, and their UTF-8 encodings must be treated as an invalid byte sequence. These encodings all start with 0xED followed by 0xA0 or higher. This rule is often ignored as surrogates are allowed in Windows filenames and this means there must be a way to store them in a string. UTF-8 that allows these surrogate halves has been (informally) called WTF-8 , while another variation that also encodes all non-BMP characters as two surrogates (6 bytes instead of 4)

3978-455: The high bit was set. The name File System Safe UCS Transformation Format ( FSS-UTF ) and most of the text of this proposal were later preserved in the final specification. In August 1992, this proposal was circulated by an IBM X/Open representative to interested parties. A modification by Ken Thompson of the Plan 9 operating system group at Bell Labs made it self-synchronizing , letting

SECTION 50

#1732766265153

4056-527: The idea that text in Chinese and other languages would take more space in UTF-8. However, text is only larger if there are more of these code points than 1-byte ASCII code points, and this rarely happens in the real-world documents due to spaces, newlines, digits, punctuation, English words, and HTML markup. UTF-8 has the advantages of being trivial to retrofit to any system that could handle an extended ASCII , not having byte-order problems, and taking about 1/2

4134-562: The intent of trivializing the conversion of text already written in Western European scripts. To preserve the distinctions made by different legacy encodings, therefore allowing for conversion between them and Unicode without any loss of information, many characters nearly identical to others , in both appearance and intended function, were given distinct code points. For example, the Halfwidth and Fullwidth Forms block encompasses

4212-441: The last byte of a code point to decode it. Unlike many earlier multi-byte text encodings such as Shift-JIS , it is self-synchronizing so searches for short strings or characters are possible and that the start of a code point can be found from a random position by backing up at most 3 bytes. The values chosen for the lead bytes means sorting a list of UTF-8 strings puts them in the same order as sorting UTF-32 strings. Using

4290-403: The last two code points in each of the 17 planes (e.g. U+FFFE , U+FFFF , U+1FFFE , U+1FFFF , ..., U+10FFFE , U+10FFFF ). The set of noncharacters is stable, and no new noncharacters will ever be defined. Like surrogates, the rule that these cannot be used is often ignored, although the operation of the byte order mark assumes that U+FFFE will never be the first code point in

4368-705: The list of scripts that are candidates or potential candidates for encoding and their tentative code block assignments on the Unicode Roadmap page of the Unicode Consortium website. For some scripts on the Roadmap, such as Jurchen and Khitan large script , encoding proposals have been made and they are working their way through the approval process. For other scripts, such as Numidian and Rongorongo , no proposal has yet been made, and they await agreement on character repertoire and other details from

4446-675: The modern text (e.g. in the union of all newspapers and magazines printed in the world in 1988), whose number is undoubtedly far below 2 = 16,384. Beyond those modern-use characters, all others may be defined to be obsolete or rare; these are better candidates for private-use registration than for congesting the public list of generally useful Unicode. In early 1989, the Unicode working group expanded to include Ken Whistler and Mike Kernaghan of Metaphor, Karen Smith-Yoshimura and Joan Aliprand of Research Libraries Group , and Glenn Wright of Sun Microsystems . In 1990, Michel Suignard and Asmus Freytag of Microsoft and NeXT 's Rick McGowan had also joined

4524-776: The range U+10000 through U+10FFFF .) The Unicode codespace is divided into 17 planes , numbered 0 to 16. Plane 0 is the Basic Multilingual Plane (BMP), and contains the most commonly used characters. All code points in the BMP are accessed as a single code unit in UTF-16 encoding and can be encoded in one, two or three bytes in UTF-8. Code points in planes 1 through 16 (the supplementary planes ) are accessed as surrogate pairs in UTF-16 and encoded in four bytes in UTF-8 . Within each plane, characters are allocated within named blocks of related characters. The size of

4602-562: The range 0x21–0x7E that meant something else in ASCII, e.g., 0x2F for / , the Unix path directory separator. In July 1992, the X/Open committee XoJIG was looking for a better encoding. Dave Prosser of Unix System Laboratories submitted a proposal for one that had faster implementation characteristics and introduced the improvement that 7-bit ASCII characters would only represent themselves; multi-byte sequences would only include bytes with

4680-418: The remaining 61,440 codepoints of the Basic Multilingual Plane (BMP), including most Chinese, Japanese and Korean characters . Four bytes are needed for the 1,048,576 codepoints in the other planes of Unicode , which include emoji (pictographic symbols), less common CJK characters , various historic scripts, and mathematical symbols . This is a prefix code and it is unnecessary to read past

4758-590: The same modified UTF-8 as Java for internal representation of Unicode data, but uses strict CESU-8 for external data. All known Modified UTF-8 implementations also treat the surrogate pairs as in CESU-8 . Raku programming language (formerly Perl 6) uses utf-8 encoding by default for I/O ( Perl 5 also supports it); though that choice in Raku also implies "normalization into Unicode NFC (normalization form canonical) . In some cases you may want to ensure no normalization

SECTION 60

#1732766265153

4836-465: The searched-for string does not contain any errors. Making each byte be an error, in which case E1,A0,20 is two errors followed by a space, also still allows searching for a valid string. This means there are only 128 different errors which makes it practical to store the errors in the output string, or replace them with characters from a legacy encoding. Only a small subset of possible byte strings are error-free UTF-8: several bytes cannot appear;

4914-500: The space for any language using mostly Latin letters. UTF-8 has been the most common encoding for the World Wide Web since 2008. As of November 2024 , UTF-8 is used by 98.4% of surveyed web sites. Although many pages only use ASCII characters to display content, very few websites now declare their encoding to only be ASCII instead of UTF-8. Virtually all countries and languages have 95% or more use of UTF-8 encodings on

4992-643: The specification for FSS-UTF. UTF-8 was first officially presented at the USENIX conference in San Diego , from January 25 to 29, 1993. The Internet Engineering Task Force adopted UTF-8 in its Policy on Character Sets and Languages in RFC ;2277 ( BCP 18) for future internet standards work in January 1998, replacing Single Byte Character Sets such as Latin-1 in older RFCs. In November 2003, UTF-8

5070-464: The standard defines 154 998 characters and 168 scripts used in various ordinary, literary, academic, and technical contexts. Many common characters, including numerals, punctuation, and other symbols, are unified within the standard and are not treated as specific to any given writing system. Unicode encodes 3790 emoji , with the continued development thereof conducted by the Consortium as

5148-429: The standard also provides charts and reference data, as well as annexes explaining concepts germane to various scripts, providing guidance for their implementation. Topics covered by these annexes include character normalization , character composition and decomposition, collation , and directionality . Unicode text is processed and stored as binary data using one of several encodings , which define how to translate

5226-453: The standard's abstracted codes for characters into sequences of bytes. The Unicode Standard itself defines three encodings: UTF-8 , UTF-16 , and UTF-32 , though several others exist. Of these, UTF-8 is the most widely used by a large margin, in part due to its backwards-compatibility with ASCII . Unicode was originally designed with the intent of transcending limitations present in all text encodings designed up to that point: each encoding

5304-645: The string at an error but this turns what would otherwise be harmless errors (i.e. "file not found") into a denial of service , for instance early versions of Python 3.0 would exit immediately if the command line or environment variables contained invalid UTF-8. RFC 3629 states "Implementations of the decoding algorithm MUST protect against decoding invalid sequences." The Unicode Standard requires decoders to: "... treat any ill-formed code unit sequence as an error condition. This guarantees that it will neither interpret nor emit an ill-formed code unit sequence." The standard now recommends replacing each error with

5382-462: The treatment of orthographical variants in Han characters , there is considerable disagreement regarding which differences justify their own encodings, and which are only graphical variants of other characters. At the most abstract level, Unicode assigns a unique number called a code point to each character. Many issues of visual representation—including size, shape, and style—are intended to be up to

5460-530: The two emoji, both of which default to a text presentation. The following Unicode-related documents record the purpose and process of defining specific characters in the Latin-1 Supplement block: Unicode Standard Unicode , formally The Unicode Standard , is a text encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 of

5538-418: The two-character prefix U+ always precedes a written code point, and the code points themselves are written as hexadecimal numbers. At least four hexadecimal digits are always written, with leading zeros prepended as needed. For example, the code point U+00F7 ÷ DIVISION SIGN is padded with two leading zeros, but U+13254 𓉔 EGYPTIAN HIEROGLYPH O004 ( [REDACTED] )

5616-618: The user communities involved. Some modern invented scripts which have not yet been included in Unicode (e.g., Tengwar ) or which do not qualify for inclusion in Unicode due to lack of real-world use (e.g., Klingon ) are listed in the ConScript Unicode Registry , along with unofficial but widely used Private Use Areas code assignments. There is also a Medieval Unicode Font Initiative focused on special Latin medieval characters. Part of these proposals has been already included in Unicode. The Script Encoding Initiative,

5694-543: The value of the code point. In the following table, the characters u to z are replaced by the bits of the code point, from the positions U+uvwxyz : The first 128 code points (ASCII) need 1 byte. The next 1,920 code points need two bytes to encode, which covers the remainder of almost all Latin-script alphabets , and also IPA extensions , Greek , Cyrillic , Coptic , Armenian , Hebrew , Arabic , Syriac , Thaana and N'Ko alphabets, as well as Combining Diacritical Marks . Three bytes are needed for

5772-594: The web. Many standards only support UTF-8, e.g. JSON exchange requires it (without a byte-order mark (BOM)). UTF-8 is also the recommendation from the WHATWG for HTML and DOM specifications, and stating "UTF-8 encoding is the most appropriate encoding for interchange of Unicode " and the Internet Mail Consortium recommends that all e‑mail programs be able to display and create mail using UTF-8. The World Wide Web Consortium recommends UTF-8 as

5850-640: The years several countries or government agencies have been members of the Unicode Consortium. Presently only the Ministry of Endowments and Religious Affairs (Oman) is a full member with voting rights. The Consortium has the ambitious goal of eventually replacing existing character encoding schemes with Unicode and its standard Unicode Transformation Format (UTF) schemes, as many of the existing schemes are limited in size and scope and are incompatible with multilingual environments. Unicode currently covers most major writing systems in use today. As of 2024 ,

5928-491: Was implemented in Unicode 2.0, so that Unicode was no longer restricted to 16 bits. This increased the Unicode codespace to over a million code points, which allowed for the encoding of many historic scripts, such as Egyptian hieroglyphs , and thousands of rarely used or obsolete characters that had not been anticipated for inclusion in the standard. Among these characters are various rarely used CJK characters—many mainly being used in proper names, making them far more necessary for

6006-439: Was relied upon for use in its own context, but with no particular expectation of compatibility with any other. Indeed, any two encodings chosen were often totally unworkable when used together, with text encoded in one interpreted as garbage characters by the other. Most encodings had only been designed to facilitate interoperation between a handful of scripts—often primarily between a given script and Latin characters —not between

6084-413: Was restricted by RFC   3629 to match the constraints of the UTF-16 character encoding: explicitly prohibiting code points corresponding to the high and low surrogate characters removed more than 3% of the three-byte sequences, and ending at U+10FFFF removed more than 48% of the four-byte sequences and all five- and six-byte sequences. UTF-8 encodes code points in one to four bytes, depending on

#152847