Misplaced Pages

Extended Unix Code

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.

Extended Unix Code ( EUC ) is a multibyte character encoding system used primarily for Japanese , Korean , and simplified Chinese (characters) .

#71928

66-475: The most commonly used EUC codes are variable-length encodings with a character belonging to an ISO/IEC 646 compliant coded character set (such as ASCII ) taking one byte, and a character belonging to a 94×94 coded character set (such as GB 2312 ) represented in two bytes. The EUC-CN form of GB 2312 and EUC-KR are examples of such two-byte EUC codes. EUC-JP includes characters represented by up to three bytes, including an initial shift code , whereas

132-510: A computer . Most common variable-width encodings are multibyte encodings (aka MBCS – multi-byte character set ), which use varying numbers of bytes ( octets ) to encode different characters. (Some authors, notably in Microsoft documentation, use the term multibyte character set, which is a misnomer , because representation size is an attribute of the encoding, not of the character set.) Early variable-width encodings using less than

198-555: A yen sign in EUC-JP (see below) and a won sign in EUC-KR. The other code sets are invoked over GR (i.e. with the most significant bit set). Hence, to get the EUC form of a character, the most significant bit of each coding byte is set (equivalent to adding 128 to each 7-bit coding byte, or adding 160 to each number in the kuten code); this allows the software to easily distinguish whether

264-409: A byte per character were sometimes used to pack English text into fewer bytes in adventure games for early microcomputers . However disks (which unlike tapes allowed random access allowing text to be loaded on demand), increases in computer memory and general purpose compression algorithms have rendered such tricks largely obsolete. Multibyte encodings are usually the result of a need to increase

330-447: A change would break compatibility with existing systems and therefore might not be feasible at all. Since the aim of a multibyte encoding system is to minimise changes to existing application software, some characters must retain their pre-existing single-unit codes, even while other characters have multiple units in their codes. The result is that there are three sorts of units in a variable-width encoding: singletons , which consist of

396-435: A double-byte DBCS-Host mode using shifting sequences (where 0x29 switches to single-byte mode and 0x28 switches to double-byte mode). Also similarly to KEIS, JIS X 0208 codes are represented the same as in EUC-JP. The lead byte range is extended back to 0x41, with 0x80–0xA0 designated for user definition; lead bytes 0x41–0x7F are assigned row numbers 101 through 163 for kuten purposes, although row 162 (lead byte 0x7E)

462-504: A fixed-length transformation format called the EUC complete two-byte format . This represents: Initial bytes of 0x00 and 0x80 are used in cases where the code set uses only one byte. There is also a four-byte fixed-length format. These fixed-length encoding formats are suited to internal processing and are not usually encountered in interchange. EUC-JP is registered with the IANA in both formats,

528-472: A particular byte in a character string belongs to the ISO 646 code or the extended code. Characters in code sets 2 and 3 are prefixed with the control codes SS2 (0x8E) and SS3 (0x8F) respectively, and invoked over GR. Besides the initial shift code, any byte outside of the range 0xA0–0xFF appearing in a character from code sets 1 through 3 is not a valid EUC code. The EUC code itself does not make use of

594-554: A patent policy. This spurred a renewed attempt to allow the W3C and the WHATWG to work together on specifications. In 2019, the W3C and WHATWG agreed to a memorandum of understanding where development of HTML and DOM specifications would be done principally in the WHATWG. The editor has significant control over the specification, but the community can influence the decisions of the editor. In one case, editor Ian Hickson proposed replacing

660-440: A search for the two-unit sequence DF E0 can yield a false positive in the sequence DE DF E0 E1, which consists of two consecutive two-unit sequences. There is also the danger that a single corrupted or lost unit may render the whole interpretation of a large run of multiunit sequences incorrect. In a variable-width encoding where all three types of units are disjunct, string searching always works without false positives, and (provided

726-459: A sequence of the 94 7-bit bytes 0x 21–7E, or alternatively 0xA1–FE if an eighth bit is available. This allows for sets of 94 graphical characters, or 8836 (94) characters, or 830584 (94) characters. Although initially 0x20 and 0x7F were always the space and delete character and 0xA0 and 0xFF were unused, later editions of ISO/IEC 2022 allowed the use of the bytes 0xA0 and 0xFF (or 0x20 and 0x7F) within sets under certain circumstances, allowing

SECTION 10

#1732779585072

792-569: A single character in EUC-TW can take up to four bytes. Modern applications are more likely to use UTF-8 , which supports all of the glyphs of the EUC codes, and more, and is generally more portable with fewer vendor deviations and errors. EUC is however still very popular, especially EUC-KR for South Korea. The structure of EUC is based on the ISO/IEC 2022 standard, which specifies a system of graphical character sets that can be represented with

858-417: A single unit, lead units , which come first in a multiunit sequence, and trail units , which come afterwards in a multiunit sequence. Input and display software obviously needs to know about the structure of the multibyte encoding scheme, but other software generally doesn't need to know if a pair of bytes represent two separate characters or just one character. For example, the four character string " I♥NY "

924-598: A subset include the Mac OS Korean script (known as Code page 10003 or x-mac-korean ), which was used by HangulTalk (MacOS-KH), the Korean localization of the classic Mac OS . It was developed by Elex Computer ( 일렉스 ), who were at the time the authorised distributor of Apple Macintosh computers in South Korea. HangulTalk adds extension characters with lead bytes between 0xA1 and 0xAD, both in unused space within

990-508: A variable-width encoding called UTF-1 , in which singletons had the range 00–9F, lead units the range A0–FF and trail units the ranges A0–FF and 21–7E. Because of this bad design, similar to Shift JIS and Big5 in its overlap of values, the inventors of the Plan 9 operating system, the first to implement Unicode throughout, abandoned it and replaced it with a much better designed variable-width encoding for Unicode: UTF-8, in which singletons have

1056-536: A variant form called HZ (which delimits GB 2312 text with ASCII sequences) was sometimes used on USENET . An ASCII character is represented in its usual encoding. A character from GB 2312 is represented by two bytes, both from the range 0xA1–0xFE. An encoding related to EUC-CN is the "748" code used in the WITS typesetting system developed by Beijing's Founder Technology (now obsoleted by its newer FITS typesetting system). The 748 code contains all of GB 2312 , but

1122-411: Is a variable-length encoding used to represent the elements of three Japanese character set standards , namely JIS X 0208 , JIS X 0212 , and JIS X 0201 . Other names for this encoding include Unixized JIS (or UJIS ) and AT&T JIS . 0.1% of all web pages use EUC-JP since September 2022, while 2.6% of websites written with Japanese use this second-most popular (for Japanese) encoding (which

1188-405: Is a variable-length encoding which may use up to four bytes per character, due to an even larger encoding space being required. Being an extension of GBK, it is a superset of EUC-CN but is not itself a true EUC code. Being a Unicode encoding, its repertoire is identical to that of other Unicode transformation formats such as UTF-8 . Other EUC-CN variants deviating from the EUC mechanism include

1254-811: Is a community of people interested in evolving HTML and related technologies. The WHATWG was founded by individuals from Apple Inc. , the Mozilla Foundation and Opera Software , leading Web browser vendors in 2004. WHATWG is responsible for maintaining multiple web-related technical standards , including the specifications for the HyperText Markup Language (HTML) and the Document Object Model (DOM). The central organizational membership and control of WHATWG – its "Steering Group" – consists of Apple, Mozilla, Google, and Microsoft. WHATWG community members work with

1320-566: Is a different, unrelated, EUC-KR extension. Unified Hangul Code extends EUC-KR by using codes that do not conform to the EUC structure to incorporate additional syllable blocks, completing the coverage of the composed syllable blocks available in Johab and Unicode. The W3C / WHATWG Encoding Standard used by HTML5 incorporates the Unified Hangul Code extensions into its definition of EUC-KR. Other encodings incorporating EUC-KR as

1386-490: Is a variant of Shift JIS . HP-16 encodes JIS X 0208 using the same bytes as in EUC-JP, but does not use the single shift codes (thus omitting code sets 2 and 3), and adds three user-defined regions which do not follow the packed-format EUC structure: The IKIS (Interactive Kanji Information System) encoding used by Data General resembles EUC-JP without single shifts, i.e. with only code sets 0 and 1. Half-width katakana are instead included in row 8 of JIS X 0208 (colliding with

SECTION 20

#1732779585072

1452-577: Is encoded in UTF-8 like this (shown as hexadecimal byte values): 49 E2 99 A5 4E 59 . Of the six units in that sequence, 49, 4E, and 59 are singletons (for I, N, and Y ), E2 is a lead unit and 99 and A5 are trail units. The heart symbol is represented by the combination of the lead unit and the two trail units. UTF-8 makes it easy for a program to identify the three sorts of units, since they fall into separate value ranges. Older variable-width encodings are typically not as well-designed, since

1518-429: Is extended back to 0x59, out of which the lead bytes 0x81–A0 are designated for user-defined characters, and the remainder are used for corporate-defined characters, including both kanji and non-kanji. JEF (Japanese-processing Extended Feature) is an EBCDIC encoding used on Fujitsu FACOM mainframes, contrasting with FMR (a variant of Shift JIS) used on Fujitsu PCs. Like KEIS, JEF is a stateful encoding, switching to

1584-1123: Is more than for Shift JIS both are much less used that UTF-8 ). It is called Code page 954 by IBM. Microsoft has two code page numbers for this encoding (51932 and 20932). This encoding scheme allows the easy mixing of 7-bit ASCII and 8-bit Japanese without the need for the escape characters employed by ISO-2022-JP , which is based on the same character set standards, and without ASCII bytes appearing as trail bytes (unlike Shift JIS ). A related and partially compatible encoding, called EUC-JISx0213 or EUC-JIS-2004 , encodes JIS X 0201 and JIS X 0213 (similarly to Shift_JISx0213 , its Shift_JIS-based counterpart). Compared to EUC-CN or EUC-KR, EUC-JP did not become as widely adopted on PC and Macintosh systems in Japan, which used Shift JIS or its extensions ( Windows code page 932 on Microsoft Windows , and MacJapanese on classic Mac OS ), although it became heavily used by Unix or Unix-like operating systems (except for HP-UX ). Therefore, whether Japanese websites use EUC-JP or Shift_JIS often depends on what OS

1650-524: Is not ISO 2022 –compliant and therefore not a true EUC code. (It uses an 8-bit lead byte but distinguishes between a second byte with its most significant bit set and one with its most significant bit cleared, and is, therefore, more similar in structure to Big5 and other non–ISO 2022–compliant DBCS encoding systems.) The non-GB2312 portion of the 748 code contains traditional and Hong Kong characters and other glyphs used in newspaper typesetting. IBM code page 1381 ( CCSID 1381) comprises

1716-428: Is not required to be left-padded with null bytes (similarly to the packed format). JIS X 0208 is, as usual, used for code set 1; code set 2 (half-width katakana) is absent; code set 3 is encoded like the two-byte fixed width format (i.e. without a shift byte and with only the first high bit set), but used for two-byte user defined characters rather than being specified for JIS X 0212. In the basic "DEC Kanji" encoding, only

1782-479: Is now preferred for new use, solving problems with consistency between platforms and vendors. A common extension of EUC-KR is the Unified Hangul Code ( 통합형 한글 코드 ; Tonghabhyeong Hangeul Kodeu , or 통합 완성형 ; Tonghab Wansunghyung ), which is the default Korean codepage on Microsoft Windows. It is given the code page number 949 by Microsoft, and 1261 or 1363 by IBM. IBM's code page 949

1848-420: Is the inclusion of two extensions to the basic GB 2312-80 set in rows 6 and 8. These are considered "standard extensions to GB 2312", neither of which is proprietary to Apple: the row 8 extension was taken from GB 6345.1 , both extensions are included by GB/T 12345 (the traditional Chinese variant of GB 2312), and both extensions are included by GB 18030 (the successor to GB 2312). EUC-JP

1914-399: Is unused. Rows 101 through 148 are used for extended kanji, while rows 149 through 163 are used for extended non-kanji. EUC-KR is a variable-length encoding to represent Korean text using two coded character sets, KS X 1001 (formerly KS C 5601) and either ISO 646 :KR ( KS X 1003 , formerly KS C 5636 ) or ASCII , depending on variant. KS X 2901 (formerly KS C 5861 ) stipulates

1980-412: The <time> tag with a more generic <data> tag, but the community disagreed and the change was reverted. Initially, the name Web Hypertext Application Technology Task Force was also used, along with variant abbreviations including WHAT Working Group , WHAT Task Force and WHATTF . After some time using both the whattf.org and whatwg.org domain names , the name WHATWG

2046-557: The ISO/IEC 8859 series technically conform to the EUC structure, they are rarely labeled as EUC. However, eucTH is used on Solaris as a label for TIS-620 . EUC-TW is a variable-length encoding that supports ASCII and 16 planes of CNS 11643 , each of which is 94×94. It is a rarely used encoding for traditional Chinese characters as used in Taiwan . Variants of Big5 are much more common than EUC-TW, although Big5 only encodes

Extended Unix Code - Misplaced Pages Continue

2112-478: The Mac OS Chinese Simplified script (known as Code page 10008 or x-mac-chinesesimp ). It uses the bytes 0x80, 0x81, 0x82, 0xA0, 0xFD, 0xFE, and 0xFF for the U with umlaut (ü), two special font metric characters, the non-breaking space , the copyright sign (©), the trademark sign (™) and the ellipsis (...) respectively. This differs in what is regarded as a single-byte character versus

2178-477: The EUC scheme. The G0 set is set to an ISO/IEC 646 compliant coded character set such as ASCII , ISO 646:KR ( KS X 1003 ) or ISO 646:JP (the lower half of JIS X 0201 ) and invoked over GL (i.e. 0x21–0x7E, with the most significant bit cleared). If ASCII is used, this makes the code an extended ASCII encoding; the most common deviation from ASCII is that 0x5C ( backslash in ASCII) is often used to represent

2244-571: The EUC-KR GR plane (trail bytes 0xA1–0xFE), and using non-EUC codes outside of it (trail bytes 0x41–0xA0). Some of these characters are font-style-independent stylized dingbats . Many of these characters do not have exact Unicode mappings, and Apple software maps these cases variously to combining sequences , to approximate mappings with an appended private-use character as a modifier for round-trip purposes, or to private-use characters. Apple also uses certain single-byte codes outside of

2310-399: The EUC-KR plane for additional characters: 0x80 for a required space , 0x81 for a won sign (₩), 0x82 for an en dash (–), 0x83 for a copyright sign (©), 0x84 for a wide underscore (_) and 0xFF for an ellipsis (...). Although none of these additional single-byte codes are within the lead byte range of plain EUC-KR (unlike Apple's extensions to EUC-CN, see above ), some are within

2376-598: The IBM-selected and user-defined characters. GBK is an extension to GB 2312 . It defines an extended form of the EUC-CN encoding capable of representing a larger array of CJK characters sourced largely from Unicode 1.1 , including traditional Chinese characters and characters used only in Japanese . It is not, however, a true EUC code, because ASCII bytes may appear as trail bytes (and C1 bytes , not limited to

2442-543: The ISO 2022 7-bit encodings were replaced by a set of 8-bit encoding schemes, the Extended Unix Code: EUC-JP, EUC-CN and EUC-KR. Instead of distinguishing between the multiunit sequences and the singletons with escape sequences, which made the encodings stateful, multiunit sequences were marked by having the most significant bit set, that is, being in the range 80–FF (hexadecimal), while the singletons were in

2508-473: The PC ( DOS and Microsoft Windows platforms), two encodings became established for Japanese and Traditional Chinese in which all of singletons, lead units and trail units overlapped: Shift-JIS and Big5 respectively. In Shift-JIS, lead units had the range 81–9F and E0–FC, trail units had the range 40–7E and 80–FC, and singletons had the range 21–7E and A1–DF. In Big5, lead units had the range A1–FE, trail units had

2574-531: The W3C members at the W3C Workshop on Web Applications and Compound Documents. On 10 April 2007, the Mozilla Foundation, Apple, and Opera Software proposed that the new HTML working group of the W3C adopt the WHATWG's HTML5 as the starting point of its work and name its future deliverable as "HTML5" (though the WHATWG specification was later renamed HTML Living Standard ). On 9 May 2007,

2640-441: The announcement and designation sequences from ISO 2022 . However, the code specification is equivalent to the following sequence of four ISO 2022 announcement sequences, with meanings breaking down as follows. The ISO-2022-based variable-length encoding described above is sometimes referred to as the EUC packed format , which is the encoding format usually labeled as EUC. However, internal processing of EUC data may make use of

2706-508: The author uses. Characters are encoded as follows: Vendor extensions to EUC-JP (from, for example, the Open Software Foundation , IBM or NEC ) were often allocated within the individual code sets, as opposed to using invalid EUC sequences (as in popular extensions of EUC-CN and EUC-KR). However, some vendor-specific encodings are partially compatible with EUC-JP, due to encoding JIS X 0208 over GR, but do not follow

Extended Unix Code - Misplaced Pages Continue

2772-415: The box-drawing characters added to the standard in 1983). JIS X 0208 rows 9 through 12 are used for user-defined characters. KEIS (Kanji-processing Extended Information System) is an EBCDIC encoding used by Hitachi , with double-byte characters (a DBCS-Host encoding) included using shifting sequences, making it a stateful encoding. Specifically, the sequence 0x0A 0x41 switches to single-byte mode and

2838-467: The decoder is well written) the corruption or loss of one unit corrupts only one character. The first use of multibyte encodings was for the encoding of Chinese, Japanese and Korean, which have large character sets well in excess of 256 characters. At first the encoding was constrained to the limit of 7 bits. The ISO-2022-JP, ISO-2022-CN and ISO-2022-KR encodings used the range 21–7E (hexadecimal) for both lead units and trail units, and marked them off from

2904-685: The double-byte component as Code page 971 , and to EUC-KR with ASCII as Code page 970 . It is implemented as Code page 20949 ("Korean Wansung") and Code page 51949 ("EUC Korean") by Microsoft. As of April 2024, less than 0.08% of all web pages globally use EUC-KR, but 4.6% of South Korean web pages use EUC-KR, Including extensions, it is the most widely used legacy character encoding in Korea on all three major platforms ( macOS , other Unix-like OSes, and Windows), but its use has been very slowly shifting to UTF-8 as it gains popularity, especially on Linux and macOS. As with most other encodings, UTF-8

2970-404: The editor of the specifications to ensure correct implementation. The WHATWG was formed in response to the slow development of World Wide Web Consortium (W3C) Web standards and W3C's decision to abandon HTML in favor of XML -based technologies. The WHATWG mailing list was announced on 4 June 2004, two days after the initiatives of a joint Opera–Mozilla position paper had been voted down by

3036-567: The encoding and RFC   1557 dubbed it as EUC-KR. A character drawn from KS X 1001 (G1, code set 1) is encoded as two bytes in GR (0xA1–0xFE) and a character from KS X 1003 or ASCII (G0, code set 0) takes one byte in GL (0x21–0x7E). It is usually referred to as Wansung ( Korean :  완성 ; RR :  Wanseong ; lit.  precomposed) in the Republic of Korea . IBM refers to

3102-648: The first 31 rows of code set 3 are used for user-defined characters: rows 32 through 94 are reserved, similarly to the unused rows in code set 1. The "Super DEC Kanji" encoding accepts codes both from the "DEC Kanji" encoding and from packed-format EUC, for a total of five code-sets. It also allows the entire user defined code set, and the unused rows at the ends of the JIS X 0208 and JIS X 0212 code sets (rows 85–94 and 78–94 respectively), to be used for user-defined characters. Hewlett-Packard defines an encoding referred to as "HP-16". This accompanies their "HP-15" encoding, which

3168-402: The first byte of a two-byte character from both EUC (where, of those, 0xFD and 0xFE are defined as lead bytes) and GBK (where, of those, 0x81, 0x82, 0xFD and 0xFE are defined as lead bytes). This use of 0xA0, 0xFD, 0xFE and 0xFF matches Apple's Shift_JIS variant . Besides these changes to the lead byte range, the other distinctive feature of the double-byte portion of Mac OS Chinese Simplified

3234-412: The first two planes of CNS 11643 hanzi , while UTF-8 is becoming more common. Note that plane 1 of CNS 11643 is encoded twice as code set 1 and a part of code set 2. Variable-width encoding A variable-width encoding is a type of character encoding scheme in which codes of differing lengths are used to encode a character set (a repertoire of symbols) for representation, usually in

3300-406: The inclusion of 96-character sets. The ranges 0x00–1F and 0x80–9F are used for C0 and C1 control codes . EUC is a family of 8-bit profiles of ISO/IEC 2022 , as opposed to 7-bit profiles such as ISO-2022-JP . As such, only ISO 2022 compliant character sets can have EUC forms. Up to four coded character sets (referred to as G0, G1, G2, and G3 or as code sets 0, 1, 2, and 3) can be represented with

3366-510: The lead byte range of Unified Hangul Code (specifically, 0x81, 0x82, 0x83 and 0x84). Similarly to KS X 1001, the North Korean KPS 9566 standard is typically used in EUC form; in these contexts, it is sometimes referred to as EUC-KP. More recent editions of the standard extend the EUC representation with characters using non-EUC two-byte codes, in a similar manner to Unified Hangul Code. Although certain single-byte encodings such as

SECTION 50

#1732779585072

3432-471: The new HTML working group of the W3C resolved to do that. An Internet Explorer platform architect from Microsoft was invited but did not join, citing the lack of a patent policy to ensure all specifications can be implemented on a royalty-free basis. Since then, the W3C and the WHATWG had been developing HTML independently, at times causing specifications to diverge. In 2017, the WHATWG established an intellectual property rights agreement that includes

3498-403: The number of characters which can be encoded without breaking backward compatibility with an existing constraint. For example, with one byte (8 bits) per character, one can encode 256 possible characters; in order to encode more than 256 characters, the obvious choice would be to use two or more bytes per encoding unit, two bytes (16 bits) would allow 65,536 possible characters, but such

3564-835: The original Unicode (1.x) without breaking compatibility with the 16-bit encoding. In UTF-16, singletons have the range 0000–D7FF (55,296 code points) and E000–FFFF (8192 code points, 63,488 in total), lead units the range D800–DBFF (1024 code points) and trail units the range DC00–DFFF (1024 code points, 2048 in total). The lead and trail units, called high surrogates and low surrogates , respectively, in Unicode terminology, map 1024×1024 or 1,048,576 supplementary characters, making 1,112,064 (63,488 BMP code points + 1,048,576 code points represented by high and low surrogate pairs) encodable code points, or scalar values in Unicode parlance (surrogates are not encodable). WHATWG The Web Hypertext Application Technology Working Group ( WHATWG )

3630-486: The packed EUC structure. Often, these do not include use of the single shifts from EUC-JP, and are thus not straight extensions of EUC-JP, with the exception of Super DEC Kanji. Digital Equipment Corporation defines two variants of EUC-JP only partly conforming to the EUC packed format, but also bearing some resemblance to the complete two-byte format. The overall format of the "DEC Kanji" encoding mostly corresponds to fixed-length (complete two-byte) EUC; however, code set 0

3696-553: The packed format as "EUC-JP" or "csEUCPkdFmtJapanese" and the fixed width format as "csEUCFixWidJapanese". Only the packed format is included in the WHATWG Encoding Standard used by HTML5 . EUC-CN is the usual encoded form of the GB 2312 standard for simplified Chinese characters . Unlike the case of Japanese JIS X 0208 and ISO-2022-JP , GB 2312 is not normally used in a 7-bit ISO 2022 code version, although

3762-435: The range 00–7F alone. The lead units and trail units were in the range A1 to FE (hexadecimal), that is, the same as their range in the ISO 2022 encodings, but with the high bit set to 1. These encodings were reasonably easy to work with provided all your delimiters were ASCII characters and you avoided truncating strings to fixed lengths, but a break in the middle of a multibyte character could still cause major corruption. On

3828-406: The range 00–7F, lead units have the range C0–FD (now actually C2–F4, to avoid overlong sequences and to maintain synchronism with the encoding capacity of UTF-16; see the UTF-8 article), and trail units have the range 80–BF. The lead unit also tells how many trail units follow: one after C2–DF, two after E0–EF and three after F0–F4. UTF-16 was devised to break free of the 65,536-character limit of

3894-532: The range 40–7E and A1–FE, and singletons had the range 21–7E (all values in hexadecimal). This overlap again made processing tricky, though at least most of the symbols had unique byte values (though strangely the backslash does not). The Unicode standard has two variable-width encodings: UTF-8 and UTF-16 (it also has a fixed-width encoding, UTF-32 ). Originally, both the Unicode and ISO 10646 standards were meant to be fixed-width, with Unicode being 16-bit and ISO 10646 being 32-bit. ISO 10646 provided

3960-466: The ranges may overlap. A text processing application that deals with the variable-width encoding must then scan the text from the beginning of all definitive sequences in order to identify the various units and interpret the text correctly. In such encodings, one is liable to encounter false positives when searching for a string in the middle of the text. For example, if the hexadecimal values DE, DF, E0, and E1 can all be either lead units or trail units, then

4026-503: The sequence 0x0A 0x42 switches to double-byte mode. However, JIS X 0208 characters are encoded using the same byte sequences used to encode them in EUC-JP. This results in duplicate encodings for the ideographic space —0x4040 per the DBCS-Host code structure, and 0xA1A1 as in EUC-JP. This differs from IBM's DBCS-Host encoding for Japanese, the layout of which builds on versions which predate JIS X 0208 altogether. The lead byte range

SECTION 60

#1732779585072

4092-485: The single shifts, may appear as lead or trail bytes), due to a larger encoding space being required. Variants of GBK are implemented by Windows code page 936 (the Microsoft Windows code page for simplified Chinese), and by IBM's code page 1386. The Unicode-based GB 18030 character encoding defines an extension of GBK capable of encoding the entirety of Unicode . However, Unicode encoded as GB 18030

4158-431: The single-byte code page 1115 (CPGID 1115 as CCSID 1115) and the double-byte code page 1380 (CPGID 1380 as CCSID 1380), which encodes GB 2312 the same way as EUC-CN, but deviates from the EUC structure by extending the lead byte range back to 0x8C, adding 31 IBM-selected characters in 0x8CE0 through 0x8CFE and adding 1880 user-defined characters with lead bytes 0x8D through 0xA0. IBM code page 1383 (CCSID 1383) comprises

4224-460: The single-byte code page 367 and the double-byte code page 1382 (CPGID 1382 as CCSID 1382), which differs by conforming to the EUC structure, adding the 31 IBM-selected characters in 0xFEE0 through 0xFEFE instead, and including only 1360 user-defined characters, interspersed in the positions not used by GB 2312. The alternative CCSID 5479 is used for the pure EUC-CN code page: it uses CCSID 9574 as its double-byte set, which uses CPGID 1382 but excludes

4290-462: The singletons by using ISO 2022 escape sequences to switch between single-byte and multibyte mode. A total of 8,836 (94×94) characters could be encoded at first, and further sets of 94×94 characters with switching. The ISO 2022 encoding schemes for CJK are still in use on the Internet. The stateful nature of these encodings and the large overlap make them very awkward to process. On Unix platforms,

4356-564: Was eventually standardized on. The namespace URI http://whattf.org/datatype-draft remains in use for the HTML validator's data type library . On 28 May 2019, the W3C announced that WHATWG would be the sole publisher of the HTML and DOM standards. The W3C and WHATWG had been publishing competing standards since 2012. While the W3C standard was identical to the WHATWG in 2007 the standards have since progressively diverged due to different design decisions. The WHATWG "Living Standard" had been

#71928