Misplaced Pages

CJK Unified Ideographs

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.

UTC sources

#698301

28-483: The Chinese, Japanese and Korean ( CJK ) scripts share a common background, collectively known as CJK characters . During the process called Han unification , the common (shared) characters were identified and named CJK Unified Ideographs . As of Unicode 16.0, Unicode defines a total of 97,680 characters. The term ideographs is a misnomer, as the Chinese script is not ideographic but rather logographic . Until

56-784: A consolidated set of characters to ISO/IEC JTC 1/SC 2 Working Group 2 (WG2) and the Unicode Technical Committee (UTC) for consideration for inclusion in the ISO/IEC 10646 and Unicode standards. The following IRG member bodies have been involved in the standardization of CJK unified ideographs: The ideographs submitted by the UTC and the United Kingdom are not specific to any particular region, but are characters which have been suggested for encoding by individual experts. The ideographs submitted by SAT are required for

84-465: A selected glyph could depend on the particular font being used. However, the source separation rule states that characters encoded separately in an earlier character set would remain separate in the new Unicode encoding. Using variation selectors , it is possible to specify certain variant CJK ideograms within Unicode. The Adobe-Japan1 character set , which has 14,684 ideographic variation sequences,

112-500: Is an extreme example of the use of variation selectors. 4E00-62FF , 6300-77FF , 7800-8CFF , 8D00-9FFF . Note: Most characters appear in multiple sources, so the sum of individual character counts (108,480) is far greater than the number of encoded characters (20,992). In Unicode 4.1, 14 HKSCS-2004 characters and 8 GB 18030 characters were assigned to between U+9FA6 and U+9FBB code points. Since then, other additions were added to this block for various reasons, all summarized in

140-507: Is greater than the number of encoded characters (4,154). The block named CJK Unified Ideographs Extension D (2B740–2B81F) contains 222 characters in the range U+2B740 through U+2B81D that were added in Unicode 6.0 (2010). 2B740–2B81F . Note: Some characters appear in more than one source, so the sum of individual character counts (239) is greater than the number of encoded characters (222). The block named CJK Unified Ideographs Extension E (2B820–2CEAF) contains 5,762 characters in

168-564: Is greater than the number of encoded characters (4,939). A block named CJK Unified Ideographs Extension H was added as part of Unicode 15.0 to the Tertiary Ideographic Plane in the range U+31350 through U+323AF, containing 4,192 characters. 31350–323AF . Note: Some characters appear in more than one source, so the sum of individual character counts (4,309) is greater than the number of encoded characters (4,192). A block named CJK Unified Ideographs Extension I

196-582: Is increasingly rare, although idiosyncratic use of Chinese characters in proper names requires knowledge (and therefore availability) of many more characters. Even today, however, South Korean students are taught 1,800 characters . Other scripts used for these languages, such as bopomofo and the Latin -based pinyin for Chinese, hiragana and katakana for Japanese, and hangul for Korean, are not strictly "CJK characters", although CJK character sets almost invariably include them as necessary for full coverage of

224-523: The Kangxi Dictionary ordering of radicals . In this system the characters written with the fewest strokes are listed first. The remaining characters were added later, and so are not in radical order. The block is the result of Han unification , which was somewhat controversial within East Asia. Since Chinese, Japanese and Korean characters were coded in the same location, the appearance of

252-651: The SAT Daizōkyō text database . The table below gives the numbers of encoded CJK unified ideographs for each IRG source for Unicode 16.0. The total number of characters (260,840) far exceeds the number of encoded CJK unified ideographs (97,680) as many characters have more than one source. The majority of characters submitted by the UTC to the IRG are derived from Unicode Technical Committee (UTC) documents. Other sources include: The basic block named CJK Unified Ideographs (4E00–9FFF) contains 20,992 basic Chinese characters in

280-465: The version history section below. The block named CJK Unified Ideographs Extension A (3400–4DBF) contains 6,592 additional characters in the range U+3400 through U+4DBF. 3400-4DBF . Note: Most characters appear in more than one source, so the sum of individual character counts (23,954) is far greater than the number of encoded characters (6,592). The block named CJK Unified Ideographs Extension B (20000–2A6DF) contains 42,720 characters in

308-523: The "Unified Ideograph" property: U+FA0E 﨎, U+FA0F 﨏, U+FA11 﨑, U+FA13 﨓, U+FA14 﨔, U+FA1F 﨟, U+FA21 﨡, U+FA23 﨣, U+FA24 﨤, U+FA27 﨧, U+FA28 﨨, and U+FA29 﨩. None of the other characters in this and other "Compatibility" blocks relate to CJK unification. While 龜 and 亀 are not considered unifiable, U+FA20 蘒 CJK COMPATIBILITY IDEOGRAPH-FA20 is considered a duplicate to U+8612 蘒 CJK UNIFIED IDEOGRAPH-8612 . F900–FAFF . Note: All characters appear in more than one source, so

SECTION 10

#1732779866699

336-862: The Chinese-origin logographic script formerly used for the Vietnamese language , or CJKVZ to also include Sawndip , used to write the Zhuang languages . Standard Mandarin Chinese and Standard Cantonese are written almost exclusively in Chinese characters. Over 3,000 characters are required for general literacy , with up to 40,000 characters for reasonably complete coverage. Japanese uses fewer characters—general literacy in Japanese can be expected with 2,136 characters. The use of Chinese characters in Korea

364-490: The Latin-based Vietnamese alphabet . The number of characters required for complete coverage of all these languages' needs cannot fit in the 256-character code space of 8-bit character encodings , requiring at least a 16-bit fixed width encoding or multi-byte variable-length encodings. The 16-bit fixed width encodings, such as those from Unicode up to and including version 2.0, are now deprecated due to

392-422: The character sets in a process known as Han unification . CJK character encodings should consist minimally of Han characters plus language-specific phonetic scripts such as pinyin , bopomofo , hiragana, katakana and hangul. CJK character encodings include: The CJK character sets take up the bulk of the assigned Unicode code space. There is much controversy among Japanese experts of Chinese characters about

420-503: The desirability and technical merit of the Han unification process used to map multiple Chinese and Japanese character sets into a single set of unified characters. All three languages can be written both left-to-right and top-to-bottom (right-to-left and top-to-bottom in ancient documents), but are usually considered left-to-right scripts when discussing encoding issues. Libraries cooperated on encoding standards for JACKPHY characters in

448-2239: The early 1980s. According to Ken Lunde , the abbreviation "CJK" was a registered trademark of Research Libraries Group (which merged with OCLC in 2006). The trademark owned by OCLC between 1987 and 2009 has now expired. CJK Unified Ideographs CJK Unified Ideographs Extension A CJK Unified Ideographs Extension B CJK Unified Ideographs Extension C CJK Unified Ideographs Extension D CJK Unified Ideographs Extension E CJK Unified Ideographs Extension F CJK Unified Ideographs Extension G CJK Unified Ideographs Extension H CJK Unified Ideographs Extension I CJK Radicals Supplement Kangxi Radicals Ideographic Description Characters CJK Symbols and Punctuation CJK Strokes Enclosed CJK Letters and Months CJK Compatibility CJK Compatibility Ideographs CJK Compatibility Forms Enclosed Ideographic Supplement CJK Compatibility Ideographs Supplement 0 BMP 0 BMP 2 SIP 2 SIP 2 SIP 2 SIP 2 SIP 3 TIP 3 TIP 2 SIP 0 BMP 0 BMP 0 BMP 0 BMP 0 BMP 0 BMP 0 BMP 0 BMP 0 BMP 1 SMP 2 SIP 4E00–9FFF 3400–4DBF 20000–2A6DF 2A700–2B73F 2B740–2B81F 2B820–2CEAF 2CEB0–2EBEF 30000–3134F 31350–323AF 2EBF0–2EE5F 2E80–2EFF 2F00–2FDF 2FF0–2FFF 3000–303F 31C0–31EF 3200–32FF 3300–33FF F900–FAFF FE30–FE4F 1F200–1F2FF 2F800–2FA1F 20,992 6,592 42,720 4,154 222 5,762 7,473 4,939 4,192 622 115 214 16 64 39 255 256 472 32 64 542 Unified Unified Unified Unified Unified Unified Unified Unified Unified Unified Not unified Not unified Not unified Not unified Not unified Not unified Not unified 12 are unified Not unified Not unified Not unified Han Han Han Han Han Han Han Han Han Han Han Han Common Han, Hangul , Common, Inherited Common Hangul, Katakana , Common Katakana, Common Han Common Hiragana , Common Han CJK Unified Ideographs Extension A Too Many Requests If you report this error to

476-408: The early 20th century, Vietnam also used Chinese characters ( Chữ Nôm ), so sometimes the abbreviation CJKV is used. The Ideographic Research Group (IRG) is responsible for developing extensions to the encoded repertoires of CJK unified ideographs. IRG processes proposals for new CJK unified ideographs submitted by its member bodies, and after undergoing several rounds of expert review, IRG submits

504-409: The range U+20000 through U+2A6DF. These include most of the characters used in the Kangxi Dictionary that are not in the basic CJK Unified Ideographs block, as well as many Hán-Nôm characters that were formerly used to write Vietnamese. 20000-215FF , 21600-230FF , 23100-245FF , 24600-260FF , 26100-275FF , 27600-290FF , 29100-2A6DF . Note: Many characters appear in more than one source, so

532-568: The range U+2B820 through U+2CEA1 that were added in Unicode 8.0 (2015). 2B820–2CEAF . Note: Some characters appear in more than one source, so the sum of individual character counts (5,919) is greater than the number of encoded characters (5,762). The block named CJK Unified Ideographs Extension F (2CEB0–2EBEF) contains 7,473 characters in the range U+2CEB0 through 2EBE0 that were added in Unicode 10.0 (2017). It includes more than 1,000 Sawndip characters for Zhuang . 2CEB0–2EBEF . Note: Some characters appear in more than one source, so

560-575: The range U+4E00 through U+9FFF. The block not only includes characters used in the Chinese writing system but also kanji used in the Japanese writing system , hanja in Korea , and chữ Nôm characters in Vietnamese. Many characters in this block are used in all three writing systems , while others are in only one or two of the three. The first 20,902 characters in the block are arranged according to

588-604: The requirement to encode more characters than a 16-bit encoding can accommodate—Unicode 5.0 has some 70,000 Han characters—and the requirement by the Chinese government that software in China support the GB 18030 character set. Although CJK encodings have common character sets, the encodings often used to represent them have been developed separately by different East Asian governments and software companies, and are mutually incompatible. Unicode has attempted, with some controversy, to unify

SECTION 20

#1732779866699

616-547: The same character has inadvertently been encoded twice) and two semi-duplicates (where the CJK-B character represents a de facto disunification of two glyph forms unified in the corresponding BMP character) were encoded by mistake: CJK characters In internationalization , CJK characters is a collective term for graphemes used in the Chinese , Japanese , and Korean writing systems , which each include Chinese characters . It can also go by CJKV to include Chữ Nôm ,

644-432: The sum of individual character counts (40) is greater than the number of encoded characters (12). The character U+4039 (䀹) was a unification of two different characters (one with jiā 夾 phonetic and one with shǎn 㚒 phonetic) until Unicode 5.0. However, they were lexically different characters that should not have been unified; they have different pronunciations and different meanings. The proposal of disunification of U+4039

672-421: The sum of individual character counts (7,775) is greater than the number of encoded characters (7,473). A block named CJK Unified Ideographs Extension G was added as part of Unicode 13.0 to the Tertiary Ideographic Plane in the range U+30000 through U+3134F, containing 4,939 characters. 30000–3134F . Note: Some characters appear in more than one source, so the sum of individual character counts (5,081)

700-416: The sum of individual character counts (99,784) is far greater than the number of encoded characters (42,720). The block named CJK Unified Ideographs Extension C (2A700–2B73F) contains 4,154 characters in the range U+2A700 through U+2B739. It was initially added in Unicode 5.2 (2009). 2A700-2B73F . Note: Some characters appear in more than one source, so the sum of individual character counts (4,634)

728-508: The target languages. The sinologist Carl Leban (1971) produced an early survey of CJK encoding systems. Until the early 20th century, Classical Chinese was the written language of government and scholarship in Vietnam. Popular literature in Vietnamese was written in the chữ Nôm script, consisting of Chinese characters with many characters created locally. Since the 1920s, the script since then used for recording literature has been

756-773: Was accepted for Unicode 5.1, encoding a new character at U+9FC3 (鿃) to represent shǎn. In CJK Unified Ideographs Extension B, some characters are incorrectly unified with others. These characters include U+2017B (𠅻), U+204AF (𠒯) and U+24CB2 (𤲲). The first two characters contained a wrong unification of Chinese Mainland and Vietnamese source of their glyph, while the last one unifies the Chinese Mainland and Taiwanese ones. Also in CJK Unified Ideographs Extension B, hundreds of glyph variants were encoded by mistake. Additionally, an ISO/IEC JTC 1/SC 2 report has found that six exact duplicates (where

784-591: Was added as part of Unicode 15.1 to the Supplementary Ideographic Plane in the range U+2EBF0 through U+2EE5F, containing 622 characters. 2EBF0–2EE5F . Note: Some characters appear in more than one source, making the sum of individual character counts (625) more than the number of encoded characters (622). The block named CJK Compatibility Ideographs (F900–FAFF) was created to retain round-trip compatibility with other standards. However, twelve characters in this block actually have

#698301