Character encoding - Misplaced Pages

#858141

60-406: Character encoding is the process of assigning numbers to graphical characters , especially the written characters of human language, allowing them to be stored, transmitted, and transformed using computers. The numerical values that make up a character encoding are known as code points and collectively comprise a code space, a code page , or character map . Early character codes associated with

120-401: A byte order mark or escape sequences ; compressing schemes try to minimize the number of bytes used per code unit (such as SCSU and BOCU ). Although UTF-32BE and UTF-32LE are simpler CESes, most systems working with Unicode use either UTF-8 , which is backward compatible with fixed-length ASCII and maps Unicode code points to variable-length sequences of octets, or UTF-16BE , which

180-495: A char on most systems, so more than one is used for some of them, as in the variable-length encoding UTF-8 where each code point takes 1 to 4 bytes. Furthermore, a "character" may require more than one code point (for instance with combining characters ), depending on what is meant by the word "character". The fact that a character was historically stored in a single byte led to the two terms ("char" and "character") being used interchangeably in most documentation. This often makes

240-434: A character array (rather than a byte array ). Unicode can also be stored in strings made up of code units that are larger than char . These are called " wide characters ". The original C type was called wchar_t . Due to some platforms defining wchar_t as 16 bits and others defining it as 32 bits, recent versions have added char16_t , char32_t . Even then the objects being stored might not be characters, for instance

300-437: A string of the letters "ab̲c𐐀"—that is, a string containing a Unicode combining character ( U+0332 ̲ COMBINING LOW LINE ) as well as a supplementary character ( U+10400 𐐀 DESERET CAPITAL LETTER LONG I ). This string has several Unicode representations which are logically equivalent, yet while each is suited to a diverse set of circumstances or range of requirements: Note in particular that 𐐀

360-572: A "character". Computers and communication equipment represent characters using a character encoding that assigns each character to something – an integer quantity represented by a sequence of digits , typically – that can be stored or transmitted through a network . Two examples of usual encodings are ASCII and the UTF-8 encoding for Unicode . While most character encodings map characters to numbers and/or bit sequences, Morse code instead represents characters using

420-721: A fifth company, the Computing-Tabulating-Recording Company (CTR). Under the presidency of Thomas J. Watson , CTR was renamed International Business Machines Corporation (IBM) in 1924. By 1933 The Tabulating Machine Company name had disappeared as subsidiary companies were subsumed by IBM. Herman Hollerith died November 17, 1929. Hollerith is buried at Oak Hill Cemetery in the Georgetown neighborhood of Washington, D.C. Hollerith cards were named after Herman Hollerith, as were Hollerith strings and Hollerith constants . His great-grandson,

480-540: A home on 29th Street and a business building at 31st Street and the Chesapeake and Ohio Canal , where today there is a commemorative plaque installed by IBM . He died of a heart attack in Washington, D.C., at age 69. At the suggestion of John Shaw Billings , Hollerith developed a mechanism using electrical connections to increment a counter, recording information. A key idea was that a datum could be recorded by

540-420: A particular sequence of bits. Instead, characters would first be mapped to a universal intermediate representation in the form of abstract numbers called code points . Code points would then be represented in a variety of ways and with various default numbers of bits per character (code units) depending on context. To encode code points higher than the length of the code unit, such as above 256 for eight-bit units,

600-401: A process known as transcoding . Some of these are cited below. Cross-platform : Windows : The most used character encoding on the web is UTF-8 , used in 98.2% of surveyed web sites, as of May 2024. In application programs and operating system tasks, both UTF-8 and UTF-16 are popular options. Character (computing) In computing and telecommunications , a character

660-405: A series of electrical impulses of varying length. Historically, the term character has been widely used by industry professionals to refer to an encoded character , often as defined by the programming language or API . Likewise, character set has been widely used to refer to a specific repertoire of characters that have been mapped to specific bit sequences or numerical codes. The term glyph

SECTION 10

#1732787994859

720-453: A set of elements used for the organization, control, or representation of data". Unicode's definition supplements this with explanatory notes that encourage the reader to differentiate between characters, graphemes, and glyphs, among other things. Such differentiation is an instance of the wider theme of the separation of presentation and content . For example, the Hebrew letter aleph ("א")

780-461: A single glyph . The former simplifies the text handling system, but the latter allows any letter/diacritic combination to be used in text. Ligatures pose similar problems. Exactly how to handle glyph variants is a choice that must be made when constructing a particular character encoding. Some writing systems, such as Arabic and Hebrew, need to accommodate things like graphemes that are joined in different ways in different contexts, but represent

840-548: A single character per code unit. However, due to the emergence of more sophisticated character encodings, the distinction between these terms has become important. "Code page" is a historical name for a coded character set. Originally, a code page referred to a specific page number in the IBM standard character set manual, which would define a particular character encoding. Other vendors, including Microsoft , SAP , and Oracle Corporation , also published their own sets of code pages;

900-567: A specific relation to each other and to a standard, and then counting or tallying such statistical items separately or in combination by means of mechanical counters operated by electro-magnets the circuits through which are controlled by the perforated sheets, substantially as and for the purpose set forth. Hollerith had left teaching and began working for the United States Census Bureau in the year he filed his first patent application. Titled "Art of Compiling Statistics", it

960-431: A stream of octets (bytes). The purpose of this decomposition is to establish a universal set of characters that can be encoded in a variety of ways. To describe this model precisely, Unicode uses its own set of terminology to describe its process: An abstract character repertoire (ACR) is the full set of abstract characters that a system supports. Unicode has an open repertoire, meaning that new characters will be added to

1020-505: A well-defined and extensible encoding system, has replaced most earlier character encodings, but the path of code development to the present is fairly well known. The Baudot code, a five- bit encoding, was created by Émile Baudot in 1870, patented in 1874, modified by Donald Murray in 1901, and standardized by CCITT as International Telegraph Alphabet No. 2 (ITA2) in 1930. The name baudot has been erroneously applied to ITA2 and its many variants. ITA2 suffered from many shortcomings and

1080-518: Is UTF-8 , which is used in 98.2% of surveyed web sites, as of May 2024. In application programs and operating system tasks, both UTF-8 and UTF-16 are popular options. The history of character codes illustrates the evolving need for machine-mediated character-based symbolic information over a distance, using once-novel electrical means. The earliest codes were based upon manual and hand-written encoding and cyphering systems, such as Bacon's cipher , Braille , international maritime signal flags , and

1140-478: Is backward compatible with fixed-length UCS-2BE and maps Unicode code points to variable-length sequences of 16-bit words. See comparison of Unicode encodings for a detailed discussion. Finally, there may be a higher-level protocol which supplies additional information to select the particular variant of a Unicode character, particularly where there are regional variants that have been 'unified' in Unicode as

1200-459: Is a unit of information that roughly corresponds to a grapheme , grapheme-like unit, or symbol , such as in an alphabet or syllabary in the written form of a natural language . Examples of characters include letters , numerical digits , common punctuation marks (such as "." or "-"), and whitespace . The concept also includes control characters , which do not correspond to visible symbols but rather to instructions to format or process

1260-442: Is defined by a CEF. A character encoding scheme (CES) is the mapping of code units to a sequence of octets to facilitate storage on an octet-based file system or transmission over an octet-based network. Simple character encoding schemes include UTF-8 , UTF-16BE , UTF-32BE , UTF-16LE , and UTF-32LE ; compound character encoding schemes, such as UTF-16 , UTF-32 and ISO/IEC 2022 , switch between several simple schemes by using

SECTION 20

#1732787994859

1320-444: Is defined by the encoding. Thus, the number of code units required to represent a code point depends on the encoding: Exactly what constitutes a character varies between character encodings. For example, for letters with diacritics , there are two distinct approaches that can be taken to encode them: they can be encoded either as a single unified character (known as a precomposed character), or as separate characters that combine into

1380-487: Is defined to be large enough to contain any member of the "basic execution character set". The exact number of bits can be checked via CHAR_BIT macro. By far the most common size is 8 bits, and the POSIX standard requires it to be 8 bits. In newer C standards char is required to hold UTF-8 code units which requires a minimum size of 8 bits. A Unicode code point may require as many as 21 bits. This will not fit in

1440-692: Is often used by mathematicians to denote certain kinds of infinity (ℵ), but it is also used in ordinary Hebrew text. In Unicode, these two uses are considered different characters, and have two different Unicode numerical identifiers (" code points "), though they may be rendered identically. Conversely, the Chinese logogram for water ("水") may have a slightly different appearance in Japanese texts than it does in Chinese texts, and local typefaces may reflect this. But nonetheless in Unicode they are considered

1500-429: Is preferred, usually in the larger context of locales. IBM's Character Data Representation Architecture (CDRA) designates entities with coded character set identifiers ( CCSIDs ), each of which is variously called a "charset", "character set", "code page", or "CHARMAP". The code unit size is equivalent to the bit measurement for the particular encoding: A code point is represented by a sequence of code units. The mapping

1560-492: Is represented with either one 32-bit value (UTF-32), two 16-bit values (UTF-16), or four 8-bit values (UTF-8). Although each of those forms uses the same total number of bits (32) to represent the glyph, it is not obvious how the actual numeric byte values are related. As a result of having many character encoding methods in use (and the need for backward compatibility with archived data), many computer programs have been developed to translate data between character encoding schemes,

1620-508: Is used to describe a particular visual appearance of a character. Many computer fonts consist of glyphs that are indexed by the numerical code of the corresponding character. With the advent and widespread acceptance of Unicode and bit-agnostic coded character sets , a character is increasingly being seen as a unit of information , independent of any particular visual manifestation. The ISO/IEC 10646 (Unicode) International Standard defines character , or abstract character as "a member of

1680-405: The 6-bit character code were once popular, and the 5-bit Baudot code has been used in the past as well. The term has even been applied to 4 bits with only 16 possible values. All modern systems use a varying-size sequence of these fixed-sized pieces, for instance UTF-8 uses a varying number of 8-bit code units to define a " code point " and Unicode uses varying number of those to define

1740-672: The City College of New York in 1875, graduated from the Columbia School of Mines with an Engineer of Mines degree in 1879 at age 19, and, in 1890, earned a Doctor of Philosophy based on his development of the tabulating system. In 1882, Hollerith joined the Massachusetts Institute of Technology where he taught mechanical engineering and conducted his first experiments with punched cards. He eventually moved to Washington, D.C., living in Georgetown with

1800-594: The 1880 census: the larger population, the data items to be collected, the Census Bureau headcount, the scheduled publications, and the use of Hollerith's electromechanical tabulators, reduced the time required to process the census from eight years for the 1880 census to six years for the 1890 census. In 1896, Hollerith founded the Tabulating Machine Company (in 1905 renamed The Tabulating Machine Company). Many major census bureaus around

1860-486: The 1980s faced the dilemma that, on the one hand, it seemed necessary to add more bits to accommodate additional characters, but on the other hand, for the users of the relatively small character set of the Latin alphabet (who still constituted the majority of computer users), those additional bits were a colossal waste of then-scarce and expensive computing resources (as they would always be zeroed out for such users). In 1985,

Character encoding - Misplaced Pages Continue

1920-532: The 4-digit encoding of Chinese characters for a Chinese telegraph code ( Hans Schjellerup , 1869). With the adoption of electrical and electro-mechanical techniques these earliest codes were adapted to the new capabilities and limitations of the early machines. The earliest well-known electrically transmitted character code, Morse code , introduced in the 1840s, used a system of four "symbols" (short signal, long signal, short space, long space) to generate codes of variable length. Though some commercial use of Morse code

1980-519: The Unicode standard is U+0000 to U+10FFFF, inclusive, divided in 17 planes , identified by the numbers 0 to 16. Characters in the range U+0000 to U+FFFF are in plane 0, called the Basic Multilingual Plane (BMP). This plane contains the most commonly-used characters. Characters in the range U+10000 to U+10FFFF in the other planes are called supplementary characters . The following table shows examples of code point values: Consider

2040-462: The average personal computer user's hard disk drive could store only about 10 megabytes, and it cost approximately US$ 250 on the wholesale market (and much higher if purchased separately at retail), so it was very important at the time to make every bit count. The compromise solution that was eventually found and developed into Unicode was to break the assumption (dating back to telegraph codes) that each character should always directly correspond to

2100-558: The documentation confusing or misleading when multibyte encodings such as UTF-8 are used, and has led to inefficient and incorrect implementations of string manipulation functions (such as computing the "length" of a string as a count of code units rather than bytes). Modern POSIX documentation attempts to fix this, defining "character" as a sequence of one or more bytes representing a single graphic symbol or control code, and attempts to use "byte" when referring to char data. However it still contains errors such as defining an array of char as

2160-543: The era had their own character codes, often six-bit, but usually had the ability to read tapes produced on IBM equipment. These BCD encodings were the precursors of IBM's Extended Binary-Coded Decimal Interchange Code (usually abbreviated as EBCDIC), an eight-bit encoding scheme developed in 1963 for the IBM System/360 that featured a larger character set, including lower case letters. In trying to develop universally interchangeable character encodings, researchers in

2220-460: The largest and most successful companies of the 20th century. Hollerith is regarded as one of the seminal figures in the development of data processing. Herman Hollerith was born in Buffalo, New York , in 1860, where he also spent his early childhood. His parents were German immigrants; his father, Georg Hollerith, was a school teacher from Großfischlingen , Rhineland-Palatinate . He entered

2280-429: The middle character of the word 'naïve' either as a single character 'ï' or as a combination of the character 'i ' with the combining diaeresis: (U+0069 LATIN SMALL LETTER I + U+0308 COMBINING DIAERESIS); this is also rendered as 'ï ' . These are considered canonically equivalent by the Unicode standard. A char in the C programming language is a data type with the size of exactly one byte , which in turn

2340-479: The most well-known code page suites are " Windows " (based on Windows-1252) and "IBM"/"DOS" (based on code page 437 ). Despite no longer referring to specific page numbers in a standard, many character encodings are still referred to by their code page number; likewise, the term "code page" is often still used to refer to character encodings in general. The term "code page" is not used in Unix or Linux, where "charmap"

2400-465: The optical or electrical telegraph could only represent a subset of the characters used in written languages , sometimes restricted to upper case letters , numerals and some punctuation only. The advent of digital computer systems allows more elaborate encodings codes (such as Unicode ) to support hundreds of written languages. The most popular character encoding on the World Wide Web

2460-434: The presence or absence of a hole at a specific location on a card. For example, if a specific hole location indicates marital status , then a hole there can indicate married while not having a hole indicates single . Hollerith determined that data in specified locations on a card, arranged in rows and columns, could be counted or sorted electromechanically. A description of this system, An Electric Tabulating System (1889) ,

Character encoding - Misplaced Pages Continue

2520-412: The punched card code then in use only allowed digits, upper-case English letters and a few special characters, six bits were sufficient. These BCD encodings extended existing simple four-bit numeric encoding to include alphabetic and special characters, mapping them easily to punch-card encoding which was already in widespread use. IBM's codes were used primarily with IBM equipment; other computer vendors of

2580-457: The punched card tabulating machine, patented in 1884, marks the beginning of the era of mechanized binary code and semiautomatic data processing systems, and his concept dominated that landscape for nearly a century. Hollerith founded a company that was amalgamated in 1911 with several other companies to form the Computing-Tabulating-Recording Company . In 1924, the company was renamed "International Business Machines" ( IBM ) and became one of

2640-460: The repertoire over time. A coded character set (CCS) is a function that maps characters to code points (each code point represents one character). For example, in a given repertoire, the capital letter "A" in the Latin alphabet might be represented by the code point 65, the character "B" by 66, and so on. Multiple coded character sets may share the same character repertoire; for example ISO/IEC 8859-1 and IBM code pages 037 and 500 all cover

2700-413: The same character, and share the same code point. The Unicode standard also differentiates between these abstract characters and coded characters or encoded characters that have been paired with numeric codes that facilitate their representation in computers. The combining character is also addressed by Unicode. For instance, Unicode allocates a code point to each of This makes it possible to code

2760-496: The same character. An example is the XML attribute xml:lang. The Unicode model uses the term "character map" for other systems which directly assign a sequence of characters to a sequence of bytes, covering all of the CCS, CEF and CES layers. In Unicode, a character can be referred to as 'U+' followed by its codepoint value in hexadecimal. The range of valid code points (the codespace) for

2820-537: The same repertoire but map them to different code points. A character encoding form (CEF) is the mapping of code points to code units to facilitate storage in a system that represents numbers as bit sequences of fixed length (i.e. practically any computer system). For example, a system that stores numeric information in 16-bit units can only directly represent code points 0 to 65,535 in each unit, but larger code points (say, 65,536 to 1.4 million) could be represented by using multiple 16-bit units. This correspondence

2880-527: The same semantic character. Unicode and its parallel standard, the ISO/IEC 10646 Universal Character Set , together constitute a unified standard for character encoding. Rather than mapping characters directly to bytes , Unicode separately defines a coded character set that maps characters to unique natural numbers ( code points ), how those code points are mapped to a series of fixed-size natural numbers (code units), and finally how those units are encoded as

2940-432: The solution was to implement variable-length encodings where an escape sequence would signal that subsequent bits should be parsed as a higher code point. Informally, the terms "character encoding", "character map", "character set" and "code page" are often used interchangeably. Historically, the same standard would specify a repertoire of characters and how they were to be encoded into a stream of code units — usually with

3000-428: The text. Examples of control characters include carriage return and tab as well as other instructions to printers or other devices that display or otherwise process text. Characters are typically combined into strings . Historically, the term character was used to denote a specific number of contiguous bits . While a character is most commonly assumed to refer to 8 bits (one byte ) today, other options like

3060-548: The variable-length UTF-16 is often stored in arrays of char16_t . Other languages also have a char type. Some such as C++ use at least 8 bits like C. Others such as Java use 16 bits for char in order to represent UTF-16 values. Herman Hollerith Herman Hollerith (February 29, 1860 – November 17, 1929) was a German-American statistician, inventor, and businessman who developed an electromechanical tabulating machine for punched cards to assist in summarizing information and, later, in accounting. His invention of

SECTION 50

#1732787994859

3120-526: The world leased his equipment and purchased his cards, as did major insurance companies. Hollerith's machines were used for censuses in England & Wales , Italy , Germany , Russia , Austria , Canada , France , Norway , Puerto Rico , Cuba , and the Philippines , and again in the 1900 U.S. census . He invented the first automatic card-feed mechanism and the first keypunch . The 1890 Tabulator

3180-493: Was hardwired to operate on 1890 Census cards. A control panel in his 1906 Type I Tabulator simplified rewiring for different jobs. The 1920s removable control panel supported prewiring and near instant job changing. These inventions were among the foundations of the data processing industry, and Hollerith's punched cards (later used for computer input/output ) continued in use for almost a century. In 1911, four corporations, including Hollerith's firm, were amalgamated to form

3240-510: Was adopted fairly widely. ASCII67's American-centric nature was somewhat addressed in the European ECMA-6 standard. Herman Hollerith invented punch card data encoding in the late 19th century to analyze census data. Initially, each hole position represented a different data element, but later, numeric information was encoded by numbering the lower rows 0 to 9, with a punch in a column representing its row number. Later alphabetic data

3300-669: Was encoded by allowing more than one punch per column. Electromechanical tabulating machines represented date internally by the timing of pulses relative to the motion of the cards through the machine. When IBM went to electronic processing, starting with the IBM 603 Electronic Multiplier, it used a variety of binary encoding schemes that were tied to the punch card code. IBM used several Binary Coded Decimal ( BCD ) six-bit character encoding schemes, starting as early as 1953 in its 702 and 704 computers, and in its later 7000 Series and 1400 series , as well as in associated peripherals. Since

3360-470: Was filed on September 23, 1884; U.S. Patent 395,782 was granted on January 8, 1889. Hollerith initially did business under his own name, as The Hollerith Electric Tabulating System , specializing in punched card data processing equipment . He provided tabulators and other machines under contract for the Census Office, which used them for the 1890 census . The net effect of the many changes from

3420-409: Was often improved by many equipment manufacturers, sometimes creating compatibility issues. In 1959 the U.S. military defined its Fieldata code, a six-or seven-bit code, introduced by the U.S. Army Signal Corps. While Fieldata addressed many of the then-modern issues (e.g. letter and digit codes arranged for machine collation), it fell short of its goals and was short-lived. In 1963 the first ASCII code

3480-591: Was released (X3.4-1963) by the ASCII committee (which contained at least one member of the Fieldata committee, W. F. Leubbert), which addressed most of the shortcomings of Fieldata, using a simpler code. Many of the changes were subtle, such as collatable character sets within certain numeric ranges. ASCII63 was a success, widely adopted by industry, and with the follow-up issue of the 1967 ASCII code (which added lower-case letters and fixed some "control code" issues) ASCII67

3540-566: Was submitted by Hollerith to Columbia University as his doctoral thesis, and is reprinted in Brian Randell 's 1982 The Origins of Digital Computers, Selected Papers . On January 8, 1889, Hollerith was issued U.S. Patent 395,782, claim 2 of which reads: The herein-described method of compiling statistics, which consists in recording separate statistical items pertaining to the individual by holes or combinations of holes punched in sheets of electrically non-conducting material, and bearing

3600-589: Was via machinery, it was often used as a manual code, generated by hand on a telegraph key and decipherable by ear, and persists in amateur radio and aeronautical use. Most codes are of fixed per-character length or variable-length sequences of fixed-length codes (e.g. Unicode ). Common examples of character encoding systems include Morse code, the Baudot code , the American Standard Code for Information Interchange (ASCII) and Unicode. Unicode,

#858141