Misplaced Pages

Multinational Character Set

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.
#32967

60-529: The Multinational Character Set ( DMCS or MCS ) is a character encoding created in 1983 by Digital Equipment Corporation (DEC) for use in the popular VT220 terminal . It was an 8-bit extension of ASCII that added accented characters, currency symbols , and other character glyphs missing from 7-bit ASCII. It is only one of the code pages implemented for the VT220 National Replacement Character Set (NRCS). MCS

120-409: A byte or word , is referred to, it is usually specified by a number from 0 upwards corresponding to its position within the byte or word. However, 0 can refer to either the most or least significant bit depending on the context. Similar to torque and energy in physics; information-theoretic information and data storage size have the same dimensionality of units of measurement , but there

180-402: A byte order mark or escape sequences ; compressing schemes try to minimize the number of bytes used per code unit (such as SCSU and BOCU ). Although UTF-32BE and UTF-32LE are simpler CESes, most systems working with Unicode use either UTF-8 , which is backward compatible with fixed-length ASCII and maps Unicode code points to variable-length sequences of octets, or UTF-16BE , which

240-437: A string of the letters "ab̲c𐐀"—that is, a string containing a Unicode combining character ( U+0332 ̲ COMBINING LOW LINE ) as well as a supplementary character ( U+10400 𐐀 DESERET CAPITAL LETTER LONG I ). This string has several Unicode representations which are logically equivalent, yet while each is suited to a diverse set of circumstances or range of requirements: Note in particular that 𐐀

300-546: A character encoding are known as code points and collectively comprise a code space, a code page , or character map . Early character codes associated with the optical or electrical telegraph could only represent a subset of the characters used in written languages , sometimes restricted to upper case letters , numerals and some punctuation only. The advent of digital computer systems allows more elaborate encodings codes (such as Unicode ) to support hundreds of written languages. The most popular character encoding on

360-613: A conducting path at a certain point of a circuit. In optical discs , a bit is encoded as the presence or absence of a microscopic pit on a reflective surface. In one-dimensional bar codes , bits are encoded as the thickness of alternating black and white lines. The bit is not defined in the International System of Units (SI). However, the International Electrotechnical Commission issued standard IEC 60027 , which specifies that

420-431: A number of bytes which is a low power of two. A string of four bits is usually a nibble . In information theory , one bit is the information entropy of a random binary variable that is 0 or 1 with equal probability, or the information that is gained when the value of such a variable becomes known. As a unit of information , the bit is also known as a shannon , named after Claude E. Shannon . The symbol for

480-420: A particular sequence of bits. Instead, characters would first be mapped to a universal intermediate representation in the form of abstract numbers called code points . Code points would then be represented in a variety of ways and with various default numbers of bits per character (code units) depending on context. To encode code points higher than the length of the code unit, such as above 256 for eight-bit units,

540-434: A process known as transcoding . Some of these are cited below. Cross-platform : Windows : The most used character encoding on the web is UTF-8 , used in 98.2% of surveyed web sites, as of May 2024. In application programs and operating system tasks, both UTF-8 and UTF-16 are popular options. Bit The bit is the most basic unit of information in computing and digital communication . The name

600-461: A single glyph . The former simplifies the text handling system, but the latter allows any letter/diacritic combination to be used in text. Ligatures pose similar problems. Exactly how to handle glyph variants is a choice that must be made when constructing a particular character encoding. Some writing systems, such as Arabic and Hebrew, need to accommodate things like graphemes that are joined in different ways in different contexts, but represent

660-549: A single character per code unit. However, due to the emergence of more sophisticated character encodings, the distinction between these terms has become important. "Code page" is a historical name for a coded character set. Originally, a code page referred to a specific page number in the IBM standard character set manual, which would define a particular character encoding. Other vendors, including Microsoft , SAP , and Oracle Corporation , also published their own sets of code pages;

SECTION 10

#1732791914033

720-432: A stream of octets (bytes). The purpose of this decomposition is to establish a universal set of characters that can be encoded in a variety of ways. To describe this model precisely, Unicode uses its own set of terminology to describe its process: An abstract character repertoire (ACR) is the full set of abstract characters that a system supports. Unicode has an open repertoire, meaning that new characters will be added to

780-505: A well-defined and extensible encoding system, has replaced most earlier character encodings, but the path of code development to the present is fairly well known. The Baudot code, a five- bit encoding, was created by Émile Baudot in 1870, patented in 1874, modified by Donald Murray in 1901, and standardized by CCITT as International Telegraph Alphabet No. 2 (ITA2) in 1930. The name baudot has been erroneously applied to ITA2 and its many variants. ITA2 suffered from many shortcomings and

840-478: Is backward compatible with fixed-length UCS-2BE and maps Unicode code points to variable-length sequences of 16-bit words. See comparison of Unicode encodings for a detailed discussion. Finally, there may be a higher-level protocol which supplies additional information to select the particular variant of a Unicode character, particularly where there are regional variants that have been 'unified' in Unicode as

900-468: Is a portmanteau of binary digit . The bit represents a logical state with one of two possible values . These values are most commonly represented as either " 1 " or " 0 " , but other representations such as true / false , yes / no , on / off , or + / − are also widely used. The relation between these values and the physical states of the underlying storage or device is a matter of convention, and different assignments may be used even within

960-442: Is defined by a CEF. A character encoding scheme (CES) is the mapping of code units to a sequence of octets to facilitate storage on an octet-based file system or transmission over an octet-based network. Simple character encoding schemes include UTF-8 , UTF-16BE , UTF-32BE , UTF-16LE , and UTF-32LE ; compound character encoding schemes, such as UTF-16 , UTF-32 and ISO/IEC 2022 , switch between several simple schemes by using

1020-444: Is defined by the encoding. Thus, the number of code units required to represent a code point depends on the encoding: Exactly what constitutes a character varies between character encodings. For example, for letters with diacritics , there are two distinct approaches that can be taken to encode them: they can be encoded either as a single unified character (known as a precomposed character), or as separate characters that combine into

1080-486: Is in general no meaning to adding, subtracting or otherwise combining the units mathematically, although one may act as a bound on the other. Units of information used in information theory include the shannon (Sh), the natural unit of information (nat) and the hartley (Hart). One shannon is the maximum amount of information needed to specify the state of one bit of storage. These are related by 1 Sh ≈ 0.693 nat ≈ 0.301 Hart. Some authors also define

1140-554: Is more compressed—the same bucket can hold more. For example, it is estimated that the combined technological capacity of the world to store information provides 1,300 exabytes of hardware digits. However, when this storage space is filled and the corresponding content is optimally compressed, this only represents 295 exabytes of information. When optimally compressed, the resulting carrying capacity approaches Shannon information or information entropy . Certain bitwise computer processor instructions (such as bit set ) operate at

1200-430: Is preferred, usually in the larger context of locales. IBM's Character Data Representation Architecture (CDRA) designates entities with coded character set identifiers ( CCSIDs ), each of which is variously called a "charset", "character set", "code page", or "CHARMAP". The code unit size is equivalent to the bit measurement for the particular encoding: A code point is represented by a sequence of code units. The mapping

1260-567: Is registered as IBM code page/ CCSID 1100 ( Multinational Emulation ) since 1992. Depending on associated sorting Oracle calls it WE8DEC , N8DEC , DK8DEC , S8DEC , or SF8DEC . Such " extended ASCII " sets were common (the National Replacement Character Set provided sets for more than a dozen European languages), but MCS has the distinction of being the ancestor of ECMA-94 in 1985 and ISO 8859-1 in 1987. The code chart of MCS with ECMA-94, ISO 8859-1 and

SECTION 20

#1732791914033

1320-492: Is represented with either one 32-bit value (UTF-32), two 16-bit values (UTF-16), or four 8-bit values (UTF-8). Although each of those forms uses the same total number of bits (32) to represent the glyph, it is not obvious how the actual numeric byte values are related. As a result of having many character encoding methods in use (and the need for backward compatibility with archived data), many computer programs have been developed to translate data between character encoding schemes,

1380-494: Is the unit byte , coined by Werner Buchholz in June 1956, which historically was used to represent the group of bits used to encode a single character of text (until UTF-8 multibyte encoding took over) in a computer and for this reason it was used as the basic addressable element in many computer architectures . The trend in hardware design converged on the most common implementation of using eight bits per byte, as it

1440-452: Is widely used today. However, because of the ambiguity of relying on the underlying hardware design, the unit octet was defined to explicitly denote a sequence of eight bits. Computers usually manipulate bits in groups of a fixed size, conventionally named " words ". Like the byte, the number of bits in a word also varies with the hardware design, and is typically between 8 and 80 bits, or even more in some specialized computers. In

1500-604: The World Wide Web is UTF-8 , which is used in 98.2% of surveyed web sites, as of May 2024. In application programs and operating system tasks, both UTF-8 and UTF-16 are popular options. The history of character codes illustrates the evolving need for machine-mediated character-based symbolic information over a distance, using once-novel electrical means. The earliest codes were based upon manual and hand-written encoding and cyphering systems, such as Bacon's cipher , Braille , international maritime signal flags , and

1560-410: The yottabit (Ybit). When the information capacity of a storage system or a communication channel is presented in bits or bits per second , this often refers to binary digits, which is a computer hardware capacity to store binary data ( 0 or 1 , up or down, current or not, etc.). Information capacity of a storage system is only an upper bound to the quantity of information stored therein. If

1620-407: The 1950s and 1960s, these methods were largely supplanted by magnetic storage devices such as magnetic-core memory , magnetic tapes , drums , and disks , where a bit was represented by the polarity of magnetization of a certain area of a ferromagnetic film, or by a change in polarity from one direction to the other. The same principle was later used in the magnetic bubble memory developed in

1680-486: The 1980s faced the dilemma that, on the one hand, it seemed necessary to add more bits to accommodate additional characters, but on the other hand, for the users of the relatively small character set of the Latin alphabet (who still constituted the majority of computer users), those additional bits were a colossal waste of then-scarce and expensive computing resources (as they would always be zeroed out for such users). In 1985,

1740-418: The 1980s, and is still found in various magnetic strip items such as metro tickets and some credit cards . In modern semiconductor memory , such as dynamic random-access memory , the two values of a bit may be represented by two levels of electric charge stored in a capacitor . In certain types of programmable logic arrays and read-only memory , a bit may be represented by the presence or absence of

1800-532: The 4-digit encoding of Chinese characters for a Chinese telegraph code ( Hans Schjellerup , 1869). With the adoption of electrical and electro-mechanical techniques these earliest codes were adapted to the new capabilities and limitations of the early machines. The earliest well-known electrically transmitted character code, Morse code , introduced in the 1840s, used a system of four "symbols" (short signal, long signal, short space, long space) to generate codes of variable length. Though some commercial use of Morse code

1860-519: The Unicode standard is U+0000 to U+10FFFF, inclusive, divided in 17 planes , identified by the numbers 0 to 16. Characters in the range U+0000 to U+FFFF are in plane 0, called the Basic Multilingual Plane (BMP). This plane contains the most commonly-used characters. Characters in the range U+10000 to U+10FFFF in the other planes are called supplementary characters . The following table shows examples of code point values: Consider

Multinational Character Set - Misplaced Pages Continue

1920-464: The average personal computer user's hard disk drive could store only about 10 megabytes, and it cost approximately US$ 250 on the wholesale market (and much higher if purchased separately at retail), so it was very important at the time to make every bit count. The compromise solution that was eventually found and developed into Unicode was to break the assumption (dating back to telegraph codes) that each character should always directly correspond to

1980-424: The average. This principle is the basis of data compression technology. Using an analogy, the hardware binary digits refer to the amount of storage space available (like the number of buckets available to store things), and the information content the filling, which comes in different levels of granularity (fine or coarse, that is, compressed or uncompressed information). When the granularity is finer—when information

2040-679: The binary digit is either "bit", per the IEC 80000-13 :2008 standard, or the lowercase character "b", per the IEEE 1541-2002 standard. Use of the latter may create confusion with the capital "B" which is the international standard symbol for the byte. The encoding of data by discrete bits was used in the punched cards invented by Basile Bouchon and Jean-Baptiste Falcon (1732), developed by Joseph Marie Jacquard (1804), and later adopted by Semyon Korsakov , Charles Babbage , Herman Hollerith , and early computer manufacturers like IBM . A variant of that idea

2100-415: The early 21st century, retail personal or server computers have a word size of 32 or 64 bits. The International System of Units defines a series of decimal prefixes for multiples of standardized units which are commonly also used with the bit and the byte. The prefixes kilo (10 ) through yotta (10 ) increment by multiples of one thousand, and the corresponding units are the kilobit (kbit) through

2160-457: The electrical state of a flip-flop circuit. For devices using positive logic , a digit value of 1 (or a logical value of true) is represented by a more positive voltage relative to the representation of 0 . Different logic families require different voltages, and variations are allowed to account for component aging and noise immunity. For example, in transistor–transistor logic (TTL) and compatible circuits, digit values 0 and 1 at

2220-543: The era had their own character codes, often six-bit, but usually had the ability to read tapes produced on IBM equipment. These BCD encodings were the precursors of IBM's Extended Binary-Coded Decimal Interchange Code (usually abbreviated as EBCDIC), an eight-bit encoding scheme developed in 1963 for the IBM System/360 that featured a larger character set, including lower case letters. In trying to develop universally interchangeable character encodings, researchers in

2280-429: The first 256 code points of Unicode have many more similarities than differences. In addition to unused code points, differences from ISO 8859-1 are: Character encoding Character encoding is the process of assigning numbers to graphical characters , especially the written characters of human language, allowing them to be stored, transmitted, and transformed using computers. The numerical values that make up

2340-409: The level of manipulating bits rather than manipulating data interpreted as an aggregate of bits. In the 1980s, when bitmapped computer displays became popular, some computers provided specialized bit block transfer instructions to set or copy the bits that corresponded to a given rectangular area on the screen. In most computers and programming languages, when a bit within a group of bits, such as

2400-479: The most well-known code page suites are " Windows " (based on Windows-1252) and "IBM"/"DOS" (based on code page 437 ). Despite no longer referring to specific page numbers in a standard, many character encodings are still referred to by their code page number; likewise, the term "code page" is often still used to refer to character encodings in general. The term "code page" is not used in Unix or Linux, where "charmap"

2460-473: The output of a device are represented by no higher than 0.4 V and no lower than 2.6 V, respectively; while TTL inputs are specified to recognize 0.8 V or below as 0 and 2.2 V or above as 1 . Bits are transmitted one at a time in serial transmission , and by a multiple number of bits in parallel transmission . A bitwise operation optionally processes bits one at a time. Data transfer rates are usually measured in decimal SI multiples of

Multinational Character Set - Misplaced Pages Continue

2520-412: The punched card code then in use only allowed digits, upper-case English letters and a few special characters, six bits were sufficient. These BCD encodings extended existing simple four-bit numeric encoding to include alphabetic and special characters, mapping them easily to punch-card encoding which was already in widespread use. IBM's codes were used primarily with IBM equipment; other computer vendors of

2580-460: The repertoire over time. A coded character set (CCS) is a function that maps characters to code points (each code point represents one character). For example, in a given repertoire, the capital letter "A" in the Latin alphabet might be represented by the code point 65, the character "B" by 66, and so on. Multiple coded character sets may share the same character repertoire; for example ISO/IEC 8859-1 and IBM code pages 037 and 500 all cover

2640-497: The same character. An example is the XML attribute xml:lang. The Unicode model uses the term "character map" for other systems which directly assign a sequence of characters to a sequence of bytes, covering all of the CCS, CEF and CES layers. In Unicode, a character can be referred to as 'U+' followed by its codepoint value in hexadecimal. The range of valid code points (the codespace) for

2700-415: The same device or program . It may be physically implemented with a two-state device. A contiguous group of binary digits is commonly called a bit string , a bit vector, or a single-dimensional (or multi-dimensional) bit array . A group of eight bits is called one  byte , but historically the size of the byte is not strictly defined. Frequently, half, full, double and quadruple words consist of

2760-537: The same repertoire but map them to different code points. A character encoding form (CEF) is the mapping of code points to code units to facilitate storage in a system that represents numbers as bit sequences of fixed length (i.e. practically any computer system). For example, a system that stores numeric information in 16-bit units can only directly represent code points 0 to 65,535 in each unit, but larger code points (say, 65,536 to 1.4 million) could be represented by using multiple 16-bit units. This correspondence

2820-527: The same semantic character. Unicode and its parallel standard, the ISO/IEC 10646 Universal Character Set , together constitute a unified standard for character encoding. Rather than mapping characters directly to bytes , Unicode separately defines a coded character set that maps characters to unique natural numbers ( code points ), how those code points are mapped to a series of fixed-size natural numbers (code units), and finally how those units are encoded as

2880-433: The solution was to implement variable-length encodings where an escape sequence would signal that subsequent bits should be parsed as a higher code point. Informally, the terms "character encoding", "character map", "character set" and "code page" are often used interchangeably. Historically, the same standard would specify a repertoire of characters and how they were to be encoded into a stream of code units — usually with

2940-415: The states of electrical relays which could be either "open" or "closed". When relays were replaced by vacuum tubes , starting in the 1940s, computer builders experimented with a variety of storage methods, such as pressure pulses traveling down a mercury delay line , charges stored on the inside surface of a cathode-ray tube , or opaque spots printed on glass discs by photolithographic techniques. In

3000-554: The symbol for binary digit should be 'bit', and this should be used in all multiples, such as 'kbit', for kilobit. However, the lower-case letter 'b' is widely used as well and was recommended by the IEEE 1541 Standard (2002) . In contrast, the upper case letter 'B' is the standard and customary symbol for byte. Multiple bits may be expressed and represented in several ways. For convenience of representing commonly reoccurring groups of bits in information technology, several units of information have traditionally been used. The most common

3060-556: The two possible values of one bit of storage are not equally likely, that bit of storage contains less than one bit of information. If the value is completely predictable, then the reading of that value provides no information at all (zero entropic bits, because no resolution of uncertainty occurs and therefore no information is available). If a computer file that uses n  bits of storage contains only m  <  n  bits of information, then that information can in principle be encoded in about m  bits, at least on

SECTION 50

#1732791914033

3120-462: The two stable states of a flip-flop , two positions of an electrical switch , two distinct voltage or current levels allowed by a circuit , two distinct levels of light intensity , two directions of magnetization or polarization , the orientation of reversible double stranded DNA , etc. Bits can be implemented in several forms. In most modern computing devices, a bit is usually represented by an electrical voltage or current pulse, or by

3180-507: The unit bit per second (bit/s), such as kbit/s. In the earliest non-electronic information processing devices, such as Jacquard's loom or Babbage's Analytical Engine , a bit was often stored as the position of a mechanical lever or gear, or the presence or absence of a hole at a specific point of a paper card or tape . The first electrical devices for discrete logic (such as elevator and traffic light control circuits , telephone switches , and Konrad Zuse's computer) represented bits as

3240-477: The use of a logarithmic measure of information in 1928. Claude E. Shannon first used the word "bit" in his seminal 1948 paper " A Mathematical Theory of Communication ". He attributed its origin to John W. Tukey , who had written a Bell Labs memo on 9 January 1947 in which he contracted "binary information digit" to simply "bit". A bit can be stored by a digital device or other physical system that exists in either of two possible distinct states . These may be

3300-511: Was adopted fairly widely. ASCII67's American-centric nature was somewhat addressed in the European ECMA-6 standard. Herman Hollerith invented punch card data encoding in the late 19th century to analyze census data. Initially, each hole position represented a different data element, but later, numeric information was encoded by numbering the lower rows 0 to 9, with a punch in a column representing its row number. Later alphabetic data

3360-670: Was encoded by allowing more than one punch per column. Electromechanical tabulating machines represented date internally by the timing of pulses relative to the motion of the cards through the machine. When IBM went to electronic processing, starting with the IBM 603 Electronic Multiplier, it used a variety of binary encoding schemes that were tied to the punch card code. IBM used several Binary Coded Decimal ( BCD ) six-bit character encoding schemes, starting as early as 1953 in its 702 and 704 computers, and in its later 7000 Series and 1400 series , as well as in associated peripherals. Since

3420-409: Was often improved by many equipment manufacturers, sometimes creating compatibility issues. In 1959 the U.S. military defined its Fieldata code, a six-or seven-bit code, introduced by the U.S. Army Signal Corps. While Fieldata addressed many of the then-modern issues (e.g. letter and digit codes arranged for machine collation), it fell short of its goals and was short-lived. In 1963 the first ASCII code

3480-591: Was released (X3.4-1963) by the ASCII committee (which contained at least one member of the Fieldata committee, W. F. Leubbert), which addressed most of the shortcomings of Fieldata, using a simpler code. Many of the changes were subtle, such as collatable character sets within certain numeric ranges. ASCII63 was a success, widely adopted by industry, and with the follow-up issue of the 1967 ASCII code (which added lower-case letters and fixed some "control code" issues) ASCII67

3540-469: Was the perforated paper tape . In all those systems, the medium (card or tape) conceptually carried an array of hole positions; each position could be either punched through or not, thus carrying one bit of information. The encoding of text by bits was also used in Morse code (1844) and early digital communications machines such as teletypes and stock ticker machines (1870). Ralph Hartley suggested

3600-590: Was via machinery, it was often used as a manual code, generated by hand on a telegraph key and decipherable by ear, and persists in amateur radio and aeronautical use. Most codes are of fixed per-character length or variable-length sequences of fixed-length codes (e.g. Unicode ). Common examples of character encoding systems include Morse code, the Baudot code , the American Standard Code for Information Interchange (ASCII) and Unicode. Unicode,

#32967