Misplaced Pages

Shift JIS

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.

Shift JIS (also SJIS , MIME name Shift_JIS , known as PCK in Solaris contexts) is a character encoding for the Japanese language , originally developed by the Japanese company ASCII Corporation in conjunction with Microsoft and standardized as JIS X 0208 Appendix 1 .

#288711

58-549: Shift JIS is based on character sets defined within JIS standards JIS X 0201 :1997 (for the single-byte characters ) and JIS X 0208 :1997 (for the double-byte characters ). As of November 2024, 0.1% of surveyed web pages used Shift JIS (actually decoded as its superset Windows-31J encoding), a decline from 1.3% in July 2014. Shift JIS is the third-most declared character encoding for Japanese websites, used by 2.1% of sites in

116-563: A Yen sign for JIS X 0201 compatibility. It includes several extensions, namely " NEC special characters (Row 13), NEC selection of IBM extensions (Rows 89 to 92), and IBM extensions (Rows 115 to 119)", in addition to setting some encoding space aside for end user definition . Windows codepage 932 is the version used in the W3C / WHATWG encoding standard used by HTML5 , which includes the "formerly proprietary extensions from IBM and NEC" from Windows-31J in its table for JIS X 0208, and also treats

174-447: A closing parenthesis followed by the eos id followed a quote), and finally – to terminate the string – a closing parenthesis, the eos id, and a quote is required. The simplest case of such a literal is with empty content and empty eos id: R"()" . The eos id may itself contain quotes: R""(I asked, "Can you hear me?")"" is a valid literal (the eos id is " here.) Escape sequences don't work in raw string literals. D supports

232-409: A double quote itself by the literal """ as the second quote is interpreted as the end of the string literal, not as the value of the string, and similarly one cannot write "This is "in quotes", but invalid." as the middle quoted portion is instead interpreted as outside of quotes. There are various solutions, the most general-purpose of which is using escape sequences, such as "\"" or "This

290-485: A double-byte JIS X 0208 sequence j 1 j 2 {\displaystyle j_{1}j_{2}} , the transformation to the corresponding Shift JIS bytes s 1 s 2 {\displaystyle s_{1}s_{2}} is: The competing 8-bit format EUC-JP , which does not support single-byte halfwidth katakana, allows for a cleaner and more direct conversion to and from JIS X 0208 code points , as all high-bit-set bytes are parts of

348-656: A double-byte character and all codes from ASCII range represent single-byte characters. HTML written in Shift JIS can still be interpreted to some extent when incorrectly tagged as ASCII, and when the charset tag is in the top of the document itself, since the important start and end of HTML tags and fields ( < , > , / , " , & , ; ) are encoded as the same bytes as in ASCII, and those bytes do not appear in two-byte sequences. Shift JIS can be used in string literals in programming languages such as C , but

406-402: A few quoting delimiters, with such strings starting with q" plus an opening delimiter and ending with the respective closing delimiter and " . Available delimiter pairs are () , <> , {} , and [] ; an unpaired non-identifier delimiter is its own closing delimiter. The paired delimiters nest, so that q"(A pair "()" of parens in quotes)" is a valid literal; an example with

464-493: A few things must be taken into consideration. Firstly, that the escape character 0x5C, normally backslash , is the half-width yen sign (¥) in Shift JIS. If the programmer is aware of this, it would be possible to use printf("ハローワールド¥n"); (where ハローワールド is Hello, world and ¥n is an escape sequence), assuming the I/O system supports Shift JIS output. Secondly, the 0x5C byte will cause problems when it appears as second byte of

522-412: A general technique for representing characters that are otherwise difficult to represent directly, including delimiters, nonprinting characters (such as backspaces), newlines, and whitespace characters (which are otherwise impossible to distinguish visually), and have a long history. They are accordingly widely used in string literals, and adding an escape sequence (either to a single character or throughout

580-531: A limited form of multiple quoting, particularly to allow nesting of long comments or embedded strings. Normally one uses [[ and ]] to delimit literal strings (initial newline stripped, otherwise raw), but the opening brackets can include any number of equal signs, and only closing brackets with the same number of signs close the string. For example: Multiple quoting is particularly useful with regular expressions that contain usual delimiters such as quotes, as this avoids needing to escape them. An early example

638-412: A previous version of the standard) or Shift_JIS-2004 . It is a superset of standard Shift JIS. In order to represent the allocated rows on both planes of JIS X 0213, Shift_JIS-2004 uses the following method of mapping codepoints. In the above, s 1 s 2 {\displaystyle s_{1}s_{2}} is a two-byte Shift_JIS-2004 sequence, m {\displaystyle m}

SECTION 10

#1732787502289

696-462: A similar facility is available via sprintf and the %c "character" format specifier, though in the presence of other workarounds this is generally not used: These constructor functions can also be used to represent nonprinting characters, though escape sequences are generally used instead. A similar technique can be used in C++ with the std::string stringification operator. Escape sequences are

754-465: A string literal would be used in other languages, and is often preferred to C-style strings for its greater flexibility and safety. But it comes with a performance penalty for string literals, as std::string usually allocates memory dynamically, and must copy the C-style string literal to it at run time. Before C++11, there was no literal for C++ strings (C++11 allows "this is a C++ string"s with

812-486: A string) is known as escaping . One character is chosen as a prefix to give encodings for characters that are difficult or impossible to include directly. Most commonly this is backslash ; in addition to other characters, a key point is that backslash itself can be encoded as a double backslash \\ and for delimited strings the delimiter itself can be encoded by escaping, say by \" for ". A regular expression for such escaped strings can be given as follows, as found in

870-408: A string. There are many alternate notations for specifying string literals especially in complicated cases. The exact notation depends on the programming language in question. Nevertheless, there are general guidelines that most modern programming languages follow. Most modern programming languages use bracket delimiters (also balanced delimiters ) to specify string literals. Double quotations are

928-490: A two-byte character, because it will be interpreted as an escape sequence, which will mess up the interpretation, unless followed by another 0x5C. Many different versions of Shift JIS exist. There are two areas for expansion: Firstly, JIS X 0208 does not fill the whole 94×94 space encoded for it in Shift JIS, therefore there is room for more characters here—these are really extensions to JIS X 0208 rather than to Shift JIS itself. Secondly, Shift JIS has more encoding space than

986-400: Is sed , where in the substitution command s/ regex / replacement / the default slash / delimiters can be replaced by another character, as in s, regex , replacement , . Another option, which is rarely used in modern languages, is to use a function to construct a string, rather than representing it via a literal. This is generally not used in modern languages because the computation

1044-507: Is \"in quotes\" and properly escaped." , but there are many other solutions. Paired quotes, such as braces in Tcl, allow nested strings, such as {foo {bar} zork} but do not otherwise solve the problem of delimiter collision, since an unbalanced closing delimiter cannot simply be included, as in {}} . A number of languages, including Pascal , BASIC , DCL , Smalltalk , SQL , J , and Fortran , avoid delimiter collision by doubling up on

1102-417: Is a literal for a string value in the source code of a computer program. Modern programming languages commonly use a quoted sequence of characters, formally "bracketed delimiters", as in x = "foo" , where , "foo" is a string literal with value foo . Methods such as escape sequences can be used to avoid the problem of delimiter collision (issues with brackets) and allow the delimiters to be embedded in

1160-564: Is a special editor which encodes Shift JIS this way. The chart below gives the detailed meaning of each byte in a stream encoded in standard Shift JIS (conforming to JIS X 0208:1997 ). Some of the bytes which are not used for single-byte codes or initial bytes in JIS X 0208:1997 are used by certain extensions, resulting in the layout detailed in the chart below. Japanese Industrial Standards Japanese Industrial Standards ( JIS ) ( 日本産業規格 , Nihon Sangyō Kikaku , formerly 日本工業規格 Nihon Kōgyō Kikaku until June 30, 2019) are

1218-577: Is a two-byte JIS sequence referencing a given plane. The same set of characters can be represented by EUC-JIS-2004 , the EUC-JP based counterpart. Some of the additions collide with popular Shift JIS extensions, including Windows codepage 932 which is used in web standards (see above ). For example, compare plane 1 row 89 in JIS X 0213 (beginning 硃, 硎, 硏...) to row 89 in the JIS X 0208 variant defined in web standards (beginning 纊, 褜, 鍈...). In addition, some of

SECTION 20

#1732787502289

1276-473: Is done at run time, rather than at parse time. For example, early forms of BASIC did not include escape sequences or any other workarounds listed here, and thus one instead was required to use the CHR$ function, which returns a string containing the character corresponding to its argument. In ASCII the quotation mark has the value 34, so to represent a string with quotes on an ASCII system one would write In C,

1334-431: Is much scope for confusion, if the extensions are used. A variant is the one that must be used if wanting to encode Shift JIS in source code strings of C and similar programming languages. This variant doubles the byte 0x5C if it appears as second byte of a two-byte character, but not if it appears as a single "¥" (ASCII: "\") character, because 0x5C is the beginning of an escape sequence . The best way of handling this

1392-510: Is needed for JIS X 0201 and JIS X 0208 (see § Shift JIS byte map below), and this space can and is used for yet more characters (as either single-byte or double-byte characters). The most popular extension is Windows code page 932 (a CCSID also used for IBM's extension to Shift JIS ), which is registered with the IANA as "Windows-31J", separately from Shift JIS. This was popularized by Microsoft, although Microsoft itself does not recognize

1450-539: Is the paired double quotations which can be used in Visual Basic .NET ). Unpaired marks are preferred for compatibility, as they are easier to type on a wide range of keyboards, and so even in languages where they are permitted, many projects forbid their use for source code. String literals might be ended by newlines. One example is MediaWiki template parameters. There might be special syntax for multi-line strings. In YAML , string literals may be specified by

1508-569: Is the plane ( 面 , men , surface) number (1 or 2), k {\displaystyle k} is the row ( 区 , ku , ward) number (1-94) and t {\displaystyle t} is the cell ( 点 , ten , point) number (1-94). The ku and ten numbers are equivalent to j 1 − 32 {\displaystyle j_{1}-32} and j 2 − 32 {\displaystyle j_{2}-32} respectively, where j 1 j 2 {\displaystyle j_{1}j_{2}}

1566-408: Is the start of the escape sequence". Every escape sequence specifies one character which is to be placed directly into the string. The actual number of characters required in an escape sequence varies. The escape character is on the top/left of the keyboard, but the editor will translate it, therefore it is not directly tapeable into a string. The backslash is used to represent the escape character in

1624-487: Is the use of multiple quoting , which allows the author to choose which characters should specify the bounds of a string literal. For example, in Perl : all produce the desired result. Although this notation is more flexible, few languages support it; other than Perl, Ruby (influenced by Perl) and C++11 also support these. A variant of multiple quoting is the use of here document -style strings. Lua (as of 5.1) provides

1682-532: Is used as an opener and a closer), which is a hangover from the typewriter technology which was the precursor of the earliest computer input and output devices. In terms of regular expressions , a basic quoted string literal is given as: This means that a string literal is written as: a quote, followed by zero, one, or more non-quote characters, followed by a quote . In practice this is often complicated by escaping, other delimiters, and excluding newlines. A number of languages provide for paired delimiters, where

1740-411: The s at the end of the literal), so the normal constructor syntax was used, for example: all of which have the same interpretation. Since C++11, there is also new constructor syntax: When using quoting, if one wishes to represent the delimiter itself in a string literal, one runs into the problem of delimiter collision . For example, if the delimiter is a double quote, one cannot simply represent

1798-479: The ANSI C specification: meaning "a quote; followed by zero or more of either an escaped character (backslash followed by something, possibly backslash or quote), or a non-escape, non-quote character; ending in a quote" – the only issue is distinguishing the terminating quote from a quote preceded by a backslash, which may itself be escaped. Multiple characters can follow the backslash, such as \uFFFF , depending on

Shift JIS - Misplaced Pages Continue

1856-529: The backslash to 0x80 (corresponding to 0x5C in US-ASCII), the non-breaking space to 0xA0, the copyright sign to 0xFD, the trademark symbol to 0xFE and the half-width horizontal ellipsis to 0xFF. It also added extended double byte characters; including 53 vertical presentation forms in the Shift_JIS range 0xEB41–0xED96, at 84 JIS rows down from their canonical forms, and 260 special characters in

1914-921: The standards used for industrial activities in Japan , coordinated by the Japanese Industrial Standards Committee (JISC) and published by the Japanese Standards Association (JSA). The JISC is composed of many nationwide committees and plays a vital role in standardizing activities across Japan. In the Meiji era , private enterprises were responsible for making standards, although the Japanese government too had standards and specification documents for procurement purposes for certain articles, such as munitions. These were summarized to form an official standard,

1972-457: The .jp domain, while UTF-8 is used by 98% of Japanese websites. Shift JIS is also sometimes used in QR codes (they are a Japanese invention also allowing UTF-8, which may though be preferred use). Shift JIS is an extension of the single-byte encoding JIS X 0201 :1997 , that uses unassigned code points in JIS X 0201 to encode the double-byte JIS X 0208 :1997 character set. The lead bytes for

2030-545: The Japanese Engineering Standard, in 1921. During World War II , simplified standards were established to increase matériel output. The present Japanese Standards Association was established in 1946, a year after Japan's defeat in World War II. The Japanese Industrial Standards Committee regulations were promulgated in 1946, and new standards were formed. The Industrial Standardization Law

2088-618: The Shift_JIS range 0x8540–0x886D. This variant was introduced in KanjiTalk version 7. However, certain Mac OS typefaces used other variants. Sai Mincho and Chu Gothic use a " PostScript " variant of MacJapanese, which included additional vertical presentation forms and a different set of extended special characters, based on the NEC special characters , some of which were only available in

2146-650: The Windows-31J name and instead calls that variation "shift_jis". IBM's code page 943 includes the same double-byte codes as Microsoft's code page 932, while IBM's code page 932 includes fewer extensions (excluding those which Microsoft incorporates from NEC), and retains the character order from the 1978 edition of JIS X 0208, rather than implementing the character variant swaps from the 1983 standard. Windows-31J assigns 0x5C to U+005C REVERSE SOLIDUS (the backslash ), and 0x7E to U+007E TILDE , following US-ASCII . However, most localised fonts on Windows display U+005C as

2204-427: The character that normally terminates the string constant, plus there must be some way to specify the escape character itself. Escape sequences are not always pretty or easy to use, so many compilers also offer other means of solving the common problems. Escape sequences, however, solve every delimiter problem and most compilers interpret escape sequences. When an escape character is inside a string literal, it means "this

2262-620: The characters map to Unicode characters beyond the BMP. The space with lead bytes 0xF5 to 0xF9 (beyond the region used for JIS X 0208) is used by Japanese mobile phone operators for pictographs for use in E-mail . KDDI goes further and defines hundreds more in the space with lead bytes 0xF3 and 0xF4. Beyond even this, there have been numerous minor variations made on Shift JIS, with individual characters here and there altered. Most of these extensions and variants have no IANA registration, so there

2320-504: The double-byte characters are "shifted" around the 64 halfwidth katakana characters in the single-byte range 0xA1 to 0xDF . The single-byte characters 0x 00 to 0x7F match the ASCII encoding, except for a yen sign (U+00A5) at 0x5C and an overline (U+203E) at 0x7E in place of the ASCII character set's backslash and tilde respectively (these deviations from ASCII align with JIS X 0201 ). The single-byte characters from 0xA1 to 0xDF map to

2378-705: The ending delimiter. Tcl allows both quotes (for interpolated strings) and braces (for raw strings), as in "The quick brown fox" or {The quick {brown fox}} ; this derives from the single quotations in Unix shells and the use of braces in C for compound statements, since blocks of code is in Tcl syntactically the same thing as string literals – that the delimiters are paired is essential for making this feasible. The Unicode character set includes paired (separate opening and closing) versions of both single and double quotations: These, however, are rarely used, as many programming languages will not register them (one exception

Shift JIS - Misplaced Pages Continue

2436-402: The escaping scheme. An escaped string must then itself be lexically analyzed , converting the escaped string into the unescaped string that it represents. This is done during the evaluation phase of the overall lexing of the computer language: the evaluator of the lexer of the overall language executes its own lexer for escaped string literals. Among other things, it must be possible to encode

2494-424: The first byte of two-byte characters will be high-bit-set (0x80–0xFF); the value of the second byte can be either high or low. The appearance of byte values 0x40–0x7E as second bytes of code words makes reliable Shift JIS detection difficult, because the same codes are used for ASCII characters. Since the same byte value can be either first or second byte, string searches are difficult, since simple searches can match

2552-408: The following two lines of Perl are equivalent: In the original FORTRAN programming language (for example), string literals were written in so-called Hollerith notation , where a decimal count of the number of characters was followed by the letter H, and then the characters of the string: This declarative notation style is contrasted with bracketed delimiter quoting, because it does not require

2610-417: The half-width katakana characters found in JIS X 0201 . For double-byte characters, the first byte is always in the range 0x81 to 0x9F or the range 0xE0 to 0xEF (these ranges are unassigned in JIS X 0201 ). If the first byte is odd, the second byte must be in the range 0x40 to 0x9E (but cannot be 0x7F); if the first byte is even, the second byte must in the range 0x9F to 0xFC. Shift JIS only guarantees that

2668-514: The label "shift_jis" interchangeably with "windows-31j" with the intent of being "compatible with deployed content". The version of Shift-JIS originating from the classic Mac OS (known as x-mac-japanese , Code page 10001 or MacJapanese) assigned the tilde to 0x7E (following US-ASCII , not JIS X 0201 which assigns the overline here), but the Yen sign to 0x5C (as in JIS X 0201 and standard Shift JIS ). It also extended JIS X 0201 by assigning

2726-446: The most common quoting delimiters used: An empty string is literally written by a pair of quotes with no character at all in between: Some languages either allow or mandate the use of single quotations instead of double quotations (the string must begin and end with the same kind of quotation mark and the type of quotation mark may or may not give slightly different semantics): These quotation marks are unpaired (the same character

2784-525: The non-nesting / character is q"/I asked, "Can you hear me?"/" . Similar to C++11, D allows here-document-style literals with end-of-string ids: In D, the end-of-string-id must be an identifier (alphanumeric characters). In some programming languages, such as sh and Perl , there are different delimiters that are treated differently, such as doing string interpolation or not, and thus care must be taken when choosing which delimiter to use; see different kinds of strings , below. A further extension

2842-416: The opening and closing delimiters are different. These also often allow nested strings, so delimiters can be embedded, so long as they are paired, but still result in delimiter collision for embedding an unpaired closing delimiter. Examples include PostScript , which uses parentheses, as in (The quick (brown fox)) and m4 , which uses the backtick (`) as the starting delimiter, and the apostrophe (') as

2900-431: The other. This does not allow having a single literal with both delimiters in it, however. This can be worked around by using several literals and using string concatenation : Python has string literal concatenation , so consecutive string literals are concatenated even without an operator, so this can be reduced to: C++11 introduced so-called raw string literals . They consist, essentially of that is, after R"

2958-589: The printer versions of the fonts. Older versions of Maru Gothic and Hon Mincho from System 7.1 encoded vertical presentation forms at 10 (not 84) JIS rows down from their canonical forms, and did not include the special character extensions, this was subsequently changed. The typical variant used with KanjiTalk version 6 placed the vertical presentation forms 10 rows down, and also used the NEC extension layout for row 13. The newer JIS X 0213 standard defines an extended variant of Shift_JIS referred to as Shift_JISx0213 (in

SECTION 50

#1732787502289

3016-411: The programmer can enter up to 16 characters except whitespace characters, parentheses, or backslash, which form the end-of-string-id (its purpose is to be repeated to signal the end of the string, eos id for short), then an opening parenthesis (to denote the end of the eos id) is required. Then follows the actual content of the literal: Any sequence characters may be used (except that it may not contain

3074-425: The quotation marks that are intended to be part of the string literal itself: Some languages, such as Fortran , Modula-2 , JavaScript , Python , and PHP allow more than one quoting delimiter; in the case of two possible delimiters, this is known as dual quoting . Typically, this consists of allowing the programmer to use either single quotations or double quotations interchangeably – each literal must use one or

3132-402: The relative positioning of whitespace and indentation. Some programming languages, such as Perl and PHP, allow string literals without any delimiters in some contexts. In the following Perl program, for example, red , green , and blue are string literals, but are unquoted: Perl treats non-reserved sequences of alphanumeric characters as string literals in most contexts. For example,

3190-492: The second byte of a character and the first byte of the next, which is not a valid Shift JIS character. String-searching algorithms must be tailor-made for Shift JIS . Shift JIS is fully backwards compatible with the JIS X 0201 single-byte encoding , meaning that any valid JIS X 0201 string is also a valid Shift JIS string. Double-byte characters in JIS X 0208 need to be transformed in order to be encoded in Shift JIS. For

3248-477: The use of balanced "bracketed" characters on either side of the string. Advantages: Drawbacks: This is however not a drawback when the prefix is generated by an algorithm as is most likely the case. C++ has two styles of string, one inherited from C (delimited by " ), and the safer std::string in the C++ Standard Library. The std::string class is frequently used in the same way

3306-491: Was able to use the new JIS mark. Therefore all JIS-certified Japanese products manufactured since October 1, 2008, have had the new JIS mark. Standards are named in the format "JIS X 0208:1997", where X denotes area division, followed by four digits designating the area (five digits for ISO -corresponding standards), and four final digits designating the revision year. Divisions of JIS and significant standards are: String literal A string literal or anonymous string

3364-409: Was enacted in 1949, which forms the legal foundation for the present Japanese Industrial Standards. The Industrial Standardization Law was revised in 2004 and the JIS product certification mark was changed; since October 1, 2005, the new JIS mark has been used upon re-certification. Use of the old mark was allowed during a three-year transition period ending on September 30, 2008, and every manufacturer

#288711