Misplaced Pages

Base32

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.

Base32 is an encoding method based on the base -32 numeral system . It uses an alphabet of 32 digits , each of which represents a different combination of 5 bits (2). Since base32 is not very widely adopted, the question of notation—which characters to use to represent the 32 digits—is not as settled as in the case of more well-known numeral systems (such as hexadecimal ), though RFCs and unofficial and de-facto standards exist. One way to represent Base32 numbers in human-readable form is using digits 0–9 followed by the twenty-two upper-case letters A–V. However, many other variations are used in different contexts. Historically, Baudot code could be considered a modified ( stateful ) base32 code.

#873126

57-463: This article focuses on the use of Base32 for representing byte strings rather than unsigned integer numbers, similar to the way Base64 works. The October 2006 proposed Internet standard RFC   4648 documents base16 , base32 and base64 encodings. It includes two schemes for base32, but recommends one over the other. It further recommends that regardless of precedent, only the alphabet it defines in its section 6 actually be called base32, and that

114-436: A , and n are stored as the byte values 77 , 97 , and 110 , which are the 8-bit binary values 01001101 , 01100001 , and 01101110 . These three values are joined together into a 24-bit string, producing 010011010110000101101110 . Groups of 6 bits (6 bits have a maximum of 2  = 64 different binary values) are converted into individual numbers from start to end (in this case, there are four numbers in

171-404: A 24-bit string), which are then converted into their corresponding Base64 character values. As this example illustrates, Base64 encoding converts three octets into four encoded characters. = padding characters might be added to make the last encoded block contain four Base64 characters. Hexadecimal to octal transformation is useful to convert between binary and Base64. Such conversion

228-411: A 64-character alphabet consisting of upper- and lower-case Roman letters ( A – Z , a – z ), the numerals ( 0 – 9 ), and the + and / symbols. The = symbol is also used as a padding suffix. The original specification, RFC   989 , additionally used the * symbol to delimit encoded but unencrypted data within the output stream. To convert data to PEM printable encoding,

285-547: A base larger than 10 (such as 16 or 32) is specified. It also retains hexadecimal's property of preserving bitwise sort order of the represented data, unlike RFC 4648's §6 base32, or base64. Unlike many other base 32 notation systems, base32hex digits beyond 9 are contiguous. However, its set of digits includes characters that may visually conflict. With the right font it is possible to visually distinguish between 0, O and 1, I, but other fonts may be unsuitable, as those letters could be hard for humans to tell apart, especially when

342-420: A compliant decoder, although most implementations use a CR/LF newline pair to delimit encoded lines. Thus, the actual length of MIME-compliant Base64-encoded binary data is usually about 137% of the original data length ( 4 ⁄ 3 × 78 ⁄ 76 ), though for very short messages the overhead can be much higher due to the overhead of the headers. Very roughly, the final size of Base64-encoded binary data

399-505: A more natural way: Its lower half is identical with hexadecimal, and beyond that, base32hex simply continues the alphabet through to the letter V. This scheme was first proposed by Christian Lanctot, a programmer working at Sage software , in a letter to Dr. Dobb's magazine in March 1999 as part of a suggested solution for the Y2K bug . Lanctot referred to it as "Double Hex". The same alphabet

456-461: A multiple of 5 bits. The closely related Base64 system, in contrast, uses a set of 64 symbols (or 65 symbols when padding is used). Base32 implementations in C/C++, Perl, Java, JavaScript Python, Go and Ruby are available. Base64 In computer programming , Base64 is a group of binary-to-text encoding schemes that transforms binary data into a sequence of printable characters, limited to

513-470: A number of advantages over Base64 : Base32 has advantages over hexadecimal / Base16 : Compared with 8-bit-based encodings, 5-bit systems might also have advantages when used for character transmission: Base32 representation takes roughly 20% more space than Base64 . Also, because it encodes five 8-bit bytes (40 bits) to eight 5-bit base32 characters rather than three 8-bit bytes (24 bits) to four 6-bit base64 characters, padding to an 8-character boundary

570-473: A set of 64 unique characters. More specifically, the source binary data is taken 6 bits at a time, then this group of 6 bits is mapped to one of 64 unique characters. As with all binary-to-text encoding schemes, Base64 is designed to carry data stored in binary formats across channels that only reliably support text content. Base64 is particularly prevalent on the World Wide Web where one of its uses

627-501: A symbol set made up of at least 32 different characters (sometimes a 33rd for padding), as well as an algorithm for encoding arbitrary sequences of 8-bit bytes into a Base32 alphabet. Because more than one 5-bit Base32 character is needed to represent each 8-bit input byte, if the input is not a multiple of 5 bytes (40 bits), then it doesn't fit exactly in 5-bit Base32 characters. In that case, some specifications require padding characters to be added while some require extra zero bits to make

SECTION 10

#1732776072874

684-555: A system include Mario Is Missing! , Mario's Time Machine , Tetris Blast , and The Lord of the Rings (Super NES) . The word-safe Base32 alphabet is an extension of the Open Location Code Base20 alphabet. That alphabet uses 8 numeric digits and 12 case-sensitive letter digits chosen to avoid accidentally forming words. Treating the alphabet as case-sensitive produces a 32 (8+12+12) digit set. Base32 has

741-689: A way that is convenient for inclusion in URLs, including in hidden web form fields, and Base64 is a convenient encoding to render them in a compact way. Using standard Base64 in URL requires encoding of ' + ', ' / ' and ' = ' characters into special percent-encoded hexadecimal sequences (' + ' becomes ' %2B ', ' / ' becomes ' %2F ' and ' = ' becomes ' %3D '), which makes the string unnecessarily longer. For this reason, modified Base64 for URL variants exist (such as base64url in RFC   4648 ), where

798-552: Is a greater burden on short messages (which may be a reason to elide padding, which is an option in RFC   4648 ). Even if Base32 takes roughly 20% less space than hexadecimal , Base32 is much less used. Hexadecimal can easily be mapped to bytes because two hexadecimal digits is a byte. Base32 does not map to individual bytes. However, two Base32 digits correspond to ten bits, which can encode (32 × 32 =) 1,024 values, with obvious applications for orders of magnitude of multiple-byte units in terms of powers of 1,024. Hexadecimal

855-603: Is a member of the development team of ZRTP and the BLAKE2 cryptographic hash function. Zooko's triangle is named after Wilcox-O'Hearn, who described the schema that relates three desirable properties of identifiers in 2001. Wilcox-O'Hearn was founder and CEO of Least Authority Enterprises in Boulder, Colorado where he is now an advisor. Zooko was a developer of the MojoNation P2P system and lead developer of

912-590: Is a variant of the Base64 encoding used in MIME. The "Modified Base64" alphabet consists of the MIME Base64 alphabet, but does not use the " = " padding character. UTF-7 is intended for use in mail headers (defined in RFC   2047 ), and the " = " character is reserved in that context as the escape character for "quoted-printable" encoding. Modified Base64 simply omits the padding and ends immediately after

969-494: Is available for both advanced calculators and programming languages. For example, the hexadecimal representation of the 24 bits above is 4D616E. The octal representation is 23260556. Those 8 octal digits can be split into pairs ( 23 26 05 56 ), and each pair is converted to decimal to yield 19 22 05 46 . Using those four decimal numbers as indices for the Base64 alphabet, the corresponding ASCII characters are TWFu . If there are only two significant input octets (e.g., 'Ma'), or when

1026-473: Is created by Douglas Crockford , who proposes using additional characters for a mod-37 checksum. It excludes the letters I, L, and O to avoid confusion with digits. It also excludes the letter U to reduce the likelihood of accidental obscenity. Libraries to encode binary data in Crockford's Base32 are available in a variety of languages. An earlier form of base 32 notation was used by programmers working on

1083-482: Is defined in RFC 4648 §6 and the earlier RFC   3548 (2003). The scheme was originally designed in 2000 by John Myers for SASL / GSSAPI . It uses an alphabet of A – Z , followed by 2 – 7 . The digits 0 , 1 and 8 are skipped due to their similarity with the letters O , I and B (thus "2" has a decimal value of 26 ). In some circumstances padding is not required or used (the padding can be inferred from

1140-430: Is easier to learn and remember, since that only entails memorising the numerical values of six additional symbols (A–F), and even if those are not instantly recalled, it is easier to count through just over a handful of values. Base32 programs are suitable for encoding arbitrary byte data using a restricted set of symbols that can both be conveniently used by humans and processed by computers. Base32 implementations use

1197-412: Is equal to 1.37 times the original data size + 814 bytes (for headers). The size of the decoded data can be approximated with this formula: UTF-7 , described first in RFC   1642 , which was later superseded by RFC   2152 , introduced a system called modified Base64 . This data encoding scheme is used to encode UTF-16 as ASCII characters for use in 7-bit transports such as SMTP . It

SECTION 20

#1732776072874

1254-445: Is not a typical use case, as it can already be safely transferred across all systems that can handle Base64. The more typical use is to encode binary data (such as an image); the resulting Base64 data will only contain 64 different ASCII characters, all of which can reliably be transferred across systems that may corrupt the raw source bytes. Here is a well-known idiom from distributed computing : Many hands make light work. When

1311-439: Is not possible, because a single Base64 character only contains 6 bits, and 8 bits are required to create a byte, so a minimum of two Base64 characters are required: The first character contributes 6 bits, and the second character contributes its first 2 bits. For example: Decoding without padding is not performed consistently among decoders. In addition, allowing padless decoding by definition allows multiple strings to decode into

1368-418: Is the ability to embed image files or other binary assets inside textual assets such as HTML and CSS files. Base64 is also widely used for sending e-mail attachments, because SMTP  – in its original form – was designed to transport 7-bit ASCII characters only. Encoding an attachment as Base64 before sending, and then decoding when received, assures older SMTP servers will not interfere with

1425-406: Is then encoded with the same Base64 algorithm and, prefixed by the " = " symbol as the separator, appended to the encoded output data. RFC   3548 , entitled The Base16, Base32, and Base64 Data Encodings , is an informational (non-normative) memo that attempts to unify the RFC   1421 and RFC   2045 specifications of Base64 encodings, alternative-alphabet encodings, and

1482-468: The Electrologica X1 to represent machine addresses. The "digits" were represented as decimal numbers from 0 to 31. For example, 12-16 would represent the machine address 400 (= 12 × 32 + 16). See Geohash algorithm , used to represent latitude and longitude values in one (bit-interlaced) positive integer. The base32 representation of Geohash uses all decimal digits (0–9) and almost all of

1539-424: The ' + ' and ' / ' characters of standard Base64 are respectively replaced by ' - ' and ' _ ', so that using URL encoders/decoders is no longer necessary and has no effect on the length of the encoded value, leaving the same encoded form intact for use in relational databases, web forms, and object identifiers in general. A popular site to make use of such is YouTube . Some variants allow or require omitting

1596-466: The Base32 (which is seldom used) and Base16 encodings. Unless implementations are written to a specification that refers to RFC   3548 and specifically requires otherwise, RFC 3548 forbids implementations from generating messages containing characters outside the encoding alphabet or without padding, and it also declares that decoder implementations must reject data that contain characters outside

1653-807: The Electric Coin Company (ECC), a for-profit company leading the development of Zcash . He is known for the Tahoe Least-Authority File Store (or Tahoe-LAFS), a secure, decentralized, fault-tolerant filesystem released under GPL and the TGPPL licenses. He is the creator of the Transitive Grace Period Public Licence (TGPPL). Wilcox-O'Hearn is the designer of multiple network protocols that incorporate concepts such as self-contained economies and secure reputation systems . He

1710-572: The alphabet so that the easier characters are the ones that occur more frequently. It compactly encodes bitstrings whose length in bits is not a multiple of 8 and omits trailing padding characters. z-base-32 was used in the Mnet open source project, and is currently used in Phil Zimmermann 's ZRTP protocol, and in the Tahoe-LAFS open source project. Another alternative design for Base32

1767-443: The attachment. Base64 encoding causes an overhead of 33–37% relative to the size of the original binary data (33% by the encoding itself; up to 4% more by the inserted line breaks). The particular set of 64 characters chosen to represent the 64-digit values for the base varies between implementations. The general strategy is to choose 64 characters that are common to most encodings and that are also printable. This combination leaves

Base32 - Misplaced Pages Continue

1824-558: The context English usually provides is not present in a notation system that is only expressing numbers. The choice of font is not controlled by notation or encoding, yet base32hex makes no attempt to compensate for the shortcomings of affected fonts. Changing the Base32 alphabet, all alternative standards have similar combinations of alphanumeric symbols. z-base-32 is a Base32 encoding designed by Zooko Wilcox-O'Hearn to be easier for human use and more compact. It includes 1 , 8 and 9 but excludes l , v , 0 and 2 . It also permutes

1881-478: The data unlikely to be modified in transit through information systems, such as email, that were traditionally not 8-bit clean . For example, MIME 's Base64 implementation uses A – Z , a – z , and 0 – 9 for the first 62 values. Other variations share this property but differ in the symbols chosen for the last two values; an example is UTF-7 . The earliest instances of this type of encoding were created for dial-up communication between systems running

1938-555: The encoding alphabet. RFC   4648 obsoletes RFC   3548 and focuses on Base64/32/16: Base64 encoding can be helpful when fairly lengthy identifying information is used in an HTTP environment. For example, a database persistence framework for Java objects might use Base64 encoding to encode a relatively large unique id (generally 128-bit UUIDs ) into a string for use as an HTTP parameter in HTTP forms or HTTP GET URLs . Also, many applications need to encode binary data in

1995-551: The first = and another 2 trailing bits for the other = . In this instance, we would get 6 bits from the d , and another 6 bits from the w for a bit string of length 12, but since we remove 2 bits for each = (for a total of 4 bits), the dw== ends up producing 8 bits (1 byte) when decoded. Without padding, after normal decoding of four characters to three bytes over and over again, fewer than four encoded characters may remain. In this situation, only two or three characters can remain. A single remaining encoded character

2052-445: The first byte is placed in the most significant eight bits of a 24-bit buffer , the next in the middle eight, and the third in the least significant eight bits. If there are fewer than three bytes left to encode (or in total), the remaining buffer bits will be zero. The buffer is then used, six bits at a time, most significant first, as indices into the string: " ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/ ", and

2109-506: The first two Base64 digits (12 bits); the four least significant bits of the last content-bearing 6-bit block will turn out to be zero, and discarded on decoding (along with the succeeding two = padding characters): Because Base64 is a six-bit encoding, and because the decoded values are divided into 8-bit octets, every four characters of Base64-encoded text (4 sextets = 4 × 6 = 24 bits) represents three octets of unencoded text or data (3 octets = 3 × 8 = 24 bits). This means that when

2166-646: The follow-on Mnet network , and a developer at SimpleGeo. Wilcox-O'Hearn worked on the first cryptocurrency, DigiCash , with David Chaum in 1996. He is a member of the founding team of the anonymous cryptocurrency Zcash , which launched in 2016. He currently serves as the CEO of the affiliated Electric Coin Company . Wilcox later commissioned the Rand Corporation to study whether anonymous coins were disproportionately represented in criminal transactions;

2223-516: The four characters will decode to only two bytes, while == indicates that the four characters will decode to only a single byte. For example: Another way to interpret the padding character is to consider it as an instruction to discard 2 trailing bits from the bit string each time a = is encountered. For example, when ` bGlnaHQg dw== ` is decoded, we convert each character (except the trailing occurrences of = ) into their corresponding 6-bit representation, and then discard 2 trailing bits for

2280-450: The indicated character is output. The process is repeated on the remaining data until fewer than four octets remain. If three octets remain, they are processed normally. If fewer than three octets (24 bits) are remaining to encode, the input data is right-padded with zero bits to form an integral multiple of six bits. After encoding the non-padded data, if two octets of the 24-bit buffer are padded-zeros, two = characters are appended to

2337-533: The input of the atob() method. Base64 can be used in a variety of contexts: Some applications use a Base64 alphabet that is significantly different from the alphabets used in the most common Base64 variants (see Variants summary table above). Zooko Wilcox-O%27Hearn Zooko Wilcox-O'Hearn (born Bryce Wilcox ; 13 May 1974 in Phoenix, Arizona ), is an American Colorado -based computer security specialist, self-proclaimed cypherpunk , and ex-CEO of

Base32 - Misplaced Pages Continue

2394-402: The last Base64 digit containing useful bits leaving up to three unused bits in the last Base64 digit. OpenPGP , described in RFC   4880 , describes Radix-64 encoding, also known as " ASCII armor ". Radix-64 is identical to the "Base64" encoding described by MIME, with the addition of an optional 24-bit CRC . The checksum is calculated on the input data before encoding; the checksum

2451-442: The last input group contains only two octets, all 16 bits will be captured in the first three Base64 digits (18 bits); the two least significant bits of the last content-bearing 6-bit block will turn out to be zero, and discarded on decoding (along with the succeeding = padding character): If there is only one significant input octet (e.g., 'M'), or when the last input group contains only one octet, all 8 bits will be captured in

2508-416: The last line, which may contain fewer printable characters. Lines are delimited by whitespace characters according to local (platform-specific) conventions. The MIME (Multipurpose Internet Mail Extensions) specification lists Base64 as one of two binary-to-text encoding schemes (the other being quoted-printable ). MIME's Base64 encoding is based on that of the RFC   1421 version of PEM: it uses

2565-725: The length of the string modulo 8). RFC 4648 states that padding must be used unless the specification of the standard (referring to the RFC) explicitly states otherwise. Excluding padding is useful when using Base32 encoded data in URL tokens or file names where the padding character could pose a problem. This is an example of a Base32 representation using the previously described 32-character set ( IPFS CIDv1 in Base32 upper-case encoding): BAFYBEICZSSCDSBS7FFQZ55ASQDF3SMV6KLCW3GOFSZVWLYARCI47BGF354 "Extended hex" base 32 or base32hex , another scheme for base 32 per RFC 4648 §7 , extends hexadecimal in

2622-418: The length of the unencoded input is not a multiple of three, the encoded output must have padding added so that its length is a multiple of four. The padding character is = , which indicates that no further bits are needed to fully encode the input. (This is different from A , which means that the remaining bits are all zeros.) The example below illustrates how truncating the input of the above quote changes

2679-521: The lower case alphabet, except letters "a", "i", "l", "o", as shown by the following character map: Before NVRAM became universal, several video games for Nintendo platforms used base 31 numbers for passwords . These systems omit vowels (except Y) to prevent the game from accidentally giving a profane password. Thus, the characters are generally some minor variation of the following set: 0–9, B, C, D, F, G, H, J, K, L, M, N, P, Q, R, S, T, V, W, X, Y, Z, and some punctuation marks. Games known to use such

2736-441: The other similar alphabet in its section 7 instead be called base32hex. Agreement with those recommendations is not universal. Care needs to be taken when using systems that are called base32, as those systems could be base32 per RFC 4648 §6, or per §7 (possibly disregarding that RFC's deprecation of the simpler name for the latter), or they could be yet another encoding variant, see further below. The most widely used base32 alphabet

2793-546: The output padding: The padding character is not essential for decoding, since the number of missing bytes can be inferred from the length of the encoded text. In some implementations, the padding character is mandatory, while for others it is not used. An exception in which padding characters are required is when multiple Base64 encoded files have been concatenated. When decoding Base64 text, four characters are typically converted back to three bytes. The only exceptions are when padding characters exist. A single = indicates that

2850-402: The output; if one octet of the 24-bit buffer is filled with padded-zeros, one = character is appended. This signals the decoder that the zero bits added due to padding should be excluded from the reconstructed data. This also guarantees that the encoded output length is a multiple of 4 bytes. PEM requires that all encoded lines consist of exactly 64 printable characters, with the exception of

2907-573: The padding ' = ' signs to avoid them being confused with field separators, or require that any such padding be percent-encoded. Some libraries will encode ' = ' to ' . ', potentially exposing applications to relative path attacks when a folder name is encoded from user data. The atob() and btoa() JavaScript methods, defined in the HTML5 draft specification, provide Base64 encoding and decoding functionality to web pages. The btoa() method outputs padding characters, but these are optional in

SECTION 50

#1732776072874

2964-509: The quote (without trailing whitespace) is encoded into Base64, it is represented as a byte sequence of 8-bit-padded ASCII characters encoded in MIME 's Base64 scheme as follows (newlines and white spaces may be present anywhere but are to be ignored on decoding): TWFueSBoYW5kcyBtYWtlIGxpZ2h0IHdvcmsu In the above quote, the encoded value of Man is TWFu . Encoded in ASCII, the characters M ,

3021-630: The same OS – for example, uuencode for UNIX and BinHex for the TRS-80 (later adapted for the Macintosh ) – and could therefore make more assumptions about what characters were safe to use. For instance, uuencode uses uppercase letters, digits, and many punctuation characters, but no lowercase. This is the Base64 alphabet defined in RFC 4648 §4 . See also § Variants summary table . The example below uses ASCII text for simplicity, but this

3078-431: The same 64-character alphabet and encoding mechanism as PEM and uses the = symbol for output padding in the same way, as described at RFC   2045 . MIME does not specify a fixed length for Base64-encoded lines, but it does specify a maximum line length of 76 characters. Additionally, it specifies that any character outside the standard set of 64 encoding characters (For example CRLF sequences), must be ignored by

3135-415: The same set of bytes, which can be a security risk. Implementations may have some constraints on the alphabet used for representing some bit patterns. This notably concerns the last two characters used in the alphabet at positions 62 and 63, and the character used for padding (which may be mandatory in some protocols or removed in others). The table below summarizes these known variants and provides links to

3192-590: The subsections below. The first known standardized use of the encoding now called MIME Base64 was in the Privacy-enhanced Electronic Mail (PEM) protocol, proposed by RFC   989 in 1987. PEM defines a "printable encoding" scheme that uses Base64 encoding to transform an arbitrary sequence of octets to a format that can be expressed in short lines of 6-bit characters, as required by transfer protocols such as SMTP . The current version of PEM (specified in RFC   1421 ) uses

3249-581: Was described in 2000 in RFC   2938 under the name "Base-32". RFC 4648, while acknowledging existing use of this version in NSEC3 , refers to it as base32hex and discourages referring to it as only "base32". Since this notation uses digits 0-9 followed by consecutive letters of the alphabet, it matches the digits used by the JavaScript parseInt() function and the Python int() constructor when

#873126