Lempel–Ziv–Storer–Szymanski ( LZSS ) is a lossless data compression algorithm , a derivative of LZ77 , that was created in 1982 by James A. Storer and Thomas Szymanski . LZSS was described in article "Data compression via textual substitution" published in Journal of the ACM (1982, pp. 928–951).
48-485: LZSS is a dictionary coding technique. It attempts to replace a string of symbols with a reference to a dictionary location of the same string. The main difference between LZ77 and LZSS is that in LZ77 the dictionary reference could actually be longer than the string it was replacing. In LZSS, such references are omitted if the length is less than the "break even" point. Furthermore, LZSS uses one-bit flags to indicate whether
96-431: A circular buffer called the "sliding window" holds the last N bytes of data processed. This window serves as the dictionary, effectively storing every substring that has appeared in the past N bytes as dictionary entries. Instead of a single index identifying a dictionary entry, two values are needed: the length , indicating the length of the matched text, and the offset (also called the distance ), indicating that
144-430: A book in the limited storage space of a PDA generally builds a static dictionary from a concordance of the text and then uses that dictionary to compress the verses. This scheme of using Huffman coding to represent indices into a concordance has been called "Huffword". In a related and more general method, a dictionary is built from redundancy extracted from a data environment (various input streams) which dictionary
192-451: A break even point of 2 bytes (and thus 2 byte pointer/offset pairs), and one byte newlines, this text compressed with LZSS becomes 95 bytes long: Note: this does not include the 11 bytes of flags indicating whether the next chunk of text is a pointer or a literal. Adding it, the text becomes 106 bytes long, which is still shorter than the original 177 bytes. Many popular archivers like ARJ , RAR , ZOO , LHarc use LZSS rather than LZ77 as
240-432: A limited time period rather than over infinite time). A high-level view of the decoding algorithm is shown here: The decoding algorithm works by reading a value from the encoded input and outputting the corresponding string from the dictionary. However, the full dictionary is not needed, only the initial dictionary that contains single-character strings (and that is usually hard coded in the program, instead of sent with
288-417: A match is not found (or if the end of data), then only the dictionary code is output. This creates a potential issue since the decoder output is one step behind the dictionary. Refer to LZW for how this is handled. Enhancements to LZW include handing symbol sizes other than 8 bits and having reserved codes to reset the dictionary and to indicate end of data. LZW Lempel–Ziv–Welch ( LZW )
336-409: A match, it substitutes a reference to the string's position in the data structure. Some dictionary coders use a 'static dictionary', one whose full set of strings is determined before coding begins and does not change during the coding process. This approach is most often used when the message or set of messages to be encoded is fixed and large; for instance, an application that stores the contents of
384-426: A sequence until the next character would make a sequence with no code yet in the dictionary. The code for the sequence (without that character) is added to the output, and a new code (for the sequence with that character) is added to the dictionary. The idea was quickly adapted to other situations. In an image based on a color table, for example, the natural character alphabet is the set of color table indexes, and in
432-541: A sequence ω until ω + next character is not in the dictionary. Emit the code for ω, and add ω + next character to the dictionary. Start buffering again with the next character. (The string to be encoded is "TOBEORNOTTOBEORTOBEORNOT#".) Using LZW has saved 29 bits out of 125, reducing the message by more than 23%. If the message were longer, then the dictionary words would begin to represent longer and longer sections of text, sending repeated words very compactly. To decode an LZW-compressed archive, one needs to know in advance
480-529: Is a universal lossless data compression algorithm created by Abraham Lempel , Jacob Ziv , and Terry Welch . It was published by Welch in 1984 as an improved implementation of the LZ78 algorithm published by Lempel and Ziv in 1978. The algorithm is simple to implement and has the potential for very high throughput in hardware implementations. It is the algorithm of the Unix file compression utility compress and
528-405: Is initialized with these 27 values. As the dictionary grows, the codes must grow in width to accommodate the additional entries. A 5-bit code gives 2 = 32 possible combinations of bits, so when the 33rd dictionary word is created, the algorithm must switch at that point from 5-bit strings to 6-bit strings (for all code values, including those previously output with only five bits). Note that since
SECTION 10
#1732772417871576-474: Is not in the dictionary. When such a string is found, the index for the string without the last character (i.e., the longest substring that is in the dictionary) is retrieved from the dictionary and sent to output, and the new string (including the last character) is added to the dictionary with the next available code. The last input character is then used as the next starting point to scan for substrings. In this way, successively longer strings are registered in
624-404: Is the same as the first character of the current string), and updates the dictionary with the new string. The decoder then proceeds to the next input (which was already read in the previous iteration) and processes it as before, and so on until it has exhausted the input stream. If variable-width codes are being used, the encoder and decoder must be careful to change the width at the same points in
672-457: Is then used statically to compress a further input stream. For example, a dictionary is built from old English texts then is used to compress a book. More common are methods where the dictionary starts in some predetermined state but the contents change during the encoding process, based on the data that has already been encoded. Both the LZ77 and LZ78 algorithms work on this principle. In LZ77,
720-474: Is used in the GIF image format. The scenario described by Welch's 1984 paper encodes sequences of 8-bit data as fixed-length 12-bit codes. The codes from 0 to 255 represent 1-character sequences consisting of the corresponding 8-bit character, and the codes 256 through 4095 are created in a dictionary for sequences encountered in the data as it is encoded. At each stage in compression, input bytes are gathered into
768-546: Is used to represent a stop code: a code outside the plaintext alphabet that triggers special handling. We arbitrarily assign these the values 1 through 26 for the letters, and 0 for the stop code '#'. (Most flavors of LZW would put the stop code after the data alphabet, but nothing in the basic algorithm requires that. The encoder and decoder only have to agree what value it has.) A computer renders these as strings of bits . Five-bit codes are needed to give sufficient combinations to encompass this set of 27 values. The dictionary
816-673: The United States and other countries for LZW and similar algorithms. LZ78 was covered by U.S. patent 4,464,650 by Lempel, Ziv, Cohn, and Eastman, assigned to Sperry Corporation , later Unisys Corporation, filed on August 10, 1981. Two US patents were issued for the LZW algorithm: U.S. patent 4,814,746 by Victor S. Miller and Mark N. Wegman and assigned to IBM , originally filed on June 1, 1983, and U.S. patent 4,558,302 by Welch, assigned to Sperry Corporation, later Unisys Corporation, filed on June 20, 1983. In addition to
864-429: The 1980s, many images had small color tables (on the order of 16 colors). For such a reduced alphabet, the full 12-bit codes yielded poor compression unless the image was large, so the idea of a variable-width code was introduced: codes typically start one bit wider than the symbols being encoded, and as each code size is used up, the code width increases by 1 bit, up to some prescribed maximum (typically 12 bits). When
912-793: The LZ77-based DEFLATE algorithm, but as of 2008 at least FreeBSD includes both compress and uncompress as a part of the distribution. Several other popular compression utilities also used LZW or closely related methods. LZW became very widely used when it became part of the GIF image format in 1987. It may also (optionally) be used in TIFF and PDF files. (Although LZW is available in Adobe Acrobat software, Acrobat by default uses DEFLATE for most text and color-table-based image data in PDF files.) Various patents have been issued in
960-651: The above patents, Welch's 1983 patent also includes citations to several other patents that influenced it, including two 1980 Japanese patents ( JP9343880A and JP17790880A ) from NEC 's Jun Kanatsu, U.S. patent 4,021,782 (1974) from John S. Hoerning, U.S. patent 4,366,551 (1977) from Klaus E. Holtz, and a 1981 German patent ( DE19813118676 ) from Karl Eckhart Heinz. In 1993–94, and again in 1999, Unisys Corporation received widespread condemnation when it attempted to enforce licensing fees for LZW in GIF images. The 1993–1994 Unisys-CompuServe controversy ( CompuServe being
1008-439: The all-zero code 00000 is used, and is labeled "0", the 33rd dictionary entry is labeled 32 . (Previously generated output is not affected by the code-width change, but once a 6-bit value is generated in the dictionary, it could conceivably be the next code emitted, so the width for subsequent output shifts to 6 bits to accommodate that.) The initial dictionary, then, consists of the following entries: Buffer input characters in
SECTION 20
#17327724178711056-411: The clear code. Since the codes emitted typically do not fall on byte boundaries, the encoder and decoder must agree on how codes are packed into bytes. The two common methods are LSB-first (" least significant bit first") and MSB-first (" most significant bit first"). In LSB-first packing, the first code is aligned so that the least significant bit of the code falls in the least significant bit of
1104-413: The code for ω at width p (since that code does not require p + 1 bits), and then increases the code width so that the next code emitted is p + 1 bits wide. The decoder is always one code behind the encoder in building the table, so when it sees the code for ω, it generates an entry for code 2 − 1. Since this is the point where the encoder increases the code width,
1152-402: The compression methods for kernel code. Dictionary coding A dictionary coder , also sometimes known as a substitution coder , is a class of lossless data compression algorithms which operate by searching for matches between the text to be compressed and a set of strings contained in a data structure (called the 'dictionary') maintained by the encoder. When the encoder finds such
1200-420: The creator of the GIF format) prompted a Usenet comp.graphics discussion Thoughts on a GIF-replacement file format , which in turn fostered an email exchange that eventually culminated in the creation of the patent-unencumbered Portable Network Graphics (PNG) file format in 1995. Unisys's US patent on the LZW algorithm expired on June 20, 2003, 20 years after it had been filed. Patents that had been filed in
1248-416: The data, the decoder mimics building the table as it sees the resulting codes. It is critical that the encoder and decoder agree on the variety of LZW used: the size of the alphabet, the maximum table size (and code width), whether variable-width encoding is used, initial code size, and whether to use the clear and stop codes (and what values they have). Most formats that employ LZW build this information into
1296-429: The data. This example has been constructed to give reasonable compression on a very short message. In real text data, repetition is generally less pronounced, so longer input streams are typically necessary before the compression builds up efficiency. The plaintext to be encoded (from an alphabet using only the capital letters) is: There are 26 symbols in the plaintext alphabet (the capital letters A through Z ). #
1344-493: The decoder must increase the width here as well—at the point where it generates the largest code that fits in p bits. Unfortunately, some early implementations of the encoding algorithm increase the code width and then emit ω at the new width instead of the old width, so that to the decoder it looks like the width changes one code too early. This is called "early change"; it caused so much confusion that Adobe now allows both versions in PDF files, but includes an explicit flag in
1392-416: The decoder receives a code Z that is not yet in its dictionary? Since the decoder is always just one code behind the encoder, Z can be in the encoder's dictionary only if the encoder just generated it, when emitting the previous code X for χ. Thus Z codes some ω that is χ + ?, and the decoder can determine the unknown character as follows: This situation occurs whenever the encoder encounters input of
1440-414: The dictionary and available for subsequent encoding as single output values. The algorithm works best on data with repeated patterns, so the initial parts of a message see little compression. As the message grows, however, the compression ratio tends asymptotically to the maximum (i.e., the compression factor or ratio improves on an increasing curve, and not linearly, approaching a theoretical maximum inside
1488-455: The dictionary and output to the compressed stream. If there is a match, then the working index is updated to the matching index, and nothing is output. LZW is similar to LZ78, but, the dictionary is initialized to all possible symbols. The typical implementation works with 8 bit symbols, so the dictionary "codes" for hex 00 to hex FF (decimal 255) are pre-defined. Dictionary entries would be added starting with code value hex 100. Unlike LZ78, if
Lempel–Ziv–Storer–Szymanski - Misplaced Pages Continue
1536-409: The encoded data so they don't disagree on boundaries between individual codes in the stream. In the standard version, the encoder increases the width from p to p + 1 when a sequence ω + s is encountered that is not in the table (so that a code must be added for it) but the next available code in the table is 2 (the first code requiring p + 1 bits). The encoder emits
1584-448: The encoded data). Instead, the full dictionary is rebuilt during the decoding process the following way: after decoding a value and outputting a string, the decoder concatenates it with the first character of the next decoded string (or the first character of current string, if the next one can't be decoded; since if the next value is unknown, then it must be the value added to the dictionary in this iteration, and so its first character
1632-448: The encoder goes ahead and adds it. But what is the missing letter? It is the first letter in the sequence coded by the next code Z that the decoder receives. So the decoder looks up Z, decodes it into the sequence ω and takes the first letter z and tacks it onto the end of χ as the next dictionary entry. This works as long as the codes received are in the decoder's dictionary, so that they can be decoded into sequences. What happens if
1680-399: The end of data (a "stop code", typically one greater than the clear code). The clear code lets the table be reinitialized after it fills up, which lets the encoding adapt to changing patterns in the input data. Smart encoders can monitor the compression efficiency and clear the table whenever the existing table no longer matches the input well. Since codes are added in a manner determined by
1728-452: The first stream byte, and if the code has more than 8 bits, the high-order bits left over are aligned with the least significant bits of the next byte; further codes are packed with LSB going into the least significant bit not yet used in the current stream byte, proceeding into further bytes as necessary. MSB-first packing aligns the first code so that its most significant bit falls in the MSB of
1776-448: The first stream byte, with overflow aligned with the MSB of the next byte; further codes are written with MSB going into the most significant bit not yet used in the current stream byte. GIF files use LSB-first packing order. TIFF files and PDF files use MSB-first packing order. The following example illustrates the LZW algorithm in action, showing the status of the output and the dictionary at every stage, both in encoding and decoding
1824-449: The first widely used universal data compression method on computers. A large English text file can typically be compressed via LZW to about half its original size. LZW was used in the public-domain program compress , which became a more or less standard utility in Unix systems around 1986. It has since disappeared from many distributions, both because it infringed the LZW patent and because gzip produced better compression ratios using
1872-414: The form cScSc , where c is a single character, S is a string and cS is already in the dictionary, but cSc is not. The encoder emits the code for cS , putting a new code for cSc into the dictionary. Next it sees cSc in the input (starting at the second c of cScSc ) and emits the new code it just inserted. The argument above shows that whenever the decoder receives a code not in its dictionary,
1920-470: The format specification or provide explicit fields for them in a compression header for the data. A high-level view of the encoding algorithm is shown here: A dictionary is initialized to contain the single-character strings corresponding to all the possible input characters (and nothing else except the clear and stop codes if they're being used). The algorithm works by scanning through the input string for successively longer substrings until it finds one that
1968-411: The header of each LZW-compressed stream to indicate whether early change is being used. Of the graphics file formats that support LZW compression, TIFF uses early change, while GIF and most others don't. When the table is cleared in response to a clear code, both encoder and decoder change the code width after the clear code back to the initial code width, starting with the code immediately following
Lempel–Ziv–Storer–Szymanski - Misplaced Pages Continue
2016-409: The initial dictionary used, but additional entries can be reconstructed as they are always simply concatenations of previous entries. At each stage, the decoder receives a code X; it looks X up in the table and outputs the sequence χ it codes, and it conjectures χ + ? as the entry the encoder just added – because the encoder emitted X for χ precisely because χ + ? was not in the table, and
2064-447: The match is found in the sliding window starting offset bytes before the current text. LZ78 uses a more explicit dictionary structure; at the beginning of the encoding process, the dictionary is empty. An index value of zero is used to represent the end of a string, so the first index of the dictionary is one. At each step of the encoding process, if there is no match, then the last matching index (or zero) and character are both added to
2112-404: The maximum code value is reached, encoding proceeds using the existing table, but new codes are not generated for addition to the table. Further refinements include reserving a code to indicate that the code table should be cleared and restored to its initial state (a "clear code", typically the first value immediately after the values for the individual alphabet characters), and a code to indicate
2160-483: The next chunk of data is a literal (byte) or a reference to an offset/length pair. Here is the beginning of Dr. Seuss's Green Eggs and Ham , with character numbers at the beginning of lines for convenience. Green Eggs and Ham is a good example to illustrate LZSS compression because the book itself only contains 50 unique words, despite having a word count of 170. Thus, words are repeated, however not in succession. This text takes 177 bytes in uncompressed form. Assuming
2208-515: The primary compression algorithm; the encoding of literal characters and of length-distance pairs varies, with the most common option being Huffman coding . Most implementations stem from a public domain 1989 code by Haruhiko Okumura . Version 4 of the Allegro library can encode and decode an LZSS format, but the feature was cut from version 5. The Game Boy Advance BIOS can decode a slightly modified LZSS format. Apple's macOS uses LZSS as one of
2256-597: The sequence of output symbols. Some package the coded stream as printable characters using some form of binary-to-text encoding ; this increases the encoded length and decreases the compression rate. Conversely, increased compression can often be achieved with an adaptive entropy encoder . Such a coder estimates the probability distribution for the value of the next symbol, based on the observed frequencies of values so far. A standard entropy encoding such as Huffman coding or arithmetic coding then uses shorter codes for values with higher probabilities. LZW compression became
2304-466: The situation must look like this. Although input of form cScSc might seem unlikely, this pattern is fairly common when the input stream is characterized by significant repetition. In particular, long strings of a single character (which are common in the kinds of images LZW is often used to encode) repeatedly generate patterns of this sort. The simple scheme described above focuses on the LZW algorithm itself. Many applications apply further encoding to
#870129