Misplaced Pages

PAQ

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.

PAQ is a series of lossless data compression archivers that have gone through collaborative development to top rankings on several benchmarks measuring compression ratio (although at the expense of speed and memory usage). Specialized versions of PAQ have won the Hutter Prize and the Calgary Challenge . PAQ is free software distributed under the GNU General Public License .

#920079

43-470: PAQ uses a context mixing algorithm. Context mixing is related to prediction by partial matching (PPM) in that the compressor is divided into a predictor and an arithmetic coder , but differs in that the next-symbol prediction is computed using a weighted combination of probability estimates from a large number of models conditioned on different contexts. Unlike PPM, a context doesn't need to be contiguous. Most PAQ versions collect next-symbol statistics for

86-499: A 100 MB English and XML data set derived from Misplaced Pages's source. The PAQ8HP series was forked from PAQ8H. The programs include text preprocessing dictionaries and models tuned specifically to the benchmark. All non-text models were removed. The dictionaries were organized to group syntactically and semantically related words and to group words by common suffix. The former strategy improves compression because related words (which are likely to appear in similar context) can be modeled on

129-455: A copy. This has allowed other authors to fork the PAQ compression engine and add new features such as a graphical user interface or better speed (at the expense of compression ratio). Notable PAQ derivatives include: Context mixing Context mixing is a type of data compression algorithm in which the next- symbol predictions of two or more statistical models are combined to yield

172-410: A list of file compression benchmarks. The following lists the major enhancements to the PAQ algorithm. In addition, there have been a large number of incremental improvements, which are omitted. The series PAQ8HP1 through PAQ8HP8 were released by Alexander Ratushnyak from August 21, 2006, through January 18, 2007, as Hutter Prize submissions. The Hutter Prize is a text compression contest using

215-427: A prediction (instead of a pair of counts). These predictions are averaged in the logistic domain: where P(1) is the probability that the next bit will be a 1, P i (1) is the probability estimated by the i -th model, and After each prediction, the model is updated by adjusting the weights to minimize coding cost: where η is the learning rate (typically 0.002 to 0.01), y is the predicted bit, and ( y − P(1))

258-522: A prediction that is often more accurate than any of the individual predictions. For example, one simple method (not necessarily the best) is to average the probabilities assigned by each model . The random forest is another method: it outputs the prediction that is the mode of the predictions output by individual models. Combining models is an active area of research in machine learning . The PAQ series of data compression programs use context mixing to assign probabilities to individual bits of

301-406: A range of techniques. A learning rate schedule changes the learning rate during learning and is most often changed between epochs/iterations. This is mainly done with two parameters: decay and momentum . There are many different learning rate schedules but the most common are time-based, step-based and exponential . Decay serves to settle the learning in a nice place and avoid oscillations,

344-409: A situation that may arise when a too high constant learning rate makes the learning jump back and forth over a minimum, and is controlled by a hyperparameter. Momentum is analogous to a ball rolling down a hill; we want the ball to settle at the lowest point of the hill (corresponding to the lowest error). Momentum both speeds up the learning (increasing the learning rate) when the error cost gradient

387-453: A small context are used to look up a new prediction in a table. After the bit is encoded, the table entry is adjusted to reduce the prediction error. SSE stages can be pipelined with different contexts or computed in parallel with the outputs averaged. A string s is compressed to the shortest byte string representing a base-256 big-endian number x in the range [0, 1] such that P( r < s ) ≤ x < P( r ≤ s ), where P( r < s )

430-654: A special character followed by the lowercase letter. In the PAQ8HP series, the dictionary is organized by grouping syntactically and semantically related words together. This allows models to use just the most significant bits of the dictionary codes as context. The following table is a sample from the Large Text Compression Benchmark by Matt Mahoney that consists of a file consisting of 10 bytes (1  GB , or 0.931  GiB ) of English Misplaced Pages text. See Lossless compression benchmarks for

473-509: A weighted average of the predictions weighted by evidence. In this example, P ( X | B ) {\displaystyle P(X|B)} gets more weight than P ( X | A ) {\displaystyle P(X|A)} because P ( X | B ) {\displaystyle P(X|B)} is based on a greater number of tests. Older versions of PAQ uses this approach. Newer versions use logistic (or neural network ) mixing by first transforming

SECTION 10

#1732787308921

516-400: Is better balanced. In PAQ6, whenever one of the bit counts is incremented, the part of the other count that exceeds 2 is halved. For example, after the sequence 000000001, the counts would go from (n 0 , n 1 ) = (8, 0) to (5, 1). Let P i (1) be the prediction by the i'th model that the next bit will be a 1. Then the final prediction P(1) is calculated: where P(1) is the probability that

559-591: Is commonly referred to as gain . In setting a learning rate, there is a trade-off between the rate of convergence and overshooting . While the descent direction is usually determined from the gradient of the loss function, the learning rate determines how big a step is taken in that direction. A too high learning rate will make the learning jump over minima but a too low learning rate will either take too long to converge or get stuck in an undesirable local minimum. In order to achieve faster convergence, prevent oscillations and getting stuck in undesirable local minima

602-507: Is heading in the same direction for a long time and also avoids local minima by 'rolling over' small bumps. Momentum is controlled by a hyperparameter analogous to a ball's mass which must be chosen manually—too high and the ball will roll over minima which we wish to find, too low and it will not fulfil its purpose. The formula for factoring in the momentum is more complex than for decay but is most often built in with deep learning libraries such as Keras . Time-based learning schedules alter

645-464: Is insufficient information for probability theory to give a result. In fact, it is possible to construct scenarios in which the result could be anything at all. But intuitively, we would expect the result to be some kind of average of the two. The problem is important for data compression. In this application, A {\displaystyle A} and B {\displaystyle B} are contexts, X {\displaystyle X}

688-443: Is split as with compression. The portion containing x becomes the new range, and the corresponding bit is appended to s . In PAQ, the lower and upper bounds of the range are represented in three parts. The most significant base-256 digits are identical, so they can be written as the leading bytes of x . The next 4 bytes are kept in memory, such that the leading byte is different. The trailing bits are assumed to be all zeros for

731-402: Is the event that the next bit or symbol of the data to be compressed has a particular value, and P ( X | A ) {\displaystyle P(X|A)} and P ( X | B ) {\displaystyle P(X|B)} are the probability estimates by two independent models. The compression ratio depends on how closely the estimated probability approaches

774-414: Is the learning rate at iteration n {\displaystyle n} , η 0 {\displaystyle \eta _{0}} is the initial learning rate, d {\displaystyle d} is how much the learning rate should change at each drop (0.5 corresponds to a halving) and r {\displaystyle r} corresponds to the drop rate , or how often

817-424: Is the prediction error. The weight update algorithm differs from backpropagation in that the terms P(1)P(0) are dropped. This is because the goal of the neural network is to minimize coding cost, not root mean square error. Most versions of PAQ use a small context to select among sets of weights for the neural network. Some versions use multiple networks whose outputs are combined with one more network prior to

860-523: Is the probability that a random string r with the same length as s will be lexicographically less than s . It is always possible to find an x such that the length of x is at most one byte longer than the Shannon limit , −log 2 P( r = s ) bits. The length of s is stored in the archive header. The arithmetic coder in PAQ is implemented by maintaining for each prediction a lower and upper bound on x , initially [0, 1]. After each prediction,

903-403: The learning rate is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function . Since it influences to what extent newly acquired information overrides old information, it metaphorically represents the speed at which a machine learning model "learns". In the adaptive control literature, the learning rate

SECTION 20

#1732787308921

946-406: The 0 and 1 counts: The weights w i are initially equal and always sum to 1. Under the initial conditions, each model is weighted in proportion to evidence. The weights are then adjusted to favor the more accurate models. Suppose we are given that the actual bit being predicted is y (0 or 1). Then the weight adjustment is: Compression can be improved by bounding n i so that the model weighting

989-419: The 0 and 1 counts: where w i is the weight of the i -th model. Through PAQ3, the weights were fixed and set in an ad-hoc manner. (Order- n contexts had a weight of n .) Beginning with PAQ4, the weights were adjusted adaptively in the direction that would reduce future errors in the same context set. If the bit to be coded is y , then the weight adjustment is: Beginning with PAQ7, each model outputs

1032-491: The SSE stages. Furthermore, for each input prediction there may be several inputs which are nonlinear functions of P i (1) in addition to stretch(P(1)). Each model partitions the known bits of s into a set of contexts and maps each context to a bit history represented by an 8-bit state. In versions through PAQ6, the state represents a pair of counters ( n 0 , n 1 ). In PAQ7 and later versions under certain conditions,

1075-494: The count over 2 is discarded when the opposite bit is observed. For example, if the current state associated with a context is ( n 0 , n 1 ) = ( 12 , 3 ) {\displaystyle (n_{0},n_{1})=(12,3)} and a 1 is observed, then the counts are updated to (7, 4). A bit is arithmetically coded with space proportional to its probability, either P(1) or P(0) = 1 − P(1). The probabilities are computed by weighted addition of

1118-424: The current range is split into two parts in proportion to P(0) and P(1), the probability that the next bit of s will be a 0 or 1 respectively, given the previous bits of s . The next bit is then encoded by selecting the corresponding subrange to be the new range. The number x is decompressed back to string s by making an identical series of bit predictions (since the previous bits of s are known). The range

1161-429: The decay is: η n = η 0 e − d n {\displaystyle \eta _{n}=\eta _{0}e^{-dn}} where d {\displaystyle d} is a decay parameter. The issue with learning rate schedules is that they all depend on hyperparameters that must be manually chosen for each given learning session and may vary greatly depending on

1204-466: The following contexts: All PAQ versions predict and compress one bit at a time, but differ in the details of the models and how the predictions are combined and postprocessed. Once the next-bit probability is determined, it is encoded by arithmetic coding . There are three methods for combining predictions, depending on the version: PAQ1SSE and later versions postprocess the prediction using secondary symbol estimation (SSE). The combined prediction and

1247-556: The high order bits of their dictionary codes. The latter strategy makes the dictionary easier to compress. The size of the decompression program and compressed dictionary is included in the contest ranking. On October 27, 2006, it was announced that PAQ8HP5 won a Hutter Prize for Lossless Compression of Human Knowledge of € 3,416. On June 30, 2007, Ratushnyak's PAQ8HP12 was awarded a second Hutter prize of €1732, improving upon his previous record by 3.46%. Being free software , PAQ can be modified and redistributed by anyone who has

1290-503: The input. Suppose that we are given two conditional probabilities, P ( X | A ) {\displaystyle P(X|A)} and P ( X | B ) {\displaystyle P(X|B)} , and we wish to estimate P ( X | A , B ) {\displaystyle P(X|A,B)} , the probability of event X given both conditions A {\displaystyle A} and B {\displaystyle B} . There

1333-543: The last linefeed occurred 72 characters ago (context B {\displaystyle B} ). Suppose that a linefeed previously occurred after 1 of the last 5 periods ( P ( X | A = 0.2 {\displaystyle P(X|A=0.2} ) and in 5 out of the last 10 lines at column 72 ( P ( X | B ) = 0.5 {\displaystyle P(X|B)=0.5} ). How should these predictions be combined? Two general approaches have been used, linear and logistic mixing. Linear mixing uses

PAQ - Misplaced Pages Continue

1376-401: The learning rate depending on the learning rate of the previous time iteration. Factoring in the decay the mathematical formula for the learning rate is: η n + 1 = η n 1 + d n {\displaystyle \eta _{n+1}={\frac {\eta _{n}}{1+dn}}} where η {\displaystyle \eta } is

1419-681: The learning rate is often varied during training either in accordance to a learning rate schedule or by using an adaptive learning rate. The learning rate and its adjustments may also differ per parameter, in which case it is a diagonal matrix that can be interpreted as an approximation to the inverse of the Hessian matrix in Newton's method . The learning rate is related to the step length determined by inexact line search in quasi-Newton methods and related optimization algorithms. Initial rate can be left as system default or can be selected using

1462-617: The learning rate, d {\displaystyle d} is a decay parameter and n {\displaystyle n} is the iteration step. Step-based learning schedules changes the learning rate according to some predefined steps. The decay application formula is here defined as: η n = η 0 d ⌊ 1 + n r ⌋ {\displaystyle \eta _{n}=\eta _{0}d^{\left\lfloor {\frac {1+n}{r}}\right\rfloor }} where η n {\displaystyle \eta _{n}}

1505-435: The lower bound and all ones for the upper bound. Compression is terminated by writing one more byte from the lower bound. In PAQ versions through PAQ6, each model maps a set of distinct contexts to a pair of counts, n 0 {\displaystyle n_{0}} , a count of zero bits, and n 1 {\displaystyle n_{1}} , a count of 1 bits. In order to favor recent history, half of

1548-448: The next bit will be a 1, P i (1) is the probability estimated by the i'th model, and After each prediction, the model is updated by adjusting the weights to minimize coding cost. where η is the learning rate (typically 0.002 to 0.01), y is the predicted bit, and (y - P(1)) is the prediction error. All versions below use logistic mixing unless otherwise indicated. Learning rate In machine learning and statistics ,

1591-441: The oldest versions of PAQ use adaptive weighting. Most context mixing compressors predict one bit of input at a time. The output probability is simply the probability that the next bit will be a 1. We are given a set of predictions P i (1) = n 1i /n i , where n i = n 0i + n 1i , and n 0i and n 1i are the counts of 0 and 1 bits respectively for the i'th model. The probabilities are computed by weighted addition of

1634-409: The predictions into the logistic domain, log(p/(1-p)) before averaging. This effectively gives greater weight to predictions near 0 or 1, in this case P ( X | A ) {\displaystyle P(X|A)} . In both cases, additional weights may be given to each of the input models and adapted to favor the models that have given the most accurate predictions in the past. All but

1677-439: The rate should be dropped (10 corresponds to a drop every 10 iterations). The floor function ( ⌊ … ⌋ {\displaystyle \lfloor \dots \rfloor } ) here drops the value of its input to 0 for all values smaller than 1. Exponential learning schedules are similar to step-based, but instead of steps, a decreasing exponential function is used. The mathematical formula for factoring in

1720-875: The representable counts: (41, 0), (40, 1), (12, 2), (5, 3), (4, 4), (3, 5), (2, 12), (1, 40), (0, 41). If a count exceeds this limit, then the next state is one chosen to have a similar ratio of n 0 to n 1 . Thus, if the current state is ( n 0 = 4, n 1 = 4, last bit = 0) and a 1 is observed, then the new state is not ( n 0 = 4, n 1 = 5, last bit = 1). Rather, it is ( n 0 = 3, n 1 = 4, last bit = 1). Most context models are implemented as hash tables . Some small contexts are implemented as direct lookup tables . Some versions of PAQ, in particular PAsQDa, PAQAR (both PAQ6 derivatives), and PAQ8HP1 through PAQ8HP8 (PAQ8 derivatives and Hutter prize recipients) preprocess text files by looking up words in an external dictionary and replacing them with 1- to 3-byte codes. In addition, uppercase letters are encoded with

1763-417: The state also represents the value of the last bit or the entire sequence. The states are mapped to probabilities using a 256-entry table for each model. After a prediction by the model, the table entry is adjusted slightly (typically by 0.4%) to reduce the prediction error. In all PAQ8 versions, the representable states are as follows: To keep the number of states to 256, the following limits are placed on

PAQ - Misplaced Pages Continue

1806-527: The true but unknown probability of event X {\displaystyle X} . It is often the case that contexts A {\displaystyle A} and B {\displaystyle B} have occurred often enough to accurately estimate P ( X | A ) {\displaystyle P(X|A)} and P ( X | B ) {\displaystyle P(X|B)} by counting occurrences of X {\displaystyle X} in each context, but

1849-409: The two contexts either have not occurred together frequently, or there are insufficient computing resources (time and memory) to collect statistics for the combined case. For example, suppose that we are compressing a text file. We wish to predict whether the next character will be a linefeed, given that the previous character was a period (context A {\displaystyle A} ) and that

#920079