Ray Solomonoff (July 25, 1926 – December 7, 2009) was an American mathematician who invented algorithmic probability , his General Theory of Inductive Inference (also known as Universal Inductive Inference), and was a founder of algorithmic information theory . He was an originator of the branch of artificial intelligence based on machine learning , prediction and probability . He circulated the first report on non-semantic machine learning in 1956.
100-475: Solomonoff first described algorithmic probability in 1960, publishing the theorem that launched Kolmogorov complexity and algorithmic information theory . He first described these results at a conference at Caltech in 1960, and in a report, Feb. 1960, "A Preliminary Report on a General Theory of Inductive Inference." He clarified these ideas more fully in his 1964 publications, "A Formal Theory of Inductive Inference," Part I and Part II. Algorithmic probability
200-410: A lower bound for each text's Kolmogorov complexity can return a value essentially larger than P 's own length (see section § Chaitin's incompleteness theorem ); hence no single program can compute the exact Kolmogorov complexity for infinitely many texts. Kolmogorov complexity is the length of the ultimately compressed version of a file (i.e., anything which can be put in a computer). Formally, it
300-645: A computable function mapping finite binary strings to binary strings. It is a universal function if, and only if, for any computable f : 2 ∗ → 2 ∗ {\displaystyle f:2^{*}\to 2^{*}} , we can encode the function in a "program" s f {\displaystyle s_{f}} , such that ∀ x ∈ 2 ∗ , U ( s f x ) = f ( x ) {\displaystyle \forall x\in 2^{*},U(s_{f}x)=f(x)} . We can think of U {\displaystyle U} as
400-645: A computable function mapping finite binary strings to binary strings. It is a universal function if, and only if, for any computable f : 2 ∗ → 2 ∗ {\displaystyle f:2^{*}\to 2^{*}} , we can encode the function in a "program" s f {\displaystyle s_{f}} , such that ∀ x ∈ 2 ∗ , U ( s f x ) = f ( x ) {\displaystyle \forall x\in 2^{*},U(s_{f}x)=f(x)} . We can think of U {\displaystyle U} as
500-399: A description in the optimal language by first describing L as a computer program P (part 1), and then using the original description D as input to that program (part 2). The total length of this new description D′ is (approximately): The length of P is a constant that doesn't depend on D . So, there is at most a constant overhead, regardless of the object described. Therefore,
600-399: A description in the optimal language by first describing L as a computer program P (part 1), and then using the original description D as input to that program (part 2). The total length of this new description D′ is (approximately): The length of P is a constant that doesn't depend on D . So, there is at most a constant overhead, regardless of the object described. Therefore,
700-431: A description language can be based on any computer programming language, such as Lisp , Pascal , or Java . If P is a program which outputs a string x , then P is a description of x . The length of the description is just the length of P as a character string, multiplied by the number of bits in a character (e.g., 7 for ASCII ). We could, alternatively, choose an encoding for Turing machines , where an encoding
800-404: A few bytes larger than the length of the string itself. Strings like the abab example above, whose Kolmogorov complexity is small relative to the string's size, are not considered to be complex. The Kolmogorov complexity can be defined for any mathematical object, but for simplicity the scope of this article is restricted to strings. We must first specify a description language for strings. Such
900-881: A more complete description in his 1964 publications, "A Formal Theory of Inductive Inference," Part 1 and Part 2 in Information and Control . Andrey Kolmogorov later independently published this theorem in Problems Inform. Transmission in 1965. Gregory Chaitin also presents this theorem in J. ACM – Chaitin's paper was submitted October 1966 and revised in December 1968, and cites both Solomonoff's and Kolmogorov's papers. The theorem says that, among algorithms that decode strings from their descriptions (codes), there exists an optimal one. This algorithm, for all strings, allows codes as short as allowed by any other algorithm up to an additive constant that depends on
1000-702: A more complete description in his 1964 publications, "A Formal Theory of Inductive Inference," Part 1 and Part 2 in Information and Control . Andrey Kolmogorov later independently published this theorem in Problems Inform. Transmission in 1965. Gregory Chaitin also presents this theorem in J. ACM – Chaitin's paper was submitted October 1966 and revised in December 1968, and cites both Solomonoff's and Kolmogorov's papers. The theorem says that, among algorithms that decode strings from their descriptions (codes), there exists an optimal one. This algorithm, for all strings, allows codes as short as allowed by any other algorithm up to an additive constant that depends on
1100-564: A number of reports leading up to the publications in 1964. The 1964 papers give a more detailed description of Algorithmic Probability, and Solomonoff Induction, presenting five different models, including the model popularly called the Universal Distribution. Other scientists who had been at the 1956 Dartmouth Summer Conference (such as Newell and Simon ) were developing the branch of Artificial Intelligence that used machines governed by if-then rules, fact based. Solomonoff
SECTION 10
#17327984288151200-399: A program x {\displaystyle x} may represent a binary number up to log 2 | x | {\displaystyle \log _{2}|x|} , simply by its own length. Stated in another way, it is as if we are using a termination symbol to denote where a word ends, and so we are not using 2 symbols, but 3. To fix this defect, we introduce
1300-399: A program x {\displaystyle x} may represent a binary number up to log 2 | x | {\displaystyle \log _{2}|x|} , simply by its own length. Stated in another way, it is as if we are using a termination symbol to denote where a word ends, and so we are not using 2 symbols, but 3. To fix this defect, we introduce
1400-434: A program interpreter, which takes in an initial segment describing the program, followed by data that the program should process. One problem with plain complexity is that C ( x y ) ≮ C ( x ) + C ( y ) {\displaystyle C(xy)\not <C(x)+C(y)} , because intuitively speaking, there is no general way to tell where to divide an output string just by looking at
1500-434: A program interpreter, which takes in an initial segment describing the program, followed by data that the program should process. One problem with plain complexity is that C ( x y ) ≮ C ( x ) + C ( y ) {\displaystyle C(xy)\not <C(x)+C(y)} , because intuitively speaking, there is no general way to tell where to divide an output string just by looking at
1600-435: A proof for the efficacy of Algorithmic Probability, but mainly because of lack of general interest at that time, did not publish it until 10 years later. In his report, he published the proof for the convergence theorem. In the years following his discovery of Algorithmic Probability he focused on how to use this probability and Solomonoff Induction in actual prediction and problem solving for A.I. He also wanted to understand
1700-507: A reliable method so that they continue to be acceptable. It utilizes search time in a very efficient way. In addition to probability estimates, Algorithmic Probability "has for AI another important value: its multiplicity of models gives us many different ways to understand our data; A description of Solomonoff's life and work prior to 1997 is in "The Discovery of Algorithmic Probability", Journal of Computer and System Sciences, Vol 55, No. 1, pp 73–88, August 1997. The paper, as well as most of
1800-477: A sequence. Algorithmic Probability and Universal (Solomonoff) Induction became associated with Solomonoff, who was focused on prediction — the extrapolation of a sequence. Later in the same 1960 publication Solomonoff describes his extension of the single-shortest-code theory. This is Algorithmic Probability. He states: "It would seem that if there are several different methods of describing a sequence, each of these methods should be given some weight in determining
1900-510: A term like O ( min ( ln x , ln y ) ) {\displaystyle O(\min(\ln x,\ln y))} on one side, whereas the same inequalities with prefix-free complexity have only O ( 1 ) {\displaystyle O(1)} . The main problem with plain complexity is that there is something extra sneaked into a program. A program not only represents for something with its code, but also represents its own length. In particular,
2000-510: A term like O ( min ( ln x , ln y ) ) {\displaystyle O(\min(\ln x,\ln y))} on one side, whereas the same inequalities with prefix-free complexity have only O ( 1 ) {\displaystyle O(1)} . The main problem with plain complexity is that there is something extra sneaked into a program. A program not only represents for something with its code, but also represents its own length. In particular,
2100-451: A value essentially larger than P 's own length (see section § Chaitin's incompleteness theorem ); hence no single program can compute the exact Kolmogorov complexity for infinitely many texts. Kolmogorov complexity is the length of the ultimately compressed version of a file (i.e., anything which can be put in a computer). Formally, it is the length of a shortest program from which the file can be reconstructed. While Kolmogorov complexity
SECTION 20
#17327984288152200-448: A version of his findings in 1957. These were the first papers to be written on probabilistic machine learning. In the late 1950s, he invented probabilistic languages and their associated grammars. A probabilistic language assigns a probability value to every possible string. Generalizing the concept of probabilistic grammars led him to his discovery in 1960 of Algorithmic Probability and General Theory of Inductive Inference. Prior to
2300-710: A write tape, but it cannot move its read-head anymore. This gives us the following formal way to describe K . Note that some universal Turing machines may not be programmable with prefix codes. We must pick only a prefix-free universal Turing machine. The prefix-free complexity of a string x {\displaystyle x} is the shortest prefix code that makes the machine output x {\displaystyle x} : K ( x ) := min { | c | : c ∈ S , U ( c ) = x } {\displaystyle K(x):=\min\{|c|:c\in S,U(c)=x\}} There are some description languages which are optimal, in
2400-710: A write tape, but it cannot move its read-head anymore. This gives us the following formal way to describe K . Note that some universal Turing machines may not be programmable with prefix codes. We must pick only a prefix-free universal Turing machine. The prefix-free complexity of a string x {\displaystyle x} is the shortest prefix code that makes the machine output x {\displaystyle x} : K ( x ) := min { | c | : c ∈ S , U ( c ) = x } {\displaystyle K(x):=\min\{|c|:c\in S,U(c)=x\}} There are some description languages which are optimal, in
2500-405: Is incomplete . There will always be descriptions outside that system's search space, which will never be acknowledged or considered, even in an infinite amount of time. Computable prediction models hide this fact by ignoring such algorithms. In many of his papers he described how to search for solutions to problems and in the 1970s and early 1980s developed what he felt was the best way to update
2600-408: Is a function which associates to each Turing Machine M a bitstring < M >. If M is a Turing Machine which, on input w , outputs string x , then the concatenated string < M > w is a description of x . For theoretical analysis, this approach is more suited for constructing detailed formal proofs and is generally preferred in the research literature. In this article, an informal approach
2700-493: Is a mathematically formalized combination of Occam's razor , and the Principle of Multiple Explanations. It is a machine independent method of assigning a probability value to each hypothesis (algorithm/program) that explains a given observation, with the simplest hypothesis (the shortest program) having the highest probability and the increasingly complex hypotheses receiving increasingly small probabilities. Solomonoff founded
2800-440: Is a program in the language L 1 which acts as an interpreter for L 2 : where p is a program in L 2 . The interpreter is characterized by the following property: Thus, if P is a program in L 2 which is a minimal description of s , then InterpretLanguage ( P ) returns the string s . The length of this description of s is the sum of This proves the desired upper bound. Algorithmic information theory
2900-440: Is a program in the language L 1 which acts as an interpreter for L 2 : where p is a program in L 2 . The interpreter is characterized by the following property: Thus, if P is a program in L 2 which is a minimal description of s , then InterpretLanguage ( P ) returns the string s . The length of this description of s is the sum of This proves the desired upper bound. Algorithmic information theory
3000-410: Is based on. Theorem. C ( x ) ≤ K ( x ) + 2 log 2 C ( x ) {\displaystyle C(x)\leq K(x)+2\log _{2}C(x)} Proof. Take any program for the universal Turing machine used to define plain complexity, and convert it to a prefix-free program by first coding the length of the program in binary, then convert
3100-410: Is based on. Theorem. C ( x ) ≤ K ( x ) + 2 log 2 C ( x ) {\displaystyle C(x)\leq K(x)+2\log _{2}C(x)} Proof. Take any program for the universal Turing machine used to define plain complexity, and convert it to a prefix-free program by first coding the length of the program in binary, then convert
Ray Solomonoff - Misplaced Pages Continue
3200-422: Is discussed. Any string s has at least one description. For example, the second string above is output by the pseudo-code : whereas the first string is output by the (much shorter) pseudo-code: If a description d ( s ) of a string s is of minimal length (i.e., using the fewest bits), it is called a minimal description of s , and the length of d ( s ) (i.e. the number of bits in the minimal description)
3300-412: Is limited by available time or computation cost rather than by cutting out search space as is done in some other prediction methods, such as Minimum Description Length. Throughout his career Solomonoff was concerned with the potential benefits and dangers of A.I., discussing it in many of his published reports. In 1985 he analyzed a likely evolution of A.I., giving a formula predicting when it would reach
3400-427: Is more suited for constructing detailed formal proofs and is generally preferred in the research literature. In this article, an informal approach is discussed. Any string s has at least one description. For example, the second string above is output by the pseudo-code : whereas the first string is output by the (much shorter) pseudo-code: If a description d ( s ) of a string s is of minimal length (i.e., using
3500-439: Is named after Andrey Kolmogorov , who first published on the subject in 1963 and is a generalization of classical information theory. The notion of Kolmogorov complexity can be used to state and prove impossibility results akin to Cantor's diagonal argument , Gödel's incompleteness theorem , and Turing's halting problem . In particular, no program P computing a lower bound for each text's Kolmogorov complexity can return
3600-506: Is often called Kolmogorov complexity. The computer scientist Ming Li considers this an example of the Matthew effect : "...to everyone who has, more will be given..." There are several other variants of Kolmogorov complexity or algorithmic information. The most widely used one is based on self-delimiting programs , and is mainly due to Leonid Levin (1974). An axiomatic approach to Kolmogorov complexity based on Blum axioms (Blum 1967)
3700-452: Is often called Kolmogorov complexity. The computer scientist Ming Li considers this an example of the Matthew effect : "...to everyone who has, more will be given..." There are several other variants of Kolmogorov complexity or algorithmic information. The most widely used one is based on self-delimiting programs , and is mainly due to Leonid Levin (1974). An axiomatic approach to Kolmogorov complexity based on Blum axioms (Blum 1967)
3800-546: Is the Kolmogorov complexity of s , written K ( s ). Symbolically, The length of the shortest description will depend on the choice of description language; but the effect of changing languages is bounded (a result called the invariance theorem ). There are two definitions of Kolmogorov complexity: plain and prefix-free . The plain complexity is the minimal description length of any program, and denoted C ( x ) {\displaystyle C(x)} while
3900-428: Is the area of computer science that studies Kolmogorov complexity and other complexity measures on strings (or other data structures ). The concept and theory of Kolmogorov Complexity is based on a crucial theorem first discovered by Ray Solomonoff , who published it in 1960, describing it in "A Preliminary Report on a General Theory of Inductive Inference" as part of his invention of algorithmic probability . He gave
4000-428: Is the area of computer science that studies Kolmogorov complexity and other complexity measures on strings (or other data structures ). The concept and theory of Kolmogorov Complexity is based on a crucial theorem first discovered by Ray Solomonoff , who published it in 1960, describing it in "A Preliminary Report on a General Theory of Inductive Inference" as part of his invention of algorithmic probability . He gave
4100-430: Is the length of a shortest program from which the file can be reconstructed. While Kolmogorov complexity is uncomputable, various approaches have been proposed and reviewed. Consider the following two strings of 32 lowercase letters and digits: The first string has a short English-language description, namely "write ab 16 times", which consists of 17 characters. The second one has no obvious simple description (using
Ray Solomonoff - Misplaced Pages Continue
4200-1084: Is the minimal description length of any program, and denoted C ( x ) {\displaystyle C(x)} while the prefix-free complexity is the minimal description length of any program encoded in a prefix-free code, and denoted K ( x ) {\displaystyle K(x)} . The plain complexity is more intuitive, but the prefix-free complexity is easier to study. By default, all equations hold only up to an additive constant. For example, f ( x ) = g ( x ) {\displaystyle f(x)=g(x)} really means that f ( x ) = g ( x ) + O ( 1 ) {\displaystyle f(x)=g(x)+O(1)} , that is, ∃ c , ∀ x , | f ( x ) − g ( x ) | ≤ c {\displaystyle \exists c,\forall x,|f(x)-g(x)|\leq c} . Let U : 2 ∗ → 2 ∗ {\displaystyle U:2^{*}\to 2^{*}} be
4300-405: Is the only probability system known to be complete in this way. As a necessary consequence of its completeness it is incomputable . The incomputability is because some algorithms—a subset of those that are partially recursive—can never be evaluated fully because it would take too long. But these programs will at least be recognized as possible solutions. On the other hand, any computable system
4400-483: Is the time needed to test the trial and P i {\displaystyle P_{i}} is the probability of success of that trial. He called this the "Conceptual Jump Size" of the problem. Levin's search technique approximates this order, and so Solomonoff, who had studied Levin's work, called this search technique Lsearch. In other papers he explored how to limit the time needed to search for solutions, writing on resource bounded search. The search space
4500-471: Is uncomputable, various approaches have been proposed and reviewed. Consider the following two strings of 32 lowercase letters and digits: The first string has a short English-language description, namely "write ab 16 times", which consists of 17 characters. The second one has no obvious simple description (using the same character set) other than writing down the string itself, i.e., "write 4c1j5b2p0cv4w1x8rx2y39umgw5q85s7" which has 38 characters. Hence
4600-445: The Kolmogorov complexity of an object, such as a piece of text, is the length of a shortest computer program (in a predetermined programming language ) that produces the object as output. It is a measure of the computational resources needed to specify the object, and is also known as algorithmic complexity , Solomonoff–Kolmogorov–Chaitin complexity , program-size complexity , descriptive complexity , or algorithmic entropy . It
4700-836: The "Infinity Point". This work is part of the history of thought about a possible technological singularity . Originally algorithmic induction methods extrapolated ordered sequences of strings. Methods were needed for dealing with other kinds of data. A 1999 report, generalizes the Universal Distribution and associated convergence theorems to unordered sets of strings and a 2008 report, to unordered pairs of strings. In 1997, 2003 and 2006 he showed that incomputability and subjectivity are both necessary and desirable characteristics of any high performance induction system. In 1970 he formed his own one man company, Oxbridge Research, and continued his research there except for periods at other institutions such as MIT, University of Saarland in Germany and
4800-441: The 'current paradigm' no longer fits the current data". Kolmogorov complexity In algorithmic information theory (a subfield of computer science and mathematics ), the Kolmogorov complexity of an object, such as a piece of text, is the length of a shortest computer program (in a predetermined programming language ) that produces the object as output. It is a measure of the computational resources needed to specify
4900-462: The 1960s, the usual method of calculating probability was based on frequency: taking the ratio of favorable results to the total number of trials. In his 1960 publication, and, more completely, in his 1964 publications, Solomonoff seriously revised this definition of probability. He called this new form of probability "Algorithmic Probability" and showed how to use it for prediction in his theory of inductive inference. As part of this work, he produced
5000-571: The American Association for Artificial Intelligence (AAAI), it was decided that probability was in no way relevant to A.I. A protest group formed, and the next year there was a workshop at the AAAI meeting devoted to "Probability and Uncertainty in AI." This yearly workshop has continued to the present day. As part of the protest at the first workshop, Solomonoff gave a paper on how to apply
5100-721: The Dalle Molle Institute for Artificial Intelligence in Lugano, Switzerland. In 2003 he was the first recipient of the Kolmogorov Award by The Computer Learning Research Center at the Royal Holloway, University of London, where he gave the inaugural Kolmogorov Lecture. Solomonoff was most recently a visiting professor at the CLRC. In 2006 he spoke at AI@50 , "Dartmouth Artificial Intelligence Conference:
SECTION 50
#17327984288155200-912: The Next Fifty Years" commemorating the fiftieth anniversary of the original Dartmouth summer study group. Solomonoff was one of five original participants to attend. In Feb. 2008, he gave the keynote address at the Conference "Current Trends in the Theory and Application of Computer Science" (CTTACS), held at Notre Dame University in Lebanon. He followed this with a short series of lectures, and began research on new applications of Algorithmic Probability. Algorithmic Probability and Solomonoff Induction have many advantages for Artificial Intelligence. Algorithmic Probability gives extremely accurate probability estimates. These estimates can be revised by
5300-592: The Russian mathematician Kolmogorov independently published similar ideas. When he became aware of Solomonoff's work, he acknowledged Solomonoff, and for several years, Solomonoff's work was better known in the Soviet Union than in the Western World. The general consensus in the scientific community, however, was to associate this type of complexity with Kolmogorov, who was more concerned with randomness of
5400-492: The algorithms, but not on the strings themselves. Solomonoff used this algorithm and the code lengths it allows to define a "universal probability" of a string on which inductive inference of the subsequent digits of the string can be based. Kolmogorov used this theorem to define several functions of strings, including complexity, randomness, and information. When Kolmogorov became aware of Solomonoff's work, he acknowledged Solomonoff's priority. For several years, Solomonoff's work
5500-492: The algorithms, but not on the strings themselves. Solomonoff used this algorithm and the code lengths it allows to define a "universal probability" of a string on which inductive inference of the subsequent digits of the string can be based. Kolmogorov used this theorem to define several functions of strings, including complexity, randomness, and information. When Kolmogorov became aware of Solomonoff's work, he acknowledged Solomonoff's priority. For several years, Solomonoff's work
5600-484: The choice of description language is discussed below). It can be shown that the Kolmogorov complexity of any string cannot be more than a few bytes larger than the length of the string itself. Strings like the abab example above, whose Kolmogorov complexity is small relative to the string's size, are not considered to be complex. The Kolmogorov complexity can be defined for any mathematical object, but for simplicity
5700-715: The concatenated string. We can divide it by specifying the length of x {\displaystyle x} or y {\displaystyle y} , but that would take O ( min ( ln x , ln y ) ) {\displaystyle O(\min(\ln x,\ln y))} extra symbols. Indeed, for any c > 0 {\displaystyle c>0} there exists x , y {\displaystyle x,y} such that C ( x y ) ≥ C ( x ) + C ( y ) + c {\displaystyle C(xy)\geq C(x)+C(y)+c} . Typically, inequalities with plain complexity have
5800-715: The concatenated string. We can divide it by specifying the length of x {\displaystyle x} or y {\displaystyle y} , but that would take O ( min ( ln x , ln y ) ) {\displaystyle O(\min(\ln x,\ln y))} extra symbols. Indeed, for any c > 0 {\displaystyle c>0} there exists x , y {\displaystyle x,y} such that C ( x y ) ≥ C ( x ) + C ( y ) + c {\displaystyle C(xy)\geq C(x)+C(y)+c} . Typically, inequalities with plain complexity have
5900-424: The deeper implications of this probability system. One important aspect of Algorithmic Probability is that it is complete and incomputable. In the 1968 report he shows that Algorithmic Probability is complete ; that is, if there is any describable regularity in a body of data, Algorithmic Probability will eventually discover that regularity, requiring a relatively small sample of that data. Algorithmic Probability
6000-469: The earliest statistical analysis of networks. He was one of the 10 attendees at the 1956 Dartmouth Summer Research Project on Artificial Intelligence . He wrote and circulated a report among the attendees: "An Inductive Inference Machine". It viewed machine learning as probabilistic, with an emphasis on the importance of training sequences, and on the use of parts of previous solutions to problems in constructing trial solutions for new problems. He published
6100-499: The fewest bits), it is called a minimal description of s , and the length of d ( s ) (i.e. the number of bits in the minimal description) is the Kolmogorov complexity of s , written K ( s ). Symbolically, The length of the shortest description will depend on the choice of description language; but the effect of changing languages is bounded (a result called the invariance theorem ). There are two definitions of Kolmogorov complexity: plain and prefix-free . The plain complexity
SECTION 60
#17327984288156200-470: The first part of a description is a computer program (specifically: a compiler for the object's language, written in the description language), with the second part being the input to that computer program which produces the object as output. The invariance theorem follows: Given any description language L , the optimal description language is at least as efficient as L , with some constant overhead. Proof: Any description D in L can be converted into
6300-470: The first part of a description is a computer program (specifically: a compiler for the object's language, written in the description language), with the second part being the input to that computer program which produces the object as output. The invariance theorem follows: Given any description language L , the optimal description language is at least as efficient as L , with some constant overhead. Proof: Any description D in L can be converted into
6400-415: The following sense: given any description of an object in a description language, said description may be used in the optimal description language with a constant overhead. The constant depends only on the languages involved, not on the description of the object, nor the object being described. Here is an example of an optimal description language. A description will have two parts: In more technical terms,
6500-415: The following sense: given any description of an object in a description language, said description may be used in the optimal description language with a constant overhead. The constant depends only on the languages involved, not on the description of the object, nor the object being described. Here is an example of an optimal description language. A description will have two parts: In more technical terms,
6600-464: The last symbol of the word, it knows that the word is finished, and does not need to backtrack or a termination symbol. Define a prefix-free Turing machine to be a Turing machine that comes with a prefix-free code, such that the Turing machine can read any string from the code in one direction, and stop reading as soon as it reads the last symbol. Afterwards, it may compute on a work tape and write to
6700-411: The last symbol of the word, it knows that the word is finished, and does not need to backtrack or a termination symbol. Define a prefix-free Turing machine to be a Turing machine that comes with a prefix-free code, such that the Turing machine can read any string from the code in one direction, and stop reading as soon as it reads the last symbol. Afterwards, it may compute on a work tape and write to
6800-462: The length to prefix-free coding. For example, suppose the program has length 9, then we can convert it as follows: 9 ↦ 1001 ↦ 11 − 00 − 00 − 11 − 01 {\displaystyle 9\mapsto 1001\mapsto 11-00-00-11-\color {red}{01}} where we double each digit, then add a termination code. The prefix-free universal Turing machine can then read in any program for
6900-462: The length to prefix-free coding. For example, suppose the program has length 9, then we can convert it as follows: 9 ↦ 1001 ↦ 11 − 00 − 00 − 11 − 01 {\displaystyle 9\mapsto 1001\mapsto 11-00-00-11-\color {red}{01}} where we double each digit, then add a termination code. The prefix-free universal Turing machine can then read in any program for
7000-621: The machine. The use of probability in A.I., however, did not have a completely smooth path. In the early years of A.I., the relevance of probability was problematic. Many in the A.I. community felt probability was not usable in their work. The area of pattern recognition did use a form of probability, but because there was no broadly based theory of how to incorporate probability in any A.I. field, most fields did not use it at all. There were, however, researchers such as Pearl and Peter Cheeseman who argued that probability could be used in artificial intelligence. About 1984, at an annual meeting of
7100-409: The most likely next event in a series of events, and how likely it will be. Although he is best known for algorithmic probability and his general theory of inductive inference , he made many other important discoveries throughout his life, most of them directed toward his goal in artificial intelligence: to develop a machine that could solve hard problems using probabilistic methods. Ray Solomonoff
7200-406: The number of bits in a character (e.g., 7 for ASCII ). We could, alternatively, choose an encoding for Turing machines , where an encoding is a function which associates to each Turing Machine M a bitstring < M >. If M is a Turing Machine which, on input w , outputs string x , then the concatenated string < M > w is a description of x . For theoretical analysis, this approach
7300-559: The object, and is also known as algorithmic complexity , Solomonoff–Kolmogorov–Chaitin complexity , program-size complexity , descriptive complexity , or algorithmic entropy . It is named after Andrey Kolmogorov , who first published on the subject in 1963 and is a generalization of classical information theory. The notion of Kolmogorov complexity can be used to state and prove impossibility results akin to Cantor's diagonal argument , Gödel's incompleteness theorem , and Turing's halting problem . In particular, no program P computing
7400-442: The operation of writing the first string can be said to have "less complexity" than writing the second. More formally, the complexity of a string is the length of the shortest possible description of the string in some fixed universal description language (the sensitivity of complexity relative to the choice of description language is discussed below). It can be shown that the Kolmogorov complexity of any string cannot be more than
7500-453: The optimal language is universal up to this additive constant. Theorem : If K 1 and K 2 are the complexity functions relative to Turing complete description languages L 1 and L 2 , then there is a constant c – which depends only on the languages L 1 and L 2 chosen – such that Proof : By symmetry, it suffices to prove that there is some constant c such that for all strings s Now, suppose there
7600-453: The optimal language is universal up to this additive constant. Theorem : If K 1 and K 2 are the complexity functions relative to Turing complete description languages L 1 and L 2 , then there is a constant c – which depends only on the languages L 1 and L 2 chosen – such that Proof : By symmetry, it suffices to prove that there is some constant c such that for all strings s Now, suppose there
7700-1451: The other machine as follows: [ code for simulating the other machine ] [ coded length of the program ] [ the program ] {\displaystyle [{\text{code for simulating the other machine}}][{\text{coded length of the program}}][{\text{the program}}]} The first part programs the machine to simulate the other machine, and is a constant overhead O ( 1 ) {\displaystyle O(1)} . The second part has length ≤ 2 log 2 C ( x ) + 3 {\displaystyle \leq 2\log _{2}C(x)+3} . The third part has length C ( x ) {\displaystyle C(x)} . Theorem : There exists c {\displaystyle c} such that ∀ x , C ( x ) ≤ | x | + c {\displaystyle \forall x,C(x)\leq |x|+c} . More succinctly, C ( x ) ≤ | x | {\displaystyle C(x)\leq |x|} . Similarly, K ( x ) ≤ | x | + 2 log 2 | x | {\displaystyle K(x)\leq |x|+2\log _{2}|x|} , and K ( x | | x | ) ≤ | x | {\displaystyle K(x||x|)\leq |x|} . Proof. For
7800-1451: The other machine as follows: [ code for simulating the other machine ] [ coded length of the program ] [ the program ] {\displaystyle [{\text{code for simulating the other machine}}][{\text{coded length of the program}}][{\text{the program}}]} The first part programs the machine to simulate the other machine, and is a constant overhead O ( 1 ) {\displaystyle O(1)} . The second part has length ≤ 2 log 2 C ( x ) + 3 {\displaystyle \leq 2\log _{2}C(x)+3} . The third part has length C ( x ) {\displaystyle C(x)} . Theorem : There exists c {\displaystyle c} such that ∀ x , C ( x ) ≤ | x | + c {\displaystyle \forall x,C(x)\leq |x|+c} . More succinctly, C ( x ) ≤ | x | {\displaystyle C(x)\leq |x|} . Similarly, K ( x ) ≤ | x | + 2 log 2 | x | {\displaystyle K(x)\leq |x|+2\log _{2}|x|} , and K ( x | | x | ) ≤ | x | {\displaystyle K(x||x|)\leq |x|} . Proof. For
7900-470: The others mentioned here, are available on his website at the publications page . In an article published the year of his death, a journal article said of Solomonoff: "A very conventional scientist understands his science using a single 'current paradigm'—the way of understanding that is most in vogue at the present time. A more creative scientist understands his science in very many ways, and can more easily create new theories, new ways of understanding, when
8000-508: The philosophical foundation for the use of Bayes rule of causation for prediction. The basic theorem of what was later called Kolmogorov Complexity was part of his General Theory. Writing in 1960, he begins: "Consider a very long sequence of symbols ... We shall consider such a sequence of symbols to be 'simple' and have a high a priori probability, if there exists a very brief description of this sequence – using, of course, some sort of stipulated description method. More exactly, if we use only
8100-702: The plain complexity, just write a program that simply copies the input to the output. For the prefix-free complexity, we need to first describe the length of the string, before writing out the string itself. Theorem. (extra information bounds, subadditivity) Note that there is no way to compare K ( x y ) {\displaystyle K(xy)} and K ( x | y ) {\displaystyle K(x|y)} or K ( x ) {\displaystyle K(x)} or K ( y | x ) {\displaystyle K(y|x)} or K ( y ) {\displaystyle K(y)} . There are strings such that
8200-702: The plain complexity, just write a program that simply copies the input to the output. For the prefix-free complexity, we need to first describe the length of the string, before writing out the string itself. Theorem. (extra information bounds, subadditivity) Note that there is no way to compare K ( x y ) {\displaystyle K(xy)} and K ( x | y ) {\displaystyle K(x|y)} or K ( x ) {\displaystyle K(x)} or K ( y | x ) {\displaystyle K(y|x)} or K ( y ) {\displaystyle K(y)} . There are strings such that
8300-436: The prefix-free Kolmogorov complexity. A prefix-free code is a subset of 2 ∗ {\displaystyle 2^{*}} such that given any two different words x , y {\displaystyle x,y} in the set, neither is a prefix of the other. The benefit of a prefix-free code is that we can build a machine that reads words from the code forward in one direction, and as soon as it reads
8400-436: The prefix-free Kolmogorov complexity. A prefix-free code is a subset of 2 ∗ {\displaystyle 2^{*}} such that given any two different words x , y {\displaystyle x,y} in the set, neither is a prefix of the other. The benefit of a prefix-free code is that we can build a machine that reads words from the code forward in one direction, and as soon as it reads
8500-952: The prefix-free complexity is the minimal description length of any program encoded in a prefix-free code, and denoted K ( x ) {\displaystyle K(x)} . The plain complexity is more intuitive, but the prefix-free complexity is easier to study. By default, all equations hold only up to an additive constant. For example, f ( x ) = g ( x ) {\displaystyle f(x)=g(x)} really means that f ( x ) = g ( x ) + O ( 1 ) {\displaystyle f(x)=g(x)+O(1)} , that is, ∃ c , ∀ x , | f ( x ) − g ( x ) | ≤ c {\displaystyle \exists c,\forall x,|f(x)-g(x)|\leq c} . Let U : 2 ∗ → 2 ∗ {\displaystyle U:2^{*}\to 2^{*}} be
8600-543: The probability of that sequence." He then shows how this idea can be used to generate the universal a priori probability distribution and how it enables the use of Bayes rule in inductive inference. Inductive inference, by adding up the predictions of all models describing a particular sequence, using suitable weights based on the lengths of those models, gets the probability distribution for the extension of that sequence. This method of prediction has since become known as Solomonoff induction . He enlarged his theory, publishing
8700-452: The same character set) other than writing down the string itself, i.e., "write 4c1j5b2p0cv4w1x8rx2y39umgw5q85s7" which has 38 characters. Hence the operation of writing the first string can be said to have "less complexity" than writing the second. More formally, the complexity of a string is the length of the shortest possible description of the string in some fixed universal description language (the sensitivity of complexity relative to
8800-402: The scope of this article is restricted to strings. We must first specify a description language for strings. Such a description language can be based on any computer programming language, such as Lisp , Pascal , or Java . If P is a program which outputs a string x , then P is a description of x . The length of the description is just the length of P as a character string, multiplied by
8900-463: The symbols 0 and 1 to express our description, we will assign the probability 2 to a sequence of symbols if its shortest possible binary description contains N digits." The probability is with reference to a particular universal Turing machine . Solomonoff showed and in 1964 proved that the choice of machine, while it could add a constant factor would not change the probability ratios very much. These probabilities are machine independent. In 1965,
9000-425: The theory of universal inductive inference , which is based on solid philosophical foundations and has its root in Kolmogorov complexity and algorithmic information theory . The theory uses algorithmic probability in a Bayesian framework. The universal prior is taken over the class of all computable measures; no hypothesis will have a zero probability. This enables Bayes' rule (of causation) to be used to predict
9100-409: The universal distribution to problems in A.I. This was an early version of the system he has been developing since that time. In that report, he described the search technique he had developed. In search problems, the best order of search, is time T i / P i {\displaystyle T_{i}/P_{i}} , where T i {\displaystyle T_{i}}
9200-522: The whole string x y {\displaystyle xy} is easy to describe, but its substrings are very hard to describe. Theorem. (symmetry of information) K ( x , y ) = K ( x | y , K ( y ) ) + K ( y ) = K ( y , x ) {\displaystyle K(x,y)=K(x|y,K(y))+K(y)=K(y,x)} . Kolmogorov complexity In algorithmic information theory (a subfield of computer science and mathematics ),
9300-461: Was better known in the Soviet Union than in the Western World. The general consensus in the scientific community, however, was to associate this type of complexity with Kolmogorov, who was concerned with randomness of a sequence, while Algorithmic Probability became associated with Solomonoff, who focused on prediction using his invention of the universal prior probability distribution. The broader area encompassing descriptional complexity and probability
9400-461: Was better known in the Soviet Union than in the Western World. The general consensus in the scientific community, however, was to associate this type of complexity with Kolmogorov, who was concerned with randomness of a sequence, while Algorithmic Probability became associated with Solomonoff, who focused on prediction using his invention of the universal prior probability distribution. The broader area encompassing descriptional complexity and probability
9500-761: Was born on July 25, 1926, in Cleveland, Ohio , the son of Jewish Russian immigrants Phillip Julius and Sarah Mashman Solomonoff. He attended Glenville High School, graduating in 1944. In 1944 he joined the United States Navy as Instructor in Electronics. From 1947–1951 he attended the University of Chicago , studying under Professors such as Rudolf Carnap and Enrico Fermi , and graduated with an M.S. in Physics in 1951. From his earliest years he
9600-494: Was developing the branch of Artificial Intelligence that focussed on probability and prediction; his specific view of A.I. described machines that were governed by the Algorithmic Probability distribution. The machine generates theories together with their associated probabilities, to solve problems, and as new problems and theories develop, updates the probability distribution on the theories. In 1968 he found
9700-518: Was introduced by Mark Burgin in the paper presented for publication by Andrey Kolmogorov. We write K ( x , y ) {\displaystyle K(x,y)} to be K ( ( x , y ) ) {\displaystyle K((x,y))} , where ( x , y ) {\displaystyle (x,y)} means some fixed way to code for a tuple of strings x and y. We omit additive factors of O ( 1 ) {\displaystyle O(1)} . This section
9800-518: Was introduced by Mark Burgin in the paper presented for publication by Andrey Kolmogorov. We write K ( x , y ) {\displaystyle K(x,y)} to be K ( ( x , y ) ) {\displaystyle K((x,y))} , where ( x , y ) {\displaystyle (x,y)} means some fixed way to code for a tuple of strings x and y. We omit additive factors of O ( 1 ) {\displaystyle O(1)} . This section
9900-593: Was motivated by the pure joy of mathematical discovery and by the desire to explore where no one had gone before. At the age of 16, in 1942, he began to search for a general method to solve mathematical problems. In 1952 he met Marvin Minsky , John McCarthy and others interested in machine intelligence. In 1956 Minsky and McCarthy and others organized the Dartmouth Summer Research Conference on Artificial Intelligence , where Solomonoff
10000-516: Was one of the original 10 invitees—he, McCarthy, and Minsky were the only ones to stay all summer. It was for this group that Artificial Intelligence was first named as a science. Computers at the time could solve very specific mathematical problems, but not much else. Solomonoff wanted to pursue a bigger question, how to make machines more generally intelligent, and how computers could use probability for this purpose. He wrote three papers, two with Anatol Rapoport , in 1950–52, that are regarded as
#814185