Reinforcement learning ( RL ) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learning is one of the three basic machine learning paradigms , alongside supervised learning and unsupervised learning .
121-529: OpenAI is an American artificial intelligence (AI) research organization founded in December 2015 and headquartered in San Francisco, California. Its stated mission is to develop "safe and beneficial" artificial general intelligence (AGI), which it defines as "highly autonomous systems that outperform humans at most economically valuable work". As a leading organization in the ongoing AI boom , OpenAI
242-415: A {\displaystyle a} and a policy π {\displaystyle \pi } , the action-value of the pair ( s , a ) {\displaystyle (s,a)} under π {\displaystyle \pi } is defined by where G {\displaystyle G} now stands for the random discounted return associated with first taking action
363-552: A {\displaystyle a} in state s {\displaystyle s} and following π {\displaystyle \pi } , thereafter. The theory of Markov decision processes states that if π ∗ {\displaystyle \pi ^{*}} is an optimal policy, we act optimally (take the optimal action) by choosing the action from Q π ∗ ( s , ⋅ ) {\displaystyle Q^{\pi ^{*}}(s,\cdot )} with
484-564: A {\displaystyle a} when in state s {\displaystyle s} . There are also deterministic policies. The state-value function V π ( s ) {\displaystyle V_{\pi }(s)} is defined as, expected discounted return starting with state s {\displaystyle s} , i.e. S 0 = s {\displaystyle S_{0}=s} , and successively following policy π {\displaystyle \pi } . Hence, roughly speaking,
605-555: A ) {\displaystyle (s,a)} are obtained by linearly combining the components of ϕ ( s , a ) {\displaystyle \phi (s,a)} with some weights θ {\displaystyle \theta } : The algorithms then adjust the weights, instead of adjusting the values associated with the individual state-action pairs. Methods based on ideas from nonparametric statistics (which can be seen to construct their own features) have been explored. Value iteration can also be used as
726-402: A ) = Pr ( A t = a ∣ S t = s ) {\displaystyle \pi (s,a)=\Pr(A_{t}=a\mid S_{t}=s)} that maximizes the expected cumulative reward. Formulating the problem as a Markov decision process assumes the agent directly observes the current environmental state; in this case, the problem is said to have full observability . If
847-581: A loss function . Variants of gradient descent are commonly used to train neural networks. Another type of local search is evolutionary computation , which aims to iteratively improve a set of candidate solutions by "mutating" and "recombining" them, selecting only the fittest to survive each generation. Distributed search processes can coordinate via swarm intelligence algorithms. Two popular swarm algorithms used in search are particle swarm optimization (inspired by bird flocking ) and ant colony optimization (inspired by ant trails ). Formal logic
968-475: A "degree of truth" between 0 and 1. It can therefore handle propositions that are vague and partially true. Non-monotonic logics , including logic programming with negation as failure , are designed to handle default reasoning . Other specialized versions of logic have been developed to describe many complex domains. Many problems in AI (including in reasoning, planning, learning, perception, and robotics) require
1089-453: A 49% stake in the company. The investment is believed to be a part of Microsoft's efforts to integrate OpenAI's ChatGPT into the Bing search engine. Google announced a similar AI application ( Bard ), after ChatGPT was launched, fearing that ChatGPT could threaten Google's place as a go-to source for information. On February 7, 2023, Microsoft announced that it was building AI technology based on
1210-608: A Reddit advertising partner. On May 22, 2024, OpenAI entered into an agreement with News Corp to integrate news content from The Wall Street Journal , the New York Post , The Times , and The Sunday Times into its AI platform. Meanwhile, other publications like The New York Times chose to sue OpenAI and Microsoft for copyright infringement over the use of their content to train AI models. On May 29, 2024, Axios reported that OpenAI had signed deals with Vox Media and The Atlantic to share content to enhance
1331-527: A certain capability threshold, suggesting that relatively weak AI systems on the other side should not be overly regulated. They also call for more technical safety research for superintelligences, and ask for more coordination, for example through governments launching a joint project which "many current efforts become part of". In July 2023, OpenAI launched the superalignment project, aiming to find within 4 years how to align future superintelligences by automating alignment research using AI. In August 2023, it
SECTION 10
#17327977157181452-460: A contradiction from premises that include the negation of the problem to be solved. Inference in both Horn clause logic and first-order logic is undecidable , and therefore intractable . However, backward reasoning with Horn clauses, which underpins computation in the logic programming language Prolog , is Turing complete . Moreover, its efficiency is competitive with computation in other symbolic programming languages. Fuzzy logic assigns
1573-520: A desire to focus more deeply on AI alignment research as his reason for the move. Additionally, OpenAI's president and co-founder, Greg Brockman, is taking an extended leave until the end of the year. In September 2024, OpenAI's global affairs chief, Anna Makanju , expressed support for the UK's approach to AI regulation during her testimony to the House of Lords Communications and Digital Committee, stating that
1694-435: A function of the parameter vector θ {\displaystyle \theta } . If the gradient of ρ {\displaystyle \rho } was known, one could use gradient ascent . Since an analytic expression for the gradient is not available, only a noisy estimate is available. Such an estimate can be constructed in many ways, giving rise to algorithms such as Williams' REINFORCE method (which
1815-624: A mapping from a finite-dimensional (parameter) space to the space of policies: given the parameter vector θ {\displaystyle \theta } , let π θ {\displaystyle \pi _{\theta }} denote the policy associated to θ {\displaystyle \theta } . Defining the performance function by ρ ( θ ) = ρ π θ {\displaystyle \rho (\theta )=\rho ^{\pi _{\theta }}} under mild conditions this function will be differentiable as
1936-455: A model capable of generating sample transitions is required, rather than a full specification of transition probabilities , which is necessary for dynamic programming methods. Monte Carlo methods apply to episodic tasks, where experience is divided into episodes that eventually terminate. Policy and value function updates occur only after the completion of an episode, making these methods incremental on an episode-by-episode basis, though not on
2057-533: A month later on December 13. On January 16, 2024, in response to intense scrutiny from regulators around the world, OpenAI announced the formation of a new Collective Alignment team that would aim to implement ideas from the public for ensuring its models would "align to the values of humanity." The move was from its public program launched in May 2023. The company explained that the program would be separate from its commercial endeavors. On January 18, 2024, OpenAI announced
2178-502: A nonprofit is difficult, but stated "I disagree with the notion that a nonprofit can't compete" and pointed to successful low-budget projects by OpenAI and others. "If bigger and better funded was always better, then IBM would still be number one." The nonprofit, OpenAI, Inc., is the sole controlling shareholder of OpenAI Global, LLC, which, despite being a for-profit company, retains a formal fiduciary responsibility to OpenAI, Inc.'s nonprofit charter. A majority of OpenAI, Inc.'s board
2299-540: A partnership with Arizona State University that would give it complete access to ChatGPT Enterprise. ASU plans to incorporate the technology into various aspects of its operations, including courses, tutoring and research. It is OpenAI's first partnership with an educational institution. In February 2024, the U.S. Securities and Exchange Commission was reportedly investigating OpenAI over whether internal company communications made by Altman were used to mislead investors; and an investigation of Altman's statements, opened by
2420-429: A path to a target goal, a process called means-ends analysis . Simple exhaustive searches are rarely sufficient for most real-world problems: the search space (the number of places to search) quickly grows to astronomical numbers . The result is a search that is too slow or never completes. " Heuristics " or "rules of thumb" can help prioritize choices that are more likely to reach a goal. Adversarial search
2541-411: A policy that maximizes the discounted return by maintaining a set of estimates of expected discounted returns E [ G ] {\displaystyle \operatorname {\mathbb {E} } [G]} for some policy (usually either the "current" [on-policy] or the optimal [off-policy] one). These methods rely on the theory of Markov decision processes, where optimality is defined in
SECTION 20
#17327977157182662-516: A safe way." OpenAI co-founder Wojciech Zaremba stated that he turned down "borderline crazy" offers of two to three times his market value to join OpenAI instead. In April 2016, OpenAI released a public beta of "OpenAI Gym", its platform for reinforcement learning research. Nvidia gifted its first DGX-1 supercomputer to OpenAI in August 2016 to help it train larger and more complex AI models with
2783-429: A schedule (making the agent explore progressively less), or adaptively based on heuristics. Even if the issue of exploration is disregarded and even if the state was observable (assumed hereafter), the problem remains to use past experience to find out which actions lead to higher cumulative rewards. The agent's action selection is modeled as a map called policy : The policy map gives the probability of taking action
2904-470: A sense stronger than the one above: A policy is optimal if it achieves the best-expected discounted return from any initial state (i.e., initial distributions play no role in this definition). Again, an optimal policy can always be found among stationary policies. To define optimality in a formal manner, define the state-value of a policy π {\displaystyle \pi } by where G {\displaystyle G} stands for
3025-528: A specialized deep learning model adept at generating complex digital images from textual descriptions, utilizing a variant of the GPT-3 architecture. In December 2022, OpenAI received widespread media coverage after launching a free preview of ChatGPT , its new AI chatbot based on GPT-3.5. According to OpenAI, the preview received over a million signups within the first five days. According to anonymous sources cited by Reuters in December 2022, OpenAI Global, LLC
3146-662: A standalone Microsoft Copilot app released for Android and one released for iOS thereafter. In October 2023, Sam Altman and Peng Xiao, CEO of the Emirati AI firm G42 , announced Open AI would let G42 deploy Open AI technology. On November 6, 2023, OpenAI launched GPTs, allowing individuals to create customized versions of ChatGPT for specific purposes, further expanding the possibilities of AI applications across various industries. On November 14, 2023, OpenAI announced they temporarily suspended new sign-ups for ChatGPT Plus due to high demand. Access for newer subscribers re-opened
3267-471: A starting point, giving rise to the Q-learning algorithm and its many variants. Including Deep Q-learning methods when a neural network is used to represent Q, with various applications in stochastic search problems. The problem with using action-values is that they may need highly precise estimates of the competing action values that can be hard to obtain when the returns are noisy, though this problem
3388-433: A step-by-step (online) basis. The term “Monte Carlo” generally refers to any method involving random sampling ; however, in this context, it specifically refers to methods that compute averages from complete returns, rather than partial returns. These methods function similarly to the bandit algorithms , in which returns are averaged for each state-action pair. The key difference is that actions taken in one state affect
3509-726: A tool that can be used for reasoning (using the Bayesian inference algorithm), learning (using the expectation–maximization algorithm ), planning (using decision networks ) and perception (using dynamic Bayesian networks ). Probabilistic algorithms can also be used for filtering, prediction, smoothing, and finding explanations for streams of data, thus helping perception systems analyze processes that occur over time (e.g., hidden Markov models or Kalman filters ). The simplest AI applications can be divided into two types: classifiers (e.g., "if shiny then diamond"), on one hand, and controllers (e.g., "if diamond then pick up"), on
3630-488: A waitlist) and as a feature of ChatGPT Plus. On May 22, 2023, Sam Altman, Greg Brockman and Ilya Sutskever posted recommendations for the governance of superintelligence . They consider that superintelligence could happen within the next 10 years, allowing a "dramatically more prosperous future" and that "given the possibility of existential risk, we can't just be reactive". They propose creating an international watchdog organization similar to IAEA to oversee AI systems above
3751-669: A wide range of techniques, including search and mathematical optimization , formal logic , artificial neural networks , and methods based on statistics , operations research , and economics . AI also draws upon psychology , linguistics , philosophy , neuroscience , and other fields. Artificial intelligence was founded as an academic discipline in 1956, and the field went through multiple cycles of optimism, followed by periods of disappointment and loss of funding, known as AI winter . Funding and interest vastly increased after 2012 when deep learning outperformed previous AI techniques. This growth accelerated further after 2017 with
OpenAI - Misplaced Pages Continue
3872-490: A wide variety of techniques to accomplish the goals above. AI can solve many problems by intelligently searching through many possible solutions. There are two very different kinds of search used in AI: state space search and local search . State space search searches through a tree of possible states to try to find a goal state. For example, planning algorithms search through trees of goals and subgoals, attempting to find
3993-1139: Is intelligence exhibited by machines , particularly computer systems . It is a field of research in computer science that develops and studies methods and software that enable machines to perceive their environment and use learning and intelligence to take actions that maximize their chances of achieving defined goals. Such machines may be called AIs. Some high-profile applications of AI include advanced web search engines (e.g., Google Search ); recommendation systems (used by YouTube , Amazon , and Netflix ); interacting via human speech (e.g., Google Assistant , Siri , and Alexa ); autonomous vehicles (e.g., Waymo ); generative and creative tools (e.g., ChatGPT , and AI art ); and superhuman play and analysis in strategy games (e.g., chess and Go ). However, many AI applications are not perceived as AI: "A lot of cutting edge AI has filtered into general applications, often without being called AI because once something becomes useful enough and common enough it's not labeled AI anymore ." The various subfields of AI research are centered around particular goals and
4114-564: Is stationary if the action-distribution returned by it depends only on the last state visited (from the observation agent's history). The search can be further restricted to deterministic stationary policies. A deterministic stationary policy deterministically selects actions based on the current state. Since any such policy can be identified with a mapping from the set of states to the set of actions, these policies can be identified with such mappings with no loss of generality. The brute force approach entails two steps: One problem with this
4235-641: Is a body of knowledge represented in a form that can be used by a program. An ontology is the set of objects, relations, concepts, and properties used by a particular domain of knowledge. Knowledge bases need to represent things such as objects, properties, categories, and relations between objects; situations, events, states, and time; causes and effects; knowledge about knowledge (what we know about what other people know); default reasoning (things that humans assume are true until they are told differently and will remain true even when other facts are changing); and many other aspects and domains of knowledge. Among
4356-430: Is a state randomly sampled from the distribution μ {\displaystyle \mu } of initial states (so μ ( s ) = Pr ( S 0 = s ) {\displaystyle \mu (s)=\Pr(S_{0}=s)} ). Although state-values suffice to define optimality, it is useful to define action-values. Given a state s {\displaystyle s} , an action
4477-456: Is aimed at natural language answering questions, but it can also translate between languages and coherently generate improvised text. It also announced that an associated API, named simply "the API", would form the heart of its first commercial product. Eleven employees left OpenAI, mostly between December 2020 and January 2021, in order to establish Anthropic . In 2021, OpenAI introduced DALL-E ,
4598-507: Is allowed to change, A policy that achieves these optimal state-values in each state is called optimal . Clearly, a policy that is optimal in this sense is also optimal in the sense that it maximizes the expected discounted return, since V ∗ ( s ) = max π E [ G ∣ s , π ] {\displaystyle V^{*}(s)=\max _{\pi }\mathbb {E} [G\mid s,\pi ]} , where s {\displaystyle s}
4719-459: Is an input, at least one hidden layer of nodes and an output. Each node applies a function and once the weight crosses its specified threshold, the data is transmitted to the next layer. A network is typically called a deep neural network if it has at least 2 hidden layers. Learning algorithms for neural networks use local search to choose the weights that will get the right output for each input during training. The most common training technique
4840-462: Is an interdisciplinary umbrella that comprises systems that recognize, interpret, process, or simulate human feeling, emotion, and mood . For example, some virtual assistants are programmed to speak conversationally or even to banter humorously; it makes them appear more sensitive to the emotional dynamics of human interaction, or to otherwise facilitate human–computer interaction . However, this tends to give naïve users an unrealistic conception of
4961-444: Is an unsolved problem. Knowledge representation and knowledge engineering allow AI programs to answer questions intelligently and make deductions about real-world facts. Formal knowledge representations are used in content-based indexing and retrieval, scene interpretation, clinical decision support, knowledge discovery (mining "interesting" and actionable inferences from large databases ), and other areas. A knowledge base
OpenAI - Misplaced Pages Continue
5082-422: Is anything that perceives and takes actions in the world. A rational agent has goals or preferences and takes actions to make them happen. In automated planning , the agent has a specific goal. In automated decision-making , the agent has preferences—there are some situations it would prefer to be in, and some situations it is trying to avoid. The decision-making agent assigns a number to each situation (called
5203-490: Is barred from having financial stakes in OpenAI Global, LLC. In addition, minority members with a stake in OpenAI Global, LLC are barred from certain votes due to conflict of interest. Some researchers have argued that OpenAI Global, LLC's switch to for-profit status is inconsistent with OpenAI's claims to be "democratizing" AI. In 2020, OpenAI announced GPT-3 , a language model trained on large internet datasets. GPT-3
5324-446: Is chosen, and the agent chooses the action that it believes has the best long-term effect (ties between actions are broken uniformly at random). Alternatively, with probability ε {\displaystyle \varepsilon } , exploration is chosen, and the action is chosen uniformly at random. ε {\displaystyle \varepsilon } is usually a fixed parameter but can be adjusted either according to
5445-413: Is classified based on previous experience. There are many kinds of classifiers in use. The decision tree is the simplest and most widely used symbolic machine learning algorithm. K-nearest neighbor algorithm was the most widely used analogical AI until the mid-1990s, and Kernel methods such as the support vector machine (SVM) displaced k-nearest neighbor in the 1990s. The naive Bayes classifier
5566-946: Is known for the GPT family of large language models , the DALL-E series of text-to-image models , and a text-to-video model named Sora . Its release of ChatGPT in November 2022 has been credited with catalyzing widespread interest in generative AI . The organization consists of the non-profit OpenAI, Inc . , registered in Delaware , and its for-profit subsidiary introduced in 2019, OpenAI Global, LLC. Microsoft owns roughly 49% of OpenAI's equity , having invested US$ 13 billion. It also provides computing resources to OpenAI through its cloud platform , Microsoft Azure . In 2023 and 2024, OpenAI faced multiple lawsuits for alleged copyright infringement against authors and media companies whose work
5687-413: Is labelled by a solution of the problem and whose leaf nodes are labelled by premises or axioms . In the case of Horn clauses , problem-solving search can be performed by reasoning forwards from the premises or backwards from the problem. In the more general case of the clausal form of first-order logic , resolution is a single, axiom-free rule of inference, in which a problem is solved by proving
5808-455: Is mitigated to some extent by temporal difference methods. Using the so-called compatible function approximation method compromises generality and efficiency. An alternative method is to search directly in (some subset of) the policy space, in which case the problem becomes a case of stochastic optimization . The two approaches available are gradient-based and gradient-free methods. Gradient -based methods ( policy gradient methods ) start with
5929-400: Is reportedly the "most widely used learner" at Google, due in part to its scalability. Neural networks are also used as classifiers. An artificial neural network is based on a collection of nodes also known as artificial neurons , which loosely model the neurons in a biological brain. It is trained to recognise patterns; once trained, it can recognise those patterns in fresh data. There
6050-491: Is that the latter do not assume knowledge of an exact mathematical model of the Markov decision process, and they target large MDPs where exact methods become infeasible. Due to its generality, reinforcement learning is studied in many disciplines, such as game theory , control theory , operations research , information theory , simulation-based optimization , multi-agent systems , swarm intelligence , and statistics . In
6171-508: Is that the number of policies can be large, or even infinite. Another is that the variance of the returns may be large, which requires many samples to accurately estimate the discounted return of each policy. These problems can be ameliorated if we assume some structure and allow samples generated from one policy to influence the estimates made for others. The two main approaches for achieving this are value function estimation and direct policy search . Value function approaches attempt to find
SECTION 50
#17327977157186292-424: Is the backpropagation algorithm. Neural networks learn to model complex relationships between inputs and outputs and find patterns in data. In theory, a neural network can learn any function. Reinforcement learning Q-learning at its simplest stores data in tables. This approach becomes infeasible as the number of states/actions increases (e.g., if the state space or action space were continuous), as
6413-440: Is the discount rate . γ {\displaystyle \gamma } is less than 1, so rewards in the distant future are weighted less than rewards in the immediate future. The algorithm must find a policy with maximum expected discounted return. From the theory of Markov decision processes it is known that, without loss of generality, the search can be restricted to the set of so-called stationary policies. A policy
6534-404: Is the process of proving a new statement ( conclusion ) from other statements that are given and assumed to be true (the premises ). Proofs can be structured as proof trees , in which nodes are labelled by sentences, and children nodes are connected to parent nodes by inference rules . Given a problem and a set of premises, problem-solving reduces to searching for a proof tree whose root node
6655-440: Is used for game-playing programs, such as chess or Go. It searches through a tree of possible moves and counter-moves, looking for a winning position. Local search uses mathematical optimization to find a solution to a problem. It begins with some form of guess and refines it incrementally. Gradient descent is a type of local search that optimizes a set of numerical parameters by incrementally adjusting them to minimize
6776-455: Is used for reasoning and knowledge representation . Formal logic comes in two main forms: propositional logic (which operates on statements that are true or false and uses logical connectives such as "and", "or", "not" and "implies") and predicate logic (which also operates on objects, predicates and relations and uses quantifiers such as " Every X is a Y " and "There are some X s that are Y s"). Deductive reasoning in logic
6897-436: Is used in AI programs that make decisions that involve other agents. Machine learning is the study of programs that can improve their performance on a given task automatically. It has been a part of AI from the beginning. There are several kinds of machine learning. Unsupervised learning analyzes a stream of data and finds patterns and makes predictions without any other guidance. Supervised learning requires labeling
7018-905: Is when the knowledge gained from one problem is applied to a new problem. Deep learning is a type of machine learning that runs inputs through biologically inspired artificial neural networks for all of these types of learning. Computational learning theory can assess learners by computational complexity , by sample complexity (how much data is required), or by other notions of optimization . Natural language processing (NLP) allows programs to read, write and communicate in human languages such as English . Specific problems include speech recognition , speech synthesis , machine translation , information extraction , information retrieval and question answering . Early work, based on Noam Chomsky 's generative grammar and semantic networks , had difficulty with word-sense disambiguation unless restricted to small domains called " micro-worlds " (due to
7139-712: The Southern New York U.S. Attorney's Office the previous November, was ongoing. On February 15, 2024, OpenAI announced a text-to-video model named Sora , which it plans to release to the public at an unspecified date. It is currently available for red teams for managing critical harms and risks. On February 29, 2024, OpenAI and CEO Sam Altman were sued by Elon Musk , who accused them of prioritizing profits over public good, contrary to OpenAI's original mission of developing AI for humanity's benefit. The lawsuit cited OpenAI's policy shift after partnering with Microsoft, questioning its open-source commitment and stirring
7260-520: The bar exam , SAT test, GRE test, and many other real-world applications. Machine perception is the ability to use input from sensors (such as cameras, microphones, wireless signals, active lidar , sonar, radar, and tactile sensors ) to deduce aspects of the world. Computer vision is the ability to analyze visual input. The field includes speech recognition , image classification , facial recognition , object recognition , object tracking , and robotic perception . Affective computing
7381-497: The multi-armed bandit problem and for finite state space Markov decision processes in Burnetas and Katehakis (1997). Reinforcement learning requires clever exploration mechanisms; randomly selecting actions, without reference to an estimated probability distribution, shows poor performance. The case of (small) finite Markov decision processes is relatively well understood. However, due to the lack of algorithms that scale well with
SECTION 60
#17327977157187502-416: The transformer architecture , and by the early 2020s hundreds of billions of dollars were being invested in AI (known as the " AI boom "). The widespread use of AI in the 21st century exposed several unintended consequences and harms in the present and raised concerns about its risks and long-term effects in the future, prompting discussions about regulatory policies to ensure the safety and benefits of
7623-436: The " utility ") that measures how much the agent prefers it. For each possible action, it can calculate the " expected utility ": the utility of all possible outcomes of the action, weighted by the probability that the outcome will occur. It can then choose the action with the maximum expected utility. In classical planning , the agent knows exactly what the effect of any action will be. In most real-world problems, however,
7744-463: The $ 1 billion "within five years, and possibly much faster." Altman has stated that even a billion dollars may turn out to be insufficient, and that the lab may ultimately need "more capital than any non-profit has ever raised" to achieve artificial general intelligence. The transition from a nonprofit to a capped-profit company was viewed with skepticism by Oren Etzioni of the nonprofit Allen Institute for AI , who agreed that wooing top researchers to
7865-505: The AI ethics-vs.-profit debate. In a blog post, OpenAI stated that "Elon understood the mission did not imply open-sourcing AGI." In a staff memo, they also denied being a de facto Microsoft subsidiary. On March 11, 2024, court filing, OpenAI said it was "doing just fine without Elon Musk" after he left the company in 2018. They also responded to Musk's lawsuit, calling the billionaire's claims "incoherent", "frivolous", "extraordinary" and "a fiction". On June 11, 2024, Musk unexpectedly withdrew
7986-457: The Bellman equations. This can be effective in palliating this issue. In order to address the fifth issue, function approximation methods are used. Linear function approximation starts with a mapping ϕ {\displaystyle \phi } that assigns a finite-dimensional vector to each state-action pair. Then, the action values of a state-action pair ( s ,
8107-881: The absence of a mathematical model of the environment). Basic reinforcement learning is modeled as a Markov decision process : The purpose of reinforcement learning is for the agent to learn an optimal (or near-optimal) policy that maximizes the reward function or other user-provided reinforcement signal that accumulates from immediate rewards. This is similar to processes that appear to occur in animal psychology. For example, biological brains are hardwired to interpret signals such as pain and hunger as negative reinforcements, and interpret pleasure and food intake as positive reinforcements. In some circumstances, animals learn to adopt behaviors that optimize these rewards. This suggests that animals are capable of reinforcement learning. A basic reinforcement learning agent interacts with its environment in discrete time steps. At each time step t ,
8228-447: The accuracy of AI models like ChatGPT by incorporating reliable news sources, addressing concerns about AI misinformation. Concerns were expressed about the decision by journalists, including those working for the publications, as well as the publications' unions. The Vox Union stated, "As both journalists and workers, we have serious concerns about this partnership, which we believe could adversely impact members of our union, not to mention
8349-421: The agent can seek information to improve its preferences. Information value theory can be used to weigh the value of exploratory or experimental actions. The space of possible future actions and situations is typically intractably large, so the agents must take actions and evaluate situations while being uncertain of what the outcome will be. A Markov decision process has a transition model that describes
8470-510: The agent may not be certain about the situation they are in (it is "unknown" or "unobservable") and it may not know for certain what will happen after each possible action (it is not "deterministic"). It must choose an action by making a probabilistic guess and then reassess the situation to see if the action worked. In some problems, the agent's preferences may be uncertain, especially if there are other agents or humans involved. These can be learned (e.g., with inverse reinforcement learning ), or
8591-406: The agent only has access to a subset of states, or if the observed states are corrupted by noise, the agent is said to have partial observability , and formally the problem must be formulated as a partially observable Markov decision process . In both cases, the set of actions available to the agent can be restricted. For example, the state of an account balance could be restricted to be positive; if
8712-461: The agent receives the current state S t {\displaystyle S_{t}} and reward R t {\displaystyle R_{t}} . It then chooses an action A t {\displaystyle A_{t}} from the set of available actions, which is subsequently sent to the environment. The environment moves to a new state S t + 1 {\displaystyle S_{t+1}} and
8833-529: The agent to operate with incomplete or uncertain information. AI researchers have devised a number of tools to solve these problems using methods from probability theory and economics. Precise mathematical tools have been developed that analyze how an agent can make choices and plan, using decision theory , decision analysis , and information value theory . These tools include models such as Markov decision processes , dynamic decision networks , game theory and mechanism design . Bayesian networks are
8954-442: The batch). Batch methods, such as the least-squares temporal difference method, may use the information in the samples better, while incremental methods are the only choice when batch methods are infeasible due to their high computational or memory complexity. Some methods try to combine the two approaches. Methods based on temporal differences also overcome the fourth issue. Another problem specific to TD comes from their reliance on
9075-457: The board rejected. Musk subsequently left OpenAI. In February 2019, GPT-2 was announced, which gained attention for its ability to generate human-like text. In 2019, OpenAI transitioned from non-profit to "capped" for-profit, with the profit being capped at 100 times any investment. According to OpenAI, the capped-profit model allows OpenAI Global, LLC to legally attract investment from venture funds and, in addition, to grant employees stakes in
9196-450: The capability of reducing processing time from six days to two hours. In December 2016, OpenAI released "Universe", a software platform for measuring and training an AI's general intelligence across the world's supply of games, websites, and other applications. In 2017, OpenAI spent $ 7.9 million, or a quarter of its functional expenses, on cloud computing alone. In comparison, DeepMind 's total expenses in 2017 were $ 442 million. In
9317-566: The co-chairs. A total of $ 1 billion in capital was pledged by Sam Altman, Greg Brockman, Elon Musk, Reid Hoffman , Jessica Livingston , Peter Thiel , Amazon Web Services (AWS), Infosys , and YC Research . The actual collected total amount of contributions was only $ 130 million until 2019. According to an investigation led by TechCrunch , Musk was its largest donor while YC Research did not contribute anything at all. The organization stated it would "freely collaborate" with other institutions and researchers by making its patents and research open to
9438-648: The common sense knowledge problem ). Margaret Masterman believed that it was meaning and not grammar that was the key to understanding languages, and that thesauri and not dictionaries should be the basis of computational language structure. Modern deep learning techniques for NLP include word embedding (representing words, typically as vectors encoding their meaning), transformers (a deep learning architecture using an attention mechanism), and others. In 2019, generative pre-trained transformer (or "GPT") language models began to generate coherent text, and by 2023, these models were able to get human-level scores on
9559-495: The company at $ 157 billion and solidifying its status as one of the most valuable private firms globally. The funding attracted returning venture capital firms like Thrive Capital and Khosla Ventures , along with major backer Microsoft and new investors Nvidia and Softbank . OpenAI's CFO , Sarah Friar, informed employees that a tender offer for share buybacks would follow the funding, although specifics were yet to be determined. Thrive Capital invested around $ 1.2 billion, with
9680-537: The company favors "smart regulation" and sees the UK's AI white paper as a positive step towards responsible AI development. On September 25, OpenAI's Chief Technology Officer (CTO) Mira Murati announced her departure from the company to "create the time and space to do my own exploration". It had previously been reported Murati was among those who expressed concerns to the Board about Altman. In October 2024, OpenAI raised $ 6.6 billion from investors, potentially valuing
9801-618: The company. Many top researchers work for Google Brain , DeepMind, or Facebook , which offer stock options that a nonprofit would be unable to. Before the transition, public disclosure of the compensation of top employees at OpenAI was legally required. The company then distributed equity to its employees and partnered with Microsoft, announcing an investment package of $ 1 billion into the company. Since then, OpenAI systems have run on an Azure -based supercomputing platform from Microsoft. OpenAI Global, LLC then announced its intention to commercially license its technologies. It planned to spend
9922-436: The cost of a top AI researcher exceeds the cost of a top NFL quarterback prospect. OpenAI's potential and mission drew these researchers to the firm; a Google employee said he was willing to leave Google for OpenAI "partly because of the very strong group of people and, to a very large extent, because of its mission." Brockman stated that "the best thing that I could imagine doing was moving humanity closer to building real AI in
10043-416: The current value of the state is 3 and the state transition attempts to reduce the value by 4, the transition will not be allowed. When the agent's performance is compared to that of an agent that acts optimally, the difference in performance yields the notion of regret . In order to act near optimally, the agent must reason about long-term consequences of its actions (i.e., maximize future rewards), although
10164-459: The discounted return associated with following π {\displaystyle \pi } from the initial state s {\displaystyle s} . Defining V ∗ ( s ) {\displaystyle V^{*}(s)} as the maximum possible state-value of V π ( s ) {\displaystyle V^{\pi }(s)} , where π {\displaystyle \pi }
10285-428: The environment’s dynamics, Monte Carlo methods rely solely on actual or simulated experience—sequences of states, actions, and rewards obtained from interaction with an environment. This makes them applicable in situations where the complete dynamics are unknown. Learning from actual experience does not require prior knowledge of the environment and can still lead to optimal behavior. When using simulated experience, only
10406-455: The goal of maximizing the cumulative reward (the feedback of which might be incomplete or delayed). The search for this balance is known as the exploration-exploitation dilemma . The environment is typically stated in the form of a Markov decision process (MDP), as many reinforcement learning algorithms use dynamic programming techniques. The main difference between classical dynamic programming methods and reinforcement learning algorithms
10527-913: The group, led OpenAI to absorb the team's work into other research areas, and officially shut down the superalignment group. According to sources interviewed by Fortune , OpenAI's promise of allocating 20% of its computing capabilities to the superalignment project had not been fulfilled. On May 19, 2024, Reddit and OpenAI announced a partnership to integrate Reddit's content into OpenAI products, including ChatGPT . This collaboration allows OpenAI to access Reddit's Data API , providing real-time, structured content to enhance AI tools and user engagement with Reddit communities. Additionally, Reddit plans to develop new AI-powered features for users and moderators using OpenAI's platform. The partnership aligns with Reddit's commitment to privacy, adhering to its Public Content Policy and existing Data API Terms, which restrict commercial use without approval. OpenAI will also serve as
10648-415: The highest action-value at each state, s {\displaystyle s} . The action-value function of such an optimal policy ( Q π ∗ {\displaystyle Q^{\pi ^{*}}} ) is called the optimal action-value function and is commonly denoted by Q ∗ {\displaystyle Q^{*}} . In summary, the knowledge of
10769-454: The immediate reward associated with this might be negative. Thus, reinforcement learning is particularly well-suited to problems that include a long-term versus short-term reward trade-off. It has been applied successfully to various problems, including energy storage , robot control , photovoltaic generators , backgammon , checkers , Go ( AlphaGo ), and autonomous driving systems . Two elements make reinforcement learning powerful:
10890-440: The intelligence of existing computer agents. Moderate successes related to affective computing include textual sentiment analysis and, more recently, multimodal sentiment analysis , wherein AI classifies the affects displayed by a videotaped subject. A machine with artificial general intelligence should be able to solve a wide variety of problems with breadth and versatility similar to human intelligence . AI research uses
11011-537: The late 1980s and 1990s, methods were developed for dealing with uncertain or incomplete information, employing concepts from probability and economics . Many of these algorithms are insufficient for solving large reasoning problems because they experience a "combinatorial explosion": They become exponentially slower as the problems grow. Even humans rarely use the step-by-step deduction that early AI research could model. They solve most of their problems using fast, intuitive judgments. Accurate and efficient reasoning
11132-580: The lawsuit. On August 5, 2024, Musk reopened the lawsuit against Altman and others, alleging that Altman claimed that OpenAI was going to be founded as a non-profit organization. On May 15, 2024, Ilya Sutskever resigned from OpenAI and was replaced with Jakub Pachocki to be the Chief Scientist. Hours later, Jan Leike , the other co-leader of the superalignment team, announced his departure, citing an erosion of safety and trust in OpenAI's leadership. Their departures along with several researchers leaving
11253-596: The long-standing domain Chat.com and redirected it to ChatGPT's main site. Moreover, Greg Brockman rejoined OpenAI after a three-month leave from his role as president. An OpenAI spokesperson confirmed his return, highlighting that Brockman would collaborate with Altman on tackling key technical challenges. His return followed a wave of high-profile departures, including Mira Murati and Ilya Sutskever, who had since launched their own AI ventures. Artificial intelligence Artificial intelligence ( AI ), in its broadest sense,
11374-457: The most difficult problems in knowledge representation are the breadth of commonsense knowledge (the set of atomic facts that the average person knows is enormous); and the sub-symbolic form of most commonsense knowledge (much of what people know is not represented as "facts" or "statements" that they could express verbally). There is also the difficulty of knowledge acquisition , the problem of obtaining knowledge for AI applications. An "agent"
11495-511: The number of states (or scale to problems with infinite state spaces), simple exploration methods are the most practical. One such method is ε {\displaystyle \varepsilon } -greedy, where 0 < ε < 1 {\displaystyle 0<\varepsilon <1} is a parameter controlling the amount of exploration vs. exploitation. With probability 1 − ε {\displaystyle 1-\varepsilon } , exploitation
11616-454: The operations research and control literature, RL is called approximate dynamic programming , or neuro-dynamic programming. The problems of interest in RL have also been studied in the theory of optimal control , which is concerned mostly with the existence and characterization of optimal solutions, and algorithms for their exact computation, and less with learning or approximation (particularly in
11737-625: The optimal action-value function alone suffices to know how to act optimally. Assuming full knowledge of the Markov decision process, the two basic approaches to compute the optimal action-value function are value iteration and policy iteration . Both algorithms compute a sequence of functions Q k {\displaystyle Q_{k}} ( k = 0 , 1 , 2 , … {\displaystyle k=0,1,2,\ldots } ) that converge to Q ∗ {\displaystyle Q^{*}} . Computing these functions involves computing expectations over
11858-534: The option for an additional $ 1 billion if revenue goals were met. Apple, despite initial interest, did not participate in this funding round. Also in October 2024, The Intercept revealed that OpenAI's tools were considered "essential" for AFRICOM 's mission and included in an "Exception to Fair Opportunity" contractual agreement between the Department of Defense and Microsoft. In November 2024, OpenAI acquired
11979-405: The other hand. Classifiers are functions that use pattern matching to determine the closest match. They can be fine-tuned based on chosen examples using supervised learning . Each pattern (also called an " observation ") is labeled with a certain predefined class. All the observations combined with their class labels are known as a data set . When a new observation is received, that observation
12100-606: The prediction problem and then extending to policy improvement and control, all based on sampled experience. The first problem is corrected by allowing the procedure to change the policy (at some or all states) before the values settle. This too may be problematic as it might prevent convergence. Most current algorithms do this, giving rise to the class of generalized policy iteration algorithms. Many actor-critic methods belong to this category. The second issue can be corrected by allowing trajectories to contribute to any state-action pair in them. This may also help to some extent with
12221-422: The probability of the agent visiting a particular state and performing a particular action diminishes. Reinforcement learning differs from supervised learning in not needing labelled input-output pairs to be presented, and in not needing sub-optimal actions to be explicitly corrected. Instead, the focus is on finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge) with
12342-411: The probability that a particular action will change the state in a particular way and a reward function that supplies the utility of each state and the cost of each action. A policy associates a decision with each possible state. The policy could be calculated (e.g., by iteration ), be heuristic , or it can be learned. Game theory describes the rational behavior of multiple interacting agents and
12463-744: The public. OpenAI was initially run from Brockman's living room. It was later headquartered at the Pioneer Building in the Mission District, San Francisco . According to Wired , Brockman met with Yoshua Bengio , one of the "founding fathers" of deep learning , and drew up a list of the "best researchers in the field". Brockman was able to hire nine of them as the first employees in December 2015. In 2016, OpenAI paid corporate-level (rather than nonprofit-level) salaries, but did not pay AI researchers salaries comparable to those of Facebook or Google . Microsoft's Peter Lee stated that
12584-407: The recursive Bellman equation. Most TD methods have a so-called λ {\displaystyle \lambda } parameter ( 0 ≤ λ ≤ 1 ) {\displaystyle (0\leq \lambda \leq 1)} that can continuously interpolate between Monte Carlo methods that do not rely on the Bellman equations and the basic TD methods that rely entirely on
12705-550: The returns of subsequent states within the same episode, making the problem non-stationary . To address this non-stationarity, Monte Carlo methods use the framework of general policy iteration (GPI). While dynamic programming computes value functions using full knowledge of the Markov decision process (MDP), Monte Carlo methods learn these functions through sample returns. The value functions and policies interact similarly to dynamic programming to achieve optimality , first addressing
12826-577: The reward R t + 1 {\displaystyle R_{t+1}} associated with the transition ( S t , A t , S t + 1 ) {\displaystyle (S_{t},A_{t},S_{t+1})} is determined. The goal of a reinforcement learning agent is to learn a policy : π : S × A → [ 0 , 1 ] {\displaystyle \pi :{\mathcal {S}}\times {\mathcal {A}}\rightarrow [0,1]} , π ( s ,
12947-465: The same foundation as ChatGPT into Microsoft Bing , Edge , Microsoft 365 and other products. On March 3, 2023, Reid Hoffman resigned from his board seat, citing a desire to avoid conflicts of interest with his investments in AI companies via Greylock Partners , and his co-founding of the AI startup Inflection AI . Hoffman remained on the board of Microsoft, a major investor in OpenAI. On March 14, 2023, OpenAI released GPT-4 , both as an API (with
13068-490: The summer of 2018, simply training OpenAI's Dota 2 bots required renting 128,000 CPUs and 256 GPUs from Google for multiple weeks. In 2018, Musk resigned from his Board of Directors seat, citing "a potential future conflict [of interest] " with his role as CEO of Tesla due to Tesla's AI development for self-driving cars. Sam Altman claims that Musk believed that OpenAI had fallen behind other players like Google and Musk proposed instead to take over OpenAI himself, which
13189-471: The technology . The general problem of simulating (or creating) intelligence has been broken into subproblems. These consist of particular traits or capabilities that researchers expect an intelligent system to display. The traits described below have received the most attention and cover the scope of AI research. Early researchers developed algorithms that imitated step-by-step reasoning that humans use when they solve puzzles or make logical deductions . By
13310-449: The third problem, although a better solution when returns have high variance is Sutton's temporal difference (TD) methods that are based on the recursive Bellman equation . The computation in TD methods can be incremental (when after each transition the memory is changed and the transition is thrown away), or batch (when the transitions are batched and the estimates are computed once based on
13431-451: The training data with the expected answers, and comes in two main varieties: classification (where the program must learn to predict what category the input belongs in) and regression (where the program must deduce a numeric function based on numeric input). In reinforcement learning , the agent is rewarded for good responses and punished for bad ones. The agent learns to choose responses that are classified as "good". Transfer learning
13552-420: The use of particular tools. The traditional goals of AI research include reasoning , knowledge representation , planning , learning , natural language processing , perception, and support for robotics . General intelligence —the ability to complete any task performable by a human on an at least equal level—is among the field's long-term goals. To reach these goals, AI researchers have adapted and integrated
13673-586: The use of samples to optimize performance, and the use of function approximation to deal with large environments. Thanks to these two key components, RL can be used in large environments in the following situations: The first two of these problems could be considered planning problems (since some form of model is available), while the last one could be considered to be a genuine learning problem. However, reinforcement learning converts both planning problems to machine learning problems. The exploration vs. exploitation trade-off has been most thoroughly studied through
13794-596: The value function estimates "how good" it is to be in a given state. where the random variable G {\displaystyle G} denotes the discounted return , and is defined as the sum of future discounted rewards: where R t + 1 {\displaystyle R_{t+1}} is the reward for transitioning from state S t {\displaystyle S_{t}} to S t + 1 {\displaystyle S_{t+1}} , 0 ≤ γ < 1 {\displaystyle 0\leq \gamma <1}
13915-446: The well-documented ethical and environmental concerns surrounding the use of generative AI." A group of nine current and former OpenAI employees has accused the company of prioritizing profits over safety, using restrictive agreements to silence concerns, and moving too quickly with inadequate risk management. They call for greater transparency, whistleblower protections, and legislative regulation of AI development. On June 10, 2024, it
14036-472: The whole state-space, which is impractical for all but the smallest (finite) Markov decision processes. In reinforcement learning methods, expectations are approximated by averaging over samples and using function approximation techniques to cope with the need to represent value functions over large state-action spaces. Monte Carlo methods are used to solve reinforcement learning problems by averaging sample returns. Unlike methods that require full knowledge of
14157-526: Was announced at WWDC 2024 that OpenAI had partnered with Apple Inc. to bring ChatGPT features to Apple Intelligence and iPhone . On June 13, 2024, OpenAI announced that Paul Nakasone , the former head of the NSA was joining the company's board of directors. Nakasone also joined the company's security subcommittee. On June 24, 2024, OpenAI acquired Multi, a startup running a collaboration platform based on Zoom . In July 2024, Reuters reported that OpenAI
14278-648: Was announced that OpenAI had acquired the New York -based start-up Global Illumination, a company that deploys AI to develop digital infrastructure and creative tools. On September 21, 2023, Microsoft had begun rebranding all variants of its Copilot to Microsoft Copilot , including the former Bing Chat and the Microsoft 365 Copilot . This strategy was followed in December 2023 by adding the MS-Copilot to many installations of Windows 11 and Windows 10 as well as
14399-520: Was projecting $ 200 million of revenue in 2023 and $ 1 billion in revenue in 2024. In January 2023, OpenAI Global, LLC was in talks for funding that would value the company at $ 29 billion, double its 2021 value. On January 23, 2023, Microsoft announced a new US$ 10 billion investment in OpenAI Global, LLC over multiple years, partially needed to use Microsoft's cloud-computing service Azure . Rumors of this deal suggested that Microsoft may receive 75% of OpenAI's profits until it secures its investment return and
14520-539: Was used to train some of OpenAI's products. In November 2023, OpenAI's board removed Sam Altman as CEO, citing a lack of confidence in him, but reinstated him five days later after negotiations resulting in a reconstructed board. Many AI safety researchers left OpenAI in 2024. In December 2015, OpenAI was founded by Sam Altman , Elon Musk , Ilya Sutskever , Greg Brockman , Trevor Blackwell , Vicki Cheung, Andrej Karpathy , Durk Kingma, John Schulman, Pamela Vagata, and Wojciech Zaremba , with Sam Altman and Elon Musk as
14641-457: Was working on a project code-named "Strawberry" (previously known as Q*) aiming to enhance AI reasoning capabilities. The project reportedly seeks to enable AI to plan ahead, navigate the internet autonomously, and conduct "deep research". The project was officially released on September 12 and named o1 . On August 5, TechCrunch reported that OpenAI's cofounder John Schulman has left the company to join rival AI startup Anthropic . Schulman cited
#717282