In machine learning , diffusion models , also known as diffusion probabilistic models or score-based generative models , are a class of latent variable generative models. A diffusion model consists of three major components: the forward process, the reverse process, and the sampling procedure. The goal of diffusion models is to learn a diffusion process for a given dataset, such that the process can generate new elements that are distributed similarly as the original dataset. A diffusion model models data as generated by a diffusion process, whereby a new datum performs a random walk with drift through the space of all possible data. A trained diffusion model can be sampled in many ways, with different efficiency and quality.
101-700: DALL-E , DALL-E 2 , and DALL-E 3 (stylised DALL·E , and pronounced DOLL-E) are text-to-image models developed by OpenAI using deep learning methodologies to generate digital images from natural language descriptions known as " prompts ". The first version of DALL-E was announced in January 2021. In the following year, its successor DALL-E 2 was released. DALL-E 3 was released natively into ChatGPT for ChatGPT Plus and ChatGPT Enterprise customers in October 2023, with availability via OpenAI's API and "Labs" platform provided in early November. Microsoft implemented
202-408: A Transformer architecture. The first iteration, GPT-1, was scaled up to produce GPT-2 in 2019; in 2020, it was scaled up again to produce GPT-3 , with 175 billion parameters. DALL-E has three components: a discrete VAE , an autoregressive decoder-only Transformer (12 billion parameters) similar to GPT-3, and a CLIP pair of image encoder and text encoder. The discrete VAE can convert an image to
303-454: A language model , which transforms the input text into a latent representation , and a generative image model , which produces an image conditioned on that representation. The most effective models have generally been trained on massive amounts of image and text data scraped from the web . Before the rise of deep learning , attempts to build text-to-image models were limited to collages by arranging existing component images, such as from
404-664: A large language model trained separately on a text-only corpus (with its weights subsequently frozen), a departure from the theretofore standard approach. Training a text-to-image model requires a dataset of images paired with text captions. One dataset commonly used for this purpose is the COCO dataset. Released by Microsoft in 2014, COCO consists of around 123,000 images depicting a diversity of objects with five captions per image, generated by human annotators. Oxford-120 Flowers and CUB-200 Birds are smaller datasets of around 10,000 images each, restricted to flowers and birds, respectively. It
505-642: A beta phase with invitations sent to 1 million waitlisted individuals; users could generate a certain number of images for free every month and may purchase more. Access had previously been restricted to pre-selected users for a research preview due to concerns about ethics and safety. On 28 September 2022, DALL-E 2 was opened to everyone and the waitlist requirement was removed. In September 2023, OpenAI announced their latest image model, DALL-E 3, capable of understanding "significantly more nuance and detail" than previous iterations. In early November 2022, OpenAI released DALL-E 2 as an API , allowing developers to integrate
606-657: A broad understanding of visual and design trends. DALL-E can produce images for a wide variety of arbitrary descriptions from various viewpoints with only rare failures. Mark Riedl, an associate professor at the Georgia Tech School of Interactive Computing, found that DALL-E could blend concepts (described as a key element of human creativity ). Its visual reasoning ability is sufficient to solve Raven's Matrices (visual tests often administered to humans to measure intelligence). DALL-E 3 follows complex prompts with more accuracy and detail than its predecessors, and
707-1030: A certain probability distribution γ {\displaystyle \gamma } over [ 0 , ∞ ) {\displaystyle [0,\infty )} , then the score-matching loss function is defined as the expected Fisher divergence: L ( θ ) = E t ∼ γ , x t ∼ ρ t [ ‖ f θ ( x t , t ) ‖ 2 + 2 ∇ ⋅ f θ ( x t , t ) ] {\displaystyle L(\theta )=E_{t\sim \gamma ,x_{t}\sim \rho _{t}}[\|f_{\theta }(x_{t},t)\|^{2}+2\nabla \cdot f_{\theta }(x_{t},t)]} After training, f θ ( x t , t ) ≈ ∇ ln ρ t {\displaystyle f_{\theta }(x_{t},t)\approx \nabla \ln \rho _{t}} , so we can perform
808-1246: A change of variables, L s i m p l e , t = E x 0 , x t ∼ q [ ‖ ϵ θ ( x t , t ) − x t − α ¯ t x 0 σ t ‖ 2 ] = E x t ∼ q , x 0 ∼ q ( ⋅ | x t ) [ ‖ ϵ θ ( x t , t ) − x t − α ¯ t x 0 σ t ‖ 2 ] {\displaystyle L_{simple,t}=E_{x_{0},x_{t}\sim q}\left[\left\|\epsilon _{\theta }(x_{t},t)-{\frac {x_{t}-{\sqrt {{\bar {\alpha }}_{t}}}x_{0}}{\sigma _{t}}}\right\|^{2}\right]=E_{x_{t}\sim q,x_{0}\sim q(\cdot |x_{t})}\left[\left\|\epsilon _{\theta }(x_{t},t)-{\frac {x_{t}-{\sqrt {{\bar {\alpha }}_{t}}}x_{0}}{\sigma _{t}}}\right\|^{2}\right]} and
909-440: A daikon radish blowing its nose, sipping a latte, or riding a unicycle, DALL-E often draws the handkerchief, hands, and feet in plausible locations." DALL-E showed the ability to "fill in the blanks" to infer appropriate details without specific prompts, such as adding Christmas imagery to prompts commonly associated with the celebration, and appropriately placed shadows to images that did not mention them. Furthermore, DALL-E exhibits
1010-707: A database of clip art . The inverse task, image captioning , was more tractable, and a number of image captioning deep learning models came prior to the first text-to-image models. The first modern text-to-image model, alignDRAW, was introduced in 2015 by researchers from the University of Toronto . alignDRAW extended the previously-introduced DRAW architecture (which used a recurrent variational autoencoder with an attention mechanism ) to be conditioned on text sequences. Images generated by alignDRAW were in small resolution (32×32 pixels, attained from resizing ) and were considered to be 'low in diversity'. The model
1111-400: A denoising network can be used as for score-based diffusion. In DDPM, the sequence of numbers 0 = σ 0 < σ 1 < ⋯ < σ T < 1 {\displaystyle 0=\sigma _{0}<\sigma _{1}<\cdots <\sigma _{T}<1} is called a (discrete time) noise schedule . In general, consider
SECTION 10
#17327988007981212-636: A density q {\displaystyle q} , we wish to learn a score function approximation f θ ≈ ∇ ln q {\displaystyle f_{\theta }\approx \nabla \ln q} . This is score matching . Typically, score matching is formalized as minimizing Fisher divergence function E q [ ‖ f θ ( x ) − ∇ ln q ( x ) ‖ 2 ] {\displaystyle E_{q}[\|f_{\theta }(x)-\nabla \ln q(x)\|^{2}]} . By expanding
1313-566: A distinct thick, rounded bill" . A model trained on the more diverse COCO (Common Objects in Context) dataset produced images which were "from a distance... encouraging", but which lacked coherence in their details. Later systems include VQGAN-CLIP, XMC-GAN, and GauGAN2. One of the first text-to-image models to capture widespread public attention was OpenAI 's DALL-E , a transformer system announced in January 2021. A successor capable of generating more complex and realistic images, DALL-E 2,
1414-426: A given prompt. For example, this can be used to insert a new subject into an image, or expand an image beyond its original borders. According to OpenAI, "Outpainting takes into account the image’s existing visual elements — including shadows, reflections, and textures — to maintain the context of the original image." DALL-E 2's language understanding has limits. It is sometimes unable to distinguish "A yellow book and
1515-846: A long enough diffusion process, we end up with some x T {\displaystyle x_{T}} that is very close to N ( 0 , I ) {\displaystyle N(0,I)} , with all traces of the original x 0 ∼ q {\displaystyle x_{0}\sim q} gone. For example, since x t | x 0 ∼ N ( α ¯ t x 0 , σ t 2 I ) {\displaystyle x_{t}|x_{0}\sim N\left({\sqrt {{\bar {\alpha }}_{t}}}x_{0},\sigma _{t}^{2}I\right)} we can sample x t | x 0 {\displaystyle x_{t}|x_{0}} directly "in one step", instead of going through all
1616-698: A loss function, also known as the Hyvärinen scoring rule , that can be minimized by stochastic gradient descent. Suppose we need to model the distribution of images, and we want x 0 ∼ N ( 0 , I ) {\displaystyle x_{0}\sim N(0,I)} , a white-noise image. Now, most white-noise images do not look like real images, so q ( x 0 ) ≈ 0 {\displaystyle q(x_{0})\approx 0} for large swaths of x 0 ∼ N ( 0 , I ) {\displaystyle x_{0}\sim N(0,I)} . This presents
1717-432: A method to learn a model that can sample from a highly complex probability distribution. They used techniques from non-equilibrium thermodynamics , especially diffusion . Consider, for example, how one might model the distribution of all naturally-occurring photos. Each image is a point in the space of all images, and the distribution of naturally-occurring photos is a "cloud" in space, which, by repeatedly adding noise to
1818-426: A model to output a high-resolution image conditioned on a text embedding, a popular technique is to train a model to generate low-resolution images, and use one or more auxiliary deep learning models to upscale it, filling in finer details. Text-to-image models are trained on large datasets of (text, image) pairs, often scraped from the web. With their 2022 Imagen model, Google Brain reported positive results from using
1919-617: A name change was requested by OpenAI in June 2022) is an AI model based on the original DALL-E that was trained on unfiltered data from the Internet. It attracted substantial media attention in mid-2022, after its release due to its capacity for producing humorous imagery. Text-to-image model A text-to-image model is a machine learning model which takes an input natural language description and produces an image matching that description. Text-to-image models began to be developed in
2020-511: A noise prediction network is trained, it can be used for generating data points in the original distribution in a loop as follows: Score-based generative model is another formulation of diffusion modelling. They are also called noise conditional score network (NCSN) or score-matching with Langevin dynamics (SMLD). Consider the problem of image generation. Let x {\displaystyle x} represent an image, and let q ( x ) {\displaystyle q(x)} be
2121-437: A particle: d x t = ∇ x t ln q ( x t ) d t + d W t {\displaystyle dx_{t}=\nabla _{x_{t}}\ln q(x_{t})dt+dW_{t}} To deal with this problem, we perform annealing . If q {\displaystyle q} is too different from a white-noise distribution, then progressively add noise until it
SECTION 20
#17327988007982222-724: A potential energy function U ( x ) = − ln q ( x ) {\displaystyle U(x)=-\ln q(x)} , and a lot of particles in the potential well, then the distribution at thermodynamic equilibrium is the Boltzmann distribution q U ( x ) ∝ e − U ( x ) / k B T = q ( x ) 1 / k B T {\displaystyle q_{U}(x)\propto e^{-U(x)/k_{B}T}=q(x)^{1/k_{B}T}} . At temperature k B T = 1 {\displaystyle k_{B}T=1} ,
2323-415: A problem for learning the score function, because if there are no samples around a certain point, then we can't learn the score function at that point. If we do not know the score function ∇ x t ln q ( x t ) {\displaystyle \nabla _{x_{t}}\ln q(x_{t})} at that point, then we cannot impose the time-evolution equation on
2424-436: A red vase" from "A red book and a yellow vase" or "A panda making latte art" from "Latte art of a panda". It generates images of "an astronaut riding a horse" when presented with the prompt "a horse riding an astronaut". It also fails to generate the correct images in a variety of circumstances. Requesting more than three objects, negation, numbers, and connected sentences may result in mistakes, and object features may appear on
2525-425: A score-based network can be used for denoising diffusion. Conversely, the continuous limit x t − 1 = x t − d t , β t = β ( t ) d t , z t d t = d W t {\displaystyle x_{t-1}=x_{t-dt},\beta _{t}=\beta (t)dt,z_{t}{\sqrt {dt}}=dW_{t}} of
2626-572: A sequence of noises σ t := σ ( λ t ) {\displaystyle \sigma _{t}:=\sigma (\lambda _{t})} , which then derives the other quantities β t = 1 − 1 − σ t 2 1 − σ t − 1 2 {\displaystyle \beta _{t}=1-{\frac {1-\sigma _{t}^{2}}{1-\sigma _{t-1}^{2}}}} . In order to use arbitrary noise schedules, instead of training
2727-526: A sequence of tokens, and conversely, convert a sequence of tokens back to an image. This is necessary as the Transformer does not directly process image data. The input to the Transformer model is a sequence of tokenized image caption followed by tokenized image patches. The image caption is in English, tokenized by byte pair encoding (vocabulary size 16384), and can be up to 256 tokens long. Each image
2828-420: A similar output. For example, the word "blood" is filtered, but "ketchup" and "red liquid" are not. Another concern about DALL-E 2 and similar models is that they could cause technological unemployment for artists, photographers, and graphic designers due to their accuracy and popularity. DALL-E 3 is designed to block users from generating art in the style of currently-living artists. In 2023 Microsoft pitched
2929-670: A smaller number than its predecessor. Instead of an autoregressive Transformer, DALL-E 2 uses a diffusion model conditioned on CLIP image embeddings, which, during inference, are generated from CLIP text embeddings by a prior model. This is the same architecture as that of Stable Diffusion , released a few months later. DALL-E can generate imagery in multiple styles, including photorealistic imagery, paintings , and emoji . It can "manipulate and rearrange" objects in its images, and can correctly place design elements in novel compositions without explicit instruction. Thom Dunn writing for BoingBoing remarked that "For example, when asked to draw
3030-537: A strictly increasing monotonic function σ {\displaystyle \sigma } of type R → ( 0 , 1 ) {\displaystyle \mathbb {R} \to (0,1)} , such as the sigmoid function . In that case, a noise schedule is a sequence of real numbers λ 1 < λ 2 < ⋯ < λ T {\displaystyle \lambda _{1}<\lambda _{2}<\cdots <\lambda _{T}} . It then defines
3131-508: A sum of pure randomness (like a Brownian walker ) and gradient descent down the potential well. The randomness is necessary: if the particles were to undergo only gradient descent, then they will all fall to the origin, collapsing the distribution. The 2020 paper proposed the Denoising Diffusion Probabilistic Model (DDPM), which improves upon the previous method by variational inference . To present
DALL-E - Misplaced Pages Continue
3232-432: A variety of architectures. The text encoding step may be performed with a recurrent neural network such as a long short-term memory (LSTM) network, though transformer models have since become a more popular option. For the image generation step, conditional generative adversarial networks (GANs) have been commonly used, with diffusion models also becoming a popular option in recent years. Rather than directly training
3333-463: Is a Wiener process (multidimensional Brownian motion). Now, the equation is exactly a special case of the overdamped Langevin equation d x t = − D k B T ( ∇ x U ) d t + 2 D d W t {\displaystyle dx_{t}=-{\frac {D}{k_{B}T}}(\nabla _{x}U)dt+{\sqrt {2D}}dW_{t}} where D {\displaystyle D}
3434-427: Is a 256×256 RGB image, divided into 32×32 patches of 4×4 each. Each patch is then converted by a discrete variational autoencoder to a token (vocabulary size 8192). DALL-E was developed and announced to the public in conjunction with CLIP (Contrastive Language-Image Pre-training) . CLIP is a separate model based on contrastive learning that was trained on 400 million pairs of images with text captions scraped from
3535-806: Is a gaussian with mean zero and variance one. To find the second one, we complete the rotational matrix: [ z ″ z ‴ ] = [ α t − α ¯ t σ t β t σ t ? ? ] [ z z ′ ] {\displaystyle {\begin{bmatrix}z''\\z'''\end{bmatrix}}={\begin{bmatrix}{\frac {\sqrt {\alpha _{t}-{\bar {\alpha }}_{t}}}{\sigma _{t}}}&{\frac {\sqrt {\beta _{t}}}{\sigma _{t}}}\\?&?\end{bmatrix}}{\begin{bmatrix}z\\z'\end{bmatrix}}} Since rotational matrices are all of
3636-1151: Is a normalization constant and often omitted. In particular, we note that x 1 : T | x 0 {\displaystyle x_{1:T}|x_{0}} is a gaussian process , which affords us considerable freedom in reparameterization . For example, by standard manipulation with gaussian process, x t | x 0 ∼ N ( α ¯ t x 0 , σ t 2 I ) {\displaystyle x_{t}|x_{0}\sim N\left({\sqrt {{\bar {\alpha }}_{t}}}x_{0},\sigma _{t}^{2}I\right)} x t − 1 | x t , x 0 ∼ N ( μ ~ t ( x t , x 0 ) , σ ~ t 2 I ) {\displaystyle x_{t-1}|x_{t},x_{0}\sim N({\tilde {\mu }}_{t}(x_{t},x_{0}),{\tilde {\sigma }}_{t}^{2}I)} In particular, notice that for large t {\displaystyle t} ,
3737-418: Is able to generate more coherent and accurate text. DALL-E 3 is integrated into ChatGPT Plus. Given an existing image, DALL-E 2 can produce "variations" of the image as individual outputs based on the original, as well as edit the image to modify or expand upon it. DALL-E 2's "inpainting" and "outpainting" use context from an image to fill in missing areas using a medium consistent with the original, following
3838-1217: Is another gaussian. We also know that these are independent. Thus we can perform a reparameterization: x t − 1 = α ¯ t − 1 x 0 + 1 − α ¯ t − 1 z {\displaystyle x_{t-1}={\sqrt {{\bar {\alpha }}_{t-1}}}x_{0}+{\sqrt {1-{\bar {\alpha }}_{t-1}}}z} x t = α t x t − 1 + 1 − α t z ′ {\displaystyle x_{t}={\sqrt {\alpha _{t}}}x_{t-1}+{\sqrt {1-\alpha _{t}}}z'} where z , z ′ {\textstyle z,z'} are IID gaussians. There are 5 variables x 0 , x t − 1 , x t , z , z ′ {\textstyle x_{0},x_{t-1},x_{t},z,z'} and two linear equations. The two sources of randomness are z , z ′ {\textstyle z,z'} , which can be reparameterized by rotation, since
3939-520: Is compared to its immediate neighbors — e.g. how much more likely is an image of cat compared to some small variants of it? Is it more likely if the image contains two whiskers, or three, or with some Gaussian noise added? Consequently, we are actually quite uninterested in q ( x ) {\displaystyle q(x)} itself, but rather, ∇ x ln q ( x ) {\displaystyle \nabla _{x}\ln q(x)} . This has two major effects: Let
4040-600: Is considered less difficult to train a high-quality text-to-image model with these datasets because of their narrow range of subject matter. Evaluating and comparing the quality of text-to-image models is a problem involving assessing multiple desirable properties. A desideratum specific to text-to-image models is that generated images semantically align with the text captions used to generate them. A number of schemes have been devised for assessing these qualities, some automated and others based on human judgement. A common algorithmic metric for assessing image quality and diversity
4141-476: Is degrading and undermines the time and skill that goes into their art. AI-driven image generation tools have been heavily criticized by artists because they are trained on human-made art scraped from the web." The second is the trouble with copyright law and data text-to-image models are trained on. OpenAI has not released information about what dataset(s) were used to train DALL-E 2, inciting concern from some that
DALL-E - Misplaced Pages Continue
4242-1743: Is designed so that for any starting distribution of x 0 {\displaystyle x_{0}} , we have lim t x t | x 0 {\displaystyle \lim _{t}x_{t}|x_{0}} converging to N ( 0 , I ) {\displaystyle N(0,I)} . The entire diffusion process then satisfies q ( x 0 : T ) = q ( x 0 ) q ( x 1 | x 0 ) ⋯ q ( x T | x T − 1 ) = q ( x 0 ) N ( x 1 | α 1 x 0 , β 1 I ) ⋯ N ( x T | α T x T − 1 , β T I ) {\displaystyle q(x_{0:T})=q(x_{0})q(x_{1}|x_{0})\cdots q(x_{T}|x_{T-1})=q(x_{0})N(x_{1}|{\sqrt {\alpha _{1}}}x_{0},\beta _{1}I)\cdots N(x_{T}|{\sqrt {\alpha _{T}}}x_{T-1},\beta _{T}I)} or ln q ( x 0 : T ) = ln q ( x 0 ) − ∑ t = 1 T 1 2 β t ‖ x t − 1 − β t x t − 1 ‖ 2 + C {\displaystyle \ln q(x_{0:T})=\ln q(x_{0})-\sum _{t=1}^{T}{\frac {1}{2\beta _{t}}}\|x_{t}-{\sqrt {1-\beta _{t}}}x_{t-1}\|^{2}+C} where C {\displaystyle C}
4343-468: Is diffusion tensor, T {\displaystyle T} is temperature, and U {\displaystyle U} is potential energy field. If we substitute in D = 1 2 β ( t ) I , k B T = 1 , U = 1 2 ‖ x ‖ 2 {\displaystyle D={\frac {1}{2}}\beta (t)I,k_{B}T=1,U={\frac {1}{2}}\|x\|^{2}} , we recover
4444-1037: Is explained thus: DDPM and score-based generative models are equivalent. This means that a network trained using DDPM can be used as a NCSN, and vice versa. We know that x t | x 0 ∼ N ( α ¯ t x 0 , σ t 2 I ) {\displaystyle x_{t}|x_{0}\sim N\left({\sqrt {{\bar {\alpha }}_{t}}}x_{0},\sigma _{t}^{2}I\right)} , so by Tweedie's formula , we have ∇ x t ln q ( x t ) = 1 σ t 2 ( − x t + α ¯ t E q [ x 0 | x t ] ) {\displaystyle \nabla _{x_{t}}\ln q(x_{t})={\frac {1}{\sigma _{t}^{2}}}(-x_{t}+{\sqrt {{\bar {\alpha }}_{t}}}E_{q}[x_{0}|x_{t}])} As described previously,
4545-522: Is indistinguishable from one. That is, we perform a forward diffusion, then learn the score function, then use the score function to perform a backward diffusion. Consider again the forward diffusion process, but this time in continuous time: x t = 1 − β t x t − 1 + β t z t {\displaystyle x_{t}={\sqrt {1-\beta _{t}}}x_{t-1}+{\sqrt {\beta _{t}}}z_{t}} By taking
4646-459: Is just the Maxwell–Boltzmann distribution of particles in a potential well V ( x ) = 1 2 ‖ x ‖ 2 {\displaystyle V(x)={\frac {1}{2}}\|x\|^{2}} at temperature 1. The initial distribution, being very much out of equilibrium, would diffuse towards the equilibrium distribution, making biased random steps that are
4747-474: Is not in equilibrium, unlike the final distribution. The equilibrium distribution is the Gaussian distribution N ( 0 , I ) {\displaystyle N(0,I)} , with pdf ρ ( x ) ∝ e − 1 2 ‖ x ‖ 2 {\displaystyle \rho (x)\propto e^{-{\frac {1}{2}}\|x\|^{2}}} . This
4848-2062: Is some unknown gaussian noise. Now we see that estimating x 0 {\displaystyle x_{0}} is equivalent to estimating z {\displaystyle z} . Therefore, let the network output a noise vector ϵ θ ( x t , t ) {\displaystyle \epsilon _{\theta }(x_{t},t)} , and let it predict μ θ ( x t , t ) = μ ~ t ( x t , x t − σ t ϵ θ ( x t , t ) α ¯ t ) = x t − ϵ θ ( x t , t ) β t / σ t α t {\displaystyle \mu _{\theta }(x_{t},t)={\tilde {\mu }}_{t}\left(x_{t},{\frac {x_{t}-\sigma _{t}\epsilon _{\theta }(x_{t},t)}{\sqrt {{\bar {\alpha }}_{t}}}}\right)={\frac {x_{t}-\epsilon _{\theta }(x_{t},t)\beta _{t}/\sigma _{t}}{\sqrt {\alpha _{t}}}}} It remains to design Σ θ ( x t , t ) {\displaystyle \Sigma _{\theta }(x_{t},t)} . The DDPM paper suggested not learning it (since it resulted in "unstable training and poorer sample quality"), but fixing it at some value Σ θ ( x t , t ) = ζ t 2 I {\displaystyle \Sigma _{\theta }(x_{t},t)=\zeta _{t}^{2}I} , where either ζ t 2 = β t or σ ~ t 2 {\displaystyle \zeta _{t}^{2}=\beta _{t}{\text{ or }}{\tilde {\sigma }}_{t}^{2}} yielded similar performance. With this,
4949-448: Is that they could be used to propagate deepfakes and other forms of misinformation. As an attempt to mitigate this, the software rejects prompts involving public figures and uploads containing human faces. Prompts containing potentially objectionable content are blocked, and uploaded images are analyzed to detect offensive material. A disadvantage of prompt-based filtering is that it is easy to bypass using alternative phrases that result in
5050-467: Is the Inception Score (IS), which is based on the distribution of labels predicted by a pretrained Inceptionv3 image classification model when applied to a sample of images generated by the text-to-image model. The score is increased when the image classification model predicts a single label with high probability, a scheme intended to favour "distinct" generated images. Another popular metric
5151-611: Is the dimension of space, and Δ {\displaystyle \Delta } is the Laplace operator . If we have solved ρ t {\displaystyle \rho _{t}} for time t ∈ [ 0 , T ] {\displaystyle t\in [0,T]} , then we can exactly reverse the evolution of the cloud. Suppose we start with another cloud of particles with density ν 0 = ρ T {\displaystyle \nu _{0}=\rho _{T}} , and let
SECTION 50
#17327988007985252-531: Is the related Fréchet inception distance , which compares the distribution of generated images and real training images according to features extracted by one of the final layers of a pretrained image classification model. Diffusion model There are various equivalent formalisms, including Markov chains , denoising diffusion probabilistic models, noise conditioned score networks, and stochastic differential equations. They are typically trained using variational inference . The model responsible for denoising
5353-1516: Is to learn the parameters such that p θ ( x 0 ) {\displaystyle p_{\theta }(x_{0})} is as close to q ( x 0 ) {\displaystyle q(x_{0})} as possible. To do that, we use maximum likelihood estimation with variational inference. The ELBO inequality states that ln p θ ( x 0 ) ≥ E x 1 : T ∼ q ( ⋅ | x 0 ) [ ln p θ ( x 0 : T ) − ln q ( x 1 : T | x 0 ) ] {\displaystyle \ln p_{\theta }(x_{0})\geq E_{x_{1:T}\sim q(\cdot |x_{0})}[\ln p_{\theta }(x_{0:T})-\ln q(x_{1:T}|x_{0})]} , and taking one more expectation, we get E x 0 ∼ q [ ln p θ ( x 0 ) ] ≥ E x 0 : T ∼ q [ ln p θ ( x 0 : T ) − ln q ( x 1 : T | x 0 ) ] {\displaystyle E_{x_{0}\sim q}[\ln p_{\theta }(x_{0})]\geq E_{x_{0:T}\sim q}[\ln p_{\theta }(x_{0:T})-\ln q(x_{1:T}|x_{0})]} We see that maximizing
5454-545: Is to use a neural network parametrized by θ {\displaystyle \theta } . The network takes in two arguments x t , t {\displaystyle x_{t},t} , and outputs a vector μ θ ( x t , t ) {\displaystyle \mu _{\theta }(x_{t},t)} and a matrix Σ θ ( x t , t ) {\displaystyle \Sigma _{\theta }(x_{t},t)} , such that each step in
5555-769: Is trained to reverse the process of adding noise to an image. After training to convergence, it can be used for image generation by starting with an image composed of random noise, and applying the network iteratively to denoise the image. Diffusion-based image generators have seen widespread commercial interest, such as Stable Diffusion and DALL-E . These models typically combine diffusion models with other models, such as text-encoders and cross-attention modules to allow text-conditioned generation. Other than computer vision, diffusion models have also found applications in natural language processing such as text generation and summarization , sound generation, and reinforcement learning. Diffusion models were introduced in 2015 as
5656-429: Is typically called its " backbone ". The backbone may be of any kind, but they are typically U-nets or transformers . As of 2024 , diffusion models are mainly used for computer vision tasks, including image denoising , inpainting , super-resolution , image generation , and video generation. These typically involve training a neural network to sequentially denoise images blurred with Gaussian noise . The model
5757-694: The β t → β ( t ) d t , d t z t → d W t {\displaystyle \beta _{t}\to \beta (t)dt,{\sqrt {dt}}z_{t}\to dW_{t}} limit, we obtain a continuous diffusion process, in the form of a stochastic differential equation : d x t = − 1 2 β ( t ) x t d t + β ( t ) d W t {\displaystyle dx_{t}=-{\frac {1}{2}}\beta (t)x_{t}dt+{\sqrt {\beta (t)}}dW_{t}} where W t {\displaystyle W_{t}}
5858-566: The United States Department of Defense to use DALL-E models to train battlefield management system . In January 2024 OpenAI removed its blanket ban on military and warfare use from its usage policies. Most coverage of DALL-E focuses on a small subset of "surreal" or "quirky" outputs. DALL-E's output for "an illustration of a baby daikon radish in a tutu walking a dog" was mentioned in pieces from Input , NBC , Nature , and other publications. Its output for "an armchair in
5959-498: The score function be s ( x ) := ∇ x ln q ( x ) {\displaystyle s(x):=\nabla _{x}\ln q(x)} ; then consider what we can do with s ( x ) {\displaystyle s(x)} . As it turns out, s ( x ) {\displaystyle s(x)} allows us to sample from q ( x ) {\displaystyle q(x)} using thermodynamics. Specifically, if we have
6060-736: The Boltzmann distribution is exactly q ( x ) {\displaystyle q(x)} . Therefore, to model q ( x ) {\displaystyle q(x)} , we may start with a particle sampled at any convenient distribution (such as the standard gaussian distribution), then simulate the motion of the particle forwards according to the Langevin equation d x t = − ∇ x t U ( x t ) d t + d W t {\displaystyle dx_{t}=-\nabla _{x_{t}}U(x_{t})dt+dW_{t}} and
6161-449: The Boltzmann distribution is, by Fokker-Planck equation, the unique thermodynamic equilibrium . So no matter what distribution x 0 {\displaystyle x_{0}} has, the distribution of x t {\displaystyle x_{t}} converges in distribution to q {\displaystyle q} as t → ∞ {\displaystyle t\to \infty } . Given
SECTION 60
#17327988007986262-912: The DDPM loss function is ∑ t L s i m p l e , t {\displaystyle \sum _{t}L_{simple,t}} with L s i m p l e , t = E x 0 ∼ q ; z ∼ N ( 0 , I ) [ ‖ ϵ θ ( x t , t ) − z ‖ 2 ] {\displaystyle L_{simple,t}=E_{x_{0}\sim q;z\sim N(0,I)}\left[\left\|\epsilon _{\theta }(x_{t},t)-z\right\|^{2}\right]} where x t = α ¯ t x 0 + σ t z {\displaystyle x_{t}={\sqrt {{\bar {\alpha }}_{t}}}x_{0}+\sigma _{t}z} . By
6363-1580: The Fokker-Planck equation, we find that ∂ t ρ T − t = ∂ t ν t {\displaystyle \partial _{t}\rho _{T-t}=\partial _{t}\nu _{t}} . Thus this cloud of points is the original cloud, evolving backwards. At the continuous limit, α ¯ t = ( 1 − β 1 ) ⋯ ( 1 − β t ) = e ∑ i ln ( 1 − β i ) → e − ∫ 0 t β ( t ) d t {\displaystyle {\bar {\alpha }}_{t}=(1-\beta _{1})\cdots (1-\beta _{t})=e^{\sum _{i}\ln(1-\beta _{i})}\to e^{-\int _{0}^{t}\beta (t)dt}} and so x t | x 0 ∼ N ( e − 1 2 ∫ 0 t β ( t ) d t x 0 , ( 1 − e − ∫ 0 t β ( t ) d t ) I ) {\displaystyle x_{t}|x_{0}\sim N\left(e^{-{\frac {1}{2}}\int _{0}^{t}\beta (t)dt}x_{0},\left(1-e^{-\int _{0}^{t}\beta (t)dt}\right)I\right)} In particular, we see that we can directly sample from any point in
6464-798: The IID gaussian distribution is rotationally symmetric. By plugging in the equations, we can solve for the first reparameterization: x t = α ¯ t x 0 + α t − α ¯ t z + 1 − α t z ′ ⏟ = σ t z ″ {\displaystyle x_{t}={\sqrt {{\bar {\alpha }}_{t}}}x_{0}+\underbrace {{\sqrt {\alpha _{t}-{\bar {\alpha }}_{t}}}z+{\sqrt {1-\alpha _{t}}}z'} _{=\sigma _{t}z''}} where z ″ {\textstyle z''}
6565-420: The Internet. Its role is to "understand and rank" DALL-E's output by predicting which caption from a list of 32,768 captions randomly selected from the dataset (of which one was the correct answer) is most appropriate for an image. A trained CLIP pair is used to filter a larger initial list of images generated by DALL-E to select the image that is closest to the text prompt. DALL-E 2 uses 3.5 billion parameters,
6666-423: The above equation. This explains why the phrase "Langevin dynamics" is sometimes used in diffusion models. Now the above equation is for the stochastic motion of a single particle. Suppose we have a cloud of particles distributed according to q {\displaystyle q} at time t = 0 {\displaystyle t=0} , then after a long time, the cloud of particles would settle into
6767-660: The backward equation x t − 1 = x t α t − β t σ t α t ϵ θ ( x t , t ) + β t z t ; z t ∼ N ( 0 , I ) {\displaystyle x_{t-1}={\frac {x_{t}}{\sqrt {\alpha _{t}}}}-{\frac {\beta _{t}}{\sigma _{t}{\sqrt {\alpha _{t}}}}}\epsilon _{\theta }(x_{t},t)+{\sqrt {\beta _{t}}}z_{t};\quad z_{t}\sim N(0,I)} gives us precisely
6868-948: The backwards diffusion process by first sampling x T ∼ N ( 0 , I ) {\displaystyle x_{T}\sim N(0,I)} , then integrating the SDE from t = T {\displaystyle t=T} to t = 0 {\displaystyle t=0} : x t − d t = x t + 1 2 β ( t ) x t d t + β ( t ) f θ ( x t , t ) d t + β ( t ) d W t {\displaystyle x_{t-dt}=x_{t}+{\frac {1}{2}}\beta (t)x_{t}dt+\beta (t)f_{\theta }(x_{t},t)dt+{\sqrt {\beta (t)}}dW_{t}} This may be done by any SDE integration method, such as Euler–Maruyama method . The name "noise conditional score network"
6969-943: The continuous diffusion process without going through the intermediate steps, by first sampling x 0 ∼ q , z ∼ N ( 0 , I ) {\displaystyle x_{0}\sim q,z\sim N(0,I)} , then get x t = e − 1 2 ∫ 0 t β ( t ) d t x 0 + ( 1 − e − ∫ 0 t β ( t ) d t ) z {\displaystyle x_{t}=e^{-{\frac {1}{2}}\int _{0}^{t}\beta (t)dt}x_{0}+\left(1-e^{-\int _{0}^{t}\beta (t)dt}\right)z} . That is, we can quickly sample x t ∼ ρ t {\displaystyle x_{t}\sim \rho _{t}} for any t ≥ 0 {\displaystyle t\geq 0} . Now, define
7070-1323: The form [ cos θ sin θ − sin θ cos θ ] {\textstyle {\begin{bmatrix}\cos \theta &\sin \theta \\-\sin \theta &\cos \theta \end{bmatrix}}} , we know the matrix must be [ z ″ z ‴ ] = [ α t − α ¯ t σ t β t σ t − β t σ t α t − α ¯ t σ t ] [ z z ′ ] {\displaystyle {\begin{bmatrix}z''\\z'''\end{bmatrix}}={\begin{bmatrix}{\frac {\sqrt {\alpha _{t}-{\bar {\alpha }}_{t}}}{\sigma _{t}}}&{\frac {\sqrt {\beta _{t}}}{\sigma _{t}}}\\-{\frac {\sqrt {\beta _{t}}}{\sigma _{t}}}&{\frac {\sqrt {\alpha _{t}-{\bar {\alpha }}_{t}}}{\sigma _{t}}}\end{bmatrix}}{\begin{bmatrix}z\\z'\end{bmatrix}}} and since
7171-1153: The forward diffusion process can be approximately undone by x t − 1 ∼ N ( μ θ ( x t , t ) , Σ θ ( x t , t ) ) {\displaystyle x_{t-1}\sim N(\mu _{\theta }(x_{t},t),\Sigma _{\theta }(x_{t},t))} . This then gives us a backward diffusion process p θ {\displaystyle p_{\theta }} defined by p θ ( x T ) = N ( x T | 0 , I ) {\displaystyle p_{\theta }(x_{T})=N(x_{T}|0,I)} p θ ( x t − 1 | x t ) = N ( x t − 1 | μ θ ( x t , t ) , Σ θ ( x t , t ) ) {\displaystyle p_{\theta }(x_{t-1}|x_{t})=N(x_{t-1}|\mu _{\theta }(x_{t},t),\Sigma _{\theta }(x_{t},t))} The goal now
7272-897: The goal is to minimize the loss by stochastic gradient descent. The expression may be simplified to L ( θ ) = ∑ t = 1 T E x t − 1 , x t ∼ q [ − ln p θ ( x t − 1 | x t ) ] + E x 0 ∼ q [ D K L ( q ( x T | x 0 ) ‖ p θ ( x T ) ) ] + C {\displaystyle L(\theta )=\sum _{t=1}^{T}E_{x_{t-1},x_{t}\sim q}[-\ln p_{\theta }(x_{t-1}|x_{t})]+E_{x_{0}\sim q}[D_{KL}(q(x_{T}|x_{0})\|p_{\theta }(x_{T}))]+C} where C {\displaystyle C} does not depend on
7373-767: The goal is to somehow reverse the process, so that we can start at the end and diffuse back to the beginning. By Fokker-Planck equation , the density of the cloud evolves according to ∂ t ln ρ t = 1 2 β ( t ) ( n + ( x + ∇ ln ρ t ) ⋅ ∇ ln ρ t + Δ ln ρ t ) {\displaystyle \partial _{t}\ln \rho _{t}={\frac {1}{2}}\beta (t)\left(n+(x+\nabla \ln \rho _{t})\cdot \nabla \ln \rho _{t}+\Delta \ln \rho _{t}\right)} where n {\displaystyle n}
7474-405: The images, diffuses out to the rest of the image space, until the cloud becomes all but indistinguishable from a Gaussian distribution N ( 0 , I ) {\displaystyle N(0,I)} . A model that can approximately undo the diffusion can then be used to sample from the original distribution. This is studied in "non-equilibrium" thermodynamics, as the starting distribution
7575-562: The integral, and performing an integration by parts, E q [ ‖ f θ ( x ) − ∇ ln q ( x ) ‖ 2 ] = E q [ ‖ f θ ‖ 2 + 2 ∇ 2 ⋅ f θ ] + C {\displaystyle E_{q}[\|f_{\theta }(x)-\nabla \ln q(x)\|^{2}]=E_{q}[\|f_{\theta }\|^{2}+2\nabla ^{2}\cdot f_{\theta }]+C} giving us
7676-435: The intermediate steps x 1 , x 2 , . . . , x t − 1 {\displaystyle x_{1},x_{2},...,x_{t-1}} . We know x t − 1 | x 0 {\textstyle x_{t-1}|x_{0}} is a gaussian, and x t | x t − 1 {\textstyle x_{t}|x_{t-1}}
7777-1657: The inverse of rotational matrix is its transpose, [ z z ′ ] = [ α t − α ¯ t σ t − β t σ t β t σ t α t − α ¯ t σ t ] [ z ″ z ‴ ] {\displaystyle {\begin{bmatrix}z\\z'\end{bmatrix}}={\begin{bmatrix}{\frac {\sqrt {\alpha _{t}-{\bar {\alpha }}_{t}}}{\sigma _{t}}}&-{\frac {\sqrt {\beta _{t}}}{\sigma _{t}}}\\{\frac {\sqrt {\beta _{t}}}{\sigma _{t}}}&{\frac {\sqrt {\alpha _{t}-{\bar {\alpha }}_{t}}}{\sigma _{t}}}\end{bmatrix}}{\begin{bmatrix}z''\\z'''\end{bmatrix}}} Plugging back, and simplifying, we have x t = α ¯ t x 0 + σ t z ″ {\displaystyle x_{t}={\sqrt {{\bar {\alpha }}_{t}}}x_{0}+\sigma _{t}z''} x t − 1 = μ ~ t ( x t , x 0 ) − σ ~ t z ‴ {\displaystyle x_{t-1}={\tilde {\mu }}_{t}(x_{t},x_{0})-{\tilde {\sigma }}_{t}z'''} The key idea of DDPM
7878-404: The launch of DALL-E 2 and ChatGPT, received an additional $ 10 billion in funding from Microsoft. Japan's anime community has had a negative reaction to DALL-E 2 and similar models. Two arguments are typically presented by artists against the software. The first is that AI art is not art because it is not created by a human with intent. "The juxtaposition of AI-generated images with their own work
7979-1261: The loss simplifies to L t = β t 2 2 α t σ t 2 ζ t 2 E x 0 ∼ q ; z ∼ N ( 0 , I ) [ ‖ ϵ θ ( x t , t ) − z ‖ 2 ] + C {\displaystyle L_{t}={\frac {\beta _{t}^{2}}{2\alpha _{t}\sigma _{t}^{2}\zeta _{t}^{2}}}E_{x_{0}\sim q;z\sim N(0,I)}\left[\left\|\epsilon _{\theta }(x_{t},t)-z\right\|^{2}\right]+C} which may be minimized by stochastic gradient descent. The paper noted empirically that an even simpler loss function L s i m p l e , t = E x 0 ∼ q ; z ∼ N ( 0 , I ) [ ‖ ϵ θ ( x t , t ) − z ‖ 2 ] {\displaystyle L_{simple,t}=E_{x_{0}\sim q;z\sim N(0,I)}\left[\left\|\epsilon _{\theta }(x_{t},t)-z\right\|^{2}\right]} resulted in better models. After
8080-490: The mid-2010s during the beginnings of the AI boom , as a result of advances in deep neural networks . In 2022, the output of state-of-the-art text-to-image models—such as OpenAI's DALL-E 2 , Google Brain 's Imagen , Stability AI's Stable Diffusion , and Midjourney —began to be considered to approach the quality of real photographs and human-drawn art . Text-to-image models are generally latent diffusion models , which combine
8181-480: The model in Bing's Image Creator tool and plans to implement it into their Designer app. DALL-E was revealed by OpenAI in a blog post on 5 January 2021, and uses a version of GPT-3 modified to generate images. On 6 April 2022, OpenAI announced DALL-E 2, a successor designed to generate more realistic images at higher resolutions that "can combine concepts, attributes, and styles". On 20 July 2022, DALL-E 2 entered into
8282-450: The model into their own applications. Microsoft unveiled their implementation of DALL-E 2 in their Designer app and Image Creator tool included in Bing and Microsoft Edge . The API operates on a cost-per-image basis, with prices varying depending on image resolution. Volume discounts are available to companies working with OpenAI's enterprise team. The software's name is a portmanteau of
8383-810: The model, we need some notation. A forward diffusion process starts at some starting point x 0 ∼ q {\displaystyle x_{0}\sim q} , where q {\displaystyle q} is the probability distribution to be learned, then repeatedly adds noise to it by x t = 1 − β t x t − 1 + β t z t {\displaystyle x_{t}={\sqrt {1-\beta _{t}}}x_{t-1}+{\sqrt {\beta _{t}}}z_{t}} where z 1 , . . . , z T {\displaystyle z_{1},...,z_{T}} are IID samples from N ( 0 , I ) {\displaystyle N(0,I)} . This
8484-630: The names of animated robot Pixar character WALL-E and the Catalan surrealist artist Salvador Dalí . In February 2024, OpenAI began adding watermarks to DALL-E generated images, containing metadata in the C2PA (Coalition for Content Provenance and Authenticity) standard promoted by the Content Authenticity Initiative . The first generative pre-trained transformer (GPT) model was initially developed by OpenAI in 2018, using
8585-762: The network does not have access to x 0 {\displaystyle x_{0}} , and so it has to estimate it instead. Now, since x t | x 0 ∼ N ( α ¯ t x 0 , σ t 2 I ) {\displaystyle x_{t}|x_{0}\sim N\left({\sqrt {{\bar {\alpha }}_{t}}}x_{0},\sigma _{t}^{2}I\right)} , we may write x t = α ¯ t x 0 + σ t z {\displaystyle x_{t}={\sqrt {{\bar {\alpha }}_{t}}}x_{0}+\sigma _{t}z} , where z {\displaystyle z}
8686-1921: The parameter, and thus can be ignored. Since p θ ( x T ) = N ( x T | 0 , I ) {\displaystyle p_{\theta }(x_{T})=N(x_{T}|0,I)} also does not depend on the parameter, the term E x 0 ∼ q [ D K L ( q ( x T | x 0 ) ‖ p θ ( x T ) ) ] {\displaystyle E_{x_{0}\sim q}[D_{KL}(q(x_{T}|x_{0})\|p_{\theta }(x_{T}))]} can also be ignored. This leaves just L ( θ ) = ∑ t = 1 T L t {\displaystyle L(\theta )=\sum _{t=1}^{T}L_{t}} with L t = E x t − 1 , x t ∼ q [ − ln p θ ( x t − 1 | x t ) ] {\displaystyle L_{t}=E_{x_{t-1},x_{t}\sim q}[-\ln p_{\theta }(x_{t-1}|x_{t})]} to be minimized. Since x t − 1 | x t , x 0 ∼ N ( μ ~ t ( x t , x 0 ) , σ ~ t 2 I ) {\displaystyle x_{t-1}|x_{t},x_{0}\sim N({\tilde {\mu }}_{t}(x_{t},x_{0}),{\tilde {\sigma }}_{t}^{2}I)} , this suggests that we should use μ θ ( x t , t ) = μ ~ t ( x t , x 0 ) {\displaystyle \mu _{\theta }(x_{t},t)={\tilde {\mu }}_{t}(x_{t},x_{0})} ; however,
8787-729: The particles in the cloud evolve according to d y t = 1 2 β ( T − t ) y t d t + β ( T − t ) ∇ y t ln ρ T − t ( y t ) ⏟ score function d t + β ( T − t ) d W t {\displaystyle dy_{t}={\frac {1}{2}}\beta (T-t)y_{t}dt+\beta (T-t)\underbrace {\nabla _{y_{t}}\ln \rho _{T-t}\left(y_{t}\right)} _{\text{score function }}dt+{\sqrt {\beta (T-t)}}dW_{t}} then by plugging into
8888-405: The probability distribution over all possible images. If we have q ( x ) {\displaystyle q(x)} itself, then we can say for certain how likely a certain image is. However, this is intractable in general. Most often, we are uninterested in knowing the absolute probability of a certain image. Instead, we are usually only interested in knowing how likely a certain image
8989-589: The quantity on the right would give us a lower bound on the likelihood of observed data. This allows us to perform variational inference. Define the loss function L ( θ ) := − E x 0 : T ∼ q [ ln p θ ( x 0 : T ) − ln q ( x 1 : T | x 0 ) ] {\displaystyle L(\theta ):=-E_{x_{0:T}\sim q}[\ln p_{\theta }(x_{0:T})-\ln q(x_{1:T}|x_{0})]} and now
9090-517: The same equation as score-based diffusion: x t − d t = x t ( 1 + β ( t ) d t / 2 ) + β ( t ) ∇ x t ln q ( x t ) d t + β ( t ) d W t {\displaystyle x_{t-dt}=x_{t}(1+\beta (t)dt/2)+\beta (t)\nabla _{x_{t}}\ln q(x_{t})dt+{\sqrt {\beta (t)}}dW_{t}} Thus,
9191-407: The shape of an avocado" was also widely covered. ExtremeTech stated "you can ask DALL-E for a picture of a phone or vacuum cleaner from a specified period of time, and it understands how those objects have changed". Engadget also noted its unusual capacity for "understanding how telephones and other objects change over time". According to MIT Technology Review , one of OpenAI's objectives
9292-486: The stable distribution of N ( 0 , I ) {\displaystyle N(0,I)} . Let ρ t {\displaystyle \rho _{t}} be the density of the cloud of particles at time t {\displaystyle t} , then we have ρ 0 = q ; ρ T ≈ N ( 0 , I ) {\displaystyle \rho _{0}=q;\quad \rho _{T}\approx N(0,I)} and
9393-734: The term inside becomes a least squares regression, so if the network actually reaches the global minimum of loss, then we have ϵ θ ( x t , t ) = x t − α ¯ t E q [ x 0 | x t ] σ t = − σ t ∇ x t ln q ( x t ) {\displaystyle \epsilon _{\theta }(x_{t},t)={\frac {x_{t}-{\sqrt {{\bar {\alpha }}_{t}}}E_{q}[x_{0}|x_{t}]}{\sigma _{t}}}=-\sigma _{t}\nabla _{x_{t}}\ln q(x_{t})} Thus,
9494-442: The variable x t | x 0 ∼ N ( α ¯ t x 0 , σ t 2 I ) {\displaystyle x_{t}|x_{0}\sim N\left({\sqrt {{\bar {\alpha }}_{t}}}x_{0},\sigma _{t}^{2}I\right)} converges to N ( 0 , I ) {\displaystyle N(0,I)} . That is, after
9595-411: The work of artists has been used for training without permission. Copyright laws surrounding these topics are inconclusive at the moment. After integrating DALL-E 3 into Bing Chat and ChatGPT, Microsoft and OpenAI faced criticism for excessive content filtering, with critics saying DALL-E had been "lobotomized." The flagging of images generated by prompts such as "man breaks server rack with sledgehammer"
9696-479: The wrong object. Additional limitations include handling text — which, even with legible lettering, almost invariably results in dream-like gibberish — and its limited capacity to address scientific information, such as astronomy or medical imagery. DALL-E 2's reliance on public datasets influences its results and leads to algorithmic bias in some cases, such as generating higher numbers of men than women for requests that do not mention gender. DALL-E 2's training data
9797-575: Was able to generalize to objects not represented in the training data (such as a red school bus) and appropriately handled novel prompts such as "a stop sign is flying in blue skies", exhibiting output that it was not merely "memorizing" data from the training set . In 2016, Reed, Akata, Yan et al. became the first to use generative adversarial networks for the text-to-image task. With models trained on narrow, domain-specific datasets, they were able to generate "visually plausible" images of birds and flowers from text captions like "an all black bird with
9898-564: Was cited as evidence. Over the first days of its launch, filtering was reportedly increased to the point where images generated by some of Bing's own suggested prompts were being blocked. TechRadar argued that leaning too heavily on the side of caution could limit DALL-E's value as a creative tool. Since OpenAI has not released source code for any of the three models, there have been several attempts to create open-source models offering similar capabilities. Released in 2022 on Hugging Face 's Spaces platform, Craiyon (formerly DALL-E Mini until
9999-610: Was filtered to remove violent and sexual imagery, but this was found to increase bias in some cases such as reducing the frequency of women being generated. OpenAI hypothesize that this may be because women were more likely to be sexualized in training data which caused the filter to influence results. In September 2022, OpenAI confirmed to The Verge that DALL-E invisibly inserts phrases into user prompts to address bias in results; for instance, "black man" and "Asian woman" are inserted into prompts that do not specify gender or race. A concern about DALL-E 2 and similar image generation models
10100-419: Was to "give language models a better grasp of the everyday concepts that humans use to make sense of things". Wall Street investors have had a positive reception of DALL-E 2, with some firms thinking it could represent a turning point for a future multi-trillion dollar industry. By mid-2019, OpenAI had already received over $ 1 billion in funding from Microsoft and Khosla Ventures, and in January 2023, following
10201-792: Was unveiled in April 2022, followed by Stable Diffusion that was publicly released in August 2022. In August 2022, text-to-image personalization allows to teach the model a new concept using a small set of images of a new object that was not included in the training set of the text-to-image foundation model. This is achieved by textual inversion , namely, finding a new text term that correspond to these images. Following other text-to-image models, language model -powered text-to-video platforms such as Runway, Make-A-Video, Imagen Video, Midjourney, and Phenaki can generate video from text and/or text/image prompts. Text-to-image models have been built using
#797202