In computer vision and image processing , motion estimation is the process of determining motion vectors that describe the transformation from one 2D image to another; usually from adjacent frames in a video sequence. It is an ill-posed problem as the motion happens in three dimensions (3D) but the images are a projection of the 3D scene onto a 2D plane. The motion vectors may relate to the whole image ( global motion estimation ) or specific parts, such as rectangular blocks, arbitrary shaped patches or even per pixel . The motion vectors may be represented by a translational model or many other models that can approximate the motion of a real video camera, such as rotation and translation in all three dimensions and zoom.
37-397: The sum of absolute transformed differences ( SATD ) is a block matching criterion widely used in fractional motion estimation for video compression . It works by taking a frequency transform , usually a Hadamard transform , of the differences between the pixels in the original block and the corresponding pixels in the block being used for comparison. The transform itself is often of
74-417: A block of equal size in the reference frame. The blocks are not transformed in any way apart from being shifted to the position of the predicted block. This shift is represented by a motion vector . To exploit the redundancy between neighboring block vectors, (e.g. for a single moving object covered by multiple blocks) it is common to encode only the difference between the current and previous motion vector in
111-429: A fast DCT algorithm with C.H. Smith and S.C. Fralick. In 1979, Anil K. Jain and Jaswant R. Jain further developed motion-compensated DCT video compression, also called block motion compensation. This led to Chen developing a practical video compression algorithm, called motion-compensated DCT or adaptive scene coding, in 1981. Motion-compensated DCT later became the standard coding technique for video compression from
148-552: A fixed block size, while newer ones such as H.263 , MPEG-4 Part 2 , H.264/MPEG-4 AVC , and VC-1 give the encoder the ability to dynamically choose what block size will be used to represent the motion. Overlapped block motion compensation (OBMC) is a good solution to these problems because it not only increases prediction accuracy but also avoids blocking artifacts. When using OBMC, blocks are typically twice as big in each dimension and overlap quadrant-wise with all 8 neighbouring blocks. Thus, each pixel belongs to 4 blocks. In such
185-464: A frame are not sufficiently represented by global motion compensation. Thus, local motion estimation is also needed. Block motion compensation (BMC), also known as motion-compensated discrete cosine transform (MC DCT), is the most widely used motion compensation technique. In BMC, the frames are partitioned in blocks of pixels (e.g. macro-blocks of 16×16 pixels in MPEG ). Each block is predicted from
222-401: A local image region again. Some matching criteria have the ability to exclude points that do not actually correspond to each other albeit producing a good matching score, others do not have this ability, but they are still matching criteria. Affine motion estimation is a technique used in computer vision and image processing to estimate the motion between two images or frames. It assumes that
259-406: A scheme, there are 4 predictions for each pixel which are summed up to a weighted mean. For this purpose, blocks are associated with a window function that has the property that the sum of 4 overlapped windows is equal to 1 everywhere. Studies of methods for reducing the complexity of OBMC have shown that the contribution to the window function is smallest for the diagonally-adjacent block. Reducing
296-511: A small block rather than the entire macroblock. For example, in x264 , a series of 4×4 blocks are transformed rather than doing the more processor-intensive 16×16 transform. SATD is slower than the sum of absolute differences (SAD), both due to its increased complexity and the fact that SAD-specific MMX and SSE2 instructions exist, while there are no such instructions for SATD. However, SATD can still be optimized considerably with SIMD instructions on most modern CPUs . The benefit of SATD
333-445: A way of exploiting temporal redundancy, motion estimation and compensation are key parts of video compression . Almost all video coding standards use block-based motion estimation and compensation such as the MPEG series including the most recent HEVC . In simultaneous localization and mapping , a 3D model of a scene is reconstructed using images from a moving camera. Motion compensation Motion compensation in computing
370-418: Is a hybrid coding algorithm, which combines two key data compression techniques: discrete cosine transform (DCT) coding in the spatial dimension , and predictive motion compensation in the temporal dimension . DCT coding is a lossy block compression transform coding technique that was first proposed by Nasir Ahmed , who initially intended it for image compression , in 1972. In 1974, Ali Habibi at
407-466: Is an algorithmic technique used to predict a frame in a video given the previous and/or future frames by accounting for motion of the camera and/or objects in the video. It is employed in the encoding of video data for video compression , for example in the generation of MPEG-2 files. Motion compensation describes a picture in terms of the transformation of a reference picture to the current picture. The reference picture may be previous in time or even from
SECTION 10
#1732779590254444-464: Is called sub-pixel precision . The in-between pixels are generated by interpolating neighboring pixels. Commonly, half-pixel or quarter pixel precision ( Qpel , used by H.264 and MPEG-4/ASP) is used. The computational expense of sub-pixel precision is much higher due to the extra processing required for interpolation and on the encoder side, a much greater number of potential source blocks to be evaluated. The main disadvantage of block motion compensation
481-517: Is that it introduces discontinuities at the block borders (blocking artifacts). These artifacts appear in the form of sharp horizontal and vertical edges which are easily spotted by the human eye and produce false edges and ringing effects (large coefficients in high frequency sub-bands) due to quantization of coefficients of the Fourier-related transform used for transform coding of the residual frames Block motion compensation divides up
518-579: Is that it more accurately models the number of bits required to transmit the residual error signal. As such, it is often used in video compressors, either as a way to drive and estimate rate explicitly, such as in the Theora encoder (since 1.1 alpha2), as an optional metric used in wide motion searches, such as in the Microsoft VC-1 encoder, or as a metric used in sub-pixel refinement, such as in x264. Motion estimation More often than not,
555-483: Is used the delta images mostly fight against the image smearing out. The delta image can also be encoded as wavelets, so that the borders of the adaptive blocks match. 2D+Delta Encoding techniques utilize H.264 and MPEG-2 compatible coding and can use motion compensation to compress between stereoscopic images. A precursor to the concept of motion compensation dates back to 1929, when R.D. Kell in Britain proposed
592-583: Is used to represent a macroblock in a picture based on the position of this macroblock (or a similar one) in another picture, called the reference picture. The H.264/MPEG-4 AVC standard defines motion vector as: motion vector: a two-dimensional vector used for inter prediction that provides an offset from the coordinates in the decoded picture to the coordinates in a reference picture. The methods for finding motion vectors can be categorised into pixel based methods ("direct") and feature based methods ("indirect"). A famous debate resulted in two papers from
629-465: The University of Southern California introduced hybrid coding, which combines predictive coding with transform coding. However, his algorithm was initially limited to intra-frame coding in the spatial dimension. In 1975, John A. Roese and Guner S. Robinson extended Habibi's hybrid coding algorithm to the temporal dimension, using transform coding in the spatial dimension and predictive coding in
666-572: The current frame into non-overlapping blocks, and the motion compensation vector tells where those blocks come from (a common misconception is that the previous frame is divided up into non-overlapping blocks, and the motion compensation vectors tell where those blocks move to ). The source blocks typically overlap in the source frame. Some video compression algorithms assemble the current frame out of pieces of several different previously transmitted frames. Frames can also be predicted from future frames. The future frames then need to be encoded before
703-403: The bit-stream. The result of this differentiating process is mathematically equivalent to a global motion compensation capable of panning. Further down the encoding pipeline, an entropy coder will take advantage of the resulting statistical distribution of the motion vectors around the zero vector to reduce the output size. It is possible to shift a block by a non-integer number of pixels, which
740-516: The concept of transmitting only the portions of an analog video scene that changed from frame-to-frame. In 1959, the concept of inter-frame motion compensation was proposed by NHK researchers Y. Taki, M. Hatori and S. Tanaka, who proposed predictive inter-frame video coding in the temporal dimension . Practical motion-compensated video compression emerged with the development of motion-compensated DCT (MC DCT) coding, also called block motion compensation (BMC) or DCT motion compensation. This
777-420: The fact that, often, for many frames of a movie, the only difference between one frame and another is the result of either the camera moving or an object in the frame moving. In reference to a video file, this means much of the information that represents one frame will be the same as the information used in the next frame. Using motion compensation, a video stream will contain some full (reference) frames; then
SECTION 20
#1732779590254814-421: The frame is tiled with triangles, and the next frame is generated by performing an affine transformation on these triangles. Only the affine transformations are recorded/transmitted. This is capable of dealing with zooming, rotation, translation etc. Variable block-size motion compensation (VBSMC) is the use of BMC with the ability for the encoder to dynamically select the size of the blocks. When coding video,
851-530: The future. When images can be accurately synthesized from previously transmitted/stored images, the compression efficiency can be improved. Motion compensation is one of the two key video compression techniques used in video coding standards , along with the discrete cosine transform (DCT). Most video coding standards, such as the H.26x and MPEG formats, typically use motion-compensated DCT hybrid coding, known as block motion compensation (BMC) or motion-compensated DCT (MC DCT). Motion compensation exploits
888-800: The image sequence must be transmitted and stored out of order so that the future frame is available to generate the B frames. After predicting frames using motion compensation, the coder finds the residual, which is then compressed and transmitted. In global motion compensation , the motion model basically reflects camera motions such as: It works best for still scenes without moving objects. There are several advantages of global motion compensation: MPEG-4 ASP supports global motion compensation with three reference points, although some implementations can only make use of one. A single reference point only allows for translational motion which for its relatively large performance cost provides little advantage over block based motion compensation. Moving objects within
925-537: The late 1980s onwards. The first digital video coding standard was H.120 , developed by the CCITT (now ITU-T) in 1984. H.120 used motion-compensated DPCM coding, which was inefficient for video coding, and H.120 was thus impractical due to low performance. The H.261 standard was developed in 1988 based on motion-compensated DCT compression, and it was the first practical video coding standard. Since then, motion-compensated DCT compression has been adopted by all
962-538: The matching criteria. The difference is only whether you summarise over a local image region first and then compare the summarisation (such as feature based methods), or you compare each pixel first (such as squaring the difference) and then summarise over a local image region (block base motion and filter based motion). An emerging type of matching criteria summarises a local image region first for every pixel location (through some feature transform such as Laplacian transform), compares each summarised pixel and summarises over
999-420: The motion can be modeled as an affine transformation (translation + rotation + zooming), which is a linear transformation followed by a translation. Applying the motion vectors to an image to synthesize the transformation to the next image is called motion compensation . It is most easily applied to discrete cosine transform (DCT) based video coding standards , because the coding is performed in blocks. As
1036-434: The only information stored for the frames in between would be the information needed to transform the previous frame into the next frame. The following is a simplistic illustrated explanation of how motion compensation works. Two successive frames were captured from the movie Elephants Dream . As can be seen from the images, the bottom (motion compensated) difference between two frames contains significantly less detail than
1073-516: The opposing factions being produced to try to establish a conclusion. Indirect methods use features, such as corner detection , and match corresponding features between frames, usually with a statistical function applied over a local or global area. The purpose of the statistical function is to remove matches that do not correspond to the actual motion. Statistical functions that have been successfully used include RANSAC . It can be argued that almost all methods require some kind of definition of
1110-509: The predicted frames and thus, the encoding order does not necessarily match the real frame order. Such frames are usually predicted from two directions, i.e. from the I- or P-frames that immediately precede or follow the predicted frame. These bidirectionally predicted frames are called B-frames . A coding scheme could, for instance, be IBBPBBPBBPBB. Further, the use of triangular tiles has also been proposed for motion compensation. Under this scheme,
1147-709: The prior images, and thus compresses much better than the rest. Thus the information that is required to encode compensated frame will be much smaller than with the difference frame. This also means that it is also possible to encode the information using difference image at a cost of less compression efficiency but by saving coding complexity without motion compensated coding; as a matter of fact that motion compensated coding (together with motion estimation , motion compensation) occupies more than 90% of encoding complexity. In MPEG , images are predicted from previous frames ( P frames ) or bidirectionally from previous and future frames ( B frames ). B frames are more complex because
Sum of absolute transformed differences - Misplaced Pages Continue
1184-431: The same point in that scene or on that object. Before we do motion estimation, we must define our measurement of correspondence, i.e., the matching metric, which is a measurement of how similar two image points are. There is no right or wrong here; the choice of matching metric is usually related to what the final estimated motion is used for as well as the optimisation strategy in the estimation process. Each motion vector
1221-575: The temporal dimension, developing inter-frame motion-compensated hybrid coding. For the spatial transform coding, they experimented with the DCT and the fast Fourier transform (FFT), developing inter-frame hybrid coders for both, and found that the DCT is the most efficient due to its reduced complexity, capable of compressing image data down to 0.25- bit per pixel for a videotelephone scene with image quality comparable to an intra-frame coder requiring 2-bit per pixel. In 1977, Wen-Hsiung Chen developed
1258-402: The term motion estimation and the term optical flow are used interchangeably. It is also related in concept to image registration and stereo correspondence . In fact all of these terms refer to the process of finding corresponding points between two images or video frames. The points that correspond to each other in two views (images or frames) of a real scene or object are "usually"
1295-419: The use of larger blocks can reduce the number of bits needed to represent the motion vectors, while the use of smaller blocks can result in a smaller amount of prediction residual information to encode. Other areas of work have examined the use of variable-shape feature metrics, beyond block boundaries, from which interframe vectors can be calculated. Older designs such as H.261 and MPEG-1 video typically use
1332-632: The vectors and full-samples, the sub-samples can be calculated by using bicubic or bilinear 2-D filtering. See subclause 8.4.2.2 "Fractional sample interpolation process" of the H.264 standard. Motion compensation is utilized in stereoscopic video coding . In video, time is often considered as the third dimension. Still, image coding techniques can be expanded to an extra dimension. JPEG 2000 uses wavelets, and these can also be used to encode motion without gaps between blocks in an adaptive way. Fractional pixel affine transformations lead to bleeding between adjacent pixels. If no higher internal resolution
1369-545: The weight for this contribution to zero and increasing the other weights by an equal amount leads to a substantial reduction in complexity without a large penalty in quality. In such a scheme, each pixel then belongs to 3 blocks rather than 4, and rather than using 8 neighboring blocks, only 4 are used for each block to be compensated. Such a scheme is found in the H.263 Annex F Advanced Prediction mode In motion compensation, quarter or half samples are actually interpolated sub-samples caused by fractional motion vectors. Based on
#253746