In machine learning , support vector machines ( SVMs , also support vector networks ) are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis . Developed at AT&T Bell Laboratories , SVMs are one of the most studied models, being based on statistical learning frameworks of VC theory proposed by Vapnik (1982, 1995) and Chervonenkis (1974).
53-637: In addition to performing linear classification , SVMs can efficiently perform non-linear classification using the kernel trick , representing the data only through a set of pairwise similarity comparisons between the original data points using a kernel function, which transforms them into coordinates in a higher-dimensional feature space . Thus, SVMs use the kernel trick to implicitly map their inputs into high-dimensional feature spaces, where linear classification can be performed. Being max-margin models, SVMs are resilient to noisy data (e.g., misclassified examples). SVMs can also be used for regression tasks, where
106-430: A ( p − 1 ) {\displaystyle (p-1)} -dimensional hyperplane . This is called a linear classifier . There are many hyperplanes that might classify the data. One reasonable choice as the best hyperplane is the one that represents the largest separation, or margin , between the two classes. So we choose the hyperplane so that the distance from it to the nearest data point on each side
159-457: A high-dimensional input space with a hyperplane : all points on one side of the hyperplane are classified as "yes", while the others are classified as "no". A linear classifier is often used in situations where the speed of classification is an issue, since it is often the fastest classifier, especially when x → {\displaystyle {\vec {x}}} is sparse. Also, linear classifiers often work very well when
212-411: A linear classifier makes a classification decision for each object based on a linear combination of its features . Such classifiers work well for practical problems such as document classification , and more generally for problems with many variables ( features ), reaching accuracy levels comparable to non-linear classifiers while taking less time to train and use. If the input feature vector to
265-414: A certain threshold to the first class and all other values to the second class; e.g., The superscript T indicates the transpose and θ {\displaystyle \theta } is a scalar threshold. A more complex f might give the probability that an item belongs to a certain class. For a two-class classification problem, one can visualize the operation of a linear classifier as splitting
318-401: A different input space φ ( x → ) {\displaystyle \varphi ({\vec {x}})} , using the kernel trick . Discriminative training of linear classifiers usually proceeds in a supervised way, by means of an optimization algorithm that is given a training set with desired outputs and a loss function that measures the discrepancy between
371-482: A finite-dimensional space, it often happens that the sets to discriminate are not linearly separable in that space. For this reason, it was proposed that the original finite-dimensional space be mapped into a much higher-dimensional space, presumably making the separation easier in that space. To keep the computational load reasonable, the mappings used by SVM schemes are designed to ensure that dot products of pairs of input data vectors may be computed easily in terms of
424-402: A good separation is achieved by the hyperplane that has the largest distance to the nearest training-data point of any class (so-called functional margin), since in general the larger the margin, the lower the generalization error of the classifier. A lower generalization error means that the implementer is less likely to experience overfitting . Whereas the original problem may be stated in
477-685: A higher-dimensional feature space increases the generalization error of support vector machines, although given enough samples the algorithm still performs well. Some common kernels include: The kernel is related to the transform φ ( x i ) {\displaystyle \varphi (\mathbf {x} _{i})} by the equation k ( x i , x j ) = φ ( x i ) ⋅ φ ( x j ) {\displaystyle k(\mathbf {x} _{i},\mathbf {x} _{j})=\varphi (\mathbf {x} _{i})\cdot \varphi (\mathbf {x} _{j})} . The value w
530-400: A hyperplane. The vectors defining the hyperplanes can be chosen to be linear combinations with parameters α i {\displaystyle \alpha _{i}} of images of feature vectors x i {\displaystyle x_{i}} that occur in the data base. With this choice of a hyperplane, the points x {\displaystyle x} in
583-888: A linear combination of the support vectors. The offset, b {\displaystyle b} , can be recovered by finding an x i {\displaystyle \mathbf {x} _{i}} on the margin's boundary and solving y i ( w T x i − b ) = 1 ⟺ b = w T x i − y i . {\displaystyle y_{i}(\mathbf {w} ^{\mathsf {T}}\mathbf {x} _{i}-b)=1\iff b=\mathbf {w} ^{\mathsf {T}}\mathbf {x} _{i}-y_{i}.} (Note that y i − 1 = y i {\displaystyle y_{i}^{-1}=y_{i}} since y i = ± 1 {\displaystyle y_{i}=\pm 1} .) Suppose now that we would like to learn
SECTION 10
#1732790835889636-651: A nonlinear classification rule which corresponds to a linear classification rule for the transformed data points φ ( x i ) . {\displaystyle \varphi (\mathbf {x} _{i}).} Moreover, we are given a kernel function k {\displaystyle k} which satisfies k ( x i , x j ) = φ ( x i ) ⋅ φ ( x j ) {\displaystyle k(\mathbf {x} _{i},\mathbf {x} _{j})=\varphi (\mathbf {x} _{i})\cdot \varphi (\mathbf {x} _{j})} . We know
689-391: A variable ζ i = max ( 0 , 1 − y i ( w T x i − b ) ) {\displaystyle \zeta _{i}=\max \left(0,1-y_{i}(\mathbf {w} ^{\mathsf {T}}\mathbf {x} _{i}-b)\right)} . Note that ζ i {\displaystyle \zeta _{i}}
742-591: A way to create nonlinear classifiers by applying the kernel trick to maximum-margin hyperplanes. The "soft margin" incarnation, as is commonly used in software packages, was proposed by Corinna Cortes and Vapnik in 1993 and published in 1995. We are given a training dataset of n {\displaystyle n} points of the form ( x 1 , y 1 ) , … , ( x n , y n ) , {\displaystyle (\mathbf {x} _{1},y_{1}),\ldots ,(\mathbf {x} _{n},y_{n}),} where
795-494: Is a one-form or linear functional mapping x → {\displaystyle {\vec {x}}} onto R .) The weight vector w → {\displaystyle {\vec {w}}} is learned from a set of labeled training samples. Often f is a threshold function , which maps all values of w → ⋅ x → {\displaystyle {\vec {w}}\cdot {\vec {x}}} above
848-497: Is a supervised learning algorithm that utilizes the labels of the data, while PCA is an unsupervised learning algorithm that ignores the labels. To summarize, the name is a historical artifact. Discriminative training often yields higher accuracy than modeling the conditional density functions . However, handling missing data is often easier with conditional density models . All of the linear classifier algorithms listed above can be converted into non-linear algorithms operating on
901-437: Is a common task in machine learning . Suppose some given data points each belong to one of two classes, and the goal is to decide which class a new data point will be in. In the case of support vector machines, a data point is viewed as a p {\displaystyle p} -dimensional vector (a list of p {\displaystyle p} numbers), and we want to know whether we can separate such points with
954-688: Is also in the transformed space, with w = ∑ i α i y i φ ( x i ) {\textstyle \mathbf {w} =\sum _{i}\alpha _{i}y_{i}\varphi (\mathbf {x} _{i})} . Dot products with w for classification can again be computed by the kernel trick, i.e. w ⋅ φ ( x ) = ∑ i α i y i k ( x i , x ) {\textstyle \mathbf {w} \cdot \varphi (\mathbf {x} )=\sum _{i}\alpha _{i}y_{i}k(\mathbf {x} _{i},\mathbf {x} )} . Computing
1007-768: Is called the dual problem. Since the dual maximization problem is a quadratic function of the c i {\displaystyle c_{i}} subject to linear constraints, it is efficiently solvable by quadratic programming algorithms. Here, the variables c i {\displaystyle c_{i}} are defined such that w = ∑ i = 1 n c i y i x i . {\displaystyle \mathbf {w} =\sum _{i=1}^{n}c_{i}y_{i}\mathbf {x} _{i}.} Moreover, c i = 0 {\displaystyle c_{i}=0} exactly when x i {\displaystyle \mathbf {x} _{i}} lies on
1060-1281: Is called the primal problem. By solving for the Lagrangian dual of the above problem, one obtains the simplified problem maximize f ( c 1 … c n ) = ∑ i = 1 n c i − 1 2 ∑ i = 1 n ∑ j = 1 n y i c i ( x i T x j ) y j c j , subject to ∑ i = 1 n c i y i = 0 , and 0 ≤ c i ≤ 1 2 n λ for all i . {\displaystyle {\begin{aligned}&{\text{maximize}}\,\,f(c_{1}\ldots c_{n})=\sum _{i=1}^{n}c_{i}-{\frac {1}{2}}\sum _{i=1}^{n}\sum _{j=1}^{n}y_{i}c_{i}(\mathbf {x} _{i}^{\mathsf {T}}\mathbf {x} _{j})y_{j}c_{j},\\&{\text{subject to }}\sum _{i=1}^{n}c_{i}y_{i}=0,\,{\text{and }}0\leq c_{i}\leq {\frac {1}{2n\lambda }}\;{\text{for all }}i.\end{aligned}}} This
1113-413: Is called the "margin", and the maximum-margin hyperplane is the hyperplane that lies halfway between them. With a normalized or standardized dataset, these hyperplanes can be described by the equations and Geometrically, the distance between these two hyperplanes is 2 ‖ w ‖ {\displaystyle {\tfrac {2}{\|\mathbf {w} \|}}} , so to maximize
SECTION 20
#17327908358891166-410: Is detailed below. Then, more recent approaches such as sub-gradient descent and coordinate descent will be discussed. Minimizing (2) can be rewritten as a constrained optimization problem with a differentiable objective function in the following way. For each i ∈ { 1 , … , n } {\displaystyle i\in \{1,\,\ldots ,\,n\}} we introduce
1219-448: Is easily derived in the dual representation of the SVM problem. This allows the algorithm to fit the maximum-margin hyperplane in a transformed feature space . The transformation may be nonlinear and the transformed space high-dimensional; although the classifier is a hyperplane in the transformed feature space, it may be nonlinear in the original input space. It is noteworthy that working in
1272-548: Is helpful max ( 0 , 1 − y i ( w T x i − b ) ) . {\displaystyle \max \left(0,1-y_{i}(\mathbf {w} ^{\mathsf {T}}\mathbf {x} _{i}-b)\right).} Note that y i {\displaystyle y_{i}} is the i -th target (i.e., in this case, 1 or −1), and w T x i − b {\displaystyle \mathbf {w} ^{\mathsf {T}}\mathbf {x} _{i}-b}
1325-412: Is maximized. Any hyperplane can be written as the set of points x {\displaystyle \mathbf {x} } satisfying w T x − b = 0 , {\displaystyle \mathbf {w} ^{\mathsf {T}}\mathbf {x} -b=0,} where w {\displaystyle \mathbf {w} } is the (not necessarily normalized) normal vector to
1378-469: Is maximized. If such a hyperplane exists, it is known as the maximum-margin hyperplane and the linear classifier it defines is known as a maximum- margin classifier ; or equivalently, the perceptron of optimal stability . More formally, a support vector machine constructs a hyperplane or set of hyperplanes in a high or infinite-dimensional space, which can be used for classification , regression , or other tasks like outliers detection. Intuitively,
1431-400: Is that the max-margin hyperplane is completely determined by those x i {\displaystyle \mathbf {x} _{i}} that lie nearest to it (explained below). These x i {\displaystyle \mathbf {x} _{i}} are called support vectors . To extend SVM to cases in which the data are not linearly separable, the hinge loss function
1484-878: Is the i -th output. This function is zero if the constraint in (1) is satisfied, in other words, if x i {\displaystyle \mathbf {x} _{i}} lies on the correct side of the margin. For data on the wrong side of the margin, the function's value is proportional to the distance from the margin. The goal of the optimization then is to minimize: ‖ w ‖ 2 + C [ 1 n ∑ i = 1 n max ( 0 , 1 − y i ( w T x i − b ) ) ] , {\displaystyle \lVert \mathbf {w} \rVert ^{2}+C\left[{\frac {1}{n}}\sum _{i=1}^{n}\max \left(0,1-y_{i}(\mathbf {w} ^{\mathsf {T}}\mathbf {x} _{i}-b)\right)\right],} where
1537-1204: Is the smallest nonnegative number satisfying y i ( w T x i − b ) ≥ 1 − ζ i . {\displaystyle y_{i}(\mathbf {w} ^{\mathsf {T}}\mathbf {x} _{i}-b)\geq 1-\zeta _{i}.} Thus we can rewrite the optimization problem as follows minimize 1 n ∑ i = 1 n ζ i + λ ‖ w ‖ 2 subject to y i ( w T x i − b ) ≥ 1 − ζ i and ζ i ≥ 0 , for all i . {\displaystyle {\begin{aligned}&{\text{minimize }}{\frac {1}{n}}\sum _{i=1}^{n}\zeta _{i}+\lambda \|\mathbf {w} \|^{2}\\[0.5ex]&{\text{subject to }}y_{i}\left(\mathbf {w} ^{\mathsf {T}}\mathbf {x} _{i}-b\right)\geq 1-\zeta _{i}\,{\text{ and }}\,\zeta _{i}\geq 0,\,{\text{for all }}i.\end{aligned}}} This
1590-440: The y i {\displaystyle y_{i}} are either 1 or −1, each indicating the class to which the point x i {\displaystyle \mathbf {x} _{i}} belongs. Each x i {\displaystyle \mathbf {x} _{i}} is a p {\displaystyle p} -dimensional real vector. We want to find the "maximum-margin hyperplane" that divides
1643-545: The feature space that are mapped into the hyperplane are defined by the relation ∑ i α i k ( x i , x ) = constant . {\displaystyle \textstyle \sum _{i}\alpha _{i}k(x_{i},x)={\text{constant}}.} Note that if k ( x , y ) {\displaystyle k(x,y)} becomes small as y {\displaystyle y} grows further away from x {\displaystyle x} , each term in
Support vector machine - Misplaced Pages Continue
1696-411: The (soft-margin) SVM classifier amounts to minimizing an expression of the form We focus on the soft-margin classifier since, as noted above, choosing a sufficiently small value for λ {\displaystyle \lambda } yields the hard-margin classifier for linearly classifiable input data. The classical approach, which involves reducing (2) to a quadratic programming problem,
1749-426: The classification vector w {\displaystyle \mathbf {w} } in the transformed space satisfies w = ∑ i = 1 n c i y i φ ( x i ) , {\displaystyle \mathbf {w} =\sum _{i=1}^{n}c_{i}y_{i}\varphi (\mathbf {x} _{i}),} Linear classifier In machine learning ,
1802-443: The classifier is a real vector x → {\displaystyle {\vec {x}}} , then the output score is where w → {\displaystyle {\vec {w}}} is a real vector of weights and f is a function that converts the dot product of the two vectors into the desired output. (In other words, w → {\displaystyle {\vec {w}}}
1855-601: The classifier's outputs and the desired outputs. Thus, the learning algorithm solves an optimization problem of the form where Popular loss functions include the hinge loss (for linear SVMs) and the log loss (for linear logistic regression). If the regularization function R is convex , then the above is a convex problem . Many algorithms exist for solving such problems; popular ones for linear classification include ( stochastic ) gradient descent , L-BFGS , coordinate descent and Newton methods . Generalization error Too Many Requests If you report this error to
1908-414: The correct side of the margin, and 0 < c i < ( 2 n λ ) − 1 {\displaystyle 0<c_{i}<(2n\lambda )^{-1}} when x i {\displaystyle \mathbf {x} _{i}} lies on the margin's boundary. It follows that w {\displaystyle \mathbf {w} } can be written as
1961-928: The correct side of the margin. This can be rewritten as We can put this together to get the optimization problem: minimize w , b 1 2 ‖ w ‖ 2 subject to y i ( w ⊤ x i − b ) ≥ 1 ∀ i ∈ { 1 , … , n } {\displaystyle {\begin{aligned}&{\underset {\mathbf {w} ,\;b}{\operatorname {minimize} }}&&{\frac {1}{2}}\|\mathbf {w} \|^{2}\\&{\text{subject to}}&&y_{i}(\mathbf {w} ^{\top }\mathbf {x} _{i}-b)\geq 1\quad \forall i\in \{1,\dots ,n\}\end{aligned}}} The w {\displaystyle \mathbf {w} } and b {\displaystyle b} that solve this problem determine
2014-435: The data into groups, and then to map new data according to these clusters. The popularity of SVMs is likely due to their amenability to theoretical analysis, and their flexibility in being applied to a wide variety of tasks, including structured prediction problems. It is not clear that SVMs have better predictive performance than other linear models, such as logistic regression and linear regression . Classifying data
2067-986: The distance between the planes we want to minimize ‖ w ‖ {\displaystyle \|\mathbf {w} \|} . The distance is computed using the distance from a point to a plane equation. We also have to prevent data points from falling into the margin, we add the following constraint: for each i {\displaystyle i} either w T x i − b ≥ 1 , if y i = 1 , {\displaystyle \mathbf {w} ^{\mathsf {T}}\mathbf {x} _{i}-b\geq 1\,,{\text{ if }}y_{i}=1,} or w T x i − b ≤ − 1 , if y i = − 1. {\displaystyle \mathbf {w} ^{\mathsf {T}}\mathbf {x} _{i}-b\leq -1\,,{\text{ if }}y_{i}=-1.} These constraints state that each data point must lie on
2120-445: The final classifier, x ↦ sgn ( w T x − b ) {\displaystyle \mathbf {x} \mapsto \operatorname {sgn}(\mathbf {w} ^{\mathsf {T}}\mathbf {x} -b)} , where sgn ( ⋅ ) {\displaystyle \operatorname {sgn}(\cdot )} is the sign function . An important consequence of this geometric description
2173-1035: The following: minimize w , b , ζ ‖ w ‖ 2 2 + C ∑ i = 1 n ζ i subject to y i ( w ⊤ x i − b ) ≥ 1 − ζ i , ζ i ≥ 0 ∀ i ∈ { 1 , … , n } {\displaystyle {\begin{aligned}&{\underset {\mathbf {w} ,\;b,\;\mathbf {\zeta } }{\operatorname {minimize} }}&&\|\mathbf {w} \|_{2}^{2}+C\sum _{i=1}^{n}\zeta _{i}\\&{\text{subject to}}&&y_{i}(\mathbf {w} ^{\top }\mathbf {x} _{i}-b)\geq 1-\zeta _{i},\quad \zeta _{i}\geq 0\quad \forall i\in \{1,\dots ,n\}\end{aligned}}} Thus, for large values of C {\displaystyle C} , it will behave similar to
Support vector machine - Misplaced Pages Continue
2226-486: The group of points x i {\displaystyle \mathbf {x} _{i}} for which y i = 1 {\displaystyle y_{i}=1} from the group of points for which y i = − 1 {\displaystyle y_{i}=-1} , which is defined so that the distance between the hyperplane and the nearest point x i {\displaystyle \mathbf {x} _{i}} from either group
2279-532: The hard-margin SVM, if the input data are linearly classifiable, but will still learn if a classification rule is viable or not. The original maximum-margin hyperplane algorithm proposed by Vapnik in 1963 constructed a linear classifier . However, in 1992, Bernhard Boser , Isabelle Guyon and Vladimir Vapnik suggested a way to create nonlinear classifiers by applying the kernel trick (originally proposed by Aizerman et al.) to maximum-margin hyperplanes. The kernel trick, where dot products are replaced by kernels,
2332-462: The hyperplane. This is much like Hesse normal form , except that w {\displaystyle \mathbf {w} } is not necessarily a unit vector. The parameter b ‖ w ‖ {\displaystyle {\tfrac {b}{\|\mathbf {w} \|}}} determines the offset of the hyperplane from the origin along the normal vector w {\displaystyle \mathbf {w} } . Warning: most of
2385-421: The literature on the subject defines the bias so that w T x + b = 0. {\displaystyle \mathbf {w} ^{\mathsf {T}}\mathbf {x} +b=0.} If the training data is linearly separable , we can select two parallel hyperplanes that separate the two classes of data, so that the distance between them is as large as possible. The region bounded by these two hyperplanes
2438-462: The number of dimensions in x → {\displaystyle {\vec {x}}} is large, as in document classification , where each element in x → {\displaystyle {\vec {x}}} is typically the number of occurrences of a word in a document (see document-term matrix ). In such cases, the classifier should be well- regularized . There are two broad classes of methods for determining
2491-419: The objective becomes ϵ {\displaystyle \epsilon } -sensitive. The support vector clustering algorithm, created by Hava Siegelmann and Vladimir Vapnik , applies the statistics of support vectors, developed in the support vector machines algorithm, to categorize unlabeled data. These data sets require unsupervised learning approaches, which attempt to find natural clustering of
2544-421: The parameter C > 0 {\displaystyle C>0} determines the trade-off between increasing the margin size and ensuring that the x i {\displaystyle \mathbf {x} _{i}} lie on the correct side of the margin (Note we can add a weight to either term in the equation above). By deconstructing the hinge loss, this optimization problem can be massaged into
2597-584: The parameters of a linear classifier w → {\displaystyle {\vec {w}}} . They can be generative and discriminative models. Methods of the former model joint probability distribution , whereas methods of the latter model conditional density functions P ( c l a s s | x → ) {\displaystyle P({\rm {class}}|{\vec {x}})} . Examples of such algorithms include: The second set of methods includes discriminative models , which attempt to maximize
2650-478: The quality of the output on a training set . Additional terms in the training cost function can easily perform regularization of the final model. Examples of discriminative training of linear classifiers include: Note: Despite its name, LDA does not belong to the class of discriminative models in this taxonomy. However, its name makes sense when we compare LDA to the other main linear dimensionality reduction algorithm: principal components analysis (PCA). LDA
2703-468: The set of points x {\displaystyle x} mapped into any hyperplane can be quite convoluted as a result, allowing much more complex discrimination between sets that are not convex at all in the original space. SVMs can be used to solve various real-world problems: The original SVM algorithm was invented by Vladimir N. Vapnik and Alexey Ya. Chervonenkis in 1964. In 1992, Bernhard Boser, Isabelle Guyon and Vladimir Vapnik suggested
SECTION 50
#17327908358892756-410: The sum measures the degree of closeness of the test point x {\displaystyle x} to the corresponding data base point x i {\displaystyle x_{i}} . In this way, the sum of kernels above can be used to measure the relative nearness of each test point to the data points originating in one or the other of the sets to be discriminated. Note the fact that
2809-423: The variables in the original space, by defining them in terms of a kernel function k ( x , y ) {\displaystyle k(x,y)} selected to suit the problem. The hyperplanes in the higher-dimensional space are defined as the set of points whose dot product with a vector in that space is constant, where such a set of vectors is an orthogonal (and thus minimal) set of vectors that defines
#888111