Linear Separator Algorithms
What is the SMO (SVM) classifier? - Sequential Minimal Optimization, or SMO. Training a support vector machine requires the solution of a very large quadratic programming (QP) optimization problem. SMO breaks this large QP problem into a series of the smallest possible QP problems. These small QP problems are solved analytically, which avoids using a time-consuming numerical QP optimization as an inner loop. The amount of memory required for SMO is linear in the training set size, which allows SMO to handle very large training sets. Because matrix computation is avoided, SMO scales somewhere between linear and quadratic in the training set size for various test problems, while the standard chunking SVM algorithm scales somewhere between linear and cubic in the training set size. SMO’s computation time is dominated by SVM evaluation, hence SMO is fastest for linear SVMs and sparse data sets. On real-world sparse data sets, SMO can be more than 1000 times faster than the chunking algorithm.
- For Optimal 2-class classifier.
- Extended for regression and clustering problems (1 class).
- maps feature vectors into a higher-dimensional space using a kernel function
- builds an optimal linear discriminating function in this space (linear?) or an optimal hyper-plane (RBF?) that fits the training data
- In case of SVM, the kernel is not defined explicitly.
- A distance needs to be defined between any 2 points in the hyper-space.
- The solution is optimal, the margin is maximal. between the separating hyper-plane and the nearest feature vectors
- The feature vectors that are the closest to the hyper-plane are called support vectors, which means that the position of other vectors does not affect the hyper-plane (the decision function).
- The model produced by support vector classification (as described above) depends only on a subset of the training data, because the cost function for building the model does not care about training points that lie beyond the margin.
Math of SVM on youtube:
- Linear - maximize the margin, optimal solution, only a few close points are really needed the others are zeroes by the alphas (alpha says “pay attention to this variable”) in the quadratic programming equation. XtX is a similarity function (pairs of points that relate to each other in output labels and how similar to one another, Xi’s point in the same direction) y1y2 are the labels. Therefore further points are not needed. But the similarity is important here(?)
- Non-linear - e.g. circle inside a circle, needs to map to a higher plane, a measure of similarity as XtX is important. We use this similarity idea to map into a higher plane, but we choose the higher plane for the purpose of a final function that behaves likes a known function, such as (A+B)^2. It turns out that (q1,q2,root(2)q1q2) is engineered with that root(2) thing for the purpose of making the multiplication of X^tY, which turns out to be (X^tY)^2. We can substitute this formula (X^tY)^2 instead of the X^tX in the quadratic equation to do that for us.This is the kernel trick that maps the inner class to one side and the outer circle class to the other and passes a plane in between them.
- Similarity is defined intuitively as all the points in one class vs the other.. I think
- A general kernel K=(X^tY + C)^p is a polynomial kernel that can define the above function and others.
- Quadratic eq with possible kernels including the polynomial.
- Most importantly the kernel function is our domain knowledge. (?) IMO we should choose a kernel that fits our feature data.
- The output of K is a number(?)
- Infinite dimensions - possible as well.
- Mercer condition - it acts like a distance\similar so that is the “rule” of which a kernel needs to follow.
- Super good lecture on MIT OPEN COURSE WARE - expands on the quadratic equations that were introduced in the previous course above.
- (basically punishment for overfitting and raising the non- linear class points higher and lower)
- The method of SVM can be extended to solve regression problems.
- Similar to SVM, the model produced by Support Vector Regression depends only on a subset of the training data, because the cost function for building the model ignores any training data close to the model prediction.
- using many kernel transforms to turn a non-linear problem into a linear problem beforehand.
From the link above, it seems like liblinear is very much the same thing, without those kernel transforms. So, as they say, in cases where the kernel transforms are not needed (they mention document classification), it will be faster.
- libsvm (SMO) implementation
- kernel (n^2)
- Linear SVM (n^3)
- liblinear - optimized to deal with linear classification without kernels
- Complexity O(n)
- does not support kernel SVMs.
- Scores higher
n is the number of samples in the training dataset.
Conclusion: In practice libsvm becomes painfully slow at 10k samples. Hence for medium to large scale datasets use liblinear and forget about libsvm (or maybe have a look at approximate kernel SVM solvers such as LaSVM, which saves training time and memory usage for large scale datasets).
- allows us to do certain calculations faster which otherwise would involve computations in higher dimensional space.
- K(x, y) = <f(x), f(y)>. Here K is the kernel function, x, y are n dimensional inputs. f is a map from n-dimension to m-dimension space. < x,y> denotes the dot product. usually m is much larger than n.
- normally calculating <f(x), f(y)> requires us to calculate f(x), f(y) first, and then do the dot product. These two computation steps can be quite expensive as they involve manipulations in m dimensional space, where m can be a large number.
- Result is ONLY a scalar, i..e., 1-dim space.
- We don’t need to do that calc if we use a clever kernel.
Simple Example: x = (x1, x2, x3); y = (y1, y2, y3). Then for the function f(x) = (x1x1, x1x2, x1x3, x2x1, x2x2, x2x3, x3x1, x3x2, x3x3), the kernel is K(x, y ) = (<x, y>)^2.
Let's plug in some numbers to make this more intuitive: suppose x = (1, 2, 3); y = (4, 5, 6). Then:
f(x) = (1, 2, 3, 2, 4, 6, 3, 6, 9) and f(y) = (16, 20, 24, 20, 25, 30, 24, 30, 36)
<f(x), f(y)> = 16 + 40 + 72 + 40 + 100+ 180 + 72 + 180 + 324 = 1024 i.e., 1*16+2*20+*3*24..
A lot of algebra. Mainly because f is a mapping from 3-dimensional to 9 dimensional space.
With a kernel its faster.
K(x, y) = (4 + 10 + 18 ) ^2 = 32^2 = 1024
A kernel is a magical shortcut to calculate even infinite dimensions!
- The idea of SVM is that y = w phi(x) +b, where w is the weight, phi is the feature vector, and b is the bias.
- if y> 0, then we classify datum to class 1, else to class 0.
- We want to find a set of weight and bias such that the margin is maximized.
- Previous answers mention that kernel makes data linearly separable for SVM. I think a more precise way to put this is, kernels do not make the the data linearly separable.
- The feature vector phi(x) makes the data linearly separable. Kernel is to make the calculation process faster and easier, especially when the feature vector phi is of very high dimension (for example, x1, x2, x3, ..., x_D^n, x1^2, x2^2, ...., x_D^2).
- Why it can also be understood as a measure of similarity: if we put the definition of kernel above, <f(x), f(y)>, in the context of SVM and feature vectors, it becomes <phi(x), phi(y)>. The inner product means the projection of phi(x) onto phi(y). or colloquially, how much overlap do x and y have in their feature space. In other words, how similar they are.
- SVM::LINEAR Linear kernel. No mapping is done, linear discrimination (or regression) is done in the original feature space. It is the fastest option..
- SVM::RBF Radial basis function (RBF), a good choice in most cases..
- There are heuristic methods that skip some search options
- However, no need for heuristics, computation-time is small, grid can be paralleled and we dont skip parameters.
- Controlling search complexity using two tier grid, coarse grid and then fine tune.
- In non linear kernels:
- Kernel choice
- Kernel parameters
- RBF - gamma, low and high values are far and near influence
- Reasonable first choice
- when the relation between class labels and attributes is nonlinear.
- Special case of C can make this similar to linear kernel (only! After finding C and gamma)
- Certain parameters makes it behave like the sigmoid kernel.
- Less hyperparameters than RBF kernel.
- 0 <Kij <1 unlike other kernels where the degree is 0<k<infinity
- Sigmoid is not valid under some parameters.
- DON'T USE when the #features is very large, use linear.
- Number of instances << number of features. I.e, 38 instances over 7000 features.
RBF=LINEAR When the number of features is large, we may not need to use RBF over Linear and vice versa (After finding C and gamma)
- Number of Instances & features is VERY LARGE. I.e, 20K samples X 20K features.
Similar performance with libsvm and liblinear, liblinear is faster by 150 times. Rule of thumb is to use for document classification.
- Number of instances >> number of features. Usually high dimensional mapping using non linear kernel. If we insist on liblinear, -s 2 leads to faster training.