[Notes] CSCI 567 lec6 Support Vector Machines (SVM)
26 Feb 2020 | CSCI567, English/英文, NoteCredit to: Prof. Yan Liu, CSCI567, Spring 2020
outline
- Support vector machines (primal formulation)
- A detour of Lagrangian duality
- Support vector machines (dual formulation)
- One of the most commonly used classification algorithms
- Works well with the kernel trick
- Strong theoretical guarantees
Support vector machines (primal formulation)
- hinge loss $\ell_\text{hinge}(z) = \max{0,1-z}$
For linear model $(\boldsymbol{w}, b)$ hinge loss is:
\[\min_{\boldsymbol{w},b} \left(\sum_n \max\{0,1 -y_n(\boldsymbol{w}^T \phi(\boldsymbol{x}_n)+b)\} + \frac{\lambda}{2}\|\boldsymbol{w}\|^2_2 \right)\]- recall $y_n ∈ {−1, +1}$
- a nonlinear mapping $\phi$ is applied
- the bias/intercept term $b$ is used explicitly (think about why after this lecture)
Geometric motivation: separable case
When data is linearly separable, there are infinitely many hyperplanes with zero training error.
Intuition: The further away from data points the better.
Distance to hyperplane
what is the distance from a point $\boldsymbol{x}$ to a hyperplane ${\boldsymbol{x:w^Tx}+b=0}$ ?
Consider move data point to the plane. after move the point would be $\boldsymbol{x}-l\frac{\boldsymbol{w}}{|\boldsymbol{w}|_2}$ then the new point should in the plane.
\[0=\boldsymbol{w}^T(\boldsymbol{x}-l\frac{\boldsymbol{w}}{\|\boldsymbol{w}\|_2}) +b = \boldsymbol{w^Tx}-l\|\boldsymbol{w}\| +b\]and $l = \frac{\boldsymbol{w^tx}+b}{|\boldsymbol{w}|_2}$
Therefore the distance is
\[\frac{|\boldsymbol{w^tx}+b|}{\|\boldsymbol{w}\|_2}\]for a hyperplane that correctly classifies $(\boldsymbol{x}, y)$, the distance becomes
\[\frac{y(\boldsymbol{w^tx}+b)}{\|\boldsymbol{w}\|_2}\]Maximizing margin
Margin: the smallest distance from all training points to the hyperplane
\[\text{MARGIN OF} (\boldsymbol{w}, b) = \min_n\frac{y_n(\boldsymbol{w^T\phi(x_n)}+b)}{\|\boldsymbol{w}\|_2}\]Rescaling
since $\boldsymbol{w^T\phi(x_n)}+b = C\boldsymbol{w^T\phi(x_n)}+Cb = 0$
since recalling the data point does not change the hyperplane at all, thus to simplify the calculation, scale $(\boldsymbol{w}, b)$ s.t. $\min_ny_n(\boldsymbol{w^T\phi(x_n)}+b)=1$
then the margin become:
\[\text{MARGIN OF} (\boldsymbol{w}, b) = \min_n\frac{y_n(\boldsymbol{w^T\phi(x_n)}+b)}{\|\boldsymbol{w}\|_2}\\ =\frac{1}{\|\boldsymbol{w}\|_2}\]For a separable training set, we aim to solve
\[\max_{\boldsymbol{w},b} \frac{1}{\|\boldsymbol{w}\|_2} \quad \text{s.t. } \min_n y_n(\boldsymbol{w^T\phi(x_n)}+b)=1\]is equivalent to
\[\min_{\boldsymbol{w},b} \frac{1}{2}\|\boldsymbol{w}\|^2_2\\ \text{s.t. } y_n(\boldsymbol{w^T\phi(x_n)}+b)\geq 1, \forall n\]SVM is thus also called max-margin classifier. The constraints above are called hard-margin constraints.
General non-separable case
If data is not linearly separable, the previous constraint $y_n(\boldsymbol{w^T\phi(x_n)}+b)\geq 1, \forall n$ is obviously not feasible.
To deal with this issue, we relax them to soft-margin constraints:
\[y_n(\boldsymbol{w^T\phi(x_n)}+b)\geq 1 - \xi_n, \forall n\]where we introduce slack variables $\xi_n ≥ 0$.
SVM Primal formulation
$\xi_n$ should be as small as possible. The objective becomes
\[\min_{\boldsymbol{w}, b, \{\xi_n\}} \frac{1}{2} \|\boldsymbol{w}\|^2_2 +C\sum_n \xi_n \\ \text{s.t. } y_n(\boldsymbol{w^T\phi(x_n)}+b)\geq 1-\xi_n, \qquad \forall n \\ \xi_n \le 0, \qquad \forall n\]Where $C$ is a hyperparameter to balance the two goals. because $\xi_n$ is minimized thus
\[y_n(\boldsymbol{w^T\phi(x_n)}+b)\geq 1-\xi_n, \qquad \forall n \\ \xi_n \le 0, \qquad \forall n\]is equivalent to
\[\max\{0, 1-y_n(\boldsymbol{w^T\phi(x_n)}+b)\} = \xi_n \qquad\forall n\]Take condition into the formula
\[\min_{\boldsymbol{w}, b}C\sum_n \max\{0, 1-y_n(\boldsymbol{w^T\phi(x_n)}+b)\} + \frac{1}{2}\|\boldsymbol{w}\|^2_2 \\ \min_{\boldsymbol{w}, b}\sum_n \max\{0, 1-y_n(\boldsymbol{w^T\phi(x_n)}+b)\} + \frac{\lambda}{2}\|\boldsymbol{w}\|^2_2\]where $\lambda = 1/C$ , This is exactly minimizing L2 regularized hinge loss!
Optimization
\[\min_{\boldsymbol{w}, b, \{\xi_n\}} \frac{1}{2} \|\boldsymbol{w}\|^2_2 +C\sum_n \xi_n \\ \text{s.t. } y_n(\boldsymbol{w^T\phi(x_n)}+b)\geq 1-\xi_n, \qquad \forall n \\ \xi_n \le 0, \qquad \forall n\]- It is a convex (quadratic in fact) problem
- thus can apply any convex optimization algorithms, e.g. SGD
- there are more specialized and efficient algorithms (used mostly)
- but usually we apply kernel trick, which requires solving the dual problem
Lagrangian duality
- Extremely important and powerful tool in analyzing optimizations
- We will introduce basic concepts and derive the KKT conditions
- Applying it to SVM reveals an important aspect of the algorithm
Primal Problem
Suppose we want to solve
\[\min_\boldsymbol{w} F(\boldsymbol{w})\qquad \text{s.t. } h_j (\boldsymbol{w}) \leq 0 \quad \forall j \in [J]\]Specified for SVM primal formulation with $J=2N$ constraints:
\[\begin{aligned} F(\boldsymbol{w}, b,\{\xi_n\})&=\frac{1}{2} \|\boldsymbol{w}\|^2_2 +C\sum_n \xi_n \\ h_n(\boldsymbol{w}, b,\{\xi_n\})&= 1- y_n(\boldsymbol{w^T\phi(x_n)}+b) - \xi_n \quad \forall n \in [N]\\ h_{N_n}(\boldsymbol{w}, b,\{\xi_n\})&= -\xi_n \quad \forall n\in [N] \end{aligned}\]Lagrangian
The Lagrangian of the previous problem is defined as:
\[L(\boldsymbol{w}, \{\lambda_j\}) = F(\boldsymbol{w})+\sum^J_{j=1}\lambda_jh_j(\boldsymbol{w})\]where $\lambda_1, …,\lambda_J\ge0$ are called Lagrangian multipliers. Note that
\[\max_{\{\lambda_i\}\ge 0} L(\boldsymbol{w}, \{\lambda_j\})= \begin{cases} F(\boldsymbol{w}) \qquad &h_j(\boldsymbol{w})\le0 \qquad \forall j\in [J] \\ + \infty \qquad &h_j(\boldsymbol{w})>0 \qquad \forall j\in [J] \end{cases}\]Thus
\[\min_\boldsymbol{w} \max_{\{\lambda_k\}\ge0}L(\boldsymbol{w},\{\lambda_j\}) \Leftrightarrow \min_\boldsymbol{w}F(\boldsymbol{w}) \qquad\text{s.t. } h_j(\boldsymbol{w})\le 0 \quad\forall j\in [J]\]Duality
We define the dual problem by swapping the min and max:
\[\max_{\{\lambda_j\}\ge0} \min_{\boldsymbol{w}} L(\boldsymbol{w}, \{\lambda_j\})\]let $\boldsymbol{w}^$ be primal solution and ${\lambda_j^}$ be the dual solutions respectively, then
\(\begin{aligned} \max_{\{\lambda_j\}\ge0} \min_{\boldsymbol{w}} L(\boldsymbol{w}, \{\lambda_j\}) &= \min_{\boldsymbol{w}} L(\boldsymbol{w}, \{\lambda_j^*\}) \le L(\boldsymbol{w}^*, \{\lambda_j^*\})\\ &\le \max_{\{\lambda_j\}\ge0} L(\boldsymbol{w}^*, \{\lambda_j\})= \min_{\boldsymbol{w}} \max_{\{\lambda_j\}\ge0} L(\boldsymbol{w}, \{\lambda_j\}) \end{aligned}\) This is called “weak duality”.
Strong duality
When $F, h_1, …, h_m$are convex, under some mild conditions:
\[\min_{\boldsymbol{w}} \max_{\{\lambda_j\}\ge0} L(\boldsymbol{w}, \{\lambda_j\}) =\max_{\{\lambda_j\}\ge0} \min_{\boldsymbol{w}} L(\boldsymbol{w}, \{\lambda_j\})\]Karush-kuhn-Tucker (KKT) conditions
Deriving
If strong duality holds:
\[\begin{aligned} F(\boldsymbol{w}^*) &= \min_{\boldsymbol{w}} \max_{\{\lambda_j\}\ge0} L(\boldsymbol{w}, \{\lambda_j\}) \\ &= \max_{\{\lambda_j\}\ge0} \min_{\boldsymbol{w}} L(\boldsymbol{w}, \{\lambda_j\}) \\ &=\min_{\boldsymbol{w}} L(\boldsymbol{w}, \{\lambda_j^*\})\\ &\le L(\boldsymbol{w}^*, \{\lambda_j^*\} = F(\boldsymbol{w}^*)+\sum^J_{j=1}\lambda_j^*h_j(\boldsymbol{w}^*) \le F(\boldsymbol{w}^*) \end{aligned}\]Implications(KKT):
- all inequalities above have to be equalities!
- last equality implies $λ^∗_jh_j(w^∗) = 0 \quad\forall j ∈ [J]$
-
equality $\min_w L(w, {λ^∗_j }) = L(w^∗, {λ^∗_j })$ implies $w^∗$ is a minimizer of $L(w, {λ^∗_j })$ and thus has zero gradient:
\[\nabla_w L(w^∗, \{λ^∗_j \}) = \nabla F(\boldsymbol{w}^*) + \sum^J_{j=1}\lambda_j^* \nabla h_j(\boldsymbol{w}^*) = 0\]
The conditions
If $\boldsymbol{w}^$ be primal solution and ${\lambda_j^}$ be the dual solutions respectively, then:
Stationarity:
\[\nabla_w L(w^∗, \{λ^∗_j \}) = \nabla F(\boldsymbol{w}^*) + \sum^J_{j=1}\lambda_j^* \nabla h_j(\boldsymbol{w}^*) = 0\]Complementary slackness:
\[λ^∗_jh_j(w^∗) = 0 \quad\forall j \in [J]\]Feasibility:
\[h_j(\boldsymbol{w}^*)\le 0 \text{ and } \lambda_j^*\ge 0 \quad \forall j \in [J]\]These are necessary conditions. They are also sufficient when $F$ is convex and $h_1, . . . , h_J$ are continuously differentiable convex functions
Support vector machines (dual formulation)
The primal formulation
\[\min_{\boldsymbol{w}, b, \{\xi_n\}} \frac{1}{2} \|\boldsymbol{w}\|^2_2 +C\sum_n \xi_n \\ \text{s.t. } y_n(\boldsymbol{w^T\phi(x_n)}+b)\geq 1-\xi_n, \qquad \forall n \\ \xi_n \le 0, \qquad \forall n\]The lagrangian is
\[L(\boldsymbol{w}, b, \{\xi_n\}, \{\alpha_n\},\{\lambda_n\}) = \frac{1}{2} \|\boldsymbol{w}\|^2_2 +C\sum_n \xi_n -\sum_n\lambda_n\xi_n + \sum_n\alpha_n(1-y_n(\boldsymbol{w^T\phi(x_n)}+b)-\xi_n)\]Where $\alpha_1, \dots, \alpha_N \ge 0$ and $\lambda_1, \dots, \lambda_N\ge0$ are Lagrangian multipliers.
Applying the stationarity condition
let $\nabla_{\boldsymbol{w},b,{\xi_n}}L = 0$
\[\frac{\partial L}{\partial\boldsymbol{w}} = \boldsymbol{w}-\sum_n y_n\alpha_n\phi(\boldsymbol{x}_n) = 0 \Rightarrow \boldsymbol{w}= \sum_ny_n\alpha_n\phi(\boldsymbol{x}_n) \\ \frac{\partial L}{\partial b} = -\sum_n \alpha_n y_n = 0\\ \frac{\partial L}{\partial \xi_n} = C-\lambda_n-\alpha_n=0, \quad \forall n\]replacing $\boldsymbol{w}$ by $\sum_ny_n\alpha_n\phi(\boldsymbol{x}_n)$