[Notes] CSCI 567 lec6 Support Vector Machines (SVM)

26 Feb 2020 | CSCI567, English/英文, Note

Credit to: Prof. Yan Liu, CSCI567, Spring 2020

outline

Support vector machines (primal formulation)
A detour of Lagrangian duality
Support vector machines (dual formulation)

One of the most commonly used classification algorithms
Works well with the kernel trick
Strong theoretical guarantees

Support vector machines (primal formulation)

hinge loss $\ell_\text{hinge}(z) = \max{0,1-z}$

For linear model $(\boldsymbol{w}, b)$ hinge loss is:

\[\min_{\boldsymbol{w},b} \left(\sum_n \max\{0,1 -y_n(\boldsymbol{w}^T \phi(\boldsymbol{x}_n)+b)\} + \frac{\lambda}{2}\|\boldsymbol{w}\|^2_2 \right)\]

recall $y_n ∈ {−1, +1}$
a nonlinear mapping $\phi$ is applied
the bias/intercept term $b$ is used explicitly (think about why after this lecture)

Geometric motivation: separable case

When data is linearly separable, there are infinitely many hyperplanes with zero training error.

Intuition: The further away from data points the better.

Distance to hyperplane

what is the distance from a point $\boldsymbol{x}$ to a hyperplane ${\boldsymbol{x:w^Tx}+b=0}$ ?

Consider move data point to the plane. after move the point would be $\boldsymbol{x}-l\frac{\boldsymbol{w}}{|\boldsymbol{w}|_2}$ then the new point should in the plane.

\[0=\boldsymbol{w}^T(\boldsymbol{x}-l\frac{\boldsymbol{w}}{\|\boldsymbol{w}\|_2}) +b = \boldsymbol{w^Tx}-l\|\boldsymbol{w}\| +b\]

and $l = \frac{\boldsymbol{w^tx}+b}{|\boldsymbol{w}|_2}$

Therefore the distance is

\[\frac{|\boldsymbol{w^tx}+b|}{\|\boldsymbol{w}\|_2}\]

for a hyperplane that correctly classifies $(\boldsymbol{x}, y)$, the distance becomes

\[\frac{y(\boldsymbol{w^tx}+b)}{\|\boldsymbol{w}\|_2}\]

Maximizing margin

Margin: the smallest distance from all training points to the hyperplane

\[\text{MARGIN OF} (\boldsymbol{w}, b) = \min_n\frac{y_n(\boldsymbol{w^T\phi(x_n)}+b)}{\|\boldsymbol{w}\|_2}\]

Rescaling

since $\boldsymbol{w^T\phi(x_n)}+b = C\boldsymbol{w^T\phi(x_n)}+Cb = 0$

since recalling the data point does not change the hyperplane at all, thus to simplify the calculation, scale $(\boldsymbol{w}, b)$ s.t. $\min_ny_n(\boldsymbol{w^T\phi(x_n)}+b)=1$

then the margin become:

\[\text{MARGIN OF} (\boldsymbol{w}, b) = \min_n\frac{y_n(\boldsymbol{w^T\phi(x_n)}+b)}{\|\boldsymbol{w}\|_2}\\ =\frac{1}{\|\boldsymbol{w}\|_2}\]

For a separable training set, we aim to solve

\[\max_{\boldsymbol{w},b} \frac{1}{\|\boldsymbol{w}\|_2} \quad \text{s.t. } \min_n y_n(\boldsymbol{w^T\phi(x_n)}+b)=1\]

is equivalent to

\[\min_{\boldsymbol{w},b} \frac{1}{2}\|\boldsymbol{w}\|^2_2\\ \text{s.t. } y_n(\boldsymbol{w^T\phi(x_n)}+b)\geq 1, \forall n\]

SVM is thus also called max-margin classifier. The constraints above are called hard-margin constraints.

General non-separable case

If data is not linearly separable, the previous constraint $y_n(\boldsymbol{w^T\phi(x_n)}+b)\geq 1, \forall n$ is obviously not feasible.

To deal with this issue, we relax them to soft-margin constraints:

\[y_n(\boldsymbol{w^T\phi(x_n)}+b)\geq 1 - \xi_n, \forall n\]

where we introduce slack variables $\xi_n ≥ 0$.

SVM Primal formulation

$\xi_n$ should be as small as possible. The objective becomes

\[\min_{\boldsymbol{w}, b, \{\xi_n\}} \frac{1}{2} \|\boldsymbol{w}\|^2_2 +C\sum_n \xi_n \\ \text{s.t. } y_n(\boldsymbol{w^T\phi(x_n)}+b)\geq 1-\xi_n, \qquad \forall n \\ \xi_n \le 0, \qquad \forall n\]

Where $C$ is a hyperparameter to balance the two goals. because $\xi_n$ is minimized thus

\[y_n(\boldsymbol{w^T\phi(x_n)}+b)\geq 1-\xi_n, \qquad \forall n \\ \xi_n \le 0, \qquad \forall n\]

is equivalent to

\[\max\{0, 1-y_n(\boldsymbol{w^T\phi(x_n)}+b)\} = \xi_n \qquad\forall n\]

Take condition into the formula

\[\min_{\boldsymbol{w}, b}C\sum_n \max\{0, 1-y_n(\boldsymbol{w^T\phi(x_n)}+b)\} + \frac{1}{2}\|\boldsymbol{w}\|^2_2 \\ \min_{\boldsymbol{w}, b}\sum_n \max\{0, 1-y_n(\boldsymbol{w^T\phi(x_n)}+b)\} + \frac{\lambda}{2}\|\boldsymbol{w}\|^2_2\]

where $\lambda = 1/C$ , This is exactly minimizing L2 regularized hinge loss!

Optimization

It is a convex (quadratic in fact) problem
thus can apply any convex optimization algorithms, e.g. SGD
there are more specialized and efficient algorithms (used mostly)
but usually we apply kernel trick, which requires solving the dual problem

Lagrangian duality

Extremely important and powerful tool in analyzing optimizations
We will introduce basic concepts and derive the KKT conditions
Applying it to SVM reveals an important aspect of the algorithm

Primal Problem

Suppose we want to solve

\[\min_\boldsymbol{w} F(\boldsymbol{w})\qquad \text{s.t. } h_j (\boldsymbol{w}) \leq 0 \quad \forall j \in [J]\]

Specified for SVM primal formulation with $J=2N$ constraints:

\[\begin{aligned} F(\boldsymbol{w}, b,\{\xi_n\})&=\frac{1}{2} \|\boldsymbol{w}\|^2_2 +C\sum_n \xi_n \\ h_n(\boldsymbol{w}, b,\{\xi_n\})&= 1- y_n(\boldsymbol{w^T\phi(x_n)}+b) - \xi_n \quad \forall n \in [N]\\ h_{N_n}(\boldsymbol{w}, b,\{\xi_n\})&= -\xi_n \quad \forall n\in [N] \end{aligned}\]

Lagrangian

The Lagrangian of the previous problem is defined as:

\[L(\boldsymbol{w}, \{\lambda_j\}) = F(\boldsymbol{w})+\sum^J_{j=1}\lambda_jh_j(\boldsymbol{w})\]

where $\lambda_1, …,\lambda_J\ge0$ are called Lagrangian multipliers. Note that

\[\max_{\{\lambda_i\}\ge 0} L(\boldsymbol{w}, \{\lambda_j\})= \begin{cases} F(\boldsymbol{w}) \qquad &h_j(\boldsymbol{w})\le0 \qquad \forall j\in [J] \\ + \infty \qquad &h_j(\boldsymbol{w})>0 \qquad \forall j\in [J] \end{cases}\]

Thus

\[\min_\boldsymbol{w} \max_{\{\lambda_k\}\ge0}L(\boldsymbol{w},\{\lambda_j\}) \Leftrightarrow \min_\boldsymbol{w}F(\boldsymbol{w}) \qquad\text{s.t. } h_j(\boldsymbol{w})\le 0 \quad\forall j\in [J]\]

Duality

We define the dual problem by swapping the min and max:

\[\max_{\{\lambda_j\}\ge0} \min_{\boldsymbol{w}} L(\boldsymbol{w}, \{\lambda_j\})\]

let $\boldsymbol{w}^$ be primal solution and ${\lambda_j^}$ be the dual solutions respectively, then

$\begin{aligned} \max_{\{\lambda_j\}\ge0} \min_{\boldsymbol{w}} L(\boldsymbol{w}, \{\lambda_j\}) &= \min_{\boldsymbol{w}} L(\boldsymbol{w}, \{\lambda_j^*\}) \le L(\boldsymbol{w}^*, \{\lambda_j^*\})\\ &\le \max_{\{\lambda_j\}\ge0} L(\boldsymbol{w}^*, \{\lambda_j\})= \min_{\boldsymbol{w}} \max_{\{\lambda_j\}\ge0} L(\boldsymbol{w}, \{\lambda_j\}) \end{aligned}$ This is called “weak duality”.

Strong duality

When $F, h_1, …, h_m$are convex, under some mild conditions:

\[\min_{\boldsymbol{w}} \max_{\{\lambda_j\}\ge0} L(\boldsymbol{w}, \{\lambda_j\}) =\max_{\{\lambda_j\}\ge0} \min_{\boldsymbol{w}} L(\boldsymbol{w}, \{\lambda_j\})\]

Karush-kuhn-Tucker (KKT) conditions

Deriving

If strong duality holds:

\[\begin{aligned} F(\boldsymbol{w}^*) &= \min_{\boldsymbol{w}} \max_{\{\lambda_j\}\ge0} L(\boldsymbol{w}, \{\lambda_j\}) \\ &= \max_{\{\lambda_j\}\ge0} \min_{\boldsymbol{w}} L(\boldsymbol{w}, \{\lambda_j\}) \\ &=\min_{\boldsymbol{w}} L(\boldsymbol{w}, \{\lambda_j^*\})\\ &\le L(\boldsymbol{w}^*, \{\lambda_j^*\} = F(\boldsymbol{w}^*)+\sum^J_{j=1}\lambda_j^*h_j(\boldsymbol{w}^*) \le F(\boldsymbol{w}^*) \end{aligned}\]

Implications(KKT):

all inequalities above have to be equalities!
last equality implies $λ^∗_jh_j(w^∗) = 0 \quad\forall j ∈ [J]$
equality $\min_w L(w, {λ^∗_j }) = L(w^∗, {λ^∗_j })$ implies $w^∗$ is a minimizer of $L(w, {λ^∗_j })$ and thus has zero gradient:
\[\nabla_w L(w^∗, \{λ^∗_j \}) = \nabla F(\boldsymbol{w}^*) + \sum^J_{j=1}\lambda_j^* \nabla h_j(\boldsymbol{w}^*) = 0\]

The conditions

If $\boldsymbol{w}^$ be primal solution and ${\lambda_j^}$ be the dual solutions respectively, then:

Stationarity:

\[\nabla_w L(w^∗, \{λ^∗_j \}) = \nabla F(\boldsymbol{w}^*) + \sum^J_{j=1}\lambda_j^* \nabla h_j(\boldsymbol{w}^*) = 0\]

Complementary slackness:

\[λ^∗_jh_j(w^∗) = 0 \quad\forall j \in [J]\]

Feasibility:

\[h_j(\boldsymbol{w}^*)\le 0 \text{ and } \lambda_j^*\ge 0 \quad \forall j \in [J]\]

These are necessary conditions. They are also sufficient when $F$ is convex and $h_1, . . . , h_J$ are continuously differentiable convex functions

Support vector machines (dual formulation)

The primal formulation

The lagrangian is

\[L(\boldsymbol{w}, b, \{\xi_n\}, \{\alpha_n\},\{\lambda_n\}) = \frac{1}{2} \|\boldsymbol{w}\|^2_2 +C\sum_n \xi_n -\sum_n\lambda_n\xi_n + \sum_n\alpha_n(1-y_n(\boldsymbol{w^T\phi(x_n)}+b)-\xi_n)\]

Where $\alpha_1, \dots, \alpha_N \ge 0$ and $\lambda_1, \dots, \lambda_N\ge0$ are Lagrangian multipliers.

Applying the stationarity condition

let $\nabla_{\boldsymbol{w},b,{\xi_n}}L = 0$

\[\frac{\partial L}{\partial\boldsymbol{w}} = \boldsymbol{w}-\sum_n y_n\alpha_n\phi(\boldsymbol{x}_n) = 0 \Rightarrow \boldsymbol{w}= \sum_ny_n\alpha_n\phi(\boldsymbol{x}_n) \\ \frac{\partial L}{\partial b} = -\sum_n \alpha_n y_n = 0\\ \frac{\partial L}{\partial \xi_n} = C-\lambda_n-\alpha_n=0, \quad \forall n\]

replacing $\boldsymbol{w}$ by $\sum_ny_n\alpha_n\phi(\boldsymbol{x}_n)$

Netside where knowledge is shared, ideas are spread.