Netside where knowledge is shared, ideas are spread.

[Notes] CSCI 567 lec6 Support Vector Machines (SVM)

Credit to: Prof. Yan Liu, CSCI567, Spring 2020

outline

  • Support vector machines (primal formulation)
  • A detour of Lagrangian duality
  • Support vector machines (dual formulation)

  • One of the most commonly used classification algorithms
  • Works well with the kernel trick
  • Strong theoretical guarantees

Support vector machines (primal formulation)

  • hinge loss $\ell_\text{hinge}(z) = \max{0,1-z}$

For linear model $(\boldsymbol{w}, b)$ hinge loss is:

\[\min_{\boldsymbol{w},b} \left(\sum_n \max\{0,1 -y_n(\boldsymbol{w}^T \phi(\boldsymbol{x}_n)+b)\} + \frac{\lambda}{2}\|\boldsymbol{w}\|^2_2 \right)\]
  • recall $y_n ∈ {−1, +1}$
  • a nonlinear mapping $\phi$ is applied
  • the bias/intercept term $b$ is used explicitly (think about why after this lecture)

Geometric motivation: separable case

When data is linearly separable, there are infinitely many hyperplanes with zero training error.

Intuition: The further away from data points the better.

Distance to hyperplane

what is the distance from a point $\boldsymbol{x}$ to a hyperplane ${\boldsymbol{x:w^Tx}+b=0}$ ?

Consider move data point to the plane. after move the point would be $\boldsymbol{x}-l\frac{\boldsymbol{w}}{|\boldsymbol{w}|_2}$ then the new point should in the plane.

\[0=\boldsymbol{w}^T(\boldsymbol{x}-l\frac{\boldsymbol{w}}{\|\boldsymbol{w}\|_2}) +b = \boldsymbol{w^Tx}-l\|\boldsymbol{w}\| +b\]

and $l = \frac{\boldsymbol{w^tx}+b}{|\boldsymbol{w}|_2}$

Therefore the distance is

\[\frac{|\boldsymbol{w^tx}+b|}{\|\boldsymbol{w}\|_2}\]

for a hyperplane that correctly classifies $(\boldsymbol{x}, y)$, the distance becomes

\[\frac{y(\boldsymbol{w^tx}+b)}{\|\boldsymbol{w}\|_2}\]

Maximizing margin

Margin: the smallest distance from all training points to the hyperplane

\[\text{MARGIN OF} (\boldsymbol{w}, b) = \min_n\frac{y_n(\boldsymbol{w^T\phi(x_n)}+b)}{\|\boldsymbol{w}\|_2}\]

Rescaling

since $\boldsymbol{w^T\phi(x_n)}+b = C\boldsymbol{w^T\phi(x_n)}+Cb = 0$

since recalling the data point does not change the hyperplane at all, thus to simplify the calculation, scale $(\boldsymbol{w}, b)$ s.t. $\min_ny_n(\boldsymbol{w^T\phi(x_n)}+b)=1$

then the margin become:

\[\text{MARGIN OF} (\boldsymbol{w}, b) = \min_n\frac{y_n(\boldsymbol{w^T\phi(x_n)}+b)}{\|\boldsymbol{w}\|_2}\\ =\frac{1}{\|\boldsymbol{w}\|_2}\]

For a separable training set, we aim to solve

\[\max_{\boldsymbol{w},b} \frac{1}{\|\boldsymbol{w}\|_2} \quad \text{s.t. } \min_n y_n(\boldsymbol{w^T\phi(x_n)}+b)=1\]

is equivalent to

\[\min_{\boldsymbol{w},b} \frac{1}{2}\|\boldsymbol{w}\|^2_2\\ \text{s.t. } y_n(\boldsymbol{w^T\phi(x_n)}+b)\geq 1, \forall n\]

SVM is thus also called max-margin classifier. The constraints above are called hard-margin constraints.

General non-separable case

If data is not linearly separable, the previous constraint $y_n(\boldsymbol{w^T\phi(x_n)}+b)\geq 1, \forall n$ is obviously not feasible.

To deal with this issue, we relax them to soft-margin constraints:

\[y_n(\boldsymbol{w^T\phi(x_n)}+b)\geq 1 - \xi_n, \forall n\]

where we introduce slack variables $\xi_n ≥ 0$.

SVM Primal formulation

$\xi_n$ should be as small as possible. The objective becomes

\[\min_{\boldsymbol{w}, b, \{\xi_n\}} \frac{1}{2} \|\boldsymbol{w}\|^2_2 +C\sum_n \xi_n \\ \text{s.t. } y_n(\boldsymbol{w^T\phi(x_n)}+b)\geq 1-\xi_n, \qquad \forall n \\ \xi_n \le 0, \qquad \forall n\]

Where $C$ is a hyperparameter to balance the two goals. because $\xi_n$ is minimized thus

\[y_n(\boldsymbol{w^T\phi(x_n)}+b)\geq 1-\xi_n, \qquad \forall n \\ \xi_n \le 0, \qquad \forall n\]

is equivalent to

\[\max\{0, 1-y_n(\boldsymbol{w^T\phi(x_n)}+b)\} = \xi_n \qquad\forall n\]

Take condition into the formula

\[\min_{\boldsymbol{w}, b}C\sum_n \max\{0, 1-y_n(\boldsymbol{w^T\phi(x_n)}+b)\} + \frac{1}{2}\|\boldsymbol{w}\|^2_2 \\ \min_{\boldsymbol{w}, b}\sum_n \max\{0, 1-y_n(\boldsymbol{w^T\phi(x_n)}+b)\} + \frac{\lambda}{2}\|\boldsymbol{w}\|^2_2\]

where $\lambda = 1/C$ , This is exactly minimizing L2 regularized hinge loss!

Optimization

\[\min_{\boldsymbol{w}, b, \{\xi_n\}} \frac{1}{2} \|\boldsymbol{w}\|^2_2 +C\sum_n \xi_n \\ \text{s.t. } y_n(\boldsymbol{w^T\phi(x_n)}+b)\geq 1-\xi_n, \qquad \forall n \\ \xi_n \le 0, \qquad \forall n\]
  • It is a convex (quadratic in fact) problem
  • thus can apply any convex optimization algorithms, e.g. SGD
  • there are more specialized and efficient algorithms (used mostly)
  • but usually we apply kernel trick, which requires solving the dual problem

Lagrangian duality

  • Extremely important and powerful tool in analyzing optimizations
  • We will introduce basic concepts and derive the KKT conditions
  • Applying it to SVM reveals an important aspect of the algorithm

Primal Problem

Suppose we want to solve

\[\min_\boldsymbol{w} F(\boldsymbol{w})\qquad \text{s.t. } h_j (\boldsymbol{w}) \leq 0 \quad \forall j \in [J]\]

Specified for SVM primal formulation with $J=2N$ constraints:

\[\begin{aligned} F(\boldsymbol{w}, b,\{\xi_n\})&=\frac{1}{2} \|\boldsymbol{w}\|^2_2 +C\sum_n \xi_n \\ h_n(\boldsymbol{w}, b,\{\xi_n\})&= 1- y_n(\boldsymbol{w^T\phi(x_n)}+b) - \xi_n \quad \forall n \in [N]\\ h_{N_n}(\boldsymbol{w}, b,\{\xi_n\})&= -\xi_n \quad \forall n\in [N] \end{aligned}\]

Lagrangian

The Lagrangian of the previous problem is defined as:

\[L(\boldsymbol{w}, \{\lambda_j\}) = F(\boldsymbol{w})+\sum^J_{j=1}\lambda_jh_j(\boldsymbol{w})\]

where $\lambda_1, …,\lambda_J\ge0$ are called Lagrangian multipliers. Note that

\[\max_{\{\lambda_i\}\ge 0} L(\boldsymbol{w}, \{\lambda_j\})= \begin{cases} F(\boldsymbol{w}) \qquad &h_j(\boldsymbol{w})\le0 \qquad \forall j\in [J] \\ + \infty \qquad &h_j(\boldsymbol{w})>0 \qquad \forall j\in [J] \end{cases}\]

Thus

\[\min_\boldsymbol{w} \max_{\{\lambda_k\}\ge0}L(\boldsymbol{w},\{\lambda_j\}) \Leftrightarrow \min_\boldsymbol{w}F(\boldsymbol{w}) \qquad\text{s.t. } h_j(\boldsymbol{w})\le 0 \quad\forall j\in [J]\]

Duality

We define the dual problem by swapping the min and max:

\[\max_{\{\lambda_j\}\ge0} \min_{\boldsymbol{w}} L(\boldsymbol{w}, \{\lambda_j\})\]

let $\boldsymbol{w}^$ be primal solution and ${\lambda_j^}$ be the dual solutions respectively, then

\(\begin{aligned} \max_{\{\lambda_j\}\ge0} \min_{\boldsymbol{w}} L(\boldsymbol{w}, \{\lambda_j\}) &= \min_{\boldsymbol{w}} L(\boldsymbol{w}, \{\lambda_j^*\}) \le L(\boldsymbol{w}^*, \{\lambda_j^*\})\\ &\le \max_{\{\lambda_j\}\ge0} L(\boldsymbol{w}^*, \{\lambda_j\})= \min_{\boldsymbol{w}} \max_{\{\lambda_j\}\ge0} L(\boldsymbol{w}, \{\lambda_j\}) \end{aligned}\) This is called “weak duality”.

Strong duality

When $F, h_1, …, h_m$are convex, under some mild conditions:

\[\min_{\boldsymbol{w}} \max_{\{\lambda_j\}\ge0} L(\boldsymbol{w}, \{\lambda_j\}) =\max_{\{\lambda_j\}\ge0} \min_{\boldsymbol{w}} L(\boldsymbol{w}, \{\lambda_j\})\]

Karush-kuhn-Tucker (KKT) conditions

Deriving

If strong duality holds:

\[\begin{aligned} F(\boldsymbol{w}^*) &= \min_{\boldsymbol{w}} \max_{\{\lambda_j\}\ge0} L(\boldsymbol{w}, \{\lambda_j\}) \\ &= \max_{\{\lambda_j\}\ge0} \min_{\boldsymbol{w}} L(\boldsymbol{w}, \{\lambda_j\}) \\ &=\min_{\boldsymbol{w}} L(\boldsymbol{w}, \{\lambda_j^*\})\\ &\le L(\boldsymbol{w}^*, \{\lambda_j^*\} = F(\boldsymbol{w}^*)+\sum^J_{j=1}\lambda_j^*h_j(\boldsymbol{w}^*) \le F(\boldsymbol{w}^*) \end{aligned}\]

Implications(KKT):

  • all inequalities above have to be equalities!
  • last equality implies $λ^∗_jh_j(w^∗) = 0 \quad\forall j ∈ [J]$
  • equality $\min_w L(w, {λ^∗_j }) = L(w^∗, {λ^∗_j })$ implies $w^∗$ is a minimizer of $L(w, {λ^∗_j })$ and thus has zero gradient:

    \[\nabla_w L(w^∗, \{λ^∗_j \}) = \nabla F(\boldsymbol{w}^*) + \sum^J_{j=1}\lambda_j^* \nabla h_j(\boldsymbol{w}^*) = 0\]

The conditions

If $\boldsymbol{w}^$ be primal solution and ${\lambda_j^}$ be the dual solutions respectively, then:

Stationarity:

\[\nabla_w L(w^∗, \{λ^∗_j \}) = \nabla F(\boldsymbol{w}^*) + \sum^J_{j=1}\lambda_j^* \nabla h_j(\boldsymbol{w}^*) = 0\]

Complementary slackness:

\[λ^∗_jh_j(w^∗) = 0 \quad\forall j \in [J]\]

Feasibility:

\[h_j(\boldsymbol{w}^*)\le 0 \text{ and } \lambda_j^*\ge 0 \quad \forall j \in [J]\]

These are necessary conditions. They are also sufficient when $F$ is convex and $h_1, . . . , h_J$ are continuously differentiable convex functions

Support vector machines (dual formulation)

The primal formulation

\[\min_{\boldsymbol{w}, b, \{\xi_n\}} \frac{1}{2} \|\boldsymbol{w}\|^2_2 +C\sum_n \xi_n \\ \text{s.t. } y_n(\boldsymbol{w^T\phi(x_n)}+b)\geq 1-\xi_n, \qquad \forall n \\ \xi_n \le 0, \qquad \forall n\]

The lagrangian is

\[L(\boldsymbol{w}, b, \{\xi_n\}, \{\alpha_n\},\{\lambda_n\}) = \frac{1}{2} \|\boldsymbol{w}\|^2_2 +C\sum_n \xi_n -\sum_n\lambda_n\xi_n + \sum_n\alpha_n(1-y_n(\boldsymbol{w^T\phi(x_n)}+b)-\xi_n)\]

Where $\alpha_1, \dots, \alpha_N \ge 0$ and $\lambda_1, \dots, \lambda_N\ge0$ are Lagrangian multipliers.

Applying the stationarity condition

let $\nabla_{\boldsymbol{w},b,{\xi_n}}L = 0$

\[\frac{\partial L}{\partial\boldsymbol{w}} = \boldsymbol{w}-\sum_n y_n\alpha_n\phi(\boldsymbol{x}_n) = 0 \Rightarrow \boldsymbol{w}= \sum_ny_n\alpha_n\phi(\boldsymbol{x}_n) \\ \frac{\partial L}{\partial b} = -\sum_n \alpha_n y_n = 0\\ \frac{\partial L}{\partial \xi_n} = C-\lambda_n-\alpha_n=0, \quad \forall n\]

replacing $\boldsymbol{w}$ by $\sum_ny_n\alpha_n\phi(\boldsymbol{x}_n)$