[Notes] CSCI 567 lec4 Multiclass Classification and Neural Nets

05 Feb 2020 | CSCI567, English/英文, Note

Credit to: Prof. Yan Liu, CSCI567, Spring 2020

outline

Multiclass Classification
Neural Nets

Multiclass Classification

More then 2 classes, e.g.,recognizing digits (C = 10) or letters (C = 26 or 52)

Nearest Neighbor Classifier naturally works for arbitrary C.

Multinomial logistic regression

Linear models

For each classes have a group of weight. Think of $\boldsymbol{w}^T_k \boldsymbol{x}$ as a score for class $k$. (one set of weights for each catalog)

\[f(\boldsymbol{x}) = \arg \max_{k\in\{1,2\}} \boldsymbol{w}^T_k \boldsymbol{x}\] \[\begin{aligned} \mathcal{F} &= \left\{ f(\boldsymbol{x}) = \arg \max_{k\in\{1,2\}} \boldsymbol{w}^T_k \boldsymbol{x} | \boldsymbol{w}_1, ..., \boldsymbol{w}_C \in\mathbb{R}\right\}\\ &=\left\{ f(\boldsymbol{x}) = \arg \max_{k\in\{1,2\}} (\boldsymbol{W} \boldsymbol{x})_k | \boldsymbol{W} \in\mathbb{R}^{C\times D}\right\} \end{aligned}\]

Probabilistic view

\[\begin{aligned} \mathbb{P}(y=1|\boldsymbol{x};\boldsymbol{w}) = \sigma(\boldsymbol{w}^T\boldsymbol{x}) &= \frac{1}{1+e^{-(\boldsymbol{w_1-w_2})^T\boldsymbol{x}}} \\ &=\frac{1}{1+e^{-\boldsymbol{w_1}^T\boldsymbol{x}-(-\boldsymbol{w_2}^T\boldsymbol{x})}} \\ & = \frac{e^{\boldsymbol{w_1}^T\boldsymbol{x}}}{e^{\boldsymbol{w_1}^T\boldsymbol{x}} + e^{\boldsymbol{w_2}^T\boldsymbol{x}}} \propto e^{\boldsymbol{w_1}^T\boldsymbol{x}} \end{aligned}\]

Naturally, for multiclass:

\[\mathbb{P}(y=k|\boldsymbol{x};\boldsymbol{W}) = \frac{e^{\boldsymbol{w_k}^T\boldsymbol{x}}}{\sum_{k'\in[C]}e^{\boldsymbol{w_{k'}}^T\boldsymbol{x}}} \propto e^{\boldsymbol{w_k}^T\boldsymbol{x}}\]

This is called the softmax function. (transfer data into probability )

Applying MLE (since the probabilit form is given)

Maximize probability of seeing labels $y_1,…,y_N$ given $\boldsymbol{x}_1,…,\boldsymbol{x}_N$ (all train data set), i.e., calculate the most possible weights for certain train data set.

Join probability (under i.i.d. assumption) (or called $L(\boldsymbol{W})$ likelihood):

\[P(\boldsymbol{W}) = \prod^N_{n=1}\mathbb{P}(y_n|\boldsymbol{x}_n;\boldsymbol{W}) = \prod^N_{n=1}\frac{e^{\boldsymbol{w_{y_n}}^T\boldsymbol{x}}}{\sum_{k\in[C]}e^{\boldsymbol{w_k}^T\boldsymbol{x}}}\]

By taking negative log, this is equivalent to minimizing

\[F(\boldsymbol{W}) = \sum^N_{n=1}\ln(\frac{\sum_{k\in[C]}e^{\boldsymbol{w_k}^T\boldsymbol{x}}}{e^{\boldsymbol{w}^T_{y_n}\boldsymbol{x}}}) = \sum^N_{n=1}\ln \left(1+\sum_{k\ne y_n} e^{(\boldsymbol{w}_k-\boldsymbol{w}_{y_n})^T \boldsymbol{x}_n}\right)\]

This is the multiclass logistic loss, a.k.a cross-entropy loss.

When C = 2, this is the same as binary logistic loss.

Optimization (apply SGD)

$W$ is a $C × D$ matrix. Let’s focus on the $k$-th row: If $k\ne y_n$: (derivative with respect to $\boldsymbol{w}_k$)

\[\begin{aligned} \nabla_{\boldsymbol{w}_k}g(\boldsymbol{W}) & = \frac{e^{(\boldsymbol{w}_k-\boldsymbol{w}_{y_n})^T \boldsymbol{x}_n}} {1+\sum_{k'\ne y_n} e^{(\boldsymbol{w}_{k'}-\boldsymbol{w}_{y_n})^T \boldsymbol{x}_n}} \\ &= \frac{e^{\boldsymbol{w_k}^T\boldsymbol{x}}}{\sum_{k'\in[C]}e^{\boldsymbol{w_{k'}}^T\boldsymbol{x}}} \\ &= \mathbb{P}(y=k|\boldsymbol{x}_n;\boldsymbol{W})\boldsymbol{x}_n^T \end{aligned}\]

$k = y_n$ (derivative wht respect to $\boldsymbol{w}_{y_n}$):

\[\begin{aligned} \nabla_{\boldsymbol{w}_k}g(\boldsymbol{W}) & = \frac{-(\sum_{k'\ne y_n}e^{(\boldsymbol{w'}_k-\boldsymbol{w}_{y_n})^T \boldsymbol{x}_n})} {1+\sum_{k'\ne y_n} e^{(\boldsymbol{w}_{k'}-\boldsymbol{w}_{y_n})^T \boldsymbol{x}_n}} \\ &= (\mathbb{P}(y=k|\boldsymbol{x}_n;\boldsymbol{W})\boldsymbol{x}-1)_n^T \end{aligned}\]

$\mathbb{P}(y=k|\boldsymbol{x}_n;\boldsymbol{W})$ is the probability of correct label. when it getting bigger the error should be bigger.

Initialize $\boldsymbol{W}=0$ (or randomly). Repeat:

pick $n\in [N]$ uniformly at random
update the parameters

\[\boldsymbol{W} \leftarrow \boldsymbol{W} - \eta \left ( \begin{matrix} \mathbb{P}(y=1|\boldsymbol{x}_n;\boldsymbol{W}) \\ \vdots \\ \mathbb{P}(y=y_n|\boldsymbol{x}_n;\boldsymbol{W})-1 \\ \vdots \\ \mathbb{P}(y=\text{C}|\boldsymbol{x}_n;\boldsymbol{W}) \\ \end{matrix} \right) \boldsymbol{x}_n^T\]

Notes on prediction

Having learned W , we can either

make a deterministic prediction: $\arg \max_{k\in{[C]}} \boldsymbol{w}^T_k \boldsymbol{x}$
make a randomized prediction according to: $\mathbb{P}(k|\boldsymbol{x};\boldsymbol{w}) \propto e^{\boldsymbol{w_1}^T\boldsymbol{x}}$

In either case, (expected) mistake is bounded by logistic loss

deterministic

\[\mathbb{I}[f(\boldsymbol{x})\ne y]\le \log_2 (1+\sum_{k\ne y}e^{(\boldsymbol{w}_k-\boldsymbol{w}_y)^T \boldsymbol{x}})\]

randomized

\[\mathbb{E}[\mathbb{I}[f(\boldsymbol{x})\ne y]] = 1-\mathbb{P}(y|\boldsymbol{x};\boldsymbol{w}) \le -\ln \mathbb{P}(y|\boldsymbol{x};\boldsymbol{w})\]

Reduction to binary classification

Is there an even more general and simpler approach to derive multiclass classification algorithms?

Given a binary classification algorithm (any one, not just linear methods), can we turn it to a multiclass algorithm, in a black-box manner?

Yes, there are in fact many ways to do it.

one-versus-all (one-versus-rest, one-against-all, etc)
one-versus-one (all-versus-all, etc)
Error-Correcting Output Codes (ECOC)
tree-based reduction

one-versus-all (OvA)

Idea: train C binary classifiers to learn “is class k or not?” for each k.

Training: for each class $k ∈ [C]$,

relabel examples with class k as +1, and all others as −1
train a binary classifier hk using this new dataset

Prediction: for a new example x

ask each hk: does this belong to class k? (i.e. $h_k(\boldsymbol{x})$)
randomly pick among all $k$’s s.t. hk(x) = +1.

Issue: will (probably) make a mistake as long as one of $h_k$ errs. when no classifier give positive we can pick the category gives the largest $h$

One-versus-one (OvO)

adv: finer resolution of different resolution

dis: more computational, less data for each classifier

Idea: train $\left(\begin{matrix}C \ 2\end{matrix}\right)$ binary classifiers to learn “is class k or k′?”.

Training: for each pair $(k, k’)$

relabel examples with class k as +1 and examples with class k′ as −1
discard all other examples (use the cases in $k$ or $k’$ )
train a binary classifier h(k,k′) using this new dataset

Prediction: for a new example $\boldsymbol{x}$

ask each classifier h(k,k′) to vote for either class k or k′
predict the class with the most votes (break tie in some way)

More robust than one-versus-all, but slower in prediction.

imbalance problem is not a serious for generative classifier model compare to discriminative classifier

Error-Correcting Output Codes (ECOC)

Idea: based on a code $\boldsymbol{M} \in {−1, +1}^{C\times L}$ learn “is bit b on or off”. Something between OvO and OvA

Training: for each bit $b \in [L]$. L can be number of classifier

relabel example $x_n$ as $M_{y_n,b}$
train a binary classifier hb using this new dataset.

Prediction: for a new example x

compute the predicted code $c = (h_1(\boldsymbol{x}), . . . , h_L(\boldsymbol{x}))^T$
predict the class with the most similar code: $k = \arg\max_k(\boldsymbol{Mc})_k$

How to pick the code M?

the more dissimilar the codes between different classes are, the better
random code is a good choice, but might create hard training sets

tree-based reduction

Idea: train $\approx$ C binary classifiers to learn “belongs to which half?”.

Training: see pictures

Prediction is also natural, but is very fast! (think ImageNet where C ≈ 20K)

Comparisons

Reduction	# of training points	test time	remark
OvA	CN	C	not robust (compare tp other classifier, i.e. less classifier)
OvO	C^2 N	C^2	can achieve very small training error
ECOC	LN	L	need diversity when designing code
Tree	(log_2C)N	log_2C	good for “extreme classification

Neural Nets

Definition

We can use a nonlinear mapping as discussed: $\phi(\boldsymbol{x}): \boldsymbol{x}\in \mathbb{R}^D \rightarrow\boldsymbol{z}\in \mathbb{R}^M$ But what kind of nonlinear mapping $\phi$ should be used? Can we actually learn this nonlinear mapping? THE most popular nonlinear models nowadays: neural nets

fig: $o = h(\boldsymbol{w}^T \boldsymbol{x}$ for $h$ we can use

linear: $h(a)=a$
Rectified Linear Unit (ReLU): $h(a)=\max{0,a}$
sigmoid function: $h(a)=\frac{1}{1+e^{-1}}$
TanH: $h(a) = \frac{e^a-e^{-a}}{e^a+e^{-a}}$
so on

more output node

$\boldsymbol{W}\in \mathbb{R}^{4\times3}, \boldsymbol{h}:\mathbb{R}^4\rightarrow\mathbb{R}^4$ so $\boldsymbol{h}(\boldsymbol{a}) = (h_1(a_1), …, h_4(a_4))$

Can think of this as a nonlinear basis: $\Phi(\boldsymbol{x}) = \boldsymbol{h}(\boldsymbol{Wx})$

more layers

fig

each node is called a neuron
$h$ is called the activation function
- can use $h(a) = 1$ for one neuron in each layer to incorporate bias term
- output neuron can use $h(a) = a$
#layers refers to #hidden layers (plus 1 or 2 for input/output layers)
deep neural nets can have many layers and millions of parameters
this is a feedforward, fully connected neural net, there are many variants

how powerful are neural nets?

Universal approximation theorem (Cybenko, 89; Hornik, 91): A feedforward neural net with a single hidden layer can approximate any continuous functions.

It might need a huge number of neurons though, and depth helps! Designing network architecture is important and very complicated

for feedforward network, need to decide number of hidden layers, number of neurons at each layer, activation functions, etc.

An L-layer neural net can be written as

\[\boldsymbol{f(x)} = \boldsymbol{h}_L(\boldsymbol{W}_L\boldsymbol{h}_{L-1}(\boldsymbol{W}_{L-1} \dotsb \boldsymbol{h}_1(\boldsymbol{W}_1\boldsymbol{x})))\]

define recursively:

$\boldsymbol{o_0=x}, \quad \boldsymbol{a_l = W_lo_{l-1}}, \quad \boldsymbol{o_l=h_l(a_l)} \qquad (l = 1, \dotsc, L)$

Where

$\boldsymbol{W}l \in\mathbb{R}^{D_l\times D{l-1}}$ is the weights for layer $l$
$D_0=D,D_1,\dotsc,D_L$ are numbers of neurons at each layer
$\boldsymbol{a}_l \in \mathbb{R}^{D_l}$ is input to layer $l$
$\boldsymbol{o}_l$ is output to layer $l$
$\boldsymbol{h}: \mathbb{R}^{D_l} \rightarrow \mathbb{R}^{D_l}$ is activation functions at layer $l$

Back propagation

object: minimize error

\[\mathcal{E}(\boldsymbol{W}_1, \dotsc,. \boldsymbol{W}_L) = \sum^N_{n=1} \mathcal{E}_n(\boldsymbol{W}_1, \dotsc,. \boldsymbol{W}_L)\]

where

\[\mathcal{E}_n(\boldsymbol{W}_1, \dotsc,. \boldsymbol{W}_L) = \begin{cases} \|\boldsymbol{f}(\boldsymbol{x}_n)-\boldsymbol{y}_n\|^2_2 & \text{for regression} \\ \ln \left(1+\sum_{k\ne y_n} e^{\boldsymbol{f}(\boldsymbol{x}_n)_k - \boldsymbol{f}(\boldsymbol{x}_n)_{y_n}}\right) & \text{for classification} \end{cases}\]

apply SGD! even if the model is nonconvex. Remark SGD works only for some of nonconvex function.

Use chain rule:

for a composite function $f(g(w))$
\[\frac{\partial f}{\partial w} = \frac{\partial f}{\partial g}\frac{\partial g}{\partial w}\]
for a composite function $f(g_1(w), \dotsc, g_d(w))$

\[\frac{\partial f}{\partial w} = \sum^d_{i=1}\frac{\partial f}{\partial g_i}\frac{\partial g_i}{\partial w}\]

Find the derivative of $\mathcal{E}n$ w.r.t. to $w{ij}$

\[\frac{\partial\mathcal{E}_n}{\partial w_{ij}} = \frac{\partial\mathcal{E}_n}{\partial a_{i}}\frac{\partial a_i}{\partial w_{ij}} =\frac{\partial\mathcal{E}_n}{\partial a_{i}}o_{ij}\] \[\frac{\partial\mathcal{E}_n}{\partial a_{i}} = \frac{\partial\mathcal{E}_n}{\partial o_{i}}\frac{\partial o_i}{\partial a_{i}} = \left ( \sum_k \frac{\partial\mathcal{E}_n}{\partial a_k}\frac{\partial a_k}{\partial o_{i}} \right)h'_i(a_i) = \left ( \sum_k \frac{\partial\mathcal{E}_n}{\partial a_k} w_{ki} \right)h'_i(a_i)\]

add layer

\[\frac{\partial\mathcal{E}_n}{\partial w_{ij}^{[l]}} =\frac{\partial\mathcal{E}_n}{\partial a_{i}^{[l]}}o_{ij}^{[l-1]}\] \[\frac{\partial\mathcal{E}_n}{\partial a_{i}^{[l]}} = \left ( \sum_k \frac{\partial\mathcal{E}_n}{\partial a_k^{[l+1]}} w_{ki}^{[l+1]} \right)h_i^{'[l]}(a_i^{[l]})\]

For the last layer, for square loss

\[\frac{\partial\mathcal{E}_n}{\partial a_{i}^{[L]}} = 2(h_{L,i}(a_{L,i})- y_{n,i})h'_{L,i}(a_{L,i})\]

Using matrix notation greatly simplifies presentation and implementation:

\[\frac{\partial\mathcal{E}_n}{\partial \boldsymbol{W}_l} =\frac{\partial\mathcal{E}_n}{\partial \boldsymbol{a}_{l}}\boldsymbol{o}^T_{l-1}\] \[\frac{\partial\mathcal{E}_n}{\partial \boldsymbol{a}_{l}}= \begin{cases} \left( \boldsymbol{W}^T_{l+1} \frac{\partial\mathcal{E}_n}{\partial \boldsymbol{a}_{l+1}}\right) \circ \boldsymbol{h}'_l(\boldsymbol{a}_l) & \text{if $l<L$}\\ 2(\boldsymbol{h}_{L}(\boldsymbol{a}_{L})- \boldsymbol{y}_{n})\circ \boldsymbol{h}'_{L}(\boldsymbol{a}_{L}) & \text{else} \end{cases}\]

where $\circ$ is the element-wise product(Hadamard product).

Full process of back propagation

initialize $\boldsymbol{W}_1, \dotsc, \boldsymbol{W}_L$ (all 0 or randomly). Repeat:

randomly pick one data point $n/in [N]$
forward propagation for each layer $l=1, .\dotsc, L$
- compute $\boldsymbol{a_l = W_lo_{l-1}}, \quad \boldsymbol{o_l=h_l(a_l)}$
backward propagation: for each $l=1, .\dotsc, L$
- compute
  \[\frac{\partial\mathcal{E}_n}{\partial \boldsymbol{a}_{l}}= \begin{cases} \left( \boldsymbol{W}^T_{l+1} \frac{\partial\mathcal{E}_n}{\partial \boldsymbol{a}_{l+1}}\right) \circ \boldsymbol{h}'_l(\boldsymbol{a}_l) & \text{if $l<L$}\\ 2(\boldsymbol{h}_{L}(\boldsymbol{a}_{L})- \boldsymbol{y}_{n})\circ \boldsymbol{h}'_{L}(\boldsymbol{a}_{L}) & \text{else} \end{cases}\]
- update weights
  \[\boldsymbol{W}_l \leftarrow \boldsymbol{W}_l - \eta \frac{\partial\mathcal{E}_n}{\partial \boldsymbol{W}_l} =\boldsymbol{W}_l - \frac{\partial\mathcal{E}_n}{\partial \boldsymbol{a}_{l}}\boldsymbol{o}^T_{l-1}\]

More tricks to optimize neural nets

Many variants based on backprop

SGD with minibatch: randomly sample a batch of examples to form a stochastic gradient
SGD with momentum ···

SGD with momentum

initialize $\boldsymbol{w}_0$ and velocity $\boldsymbol{v}=0$

For t=1,2,…

form a stochastic gradient $\boldsymbol{g}_t$
update velocity $\boldsymbol{v}\leftarrow \alpha\boldsymbol{v}- \eta \boldsymbol{g}_t$
update weight $\boldsymbol{w}t \leftarrow \boldsymbol{w}{t-1}+\boldsymbol{v}$

update for first few rounds:

$\boldsymbol{w}_1 = \boldsymbol{w}_0 - \eta\boldsymbol{g}_1$
$\boldsymbol{w}_2 = \boldsymbol{w}_1 - \alpha\eta\boldsymbol{g}_1-\eta\boldsymbol{g}_2$
$\boldsymbol{w}_3 = \boldsymbol{w}_2 - \alpha^2\eta\boldsymbol{g}_1 - \alpha\eta\boldsymbol{g}_2-\eta\boldsymbol{g}_3$
…

Overfitting

Overfitting is very likely since the models are too powerful.

Methods to overcome overfitting:

data augmentation
regularization dropout: Randomly delete neurons during training
early stopping: Stop training when the performance on validation set stops improving
···

Data augmentation

Exploit prior knowledge to add more training data

Regularization

L2 regularization: minimize, change error to error’

\[\mathcal{E}'(\boldsymbol{W}_1, \dotsc,. \boldsymbol{W}_L) = \mathcal{E}(\boldsymbol{W}_1, \dotsc,. \boldsymbol{W}_L) + \lambda\sum^L_{l=1}\|\boldsymbol{W}_l\|^2_2\]

Simple change to the gradient:

\[\frac{\partial\mathcal{E}'}{\partial w_{ij}} = \frac{\partial\mathcal{E}}{\partial w_{ij}} + 2\lambda w_{ij}\]

Introduce weight decaying effect

Conclusions for neural nets

Deep neural networks

are hugely popular, achieving best performance on many problems
do need a lot of data to work well
take a lot of time to train (need GPUs for massive parallel computing)
take some work to select architecture and hyperparameters
are still not well understood in theory

Netside where knowledge is shared, ideas are spread.