[Notes] CSCI 567 lec4 Multiclass Classification and Neural Nets
05 Feb 2020 | CSCI567, English/英文, NoteCredit to: Prof. Yan Liu, CSCI567, Spring 2020
outline
- Multiclass Classification
- Neural Nets
Multiclass Classification
More then 2 classes, e.g.,recognizing digits (C = 10) or letters (C = 26 or 52)
Nearest Neighbor Classifier naturally works for arbitrary C.
Multinomial logistic regression
Linear models
For each classes have a group of weight. Think of $\boldsymbol{w}^T_k \boldsymbol{x}$ as a score for class $k$. (one set of weights for each catalog)
\[f(\boldsymbol{x}) = \arg \max_{k\in\{1,2\}} \boldsymbol{w}^T_k \boldsymbol{x}\] \[\begin{aligned} \mathcal{F} &= \left\{ f(\boldsymbol{x}) = \arg \max_{k\in\{1,2\}} \boldsymbol{w}^T_k \boldsymbol{x} | \boldsymbol{w}_1, ..., \boldsymbol{w}_C \in\mathbb{R}\right\}\\ &=\left\{ f(\boldsymbol{x}) = \arg \max_{k\in\{1,2\}} (\boldsymbol{W} \boldsymbol{x})_k | \boldsymbol{W} \in\mathbb{R}^{C\times D}\right\} \end{aligned}\]Probabilistic view
\[\begin{aligned} \mathbb{P}(y=1|\boldsymbol{x};\boldsymbol{w}) = \sigma(\boldsymbol{w}^T\boldsymbol{x}) &= \frac{1}{1+e^{-(\boldsymbol{w_1-w_2})^T\boldsymbol{x}}} \\ &=\frac{1}{1+e^{-\boldsymbol{w_1}^T\boldsymbol{x}-(-\boldsymbol{w_2}^T\boldsymbol{x})}} \\ & = \frac{e^{\boldsymbol{w_1}^T\boldsymbol{x}}}{e^{\boldsymbol{w_1}^T\boldsymbol{x}} + e^{\boldsymbol{w_2}^T\boldsymbol{x}}} \propto e^{\boldsymbol{w_1}^T\boldsymbol{x}} \end{aligned}\]Naturally, for multiclass:
\[\mathbb{P}(y=k|\boldsymbol{x};\boldsymbol{W}) = \frac{e^{\boldsymbol{w_k}^T\boldsymbol{x}}}{\sum_{k'\in[C]}e^{\boldsymbol{w_{k'}}^T\boldsymbol{x}}} \propto e^{\boldsymbol{w_k}^T\boldsymbol{x}}\]This is called the softmax function. (transfer data into probability )
Applying MLE (since the probabilit form is given)
Maximize probability of seeing labels $y_1,…,y_N$ given $\boldsymbol{x}_1,…,\boldsymbol{x}_N$ (all train data set), i.e., calculate the most possible weights for certain train data set.
Join probability (under i.i.d. assumption) (or called $L(\boldsymbol{W})$ likelihood):
\[P(\boldsymbol{W}) = \prod^N_{n=1}\mathbb{P}(y_n|\boldsymbol{x}_n;\boldsymbol{W}) = \prod^N_{n=1}\frac{e^{\boldsymbol{w_{y_n}}^T\boldsymbol{x}}}{\sum_{k\in[C]}e^{\boldsymbol{w_k}^T\boldsymbol{x}}}\]By taking negative log, this is equivalent to minimizing
\[F(\boldsymbol{W}) = \sum^N_{n=1}\ln(\frac{\sum_{k\in[C]}e^{\boldsymbol{w_k}^T\boldsymbol{x}}}{e^{\boldsymbol{w}^T_{y_n}\boldsymbol{x}}}) = \sum^N_{n=1}\ln \left(1+\sum_{k\ne y_n} e^{(\boldsymbol{w}_k-\boldsymbol{w}_{y_n})^T \boldsymbol{x}_n}\right)\]This is the multiclass logistic loss, a.k.a cross-entropy loss.
When C = 2, this is the same as binary logistic loss.
Optimization (apply SGD)
$W$ is a $C × D$ matrix. Let’s focus on the $k$-th row: If $k\ne y_n$: (derivative with respect to $\boldsymbol{w}_k$)
\[\begin{aligned} \nabla_{\boldsymbol{w}_k}g(\boldsymbol{W}) & = \frac{e^{(\boldsymbol{w}_k-\boldsymbol{w}_{y_n})^T \boldsymbol{x}_n}} {1+\sum_{k'\ne y_n} e^{(\boldsymbol{w}_{k'}-\boldsymbol{w}_{y_n})^T \boldsymbol{x}_n}} \\ &= \frac{e^{\boldsymbol{w_k}^T\boldsymbol{x}}}{\sum_{k'\in[C]}e^{\boldsymbol{w_{k'}}^T\boldsymbol{x}}} \\ &= \mathbb{P}(y=k|\boldsymbol{x}_n;\boldsymbol{W})\boldsymbol{x}_n^T \end{aligned}\]$k = y_n$ (derivative wht respect to $\boldsymbol{w}_{y_n}$):
\[\begin{aligned} \nabla_{\boldsymbol{w}_k}g(\boldsymbol{W}) & = \frac{-(\sum_{k'\ne y_n}e^{(\boldsymbol{w'}_k-\boldsymbol{w}_{y_n})^T \boldsymbol{x}_n})} {1+\sum_{k'\ne y_n} e^{(\boldsymbol{w}_{k'}-\boldsymbol{w}_{y_n})^T \boldsymbol{x}_n}} \\ &= (\mathbb{P}(y=k|\boldsymbol{x}_n;\boldsymbol{W})\boldsymbol{x}-1)_n^T \end{aligned}\]$\mathbb{P}(y=k|\boldsymbol{x}_n;\boldsymbol{W})$ is the probability of correct label. when it getting bigger the error should be bigger.
Initialize $\boldsymbol{W}=0$ (or randomly). Repeat:
- pick $n\in [N]$ uniformly at random
- update the parameters
Notes on prediction
Having learned W , we can either
- make a deterministic prediction: $\arg \max_{k\in{[C]}} \boldsymbol{w}^T_k \boldsymbol{x}$
- make a randomized prediction according to: $\mathbb{P}(k|\boldsymbol{x};\boldsymbol{w}) \propto e^{\boldsymbol{w_1}^T\boldsymbol{x}}$
In either case, (expected) mistake is bounded by logistic loss
- deterministic
- randomized
Reduction to binary classification
Is there an even more general and simpler approach to derive multiclass classification algorithms?
Given a binary classification algorithm (any one, not just linear methods), can we turn it to a multiclass algorithm, in a black-box manner?
Yes, there are in fact many ways to do it.
- one-versus-all (one-versus-rest, one-against-all, etc)
- one-versus-one (all-versus-all, etc)
- Error-Correcting Output Codes (ECOC)
- tree-based reduction
one-versus-all (OvA)
Idea: train C binary classifiers to learn “is class k or not?” for each k.
Training: for each class $k ∈ [C]$,
- relabel examples with class k as +1, and all others as −1
- train a binary classifier hk using this new dataset
Prediction: for a new example x
- ask each hk: does this belong to class k? (i.e. $h_k(\boldsymbol{x})$)
- randomly pick among all $k$’s s.t. hk(x) = +1.
Issue: will (probably) make a mistake as long as one of $h_k$ errs. when no classifier give positive we can pick the category gives the largest $h$
One-versus-one (OvO)
adv: finer resolution of different resolution
dis: more computational, less data for each classifier
Idea: train $\left(\begin{matrix}C \ 2\end{matrix}\right)$ binary classifiers to learn “is class k or k′?”.
Training: for each pair $(k, k’)$
- relabel examples with class k as +1 and examples with class k′ as −1
- discard all other examples (use the cases in $k$ or $k’$ )
- train a binary classifier h(k,k′) using this new dataset
Prediction: for a new example $\boldsymbol{x}$
- ask each classifier h(k,k′) to vote for either class k or k′
- predict the class with the most votes (break tie in some way)
More robust than one-versus-all, but slower in prediction.
imbalance problem is not a serious for generative classifier model compare to discriminative classifier
Error-Correcting Output Codes (ECOC)
Idea: based on a code $\boldsymbol{M} \in {−1, +1}^{C\times L}$ learn “is bit b on or off”. Something between OvO and OvA
Training: for each bit $b \in [L]$. L can be number of classifier
- relabel example $x_n$ as $M_{y_n,b}$
- train a binary classifier hb using this new dataset.
Prediction: for a new example x
- compute the predicted code $c = (h_1(\boldsymbol{x}), . . . , h_L(\boldsymbol{x}))^T$
- predict the class with the most similar code: $k = \arg\max_k(\boldsymbol{Mc})_k$
How to pick the code M?
- the more dissimilar the codes between different classes are, the better
- random code is a good choice, but might create hard training sets
tree-based reduction
Idea: train $\approx$ C binary classifiers to learn “belongs to which half?”.
Training: see pictures
Prediction is also natural, but is very fast! (think ImageNet where C ≈ 20K)
Comparisons
Reduction | # of training points | test time | remark |
---|---|---|---|
OvA | CN | C | not robust (compare tp other classifier, i.e. less classifier) |
OvO | C^2 N | C^2 | can achieve very small training error |
ECOC | LN | L | need diversity when designing code |
Tree | (log_2C)N | log_2C | good for “extreme classification |
Neural Nets
Definition
We can use a nonlinear mapping as discussed: \(\phi(\boldsymbol{x}): \boldsymbol{x}\in \mathbb{R}^D \rightarrow\boldsymbol{z}\in \mathbb{R}^M\) But what kind of nonlinear mapping $\phi$ should be used? Can we actually learn this nonlinear mapping? THE most popular nonlinear models nowadays: neural nets
fig: $o = h(\boldsymbol{w}^T \boldsymbol{x}$ for $h$ we can use
- linear: $h(a)=a$
- Rectified Linear Unit (ReLU): $h(a)=\max{0,a}$
- sigmoid function: $h(a)=\frac{1}{1+e^{-1}}$
- TanH: $h(a) = \frac{e^a-e^{-a}}{e^a+e^{-a}}$
- so on
more output node
$\boldsymbol{W}\in \mathbb{R}^{4\times3}, \boldsymbol{h}:\mathbb{R}^4\rightarrow\mathbb{R}^4$ so $\boldsymbol{h}(\boldsymbol{a}) = (h_1(a_1), …, h_4(a_4))$
Can think of this as a nonlinear basis: $\Phi(\boldsymbol{x}) = \boldsymbol{h}(\boldsymbol{Wx})$
more layers
fig
- each node is called a neuron
- $h$ is called the activation function
- can use $h(a) = 1$ for one neuron in each layer to incorporate bias term
- output neuron can use $h(a) = a$
- #layers refers to #hidden layers (plus 1 or 2 for input/output layers)
- deep neural nets can have many layers and millions of parameters
- this is a feedforward, fully connected neural net, there are many variants
how powerful are neural nets?
Universal approximation theorem (Cybenko, 89; Hornik, 91): A feedforward neural net with a single hidden layer can approximate any continuous functions.
It might need a huge number of neurons though, and depth helps! Designing network architecture is important and very complicated
- for feedforward network, need to decide number of hidden layers, number of neurons at each layer, activation functions, etc.
An L-layer neural net can be written as
\[\boldsymbol{f(x)} = \boldsymbol{h}_L(\boldsymbol{W}_L\boldsymbol{h}_{L-1}(\boldsymbol{W}_{L-1} \dotsb \boldsymbol{h}_1(\boldsymbol{W}_1\boldsymbol{x})))\]define recursively:
$\boldsymbol{o_0=x}, \quad \boldsymbol{a_l = W_lo_{l-1}}, \quad \boldsymbol{o_l=h_l(a_l)} \qquad (l = 1, \dotsc, L)$
Where
- $\boldsymbol{W}l \in\mathbb{R}^{D_l\times D{l-1}}$ is the weights for layer $l$
- $D_0=D,D_1,\dotsc,D_L$ are numbers of neurons at each layer
- $\boldsymbol{a}_l \in \mathbb{R}^{D_l}$ is input to layer $l$
- $\boldsymbol{o}_l$ is output to layer $l$
- $\boldsymbol{h}: \mathbb{R}^{D_l} \rightarrow \mathbb{R}^{D_l}$ is activation functions at layer $l$
Back propagation
object: minimize error
\[\mathcal{E}(\boldsymbol{W}_1, \dotsc,. \boldsymbol{W}_L) = \sum^N_{n=1} \mathcal{E}_n(\boldsymbol{W}_1, \dotsc,. \boldsymbol{W}_L)\]where
\[\mathcal{E}_n(\boldsymbol{W}_1, \dotsc,. \boldsymbol{W}_L) = \begin{cases} \|\boldsymbol{f}(\boldsymbol{x}_n)-\boldsymbol{y}_n\|^2_2 & \text{for regression} \\ \ln \left(1+\sum_{k\ne y_n} e^{\boldsymbol{f}(\boldsymbol{x}_n)_k - \boldsymbol{f}(\boldsymbol{x}_n)_{y_n}}\right) & \text{for classification} \end{cases}\]apply SGD! even if the model is nonconvex. Remark SGD works only for some of nonconvex function.
Use chain rule:
- for a composite function $f(g(w))$
- \[\frac{\partial f}{\partial w} = \frac{\partial f}{\partial g}\frac{\partial g}{\partial w}\]
- for a composite function $f(g_1(w), \dotsc, g_d(w))$
Find the derivative of $\mathcal{E}n$ w.r.t. to $w{ij}$
\[\frac{\partial\mathcal{E}_n}{\partial w_{ij}} = \frac{\partial\mathcal{E}_n}{\partial a_{i}}\frac{\partial a_i}{\partial w_{ij}} =\frac{\partial\mathcal{E}_n}{\partial a_{i}}o_{ij}\] \[\frac{\partial\mathcal{E}_n}{\partial a_{i}} = \frac{\partial\mathcal{E}_n}{\partial o_{i}}\frac{\partial o_i}{\partial a_{i}} = \left ( \sum_k \frac{\partial\mathcal{E}_n}{\partial a_k}\frac{\partial a_k}{\partial o_{i}} \right)h'_i(a_i) = \left ( \sum_k \frac{\partial\mathcal{E}_n}{\partial a_k} w_{ki} \right)h'_i(a_i)\]add layer
\[\frac{\partial\mathcal{E}_n}{\partial w_{ij}^{[l]}} =\frac{\partial\mathcal{E}_n}{\partial a_{i}^{[l]}}o_{ij}^{[l-1]}\] \[\frac{\partial\mathcal{E}_n}{\partial a_{i}^{[l]}} = \left ( \sum_k \frac{\partial\mathcal{E}_n}{\partial a_k^{[l+1]}} w_{ki}^{[l+1]} \right)h_i^{'[l]}(a_i^{[l]})\]For the last layer, for square loss
\[\frac{\partial\mathcal{E}_n}{\partial a_{i}^{[L]}} = 2(h_{L,i}(a_{L,i})- y_{n,i})h'_{L,i}(a_{L,i})\]Using matrix notation greatly simplifies presentation and implementation:
\[\frac{\partial\mathcal{E}_n}{\partial \boldsymbol{W}_l} =\frac{\partial\mathcal{E}_n}{\partial \boldsymbol{a}_{l}}\boldsymbol{o}^T_{l-1}\] \[\frac{\partial\mathcal{E}_n}{\partial \boldsymbol{a}_{l}}= \begin{cases} \left( \boldsymbol{W}^T_{l+1} \frac{\partial\mathcal{E}_n}{\partial \boldsymbol{a}_{l+1}}\right) \circ \boldsymbol{h}'_l(\boldsymbol{a}_l) & \text{if $l<L$}\\ 2(\boldsymbol{h}_{L}(\boldsymbol{a}_{L})- \boldsymbol{y}_{n})\circ \boldsymbol{h}'_{L}(\boldsymbol{a}_{L}) & \text{else} \end{cases}\]where $\circ$ is the element-wise product(Hadamard product).
Full process of back propagation
initialize $\boldsymbol{W}_1, \dotsc, \boldsymbol{W}_L$ (all 0 or randomly). Repeat:
- randomly pick one data point $n/in [N]$
- forward propagation for each layer $l=1, .\dotsc, L$
- compute $\boldsymbol{a_l = W_lo_{l-1}}, \quad \boldsymbol{o_l=h_l(a_l)}$
- backward propagation: for each $l=1, .\dotsc, L$
-
compute
\[\frac{\partial\mathcal{E}_n}{\partial \boldsymbol{a}_{l}}= \begin{cases} \left( \boldsymbol{W}^T_{l+1} \frac{\partial\mathcal{E}_n}{\partial \boldsymbol{a}_{l+1}}\right) \circ \boldsymbol{h}'_l(\boldsymbol{a}_l) & \text{if $l<L$}\\ 2(\boldsymbol{h}_{L}(\boldsymbol{a}_{L})- \boldsymbol{y}_{n})\circ \boldsymbol{h}'_{L}(\boldsymbol{a}_{L}) & \text{else} \end{cases}\] -
update weights
\[\boldsymbol{W}_l \leftarrow \boldsymbol{W}_l - \eta \frac{\partial\mathcal{E}_n}{\partial \boldsymbol{W}_l} =\boldsymbol{W}_l - \frac{\partial\mathcal{E}_n}{\partial \boldsymbol{a}_{l}}\boldsymbol{o}^T_{l-1}\]
-
More tricks to optimize neural nets
Many variants based on backprop
- SGD with minibatch: randomly sample a batch of examples to form a stochastic gradient
- SGD with momentum ···
SGD with momentum
initialize $\boldsymbol{w}_0$ and velocity $\boldsymbol{v}=0$
For t=1,2,…
- form a stochastic gradient $\boldsymbol{g}_t$
- update velocity $\boldsymbol{v}\leftarrow \alpha\boldsymbol{v}- \eta \boldsymbol{g}_t$
- update weight $\boldsymbol{w}t \leftarrow \boldsymbol{w}{t-1}+\boldsymbol{v}$
update for first few rounds:
- $\boldsymbol{w}_1 = \boldsymbol{w}_0 - \eta\boldsymbol{g}_1$
- $\boldsymbol{w}_2 = \boldsymbol{w}_1 - \alpha\eta\boldsymbol{g}_1-\eta\boldsymbol{g}_2$
- $\boldsymbol{w}_3 = \boldsymbol{w}_2 - \alpha^2\eta\boldsymbol{g}_1 - \alpha\eta\boldsymbol{g}_2-\eta\boldsymbol{g}_3$
- …
Overfitting
Overfitting is very likely since the models are too powerful.
Methods to overcome overfitting:
- data augmentation
- regularization dropout: Randomly delete neurons during training
- early stopping: Stop training when the performance on validation set stops improving
- ···
Data augmentation
- Exploit prior knowledge to add more training data
Regularization
L2 regularization: minimize, change error to error’
\[\mathcal{E}'(\boldsymbol{W}_1, \dotsc,. \boldsymbol{W}_L) = \mathcal{E}(\boldsymbol{W}_1, \dotsc,. \boldsymbol{W}_L) + \lambda\sum^L_{l=1}\|\boldsymbol{W}_l\|^2_2\]Simple change to the gradient:
\[\frac{\partial\mathcal{E}'}{\partial w_{ij}} = \frac{\partial\mathcal{E}}{\partial w_{ij}} + 2\lambda w_{ij}\]Introduce weight decaying effect
Conclusions for neural nets
Deep neural networks
- are hugely popular, achieving best performance on many problems
- do need a lot of data to work well
- take a lot of time to train (need GPUs for massive parallel computing)
- take some work to select architecture and hyperparameters
- are still not well understood in theory