[Notes] CSCI 567 lec2 LR

24 Jan 2020 | CSCI567, English/英文, Note

Credit to: Prof. Yan Liu, CSCI567, Spring 2020

Outline

Linear regression
Linear regression with nonlinear basis
Overfitting and Preventing Overfitting

Linear regression

Motivation

Predicting a continuous outcome variable using past observations. Linear Regression is Regression with linear models

Key difference from classification

continuous vs discrete
measure prediction errors differently
lead to quite different learning algorithms

Human are good at linear relationship (visually). Also, for taylor series of the target function, the first term is usually a constant, the second is a linear function. By ignoring the rest terms, a function can be simplified to a linear function.

\[f(x,a)= f(x_0,a)+ \frac{f'(x_0,a)}{1!}(x-x_0) + \frac{f''(x_0,a)}{2!}(x-x_0)^2+...\]

Setup and Algorithm

Input: $\boldsymbol{x}\in \mathbb{R}^D$ (features, covariates, …)
Output: $y\in \mathbb{R}$ ( responses, targets, …)
Training data: $\mathcal{D}={(\boldsymbol{x}_n, y_n), n=1,2, …, N}$
Linear model: $f: \mathbb{R}^D\rightarrow\mathbb{R}$ with $f(\boldsymbol{x})=w_0+\sum^D_{d=1}w_dx_d=w_0+\boldsymbol{w^T x}$ (notices, $T$ here stands for transpose)

where $\boldsymbol{w} =[w_1 \ w_2 \ …\ w_D]^T$ (weights, weight vector, parameter vector, etc.), $w_0$ is bias

Often to simplify the notation we use $\tilde{\boldsymbol{x}}=[1 \ x_1 \ x_2 \ …\ x_D]^T$ and $\tilde{\boldsymbol{w}}=[w_0 \ w_1 \ w_2 \ …\ w_D]^T$ then the model is simplified as $f(\boldsymbol{x})=\tilde{\boldsymbol{w}}^T\tilde{\boldsymbol{x}}$ *some time even use $\boldsymbol{w}, \boldsymbol{x}, D$ directly for $\tilde{\boldsymbol{w}}, \tilde{\boldsymbol{x}}, D+1$

Measure error for on prediction

The classification error (0-1 loss) is inappropriate fot the continuous outcomes

absolute error: |prediction - sale price|
squared error: (prediction - sale price)^2 (Most common, and usually well behaver)

Then the Goal is to minimize total squared error Residual Sum of Squares (RSS) $\text{RSS}(\tilde{\boldsymbol{w}})=\sum_n (f(\boldsymbol{x}_n)-y_n)^2 = \sum_n (\tilde{\boldsymbol{x}}_n^T\tilde{\boldsymbol{w}}-y_n)^2$

Find $\tilde{\boldsymbol{w}}^*=\arg\max_{\tilde{\boldsymbol{w}}\in \mathbb{R}^{D+1}}\text{RSS}(\tilde{\boldsymbol{w}})$, i.e. least(mean) squares solution (or empirical risk minimizer)

Reduce machine learning to optimization problem

Find stationary point(multivariate calculus)

\[\begin{aligned} \nabla\text{RSS}(\tilde{\boldsymbol{w}})&=2\sum_n \tilde{\boldsymbol{x}}_n(\tilde{\boldsymbol{x}}_n^T\tilde{\boldsymbol{w}}-y_n) \propto (\sum_n \tilde{\boldsymbol{x}}_n\tilde{\boldsymbol{x}}_n^T)\tilde{\boldsymbol{w}}-\sum_n \tilde{\boldsymbol{x}}_ny_n \\ &= (\tilde{\boldsymbol{X}}^T \tilde{\boldsymbol{X}})\tilde{\boldsymbol{w}}-\tilde{\boldsymbol{X}}^T\boldsymbol{y} = 0 \\ \Rightarrow \tilde{\boldsymbol{w}}^* &= (\tilde{\boldsymbol{X}}^T \tilde{\boldsymbol{X}})^{-1}\tilde{\boldsymbol{X}}^T\boldsymbol{y} \end{aligned}\]

Bottleneck of computing: to invert the matrix $\tilde{\boldsymbol{X}}^T \tilde{\boldsymbol{X}}\in \mathbb{R}^{(D+1)(D+1)}$ is $O(D^3)$ naively. (There are many faster approaches(such as conjugate gradient).

What if $\tilde{\boldsymbol{X}}^T \tilde{\boldsymbol{X}}$ is invertible?

$N < D+1$ i.e., not enough data to estimate all parameters (less data then the number of attributes )

fix the problem by adding a hyper-parameter $\lambda$, then the solution becomes

\[\tilde{\boldsymbol{w}}^* = (\tilde{\boldsymbol{X}}^T \tilde{\boldsymbol{X}}+\lambda\boldsymbol{I})^{-1}\tilde{\boldsymbol{X}}^T\boldsymbol{y}\]

Comparison to NNC

Parametric methods (LR): the size of the model does not grow with teh size of the training set N.

e.g., linear regression has D+1 parameters and is independent of N.
Non-parametric methods: the size of the model grows with the size of the training set.

e.g. NNC, the training set itself needs to be kept in order to predict. Thus, the size of the model is the size of the training set.

Linear regression with nonlinear basis

Use a nonlinear mapping to transform the data to a more complicate feature space

\[\phi(\boldsymbol{x}): \boldsymbol{x} \in \mathbb{R}^D \rightarrow \boldsymbol{z}\in \mathbb{R}^M \ \ \ (D\le M)\]

Then apply linear regression

Notice, if $\phi(.)$ is a linear feature map it basically does nothing (will not improve even change the result).

Overfitting and Preventing Overfitting

Underfitting
- large training error
- large test error
Overfitting
- small training error
- larger test error

More complicated models -> larger gap between training and test error

Prevent overfitting

more training data
control the model complexity (e.g., degree $M$ mentioned before) Larger weight -> more complex model

Regularized linear regression

new objective $\mathcal{E}(\boldsymbol{w}) = \text{RSS}(\boldsymbol{w})+\lambda R(\boldsymbol{w})$

find $\boldsymbol{w}^*=\arg\max_w\varepsilon(\boldsymbol{w})$

$R:\mathbb{R}^D \rightarrow \mathbb{R}^+$ is the regularizer

measure how complex the model $\boldsymbol{w}$ is
common choices: $|\boldsymbol{w}|^2_2, |\boldsymbol{w}|_1$, etc

$\lambda >0$ is the regularization coefficient

$\lambda = 0$, no regularization
$\lambda \rightarrow +\infty , \boldsymbol{w}\rightarrow R(\boldsymbol{w})$
i.e. control trade- off between training error and complexity

For $R(\boldsymbol{w})=|\boldsymbol{w}|^2_2$

\[\begin{aligned} \mathcal{E}(\boldsymbol{w}) & = \text{RSS}(\boldsymbol{w})+\lambda \|\boldsymbol{w}\|^2_2 \\ & = \|\pmb{\Phi w - y}\|^2_2 +\lambda \|w\| \\ \end{aligned}\] \[\begin{aligned} \nabla\mathcal{E}(\boldsymbol{w})=2(\pmb{\Phi^T\Phi w - \Phi^Ty}) + 2\lambda\boldsymbol{w} &= 0\\ (\boldsymbol{\Phi^T\Phi } + \lambda\boldsymbol{I})\boldsymbol{w} &= \boldsymbol{\Phi^Ty} \\ \boldsymbol{w}^*&=(\boldsymbol{\Phi^T\Phi } + \lambda\boldsymbol{I})^{-1}\boldsymbol{\Phi^Ty} \end{aligned}\]

Note the same form as in the ﬁx when $\boldsymbol{X}^T\boldsymbol{X}$ is not invertible!

For other regularizers, as long as it’s **, standard optimization algorithms can be applied.

Equivalent form

Regularization is also sometimes formulated as (optimization under constrain): $\arg\min_\boldsymbol{w} \text{RSS}(w) \text{ subject to } R(w)\le \beta$ where $\beta$ is some hyper-parameter. Choosing can be done by cross-validation.

Summary

Typical steps of develo p in g a machine learning system:

Collect data, split into training , development, and test sets.
Train a model with a machine learning algorithm. Most often we app y cross-validation to tune hyper- parameters.
Evaluate using the test data and report performance.
Use the model to predict future/make decisions.

General idea to derive ML algorithms

Pick a set of model $\mathcal{F}$
- \[\mathcal{F}=\{f(\boldsymbol{x})=\boldsymbol{w}^T\boldsymbol{x}|\boldsymbol{w}\in\mathbb{R}^D\}\]
- \[\mathcal{F}=\{f(\boldsymbol{x})=\boldsymbol{w}^T\Phi(\boldsymbol{x})|\boldsymbol{w}\in\mathbb{R}^D\}\]
Define error/loss $L(y’,y)$
Find empirical risk minimizer (ERM): $\boldsymbol{f}^* = \arg\max_{f\in\mathcal{F}}\sum^N_{n=1}=L(f(x_n),y_n)$

Or regularized empirical risk minimizer: $\boldsymbol{f}^* = \arg\max_{f\in\mathcal{F}}\sum^N_{n=1}=L(f(x_n),y_n)+\lambda R(f)$

Netside where knowledge is shared, ideas are spread.