Notes on Optimization

4 minute read

Published: January 24, 2021

This post covers Optimization and Gradient Descent from cs231n.

Image Classification Task

Score Function
- A parametrized score function maps raw image pixels to class scores
- e.g. Linear Classifier $ f(x_i, W) = W * x_i $
Loss Function
- Measures quality of parameters based on how well scores agreed with ground truth labels
- $ Softmax~Loss = L_i = = -log \frac{e^{s_{y_i}} }{ \sum\limits_{j} e^{s_j} } $
- $ SVM~Loss = L_i = \sum\limits_{j \neq y_i} max(0, s_j + 1 - s_{y_i}) $
- $Full~Loss = L = \left[ \frac{1}{N} \sum\limits_{i=1}^{N} L_i \right] + \lambda * R(W)$
Optimization
- Finding the set of parameters $ W $ that minimizes the loss function

Optimization

Loss function is defined over very high dimension space e.g. Linear Classifier Weight matrix for CIFAR-10: $32323 = 3072$ will be $ (10,~3073) $, we have 30,730 parameters
Random Search
- Try different random weights and track whats best based on test data set
- Iterative Refinement
  - Start with random weights and iteratively refine them over time to get lower loss
- Blindfolded Hiker Strategy
  - In case of CIFAR-10, hills are 30,730 dimensional, at every point on the hill, we have a loss (height of terrain).
Random Local Search
- Start with random $W$ and generate random perturbations $\delta W$ and update $W$ to $W+\delta W$ if the loss is lower ($\delta$ may be $0.001$).
Follow the Gradient
We need to update $W$ that reduces the loss. The update will be related to gradient of the loss function
- In 1-d functions, slope is instantaneous rate of change and is a scalar quantity. Gradient is a generalization of slope for multi variable function and is a vector quantity.
  - Gradient denoted by $\Delta f$ for $f(x, y, …)$ is $[\frac{\delta f}{\delta x}, \frac{\delta f}{\delta y}, …]$
  - If we are at a point $(x_0, y_0, …)$ in the input space of $f$, the vector $\Delta f(x_0,y_0,…)$ gives the direction that will increase the value of $f$ most rapidly.
Compute Gradient
- Slow, easy, and approximate - Numerical Gradient
- Fast, exact, error prone - Analytic Gradient
- Numeric Gradient with Finite Differences
  - Iterate over all dimensions one by one, make small change $h$ along that dimension, compute partial derivative of loss function along that dimension
  - Wiki A simple two point estimation is to compute slope of a nearby secant line (line that cuts at minimum two points) through points $(x, f(x))$ and $(x+h, f(x+h))$. Slope of this line will be $\frac{f(x+h)-f(x)}{h}$, which is Newton’s difference quotient or first order divided difference.
  - Slope of secant differs from tangent based on $h$. Thus $f’(x) = \lim_{h->0} \frac{f(x+h)-f(x)}{h}$
  - Equivalently, it may be $f’(x) = \lim_{h->0} \frac{f(x)-f(x-h)}{h}$
  - Two Point formula is $f’(x) = \lim_{h->0} \frac{f(x+h)-f(x-h)}{2*h}$, known as symmetric difference quotient
    - Slope of these secant lines differ from slope of tangent line by amount proportional to $h^2$. Thus, for small values of $h$, this is more accurate.
  - $ W_{new} = W - step_size * df $
    - Update in negative direction to reduce loss
Computing Gradient Analytically with Calculus
- Numeric gradient is simple to compute but is approximate due to selection of $h$ and computation expensive
- Calculus, allows us to compute gradient using direct formula, but it can be error prone to implement
- Thus, we compute analytic gradient and compare with numeric gradient to check, called gradient check

Gradient Descent

Process of repeatedly computing gradient of loss function and performing parameter update

# Vanila Gradient Descent
while True:
	weights_grad = eval_grad(loss_fn, data, weights)
	weights += - setp_size * weights_grad

Mini-batch Gradient Descent
- Compute gradient over batches of the training data
- ```
# Vanila Minibatch Gradient Descent
data_batch = sample_training(data, size)
weights_grad = eval_grad(loss_fn, data_batch, weights)
weights += - step_size * weights_grad
```
- This works well since examples are correlated. Gradient from a mini batch is a good approximation of the gradient of the full objective function.
- Much faster convergence can be achieved by mini-batch with frequent parameter update
Stochastic Gradient Descent
- Minibatch contains only one example

Share on

Twitter Facebook LinkedIn

Notes on Optimization

Image Classification Task

Optimization

Gradient Descent

Share on

You May Also Enjoy

Applied Software Design

Code: CMake and Catch2

C++

Pointers: slide 1

C++

Arrays and Vectors: slide 1

C++

Functions: slide 1