Notes on Optimization
Published:
This post covers Optimization and Gradient Descent from cs231n.
Image Classification Task
- Score Function
- A parametrized score function maps raw image pixels to class scores
- e.g. Linear Classifier $ f(x_i, W) = W * x_i $
- Loss Function
- Measures quality of parameters based on how well scores agreed with ground truth labels
- $ Softmax~Loss = L_i = = -log \frac{e^{s_{y_i}} }{ \sum\limits_{j} e^{s_j} } $
- $ SVM~Loss = L_i = \sum\limits_{j \neq y_i} max(0, s_j + 1 - s_{y_i}) $
- $Full~Loss = L = \left[ \frac{1}{N} \sum\limits_{i=1}^{N} L_i \right] + \lambda * R(W)$
- Optimization
- Finding the set of parameters $ W $ that minimizes the loss function
Optimization
- Loss function is defined over very high dimension space e.g. Linear Classifier Weight matrix for CIFAR-10: $32323 = 3072$ will be $ (10,~3073) $, we have 30,730 parameters
- Random Search
- Try different random weights and track whats best based on test data set
- Iterative Refinement
- Start with random weights and iteratively refine them over time to get lower loss
- Blindfolded Hiker Strategy
- In case of CIFAR-10, hills are 30,730 dimensional, at every point on the hill, we have a loss (height of terrain).
Random Local Search
- Start with random $W$ and generate random perturbations $\delta W$ and update $W$ to $W+\delta W$ if the loss is lower ($\delta$ may be $0.001$).
- Follow the Gradient
- We need to update $W$ that reduces the loss. The update will be related to gradient of the loss function
- In 1-d functions, slope is instantaneous rate of change and is a scalar quantity. Gradient is a generalization of slope for multi variable function and is a vector quantity.
- Gradient denoted by $\Delta f$ for $f(x, y, …)$ is $[\frac{\delta f}{\delta x}, \frac{\delta f}{\delta y}, …]$
- If we are at a point $(x_0, y_0, …)$ in the input space of $f$, the vector $\Delta f(x_0,y_0,…)$ gives the direction that will increase the value of $f$ most rapidly.
- In 1-d functions, slope is instantaneous rate of change and is a scalar quantity. Gradient is a generalization of slope for multi variable function and is a vector quantity.
Compute Gradient
- Slow, easy, and approximate - Numerical Gradient
- Fast, exact, error prone - Analytic Gradient
- Numeric Gradient with Finite Differences
- Iterate over all dimensions one by one, make small change $h$ along that dimension, compute partial derivative of loss function along that dimension
- Wiki A simple two point estimation is to compute slope of a nearby secant line (line that cuts at minimum two points) through points $(x, f(x))$ and $(x+h, f(x+h))$. Slope of this line will be $\frac{f(x+h)-f(x)}{h}$, which is Newton’s difference quotient or first order divided difference.
- Slope of secant differs from tangent based on $h$. Thus $f’(x) = \lim_{h->0} \frac{f(x+h)-f(x)}{h}$
- Equivalently, it may be $f’(x) = \lim_{h->0} \frac{f(x)-f(x-h)}{h}$
- Two Point formula is $f’(x) = \lim_{h->0} \frac{f(x+h)-f(x-h)}{2*h}$, known as symmetric difference quotient
- Slope of these secant lines differ from slope of tangent line by amount proportional to $h^2$. Thus, for small values of $h$, this is more accurate.
- $ W_{new} = W - step_size * df $
- Update in negative direction to reduce loss
- Computing Gradient Analytically with Calculus
- Numeric gradient is simple to compute but is approximate due to selection of $h$ and computation expensive
- Calculus, allows us to compute gradient using direct formula, but it can be error prone to implement
- Thus, we compute analytic gradient and compare with numeric gradient to check, called gradient check
Gradient Descent
Process of repeatedly computing gradient of loss function and performing parameter update
# Vanila Gradient Descent while True: weights_grad = eval_grad(loss_fn, data, weights) weights += - setp_size * weights_grad
Mini-batch Gradient Descent
Compute gradient over batches of the training data
# Vanila Minibatch Gradient Descent data_batch = sample_training(data, size) weights_grad = eval_grad(loss_fn, data_batch, weights) weights += - step_size * weights_grad
This works well since examples are correlated. Gradient from a mini batch is a good approximation of the gradient of the full objective function.
- Much faster convergence can be achieved by mini-batch with frequent parameter update
Stochastic Gradient Descent
- Minibatch contains only one example