# Notes on Optimization

Published:

This post covers Optimization and Gradient Descent from cs231n.

• Score Function
• A parametrized score function maps raw image pixels to class scores
• e.g. Linear Classifier $f(x_i, W) = W * x_i$
• Loss Function
• Measures quality of parameters based on how well scores agreed with ground truth labels
• $Softmax~Loss = L_i = = -log \frac{e^{s_{y_i}} }{ \sum\limits_{j} e^{s_j} }$
• $SVM~Loss = L_i = \sum\limits_{j \neq y_i} max(0, s_j + 1 - s_{y_i})$
• $Full~Loss = L = \left[ \frac{1}{N} \sum\limits_{i=1}^{N} L_i \right] + \lambda * R(W)$
• Optimization
• Finding the set of parameters $W$ that minimizes the loss function

# Optimization

• Loss function is defined over very high dimension space e.g. Linear Classifier Weight matrix for CIFAR-10: $32323 = 3072$ will be $(10,~3073)$, we have 30,730 parameters
• Random Search
• Try different random weights and track whats best based on test data set
• Iterative Refinement
• Start with random weights and iteratively refine them over time to get lower loss
• Blindfolded Hiker Strategy
• In case of CIFAR-10, hills are 30,730 dimensional, at every point on the hill, we have a loss (height of terrain).
• Random Local Search

• Start with random $W$ and generate random perturbations $\delta W$ and update $W$ to $W+\delta W$ if the loss is lower ($\delta$ may be $0.001$).
• We need to update $W$ that reduces the loss. The update will be related to gradient of the loss function
• In 1-d functions, slope is instantaneous rate of change and is a scalar quantity. Gradient is a generalization of slope for multi variable function and is a vector quantity.
• Gradient denoted by $\Delta f$ for $f(x, y, …)$ is $[\frac{\delta f}{\delta x}, \frac{\delta f}{\delta y}, …]$
• If we are at a point $(x_0, y_0, …)$ in the input space of $f$, the vector $\Delta f(x_0,y_0,…)$ gives the direction that will increase the value of $f$ most rapidly.

• Slow, easy, and approximate - Numerical Gradient
• Fast, exact, error prone - Analytic Gradient
• Numeric Gradient with Finite Differences
• Iterate over all dimensions one by one, make small change $h$ along that dimension, compute partial derivative of loss function along that dimension
• Wiki A simple two point estimation is to compute slope of a nearby secant line (line that cuts at minimum two points) through points $(x, f(x))$ and $(x+h, f(x+h))$. Slope of this line will be $\frac{f(x+h)-f(x)}{h}$, which is Newton’s difference quotient or first order divided difference.
• Slope of secant differs from tangent based on $h$. Thus $f’(x) = \lim_{h->0} \frac{f(x+h)-f(x)}{h}$
• Equivalently, it may be $f’(x) = \lim_{h->0} \frac{f(x)-f(x-h)}{h}$
• Two Point formula is $f’(x) = \lim_{h->0} \frac{f(x+h)-f(x-h)}{2*h}$, known as symmetric difference quotient
• Slope of these secant lines differ from slope of tangent line by amount proportional to $h^2$. Thus, for small values of $h$, this is more accurate.
• $W_{new} = W - step_size * df$
• Update in negative direction to reduce loss
• Computing Gradient Analytically with Calculus
• Numeric gradient is simple to compute but is approximate due to selection of $h$ and computation expensive
• Calculus, allows us to compute gradient using direct formula, but it can be error prone to implement

• Process of repeatedly computing gradient of loss function and performing parameter update

• # Vanila Gradient Descent
while True:
weights += - setp_size * weights_grad


• Compute gradient over batches of the training data

• # Vanila Minibatch Gradient Descent
data_batch = sample_training(data, size)