Notes on Optimization

4 minute read


This post covers Optimization and Gradient Descent from cs231n.

Image Classification Task

  • Score Function
    • A parametrized score function maps raw image pixels to class scores
    • e.g. Linear Classifier $ f(x_i, W) = W * x_i $
  • Loss Function
    • Measures quality of parameters based on how well scores agreed with ground truth labels
    • $ Softmax~Loss = L_i = = -log \frac{e^{s_{y_i}} }{ \sum\limits_{j} e^{s_j} } $
    • $ SVM~Loss = L_i = \sum\limits_{j \neq y_i} max(0, s_j + 1 - s_{y_i}) $
    • $Full~Loss = L = \left[ \frac{1}{N} \sum\limits_{i=1}^{N} L_i \right] + \lambda * R(W)$
  • Optimization
    • Finding the set of parameters $ W $ that minimizes the loss function


  • Loss function is defined over very high dimension space e.g. Linear Classifier Weight matrix for CIFAR-10: $32323 = 3072$ will be $ (10,~3073) $, we have 30,730 parameters
  • Random Search
    • Try different random weights and track whats best based on test data set
    • Iterative Refinement
      • Start with random weights and iteratively refine them over time to get lower loss
    • Blindfolded Hiker Strategy
      • In case of CIFAR-10, hills are 30,730 dimensional, at every point on the hill, we have a loss (height of terrain).
  • Random Local Search

    • Start with random $W$ and generate random perturbations $\delta W$ and update $W$ to $W+\delta W$ if the loss is lower ($\delta$ may be $0.001$).
  • Follow the Gradient
  • We need to update $W$ that reduces the loss. The update will be related to gradient of the loss function
    • In 1-d functions, slope is instantaneous rate of change and is a scalar quantity. Gradient is a generalization of slope for multi variable function and is a vector quantity.
      • Gradient denoted by $\Delta f$ for $f(x, y, …)$ is $[\frac{\delta f}{\delta x}, \frac{\delta f}{\delta y}, …]$
      • If we are at a point $(x_0, y_0, …)$ in the input space of $f$, the vector $\Delta f(x_0,y_0,…)$ gives the direction that will increase the value of $f$ most rapidly.
  • Compute Gradient

    • Slow, easy, and approximate - Numerical Gradient
    • Fast, exact, error prone - Analytic Gradient
    • Numeric Gradient with Finite Differences
      • Iterate over all dimensions one by one, make small change $h$ along that dimension, compute partial derivative of loss function along that dimension
      • Wiki A simple two point estimation is to compute slope of a nearby secant line (line that cuts at minimum two points) through points $(x, f(x))$ and $(x+h, f(x+h))$. Slope of this line will be $\frac{f(x+h)-f(x)}{h}$, which is Newton’s difference quotient or first order divided difference.
      • Slope of secant differs from tangent based on $h$. Thus $f’(x) = \lim_{h->0} \frac{f(x+h)-f(x)}{h}$
      • Equivalently, it may be $f’(x) = \lim_{h->0} \frac{f(x)-f(x-h)}{h}$
      • Two Point formula is $f’(x) = \lim_{h->0} \frac{f(x+h)-f(x-h)}{2*h}$, known as symmetric difference quotient
        • Slope of these secant lines differ from slope of tangent line by amount proportional to $h^2$. Thus, for small values of $h$, this is more accurate.
      • $ W_{new} = W - step_size * df $
        • Update in negative direction to reduce loss
  • Computing Gradient Analytically with Calculus
    • Numeric gradient is simple to compute but is approximate due to selection of $h$ and computation expensive
    • Calculus, allows us to compute gradient using direct formula, but it can be error prone to implement
    • Thus, we compute analytic gradient and compare with numeric gradient to check, called gradient check

Gradient Descent

  • Process of repeatedly computing gradient of loss function and performing parameter update

  • # Vanila Gradient Descent
    while True:
    	weights_grad = eval_grad(loss_fn, data, weights)
    	weights += - setp_size * weights_grad
  • Mini-batch Gradient Descent

    • Compute gradient over batches of the training data

    • # Vanila Minibatch Gradient Descent
      data_batch = sample_training(data, size)
      weights_grad = eval_grad(loss_fn, data_batch, weights)
      weights += - step_size * weights_grad
    • This works well since examples are correlated. Gradient from a mini batch is a good approximation of the gradient of the full objective function.

    • Much faster convergence can be achieved by mini-batch with frequent parameter update
  • Stochastic Gradient Descent

    • Minibatch contains only one example