# Deep Learning Nyu.Week 5

- Gradient Descent
- Worst optimization method in the world

- Optimization problem
- minimize f(w) over w
- w
_{(k + 1)}= w_{k}- (step) * (Del) f(w_{k})

- Assume f is differentiable and continuous – not true
- Actually sub differentiable
- "It should work; no theory to support this"
- Follow the direction of the negative gradient

- we look at the optimization landscape locally
- landscape = domain of all weights in the neural network
- find the best solution relative to where we are

- Consider a quadratic optimization problem
- positive definite case
- can calculate this as matrix * distance from solution
- which gives 1 - smallest eigenvalue / largest eigenvalue step reduction in step size
- largest / smallest = condition number
- poorly conditioned – l is very large, well conditioned l is small

- Step sizes
- we don't have a good estimate of learning rate
- try a bunch of values on the log scale
- ideally choose an optimal step size
- we tend to choose the largest possible learning rate – at the edge of divergence

- Stochastic optimization
- Actually used to train nets in practice
- Replace gradient with a stochastic approximation to the gradient
- Gradient of the loss for a single instance
- instance chosen uniformly at random
- (full loss is sum of all the fis)

- Expected value of sgd is full gradient
- useful to think of it as gd with noise

- Annealing
- neural network landscapes are bumpy
- SGD -> particularly the noise helps it jump over these minima
- good minima are larger and harder to skip

- Also valuable because
- we have a lot of redundancy
- SGD exploits the redundancy
- can be thousands of times cheaper
- can be hard to trust GD instead

- Minibatching
- use batches randomly chosen
- practical reasons are overwhelming
- much more efficient utilization of hardware
- eg. imagenet uses batch sizes of 64

- distributed training
- "ImageNet in one hour"

- Full batch
- Do not gradient descent
- LBFGS
- 50 years of optim research
- scipy has a bulletproof implementation

- CPU doesn't have batch size critical
- Always try mini-batching
- Momentum
- trick to always use with SGD
- momentum parameter in the network
- w
_{(k + 1)}= wk - gamma_{k}* delta + beta_{k}* (w_{k}- w_{(k-1)}) - update both p and w – damp the old momentum and add gradient
- p is accumulated gradient buffer – past gradients are reduced – running sum of gradients
- stochastic descent uses gradient
- "Stochastic heavy ball method"
- gradient keeps pushing the direction in the same direction instead of dramatic changes
- small beta – can change direction more quickly; high beta makes it harder to turn
- high beta helps dampen oscillations
- beta = .9, .99 always works well
- momentum also increases the step size (for past gradients)
- change step size to 1/(1 - beta).

- why it works
- acceleration contributes to performance

- Nesterov – did a lot of research
- Acceleration
- Noise smoothing
- momentum averages gradients
- it adds smoothing that makes things become a good approximation to the solution
- reduces the bouncing

- SGD works – well conditioned
- otherwise poorly conditioned

- Adaptive methods
- Maintain an estimate of a better rate separately for each weight
- lots of different ways to do this
- smaller learning rates for weights later in the network, larger in the early weights
- fairly hand-wavy

- RMSProp
- normalize by root mean square of the gradient
- ![2021
_{06}_{17}_{image.png}]()

- ADAM: Adaptive moment estimation
- ![2021
_{06}_{17}_{image.png}]() - Bias correction in full adam increases the value during early stages
- Occasionally doesn't converge
- Poorly understood
- Has worse generalization error
- Small neural networks will have different results depending on initial values

- ![2021
- Normalization layers
- Linear -> norm -> activation or
- Conv -> norm -> ReLu
- They don't make the network more powerful
- Whitening operation to update the data
- with some additional parameters to allow all ranges of values
- adds more parameters to the layer: learnable scaling and bias term
- y = a / stddev * (x - mean) + b
- often they reverse the parametrization
- a & b move slowly as they're learned

- Batch norm
- bizarre, but works very well
- normalize across batch
- estimates mean and stddev across all instances in a mini batch
- breaks all the theory of SGD
- layer instance and group norm are other norms that work
- group norm works where batch norm works

- Why does normalization help?
- the network becomes easier to optimize, can use larger lrs
- adds noise, which helps with generalization
- makes weight initialization less important
- allows plugging together multiple layers with impunity
- allows for automated architecture search
- typically resulted in a poorly conditioned network

- have to backpropagate through the calculation of the mean and stddev
- for batch/instance norm: mean/std are fixed after training
- group/layer can update the values

- Death of optimization
- try to use a big neural network to solve the optimization problem

- Practicum
- Convolution dimensions output: n - k + 1 by m