# 8 Optimization for Training Deep Models

This chapter focuses on one particular case of optimization: finding the parameters $\theta$ of a neural network that significantly reduce a cost function $J(\theta)$, which typically includes a performance measure evaluated on the entire training set as well as additional regularization terms.

## 8.1 How Learning Differs from Pure Optimization

### 8.1.3 Batch and Minibatch Algorithms

standard error of the mean: $\sigma / \sqrt{n}$

large numbers of examples that all make very similar contributions to the gradient

Minibatch sizes are generally driven by the following factors:

• Larger bathces provide a more accurate estimate of the gradient, but with less than linear return.
• Multicore architectures are usually underutilized by extremely small batches.
• Parallel processing
• Specific sizes of arrays
• Small batches can offer a regularizing effect.

Some algorithms are more sensitive to sampling error than others:

• either because they use information that is difficult to estimate accurately with few sampels.
• or because they use information in ways that amplify sampling errors more.

It is also crucial that the minibatches be selected randomly.

• shuffle the order of the dataset once and then store itin shuffled fashion.

## 8.2 Challenges in Neural Network Optimization

### 8.2.2 Local Minima

model identifiability

• weight space symmetry

## 8.4 Parameter Initialization Strategies

Unfortunately, these optimal criteria for initial weights often do not lead to optimal performance. Three different reasons:

• wrong criteria
• the properties may not persist after learning
• the criteria might succeed at improving the speed of optimization but inadvertently increase generalization error.

There are a few situations where we may set some biases to nonzero values:

• if a bias is for an output unit.
• choose the bias to avoid causing too much saturation at initialization.
• a control unit.

## 8.5 Algorithms with Adaptive Learning Rates

delta-bar-delta

### 8.5.2 RMSProp

Algorithm 8.5 The RMSProp algorithm

Algorithm 8.6 RMSProp algorithm with Nesterov momentum

### 8.5.4 Choosing the Right Optimization Algorithm

hahaha: The choice of which algorithm to use, at this point, seems to depend largely on the user’s familiarity with the algorithm(for ease of hyperparameter tuning).

## 8.6 Approximate Second-Order Methods

### 8.6.2 Conjugate Gradients

Conjugate gradients is a method to efficiently avoid the calculation of the inverse Hessian by iteratively descending conjugate directions.

## 8.7 Optimization Strategies and Meta-Algorithms

### 8.7.3 Polyak Averaging

When applying Polyak averaging to non-convex problems, it is typical to use an exponentially decaying running average:

### 8.7.4 Supervised Pretraining

greedy supervised pretraining

transfer learning

teacher student network

### 8.7.5 Designing Models to Aid Optimization

In practice, it is more important to choose a model family that is easy to optimize than to use a powerful optimization algorithm.

ReLU

linear paths or skip connections

### 8.7.6 Continuation Methods and Curriculum Learning

Continuation methods

Traditional continuation methods are usually based on smoothing the objective function.

Curriculum learning