# 5 Machine Learning Basics

## 5.1 Learning Algorithms

### 5.1.1 The Task, $T$

• Classification
• Classification with missing inputs
• Regression
• Transcription
• Machine translation
• Structured output
• Anomaly detection
• Sythesis and sampling
• Imputation of missing values
• Denoising
• Density estimation or probability mass function estimation

## 5.2 Capacity, Overfitting and Underfitting

The error incurred by an oracle making predictions from the true distribution p(x, y) is called the Bayes error.

### 5.2.2 Regularization

Regularization is any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.

## 5.5 Maximum Likelihood Estimation

### 5.5.1 Conditional Log-Likelihood and Mean Squared Error

Example: Linear Regression as Maximum Likelihood

We now revisit linear regression from the point of view of maximum likelihood estimation.

### 5.5.2 Properties of Maximum Likelihood

Statistic efficiency

## 5.6 Bayesian Statistics

Example: Bayesian Linear Regression

## 5.7 Supervised Learning Algorithms

### 5.7.1 Probabilistic Supervised Learning

logistic regression

kernel trick

### 5.7.3 Other Simple Supervised Learning Algorithms

k-nearest neighbors

decision tree

## 5.8 Unsupervised Learning Algorithms

There are multiple ways of defining a simpler representation. Three of the most common include lower-dimensional representations, sparse representations, and independent representations.

### 5.8.2 k-means Clustering

The insignht of SGD is tha the gradient is an expectation. The expectation may be approximately estimated using a small set of samples. Specifically, on each step of the algorithm, we can sample a minibatch of examples $B = \{ x^1, \dots, x^{m’} \}$ drawn uniformly from the training set. The minibatch size $m’$ is typeically chosen to be a relatively small number of examples, rangning from one to a few hundred.

## 5.10 Building a Machine Learning Algorithm

A fairly simple recipe: combine a specification of a dataset, a cost function, an optimization procedure and a model.

## 5.11 Challenges Motivating Deep Learning

### 5.11.3 Manifold Learning

The first observation in favor of the manifold hypothesis is that the probability distribution over images, text strings and sounds that occur in real life is highly concentrated.

The second argument in favor of the manifold hypothesis is that we can also imagine such neighborhoods and transformations, at least informally.