## Neural Networks and Deep Learning

### Week One

(Neural network is also introduced in Machine Learning course, with my learning note).

House price Prediction can be regarded as the simplest neural network: The function can be ReLU (REctified Linear Unit), which we’ll see a lot. This is a single neuron. A larger neural network is then formed by taking many of the single neurons and stacking them together.

Almost all the economic value created by neural networks has been through supervised learning.

Input(x) Output(y) Application Neural Network
House feature Price Real estate Standard NN
Ad, user info Click on ad?(0/1) Online advertising Standard NN
Photo Object(Index 1,…,1000) Photo tagging CNN
Audio Text transcript Speech recognition RNN
English Chinese Machine translation RNN
Image, Radar info Position of other cars Autonomous driving Custom/Hybrid

Neural Network examples CNN: often for image data
RNN: often for one-dimensional sequence data

Structured data and Unstructured data Scale drives deep learning progress Scale: both the size of the neural network and the scale of the data.

• Data
• Computation
• Algorithms

Using ReLU instead of sigmoid function as activation function can improve efficiency.

### Week Two

Notation
(x,y): a single training example. $x\in\mathbb{R}^{n_x},y\in\{0,1\}$
m training examples: $\{(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),...,(x^{(m)},y^{(m)})\}$

Take training set inputs x1, x2 and so on and stacking them in columns. (This make the implementation much easier than X’s transpose)

#### Logistic Regression

Differences with former course
Notation is a bit different from what is introduced in Machine Learning(note).
Originally, we add $x_0=1$ so that $x\in\mathbb{R}^{n_x+1}$.

where $\theta=\begin{bmatrix}\theta_0\\\theta_1\\\theta_2\\...\\\theta_{n_x}\end{bmatrix}$.

Here in Deep Learning course, we use b to represent $\theta_0$, and w to represent $\theta_1,...,\theta_{n_m}$. Just keep b and w as separate parameters.
Given x, want $\hat{y}=P(y=1|x)$. $x\in\mathbb{R}^{n_x},0\le\hat y\le1$
Parameters: $w\in\mathbb{R}^{n_x},b\in\mathbb{R}$
Output: $\hat{y}=\sigma(w^Tx+b)$
σ() is sigmoid function: $\sigma(z)=\frac{1}{1+e^{-z}}$ Cost Function

If $y=1:\mathscr{L}(\hat{y},y)=-\log\hat{y}$
If $y=0:\mathscr{L}(\hat{y},y)=-\log(1-\hat{y})$

Cost function:

Loss function is applied to just a single training example.
Cost function is the cost of your parameters, it is the average of the loss functions of the entire training set.

Usually initialize the value to zero in logistic regression. Random initialization also works, but people don’t usually do that for logistic regression.

Repeat {

} From forward propagation, we calculate z, a and finally $\mathscr{L}(a,y)$
From back propagation, we calculate the derivatives step by step:

Algorithm
(Repeat)
J=0; dw1,dw2,…dwn=0; db=0
for i = 1 to m

for j = 1 to n: $dw_j+=x^{(i)}_jdz^{(i)}$

J /= m;
dw1,dw2,…,dwn /= m;
db /= m

w1:=w1-αdw1
w2:=w2-αdw2
b:=b-αdb

In the for loop, there’s no superscript i for dw variable, because the value of dw in the code is cumulative. While dz is referring to one training example.

Vectorization
Original for loop:
for i = 1 to m

Vectorized:

Code:
z = np.dot(w.T, X) + b
dz = A - Y
cost = -1 / m * np.sum(Y * np.log(A) + (1 - Y) * np.log(1 - A))
db = 1 / m * np.sum(dZ)
dw = 1 / m * X * dZ.T

A.sum(axis = 0): sum vertically
A.sum(axis = 1): sum horizontally

If an (m, n) matrix operates with (+-*/) a (1, n) row vector, just expand the vector vertically to (m, n) by copying m times.
If an (m, n) matrix operates with a (m, 1) column vector, just expand the vector horizontally to (m, n) by copying n times.
If an row/column vector operates with a real number, just expand the real number to the corresponding vector.
documention

Rank 1 Array
a = np.random.randn(5) creates a rank 1 array whose shape is (5,).
Try to avoid using rank 1 array. Use a = a.reshape((5, 1)) or a = np.random.randn(5, 1).

Note that np.dot() performs a matrix-matrix or matrix-vector multiplication. This is different from np.multiply() and the * operator (which is equivalent to .* in MATLAB/Octave), which performs an element-wise multiplication.

### Week Three

#### Neural Network Overview

Superscript with square brackets denotes the layer, superscript with round brackets refers to i’th training example. Logistic regression can be regarded as the simplest neural network. The neuron takes in the inputs and make two computations: $z=w^Tx+b,a=\sigma(z)$ Neural network functions similarly. (Note that this neural network has 2 layers. When counting layers, input layer is not included.)
Take the first node in the hidden layer as example:

The superscript $[l]$ denotes the layer, and subscript i represents the node in layer.
Similarly,

Vectorization:

Formula:

(dimensions: $z^{}:(4,1),W^{}:(4,3),a^{}:(3,1),b^{}:(4,1),a^{}:(4,1);z^{}:(1,1),W^{}:(1,4),b^{}:(1,1),a^{}:(1,1)$)

Vectorizing across multiple examples:

Explanation  Stack elements in column.
Each column represents a training example, each row represents a hidden unit. #### Activation Function

Sigmoid Function Only used in binary classification’s output layer(with output 0 or 1).
Not used in other occasion. tanh is a better choice.

tanh Function With a range of $(-1,1)$, it performs better than sigmoid function because the mean of its output is closer to zero.
Both sigmoid and tanh function have a disadvantage that when z is very large($\to\infty$) or very small($\to-\infty$), the derivative can be close to 0, so the gradient descent would be very slow.

ReLU Default choice of activation function.
With $g'(z)=1$ when z is positive, it performs well in practice.
(Although the g'(z)=0 when z is positive, and technically the derivative when $z=0$ is not well-defined)

Leaky ReLU Makes sure that derivatives not equal to 0 when z < 0.

Linear Activation Function

Also called identity function.
Not used in neural network, because even many hidden layers still gets a linear result. Just used in machine learning when the output is a real number.

Forward propagation:

Backward propagation:

note:
keepdims=True makes sure that Python won’t produce rank-1 array with shape of (n,).
* is element-wise product. $dZ^{}$:(n,m);$W^{T}dZ^{}$:(n,m);$g^{'}(Z^{})$:(n,m).

#### Random Initialization

In logistic regression, it’s okay to initialize all parameters to zero. However, it’s not feasible in neural network.
Instead, initialize w with random small value to break symmetry. It’s okay to initialize b to zeros. Symmetry is still broken so long as $W^{[l]}$ is initialized randomly.

Random
If the parameter w are all zeros, then the neurons in hidden layers are symmetric(“identical”). Even if after gradient descent, they keep the same. So use random initialization.

Small
Both sigmoid and tanh function has greatest derivative at z=0. If z had large or small value, the derivative would be close to zero, and consequently gradient descent would be slow. Thus, it’s a good choice to make the value small.

### Week Four

Deep neural network notation
-$l$: number of layers
-$n^{[l]}$: number of units in layer l
-$a^{[l]}$: activations in layer l. $a^{[l]}=g^{[l]}(z^{[l]})$
($a^{}=x, a^{[l]}=\hat{y}$)
-$W^{[l]}$: weights for $z^{[l]}$
-$b^{[l]}$: bias for $z^{[l]}$

Forward Propagation
for l = 1 to L:

Well, this for loop is inevitable.

Matrix Dimensions
-$W^{[l]}=dW^{[l]}:(n^{[l]},n^{[l-1]})$
-$b^{[l]}=db^{[l]}:(n^{[l]},m)$(here the dimension can be $(n^{[l]},1)$ with Python’s broadcasting)
-$Z^{[l]}=A^{[l]}:(n^{[l]},m)$
-$dZ^{[l]}=dA^{[l]}=Z^{[l]}=A^{[l]}$

cache
Cache is used to pass variables computed during forward propagation to the corresponding backward propagation step. It contains useful values for backward propagation to compute derivatives.

Why deep representations?
Informally: There are functions you can compute with a “small” L-layer deep neural network that shallower networks require exponentially more hidden units to compute.

#### Forward and Backward Propagation

Forward propagation for layer l
Input $a^{[l-1]}$
Output $a^{[l]}$, cache $(z^{[l]})$

Backward propagation for layer l
Input $da^{[l]}$
Output $da^{[l-1]},dW^{[l]},db^{[l]}$

#### Hyperparameters and Parameters

Hyperparameters determine the final value parameters.

Parameters
· $W^{},b^{},W^{},b^{},W^{},b^{}...$

Hyperparameters
· learning rate $\alpha$
· number of iterations
· number of hidden layers $L$
· number of hidden units $n^{},n^{},...$
· choice of activation function
· momentum, minibatch size, regularizations, etc.

## Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization

### Week One

#### Setting up your Machine Learning Application

##### Train/dev/test sets

Training set:
Keep on training algorithms on the training sets.

Development set
Also called Hold-out cross validation set, Dev set for short.
Use dev set to see which of many different models performs best on the dev set.

Test set
To get an unbiased estimate of how well your algorithm is doing.

Proportion
Previous era: the data amount is not too large, it’s common to take all the data and split it as 70%/30% or 60%/20%/20%.
Big data: there’re millions of examples, 10000 examples used in dev set and 10000 examples used in test set is enough. The proportion can be 98/1/1 or even 99.5/0.4/0.1

Notes
Make sure dev set and test set come from same distribution.

Not having a test set might be okay if it’s not necessary to get an unbiased estimate of performance. Though dev set is called ‘test set’ if there’s no real test set.

##### Bias/Variance

Solutions
High bias:
Bigger network
Train longer
(Neural network architecture search)

High variance:
More data
Regularization
(Neural network architecture search)

Originally, reducing bias may increase variance, and vice versa. So it’s necessary to trade-off between bias and variance.
But in deep learning, there’re ways to reduce one without increasing another. So don’t worry about bias variance trade-off.

#### Regularization

##### L2 Regularization

Logistic regression

L2-regularization relies on the assumption that a model with small weights is simpler than a model with large weights. Thus, by penalizing the square values of the weights in the cost function you drive all the weights to smaller values. It becomes too costly for the cost to have large weights! This leads to a smoother model in which the output changes more slowly as the input changes.
Weights end up smaller(“weight decay”): Weights are pushed to smaller values.
L2 regularization: $||w||_2^2=\sum_{j=1}^{n_x}w_j^2=w^Tw$
L1 regularization: $||w||_1=\sum_{j=1}^{n_x}|w_j|$
(L1 regularization leads $w$ to be sparse, but not very effictive)

Neural network

-$||w^{[l]}||_F^2=\sum_{i=1}^{n^{[l]}}\sum_{j=1}^{n^{[l-1]}}(w_{ij}^{[l]})^2$, it’s called Frobenius norm which is different from Euclidean distance.

Back propagation:

##### Dropout regularization

With dropout, what we’re going to do is go through each of the layers of the network and set some probability of eliminating a node in neural network.
For each training example, you would train it using one of these neural based networks.
The idea behind drop-out is that at each iteration, you train a different model that uses only a subset of your neurons. With dropout, your neurons thus become less sensitive to the activation of one other specific neuron, because that other neuron might be shut down at any time. Usually used in Computer Vision.

Implementation with layer 3

• Dropout is a regularization technique.
• You only use dropout during training. Don’t use dropout (randomly eliminate nodes) during test time.
• Apply dropout both during forward and backward propagation.
• During training time, divide each dropout layer by keep_prob to keep the same expected value for the activations. For example, if keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled by 0.5 since only the remaining half are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the output now has the same expected value. You can check that this works even when keep_prob is other values than 0.5.
##### Other regularization methods

Data augmentation
Take image input for example. Flipping the image horizontally, rotating and sort of randomly zooming, distortion, etc.
Get more training set without paying much to reduce overfitting.

Early stopping
Stop early so that $||w||_F^2$ is relatively small.
Early stopping violates Orthogonalization, which suggests separate Optimize cost function J and Not overfit.

#### Optimization

##### Normalizing inputs

Subtract mean

Normalize variance

Note: use same $\mu,\sigma^2$ to normalize test set.

Intuition: Since the number of layers in deep learning may be quite large, the product of L layers may tend to $\infty$ or $0$. (just think about $1.001^{1000}$ and $0.999^{1000}$)

Weight initialization for deep networks
Take a single neuron as example: $z=w_1x_1+w_2x_2+...+w_nx_n$
If $n$ is large, then $w_i$ would be smaller. Our goal is to get $Var(w:)=\frac{1}{n}or\frac{2}{n}$

Random initialization for ReLU:(known as He initialization, named for the first author of He et al., 2015.)

For tanh: use $\sqrt(\frac{1}{n^{[l-1]}})$
Xavier initialization: $\sqrt(\frac{2}{n^{[l-1]}+n^{[l]}})$

Take $W^{},b^{},...,W^{[L]},b^{[L]}$ and reshape into a big vector $\theta$: $J(W^{},b^{},...,W^{[L]},b^{[L]})=J(\theta)$
Take $dW^{},db^{},...,dW^{[L]},db^{[L]}$ and reshape into a big vector $d\theta$

for each i:
-$d\theta_{approx}[i]=\frac{J(\theta_1,\theta_2,...,\theta_i+\epsilon,...)-J(\theta_1,\theta_2,...,\theta_i-\epsilon,...)}{2\epsilon}$

check if $d\theta_{approx}\approx d\theta$?
Calculate $\frac{||d\theta_{approx}-d\theta||_2}{||d\theta_{approx}||_2+||d\theta||}$. ($10^{-7}$ is great)

Note
Gradient checking verifies closeness between the gradients from backpropagation and the numerical approximation of the gradient (computed using forward propagation).
Gradient checking is slow, so we don’t run it in every iteration of training. You would usually run it only to make sure your code is correct, then turn it off and use backprop for the actual learning process.

• Don’t use in training - only to debug.
• If algorithm fails grad check, look at components to try to identify bug.
• Remember regularization.
• Doesn’t work with dropout.
• Run at random initialization; perhaps again after some training.

### Week Two

#### Mini-batch gradient descent

Batch gradient descent (original gradient descent that we’ve known) calculates the entire training set, and just update the parameters $W,b$ a little step. If the training set is pretty large, the training would be quite slow. And the idea of mini-batch gradient descent is use part of the training set, and update the parameters faster.
For example, if $X$‘s dimension is $(n_x,m)$, divide the training set into parts with dimension of $(n_x,1000)$, i.e. $X^{\{1\}}=[x^{(1)}\ x^{(2)}\ ...\ x^{(1000}],X^{\{2\}}=[x^{(1001)}\ x^{(1002)}\ ...\ x^{(2000)}],...$
Similarly, $Y^{\{1\}}=[y^{(1)}\ y^{(2)}\ ...\ y^{(1000}],Y^{\{2\}}=[y^{(1001)}\ y^{(1002)}\ ...\ y^{(2000)}],...$.
One iteration of mini-batch gradient descent(computing on a single mini-batch) is faster than one iteration of batch gradient descent.

Two steps of mini-batch gradient descent:  repeat {
for t = 1,…,5000 {
Forward prop on $X^{\{t\}}$
$Z^{}=W^{}X^{\{t\}}+b^{}$
$A^{}=g^{}(Z^{})$
…
$A^{[L]}=g^{[L]}(Z^{[L]})$
Compute cost $J^{\{t\}}=\frac{1}{1000}\sum_{i=1}^l\mathscr{L}(\hat{y}^{(i)},y^{(i)})+\frac{\lambda}{2*1000}\sum_l||W^{[l]}||_F^2)$
Backprop to compute gradients cost $J^{\{t\}}$ (using $(X^{\{t\}},Y^{\{t\}})$)
$W^{[l]}:=W^{[l]}-\alpha dW^{[l]},b^{[l]}:=b^{[l]}-\alpha db^{[l]}$
} # this is called 1 epoch
}

Choosing mini-batch size
Mini-batch size = m: Batch gradient descent. $(X^{\{1\}},Y^{\{1\}})=(X,Y)$
It has to process the whole training set before making progress, which takes too long for per iteration.

Mini-batch size = 1: Stochastic gradient descent. $(X^{\{1\}},Y^{\{1\}})=(X^{(1)},Y^{(1)})$
It loses the benefits of vectorization across examples.

Mini-batch size in between 1 and m.
Fastest learning: using vectorization and make process without processing entire training set.

If training set is small(m≤2000): just use batch gradient descent.
Typical mini-batch sizes: 64, 128, 256, 512 (1024)

#### Exponentially weighted averages

E.g.
-$v_{100}=0.9v_{99}+0.1\theta_{100}$
-$v_{99}=0.9v_{98}+0.1\theta_{99}$
-$v_{98}=0.9v_{97}+0.1\theta_{98}$
-$...$
Replace $v_{99}$ with the second equation, then replace $v_{98}$ with the third equation, and so on. Finally we’d get $v_{100}=0.1\theta_{100}+0.1*0.9\theta_{99}+0.1*0.9^2\theta_{98}+...+0.1*0.9^{99}\theta_1$
This is why it is called exponentially weighted averages. In practice, $0.9^{10}\approx0.35\approx\frac{1}{e}$, thus it show an average of 10 examples.

Bias correction As is shown above, the purple line is exponentially weighted average without bias correction, it’s much lower than the exponentially weighted average with bias correction(green line) at the very beginning.
Since $v_0$ is set to be zero(and assume $\beta=0.98$), the first calculation $v_1=0.98v_0+0.02\theta_1$ has quite small result. The result is small until t gets larger(say $t=50$ for $\beta=0.98$) To avoid such situation, bias correction introduces another step:

#### Gradient descent with momentum

Set $v_{dW}=0,v_{db}=0$
On iteration t:
Compute dW,db on the current mini-batch
$v_{dW}=\beta v_{dW}+(1-\beta)dW$
$v_{db}=\beta v_{db}+(1-\beta)db$
$W=W-\alpha v_{dW},b=b-\alpha v_{db}$ Momentum takes past gradients into account to smooth out the steps of gradient. Gradient descent with momentum has the same idea as exponentially weighted average(while some may not use $(1-\beta)$ in momentum). Just as the example shown above, we want slow learning horizontally and faster learning vertically. The exponentially weighted average helps to eliminate the horizontal oscillation and makes gradient descent faster. Note there’s no need for gradient descent with momentum to do bias correction. After several iterations, the algorithm will be okay.

#### RMSprop

On iteration t:
Compute dW,db on the current mini-batch
$s_{dW}=\beta_2S_{dW}+(1-\beta)dW^2$
$s_{db}=\beta_2S_{db}+(1-\beta)db^2$
$W:=W-\alpha\frac{dW}{\sqrt{s_{dW}+\epsilon}},b:=b-\alpha\frac{db}{\sqrt{s_{db}+\epsilon}}$

RMS means Root Mean Square, it uses division to help to adjust gradient descent.

#### Adam optimization algorithm

Combine momentum and RMSprop together:
1.It calculates an exponentially weighted average of past gradients, and stores it in variable v (before bias correction) and v_corrected (with bias correction).
2.It calculates an exponentially weighted average of the squares of the past gradients, and stores it in variable s (before bias correction) and s_corrected (with bias correction).
3.It updates parameters in a direction based on combining information from 1 and 2.

Set $v_{dW}=0,S_{dW}=0,v_{db}=0,S_{db}=0$
On iteration t:
Compute dW,db on the current mini-batch
$v_{dW}=\beta_1v_{dW}+(1-\beta_1)dW,v_{db}=\beta_1V_{db}+(1-\beta_1)db$
$s_{dW}=\beta_2s_{dW}+(1-\beta_2)dW^2,s_{db}=\beta_2S_{db}+(1-\beta_2)db^2$
$v_{dW}^{corrected}=v_{dW}/(1-\beta^t_1),v_{db}^{corrected}=v_{db}/(1-\beta^t_1)$
$s_{dW}^{corrected}=s_{dW}/(1-\beta^t_2),s_{db}^{corrected}=s_{db}/(1-\beta_2^t)$
$W:=W-\alpha\frac{V_{dW}^{corrected}}{\sqrt{S_{dW}^{corrected}}+\epsilon},b:=b-\alpha\frac{V_{db}^{corrected}}{\sqrt{S_{db}^{corrected}}+\epsilon}$

Hyperparameters:
-$\alpha$: needs to be tune
-$\beta_1$: 0.9
-$\beta_2$: 0.999
-$\epsilon:10^{-8}$

#### Learning rate decay

Mini-batch gradient descent won’t converge, but step around at the optimal instead. To help converge, it’s advisable to decay learning rate with the number of iterations.
Some formula:
-$\alpha=\frac{1}{1+decay-rate*epoch-num}\alpha_0$
-$\alpha=0.95^{epoch-num}\alpha_0$
-$\alpha=\frac{k}{\sqrt{epoch-num}}\alpha_0$
-discrete stair case (half α after some iterations)
-manual decay

### Week Three

#### Hyperparameter tuning

Hyperparameters: $\alpha,\beta,\beta_1,\beta_2,\epsilon$, number of layers, number of units, learning rate decay, mini-batch size, etc.
Priority: $\alpha>\beta,\#hidden\ units,mini-batch\ size>\#layers,learning\ rate\ decay$

Try to use random values of hyperparameters rather than grid.
Coarse to fine: if finds some region with good result, try more in that region.

Appropriate scale:
It’s okay to sample uniformly at random for some hyperparameters: number of layers, number of units.
While for some hyperparameters like $\alpha,\beta$, instead of sampling uniformly at random, sample randomly on logarithmic scale.

Pandas & Caviar
Panda: babysitting one model at a time
Caviar: training many models in parallel
Largely determined by the amount of computational power you can access.

#### Batch Normalization

Using the idea of normalizing input, make normalization in hidden layers.

Given some intermediate value in neural network $z^{(1)},...,z^{(m)}$(specifically $z^{[l](i)}$ in a single layer)
$\mu=\frac{1}{m}\sum_iz^{(i)}$
$\sigma^2=\frac{1}{m}\sum_i(z^{(i)}-\mu)^2$
$z^{(i)}_{norm}=\frac{z^{(i)}-\mu}{\sqrt{\sigma^2+\epsilon}}$
$\tilde{z}^{(i)}=\gamma z_{norm}^{(i)}+\beta$
Use $\tilde{z}^{[l](i)}$ instead of $z^{[l](i)}$

Batch Norm as regularization
Each mini-batch is scaled by the mean/variance computed on just that mini-batch.
This adds some noise to the values $z^{[l]}$ within that mini-batch. So similar to dropout, it adds some noise to each hidden layer’s activations.
This has a slight regularization effect.

Batch Norm at test time: use exponentially weighted averages to compute average $\mu,\sigma^2$ for test.

#### Multi-class classification

Softmax
The output layer is a vector with dimension C rather than a real number. C is the number of classes.
Activation function:

Cost function

#### Deep Learning frameworks

• Caffe/Caffe2
• CNTK
• DL4J
• Keras
• Lasagne
• mxnet
• TensorFlow
• Theano
• Torch

Choosing deep learning frameworks
Easy of programming (development and deployment)
Running speed
Truly Open (open source with good governance)

##### TensorFlow

Writing and running programs in TensorFlow has the following steps:

1. Create Tensors(variables) that are not yet executed/evaluated.
2. Write operations between those Tensors.
3. Initialize your Tensors.
4. Create a Session.
5. Run the Session. This will run the operations you’d written above.

tf.constant(...): to create a constant value
tf.placeholder(dtype = ..., shape = ..., name = ...): a placeholder is an object whose value you can specify only later

tf.add(..., ...): to do an addition
tf.multiply(..., ...): to do a multiplication
tf.matmul(..., ...): to do a matrix multiplication

2 typical ways to create and use sessions in TensorFlow:

## Structuring Machine Learning Projects

### Week One

#### Orthogonalization

Orthogonalization or orthogonality is a system design property that assures that modifying an instruction or a component of an algorithm will not create or propagate side effects to other components of the system. It becomes easier to verify the algorithms independently from one another, and it reduces testing and development time.

When a supervised learning system is designed, these are the 4 assumptions that need to be true and orthogonal.

1. Fit training set well on cost function - bigger network, Adam, etc
2. Fit dev set well on cost function - regularization, bigger training set, etc
3. Fit test set well on cost function - bigger dev set
4. Performs well in real world - change dev set or cost function

#### Single number evaluation metric

Precision

Among all the prediction, estimate how much predictions are right.

Recall

Among all the positive examples, estimate how much positive examples are correctly predicted.

F1-Score

The problem with using precision/recall as the evaluation metric is that you are not sure which one is better since in this case, both of them have a good precision et recall. F1-score, a harmonic mean, combine both precision and recall.

Satisficing and optimizing metric
There are different metrics to evaluate the performance of a classifier, they are called evaluation matrices. They can be categorized as satisficing and optimizing matrices. It is important to note that these evaluation matrices must be evaluated on a training set, a development set or on the test set.
The general rule is:

For example:

Classifier Accuracy Running Time
A 90% 80ms
B 92% 95ms
C 95% 1500ms

For example, there’re two evaluation metrics: accuracy and running time. Take accuracy as optimizing metric and the following(running time) as satisficing metric(s). The satisficing metric has to meet expectation set and improve the optimizing metric as much as possible.

#### Train/Dev/Test Set

It’s important to choose the development and test sets from the same distribution and it must be taken randomly from all the data.
Guideline: Choose a dev set and test set to reflect data you expect to get in the future and consider important to do well on.

Size
Old way of splitting data:
We had smaller data set, therefore, we had to use a greater percentage of data to develop and test ideas and models. Modern era - Big data:
Now, because a larger amount of data is available, we don’t have to compromise and can use a greater portion to train the model. Set your dev set to be big enough to detect differences in algorithms/models you’re trying out.
Set your test set to be big enough to give high confidence in the overall performance of your system.

When to change dev/test sets and metrics Orthogonalization:
How to define a metric to evaluate classifiers.
Worry separately about how to do well on this metric.

If doing well on your metric + dev/test set does not correspond to doing well on your application, change your metric and/or dev/test set.

#### Comparing to human-level performance The graph shows the performance of humans and machine learning over time.
Machine learning progresses slowly when it surpasses human-level performance. One of the reason is that human-level performance can be close to Bayes optimal error, especially for natural perception problem.
Bayes optimal error is defined as the best possible error. In other words, it means that any functions mapping from x to y can’t surpass a certain level of accuracy(for different reasons, e.g. blurring images, audio with noise, etc).

Humans are quite good at a lot of tasks. So long as machine learning is worse than humans, you can:

• Get labeled data from humans
• Gain insight from manual error analysis: Why did a person get this right?
• Better analysis of bias/variance

Human-level error as a proxy for Bayes error(i.e. Human-level error ≈ Bayes error).
The difference between Human-level error and training error is also regarded as “Avoidable bias”.

If the difference between human-level error and the training error is bigger than the difference between the training error and the development error. The focus should be on bias reduction technique.
· Train bigger model
· Train longer/better optimization algorithms(momentum, RMSprop, Adam)
· NN architecture/hyperparameters search(RNN,CNN)

If the difference between training error and the development error is bigger than the difference between the human-level error and the training error. The focus should be on variance reduction technique
· More data
· Regularization(L2, dropout, data augmentation)
· NN architecture/hyperparameters search

Problems where machine significantly surpasses human-level performance
Feature: Structured data, not natural perception, lots of data.
· Product recommendations
· Logistics(predicting transit time)
· Loan approvals

The two fundamental assumptions of supervised learning:
You can fit the training set pretty well.(avoidable bias ≈ 0)
The training set performance generalizes pretty well to the dev/test set.(variance ≈ 0)

### Week Two

#### Error Analysis

Before deciding how to improve the accuracy, set up a spread sheet find out what matters.
For example:

Image Dog Great Cat Blurry Comment
1 small white dog
2 lion in rainy day
Percentage 5% 41% 63%

Mislabeled examples refer to if your learning algorithm outputs the wrong value of Y.
Incorrectly labeled examples refer to if in the data set you have in the training/dev/test set, the label for Y, whatever a human label assigned to this piece of data, is actually incorrect.

Deep learning algorithms are quite robust to random errors in the training set, but less robust to systematic errors.

Guideline: Build system quickly, then iterate.

1. Set up development/test set and metrics
• Set up a target
1. Build an initial system quickly
• Train training set quickly: Fit the parameters
• Development set: Tune the parameters
• Test set: Assess the performance
1. Use bias/variance analysis & Error analysis to prioritize next steps

#### Mismatched training and dev/test set

The development set and test should come from the same distribution. However, the training set’s distribution might be a bit different. Take a mobile application of cat recognizer for example:
The images from webpages have high resolution and are professionally framed. However, the images from app’s users are relatively low and blurrier.
The problem is that you have a different distribution:
Small data set from pictures uploaded by users. (10000)This distribution is important for the mobile app.
Bigger data set from the web.(200000)

Instead of mixing all the data and randomly shuffle the data set, just like below. Take 5000 examples from users into training set, and halving the remaining into dev and test set.

The advantage of this way of splitting up is that the target is well defined.
The disadvantage is that the training distribution is different from the dev and test set distributions. However, the way of splitting the data has a better performance in long term.

Training-Dev Set
Since the distributions among the training and the dev set are different now, it’s hard to know whether the difference between training error and the training error is caused by variance or from different distributions.
Therefore, take a small fraction of the original training set, called training-dev set. Don’t use training-dev set for training, but to check variance.
The difference between the training-dev set and the dev set is called data mismatch. • Carry out manual error analysis to try to understand difference between training and dev/test sets.
• Make training data more similar; or collect more data similar to dev/test sets

#### Transfer learning When transfer learning makes sense:

• Task A and B have the same input x.
• You have a lot more data for Task A than Task B.
• Low level features from A could be helpful for learning B.

Guideline:

• Delete last layer of neural network
• Delete weights feeding into the last output layer of the neural network
• Create a new set of randomly initialized weights for the last layers only
• New data set (x,y)

Example: detect pedestrians, cars, road signs and traffic lights at the same time. The output is a 4-dimension vector. Note that the second sum(j = 1 to 4) only over value of j with 0/1 label (not ? mark).

When multi-task learning makes sense

• Training on a set of tasks that could benefit from having shared lower-level features.
• Usually: Amount of data you have for each task is quite similar.
• Can train a big enough neural network to do well on all the tasks.

#### End-to-end deep learning

End-to-end deep learning is the simplification of a processing or learning systems into one neural network. End-to-end deep learning cannot be used for every problem since it needs a lot of labeled data. It is used mainly in audio transcripts, image captures, image synthesis, machine translation, steering in self-driving cars, etc.

Pros and cons of end-to-end deep learning
Pros:
Let the data speak
Less hand-designing of components needed

Cons:
May need large amount of data
Excludes potentially useful hand-designed components

## Convolutional Neural Networks

### Week One

Computer Vision Problems

• Image Classification
• Object Detection
• Neural Style Transfer

#### Convolution * is the operator for convolution.

Filter/Kernel
The second operand is called filter in the course and often called kernel in the research paper.
There’re different types of filters: Filter usually has an size of odd number. 1*1, 3*3, 5*5...(helps to highlight the centroid)

Vertical edge detection examples Valid and Same Convolutions
Suppose that the original image has a size of n×n, the filter has a size of f×f, then the result has a size of (n-f+1)×(n-f+1). This is called Valid convolution.
The size will get smaller and smaller with the process of valid convolution.

To avoid such a problem, we can use paddings to enlarge the original image before convolution so that output size is the same as the input size.
If the filter’s size is f×f, then the padding $p=\frac{f-1}{2}$.

The main benefits of padding are the following:
· It allows you to use a CONV layer without necessarily shrinking the height and width of the volumes. This is important for building deeper networks, since otherwise the height/width would shrink as you go to deeper layers. An important special case is the “same” convolution, in which the height/width is exactly preserved after one layer.
· It helps us keep more of the information at the border of an image. Without padding, very few values at the next layer would be affected by pixels as the edges of an image.

Stride
The simplest stride is 1, which means that the filter moves 1 step at a time. However, the stride can be not 1. For example, moves 2 steps at a time instead. That’s called strided convolution.

Given that:
Size of $n\times n$ image, $f\times f$ filter, padding p, stride s,
output size:

technical
In mathematics and DSP, the convolution involves another “flip” step. However, this step is omitted in CNN. The “real” technical note should be “cross-correlation” rather than convolution.
In convention, just use Convolution in CNN.

Convolution over volumes
The 1-channel filter cannot be applied to RGB images. But we can use filters with multiple channels(RGB images have 3 channels).

The number of the filter’s channel should match that of the image’s channel.
E.g.
A $6\times6\times3$ image conv with a $3\times3\times3$ filter, the result has a size of $4\times4\times1$. Note that this is only 1 channel! (The number of the result’s channel corresponds to the number of the filters).

#### CNN

notation
If layer l is a convolution layer:

• $f^{[l]}$= filter size
• $p^{[l]}$= padding
• $s^{[l]}$= stride
• $n_c^{[l]}$= number of filters

Each filter is: $f^{[l]}\times f^{[l]}\times n_c^{[l-1]}$
Activations: $a^{[l]}\to n_H^{[l]}\times n_w^{[l]}\times n_c^{[l]}$, $A^{[l]}\to m\times n_H^{[l]}\times n_w^{[l]}\times n_c^{[l]}$
Weights: $f^{[l]}\times f^{[l]}\times n_c^{[l-1]}\times n_c^{[l]}$,($n_c^{[l]}$: #filters in layer l.)
bias: $n_c^{[l]}-(1,1,1,n_c^{[l]})$

Input: $n_H^{[l-1]}\times n_w^{[l-1]}\times n_c^{[l-1]}$
Output: $n_H^{[l]}\times n_w^{[l]}\times n_c^{[l]}$

E.g. Types of layers in a convolutional network

• Convolution (conv)
• Pooling (pool)
• Fully Connected (FC)

Pooling layers

• Max pooling: slides an (f,f) window over the input and stores the max value of the window in the output.
• Average pooling: slides an (f,f) window over the input and stores the average value of the window in the output. Hyperparameters:
f: filter size
s: stride
Max or average pooling
Note no parameters to learn.

Suppose that the input has a size of $n_H\times n_w\times n_c$, then after pooling, the output has a size of $\lfloor{\frac{n_H-f}{s}+1}\rfloor\times\lfloor{\frac{n_H-f}{s}+1}\rfloor\times n_c$

A more complicated cnn: Backpropagation is discussed in programming assignment.

Why convolutions

• Parameter sharing: A feature detector(such as a vertical edge detector) that’s useful in one part of the image is probably useful in another part of the image.
• Sparsity of connections: In each layer, each output value depends only on a small number of inputs.

### Week Two

#### Classic networks

##### LeNet - 5

Paper link: Gradient-Based Learning Applied to Document Recognition(IEEE has another version of this paper.) Take the input, use a 5×5 filter with 1 stride, then use an average pooling with a 2×2 filter and s = 2. Again, use a 5×5 filter with 1 stride, then use an average pooling with a 2×2 filter and s = 2. After two fully connected layer, the output uses softmax to make classification.
conv → pool → conv → pool → fc → fc → output
With the decrease of nH and nW, the number of nC is increased.

##### AlexNet

Paper link: ImageNet Classification with Deep Convolutional Neural Networks Similar to LeNet, but much bigger. (60K -> 60M)
It uses ReLU.

##### VGG-16

Paper link: Very Deep Convolutional Networks for Large-Scale Image Recognition CONV = 3×3 filter, s = 1, same(using padding to make the size same)
MAX-POOL = 2×2, s = 2
Only use these 2 filters.

##### Residual Networks

Paper link: Deep residual networks for image recognition

In the plain network, the training error won’t keep decreasing, it may increase at some threshold. In Residual network, the training error will keep decreasing.
The skip-connection makes it easy for the network to learn an identity mapping between the input and the output within the ResNet block. In ResNets, a “shortcut” or a “skip connection” allows the gradient to be directly backpropagated to earlier layers: ##### 1×1 convolution

Paper link: Network in network

If the input has a volume of dimension $n_H\times n_W\times n_C$, then a single 1×1 convolutional filter has $1\times1\times n_C+1$ parameters(including bias).
You can use a 1×1 convolutional layer to reduce $n_C$ but not $n_H,n_W$.
You can use a pooling layer to reduce $n_H,n_W$, but not $n_C$.

##### Inception network

Paper link: Going deeper with convolutions
Don’t bother worrying about what filters to use. Use all kinds of filters and stack them together.
Module:  Typically, with deeper layers, $n_H$ and $n_W$ decrease, while $n_C$ increases.

#### Practical advices for using ConvNets

Using Open-Source Implementations: GitHub

Reasons for using open-source implementations of ConvNet:
Parameters trained for one computer vision task are often useful as pretraining for other computer vision tasks.
It is a convenient way to get working an implementation of a complex ConvNet architecture.

### Week Three

#### Classification, localization and detection

Image classification: Given a image, make predictions of what classification it is.
Classification localization: In addition, put a bounding box to figure out where the object is.
Detection: Multiple objects appear in the image, detect all of them.

In classification localization, the output has some values $b_x, b_y$ which show the position of the centroid of the object,(note that the upper left corner’s coordinates is (0,0) and the lower right corner’s is (1,1)) and $b_h, b_w$ which show the height and width of the object.
If the output has 3 classes, then the format of the output looks like as follows:

For example, if the image contains a car, then the output is

and if the image doesn’t contain anything, the output is

The loss function is

Landmark detection
The output contains more information about the position of the landmarks $l_x,l_y$.

Sliding windows detection
Use a small sliding window with small stride scanning the image, detect the objects. Then use a slightly bigger sliding window, and then bigger.
However, it has high computation cost.

Turning FC layer into convolutional layers
Use a filter with the same size of the last layer, the number of filters is the same as the fully connected nodes. #### YOLO

##### Bounding boxes:

Divide the object into several grid cells(in general grids with a size of 19×19 are common), and only detect once if the object’s midpoint is in that grid.
Each grid’s upper left corner has a coordinate of (0,0) and lower right corner’s (1,1). Therefore, the value of $b_x,b_y$ should be between (0,1). And $b_h,b_w$ can be greater than 1.

##### IoU

Intersection over union

If IoU≥0.5 we can estimate that the result is right.
More generally, IoU is a measure of the overlap between two bounding boxes.

##### Non-max suppression

Algorithm:
Each output prediction is $\begin{bmatrix}p_c\\b_x\\b_y\\b_h\\b_w\end{bmatrix}$ (just focus on one class at a time so there’s no $c_1,c_2$)
Discard all boxes with $p_c\le0.6$
While there are any remaining boxes:
· Pick the box with the largest $p_c$. Output that as a prediction.
· Discard any remaining box with $IoU\ge0.5$ with the box output in the previous step.

##### Anchor boxes

In an image, some objects may be overlapping. To predict multiple objects in one grid cell, use some anchor boxes.

Previously:
Each object in training image is assigned to grid cell that contains that object’s midpoint.

With two anchor boxes:
Each object in training image is assigned to grid cell that contains object’s midpoint and anchor box for the grid cell with highest IoU.

The output vector has a size of $\#grid\times\#grid\times\#anchors\times(1+4+classes)$
E.g.

(Manually choose the shape of anchor boxes.)

#### Region proposals

Paper link:Rich feature hierarchies for accurate object detection and semantic segmentation
Instead using sliding windows over and over again, use segmentation algorithm to predict which regions may contain objects.

R-CNN: Propose regions. Classify proposed regions one at a time. Output label + bounding box.
Fast R-CNN: Propose regions. Use convolution implementation of sliding windows to classify all the proposed regions.
Faster R-CNN: Use convolutional network to propose regions.

### Week Four

#### Face Recognition

Face verification & Face recognition
Verification:
· Input image, name/ID
· Output whether the input image is that of the claimed person.
This is a 1:1 matching problem.

Recognition:
· Has a database of K persons
· Get an input image
· Output ID if the image is any of the K persons(or “not recognized”)
This is a 1:K matching problem.
(High demand for single accuracy.)

Face verification requires comparing a new picture against one person’s face, whereas face recognition requires comparing a new picture against K person’s faces.

##### One-shot learning

Learning from one example to recognize the person again. The idea is learning a “similarity” function. (A bit similar to recommendation system.)
d(img1, img2) = degree of difference between images.
If $d(img1,img2)\le\tau$, the output is same; else the output is different.

##### Siamese network

Parameters of NN define an encoding $f(x^{(i)})$. (Use a vector to represent the image x)
Goal: Learn parameters so that
if $x^{(i)},x^{(j)}$ are the same person, $||f(x^{(i)})-f(x^{(j)})||^2$ is small;
if $x^{(i)},x^{(j)}$ are different person, $||f(x^{(i)})-f(x^{(j)})||^2$ is large.

##### Triplet loss

Pick an anchor image(denoted as “A”), a positive image(denoted as “P”) and a negative image(denoted as “N”).
We can calculate the differences between A and P, A and N.

We want that

where α is called margin.

Loss function:

About choosing the triplets A,P,N
During training, if A,P,N are chosen randomly, $d(A,P)+\alpha\le d(A,N)$ is easily satisfied. Therefore, the gradient descent wouldn’t make much progress.
Thus, choose triplets that are “hard” to train on. That is, pick A,P,N such that $d(A,P)\approx d(A,N)$

#### Neural Style Transfer

##### Neural style transfer cost function

The input contains content image(denoted as C) and style image(denoted as S), and the output is the generated image(denoted as G).

To find the generated image G:
1.Initiate G randomly (e.g. init with white noise)
2.Use gradient descent to minimize J(G). $G:=G-\frac{\partial}{\partial G}J(G)$

Content cost function

• Say you use hidden layer l to compute content cost.
• Use pre-trained ConvNet. (E.g., VGG network)
• Let $a^{[l](C)}$ and $a^{[l](G)}$ be the activation of layer l on the images.
• If $a^{[l](C)}$ and $a^{[l](G)}$ are similar, both images have similar content.

Style cost function
Say you are using layer l‘s activation to measure style.
Define style as correlation between activations across channels.

Let $a^{[l]}_{i,j,k}$ = activation at (i,j,k). $G^{[l]}$ is $n_C^{[l]}\times n_C^{[l]}$

The style matrix is also called a “Gram matrix”. In linear algebra, the Gram matrix G of a set of vectors($v_1,...,v_n$) is the matrix of dot products, whose entries are $G_{ij}=v_i^Tv_j=np.dot(v_i,v_j)$. In other words, $G_{ij}$ compares how similar $v_i$ is similar to $v_j$: If they are highly similar, you would expect them to have a large dot product, and thus for $G_{ij}$ to be large.

The style of an image can be represented using the Gram matrix of a hidden layer’s activations. However, we get even better results combining this representation from multiple different layers. This is in contrast to the content representation, where usually using just a single hidden layer is sufficient.
Minimizing the style cost will cause the image G to follow the style of the image S.

## Sequence Model

### Week One

#### Recurrent Neural Networks

Notation
-$x^{}$: denotes an object at the t’th timestep.
-$y^{}$: index into the output position
-t: implies that these are temporal sequences
-$T_x$: the length of the input sequence
-$T_y$: the length of the output sequence
-$T_x^{(i)}$: the length of the i’th training example
-$T_y^{(i)}$: the output length of the i’th training example
-$x^{(i)}$: the input at the t’th timestep of example i
-$y^{(i)}$: the output at the t’th timestep of example i

One-hot representation
Using a large vector(a dictionary containing tens of thousands of words) to represent a word. Only one element is one(the corresponding position of the word in the dictionary) and the others are zero.

Why not a standard network?
Problems:
Inputs, outputs can be different lengths in different examples. (Different sentences have different lengths.)
Doesn’t share features learned across different positions of text. (A word may appear many times in a sentence. Need to make repetitions.)

RNN cell Basic RNN cell. Takes as input $x^{}$(current input) and $a^{}$(previous hidden state containing information from the past), and outputs $a^{}$ which is given to the next RNN cell and also used to predict $y^{}$.

#### Forward Propagation Here the weight W has two subscripts: the former corresponds to the result and the latter represents the operand that it multiply by. $a\leftarrow W_{ax}x$

The activation function $g_1()$ usually uses tanh, sometimes ReLU.
The $g_2()$ function uses sigmoid to make binary classification.

The formulas can be simplified as follows:

Here, $W_a=\begin{bmatrix}W_{aa}&|&W_{ax}\end{bmatrix}$, and $[a^{},x^{}]=\begin{bmatrix}a^{}\\x^{}\end{bmatrix}$

#### Different types

One to one Usage: Simple neural network

One to many Usage: Music generation, sequence generation

Many to one Usage: Sentiment classification

Many to many (I) Usage: Name entity recognition

Many to many (II) Usage: Machine translation

#### Language model : unknown words (words not shown in vocabulary)

: end of sentence

Language model is used to calculate the probability using RNN. Each layer’s output is a probability given the previous activations.
E.g. given the sentence Cats average 15 hours of sleep a day., $\hat y^{<1>}=P(cats)$ (the probability of ‘cats’ appears in the beginning of the sentence); $\hat y^{<2>}=P(average|cat)$ (conditional probability);…;$\hat y^{<9>}=P(|...)$

Character-level language model
Instead of using words, character-level generates sequences of characters. It’s more computational.

#### Gated Recurrent Unit(GRU)

The basic RNN unit: g() is tanh function.

GRU(simplified): Instead of using $a^{}$, use $c^{}$ instead(though in GRU $a^{}=c^{}$). Here c represents memory cell.

u: update. r: relevance.
Gate u is a vector of dimension equal to the number of hidden units in the LSTM.
Gate r tells you how relevant is c to computing the next candidate for c.

#### Long Short Term Memory(LSTM) Unit

Difference between LSTM and GRU(LSTM comes earlier, and GRU can be regarded as a special case of LSTM).  Forget Gate
For the sake of this illustration, lets assume we are reading words in a piece of text, and want use an LSTM to keep track of grammatical structures, such as whether the subject is singular or plural. If the subject changes from a singular word to a plural word, we need to find a way to get rid of our previously stored memory value of the singular/plural state. In an LSTM, the forget gate lets us do this:

Here, $W_f$ are weights that govern the forget gate’s behavior. We concatenate $[a^{⟨t−1⟩},x^{⟨t⟩}]$ and multiply by $W_f$. The equation above results in a vector $\Gamma_f^{\langle t \rangle}$ with values between 0 and 1. This forget gate vector will be multiplied element-wise by the previous cell state $c^{\langle t-1 \rangle}$. So if one of the values of $\Gamma_f^{\langle t \rangle}$ is 0 (or close to 0) then it means that the LSTM should remove that piece of information (e.g. the singular subject) in the corresponding component of $c^{\langle t-1 \rangle}$. If one of the values is 1, then it will keep the information.

Update Gate
Once we forget that the subject being discussed is singular, we need to find a way to update it to reflect that the new subject is now plural. Here is the formulate for the update gate:

Similar to the forget gate, here $\Gamma_u^{\langle t \rangle}$ is again a vector of values between 0 and 1. This will be multiplied element-wise with $\tilde{c}^{\langle t \rangle}$, in order to compute $c^{\langle t \rangle}$.

Updating the cell
To update the new subject we need to create a new vector of numbers that we can add to our previous cell state. The equation we use is:

Finally, the new cell state is:

Output gate
To decide which outputs we will use, we will use the following two formulas:

Where in equation 5 you decide what to output using a sigmoid function and in equation 6 you multiply that by the tanh of the previous state.

### Week Two - Natural Language Processing & Word Embeddings

Transfer learning and word embeddings
1.Learn word embeddings from large text corpus. (1-100B words)
2.Transfer embedding to new task with smaller training set. (say, 100k words)
3.Optional: Continue to finetune the word embeddings with new data.

Computation of Similarities:
Cosine similarity: $sim(u,v)=\frac{u^Tv}{||u||_2||v||_2}$
Euclidean distance: $||u-v||^2$

Embedding matrix
The embedding matrix is denoted as E.
The embedding for word j can be calculated as $e_j=E\cdot o_j$.
Here, e means embedding and o means one-hot. And in practice, we just use specialized function to look up an embedding rather than use costly matrix multiplication.

Context/target pairs
Context:
· Last 4 words
· 4 words on left & right
· Last 1 word
· Nearby 1 word

#### Word2Vec

Using skip-grams:

Here, $e_c=E\cdot o_c$ and $\theta_t$ is the parameter associated with output t.

Problems:
The cost of computation $p(t|c)=\frac{e^{\theta_tT_{e_c}}}{\sum_{j=1}^{10000}e^{\theta^T_je_c}}$ is too high.
Solution:
Using hierarchal softmax.

#### Negative sampling

Randomly choose k+1 examples, where only 1 example is positive and the remaining k are negative. (The value of k is dependent on the size of data sets. If the dataset is big, k = 2-5; if the dataset is small, k = 5-20).
Instead of using softmax, compute k times binary classification to reduce the computation.

#### GloVe word vectors

-$x_{ij}$: the number of times i appears in context of j. Thus, $x_{ij}=x_{ji}$

#### Applications using Word Embeddings

Sentiment Classification and Debiasing.

### Week Three

Machine translation can be regarded as a conditional language model. The original language model compute the probability $P(y^{<1>},...,y^{})$,while the machine translation computes the probability $P(y^{<1>},...,y^{}|x^{<1>},...,x^{})$. Therefore, it can be regarded as conditional language model.
The machine translation contains two parts: encoder and decoder.

Just find the most likely translation.

(not use greedy search)

Pick a hyperparameter B. In each layer of RNN, pick B most possible output.
Since the probability can be computed as:

(Beam search with B=1 is greedy search.)

Length normalization

The range of possibilities is [0,1]. Therefore the original formula can be extremely small with many small values’ multiplication. To avoid such situations, use log in calculations:

Machine tends to make short translation to maximize the result, while a too short translation is not satisfying. Therefore, add another hyperparameter to counteract such problem:

Unlike exact search algorithms like BFS or DFS, Beam Search runs faster but is not guaranteed to find exact maximum for arg max P(y|x).

Error analysis
There’re two main models in machine translation: RNN part and Beam Search part. If the training error is high, we want to figure out which part is not functioning well.
Use $(y^*}$ to represent human’s translation and $(\hat y)$ as machine’s.

Case 1: $P(y^*|x)>P(\hat y|x)$
Beam search chose $\hat y$. But $y^*$ attains higher P(y|x).
Conclusion: Beam search is at fault.

Case 2: $P(y^*|x)\le P(\hat y|x)$
In fact, $y^*$ is a better translation than $\hat y$. But RNN predicted $P(y^*|x)\le P(\hat y|x)$
Conclusion: RNN model is at fault.

#### Bleu score

Here, $p_n$= Bleu score on n-grams only.
Combined Bleu score:

BP is brevity penalty with

#### Attention model Here, $a^{}$= amount of attention $y^{}$ should pay to $a^{}$