Neural Networks and Deep Learning
Week One
(Neural network is also introduced in Machine Learning course, with my learning note).
House price Prediction can be regarded as the simplest neural network:
The function can be ReLU (REctified Linear Unit), which we’ll see a lot.
This is a single neuron. A larger neural network is then formed by taking many of the single neurons and stacking them together.
Almost all the economic value created by neural networks has been through supervised learning.
Input(x)  Output(y)  Application  Neural Network 

House feature  Price  Real estate  Standard NN 
Ad, user info  Click on ad?(0/1)  Online advertising  Standard NN 
Photo  Object(Index 1,…,1000)  Photo tagging  CNN 
Audio  Text transcript  Speech recognition  RNN 
English  Chinese  Machine translation  RNN 
Image, Radar info  Position of other cars  Autonomous driving  Custom/Hybrid 
Neural Network examples
CNN: often for image data
RNN: often for onedimensional sequence data
Structured data and Unstructured data
Scale drives deep learning progress
Scale: both the size of the neural network and the scale of the data.
 Data
 Computation
 Algorithms
Using ReLU instead of sigmoid function as activation function can improve efficiency.
Week Two
Notation
(x,y): a single training example.
m training examples:
Take training set inputs x1, x2 and so on and stacking them in columns. (This make the implementation much easier than X’s transpose)
Logistic Regression
Differences with former course
Notation is a bit different from what is introduced in Machine Learning(note).
Originally, we add so that .
where .
Here in Deep Learning course, we use b to represent , and w to represent . Just keep b and w as separate parameters.
Given x, want .
Parameters:
Output:
σ() is sigmoid function:
Cost Function
If
If
Cost function:
Loss function is applied to just a single training example.
Cost function is the cost of your parameters, it is the average of the loss functions of the entire training set.
Gradient Descent
Usually initialize the value to zero in logistic regression. Random initialization also works, but people don’t usually do that for logistic regression.
Repeat {
}
From forward propagation, we calculate z, a and finally
From back propagation, we calculate the derivatives step by step:
Algorithm
(Repeat)
J=0; dw1,dw2,…dwn=0; db=0
for i = 1 to m
for j = 1 to n:
J /= m;
dw1,dw2,…,dwn /= m;
db /= m
w1:=w1αdw1
w2:=w2αdw2
b:=bαdb
In the for loop, there’s no superscript i for dw variable, because the value of dw in the code is cumulative. While dz is referring to one training example.
Vectorization
Original for
loop:
for i = 1 to m
Vectorized:
Code:z = np.dot(w.T, X) + b
dz = A  Y
cost = 1 / m * np.sum(Y * np.log(A) + (1  Y) * np.log(1  A))
db = 1 / m * np.sum(dZ)
dw = 1 / m * X * dZ.T
About Python
A.sum(axis = 0)
: sum verticallyA.sum(axis = 1)
: sum horizontally
Broadcasting
If an (m, n) matrix operates with (+*/) a (1, n) row vector, just expand the vector vertically to (m, n) by copying m times.
If an (m, n) matrix operates with a (m, 1) column vector, just expand the vector horizontally to (m, n) by copying n times.
If an row/column vector operates with a real number, just expand the real number to the corresponding vector.
documention
Rank 1 Arraya = np.random.randn(5)
creates a rank 1 array whose shape is (5,)
.
Try to avoid using rank 1 array. Use a = a.reshape((5, 1))
or a = np.random.randn(5, 1)
.
Note that np.dot()
performs a matrixmatrix or matrixvector multiplication. This is different from np.multiply()
and the *
operator (which is equivalent to .*
in MATLAB/Octave), which performs an elementwise multiplication.
Week Three
Neural Network Overview
Superscript with square brackets denotes the layer, superscript with round brackets refers to i’th training example.
Logistic regression can be regarded as the simplest neural network. The neuron takes in the inputs and make two computations:
Neural network functions similarly. (Note that this neural network has 2 layers. When counting layers, input layer is not included.)
Take the first node in the hidden layer as example:
The superscript denotes the layer, and subscript i
represents the node in layer.
Similarly,
Vectorization:
Formula:
(dimensions: )
Vectorizing across multiple examples:
Explanation
Stack elements in column.
Each column represents a training example, each row represents a hidden unit.
Activation Function
Sigmoid Function
Only used in binary classification’s output layer(with output 0 or 1).
Not used in other occasion. tanh
is a better choice.
tanh Function
With a range of , it performs better than sigmoid function because the mean of its output is closer to zero.
Both sigmoid and tanh function have a disadvantage that when z is very large() or very small(), the derivative can be close to 0, so the gradient descent would be very slow.
ReLU
Default choice of activation function.
With when z is positive, it performs well in practice.
(Although the g'(z)=0
when z is positive, and technically the derivative when is not welldefined)
Leaky ReLU
Makes sure that derivatives not equal to 0 when z < 0.
Linear Activation Function
Also called identity function.
Not used in neural network, because even many hidden layers still gets a linear result. Just used in machine learning when the output is a real number.
Gradient descent
Forward propagation:
Backward propagation:
note:keepdims=True
makes sure that Python won’t produce rank1 array with shape of (n,)
.*
is elementwise product. :(n[1],m);:(n[1],m);:(n[1],m).
Random Initialization
In logistic regression, it’s okay to initialize all parameters to zero. However, it’s not feasible in neural network.
Instead, initialize w with random small value to break symmetry. It’s okay to initialize b to zeros. Symmetry is still broken so long as is initialized randomly.


Random
If the parameter w are all zeros, then the neurons in hidden layers are symmetric(“identical”). Even if after gradient descent, they keep the same. So use random initialization.
Small
Both sigmoid and tanh function has greatest derivative at z=0
. If z had large or small value, the derivative would be close to zero, and consequently gradient descent would be slow. Thus, it’s a good choice to make the value small.
Week Four
Deep neural network notation
: number of layers
: number of units in layer l
: activations in layer l.
()
: weights for
: bias for
Forward Propagation
for l = 1 to L:
Well, this for
loop is inevitable.
Matrix Dimensions

(here the dimension can be with Python’s broadcasting)


cache
Cache is used to pass variables computed during forward propagation to the corresponding backward propagation step. It contains useful values for backward propagation to compute derivatives.
Why deep representations?
Informally: There are functions you can compute with a “small” Llayer deep neural network that shallower networks require exponentially more hidden units to compute.
Forward and Backward Propagation
Forward propagation for layer l
Input
Output , cache
Backward propagation for layer l
Input
Output
Hyperparameters and Parameters
Hyperparameters determine the final value parameters.
Parameters
·
Hyperparameters
· learning rate
· number of iterations
· number of hidden layers
· number of hidden units
· choice of activation function
· momentum, minibatch size, regularizations, etc.
Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization
Week One
Setting up your Machine Learning Application
Train/dev/test sets
Training set:
Keep on training algorithms on the training sets.
Development set
Also called Holdout cross validation set, Dev set for short.
Use dev set to see which of many different models performs best on the dev set.
Test set
To get an unbiased estimate of how well your algorithm is doing.
Proportion
Previous era: the data amount is not too large, it’s common to take all the data and split it as 70%/30% or 60%/20%/20%.
Big data: there’re millions of examples, 10000 examples used in dev set and 10000 examples used in test set is enough. The proportion can be 98/1/1 or even 99.5/0.4/0.1
Notes
Make sure dev set and test set come from same distribution.
Not having a test set might be okay if it’s not necessary to get an unbiased estimate of performance. Though dev set is called ‘test set’ if there’s no real test set.
Bias/Variance
Solutions
High bias:
Bigger network
Train longer
(Neural network architecture search)
High variance:
More data
Regularization
(Neural network architecture search)
Bias Variance tradeoff
Originally, reducing bias may increase variance, and vice versa. So it’s necessary to tradeoff between bias and variance.
But in deep learning, there’re ways to reduce one without increasing another. So don’t worry about bias variance tradeoff.
Regularization
L2 Regularization
Logistic regression
L2regularization relies on the assumption that a model with small weights is simpler than a model with large weights. Thus, by penalizing the square values of the weights in the cost function you drive all the weights to smaller values. It becomes too costly for the cost to have large weights! This leads to a smoother model in which the output changes more slowly as the input changes.
Weights end up smaller(“weight decay”): Weights are pushed to smaller values.
L2 regularization:
L1 regularization:
(L1 regularization leads to be sparse, but not very effictive)
Neural network
, it’s called Frobenius norm which is different from Euclidean distance.
Back propagation:
Dropout regularization
With dropout, what we’re going to do is go through each of the layers of the network and set some probability of eliminating a node in neural network.
For each training example, you would train it using one of these neural based networks.
The idea behind dropout is that at each iteration, you train a different model that uses only a subset of your neurons. With dropout, your neurons thus become less sensitive to the activation of one other specific neuron, because that other neuron might be shut down at any time.
Usually used in Computer Vision.
Implementation with layer 3


 Dropout is a regularization technique.
 You only use dropout during training. Don’t use dropout (randomly eliminate nodes) during test time.
 Apply dropout both during forward and backward propagation.
 During training time, divide each dropout layer by keep_prob to keep the same expected value for the activations. For example, if keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled by 0.5 since only the remaining half are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the output now has the same expected value. You can check that this works even when keep_prob is other values than 0.5.
Other regularization methods
Data augmentation
Take image input for example. Flipping the image horizontally, rotating and sort of randomly zooming, distortion, etc.
Get more training set without paying much to reduce overfitting.
Early stopping
Stop early so that is relatively small.
Early stopping violates Orthogonalization, which suggests separate Optimize cost function J and Not overfit.
Optimization
Normalizing inputs
Subtract mean
Normalize variance
Note: use same to normalize test set.
Intuition:
Vanishing/Exploding gradients
Since the number of layers in deep learning may be quite large, the product of L layers may tend to or . (just think about and )
Weight initialization for deep networks
Take a single neuron as example:
If is large, then would be smaller. Our goal is to get
Random initialization for ReLU:(known as He initialization, named for the first author of He et al., 2015.)
For tanh: use
Xavier initialization:
Gradient checking
Take and reshape into a big vector :
Take and reshape into a big vector
for each i:

check if ?
Calculate . ( is great)
Note
Gradient checking verifies closeness between the gradients from backpropagation and the numerical approximation of the gradient (computed using forward propagation).
Gradient checking is slow, so we don’t run it in every iteration of training. You would usually run it only to make sure your code is correct, then turn it off and use backprop for the actual learning process.
 Don’t use in training  only to debug.
 If algorithm fails grad check, look at components to try to identify bug.
 Remember regularization.
 Doesn’t work with dropout.
 Run at random initialization; perhaps again after some training.
Week Two
Minibatch gradient descent
Batch gradient descent (original gradient descent that we’ve known) calculates the entire training set, and just update the parameters a little step. If the training set is pretty large, the training would be quite slow. And the idea of minibatch gradient descent is use part of the training set, and update the parameters faster.
For example, if ‘s dimension is , divide the training set into parts with dimension of , i.e.
Similarly, .
One iteration of minibatch gradient descent(computing on a single minibatch) is faster than one iteration of batch gradient descent.
Two steps of minibatch gradient descent:
repeat {
for t = 1,…,5000 {
Forward prop on
…
Compute cost
Backprop to compute gradients cost (using )
} # this is called 1 epoch
}
Choosing minibatch size
Minibatch size = m: Batch gradient descent.
It has to process the whole training set before making progress, which takes too long for per iteration.
Minibatch size = 1: Stochastic gradient descent.
It loses the benefits of vectorization across examples.
Minibatch size in between 1 and m.
Fastest learning: using vectorization and make process without processing entire training set.
If training set is small(m≤2000): just use batch gradient descent.
Typical minibatch sizes: 64, 128, 256, 512 (1024)
Exponentially weighted averages
E.g.




Replace with the second equation, then replace with the third equation, and so on. Finally we’d get
This is why it is called exponentially weighted averages. In practice, , thus it show an average of 10 examples.
Bias correction
As is shown above, the purple line is exponentially weighted average without bias correction, it’s much lower than the exponentially weighted average with bias correction(green line) at the very beginning.
Since is set to be zero(and assume ), the first calculation has quite small result. The result is small until t gets larger(say for ) To avoid such situation, bias correction introduces another step:
Gradient descent with momentum
Set
On iteration t:
Compute dW,db on the current minibatch
Momentum takes past gradients into account to smooth out the steps of gradient. Gradient descent with momentum has the same idea as exponentially weighted average(while some may not use in momentum). Just as the example shown above, we want slow learning horizontally and faster learning vertically. The exponentially weighted average helps to eliminate the horizontal oscillation and makes gradient descent faster. Note there’s no need for gradient descent with momentum to do bias correction. After several iterations, the algorithm will be okay.
RMSprop
On iteration t:
Compute dW,db on the current minibatch
RMS means Root Mean Square, it uses division to help to adjust gradient descent.
Adam optimization algorithm
Combine momentum and RMSprop together:
1.It calculates an exponentially weighted average of past gradients, and stores it in variable v (before bias correction) and v_corrected (with bias correction).
2.It calculates an exponentially weighted average of the squares of the past gradients, and stores it in variable s (before bias correction) and s_corrected (with bias correction).
3.It updates parameters in a direction based on combining information from 1 and 2.
Set
On iteration t:
Compute dW,db on the current minibatch
Hyperparameters:
: needs to be tune
: 0.9
: 0.999

(Adam just means Adaption moment estimation)
Learning rate decay
Minibatch gradient descent won’t converge, but step around at the optimal instead. To help converge, it’s advisable to decay learning rate with the number of iterations.
Some formula:



discrete stair case (half α after some iterations)
manual decay
Week Three
Hyperparameter tuning
Hyperparameters: , number of layers, number of units, learning rate decay, minibatch size, etc.
Priority:
Try to use random values of hyperparameters rather than grid.
Coarse to fine: if finds some region with good result, try more in that region.
Appropriate scale:
It’s okay to sample uniformly at random for some hyperparameters: number of layers, number of units.
While for some hyperparameters like , instead of sampling uniformly at random, sample randomly on logarithmic scale.
Pandas & Caviar
Panda: babysitting one model at a time
Caviar: training many models in parallel
Largely determined by the amount of computational power you can access.
Batch Normalization
Using the idea of normalizing input, make normalization in hidden layers.
Given some intermediate value in neural network (specifically in a single layer)
Use instead of
Batch Norm as regularization
Each minibatch is scaled by the mean/variance computed on just that minibatch.
This adds some noise to the values within that minibatch. So similar to dropout, it adds some noise to each hidden layer’s activations.
This has a slight regularization effect.
Batch Norm at test time: use exponentially weighted averages to compute average for test.
Multiclass classification
Softmax
The output layer is a vector with dimension C rather than a real number. C is the number of classes.
Activation function:
Cost function
Deep Learning frameworks
 Caffe/Caffe2
 CNTK
 DL4J
 Keras
 Lasagne
 mxnet
 PaddlePaddle
 TensorFlow
 Theano
 Torch
Choosing deep learning frameworks
Easy of programming (development and deployment)
Running speed
Truly Open (open source with good governance)
TensorFlow
Writing and running programs in TensorFlow has the following steps:
 Create Tensors(variables) that are not yet executed/evaluated.
 Write operations between those Tensors.
 Initialize your Tensors.
 Create a Session.
 Run the Session. This will run the operations you’d written above.
tf.constant(...)
: to create a constant valuetf.placeholder(dtype = ..., shape = ..., name = ...)
: a placeholder is an object whose value you can specify only later
tf.add(..., ...)
: to do an additiontf.multiply(..., ...)
: to do a multiplicationtf.matmul(..., ...)
: to do a matrix multiplication
2 typical ways to create and use sessions in TensorFlow:




Structuring Machine Learning Projects
Week One
Orthogonalization
Orthogonalization or orthogonality is a system design property that assures that modifying an instruction or a component of an algorithm will not create or propagate side effects to other components of the system. It becomes easier to verify the algorithms independently from one another, and it reduces testing and development time.
When a supervised learning system is designed, these are the 4 assumptions that need to be true and orthogonal.
 Fit training set well on cost function  bigger network, Adam, etc
 Fit dev set well on cost function  regularization, bigger training set, etc
 Fit test set well on cost function  bigger dev set
 Performs well in real world  change dev set or cost function
Single number evaluation metric
Precision
Among all the prediction, estimate how much predictions are right.
Recall
Among all the positive examples, estimate how much positive examples are correctly predicted.
F1Score
The problem with using precision/recall as the evaluation metric is that you are not sure which one is better since in this case, both of them have a good precision et recall. F1score, a harmonic mean, combine both precision and recall.
Satisficing and optimizing metric
There are different metrics to evaluate the performance of a classifier, they are called evaluation matrices. They can be categorized as satisficing and optimizing matrices. It is important to note that these evaluation matrices must be evaluated on a training set, a development set or on the test set.
The general rule is:
For example:
Classifier  Accuracy  Running Time 

A  90%  80ms 
B  92%  95ms 
C  95%  1500ms 
For example, there’re two evaluation metrics: accuracy and running time. Take accuracy as optimizing metric and the following(running time) as satisficing metric(s). The satisficing metric has to meet expectation set and improve the optimizing metric as much as possible.
Train/Dev/Test Set
It’s important to choose the development and test sets from the same distribution and it must be taken randomly from all the data.
Guideline: Choose a dev set and test set to reflect data you expect to get in the future and consider important to do well on.
Size
Old way of splitting data:
We had smaller data set, therefore, we had to use a greater percentage of data to develop and test ideas and models.
Modern era  Big data:
Now, because a larger amount of data is available, we don’t have to compromise and can use a greater portion to train the model.
Set your dev set to be big enough to detect differences in algorithms/models you’re trying out.
Set your test set to be big enough to give high confidence in the overall performance of your system.
When to change dev/test sets and metrics
Orthogonalization:
How to define a metric to evaluate classifiers.
Worry separately about how to do well on this metric.
If doing well on your metric + dev/test set does not correspond to doing well on your application, change your metric and/or dev/test set.
Comparing to humanlevel performance
The graph shows the performance of humans and machine learning over time.
Machine learning progresses slowly when it surpasses humanlevel performance. One of the reason is that humanlevel performance can be close to Bayes optimal error, especially for natural perception problem.
Bayes optimal error is defined as the best possible error. In other words, it means that any functions mapping from x to y can’t surpass a certain level of accuracy(for different reasons, e.g. blurring images, audio with noise, etc).
Humans are quite good at a lot of tasks. So long as machine learning is worse than humans, you can:
 Get labeled data from humans
 Gain insight from manual error analysis: Why did a person get this right?
 Better analysis of bias/variance
Humanlevel error as a proxy for Bayes error(i.e. Humanlevel error ≈ Bayes error).
The difference between Humanlevel error and training error is also regarded as “Avoidable bias”.
If the difference between humanlevel error and the training error is bigger than the difference between the training error and the development error. The focus should be on bias reduction technique.
· Train bigger model
· Train longer/better optimization algorithms(momentum, RMSprop, Adam)
· NN architecture/hyperparameters search(RNN,CNN)
If the difference between training error and the development error is bigger than the difference between the humanlevel error and the training error. The focus should be on variance reduction technique
· More data
· Regularization(L2, dropout, data augmentation)
· NN architecture/hyperparameters search
Problems where machine significantly surpasses humanlevel performance
Feature: Structured data, not natural perception, lots of data.
· Online advertising
· Product recommendations
· Logistics(predicting transit time)
· Loan approvals
The two fundamental assumptions of supervised learning:
You can fit the training set pretty well.(avoidable bias ≈ 0)
The training set performance generalizes pretty well to the dev/test set.(variance ≈ 0)
Week Two
Error Analysis
Spread sheet:
Before deciding how to improve the accuracy, set up a spread sheet find out what matters.
For example:
Image  Dog  Great Cat  Blurry  Comment 

1  √  small white dog  
2  √  √  lion in rainy day  
…  
Percentage  5%  41%  63% 
Mislabeled examples refer to if your learning algorithm outputs the wrong value of Y.
Incorrectly labeled examples refer to if in the data set you have in the training/dev/test set, the label for Y, whatever a human label assigned to this piece of data, is actually incorrect.
Deep learning algorithms are quite robust to random errors in the training set, but less robust to systematic errors.
Guideline: Build system quickly, then iterate.
 Set up development/test set and metrics
 Set up a target
 Build an initial system quickly
 Train training set quickly: Fit the parameters
 Development set: Tune the parameters
 Test set: Assess the performance
 Use bias/variance analysis & Error analysis to prioritize next steps
Mismatched training and dev/test set
The development set and test should come from the same distribution. However, the training set’s distribution might be a bit different. Take a mobile application of cat recognizer for example:
The images from webpages have high resolution and are professionally framed. However, the images from app’s users are relatively low and blurrier.
The problem is that you have a different distribution:
Small data set from pictures uploaded by users. (10000)This distribution is important for the mobile app.
Bigger data set from the web.(200000)
Instead of mixing all the data and randomly shuffle the data set, just like below.
Take 5000 examples from users into training set, and halving the remaining into dev and test set.
The advantage of this way of splitting up is that the target is well defined.
The disadvantage is that the training distribution is different from the dev and test set distributions. However, the way of splitting the data has a better performance in long term.
TrainingDev Set
Since the distributions among the training and the dev set are different now, it’s hard to know whether the difference between training error and the training error is caused by variance or from different distributions.
Therefore, take a small fraction of the original training set, called trainingdev set. Don’t use trainingdev set for training, but to check variance.
The difference between the trainingdev set and the dev set is called data mismatch.
Addressing data mismatch:
 Carry out manual error analysis to try to understand difference between training and dev/test sets.
 Make training data more similar; or collect more data similar to dev/test sets
Transfer learning
When transfer learning makes sense:
 Task A and B have the same input x.
 You have a lot more data for Task A than Task B.
 Low level features from A could be helpful for learning B.
Guideline:
 Delete last layer of neural network
 Delete weights feeding into the last output layer of the neural network
 Create a new set of randomly initialized weights for the last layers only
 New data set (x,y)
Multitask learning
Example: detect pedestrians, cars, road signs and traffic lights at the same time. The output is a 4dimension vector.
Note that the second sum(j = 1 to 4) only over value of j with 0/1 label (not ? mark).
When multitask learning makes sense
 Training on a set of tasks that could benefit from having shared lowerlevel features.
 Usually: Amount of data you have for each task is quite similar.
 Can train a big enough neural network to do well on all the tasks.
Endtoend deep learning
Endtoend deep learning is the simplification of a processing or learning systems into one neural network.
Endtoend deep learning cannot be used for every problem since it needs a lot of labeled data. It is used mainly in audio transcripts, image captures, image synthesis, machine translation, steering in selfdriving cars, etc.
Pros and cons of endtoend deep learning
Pros:
Let the data speak
Less handdesigning of components needed
Cons:
May need large amount of data
Excludes potentially useful handdesigned components
Convolutional Neural Networks
Week One
Computer Vision Problems
 Image Classification
 Object Detection
 Neural Style Transfer
Convolution
*
is the operator for convolution.
Filter/Kernel
The second operand is called filter in the course and often called kernel in the research paper.
There’re different types of filters:
Filter usually has an size of odd number. 1*1, 3*3, 5*5...
(helps to highlight the centroid)
Vertical edge detection examples
Valid and Same Convolutions
Suppose that the original image has a size of n×n, the filter has a size of f×f, then the result has a size of (nf+1)×(nf+1). This is called Valid convolution.
The size will get smaller and smaller with the process of valid convolution.
To avoid such a problem, we can use paddings to enlarge the original image before convolution so that output size is the same as the input size.
If the filter’s size is f×f, then the padding .
The main benefits of padding are the following:
· It allows you to use a CONV layer without necessarily shrinking the height and width of the volumes. This is important for building deeper networks, since otherwise the height/width would shrink as you go to deeper layers. An important special case is the “same” convolution, in which the height/width is exactly preserved after one layer.
· It helps us keep more of the information at the border of an image. Without padding, very few values at the next layer would be affected by pixels as the edges of an image.
Stride
The simplest stride is 1, which means that the filter moves 1 step at a time. However, the stride can be not 1. For example, moves 2 steps at a time instead. That’s called strided convolution.
Given that:
Size of image, filter, padding p, stride s,
output size:
technical
In mathematics and DSP, the convolution involves another “flip” step. However, this step is omitted in CNN. The “real” technical note should be “crosscorrelation” rather than convolution.
In convention, just use Convolution in CNN.
Convolution over volumes
The 1channel filter cannot be applied to RGB images. But we can use filters with multiple channels(RGB images have 3 channels).
The number of the filter’s channel should match that of the image’s channel.
E.g.
A image conv with a filter, the result has a size of . Note that this is only 1 channel! (The number of the result’s channel corresponds to the number of the filters).
CNN
notation
If layer l is a convolution layer:
 = filter size
 = padding
 = stride
 = number of filters
Each filter is:
Activations: ,
Weights: ,(: #filters in layer l.)
bias:
Input:
Output:
E.g.
Types of layers in a convolutional network
 Convolution (conv)
 Pooling (pool)
 Fully Connected (FC)
Pooling layers
 Max pooling: slides an (f,f) window over the input and stores the max value of the window in the output.
 Average pooling: slides an (f,f) window over the input and stores the average value of the window in the output.
Hyperparameters:
f: filter size
s: stride
Max or average pooling
Note no parameters to learn.
Suppose that the input has a size of , then after pooling, the output has a size of
A more complicated cnn:
Backpropagation is discussed in programming assignment.
Why convolutions
 Parameter sharing: A feature detector(such as a vertical edge detector) that’s useful in one part of the image is probably useful in another part of the image.
 Sparsity of connections: In each layer, each output value depends only on a small number of inputs.
Week Two
Classic networks
LeNet  5
Paper link: GradientBased Learning Applied to Document Recognition(IEEE has another version of this paper.)
Take the input, use a 5×5 filter with 1 stride, then use an average pooling with a 2×2 filter and s = 2. Again, use a 5×5 filter with 1 stride, then use an average pooling with a 2×2 filter and s = 2. After two fully connected layer, the output uses softmax to make classification.
conv → pool → conv → pool → fc → fc → output
With the decrease of nH and nW, the number of nC is increased.
AlexNet
Paper link: ImageNet Classification with Deep Convolutional Neural Networks
Similar to LeNet, but much bigger. (60K > 60M)
It uses ReLU.
VGG16
Paper link: Very Deep Convolutional Networks for LargeScale Image Recognition
CONV = 3×3 filter, s = 1, same(using padding to make the size same)
MAXPOOL = 2×2, s = 2
Only use these 2 filters.
Residual Networks
Paper link: Deep residual networks for image recognition
In the plain network, the training error won’t keep decreasing, it may increase at some threshold. In Residual network, the training error will keep decreasing.
The skipconnection makes it easy for the network to learn an identity mapping between the input and the output within the ResNet block.
In ResNets, a “shortcut” or a “skip connection” allows the gradient to be directly backpropagated to earlier layers:
1×1 convolution
Paper link: Network in network
If the input has a volume of dimension , then a single 1×1 convolutional filter has parameters(including bias).
You can use a 1×1 convolutional layer to reduce but not .
You can use a pooling layer to reduce , but not .
Inception network
Paper link: Going deeper with convolutions
Don’t bother worrying about what filters to use. Use all kinds of filters and stack them together.
Module:
Typically, with deeper layers, and decrease, while increases.
Practical advices for using ConvNets
Using OpenSource Implementations: GitHub
Reasons for using opensource implementations of ConvNet:
Parameters trained for one computer vision task are often useful as pretraining for other computer vision tasks.
It is a convenient way to get working an implementation of a complex ConvNet architecture.
Week Three
Classification, localization and detection
Image classification: Given a image, make predictions of what classification it is.
Classification localization: In addition, put a bounding box to figure out where the object is.
Detection: Multiple objects appear in the image, detect all of them.
In classification localization, the output has some values which show the position of the centroid of the object,(note that the upper left corner’s coordinates is (0,0) and the lower right corner’s is (1,1)) and which show the height and width of the object.
If the output has 3 classes, then the format of the output looks like as follows:
For example, if the image contains a car, then the output is
and if the image doesn’t contain anything, the output is
The loss function is
Landmark detection
The output contains more information about the position of the landmarks .
Sliding windows detection
Use a small sliding window with small stride scanning the image, detect the objects. Then use a slightly bigger sliding window, and then bigger.
However, it has high computation cost.
Turning FC layer into convolutional layers
Use a filter with the same size of the last layer, the number of filters is the same as the fully connected nodes.
YOLO
Paper link: You Only Look Once: Unified, RealTime Object Detection
Bounding boxes:
Divide the object into several grid cells(in general grids with a size of 19×19 are common), and only detect once if the object’s midpoint is in that grid.
Each grid’s upper left corner has a coordinate of (0,0) and lower right corner’s (1,1). Therefore, the value of should be between (0,1). And can be greater than 1.
IoU
Intersection over union
If IoU≥0.5 we can estimate that the result is right.
More generally, IoU is a measure of the overlap between two bounding boxes.
Nonmax suppression
Algorithm:
Each output prediction is (just focus on one class at a time so there’s no )
Discard all boxes with
While there are any remaining boxes:
· Pick the box with the largest . Output that as a prediction.
· Discard any remaining box with with the box output in the previous step.
Anchor boxes
In an image, some objects may be overlapping. To predict multiple objects in one grid cell, use some anchor boxes.
Previously:
Each object in training image is assigned to grid cell that contains that object’s midpoint.
With two anchor boxes:
Each object in training image is assigned to grid cell that contains object’s midpoint and anchor box for the grid cell with highest IoU.
The output vector has a size of
E.g.
(Manually choose the shape of anchor boxes.)
Region proposals
Paper link:Rich feature hierarchies for accurate object detection and semantic segmentation
Instead using sliding windows over and over again, use segmentation algorithm to predict which regions may contain objects.
RCNN: Propose regions. Classify proposed regions one at a time. Output label + bounding box.
Fast RCNN: Propose regions. Use convolution implementation of sliding windows to classify all the proposed regions.
Faster RCNN: Use convolutional network to propose regions.
Week Four
Face Recognition
Face verification & Face recognition
Verification:
· Input image, name/ID
· Output whether the input image is that of the claimed person.
This is a 1:1 matching problem.
Recognition:
· Has a database of K persons
· Get an input image
· Output ID if the image is any of the K persons(or “not recognized”)
This is a 1:K matching problem.
(High demand for single accuracy.)
Face verification requires comparing a new picture against one person’s face, whereas face recognition requires comparing a new picture against K person’s faces.
Oneshot learning
Learning from one example to recognize the person again. The idea is learning a “similarity” function. (A bit similar to recommendation system.)
d(img1, img2) = degree of difference between images.
If , the output is same; else the output is different.
Siamese network
Parameters of NN define an encoding . (Use a vector to represent the image x)
Goal: Learn parameters so that
if are the same person, is small;
if are different person, is large.
Triplet loss
Pick an anchor image(denoted as “A”), a positive image(denoted as “P”) and a negative image(denoted as “N”).
We can calculate the differences between A and P, A and N.
We want that
where α is called margin.
Loss function:
About choosing the triplets A,P,N
During training, if A,P,N are chosen randomly, is easily satisfied. Therefore, the gradient descent wouldn’t make much progress.
Thus, choose triplets that are “hard” to train on. That is, pick A,P,N such that
Neural Style Transfer
Neural style transfer cost function
The input contains content image(denoted as C) and style image(denoted as S), and the output is the generated image(denoted as G).
To find the generated image G:
1.Initiate G randomly (e.g. init with white noise)
2.Use gradient descent to minimize J(G).
Content cost function
 Say you use hidden layer l to compute content cost.
 Use pretrained ConvNet. (E.g., VGG network)
 Let and be the activation of layer l on the images.
 If and are similar, both images have similar content.
Style cost function
Say you are using layer l‘s activation to measure style.
Define style as correlation between activations across channels.
Let = activation at (i,j,k). is
The style matrix is also called a “Gram matrix”. In linear algebra, the Gram matrix G of a set of vectors() is the matrix of dot products, whose entries are . In other words, compares how similar is similar to : If they are highly similar, you would expect them to have a large dot product, and thus for to be large.
The style of an image can be represented using the Gram matrix of a hidden layer’s activations. However, we get even better results combining this representation from multiple different layers. This is in contrast to the content representation, where usually using just a single hidden layer is sufficient.
Minimizing the style cost will cause the image G to follow the style of the image S.
Sequence Model
Week One
Recurrent Neural Networks
Notation
: denotes an object at the t’th timestep.
: index into the output position
t: implies that these are temporal sequences
: the length of the input sequence
: the length of the output sequence
: the length of the i’th training example
: the output length of the i’th training example
: the input at the t’th timestep of example i
: the output at the t’th timestep of example i
Onehot representation
Using a large vector(a dictionary containing tens of thousands of words) to represent a word. Only one element is one(the corresponding position of the word in the dictionary) and the others are zero.
Why not a standard network?
Problems:
Inputs, outputs can be different lengths in different examples. (Different sentences have different lengths.)
Doesn’t share features learned across different positions of text. (A word may appear many times in a sentence. Need to make repetitions.)
RNN cell
Basic RNN cell. Takes as input (current input) and (previous hidden state containing information from the past), and outputs which is given to the next RNN cell and also used to predict .
Forward Propagation
Here the weight W has two subscripts: the former corresponds to the result and the latter represents the operand that it multiply by.
The activation function usually uses tanh, sometimes ReLU.
The function uses sigmoid to make binary classification.
The formulas can be simplified as follows:
Here, , and
Backward Propagation
Different types
One to one
Usage: Simple neural network
One to many
Usage: Music generation, sequence generation
Many to one
Usage: Sentiment classification
Many to many (I)
Usage: Name entity recognition
Many to many (II)
Usage: Machine translation
Language model
Language model is used to calculate the probability using RNN. Each layer’s output is a probability given the previous activations.
E.g. given the sentence Cats average 15 hours of sleep a day., (the probability of ‘cats’ appears in the beginning of the sentence); (conditional probability);…;
Characterlevel language model
Instead of using words, characterlevel generates sequences of characters. It’s more computational.
Gated Recurrent Unit(GRU)
The basic RNN unit:
g() is tanh function.
GRU(simplified):
Instead of using , use instead(though in GRU ). Here c represents memory cell.
u: update. r: relevance.
Gate u is a vector of dimension equal to the number of hidden units in the LSTM.
Gate r tells you how relevant is c
Long Short Term Memory(LSTM) Unit
Difference between LSTM and GRU(LSTM comes earlier, and GRU can be regarded as a special case of LSTM).
Forget Gate
For the sake of this illustration, lets assume we are reading words in a piece of text, and want use an LSTM to keep track of grammatical structures, such as whether the subject is singular or plural. If the subject changes from a singular word to a plural word, we need to find a way to get rid of our previously stored memory value of the singular/plural state. In an LSTM, the forget gate lets us do this:
Here, are weights that govern the forget gate’s behavior. We concatenate and multiply by . The equation above results in a vector with values between 0 and 1. This forget gate vector will be multiplied elementwise by the previous cell state . So if one of the values of is 0 (or close to 0) then it means that the LSTM should remove that piece of information (e.g. the singular subject) in the corresponding component of . If one of the values is 1, then it will keep the information.
Update Gate
Once we forget that the subject being discussed is singular, we need to find a way to update it to reflect that the new subject is now plural. Here is the formulate for the update gate:
Similar to the forget gate, here is again a vector of values between 0 and 1. This will be multiplied elementwise with $\tilde{c}^{\langle t \rangle}
$
, in order to compute .
Updating the cell
To update the new subject we need to create a new vector of numbers that we can add to our previous cell state. The equation we use is:
Finally, the new cell state is:
Output gate
To decide which outputs we will use, we will use the following two formulas:
Where in equation 5 you decide what to output using a sigmoid function and in equation 6 you multiply that by the tanh of the previous state.
Bidirectional RNN(BRNN)
Week Two  Natural Language Processing & Word Embeddings
Transfer learning and word embeddings
1.Learn word embeddings from large text corpus. (1100B words)
(Or download pretrained embedding online.)
2.Transfer embedding to new task with smaller training set. (say, 100k words)
3.Optional: Continue to finetune the word embeddings with new data.
Computation of Similarities:
Cosine similarity:
Euclidean distance:
Embedding matrix
The embedding matrix is denoted as E.
The embedding for word j can be calculated as .
Here, e means embedding and o means onehot. And in practice, we just use specialized function to look up an embedding rather than use costly matrix multiplication.
Context/target pairs
Context:
· Last 4 words
· 4 words on left & right
· Last 1 word
· Nearby 1 word
Word2Vec
Using skipgrams:
Here, and is the parameter associated with output t.
Problems:
The cost of computation is too high.
Solution:
Using hierarchal softmax.
Negative sampling
Randomly choose k+1 examples, where only 1 example is positive and the remaining k are negative. (The value of k is dependent on the size of data sets. If the dataset is big, k = 25; if the dataset is small, k = 520).
Instead of using softmax, compute k times binary classification to reduce the computation.
GloVe word vectors
: the number of times i appears in context of j. Thus,
Applications using Word Embeddings
Sentiment Classification and Debiasing.
Week Three
Machine translation can be regarded as a conditional language model.
The original language model compute the probability ,while the machine translation computes the probability . Therefore, it can be regarded as conditional language model.
The machine translation contains two parts: encoder and decoder.
Just find the most likely translation.
(not use greedy search)
Beam Search
Pick a hyperparameter B. In each layer of RNN, pick B most possible output.
Since the probability can be computed as:
(Beam search with B=1 is greedy search.)
Length normalization
The range of possibilities is [0,1]. Therefore the original formula can be extremely small with many small values’ multiplication. To avoid such situations, use log in calculations:
Machine tends to make short translation to maximize the result, while a too short translation is not satisfying. Therefore, add another hyperparameter to counteract such problem:
Unlike exact search algorithms like BFS or DFS, Beam Search runs faster but is not guaranteed to find exact maximum for arg max P(yx).
Error analysis
There’re two main models in machine translation: RNN part and Beam Search part. If the training error is high, we want to figure out which part is not functioning well.
Use to represent human’s translation and as machine’s.
Case 1:
Beam search chose . But attains higher P(yx).
Conclusion: Beam search is at fault.
Case 2:
In fact, is a better translation than . But RNN predicted
Conclusion: RNN model is at fault.
Bleu score
Here, = Bleu score on ngrams only.
Combined Bleu score:
BP is brevity penalty with
Attention model
Here, = amount of attention should pay to