Neural Networks and Deep Learning
Week One
(Neural network is also introduced in Machine Learning course, with my learning note).
House price Prediction can be regarded as the simplest neural network:
The function can be ReLU (REctified Linear Unit), which we’ll see a lot.
This is a single neuron. A larger neural network is then formed by taking many of the single neurons and stacking them together.
Almost all the economic value created by neural networks has been through supervised learning.
Input(x)  Output(y)  Application  Neural Network 

House feature  Price  Real estate  Standard NN 
Ad, user info  Click on ad?(0/1)  Online advertising  Standard NN 
Photo  Object(Index 1,…,1000)  Photo tagging  CNN 
Audio  Text transcript  Speech recognition  RNN 
English  Chinese  Machine translation  RNN 
Image, Radar info  Position of other cars  Autonomous driving  Custom/Hybrid 
Neural Network examples
CNN: often for image data
RNN: often for onedimensional sequence data
Structured data and Unstructured data
Scale drives deep learning progress
Scale: both the size of the neural network and the scale of the data.
 Data
 Computation
 Algorithms
Using ReLU instead of sigmoid function as activation function can improve efficiency.
Week Two
Notation
(x,y): a single training example.
m training examples:
Take training set inputs x1, x2 and so on and stacking them in columns. (This make the implementation much easier than X’s transpose)
Logistic Regression
Differences with former course
Notation is a bit different from what is introduced in Machine Learning(note).
Originally, we add so that .
where .
Here in Deep Learning course, we use b to represent , and w to represent . Just keep b and w as separate parameters.
Given x, want .
Parameters:
Output:
σ() is sigmoid function:
Cost Function
If
If
Cost function:
Loss function is applied to just a single training example.
Cost function is the cost of your parameters, it is the average of the loss functions of the entire training set.
Gradient Descent
Usually initialize the value to zero in logistic regression. Random initialization also works, but people don’t usually do that for logistic regression.
Repeat {
}
From forward propagation, we calculate z, a and finally
From back propagation, we calculate the derivatives step by step:
Algorithm
(Repeat)
J=0; dw1,dw2,…dwn=0; db=0
for i = 1 to m
for j = 1 to n:
J /= m;
dw1,dw2,…,dwn /= m;
db /= m
w1:=w1αdw1
w2:=w2αdw2
b:=bαdb
In the for loop, there’s no superscript i for dw variable, because the value of dw in the code is cumulative. While dz is referring to one training example.
Vectorization
Original for
loop:
for i = 1 to m
Vectorized:
Code:z = np.dot(w.T, X) + b
dz = A  Y
cost = 1 / m * np.sum(Y * np.log(A) + (1  Y) * np.log(1  A))
db = 1 / m * np.sum(dZ)
dw = 1 / m * X * dZ.T
About Python
A.sum(axis = 0)
: sum verticallyA.sum(axis = 1)
: sum horizontally
Broadcasting
If an (m, n) matrix operates with (+*/) a (1, n) row vector, just expand the vector vertically to (m, n) by copying m times.
If an (m, n) matrix operates with a (m, 1) column vector, just expand the vector horizontally to (m, n) by copying n times.
If an row/column vector operates with a real number, just expand the real number to the corresponding vector.
documention
Rank 1 Arraya = np.random.randn(5)
creates a rank 1 array whose shape is (5,)
.
Try to avoid using rank 1 array. Use a = a.reshape((5, 1))
or a = np.random.randn(5, 1)
.
Note that np.dot()
performs a matrixmatrix or matrixvector multiplication. This is different from np.multiply()
and the *
operator (which is equivalent to .*
in MATLAB/Octave), which performs an elementwise multiplication.
Week Three
Neural Network Overview
Superscript with square brackets denotes the layer, superscript with round brackets refers to i’th training example.
Logistic regression can be regarded as the simplest neural network. The neuron takes in the inputs and make two computations:
Neural network functions similarly. (Note that this neural network has 2 layers. When counting layers, input layer is not included.)
Take the first node in the hidden layer as example:
The superscript denotes the layer, and subscript i
represents the node in layer.
Similarly,
Vectorization:
Formula:
(dimensions: )
Vectorizing across multiple examples:
Explanation
Stack elements in column.
Each column represents a training example, each row represents a hidden unit.
Activation Function
Sigmoid Function
Only used in binary classification’s output layer(with output 0 or 1).
Not used in other occasion. tanh
is a better choice.
tanh Function
With a range of , it performs better than sigmoid function because the mean of its output is closer to zero.
Both sigmoid and tanh function have a disadvantage that when z is very large() or very small(), the derivative can be close to 0, so the gradient descent would be very slow.
ReLU
Default choice of activation function.
With when z is positive, it performs well in practice.
(Although the g'(z)=0
when z is positive, and technically the derivative when is not welldefined)
Leaky ReLU
Makes sure that derivatives not equal to 0 when z < 0.
Linear Activation Function
Also called identity function.
Not used in neural network, because even many hidden layers still gets a linear result. Just used in machine learning when the output is a real number.
Gradient descent
Forward propagation:
Backward propagation:
note:keepdims=True
makes sure that Python won’t produce rank1 array with shape of (n,)
.*
is elementwise product. :(n[1],m);:(n[1],m);:(n[1],m).
Random Initialization
In logistic regression, it’s okay to initialize all parameters to zero. However, it’s not feasible in neural network.
Instead, initialize w with random small value to break symmetry. It’s okay to initialize b to zeros. Symmetry is still broken so long as is initialized randomly.


Random
If the parameter w are all zeros, then the neurons in hidden layers are symmetric(“identical”). Even if after gradient descent, they keep the same. So use random initialization.
Small
Both sigmoid and tanh function has greatest derivative at z=0
. If z had large or small value, the derivative would be close to zero, and consequently gradient descent would be slow. Thus, it’s a good choice to make the value small.
Week Four
Deep neural network notation
: number of layers
: number of units in layer l
: activations in layer l.
()
: weights for
: bias for
Forward Propagation
for l = 1 to L:
Well, this for
loop is inevitable.
Matrix Dimensions

(here the dimension can be with Python’s broadcasting)


cache
Cache is used to pass variables computed during forward propagation to the corresponding backward propagation step. It contains useful values for backward propagation to compute derivatives.
Why deep representations?
Informally: There are functions you can compute with a “small” Llayer deep neural network that shallower networks require exponentially more hidden units to compute.
Forward and Backward Propagation
Forward propagation for layer l
Input
Output , cache
Backward propagation for layer l
Input
Output
Hyperparameters and Parameters
Hyperparameters determine the final value parameters.
Parameters
·
Hyperparameters
· learning rate
· number of iterations
· number of hidden layers
· number of hidden units
· choice of activation function
· momentum, minibatch size, regularizations, etc.
Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization
Week One
Setting up your Machine Learning Application
Train/dev/test sets
Training set:
Keep on training algorithms on the training sets.
Development set
Also called Holdout cross validation set, Dev set for short.
Use dev set to see which of many different models performs best on the dev set.
Test set
To get an unbiased estimate of how well your algorithm is doing.
Proportion
Previous era: the data amount is not too large, it’s common to take all the data and split it as 70%/30% or 60%/20%/20%.
Big data: there’re millions of examples, 10000 examples used in dev set and 10000 examples used in test set is enough. The proportion can be 98/1/1 or even 99.5/0.4/0.1
Notes
Make sure dev set and test set come from same distribution.
Not having a test set might be okay if it’s not necessary to get an unbiased estimate of performance. Though dev set is called ‘test set’ if there’s no real test set.
Bias/Variance
Solutions
High bias:
Bigger network
Train longer
(Neural network architecture search)
High variance:
More data
Regularization
(Neural network architecture search)
Bias Variance tradeoff
Originally, reducing bias may increase variance, and vice versa. So it’s necessary to tradeoff between bias and variance.
But in deep learning, there’re ways to reduce one without increasing another. So don’t worry about bias variance tradeoff.
Regularization
L2 Regularization
Logistic regression
L2regularization relies on the assumption that a model with small weights is simpler than a model with large weights. Thus, by penalizing the square values of the weights in the cost function you drive all the weights to smaller values. It becomes too costly for the cost to have large weights! This leads to a smoother model in which the output changes more slowly as the input changes.
Weights end up smaller(“weight decay”): Weights are pushed to smaller values.
L2 regularization:
L1 regularization:
(L1 regularization leads to be sparse, but not very effictive)
Neural network
, it’s called Frobenius norm which is different from Euclidean distance.
Back propagation:
Dropout regularization
With dropout, what we’re going to do is go through each of the layers of the network and set some probability of eliminating a node in neural network.
For each training example, you would train it using one of these neural based networks.
The idea behind dropout is that at each iteration, you train a different model that uses only a subset of your neurons. With dropout, your neurons thus become less sensitive to the activation of one other specific neuron, because that other neuron might be shut down at any time.
Usually used in Computer Vision.
Implementation with layer 3


 Dropout is a regularization technique.
 You only use dropout during training. Don’t use dropout (randomly eliminate nodes) during test time.
 Apply dropout both during forward and backward propagation.
 During training time, divide each dropout layer by keep_prob to keep the same expected value for the activations. For example, if keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled by 0.5 since only the remaining half are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the output now has the same expected value. You can check that this works even when keep_prob is other values than 0.5.
Other regularization methods
Data augmentation
Take image input for example. Flipping the image horizontally, rotating and sort of randomly zooming, distortion, etc.
Get more training set without paying much to reduce overfitting.
Early stopping
Stop early so that is relatively small.
Early stopping violates Orthogonalization, which suggests separate Optimize cost function J and Not overfit.
Optimization
Normalizing inputs
Subtract mean
Normalize variance
Note: use same to normalize test set.
Intuition:
Vanishing/Exploding gradients
Since the number of layers in deep learning may be quite large, the product of L layers may tend to or . (just think about and )
Weight initialization for deep networks
Take a single neuron as example:
If is large, then would be smaller. Our goal is to get
Random initialization for ReLU:(known as He initialization, named for the first author of He et al., 2015.)
For tanh: use
Xavier initialization:
Gradient checking
Take and reshape into a big vector :
Take and reshape into a big vector
for each i:

check if ?
Calculate . ( is great)
Note
Gradient checking verifies closeness between the gradients from backpropagation and the numerical approximation of the gradient (computed using forward propagation).
Gradient checking is slow, so we don’t run it in every iteration of training. You would usually run it only to make sure your code is correct, then turn it off and use backprop for the actual learning process.
 Don’t use in training  only to debug.
 If algorithm fails grad check, look at components to try to identify bug.
 Remember regularization.
 Doesn’t work with dropout.
 Run at random initialization; perhaps again after some training.
Week Two
Minibatch gradient descent
Batch gradient descent (original gradient descent that we’ve known) calculates the entire training set, and just update the parameters a little step. If the training set is pretty large, the training would be quite slow. And the idea of minibatch gradient descent is use part of the training set, and update the parameters faster.
For example, if ‘s dimension is , divide the training set into parts with dimension of , i.e.
Similarly, .
One iteration of minibatch gradient descent(computing on a single minibatch) is faster than one iteration of batch gradient descent.
Two steps of minibatch gradient descent:
repeat {
for t = 1,…,5000 {
Forward prop on
…
Compute cost
Backprop to compute gradients cost (using )
} # this is called 1 epoch
}
Choosing minibatch size
Minibatch size = m: Batch gradient descent.
It has to process the whole training set before making progress, which takes too long for per iteration.
Minibatch size = 1: Stochastic gradient descent.
It loses the benefits of vectorization across examples.
Minibatch size in between 1 and m.
Fastest learning: using vectorization and make process without processing entire training set.
If training set is small(m≤2000): just use batch gradient descent.
Typical minibatch sizes: 64, 128, 256, 512 (1024)
Exponentially weighted averages
E.g.




Replace with the second equation, then replace with the third equation, and so on. Finally we’d get
This is why it is called exponentially weighted averages. In practice, , thus it show an average of 10 examples.
Bias correction
As is shown above, the purple line is exponentially weighted average without bias correction, it’s much lower than the exponentially weighted average with bias correction(green line) at the very beginning.
Since is set to be zero(and assume ), the first calculation has quite small result. The result is small until t gets larger(say for ) To avoid such situation, bias correction introduces another step:
Gradient descent with momentum
Set
On iteration t:
Compute dW,db on the current minibatch
Momentum takes past gradients into account to smooth out the steps of gradient. Gradient descent with momentum has the same idea as exponentially weighted average(while some may not use in momentum). Just as the example shown above, we want slow learning horizontally and faster learning vertically. The exponentially weighted average helps to eliminate the horizontal oscillation and makes gradient descent faster. Note there’s no need for gradient descent with momentum to do bias correction. After several iterations, the algorithm will be okay.
RMSprop
On iteration t:
Compute dW,db on the current minibatch
RMS means Root Mean Square, it uses division to help to adjust gradient descent.
Adam optimization algorithm
Combine momentum and RMSprop together:
1.It calculates an exponentially weighted average of past gradients, and stores it in variable v (before bias correction) and v_corrected (with bias correction).
2.It calculates an exponentially weighted average of the squares of the past gradients, and stores it in variable s (before bias correction) and s_corrected (with bias correction).
3.It updates parameters in a direction based on combining information from 1 and 2.
Set
On iteration t:
Compute dW,db on the current minibatch
Hyperparameters:
: needs to be tune
: 0.9
: 0.999

(Adam just means Adaption moment estimation)
Learning rate decay
Minibatch gradient descent won’t converge, but step around at the optimal instead. To help converge, it’s advisable to decay learning rate with the number of iterations.
Some formula:



discrete stair case (half α after some iterations)
manual decay
Week Three
Hyperparameter tuning
Hyperparameters: , number of layers, number of units, learning rate decay, minibatch size, etc.
Priority:
Try to use random values of hyperparameters rather than grid.
Coarse to fine: if finds some region with good result, try more in that region.
Appropriate scale:
It’s okay to sample uniformly at random for some hyperparameters: number of layers, number of units.
While for some hyperparameters like , instead of sampling uniformly at random, sample randomly on logarithmic scale.
Pandas & Caviar
Panda: babysitting one model at a time
Caviar: training many models in parallel
Largely determined by the amount of computational power you can access.
Batch Normalization
Using the idea of normalizing input, make normalization in hidden layers.
Given some intermediate value in neural network (specifically in a single layer)
Use instead of
Batch Norm as regularization
Each minibatch is scaled by the mean/variance computed on just that minibatch.
This adds some noise to the values within that minibatch. So similar to dropout, it adds some noise to each hidden layer’s activations.
This has a slight regularization effect.
Batch Norm at test time: use exponentially weighted averages to compute average for test.
Multiclass classification
Softmax
The output layer is a vector with dimension C rather than a real number. C is the number of classes.
Activation function:
Cost function
Deep Learning frameworks
 Caffe/Caffe2
 CNTK
 DL4J
 Keras
 Lasagne
 mxnet
 PaddlePaddle
 TensorFlow
 Theano
 Torch
Choosing deep learning frameworks
Easy of programming (development and deployment)
Running speed
Truly Open (open source with good governance)
TensorFlow
Writing and running programs in TensorFlow has the following steps:
 Create Tensors(variables) that are not yet executed/evaluated.
 Write operations between those Tensors.
 Initialize your Tensors.
 Create a Session.
 Run the Session. This will run the operations you’d written above.
tf.constant(...)
: to create a constant valuetf.placeholder(dtype = ..., shape = ..., name = ...)
: a placeholder is an object whose value you can specify only later
tf.add(..., ...)
: to do an additiontf.multiply(..., ...)
: to do a multiplicationtf.matmul(..., ...)
: to do a matrix multiplication
2 typical ways to create and use sessions in TensorFlow:




Structuring Machine Learning Projects
Week One
Orthogonalization
Orthogonalization or orthogonality is a system design property that assures that modifying an instruction or a component of an algorithm will not create or propagate side effects to other components of the system. It becomes easier to verify the algorithms independently from one another, and it reduces testing and development time.
When a supervised learning system is designed, these are the 4 assumptions that need to be true and orthogonal.
 Fit training set well on cost function  bigger network, Adam, etc
 Fit dev set well on cost function  regularization, bigger training set, etc
 Fit test set well on cost function  bigger dev set
 Performs well in real world  change dev set or cost function
Single number evaluation metric
Precision
Among all the prediction, estimate how much predictions are right.
Recall
Among all the positive examples, estimate how much positive examples are correctly predicted.
F1Score
The problem with using precision/recall as the evaluation metric is that you are not sure which one is better since in this case, both of them have a good precision et recall. F1score, a harmonic mean, combine both precision and recall.
Satisficing and optimizing metric
There are different metrics to evaluate the performance of a classifier, they are called evaluation matrices. They can be categorized as satisficing and optimizing matrices. It is important to note that these evaluation matrices must be evaluated on a training set, a development set or on the test set.
The general rule is:
For example:
Classifier  Accuracy  Running Time 

A  90%  80ms 
B  92%  95ms 
C  95%  1500ms 
For example, there’re two evaluation metrics: accuracy and running time. Take accuracy as optimizing metric and the following(running time) as satisficing metric(s). The satisficing metric has to meet expectation set and improve the optimizing metric as much as possible.
Train/Dev/Test Set
It’s important to choose the development and test sets from the same distribution and it must be taken randomly from all the data.
Guideline: Choose a dev set and test set to reflect data you expect to get in the future and consider important to do well on.
Size
Old way of splitting data:
We had smaller data set, therefore, we had to use a greater percentage of data to develop and test ideas and models.
Modern era  Big data:
Now, because a larger amount of data is available, we don’t have to compromise and can use a greater portion to train the model.
Set your dev set to be big enough to detect differences in algorithms/models you’re trying out.
Set your test set to be big enough to give high confidence in the overall performance of your system.
When to change dev/test sets and metrics
Orthogonalization:
How to define a metric to evaluate classifiers.
Worry separately about how to do well on this metric.
If doing well on your metric + dev/test set does not correspond to doing well on your application, change your metric and/or dev/test set.
Comparing to humanlevel performance
The graph shows the performance of humans and machine learning over time.
Machine learning progresses slowly when it surpasses humanlevel performance. One of the reason is that humanlevel performance can be close to Bayes optimal error, especially for natural perception problem.
Bayes optimal error is defined as the best possible error. In other words, it means that any functions mapping from x to y can’t surpass a certain level of accuracy(for different reasons, e.g. blurring images, audio with noise, etc).
Humans are quite good at a lot of tasks. So long as machine learning is worse than humans, you can:
 Get labeled data from humans
 Gain insight from manual error analysis: Why did a person get this right?
 Better analysis of bias/variance
Humanlevel error as a proxy for Bayes error(i.e. Humanlevel error ≈ Bayes error).
The difference between Humanlevel error and training error is also regarded as “Avoidable bias”.
If the difference between humanlevel error and the training error is bigger than the difference between the training error and the development error. The focus should be on bias reduction technique.
· Train bigger model
· Train longer/better optimization algorithms(momentum, RMSprop, Adam)
· NN architecture/hyperparameters search(RNN,CNN)
If the difference between training error and the development error is bigger than the difference between the humanlevel error and the training error. The focus should be on variance reduction technique
· More data
· Regularization(L2, dropout, data augmentation)
· NN architecture/hyperparameters search
Problems where machine significantly surpasses humanlevel performance
Feature: Structured data, not natural perception, lots of data.
· Online advertising
· Product recommendations
· Logistics(predicting transit time)
· Loan approvals
The two fundamental assumptions of supervised learning:
You can fit the training set pretty well.(avoidable bias ≈ 0)
The training set performance generalizes pretty well to the dev/test set.(variance ≈ 0)
Week Two
Error Analysis
Spread sheet:
Before deciding how to improve the accuracy, set up a spread sheet find out what matters.
For example:
Image  Dog  Great Cat  Blurry  Comment 

1  √  small white dog  
2  √  √  lion in rainy day  
…  
Percentage  5%  41%  63% 
Mislabeled examples refer to if your learning algorithm outputs the wrong value of Y.
Incorrectly labeled examples refer to if in the data set you have in the training/dev/test set, the label for Y, whatever a human label assigned to this piece of data, is actually incorrect.
Deep learning algorithms are quite robust to random errors in the training set, but less robust to systematic errors.
Guideline: Build system quickly, then iterate.
 Set up development/test set and metrics
 Set up a target
 Build an initial system quickly
 Train training set quickly: Fit the parameters
 Development set: Tune the parameters
 Test set: Assess the performance
 Use bias/variance analysis & Error analysis to prioritize next steps
Mismatched training and dev/test set
The development set and test should come from the same distribution. However, the training set’s distribution might be a bit different. Take a mobile application of cat recognizer for example:
The images from webpages have high resolution and are professionally framed. However, the images from app’s users are relatively low and blurrier.
The problem is that you have a different distribution:
Small data set from pictures uploaded by users. (10000)This distribution is important for the mobile app.
Bigger data set from the web.(200000)
Instead of mixing all the data and randomly shuffle the data set, just like below.
Take 5000 examples from users into training set, and halving the remaining into dev and test set.
The advantage of this way of splitting up is that the target is well defined.
The disadvantage is that the training distribution is different from the dev and test set distributions. However, the way of splitting the data has a better performance in long term.
TrainingDev Set
Since the distributions among the training and the dev set are different now, it’s hard to know whether the difference between training error and the training error is caused by variance or from different distributions.
Therefore, take a small fraction of the original training set, called trainingdev set. Don’t use trainingdev set for training, but to check variance.
The difference between the trainingdev set and the dev set is called data mismatch.
Addressing data mismatch:
 Carry out manual error analysis to try to understand difference between training and dev/test sets.
 Make training data more similar; or collect more data similar to dev/test sets
Transfer learning
When transfer learning makes sense:
 Task A and B have the same input x.
 You have a lot more data for Task A than Task B.
 Low level features from A could be helpful for learning B.
Guideline:
 Delete last layer of neural network
 Delete weights feeding into the last output layer of the neural network
 Create a new set of randomly initialized weights for the last layers only
 New data set (x,y)
Multitask learning
Example: detect pedestrians, cars, road signs and traffic lights at the same time. The output is a 4dimension vector.
Note that the second sum(j = 1 to 4) only over value of j with 0/1 label (not ? mark).
When multitask learning makes sense
 Training on a set of tasks that could benefit from having shared lowerlevel features.
 Usually: Amount of data you have for each task is quite similar.
 Can train a big enough neural network to do well on all the tasks.
Endtoend deep learning
Endtoend deep learning is the simplification of a processing or learning systems into one neural network.
Endtoend deep learning cannot be used for every problem since it needs a lot of labeled data. It is used mainly in audio transcripts, image captures, image synthesis, machine translation, steering in selfdriving cars, etc.
Pros and cons of endtoend deep learning
Pros:
Let the data speak
Less handdesigning of components needed
Cons:
May need large amount of data
Excludes potentially useful handdesigned components
Convolutional Neural Networks
Week One
Computer Vision Problems
 Image Classification
 Object Detection
 Neural Style Transfer
Convolution
*
is the operator for convolution.
Filter/Kernel
The second operand is called filter in the course and often called kernel in the research paper.
There’re different types of filters:
Filter usually has an size of odd number. 1*1, 3*3, 5*5...
(helps to highlight the centroid)
Vertical edge detection examples
Valid and Same Convolutions
Suppose that the original image has a size of n×n, the filter has a size of f×f, then the result has a size of (nf+1)×(nf+1). This is called Valid convolution.
The size will get smaller and smaller with the process of valid convolution.
To avoid such a problem, we can use paddings to enlarge the original image before convolution so that output size is the same as the input size.
If the filter’s size is f×f, then the padding .
The main benefits of padding are the following:
· It allows you to use a CONV layer without necessarily shrinking the height and width of the volumes. This is important for building deeper networks, since otherwise the height/width would shrink as you go to deeper layers. An important special case is the “same” convolution, in which the height/width is exactly preserved after one layer.
· It helps us keep more of the information at the border of an image. Without padding, very few values at the next layer would be affected by pixels as the edges of an image.
Stride
The simplest stride is 1, which means that the filter moves 1 step at a time. However, the stride can be not 1. For example, moves 2 steps at a time instead. That’s called strided convolution.
Given that:
Size of image, filter, padding p, stride s,
output size:
technical
In mathematics and DSP, the convolution involves another “flip” step. However, this step is omitted in CNN. The “real” technical note should be “crosscorrelation” rather than convolution.
In convention, just use Convolution in CNN.
Convolution over volumes
The 1channel filter cannot be applied to RGB images. But we can use filters with multiple channels(RGB images have 3 channels).
The number of the filter’s channel should match that of the image’s channel.
E.g.
A image conv with a filter, the result has a size of . Note that this is only 1 channel! (The number of the result’s channel corresponds to the number of the filters).
CNN
notation
If layer l is a convolution layer:
 = filter size
 = padding
 = stride
 = number of filters
Each filter is:
Activations: ,
Weights: ,(: #filters in layer l.)
bias:
Input:
Output:
E.g.
Types of layers in a convolutional network
 Convolution (conv)
 Pooling (pool)
 Fully Connected (FC)
Pooling layers
 Max pooling: slides an (f,f) window over the input and stores the max value of the window in the output.
 Average pooling: slides an (f,f) window over the input and stores the average value of the window in the output.
Hyperparameters:
f: filter size
s: stride
Max or average pooling
Note no parameters to learn.
Suppose that the input has a size of , then after pooling, the output has a size of
A more complicated cnn:
Backpropagation is discussed in programming assignment.
Why convolutions
 Parameter sharing: A feature detector(such as a vertical edge detector) that’s useful in one part of the image is probably useful in another part of the image.
 Sparsity of connections: In each layer, each output value depends only on a small number of inputs.
Week Two
Classic networks
LeNet  5
Paper link: GradientBased Learning Applied to Document Recognition(IEEE has another version of this paper.)
Take the input, use a 5×5 filter with 1 stride, then use an average pooling with a 2×2 filter and s = 2. Again, use a 5×5 filter with 1 stride, then use an average pooling with a 2×2 filter and s = 2. After two fully connected layer, the output uses softmax to make classification.
conv → pool → conv → pool → fc → fc → output
With the decrease of nH and nW, the number of nC is increased.
AlexNet
Paper link: ImageNet Classification with Deep Convolutional Neural Networks
Similar to LeNet, but much bigger. (60K > 60M)
It uses ReLU.
VGG16
Paper link: Very Deep Convolutional Networks for LargeScale Image Recognition
CONV = 3×3 filter, s = 1, same(using padding to make the size same)
MAXPOOL = 2×2, s = 2
Only use these 2 filters.
Residual Networks
Paper link: Deep residual networks for image recognition
In the plain network, the training error won’t keep decreasing, it may increase at some threshold. In Residual network, the training error will keep decreasing.
1×1 convolution
Paper link: Network in network
It can reduce the number of nC.
Inception network
Paper link: Going deeper with convolutions
Don’t bother worrying about what filters to use. Use all kinds of filters and stack them together.
Module: