Neural Networks and Deep Learning

Week One

(Neural network is also introduced in Machine Learning course, with my learning note).

House price Prediction can be regarded as the simplest neural network:

The function can be ReLU (REctified Linear Unit), which we’ll see a lot.

This is a single neuron. A larger neural network is then formed by taking many of the single neurons and stacking them together.

Almost all the economic value created by neural networks has been through supervised learning.

Input(x) Output(y) Application Neural Network
House feature Price Real estate Standard NN
Ad, user info Click on ad?(0/1) Online advertising Standard NN
Photo Object(Index 1,…,1000) Photo tagging CNN
Audio Text transcript Speech recognition RNN
English Chinese Machine translation RNN
Image, Radar info Position of other cars Autonomous driving Custom/Hybrid

Neural Network examples

CNN: often for image data
RNN: often for one-dimensional sequence data

Structured data and Unstructured data

Scale drives deep learning progress

Scale: both the size of the neural network and the scale of the data.

  • Data
  • Computation
  • Algorithms

Using ReLU instead of sigmoid function as activation function can improve efficiency.

Week Two

(x,y): a single training example.
m training examples:

Take training set inputs x1, x2 and so on and stacking them in columns. (This make the implementation much easier than X’s transpose)

Logistic Regression

Differences with former course
Notation is a bit different from what is introduced in Machine Learning(note).
Originally, we add so that .

where .

Here in Deep Learning course, we use b to represent , and w to represent . Just keep b and w as separate parameters.
Given x, want .
σ() is sigmoid function:

Cost Function


Cost function:

Loss function is applied to just a single training example.
Cost function is the cost of your parameters, it is the average of the loss functions of the entire training set.

Gradient Descent
Usually initialize the value to zero in logistic regression. Random initialization also works, but people don’t usually do that for logistic regression.

Repeat {


From forward propagation, we calculate z, a and finally
From back propagation, we calculate the derivatives step by step:

J=0; dw1,dw2,…dwn=0; db=0
for i = 1 to m

for j = 1 to n:

J /= m;
dw1,dw2,…,dwn /= m;
db /= m


In the for loop, there’s no superscript i for dw variable, because the value of dw in the code is cumulative. While dz is referring to one training example.

Original for loop:
for i = 1 to m


z =, X) + b
dz = A - Y
cost = -1 / m * np.sum(Y * np.log(A) + (1 - Y) * np.log(1 - A))
db = 1 / m * np.sum(dZ)
dw = 1 / m * X * dZ.T

About Python

A.sum(axis = 0): sum vertically
A.sum(axis = 1): sum horizontally

If an (m, n) matrix operates with (+-*/) a (1, n) row vector, just expand the vector vertically to (m, n) by copying m times.
If an (m, n) matrix operates with a (m, 1) column vector, just expand the vector horizontally to (m, n) by copying n times.
If an row/column vector operates with a real number, just expand the real number to the corresponding vector.

Rank 1 Array
a = np.random.randn(5) creates a rank 1 array whose shape is (5,).
Try to avoid using rank 1 array. Use a = a.reshape((5, 1)) or a = np.random.randn(5, 1).

Note that performs a matrix-matrix or matrix-vector multiplication. This is different from np.multiply() and the * operator (which is equivalent to .* in MATLAB/Octave), which performs an element-wise multiplication.

Week Three

Neural Network Overview

Superscript with square brackets denotes the layer, superscript with round brackets refers to i’th training example.

Logistic regression can be regarded as the simplest neural network. The neuron takes in the inputs and make two computations:

Neural network functions similarly. (Note that this neural network has 2 layers. When counting layers, input layer is not included.)
Take the first node in the hidden layer as example:

The superscript denotes the layer, and subscript i represents the node in layer.



(dimensions: )

Vectorizing across multiple examples:


Stack elements in column.
Each column represents a training example, each row represents a hidden unit.

Activation Function

Sigmoid Function

Only used in binary classification’s output layer(with output 0 or 1).
Not used in other occasion. tanh is a better choice.

tanh Function

With a range of , it performs better than sigmoid function because the mean of its output is closer to zero.
Both sigmoid and tanh function have a disadvantage that when z is very large() or very small(), the derivative can be close to 0, so the gradient descent would be very slow.


Default choice of activation function.
With when z is positive, it performs well in practice.
(Although the g'(z)=0 when z is positive, and technically the derivative when is not well-defined)

Leaky ReLU

Makes sure that derivatives not equal to 0 when z < 0.

Linear Activation Function

Also called identity function.
Not used in neural network, because even many hidden layers still gets a linear result. Just used in machine learning when the output is a real number.

Gradient descent

Forward propagation:

Backward propagation:

keepdims=True makes sure that Python won’t produce rank-1 array with shape of (n,).
* is element-wise product. :(n[1],m);:(n[1],m);:(n[1],m).

Random Initialization

In logistic regression, it’s okay to initialize all parameters to zero. However, it’s not feasible in neural network.
Instead, initialize w with random small value to break symmetry. It’s okay to initialize b to zeros. Symmetry is still broken so long as is initialized randomly.

W1 = np.random.randn((2, 2)) * 0.01
b1 = np.zeros((2, 1))
W2 = np.random.randn((1, 2)) * 0.01
b2 = 0

If the parameter w are all zeros, then the neurons in hidden layers are symmetric(“identical”). Even if after gradient descent, they keep the same. So use random initialization.

Both sigmoid and tanh function has greatest derivative at z=0. If z had large or small value, the derivative would be close to zero, and consequently gradient descent would be slow. Thus, it’s a good choice to make the value small.

Week Four

Deep neural network notation
-: number of layers
-: number of units in layer l
-: activations in layer l.
-: weights for
-: bias for

Forward Propagation
for l = 1 to L:

Well, this for loop is inevitable.

Matrix Dimensions
-(here the dimension can be with Python’s broadcasting)

Cache is used to pass variables computed during forward propagation to the corresponding backward propagation step. It contains useful values for backward propagation to compute derivatives.

Why deep representations?
Informally: There are functions you can compute with a “small” L-layer deep neural network that shallower networks require exponentially more hidden units to compute.

Forward and Backward Propagation

Forward propagation for layer l
Output , cache

Backward propagation for layer l

Hyperparameters and Parameters

Hyperparameters determine the final value parameters.


· learning rate
· number of iterations
· number of hidden layers
· number of hidden units
· choice of activation function
· momentum, minibatch size, regularizations, etc.

Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization

Week One

Setting up your Machine Learning Application

Train/dev/test sets

Training set:
Keep on training algorithms on the training sets.

Development set
Also called Hold-out cross validation set, Dev set for short.
Use dev set to see which of many different models performs best on the dev set.

Test set
To get an unbiased estimate of how well your algorithm is doing.

Previous era: the data amount is not too large, it’s common to take all the data and split it as 70%/30% or 60%/20%/20%.
Big data: there’re millions of examples, 10000 examples used in dev set and 10000 examples used in test set is enough. The proportion can be 98/1/1 or even 99.5/0.4/0.1

Make sure dev set and test set come from same distribution.

Not having a test set might be okay if it’s not necessary to get an unbiased estimate of performance. Though dev set is called ‘test set’ if there’s no real test set.


High bias:
Bigger network
Train longer
(Neural network architecture search)

High variance:
More data
(Neural network architecture search)

Bias Variance trade-off
Originally, reducing bias may increase variance, and vice versa. So it’s necessary to trade-off between bias and variance.
But in deep learning, there’re ways to reduce one without increasing another. So don’t worry about bias variance trade-off.


L2 Regularization

Logistic regression

L2-regularization relies on the assumption that a model with small weights is simpler than a model with large weights. Thus, by penalizing the square values of the weights in the cost function you drive all the weights to smaller values. It becomes too costly for the cost to have large weights! This leads to a smoother model in which the output changes more slowly as the input changes.
Weights end up smaller(“weight decay”): Weights are pushed to smaller values.
L2 regularization:
L1 regularization:
(L1 regularization leads to be sparse, but not very effictive)

Neural network

-, it’s called Frobenius norm which is different from Euclidean distance.

Back propagation:

Dropout regularization

With dropout, what we’re going to do is go through each of the layers of the network and set some probability of eliminating a node in neural network.
For each training example, you would train it using one of these neural based networks.
The idea behind drop-out is that at each iteration, you train a different model that uses only a subset of your neurons. With dropout, your neurons thus become less sensitive to the activation of one other specific neuron, because that other neuron might be shut down at any time.

Usually used in Computer Vision.

Implementation with layer 3

d3 = np.random.rand(a3.shape[0], a3.shape[1]) < keep_prob # boolean matrix with 0/1
a3 = np.multiply(a3, d3) # a3 *= d3, element-wise multiply
a3 /= keep_prob # ensures that the expected value of a3 remains the same. Make test time easier because of less scaling problem
  • Dropout is a regularization technique.
  • You only use dropout during training. Don’t use dropout (randomly eliminate nodes) during test time.
  • Apply dropout both during forward and backward propagation.
  • During training time, divide each dropout layer by keep_prob to keep the same expected value for the activations. For example, if keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled by 0.5 since only the remaining half are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the output now has the same expected value. You can check that this works even when keep_prob is other values than 0.5.
Other regularization methods

Data augmentation
Take image input for example. Flipping the image horizontally, rotating and sort of randomly zooming, distortion, etc.
Get more training set without paying much to reduce overfitting.

Early stopping
Stop early so that is relatively small.
Early stopping violates Orthogonalization, which suggests separate Optimize cost function J and Not overfit.


Normalizing inputs

Subtract mean

Normalize variance

Note: use same to normalize test set.


Vanishing/Exploding gradients

Since the number of layers in deep learning may be quite large, the product of L layers may tend to or . (just think about and )

Weight initialization for deep networks
Take a single neuron as example:
If is large, then would be smaller. Our goal is to get

Random initialization for ReLU:(known as He initialization, named for the first author of He et al., 2015.)

For tanh: use
Xavier initialization:

Gradient checking

Take and reshape into a big vector :
Take and reshape into a big vector

for each i:

check if ?
Calculate . ( is great)

Gradient checking verifies closeness between the gradients from backpropagation and the numerical approximation of the gradient (computed using forward propagation).
Gradient checking is slow, so we don’t run it in every iteration of training. You would usually run it only to make sure your code is correct, then turn it off and use backprop for the actual learning process.

  • Don’t use in training - only to debug.
  • If algorithm fails grad check, look at components to try to identify bug.
  • Remember regularization.
  • Doesn’t work with dropout.
  • Run at random initialization; perhaps again after some training.

Week Two

Mini-batch gradient descent

Batch gradient descent (original gradient descent that we’ve known) calculates the entire training set, and just update the parameters a little step. If the training set is pretty large, the training would be quite slow. And the idea of mini-batch gradient descent is use part of the training set, and update the parameters faster.
For example, if ‘s dimension is , divide the training set into parts with dimension of , i.e.
Similarly, .
One iteration of mini-batch gradient descent(computing on a single mini-batch) is faster than one iteration of batch gradient descent.

Two steps of mini-batch gradient descent:

repeat {
 for t = 1,…,5000 {
  Forward prop on
  Compute cost
  Backprop to compute gradients cost (using )
 } # this is called 1 epoch

Choosing mini-batch size
Mini-batch size = m: Batch gradient descent.
It has to process the whole training set before making progress, which takes too long for per iteration.

Mini-batch size = 1: Stochastic gradient descent.
It loses the benefits of vectorization across examples.

Mini-batch size in between 1 and m.
Fastest learning: using vectorization and make process without processing entire training set.

If training set is small(m≤2000): just use batch gradient descent.
Typical mini-batch sizes: 64, 128, 256, 512 (1024)

Exponentially weighted averages

Replace with the second equation, then replace with the third equation, and so on. Finally we’d get
This is why it is called exponentially weighted averages. In practice, , thus it show an average of 10 examples.

Bias correction

As is shown above, the purple line is exponentially weighted average without bias correction, it’s much lower than the exponentially weighted average with bias correction(green line) at the very beginning.
Since is set to be zero(and assume ), the first calculation has quite small result. The result is small until t gets larger(say for ) To avoid such situation, bias correction introduces another step:

Gradient descent with momentum

On iteration t:
 Compute dW,db on the current mini-batch

Momentum takes past gradients into account to smooth out the steps of gradient. Gradient descent with momentum has the same idea as exponentially weighted average(while some may not use in momentum). Just as the example shown above, we want slow learning horizontally and faster learning vertically. The exponentially weighted average helps to eliminate the horizontal oscillation and makes gradient descent faster. Note there’s no need for gradient descent with momentum to do bias correction. After several iterations, the algorithm will be okay.


On iteration t:
 Compute dW,db on the current mini-batch

RMS means Root Mean Square, it uses division to help to adjust gradient descent.

Adam optimization algorithm

Combine momentum and RMSprop together:
1.It calculates an exponentially weighted average of past gradients, and stores it in variable v (before bias correction) and v_corrected (with bias correction).
2.It calculates an exponentially weighted average of the squares of the past gradients, and stores it in variable s (before bias correction) and s_corrected (with bias correction).
3.It updates parameters in a direction based on combining information from 1 and 2.

On iteration t:
 Compute dW,db on the current mini-batch

-: needs to be tune
-: 0.9
-: 0.999

(Adam just means Adaption moment estimation)

Learning rate decay

Mini-batch gradient descent won’t converge, but step around at the optimal instead. To help converge, it’s advisable to decay learning rate with the number of iterations.
Some formula:
-discrete stair case (half α after some iterations)
-manual decay

Week Three

Hyperparameter tuning

Hyperparameters: , number of layers, number of units, learning rate decay, mini-batch size, etc.

Try to use random values of hyperparameters rather than grid.
Coarse to fine: if finds some region with good result, try more in that region.

Appropriate scale:
It’s okay to sample uniformly at random for some hyperparameters: number of layers, number of units.
While for some hyperparameters like , instead of sampling uniformly at random, sample randomly on logarithmic scale.

Pandas & Caviar
Panda: babysitting one model at a time
Caviar: training many models in parallel
Largely determined by the amount of computational power you can access.

Batch Normalization

Using the idea of normalizing input, make normalization in hidden layers.

Given some intermediate value in neural network (specifically in a single layer)
Use instead of

Batch Norm as regularization
Each mini-batch is scaled by the mean/variance computed on just that mini-batch.
This adds some noise to the values within that mini-batch. So similar to dropout, it adds some noise to each hidden layer’s activations.
This has a slight regularization effect.

Batch Norm at test time: use exponentially weighted averages to compute average for test.

Multi-class classification

The output layer is a vector with dimension C rather than a real number. C is the number of classes.
Activation function:

Cost function

Deep Learning frameworks

  • Caffe/Caffe2
  • CNTK
  • DL4J
  • Keras
  • Lasagne
  • mxnet
  • PaddlePaddle
  • TensorFlow
  • Theano
  • Torch

Choosing deep learning frameworks
Easy of programming (development and deployment)
Running speed
Truly Open (open source with good governance)


Writing and running programs in TensorFlow has the following steps:

  1. Create Tensors(variables) that are not yet executed/evaluated.
  2. Write operations between those Tensors.
  3. Initialize your Tensors.
  4. Create a Session.
  5. Run the Session. This will run the operations you’d written above.

tf.constant(...): to create a constant value
tf.placeholder(dtype = ..., shape = ..., name = ...): a placeholder is an object whose value you can specify only later

tf.add(..., ...): to do an addition
tf.multiply(..., ...): to do a multiplication
tf.matmul(..., ...): to do a matrix multiplication

2 typical ways to create and use sessions in TensorFlow:

sess = tf.Session()
# Run the variables initialization (if needed), run the operations
result =, feed_dict = {...})
sess.close() # Close the session
with tf.Session() as sess:
# run the variables initialization (if needed), run the operations
result =, feed_dict = {...})
# This takes care of closing the session

Structuring Machine Learning Projects

Week One


Orthogonalization or orthogonality is a system design property that assures that modifying an instruction or a component of an algorithm will not create or propagate side effects to other components of the system. It becomes easier to verify the algorithms independently from one another, and it reduces testing and development time.

When a supervised learning system is designed, these are the 4 assumptions that need to be true and orthogonal.

  1. Fit training set well on cost function - bigger network, Adam, etc
  2. Fit dev set well on cost function - regularization, bigger training set, etc
  3. Fit test set well on cost function - bigger dev set
  4. Performs well in real world - change dev set or cost function

Single number evaluation metric


Among all the prediction, estimate how much predictions are right.


Among all the positive examples, estimate how much positive examples are correctly predicted.


The problem with using precision/recall as the evaluation metric is that you are not sure which one is better since in this case, both of them have a good precision et recall. F1-score, a harmonic mean, combine both precision and recall.

Satisficing and optimizing metric
There are different metrics to evaluate the performance of a classifier, they are called evaluation matrices. They can be categorized as satisficing and optimizing matrices. It is important to note that these evaluation matrices must be evaluated on a training set, a development set or on the test set.
The general rule is:

For example:

Classifier Accuracy Running Time
A 90% 80ms
B 92% 95ms
C 95% 1500ms

For example, there’re two evaluation metrics: accuracy and running time. Take accuracy as optimizing metric and the following(running time) as satisficing metric(s). The satisficing metric has to meet expectation set and improve the optimizing metric as much as possible.

Train/Dev/Test Set

It’s important to choose the development and test sets from the same distribution and it must be taken randomly from all the data.
Guideline: Choose a dev set and test set to reflect data you expect to get in the future and consider important to do well on.

Old way of splitting data:
We had smaller data set, therefore, we had to use a greater percentage of data to develop and test ideas and models.

Modern era - Big data:
Now, because a larger amount of data is available, we don’t have to compromise and can use a greater portion to train the model.

Set your dev set to be big enough to detect differences in algorithms/models you’re trying out.
Set your test set to be big enough to give high confidence in the overall performance of your system.

When to change dev/test sets and metrics

How to define a metric to evaluate classifiers.
Worry separately about how to do well on this metric.

If doing well on your metric + dev/test set does not correspond to doing well on your application, change your metric and/or dev/test set.

Comparing to human-level performance

The graph shows the performance of humans and machine learning over time.
Machine learning progresses slowly when it surpasses human-level performance. One of the reason is that human-level performance can be close to Bayes optimal error, especially for natural perception problem.
Bayes optimal error is defined as the best possible error. In other words, it means that any functions mapping from x to y can’t surpass a certain level of accuracy(for different reasons, e.g. blurring images, audio with noise, etc).

Humans are quite good at a lot of tasks. So long as machine learning is worse than humans, you can:

  • Get labeled data from humans
  • Gain insight from manual error analysis: Why did a person get this right?
  • Better analysis of bias/variance

Human-level error as a proxy for Bayes error(i.e. Human-level error ≈ Bayes error).
The difference between Human-level error and training error is also regarded as “Avoidable bias”.

If the difference between human-level error and the training error is bigger than the difference between the training error and the development error. The focus should be on bias reduction technique.
· Train bigger model
· Train longer/better optimization algorithms(momentum, RMSprop, Adam)
· NN architecture/hyperparameters search(RNN,CNN)

If the difference between training error and the development error is bigger than the difference between the human-level error and the training error. The focus should be on variance reduction technique
· More data
· Regularization(L2, dropout, data augmentation)
· NN architecture/hyperparameters search

Problems where machine significantly surpasses human-level performance
Feature: Structured data, not natural perception, lots of data.
· Online advertising
· Product recommendations
· Logistics(predicting transit time)
· Loan approvals

The two fundamental assumptions of supervised learning:
You can fit the training set pretty well.(avoidable bias ≈ 0)
The training set performance generalizes pretty well to the dev/test set.(variance ≈ 0)

Week Two

Error Analysis

Spread sheet:
Before deciding how to improve the accuracy, set up a spread sheet find out what matters.
For example:

Image Dog Great Cat Blurry Comment
1 small white dog
2 lion in rainy day
Percentage 5% 41% 63%

Mislabeled examples refer to if your learning algorithm outputs the wrong value of Y.
Incorrectly labeled examples refer to if in the data set you have in the training/dev/test set, the label for Y, whatever a human label assigned to this piece of data, is actually incorrect.

Deep learning algorithms are quite robust to random errors in the training set, but less robust to systematic errors.

Guideline: Build system quickly, then iterate.

  1. Set up development/test set and metrics
  • Set up a target
  1. Build an initial system quickly
  • Train training set quickly: Fit the parameters
  • Development set: Tune the parameters
  • Test set: Assess the performance
  1. Use bias/variance analysis & Error analysis to prioritize next steps

Mismatched training and dev/test set

The development set and test should come from the same distribution. However, the training set’s distribution might be a bit different. Take a mobile application of cat recognizer for example:
The images from webpages have high resolution and are professionally framed. However, the images from app’s users are relatively low and blurrier.
The problem is that you have a different distribution:
Small data set from pictures uploaded by users. (10000)This distribution is important for the mobile app.
Bigger data set from the web.(200000)

Instead of mixing all the data and randomly shuffle the data set, just like below.

Take 5000 examples from users into training set, and halving the remaining into dev and test set.

The advantage of this way of splitting up is that the target is well defined.
The disadvantage is that the training distribution is different from the dev and test set distributions. However, the way of splitting the data has a better performance in long term.

Training-Dev Set
Since the distributions among the training and the dev set are different now, it’s hard to know whether the difference between training error and the training error is caused by variance or from different distributions.
Therefore, take a small fraction of the original training set, called training-dev set. Don’t use training-dev set for training, but to check variance.
The difference between the training-dev set and the dev set is called data mismatch.

Addressing data mismatch:

  • Carry out manual error analysis to try to understand difference between training and dev/test sets.
  • Make training data more similar; or collect more data similar to dev/test sets

Transfer learning

When transfer learning makes sense:

  • Task A and B have the same input x.
  • You have a lot more data for Task A than Task B.
  • Low level features from A could be helpful for learning B.


  • Delete last layer of neural network
  • Delete weights feeding into the last output layer of the neural network
  • Create a new set of randomly initialized weights for the last layers only
  • New data set (x,y)

Multi-task learning
Example: detect pedestrians, cars, road signs and traffic lights at the same time. The output is a 4-dimension vector.

Note that the second sum(j = 1 to 4) only over value of j with 0/1 label (not ? mark).

When multi-task learning makes sense

  • Training on a set of tasks that could benefit from having shared lower-level features.
  • Usually: Amount of data you have for each task is quite similar.
  • Can train a big enough neural network to do well on all the tasks.

End-to-end deep learning

End-to-end deep learning is the simplification of a processing or learning systems into one neural network.

End-to-end deep learning cannot be used for every problem since it needs a lot of labeled data. It is used mainly in audio transcripts, image captures, image synthesis, machine translation, steering in self-driving cars, etc.

Pros and cons of end-to-end deep learning
Let the data speak
Less hand-designing of components needed

May need large amount of data
Excludes potentially useful hand-designed components

Convolutional Neural Networks

Week One

Computer Vision Problems

  • Image Classification
  • Object Detection
  • Neural Style Transfer


* is the operator for convolution.

The second operand is called filter in the course and often called kernel in the research paper.
There’re different types of filters:

Filter usually has an size of odd number. 1*1, 3*3, 5*5...(helps to highlight the centroid)

Vertical edge detection examples

Valid and Same Convolutions
Suppose that the original image has a size of n×n, the filter has a size of f×f, then the result has a size of (n-f+1)×(n-f+1). This is called Valid convolution.
The size will get smaller and smaller with the process of valid convolution.

To avoid such a problem, we can use paddings to enlarge the original image before convolution so that output size is the same as the input size.
If the filter’s size is f×f, then the padding .

The main benefits of padding are the following:
· It allows you to use a CONV layer without necessarily shrinking the height and width of the volumes. This is important for building deeper networks, since otherwise the height/width would shrink as you go to deeper layers. An important special case is the “same” convolution, in which the height/width is exactly preserved after one layer.
· It helps us keep more of the information at the border of an image. Without padding, very few values at the next layer would be affected by pixels as the edges of an image.

The simplest stride is 1, which means that the filter moves 1 step at a time. However, the stride can be not 1. For example, moves 2 steps at a time instead. That’s called strided convolution.

Given that:
Size of image, filter, padding p, stride s,
output size:

In mathematics and DSP, the convolution involves another “flip” step. However, this step is omitted in CNN. The “real” technical note should be “cross-correlation” rather than convolution.
In convention, just use Convolution in CNN.

Convolution over volumes
The 1-channel filter cannot be applied to RGB images. But we can use filters with multiple channels(RGB images have 3 channels).

The number of the filter’s channel should match that of the image’s channel.
A image conv with a filter, the result has a size of . Note that this is only 1 channel! (The number of the result’s channel corresponds to the number of the filters).


If layer l is a convolution layer:

  • = filter size
  • = padding
  • = stride
  • = number of filters

Each filter is:
Activations: ,
Weights: ,(: #filters in layer l.)



Types of layers in a convolutional network

  • Convolution (conv)
  • Pooling (pool)
  • Fully Connected (FC)

Pooling layers

  • Max pooling: slides an (f,f) window over the input and stores the max value of the window in the output.
  • Average pooling: slides an (f,f) window over the input and stores the average value of the window in the output.

f: filter size
s: stride
Max or average pooling
Note no parameters to learn.

Suppose that the input has a size of , then after pooling, the output has a size of

A more complicated cnn:

Backpropagation is discussed in programming assignment.

Why convolutions

  • Parameter sharing: A feature detector(such as a vertical edge detector) that’s useful in one part of the image is probably useful in another part of the image.
  • Sparsity of connections: In each layer, each output value depends only on a small number of inputs.

Week Two

Classic networks

LeNet - 5

Paper link: Gradient-Based Learning Applied to Document Recognition(IEEE has another version of this paper.)

Take the input, use a 5×5 filter with 1 stride, then use an average pooling with a 2×2 filter and s = 2. Again, use a 5×5 filter with 1 stride, then use an average pooling with a 2×2 filter and s = 2. After two fully connected layer, the output uses softmax to make classification.
conv → pool → conv → pool → fc → fc → output
With the decrease of nH and nW, the number of nC is increased.


Paper link: ImageNet Classification with Deep Convolutional Neural Networks

Similar to LeNet, but much bigger. (60K -> 60M)
It uses ReLU.


Paper link: Very Deep Convolutional Networks for Large-Scale Image Recognition

CONV = 3×3 filter, s = 1, same(using padding to make the size same)
MAX-POOL = 2×2, s = 2
Only use these 2 filters.

Residual Networks

Paper link: Deep residual networks for image recognition

In the plain network, the training error won’t keep decreasing, it may increase at some threshold. In Residual network, the training error will keep decreasing.

1×1 convolution

Paper link: Network in network

It can reduce the number of nC.

Inception network

Paper link: Going deeper with convolutions
Don’t bother worrying about what filters to use. Use all kinds of filters and stack them together.