Week One
Introduction
What is machine learning?
Field of study that gives computers the ability to learn without being explicitly programmed. — By Arthur Samuel
A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T as measured by P improves with experience E. — By Tom Mitchell
Machine learning algorithms
Supervised learning
Regression
try to map input variables to some continuous function.Classification
predict results in a discrete output.
Unsupervised learning
approach problems with little or no idea what our results should look like
Example:
 Clustering
 Nonclustering
Linear Regression with One Variable
Some notations:
m: Number of training examples
x: “input” variable / features
y: “output” variable / “target” variable
: ith training example: Note that the superscript “(i)” in the notation is simply an index into the training set, and has nothing to do with exponentiation.
Hypothesis Function and Cost Function
A slightly more formal description of supervised learning problem is that given a training set, to learn a function h : X → Y so that h(x) is a “good” predictor for the corresponding value of y. For historical reasons, this function h is called a hypothesis.
In this example of linear regression with one variable, the hypothesis function can be denoted as
(maybe we Chinese students are more familiar with the form like h(x)=kx+b) Here and are just parameters. And our goal is to choose and so that is close to y for our training examples(x,y).
The cost function takes an average difference of all the results of the hypothesis with inputs from x’s and the actual output y’s.
This function is otherwise called the “Squared error function”, or “Mean squared error”. The coefficient 1/2 is used for gradient descent so that the partial derivative result will be cleaner.
Gradient Descent
the Gradient descent algorithm:
repeat until convergence {
}
The value of α should not be too small or too large.
If α is too small, gradient descent can be slow.
If α is too large, gradient descent can overshoot the minimum. It may fail to converge, or even diverge.
In general, gradient descent can converge to a local minimum, even with the learning rate α fixed.
After calculating partial derivation, we can get the algorithm as :
repeat until convergence {
(update θ0 and θ1 simultaneously)
}
Linear Algebra Review
Matrix: 2dimensional array.
Vector: An n×1 matrix
Notation:Generally the uppercase letters are used for matrix and the lowercase letters are used for vector.
Matrix Manipulation
Addition
Scalar multiplication
Matrixvector multiplication
Let A be an m×n matrix, x be an ndimensional vector, then the result A×x will be an mdimensional vector.
To get yi, multiply A’s ith row with elements of vector x, and add them up.
Matrixmatrix multiplication
Let A be an m×n matrix, B be an n×o matrix, then the result A×B will be an m×o matrix.
The ith column of the matrix C is obtained by multiplying A with the ith column of B.(for i=1,2,…,o). Then the calculation can be simplified to matrixvector multiplication.
Matrix multiplication properties
Not commutative
Let A and B be matrices. Then in general, A×B≠B×A.
Associative
A×(B×C)=(A×B)×C
Special matrix
Identity Matrix
The identity matrix, which is denoted as I(sometimes with n×n subscript), simply has 1’s on the diagonal (upper left to lower right diagonal) and 0’s elsewhere. For example:
For any matrix A, A×I=I×A=A
Matrix Inverse
If A is an m×m matrix, and if it has an inverse(not all matrix have an inverse),
Matrices that don’t have an inverse are singular or degenerate.
Matrix Transpose
Let A be an m×n matrix, and let . Then B is an n×m matrix, and
Week Two
Multivariate Linear Regression
from one variable to multiple variables
In the first week’s course, we learned linear with one variable x, and the hypothesis could be . As a matter of fact, there can be more than one variables. So here’re the new notations:
m: the number of training examples
n: the number of features
 : input (features) of ith training example
 : value of feature j in ith training example
The hypothesis should be transformed to:
(for convenience of notation, we define x0=1).
Using the definition of matrix multiplication, our multivariable hypothesis function can be concisely represented as:
We can see that linear regression with one variable is just the special case when n=1.
Similarly, the new gradient descent algorithm would be represented as:
repeat until convergence{
}
Feature Scaling
The idea is to make sure features are on a similar scale so that the gradient descent can be sped up.
Generally, get every feature into approximately a range. (the range is not limited to [1,1])
Often used formula:
Mean normalization
Replace with to make features have approximately zero mean.
Formula:
where is the average of all the values for feature(i) and si is the range of values(max  min) or the standard deviation.
Something more about learning rate
The idea of learning rate is same as in the first week. And how to make sure gradient descent is working correctly(or say, debugging)?
Make a plot with number of iterations on the xaxis. Now plot the cost function, J(θ) over the number of iterations of gradient descent. If J(θ) ever increases, then you probably need to decrease α.
How to declare convergence?
If J(θ) decreased by less than in one iteration.
Polynomial regression
linear:
quadratic:
cubic:
square root:
One important thing to keep in mind is, if you choose your features this way then feature scaling becomes very important.
Normal equation
Gradient descent gives one way of minimizing J, and normal equation is another way. In the “Normal Equation” method, we will minimize J by explicitly taking its derivatives with respect to the θj ’s, and setting them to zero. This allows us to find the optimum theta without iteration.
Formula:
Here, X is m×(n+1) matrix(remember that x0 = 1), y is mdimensional vector, and θ is (n+1) dimensional vector.
The advantage of normal equation:
 no need to choose α
 don’t need to iterate
Comparison of gradient descent and normal equation
Gradient Descent  Normal Equation 

need to choose α  no need to choose α 
need many iterations  no need to iterate 
O(kn²)  O(n³)(need to calculate inverse) 
works well when n is large  slow if n is very large 
normal equation noninvertibility
Sometimes can be invertible, the common causes might be having:
 Redundant features(linear dependent)
 Too many features(e.g. m≤n)
Octave/MATLAB Tutorial
Generally the commands are both used in MATLAB and Octave. (It is suggested that change Octave’s command prompt using PS1('>> ')
.)
Elementary math operations
e.g., 1+2
, 34
, 5*6
, 7/8
, 2^6
//Note: if the result is floating point, the default digits after decimal point is different between Octave (6 digits) and MATLAB (4 digits).
Logical operations
Equality.1==2
, 1~=2 %~=: not equal
AND, OR, XOR.1 && 0
, 1  0
, xor(1,0)
The result of logical operations is 0(false) or 1(true).
Variable
Simple form.a = 3
, a = 3;
The semicolon suppresses the print output.
The assignment can be constants, strings, boolean expressions, etc.
Display variable.a
, disp(a)
, disp(sprintf('2 decimals: %0.2f', a)
Suppose a = 3, then the output of the command a
is a = 3
, of the command disp(a)
is 3
and of the command disp(sprintf('2 decimals: %0.2f', a)
is 2 decimals: 3.00
.
Format.sprintf
is a Clike syntax that defines the output format.format long
, format short
makes the all of the following commands output in long or short format.
Vectors and Matrices
Matrix.A = [1, 2; 3, 4; 5, 6]
We can memorize that the semicolon ;
means the next row of the matrix and the comma ,
(which can be replaced by space
) means the next column.
Vector.v = [1 2 3]
, v = [1; 2; 3]
The former creates a row vector (1×3 matrix), and the latter creates a column vector (3×1 matrix).
Some useful notationv = START(:INCREMENT):END
Create a row vector from START to END with each step incremented by INCREMENT. If :INCREMENT
is omitted, the increment is 1.
ones(ROW, COLUMN)
, zeros(ROW, COLUMN)
Create a ROW×COLUMN matrix of all ones/zeros.
rand(ROW, COLUMN)
rand
generates random numbers from the standard uniform distribution (0,1)randn
generates random numbers from standard normal distribution.
eye(ROW)
(Eye is maybe a pun on the word identity.) Create a ROW by ROW identity matrix.
Week Three
Logistic Regression
Binary Classification
The output y can take only two values, 0 and 1. Thus, y∈{0,1}, where the value 0 represents negative class and the value 1 represents positive class. It’s not advisable to use linear regression to represent the hypothesis. It’s logistic regression that has a range of (0,1).
Multiclass Classification:Onevsall
Train a logistic regression classifier for each class i to predict the probability that y = i.
On a new input x, to make a prediction, pick the class i that maximizes .
Logistic Regression Model
The function is called sigmoid function or logistic function, and the function image is as shown below:
What is the interpretation of hypothesis output?
 estimated probability that y = 1, given x, parameterized by θ. In mathematical formula it can be denoted as
Using probability theory knowledge, we can also know that
Decision Boundary
Suppose predict “y=1” if , and predict “y=0” if .
That is equal to
Cost function and Gradient Descent
Cost function in logistic regression is different from that in linear regression.
that means:
 if y = 1;
 if y = 0.
Vectorized implementation of cost function:
Gradient Descent:
repeat {
}
Vectorized implementation of gradient descent:
Advanced optimization
Except gradient descent, there’re Conjugate gradient, BFGS and LBFGS optimization algorithms. They have advantages that 1.No need to manually pick α;2.Often faster than gradient descent. But they’re more complex than gradient descent.
Regularization
Overfitting
If we have too many features, the learn hypothesis may fit the training set very well, but fail to generalize to new examples. (Also known high variance)
There’s also another problem called Underfit problem, which has high bias.
Two examples:
How to address overfitting
Reduce number of features.
 Manually select which features to keep
 Model selection algorithmRegularization.
 Keep all the features, but reduce magnitude/values of parameters
Cost function:
The additional part is called regularization parameter. Note that λ should be set a proper value. If λ is set to an extremely large value, the algorithm may fail to eliminate overfitting or results in underfitting.
Regularized Linear Regression
Cost function
Gradient Descent:
Repeat {
}
The second line can also denoted as:
The coefficient will always be less than 1.
Normal Equation:
*If λ>0, this normal equation makes it invertible.
Regularized Logistic Regression
Gradient Descent:
Repeat {
}
Week Four
Neural Networks
At a very simple level, neurons are basically computational units that take inputs (dendrites) as electrical inputs (called “spikes”) that are channeled to outputs (axons). In our model, our dendrites are like the input features x1⋯xn, and the output is the result of our hypothesis function. In this model our x0 input node is sometimes called the “bias unit.” It is always equal to 1. In neural networks, we use the same logistic function as in classification, , yet we sometimes call it a sigmoid (logistic) activation function. In this situation, our “theta” parameters are sometimes called “weights”.
There’re several layers in the neural network. The first layer is called Input Layer, the last layer is called Output Layer, and the others is called Hidden Layer.
Notations:
 = “activation” of unit i in layer j
 = matrix of weights controlling function mapping from layer j to layer j+1
In a simple example like this:
we have some equations:
If network has units in layer j, units in layer j+1, then will be of dimension .
Vectorized Implementation:
 (regard the input layer as )
Add (add the bias unit)
Week Five
Neural Networks: Learning
L: total no. of layers in network
: no. of units(not counting bias unit) in layer l
K: no. of units in the output layer()
(K=1:Binary classification, K≥3:Multiclass classification)
Cost Function
Compared with the cost function of logistic regression, we have added a few nested summations to account for our multiple output nodes. In the first part of the equation, before the square brackets, we have an additional nested summation that loops through the number of output nodes.(k = 1:K)
In the regularization part, after the square brackets, we must account for multiple theta matrices. The number of columns in our current theta matrix is equal to the number of nodes in our current layer (including the bias unit). The number of rows in our current theta matrix is equal to the number of nodes in the next layer (excluding the bias unit).
Backpropagation Algorithm
To minimize , we need code to compute and . The formula of is described above, and now let’s have a recall at forward propagation first.
In a 4layer neural network, we have the following forward propagation:
Then what about backward propagation(Intuition: = “error” of node j in layer l)?
（, it’s just the partial derivative)
=”error” of cost for .
Formally, .
Backpropagation algorithm:
Training set
Set .(initializing accumulators)
For i = 1 to m
Set
Perform forward propagation to compute for l=2,3,…,L
Using , compute
Compute
 if j≠0
 if j=0
Unrolling parameters
Idea: Unroll matrices into vectors. In order to use optimizing functions such as “fminunc()”, we will want to “unroll” all the elements and put them into one long vector.
Learning Algorithm
Have initial parameters Theta1,Theta2,Theta3
Unroll to get initialTheta
to pass to fminunc(@costFunction, initialTheta, options)
.
Gradient checking
Gradient checking will assure that our backpropagation works as intended. We can approximate the derivative of our cost function with:
With multiple theta matrices, we can approximate the derivative with respect to Θj as follows:
Implementation note:
 Implement backprop to compute DVec(unrolled D(1),D(2),D(3)).
 Implement numerical gradient check to compute
gradApprox
.  Make sure they have similar values.
 Turn off gradient checking. Using backprop code for learning. (or the code will be very slow)
Random initialization
For gradient descent and advanced optimization method, we need initial value for θ. However, it’s not advisable to set initial theta to all zeros. When we backpropagate, all nodes will update to the same value repeatedly. So instead of using zeros
, use rand
to initialize theta.
Training a neural network
 Randomly initialize weights
 Implement forward propagation to get for any
 Implement code to compute cost function
 Implement backprop to compute partial derivatives
 Use gradient checking to compare computed using backpropagation vs. using numerical estimate of gradient of .(Then disable gradient checking code.)
 Use gradient descent or advanced optimization method with backpropagation to try to minimize as a function of parameter θ
Week Six
Advice for applying machine learning
Divide the data set into three parts: training set, cross validation set and test set.(sometimes two parts, training set and test set)
Training error:
Cross Validation error:
Test error:
Model Selection: eg. for d = 1:10, when trying to minimize , we get the object . Then using the θ we get, we can estimate generalization error for test set.
bias v.s. variance
High bias: underfit
High variance: overfit
Just as the figure shown above, with the increment of degree of polynomial d, the training error will be less and less. However, if the degree is too high, the cross validation error would be high again(overfitting).
Bias:
 will be high.

Variance:
 will be low.

Regularization and bias/variance
Now let’s take the parameter λ that we use in regularization into consideration.
If λ is too large, it may lead to high bias. If λ is too small, it may lead to high variance.(e.g. λ=0)
Learning curves
Debugging a learning algorithm
Fixing high bias:
 Try getting additional features
 Try adding polynomial features
 Try decreasing λ
Fixing high variance:
 Get more training examples
 Try smaller sets of features
 Try increasing λ
Machine learning system design
Recommended approach
 Start with a simple algorithm that you can implement quickly. Implement it and test it on your crossvalidation data.
 Plot learning curves to decide if more data, more features, etc. are likely to help.
 Error analysis: Manually examine the examples (in cross validation set) that your algorithm made errors on. See if you spot any systematic trend in what type of examples it is making errors on.
Precision & Recall
Take cancer prediction as example. Since the cancer incidence is quite low, the prediction can simply be y=0
. The error rate is relatively low, however, it’s not a ‘prediction’ at all. Thus, we can’t judge predictions’ performance only using error rate. So the concept of precision and recall is introduced.
Actual Class 1  Actual Class 0  

Predicted Class 1  True positive  False positive 
Predicted Class 0  False negative  True negative 
y=1 in presence of rare class that we want to detect. (In cancer prediction example, isCancer
should be 1)
Precision:
Calculated in row.
Recall
Calculated in column.
Suppose we want to predict y=1
(cancer) only if very confident. Then turn threshold(originally 0.5) up to get high precision and lower recall.
Suppose we want to avoid missing too many cases of cancer(avoid false negatives). Then turn threshold down to get high recall and lower precision.
F1 Score(F score)
Large data rationale
Use a learning algorithm with many parameters(e.g. logistic regression/linear regression with many features; neural network with many hidden units).  low bias
Use a very large training set(unlikely to overfit).  low variance
Week Seven
Support Vector Machine
Alternative view of logistic regression
The cost of a specific example(x,y) is
If y=1
, then the right part of the function is ignored(since 1y=0
), and what we get is:
Similarly, if y=0
, then the left part of the function is ignored, and what we get is:
For support vector machine, we have and functions (the subscript corresponds to the value of y) that are similar to the original curves. The difference is that the curves are made up of straight lines.
Cost1:
Cost0:
And the cost function of SVM:
Difference between SVM and logistic regression:
SVM ignores the term 1/m.
The form of logistic regression is A+λB, while that of SVM is CA+B.(if C=1/λ, then two these two optimization objectives should give you the same optimal value for theta)
Large Margin Classifier
SVM wants a bit more than the original logistic regression.
If , we want (not just ≥0)
If , we want (not just ＜0)
Consider the situation that C is very large, and optimization would try to set the left part of cost function to be zero. And this leads to the large margin classifier concept:
Compared with the magenta and green lines, the black line has some larger minimum distance from any of the training examples. This distance is called the margin of the SVM and this gives the SVM a certain robustness, because it tries to separate the data with as large a margin as possible.
The mathematics behind large margin classification is the dot product of the vectors.
Kernels
For SVM, there’s a different(better) choice of the features:
given x, compute new features depending on proximity to landmarks .
For example,
Here the similarity is called the kernel function, and the corresponding kernel function is Gaussian Kernel which uses exp
.
How to choose landmarks?
Landmarks is just the input examples.
Given ,
choose .
And calculate features using kernals.
Hypothesis: Given x, compute features , predict “y=1” if .
Training:
Here n is equal to m.
And for some SVM, the computation of is using rather than .
SVM parameters
C
Large C: Lower bias, higher variance.
Small C: Higher bias, lower variance.
(regard C as 1/λ).
σ²
Large σ²: Features f vary more smoothly. Higher bias, lower variance.
Small σ²: Features f vary less smoothly. Lower bias, higher variance.
Using an SVM
SVM package: liblinear, libsvm, etc.
Choice of kernels:
Linear kernel(No kernel)
Predict “y=1” if
Gaussian kernel
(Remember to perform feature scaling before using Gaussian kernel.)
Polynomial kernel
More esoteric
String kernel, chisquare kernel, histogram intersection kernel, …
Logistic regression or SVM?
If n is large(relative to m): use logistic regression, or SVM without a kernel.
If n is small, m is intermediate: use SVM with Gaussian kernel.
If n is small, m is large: add more features, then use logistic regression, or SVM without a kernel
Neural network is likely to work well for most of these settings, but may be slower to train.
Week Eight
Clustering
The difference between unsupervised learning and supervised learning:
The supervised learning problem is given a set of labels to fit a hypothesis to it. In contrast, in the unsupervised learning problem we’re given data that does not have any labels associated with it.
Kmeans algorithm
Random initialize K cluster centroids
Repeat {
for i=1 to m
:=index(from 1 to K) of cluster centroid closest to
for k=1 to K
:=average(mean) of points assigned to cluster k
}
The first for loop is cluster assignment step, and the second for loop is moving centroid.
Optimization objective
: index of cluster(1,2,…,K) to which example is currently assigned
: cluster centroid
: cluster centroid of cluster to which example has been assigned
Try to minimize (also called distortion function).
Random initialization
Randomly pick K training examples, and set those examples as cluster centroids.
For better performance, run multiple times(e.g, 100 times) and pick clustering that gave lowest cost J.
Choosing the number of clusters
Manually.
Sometime it’s helpful to use elbow method, but it’s often not advisable:
Sometimes, you’re running Kmeans to get clusters to use for some later/downstream purpose. Evaluate Kmeans based on a metric for how well it performs for that later purpose.
Dimensionality Reduction
Motivation of dimensionality reduction:
Data Compression
Reduce memory/disk needed to store data
Speed up learning algorithmData Visualization
k=2 or k=3, so we can visualize the data and get an intuitive view
Principal Component Analysis(PCA)
Reduce from ndimension to kdimension: Find k vectors onto which to project the data, so as to minimize the projection error.
Difference between PCA and linear regression
PCA looks like linear regression(reduce from 2D to 1D), but they’re different.
Linear regression has the input x and corresponding label y. What linear regression does is trying to predict the output y. And the ‘error’ is computed vertically.
PCA is unsupervised learning and has no label y. What PCA does is reduce the dimension of features. And the ‘error’ is computed according to the vector difference.
PCA Algorithm
Before applying PCA algorithm, remember to make data preprocessing.
Given training set:, using feature scaling/mean normalization to preprocess
then replace each with .
If different features on different scales (e.g., x1=size of house, x2=number of bedrooms), scale features to have comparable range of values.
After mean normalization(ensure every feature has zero means) and optionally feature scaling:
Sigma = (1 / m) * X' * X
[U,S,V] = svd(Sigma);
Ureduce = U(:,1:k);
z = Ureduce' * x;
Choosing k
Reconstruction from compressed representation:
From the formula , we can get the reconstruction x using
Average squared projection error:
Total variation in the data:
Typically, choose k to be smallest value so that
Here 0.01 means that 99% of variance is retained.
We don’t have to loop k from 1 to n to find the smallest value. The function svd
has an output S which is useful.
[U,S,V] = svd(Sigma)
Just pick smallest value of k for which
Advice
To speed up supervised learning, note to map should be defined by running PCA only on the training set.
It’s not good to address overfitting using PCA. Use regularization instead.
Before implementing PCA, first try running whatever you want to do with the original/raw data . Only if that doesn’t do what you want, then implement PCA and consider using .
Week Nine
Anomaly Detection
Algorithm
1.Choose features that you think might be indicative of anomalous examples.
2.Fit parameters
3.Given new example x, computer p(x):
Check anomaly if
Developing and evaluating
Assume we have some labeled data, of anomalous and nonanomalous examples. (y=0 if normal, y=1 if anomalous).
Suppose we have 10000 good (normal) engines(it’s okay that there’re some anomalous fixed in) and 20 flawed engines(anomalous).
Then divide the examples into Training set:6000 good engines; CV: 2000 good engines and 10 anomalous; Test: 2000 good engines and 10 anomalous.
Possible evaluation metrics:
True positive, false positive, false negative, true negative.
Precision/Recall
F1score.
Anomaly detection vs. supervised learning
Comparison
Anomaly detection:
 Very small number of positive examples(y=1). (020 is common)
 Large number of negative(y=0) examples.
 Many different “types” of anomalies. Hard for any algorithm to learn from positive examples what the anomalies look like; future anomalies may look nothing like any of the anomalous examples we’ve seen so far.
Supervised learning:
 Large number of positive and negative examples.
 Enough positive examples for algorithm to get a sense of what positive examples are like, future positive examples likely to be similar to ones in training set.
Examples
 Anomaly detection  Supervised learning 
 —  — 
 Fraud detection  Email spam classification 
 Manufacturing(e.g. aircraft engines)  Weather prediction 
 Monitoring machines in a data center  Cancer classification 
 …  … 
Normal distribution
It is denoted as
Tips: if features are nongaussian features, it’s advisable to transform the original features to log/polynomial/…
Multivariate Gaussian distribution
Original model: corresponds to multivariate Gaussian where .(which is axis aligned).
The off diagonal means the correlations between axises. Here’re some examples.
Comparison
Original model:
Manually create features to capture anomalies where x1, x2 take unusual combinations of values.
Computationally cheaper.
Ok even if m is small.
Multivariate Gaussian:
Automatically captures correlations between features.
Computationally more expensive.
Must have m>n or else ∑ is noninvertible.
Recommender System
Well, I’ve written an article about recommender system. (about Itembased Collaborative Filtering:D)
Notation:
 if user j has rated movie i (0 otherwise)
= rating by user j on movie i (if defined)
= parameter vector for user j
= feature vector for movie i
: for user j, movie i, predicted rating
= no. of movies rated by user j
Contentbased
To learn (parameter for user j):
To learn :
Gradient descent update:
Collaborative filtering
Given , to learn :
Given , to learn :
Collaborative filtering optimization objective
Given , estimate :
Given , estimate :
Minimizing and simultaneously:
algorithm
1.Initialize to small random values.
2.Minimize using gradient descent(or an advanced optimization algorithm). E.g. for every :
3.For a user with parameters θ and a movie with (learned) features x, predict a star rating of .
Week Ten
Large scale machine learning
Stochastic gradient descent
Algorithm:
(Learning rate α is typically held constant. We can slowly decrease α over time if we want θ to converge.
Checking for convergence
Every 1000 iterations(say), plot averaged over the last 1000 examples processed by algorithm.
Batch Gradient Descent: use all m examples in each iteration
Stochastic Gradient Descent: use 1 example in each iteration
Minibatch Gradient Descent: use b examples in each iteration
MapReduce
Divide the total computation into several parts. Let different computers/cores to calculate a part and then a central computer calculate the final results.