Machine Learning Note

Week One


What is machine learning?

Field of study that gives computers the ability to learn without being explicitly programmed. — By Arthur Samuel

A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T as measured by P improves with experience E. — By Tom Mitchell

Machine learning algorithms

Supervised learning

  • Regression
    try to map input variables to some continuous function.

  • Classification
    predict results in a discrete output.

Unsupervised learning
approach problems with little or no idea what our results should look like


  • Clustering
  • Non-clustering

Linear Regression with One Variable

Some notations:
m: Number of training examples
x: “input” variable / features
y: “output” variable / “target” variable
: ith training example: Note that the superscript “(i)” in the notation is simply an index into the training set, and has nothing to do with exponentiation.

Hypothesis Function and Cost Function

A slightly more formal description of supervised learning problem is that given a training set, to learn a function h : X → Y so that h(x) is a “good” predictor for the corresponding value of y. For historical reasons, this function h is called a hypothesis.

In this example of linear regression with one variable, the hypothesis function can be denoted as

(maybe we Chinese students are more familiar with the form like h(x)=kx+b) Here and are just parameters. And our goal is to choose and so that is close to y for our training examples(x,y).
The cost function takes an average difference of all the results of the hypothesis with inputs from x’s and the actual output y’s.

This function is otherwise called the “Squared error function”, or “Mean squared error”. The coefficient 1/2 is used for gradient descent so that the partial derivative result will be cleaner.

Gradient Descent

the Gradient descent algorithm:

repeat until convergence {


The value of α should not be too small or too large.
If α is too small, gradient descent can be slow.
If α is too large, gradient descent can overshoot the minimum. It may fail to converge, or even diverge.
In general, gradient descent can converge to a local minimum, even with the learning rate α fixed.

After calculating partial derivation, we can get the algorithm as :

repeat until convergence {

(update θ0 and θ1 simultaneously)

Linear Algebra Review

Matrix: 2-dimensional array.
Vector: An n×1 matrix

Notation:Generally the uppercase letters are used for matrix and the lowercase letters are used for vector.

Matrix Manipulation


Scalar multiplication

Matrix-vector multiplication

Let A be an m×n matrix, x be an n-dimensional vector, then the result A×x will be an m-dimensional vector.
To get yi, multiply A’s ith row with elements of vector x, and add them up.

Matrix-matrix multiplication

Let A be an m×n matrix, B be an n×o matrix, then the result A×B will be an m×o matrix.
The ith column of the matrix C is obtained by multiplying A with the ith column of B.(for i=1,2,…,o). Then the calculation can be simplified to matrix-vector multiplication.

Matrix multiplication properties

Not commutative
Let A and B be matrices. Then in general, A×B≠B×A.


Special matrix

Identity Matrix
The identity matrix, which is denoted as I(sometimes with n×n subscript), simply has 1’s on the diagonal (upper left to lower right diagonal) and 0’s elsewhere. For example:

For any matrix A, A×I=I×A=A

Matrix Inverse
If A is an m×m matrix, and if it has an inverse(not all matrix have an inverse),

Matrices that don’t have an inverse are singular or degenerate.

Matrix Transpose
Let A be an m×n matrix, and let . Then B is an n×m matrix, and

Week Two

Multivariate Linear Regression

from one variable to multiple variables

In the first week’s course, we learned linear with one variable x, and the hypothesis could be . As a matter of fact, there can be more than one variables. So here’re the new notations:
m: the number of training examples
n: the number of features

- : input (features) of ith training example
- : value of feature j in ith training example

The hypothesis should be transformed to:

(for convenience of notation, we define x0=1).
Using the definition of matrix multiplication, our multivariable hypothesis function can be concisely represented as:

We can see that linear regression with one variable is just the special case when n=1.

Similarly, the new gradient descent algorithm would be represented as:
repeat until convergence{


Feature Scaling

The idea is to make sure features are on a similar scale so that the gradient descent can be sped up.
Generally, get every feature into approximately a range. (the range is not limited to [-1,1])
Often used formula:

Mean normalization

Replace with to make features have approximately zero mean.

where is the average of all the values for feature(i) and si is the range of values(max - min) or the standard deviation.

Something more about learning rate

The idea of learning rate is same as in the first week. And how to make sure gradient descent is working correctly(or say, debugging)?
Make a plot with number of iterations on the x-axis. Now plot the cost function, J(θ) over the number of iterations of gradient descent. If J(θ) ever increases, then you probably need to decrease α.

How to declare convergence?
If J(θ) decreased by less than in one iteration.

Polynomial regression




square root:

One important thing to keep in mind is, if you choose your features this way then feature scaling becomes very important.

Normal equation

Gradient descent gives one way of minimizing J, and normal equation is another way. In the “Normal Equation” method, we will minimize J by explicitly taking its derivatives with respect to the θj ’s, and setting them to zero. This allows us to find the optimum theta without iteration.

Here, X is m×(n+1) matrix(remember that x0 = 1), y is m-dimensional vector, and θ is (n+1) dimensional vector.

The advantage of normal equation:

  • no need to choose α
  • don’t need to iterate

Comparison of gradient descent and normal equation

Gradient Descent Normal Equation
need to choose α no need to choose α
need many iterations no need to iterate
O(kn²) O(n³)(need to calculate inverse)
works well when n is large slow if n is very large

normal equation noninvertibility

Sometimes can be invertible, the common causes might be having:

  • Redundant features(linear dependent)
  • Too many features(e.g. m≤n)

Octave/MATLAB Tutorial

Generally the commands are both used in MATLAB and Octave. (It is suggested that change Octave’s command prompt using PS1('>> ').)

Elementary math operations
e.g., 1+2, 3-4, 5*6, 7/8, 2^6
//Note: if the result is floating point, the default digits after decimal point is different between Octave (6 digits) and MATLAB (4 digits).

Logical operations
1==2, 1~=2 %~=: not equal

1 && 0, 1 || 0, xor(1,0)

The result of logical operations is 0(false) or 1(true).

Simple form.
a = 3, a = 3;
The semicolon suppresses the print output.
The assignment can be constants, strings, boolean expressions, etc.

Display variable.
a, disp(a), disp(sprintf('2 decimals: %0.2f', a)
Suppose a = 3, then the output of the command a is a = 3, of the command disp(a) is 3 and of the command disp(sprintf('2 decimals: %0.2f', a) is 2 decimals: 3.00.

sprintf is a C-like syntax that defines the output format.
format long, format short makes the all of the following commands output in long or short format.

Vectors and Matrices
A = [1, 2; 3, 4; 5, 6]
We can memorize that the semicolon ; means the next row of the matrix and the comma , (which can be replaced by space ) means the next column.

v = [1 2 3], v = [1; 2; 3]
The former creates a row vector (1×3 matrix), and the latter creates a column vector (3×1 matrix).

Some useful notation
Create a row vector from START to END with each step incremented by INCREMENT. If :INCREMENT is omitted, the increment is 1.

ones(ROW, COLUMN), zeros(ROW, COLUMN)
Create a ROW×COLUMN matrix of all ones/zeros.

rand generates random numbers from the standard uniform distribution (0,1)
randn generates random numbers from standard normal distribution.

(Eye is maybe a pun on the word identity.) Create a ROW by ROW identity matrix.

Week Three

Logistic Regression

Binary Classification
The output y can take only two values, 0 and 1. Thus, y∈{0,1}, where the value 0 represents negative class and the value 1 represents positive class. It’s not advisable to use linear regression to represent the hypothesis. It’s logistic regression that has a range of (0,1).

Multiclass Classification:One-vs-all
Train a logistic regression classifier for each class i to predict the probability that y = i.
On a new input x, to make a prediction, pick the class i that maximizes .

Logistic Regression Model

The function is called sigmoid function or logistic function, and the function image is as shown below:

What is the interpretation of hypothesis output?

- estimated probability that y = 1, given x, parameterized by θ. In mathematical formula it can be denoted as

Using probability theory knowledge, we can also know that

Decision Boundary
Suppose predict “y=1” if , and predict “y=0” if .
That is equal to

Cost function and Gradient Descent

Cost function in logistic regression is different from that in linear regression.

that means:

- if y = 1;
- if y = 0.

Vectorized implementation of cost function:

Gradient Descent:

repeat {


Vectorized implementation of gradient descent:

Advanced optimization
Except gradient descent, there’re Conjugate gradient, BFGS and L-BFGS optimization algorithms. They have advantages that 1.No need to manually pick α;2.Often faster than gradient descent. But they’re more complex than gradient descent.



If we have too many features, the learn hypothesis may fit the training set very well, but fail to generalize to new examples. (Also known high variance)
There’s also another problem called Underfit problem, which has high bias.
Two examples:

How to address overfitting

  1. Reduce number of features.
    - Manually select which features to keep
    - Model selection algorithm

  2. Regularization.
    - Keep all the features, but reduce magnitude/values of parameters

Cost function:

The additional part is called regularization parameter. Note that λ should be set a proper value. If λ is set to an extremely large value, the algorithm may fail to eliminate overfitting or results in underfitting.

Regularized Linear Regression

Cost function

Gradient Descent:

Repeat {


The second line can also denoted as:

The coefficient will always be less than 1.

Normal Equation:

*If λ>0, this normal equation makes it invertible.

Regularized Logistic Regression

Gradient Descent:
Repeat {


Week Four

Neural Networks

At a very simple level, neurons are basically computational units that take inputs (dendrites) as electrical inputs (called “spikes”) that are channeled to outputs (axons). In our model, our dendrites are like the input features x1⋯xn, and the output is the result of our hypothesis function. In this model our x0 input node is sometimes called the “bias unit.” It is always equal to 1. In neural networks, we use the same logistic function as in classification, , yet we sometimes call it a sigmoid (logistic) activation function. In this situation, our “theta” parameters are sometimes called “weights”.
There’re several layers in the neural network. The first layer is called Input Layer, the last layer is called Output Layer, and the others is called Hidden Layer.

- = “activation” of unit i in layer j
- = matrix of weights controlling function mapping from layer j to layer j+1

In a simple example like this:

we have some equations:

If network has units in layer j, units in layer j+1, then will be of dimension .

Vectorized Implementation:
- (regard the input layer as )

Add (add the bias unit)

Week Five

Neural Networks: Learning

-L: total no. of layers in network
-: no. of units(not counting bias unit) in layer l
-K: no. of units in the output layer()
(K=1:Binary classification, K≥3:Multi-class classification)

Cost Function

Compared with the cost function of logistic regression, we have added a few nested summations to account for our multiple output nodes. In the first part of the equation, before the square brackets, we have an additional nested summation that loops through the number of output nodes.(k = 1:K)
In the regularization part, after the square brackets, we must account for multiple theta matrices. The number of columns in our current theta matrix is equal to the number of nodes in our current layer (including the bias unit). The number of rows in our current theta matrix is equal to the number of nodes in the next layer (excluding the bias unit).

Backpropagation Algorithm

To minimize , we need code to compute and . The formula of is described above, and now let’s have a recall at forward propagation first.

In a 4-layer neural network, we have the following forward propagation:

Then what about backward propagation(Intuition: = “error” of node j in layer l)?

, it’s just the partial derivative)

-=”error” of cost for .
Formally, .

Backpropagation algorithm:
Training set
Set .(initializing accumulators)
For i = 1 to m
Perform forward propagation to compute for l=2,3,…,L
Using , compute

- if j≠0
- if j=0

Unrolling parameters
Idea: Unroll matrices into vectors. In order to use optimizing functions such as “fminunc()”, we will want to “unroll” all the elements and put them into one long vector.

Learning Algorithm
Have initial parameters Theta1,Theta2,Theta3
Unroll to get initialTheta to pass to fminunc(@costFunction, initialTheta, options).

Gradient checking

Gradient checking will assure that our backpropagation works as intended. We can approximate the derivative of our cost function with:

With multiple theta matrices, we can approximate the derivative with respect to Θj as follows:

Implementation note:

  • Implement backprop to compute DVec(unrolled D(1),D(2),D(3)).
  • Implement numerical gradient check to compute gradApprox.
  • Make sure they have similar values.
  • Turn off gradient checking. Using backprop code for learning. (or the code will be very slow)

Random initialization
For gradient descent and advanced optimization method, we need initial value for θ. However, it’s not advisable to set initial theta to all zeros. When we backpropagate, all nodes will update to the same value repeatedly. So instead of using zeros, use rand to initialize theta.

Training a neural network

  1. Randomly initialize weights
  2. Implement forward propagation to get for any
  3. Implement code to compute cost function
  4. Implement backprop to compute partial derivatives
  5. Use gradient checking to compare computed using backpropagation vs. using numerical estimate of gradient of .(Then disable gradient checking code.)
  6. Use gradient descent or advanced optimization method with backpropagation to try to minimize as a function of parameter θ

Week Six

Advice for applying machine learning

Divide the data set into three parts: training set, cross validation set and test set.(sometimes two parts, training set and test set)
Training error:

Cross Validation error:

Test error:

Model Selection: eg. for d = 1:10, when trying to minimize , we get the object . Then using the θ we get, we can estimate generalization error for test set.

bias v.s. variance

High bias: underfit
High variance: overfit

Just as the figure shown above, with the increment of degree of polynomial d, the training error will be less and less. However, if the degree is too high, the cross validation error would be high again(overfitting).
- will be high.
- will be low.

Regularization and bias/variance

Now let’s take the parameter λ that we use in regularization into consideration.
If λ is too large, it may lead to high bias. If λ is too small, it may lead to high variance.(e.g. λ=0)

Learning curves

Debugging a learning algorithm

Fixing high bias:

  • Try getting additional features
  • Try adding polynomial features
  • Try decreasing λ

Fixing high variance:

  • Get more training examples
  • Try smaller sets of features
  • Try increasing λ

Machine learning system design

  • Start with a simple algorithm that you can implement quickly. Implement it and test it on your cross-validation data.
  • Plot learning curves to decide if more data, more features, etc. are likely to help.
  • Error analysis: Manually examine the examples (in cross validation set) that your algorithm made errors on. See if you spot any systematic trend in what type of examples it is making errors on.

Precision & Recall

Take cancer prediction as example. Since the cancer incidence is quite low, the prediction can simply be y=0. The error rate is relatively low, however, it’s not a ‘prediction’ at all. Thus, we can’t judge predictions’ performance only using error rate. So the concept of precision and recall is introduced.

Actual Class 1 Actual Class 0
Predicted Class 1 True positive False positive
Predicted Class 0 False negative True negative

y=1 in presence of rare class that we want to detect. (In cancer prediction example, isCancer should be 1)


Calculated in row.


Calculated in column.

Suppose we want to predict y=1(cancer) only if very confident. Then turn threshold(originally 0.5) up to get high precision and lower recall.
Suppose we want to avoid missing too many cases of cancer(avoid false negatives). Then turn threshold down to get high recall and lower precision.

F1 Score(F score)

Large data rationale

Use a learning algorithm with many parameters(e.g. logistic regression/linear regression with many features; neural network with many hidden units). - low bias

Use a very large training set(unlikely to overfit). - low variance

Week Seven

Support Vector Machine

Alternative view of logistic regression
The cost of a specific example(x,y) is
If y=1, then the right part of the function is ignored(since 1-y=0), and what we get is:

Similarly, if y=0, then the left part of the function is ignored, and what we get is:

For support vector machine, we have and functions (the subscript corresponds to the value of y) that are similar to the original curves. The difference is that the curves are made up of straight lines.


And the cost function of SVM:

Difference between SVM and logistic regression:
SVM ignores the term 1/m.
The form of logistic regression is A+λB, while that of SVM is CA+B.(if C=1/λ, then two these two optimization objectives should give you the same optimal value for theta)

Large Margin Classifier
SVM wants a bit more than the original logistic regression.
If , we want (not just ≥0)
If , we want (not just <0)

Consider the situation that C is very large, and optimization would try to set the left part of cost function to be zero. And this leads to the large margin classifier concept:

Compared with the magenta and green lines, the black line has some larger minimum distance from any of the training examples. This distance is called the margin of the SVM and this gives the SVM a certain robustness, because it tries to separate the data with as large a margin as possible.

The mathematics behind large margin classification is the dot product of the vectors.


For SVM, there’s a different(better) choice of the features:
given x, compute new features depending on proximity to landmarks .

For example,

Here the similarity is called the kernel function, and the corresponding kernel function is Gaussian Kernel which uses exp.

How to choose landmarks?
Landmarks is just the input examples.
Given ,
choose .
And calculate features using kernals.
Hypothesis: Given x, compute features , predict “y=1” if .

Here n is equal to m.
And for some SVM, the computation of is using rather than .

SVM parameters

Large C: Lower bias, higher variance.
Small C: Higher bias, lower variance.
(regard C as 1/λ).

Large σ²: Features f vary more smoothly. Higher bias, lower variance.
Small σ²: Features f vary less smoothly. Lower bias, higher variance.

Using an SVM

SVM package: liblinear, libsvm, etc.

Choice of kernels:
Linear kernel(No kernel)
Predict “y=1” if

Gaussian kernel

(Remember to perform feature scaling before using Gaussian kernel.)

Polynomial kernel

More esoteric
String kernel, chi-square kernel, histogram intersection kernel, …

Logistic regression or SVM?
If n is large(relative to m): use logistic regression, or SVM without a kernel.
If n is small, m is intermediate: use SVM with Gaussian kernel.
If n is small, m is large: add more features, then use logistic regression, or SVM without a kernel

Neural network is likely to work well for most of these settings, but may be slower to train.

Week Eight


The difference between unsupervised learning and supervised learning:
The supervised learning problem is given a set of labels to fit a hypothesis to it. In contrast, in the unsupervised learning problem we’re given data that does not have any labels associated with it.

K-means algorithm

Random initialize K cluster centroids
Repeat {
for i=1 to m
-:=index(from 1 to K) of cluster centroid closest to
for k=1 to K
-:=average(mean) of points assigned to cluster k

The first for loop is cluster assignment step, and the second for loop is moving centroid.

Optimization objective
-: index of cluster(1,2,…,K) to which example is currently assigned
-: cluster centroid
-: cluster centroid of cluster to which example has been assigned

Try to minimize (also called distortion function).

Random initialization
Randomly pick K training examples, and set those examples as cluster centroids.
For better performance, run multiple times(e.g, 100 times) and pick clustering that gave lowest cost J.

Choosing the number of clusters
Sometime it’s helpful to use elbow method, but it’s often not advisable:

Sometimes, you’re running K-means to get clusters to use for some later/downstream purpose. Evaluate K-means based on a metric for how well it performs for that later purpose.

Dimensionality Reduction

Motivation of dimensionality reduction:

  • Data Compression

    -Reduce memory/disk needed to store data
    -Speed up learning algorithm

  • Data Visualization

    -k=2 or k=3, so we can visualize the data and get an intuitive view

Principal Component Analysis(PCA)

Reduce from n-dimension to k-dimension: Find k vectors onto which to project the data, so as to minimize the projection error.

Difference between PCA and linear regression
PCA looks like linear regression(reduce from 2D to 1D), but they’re different.
Linear regression has the input x and corresponding label y. What linear regression does is trying to predict the output y. And the ‘error’ is computed vertically.
PCA is unsupervised learning and has no label y. What PCA does is reduce the dimension of features. And the ‘error’ is computed according to the vector difference.

PCA Algorithm
Before applying PCA algorithm, remember to make data preprocessing.
Given training set:, using feature scaling/mean normalization to preprocess

then replace each with .
If different features on different scales (e.g., x1=size of house, x2=number of bedrooms), scale features to have comparable range of values.

After mean normalization(ensure every feature has zero means) and optionally feature scaling:

Sigma = (1 / m) * X' * X
[U,S,V] = svd(Sigma);
Ureduce = U(:,1:k);
z = Ureduce' * x;

Choosing k

Reconstruction from compressed representation:
From the formula , we can get the reconstruction x using

Average squared projection error:

Total variation in the data:

Typically, choose k to be smallest value so that

Here 0.01 means that 99% of variance is retained.

We don’t have to loop k from 1 to n to find the smallest value. The function svd has an output S which is useful.

[U,S,V] = svd(Sigma)
Just pick smallest value of k for which


To speed up supervised learning, note to map should be defined by running PCA only on the training set.
It’s not good to address overfitting using PCA. Use regularization instead.
Before implementing PCA, first try running whatever you want to do with the original/raw data . Only if that doesn’t do what you want, then implement PCA and consider using .

Week Nine

Anomaly Detection


1.Choose features that you think might be indicative of anomalous examples.
2.Fit parameters

3.Given new example x, computer p(x):

Check anomaly if

Developing and evaluating
Assume we have some labeled data, of anomalous and non-anomalous examples. (y=0 if normal, y=1 if anomalous).

Suppose we have 10000 good (normal) engines(it’s okay that there’re some anomalous fixed in) and 20 flawed engines(anomalous).
Then divide the examples into Training set:6000 good engines; CV: 2000 good engines and 10 anomalous; Test: 2000 good engines and 10 anomalous.

Possible evaluation metrics:
True positive, false positive, false negative, true negative.

Anomaly detection vs. supervised learning

Anomaly detection:

  • Very small number of positive examples(y=1). (0-20 is common)
  • Large number of negative(y=0) examples.
  • Many different “types” of anomalies. Hard for any algorithm to learn from positive examples what the anomalies look like; future anomalies may look nothing like any of the anomalous examples we’ve seen so far.

Supervised learning:

  • Large number of positive and negative examples.
  • Enough positive examples for algorithm to get a sense of what positive examples are like, future positive examples likely to be similar to ones in training set.

| Anomaly detection | Supervised learning |
| —- | —- |
| Fraud detection | Email spam classification |
| Manufacturing(e.g. aircraft engines) | Weather prediction |
| Monitoring machines in a data center | Cancer classification |
| … | … |

Normal distribution

It is denoted as

Tips: if features are non-gaussian features, it’s advisable to transform the original features to log/polynomial/…

Multivariate Gaussian distribution

Original model: corresponds to multivariate Gaussian where .(which is axis aligned).

The off diagonal means the correlations between axises. Here’re some examples.

Original model:
Manually create features to capture anomalies where x1, x2 take unusual combinations of values.
Computationally cheaper.
Ok even if m is small.

Multivariate Gaussian:
Automatically captures correlations between features.
Computationally more expensive.
Must have m>n or else ∑ is non-invertible.

Recommender System

Well, I’ve written an article about recommender system. (about Item-based Collaborative Filtering:D)

- if user j has rated movie i (0 otherwise)
-= rating by user j on movie i (if defined)
-= parameter vector for user j
-= feature vector for movie i
-: for user j, movie i, predicted rating
-= no. of movies rated by user j


To learn (parameter for user j):

To learn :

Gradient descent update:

Collaborative filtering

Given , to learn :

Given , to learn :

Collaborative filtering optimization objective
Given , estimate :

Given , estimate :

Minimizing and simultaneously:

1.Initialize to small random values.
2.Minimize using gradient descent(or an advanced optimization algorithm). E.g. for every :

3.For a user with parameters θ and a movie with (learned) features x, predict a star rating of .

Week Ten

Large scale machine learning

Stochastic gradient descent


(Learning rate α is typically held constant. We can slowly decrease α over time if we want θ to converge.
Checking for convergence
Every 1000 iterations(say), plot averaged over the last 1000 examples processed by algorithm.

Batch Gradient Descent: use all m examples in each iteration
Stochastic Gradient Descent: use 1 example in each iteration
Mini-batch Gradient Descent: use b examples in each iteration


Divide the total computation into several parts. Let different computers/cores to calculate a part and then a central computer calculate the final results.