Formally, the task of language modeling is to assign a probability to any sequences of words , i.e., to estimate .
Notations: the expression means the string . For the joint probability of each word in a sequence having a particular value , we’ll use .
By using the chain rule of probability, we get
The chain rule converts the joint probability to the product of conditional probabilities. However, it doesn’t help to simplify the calculation. Therefore, language models make use of the markovassumption, which approximates the history by just the last few words.
The bigram, for example, approximates the probability of a word given all the previous words by using only the conditional probability of the preceding word , then we can compute the probability of a complete word sequence:
To estimate the probability, an intuitive way is maximum likelihood estimation(MLE).
To compute the probability of an entire sentence, it is convenient to pad the beginning and end with special symbols. E.g., a special symbol <s>
at the beginning of the sentence, and a special endsymbol </s>
.
For the general case of MLE ngram parameter estimation:
The hyperpamater n controls the size of the context used in each conditional probability. The biasvariance tradeoff problem lies in ngram language model. A small ngram size may cause high bias, and a large ngram size could lead to high variance. Since the language is full of longrange dependencies, it is impossible to capture those dependencies with small n. Language is creative and variant as well, so it is hard to set n too large.
We have to make some modifications to our probabilities in case it meets some unseen patterns in the test text. There are several ways: addone smoothing, backoff and interpolation, KneserNey smoothing, etc.
Take unigram for example. The unsmoothed maximum likelihood estimate of the unigram probability of the word is its count normalized by the total number of word tokens N: .
Laplace smoothing just adds one to each count (that’s why it’s called addone smoothing). Since there are words in the vocabulary and each one is incremented, we also need to adjust the denominator to take into account the extra observations.
For bigram counts, we need to augment the unigram count by the number of total word types in the vocabulary :
Laplace smoothing is a special case of Lidstone smoothing. The basic framework of Lidstone smoothing:
Instead of changing both the numerator and denominator, it is convenient to describe how a smoothing algorithm affects the numerator, by defining an adjusted count :
Then the probability can be calculated by dividing by .
The GoodTuring estimate (Good, 1953) states that for any ngram that occurs times, we just pretend that the occurrences is where
and where is the number of ngrams that occur exactly times in the training data. Then the corresponding probability for an ngram with counts should be normalized as:
A related way to view smoothing is as discounting (lowering) some nonzero counts in order to get the probability mass that will be assigned to the zero counts. The ratio of the discounted counts to the original counts, discount :
Church and Gale (1991) show empirically that the average GoodTuring discount associated with ngrams is largely constant over . Absolute discounting formalizes this by subtracting a fixed (absolute) discount from each count.
When using large ngram (trigram or more), we may have no corresponding examples. In a backoff ngram model, if the ngram we need has zero counts, we approximate it by backing off to the (N1)gram. We continue backing off until we reach a history that has some counts.
We have to apply backoff with discounting for a correct probability distribution. This kind of backoff is called Katz backoff.
The term indicates the amount of probability mass that has been discounted for context j.
Katz backoff is often combined with GoodTuring. The combined GoodTuring backoff algorithm involves quite detailed computation for estimating the GoodTuring smoothing and the and values.
In backoff, we only “back off” to a lowerorder ngram if we have zero evidence for a higherorder ngram. By contrast, in interpolation, we always mix the probability estimates from all related ngram estimators.
In simplelinear interpolation, we estimate the trigram probability by mixing together the unigram, bigram, and trigram probabilities, each weighted by a :
To ensure that the estimated probability is valid, s should sum to 1: .
An elegant way to find the specific values is expectationmaximization(EM).
Extrinsic evaluation: the best way to evaluate the performance of a language model is to embed it in an application and measure how much the application, like machine translation, improves. However, such endtoend evaluation is hard and often expensive.
Intrinsic evaluation: intrinsic evaluation is taskneutral, which can be used to quickly evaluate potential improvements in a language model.
Note that for intrinsic evaluation, we have to use held out data, which is not used during training.
Perplexity is an information theoretic measurement of how well a probability model predicts a sample. The lower perplexity, the better.
Some special cases:
The perplexity measure is a good indicator of the quality of a language model. However, in many cases improvement in perplexity scores do not transfer to improvement in extrinsic, taskquality scores. In that sense, the perplexity measure is good for comparing different language models in terms of their ability to pickup regularities in sequences, but is not a good measure for assessing progress in language understanding or languageprocessing tasks. A model’s improvement in perplexity should always be confirmed by an endtoend evaluation of a real task before concluding the evaluation of the model.
A small benchmark dataset is the Penn Treebank, which contains roughly a million tokens; its vocabulary is limited to 10,000 words, with all other tokens mapped a special symbol.
A largerscale language modeling dataset is the 1B word Benchmark, which contains text from Wikipedia.
I’ll complement this section after I read the relevant papers. Besides, the stateoftheart leaderboards can be viewed here.
Jacob Eisenstein. 2018. Natural Language Processing (draft of November 13, 2018). https://github.com/jacobeisenstein/gtnlpclass/tree/master/notes, pages 125143.
Dan Jurafsky, James H. Martin. 2018. Speech and Language Processing (3rd ed. draft, Sep. 23, 2018). https://web.stanford.edu/%7Ejurafsky/slp3/, pages 3762.
Yoav Goldberg. 2017. Neural Network Methods for Natural Language Processing. https://doi.org/10.2200/S00762ED1V01Y201703HLT037, Pages 105113.
]]>It is okay whether you have known Python before or not. As long as you have learned some programming language, e.g., Java or C, that’s enough. Python is userfriendly and easy to use. You can refer to Python’s official tutorial any time for better understanding, or you can just leave it out and follow my post step by step. Anyway, I’d suggest that you download Anaconda. It is straightforward and helps to manage troublesome settings. You can write code freely once you installed it. And I’d highly recommend writing in Jupyter Notebook, which is installed together with Anaconda.
For Windows users, click Open Menu (or press Windows key
on the keyboard)  Anaconda3  Jupyter Notebook.
For macOS users, open Launchpad (or press F4
on the keyboard)  Anaconda Navigator  notebook.
After that, a command window (black in Windows, white in macOS) will pop up. You can ignore it (but don’t close it!). Then your default browser will open automatically. Click New on the top right corner and select Python 3, you can start writing Python code now!
The notebook consists of a sequence of cells. There are three types of cells: code cells, markdown cells, and raw cells. By default, the cell is code cell, and you can write Python code directly. You don’t need to worry about markdown or raw cells.
The cell has two modes: command mode and edit mode. Generally, the cell is in edit mode and you can type freely. If it’s in command mode, you can navigate around the notebook using keyboard shortcuts.
Shift + Enter
: run cell
Esc
: turn cell into command mode
Enter
: turn cell into edit mode
D, D
: delete the cell
A
: insert cell above
B
: insert cell below
These are shortcuts that I use most frequently. For the full list of available shortcuts, click Help  Keyboard Shortcuts in the notebook menus.
Well, this is a simple introduction to Jupyter Notebook. You can download my code and open processing.ipynb using Jupyter Notebook, or you can continue reading this post and type the code on your own. The following content is almost the same.
Oops, I almost forget to tell you how you shut down Jupyter Notebook when you’ve finished your work. Return to the command window, press Ctrl+C
twice quickly. Then Jupyter Notebook will shut down and everything will be okay.
You have to glance through your corpus before processing. Generally the format of corpus is .txt
or .xml
. You can use the default text editor in your operating system to open the file, or you can use some advanced text editor like Sublime Text for better view.
The above image shows the heading of LCMC_A.xml. (The Lancaster Corpus of Mandarin Chinese (LCMC) addresses an increasing need within the research community for a publicly available balanced corpus of Mandarin Chinese. Click here for more information about this corpus.) We don’t need to care the heading, which indicate the information about creator, date, etc. By focusing the main body, we can know each word is tagged with <w POS="">
and followed by </w>
, like <w POS="a">大</w>
. Different corpora may have different taggings, here (and in the following sections) I just take LCMC as an example.
All right. Let’s get started with a single file.


The first line declares a variable called filename. The second line tells the system to open the file. Note that encoding='utf8'
indicates the encoding of the file, and currently utf8 is the most frequently used encoding. If you don’t know the file’s encoding, take utf8 by default. Later I’ll write how to dealing with other encodings. The third line just read the entire content and pass it to a variable read_data.
You can type read_data
in a new cell and run the cell to see whether it’s successful.


The most useful information is the tags and the tokens. With the above code we can extract these parts parallelly (in two seperate lists). Here I use the regular expression (re) and maybe it is a bit abstruce. Well, currently you just need to know what it does: recall the word form: <w POS="a">大</w>
, the second line extract the content between the quotes(a
, which is a tag); the third line extract the content between the right angle bracket and the left angle bracket(大
, which is a token).
Note that with the above codes I just extract words, not including punctuations. If you need to treat punctuations as tokens, use the following code instead:
tag = re.findall(r'<[wc] POS="(\w{0,4})">.{0,7}</[wc]>', read_data)
,
token = re.findall(r'<[wc] POS="\w{0,4}">(.{0,7})</[wc]>', read_data)
.
You can refer to my post for more information about regular expression.


Since we have extract the useful part, we can count the frequencies. The first line import a library collections
for counting. The second line counts the token list and the third line outputs the frequency in reverse order.
Well, the freguency just gives us an overview. Then let’s focus on our target words. We need to know the word’s occurrence in the corpus. For example, the following line helps to find all indices of the word ‘是’.


With the indices we get, now we can perform a variety of tasks. We can see the concordance of the word, ngram, etc.


The first two lines stores the tokens before ‘是’ and the tokens next ‘是’ separately, and the last two lines stores the tags before ‘是’ and the tags next ‘是’.
Key Word In Context(KWIC) is a typical application in corpus linguistics. We can print KWIC easily with the indices we get:


(Well, here the code is simplified, without considering OutOfIndex exception.)
Maybe you are more familiar to Excel. You can export the statistics you get by Python as well.


With these two lines of code executed, you can see that a new excel file is generated in your current directory. Then you can view the data in your Excel.
Generally, a corpus may contain several files. Now that we have done with a single file, it’s easy to process all the files as well.


The above code will list the current directory’s files, whose filename extension is .xml
. With these files, we can read all files within a for
loop.


Here the tag
and token
is just the same as what was mentioned in the previous section. The difference lies in that I declared two new variables, tags
and tokens
to store the entire tags and tokens in the corpus (while the tag
and token
only store those in only one file). The last two lines just merge the tag
to tags
.
Some lines of code have its particular purpose and may be executed several times. It’s a good idea to wrap those code in a function for reutilization. It helps to simplify our programs and make it easier to use and comprehend.


For example, we define a function called previous_tokens
. With the above board written, we can call it easily. previous_tokens('的')
will give us the previous tags of ‘的’.
Simply put, TENSORS
are a generalization of vectors and matrices. In PyTorch, they are a multidimensional matrix containing elements of a single data type.


Resizing the tensor
There are a few options to use: .reshape()
, .resize()
and .view()
.
w.reshape(a, b)
will return a new tensor with the same data as w
with size (a, b)
sometimes, and sometimes a clone, as in it copies the data to another part of memoryw.resize_(a, b)
returns the same tensor with a different shape. However, if the new shape results in fewer elements than the original tensor, some elements will be removed from the tensor (but not from memory). If the new shape results in more elements than the original tensor, new elements will be uninitialized in memory.w.view(a, b)
will return a new tensor with the same data as w
with size (a, b)
The above three methods are introduced in Intro to Deep Learning with PyTorch. PyTorch official tutorial only introduces w.view()
, so generally I’d use w.view()
for resizing.
Convenience of 1
When resizing the tensor, 1
is helpful to determine the only unknown size when we already know the other sizes. It can be inferred from other dimensions.
E.g.,


The size of y is torch.Size([2, 8])
, just as we want.
Inplace operation
An inplace operation is an operation that changes directly the content of a given tensor without making a copy. Inplace operations in PyTorch are always postfixed with a _
, like .add_()
. The .resize_()
mentioned above is also an inplace operation.
NumPy to Torch and back
PyTorch has a great feature for converting between NumPy arrays and Torch tensors. To create a tensor from a NumPy array, use torch.from_numpy()
. To convert a tensor to a NumPy array, use the .numpy()
method.


The memory is shared between the NumPy array and Torch tensor.
sum() method
Setting the dim
keyword dim=0
takes the sum across the rows, i.e., compute the sum of the column vector. Similarly, dim=1
takes the sum across the columns (compute the sum of the row vector).
The general process with PyTorch:
loss.backward()
to calculate the gradientsConstructing Neural Networks


It is mandatory to inherit from nn.Module
when creating a class for our network. The name of the class itself can be anything.
PyTorch networks created with nn.Module
must have a forward
method defined. It takes in a tensor x
and passes it through the operations you defined in the __init__
method. And the backward
function (where gradients are computed) is automatically defined for you using autograd
.
Another way is mentioned in the course: nn.Sequential
. (See Doc in detail)


Loss
PyTorch provides losses such as the crossentropy loss (nn.CrossEntropyLoss
). You’ll usually see the loss assigned to criterion
. This criterion combines nn.LogSoftmax()
and nn.NLLLoss()
in one single class.
The input is expected to contain scores for each class.


Autograd
The autograd
package provides automatic differentiation for all operations on Tensors. If the attribute requires_grad
of torch.Tensor
is set as True
, it starts to track all operations on it. When you finished your computation you can call .backward()
and have all the gradients computed automatically. The gradient for this tensor will be accumulated into .grad
attribute.
For more information, refer to autograd tutorial and autograd doc.
The most common method to reduce overfitting is dropout, where we randomly drop input units. Adding dropout in PyTorch is straightforward using the nn.Dropout
module.


Turning off gradient descent when tesing or validating can help accelerate the process. Generally we can do the testing in the following mode:


During training we want to use dropout to prevent overfitting, but during inference we want to use the entire network. So, we need to turn off dropout during validation, testing, and whenever we’re using the network to make predictions. To do this, you use model.eval()
. This sets the model to evaluation mode where the dropout probability is 0. You can turn dropout back on by setting the model to train mode with model.train()
.
Machine translation is the technology used to translate between human language.
Source language: the language input to the machine translation system
Target language: the output language
Machine learning can be described as the task of converting a sequence of words in the source, and converting into a sequence of words in the target.
Sequencetosequence models refers to the broader class of models that include all models that map one sequence to another. E.g., machine translation, tagging, dialog, speech recognition, etc.
We define our task of machine learning as translating a source sentence into a target sentence . Here, the subscript at the end of the equations may be a bit misleading (at least for me at the first glance). It means that the identity of the first word in the sentence is , the identity of the second word in the sentence is , up until the last word in the sentence being .
Then, translation system can be defined as a function , which returns a translation hypothesis given a source sentence as input.
Statistical machine translation systems are systems that perform translation by creating a probabilistic model for the probability of given , , and finding the target sentence that maximizes the probability
, where are the parameters of the model specifying the probability distribution.
The parameters are learned from data consisting of aligned sentences in the source and target languages, which are called parallel corpora.
Instead of calculating the original joint probability , it’s more manageable to calculate by multiplying together conditional probabilities for each of its elements:
where $e_{T+1}=\langle/s\rangle$. It’s an implicit sentence end symbol, which we will indicate when we have terminated the sentence. By examining the position of the $\langle/s\rangle$ symbol, we can determine whether .
Then how to calculate the next word given the previous words ? The first way is simple: prepare a set of training data from which we can count word strings, count up the number of times we have seen a particular string of words, and divide it by the number of times we have seen the context.
Here is the count of the number of times this particular word string appeared at the beginning of a sentence in the training data.
However, this language model will assign a probability of zero to every sentenec that it hasn’t seen before in the training corpus, which is not very useful.
To solve the problem, we set a fixed window of previous words upon which we will base our probability calculations instead of calculating probabilities from the beginning of the sentence. If we limit our context to previous words, this would amount to:
Models that make this assumption are called n**gram models**. Specifically, when models where are called unigram models, bigram models, trigram models, etc.
In the simplest form, the parameters of ngram models consist of probabilities of the next word given previous words can be calculated using maximum likelihood estimation as follows:
However, what if we encounter a twoword string that has never appeared in the training corpus? ngram models fix this problem by smoothing probabilities, combining the maximum likelihood estimates for various values of n. In the simple case of smoothing unigram and bigram probabilities, we can think of a model that combines together the probabilities as follows:
where is a variable specifying how much probability mass we hold out for the unigram distribution. As long as $\alpha>0$, all the words in our vocabulary will be assigned some probability. This method is called interpolation, and is one of the standard ways to make probabilistic models more robust to lowfrequency phenomena.
Some more sophisticated methods for smoothing: Contextdependent smoothing coefficients, Backoff, Modified distributions, Modified KneserNey smoothing.
Likelihood
The most straightforward way of defining accuracy is the likelihood of the model with respect to the development or test data. The likelihood of the parameters with respect to this data is equal to the probability that the model assigns to the data.
Log likelihood
The log likelihood is used for a couple reasons. The first is because the probability of any particular sentence according to the language model can be a very small number, and the product of these small numbers can become a very small number that will cause numerical precision problems on standard computing hardware. The second is because sometimes
Perplexity
An intuitive explanation of the perplexity is “how confused is the model about its decision?” More accurately, it expresses the value “if we randomly picked words from the probability distribution calculated by the language model at each time step, on average how many words would it have to pick to get the correct one?”
Further reading includes largescale language modeling, language model adaptation, longerdistance language countbased models, syntexbased language models.
Course note by Michael Collins from Columbia Univeristy is another good material.
Loglinear language models revolve around the concept of features. We define a feature function $\phi(e_{tn+1}^{t1})$ that takes a context as input, and outputs a realvalued feature vector that describe the context using N different features.
Feature vector can be a onehot vector. The function returns a vector where only the element is one and the rest are zero (assume the length of the vector is the appropriate length given the context).
Calculating scores: We calculate a score vector that corresponds to the likelihood of each word: words with higher scores in the vector will also have higher probabilities. The model parameters specifically come in two varieties: a bias vector , which tells us how likely each word in the vocabulary is overall, and a weight matrix , which describes the relationship between feature values and scores.
To make computation more efficient (because many elements are zero in onehot vectors or other sparse vectors), we can add together the columns of the weight matrix for all active (nonzero) features:
where is the column of .
Calculating probabilities:
By taking the exponent and dividing by the sum of the values over the entire vocabulary, these scores can be turned into probabilities that are between 0 and 1 and sum to 1. This function is called the softmax function, and often expressed in vector form as follows:
Sentence level loss function
Perword level
(In the original tutorial, the equation is as below. But I guess it should be $\log P(ete(tn+1)^(t1))$?)
Methods for SGD
Adjusting the learning rate, early stopping, shuffling training order, SGD with momentum, AdaGrad, Adam. (Common methods in machine learning)
Derivative
Stepping through the full loss function in one pass:
Using the chain rule:
One reason why loglinear models are nice is because they allow us to flexibly design features that we think might be useful for predicting the next word. Features include Context word features, Context class, Context suffix features, Bagofwords features, etc.
Further reading includes Wholesentence language models, Discriminative language models.
Multilayer perceptrons(MLPs) consist one or more hidden layers that consist of an affine transform (a fancy name for a multiplication followed by an addition) followed by a nonlinear function (step function, tanh, relu, etc), culminating in an output layer that calculates some variety of output.
Neural networks can be thought of as a chain of functions that takes some input and calculates some desired output. The power of neural networks lies in the fact that chaining together a variety of simpler functions makes it possible to represent more complicated functions in an easily trainable, parameterefficient way.
Calculating loss function:
The derivatives:
We could go through all of the derivations above by hand and precisely calculate the gradients of all parameters in the model. But even for a simple model like the one above, it is quite a lot of work and error prone. Fortunately, when we actually implement neural networks on a computer, there is a very useful tool that saves us a large portion of this pain: automatic differentiation (autodiff). To understand automatic differentiation, it is useful to think of our computation in a data structure called a computation graph. As shown in the following figure (figure 10 from the original paper), each node represents either an input to the network or the result of one computational operation, such as a multiplication, addition, tanh, or squared error. The first graph in the figure calculates the function of interest itself and would be used when we want to make predictions using our model, and the second graph calculates the loss function and would be used in training.
Automatic differentiation is a twostep dynamic programming algorithm that operates over the second graph and performs:
A trigram neural network model with a single layer is structured as shown in the figure above.
In the first line, we obtain a vector representing the context $e_{in+1}^{i1}$. Here, M is a matrix with columns and rows, where each column corresponds to an length vector representing a single word in the vocabulary. This vector is called a word embedding or a word representation, which is a vector of real numbers corresponding to particular words in the vocabulary.
The vecror then results from the concatenation of the word vectors for all of the words in the context, so . Once we have this , we run the vectors through a hidden layer to obtain vector . By doing so, the model can learn combination features that reflect information regarding multiple words in the context.
Next, we calculate the score vector for each word: . This is done by performing an affine transform of the hidden vector with a weight matrix and adding a bias vector . Finally, we get a probability estimate by running the calculated scores through a softmax function, like we did in the loglinear language models. For training, if we know we can also calculate the loss function .
The advantage of neural network formulation: Better generalization of contexts, More generalizable combination of words into contexts and Ability to skip previous words.
Further reading includes Softmax approximations, Other softmax structures and Other models to learn word representations.
Language models based on recurrent neural networks (RNNs) have the ability to capture longdistance dependencies in language.
Some examples of longdistance dependencies in language: reflexive form (himself, herself) should match the gender, the conjugation of the verb based on the subject of the sentence, selectional preferences, topic and register.
Recurrent neural networks are a variety of neural network that makes it possible to model these longdistance dependencies. The idea is simply to add a connection that references the previous hidden state when calculating hidden state .
For time steps , the only difference from the hidden layer in a standard neural network is the addition of the connection form the hidden state at time step connecting to that at time step .
RNNs make it possible to model long distance dependencies because they have the ability to pass information between timesteps. For example, if some of the nodes in encode the information that “the subject of the sentence is male”, it is possible to pass on this information to , which can in turn pass it on to $\boldsymbol{h}_{t+1}$ and on to the end of the sentence. This ability to pass information across an arbitrary number of consecutive time steps is the strength of recurrent neural networks, and allows them to handle the longdistance dependencies.
Feedforward language model:
The vanishing gradient problem and the exploding gradient problem are the problems that simple RNNs are facing. The gradient in back propagation will gets smaller and smaller if and then diminish the gradient (amplified if ).
One method to solve this problem, in the case of diminishing gradients, is the use of a neural network architecture, the long shortterm memory (LSTM), that is specifically designed to ensure that the derivate of the recurrent function is exactly one. The most fundamental idea behind the LSTM is that in addition to the standard hidden state used by most neural networks, it also has a memory cell c , for which the gradient is exactly one. Because this gradient is exactly one, information stored in the memory cell does not suffer from vanishing gradients, and thus LSTMs can capture longdistance dependencies more effectively than standard recurrent neural networks.
The first equation is the update, which is basically the same as the RNN update. It takes in the input and hidden state, performs an affine transform and runs it through the tanh nonlinearity.
The following two equations are the input gate and output gate of the LSTM respectively. The function of “gates”, as indicated by their name, is to either allow information to pass through or block it from passing. Both of these gates perform an affine transform followed by the sigmoid function. The output of the sigmoid is then used to perform a compoenntwise multiplication, , which means , with the output of another function.
The next is the most important equation in the LSTM. This equation sets to be equal to the update modulated by the input gate pllus the cell value for the previous time step . Since we are directly adding to , the gradient would be one.
The final equation calculates the next hidden state of the LSTM. This is calculated by using a tanh function to scale the cell value between 1 and 1, then modulating the output using the output gate value . This will be the value actually used in any downstream calculation.
One modification to the standard LSTM that is used widely (in fact so widely that most people who refer to “LSTM” are now referring to this variant) is the addition of a forget gate. The equations:
The difference lies in the forget gate, which modulates the passing of the previous cell to the current cell . This forget gate is useful in that it allows the cell to easily clear its memory when justified. Forget gates have the advantage of allowing the sort of findgrained information flow control, but they also come with the risk that if is set to zero all the time, the model will forget everything and lose its ability to handle longdistance dependencies. Thus, at the beginning of neural network training, it is common to initialize the bias of the forget gate to be a somewhat large value (e.g. 1), which will make the neural net start training without using the forget gate, and only gradually start forgetting content after the net has been trained to some extent.
Gated recurrent unit (GRU) is one simpler RNN variant than LSTM:
The most characteristic element of the GRU is the last equation, which interpolates between a candidate for the updated hidden state and the previous state (in the original tutorial, here is noted as , which might be a mistake). This interpolation is modulated by an update gate , where if the update gate is close to one, the GRU will use the new candidate hidden value, and if the update is close to zero, it will use the previous value. The candidate hidden state is similar to a standard RNN update but includes an additional modulation of the hidden state input by a reset gate . Compared to the LSTM, the GRU has slightly fewer parameters (it performs one less parameterized affine transform) and also does not have a separate concept of a “cell”. Thus, GRUs have been used by some to conserve memory or computation time.
[The stacked RNNs , residual networks; online, batch and minibatch training will be temporarily left out (time limited). I will go over the the following chapters first and then return here.]
The basic idea of the encoderdecoder model is relatively simple: we have an RNN language model, but before starting calculation of the probabilities of E, we first calculate the initial state of the language model using another RNN over the source sentence F. The name “encoderdecoder” comes from the idea that the first neural network running over F “encodes” its information as a vector of realvalued numbers (the hidden state), then the second neural network used to predict E “decodes” this information into the target sentence.
In the first two lines, we look up the embedding and calculate the encoder hidden state for the tth word in the source sequence F. We start with am empty vector , and by , the encoder has seen all the words in the source sentence. Thus, this hidden state should theoretically be able to encode all of the information in the source sentence.
In the decoder phase, we predict the probability of word at each time step. First, we similarly look up , but this time use the previous word , as we must condition the probability of on the previous word, not on itself. Then, we run the decoder to calculate , whose initial state . Finally, we calculate the probability by using a softmax on the hidden state .
In general, given the probability model , we can generate output according to several criteria:
Random Sampling: Randomly select an output from the probability distribution . This is usually denoted . Useful for a dialog system.
1best Search: Find the that maximizes , denoted . Useful in machine translation.
nbest Search: Find the n outputs with the highest probabilities according to .
Reverse and bidirectional encoders, convolutional neural networks, treestructured networks will be covered later.
Ensembling is widely used in encoderdecoders: the combination of the prediction of multiple independently trained models to improve the overall prediction results.
The basic idea of attention is that we keep around vectors for every word in the input sentence, and reference these vectors at each decoding step. Because the number of vectors available to reference is equivalent to the number of words in the input sentence, long sentences will have more vectors than short sentences.
First, we create a set of vectors that we will be using as this variablylengthed representation. To do so, we calculate a vector for every word in the source sentence by running an RNN in both directions:
Then, we concatenate the two vectors and into a bidirectional representation
We can further concatenate these vectors into a matrix:
The key insight of attention is that we calculate a vector that can be used to combine together the columns of H into a vector
the is called the attentional vector, and is generally assumed to have elements that are between zero and one and add to one.
The basic idea behind the attention vector is that it is telling us how much we are “focusing” on a particular source word at a particular time step. The larger the value in , the more impact a word will have when predicting the next word in the output sentence.
Then how to get the ? The answer lies in the decoder RNN, which we use to track our state while we are generating output. The decoder’s hidden state is a fixedlength continuous vector representing the previous target words , initialized as . This is used to calculate a context vector that is used to summarize the source attentional context used in choosing target word , and initialized as .
First, we update the hidden state to $\boldsymbol{h}_t^{(e)}$ based on the word representation and context vector from the previous target time step
Based on this , we calculate an attention score $\boldsymbol{a}_t$, with each element equal to
We then normalize this into the actual attention vector itself by taking a softmax over the scores:
Then we can use this attention vector to weight the encoded representation to create a context vector $\boldsymbol{c}_t$ for the current time step.
We now have a context vector and hidden state for time step t, which we can pass down to downstream tasks.
If V is the size of the target vocabulary, how many are there for a sentence of length T? (on page 4)
There are .
How many parameters does an ngram model with a particular n have? (on page 6)
What is this probability? (on page 7)
Googleresearch just released their TensorFlow code and pretrained models for BERT on Halloween and received nearly 3k stars within 24 hours.
I just glanced through the paper and I will go over it and write a detailed note. (Maybe after the graduation application).
]]>Texts are represented in Python using lists. We can use indexing, slicing, and the len()
function.
Some word comparison operators: s.startswith(t)
, s.endswith(t)
, t in s
, s.islower()
, s.isupper()
, s.isalpha()
, s.isalnum()
, s.isdigit()
, s.istitle()
.
text1.concordance("monstrous")
A concordance view shows us every occurrence of a given word, together with some context.
A concordance permits us to see words in context.
text1.similar("monstrous")
We can find out other words appear in a similar range of contexts, by appending the term similar
to the name of the text, then inserting the relevant word in parentheses. (a bit like synonyms)
text2.common_contexts(["monstrous", "very"])
The term common_contexts
allows us to examine just the contexts that are shared by two or more words.
text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])
We can determine the location of a word in the text: how many words from the beginning it appears. This positional information can be displayed using a dispersion plot.
Finegrained Selection of Words
The mathematical set notation and corresponding Python expression.
[w for w in V if p(w)]


Collocations and Bigrams
The bigram is written as ('than', 'said')
in Python.
A collocation is a sequence of words that occur together unusually often. Collocations are essentially just frequent bigrams, except that we want to pay more attention to the cases that involve rare words. In particular, we want to find bigrams that occur more often then we would expect based on the frequency of the individual words.


A frequency distribution tells us the frequency of each vocabulary item in the text.
FreqDist can be treated as dictionary
in Python, where the word(or word length, etc) is the key, and the occurrence is the corresponding value.


functions defined for NLTK’s Frequency Distributions can be found here
A text corpus is a large, structured collection of texts. Some text corpora are categorized, e.g., by genre or topic; sometimes the categories of a corpus overlap each other.
The NLTK has many corpus in the package nltk.corpus
. To perform the functions introduced before, we have to employ the following pair of statements:


A short program to display information about each text, by looping over all the values of fileid
corresponding to the gutenberg
file identifiers and then computing statistics for each text.


Brown Corpus
The Brown Corpus was the first millionword electronic corpus of English, created in 1961 at Brown University. This corpus contains text from 500 sources, and the sources have been categorized by genre.(see here for detail)
We can access the corpus as a list of words, or a list of sentences. We can optionally specify particular categories or files to read.


Use Brown Corpus to study stylistics: systematic differences between genres.


A conditional frequency distribution is a collection of frequency distributions, each one for a different “condition”. The condition will often be the category of the text.
A frequency distribution counts observable events, such as the appearance of words in a text. A conditional frequency distribution needs to pair each event with a condition.
NLTK’s Conditional Frequency Distributions: commonlyused methods and idioms for defining, accessing, and visualizing a conditional frequency distribution of counters.
Example  Description 

cfdist = ConditionalFreqDist(pairs)  create a conditional frequency distribution from a list of pairs 
cfdist.conditions()  the conditions 
cfdist[condition]  the frequency distribution for this condition 
cfdist[condition][sample]  frequency for the given sample for this condition 
cfdist.tabulate()  tabulate the conditional frequency distribution 
cfdist.tabulate(samples, conditions)  tabulation limited to the specified samples and conditions 
cfdist.plot()  graphical plot of the conditional frequency distribution 
cfdist.plot(samples, conditions)  graphical plot limited to the specified samples and conditions 
cfdist1 < cfdist2  test if samples in cfdist1 occur less frequently than in cfdist2 
Reuters Corpus
The Reuters Corpus contains 10788 news documents totaling 1.3 million words. The documents have been classified into 90 topics and grouped into training and test sets.
Unlike the Brown Corpus, categories in the Reuters corpus overlap with each other, simply because a news story often covers multiple topics. We can ask for the topics covered by one or more documents, or for the documents included in one or more categories.


Here lists some of the Corpora and Corpus Samples Distributed with NLTK. For more information consult NLTK HOWTOs.
Basic Corpus Functionality defined in NLTK:
Example  Description 

fileids()  the files of the corpus 
fileids([categories])  the files of the corpus corresponding to these categories 
categories()  the categories of the corpus 
categories([fileids])  the categories of the corpus corresponding to these files 
raw()  the raw content of the corpus 
raw(fileids=[f1,f2,f3])  the raw content of the specified files 
raw(categories=[c1,c2])  the raw content of the specified categories 
words()  the words of the whole corpus 
words(fileids=[f1,f2,f3])  the words of the specified fileids 
words(categories=[c1,c2])  the words of the specified categories 
sents()  the sentences of the whole corpus 
sents(fileids=[f1,f2,f3])  the sentences of the specified fileids 
sents(categories=[c1,c2])  the sentences of the specified categories 
abspath(fileid)  the location of the given file on disk 
encoding(fileid)  the encoding of the file (if known) 
open(fileid)  open a stream for reading the given corpus file 
root  if the path to the root of locally installed corpus 
readme()  the contents of the README file of the corpus 
A lexicon, or lexical resource, is a collection of words and/or phrases along with associated such as part of speech and sense definition. Lexical resources are secondary to texts, and are usually created and enriched with the help of texts. For example, vocab = sorted(set(my_text))
and word_freq = FreqDist(my_text)
are both simple lexical resources.
Lexicon Terminology: lexical entries for two lemmas having the same spelling (homonyms), providing part of speech and gloss information.
Wordlist Corpora








The CMU Pronouncing Dictionary, Toolbox are introduced in the book, I’ll just omit them in the note.
WordNet is a semanticallyoriented dictionary of English, similar to a traditional thesaurus but with a richer structure. NLTK includes the English WordNet, with 155287 words and 117659 synonym sets.
Synsets
With the WordNet, we can find the word’s synonyms in synsets  “synonym set”, definitions and examples as well.


Hyponyms and hypernyms
WordNet synsets correspond to abstract concepts, and they don’t always have corresponding words in English. These concepts are linked together in a hierarchy. (See hyponyms and in lexical relations.)
The corresponding methods are hyponyms()
and hypernyms()
.


Some other lexical relations
Another important way to navigate the WordNet network is from items to their components (meronyms), or to the things they are contained in (holonyms). There are three kinds of holonymmeronym relation: member_meronyms()
, part_meronyms()
, substance_meronyms()
, member_holonyms()
, part_holonyms()
, substance_holonyms()
.
There are also relationships between verbs. For example, the act of walking involves the act of stepping, so walking entails stepping. Some verbs have multiple entailments.(NLTK also includes VerbNet, a hierarhical verb lexicon linked to WordNet. It can be accessed with nltk.corpus.verbnet
)


Semantic Similarity
Given a particular synset, we can traverse the WordNet network to find synsets with related meanings. Knowing which words are semantically related is useful for indexing a collection of texts, so that a search for a general term will match documents containing specific terms.
We can qualify the concept of generality(specific or general) by looking up the depth of the synset.path_similarity
assigns a score in the range 0–1 based on the shortest path that connects the concepts in the hypernym hierarchy (1 is returned in those cases where a path cannot be found). Comparing a synset with itself will return 1.


From Web


From local files


ASCII text and HTML text are human readable formats. Text often comes in binary formats — like PDF and MSWord — that can only be opened using specialized software. Thirdparty libraries such as pypdf
and pywin32
provide access to these formats. Extracting text from multicolumn documents is particularly challenging. For onceoff conversion of a few documents, it is simpler to open the document with a suitable application, then save it as text to your local drive, and access it as described below. If the document is already on the web, you can enter its URL in Google’s search box. The search result often includes a link to an HTML version of the document, which you can save as text.
I’ve uploaded my summary of regular expression in the post Regular Expression.


NLTK provides a regular expression tokenizer: nltk.regexp_tokenize()
.
The basic part is discussed in my Python learning note. And in this section, I just record some unfamiliar knowledge.
Assignment always copies the value of an expression, but a value is not always what you might expect it to be. In particular, the “value” of a structured object such as a list is actually just a reference to the object.
Python provides two ways to check that a pair of items are the same. The is
operator tests for object identity.
A list is typically a sequence of objects all having the same type, of arbitrary length. We often use lists to hold sequences of words. In contrast, a tuple is typically a collection of objects of different types, of fixed length. We often use a tuple to hold a record, a collection of different fields relating to some entity.
Generator Expression:


The second line uses a generator expression. This is more than a notational convenience: in many language processing situations, generator expressions will be more efficient. In 1, storage for the list object must be allocated before the value of max()
is computed. If the text is very large, this could be slow. In 2, the data is streamed to the calling function. Since the calling function simply has to find the maximum value  the word which comes latest in lexicographic sort order  it can process the stream of data without having to store anything more than the maximum value seen so far.


Function
It is not necessary to have any parameters.
A function usually communicates its results back to the calling program via the return
statement.
A Python function is not required to have a return statement. Some functions do their work as a side effect, printing a result, modifying a file, or updating the contents of a parameter to the function (such functions are called “procedures” in some other programming languages).


When you refer to an existing name from within the body of a function, the Python interpreter first tries to resolve the name with respect to the names that are local to the function. If nothing is found, the interpreter checks if it is a global name within the module. Finally, if that does not succeed, the interpreter checks if the name is a Python builtin. This is the socalled LGB rule of name resolution: local, then global, then builtin.
Python also lets us pass a function as an argument to another function. Now we can abstract out the operation, and apply a different operation on the same data.


Python’s default arguments are evaluated once when the function is defined, not each time the function is called (like it is in say, Ruby). This means that if you use a mutable default argument and mutate it, you will and have mutated that object for all future calls to the function as well.
SpaceTime Tradeoffs
We can test the effeciency using the timeit
module. The Timer
class has two parameters, a statement which is executed multiple times, and setup code that is executed once at the beginning.


Dynamic Programming
Dynamic programming is a general technique for designing algorithms which is widely used in natural language processing. Dynamic programming is used when a problem contains overlapping subproblems. Instead of computing solutions to these subproblems repeatedly, we simply store them in a lookup table.


Four Ways to Compute Sanskrit Meter: (i) recursive; (ii) bottomup dynamic programming; (iii) topdown dynamic programming; and (iv) builtin memoization.
Universal PartofSpeech Tagset
 Tag  Meaning  English Examples 
 —  —  — 
 ADJ  adjective  new, good 
 ADP  adposition  on, of 
 ADV  adverb  really, already 
 CONJ  conjunction  and, or 
 DET  determiner, article  the, some 
 NOUN  noun  year, home 
 NUM  numeral  twentyfour, fourth 
 PRT  particle  at, on 
 PRON  pronoun  he, their 
 VERB  verb  is, say 
 .  punctuation marks  .,;! 
 X  other  gr8, univeristy 
Some related functions:


The Default Tagger
Default taggers assign their tag to every single word, even words that have never been encountered before.


The Regular Expression Tagger
The regular expression tagger assigns tags to tokens on the basis of matching patterns.


The Lookup Tagger
Find the most frequent words and store their most likely tag. Then use this information as the model for a “lookup tagger”(an NLTK UnigramTagger
).
For those words not among the most frequent words, it’s okay to assign the default tag of NN
. In other words, we want to use the lookup table first, and if it is unable to assign a tag, then use the default tagger, a process known as backoff (5).


Unigram Tagging
Unigram taggers are based on a simple statistical algorithm: for each token, assign the tag that is most likely for that particular token.
An ngram tagger is a generalization of a unigram tagger whose context is the current word together with the partofspeech tags of the n1 preceding tokens.


Note:
ngram taggers should not consider context that crosses a sentence boundary. Accordingly, NLTK taggers are designed to work with lists of sentences, where each sentence is a list of words. At the start of a sentence, t_{n1} and preceding tags are set to None
.
As n gets larger, the specificity of the contexts increases, as does the chance that the data we wish to tag contains contexts that were not present in the training data. This is known as the sparse data problem, and is quite pervasive in NLP. As a consequence, there is a tradeoff between the accuracy and the coverage of our results (and this is related to the precision/recall tradeoff in information retrieval).
One way to address the tradeoff between accuracy and coverage is to use the more accurate algorithms when we can, but to fall back on algorithms with wider coverage when necessary.


Classification is the task of choosing the correct class label for a given input. In basic classification tasks, each input is considered in isolation from all other inputs, and the set of labels is defined in advance.
E.g., deciding whether an email is spam or not; deciding what the topic of a news article is, from a fixed list of topic areas such as “sports,” “technology,” and “politics”; deciding whether a given occurrence of the word bank is used to refer to a river bank, a financial institution, the act of tilting to the side, or the act of depositing something in a financial institution.


The training set is used to train the model, and the devtest set is used to perform error analysis. The test set serves in our final evaluation of the system. Just as common machine learning does.
The following content seems to focus on some methods provided by NLTK. And to learn the principles like decision tree, which is not covered in Andrew Ng’s course, I’d like to turn to Handson Machine Learning with ScikitLearn and TensorFlow rather than this book. And I’ll write a new post recording notes on that book:D
(Well, I’ll come back to continue updating this post when it’s necessary to.)
Well, we simply do not know how language originated. Some speculations about the origins of language: the divine source, the naturalsound source and the oralgesture source.
In a different view, glossogenetics and physiological adapation shows that human teeth(upright), lips(intricate muscle interlacing), larynx(with pharynx) and brain(lateralized) provide possibilities of language.
Interactional function: to do with how humans use language to interact with each other, socially or emotionally; how they indicate friendliness, cooperation or hostility, or annoyance, pain, or pleasure.
Transactional function: use linguistic abilities to communicate knowledge, skills and information.
Pictogram: picturewriting
Ideograms: ideawriting
The distinction between pictograms and ideograms is essentially a difference in the relationship between the symbol and the entity it represents. The more ‘picturelike’ forms are pictograms, the more abstract, derived forms are ideograms. A key property of both pictograms and ideograms is that they do not represent words or sounds in a particular language.
When symbols come to be used to represent words in a language, they are described as examples of wordwriting, or logograms.
Cuneiform writing: normally referred to when the expression “the earliest known writing system” is used.
Characters(Chinese writing): the longest continuous history of use as a writing system.
To avoid substantial memory load, some principled method is required to go from symbols which represent words(i.e. a logographic system) to a set of symbols which represent sounds(i.e. a phonographic system).
When a writing system employs a set of symbols which represent the pronunciations of syllabic writing. (There are no purely syllabic writing systems in use today.)
Alphabetic writing: the symbols can be used to represent single sound types in a language. An alphabet is essentially a set of written symbols which each represent a single type of sound.
Communicative: intentionally communicating something
Informative: unintentionally sent signals
Displacement: It allows the users of language to talk about things and events not present in the immediate environment. It enables us to talk about things and places whose existence we cannot even be sure of.
Arbitrariness: A property of linguistic signs is their arbitrary relationship with the objects they are used to indicate. They do not, in any way, ‘fit’ the objects they denote.
Productivity(creativity/openendedness): The potential number of utterances in any human language is infinite.
Culture transmission: Humans are not born with the ability to produce utterances in a specific language.
Discreteness: Each sound in the language is treated as discrete.
Duality: Language is organized at two levels or layers simultaneously. At one level, we have distinct sounds, and at another level, we have distinct meanings.
The use of the vocalauditory channel: Human linguistic communication is typically generated via the vocal organs and perceived via the ears.
Reciprocity: Any speaker/sender of a linguistic signal can also be a listener/receiver.
Specialization: Linguistic signals do not normally serve any other type of purpose, such as breathing or feeding.
Nondirectionality: Linguistic signals can be picked up by anyone within hearing, even unseen.
Rapid fade: Linguistic signals are produced and disappear quickly.
Most of these are properties of the spoken language, but not of the written language.
Phonetics: the general study of the characteristics of speech sounds
Articulatory phonetics: the study of how speech sounds are made, or ‘articulated’
Acoustic phonetics: deals with the physical properties of speech as sound waves ‘in the air’
Auditory(Perceptual) phonetics: deals with the perception, via the ear, of speech sounds
Forensic phonetics: has applications in legal cases involving speaker identification and the analysis of recorded utterances
When the vocal cords are spread apart, the air from the lungs passes between them unimpeded. Sounds produced in this way are described as voiceless.
When the vocal cords are drawn together, the air from the lungs repeatedly pushes them apart as it passes through, creating a vibration effect. Sounds produced in this way are described as voiced.
Bilabials
These are sounds formed using both upper and lower lips.
Includes:
Labiodentals
These are sounds formed with the upper teeth and the lower lip.
Includes:
Dentals
These sounds are formed with the tongue tip behind the upper front teeth.
Includes: (Sorry can’t type the phonetic symbol directly. The phonetic of th in the,there,then)
Alveolars
These are sounds formed with the front part of the tongue on the alveolar ridge.
Includes:
Alveopalatals
These are sounds formed with the tongue at the very front of the palate, near the alveolar ridge.
Includes:
Velars
These are sounds formed with the back of the tongue against the velum.
Includes:
Glottals
This sound is produced without the active use of the tongue and other parts of the mouth.
Includes:
Stops
Consonant sound resulting from a blocking or stopping effect on the airstream.
Includes:
Fricatives
As the air is pushed through, a type of friction is produced.
Includes:
Affricates
Combine a brief stopping of the airstream with an obstructed release which causes some friction.
Includes:
Nasals
The velum is lowered and the airstream is allowed to flow out through the nose.
Includes:
Approximants
The articulation of each is strongly influenced by the following vowel sound.
Includes:
The contents of vowel and the sound patterns are omitted
Coinage
The invention of totally new terms(The most typical sources are invented trade names for one company’s product which become general terms for any version of that product).
aspirin, nylon, zipper
Borrowing
Take over of words from other languages.
A special type of borrowing is described as loadtranslation, or calque. In this process, there is a direct translation of the elements of a word into the borrowing language.
alcohol, boss, piano
Compounding
There is a joining of two separate words to produce a single form.
bookcase, fingerprint, wallpaper
Blending
Blending is typically accomplished by taking only the beginning of one word and joining it to the end of the other word.
smog(smoke+fog), bit(binary+digit), brunch(breakfast+lunch)
Clipping
This occurs when a word of more than one syllable is reduced to a shorter form, often in casual speech.
fax(facsimile), gas(gasoline), ad(advertisement)
Backformation
A word of one type(usually a noun) is reduced to form another word of a different type(usually a verb).
televise(television), donate(donation), opt(option)
Conversion
A change in the function of a word.
paper(noun>verb), guess(verb>noun), empty(adjective>verb)
Acronyms
New words formed from the initial letters of a set of other words.
CD(compact disk), radar(radio detecting and ranging), ATM(automatic tell machine)
Derivation
Accomplished by means of a large number of affixes which are not usually given separate listings in dictionaries.
prefix: added to the beginning of the word un
suffix: added to the end of the word ish
infix: incorporated inside another word unfuckingbelievable
Morphology, which literally means the ‘study of forms’, was originally used in biology, but, since the middle of the nineteenth century, has also been used to describe that type of investigation which analyzes all those morphemes which are used in a language.
The definition of a morpheme is “a minimal unit of meaning or grammatical function”.
Morphemes which can stand by themselves as single words.
Lexical morphemes: a set of ordinary nouns, adjectives and verbs which we think of as the words which carry the ‘content’ of messages we convey.
boy, man, house, tiger, sad, long, yellow, sincere, open, look, follow, break
‘Open’ class of words(we can add new lexical morphemes to the language rather easily).
Functional morphemes: a set consists largely of the functional words in the language such as conjunctions, prepositions, articles and pronouns.
and, but, when, because, on, near, above, in, the, that, it
‘Closed’ class of words(we almost never add new functional morphemes to the language).
Morphemes which cannot normally stand alone, but are typically attached to another form.
Derivational morphemes: used to makek new words in the language and are often used to make words of a different grammatical category from the stem(when affixes are used with bound morphemes, the basic wordform involved is technically known as the stem).
(ness, ful, less, ish, ly, re, pre, ex, dis, co, un)
Inflectional morphemes: to indicate aspects of the grammatical function of a word.
Noun+ ‘s(possessive), s(plural)
Verb+ s(3rd person present singular), ing(present participle), ed(past tense), en(past participle)
Adjective+ est(superlative), er(comparative)
An inflectional morpheme never changes the grammatical category of a word. A derivational morpheme can change the grammatical category of a word.
We need a way of describing the structure of phrases and sentences which will account for all of the grammatical sequences and rule out all ungrammatical sequences. Providing such an account involves us in the study of grammar.
The part of speech
Nouns are words used to refer to people, objects, creatures, places, qualities, phenomena and abstract ideas as if they were all ‘things’.
Adjectives are words used, typically with nouns, to provide more information about the ‘things’ referred to. (happy, large, cute)
Verbs are words used to refer to various kinds of actions(run, jump) and states(be, seem) involving the ‘things’ in events.
Adverbs are words used to provide more information about the actions and events(slowly, suddenly). Some adverb(really, very) are also used twith adjectives to modify the information about ‘things’.
Prepositions are words(at, in, on, near, with, without) used with nouns in phrases providing information about time, place and other connections involving actions and things.
Pronouns are words(me, they, he, himself, this, it) used in place of noun phrases, typically referring to things already known.
Conjunctions are words(and, but, although, if) used to connect, and indicate relationships between events and things.
In addition to the terms used for the parts of speech, traditional grammatical analysis also gave us a number of other categories, including ‘number’, ‘person’, ‘tense’, ‘voice’ and ‘gender’.
Number is whether the noun in singular or plural.
Person covers the distinctions of first person(involving the speaker), second person(involving the hearer) and third person(involving any others).
Tense: present tense, past tense, future tense.
Voice: active voice, passive voice
Gender: describe the relationship in terms of natural gender, mainly derived from a biological distinction between male and female. (Grammatical gender is common but may not be as appropriate in describing English)
The view of grammar as a set of rules for the ‘proper’ user of a language is still to be found today and may be best characterized as the prescriptive approach.
Some familar examples of prescriptive rules for English sentences:
Analysts collect samples of the language they are interested in and attempt to describe the regular structures of the language as it is used, not according to some view of how it should be used. This is called the descriptive approach and it is the basis of most modern attempts to characterize the structure of different languages.
Structural analysis
One type of descriptive approach is called structural analysis and its main concern is to investigate the distribution of forms(e.g., morphemes) in a language. The method employed involves the use of ‘testframs’ which can be sentences with empty slots in them.
Immediate constituent analysis
An approach with the same descriptive aims is called immediate constituent analysis. The technique employed in this approach is designed to show how small constituents(or components) in sentences go together to form larger constituents.
The analysis of the constituent structure of the sentence can be represented in different types of diagrams. (Simple diagram, labeled and bracketed sentences, tree diagrams, discussed in the following chapter in detail)
The word ‘syntax’ came originally from Greek and literally meant ‘a setting out together’ or ‘arrangement’.
Generative grammar
There have been attempts to produce a particular type of grammar which would have a very explicit system of rules specifying what combinations of basic elements would result in wellformed sentences since the 1950s.
Given an algebraic expression , the simple algebraic expression can generate an endless set of values, by following the simple rules of arithmetic. The endless set of such results is ‘generated’ by the operation of the explicitly formalized rules. If the sentences of a language can be seen as a comparable set, then there must be a set of explicit rules which yield those sentences. Such a set of explicit rules is a generative grammar.
Some properties of the grammar
S sentence
N noun
Pro pronoun
PN proper noun
V verb
Adj adjective
Art article
Adv adverb
Prep preposition
NP noun phrase
VP verb phrase
PP prepositional phrase
* ungrammatical sequence
> consists of
() optional constituent
{} one and only one of these constituents must be selected
(May be a bit different with the symbols used in the post COMS W4705 Natural Language Processing Note, but it doesn’t matter. And the tree diagram is introduced in that post as well).
Semantics is the study of the meaning of the words, phrases and sentences. Linguistic semantics deals with the conventional meaning conveyed by the use of words and sentences of a language.
Analyze meaning in terms of semantic features. Features such as +animate, animate; +human, human, for example, can be treated as the basic features involved in differentiaiting the meanings of each word in the language from every other word.
This approach gives us the ability to predict what nouns would make sentence semantically odd.
However, for many words in a language it may not be so easy to come up with neat components of meaning. Part of the problem seems to be that the approach involves a view of words in a language as some sort of ‘containers’, carrying meaningcomponents.
Instead of thinking of the words as ‘containers’ of meaning, we can look at the ‘roles’ they fulfill within the situation described by a sentence.
agent: the entity that performs the action
theme: the entity that is involved in or affected by the action
instrument: if an agent uses another entity in performing an action, that other entity fills the role of instrument
experiencer: when a noun phrase designates an entity as the person who has a feeling, a perception or a state
location: where an entity is
source: where an entity moves from
goal: where an entity moves to
Characterize the meaning of a word not in terms of its component features, but in terms of its relationship to other words. This procedure has also been used in the semantic description of languages and is treated as the analysis of lexical relations.
Synonymy
Two or more forms with very closely related meanings, which are often, but not always, intersubstituatable in sentences.
Antonymy
Two forms with opposite meanings
gradable antonyms such as the pair bigsmall, can be used in comparative constructions like bigger thansmaller than. Also, the negative of one member of the gradable pair does not necessarily imply the other.
nongradable antonyms also called ‘complementatry pairs’, comparative constructions are not normally used, and the negative of one member does imply the other.
Hyponymy
When the meaning of one form is included in the meaning of another, the relationship is described as hyponymy. (The meaning of animal is ‘included’ in the meaning of dog. Or, dog is a hyponym of animal.)
When we consider hyponymous relations, we are essentially looking at the meaning of words in some type of hierarchical relationship.
From the hierarchical diagram, we can say that two or more terms which share the same superordinate(higherup) term are cohyponymss.
The relation of hyponymy captures the idea of ‘is a kind of’.
Prototype
The concept of a prototype helps explain the meaning of certain words not in terms of component features, but in terms of resemblance to the clearest exemplar. (For many American English speakers, the prototype of ‘bird’ is the robin.)
Homophony
Two or more different (written) forms have the same pronunciation.(barebear, meatmeet)
Homonymy
One form(written and spoken) has two or more unrelated meanings.(bank: of a river; financial instituion, race: contest of speed; ethnic group)
Polysemy
One form(written or spoken) has multiple meanings which are all related by extension.(head, foot)
(Some other lexical relations like meronyms and holonyms are introduced in this post)
Relationship between words based simply on a close connection in everyday experience. That close connection can be based on a containercontents relation(bottlecoke; canjuice), a wholepart relation(carwheel; hourseroof) or a representativesymbol relationship(kingcrown; the Presidentthe White House).
Many examples of metonymy are highly conventionalized and easy to interpret. However, many others depend on an ability to infer what the speaker has in mind.
One way we seem to organize our knowledge of words is simply in terms of collocation, or frequently occurring together.
(butterbread, needlethread, saltpepper)
When we read or hear pieces of language, we normally try to understand not only what the words mean, but what the writer or speaker of those words intended to convey. The study of ‘intended speaker meaning’ is called pragmatics.
Linguistic context(cotext)
The cotext of a word is the set of other words used in the same phrase or sentence. This surrounding cotext has a strong effect on what we think the word means.
Physical context
Our understanding of much of what we read and hear is tied to the physical context, particularly the time and place, in which we encounter linguistic expressions.
There are some words in the language that cannot be interpreted at all unless the physical context, especially the physical context of the speaker, is known. Expressions, which depend for their interpretation on the immediate physical context in which they were uttered, are very obvious examples of bits of language which we can only understand in terms of speaker’s intended meaning. These are technically known as deictic expressions.
Person deixis: used to point to a person(me, you, him, them)
Place deixis: (here, there, yonder)
Time deixis: (now, then, tonight, last week)
An act by which a speaker(or writer) uses language to enable a listener(or reader) to identify something.
We can use names associated with things to refer to people and names of people to refer to things. The key process here is called inference. An inference is any additional information used by the listener to connect what is said to whawt must be meant.
When we establish a referent and subsequently refer to the same object, we have a particular kind of referential relationship. The second referring expression is an example of anaphora and the first mention is called the antecedent.
Anaphora can be defined as subsequent reference to an already introduced entity. Mostly we use anaphora in texts to maintain reference.
What a speaker assumes is true or is known by the hearer can be described as a presupposition.
Constancy under negation: although two sentences have opposite meanings, the underlying presupposition remains true in both.
In very general terms, we can usually recognize the type of ‘act’ performed by a speaker in uttering a sentence. The use of the term speech act covers ‘actions’ such as ‘requesting’, ‘commanding’, ‘questioning’ and ‘informing’.
Forms  Functions 

Interrogative  Question 
Imperative  Command(request) 
Declarative  Statement 
Direct speech act: the forms in the set above is used to perform the corresponding function
Indirect speech act: whenever one of the forms in the set above is used to perform a function other than the one listed beside it
Politeness is showing awareness of another person’s face(Face is public selfimage. This is the emotional and social sense of self that every person has and expects everyone else to recognize).
facethreatening act: say something that represents a threat to another person’s selfimage (use a direct speech act to order someone to do something)
facesaving act: say something that lessens the possible threat to another’s face (use an indirect speech act instead)
negative face: the need to be independent and to have freedom from imposition
positive face: the need to be connected, to belong, to be a member of the group
When we ask how it is that we, as language users, make sense of what we read in texts, understand what speakers mean despite what they say, recognize connected as opposed to jumbled or incoherent discourse, and successfully take part in that complex activity called conversation, we are undertaking what is known as discourse analysis.
Texts must have a certain structure which depends on factors quite different from those required in the structure of a single sentence. Some of those factors are described in terms of cohesion, or the ties and connections which exist within texts.
Analysis of cohesive links within a text gives us some insight into how writers structure what they want to say, and may be crucial factors in our judgments on whether something is wellwritten or not.
There must be some factor which leads us to distinguish connected texts which make sense from those which do not. This factor is usually described as coherence.
The key to the concept of coherence is not something which exists in the language, but something which exists in people. It is people who ‘make sense’ of what they read and hear.
An underlying assumption in most conversational exchanges seems to be that the participants are cooperating with each other.
Four maxim
Quantity: Make your contribution as informative as is required, but not more, or less, than is required
Quality: Do not say that which you believe to be false or for which you lack evidence
Relation: Be relevant
Manner: Be clear, brief and orderly
We actually create what the text is about, based on our expectations of what normally happens. In attempting to describe this phenomenon, many researchers use the concept of a ‘schema’.
A schema is a general term for a conventional knowledge structure which exists in memory. We have many schemata which are used in the interpretation of what we experience and what we hear or read about.
One particular kind of shcema is a ‘script’. A script is essentially a dynamic schema, in which a series of conventional actions takes place.
Broca’s area
Paul Broca, a French surgeon, reported in the 1860s that damage to this specific part of the brain was related to extreme difficulty in producing speech. It was noted that damage to the corresponding area on the right hemisphere had no such effect. This finding was first used to argue that language ability must be located in the left hemisphere and since then has been taken as more specifically illustrating that Broca’s area is crucially involved in the production of speech.
Wernicke’s area
Carl Wernicke was a German doctor who, in the 1870s, reported that damage to this part of the brain was found among patients who had speech comprehension difficulties. This finding confirmed the lefthemisphere location of language ability and led to the view that Wernicke’s area is part of the brain crucially involved in the understanding of speech.
The motor cortex
The motor cortex generally controls movement of the muscles. Close to Broca’s area is the part of the motor cortex that controls the articulatory muscles of the face, jaw, tongue and larynx. Evidence that this area is involved in the actual physical articulation of speech comes from the work, reported in the 1950s, of two neurosurgeons, Penfield and Roberts.
The arcuate fasciculus
The arcuate fasciculus is a bundle of nerve fibers. This was also one of Wernicke’s discoveries and forms a crucial connection between Wernicke’s area and Broca’s area.
The word is heard and comprehended via Wernicke’s area. This signal is then transferred via the arcuate fasciculus to Broca’s area where preparations are made to produce it. A signal is then sent to the motor cortex to physically articulated the word.(A massively oversimplified version of what may actually take place.)
Tipofthetongue
There is the tipofthetongue phenomenon in which you feel that some word is just eluding you, that you know the word, but it just won’t come to the surface.
Malapropisms
The experience which occurs with uncommon terms or names suggests that our ‘wordstorage’ may be partially organized on the basis of some phonological information and that some words in that ‘store’ are more easily retrieved than others. When we make mistakes in this retrieval process, there are often strong phonological similarities between the target word and the mistake. Mistakes of this type are sometimes referred to as Malapropisms.
Slipofthetongue
A slipofthetongue often results in tangled expressions or word reversals. This type of slip is also known as a Spoonerism.
Tipofthelung
Tipsofthelung are often simply the result of a sound being carried over from one word to the next, or a sound used in one word in anticipation of its occurrence in the next word.
Slipoftheear
Slipsoftheear can result in misinterpretaion when hearing.
Aphasia is defined as an impairment of language function due to localized cerebral damage which leads to difficulty in understanding and/or producing linguistic forms.
Broca’s aphasia
The type of serious language disorder known as Broca’s aphasia(also called ‘motor aphasia’) is characterized by a substantially reduced amount of speech, distorted articulation and slow, often effortful speech. What is said often consists almost entirely of lexical morphemes(e.g. nouns and verbs). The frequent omission of functional morphemes(e.g. articles, prepositions, inflections) has led to the characterization of this type of aphasia as agrammatic. The grammatical markers are missing.
In Broca’s aphasia, comprehension is typically much better than production.
Wernicke’s aphasia
The type of language disorder which results in difficulties in auditory comprehension is sometimes called ‘sensory aphasia’, but is more commonly known as Wernicke’s aphasia. Someone suffering from this disorder can actually produce very fluent speech which is, however, often difficult to make sense of.
Difficulty in finding the correct words(sometimes referred to as anomia) is also very common and circumlocution may be used.
Conduction aphasia
The type of aphasia is identified with damage to the arcuate fasciculus and is called conduction aphasia. Individuals suffering from this disorder typically do not have articulation problems. They are fluent, but may have disrupted rhythm because of pauses and hesitations. Comprehension of spoken words is normally good. However, the task of repeating a word phrase(spoken by someone else) will create major difficulty. What is heard and understood cannot be transferred to the speech production area.
An experimental technique which has demonstrated that, for the majority of subjects tested, the language functions must be located in the left hemisphere is called the dichotic listening test. This is a technique which uses the generally established fact that anything experienced on the righthand side of the body is processed in the left hemisphere of the brain and anything on the left side is processed in the right hemisphere.
An experiment is possible in which a subject sits with a set of earphones on and is given two different sound signals simultaneously, one through each earphone. When asked to say what was heard, the subject more often correctly identifies the sound which came via the right ear. This has come to be known as the right ear advantage.
The explanation of this process proposes that a language signal received through the left ear is first sent to the right hemisphere and then has to be sent over to the left hemisphere(language center) for processing. This nondirect route will take longer than a linguistic signal which is received through the right ear and goes directly to the left hemisphere. First signal to get processed wins.
the historical study of languages
Cognates: within groups of related languages, we often find close similarities in particular sets of terms. A cognate of a word in one language is a word in another language which has a similar form and is, or was, used with a similar meaning.
Comparative reconstruction: the aim of this procedure is to reconstruct what must have been the original, or ‘proto’ form in the common ancestral language. It’s a bit like trying to work out what the greatgrandmother must have been like on the basis of common features possessed by the set of granddaughters.
The majority principle: if, in a cognate set, three forms begin with a [p] sound and one form begins with a [b] sound, then our best guess is that the majority have retained the original sound(i.e. [p]), and the minority has changed a little through time.
The most natural development principle: based on the fact that certain types of soundchange are very common, whereas others are extremely unlikely.
Sound changes
metathesis: involves a reversal in position of two adjoining sounds
epenthesis: involves the addition of a sound to the middle of a word
prothesis: involves the addition of a sound to the beginning of a word
Lexical changes
broadening
narrowing
In general terms, sociolinguistics deals with the interrelationships between language and society. It has strong connections to anthropology, through the investigation of language and culture, and to sociology, through the crucial role that language plays in the organization of social groups and institutions. It is also tied to social psychology, particularly with regard to how attitudes and perceptions are expressed and how ingroup and outgroup behaviors are identified.
Varieties of language used by groups defined according to class, education, age, sex, and a number of other social parameters.
Factors include: social class and education; age and gender; ethnic background; idiolect; style, register and jargon, diglossia.
Linguistic determinism: language determines thought
The SapirWhorf hypothesis: we dissect nature along lines laid down by our native languages.
杭州很大，杭州也挺小。
紫金港，玉泉，杭州大厦。不同于前几个学期的囿于某一校区内，这学期的课程不仅在周一至周五要往返玉泉紫金港，还有周日在杭州大厦的新东方GRE课程。曾经觉得偌大无比的杭州城，终于渐渐熟悉了其中的一角。公交卡、校车票、滴滴出行的轮番交替，勾勒出紫金港玉泉之间的交错纵横。
原来，有三个属于玉泉的公交站：浙大玉泉校区，玉古路求是路口，古荡；以及好巧，有三个属于紫金港的公交站：紫荆花路余杭塘路口，浙大紫金港校区，章桥头。
原来，不同的出行方式也有不同的风格：“己”字形的89路公交路线，“卅”字形的校车路线，以及“乛”字形的出租车路线。或急或缓，或简或繁。
一学期很长，一学期也转瞬即逝。
虽然这学期的课表已尽量安排地比较空，不过周一至周五的校内课程、周日的GRE课程，以及课外杂七杂八的一些知识内容，依然将每一周、每一天填补地满满当当。在玉泉紫金港往返着，在清明节劳动节端午节瞎忙着，在考试周连续熬夜着，至关重要的大三下也过去了。
（嗯……就是想感慨下时间过得很快。）
我不够优秀，我值得更好。
在三四月大厂春招时，我也抱着尝试的心态投了阿里网易等机器学习相关的实习岗，凭借着浙大的光环有幸得到笔试/面试机会，然而终因实力不济在首轮面试便纷纷折戟。毕竟自己因学业原因没有为面试准备，不过潜意识里也还是害怕着吧，害怕精心准备后依然迎来的是失败的结局。
为了丰富自己的科研经历，亦为了能翘掉无聊的小学期，略读了CMU的几位LTI教授的论文后便试着发了邮件申请暑期科研，但是，一位教授暑假旅游，一位教授不收暑研，剩下的均石沉大海杳无音信。不过至少，暑假我可以待在丁老师的实验室安心学习钻研呐。
TOEFL 96，GRE 332。都是接近于裸考。虽然分数勉强看得过去，不过还是希望能再努力加把劲，以取得更理想的成绩吧。
I don’t need love. I wanna find love.
Finished? To be continued.
A regular expression(RegEx) is an algebraic notation for characterizing a set of strings. They are particularly useful for searching in texts, when we have a pattern to search for and a corpus of texts to search through. A regular expression search function will search through the corpus, returning all texts that match the pattern. The corpus can be a single document or a collection.
(In some case, regular expressions are shown by delimiting by slashes /
. But slashes are not part of the regular expression.)
In this post, I’d just use expression
(without quotes) to denote regular expressions, and 'expression'
to denote the patterns matched.
Regular expressions can contain both special and ordinary characters. Most ordinary characters, like A
, a
, or 0
, are the simplest regular expressions; they simply match themselves. E.g., hello
matches 'hello', world
.
Regular expressions are case sensitive. Therefore, the lower case h
is distinct from upper case H
.
[]
Used to indicate a set of characters.
[amk]
will match 'a'
,'m'
, or 'k'
.
. E.g., [AZ]
matches an upper case letter, [09]
matches a single digit. If 
is escaped(\
) or if it’s placed as the first or last character(e.g., [a]
or [a]
), it will match a literal ''
.[(+*)]
will match any of the literal characters '('
, '+'
, '*'
, or ')'
.\w
or \S
are also accepted inside a set.^
, all characters that are not in the set will be matched. This is only true when the caret is the first symbol after the open square brace. If it occurs anywhere else, it usually stands for a caret and has no special meaning.']'
inside a set, precede it with a backslash, or place it at the beginning of the set.Some of the special sequences beginning with \
represent predefined sets of characters that are often useful, such as the set of digits, the set of letters, or the set of anything that isn’t whitespace.
\d
Matches any decimal digit; this is equivalent to [09]
.
\D
Matches any nondigit character; this is equivalent to [^09]
.
\s
Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v]
.
\S
Matches any nonwhitespace character; this is equivalent to the class [^ \t\n\r\f\v]
.
\w
Matches any alphanumeric character and underscore(easily overlooked); this is equivalent to the class [azAZ09_].
\W
Matches any nonalphanumeric character; this is equivalent to the class [^azAZ09_]
.
The period/dot .
is a special character that matches any single character(except a newline). .
is often used where “any character” is to be matched.
Anchors are special characters that anchor regular expressions to particular places in a string.
^
The caret ^
matches the start of a line.
Thus, the caret ^
has three uses: to match the start of a line(^
), to indicate a negation inside of square brackets([^]
), and just to mean a caret(\^
or [.^]
).
$
The dollar sign $
matches the end of a line.
There are also two other anchors: \b
matches a word boundary, and \B
matches a nonboundary. More technically, a “word” for the purposes of a regular expression is defined as any sequence of digits, underscores, or letters.
?
The question mark ?
means “zero or one instances of the previous character”.ab?
will match either 'a'
or 'ab'
.
(?i)
makes regular expression case insensitive.
*
Commonly called the Kleene *. The Kleene star means “zero or more occurrences of the immediately previous character or regular expression”.
+
The Kleene + means “one or more of the previous character”.
The ?
, *
and +
qualifiers are all greedy, they match as much text as possible. There are also ways to enforce nongreedy matching, using another meaning of the ?
qualifier(here ?
means lazy: cause it to match as few characters as possible): *?
, +?
, ??
.
{}
{n}
: Exactly n repeats where n is a nonnegative integer{n,}
: At least n repeats{,n}
: No more than n repeats{m,n}
: At least m and no more than n repeats
The disjunction operator, also called the pipe symbol 
acts like a boolean OR. It matches the expression before or after the 
.
In some sense, 
is never greedy. As the target string is scanned, REs separated by 
are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match.
()
Enclosing a pattern in parentheses makes it act like a single character for the purposes of neighboring operators like the pipe 
and the Kleene *
.
The use of parentheses to store a pattern in memory is called a capture group. Every time a capture group is used (i.e., parentheses surround a pattern), the resulting match is stored in a numbered register. Use backslash with number like \1
to refer to those registers. Here the \1
will be replaced by whatever string matched the first item in parentheses.
Parentheses thus have a double function in regular expressions; they are used to group terms for specifying the order in which operators should apply, and they are used to capture something in a register. Occasionally we might want to use parentheses for grouping, but don’t want to capture the resulting pattern in a register. In that case we use a noncapturing group, which is specified by putting the commands?:
after the open paren, in the form (?: pattern )
.
This idea that one operator may take precedence over another, requiring us to sometimes use parentheses to specify what we mean, is formalized by the operator precedence hierarchy for regular expressions.
 Kind  Operators 
 —  — 
 Parenthesis  () 
 Counters  * + ? {} 
 Sequences and anchors  the ^my end$ 
 Disjunction   


There will be time when we need to predict the future: look ahead in the text to see if some pattern matches, but not advance the match cursor, so that we can deal with the pattern if it oocurs.
Positive lookahead: (?= pattern)
Negative lookahead: (?! pattern)
To the Python interpreter, a regular expression is just like any other string. If the string contains a backslash followed by particular characters, it will interpret these specially. For example \b
would be interpreted as the backspace character. In general, when using regular expressions containing backslash, we should instruct the interpreter not to look inside the string at all, but simply to pass it directly to the re
library for processing. We do this by prefixing the string with the letter r
, to idicate that it is a raw string.
It seems that the original Python’s split()
function only support one separator. However, it would be convenient to use regular expression to support multiple separators.re.split(r'\W+', original_string)
It will return the list of separated strings. But it gives us empty strings at the start and the end.
We can use re.findall(r'\w+', original_string)
to get the same tokens, but without the empty strings.
re.findall(r"\w+(?:[']\w+)*'[.(]+\S\w*", raw)
This way helps to deal with the words like it's
and wardhearted
.
re.findall()
is also useful to extract information when dealing with tagged corpora, and currently I am doing a research with the help of it.
The markdown editors supports inline well. However, to support LaTeX on Hexo, it’s necessary to use mathjax and add inline code surrounding the dollor sign. Now it’s time to use regular expression to modify the format.


(Well, I just use Sublime Text’s Replace function to make the modification.)
Speech and Language Processing (3rd ed. draft), Dan Jurafsky and James H. Martin”:
Regular Expressions, Text Normalization, Edit Distance
Python Documentation:
re  Regular expression operations
Regular Expression HOWTO
As for me, my environment is:
macOS Sierra 10.12.6
Python 3.6.3
pip3
The official command uses pip install upgrade virtualenv
. However, my MacBook Air doesn’t have pip
and just can’t install pip
using sudo easy_install pip
. The error message isDownload error on https://pypi.python.org/simple/pip/: [SSL: TLSV1_ALERT_PROTOCOL_VERSION] tlsv1 alert protocol version (_ssl.c:590)  Some packages may not be found!
There seems something wrong with openssl
. But after updating openssl
with homebrew
, it still doesn’t work. Thankfully, I can install with pip3
instead.
So the full commands are as follows:


To learn TensorFlow, I’m following Stanford’s course CS20: TensorFlow for Deep Learning Research. So I’ve also installed TensorFlow 1.4.1 with the setup instruction.
There seems something wrong when importing tensorflow. The error message:/usr/local/Cellar/python/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: compiletime version 3.5 of module 'tensorflow.python.framework.fast_tensor_util' does not match runtime version 3.6return f(*args, **kwds)
The solution is found here. Download the binary resource and use the command pip install ignoreinstalled upgrade tensorflow1.4.0cp36cp36mmacosx_10_12_x86_64.whl
(may be different on different on different machines).
Activate the Virtualenv each time when using TensorFlow in a new shell.


Change the path to Virtualenv environment, invoke the activation command and then the prompt will transform to the following to indicate that TensorFlow environment is active:
(targetDirectory)$
When it is done, deactivate the environment by issuing the following command:
(targetDirectory)$ deactivate
TensorFlow separates definition of computations from their execution.
Phase:
(this might change in the future with eager mode)
Tensor
A tensor is an ndimensional array.
0d tensor: scalar (number)
1d tensor: vector
2d tensor: matrix
and so on…
Data Flow Graphs
Nodes: operators, variables, and constants
Edges: tensors
Tensors are data.
TensorFlow = tensor + flow = data + flow
Then how to get the value of a?
Create a session, assign it to variable sess so we can call it later.
Within the session, evaluate the graph to fetch the value of a.
Two ways:




A session object encapsulates the environment in which Operation objects are executed, and Tensor objects are evaluated.
Session will also allocate memory to store the current values of variables.
Why graphs
The computations you’ll use TensorFlow for  like training a massive deep neural network  can be complex and confusing. To make it easier to understand, debug, and optimize TensorFlow programs, we’ve included a suite of visualization tools called TensorBoard.
When a user perform certain operations in a TensorBoardactivated TensorFlow program, these operations are exported to an event log file. TensorBoard is able to convert these event files to visualizations that can give insight into a model’s graph and its runtime behavior. Learning to use TensorBoard early and often will make working with TensorFlow that much more enjoyable and productive.
To visualize the program with TensorBoard, we need to write log files of the program. To write event files, we first need to create a writer for those logs, using the code writer = tf.summary. FileWriter ([logdir], [graph])
[graph] is the graph of the program that we’re working on. Either call it using tf.get_default_graph()
, which returns the default graph of the program, or through sess.graph
, which returns the graph the session is handling. The latter requires that a session is created.
[logdir] is the folder where we want to store those log files.
Note: if running the code several times, there will be multiple event files in [logdir]. TF will show only the latest graph and display the warning of multiple event files. To get rid of the warning, just delete the event files that is useless.






a = tf.constant([2, 2], name='a')
Tensors filled with a specific value


Similar to numpy.zerostf.zeros([2, 3], tf.int32)
==> [[0, 0, 0], [0, 0, 0]


Similar to numpy.zeros_liketf.zeros_like(input_tensor) # input_tensor = [[0, 1], [2, 3], [4, 5]]
==> [[0, 0], [0, 0], [0, 0]]


Similar to numpy.ones, numpy.ones_like


Similar to numpy.fulltf.fill([2, 3], 8)
==> [[8, 8, 8], [8, 8, 8]]
Constants as sequences


tf.lin_space(10.0, 13.0, 4)
==> [10. 11. 12. 13.]


tf.range(3, 18, 3)
==> [3 6 9 12 15]tf.range(5)
==> [0 1 2 3 4]
Randomly Generated Constants
tf.random_normal
tf.truncated_normal
tf.random_uniform
tf.random_shuffle
tf.random_crop
tf.multinomial
tf.random_gamma
tf.set_random_seed(seed)
Elementwise mathematical operations
Add, Sub, Mul, Div, Exp, Log, Greater, Less, Equal, …
Well, there’re 7 different div operations in TensorFlow, all doing more or less the same thing: tf.div(), tf. divide(), tf.truediv(), tf.floordiv(), tf.realdiv(), tf.truncatediv(), tf.floor_div()
Array operations
Concat, Slice, Split, Constant, Rank, Shape, Shuffle, …
Matrix operations
MatMul, MatrixInverse, MatrixDeterminant, …
Stateful operations
Variable, Assign, AssignAdd, …
Neural network building blocks
SoftMax, Sigmoid, ReLU, Convolution2D, MaxPool, …
Checkpointing operations
Save, Restore
Queue and synchronization operations
Enqueue, Dequeue, MutexAcquire, MutexRelease, …
Control flow operations
Merge, Switch, Enter, Leave, NextIteration
TensorFlow takes Python natives types: boolean, numeric(int, float), strings
scalars are treated like 0d tensors
1d arrays are treated like 1d tensors
2d arrays are treated like 2d tensors
TensorFlow integrates seamlessly with NumPy
Can pass numpy types to TensorFlow ops
Use TF DType when possible:
Python native types: TensorFlow has to infer Python type
NumPy arrays: NumPy is not GPU compatible
Constants are stored in the graph definition. This makes loading graphs expensive when constants are big.
Therefore, only use constants for primitive types. Use variables or readers for more data that requires more memory.


With tf.get_variable
, we can provide variable’s internal name, shape, type, and initializer to give the variable its initial value.
The old way to create a variable is simply call tf.Variable(<initialvalue>, name=<optionalname>)
.(Note that it’s written tf.constant
with lowercase ‘c’ but tf.Variable
with uppercase ‘V’. It’s because tf.constant
is an op, while tf.Variable
is a class with multiple ops.) However, this old way is discouraged and TensorFlow recommends that we use the wrapper tf.get_variable
, which allows for easy variable sharing.
Some initializertf.zeros_initializer()
tf.ones_initializer()
tf.random_normal_initializer()
tf.random_uniform_initializer()
We have to initialize a variable before using it. (If you try to evaluate the variables before initializing them you’ll run into FailedPreconditionError: Attempting to use uninitialized value.)
The easiest way is initializing all variables at once:


Initialize only a subset of variables:


Initialize a single variable:


Eval: get a variable’s value.print(W.eval()) # Similar to print(sess.run(W))


Why W is 10 but not 100? In fact, W.assign(100)
creates an assign op. That op needs to be executed in a session to take effect.


Note that we don’t have to initialize W in this case, because assign() does it for us. In fact, the initializer op is an assign op that assigns the variable’s initial value to the variable itself.
For simple incrementing and decrementing of variables, TensorFlow includes the tf.Variable.assign_add()
and tf.Variable.assign_sub()
methods. Unlike tf.Variable.assign()
, tf.Variable.assign_add()
and tf.Variable.assign_sub()
don’t initialize your variables for you because these ops depend on the initial values of the variable.
Each session maintains its own copy of variables.
Sometimes, we have two or more independent ops and we’d like to specify which ops should be run first. In this case, we use tf.Graph.control_dependencies([control_inputs])
.


We can assemble the graphs first without knowing the values needed for computation.
(Just think about defining the function of x,y without knowing the values of x,y. E.g.,
With the graph assembled, we, or our clients, can later supply their own data when they need to execute the computation.
To define a place holder:tf.placeholder(dtype, shape=None, name=None)
We can feed as many data points to the placeholder as we want by iterating through the data set and feed in the value one at a time.


We can feed_dict any feedable tensor. Placeholder is just a way to indicate that something must be fed. Use tf.Graph.is_feedable(tensor)
to check if a tensor is feedable or not.
feed_dict can be extremely useful to test models. When you have a large graph and just want to test out certain parts, you can provide dummy values so TensorFlow won’t waste time doing unnecessary computations.
Pros and Cons of placeholder:
Pro: put the data processing outside TensorFlow, making it easy to do in Python
Cons: users often end up processing their data in a single thread and creating data bottleneck that slows execution down
tf.data
tf.data.Dataset.from_tensor_slices((features, labels))
tf.data.Dataset.from_generator(gen, output_types, output_shapes
For prototyping, feed dict can be faster and easier to write(pythonic)
tf.data is tricky to use when you have complicated preprocessing or multiple data sources
NLP data is normally just a sequence of integers. In this case, transferring the data over to GPU is pretty quick, so the speedup of tf.data isn’t that large
How does TensorFlow know what variables to update?


By default, the optimizer trains all the trainable variables its objective function depends on. If there are variables that you do not want to train, you can set the keyword trainable=False
when declaring a variable.
Solution for LAZY LOADING
Given World Development Indicators dataset, X is birth rate, Y is life expectancy. Find a linear relationship between X and Y to predict Y from X.
Phase 1: Assemble the graph
Y_predicted = w * X + b
Phase 2: Train the model
Write log files using a FileWriter
See it on TensorBoard
Huber loss
One way to deal with outliers is to use Huber loss.
If the difference between the predicted value and the real value is small, square it
If it’s large, take its absolute value


X: image of a handwritten digit
Y: the digit value
Recognize the digit in the image
Phase 1: Assemble the graph
Phase 2: Train the model


Pros and Cons of Graph:
PRO:
Optimizable
· automatic buffer reuse
· constant folding
· interop parallelism
· automatic tradeoff between compute and memory
Deployable
· the Graph is an intermediate representation for models
Rewritable
· experiment with automatic device placement or quantization
CON:
Difficult to debug
· errors are reported long after graph construction
· execution cannot by debugged with pdb
or print statements
UnPythonic
· writing a TensorFlow program is an exercise in metaprogramming
· control flow(e.g., tf.while_loop
) differs from Python
· can’t easily mix graph construction with custom data structures
A NumPylike library for numerical computation with support for GPU acceleration and automatic differentiation, and a flexible platform for machine learning research and experimentation.


Key advantages of eager execution
pdb.set_trace()
to heart contentif
statements, for
loops, recursionSince TensorFlow 2.0 is coming (a preview version of TensorFlow 2.0 later this year) and eager execution is a central feature of 2.0. I’ll update more after the release of TensorFlow 2.0. Looking forward to it.
]]>Semantic networks represent concepts as nodes in a graph. Edges in the graph denote semantic relationships between concepts(e.g., DOG ISA MAMMAL, DOG HAS TAIL) and word meaning is expressed by the number of type of connections to other words. In this framework, word similarity is a function of path lengthsemantically related words are expected to have shorter paths between them. Semantic networks constitute a somewhat idealized representation that abstracts away from realword usagethey are traditionally hand coded by modelers who a priori decide which relationships are most relevant in representing meaning.
More recent work creates a semantic network from word association norms (Nelson, McEvoy, & Schreiber, 1999); however, these can only represent a small fraction of the vocabulary of an adult speaker.
Featurebased model has the idea that word meaning can be described in terms of feature lists. Theories tend to differ with respect to their definition of features. In many cases, the features are obtained by asking native speakers to generate attributes they consider important in describing the meaning of a word. This allows the representation of each word by a distribution of numerical values over the feature set.
Admittedly, norming studies have the potential of revealing which dimensions of meaning are psychologically salient. However, a number of difficulties arise when working with such data. For example, the number and types of attributes generated can vary substantially as a function of the amount of time devoted to each word. There are many degrees of freedom in the way that responses are coded and analyzed. And multiple subjects are required to create a representation for each word, which in practice limits elicitation studies to a smallsize lexicon.
It has been driven by the assumption that word meaning can be learned from the linguistic environment. Words that are similar in meaning tend to occur in contexts of similar words. Semantic space models capture meaning quantitatively in terms of simple cooccurrence statistics. Words are represented as vectors in a highdimensional space, where each component corresponds to some cooccurring contextual element. The latter can be words themselves, larger linguistic units such as paragraphs or documents, or even more complex linguistic representations such as ngrams and the argument slots of predicates.
The advantage of taking such a geometric approach is that the similarity of word meanings can be easily quantifies by measuring their distance in the vector space, or the cosine of the angle between them.
Hyperspace Analog to Language model(HAL)
Represents each word by a vector where each element of the vector corresponds to a weighted cooccurrence value of that word with some other word.
Latent Semantic Analysis(LSA)
Derives a highdimensional semantic space for words while using cooccurrence information between words and the passages they occur in. LSA constructs a worddocument cooccurrence matrix from a large document collection.
Probabilistic topic models
Offers an alternative to semantic spaces based on the assumption that words observed in a corpus manifest some latent structure linked to topics. Words are represented as a probability distribution over a set of topics(corresponding to coarsegrained senses). Each topic is a probability distribution over words, and the content of the topic is reflected in the words to which it assigns high probability. Topic models are generative, they specify a probabilistic procedure by which documents can be generated. Thus, to make a new document, one first chooses a distribution over topics. Then for each word in that document, one chooses a topic at random according to this distribution and selects a word from that topic. Under this framework, the problem of meaning representation is expressed as one of statistical inference: Given some datawords in a corpusinfer the latent structure from which it was generated.
It is well known that linguistic structures are compositional(simpler elements are combined to form more complex ones).
Morphemes are combined into words, words into phrases, and phrases into sentences. It is also reasonable to assume that the meaning of sentences is composed of the meanings of individual words or phrases.
Compositionality allows languages to construct complex meanings from combinations of simpler elements. This property is often captured in the following principle: The meaning of a whole is a function of the meaning of the parts. Therefore, whatever approach we take to modeling semantics, representing the meanings of complex structures will involve modeling the way in which meanings combine.
In this article, the authors attempt to bridge the gap in the literature by developing models of semantic composition that can represent the meaning of word combinations as opposed to individual words. Our models are narrower in scope compared with those developed in earlier connectionist work. Our vectors represent words; they are highdimensional but relatively structured, and every component corresponds to a predefined context in which the words are found. The author take it as a defining property of the vectors they consider that the values of their components are derived from event frequencies such as the number of times a given word appears in a given context. Having this in mind, they present a general framework for vectorbased composition that allows us to consider different classes of models. Specifically, they formulate composition as a function of two vectors and introduce models based on addition and multiplication. They also investigate how the choice of the underlying semantic representation interacts with the choice of composition function by comparing a spatial model that represents words as vectors in a highdimensional space against a probabilistic model that represents words as topic distributions. They assess the performance of these models directly on a similarity task. They elicit similarity ratings for pairs of adjective–noun, noun–noun, and verb–object constructions and examine the strength of the relationship between similarity ratings and the predictions of their models.
Express the composition of two constituents, and , in terms of a function acting on those constituents.
A further refinement of the above principle taking the role of syntax into account: The meaning of a whole is a function of the meaning of the parts and of the way they are syntactically combined. They thus modify the composition function in Eq. 1 to account for the fact that there is a syntactic relation R between constituents and .
Even this formulation may not be fully adequate. The meaning of the whole is greater than the meaning of the parts. The implication here is that language users are bringing more to the problem of constructing complex meanings than simply the meaning of the parts and their syntactic relations. This additional information includes both knowledge about the language itself and also knowledge about the real world.
A full understanding of the compositional process involves an account of how novel interpretations are integrated with existing knowledge. The composition function needs to be augmented to include an additional argument, K, representing any knowledge utilized by the compositional process.
Compositionality is a matter of degree rather than a binary notion. Linguistic structures range from fully compositional(e.g., black hair), to partly compositional syntactically fixed expressions,(e.g., take advantage), in which the constituents can still be assigned separate meanings, and noncompositional idioms(e.g., kick the bucket) or multiword expressions(e.g., by and large), whose meaning cannot be distributed across their constituents.
Within symbolic logic, compositionality is accounted for elegantly by assuming a tight correspondence between syntactic expressions and semantic form. In this tradition, the meaning of a phrase or sentence is its truth conditions which are expressed in terms of truth relative to a model. In classical Montague grammar, for each syntactic category there is a uniform semantic type(e.g., sentences express propositions; nouns and adjectives express properties of entities; verbs express properties of events). Most lexical meanings are left unanalyzed and treated as primitive.
Noun is represented by logical symbol. Verb is represented by a function from entities to propositions, expressed in lambda calculus.(Well I’m not familiar with lambda calculus yet, but the idea is similar to predicate logic in discrete mathematics.)
For example, the proper noun John is represented by the logical symbol JOHN denoting a specific entity, and the verb wrote is represented as λx.WROTE(x). Applying this function to the entity JOHN yields the logical formula WROTE(JOHN) as a representation of the sentence John wrote. It is worth noting that the entity and predicate within this formula are represented symbolically, and that the connection between a symbol and its meaning is an arbitrary matter of convention.
Advantage
Allows composition to be carried out syntactically.
The laws of deductive logic in particular can be defined as syntactic processes which act irrespective of the meanings of the symbols involved.
Disadvantage
Abstracting away from the actual meanings may not be fully adequate for modeling semantic composition.
For example, doesn’t mean that John is a good lawyer.
Modeling semantic composition means modeling the way in which meanings combine, and this requires that words have representations which are richer than single, arbitrary symbols.
The key premise here is that knowledge is represented not as discrete symbols that enter into symbolic expressions, but as patterns of activation distributed over many processing elements. These representations are distributed in the sense that any single concept is represented as a pattern, that is, vector, of activation over many elements(nodes or units) that are typically assumed to correspond to neurons or small collections of neurons.
Tensor product
The tensor product is a matrix whose components are all the possible products of the components of vectors u and v.
The tensor product has dimensionality m×n, which grows exponentially in size as more constituents are composed.
Holographic reduced representation
The tensor product is projected onto the space of the original vectors, thus avoiding any dimensionality increase.
The projection is defined in terms of circular convolution, which compresses the tensor product of two vectors. The compression is achieved by summing along the transdiagonal(?) elements of the tensor product. Noisy versions of the original vectors can be recovered by means of circular correlation, which is the approximate inverse of circular convolution. The success of circular correlation crucially depends on the components of the ndimensional vectors u and v being real numbers and randomly distributed with mean 0 and variance 1/n.
Binary spatter codes
Binary spatter codes are a particularly simple form of holographic reduced representation. Typically, these vectors are random bit strings or binary N vectors (e.g., N=10000). Compositional representations are synthesized from parts or chunks. Chunks are combined by binding, which is the same as taking the exclusive or(XOR) of two vectors. Here, only the transdiagonal elements of the tensor product of two vectors are kept and the rest are discarded.
Both spatter codes and holographic reduced representations can be implemented efficiently and the dimensionality of the resulting vector does not change.
The downside is that operations like circular convolution are a form of lossy compression that introduces noise into the representation. To retrieve the original vectors from their bindings, a cleanup memory process is usually employed where the noisy vector is compared with all component vectors in order to find the closest one.
Premise: Words occurring within similar contexts are semantically similar.
Semantic space models extract from a corpus a set of counts representing the occurrences of a target word t in the specific context c of choice and then map these counts into the components of a vector in some space.
Semantic space models resemble the representations used in the connectionist literature. Words are represented as vectors and their meaning is distributed across many dimensions. Crucially, the vector components are neither binary nor randomly distributed(compared with holographic reduced representation and binary spatter code mentioned above). They correspond to cooccurrence counts, and it is assumed that differences in meaning arise from differences in the distribution of these counts across contexts.
Aim: construct vector representations for phrases and sentences.
Note: the problem of combining semantic vectors to make a representation of a multiword phrase is different to the problem of how to incorporate information about multiword contexts into a distributional representation for a single target word.
Define p, the composition of vectors u and v, representing a pair of words which stand in some syntactic relation R, given some background knowledge K as: .
To begin with, just ignore K so as to explore what can be achieved in the absence of any background or world knowledge.
Assumption
Constituents are represented by vectors which subsequently combine in some way to produce a new vector.
p lies in the same space as u and (~?)v. This essentially means that all syntactic types have the same dimensionality.
The restriction renders the composition problem computationally feasible.
A hypothetical semantic space for practical and difficulty
music  solution  economy  craft  reasonable  

practical  0  6  2  10  4 
difficulty  1  8  4  4  0 
This model assumes that composition is a symmetric function of the constituents; in other words, the order of constituents essentially makes no difference.
Might be reasonable for certain structures, a list perhaps.
A sum of predicate, argument, and a number of neighbors of the predicate:
Model the composition of a predicate with its argument in a manner that distinguishes the role of these constituents, making use of the lexicon of semantic representations to identify the features of each constituent relevant to their combination.
Considerable latitude is allowed in selecting the appropriate neighbors. Kintsch(2001) considers only the m most similar neighbors to the predicate, from which he subsequently selects k, those most similar to its argument.
E.g., in the composition of practical with difficulty, the chosen neighbor is problem, with ,
In this process, the selection of relevant neighbors for the predicate plays a role similar to the integration of a representation with existing background knowledge in the original construction–integration model. Here, background knowledge takes the form of the lexicon from which the neighbors are drawn.
Weight the constituents differentially in the summation.
This makes the composition function asymmetric in u and v allowing their distinct syntactic roles to be recognized.
E.g., set α to 0.4 and β to 0.6.
(there’s some calculation mistake in the paper)
Extrem form
One of the vectors(u) contributes nothing at all to the combination. It can serve a simple baseline against which to compare more sophisticated models.
where the symbol represents multiplication of the corresponding components:
It is still a symmetric function and thus does not take word order or syntax into account.
where the symbol stands for the operation of taking all pairwise products of the components of u and v:
where the symbol stands for a compression of the tensor product based on summing along its transdiagonal elements:
Subscripts are interpreted modulo n which gives the operation its circular nature.
(the result in the paper is reversed, which seems wrong)
Temporarily not understand why multiplicative functions only affect magnitude but not direction, while additive models can have a considerable effect on both the magnitude and direction. And cosine similarity is itself insensitive to the magnitudes of vectors.
To see how the vector u can be thought of as something that modifies v, consider the partial product of C with u, producing a matrix which is called U.
Here, the composition function can be thought of as the action of a matrix, U, representing one constituent, on a vector v, representing the other constituent. Since the authors’ decision to use vectors, they just make use of the insight. Map a constituent vector, u, onto a matrix, U, while representing all words with vectors.
U‘s offdiagonal elements are zero and U‘s diagonal elements are equal to the components of u.
The action of this matrix on v is a type of dilation, in that it stretches and squeezes v in various directions. Specifically, v is scaled by a factor of along the ith basis.
The drawback of this process is that its results are independent on the basis used.
Ideally, we would like to have a basisindependent composition, that is, one which is based solely on the geometry of u and v. One way to achieve basis independence is by dilating v along the direction of u, rather than along the basis directions. Just decompose v into a component parallel to u and a component orthogonal to u, and then stretch the parallel component to modulate v to be more like u.
By dilating x by a factor λ, while leaving y unchanged, we get a modified vector v‘, which has been stretched to emphasize the contribution of u:
Multiplying through by makes the expression easier to work with(since the cosine similarity function is insensitive to the magnitudes of vectors)
From the given example, and . Assuming λ=2 and we can get
(Again there’s some mistakes in the paper, it confused the coefficient uu and uv.)


Count model
The count model learns word vectors by calculating the cooccurrence frequency of each word with features.
Predict model
The predict model learns word vectors by maximizing the probability of the contexts in which the word is observed in the corpus.
The SkipGram model and CBOW model included in word2vec tool are most widely used for generating word vectors.
A method described by many works. I’m just planning to write another post about Word2Vec in details.
There’re some mistakes in introducing the retrofitting method in this paper(e.g., the neighbor word vector should be instead of which is confusing.) , so I just find the original paper describing retrofitting method: Retrofitting Word Vectors to Semantic Lexicons.
Let be a vocabulary, i.e., the set of word types, (well, I don’t know why it’s the set of word types…) and be an ontology that encodes semantic relations between words in . is an undirected graph with one vertex for each word type and edges indicating a semantic relationship of interest.
The matrix will be the collection of vector representation , for each , learned using a standard datadriven technique, where is the length of the word vectors. The objective is to learn the matrix such that the columns are both close (under a distance metric) to their counterparts in and to adjacent vertices in .
The distance between a pair of vectors is defined to be the Euclidean distance. Since we want the inferred word vector to be close to the observed value and close to its neighbors such that , the objective to be minimized becomes:
where and values control the relative strengths of associations.
Related paper: From Paraphrase Database to Compositional Paraphrase Model and Back
Train word vectors with a contrastive maxmargin objective function. Specifically, the training data consisting of a set X of word paraphrase pairs , while are negative examples that are the most similar word pairs to in a minibatch during optimization. The objective function is given as follows:
where is the regularization parameter, is the length of training paraphrase pairs, is the target word vector matrix, and is the initial word vector matrix.
For bigram phrase similarity task, there are two general types of training data in existing work.
Pseudoword training data
Consists of tuples in the form {adjective1 noun1, adjective1noun1}. E.g., {older man, olderman}.
Pseudoword training data have been widely used in composition models.
Pair training data
Consists of tuples in the form {adjective1 noun1, adjective2 noun2}. E.g., {older man, elderly woman}.
For multiword phrase, pair training data is the first choice because it is hard to learn the representation of pseudowords with multiple words. So this paper use pair training data for multiword phrase experiments.
With a phrase p consisting of two words and , we can establish equation as follows:
where are the word representations, is the phrase representation, and is the composition function.
The Additive model assumes that the meaning of the composition is a linear combination of the constituent words.
The Additive model described the phenomenon: people concatenate the meanings of two words when understanding phrases.
The Multiplicative model assumes that the meaning of composition is the elementwise product of the two vectors.
The multiplicative model has the highest correlation with the neural activity observed in human brain when reading adjectivenoun phrase data.
Composition functions such as Matrix, RecNN(Recursive Neural Network) and RNN transform word vectors into another vector space through matrix transformations and nonlinear functions. These three functions differ primarily in the order of transformation.
The Matrix model first transforms component words into another vector space and then composes them using addition. The RecNN model takes word order into consideration, concatenates vector component words and then transforms the vector using a matrix and nonlinearity. The RNN model composes words in a sentence from left to right by forming new representations from previous representations and the representation of the current word.
What is natural language processing?
computers using natural language as input and/or output
NLP contains two types: understanding(NLU) and generation(NLG).
Machine Translation
Translate from one language to another.
Information Extraction
Take some text as input, and produce structured(database) representation of some key content in the text.
Goal: Map a document collection to structured database
Motivation: Complex searches, Statistical queries.
Text Summarization
Take a single document, or potentially a group of several documents, and try to condense them down to summarize main information in those documents.
Dialogue Systems
Human can interact with computer.
Strings to Tagged Sequences
Examples:
Partofspeech tagging
Profits(/N) soared(/V) at(/P) Boeing(/N) Co.(/N) .(/.) easily(/ADV) topping(/V) forecasts(/N) on (/P) Wall(/N) Street(/N) .(/.)
Name Entity Recognition
Profits(/NA) soared(/NA) at(/NA) Boeing(/SC) Co.(/CC) .(/NA) easily(/NA) topping(/NA) forecasts(/NA) on (/NA) Wall(/SL) Street(/CL) .(/.)
/NA: not any entity
/SC: start of company
/CC: continuation of company
/SL: start of location
/CL: continuation of location
At last, a computer that understands you like your mother.
This sentence can be interpreted in different ways:
1.It understands you as well as your mother understands you.
2.It understands (that) you like your mother.
3.It understands you as well as it understands your mother.
Ambiguity at Many Levels:
Acoustic level
In speech recognition:
1.”… a computer that understands you like your mother”
2.”… a computer that understands lie cured mother”
Syntactic level
Semantic level
A word may have a variety of meanings and it may cause word sense ambiguity.
Discourse (multiclause) level
Alice says they’ve built a computer that understands you like your mother
But she…
… doesn’t know any details
… doesn’t understand me at all
This is an instance of anaphora, where she referees to some other discourse entity.
· We have some (finite) vocabulary, say
· We have an (infinite) set of strings, :
the STOP
a STOP
the fan STOP
the fan saw Beckham STOP
the fan saw saw STOP
the fan saw Beckham play for Real Madrid STOP
· We have a training sample of example sentences in English
· We need to “learn” a probability distribution p i.e., p is a function that satisfies
Definition: A language model consists of a finite set , and a function such that:
1.For any ,
2.In addition,
Language models are very useful in a broad range of applications, the most obvious perhaps being speech recognition and machine translation. In many applications it is very useful to have a good “prior” distribution over which sentences are or aren’t probable in a language. For example, in speech recognition the language model is combined with an acoustic model that models the pronunciation of different words: one way to think about it is that the acoustic model generates a large number of candidate sentences, together with probabilities; the language model is then used to reorder these possibilities based on how likely they are to be a sentence in the language.
The techniques we describe for defining the function p, and for estimating the parameters of the resulting model from training examples, will be useful in several other contexts during the course: for example in hidden Markov models and in models for natural language parsing.
We have N training sentences.
For any sentence is the number of times the sentence is seen in our training data.
This is a poor model: in particular it will assign probability 0 to any sentence not seen in the training corpus. Thus it fails to generalize to sentences that have not seen in the training data. The key technical contribution of this chapter will be to introduce methods that do generalize to sentences that are not seen in our training data.
FirstOrder Markov Processes
The first step is exact: by the chain rule of probabilities, any distribution can be written in this form. So we have made no assumptions in this step of the derivation. However, the second step is not necessarily exact: we have made the assumption that for any , for any ,
This is a firstorder Markov assumption. We have assumed that the identity of the i’th word in the sequence depends only on the identity of the previous word, . More formally, we have assumed that the value of is conditionally independent of , given the value of .
SecondOrder Markov Processes
(For convenience we assume , where * is a special “start” symbol.)
Compared with firstorder Markov process, we make a slightly weaker assumption, namely that each word depends on the previous two words in the sequence. And the secondorder Markov process will form the basis of trigram language models.
The length of sequence n can itself vary. We just assume that the n‘th word in the sequence, is always equal to a special symbol, the STOP symbol. This symbol can only appear at the end of a sequence and it doesn’t belong to the set .
Process:
A trigram language model consists of a finite set , and a parameter for each trigram u,v,w such that , and . The value for can be interpreted as the probability of seeing the word w immediately after the bigram (u,v). For any sentence where for i = 1…(n1), and , the probability of the sentence under the trigram language model is , where we define .
Estimation Problem:
A natural estimate (the “maximum likelihood estimate”):
For example:
This way of estimating parameters runs into a very serious issue. Say our vocabulary size is , then there are parameters in the model. This leads to two problems:
· Many of the above estimates will be , due to the count in the numerator being 0. This will lead to many trigram probabilities being systematically underestimated: it seems unreasonable to assign probability 0 to any trigram not seen in training data, given that the number of parameters of the model is typically very large in comparison to the number of words in the training corpus.
· In cases where the denominator is equal to zero, the estimate is not well defined.
Perplexity
We have some test data sentences . Note that test sentences are “held out”, in the sense that they are not part of the corpus used to estimate the language model. (Just the same as normal machine learning problems.)
We could look at the probability under our model .
More conveniently, the log probability:
In fact the usual evaluation measure is :
and M is the total number of words in the test data . is the length of the i‘th test sentence.
The subscript ML means maximumlikelihood estimation. And is the number of times word w is seen in the training corpus, and is the total number of words seen in the training corpus.
The trigram, bigram and unigram estimates have different strengths and weaknesses. The idea in linear interpolation is to use all three estimates, by taking a weighted average of the three estimates:
Here , and are three additional parameters of the model, which satisfies and for all i.
(Our estimate correctly defines a distribution.)
There are various ways of estimating the λ values. A common one is as follows. Say we have some additional heldout data(development data), which is separate from both our training and test corpora.
Define to be the number of times that the trigram (u,v,w) is seen in the development data.
We would like to choose our λ values to maximize .(It means minimize perplexity.)
The three parameters can be interpreted as an indication of the confidence, or weight, placed on each of the trigram, bigram and unigram estimates.
In practice, it is important to add an additional degree of freedom, by allowing the values for to vary.
Take a function that partitions histories
e.g.,
Introduce a dependence of the λ’s on the partition:
where , and for all i.
By the way, should be 0 if , because in this case the trigram estimate is undefined. Similarly, if both and are equal to zero, we need
In the referencing note, a simple method is introduced:
γ>0.
Under this definition, it can be seen that increases as increases, and similarly that increases as increases. This method is relatively crude, and is not likely to be optimal. It is, however, very simple, and in practice it can work well in some applications.
For any bigram such that , we define the discounted count as
where β is a value between 0 and 1 (a typical value might be β=0.5). Thus we simply subtract a constant value, β, from the count. This reflects the intuition that if we take counts from the training corpus, we will systematically overestimate the probability of bigrams seen in the corpus(and underestimate bigrams not seen in the corpus).
For any bigram such that , we can then define . For any context , this definition leads to some missing probability mass, defined as
The intuition behind discounted methods is to divide this “missing mass” between the words w such that .
Define two sets
Then the backoff model is defined as
Thus if we return the estimate ; otherwise we divide the remaining probability mass in proportion to the unigram estimate .
The method can be generalized to trigram language models in a natural, recursive way:
Given the input to the tagging model(referred as a sentence) , use to denote the output of the tagging model(referred as the state sequence or tag sequence).
This type of problem, where the task is to map a sentence to a tag sequence is often referred to as a sequence labeling problem, or a tagging problem.
We will assume that we have a set of training examples, for , where each is a sentence , and each is a tag sequence (we assume that the i‘th training example is of length ). Hence is the j‘th word in the i‘th training example, and is the tag for that word. Our task is to learn a function that maps sentences to tag sequences from these training examples.
POS Tagging: PartofSpeech tagging.
The input to the problem is a sentence. The output is a tagged sentence, where each word in the sentence is annotated with its part of speech. Our goal will be to construct a model that recovers POS tags for sentences with high accuracy. POS tagging is one of the most basic problems in NLP, and is useful in many natural language applications.
Ambiguity
Many words in English can take several possible parts of speech(as well as in Chinese and many other languages). A word can be a noun as well as a verb. E.g., look, result…
Rare words
Some words are rare and may not be seen in the training examples. Even with say a million words of training data, there will be many words in new sentences which have not been seen in training. It will be important to develop methods that deal effectively with words which have not been seen in training data.
Local
Individual words have statistical preferences for their part of speech.
E.g., can is more likely to be a modal verb rather than a noun.
Contextual
The context has an important effect on the part of speech for a word. In particular, some sequences of POS tags are much more likely than others. If we consider POS trigrams, the sequence D N V
will be frequent in English, whereas the sequence D V N
is much less likely.
Sometimes these two sources of evidence are in conflict. For example, in the sentence The trash can is hard to find, the part of speech for can is a nounhowever, can can also be a modal verb, and in fact it is much more frequently seen as a modal verb in English. In this sentence the context has overridden the tendency for can to be a verb as opposed to a noun.
For namedentity problem, the input is again a sentence. The output is the sentence with entityboundaries marked. Recognizing entities such as people, locations and organizations has many applications, and nameentity recognition has been widely studies in NLP research.
Once this mapping has been performed on training examples, we can train a tagging model on these training examples. Given a new test sentence we can then recover the sequence of tags from the model, and it is straightforward to identify the entities identified by the model.
Supervised Learning:
Assume training examples , where each example consists of an input paired with a label . We use to refer to the set of possible inputs, and to refer to the set of possible labels. Our task is to learn a function that maps any input x to a label f(x).
One way to define the function f(x) is through a conditional model. In this approach we define a model that defines the conditional probability for any x,y pair. The parameters of the model are estimated from the training examples. Given a new test example x, the output from the model is
Thus we simply take the most likely label y as the output from the model. If our model is close to the true conditional distribution of labels given inputs, the function f(x) will be close to optimal.
Generative Models
Rather than directly estimating the conditional distribution , in generative models we instead model the joint probability over pairs. The parameters of the model are again estimated from the training examples for . In many cases we further decompose the probability as and then estimate the models for and separately. These two model components have the following interpretations:
· is a prior probability distribution over labels y.
· is the probability of generating the input x, given that the underlying label is y.
Given a generative model, we can use Bayes rule to derive the condition probability for any pair:
where
We use Bayes rule directly in applying the joint model to a new test example. Given an input x, the output of our model, f(x), can be derived as follows:
Eq. 2 follows by Bayes rule. Eq. 3 follows because the denominator, , does not depend on y, and hence does not affect the arg max. This is convenient, because it mean that we do not need calculate , which can be an expensive operation.
Models that decompose a joint probability into terms and are often called noisychannel models. Intuitively, when we see a test example x, we assume that has been generated in two steps: first, a label y has been chosen with probability ; second, the example x has been generated from the distribution . The model can be interpreted as a “channel” which takes a label y as its input, and corrupts it to produce x as its output. Our task is to find the most likely label y, given that we observe x.
In summary:
Assume a finite set of words , and a finite set of tags . Define to be the set of all sequence/tagsequence pairs such that . A generative tagging model is then a function p such that:
Hense is a probability distribution over pairs of sequences(i.e., a probability distribution over the set ).
Given a generative tagging model, the function from sentences to tag sequence is defined as
where the arg max is taken over all sequences such that . Thus for any input , we take the highest probability tag sequence as the output from the model.
A trigram HMM consists of a finite set of possible words, and a finite set of possible tags, together with the following parameters:
· A parameter for any trigram such that , and . The value for can be interpreted as the probability of seeing the tag s immediately after the bigram of tags .
· A parameter for any . The value for can be interpreted as the probability of seeing observation x paired with state s.
Define to be the set of all sequence/tagsequence pairs such that .
We then define the probability for any as
where we have assumed that .
E.g.
If we have n = 3, equal to the sentence the dog laughs, and equal to the tag sequence D N V STOP
, then
The quantity is the prior probability of seeing the tag sequence D N V STOP
, where we have used a secondorder Markov model(a trigram model).
The quantity can be interpreted as the conditional probability : that is, the conditional probability where x is the sentence the dog laughs, and y is the sequence D N V STOP
.
Consider a pair of sentences of random variables and , where n is the length of the sequences. To model the joint probability
for any observation sequence paired with a state sequence , where each is a member of , and each is a member of .
Define one additional variable which always takes the value STOP, just as we did in variableMarkov sequences.
The key idea in hidden Markov models is the following definition:
We have assumed that for any i, for any values of ,
and that for any value of i, for any values of and ,
The derivation of hidden Markov models:
Just by the chain rule of probabilities. The joint probability is decomposed into two terms: first, the probability of choosing tag sequence ; second, the probability of choosing the word sequence , conditioned on the choice of tag sequence.
Consider the first item: Assume the sequence is a secondorder Markov sequence.
Consider the second item: Assume that the value for the random variable depends only on the value .
Stochastic process
Define to be the number of times the sequence of three states is seen in training data. Similarly, define to be the number of times the tag bigram is seen and to be the number of times that the state is seen in the corpus.
Define to be the number of times state is seen paired with observation in the corpus.
The maximumlikelihood estimates are
In some cases it is useful to smooth estimates of , using the techniques of smoothing:
A common method is as follows:
· Step 1: Split vocabulary into two sets
Frequent words : words occurring ≥ 5 times in training
Lowfrequency words : all other words
· Step 2: Map low frequency words into a small, finite set, depending on prefixes, suffixes etc.
Problem: for an input , find where the arg max is taken over all sequences such that for , and .
We assume that p again takes the form
The naive brute force method would be hopelessly inefficient. Instead, we can efficiently find the highest probability tag sequence using a dynamic programming algorithm that is often called the Viterbi algorithm. The input to the algorithm is a sentence .
Input: a sentence
Output: a parse tree
A parse tree is a tree structure with the words in the sentence at the leaves of the tree, and the tree has labels on the internal nodes.
Syntactic Formalisms: minimalism, lexical functional grammar(LFG), headdriven phrasestructure grammar(HPSG), tree adjoining grammars(TAG), categorial grammars, etc.
The lecture focuses on contextfree grammars, which are fundamental and form the basis for all modern form atoms.
Data: Penn WSJ Treebank: 50000 sentences with associated trees(annotated by hand).
It plays the same role as POS tagging. For each word in the sentence, just put a tag for the word.
N = noun, V = verb, DT = determiner
Phrases, or what are often called constituents, are compositions of words.
NP = noun phrase, VP = verb phrase
Parse trees encode important grammatical relationships within a sentence.
Some templates: subject+verb, verb+DIRECT OBJECT.
A context free grammar where:
Properties
A leftmost derivation is a sequence of strings , where
E.g. [S]→[NP VP]→[D N VP]→[the N VP]→[the man VP]→[the man Vi]→[the man sleeps]
· Nouns
NN: singular noun (e.g., man, dog, park)
NNS: plural noun (e.g., books, pens)
NNP: proper noun (e.g., Bob, IBM)
NP: noun phrase (e.g., the girl)
· Determiners
DT: determiner (e.g., the, a, some, every)
· Adjectives
JJ: adjective (e.g., good, quick, big)
· Prepositions
IN: preposition (e.g., of, in, out, beside, as)
PP: prepositional phrase
· Basic Verb Types
Vi: intransitive verb (e.g., sleep, walk)
Vt: transitive verb (e.g., like, see)
Vd: ditransitive verb (e.g., give)
VP: verb phrase (e.g., sleep in the car, go to school)
· New Verb Types
V[5]: clause directly followed by the verb (e.g., say, report)
V[6]: clause followed by the verb with one objective (e.g., tell, inform)
V[7]: clause followed by the verb with two objectives (e.g., bet)
· Complementizers
COMP: complementizer (e.g., that)
· Coordinators
CC: coordinator (e.g., and, or, but)
· Sentences
S: sentence (e.g., the dog sleeps)
N(bar) => NN
N(bar) => NN N(bar)
N(bar) => JJ N(bar)
N(bar) => N(bar) N(bar)
NP => DT N(bar)
PP => IN NP
N(bar) => N(bar) PP
VP => Vi
VP => Vt NP
VP => Vd NP NP
VP => VP PP
S => NP VP
SBAR => COMP S
VP => V[5] SBAR
VP => V[6] NP SBAR
VP => V[7] NP NP SBAR
NP => NP CC NP
N(bar) => N(bar) CC N(bar)
VP => VP CC VP
S => S CC S
SBAR => SBAR CC SBAR
What is discussed in the lecture is only a small part of the syntax of English.
There’re some problems:
Agreement
The dogs laugh vs. The dog laughs
Whmovement
The dog [that the cat] liked…
Active vs. passive
The dog saw the cat vs. The cat was seen by the dog
A probabilistic contextfree grammar consists of:
Constraint:
Given a parsetree containing rules , the probability of t under the PCFG is
Assigns a probability to each leftmost derivation, or parsetree, allowed by the underlying CFG
Say we have a sentence s, set of derivations for that sentence is . Then a PCFG assigns a probability to each member of . i.e., we now have a ranking in order of probability.
The most likely parse tree for a sentence s is
程序设计范型是指设计程序的规范、模型和风格，它是一类程序设计语言的基础。
面向对象程序的主要结构特点：
一、程序一般由类的定义和类的使用两部分组成，在主程序中定义各对象并规定它们之间传递消息的规律。
二、程序中的一切操作都是通过向对象发送消息来实现的，对象接收到消息后，启动有关方法完成相应的操作。
在现实世界中，任何事物都是对象。可以使有形的具体存在的事物，也可以是无形的抽象的事件。
对象一般可以表示为：属性+操作
名字：用于区别不同的实体
属性/状态：属性用于描述不同实体的特征状态由这个对象的属性和这些属性的当前值决定。
操作：用于描述不同实体可具有的行为是对象提供给用户的一种服务，也叫行为或方法。
· 对象的操作可以分为两类，一类是自身所承受的操作(private/protected)，一类是施加于其他对象的操作(public)。
方法(Method)——就是对象所能执行的操作，即服务。方法描述了对象执行操作的算法，响应消息的方法。在C++中称为成员函数。
属性(Attribute)——就是类中所定义的数据，它是对客观世界实体所具有性质的抽象。C++中称为数据成员。
在面向对象程序设计中，对象是描述其属性的数据及对这些数据施加的一组操作封装在一起构成的统一体。
对象可以认为是：数据+方法（操作）
在现实世界中，类是一组具有相同属性和行为的对象的抽象。
类和对象之间的关系式抽象和具体的关系。类是多个对象进行综合抽象的结果，一个对象是类的一个实例。
在面向对象程序设计中，类就是具有相同数据和相同操作的一组对象的集合。是对具有相同数据结构和相同操作的一类对象的描述。
在面向对象程序设计中，总是先声明类，再由类生成其对象。
注意不能把一组函数组合在一起构成类。即类不是函数的集合。
面向对象设计技术必须提供一种机制允许一个对象与另一个对象的交互，这种机制叫消息传递。
在面向对象程序设计中，一个对象向另一个对象发出的请求被称为消息。当对象收到消息时，就调用有关的方法，执行相应的操作。消息是一个对象要求另一个对象执行某个操作的规格说明，通过消息传递才能完成对象之间的相互请求或相互协作。
消息具有三个性质：
(1).同一个对象可以接收不同形式的多个消息，作出不同的响应
(2).相同形式的消息可以传递给不同的对象，所作出的响应可以是不同的。
(3).对消息的响应并不是必需的，对象可以响应消息，也可以不响应。
分为两类：公有消息（其他对象发出），私有消息（向自己发出）。
方法就是对象所能执行的操作。方法包括界面和方法体两部分。
方法的界面（接口）就是消息的模式，它给出了方法调用的协议；
方法体则是实现某种操作的一系列计算步骤，就是一段程序
在C++语言中方法是通过函数来实现的，称为成员函数
消息和方法的关系是：对象根据接收到的消息，调用相应的方法；反过来，有了方法，对象才能响应相应的消息。
抽象是通过特定的实例（对象）抽取共同性质以后形成概念的过程。抽象是对系统的简化描述和规范说明，他强调了系统中的一部分细节和特性，而忽略了其他部分。
抽象包括两个方面，数据抽象和代码抽象（或称行为抽象）。前者描述某类对象的属性和状况，也就是此类对象区别于彼类对象的特征物理量；后者描述了某类对象的共同行为特征或具有的共同操作。
在面向对象的程序设计方法中，对一个具体问题的抽象分析结果，是通过类来描述和实现的。
在面向对象程序设计中，封装是指把数据和实现操作的代码集中起来放在对象内部，并尽可能隐藏对象的内部细节。
封装应该具有如下几个条件：
（1）对象具有一个清晰的边界，对象的私有数据和实现操作的代码被封装在该边界内。
（2）具有一个描述对象与其他对象如何相互作用的接口，该接口必须说明消息如何传递的使用方法。
（3）对象内部的代码和数据应受到保护，其他对象不能直接修改。
继承是在一个已经建立的类的基础上再接着声明一个新类的扩展机制，原先已经建立的类称为基类，在基类之下扩展的类称为派生类，派生类又可以向下充当继续扩展的基类，因此构成层层派生的一个动态扩展过程。
派生类享有基类的数据结构和算法，而本身又具有增加的行为和特性，因此继承的机制促进了程序代码的可重用性。
一个基类可以有多个派生类，一个派生类反过来可以具有多个基类，形成复杂的继承树层次体系。
基类与派生类之间本质的关系：基类是一个简单的类，描述相对简单的事物，派生类是一个复杂些的类，处理相对复杂的现象。
继承的作用：
避免公用代码的重复开发，减少代码和数据冗余。
通过增强一致性来减少模块间的接口。
继承分为单继承和多继承。
多态性是指不同的对象收到相同的消息时产生多种不同的行为方式。
C++支持两种多态性：编译时的多态性（重载）和运行时的多态性（虚函数）。
OOP的主要优点
（1）可提高程序的重用性
（2）可控制程序的复杂性
（3）可改善程序的可维护性
（4）能够更好地支持大型程序设计
（5）增强了计算机处理信息的范围
（6）能很好地适应新的硬件环境
C++的优点
C++继承了C的优点，并有自己的特点，主要有：
（1）全面兼容C，C的许多代码不经修改就可以为C++所用，用C编写的库函数和实用软件可以用于C++。
（2）用C++编写的程序可读性更好，代码结构更为合理，可直接在程序中映射问题空间结构。
（3）生成代码的质量高，运行效率高。
（4）从开发时间、费用到形成软件的可重用性、可扩充性、可维护性和可靠性等方面有了很大提高，使得大中型的程序开发项目变得容易得多。
（5）支持面向对象的机制，可方便地构造出模拟现实问题的实体和操作。
注释符：/* */
或//
续行符：\
。当一个语句太长时可以用该符号分段写在几行中
note: 其实不加续航符直接换行也可以0.0
E.g.


This program will print hello world
in a line.
C: scanf
和printf
C++: cin>>
和cout<<
（用C的也可以，但是不推荐……）
cout和cin分别是C++的标准输出流和输入流。C++支持重定向，但一般cout指的是屏幕，cin指的是键盘。操作符<<
和>>
除了具有C语言中定义的左移和右移的功能外，在这里符号<<
是把右方的参数写到标准输出流cout中；相反，符号>>
则是将标准输入流的数据赋给右方的变量。
cin和>>
，cout和<<
配套使用
使用cout和cin时，也可以对输入和输出的格式进行控制，比如可用不同的进制方式显示数据，只要设置转换基数的操作符dec、hex和oct即可。
定义变量的位置
在程序中的不同位置采用不同的变量定义方式，决定了该变量具有不同的特点。变量的定义一般可由以下三种位置：
（1）函数体内部
在函数体内部定义的变量称为局部变量。
（2）形式参数
当定义一个有参函数时，函数名后面括号内的变量，统称为形式参数。
（3）全局变量：在所有函数体外部定义的变量，其作用范围是整个程序，并在整个程序运行期间有效。
在C语言中，全局变量声明必须在任何函数之前，局部变量必须集中在可执行语句之前。
C++中的变量声明非常灵活。它允许变量声明与可执行语句交替执行，随时声明。for (int i = 0; i < 10; i++)
在C++中，结构名、联合名、枚举名都是类型名。在定义变量时，不必在结构名、联合名或枚举名前冠以struct、union或enum。
如：定义枚举类型boole
: enum boole{FALSE, TRUE};
在C语言中定义变量需写成enum boole done;
，但在C++中，可以说明为boole done;
。
C语言建议编程者为程序中的每一个函数建立圆形，而C++要求为每一个函数建立原型，以说明函数的名称、参数类型与个数，以及函数返回值的类型。其主要目的是让C++编译程序进行类型检查，即形参与实参的类型匹配检查，以及返回值是否与原型相符，以维护程序的正确性。
在程序中，要求一个函数的原型出现在该函数的调用语句之前。说明：
（1）函数原型的参数表中可不包含参数的名字，而只包含它们的类型。例如long Area(int, int);
（2）函数定义由函数首部和函数体构成。函数首部和函数原型基本一样，但函数首部中的参数必须给出名字而且不包含结尾的分号。
（3）C++的参数说明必须放在函数说明后的括号内，不可将函数参数说明放在函数首部和函数体之间。这种方法只在C中成立。
（4）主函数不必进行原型说明，因为它被看成自动说明原型的函数。
（5）原型说明中没有指定返回类型的函数（包括主函数main），C++默认该函数的返回类型是int。
（6）如果一个函数没有返回值，则必须在函数原型中注明返回类型为void，主函数类似处理。
（7）如果函数原型中未注明参数，C++假定该函数的参数表为空(void)。
C语言中习惯用#define
定义常量，C++利用const定义正规常数
一般格式 const 数据类型标识符 常数名 = 常量值
采用这种方式定义的常量是类型化的，它有地址，可以用指针指向这个值，但不能修改它。
const必须放在被修饰类型符和类型名前面。
数据类型是可选项，用来指定常数值的数据类型，如果省略了数据类型，那么默认是int。
const的作用于#define
相似，但它消除了#define
的不安全性。
const可以与指针一起使用。
指向常量的指针、常指针和指向常量的常指针。
1）指向常量的指针是指：一个指向常量的指针变量。
2）常指针是指：把指针本身，而不是它指向的对象声明为常量。
3）指向常量的常指针是指：这个指针本身不能改变，它所指向的值也不能改变。要声明一个指向常量的常指针，二者都要声明为const。
说明：
（1）如果用const定义的是一个整型变量，关键词int可以省略。
（2）常量一旦被建立，在程序的任何地方都不能再更改。
（3）与#define
定义的常量有所不同，const定义的常量可以有自己的数据类型，这样C++的编译程序可以进行更加严格的类型检查，具有良好的编译时的检测性。
（4）函数参数也可以用const说明，用于保证实参在该函数内部不被改动。
void通常表示无值，但将void作为指针的类型时，它却表示不确定的类型。这种void型指针是一种通用型指针，也就是说任何类型的指针值都可以赋给void类型的指针变量。
void型指针可以接受任何类型的指针的赋值，但对已获值的void型指针，对它在进行处理，如输出或传递指针值时，则必须进行强制类型转换，否则会出错。


调用函数时系统要付出一定的开销，用于信息入栈出栈和参数传递等。
C++引进了内联函数(inline function)的概念。在进行程序的编译时，编译器将内联函数的目标代码作拷贝并将其插入到调用内联函数的地方。


说明：
（1）内联函数在第一次被调用前必须进行声明或定义。否则编译器无法知道应该插入什么代码
（2）C++的内联函数具有与C中的宏定义#define
相同的作用和类似机理，但消除了#define
的不安全性。
（3）内联函数体内一般不能有循环语句和开关语句。
（4）后面类结构中所有在类说明体内定义的函数都是内联函数。
（5）通常较短的函数才定义为内联函数。
在C++中，函数的参数可以有缺省值。当调用有缺省参数的函数时，如果相应的参数没有给出实参，则自动用相应的缺省参数作为其实参。函数的缺省参数，是在函数原型中给定的。
说明
（1）在函数原型中，所有取缺省值的参数必须出现在不取缺省值的参数的右边。
（2）在函数调用时，若某个参数省略，则其后的参数皆应省略而采用缺省值。
函数重载是指一个函数可以和同一作用域中的其他函数具有相同的名字，但这些同名函数的参数类型、参数个数不同。
为什么要使用函数重载？
对于具有同一功能的函数，如果只是由于参数类型不一样，则可以定义相同名称的函数。
调用步骤：
（1）寻找一个严格的匹配，即：调用与实参的数据类型、个数完全相同的那个函数。
（2）通过内部转换寻求一个匹配，即：通过（1）的方法没有找到相匹配的函数时，则由C++系统对实参的数据类型进行内部转换，转换完毕后，如果有匹配的函数存在，则执行该函数。
（3）通过用户定义的转换寻求一个匹配，若能查出有唯一的一组转换，就调用那个函数。即：在函数调用处由程序员对实参进行强制类型转换，以此作为查找相匹配的函数的依据。
注意事项：
重载函数不能只是函数的返回值不同，应至少在形参的个数、参数类型或参数顺序上有所不同。
应使所有的重载函数的功能相同。如果让重载函数完成不同的功能，会破坏程序的可读性。
函数模板
函数模板：建立一个通用函数，其函数类型和形参类型不具体指定，而是一个虚拟类型。
应用情况：凡是函数体相同的函数都可以用这个模板来代替，不必定义多个函数，只需在模板中定义一次即可。在调用函数时系统会根据实参的类型来取代模板中的虚拟类型，从而实现了不同函数的功能。template<typename T>通用函数定义
，template<class T>通用函数定义
(class和typename可以通用)


与重载函数比较：用函数模板比函数重载更方便，程序更简洁。但应注意它只适用于：函数参数个数相同而类型不同，且函数体相同的情况。如果参数的个数不同，则不能用函数模板。
通常情况下，如果有两个同名变量，一个是全局的，另一个是局部的，那么局部变量在其作用域内具有较高的优先权。
在全局变量加上::
，此时::var
代表全局变量。
无名联合是C++中的一种特殊联合，可以声明一组无标记名共享同一段内存地址的数据项。如: union {int i; float j;}
在此无名联合中，声明了变量i和f具有相同的存储地址。无名联合可通过使用其中数据项名字直接存取，例如可以直接使用上面的变量i或f。
C中数据类型转换的一般形式 (数据类型标识符) 表达式
C++支持这样的格式，还提供了一种更为方便的函数调用方法，即将类型名作为函数名使用，是的类型转换的执行看起来好像调用了一个函数。形式为：数据类型标识符 (表达式)。
推荐使用后一种方式。
作为对C语言中malloc和free的替换，C++引进了new和delete操作符。它们的功能是实现内存的动态分配和释放。指针变量=new 数据类型;
或指针变量=new 数据类型(初始值);
例如：int *a, *b;
a = new int;
b = new int(10);
释放由new操作动态分配的内存时，用delete操作。delete 指针变量;
例如delete a;
，delete b;
。
优点：
（1）new和delete操作自动计算需要分配和释放类型的长度。这不但省去了用sizeof计算长度的步骤，更主要的是避免了内存分配和释放时因长度出错带来的严重后果。
（2）new操作自动返回需分配类型的指针，无需使用强制类型转换。
（3）new操作能初始化所分配的类型变量。
（4）new和delete都可以被重载，允许建立自定义的内存管理法。
说明：
（1）用new分配的空间，使用结束后应该用delete显示的释放，否则这部分空间将不能回收而变成死空间。
（2）使用new动态分配内存时，如果没有足够的内存满足分配要求，new将返回空指针（NULL）。因此通常要对内存的动态分配是否成功进行检查。
（3）使用new可以为数组动态分配内存空间。这时需要在类型后面加上数组大小。指针变量 = new 类型名[下标表达式];
使用new为多维数组分配空间时，必须提供所有维的大小。
（4）释放动态分配的数组存储区时，可使用delete运算符，语法格式为delete []指针变量;
（5）new 可在为简单变量分配内存空间的同时，进行初始化。这时的语法形式为：指针变量 = new 类型名(初始值列表)
引用就是某一变量（目标）的一个别名，这样对引用的操作就是对目标的操作。
引用的声明方法：类型标识符 &引用名=目标变量名;
说明：
（1）&
在此不是求地址运算，而是起标识作用。
（2）类型标识符是指目标变量的类型。
（3）声明引用时，必须同时对其进行初始化。
（4）引用声明完毕后，相当于目标变量名有两个名称。
（5）声明一个引用，不是新定义了一个变量，系统并不给引用分配存储单元。
引用的使用
（1）引用名可以是任何合法的变量名。除了用作函数的参数或返回类型外，在声明时，必须立即对它进行初始化，不能声明完后再赋值。
（2）引用不能重新赋值，不能再把该引用名作为其他变量名的别名，任何对该引用的赋值就是该引用对应的目标变量名的赋值。对引用求地址，就是对目标变量求地址。
（3）由于指针变量也是变量，所以，可以声明一个指针变量的引用。方法是类型标识符 *&引用名=指针变量名
（4）引用是对某一变量或目标对象的引用，它本身不是一种数据类型，因此引用本身不占存储单元，这样，就不能声明引用的引用，也不能定义引用的指针。
（5）不能建立数组的引用，因为数组是一个由若干个元素所组成的集合，所以就无法建立一个数组的别名。
（6）不能建立空指针的引用。
（7）不能建立空类型void的引用。
（8）尽管引用运算符与地址操作符使用相同的符号，但是不一样的。引用仅在声明时带有引用运算符&
，以后就像普通变量一样使用，不能再带&
。其他场合使用的&
都是地址操作符。
用引用作为函数的参数
一个函数的参数可以定义成引用的形式。
在主调函数的调用点处，直接以变量作为实参进行调用即可，不需要实参变量有任何的特殊要求。
用引用返回函数值
函数可以返回一个引用，将函数说明为返回一个引用。
主要目的是：为了将函数用在赋值运算符的左边。要以引用返回函数值。类型标识符 &函数名 (形参列表及类型说明){函数体}
（1）以引用返回函数值，定义函数时需要在函数名前加&
（2）用引用返回一个函数值的最大好处是，在内存中不产生返回值的副本。
在定义返回引用的函数时，注意不要返回该函数内的自动变量（局部变量）的引用，由于自动变量的生存期仅限于函数内部，当函数返回时，自动变量就消失了。
一个返回引用的函数值作为赋值表达式的左值。
一般情况下，赋值表达式的左边只能是变量名，即被赋值的对象必须是变量，只有变量才能被赋值。
类是C++对C中结构的扩展。
C语言中的struct是数据成员集合，而C++中的类，则是数据成员和成员函数的集合。
struct是用户定义的数据类型，是一种构造数据类型。类和struct一样，也是一种用户定义的数据类型，是一种构造数据类型。
C结构无法对数据进行保护和权限控制，所以结构中的数据是不安全的。C++中的类将数据和与之相关联的数据封装在一起，形成一个整体，具有良好的外部接口可以防止数据未经授权的访问，提供了模块间的独立性。
类的成员分两部分：一部分对应数据的状态，称为数据成员，另一部分作用于该数据状态的函数，称为成员函数。
· private
部分称为类的私有部分，这一部分的数据成员和成员函数称为类的私有成员。私有成员只能由本类的成员函数访问，而类外部的任何访问都是非法的。（只能在定义、实现的时候访问）
· public
部分称为类的共有部分，这部分的数据成员和成员函数称为类的公有成员。公有成员可以由程序中的函数访问，它对外是完全开放的。
· protected
部分称为类的保护部分，这部分的数据成员和成员函数称为类的保护成员。保护成员可以由本类的成员函数访问，也可以由本类的派生类的成员函数访问，而类外的任何访问都是非法的。
（1）类声明格式中的3个部分并非一定要全有，但至少要有其中的一个部分。
一般一个类的数据成员应该声明为私有成员，成员函数声明为公有成员。
（2）类声明中的private, protected, public三个关键字可以按任意顺序出现任意次。但是，如果把所有的私有成员、保护成员和公有成员归类放在一起，程序将更加清晰。
（3）private处于类体重第一部分时，关键字private可以省略。
（4）数据成员可以是任何数据类型，但不能用自动(auto)、寄存器(register)或外部(extern)进行声明。
（5）不能在类声明中给数据成员赋值。C++规定，只有在类对象定义之后才能给数据成员赋初值。
普通成员函数形式
在类的声明中(.h)只给出成员函数的原型，而成员函数体写在类的外部(.cpp)。
内联函数形式
直接将函数声明在类内部；
在类声明中只给出成员函数的原型，而成员函数体写在类的外部，在成员函数返回类型前冠以关键字inline
。
在C++中，可以把相同数据结构和相同操作集的对象看成属于同一类。
当定义了一个类的对象后，就可以访问对象的成员了。在类的外部可以通过类的对象对公有成员进行访问，访问对象成员要使用操作符.
（称为对象选择符，简称点运算符）。
在定义对象时，若定义的是指向对象的指针，则访问此对象的成员时，要用>
操作符。
在类的内部所有成员之间都可以通过成员函数直接访问，但是类的外部不能访问对象的私有成员。
类成员的访问属性
说明为public的成员不但可以被类中成员函数访问；还可以在类的外部，通过类的对象进行访问
说明为private的成员只能被类中成员函数访问，不能在类的外部，通过类的对象进行访问
说明为protected的成员除了类本身的成员函数可以访问外，该类的派生类的成员也可以访问，但不能在类的外部，通过类的对象进行访问
类的成员对类对象的可见性和对类的成员函数的可见性是不同的。
类的成员函数可以访问类的所有成员，而类的对象对类的成员的访问是受类成员的访问属性的制约的。
一般来说，公有成员是类的对外接口，而私有成员和保护成员是类的内部数据和内部实现，不希望外界访问。将类的成员划分为不同的访问级别有两个好处：一是信息隐蔽，即实现封装；二是数据保护，即将类的重要信息保护起来，以免其他程序不恰当地修改。
对象赋值语句
两个同类型的变量之间可以相互赋值。同类型的对象间也可以进行赋值，当一个对象赋值给另一个对象时，所有的数据成员都会逐位拷贝。
说明：
·在使用对象赋值语句进行对象赋值时，两个对象的类型必须相同，如果对象的类型不同，编译时将出错。
·两个对象之间的赋值，仅仅使这些对象中数据成员相同，而两个对象仍是分离的。
·=
的对象赋值是通过缺省的赋值运算符函数实现的。（复杂的需要重载）
·当类中存在指针时，使用缺省的赋值运算符进行对象赋值，可能会产生错误。
构造函数和析构函数都是类的成员函数，但它们都是特殊的成员函数，执行特殊的功能，不用调用便自动执行，而且这些函数的名字与类的名字有关。
C++语言中有一些成员函数性质是特殊的，这些成员函数负责对象的建立、删除。这些函数的特殊性在于可以由编译器自动地隐含调用，其中一些函数调用格式采用运算符函数重载的语法。C++引进一个自动完成对象初始化过程的机制，这就是类的构造函数。
对象的初始化
1）数据成员是不能在声明类时初始化
2）类型对象的初始化方法：
·调用对外接口(public成员函数)实现 声明类→定义对象→调用接口给成员赋值
·应用构造函数(constructor)实现 声明类→定义对象→同时给成员赋值
构造函数是一种特殊的成员函数，它主要用于为对象分配空间，进行初始化。构造函数具有一些特殊的性质：
（1）构造函数的名字必须与类名相同。
（2）构造函数可以有任意类型的参数，但不能指定返回类型。它有隐含的返回值，该值由系统内部使用。
（3）构造函数是特殊的成员函数，函数体可写在类体内，也可写在类体外。
（4）构造函数可以重载，即一个类中可以定义多个参数个数或参数类型不同的构造函数。构造函数不能继承。
（5）构造函数被声明为公有函数，但它不能像其他成员函数那样被显式地调用，它是在定义对象的同时调用的
·在声明类时如果没有定义类的构造函数，编译系统就会在编译时自动生成一个默认形式的构造函数。
·默认构造函数是构造对象时不提供参数的构造函数。
·除了无参数构造函数是默认构造函数外，带有全部默认参数值的构造函数也是默认构造函数。
·自动调用：构造函数在定义类对象时自动调用，不需用户调用，也不能被用户调用。在对象使用前调用。
·调用顺序：在对象进入其作用域时（对象使用前）调用构造函数。
利用构造函数创建对象的两种方法：
（1）利用构造函数直接创建对象，其一般形式为：类名 对象名[(实参表)];
这里的“类名”与构造函数名相同，“实参表”是为构造函数提供的实际参数。
（2）利用构造函数创建对象时，通过指针和new来实现。其一般语法形式为：类名 *指针变量 = new 类名 [(实参表)];
对于常量类型和引用类型的数据成员，不能在构造函数中用赋值语句直接赋值，C++提供初始化表进行置初值。
类名::构造函数名([参数表])[:(成员初始化表)]
成员初始化表的一般形式为：数据成员名1(初始值1),数据成员名2(初始值2),…
如果需要将数据成员存放在堆中或数组中，则应在构造函数中使用赋值语句，即使构造函数有成员初始化表也应如此。
类成员是按照它们在类里被声明的顺序初始化的，与它们在初始化表中列出的顺序无关。
拷贝构造函数是一种特殊的构造函数，其形参是本类对象的引用。其作用是使用一个已经存在的对象去初始化另一个同类的对象。
通过等号复制对象时，系统会自动调用拷贝构造函数
拷贝函数特点：
该函数也是一种构造函数，所以其函数名与类名相同，并且该函数也没有返回值类型
该函数只有一个参数，并且是同类对象的引用
每个类必须有一个拷贝构造函数。可以根据需要定义特定的拷贝构造函数，以实现同类对象之间数据成员的传递。如果没有定义类的拷贝构造函数，系统就会自动生成产生一个缺省的拷贝构造函数
缺省的拷贝构造函数
如果没有编写自定义的拷贝构造函数，C++会自动地将一个已存在的对象复制给新对象，这种按成员逐一复制的过程是由缺省拷贝构造函数自动完成的。
调用拷贝构造函数的三种情况：
（1）当用类的一个对象去初始化该类的另一个对象时。（代入法与赋值法）
（2）当函数的形参是类的对象，调用函数，进行形参和实参结合时。
（3）当函数的返回值是对象，函数执行完成，返回调用者时。
浅拷贝与深拷贝
所谓浅拷贝，就是由缺省的拷贝构造函数所实现的数据成员逐一赋值，若类中含有指针类型数据，则会产生错误。
为了解决浅拷贝出现的错误，必须显示地定义一个自己的拷贝构造函数，使之不但拷贝数据成员，而且为对象1和对象2分配各自的内存空间，这就是所谓的深拷贝。
析构函数也是一种特殊的成员函数。它执行与构造函数相反的操作，通常用于撤销对象时的一些清理任务，如释放分配给对象的内存空间等。
析构函数有以下一些特点：
①析构函数与构造函数名字相同，但它前面必须加一个波浪号(~)；
②析构函数没有参数，也没有返回值，而且不能重载。因此在一个类中只能有一个析构函数；
③当撤销对象时，编译系统会自动调用析构函数。如果程序员没有定义析构函数，系统将自动生成和调用一个默认析构函数，默认析构函数只能释放对象的数据成员所占用的空间，但不包括堆内存空间。
析构函数被调用的两种情况：
(1)若一个对象被定义在一个函数体内，当这个函数结束时，析构函数被自动调用。
(2)若一个对象是使用new运算符动态创建，在使用delete释放时，自动调用析构函数。
1） 一般顺序：调用析构函数的次序正好与调用构造函数的次序相反：最先被调用的构造函数，其对应的析构函数最后被调用，而最后被调用的构造函数，其对应的析构函数最先被调用。
2） 全局对象：在全局范围中定义的对象（即在所有函数之外定义的对象），它的构造函数在所有函数（包括main函数）执行之前调用。在程序的流程离开其作用域时（如main函数结束或调用exit函数）时，调用该全局对象的析构函数。
3） auto局部对象：局部自动对象（例如在函数中定义的对象），则在建立对象时调用其构造函数。如果函数被多次调用，则在每次建立对象时都要调用构造函数。在函数调用结束、对象释放时先调用析构函数。
4） static局部对象：如果在函数中定义静态局部对象，则只在程序第一次调用此函数建立对象时调用构造函数一次，在调用结束时对象并不释放，因此也不调用析构函数，只在main函数结束或调用exit函数结束程序时，才调用析构函数。
对象的生存期
（1）局部对象：当对象被定义时，调用构造函数，该对象被创建；当程序退出该对象所在的函数体或程序块时，调用析构函数，对象被释放。
（2）全局对象：当程序开始运行时，调用构造函数，该对象被创建；当程序结束时，调用析构函数，该对象被释放。
（3）静态对象：当程序中定义静态对象时，调用构造函数，该对象被创建；当整个程序结束时，调用析构函数，对象被释放。
（4）动态对象：执行new运算符调用构造函数，动态对象被创建；用delete释放对象时，调用析构函数。
局部对象是倍定义在一个函数体或程序块内的，它的作用域限定在函数体或程序块内，生存期较短。
静态对象是被定义在一个文件中，它的作用域从定义时起到文件结束时为止。生存期较长。
全局对象是被定义在某个文件中，它的作用域包含在该文件的整个程序中，生存期是最长的。
动态对象是由程序员掌握的，它的作用域和生存期是由new和delete之间的间隔决定的。
每一个类的成员函数都有一个隐藏定义的常量指针，称为this指针。
this指针的类型就是成员函数所属的类的类型。
每当调用成员函数时，它被初始化为被调函数所在类的对象的地址。也就是自动地将对象的指针传给它。不同的对象调用同一个成员函数时，编译器将根据成员函数的this指针所指向的对象来确定应该引用哪一个对象的数据成员。
在通常情况下，this指针在系统中是隐含地存在的，也可以显示地表示出来。
this指针是一个const指针，不能在程序中修改它或给它赋值。
this指针是一个局部数据，它的作用域仅在一个对象的内部。
所谓对象数组是指每一数组元素都是对象的数组。
与基本数据类型的数组一样，在使用对象数组时也只能访问单个数组元素，也就是一个对象，通过这个对象，也可以访问到它的公有成员。
如果需要建立某个类的对象数组，在设计类的构造函数时要充分考虑到数组元素初始化的需要：
当各个元素的初值要求为相同的值时，应该在类中定义出不带参数的构造函数或带缺省参数值的构造函数
当各元素对象的初值要求为不同的值时需要定义带形参（无缺省值）的构造函数
定义对象数组时，可通过初始化表进行赋值
每一个对象在初始化后都会在内存中占有一定的空间。因此，既可以通过对象名访问一个对象，也可以通过对象地址来访问一个对象。对象指针就是用于存放对象地址的变量。类名 * 对象指针名
用指针访问单个对象成员
初始化指向一个已创建的对象，用>
操作符访问对象的公有成员
用对象指针访问对象数组
对象指针++即指向下一个数组对象元素
指向类的成员的指针
类的成员自身也是一些变量、函数或者对象等，因此也可以直接将它们的地址存放到一个指针变量中，这样就可以使指针直接指向对象的成员，进而可以通过指针访问对象的成员。
指向成员的指针只能访问公有数据成员和成员函数。
使用要先声明，再赋值，然后访问。
指向数据成员的指针
声明：类型说明符 类名::*数据成员指针名
赋值：数据成员指针名 = &类名::数据成员名
使用：对象名.*数据成员指针名
对象指针名>*数据成员指针名
指向成员函数的指针
声明：类型说明符 (类名:: *指针名)(参数表)
赋值：成员函数指针名 = 类名::成员函数名
使用：(对象名.*成员函数指针名)(参数表)
(对象指针名>*成员函数指针名)(参数表)
对象可以作为参数传递给函数，其方法与传递其他类型的数据相同。在向函数传递对象时，是通过传值调用传递给函数的。因此，函数中对对象的任何修改均不影响调用该函数的对象本身。
对象指针可以作为函数的参数，使用对象指针作为函数参数可以实现传址调用，即可在被调用函数中改变函数的参数对象的值，实现函数之间的信息传递。同时使用对象指针实参仅将对象的地址值传给形参，而不进行副本的拷贝，这样可以提高运行效率，减少时空开销。
使用对象引用作为函数参数不但具有用对象指针作函数参数的优点，而且用对象引用作函数参数将更简单、更直接。
引入目的：实现一个类的不同对象之间数据和函数共享
静态数据成员
用关键字static
声明
该类的所有对象维护该成员的同一个拷贝
必须在类外定义和初始化，用(::)
来指明所属的类
与一般的数据成员不同，无论建立多少个类的对象，都只有一个静态数据的拷贝。从而实现了同一个类的不同对象之间的数据共享。它不因对象的建立而产生，也不因对象的析构而删除。
静态数据成员初始化的格式：<数据类型><类名>::<静态数据成员名>=<值>;
初始化时使用作用域运算符来标明它所属的类，因此，静态数据成员是类的成员，而不是对象的成员。
引用静态数据成员时，采用如下格式：<类名>::<静态成员名>
如何使用静态数据成员？
(1)静态数据成员的定义与一般数据成员相似，但前面要加上static关键词
(2)静态数据成员的初始化与一般数据成员不同。初始化位置在定义对象之前，一般在类定义后，main()前进行
(3)访问方式（只能访问公有静态数据成员）
可用类名访问：类名::静态数据成员
也可用对象访问：对象名.静态数据成员
,对象指针>静态数据成员
(4)私有静态数据成员不能被类外部函数访问，也不能用对象进行访问
(5)支持静态数据成员的一个主要原因是可以不必使用全局变量。静态数据成员的主要用途是定义类的各个对象所公用的数据。
静态成员函数
类外代码可以使用类名和作用域符来调用公有静态成员函数
静态成员函数只能引用属于该类的静态数据成员或静态成员函数。访问非静态数据成员，必须通过参数传递方式得到对象名，通过对象名访问。
可以通过定义和使用静态成员函数来访问静态数据成员。
所谓静态成员函数就是使用static关键字声明函数成员。同静态数据成员一样，静态成员函数也属于整个类，由同一个类的所有对象共同维护，为这些对象所共享。
静态成员函数作为成员函数，它的访问属性可以受到类的严格控制。对公有静态成员函数，可以通过类名或对象名来调用；而一般的非静态公有成员函数只能通过对象名来调用。
静态成员函数可以直接访问该类的静态数据成员和函数成员；而访问非静态数据成员，必须通过参数传递方式得到对象名，然后通过对象名来访问。
定义：static 返回类型 静态成员函数名(参数表);
使用：类名::静态成员函数名(实参表)
对象.静态成员函数名(实参表)
对象指针>静态成员函数名(实参表)
注意：
(1)静态成员函数可以定义成内嵌的，也可以在类外定义，在类外定义时不能用static前缀。
(2)静态成员函数主要用来访问全局变量或同一个类中的静态数据成员。特别是，当它与静态数据成员一起使用时，达到了对同一个类中对象之间共享数据进行维护的目的。
(3)私有静态成员函数不能被类外部函数和对象访问。
(4)使用静态成员函数的一个原因是，可以用它在建立任何对象之前处理静态数据成员。这是普通成员函数不能实现的。
(5)静态成员函数中没有指针this，所以静态成员函数不能访问类中的非静态数据成员，若确实需要则只能通过对象名作为参数访问。
可以通过指针访问静态数据成员和静态成员函数
友元可以访问与其有好友关系的类中的私有成员。友元包括友元函数和友元类。
友元函数不是当前类的成员函数，而是独立于当前类的外部函数，但它可以访问该类的所有对象的成员，包括私有、保护和公有成员。
友元函数的声明：
位置：当前类体中
格式：函数名前加friend
友元函数的定义：
类体外：同一般函数（函数名前不能加类名::
）
类体内：函数名前加friend
说明：
(1)友元函数毕竟不是成员函数，因此，在类的外部定义友元函数时，不能在函数名前加上类名::
。
(2)友元函数一般带有一个该类的入口参数。因为友元函数不是类的成员函数，没有this指针，所以不能直接引用对象成员的名字，也不能通过this指针引用对象的成员，它必须通过作为入口参数传递进来的对象名或对象指针来引用该对象的成员。
引入友元机制的原因
(1)友元机制是对类的封装机制的补充，利用此机制，一个类可以赋予某些函数访问它的私有成员的特权。
(2)友元提供了不同类的成员函数之间、类的成员函数与一般函数之间进行数据互享的机制。
一个类的成员函数也可以作为另一个类的友元，这种成员函数不仅可以访问自己所在类对象中的所有成员，还可以访问friend声明语句所在类对象中的所有成员。
这样能使两个类相互合作、协调工作，完成某一任务。
一个类的成员函数作为另一个类的友元函数时，必须先定义这个类。
一个类也可以作为另一个类的友元。
类Y的所有成员函数都是类X的友元函数
在实际工作中，除非确有必要，一般并不把整个类声明为友元类，而只将确实有需要的成员函数声明为友元函数，这样更安全一些。
友元的关系是单向的而不是双向的。
友元的关系不能传递。
const对象的一般形式类型名 const 对象名[(构造实参表列)];
const 类型名 对象名[(构造实参表列)];
常对象必须要有初值。
定义为const的对象的所有数据成员的值都不能被修改。凡出现调用非const的成员函数，将出现编译错误。
对数据成员声明为mutable时，即使是const对象，仍然可以修改该数据成员值。
用const声明的常数据成员，其值是不能改变的。只能通过构造函数的参数初始化表对场数据成员进行初始化。


成员函数声明中包含const时为常成员函数。此时，该函数只能引用本类中的数据成员，而不能修改它们，即成员数据不能作为语句的左值。(mutable可以)类型说明符 函数名(参数表) const;
const
的位置在函数名和括号之后，是函数类型的一部分，在声明函数和定义函数时都要有const关键字。
如果将一个对象声明为常对象，则通过该对象只能调用它的常成员函数，而不能调用普通成员函数。而且常成员函数也不能更新对象的数据成员。


指向常对象的指针变量的一般形式：const 类型 *指针变量名
指向常对象（变量）的指针变量，不能通过它来改变所指向目标对象的值，但指针变量的值是可以改变的。
如果被声明为常对象（变量），只能用指向常对象（变量）的指针变量指向它，而不能非const型指针变量去指向它。
指向常对象（变量）的指针变量除了可以指向常对象（变量）外，还可以指向未被声明为const的对象（变量）。此时不能通过此指针变量改变该变量的值。
指向常对象（变量）的指针变量可以指向const和非const型的对象（变量），而指向非const型变量的指针变量只能指向非const的对象（变量）。
如果函数的形参是指向非const型变量的指针，实参只能用指向非const变量的指针，而不能用指向const变量的指针，这样，在执行函数的过程中可以改变形参指针变量所指向的变量的值。
如果函数形参是指向const型变量的指针，允许实参是指向const变量的指针，或指向非const变量的指针。


 Time const t = Time(1,2,3); const Time t = Time(1,2,3);
const int a = 10;
int const a = 10;  t是常对象，其成员值在任何情况下都不能被改变 a是常变量，其值不能被改变 
 —  — 
 void Time::fun() const;  fun是Time类的常成员函数，可以调用该函数，但不能修改本类中的数据成员(非mutable) 
 Time const pt; int const pa;  pt是指向Time对象的常指针，pa是指向整数的常指针。指针值不能改变 
 const Time pt; const int pa;  pt是指向Time类常对象的指针，pa是指向常整数的指针，不能通过指针来改变指向的对象（值） 
继承目的：代码的重用和代码的扩充
继承方法程序设计思路：一般>特殊
继承种类：单继承、多继承
继承方式：public protected private
继承内容：除构造函数、析构函数、私有成员外的其他成员
保持已有类的特性而构造新类的过程称为继承。
在已有类的基础上新增自己的特性而产生新类的过程称为派生。
被继承的已有类称为基类（父类）。
派生出的新类称为派生类。
三种继承方式：public, private, protected
派生类成员的访问权限：inaccessible, public, private, protected
在基类中的访问属性  继承方式  在派生类中的访问属性 

private  public  inaccessible 
private  private  inaccessible 
private  protected  inaccessible 
public  public  public 
public  private  private 
public  protected  protected 
protected  public  protected 
protected  private  private 
protected  protected  protected 
私有继承的访问规则
基类的public成员和protected成员被继承后作为派生类的private成员，派生类的其他成员可以直接访问它们，但是在类外部通过派生类的对象无法访问。
基类的private成员在私有派生类中是不可直接访问的，所以无论是派生类成员还是通过派生类的对象，都无法直接访问从基类继承来的private成员，但是可以通过基类提供的public成员函数间接访问。
通过派生类的对象不能访问基类中的任何成员。
公有继承的访问规则
基类的public成员和protected成员被继承到派生类中仍作为派生类的public成员和protected成员，派生类的其他成员可以直接访问它们。但是，类的外部的使用者只能通过派生类的对象访问继承来的public成员。
派生类的对象只能访问基类的public成员。
1.派生的对象可以赋给基类的对象


2.派生类的对象可以初始化基类的引用


3.派生类的对象的地址可以赋给指向基类的指针


通过指针或引用只能访问对象d中所继承的基类成员。
保护继承的访问规则
基类的public成员和protected成员被继承到派生类中都作为派生类的protected成员，派生类的其他成员可以直接访问它们，但是类的外部使用者不能通过派生类的对象来访问它们。
通过派生类的对象不能访问基类中的任何成员。
基类与派生类的关系
派生类是基类的具体化
派生类是基类定义的延续
派生类是基类的组合
基类的构造函数和析构函数不能被继承，一般派生类要加入自己的构造函数。
通常情况下，当创建派生类对象时，首先执行基类的构造函数，随后再执行派生类的构造函数；
当撤销派生类对象时，则先执行派生类的析构函数，随后再执行基类的析构函数。
当基类的构造函数没有参数，或没有显示定义构造函数时，派生类可以不向基类传递参数，甚至可以不定义构造函数；当基类含有带参数的构造函数时，派生类必须定义构造函数，以提供把参数传递给基类构造函数的途径。


当派生类中含有内嵌对象成员时，其构造函数的一般形式为：


在定义派生类对象时，构造函数的执行顺序如下：
调用基类的构造函数；
调用内嵌对象成员（子对象类）的构造函数（有多个对象成员时，调用顺序由它们在类中声明的顺序确定）；
派生类中的构造函数体中的内容
撤销对象时，析构函数的调用顺序与构造函数的调用顺序相反。
当基类构造函数不带参数时，派生类可不定义构造函数，但基类构造函数带有参数，则派生类必须定义构造函数。
若基类使用缺省构造函数或不带参数的构造函数，则在派生类中定义构造函数时可略去:基类构造函数名(参数表)
如果派生类的基类也是一个派生类，每个派生类只需负责其直接基类的构造，依次上溯。
由于析构函数是不带参数的，在派生类中是否定义析构函数与它所属的基类无关，基类的析构函数不会因为派生类没有析构函数而得不到执行，基类和派生类的析构函数是各自独立的。
派生类只有一个基类，这种派生方法称为单基派生或单继承
当一个派生类具有多个基类时，这种派生方法称为多基派生或多继承class 派生类名:继承方式1 基类名1,...,继承方式n,基类名n {}
构造函数的执行顺序同单继承：
先执行基类构造函数，再执行对象成员的构造函数，最后执行派生类构造函数。
必须同时负责该派生类所有基类构造函数的调用。派生类的参数个数必须包含完成所有基类初始化所需的参数个数。
处于同一层次各基类构造函数执行顺序，取决于声明派生类时所制定各基类的顺序，与派生类构造函数中所定义的成员初始化列表的各项顺序无关。
对基类成员的访问必须是无二义性，使用类名限定可以消除二义性。
当某一个类的多个直接基类是从另一个共同基类派生而来时，这些直接基类中从上一级基类继承来的成员就拥有相同的名称。在派生类的对象中，这些同名成员在内存中同时拥有多个拷贝。一种分辨方法是使用作用域标示符来唯一表示它们。另一种方法就是定义派生类，使派生类中只保留一份拷贝。class 派生类名:virtual 继承方式 类名 {}
如果在虚基类中定义有带形参的构造函数，并且没有定义缺省形参的构造函数，则整个继承结构中，所有直接或间接的派生类都必须在构造函数的成员初始化表中列出对虚基类构造函数的调用，以初始化在虚基类中定义的数据成员。
建立一个对象时，如果这个对象中含有从虚基类继承来的成员，则虚基类的成员是由最远派生类的构造函数通过调用虚基类的构造函数进行初始化的。该派生类的其他基类对虚基类构造函数的调用都自动被忽略。
若同一层次中同时包含虚基类和非虚基类，应先调用虚基类的构造函数，再调用非虚基类的构造函数，最后调用派生类的构造函数。
对于多个虚基类，构造函数的执行顺序仍然是先左后右，自上而下。
对于非虚基类，构造函数的执行顺序仍然是先左后右，自上而下。
若虚基类由非虚基类派生而来，则仍然先调用基类构造函数，再调用派生类的构造函数。
所谓赋值兼容规则是指在需要基类对象的任何地方都可以使用公有派生类的对象来替代。这样，公有派生类实际上具备了基类的所有特性，凡基类能解决的问题，公有派生类也能解决。（在公有派生已提及）
(1)可以用派生类对象给基类对象赋值。
(2)可以用派生类对象来初始化基类的引用。
(3)可以把派生类的地址赋值给指向基类的指针。（这种形式的转换，是在实际应用中最常见到的）
(4)可以把指向派生类对象的指针赋值给指向基类对象的指针
说明
(1)声明为指向基类对象的指针可以指向它的公有派生的对象，但不允许指向它的私有派生的对象
(2)允许将一个声明为指向基类的指针指向其公有派生类的对象，但是不能将一个声明为指向派生类对象的指针指向其基类的一个对象
(3)声明为指向基类对象的指针，当其指向公有派生类对象时，只能用它来直接访问派生类中从基类继承来的成员，而不能直接访问公有派生类中定义的成员。
若想访问其公有派生类的特定成员，可以将基类指针用显示类型转换为派生类指针。
所谓多态性就是不同对象收到相同的消息时，产生不同的动作。
C++中的多态性：
通用多态：参数多态，包含多态
专用多态：重载多态，强制多态
参数多态与类属函数和类属类相关联，函数模板和类模板就是这种多态
包含多态是研究类族中定义于不同类中的同名成员函数的多态行为，主要是通过虚函数来实现的
重载多态如函数重载、运算符重载等。普通函数及类的成员函数的重载多属于重载多态
强制多态是指将一个变元的类型加以变化，以符合一个函数或操作的要求，例如加法运算符在进行浮点数与整型数相加时，首先进行类型强制转换，把整型数变为浮点数再相加的情况，就是强制多态的实例
在C++中，编译时多态性主要是通过函数重载和运算符重载实现的，运行时多态性主要是通过虚函数来实现的
虚函数允许函数调用与函数体之间的联系在运行时才建立，也就是在运行时才决定如何动作，即所谓的动态联编。
虚函数是成员函数，而且是非static的成员函数。是动态联编的基础。virtual <类型说明符><函数名>(<参数表>)
如果某类中的一个成员函数被说明为虚函数，这就意味着该成员函数在派生类中可能有不同的实现。当使用这个成员函数操作指针或引用所标识对象时，对该成员函数调用采取动态联编方式，即在运行时进行关联或束定。
动态联编只能通过指针或引用标识对象来操作虚函数。如果采用一般类型的标识对象来操作虚函数，则将采用静态联编方式调用虚函数。
派生类中对基类的虚函数进行替换时，要求派生类中说明的虚函数与基类中的被替换的虚函数之间满足如下条件：
(1)与基类的虚函数有相同的参数个数
(2)其参数的类型与基类的虚函数的对应参数类型相同
(3)其返回值或者与基类虚函数的相同,或者都返回指针或引用,并且派生类虚函数所返回的指针或引用的基类型是基类中被替换的虚函数所返回的指针或引用的基类型的子类型
虚函数的作用
虚函数同派生类的结合可使C++支持运行时的多态性,实现了在基类定义派生类所拥有的通用接口,而在派生类定义具体的实现方法,即常说的”同一接口,多种方法”,它帮助程序员处理越来越复杂的程序
虚函数的定义virtual 函数类型 函数名(形参表){}
派生类中重新定义时,其函数原型,包括返回类型、函数名、参数个数、参数类型的顺序，都必须与其基类中的原型完全相同。
C++规定，如果在派生类中，没有用virtual显示地给出虚函数声明，这时系统就会遵循以下的规则来判断一个成员函数是不是虚函数：
·该函数与基类的虚函数有相同的名称
·该函数与基类的虚函数有相同的参数个数及相同的对应参数类型
·该函数与基类的虚函数有相同的返回类型或满足赋值兼容规则的指针、引用型的返回类型
派生类的函数满足了上述条件，就被自动确定为虚函数
说明：
(1)通过定义虚函数来使用C++提供的多态机制时，派生类应该从它的基类公有派生。赋值兼容规则成立的条件是派生类从其基类公有派生。
(2)必须首先在基类中定义虚函数。在实际应用中，应该在类等级内需要具有动态多态性的几个层次中的最高层类内首先声明虚函数。
(3)在派生类对基类中声明的虚函数进行重新定义时，关键字virtual可以写或不写。
(4)使用对象名和点运算符的方式也能调用虚函数，但是这种调用在编译时进行的是静态联编，它没有充分利用虚函数的特性。只有通过基类指针访问虚函数时才能获得运行时的多态性。
(5)一个虚函数无论被公有继承多少次，它仍然保持其虚函数的特性。
(6)虚函数必须是其所在类的成员函数，而不能是友元函数，也不能是静态成员函数，因为虚函数调用要靠特定的对象来决定该激活哪个函数。但是虚函数可以在另一个类中被声明为友元函数。
(7)内联函数不能是虚函数，因为内联函数是不能在运行中动态确定其位置的。即使虚函数在类的内部定义，编译时仍将其看作是非内联的。
(8)构造函数不能是虚函数。因为虚函数作为运行过程中多态的基础，主要是针对对象的，而构造函数是在产生对象之前运行的，因此虚构造函数是没有意义的。
(9)析构函数可以是虚函数，而且通常声明为虚函数。
虚析构函数
在程序用带指针参数的delete运算符撤销对象时，会发生一种情况：系统会只执行基类的析构函数，而不执行派生类的析构函数。
解决方法：将基类的析构函数声明为虚函数
析构函数设置为虚函数后，在使用指针引用时可以动态联编，实现运行时的多态，保证使用基类类型的指针能够调用适当的析构函数针对不同的对象进行清理工作
虚函数与重载函数的关系
在一个派生类中重新定义基类的虚函数是函数重载的另一种形式，但它不同于一般的函数重载。
普通的函数重载时，其函数的参数个数或参数类型必须有所不同，函数的返回类型也可以不同。
当重载一个虚函数时，也就是说在派生类中重新定义虚函数时，要求函数名、返回类型、参数个数、参数的类型和顺序与基类中的虚函数原型完全相同。
若仅仅函数名相同，而参数的个数、类型或顺序不同，系统将它作为普通的函数重载，这时将失去虚函数的特性。
纯虚函数是一个在基类中说明的虚函数，它在基类中没有定义，但要求在它的派生类中必须定义自己的版本，或重新说明为纯虚函数。virtual <函数类型><函数名>(参数表)=0;
纯虚函数与一般虚函数成员的原型在书写形式上的不同就在于后面加了=0
，表明在基类中不用定义该函数，它的实现部分（函数体）留给派生类去做。
纯虚函数没有函数体
最后面的=0
并不表示函数返回值为0
这是一个声明语句，最后有;
纯虚函数只有函数的名字而不具备函数的功能，不能被调用。在派生类中对此函数提供定义后，它才能具备函数的功能，可被调用。
如果在一个类中声明了纯虚函数，而在其派生类中没有对该函数定义，则该虚函数在派生类中仍然为纯虚函数。
一个具有纯虚函数的类称为抽象类
如果一个类至少有一个纯虚函数，那么就称该类为抽象类。
抽象类只能作为其他类的基类来使用，不能建立抽象类对象，其纯虚函数的实现由派生类给出。
派生类中必须重载基类中的纯虚函数，否则它仍将被看作一个抽象类。
规定：
(1)由于抽象类中至少包含一个没有定义功能的纯虚函数，因此，抽象类只能作为其他类的基类来使用，不能建立抽象类的对象，纯虚函数的实现由派生类给出
(2)不允许从具体类派生出抽象类
(3)抽象类不能用作参数类型、函数返回类型或显示转换的类型
(4)可以声明指向抽象类的指针或引用，此指针可以指向它的派生类，进而实现多态性
(5)抽象类的析构函数可以被声明为纯虚函数，这时，应该至少提供该析构函数的一个实现
(6)如果派生类中没有重定义纯虚函数，而派生类只是继承基类的纯虚函数，则这个派生类仍然是一个抽象类。如果派生类中给出了基类纯虚函数的实现，则该派生类就不是抽象类了，它是一个可以建立对象的具体类
(7)在抽象类中也可以定义普通成员函数或虚函数，虽然不能为抽象类声明对象，但仍然可以通过派生类对象来调用这些不是纯虚函数的函数。
运算符重载是使同一个运算符作用于不同类型的数据时具有不同的行为。运算符重载实质上将运算对象转化为运算函数的实参，并根据实参的类型来确定重载的运算函数。
运算符重载的规则
1.只能重载C++中已有的运算符，不能臆造新的运算符
2.类属关系运算符.
、作用域分辨符::
、成员指针运算符*
、sizeof
运算符和三目运算符?:
不能重载
3.重载之后运算符的优先级和结合性都不能改变，单目运算符只能重载为单目运算符，双目运算符只能重载为双目运算符
4.运算符重载后的功能应当与原有功能相类似
5.重载运算符含义必须清楚，不能有二义性
将运算符重载为类的成员函数就是在类中用关键字operator定义一个成员函数，函数名就是重载的运算符。运算符如果重载为类的成员函数，它就可以自由地访问该类的数据成员。<类型><类名>::operator<要重载的运算符>(形参表){}
双目运算
op1 B op2
把B重载为op1所属类的成员函数，只有一个形参，形参的类型是op2所属类。
例如，经过重载后，op1+op2
就相当于op1.operator+(op2)
单目运算
(1)前置单目运算：U op
把U重载为operand所属类的成员函数，没有形参。
例如，++
重载的语法格式为:<函数类型> operator ++();
++op
就相当于函数调用op.operator ++();
(2)后置单目运算：op V
运算符V重载为op所属类的成员函数，带有一个整型(int)形参。
例如，后置单目运算符
重载的语法格式为:<函数类型> operator (int);
op
就相当于函数调用op.operator(0);
对于++(—)运算符的重载，因为编译器不能区分出++(—)是前置还是后置的，所以要加上(int)来区分。
赋值运算
赋值运算符重载一般包括以下几个步骤，首先要检查是否自赋值，如果是要立即返回，如果不返回，后面的语句会把自己所指空间删掉，从而导致错误；第二步要释放原有的内存资源；第三步要分配新的内存资源，并复制内容；第四步是返回本对象的引用。如果没有指针操作，则没有第二步操作。
赋值运算符与拷贝构造函数在功能上有些类似，都是用一个对象去填另一个对象，但拷贝构造函数是在对象建立的时候执行，赋值运算符是在对象建立之后执行。
friend <函数返回类型> operator <二元运算符>(<形参1>,<形参2>);
friend <函数返回类型> operator <一元运算符>(类名 &对象){}
其中，函数返回类型为运算符重载函数的返回类型。operator<重载函数符>
为重载函数名。当重载函数作为友元普通函数时，重载函数不能用对象调用，所以参加运算的对象必须以形参方式传送到重载函数体内，在二元运算符重载函数为友元函数时，形参通常为两个参加运算的对象。
双目运算
op1 B op2
双目运算符B重载为op1所属类的友元函数，该函数有两个形参，表达式op1 B op2
相当于函数调用operator B(op1, op2)
单目运算
(1)前置单目运算 U op
前置单目运算符U重载为op所属类的友元函数，表达式U op
相当于函数调用operator U(op)
(2)后置单目运算 op U
后置单目运算符V重载为op所属类的友元函数，表达式op V
相当于函数调用operator V(op, int)
istream和ostream是C++的预定义流类，cin是istream的对象，cout是ostream的对象。运算符<<由ostream重载为插入操作，运算符>>由istream重载为提取操作，用于输入和输出基本类型数据。可用重载<<和>>运算符，用于输入和输出用户自定义的数据类型，必须定义为类的友元函数。
ostream & operator <<(ostream &, const 自定义类&);
第一个参数和函数的类型都必须是ostream &
类型，第二个参数是对要进行输出的类类型的引用，它可以是const，因为一般而言输出一个对象不应该改变对象。返回类型是一个ostream引用，通常是输出操作符所操作的ostream对象。


istream & operator >>(istream &, 自定义类 &)
与输出操作符类似，输入操作符的第一个形参是一个引用，指向要读的流，并且返回的也是同一个流的引用。第二个形参是对要读入的对象的非const引用，该形参必须为非const，因为输入操作符的目的是将数据读到这个对象中。和输出操作符不同的是输入操作符必须处理错误和文件结束的可能性。
(Neural network is also introduced in Machine Learning course, with my learning note).
House price Prediction can be regarded as the simplest neural network:
The function can be ReLU (REctified Linear Unit), which we’ll see a lot.
This is a single neuron. A larger neural network is then formed by taking many of the single neurons and stacking them together.
Almost all the economic value created by neural networks has been through supervised learning.
Input(x)  Output(y)  Application  Neural Network 

House feature  Price  Real estate  Standard NN 
Ad, user info  Click on ad?(0/1)  Online advertising  Standard NN 
Photo  Object(Index 1,…,1000)  Photo tagging  CNN 
Audio  Text transcript  Speech recognition  RNN 
English  Chinese  Machine translation  RNN 
Image, Radar info  Position of other cars  Autonomous driving  Custom/Hybrid 
Neural Network examples
CNN: often for image data
RNN: often for onedimensional sequence data
Structured data and Unstructured data
Scale drives deep learning progress
Scale: both the size of the neural network and the scale of the data.
Using ReLU instead of sigmoid function as activation function can improve efficiency.
Notation
(x,y): a single training example.
m training examples:
Take training set inputs x1, x2 and so on and stacking them in columns. (This make the implementation much easier than X’s transpose)
Differences with former course
Notation is a bit different from what is introduced in Machine Learning(note).
Originally, we add so that .
where .
Here in Deep Learning course, we use b to represent , and w to represent . Just keep b and w as separate parameters.
Given x, want .
Parameters:
Output:
σ() is sigmoid function:
Cost Function
If
If
Cost function:
Loss function is applied to just a single training example.
Cost function is the cost of your parameters, it is the average of the loss functions of the entire training set.
Gradient Descent
Usually initialize the value to zero in logistic regression. Random initialization also works, but people don’t usually do that for logistic regression.
Repeat {
}
From forward propagation, we calculate z, a and finally
From back propagation, we calculate the derivatives step by step:
Algorithm
(Repeat)
J=0; dw1,dw2,…dwn=0; db=0
for i = 1 to m
for j = 1 to n:
J /= m;
dw1,dw2,…,dwn /= m;
db /= m
w1:=w1αdw1
w2:=w2αdw2
b:=bαdb
In the for loop, there’s no superscript i for dw variable, because the value of dw in the code is cumulative. While dz is referring to one training example.
Vectorization
Original for
loop:
for i = 1 to m
Vectorized:
Code:z = np.dot(w.T, X) + b
dz = A  Y
cost = 1 / m * np.sum(Y * np.log(A) + (1  Y) * np.log(1  A))
db = 1 / m * np.sum(dZ)
dw = 1 / m * X * dZ.T
A.sum(axis = 0)
: sum verticallyA.sum(axis = 1)
: sum horizontally
Broadcasting
If an (m, n) matrix operates with (+*/) a (1, n) row vector, just expand the vector vertically to (m, n) by copying m times.
If an (m, n) matrix operates with a (m, 1) column vector, just expand the vector horizontally to (m, n) by copying n times.
If an row/column vector operates with a real number, just expand the real number to the corresponding vector.
documention
Rank 1 Arraya = np.random.randn(5)
creates a rank 1 array whose shape is (5,)
.
Try to avoid using rank 1 array. Use a = a.reshape((5, 1))
or a = np.random.randn(5, 1)
.
Note that np.dot()
performs a matrixmatrix or matrixvector multiplication. This is different from np.multiply()
and the *
operator (which is equivalent to .*
in MATLAB/Octave), which performs an elementwise multiplication.
Superscript with square brackets denotes the layer, superscript with round brackets refers to i’th training example.
Logistic regression can be regarded as the simplest neural network. The neuron takes in the inputs and make two computations:
Neural network functions similarly. (Note that this neural network has 2 layers. When counting layers, input layer is not included.)
Take the first node in the hidden layer as example:
The superscript denotes the layer, and subscript i
represents the node in layer.
Similarly,
Vectorization:
Formula:
(dimensions: )
Vectorizing across multiple examples:
Explanation
Stack elements in column.
Each column represents a training example, each row represents a hidden unit.
Sigmoid Function
Only used in binary classification’s output layer(with output 0 or 1).
Not used in other occasion. tanh
is a better choice.
tanh Function
With a range of , it performs better than sigmoid function because the mean of its output is closer to zero.
Both sigmoid and tanh function have a disadvantage that when z is very large() or very small(), the derivative can be close to 0, so the gradient descent would be very slow.
ReLU
Default choice of activation function.
With when z is positive, it performs well in practice.
(Although the g'(z)=0
when z is positive, and technically the derivative when is not welldefined)
Leaky ReLU
Makes sure that derivatives not equal to 0 when z < 0.
Linear Activation Function
Also called identity function.
Not used in neural network, because even many hidden layers still gets a linear result. Just used in machine learning when the output is a real number.
Forward propagation:
Backward propagation:
note:keepdims=True
makes sure that Python won’t produce rank1 array with shape of (n,)
.*
is elementwise product. :(n[1],m);:(n[1],m);:(n[1],m).
In logistic regression, it’s okay to initialize all parameters to zero. However, it’s not feasible in neural network.
Instead, initialize w with random small value to break symmetry. It’s okay to initialize b to zeros. Symmetry is still broken so long as is initialized randomly.


Random
If the parameter w are all zeros, then the neurons in hidden layers are symmetric(“identical”). Even if after gradient descent, they keep the same. So use random initialization.
Small
Both sigmoid and tanh function has greatest derivative at z=0
. If z had large or small value, the derivative would be close to zero, and consequently gradient descent would be slow. Thus, it’s a good choice to make the value small.
Deep neural network notation
: number of layers
: number of units in layer l
: activations in layer l.
()
: weights for
: bias for
Forward Propagation
for l = 1 to L:
Well, this for
loop is inevitable.
Matrix Dimensions

(here the dimension can be with Python’s broadcasting)


cache
Cache is used to pass variables computed during forward propagation to the corresponding backward propagation step. It contains useful values for backward propagation to compute derivatives.
Why deep representations?
Informally: There are functions you can compute with a “small” Llayer deep neural network that shallower networks require exponentially more hidden units to compute.
Forward propagation for layer l
Input
Output , cache
Backward propagation for layer l
Input
Output
Hyperparameters determine the final value parameters.
Parameters
·
Hyperparameters
· learning rate
· number of iterations
· number of hidden layers
· number of hidden units
· choice of activation function
· momentum, minibatch size, regularizations, etc.
Training set:
Keep on training algorithms on the training sets.
Development set
Also called Holdout cross validation set, Dev set for short.
Use dev set to see which of many different models performs best on the dev set.
Test set
To get an unbiased estimate of how well your algorithm is doing.
Proportion
Previous era: the data amount is not too large, it’s common to take all the data and split it as 70%/30% or 60%/20%/20%.
Big data: there’re millions of examples, 10000 examples used in dev set and 10000 examples used in test set is enough. The proportion can be 98/1/1 or even 99.5/0.4/0.1
Notes
Make sure dev set and test set come from same distribution.
Not having a test set might be okay if it’s not necessary to get an unbiased estimate of performance. Though dev set is called ‘test set’ if there’s no real test set.
Solutions
High bias:
Bigger network
Train longer
(Neural network architecture search)
High variance:
More data
Regularization
(Neural network architecture search)
Bias Variance tradeoff
Originally, reducing bias may increase variance, and vice versa. So it’s necessary to tradeoff between bias and variance.
But in deep learning, there’re ways to reduce one without increasing another. So don’t worry about bias variance tradeoff.
Logistic regression
L2regularization relies on the assumption that a model with small weights is simpler than a model with large weights. Thus, by penalizing the square values of the weights in the cost function you drive all the weights to smaller values. It becomes too costly for the cost to have large weights! This leads to a smoother model in which the output changes more slowly as the input changes.
Weights end up smaller(“weight decay”): Weights are pushed to smaller values.
L2 regularization:
L1 regularization:
(L1 regularization leads to be sparse, but not very effictive)
Neural network
, it’s called Frobenius norm which is different from Euclidean distance.
Back propagation:
With dropout, what we’re going to do is go through each of the layers of the network and set some probability of eliminating a node in neural network.
For each training example, you would train it using one of these neural based networks.
The idea behind dropout is that at each iteration, you train a different model that uses only a subset of your neurons. With dropout, your neurons thus become less sensitive to the activation of one other specific neuron, because that other neuron might be shut down at any time.
Usually used in Computer Vision.
Implementation with layer 3


Data augmentation
Take image input for example. Flipping the image horizontally, rotating and sort of randomly zooming, distortion, etc.
Get more training set without paying much to reduce overfitting.
Early stopping
Stop early so that is relatively small.
Early stopping violates Orthogonalization, which suggests separate Optimize cost function J and Not overfit.
Subtract mean
Normalize variance
Note: use same to normalize test set.
Intuition:
Since the number of layers in deep learning may be quite large, the product of L layers may tend to or . (just think about and )
Weight initialization for deep networks
Take a single neuron as example:
If is large, then would be smaller. Our goal is to get
Random initialization for ReLU:(known as He initialization, named for the first author of He et al., 2015.)
For tanh: use
Xavier initialization:
Take and reshape into a big vector :
Take and reshape into a big vector
for each i:

check if ?
Calculate . ( is great)
Note
Gradient checking verifies closeness between the gradients from backpropagation and the numerical approximation of the gradient (computed using forward propagation).
Gradient checking is slow, so we don’t run it in every iteration of training. You would usually run it only to make sure your code is correct, then turn it off and use backprop for the actual learning process.
Batch gradient descent (original gradient descent that we’ve known) calculates the entire training set, and just update the parameters a little step. If the training set is pretty large, the training would be quite slow. And the idea of minibatch gradient descent is use part of the training set, and update the parameters faster.
For example, if ‘s dimension is , divide the training set into parts with dimension of , i.e.
Similarly, .
One iteration of minibatch gradient descent(computing on a single minibatch) is faster than one iteration of batch gradient descent.
Two steps of minibatch gradient descent:
repeat {
for t = 1,…,5000 {
Forward prop on
…
Compute cost
Backprop to compute gradients cost (using )
} # this is called 1 epoch
}
Choosing minibatch size
Minibatch size = m: Batch gradient descent.
It has to process the whole training set before making progress, which takes too long for per iteration.
Minibatch size = 1: Stochastic gradient descent.
It loses the benefits of vectorization across examples.
Minibatch size in between 1 and m.
Fastest learning: using vectorization and make process without processing entire training set.
If training set is small(m≤2000): just use batch gradient descent.
Typical minibatch sizes: 64, 128, 256, 512 (1024)
E.g.




Replace with the second equation, then replace with the third equation, and so on. Finally we’d get
This is why it is called exponentially weighted averages. In practice, , thus it show an average of 10 examples.
Bias correction
As is shown above, the purple line is exponentially weighted average without bias correction, it’s much lower than the exponentially weighted average with bias correction(green line) at the very beginning.
Since is set to be zero(and assume ), the first calculation has quite small result. The result is small until t gets larger(say for ) To avoid such situation, bias correction introduces another step:
Set
On iteration t:
Compute dW,db on the current minibatch
Momentum takes past gradients into account to smooth out the steps of gradient. Gradient descent with momentum has the same idea as exponentially weighted average(while some may not use in momentum). Just as the example shown above, we want slow learning horizontally and faster learning vertically. The exponentially weighted average helps to eliminate the horizontal oscillation and makes gradient descent faster. Note there’s no need for gradient descent with momentum to do bias correction. After several iterations, the algorithm will be okay.
On iteration t:
Compute dW,db on the current minibatch
RMS means Root Mean Square, it uses division to help to adjust gradient descent.
Combine momentum and RMSprop together:
1.It calculates an exponentially weighted average of past gradients, and stores it in variable v (before bias correction) and v_corrected (with bias correction).
2.It calculates an exponentially weighted average of the squares of the past gradients, and stores it in variable s (before bias correction) and s_corrected (with bias correction).
3.It updates parameters in a direction based on combining information from 1 and 2.
Set
On iteration t:
Compute dW,db on the current minibatch
Hyperparameters:
: needs to be tune
: 0.9
: 0.999

(Adam just means Adaption moment estimation)
Minibatch gradient descent won’t converge, but step around at the optimal instead. To help converge, it’s advisable to decay learning rate with the number of iterations.
Some formula:



discrete stair case (half α after some iterations)
manual decay
Hyperparameters: , number of layers, number of units, learning rate decay, minibatch size, etc.
Priority:
Try to use random values of hyperparameters rather than grid.
Coarse to fine: if finds some region with good result, try more in that region.
Appropriate scale:
It’s okay to sample uniformly at random for some hyperparameters: number of layers, number of units.
While for some hyperparameters like , instead of sampling uniformly at random, sample randomly on logarithmic scale.
Pandas & Caviar
Panda: babysitting one model at a time
Caviar: training many models in parallel
Largely determined by the amount of computational power you can access.
Using the idea of normalizing input, make normalization in hidden layers.
Given some intermediate value in neural network (specifically in a single layer)
Use instead of
Batch Norm as regularization
Each minibatch is scaled by the mean/variance computed on just that minibatch.
This adds some noise to the values within that minibatch. So similar to dropout, it adds some noise to each hidden layer’s activations.
This has a slight regularization effect.
Batch Norm at test time: use exponentially weighted averages to compute average for test.
Softmax
The output layer is a vector with dimension C rather than a real number. C is the number of classes.
Activation function:
Cost function
Choosing deep learning frameworks
Easy of programming (development and deployment)
Running speed
Truly Open (open source with good governance)
Writing and running programs in TensorFlow has the following steps:
tf.constant(...)
: to create a constant valuetf.placeholder(dtype = ..., shape = ..., name = ...)
: a placeholder is an object whose value you can specify only later
tf.add(..., ...)
: to do an additiontf.multiply(..., ...)
: to do a multiplicationtf.matmul(..., ...)
: to do a matrix multiplication
2 typical ways to create and use sessions in TensorFlow:




Orthogonalization or orthogonality is a system design property that assures that modifying an instruction or a component of an algorithm will not create or propagate side effects to other components of the system. It becomes easier to verify the algorithms independently from one another, and it reduces testing and development time.
When a supervised learning system is designed, these are the 4 assumptions that need to be true and orthogonal.
Precision
Among all the prediction, estimate how much predictions are right.
Recall
Among all the positive examples, estimate how much positive examples are correctly predicted.
F1Score
The problem with using precision/recall as the evaluation metric is that you are not sure which one is better since in this case, both of them have a good precision et recall. F1score, a harmonic mean, combine both precision and recall.
Satisficing and optimizing metric
There are different metrics to evaluate the performance of a classifier, they are called evaluation matrices. They can be categorized as satisficing and optimizing matrices. It is important to note that these evaluation matrices must be evaluated on a training set, a development set or on the test set.
The general rule is:
For example:
Classifier  Accuracy  Running Time 

A  90%  80ms 
B  92%  95ms 
C  95%  1500ms 
For example, there’re two evaluation metrics: accuracy and running time. Take accuracy as optimizing metric and the following(running time) as satisficing metric(s). The satisficing metric has to meet expectation set and improve the optimizing metric as much as possible.
It’s important to choose the development and test sets from the same distribution and it must be taken randomly from all the data.
Guideline: Choose a dev set and test set to reflect data you expect to get in the future and consider important to do well on.
Size
Old way of splitting data:
We had smaller data set, therefore, we had to use a greater percentage of data to develop and test ideas and models.
Modern era  Big data:
Now, because a larger amount of data is available, we don’t have to compromise and can use a greater portion to train the model.
Set your dev set to be big enough to detect differences in algorithms/models you’re trying out.
Set your test set to be big enough to give high confidence in the overall performance of your system.
When to change dev/test sets and metrics
Orthogonalization:
How to define a metric to evaluate classifiers.
Worry separately about how to do well on this metric.
If doing well on your metric + dev/test set does not correspond to doing well on your application, change your metric and/or dev/test set.
The graph shows the performance of humans and machine learning over time.
Machine learning progresses slowly when it surpasses humanlevel performance. One of the reason is that humanlevel performance can be close to Bayes optimal error, especially for natural perception problem.
Bayes optimal error is defined as the best possible error. In other words, it means that any functions mapping from x to y can’t surpass a certain level of accuracy(for different reasons, e.g. blurring images, audio with noise, etc).
Humans are quite good at a lot of tasks. So long as machine learning is worse than humans, you can:
Humanlevel error as a proxy for Bayes error(i.e. Humanlevel error ≈ Bayes error).
The difference between Humanlevel error and training error is also regarded as “Avoidable bias”.
If the difference between humanlevel error and the training error is bigger than the difference between the training error and the development error. The focus should be on bias reduction technique.
· Train bigger model
· Train longer/better optimization algorithms(momentum, RMSprop, Adam)
· NN architecture/hyperparameters search(RNN,CNN)
If the difference between training error and the development error is bigger than the difference between the humanlevel error and the training error. The focus should be on variance reduction technique
· More data
· Regularization(L2, dropout, data augmentation)
· NN architecture/hyperparameters search
Problems where machine significantly surpasses humanlevel performance
Feature: Structured data, not natural perception, lots of data.
· Online advertising
· Product recommendations
· Logistics(predicting transit time)
· Loan approvals
The two fundamental assumptions of supervised learning:
You can fit the training set pretty well.(avoidable bias ≈ 0)
The training set performance generalizes pretty well to the dev/test set.(variance ≈ 0)
Spread sheet:
Before deciding how to improve the accuracy, set up a spread sheet find out what matters.
For example:
Image  Dog  Great Cat  Blurry  Comment 

1  √  small white dog  
2  √  √  lion in rainy day  
…  
Percentage  5%  41%  63% 
Mislabeled examples refer to if your learning algorithm outputs the wrong value of Y.
Incorrectly labeled examples refer to if in the data set you have in the training/dev/test set, the label for Y, whatever a human label assigned to this piece of data, is actually incorrect.
Deep learning algorithms are quite robust to random errors in the training set, but less robust to systematic errors.
Guideline: Build system quickly, then iterate.
The development set and test should come from the same distribution. However, the training set’s distribution might be a bit different. Take a mobile application of cat recognizer for example:
The images from webpages have high resolution and are professionally framed. However, the images from app’s users are relatively low and blurrier.
The problem is that you have a different distribution:
Small data set from pictures uploaded by users. (10000)This distribution is important for the mobile app.
Bigger data set from the web.(200000)
Instead of mixing all the data and randomly shuffle the data set, just like below.
Take 5000 examples from users into training set, and halving the remaining into dev and test set.
The advantage of this way of splitting up is that the target is well defined.
The disadvantage is that the training distribution is different from the dev and test set distributions. However, the way of splitting the data has a better performance in long term.
TrainingDev Set
Since the distributions among the training and the dev set are different now, it’s hard to know whether the difference between training error and the training error is caused by variance or from different distributions.
Therefore, take a small fraction of the original training set, called trainingdev set. Don’t use trainingdev set for training, but to check variance.
The difference between the trainingdev set and the dev set is called data mismatch.
Addressing data mismatch:
When transfer learning makes sense:
Guideline:
Multitask learning
Example: detect pedestrians, cars, road signs and traffic lights at the same time. The output is a 4dimension vector.
Note that the second sum(j = 1 to 4) only over value of j with 0/1 label (not ? mark).
When multitask learning makes sense
Endtoend deep learning is the simplification of a processing or learning systems into one neural network.
Endtoend deep learning cannot be used for every problem since it needs a lot of labeled data. It is used mainly in audio transcripts, image captures, image synthesis, machine translation, steering in selfdriving cars, etc.
Pros and cons of endtoend deep learning
Pros:
Let the data speak
Less handdesigning of components needed
Cons:
May need large amount of data
Excludes potentially useful handdesigned components
Computer Vision Problems
*
is the operator for convolution.
Filter/Kernel
The second operand is called filter in the course and often called kernel in the research paper.
There’re different types of filters:
Filter usually has an size of odd number. 1*1, 3*3, 5*5...
(helps to highlight the centroid)
Vertical edge detection examples
Valid and Same Convolutions
Suppose that the original image has a size of n×n, the filter has a size of f×f, then the result has a size of (nf+1)×(nf+1). This is called Valid convolution.
The size will get smaller and smaller with the process of valid convolution.
To avoid such a problem, we can use paddings to enlarge the original image before convolution so that output size is the same as the input size.
If the filter’s size is f×f, then the padding .
The main benefits of padding are the following:
· It allows you to use a CONV layer without necessarily shrinking the height and width of the volumes. This is important for building deeper networks, since otherwise the height/width would shrink as you go to deeper layers. An important special case is the “same” convolution, in which the height/width is exactly preserved after one layer.
· It helps us keep more of the information at the border of an image. Without padding, very few values at the next layer would be affected by pixels as the edges of an image.
Stride
The simplest stride is 1, which means that the filter moves 1 step at a time. However, the stride can be not 1. For example, moves 2 steps at a time instead. That’s called strided convolution.
Given that:
Size of image, filter, padding p, stride s,
output size:
technical
In mathematics and DSP, the convolution involves another “flip” step. However, this step is omitted in CNN. The “real” technical note should be “crosscorrelation” rather than convolution.
In convention, just use Convolution in CNN.
Convolution over volumes
The 1channel filter cannot be applied to RGB images. But we can use filters with multiple channels(RGB images have 3 channels).
The number of the filter’s channel should match that of the image’s channel.
E.g.
A image conv with a filter, the result has a size of . Note that this is only 1 channel! (The number of the result’s channel corresponds to the number of the filters).
notation
If layer l is a convolution layer:
Each filter is:
Activations: ,
Weights: ,(: #filters in layer l.)
bias:
Input:
Output:
E.g.
Types of layers in a convolutional network
Pooling layers
Hyperparameters:
f: filter size
s: stride
Max or average pooling
Note no parameters to learn.
Suppose that the input has a size of , then after pooling, the output has a size of
A more complicated cnn:
Backpropagation is discussed in programming assignment.
Why convolutions
Paper link: GradientBased Learning Applied to Document Recognition(IEEE has another version of this paper.)
Take the input, use a 5×5 filter with 1 stride, then use an average pooling with a 2×2 filter and s = 2. Again, use a 5×5 filter with 1 stride, then use an average pooling with a 2×2 filter and s = 2. After two fully connected layer, the output uses softmax to make classification.
conv → pool → conv → pool → fc → fc → output
With the decrease of nH and nW, the number of nC is increased.
Paper link: ImageNet Classification with Deep Convolutional Neural Networks
Similar to LeNet, but much bigger. (60K > 60M)
It uses ReLU.
Paper link: Very Deep Convolutional Networks for LargeScale Image Recognition
CONV = 3×3 filter, s = 1, same(using padding to make the size same)
MAXPOOL = 2×2, s = 2
Only use these 2 filters.
Paper link: Deep residual networks for image recognition
In the plain network, the training error won’t keep decreasing, it may increase at some threshold. In Residual network, the training error will keep decreasing.
The skipconnection makes it easy for the network to learn an identity mapping between the input and the output within the ResNet block.
In ResNets, a “shortcut” or a “skip connection” allows the gradient to be directly backpropagated to earlier layers:
Paper link: Network in network
If the input has a volume of dimension , then a single 1×1 convolutional filter has parameters(including bias).
You can use a 1×1 convolutional layer to reduce but not .
You can use a pooling layer to reduce , but not .
Paper link: Going deeper with convolutions
Don’t bother worrying about what filters to use. Use all kinds of filters and stack them together.
Module:
Typically, with deeper layers, and decrease, while increases.
Using OpenSource Implementations: GitHub
Reasons for using opensource implementations of ConvNet:
Parameters trained for one computer vision task are often useful as pretraining for other computer vision tasks.
It is a convenient way to get working an implementation of a complex ConvNet architecture.
Image classification: Given a image, make predictions of what classification it is.
Classification localization: In addition, put a bounding box to figure out where the object is.
Detection: Multiple objects appear in the image, detect all of them.
In classification localization, the output has some values which show the position of the centroid of the object,(note that the upper left corner’s coordinates is (0,0) and the lower right corner’s is (1,1)) and which show the height and width of the object.
If the output has 3 classes, then the format of the output looks like as follows:
For example, if the image contains a car, then the output is
and if the image doesn’t contain anything, the output is
The loss function is
Landmark detection
The output contains more information about the position of the landmarks .
Sliding windows detection
Use a small sliding window with small stride scanning the image, detect the objects. Then use a slightly bigger sliding window, and then bigger.
However, it has high computation cost.
Turning FC layer into convolutional layers
Use a filter with the same size of the last layer, the number of filters is the same as the fully connected nodes.
Paper link: You Only Look Once: Unified, RealTime Object Detection
Divide the object into several grid cells(in general grids with a size of 19×19 are common), and only detect once if the object’s midpoint is in that grid.
Each grid’s upper left corner has a coordinate of (0,0) and lower right corner’s (1,1). Therefore, the value of should be between (0,1). And can be greater than 1.
Intersection over union
If IoU≥0.5 we can estimate that the result is right.
More generally, IoU is a measure of the overlap between two bounding boxes.
Algorithm:
Each output prediction is (just focus on one class at a time so there’s no )
Discard all boxes with
While there are any remaining boxes:
· Pick the box with the largest . Output that as a prediction.
· Discard any remaining box with with the box output in the previous step.
In an image, some objects may be overlapping. To predict multiple objects in one grid cell, use some anchor boxes.
Previously:
Each object in training image is assigned to grid cell that contains that object’s midpoint.
With two anchor boxes:
Each object in training image is assigned to grid cell that contains object’s midpoint and anchor box for the grid cell with highest IoU.
The output vector has a size of
E.g.
(Manually choose the shape of anchor boxes.)
Paper link:Rich feature hierarchies for accurate object detection and semantic segmentation
Instead using sliding windows over and over again, use segmentation algorithm to predict which regions may contain objects.
RCNN: Propose regions. Classify proposed regions one at a time. Output label + bounding box.
Fast RCNN: Propose regions. Use convolution implementation of sliding windows to classify all the proposed regions.
Faster RCNN: Use convolutional network to propose regions.
Face verification & Face recognition
Verification:
· Input image, name/ID
· Output whether the input image is that of the claimed person.
This is a 1:1 matching problem.
Recognition:
· Has a database of K persons
· Get an input image
· Output ID if the image is any of the K persons(or “not recognized”)
This is a 1:K matching problem.
(High demand for single accuracy.)
Face verification requires comparing a new picture against one person’s face, whereas face recognition requires comparing a new picture against K person’s faces.
Learning from one example to recognize the person again. The idea is learning a “similarity” function. (A bit similar to recommendation system.)
d(img1, img2) = degree of difference between images.
If , the output is same; else the output is different.
Parameters of NN define an encoding . (Use a vector to represent the image x)
Goal: Learn parameters so that
if are the same person, is small;
if are different person, is large.
Pick an anchor image(denoted as “A”), a positive image(denoted as “P”) and a negative image(denoted as “N”).
We can calculate the differences between A and P, A and N.
We want that
where α is called margin.
Loss function:
About choosing the triplets A,P,N
During training, if A,P,N are chosen randomly, is easily satisfied. Therefore, the gradient descent wouldn’t make much progress.
Thus, choose triplets that are “hard” to train on. That is, pick A,P,N such that
The input contains content image(denoted as C) and style image(denoted as S), and the output is the generated image(denoted as G).
To find the generated image G:
1.Initiate G randomly (e.g. init with white noise)
2.Use gradient descent to minimize J(G).
Content cost function
Style cost function
Say you are using layer l‘s activation to measure style.
Define style as correlation between activations across channels.
Let = activation at (i,j,k). is
The style matrix is also called a “Gram matrix”. In linear algebra, the Gram matrix G of a set of vectors() is the matrix of dot products, whose entries are . In other words, compares how similar is similar to : If they are highly similar, you would expect them to have a large dot product, and thus for to be large.
The style of an image can be represented using the Gram matrix of a hidden layer’s activations. However, we get even better results combining this representation from multiple different layers. This is in contrast to the content representation, where usually using just a single hidden layer is sufficient.
Minimizing the style cost will cause the image G to follow the style of the image S.
Notation
: denotes an object at the t’th timestep.
: index into the output position
t: implies that these are temporal sequences
: the length of the input sequence
: the length of the output sequence
: the length of the i’th training example
: the output length of the i’th training example
: the input at the t’th timestep of example i
: the output at the t’th timestep of example i
Onehot representation
Using a large vector(a dictionary containing tens of thousands of words) to represent a word. Only one element is one(the corresponding position of the word in the dictionary) and the others are zero.
Why not a standard network?
Problems:
Inputs, outputs can be different lengths in different examples. (Different sentences have different lengths.)
Doesn’t share features learned across different positions of text. (A word may appear many times in a sentence. Need to make repetitions.)
RNN cell
Basic RNN cell. Takes as input (current input) and (previous hidden state containing information from the past), and outputs which is given to the next RNN cell and also used to predict .
Here the weight W has two subscripts: the former corresponds to the result and the latter represents the operand that it multiply by.
The activation function usually uses tanh, sometimes ReLU.
The function uses sigmoid to make binary classification.
The formulas can be simplified as follows:
Here, , and
One to one
Usage: Simple neural network
One to many
Usage: Music generation, sequence generation
Many to one
Usage: Sentiment classification
Many to many (I)
Usage: Name entity recognition
Many to many (II)
Usage: Machine translation
Language model is used to calculate the probability using RNN. Each layer’s output is a probability given the previous activations.
E.g. given the sentence Cats average 15 hours of sleep a day., (the probability of ‘cats’ appears in the beginning of the sentence); (conditional probability);…;
Characterlevel language model
Instead of using words, characterlevel generates sequences of characters. It’s more computational.
The basic RNN unit:
g() is tanh function.
GRU(simplified):
Instead of using , use instead(though in GRU ). Here c represents memory cell.
u: update. r: relevance.
Gate u is a vector of dimension equal to the number of hidden units in the LSTM.
Gate r tells you how relevant is c
Difference between LSTM and GRU(LSTM comes earlier, and GRU can be regarded as a special case of LSTM).
Forget Gate
For the sake of this illustration, lets assume we are reading words in a piece of text, and want use an LSTM to keep track of grammatical structures, such as whether the subject is singular or plural. If the subject changes from a singular word to a plural word, we need to find a way to get rid of our previously stored memory value of the singular/plural state. In an LSTM, the forget gate lets us do this:
Here, are weights that govern the forget gate’s behavior. We concatenate and multiply by . The equation above results in a vector with values between 0 and 1. This forget gate vector will be multiplied elementwise by the previous cell state . So if one of the values of is 0 (or close to 0) then it means that the LSTM should remove that piece of information (e.g. the singular subject) in the corresponding component of . If one of the values is 1, then it will keep the information.
Update Gate
Once we forget that the subject being discussed is singular, we need to find a way to update it to reflect that the new subject is now plural. Here is the formulate for the update gate:
Similar to the forget gate, here is again a vector of values between 0 and 1. This will be multiplied elementwise with $\tilde{c}^{\langle t \rangle}$
, in order to compute .
Updating the cell
To update the new subject we need to create a new vector of numbers that we can add to our previous cell state. The equation we use is:
Finally, the new cell state is:
Output gate
To decide which outputs we will use, we will use the following two formulas:
Where in equation 5 you decide what to output using a sigmoid function and in equation 6 you multiply that by the tanh of the previous state.
Transfer learning and word embeddings
1.Learn word embeddings from large text corpus. (1100B words)
(Or download pretrained embedding online.)
2.Transfer embedding to new task with smaller training set. (say, 100k words)
3.Optional: Continue to finetune the word embeddings with new data.
Computation of Similarities:
Cosine similarity:
Euclidean distance:
Embedding matrix
The embedding matrix is denoted as E.
The embedding for word j can be calculated as .
Here, e means embedding and o means onehot. And in practice, we just use specialized function to look up an embedding rather than use costly matrix multiplication.
Context/target pairs
Context:
· Last 4 words
· 4 words on left & right
· Last 1 word
· Nearby 1 word
Using skipgrams:
Here, and is the parameter associated with output t.
Problems:
The cost of computation is too high.
Solution:
Using hierarchal softmax.
Randomly choose k+1 examples, where only 1 example is positive and the remaining k are negative. (The value of k is dependent on the size of data sets. If the dataset is big, k = 25; if the dataset is small, k = 520).
Instead of using softmax, compute k times binary classification to reduce the computation.
: the number of times i appears in context of j. Thus,
Sentiment Classification and Debiasing.
Machine translation can be regarded as a conditional language model.
The original language model compute the probability ,while the machine translation computes the probability . Therefore, it can be regarded as conditional language model.
The machine translation contains two parts: encoder and decoder.
Just find the most likely translation.
(not use greedy search)
Pick a hyperparameter B. In each layer of RNN, pick B most possible output.
Since the probability can be computed as:
(Beam search with B=1 is greedy search.)
Length normalization
The range of possibilities is [0,1]. Therefore the original formula can be extremely small with many small values’ multiplication. To avoid such situations, use log in calculations:
Machine tends to make short translation to maximize the result, while a too short translation is not satisfying. Therefore, add another hyperparameter to counteract such problem:
Unlike exact search algorithms like BFS or DFS, Beam Search runs faster but is not guaranteed to find exact maximum for arg max P(yx).
Error analysis
There’re two main models in machine translation: RNN part and Beam Search part. If the training error is high, we want to figure out which part is not functioning well.
Use to represent human’s translation and as machine’s.
Case 1:
Beam search chose . But attains higher P(yx).
Conclusion: Beam search is at fault.
Case 2:
In fact, is a better translation than . But RNN predicted
Conclusion: RNN model is at fault.
Here, = Bleu score on ngrams only.
Combined Bleu score:
BP is brevity penalty with
Here, = amount of attention should pay to
]]>Learning Python by myself.
Here’s some environment configuration:
Version: Python 3.6.3
Platform: macOS 10.12.6
Interpreter: terminal
Text editor: Sublime Text 3
Notebook: Jupyter Notebook
Material:
These materials should be enough. What I use is the official tutorial to get an detailed understanding of Python and Python’s most noteworthy features, and get a good idea of the language’s flavor and style.
I plan to get the hang of Python in my winter vacation.
Try to get more familiar with Python with practical applications. (And keep updating this post!)
When commands are read from a tty, the interpreter is said to be in interactive mode. To start interactive mode, type Python3
in terminal(the default Python version of macOS is Python2, so type Python
would start Python2). To stop interactive mode, type exit()
to quit.
If code is saved as file with .py
, then type Python3 filename.py
in terminal to compile and run the file.
On BSD’ish Unix systems, Python scripts can be made directly executable, like shell scripts, by putting the line #!/usr/bin/env python3
.
The script can be given an executable mode, or permission, using the chmod +x script.py
command.
On Windows systems, there is no notion of an “executable mode”. The Python installer automatically associates .py
files with python.exe
so that a doubleclick on a Python file will run it as a script. The extension can also be pyw
, in that case, the console window that normally appears is suppressed.
Comment
Comments in Python start with the hash character, #, and extend to the end of the physical line.
The #
sign will only comment out a single line, if it’s necessary to add multiline comments, just begin with #
each line. (For multiline comments, include the whole block in a set of triple quotation marks is okay in .py file, but not interactive mode.)


Prompt
When commands are read from a tty, the interpreter is said to be in interactive mode. In this mode it prompts for the next command with the primary prompt, usually three greaterthan signs (>>>
); for continuation lines it prompts with the secondary prompt, by default three dots (...
).
Generally the prompts are primary prompt. Secondary prompts are used in control flow like if
, while
, etc.
Primary prompt (after python3
command in my terminal):


Secondary prompt (example from tutorial):


Well, this part is not introduced in tutorial, but when dealing with online practice(like CodeStepByStep), it’s necessary to know how to input and output. So I’d include some information here.
Input:


The default type of x is string. If we input an integer and what to use x as an integer, use type conversion with int()
. x = int(input("prompts"))
Output:
Just like MATLAB, we can output a variable’s value by typing the variable’s name, or use the print()
function.
When concatenating strings in print()
, we can use both ,
and +
. When using ,
, we don’t need to convert int/float to string, and every ,
is treated as a blank space; while we need to convert int/float to string using str()
when using +
, but +
won’t add extra blank space.
printfstyle


The brackets and characters within them (called format fields) are replaced with the objects passed into the str.format()
method.
Positional arguments
A number in the brackets can be used to refer to the position of the object passed into the str.format()
method.


Keyword arguments
If keyword arguments are used in the str.format()
method, their values are referred to by using the name of the argument.


Positional and keyword arguments can be arbitrarily combined.
Old string formatting
The %
operator can also be used for string formatting.


printfstyle String Formatting
PEP 8  Style Guide for Python Code
a = f(1, 2) + g(3, 4)
CamelCase
for classes and lower_case_with_underscores
for functions and methods. Always use self
as the name for the first method argument.Python interpreter can act as a simple calculator. So just type math expressions will get the calculation result.
Data typeint
float
Decimal
Fraction
Complex numbers: use j
or J
suffix to indicate the imaginary part(e.g. 3+5j
).
Operation+
: Addition. 2 + 2 = 4

: Subtract. 3  1 = 2
*
: Multiply. 2 * 2 = 4
**
: Power. 2 ** 7 = 128
/
: Division. Always returns a float. 10 / 3 = 3.3333333333333335
//
: Floor division. Discard any fractional result and get an integer result.
_
: Last printed expression(easier to continue calculations).
Comparison<
less than, >
greater than, ==
equal to, <=
less than or equal to, >=
greater than, !=
not equal toin
and not in
check whether a value occurs (does not occur) in a sequence.is
and is not
compare whether two objects are really the same object; this only matters for mutable objects like lists.
Booleanand
, or
, not
True
, False
.
Any nonzero integer value is true; zero is false. The condition may also be any sequence(string, list, etc): anything with a nonzero length is true, empty sequences are false.
shortcircuit.
Strings in Python can be enclosed in both single quotes and double quotes.
Escape character
Just like C, \
can be used to escape quotes in Python as well. \t
,\n
,etc.
Use raw strings by adding an r
before the first quote can prevent \
from being treated as special characters. Often used in regular expressions.(See notes about regular expressions here)


Concatenating
Strings can be concatenated with +
operator, and repeated with *
.
For string literals(i.e. the ones enclosed between quotes) next to each other are automatically concatenated.
Remember to use str()
when concatenating strings and other data types(e.g. int).s.join()
can combine the words of the text into a string using s
as the glue.
Index
Indices of strings can be nonnegative(just like C array) and negative(start counting from the right).


Slicing
Like MATLAB, Python supports slicing(word[0:2]
), which allows to obtain substring.
Note the start is always included, and the end always excluded.
Default: an omitted first index defaults to zero, an omitted second index defaults to the size of the string being sliced.
Attempting to use an index that is too large will result in an error. However, out of range slice indexes are handled gracefully when used for slicing.
Substring
We test if a string contains a particular substring using the in
operator.
We can also find the position of a substring with a string, using find()
.s.find(t)
: index of first instance of string t
inside s
(1
if not found)s.rfind(t)
: index of last instance of string t
inside s
(1
if not found)s.index(t)
: like s.find(t)
except it raises ValueError
if not founds.rindex(t)
: like s.rfind(t)
except it raises ValueError
if not found
Immutable
Python strings are immutable, just as in Java. Thus, if it’s necessary to edit a string, just create a new one.
Multilines
This is introduced in NLTK’book Chapter 3
Sometimes strings go over several lines. Python provides us with various ways of entering them. In the next example, a sequence of two strings is joined into a single string. We need to use backslash or parentheses so that the interpreter knows that the statement is not complete after the first line.


Others
The builtin function len()
returns the length of a string.
String Methods
s.startswith(t)
: test if s
starts with t
s.endswith(t)
: test if s
ends with t
t in s
: test if t is a substring of ss.islower()
: test if s
contains cased characters and all are lowercases.isupper()
: test if s
contains cased characters and all are uppercases.isalpha()
: test if s
is nonempty and all characters in s
are alphabetics.isalnum()
: test if s
is nonempty and all characters in s
are alphanumerics.isdigit()
: test if s
is nonempty and all characters in s
are digitss.istitle()
: test if s
contains cased characters and is titlecased (i.e. all words in s
have initial capitals)
title()
: convert the first character in each word to uppercase and remaining characters to lowercase in string and returns new stringupper()
: convert all characters to uppercaselower()
: convert all characters to lowercase
rstrip()
: returns a copy of the string with trailing characters removed.lstrip()
: returns a copy of the string with leading characters removed.strip()
: returns a copy of the string with the leading and trailing characters removed.
replace(t, u)
: replace instances of t
with u
.
if Statement


There can be zero or more elif
parts, and the else
part is optional.
Note there’s no switch
case
in Python.
for Statement
Python’s for
statement iterates over the items of any sequence(a list or a string), in the order that they appear in the sequence.


The range() Functionfor i in range(5)
range()
function may contain 1, 2 or 3 parameters.range(term)
: Generate term values from 0 to (term  1). Note that term should be positive integers or it will return an empty list.range(begin, end)
: Generate values from begin to (end  1).range(begin, end, step)
: Specify a different increment(step), which can even be negative.
In many ways the object returned by range()
behaves as if it is a list, but in fact it isn’t. It is an object which returns the successive items of the desired sequence when you iterate over it, but it doesn’t really make the list, thus saving space.
We say such an object is iterable, that is, suitable as a target for functions and constructs that expect something from which they can obtain successive items until the supply is exhausted.
break and continue Statements, and else Clauses on Loops
The break
statement breaks out of the innermost enclosing for
or while
loop.
The continue
statement continues with the next iteration of the loop.
These two statements are borrowed from C.
Loop statements may have an else
clause; it is executed when the loop terminates through exhaustion of the list (with for
) or when the condition becomes false (with while
), but not when the loop is terminated by a break
statement.


pass Statements
The pass
statement does nothing. It can be used when a statement is required syntactically but the program requires no action.


This is commonly used for creating minimal classes:


pass
can be used as a placeholder for a function or conditional body when you are working on new code, allowing you to keep thinking at a more abstract level.


Detail
The keyword def
introduces a function definition. It must be followed by the function name and the parenthesized list of formal parameters. The statements that form the body of the function start at the next line, and must be indented.
docstring
The first statement of the function body can optionally be a string literal(in three single quotes ''' '''
); this string literal is the function’s documentation string, or docstring. This line should begin with a capital letter and end with a period.
Docstrings can include a doctest block, illustrating the use of the function and the expected output. These can be tested automatically using Python’s docutils
module. Docstrings should document the type of each parameter to the function, and the return type. At a minimum, that can be done in plain text.
call by value
The actual parameters (arguments) to a function call are introduced in the local symbol table of the called function when it is called; thus, arguments are passed using call by value (where the value is always an object reference, not the value of the object). When a function calls another function, a new local symbol table is created for that call.
Actually, call by object reference would be a better description, since if a mutable object is passed, the caller will see any changes the callee makes to it (items inserted into a list).
return
The return
statement returns with a value from a function. return
without an expression argument returns None
. Falling off the end of a function also returns None
.
default argument values
When defining functions, it’s useful to specify a default value for one or more arguments. This creates a function that can be called with fewer arguments than it is defined to allow.
Remember that the default value is evaluated only once.
keyword argument & positional argument
From Glossary
A value passed to a function
(or method
) when calling the function. There are two kinds of argument:
· keyword argument. an argument preceded by an identifier(e.g. name=
) in a function call or passed as a value in a dictionary preceded by **
.
· positional argument. an argument that is not a keyword argument. Positional arguments can appear at the beginning of an argument list and/or be passed as elements of an iterable
preceded by *
.
When a final formal parameter of the form **name
is present, it receives a dictionary containing all keyword arguments except for those corresponding to a formal parameter. This may be combined with a formal parameter of the form *name
(described in the next subsection) which receives a tuple containing the positional arguments beyond the formal parameter list. (*name
must occur before **name
.)
Sequence Types  list, tuple, range
Set Types  set, frozenset
Mapping Types  dict
List can be written as a list of commaseparated values (items) between square brackets. (similar to Java’s ArrayList)
Lists might contain items of different types, but usually the items all have the same type. squares = [1, 4, 9, 16, 25]
It’s a good choice to name the list variable with plurals.
Index and slicing are similar to those of String.
Mutable
Lists are a mutable type, i.e. it is possible to change their content. Just use the index to change the content(like array in C).
Concatenation+
operator: squares + [36, 49]
append
method: squares.append(64)
Insert
Use insert()
to add elements in the specified location.
Use append()
to add elements to the end of the list.
Delete
Use del
if the index of element is known. del a[0]
Methods of list objectslist.append(x)
Add an item to the end of the list. Equivalent to a[len(a):] = [x]
.
list.extend(iterable)
Extend the list by appending all the items from the iterable. Equivalent to a[len(a):] = iterable
.
list.insert(i,x)
Insert an item at a given position. The first argument is the index of the element before which to insert, so a.insert(0, x)
inserts at the front of the list, and a.insert(len(a), x)
is equivalent to a.append(x)
list.remove(x)
Remove the first item from the list whose value is x. It is an error if there is no such item.
list.pop([i])
Remove the item at the given position in the list, and return it. If no index is specified, a.pop()
removes and returns the last item in the list.
list.index(x[, start[, end]])
Return zerobased index in the list of the first item whose value is x. Raises a ValueError
if there is no such item.
The optional arguments start and end are interpreted as in the slice notation and are used to limit the search to a particular subsequence of the list. The returned index is computed relative to the beginning of the full sequence rather than the start argument.
list.count(x)
Return the number of times x appears in the list.
list.sort()
Sort the items of the list in place.
list.reverse()
Reverse the elements of the list in place.
list.copy()
Return a shallow copy of the list. Equivalent to a[:]
.
List Comprehensions
A list comprehension consists of brackets containing an expression followed by a for
clause, then zero or more for
of if
clauses. The result will be a new list resulting from evaluating the expression in the context of the for
and if
clauses which follow it.
E.g:


It is equivalent to squares = [x**2 for x in range(10)]
Various ways to iterate over sequences:


Given a sequence s
, enumerate(s)
returns pairs consisting of an index and the item at that index.
sort
Python list have a builtin list.sort()
method that modifies the list inplace. There is also a sorted()
builtin function that builds a new sorted list from an iterable.sorted()
function returns a new sorted list. list.sort()
modifies the list inplace (and returns None
to avoid confusion). Usually it’s less convenient than sorted()
 but if you don’t need the original list, it’s slightly more efficient.
Both list.sort()
and sorted()
have a key parameter to specify a function to be called on each list element prior to making comparisons. A common pattern is to sort complex objects using some of the object’s indices as keys. The same technique works for objects with name attributes.
The keyfunction patterns are very common, and Python provides convenience functions to make accessor functions easier and faster. The operator
module has itemgetter()
, attrgetter()
and a methodcaller()
function.
Both list.sort()
and sorted()
accept a reverse parameter with a boolean value. This is used to flag descending sorts.
Sorts are guaranteed to be stable.
Sorting HOWTO
A tuple consists of a number of values separated by commas, and typically enclosed using parentheses.
Tuples are constructed using the comma operator. Parentheses are a more general feature of Python syntax, designed for grouping. A tuple containing a single element is defined by adding a trailing comma. The empty tuple is a special case, and is designed using empty parentheses()
.
Comparison with list
Type  tuple  list 

Immutable  immutable  mutable 
Element  heterogeneous  homogeneous 
Access  unpacking/indexing  iterating 
Packing and Unpacking
Packingt = 12345, 54321, 'hello!'
is an example of tuple packing
Unpackingx, y, z = t
is called sequence unpacking and works for any sequence on the righthand side. Sequence unpacking requires that there are as many variables on the left side of the equals sign as there are elements in the sequence. Note that multiple assignment is really just a combination of tuple packing and sequence unpacking.
zip()
takes the items of two or more sequences and “zips” them together into a single list of tuples.
Introduction in documentation:
Make an iterator that aggregates elements from each of the iterables.
Returns an iterator of tuples, where the ith tuple contains the ith element from each of the argument sequences or iterables. The iterator stops when the shortest input iterable is exhausted.
Therefore, if two (or more) items have diffrent lengths, the length of the tuple is only the shortest. The trailing will be discarded. (If those values are important, use itertools.zip_longest()
instead.)
A set is an unordered collection with no duplicate elements.
Curly braces {}
or the set()
function can be used to create sets.set()
can be used to remove duplicated items in the list.
Dictionaries are index by keys, which can be any immutable type; strings and numbers can always be keys. Tuples can be used as keys if they contain only strings, numbers, or tuples; if a tuple contains any mutable object either directly or indirectly, it cannot be used as a key. Since lists are mutable, lists can’t be used as keys as well.
The dictionary methods keys()
, values()
and items()
allow us to access the keys, values, and keyvalue pairs. (The type is dict_keys
, etc. Sometimes we have to convert to list before further processing).
Python’s Dictionary Methods: A summary of commonlyused methods and idioms involving dictionaries.d = {}
: create an empty dictionary and assign it to d
d[key] = value
: assign a value to a given dictionary keyd.keys()
: the list of keys of the dictionarylist(d)
: the list of keys of the dictionarysorted(d)
: the keys of the dictionary, sortedkey in d
: test whether a particular key is in the dictionaryfor key in d
: iterate over the keys of the dictionaryd.values()
: the list of values in the dictionarydict([(k1,v1), (k2,v2), ...])
: create a dictionary from a list of keyvalue pairsd1.update(d2)
: add all items from d2 to d1
defaultdict(int)
: a dictionary whose default value is zero
Default dictionary
If we try to access a key that is not in a dictionary, we get an error. However, it’s often useful if a dictionary can automatically create an entry for this new key and give it a default value, such as zero or the empty list. For this reason, a special kind of dictionary called a defaultdict
is available. In order to use it, we have to supply a parameter which can be used to create the default value, e.g. int
, float
, str
, list
, dict
, tuple
.
If the parameter is None
, it’s just just like the original dict
. And We can specify any default value we like using lambda expression.


Deques are a generalization of stacks and queues. Deques support threadsafe, memory efficient appends and pops from either side of the deque with approximately the sam O(1) performance in either direction.
The constructor returns a new deque object initialized lefttoright with data from iterable.
If iterable is not specified, the new deque is empty.
If maxlen is not specified or is None
, deques may grow to an arbitrary length. Otherwise, the deque is bounded to the specified maximum length. Once a bounded length deque is full, when new items are added, a corresponding number of items are discarded from the opposite end. (Can be used for keeping last N items)
Methods: append()
, appendleft()
, pop()
, popleft()
, etc.


The module provides an implementation of the heap queue algorithm. The implementation uses arrays for which heap[k] <= heap[2*k+1]
and heap[k] <= heap[2*k+2]
for all k, counting elements from zero.
Python implementation uses zerobased indexing. This makes the relationship between the index for a node and the indices for its children slightly less obvious. And the pop method returns the smallest item.
Functions:heapq.heappush(heap, item)
, heapq.heappop(heap)
, heapq.heapify(x)
, heapq.nlargest(n, iterable, key=None)
, heapq.nsmallest(n, iterable, key=None)
.
The latter two functions perform best for smaller values of n. For larger values, it is more efficient to use the sorted()
function (with slicing). Also, when n==1
, it is more efficient to use the builtin min()
and max()
functions.
A module is a file containing Python definitions and statements. The file name is the module name with the suffix .py
appended. Definitions from a module can be imported into other modules or into the main module (the collection of variables that you have access to in a script executed at the top level and in calculator mode). Within a module, the module’s name (as a string) is available as the value of the global variable __name__
.
open(filename, mode)
open()
returns a file object. The first argument is a string containing the filename. The second argument is another string containing a few characters describing the way in which the file will be used. mode can be 'r'
when the file will only be read, 'w'
for only writing (an existing file with the same name will be erased), and 'a'
opens the file for appending; any data written to the file is automatically added to the end. 'r+'
opens the file for both reading and writing. The mode argument is optional; 'r'
will be assumed if it’s omitted.
It is good practice to use the with
keyword when dealing with file objects. (similar to with tf.Session() as sess:
in TensorFlow). The advantage is that the file is properly closed after its suite finishes, even if an exception is raised at some point. Using with is also much shorter than writing equivalent tryfinally blocks:


If not using with
keyword, just call f.close()
to close the file and immediately free up any system resources used by it.
f.read(size)
Reads some quantity of data and returns it as a string (in text mode) or bytes object(in binary mode). size is an optional numeric argument. When size is omitted or negative, the entire contents of the file will be read and returned. Otherwise, at most size bytes are read and returned. If the end of the file has been reached, f.read()
will return an empty string(''
).
f.readline()
Reads a single line from the file; a newline character(\n
) is left at the end of the string, and is only omitted on the last line of the file if the file doesn’t end in a newline. This makes the return value unambiguous: if f.readline()
returns an empty string, the end of the file has been reached, while a blank line is represented by '\n'
, a string containing only a single newline.
Looping over the file object:


Read all the lines of a file in a list: list(f)
or f.readlines()
.
f.write(string)
writes the contents of string to the file, returning the number of characters written. Other types of objects need to be converted  either to a string (in text mode) or a bytes object (in binary mode)  before writing them.


It will return a list of the file name in the current directory.
Errors detected during execution are called exceptions.
Exceptions come in different types, and the type is printed as part of the message. The string printed as the exception type is the name of the builtin exception that occurred.(true for all builtin exceptions, but need not be true for userdefined exceptions)
All builtin exceptions are listed here
The try
statement works as follows.
try
statement is finished.except
keyword, the except clause is executed, and then execution continues after the try
statement.try
statements; if no handlers is found, it is an unhandled exception and execution stops with a message.A try
statement may have more than one except clause, to specify handlers for different exceptions. At most one handler will be executed. Handlers only handle exceptions that occur in the corresponding try clause, not in other handlers of the same try statement. An except clause may name multiple exceptions as a parenthesized tuple.
Assignment always copies the value of an expression, but a value is not always what you might expect it to be. In particular, the “value” of a structured object such as a list is actually just a reference to the object.
Python provides two ways to check that a pair of items are the same. ==
and is
. The is
operator tests for object identity.
We can use id()
function to find out the numerical identifier for any object.
import this
The Zen of Python, by Tim Peters
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren’t special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one— and preferably only one —obvious way to do it.
Although that way may not be obvious at first unless you’re Dutch.
Now is better than never.
Although never is often better than right now.
If the implementation is hard to explain, it’s a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea — let’s do more of those!
The keyword argument end
can be used to avoid the newline after the output, or end the output with a different string.


in
keyword tests whether or not a sequence contains a certain value.
The square brackets in the method signature denote that the parameter is optional, not that you should type square brackets at that position. This is frequent in the Python Library Reference.
When looping through dictionaries, the key and corresponding value can be retrieved at the same time using the items()
method.


When looping through a sequence, the position index and corresponding value can be retrieved at the same time using the enumerate()
function.


To loop over two or more sequence at the same time, the entries can be paired with the zip()
function.


To loop over a sequence in reverse, first specify the sequence in a forward direction and then call the reversed()
function.


To loop over a sequence in sorted order, use the sorted()
function which returns a new sorted list while leaving the source unaltered.


Eval: Builtin Function
The return value is the result of the evaluated expression. Syntax errors are reported as exceptions.
This function can also be used to execute arbitrary code objects (such as those created by compile()
). In this case pass a code object instead of a string.
The following lines make the most common conversions between string and list.


For more information, refer to documentation of split() and join(). The following section split with multiple delimiters is also helpful.
See my post Regular Expression.
Measure execution time of small code snippets
timeittimeit
provides a simple way to time small bits of Python code.
CommandLine Interface:


Python Interface:


(In NLTK, it is introduced to use timeit.Timer
, but that’s not necessary. timeit.timeit()
will automatically create a Timer instance.)timeit.timeit(stmt='pass', setup='pass', timer=<default timer>, number=1000000, globals=None)
The global
parameter is new in Python 3.5. Passing globals()
to the global parameter will cause the code to be executed within the current global namespcae. This can be more convenient than individually specifying imports.
General
https://docs.python.org/3.7/library/time.html?highlight=timer#time.process_time
Use time.process_time()
method (which is new in Python 3.3) to record the beginning and end time.


And for accuracy, Python 3.7 introduces process_time_ns()
which return time as nanoseconds.
2to3 is a Python program that reads Python 2.x source code and applies a series of fixers to transform it into valid Python 3.x code. 2to3 will usually be installed with the Python interpreter as a script.
2to3’s basic arguments are a list of files or directories to transform.2to3 example.py
A diff against the original source file is printed. 2to3 can also write the needed modifications right back to the source file. Writing the changes back is enabled with the w flag:2to3 w example.py
2to3 document
index()
will return index in the list of the first item. But what if we need the all indices of an item with multiple occurrence? It seems that Python doesn’t provide a handy method. A common solution is:


Use map function


or list comprehension


There are some methods in Python to round off float number:
round(number[, ndigits]) return number rounded to ndigits precision after the decimal point. If ndigits is omitted or is None
, it returns the nearest integer to its input.
For the builtin types supporting round(), values are rounded to the closest multiple of 10 to the power minus ndigits; if two multiples are equally close, rounding is done toward the even choice (bankers’ rounding).
Note: The behavior of
round()
for floats can be surprising: for example,round(2.675, 2)
gives2.67
instead of the expected2.68
. This is not a bug: it’s a result of the fact that most decimal fractions can’t be represented exactly as a float. See Floating Point Arithmetic: Issues and Limitations for more information.
print("%.1f" % number)
is used when printing the number out. It seems that it also observes bankers’ rounding.
To make it more accurate, we can use decimal()
module. (Referred from Stack Overflow)


To flatten the list:


To compute the cartesian product:


Here stores something that is introduced in tutorial briefly, waiting to be discussed in detail in the future.
Arbitrary Argument Lists and Unpacking Argument Lists
Introduced in Section 4.7,
Method
A method is a function that ‘belongs’ to an object and is named obj.methodname, where obj is some object (this may be an expression), and methodname is the name of a method that is defined by the object’s type. Different types define different methods. Methods of different types may have the same name without causing ambiguity.
Lambda Expressions
Small anonymous functions can be created with the lambda
keyword. This function returns the sum of its two arguments: lambda a, b: a+b
. Lambda functions can be used wherever function objects are required. They are syntactically restricted to a single expression. Semantically, they are just syntactic sugar for a normal function definition. Like nested function definitions, lambda functions can reference variables from the containing scope:


The del statement
Remove an item from a list given its index instead of its value. The del
statement can also be used to remove slices from a list or clear the entire list.


del
can also be used to delete entire variables.del
can delete a key:value pair in dict, the parameter is just dict’s key.
Takes any string as input
fixedsize output (e.g. 256bits just as BitCoin)
efficiently computable
Collisionfree
Nobody can find x and y such that x!=y and H(x)=H(y).
Note: Collisions do exist. The possible outputs are finite(string of 256 bits in size), while the possible inputs can be a string of any size.
To find a collision of any hash function: Trying randomly chosen inputs and chances are 99.8% that two of them will collide. Well, it takes too long to matter.
For some possible H’s, there’s a faster way to find collisions; for others, we haven’t known yet.
No H has been proven collisionfree.
Hiding
(rx) means that take all the bits of r and put after them all the bits of x.
If r is chosen from a probability distribution that has high minentropy, then given H(rx), it is infeasible to find x.
High minentropy means that the distribution is “very spread out”, so that no particular value is chosen with more than negligible probability.
Puzzlefriendly
For every possible output value y, if k is chosen from a distribution with high minentropy, then it is infeasible to find x such that H(kx)=y.
Hash as message digest.
If we know H(x)=H(y), it’s safe to assume that x=y.
To recognize a file that we saw before, just remember its hash.
It’s useful because the hash is small, while a file may be very big.
Commitment
Want to “seal a value in an envelope”, and “open the envelope” later.
Commit to a value, reveal it later.


To seal msg in envelope:
(com, key) := commit(msg) — then publish com
To open envelope:
publish key, msg
anyone can use verify() to check validity
Commitment API:
commit(msg):=(H(keymsg),key) , where key is a random 256bit value
verify(com, key, msg):=(H(keymsg)==com)
Security properties:
Hiding: Given com, infeasible to find msg.
Binding: Infeasible to find msg != msg'
such that verify(commit(msg), msg') == true
.
Search Puzzle
Given a “puzzle ID” id (from high minentropy distribution) and a target set Y,
try to find a “solution” x such that H(idx)∈Y.
Puzzlefriendly property implies that no solving strategy is much better than trying random values of x.
SHA256 takes the message that you’re hashing, and it breaks the message up into blocks that are 512 bits in size(add some padding at the end 100…00).
IV: 256 bit value(look up in a standards document).
c: the compression function. Take 768 bits(256+512) and run through this function and out comes 256 bits.
Hash pointer is pointer to where some info is stored, and (cryptographic) hash of the info.
If we have a hash pointer, we can ask to get the info back, and verify that it hasn’t changed.
Data structure of blockchain:
This is blockchain, and it’s similar to linked list.
If some adversary wants to change a block’s data(e.g., the left one), then the content of that block is changed. And the middle block’s hash pointer is not consistent to the left block any more. So the adversary has to change middle block’s header, and next block’s header and so on.
Merkle tree
Advantages of Merkle trees:
Tree holds many items, but just need to remember the root hash.
Can verify membership in O(log n) time/space.
What we want from signatures is two things:
API for digital signatures


Requirements for signatures
·Valid signatures verify
verify(pk, message, sign(sk, message))==true
·Can’t forge signatures
An adversary who knows your public key and gets to see signatures on some other messages, can’t forge your signature on some message that he wants to forge it on.
Practical stuff
algorithms are randomized
need good source of randomness
limit on message size
fix: use Hash(message) rather than message
fun trick: sign a hash pointer
signature “covers” the whole structure
Goofy can create new coins.
A coin’s owner can spend it.
The recipient can pass on the coin again.
Problem: Doublespending attack
the main design challenge in digital currency
CreateCoin transaction creates new coins.
PayCoin transaction consumes (and destroys) some coins, and create new coins of the same total value.
Valid if:
consumed coins valid
not already consumed
total value out = total value in
signed by owners of all consumed coins
Note: Coins are immutable, that is, coins can’t be transferred, subdivided, or combined.
Why consensus protocols?
Traditional motivation: reliability in distributed systems
Distributed keyvalue store enables various applications: DNS, public key directory, stock trades …
The protocol terminates and all correct nodes decide on the same value
This value must have been proposed by some correct node
Why consensus is hard?
Nodes may crash
Nodes may be malicious
Network is imperfect (Not all pairs of nodes connected; faults in network; latency)
Consensus algorithm (simplified)
What can a malicious node do?  Double Spending
Honest nodes will extend the longest valid branch. In the above image, the green block and the red block are identical. So chances are that the next node will extend the red block, and so on, which makes the doublespending attack.
However, from Bob’s view, it looks like this:
Doublespend attack only occurs with 1 confirmations. If Bob is patient enough and wait for some other confirmations, he’s not likely to suffer doublespend.
Doublespend probability decreases exponentially with number of confirmations.
(Most common heuristic: 6 confirmations)
Incentive 1: block reward
Creator of block gets to
Note block creator gets to “collect” the reward only if the block ends up on longterm consensus branch.
Incentive 2: transaction fees
Creator of transaction can choose to make output value less than input value
Remainder is a transaction fee and goes to block creator
Purely voluntary, like a tip.
PoW property
1: difficult to compute
Only some nods bother to compute  miners
2: parameterizable cost
Nodes automatically recalculate the target every two weeks.
Goal: average time between blocks = 10 minutes
3: trivial to verify
Nonce must be published as part of block
Other miners simply verify that
Key security assumption
Attacks infeasible if majority of miners weighted by hash power follow the protocol.
for individual miner:
**What can a “51% attacker” do?
Steal coins from existing address? ×
Suppress some transactions?
· From the block chain √
· From the P2P network ×
Change the block reward? ×
Destroy confidence in Bitcoin? √√
Comparison between an accountbased ledger and a transactionbased ledger:
An accountbased ledger
If we want to check whether a transaction is valid, we might need to scan backwards until genesis. That’s quite inconvenient and inefficiency.
A transactionbased ledger
The verification just needs finite scan with hash pointers.
Then, we can easily realize functions like merging value and joint payments.
Merging value
(The slides shown on Coursera has some mistakes, so I have to screenshot on the course)
Joint payments
Here’s what a Bitcoin transaction look like:
Note that in Bitcoin transaction, instead of assigning a public key, Bitcoin uses script.
Design goals
Instructions
256 opcodes total (15 disabled, 75 reserved)
Instruction:
<sig>
: Input script. Push data onto the stack.<pubKey>
: Push data onto the stack.
OP_DUP
: Duplicate instruction. Take the value on the top of the stack, pop it off, and then write two copies back to the stack.
OP_HASH160
: Take the top value on the stack and compute a cryptographic hash of it.
pubKeyHash
: The hash of public key that was actually used by the recipient when trying to claim the coins.pubKeyHash?
: Specified by the sender of the coins. The public key that the sender specified, had to be used to generate the signature to redeem these coins.
OP_EQUALVERIFY
: Check if the two values at the top of the stack equal. It the two values aren’t equal, an error’s gonna be thrown and the script will stop executing. It they are, the instruction will consume those two data items that are at the top of the stack.
OP_CHECKSIG
: Verify the entire transaction was successfully signed. Pop those remaining two items off of the stack, check the signature is valid.
OP_CHECKMULTISIG
:
Builtin support for joint signatures
Specify n public key
Specify t (threshold)
Verification requires t signatures
(BUG: Extra data value popped from the stack and ignored)
There’s a special transaction, which is the coinbase transaction:
Since this transaction creates new coins, it’s prev_out has a null hash pointer.
Miners can put anything in coinbase.
The value is slightly more than the set value, which contains transaction fees.
P2P Network
Transaction propagation
· Transaction valid with current blockchain
· (default) script matches a whitelist  avoid unusual scripts
· Haven’t seen before  Avoid infinite loops
· Doesn’t conflict with others I’ve relayed  avoid doublespends
Block propagation
Relay a new block when you hear it if:
· Block meets the hash target
· Block has all valid transactions  Run all scripts, even if you wouldn’t relay
· Block builds on current longest chain  Avoid forks
Fullyvalidating nodes
· Permanently connected
· Store entire block chain (Storage cost: 20 GB)
· Hear and forward every node/transaction
Thin/SPV clients(not fullyvalidating)
Ideas: don’t store everything
· Store block headers only (1000x cost saving)
· Request transactions as needed  to verify incoming payment
· Trust fullyvalidating nodes
Hardcoded limits in Bitcoin
· 10 min. average creation time per block
· 1 M bytes in a block
· 20,000 signature operations per block
·· 100 M satoshis per bitcoin
·· 21 M total bitcoins maximum
·· 50,25,12,5… bitcoin mining reward
·· : These affect economic balance of power too much to change now
Throughput limits in Bitcoin
· 1 M bytes/block (10 min)
· >250 bytes/transaction
· 7 transactions/sec
Cryptographic limits in Bitcoin
· Only 1 signature algorithm (ECDSA/P256)
· Hardcoded hash functions
Hardforking
If old nodes didn’t update software, they will reject all the new transactions/blocks.
Old nodes will never catch up.
Soft forks
Observation: we can add new features which only limit the set of valid transactions
Need majority of nodes to enforce new rules
Old nodes will approve.
Risks exist that old nodes might mine nowinvalid blocks, since there’re new limits on blocks.
Soft fork possibilities
· New signature schemas
· Extra perblock metadata  Shove in the coinbase parameter; Commit to UTXO tree in each block
Hard forks
· New op codes
· Changes to size limits
· Changes to mining rate
· Many small bug fixes (like bug in MULTISIG)
Currently seem very unlikely to happen.
Throughput on a highend PC = 1020 MHz ≈ 2^24
139461 years on average to find a block today
GPUs designed for highperformance graphics  high parallelism & high throughput
First used for Bitcoin ca. October 2010
Implemented in OpenCL  Later: hacks for specific cards
Advantages
· easily available, easy to set up
· parallel ALUs
· bitspecific instructions
· can drive many from 1 CPU
· can overclock
Disadvantages
· poor utilization of hardware
· poor cooling
· large power draw
· few boards to hold multiple GPUs
Throughput on a good card = 20200 MHz ≈ 2^27
173 years on average to find a block w/100 cards
First used for Bitcoin ca. June 2011
Implemented in Verilog
Advantages
· higher performance than GPUs  excellent performance on bitwise operations
· better cooling
· extensive customization, optimization
Disadvantages
· higher power draw than GPUs designed for  frequent malfunctions, errors
· poor optimization of 32bit adds
· fewer hobbyists with sufficient expertise
· more expensive than GPUs
· marginal performance/cost advantage over GPUs
Throughput on a good card = 1001000 MHz ≈ 2^30
25 years on average to find a block w/100 boards
· special purpose  approaching known limits on feature sizes, less than 10x performance improvement expected
· designed to be run constantly for life
· require significant expertise, long leadtimes
· perhaps the fastest chip development ever
Anonymity in computer science:
Pseudonymity: Bitcoin addresses are public key hashes rather than real identities
Unlinkability: different interactions of the same user with the system should not be linkable to each other
Unlinkability in Bitcoin:
Hard to link different addresses of the same user
Hard to link different transactions of the same user
Hard to link sender of a payment to its recipient
Blind signature
twoparty protocol to create digital signature without signer knowing the input
大部分同学应该在接触MATLAB之前都接触过类C语言（C、C++、Java），而MATLAB作为一门解释型语言，与类C语言相比还是有比较大的不同。所以一开始想简单地列举一下MATLAB语言(下简称m语言)和类C语言（下简称C语言）的不同。
变量定义
C语言要求变量先定义后使用，如int i
，double sum
。而m语言中并不需要定义变量这一步，只要在赋值语句的左边出现的变量都可以直接使用（不要在赋值语句的右边使用未出现过的变量名即可）。
数值运算
C语言中的数值运算一般情况下比较符合日常生活中的数值运算，最多也就出现10/3=3
这种个别的反直觉的结果。而m语言中的数值都是矩阵化的，在运算方面都是需要根据矩阵运算法则进行。（例如加法需要两个矩阵维数相等，乘法就需要第一个矩阵的列=第二个矩阵的行,etc）。
此外m语言还有一种elementwise（按元素）的运算，即.*
、.^
等。如A=[a1,a2,a3];
，B=[b1,b2,b3];
，根据线代知识这两个矩阵无法进行矩阵相乘(*
)运算，但可以进行按元素乘(.*
)运算，运算结果A.*B=[a1b1,a2b2,a3b3]
。
程序运行
C语言是编译型语言，需要将后缀名为.c的文件通过编译后方能运行。而m语言是解释型语言，直接一句一句就可以执行语句。所以MATLAB可以在命令行窗口输入指令后直接执行得到结果，也可以保存为.m文件后再执行该.m文件。
这是MATLAB R2017b版本的截图，其中用不同颜色划分了不同的区域。
功能区
提供打开文件、保存、设置断点等诸多功能。（因为我都没用过功能区……要么快捷键要么输指令，所以不详细介绍功能区了）
就说几个有用的快捷键：Ctrl+N
(MacBook用Command键替换Ctrl键，下同)：新建.m文件Ctrl+O
：打开文件（一般是打开.m文件）Ctrl+S
：保存文件
当前路径文件
显示当前路径下的所有文件/文件夹，不同格式的文件会有不同的图标。
例如图中的ex3_nn.m和ex3.m为可执行m文件，可以在命令行窗口中直接输入文件名（ex3_nn、ex3）便可以执行该文件；displayData.m、fmincg.m等为函数文件，定义了不同功能的函数（由于函数往往需要调用参数，所以不建议直接执行函数文件）；ex3data1.mat、ex3weights.mat等为数据文件，MATLAB可以直接读取数据文件中的数据。
编辑器
新建/打开.m文件，或者对.m文件进行编辑时，会有编辑器窗口（默认情况下命令行窗口会占据编辑器的位置。虽然与其他的编辑器，例如Sublime Text、VS Code等相比，MATLAB编辑器略显不足，只有少量的代码高亮，以及中文字体可能出现乱码情况（直接复制代码到Word中会出现乱码，需选择性粘贴带格式文本；其他文本编辑器打开.m文件也会出现乱码），不过毕竟是自带的编辑器，也将就着使用叭~
（另：MATLAB还有实时脚本live script功能，后缀名为.mlx，可以包含函数图像，在写论文等情况下建议使用，一般情况下就直接用.m文件即可。）
命令行窗口
命令行窗口中可以执行多种操作。如上文提到的输入.m程序名直接执行程序（注意只需要输入文件名即可，不需要输入.m后缀），还可以复制.m程序中的部分语句到命令行中进行执行，后文还会提到功能各异的各种指令。
工作区
工作区里会显示当前程序下所有变量的值/维度，便于进行调试。
help
help可以说是MATLAB最有用的指令了。不清楚某个函数的使用方法，直接在命令行窗口中输入help 函数名
便可以查看函数使用及相关示例，例如help plot
。甚至还可以用help help
来查看如何使用help函数。所以最先介绍一下help
函数~
doc
help指令是简单地了解指令，而doc指令则是更完整地查看MATLAB的相关文档。
纯量a=5;
MATLAB中所有的变量都是数组(array)，或者说矩阵(matrix)。单独的一个数字，在MATLAB中也叫做scalar（纯量），是一个特殊的1×1的矩阵。
另：MATLAB中可以用pi
直接表示圆周率π，不过不可以直接使用e表示自然底数（会提示未定义函数或变量’e’），要进行自然底数的运算请使用exp()函数。
矩阵初始化ones()
zeros()
eyes()
上面三函数都可以用于矩阵/向量的初始化。其中ones
和zeros
的使用方法类似。ones
是矩阵元素均为1，zeros
是矩阵元素均为0，eyes
是生成单位矩阵（即主对角线上的元素均为1，其余均为0。）。ones(N)
，zeros(N)
，eyes(N)
：生成N×N的矩阵；ones(M,N)
，zeros(M,N)
：生成M×N的矩阵。
随机数rand()
：符合均匀分布(0,1)的伪随机数randn()
：符合标准正态分布的伪随机数rand()
和randn()
的使用方法与ones
和zeros
类似。randi(MAX,N)
、randi(MAX,M,N)
：符合均匀离散分布1:MAX的伪随机数（即1，2，…，MAX），除多一个参数MAX外，其他与rand
和randn
使用方法类似。
When you separate numbers by spaces (or commas), MATLAB combines the numbers into a row vector, which is an array with one row and multiple columns (1byn). When you separate them by semicolons, MATLAB creates a column vector (nby1)
当使用空格或者逗号分隔数字时，MATLAB将这些数字结合成行向量(1×n)；当用分号分隔时，MATLAB建立列向量(n×1)。换言之，在定义矩阵时，,
（或者空格）表示同一行间不同元素的分隔，;
表示不同行之间的分隔。如[1,2,3;4,5,6]便建立了一个2×3的矩阵
;
：分号除了分隔不同行外，还有一种作用是加在一条指令的最后，作用为取消输出结果。如a=1;
。这样该指令会执行，不过不会在命令行窗口输出结果。所以建议在编写.m文件时，如无特殊情况每一条语句后都加上;
。
:
：第一种常见用法是min:max
或者min:step:max
，前者建立一个[min, min + 1, min + 2, … , max]的矩阵，后者则建立一个[min, min + step, min + 2 step, … , min + n step]。（min + n × step ≤ max, min + (n + 1) × step > max)）如1:2:6
的结果为[1,3,5]。
（注：linspace(first,last,number_of_elements)
是类似于:
的一个函数。）
第二种用法是表示某一维度的所有元素。如x = A(2,:)
即将A矩阵第二行的所有元素赋值给变量x。x = A(1:3,:)
同时使用了:
的两种常见用法。该指令将矩阵A的前三行所有元素赋值给变量x。
...
：如果一条语句比较长，则可以使用...
后按回车继续写该语句。...
表示语句会在下一行继续。一般在参数较多的函数中使用。
矩阵与向量定义
MATLAB通过矩阵化的运算可以有效提高运算速度。MATLAB中的数据都视作矩阵，其中维度为1×n或者n×1的为向量(vector)。size
指令可以显示对应矩阵的维度，对于向量则可以使用length
指令。
矩阵乘法与按位乘法
矩阵乘法的话一般可分为两种：矩阵与纯量的乘法，以及矩阵与矩阵的乘法。这两种乘法都是用运算符*
实现。如2 * [1 2 3] = [2 4 6]
，以及[1,2;3,4] * [1;1] = [3;7]
等。
有时两个矩阵的维度相同，可能无法进行矩阵的乘法运算。不过，正如之前介绍的，MATLAB还支持一种按元素运算(elementwise)，运算符为.*
。如[1,2,3] .* [2,3,4] = [2,6,12]
。
矩阵最大值max(X)
：若X为向量，则返回该向量中的最大值；若X为矩阵，则返回一个行向量，每一个元素对应该列的最大值。如max([1,3,5;2,3,4])=[2,3,5]
。[m,i]=max(X)
：除返回最大值外，还会返回最大值的索引。
矩阵求和sum(x)
：列求和sum(x,2)
：行求和sum(x(:))
：矩阵求和
for循环一般用法：


if条件判断：


switch多重判断：


用法与C类似，就是注意一下语法有所不同。
下附四张表格以助理解：(Referring to Digital Image Processing Using MATLAB Second Edition)
Arithmetic operators
 Operator  Name  Comments and Examples 
 —  —  — 
 +  Array and matrix addition  a+b,A+B,or a+A.

   Array and matrix subtraction  ab,AB,Aa,or aA.

 .  Array multiplication  `C=A.B,C(I,J)=A(I,J)B(I,J).` 
  Matrix multiplication  A*B,standard matrix multiplication,or a*A,multiplication of a scalar times all elements of A.

 ./  Array right division  C=A./B,C(I,J)=A(I,J)/B(I,J).

 .\  Array left division  C=A.\B,C(I,J)=B(I,J)/A(I,J).

 /  Matrix right division  A/B is the preferred way to compute A*inv(B).

 \  Matrix left division  A\B is the preferred way to compute inv(A)*B.

 .^  Array power  If C=A.^B,thenC(I,J)=A(I,J)^B(I,J).

 ^  Matrix power  See help for a discussion of this operator.

 .’  Vector and matrix transpose  A.’,standard vector and matrix transpose.

 ‘  Vector and matrix complex conjugate transpose  A’,standard vector and matrix conjugate transpose.When A is real A.’=A’.

 +  Unary plus  +A is the same as 0+A.

   Unary minus  A is the same as 0A or 1*A.

 :  Colon  Discussed above.

Relation operators
Operator  Name 

<  Less than 
<=  Less than or equal to 
>  Greater than 
>=  Greater than or equal to 
==  Equal to 
~=  Not equal to 
Logical operators
Operator  Description  

&  Elementwise AND  
Elementwise OR  
~  Elementwise and scalar NOT  
&&  Scalar AND  
Scalar OR 
Flow control statements
Statement  Description 

if  if,together with else and elseif,executes a group of statements based on a specified logical condition. 
for  Executes a group of statements a fixed(specified) number of times. 
while  Executes a group of statements an indefinite number of times,based on a specified logical condition. 
break  Terminates execution of a for loop or while loop. 
continue  Passes control to the next iteration of a for or while loop,skipping any remaining statements in the body of the loop. 
switch  switch,together with case and otherwise,executes different groups of statements,dependiing on a specified value or string. 
return  Causes execution to return to the invoking function. 
try catch  Changes flow control if an error is detected during execution. 
数据保存载入：save
：保存数据为.mat文件load
：导入.mat文件中的数据
不是很常用，一般可以用鼠标代替操作。：在工作区右键可以执行保存操作，在功能区则可以执行打开操作。
imread
：载入常用标准图片格式(GIF, JPEG, PNG, etc).imshow
：显示该图片imwrite
：将图片保存至当前路径。
datastoreimageDatastore
例：ds = imageDatastore('foo*.png')
：在MATLAB中建立数据储存。参数名为文件夹名或文件名，可以使用通配符*
表示多个文件。foo*
即表示所有以foo开头的PNG图片。read
,readimage
,readall
.
注释%
：百分号后的内容均视为注释内容。类似于类C中的//
清除指令clear
：清除工作区变量clc
：清除命令行窗口所有指令close all
：关闭所有绘图窗口
类Linux指令
在命令行窗口中还支持许多Linux常用指令（感兴趣的可以参考我的Linux系统下常用指令）。下面就简单介绍一下：
cd ..
：返回上层文件夹cd 文件夹1
：将路径跳转至当前路径下名为’文件夹1’的文件夹cd ~
：将路径跳转至个人主文件夹
mkdir 文件夹1
：在当前路径下新建名为’文件夹1’的文件夹rmdir 文件夹1
：删除当前路径下名为’文件夹1’的文件夹
Ctrl C
：中止当前运行的程序。（注意MacBook下也是Ctrl C）
一般MATLAB中常用的就这些命令啦。
https://cn.mathworks.com/help/matlab/ref/linespec.html
https://cn.mathworks.com/help/matlab/ref/chartlineproperties.html
https://cn.mathworks.com/help/matlab/creating_plots/usinghighlevelplottingfunctions.html
hold on
title
xlabel
ylabel
legend
MATLAB的M文件既可以作为直接可执行的MATLAB语句，也可以是函数：接受参数并产生输出。M函数的组成通常有：
函数定义：function output = name(inputParameter1, inputParameter2...)
首行：函数定义后的第一行单行注释。在函数定义与首行之间没有空行。在首行中一般%
与第一个单词之间也没有空格，比如%sum Computes the sum...%
。使用lookfor
指令时显示该行内容。
帮助文本：首行后的注释内容。在帮助文本与首行之间也没有空行。使用help
指令时会显示首行与帮助文本内容。
函数主体：执行计算、赋值等语句的MATLAB代码。
注释：除首行与帮助文本外的所有注释。
下面记录一些在MATLAB官方的Deep Learning Onramp课程中使用的函数。alexnet
：在MATLAB工作区中建立一个预训练过的深度学习网络”AlexNet”。net = alexnet;
classify
：对于一个图像进行预测。
注：在MATLAB online版本可以直接使用alexnet
，不过在桌面版MATLAB中需安装对应支持包，才可以使用AlexNet等预训练过的网络。可以在扩展功能中搜索pretrained network免费下载对应包。
You can use the splitEachLabel function to divide the images in a datastore into two separate datastores.[ds1,ds2] = splitEachLabel(imds,p);
The proportion p (a value from 0 to 1) indicates the proportion of images with each label from imds that should be contained in ds1. The remaining files are assigned to ds2.
This semester I’m taking the Ecommerce System Structure, and we have come to learn about personalization. In Ecommerce, the form of personalization is to recommend items to the user.
Recommendation Metrics
Recommendation Strategies
In the corresponding experiment, our assignment is to read the paper ItemBased Collaborative Filtering Recommendation Algorithms, a famous paper in recommender system. And I’d like to take a note of some key point, with some specific examples and implementation codes.
Figure 1 shows the schematic diagram of the collaborative filtering process. CF algorithms represent the entire m×n useritem data as a ratings matrix, A. Here ‘m’ is the number of users and ‘n’ is the number of items. Each entry in A represents the preference score (ratings) of the ith user on the jth item. Each individual ratings is within a numerical scale(e.g., from 1 to 5, the higher the better) and it can as well be 0 or NULL indicating that the user has not yet rated that item.
Prediction is a numerical value.
Recommendation is a list of N items.
The image may be a bit abstract. Here’s my specific example with simplified toy examples. The rating scales from 1 to 5, where 1 indicates ‘dislike the movie very much’, and 5 indicates ‘like the movie very much’. If the rating is blank, then the user has not seen(voted) the movie yet. I’d refer to this table later.
Harry Potter  Pirates of the Caribbean  Titanic  Avatar  Transformers  

Alice  3  5  4  1  
Bob  3  4  4  1  
Carol  4  3  3  1  
Dave  4  4  4  3  1 
Eve  5  4  5  3 
Memorybased CF Algorithms
Memorybased algorithms utilize the entire useritem database to generate a prediction. These systems employ statistical techniques to find a set of users, known as neighbors, that have a history of agreeing with the target user. The techniques, also known as nearestneighbor or userbased collaborative filtering, are more popular and widely used in practice.
Modelbased CF Algorithms
Modelbased collaborative filtering algorithms provide item recommendation by first developing a model of user ratings. Algorithms in this category take a probabilistic approach and envision the collaborative filtering process as computing the expected value of a user prediction, give his/her ratings on other items.
The itembased approach looks into the set of items the target user has rated and computes how similar they are to the target item i and then selects k most similar items. At the same time their corresponding similarities are also computed. Once the most similar items are found, the prediction is then computed by taking a weighted average of the target user’s ratings on these similar items.
Two items are thought of as two vectors in the m dimensional userspace. The similarity between them is measured by computing the cosine of the angle between these two vectors.
Take the previous example: Each column can be viewed as a vector. Therefore, the item Harry Potter has a vector of and the item Pirates of the Caribbean has a vector of . Note that if the rating is treated as 0 if blank. And the result should be
Similarity between two items i and j is measured by computing the Pearsonr correlation. To make the correlation computation accurate, first isolate the corated cases(i.e., cases where the users rated both i and j). Let the set of users who both rated i and j are denoted by U.
Take the example again: , . The set of users who both rated Harry Potter and The Avengers is . Then the similarity is


There’re tough users and easy users. Tough users tend to rate a relatively low score and maybe he has an average rate of 2.5, while easy users tend to have an average rate of 4.0. When computing the similarity, we have to consider the difference between users, and this is what adjusted cosine similarity does. Subtracting each valid rating by the user’s average rating to diminish the difference between users, and that would produce a more accurate result.


This method computes the prediction on an item i for a user u by computing the sum of the ratings given by the user on the items similar to i. Each ratings is weighted by the corresponding similarity between items i and j.
For this to work best, should be a value in the range 1 to 1. Our ratings are in the range 1 to 5. So we will need to convert our ratings to the 1 and 1 scale. Here’s two formulas:(reference from IT533 ItemBased Recommender Systems (with Cosine Similarity and Python Code))
Normalization
Denormalization
Example of predicting Alice’s rating on Harry Potter:
If we use the prediction formula directly:
This is not an accurate prediction. Since it’s intuitive that Harry Potter is similar to Titanic(Dave rated 4 both and Eve rated 5 both), so Alice’s rate on Harry Potter should be similar to her rate on Titanic. So let’s have a look on normalization result:
Normalize Alice’s rating: . Using these normalized ratings, prediction would be:
This is Alice normalized rating on Harry Potter. And to denormalize it:
And it’s an more accurate prediction.




In practice, the similarities computed using cosine or correlation measures may be misleading in the sense that two rating vectors may be distant(in Euclidean sense) yet may have very high similarity. The basic idea is to use the same formula as the weighted sum technique, but instead of using the similar item N’s “raw” ratings values ‘s, this model uses their approximated values based on a linear regression model.
The respective vectors of the target item i and the similar item N are denoted by and .
About linear regression, maybe it’s a good choice to refer to the machine learning article, in which I briefly introduced linear regression.
x: determines what percentage of data is used as training and test sets.
ML: the data set with 943 rows and 1682 columns.
sparsity level:
MAE: mean absolute error.
What is machine learning?
Field of study that gives computers the ability to learn without being explicitly programmed. — By Arthur Samuel
A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T as measured by P improves with experience E. — By Tom Mitchell
Supervised learning
Regression
try to map input variables to some continuous function.
Classification
predict results in a discrete output.
Unsupervised learning
approach problems with little or no idea what our results should look like
Example:
Some notations:
m: Number of training examples
x: “input” variable / features
y: “output” variable / “target” variable
: ith training example: Note that the superscript “(i)” in the notation is simply an index into the training set, and has nothing to do with exponentiation.
A slightly more formal description of supervised learning problem is that given a training set, to learn a function h : X → Y so that h(x) is a “good” predictor for the corresponding value of y. For historical reasons, this function h is called a hypothesis.
In this example of linear regression with one variable, the hypothesis function can be denoted as
(maybe we Chinese students are more familiar with the form like h(x)=kx+b) Here and are just parameters. And our goal is to choose and so that is close to y for our training examples(x,y).
The cost function takes an average difference of all the results of the hypothesis with inputs from x’s and the actual output y’s.
This function is otherwise called the “Squared error function”, or “Mean squared error”. The coefficient 1/2 is used for gradient descent so that the partial derivative result will be cleaner.
the Gradient descent algorithm:
repeat until convergence {
}
The value of α should not be too small or too large.
If α is too small, gradient descent can be slow.
If α is too large, gradient descent can overshoot the minimum. It may fail to converge, or even diverge.
In general, gradient descent can converge to a local minimum, even with the learning rate α fixed.
After calculating partial derivation, we can get the algorithm as :
repeat until convergence {
(update θ0 and θ1 simultaneously)
}
Matrix: 2dimensional array.
Vector: An n×1 matrix
Notation:Generally the uppercase letters are used for matrix and the lowercase letters are used for vector.
Addition
Scalar multiplication
Matrixvector multiplication
Let A be an m×n matrix, x be an ndimensional vector, then the result A×x will be an mdimensional vector.
To get yi, multiply A’s ith row with elements of vector x, and add them up.
Matrixmatrix multiplication
Let A be an m×n matrix, B be an n×o matrix, then the result A×B will be an m×o matrix.
The ith column of the matrix C is obtained by multiplying A with the ith column of B.(for i=1,2,…,o). Then the calculation can be simplified to matrixvector multiplication.
Not commutative
Let A and B be matrices. Then in general, A×B≠B×A.
Associative
A×(B×C)=(A×B)×C
Identity Matrix
The identity matrix, which is denoted as I(sometimes with n×n subscript), simply has 1’s on the diagonal (upper left to lower right diagonal) and 0’s elsewhere. For example:
For any matrix A, A×I=I×A=A
Matrix Inverse
If A is an m×m matrix, and if it has an inverse(not all matrix have an inverse),
Matrices that don’t have an inverse are singular or degenerate.
Matrix Transpose
Let A be an m×n matrix, and let . Then B is an n×m matrix, and
In the first week’s course, we learned linear with one variable x, and the hypothesis could be . As a matter of fact, there can be more than one variables. So here’re the new notations:
m: the number of training examples
n: the number of features
 : input (features) of ith training example
 : value of feature j in ith training example
The hypothesis should be transformed to:
(for convenience of notation, we define x0=1).
Using the definition of matrix multiplication, our multivariable hypothesis function can be concisely represented as:
We can see that linear regression with one variable is just the special case when n=1.
Similarly, the new gradient descent algorithm would be represented as:
repeat until convergence{
}
The idea is to make sure features are on a similar scale so that the gradient descent can be sped up.
Generally, get every feature into approximately a range. (the range is not limited to [1,1])
Often used formula:
Replace with to make features have approximately zero mean.
Formula:
where is the average of all the values for feature(i) and si is the range of values(max  min) or the standard deviation.
The idea of learning rate is same as in the first week. And how to make sure gradient descent is working correctly(or say, debugging)?
Make a plot with number of iterations on the xaxis. Now plot the cost function, J(θ) over the number of iterations of gradient descent. If J(θ) ever increases, then you probably need to decrease α.
How to declare convergence?
If J(θ) decreased by less than in one iteration.
linear:
quadratic:
cubic:
square root:
One important thing to keep in mind is, if you choose your features this way then feature scaling becomes very important.
Gradient descent gives one way of minimizing J, and normal equation is another way. In the “Normal Equation” method, we will minimize J by explicitly taking its derivatives with respect to the θj ’s, and setting them to zero. This allows us to find the optimum theta without iteration.
Formula:
Here, X is m×(n+1) matrix(remember that x0 = 1), y is mdimensional vector, and θ is (n+1) dimensional vector.
The advantage of normal equation:
Comparison of gradient descent and normal equation
Gradient Descent  Normal Equation 

need to choose α  no need to choose α 
need many iterations  no need to iterate 
O(kn²)  O(n³)(need to calculate inverse) 
works well when n is large  slow if n is very large 
Sometimes can be invertible, the common causes might be having:
Generally the commands are both used in MATLAB and Octave. (It is suggested that change Octave’s command prompt using PS1('>> ')
.)
Elementary math operations
e.g., 1+2
, 34
, 5*6
, 7/8
, 2^6
//Note: if the result is floating point, the default digits after decimal point is different between Octave (6 digits) and MATLAB (4 digits).
Logical operations
Equality.1==2
, 1~=2 %~=: not equal
AND, OR, XOR.1 && 0
, 1  0
, xor(1,0)
The result of logical operations is 0(false) or 1(true).
Variable
Simple form.a = 3
, a = 3;
The semicolon suppresses the print output.
The assignment can be constants, strings, boolean expressions, etc.
Display variable.a
, disp(a)
, disp(sprintf('2 decimals: %0.2f', a)
Suppose a = 3, then the output of the command a
is a = 3
, of the command disp(a)
is 3
and of the command disp(sprintf('2 decimals: %0.2f', a)
is 2 decimals: 3.00
.
Format.sprintf
is a Clike syntax that defines the output format.format long
, format short
makes the all of the following commands output in long or short format.
Vectors and Matrices
Matrix.A = [1, 2; 3, 4; 5, 6]
We can memorize that the semicolon ;
means the next row of the matrix and the comma ,
(which can be replaced by space
) means the next column.
Vector.v = [1 2 3]
, v = [1; 2; 3]
The former creates a row vector (1×3 matrix), and the latter creates a column vector (3×1 matrix).
Some useful notationv = START(:INCREMENT):END
Create a row vector from START to END with each step incremented by INCREMENT. If :INCREMENT
is omitted, the increment is 1.
ones(ROW, COLUMN)
, zeros(ROW, COLUMN)
Create a ROW×COLUMN matrix of all ones/zeros.
rand(ROW, COLUMN)
rand
generates random numbers from the standard uniform distribution (0,1)randn
generates random numbers from standard normal distribution.
eye(ROW)
(Eye is maybe a pun on the word identity.) Create a ROW by ROW identity matrix.
Binary Classification
The output y can take only two values, 0 and 1. Thus, y∈{0,1}, where the value 0 represents negative class and the value 1 represents positive class. It’s not advisable to use linear regression to represent the hypothesis. It’s logistic regression that has a range of (0,1).
Multiclass Classification:Onevsall
Train a logistic regression classifier for each class i to predict the probability that y = i.
On a new input x, to make a prediction, pick the class i that maximizes .
The function is called sigmoid function or logistic function, and the function image is as shown below:
What is the interpretation of hypothesis output?
 estimated probability that y = 1, given x, parameterized by θ. In mathematical formula it can be denoted as
Using probability theory knowledge, we can also know that
Decision Boundary
Suppose predict “y=1” if , and predict “y=0” if .
That is equal to
Cost function in logistic regression is different from that in linear regression.
that means:
 if y = 1;
 if y = 0.
Vectorized implementation of cost function:
Gradient Descent:
repeat {
}
Vectorized implementation of gradient descent:
Advanced optimization
Except gradient descent, there’re Conjugate gradient, BFGS and LBFGS optimization algorithms. They have advantages that 1.No need to manually pick α;2.Often faster than gradient descent. But they’re more complex than gradient descent.
If we have too many features, the learn hypothesis may fit the training set very well, but fail to generalize to new examples. (Also known high variance)
There’s also another problem called Underfit problem, which has high bias.
Two examples:
How to address overfitting
Reduce number of features.
 Manually select which features to keep
 Model selection algorithm
Regularization.
 Keep all the features, but reduce magnitude/values of parameters
Cost function:
The additional part is called regularization parameter. Note that λ should be set a proper value. If λ is set to an extremely large value, the algorithm may fail to eliminate overfitting or results in underfitting.
Cost function
Gradient Descent:
Repeat {
}
The second line can also denoted as:
The coefficient will always be less than 1.
Normal Equation:
*If λ>0, this normal equation makes it invertible.
Gradient Descent:
Repeat {
}
At a very simple level, neurons are basically computational units that take inputs (dendrites) as electrical inputs (called “spikes”) that are channeled to outputs (axons). In our model, our dendrites are like the input features x1⋯xn, and the output is the result of our hypothesis function. In this model our x0 input node is sometimes called the “bias unit.” It is always equal to 1. In neural networks, we use the same logistic function as in classification, , yet we sometimes call it a sigmoid (logistic) activation function. In this situation, our “theta” parameters are sometimes called “weights”.
There’re several layers in the neural network. The first layer is called Input Layer, the last layer is called Output Layer, and the others is called Hidden Layer.
Notations:
 = “activation” of unit i in layer j
 = matrix of weights controlling function mapping from layer j to layer j+1
In a simple example like this:
we have some equations:
If network has units in layer j, units in layer j+1, then will be of dimension .
Vectorized Implementation:
 (regard the input layer as )
Add (add the bias unit)
L: total no. of layers in network
: no. of units(not counting bias unit) in layer l
K: no. of units in the output layer()
(K=1:Binary classification, K≥3:Multiclass classification)
Compared with the cost function of logistic regression, we have added a few nested summations to account for our multiple output nodes. In the first part of the equation, before the square brackets, we have an additional nested summation that loops through the number of output nodes.(k = 1:K)
In the regularization part, after the square brackets, we must account for multiple theta matrices. The number of columns in our current theta matrix is equal to the number of nodes in our current layer (including the bias unit). The number of rows in our current theta matrix is equal to the number of nodes in the next layer (excluding the bias unit).
To minimize , we need code to compute and . The formula of is described above, and now let’s have a recall at forward propagation first.
In a 4layer neural network, we have the following forward propagation:
Then what about backward propagation(Intuition: = “error” of node j in layer l)?
（, it’s just the partial derivative)
=”error” of cost for .
Formally, .
Backpropagation algorithm:
Training set
Set .(initializing accumulators)
For i = 1 to m
Set
Perform forward propagation to compute for l=2,3,…,L
Using , compute
Compute
 if j≠0
 if j=0
Unrolling parameters
Idea: Unroll matrices into vectors. In order to use optimizing functions such as “fminunc()”, we will want to “unroll” all the elements and put them into one long vector.
Learning Algorithm
Have initial parameters Theta1,Theta2,Theta3
Unroll to get initialTheta
to pass to fminunc(@costFunction, initialTheta, options)
.
Gradient checking will assure that our backpropagation works as intended. We can approximate the derivative of our cost function with:
With multiple theta matrices, we can approximate the derivative with respect to Θj as follows:
Implementation note:
gradApprox
.Random initialization
For gradient descent and advanced optimization method, we need initial value for θ. However, it’s not advisable to set initial theta to all zeros. When we backpropagate, all nodes will update to the same value repeatedly. So instead of using zeros
, use rand
to initialize theta.
Divide the data set into three parts: training set, cross validation set and test set.(sometimes two parts, training set and test set)
Training error:
Cross Validation error:
Test error:
Model Selection: eg. for d = 1:10, when trying to minimize , we get the object . Then using the θ we get, we can estimate generalization error for test set.
High bias: underfit
High variance: overfit
Just as the figure shown above, with the increment of degree of polynomial d, the training error will be less and less. However, if the degree is too high, the cross validation error would be high again(overfitting).
Bias:
 will be high.

Variance:
 will be low.

Now let’s take the parameter λ that we use in regularization into consideration.
If λ is too large, it may lead to high bias. If λ is too small, it may lead to high variance.(e.g. λ=0)
Fixing high bias:
Fixing high variance:
Take cancer prediction as example. Since the cancer incidence is quite low, the prediction can simply be y=0
. The error rate is relatively low, however, it’s not a ‘prediction’ at all. Thus, we can’t judge predictions’ performance only using error rate. So the concept of precision and recall is introduced.
Actual Class 1  Actual Class 0  

Predicted Class 1  True positive  False positive 
Predicted Class 0  False negative  True negative 
y=1 in presence of rare class that we want to detect. (In cancer prediction example, isCancer
should be 1)
Precision:
Calculated in row.
Recall
Calculated in column.
Suppose we want to predict y=1
(cancer) only if very confident. Then turn threshold(originally 0.5) up to get high precision and lower recall.
Suppose we want to avoid missing too many cases of cancer(avoid false negatives). Then turn threshold down to get high recall and lower precision.
F1 Score(F score)
Use a learning algorithm with many parameters(e.g. logistic regression/linear regression with many features; neural network with many hidden units).  low bias
Use a very large training set(unlikely to overfit).  low variance
Alternative view of logistic regression
The cost of a specific example(x,y) is
If y=1
, then the right part of the function is ignored(since 1y=0
), and what we get is:
Similarly, if y=0
, then the left part of the function is ignored, and what we get is:
For support vector machine, we have and functions (the subscript corresponds to the value of y) that are similar to the original curves. The difference is that the curves are made up of straight lines.
Cost1:
Cost0:
And the cost function of SVM:
Difference between SVM and logistic regression:
SVM ignores the term 1/m.
The form of logistic regression is A+λB, while that of SVM is CA+B.(if C=1/λ, then two these two optimization objectives should give you the same optimal value for theta)
Large Margin Classifier
SVM wants a bit more than the original logistic regression.
If , we want (not just ≥0)
If , we want (not just ＜0)
Consider the situation that C is very large, and optimization would try to set the left part of cost function to be zero. And this leads to the large margin classifier concept:
Compared with the magenta and green lines, the black line has some larger minimum distance from any of the training examples. This distance is called the margin of the SVM and this gives the SVM a certain robustness, because it tries to separate the data with as large a margin as possible.
The mathematics behind large margin classification is the dot product of the vectors.
For SVM, there’s a different(better) choice of the features:
given x, compute new features depending on proximity to landmarks .
For example,
Here the similarity is called the kernel function, and the corresponding kernel function is Gaussian Kernel which uses exp
.
How to choose landmarks?
Landmarks is just the input examples.
Given ,
choose .
And calculate features using kernals.
Hypothesis: Given x, compute features , predict “y=1” if .
Training:
Here n is equal to m.
And for some SVM, the computation of is using rather than .
C
Large C: Lower bias, higher variance.
Small C: Higher bias, lower variance.
(regard C as 1/λ).
σ²
Large σ²: Features f vary more smoothly. Higher bias, lower variance.
Small σ²: Features f vary less smoothly. Lower bias, higher variance.
SVM package: liblinear, libsvm, etc.
Choice of kernels:
Linear kernel(No kernel)
Predict “y=1” if
Gaussian kernel
(Remember to perform feature scaling before using Gaussian kernel.)
Polynomial kernel
More esoteric
String kernel, chisquare kernel, histogram intersection kernel, …
Logistic regression or SVM?
If n is large(relative to m): use logistic regression, or SVM without a kernel.
If n is small, m is intermediate: use SVM with Gaussian kernel.
If n is small, m is large: add more features, then use logistic regression, or SVM without a kernel
Neural network is likely to work well for most of these settings, but may be slower to train.
The difference between unsupervised learning and supervised learning:
The supervised learning problem is given a set of labels to fit a hypothesis to it. In contrast, in the unsupervised learning problem we’re given data that does not have any labels associated with it.
Random initialize K cluster centroids
Repeat {
for i=1 to m
:=index(from 1 to K) of cluster centroid closest to
for k=1 to K
:=average(mean) of points assigned to cluster k
}
The first for loop is cluster assignment step, and the second for loop is moving centroid.
Optimization objective
: index of cluster(1,2,…,K) to which example is currently assigned
: cluster centroid
: cluster centroid of cluster to which example has been assigned
Try to minimize (also called distortion function).
Random initialization
Randomly pick K training examples, and set those examples as cluster centroids.
For better performance, run multiple times(e.g, 100 times) and pick clustering that gave lowest cost J.
Choosing the number of clusters
Manually.
Sometime it’s helpful to use elbow method, but it’s often not advisable:
Sometimes, you’re running Kmeans to get clusters to use for some later/downstream purpose. Evaluate Kmeans based on a metric for how well it performs for that later purpose.
Motivation of dimensionality reduction:
Data Compression
Reduce memory/disk needed to store data
Speed up learning algorithm
Data Visualization
k=2 or k=3, so we can visualize the data and get an intuitive view
Reduce from ndimension to kdimension: Find k vectors onto which to project the data, so as to minimize the projection error.
Difference between PCA and linear regression
PCA looks like linear regression(reduce from 2D to 1D), but they’re different.
Linear regression has the input x and corresponding label y. What linear regression does is trying to predict the output y. And the ‘error’ is computed vertically.
PCA is unsupervised learning and has no label y. What PCA does is reduce the dimension of features. And the ‘error’ is computed according to the vector difference.
PCA Algorithm
Before applying PCA algorithm, remember to make data preprocessing.
Given training set:, using feature scaling/mean normalization to preprocess
then replace each with .
If different features on different scales (e.g., x1=size of house, x2=number of bedrooms), scale features to have comparable range of values.
After mean normalization(ensure every feature has zero means) and optionally feature scaling:
Sigma = (1 / m) * X' * X
[U,S,V] = svd(Sigma);
Ureduce = U(:,1:k);
z = Ureduce' * x;
Reconstruction from compressed representation:
From the formula , we can get the reconstruction x using
Average squared projection error:
Total variation in the data:
Typically, choose k to be smallest value so that
Here 0.01 means that 99% of variance is retained.
We don’t have to loop k from 1 to n to find the smallest value. The function svd
has an output S which is useful.
[U,S,V] = svd(Sigma)
Just pick smallest value of k for which
To speed up supervised learning, note to map should be defined by running PCA only on the training set.
It’s not good to address overfitting using PCA. Use regularization instead.
Before implementing PCA, first try running whatever you want to do with the original/raw data . Only if that doesn’t do what you want, then implement PCA and consider using .
1.Choose features that you think might be indicative of anomalous examples.
2.Fit parameters
3.Given new example x, computer p(x):
Check anomaly if
Developing and evaluating
Assume we have some labeled data, of anomalous and nonanomalous examples. (y=0 if normal, y=1 if anomalous).
Suppose we have 10000 good (normal) engines(it’s okay that there’re some anomalous fixed in) and 20 flawed engines(anomalous).
Then divide the examples into Training set:6000 good engines; CV: 2000 good engines and 10 anomalous; Test: 2000 good engines and 10 anomalous.
Possible evaluation metrics:
True positive, false positive, false negative, true negative.
Precision/Recall
F1score.
Comparison
Anomaly detection:
Supervised learning:
Examples
 Anomaly detection  Supervised learning 
 —  — 
 Fraud detection  Email spam classification 
 Manufacturing(e.g. aircraft engines)  Weather prediction 
 Monitoring machines in a data center  Cancer classification 
 …  … 
It is denoted as
Tips: if features are nongaussian features, it’s advisable to transform the original features to log/polynomial/…
Original model: corresponds to multivariate Gaussian where .(which is axis aligned).
The off diagonal means the correlations between axises. Here’re some examples.
Comparison
Original model:
Manually create features to capture anomalies where x1, x2 take unusual combinations of values.
Computationally cheaper.
Ok even if m is small.
Multivariate Gaussian:
Automatically captures correlations between features.
Computationally more expensive.
Must have m>n or else ∑ is noninvertible.
Well, I’ve written an article about recommender system. (about Itembased Collaborative Filtering:D)
Notation:
 if user j has rated movie i (0 otherwise)
= rating by user j on movie i (if defined)
= parameter vector for user j
= feature vector for movie i
: for user j, movie i, predicted rating
= no. of movies rated by user j
To learn (parameter for user j):
To learn :
Gradient descent update:
Given , to learn :
Given , to learn :
Collaborative filtering optimization objective
Given , estimate :
Given , estimate :
Minimizing and simultaneously:
algorithm
1.Initialize to small random values.
2.Minimize using gradient descent(or an advanced optimization algorithm). E.g. for every :
3.For a user with parameters θ and a movie with (learned) features x, predict a star rating of .
Algorithm:
(Learning rate α is typically held constant. We can slowly decrease α over time if we want θ to converge.
Checking for convergence
Every 1000 iterations(say), plot averaged over the last 1000 examples processed by algorithm.
Batch Gradient Descent: use all m examples in each iteration
Stochastic Gradient Descent: use 1 example in each iteration
Minibatch Gradient Descent: use b examples in each iteration
Divide the total computation into several parts. Let different computers/cores to calculate a part and then a central computer calculate the final results.
]]>