Composition in Distributional Models of Semantics

Semantic representation

Semantic Networks

Semantic networks represent concepts as nodes in a graph. Edges in the graph denote semantic relationships between concepts(e.g., DOG IS-A MAMMAL, DOG HAS TAIL) and word meaning is expressed by the number of type of connections to other words. In this framework, word similarity is a function of path length-semantically related words are expected to have shorter paths between them. Semantic networks constitute a somewhat idealized representation that abstracts away from real-word usage-they are traditionally hand coded by modelers who a priori decide which relationships are most relevant in representing meaning.
More recent work creates a semantic network from word association norms (Nelson, McEvoy, & Schreiber, 1999); however, these can only represent a small fraction of the vocabulary of an adult speaker.

Feature-based Models

Feature-based model has the idea that word meaning can be described in terms of feature lists. Theories tend to differ with respect to their definition of features. In many cases, the features are obtained by asking native speakers to generate attributes they consider important in describing the meaning of a word. This allows the representation of each word by a distribution of numerical values over the feature set.
Admittedly, norming studies have the potential of revealing which dimensions of meaning are psychologically salient. However, a number of difficulties arise when working with such data. For example, the number and types of attributes generated can vary substantially as a function of the amount of time devoted to each word. There are many degrees of freedom in the way that responses are coded and analyzed. And multiple subjects are required to create a representation for each word, which in practice limits elicitation studies to a small-size lexicon.

Semantic Spaces

It has been driven by the assumption that word meaning can be learned from the linguistic environment. Words that are similar in meaning tend to occur in contexts of similar words. Semantic space models capture meaning quantitatively in terms of simple co-occurrence statistics. Words are represented as vectors in a high-dimensional space, where each component corresponds to some co-occurring contextual element. The latter can be words themselves, larger linguistic units such as paragraphs or documents, or even more complex linguistic representations such as n-grams and the argument slots of predicates.
The advantage of taking such a geometric approach is that the similarity of word meanings can be easily quantifies by measuring their distance in the vector space, or the cosine of the angle between them.

Some semantic space models

Hyperspace Analog to Language model(HAL)
Represents each word by a vector where each element of the vector corresponds to a weighted co-occurrence value of that word with some other word.

Latent Semantic Analysis(LSA)
Derives a high-dimensional semantic space for words while using co-occurrence information between words and the passages they occur in. LSA constructs a word-document co-occurrence matrix from a large document collection.

Probabilistic topic models
Offers an alternative to semantic spaces based on the assumption that words observed in a corpus manifest some latent structure linked to topics. Words are represented as a probability distribution over a set of topics(corresponding to coarse-grained senses). Each topic is a probability distribution over words, and the content of the topic is reflected in the words to which it assigns high probability. Topic models are generative, they specify a probabilistic procedure by which documents can be generated. Thus, to make a new document, one first chooses a distribution over topics. Then for each word in that document, one chooses a topic at random according to this distribution and selects a word from that topic. Under this framework, the problem of meaning representation is expressed as one of statistical inference: Given some data-words in a corpus-infer the latent structure from which it was generated.

Application

• semantic priming
• discourse comprehension
• word categorization
• judgments of essay quality
• synonymy tests
• association

Composition

It is well known that linguistic structures are compositional(simpler elements are combined to form more complex ones).
Morphemes are combined into words, words into phrases, and phrases into sentences. It is also reasonable to assume that the meaning of sentences is composed of the meanings of individual words or phrases.
Compositionality allows languages to construct complex meanings from combinations of simpler elements. This property is often captured in the following principle: The meaning of a whole is a function of the meaning of the parts. Therefore, whatever approach we take to modeling semantics, representing the meanings of complex structures will involve modeling the way in which meanings combine.

Preset

In this article, the authors attempt to bridge the gap in the literature by developing models of semantic composition that can represent the meaning of word combinations as opposed to individual words. Our models are narrower in scope compared with those developed in earlier connectionist work. Our vectors represent words; they are high-dimensional but relatively structured, and every component corresponds to a predefined context in which the words are found. The author take it as a defining property of the vectors they consider that the values of their components are derived from event frequencies such as the number of times a given word appears in a given context. Having this in mind, they present a general framework for vector-based composition that allows us to consider different classes of models. Specifically, they formulate composition as a function of two vectors and introduce models based on addition and multiplication. They also investigate how the choice of the underlying semantic representation interacts with the choice of composition function by comparing a spatial model that represents words as vectors in a high-dimensional space against a probabilistic model that represents words as topic distributions. They assess the performance of these models directly on a similarity task. They elicit similarity ratings for pairs of adjective–noun, noun–noun, and verb–object constructions and examine the strength of the relationship between similarity ratings and the predictions of their models.

Functions

Express the composition of two constituents, $\mathbf{u}$ and $\mathbf{v}$, in terms of a function acting on those constituents.

A further refinement of the above principle taking the role of syntax into account: The meaning of a whole is a function of the meaning of the parts and of the way they are syntactically combined. They thus modify the composition function in Eq. 1 to account for the fact that there is a syntactic relation R between constituents $\mathbf{u}$ and $\mathbf{v}$.
Even this formulation may not be fully adequate. The meaning of the whole is greater than the meaning of the parts. The implication here is that language users are bringing more to the problem of constructing complex meanings than simply the meaning of the parts and their syntactic relations. This additional information includes both knowledge about the language itself and also knowledge about the real world.

A full understanding of the compositional process involves an account of how novel interpretations are integrated with existing knowledge. The composition function needs to be augmented to include an additional argument, K, representing any knowledge utilized by the compositional process.

Compositionality is a matter of degree rather than a binary notion. Linguistic structures range from fully compositional(e.g., black hair), to partly compositional syntactically fixed expressions,(e.g., take advantage), in which the constituents can still be assigned separate meanings, and noncompositional idioms(e.g., kick the bucket) or multiword expressions(e.g., by and large), whose meaning cannot be distributed across their constituents.

Logic-based view

Within symbolic logic, compositionality is accounted for elegantly by assuming a tight correspondence between syntactic expressions and semantic form. In this tradition, the meaning of a phrase or sentence is its truth conditions which are expressed in terms of truth relative to a model. In classical Montague grammar, for each syntactic category there is a uniform semantic type(e.g., sentences express propositions; nouns and adjectives express properties of entities; verbs express properties of events). Most lexical meanings are left unanalyzed and treated as primitive.
Noun is represented by logical symbol. Verb is represented by a function from entities to propositions, expressed in lambda calculus.(Well I’m not familiar with lambda calculus yet, but the idea is similar to predicate logic in discrete mathematics.)
For example, the proper noun John is represented by the logical symbol JOHN denoting a specific entity, and the verb wrote is represented as λx.WROTE(x). Applying this function to the entity JOHN yields the logical formula WROTE(JOHN) as a representation of the sentence John wrote. It is worth noting that the entity and predicate within this formula are represented symbolically, and that the connection between a symbol and its meaning is an arbitrary matter of convention.

Allows composition to be carried out syntactically.
The laws of deductive logic in particular can be defined as syntactic processes which act irrespective of the meanings of the symbols involved.

Abstracting away from the actual meanings may not be fully adequate for modeling semantic composition.
For example, $GOOD(JOHN)\wedge LAWYER(JOHN)$ doesn’t mean that John is a good lawyer.
Modeling semantic composition means modeling the way in which meanings combine, and this requires that words have representations which are richer than single, arbitrary symbols.

Connectionism

The key premise here is that knowledge is represented not as discrete symbols that enter into symbolic expressions, but as patterns of activation distributed over many processing elements. These representations are distributed in the sense that any single concept is represented as a pattern, that is, vector, of activation over many elements(nodes or units) that are typically assumed to correspond to neurons or small collections of neurons.

Tensor product
The tensor product $\mathbf{u}\otimes\mathbf{v}$ is a matrix whose components are all the possible products $u_iv_j$ of the components of vectors u and v.

The tensor product has dimensionality m×n, which grows exponentially in size as more constituents are composed.

Holographic reduced representation
The tensor product is projected onto the space of the original vectors, thus avoiding any dimensionality increase.
The projection is defined in terms of circular convolution, which compresses the tensor product of two vectors. The compression is achieved by summing along the transdiagonal(?) elements of the tensor product. Noisy versions of the original vectors can be recovered by means of circular correlation, which is the approximate inverse of circular convolution. The success of circular correlation crucially depends on the components of the n-dimensional vectors u and v being real numbers and randomly distributed with mean 0 and variance 1/n.

Binary spatter codes
Binary spatter codes are a particularly simple form of holographic reduced representation. Typically, these vectors are random bit strings or binary N vectors (e.g., N=10000). Compositional representations are synthesized from parts or chunks. Chunks are combined by binding, which is the same as taking the exclusive or(XOR) of two vectors. Here, only the transdiagonal elements of the tensor product of two vectors are kept and the rest are discarded.

Both spatter codes and holographic reduced representations can be implemented efficiently and the dimensionality of the resulting vector does not change.
The downside is that operations like circular convolution are a form of lossy compression that introduces noise into the representation. To retrieve the original vectors from their bindings, a clean-up memory process is usually employed where the noisy vector is compared with all component vectors in order to find the closest one.

Semantic Space

Premise: Words occurring within similar contexts are semantically similar.
Semantic space models extract from a corpus a set of counts representing the occurrences of a target word t in the specific context c of choice and then map these counts into the components of a vector in some space.
Semantic space models resemble the representations used in the connectionist literature. Words are represented as vectors and their meaning is distributed across many dimensions. Crucially, the vector components are neither binary nor randomly distributed(compared with holographic reduced representation and binary spatter code mentioned above). They correspond to co-occurrence counts, and it is assumed that differences in meaning arise from differences in the distribution of these counts across contexts.

Composition models

Aim: construct vector representations for phrases and sentences.
Note: the problem of combining semantic vectors to make a representation of a multiword phrase is different to the problem of how to incorporate information about multiword contexts into a distributional representation for a single target word.

Define p, the composition of vectors u and v, representing a pair of words which stand in some syntactic relation R, given some background knowledge K as: $\mathbf{p}=f(\mathbf{u},\mathbf{v},R,K)$.
To begin with, just ignore K so as to explore what can be achieved in the absence of any background or world knowledge.

Assumption
Constituents are represented by vectors which subsequently combine in some way to produce a new vector.
p lies in the same space as u and (~?)v. This essentially means that all syntactic types have the same dimensionality.
The restriction renders the composition problem computationally feasible.

A hypothetical semantic space for practical and difficulty

music solution economy craft reasonable
practical 0 6 2 10 4
difficulty 1 8 4 4 0

This model assumes that composition is a symmetric function of the constituents; in other words, the order of constituents essentially makes no difference.
Might be reasonable for certain structures, a list perhaps.

A sum of predicate, argument, and a number of neighbors of the predicate:

Model the composition of a predicate with its argument in a manner that distinguishes the role of these constituents, making use of the lexicon of semantic representations to identify the features of each constituent relevant to their combination.
Considerable latitude is allowed in selecting the appropriate neighbors. Kintsch(2001) considers only the m most similar neighbors to the predicate, from which he subsequently selects k, those most similar to its argument.

E.g., in the composition of practical with difficulty, the chosen neighbor is problem, with $\mathbf{problem}=\begin{bmatrix}2&15&7&9&1\end{bmatrix}$,

In this process, the selection of relevant neighbors for the predicate plays a role similar to the integration of a representation with existing background knowledge in the original construction–integration model. Here, background knowledge takes the form of the lexicon from which the neighbors are drawn.

Weighted summation

Weight the constituents differentially in the summation.

This makes the composition function asymmetric in u and v allowing their distinct syntactic roles to be recognized.
E.g., set α to 0.4 and β to 0.6.

(there’s some calculation mistake in the paper)

Extrem form

One of the vectors(u) contributes nothing at all to the combination. It can serve a simple baseline against which to compare more sophisticated models.

Multiplicative function

Simple

where the symbol $\odot$ represents multiplication of the corresponding components: $p_i=u_i\cdot v_i$

It is still a symmetric function and thus does not take word order or syntax into account.

Tensor product

where the symbol $\otimes$ stands for the operation of taking all pairwise products of the components of u and v: $p_{i,j}=u_i\cdot v_j$

Circular Convolution

where the symbol $\circledast$ stands for a compression of the tensor product based on summing along its transdiagonal elements: $p_i=\sum_ju_j\cdot v_{(i-j)}$
Subscripts are interpreted modulo n which gives the operation its circular nature.

(the result in the paper is reversed, which seems wrong)

Temporarily not understand why multiplicative functions only affect magnitude but not direction, while additive models can have a considerable effect on both the magnitude and direction. And cosine similarity is itself insensitive to the magnitudes of vectors.

Matrix idea

Basic

To see how the vector u can be thought of as something that modifies v, consider the partial product of C with u, producing a matrix which is called U.

Here, the composition function can be thought of as the action of a matrix, U, representing one constituent, on a vector v, representing the other constituent. Since the authors’ decision to use vectors, they just make use of the insight. Map a constituent vector, u, onto a matrix, U, while representing all words with vectors.

U‘s off-diagonal elements are zero and U‘s diagonal elements are equal to the components of u.
The action of this matrix on v is a type of dilation, in that it stretches and squeezes v in various directions. Specifically, v is scaled by a factor of $u_i$ along the ith basis.
The drawback of this process is that its results are independent on the basis used.

Parallel and Orthogonal Decomposition

Ideally, we would like to have a basis-independent composition, that is, one which is based solely on the geometry of u and v. One way to achieve basis independence is by dilating v along the direction of u, rather than along the basis directions. Just decompose v into a component parallel to u and a component orthogonal to u, and then stretch the parallel component to modulate v to be more like u.

By dilating x by a factor λ, while leaving y unchanged, we get a modified vector v‘, which has been stretched to emphasize the contribution of u:

Multiplying through by $\mathbf{u\cdot u}$ makes the expression easier to work with(since the cosine similarity function is insensitive to the magnitudes of vectors)

From the given example, $\mathbf{practical\times practical}=156$ and $\mathbf{practical\times difficulty}=96$. Assuming λ=2 and we can get

(Again there’s some mistakes in the paper, it confused the coefficient uu and uv.)

Comparison Study on Critical Components in Compositional Model for Phrase Representation

Word Representation

Count model
The count model learns word vectors by calculating the co-occurrence frequency of each word with features.

Predict model
The predict model learns word vectors by maximizing the probability of the contexts in which the word is observed in the corpus.
The Skip-Gram model and CBOW model included in word2vec tool are most widely used for generating word vectors.

Word2Vec

A method described by many works. I’m just planning to write another post about Word2Vec in details.

Retrofitting Method

There’re some mistakes in introducing the retrofitting method in this paper(e.g., the neighbor word vector should be $q_j$ instead of $q_i$ which is confusing.) , so I just find the original paper describing retrofitting method: Retrofitting Word Vectors to Semantic Lexicons.

Let $V=\{w_1,...,w_n\}$ be a vocabulary, i.e., the set of word types, (well, I don’t know why it’s the set of word types…) and $\Omega$ be an ontology that encodes semantic relations between words in $V$. $\Omega$ is an undirected graph $(V,E)$ with one vertex for each word type and edges $(w_i,w_j)\in E\subseteq V\times V$ indicating a semantic relationship of interest.
The matrix $\hat{Q}$ will be the collection of vector representation $\hat{q_i}\in\mathbb{R}^d$, for each $w_i\in V$, learned using a standard data-driven technique, where $d$ is the length of the word vectors. The objective is to learn the matrix $Q=(q_1,...,q_n)$ such that the columns are both close (under a distance metric) to their counterparts in $\hat{Q}$ and to adjacent vertices in $\Omega$.
The distance between a pair of vectors is defined to be the Euclidean distance. Since we want the inferred word vector to be close to the observed value $\hat{q_i}$ and close to its neighbors $q_j,\forall j$ such that $(i,j)\in E$, the objective to be minimized becomes:

where $\alpha$ and $\beta$ values control the relative strengths of associations.

Word Paraphrasing

Related paper: From Paraphrase Database to Compositional Paraphrase Model and Back
Train word vectors with a contrastive max-margin objective function. Specifically, the training data consisting of a set X of word paraphrase pairs $(x_1,x_2)$, while $(t_1,t_2)$ are negative examples that are the most similar word pairs to $(x_1,x_2)$ in a mini-batch during optimization. The objective function is given as follows:

where $\lambda$ is the regularization parameter, $|X|$ is the length of training paraphrase pairs, $W_w$ is the target word vector matrix, and $W_{w_{initial}}$ is the initial word vector matrix.

Training data

For bigram phrase similarity task, there are two general types of training data in existing work.

Pseudo-word training data
Consists of tuples in the form {adjective1 noun1, adjective1-noun1}. E.g., {older man, older-man}.
Pseudo-word training data have been widely used in composition models.

Pair training data
Consists of tuples in the form {adjective1 noun1, adjective2 noun2}. E.g., {older man, elderly woman}.

For multi-word phrase, pair training data is the first choice because it is hard to learn the representation of pseudo-words with multiple words. So this paper use pair training data for multi-word phrase experiments.

Composition Function

With a phrase p consisting of two words $w^{(1)}$ and $w^{(2)}$, we can establish equation as follows:

where $\vec w^{(1)},\vec w^{(2)}$ are the word representations, $\vec{p}$ is the phrase representation, and $f$ is the composition function.

The Additive model assumes that the meaning of the composition is a linear combination of the constituent words.
The Additive model described the phenomenon: people concatenate the meanings of two words when understanding phrases.

The Multiplicative model assumes that the meaning of composition is the element-wise product of the two vectors.
The multiplicative model has the highest correlation with the neural activity observed in human brain when reading adjective-noun phrase data.

Composition functions such as Matrix, RecNN(Recursive Neural Network) and RNN transform word vectors into another vector space through matrix transformations and nonlinear functions. These three functions differ primarily in the order of transformation.
The Matrix model first transforms component words into another vector space and then composes them using addition. The RecNN model takes word order into consideration, concatenates vector component words and then transforms the vector using a matrix and nonlinearity. The RNN model composes words in a sentence from left to right by forming new representations from previous representations and the representation of the current word.