Validating the bag-of-words model
Processing natural language is a notoriously difficult task for computers.
Natural language doesn’t follow the same sort of rigid, consistent standards as other systems. Our day-to-day language is actually riddled with ambiguities and inconsistencies, even if we’re not consciously aware of it.
While the human brain can resolve these easily, a conventional computer system would need a list of every single rule, edge case and exception in the English language.
This is practically impossible1, so natural language processing systems need to break the problem down into smaller parts that are easier to work with.
Usually, this involves constructing a more simplified model of language, something that works well for certain tasks, but fails spectacularly at others. In other words, it’s a model that doesn’t generalize to the language as a whole.
One such model is the “bag-of-words” approach, which considers documents or sentences as a collection of specific words.
Under this model, a document can created by placing all of its constituent words in one big bag and drawing them out in random order. Shaking the bag to randomly jumble up the words has no impact on the document, because word order is treated as unimportant.
The model treats the sentences “Bob is Alice’s father.” and “is father Alice’s Bob.” as equivalent.
Obviously, this isn’t a realistic model of the English language. These two sentences are clearly different – one is grammatically correct and clear. The other is basically gibberish.
But adopting this model lets a computer construct a much simpler set of rules for processing language.
By assuming that the order of words is independent, a document can be modelled as a simpler joint probability distribution over its words (since every word drawn from the bag is independent of the last).
Ignoring sequences of words (e.g. word pairs or “bi-grams”) is also much more efficient computationally.
For example, consider a table listing the frequency of each word in a 100-word document. The table will usually need fewer rows than the total number of words in the document, because certain words will occur very frequently – “the”, “and”, “to”, and so on.
A table representing all word-pairs, however, will need need a much larger table. Consider the sentence “The cat sat on the table next to the couch.”. The word frequency table will require 8 rows – the, cat, sat, on, table, next, to, couch. The word-pair frequency table, however, requires 9 rows – the cat, cat sat, sat on, on the, the table, table next, next to, to the, the couch.
This holds true for larger documents, too, because commonly occurring words – the, and to – occur more frequently than commonly occurring bi-grams (“the cat”, “the table”, and so on). The number of unique bi-grams will always grow faster than the number of unique words.
Intuitively, this makes sense – we would expect the number of unique word pairs to be larger than the number of unique words, even though not every two words form a linguistically valid word pair (for example, I don’t think there’s any such thing as a “cat table”).
This is also true empirically, which we can see by taking a corpus and plotting the number of unique bi-grams against the number of unique words.
We can see that the number of unique words levels off after a certain point, but the number of bi-grams continues to grow (almost linearly as a function of the total number of words encountered in a corpus).
For the same corpus, a “bag of bigrams” is therefore going to require an exponentially larger “bag” than a corresponding “bag of words” (not to mention the additional space required to actually hold each bi-gram in memory).
So although the bag-of-words model is “wrong”, it’s a very useful simplifying assumption. The model transforms the problem into something that is computationally and mathematically easier to work with.
But exactly how “wrong” is the model? Rather than contriving examples like “is father Alice’s Bob.”, how often do we actually encounter a single bag of words that can be combined into two valid or more valid, but “different”, documents?
For this exercise, let’s look at a corpus of written judgments from the South Australian Supreme Court between 1991 and 2016 on AUSTLII.
To start with, let’s assume two documents never share exactly the same bag of words. For demonstration purposes, let’s also assume that two full sentences rarely share exactly the same bag of words.
Although we won’t prove these are true, the assumptions are probably reasonable if we just want to demonstrate collisions between bags of words2.
We’re going to split each sentence in each document into sequences of 7 words or less3, and find any phrases that use the same words in a different order.
Overwhelmingly, most actual collisions were of the form:
These phrases might differ in grammatical terms, but you’d be hard pressed to identify any semantic difference. Though the position of certain words has changed, this is simply due to the author’s style; the meaning conveyed is identical.
99.9% of collisions4 took this form. There were, however, a material number that took a slightly different from:
Here, the position of a single word can completely reverse the meaning of the phrase. This makes sense grammatically, of course. Swapping the subject and object in a sentence will obviously change the direction of action of the verb.
But a computer has no inherent understanding of “subject”, “object” or “verb”. The bag-of-words model assumes that all words are equal. But if all words are equal, why does swapping two words in “I think in all the circumstances that…” make no difference, but swapping two words in “The respondent must pay the appellant’s” completely reverses its meaning?
Another interesting collision took the following form:
Is “I am not satisfied that X is Y” different from “I am satisfied that X is not Y”?
The question probably has different answers depending on the perspective from which it is asked.
A lawyer might detect a difference in the burden of proof. After all, persuading me that X is Y is obviously different from persuading me that X is not Y.
A journalist reporting on the case might perceive no difference whatsoever – after all, the judge concluded that X was not Y in this particular case, so there is nothing further to discuss.
An impartial observer might treat the latter as definitive, whereas the former leaves open the possibility that X could be Y if more evidence were provided.
The fact that these questions exist suggests that words cannot be treated as independent of their position in a sentence. For that matter, words may not even be independent of the perspective of the reader. The bag-of-words model is starting to break down once we delve beneath the surface.
Similarly, compare the following two phrases:
In the first, it’s perfectly clear that “he” and “the respondent” refer to separate people.
In the second, however, does “he” refer to “the respondent”, or to some other individual?
Swapping two words not only changes the meaning of the phrase, it also renders the meaning completely ambiguous. Having the remainder of the sentence wouldn’t necessarily clarify the ambiguity, either. The reader would probably still need the broader context of the phrase within the paragraph or document.
So sentences can clearly take on different meanings depending on the order of their words. Similarly, words themselves can take on different meanings depending on their position within a sentence, and the context of the sentence in the document. This is a simple and obvious conclusion, but it’s illuminating to use real-world examples to guide our understanding. So it’s unlikely that the bag-of-words model will be able to represent all of the contextual variation within the English language.
That being said, it was never expected to. Basic spam filters don’t need an understanding of grammar. They just need to know which words are likely to be spam, and which are not.
Equally, understanding broadly what a document is “about” doesn’t require understanding of the exact meaning of every sentence. “The plaintiff must pay the defendant’s costs” is clearly “about” costs in litigation, irrespective of who specifically is paying whom, a principle which is very useful in document classification, search indexing and retrieval, and so on.
Moreover, we can’t overlook that the collisions above were the exception, not the rule. The overwhelming number of collisions were entirely inconsequential (not to mention that we also ignored the fact that documents and sentences almost never reduce to the same bag of words). So the bag-of-words model, though an incomplete model of the English language, is still an incredibly useful approach to take.
2 These assumptions themselves actually support the use of the bag-of-words model.
3 Inter-sentence collisions are a further complication (for example, “The dog sat on the chair. The cat sat on the floor” should collide if we swap the words dog” and “cat”). Whether or not this is a realistic problem will be a topic for another day.
4 probably – I didn’t actually count. Mainly because we didn’t adjust for phrases containing different numbers of the same words, which cluttered the output with unimportant collisions.
5 All code available on Bitbucket.
6 Image credit: Kjpargeter – Freepik.com.</p>