Thoughts on data-driven language learning

Sep 12, 2016 · 5 minute read · Comments

I used to be a language pedant. I would bemoan the use of the word “presently” to mean “currently”, shudder at “between you and I”, gasp at the use of “literally” to mean… “not literally” (“I literally peed my pants laughing.” “Orly?”) I would get particularly exasperated when I heard people use phrases that were clearly (to me) nonsensical but that sounded almost correct. A classic example of this is when people say “The reason being is…” or start a sentence with “As such, …” when the word “such” does not refer to anything. The dangling participle is another whole class of examples.

But now I actually find these usage tendencies quite fascinating (ok, I still shudder at “between you and I”.) They seem to indicate that people’s choice of words is based on what they’ve heard over and over, rather than on their employment of grammatical rules. And if they misremember what they heard they’ll say something that sounds roughly the same even if it’s ungrammatical. Or they’ll throw phrases into places where they sound like they might belong, even though they don’t. Then common mistakes like using “as such” to mean something similar to “therefore” proliferate and before you know it “literally” means “not literally”.

A recent article in Scientific American refutes Chomsky’s very influential Universal Grammar theory, which postulates the existence of a sort of toolkit for learning language that is hard-wired into our brains. This module is an instantiation of a grammar that all human languages conform to, according to the theory, and is what enables us to learn language. The Scientific American article introduces an alternative theory referred to simply as a usage-based theory of linguistics, which does away entirely with the need for any sort of hard-wired grammar toolkit in our brains in order to explain our ability to acquire language. Instead, the authors claim, children are born with a set of general purpose tools that allow them to perform tasks like object categorization, reading the intentions of others, and making analogies; and these are the tools used in language learning as well as in the performance of other non-language related tasks.

One argument they give that I find quite compelling talks about utterances of the form “Him a presidential candidate?!”, “Her go to the ballet?!,” “Him a doctor?!”, etc., and how we can produce infinitely many such utterances even though they are ungrammatical. The Universal Grammar theory has to jump through hoops to explain this, invoking a list of exceptions to grammatical rules in addition to the rules themselves…

So the question becomes, are these utterances part of the core grammar or the list of exceptions? If they are not part of a core grammar, then they must be learned individually as separate items. But if children can learn these part-rule, part-exception utterances, then why can they not learn the rest of language the same way? In other words, why do they need universal grammar at all?

When theories have to introduce all kinds of complexities in order to cope with new evidence, that does seem like the right time to abandon them. At any rate, given my observations above about people shoving words and phrases that sound roughly appropriate into their sentences, without regard for correctness, I was obviously going to like this new usage-based theory :) But I like it for another reason also: it aligns beautifully with data-driven approaches to Natural Language Processing tasks such as machine translation.

In the 2009 paper The Unreasonable Effectiveness of Data by researchers at Google, the authors talk about how tasks where rules-based approaches have traditionally beaten statistical approaches usually succumb eventually when enough data is thrown at them. The more data you have, the less need you have of rules:

Instead of assuming that general patterns are more effective than memorizing specific phrases, today’s translation models introduce general rules only when they improve translation over just memorizing particular phrases (for instance, in rules for dates and numbers).

Memorizing phrases is exactly what I think people do. Now, I’m not suggesting that all of this means that Natural Language Understanding in AI will soon be a solved problem – there’s still the issue of equipping machines with the ability to analogize, read communicative intentions, etc. – but I do think it means we’ve only seen the tip of the iceberg when it comes to what the data-driven approach to NLU can achieve.

I’m particularly excited by the idea of distributed word representations, a.k.a. word embeddings, where you train a set of vectorized representations of words such that the relationships between words is captured by the mathematical relationships that hold between the vectors representing them. This idea, coupled with massive corpora, e.g. all of Wikipedia, leads to the creation of pre-trained sets of embeddings like word2vec and GloVe, which hold really rich representations of how words are used in a language. These pretrained embeddings can then be used for a wide range of downstream NLP tasks, e.g. text classification, topic modeling and document similarity.

Granted, these existing sets of embeddings may not capture the “not literally” sense of “literally”, but once we add Youtube comments to the training corpora and figure out the mathematics required to make a word vector represent its own opposite, we’ll be off to the races! Literally.

Katherine Bailey

Thoughts on data-driven language learning