Natural Language Processing

about this course

Machines have had some ability to understand and generate text for several decades. More recently, neural approaches have revolutionized the field. The most striking and impressive models are large Transformer-based models trained on large amounts of text.

But if "Attention is All You Need," then maybe we don't need those large language models at all. If "Language Models are Few-Shot Learners," then maybe small language models can learn from a few examples too. Maybe we can develop a small language model that better understands our language if we "pay attention" to the language.

Training a model to examine the links between words in a sequence and directly model those relationships trains the model to understand human language. Super-sizing a model super-sizes the cost. It does not super-size the understanding.

Larger models tend to perform better than smaller models, but performance gains diminish as model size increases.

At large sizes, language models can become superficially fluent without truly understanding human language. Such "Stochastic Parrots" simply repeat long sequences that they learned from the training data. So we need to "pay attention" to those datasets and ask what the model has learned.

Language models perform better in the domain that they have been trained on. A model trained on Wikipedia will not understand Huckleberry Finn. So we also need to ask if there is good reason to believe that a large language model can be fine-tuned for a given task. Many times there will be a good reason. Sometimes there will not.

And when the language is not English, a large language model that can be fine-tuned for any task might not even exist at all. In those cases, we need to "pay attention" to the language, so that we can train a small language model that understands our (non-English) language.

what you will learn

This course will compare the performance of RNNs, Transformers, BERT and GPT to previous approaches. And it will pay particular attention to how those performance gains were achieved. Did the researchers develop a better model? Or did they train a larger model?

For example, the fluency and translation quality of neural translation models far surpasses that of phrase-based statistical models. And in low-resource cases too.

But what's important is how those performance gains were achieved. Instead of translating words or phrases, the neural approach attempts to understand context. Neural models translate better than phrase-based models because they attempt to create a sentence in the target language with the same meaning as the source language sentence.

In that spirit, this course will explore neural approaches to natural language processing. Comparing them, it will ask how we can develop models that better understand our language.

By training small comparably-sized models, we can compare approaches. Holding model size constant, we'll ask which training or fine-tuning technique performs best on a given task. Identifying the techniques that work well at small scale, we'll find techniques that work exceptionally well at large scale.

links and files

models, corpora and software

course outline

  • lecture 00 — context and background
    • themes:
      • Before neural networks were used in NLP, count-based methods like term-frequency, inverse document frequency (TF-IDF) provided classifiers. And phrase-based methods provided statistical machine translation.
      • More recently, deep learning has given NLP simple, but sophisticated models to understand and generate human language. And at large scale, the new models provide impressive, but superficial fluency.
      • This lecture explores the beautiful potential to help people learn a new language or understand someone else's language. And it also explores the dangerous potential to generate environmental pollution and bad information.
    • readings:
  • lecture 01 — tools for our "NLP kitchen"
    • theme:
      • In linear regression, the confidence interval around the prediction is smaller when the predictor variables lie close to the sample means in the dataset used to estimate the regression model.
      • Similarly, language models make better predictions when the subword distribution at inference matches the subword distribution at training.
      • Incorporating linguistic theory into the pre-processing aligns those distributions.
    • tools we can use:
      • regular expressions
      • dictionaries and grammar books
      • lemmas, parts of speech and dependency labels
    • readings:
  • lecture 02 — word embeddings
    • theme:
      • Humans understand a word or phrase by understanding the context in the word or phrase appears. Similarly, we can train a machine to understand words and phrases by training it to understand the contexts in which those words or phrases appear.
      • So our first language understanding task is to convert words and phrases into vector representations of their context. Then, from those representations, we can measure the similarity between words and phrases. Words and phrases that are close to each other in vector space should have similar meanings.
    • readings:
  • lecture 03 — subword segmentation
    • theme:
      • Newer segmentation methods provide language-independent tokenization, detokenization and segmentation from raw sentences, whereas previous methods assumed prior tokenization.
      • This choice is important because once a model has been trained, the subword vocabulary is "locked in." It's also important because some languages do not divide sentences into words.
      • And it's important because a language model predicts a sequence of subword units. Incorporating theory into the subword-splitting trains the model to predict a theoretic sequence of subword units.
    • readings:
  • lecture 04 — recurrent neural networks
    • themes:
      • RNNs were the first neural model used in machine translation. Reading words sequentially, they employ gating mechanisms to identify relationships between separated words in a sequence. The gating mechanisms use word embeddings to understand context.
      • The output that RNNs produced was far more fluent than that of phrase-based models. Nonetheless, phrase-based models continued to outperform RNNs in low-resource cases. Only when trained on very large datasets, RNN models outperformed phrase-based models.
      • RNNs fell out of favor because recurrent processing requires long training times. And the gating mechanisms only partially solved the vanishing gradient problem, which often made RNNs difficult to train.
    • readings:
  • lecture 05 — the Transformer
    • themes:
      • The Transformer quickly became the neural model of choice for NLP tasks.
      • Unlike RNNs, Transformers do not require recurrent processing of a hidden state. Instead, they encode and decode using only self-attention, performing computations in parallel, which reduces training costs.
      • Self-attention directly models the relationships between words in a sequence as it examines the links between them. So when training a Transformer, computations are spent modeling the language (not encoding/decoding a hidden state vector).
      • It's a simpler approach that scales upward to larger datasets. And it also scales downward to smaller datasets, enabling us to develop useful, meaningful models in low-resource cases.
    • readings:
  • lecture 07 — GPT models
    • themes:
      • Using a decoder-only Transformer, researchers at OpenAI observed that pre-training a language model in an unsupervised fashion and then fine-tuning it for a given task is an effective strategy.
      • In their 2018 paper introducing GPT, they hypothesized that "the more structured attentional memory of the transformer assists in transfer compared to LSTMs."
      • Then in 2019 and 2020, they trained models at a range of sizes and observed that a model's ability to transfer learning across tasks increases with model size. Their largest model performed best on all tests. And sometimes it performed much better than the smaller models. But in many cases, the improvement was small.
    • readings:
  • lecture 08 — BERT models
    • themes:
      • Using an encoder-only Transformer, researchers at Google took a different approach to pre-training and unsupervised learning. They developed a "bidirectional" model, BERT, which allows each token to attend to all tokens in the self-attention layers, so that the representation learns from context on both sides.
      • For comparison, GPT (like the original Transformer) only allows a token to attend to previous tokens.
      • So to prevent trivial predictions, the team that developed BERT introduced masked language modeling. During pre-training, they randomly replaced a fraction of the tokens with either a "mask" token, a random token or the same (unchanged) token, then they pre-trained BERT to predict those tokens.
      • And to capture the relationship between two sentences, they created paired training examples in which the second sentence was either the next sentence or a random sentence.
      • The bidirectional cross-attention between the two sentences makes BERT a good choice for question-answering tasks, inference tasks and classification tasks.
    • readings:
  • lecture 09 — "Stochastic Parrots"
    • themes:
      • Large language models impressively generate coherent, fluent text by predicting a sequence of subword units. This concluding lecture will consider optimal model size and what language models truly understand.
      • Large models perform better, but they also cost more to train and deploy. So given the diminishing marginal returns to model capacity, the profit-maximizing model size may be quite small.
      • Training and fine-tuning large language models consumes energy. And often times that energy does not come from renewable sources.
      • Training models on mostly-English datasets leaves other languages poorly served. And even within English, the data may not reflect the way the language has changed and is changing in response to changing social views, changing opinions.
      • But with careful thought and planning, we can develop language models that understand the language that we speak.
    • readings:

Copyright © 2002-2025 Eryk Wdowiak