Unit 1: Additional Readings and Extensions

For each unit, I’ll try to compile a list of resources/further readings in case a topic we covered is of particular interest to you! For the most part, what we cover will only skim the surface of what is out there. The aim of this page is to guide you towards reasonable next steps in exploring these topics that should be challenging, but accessible given what we’ve covered.

N-Gram Language Modeling

If you were to train an n-gram model of any sort in practice, you would not want to roll your own implementation (i.e., don’t actually use the code you wrote for HW1!). This is mainly because when you work with large models, you need to be very careful with efficiency lest your models take up massive amounts of time and memory to train (imagine the counts for a large $n$ and a massive corpus). In practice, there are two libraries folks use: KenLM and SRiLM, both with fair use and usable documentation/tutorialization.
N-Gram models are primarily useful as a historical and pedagogical tool, but n-gram-like ideas do show up in modern research in a number of ways!
- One way is by scaling them up to the dataset scale we have today. infini-gram is one recent proposal that builds a novel backoff system and trillions of tokens. Interpolating with Llama reduces model perplexity!
- N-Gram models also find purchase by being easily interpretable (unlike neural models) while being somewhat easy to build (unlike PCFGs). This let’s them be a useful model for, say, a reference model for analyzing the computational expressivity of a transformer, as a stepping stone to help guide the training of a fancier model, as part of the evaluation of the performance of machine translation systems, as a part of models for low-resource languages where larger models are not as feasible.
- I know them most comfortably as a direct measure of attested next-word probability when doing cognitive modeling or another computational linguistics-y task (see Smith \& Levy (2013), or for something more recent like Shain et al. (2024) where they often show up as a baseline. This is just a tiny sampling — they show up as baselines everywhere!
- This is not to even mention their use in the social science/digital humanities where interpretability is also critical (see, for instance, something like Baklāne \& Saulespurēns (2023) which seems very cool!). Note that I gathered these through some very light searching for recent papers, primarily from the ACL Anthology, which collects papers from the conferences it hosts (that are considered top NLP conferences!).
The smoothing techniques we saw in class are relatively simple, and were primarily to show you the two basic approaches to doing smoothing: interpolation-based methods and backoff-based methods. More complex techniques abound, and the definitive reference for these methods is Chen & Goodman 1996, which surveys their empirical performance and finds that Modified Kneser-Ney Smoothing has the best performance. This is usually the default option when doing n-gram smoothing nowadays.
That’s not to say that add-1/Laplace-style smoothing is entirely irrelevant. It was very recently found that add-k smoothing is formally equivalent to label smoothing, a technique for regularizing some neural language models (Malagutti et al. 2024. The paper also shows a method for developing new regularizers based on other n-gram smoothing techniques.

Tokenization

BPE is a very popular tokenization technique, but has a number of alternatives like Wordpiece (Schuster & Nakajima 2012) and UnigramLM. Tokenization tends to be introduced through papers that aim to do something bigger than just tokenization (even if that is the legacy they leave!), so I recommend looking towards broader survey articles like Mielke et al. (2021) to get a lay of the land and a description of the key differences between methods.
There are many reasonable ways to tokenize in practice, but a convenient and easy-to-use go-to for reference implementations is those on HuggingFace. The linked tutorial is a good introduction to start playing with tokenizers!
Tokenization is tricky business in practice, both in terms of developing a vocabulary and determining how to tokenize sequences when that becomes ambiguous. BPE does this by incrementally building a vocabulary by merging frequent bigrams, and then applying merges in the order they were done during training. Different tokenization algorithms vary in how they execute both steps — pay attention to this if you’re looking into different tokenization schemes!
The relationship between tokenization and morphological decomposition (i.e., breaking words into meaningful units) is complex. You can look at how weird real tokenization ends up like this paper, or you can check how aspects of tokenization map onto psychological measures of word processing like this paper or this paper.

CFGs and Parsing

One issue we didn’t talk much about was how we get probabilities on a CFG (or even build a CFG in the first place). The empirical answer is from Treebanks — we have linguists annotate (or create annotation guidelines) for a large corpus, and then we estimate what the probabilities should be in order to, say, maximize the likelihood of generating the treebank from the grammar. The most common treebank in English is the Penn Treebank, a small portion of which is accessible in NLTK (as I believe HW0 demonstrated).
If you don’t have any parses available, the task of determining the grammar is called grammar induction, and is a hard problem! It is also quite neat because it mirrors (in some way) one possibility for how humans come to acquire a mental grammar (i.e., rules about what is and isn’t a good sentence of our languages), since we know this even before we might have taken a class that teaches usa prescriptive grammar! Something like Klein & Manning (2001) is probably a good start to get thinking about that!
CFGs are one of many kinds of grammars one can use to model language. Others include Combinatory Categorial Grammar, or CCG, which is both used widely in NLP and has serious linguistic theorizing, Dependency Grammars, which eschew constituency as an idea in favor of directly modeling grammatical relations (like agreement, subject-hood, etc.), and some like Tree-Adjoining Grammars, which are a little bit more linguistics facing. To get a lot more linguistics facing, you can look at a formalism like Minimalist Grammar, which directly attempts to encode syntactic theorizing in linguistics in a formal grammar system. This unfortunately means that the link I provide is both the most accessible intro I could find and highly technical!
- One thing notable about CCG and TAG is that they are more expressive than CFGs. What that means is that there are some (formal) languages that can be defined by CCG/TAG that cannot be defined by CFGs. For this reason, they are called (Mildy) Context-Sensitive Languages. This notion is above and beyond an undergraduate curriculum, but these are the kinds of questions we get interested in as we talk about formal language theory in a class like Theory of Computation (Offered this Spring (2025)!). Also worth noting that some human languages have been shown to not be expressible by context-free grammars, most notably Swiss-German and Dutch, so modeling human linguistic competence with CFGs is already a futile game.
The issue, broadly, of understanding what the right structure underlying language is is within the domain of syntactic theory, a field in linguistics. Sans taking a full linguistics class, the reference I recommend is the one I was taught out of – Carnie’s Syntax: A Generative Introduction. One thing you should know about linguistics is that everything in syntax is highly controversial, especially the ideas behind generative syntax stemming from Chomsky’s research program. To avoid claims of impropriety, I will also provide references for alternative approaches, like Optimality-Theoretic Syntax, or HPSG.
Parsing is a broad issue that spans Formal Language Theory, NLP/Linguistics, and Compiler Design. As a result, a variety of different parsing strategies beyond what we’ve looked at are particularly relevant in various fields. The most notable missing pieces are a discussion of Shift-Reduce Parsing, a simple strategy for parsing that finds use both in neural NLP and in compilers, Earley Parsing (Earley 1970, Stolcke 1995), a similar incremental parsing algorithm that’s often used to compute prefix-probabilities to make PCFGs usable conditional language models, and a discussion of Left-Corner Parsing, a strategy that merges some of the ideas behind top-down and bottom-up parsing to build more efficient parsers (like the left-corner filter in Stolcke’s Earley Parsing Implementation!).
- Earley Parsing in particular is influential in computational linguistics. Hale (2001) proposes using a probabilistic Earley parser as a model of human sentence processing (or at least that of Garden Path sentences!). The critical factor here is incrementality: being able to partially parse before you get the full string when you receive your input strictly from left to right.

Unit 1: Additional Readings and Extensions

Unit 1: Additional Readings and Extensions

N-Gram Language Modeling

Tokenization

CFGs and Parsing

results matching ""

No results matching ""