Subword Splitting

In a recent case study, Sennrich and Zhang (2019) develop a set of best practices for low-resource neural machine translation and show that those best practices can achieve better translation quality than phrase-based statistical machine translation in a 100,000 word dataset derived from the 2014 German-English IWSLT.

In their best practices, they suggest using a smaller neural network with fewer layers, smaller batch sizes and larger dropout parameters. And their largest improvements in translation quality (as measured by BLEU score) came from the application of a byte-pair encoding that reduced the vocabulary from 14,000 words to 2000 words.

The best neural model that they developed with that 100,000 word dataset scored 16.6 on German-to-English translation, while their phrase-based statistical model scored 15.9.

For comparison, just two years earlier, with a 377,000 word English-to-Spanish dataset, Koehn and Knowles (2017) only obtained a BLEU score of 1.6 with a neural model, but 16.4 with a phrase-based statistical model.

Although the languages are different, the comparison seems valid because the better results required far less parallel text and because both pairs of researchers used recurrent neural networks. On the next page, we show better results with the Transformer model.

And in general, the way that subword splitting improves translation quality gives me hope that neural machine translation will soon be available for all the world's languages.

Specifically, Sennrich, Hadlow and Birch (2016) developed a byte-pair encoding algorithm which replaces the fixed vocabulary of the usual model with a vocabulary of "subwords."

For example, the English present tense only has two forms: "speak" and "speaks." By contrast, Sicilian has six different forms for the present tense. But splitting them into subwords:

Sicilian English
parr + u I  speak
parr + i you  speak
parr + a he  speak + s
parr + amu we  speak
parr + ati you  speak
parr + anu they  speak

yields something closer to English: "parr" matches "speak" and the Sicilian verb endings match the English pronouns.

Diminuitives and augmentatives can be similarly expressed as sequences of subword units:

Sicilian English
jatt + u cat
jatt + ar + eddu little  cat
banc + u bench
banc + ar + eddu little  bench

Subword splitting allows us represent many different word forms in a much smaller vocabulary, thus allowing the translator to learn rare words and unknown words. So even if "jo manciu" ("I eat") does not appear at all in the dataset, but forms like "jo parru" ("I speak") and "iddu mancia" ("he eats") do appear, then subword splitting would allow the translator to learn "jo manciu" ("I eat").

With a vocabulary of 1500 subwords, the sentence: "Carinisi are dogs!" gets tokenized and split into:

car@@ in@@ isi are dogs !

which is translated into Sicilian as:

cani car@@ in@@ isi !

and detokenized into: "Cani carinisi!"

One innovation that greatly improved translation quality was to bias the learned subword vocabulary towards the desinences one finds in a textbook. Specifically, we added a unique list of words from the Dieli Dictionary and the inflections of verbs, nouns and adjectives from Chiù dâ Palora to the Sicilian data.

Because each word was only added once, none of them affected the distribution of whole words. But once the words were split, they greatly affected the distribution of subwords, filling it with stems and suffixes. So the subword vocabulary that the machine learns is similar to the theoretical stems and desinences of a textbook.

And since subword splitting appears to be an effective tool for developing a neural machine translator for the Sicilian language, we will continue assembling parallel text and hope to present a better quality translator soon. In the meantime, you can see the results of this experiment at Napizia.

Copyright © 2002-2024 Eryk Wdowiak