Reverse Training Strategy

The strategy described below reverses the conventional approach to training a Transformer model. The final implementation follows the same sequence of steps that Radford et al. (2018) proposed – pre-training then fine-tuning. The difference is that we think about the steps in reverse order.

First, we develop the dataset that we'll use for fine-tuning and we train an initial model on that dataset. Then we pre-train a model that will provide a good starting point for the subsequent fine-tuning.

Observing that "most deep learning methods require substantial amounts of manually labeled data" (e.g. parallel text), the lack of which poses an obstacle in low-resource settings, Radford et al. created a general task-agnostic model that could "learn effectively from raw text."

Our experience has shown that one can find or create the necessary data. On these pages, we have developed techniques to work effectively with limited resources. And in the case of Sicilian, we have a very specific goal in mind – to create a translator.

Nonetheless, to reach a high-level of translation quality, we also had to find a creative way to train our model to "learn effectively from raw text," so we took the insights of Radford et al. and (inadvertently) thought about them in reverse order.

And we're happy with the reverse order because the best pre-trained model is your own pre-trained model – the one that you developed.

You will get a much better result when you fine-tune your own pre-trained model because designing your own model allows you to select the model size, tokenization and subword splitting that best meet your needs.

BLEU scores at each training stage

direction stage 1
forward
stage 2
pre-train
stage 3
fine tune
Eng→Scn 38.2 34.7 45.1
Scn→Eng 44.5 45.8 48.6
Ita→Scn 59.6 51.3 61.4
Scn→Ita 60.2 61.1 62.9
Ita→Eng 48.2 48.0 48.2
Eng→Ita 47.1 46.9 46.7

And in practice, we divided training into three stages. We thought about them in reverse order.

first step:  think about the last stage

In the case of Sicilian, our goal was to create a neural machine translator for the Sicilian language. When we started, back in 2019, there was no publicly available collection of Sicilian-English or Sicilian-Italian parallel text.

Given those circumstances, there was no model that had been pre-trained on Sicilian text. So, as explained on the low-resource NMT page, we trained an initial model with the parallel text that we collected from issues of Arba Sicula. Those issues provided the parallel text for what later became our fine-tuning set.

For the purposes of this page, what's important to notice is that we thought about our fine-tuning set before we thought about pre-training. When we wrote our two papers – Wdowiak (2021) and Wdowiak (2022) – there was no large collection of Sicilian language text that we could use for pre-training.

second step:  think about the second stage

Then, four days after publication of our paper, Facebook announced "No Language Left Behind" (2022) and suddenly a large Sicilian-English dataset became available (thanks to Allen AI).

Suddenly, we could think about pre-training a translation model with Sicilian text. ... And we spent a lot of time thinking about it because the NLLB's Sicilian-English collection only has a handful of good translation pairs.

There are about one million Sicilian-English sentence pairs in the NLLB dataset, but very few of them (less than five percent) are translations of each other. And many of the Sicilian sentences are written in local dialect. For a translation model, we need sentences written in the Sicilian literary language.

But at least we had a large collection of potential Sicilian sentences, which we could use for back-translation (Sennrich, Haddow and Birch, 2015). So we translated them all into English and scored the resulting pair on the task of English-to-Sicilian translation.

We scored translations into English on the task of English-to-Sicilian translation because a sentence written in Standard Sicilian will score better than a sentence written in local dialect, allowing us to identify the sentences written in Standard Sicilian.

And it yielded a back-translated dataset of 750,000 pairs to simulate English-to-Sicilian translation. A similar process that also includes a portion of the ParaCrawl data (described below) yielded 750,000 pairs in the other five translation directions.

Together, these pairs provided data for pre-training, which is the second stage of our training strategy.

last step:  think about the first stage

The last step is to prepare a dataset for initial training, so that the pre-training stage (second stage) begins from a good set of parameters, not from a random initialization. For lack of a better term, we'll call this the "forward training" stage because we used forward-translations to simulate a larger Sicilian language dataset.

Starting from a random initialization, we trained a bi-directional model on 37 million English-Italian sentence pairs from ParaCrawl, version 7.1. Then we further trained this initial model on forward-translations to Sicilian.

Specifically, we translated all 37 million Italian sentences into Sicilian and scored them on the task of Italian-to-Sicilian translation. Using the best four million forward-translations, we further trained the initial model in all six translation directions, so that the next training stage (the pre-training stage) could make better use of the available Sicilian language text.

putting it all together

The first training stage used synthethic Sicilian. The second training stage used internet Sicilian. And, saving the best for last, the third training stage used issues of Arba Sicula.

In the 13th century, the Sicilian School of Poets at the imperial court of Frederick II created the first literary standard in Italy. We hope our Tradutturi Sicilianu will help us create new Sicilian literature in the 21st century.

Copyright © 2002-2025 Eryk Wdowiak