table of contents
Sicilian language
How do you translate a sentence into another language? Do you translate the words or do you translate the meaning?
When you look up a word in the dictionary, what are you hoping to find? its most common translation? an example? or a definition of the word?
And using that information, what do you translate? the words or the meaning?
How should a machine translate a sentence into another language?
Should the machine look up the words in a dictionary and translate them according to a set of rules? If so, how should we develop that dictionary and how should we write those rules?
Or should the machine create a target-language sentence with the same meaning as the source-language sentence? If so, how should the machine learn the meaning of the source-language sentence and how should it create a target-language sentence with the same meaning?
The first systems to translate text from one language to another were based on rules and dictionaries. A dictionary translated words in the source language to words in the target language, while a system of rules arranged the words into grammatical order.
Such systems inherently require large dictionaries and long lists of rules (and are therefore very complex), but projects like Apertium show that the approach is feasible.
But complications arise with words like "know." Romance languages like Sicilian distinguish between "sapiri" ("know how" or "know something") and "canusciri" ("know someone"). Slavic languages like Polish distinguish between "umieć" ("know how"), "wiedzieć" ("know something") and "znać" ("know someone").
Because a single word can have many meanings, an alternative to translating the word is to translate a phrase. But given the infinite number of ways that words can come together to form a phrase, it's impossible to write all of the rules necessary to translate phrase-by-phrase.
So the next generation of models began using statistical methods to train the machine to learn the rules of translation. The Moses SMT system is a good example of this method.
As its phrase-based statistical model learns, it creates tables of phrases, how those phrases were translated and how often. It then uses those tables to translate whole sequences of words. An alternative syntax-based model learns to translate syntactic units, instead of single words or phrases.
Such statistical methods worked well enough to inspire Google Translate, but what we really want to translate is meaning. So neural machine translation models, like Google Translate and Sockeye (which we used to train our Tradutturi Sicilianu), replace the tables of phrases or syntax with word embeddings, which attempt to understand the meaning of a word by identifying other words that often appear in the same context.
So instead of asking: "What's the most common translation of this phrase?" the neural approach asks: "What is the meaning of this sentence? And how can I create a sentence in the other language with the same meaning?"
It's better than phrase-based translation because now you don't need the phrase. The machine can just take what it needs. For example, if we split words into subword units, then the machine can identify verb stems and verb endings in a source-language sentence. And, as it creates the target-language sentence, it can select (from the subword vocabulary) a verb stem, a verb ending and words to form the sentence's predicate, creating phrases that it did not observe in the training data.
The following pages seek to explain this method. The next page will discuss our sources of parallel text. The page after that will explain how we preprocess the text with subword splitting. And then we'll explain how we trained our Tradutturi Sicilianu in our recipe for low-resource NMT.
Copyright © 2002-2025 Eryk Wdowiak