2.3.5 Statistical Machine Translation Systems
In the early 1990s when computers started to provide more speed and memory which allows more complex methods. A new concept of machine translation system rose: "Statistical Machine Translation" is the name of one modern translation system which promises more flexibility. The first software based on statistical methods was IBM's CANDIDE. The system is based on corpora of sample translations of a size which is directly proportional to the quality of the translations. The more different translation examples the corpora includes, the more clues and options of the comparison with the actual source text can be found. The system tries to find analogical translations in its translation memory and filters the matches. It consists basically of an enormous quantity of human translation samples and finally, it must detect which one is the best sample translation on the basis of linguistic devices, verb forms and prepositions. The fundamental idea of this system is based on the natural way of learning. Young children listen to dialogs and observe their environment. They learn the language best by recording as much as possible. So Statistical Translation Systems do not require ordinary dictionaries or linguistic rules but some kind of training. They are considered to manage extracing vocabulary items, analogies, rules and linguistic devices on its own by being fed with a vast amount of correct translations by humans. Using this system on texts of a specific topic can result in useful texts which need only a few corrections. CANDIDE's task is the translation of the English-French records of the Canadian and European Parliament and it convinces with mostly semantically correct translations. This software has been "trained" by using the so-called Canadian Hansard corpus, a large collection of Canadian debates (1986-1993) in French and English with about 26 million word records. Today this corpus consists of 32 million words and consequently even today’s computers need several hours to translate a text because of many complex statistical and probability calculations, the long process of learning. Another disadvantage is the fact that big corpora are rare and require a lot of human translation work. But the encouraging results promise best prospects. That's why the company which has probably more texts than any other company. Google Inc. has been making an effort to collect data since 1998. Last year the company confessed working on a statistical machine translation system in the Google Labs. At the moment the Google Translation Service uses an old version of Systran, a mixture of the interlingual and direct machine translation system. But when the Google developers will consider their corpus enough, the translation tool results are expected to be of higher quality.