Doubling Up

Monday, September 29, 2008 at 9/29/2008 08:06:00 PM



Machine translation is hard. Natural languages are so complex and have
so many ambiguities and exceptions that teaching a computer to
translate between them turned out to be a much harder problem than
people thought when the field of machine translation was born over 50
years ago. At Google Research, our approach is to have the machines
learn to translate by using learning algorithms on gigantic amounts of
monolingual and translated data. Another knowledge source is user
suggestions. This approach allows us to constantly improve the
quality of machine translations as we mine more data and
get more and more feedback from users.

A nice property of the learning algorithms that we use is that they
are largely language independent -- we use the same set of core
algorithms for all languages. So this means if we find a lot of
translated data for a new language, we can just run our algorithms and
build a new translation system for that language.

As a result, we were recently able to significantly increase the number of
languages on translate.google.com. Last week, we launched eleven new
languages: CatalanFilipinoHebrewIndonesianLatvianLithuanianSerbian,
SlovakSlovenianUkrainianVietnamese. This increases the
total number of languages from 23 to 34.  Since we offer translation
between any of those languages this increases the number of language
pairs from 506 to 1122 (well, depending on how you count simplified
and traditional Chinese you might get even larger numbers). We're very
happy that we can now provide free online machine translation for many
languages that didn't have any available translation system before.

So how far can we go with adding new languages in the future? Can we
go to 40, 50 or even more languages?  It is certainly getting harder,
as less data is available for those languages and as a result it is
harder to build systems that meet our quality bar.  But we're working
on better learning algorithms and new ways to mine data and so even if
we haven't covered your favorite language yet, we hope that we will have
it soon.

6 comments:

D_K said...

The quality of your machine translator has become amazing over time. Do you train Li<->Lj, i != j, i.e. every pair? For N languages you need N*N pairs and there are "At least 500 (But that’s just in Northern Italy)." [1]

Thank you for your project.

References
[1] Stephen R. Anderson: "How Many
Languages
Are There in
the World?": http://www.lsadc.org/info/pdf_files/howmany.pdf (accessed: September, 2008)

mb said...

Do you use (bi|multi)lingual open source data like OPUS (http://urd.let.rug.nl/tiedeman/OPUS/) ?

Do you try to fetch and align multilingual websites (most big compagnies got their corporate websites translated in different languages). Also I was wondering if this is legal to use data from these websites as input for your algorithms and then use the results of the learning to give (or sell) translation services ?

Are they some buyable (bi|multi)lingual aligned corpus ?

ymerej said...

What are your considerations for determining if translated bilingual texts contributed are acceptable? I'm interested in locating some for the two official languages of New Zealand (Engrish and Maori) for use by by Google Translate team. Do they need to be in more than two languages for you to consider them?

Kind Regards,

Jeremy

ymerej said...

What are your considerations for determining if translated bilingual texts contributed are acceptable? I'm interested in locating some for the two official languages of New Zealand (Engrish and Maori) for use by by Google Translate team. Do they need to be in more than two languages for you to consider them?

Kind Regards,

vpetro said...

By the way, your English to Ukrainian translation mixes Ukrainian and Russian in the translated sentence. :)

Marjory said...

Translation is hard even if you're not a machine. You're doing a pretty good job.