Tuesday, December 30, 2008
At Google, we like search. So it's no surprise that we treat language translation as a search problem. We build statistical models of how one language maps to another (the translation model) and models of what the target language is supposed to look like (the language model) and then we search for the best translation according to those models (combined into one big log linear model for those of you taking notes).
But, just as putting all of your money in the investment with the highest historical return is not always the best idea, choosing the translation with the highest probability is not always the best idea either - especially when you have a relatively flat distribution among the top candidates. Instead, we can use the Minimum Bayes Risk (MBR) criterion. Essentially, we look at a sample of the best candidate translations (the so called n-best list) and choose the safest one, the one most likely to do the least amount of damage (where 'damage' is defined by our measurement of translation quality). You might want to view this as choosing a translation that is a lot like the other good translations instead of choosing that strange one that had the good model score.
If this is our 'diversification' strategy, how can we make things even safer? Exactly the same way as we do for investments, we diversify even more. That is, we look at more of the candidate translations to make the MBR decision. A lot more. And the way to do that is to build a lattice of translations during the search and then we do our MBR search over the lattice. Instead of 100 or 1000 best translations that we would use for the n-best approach, lattices give us access to a number that rivals the number of particles in the visible universe (really, it's huge).
Interested? You can read all about it here.