Monday, June 14, 2010
Google’s speech team is composed of people from many different cultural backgrounds. Indeed, if we count the languages spoken by our teammates, the number comes to well over a dozen. Given our own backgrounds and interests, we are naturally excited to extend our software to work with many different languages and dialects. After testing the waters with English, Mandarin Chinese, and Japanese, we decided to tackle four main European languages which are often referred to as FIGS - French, Italian, German and Spanish.
Developing Voice Search systems in each of these languages presented its own challenges. French and Spanish required special work to deal with diacritic and accent marks (e.g. ç in French, ñ in Spanish). When we develop a new language we tweak our dictionaries based on user generated content. To our surprise we found that a lot of this content in French and Spanish often uses non-standard orthography. For example a French speaker might type “francoise” into a search engine and still expect it to return results for “Françoise”. Likewise in Spanish a user might type “espana” and expect results for the term “España”. Of course a lot of this has to do with the fact that, until recently, domain names (like www.elpais.es) did not allow diacritics, and that entering special characters is often painful but omitting diacrictics is usually not an obstacle to communication. However, non-standard spellings distort the intended pronunciations. For example, if “francoise” were a real French word, one would expect it to be pronounced “franquoise”. In order to capture the intended pronunciation of the non-standard spellings, we fixed the orthography in our dictionaries for Spanish and French automatically. While this is not perfect, it deals with many of the offending cases.
Since our Voice search systems typically understand more than a million different words in each language, developing pronunciation dictionaries is one of the most critical tasks. We need the dictionary to match what the user said with the written form. Not surprisingly we found that dictionary development for some languages like Spanish and Italian to be extremely easy, as they have very regular orthographies. In fact the core of our Spanish pronunciation module consists of less than 100 lines of source code. Other languages like German and French have more complex orthographies. For example in French “au”, “eaux” and “hauts” are all pronounced “o”.
A notable aspect of German (especially “Internet German”) is that a lot of English words are in common usage. We do our best to recognize thousands of English words, even though English contains some sounds that don’t exist in German, like “th” in “the”. One of the trickiest examples we came across was when one of our volunteers read “nba playoffs 2009”, saying “nba playoffs” in English followed by “zwei tausend neun” in German. So go ahead and search for “Germany’s Next Topmodel” or “Postbank Online”, see if it works for you.
German is also notorious for having long, complex words. Our favorite examples include:
- Berufskraftfahrerqualifikationsgesetz (or shorter: BKrFQG)
Just for fun, compare how long it takes you to say these to Voice Search vs. typing them.
Even though a vocabulary size of one million words sounds like a large number, each of these languages has even more words, so we need a procedure to select which ones to model. We obviously do not do this manually and instead use statistical procedures to identify the list of words we will allow. We do this by looking at many sources of data and looking at the frequency of words. It is therefore surprising to find sometimes really weird terms selected by our algorithms. For example in Spanish we found these unusual words:
So, in the unlikely event that you ever try a Spanish voice search query like this “imágenes del músculo supercalifragilisticoespialidoso chiripitiflautico esternocleidomastoideo” you may be surprised to see that it works.
French, Italian, German, and Spanish are spoken in many parts of the world. In this first release of Google Search by Voice in these languages, we initially only support the varieties spoken in France, Italy, Germany, and Spain, respectively. The reason is that almost all aspects of a Voice Search system are affected by regional variation: French speakers from different regions have slightly different accents, use a number of different words, and will want to search for different things. Eventually, we plan to support other regions as well, and we will work hard to make sure our systems work well for all of you.
So, we hope you find these new voice search system useful and fun to use. We definitely had a “supercalifragilisticoespialidoso chiripitiflautico” time developing them.