Wednesday, June 30, 2010
On June 16th, we launched our Korean voice search system. Google Search by Voice has been available in various flavors of English since 2008, in Mandarin and Japanese since 2009, and in French, Italian, German and Spanish just a few weeks ago (some more details in a recent blog post).
Korean speech recognition has received less attention than English, which has been studied extensively around the world by teams in both English and non-English speaking countries. Fundamentally, our methodology for developing a Korean speech recognition system is similar to the process we have used for other languages. We created a set of statistical models: an acoustic model for the basic sounds of the language, a language model for the words and phrases of the language, and a dictionary mapping the words to their pronunciations. We trained our acoustic model using a large quantity of recorded and transcribed Korean speech. The language model was trained using anonymized Korean web search queries. Once these models were trained, given an audio input, we can compute and display the most likely spoken phrase, along with its search result.
There were several challenges in developing a Korean speech recognition system, some unique to Korean, some typical of Asian languages and some universal to all languages. Here are some examples of problems that stood out:
- Developing a Korean dictionary: Unlike English, where there are many publicly-available dictionaries for mapping words to their pronunciations, there are very few available for Korean. Since our Korean recognizer knows several hundred thousand words, we needed to create these mappings ourselves. Luckily, Korean has one of the most elegant and simple writing systems in the world (created in the 15th century!) and this makes mapping Korean words to pronunciations relatively straightforward. However, we found that Koreans also use quite a few English words in their queries, which complicates the mapping process. To predict these pronunciations, we built a statistical model using data from an existing (smaller) Korean dictionary.
- Korean word boundaries: Although Korean orthography uses spaces to indicate word boundaries (unlike Japanese or Mandarin), we found that people use word boundaries inconsistently for search queries. To limit the size of the vocabulary generated from the search queries, we used statistical techniques to cut rare long words into smaller sub-words (similarly to the system we developed for Japanese).
- Pronunciation exceptions: Korean (like all other languages) has many exceptions for pronunciations that are not immediately obvious. For example, numbers are often written as digit sequences but not necessarily spoken this way (2010 = 이천십). The same is true for many common alphanumeric sequences like “mp3”, “kbs2” or mixed queries like “삼성 tv”, which often contain spelled letters and possibly English spoken digits as opposed to Korean ones.
- Encoding issues: Korean script (Hangul) is written in syllabic blocks, with each block containing at least two of the 24 modern Hangul letters (Jamo), at least one consonant and one vowel. Including the normal ASCII characters this brings the total number of possible basic characters to over 10000, not including Hanja (used mostly in the formal spelling of names). So, despite its simple writing system, Korean still presents the same challenge of handling a large alphabet that is typical of Asian languages.
- Script ambiguity: We found that some users like to use English native words and others the Korean transliteration (example: “ncis season 6” vs. “ncis 시즌6”). This makes it challenging to train and evaluate the system. We use a metric that estimates whether our transcription will give the correct web page result on the user’s smart phone screen, and such script variations make this tricky.
- Recognizing rare words: The recognizer is good at recognizing things users often type into the search engine, such as cities, shops, addresses, common abbreviations, common product model numbers and well-known names like “김연아”. However, rare words (like many personal names) are often harder for us to recognize. We continue to work on improving those.
- Every speaker sounds different: People speak in different styles, slow or fast, with an accent or without, have lower or higher pitched voices, etc. To make our system work for all these different conditions, we trained our system using data from many different sources to capture as many conditions as possible.
When speech recognizers make errors, the reason is usually that the models are not good enough, and that often means they haven’t been trained on enough data. For Korean (and all other languages) our cloud computing infrastructure allows us to retrain our models frequently and using an ever growing amount of data to continually improve performance. Over time, we are committed to improve the system regularly to make speech a user-friendly input method on mobile devices.