Thursday, December 02, 2010
On November 30th 2010, Google launched Cantonese Voice Search in Hong Kong. Google Search by Voice has been available in a growing number of languages since we launched our first US English system in 2008. In addition to US English, we already support Mandarin for Mainland China, Mandarin for Taiwan, Japanese, Korean, French, Italian, German, Spanish, Turkish, Russian, Czech, Polish, Brazilian Portuguese, Dutch, Afrikaans, and Zulu, along with special recognizers for English spoken with British, Indian, Australian, and South African accents.
Cantonese is widely spoken in Hong Kong, where it is written using traditional Chinese characters, similar to those used in Taiwan. Chinese script is much harder to type than the Latin alphabet, especially on mobile devices with small or virtual keyboards. People in Hong Kong typically use either “Cangjie” (倉頡) or “Handwriting” (手寫輸入) input methods. Cangjie (倉頡) has a steep learning curve and requires users to break the Chinese characters down into sequences of graphical components. The Handwriting (手寫輸入) method is easier to learn, but slow to use. Neither is an ideal input method for people in Hong Kong trying to use Google Search on their mobile phones.
Speaking is generally much faster and more natural than typing. Moreover, some Chinese characters – like “滘” in “滘西州” (Kau Sai Chau) and “砵” in “砵典乍街” (Pottinger Street) – are so rarely used that people often know only the pronunciation, and not how to write them. Our Cantonese Voice Search begins to address these situations by allowing Hong Kong users to speak queries instead of entering Chinese characters on mobile devices. We believe our development of Cantonese Voice Search is a step towards solving the text input challenge for devices with small or virtual keyboards for users in Hong Kong.
There were several challenges in developing Cantonese Voice Search, some unique to Cantonese, some typical of Asian languages and some universal to all languages. Here are some examples of problems that stood out:
- Data Collection: In contrast to English, there are few existing Cantonese datasets that can be used to train a recognition system. Building a recognition system requires both audio and text data so it can recognize both the sounds and the words. For audio data, our efficient DataHound collection technique uses smartphones to record and upload large numbers of audio samples from local Cantonese-speaking volunteers. For text data, we sample from anonymized search query logs from http://www.google.com.hk to obtain the large amounts of data needed to train language models.
- Chinese Word Boundaries: Chinese writing doesn’t use spaces to indicate word boundaries. To limit the size of the vocabulary for our speech recognizer and to simplify lexicon development, we use characters, rather than words, as the basic units in our system and allow multiple pronunciations for each character.
- Mixing of Chinese Characters and English Words: We found that Hong Kong users mix more English into their queries than users in Mainland China and Taiwan. To build a lexicon for both Chinese characters and English words, we map English words to a sequence of Cantonese pronunciation units.
- Tone Issues: Linguists disagree on the best count of the number of tones in Cantonese – some say 6, some say 7, or 9, or 10. In any case, it’s a lot. We decided to model tone-plus-vowel combinations as single units. In order to limit the complexity of the resulting model, some rarely-used tone-vowel combinations are merged into single models.
- Transliteration: We found that some users use English words while others use the Cantonese transliteration (e.g.,: “Jordan” vs. “佐敦”). This makes it challenging to develop and evaluate the system, since it’s often impossible for the recognizer to distinguish between an English word and its Cantonese transliteration. During development we use a metric that simply checks whether the correct search results are returned.
- Different Accents and Noisy Environment: People speak in different styles with different accents. They use our systems in a variety of environments, including offices, subways, and shopping malls. To make our system work in all these different conditions, we train it using data collected from many different volunteers in many different environments.