CantoFish

About


What is it?

CantoFish is a popup Cantonese-English dictionary for Firefox. It contains over 200,000 entries and works with both traditional and simplified Chinese characters. The Yale and Jyutping romanization systems are supported, as well as Mandarin pinyin.

Screenshot

How were the Cantonese readings generated?

Fortunately, many characters have only one reading, or only one commonly used reading. Unfortunately, there are still a large number which are not so well behaved. Seeing as there is no extensive, publicly available Cantonese dictionary, I had to improvise. Existing data sets such as adso and CC-CEDICT were designed for Mandarin, so I therefore needed to find a way to generate the Cantonese pronunciation for all of those words.

This is not a trivial task. Since each character may have multiple readings, it is impossible to guess the correct one every time programatically. Although there is not a 1:1 mapping between Mandarin and Cantonese, it is sometimes possible to use the Mandarin reading as a hint when choosing between multiple Cantonese readings.

For example, let’s look at the 覺 character. In Cantonese this can be read as either gok3 (覺得: feel) or gaau3 (睡覺: sleep). If we encounter a new compound word containing this character, how do we guess which one is correct? By making note of the fact that this character also has 2 readings in Mandarin, we can create a mapping which says that when the Mandarin word is pronounced as jue2, it is probably going to sound like gok3 in Cantonese, and when it is pronounced as jiao4, it will probably sound like gaau3.

This same strategy can also help with tone changes. For example, the 好 character is usually read as hou2 (hao3 in Mandarin) but it can also sometimes be read as hou3. Luckily, the change in tone also happens in Mandarin, so when we see a word like 愛好 (hobby) where the second character is read as hao4 we can guess that the Cantonese will likely use either the 3rd or 6th tone, in this case, hou3. This substantially increases the chance of choosing the correct reading.

To date, I’ve spent a considerable amount of time researching these relationships, and have compiled a large set of mappings. In the case where there is no mapping defined, I tried to order the readings as close as possible to their frequency.

In addition to the mappings, I also make use of a separate data file which contains hand edited entries. This provides a way to disambiguate readings which don’t follow any set pattern, as well as the ability to add colloquial Cantonese words which don’t appear in standard Chinese. These entries will take priority over the generated ones since they have been checked by a human being.

One thing that needs work is defining the words which are affected by tone sandhi, most notably when the final character in a compound word changes its tone. As far as I can tell this is fairly arbitrary and there is no simple pattern to determine when this happens. Adding a sufficient number of manual entries should ultimately help alleviate this problem.

Update: As of August 2009, a very large number of human checked compound word readings were provided by Adam Sheik’s CantoDict project. This greatly increased the quality of the mapping data.

System Requirements and Installation

CantoFish will run on any operating system with Firefox. To install the plugin manually, select File > Open File from the menu in Firefox and select the CantoFish xpi file. You’ll need to restart Firefox after adding this plugin.

Usage

To enable CantoFish, either right click inside a web page and select CantoFish, or select it from the Tools menu. Then simply hover over a Chinese word to see the definition and pronunciation.

To switch between romanization settings (Yale, Jyutping, and Mandarin Pinyin), go into the Tools menu and select Add-ons, then find CantoFish and click the Options button. Change the Romanization dropdown, and click OK.

Credits

The plugin itself is based on the code from Chinese Perapera-kun (http://perapera.wordpress.com/) which in turn is based on Rikaichan (http://www.polarcloud.com/rikaichan/) for Japanese

Romanization is based primarily on data from Aaron Chan’s HanConv utility: http://www.icycloud.tk

Dictionary data derived from adso (http://adsotrans.com) and CC-CEDICT (http://usa.mdbg.net/chindict/chindict.php?page=cedict)

Compound reading data (human checked) generously provided by Adam Sheik’s CantoDict project: http://www.cantodict.org

Disclaimer

This program may be freely distributed, and is provided ‘as is’ without warranty of any kind.

Written by jburket

February 28, 2009 at 9:44 pm

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: