In June this year I came across an interesting project that I have since started volunteering for called Common Voice, launched by Mozilla, the same non-profit behind the Firefox browser.

What is Common Voice?

“Common Voice is Mozilla’s initiative to help teach machines how real people speak.” — https://voice.mozilla.org/

Voice recognition technology powers a lot of everyday apps now, from the obvious speech-to-text software, to your virtual personal assistants (Google Assistant, Siri, Alexa, Cortana), and it has probably already become your preferred way of interacting with your devices.

Tech giants collected your voices to create their own proprietary voice databases, which they use to train their technology to be even better. To create great voice systems, an extremely large amount of voice data is required. As a developer, you can rely on proprietary APIs, but they are often restrictive, expensive, and of course closed source. The data used by these large companies isn’t available to the majority of people.

Mozilla thinks that stifles innovation, and launched their own open source project, Common Voice, aiming to make voice recognition open to everyone.

The official Common Voice project logo

How does it work?

Crowdsourcing. Anyone can record their voice on the Common Voice web app, and anyone can listen to the donated recordings to validate their correctness. The validated voice data library will be available for everyone to download. Developers, academics, companies, anyone, for free. Your voice donations and validations will help build this future open source voice database!

Canto Common Voice!

Mozilla has opened up the Common Voice project beyond English and so far a dozen other languages have already gone online and collecting voices.

Creating the Cantonese Chinese (Hong Kong) version of this project became one of my personal ambitions. There’s just something noble about doing this for my own native language, a language that is also eclipsing in use in many parts of the world.

So I sent the project coordinators of Common Voice an email to open up the [zh-hk] version of the website (Thanks Michael and Peiying!). To begin collection voice donations, the new Cantonese branch will need:

  1. A localised, fully translated version of the website, and
  2. 5000+ sentences in Cantonese Chinese ready to be recorded.

Localisation

Translations for Mozilla projects is done via their bespoke localisation platform, Pontoon. If you have worked on software projects before, this is basically a hosted string file which anyone can suggest edits to. It is extremely easy to use, for even the least tech savvy person.

After about two months of work (Shout out to top contributor Terry! And F!) a completed first draft of our localised web app is finally available today!

I am aware that the translations may not be perfect with only three contributors, so please if are a Cantonese native, or even if you are an overseas Cantonese speaker, don’t hesitate to give our Pontoon a browse, and suggest edits/ improvements!

Capture
Take a look at the translated voice.mozilla.org/zh-hk

Going forward

Sentence collection is going to be hard work, and the main project team is still working on a proper interface for accepting donations. The big limitation here is that in order to stay open source, Common Voice cannot accept sentences that isn’t licensed at Creative Commons CC-0, or equivalent/ lower. So please let me/ us/ Moz-HK know if you are seen any large databases (news archives, movie subtitles, academic databases) of colloquial Cantonese, or written Hong Kong Chinese, that carries a CC-0 license.

Localisation best practices is something many other Mozilla local communities have (eg. Moz-TW and Moz-CN). In the future, the localisation of other Mozilla applications/ projects into Hong Kong Chinese would likely be initiated, so preparing a few brief guidelines for how to “localise more good” will be important. Recognising that I am no expert in this, I am looking for suggestions for useful readings on translations for the Hong Kong market (be it academic/ governmental/ corporate), and another volunteer who has professional/ academic English-Cantonese translation experiences.

Let me know in the comments section!

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

The author

Born and raised in Hong Kong for twenty years, transported to West London in young adulthood, a brief stint in Manchester, now torn between sleepy suburbia and the bustling city.

Create a website or blog at WordPress.com