This is a start to finish guide showing how to do internationalization (i18n) for a Python application. When I added i18n to handroll, I struggled to find clear advice for supporting other languages. This is one opinionated view explaining how I got there.
Table of Contents:
- Marking strings
- Extracting the master list
- Getting translations for other languages
- Packaging it together
- Testing out the package
- Questions?
Overview
To internationalize code, you have to treat user text strings in a certain way. All text strings must be wrapped with a special function call. This special function marks the strings as something that needs translation. Once all the strings are marked, i18n tools can scan your code to make a master list of everything. With the master list available, translators can produce a list of translated strings for each desired language. The translated strings are added back to the Python code and it’s all bundled up into a nice translated final product. Then you may want to test that out.
That’s a lot of stuff, but I’ll explain each part of that process in detail.
For the purposes of this guide, all of my example code will refer to the
handroll
package. If this is your Python code, replace handroll
with
the name of your top level Python package.
Marking strings
The special function that I mentioned in the overview looks like
_('Hello World')
. This function comes from the gettext
module.
Python uses GNU gettext to do translation so let’s look at the handroll
code that creates the _
function in handroll/i18n.py
.
import gettext
import os
localedir = os.path.join(os.path.abspath(os.path.dirname(__file__)), 'locale')
translate = gettext.translation('handroll', localedir, fallback=True)
_ = translate.gettext
This little chunk of code makes a translation object using a locale
directory as the source of translated strings. Your locale
directory
doesn’t exist yet, but it is where your application will look for
translations at run time. From the translation object, I made an _
alias for translate.gettext
. This underscore is not particularly
special, however, it is the conventional name of this function that is
used by the Python community.
I hope there is enough here now to understand what will happen at run
time. The important idea is that 'Hello World'
acts like a key.
When the code executes, that “key” will be used to look in the locale
data to find a matching translation. The return value of _
will
be the translated string or the original if the translation is
missing.
_
and format
strings
One weirdness that you will quickly encounter if you use format
for your strings is where to close the _
parentheses. Here is an
example to make it clear.
_('Close the underscore function first. A {number}').format(number=42)
Extracting the master list
With all strings marked with _()
, it is time to extract them into
a master list. Gettext refers to this master list as the Portable Object Template
(POT
) file. To generate the POT file, I turned
to a tool called Babel. Babel is designed to help
developers handle i18n issues in a wide variety of ways. It’s cool
with a lot of features, but for this guide, I’ll focus on its
gettext support.
Babel extends a setup.py
file with new commands. (Python packaging
is out of scope for this guide so if setup.py
is completely foreign
to you, please check out Distributing Python Modules
for more information.)
To generate a POT file, run python setup.py extract_messages
. You
will need some settings similar to:
[extract_messages]
input_dirs = handroll
output_file = handroll/locale/handroll.pot
Getting translations for other languages
Now that you have generated a POT file, you are ready to get translated versions of your strings. Remember that the T in POT is for template. The translated files are called PO files, and translators use the POT file as a starting point to generate their PO file. Doing translation by manually creating the PO file is possible, but it is tedious. At this point, I turned to a service called Transifex.
Transifex is focused on making translation easy. It’s free for open source projects and really easy to use. I set up a project for handroll, configured Transifex to “watch” the POT file for changes, and translators can easily translate all the strings in the project. Transifex has an API that enabled me to pull PO files back into the repository whenever I needed. The script is a little too long to put in this post, but you can look at it on GitHub.
One important thing that the script does is store PO files in a
specific structure in the locale
directory. Gettext expects an
order like <language>/LC_MESSAGES/handroll.po
.
Packaging it together
After the translated PO files are in place, it is time to package
your translation data. The setup process includes extra data in
MANIFEST.in
. Add a line similar to the one below to ensure all
locale
data is put into the package archive.
recursive-include handroll/locale *
So far I’ve lied a little bit for clarity in explanation. Gettext
does not use PO files directly for looking up translations. The
truth is that the PO file must be compiled into a binary Machine
Object (MO) file (for faster speed). Again, we can turn to Babel
for help. Babel has a compile_catalog
command that can compile
PO into MO. You can run it with python setup.py compile_catalog
.
It will need settings like:
[compile_catalog]
domain = handroll
directory = handroll/locale
Because translations will not work without MO files, I extended
setup.py
so that the sdist
command will always run
compile_catalog
.
from setuptools.command.sdist import sdist
class Sdist(sdist):
"""Custom ``sdist`` command to ensure that mo files are always created."""
def run(self):
self.run_command('compile_catalog')
# sdist is an old style class so super cannot be used.
sdist.run(self)
If you do this, don’t forget to add cmdclass={'sdist': Sdist}
and
setup_requires=['Babel']
in your setup
call.
At this point, running python setup.py sdist
should create a
tarball with translations for your project. You’re almost done!
Testing out the package
Testing translations is not easy. If you go the extra mile, you can know that people from all different cultures can enjoy your software. But translation testing is not easy because you need tests that run every string in your code. For handroll, that meant I had to get to 100% test coverage to get those really strange corner cases.
The reason to test is that incorrectly translated strings can destroy
your code. If you have a format string like
_('Hello {name}!').format(name='Johnny')
and a translator makes a
mistake like '¡Hola {nombre}!'
, then that code would break for Spanish
users.
>>> '¡Hola {nombre}!'.format(name='Johnny')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: 'nombre'
To test out these strings, you need two things: MO files and the
LANGUAGE
environment variable. Let’s assume you use the
nose test runner.
To test out your Spanish support, you can run:
$ LANGUAGE=es nosetests
..................................
----------------------------------------------------------------------
Ran 34 tests in 1.440s
OK
In whatever continuous integration system you use, run your unit tests through each language you support to give yourself some confidence that translations have not broken the software. The handroll tox.ini provides an example to compare against. Tox is awesome for this kind of stuff.
Questions?
In this guide, I did my best to document all that I did to internationalize my project. I hope that all the concrete details help you see how to translate your own project. If something is missing or broken, let me know and I’ll be glad to update the post.