Proposal: Convert .mo files to UTF-8

Thursday, 1 March 2007

Hi,

I'd like to propose a small change that involves many Fedora packages.
(First I thought I'd put it in bugzilla, but I don't know what the right
component would be.)

The proposed change is the following: when building RPM packages, let's
convert all .mo files (gettext translations) to UTF-8.

Why?

- As Fedora is a fully UTF-8 system, applications are likely to request
  translations in UTF-8. (There might be a few applications that are
  exceptions, and some users may have special setup or special wrappers to
  run certain applications in some other charset, but in the vast majority
  of the cases gettext is required to return UTF-8 string.)

  If the .mo file is already in UTF-8, the gettext() call simply returns a
  pointer pointing somewhere in the area where the .mo file is mmap()ed to. 
  This can simply be checked with strace. This way no run-time conversion
  happens and no per-proecess memory is involved; translations are shared by
  all the processes that use the same message catalog.

  If, however, .mo file uses a different encoding, gettext() has to allocate
  memory for the converted string and has to perform the conversion. This
  way if more processes display the same localized string, they all allocate
  their own memory area to store the UTF-8 version of the string and they
  all perform the charset conversion. And actually they all load the
  corresponding gconv module which could be avoided, too.

  To summarize, having all the .mo files in UTF-8 would save both memory and
  CPU time.

- Currently the encoding of the .mo files is completely arbitrary; it is
  always what the software developers or the translators happened to use. 
  With this change, it would be consistently UTF-8. This would make it
  easier to find which package ships a particular translation. It often
  happens that I want to locate which package a particular message comes
  from. It might happen because a word is misspelled, or because the whole
  message shouldn't appear and I'd like to fix the buggy package. The
  obvious solution is to do a recursive grep on /usr/share/locale/<lang>. If
  all the .mo files of the distribution are converted to UTF-8, I can do it
  simply, without having to worry about accented characters. (grep in UTF-8
  mode works fine and finds the matching UTF-8 .mo files even though they
  are not fully valid UTF-8 files, the UTF-8 strings are surrounded by other
  binary data.) However, if multiple encodings are used, there is no
  straightforward way to find accented letters, it becomes a much harder
  job.

How?

- Due to RPM's flexibility, none of the packages needs to be modified, only
  the RPM macros. I recommend to perform the conversion on the .mo files
  after the '%install' step (in '%__install_post' or whatever it's
called),
  this way this whole story is independent from the package's build
  procedure (does it use autotools or not; does it re-generate .mo files
  from .po or ships pre-built .mo files; no need to worry about faulty and
  hence skipped .po files; no need to take care of non-standard places of
  po/mo files within the source tree; etc...)

- The only thing that needs to be done is an "msgunfmt" followed by
"msgconv
  -t UTF-8" and finally "msgfmt" for all the .mo files under the standard
  locale directories.

- So, after all, it is _very_ easy to implement it.

Is it safe?

- The encoding inside the .mo files is completely transparent to the
  applications as gettext() and its friends always convert the strings to
  the charset requested by the application. So applications won't notice any
  change.

- We performed this step when building all the packages of the UHU-Linux 2.0
  distribution, which was released a half year ago, and so far no known
  problems arised. (During the test period there was only 1 package (namely
  coreutils) where the converted .mo files were corrupted, but as it turned
  out, it was caused by a bug in msgunfmt in gettext-0.15, which is already
  fixed in gettext-0.16.)

Any drawbacks?

- Not known by me, except for a negligible growth in the packages' sizes.

Well, I hope you like my idea :-)

bye,

Egmont

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002