Hi,
I'd like to propose a small change that involves many Fedora packages.
(First I thought I'd put it in bugzilla, but I don't know what the right
component would be.)
The proposed change is the following: when building RPM packages, let's
convert all .mo files (gettext translations) to UTF-8.
Why?
- As Fedora is a fully UTF-8 system, applications are likely to request
translations in UTF-8. (There might be a few applications that are
exceptions, and some users may have special setup or special wrappers to
run certain applications in some other charset, but in the vast majority
of the cases gettext is required to return UTF-8 string.)
If the .mo file is already in UTF-8, the gettext() call simply returns a
pointer pointing somewhere in the area where the .mo file is mmap()ed to.
This can simply be checked with strace. This way no run-time conversion
happens and no per-proecess memory is involved; translations are shared by
all the processes that use the same message catalog.
If, however, .mo file uses a different encoding, gettext() has to allocate
memory for the converted string and has to perform the conversion. This
way if more processes display the same localized string, they all allocate
their own memory area to store the UTF-8 version of the string and they
all perform the charset conversion. And actually they all load the
corresponding gconv module which could be avoided, too.
To summarize, having all the .mo files in UTF-8 would save both memory and
CPU time.
- Currently the encoding of the .mo files is completely arbitrary; it is
always what the software developers or the translators happened to use.
With this change, it would be consistently UTF-8. This would make it
easier to find which package ships a particular translation. It often
happens that I want to locate which package a particular message comes
from. It might happen because a word is misspelled, or because the whole
message shouldn't appear and I'd like to fix the buggy package. The
obvious solution is to do a recursive grep on /usr/share/locale/<lang>. If
all the .mo files of the distribution are converted to UTF-8, I can do it
simply, without having to worry about accented characters. (grep in UTF-8
mode works fine and finds the matching UTF-8 .mo files even though they
are not fully valid UTF-8 files, the UTF-8 strings are surrounded by other
binary data.) However, if multiple encodings are used, there is no
straightforward way to find accented letters, it becomes a much harder
job.
How?
- Due to RPM's flexibility, none of the packages needs to be modified, only
the RPM macros. I recommend to perform the conversion on the .mo files
after the '%install' step (in '%__install_post' or whatever it's
called),
this way this whole story is independent from the package's build
procedure (does it use autotools or not; does it re-generate .mo files
from .po or ships pre-built .mo files; no need to worry about faulty and
hence skipped .po files; no need to take care of non-standard places of
po/mo files within the source tree; etc...)
- The only thing that needs to be done is an "msgunfmt" followed by
"msgconv
-t UTF-8" and finally "msgfmt" for all the .mo files under the standard
locale directories.
- So, after all, it is _very_ easy to implement it.
Is it safe?
- The encoding inside the .mo files is completely transparent to the
applications as gettext() and its friends always convert the strings to
the charset requested by the application. So applications won't notice any
change.
- We performed this step when building all the packages of the UHU-Linux 2.0
distribution, which was released a half year ago, and so far no known
problems arised. (During the test period there was only 1 package (namely
coreutils) where the converted .mo files were corrupted, but as it turned
out, it was caused by a bug in msgunfmt in gettext-0.15, which is already
fixed in gettext-0.16.)
Any drawbacks?
- Not known by me, except for a negligible growth in the packages' sizes.
Well, I hope you like my idea :-)
bye,
Egmont
Show replies by date
Egmont Koblinger (egmont(a)uhulinux.hu) said:
- The only thing that needs to be done is an "msgunfmt"
followed by "msgconv
-t UTF-8" and finally "msgfmt" for all the .mo files under the standard
locale directories.
- So, after all, it is _very_ easy to implement it.
So, what this has the potential to do is encode the timestamp of this
conversion in the final .mo file. Which then will cause multilib conflicts
on package installs.
Bill
On 01/03/07, Bill Nottingham <notting(a)redhat.com> wrote:
Egmont Koblinger (egmont(a)uhulinux.hu) said:
> - The only thing that needs to be done is an "msgunfmt" followed by
"msgconv
> -t UTF-8" and finally "msgfmt" for all the .mo files under the
standard
> locale directories.
>
> - So, after all, it is _very_ easy to implement it.
So, what this has the potential to do is encode the timestamp of this
conversion in the final .mo file. Which then will cause multilib conflicts
on package installs.
Surely this sort of good idea should be done on much higher level than fedora?
If this sort of thing is done in the automake type level then all the
distros benefit, although I can't pretend that I understand all the
msgfmt stuff.
Richard.
tor, 01.03.2007 kl. 15.54 -0500, skrev Bill Nottingham:
Egmont Koblinger (egmont(a)uhulinux.hu) said:
> - The only thing that needs to be done is an "msgunfmt" followed by
"msgconv
> -t UTF-8" and finally "msgfmt" for all the .mo files under the
standard
> locale directories.
>
> - So, after all, it is _very_ easy to implement it.
So, what this has the potential to do is encode the timestamp of this
conversion in the final .mo file. Which then will cause multilib conflicts
on package installs.
We've had a policy in the GNOME project that says all .po files should
be UTF-8 encoded. Why not do it at that level so it has no chance of
impacting packaging? Are there any reasons why translators would need a
different encoding?
Cheers
Kjartan
Kjartan Maraas (kmaraas(a)broadpark.no) said:
We've had a policy in the GNOME project that says all .po files
should
be UTF-8 encoded. Why not do it at that level so it has no chance of
impacting packaging? Are there any reasons why translators would need a
different encoding?
Other than potentially using cranky tools that only understand local
charsets, I can't think of one.
Heck, if someone wants this unilaterally done on all the
i18n.redhat.com
po files, it can happen. He he he.
Bill
On Thu, Mar 01, 2007 at 10:05:28PM +0100, Kjartan Maraas wrote:
Are there any reasons why translators would need a different
encoding?
No, there isn't. But many translators or software developers still use other
encodings nowadays, and there are stalling projects too. You can't get every
software developer to convert .po files to UTF-8 and release a new
mainstream version :( Still you want to have many of these software in
Fedora...
--
Egmont
Egmont Koblinger (egmont(a)uhulinux.hu) said:
> Are there any reasons why translators would need a different
encoding?
No, there isn't. But many translators or software developers still use other
encodings nowadays, and there are stalling projects too. You can't get every
software developer to convert .po files to UTF-8 and release a new
mainstream version :( Still you want to have many of these software in
Fedora...
True, but at least for packages where Fedora is the upstream, might
as well do it. There seem to be only 36 or so files that need fixed
(out of 3900+).
Bill
O/H Bill Nottingham έγραψε:
Egmont Koblinger (egmont(a)uhulinux.hu) said:
>> Are there any reasons why translators would need a different encoding?
> No, there isn't. But many translators or software developers still use other
> encodings nowadays, and there are stalling projects too. You can't get every
> software developer to convert .po files to UTF-8 and release a new
> mainstream version :( Still you want to have many of these software in
> Fedora...
True, but at least for packages where Fedora is the upstream, might
as well do it. There seem to be only 36 or so files that need fixed
(out of 3900+).
For which packages is Fedora the upstream?
-d
--
Dimitris Glezos
Jabber ID: glezos(a)jabber.org, GPG: 0xA5A04C3B
http://dimitris.glezos.com/
"He who gives up functionality for ease of use
loses both and deserves neither." (Anonymous)
--
Dimitris Glezos (dimitris(a)glezos.com) said:
> True, but at least for packages where Fedora is the upstream,
might
> as well do it. There seem to be only 36 or so files that need fixed
> (out of 3900+).
For which packages is Fedora the upstream?
system-config-*, anaconda, initscripts, various other things.
:pserver:rhlinux.redhat.com:/usr/local/CVS translate will get
you some sort of idea.
Bill