The program's interface with the human should be designed in a way to ease the human the task. One of the possibilities is to use messages in whatever language the user prefers.
Printing messages in different languages can be implemented in different ways. One could add all the different languages in the source code and add among the variants every time a message has to be printed. This is certainly no good solution since extending the set of languages is difficult (the code must be changed) and the code itself can become really big with dozens of message sets.
A better solution is to keep the message sets for each language are kept in separate files which are loaded at runtime depending on the language selection of the user.
The GNU C Library provides two different sets of functions to support
message translation. The problem is that neither of the interfaces is
officially defined by the POSIX standard. The catgets
family of
functions is defined in the X/Open standard but this is derived from
industry decisions and therefore not necessarily based on reasonable
decisions.
As mentioned above the message catalog handling provides easy extendibility by using external data files which contain the message translations. I.e., these files contain for each of the messages used in the program a translation for the appropriate language. So the tasks of the message handling functions are
The two approaches mainly differ in the implementation of this last step. The design decisions made for this influences the whole rest.
The catgets
functions are based on the simple scheme:
Associate every message to translate in the source code with a unique identifier. To retrieve a message from a catalog file solely the identifier is used.
This means for the author of the program that s/he will have to make sure the meaning of the identifier in the program code and in the message catalogs are always the same.
Before a message can be translated the catalog file must be located. The user of the program must be able to guide the responsible function to find whatever catalog the user wants. This is separated from what the programmer had in mind.
All the types, constants and functions for the catgets
functions
are defined/declared in the `nl_types.h' header file.
catgets
function family
catgets
function tries to locate the message data file names
cat_name and loads it when found. The return value is of an
opaque type and can be used in calls to the other functions to refer to
this loaded catalog.
The return value is (nl_catd) -1
in case the function failed and
no catalog was loaded. The global variable errno contains a code
for the error causing the failure. But even if the function call
succeeded this does not mean that all messages can be translated.
Locating the catalog file must happen in a way which lets the user of the program influence the decision. It is up to the user to decide about the language to use and sometimes it is useful to use alternate catalog files. All this can be specified by the user by setting some environment variables.
The first problem is to find out where all the message catalogs are stored. Every program could have its own place to keep all the different files but usually the catalog files are grouped by languages and the catalogs for all programs are kept in the same place.
To tell the catopen
function where the catalog for the program
can be found the user can set the environment variable NLSPATH
to
a value which describes her/his choice. Since this value must be usable
for different languages and locales it cannot be a simple string.
Instead it is a format string (similar to printf
's). An example
is
/usr/share/locale/%L/%N:/usr/share/locale/%L/LC_MESSAGES/%N
First one can see that more than one directory can be specified (with
the usual syntax of separating them by colons). The next things to
observe are the format string, %L
and %N
in this case.
The catopen
function knows about several of them and the
replacement for all of them is of course different.
%N
catgets
.
%L
%l
lang[_terr[.codeset]]
and this format uses the
first part lang.
%t
%c
%%
%
is used in a meta character there must be a way to
express the %
character in the result itself. Using %%
does this just like it works for printf
.
Using NLSPATH
allows arbitrary directories to be searched for
message catalogs while still allowing different languages to be used.
If the NLSPATH
environment variable is not set, the default value
is
prefix/share/locale/%L/%N:prefix/share/locale/%L/LC_MESSAGES/%N
where prefix is given to configure
while installing the GNU
C Library (this value is in many cases /usr
or the empty string).
The remaining problem is to decide which must be used. The value
decides about the substitution of the format elements mentioned above.
First of all the user can specify a path in the message catalog name
(i.e., the name contains a slash character). In this situation the
NLSPATH
environment variable is not used. The catalog must exist
as specified in the program, perhaps relative to the current working
directory. This situation in not desirable and catalogs names never
should be written this way. Beside this, this behavior is not portable
to all other platforms providing the catgets
interface.
Otherwise the values of environment variables from the standard
environment are examined (see section Standard Environment Variables). Which
variables are examined is decided by the flag parameter of
catopen
. If the value is NL_CAT_LOCALE
(which is defined
in `nl_types.h') then the catopen
function use the name of
the locale currently selected for the LC_MESSAGES
category.
If flag is zero the LANG
environment variable is examined.
This is a left-over from the early days where the concept of the locales
had not even reached the level of POSIX locales.
The environment variable and the locale name should have a value of the
form lang[_terr[.codeset]]
as explained above.
If no environment variable is set the "C"
locale is used which
prevents any translation.
The return value of the function is in any case a valid string. Either it is a translation from a message catalog or it is the same as the string parameter. So a piece of code to decide whether a translation actually happened must look like this:
{ char *trans = catgets (desc, set, msg, input_string); if (trans == input_string) { /* Something went wrong. */ } }
When an error occurred the global variable errno is set to
While it sometimes can be useful to test for errors programs normally will avoid any test. If the translation is not available it is no big problem if the original, untranslated message is printed. Either the user understands this as well or s/he will look for the reason why the messages are not translated.
Please note that the currently selected locale does not depend on a call
to the setlocale
function. It is not necessary that the locale
data files for this locale exist and calling setlocale
succeeds.
The catopen
function directly reads the values of the environment
variables.
catgets
has to be used to access the massage catalog
previously opened using the catopen
function. The
catalog_desc parameter must be a value previously returned by
catopen
.
The next two parameters, set and message, reflect the internal organization of the message catalog files. This will be explained in detail below. For now it is interesting to know that a catalog can consists of several set and the messages in each thread are individually numbered using numbers. Neither the set number nor the message number must be consecutive. They can be arbitrarily chosen. But each message (unless equal to another one) must have its own unique pair of set and message number.
Since it is not guaranteed that the message catalog for the language selected by the user exists the last parameter string helps to handle this case gracefully. If no matching string can be found string is returned. This means for the programmer that
It is somewhat uncomfortable to write a program using the catgets
functions if no supporting functionality is available. Since each
set/message number tuple must be unique the programmer must keep lists
of the messages at the same time the code is written. And the work
between several people working on the same project must be coordinated.
We will see some how these problems can be relaxed a bit (see section How to use the catgets
interface).
catclose
function can be used to free the resources
associated with a message catalog which previously was opened by a call
to catopen
. If the resources can be successfully freed the
function returns 0
. Otherwise it return -1
and the
global variable errno is set. Errors can occur if the catalog
descriptor catalog_desc is not valid in which case errno is
set to EBADF
.
The only reasonable way the translate all the messages of a function and
store the result in a message catalog file which can be read by the
catopen
function is to write all the message text to the
translator and let her/him translate them all. I.e., we must have a
file with entries which associate the set/message tuple with a specific
translation. This file format is specified in the X/Open standard and
is as follows:
$
followed by a whitespace character are comment and are also ignored.
$set
followed by a whitespace character an additional argument
is required to follow. This argument can either be:
catgets
interface.
It is an error if a symbol name appears more than once. All following
messages are placed in a set with this number.
$delset
followed by a whitespace character an additional argument
is required to follow. This argument can either be:
$set
command again messages could be added and these
messages will appear in the output.
$quote
, the quoting character used for this input file is
changed to the first non-whitespace character following the
$quote
. If no non-whitespace character is present before the
line ends quoting is disable.
By default no quoting character is used. In this mode strings are
terminated with the first unescaped line break. If there is a
$quote
sequence present newline need not be escaped. Instead a
string is terminated with the first unescaped appearance of the quote
character.
A common usage of this feature would be to set the quote character to
"
. Then any appearance of the "
in the strings must
be escaped using the backslash (i.e., \"
must be written).
catgets
interface). There is
one limitation with the identifier: it must not be Set
. The
reason will be explained below.
The text of the messages can contain escape characters. The usual bunch
of characters known from the ISO C language are recognized
(\n
, \t
, \v
, \b
, \r
, \f
,
\\
, and \nnn
, where nnn is the octal coding of
a character code).
Important: The handling of identifiers instead of numbers for the set and messages is a GNU extension. Systems strictly following the X/Open specification do not have this feature. An example for a message catalog file is this:
$ This is a leading comment. $quote " $set SetOne 1 Message with ID 1. two " Message with ID \"two\", which gets the value 2 assigned" $set SetTwo $ Since the last set got the number 1 assigned this set has number 2. 4000 "The numbers can be arbitrary, they need not start at one."
This small example shows various aspects:
$
followed by
a whitespace.
"
. Otherwise the quotes in the
message definition would have to be left away and in this case the
message with the identifier two
would loose its leading whitespace.
While this file format is pretty easy it is not the best possible for
use in a running program. The catopen
function would have to
parser the file and handle syntactic errors gracefully. This is not so
easy and the whole process is pretty slow. Therefore the catgets
functions expect the data in another more compact and ready-to-use file
format. There is a special program gencat
which is explained in
detail in the next section.
Files in this other format are not human readable. To be easy to use by programs it is a binary file. But the format is byte order independent so translation files can be shared by systems of arbitrary architecture (as long as they use the GNU C Library).
Details about the binary file format are not important to know since
these files are always created by the gencat
program. The
sources of the GNU C Library also provide the sources for the
gencat
program and so the interested reader can look through
these source files to learn about the file format.
The gencat
program is specified in the X/Open standard and the
GNU implementation follows this specification and so processes
all correctly formed input files. Additionally some extension are
implemented which help to work in a more reasonable way with the
catgets
functions.
The gencat
program can be invoked in two ways:
`gencat [Option]... [Output-File [Input-File]...]`
This is the interface defined in the X/Open standard. If no Input-File parameter is given input will be read from standard input. Multiple input files will be read as if they are concatenated. If Output-File is also missing, the output will be written to standard output. To provide the interface one is used to from other programs a second interface is provided.
`gencat [Option]... -o Output-File [Input-File]...`
The option `-o' is used to specify the output file and all file arguments are used as input files.
Beside this one can use `-' or `/dev/stdin' for Input-File to denote the standard input. Corresponding one can use `-' and `/dev/stdout' for Output-File to denote standard output. Using `-' as a file name is allowed in X/Open while using the device names is a GNU extension.
The gencat
program works by concatenating all input files and
then merge the resulting collection of message sets with a
possibly existing output file. This is done by removing all messages
with set/message number tuples matching any of the generated messages
from the output file and then adding all the new messages. To
regenerate a catalog file while ignoring the old contents therefore
requires to remove the output file if it exists. If the output is
written to standard output no merging takes place.
The following table shows the options understood by the gencat
program. The X/Open standard does not specify any option for the
program so all of these are GNU extensions.
#define
s to associate a name with a
number.
Please note that the generated file only contains the symbols from the
input files. If the output is merged with the previous content of the
output file the possibly existing symbols from the file(s) which
generated the old output files are not in the generated header file.
catgets
interface
The catgets
functions can be used in two different ways. By
following slavishly the X/Open specs and not relying on the extension
and by using the GNU extensions. We will take a look at the former
method first to understand the benefits of extensions.
Since the X/Open format of the message catalog files does not allow symbol names we have to work with numbers all the time. When we start writing a program we have to replace all appearances of translatable strings with something like
catgets (catdesc, set, msg, "string")
catgets is retrieved from a call to catopen
which is
normally done once at the program start. The "string"
is the
string we want to translate. The problems start with the set and
message numbers.
In a bigger program several programmers usually work at the same time on the program and so coordinating the number allocation is crucial. Though no two different strings must be indexed by the same tuple of numbers it is highly desirable to reuse the numbers for equal strings with equal translations (please note that there might be strings which are equal in one language but have different translations due to difference contexts).
The allocation process can be relaxed a bit by different set numbers for
different parts of the program. So the number of developers who have to
coordinate the allocation can be reduced. But still lists must be keep
track of the allocation and errors can easily happen. These errors
cannot be discovered by the compiler or the catgets
functions.
Only the user of the program might see wrong messages printed. In the
worst cases the messages are so irritating that they cannot be
recognized as wrong. Think about the translations for "true"
and
"false"
being exchanged. This could result in a disaster.
The problems mentioned in the last section derive from the fact that:
By constantly using symbolic names and by providing a method which maps the string content to a symbolic name (however this will happen) one can prevent both problems above. The cost of this is that the programmer has to write a complete message catalog file while s/he is writing the program itself.
This is necessary since the symbolic names must be mapped to numbers
before the program sources can be compiled. In the last section it was
described how to generate a header containing the mapping of the names.
E.g., for the example message file given in the last section we could
call the gencat
program as follow (assume `ex.msg' contains
the sources).
gencat -H ex.h -o ex.cat ex.msg
This generates a header file with the following content:
#define SetTwoSet 0x2 /* ex.msg:8 */ #define SetOneSet 0x1 /* ex.msg:4 */ #define SetOnetwo 0x2 /* ex.msg:6 */
As can be seen the various symbols given in the source file are mangled
to generate unique identifiers and these identifiers get numbers
assigned. Reading the source file and knowing about the rules will
allow to predict the content of the header file (it is deterministic)
but this is not necessary. The gencat
program can take care for
everything. All the programmer has to do is to put the generated header
file in the dependency list of the source files of her/his project and
to add a rules to regenerate the header of any of the input files
change.
One word about the symbol mangling. Every symbol consists of two parts:
the name of the message set plus the name of the message or the special
string Set
. So SetOnetwo
means this macro can be used to
access the translation with identifier two
in the message set
SetOne
.
The other names denote the names of the message sets. The special
string Set
is used in the place of the message identifier.
If in the code the second string of the set SetOne
is used the C
code should look like this:
catgets (catdesc, SetOneSet, SetOnetwo, " Message with ID \"two\", which gets the value 2 assigned")
Writing the function this way will allow to change the message number and even the set number without requiring any change in the C source code. (The text of the string is normally not the same; this is only for this example.)
To illustrate the usual way to work with the symbolic version numbers here is a little example. Assume we want to write the very complex and famous greeting program. We start by writing the code as usual:
#include <stdio.h> int main (void) { printf ("Hello, world!\n"); return 0; }
Now we want to internationalize the message and therefore replace the message with whatever the user wants.
#include <nl_types.h> #include <stdio.h> #include "msgnrs.h" int main (void) { nl_catd catdesc = catopen ("hello.cat", NL_CAT_LOCALE); printf (catgets (catdesc, SetMainSet, SetMainHello, "Hello, world!\n")); catclose (catdesc); return 0; }
We see how the catalog object is opened and the returned descriptor used in the other function calls. It is not really necessary to check for failure of any of the functions since even in these situations the functions will behave reasonable. They simply will be return a translation.
What remains unspecified here are the constants SetMainSet
and
SetMainHello
. These are the symbolic names describing the
message. To get the actual definitions which match the information in
the catalog file we have to create the message catalog source file and
process it using the gencat
program.
$ Messages for the famous greeting program. $quote " $set Main Hello "Hallo, Welt!\n"
Now we can start building the program (assume the message catalog source file is named `hello.msg' and the program source file `hello.c'):
% gencat -H msgnrs.h -o hello.cat hello.msg % cat msgnrs.h #define MainSet 0x1 /* hello.msg:4 */ #define MainHello 0x1 /* hello.msg:5 */ % gcc -o hello hello.c -I. % cp hello.cat /usr/share/locale/de/LC_MESSAGES % echo $LC_ALL de % ./hello Hallo, Welt! %
The call of the gencat
program creates the missing header file
`msgnrs.h' as well as the message catalog binary. The former is
used in the compilation of `hello.c' while the later is placed in a
directory in which the catopen
function will try to locate it.
Please check the LC_ALL
environment variable and the default path
for catopen
presented in the description above.
Sun Microsystems tried to standardize a different approach to message translation in the Uniforum group. There never was a real standard defined but still the interface was used in Sun's operation systems. Since this approach fits better in the development process of free software it is also used throughout the GNU project and the GNU `gettext' package provides support for this outside the GNU C Library.
The code of the `libintl' from GNU `gettext' is the same as the code in the GNU C Library. So the documentation in the GNU `gettext' manual is also valid for the functionality here. The following text will describe the library functions in detail. But the numerous helper programs are not described in this manual. Instead people should read the GNU `gettext' manual (see section `GNU gettext utilities' in Native Language Support Library and Tools). We will only give a short overview.
Though the catgets
functions are available by default on more
systems the gettext
interface is at least as portable as the
former. The GNU `gettext' package can be used wherever the
functions are not available.
gettext
family of functions
The paradigms underlying the gettext
approach to message
translations is different from that of the catgets
functions the
basic functionally is equivalent. There are functions of the following
categories:
The gettext
functions have a very simple interface. The most
basic function just takes the string which shall be translated as the
argument and it returns the translation. This is fundamentally
different from the catgets
approach where an extra key is
necessary and the original string is only used for the error case.
If the string which has to be translated is the only argument this of
course means the string itself is the key. I.e., the translation will
be selected based on the original string. The message catalogs must
therefore contain the original strings plus one translation for any such
string. The task of the gettext
function is it to compare the
argument string with the available strings in the catalog and return the
appropriate translation. Of course this process is optimized so that
this process is not more expensive than an access using an atomic key
like in catgets
.
The gettext
approach has some advantages but also some
disadvantages. Please see the GNU `gettext' manual for a detailed
discussion of the pros and cons.
All the definitions and declarations for gettext
can be found in
the `libintl.h' header file. On systems where these functions are
not part of the C library they can be found in a separate library named
`libintl.a' (or accordingly different for shared libraries).
gettext
function searches the currently selected message
catalogs for a string which is equal to msgid. If there is such a
string available it is returned. Otherwise the argument string
msgid is returned.
Please note that all though the return value is char *
the
returned string must not be changed. This broken type results from the
history of the function and does not reflect the way the function should
be used.
Please note that above we wrote "message catalogs" (plural). This is a specialty of the GNU implementation of these functions and we will say more about this when we talk about the ways message catalogs are selected (see section How to determine which catalog to be used).
The gettext
function does not modify the value of the global
errno variable. This is necessary to make it possible to write
something like
printf (gettext ("Operation failed: %m\n"));
Here the errno value is used in the printf
function while
processing the %m
format element and if the gettext
function would change this value (it is called before printf
is
called) we would get a wrong message.
So there is no easy way to detect a missing message catalog beside comparing the argument string with the result. But it is normally the task of the user to react on missing catalogs. The program cannot guess when a message catalog is really necessary since for a user who speaks the language the program was developed in does not need any translation.
The remaining two functions to access the message catalog add some
functionality to select a message catalog which is not the default one.
This is important if parts of the program are developed independently.
Every part can have its own message catalog and all of them can be used
at the same time. The C library itself is an example: internally it
uses the gettext
functions but since it must not depend on a
currently selected default message catalog it must specify all ambiguous
information.
dgettext
functions acts just like the gettext
function. It only takes an additional first argument domainname
which guides the selection of the message catalogs which are searched
for the translation. If the domainname parameter is the null
pointer the dgettext
function is exactly equivalent to
gettext
since the default value for the domain name is used.
As for gettext
the return value type is char *
which is an
anachronism. The returned string must never be modified.
dcgettext
adds another argument to those which
dgettext
takes. This argument category specifies the last
piece of information needed to localize the message catalog. I.e., the
domain name and the locale category exactly specify which message
catalog has to be used (relative to a given directory, see below).
The dgettext
function can be expressed in terms of
dcgettext
by using
dcgettext (domain, string, LC_MESSAGES)
instead of
dgettext (domain, string)
This also shows which values are expected for the third parameter. One
has to use the available selectors for the categories available in
`locale.h'. Normally the available values are LC_CTYPE
,
LC_COLLATE
, LC_MESSAGES
, LC_MONETARY
,
LC_NUMERIC
, and LC_TIME
. Please note that LC_ALL
must not be used and even though the names might suggest this, there is
no relation to the environments variables of this name.
The dcgettext
function is only implemented for compatibility with
other systems which have gettext
functions. There is not really
any situation where it is necessary (or useful) to use a different value
but LC_MESSAGES
in for the category parameter. We are
dealing with messages here and any other choice can only be irritating.
As for gettext
the return value type is char *
which is an
anachronism. The returned string must never be modified.
When using the three functions above in a program it is a frequent case
that the msgid argument is a constant string. So it is worth to
optimize this case. Thinking shortly about this one will realize that
as long as no new message catalog is loaded the translation of a message
will not change. This optimization is actually implemented by the
gettext
, dgettext
and dcgettext
functions.
The functions to retrieve the translations for a given message have a remarkable simple interface. But to provide the user of the program still the opportunity to select exactly the translation s/he wants and also to provide the programmer the possibility to influence the way to locate the search for catalogs files there is a quite complicated underlying mechanism which controls all this. The code is complicated the use is easy.
Basically we have two different tasks to perform which can also be
performed by the catgets
functions:
This is the functionality required by the specifications for
gettext
and this is also what the catgets
functions are
able to do. But there are some problems unresolved:
de
, german
, or
deutsch
and the program should always react the same.
de_DE.ISO-8859-1
which means German, spoken in Germany,
coded using the ISO 8859-1 character set there is the possibility
that a message catalog matching this exactly is not available. But
there could be a catalog matching de
and if the character set
used on the machine is always ISO 8859-1 there is no reason why this
later message catalog should not be used. (We call this message
inheritance.)
We can divide the configuration actions in two parts: the one is performed by the programmer, the other by the user. We will start with the functions the programmer can use since the user configuration will be based on this.
As the functions described in the last sections already mention separate
sets of messages can be selected by a domain name. This is a
simple string which should be unique for each program part with uses a
separate domain. It is possible to use in one program arbitrary many
domains at the same time. E.g., the GNU C Library itself uses a domain
named libc
while the program using the C Library could use a
domain named foo
. The important point is that at any time
exactly one domain is active. This is controlled with the following
function.
textdomain
function sets the default domain, which is used in
all future gettext
calls, to domainname. Please note that
dgettext
and dcgettext
calls are not influenced if the
domainname parameter of these functions is not the null pointer.
Before the first call to textdomain
the default domain is
messages
. This is the name specified in the specification of
the gettext
API. This name is as good as any other name. No
program should ever really use a domain with this name since this can
only lead to problems.
The function returns the value which is from now on taken as the default
domain. If the system went out of memory the returned value is
NULL
and the global variable errno is set to ENOMEM
.
Despite the return value type being char *
the return string must
not be changed. It is allocated internally by the textdomain
function.
If the domainname parameter is the null pointer no new default domain is set. Instead the currently selected default domain is returned.
If the domainname parameter is the empty string the default domain
is reset to its initial value, the domain with the name messages
.
This possibility is questionable to use since the domain messages
really never should be used.
bindtextdomain
function can be used to specify the directory
which contains the message catalogs for domain domainname for the
different languages. To be correct, this is the directory where the
hierarchy of directories is expected. Details are explained below.
For the programmer it is important to note that the translations which
come with the program have be placed in a directory hierarchy starting
at, say, `/foo/bar'. Then the program should make a
bindtextdomain
call to bind the domain for the current program to
this directory. So it is made sure the catalogs are found. A correctly
running program does not depend on the user setting an environment
variable.
The bindtextdomain
function can be used several times and if the
domainname argument is different the previously bound domains
will not be overwritten.
If the program which wish to use bindtextdomain
at some point of
time use the chdir
function to change the current working
directory it is important that the dirname strings ought to be an
absolute pathname. Otherwise the addressed directory might vary with
the time.
If the dirname parameter is the null pointer bindtextdomain
returns the currently selected directory for the domain with the name
domainname.
The bindtextdomain
function returns a pointer to a string
containing the name of the selected directory name. The string is
allocated internally in the function and must not be changed by the
user. If the system went out of core during the execution of
bindtextdomain
the return value is NULL
and the global
variable errno is set accordingly.
The functions of the gettext
family described so far (and all the
catgets
functions as well) have one problem in the real world
which have been neglected completely in all existing approaches. What
is meant here is the handling of plural forms.
Looking through Unix source code before the time anybody thought about internationalization (and, sadly, even afterwards) one can often find code similar to the following:
printf ("%d file%s deleted", n, n == 1 ? "" : "s");
After the first complaints from people internationalizing the code people
either completely avoided formulations like this or used strings like
"file(s)"
. Both look unnatural and should be avoided. First
tries to solve the problem correctly looked like this:
if (n == 1) printf ("%d file deleted", n); else printf ("%d files deleted", n);
But this does not solve the problem. It helps languages where the plural form of a noun is not simply constructed by adding an `s' but that is all. Once again people fell into the trap of believing the rules their language is using are universal. But the handling of plural forms differs widely between the language families. There are two things we can differ between (and even inside language families);
The consequence of this is that application writers should not try to
solve the problem in their code. This would be localization since it is
only usable for certain, hardcoded language environments. Instead the
extended gettext
interface should be used.
These extra functions are taking instead of the one key string two
strings and an numerical argument. The idea behind this is that using
the numerical argument and the first string as a key, the implementation
can select using rules specified by the translator the right plural
form. The two string arguments then will be used to provide a return
value in case no message catalog is found (similar to the normal
gettext
behavior). In this case the rules for Germanic language
is used and it is assumed that the first string argument is the singular
form, the second the plural form.
This has the consequence that programs without language catalogs can
display the correct strings only if the program itself is written using
a Germanic language. This is a limitation but since the GNU C library
(as well as the GNU gettext
package) are written as part of the
GNU package and the coding standards for the GNU project require program
being written in English, this solution nevertheless fulfills its
purpose.
ngettext
function is similar to the gettext
function
as it finds the message catalogs in the same way. But it takes two
extra arguments. The msgid1 parameter must contain the singular
form of the string to be converted. It is also used as the key for the
search in the catalog. The msgid2 parameter is the plural form.
The parameter n is used to determine the plural form. If no
message catalog is found msgid1 is returned if n == 1
,
otherwise msgid2
.
An example for the us of this function is:
printf (ngettext ("%d file removed", "%d files removed", n), n);
Please note that the numeric value n has to be passed to the
printf
function as well. It is not sufficient to pass it only to
ngettext
.
dngettext
is similar to the dgettext
function in the
way the message catalog is selected. The difference is that it takes
two extra parameter to provide the correct plural form. These two
parameters are handled in the same way ngettext
handles them.
dcngettext
is similar to the dcgettext
function in the
way the message catalog is selected. The difference is that it takes
two extra parameter to provide the correct plural form. These two
parameters are handled in the same way ngettext
handles them.
A description of the problem can be found at the beginning of the last section. Now there is the question how to solve it. Without the input of linguists (which was not available) it was not possible to determine whether there are only a few different forms in which plural forms are formed or whether the number can increase with every new supported language.
Therefore the solution implemented is to allow the translator to specify
the rules of how to select the plural form. Since the formula varies
with every language this is the only viable solution except for
hardcoding the information in the code (which still would require the
possibility of extensions to not prevent the use of new languages). The
details are explained in the GNU gettext
manual. Here only a a
bit of information is provided.
The information about the plural form selection has to be stored in the
header entry (the one with the empty (msgid
string). It looks
like this:
Plural-Forms: nplurals=2; plural=n == 1 ? 0 : 1;
The nplurals
value must be a decimal number which specifies how
many different plural forms exist for this language. The string
following plural
is an expression which is using the C language
syntax. Exceptions are that no negative number are allowed, numbers
must be decimal, and the only variable allowed is n
. This
expression will be evaluated whenever one of the functions
ngettext
, dngettext
, or dcngettext
is called. The
numeric value passed to these functions is then substituted for all uses
of the variable n
in the expression. The resulting value then
must be greater or equal to zero and smaller than the value given as the
value of nplurals
.
The following rules are known at this point. The language with families are listed. But this does not necessarily mean the information can be generalized for the whole family (as can be easily seen in the table below).(1).}
Plural-Forms: nplurals=1; plural=0;Languages with this property include:
Plural-Forms: nplurals=2; plural=n != 1;(Note: this uses the feature of C expressions that boolean expressions have to value zero or one.) Languages with this property include:
Plural-Forms: nplurals=2; plural=n>1;Languages with this property include:
Plural-Forms: nplurals=3; plural=n==1 ? 0 : n==2 ? 1 : 2;Languages with this property include:
Plural-Forms: nplurals=3; \ plural=n%100/10==1 ? 2 : n%10==1 ? 0 : (n+9)%10>3 ? 2 : 1;Languages with this property include:
Plural-Forms: nplurals=3; \ plural=n==1 ? 0 : \ n%10>=2 && n%10<=4 && (n%100<10 || n%100>=20) ? 1 : 2;(Continuation in the next line is possible.) Languages with this property include:
Plural-Forms: nplurals=4; \ plural=n==1 ? 0 : n%10==2 ? 1 : n%10==3 || n%10==4 ? 2 : 3;Languages with this property include:
gettext
uses
gettext
not only looks up a translation in a message catalog. It
also converts the translation on the fly to the desired output character
set. This is useful if the user is working in a different character set
than the translator who created the message catalog, because it avoids
distributing variants of message catalogs which differ only in the
character set.
The output character set is, by default, the value of nl_langinfo
(CODESET)
, which depends on the LC_CTYPE
part of the current
locale. But programs which store strings in a locale independent way
(e.g. UTF-8) can request that gettext
and related functions
return the translations in that encoding, by use of the
bind_textdomain_codeset
function.
Note that the msgid argument to gettext
is not subject to
character set conversion. Also, when gettext
does not find a
translation for msgid, it returns msgid unchanged --
independently of the current output character set. It is therefore
recommended that all msgids be US-ASCII strings.
bind_textdomain_codeset
function can be used to specify the
output character set for message catalogs for domain domainname.
The codeset argument must be a valid codeset name which can be used
for the iconv_open
function, or a null pointer.
If the codeset parameter is the null pointer,
bind_textdomain_codeset
returns the currently selected codeset
for the domain with the name domainname. It returns NULL
if
no codeset has yet been selected.
The bind_textdomain_codeset
function can be used several times.
If used multiple times with the same domainname argument, the
later call overrides the settings made by the earlier one.
The bind_textdomain_codeset
function returns a pointer to a
string containing the name of the selected codeset. The string is
allocated internally in the function and must not be changed by the
user. If the system went out of core during the execution of
bind_textdomain_codeset
, the return value is NULL
and the
global variable errno is set accordingly. @end deftypefun
gettext
in GUI programs
One place where the gettext
functions, if used normally, have big
problems is within programs with graphical user interfaces (GUIs). The
problem is that many of the strings which have to be translated are very
short. They have to appear in pull-down menus which restricts the
length. But strings which are not containing entire sentences or at
least large fragments of a sentence may appear in more than one
situation in the program but might have different translations. This is
especially true for the one-word strings which are frequently used in
GUI programs.
As a consequence many people say that the gettext
approach is
wrong and instead catgets
should be used which indeed does not
have this problem. But there is a very simple and powerful method to
handle these kind of problems with the gettext
functions.
As as example consider the following fictional situation. A GUI program has a menu bar with the following entries:
+------------+------------+--------------------------------------+ | File | Printer | | +------------+------------+--------------------------------------+ | Open | | Select | | New | | Open | +----------+ | Connect | +----------+
To have the strings File
, Printer
, Open
,
New
, Select
, and Connect
translated there has to be
at some point in the code a call to a function of the gettext
family. But in two places the string passed into the function would be
Open
. The translations might not be the same and therefore we
are in the dilemma described above.
One solution to this problem is to artificially enlengthen the strings to make them unambiguous. But what would the program do if no translation is available? The enlengthened string is not what should be printed. So we should use a little bit modified version of the functions.
To enlengthen the strings a uniform method should be used. E.g., in the example above the strings could be chosen as
Menu|File Menu|Printer Menu|File|Open Menu|File|New Menu|Printer|Select Menu|Printer|Open Menu|Printer|Connect
Now all the strings are different and if now instead of gettext
the following little wrapper function is used, everything works just
fine:
char * sgettext (const char *msgid) { char *msgval = gettext (msgid); if (msgval == msgid) msgval = strrchr (msgid, '|') + 1; return msgval; }
What this little function does is to recognize the case when no
translation is available. This can be done very efficiently by a
pointer comparison since the return value is the input value. If there
is no translation we know that the input string is in the format we used
for the Menu entries and therefore contains a |
character. We
simply search for the last occurrence of this character and return a
pointer to the character following it. That's it!
If one now consistently uses the enlengthened string form and replaces
the gettext
calls with calls to sgettext
(this is normally
limited to very few places in the GUI implementation) then it is
possible to produce a program which can be internationalized.
With advanced compilers (such as GNU C) one can write the
sgettext
functions as an inline function or as a macro like this:
#define sgettext(msgid) \ ({ const char *__msgid = (msgid); \ char *__msgstr = gettext (__msgid); \ if (__msgval == __msgid) \ __msgval = strrchr (__msgid, '|') + 1; \ __msgval; })
The other gettext
functions (dgettext
, dcgettext
and the ngettext
equivalents) can and should have corresponding
functions as well which look almost identical, except for the parameters
and the call to the underlying function.
Now there is of course the question why such functions do not exist in the GNU C library? There are two parts of the answer to this question.
|
which is a quite good choice because it
resembles a notation frequently used in this context and it also is a
character not often used in message strings.
But what if the character is used in message strings. Or if the chose
character is not available in the character set on the machine one
compiles (e.g., |
is not required to exist for ISO C; this is
why the `iso646.h' file exists in ISO C programming environments).
There is only one more comment to make left. The wrapper function above require that the translations strings are not enlengthened themselves. This is only logical. There is no need to disambiguate the strings (since they are never used as keys for a search) and one also saves quite some memory and disk space by doing this.
gettext
The last sections described what the programmer can do to internationalize the messages of the program. But it is finally up to the user to select the message s/he wants to see. S/He must understand them.
The POSIX locale model uses the environment variables LC_COLLATE
,
LC_CTYPE
, LC_MESSAGES
, LC_MONETARY
, NUMERIC
,
and LC_TIME
to select the locale which is to be used. This way
the user can influence lots of functions. As we mentioned above the
gettext
functions also take advantage of this.
To understand how this happens it is necessary to take a look at the various components of the filename which gets computed to locate a message catalog. It is composed as follows:
dir_name/locale/LC_category/domain_name.mo
The default value for dir_name is system specific. It is computed from the value given as the prefix while configuring the C library. This value normally is `/usr' or `/'. For the former the complete dir_name is:
/usr/share/locale
We can use `/usr/share' since the `.mo' files containing the
message catalogs are system independent, so all systems can use the same
files. If the program executed the bindtextdomain
function for
the message domain that is currently handled, the dir_name
component is exactly the value which was given to the function as
the second parameter. I.e., bindtextdomain
allows overwriting
the only system dependent and fixed value to make it possible to
address files anywhere in the filesystem.
The category is the name of the locale category which was selected
in the program code. For gettext
and dgettext
this is
always LC_MESSAGES
, for dcgettext
this is selected by the
value of the third parameter. As said above it should be avoided to
ever use a category other than LC_MESSAGES
.
The locale component is computed based on the category used. Just
like for the setlocale
function here comes the user selection
into the play. Some environment variables are examined in a fixed order
and the first environment variable set determines the return value of
the lookup process. In detail, for the category LC_xxx
the
following variables in this order are examined:
LANGUAGE
LC_ALL
LC_xxx
LANG
This looks very familiar. With the exception of the LANGUAGE
environment variable this is exactly the lookup order the
setlocale
function uses. But why introducing the LANGUAGE
variable?
The reason is that the syntax of the values these variables can have is
different to what is expected by the setlocale
function. If we
would set LC_ALL
to a value following the extended syntax that
would mean the setlocale
function will never be able to use the
value of this variable as well. An additional variable removes this
problem plus we can select the language independently of the locale
setting which sometimes is useful.
While for the LC_xxx
variables the value should consist of
exactly one specification of a locale the LANGUAGE
variable's
value can consist of a colon separated list of locale names. The
attentive reader will realize that this is the way we manage to
implement one of our additional demands above: we want to be able to
specify an ordered list of language.
Back to the constructed filename we have only one component missing.
The domain_name part is the name which was either registered using
the textdomain
function or which was given to dgettext
or
dcgettext
as the first parameter. Now it becomes obvious that a
good choice for the domain name in the program code is a string which is
closely related to the program/package name. E.g., for the GNU C
Library the domain name is libc
.
A limit piece of example code should show how the programmer is supposed to work:
{ setlocale (LC_ALL, ""); textdomain ("test-package"); bindtextdomain ("test-package", "/usr/local/share/locale"); puts (gettext ("Hello, world!")); }
At the program start the default domain is messages
, and the
default locale is "C". The setlocale
call sets the locale
according to the user's environment variables; remember that correct
functioning of gettext
relies on the correct setting of the
LC_MESSAGES
locale (for looking up the message catalog) and
of the LC_CTYPE
locale (for the character set conversion).
The textdomain
call changes the default domain to
test-package
. The bindtextdomain
call specifies that
the message catalogs for the domain test-package
can be found
below the directory `/usr/local/share/locale'.
If now the user set in her/his environment the variable LANGUAGE
to de
the gettext
function will try to use the
translations from the file
/usr/local/share/locale/de/LC_MESSAGES/test-package.mo
From the above descriptions it should be clear which component of this filename is determined by which source.
In the above example we assumed that the LANGUAGE
environment
variable to de
. This might be an appropriate selection but what
happens if the user wants to use LC_ALL
because of the wider
usability and here the required value is de_DE.ISO-8859-1
? We
already mentioned above that a situation like this is not infrequent.
E.g., a person might prefer reading a dialect and if this is not
available fall back on the standard language.
The gettext
functions know about situations like this and can
handle them gracefully. The functions recognize the format of the value
of the environment variable. It can split the value is different pieces
and by leaving out the only or the other part it can construct new
values. This happens of course in a predictable way. To understand
this one must know the format of the environment variable value. There
are two more or less standardized forms:
language[_territory[.codeset]][@modifier]
language[_territory][+audience][+special][,[sponsor][_revision]]
The functions will automatically recognize which format is used. Less specific locale names will be stripped of in the order of the following list:
revision
sponsor
special
codeset
normalized codeset
territory
audience
/modifier
From the last entry one can see that the meaning of the modifier
field in the X/Open format and the audience
format have the same
meaning. Beside one can see that the language
field for obvious
reasons never will be dropped.
The only new thing is the normalized codeset
entry. This is
another goodie which is introduced to help reducing the chaos which
derives from the inability of the people to standardize the names of
character sets. Instead of ISO-8859-1 one can often see 8859-1,
88591, iso8859-1, or iso_8859-1. The normalized
codeset
value is generated from the user-provided character set name by
applying the following rules:
"iso"
.
So all of the above name will be normalized to iso88591
. This
allows the program user much more freely choosing the locale name.
Even this extended functionality still does not help to solve the
problem that completely different names can be used to denote the same
locale (e.g., de
and german
). To be of help in this
situation the locale implementation and also the gettext
functions know about aliases.
The file `/usr/share/locale/locale.alias' (replace `/usr' with whatever prefix you used for configuring the C library) contains a mapping of alternative names to more regular names. The system manager is free to add new entries to fill her/his own needs. The selected locale from the environment is compared with the entries in the first column of this file ignoring the case. If they match the value of the second column is used instead for the further handling.
In the description of the format of the environment variables we already mentioned the character set as a factor in the selection of the message catalog. In fact, only catalogs which contain text written using the character set of the system/program can be used (directly; there will come a solution for this some day). This means for the user that s/he will always have to take care for this. If in the collection of the message catalogs there are files for the same language but coded using different character sets the user has to be careful.
gettext
The GNU C Library does not contain the source code for the programs to
handle message catalogs for the gettext
functions. As part of
the GNU project the GNU gettext package contains everything the
developer needs. The functionality provided by the tools in this
package by far exceeds the abilities of the gencat
program
described above for the catgets
functions.
There is a program msgfmt
which is the equivalent program to the
gencat
program. It generates from the human-readable and
-editable form of the message catalog a binary file which can be used by
the gettext
functions. But there are several more programs
available.
The xgettext
program can be used to automatically extract the
translatable messages from a source file. I.e., the programmer need not
take care for the translations and the list of messages which have to be
translated. S/He will simply wrap the translatable string in calls to
gettext
et.al and the rest will be done by xgettext
. This
program has a lot of option which help to customize the output or do
help to understand the input better.
Other programs help to manage development cycle when new messages appear in the source files or when a new translation of the messages appear. here it should only be noted that using all the tools in GNU gettext it is possible to completely automize the handling of message catalog. Beside marking the translatable string in the source code and generating the translations the developers do not have anything to do themselves.
Go to the first, previous, next, last section, table of contents.