iconv
Implementation in the GNU C library
After reading about the problems of iconv
implementations in the
last section it is certainly good to note that the implementation in
the GNU C library has none of the problems mentioned above. What
follows is a step-by-step analysis of the points raised above. The
evaluation is based on the current state of the development (as of
January 1999). The development of the iconv
functions is not
complete, but basic functionality has solidified.
The GNU C library's iconv
implementation uses shared loadable
modules to implement the conversions. A very small number of
conversions are built into the library itself but these are only rather
trivial conversions.
All the benefits of loadable modules are available in the GNU C library
implementation. This is especially appealing since the interface is
well documented (see below) and it therefore is easy to write new
conversion modules. The drawback of using loadable objects is not a
problem in the GNU C library, at least on ELF systems. Since the
library is able to load shared objects even in statically linked
binaries this means that static linking needs not to be forbidden in
case one wants to use iconv
.
The second mentioned problem is the number of supported conversions. Currently, the GNU C library supports more than 150 character sets. The way the implementation is designed the number of supported conversions is greater than 22350 (@math{150} times @math{149}). If any conversion from or to a character set is missing it can easily be added.
Particularly impressive as it may be, this high number is due to the
fact that the GNU C library implementation of iconv
does not have
the third problem mentioned above. I.e., whenever there is a conversion
from a character set @math{@cal{A}} to @math{@cal{B}} and from
@math{@cal{B}} to @math{@cal{C}} it is always possible to convert from
@math{@cal{A}} to @math{@cal{C}} directly. If the iconv_open
returns an error and sets errno
to EINVAL
this really
means there is no known way, directly or indirectly, to perform the
wanted conversion.
This is achieved by providing for each character set a conversion from and to UCS-4 encoded ISO 10646. Using ISO 10646 as an intermediate representation it is possible to triangulate, i.e., converting with an intermediate representation.
There is no inherent requirement to provide a conversion to ISO 10646 for a new character set and it is also possible to provide other conversions where neither source nor destination character set is ISO 10646. The currently existing set of conversions is simply meant to cover all conversions which might be of interest.
All currently available conversions use the triangulation method above, making conversion run unnecessarily slow. If, e.g., somebody often needs the conversion from ISO-2022-JP to EUC-JP, a quicker solution would involve direct conversion between the two character sets, skipping the input to ISO 10646 first. The two character sets of interest are much more similar to each other than to ISO 10646.
In such a situation one can easy write a new conversion and provide it
as a better alternative. The GNU C library iconv
implementation
would automatically use the module implementing the conversion if it is
specified to be more efficient.
All information about the available conversions comes from a file named
`gconv-modules' which can be found in any of the directories along
the GCONV_PATH
. The `gconv-modules' files are line-oriented
text files, where each of the lines has one of the following formats:
alias
define an alias name for a character
set. There are two more words expected on the line. The first one
defines the alias name and the second defines the original name of the
character set. The effect is that it is possible to use the alias name
in the fromset or toset parameters of iconv_open
and
achieve the same result as when using the real character set name.
This is quite important as a character set has often many different
names. There is normally always an official name but this need not
correspond to the most popular name. Beside this many character sets
have special names which are somehow constructed. E.g., all character
sets specified by the ISO have an alias of the form
ISO-IR-nnn
where nnn is the registration number.
This allows programs which know about the registration number to
construct character set names and use them in iconv_open
calls.
More on the available names and aliases follows below.
module
introduce an available conversion
module. These lines must contain three or four more words.
The first word specifies the source character set, the second word the
destination character set of conversion implemented in this module. The
third word is the name of the loadable module. The filename is
constructed by appending the usual shared object suffix (normally
`.so') and this file is then supposed to be found in the same
directory the `gconv-modules' file is in. The last word on the
line, which is optional, is a numeric value representing the cost of the
conversion. If this word is missing a cost of @math{1} is assumed. The
numeric value itself does not matter that much; what counts are the
relative values of the sums of costs for all possible conversion paths.
Below is a more precise description of the use of the cost value.
Returning to the example above where one has written a module to directly convert from ISO-2022-JP to EUC-JP and back. All what has to be done is to put the new module, be its name ISO2022JP-EUCJP.so, in a directory and add a file `gconv-modules' with the following content in the same directory:
module ISO-2022-JP// EUC-JP// ISO2022JP-EUCJP 1 module EUC-JP// ISO-2022-JP// ISO2022JP-EUCJP 1
To see why this is sufficient, it is necessary to understand how the
conversion used by iconv
(and described in the descriptor) is
selected. The approach to this problem is quite simple.
At the first call of the iconv_open
function the program reads
all available `gconv-modules' files and builds up two tables: one
containing all the known aliases and another which contains the
information about the conversions and which shared object implements
them.
iconv
The set of available conversions form a directed graph with weighted
edges. The weights on the edges are the costs specified in the
`gconv-modules' files. The iconv_open
function uses an
algorithm suitable for search for the best path in such a graph and so
constructs a list of conversions which must be performed in succession
to get the transformation from the source to the destination character
set.
Explaining why the above `gconv-modules' files allows the
iconv
implementation to resolve the specific ISO-2022-JP to
EUC-JP conversion module instead of the conversion coming with the
library itself is straightforward. Since the latter conversion takes two
steps (from ISO-2022-JP to ISO 10646 and then from ISO 10646 to
EUC-JP) the cost is @math{1+1 = 2}. But the above `gconv-modules'
file specifies that the new conversion modules can perform this
conversion with only the cost of @math{1}.
A mysterious piece about the `gconv-modules' file above (and also
the file coming with the GNU C library) are the names of the character
sets specified in the module
lines. Why do almost all the names
end in //
? And this is not all: the names can actually be
regular expressions. At this point of time this mystery should not be
revealed, unless you have the relevant spell-casting materials: ashes
from an original DOS 6.2 boot disk burnt in effigy, a crucifix
blessed by St. Emacs, assorted herbal roots from Central America, sand
from Cebu, etc. Sorry! The part of the implementation where
this is used is not yet finished. For now please simply follow the
existing examples. It'll become clearer once it is. --drepper
A last remark about the `gconv-modules' is about the names not
ending with //
. There often is a character set named
INTERNAL
mentioned. From the discussion above and the chosen
name it should have become clear that this is the name for the
representation used in the intermediate step of the triangulation. We
have said that this is UCS-4 but actually it is not quite right. The
UCS-4 specification also includes the specification of the byte ordering
used. Since a UCS-4 value consists of four bytes a stored value is
effected by byte ordering. The internal representation is not
the same as UCS-4 in case the byte ordering of the processor (or at least
the running process) is not the same as the one required for UCS-4. This
is done for performance reasons as one does not want to perform
unnecessary byte-swapping operations if one is not interested in actually
seeing the result in UCS-4. To avoid trouble with endianess the internal
representation consistently is named INTERNAL
even on big-endian
systems where the representations are identical.
iconv
module data structuresSo far this section described how modules are located and considered to be used. What remains to be described is the interface of the modules so that one can write new ones. This section describes the interface as it is in use in January 1999. The interface will change in future a bit but hopefully only in an upward compatible way.
The definitions necessary to write new modules are publicly available in the non-standard header `gconv.h'. The following text will therefore describe the definitions from this header file. But first it is necessary to get an overview.
From the perspective of the user of iconv
the interface is quite
simple: the iconv_open
function returns a handle which can be
used in calls to iconv
and finally the handle is freed with a call
to iconv_close
. The problem is: the handle has to be able to
represent the possibly long sequences of conversion steps and also the
state of each conversion since the handle is all which is passed to the
iconv
function. Therefore the data structures are really the
elements to understanding the implementation.
We need two different kinds of data structures. The first describes the conversion and the second describes the state etc. There are really two type definitions like this in `gconv.h'.
struct __gconv_loaded_object *__shlib_handle
const char *__modname
int __counter
const char *__from_name
const char *__to_name
__from_name
and __to_name
contain the names of the source and
destination character sets. They can be used to identify the actual
conversion to be carried out since one module might implement
conversions for more than one character set and/or direction.
gconv_fct __fct
gconv_init_fct __init_fct
gconv_end_fct __end_fct
int __min_needed_from
int __max_needed_from
int __min_needed_to
int __max_needed_to;
__min_needed_from
value specifies how many bytes a character of
the source character set at least needs. The __max_needed_from
specifies the maximum value which also includes possible shift
sequences.
The __min_needed_to
and __max_needed_to
values serve the
same purpose but this time for the destination character set.
It is crucial that these values are accurate since otherwise the
conversion functions will have problems or not work at all.
int __stateful
void *__data
__data
element must not contain data specific to one specific use of the
conversion function.
char *__outbuf
char *__outbufend
__outbuf
element points to the beginning of the buffer and
__outbufend
points to the byte following the last byte in the
buffer. The conversion function must not assume anything about the size
of the buffer but it can be safely assumed the there is room for at
least one complete character in the output buffer.
Once the conversion is finished and the conversion is the last step the
__outbuf
element must be modified to point after last last byte
written into the buffer to signal how much output is available. If this
conversion step is not the last one the element must not be modified.
The __outbufend
element must not be modified.
int __is_last
int __invocation_counter
int __internal_use
mbsrtowcs
et.al. I.e., the
function is not used directly through the iconv
interface.
This sometimes makes a difference as it is expected that the
iconv
functions are used to translate entire texts while the
mbsrtowcs
functions are normally only used to convert single
strings and might be used multiple times to convert entire texts.
But in this situation we would have problem complying with some rules of
the character set specification. Some character sets require a prolog
which must appear exactly once for an entire text. If a number of
mbsrtowcs
calls are used to convert the text only the first call
must add the prolog. But since there is no communication between the
different calls of mbsrtowcs
the conversion functions have no
possibility to find this out. The situation is different for sequences
of iconv
calls since the handle allows access to the needed
information.
This element is mostly used together with __invocation_counter
in
a way like this:
if (!data->__internal_use && data->__invocation_counter == 0) /* Emit prolog. */ ...This element must never be modified.
mbstate_t *__statep
__statep
element points to an object of type mbstate_t
(see section Representing the state of the conversion). The conversion of an stateful character
set must use the object pointed to by this element to store information
about the conversion state. The __statep
element itself must
never be modified.
mbstate_t __state
iconv
module interfacesWith the knowledge about the data structures we now can describe the conversion functions itself. To understand the interface a bit of knowledge about the functionality in the C library which loads the objects with the conversions is necessary.
It is often the case that one conversion is used more than once. I.e.,
there are several iconv_open
calls for the same set of character
sets during one program run. The mbsrtowcs
et.al. functions in
the GNU C library also use the iconv
functionality which
increases the number of uses of the same functions even more.
For this reason the modules do not get loaded exclusively for one
conversion. Instead a module once loaded can be used by arbitrarily many
iconv
or mbsrtowcs
calls at the same time. The splitting
of the information between conversion function specific information and
conversion data makes this possible. The last section showed the two
data structures used to do this.
This is of course also reflected in the interface and semantics of the functions the modules must provide. There are three functions which must have the following names:
gconv_init
gconv_init
function initializes the conversion function
specific data structure. This very same object is shared by all
conversion which use this conversion and therefore no state information
about the conversion itself must be stored in here. If a module
implements more than one conversion the gconv_init
function will be
called multiple times.
gconv_end
gconv_end
function is responsible to free all resources
allocated by the gconv_init
function. If there is nothing to do
this function can be missing. Special care must be taken if the module
implements more than one conversion and the gconv_init
function
does not allocate the same resources for all conversions.
gconv
gconv_init
and the conversion data, specific to
this use of the conversion functions.
There are three data types defined for the three module interface function and these define the interface.
As explained int the description of the struct __gconv_step
data
structure above the initialization function has to initialize parts of
it.
__min_needed_from
__max_needed_from
__min_needed_to
__max_needed_to
__stateful
If the initialization function needs to communication some information
to the conversion function this can happen using the __data
element of the __gconv_step
structure. But since this data is
shared by all the conversion is must not be modified by the conversion
function. How this can be used is shown in the example below.
#define MIN_NEEDED_FROM 1 #define MAX_NEEDED_FROM 4 #define MIN_NEEDED_TO 4 #define MAX_NEEDED_TO 4 int gconv_init (struct __gconv_step *step) { /* Determine which direction. */ struct iso2022jp_data *new_data; enum direction dir = illegal_dir; enum variant var = illegal_var; int result; if (__strcasecmp (step->__from_name, "ISO-2022-JP//") == 0) { dir = from_iso2022jp; var = iso2022jp; } else if (__strcasecmp (step->__to_name, "ISO-2022-JP//") == 0) { dir = to_iso2022jp; var = iso2022jp; } else if (__strcasecmp (step->__from_name, "ISO-2022-JP-2//") == 0) { dir = from_iso2022jp; var = iso2022jp2; } else if (__strcasecmp (step->__to_name, "ISO-2022-JP-2//") == 0) { dir = to_iso2022jp; var = iso2022jp2; } result = __GCONV_NOCONV; if (dir != illegal_dir) { new_data = (struct iso2022jp_data *) malloc (sizeof (struct iso2022jp_data)); result = __GCONV_NOMEM; if (new_data != NULL) { new_data->dir = dir; new_data->var = var; step->__data = new_data; if (dir == from_iso2022jp) { step->__min_needed_from = MIN_NEEDED_FROM; step->__max_needed_from = MAX_NEEDED_FROM; step->__min_needed_to = MIN_NEEDED_TO; step->__max_needed_to = MAX_NEEDED_TO; } else { step->__min_needed_from = MIN_NEEDED_TO; step->__max_needed_from = MAX_NEEDED_TO; step->__min_needed_to = MIN_NEEDED_FROM; step->__max_needed_to = MAX_NEEDED_FROM + 2; } /* Yes, this is a stateful encoding. */ step->__stateful = 1; result = __GCONV_OK; } } return result; }
The function first checks which conversion is wanted. The module from which this function is taken implements four different conversion and which one is selected can be determined by comparing the names. The comparison should always be done without paying attention to the case.
Then a data structure is allocated which contains the necessary
information about which conversion is selected. The data structure
struct iso2022jp_data
is locally defined since outside the module
this data is not used at all. Please note that if all four conversions
this modules supports are requested there are four data blocks.
One interesting thing is the initialization of the __min_
and
__max_
elements of the step data object. A single ISO-2022-JP
character can consist of one to four bytes. Therefore the
MIN_NEEDED_FROM
and MAX_NEEDED_FROM
macros are defined
this way. The output is always the INTERNAL
character set (aka
UCS-4) and therefore each character consists of exactly four bytes. For
the conversion from INTERNAL
to ISO-2022-JP we have to take into
account that escape sequences might be necessary to switch the character
sets. Therefore the __max_needed_to
element for this direction
gets assigned MAX_NEEDED_FROM + 2
. This takes into account the
two bytes needed for the escape sequences to single the switching. The
asymmetry in the maximum values for the two directions can be explained
easily: when reading ISO-2022-JP text escape sequences can be handled
alone. I.e., it is not necessary to process a real character since the
effect of the escape sequence can be recorded in the state information.
The situation is different for the other direction. Since it is in
general not known which character comes next one cannot emit escape
sequences to change the state in advance. This means the escape
sequences which have to be emitted together with the next character.
Therefore one needs more room then only for the character itself.
The possible return values of the initialization function are:
__GCONV_OK
__GCONV_NOCONV
__GCONV_NOMEM
The functions called before the module is unloaded is significantly easier. It often has nothing at all to do in which case it can be left out completely.
__data
element of
the object pointed to by the argument is of interest. Continuing the
example from the initialization function, the finalization function
looks like this:
void gconv_end (struct __gconv_step *data) { free (data->__data); }
The most important function is the conversion function itself. It can get quite complicated for complex character sets. But since this is not of interest here we will only describe a possible skeleton for the conversion function.
iconv
function it can be seen why the flushing mode is necessary. What mode
is selected is determined by the sixth argument, an integer. If it is
nonzero it means that flushing is selected.
Common to both mode is where the output buffer can be found. The
information about this buffer is stored in the conversion step data. A
pointer to this is passed as the second argument to this function. The
description of the struct __gconv_step_data
structure has more
information on this.
What has to be done for flushing depends on the source character set.
If it is not stateful nothing has to be done. Otherwise the function
has to emit a byte sequence to bring the state object in the initial
state. Once this all happened the other conversion modules in the chain
of conversions have to get the same chance. Whether another step
follows can be determined from the __is_last
element of the step
data structure to which the first parameter points.
The more interesting mode is when actually text has to be converted. The first step in this case is to convert as much text as possible from the input buffer and store the result in the output buffer. The start of the input buffer is determined by the third argument which is a pointer to a pointer variable referencing the beginning of the buffer. The fourth argument is a pointer to the byte right after the last byte in the buffer.
The conversion has to be performed according to the current state if the
character set is stateful. The state is stored in an object pointed to
by the __statep
element of the step data (second argument). Once
either the input buffer is empty or the output buffer is full the
conversion stops. At this point the pointer variable referenced by the
third parameter must point to the byte following the last processed
byte. I.e., if all of the input is consumed this pointer and the fourth
parameter have the same value.
What now happens depends on whether this step is the last one or not.
If it is the last step the only thing which has to be done is to update
the __outbuf
element of the step data structure to point after the
last written byte. This gives the caller the information on how much
text is available in the output buffer. Beside this the variable
pointed to by the fifth parameter, which is of type size_t
, must
be incremented by the number of characters (not bytes) which were
converted in a non-reversible way. Then the function can return.
In case the step is not the last one the later conversion functions have to get a chance to do their work. Therefore the appropriate conversion function has to be called. The information about the functions is stored in the conversion data structures, passed as the first parameter. This information and the step data are stored in arrays so the next element in both cases can be found by simple pointer arithmetic:
int gconv (struct __gconv_step *step, struct __gconv_step_data *data, const char **inbuf, const char *inbufend, size_t *written, int do_flush) { struct __gconv_step *next_step = step + 1; struct __gconv_step_data *next_data = data + 1; ...
The next_step
pointer references the next step information and
next_data
the next data record. The call of the next function
therefore will look similar to this:
next_step->__fct (next_step, next_data, &outerr, outbuf, written, 0)
But this is not yet all. Once the function call returns the conversion
function might have some more to do. If the return value of the
function is __GCONV_EMPTY_INPUT
this means there is more room in
the output buffer. Unless the input buffer is empty the conversion
functions start all over again and processes the rest of the input
buffer. If the return value is not __GCONV_EMPTY_INPUT
something
went wrong and we have to recover from this.
A requirement for the conversion function is that the input buffer pointer (the third argument) always points to the last character which was put in the converted form in the output buffer. This is trivially true after the conversion performed in the current step. But if the conversion functions deeper down the stream stop prematurely not all characters from the output buffer are consumed and therefore the input buffer pointers must be backed of to the right position.
This is easy to do if the input and output character sets have a fixed width for all characters. In this situation we can compute how many characters are left in the output buffer and therefore can correct the input buffer pointer appropriate with a similar computation. Things are getting tricky if either character set has character represented with variable length byte sequences and it gets even more complicated if the conversion has to take care of the state. In these cases the conversion has to be performed once again, from the known state before the initial conversion. I.e., if necessary the state of the conversion has to be reset and the conversion loop has to be executed again. The difference now is that it is known how much input must be created and the conversion can stop before converting the first unused character. Once this is done the input buffer pointers must be updated again and the function can return.
One final thing should be mentioned. If it is necessary for the
conversion to know whether it is the first invocation (in case a prolog
has to be emitted) the conversion function should just before returning
to the caller increment the __invocation_counter
element of the
step data structure. See the description of the struct
__gconv_step_data
structure above for more information on how this can
be used.
The return value must be one of the following values:
__GCONV_EMPTY_INPUT
__GCONV_FULL_OUTPUT
__GCONV_INCOMPLETE_INPUT
The following example provides a framework for a conversion function. In case a new conversion has to be written the holes in this implementation have to be filled and that is it.
int gconv (struct __gconv_step *step, struct __gconv_step_data *data, const char **inbuf, const char *inbufend, size_t *written, int do_flush) { struct __gconv_step *next_step = step + 1; struct __gconv_step_data *next_data = data + 1; gconv_fct fct = next_step->__fct; int status; /* If the function is called with no input this means we have to reset to the initial state. The possibly partly converted input is dropped. */ if (do_flush) { status = __GCONV_OK; /* Possible emit a byte sequence which put the state object into the initial state. */ /* Call the steps down the chain if there are any but only if we successfully emitted the escape sequence. */ if (status == __GCONV_OK && ! data->__is_last) status = fct (next_step, next_data, NULL, NULL, written, 1); } else { /* We preserve the initial values of the pointer variables. */ const char *inptr = *inbuf; char *outbuf = data->__outbuf; char *outend = data->__outbufend; char *outptr; do { /* Remember the start value for this round. */ inptr = *inbuf; /* The outbuf buffer is empty. */ outptr = outbuf; /* For stateful encodings the state must be safe here. */ /* Run the conversion loop.status
is set appropriately afterwards. */ /* If this is the last step leave the loop, there is nothing we can do. */ if (data->__is_last) { /* Store information about how many bytes are available. */ data->__outbuf = outbuf; /* If any non-reversible conversions were performed, add the number to*written
. */ break; } /* Write out all output which was produced. */ if (outbuf > outptr) { const char *outerr = data->__outbuf; int result; result = fct (next_step, next_data, &outerr, outbuf, written, 0); if (result != __GCONV_EMPTY_INPUT) { if (outerr != outbuf) { /* Reset the input buffer pointer. We document here the complex case. */ size_t nstatus; /* Reload the pointers. */ *inbuf = inptr; outbuf = outptr; /* Possibly reset the state. */ /* Redo the conversion, but this time the end of the output buffer is atouterr
. */ } /* Change the status. */ status = result; } else /* All the output is consumed, we can make another run if everything was ok. */ if (status == __GCONV_FULL_OUTPUT) status = __GCONV_OK; } } while (status == __GCONV_OK); /* We finished one use of this step. */ ++data->__invocation_counter; } return status; }
This information should be sufficient to write new modules. Anybody doing so should also take a look at the available source code in the GNU C library sources. It contains many examples of working and optimized modules.
Go to the first, previous, next, last section, table of contents.