/glibc/libc/glibc-iconv-Implementation.html
HTML | 943 lines | 840 code | 56 blank | 47 comment | 0 complexity | b11d099d3d4e70ed5efdad8e0bd6bde9 MD5 | raw file
- <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
- <html>
- <!-- This file documents the GNU C Library.
- This is
- The GNU C Library Reference Manual, for version
- 2.23.
- Copyright (C) 1993-2016 Free Software Foundation, Inc.
- Permission is granted to copy, distribute and/or modify this document
- under the terms of the GNU Free Documentation License, Version
- 1.3 or any later version published by the Free
- Software Foundation; with the Invariant Sections being "Free Software
- Needs Free Documentation" and "GNU Lesser General Public License",
- the Front-Cover texts being "A GNU Manual", and with the Back-Cover
- Texts as in (a) below. A copy of the license is included in the
- section entitled "GNU Free Documentation License".
- (a) The FSF's Back-Cover Text is: "You have the freedom to
- copy and modify this GNU manual. Buying copies from the FSF
- supports it in developing GNU and promoting software freedom." -->
- <!-- Created by GNU Texinfo 6.0, http://www.gnu.org/software/texinfo/ -->
- <head>
- <title>The GNU C Library: glibc iconv Implementation</title>
- <meta name="description" content="The GNU C Library: glibc iconv Implementation">
- <meta name="keywords" content="The GNU C Library: glibc iconv Implementation">
- <meta name="resource-type" content="document">
- <meta name="distribution" content="global">
- <meta name="Generator" content="makeinfo">
- <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
- <link href="index.html#Top" rel="start" title="Top">
- <link href="Concept-Index.html#Concept-Index" rel="index" title="Concept Index">
- <link href="index.html#SEC_Contents" rel="contents" title="Table of Contents">
- <link href="Generic-Charset-Conversion.html#Generic-Charset-Conversion" rel="up" title="Generic Charset Conversion">
- <link href="Locales.html#Locales" rel="next" title="Locales">
- <link href="Other-iconv-Implementations.html#Other-iconv-Implementations" rel="prev" title="Other iconv Implementations">
- <style type="text/css">
- <!--
- a.summary-letter {text-decoration: none}
- blockquote.indentedblock {margin-right: 0em}
- blockquote.smallindentedblock {margin-right: 0em; font-size: smaller}
- blockquote.smallquotation {font-size: smaller}
- div.display {margin-left: 3.2em}
- div.example {margin-left: 3.2em}
- div.lisp {margin-left: 3.2em}
- div.smalldisplay {margin-left: 3.2em}
- div.smallexample {margin-left: 3.2em}
- div.smalllisp {margin-left: 3.2em}
- kbd {font-style: oblique}
- pre.display {font-family: inherit}
- pre.format {font-family: inherit}
- pre.menu-comment {font-family: serif}
- pre.menu-preformatted {font-family: serif}
- pre.smalldisplay {font-family: inherit; font-size: smaller}
- pre.smallexample {font-size: smaller}
- pre.smallformat {font-family: inherit; font-size: smaller}
- pre.smalllisp {font-size: smaller}
- span.nocodebreak {white-space: nowrap}
- span.nolinebreak {white-space: nowrap}
- span.roman {font-family: serif; font-weight: normal}
- span.sansserif {font-family: sans-serif; font-weight: normal}
- ul.no-bullet {list-style: none}
- -->
- </style>
- </head>
- <body lang="en">
- <a name="glibc-iconv-Implementation"></a>
- <div class="header">
- <p>
- Previous: <a href="Other-iconv-Implementations.html#Other-iconv-Implementations" accesskey="p" rel="prev">Other iconv Implementations</a>, Up: <a href="Generic-Charset-Conversion.html#Generic-Charset-Conversion" accesskey="u" rel="up">Generic Charset Conversion</a> [<a href="index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>][<a href="Concept-Index.html#Concept-Index" title="Index" rel="index">Index</a>]</p>
- </div>
- <hr>
- <a name="The-iconv-Implementation-in-the-GNU-C-Library"></a>
- <h4 class="subsection">6.5.4 The <code>iconv</code> Implementation in the GNU C Library</h4>
- <p>After reading about the problems of <code>iconv</code> implementations in the
- last section it is certainly good to note that the implementation in
- the GNU C Library has none of the problems mentioned above. What
- follows is a step-by-step analysis of the points raised above. The
- evaluation is based on the current state of the development (as of
- January 1999). The development of the <code>iconv</code> functions is not
- complete, but basic functionality has solidified.
- </p>
- <p>The GNU C Library’s <code>iconv</code> implementation uses shared loadable
- modules to implement the conversions. A very small number of
- conversions are built into the library itself but these are only rather
- trivial conversions.
- </p>
- <p>All the benefits of loadable modules are available in the GNU C Library
- implementation. This is especially appealing since the interface is
- well documented (see below), and it, therefore, is easy to write new
- conversion modules. The drawback of using loadable objects is not a
- problem in the GNU C Library, at least on ELF systems. Since the
- library is able to load shared objects even in statically linked
- binaries, static linking need not be forbidden in case one wants to use
- <code>iconv</code>.
- </p>
- <p>The second mentioned problem is the number of supported conversions.
- Currently, the GNU C Library supports more than 150 character sets. The
- way the implementation is designed the number of supported conversions
- is greater than 22350 (<em>150</em> times <em>149</em>). If any conversion
- from or to a character set is missing, it can be added easily.
- </p>
- <p>Particularly impressive as it may be, this high number is due to the
- fact that the GNU C Library implementation of <code>iconv</code> does not have
- the third problem mentioned above (i.e., whenever there is a conversion
- from a character set <em>A</em> to <em>B</em> and from
- <em>B</em> to <em>C</em> it is always possible to convert from
- <em>A</em> to <em>C</em> directly). If the <code>iconv_open</code>
- returns an error and sets <code>errno</code> to <code>EINVAL</code>, there is no
- known way, directly or indirectly, to perform the wanted conversion.
- </p>
- <a name="index-triangulation"></a>
- <p>Triangulation is achieved by providing for each character set a
- conversion from and to UCS-4 encoded ISO 10646<!-- /@w -->. Using ISO 10646<!-- /@w -->
- as an intermediate representation it is possible to <em>triangulate</em>
- (i.e., convert with an intermediate representation).
- </p>
- <p>There is no inherent requirement to provide a conversion to ISO 10646<!-- /@w --> for a new character set, and it is also possible to provide other
- conversions where neither source nor destination character set is ISO 10646<!-- /@w -->. The existing set of conversions is simply meant to cover all
- conversions that might be of interest.
- </p>
- <a name="index-ISO_002d2022_002dJP"></a>
- <a name="index-EUC_002dJP"></a>
- <p>All currently available conversions use the triangulation method above,
- making conversion run unnecessarily slow. If, for example, somebody
- often needs the conversion from ISO-2022-JP to EUC-JP, a quicker solution
- would involve direct conversion between the two character sets, skipping
- the input to ISO 10646<!-- /@w --> first. The two character sets of interest
- are much more similar to each other than to ISO 10646<!-- /@w -->.
- </p>
- <p>In such a situation one easily can write a new conversion and provide it
- as a better alternative. The GNU C Library <code>iconv</code> implementation
- would automatically use the module implementing the conversion if it is
- specified to be more efficient.
- </p>
- <a name="Format-of-gconv_002dmodules-files"></a>
- <h4 class="subsubsection">6.5.4.1 Format of <samp>gconv-modules</samp> files</h4>
- <p>All information about the available conversions comes from a file named
- <samp>gconv-modules</samp>, which can be found in any of the directories along
- the <code>GCONV_PATH</code>. The <samp>gconv-modules</samp> files are line-oriented
- text files, where each of the lines has one of the following formats:
- </p>
- <ul>
- <li> If the first non-whitespace character is a <kbd>#</kbd> the line contains only
- comments and is ignored.
- </li><li> Lines starting with <code>alias</code> define an alias name for a character
- set. Two more words are expected on the line. The first word
- defines the alias name, and the second defines the original name of the
- character set. The effect is that it is possible to use the alias name
- in the <var>fromset</var> or <var>toset</var> parameters of <code>iconv_open</code> and
- achieve the same result as when using the real character set name.
- <p>This is quite important as a character set has often many different
- names. There is normally an official name but this need not correspond to
- the most popular name. Beside this many character sets have special
- names that are somehow constructed. For example, all character sets
- specified by the ISO have an alias of the form <code>ISO-IR-<var>nnn</var></code>
- where <var>nnn</var> is the registration number. This allows programs that
- know about the registration number to construct character set names and
- use them in <code>iconv_open</code> calls. More on the available names and
- aliases follows below.
- </p>
- </li><li> Lines starting with <code>module</code> introduce an available conversion
- module. These lines must contain three or four more words.
- <p>The first word specifies the source character set, the second word the
- destination character set of conversion implemented in this module, and
- the third word is the name of the loadable module. The filename is
- constructed by appending the usual shared object suffix (normally
- <samp>.so</samp>) and this file is then supposed to be found in the same
- directory the <samp>gconv-modules</samp> file is in. The last word on the line,
- which is optional, is a numeric value representing the cost of the
- conversion. If this word is missing, a cost of <em>1</em> is assumed. The
- numeric value itself does not matter that much; what counts are the
- relative values of the sums of costs for all possible conversion paths.
- Below is a more precise description of the use of the cost value.
- </p></li></ul>
- <p>Returning to the example above where one has written a module to directly
- convert from ISO-2022-JP to EUC-JP and back. All that has to be done is
- to put the new module, let its name be ISO2022JP-EUCJP.so, in a directory
- and add a file <samp>gconv-modules</samp> with the following content in the
- same directory:
- </p>
- <div class="smallexample">
- <pre class="smallexample">module ISO-2022-JP// EUC-JP// ISO2022JP-EUCJP 1
- module EUC-JP// ISO-2022-JP// ISO2022JP-EUCJP 1
- </pre></div>
- <p>To see why this is sufficient, it is necessary to understand how the
- conversion used by <code>iconv</code> (and described in the descriptor) is
- selected. The approach to this problem is quite simple.
- </p>
- <p>At the first call of the <code>iconv_open</code> function the program reads
- all available <samp>gconv-modules</samp> files and builds up two tables: one
- containing all the known aliases and another that contains the
- information about the conversions and which shared object implements
- them.
- </p>
- <a name="Finding-the-conversion-path-in-iconv"></a>
- <h4 class="subsubsection">6.5.4.2 Finding the conversion path in <code>iconv</code></h4>
- <p>The set of available conversions form a directed graph with weighted
- edges. The weights on the edges are the costs specified in the
- <samp>gconv-modules</samp> files. The <code>iconv_open</code> function uses an
- algorithm suitable for search for the best path in such a graph and so
- constructs a list of conversions that must be performed in succession
- to get the transformation from the source to the destination character
- set.
- </p>
- <p>Explaining why the above <samp>gconv-modules</samp> files allows the
- <code>iconv</code> implementation to resolve the specific ISO-2022-JP to
- EUC-JP conversion module instead of the conversion coming with the
- library itself is straightforward. Since the latter conversion takes two
- steps (from ISO-2022-JP to ISO 10646<!-- /@w --> and then from ISO 10646<!-- /@w --> to
- EUC-JP), the cost is <em>1+1 = 2</em>. The above <samp>gconv-modules</samp>
- file, however, specifies that the new conversion modules can perform this
- conversion with only the cost of <em>1</em>.
- </p>
- <p>A mysterious item about the <samp>gconv-modules</samp> file above (and also
- the file coming with the GNU C Library) are the names of the character
- sets specified in the <code>module</code> lines. Why do almost all the names
- end in <code>//</code>? And this is not all: the names can actually be
- regular expressions. At this point in time this mystery should not be
- revealed, unless you have the relevant spell-casting materials: ashes
- from an original DOS 6.2<!-- /@w --> boot disk burnt in effigy, a crucifix
- blessed by St. Emacs, assorted herbal roots from Central America, sand
- from Cebu, etc. Sorry! <strong>The part of the implementation where
- this is used is not yet finished. For now please simply follow the
- existing examples. It’ll become clearer once it is. –drepper</strong>
- </p>
- <p>A last remark about the <samp>gconv-modules</samp> is about the names not
- ending with <code>//</code>. A character set named <code>INTERNAL</code> is often
- mentioned. From the discussion above and the chosen name it should have
- become clear that this is the name for the representation used in the
- intermediate step of the triangulation. We have said that this is UCS-4
- but actually that is not quite right. The UCS-4 specification also
- includes the specification of the byte ordering used. Since a UCS-4 value
- consists of four bytes, a stored value is affected by byte ordering. The
- internal representation is <em>not</em> the same as UCS-4 in case the byte
- ordering of the processor (or at least the running process) is not the
- same as the one required for UCS-4. This is done for performance reasons
- as one does not want to perform unnecessary byte-swapping operations if
- one is not interested in actually seeing the result in UCS-4. To avoid
- trouble with endianness, the internal representation consistently is named
- <code>INTERNAL</code> even on big-endian systems where the representations are
- identical.
- </p>
- <a name="iconv-module-data-structures"></a>
- <h4 class="subsubsection">6.5.4.3 <code>iconv</code> module data structures</h4>
- <p>So far this section has described how modules are located and considered
- to be used. What remains to be described is the interface of the modules
- so that one can write new ones. This section describes the interface as
- it is in use in January 1999. The interface will change a bit in the
- future but, with luck, only in an upwardly compatible way.
- </p>
- <p>The definitions necessary to write new modules are publicly available
- in the non-standard header <samp>gconv.h</samp>. The following text,
- therefore, describes the definitions from this header file. First,
- however, it is necessary to get an overview.
- </p>
- <p>From the perspective of the user of <code>iconv</code> the interface is quite
- simple: the <code>iconv_open</code> function returns a handle that can be used
- in calls to <code>iconv</code>, and finally the handle is freed with a call to
- <code>iconv_close</code>. The problem is that the handle has to be able to
- represent the possibly long sequences of conversion steps and also the
- state of each conversion since the handle is all that is passed to the
- <code>iconv</code> function. Therefore, the data structures are really the
- elements necessary to understanding the implementation.
- </p>
- <p>We need two different kinds of data structures. The first describes the
- conversion and the second describes the state etc. There are really two
- type definitions like this in <samp>gconv.h</samp>.
- <a name="index-gconv_002eh"></a>
- </p>
- <dl>
- <dt><a name="index-struct-_005f_005fgconv_005fstep"></a>Data type: <strong>struct __gconv_step</strong></dt>
- <dd><p>This data structure describes one conversion a module can perform. For
- each function in a loaded module with conversion functions there is
- exactly one object of this type. This object is shared by all users of
- the conversion (i.e., this object does not contain any information
- corresponding to an actual conversion; it only describes the conversion
- itself).
- </p>
- <dl compact="compact">
- <dt><code>struct __gconv_loaded_object *__shlib_handle</code></dt>
- <dt><code>const char *__modname</code></dt>
- <dt><code>int __counter</code></dt>
- <dd><p>All these elements of the structure are used internally in the C library
- to coordinate loading and unloading the shared. One must not expect any
- of the other elements to be available or initialized.
- </p>
- </dd>
- <dt><code>const char *__from_name</code></dt>
- <dt><code>const char *__to_name</code></dt>
- <dd><p><code>__from_name</code> and <code>__to_name</code> contain the names of the source and
- destination character sets. They can be used to identify the actual
- conversion to be carried out since one module might implement conversions
- for more than one character set and/or direction.
- </p>
- </dd>
- <dt><code>gconv_fct __fct</code></dt>
- <dt><code>gconv_init_fct __init_fct</code></dt>
- <dt><code>gconv_end_fct __end_fct</code></dt>
- <dd><p>These elements contain pointers to the functions in the loadable module.
- The interface will be explained below.
- </p>
- </dd>
- <dt><code>int __min_needed_from</code></dt>
- <dt><code>int __max_needed_from</code></dt>
- <dt><code>int __min_needed_to</code></dt>
- <dt><code>int __max_needed_to;</code></dt>
- <dd><p>These values have to be supplied in the init function of the module. The
- <code>__min_needed_from</code> value specifies how many bytes a character of
- the source character set at least needs. The <code>__max_needed_from</code>
- specifies the maximum value that also includes possible shift sequences.
- </p>
- <p>The <code>__min_needed_to</code> and <code>__max_needed_to</code> values serve the
- same purpose as <code>__min_needed_from</code> and <code>__max_needed_from</code> but
- this time for the destination character set.
- </p>
- <p>It is crucial that these values be accurate since otherwise the
- conversion functions will have problems or not work at all.
- </p>
- </dd>
- <dt><code>int __stateful</code></dt>
- <dd><p>This element must also be initialized by the init function.
- <code>int __stateful</code> is nonzero if the source character set is stateful.
- Otherwise it is zero.
- </p>
- </dd>
- <dt><code>void *__data</code></dt>
- <dd><p>This element can be used freely by the conversion functions in the
- module. <code>void *__data</code> can be used to communicate extra information
- from one call to another. <code>void *__data</code> need not be initialized if
- not needed at all. If <code>void *__data</code> element is assigned a pointer
- to dynamically allocated memory (presumably in the init function) it has
- to be made sure that the end function deallocates the memory. Otherwise
- the application will leak memory.
- </p>
- <p>It is important to be aware that this data structure is shared by all
- users of this specification conversion and therefore the <code>__data</code>
- element must not contain data specific to one specific use of the
- conversion function.
- </p></dd>
- </dl>
- </dd></dl>
- <dl>
- <dt><a name="index-struct-_005f_005fgconv_005fstep_005fdata"></a>Data type: <strong>struct __gconv_step_data</strong></dt>
- <dd><p>This is the data structure that contains the information specific to
- each use of the conversion functions.
- </p>
- <dl compact="compact">
- <dt><code>char *__outbuf</code></dt>
- <dt><code>char *__outbufend</code></dt>
- <dd><p>These elements specify the output buffer for the conversion step. The
- <code>__outbuf</code> element points to the beginning of the buffer, and
- <code>__outbufend</code> points to the byte following the last byte in the
- buffer. The conversion function must not assume anything about the size
- of the buffer but it can be safely assumed the there is room for at
- least one complete character in the output buffer.
- </p>
- <p>Once the conversion is finished, if the conversion is the last step, the
- <code>__outbuf</code> element must be modified to point after the last byte
- written into the buffer to signal how much output is available. If this
- conversion step is not the last one, the element must not be modified.
- The <code>__outbufend</code> element must not be modified.
- </p>
- </dd>
- <dt><code>int __is_last</code></dt>
- <dd><p>This element is nonzero if this conversion step is the last one. This
- information is necessary for the recursion. See the description of the
- conversion function internals below. This element must never be
- modified.
- </p>
- </dd>
- <dt><code>int __invocation_counter</code></dt>
- <dd><p>The conversion function can use this element to see how many calls of
- the conversion function already happened. Some character sets require a
- certain prolog when generating output, and by comparing this value with
- zero, one can find out whether it is the first call and whether,
- therefore, the prolog should be emitted. This element must never be
- modified.
- </p>
- </dd>
- <dt><code>int __internal_use</code></dt>
- <dd><p>This element is another one rarely used but needed in certain
- situations. It is assigned a nonzero value in case the conversion
- functions are used to implement <code>mbsrtowcs</code> et.al. (i.e., the
- function is not used directly through the <code>iconv</code> interface).
- </p>
- <p>This sometimes makes a difference as it is expected that the
- <code>iconv</code> functions are used to translate entire texts while the
- <code>mbsrtowcs</code> functions are normally used only to convert single
- strings and might be used multiple times to convert entire texts.
- </p>
- <p>But in this situation we would have problem complying with some rules of
- the character set specification. Some character sets require a prolog,
- which must appear exactly once for an entire text. If a number of
- <code>mbsrtowcs</code> calls are used to convert the text, only the first call
- must add the prolog. However, because there is no communication between the
- different calls of <code>mbsrtowcs</code>, the conversion functions have no
- possibility to find this out. The situation is different for sequences
- of <code>iconv</code> calls since the handle allows access to the needed
- information.
- </p>
- <p>The <code>int __internal_use</code> element is mostly used together with
- <code>__invocation_counter</code> as follows:
- </p>
- <div class="smallexample">
- <pre class="smallexample">if (!data->__internal_use
- && data->__invocation_counter == 0)
- /* <span class="roman">Emit prolog.</span> */
- …
- </pre></div>
- <p>This element must never be modified.
- </p>
- </dd>
- <dt><code>mbstate_t *__statep</code></dt>
- <dd><p>The <code>__statep</code> element points to an object of type <code>mbstate_t</code>
- (see <a href="Keeping-the-state.html#Keeping-the-state">Keeping the state</a>). The conversion of a stateful character
- set must use the object pointed to by <code>__statep</code> to store
- information about the conversion state. The <code>__statep</code> element
- itself must never be modified.
- </p>
- </dd>
- <dt><code>mbstate_t __state</code></dt>
- <dd><p>This element must <em>never</em> be used directly. It is only part of
- this structure to have the needed space allocated.
- </p></dd>
- </dl>
- </dd></dl>
- <a name="iconv-module-interfaces"></a>
- <h4 class="subsubsection">6.5.4.4 <code>iconv</code> module interfaces</h4>
- <p>With the knowledge about the data structures we now can describe the
- conversion function itself. To understand the interface a bit of
- knowledge is necessary about the functionality in the C library that
- loads the objects with the conversions.
- </p>
- <p>It is often the case that one conversion is used more than once (i.e.,
- there are several <code>iconv_open</code> calls for the same set of character
- sets during one program run). The <code>mbsrtowcs</code> et.al. functions in
- the GNU C Library also use the <code>iconv</code> functionality, which
- increases the number of uses of the same functions even more.
- </p>
- <p>Because of this multiple use of conversions, the modules do not get
- loaded exclusively for one conversion. Instead a module once loaded can
- be used by an arbitrary number of <code>iconv</code> or <code>mbsrtowcs</code> calls
- at the same time. The splitting of the information between conversion-
- function-specific information and conversion data makes this possible.
- The last section showed the two data structures used to do this.
- </p>
- <p>This is of course also reflected in the interface and semantics of the
- functions that the modules must provide. There are three functions that
- must have the following names:
- </p>
- <dl compact="compact">
- <dt><code>gconv_init</code></dt>
- <dd><p>The <code>gconv_init</code> function initializes the conversion function
- specific data structure. This very same object is shared by all
- conversions that use this conversion and, therefore, no state information
- about the conversion itself must be stored in here. If a module
- implements more than one conversion, the <code>gconv_init</code> function will
- be called multiple times.
- </p>
- </dd>
- <dt><code>gconv_end</code></dt>
- <dd><p>The <code>gconv_end</code> function is responsible for freeing all resources
- allocated by the <code>gconv_init</code> function. If there is nothing to do,
- this function can be missing. Special care must be taken if the module
- implements more than one conversion and the <code>gconv_init</code> function
- does not allocate the same resources for all conversions.
- </p>
- </dd>
- <dt><code>gconv</code></dt>
- <dd><p>This is the actual conversion function. It is called to convert one
- block of text. It gets passed the conversion step information
- initialized by <code>gconv_init</code> and the conversion data, specific to
- this use of the conversion functions.
- </p></dd>
- </dl>
- <p>There are three data types defined for the three module interface
- functions and these define the interface.
- </p>
- <dl>
- <dt><a name="index-_0028_002a_005f_005fgconv_005finit_005ffct_0029"></a>Data type: <em>int</em> <strong>(*__gconv_init_fct)</strong> <em>(struct __gconv_step *)</em></dt>
- <dd><p>This specifies the interface of the initialization function of the
- module. It is called exactly once for each conversion the module
- implements.
- </p>
- <p>As explained in the description of the <code>struct __gconv_step</code> data
- structure above the initialization function has to initialize parts of
- it.
- </p>
- <dl compact="compact">
- <dt><code>__min_needed_from</code></dt>
- <dt><code>__max_needed_from</code></dt>
- <dt><code>__min_needed_to</code></dt>
- <dt><code>__max_needed_to</code></dt>
- <dd><p>These elements must be initialized to the exact numbers of the minimum
- and maximum number of bytes used by one character in the source and
- destination character sets, respectively. If the characters all have the
- same size, the minimum and maximum values are the same.
- </p>
- </dd>
- <dt><code>__stateful</code></dt>
- <dd><p>This element must be initialized to a nonzero value if the source
- character set is stateful. Otherwise it must be zero.
- </p></dd>
- </dl>
- <p>If the initialization function needs to communicate some information
- to the conversion function, this communication can happen using the
- <code>__data</code> element of the <code>__gconv_step</code> structure. But since
- this data is shared by all the conversions, it must not be modified by
- the conversion function. The example below shows how this can be used.
- </p>
- <div class="smallexample">
- <pre class="smallexample">#define MIN_NEEDED_FROM 1
- #define MAX_NEEDED_FROM 4
- #define MIN_NEEDED_TO 4
- #define MAX_NEEDED_TO 4
- int
- gconv_init (struct __gconv_step *step)
- {
- /* <span class="roman">Determine which direction.</span> */
- struct iso2022jp_data *new_data;
- enum direction dir = illegal_dir;
- enum variant var = illegal_var;
- int result;
- if (__strcasecmp (step->__from_name, "ISO-2022-JP//") == 0)
- {
- dir = from_iso2022jp;
- var = iso2022jp;
- }
- else if (__strcasecmp (step->__to_name, "ISO-2022-JP//") == 0)
- {
- dir = to_iso2022jp;
- var = iso2022jp;
- }
- else if (__strcasecmp (step->__from_name, "ISO-2022-JP-2//") == 0)
- {
- dir = from_iso2022jp;
- var = iso2022jp2;
- }
- else if (__strcasecmp (step->__to_name, "ISO-2022-JP-2//") == 0)
- {
- dir = to_iso2022jp;
- var = iso2022jp2;
- }
- result = __GCONV_NOCONV;
- if (dir != illegal_dir)
- {
- new_data = (struct iso2022jp_data *)
- malloc (sizeof (struct iso2022jp_data));
- result = __GCONV_NOMEM;
- if (new_data != NULL)
- {
- new_data->dir = dir;
- new_data->var = var;
- step->__data = new_data;
- if (dir == from_iso2022jp)
- {
- step->__min_needed_from = MIN_NEEDED_FROM;
- step->__max_needed_from = MAX_NEEDED_FROM;
- step->__min_needed_to = MIN_NEEDED_TO;
- step->__max_needed_to = MAX_NEEDED_TO;
- }
- else
- {
- step->__min_needed_from = MIN_NEEDED_TO;
- step->__max_needed_from = MAX_NEEDED_TO;
- step->__min_needed_to = MIN_NEEDED_FROM;
- step->__max_needed_to = MAX_NEEDED_FROM + 2;
- }
- /* <span class="roman">Yes, this is a stateful encoding.</span> */
- step->__stateful = 1;
- result = __GCONV_OK;
- }
- }
- return result;
- }
- </pre></div>
- <p>The function first checks which conversion is wanted. The module from
- which this function is taken implements four different conversions;
- which one is selected can be determined by comparing the names. The
- comparison should always be done without paying attention to the case.
- </p>
- <p>Next, a data structure, which contains the necessary information about
- which conversion is selected, is allocated. The data structure
- <code>struct iso2022jp_data</code> is locally defined since, outside the
- module, this data is not used at all. Please note that if all four
- conversions this modules supports are requested there are four data
- blocks.
- </p>
- <p>One interesting thing is the initialization of the <code>__min_</code> and
- <code>__max_</code> elements of the step data object. A single ISO-2022-JP
- character can consist of one to four bytes. Therefore the
- <code>MIN_NEEDED_FROM</code> and <code>MAX_NEEDED_FROM</code> macros are defined
- this way. The output is always the <code>INTERNAL</code> character set (aka
- UCS-4) and therefore each character consists of exactly four bytes. For
- the conversion from <code>INTERNAL</code> to ISO-2022-JP we have to take into
- account that escape sequences might be necessary to switch the character
- sets. Therefore the <code>__max_needed_to</code> element for this direction
- gets assigned <code>MAX_NEEDED_FROM + 2</code>. This takes into account the
- two bytes needed for the escape sequences to single the switching. The
- asymmetry in the maximum values for the two directions can be explained
- easily: when reading ISO-2022-JP text, escape sequences can be handled
- alone (i.e., it is not necessary to process a real character since the
- effect of the escape sequence can be recorded in the state information).
- The situation is different for the other direction. Since it is in
- general not known which character comes next, one cannot emit escape
- sequences to change the state in advance. This means the escape
- sequences that have to be emitted together with the next character.
- Therefore one needs more room than only for the character itself.
- </p>
- <p>The possible return values of the initialization function are:
- </p>
- <dl compact="compact">
- <dt><code>__GCONV_OK</code></dt>
- <dd><p>The initialization succeeded
- </p></dd>
- <dt><code>__GCONV_NOCONV</code></dt>
- <dd><p>The requested conversion is not supported in the module. This can
- happen if the <samp>gconv-modules</samp> file has errors.
- </p></dd>
- <dt><code>__GCONV_NOMEM</code></dt>
- <dd><p>Memory required to store additional information could not be allocated.
- </p></dd>
- </dl>
- </dd></dl>
- <p>The function called before the module is unloaded is significantly
- easier. It often has nothing at all to do; in which case it can be left
- out completely.
- </p>
- <dl>
- <dt><a name="index-_0028_002a_005f_005fgconv_005fend_005ffct_0029"></a>Data type: <em>void</em> <strong>(*__gconv_end_fct)</strong> <em>(struct gconv_step *)</em></dt>
- <dd><p>The task of this function is to free all resources allocated in the
- initialization function. Therefore only the <code>__data</code> element of
- the object pointed to by the argument is of interest. Continuing the
- example from the initialization function, the finalization function
- looks like this:
- </p>
- <div class="smallexample">
- <pre class="smallexample">void
- gconv_end (struct __gconv_step *data)
- {
- free (data->__data);
- }
- </pre></div>
- </dd></dl>
- <p>The most important function is the conversion function itself, which can
- get quite complicated for complex character sets. But since this is not
- of interest here, we will only describe a possible skeleton for the
- conversion function.
- </p>
- <dl>
- <dt><a name="index-_0028_002a_005f_005fgconv_005ffct_0029"></a>Data type: <em>int</em> <strong>(*__gconv_fct)</strong> <em>(struct __gconv_step *, struct __gconv_step_data *, const char **, const char *, size_t *, int)</em></dt>
- <dd><p>The conversion function can be called for two basic reason: to convert
- text or to reset the state. From the description of the <code>iconv</code>
- function it can be seen why the flushing mode is necessary. What mode
- is selected is determined by the sixth argument, an integer. This
- argument being nonzero means that flushing is selected.
- </p>
- <p>Common to both modes is where the output buffer can be found. The
- information about this buffer is stored in the conversion step data. A
- pointer to this information is passed as the second argument to this
- function. The description of the <code>struct __gconv_step_data</code>
- structure has more information on the conversion step data.
- </p>
- <a name="index-stateful-5"></a>
- <p>What has to be done for flushing depends on the source character set.
- If the source character set is not stateful, nothing has to be done.
- Otherwise the function has to emit a byte sequence to bring the state
- object into the initial state. Once this all happened the other
- conversion modules in the chain of conversions have to get the same
- chance. Whether another step follows can be determined from the
- <code>__is_last</code> element of the step data structure to which the first
- parameter points.
- </p>
- <p>The more interesting mode is when actual text has to be converted. The
- first step in this case is to convert as much text as possible from the
- input buffer and store the result in the output buffer. The start of the
- input buffer is determined by the third argument, which is a pointer to a
- pointer variable referencing the beginning of the buffer. The fourth
- argument is a pointer to the byte right after the last byte in the buffer.
- </p>
- <p>The conversion has to be performed according to the current state if the
- character set is stateful. The state is stored in an object pointed to
- by the <code>__statep</code> element of the step data (second argument). Once
- either the input buffer is empty or the output buffer is full the
- conversion stops. At this point, the pointer variable referenced by the
- third parameter must point to the byte following the last processed
- byte (i.e., if all of the input is consumed, this pointer and the fourth
- parameter have the same value).
- </p>
- <p>What now happens depends on whether this step is the last one. If it is
- the last step, the only thing that has to be done is to update the
- <code>__outbuf</code> element of the step data structure to point after the
- last written byte. This update gives the caller the information on how
- much text is available in the output buffer. In addition, the variable
- pointed to by the fifth parameter, which is of type <code>size_t</code>, must
- be incremented by the number of characters (<em>not bytes</em>) that were
- converted in a non-reversible way. Then, the function can return.
- </p>
- <p>In case the step is not the last one, the later conversion functions have
- to get a chance to do their work. Therefore, the appropriate conversion
- function has to be called. The information about the functions is
- stored in the conversion data structures, passed as the first parameter.
- This information and the step data are stored in arrays, so the next
- element in both cases can be found by simple pointer arithmetic:
- </p>
- <div class="smallexample">
- <pre class="smallexample">int
- gconv (struct __gconv_step *step, struct __gconv_step_data *data,
- const char **inbuf, const char *inbufend, size_t *written,
- int do_flush)
- {
- struct __gconv_step *next_step = step + 1;
- struct __gconv_step_data *next_data = data + 1;
- …
- </pre></div>
- <p>The <code>next_step</code> pointer references the next step information and
- <code>next_data</code> the next data record. The call of the next function
- therefore will look similar to this:
- </p>
- <div class="smallexample">
- <pre class="smallexample"> next_step->__fct (next_step, next_data, &outerr, outbuf,
- written, 0)
- </pre></div>
- <p>But this is not yet all. Once the function call returns the conversion
- function might have some more to do. If the return value of the function
- is <code>__GCONV_EMPTY_INPUT</code>, more room is available in the output
- buffer. Unless the input buffer is empty the conversion, functions start
- all over again and process the rest of the input buffer. If the return
- value is not <code>__GCONV_EMPTY_INPUT</code>, something went wrong and we have
- to recover from this.
- </p>
- <p>A requirement for the conversion function is that the input buffer
- pointer (the third argument) always point to the last character that
- was put in converted form into the output buffer. This is trivially
- true after the conversion performed in the current step, but if the
- conversion functions deeper downstream stop prematurely, not all
- characters from the output buffer are consumed and, therefore, the input
- buffer pointers must be backed off to the right position.
- </p>
- <p>Correcting the input buffers is easy to do if the input and output
- character sets have a fixed width for all characters. In this situation
- we can compute how many characters are left in the output buffer and,
- therefore, can correct the input buffer pointer appropriately with a
- similar computation. Things are getting tricky if either character set
- has characters represented with variable length byte sequences, and it
- gets even more complicated if the conversion has to take care of the
- state. In these cases the conversion has to be performed once again, from
- the known state before the initial conversion (i.e., if necessary the
- state of the conversion has to be reset and the conversion loop has to be
- executed again). The difference now is that it is known how much input
- must be created, and the conversion can stop before converting the first
- unused character. Once this is done the input buffer pointers must be
- updated again and the function can return.
- </p>
- <p>One final thing should be mentioned. If it is necessary for the
- conversion to know whether it is the first invocation (in case a prolog
- has to be emitted), the conversion function should increment the
- <code>__invocation_counter</code> element of the step data structure just
- before returning to the caller. See the description of the <code>struct
- __gconv_step_data</code> structure above for more information on how this can
- be used.
- </p>
- <p>The return value must be one of the following values:
- </p>
- <dl compact="compact">
- <dt><code>__GCONV_EMPTY_INPUT</code></dt>
- <dd><p>All input was consumed and there is room left in the output buffer.
- </p></dd>
- <dt><code>__GCONV_FULL_OUTPUT</code></dt>
- <dd><p>No more room in the output buffer. In case this is not the last step
- this value is propagated down from the call of the next conversion
- function in the chain.
- </p></dd>
- <dt><code>__GCONV_INCOMPLETE_INPUT</code></dt>
- <dd><p>The input buffer is not entirely empty since it contains an incomplete
- character sequence.
- </p></dd>
- </dl>
- <p>The following example provides a framework for a conversion function.
- In case a new conversion has to be written the holes in this
- implementation have to be filled and that is it.
- </p>
- <div class="smallexample">
- <pre class="smallexample">int
- gconv (struct __gconv_step *step, struct __gconv_step_data *data,
- const char **inbuf, const char *inbufend, size_t *written,
- int do_flush)
- {
- struct __gconv_step *next_step = step + 1;
- struct __gconv_step_data *next_data = data + 1;
- gconv_fct fct = next_step->__fct;
- int status;
- /* <span class="roman">If the function is called with no input this means we have</span>
- <span class="roman">to reset to the initial state. The possibly partly</span>
- <span class="roman">converted input is dropped.</span> */
- if (do_flush)
- {
- status = __GCONV_OK;
- /* <span class="roman">Possible emit a byte sequence which put the state object</span>
- <span class="roman">into the initial state.</span> */
- /* <span class="roman">Call the steps down the chain if there are any but only</span>
- <span class="roman">if we successfully emitted the escape sequence.</span> */
- if (status == __GCONV_OK && ! data->__is_last)
- status = fct (next_step, next_data, NULL, NULL,
- written, 1);
- }
- else
- {
- /* <span class="roman">We preserve the initial values of the pointer variables.</span> */
- const char *inptr = *inbuf;
- char *outbuf = data->__outbuf;
- char *outend = data->__outbufend;
- char *outptr;
- do
- {
- /* <span class="roman">Remember the start value for this round.</span> */
- inptr = *inbuf;
- /* <span class="roman">The outbuf buffer is empty.</span> */
- outptr = outbuf;
- /* <span class="roman">For stateful encodings the state must be safe here.</span> */
- /* <span class="roman">Run the conversion loop. <code>status</code> is set</span>
- <span class="roman">appropriately afterwards.</span> */
- /* <span class="roman">If this is the last step, leave the loop. There is</span>
- <span class="roman">nothing we can do.</span> */
- if (data->__is_last)
- {
- /* <span class="roman">Store information about how many bytes are</span>
- <span class="roman">available.</span> */
- data->__outbuf = outbuf;
- /* <span class="roman">If any non-reversible conversions were performed,</span>
- <span class="roman">add the number to <code>*written</code>.</span> */
- break;
- }
- /* <span class="roman">Write out all output that was produced.</span> */
- if (outbuf > outptr)
- {
- const char *outerr = data->__outbuf;
- int result;
- result = fct (next_step, next_data, &outerr,
- outbuf, written, 0);
- if (result != __GCONV_EMPTY_INPUT)
- {
- if (outerr != outbuf)
- {
- /* <span class="roman">Reset the input buffer pointer. We</span>
- <span class="roman">document here the complex case.</span> */
- size_t nstatus;
- /* <span class="roman">Reload the pointers.</span> */
- *inbuf = inptr;
- outbuf = outptr;
- /* <span class="roman">Possibly reset the state.</span> */
- /* <span class="roman">Redo the conversion, but this time</span>
- <span class="roman">the end of the output buffer is at</span>
- <span class="roman"><code>outerr</code>.</span> */
- }
- /* <span class="roman">Change the status.</span> */
- status = result;
- }
- else
- /* <span class="roman">All the output is consumed, we can make</span>
- <span class="roman"> another run if everything was ok.</span> */
- if (status == __GCONV_FULL_OUTPUT)
- status = __GCONV_OK;
- }
- }
- while (status == __GCONV_OK);
- /* <span class="roman">We finished one use of this step.</span> */
- ++data->__invocation_counter;
- }
- return status;
- }
- </pre></div>
- </dd></dl>
- <p>This information should be sufficient to write new modules. Anybody
- doing so should also take a look at the available source code in the
- GNU C Library sources. It contains many examples of working and optimized
- modules.
- </p>
- <hr>
- <div class="header">
- <p>
- Previous: <a href="Other-iconv-Implementations.html#Other-iconv-Implementations" accesskey="p" rel="prev">Other iconv Implementations</a>, Up: <a href="Generic-Charset-Conversion.html#Generic-Charset-Conversion" accesskey="u" rel="up">Generic Charset Conversion</a> [<a href="index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>][<a href="Concept-Index.html#Concept-Index" title="Index" rel="index">Index</a>]</p>
- </div>
- </body>
- </html>