PageRenderTime 61ms CodeModel.GetById 17ms RepoModel.GetById 1ms app.codeStats 0ms

/glibc/libc/glibc-iconv-Implementation.html

https://gitlab.com/Gentio/my-pdf
HTML | 943 lines | 840 code | 56 blank | 47 comment | 0 complexity | b11d099d3d4e70ed5efdad8e0bd6bde9 MD5 | raw file
  1. <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
  2. <html>
  3. <!-- This file documents the GNU C Library.
  4. This is
  5. The GNU C Library Reference Manual, for version
  6. 2.23.
  7. Copyright (C) 1993-2016 Free Software Foundation, Inc.
  8. Permission is granted to copy, distribute and/or modify this document
  9. under the terms of the GNU Free Documentation License, Version
  10. 1.3 or any later version published by the Free
  11. Software Foundation; with the Invariant Sections being "Free Software
  12. Needs Free Documentation" and "GNU Lesser General Public License",
  13. the Front-Cover texts being "A GNU Manual", and with the Back-Cover
  14. Texts as in (a) below. A copy of the license is included in the
  15. section entitled "GNU Free Documentation License".
  16. (a) The FSF's Back-Cover Text is: "You have the freedom to
  17. copy and modify this GNU manual. Buying copies from the FSF
  18. supports it in developing GNU and promoting software freedom." -->
  19. <!-- Created by GNU Texinfo 6.0, http://www.gnu.org/software/texinfo/ -->
  20. <head>
  21. <title>The GNU C Library: glibc iconv Implementation</title>
  22. <meta name="description" content="The GNU C Library: glibc iconv Implementation">
  23. <meta name="keywords" content="The GNU C Library: glibc iconv Implementation">
  24. <meta name="resource-type" content="document">
  25. <meta name="distribution" content="global">
  26. <meta name="Generator" content="makeinfo">
  27. <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  28. <link href="index.html#Top" rel="start" title="Top">
  29. <link href="Concept-Index.html#Concept-Index" rel="index" title="Concept Index">
  30. <link href="index.html#SEC_Contents" rel="contents" title="Table of Contents">
  31. <link href="Generic-Charset-Conversion.html#Generic-Charset-Conversion" rel="up" title="Generic Charset Conversion">
  32. <link href="Locales.html#Locales" rel="next" title="Locales">
  33. <link href="Other-iconv-Implementations.html#Other-iconv-Implementations" rel="prev" title="Other iconv Implementations">
  34. <style type="text/css">
  35. <!--
  36. a.summary-letter {text-decoration: none}
  37. blockquote.indentedblock {margin-right: 0em}
  38. blockquote.smallindentedblock {margin-right: 0em; font-size: smaller}
  39. blockquote.smallquotation {font-size: smaller}
  40. div.display {margin-left: 3.2em}
  41. div.example {margin-left: 3.2em}
  42. div.lisp {margin-left: 3.2em}
  43. div.smalldisplay {margin-left: 3.2em}
  44. div.smallexample {margin-left: 3.2em}
  45. div.smalllisp {margin-left: 3.2em}
  46. kbd {font-style: oblique}
  47. pre.display {font-family: inherit}
  48. pre.format {font-family: inherit}
  49. pre.menu-comment {font-family: serif}
  50. pre.menu-preformatted {font-family: serif}
  51. pre.smalldisplay {font-family: inherit; font-size: smaller}
  52. pre.smallexample {font-size: smaller}
  53. pre.smallformat {font-family: inherit; font-size: smaller}
  54. pre.smalllisp {font-size: smaller}
  55. span.nocodebreak {white-space: nowrap}
  56. span.nolinebreak {white-space: nowrap}
  57. span.roman {font-family: serif; font-weight: normal}
  58. span.sansserif {font-family: sans-serif; font-weight: normal}
  59. ul.no-bullet {list-style: none}
  60. -->
  61. </style>
  62. </head>
  63. <body lang="en">
  64. <a name="glibc-iconv-Implementation"></a>
  65. <div class="header">
  66. <p>
  67. Previous: <a href="Other-iconv-Implementations.html#Other-iconv-Implementations" accesskey="p" rel="prev">Other iconv Implementations</a>, Up: <a href="Generic-Charset-Conversion.html#Generic-Charset-Conversion" accesskey="u" rel="up">Generic Charset Conversion</a> &nbsp; [<a href="index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>][<a href="Concept-Index.html#Concept-Index" title="Index" rel="index">Index</a>]</p>
  68. </div>
  69. <hr>
  70. <a name="The-iconv-Implementation-in-the-GNU-C-Library"></a>
  71. <h4 class="subsection">6.5.4 The <code>iconv</code> Implementation in the GNU C Library</h4>
  72. <p>After reading about the problems of <code>iconv</code> implementations in the
  73. last section it is certainly good to note that the implementation in
  74. the GNU C Library has none of the problems mentioned above. What
  75. follows is a step-by-step analysis of the points raised above. The
  76. evaluation is based on the current state of the development (as of
  77. January 1999). The development of the <code>iconv</code> functions is not
  78. complete, but basic functionality has solidified.
  79. </p>
  80. <p>The GNU C Library&rsquo;s <code>iconv</code> implementation uses shared loadable
  81. modules to implement the conversions. A very small number of
  82. conversions are built into the library itself but these are only rather
  83. trivial conversions.
  84. </p>
  85. <p>All the benefits of loadable modules are available in the GNU C Library
  86. implementation. This is especially appealing since the interface is
  87. well documented (see below), and it, therefore, is easy to write new
  88. conversion modules. The drawback of using loadable objects is not a
  89. problem in the GNU C Library, at least on ELF systems. Since the
  90. library is able to load shared objects even in statically linked
  91. binaries, static linking need not be forbidden in case one wants to use
  92. <code>iconv</code>.
  93. </p>
  94. <p>The second mentioned problem is the number of supported conversions.
  95. Currently, the GNU C Library supports more than 150 character sets. The
  96. way the implementation is designed the number of supported conversions
  97. is greater than 22350 (<em>150</em> times <em>149</em>). If any conversion
  98. from or to a character set is missing, it can be added easily.
  99. </p>
  100. <p>Particularly impressive as it may be, this high number is due to the
  101. fact that the GNU C Library implementation of <code>iconv</code> does not have
  102. the third problem mentioned above (i.e., whenever there is a conversion
  103. from a character set <em>A</em> to <em>B</em> and from
  104. <em>B</em> to <em>C</em> it is always possible to convert from
  105. <em>A</em> to <em>C</em> directly). If the <code>iconv_open</code>
  106. returns an error and sets <code>errno</code> to <code>EINVAL</code>, there is no
  107. known way, directly or indirectly, to perform the wanted conversion.
  108. </p>
  109. <a name="index-triangulation"></a>
  110. <p>Triangulation is achieved by providing for each character set a
  111. conversion from and to UCS-4 encoded ISO&nbsp;10646<!-- /@w -->. Using ISO&nbsp;10646<!-- /@w -->
  112. as an intermediate representation it is possible to <em>triangulate</em>
  113. (i.e., convert with an intermediate representation).
  114. </p>
  115. <p>There is no inherent requirement to provide a conversion to ISO&nbsp;10646<!-- /@w --> for a new character set, and it is also possible to provide other
  116. conversions where neither source nor destination character set is ISO&nbsp;10646<!-- /@w -->. The existing set of conversions is simply meant to cover all
  117. conversions that might be of interest.
  118. </p>
  119. <a name="index-ISO_002d2022_002dJP"></a>
  120. <a name="index-EUC_002dJP"></a>
  121. <p>All currently available conversions use the triangulation method above,
  122. making conversion run unnecessarily slow. If, for example, somebody
  123. often needs the conversion from ISO-2022-JP to EUC-JP, a quicker solution
  124. would involve direct conversion between the two character sets, skipping
  125. the input to ISO&nbsp;10646<!-- /@w --> first. The two character sets of interest
  126. are much more similar to each other than to ISO&nbsp;10646<!-- /@w -->.
  127. </p>
  128. <p>In such a situation one easily can write a new conversion and provide it
  129. as a better alternative. The GNU C Library <code>iconv</code> implementation
  130. would automatically use the module implementing the conversion if it is
  131. specified to be more efficient.
  132. </p>
  133. <a name="Format-of-gconv_002dmodules-files"></a>
  134. <h4 class="subsubsection">6.5.4.1 Format of <samp>gconv-modules</samp> files</h4>
  135. <p>All information about the available conversions comes from a file named
  136. <samp>gconv-modules</samp>, which can be found in any of the directories along
  137. the <code>GCONV_PATH</code>. The <samp>gconv-modules</samp> files are line-oriented
  138. text files, where each of the lines has one of the following formats:
  139. </p>
  140. <ul>
  141. <li> If the first non-whitespace character is a <kbd>#</kbd> the line contains only
  142. comments and is ignored.
  143. </li><li> Lines starting with <code>alias</code> define an alias name for a character
  144. set. Two more words are expected on the line. The first word
  145. defines the alias name, and the second defines the original name of the
  146. character set. The effect is that it is possible to use the alias name
  147. in the <var>fromset</var> or <var>toset</var> parameters of <code>iconv_open</code> and
  148. achieve the same result as when using the real character set name.
  149. <p>This is quite important as a character set has often many different
  150. names. There is normally an official name but this need not correspond to
  151. the most popular name. Beside this many character sets have special
  152. names that are somehow constructed. For example, all character sets
  153. specified by the ISO have an alias of the form <code>ISO-IR-<var>nnn</var></code>
  154. where <var>nnn</var> is the registration number. This allows programs that
  155. know about the registration number to construct character set names and
  156. use them in <code>iconv_open</code> calls. More on the available names and
  157. aliases follows below.
  158. </p>
  159. </li><li> Lines starting with <code>module</code> introduce an available conversion
  160. module. These lines must contain three or four more words.
  161. <p>The first word specifies the source character set, the second word the
  162. destination character set of conversion implemented in this module, and
  163. the third word is the name of the loadable module. The filename is
  164. constructed by appending the usual shared object suffix (normally
  165. <samp>.so</samp>) and this file is then supposed to be found in the same
  166. directory the <samp>gconv-modules</samp> file is in. The last word on the line,
  167. which is optional, is a numeric value representing the cost of the
  168. conversion. If this word is missing, a cost of <em>1</em> is assumed. The
  169. numeric value itself does not matter that much; what counts are the
  170. relative values of the sums of costs for all possible conversion paths.
  171. Below is a more precise description of the use of the cost value.
  172. </p></li></ul>
  173. <p>Returning to the example above where one has written a module to directly
  174. convert from ISO-2022-JP to EUC-JP and back. All that has to be done is
  175. to put the new module, let its name be ISO2022JP-EUCJP.so, in a directory
  176. and add a file <samp>gconv-modules</samp> with the following content in the
  177. same directory:
  178. </p>
  179. <div class="smallexample">
  180. <pre class="smallexample">module ISO-2022-JP// EUC-JP// ISO2022JP-EUCJP 1
  181. module EUC-JP// ISO-2022-JP// ISO2022JP-EUCJP 1
  182. </pre></div>
  183. <p>To see why this is sufficient, it is necessary to understand how the
  184. conversion used by <code>iconv</code> (and described in the descriptor) is
  185. selected. The approach to this problem is quite simple.
  186. </p>
  187. <p>At the first call of the <code>iconv_open</code> function the program reads
  188. all available <samp>gconv-modules</samp> files and builds up two tables: one
  189. containing all the known aliases and another that contains the
  190. information about the conversions and which shared object implements
  191. them.
  192. </p>
  193. <a name="Finding-the-conversion-path-in-iconv"></a>
  194. <h4 class="subsubsection">6.5.4.2 Finding the conversion path in <code>iconv</code></h4>
  195. <p>The set of available conversions form a directed graph with weighted
  196. edges. The weights on the edges are the costs specified in the
  197. <samp>gconv-modules</samp> files. The <code>iconv_open</code> function uses an
  198. algorithm suitable for search for the best path in such a graph and so
  199. constructs a list of conversions that must be performed in succession
  200. to get the transformation from the source to the destination character
  201. set.
  202. </p>
  203. <p>Explaining why the above <samp>gconv-modules</samp> files allows the
  204. <code>iconv</code> implementation to resolve the specific ISO-2022-JP to
  205. EUC-JP conversion module instead of the conversion coming with the
  206. library itself is straightforward. Since the latter conversion takes two
  207. steps (from ISO-2022-JP to ISO&nbsp;10646<!-- /@w --> and then from ISO&nbsp;10646<!-- /@w --> to
  208. EUC-JP), the cost is <em>1+1 = 2</em>. The above <samp>gconv-modules</samp>
  209. file, however, specifies that the new conversion modules can perform this
  210. conversion with only the cost of <em>1</em>.
  211. </p>
  212. <p>A mysterious item about the <samp>gconv-modules</samp> file above (and also
  213. the file coming with the GNU C Library) are the names of the character
  214. sets specified in the <code>module</code> lines. Why do almost all the names
  215. end in <code>//</code>? And this is not all: the names can actually be
  216. regular expressions. At this point in time this mystery should not be
  217. revealed, unless you have the relevant spell-casting materials: ashes
  218. from an original DOS&nbsp;6.2<!-- /@w --> boot disk burnt in effigy, a crucifix
  219. blessed by St. Emacs, assorted herbal roots from Central America, sand
  220. from Cebu, etc. Sorry! <strong>The part of the implementation where
  221. this is used is not yet finished. For now please simply follow the
  222. existing examples. It&rsquo;ll become clearer once it is. &ndash;drepper</strong>
  223. </p>
  224. <p>A last remark about the <samp>gconv-modules</samp> is about the names not
  225. ending with <code>//</code>. A character set named <code>INTERNAL</code> is often
  226. mentioned. From the discussion above and the chosen name it should have
  227. become clear that this is the name for the representation used in the
  228. intermediate step of the triangulation. We have said that this is UCS-4
  229. but actually that is not quite right. The UCS-4 specification also
  230. includes the specification of the byte ordering used. Since a UCS-4 value
  231. consists of four bytes, a stored value is affected by byte ordering. The
  232. internal representation is <em>not</em> the same as UCS-4 in case the byte
  233. ordering of the processor (or at least the running process) is not the
  234. same as the one required for UCS-4. This is done for performance reasons
  235. as one does not want to perform unnecessary byte-swapping operations if
  236. one is not interested in actually seeing the result in UCS-4. To avoid
  237. trouble with endianness, the internal representation consistently is named
  238. <code>INTERNAL</code> even on big-endian systems where the representations are
  239. identical.
  240. </p>
  241. <a name="iconv-module-data-structures"></a>
  242. <h4 class="subsubsection">6.5.4.3 <code>iconv</code> module data structures</h4>
  243. <p>So far this section has described how modules are located and considered
  244. to be used. What remains to be described is the interface of the modules
  245. so that one can write new ones. This section describes the interface as
  246. it is in use in January 1999. The interface will change a bit in the
  247. future but, with luck, only in an upwardly compatible way.
  248. </p>
  249. <p>The definitions necessary to write new modules are publicly available
  250. in the non-standard header <samp>gconv.h</samp>. The following text,
  251. therefore, describes the definitions from this header file. First,
  252. however, it is necessary to get an overview.
  253. </p>
  254. <p>From the perspective of the user of <code>iconv</code> the interface is quite
  255. simple: the <code>iconv_open</code> function returns a handle that can be used
  256. in calls to <code>iconv</code>, and finally the handle is freed with a call to
  257. <code>iconv_close</code>. The problem is that the handle has to be able to
  258. represent the possibly long sequences of conversion steps and also the
  259. state of each conversion since the handle is all that is passed to the
  260. <code>iconv</code> function. Therefore, the data structures are really the
  261. elements necessary to understanding the implementation.
  262. </p>
  263. <p>We need two different kinds of data structures. The first describes the
  264. conversion and the second describes the state etc. There are really two
  265. type definitions like this in <samp>gconv.h</samp>.
  266. <a name="index-gconv_002eh"></a>
  267. </p>
  268. <dl>
  269. <dt><a name="index-struct-_005f_005fgconv_005fstep"></a>Data type: <strong>struct __gconv_step</strong></dt>
  270. <dd><p>This data structure describes one conversion a module can perform. For
  271. each function in a loaded module with conversion functions there is
  272. exactly one object of this type. This object is shared by all users of
  273. the conversion (i.e., this object does not contain any information
  274. corresponding to an actual conversion; it only describes the conversion
  275. itself).
  276. </p>
  277. <dl compact="compact">
  278. <dt><code>struct __gconv_loaded_object *__shlib_handle</code></dt>
  279. <dt><code>const char *__modname</code></dt>
  280. <dt><code>int __counter</code></dt>
  281. <dd><p>All these elements of the structure are used internally in the C library
  282. to coordinate loading and unloading the shared. One must not expect any
  283. of the other elements to be available or initialized.
  284. </p>
  285. </dd>
  286. <dt><code>const char *__from_name</code></dt>
  287. <dt><code>const char *__to_name</code></dt>
  288. <dd><p><code>__from_name</code> and <code>__to_name</code> contain the names of the source and
  289. destination character sets. They can be used to identify the actual
  290. conversion to be carried out since one module might implement conversions
  291. for more than one character set and/or direction.
  292. </p>
  293. </dd>
  294. <dt><code>gconv_fct __fct</code></dt>
  295. <dt><code>gconv_init_fct __init_fct</code></dt>
  296. <dt><code>gconv_end_fct __end_fct</code></dt>
  297. <dd><p>These elements contain pointers to the functions in the loadable module.
  298. The interface will be explained below.
  299. </p>
  300. </dd>
  301. <dt><code>int __min_needed_from</code></dt>
  302. <dt><code>int __max_needed_from</code></dt>
  303. <dt><code>int __min_needed_to</code></dt>
  304. <dt><code>int __max_needed_to;</code></dt>
  305. <dd><p>These values have to be supplied in the init function of the module. The
  306. <code>__min_needed_from</code> value specifies how many bytes a character of
  307. the source character set at least needs. The <code>__max_needed_from</code>
  308. specifies the maximum value that also includes possible shift sequences.
  309. </p>
  310. <p>The <code>__min_needed_to</code> and <code>__max_needed_to</code> values serve the
  311. same purpose as <code>__min_needed_from</code> and <code>__max_needed_from</code> but
  312. this time for the destination character set.
  313. </p>
  314. <p>It is crucial that these values be accurate since otherwise the
  315. conversion functions will have problems or not work at all.
  316. </p>
  317. </dd>
  318. <dt><code>int __stateful</code></dt>
  319. <dd><p>This element must also be initialized by the init function.
  320. <code>int __stateful</code> is nonzero if the source character set is stateful.
  321. Otherwise it is zero.
  322. </p>
  323. </dd>
  324. <dt><code>void *__data</code></dt>
  325. <dd><p>This element can be used freely by the conversion functions in the
  326. module. <code>void *__data</code> can be used to communicate extra information
  327. from one call to another. <code>void *__data</code> need not be initialized if
  328. not needed at all. If <code>void *__data</code> element is assigned a pointer
  329. to dynamically allocated memory (presumably in the init function) it has
  330. to be made sure that the end function deallocates the memory. Otherwise
  331. the application will leak memory.
  332. </p>
  333. <p>It is important to be aware that this data structure is shared by all
  334. users of this specification conversion and therefore the <code>__data</code>
  335. element must not contain data specific to one specific use of the
  336. conversion function.
  337. </p></dd>
  338. </dl>
  339. </dd></dl>
  340. <dl>
  341. <dt><a name="index-struct-_005f_005fgconv_005fstep_005fdata"></a>Data type: <strong>struct __gconv_step_data</strong></dt>
  342. <dd><p>This is the data structure that contains the information specific to
  343. each use of the conversion functions.
  344. </p>
  345. <dl compact="compact">
  346. <dt><code>char *__outbuf</code></dt>
  347. <dt><code>char *__outbufend</code></dt>
  348. <dd><p>These elements specify the output buffer for the conversion step. The
  349. <code>__outbuf</code> element points to the beginning of the buffer, and
  350. <code>__outbufend</code> points to the byte following the last byte in the
  351. buffer. The conversion function must not assume anything about the size
  352. of the buffer but it can be safely assumed the there is room for at
  353. least one complete character in the output buffer.
  354. </p>
  355. <p>Once the conversion is finished, if the conversion is the last step, the
  356. <code>__outbuf</code> element must be modified to point after the last byte
  357. written into the buffer to signal how much output is available. If this
  358. conversion step is not the last one, the element must not be modified.
  359. The <code>__outbufend</code> element must not be modified.
  360. </p>
  361. </dd>
  362. <dt><code>int __is_last</code></dt>
  363. <dd><p>This element is nonzero if this conversion step is the last one. This
  364. information is necessary for the recursion. See the description of the
  365. conversion function internals below. This element must never be
  366. modified.
  367. </p>
  368. </dd>
  369. <dt><code>int __invocation_counter</code></dt>
  370. <dd><p>The conversion function can use this element to see how many calls of
  371. the conversion function already happened. Some character sets require a
  372. certain prolog when generating output, and by comparing this value with
  373. zero, one can find out whether it is the first call and whether,
  374. therefore, the prolog should be emitted. This element must never be
  375. modified.
  376. </p>
  377. </dd>
  378. <dt><code>int __internal_use</code></dt>
  379. <dd><p>This element is another one rarely used but needed in certain
  380. situations. It is assigned a nonzero value in case the conversion
  381. functions are used to implement <code>mbsrtowcs</code> et.al. (i.e., the
  382. function is not used directly through the <code>iconv</code> interface).
  383. </p>
  384. <p>This sometimes makes a difference as it is expected that the
  385. <code>iconv</code> functions are used to translate entire texts while the
  386. <code>mbsrtowcs</code> functions are normally used only to convert single
  387. strings and might be used multiple times to convert entire texts.
  388. </p>
  389. <p>But in this situation we would have problem complying with some rules of
  390. the character set specification. Some character sets require a prolog,
  391. which must appear exactly once for an entire text. If a number of
  392. <code>mbsrtowcs</code> calls are used to convert the text, only the first call
  393. must add the prolog. However, because there is no communication between the
  394. different calls of <code>mbsrtowcs</code>, the conversion functions have no
  395. possibility to find this out. The situation is different for sequences
  396. of <code>iconv</code> calls since the handle allows access to the needed
  397. information.
  398. </p>
  399. <p>The <code>int __internal_use</code> element is mostly used together with
  400. <code>__invocation_counter</code> as follows:
  401. </p>
  402. <div class="smallexample">
  403. <pre class="smallexample">if (!data-&gt;__internal_use
  404. &amp;&amp; data-&gt;__invocation_counter == 0)
  405. /* <span class="roman">Emit prolog.</span> */
  406. &hellip;
  407. </pre></div>
  408. <p>This element must never be modified.
  409. </p>
  410. </dd>
  411. <dt><code>mbstate_t *__statep</code></dt>
  412. <dd><p>The <code>__statep</code> element points to an object of type <code>mbstate_t</code>
  413. (see <a href="Keeping-the-state.html#Keeping-the-state">Keeping the state</a>). The conversion of a stateful character
  414. set must use the object pointed to by <code>__statep</code> to store
  415. information about the conversion state. The <code>__statep</code> element
  416. itself must never be modified.
  417. </p>
  418. </dd>
  419. <dt><code>mbstate_t __state</code></dt>
  420. <dd><p>This element must <em>never</em> be used directly. It is only part of
  421. this structure to have the needed space allocated.
  422. </p></dd>
  423. </dl>
  424. </dd></dl>
  425. <a name="iconv-module-interfaces"></a>
  426. <h4 class="subsubsection">6.5.4.4 <code>iconv</code> module interfaces</h4>
  427. <p>With the knowledge about the data structures we now can describe the
  428. conversion function itself. To understand the interface a bit of
  429. knowledge is necessary about the functionality in the C library that
  430. loads the objects with the conversions.
  431. </p>
  432. <p>It is often the case that one conversion is used more than once (i.e.,
  433. there are several <code>iconv_open</code> calls for the same set of character
  434. sets during one program run). The <code>mbsrtowcs</code> et.al. functions in
  435. the GNU C Library also use the <code>iconv</code> functionality, which
  436. increases the number of uses of the same functions even more.
  437. </p>
  438. <p>Because of this multiple use of conversions, the modules do not get
  439. loaded exclusively for one conversion. Instead a module once loaded can
  440. be used by an arbitrary number of <code>iconv</code> or <code>mbsrtowcs</code> calls
  441. at the same time. The splitting of the information between conversion-
  442. function-specific information and conversion data makes this possible.
  443. The last section showed the two data structures used to do this.
  444. </p>
  445. <p>This is of course also reflected in the interface and semantics of the
  446. functions that the modules must provide. There are three functions that
  447. must have the following names:
  448. </p>
  449. <dl compact="compact">
  450. <dt><code>gconv_init</code></dt>
  451. <dd><p>The <code>gconv_init</code> function initializes the conversion function
  452. specific data structure. This very same object is shared by all
  453. conversions that use this conversion and, therefore, no state information
  454. about the conversion itself must be stored in here. If a module
  455. implements more than one conversion, the <code>gconv_init</code> function will
  456. be called multiple times.
  457. </p>
  458. </dd>
  459. <dt><code>gconv_end</code></dt>
  460. <dd><p>The <code>gconv_end</code> function is responsible for freeing all resources
  461. allocated by the <code>gconv_init</code> function. If there is nothing to do,
  462. this function can be missing. Special care must be taken if the module
  463. implements more than one conversion and the <code>gconv_init</code> function
  464. does not allocate the same resources for all conversions.
  465. </p>
  466. </dd>
  467. <dt><code>gconv</code></dt>
  468. <dd><p>This is the actual conversion function. It is called to convert one
  469. block of text. It gets passed the conversion step information
  470. initialized by <code>gconv_init</code> and the conversion data, specific to
  471. this use of the conversion functions.
  472. </p></dd>
  473. </dl>
  474. <p>There are three data types defined for the three module interface
  475. functions and these define the interface.
  476. </p>
  477. <dl>
  478. <dt><a name="index-_0028_002a_005f_005fgconv_005finit_005ffct_0029"></a>Data type: <em>int</em> <strong>(*__gconv_init_fct)</strong> <em>(struct __gconv_step *)</em></dt>
  479. <dd><p>This specifies the interface of the initialization function of the
  480. module. It is called exactly once for each conversion the module
  481. implements.
  482. </p>
  483. <p>As explained in the description of the <code>struct __gconv_step</code> data
  484. structure above the initialization function has to initialize parts of
  485. it.
  486. </p>
  487. <dl compact="compact">
  488. <dt><code>__min_needed_from</code></dt>
  489. <dt><code>__max_needed_from</code></dt>
  490. <dt><code>__min_needed_to</code></dt>
  491. <dt><code>__max_needed_to</code></dt>
  492. <dd><p>These elements must be initialized to the exact numbers of the minimum
  493. and maximum number of bytes used by one character in the source and
  494. destination character sets, respectively. If the characters all have the
  495. same size, the minimum and maximum values are the same.
  496. </p>
  497. </dd>
  498. <dt><code>__stateful</code></dt>
  499. <dd><p>This element must be initialized to a nonzero value if the source
  500. character set is stateful. Otherwise it must be zero.
  501. </p></dd>
  502. </dl>
  503. <p>If the initialization function needs to communicate some information
  504. to the conversion function, this communication can happen using the
  505. <code>__data</code> element of the <code>__gconv_step</code> structure. But since
  506. this data is shared by all the conversions, it must not be modified by
  507. the conversion function. The example below shows how this can be used.
  508. </p>
  509. <div class="smallexample">
  510. <pre class="smallexample">#define MIN_NEEDED_FROM 1
  511. #define MAX_NEEDED_FROM 4
  512. #define MIN_NEEDED_TO 4
  513. #define MAX_NEEDED_TO 4
  514. int
  515. gconv_init (struct __gconv_step *step)
  516. {
  517. /* <span class="roman">Determine which direction.</span> */
  518. struct iso2022jp_data *new_data;
  519. enum direction dir = illegal_dir;
  520. enum variant var = illegal_var;
  521. int result;
  522. if (__strcasecmp (step-&gt;__from_name, &quot;ISO-2022-JP//&quot;) == 0)
  523. {
  524. dir = from_iso2022jp;
  525. var = iso2022jp;
  526. }
  527. else if (__strcasecmp (step-&gt;__to_name, &quot;ISO-2022-JP//&quot;) == 0)
  528. {
  529. dir = to_iso2022jp;
  530. var = iso2022jp;
  531. }
  532. else if (__strcasecmp (step-&gt;__from_name, &quot;ISO-2022-JP-2//&quot;) == 0)
  533. {
  534. dir = from_iso2022jp;
  535. var = iso2022jp2;
  536. }
  537. else if (__strcasecmp (step-&gt;__to_name, &quot;ISO-2022-JP-2//&quot;) == 0)
  538. {
  539. dir = to_iso2022jp;
  540. var = iso2022jp2;
  541. }
  542. result = __GCONV_NOCONV;
  543. if (dir != illegal_dir)
  544. {
  545. new_data = (struct iso2022jp_data *)
  546. malloc (sizeof (struct iso2022jp_data));
  547. result = __GCONV_NOMEM;
  548. if (new_data != NULL)
  549. {
  550. new_data-&gt;dir = dir;
  551. new_data-&gt;var = var;
  552. step-&gt;__data = new_data;
  553. if (dir == from_iso2022jp)
  554. {
  555. step-&gt;__min_needed_from = MIN_NEEDED_FROM;
  556. step-&gt;__max_needed_from = MAX_NEEDED_FROM;
  557. step-&gt;__min_needed_to = MIN_NEEDED_TO;
  558. step-&gt;__max_needed_to = MAX_NEEDED_TO;
  559. }
  560. else
  561. {
  562. step-&gt;__min_needed_from = MIN_NEEDED_TO;
  563. step-&gt;__max_needed_from = MAX_NEEDED_TO;
  564. step-&gt;__min_needed_to = MIN_NEEDED_FROM;
  565. step-&gt;__max_needed_to = MAX_NEEDED_FROM + 2;
  566. }
  567. /* <span class="roman">Yes, this is a stateful encoding.</span> */
  568. step-&gt;__stateful = 1;
  569. result = __GCONV_OK;
  570. }
  571. }
  572. return result;
  573. }
  574. </pre></div>
  575. <p>The function first checks which conversion is wanted. The module from
  576. which this function is taken implements four different conversions;
  577. which one is selected can be determined by comparing the names. The
  578. comparison should always be done without paying attention to the case.
  579. </p>
  580. <p>Next, a data structure, which contains the necessary information about
  581. which conversion is selected, is allocated. The data structure
  582. <code>struct iso2022jp_data</code> is locally defined since, outside the
  583. module, this data is not used at all. Please note that if all four
  584. conversions this modules supports are requested there are four data
  585. blocks.
  586. </p>
  587. <p>One interesting thing is the initialization of the <code>__min_</code> and
  588. <code>__max_</code> elements of the step data object. A single ISO-2022-JP
  589. character can consist of one to four bytes. Therefore the
  590. <code>MIN_NEEDED_FROM</code> and <code>MAX_NEEDED_FROM</code> macros are defined
  591. this way. The output is always the <code>INTERNAL</code> character set (aka
  592. UCS-4) and therefore each character consists of exactly four bytes. For
  593. the conversion from <code>INTERNAL</code> to ISO-2022-JP we have to take into
  594. account that escape sequences might be necessary to switch the character
  595. sets. Therefore the <code>__max_needed_to</code> element for this direction
  596. gets assigned <code>MAX_NEEDED_FROM + 2</code>. This takes into account the
  597. two bytes needed for the escape sequences to single the switching. The
  598. asymmetry in the maximum values for the two directions can be explained
  599. easily: when reading ISO-2022-JP text, escape sequences can be handled
  600. alone (i.e., it is not necessary to process a real character since the
  601. effect of the escape sequence can be recorded in the state information).
  602. The situation is different for the other direction. Since it is in
  603. general not known which character comes next, one cannot emit escape
  604. sequences to change the state in advance. This means the escape
  605. sequences that have to be emitted together with the next character.
  606. Therefore one needs more room than only for the character itself.
  607. </p>
  608. <p>The possible return values of the initialization function are:
  609. </p>
  610. <dl compact="compact">
  611. <dt><code>__GCONV_OK</code></dt>
  612. <dd><p>The initialization succeeded
  613. </p></dd>
  614. <dt><code>__GCONV_NOCONV</code></dt>
  615. <dd><p>The requested conversion is not supported in the module. This can
  616. happen if the <samp>gconv-modules</samp> file has errors.
  617. </p></dd>
  618. <dt><code>__GCONV_NOMEM</code></dt>
  619. <dd><p>Memory required to store additional information could not be allocated.
  620. </p></dd>
  621. </dl>
  622. </dd></dl>
  623. <p>The function called before the module is unloaded is significantly
  624. easier. It often has nothing at all to do; in which case it can be left
  625. out completely.
  626. </p>
  627. <dl>
  628. <dt><a name="index-_0028_002a_005f_005fgconv_005fend_005ffct_0029"></a>Data type: <em>void</em> <strong>(*__gconv_end_fct)</strong> <em>(struct gconv_step *)</em></dt>
  629. <dd><p>The task of this function is to free all resources allocated in the
  630. initialization function. Therefore only the <code>__data</code> element of
  631. the object pointed to by the argument is of interest. Continuing the
  632. example from the initialization function, the finalization function
  633. looks like this:
  634. </p>
  635. <div class="smallexample">
  636. <pre class="smallexample">void
  637. gconv_end (struct __gconv_step *data)
  638. {
  639. free (data-&gt;__data);
  640. }
  641. </pre></div>
  642. </dd></dl>
  643. <p>The most important function is the conversion function itself, which can
  644. get quite complicated for complex character sets. But since this is not
  645. of interest here, we will only describe a possible skeleton for the
  646. conversion function.
  647. </p>
  648. <dl>
  649. <dt><a name="index-_0028_002a_005f_005fgconv_005ffct_0029"></a>Data type: <em>int</em> <strong>(*__gconv_fct)</strong> <em>(struct __gconv_step *, struct __gconv_step_data *, const char **, const char *, size_t *, int)</em></dt>
  650. <dd><p>The conversion function can be called for two basic reason: to convert
  651. text or to reset the state. From the description of the <code>iconv</code>
  652. function it can be seen why the flushing mode is necessary. What mode
  653. is selected is determined by the sixth argument, an integer. This
  654. argument being nonzero means that flushing is selected.
  655. </p>
  656. <p>Common to both modes is where the output buffer can be found. The
  657. information about this buffer is stored in the conversion step data. A
  658. pointer to this information is passed as the second argument to this
  659. function. The description of the <code>struct __gconv_step_data</code>
  660. structure has more information on the conversion step data.
  661. </p>
  662. <a name="index-stateful-5"></a>
  663. <p>What has to be done for flushing depends on the source character set.
  664. If the source character set is not stateful, nothing has to be done.
  665. Otherwise the function has to emit a byte sequence to bring the state
  666. object into the initial state. Once this all happened the other
  667. conversion modules in the chain of conversions have to get the same
  668. chance. Whether another step follows can be determined from the
  669. <code>__is_last</code> element of the step data structure to which the first
  670. parameter points.
  671. </p>
  672. <p>The more interesting mode is when actual text has to be converted. The
  673. first step in this case is to convert as much text as possible from the
  674. input buffer and store the result in the output buffer. The start of the
  675. input buffer is determined by the third argument, which is a pointer to a
  676. pointer variable referencing the beginning of the buffer. The fourth
  677. argument is a pointer to the byte right after the last byte in the buffer.
  678. </p>
  679. <p>The conversion has to be performed according to the current state if the
  680. character set is stateful. The state is stored in an object pointed to
  681. by the <code>__statep</code> element of the step data (second argument). Once
  682. either the input buffer is empty or the output buffer is full the
  683. conversion stops. At this point, the pointer variable referenced by the
  684. third parameter must point to the byte following the last processed
  685. byte (i.e., if all of the input is consumed, this pointer and the fourth
  686. parameter have the same value).
  687. </p>
  688. <p>What now happens depends on whether this step is the last one. If it is
  689. the last step, the only thing that has to be done is to update the
  690. <code>__outbuf</code> element of the step data structure to point after the
  691. last written byte. This update gives the caller the information on how
  692. much text is available in the output buffer. In addition, the variable
  693. pointed to by the fifth parameter, which is of type <code>size_t</code>, must
  694. be incremented by the number of characters (<em>not bytes</em>) that were
  695. converted in a non-reversible way. Then, the function can return.
  696. </p>
  697. <p>In case the step is not the last one, the later conversion functions have
  698. to get a chance to do their work. Therefore, the appropriate conversion
  699. function has to be called. The information about the functions is
  700. stored in the conversion data structures, passed as the first parameter.
  701. This information and the step data are stored in arrays, so the next
  702. element in both cases can be found by simple pointer arithmetic:
  703. </p>
  704. <div class="smallexample">
  705. <pre class="smallexample">int
  706. gconv (struct __gconv_step *step, struct __gconv_step_data *data,
  707. const char **inbuf, const char *inbufend, size_t *written,
  708. int do_flush)
  709. {
  710. struct __gconv_step *next_step = step + 1;
  711. struct __gconv_step_data *next_data = data + 1;
  712. &hellip;
  713. </pre></div>
  714. <p>The <code>next_step</code> pointer references the next step information and
  715. <code>next_data</code> the next data record. The call of the next function
  716. therefore will look similar to this:
  717. </p>
  718. <div class="smallexample">
  719. <pre class="smallexample"> next_step-&gt;__fct (next_step, next_data, &amp;outerr, outbuf,
  720. written, 0)
  721. </pre></div>
  722. <p>But this is not yet all. Once the function call returns the conversion
  723. function might have some more to do. If the return value of the function
  724. is <code>__GCONV_EMPTY_INPUT</code>, more room is available in the output
  725. buffer. Unless the input buffer is empty the conversion, functions start
  726. all over again and process the rest of the input buffer. If the return
  727. value is not <code>__GCONV_EMPTY_INPUT</code>, something went wrong and we have
  728. to recover from this.
  729. </p>
  730. <p>A requirement for the conversion function is that the input buffer
  731. pointer (the third argument) always point to the last character that
  732. was put in converted form into the output buffer. This is trivially
  733. true after the conversion performed in the current step, but if the
  734. conversion functions deeper downstream stop prematurely, not all
  735. characters from the output buffer are consumed and, therefore, the input
  736. buffer pointers must be backed off to the right position.
  737. </p>
  738. <p>Correcting the input buffers is easy to do if the input and output
  739. character sets have a fixed width for all characters. In this situation
  740. we can compute how many characters are left in the output buffer and,
  741. therefore, can correct the input buffer pointer appropriately with a
  742. similar computation. Things are getting tricky if either character set
  743. has characters represented with variable length byte sequences, and it
  744. gets even more complicated if the conversion has to take care of the
  745. state. In these cases the conversion has to be performed once again, from
  746. the known state before the initial conversion (i.e., if necessary the
  747. state of the conversion has to be reset and the conversion loop has to be
  748. executed again). The difference now is that it is known how much input
  749. must be created, and the conversion can stop before converting the first
  750. unused character. Once this is done the input buffer pointers must be
  751. updated again and the function can return.
  752. </p>
  753. <p>One final thing should be mentioned. If it is necessary for the
  754. conversion to know whether it is the first invocation (in case a prolog
  755. has to be emitted), the conversion function should increment the
  756. <code>__invocation_counter</code> element of the step data structure just
  757. before returning to the caller. See the description of the <code>struct
  758. __gconv_step_data</code> structure above for more information on how this can
  759. be used.
  760. </p>
  761. <p>The return value must be one of the following values:
  762. </p>
  763. <dl compact="compact">
  764. <dt><code>__GCONV_EMPTY_INPUT</code></dt>
  765. <dd><p>All input was consumed and there is room left in the output buffer.
  766. </p></dd>
  767. <dt><code>__GCONV_FULL_OUTPUT</code></dt>
  768. <dd><p>No more room in the output buffer. In case this is not the last step
  769. this value is propagated down from the call of the next conversion
  770. function in the chain.
  771. </p></dd>
  772. <dt><code>__GCONV_INCOMPLETE_INPUT</code></dt>
  773. <dd><p>The input buffer is not entirely empty since it contains an incomplete
  774. character sequence.
  775. </p></dd>
  776. </dl>
  777. <p>The following example provides a framework for a conversion function.
  778. In case a new conversion has to be written the holes in this
  779. implementation have to be filled and that is it.
  780. </p>
  781. <div class="smallexample">
  782. <pre class="smallexample">int
  783. gconv (struct __gconv_step *step, struct __gconv_step_data *data,
  784. const char **inbuf, const char *inbufend, size_t *written,
  785. int do_flush)
  786. {
  787. struct __gconv_step *next_step = step + 1;
  788. struct __gconv_step_data *next_data = data + 1;
  789. gconv_fct fct = next_step-&gt;__fct;
  790. int status;
  791. /* <span class="roman">If the function is called with no input this means we have</span>
  792. <span class="roman">to reset to the initial state. The possibly partly</span>
  793. <span class="roman">converted input is dropped.</span> */
  794. if (do_flush)
  795. {
  796. status = __GCONV_OK;
  797. /* <span class="roman">Possible emit a byte sequence which put the state object</span>
  798. <span class="roman">into the initial state.</span> */
  799. /* <span class="roman">Call the steps down the chain if there are any but only</span>
  800. <span class="roman">if we successfully emitted the escape sequence.</span> */
  801. if (status == __GCONV_OK &amp;&amp; ! data-&gt;__is_last)
  802. status = fct (next_step, next_data, NULL, NULL,
  803. written, 1);
  804. }
  805. else
  806. {
  807. /* <span class="roman">We preserve the initial values of the pointer variables.</span> */
  808. const char *inptr = *inbuf;
  809. char *outbuf = data-&gt;__outbuf;
  810. char *outend = data-&gt;__outbufend;
  811. char *outptr;
  812. do
  813. {
  814. /* <span class="roman">Remember the start value for this round.</span> */
  815. inptr = *inbuf;
  816. /* <span class="roman">The outbuf buffer is empty.</span> */
  817. outptr = outbuf;
  818. /* <span class="roman">For stateful encodings the state must be safe here.</span> */
  819. /* <span class="roman">Run the conversion loop. <code>status</code> is set</span>
  820. <span class="roman">appropriately afterwards.</span> */
  821. /* <span class="roman">If this is the last step, leave the loop. There is</span>
  822. <span class="roman">nothing we can do.</span> */
  823. if (data-&gt;__is_last)
  824. {
  825. /* <span class="roman">Store information about how many bytes are</span>
  826. <span class="roman">available.</span> */
  827. data-&gt;__outbuf = outbuf;
  828. /* <span class="roman">If any non-reversible conversions were performed,</span>
  829. <span class="roman">add the number to <code>*written</code>.</span> */
  830. break;
  831. }
  832. /* <span class="roman">Write out all output that was produced.</span> */
  833. if (outbuf &gt; outptr)
  834. {
  835. const char *outerr = data-&gt;__outbuf;
  836. int result;
  837. result = fct (next_step, next_data, &amp;outerr,
  838. outbuf, written, 0);
  839. if (result != __GCONV_EMPTY_INPUT)
  840. {
  841. if (outerr != outbuf)
  842. {
  843. /* <span class="roman">Reset the input buffer pointer. We</span>
  844. <span class="roman">document here the complex case.</span> */
  845. size_t nstatus;
  846. /* <span class="roman">Reload the pointers.</span> */
  847. *inbuf = inptr;
  848. outbuf = outptr;
  849. /* <span class="roman">Possibly reset the state.</span> */
  850. /* <span class="roman">Redo the conversion, but this time</span>
  851. <span class="roman">the end of the output buffer is at</span>
  852. <span class="roman"><code>outerr</code>.</span> */
  853. }
  854. /* <span class="roman">Change the status.</span> */
  855. status = result;
  856. }
  857. else
  858. /* <span class="roman">All the output is consumed, we can make</span>
  859. <span class="roman"> another run if everything was ok.</span> */
  860. if (status == __GCONV_FULL_OUTPUT)
  861. status = __GCONV_OK;
  862. }
  863. }
  864. while (status == __GCONV_OK);
  865. /* <span class="roman">We finished one use of this step.</span> */
  866. ++data-&gt;__invocation_counter;
  867. }
  868. return status;
  869. }
  870. </pre></div>
  871. </dd></dl>
  872. <p>This information should be sufficient to write new modules. Anybody
  873. doing so should also take a look at the available source code in the
  874. GNU C Library sources. It contains many examples of working and optimized
  875. modules.
  876. </p>
  877. <hr>
  878. <div class="header">
  879. <p>
  880. Previous: <a href="Other-iconv-Implementations.html#Other-iconv-Implementations" accesskey="p" rel="prev">Other iconv Implementations</a>, Up: <a href="Generic-Charset-Conversion.html#Generic-Charset-Conversion" accesskey="u" rel="up">Generic Charset Conversion</a> &nbsp; [<a href="index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>][<a href="Concept-Index.html#Concept-Index" title="Index" rel="index">Index</a>]</p>
  881. </div>
  882. </body>
  883. </html>