/glibc/libc/Representation-of-Strings.html
HTML | 211 lines | 156 code | 8 blank | 47 comment | 0 complexity | af3e47ab403ab723e1e271b0dc069a31 MD5 | raw file
- <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
- <html>
- <!-- This file documents the GNU C Library.
- This is
- The GNU C Library Reference Manual, for version
- 2.23.
- Copyright (C) 1993-2016 Free Software Foundation, Inc.
- Permission is granted to copy, distribute and/or modify this document
- under the terms of the GNU Free Documentation License, Version
- 1.3 or any later version published by the Free
- Software Foundation; with the Invariant Sections being "Free Software
- Needs Free Documentation" and "GNU Lesser General Public License",
- the Front-Cover texts being "A GNU Manual", and with the Back-Cover
- Texts as in (a) below. A copy of the license is included in the
- section entitled "GNU Free Documentation License".
- (a) The FSF's Back-Cover Text is: "You have the freedom to
- copy and modify this GNU manual. Buying copies from the FSF
- supports it in developing GNU and promoting software freedom." -->
- <!-- Created by GNU Texinfo 6.0, http://www.gnu.org/software/texinfo/ -->
- <head>
- <title>The GNU C Library: Representation of Strings</title>
- <meta name="description" content="The GNU C Library: Representation of Strings">
- <meta name="keywords" content="The GNU C Library: Representation of Strings">
- <meta name="resource-type" content="document">
- <meta name="distribution" content="global">
- <meta name="Generator" content="makeinfo">
- <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
- <link href="index.html#Top" rel="start" title="Top">
- <link href="Concept-Index.html#Concept-Index" rel="index" title="Concept Index">
- <link href="index.html#SEC_Contents" rel="contents" title="Table of Contents">
- <link href="String-and-Array-Utilities.html#String-and-Array-Utilities" rel="up" title="String and Array Utilities">
- <link href="String_002fArray-Conventions.html#String_002fArray-Conventions" rel="next" title="String/Array Conventions">
- <link href="String-and-Array-Utilities.html#String-and-Array-Utilities" rel="prev" title="String and Array Utilities">
- <style type="text/css">
- <!--
- a.summary-letter {text-decoration: none}
- blockquote.indentedblock {margin-right: 0em}
- blockquote.smallindentedblock {margin-right: 0em; font-size: smaller}
- blockquote.smallquotation {font-size: smaller}
- div.display {margin-left: 3.2em}
- div.example {margin-left: 3.2em}
- div.lisp {margin-left: 3.2em}
- div.smalldisplay {margin-left: 3.2em}
- div.smallexample {margin-left: 3.2em}
- div.smalllisp {margin-left: 3.2em}
- kbd {font-style: oblique}
- pre.display {font-family: inherit}
- pre.format {font-family: inherit}
- pre.menu-comment {font-family: serif}
- pre.menu-preformatted {font-family: serif}
- pre.smalldisplay {font-family: inherit; font-size: smaller}
- pre.smallexample {font-size: smaller}
- pre.smallformat {font-family: inherit; font-size: smaller}
- pre.smalllisp {font-size: smaller}
- span.nocodebreak {white-space: nowrap}
- span.nolinebreak {white-space: nowrap}
- span.roman {font-family: serif; font-weight: normal}
- span.sansserif {font-family: sans-serif; font-weight: normal}
- ul.no-bullet {list-style: none}
- -->
- </style>
- </head>
- <body lang="en">
- <a name="Representation-of-Strings"></a>
- <div class="header">
- <p>
- Next: <a href="String_002fArray-Conventions.html#String_002fArray-Conventions" accesskey="n" rel="next">String/Array Conventions</a>, Up: <a href="String-and-Array-Utilities.html#String-and-Array-Utilities" accesskey="u" rel="up">String and Array Utilities</a> [<a href="index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>][<a href="Concept-Index.html#Concept-Index" title="Index" rel="index">Index</a>]</p>
- </div>
- <hr>
- <a name="Representation-of-Strings-1"></a>
- <h3 class="section">5.1 Representation of Strings</h3>
- <a name="index-string_002c-representation-of"></a>
- <p>This section is a quick summary of string concepts for beginning C
- programmers. It describes how strings are represented in C
- and some common pitfalls. If you are already familiar with this
- material, you can skip this section.
- </p>
- <a name="index-string"></a>
- <p>A <em>string</em> is a null-terminated array of bytes of type <code>char</code>,
- including the terminating null byte. String-valued
- variables are usually declared to be pointers of type <code>char *</code>.
- Such variables do not include space for the text of a string; that has
- to be stored somewhere else—in an array variable, a string constant,
- or dynamically allocated memory (see <a href="Memory-Allocation.html#Memory-Allocation">Memory Allocation</a>). It’s up to
- you to store the address of the chosen memory space into the pointer
- variable. Alternatively you can store a <em>null pointer</em> in the
- pointer variable. The null pointer does not point anywhere, so
- attempting to reference the string it points to gets an error.
- </p>
- <a name="index-multibyte-character"></a>
- <a name="index-multibyte-string"></a>
- <a name="index-wide-string"></a>
- <p>A <em>multibyte character</em> is a sequence of one or more bytes that
- represents a single character using the locale’s encoding scheme; a
- null byte always represents the null character. A <em>multibyte
- string</em> is a string that consists entirely of multibyte
- characters. In contrast, a <em>wide string</em> is a null-terminated
- sequence of <code>wchar_t</code> objects. A wide-string variable is usually
- declared to be a pointer of type <code>wchar_t *</code>, by analogy with
- string variables and <code>char *</code>. See <a href="Extended-Char-Intro.html#Extended-Char-Intro">Extended Char Intro</a>.
- </p>
- <a name="index-null-byte"></a>
- <a name="index-null-wide-character"></a>
- <p>By convention, the <em>null byte</em>, <code>'\0'</code>,
- marks the end of a string and the <em>null wide character</em>,
- <code>L'\0'</code>, marks the end of a wide string. For example, in
- testing to see whether the <code>char *</code> variable <var>p</var> points to a
- null byte marking the end of a string, you can write
- <code>!*<var>p</var></code> or <code>*<var>p</var> == '\0'</code>.
- </p>
- <p>A null byte is quite different conceptually from a null pointer,
- although both are represented by the integer constant <code>0</code>.
- </p>
- <a name="index-string-literal"></a>
- <p>A <em>string literal</em> appears in C program source as a multibyte
- string between double-quote characters (‘<samp>"</samp>’). If the
- initial double-quote character is immediately preceded by a capital
- ‘<samp>L</samp>’ (ell) character (as in <code>L"foo"</code>), it is a wide string
- literal. String literals can also contribute to <em>string
- concatenation</em>: <code>"a" "b"</code> is the same as <code>"ab"</code>.
- For wide strings one can use either
- <code>L"a" L"b"</code> or <code>L"a" "b"</code>. Modification of string literals is
- not allowed by the GNU C compiler, because literals are placed in
- read-only storage.
- </p>
- <p>Arrays that are declared <code>const</code> cannot be modified
- either. It’s generally good style to declare non-modifiable string
- pointers to be of type <code>const char *</code>, since this often allows the
- C compiler to detect accidental modifications as well as providing some
- amount of documentation about what your program intends to do with the
- string.
- </p>
- <p>The amount of memory allocated for a byte array may extend past the null byte
- that marks the end of the string that the array contains. In this
- document, the term <em>allocated size</em> is always used to refer to the
- total amount of memory allocated for an array, while the term
- <em>length</em> refers to the number of bytes up to (but not including)
- the terminating null byte. Wide strings are similar, except their
- sizes and lengths count wide characters, not bytes.
- <a name="index-length-of-string"></a>
- <a name="index-allocation-size-of-string"></a>
- <a name="index-size-of-string"></a>
- <a name="index-string-length"></a>
- <a name="index-string-allocation"></a>
- </p>
- <p>A notorious source of program bugs is trying to put more bytes into a
- string than fit in its allocated size. When writing code that extends
- strings or moves bytes into a pre-allocated array, you should be
- very careful to keep track of the length of the text and make explicit
- checks for overflowing the array. Many of the library functions
- <em>do not</em> do this for you! Remember also that you need to allocate
- an extra byte to hold the null byte that marks the end of the
- string.
- </p>
- <a name="index-single_002dbyte-string"></a>
- <a name="index-multibyte-string-1"></a>
- <p>Originally strings were sequences of bytes where each byte represented a
- single character. This is still true today if the strings are encoded
- using a single-byte character encoding. Things are different if the
- strings are encoded using a multibyte encoding (for more information on
- encodings see <a href="Extended-Char-Intro.html#Extended-Char-Intro">Extended Char Intro</a>). There is no difference in
- the programming interface for these two kind of strings; the programmer
- has to be aware of this and interpret the byte sequences accordingly.
- </p>
- <p>But since there is no separate interface taking care of these
- differences the byte-based string functions are sometimes hard to use.
- Since the count parameters of these functions specify bytes a call to
- <code>memcpy</code> could cut a multibyte character in the middle and put an
- incomplete (and therefore unusable) byte sequence in the target buffer.
- </p>
- <a name="index-wide-string-1"></a>
- <p>To avoid these problems later versions of the ISO C<!-- /@w --> standard
- introduce a second set of functions which are operating on <em>wide
- characters</em> (see <a href="Extended-Char-Intro.html#Extended-Char-Intro">Extended Char Intro</a>). These functions don’t have
- the problems the single-byte versions have since every wide character is
- a legal, interpretable value. This does not mean that cutting wide
- strings at arbitrary points is without problems. It normally
- is for alphabet-based languages (except for non-normalized text) but
- languages based on syllables still have the problem that more than one
- wide character is necessary to complete a logical unit. This is a
- higher level problem which the C library<!-- /@w --> functions are not designed
- to solve. But it is at least good that no invalid byte sequences can be
- created. Also, the higher level functions can also much more easily operate
- on wide characters than on multibyte characters so that a common strategy
- is to use wide characters internally whenever text is more than simply
- copied.
- </p>
- <p>The remaining of this chapter will discuss the functions for handling
- wide strings in parallel with the discussion of
- strings since there is almost always an exact equivalent
- available.
- </p>
- <hr>
- <div class="header">
- <p>
- Next: <a href="String_002fArray-Conventions.html#String_002fArray-Conventions" accesskey="n" rel="next">String/Array Conventions</a>, Up: <a href="String-and-Array-Utilities.html#String-and-Array-Utilities" accesskey="u" rel="up">String and Array Utilities</a> [<a href="index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>][<a href="Concept-Index.html#Concept-Index" title="Index" rel="index">Index</a>]</p>
- </div>
- </body>
- </html>