PageRenderTime 24ms CodeModel.GetById 23ms RepoModel.GetById 0ms app.codeStats 0ms

/README.md

http://github.com/nicolas-grekas/Patchwork-UTF8
Markdown | 153 lines | 120 code | 33 blank | 0 comment | 0 complexity | e9908a6b962072687c2a3ed4bcdfad1a MD5 | raw file
  1. Patchwork UTF-8 for PHP
  2. =======================
  3. [![Latest Stable Version](https://poser.pugx.org/patchwork/utf8/v/stable.png)](https://packagist.org/packages/patchwork/utf8)
  4. [![Total Downloads](https://poser.pugx.org/patchwork/utf8/downloads.png)](https://packagist.org/packages/patchwork/utf8)
  5. [![Build Status](https://secure.travis-ci.org/tchwork/utf8.png?branch=master)](http://travis-ci.org/tchwork/utf8)
  6. [![SensioLabsInsight](https://insight.sensiolabs.com/projects/666c8ae7-0997-4d27-883a-6089ce3cc76b/mini.png)](https://insight.sensiolabs.com/projects/666c8ae7-0997-4d27-883a-6089ce3cc76b)
  7. Patchwork UTF-8 gives PHP developpers extensive, portable and performant
  8. handling of UTF-8 and [grapheme clusters](http://unicode.org/reports/tr29/).
  9. It provides both :
  10. - a portability layer for `mbstring`, `iconv`, and intl `Normalizer` and
  11. `grapheme_*` functions,
  12. - an UTF-8 grapheme clusters aware replica of native string functions.
  13. It can also serve as a documentation source referencing the practical problems
  14. that arise when handling UTF-8 in PHP: Unicode concepts, related algorithms,
  15. bugs in PHP core, workarounds, etc.
  16. Version 1.2 adds best-fit mappings for UTF-8 to *Code Page* approximations.
  17. It also adds Unicode filesystem access under Windows, using preferably
  18. [wfio](https://github.com/kenjiuno/php-wfio) or a COM based fallback otherwise.
  19. Portability
  20. -----------
  21. Unicode handling in PHP is best performed using a combo of `mbstring`, `iconv`,
  22. `intl` and `pcre` with the `u` flag enabled. But when an application is expected
  23. to run on many servers, you should be aware that these 4 extensions are not
  24. always enabled.
  25. Patchwork UTF-8 provides pure PHP implementations for 3 of those 4 extensions.
  26. `pcre` compiled with unicode support is required but is widely available.
  27. The following set of portability-fallbacks allows an application to run on a
  28. server even if one or more of those extensions are not enabled:
  29. - *utf8_encode, utf8_decode*,
  30. - `mbstring`: *mb_check_encoding, mb_convert_case, mb_convert_encoding,
  31. mb_decode_mimeheader, mb_detect_encoding, mb_detect_order,
  32. mb_encode_mimeheader, mb_encoding_aliases, mb_get_info, mb_http_input,
  33. mb_http_output, mb_internal_encoding, mb_language, mb_list_encodings,
  34. mb_output_handler, mb_strlen, mb_strpos, mb_strrpos, mb_strtolower,
  35. mb_strtoupper, mb_stripos, mb_stristr, mb_strrchr, mb_strrichr, mb_strripos,
  36. mb_strstr, mb_strwidth, mb_substitute_character, mb_substr, mb_substr_count*,
  37. - `iconv`: *iconv, iconv_mime_decode, iconv_mime_decode_headers,
  38. iconv_get_encoding, iconv_set_encoding, iconv_mime_encode, ob_iconv_handler,
  39. iconv_strlen, iconv_strpos, iconv_strrpos, iconv_substr*,
  40. - `intl`: *Normalizer, grapheme_extract, grapheme_stripos, grapheme_stristr,
  41. grapheme_strlen, grapheme_strpos, grapheme_strripos, grapheme_strrpos,
  42. grapheme_strstr, grapheme_substr, normalizer_is_normalized,
  43. normalizer_normalize*.
  44. Patchwork\Utf8
  45. --------------
  46. [Grapheme clusters](http://unicode.org/reports/tr29/) should always be
  47. considered when working with generic Unicode strings. The `Patchwork\Utf8`
  48. class implements the quasi-complete set of native string functions that need
  49. UTF-8 grapheme clusters awareness. Function names, arguments and behavior
  50. carefully replicates native PHP string functions.
  51. Some more functions are also provided to help handling UTF-8 strings:
  52. - *filter()*: normalizes to UTF-8 NFC, converting from [CP-1252](http://wikipedia.org/wiki/CP-1252) when needed,
  53. - *isUtf8()*: checks if a string contains well formed UTF-8 data,
  54. - *toAscii()*: generic UTF-8 to ASCII transliteration,
  55. - *strtocasefold()*: unicode transformation for caseless matching,
  56. - *strtonatfold()*: generic case sensitive transformation for collation matching,
  57. - *strwidth()*: computes the width of a string when printed on a terminal,
  58. - *wrapPath()*: unicode filesystem access under Windows and other OSes.
  59. Mirrored string functions are:
  60. *strlen, substr, strpos, stripos, strrpos, strripos, strstr, stristr, strrchr,
  61. strrichr, strtolower, strtoupper, wordwrap, chr, count_chars, ltrim, ord, rtrim,
  62. trim, str_ireplace, str_pad, str_shuffle, str_split, str_word_count, strcmp,
  63. strnatcmp, strcasecmp, strnatcasecmp, strncasecmp, strncmp, strcspn, strpbrk,
  64. strrev, strspn, strtr, substr_compare, substr_count, substr_replace, ucfirst,
  65. lcfirst, ucwords, number_format, utf8_encode, utf8_decode, json_decode,
  66. filter_input, filter_input_array*.
  67. Notably missing (but hard to replicate) are *printf*-family functions.
  68. The implementation favors performance over full edge cases handling.
  69. It generally works on UTF-8 normalized strings and provides filters to get them.
  70. As the turkish locale requires special cares, a `Patchwork\TurkishUtf8` class
  71. is provided for working with this locale. It clones all the features of
  72. `Patchwork\Utf8` but knows about the turkish specifics.
  73. Usage
  74. -----
  75. The recommended way to install Patchwork UTF-8 is [through
  76. composer](http://getcomposer.org). Just create a `composer.json` file and run
  77. the `php composer.phar install` command to install it:
  78. {
  79. "require": {
  80. "patchwork/utf8": "~1.2"
  81. }
  82. }
  83. Then, early in your bootstrap sequence, you have to configure your environment:
  84. ```php
  85. \Patchwork\Utf8\Bootup::initAll(); // Enables the portablity layer and configures PHP for UTF-8
  86. \Patchwork\Utf8\Bootup::filterRequestUri(); // Redirects to an UTF-8 encoded URL if it's not already the case
  87. \Patchwork\Utf8\Bootup::filterRequestInputs(); // Normalizes HTTP inputs to UTF-8 NFC
  88. ```
  89. Run `phpunit` to see the code in action.
  90. Make sure that you are confident about using UTF-8 by reading
  91. [Character Sets / Character Encoding Issues](http://www.phpwact.org/php/i18n/charsets)
  92. and [Handling UTF-8 with PHP](http://www.phpwact.org/php/i18n/utf-8),
  93. or [PHP et UTF-8](http://julp.lescigales.org/articles/3-php-et-utf-8.html) for french readers.
  94. You should also get familiar with the concept of
  95. [Unicode Normalization](http://en.wikipedia.org/wiki/Unicode_equivalence) and
  96. [Grapheme Clusters](http://unicode.org/reports/tr29/).
  97. Do not blindly replace all use of PHP's string functions. Most of the time you
  98. will not need to, and you will be introducing a significant performance overhead
  99. to your application.
  100. Screen your input on the *outer perimeter* so that only well formed UTF-8 pass
  101. through. When dealing with badly formed UTF-8, you should not try to fix it
  102. (see [Unicode Security Considerations](http://www.unicode.org/reports/tr36/#Deletion_of_Noncharacters)).
  103. Instead, consider it as [CP-1252](http://wikipedia.org/wiki/CP-1252) and use
  104. `Patchwork\Utf8::utf8_encode()` to get an UTF-8 string. Don't forget also to
  105. choose one unicode normalization form and stick to it. NFC is now the defacto
  106. standard. `Patchwork\Utf8::filter()` implements this behavior: it converts from
  107. CP1252 and to NFC.
  108. This library is orthogonal to `mbstring.func_overload` and will not work if the
  109. php.ini setting is enabled.
  110. Licensing
  111. ---------
  112. Patchwork\Utf8 is free software; you can redistribute it and/or modify it under
  113. the terms of the (at your option):
  114. - [Apache License v2.0](http://apache.org/licenses/LICENSE-2.0.txt), or
  115. - [GNU General Public License v2.0](http://gnu.org/licenses/gpl-2.0.txt).
  116. Unicode handling requires tedious work to be implemented and maintained on the
  117. long run. As such, contributions such as unit tests, bug reports, comments or
  118. patches licensed under both licenses are really welcomed.
  119. I hope many projects could adopt this code and together help solve the unicode
  120. subject for PHP.