PageRenderTime 42ms CodeModel.GetById 13ms RepoModel.GetById 0ms app.codeStats 0ms

/lib/exif/php-jpg/Unicode.php

http://buddypress-media.googlecode.com/
PHP | 1227 lines | 443 code | 229 blank | 555 comment | 111 complexity | efd3222d38d1c16cb8263d33a3aa5f20 MD5 | raw file
Possible License(s): AGPL-1.0, Apache-2.0, GPL-2.0, LGPL-2.1

Large files files are truncated, but you can click here to view the full file

  1. <?php
  2. /******************************************************************************
  3. *
  4. * Filename: Unicode.php
  5. *
  6. * Description: Provides functions for handling Unicode strings in PHP without
  7. * needing to configure the non-default mbstring extension
  8. *
  9. * Author: Evan Hunter
  10. *
  11. * Date: 27/7/2004
  12. *
  13. * Project: JPEG Metadata
  14. *
  15. * Revision: 1.10
  16. *
  17. * Changes: 1.00 -> 1.10 : Added the following functions:
  18. * smart_HTML_Entities
  19. * smart_htmlspecialchars
  20. * HTML_UTF16_UnEscape
  21. * HTML_UTF8_UnEscape
  22. * changed HTML_UTF8_Escape and HTML_UTF16_Escape to
  23. * use smart_htmlspecialchars, so that characters which
  24. * were already escaped would remain intact
  25. *
  26. *
  27. * URL: http://electronics.ozhiker.com
  28. *
  29. * License: This file is part of the PHP JPEG Metadata Toolkit.
  30. *
  31. * The PHP JPEG Metadata Toolkit is free software; you can
  32. * redistribute it and/or modify it under the terms of the
  33. * GNU General Public License as published by the Free Software
  34. * Foundation; either version 2 of the License, or (at your
  35. * option) any later version.
  36. *
  37. * The PHP JPEG Metadata Toolkit is distributed in the hope
  38. * that it will be useful, but WITHOUT ANY WARRANTY; without
  39. * even the implied warranty of MERCHANTABILITY or FITNESS
  40. * FOR A PARTICULAR PURPOSE. See the GNU General Public License
  41. * for more details.
  42. *
  43. * You should have received a copy of the GNU General Public
  44. * License along with the PHP JPEG Metadata Toolkit; if not,
  45. * write to the Free Software Foundation, Inc., 59 Temple
  46. * Place, Suite 330, Boston, MA 02111-1307 USA
  47. *
  48. * If you require a different license for commercial or other
  49. * purposes, please contact the author: evan@ozhiker.com
  50. *
  51. ******************************************************************************/
  52. // TODO: UTF-16 functions have not been tested fully
  53. /******************************************************************************
  54. *
  55. * Unicode UTF-8 Encoding Functions
  56. *
  57. * Description: UTF-8 is a Unicode encoding system in which extended characters
  58. * use only the upper half (128 values) of the byte range, thus it
  59. * allows the use of normal 7-bit ASCII text.
  60. * 7-Bit ASCII will pass straight through UTF-8 encoding/decoding without change
  61. *
  62. *
  63. * The encoding is as follows:
  64. * Unicode Value : Binary representation (x=data bit)
  65. *--------------------------------------------------------------------------------
  66. * U-00000000 - U-0000007F: 0xxxxxxx <- This is 7-bit ASCII
  67. * U-00000080 - U-000007FF: 110xxxxx 10xxxxxx
  68. * U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
  69. * U-00010000 - U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
  70. * U-00200000 - U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
  71. * U-04000000 - U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
  72. *--------------------------------------------------------------------------------
  73. *
  74. ******************************************************************************/
  75. /******************************************************************************
  76. *
  77. * Unicode UTF-16 Encoding Functions
  78. *
  79. * Description: UTF-16 is a Unicode encoding system uses 16 bit values for representing
  80. * characters.
  81. * It also has an extended set of characters available by the use
  82. * of surrogate pairs, which are a pair of 16 bit values, giving a
  83. * total data length of 20 useful bits.
  84. *
  85. *
  86. * The encoding is as follows:
  87. * Unicode Value : Binary representation (x=data bit)
  88. *--------------------------------------------------------------------------------
  89. * U-000000 - U-00D7FF: xxxxxxxx xxxxxxxx
  90. * U-00D800 - U-00DBFF: Not available - used for high surrogate pairs
  91. * U-00DC00 - U-00DFFF: Not available - used for low surrogate pairs
  92. U-00E000 - U-00FFFF: xxxxxxxx xxxxxxxx
  93. * U-010000 - U-10FFFF: 110110ww wwxxxxxx 110111xx xxxxxxxx ( wwww = (uni-0x10000)/0x10000 )
  94. *--------------------------------------------------------------------------------
  95. *
  96. * Surrogate pair Calculations
  97. *
  98. * $hi = ($uni - 0x10000) / 0x400 + 0xD800;
  99. * $lo = ($uni - 0x10000) % 0x400 + 0xDC00;
  100. *
  101. *
  102. * $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
  103. *
  104. *
  105. ******************************************************************************/
  106. /******************************************************************************
  107. *
  108. * Function: UTF8_fix
  109. *
  110. * Description: Checks a string for badly formed Unicode UTF-8 coding and
  111. * returns the same string containing only the parts which
  112. * were properly formed UTF-8 data.
  113. *
  114. * Parameters: utf8_text - a string with possibly badly formed UTF-8 data
  115. *
  116. * Returns: output - the well formed UTF-8 version of the string
  117. *
  118. ******************************************************************************/
  119. function UTF8_fix( $utf8_text )
  120. {
  121. // Initialise the current position in the string
  122. $pos = 0;
  123. // Create a string to accept the well formed output
  124. $output = "" ;
  125. // Cycle through each group of bytes, ensuring the coding is correct
  126. while ( $pos < strlen( $utf8_text ) )
  127. {
  128. // Retreive the current numerical character value
  129. $chval = ord($utf8_text{$pos});
  130. // Check what the first character is - it will tell us how many bytes the
  131. // Unicode value covers
  132. if ( ( $chval >= 0x00 ) && ( $chval <= 0x7F ) )
  133. {
  134. // 1 Byte UTF-8 Unicode (7-Bit ASCII) Character
  135. $bytes = 1;
  136. }
  137. else if ( ( $chval >= 0xC0 ) && ( $chval <= 0xDF ) )
  138. {
  139. // 2 Byte UTF-8 Unicode Character
  140. $bytes = 2;
  141. }
  142. else if ( ( $chval >= 0xE0 ) && ( $chval <= 0xEF ) )
  143. {
  144. // 3 Byte UTF-8 Unicode Character
  145. $bytes = 3;
  146. }
  147. else if ( ( $chval >= 0xF0 ) && ( $chval <= 0xF7 ) )
  148. {
  149. // 4 Byte UTF-8 Unicode Character
  150. $bytes = 4;
  151. }
  152. else if ( ( $chval >= 0xF8 ) && ( $chval <= 0xFB ) )
  153. {
  154. // 5 Byte UTF-8 Unicode Character
  155. $bytes = 5;
  156. }
  157. else if ( ( $chval >= 0xFC ) && ( $chval <= 0xFD ) )
  158. {
  159. // 6 Byte UTF-8 Unicode Character
  160. $bytes = 6;
  161. }
  162. else
  163. {
  164. // Invalid Code - skip character and do nothing
  165. $bytes = 0;
  166. $pos++;
  167. }
  168. // check that there is enough data remaining to read
  169. if (($pos + $bytes - 1) < strlen( $utf8_text ) )
  170. {
  171. // Cycle through the number of bytes specified,
  172. // copying them to the output string
  173. while ( $bytes > 0 )
  174. {
  175. $output .= $utf8_text{$pos};
  176. $pos++;
  177. $bytes--;
  178. }
  179. }
  180. else
  181. {
  182. break;
  183. }
  184. }
  185. // Return the result
  186. return $output;
  187. }
  188. /******************************************************************************
  189. * End of Function: UTF8_fix
  190. ******************************************************************************/
  191. /******************************************************************************
  192. *
  193. * Function: UTF16_fix
  194. *
  195. * Description: Checks a string for badly formed Unicode UTF-16 coding and
  196. * returns the same string containing only the parts which
  197. * were properly formed UTF-16 data.
  198. *
  199. * Parameters: utf16_text - a string with possibly badly formed UTF-16 data
  200. * MSB_first - True will cause processing as Big Endian UTF-16 (Motorola, MSB first)
  201. * False will cause processing as Little Endian UTF-16 (Intel, LSB first)
  202. *
  203. * Returns: output - the well formed UTF-16 version of the string
  204. *
  205. ******************************************************************************/
  206. function UTF16_fix( $utf16_text, $MSB_first )
  207. {
  208. // Initialise the current position in the string
  209. $pos = 0;
  210. // Create a string to accept the well formed output
  211. $output = "" ;
  212. // Cycle through each group of bytes, ensuring the coding is correct
  213. while ( $pos < strlen( $utf16_text ) )
  214. {
  215. // Retreive the current numerical character value
  216. $chval1 = ord($utf16_text{$pos});
  217. // Skip over character just read
  218. $pos++;
  219. // Check if there is another character available
  220. if ( $pos < strlen( $utf16_text ) )
  221. {
  222. // Another character is available - get it for the second half of the UTF-16 value
  223. $chval2 = ord( $utf16_text{$pos} );
  224. }
  225. else
  226. {
  227. // Error - no second byte to this UTF-16 value - end processing
  228. continue 1;
  229. }
  230. // Skip over character just read
  231. $pos++;
  232. // Calculate the 16 bit unicode value
  233. if ( $MSB_first )
  234. {
  235. // Big Endian
  236. $UTF16_val = $chval1 * 0x100 + $chval2;
  237. }
  238. else
  239. {
  240. // Little Endian
  241. $UTF16_val = $chval2 * 0x100 + $chval1;
  242. }
  243. if ( ( ( $UTF16_val >= 0x0000 ) && ( $UTF16_val <= 0xD7FF ) ) ||
  244. ( ( $UTF16_val >= 0xE000 ) && ( $UTF16_val <= 0xFFFF ) ) )
  245. {
  246. // Normal Character (Non Surrogate pair)
  247. // Add it to the output
  248. $output .= chr( $chval1 ) . chr ( $chval2 );
  249. }
  250. else if ( ( $UTF16_val >= 0xD800 ) && ( $UTF16_val <= 0xDBFF ) )
  251. {
  252. // High surrogate of a surrogate pair
  253. // Now we need to read the low surrogate
  254. // Check if there is another 2 characters available
  255. if ( ( $pos + 3 ) < strlen( $utf16_text ) )
  256. {
  257. // Another 2 characters are available - get them
  258. $chval3 = ord( $utf16_text{$pos} );
  259. $chval4 = ord( $utf16_text{$pos+1} );
  260. // Calculate the second 16 bit unicode value
  261. if ( $MSB_first )
  262. {
  263. // Big Endian
  264. $UTF16_val2 = $chval3 * 0x100 + $chval4;
  265. }
  266. else
  267. {
  268. // Little Endian
  269. $UTF16_val2 = $chval4 * 0x100 + $chval3;
  270. }
  271. // Check that this is a low surrogate
  272. if ( ( $UTF16_val2 >= 0xDC00 ) && ( $UTF16_val2 <= 0xDFFF ) )
  273. {
  274. // Low surrogate found following high surrogate
  275. // Add both to the output
  276. $output .= chr( $chval1 ) . chr ( $chval2 ) . chr( $chval3 ) . chr ( $chval4 );
  277. // Skip over the low surrogate
  278. $pos += 2;
  279. }
  280. else
  281. {
  282. // Low surrogate not found after high surrogate
  283. // Don't add either to the output
  284. // Only the High surrogate is skipped and processing continues after it
  285. }
  286. }
  287. else
  288. {
  289. // Error - not enough data for low surrogate - end processing
  290. continue 1;
  291. }
  292. }
  293. else
  294. {
  295. // Low surrogate of a surrogate pair
  296. // This should not happen - it means this is a lone low surrogate
  297. // Dont add it to the output
  298. }
  299. }
  300. // Return the result
  301. return $output;
  302. }
  303. /******************************************************************************
  304. * End of Function: UTF16_fix
  305. ******************************************************************************/
  306. /******************************************************************************
  307. *
  308. * Function: UTF8_to_unicode_array
  309. *
  310. * Description: Converts a string encoded with Unicode UTF-8, to an array of
  311. * numbers which represent unicode character numbers
  312. *
  313. * Parameters: utf8_text - a string containing the UTF-8 data
  314. *
  315. * Returns: output - the array containing the unicode character numbers
  316. *
  317. ******************************************************************************/
  318. function UTF8_to_unicode_array( $utf8_text )
  319. {
  320. // Create an array to receive the unicode character numbers output
  321. $output = array( );
  322. // Cycle through the characters in the UTF-8 string
  323. for ( $pos = 0; $pos < strlen( $utf8_text ); $pos++ )
  324. {
  325. // Retreive the current numerical character value
  326. $chval = ord($utf8_text{$pos});
  327. // Check what the first character is - it will tell us how many bytes the
  328. // Unicode value covers
  329. if ( ( $chval >= 0x00 ) && ( $chval <= 0x7F ) )
  330. {
  331. // 1 Byte UTF-8 Unicode (7-Bit ASCII) Character
  332. $bytes = 1;
  333. $outputval = $chval; // Since 7-bit ASCII is unaffected, the output equals the input
  334. }
  335. else if ( ( $chval >= 0xC0 ) && ( $chval <= 0xDF ) )
  336. {
  337. // 2 Byte UTF-8 Unicode
  338. $bytes = 2;
  339. $outputval = $chval & 0x1F; // The first byte is bitwise ANDed with 0x1F to remove the leading 110b
  340. }
  341. else if ( ( $chval >= 0xE0 ) && ( $chval <= 0xEF ) )
  342. {
  343. // 3 Byte UTF-8 Unicode
  344. $bytes = 3;
  345. $outputval = $chval & 0x0F; // The first byte is bitwise ANDed with 0x0F to remove the leading 1110b
  346. }
  347. else if ( ( $chval >= 0xF0 ) && ( $chval <= 0xF7 ) )
  348. {
  349. // 4 Byte UTF-8 Unicode
  350. $bytes = 4;
  351. $outputval = $chval & 0x07; // The first byte is bitwise ANDed with 0x07 to remove the leading 11110b
  352. }
  353. else if ( ( $chval >= 0xF8 ) && ( $chval <= 0xFB ) )
  354. {
  355. // 5 Byte UTF-8 Unicode
  356. $bytes = 5;
  357. $outputval = $chval & 0x03; // The first byte is bitwise ANDed with 0x03 to remove the leading 111110b
  358. }
  359. else if ( ( $chval >= 0xFC ) && ( $chval <= 0xFD ) )
  360. {
  361. // 6 Byte UTF-8 Unicode
  362. $bytes = 6;
  363. $outputval = $chval & 0x01; // The first byte is bitwise ANDed with 0x01 to remove the leading 1111110b
  364. }
  365. else
  366. {
  367. // Invalid Code - do nothing
  368. $bytes = 0;
  369. }
  370. // Check if the byte was valid
  371. if ( $bytes !== 0 )
  372. {
  373. // The byte was valid
  374. // Check if there is enough data left in the UTF-8 string to allow the
  375. // retrieval of the remainder of this unicode character
  376. if ( $pos + $bytes - 1 < strlen( $utf8_text ) )
  377. {
  378. // The UTF-8 string is long enough
  379. // Cycle through the number of bytes required,
  380. // minus the first one which has already been done
  381. while ( $bytes > 1 )
  382. {
  383. $pos++;
  384. $bytes--;
  385. // Each remaining byte is coded with 6 bits of data and 10b on the high
  386. // order bits. Hence we need to shift left by 6 bits (0x40) then add the
  387. // current characer after it has been bitwise ANDed with 0x3F to remove the
  388. // highest two bits.
  389. $outputval = $outputval*0x40 + ( (ord($utf8_text{$pos})) & 0x3F );
  390. }
  391. // Add the calculated Unicode number to the output array
  392. $output[] = $outputval;
  393. }
  394. }
  395. }
  396. // Return the resulting array
  397. return $output;
  398. }
  399. /******************************************************************************
  400. * End of Function: UTF8_to_unicode_array
  401. ******************************************************************************/
  402. /******************************************************************************
  403. *
  404. * Function: UTF16_to_unicode_array
  405. *
  406. * Description: Converts a string encoded with Unicode UTF-16, to an array of
  407. * numbers which represent unicode character numbers
  408. *
  409. * Parameters: utf16_text - a string containing the UTF-16 data
  410. * MSB_first - True will cause processing as Big Endian UTF-16 (Motorola, MSB first)
  411. * False will cause processing as Little Endian UTF-16 (Intel, LSB first)
  412. *
  413. * Returns: output - the array containing the unicode character numbers
  414. *
  415. ******************************************************************************/
  416. function UTF16_to_unicode_array( $utf16_text, $MSB_first )
  417. {
  418. // Create an array to receive the unicode character numbers output
  419. $output = array( );
  420. // Initialise the current position in the string
  421. $pos = 0;
  422. // Cycle through each group of bytes, ensuring the coding is correct
  423. while ( $pos < strlen( $utf16_text ) )
  424. {
  425. // Retreive the current numerical character value
  426. $chval1 = ord($utf16_text{$pos});
  427. // Skip over character just read
  428. $pos++;
  429. // Check if there is another character available
  430. if ( $pos < strlen( $utf16_text ) )
  431. {
  432. // Another character is available - get it for the second half of the UTF-16 value
  433. $chval2 = ord( $utf16_text{$pos} );
  434. }
  435. else
  436. {
  437. // Error - no second byte to this UTF-16 value - end processing
  438. continue 1;
  439. }
  440. // Skip over character just read
  441. $pos++;
  442. // Calculate the 16 bit unicode value
  443. if ( $MSB_first )
  444. {
  445. // Big Endian
  446. $UTF16_val = $chval1 * 0x100 + $chval2;
  447. }
  448. else
  449. {
  450. // Little Endian
  451. $UTF16_val = $chval2 * 0x100 + $chval1;
  452. }
  453. if ( ( ( $UTF16_val >= 0x0000 ) && ( $UTF16_val <= 0xD7FF ) ) ||
  454. ( ( $UTF16_val >= 0xE000 ) && ( $UTF16_val <= 0xFFFF ) ) )
  455. {
  456. // Normal Character (Non Surrogate pair)
  457. // Add it to the output
  458. $output[] = $UTF16_val;
  459. }
  460. else if ( ( $UTF16_val >= 0xD800 ) && ( $UTF16_val <= 0xDBFF ) )
  461. {
  462. // High surrogate of a surrogate pair
  463. // Now we need to read the low surrogate
  464. // Check if there is another 2 characters available
  465. if ( ( $pos + 3 ) < strlen( $utf16_text ) )
  466. {
  467. // Another 2 characters are available - get them
  468. $chval3 = ord( $utf16_text{$pos} );
  469. $chval4 = ord( $utf16_text{$pos+1} );
  470. // Calculate the second 16 bit unicode value
  471. if ( $MSB_first )
  472. {
  473. // Big Endian
  474. $UTF16_val2 = $chval3 * 0x100 + $chval4;
  475. }
  476. else
  477. {
  478. // Little Endian
  479. $UTF16_val2 = $chval4 * 0x100 + $chval3;
  480. }
  481. // Check that this is a low surrogate
  482. if ( ( $UTF16_val2 >= 0xDC00 ) && ( $UTF16_val2 <= 0xDFFF ) )
  483. {
  484. // Low surrogate found following high surrogate
  485. // Add both to the output
  486. $output[] = 0x10000 + ( ( $UTF16_val - 0xD800 ) * 0x400 ) + ( $UTF16_val2 - 0xDC00 );
  487. // Skip over the low surrogate
  488. $pos += 2;
  489. }
  490. else
  491. {
  492. // Low surrogate not found after high surrogate
  493. // Don't add either to the output
  494. // The high surrogate is skipped and processing continued
  495. }
  496. }
  497. else
  498. {
  499. // Error - not enough data for low surrogate - end processing
  500. continue 1;
  501. }
  502. }
  503. else
  504. {
  505. // Low surrogate of a surrogate pair
  506. // This should not happen - it means this is a lone low surrogate
  507. // Don't add it to the output
  508. }
  509. }
  510. // Return the result
  511. return $output;
  512. }
  513. /******************************************************************************
  514. * End of Function: UTF16_to_unicode_array
  515. ******************************************************************************/
  516. /******************************************************************************
  517. *
  518. * Function: unicode_array_to_UTF8
  519. *
  520. * Description: Converts an array of unicode character numbers to a string
  521. * encoded by UTF-8
  522. *
  523. * Parameters: unicode_array - the array containing unicode character numbers
  524. *
  525. * Returns: output - the UTF-8 encoded string representing the data
  526. *
  527. ******************************************************************************/
  528. function unicode_array_to_UTF8( $unicode_array )
  529. {
  530. // Create a string to receive the UTF-8 output
  531. $output = "";
  532. // Cycle through each Unicode character number
  533. foreach( $unicode_array as $unicode_char )
  534. {
  535. // Check which range the current unicode character lies in
  536. if ( ( $unicode_char >= 0x00 ) && ( $unicode_char <= 0x7F ) )
  537. {
  538. // 1 Byte UTF-8 Unicode (7-Bit ASCII) Character
  539. $output .= chr($unicode_char); // Output is equal to input for 7-bit ASCII
  540. }
  541. else if ( ( $unicode_char >= 0x80 ) && ( $unicode_char <= 0x7FF ) )
  542. {
  543. // 2 Byte UTF-8 Unicode - binary encode data as : 110xxxxx 10xxxxxx
  544. $output .= chr(0xC0 + ($unicode_char/0x40));
  545. $output .= chr(0x80 + ($unicode_char & 0x3F));
  546. }
  547. else if ( ( $unicode_char >= 0x800 ) && ( $unicode_char <= 0xFFFF ) )
  548. {
  549. // 3 Byte UTF-8 Unicode - binary encode data as : 1110xxxx 10xxxxxx 10xxxxxx
  550. $output .= chr(0xE0 + ($unicode_char/0x1000));
  551. $output .= chr(0x80 + (($unicode_char/0x40) & 0x3F));
  552. $output .= chr(0x80 + ($unicode_char & 0x3F));
  553. }
  554. else if ( ( $unicode_char >= 0x10000 ) && ( $unicode_char <= 0x1FFFFF ) )
  555. {
  556. // 4 Byte UTF-8 Unicode - binary encode data as : 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
  557. $output .= chr(0xF0 + ($unicode_char/0x40000));
  558. $output .= chr(0x80 + (($unicode_char/0x1000) & 0x3F));
  559. $output .= chr(0x80 + (($unicode_char/0x40) & 0x3F));
  560. $output .= chr(0x80 + ($unicode_char & 0x3F));
  561. }
  562. else if ( ( $unicode_char >= 0x200000 ) && ( $unicode_char <= 0x3FFFFFF ) )
  563. {
  564. // 5 Byte UTF-8 Unicode - binary encode data as : 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
  565. $output .= chr(0xF8 + ($unicode_char/0x1000000));
  566. $output .= chr(0x80 + (($unicode_char/0x40000) & 0x3F));
  567. $output .= chr(0x80 + (($unicode_char/0x1000) & 0x3F));
  568. $output .= chr(0x80 + (($unicode_char/0x40) & 0x3F));
  569. $output .= chr(0x80 + ($unicode_char & 0x3F));
  570. }
  571. else if ( ( $unicode_char >= 0x4000000 ) && ( $unicode_char <= 0x7FFFFFFF ) )
  572. {
  573. // 6 Byte UTF-8 Unicode - binary encode data as : 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
  574. $output .= chr(0xFC + ($unicode_char/0x40000000));
  575. $output .= chr(0x80 + (($unicode_char/0x1000000) & 0x3F));
  576. $output .= chr(0x80 + (($unicode_char/0x40000) & 0x3F));
  577. $output .= chr(0x80 + (($unicode_char/0x1000) & 0x3F));
  578. $output .= chr(0x80 + (($unicode_char/0x40) & 0x3F));
  579. $output .= chr(0x80 + ($unicode_char & 0x3F));
  580. }
  581. else
  582. {
  583. // Invalid Code - do nothing
  584. }
  585. }
  586. // Return resulting UTF-8 String
  587. return $output;
  588. }
  589. /******************************************************************************
  590. * End of Function: unicode_array_to_UTF8
  591. ******************************************************************************/
  592. /******************************************************************************
  593. *
  594. * Function: unicode_array_to_UTF16
  595. *
  596. * Description: Converts an array of unicode character numbers to a string
  597. * encoded by UTF-16
  598. *
  599. * Parameters: unicode_array - the array containing unicode character numbers
  600. * MSB_first - True will cause processing as Big Endian UTF-16 (Motorola, MSB first)
  601. * False will cause processing as Little Endian UTF-16 (Intel, LSB first)
  602. *
  603. * Returns: output - the UTF-16 encoded string representing the data
  604. *
  605. ******************************************************************************/
  606. function unicode_array_to_UTF16( $unicode_array, $MSB_first )
  607. {
  608. // Create a string to receive the UTF-16 output
  609. $output = "";
  610. // Cycle through each Unicode character number
  611. foreach( $unicode_array as $unicode_char )
  612. {
  613. // Check which range the current unicode character lies in
  614. if ( ( ( $unicode_char >= 0x0000 ) && ( $unicode_char <= 0xD7FF ) ) ||
  615. ( ( $unicode_char >= 0xE000 ) && ( $unicode_char <= 0xFFFF ) ) )
  616. {
  617. // Normal 16 Bit Character (Not a Surrogate Pair)
  618. // Check what byte order should be used
  619. if ( $MSB_first )
  620. {
  621. // Big Endian
  622. $output .= chr( $unicode_char / 0x100 ) . chr( $unicode_char % 0x100 ) ;
  623. }
  624. else
  625. {
  626. // Little Endian
  627. $output .= chr( $unicode_char % 0x100 ) . chr( $unicode_char / 0x100 ) ;
  628. }
  629. }
  630. else if ( ( $unicode_char >= 0x10000 ) && ( $unicode_char <= 0x10FFFF ) )
  631. {
  632. // Surrogate Pair required
  633. // Calculate Surrogates
  634. $High_Surrogate = ( ( $unicode_char - 0x10000 ) / 0x400 ) + 0xD800;
  635. $Low_Surrogate = ( ( $unicode_char - 0x10000 ) % 0x400 ) + 0xDC00;
  636. // Check what byte order should be used
  637. if ( $MSB_first )
  638. {
  639. // Big Endian
  640. $output .= chr( $High_Surrogate / 0x100 ) . chr( $High_Surrogate % 0x100 );
  641. $output .= chr( $Low_Surrogate / 0x100 ) . chr( $Low_Surrogate % 0x100 );
  642. }
  643. else
  644. {
  645. // Little Endian
  646. $output .= chr( $High_Surrogate % 0x100 ) . chr( $High_Surrogate / 0x100 );
  647. $output .= chr( $Low_Surrogate % 0x100 ) . chr( $Low_Surrogate / 0x100 );
  648. }
  649. }
  650. else
  651. {
  652. // Invalid UTF-16 codepoint
  653. // Unicode value should never be between 0xD800 and 0xDFFF
  654. // Do not output this point - there is no way to encode it in UTF-16
  655. }
  656. }
  657. // Return resulting UTF-16 String
  658. return $output;
  659. }
  660. /******************************************************************************
  661. * End of Function: unicode_array_to_UTF16
  662. ******************************************************************************/
  663. /******************************************************************************
  664. *
  665. * Function: xml_UTF8_clean
  666. *
  667. * Description: XML has specific requirements about the characters that are
  668. * allowed, and characters that must be escaped.
  669. * This function ensures that all characters in the given string
  670. * are valid, and that characters such as Quotes, Greater than,
  671. * Less than and Ampersand are properly escaped. Newlines and Tabs
  672. * are also escaped.
  673. * Note - Do not use this on constructed XML which includes tags,
  674. * as it will escape the tags. It is designed to be used
  675. * on the tag and attribute names, attribute values, and text.
  676. *
  677. * Parameters: utf8_text - a string containing the UTF-8 data
  678. *
  679. * Returns: output - the array containing the unicode character numbers
  680. *
  681. ******************************************************************************/
  682. function xml_UTF8_clean( $UTF8_text )
  683. {
  684. // Ensure that the Unicode UTF8 encoding is valid.
  685. $UTF8_text = UTF8_fix( $UTF8_text );
  686. // XML only allows characters in the following unicode ranges
  687. // #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
  688. // Hence we need to delete any characters that dont fit this
  689. // Convert the UTF-8 string to an array of unicode character numbers
  690. $unicode_array = UTF8_to_unicode_array( $UTF8_text );
  691. // Create a new array to receive the valid unicode character numbers
  692. $new_unicode_array = array( );
  693. // Cycle through the unicode character numbers
  694. foreach( $unicode_array as $unichar )
  695. {
  696. // Check if the unicode character number is valid for XML
  697. if ( ( $unichar == 0x09 ) ||
  698. ( $unichar == 0x0A ) ||
  699. ( $unichar == 0x0D ) ||
  700. ( ( $unichar >= 0x20 ) && ( $unichar <= 0xD7FF ) ) ||
  701. ( ( $unichar >= 0xE000 ) && ( $unichar <= 0xFFFD ) ) ||
  702. ( ( $unichar >= 0x10000 ) && ( $unichar <= 0x10FFFF ) ) )
  703. {
  704. // Unicode character is valid for XML - add it to the valid characters array
  705. $new_unicode_array[] = $unichar;
  706. }
  707. }
  708. // Convert the array of valid unicode character numbers back to UTF-8 encoded text
  709. $UTF8_text = unicode_array_to_UTF8( $new_unicode_array );
  710. // Escape any special HTML characters present
  711. $UTF8_text = htmlspecialchars ( $UTF8_text, ENT_QUOTES );
  712. // Escape CR, LF and TAB characters, so that they are kept and not treated as expendable white space
  713. $trans = array( "\x09" => "&#x09;", "\x0A" => "&#x0A;", "\x0D" => "&#x0D;" );
  714. $UTF8_text = strtr( $UTF8_text, $trans );
  715. // Return the resulting XML valid string
  716. return $UTF8_text;
  717. }
  718. /******************************************************************************
  719. * End of Function: xml_UTF8_clean
  720. ******************************************************************************/
  721. /******************************************************************************
  722. *
  723. * Function: xml_UTF16_clean
  724. *
  725. * Description: XML has specific requirements about the characters that are
  726. * allowed, and characters that must be escaped.
  727. * This function ensures that all characters in the given string
  728. * are valid, and that characters such as Quotes, Greater than,
  729. * Less than and Ampersand are properly escaped. Newlines and Tabs
  730. * are also escaped.
  731. * Note - Do not use this on constructed XML which includes tags,
  732. * as it will escape the tags. It is designed to be used
  733. * on the tag and attribute names, attribute values, and text.
  734. *
  735. * Parameters: utf16_text - a string containing the UTF-16 data
  736. * MSB_first - True will cause processing as Big Endian UTF-16 (Motorola, MSB first)
  737. * False will cause processing as Little Endian UTF-16 (Intel, LSB first)
  738. *
  739. * Returns: output - the array containing the unicode character numbers
  740. *
  741. ******************************************************************************/
  742. function xml_UTF16_clean( $UTF16_text, $MSB_first )
  743. {
  744. // Ensure that the Unicode UTF16 encoding is valid.
  745. $UTF16_text = UTF16_fix( $UTF16_text, $MSB_first );
  746. // XML only allows characters in the following unicode ranges
  747. // #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
  748. // Hence we need to delete any characters that dont fit this
  749. // Convert the UTF-16 string to an array of unicode character numbers
  750. $unicode_array = UTF16_to_unicode_array( $UTF16_text, $MSB_first );
  751. // Create a new array to receive the valid unicode character numbers
  752. $new_unicode_array = array( );
  753. // Cycle through the unicode character numbers
  754. foreach( $unicode_array as $unichar )
  755. {
  756. // Check if the unicode character number is valid for XML
  757. if ( ( $unichar == 0x09 ) ||
  758. ( $unichar == 0x0A ) ||
  759. ( $unichar == 0x0D ) ||
  760. ( ( $unichar >= 0x20 ) && ( $unichar <= 0xD7FF ) ) ||
  761. ( ( $unichar >= 0xE000 ) && ( $unichar <= 0xFFFD ) ) ||
  762. ( ( $unichar >= 0x10000 ) && ( $unichar <= 0x10FFFF ) ) )
  763. {
  764. // Unicode character is valid for XML - add it to the valid characters array
  765. $new_unicode_array[] = $unichar;
  766. }
  767. }
  768. // Convert the array of valid unicode character numbers back to UTF-16 encoded text
  769. $UTF16_text = unicode_array_to_UTF16( $new_unicode_array, $MSB_first );
  770. // Escape any special HTML characters present
  771. $UTF16_text = htmlspecialchars ( $UTF16_text, ENT_QUOTES );
  772. // Escape CR, LF and TAB characters, so that they are kept and not treated as expendable white space
  773. $trans = array( "\x09" => "&#x09;", "\x0A" => "&#x0A;", "\x0D" => "&#x0D;" );
  774. $UTF16_text = strtr( $UTF16_text, $trans );
  775. // Return the resulting XML valid string
  776. return $UTF16_text;
  777. }
  778. /******************************************************************************
  779. * End of Function: xml_UTF16_clean
  780. ******************************************************************************/
  781. /******************************************************************************
  782. *
  783. * Function: HTML_UTF8_Escape
  784. *
  785. * Description: A HTML page can display UTF-8 data properly if it has a
  786. * META http-equiv="Content-Type" tag with the content attribute
  787. * including the value: "charset=utf-8".
  788. * Otherwise the ISO-8859-1 character set is usually assumed, and
  789. * Unicode values above 0x7F must be escaped.
  790. * This function takes a UTF-8 encoded string and escapes the
  791. * characters above 0x7F as well as reserved HTML characters such
  792. * as Quotes, Greater than, Less than and Ampersand.
  793. *
  794. * Parameters: utf8_text - a string containing the UTF-8 data
  795. *
  796. * Returns: htmloutput - a string containing the HTML equivalent
  797. *
  798. ******************************************************************************/
  799. function HTML_UTF8_Escape( $UTF8_text )
  800. {
  801. // Ensure that the Unicode UTF8 encoding is valid.
  802. $UTF8_text = UTF8_fix( $UTF8_text );
  803. // Change: changed to use smart_htmlspecialchars, so that characters which were already escaped would remain intact, as of revision 1.10
  804. // Escape any special HTML characters present
  805. $UTF8_text = smart_htmlspecialchars( $UTF8_text, ENT_QUOTES );
  806. // Convert the UTF-8 string to an array of unicode character numbers
  807. $unicode_array = UTF8_to_unicode_array( $UTF8_text );
  808. // Create a string to receive the escaped HTML
  809. $htmloutput = "";
  810. // Cycle through the unicode character numbers
  811. foreach( $unicode_array as $unichar )
  812. {
  813. // Check if the character needs to be escaped
  814. if ( ( $unichar >= 0x00 ) && ( $unichar <= 0x7F ) )
  815. {
  816. // Character is less than 0x7F - add it to the html as is
  817. $htmloutput .= chr( $unichar );
  818. }
  819. else
  820. {
  821. // Character is greater than 0x7F - escape it and add it to the html
  822. $htmloutput .= "&#x" . dechex($unichar) . ";";
  823. }
  824. }
  825. // Return the resulting escaped HTML
  826. return $htmloutput;
  827. }
  828. /******************************************************************************
  829. * End of Function: HTML_UTF8_Escape
  830. ******************************************************************************/
  831. /******************************************************************************
  832. *
  833. * Function: HTML_UTF8_UnEscape
  834. *
  835. * Description: Converts HTML which contains escaped decimal or hex characters
  836. * into UTF-8 text
  837. *
  838. * Parameters: HTML_text - a string containing the HTML text to convert
  839. *
  840. * Returns: utfoutput - a string containing the UTF-8 equivalent
  841. *
  842. ******************************************************************************/
  843. function HTML_UTF8_UnEscape( $HTML_text )
  844. {
  845. preg_match_all( "/\&\#(\d+);/", $HTML_text, $matches);
  846. preg_match_all( "/\&\#[x|X]([A|B|C|D|E|F|a|b|c|d|e|f|0-9]+);/", $HTML_text, $hexmatches);
  847. foreach( $hexmatches[1] as $index => $match )
  848. {
  849. $matches[0][] = $hexmatches[0][$index];
  850. $matches[1][] = hexdec( $match );
  851. }
  852. for ( $i = 0; $i < count( $matches[ 0 ] ); $i++ )
  853. {
  854. $trans = array( $matches[0][$i] => unicode_array_to_UTF8( array( $matches[1][$i] ) ) );
  855. $HTML_text = strtr( $HTML_text , $trans );
  856. }
  857. return $HTML_text;
  858. }
  859. /******************************************************************************
  860. * End of Function: HTML_UTF8_UnEscape
  861. ******************************************************************************/
  862. /******************************************************************************
  863. *
  864. * Function: HTML_UTF16_Escape
  865. *
  866. * Description: A HTML page can display UTF-16 data properly if it has a
  867. * META http-equiv="Content-Type" tag with the content attribute
  868. * including the value: "charset=utf-16".
  869. * Otherwise the ISO-8859-1 character set is usually assumed, and
  870. * Unicode values above 0x7F must be escaped.
  871. * This function takes a UTF-16 encoded string and escapes the
  872. * characters above 0x7F as well as reserved HTML characters such
  873. * as Quotes, Greater than, Less than and Ampersand.
  874. *
  875. * Parameters: utf16_text - a string containing the UTF-16 data
  876. * MSB_first - True will cause processing as Big Endian UTF-16 (Motorola, MSB first)
  877. * False will cause processing as Little Endian UTF-16 (Intel, LSB first)
  878. *
  879. * Returns: htmloutput - a string containing the HTML equivalent
  880. *
  881. ******************************************************************************/
  882. function HTML_UTF16_Escape( $UTF16_text, $MSB_first )
  883. {
  884. // Ensure that the Unicode UTF16 encoding is valid.
  885. $UTF16_text = UTF16_fix( $UTF16_text, $MSB_first );
  886. // Change: changed to use smart_htmlspecialchars, so that characters which were already escaped would remain intact, as of revision 1.10
  887. // Escape any special HTML characters present
  888. $UTF16_text = smart_htmlspecialchars( $UTF16_text );
  889. // Convert the UTF-16 string to an array of unicode character numbers
  890. $unicode_array = UTF16_to_unicode_array( $UTF16_text, $MSB_first );
  891. // Create a string to receive the escaped HTML
  892. $htmloutput = "";
  893. // Cycle through the unicode character numbers
  894. foreach( $unicode_array as $unichar )
  895. {
  896. // Check if the character needs to be escaped
  897. if ( ( $unichar >= 0x00 ) && ( $unichar <= 0x7F ) )
  898. {
  899. // Character is less than 0x7F - add it to the html as is
  900. $htmloutput .= chr( $unichar );
  901. }
  902. else
  903. {
  904. // Character is greater than 0x7F - escape it and add it to the html
  905. $htmloutput .= "&#x" . dechex($unichar) . ";";
  906. }
  907. }
  908. // Return the resulting escaped HTML
  909. return $htmloutput;
  910. }
  911. /******************************************************************************
  912. * End of Function: HTML_UTF16_Escape
  913. ******************************************************************************/
  914. /******************************************************************************
  915. *
  916. * Function: HTML_UTF16_UnEscape
  917. *
  918. * Description: Converts HTML which contains escaped decimal or hex characters
  919. * into UTF-16 text
  920. *
  921. * Parameters: HTML_text - a string containing the HTML text to be converted
  922. * MSB_first - True will cause processing as Big Endian UTF-16 (Motorola, MSB first)
  923. * False will cause processing as Little Endian UTF-16 (Intel, LSB first)
  924. *
  925. * Returns: utfoutput - a string containing the UTF-16 equivalent
  926. *
  927. ******************************************************************************/
  928. function HTML_UTF16_UnEscape( $HTML_text, $MSB_first )
  929. {
  930. $utf8_text = HTML_UTF8_UnEscape( $HTML_text );
  931. return unicode_array_to_UTF16( UTF8_to_unicode_array( $utf8_text ), $MSB_first );
  932. }
  933. /******************************************************************************
  934. * End of Function: HTML_UTF16_UnEscape
  935. ******************************************************************************/
  936. /******************************************************************************
  937. *
  938. * Function: smart_HTML_Entities
  939. *
  940. * Description: Performs the same function as HTML_Entities, but leaves entities
  941. * that are already escaped intact.
  942. *
  943. * Parameters: HTML_text - a string containing the HTML text to be escaped
  944. *
  945. * Returns: HTML_text_out - a string containing the escaped HTML text
  946. *
  947. ******************************************************************************/
  948. function smart_HTML_Entities( $HTML_text )
  949. {
  950. // Get a table containing the HTML entities translations
  951. $translation_table = get_html_translation_table( HTML_ENTITIES );
  952. // Change the ampersand to translate to itself, to avoid getting &amp;
  953. $translation_table[ chr(38) ] = '&';
  954. // Perform replacements
  955. // Regular expression says: find an ampersand, check the text after it,
  956. // if the text after it is not one of the following, then replace the ampersand
  957. // with &amp;
  958. // a) any combination of up to 4 letters (upper or lower case) with at least 2 or 3 non whitespace characters, then a semicolon
  959. // b) a hash symbol, then between 2 and 7 digits
  960. // c) a hash symbol, an 'x' character, then between 2 and 7 digits
  961. // d) a hash symbol, an 'X' character, then between 2 and 7 digits
  962. return preg_replace( "/&(?![A-Za-z]{0,4}\w{2,3};|#[0-9]{2,7}|#x[0-9]{2,7}|#X[0-9]{2,7};)/","&amp;" , strtr( $HTML_text, $translation_table ) );
  963. }
  964. /******************************************************************************
  965. * End of Function: smart_HTML_Entities
  966. ******************************************************************************/
  967. /******************************************************************************
  968. *
  969. * Function: smart_htmlspecialchars
  970. *
  971. * Description: Performs the same function as htmlspecialchars, but leaves characters
  972. * that are already escaped intact.
  973. *
  974. * Parameters: HTML_text - a string containing the HTML text to be escaped
  975. *
  976. * Returns: HTML_text_out - a string containing the escaped HTML text
  977. *
  978. ******************************************************************************/
  979. function smart_htmlspecialchars( $HTML_text )
  980. {
  981. // Get a table containing the HTML special characters translations
  982. $translation_table=get_html_translation_table (HTML_SPECIALCHARS);
  983. // Change the ampersand to translate to itself, to avoid getting &amp;
  984. $translation_table[ chr(38) ] = '&';

Large files files are truncated, but you can click here to view the full file