PageRenderTime 41ms CodeModel.GetById 16ms RepoModel.GetById 0ms app.codeStats 0ms

/application/third_party/ar-php/Arabic/Query.php

https://bitbucket.org/sammousa/valuematchbv-ls2
PHP | 589 lines | 254 code | 46 blank | 289 comment | 46 complexity | 31e1daffb64417712111c3866b4db643 MD5 | raw file
Possible License(s): GPL-2.0, LGPL-2.1, BSD-3-Clause, GPL-3.0, LGPL-3.0
  1. <?php
  2. /**
  3. * ----------------------------------------------------------------------
  4. *
  5. * Copyright (c) 2006-2012 Khaled Al-Sham'aa.
  6. *
  7. * http://www.ar-php.org
  8. *
  9. * PHP Version 5
  10. *
  11. * ----------------------------------------------------------------------
  12. *
  13. * LICENSE
  14. *
  15. * This program is open source product; you can redistribute it and/or
  16. * modify it under the terms of the GNU Lesser General Public License (LGPL)
  17. * as published by the Free Software Foundation; either version 3
  18. * of the License, or (at your option) any later version.
  19. *
  20. * This program is distributed in the hope that it will be useful,
  21. * but WITHOUT ANY WARRANTY; without even the implied warranty of
  22. * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
  23. * GNU Lesser General Public License for more details.
  24. *
  25. * You should have received a copy of the GNU Lesser General Public License
  26. * along with this program. If not, see <http://www.gnu.org/licenses/lgpl.txt>.
  27. *
  28. * ----------------------------------------------------------------------
  29. *
  30. * Class Name: Arabic Queary Class
  31. *
  32. * Filename: Query.php
  33. *
  34. * Original Author(s): Khaled Al-Sham'aa <khaled@ar-php.org>
  35. *
  36. * Purpose: Build WHERE condition for SQL statement using MySQL REGEXP and
  37. * Arabic lexical rules
  38. *
  39. * ----------------------------------------------------------------------
  40. *
  41. * Arabic Queary Class
  42. *
  43. * PHP class build WHERE condition for SQL statement using MySQL REGEXP and
  44. * Arabic lexical rules.
  45. *
  46. * With the exception of the Qur'an and pedagogical texts, Arabic is generally
  47. * written without vowels or other graphic symbols that indicate how a word is
  48. * pronounced. The reader is expected to fill these in from context. Some of the
  49. * graphic symbols include sukuun, which is placed over a consonant to indicate that
  50. * it is not followed by a vowel; shadda, written over a consonant to indicate it is
  51. * doubled; and hamza, the sign of the glottal stop, which can be written above or
  52. * below (alif) at the beginning of a word, or on (alif), (waaw), (yaa'),
  53. * or by itself on the line elsewhere. Also, common spelling differences regularly
  54. * appear, including the use of (haa') for (taa' marbuuta) and (alif maqsuura)
  55. * for (yaa'). These features of written Arabic, which are also seen in Hebrew as
  56. * well as other languages written with Arabic script (such as Farsi, Pashto, and
  57. * Urdu), make analyzing and searching texts quite challenging. In addition, Arabic
  58. * morphology and grammar are quite rich and present some unique issues for
  59. * information retrieval applications.
  60. *
  61. * There are essentially three ways to search an Arabic text with Arabic queries:
  62. * literal, stem-based or root-based.
  63. *
  64. * A literal search, the simplest search and retrieval method, matches documents
  65. * based on the search terms exactly as the user entered them. The advantage of this
  66. * technique is that the documents returned will without a doubt contain the exact
  67. * term for which the user is looking. But this advantage is also the biggest
  68. * disadvantage: many, if not most, of the documents containing the terms in
  69. * different forms will be missed. Given the many ambiguities of written Arabic, the
  70. * success rate of this method is quite low. For example, if the user searches
  71. * for (kitaab, book), he or she will not find documents that only
  72. * contain (`al-kitaabu, the book).
  73. *
  74. * Stem-based searching, a more complicated method, requires some normalization of
  75. * the original texts and the queries. This is done by removing the vowel signs,
  76. * unifying the hamza forms and removing or standardizing the other signs.
  77. * Additionally, grammatical affixes and other constructions which attach directly
  78. * to words, such as conjunctions, prepositions, and the definite article, should be
  79. * identified and removed. Finally, regular and irregular plural forms need to be
  80. * identified and reduced to their singular forms. Performing this type of stemming
  81. * leads to more successful searches, but can be problematic due to over-generation
  82. * or incorrect generation of stems.
  83. *
  84. * A third method for searching Arabic texts is to index and search for the root
  85. * forms of each word. Since most verbs and nouns in Arabic are derived from
  86. * triliteral (or, rarely, quadriliteral) roots, identifying the underlying root of
  87. * each word theoretically retrieves most of the documents containing a given search
  88. * term regardless of form. However, there are some significant challenges with this
  89. * approach. Determining the root for a given word is extremely difficult, since it
  90. * requires a detailed morphological, syntactic and semantic analysis of the text to
  91. * fully disambiguate the root forms. The issue is complicated further by the fact
  92. * that not all words are derived from roots. For example, loan words (words
  93. * borrowed from another language) are not based on root forms, although there are
  94. * even exceptions to this rule. For example, some loans that have a structure
  95. * similar to triliteral roots, such as the English word film, are handled
  96. * grammatically as if they were root-based, adding to the complexity of this type
  97. * of search. Finally, the root can serve as the foundation for a wide variety of
  98. * words with related meanings. The root (k-t-b) is used for many words related
  99. * to writing, including (kataba, to write), (kitaab, book), (maktab,
  100. * office), and (kaatib, author). But the same root is also used for regiment/
  101. * battalion, (katiiba). As a result, searching based on root forms results in
  102. * very high recall, but precision is usually quite low.
  103. *
  104. * While search and retrieval of Arabic text will never be an easy task, relying on
  105. * linguistic analysis tools and methods can help make the process more successful.
  106. * Ultimately, the search method you choose should depend on how critical it is to
  107. * retrieve every conceivable instance of a word or phrase and the resources you
  108. * have to process search returns in order to determine their true relevance.
  109. *
  110. * Source: Volume 13 Issue 7 of MultiLingual Computing &
  111. * Technology published by MultiLingual Computing, Inc., 319 North First Ave.,
  112. * Sandpoint, Idaho, USA, 208-263-8178, Fax: 208-263-6310.
  113. *
  114. * Example:
  115. * <code>
  116. * include('./I18N/Arabic.php');
  117. * $obj = new I18N_Arabic('Query');
  118. *
  119. * $dbuser = 'root';
  120. * $dbpwd = '';
  121. * $dbname = 'test';
  122. *
  123. * try {
  124. * $dbh = new PDO('mysql:host=localhost;dbname='.$dbname, $dbuser, $dbpwd);
  125. *
  126. * // Set the error reporting attribute
  127. * $dbh->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
  128. *
  129. * $dbh->exec("SET NAMES 'utf8'");
  130. *
  131. * if ($_GET['keyword'] != '') {
  132. * $keyword = @$_GET['keyword'];
  133. * $keyword = str_replace('\"', '"', $keyword);
  134. *
  135. * $obj->setStrFields('headline');
  136. * $obj->setMode($_GET['mode']);
  137. *
  138. * $strCondition = $Arabic->getWhereCondition($keyword);
  139. * } else {
  140. * $strCondition = '1';
  141. * }
  142. *
  143. * $StrSQL = "SELECT `headline` FROM `aljazeera` WHERE $strCondition";
  144. *
  145. * $i = 0;
  146. * foreach ($dbh->query($StrSQL) as $row) {
  147. * $headline = $row['headline'];
  148. * $i++;
  149. * if ($i % 2 == 0) {
  150. * $bg = "#f0f0f0";
  151. * } else {
  152. * $bg = "#ffffff";
  153. * }
  154. * echo "<tr bgcolor=\"$bg\"><td>$headline</td></tr>";
  155. * }
  156. *
  157. * // Close the databse connection
  158. * $dbh = null;
  159. *
  160. * } catch (PDOException $e) {
  161. * echo $e->getMessage();
  162. * }
  163. * </code>
  164. *
  165. * @category I18N
  166. * @package I18N_Arabic
  167. * @author Khaled Al-Sham'aa <khaled@ar-php.org>
  168. * @copyright 2006-2012 Khaled Al-Sham'aa
  169. *
  170. * @license LGPL <http://www.gnu.org/licenses/lgpl.txt>
  171. * @link http://www.ar-php.org
  172. */
  173. // New in PHP V5.3: Namespaces
  174. // namespace I18N\Arabic;
  175. //
  176. // $obj = new I18N\Arabic\Query();
  177. //
  178. // use I18N\Arabic;
  179. // $obj = new Arabic\Query();
  180. //
  181. // use I18N\Arabic\Query as Query;
  182. // $obj = new Query();
  183. /**
  184. * This PHP class build WHERE condition for SQL statement using MySQL REGEXP and
  185. * Arabic lexical rules
  186. *
  187. * @category I18N
  188. * @package I18N_Arabic
  189. * @author Khaled Al-Sham'aa <khaled@ar-php.org>
  190. * @copyright 2006-2012 Khaled Al-Sham'aa
  191. *
  192. * @license LGPL <http://www.gnu.org/licenses/lgpl.txt>
  193. * @link http://www.ar-php.org
  194. */
  195. class I18N_Arabic_Query
  196. {
  197. private $_fields = array();
  198. private $_lexPatterns = array();
  199. private $_lexReplacements = array();
  200. /**
  201. * Loads initialize values
  202. */
  203. public function __construct()
  204. {
  205. $xml = simplexml_load_file(dirname(__FILE__).'/data/ArQuery.xml');
  206. foreach ($xml->xpath("//preg_replace[@function='__construct']/pair")
  207. as $pair) {
  208. array_push($this->_lexPatterns, (string)$pair->search);
  209. array_push($this->_lexReplacements, (string)$pair->replace);
  210. }
  211. }
  212. /**
  213. * Setting value for $_fields array
  214. *
  215. * @param array $arrConfig Name of the fields that SQL statement will search
  216. * them (in array format where items are those
  217. * fields names)
  218. *
  219. * @return object $this to build a fluent interface
  220. * @author Khaled Al-Sham'aa <khaled@ar-php.org>
  221. */
  222. public function setArrFields($arrConfig)
  223. {
  224. if (is_array($arrConfig)) {
  225. // Get _fields array
  226. $this->_fields = $arrConfig;
  227. }
  228. return $this;
  229. }
  230. /**
  231. * Setting value for $_fields array
  232. *
  233. * @param string $strConfig Name of the fields that SQL statement will search
  234. * them (in string format using comma as delimated)
  235. *
  236. * @return object $this to build a fluent interface
  237. * @author Khaled Al-Sham'aa <khaled@ar-php.org>
  238. */
  239. public function setStrFields($strConfig)
  240. {
  241. if (is_string($strConfig)) {
  242. // Get _fields array
  243. $this->_fields = explode(',', $strConfig);
  244. }
  245. return $this;
  246. }
  247. /**
  248. * Setting $mode propority value that refer to search mode
  249. * [0 for OR logic | 1 for AND logic]
  250. *
  251. * @param integer $mode Setting value to be saved in the $mode propority
  252. *
  253. * @return object $this to build a fluent interface
  254. * @author Khaled Al-Sham'aa <khaled@ar-php.org>
  255. */
  256. public function setMode($mode)
  257. {
  258. if (in_array($mode, array('0', '1'))) {
  259. // Set search mode [0 for OR logic | 1 for AND logic]
  260. $this->mode = $mode;
  261. }
  262. return $this;
  263. }
  264. /**
  265. * Getting $mode propority value that refer to search mode
  266. * [0 for OR logic | 1 for AND logic]
  267. *
  268. * @return integer Value of $mode properity
  269. * @author Khaled Al-Sham'aa <khaled@ar-php.org>
  270. */
  271. public function getMode()
  272. {
  273. // Get search mode value [0 for OR logic | 1 for AND logic]
  274. return $this->mode;
  275. }
  276. /**
  277. * Getting values of $_fields Array in array format
  278. *
  279. * @return array Value of $_fields array in Array format
  280. * @author Khaled Al-Sham'aa <khaled@ar-php.org>
  281. */
  282. public function getArrFields()
  283. {
  284. $fields = $this->_fields;
  285. return $fields;
  286. }
  287. /**
  288. * Getting values of $_fields array in String format (comma delimated)
  289. *
  290. * @return string Values of $_fields array in String format (comma delimated)
  291. * @author Khaled Al-Sham'aa <khaled@ar-php.org>
  292. */
  293. public function getStrFields()
  294. {
  295. $fields = implode(',', $this->_fields);
  296. return $fields;
  297. }
  298. /**
  299. * Build WHERE section of the SQL statement using defind lex's rules, search
  300. * mode [AND | OR], and handle also phrases (inclosed by "") using normal
  301. * LIKE condition to match it as it is.
  302. *
  303. * @param string $arg String that user search for in the database table
  304. *
  305. * @return string The WHERE section in SQL statement
  306. * (MySQL database engine format)
  307. * @author Khaled Al-Sham'aa <khaled@ar-php.org>
  308. */
  309. public function getWhereCondition($arg)
  310. {
  311. $sql = '';
  312. $arg = mysql_escape_string($arg);
  313. // Check if there are phrases in $arg should handle as it is
  314. $phrase = explode("\"", $arg);
  315. if (count($phrase) > 2) {
  316. // Re-init $arg variable
  317. // (It will contain the rest of $arg except phrases).
  318. $arg = '';
  319. for ($i = 0; $i < count($phrase); $i++) {
  320. $subPhrase = $phrase[$i];
  321. if ($i % 2 == 0 && $subPhrase != '') {
  322. // Re-build $arg variable after restricting phrases
  323. $arg .= $subPhrase;
  324. } elseif ($i % 2 == 1 && $subPhrase != '') {
  325. // Handle phrases using reqular LIKE matching in MySQL
  326. $this->wordCondition[] = $this->getWordLike($subPhrase);
  327. }
  328. }
  329. }
  330. // Handle normal $arg using lex's and regular expresion
  331. $words = preg_split('/\s+/', trim($arg));
  332. foreach ($words as $word) {
  333. //if (is_numeric($word) || strlen($word) > 2) {
  334. // Take off all the punctuation
  335. //$word = preg_replace("/\p{P}/", '', $word);
  336. $exclude = array('(', ')', '[', ']', '{', '}', ',', ';', ':',
  337. '?', '!', '،', '؛', '؟');
  338. $word = str_replace($exclude, '', $word);
  339. $this->wordCondition[] = $this->getWordRegExp($word);
  340. //}
  341. }
  342. if (!empty($this->wordCondition)) {
  343. if ($this->mode == 0) {
  344. $sql = '(' . implode(') OR (', $this->wordCondition) . ')';
  345. } elseif ($this->mode == 1) {
  346. $sql = '(' . implode(') AND (', $this->wordCondition) . ')';
  347. }
  348. }
  349. return $sql;
  350. }
  351. /**
  352. * Search condition in SQL format for one word in all defind fields using
  353. * REGEXP clause and lex's rules
  354. *
  355. * @param string $arg String (one word) that you want to build a condition for
  356. *
  357. * @return string sub SQL condition (for internal use)
  358. * @author Khaled Al-Sham'aa <khaled@ar-php.org>
  359. */
  360. protected function getWordRegExp($arg)
  361. {
  362. $arg = $this->lex($arg);
  363. //$sql = implode(" REGEXP '$arg' OR ", $this->_fields) . " REGEXP '$arg'";
  364. $sql = ' REPLACE(' .
  365. implode(", 'ـ', '') REGEXP '$arg' OR REPLACE(", $this->_fields) .
  366. ", 'ـ', '') REGEXP '$arg'";
  367. return $sql;
  368. }
  369. /**
  370. * Search condition in SQL format for one word in all defind fields using
  371. * normal LIKE clause
  372. *
  373. * @param string $arg String (one word) that you want to build a condition for
  374. *
  375. * @return string sub SQL condition (for internal use)
  376. * @author Khaled Al-Sham'aa <khaled@ar-php.org>
  377. */
  378. protected function getWordLike($arg)
  379. {
  380. $sql = implode(" LIKE '$arg' OR ", $this->_fields) . " LIKE '$arg'";
  381. return $sql;
  382. }
  383. /**
  384. * Get more relevant order by section related to the user search keywords
  385. *
  386. * @param string $arg String that user search for in the database table
  387. *
  388. * @return string sub SQL ORDER BY section
  389. * @author Saleh AlMatrafe <saleh@saleh.cc>
  390. */
  391. public function getOrderBy($arg)
  392. {
  393. // Check if there are phrases in $arg should handle as it is
  394. $phrase = explode("\"", $arg);
  395. if (count($phrase) > 2) {
  396. // Re-init $arg variable
  397. // (It will contain the rest of $arg except phrases).
  398. $arg = '';
  399. for ($i = 0; $i < count($phrase); $i++) {
  400. if ($i % 2 == 0 && $phrase[$i] != '') {
  401. // Re-build $arg variable after restricting phrases
  402. $arg .= $phrase[$i];
  403. } elseif ($i % 2 == 1 && $phrase[$i] != '') {
  404. // Handle phrases using reqular LIKE matching in MySQL
  405. $wordOrder[] = $this->getWordLike($phrase[$i]);
  406. }
  407. }
  408. }
  409. // Handle normal $arg using lex's and regular expresion
  410. $words = explode(' ', $arg);
  411. foreach ($words as $word) {
  412. if ($word != '') {
  413. $wordOrder[] = 'CASE WHEN ' .
  414. $this->getWordRegExp($word) .
  415. ' THEN 1 ELSE 0 END';
  416. }
  417. }
  418. $order = '((' . implode(') + (', $wordOrder) . ')) DESC';
  419. return $order;
  420. }
  421. /**
  422. * This method will implement various regular expressin rules based on
  423. * pre-defined Arabic lexical rules
  424. *
  425. * @param string $arg String of one word user want to search for
  426. *
  427. * @return string Regular Expression format to be used in MySQL query statement
  428. * @author Khaled Al-Sham'aa <khaled@ar-php.org>
  429. */
  430. protected function lex($arg)
  431. {
  432. $arg = preg_replace($this->_lexPatterns, $this->_lexReplacements, $arg);
  433. return $arg;
  434. }
  435. /**
  436. * Get most possible Arabic lexical forms for a given word
  437. *
  438. * @param string $word String that user search for
  439. *
  440. * @return string list of most possible Arabic lexical forms for a given word
  441. * @author Khaled Al-Sham'aa <khaled@ar-php.org>
  442. */
  443. protected function allWordForms($word)
  444. {
  445. $wordForms = array($word);
  446. $postfix1 = array('كم', 'كن', 'نا', 'ها', 'هم', 'هن');
  447. $postfix2 = array('ين', 'ون', 'ان', 'ات', 'وا');
  448. $len = mb_strlen($word);
  449. if (mb_substr($word, 0, 2) == 'ال') {
  450. $word = mb_substr($word, 2);
  451. }
  452. $wordForms[] = $word;
  453. $str1 = mb_substr($word, 0, -1);
  454. $str2 = mb_substr($word, 0, -2);
  455. $str3 = mb_substr($word, 0, -3);
  456. $last1 = mb_substr($word, -1);
  457. $last2 = mb_substr($word, -2);
  458. $last3 = mb_substr($word, -3);
  459. if ($len >= 6 && $last3 == 'تين') {
  460. $wordForms[] = $str3;
  461. $wordForms[] = $str3 . 'ة';
  462. $wordForms[] = $word . 'ة';
  463. }
  464. if ($len >= 6 && ($last3 == 'كما' || $last3 == 'هما')) {
  465. $wordForms[] = $str3;
  466. $wordForms[] = $str3 . 'كما';
  467. $wordForms[] = $str3 . 'هما';
  468. }
  469. if ($len >= 5 && in_array($last2, $postfix2)) {
  470. $wordForms[] = $str2;
  471. $wordForms[] = $str2.'ة';
  472. $wordForms[] = $str2.'تين';
  473. foreach ($postfix2 as $postfix) {
  474. $wordForms[] = $str2 . $postfix;
  475. }
  476. }
  477. if ($len >= 5 && in_array($last2, $postfix1)) {
  478. $wordForms[] = $str2;
  479. $wordForms[] = $str2.'ي';
  480. $wordForms[] = $str2.'ك';
  481. $wordForms[] = $str2.'كما';
  482. $wordForms[] = $str2.'هما';
  483. foreach ($postfix1 as $postfix) {
  484. $wordForms[] = $str2 . $postfix;
  485. }
  486. }
  487. if ($len >= 5 && $last2 == 'ية') {
  488. $wordForms[] = $str1;
  489. $wordForms[] = $str2;
  490. }
  491. if (($len >= 4 && ($last1 == 'ة' || $last1 == 'ه' || $last1 == 'ت')) ||
  492. ($len >= 5 && $last2 == 'ات')) {
  493. $wordForms[] = $str1;
  494. $wordForms[] = $str1 . 'ة';
  495. $wordForms[] = $str1 . 'ه';
  496. $wordForms[] = $str1 . 'ت';
  497. $wordForms[] = $str1 . 'ات';
  498. }
  499. if ($len >= 4 && $last1 == 'ى') {
  500. $wordForms[] = $str1 . 'ا';
  501. }
  502. $trans = array('أ' => 'ا', 'إ' => 'ا', 'آ' => 'ا');
  503. foreach ($wordForms as $word) {
  504. $normWord = strtr($word, $trans);
  505. if ($normWord != $word) {
  506. $wordForms[] = $normWord;
  507. }
  508. }
  509. $wordForms = array_unique($wordForms);
  510. return $wordForms;
  511. }
  512. /**
  513. * Get most possible Arabic lexical forms of user search keywords
  514. *
  515. * @param string $arg String that user search for
  516. *
  517. * @return string list of most possible Arabic lexical forms for given keywords
  518. * @author Khaled Al-Sham'aa <khaled@ar-php.org>
  519. */
  520. public function allForms($arg)
  521. {
  522. $wordForms = array();
  523. $words = explode(' ', $arg);
  524. foreach ($words as $word) {
  525. $wordForms = array_merge($wordForms, $this->allWordForms($word));
  526. }
  527. $str = implode(' ', $wordForms);
  528. return $str;
  529. }
  530. }