PageRenderTime 59ms CodeModel.GetById 28ms RepoModel.GetById 1ms app.codeStats 0ms

/documentation/3. ХЭРЭГЖҮҮЛЭЛТ.ipynb

https://gitlab.com/enhush.toy/ai-sentiment
Jupyter | 549 lines | 549 code | 0 blank | 0 comment | 0 complexity | d89369b8f23dc2b609d8b36bb85848c2 MD5 | raw file
  1. {
  2. "cells": [
  3. {
  4. "cell_type": "markdown",
  5. "metadata": {},
  6. "source": [
  7. "# ХЭРЭГЖҮҮЛЭЛТ\n",
  8. "Кино сайт IMDB-ээс гаргаж авсан 50000 ширхэг хэрэглэгчдийн сэтгэгдэл, үнэлгээнээс бүрдсэн материалыг ашиглан хожим орж ирэх сэтгэгдлүүдийг ямар хандлагатай байгааг ангилах classifier-ыг logistic regression model ашиглан боловсруулсан [5].\n",
  9. "## Ашигласан технологи\n",
  10. " - Python 3\n",
  11. " - NumPy 1.11.0\n",
  12. " - SciPy 0.17.1\n",
  13. " - matplotlib 1.5.1\n",
  14. " - scikit-learn 0.17.1\n",
  15. " - nltk 3.2.1\n",
  16. " - Flask 0.10.1\n",
  17. "\n",
  18. "## Ажлын дараалал\n",
  19. " 1. Өгөгдөл унших\n",
  20. " 2. Текстийг цэвэрлэх\n",
  21. " 3. Logistic Regression ашиглан classifier бэлдэх\n",
  22. " 4. Бэлэн болсон classifier-ыг файлд хадгалах\n",
  23. " 5. Турших\n",
  24. " 6. Вэб аппликэйшн"
  25. ]
  26. },
  27. {
  28. "cell_type": "markdown",
  29. "metadata": {},
  30. "source": [
  31. "### 1. Өгөгдөл унших\n",
  32. "Нийт 85mb өгөгдлийг http://ai.stanford.edu/~amaas/data/sentiment/ сайтаас татаж model бэлдэхэд тохирох байдлаар өөрчлөв."
  33. ]
  34. },
  35. {
  36. "cell_type": "code",
  37. "execution_count": 1,
  38. "metadata": {
  39. "collapsed": false
  40. },
  41. "outputs": [
  42. {
  43. "name": "stderr",
  44. "output_type": "stream",
  45. "text": [
  46. "0% 100%\n",
  47. "[##############################] | ETA: 00:00:00\n",
  48. "Total time elapsed: 00:06:35\n"
  49. ]
  50. }
  51. ],
  52. "source": [
  53. "import pyprind\n",
  54. "import pandas\n",
  55. "import os\n",
  56. "\n",
  57. "basepath = '../aclImdb'\n",
  58. "\n",
  59. "labels = {'pos': 1, 'neg': 0}\n",
  60. "progressBar = pyprind.ProgBar(50000)\n",
  61. "dataFrame = pandas.DataFrame()\n",
  62. "for subpath1 in ('test', 'train'):\n",
  63. " for subpath2 in ('pos', 'neg'):\n",
  64. " path = os.path.join(basepath, subpath1, subpath2)\n",
  65. " for file in os.listdir(path):\n",
  66. " with open(os.path.join(path, file), 'r', encoding='utf-8') as inputFile:\n",
  67. " txt = inputFile.read()\n",
  68. " dataFrame = dataFrame.append([[txt, labels[subpath2]]], ignore_index=True)\n",
  69. " progressBar.update()\n",
  70. "dataFrame.columns = ['review', 'sentiment']"
  71. ]
  72. },
  73. {
  74. "cell_type": "markdown",
  75. "metadata": {},
  76. "source": [
  77. "Уншиж авсан өгөгдлөө санамсаргүй байдлаар холив."
  78. ]
  79. },
  80. {
  81. "cell_type": "code",
  82. "execution_count": 2,
  83. "metadata": {
  84. "collapsed": true
  85. },
  86. "outputs": [],
  87. "source": [
  88. "import numpy\n",
  89. "\n",
  90. "numpy.random.seed(0)\n",
  91. "dataFrame = dataFrame.reindex(numpy.random.permutation(dataFrame.index))"
  92. ]
  93. },
  94. {
  95. "cell_type": "markdown",
  96. "metadata": {},
  97. "source": [
  98. "Уншиж өөрчилсөн өгөгдлөө CSV болгон хадгалав."
  99. ]
  100. },
  101. {
  102. "cell_type": "code",
  103. "execution_count": 7,
  104. "metadata": {
  105. "collapsed": true
  106. },
  107. "outputs": [],
  108. "source": [
  109. "dataFrame.to_csv('../movie_data.csv', index=False, encoding='utf-8')"
  110. ]
  111. },
  112. {
  113. "cell_type": "markdown",
  114. "metadata": {},
  115. "source": [
  116. "Бэлэн болсон өгөгдлийн хэсгээс хэвлэж үзэв."
  117. ]
  118. },
  119. {
  120. "cell_type": "code",
  121. "execution_count": 4,
  122. "metadata": {
  123. "collapsed": false,
  124. "scrolled": false
  125. },
  126. "outputs": [
  127. {
  128. "data": {
  129. "text/html": [
  130. "<div>\n",
  131. "<table border=\"1\" class=\"dataframe\">\n",
  132. " <thead>\n",
  133. " <tr style=\"text-align: right;\">\n",
  134. " <th></th>\n",
  135. " <th>review</th>\n",
  136. " <th>sentiment</th>\n",
  137. " </tr>\n",
  138. " </thead>\n",
  139. " <tbody>\n",
  140. " <tr>\n",
  141. " <th>0</th>\n",
  142. " <td>In 1974, the teenager Martha Moxley (Maggie Gr...</td>\n",
  143. " <td>1</td>\n",
  144. " </tr>\n",
  145. " </tbody>\n",
  146. "</table>\n",
  147. "</div>"
  148. ],
  149. "text/plain": [
  150. " review sentiment\n",
  151. "0 In 1974, the teenager Martha Moxley (Maggie Gr... 1"
  152. ]
  153. },
  154. "execution_count": 4,
  155. "metadata": {},
  156. "output_type": "execute_result"
  157. }
  158. ],
  159. "source": [
  160. "import pandas\n",
  161. "dataFrame = pandas.read_csv('../movie_data.csv', encoding='utf-8')\n",
  162. "dataFrame.head(1)"
  163. ]
  164. },
  165. {
  166. "cell_type": "markdown",
  167. "metadata": {},
  168. "source": [
  169. "### 2. Текстийг цэвэрлэх\n",
  170. "Хадгалж авсан movie_data.csv файлаас html tag-ууд, ангилалтанд ач холбогдолгүй үгс зэргийг ялгаж хасав."
  171. ]
  172. },
  173. {
  174. "cell_type": "markdown",
  175. "metadata": {},
  176. "source": [
  177. "Ач холбогдолгүй үгсийн бэлэн сан татаж ашиглав."
  178. ]
  179. },
  180. {
  181. "cell_type": "code",
  182. "execution_count": 3,
  183. "metadata": {
  184. "collapsed": false
  185. },
  186. "outputs": [
  187. {
  188. "name": "stdout",
  189. "output_type": "stream",
  190. "text": [
  191. "[nltk_data] Downloading package stopwords to\n",
  192. "[nltk_data] C:\\Users\\User\\AppData\\Roaming\\nltk_data...\n",
  193. "[nltk_data] Unzipping corpora\\stopwords.zip.\n"
  194. ]
  195. },
  196. {
  197. "data": {
  198. "text/plain": [
  199. "True"
  200. ]
  201. },
  202. "execution_count": 3,
  203. "metadata": {},
  204. "output_type": "execute_result"
  205. }
  206. ],
  207. "source": [
  208. "import nltk\n",
  209. "nltk.download('stopwords')"
  210. ]
  211. },
  212. {
  213. "cell_type": "markdown",
  214. "metadata": {},
  215. "source": [
  216. "Текстээс хэрэггүй текстийг хасах функц"
  217. ]
  218. },
  219. {
  220. "cell_type": "code",
  221. "execution_count": 16,
  222. "metadata": {
  223. "collapsed": true
  224. },
  225. "outputs": [],
  226. "source": [
  227. "import numpy\n",
  228. "import re\n",
  229. "from nltk.corpus import stopwords\n",
  230. "\n",
  231. "def tokenizer(text):\n",
  232. " text = re.sub('<[^>]*>', '', text)\n",
  233. " emoticons = re.findall('(?::|;|=)(?:-)?(?:\\)|\\(|D|P)', text.lower())\n",
  234. " text = re.sub('[\\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')\n",
  235. " tokenized = [w for w in text.split() if w not in stopwords.words('english')]\n",
  236. " return tokenized"
  237. ]
  238. },
  239. {
  240. "cell_type": "markdown",
  241. "metadata": {},
  242. "source": [
  243. "movie_data.csv файлыг унших фунц"
  244. ]
  245. },
  246. {
  247. "cell_type": "code",
  248. "execution_count": 17,
  249. "metadata": {
  250. "collapsed": true
  251. },
  252. "outputs": [],
  253. "source": [
  254. "def stream_docs(path):\n",
  255. " with open(path, 'r', encoding='utf-8') as csv:\n",
  256. " # header-ийг алгасах\n",
  257. " next(csv)\n",
  258. " for line in csv:\n",
  259. " text, label = line[:-3], int(line[-2])\n",
  260. " yield text, label\n",
  261. "\n",
  262. "def get_minibatch(doc_stream, size):\n",
  263. " docs, y = [], []\n",
  264. " try:\n",
  265. " for _ in range(size):\n",
  266. " text, label = next(doc_stream)\n",
  267. " docs.append(text)\n",
  268. " y.append(label)\n",
  269. " except StopIteration:\n",
  270. " return None, None\n",
  271. " return docs, y"
  272. ]
  273. },
  274. {
  275. "cell_type": "markdown",
  276. "metadata": {},
  277. "source": [
  278. "### 3. Logistic Regression ашиглан classifier бэлдэх\n",
  279. "Дээрх preprocessing функцуудээ ашиглан feature vector-оо ялгаж, scikit-learn сангийн SGDClassifier-аар logistic regression model-оо байгуулав."
  280. ]
  281. },
  282. {
  283. "cell_type": "code",
  284. "execution_count": 18,
  285. "metadata": {
  286. "collapsed": true
  287. },
  288. "outputs": [],
  289. "source": [
  290. "from sklearn.feature_extraction.text import HashingVectorizer\n",
  291. "from sklearn.linear_model import SGDClassifier\n",
  292. "\n",
  293. "vect = HashingVectorizer(decode_error='ignore', \n",
  294. " n_features=2**21,\n",
  295. " preprocessor=None, \n",
  296. " tokenizer=tokenizer)\n",
  297. "\n",
  298. "clf = SGDClassifier(loss='log', random_state=1, n_iter=1)\n",
  299. "doc_stream = stream_docs(path='../movie_data.csv')"
  300. ]
  301. },
  302. {
  303. "cell_type": "markdown",
  304. "metadata": {},
  305. "source": [
  306. "Эхний 45000-аар train хийв."
  307. ]
  308. },
  309. {
  310. "cell_type": "code",
  311. "execution_count": 19,
  312. "metadata": {
  313. "collapsed": false
  314. },
  315. "outputs": [
  316. {
  317. "name": "stderr",
  318. "output_type": "stream",
  319. "text": [
  320. "0% 100%\n",
  321. "[##############################] | ETA: 00:00:00\n",
  322. "Total time elapsed: 00:44:16\n"
  323. ]
  324. }
  325. ],
  326. "source": [
  327. "import pyprind\n",
  328. "progressbar = pyprind.ProgBar(45)\n",
  329. "\n",
  330. "classes = numpy.array([0, 1])\n",
  331. "for _ in range(45):\n",
  332. " X_train, y_train = get_minibatch(doc_stream, size=1000)\n",
  333. " if not X_train:\n",
  334. " break\n",
  335. " X_train = vect.transform(X_train)\n",
  336. " clf.partial_fit(X_train, y_train, classes=classes)\n",
  337. " progressbar.update()"
  338. ]
  339. },
  340. {
  341. "cell_type": "markdown",
  342. "metadata": {},
  343. "source": [
  344. "Сүүлийн 5000-аар test хийв."
  345. ]
  346. },
  347. {
  348. "cell_type": "code",
  349. "execution_count": 20,
  350. "metadata": {
  351. "collapsed": false
  352. },
  353. "outputs": [
  354. {
  355. "name": "stdout",
  356. "output_type": "stream",
  357. "text": [
  358. "Нарийвчлал: 0.867\n"
  359. ]
  360. }
  361. ],
  362. "source": [
  363. "X_test, y_test = get_minibatch(doc_stream, size=5000)\n",
  364. "X_test = vect.transform(X_test)\n",
  365. "print('Нарийвчлал: %.3f' % clf.score(X_test, y_test))"
  366. ]
  367. },
  368. {
  369. "cell_type": "markdown",
  370. "metadata": {},
  371. "source": [
  372. "### 4. Бэлэн болсон classifier-ыг файлд хадгалах\n",
  373. "Програмыг нээх болгонд train хийх нь хугацаа алдсан ашиггүй үйлдэл учраас нэгэнт бэлэн болсон model-оо python pickle файл болгон хадгалаад дахин ашиглав."
  374. ]
  375. },
  376. {
  377. "cell_type": "code",
  378. "execution_count": 22,
  379. "metadata": {
  380. "collapsed": false
  381. },
  382. "outputs": [],
  383. "source": [
  384. "import pickle\n",
  385. "import os\n",
  386. "\n",
  387. "dest = os.path.join('../application', 'pkl_objects')\n",
  388. "if not os.path.exists(dest):\n",
  389. " os.makedirs(dest)\n",
  390. "\n",
  391. "pickle.dump(stopwords.words('english'), open(os.path.join(dest, 'stopwords.pkl'), 'wb'), protocol=4) \n",
  392. "pickle.dump(clf, open(os.path.join(dest, 'classifier.pkl'), 'wb'), protocol=4)"
  393. ]
  394. },
  395. {
  396. "cell_type": "markdown",
  397. "metadata": {},
  398. "source": [
  399. "Мөн текстэн өгөгдлийг feature vector руу хөрвүүлэх функцийг тусдаа файлд хадгалав."
  400. ]
  401. },
  402. {
  403. "cell_type": "code",
  404. "execution_count": 23,
  405. "metadata": {
  406. "collapsed": false
  407. },
  408. "outputs": [
  409. {
  410. "name": "stdout",
  411. "output_type": "stream",
  412. "text": [
  413. "Writing ../application/vectorizer.py\n"
  414. ]
  415. }
  416. ],
  417. "source": [
  418. "%%writefile ../application/vectorizer.py\n",
  419. "from sklearn.feature_extraction.text import HashingVectorizer\n",
  420. "import re\n",
  421. "import os\n",
  422. "import pickle\n",
  423. "\n",
  424. "cur_dir = os.path.dirname(__file__)\n",
  425. "stop = pickle.load(open(\n",
  426. " os.path.join('../application', \n",
  427. " 'pkl_objects', \n",
  428. " 'stopwords.pkl'), 'rb'))\n",
  429. "\n",
  430. "def tokenizer(text):\n",
  431. " text = re.sub('<[^>]*>', '', text)\n",
  432. " emoticons = re.findall('(?::|;|=)(?:-)?(?:\\)|\\(|D|P)',\n",
  433. " text.lower())\n",
  434. " text = re.sub('[\\W]+', ' ', text.lower()) \\\n",
  435. " + ' '.join(emoticons).replace('-', '')\n",
  436. " tokenized = [w for w in text.split() if w not in stop]\n",
  437. " return tokenized\n",
  438. "\n",
  439. "vect = HashingVectorizer(decode_error='ignore',\n",
  440. " n_features=2**21,\n",
  441. " preprocessor=None,\n",
  442. " tokenizer=tokenizer)"
  443. ]
  444. },
  445. {
  446. "cell_type": "markdown",
  447. "metadata": {},
  448. "source": [
  449. "### 5. Турших"
  450. ]
  451. },
  452. {
  453. "cell_type": "code",
  454. "execution_count": 25,
  455. "metadata": {
  456. "collapsed": false
  457. },
  458. "outputs": [],
  459. "source": [
  460. "import os\n",
  461. "os.chdir('../application')"
  462. ]
  463. },
  464. {
  465. "cell_type": "code",
  466. "execution_count": 26,
  467. "metadata": {
  468. "collapsed": true
  469. },
  470. "outputs": [],
  471. "source": [
  472. "import pickle\n",
  473. "import re\n",
  474. "import os\n",
  475. "from vectorizer import vect\n",
  476. "\n",
  477. "clf = pickle.load(open(os.path.join('pkl_objects', 'classifier.pkl'), 'rb'))"
  478. ]
  479. },
  480. {
  481. "cell_type": "markdown",
  482. "metadata": {},
  483. "source": [
  484. "Жишээ нь 'I hate it' гэж оруулахад classifier маань 97%-ийн магадлалтай 'дургүй/сөрөг' гэж ялгаж байна."
  485. ]
  486. },
  487. {
  488. "cell_type": "code",
  489. "execution_count": 28,
  490. "metadata": {
  491. "collapsed": false
  492. },
  493. "outputs": [
  494. {
  495. "name": "stdout",
  496. "output_type": "stream",
  497. "text": [
  498. "Ангилал: Дургүй\n",
  499. "Магадлал: 97.05%\n"
  500. ]
  501. }
  502. ],
  503. "source": [
  504. "import numpy as np\n",
  505. "label = {0:'Дургүй', 1:'Дуртай'}\n",
  506. "\n",
  507. "example = ['I hate this movie. Very bad']\n",
  508. "X = vect.transform(example)\n",
  509. "print('Ангилал: %s\\nМагадлал: %.2f%%' %\\\n",
  510. " (label[clf.predict(X)[0]], clf.predict_proba(X).max()*100))"
  511. ]
  512. },
  513. {
  514. "cell_type": "markdown",
  515. "metadata": {},
  516. "source": [
  517. "### 6. Вэб аппликэйшн"
  518. ]
  519. },
  520. {
  521. "cell_type": "markdown",
  522. "metadata": {},
  523. "source": [
  524. "Бэлэн болсон classifier-аа ашиглан хэрэглэгчийн сэтгэгдлийг авч автоматаар тухайн хэрэглэгчийн кинонд өгсөн хандлагыг /эерэг-сөрөг эсвэл дуртай-дургүй/ ялгах, мөн зөв ангилж байгаа эсэхээ дурын хэрэглэгчээр хянуулж /feedback/ тогтмол өөртөө training хийх боломжоор хангасан жижиг вэб аппликэйшн хөгжүүлэв."
  525. ]
  526. }
  527. ],
  528. "metadata": {
  529. "kernelspec": {
  530. "display_name": "Python 3",
  531. "language": "python",
  532. "name": "python3"
  533. },
  534. "language_info": {
  535. "codemirror_mode": {
  536. "name": "ipython",
  537. "version": 3
  538. },
  539. "file_extension": ".py",
  540. "mimetype": "text/x-python",
  541. "name": "python",
  542. "nbconvert_exporter": "python",
  543. "pygments_lexer": "ipython3",
  544. "version": "3.5.2"
  545. }
  546. },
  547. "nbformat": 4,
  548. "nbformat_minor": 0
  549. }