PageRenderTime 61ms CodeModel.GetById 26ms RepoModel.GetById 1ms app.codeStats 0ms

/CSC586C/Term2/paper.tex

https://github.com/jordanell/School
LaTeX | 244 lines | 182 code | 50 blank | 12 comment | 0 complexity | 4ced5d76f5854c211902ec0863d8490f MD5 | raw file
  1. % This will be the main document for the Technical Networks paper to
  2. % be written by the Eggnet team of Jordan Ell, Triet Huynh and Braden
  3. % Simpson in association with Adrian Schroeter and Daniela Damian.
  4. \documentclass[conference]{IEEEtran}
  5. % Use of outside images
  6. \usepackage{graphicx}
  7. % Use text inside euqations
  8. \usepackage{amsmath}
  9. \usepackage{float}
  10. \floatstyle{plaintop}
  11. \restylefloat{table}
  12. % Correct bad hyphenation here
  13. \hyphenation{op-tical net-works semi-conduc-tor}
  14. % Begin the paper here
  15. \begin{document}
  16. % Paper title
  17. % Can use linebreaks \\ within to get better formatting as desired
  18. \title{}
  19. % Authors names
  20. \author{\IEEEauthorblockN{Jordan Ell}
  21. \IEEEauthorblockA{University of Victoria,
  22. Victoria, British Columbia, Canada \\ jell@uvic.ca}
  23. }
  24. % Make the title area
  25. \maketitle
  26. \begin{abstract}
  27. Internet forums are a great application for talking about your favorite piece of software
  28. or even a video game. However, forums themselves do not provide the ability to perform any sort
  29. of deep analytical queries on their information. The best that we currently have as an industry
  30. are community managers who monitor online forums and report back to developers for improvement
  31. suggestions. This manual process is why I have created the website known as ``Forum-Miner'' which
  32. is a forum analytical tool for the Blizzard game forums to determine what players are talking about
  33. and how those conversations can be used to improve the game. Through the use of web crawling,
  34. Python scripts, Ruby on Rails, the traditional web stack (HTML5, JavaScript, CSS), and PostgreSQL
  35. databases, I have created an easy to use website for deep forum analytics which provides visualization
  36. and aggregation of player thoughts.
  37. \end{abstract}
  38. \section{Introduction}
  39. Internet forums are a great means of communication from end users of a system to the system's
  40. developers. These forums allow all users to discuss what aspects of a software system they like,
  41. what parts they are having issues with, and even what features they would like to see moving into
  42. the future. Internet forums often follow a generic template which involves topics, threads, and
  43. comments. A software system may have unique topics such as bug reports and feature requests. Inside
  44. these topics is where threads are found. Threads are created by an individual wishing to express
  45. some idea. This thread is usually accompanied with a title and some initial body text. Once a thread
  46. is created, other users can post comments inside the thread as per the thread's topic and direction.
  47. The issue with online forums is that their size is quite daunting. If for example, we look at the online
  48. forums for the video game company Blizzard Entertainment, we see that for each of the 5 games Blizzard produces,
  49. thousands and millions of threads have been created and discussed. In order to analyze these threads as
  50. per developer and business needs, we so far only have human interaction. If we look at the Blizzard online
  51. job postings, we can see that they hire ``Community Managers'' which are in charge of sorting through
  52. the thousands of threads in order to see what players are actually talking about online. This is a terribly
  53. inefficient system.
  54. The goal of this paper is to show a way in which we can monitor online forums for Blizzard's video games
  55. in an automatic fashion and provide developers with the information that they need. To achieve this goal,
  56. I created a website called Forum-Miner (FM) which, by using natural language processing techniques,
  57. is able to sort through thousand if not millions of forum threads and identify trends in conversations
  58. as well as provide developers with changes to the game that the end users would like to see.
  59. The rest of this paper is laid out as follows. Section~\ref{sec:meth} will outline the technical details
  60. of how the website was made and how it can be used by end users. Section~\ref{sec:fw} will outline
  61. the future work that is planned for this website and how it will change over time to better support
  62. deeper analysis of the Blizzard forums. Finally, ~\ref{sec:conc} will give a final conclusion of what has
  63. been learned over the course of this project.
  64. \section{Methodology}
  65. \label{sec:meth}
  66. In order to create the website ``Forum-Miner'', several technologies has to be used to create
  67. the unique set of analysis tools which can be seen in the final product. These tools included: Python scripts,
  68. Ruby on Rails web framework, HTML5 technologies, and natural language processing.
  69. \subsection{Collecting Forum Data}
  70. In order to collect forum data from Blizzard, a few steps needed to be taken. First, how to store the forum
  71. data must be considered as it will impact the design decisions of visualization moving into the future. I decided
  72. to store the data in a PostgreSQL database as it allowed me to create the final web application I wanted with
  73. greater ease (through Ruby on Rails) than storing into a NoSQL database or just plain files for Map Reduce jobs.
  74. Second, I had to figure out a way to actually pull the data down from the Blizzard web pages. After doing some
  75. research, I saw that Blizzard had no REST API available for programmers to use on their forums. This being
  76. the case, I resulted to writing a web crawler in Python. I used the Python library called ``BeautifulSoup'' in
  77. order to complete the web crawling. Once data was able to be pulled down, I stored the data in the PostgreSQL
  78. database with a very simple schema of thread and comments. I omitted the topics of the forum because I wanted
  79. my final product to be generalizable to all online forums, not just those with defined topic areas.
  80. The only major hiccup along the way of pulling down this data was the loss of connections from the web crawler
  81. which could occur while it was running. (I was initially concerned about becoming IP banned by Blizzard for
  82. using up too much bandwidth, but that did not end up being a problem.) So in order to mitigate a connection loss,
  83. I throttled the speed in which I would visit web pages in order to slow down the connection. This resulted in
  84. myself having to run the crawler for longer periods of time to ensure accuracy. However, this also means that
  85. I did not collect all the data available on the forums. In fact, I crawled roughly 75,000 pages from a total
  86. of near a million.
  87. \subsection{Forum Analysis}
  88. This section involves many different analysis techniques. Each technique is associated with a picture from
  89. the resulting web interface.
  90. \subsubsection{Activity}
  91. The activity of the online forums is straight forward. I simply took how many comments were created on each
  92. day for the last year and plotted them on a stock ticker graph using ``HighCharts.js''. This can be seen in
  93. Figure~\ref{fig:activity}. The activity can also be plotted by topic. (Topic identification and search will
  94. be shown later.) Once the user enters in a topic he or she would like to learn more about, the same aggregation
  95. happens, only with a filter on that topic. Only comments which are in the topic provided are counted againts
  96. the daily totals on the graph.
  97. \begin{figure}[h]
  98. \centering
  99. \includegraphics[width=0.5\textwidth]{images/activity.png}
  100. \caption{A screen shot of activity measure graph.\label{fig:activity}}
  101. \end{figure}
  102. \subsubsection{Sentiment}
  103. For every comment that came in, every word of the comment was separated in order to analyze them on their own.
  104. Each word was assigned a sentiment score as per the document handed out in this class for assignment 1
  105. known as AFINN-111. This document has a variety of English language words with scores assigned to them
  106. between -5 and +5. If a word is assigned a negative score then that means it has negative sentiment (sad,
  107. angry, etc.) and if it has a positive score it has a happy sentiment. As per the comments, each word is assigned
  108. a score and the total score for the comment is the aggregation of word scores within the comment. This final
  109. score is also clamped between the values of -5 and +5 to avoid single comments skewing the results of
  110. final analysis techniques.
  111. The results of the sentiment analysis is shown in Figure~\ref{fig:sent}. If no topic is specified, all comments
  112. are used in the total sentiment analysis. If however, a user uses the search bar to provide a topic, only those
  113. comments containing that topic are used.
  114. \begin{figure}[h]
  115. \centering
  116. \includegraphics[width=0.5\textwidth]{images/sent.png}
  117. \caption{A screen shot of sentiment measure graph.\label{fig:sent}}
  118. \end{figure}
  119. \subsubsection{Related Topics}
  120. When a user searches for more information about a particular topic on the forum, FM presents them with related
  121. topics to their search terms. In order to accomplish this, I used the python library called ``Topia''. Topia
  122. uses Parts of Speech in order to categorize every word from every post as noun, adjective, verb, etc. Once these
  123. categorizations are completed, I simply filtered the words on nouns, objects, and verbs. This subset became the
  124. list of topic words for any given comment. Once these keyword topics had been extracted, it was a simple algorithm
  125. of seeing which keywords are referenced the most with the user provided topic. I ended up limiting it to the
  126. top 50 keywords so as to not overwhelm the user. I found that the top 10 keywords ended up being the same for most
  127. topics provided by the user, but the related topics ranked 10th - 30th were often quite useful. The related topics
  128. can be seen in Figure~\ref{fig:rel}
  129. \begin{figure}[h]
  130. \centering
  131. \includegraphics[width=0.5\textwidth]{images/rel.png}
  132. \caption{A screen shot of related topics tags.\label{fig:rel}}
  133. \end{figure}
  134. \subsubsection{Requirements}
  135. When a user searchers for more information about a particular topic, FM will present a list of requirements
  136. as designated by the community surrounding that topic. In Figured~\ref{fig:req}, we can see that when
  137. the user searchers for the term Priest, objects that are related to the Priest object in the game come up
  138. with their recommended requirements.
  139. \begin{figure}[h]
  140. \centering
  141. \includegraphics[width=0.5\textwidth]{images/req.png}
  142. \caption{A screen shot of related topics tags.\label{fig:req}}
  143. \end{figure}
  144. The results seen in Figured~\ref{fig:req} are hard coded for this example as the actual algorithm implemented
  145. was not quite as good as was expected (I will address this in Section~\ref{sec:fw}). In order to achieve these
  146. results, I followed the following algorithm. First I found which keywords were related to the search term as
  147. seen in the related topics section above. Once I had these, I found which sentences in which posts correlated
  148. to these topic discussion. I then used ``Topia'' once again in order to label the Parts of Speech found in
  149. the sentences. I then extracted the objects and verb phrases of the sentences. For instance, the sentence
  150. ``I think that Mind Control is too strong and that it should be set to 9 mana.'' will have extracted
  151. Mind Control as the object and set to 9 mana as the verb phrase. This was my initial idea for the requirements
  152. algorithm, however it yielded very difficult to read output and require myself to intervene and make the
  153. results readable, in terms of sentence structure, to other users. I have plans to improve this algorithm
  154. which I will talk about in Section~\ref{sec:fw}.
  155. \subsection{Visualization}
  156. In order to visualize all of the previous analysis techniques, I used a variety of tools. I first used Ruby
  157. on Rails in order to create a web application to host all of my findings. Ruby on Rails allows for quick prototyping
  158. of website so it was an easy selection for this project.
  159. I next had to find a way to graph both the sentiment and activity analysis I performed. I originally had planned
  160. to use Google Charts as I had used them before in other big data applications, but I actually stumbled across
  161. HighCharts.js, which is a great charting application for big data. HighCharts.js is a native only JavaScript
  162. charting program which only requires jQuery (added in most projects anyways). This lack of dependency made it
  163. an easy tool to work with.
  164. Finally, for styling purposes, I used Twitter's Bootstrap CSS framework. Bootstrap is used in many web applications
  165. for its ease of use and for its polish.
  166. \section{Future Work}
  167. \label{sec:fw}
  168. For the future work of this project, I really only have one major focus, but due to the nature of web applications,
  169. other interesting additions can be created with ease. My main focus moving forward is in the further development
  170. of the requirements elicitation tool. As per Figure~\ref{fig:req}, we can see how automatically generated
  171. requirements may becomes handy to developers in that they can see what the community wants. However, finding
  172. these requirements and displaying them in a way that makes sense is difficult as found with this project.
  173. In order to improve the requirements I suggest 2 steps. First to make the results more human readable, I would
  174. like to incorporate other factors of the source sentences aside from object and verb phrase in order to make
  175. the results a more cohesive sentence. Second, I would like to use some of the ideas from plagiarism detection
  176. software and research in order to rank the findings better. Plagiarism research deals with taking two pieces of
  177. English, say two sentences, and determining how alike they are. I would use this idea to find similar requirements.
  178. If I can find requirements that are talked about by 90\% of the community, they are probably more high end priorities
  179. for developers than a single suggestion by one user.
  180. Through these two changes, I hope to make Forum-Miner a more robust tool and to eventual support community manager
  181. jobs for software development.
  182. \section{Conclusions}
  183. \label{sec:conc}
  184. This paper has walked through the creation and implementation of the ``Forum-Miner'' web tool. I have shown
  185. how through the use of Python and web crawling, storage facilities, and Ruby on Rails, how we can create tools
  186. which can perform deep analysis on natural unstructured data
  187. that is available on the web through software system forums. Moving forward, I do not only hope to improve my
  188. website by the better implementation of the requirements gathering algorithm, but I also hope to open the ideas
  189. of this paper to further unstructured data research on the web. Free text is all over the place on the Internet,
  190. but we do not yet have the tools to harness it.
  191. I hope to release ``Forum-Miner'' to a public web server near April 2014 as I will be continuing this project
  192. as a directed studies in the following semester. The code is available on GitHub.
  193. % End of the paper
  194. \end{document}