/lcs.html

http://github.com/cdmh/algorithms · HTML · 232 lines · 221 code · 11 blank · 0 comment · 0 complexity · 92331822d8c77c2112ce3af047b04724 MD5 · raw file

  1. <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
  2. <html>
  3. <head>
  4. <title>Boost Longest Common Subsequence</title>
  5. </head>
  6. <body>
  7. <table border="1" bgcolor="teal" cellpadding="2">
  8. <tr>
  9. <td bgcolor="white">
  10. <a href="../../../index.htm"><img src="../../../c++boost.gif" alt="Home" width="277" height="86" border="0"></a></td>
  11. <td><a href="../../../index.htm"><font face="Arial" color="white"><big>Home</big></font></a></td>
  12. <td><a href="../../libraries.htm"><font face="Arial" color="white"><big>Libraries</big></font></a></td>
  13. <td><a href="../../../people/people.htm"><font face="Arial" color="white"><big>People</big></font></a></td>
  14. <td><a href="../../../more/faq.htm"><font face="Arial" color="white"><big>FAQ</big></font></a></td>
  15. <td><a href="../../../more/index.htm"><font face="Arial" color="white"><big>More</big></font></a></td>
  16. </tr>
  17. </table>
  18. <font face="Times" size="3">
  19. <h1>Boost Longest Common Subsequence</h1>
  20. <font size="+1"><b>Definitions</b></font>
  21. <table width="100%">
  22. <tr>
  23. <td valign="top"><i>Element</i></td>
  24. <td>A single item of data</td>
  25. </tr>
  26. <td valign="top"><i>Sequence</i></td>
  27. <td>A container of elements with an order. This can be a STL container or any other
  28. implementing the necessary interface, such as <code>boost::array&lt;&gt;</code>
  29. or raw C++ arrays.</td>
  30. </tr>
  31. </table>
  32. <h2>Overview</h2>
  33. <p>The <i>Longest Common Subsequence</i> algorithm constructs a sequence from two
  34. source sequences, and contains elements common to both the sources in the order
  35. in which they appears. For example, take the first two lines of a popular
  36. nursery rhyme as input sequences. What is the longest sequence that contains
  37. elements from both of these sequences, with the elements in the same order as
  38. they appear in the originals?</p>
  39. <p>To visualise the problem, we can write the two strings, one below the other,
  40. such that each letter that appears in both strings are aligned:</p>
  41. <table width="100%">
  42. <tr>
  43. <td>
  44. <pre>
  45. ja <font color="#0000dd"><b>c</b></font>k <font color="#0000dd"><b>a</b></font>nd jil <font color="#0000dd"><b>l</b></font> w<font color="#0000dd"><b>e</b></font>nt up <font color="#0000dd"><b>t</b></font>h<font color="#0000dd"><b>e</b></font> hill
  46. to fet<font color="#0000dd"><b>c</b></font> h <font color="#0000dd"><b>a</b></font> pa<font color="#0000dd"><b>l</b></font> <font color="#0000dd"><b>e</b></font> of wa<font color="#0000dd"><b>t</b></font> <font color="#0000dd"><b>e</b></font> r
  47. </pre>
  48. </td>
  49. </tr>
  50. </table>
  51. We can clearly see from this that the Longest Common Substring, highlighted in
  52. blue, is "<code><b>c a le&nbsp;&nbsp;te</b></code>", and has a length of 10
  53. elements. To solve the problem of identifying the LCS, we can define that:<br>
  54. <ul><li>
  55. if the first elements in each sequence are equal then this element must also
  56. appear at the beginning of the subsequence.</li>
  57. <li>
  58. if the first elements in each sequence are not equal then either element, but
  59. not both, may appear in the subsequence.</li>
  60. </ul>
  61. <p>Once a decision is made in respect to the first element in the input sequences,
  62. the rest of the problem is simply another LCS problem on the short sequences.
  63. That is to say, that the problem is recursive.</p>
  64. <p>The LCS is computed using a 2-dimensional array, with one sequence along the
  65. x-axis and the other sequence along the y-axis. Given that the subsequence is
  66. <i>common</i> to both sequences, the x and y axis are interchangable. The algorithm
  67. uses memory based on rows, so it follows that by representing the shortest sequence
  68. in the x axis will use the least memory, but representing the shortest sequence in
  69. the y axis will use less iterations and therefore be slightly faster. I have adopted
  70. the smallest memory footprint approach in my algorithm.</p>
  71. <p>In order to calculate a value in the array, we must know the values to the left, above
  72. and diagonally above-left. So, we only ever need two rows of information, so we can
  73. allocate enough memory for two rows of working data and reuse the buffer for alternate
  74. rows. At the end of the algorithm, the length of the subsequence is then given by the
  75. right most value in last row.</p>
  76. <h4>Algorithm</h4>
  77. <p>Note that the first and second row must be initialised to zero. The algorithm below assumes
  78. the sequence elements are in array indexes 1..n. Index 0 is reserved for 0 value elements to
  79. avoid conditions in the main loops. The <code>position</code> array is the length of the first
  80. sequence and tracks the position in which the middle element of sequence two is found in
  81. sequence one.</p>
  82. <pre>
  83. for each element in the first sequence from first to last
  84. for each element in the second sequence from first to (n / 2)
  85. if the data for each element is equal, then
  86. set array[column][row] = 1 + array[column-1][row-1]
  87. else
  88. set array[column][row] = max(array[column-1][row], array[column][row-1])
  89. for each element in the first sequence from the first to the last
  90. for each element in the second sequence from (n / 2) + 1 to the last
  91. if the data for each element is equal, then
  92. set array[column][row] = 1 + array[column-1][row-1]
  93. set position[column] = position[column-1];
  94. else if array[column-1][row] > array[column][row-1]
  95. set array[column][row] = array[column-1][row]
  96. set position[column] = position[column-1];
  97. else
  98. set array[column][row] = array[column][row-1]
  99. </pre>
  100. <p>Once we get to this point, we know the length of the subsequence, and we know the middle
  101. node along the solution path, which is in the last element of <code>position</code>. The
  102. sub-problems can now be solved recursively for elements 1..n/2 and n/2..n</p>
  103. <h2>Synopsis</h2>
  104. <p>We have two fundamental algorithms. The calculation of the Subsequence and the Length
  105. of the Subsequence. Each algorithm has two overloaded implementations, one that takes an
  106. allocator object for memory management and the other that provides a default.</p>
  107. <pre>
  108. template&lt;typename size_type,
  109. typename ItIn1,
  110. typename ItIn2,
  111. typename ItSubSeq&gt;
  112. inline
  113. size_type
  114. longest_common_subsequence(ItIn1 begin_first,
  115. ItIn1 end_first,
  116. ItIn2 begin_second,
  117. ItIn2 end_second,
  118. ItSubSeq subsequence)
  119. template&lt;typename size_type,
  120. typename Alloc,
  121. typename ItIn1,
  122. typename ItIn2,
  123. typename ItSubSeq&gt;
  124. size_type
  125. longest_common_subsequence(ItIn1 begin_first,
  126. ItIn1 end_first,
  127. ItIn2 begin_second,
  128. ItIn2 end_second,
  129. ItSubSeq subsequence,
  130. Alloc &alloc)
  131. template&lt;typename size_type, typename ItIn1, typename ItIn2&gt;
  132. inline
  133. size_type
  134. longest_common_subsequence_length(ItIn1 begin_first, ItIn1 end_first,
  135. ItIn2 begin_second, ItIn2 end_second)
  136. template&lt;typename size_type, typename Alloc, typename ItIn1, typename ItIn2&gt;
  137. size_type
  138. longest_common_subsequence_length(ItIn1 begin_first, ItIn1 end_first,
  139. ItIn2 begin_second, ItIn2 end_second,
  140. Alloc &alloc)
  141. </pre>
  142. <h2>Example</h2>
  143. <pre>
  144. test_character_seq(const char *str1, const char *str2)
  145. {
  146. std::cout << "Comparing: \"" << str1 << "\"" << std::endl
  147. << " & \"" << str2 << "\"" << std::endl;
  148. std::string subsequence;
  149. signed short llcs;
  150. llcs = boost::longest_common_subsequence&lt;
  151. signed short&gt;(
  152. str1, str1+strlen(str1),
  153. str2, str2+strlen(str2),
  154. std::back_inserter&lt;&gt;(subsequence));
  155. std::cout &lt;&lt; "Longest Common Subsequence: \"" &lt;&lt; subsequence
  156. &lt;&lt; "\"" &lt;&lt; std::endl
  157. &lt;&lt; "Length is " &lt;&lt; llcs &lt;&lt; std::endl;
  158. BOOST_ASSERT(llcs == static_cast&lt;signed short&gt;(subsequence.size()));
  159. BOOST_ASSERT(llcs == boost::longest_common_subsequence_length&lt;
  160. signed short&gt;(str1, str1+strlen(str1),
  161. str2, str2+strlen(str2)));
  162. }
  163. </pre>
  164. <h2>Test Program</h2>
  165. <p>The test program has been compiled and tested with 4 compilers with 5 STL variants.</p>
  166. <table border=1 cellspacing=0 cellpadding=6 width=100%>
  167. <tr bgcolor="teal">
  168. <td><b>&nbsp;Compiler</b></td>
  169. <td><b>&nbsp;STL version</b></td>
  170. <td><b>&nbsp;Command Line</b></td>
  171. <td><b>&nbsp;Success/Failure</b></td>
  172. </tr>
  173. <tr>
  174. <td bgcolor="teal"><font color="#ffffff"><b>&nbsp;Microsoft Visual C++.Net (VC7)</b></font></td>
  175. <td bgcolor="#ffcc66">Dinkumware (with compiler)</td>
  176. <td bgcolor="#ffcc66">Using the project file <code>lcs.vcproj</code></td>
  177. <td bgcolor="#ffcc66">Success.</td>
  178. </tr>
  179. <tr>
  180. <td bgcolor="teal"><font color="#ffffff"><b>&nbsp;Microsoft Visual C++.Net (VC7)</b></font></td>
  181. <td bgcolor="#ffcc66">STLport-4.5.3</td>
  182. <td bgcolor="#ffcc66">Using the project file <code>lcs.vcproj</code></td>
  183. <td bgcolor="#ffcc66">Success.</td>
  184. </tr>
  185. <tr>
  186. <td bgcolor="teal"><font color="#ffffff"><b>&nbsp;Microsoft Visual C++ v6</b></font></td>
  187. <td bgcolor="#ffcc66">Dinkumware (with compiler)</td>
  188. <td bgcolor="#ffcc66"><code>cl lcs_test.cpp -GX</code></td>
  189. <td bgcolor="#ffcc66">Success.</td>
  190. </tr>
  191. <tr>
  192. <td bgcolor="teal"><font color="#ffffff"><b>&nbsp;g++ 3.1.1 20020718 (prerelease) (Cygwin)</b></font></td>
  193. <td bgcolor="#ffcc66">Compiler</td>
  194. <td bgcolor="#ffcc66"><code>g++ lcs_test.cpp</code></td>
  195. <td bgcolor="#ffcc66">Success.</td>
  196. </tr>
  197. <tr>
  198. <td bgcolor="teal"><font color="#ffffff"><b>&nbsp;Borland Free C++ 5.5.1</b></font></td>
  199. <td bgcolor="#ffcc66">Compiler</td>
  200. <td bgcolor="#ffcc66"><code>bcc32 lcs_test.cpp</code></td>
  201. <td bgcolor="#ffcc66">Success.</td>
  202. </tr>
  203. <tr>
  204. <td bgcolor="teal"><font color="#ffffff"><b>&nbsp;Borland Free C++ 5.5.1</b></font></td>
  205. <td bgcolor="#ffcc66">STLport-4.5.3</td>
  206. <td bgcolor="#ffcc66"><code>bcc32 lcs_test.cpp</code></td>
  207. <td bgcolor="#ffcc66">Success.</td>
  208. </tr>
  209. </table>
  210. <h2>Disclaimer</h2>
  211. <p>&copy; Copyright 2002, 2003, <a href="mailto:cdm.henderson@virgin.net">Craig Henderson</a></p>
  212. <p><i>Permission to use, copy, modify, distribute and sell this software and its
  213. documentation for any purpose is hereby granted without fee, provided that the
  214. above copyright notice appears in all copies and that both that copyright
  215. notice and this permission notice appear in supporting documentation. The
  216. author makes no representations about the suitability of this software for any
  217. purpose. It is provided "as is" without express or implied warranty.</i></p>
  218. <hr>
  219. <i>Last Revised: 10th October, 2003</i> </font>
  220. </body>
  221. </html>