/lcs.html
HTML | 232 lines | 221 code | 11 blank | 0 comment | 0 complexity | 92331822d8c77c2112ce3af047b04724 MD5 | raw file
1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"> 2<html> 3 <head> 4 <title>Boost Longest Common Subsequence</title> 5 </head> 6 <body> 7 <table border="1" bgcolor="teal" cellpadding="2"> 8 <tr> 9 <td bgcolor="white"> 10 <a href="../../../index.htm"><img src="../../../c++boost.gif" alt="Home" width="277" height="86" border="0"></a></td> 11 <td><a href="../../../index.htm"><font face="Arial" color="white"><big>Home</big></font></a></td> 12 <td><a href="../../libraries.htm"><font face="Arial" color="white"><big>Libraries</big></font></a></td> 13 <td><a href="../../../people/people.htm"><font face="Arial" color="white"><big>People</big></font></a></td> 14 <td><a href="../../../more/faq.htm"><font face="Arial" color="white"><big>FAQ</big></font></a></td> 15 <td><a href="../../../more/index.htm"><font face="Arial" color="white"><big>More</big></font></a></td> 16 </tr> 17 </table> 18 <font face="Times" size="3"> 19 <h1>Boost Longest Common Subsequence</h1> 20 <font size="+1"><b>Definitions</b></font> 21 <table width="100%"> 22 <tr> 23 <td valign="top"><i>Element</i></td> 24 <td>A single item of data</td> 25 </tr> 26 <td valign="top"><i>Sequence</i></td> 27 <td>A container of elements with an order. This can be a STL container or any other 28 implementing the necessary interface, such as <code>boost::array<></code> 29 or raw C++ arrays.</td> 30 </tr> 31 </table> 32 <h2>Overview</h2> 33 <p>The <i>Longest Common Subsequence</i> algorithm constructs a sequence from two 34 source sequences, and contains elements common to both the sources in the order 35 in which they appears. For example, take the first two lines of a popular 36 nursery rhyme as input sequences. What is the longest sequence that contains 37 elements from both of these sequences, with the elements in the same order as 38 they appear in the originals?</p> 39 <p>To visualise the problem, we can write the two strings, one below the other, 40 such that each letter that appears in both strings are aligned:</p> 41 <table width="100%"> 42 <tr> 43 <td> 44 <pre> 45ja <font color="#0000dd"><b>c</b></font>k <font color="#0000dd"><b>a</b></font>nd jil <font color="#0000dd"><b>l</b></font> w<font color="#0000dd"><b>e</b></font>nt up <font color="#0000dd"><b>t</b></font>h<font color="#0000dd"><b>e</b></font> hill 46 to fet<font color="#0000dd"><b>c</b></font> h <font color="#0000dd"><b>a</b></font> pa<font color="#0000dd"><b>l</b></font> <font color="#0000dd"><b>e</b></font> of wa<font color="#0000dd"><b>t</b></font> <font color="#0000dd"><b>e</b></font> r 47</pre> 48 </td> 49 </tr> 50 </table> 51 We can clearly see from this that the Longest Common Substring, highlighted in 52 blue, is "<code><b>c a le te</b></code>", and has a length of 10 53 elements. To solve the problem of identifying the LCS, we can define that:<br> 54 <ul><li> 55 if the first elements in each sequence are equal then this element must also 56 appear at the beginning of the subsequence.</li> 57 <li> 58 if the first elements in each sequence are not equal then either element, but 59 not both, may appear in the subsequence.</li> 60 </ul> 61 <p>Once a decision is made in respect to the first element in the input sequences, 62 the rest of the problem is simply another LCS problem on the short sequences. 63 That is to say, that the problem is recursive.</p> 64 <p>The LCS is computed using a 2-dimensional array, with one sequence along the 65 x-axis and the other sequence along the y-axis. Given that the subsequence is 66 <i>common</i> to both sequences, the x and y axis are interchangable. The algorithm 67 uses memory based on rows, so it follows that by representing the shortest sequence 68 in the x axis will use the least memory, but representing the shortest sequence in 69 the y axis will use less iterations and therefore be slightly faster. I have adopted 70 the smallest memory footprint approach in my algorithm.</p> 71 72 <p>In order to calculate a value in the array, we must know the values to the left, above 73 and diagonally above-left. So, we only ever need two rows of information, so we can 74 allocate enough memory for two rows of working data and reuse the buffer for alternate 75 rows. At the end of the algorithm, the length of the subsequence is then given by the 76 right most value in last row.</p> 77 <h4>Algorithm</h4> 78 <p>Note that the first and second row must be initialised to zero. The algorithm below assumes 79 the sequence elements are in array indexes 1..n. Index 0 is reserved for 0 value elements to 80 avoid conditions in the main loops. The <code>position</code> array is the length of the first 81 sequence and tracks the position in which the middle element of sequence two is found in 82 sequence one.</p> 83 <pre> 84for each element in the first sequence from first to last 85 for each element in the second sequence from first to (n / 2) 86 if the data for each element is equal, then 87 set array[column][row] = 1 + array[column-1][row-1] 88 else 89 set array[column][row] = max(array[column-1][row], array[column][row-1]) 90 91for each element in the first sequence from the first to the last 92 for each element in the second sequence from (n / 2) + 1 to the last 93 if the data for each element is equal, then 94 set array[column][row] = 1 + array[column-1][row-1] 95 set position[column] = position[column-1]; 96 else if array[column-1][row] > array[column][row-1] 97 set array[column][row] = array[column-1][row] 98 set position[column] = position[column-1]; 99 else 100 set array[column][row] = array[column][row-1] 101</pre> 102 <p>Once we get to this point, we know the length of the subsequence, and we know the middle 103 node along the solution path, which is in the last element of <code>position</code>. The 104 sub-problems can now be solved recursively for elements 1..n/2 and n/2..n</p> 105 106 <h2>Synopsis</h2> 107 <p>We have two fundamental algorithms. The calculation of the Subsequence and the Length 108 of the Subsequence. Each algorithm has two overloaded implementations, one that takes an 109 allocator object for memory management and the other that provides a default.</p> 110<pre> 111template<typename size_type, 112 typename ItIn1, 113 typename ItIn2, 114 typename ItSubSeq> 115inline 116size_type 117longest_common_subsequence(ItIn1 begin_first, 118 ItIn1 end_first, 119 ItIn2 begin_second, 120 ItIn2 end_second, 121 ItSubSeq subsequence) 122 123 124template<typename size_type, 125 typename Alloc, 126 typename ItIn1, 127 typename ItIn2, 128 typename ItSubSeq> 129size_type 130longest_common_subsequence(ItIn1 begin_first, 131 ItIn1 end_first, 132 ItIn2 begin_second, 133 ItIn2 end_second, 134 ItSubSeq subsequence, 135 Alloc &alloc) 136 137template<typename size_type, typename ItIn1, typename ItIn2> 138inline 139size_type 140longest_common_subsequence_length(ItIn1 begin_first, ItIn1 end_first, 141 ItIn2 begin_second, ItIn2 end_second) 142 143 144template<typename size_type, typename Alloc, typename ItIn1, typename ItIn2> 145size_type 146longest_common_subsequence_length(ItIn1 begin_first, ItIn1 end_first, 147 ItIn2 begin_second, ItIn2 end_second, 148 Alloc &alloc) 149</pre> 150 <h2>Example</h2> 151<pre> 152test_character_seq(const char *str1, const char *str2) 153{ 154 std::cout << "Comparing: \"" << str1 << "\"" << std::endl 155 << " & \"" << str2 << "\"" << std::endl; 156 157 std::string subsequence; 158 signed short llcs; 159 llcs = boost::longest_common_subsequence< 160 signed short>( 161 str1, str1+strlen(str1), 162 str2, str2+strlen(str2), 163 std::back_inserter<>(subsequence)); 164 165 std::cout << "Longest Common Subsequence: \"" << subsequence 166 << "\"" << std::endl 167 << "Length is " << llcs << std::endl; 168 169 BOOST_ASSERT(llcs == static_cast<signed short>(subsequence.size())); 170 BOOST_ASSERT(llcs == boost::longest_common_subsequence_length< 171 signed short>(str1, str1+strlen(str1), 172 str2, str2+strlen(str2))); 173} 174</pre> 175 <h2>Test Program</h2> 176 <p>The test program has been compiled and tested with 4 compilers with 5 STL variants.</p> 177 <table border=1 cellspacing=0 cellpadding=6 width=100%> 178 <tr bgcolor="teal"> 179 <td><b> Compiler</b></td> 180 <td><b> STL version</b></td> 181 <td><b> Command Line</b></td> 182 <td><b> Success/Failure</b></td> 183 </tr> 184 <tr> 185 <td bgcolor="teal"><font color="#ffffff"><b> Microsoft Visual C++.Net (VC7)</b></font></td> 186 <td bgcolor="#ffcc66">Dinkumware (with compiler)</td> 187 <td bgcolor="#ffcc66">Using the project file <code>lcs.vcproj</code></td> 188 <td bgcolor="#ffcc66">Success.</td> 189 </tr> 190 <tr> 191 <td bgcolor="teal"><font color="#ffffff"><b> Microsoft Visual C++.Net (VC7)</b></font></td> 192 <td bgcolor="#ffcc66">STLport-4.5.3</td> 193 <td bgcolor="#ffcc66">Using the project file <code>lcs.vcproj</code></td> 194 <td bgcolor="#ffcc66">Success.</td> 195 </tr> 196 <tr> 197 <td bgcolor="teal"><font color="#ffffff"><b> Microsoft Visual C++ v6</b></font></td> 198 <td bgcolor="#ffcc66">Dinkumware (with compiler)</td> 199 <td bgcolor="#ffcc66"><code>cl lcs_test.cpp -GX</code></td> 200 <td bgcolor="#ffcc66">Success.</td> 201 </tr> 202 <tr> 203 <td bgcolor="teal"><font color="#ffffff"><b> g++ 3.1.1 20020718 (prerelease) (Cygwin)</b></font></td> 204 <td bgcolor="#ffcc66">Compiler</td> 205 <td bgcolor="#ffcc66"><code>g++ lcs_test.cpp</code></td> 206 <td bgcolor="#ffcc66">Success.</td> 207 </tr> 208 <tr> 209 <td bgcolor="teal"><font color="#ffffff"><b> Borland Free C++ 5.5.1</b></font></td> 210 <td bgcolor="#ffcc66">Compiler</td> 211 <td bgcolor="#ffcc66"><code>bcc32 lcs_test.cpp</code></td> 212 <td bgcolor="#ffcc66">Success.</td> 213 </tr> 214 <tr> 215 <td bgcolor="teal"><font color="#ffffff"><b> Borland Free C++ 5.5.1</b></font></td> 216 <td bgcolor="#ffcc66">STLport-4.5.3</td> 217 <td bgcolor="#ffcc66"><code>bcc32 lcs_test.cpp</code></td> 218 <td bgcolor="#ffcc66">Success.</td> 219 </tr> 220 </table> 221 <h2>Disclaimer</h2> 222 <p>© Copyright 2002, 2003, <a href="mailto:cdm.henderson@virgin.net">Craig Henderson</a></p> 223 <p><i>Permission to use, copy, modify, distribute and sell this software and its 224 documentation for any purpose is hereby granted without fee, provided that the 225 above copyright notice appears in all copies and that both that copyright 226 notice and this permission notice appear in supporting documentation. The 227 author makes no representations about the suitability of this software for any 228 purpose. It is provided "as is" without express or implied warranty.</i></p> 229 <hr> 230<i>Last Revised: 10th October, 2003</i> </font> 231</body> 232</html>