/lcs.html
http://github.com/cdmh/algorithms · HTML · 232 lines · 221 code · 11 blank · 0 comment · 0 complexity · 92331822d8c77c2112ce3af047b04724 MD5 · raw file
- <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
- <html>
- <head>
- <title>Boost Longest Common Subsequence</title>
- </head>
- <body>
- <table border="1" bgcolor="teal" cellpadding="2">
- <tr>
- <td bgcolor="white">
- <a href="../../../index.htm"><img src="../../../c++boost.gif" alt="Home" width="277" height="86" border="0"></a></td>
- <td><a href="../../../index.htm"><font face="Arial" color="white"><big>Home</big></font></a></td>
- <td><a href="../../libraries.htm"><font face="Arial" color="white"><big>Libraries</big></font></a></td>
- <td><a href="../../../people/people.htm"><font face="Arial" color="white"><big>People</big></font></a></td>
- <td><a href="../../../more/faq.htm"><font face="Arial" color="white"><big>FAQ</big></font></a></td>
- <td><a href="../../../more/index.htm"><font face="Arial" color="white"><big>More</big></font></a></td>
- </tr>
- </table>
- <font face="Times" size="3">
- <h1>Boost Longest Common Subsequence</h1>
- <font size="+1"><b>Definitions</b></font>
- <table width="100%">
- <tr>
- <td valign="top"><i>Element</i></td>
- <td>A single item of data</td>
- </tr>
- <td valign="top"><i>Sequence</i></td>
- <td>A container of elements with an order. This can be a STL container or any other
- implementing the necessary interface, such as <code>boost::array<></code>
- or raw C++ arrays.</td>
- </tr>
- </table>
- <h2>Overview</h2>
- <p>The <i>Longest Common Subsequence</i> algorithm constructs a sequence from two
- source sequences, and contains elements common to both the sources in the order
- in which they appears. For example, take the first two lines of a popular
- nursery rhyme as input sequences. What is the longest sequence that contains
- elements from both of these sequences, with the elements in the same order as
- they appear in the originals?</p>
- <p>To visualise the problem, we can write the two strings, one below the other,
- such that each letter that appears in both strings are aligned:</p>
- <table width="100%">
- <tr>
- <td>
- <pre>
- ja <font color="#0000dd"><b>c</b></font>k <font color="#0000dd"><b>a</b></font>nd jil <font color="#0000dd"><b>l</b></font> w<font color="#0000dd"><b>e</b></font>nt up <font color="#0000dd"><b>t</b></font>h<font color="#0000dd"><b>e</b></font> hill
- to fet<font color="#0000dd"><b>c</b></font> h <font color="#0000dd"><b>a</b></font> pa<font color="#0000dd"><b>l</b></font> <font color="#0000dd"><b>e</b></font> of wa<font color="#0000dd"><b>t</b></font> <font color="#0000dd"><b>e</b></font> r
- </pre>
- </td>
- </tr>
- </table>
- We can clearly see from this that the Longest Common Substring, highlighted in
- blue, is "<code><b>c a le te</b></code>", and has a length of 10
- elements. To solve the problem of identifying the LCS, we can define that:<br>
- <ul><li>
- if the first elements in each sequence are equal then this element must also
- appear at the beginning of the subsequence.</li>
- <li>
- if the first elements in each sequence are not equal then either element, but
- not both, may appear in the subsequence.</li>
- </ul>
- <p>Once a decision is made in respect to the first element in the input sequences,
- the rest of the problem is simply another LCS problem on the short sequences.
- That is to say, that the problem is recursive.</p>
- <p>The LCS is computed using a 2-dimensional array, with one sequence along the
- x-axis and the other sequence along the y-axis. Given that the subsequence is
- <i>common</i> to both sequences, the x and y axis are interchangable. The algorithm
- uses memory based on rows, so it follows that by representing the shortest sequence
- in the x axis will use the least memory, but representing the shortest sequence in
- the y axis will use less iterations and therefore be slightly faster. I have adopted
- the smallest memory footprint approach in my algorithm.</p>
-
- <p>In order to calculate a value in the array, we must know the values to the left, above
- and diagonally above-left. So, we only ever need two rows of information, so we can
- allocate enough memory for two rows of working data and reuse the buffer for alternate
- rows. At the end of the algorithm, the length of the subsequence is then given by the
- right most value in last row.</p>
- <h4>Algorithm</h4>
- <p>Note that the first and second row must be initialised to zero. The algorithm below assumes
- the sequence elements are in array indexes 1..n. Index 0 is reserved for 0 value elements to
- avoid conditions in the main loops. The <code>position</code> array is the length of the first
- sequence and tracks the position in which the middle element of sequence two is found in
- sequence one.</p>
- <pre>
- for each element in the first sequence from first to last
- for each element in the second sequence from first to (n / 2)
- if the data for each element is equal, then
- set array[column][row] = 1 + array[column-1][row-1]
- else
- set array[column][row] = max(array[column-1][row], array[column][row-1])
-
- for each element in the first sequence from the first to the last
- for each element in the second sequence from (n / 2) + 1 to the last
- if the data for each element is equal, then
- set array[column][row] = 1 + array[column-1][row-1]
- set position[column] = position[column-1];
- else if array[column-1][row] > array[column][row-1]
- set array[column][row] = array[column-1][row]
- set position[column] = position[column-1];
- else
- set array[column][row] = array[column][row-1]
- </pre>
- <p>Once we get to this point, we know the length of the subsequence, and we know the middle
- node along the solution path, which is in the last element of <code>position</code>. The
- sub-problems can now be solved recursively for elements 1..n/2 and n/2..n</p>
-
- <h2>Synopsis</h2>
- <p>We have two fundamental algorithms. The calculation of the Subsequence and the Length
- of the Subsequence. Each algorithm has two overloaded implementations, one that takes an
- allocator object for memory management and the other that provides a default.</p>
- <pre>
- template<typename size_type,
- typename ItIn1,
- typename ItIn2,
- typename ItSubSeq>
- inline
- size_type
- longest_common_subsequence(ItIn1 begin_first,
- ItIn1 end_first,
- ItIn2 begin_second,
- ItIn2 end_second,
- ItSubSeq subsequence)
-
-
- template<typename size_type,
- typename Alloc,
- typename ItIn1,
- typename ItIn2,
- typename ItSubSeq>
- size_type
- longest_common_subsequence(ItIn1 begin_first,
- ItIn1 end_first,
- ItIn2 begin_second,
- ItIn2 end_second,
- ItSubSeq subsequence,
- Alloc &alloc)
-
- template<typename size_type, typename ItIn1, typename ItIn2>
- inline
- size_type
- longest_common_subsequence_length(ItIn1 begin_first, ItIn1 end_first,
- ItIn2 begin_second, ItIn2 end_second)
-
-
- template<typename size_type, typename Alloc, typename ItIn1, typename ItIn2>
- size_type
- longest_common_subsequence_length(ItIn1 begin_first, ItIn1 end_first,
- ItIn2 begin_second, ItIn2 end_second,
- Alloc &alloc)
- </pre>
- <h2>Example</h2>
- <pre>
- test_character_seq(const char *str1, const char *str2)
- {
- std::cout << "Comparing: \"" << str1 << "\"" << std::endl
- << " & \"" << str2 << "\"" << std::endl;
-
- std::string subsequence;
- signed short llcs;
- llcs = boost::longest_common_subsequence<
- signed short>(
- str1, str1+strlen(str1),
- str2, str2+strlen(str2),
- std::back_inserter<>(subsequence));
-
- std::cout << "Longest Common Subsequence: \"" << subsequence
- << "\"" << std::endl
- << "Length is " << llcs << std::endl;
-
- BOOST_ASSERT(llcs == static_cast<signed short>(subsequence.size()));
- BOOST_ASSERT(llcs == boost::longest_common_subsequence_length<
- signed short>(str1, str1+strlen(str1),
- str2, str2+strlen(str2)));
- }
- </pre>
- <h2>Test Program</h2>
- <p>The test program has been compiled and tested with 4 compilers with 5 STL variants.</p>
- <table border=1 cellspacing=0 cellpadding=6 width=100%>
- <tr bgcolor="teal">
- <td><b> Compiler</b></td>
- <td><b> STL version</b></td>
- <td><b> Command Line</b></td>
- <td><b> Success/Failure</b></td>
- </tr>
- <tr>
- <td bgcolor="teal"><font color="#ffffff"><b> Microsoft Visual C++.Net (VC7)</b></font></td>
- <td bgcolor="#ffcc66">Dinkumware (with compiler)</td>
- <td bgcolor="#ffcc66">Using the project file <code>lcs.vcproj</code></td>
- <td bgcolor="#ffcc66">Success.</td>
- </tr>
- <tr>
- <td bgcolor="teal"><font color="#ffffff"><b> Microsoft Visual C++.Net (VC7)</b></font></td>
- <td bgcolor="#ffcc66">STLport-4.5.3</td>
- <td bgcolor="#ffcc66">Using the project file <code>lcs.vcproj</code></td>
- <td bgcolor="#ffcc66">Success.</td>
- </tr>
- <tr>
- <td bgcolor="teal"><font color="#ffffff"><b> Microsoft Visual C++ v6</b></font></td>
- <td bgcolor="#ffcc66">Dinkumware (with compiler)</td>
- <td bgcolor="#ffcc66"><code>cl lcs_test.cpp -GX</code></td>
- <td bgcolor="#ffcc66">Success.</td>
- </tr>
- <tr>
- <td bgcolor="teal"><font color="#ffffff"><b> g++ 3.1.1 20020718 (prerelease) (Cygwin)</b></font></td>
- <td bgcolor="#ffcc66">Compiler</td>
- <td bgcolor="#ffcc66"><code>g++ lcs_test.cpp</code></td>
- <td bgcolor="#ffcc66">Success.</td>
- </tr>
- <tr>
- <td bgcolor="teal"><font color="#ffffff"><b> Borland Free C++ 5.5.1</b></font></td>
- <td bgcolor="#ffcc66">Compiler</td>
- <td bgcolor="#ffcc66"><code>bcc32 lcs_test.cpp</code></td>
- <td bgcolor="#ffcc66">Success.</td>
- </tr>
- <tr>
- <td bgcolor="teal"><font color="#ffffff"><b> Borland Free C++ 5.5.1</b></font></td>
- <td bgcolor="#ffcc66">STLport-4.5.3</td>
- <td bgcolor="#ffcc66"><code>bcc32 lcs_test.cpp</code></td>
- <td bgcolor="#ffcc66">Success.</td>
- </tr>
- </table>
- <h2>Disclaimer</h2>
- <p>© Copyright 2002, 2003, <a href="mailto:cdm.henderson@virgin.net">Craig Henderson</a></p>
- <p><i>Permission to use, copy, modify, distribute and sell this software and its
- documentation for any purpose is hereby granted without fee, provided that the
- above copyright notice appears in all copies and that both that copyright
- notice and this permission notice appear in supporting documentation. The
- author makes no representations about the suitability of this software for any
- purpose. It is provided "as is" without express or implied warranty.</i></p>
- <hr>
- <i>Last Revised: 10th October, 2003</i> </font>
- </body>
- </html>