PageRenderTime 69ms CodeModel.GetById 54ms app.highlight 10ms RepoModel.GetById 1ms app.codeStats 0ms

/lcs.html

http://github.com/cdmh/algorithms
HTML | 232 lines | 221 code | 11 blank | 0 comment | 0 complexity | 92331822d8c77c2112ce3af047b04724 MD5 | raw file
  1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
  2<html>
  3	<head>
  4		<title>Boost Longest Common Subsequence</title>
  5	</head>
  6	<body>
  7		<table border="1" bgcolor="teal" cellpadding="2">
  8			<tr>
  9				<td bgcolor="white">
 10					<a href="../../../index.htm"><img src="../../../c++boost.gif" alt="Home" width="277" height="86" border="0"></a></td>
 11				<td><a href="../../../index.htm"><font face="Arial" color="white"><big>Home</big></font></a></td>
 12				<td><a href="../../libraries.htm"><font face="Arial" color="white"><big>Libraries</big></font></a></td>
 13				<td><a href="../../../people/people.htm"><font face="Arial" color="white"><big>People</big></font></a></td>
 14				<td><a href="../../../more/faq.htm"><font face="Arial" color="white"><big>FAQ</big></font></a></td>
 15				<td><a href="../../../more/index.htm"><font face="Arial" color="white"><big>More</big></font></a></td>
 16			</tr>
 17		</table>
 18		<font face="Times" size="3">
 19			<h1>Boost Longest Common Subsequence</h1>
 20			<font size="+1"><b>Definitions</b></font>
 21			<table width="100%">
 22				<tr>
 23					<td valign="top"><i>Element</i></td>
 24					<td>A single item of data</td>
 25				</tr>
 26				<td valign="top"><i>Sequence</i></td>
 27				<td>A container of elements with an order. This can be a STL container or any other 
 28					implementing the necessary interface, such as <code>boost::array&lt;&gt;</code>
 29					or raw C++ arrays.</td>
 30				</tr>
 31			</table>
 32			<h2>Overview</h2>
 33			<p>The <i>Longest Common Subsequence</i> algorithm constructs a sequence from two 
 34				source sequences, and contains elements common to both the sources in the order 
 35				in which they appears. For example, take the first two lines of a popular 
 36				nursery rhyme as input sequences. What is the longest sequence that contains 
 37				elements from both of these sequences, with the elements in the same order as 
 38				they appear in the originals?</p>
 39			<p>To visualise the problem, we can write the two strings, one below the other, 
 40				such that each letter that appears in both strings are aligned:</p>
 41			<table width="100%">
 42				<tr>
 43					<td>
 44						<pre>
 45ja      <font color="#0000dd"><b>c</b></font>k  <font color="#0000dd"><b>a</b></font>nd jil  <font color="#0000dd"><b>l</b></font> w<font color="#0000dd"><b>e</b></font>nt up     <font color="#0000dd"><b>t</b></font>h<font color="#0000dd"><b>e</b></font> hill 
 46  to fet<font color="#0000dd"><b>c</b></font> h <font color="#0000dd"><b>a</b></font>      pa<font color="#0000dd"><b>l</b></font>  <font color="#0000dd"><b>e</b></font>     of wa<font color="#0000dd"><b>t</b></font> <font color="#0000dd"><b>e</b></font>     r
 47</pre>
 48					</td>
 49				</tr>
 50			</table>
 51			We can clearly see from this that the Longest Common Substring, highlighted in 
 52			blue, is "<code><b>c a le&nbsp;&nbsp;te</b></code>", and has a length of 10 
 53			elements. To solve the problem of identifying the LCS, we can define that:<br>
 54			<ul><li>
 55				if the first elements in each sequence are equal then this element must also 
 56				appear at the beginning of the subsequence.</li>
 57			<li>
 58				if the first elements in each sequence are not equal then either element, but 
 59				not both, may appear in the subsequence.</li>
 60			</ul>
 61			<p>Once a decision is made in respect to the first element in the input sequences, 
 62				the rest of the problem is simply another LCS problem on the short sequences. 
 63				That is to say, that the problem is recursive.</p>
 64			<p>The LCS is computed using a 2-dimensional array, with one sequence along  the
 65				x-axis and the other sequence along the y-axis. Given that the subsequence is
 66				<i>common</i> to both sequences, the x and y axis are interchangable. The algorithm
 67				uses memory based on rows, so it follows that by representing the shortest sequence
 68				in the x axis will use the least memory, but representing the shortest sequence in
 69				the y axis will use less iterations and therefore be slightly faster. I have adopted
 70				the smallest memory footprint approach in my algorithm.</p>
 71
 72			<p>In order to calculate a value in the array, we must know the values to the left, above
 73				and diagonally above-left. So, we only ever need two rows of information, so we can
 74				allocate enough memory for two rows of working data and reuse the buffer for alternate
 75				rows. At the end of the algorithm, the length of the subsequence is then given by the
 76				right most value in last row.</p>
 77			<h4>Algorithm</h4>
 78			<p>Note that the first and second row must be initialised to zero. The algorithm below assumes
 79			the sequence elements are in array indexes 1..n. Index 0 is reserved for 0 value elements to
 80			avoid conditions in the main loops. The <code>position</code> array is the length of the first
 81			sequence and tracks the position in which the middle element of sequence two is found in
 82			sequence one.</p>
 83			<pre>
 84for each element in the first sequence from first to last
 85    for each element in the second sequence from first to (n / 2)
 86        if the data for each element is equal, then
 87            set array[column][row] = 1 + array[column-1][row-1]
 88        else
 89            set array[column][row] = max(array[column-1][row], array[column][row-1])
 90
 91for each element in the first sequence from the first to the last
 92    for each element in the second sequence from (n / 2) + 1 to the last
 93        if the data for each element is equal, then
 94            set array[column][row] = 1 + array[column-1][row-1]
 95            set position[column] = position[column-1];
 96        else if array[column-1][row] > array[column][row-1]
 97            set array[column][row] = array[column-1][row]
 98            set position[column] = position[column-1];
 99        else
100            set array[column][row] = array[column][row-1]
101</pre>
102			<p>Once we get to this point, we know the length of the subsequence, and we know the middle
103			node along the solution path, which is in the last element of <code>position</code>. The
104			sub-problems can now be solved recursively for elements 1..n/2 and n/2..n</p>
105
106			<h2>Synopsis</h2>
107			<p>We have two fundamental algorithms. The calculation of the Subsequence and the Length
108			of the Subsequence. Each algorithm has two overloaded implementations, one that takes an
109			allocator object for memory management and the other that provides a default.</p>
110<pre>
111template&lt;typename size_type,
112         typename ItIn1,
113         typename ItIn2,
114         typename ItSubSeq&gt;
115inline
116size_type
117longest_common_subsequence(ItIn1    begin_first,
118                           ItIn1    end_first,
119                           ItIn2    begin_second,
120                           ItIn2    end_second,
121                           ItSubSeq subsequence)
122
123
124template&lt;typename size_type,
125         typename Alloc,
126         typename ItIn1,
127         typename ItIn2,
128         typename ItSubSeq&gt;
129size_type
130longest_common_subsequence(ItIn1     begin_first,
131                           ItIn1     end_first,
132                           ItIn2     begin_second,
133                           ItIn2     end_second,
134                           ItSubSeq  subsequence,
135                           Alloc    &alloc)
136
137template&lt;typename size_type, typename ItIn1, typename ItIn2&gt;
138inline
139size_type
140longest_common_subsequence_length(ItIn1 begin_first, ItIn1 end_first,
141                                  ItIn2 begin_second, ItIn2 end_second)
142
143
144template&lt;typename size_type, typename Alloc, typename ItIn1, typename ItIn2&gt;
145size_type
146longest_common_subsequence_length(ItIn1 begin_first, ItIn1 end_first,
147                                  ItIn2 begin_second, ItIn2 end_second,
148                                  Alloc &alloc)
149</pre>
150			<h2>Example</h2>
151<pre>
152test_character_seq(const char *str1, const char *str2)
153{
154    std::cout << "Comparing: \"" << str1 << "\"" << std::endl
155              << "         & \"" << str2 << "\"" << std::endl;
156
157    std::string subsequence;
158    signed short llcs;
159    llcs = boost::longest_common_subsequence&lt;
160               signed short&gt;(
161                   str1, str1+strlen(str1),
162                   str2, str2+strlen(str2),
163                   std::back_inserter&lt;&gt;(subsequence));
164
165    std::cout &lt;&lt; "Longest Common Subsequence: \"" &lt;&lt; subsequence
166              &lt;&lt; "\"" &lt;&lt; std::endl
167              &lt;&lt; "Length is " &lt;&lt; llcs &lt;&lt; std::endl;
168              
169    BOOST_ASSERT(llcs == static_cast&lt;signed short&gt;(subsequence.size()));
170    BOOST_ASSERT(llcs == boost::longest_common_subsequence_length&lt;
171                             signed short&gt;(str1, str1+strlen(str1),
172                                           str2, str2+strlen(str2)));
173}
174</pre>
175			<h2>Test Program</h2>
176			<p>The test program has been compiled and tested with 4 compilers with 5 STL variants.</p>
177			<table border=1 cellspacing=0 cellpadding=6 width=100%>
178				<tr bgcolor="teal">
179					<td><b>&nbsp;Compiler</b></td>
180					<td><b>&nbsp;STL version</b></td>
181					<td><b>&nbsp;Command Line</b></td>
182					<td><b>&nbsp;Success/Failure</b></td>
183				</tr>
184				<tr>
185					<td bgcolor="teal"><font color="#ffffff"><b>&nbsp;Microsoft Visual C++.Net (VC7)</b></font></td>
186					<td bgcolor="#ffcc66">Dinkumware (with compiler)</td>
187					<td bgcolor="#ffcc66">Using the project file <code>lcs.vcproj</code></td>
188					<td bgcolor="#ffcc66">Success.</td>
189				</tr>
190				<tr>
191					<td bgcolor="teal"><font color="#ffffff"><b>&nbsp;Microsoft Visual C++.Net (VC7)</b></font></td>
192					<td bgcolor="#ffcc66">STLport-4.5.3</td>
193					<td bgcolor="#ffcc66">Using the project file <code>lcs.vcproj</code></td>
194					<td bgcolor="#ffcc66">Success.</td>
195				</tr>
196				<tr>
197					<td bgcolor="teal"><font color="#ffffff"><b>&nbsp;Microsoft Visual C++ v6</b></font></td>
198					<td bgcolor="#ffcc66">Dinkumware (with compiler)</td>
199					<td bgcolor="#ffcc66"><code>cl lcs_test.cpp -GX</code></td>
200					<td bgcolor="#ffcc66">Success.</td>
201				</tr>
202				<tr>
203					<td bgcolor="teal"><font color="#ffffff"><b>&nbsp;g++ 3.1.1 20020718 (prerelease) (Cygwin)</b></font></td>
204					<td bgcolor="#ffcc66">Compiler</td>
205					<td bgcolor="#ffcc66"><code>g++ lcs_test.cpp</code></td>
206					<td bgcolor="#ffcc66">Success.</td>
207				</tr>
208				<tr>
209					<td bgcolor="teal"><font color="#ffffff"><b>&nbsp;Borland Free C++ 5.5.1</b></font></td>
210					<td bgcolor="#ffcc66">Compiler</td>
211					<td bgcolor="#ffcc66"><code>bcc32 lcs_test.cpp</code></td>
212					<td bgcolor="#ffcc66">Success.</td>
213				</tr>
214				<tr>
215					<td bgcolor="teal"><font color="#ffffff"><b>&nbsp;Borland Free C++ 5.5.1</b></font></td>
216					<td bgcolor="#ffcc66">STLport-4.5.3</td>
217					<td bgcolor="#ffcc66"><code>bcc32 lcs_test.cpp</code></td>
218					<td bgcolor="#ffcc66">Success.</td>
219				</tr>
220			</table>
221			<h2>Disclaimer</h2>
222			<p>&copy; Copyright 2002, 2003, <a href="mailto:cdm.henderson@virgin.net">Craig Henderson</a></p>
223			<p><i>Permission to use, copy, modify, distribute and sell this software and its 
224					documentation for any purpose is hereby granted without fee, provided that the 
225					above copyright notice appears in all copies and that both that copyright 
226					notice and this permission notice appear in supporting documentation. The 
227					author makes no representations about the suitability of this software for any 
228					purpose. It is provided "as is" without express or implied warranty.</i></p>
229			<hr>
230<i>Last Revised: 10th October, 2003</i> </font> 
231</body>
232</html>