/Src/Dependencies/Boost/libs/statechart/doc/performance.html
http://hadesmem.googlecode.com/ · HTML · 454 lines · 367 code · 87 blank · 0 comment · 0 complexity · e94bb4a0396b85ad454dedf336a2cebe MD5 · raw file
- <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
- <html>
- <head>
- <meta http-equiv="Content-Language" content="en-us">
- <meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
- <meta name="GENERATOR" content="Microsoft FrontPage 6.0">
- <meta name="ProgId" content="FrontPage.Editor.Document">
- <link rel="stylesheet" type="text/css" href="../../../boost.css">
- <title>The Boost Statechart Library - Performance</title>
- </head>
- <body link="#0000FF" vlink="#800080">
- <table border="0" cellpadding="7" cellspacing="0" width="100%" summary=
- "header">
- <tr>
- <td valign="top" width="300">
- <h3><a href="../../../index.htm"><img alt="C++ Boost" src=
- "../../../boost.png" border="0" width="277" height="86"></a></h3>
- </td>
- <td valign="top">
- <h1 align="center">The Boost Statechart Library</h1>
- <h2 align="center">Performance</h2>
- </td>
- </tr>
- </table>
- <hr>
- <dl class="index">
- <dt><a href="#SpeedVersusScalabilityTradeoffs">Speed versus scalability
- tradeoffs</a></dt>
- <dt><a href="#MemoryManagementCustomization">Memory management
- customization</a></dt>
- <dt><a href="#RttiCustomization">RTTI customization</a></dt>
- <dt><a href="#ResourceUsage">Resource usage</a></dt>
- </dl>
- <h2><a name="SpeedVersusScalabilityTradeoffs" id=
- "SpeedVersusScalabilityTradeoffs">Speed versus scalability
- tradeoffs</a></h2>
- <p>Quite a bit of effort has gone into making the library fast for small
- simple machines <b>and</b> scaleable at the same time (this applies only to
- <code>state_machine<></code>, there still is some room for optimizing
- <code>fifo_scheduler<></code>, especially for multi-threaded builds).
- While I believe it should perform reasonably in most applications, the
- scalability does not come for free. Small, carefully handcrafted state
- machines will thus easily outperform equivalent Boost.Statechart machines.
- To get a picture of how big the gap is, I implemented a simple benchmark in
- the BitMachine example. The Handcrafted example is a handcrafted variant of
- the 1-bit-BitMachine implementing the same benchmark.</p>
- <p>I tried to create a fair but somewhat unrealistic <b>worst-case</b>
- scenario:</p>
- <ul>
- <li>For both machines exactly one object of the only event is allocated
- before starting the test. This same object is then sent to the machines
- over and over</li>
- <li>The Handcrafted machine employs GOF-visitor double dispatch. The
- states are preallocated so that event dispatch & transition amounts
- to nothing more than two virtual calls and one pointer assignment</li>
- </ul>
- <p>The Benchmarks - running on a 3.2GHz Intel Pentium 4 - produced the
- following dispatch and transition times per event:</p>
- <ul>
- <li>Handcrafted:
- <ul>
- <li>MSVC7.1: 10ns</li>
- <li>GCC3.4.2: 20ns</li>
- <li>Intel9.0: 20ns</li>
- </ul>
- </li>
- <li>1-bit-BitMachine with customized memory management:
- <ul>
- <li>MSVC7.1: 150ns</li>
- <li>GCC3.4.2: 320ns</li>
- <li>Intel9.0: 170ns</li>
- </ul>
- </li>
- </ul>
- <p>Although this is a big difference I still think it will not be
- noticeable in most real-world applications. No matter whether an
- application uses handcrafted or Boost.Statechart machines it will...</p>
- <ul>
- <li>almost never run into a situation where a state machine is swamped
- with as many events as in the benchmarks. A state machine will almost
- always spend a good deal of time waiting for events (which typically come
- from a human operator, from machinery or from electronic devices over
- often comparatively slow I/O channels). Parsers are just about the only
- application of FSMs where this is not the case. However, parser FSMs are
- usually not directly specified on the state machine level but on a higher
- one that is better suited for the task. Examples of such higher levels
- are: Boost.Regex, Boost.Spirit, XML Schemas, etc. Moreover, the nature of
- parsers allows for a number of optimizations that are not possible in a
- general-purpose FSM framework.<br>
- Bottom line: While it is possible to implement a parser with this
- library, it is almost never advisable to do so because other approaches
- lead to better performing and more expressive code</li>
- <li>often run state machines in their own threads. This adds considerable
- locking and thread-switching overhead. Performance tests with the
- PingPong example, where two asynchronous state machines exchange events,
- gave the following times to process one event and perform the resulting
- in-state reaction (using the library with
- <code>boost::fast_pool_allocator<></code>):
- <ul>
- <li>Single-threaded (no locking and waiting): 840ns / 840ns</li>
- <li>Multi-threaded with one thread (the scheduler uses mutex locking
- but never has to wait for events): 6500ns / 4800ns</li>
- <li>Multi-threaded with two threads (both schedulers use mutex
- locking and exactly one always waits for an event): 14000ns /
- 7000ns</li>
- </ul>
- <p>As mentioned above, there definitely is some room to improve the
- timings for the asynchronous machines. Moreover, these are very crude
- benchmarks, designed to show the overhead of locking and thread context
- switching. The overhead in a real-world application will typically be
- smaller and other operating systems can certainly do better in this
- area. However, I strongly believe that on most platforms the threading
- overhead is usually larger than the time that Boost.Statechart spends
- for event dispatch and transition. Handcrafted machines will inevitably
- have the same overhead, making raw single-threaded dispatch and
- transition speed much less important</p>
- </li>
- <li>almost always allocate events with <code>new</code> and destroy them
- after consumption. This will add a few cycles, even if event memory
- management is customized</li>
- <li>often use state machines that employ orthogonal states and other
- advanced features. This forces the handcrafted machines to use a more
- adequate and more time-consuming book-keeping</li>
- </ul>
- <p>Therefore, in real-world applications event dispatch and transition not
- normally constitutes a bottleneck and the relative gap between handcrafted
- and Boost.Statechart machines also becomes much smaller than in the
- worst-case scenario.</p>
- <h3>Detailed performance data</h3>
- <p>In an effort to identify the main performance bottleneck, the example
- program "Performance" has been written. It measures the time that is spent
- to process one event in different BitMachine variants. In contrast to the
- BitMachine example, which uses only transitions, Performance uses a varying
- number of in-state reactions together with transitions. The only difference
- between in-state-reactions and transitions is that the former neither enter
- nor exit any states. Apart from that, the same amount of code needs to be
- run to dispatch an event and execute the resulting action.</p>
- <p>The following diagrams show the average time the library spends to
- process one event, depending on the percentage of in-state reactions
- employed. 0% means that all triggered reactions are transitions. 100% means
- that all triggered reactions are in-state reactions. I draw the following
- conclusions from these measurements:</p>
- <ul>
- <li>The fairly linear course of the curves suggests that the measurements
- with a 100% in-state reaction ratio are accurate and not merely a product
- of optimizations in the compiler. Such optimizations might have been
- possible due to the fact that in the 100% case it is known at
- compile-time that the current state will never change</li>
- <li>The data points with 100% in-state reaction ratio and speed optimized
- RTTI show that modern compilers seem to inline the complex-looking
- dispatch code so aggressively that dispatch is reduced to little more
- than it actually is, one virtual function call followed by a linear
- search for a suitable reaction. For instance, in the case of the 1-bit
- Bitmachine, Intel9.0 seems to produce dispatch code that is equally
- efficient like the two virtual function calls in the Handcrafted
- machine</li>
- <li>On all compilers and in all variants the time spent in event dispatch
- is dwarfed by the time spent to exit the current state and enter the
- target state. It is worth noting that BitMachine is a flat and
- non-orthogonal state machine, representing a close-to-worst case.
- Real-world machines will often exit and enter multiple states during a
- transition, what further dwarfs pure dispatch time. This makes the
- implementation of constant-time dispatch (as requested by a few people
- during formal review) an undertaking with little merit. Instead, the
- future optimization effort will concentrate on state-entry and
- state-exit</li>
- <li>Intel9.0 seems to have problems to optimize/inline code as soon as
- the amount of code grows over a certain threshold. Unlike with the other
- two compilers, I needed to compile the tests for the 1, 2, 3 and 4-bit
- BitMachine into separate executables to get good performance. Even then
- was the performance overly bad for the 4-bit BitMachine. It was much
- worse when I compiled all 4 tests into the same executable. This surely
- looks like a bug in the compiler</li>
- </ul>
- <h4>Out of the box</h4>
- <p>The library is used as is, without any optimizations/modifications.</p>
- <p><img alt="PerformanceNormal1" src="PerformanceNormal1.gif" border="0"
- width="371" height="284"><img alt="PerformanceNormal2" src=
- "PerformanceNormal2.gif" border="0" width="371" height="284"><img alt=
- "PerformanceNormal3" src="PerformanceNormal3.gif" border="0" width="371"
- height="284"><img alt="PerformanceNormal4" src="PerformanceNormal4.gif"
- border="0" width="371" height="284"></p>
- <h4>Native RTTI</h4>
- <p>The library is used with <code><a href=
- "configuration.html#ApplicationDefinedMacros">BOOST_STATECHART_USE_NATIVE_RTTI</a></code>
- defined.</p>
- <p><img alt="PerformanceNative1" src="PerformanceNative1.gif" border="0"
- width="371" height="284"><img alt="PerformanceNative2" src=
- "PerformanceNative2.gif" border="0" width="371" height="284"><img alt=
- "PerformanceNative3" src="PerformanceNative3.gif" border="0" width="371"
- height="284"><img alt="PerformanceNative4" src="PerformanceNative4.gif"
- border="0" width="371" height="284"></p>
- <h4>Customized memory-management</h4>
- <p>The library is used with customized memory management
- (<code>boost::fast_pool_allocator</code>).</p>
- <p><img alt="PerformanceCustom1" src="PerformanceCustom1.gif" border="0"
- width="371" height="284"><img alt="PerformanceCustom2" src=
- "PerformanceCustom2.gif" border="0" width="371" height="284"><img alt=
- "PerformanceCustom3" src="PerformanceCustom3.gif" border="0" width="371"
- height="284"><img alt="PerformanceCustom4" src="PerformanceCustom4.gif"
- border="0" width="371" height="284"></p>
- <h3><a name="DoubleDispatch" id="DoubleDispatch">Double dispatch</a></h3>
- <p>At the heart of every state machine lies an implementation of double
- dispatch. This is due to the fact that the incoming event <b>and</b> the
- active state define exactly which <a href=
- "definitions.html#Reaction">reaction</a> the state machine will produce.
- For each event dispatch, one virtual call is followed by a linear search
- for the appropriate reaction, using one RTTI comparison per reaction. The
- following alternatives were considered but rejected:</p>
- <ul>
- <li><a href=
- "http://www.objectmentor.com/resources/articles/acv.pdf">Acyclic
- visitor</a>: This double-dispatch variant satisfies all scalability
- requirements but performs badly due to costly inheritance tree
- cross-casts. Moreover, a state must store one v-pointer for <b>each</b>
- reaction what slows down construction and makes memory management
- customization inefficient. In addition, C++ RTTI must inevitably be
- turned on, with negative effects on executable size. Boost.Statechart
- originally employed acyclic visitor and was about 4 times slower than it
- is now (MSVC7.1 on Intel Pentium M). The dispatch speed might be better
- on other platforms but the other negative effects will remain</li>
- <li><a href="http://en.wikipedia.org/wiki/Visitor_pattern">GOF
- Visitor</a>: The GOF Visitor pattern inevitably makes the whole machine
- depend upon all events. That is, whenever a new event is added there is
- no way around recompiling the whole state machine. This is contrary to
- the scalability requirements</li>
- <li>Single two-dimensional array of function pointers: To satisfy
- requirement 6, it should be possible to spread a single state machine
- over several translation units. This however means that the dispatch
- table must be filled at runtime and the different translation units must
- somehow make themselves "known", so that their part of the state machine
- can be added to the table. There simply is no way to do this
- automatically <b>and</b> portably. The only portable way that a state
- machine distributed over several translation units could employ
- table-based double dispatch relies on the user. The programmer(s) would
- somehow have to <b>manually</b> tie together the various pieces of the
- state machine. Not only does this scale badly but is also quite
- error-prone</li>
- </ul>
- <h2><a name="MemoryManagementCustomization" id=
- "MemoryManagementCustomization">Memory management customization</a></h2>
- <p>Out of the box, everything (event objects, state objects, internal data,
- etc.) is allocated through <code>std::allocator< void ></code> (the
- default for the Allocator template parameter). This should be satisfactory
- for applications meeting the following prerequisites:</p>
- <ul>
- <li>There are no deterministic reaction time (hard real-time)
- requirements</li>
- <li>The application will never run long enough for heap fragmentation to
- become a problem. This is of course an issue for all long running
- programs not only the ones employing this library. However, it should be
- noted that fragmentation problems could show up earlier than with
- traditional FSM frameworks</li>
- </ul>
- <p>Should an application not meet these prerequisites, Boost.Statechart's
- memory management can be customized as follows:</p>
- <ul>
- <li>By passing a model of the standard Allocator concept to the class
- templates that support a corresponding parameter
- (<code>event<></code>, <code>state_machine<></code>,
- <code>asynchronous_state_machine<></code>,
- <code>fifo_scheduler<></code>, <code>fifo_worker<></code>).
- This redirects all allocations to the passed custom allocator and should
- satisfy the needs of just about any project</li>
- <li>Additionally, it is possible to <b>separately</b> customize
- <b>state</b> memory management by overloading <code>operator new()</code>
- and <code>operator delete()</code> for all state classes but this is
- probably only useful under rare circumstances</li>
- </ul>
- <h2><a name="RttiCustomization" id="RttiCustomization">RTTI
- customization</a></h2>
- <p>RTTI is used for event dispatch and
- <code>state_downcast<>()</code>. Currently, there are exactly two
- options:</p>
- <ol>
- <li>By default, a speed-optimized internal implementation is
- employed</li>
- <li>The library can be instructed to use native C++ RTTI instead by
- defining <code><a href=
- "configuration.html#ApplicationDefinedMacros">BOOST_STATECHART_USE_NATIVE_RTTI</a></code></li>
- </ol>
- <p>There are 2 reasons to favor 2:</p>
- <ul>
- <li>When a state machine (or parts of it) are compiled into a DLL,
- problems could arise from the use of the internal RTTI mechanism (see the
- FAQ item "<a href="faq.html#Dll">How can I compile a state machine into a
- dynamic link library (DLL)?</a>"). Using option 2 is one way to work
- around such problems (on some platforms, it seems to be the only
- way)</li>
- <li>State and event objects need to store one pointer less, meaning that
- in the best case the memory footprint of a state machine object could
- shrink by 15% (an empty event is typically 30% smaller, what can be an
- advantage when there are bursts of events rather than a steady flow).
- However, on most platforms executable size grows when C++ RTTI is turned
- on. So, given the small per machine object savings, this only makes sense
- in applications where both of the following conditions hold:
- <ul>
- <li>Event dispatch will never become a bottleneck</li>
- <li>There is a need to reduce the memory allocated at runtime (at the
- cost of a larger executable)</li>
- </ul>
- <p>Obvious candidates are embedded systems where the executable resides
- in ROM. Other candidates are applications running a large number of
- identical state machines where this measure could even reduce the
- <b>overall</b> memory footprint</p>
- </li>
- </ul>
- <h2><a name="ResourceUsage" id="ResourceUsage">Resource usage</a></h2>
- <h3>Memory</h3>
- <p>On a 32-bit box, one empty active state typically needs less than 50
- bytes of memory. Even <b>very</b> complex machines will usually have less
- than 20 simultaneously active states so just about every machine should run
- with less than one kilobyte of memory (not counting event queues).
- Obviously, the per-machine memory footprint is offset by whatever
- state-local members the user adds.</p>
- <h3>Processor cycles</h3>
- <p>The following ranking should give a rough picture of what feature will
- consume how many cycles:</p>
- <ol>
- <li><code>state_cast<>()</code>: By far the most cycle-consuming
- feature. Searches linearly for a suitable state, using one
- <code>dynamic_cast</code> per visited state</li>
- <li>State entry and exit: Profiling of the fully optimized
- 1-bit-BitMachine suggested that roughly 3 quarters of the total event
- processing time is spent destructing the exited state and constructing
- the entered state. Obviously, transitions where the <a href=
- "definitions.html#InnermostCommonContext">innermost common context</a> is
- "far" from the leaf states and/or with lots of orthogonal states can
- easily cause the destruction and construction of quite a few states
- leading to significant amounts of time spent for a transition</li>
- <li><code>state_downcast<>()</code>: Searches linearly for the
- requested state, using one virtual call and one RTTI comparison per
- visited state</li>
- <li>Deep history: For all innermost states inside a state passing either
- <code>has_deep_history</code> or <code>has_full_history</code> to its
- state base class, a binary search through the (usually small) history map
- must be performed on each exit. History slot allocation is performed
- exactly once, at first exit</li>
- <li>Shallow history: For all direct inner states of a state passing
- either <code>has_shallow_history</code> or <code>has_full_history</code>
- to its state base class, a binary search through the (usually small)
- history map must be performed on each exit. History slot allocation is
- performed exactly once, at first exit</li>
- <li>Event dispatch: One virtual call followed by a linear search for a
- suitable <a href="definitions.html#Reaction">reaction</a>, using one RTTI
- comparison per visited reaction</li>
- <li>Orthogonal states: One additional virtual call for each exited state
- <b>if</b> there is more than one active leaf state before a transition.
- It should also be noted that the worst-case event dispatch time is
- multiplied in the presence of orthogonal states. For example, if two
- orthogonal leaf states are added to a given state configuration, the
- worst-case time is tripled</li>
- </ol>
- <hr>
- <p><a href="http://validator.w3.org/check?uri=referer"><img border="0" src=
- "../../../doc/images/valid-html401.png" alt="Valid HTML 4.01 Transitional"
- height="31" width="88"></a></p>
- <p>Revised
- <!--webbot bot="Timestamp" s-type="EDITED" s-format="%d %B, %Y" startspan -->03 December, 2006<!--webbot bot="Timestamp" endspan i-checksum="38512" --></p>
- <p><i>Copyright © 2003-<!--webbot bot="Timestamp" s-type="EDITED" s-format="%Y" startspan -->2006<!--webbot bot="Timestamp" endspan i-checksum="770" -->
- <a href="contact.html">Andreas Huber Dönni</a></i></p>
- <p><i>Distributed under the Boost Software License, Version 1.0. (See
- accompanying file <a href="../../../LICENSE_1_0.txt">LICENSE_1_0.txt</a> or
- copy at <a href=
- "http://www.boost.org/LICENSE_1_0.txt">http://www.boost.org/LICENSE_1_0.txt</a>)</i></p>
- </body>
- </html>