/Objects/dictnotes.txt
http://unladen-swallow.googlecode.com/ · Plain Text · 270 lines · 211 code · 59 blank · 0 comment · 0 complexity · bcfe05c21297f41855c945f3c7464a75 MD5 · raw file
- NOTES ON OPTIMIZING DICTIONARIES
- ================================
- Principal Use Cases for Dictionaries
- ------------------------------------
- Passing keyword arguments
- Typically, one read and one write for 1 to 3 elements.
- Occurs frequently in normal python code.
- Class method lookup
- Dictionaries vary in size with 8 to 16 elements being common.
- Usually written once with many lookups.
- When base classes are used, there are many failed lookups
- followed by a lookup in a base class.
- Instance attribute lookup and Global variables
- Dictionaries vary in size. 4 to 10 elements are common.
- Both reads and writes are common.
- Builtins
- Frequent reads. Almost never written.
- Size 126 interned strings (as of Py2.3b1).
- A few keys are accessed much more frequently than others.
- Uniquification
- Dictionaries of any size. Bulk of work is in creation.
- Repeated writes to a smaller set of keys.
- Single read of each key.
- Some use cases have two consecutive accesses to the same key.
- * Removing duplicates from a sequence.
- dict.fromkeys(seqn).keys()
- * Counting elements in a sequence.
- for e in seqn:
- d[e] = d.get(e,0) + 1
- * Accumulating references in a dictionary of lists:
- for pagenumber, page in enumerate(pages):
- for word in page:
- d.setdefault(word, []).append(pagenumber)
- Note, the second example is a use case characterized by a get and set
- to the same key. There are similar use cases with a __contains__
- followed by a get, set, or del to the same key. Part of the
- justification for d.setdefault is combining the two lookups into one.
- Membership Testing
- Dictionaries of any size. Created once and then rarely changes.
- Single write to each key.
- Many calls to __contains__() or has_key().
- Similar access patterns occur with replacement dictionaries
- such as with the % formatting operator.
- Dynamic Mappings
- Characterized by deletions interspersed with adds and replacements.
- Performance benefits greatly from the re-use of dummy entries.
- Data Layout (assuming a 32-bit box with 64 bytes per cache line)
- ----------------------------------------------------------------
- Smalldicts (8 entries) are attached to the dictobject structure
- and the whole group nearly fills two consecutive cache lines.
- Larger dicts use the first half of the dictobject structure (one cache
- line) and a separate, continuous block of entries (at 12 bytes each
- for a total of 5.333 entries per cache line).
- Tunable Dictionary Parameters
- -----------------------------
- * PyDict_MINSIZE. Currently set to 8.
- Must be a power of two. New dicts have to zero-out every cell.
- Each additional 8 consumes 1.5 cache lines. Increasing improves
- the sparseness of small dictionaries but costs time to read in
- the additional cache lines if they are not already in cache.
- That case is common when keyword arguments are passed.
- * Maximum dictionary load in PyDict_SetItem. Currently set to 2/3.
- Increasing this ratio makes dictionaries more dense resulting
- in more collisions. Decreasing it improves sparseness at the
- expense of spreading entries over more cache lines and at the
- cost of total memory consumed.
- The load test occurs in highly time sensitive code. Efforts
- to make the test more complex (for example, varying the load
- for different sizes) have degraded performance.
- * Growth rate upon hitting maximum load. Currently set to *2.
- Raising this to *4 results in half the number of resizes,
- less effort to resize, better sparseness for some (but not
- all dict sizes), and potentially doubles memory consumption
- depending on the size of the dictionary. Setting to *4
- eliminates every other resize step.
- * Maximum sparseness (minimum dictionary load). What percentage
- of entries can be unused before the dictionary shrinks to
- free up memory and speed up iteration? (The current CPython
- code does not represent this parameter directly.)
- * Shrinkage rate upon exceeding maximum sparseness. The current
- CPython code never even checks sparseness when deleting a
- key. When a new key is added, it resizes based on the number
- of active keys, so that the addition may trigger shrinkage
- rather than growth.
- Tune-ups should be measured across a broad range of applications and
- use cases. A change to any parameter will help in some situations and
- hurt in others. The key is to find settings that help the most common
- cases and do the least damage to the less common cases. Results will
- vary dramatically depending on the exact number of keys, whether the
- keys are all strings, whether reads or writes dominate, the exact
- hash values of the keys (some sets of values have fewer collisions than
- others). Any one test or benchmark is likely to prove misleading.
- While making a dictionary more sparse reduces collisions, it impairs
- iteration and key listing. Those methods loop over every potential
- entry. Doubling the size of dictionary results in twice as many
- non-overlapping memory accesses for keys(), items(), values(),
- __iter__(), iterkeys(), iteritems(), itervalues(), and update().
- Also, every dictionary iterates at least twice, once for the memset()
- when it is created and once by dealloc().
- Dictionary operations involving only a single key can be O(1) unless
- resizing is possible. By checking for a resize only when the
- dictionary can grow (and may *require* resizing), other operations
- remain O(1), and the odds of resize thrashing or memory fragmentation
- are reduced. In particular, an algorithm that empties a dictionary
- by repeatedly invoking .pop will see no resizing, which might
- not be necessary at all because the dictionary is eventually
- discarded entirely.
- Results of Cache Locality Experiments
- -------------------------------------
- When an entry is retrieved from memory, 4.333 adjacent entries are also
- retrieved into a cache line. Since accessing items in cache is *much*
- cheaper than a cache miss, an enticing idea is to probe the adjacent
- entries as a first step in collision resolution. Unfortunately, the
- introduction of any regularity into collision searches results in more
- collisions than the current random chaining approach.
- Exploiting cache locality at the expense of additional collisions fails
- to payoff when the entries are already loaded in cache (the expense
- is paid with no compensating benefit). This occurs in small dictionaries
- where the whole dictionary fits into a pair of cache lines. It also
- occurs frequently in large dictionaries which have a common access pattern
- where some keys are accessed much more frequently than others. The
- more popular entries *and* their collision chains tend to remain in cache.
- To exploit cache locality, change the collision resolution section
- in lookdict() and lookdict_string(). Set i^=1 at the top of the
- loop and move the i = (i << 2) + i + perturb + 1 to an unrolled
- version of the loop.
- This optimization strategy can be leveraged in several ways:
- * If the dictionary is kept sparse (through the tunable parameters),
- then the occurrence of additional collisions is lessened.
- * If lookdict() and lookdict_string() are specialized for small dicts
- and for largedicts, then the versions for large_dicts can be given
- an alternate search strategy without increasing collisions in small dicts
- which already have the maximum benefit of cache locality.
- * If the use case for a dictionary is known to have a random key
- access pattern (as opposed to a more common pattern with a Zipf's law
- distribution), then there will be more benefit for large dictionaries
- because any given key is no more likely than another to already be
- in cache.
- * In use cases with paired accesses to the same key, the second access
- is always in cache and gets no benefit from efforts to further improve
- cache locality.
- Optimizing the Search of Small Dictionaries
- -------------------------------------------
- If lookdict() and lookdict_string() are specialized for smaller dictionaries,
- then a custom search approach can be implemented that exploits the small
- search space and cache locality.
- * The simplest example is a linear search of contiguous entries. This is
- simple to implement, guaranteed to terminate rapidly, never searches
- the same entry twice, and precludes the need to check for dummy entries.
- * A more advanced example is a self-organizing search so that the most
- frequently accessed entries get probed first. The organization
- adapts if the access pattern changes over time. Treaps are ideally
- suited for self-organization with the most common entries at the
- top of the heap and a rapid binary search pattern. Most probes and
- results are all located at the top of the tree allowing them all to
- be located in one or two cache lines.
- * Also, small dictionaries may be made more dense, perhaps filling all
- eight cells to take the maximum advantage of two cache lines.
- Strategy Pattern
- ----------------
- Consider allowing the user to set the tunable parameters or to select a
- particular search method. Since some dictionary use cases have known
- sizes and access patterns, the user may be able to provide useful hints.
- 1) For example, if membership testing or lookups dominate runtime and memory
- is not at a premium, the user may benefit from setting the maximum load
- ratio at 5% or 10% instead of the usual 66.7%. This will sharply
- curtail the number of collisions but will increase iteration time.
- The builtin namespace is a prime example of a dictionary that can
- benefit from being highly sparse.
- 2) Dictionary creation time can be shortened in cases where the ultimate
- size of the dictionary is known in advance. The dictionary can be
- pre-sized so that no resize operations are required during creation.
- Not only does this save resizes, but the key insertion will go
- more quickly because the first half of the keys will be inserted into
- a more sparse environment than before. The preconditions for this
- strategy arise whenever a dictionary is created from a key or item
- sequence and the number of *unique* keys is known.
- 3) If the key space is large and the access pattern is known to be random,
- then search strategies exploiting cache locality can be fruitful.
- The preconditions for this strategy arise in simulations and
- numerical analysis.
- 4) If the keys are fixed and the access pattern strongly favors some of
- the keys, then the entries can be stored contiguously and accessed
- with a linear search or treap. This exploits knowledge of the data,
- cache locality, and a simplified search routine. It also eliminates
- the need to test for dummy entries on each probe. The preconditions
- for this strategy arise in symbol tables and in the builtin dictionary.
- Readonly Dictionaries
- ---------------------
- Some dictionary use cases pass through a build stage and then move to a
- more heavily exercised lookup stage with no further changes to the
- dictionary.
- An idea that emerged on python-dev is to be able to convert a dictionary
- to a read-only state. This can help prevent programming errors and also
- provide knowledge that can be exploited for lookup optimization.
- The dictionary can be immediately rebuilt (eliminating dummy entries),
- resized (to an appropriate level of sparseness), and the keys can be
- jostled (to minimize collisions). The lookdict() routine can then
- eliminate the test for dummy entries (saving about 1/4 of the time
- spent in the collision resolution loop).
- An additional possibility is to insert links into the empty spaces
- so that dictionary iteration can proceed in len(d) steps instead of
- (mp->mask + 1) steps. Alternatively, a separate tuple of keys can be
- kept just for iteration.
- Caching Lookups
- ---------------
- The idea is to exploit key access patterns by anticipating future lookups
- based on previous lookups.
- The simplest incarnation is to save the most recently accessed entry.
- This gives optimal performance for use cases where every get is followed
- by a set or del to the same key.