PageRenderTime 87ms CodeModel.GetById 60ms RepoModel.GetById 1ms app.codeStats 0ms

/deps/klib/README.md

https://gitlab.com/Blueprint-Marketing/h2o
Markdown | 237 lines | 210 code | 27 blank | 0 comment | 0 complexity | 3340480ed10f1e837ae4bd07b638ef50 MD5 | raw file
  1. #Klib: a Generic Library in C
  2. ##<a name="overview"></a>Overview
  3. Klib is a standalone and lightweight C library distributed under [MIT/X11
  4. license][1]. Most components are independent of external libraries, except the
  5. standard C library, and independent of each other. To use a component of this
  6. library, you only need to copy a couple of files to your source code tree
  7. without worrying about library dependencies.
  8. Klib strives for efficiency and a small memory footprint. Some components, such
  9. as khash.h, kbtree.h, ksort.h and kvec.h, are among the most efficient
  10. implementations of similar algorithms or data structures in all programming
  11. languages, in terms of both speed and memory use.
  12. A new documentation is available [here](http://attractivechaos.github.io/klib/)
  13. which includes most information in this README file.
  14. ####Common components
  15. * [khash.h][khash]: generic hash table based on [double hashing][2].
  16. * [kbtree.h][kbtree]: generic search tree based on [B-tree][3].
  17. * [ksort.h][ksort]: generic sort, including [introsort][4], [merge sort][5], [heap sort][6], [comb sort][7], [Knuth shuffle][8] and the [k-small][9] algorithm.
  18. * [kseq.h][kseq]: generic stream buffer and a [FASTA][10]/[FASTQ][11] format parser.
  19. * kvec.h: generic dynamic array.
  20. * klist.h: generic single-linked list and [memory pool][12].
  21. * kstring.{h,c}: basic string library.
  22. * kmath.{h,c}: numerical routines including [MT19937-64][13] [pseudorandom generator][14], basic [nonlinear programming][15] and a few special math functions.
  23. ####Components for more specific use cases
  24. * ksa.c: constructing [suffix arrays][16] for strings with multiple sentinels, based on a revised [SAIS algorithm][17].
  25. * knetfile.{h,c}: random access to remote files on HTTP or FTP.
  26. * kopen.c: smart stream opening.
  27. * khmm.{h,c}: basic [HMM][18] library.
  28. * ksw.(h,c}: Striped [Smith-Waterman algorithm][19].
  29. * knhx.{h,c}: [Newick tree format][20] parser.
  30. ##<a name="methodology"></a>Methodology
  31. For the implementation of generic [containers][21], klib extensively uses C
  32. macros. To use these data structures, we usually need to instantiate methods by
  33. expanding a long macro. This makes the source code look unusual or even ugly
  34. and adds difficulty to debugging. Unfortunately, for efficient generic
  35. programming in C that lacks [template][22], using macros is the only
  36. solution. Only with macros, we can write a generic container which, once
  37. instantiated, compete with a type-specific container in efficiency. Some
  38. generic libraries in C, such as [Glib][23], use the `void*` type to implement
  39. containers. These implementations are usually slower and use more memory than
  40. klib (see [this benchmark][31]).
  41. To effectively use klib, it is important to understand how it achieves generic
  42. programming. We will use the hash table library as an example:
  43. #include "khash.h"
  44. KHASH_MAP_INIT_INT(m32, char) // instantiate structs and methods
  45. int main() {
  46. int ret, is_missing;
  47. khint_t k;
  48. khash_t(m32) *h = kh_init(m32); // allocate a hash table
  49. k = kh_put(m32, h, 5, &ret); // insert a key to the hash table
  50. if (!ret) kh_del(m32, h, k);
  51. kh_value(h, k) = 10; // set the value
  52. k = kh_get(m32, h, 10); // query the hash table
  53. is_missing = (k == kh_end(h)); // test if the key is present
  54. k = kh_get(m32, h, 5);
  55. kh_del(m32, h, k); // remove a key-value pair
  56. for (k = kh_begin(h); k != kh_end(h); ++k) // traverse
  57. if (kh_exist(h, k)) // test if a bucket contains data
  58. kh_value(h, k) = 1;
  59. kh_destroy(m32, h); // deallocate the hash table
  60. return 0;
  61. }
  62. In this example, the second line instantiates a hash table with `unsigned` as
  63. the key type and `char` as the value type. `m32` names such a type of hash table.
  64. All types and functions associated with this name are macros, which will be
  65. explained later. Macro `kh_init()` initiates a hash table and `kh_destroy()`
  66. frees it. `kh_put()` inserts a key and returns the iterator (or the position)
  67. in the hash table. `kh_get()` and `kh_del()` get a key and delete an element,
  68. respectively. Macro `kh_exist()` tests if an iterator (or a position) is filled
  69. with data.
  70. An immediate question is this piece of code does not look like a valid C
  71. program (e.g. lacking semicolon, assignment to an _apparent_ function call and
  72. _apparent_ undefined `m32` 'variable'). To understand why the code is correct,
  73. let's go a bit further into the source code of `khash.h`, whose skeleton looks
  74. like:
  75. #define KHASH_INIT(name, SCOPE, key_t, val_t, is_map, _hashf, _hasheq) \
  76. typedef struct { \
  77. int n_buckets, size, n_occupied, upper_bound; \
  78. unsigned *flags; \
  79. key_t *keys; \
  80. val_t *vals; \
  81. } kh_##name##_t; \
  82. SCOPE inline kh_##name##_t *init_##name() { \
  83. return (kh_##name##_t*)calloc(1, sizeof(kh_##name##_t)); \
  84. } \
  85. SCOPE inline int get_##name(kh_##name##_t *h, key_t k) \
  86. ... \
  87. SCOPE inline void destroy_##name(kh_##name##_t *h) { \
  88. if (h) { \
  89. free(h->keys); free(h->flags); free(h->vals); free(h); \
  90. } \
  91. }
  92. #define _int_hf(key) (unsigned)(key)
  93. #define _int_heq(a, b) (a == b)
  94. #define khash_t(name) kh_##name##_t
  95. #define kh_value(h, k) ((h)->vals[k])
  96. #define kh_begin(h, k) 0
  97. #define kh_end(h) ((h)->n_buckets)
  98. #define kh_init(name) init_##name()
  99. #define kh_get(name, h, k) get_##name(h, k)
  100. #define kh_destroy(name, h) destroy_##name(h)
  101. ...
  102. #define KHASH_MAP_INIT_INT(name, val_t) \
  103. KHASH_INIT(name, static, unsigned, val_t, is_map, _int_hf, _int_heq)
  104. `KHASH_INIT()` is a huge macro defining all the structs and methods. When this
  105. macro is called, all the code inside it will be inserted by the [C
  106. preprocess][37] to the place where it is called. If the macro is called
  107. multiple times, multiple copies of the code will be inserted. To avoid naming
  108. conflict of hash tables with different key-value types, the library uses [token
  109. concatenation][36], which is a preprocessor feature whereby we can substitute
  110. part of a symbol based on the parameter of the macro. In the end, the C
  111. preprocessor will generate the following code and feed it to the compiler
  112. (macro `kh_exist(h,k)` is a little complex and not expanded for simplicity):
  113. typedef struct {
  114. int n_buckets, size, n_occupied, upper_bound;
  115. unsigned *flags;
  116. unsigned *keys;
  117. char *vals;
  118. } kh_m32_t;
  119. static inline kh_m32_t *init_m32() {
  120. return (kh_m32_t*)calloc(1, sizeof(kh_m32_t));
  121. }
  122. static inline int get_m32(kh_m32_t *h, unsigned k)
  123. ...
  124. static inline void destroy_m32(kh_m32_t *h) {
  125. if (h) {
  126. free(h->keys); free(h->flags); free(h->vals); free(h);
  127. }
  128. }
  129. int main() {
  130. int ret, is_missing;
  131. khint_t k;
  132. kh_m32_t *h = init_m32();
  133. k = put_m32(h, 5, &ret);
  134. if (!ret) del_m32(h, k);
  135. h->vals[k] = 10;
  136. k = get_m32(h, 10);
  137. is_missing = (k == h->n_buckets);
  138. k = get_m32(h, 5);
  139. del_m32(h, k);
  140. for (k = 0; k != h->n_buckets; ++k)
  141. if (kh_exist(h, k)) h->vals[k] = 1;
  142. destroy_m32(h);
  143. return 0;
  144. }
  145. This is the C program we know.
  146. From this example, we can see that macros and the C preprocessor plays a key
  147. role in klib. Klib is fast partly because the compiler knows the key-value
  148. type at the compile time and is able to optimize the code to the same level
  149. as type-specific code. A generic library written with `void*` will not get such
  150. performance boost.
  151. Massively inserting code upon instantiation may remind us of C++'s slow
  152. compiling speed and huge binary size when STL/boost is in use. Klib is much
  153. better in this respect due to its small code size and component independency.
  154. Inserting several hundreds lines of code won't make compiling obviously slower.
  155. ##<a name="resources"></a>Resources
  156. * Library documentation, if present, is available in the header files. Examples
  157. can be found in the [test/][24] directory.
  158. * **Obsolete** documentation of the hash table library can be found at
  159. [SourceForge][25]. This README is partly adapted from the old documentation.
  160. * [Blog post][26] describing the hash table library.
  161. * [Blog post][27] on why using `void*` for generic programming may be inefficient.
  162. * [Blog post][28] on the generic stream buffer.
  163. * [Blog post][29] evaluating the performance of `kvec.h`.
  164. * [Blog post][30] arguing B-tree may be a better data structure than a binary search tree.
  165. * [Blog post][31] evaluating the performance of `khash.h` and `kbtree.h` among many other implementations.
  166. [An older version][33] of the benchmark is also available.
  167. * [Blog post][34] benchmarking internal sorting algorithms and implementations.
  168. * [Blog post][32] on the k-small algorithm.
  169. * [Blog post][35] on the Hooke-Jeeve's algorithm for nonlinear programming.
  170. [1]: http://en.wikipedia.org/wiki/MIT_License
  171. [2]: http://en.wikipedia.org/wiki/Double_hashing
  172. [3]: http://en.wikipedia.org/wiki/B-tree
  173. [4]: http://en.wikipedia.org/wiki/Introsort
  174. [5]: http://en.wikipedia.org/wiki/Merge_sort
  175. [6]: http://en.wikipedia.org/wiki/Heapsort
  176. [7]: http://en.wikipedia.org/wiki/Comb_sort
  177. [8]: http://en.wikipedia.org/wiki/Fisher-Yates_shuffle
  178. [9]: http://en.wikipedia.org/wiki/Selection_algorithm
  179. [10]: http://en.wikipedia.org/wiki/FASTA_format
  180. [11]: http://en.wikipedia.org/wiki/FASTQ_format
  181. [12]: http://en.wikipedia.org/wiki/Memory_pool
  182. [13]: http://en.wikipedia.org/wiki/Mersenne_twister
  183. [14]: http://en.wikipedia.org/wiki/Pseudorandom_generator
  184. [15]: http://en.wikipedia.org/wiki/Nonlinear_programming
  185. [16]: http://en.wikipedia.org/wiki/Suffix_array
  186. [17]: https://sites.google.com/site/yuta256/sais
  187. [18]: http://en.wikipedia.org/wiki/Hidden_Markov_model
  188. [19]: http://en.wikipedia.org/wiki/Smith-Waterman_algorithm
  189. [20]: http://en.wikipedia.org/wiki/Newick_format
  190. [21]: http://en.wikipedia.org/wiki/Container_(abstract_data_type)
  191. [22]: http://en.wikipedia.org/wiki/Template_(C%2B%2B)
  192. [23]: http://en.wikipedia.org/wiki/GLib
  193. [24]: https://github.com/attractivechaos/klib/tree/master/test
  194. [25]: http://klib.sourceforge.net/
  195. [26]: http://attractivechaos.wordpress.com/2008/09/02/implementing-generic-hash-library-in-c/
  196. [27]: http://attractivechaos.wordpress.com/2008/10/02/using-void-in-generic-c-programming-may-be-inefficient/
  197. [28]: http://attractivechaos.wordpress.com/2008/10/11/a-generic-buffered-stream-wrapper/
  198. [29]: http://attractivechaos.wordpress.com/2008/09/19/c-array-vs-c-vector/
  199. [30]: http://attractivechaos.wordpress.com/2008/09/24/b-tree-vs-binary-search-tree/
  200. [31]: http://attractivechaos.wordpress.com/2008/10/07/another-look-at-my-old-benchmark/
  201. [32]: http://attractivechaos.wordpress.com/2008/09/13/calculating-median/
  202. [33]: http://attractivechaos.wordpress.com/2008/08/28/comparison-of-hash-table-libraries/
  203. [34]: http://attractivechaos.wordpress.com/2008/08/28/comparison-of-internal-sorting-algorithms/
  204. [35]: http://attractivechaos.wordpress.com/2008/08/24/derivative-free-optimization-dfo/
  205. [36]: http://en.wikipedia.org/wiki/C_preprocessor#Token_concatenation
  206. [37]: http://en.wikipedia.org/wiki/C_preprocessor
  207. [kbtree]: http://attractivechaos.github.io/klib/#KBtree%3A%20generic%20ordered%20map:%5B%5BKBtree%3A%20generic%20ordered%20map%5D%5D
  208. [khash]: http://attractivechaos.github.io/klib/#Khash%3A%20generic%20hash%20table:%5B%5BKhash%3A%20generic%20hash%20table%5D%5D
  209. [kseq]: http://attractivechaos.github.io/klib/#Kseq%3A%20stream%20buffer%20and%20FASTA%2FQ%20parser:%5B%5BKseq%3A%20stream%20buffer%20and%20FASTA%2FQ%20parser%5D%5D
  210. [ksort]: http://attractivechaos.github.io/klib/#Ksort%3A%20sorting%2C%20shuffling%2C%20heap%20and%20k-small:%5B%5BKsort%3A%20sorting%2C%20shuffling%2C%20heap%20and%20k-small%5D%5D