PageRenderTime 54ms CodeModel.GetById 21ms RepoModel.GetById 0ms app.codeStats 0ms

/doc/faq.rst

https://gitlab.com/0072016/0072016
ReStructuredText | 277 lines | 218 code | 59 blank | 0 comment | 0 complexity | c7cc045b6e2e4afa3a877b555ad6754e MD5 | raw file
  1. .. _faq:
  2. ===========================
  3. Frequently Asked Questions
  4. ===========================
  5. Here we try to give some answers to questions that regularly pop up on the mailing list.
  6. What is the project name (a lot of people get it wrong)?
  7. --------------------------------------------------------
  8. scikit-learn, but not scikit or SciKit nor sci-kit learn. Also not scikits.learn or scikits-learn, which where previously used.
  9. How do you pronounce the project name?
  10. ------------------------------------------
  11. sy-kit learn. sci stands for science!
  12. Why scikit?
  13. ------------
  14. There are multiple scikits, which are scientific toolboxes build around SciPy.
  15. You can find a list at `<https://scikits.appspot.com/scikits>`_.
  16. Apart from scikit-learn, another popular one is `scikit-image <http://scikit-image.org/>`_.
  17. How can I contribute to scikit-learn?
  18. -----------------------------------------
  19. See :ref:`contributing`. Before wanting to add a new algorithm, which is
  20. usually a major and lengthy undertaking, it is recommended to start with :ref:`known
  21. issues <easy_issues>`.
  22. How can I create a bunch object?
  23. ------------------------------------------------
  24. Don't make a bunch object! They are not part of the scikit-learn API. Bunch
  25. objects are just a way to package some numpy arrays. As a scikit-learn user you
  26. only ever need numpy arrays to feed your model with data.
  27. For instance to train a classifier, all you need is a 2D array ``X`` for the
  28. input variables and a 1D array ``y`` for the target variables. The array ``X``
  29. holds the features as columns and samples as rows . The array ``y`` contains
  30. integer values to encode the class membership of each sample in ``X``.
  31. To load data as numpy arrays you can use different libraries depending on the
  32. original data format:
  33. * `numpy.loadtxt
  34. <http://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html>`_ to
  35. load text files (such as CSV) assuming that all the columns have an
  36. homogeneous data type (e.g. all numeric values).
  37. * `scipy.io <http://docs.scipy.org/doc/scipy/reference/io.html>`_ for common
  38. binary formats often used in scientific computing context.
  39. * `scipy.misc.imread <http://docs.scipy.org/doc/scipy/reference/generated/scipy.
  40. misc.imread.html#scipy.misc.imread>`_ (requires the `Pillow
  41. <https://pypi.python.org/pypi/Pillow>`_ package) to load pixel intensities
  42. data from various image file formats.
  43. * `pandas.io <http://pandas.pydata.org/pandas-docs/stable/io.html>`_ to load
  44. heterogeneously typed data from various file formats and database protocols
  45. that can slice and dice before conversion to numerical features in a numpy
  46. array.
  47. Note: if you manage your own numerical data it is recommended to use an
  48. optimized file format such as HDF5 to reduce data load times. Various libraries
  49. such as H5Py, PyTables and pandas provides a Python interface for reading and
  50. writing data in that format.
  51. What are the inclusion criteria for new algorithms ?
  52. ----------------------------------------------------
  53. We only consider well-established algorithms for inclusion. A rule of thumb is
  54. at least 3 years since publication, 200+ citations and wide use and
  55. usefulness. A technique that provides a clear-cut improvement (e.g. an
  56. enhanced data structure or a more efficient approximation technique) on
  57. a widely-used method will also be considered for inclusion.
  58. From the algorithms or techniques that meet the above criteria, only those
  59. which fit well within the current API of scikit-learn, that is a ``fit``,
  60. ``predict/transform`` interface and ordinarily having input/output that is a
  61. numpy array or sparse matrix, are accepted.
  62. The contributor should support the importance of the proposed addition with
  63. research papers and/or implementations in other similar packages, demonstrate
  64. its usefulness via common use-cases/applications and corroborate performance
  65. improvements, if any, with benchmarks and/or plots. It is expected that the
  66. proposed algorithm should outperform the methods that are already implemented
  67. in scikit-learn at least in some areas.
  68. Also note that your implementation need not be in scikit-learn to be used
  69. together with scikit-learn tools. You can implement your favorite algorithm in
  70. a scikit-learn compatible way, upload it to github and let us know. We will
  71. list it under :ref:`related_projects`.
  72. Why are you so selective on what algorithms you include in scikit-learn?
  73. ------------------------------------------------------------------------
  74. Code is maintenance cost, and we need to balance the amount of
  75. code we have with the size of the team (and add to this the fact that
  76. complexity scales non linearly with the number of features).
  77. The package relies on core developers using their free time to
  78. fix bugs, maintain code and review contributions.
  79. Any algorithm that is added needs future attention by the developers,
  80. at which point the original author might long have lost interest.
  81. Also see `this thread on the mailing list
  82. <http://sourceforge.net/p/scikit-learn/mailman/scikit-learn-general/thread/CAAkaFLWcBG%2BgtsFQzpTLfZoCsHMDv9UG5WaqT0LwUApte0TVzg%40mail.gmail.com/#msg33104380>`_.
  83. Why did you remove HMMs from scikit-learn?
  84. --------------------------------------------
  85. See :ref:`adding_graphical_models`.
  86. .. _adding_graphical_models:
  87. Will you add graphical models or sequence prediction to scikit-learn?
  88. ---------------------------------------------------------------------
  89. Not in the foreseeable future.
  90. scikit-learn tries to provide a unified API for the basic tasks in machine
  91. learning, with pipelines and meta-algorithms like grid search to tie
  92. everything together. The required concepts, APIs, algorithms and
  93. expertise required for structured learning are different from what
  94. scikit-learn has to offer. If we started doing arbitrary structured
  95. learning, we'd need to redesign the whole package and the project
  96. would likely collapse under its own weight.
  97. There are two project with API similar to scikit-learn that
  98. do structured prediction:
  99. * `pystruct <http://pystruct.github.io/>`_ handles general structured
  100. learning (focuses on SSVMs on arbitrary graph structures with
  101. approximate inference; defines the notion of sample as an instance of
  102. the graph structure)
  103. * `seqlearn <http://larsmans.github.io/seqlearn/>`_ handles sequences only
  104. (focuses on exact inference; has HMMs, but mostly for the sake of
  105. completeness; treats a feature vector as a sample and uses an offset encoding
  106. for the dependencies between feature vectors)
  107. Will you add GPU support?
  108. -------------------------
  109. No, or at least not in the near future. The main reason is that GPU support
  110. will introduce many software dependencies and introduce platform specific
  111. issues. scikit-learn is designed to be easy to install on a wide variety of
  112. platforms. Outside of neural networks, GPUs don't play a large role in machine
  113. learning today, and much larger gains in speed can often be achieved by a
  114. careful choice of algorithms.
  115. Do you support PyPy?
  116. --------------------
  117. In case you didn't know, `PyPy <http://pypy.org/>`_ is the new, fast,
  118. just-in-time compiling Python implementation. We don't support it.
  119. When the `NumPy support <http://buildbot.pypy.org/numpy-status/latest.html>`_
  120. in PyPy is complete or near-complete, and SciPy is ported over as well,
  121. we can start thinking of a port.
  122. We use too much of NumPy to work with a partial implementation.
  123. How do I deal with string data (or trees, graphs...)?
  124. -----------------------------------------------------
  125. scikit-learn estimators assume you'll feed them real-valued feature vectors.
  126. This assumption is hard-coded in pretty much all of the library.
  127. However, you can feed non-numerical inputs to estimators in several ways.
  128. If you have text documents, you can use a term frequency features; see
  129. :ref:`text_feature_extraction` for the built-in *text vectorizers*.
  130. For more general feature extraction from any kind of data, see
  131. :ref:`dict_feature_extraction` and :ref:`feature_hashing`.
  132. Another common case is when you have non-numerical data and a custom distance
  133. (or similarity) metric on these data. Examples include strings with edit
  134. distance (aka. Levenshtein distance; e.g., DNA or RNA sequences). These can be
  135. encoded as numbers, but doing so is painful and error-prone. Working with
  136. distance metrics on arbitrary data can be done in two ways.
  137. Firstly, many estimators take precomputed distance/similarity matrices, so if
  138. the dataset is not too large, you can compute distances for all pairs of inputs.
  139. If the dataset is large, you can use feature vectors with only one "feature",
  140. which is an index into a separate data structure, and supply a custom metric
  141. function that looks up the actual data in this data structure. E.g., to use
  142. DBSCAN with Levenshtein distances::
  143. >>> from leven import levenshtein # doctest: +SKIP
  144. >>> import numpy as np
  145. >>> from sklearn.cluster import dbscan
  146. >>> data = ["ACCTCCTAGAAG", "ACCTACTAGAAGTT", "GAATATTAGGCCGA"]
  147. >>> def lev_metric(x, y):
  148. ... i, j = int(x[0]), int(y[0]) # extract indices
  149. ... return levenshtein(data[i], data[j])
  150. ...
  151. >>> X = np.arange(len(data)).reshape(-1, 1)
  152. >>> X
  153. array([[0],
  154. [1],
  155. [2]])
  156. >>> dbscan(X, metric=lev_metric, eps=5, min_samples=2) # doctest: +SKIP
  157. ([0, 1], array([ 0, 0, -1]))
  158. (This uses the third-party edit distance package ``leven``.)
  159. Similar tricks can be used, with some care, for tree kernels, graph kernels,
  160. etc.
  161. Why do I sometime get a crash/freeze with n_jobs > 1 under OSX or Linux?
  162. ------------------------------------------------------------------------
  163. Several scikit-learn tools such as ``GridSearchCV`` and ``cross_val_score``
  164. rely internally on Python's `multiprocessing` module to parallelize execution
  165. onto several Python processes by passing ``n_jobs > 1`` as argument.
  166. The problem is that Python ``multiprocessing`` does a ``fork`` system call
  167. without following it with an ``exec`` system call for performance reasons. Many
  168. libraries like (some versions of) Accelerate / vecLib under OSX, (some versions
  169. of) MKL, the OpenMP runtime of GCC, nvidia's Cuda (and probably many others),
  170. manage their own internal thread pool. Upon a call to `fork`, the thread pool
  171. state in the child process is corrupted: the thread pool believes it has many
  172. threads while only the main thread state has been forked. It is possible to
  173. change the libraries to make them detect when a fork happens and reinitialize
  174. the thread pool in that case: we did that for OpenBLAS (merged upstream in
  175. master since 0.2.10) and we contributed a `patch
  176. <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60035>`_ to GCC's OpenMP runtime
  177. (not yet reviewed).
  178. But in the end the real culprit is Python's ``multiprocessing`` that does
  179. ``fork`` without ``exec`` to reduce the overhead of starting and using new
  180. Python processes for parallel computing. Unfortunately this is a violation of
  181. the POSIX standard and therefore some software editors like Apple refuse to
  182. consider the lack of fork-safety in Accelerate / vecLib as a bug.
  183. In Python 3.4+ it is now possible to configure ``multiprocessing`` to use the
  184. 'forkserver' or 'spawn' start methods (instead of the default 'fork') to manage
  185. the process pools. This makes it possible to not be subject to this issue
  186. anymore. The version of joblib shipped with scikit-learn automatically uses
  187. that setting by default (under Python 3.4 and later).
  188. If you have custom code that uses ``multiprocessing`` directly instead of using
  189. it via joblib you can enable the 'forkserver' mode globally for your
  190. program: Insert the following instructions in your main script::
  191. import multiprocessing
  192. # other imports, custom code, load data, define model...
  193. if __name__ == '__main__':
  194. multiprocessing.set_start_method('forkserver')
  195. # call scikit-learn utils with n_jobs > 1 here
  196. You can find more default on the new start methods in the `multiprocessing
  197. documentation <https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods>`_.
  198. Why is there no support for deep learning / Will there be support for deep learning in scikit-learn?
  199. ----------------------------------------------------------------------------------------------------
  200. Deep learning requires a rich vocabulary to define an architecture and the
  201. use of GPUs for efficient computing. However, neither of these fit within
  202. the design constraints of scikit-learn. As a result, deep learning is
  203. currently out of scope for what scikit-learn seeks to achieve.
  204. Why is my pull request not getting any attention?
  205. -------------------------------------------------
  206. The scikit-learn review process takes a significant amount of time, and
  207. contributors should not be discouraged by a lack of activity or review on
  208. their pull request. We care a lot about getting things right
  209. the first time, as maintenance and later change comes at a high cost.
  210. We rarely release any "experimental" code, so all of our contributions
  211. will be subject to high use immediately and should be of the highest
  212. quality possible initially.
  213. Beyond that, scikit-learn is limited in its reviewing bandwidth; many of the
  214. reviewers and core developers are working on scikit-learn on their own time.
  215. If a review of your pull request comes slowly, it is likely because the
  216. reviewers are busy. We ask for your understanding and request that you
  217. not close your pull request or discontinue your work solely because of
  218. this reason.