PageRenderTime 48ms CodeModel.GetById 18ms RepoModel.GetById 1ms app.codeStats 0ms

/doc/faq.rst

http://github.com/scikit-learn/scikit-learn
ReStructuredText | 413 lines | 326 code | 87 blank | 0 comment | 0 complexity | e7090349625bd166f5a294ce2d7131ab MD5 | raw file
Possible License(s): BSD-3-Clause
  1. .. _faq:
  2. ===========================
  3. Frequently Asked Questions
  4. ===========================
  5. .. currentmodule:: sklearn
  6. Here we try to give some answers to questions that regularly pop up on the mailing list.
  7. What is the project name (a lot of people get it wrong)?
  8. --------------------------------------------------------
  9. scikit-learn, but not scikit or SciKit nor sci-kit learn.
  10. Also not scikits.learn or scikits-learn, which were previously used.
  11. How do you pronounce the project name?
  12. ------------------------------------------
  13. sy-kit learn. sci stands for science!
  14. Why scikit?
  15. ------------
  16. There are multiple scikits, which are scientific toolboxes built around SciPy.
  17. You can find a list at `<https://scikits.appspot.com/scikits>`_.
  18. Apart from scikit-learn, another popular one is `scikit-image <https://scikit-image.org/>`_.
  19. How can I contribute to scikit-learn?
  20. -----------------------------------------
  21. See :ref:`contributing`. Before wanting to add a new algorithm, which is
  22. usually a major and lengthy undertaking, it is recommended to start with
  23. :ref:`known issues <new_contributors>`. Please do not contact the contributors
  24. of scikit-learn directly regarding contributing to scikit-learn.
  25. What's the best way to get help on scikit-learn usage?
  26. --------------------------------------------------------------
  27. **For general machine learning questions**, please use
  28. `Cross Validated <https://stats.stackexchange.com/>`_ with the ``[machine-learning]`` tag.
  29. **For scikit-learn usage questions**, please use `Stack Overflow <https://stackoverflow.com/questions/tagged/scikit-learn>`_
  30. with the ``[scikit-learn]`` and ``[python]`` tags. You can alternatively use the `mailing list
  31. <https://mail.python.org/mailman/listinfo/scikit-learn>`_.
  32. Please make sure to include a minimal reproduction code snippet (ideally shorter
  33. than 10 lines) that highlights your problem on a toy dataset (for instance from
  34. ``sklearn.datasets`` or randomly generated with functions of ``numpy.random`` with
  35. a fixed random seed). Please remove any line of code that is not necessary to
  36. reproduce your problem.
  37. The problem should be reproducible by simply copy-pasting your code snippet in a Python
  38. shell with scikit-learn installed. Do not forget to include the import statements.
  39. More guidance to write good reproduction code snippets can be found at:
  40. https://stackoverflow.com/help/mcve
  41. If your problem raises an exception that you do not understand (even after googling it),
  42. please make sure to include the full traceback that you obtain when running the
  43. reproduction script.
  44. For bug reports or feature requests, please make use of the
  45. `issue tracker on GitHub <https://github.com/scikit-learn/scikit-learn/issues>`_.
  46. There is also a `scikit-learn Gitter channel
  47. <https://gitter.im/scikit-learn/scikit-learn>`_ where some users and developers
  48. might be found.
  49. **Please do not email any authors directly to ask for assistance, report bugs,
  50. or for any other issue related to scikit-learn.**
  51. How should I save, export or deploy estimators for production?
  52. --------------------------------------------------------------
  53. See :ref:`model_persistence`.
  54. How can I create a bunch object?
  55. ------------------------------------------------
  56. Bunch objects are sometimes used as an output for functions and methods. They
  57. extend dictionaries by enabling values to be accessed by key,
  58. `bunch["value_key"]`, or by an attribute, `bunch.value_key`.
  59. They should not be used as an input; therefore you almost never need to create
  60. a ``Bunch`` object, unless you are extending the scikit-learn's API.
  61. How can I load my own datasets into a format usable by scikit-learn?
  62. --------------------------------------------------------------------
  63. Generally, scikit-learn works on any numeric data stored as numpy arrays
  64. or scipy sparse matrices. Other types that are convertible to numeric
  65. arrays such as pandas DataFrame are also acceptable.
  66. For more information on loading your data files into these usable data
  67. structures, please refer to :ref:`loading external datasets <external_datasets>`.
  68. .. _new_algorithms_inclusion_criteria:
  69. What are the inclusion criteria for new algorithms ?
  70. ----------------------------------------------------
  71. We only consider well-established algorithms for inclusion. A rule of thumb is
  72. at least 3 years since publication, 200+ citations, and wide use and
  73. usefulness. A technique that provides a clear-cut improvement (e.g. an
  74. enhanced data structure or a more efficient approximation technique) on
  75. a widely-used method will also be considered for inclusion.
  76. From the algorithms or techniques that meet the above criteria, only those
  77. which fit well within the current API of scikit-learn, that is a ``fit``,
  78. ``predict/transform`` interface and ordinarily having input/output that is a
  79. numpy array or sparse matrix, are accepted.
  80. The contributor should support the importance of the proposed addition with
  81. research papers and/or implementations in other similar packages, demonstrate
  82. its usefulness via common use-cases/applications and corroborate performance
  83. improvements, if any, with benchmarks and/or plots. It is expected that the
  84. proposed algorithm should outperform the methods that are already implemented
  85. in scikit-learn at least in some areas.
  86. Inclusion of a new algorithm speeding up an existing model is easier if:
  87. - it does not introduce new hyper-parameters (as it makes the library
  88. more future-proof),
  89. - it is easy to document clearly when the contribution improves the speed
  90. and when it does not, for instance "when n_features >>
  91. n_samples",
  92. - benchmarks clearly show a speed up.
  93. Also, note that your implementation need not be in scikit-learn to be used
  94. together with scikit-learn tools. You can implement your favorite algorithm
  95. in a scikit-learn compatible way, upload it to GitHub and let us know. We
  96. will be happy to list it under :ref:`related_projects`. If you already have
  97. a package on GitHub following the scikit-learn API, you may also be
  98. interested to look at `scikit-learn-contrib
  99. <https://scikit-learn-contrib.github.io>`_.
  100. .. _selectiveness:
  101. Why are you so selective on what algorithms you include in scikit-learn?
  102. ------------------------------------------------------------------------
  103. Code comes with maintenance cost, and we need to balance the amount of
  104. code we have with the size of the team (and add to this the fact that
  105. complexity scales non linearly with the number of features).
  106. The package relies on core developers using their free time to
  107. fix bugs, maintain code and review contributions.
  108. Any algorithm that is added needs future attention by the developers,
  109. at which point the original author might long have lost interest.
  110. See also :ref:`new_algorithms_inclusion_criteria`. For a great read about
  111. long-term maintenance issues in open-source software, look at
  112. `the Executive Summary of Roads and Bridges
  113. <https://www.fordfoundation.org/media/2976/roads-and-bridges-the-unseen-labor-behind-our-digital-infrastructure.pdf#page=8>`_
  114. Why did you remove HMMs from scikit-learn?
  115. --------------------------------------------
  116. See :ref:`adding_graphical_models`.
  117. .. _adding_graphical_models:
  118. Will you add graphical models or sequence prediction to scikit-learn?
  119. ---------------------------------------------------------------------
  120. Not in the foreseeable future.
  121. scikit-learn tries to provide a unified API for the basic tasks in machine
  122. learning, with pipelines and meta-algorithms like grid search to tie
  123. everything together. The required concepts, APIs, algorithms and
  124. expertise required for structured learning are different from what
  125. scikit-learn has to offer. If we started doing arbitrary structured
  126. learning, we'd need to redesign the whole package and the project
  127. would likely collapse under its own weight.
  128. There are two project with API similar to scikit-learn that
  129. do structured prediction:
  130. * `pystruct <https://pystruct.github.io/>`_ handles general structured
  131. learning (focuses on SSVMs on arbitrary graph structures with
  132. approximate inference; defines the notion of sample as an instance of
  133. the graph structure)
  134. * `seqlearn <https://larsmans.github.io/seqlearn/>`_ handles sequences only
  135. (focuses on exact inference; has HMMs, but mostly for the sake of
  136. completeness; treats a feature vector as a sample and uses an offset encoding
  137. for the dependencies between feature vectors)
  138. Will you add GPU support?
  139. -------------------------
  140. No, or at least not in the near future. The main reason is that GPU support
  141. will introduce many software dependencies and introduce platform specific
  142. issues. scikit-learn is designed to be easy to install on a wide variety of
  143. platforms. Outside of neural networks, GPUs don't play a large role in machine
  144. learning today, and much larger gains in speed can often be achieved by a
  145. careful choice of algorithms.
  146. Do you support PyPy?
  147. --------------------
  148. In case you didn't know, `PyPy <https://pypy.org/>`_ is an alternative
  149. Python implementation with a built-in just-in-time compiler. Experimental
  150. support for PyPy3-v5.10+ has been added, which requires Numpy 1.14.0+,
  151. and scipy 1.1.0+.
  152. How do I deal with string data (or trees, graphs...)?
  153. -----------------------------------------------------
  154. scikit-learn estimators assume you'll feed them real-valued feature vectors.
  155. This assumption is hard-coded in pretty much all of the library.
  156. However, you can feed non-numerical inputs to estimators in several ways.
  157. If you have text documents, you can use a term frequency features; see
  158. :ref:`text_feature_extraction` for the built-in *text vectorizers*.
  159. For more general feature extraction from any kind of data, see
  160. :ref:`dict_feature_extraction` and :ref:`feature_hashing`.
  161. Another common case is when you have non-numerical data and a custom distance
  162. (or similarity) metric on these data. Examples include strings with edit
  163. distance (aka. Levenshtein distance; e.g., DNA or RNA sequences). These can be
  164. encoded as numbers, but doing so is painful and error-prone. Working with
  165. distance metrics on arbitrary data can be done in two ways.
  166. Firstly, many estimators take precomputed distance/similarity matrices, so if
  167. the dataset is not too large, you can compute distances for all pairs of inputs.
  168. If the dataset is large, you can use feature vectors with only one "feature",
  169. which is an index into a separate data structure, and supply a custom metric
  170. function that looks up the actual data in this data structure. E.g., to use
  171. DBSCAN with Levenshtein distances::
  172. >>> from leven import levenshtein # doctest: +SKIP
  173. >>> import numpy as np
  174. >>> from sklearn.cluster import dbscan
  175. >>> data = ["ACCTCCTAGAAG", "ACCTACTAGAAGTT", "GAATATTAGGCCGA"]
  176. >>> def lev_metric(x, y):
  177. ... i, j = int(x[0]), int(y[0]) # extract indices
  178. ... return levenshtein(data[i], data[j])
  179. ...
  180. >>> X = np.arange(len(data)).reshape(-1, 1)
  181. >>> X
  182. array([[0],
  183. [1],
  184. [2]])
  185. >>> # We need to specify algoritum='brute' as the default assumes
  186. >>> # a continuous feature space.
  187. >>> dbscan(X, metric=lev_metric, eps=5, min_samples=2, algorithm='brute')
  188. ... # doctest: +SKIP
  189. ([0, 1], array([ 0, 0, -1]))
  190. (This uses the third-party edit distance package ``leven``.)
  191. Similar tricks can be used, with some care, for tree kernels, graph kernels,
  192. etc.
  193. Why do I sometime get a crash/freeze with n_jobs > 1 under OSX or Linux?
  194. ------------------------------------------------------------------------
  195. Several scikit-learn tools such as ``GridSearchCV`` and ``cross_val_score``
  196. rely internally on Python's `multiprocessing` module to parallelize execution
  197. onto several Python processes by passing ``n_jobs > 1`` as an argument.
  198. The problem is that Python ``multiprocessing`` does a ``fork`` system call
  199. without following it with an ``exec`` system call for performance reasons. Many
  200. libraries like (some versions of) Accelerate / vecLib under OSX, (some versions
  201. of) MKL, the OpenMP runtime of GCC, nvidia's Cuda (and probably many others),
  202. manage their own internal thread pool. Upon a call to `fork`, the thread pool
  203. state in the child process is corrupted: the thread pool believes it has many
  204. threads while only the main thread state has been forked. It is possible to
  205. change the libraries to make them detect when a fork happens and reinitialize
  206. the thread pool in that case: we did that for OpenBLAS (merged upstream in
  207. master since 0.2.10) and we contributed a `patch
  208. <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60035>`_ to GCC's OpenMP runtime
  209. (not yet reviewed).
  210. But in the end the real culprit is Python's ``multiprocessing`` that does
  211. ``fork`` without ``exec`` to reduce the overhead of starting and using new
  212. Python processes for parallel computing. Unfortunately this is a violation of
  213. the POSIX standard and therefore some software editors like Apple refuse to
  214. consider the lack of fork-safety in Accelerate / vecLib as a bug.
  215. In Python 3.4+ it is now possible to configure ``multiprocessing`` to
  216. use the 'forkserver' or 'spawn' start methods (instead of the default
  217. 'fork') to manage the process pools. To work around this issue when
  218. using scikit-learn, you can set the ``JOBLIB_START_METHOD`` environment
  219. variable to 'forkserver'. However the user should be aware that using
  220. the 'forkserver' method prevents joblib.Parallel to call function
  221. interactively defined in a shell session.
  222. If you have custom code that uses ``multiprocessing`` directly instead of using
  223. it via joblib you can enable the 'forkserver' mode globally for your
  224. program: Insert the following instructions in your main script::
  225. import multiprocessing
  226. # other imports, custom code, load data, define model...
  227. if __name__ == '__main__':
  228. multiprocessing.set_start_method('forkserver')
  229. # call scikit-learn utils with n_jobs > 1 here
  230. You can find more default on the new start methods in the `multiprocessing
  231. documentation <https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods>`_.
  232. .. _faq_mkl_threading:
  233. Why does my job use more cores than specified with n_jobs?
  234. ----------------------------------------------------------
  235. This is because ``n_jobs`` only controls the number of jobs for
  236. routines that are parallelized with ``joblib``, but parallel code can come
  237. from other sources:
  238. - some routines may be parallelized with OpenMP (for code written in C or
  239. Cython).
  240. - scikit-learn relies a lot on numpy, which in turn may rely on numerical
  241. libraries like MKL, OpenBLAS or BLIS which can provide parallel
  242. implementations.
  243. For more details, please refer to our :ref:`Parallelism notes <parallelism>`.
  244. Why is there no support for deep or reinforcement learning / Will there be support for deep or reinforcement learning in scikit-learn?
  245. --------------------------------------------------------------------------------------------------------------------------------------
  246. Deep learning and reinforcement learning both require a rich vocabulary to
  247. define an architecture, with deep learning additionally requiring
  248. GPUs for efficient computing. However, neither of these fit within
  249. the design constraints of scikit-learn; as a result, deep learning
  250. and reinforcement learning are currently out of scope for what
  251. scikit-learn seeks to achieve.
  252. You can find more information about addition of gpu support at
  253. `Will you add GPU support?`_.
  254. Note that scikit-learn currently implements a simple multilayer perceptron
  255. in `sklearn.neural_network`. We will only accept bug fixes for this module.
  256. If you want to implement more complex deep learning models, please turn to
  257. popular deep learning frameworks such as
  258. `tensorflow <https://www.tensorflow.org/>`_,
  259. `keras <https://keras.io/>`_
  260. and `pytorch <https://pytorch.org/>`_.
  261. Why is my pull request not getting any attention?
  262. -------------------------------------------------
  263. The scikit-learn review process takes a significant amount of time, and
  264. contributors should not be discouraged by a lack of activity or review on
  265. their pull request. We care a lot about getting things right
  266. the first time, as maintenance and later change comes at a high cost.
  267. We rarely release any "experimental" code, so all of our contributions
  268. will be subject to high use immediately and should be of the highest
  269. quality possible initially.
  270. Beyond that, scikit-learn is limited in its reviewing bandwidth; many of the
  271. reviewers and core developers are working on scikit-learn on their own time.
  272. If a review of your pull request comes slowly, it is likely because the
  273. reviewers are busy. We ask for your understanding and request that you
  274. not close your pull request or discontinue your work solely because of
  275. this reason.
  276. How do I set a ``random_state`` for an entire execution?
  277. ---------------------------------------------------------
  278. For testing and replicability, it is often important to have the entire execution
  279. controlled by a single seed for the pseudo-random number generator used in
  280. algorithms that have a randomized component. Scikit-learn does not use its own
  281. global random state; whenever a RandomState instance or an integer random seed
  282. is not provided as an argument, it relies on the numpy global random state,
  283. which can be set using :func:`numpy.random.seed`.
  284. For example, to set an execution's numpy global random state to 42, one could
  285. execute the following in his or her script::
  286. import numpy as np
  287. np.random.seed(42)
  288. However, a global random state is prone to modification by other code during
  289. execution. Thus, the only way to ensure replicability is to pass ``RandomState``
  290. instances everywhere and ensure that both estimators and cross-validation
  291. splitters have their ``random_state`` parameter set.
  292. Why do categorical variables need preprocessing in scikit-learn, compared to other tools?
  293. -----------------------------------------------------------------------------------------
  294. Most of scikit-learn assumes data is in NumPy arrays or SciPy sparse matrices
  295. of a single numeric dtype. These do not explicitly represent categorical
  296. variables at present. Thus, unlike R's data.frames or pandas.DataFrame, we
  297. require explicit conversion of categorical features to numeric values, as
  298. discussed in :ref:`preprocessing_categorical_features`.
  299. See also :ref:`sphx_glr_auto_examples_compose_plot_column_transformer_mixed_types.py` for an
  300. example of working with heterogeneous (e.g. categorical and numeric) data.
  301. Why does Scikit-learn not directly work with, for example, pandas.DataFrame?
  302. ----------------------------------------------------------------------------
  303. The homogeneous NumPy and SciPy data objects currently expected are most
  304. efficient to process for most operations. Extensive work would also be needed
  305. to support Pandas categorical types. Restricting input to homogeneous
  306. types therefore reduces maintenance cost and encourages usage of efficient
  307. data structures.
  308. Do you plan to implement transform for target y in a pipeline?
  309. ----------------------------------------------------------------------------
  310. Currently transform only works for features X in a pipeline.
  311. There's a long-standing discussion about
  312. not being able to transform y in a pipeline.
  313. Follow on github issue
  314. `#4143 <https://github.com/scikit-learn/scikit-learn/issues/4143>`_.
  315. Meanwhile check out
  316. :class:`sklearn.compose.TransformedTargetRegressor`,
  317. `pipegraph <https://github.com/mcasl/PipeGraph>`_,
  318. `imbalanced-learn <https://github.com/scikit-learn-contrib/imbalanced-learn>`_.
  319. Note that Scikit-learn solved for the case where y
  320. has an invertible transformation applied before training
  321. and inverted after prediction. Scikit-learn intends to solve for
  322. use cases where y should be transformed at training time
  323. and not at test time, for resampling and similar uses,
  324. like at imbalanced learn.
  325. In general, these use cases can be solved
  326. with a custom meta estimator rather than a Pipeline