PageRenderTime 169ms CodeModel.GetById 24ms RepoModel.GetById 0ms app.codeStats 0ms

/BRI_RNAseq_Nextera_optimized/pymodules/python2.7/lib/python/statsmodels-0.5.0-py2.7-linux-x86_64.egg/statsmodels/sandbox/multilinear.py

https://gitlab.com/pooja043/Globus_Docker_1
Python | 329 lines | 141 code | 9 blank | 179 comment | 23 complexity | 56b13b12930fc7fd5dc26eeecb18c6d2 MD5 | raw file
  1. """Analyze a set of multiple variables with a linear models
  2. multiOLS:
  3. take a model and test it on a series of variables defined over a
  4. pandas dataset, returning a summary for each variable
  5. multigroup:
  6. take a boolean vector and the definition of several groups of variables
  7. and test if the group has a fraction of true values higher than the
  8. rest. It allows to test if the variables in the group are significantly
  9. more significant than outside the group.
  10. """
  11. from patsy import dmatrix
  12. import pandas as pd
  13. from statsmodels.api import OLS
  14. from statsmodels.api import stats
  15. import numpy as np
  16. import logging
  17. def _model2dataframe(model_endog, model_exog, model_type=OLS, **kwargs):
  18. """return a series containing the summary of a linear model
  19. All the exceding parameters will be redirected to the linear model
  20. """
  21. # create the linear model and perform the fit
  22. model_result = model_type(model_endog, model_exog, **kwargs).fit()
  23. # keeps track of some global statistics
  24. statistics = pd.Series({'r2': model_result.rsquared,
  25. 'adj_r2': model_result.rsquared_adj})
  26. # put them togher with the result for each term
  27. result_df = pd.DataFrame({'params': model_result.params,
  28. 'pvals': model_result.pvalues,
  29. 'std': model_result.bse,
  30. 'statistics': statistics})
  31. # add the complexive results for f-value and the total p-value
  32. fisher_df = pd.DataFrame({'params': {'_f_test': model_result.fvalue},
  33. 'pvals': {'_f_test': model_result.f_pvalue}})
  34. # merge them and unstack to obtain a hierarchically indexed series
  35. res_series = pd.concat([result_df, fisher_df]).unstack()
  36. return res_series.dropna()
  37. def multiOLS(model, dataframe, column_list=None, method='fdr_bh',
  38. alpha=0.05, subset=None, model_type=OLS, **kwargs):
  39. """apply a linear model to several endogenous variables on a dataframe
  40. Take a linear model definition via formula and a dataframe that will be
  41. the environment of the model, and apply the linear model to a subset
  42. (or all) of the columns of the dataframe. It will return a dataframe
  43. with part of the information from the linear model summary.
  44. Parameters
  45. ----------
  46. model : string
  47. formula description of the model
  48. dataframe : pandas.dataframe
  49. dataframe where the model will be evaluated
  50. column_list : list of strings, optional
  51. Names of the columns to analyze with the model.
  52. If None (Default) it will perform the function on all the
  53. eligible columns (numerical type and not in the model definition)
  54. model_type : model class, optional
  55. The type of model to be used. The default is the linear model.
  56. Can be any linear model (OLS, WLS, GLS, etc..)
  57. method: string, optional
  58. the method used to perform the pvalue correction for multiple testing.
  59. default is the Benjamini/Hochberg, other available methods are:
  60. `bonferroni` : one-step correction
  61. `sidak` : on-step correction
  62. `holm-sidak` :
  63. `holm` :
  64. `simes-hochberg` :
  65. `hommel` :
  66. `fdr_bh` : Benjamini/Hochberg
  67. `fdr_by` : Benjamini/Yekutieli
  68. alpha: float, optional
  69. the significance level used for the pvalue correction (default 0.05)
  70. subset: boolean array
  71. the selected rows to be used in the regression
  72. all the other parameters will be directed to the model creation.
  73. Returns
  74. -------
  75. summary : pandas.DataFrame
  76. a dataframe containing an extract from the summary of the model
  77. obtained for each columns. It will give the model complexive f test
  78. result and p-value, and the regression value and standard deviarion
  79. for each of the regressors. The Dataframe has a hierachical column
  80. structure, divided as:
  81. - params: contains the parameters resulting from the models. Has
  82. an additional column named _f_test containing the result of the
  83. F test.
  84. - pval: the pvalue results of the models. Has the _f_test column
  85. for the significativity of the whole test.
  86. - adj_pval: the corrected pvalues via the multitest function.
  87. - std: uncertainties of the model parameters
  88. - statistics: contains the r squared statistics and the adjusted
  89. r squared.
  90. Notes
  91. -----
  92. The main application of this function is on system biology to perform
  93. a linear model testing of a lot of different parameters, like the
  94. different genetic expression of several genes.
  95. See Also
  96. --------
  97. statsmodels.stats.multitest
  98. contains several functions to perform the multiple p-value correction
  99. Examples
  100. --------
  101. Using the longley data as dataframe example
  102. >>> import statsmodels.api as sm
  103. >>> data = sm.datasets.longley.load_pandas()
  104. >>> df = data.exog
  105. >>> df['TOTEMP'] = data.endog
  106. This will perform the specified linear model on all the
  107. other columns of the dataframe
  108. >>> multiOLS('GNP + 1', df)
  109. This select only a certain subset of the columns
  110. >>> multiOLS('GNP + 0', df, ['GNPDEFL', 'TOTEMP', 'POP'])
  111. It is possible to specify a trasformation also on the target column,
  112. conforming to the patsy formula specification
  113. >>> multiOLS('GNP + 0', df, ['I(GNPDEFL**2)', 'center(TOTEMP)'])
  114. It is possible to specify the subset of the dataframe
  115. on which perform the analysis
  116. >> multiOLS('GNP + 1', df, subset=df.GNPDEFL > 90)
  117. Even a single column name can be given without enclosing it in a list
  118. >>> multiOLS('GNP + 0', df, 'GNPDEFL')
  119. """
  120. # data normalization
  121. # if None take all the numerical columns that aren't present in the model
  122. # it's not waterproof but is a good enough criterion for everyday use
  123. if column_list is None:
  124. column_list = [name for name in dataframe.columns
  125. if dataframe[name].dtype != object and name not in model]
  126. # if it's a single string transform it in a single element list
  127. if isinstance(column_list, basestring):
  128. column_list = [column_list]
  129. if subset is not None:
  130. dataframe = dataframe.ix[subset]
  131. # perform each model and retrieve the statistics
  132. col_results = {}
  133. # as the model will use always the same endogenous variables
  134. # we can create them once and reuse
  135. model_exog = dmatrix(model, data=dataframe, return_type="dataframe")
  136. for col_name in column_list:
  137. # it will try to interpret the column name as a valid dataframe
  138. # index as it can be several times faster. If it fails it
  139. # interpret it as a patsy formula (for example for centering)
  140. try:
  141. model_endog = dataframe[col_name]
  142. except KeyError:
  143. model_endog = dmatrix(col_name + ' + 0', data=dataframe)
  144. # retrieve the result and store them
  145. res = _model2dataframe(model_endog, model_exog, model_type, **kwargs)
  146. col_results[col_name] = res
  147. # mangle them togheter and sort by complexive p-value
  148. summary = pd.DataFrame(col_results)
  149. # order by the p-value: the most useful model first!
  150. summary = summary.T.sort([('pvals', '_f_test')])
  151. summary.index.name = 'endogenous vars'
  152. # implementing the pvalue correction method
  153. smt = stats.multipletests
  154. for (key1, key2) in summary:
  155. if key1 != 'pvals':
  156. continue
  157. p_values = summary[key1, key2]
  158. corrected = smt(p_values, method=method, alpha=alpha)[1]
  159. # extend the dataframe of results with the column
  160. # of the corrected p_values
  161. summary['adj_' + key1, key2] = corrected
  162. return summary
  163. def _test_group(pvalues, group_name, group, exact=True):
  164. """test if the objects in the group are different from the general set.
  165. The test is performed on the pvalues set (ad a pandas series) over
  166. the group specified via a fisher exact test.
  167. """
  168. from scipy.stats import fisher_exact
  169. try:
  170. from scipy.stats import chi2_contingency
  171. except ImportError:
  172. def chi2_contingency(*args, **kwds):
  173. raise ValueError('exact=False is not available with old scipy')
  174. totals = 1.0 * len(pvalues)
  175. total_significant = 1.0 * np.sum(pvalues)
  176. cross_index = [c for c in group if c in pvalues.index]
  177. missing = [c for c in group if c not in pvalues.index]
  178. if missing:
  179. s = ('the test is not well defined if the group '
  180. 'has elements not presents in the significativity '
  181. 'array. group name: {}, missing elements: {}')
  182. logging.warning(s.format(group_name, missing))
  183. # how many are significant and not in the group
  184. group_total = 1.0 * len(cross_index)
  185. group_sign = 1.0 * len([c for c in cross_index if pvalues[c]])
  186. group_nonsign = 1.0 * (group_total - group_sign)
  187. # how many are significant and not outside the group
  188. extern_sign = 1.0 * (total_significant - group_sign)
  189. extern_nonsign = 1.0 * (totals - total_significant - group_nonsign)
  190. # make the fisher test or the chi squared
  191. test = fisher_exact if exact else chi2_contingency
  192. table = [[extern_nonsign, extern_sign], [group_nonsign, group_sign]]
  193. pvalue = test(np.array(table))[1]
  194. # is the group more represented or less?
  195. part = group_sign, group_nonsign, extern_sign, extern_nonsign
  196. #increase = (group_sign / group_total) > (total_significant / totals)
  197. increase = np.log((totals * group_sign)
  198. / (total_significant * group_total))
  199. return pvalue, increase, part
  200. def multigroup(pvals, groups, exact=True, keep_all=True, alpha=0.05):
  201. """Test if the given groups are different from the total partition.
  202. Given a boolean array test if each group has a proportion of positives
  203. different than the complexive proportion.
  204. The test can be done as an exact Fisher test or approximated as a
  205. Chi squared test for more speed.
  206. Parameters
  207. ----------
  208. pvals: pandas series of boolean
  209. the significativity of the variables under analysis
  210. groups: dict of list
  211. the name of each category of variables under exam.
  212. each one is a list of the variables included
  213. exact: boolean, optional
  214. If True (default) use the fisher exact test, otherwise
  215. use the chi squared test for contingencies tables.
  216. For high number of elements in the array the fisher test can
  217. be significantly slower than the chi squared.
  218. keep_all: boolean, optional
  219. if False it will drop those groups where the fraction
  220. of positive is below the expected result. If True (default)
  221. it will keep all the significant results.
  222. alpha: float, optional
  223. the significativity level for the pvalue correction
  224. on the whole set of groups (not inside the groups themselves).
  225. Returns
  226. -------
  227. result_df: pandas dataframe
  228. for each group returns:
  229. pvals - the fisher p value of the test
  230. adj_pvals - the adjusted pvals
  231. increase - the log of the odd ratio between the
  232. internal significant ratio versus the external one
  233. _in_sign - significative elements inside the group
  234. _in_non - non significative elements inside the group
  235. _out_sign - significative elements outside the group
  236. _out_non - non significative elements outside the group
  237. Notes
  238. -----
  239. This test allow to see if a category of variables is generally better
  240. suited to be described for the model. For example to see if a predictor
  241. gives more information on demographic or economical parameters,
  242. by creating two groups containing the endogenous variables of each
  243. category.
  244. This function is conceived for medical dataset with a lot of variables
  245. that can be easily grouped into functional groups. This is because
  246. The significativity of a group require a rather large number of
  247. composing elements.
  248. Examples
  249. --------
  250. A toy example on a real dataset, the Guerry dataset from R
  251. >>> url = "http://vincentarelbundock.github.com/"
  252. >>> url = url + "Rdatasets/csv/HistData/Guerry.csv"
  253. >>> df = pd.read_csv(url, index_col='dept')
  254. evaluate the relationship between the variuos paramenters whith the Wealth
  255. >>> pvals = multiOLS('Wealth', df)['adj_pvals', '_f_test']
  256. define the groups
  257. >>> groups = {}
  258. >>> groups['crime'] = ['Crime_prop', 'Infanticide',
  259. ... 'Crime_parents', 'Desertion', 'Crime_pers']
  260. >>> groups['religion'] = ['Donation_clergy', 'Clergy', 'Donations']
  261. >>> groups['wealth'] = ['Commerce', 'Lottery', 'Instruction', 'Literacy']
  262. do the analysis of the significativity
  263. >>> multigroup(pvals < 0.05, groups)
  264. """
  265. pvals = pd.Series(pvals)
  266. if not (set(pvals.unique()) <= set([False, True])):
  267. raise ValueError("the series should be binary")
  268. if hasattr(pvals.index, 'is_unique') and not pvals.index.is_unique:
  269. raise ValueError("series with duplicated index is not accepted")
  270. results = {'pvals': {},
  271. 'increase': {},
  272. '_in_sign': {},
  273. '_in_non': {},
  274. '_out_sign': {},
  275. '_out_non': {}}
  276. for group_name, group_list in groups.iteritems():
  277. res = _test_group(pvals, group_name, group_list, exact)
  278. results['pvals'][group_name] = res[0]
  279. results['increase'][group_name] = res[1]
  280. results['_in_sign'][group_name] = res[2][0]
  281. results['_in_non'][group_name] = res[2][1]
  282. results['_out_sign'][group_name] = res[2][2]
  283. results['_out_non'][group_name] = res[2][3]
  284. result_df = pd.DataFrame(results).sort('pvals')
  285. if not keep_all:
  286. result_df = result_df[result_df.increase]
  287. smt = stats.multipletests
  288. corrected = smt(result_df['pvals'], method='fdr_bh', alpha=alpha)[1]
  289. result_df['adj_pvals'] = corrected
  290. return result_df