Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/maximtrp/scikit-posthocs
Multiple Pairwise Comparisons (Post Hoc) Tests in Python
https://github.com/maximtrp/scikit-posthocs
anova multiple-comparisons nonparametric-statistics post-hoc python statistical-analysis statistical-methods statistics stats
Last synced: 2 months ago
JSON representation
Multiple Pairwise Comparisons (Post Hoc) Tests in Python
- Host: GitHub
- URL: https://github.com/maximtrp/scikit-posthocs
- Owner: maximtrp
- License: mit
- Created: 2017-06-22T19:41:37.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2024-06-26T08:01:19.000Z (7 months ago)
- Last Synced: 2024-08-01T18:28:27.734Z (5 months ago)
- Topics: anova, multiple-comparisons, nonparametric-statistics, post-hoc, python, statistical-analysis, statistical-methods, statistics, stats
- Language: Python
- Homepage: https://scikit-posthocs.rtfd.io
- Size: 5.35 MB
- Stars: 327
- Watchers: 5
- Forks: 40
- Open Issues: 7
-
Metadata Files:
- Readme: README.rst
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
- awesome-python-tools - scikit-posthocs - hoc tests to complement statistical analysis. (Statistical Analysis / Specialized Machine Learning Libraries)
- awesome-python-machine-learning - scikit-posthocs - Pairwise multiple comparisons (post hoc) tests in Python. (Uncategorized / Uncategorized)
- awesome-datascience - scikit-posthocs
- awesome-python-machine-learning-resources - GitHub - 12% open · ⏱️ 21.08.2022): (概率统计)
- StarryDivineSky - maximtrp/scikit-posthocs - posthocs 是一个 Python 包,提供用于成对多重比较的后验检验,通常在统计数据分析中执行,以评估组水平之间的差异,前提是 ANOVA 检验已获得统计学上的显著结果。scikit-posthocs 与 Pandas DataFrames 和 NumPy 数组紧密集成,以确保快速计算和方便的数据导入和存储。该包对使用 Python 进行工作的统计学家、数据分析师和研究人员很有用。它提供了多种参数和非参数后验检验,以及异常值检测和基本绘图方法,旨在弥补 Python 统计生态系统中与 R 包相比的不足。 (其他_机器学习与深度学习)
README
.. image:: images/logo.png
===============
.. image:: http://joss.theoj.org/papers/10.21105/joss.01169/status.svg
:target: https://doi.org/10.21105/joss.01169
.. image:: https://img.shields.io/github/actions/workflow/status/maximtrp/scikit-posthocs/package-test.yml?label=build
:target: https://github.com/maximtrp/scikit-posthocs/actions/workflows/package-test.yml
.. image:: https://img.shields.io/readthedocs/scikit-posthocs.svg
:target: https://scikit-posthocs.readthedocs.io
.. image:: https://img.shields.io/codacy/coverage/50d2a82a6dd84b51b515cebf931067d7/master
:target: https://app.codacy.com/gh/maximtrp/scikit-posthocs/dashboard
.. image:: https://img.shields.io/codacy/grade/50d2a82a6dd84b51b515cebf931067d7
:target: https://www.codacy.com/gh/maximtrp/scikit-posthocs/dashboard
.. image:: https://static.pepy.tech/badge/scikit-posthocs
:target: https://pepy.tech/project/scikit-posthocs
.. image:: https://img.shields.io/github/issues/maximtrp/scikit-posthocs.svg
:target: https://github.com/maximtrp/scikit-posthocs/issues
.. image:: https://img.shields.io/pypi/v/scikit-posthocs.svg
:target: https://pypi.python.org/pypi/scikit-posthocs/
.. image:: https://img.shields.io/conda/vn/conda-forge/scikit-posthocs.svg
:target: https://anaconda.org/conda-forge/scikit-posthocs===============
**scikit-posthocs** is a Python package that provides post hoc tests for
pairwise multiple comparisons that are usually performed in statistical
data analysis to assess the differences between group levels if a statistically
significant result of ANOVA test has been obtained.**scikit-posthocs** is tightly integrated with Pandas DataFrames and NumPy
arrays to ensure fast computations and convenient data import and storage.This package will be useful for statisticians, data analysts, and
researchers who use Python in their work.Background
----------Python statistical ecosystem comprises multiple packages. However, it
still has numerous gaps and is surpassed by R packages and capabilities.`SciPy `_ (version 1.2.0) offers *Student*, *Wilcoxon*,
and *Mann-Whitney* tests that are not adapted to multiple pairwise
comparisons. `Statsmodels `_ (version 0.9.0)
features *TukeyHSD* test that needs some extra actions to be fluently
integrated into a data analysis pipeline.
`Statsmodels `_ also has good helper
methods: ``allpairtest`` (adapts an external function such as
``scipy.stats.ttest_ind`` to multiple pairwise comparisons) and
``multipletests`` (adjusts *p* values to minimize type I and II errors).
`PMCMRplus `_ is a very good R package that
has no rivals in Python as it offers more than 40 various tests (including
post hoc tests) for factorial and block design data. PMCMRplus was an
inspiration and a reference for *scikit-posthocs*.**scikit-posthocs** attempts to improve Python statistical capabilities by
offering a lot of parametric and nonparametric post hoc tests along with
outliers detection and basic plotting methods.Features
--------.. image:: images/flowchart.png
:alt: Tests Flowchart- *Omnibus* tests:
- Durbin test (for balanced incomplete block design).
- Mack-Wolfe test.
- Hayter (OSRT) test.- *Parametric* pairwise multiple comparisons tests:
- Scheffe test.
- Student T test.
- Tamhane T2 test.
- TukeyHSD test.- *Non-parametric* tests for factorial design:
- Conover test.
- Dunn test.
- Dwass, Steel, Critchlow, and Fligner test.
- Mann-Whitney test.
- Nashimoto and Wright (NPM) test.
- Nemenyi test.
- van Waerden test.
- Wilcoxon test.- *Non-parametric* tests for block design:
- Conover test.
- Durbin and Conover test.
- Miller test.
- Nemenyi test.
- Quade test.
- Siegel test.- Outliers detection tests:
- Simple test based on interquartile range (IQR).
- Grubbs test.
- Tietjen-Moore test.
- Generalized Extreme Studentized Deviate test (ESD test).- Other tests:
- Anderson-Darling test.
- Global null hypothesis tests:
- Fisher's combination test.
- Simes test.- Plotting functionality (e.g. significance plots).
All post hoc tests are capable of p adjustments for multiple
pairwise comparisons.Dependencies
------------- `NumPy and SciPy packages `_
- `Statsmodels `_
- `Pandas `_
- `Matplotlib `_
- `Seaborn `_Compatibility
-------------Package is only compatible with Python 3.
Install
-------You can install the package using ``pip`` (from PyPi):
.. code:: bash
pip install scikit-posthocs
Or using ``conda`` (from conda-forge repo):
.. code:: bash
conda install -c conda-forge scikit-posthocs
The latest version from GitHub can be installed using:
.. code:: bash
pip install git+https://github.com/maximtrp/scikit-posthocs.git
Examples
--------Parametric ANOVA with post hoc tests
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Here is a simple example of the one-way analysis of variance (ANOVA)
with post hoc tests used to compare *sepal width* means of three
groups (three iris species) in *iris* dataset.To begin, we will import the dataset using statsmodels
``get_rdataset()`` method... code:: python
>>> import statsmodels.api as sa
>>> import statsmodels.formula.api as sfa
>>> import scikit_posthocs as sp
>>> df = sa.datasets.get_rdataset('iris').data
>>> df.columns = df.columns.str.replace('.', '')
>>> df.head()
SepalLength SepalWidth PetalLength PetalWidth Species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosaNow, we will build a model and run ANOVA using statsmodels ``ols()``
and ``anova_lm()`` methods. Columns ``Species`` and ``SepalWidth``
contain independent (predictor) and dependent (response) variable
values, correspondingly... code:: python
>>> lm = sfa.ols('SepalWidth ~ C(Species)', data=df).fit()
>>> anova = sa.stats.anova_lm(lm)
>>> print(anova)
df sum_sq mean_sq F PR(>F)
C(Species) 2.0 11.344933 5.672467 49.16004 4.492017e-17
Residual 147.0 16.962000 0.115388 NaN NaNThe results tell us that there is a significant difference between
groups means (p = 4.49e-17), but does not tell us the exact group pairs which
are different in means. To obtain pairwise group differences, we will carry
out a posteriori (post hoc) analysis using ``scikits-posthocs`` package.
Student T test applied pairwisely gives us the following p values:.. code:: python
>>> sp.posthoc_ttest(df, val_col='SepalWidth', group_col='Species', p_adjust='holm')
setosa versicolor virginica
setosa -1.000000e+00 5.535780e-15 8.492711e-09
versicolor 5.535780e-15 -1.000000e+00 1.819100e-03
virginica 8.492711e-09 1.819100e-03 -1.000000e+00Remember to use a `FWER controlling procedure `_,
such as Holm procedure, when making multiple comparisons. As seen from this
table, significant differences in group means are obtained for all group pairs.Non-parametric ANOVA with post hoc tests
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~If normality and other `assumptions `_
are violated, one can use a non-parametric Kruskal-Wallis H test (one-way
non-parametric ANOVA) to test if samples came from the same distribution.Let's use the same dataset just to demonstrate the procedure. Kruskal-Wallis
test is implemented in SciPy package. ``scipy.stats.kruskal`` method
accepts array-like structures, but not DataFrames... code:: python
>>> import scipy.stats as ss
>>> import statsmodels.api as sa
>>> import scikit_posthocs as sp
>>> df = sa.datasets.get_rdataset('iris').data
>>> df.columns = df.columns.str.replace('.', '')
>>> data = [df.loc[ids, 'SepalWidth'].values for ids in df.groupby('Species').groups.values()]``data`` is a list of 1D arrays containing *sepal width* values, one array per
each species. Now we can run Kruskal-Wallis analysis of variance... code:: python
>>> H, p = ss.kruskal(*data)
>>> p
1.5692820940316782e-14P value tells us we may reject the null hypothesis that the population medians
of all of the groups are equal. To learn what groups (species) differ in their
medians we need to run post hoc tests. ``scikit-posthocs`` provides a lot of
non-parametric tests mentioned above. Let's choose Conover's test... code:: python
>>> sp.posthoc_conover(df, val_col='SepalWidth', group_col='Species', p_adjust = 'holm')
setosa versicolor virginica
setosa -1.000000e+00 2.278515e-18 1.293888e-10
versicolor 2.278515e-18 -1.000000e+00 1.881294e-03
virginica 1.293888e-10 1.881294e-03 -1.000000e+00Pairwise comparisons show that we may reject the null hypothesis (p < 0.01) for
each pair of species and conclude that all groups (species) differ in their
sepal widths.Block design
~~~~~~~~~~~~In block design case, we have a primary factor (e.g. treatment) and a blocking
factor (e.g. age or gender). A blocking factor is also called a *nuisance*
factor, and it is usually a source of variability that needs to be accounted
for.An example scenario is testing the effect of four fertilizers on crop yield in
four cornfields. We can represent the results with a matrix in which rows
correspond to the blocking factor (field) and columns correspond to the
primary factor (yield).The following dataset is artificial and created just for demonstration
of the procedure:.. code:: python
>>> data = np.array([[ 8.82, 11.8 , 10.37, 12.08],
[ 8.92, 9.58, 10.59, 11.89],
[ 8.27, 11.46, 10.24, 11.6 ],
[ 8.83, 13.25, 8.33, 11.51]])First, we need to perform an omnibus test — Friedman rank sum test. It is
implemented in ``scipy.stats`` subpackage:.. code:: python
>>> import scipy.stats as ss
>>> ss.friedmanchisquare(*data.T)
FriedmanchisquareResult(statistic=8.700000000000003, pvalue=0.03355726870553798)We can reject the null hypothesis that our treatments have the same
distribution, because p value is less than 0.05. A number of post hoc tests are
available in ``scikit-posthocs`` package for unreplicated block design data.
In the following example, Nemenyi's test is used:.. code:: python
>>> import scikit_posthocs as sp
>>> sp.posthoc_nemenyi_friedman(data)
0 1 2 3
0 -1.000000 0.220908 0.823993 0.031375
1 0.220908 -1.000000 0.670273 0.823993
2 0.823993 0.670273 -1.000000 0.220908
3 0.031375 0.823993 0.220908 -1.000000This function returns a DataFrame with p values obtained in pairwise
comparisons between all treatments.
One can also pass a DataFrame and specify the names of columns containing
dependent variable values, blocking and primary factor values.
The following code creates a DataFrame with the same data:.. code:: python
>>> data = pd.DataFrame.from_dict({'blocks': {0: 0, 1: 1, 2: 2, 3: 3, 4: 0, 5: 1, 6:
2, 7: 3, 8: 0, 9: 1, 10: 2, 11: 3, 12: 0, 13: 1, 14: 2, 15: 3}, 'groups': {0:
0, 1: 0, 2: 0, 3: 0, 4: 1, 5: 1, 6: 1, 7: 1, 8: 2, 9: 2, 10: 2, 11: 2, 12: 3,
13: 3, 14: 3, 15: 3}, 'y': {0: 8.82, 1: 8.92, 2: 8.27, 3: 8.83, 4: 11.8, 5:
9.58, 6: 11.46, 7: 13.25, 8: 10.37, 9: 10.59, 10: 10.24, 11: 8.33, 12: 12.08,
13: 11.89, 14: 11.6, 15: 11.51}})
>>> data
blocks groups y
0 0 0 8.82
1 1 0 8.92
2 2 0 8.27
3 3 0 8.83
4 0 1 11.80
5 1 1 9.58
6 2 1 11.46
7 3 1 13.25
8 0 2 10.37
9 1 2 10.59
10 2 2 10.24
11 3 2 8.33
12 0 3 12.08
13 1 3 11.89
14 2 3 11.60
15 3 3 11.51This is a *melted* and ready-to-use DataFrame. Do not forget to pass ``melted``
argument:.. code:: python
>>> sp.posthoc_nemenyi_friedman(data, y_col='y', block_col='blocks', group_col='groups', melted=True)
0 1 2 3
0 -1.000000 0.220908 0.823993 0.031375
1 0.220908 -1.000000 0.670273 0.823993
2 0.823993 0.670273 -1.000000 0.220908
3 0.031375 0.823993 0.220908 -1.000000Data types
~~~~~~~~~~Internally, ``scikit-posthocs`` uses NumPy ndarrays and pandas DataFrames to
store and process data. Python lists, NumPy ndarrays, and pandas DataFrames
are supported as *input* data types. Below are usage examples of various
input data structures.Lists and arrays
^^^^^^^^^^^^^^^^.. code:: python
>>> x = [[1,2,1,3,1,4], [12,3,11,9,3,8,1], [10,22,12,9,8,3]]
>>> # or
>>> x = np.array([[1,2,1,3,1,4], [12,3,11,9,3,8,1], [10,22,12,9,8,3]])
>>> sp.posthoc_conover(x, p_adjust='holm')
1 2 3
1 -1.000000 0.057606 0.007888
2 0.057606 -1.000000 0.215761
3 0.007888 0.215761 -1.000000You can check how it is processed with a hidden function ``__convert_to_df()``:
.. code:: python
>>> sp.__convert_to_df(x)
( vals groups
0 1 1
1 2 1
2 1 1
3 3 1
4 1 1
5 4 1
6 12 2
7 3 2
8 11 2
9 9 2
10 3 2
11 8 2
12 1 2
13 10 3
14 22 3
15 12 3
16 9 3
17 8 3
18 3 3, 'vals', 'groups')It returns a tuple of a DataFrame representation and names of the columns
containing dependent (``vals``) and independent (``groups``) variable values.*Block design* matrix passed as a NumPy ndarray is processed with a hidden
``__convert_to_block_df()`` function:.. code:: python
>>> data = np.array([[ 8.82, 11.8 , 10.37, 12.08],
[ 8.92, 9.58, 10.59, 11.89],
[ 8.27, 11.46, 10.24, 11.6 ],
[ 8.83, 13.25, 8.33, 11.51]])
>>> sp.__convert_to_block_df(data)
( blocks groups y
0 0 0 8.82
1 1 0 8.92
2 2 0 8.27
3 3 0 8.83
4 0 1 11.80
5 1 1 9.58
6 2 1 11.46
7 3 1 13.25
8 0 2 10.37
9 1 2 10.59
10 2 2 10.24
11 3 2 8.33
12 0 3 12.08
13 1 3 11.89
14 2 3 11.60
15 3 3 11.51, 'y', 'groups', 'blocks')DataFrames
^^^^^^^^^^If you are using DataFrames, you need to pass column names containing variable
values to a post hoc function:.. code:: python
>>> import statsmodels.api as sa
>>> import scikit_posthocs as sp
>>> df = sa.datasets.get_rdataset('iris').data
>>> df.columns = df.columns.str.replace('.', '')
>>> sp.posthoc_conover(df, val_col='SepalWidth', group_col='Species', p_adjust = 'holm')``val_col`` and ``group_col`` arguments specify the names of the columns
containing dependent (response) and independent (grouping) variable values.Significance plots
------------------P values can be plotted using a heatmap:
.. code:: python
>>> pc = sp.posthoc_conover(x, val_col='values', group_col='groups')
>>> heatmap_args = {'linewidths': 0.25, 'linecolor': '0.5', 'clip_on': False, 'square': True, 'cbar_ax_bbox': [0.80, 0.35, 0.04, 0.3]}
>>> sp.sign_plot(pc, **heatmap_args).. image:: images/plot-conover.png
Custom colormap applied to a plot:
.. code:: python
>>> pc = sp.posthoc_conover(x, val_col='values', group_col='groups')
>>> # Format: diagonal, non-significant, p<0.001, p<0.01, p<0.05
>>> cmap = ['1', '#fb6a4a', '#08306b', '#4292c6', '#c6dbef']
>>> heatmap_args = {'cmap': cmap, 'linewidths': 0.25, 'linecolor': '0.5', 'clip_on': False, 'square': True, 'cbar_ax_bbox': [0.80, 0.35, 0.04, 0.3]}
>>> sp.sign_plot(pc, **heatmap_args).. image:: images/plot-conover-custom-cmap.png
Citing
------If you want to cite *scikit-posthocs*, please refer to the publication in
the `Journal of Open Source Software `_:Terpilowski, M. (2019). scikit-posthocs: Pairwise multiple comparison tests in
Python. Journal of Open Source Software, 4(36), 1169, https://doi.org/10.21105/joss.01169.. code::
@ARTICLE{Terpilowski2019,
title = {scikit-posthocs: Pairwise multiple comparison tests in Python},
author = {Terpilowski, Maksim},
journal = {The Journal of Open Source Software},
volume = {4},
number = {36},
pages = {1169},
year = {2019},
doi = {10.21105/joss.01169}
}Acknowledgement
---------------Thorsten Pohlert, PMCMR author and maintainer