{"id":23116069,"url":"https://github.com/bukson/nancorrmp","last_synced_at":"2025-08-16T21:32:17.508Z","repository":{"id":57444800,"uuid":"237411424","full_name":"bukson/nancorrmp","owner":"bukson","description":"Parallel correlation calculation of big numpy arrays or pandas dataframes with NaNs and infs.","archived":false,"fork":false,"pushed_at":"2023-10-18T04:47:32.000Z","size":29,"stargazers_count":25,"open_issues_count":2,"forks_count":5,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-07-01T16:51:45.038Z","etag":null,"topics":["correlation","correlation-matrices","data-science","machine-learning","multiprocessing","numpy","pandas","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bukson.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2020-01-31T10:52:55.000Z","updated_at":"2025-06-01T16:32:22.000Z","dependencies_parsed_at":"2022-09-26T17:30:36.615Z","dependency_job_id":"5fa634d2-bc08-48d0-8644-85f74ef8400c","html_url":"https://github.com/bukson/nancorrmp","commit_stats":{"total_commits":13,"total_committers":3,"mean_commits":4.333333333333333,"dds":"0.23076923076923073","last_synced_commit":"0634ba42d557f1fd9df7ab5421969c636b54fc5e"},"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"purl":"pkg:github/bukson/nancorrmp","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bukson%2Fnancorrmp","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bukson%2Fnancorrmp/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bukson%2Fnancorrmp/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bukson%2Fnancorrmp/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bukson","download_url":"https://codeload.github.com/bukson/nancorrmp/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bukson%2Fnancorrmp/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":270775799,"owners_count":24642961,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-16T02:00:11.002Z","response_time":91,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["correlation","correlation-matrices","data-science","machine-learning","multiprocessing","numpy","pandas","python"],"created_at":"2024-12-17T04:10:51.746Z","updated_at":"2025-08-16T21:32:17.250Z","avatar_url":"https://github.com/bukson.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"Multiprocessing correlation calculation for Python\n=======================\n\n[![Build Status](https://travis-ci.com/bukson/nancorrmp.svg?branch=master)](https://travis-ci.com/bukson/nancorrmp)\n\n`nancorrmp` is a small module for calculating correlations of big numpy arrays or pandas dataframes with\n NaNs and infs, using multiple cores. Default `numpy.corrcoef` method does not calculate correlations\n with input that contains NaNs and infs and `pandas` method `pandas.DataFrame.corr` is single thread\n only. \n \n `nancorrmp` utilizes Pearson correlation calculation code from `scipy`, that is based on `numpy` instead\n of `pandas` cythonic backed. The multiprocessing is implemented by python `multiprocessing` module. \n `nancorrmp` uses `pandas` method of calculating correlations of arrays with NaNs and infs,\n that skips pair of observations when one of them is either Nan or +inf, or -inf. `nancorrmp` also\n can calculate result with p values, similar to `scipy.pearsonr` function.\n \n Benchmarks are showing that with 4 cores, calculating correlation is faster with `nancorrmp` then with `pandas`\n even for 1200x1200 matrix. With 2 cores it is for 2400x2400. `pandas` single processed implementation is faster\n then using single process `nancorrmp` still for 5000x5000 matrix, so it is recommended to use `nancorrmp` with at least\n 2 cores.\n \n Table of Content\n================\n\n* [Installation](https://github.com/bukson/nancorrmp#installation)\n\n* [Usage](https://github.com/bukson/nancorrmp#usage)\n\n* [Methods](https://github.com/bukson/nancorrmp#nancorrmp-methods)\n\n* [Benchmark](https://github.com/bukson/nancorrmp#benchmark)\n\n* [Test](https://github.com/bukson/nancorrmp#test)\n\n* [License](https://github.com/bukson/nancorrmp#license)\n\nInstallation\n============\n\n```\npip install nancorrmp\n```\nUsage\n=====\n```python\nimport pandas as pd\nimport numpy as np\nfrom nancorrmp.nancorrmp import NaNCorrMp\nfrom pandas.testing import assert_frame_equal\n\nnp.random.seed(0)\nrandom_dataframe = pd.DataFrame(np.random.rand(100, 100))\ncorr = NaNCorrMp.calculate(random_dataframe)\ncorr_pandas = random_dataframe.corr()\nassert_frame_equal(corr, corr_pandas)\ncorr, p_value = NaNCorrMp.calculate_with_p_value(random_dataframe)\n```\n\nNaNCorrMp Methods\n=================\n`nancorrmp` module has one static class named `NaNCorrMp` with 2 public methods and 1 type\n\n**ArrayLike = Union[pd.DataFrame, np.ndarray]**\n\n\nType used to unify `pd.DataFrame` and `np.ndarray`. \n\n\n**NaNCorrMp.calculate(X: ArrayLike, n_jobs: int = -1, chunks: int = 500) -\u003e ArrayLike**\n\nCalculates correlation matrix using Pearson correlation. `n_jobs` controls number of cores to use\nwith default -1 which uses all available cores. `chunks` controls how many pairs of arrays are send to\neach process, 500 should be suitable for all purposes. \n\nReturns output as the same type as input, if `X` is `pd.Dataframe` it will return `pd.Dataframe`, if\n`X` is `np.ndarray` it will return `np.ndarray`.\n\n```python\nimport pandas as pd\nimport numpy as np\nfrom nancorrmp.nancorrmp import NaNCorrMp\n\nnp.random.seed(0)\nrandom_dataframe = pd.DataFrame(np.random.rand(100, 100))\ncorr = NaNCorrMp.calculate(random_dataframe)\n```\n\n\n**NaNCorrMp.calculate_with_p_value(X: ArrayLike, n_jobs: int = -1, chunks: int = 500) -\u003e Tuple[ArrayLike, ArrayLike]**\n\nCalculates correlation matrix and p value matrix using Pearson correlation. `n_jobs` controls number of cores to use\nwith default -1 which uses all available cores. `chunks` controls how many pairs of arrays are send to\neach process, 500 should be suitable for all purposes. Correlation and p value are the same as the result of \nusing `scipy.pearsonr`, but it can be used with NaNs and infs and multiple cores.\n\nReturns output as similar type as input, if `X` is `pd.Dataframe` it will return `(pd.Dataframe, pd.Dataframe)`, if\n`X` is `np.ndarray` it will return `(np.ndarray, np.ndarray)`.\n\n```python\nimport pandas as pd\nimport numpy as np\nfrom nancorrmp.nancorrmp import NaNCorrMp\n\nnp.random.seed(0)\nrandom_dataframe = pd.DataFrame(np.random.rand(100, 100))\ncorr, p_value = NaNCorrMp.calculate_with_p_value(random_dataframe)\n```\n\n\nBenchmark\n============\n\nResults can be reproduced by using `test/test_benchmark_nancorrmp.py` module\n\n```python\nimport pandas as pd\nimport numpy as np\nfrom nancorrmp.nancorrmp import NaNCorrMp\n\nnp.random.seed(0)\nrandom_dataframe = pd.DataFrame(np.random.rand(1200, 1200))\n\n%timeit NaNCorrMp.calculate(random_dataframe, n_jobs=4, chunks=1000)\n# 9.92 s ± 205 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n\n%timeit random_dataframe.corr()\n# 10.4 s ± 56.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n\nrandom_dataframe = pd.DataFrame(np.random.rand(2400, 2400))\n\n%timeit NaNCorrMp.calculate(random_dataframe, n_jobs=2, chunks=1000)\n# 1min 26s ± 3.16 s per loop (mean ± std. dev. of 7 runs, 1 loop each)\n\n%timeit random_dataframe.corr()\n# 1min 45s ± 3.58 s per loop (mean ± std. dev. of 7 runs, 1 loop each)\n```\n\nTest\n====\n\n`test` module contains test both for single core usage as for multiple cores. Tests asserts\nthen the outuput of `NaNCorrMp.calculate` is the same as output of `pandas.corr` for the same data. \nTests require `scipy` and can be run with the following command:\n```bash\npython setup.py test\n```\nLicencse\n========\n\nMIT License\n\nCopyright (c) 2020 Michał Bukowski michal.bukowski@buksoft.pl\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbukson%2Fnancorrmp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbukson%2Fnancorrmp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbukson%2Fnancorrmp/lists"}