{"id":20938152,"url":"https://github.com/gibsramen/qupid","last_synced_at":"2025-05-13T22:31:34.487Z","repository":{"id":37051130,"uuid":"452982795","full_name":"gibsramen/qupid","owner":"gibsramen","description":"Case-control matching for microbiome data","archived":false,"fork":false,"pushed_at":"2023-08-08T20:36:32.000Z","size":178,"stargazers_count":10,"open_issues_count":4,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-17T09:19:10.138Z","etag":null,"topics":["bioinformatics","case-control","microbiome","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gibsramen.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-01-28T07:52:20.000Z","updated_at":"2024-05-17T14:56:44.000Z","dependencies_parsed_at":"2024-11-18T22:54:32.849Z","dependency_job_id":"686f2d64-38a4-49de-a0b8-c110058aeac7","html_url":"https://github.com/gibsramen/qupid","commit_stats":{"total_commits":128,"total_committers":2,"mean_commits":64.0,"dds":0.0078125,"last_synced_commit":"68372cfd36eeca8f1e3ed8a3a3557ec359bc4051"},"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gibsramen%2Fqupid","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gibsramen%2Fqupid/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gibsramen%2Fqupid/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gibsramen%2Fqupid/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gibsramen","download_url":"https://codeload.github.com/gibsramen/qupid/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254036813,"owners_count":22003655,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","case-control","microbiome","python"],"created_at":"2024-11-18T22:49:32.048Z","updated_at":"2025-05-13T22:31:34.104Z","avatar_url":"https://github.com/gibsramen.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![Main CI](https://github.com/gibsramen/qupid/actions/workflows/main.yml/badge.svg)](https://github.com/gibsramen/qupid/actions)\n[![QIIME 2 CI](https://github.com/gibsramen/qupid/actions/workflows/q2.yml/badge.svg)](https://github.com/gibsramen/qupid/actions/workflows/q2.yml)\n[![PyPI](https://img.shields.io/pypi/v/qupid.svg)](https://pypi.org/project/qupid)\n\n# Qupid\n\n(Pronounced like cupid)\n\nQupid is a tool for generating and statistically evaluating *multiple* case-control matchings of microbiome data.\n\n## Installation\n\nYou can install the most up-to-date version of Qupid from PyPi using the following command:\n\n```\npip install qupid\n```\n\n## Quickstart\n\nQupid provides a convenience function, `shuffle`, to easily generate multiple matches based on matching critiera.\nThis block of code will determine each viable control per case and randomly pick 10 arrangments of a single case matched to a single valid control.\nThe output is a pandas DataFrame where the rows are case names and each column represents a valid mapping of case to control.\n\n```python\nfrom pkg_resources import resource_filename\nimport pandas as pd\nimport qupid\n\nmetadata_fpath = resource_filename(\"qupid\", \"tests/data/asd.tsv\")\nmetadata = pd.read_table(metadata_fpath, sep=\"\\t\", index_col=0)\n\nasd_str = \"Diagnosed by a medical professional (doctor, physician assistant)\"\nno_asd_str = \"I do not have this condition\"\n\nbackground = metadata.query(\"asd == @no_asd_str\")\nfocus = metadata.query(\"asd == @asd_str\")\n\nmatches = qupid.shuffle(\n    focus=focus,\n    background=background,\n    categories=[\"sex\", \"age_years\"],\n    tolerance_map={\"age_years\": 10},\n    iterations=100\n)\n```\n\n## Tutorial\n\nThere are three primary steps to the Qupid workflow:\n\n1. Match each case to all valid controls\n2. Generate multiple one-to-one matchings\n3. Evaluate the statistical differences between cases and controls for all matchings\n\nTo match each case to all valid controls, we need to first establish matching criteria.\nQupid allows matching by both categorical metadata (exact matches) and continuous metadata (matching within provided tolerance).\nYou can match on either a single metadata column or based on multiple.\n\nIn Qupid, the cases to be matched are referred to as the \"focus\" set, while the set of all possible controls is called the \"background\".\nFor this tutorial we will be used data from the American Gut Project to match cases to controls in samples from people with autism.\n\nFirst, we'll load in the provided example metadata and separate it into the focus (samples from people with autism) and the background (samples from people who do not have autism).\n\n### Loading data\n\n```python\nfrom pkg_resources import resource_filename\nimport pandas as pd\n\nmetadata_fpath = resource_filename(\"qupid\", \"tests/data/asd.tsv\")\nmetadata = pd.read_table(metadata_fpath, sep=\"\\t\", index_col=0)\n\n# Designate focus samples\nasd_str = \"Diagnosed by a medical professional (doctor, physician assistant)\"\nno_asd_str = \"I do not have this condition\"\n\nbackground = metadata.query(\"asd == @no_asd_str\")\nfocus = metadata.query(\"asd == @asd_str\")\n```\n\n### Matching each case to all possible controls\n\nNext, we want to perform case-control matching on sex and age.\nSex is a discrete factor, so Qupid will attempt to find exact matches (e.g. male to male, female to female).\nHowever, age is a continuous factor; as a result, we should provide a tolerance value (e.g. match within 10 years).\nWe use the `match_by_multiple` function to match based on more than one metadata category.\n\n```python\nfrom qupid import match_by_multiple\n\ncm = match_by_multiple(\n    focus=focus,\n    background=background,\n    categories=[\"sex\", \"age_years\"],\n    tolerance_map={\"age_years\": 10}\n)\n```\n\nThis creates a `CaseMatchOneToMany` object where each case is matched to each possible control.\nYou can view the underlying matches as a dictionary with `cm.case_control_map`.\n\n### Generating mappings from each case to a single control\n\nWhat we now want is to match each case to a *single* control so we can perform downstream analysis.\nHowever, we have *a lot* of possible controls.\nWe can easily see how many cases and possible controls we have.\n\n```python\nprint(len(cm.cases), len(cm.controls))\n```\n\nThis tells us that we have 45 cases and 1785 possible controls.\nBecause of this, there are many possible sets of valid matchings of each case to a single control.\nWe can use Qupid to generate many such cases.\n\n```python\nresults = cm.create_matched_pairs(iterations=100)\n```\n\nThis creates a `CaseMatchCollection` data structure that contains 100 `CaseMatchOneToOne` instances.\nEach `CaseMatchOneToOne` entry maps each case to *a single control* rather than all possible controls.\nWe can verify that each entry has exactly 45 cases and 45 controls.\n\n```python\nprint(len(results[0].cases), len(results[0].controls))\n```\n\nQupid provides a convenience method to convert a `CaseMatchCollection` object into a pandas DataFrame.\nThe DataFrame index corresponds to the cases, while each column represents a distinct set of matching controls.\nThe value in a cell represents a matching control to the row's case.\n\n```python\nresults_df = results.to_dataframe()\nresults_df.head()\n```\n\n```\n                                0                 1   ...                98                99\ncase_id                                               ...\nS10317.000026181  S10317.000033804  S10317.000069086  ...  S10317.000108605  S10317.000076381\nS10317.000071491  S10317.000155409  S10317.000103912  ...  S10317.000099277  S10317.000036401\nS10317.000029293  S10317.000069676  S10317.X00175749  ...  S10317.000069299  S10317.000066846\nS10317.000067638  S10317.X00179103  S10317.000052409  ...  S10317.000067511  S10317.000067601\nS10317.000067637  S10317.000067747  S10317.000098161  ...  S10317.000017116  S10317.000067997\n\n[5 rows x 100 columns]\n```\n\n### Statistical assessment of matchings\n\nOnce we have this list of matchings, we want to determine how statistically difference cases are from controls based on some values.\nQupid supports two types of statistical tests: univariate and multivariate.\nUnivariate data is in the form of a vector where each case and control has a single value.\nThis can be alpha diversity, log-ratios, etc.\nMultivariate data is in the form of a distance matrix where each entry is the pairwise distance between two samples, e.g. from beta diversity analysis.\nWe will generate random data for this tutorial where there exists a small difference between ASD samples and non-ASD samples.\n\n```python\nimport numpy as np\n\nrng = np.random.default_rng()\nasd_mean = 4\nctrl_mean = 3.75\n\nnum_cases = len(cm.cases)\nnum_ctrls = len(cm.controls)\n\nasd_values = rng.normal(asd_mean, 1, size=num_cases)\nctrl_values = rng.normal(ctrl_mean, 1, size=num_ctrls)\n\nasd_values = pd.Series(asd_values, index=focus.index)\nctrl_values = pd.Series(ctrl_values, index=background.index)\n\nsample_values = pd.concat([asd_values, ctrl_values])\n```\n\nWe can now evaluate a t-test between case values and control values for each possible case-control matching in our collection.\n\n```python\nfrom qupid.stats import bulk_univariate_test\n\ntest_results = bulk_univariate_test(\n    casematches=results,\n    values=sample_values,\n    test=\"t\"\n)\n```\n\nThis returns a DataFrame of test results sorted by descending test statistic.\n\n```\n   method_name test_statistic_name  test_statistic   p-value  sample_size  number_of_groups\n15      t-test                   t        3.900874  0.000187           90                 2\n61      t-test                   t        3.770914  0.000294           90                 2\n50      t-test                   t        3.536803  0.000649           90                 2\n32      t-test                   t        3.395298  0.001030           90                 2\n68      t-test                   t        3.310822  0.001350           90                 2\n..         ...                 ...             ...       ...          ...               ...\n13      t-test                   t        0.645694  0.520158           90                 2\n49      t-test                   t        0.555063  0.580260           90                 2\n92      t-test                   t        0.409252  0.683349           90                 2\n51      t-test                   t        0.110707  0.912101           90                 2\n34      t-test                   t        0.048571  0.961371           90                 2\n\n[100 rows x 6 columns]\n```\n\nFrom this table, we can see that iteration 15 best separates cases from controls based on our random data.\nConversely, iteration 34 showed essentially no difference between cases and controls.\nThis shows that it is important to create multiple matchings as some of them are better than others.\nWe can plot the distribution of p-values to get a sense of the overall distribution.\n\n```python\nimport matplotlib.pyplot as plt\nimport seaborn as sns\n\nsns.histplot(test_results[\"p-value\"])\n```\n\n![p-value Histogram](https://raw.githubusercontent.com/gibsramen/qupid/main/imgs/asd_match_pvals.png)\n\nWe see that most of the p-values are near zero which makes sense because we simulated our data with a difference between ASD and non-ASD samples.\n\n### Saving and loading qupid results\n\nQupid allows the saving and loading of both `CaseMatch` and `CaseMatchCollection` objects.\n`CaseMatchOneToMany` and `CaseMatchOneToOne` objects are saved as JSON files while `CaseMatchCollection` objects are saved as pandas DataFrames.\n\n```python\nfrom qupid.casematch import CaseMatchOneToMany, CaseMatchOneToOne, CaseMatchCollection\n\ncm.save(\"asd_matches.one_to_many.json\")  # Save all possible matches\nresults.save(\"asd_matches.100.tsv\")  # Save all 100 iterations\nresults[15].save(\"asd_matches.best.json\")  # Save best matching\n\nCaseMatchOneToMany.load(\"asd_matches.one_to_many.json\")\nCaseMatchCollection.load(\"asd_matches.100.tsv\")\nCaseMatchOneToOne.load(\"asd_matches.best.json\")\n```\n\n## Command Line Interface\n\nQupid has a command line interface to create multiple matchings from cases and possible controls.\nIf providing numeric categories, the column name must be accompanied by the tolerance after a space (e.g. `age_years 5` for a tolerance of 5 years).\nYou can pass multiple options to `--discrete-cat` or `--numeric-cat` to specify multiple matching criteria.\n\nFor usage detalls, use `qupid shuffle --help`.\n\n```\nqupid shuffle \\\n    --focus focus.tsv \\\n    --background background.tsv \\\n    --iterations 15 \\\n    --discrete-cat sex \\\n    --discrete-cat race \\\n    --numeric-cat age_years 5 \\\n    --numeric-cat weight_lbs 10 \\\n    --output matches.tsv\n```\n\n## QIIME 2 Usage\n\nQupid provides support for the popular QIIME 2 framework of microbiome data analysis.\nWe assume in this tutorial that you are familiar with using QIIME 2 on the command line.\nIf not, we recommend you read the excellent [documentation](https://docs.qiime2.org/) before you get started with Qupid.\n\nRun `qiime qupid --help` to see all possible commands.\n\n### Matching one-to-many\n\nUse `qiime qupid match-one-to-many` to match each case to all possible controls.\nNote that for numeric categories, you must pass in tolerances in the form of `\u003ccolumn_name\u003e+-\u003ctolerance\u003e`.\n\n```\nqiime qupid match-one-to-many \\\n    --m-sample-metadata-file metadata.tsv \\\n    --p-case-control-column case_control \\\n    --p-categories sex age_years \\\n    --p-case-identifier case \\\n    --p-tolerances age_years+-10 \\\n    --o-case-match-one-to-many cm_one_to_many.qza\n```\n\n### Matching one-to-one\n\nWith a one-to-many match, you can generate multiple possible one-to-one matches using `qiime qupid match-one-to-one`.\n\n```\nqiime qupid match-one-to-one \\\n    --i-case-match-one-to-many cm_one_to_many.qza \\\n    --p-iterations 10 \\\n    --o-case-match-collection cm_collection.qza\n```\n\n### Qupid shuffle\n\nThe previous two commands can be run sequentially using `qiime qupid shuffle`.\n\n```\nqiime qupid shuffle \\\n    --m-sample-metadata-file metadata.tsv \\\n    --p-case-control-column case_control \\\n    --p-categories sex age_years \\\n    --p-case-identifier case \\\n    --p-tolerances age_years+-10 \\\n    --p-iterations 10 \\\n    --output-dir shuffle\n```\n\n### Statistical assessment of matches\n\nYou can assess how different cases are from controls using both univariate data (such as alpha diversity) or multivariate data (distance matrices).\nThe result will be a histogram of p-values from either a t-test (univariate) or PERMANOVA (multivariate) comparing cases to controls.\nNote that for either command, the input data must contain values for all possible cases and controls.\n\n```\nqiime qupid assess-matches-univariate \\\n    --i-case-match-collection cm_collection.qza \\\n    --m-data-file data.tsv \\\n    --m-data-column faith_pd \\\n    --o-visualization univariate_p_values.qzv\n```\n\n```\nqiime qupid assess-matches-multivariate \\\n    --i-case-match-collection cm_collection.qza \\\n    --i-distance-matrix uw_unifrac_distance_matrix.qza \\\n    --p-permutations 999 \\\n    --o-visualization multivariate_p_values.qzv\n```\n\n## Help with Qupid\n\nIf you encounter a bug in Qupid, please post a GitHub issue and we will get to it as soon as we can. We welcome any ideas or documentation updates/fixes so please submit an issue and/or a pull request if you have thoughts on making Qupid better.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgibsramen%2Fqupid","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgibsramen%2Fqupid","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgibsramen%2Fqupid/lists"}