{"id":13457710,"url":"https://github.com/omerwe/MKLMM","last_synced_at":"2025-03-24T14:32:16.367Z","repository":{"id":88522838,"uuid":"50574796","full_name":"omerwe/MKLMM","owner":"omerwe","description":"Multi Kernel Linear Mixed Models for Complex Phenotype Prediction","archived":false,"fork":false,"pushed_at":"2022-07-06T17:26:26.000Z","size":16737,"stargazers_count":8,"open_issues_count":1,"forks_count":5,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-10-29T02:31:12.803Z","etag":null,"topics":["gaussian-process-regression","gaussian-processes","gwas","kernel-methods","machine-learning","polygenic-risk-scores","polygenic-scores","statistical-genetics"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-2-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/omerwe.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2016-01-28T10:18:28.000Z","updated_at":"2024-01-05T04:30:10.000Z","dependencies_parsed_at":"2023-07-30T13:16:16.324Z","dependency_job_id":null,"html_url":"https://github.com/omerwe/MKLMM","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/omerwe%2FMKLMM","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/omerwe%2FMKLMM/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/omerwe%2FMKLMM/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/omerwe%2FMKLMM/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/omerwe","download_url":"https://codeload.github.com/omerwe/MKLMM/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245289630,"owners_count":20591108,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["gaussian-process-regression","gaussian-processes","gwas","kernel-methods","machine-learning","polygenic-risk-scores","polygenic-scores","statistical-genetics"],"created_at":"2024-07-31T09:00:34.144Z","updated_at":"2025-03-24T14:32:15.276Z","avatar_url":"https://github.com/omerwe.png","language":"Python","funding_links":[],"categories":["\u003cp align=\"center\"\u003e Challenges in the prediction \u003c/p\u003e"],"sub_categories":["\u003cp align = 'center'\u003e Why an individual's phenotype cannot always be predicted from their genome and the environment they experience? \u003c/p\u003e"],"readme":"# MKLMM\nMulti Kernel Linear Mixed Models for Complex Phenotype Prediction\n\nMKLMM is a Python package for predition of complex phenotypes from single nucleotide polymorphism (SNP) data that can model genetic interactions by using multi-kernel-learning techniques. The model is a generalization of [Adaptive MultiBLUP](http://genome.cshlp.org/content/early/2014/06/24/gr.169375.113.abstract), which divides the genome into several regions and infers a different variance component for every region. MKLMM improves upon MultiBLUP by additionally modeling genetic interactions, which are modeled by non-linear variance components (or kernels), where all model parameters are inferred jointly via restricted maximum likelihood.\n\nMKLMM is particularly suitable for modeling complex local interactions between nearby variants. MKLMM-Adapt is a variant of MKLMM which automatically infers interaction patterns across multiple genomic regions. \n\nMKLMM was published in: [Multikernel linear mixed models for complex phenotype prediction. Genome research 26, 969-979 (2016)](http://genome.cshlp.org/content/26/7/969.short).\n\nSeveral parts of the code are based on code translated from the [GPML toolbox](http://www.gaussianprocess.org/gpml/code/matlab/) and [Fast-LMM](https://github.com/MicrosoftGenomics/FaST-LMM).\n\n------------------\nInstallation\n------------------\nMKLMM is designed to work in Python 2.7, and depends on the following freely available Python package:\n* [Numpy](http://www.numpy.org/) and [Scipy](http://www.scipy.org/)\n* [Scikits-learn](http://scikit-learn.org/stable/)\n* [PySnpTools](https://github.com/MicrosoftGenomics/PySnpTools)\n\nTypically, the packages can be installed with the command \"pip install --user \u003cpackage_name\u003e\".\n\nMKLMM is particularly easy to use with the [Anaconda Python distribution](https://store.continuum.io/cshop/anaconda). The [numerically optimized version](http://docs.continuum.io/mkl-optimizations/index) of Anaconda can speed MKLMM up significantly.\nAlternatively (if numerically optimized Anaconda can't be installed), for very fast performance it is recommended to have an optimized version of Numpy/Scipy [installed on your system](http://www.scipy.org/scipylib/building), using optimized numerical libraries such as [OpenBLAS](http://www.openblas.net) or [Intel MKL](https://software.intel.com/en-us/intel-mkl) (see [Compilation instructions for scipy with Intel MKL)](https://software.intel.com/en-us/articles/numpyscipy-with-intel-mkl).\n\nOnce all the prerequisite packages are installed, MKLMM can be installed on a git-enabled machine by typing:\n```\ngit clone https://github.com/omerwe/MKLMM\n```\n```\nunzip example.zip\n```\n\nThe project can also be downloaded as a zip file from the Github website.\n\n------------------\nUsage Overview\n------------------\nMKLMM works with [binary Plink format](http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#bed), and with phenotype and covariate files written in Plink format. \n\nThere are two files that need to be executed:\n\n1. rank_regions.py:  This file divides the genome into small regions (mean length 75Kb) and ranks them. It outputs a file that reports the first and last SNP in each region, and the region score.\n\n2. mklmm_wrapper.py: This file trains a model and predicts phenotypes for a set of individuals. The file can be invoked in train mode, test mode or both. When invoked in train mode, it creates an output file with the learned REML parameters. When invoked in test mode, it creates an output file which lists the posterior mean and variance of the estimated phenotypes of all individuals.\n\n\nThe list of available options for both files can be seen by typing\n```\npython \u003cfile_name\u003e --help\n```\n\nFor an example, please run:\n```\npython mklmm_wrapper.py --bfile_train example/train --pheno example/train.phe --bfile example/test --regions example/regions.txt --numRegions 2 --kernel lin --out example/predictions.txt\n```\nThis will train a model on the individuals in the file example/train.bed, will perform prediction on individuals in the file example/test.bed and will write the output to the file example/predictions.txt. The model will use two regions: One genome-wide region spanning all SNPs, and an additional region that obtained the highest log-likelihood in regions-ranking stage.\n\n------------------\nDetailed Instructions\n------------------\n#### Ranking of regions\nTo rank regions, one should type:\n```\npython rank_regions.py --bfile_train \u003cbfile\u003e --pheno_train \u003cphenotype file\u003e --meanLen \u003cmean region length\u003e --out \u003coutput file\u003e --covar_train \u003ccovariates file\u003e\n```\nThe bfile_train and phenotype files should contain only train set individuals, to prevent leakage. The covar_train flag is optional, and can be used to specify covariates wich will be modeled with fixed effects, in Plink format. Note that MKLMM automatically creates a covariate that is an intercept covariate, so there is no need to add a column of ones.\n\nIt is also possible to specify a bfile with test individuals (using the flag --bfile). This is useful for removal of top principal components from the genotypes, which can prevent confounding due to population structure (see details below). For an effective removal, it is required to compute the principal components using both the train and test individuals. This does not present any form of leakage, because the phenotypes of the test individuals are not known at any stage.\nRanking of regions is typically very fast, owing to the fast LMM inference algorithm of the FastLMM method.\n\n\nThe output file contains a row for every region, where each row has three entries. These entries correspond to the first and last SNP in the region (where the first SNP is numbered as 0), and to the log likeliood of the phenotype when using an LMM whose covariance matrix consists of only the SNPs in that region.\n\n#### Training an MKLMM model\nTo train a model, one should type:\n```\npython mklmm_wrapper.py --bfile_train \u003cbfile\u003e --pheno_train \u003cphenotype file\u003e --covar_train \u003ccovariates file\u003e --regions \u003cregions file\u003e --numRegions \u003c#regions to use\u003e --train_out \u003coutput training file\u003e --kernel \u003ckernel type\u003e \n```\n\nThe bfile_train, phenotype file and covar_train files should be the same as those provided in the first stage. The regions file is the output of the first stage. numRegions specifies the number of kernels that will get assigned their own kernel. For example, when numRegions=0, only a single genome-wide kernel is used, and the model thus becomes equivalent to GBLUP. The code automatically merges adjacent regions with a high log likelihood together. The parameter train_out specifies the output file which will contain the learned parameters of the kernels and the fixed effects.\nThe --kernel flag specifies the type of kernel that will be used, among several options:\n\n* lin - a linear kernel will be used for every region. This is the fastest option. MKLMM with this option is equivalent to [Adaptive MultiBLUP](http://genome.cshlp.org/content/early/2014/06/24/gr.169375.113.abstract).\n* rbf - An RBF kernel for every region\n* poly2 - An inhomogeneous polynomial kernel of degree 2 for every region\n* poly3 - An inhomogeneous polynomial kernel of degree 3 for every region\n* nn - a neural network kernel (referred to in the paper as an SP kernel)  for every region\n* adapt - MKLMM-Adapt: The code will adaptively select a different kernel for every region in a data driven manner. Note that this results in a slower run-time compared to the other kernels, because several kernels are evaluated for every region.\n\nAll kernels can be augmented with a linear kernel, by adding the suffix \"_lin\". For example, rbf_lin specifies that for every region the model will use a weighted combination of a linear and an RBF kernel. Note that the polynomial kernels become homogeneous when used along with a linear kernel, because otherwise the model is overparameterized. Also note that the code supports several other kernel types not reported in the paper, such as the Matern, Gabor and piecewise polynomial kernel. Please look at the source code for the full\nlist of kernels.\n\n#### Performing Prediction:\nTo perform prediction, one should type:\n```\npython mklmm_wrapper.py --bfile_train \u003cbfile\u003e --pheno_train \u003cphenotype file\u003e --covar_train \u003ccovariates file\u003e --regions \u003cregions file\u003e --numRegions \u003c#regions to use\u003e --train_file \u003ctraining file\u003e --kernel \u003ckernel type\u003e --bfile \u003cbfile with new individuals\u003e --out \u003coutput file\u003e\n```\n\nThe bfile_train, pheno_train, covar_train, regions numRegions parameters should be exactly the same as in the previous run of mklmm_wrapper.py. The train_file is the output of the previous run of mklmm_wrapper.py\nThe file specified with --bfile should contain new individuals that did not participate in the training set. Prediction will be performed for these individuals.\nThe output is written to the output file. For every individual, the file reports the mean (column 3) and variance (column 4) of the posterior phenotype distribution for this individual.\n\n\nThe training set should be provided in the testing stage for several reasons:\n* The SNPs of the test individuals should be normalized in the same way as the train individuals. It is possible to save the mean and standard deviation of every SNP, but this is wasteful, so was not used here.\n* In order to compute the covariance between train and test individuals, the train genotypes are used. Note that one can bypass this requirement, as explained in the MKLMM paper, but this option is not yet enabled.\n* Efficient computation of the predictive distribution of test individuals also requires the phenotypes and covariates of training individuals. Again, this requirement can be bypassed, but is not yet enabled.\n\nNote that both runs of mklmm_wrapper.py can be combined. This is especially useful when one wants to remove principal components, as explained below.\n\n\n------------------\nRemoval of top principal components\n------------------\nIt is often beneficial to project genotype vectors into a subspace that is orthogonal to the top principal components of the data, in order to prevent confounding due to population structure. This can be done by using the flag \"--numRemovePCs \u003c#PCs\u003e\", and specifying a number larger than zero. Note that this should be done for both rank_regions.py and mklmm_wrapper.py to obtain consistent results.\nWhen removing principal components, the test individuals should be provided (using the --bfile flag), so that the principal components computation will also consider their genotypes. \n\n\n-----------------------------\nStandardization of genotypes\n-----------------------------\nStandardization of genotypes can be carried in several ways. The standard way is to simply standardize all SNPs to have a zero mean and a unit variance. Alternatively, one can apply an \"ascertainment-aware\" standardization by computing a weighted mean and standard deviations according to the disease prevalence, so that cases are not overrepresented in the computation. This can be accomplished with the flag \"--norm controls\". Note that unlike genotypes, standardization of covariates does not affect the analysis in any way, because they are treated as fixed effects.\n\n\n-----------------------------\nExample data set\n-----------------------------\nThe package contains a small example dataset in the \"example\" directory with synthetic genotypes and phenotypes. The example contains two regions with interacting SNPs, as well as a polygenic term spanning all SNPs.\nThe data is divided into a train set (containing 1,500 individuals) and a test set (containing 1,301 individuals).\nThe example directory also contains the original (synthetic) phenotypes of all individuals, in the file pheno_all.phe, which can be used to evaluate prediction performance.\n\n\n\n\n\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fomerwe%2FMKLMM","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fomerwe%2FMKLMM","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fomerwe%2FMKLMM/lists"}