{"id":45210268,"url":"https://github.com/viralemergence/trefle","last_synced_at":"2026-02-20T16:27:56.024Z","repository":{"id":50493616,"uuid":"334272263","full_name":"viralemergence/trefle","owner":"viralemergence","description":"Imputing the mammalian virome with the LF-SVD model","archived":false,"fork":false,"pushed_at":"2023-03-17T17:25:10.000Z","size":42659,"stargazers_count":1,"open_issues_count":2,"forks_count":1,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-09-05T13:01:28.156Z","etag":null,"topics":["imputation","svd","verena","virology","zoonotic-disease"],"latest_commit_sha":null,"homepage":"","language":"Julia","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"cc-by-4.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/viralemergence.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"citation_counts/m1.rds","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2021-01-29T21:58:31.000Z","updated_at":"2022-07-29T14:46:25.000Z","dependencies_parsed_at":"2025-09-05T13:02:22.322Z","dependency_job_id":null,"html_url":"https://github.com/viralemergence/trefle","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/viralemergence/trefle","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/viralemergence%2Ftrefle","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/viralemergence%2Ftrefle/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/viralemergence%2Ftrefle/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/viralemergence%2Ftrefle/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/viralemergence","download_url":"https://codeload.github.com/viralemergence/trefle/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/viralemergence%2Ftrefle/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29656773,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-20T09:27:29.698Z","status":"ssl_error","status_checked_at":"2026-02-20T09:26:12.373Z","response_time":59,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["imputation","svd","verena","virology","zoonotic-disease"],"created_at":"2026-02-20T16:27:55.280Z","updated_at":"2026-02-20T16:27:56.015Z","avatar_url":"https://github.com/viralemergence.png","language":"Julia","funding_links":[],"categories":[],"sub_categories":[],"readme":"# A model-inflated list of potential host-virus associations\n\nC'est quoi, `trefle`?\n\nIt is a data product derived from the [`clover`][clover] database of\nmammals-virus association. Specifically, `trefle` was produced using LF-SVD\nimputation, a two-step algorithm where novel host-virus associations are\nrecommended based on truncated singular value decomposition applied to initial\nvalues based on a linear filter.\n\n[![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)\n\n\n\n[clover]: https://github.com/viralemergence/clover\n\n## LF-SVD\n\nAssociations in `trefle` are recommended based on the output of a two-step\nprocess. First [linear filtering][LF] is used to generate an initial value based\non network properties. The linear filter has four hyper-parameters (the four\nweights assigned to the initial association, the connectance, and the in and out\ndegree of the nodes), constrained as their values must sum to one.\n\n[LF]: https://www.nature.com/articles/srep45908\n\nSecond, we apply truncated SVD to the modified `clover` wherein the missing\nassociation we impute get its initial value from to the linear filter. The rank\nof truncation for the low-rank approximation is a fifth hyper-parameter in this\nmodel.\n\nIn short, `trefle` is a giant LOOCV dataset. This has consequences for how much\ncomputational resources are required to *produce* it, which we will approximate\nas: hella. We will discuss the computational requirements more below.\n\n## Hyper-parameters tuning\n\nIn practice, we can get away with removing the first hyper-parameter of the\nlinear filter, as we have reasons to suspect that negative associations can\noften be false negatives. This leaves us with four hyper-parameters to tune.\n\nBecause exploring the grid of linear filter parameters would be prohibitive in\nterms of computing time (but also would lead to less interpretable model\ninputs), we picked three initial models: the initial value is the same for all\nassociations and determined by the connectance of `clover` (`connectance`); the\ninitial value is given by the averaged relative degree of the host and the virus\n(`degree`); the initial value is given by the average of the previous two models\n(`hybrid`).\n\nWe applied each model at various depth of low-rank approximation, *i.e.* by\ntruncating the SVD to its 1st to 20th singular value. Within each model-rank\ncombination, we imputed the value of 780 positive interactions (which we should\nassume are true positive given the nature of the `clover` data), and of 780\nnegative interactions (about which we will refrain from making assumptions),\nusing LOOCV.\n\nThe performance of each model-rank combination was measured using ROC-AUC,\nassuming that negative interactions are true negatives. Note that owing to the\ndimensions of `clover`, the training sample represents less than 1/1000 of the\nentire dataset. Further, for each model we decided on a threshold of evidence\nabove which the pseudo-probability should be indicative of an actual association\nby picking the value of evidence which maximizes Youden's J statistic. In the\noverwheling majority of cases, this value of evidence *also* maximized the\naccuracy of the model.\n\n## Output values\n\nThe output value in `trefle` is akin to an association probability (but it is\nnot a probability of association in the sense of [probabilistic ecological\nnetworks][pen]). The final value after imputation is divided by the initial\nvalue before imputation. If the association \"score\" does not change, this gives\na value of 1. We transform this by substracting one from the result, yielding an\n*evidence* value for the association: positive evidence makes the association\nmore likely. To convert the evidence into a pseudo-probability, we put it\nthrough the logistic function. This returns values in [0;1]. In practice, owing\nto the numerical imprecisions involved in measuring the logistic on even\nmoderately large floating-point numbers on 64 bits, it is common to have final\npseudo-probability values of 1, and we rely on the *evidence* for ranking.\n\n[pen]: https://besjournals.onlinelibrary.wiley.com/doi/10.1111/2041-210X.12468\n\nThe following figure is an illustration of the resulting probabilities in an ensemble model of all of the model candidates used during tuning - the little bump in values that are `false` around 1 are candidate false negatives:\n\n![proba ensemble](model_performance/probabilities.png)\n\n## Model performance\n\n### Top 10 models\n\nThe following table has the 10 best models ranked from first to last, as well as\nthe usual measures of model performance derived from the confusion table. In\naddition to the AUC and cutoff (expressed as a *pseudo-probability*), we report\nthe true positive and true negative rates (TPR, TNR), the positive and negative\npredictive values (PPV, NPV), the false negative and positive rates (FNR, FPR),\nthe false discovery and false omission rates (FDR, FOR), the critical success\nindex (CSI), accuracy (ACC), and Youden's J.\n\n| model         | rank | AUC   | cutoff | TPR   | TNR   | PPV   | NPV   | FNR   | FPR   | FDR   | FOR   | CSI   | ACC   | J     |\n|---------------|------|-------|--------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|\n| `connectance` | 12   | 0.849 | 0.846  | 0.720 | 0.925 | 0.906 | 0.769 | 0.28  | 0.074 | 0.093 | 0.230 | 0.669 | 0.823 | 0.645 |\n| `connectance` | 11   | 0.846 | 0.908  | 0.684 | 0.936 | 0.914 | 0.75  | 0.315 | 0.063 | 0.085 | 0.25  | 0.643 | 0.811 | 0.621 |\n| `connectance` | 17   | 0.844 | 0.929  | 0.692 | 0.935 | 0.913 | 0.754 | 0.307 | 0.064 | 0.086 | 0.245 | 0.649 | 0.814 | 0.627 |\n| `connectance` | 8    | 0.842 | 0.705  | 0.701 | 0.895 | 0.868 | 0.751 | 0.298 | 0.104 | 0.131 | 0.248 | 0.634 | 0.798 | 0.596 |\n| `hybrid`      | 12   | 0.841 | 0.707  | 0.703 | 0.877 | 0.851 | 0.748 | 0.296 | 0.122 | 0.148 | 0.251 | 0.626 | 0.790 | 0.581 |\n| `connectance` | 14   | 0.839 | 0.902  | 0.700 | 0.929 | 0.907 | 0.758 | 0.299 | 0.070 | 0.092 | 0.241 | 0.653 | 0.815 | 0.629 |\n| `hybrid`      | 11   | 0.837 | 0.820  | 0.647 | 0.918 | 0.888 | 0.723 | 0.352 | 0.081 | 0.111 | 0.276 | 0.598 | 0.783 | 0.566 |\n| `connectance` | 5    | 0.836 | 0.931  | 0.660 | 0.940 | 0.916 | 0.735 | 0.339 | 0.059 | 0.083 | 0.264 | 0.623 | 0.800 | 0.600 |\n| `connectance` | 7    | 0.836 | 0.948  | 0.655 | 0.957 | 0.939 | 0.735 | 0.344 | 0.042 | 0.060 | 0.264 | 0.628 | 0.806 | 0.613 |\n| `connectance` | 16   | 0.835 | 0.961  | 0.667 | 0.945 | 0.923 | 0.741 | 0.332 | 0.054 | 0.076 | 0.258 | 0.632 | 0.807 | 0.613 |\n\nFollowing these results, we have conducted the imputation with on the model\nbased on connectance and a rank 12 approximation. Visualisations of all these\nmetrics are provided in `model_performance/metrics`.\n\n### Overview of the best model\n\nThe following figure is the ROC AUC, with a depiction of the point maximizing\nYouden's J and the probability cutoff associated:\n\n![ROC-AUC](model_performance/roc/rank-12-model-connectance.png)\n\nVisualisations of the same curve for all model-rank combinations are in\n`model_performance/roc`.\n## Computational resources\n\nWe assembled `trefle` on the [beluga][beluga] supercomputer, operated by *Calcul\nQuébec*, using a pipeline built entirely in [Julia][jl] (1.5.2).\n\n[beluga]: https://www.computecanada.ca/featured/beluga-the-latest-supercomputer-for-canadian-researchers/\n[jl]: https://julialang.org/\n\nTuning the hyper-parameters required about 2400 core hours, and imputation took\napproximately 59500 core hours. Rounding up, using recent ARC hardware, the\nassembly of `trefle` takes 62000 core hours, or just above 7 core years.\nAssuming a cost of $0.051 per hour (equivalent to what a commercial cloud\ncomputing provider would charge), the entire `trefle` production process costs\nabout $3200.\n\nDealing with the `artifacts/tuning.csv` and `artifacts/predictions.csv` is\n*considerably* less demanding. The project comes bundled with a `Project.toml`\nwhich specifies the dependencies, and the compatible major/minor releases of the\npackages. The `hpc/inputs` folder also comes with its `Manifest.toml` file, to\nensure that we would get the same environment should we decide to run the code\nagain (but see the previous paragraph for why this is unlikely).\n\n## How to use `trefle`\n\nThe output of running the pipeline is a *prediction* (specifically based on a binary\nclassifier) for host-virus associations that are likely to exist given what we know\nabout true positives (*i.e.* the content of `clover`). These recommended interactions are\n*not* actual observations, and should not be treated as such.\n\n🧑‍⚖️ Let's talk about licensing, said no one ever. The `trefle` repo is a\ncomplex beast with data from other projects, code to work on it, and derived data products\nfrom both of these things. As a result, intellectual property and\nuse rights are applied *within each top-level folder*. A folder that has *no\n`LICENSE` file in it* is understood to contain information that should not be\nre-used or re-distributed. This is notably the case for `data/`, which contains\ninformation from other projects. Note that the repo has a `LICENSE` (CC-BY 4.0)\nfile at its root, which cover this `README`, and *all images present within this\nproject* All derived data (in `artifacts`) are released under the CC0 waiver and\nare usable without condition or restriction. Re-use of content under CC-BY 4.0\nshould mention the URL to this repository and credit \"The VERENA consortium\".\n\n⚠️ Discussions about intellectual property notwithstanding, `trefle` should most\nlikely not be merged into your own database. The associations are *predictions*,\nand we can estimate how many of them are false positives, and how many are\nmissing (but we do not know which are which). In addition, the probability score\nis not a biologically meaningful probability. Unless your database is able to\naccommodate these subtleties and convey them clearly to the user, we advise you\nagainst consuming `trefle` to re-distribute as part of another database.\n\nContact: `timothee.poisot@umontreal.🇨🇦`\n\n## Repository content\n\n- `hpc` contains all the code used to run the tuning and simulation using `slurm`\n    - `inputs` is the main location for the bash scripts and helper functions\n    - `outputs` is where the output files are located -- note that they are not written here by default, this was us doing some post-processing\n        - `tuning.csv` is the file for model selection (about 6MB)\n        - `predictions.csv` is the output of imputation (about 85MB)\n    - as a side-note, each thread is responsible for its own files (and works on its own copy of the data, so think about memory use)\n    - as an additional side-note, not all species pairs in `clover` are in `trefle`, because some proportion (\u003c1%) of runs fail for reasons that always mean that the association is [almost surely][almost_surely] not happening\n- `data` is storing all the data that are *not* directly generated by `trefle`\n- `model_performance` has the file for model selection *and* the figures generated as part of this process\n    - `roc` has all the plots of ROC-AUCs\n    - `metrics` has the plots of all metrics presented in the table above\n- `imputation` has the files to read the data from `hpc/outputs` and do the analyses\n- `artifacts` has derived data tables\n    - `modelselection.csv` is the list of all models considered during hyper-parameters tuning\n    - `imputed_associations.csv` is the list of all suspected positive associations (~ 6MB) - associations are ranked from least to most likely\n    - `zoonoses.csv` is the list of the subset of suspected positive associations involving *H. sapiens* - associations are ranked from least to most likely\n    - `trefle.csv` is the edgelist of `clover` plus the imputed associations, sorted by virus name (~ 3MB)\n    - `phylo_distance_to_human.csv` is the phylogenetic distance between *H. sapiens* and other taxa in the Upham tree\n    - `sharing-phylogeny.csv` is a table with the Jaccard similarity of viruses, number of shared viruses, and phylogenetic distance between pairs of hosts -- it contains both the *before* and *after*  imputation step\n    - `viral_subspace.csv` are truncated SVD embeddings of the left-subspace (viruses) at rank 12 multiplied by the square root of the eigenvalues, as in a RDGP.\n- `demo-phylogeny` contains a visualization of phylogenetic signal to the data and predictions as a use case vignette \n- `R` has `.r` files to read the phylogeny\n\n[almost_surely]: https://en.wikipedia.org/wiki/Almost_surely\n\n## Main results\n\nThis section will grow as we develop more analyses.\n\n### Imputation changes the network\n\nThe LF-SVD approach suggested 75901 new interactions, from the original 5494 in\n`clover`. With a total of 81395 interactions, `trefle` has a connectance of\n0.09, which is well within the range of connectances for antagonistic bipartite\nnetworks.\n\nThe following figure is the result of a 2-dimensional tSNE embedding of `clover`\n(left) and `trefle` (right):\n\n![before-after](figures/before-after.png)\n\nNot only can we see an increase in the degree of most nodes, we can also see the\nshape of the network change, with less clusters of mostly homogenous species.\n\n### Top 10 predicted *H. sapiens* viruses\n\n| Host         | Virus                       | Evidence |\n|--------------|-----------------------------|----------|\n| Homo sapiens | **Torque teno virus 2**     | 182.4210 |\n| Homo sapiens | **Torque teno virus 23**    | 187.3940 |\n| Homo sapiens | Panine betaherpesvirus 2    | 187.3940 |\n| Homo sapiens | **Torque teno virus 4**     | 187.3940 |\n| Homo sapiens | **Torque teno virus 14**    | 187.3940 |\n| Homo sapiens | Carnivore protoparvovirus 1 | 191.2557 |\n| Homo sapiens | Phocid alphaherpesvirus 1   | 191.4652 |\n| Homo sapiens | Panine gammaherpesvirus 1   | 201.9715 |\n| Homo sapiens | Simian mastadenovirus A     | 242.8597 |\n| Homo sapiens | Canine mastadenovirus A     | 275.6808 |\n\n### Zoonotic viruses have more paths to reach human\n\nThis next figure is the evidence for (potential novel) zoonotic viruses in\n`trefle`, compared to the number of paths existing from this virus to *H.\nsapiens* in `clover`. The log-log relationship is quite clear: viruses that are\nmore likely to be zoonotic according to our model have more direct paths (bridge\nhosts) to reach human.\n\n![number of paths to huuman](figures/number_of_paths.png)\n\nThe same relationship holds for 2 jumps, 3 jumps, and 4 jumps.\n\n### Imputation removes the livestock bias\n\nThe original data that went into `clover` had a lot of information about\nlivestock viruses. In the following figure, we show the ten species most similar\n(using Additive Jaccard Similarity) to *H. sapiens* before and after imputation:\n\n![similarity to human](figures/human-similarity.png)\n\nStrikingly, if not unexpectedly, the hosts with viral associations most similar\nto human after imputation are mostly primates (chimpanzees and both gorilla\nspecies). Some rodents are also joining the top 10. This result suggests that\nthe LF-SVD approach is able to somewhat overcome the initial data bias.\n\n### LF-SVD predicts associations between species not shared by databases\n\nIn the next figure, we look at the probability of association as a function of\nwhether the two species were reported as part of the same database that went\ninto making `clover`:\n\n![similarity to human](figures/probability-by-cooccurrence.png)\n\nThere is little to report here - the method is indeed able to predict\nassociations between species that were non-overlapping across data sources. Due\nto the effort that went into reconciling the taxonomic names in `clover`, the\nfinal amount of overlap is rather large anyways.\n\n### Predicted associations have a strong phylogenetic plausibility\n\nThe below figure shows pre- and post-imputation host sharing networks analyzed as a function of phylogenetic distance between hosts, pairwise across the entire network (top) and hostwise with humans (bottom), using either binary sharing of at least one virus (sharing) or total number of viruses shared (counts).\n\n![phylogenetic effect](demo-phylogeny/PhylogenyGAMs.png)\n\nThere are two main results:\n1. The missing links recommended by SVD have a strong phylogenetic signal even though it's trait agnostic, implying the signal in the network is strong enough to be propagated by latent factor approaches. (SVD is good)\n2. The less sparse the matrix becomes, the more we will need to move from thinking about sharing networks as binary networks to weighted ones, which is a bit of a change from the last 20 years of sharing work like the GMPD-based work (count data matters)\n\n### The impact of sampling bias on viral richness is reduced after imputation\n\nObserved host-parasite association networks are heavily influenced by sampling biases across hosts and parasites. In comprative analyses of the number of documented viral species per host species, research effort is often the strongest predictor. These models typically use number of publication per host species as a measure of sampling effort, and find that well researched hosts are found to harbour a larger number of viruses. To explore whether network imputation via LF-SVD is extrapolating from previous sampling biases, we conducted a set of comparative analyses investigating the how the explanatory power of sampling efforts on viral species richness changes after network imputation. We find that sampling effort explains less of the variance in viral richness after imputation, suggesting that imputation vir LF-SVD is not merely recapitulating the observed sampling effort per host.\n\n|Response               | Predictor             |Slope  | Std. Error | R Squared | Lambda | Lambda 95% CI |\n|-----------------------|-----------------------|-------|------------|-----------|--------|---------------|\n|Viral Richness (clover)| # pubs                | 0.53  | 0.02       | 0.46      | 0.59   | 0.47 - 0.69   | \n|Viral Richness (trefle)| # pubs                | 0.39  | 0.02       | 0.23      | 0.59   | 0.45 - 0.72   | \n|Viral Richness (clover)| # virus related pubs  | 0.71  | 0.02       | 0.54      | 0.45   | 0.31 - 0.58   | \n|Viral Richness (trefle)| # virus related pubs  | 0.47  | 0.03       | 0.22      | 0.60   | 0.46 - 0.71   | \n\n\n### The imputed network improves zoonotic ranking models\nCode for this section can be found in [viralemergence/haystack_zoonotic](https://github.com/viralemergence/haystack_zoonotic).\n\nKnowing the network of observed (non-human) hosts for each virus increases the probability that a randomly chosen *known* human-infecting virus is ranked above viruses that have not been detected in humans. Imputing missing links improves this even further.\n\n|Model                                 | AUC (mean)  | SD    | AUC (bagged) |\n|--------------------------------------|-------------|-------|--------------|\n|Genome composition                    | 0.723       | 0.053 | 0.755        |\n|Genome composition + Observed network | 0.830       | 0.043 | 0.848        |\n|Genome composition + Imputed network  | 0.875       | 0.036 | 0.898        |\n\nIn the combined genome composition + imputed network model, features describing the imputed network are more important.\n\n![zoonotic rank result](figures/human_models_main.png)\n\n\n### Spatial analysis of hotspots of viral diversity\n\n![LCBD](figures/lcbd-panel.png)\n\n**Analysis in development**: @tpoisot - comparison of pre and post-imputation LCBD\n\n## Get involved\n\nIf  you want to develop an analysis, **please open an issue** (and if you want to\nstart working, please make an explicitely named branch).\n\nIf you have to create new data files, please mind the current directory, and\nwhen in dout, ask @tpoisot.\n\nIf you require a new data file to be created for you, ask @tpoisot.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fviralemergence%2Ftrefle","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fviralemergence%2Ftrefle","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fviralemergence%2Ftrefle/lists"}