{"id":19839858,"url":"https://github.com/qdata/fastsk","last_synced_at":"2025-05-01T19:30:31.890Z","repository":{"id":52164295,"uuid":"130412607","full_name":"QData/FastSK","owner":"QData","description":"Bioinformatics 2020: FastSK: Fast and Accurate Sequence Classification by making gkm-svm faster and scalable. https://fastsk.readthedocs.io/en/master/","archived":false,"fork":false,"pushed_at":"2022-12-08T09:37:45.000Z","size":119145,"stargazers_count":21,"open_issues_count":3,"forks_count":9,"subscribers_count":8,"default_branch":"master","last_synced_at":"2025-04-17T01:58:55.542Z","etag":null,"topics":["cpp","gkm-svm","python-library","sequence-classification","string-classification","string-kernel"],"latest_commit_sha":null,"homepage":"https://fastsk.readthedocs.io/en/master/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/QData.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-04-20T20:50:03.000Z","updated_at":"2023-03-14T15:16:11.000Z","dependencies_parsed_at":"2023-01-25T06:00:07.631Z","dependency_job_id":null,"html_url":"https://github.com/QData/FastSK","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/QData%2FFastSK","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/QData%2FFastSK/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/QData%2FFastSK/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/QData%2FFastSK/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/QData","download_url":"https://codeload.github.com/QData/FastSK/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251932525,"owners_count":21667159,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cpp","gkm-svm","python-library","sequence-classification","string-classification","string-kernel"],"created_at":"2024-11-12T12:24:36.684Z","updated_at":"2025-05-01T19:30:26.874Z","avatar_url":"https://github.com/QData.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# FastSK: fast sequence analysis with gapped string kernels (Fast-GKM-SVM)\n\nThis Github repo provides improved algorithms for implementing gkm-svm string kernel calculations. We provide C++ version of the algorithm implementation and a python wrapper (making to a python package) for the C++ implementation. Our package provides fast and accuate gkm-svm based training SVM classifiers and regressors for gkm string kernel based sequence analysis. \n\nThis Github is built with a novel and fast algorithm design for implementing gapped k-mer algorithm, [pybind11](https://github.com/pybind/pybind11), and [LIBSVM](https://github.com/cjlin1/libsvm).\n\n#### More details of algorithms and results now in: [Bioinformatics 2020](https://academic.oup.com/bioinformatics/article/36/Supplement_2/i857/6055916)\n\n## Prerequisites\n\n* Python 3.6+\n* setuptools version 42 or greater (run `pip install --upgrade setuptools`)\n* `pybind11` (run `pip install pybind11`)\n\n\n## Installation via Pip Install (Linux and MacOS)\n\n#### Way 1: from Pypi\n\n```bash\npip install fastsk\n```\n\n#### Way 2: Clone this repository and run:\n\n```bash\ngit clone https://github.com/QData/FastSK.git\ncd FastSK\npip install -r requirements.txt\npip install .\n```\n\n#### The pip intallation of FastSK has been tested successfully on CentOS, Red Hat, MacOS.\n\n## Python Version Tutorial\n\n### Example Jupyter notebook  \n- 'docs/2demo/fastDemo.ipynb'\n\n### You can check if fastsk library is installed correctly in python shell:\n\n```\nfrom fastsk import FastSK\n\n## Compute kernel matrix\nfastsk = FastSK(g=10, m=6, t=1, approx=True)\n```\n\n\n### Example python usage script: (assuming you have cloned FastSK.git)\n```\ncd test\npython run_check.py \n```\n\n\n### Experimental Results, Baselines, Utility Codes and Setup\n\n- We have provided all datasets we used in the subfolder \"data\"\n- We have provided all scripts we used to generate results under the subfolder \"results\"\n\n#### Grid Search for FastSK and gkm-svm baseline\nTo run a grid search over the hyperparameter space (g, m, and C) to find the optimal parameters, e.g, one utility code:\n```\ncd results/\npython run_gridsearch.py\n```\n\n#### When comparing with Deep Learning baselines\n+ You do need to have pytorch installed\n```\npip install torch torchvision\n```\n+ One utility code: on all datasets with hyperparameter tuning of charCNN and each configure with 5 random-seeding repeats:\n```\ncd results/neural_nets\npython run_cnn_hyperTrTune.py \n```\n+ We have many other utility codes helping users to run CNN and RNN baselines\n\n#### Some of our exprimental results comparing FastSK with baselines wrt performance and speed\n\n\n\u003cimg src=\"results/spreadsheets/Figure5.png\" width=\"800\"\u003e\n\n\u003cimg src=\"results/spreadsheets/Table1.png\" width=\"800\"\u003e\n\n\u003cimg src=\"results/spreadsheets/Table2.png\" width=\"800\"\u003e\n\n\n#### Some of our exprimental results comparing FastSK with Character based Convolutional Neural Nets (CharCNN) when varying training size. \n\n\u003cimg src=\"results/neural_nets/trainsize_varyresults/dna.png\" width=\"800\"\u003e\n\n\u003cimg src=\"results/neural_nets/trainsize_varyresults/protein.png\" width=\"800\"\u003e\n\n\u003cimg src=\"results/neural_nets/trainsize_varyresults/nlp.png\" width=\"800\"\u003e\n\n\n#### To Do: \n* a detailed user document, with example input files, output files, code, and perhaps a user group where people can post their questions\n\n\n### Citations\n\nIf you find this tool useful, please cite us!\n\n```\n@article{fast-gkm-svm,\n    author = {Blakely, Derrick and Collins, Eamon and Singh, Ritambhara and Norton, Andrew and Lanchantin, Jack and Qi, Yanjun},\n    title = \"{FastSK: fast sequence analysis with gapped string kernels}\",\n    journal = {Bioinformatics},\n    volume = {36},\n    number = {Supplement_2},\n    pages = {i857-i865},\n    year = {2020},\n    month = {12},\n    abstract = \"{Gapped k-mer kernels with support vector machines (gkm-SVMs) have achieved strong predictive performance on regulatory DNA sequences on modestly sized training sets. However, existing gkm-SVM algorithms suffer from slow kernel computation time, as they depend exponentially on the sub-sequence feature length, number of mismatch positions, and the task’s alphabet size.In this work, we introduce a fast and scalable algorithm for calculating gapped k-mer string kernels. Our method, named FastSK, uses a simplified kernel formulation that decomposes the kernel calculation into a set of independent counting operations over the possible mismatch positions. This simplified decomposition allows us to devise a fast Monte Carlo approximation that rapidly converges. FastSK can scale to much greater feature lengths, allows us to consider more mismatches, and is performant on a variety of sequence analysis tasks. On multiple DNA transcription factor binding site prediction datasets, FastSK consistently matches or outperforms the state-of-the-art gkmSVM-2.0 algorithms in area under the ROC curve, while achieving average speedups in kernel computation of ∼100× and speedups of ∼800× for large feature lengths. We further show that FastSK outperforms character-level recurrent and convolutional neural networks while achieving low variance. We then extend FastSK to 7 English-language medical named entity recognition datasets and 10 protein remote homology detection datasets. FastSK consistently matches or outperforms these baselines.Our algorithm is available as a Python package and as C++ source code at https://github.com/QData/FastSKSupplementary data are available at Bioinformatics online.}\",\n    issn = {1367-4803},\n    doi = {10.1093/bioinformatics/btaa817},\n    url = {https://doi.org/10.1093/bioinformatics/btaa817},\n    eprint = {https://academic.oup.com/bioinformatics/article-pdf/36/Supplement\\_2/i857/35337038/btaa817.pdf},\n}\n```\n\n### Legacy: If you prefer using the executable made from the Pure C++ source code (without python wrapper or R wrapper)\n\n- you can clone this repository:\n```\ngit clone --recursive https://github.com/QData/FastSK.git\n```\nthen run\n```\ncd FastSK\nmake\n```\nA `fastsk` executable will be installed to the `bin` directory, which you can use for kernel computation and inference. For example:\n```\n./bin/fastsk -g 10 -m 6 -C 1 -t 1 -a data/EP300.train.fasta data/EP300.test.fasta\n```\nThis will run the approximate kernel algorithm on the EP300 TFBS dataset using a feature length of `g = 10` with up to `m = 6` mismatches. It will then train and evaluate an SVM classifier with the SVM parameter `C = 1`.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fqdata%2Ffastsk","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fqdata%2Ffastsk","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fqdata%2Ffastsk/lists"}