{"id":27881813,"url":"https://github.com/src-d/models","last_synced_at":"2026-02-17T16:05:10.654Z","repository":{"id":79454032,"uuid":"107106225","full_name":"src-d/models","owner":"src-d","description":"Machine learning models for MLonCode trained using the source{d} stack","archived":false,"fork":false,"pushed_at":"2019-10-30T11:07:16.000Z","size":83,"stargazers_count":19,"open_issues_count":5,"forks_count":11,"subscribers_count":8,"default_branch":"master","last_synced_at":"2025-10-11T18:19:56.602Z","etag":null,"topics":["machine-learning","mlosc","model","nlp","source-code"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/src-d.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2017-10-16T09:25:20.000Z","updated_at":"2022-12-29T19:07:48.000Z","dependencies_parsed_at":null,"dependency_job_id":"c214f52a-c355-41ef-a6ee-be483cf437f7","html_url":"https://github.com/src-d/models","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/src-d/models","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/src-d%2Fmodels","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/src-d%2Fmodels/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/src-d%2Fmodels/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/src-d%2Fmodels/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/src-d","download_url":"https://codeload.github.com/src-d/models/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/src-d%2Fmodels/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29549250,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-17T14:33:00.708Z","status":"ssl_error","status_checked_at":"2026-02-17T14:32:58.657Z","response_time":100,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["machine-learning","mlosc","model","nlp","source-code"],"created_at":"2025-05-05T05:05:23.957Z","updated_at":"2026-02-17T16:05:10.624Z","avatar_url":"https://github.com/src-d.png","language":null,"readme":"source{d} MLonCode models\n=========================\n\n## bot-detection\nModel that identifies bots from humans among developer identities.\n\nExample:\n\n```python\nfrom sklearn.preprocessing import LabelEncoder\nfrom sourced.ml.models import BotDetection\nfrom xgboost import XGBClassifier\n\nbot_detection = BotDetection.load(bot-detection)\nxgb_cls = XGBClassifier()\nxgb_cls._Booster = bot_detection_model.booster\nxgb_cls._le = LabelEncoder().fit([False, True])\nprint('model configuration: ', xgb_cls)\nprint('BPE model vocabulary size: ', len(bot_detection.bpe_model.vocab()))\n```\n\n1 model:\n\n* \u003cdefault\u003e [94806d1f-1995-4c72-89c9-07681fa9d97d](/bot-detection/94806d1f-1995-4c72-89c9-07681fa9d97d.md)\n\n## bow\nWeighted bag-of-words, that is, every bag is a feature extracted from source code and associated with a weight obtained by applying TFIDF.\n\nExample:\n\n```python\nfrom sourced.ml.models import BOW\nbow = BOW().load(bow)\nprint(\"Number of documents:\", len(bow))\nprint(\"Number of tokens:\", len(bow.tokens))\n```\n\n4 models:\n\n*  [1e0deee4-7dc1-400f-acb6-74c0f4aec471](/bow/1e0deee4-7dc1-400f-acb6-74c0f4aec471.md)\n* \u003cdefault\u003e [1e3da42a-28b6-4b33-94a2-a5671f4102f4](/bow/1e3da42a-28b6-4b33-94a2-a5671f4102f4.md)\n*  [694c20a0-9b96-4444-80ae-f2fa5bd1395b](/bow/694c20a0-9b96-4444-80ae-f2fa5bd1395b.md)\n*  [da8c5dee-b285-4d55-8913-a5209f716564](/bow/da8c5dee-b285-4d55-8913-a5209f716564.md)\n\n## docfreq\nDocument frequencies of features extracted from source code, that is, how many documents (repositories, files or functions) contain each tokenized feature.\n\nExample:\n\n```python\nfrom sourced.ml.models import DocumentFrequencies\ndf = DocumentFrequencies().load(docfreq)\nprint(\"Number of tokens:\", len(df))\n```\n\n2 models:\n\n*  [55215392-36fc-43e5-b277-500f5b68d0c6](/docfreq/55215392-36fc-43e5-b277-500f5b68d0c6.md)\n* \u003cdefault\u003e [f64bacd4-67fb-4c64-8382-399a8e7db52a](/docfreq/f64bacd4-67fb-4c64-8382-399a8e7db52a.md)\n\n## id2vec\nSource code identifier embeddings, that is, every identifier is represented by a dense vector.\n\nExample:\n\n```python\nfrom sourced.ml.models import Id2Vec\nid2vec = Id2Vec().load(id2vec)\nprint(\"Number of tokens:\", len(id2vec))\n```\n\n2 models:\n\n*  [3467e9ca-ec11-444a-ba27-9fa55f5ee6c1](/id2vec/3467e9ca-ec11-444a-ba27-9fa55f5ee6c1.md)\n* \u003cdefault\u003e [92609e70-f79c-46b5-8419-55726e873cfc](/id2vec/92609e70-f79c-46b5-8419-55726e873cfc.md)\n\n## id_splitter_bilstm\nModel that contains source code identifier splitter BiLSTM weights.\n\nExample:\n\n```python\nfrom sourced.ml.models.id_splitter import IdentifierSplitterBiLSTM\nid_splitter = IdentifierSplitterBiLSTM().load(id_splitter_bilstm)\nid_splitter.split(identifiers)\n```\n\n1 model:\n\n* \u003cdefault\u003e [522bdd11-d1fa-49dd-9e51-87c529283418](/id_splitter_bilstm/522bdd11-d1fa-49dd-9e51-87c529283418.md)\n\n## topics\nTopic modeling of Git repositories. All tokens are identifiers extracted from repositories and seen as indicators for topics. They are used to infer the topic(s) of repositories.\n\nExample:\n\n```python\nfrom sourced.ml.models import Topics\ntopics = Topics().load(topics)\nprint(\"Number of topics:\", len(topics))\nprint(\"Number of tokens:\", len(topics.tokens))\n```\n\n1 model:\n\n* \u003cdefault\u003e [c70a7514-9257-4b33-b468-27a8588d4dfa](/topics/c70a7514-9257-4b33-b468-27a8588d4dfa.md)\n\n## typos_correction\nModel that suggests fixes to correct typos.\n\nExample:\n\n```python\nfrom lookout.style.typos.corrector import TyposCorrector\ncorrector = TyposCorrector().load(typos_correction)\nprint(\"Corrector configuration:\\n\", corrector.dump())\n```\n\n3 models:\n\n*  [16577a2c-7f17-4a6f-a759-92f3a00cf339](/typos_correction/16577a2c-7f17-4a6f-a759-92f3a00cf339.md)\n*  [245fae3a-2f87-4990-ab9a-c463393cfe51](/typos_correction/245fae3a-2f87-4990-ab9a-c463393cfe51.md)\n* \u003cdefault\u003e [9b82399a-1a4d-48c2-9e53-c4f0be631a45](/typos_correction/9b82399a-1a4d-48c2-9e53-c4f0be631a45.md)\n","funding_links":[],"categories":["Software"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsrc-d%2Fmodels","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsrc-d%2Fmodels","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsrc-d%2Fmodels/lists"}