{"id":32978802,"url":"https://github.com/src-d/ml","last_synced_at":"2025-12-30T09:06:28.547Z","repository":{"id":41352016,"uuid":"94111019","full_name":"src-d/ml","owner":"src-d","description":"sourced.ml is a library and command line tools to build and apply machine learning models on top of Universal Abstract Syntax Trees","archived":true,"fork":false,"pushed_at":"2019-05-22T09:56:11.000Z","size":29668,"stargazers_count":141,"open_issues_count":26,"forks_count":44,"subscribers_count":13,"default_branch":"master","last_synced_at":"2025-12-15T14:59:20.286Z","etag":null,"topics":["ast","machine-learning","mloncode","word2vec"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/src-d.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"contributing.md","funding":null,"license":"license.md","code_of_conduct":"code_of_conduct.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-06-12T15:21:41.000Z","updated_at":"2025-08-08T19:57:38.000Z","dependencies_parsed_at":"2022-09-07T23:30:57.152Z","dependency_job_id":null,"html_url":"https://github.com/src-d/ml","commit_stats":null,"previous_names":["src-d/ast2vec"],"tags_count":33,"template":false,"template_full_name":null,"purl":"pkg:github/src-d/ml","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/src-d%2Fml","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/src-d%2Fml/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/src-d%2Fml/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/src-d%2Fml/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/src-d","download_url":"https://codeload.github.com/src-d/ml/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/src-d%2Fml/sbom","scorecard":{"id":843511,"data":{"date":"2025-08-11","repo":{"name":"github.com/src-d/ml","commit":"db23fb14fc93dece05b434342def5f77b01c6cc3"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":2.3,"checks":[{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Maintained","score":0,"reason":"project is archived","details":["Warn: Repository is archived."],"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Token-Permissions","score":-1,"reason":"No tokens found","details":null,"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Code-Review","score":6,"reason":"Found 5/8 approved changesets -- score normalized to 6","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Dangerous-Workflow","score":-1,"reason":"no workflows found","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"License","score":9,"reason":"license file detected","details":["Info: project has a license file: license.md:0","Warn: project license file does not contain an FSF or OSI license."],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'master'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"Pinned-Dependencies","score":0,"reason":"dependency not pinned by hash detected -- score normalized to 0","details":["Warn: containerImage not pinned by hash: Dockerfile:1: pin your Docker image by updating srcd/ml-core to srcd/ml-core@sha256:35e2462d70d84017a516037e2902e8f870e5452bade3f8c29c2ebc8b3bd6ecf8","Warn: containerImage not pinned by hash: Dockerfile.core:1: pin your Docker image by updating srcd/spark:2.2.1 to srcd/spark:2.2.1@sha256:f2ac7acabfb8ee0c666ba59efdc617d7dcdf6e514ca6885504aaf4c9809683b9","Warn: pipCommand not pinned by hash: Dockerfile:6","Warn: pipCommand not pinned by hash: Dockerfile.core:5-23","Info:   0 out of   2 containerImage dependencies pinned","Info:   0 out of   2 pipCommand dependencies pinned"],"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"Vulnerabilities","score":0,"reason":"13 existing vulnerabilities detected","details":["Warn: Project is vulnerable to: GHSA-cjgq-5qmw-rcj6","Warn: Project is vulnerable to: GHSA-x4wf-678h-2pmq","Warn: Project is vulnerable to: PYSEC-2021-856 / GHSA-5545-2q6w-2gh6","Warn: Project is vulnerable to: GHSA-6p56-wp2h-9hxr","Warn: Project is vulnerable to: PYSEC-2021-857 / GHSA-f7c7-j99h-c22f","Warn: Project is vulnerable to: GHSA-fpfv-jqm9-f5jm","Warn: Project is vulnerable to: PYSEC-2021-140 / GHSA-9w8r-397f-prfh","Warn: Project is vulnerable to: PYSEC-2023-117 / GHSA-mrwq-x4v8-fh7p","Warn: Project is vulnerable to: PYSEC-2021-141 / GHSA-pq64-v7f5-gqh8","Warn: Project is vulnerable to: PYSEC-2020-107 / GHSA-jjw5-xxj6-pcv5","Warn: Project is vulnerable to: PYSEC-2024-110 / GHSA-jw8x-6495-233v","Warn: Project is vulnerable to: PYSEC-2020-108","Warn: Project is vulnerable to: GHSA-g7vv-2v7x-gj9p"],"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}},{"name":"SAST","score":0,"reason":"SAST tool is not run on all commits -- score normalized to 0","details":["Warn: 0 commits out of 30 are checked with a SAST tool"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}}]},"last_synced_at":"2025-08-23T20:58:03.020Z","repository_id":41352016,"created_at":"2025-08-23T20:58:03.020Z","updated_at":"2025-08-23T20:58:03.020Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28080908,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-12-27T02:00:05.897Z","response_time":58,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ast","machine-learning","mloncode","word2vec"],"created_at":"2025-11-13T06:00:38.078Z","updated_at":"2025-12-30T09:06:27.928Z","avatar_url":"https://github.com/src-d.png","language":"Python","readme":"# MLonCode research playground [![PyPI](https://img.shields.io/pypi/v/sourced-ml.svg)](https://pypi.python.org/pypi/sourced-ml) [![Build Status](https://travis-ci.org/src-d/ml.svg)](https://travis-ci.org/src-d/ml) [![Docker Build Status](https://img.shields.io/docker/build/srcd/ml.svg)](https://hub.docker.com/r/srcd/ml) [![codecov](https://codecov.io/github/src-d/ml/coverage.svg)](https://codecov.io/gh/src-d/ml)\n\n**This project is no longer maintained, it has evolved into several others:**\n\n* [ml-core](https://github.com/src-d/ml-core) - the bits which are independent of mining tools.\n* [ml-mining](https://github.com/src-d/ml-mining) - general purpose mining environment, currenly based on the deprecated [jgit-spark-connector](https://github.com/src-d/jgit-spark-connector).\n\n**Below goes the original README.**\n\nThis project is the foundation for [MLonCode](https://github.com/src-d/awesome-machine-learning-on-source-code) research and development. It abstracts feature extraction and training models, thus allowing to focus on the higher level tasks.\n\nCurrently, the following models are implemented:\n\n* BOW - weighted bag of x, where x is many different extracted feature types.\n* id2vec, source code identifier embeddings.\n* docfreq, feature document frequencies \\(part of TF-IDF\\).\n* topic modeling over source code identifiers.\n\nIt is written in Python3 and has been tested on Linux and macOS. source{d} ml is tightly coupled with [source{d} engine](https://engine.sourced.tech) and delegates all the feature extraction parallelization to it.\n\nHere is the list of proof-of-concept projects which are built using sourced.ml:\n\n* [vecino](https://github.com/src-d/vecino) - finding similar repositories.\n* [tmsc](https://github.com/src-d/tmsc) - listing topics of a repository.\n* [snippet-ranger](https://github.com/src-d/snippet-ranger) - topic modeling of source code snippets.\n* [apollo](https://github.com/src-d/apollo) - source code deduplication at scale.\n\n## Installation\n\nWhether you wish to include Spark in your installation or would rather use an existing\ninstallation, to use `sourced-ml` you will need to have some native libraries installed,\ne.g. on Ubuntu you must first run: `apt install libxml2-dev libsnappy-dev`. [Tensorflow](https://tensorflow.org)\nis also a requirement - we support both the CPU and GPU  version. \nIn order to select which version you want, modify the package name in the next section\nto either `sourced-ml[tf]` or `sourced-ml[tf-gpu]` depending on your choice.\n**If you don't, neither version will be installed.**\n\n### With Apache Spark included\n\n```text\npip3 install sourced-ml\n```\n\n### Use existing Apache Spark\n\nIf you already have Apache Spark installed and configured on your environment at `$APACHE_SPARK` you can re-use it and avoid downloading 200Mb through [pip \"editable installs\"](https://pip.pypa.io/en/stable/reference/pip_install/#editable-installs) by\n\n```text\npip3 install -e \"$SPARK_HOME/python\"\npip3 install sourced-ml\n```\n\nIn both cases, you will need to have some native libraries installed. E.g., \non Ubuntu `apt install libxml2-dev libsnappy-dev`. Some parts require [Tensorflow](https://tensorflow.org).\n\n## Usage\n\nThis project exposes two interfaces: API and command line. The command line is\n\n```text\nsrcml --help\n```\n\n## Docker image\n\n```text\ndocker run -it --rm srcd/ml --help\n```\n\nIf this first command fails with\n\n```text\nCannot connect to the Docker daemon. Is the docker daemon running on this host?\n```\n\nAnd you are sure that the daemon is running, then you need to add your user to `docker` group: refer to the [documentation](https://docs.docker.com/engine/installation/linux/linux-postinstall/#manage-docker-as-a-non-root-user).\n\n## Contributions\n\n...are welcome! See [CONTRIBUTING](contributing.md) and [CODE\\_OF\\_CONDUCT.md](code_of_conduct.md).\n\n## License\n\n[Apache 2.0](license.md)\n\n## Algorithms\n\n#### Identifier embeddings\n\nWe build the source code identifier co-occurrence matrix for every repository.\n\n1. Read Git repositories.\n2. Classify files using [enry](https://github.com/src-d/enry).\n3. Extract [UAST](https://doc.bblf.sh/uast/specification.html) from each supported file.\n4. [Split and stem](https://github.com/src-d/ml/tree/d1f13d079f57caa6338bb7eb8acb9062e011eda9/sourced/ml/algorithms/token_parser.py) all the identifiers in each tree.\n5. [Traverse UAST](https://github.com/src-d/ml/tree/d1f13d079f57caa6338bb7eb8acb9062e011eda9/sourced/ml/transformers/coocc.py), collapse all non-identifier paths and record all\n\n   identifiers on the same level as co-occurring. Besides, connect them with their immediate parents.\n\n6. Write the global co-occurrence matrix.\n7. Train the embeddings using [Swivel](https://github.com/src-d/ml/tree/d1f13d079f57caa6338bb7eb8acb9062e011eda9/sourced/ml/algorithms/swivel.py) \\(requires Tensorflow\\). Interactively view\n\n   the intermediate results in Tensorboard using `--logs`.\n\n8. Write the identifier embeddings model.\n\n1-5 is performed with `repos2coocc` command, 6 with `id2vec_preproc`, 7 with `id2vec_train`, 8 with `id2vec_postproc`.\n\n#### Weighted Bag of X\n\nWe represent every repository as a weighted bag-of-vectors, provided by we've got document frequencies \\(\"docfreq\"\\) and identifier embeddings \\(\"id2vec\"\\).\n\n1. Clone or read the repository from disk.\n2. Classify files using [enry](https://github.com/src-d/enry).\n3. Extract [UAST](https://doc.bblf.sh/uast/specification.html) from each supported file.\n4. Extract various features from each tree, e.g. identifiers, literals or node2vec-like structural fingerprints.\n5. Group by repository, file or function.\n6. Set the weight of each such feature according to TF-IDF.\n7. Write the BOW model.\n\n1-7 are performed with `repos2bow` command.\n\n#### Topic modeling\n\nSee [here](doc/topic_modeling.md).\n\n## Glossary\n\nSee [here](GLOSSARY.md).\n","funding_links":[],"categories":["Software"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsrc-d%2Fml","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsrc-d%2Fml","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsrc-d%2Fml/lists"}