{"id":19020037,"url":"https://github.com/bzz/ml-on-code","last_synced_at":"2025-04-23T05:22:38.141Z","repository":{"id":74014445,"uuid":"127106872","full_name":"bzz/ml-on-code","owner":"bzz","description":"\"Introduction to ML-on-Code\" workshop materials 2018","archived":false,"fork":false,"pushed_at":"2018-08-22T10:14:43.000Z","size":68,"stargazers_count":10,"open_issues_count":0,"forks_count":2,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-04-17T20:39:45.185Z","etag":null,"topics":["ml-on-code"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bzz.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2018-03-28T08:12:37.000Z","updated_at":"2020-11-24T18:09:19.000Z","dependencies_parsed_at":null,"dependency_job_id":"44bd7af8-5372-4727-a10b-732b81fdc0d7","html_url":"https://github.com/bzz/ml-on-code","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bzz%2Fml-on-code","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bzz%2Fml-on-code/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bzz%2Fml-on-code/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bzz%2Fml-on-code/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bzz","download_url":"https://codeload.github.com/bzz/ml-on-code/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250374538,"owners_count":21419955,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ml-on-code"],"created_at":"2024-11-08T20:15:33.393Z","updated_at":"2025-04-23T05:22:38.132Z","avatar_url":"https://github.com/bzz.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Introduction to ML-on-Code Workshop\n\nThese are materials for a workshop on \"Introduction to ML-on-Code\" - a guided tour on source{d} open source technology stack for Machine Learning on Code.\n\nSlides [on GDrive](https://docs.google.com/presentation/d/12NdxDQLrtwMu2J-k0HB86I7H-eRDI3N9XDUMvea2Ioc/edit?usp=sharing).\n\n\nOSS tools covered:\n- Public Github Archive: http://pga.sourced.tech/\n- Siva: https://github.com/src-d/go-siva#command-line-interface\n- source{d} Engine: https://github.com/src-d/engine/\n- Project Babelfish: https://doc.bblf.sh/\n\n## Content\n\n  * [Prerequisites](#prerequisites)\n  * [Dependencies](#dependencies)\n  * [Workflow](#workflow)\n     * [1. Play with PublicGithubArchive CLI](#1-play-with-publicgithubarchive-cli)\n     * [2. Get used to Siva format](#2-get-used-to-siva-format)\n     * [3. Engine (basic queries)](#3-engine-basic-queries)\n     * [4. Project Babelfish](#4-project-babelfish)\n     * [5. Engine (advanced, UAST)](#5-engine-advanced-uast)\n    \n\n## Prerequisites\n - Docker \n - Go\n\n## Dependencies\n\nGolang for CLI tools: \n```\ngo get github.com/src-d/datasets/PublicGitArchive/pga\ngo get -u gopkg.in/src-d/go-siva.v1/...\n# add \"$GOPATH/bin\" to \"$PATH\"\necho \"export PATH=$PATH:$(go env GOPATH)/bin\" \u003e\u003e ~/.bash_profile\nsource ~/.bash_profile\n```\n\nImport Docker images (works offline):\n```\ndocker load -i images/engine-jupyter-bblfsh.tgz\ndocker load -i images/bblfshd-with-drivers.tgz\n\ndocker images\n```\n\nRun Bblfsh containers:\n```\ndocker run -d --name bblfshd --privileged -p 9432:9432 bblfsh/bblfshd-with-drivers\n\ndocker exec -it bblfshd bblfshctl driver list\n\n# if above did not work for some reason, use\ndocker run -d --name bblfshd --privileged -p 9432:9432 bblfsh/bblfshd\ndocker exec -it bblfshd bblfshctl driver install --recommended\n```\n\nRun Engine container \\w Jupyter:\n```\ndocker run --name engine-jupyter -it -p 8080:8080 -v $(pwd)/repositories:/repositories -v $(pwd)/notebooks:/home --link bblfshd:bblfshd srcd/engine-jupyter-bblfsh\n```\n\n## Workflow\n\nWorkshop is structured as a sequence of steps, each introducing a layer of source{d} technology stack, from bottom up.\n\n\u003cimg width=\"720\" alt=\"Workshop flow\" src=\"https://user-images.githubusercontent.com/5582506/38016881-03b62980-32a3-11e8-9926-2f3d56faf1b3.png\"\u003e\n\n### 1. Play with PublicGithubArchive CLI\n\nPublic Github Playground is a reference dataset of full history of ~180k most popular (\u003e50 stars) projects from Github.\n\n 710 GB of code in 3 TB of packfiles.\n\n```sh\ncp -r .pga/latest.csv.gz ~/\npga help\n\n# number of repos from Github\npga list -u github.com/github/ -f json | wc -l\n\n# number of repos from Github in Golang\npga list -u github.com/github/ --lang go -f json | wc -l\n\n# pretty-print src-d repos\npga list -u github.com/src-d/ -f json | jq -r . | less\n\n# URLs and languages for src-d repos \\w more then 50 files\npga list -u github.com/src-d/ -f json | jq -r 'select(.fileCount \u003e 50) | .url + \" \" + .langs[]' | less\n```\n\n\nMaterials:\n  - http://pga.sourced.tech/\n  - https://github.com/src-d/datasets/tree/master/PublicGitArchive/pga\n  - https://github.com/src-d/datasets/blob/master/PublicGitArchive/doc/dataset_analysis.md#description-of-the-current-dataset\n\n\n\n### 2. Get used to Siva format\n\n[**S**eekable **I**ndexed **B**lock **A**rchiver](https://github.com/src-d/go-siva) file format.\n\nKeeps all files + updates of a single Git repository in 1 file in FS.\n\n```sh\nfind ./repositories/\n\n# list files in archive\nsiva list ./repositories/siva/latest/65/65c397a8673c0f4b98e3867e5fd6efdaa7d9ccd2.siva\n\n# extract single file\nsiva unpack -m=config ./repositories/siva/latest/65/65c397a8673c0f4b98e3867e5fd6efdaa7d9ccd2.siva .\nless config\n\n# extract all files (bare Git repository)\nsiva unpack ./repositories/siva/latest/65/65c397a8673c0f4b98e3867e5fd6efdaa7d9ccd2.siva go-kallax/.git\n\n# list all Git objects\ncd go-kallax\ngit verify-pack -v .git/objects/pack/pack-4a202ad08739b7236f57a3a283f45c27087a99f6.idx\n\n# get a single object\ngit cat-file -p 72e6129819d6a580512f131f0c8d34cf16ffe4e5\ngit cat-file -p 63d6012da17573aec5d61d8ba4bae4bf8eab257e\n```\n\nMaterials:\n  - https://github.com/src-d/go-siva#command-line-interface\n  - https://blog.sourced.tech/post/siva/\n  - https://git-scm.com/book/en/v2/Git-Internals-Packfiles\n\n\n### 3. Engine (basic queries)\n\n[source{d} engine](https://github.com/src-d/engine/) is a library that allows to query Git repositories in parallele from a cluster of machines using Apache Spark.\n\nTo start Apache Spark session:\n```sh\nspark-shell --packages \"tech.sourced:engine:0.5.5\"\n```\n\nExample of the query:\n```scala\nfrom sourced.engine import Engine\n\nEngine(spark, 'siva',\n         '/path/to/siva-files')\n  .repositories\n  .references\n  .head_ref\n  .files\n  .classify_languages()\n  .filter(\"lang = 'java'\")\n  .select('path',\n          'repository_id')\n  .write\n  .parquet(\"hdfs://...\")\n```\n\nOpen in browser your [Jupyter Notebook - Engine (basic)](http://localhost:8080/notebooks/Intro%20ML-on-Code%20-%20Python.ipynb#) from a running Docker container.\n\n\nMaterials:\n  - https://github.com/src-d/engine#playing-around-with-engine-on-jupyter\n  - https://github.com/src-d/engine/blob/master/_examples/pyspark/pyspark-shell-basic.md\n\n\n\n### 4. Project Babelfish\n\n\u003cimg src=\"https://avatars2.githubusercontent.com/u/25795418?v=3\u0026s=200f\" align=\"right\" width=\"100px\" height=\"100px\" alt=\"Babelfish logo\" /\u003e\n\nProject Babelfish provides a universal code parser - contenerized parser infrastructure, to extract uAST representation from the source code text.\n\nVisit http://dashboard.bblf.sh/ to try experiment with uAST representation.\n\n```xpath\n(: function names :)\n//*[@roleFunction and @roleDeclaration and @roleName and not(@roleArgument)]\n    \n(: python Docstrings :)\n//*[@roleFunction and @roleDeclaration and @roleBody]/*/*[@roleLiteral]\n    \n(: identifiers :)\n//*[@roleIdentifier and not(@roleIncomplete)]\n```\n\nMaterials:\n  - https://blog.sourced.tech/post/announcing_babelfish/\n  - https://doc.bblf.sh/\n  - https://doc.bblf.sh/using-babelfish/getting-started.html\n  - https://doc.bblf.sh/using-babelfish/uast-querying.html\n  - https://doc.bblf.sh/uast/roles.html#roles-list\n\n\n### 5. Engine (advanced, UAST)\n\nThrough Engine, it is possible to parse files to uASTs using Bblfsh and then query those with XPath.\n\nOpen in browser your [Jupyter Notebook - Engine (advanced)](http://localhost:8080/notebooks/Intro%20ML-on-Code%20-%20Python.ipynb#) from your running Docker container.\n\nMaterials:\n  - https://github.com/src-d/engine-tour#exploring-public-git-archive-with-sourced-engine\n  - https://github.com/src-d/engine/blob/master/_examples/pyspark/pyspark-shell-xpath-query.md\n  - https://github.com/src-d/engine/blob/master/examples/notebooks/Example.ipynb\n\n\n### 6. (TBD) ML: train a model\n\nUse the data, saved from a previous step to train source code identifier embedding model with Tensorflow.\n\nMaterials:\n  - https://blog.sourced.tech/post/id2vec/\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbzz%2Fml-on-code","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbzz%2Fml-on-code","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbzz%2Fml-on-code/lists"}