https://github.com/bzz/ml-on-code
"Introduction to ML-on-Code" workshop materials 2018
https://github.com/bzz/ml-on-code
ml-on-code
Last synced: about 2 months ago
JSON representation
"Introduction to ML-on-Code" workshop materials 2018
- Host: GitHub
- URL: https://github.com/bzz/ml-on-code
- Owner: bzz
- Created: 2018-03-28T08:12:37.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2018-08-22T10:14:43.000Z (almost 7 years ago)
- Last Synced: 2025-04-17T20:39:45.185Z (about 2 months ago)
- Topics: ml-on-code
- Language: Jupyter Notebook
- Homepage:
- Size: 66.4 KB
- Stars: 10
- Watchers: 3
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Introduction to ML-on-Code Workshop
These are materials for a workshop on "Introduction to ML-on-Code" - a guided tour on source{d} open source technology stack for Machine Learning on Code.
Slides [on GDrive](https://docs.google.com/presentation/d/12NdxDQLrtwMu2J-k0HB86I7H-eRDI3N9XDUMvea2Ioc/edit?usp=sharing).
OSS tools covered:
- Public Github Archive: http://pga.sourced.tech/
- Siva: https://github.com/src-d/go-siva#command-line-interface
- source{d} Engine: https://github.com/src-d/engine/
- Project Babelfish: https://doc.bblf.sh/## Content
* [Prerequisites](#prerequisites)
* [Dependencies](#dependencies)
* [Workflow](#workflow)
* [1. Play with PublicGithubArchive CLI](#1-play-with-publicgithubarchive-cli)
* [2. Get used to Siva format](#2-get-used-to-siva-format)
* [3. Engine (basic queries)](#3-engine-basic-queries)
* [4. Project Babelfish](#4-project-babelfish)
* [5. Engine (advanced, UAST)](#5-engine-advanced-uast)
## Prerequisites
- Docker
- Go## Dependencies
Golang for CLI tools:
```
go get github.com/src-d/datasets/PublicGitArchive/pga
go get -u gopkg.in/src-d/go-siva.v1/...
# add "$GOPATH/bin" to "$PATH"
echo "export PATH=$PATH:$(go env GOPATH)/bin" >> ~/.bash_profile
source ~/.bash_profile
```Import Docker images (works offline):
```
docker load -i images/engine-jupyter-bblfsh.tgz
docker load -i images/bblfshd-with-drivers.tgzdocker images
```Run Bblfsh containers:
```
docker run -d --name bblfshd --privileged -p 9432:9432 bblfsh/bblfshd-with-driversdocker exec -it bblfshd bblfshctl driver list
# if above did not work for some reason, use
docker run -d --name bblfshd --privileged -p 9432:9432 bblfsh/bblfshd
docker exec -it bblfshd bblfshctl driver install --recommended
```Run Engine container \w Jupyter:
```
docker run --name engine-jupyter -it -p 8080:8080 -v $(pwd)/repositories:/repositories -v $(pwd)/notebooks:/home --link bblfshd:bblfshd srcd/engine-jupyter-bblfsh
```## Workflow
Workshop is structured as a sequence of steps, each introducing a layer of source{d} technology stack, from bottom up.
### 1. Play with PublicGithubArchive CLI
Public Github Playground is a reference dataset of full history of ~180k most popular (>50 stars) projects from Github.
710 GB of code in 3 TB of packfiles.
```sh
cp -r .pga/latest.csv.gz ~/
pga help# number of repos from Github
pga list -u github.com/github/ -f json | wc -l# number of repos from Github in Golang
pga list -u github.com/github/ --lang go -f json | wc -l# pretty-print src-d repos
pga list -u github.com/src-d/ -f json | jq -r . | less# URLs and languages for src-d repos \w more then 50 files
pga list -u github.com/src-d/ -f json | jq -r 'select(.fileCount > 50) | .url + " " + .langs[]' | less
```Materials:
- http://pga.sourced.tech/
- https://github.com/src-d/datasets/tree/master/PublicGitArchive/pga
- https://github.com/src-d/datasets/blob/master/PublicGitArchive/doc/dataset_analysis.md#description-of-the-current-dataset### 2. Get used to Siva format
[**S**eekable **I**ndexed **B**lock **A**rchiver](https://github.com/src-d/go-siva) file format.
Keeps all files + updates of a single Git repository in 1 file in FS.
```sh
find ./repositories/# list files in archive
siva list ./repositories/siva/latest/65/65c397a8673c0f4b98e3867e5fd6efdaa7d9ccd2.siva# extract single file
siva unpack -m=config ./repositories/siva/latest/65/65c397a8673c0f4b98e3867e5fd6efdaa7d9ccd2.siva .
less config# extract all files (bare Git repository)
siva unpack ./repositories/siva/latest/65/65c397a8673c0f4b98e3867e5fd6efdaa7d9ccd2.siva go-kallax/.git# list all Git objects
cd go-kallax
git verify-pack -v .git/objects/pack/pack-4a202ad08739b7236f57a3a283f45c27087a99f6.idx# get a single object
git cat-file -p 72e6129819d6a580512f131f0c8d34cf16ffe4e5
git cat-file -p 63d6012da17573aec5d61d8ba4bae4bf8eab257e
```Materials:
- https://github.com/src-d/go-siva#command-line-interface
- https://blog.sourced.tech/post/siva/
- https://git-scm.com/book/en/v2/Git-Internals-Packfiles### 3. Engine (basic queries)
[source{d} engine](https://github.com/src-d/engine/) is a library that allows to query Git repositories in parallele from a cluster of machines using Apache Spark.
To start Apache Spark session:
```sh
spark-shell --packages "tech.sourced:engine:0.5.5"
```Example of the query:
```scala
from sourced.engine import EngineEngine(spark, 'siva',
'/path/to/siva-files')
.repositories
.references
.head_ref
.files
.classify_languages()
.filter("lang = 'java'")
.select('path',
'repository_id')
.write
.parquet("hdfs://...")
```Open in browser your [Jupyter Notebook - Engine (basic)](http://localhost:8080/notebooks/Intro%20ML-on-Code%20-%20Python.ipynb#) from a running Docker container.
Materials:
- https://github.com/src-d/engine#playing-around-with-engine-on-jupyter
- https://github.com/src-d/engine/blob/master/_examples/pyspark/pyspark-shell-basic.md### 4. Project Babelfish
Project Babelfish provides a universal code parser - contenerized parser infrastructure, to extract uAST representation from the source code text.
Visit http://dashboard.bblf.sh/ to try experiment with uAST representation.
```xpath
(: function names :)
//*[@roleFunction and @roleDeclaration and @roleName and not(@roleArgument)]
(: python Docstrings :)
//*[@roleFunction and @roleDeclaration and @roleBody]/*/*[@roleLiteral]
(: identifiers :)
//*[@roleIdentifier and not(@roleIncomplete)]
```Materials:
- https://blog.sourced.tech/post/announcing_babelfish/
- https://doc.bblf.sh/
- https://doc.bblf.sh/using-babelfish/getting-started.html
- https://doc.bblf.sh/using-babelfish/uast-querying.html
- https://doc.bblf.sh/uast/roles.html#roles-list### 5. Engine (advanced, UAST)
Through Engine, it is possible to parse files to uASTs using Bblfsh and then query those with XPath.
Open in browser your [Jupyter Notebook - Engine (advanced)](http://localhost:8080/notebooks/Intro%20ML-on-Code%20-%20Python.ipynb#) from your running Docker container.
Materials:
- https://github.com/src-d/engine-tour#exploring-public-git-archive-with-sourced-engine
- https://github.com/src-d/engine/blob/master/_examples/pyspark/pyspark-shell-xpath-query.md
- https://github.com/src-d/engine/blob/master/examples/notebooks/Example.ipynb### 6. (TBD) ML: train a model
Use the data, saved from a previous step to train source code identifier embedding model with Tensorflow.
Materials:
- https://blog.sourced.tech/post/id2vec/