https://github.com/vmarkovtsev/codeneuron

Recurrent neural network to split code snippets from text.
https://github.com/vmarkovtsev/codeneuron

Last synced: 7 months ago
JSON representation

Recurrent neural network to split code snippets from text.

Host: GitHub
URL: https://github.com/vmarkovtsev/codeneuron
Owner: vmarkovtsev
License: mit
Created: 2018-03-03T09:27:10.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2018-12-10T18:11:24.000Z (almost 7 years ago)
Last Synced: 2025-04-14T17:12:44.578Z (7 months ago)
Language: Python
Size: 32.7 MB
Stars: 12
Watchers: 2
Forks: 3
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-machine-learning-on-source-code - Code Neuron - Recurrent neural network to detect code blocks in natural language text. (Software)
awesome-machine-learning-on-source-code - Code Neuron - Recurrent neural network to detect code blocks in natural language text. (Software)
awesome-machine-learning-on-source-code - Code Neuron - Recurrent neural network to detect code blocks in natural language text. (Software)

README

          Code Neuron

===========

Recurrent neural network to detect code blocks. Runs on Tensorflow. It is trained in two stages.

First stage is pre-training the character level RNN with two branches - before and after:

![CharRNN Architecture](doc/char_rnn_arch.png)

```

my code :  FooBar

------> x <------

```

We assign recurrent branches to different GPUs to train faster.

I set 512 LSTM neurons and reach 89% validation accuracy over 200 most frequent character classes:

![CharRNN Validation](doc/char_rnn_validation.png)

The second stage is training the same network but with the different dense layer which predicts

only 3 classes: code block begins, code block ends and no-op.

The prediction scheme changes: now we look at the adjacent chars and decide if there is a code boundary

between them or not.

![Code Neuron Validation](doc/code_neuron_validation.png)

It is much faster to train and it reaches **~99.2% validation accuracy**.

Training set

------------

[StackSample questions and answers](https://www.kaggle.com/stackoverflow/stacksample), processed with

```

unzip -p Answers(Questions).csv.zip | ./dataset | sed -r -e '/^$/d' -e '/\x03/ {N; s/\x03\s*\n/\x03/g}' | gzip >> Dataset.txt.gz

```

Baked model

-----------

[model_LSTM_600_0.9924.pb](model_LSTM_600_0.9924.pb) - reaches 99.2% accuracy on validation. The model

in Tensorflow "GraphDef" protobuf format.

Pretraining was performed with 20% validation on the first 8000000 bytes of the uncompressed questions.

Training was performed with 20% validation and 90% negative samples on the first 256000000 bytes of

the uncompressed questions.

This means I was lazy to wait a week for it to train on the whole dataset - you are encouraged

to experiment.

Try to run it:

```

cat sample.txt | python3 run_model.py -m model_LSTM_600_0.9924.pb

```

You should see:

```

Here is my Python code, it is awesome and easy to read:

def main():

    print("Hello, world!")

Please say what you think about it. Mad skills. Here is another one,

func main() {

  println("Hello, world!")

}

As you see, I know Go too. Some more text to provide enough context.

```

Visualize the trained model:

```

python3 model2tb.py --model-dir model_LSTM_600_0.9924.pb --log-dir tb_logs

tensorboard --logdir=tb_logs

```

Go inference

------------

```

go get gopkg.in/vmarkovtsev/CodeNeuron.v1/...

cat sample.txt | $(go env GOPATH)/bin/codetect

```

API:

```go

import "gopkg.in/vmarkovtsev/CodeNeuron.v1"

func main() {

  session, _ := codetect.OpenSession()

  textBytes, _ := ioutil.ReadFile("test.txt")

  result, _ := codetect.Run(string(textBytes), session)

}

```

#### Updating the model

```

go-bindata -nomemcopy -nometadata -pkg assets -o assets/bindata.go  model.pb

```

License

-------

MIT, see [LICENSE](LICENSE).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/vmarkovtsev/codeneuron

Awesome Lists containing this project

README