https://github.com/vmarkovtsev/codeneuron
Recurrent neural network to split code snippets from text.
https://github.com/vmarkovtsev/codeneuron
Last synced: 7 months ago
JSON representation
Recurrent neural network to split code snippets from text.
- Host: GitHub
- URL: https://github.com/vmarkovtsev/codeneuron
- Owner: vmarkovtsev
- License: mit
- Created: 2018-03-03T09:27:10.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2018-12-10T18:11:24.000Z (almost 7 years ago)
- Last Synced: 2025-04-14T17:12:44.578Z (7 months ago)
- Language: Python
- Size: 32.7 MB
- Stars: 12
- Watchers: 2
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-machine-learning-on-source-code - Code Neuron - Recurrent neural network to detect code blocks in natural language text. (Software)
- awesome-machine-learning-on-source-code - Code Neuron - Recurrent neural network to detect code blocks in natural language text. (Software)
- awesome-machine-learning-on-source-code - Code Neuron - Recurrent neural network to detect code blocks in natural language text. (Software)
README
Code Neuron
===========
Recurrent neural network to detect code blocks. Runs on Tensorflow. It is trained in two stages.
First stage is pre-training the character level RNN with two branches - before and after:

```
my code : FooBar
------> x <------
```
We assign recurrent branches to different GPUs to train faster.
I set 512 LSTM neurons and reach 89% validation accuracy over 200 most frequent character classes:

The second stage is training the same network but with the different dense layer which predicts
only 3 classes: code block begins, code block ends and no-op.
The prediction scheme changes: now we look at the adjacent chars and decide if there is a code boundary
between them or not.

It is much faster to train and it reaches **~99.2% validation accuracy**.
Training set
------------
[StackSample questions and answers](https://www.kaggle.com/stackoverflow/stacksample), processed with
```
unzip -p Answers(Questions).csv.zip | ./dataset | sed -r -e '/^$/d' -e '/\x03/ {N; s/\x03\s*\n/\x03/g}' | gzip >> Dataset.txt.gz
```
Baked model
-----------
[model_LSTM_600_0.9924.pb](model_LSTM_600_0.9924.pb) - reaches 99.2% accuracy on validation. The model
in Tensorflow "GraphDef" protobuf format.
Pretraining was performed with 20% validation on the first 8000000 bytes of the uncompressed questions.
Training was performed with 20% validation and 90% negative samples on the first 256000000 bytes of
the uncompressed questions.
This means I was lazy to wait a week for it to train on the whole dataset - you are encouraged
to experiment.
Try to run it:
```
cat sample.txt | python3 run_model.py -m model_LSTM_600_0.9924.pb
```
You should see:
```
Here is my Python code, it is awesome and easy to read:
def main():
print("Hello, world!")
Please say what you think about it. Mad skills. Here is another one,
func main() {
println("Hello, world!")
}
As you see, I know Go too. Some more text to provide enough context.
```
Visualize the trained model:
```
python3 model2tb.py --model-dir model_LSTM_600_0.9924.pb --log-dir tb_logs
tensorboard --logdir=tb_logs
```
Go inference
------------
```
go get gopkg.in/vmarkovtsev/CodeNeuron.v1/...
cat sample.txt | $(go env GOPATH)/bin/codetect
```
API:
```go
import "gopkg.in/vmarkovtsev/CodeNeuron.v1"
func main() {
session, _ := codetect.OpenSession()
textBytes, _ := ioutil.ReadFile("test.txt")
result, _ := codetect.Run(string(textBytes), session)
}
```
#### Updating the model
```
go-bindata -nomemcopy -nometadata -pkg assets -o assets/bindata.go model.pb
```
License
-------
MIT, see [LICENSE](LICENSE).