{"id":17327117,"url":"https://github.com/vmarkovtsev/codeneuron","last_synced_at":"2025-04-14T17:13:03.336Z","repository":{"id":57612819,"uuid":"123677637","full_name":"vmarkovtsev/CodeNeuron","owner":"vmarkovtsev","description":"Recurrent neural network to split code snippets from text.","archived":false,"fork":false,"pushed_at":"2018-12-10T18:11:24.000Z","size":34264,"stargazers_count":12,"open_issues_count":0,"forks_count":3,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-14T17:12:44.578Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vmarkovtsev.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-03-03T09:27:10.000Z","updated_at":"2024-11-22T01:55:36.000Z","dependencies_parsed_at":"2022-09-05T17:40:18.620Z","dependency_job_id":null,"html_url":"https://github.com/vmarkovtsev/CodeNeuron","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vmarkovtsev%2FCodeNeuron","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vmarkovtsev%2FCodeNeuron/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vmarkovtsev%2FCodeNeuron/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vmarkovtsev%2FCodeNeuron/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vmarkovtsev","download_url":"https://codeload.github.com/vmarkovtsev/CodeNeuron/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248923765,"owners_count":21183953,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-15T14:18:47.999Z","updated_at":"2025-04-14T17:13:03.303Z","avatar_url":"https://github.com/vmarkovtsev.png","language":"Python","readme":"Code Neuron\n===========\n\nRecurrent neural network to detect code blocks. Runs on Tensorflow. It is trained in two stages.\n\nFirst stage is pre-training the character level RNN with two branches - before and after:\n\n![CharRNN Architecture](doc/char_rnn_arch.png)\n\n```\nmy code :  FooBar\n------\u003e x \u003c------\n```\n\nWe assign recurrent branches to different GPUs to train faster.\nI set 512 LSTM neurons and reach 89% validation accuracy over 200 most frequent character classes:\n\n![CharRNN Validation](doc/char_rnn_validation.png)\n\nThe second stage is training the same network but with the different dense layer which predicts\nonly 3 classes: code block begins, code block ends and no-op.\nThe prediction scheme changes: now we look at the adjacent chars and decide if there is a code boundary\nbetween them or not.\n\n![Code Neuron Validation](doc/code_neuron_validation.png)\n\nIt is much faster to train and it reaches **~99.2% validation accuracy**.\n\nTraining set\n------------\n\n[StackSample questions and answers](https://www.kaggle.com/stackoverflow/stacksample), processed with\n\n```\nunzip -p Answers(Questions).csv.zip | ./dataset | sed -r -e '/^$/d' -e '/\\x03/ {N; s/\\x03\\s*\\n/\\x03/g}' | gzip \u003e\u003e Dataset.txt.gz\n```\n\nBaked model\n-----------\n\n[model_LSTM_600_0.9924.pb](model_LSTM_600_0.9924.pb) - reaches 99.2% accuracy on validation. The model\nin Tensorflow \"GraphDef\" protobuf format.\n\nPretraining was performed with 20% validation on the first 8000000 bytes of the uncompressed questions.\nTraining was performed with 20% validation and 90% negative samples on the first 256000000 bytes of\nthe uncompressed questions.\nThis means I was lazy to wait a week for it to train on the whole dataset - you are encouraged\nto experiment.\n\nTry to run it:\n\n```\ncat sample.txt | python3 run_model.py -m model_LSTM_600_0.9924.pb\n```\n\nYou should see:\n\n```\nHere is my Python code, it is awesome and easy to read:\n\u003ccode\u003edef main():\n    print(\"Hello, world!\")\n\u003c/code\u003ePlease say what you think about it. Mad skills. Here is another one,\n\u003ccode\u003efunc main() {\n  println(\"Hello, world!\")\n}\n\u003c/code\u003eAs you see, I know Go too. Some more text to provide enough context.\n```\n\nVisualize the trained model:\n\n```\npython3 model2tb.py --model-dir model_LSTM_600_0.9924.pb --log-dir tb_logs\ntensorboard --logdir=tb_logs\n```\n\nGo inference\n------------\n\n```\ngo get gopkg.in/vmarkovtsev/CodeNeuron.v1/...\ncat sample.txt | $(go env GOPATH)/bin/codetect\n```\n\nAPI:\n\n```go\nimport \"gopkg.in/vmarkovtsev/CodeNeuron.v1\"\n\nfunc main() {\n  session, _ := codetect.OpenSession()\n  textBytes, _ := ioutil.ReadFile(\"test.txt\")\n  result, _ := codetect.Run(string(textBytes), session)\n}\n```\n\n#### Updating the model\n\n```\ngo-bindata -nomemcopy -nometadata -pkg assets -o assets/bindata.go  model.pb\n```\n\nLicense\n-------\n\nMIT, see [LICENSE](LICENSE).\n","funding_links":[],"categories":["Software"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvmarkovtsev%2Fcodeneuron","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvmarkovtsev%2Fcodeneuron","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvmarkovtsev%2Fcodeneuron/lists"}