https://github.com/kirili4ik/nl2ml-corpus
Natural Language to Machine Learning corpus
https://github.com/kirili4ik/nl2ml-corpus
code2vec machine-learning-corpus snippets
Last synced: 3 months ago
JSON representation
Natural Language to Machine Learning corpus
- Host: GitHub
- URL: https://github.com/kirili4ik/nl2ml-corpus
- Owner: Kirili4ik
- Created: 2020-05-09T16:25:53.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2020-06-09T12:21:32.000Z (about 5 years ago)
- Last Synced: 2025-01-20T18:46:22.038Z (5 months ago)
- Topics: code2vec, machine-learning-corpus, snippets
- Language: Jupyter Notebook
- Size: 174 KB
- Stars: 0
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# NL2ML corpus
Natural Language to Machine Learning corpus. A coursework of mine.### Workflow
-Page-1.jpg)### Presentation with explanation:
https://github.com/Kirili4ik/pres-n-articles/blob/master/corpus_NL2ML_Presentation.pdf
### Expertly collected and marked data:
https://docs.google.com/spreadsheets/d/1gDhVdq2GktuWXh7hDyt_js335Xbvsw57iSNh_wEaUxE/### Data parsed from Kaggle:
https://yadi.sk/d/kvnqRG6ngt8emwmarkup_complete - Python 3 code snippets with 6 binary columns that stand for KG nodes(> 100 000 snippets)
chunks_30_final - Python 3 code, divided by every 30 rows. (2 574k rows)
code_blocks_final - Python 3 code, divided to blocks (from .ipynb), where authors left comments. (2 211k rows)
### Work files
first_attempt_baseline - Naive Bayes classifier solution.pre-preprocessing - basic preprocessing for finding most popular comments and trying stemming/lemmatization.
regular_expressions+LogReg_code2vec - making tags by KG and regular expressions + looking at code2vec and logistic regression f1-scores
### Code2vec implementation:
https://github.com/Kirili4ik/code2vec
### Related works:
https://github.com/whatevernevermindbro/source_code_classification
https://github.com/ramazyant/nl2ml