An open API service indexing awesome lists of open source software.

https://github.com/kirili4ik/nl2ml-corpus

Natural Language to Machine Learning corpus
https://github.com/kirili4ik/nl2ml-corpus

code2vec machine-learning-corpus snippets

Last synced: 3 months ago
JSON representation

Natural Language to Machine Learning corpus

Awesome Lists containing this project

README

        

# NL2ML corpus
Natural Language to Machine Learning corpus. A coursework of mine.

### Workflow
![alt text](https://github.com/Kirili4ik/NL2ML-corpus/blob/master/whole_workflow%20(1)-Page-1.jpg)

### Presentation with explanation:

https://github.com/Kirili4ik/pres-n-articles/blob/master/corpus_NL2ML_Presentation.pdf

### Expertly collected and marked data:
https://docs.google.com/spreadsheets/d/1gDhVdq2GktuWXh7hDyt_js335Xbvsw57iSNh_wEaUxE/

### Data parsed from Kaggle:
https://yadi.sk/d/kvnqRG6ngt8emw

markup_complete - Python 3 code snippets with 6 binary columns that stand for KG nodes(> 100 000 snippets)

chunks_30_final - Python 3 code, divided by every 30 rows. (2 574k rows)

code_blocks_final - Python 3 code, divided to blocks (from .ipynb), where authors left comments. (2 211k rows)

### Work files
first_attempt_baseline - Naive Bayes classifier solution.

pre-preprocessing - basic preprocessing for finding most popular comments and trying stemming/lemmatization.

regular_expressions+LogReg_code2vec - making tags by KG and regular expressions + looking at code2vec and logistic regression f1-scores

### Code2vec implementation:

https://github.com/Kirili4ik/code2vec

### Related works:

https://github.com/whatevernevermindbro/source_code_classification

https://github.com/ramazyant/nl2ml