https://github.com/stepicorg/submissions-clustering
https://github.com/stepicorg/submissions-clustering
Last synced: 6 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/stepicorg/submissions-clustering
- Owner: StepicOrg
- License: mit
- Created: 2017-07-05T15:38:36.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2018-11-13T15:31:26.000Z (about 7 years ago)
- Last Synced: 2025-03-22T08:30:05.454Z (10 months ago)
- Language: Python
- Size: 6.97 MB
- Stars: 1
- Watchers: 6
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.rst
- License: LICENSE
Awesome Lists containing this project
README
======================
submissions-clustering
======================
Fine tool to split code submissions into clusters.
------------
Installation
------------
Dependencies
============
1. **make**
2. **python** of version *3.4* or more.
3. **pip**
Pip
===
``pip install git+https://github.com/StepicOrg/submissions-clustering.git``
Locally
=======
1. ``git clone https://github.com/StepicOrg/submissions-clustering``
2. ``make reqs``
3. ``make build``
You can run ``make help`` for list of avialible deps. Some of them provided just for convinience. For example, you can
run ``make run`` to execute **main.py** script or run ``make check`` to run python static code checkers.
-----
Usage
-----
Setting Up
==========
At first, we need to set things up. Customize your logger behavior:
>>> import logging
>>> logging.basicConfig(
... level=logging.INFO,
... format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
... )
Training
========
Simple example of usage. Firstly, we read our submissions as iterable *(code : str, status : str)* for some source. Then
we create a model from specification (langauge and approach). Then, we feed submission into model (the training goes
here). Lastly, we predict ids of first 5 neighbors for first code sample:
>>> import subsclu as sc
>>> from subsclu.utils import read as read_utils
>>> submissions = list(read_utils.from_sl3("data/subs.sl3", nrows=3000))
>>> model = sc.SubmissionsClustering.outof("python", "ast")
>>> model.fit(submissions)
>>> model.neighbors([submissions[0][0]])[0][:5]
array([ 0, 2308, 1686, 460, 643])
Saving & Restoring
==================
Model class provide save and static load methods for faster and efficient model saving and loading:
>>> model.save("data/model.dump")
>>> del model
>>> model = sc.SubmissionsClustering.load("data/model.dump")
Default serializing machinery provided by joblib package from sklearn (for faster work with numpy matricies). Since
model is pickable object, you can use methods from pickle package (and also use with `django-picklefield`_):
.. _`django-picklefield`: https://pypi.python.org/pypi/django-picklefield
>>> from subsclu.utils import dump as dump_utils
>>> dump_utils.pickle_save(model, "data/model.dump")
>>> del model
>>> model = dump_utils.pickle_load("data/model.dump")
Evaluating
==========
Of course, we need to somehow evaluate performance of our model. For this purpose we gonna use scorers. Here we create
on of them from specification (language and testing approach). Scorer instance has *score(model, submissions, **kwargs)*
method to calculate of how good model (unfitted) perform on given submissions:
>>> from subsclu.scorers import Scorer
>>> scorer = Scorer.outof("python", "diff")
>>> hasattr(scorer, "score")
True
Scorer uses metric isntance inside, that measure of how close one code to another. The overall score is computed as mean
of diff between best metric and local best metric (within neighbors). Metric evaluate how close is one code to another.
We can speed-up the calculating using predefined array of best scores:
>>> scorer.score(model, submissions, presaved_score_path="data/best_metrics.dump")
0.013940358468805102
Spec and creating custom
========================
To see list of languages run:
>>> from subsclu.languages.spec import VALID_NAMES
>>> VALID_NAMES
('python',)
Same goes for approaches and other stuff:
>>> from subsclu.spec import VALID_APPROACHES
>>> VALID_APPROACHES
('diff', 'token', 'ast', 'test')
If you dont satisfied with predefined in spec set of things, you can define you own. See **subsclu** package subpackages
for more info on that.
----
Test
----
Run ``make test`` to start full build-test cycle in separate py34 venv using **tox**.
---
Doc
---
Run ``make doc`` to get pdf file of full package documentation.
------------
Useful Links
------------
Node embedding tensorboard
==========================
`Here `_ you can find embedding for AST nodes visualization in tensorboard.
Articles
========
The entire project idea is based on `this article `_.
I am also use `this `_,
`this `_, and
`that `_.