https://github.com/dmitrii-artuhov/tinkoff-ml-2023
Test work for Tinkoff ML 2023 3-month course
https://github.com/dmitrii-artuhov/tinkoff-ml-2023
Last synced: 7 months ago
JSON representation
Test work for Tinkoff ML 2023 3-month course
- Host: GitHub
- URL: https://github.com/dmitrii-artuhov/tinkoff-ml-2023
- Owner: dmitrii-artuhov
- Created: 2023-01-07T14:10:13.000Z (almost 3 years ago)
- Default Branch: master
- Last Pushed: 2023-01-08T16:38:12.000Z (almost 3 years ago)
- Last Synced: 2025-01-20T13:45:07.596Z (9 months ago)
- Language: Python
- Homepage:
- Size: 16.6 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Tinkoff ML 2023 enrollment project
## Anti - plagiarism
Utility for comparing python codefiles in order to check for plagiate.
## How to use
1. Clone the project: `git clone https://github.com/dmitrii-artuhov/tinkoff-ml-2023.git`.
2. Install at least `Python 3.9.x` (ver. `3.10.9` was used during development).
3. Run the command:
```python
python compare.py input.txt output.txt
```
where `input.txt` contains pairs of files to be compared, eg.:
```txt
plagiat/1.py plagiat/2.py
plagiat/2.py plagiat/2.py
plagiat/1.py plagiat/not-found.py
```In the output file you will find floating point numbers (the scores in range `[0, 1]`, the bigger the score, the more similar the programms are), eg.:
```txt
0.7687309644670051
1.0
Invalid path(s): 'plagiat/not-found.py'
```## How it works
Utility uses [Levenstein distance](https://en.wikipedia.org/wiki/Levenshtein_distance#:~:text=Informally%2C%20the%20Levenshtein%20distance%20between,considered%20this%20distance%20in%201965.) algorithm on the pre-processed python-programm texts. To be exact, the formula look like this: `compare(a, b) = levenstein_distance(a, b) / max(len(a), len(b))`, where `a` and `b` the programm texts and `compare(a, b)` is the function that calculates the final answer.Text pre-processing does the following:
1. Removes all `docstrings` in the programm.
2. Renames all variables to the pattern: `var_{i}` (where `i` is the number of the variables that were renamed before variable `var_{i}`) and substitutes them with their pattern-name.
3. Renames all functions, including class methods to the pattern: `func_{i}`, following same rule as described above.
4. Renames all function and method arguments (except `self` keyword) with the pattern `arg_{i}` and substitutes them with their pattern-name inside the function/method body.