{"id":16108685,"url":"https://github.com/aliosm/ai-soco","last_synced_at":"2026-01-19T11:33:16.561Z","repository":{"id":90007196,"uuid":"265671705","full_name":"AliOsm/AI-SOCO","owner":"AliOsm","description":"Official FIRE 2020 Authorship Identification of SOurce COde (AI-SOCO) task repository containing dataset, evaluation tools and baselines","archived":false,"fork":false,"pushed_at":"2023-05-22T23:30:12.000Z","size":66609,"stargazers_count":20,"open_issues_count":2,"forks_count":6,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-10-09T04:53:08.602Z","etag":null,"topics":["ai-soco","authorship-identification","codeforces","fire2020","machine-learning","pan2020","source-code"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AliOsm.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-05-20T19:48:36.000Z","updated_at":"2025-09-10T08:31:04.000Z","dependencies_parsed_at":"2024-10-31T20:51:14.019Z","dependency_job_id":null,"html_url":"https://github.com/AliOsm/AI-SOCO","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/AliOsm/AI-SOCO","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AliOsm%2FAI-SOCO","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AliOsm%2FAI-SOCO/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AliOsm%2FAI-SOCO/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AliOsm%2FAI-SOCO/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AliOsm","download_url":"https://codeload.github.com/AliOsm/AI-SOCO/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AliOsm%2FAI-SOCO/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28566478,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-19T08:53:44.001Z","status":"ssl_error","status_checked_at":"2026-01-19T08:52:40.245Z","response_time":67,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-soco","authorship-identification","codeforces","fire2020","machine-learning","pan2020","source-code"],"created_at":"2024-10-09T19:27:53.133Z","updated_at":"2026-01-19T11:33:16.547Z","avatar_url":"https://github.com/AliOsm.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cimg alt=\"Logo\" height=\"150px\" src=\"https://i.imgur.com/UyORSKr.png\"/\u003e\n\n# AI-SOCO\nOfficial [FIRE 2020](http://fire.irsi.res.in/fire/2018/home) **A**uthorship **I**dentification of **SO**urce **CO**de (AI-SOCO) [PAN](https://pan.webis.de) task repository containing dataset, evaluation tools and baselines.\n\n10 - 13 December, Virtually.\n\nWelcome to pariticipate on our Codalab competition [here](https://competitions.codalab.org/competitions/25148)!\n\nAll participants are welcome to open new [issue](https://github.com/AliOsm/AI-SOCO/issues/new) about dataset issues!\n\n## Introduction\nGeneral authorship identification is essential to the detection of undesirable deception of others' content misuse or exposing the owners of some anonymous hurtful content. This is done by revealing the author of that content. **A**uthorship **I**dentification of **SO**urce **CO**de (AI-SOCO) focuses on uncovering the author who wrote some piece of code. This facilitates solving issues related to cheating in academic, work and open source environments. Also, it can be helpful in detecting the authors of malware softwares over the world.\n\nThe detection of cheating in academic communities is significant to properly address the contribution of each researcher. Also, in work environments, credit sometimes goes to people that did not deserve it. Such issues of plagiarism could arise in open source projects that are available on public platforms. Similarly, this could be used in public or private online coding contests whether done in coding interviews or in official coding training contests to detect the cheating of applicants or contestants. A system like this could also play a big role in detecting the source of anonymous malicious softwares.\n\nThe dataset is composed of source codes collected from the open submissions in the [Codeforces](http://codeforces.com/) online judge. Codeforces is an online judge for hosting competitive programming contests such that each contest consists of multiple problems to be solved by the participants. A Codeforces participant can solve a problem by writing a solution for it using any of the available programming languages on the website, and then submitting the solution through the website. The solution's result can be correct (accepted) or incorrect (wrong answer, time limit exceeded, etc.).\n\nIn our dataset, we selected 1,000 users and collected 100 source codes from each one. So, the total number of source codes is 100,000. All collected source codes are correct, bug-free, compile-ready and written using the C++ programming language using different versions. For each user, all collected source codes are from unique problems.\n\nGiven the pre-defined set of source codes and their writers, the task is to build a system that is able to detect the writer given any new, unseen before source codes from the previously defined writers list.\n\n### Example\nGiven the following bug-free and ready to compile C++ source code:\n\n```c++\n#include \u003cstring\u003e\n#include \u003ciostream\u003e\n#include \u003cctype.h\u003e\nusing namespace std;\n \nint main() {\n    string s;\n    cin \u003e\u003e s;\n    s[0] = toupper(s[0]);\n    cout \u003c\u003c s \u003c\u003c endl;\n    return 0;\n}\n```\n\nYou need to build a system that can determine the source code writer from list consists of 1,000 writers.\n\n## Dataset Structure\nIn `data_dir` directory there are the following:\n- `train.csv` file which contains 50K pairs of `uid`s (User IDs) and `pid`s (Problem IDs). Each `uid` appears 50 times in the file with 50 different `pid`s.\n- `train` directory which contains 50K files, each file with different `pid` represents the C++ source code that will be the input to your system.\n- `dev.csv` file is similar to `train.csv`, but it will be used to evaluation your system while developing, so it is not allowed to use it in the training phase.\n- `dev` directory is similar to `train`, but it will be used to evaluation your system while developing, so it is not allowed to use it in the training phase.\n- `unlabeled_test.csv` file is similar to `train.csv`, but it will be used to evaluation your system, so it is not allowed to use it in the training phase.\n- `test` directory is similar to `train`, but it will be used to evaluation your system, so it is not allowed to use it in the training phase.\n\n### Note\nThe data is now available on [Zenodo](https://zenodo.org/record/4059840#.X3ScgnX7Qno) with the test set labels.\n\n## Baseline\n- [**Random Baseline**](random_baseline.py) is simply predicting a random writer for each piece of code from the list of 1,000 writers (from 0 to 999). Its accuracy reaches around **0.1%**.\n- [**Characters Count Logistic Baseline**](characters_logistic_baseline.py) converts each source code to a vector represents the count of the 100 [printable characters](https://en.wikipedia.org/wiki/ASCII#Printable_characters), then it builds a [logistic regression](https://en.wikipedia.org/wiki/Logistic_regression) model on the vectorized representations. It achieved a **29.252%** accuracy on the development set.\n- [**TF-IDF KNN Baseline**](tfidf_knn_baseline.py) vectorizes the source codes using [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) method with **10K** features and builds a [KNN](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) classifier with **25** neighbors on top of that representations extracted from TF-IDF. Its accuracy on the development set is **62.128%** which is much better than the previous baselines. Keep in mind that this baseline is very slow and it will take you about **4** hours to predict all examples in the development set using **6** threads.\n\nTo train and predict on the development set using any of the previously mentioned baselines, please run the following command:\n```bash\npython baselines/[random_baseline.py|characters_logistic_baseline.py|tfidf_knn_baseline.py]\n```\n\n## Evaluation\nSystems will be evaluated and ranked based on **Accuracy** metric. An evaluation [script](scorer.py) is available on the Github repository.\n\n## Important Dates\n- ~~8th June - Open track website~~\n- ~~8th June – Training and development data release~~\n- ~~31st July – Test data release~~\n- ~~7th September – Run submission deadline~~\n- ~~15th September – Results declared~~\n- ~~5th October – Working notes papers due~~\n- 10th November – Final version of working notes papers due\n- 16th-20th December - FIRE 2020 (Online Event)\n\n## Notes\n- All scripts in this repository were tested on **Ubuntu 20.04** and **Python 3.8.2**.\n\n## License\nThe dataset is distributed under the [MIT](/LICENSE) license.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faliosm%2Fai-soco","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faliosm%2Fai-soco","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faliosm%2Fai-soco/lists"}