{"id":16281053,"url":"https://github.com/mafesan/2021-tfm-code","last_synced_at":"2026-04-15T15:40:08.979Z","repository":{"id":85582730,"uuid":"595582982","full_name":"mafesan/2021-tfm-code","owner":"mafesan","description":"Revelio: Machine-Learning classifier to identify Bots integrable with GrimoireLab","archived":false,"fork":false,"pushed_at":"2023-04-17T19:56:17.000Z","size":12714,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-14T16:54:34.787Z","etag":null,"topics":["bot-accounts","data-analysis","data-analytics","data-science","grimoirelab","machine-learning","metrics","open-source","open-source-community","project-health","python","scikit-learn"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mafesan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-01-31T11:40:53.000Z","updated_at":"2024-09-09T12:09:01.000Z","dependencies_parsed_at":"2023-07-04T10:47:12.060Z","dependency_job_id":null,"html_url":"https://github.com/mafesan/2021-tfm-code","commit_stats":{"total_commits":20,"total_committers":1,"mean_commits":20.0,"dds":0.0,"last_synced_commit":"efa991cfaf1b69c9d5b2b69356b190b4d2965542"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mafesan%2F2021-tfm-code","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mafesan%2F2021-tfm-code/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mafesan%2F2021-tfm-code/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mafesan%2F2021-tfm-code/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mafesan","download_url":"https://codeload.github.com/mafesan/2021-tfm-code/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247930995,"owners_count":21020148,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bot-accounts","data-analysis","data-analytics","data-science","grimoirelab","machine-learning","metrics","open-source","open-source-community","project-health","python","scikit-learn"],"created_at":"2024-10-10T19:04:46.617Z","updated_at":"2026-04-15T15:40:03.934Z","avatar_url":"https://github.com/mafesan.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Revelio: Bot classifier integrable with GrimoireLab\n\nThe aim of this tool is to detect Bots automatically based on their profiles' information and their activity in the project, integrable as a component inside the [GrimoireLab](https://github.com/chaoss/grimoirelab) toolchain. To develop  the first version of this tool, we analyzed the code changes from a set of software projects from the Wikimedia Foundation, produced between January 2008 and September 2021 using GrimoireLab, labeling manually the Bot accounts generating activity with the purpose of creating an input dataset to train a binary classifier to detect whether a given profile is a Bot or not.\n\n\u003cimg src=\"docs/imgs/revelio-logo.png\" alt=\"revelio logo\" width=\"150\" height=\"150\"\u003e\n\n## General architecture\n\n![general-architecture](docs/imgs/general-architecture-revelio.png)\n\nThe tool requires a running GrimoireLab instance to execute. This GrimoireLab instance would contain data from many endpoints stored in an ElasticSearch instance, together with a relational database containing identity information.\n\nWith the GrimoireLab instance in place, Revelio accepts three main input parameters:\n* The URL or the IP address of the ElasticSearch instance.\n* The credentials to access ElasticSearch and SortingHat.\n* The index name from ElasticSeach, containing the GrimoreLab-formatted data.\n\nWith these input parameters, the tool executes the following steps:\n\n### Data extraction\n\nRevelio extracts the data per individual from the selected index querying the ElasticSearch instance (`ES-extract-datasets.py`).\n\n### Data processing\n\nThe extracted data is analyzed and processed, creating the datasets for the classification phase (`build-classifier-input.py`, `exploratory-data-analysis.ipynb`).\n\n### Classification\n\nIn this phase, the classification models are defined and adjusted. The output of this chain is a report containing the results of the classification: A boolean attribute for each individual indicating if it is a bot or not, and another attribute for the accuracy of the result (`classifiers.ipynb`).\n\n## How to execute\n\nInstall the requirements using python-pip:\n\n```bash\n$ pip3 install -r requirements.txt\n```\n\nGet the GrimoireLab instance up and running:\n\n```bash\n$ cd docker-compose\n$ docker-compose up -d\n```\n\nTo extract the data from ElasticSearch, run:\n\n```bash\n$ python3 revelio/ES-extract-datasets.py\n```\n\nThis script generates one JSON file per unique individual found in the data obtained by GrimoreLab in the directory `data`.\n\nThen, the script `build-classifier-input.py` will load each file in the `data` directory to build the input dataset for the classification stage. Then, this dataset is exported into a file inside the `datasets` directory.\n\n```bash\n$ python3 revelio/build-classifier-input.py\n```\n\nNote: The resulting datasets are not published because they contain personal information (names and emails). To ease future analysis, we are sharing the accounts marked as bots, as they are not subject to the GDPR.\n\nTo execute the Notebooks, we need to start Jupyter:\n\n```bash\n$ jupyter notebook\n```\n\nBy default, the Jupyter interface is available at http://localhost:8888. There, we can execute both notebooks: `exploratory-data-analysis.ipynb` and `classifier.ipynb`.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmafesan%2F2021-tfm-code","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmafesan%2F2021-tfm-code","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmafesan%2F2021-tfm-code/lists"}