{"id":19890949,"url":"https://github.com/madjakul/halvesting","last_synced_at":"2026-02-14T07:31:15.205Z","repository":{"id":223003240,"uuid":"753551990","full_name":"Madjakul/HALvesting","owner":"Madjakul","description":"Harvests open research papers from HAL (Hyper Articles en Ligne).","archived":false,"fork":false,"pushed_at":"2025-02-18T14:25:14.000Z","size":502,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-10-06T13:50:30.589Z","etag":null,"topics":["dataset-generation","language-modeling","natural-language-processing"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Madjakul.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-02-06T10:52:08.000Z","updated_at":"2025-02-18T14:25:18.000Z","dependencies_parsed_at":"2024-02-17T16:27:39.281Z","dependency_job_id":"4db73715-64e9-42db-b051-5ecbb46bdeac","html_url":"https://github.com/Madjakul/HALvesting","commit_stats":null,"previous_names":["madjakul/halversting"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Madjakul/HALvesting","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Madjakul%2FHALvesting","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Madjakul%2FHALvesting/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Madjakul%2FHALvesting/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Madjakul%2FHALvesting/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Madjakul","download_url":"https://codeload.github.com/Madjakul/HALvesting/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Madjakul%2FHALvesting/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29439486,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-14T07:24:13.446Z","status":"ssl_error","status_checked_at":"2026-02-14T07:23:58.969Z","response_time":53,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataset-generation","language-modeling","natural-language-processing"],"created_at":"2024-11-12T18:16:33.513Z","updated_at":"2026-02-14T07:31:15.185Z","avatar_url":"https://github.com/Madjakul.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# HALvesting\n\n[![arXiv](https://img.shields.io/badge/arXiv-2407.20595-b31b1b.svg)](https://arxiv.org/abs/2407.20595)\n[![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Data-yellow)](https://huggingface.co/datasets/almanach/HALvest)\n\nHarvests open scientific papers from HAL.\n\n* See also: [HALvesting-Geometric](https://github.com/Madjakul/HALvesting-Geometric)\n* See also: [HALvesting-Contrastive](https://github.com/Madjakul/HALvesting-Contrastive)\n\n---\n\n\nHALvesting is a Python project designed to crawl data from the [HAL (Hyper Articles en Ligne) repository](https://hal.science/). It provides functionalities to fetch data from HAL and to process it for further analysis.\n\nThe latest dump can be found on [HuggingFace](https://huggingface.co/datasets/Madjakul/HALvest).\n\n\n## Features\n\n- [**fetch_data.py**](fetch_data.py): This script fetches data from HAL using specified criterias.\n- [**merge_data.py**](merge_data.py): This script is used for post-processing the fetched data.\n- [**enrich_data.py**](enrich_data.py): This script adds new keys to the merged data.\n- [**filter_data.py**](filter_data.py): This script removes gibberish documents.\n\n\n## Requirements\n\nYou will need Python \u003e 3.8 and an internet connection.\n\n\n## Installation\n\n1. Clone the repository\n\n```sh\ngit clone https://github.com/Madjakul/HALvesting.git\n```\n\n2. Navigate to the project directory\n\n```sh\ncd HALvesting\n```\n\n3. Install the required dependencies:\n\n```sh\npip install -r requirements.txt\n```\n\n\n## Usage\n\nIt's easier to modify the files [`scripts/fetch_data.sh`](scripts/fetch_data.sh), [`scripts/merge_data.sh`](scripts/merge_data.sh), [`scripts/enrich_data.sh`](scripts/enrich_data.sh) and [`scripts/filter_data.sh`](scripts/filter_data.sh) at need and launch them. However one can launch directly the Python scripts with the correct arguments as we will see below.\n\n\n### Fetching Data\n\nThis scripts passes a query to HAL's API and fetch all the open papers before storing them in a JSON file. The fetched data is only comprise of papers' metadatas.\n\n```js\n[\n    {\n        \"halid\": \"02975689\",\n        \"lang\": \"ar\",\n        \"domain\": [ \"math.math-ho\", \"shs.edu\" ],\n        \"timestamp\": \"2024/03/04 16:36:13\",\n        \"year\": \"2020\",\n        \"url\": \"https://hal.science/hal-02975689/file/DAD-TGP_Encyclopedia_Arabic.pdf\"\n    },\n    ...\n]\n```\n\n```\n\u003e\u003e\u003e python3 fetch_data.py -h\nusage: fetch_data.py [-h] [--query [QUERY]] [--from_date FROM_DATE] [--from_hour FROM_HOUR] [--to_date TO_DATE] [--to_hour TO_HOUR] --pdf PDF --response_dir RESPONSE_DIR [--pdf_dir [PDF_DIR]] [--num_chunks [NUM_CHUNKS]]\n\nArguments used to fetch data.\n\noptions:\n  -h, --help            show this help message and exit\n  --query [QUERY]       Query used to request APIs.\n  --from_date FROM_DATE\n                        Minimum submition date of documents.\n  --from_hour FROM_HOUR\n                        Minimum submition hour of documents.\n  --to_date TO_DATE     Maximum submition date of documents.\n  --to_hour TO_HOUR     Maximum submition hour of documents.\n  --pdf PDF             Set to `true` if you want to download the PDFs.\n  --response_dir RESPONSE_DIR\n                        Target directory used to store fetched data.\n  --pdf_dir [PDF_DIR]   Target directory used to store the PDFs.\n  --num_chunks [NUM_CHUNKS]\n                        Number of semaphores for the PDF downloader.\n\n```\n\n\n### Post-process Data\n\nThis script merges the fetched metadatas with the generated text from GROBID and harvesting. The output are compressed JSON files sorted by language.\n\n\n```\n\u003e\u003e\u003e python3 merge_data.py\nusage: merge_data.py [-h] --js_dir_path JS_DIR_PATH --txts_dir_path TXTS_DIR_PATH --output_dir_path OUTPUT_DIR_PATH --version VERSION\n\nArguments used to fetch data.\n\noptions:\n  -h, --help            show this help message and exit\n  --js_dir_path JS_DIR_PATH\n                        Folder containing fetched data.\n  --txts_dir_path TXTS_DIR_PATH\n                        Folder containing the txt files.\n  --output_dir_path OUTPUT_DIR_PATH\n                        Final folder containing the processed data for HuggingFace.\n  --version VERSION     Version of the dump starting at '1.0'.\n```\n\n\n### Enrich Data\n\nThis script adds new keys to the merged data.\n\n```\n\u003e\u003e\u003e python3 enrich_data.py -h\nusage: enrich_data.py [-h] [--dataset_checkpoint DATASET_CHECKPOINT] [--cache_dir_path [CACHE_DIR_PATH]] [--dataset_config_path DATASET_CONFIG_PATH] [--download_models DOWNLOAD_MODELS] [--kenlm_dir_path KENLM_DIR_PATH] [--num_proc NUM_PROC]\n                      [--batch_size BATCH_SIZE] [--output_dir_path OUTPUT_DIR_PATH] [--tokenizer_checkpoint [TOKENIZER_CHECKPOINT]] [--use_fast [USE_FAST]] [--load_from_cache_file [LOAD_FROM_CACHE_FILE]] --version VERSION\n\nDownload Sentencepiece and KenLM models for supported languages.\n\noptions:\n  -h, --help            show this help message and exit\n  --dataset_checkpoint DATASET_CHECKPOINT\n                        Name of the HuggingFace dataset to be processed.\n  --cache_dir_path [CACHE_DIR_PATH]\n                        Path to the HuggingFace cache directory.\n  --dataset_config_path DATASET_CONFIG_PATH\n                        Path to the txt file containing the dataset configs to process.\n  --download_models DOWNLOAD_MODELS\n                        Set to `true` if you want to download the KenLM models.\n  --kenlm_dir_path KENLM_DIR_PATH\n                        Path to the directory containing the sentencepiece and kenlm models.\n  --num_proc NUM_PROC   Number of processes to use for processing the dataset.\n  --batch_size BATCH_SIZE\n                        Number of documents loaded per proc.\n  --output_dir_path OUTPUT_DIR_PATH\n                        Path to the directory where the processed dataset will be saved.\n  --tokenizer_checkpoint [TOKENIZER_CHECKPOINT]\n                        Name of the HuggingFace tokenizer model to be used.\n  --use_fast [USE_FAST]\n                        Set to `true` if you want to use the Ruste-based tokenizer from HF.\n  --load_from_cache_file [LOAD_FROM_CACHE_FILE]\n                        Set to `true` if you if some of the enriching functions have been altered.\n  --version VERSION     Version of the dump starting at '1.0'.\n```\n\n\n### Filter Data\n\nThis script filters out raw data out of the gibberish documents\n\n```\n\u003e\u003e\u003e python3 filter_data.py -h\nusage: filter_data.py [-h] [--dataset_checkpoint DATASET_CHECKPOINT] [--cache_dir_path [CACHE_DIR_PATH]] [--dataset_config_path DATASET_CONFIG_PATH] [--num_proc NUM_PROC] [--batch_size BATCH_SIZE] [--output_dir_path OUTPUT_DIR_PATH]\n                      [--load_from_cache_file [LOAD_FROM_CACHE_FILE]] --version VERSION\n\nArgument used to filter the dataset.\n\noptions:\n  -h, --help            show this help message and exit\n  --dataset_checkpoint DATASET_CHECKPOINT\n                        Name of the HuggingFace dataset to be processed.\n  --cache_dir_path [CACHE_DIR_PATH]\n                        Path to the HuggingFace cache directory.\n  --dataset_config_path DATASET_CONFIG_PATH\n                        Path to the txt file containing the dataset configs to process.\n  --num_proc NUM_PROC   Number of processes to use for processing the dataset.\n  --batch_size BATCH_SIZE\n                        Number of documents loaded per proc.\n  --output_dir_path OUTPUT_DIR_PATH\n                        Path to the directory where the processed dataset will be saved.\n  --load_from_cache_file [LOAD_FROM_CACHE_FILE]\n                        Set to `true` if you if some of the enriching functions have been altered.\n  --version VERSION     Version of the dump starting at '1.0'.\n```\n\n\n## Citation\n\nTo cite HALvesting/HALvest:\n\n```bib\n@misc{kulumba2024harvestingtextualstructureddata,\n      title={Harvesting Textual and Structured Data from the HAL Publication Repository}, \n      author={Francis Kulumba and Wissam Antoun and Guillaume Vimont and Laurent Romary},\n      year={2024},\n      eprint={2407.20595},\n      archivePrefix={arXiv},\n      primaryClass={cs.DL},\n      url={https://arxiv.org/abs/2407.20595}, \n}\n```\n\n\n## Acknowledgement\n\nThe code and the dataset have been built upon the following work:\n\n```\nGROBID: A machine learning software for extracting information from scholarly documents\nhttps://github.com/kermitt2/grobid\n\nharvesting: Collection of parsers for scientific data\n```\n\n\n## License\n\nThis project is licensed under the [Apache License 2.0](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmadjakul%2Fhalvesting","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmadjakul%2Fhalvesting","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmadjakul%2Fhalvesting/lists"}