https://github.com/madjakul/halvesting
Harvests open research papers from HAL (Hyper Articles en Ligne).
https://github.com/madjakul/halvesting
dataset-generation language-modeling natural-language-processing
Last synced: 6 months ago
JSON representation
Harvests open research papers from HAL (Hyper Articles en Ligne).
- Host: GitHub
- URL: https://github.com/madjakul/halvesting
- Owner: Madjakul
- License: apache-2.0
- Created: 2024-02-06T10:52:08.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-02-18T14:25:14.000Z (8 months ago)
- Last Synced: 2025-02-18T15:30:36.480Z (8 months ago)
- Topics: dataset-generation, language-modeling, natural-language-processing
- Language: Python
- Homepage:
- Size: 490 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# HALvesting
[](https://arxiv.org/abs/2407.20595)
[](https://huggingface.co/datasets/almanach/HALvest)Harvests open scientific papers from HAL.
* See also: [HALvesting-Geometric](https://github.com/Madjakul/HALvesting-Geometric)
* See also: [HALvesting-Contrastive](https://github.com/Madjakul/HALvesting-Contrastive)---
HALvesting is a Python project designed to crawl data from the [HAL (Hyper Articles en Ligne) repository](https://hal.science/). It provides functionalities to fetch data from HAL and to process it for further analysis.
The latest dump can be found on [HuggingFace](https://huggingface.co/datasets/Madjakul/HALvest).
## Features
- [**fetch_data.py**](fetch_data.py): This script fetches data from HAL using specified criterias.
- [**merge_data.py**](merge_data.py): This script is used for post-processing the fetched data.
- [**enrich_data.py**](enrich_data.py): This script adds new keys to the merged data.
- [**filter_data.py**](filter_data.py): This script removes gibberish documents.## Requirements
You will need Python > 3.8 and an internet connection.
## Installation
1. Clone the repository
```sh
git clone https://github.com/Madjakul/HALvesting.git
```2. Navigate to the project directory
```sh
cd HALvesting
```3. Install the required dependencies:
```sh
pip install -r requirements.txt
```## Usage
It's easier to modify the files [`scripts/fetch_data.sh`](scripts/fetch_data.sh), [`scripts/merge_data.sh`](scripts/merge_data.sh), [`scripts/enrich_data.sh`](scripts/enrich_data.sh) and [`scripts/filter_data.sh`](scripts/filter_data.sh) at need and launch them. However one can launch directly the Python scripts with the correct arguments as we will see below.
### Fetching Data
This scripts passes a query to HAL's API and fetch all the open papers before storing them in a JSON file. The fetched data is only comprise of papers' metadatas.
```js
[
{
"halid": "02975689",
"lang": "ar",
"domain": [ "math.math-ho", "shs.edu" ],
"timestamp": "2024/03/04 16:36:13",
"year": "2020",
"url": "https://hal.science/hal-02975689/file/DAD-TGP_Encyclopedia_Arabic.pdf"
},
...
]
``````
>>> python3 fetch_data.py -h
usage: fetch_data.py [-h] [--query [QUERY]] [--from_date FROM_DATE] [--from_hour FROM_HOUR] [--to_date TO_DATE] [--to_hour TO_HOUR] --pdf PDF --response_dir RESPONSE_DIR [--pdf_dir [PDF_DIR]] [--num_chunks [NUM_CHUNKS]]Arguments used to fetch data.
options:
-h, --help show this help message and exit
--query [QUERY] Query used to request APIs.
--from_date FROM_DATE
Minimum submition date of documents.
--from_hour FROM_HOUR
Minimum submition hour of documents.
--to_date TO_DATE Maximum submition date of documents.
--to_hour TO_HOUR Maximum submition hour of documents.
--pdf PDF Set to `true` if you want to download the PDFs.
--response_dir RESPONSE_DIR
Target directory used to store fetched data.
--pdf_dir [PDF_DIR] Target directory used to store the PDFs.
--num_chunks [NUM_CHUNKS]
Number of semaphores for the PDF downloader.```
### Post-process Data
This script merges the fetched metadatas with the generated text from GROBID and harvesting. The output are compressed JSON files sorted by language.
```
>>> python3 merge_data.py
usage: merge_data.py [-h] --js_dir_path JS_DIR_PATH --txts_dir_path TXTS_DIR_PATH --output_dir_path OUTPUT_DIR_PATH --version VERSIONArguments used to fetch data.
options:
-h, --help show this help message and exit
--js_dir_path JS_DIR_PATH
Folder containing fetched data.
--txts_dir_path TXTS_DIR_PATH
Folder containing the txt files.
--output_dir_path OUTPUT_DIR_PATH
Final folder containing the processed data for HuggingFace.
--version VERSION Version of the dump starting at '1.0'.
```### Enrich Data
This script adds new keys to the merged data.
```
>>> python3 enrich_data.py -h
usage: enrich_data.py [-h] [--dataset_checkpoint DATASET_CHECKPOINT] [--cache_dir_path [CACHE_DIR_PATH]] [--dataset_config_path DATASET_CONFIG_PATH] [--download_models DOWNLOAD_MODELS] [--kenlm_dir_path KENLM_DIR_PATH] [--num_proc NUM_PROC]
[--batch_size BATCH_SIZE] [--output_dir_path OUTPUT_DIR_PATH] [--tokenizer_checkpoint [TOKENIZER_CHECKPOINT]] [--use_fast [USE_FAST]] [--load_from_cache_file [LOAD_FROM_CACHE_FILE]] --version VERSIONDownload Sentencepiece and KenLM models for supported languages.
options:
-h, --help show this help message and exit
--dataset_checkpoint DATASET_CHECKPOINT
Name of the HuggingFace dataset to be processed.
--cache_dir_path [CACHE_DIR_PATH]
Path to the HuggingFace cache directory.
--dataset_config_path DATASET_CONFIG_PATH
Path to the txt file containing the dataset configs to process.
--download_models DOWNLOAD_MODELS
Set to `true` if you want to download the KenLM models.
--kenlm_dir_path KENLM_DIR_PATH
Path to the directory containing the sentencepiece and kenlm models.
--num_proc NUM_PROC Number of processes to use for processing the dataset.
--batch_size BATCH_SIZE
Number of documents loaded per proc.
--output_dir_path OUTPUT_DIR_PATH
Path to the directory where the processed dataset will be saved.
--tokenizer_checkpoint [TOKENIZER_CHECKPOINT]
Name of the HuggingFace tokenizer model to be used.
--use_fast [USE_FAST]
Set to `true` if you want to use the Ruste-based tokenizer from HF.
--load_from_cache_file [LOAD_FROM_CACHE_FILE]
Set to `true` if you if some of the enriching functions have been altered.
--version VERSION Version of the dump starting at '1.0'.
```### Filter Data
This script filters out raw data out of the gibberish documents
```
>>> python3 filter_data.py -h
usage: filter_data.py [-h] [--dataset_checkpoint DATASET_CHECKPOINT] [--cache_dir_path [CACHE_DIR_PATH]] [--dataset_config_path DATASET_CONFIG_PATH] [--num_proc NUM_PROC] [--batch_size BATCH_SIZE] [--output_dir_path OUTPUT_DIR_PATH]
[--load_from_cache_file [LOAD_FROM_CACHE_FILE]] --version VERSIONArgument used to filter the dataset.
options:
-h, --help show this help message and exit
--dataset_checkpoint DATASET_CHECKPOINT
Name of the HuggingFace dataset to be processed.
--cache_dir_path [CACHE_DIR_PATH]
Path to the HuggingFace cache directory.
--dataset_config_path DATASET_CONFIG_PATH
Path to the txt file containing the dataset configs to process.
--num_proc NUM_PROC Number of processes to use for processing the dataset.
--batch_size BATCH_SIZE
Number of documents loaded per proc.
--output_dir_path OUTPUT_DIR_PATH
Path to the directory where the processed dataset will be saved.
--load_from_cache_file [LOAD_FROM_CACHE_FILE]
Set to `true` if you if some of the enriching functions have been altered.
--version VERSION Version of the dump starting at '1.0'.
```## Citation
To cite HALvesting/HALvest:
```bib
@misc{kulumba2024harvestingtextualstructureddata,
title={Harvesting Textual and Structured Data from the HAL Publication Repository},
author={Francis Kulumba and Wissam Antoun and Guillaume Vimont and Laurent Romary},
year={2024},
eprint={2407.20595},
archivePrefix={arXiv},
primaryClass={cs.DL},
url={https://arxiv.org/abs/2407.20595},
}
```## Acknowledgement
The code and the dataset have been built upon the following work:
```
GROBID: A machine learning software for extracting information from scholarly documents
https://github.com/kermitt2/grobidharvesting: Collection of parsers for scientific data
```## License
This project is licensed under the [Apache License 2.0](LICENSE).