Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/veya2ztn/upar5iv
Convert ar5iv html dataset into LLM friendly format such .md and .json
https://github.com/veya2ztn/upar5iv
Last synced: 3 days ago
JSON representation
Convert ar5iv html dataset into LLM friendly format such .md and .json
- Host: GitHub
- URL: https://github.com/veya2ztn/upar5iv
- Owner: veya2ztn
- Created: 2024-05-05T13:08:11.000Z (8 months ago)
- Default Branch: master
- Last Pushed: 2024-05-05T14:06:03.000Z (8 months ago)
- Last Synced: 2024-05-06T14:33:59.799Z (8 months ago)
- Language: Python
- Size: 65.4 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- awesome_ai_agents - Upar5Iv - Convert ar5iv html dataset into LLM friendly format such .md and .json (Building / Datasets)
- awesome_ai_agents - Upar5Iv - Convert ar5iv html dataset into LLM friendly format such .md and .json (Building / Datasets)
README
# Usage
for single file
```
python python_script/html_to_json.py --root datasets/ar5iv/no-problem/0003/cond-mat0003325.html --savepath datasets/ar5iv/no_problem_json/
```
for a batch of files (store the file list in a file endswith `.filelist`)```
python python_script/html_to_json.py --root datasets/ar5iv/no_problem.html.filelist --savepath datasets/ar5iv/no_problem_json/
```- the .filelist should be the abspath for each html file such as
```
datasets/ar5iv/no-problem/0003/cond-mat0003325.html
datasets/ar5iv/no-problem/0003/hep-ph0003287.html
datasets/ar5iv/no-problem/0003/gr-qc0003030.html
datasets/ar5iv/no-problem/0003/astro-ph0003274.html
datasets/ar5iv/no-problem/0003/cond-mat0003474.html
datasets/ar5iv/no-problem/0003/astro-ph0003320.html
datasets/ar5iv/no-problem/0003/astro-ph0003298.html
datasets/ar5iv/no-problem/0003/cond-mat0003427.html
datasets/ar5iv/no-problem/0003/physics0003094.html
datasets/ar5iv/no-problem/0003/math0003213.html
```- it will save the `.json` format file follow the same direction under `savepath` such as
```
datasets/ar5iv/no_problem_json/0003/cond-mat0003325/ar5iv/cond-mat0003325.json
datasets/ar5iv/no_problem_json/0003/hep-ph0003287/ar5iv/hep-ph0003287.json
datasets/ar5iv/no_problem_json/0003/gr-qc0003030/ar5iv/gr-qc0003030.json
datasets/ar5iv/no_problem_json/0003/astro-ph0003274/ar5iv/astro-ph0003274.json
datasets/ar5iv/no_problem_json/0003/cond-mat0003474/ar5iv/cond-mat0003474.json
datasets/ar5iv/no_problem_json/0003/astro-ph0003320/ar5iv/astro-ph0003320.json
datasets/ar5iv/no_problem_json/0003/astro-ph0003298/ar5iv/astro-ph0003298.json
datasets/ar5iv/no_problem_json/0003/cond-mat0003427/ar5iv/cond-mat0003427.json
datasets/ar5iv/no_problem_json/0003/physics0003094/ar5iv/physics0003094.json
datasets/ar5iv/no_problem_json/0003/math0003213/ar5iv/math0003213.json
```- Active multiprocessing by adding `--batch_num [thread_num]` such as
`python python_script/html_to_json.py --root datasets/ar5iv/no_problem.html.filelist --savepath datasets/ar5iv/no_problem_json/ --batch_num 32`
- Avoid analysis and put back note by adding `--passNote`