https://github.com/centre-for-humanities-computing/friths
https://github.com/centre-for-humanities-computing/friths
Last synced: 4 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/centre-for-humanities-computing/friths
- Owner: centre-for-humanities-computing
- Created: 2023-11-06T13:55:44.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-02-01T08:29:54.000Z (over 2 years ago)
- Last Synced: 2025-09-10T00:02:11.697Z (9 months ago)
- Language: HTML
- Size: 15.6 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 8
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Friths
Get started using `bash dev_setup.sh`
The project is structured following the [cookiecutter data science template](https://github.com/drivendata/cookiecutter-data-science/tree/master).
## Pipelines
### Publications Pipeline
Run using `make publications`
Input: publication PDFs (`data/raw/UTA publications`)
Manual work required to keep most data:
- Export scanned files (`failed_pdf_paths.txt`) with "Embed text" (Sofie)
- Manually convert or fix non-pdf files (`non_pdf_paths.txt`)
Pipeline:
A) `dataset/parse_publications.py`
We try to extract information using `scipdf`, which has a java dependency `grobid`.
If unsuccessful, the file will get added to `failed_pdf_paths.txt` & OCR'd later (no metadata avilable then).
B) `dataset/ocr_failed_pdf.py`
Runs OCR using `pytesseract`.
Outputs are saved separately from the parsed files.
C1) `dataset/metadata_publications.py`
Generates metadata & document ids separately for both file types (PARSING and OCR).
Most importatly, tries to reconstruct the publication year.
Otherwise, just takes the metadata found by `scipdf` (PARSING files only), or adds blank columns (OCR files only)/
C2) `dataset/fetch_scopus.py`
Fetches metadata from the Scopus API, given author IDs.
C3) `dataset/metadata_scopus.py`
Merges PARSING metadata with records extracted from Scopus (`data/raw/ScopusExport_{author_id}_{date}.csv`)
D) `dataset/concat_publications.py`
Concatenates article sections into a single block of text.
Creates files with the extracted abstracts.
Merges:
- PARSING and OCR files.
- PARSING and OCR metadata files.
- PARSING and OCR abstract files.
E) `dataset/quality_checks_publications.py`
Adds info about language & text descriptive stats into the metadata.
F) `features/run_embeddings.py`
Get embeddings from OpenAI
Analysis:
`notebooks/experiment_abstracts.ipynb`
`notebooks/experiment_infodynamics.ipynb`