https://github.com/centre-for-humanities-computing/friths

Last synced: 4 months ago
JSON representation

Host: GitHub
URL: https://github.com/centre-for-humanities-computing/friths
Owner: centre-for-humanities-computing
Created: 2023-11-06T13:55:44.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-02-01T08:29:54.000Z (over 2 years ago)
Last Synced: 2025-09-10T00:02:11.697Z (9 months ago)
Language: HTML
Size: 15.6 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 8
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Friths
Get started using `bash dev_setup.sh`

The project is structured following the [cookiecutter data science template](https://github.com/drivendata/cookiecutter-data-science/tree/master).

## Pipelines
### Publications Pipeline
Run using `make publications`

Input: publication PDFs (`data/raw/UTA publications`)

Manual work required to keep most data:
- Export scanned files (`failed_pdf_paths.txt`) with "Embed text" (Sofie)
- Manually convert or fix non-pdf files (`non_pdf_paths.txt`)

Pipeline:
A) `dataset/parse_publications.py`
We try to extract information using `scipdf`, which has a java dependency `grobid`.
If unsuccessful, the file will get added to `failed_pdf_paths.txt` & OCR'd later (no metadata avilable then).

B) `dataset/ocr_failed_pdf.py`
Runs OCR using `pytesseract`.
Outputs are saved separately from the parsed files.

C1) `dataset/metadata_publications.py`
Generates metadata & document ids separately for both file types (PARSING and OCR).
Most importatly, tries to reconstruct the publication year.
Otherwise, just takes the metadata found by `scipdf` (PARSING files only), or adds blank columns (OCR files only)/

C2) `dataset/fetch_scopus.py`
Fetches metadata from the Scopus API, given author IDs.

C3) `dataset/metadata_scopus.py`
Merges PARSING metadata with records extracted from Scopus (`data/raw/ScopusExport_{author_id}_{date}.csv`)

D) `dataset/concat_publications.py`
Concatenates article sections into a single block of text.
Creates files with the extracted abstracts.
Merges:
- PARSING and OCR files.
- PARSING and OCR metadata files.
- PARSING and OCR abstract files.

E) `dataset/quality_checks_publications.py`
Adds info about language & text descriptive stats into the metadata.

F) `features/run_embeddings.py`
Get embeddings from OpenAI

Analysis:
`notebooks/experiment_abstracts.ipynb`
`notebooks/experiment_infodynamics.ipynb`

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/centre-for-humanities-computing/friths

Awesome Lists containing this project

README