https://github.com/mvuorre/psyarxivdb
Datasette serving PsyArXiv preprint metadata
https://github.com/mvuorre/psyarxivdb
data datasette open-science preprints psyarxiv
Last synced: about 1 month ago
JSON representation
Datasette serving PsyArXiv preprint metadata
- Host: GitHub
- URL: https://github.com/mvuorre/psyarxivdb
- Owner: mvuorre
- License: mit
- Created: 2025-04-02T08:19:27.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2026-03-31T12:53:49.000Z (3 months ago)
- Last Synced: 2026-05-04T21:45:49.404Z (about 2 months ago)
- Topics: data, datasette, open-science, preprints, psyarxiv
- Language: Python
- Homepage: https://psyarxivdb.vuorre.com/
- Size: 337 KB
- Stars: 1
- Watchers: 2
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# PsyArXivDB
A [Datasette](https://datasette.io/) instance serving [PsyArXiv](https://psyarxiv.com/) preprint data from the [Open Science Framework](https://osf.io/) (OSF) API. Updated daily.
Browse PsyArXiv preprints at .
[Comments](https://github.com/mvuorre/psyarxivdb/issues) welcome. Thanks to OSF, PsyArXiv, and all researchers sharing their work.
## Development
Before getting started, get a copy of a current database file from Matti to speed up the setup & avoid unnecessarily stressing the OSF API.
### Prerequisites
- Computer etc.
- Python 3.9-3.13
- [UV](https://github.com/astral-sh/uv)
### Setup and use
```bash
# Clone the repository
git clone https://github.com/mvuorre/psyarxivdb
cd psyarxivdb
# Create .venv with the repo-pinned Python 3.13 and install dependencies
uv sync --frozen
# Optional: activate the environment in your shell
source .venv/bin/activate
# Harvest and process PsyArXiv data
make harvest # Fetch raw data from OSF API
make ingest # Process into normalized tables
# Check status
make status
# Dump processed preprints table as compressed csv
make dump
# Run daily update of database
make daily-update
# Show all available commands
make help
```
To rebuild the small precomputed tables used by the dashboard without doing a full ingest, including the precomputed `coauthor_edges` table, run:
```bash
make refresh-dashboard-summaries
```
### Local copy workflow
The simplest local development flow is:
1. Pull a fresh copy of the database from the VPS.
2. Run Datasette locally against that copy.
3. Point any local apps at `http://127.0.0.1:8001`.
Create a local `.env` file with the VPS connection details. `.env` is gitignored.
```bash
cp .env.example .env
```
Then edit `.env` so it contains:
```bash
DEV_DB_HOST=user@your-vps
DEV_DB_REMOTE_PATH=/path/to/preprints.db
```
Then pull the latest copy:
```bash
make pull-db
```
This copies `DEV_DB_REMOTE_PATH` from the VPS to `data/preprints.db` using `rsync`.
Keep it simple: run this when the VPS database is not being updated.
If you pulled an older copy of the database or you changed the dashboard summary-table logic, refresh those tables once:
```bash
make refresh-dashboard-summaries
```
Then serve the local copy with Datasette:
```bash
make datasette-local
```
This serves `data/preprints.db` on `http://127.0.0.1:8001` with the Datasette metadata in `datasette/metadata.yml` and production-like query limits, including `sql_time_limit_ms=5000`.
If `make datasette-local` says Datasette is not installed, run:
```bash
uv sync --frozen
```
### Project Structure
- `data/preprints.db` - SQLite database with all data
- `osf/` - Core library (harvester, ingestor, database)
- `scripts/` - CLI tools (`harvest.py`, `ingest.py`)
- `tools/` - Admin utilities (`show_status.py`, `fix_gaps.py`, `refresh_dashboard_summaries.py`)
- `datasette/` - Metadata configuration for web interface