https://github.com/forrtproject/flora-extractor
FloRA extractor
https://github.com/forrtproject/flora-extractor
Last synced: about 2 months ago
JSON representation
FloRA extractor
- Host: GitHub
- URL: https://github.com/forrtproject/flora-extractor
- Owner: forrtproject
- Created: 2026-04-29T18:44:33.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2026-05-11T18:58:25.000Z (about 2 months ago)
- Last Synced: 2026-05-11T20:38:48.205Z (about 2 months ago)
- Language: Python
- Size: 605 KB
- Stars: 2
- Watchers: 2
- Forks: 1
- Open Issues: 16
-
Metadata Files:
- Readme: README.md
- Agents: AGENTS.md
Awesome Lists containing this project
README
# FLoRA Extractor
A Python tool that discovers, extracts, and validates replication and reproduction studies for the [FLoRA database](https://forrt.org/replication-hub/flora/).
**Part of the [FORRT](https://forrt.org) project.**
---
## What It Does
Starting from keyword searches of academic databases, FLoRA Extractor:
1. **Discovers** candidate replication/reproduction papers from OpenAlex and curated lists
2. **Filters** false positives using rule-based and LLM classification
3. **Extracts** the target study and replication outcome from each paper
4. **Validates** results through a crowdsourced voting web interface
---
## Architecture
```
Stage 1: search/ → data/candidates.csv (discover candidates)
Stage 2: filter/ → data/filtered.csv (remove false positives)
Stage 3: extract/ → data/extracted.csv (link original + code outcome)
Stage 4: validate/ → Flask web app (human voting, export)
```
Each stage is independently runnable. See [CLAUDE.md](CLAUDE.md) for full technical details.
---
## Quick Start
> **Note:** The commands below are the target state once all stages are implemented.
> Currently, `flora_selected.csv` can be used to seed Stage 4 directly.
```bash
# 1. Clone and setup
git clone https://github.com/forrtproject/flora-extractor.git
cd flora-extractor
pip install -r requirements.txt
cp .env.example .env # fill in your API keys
# 2. Run the pipeline
python search/run_search.py # → data/candidates.csv
python filter/run_filter.py # → data/filtered.csv
python extract/run_extract.py # → data/extracted.csv
# 3. Start the validation web app
python -m validate.import_csv # load into SQLite
python -m validate.app # → http://localhost:5001
```
---
## API Keys Required
Add to your `.env` file (copy from `.env.example`):
```
RESEARCHER_EMAIL=you@example.com # for OpenAlex/Crossref API politeness
GEMINI_API_KEY=... # primary LLM
GEMINI_API_KEY_2=... # optional: rotate for higher quota
OPENAI_API_KEY=... # fallback LLM (optional)
GROBID_URL=http://localhost:8070 # local GROBID server (optional, for full-text extraction)
```
Get a free Gemini API key at [aistudio.google.com](https://aistudio.google.com).
---
## Data Sources
**Bibliographic databases (primary):**
| Source | Coverage |
| -------------------------------------------------- | ------------------------------------------------ |
| [OpenAlex](https://openalex.org) | Broad academic literature, free API |
| [Semantic Scholar](https://www.semanticscholar.org)| Supplementary coverage |
| [Crossref](https://www.crossref.org) | DOI resolution and reference lists |
| [OpenCitations](https://opencitations.net) | Reference lists (where OpenAlex coverage is thin)|
**Curated lists (secondary, pluggable):**
| Source | Coverage |
| ------------------------------------------------------------------------------------- | ----------------------------------- |
| [Bob Reed's Replication Network](https://replicationnetwork.com/replication-studies/) | Economics |
| [I4R](https://i4replication.org/reports/) | Institute for Replication reports |
Full-text acquisition (for Stage 3): [Unpaywall](https://unpaywall.org), [CORE](https://core.ac.uk), arXiv, OSF.
---
## Output Schema
Each extracted record contains:
| Field | Description |
|-------|-------------|
| `doi_r` | Replication paper DOI |
| `doi_o` | Original target study DOI |
| `title_o` | Original target study title |
| `outcome` | success / failure / mixed / uninformative / descriptive |
| `outcome_phrase` | Supporting quote from the paper |
| `link_evidence` | Evidence used to identify the original |
| `validation_status` | confirmed / rejected / pending / needs_review |
Full schema: [shared/schema.py](shared/schema.py)
---
## Team Guide
| Team | Stage | Branch | Docs |
|------|-------|--------|------|
| Team Search | Stage 1 | `feature/search` | [docs/STAGE1_SEARCH.md](docs/STAGE1_SEARCH.md) |
| Team Filter | Stage 2 | `feature/filter` | [docs/STAGE2_FILTER.md](docs/STAGE2_FILTER.md) |
| Team Extract | Stage 3 | `feature/extract` | [docs/STAGE3_EXTRACT.md](docs/STAGE3_EXTRACT.md) |
| Team Validate | Stage 4 | `feature/validate` | [docs/STAGE4_VALIDATE.md](docs/STAGE4_VALIDATE.md) |
**New team member?** Read [CLAUDE.md](CLAUDE.md) first — it contains architecture, schema, and coding rules.
**AI coding agent?** Read [CLAUDE.md](CLAUDE.md) (Claude Code) or [AGENTS.md](AGENTS.md) (all others).
**Working in R?** See the R note in [CLAUDE.md](CLAUDE.md#r-support).
---
## Contributing
1. Branch from `dev` using your team's branch name (`feature/search`, etc.)
2. Use sample data in `misc/` to develop and test independently
3. Open a PR to `dev` when a feature is stable — don't wait until the end
4. `main` and `dev` are branch-protected; all merges require a PR review
---
## Related Projects
- [flora_search_approaches](https://github.com/forrtproject/flora_search_approaches) — original R-based pathway pipeline (reference implementation)
- [FLoRA database](https://forrt.org/replication-hub/flora/) — the database this tool feeds into
---
## License
MIT