https://github.com/forrtproject/flora-extractor

FloRA extractor
https://github.com/forrtproject/flora-extractor

Last synced: about 2 months ago
JSON representation

FloRA extractor

Host: GitHub
URL: https://github.com/forrtproject/flora-extractor
Owner: forrtproject
Created: 2026-04-29T18:44:33.000Z (2 months ago)
Default Branch: main
Last Pushed: 2026-05-11T18:58:25.000Z (about 2 months ago)
Last Synced: 2026-05-11T20:38:48.205Z (about 2 months ago)
Language: Python
Size: 605 KB
Stars: 2
Watchers: 2
Forks: 1
Open Issues: 16
Metadata Files:
- Readme: README.md
- Agents: AGENTS.md

Awesome Lists containing this project

README

          # FLoRA Extractor

A Python tool that discovers, extracts, and validates replication and reproduction studies for the [FLoRA database](https://forrt.org/replication-hub/flora/).

**Part of the [FORRT](https://forrt.org) project.**

---

## What It Does

Starting from keyword searches of academic databases, FLoRA Extractor:

1. **Discovers** candidate replication/reproduction papers from OpenAlex and curated lists

2. **Filters** false positives using rule-based and LLM classification

3. **Extracts** the target study and replication outcome from each paper

4. **Validates** results through a crowdsourced voting web interface

---

## Architecture

```

Stage 1: search/      → data/candidates.csv   (discover candidates)

Stage 2: filter/      → data/filtered.csv     (remove false positives)

Stage 3: extract/     → data/extracted.csv    (link original + code outcome)

Stage 4: validate/    → Flask web app         (human voting, export)

```

Each stage is independently runnable. See [CLAUDE.md](CLAUDE.md) for full technical details.

---

## Quick Start

> **Note:** The commands below are the target state once all stages are implemented.

> Currently, `flora_selected.csv` can be used to seed Stage 4 directly.

```bash

# 1. Clone and setup

git clone https://github.com/forrtproject/flora-extractor.git

cd flora-extractor

pip install -r requirements.txt

cp .env.example .env   # fill in your API keys

# 2. Run the pipeline

python search/run_search.py        # → data/candidates.csv

python filter/run_filter.py        # → data/filtered.csv

python extract/run_extract.py      # → data/extracted.csv

# 3. Start the validation web app

python -m validate.import_csv      # load into SQLite

python -m validate.app             # → http://localhost:5001

```

---

## API Keys Required

Add to your `.env` file (copy from `.env.example`):

```

RESEARCHER_EMAIL=you@example.com      # for OpenAlex/Crossref API politeness

GEMINI_API_KEY=...                    # primary LLM

GEMINI_API_KEY_2=...                  # optional: rotate for higher quota

OPENAI_API_KEY=...                    # fallback LLM (optional)

GROBID_URL=http://localhost:8070      # local GROBID server (optional, for full-text extraction)

```

Get a free Gemini API key at [aistudio.google.com](https://aistudio.google.com).

---

## Data Sources

**Bibliographic databases (primary):**

| Source                                             | Coverage                                         |

| -------------------------------------------------- | ------------------------------------------------ |

| [OpenAlex](https://openalex.org)                   | Broad academic literature, free API              |

| [Semantic Scholar](https://www.semanticscholar.org)| Supplementary coverage                           |

| [Crossref](https://www.crossref.org)               | DOI resolution and reference lists               |

| [OpenCitations](https://opencitations.net)         | Reference lists (where OpenAlex coverage is thin)|

**Curated lists (secondary, pluggable):**

| Source                                                                                | Coverage                            |

| ------------------------------------------------------------------------------------- | ----------------------------------- |

| [Bob Reed's Replication Network](https://replicationnetwork.com/replication-studies/) | Economics                           |

| [I4R](https://i4replication.org/reports/)                                             | Institute for Replication reports   |

Full-text acquisition (for Stage 3): [Unpaywall](https://unpaywall.org), [CORE](https://core.ac.uk), arXiv, OSF.

---

## Output Schema

Each extracted record contains:

| Field | Description |

|-------|-------------|

| `doi_r` | Replication paper DOI |

| `doi_o` | Original target study DOI |

| `title_o` | Original target study title |

| `outcome` | success / failure / mixed / uninformative / descriptive |

| `outcome_phrase` | Supporting quote from the paper |

| `link_evidence` | Evidence used to identify the original |

| `validation_status` | confirmed / rejected / pending / needs_review |

Full schema: [shared/schema.py](shared/schema.py)

---

## Team Guide

| Team | Stage | Branch | Docs |

|------|-------|--------|------|

| Team Search | Stage 1 | `feature/search` | [docs/STAGE1_SEARCH.md](docs/STAGE1_SEARCH.md) |

| Team Filter | Stage 2 | `feature/filter` | [docs/STAGE2_FILTER.md](docs/STAGE2_FILTER.md) |

| Team Extract | Stage 3 | `feature/extract` | [docs/STAGE3_EXTRACT.md](docs/STAGE3_EXTRACT.md) |

| Team Validate | Stage 4 | `feature/validate` | [docs/STAGE4_VALIDATE.md](docs/STAGE4_VALIDATE.md) |

**New team member?** Read [CLAUDE.md](CLAUDE.md) first — it contains architecture, schema, and coding rules.  

**AI coding agent?** Read [CLAUDE.md](CLAUDE.md) (Claude Code) or [AGENTS.md](AGENTS.md) (all others).  

**Working in R?** See the R note in [CLAUDE.md](CLAUDE.md#r-support).

---

## Contributing

1. Branch from `dev` using your team's branch name (`feature/search`, etc.)

2. Use sample data in `misc/` to develop and test independently

3. Open a PR to `dev` when a feature is stable — don't wait until the end

4. `main` and `dev` are branch-protected; all merges require a PR review

---

## Related Projects

- [flora_search_approaches](https://github.com/forrtproject/flora_search_approaches) — original R-based pathway pipeline (reference implementation)

- [FLoRA database](https://forrt.org/replication-hub/flora/) — the database this tool feeds into

---

## License

MIT

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/forrtproject/flora-extractor

Awesome Lists containing this project

README