{"id":49090694,"url":"https://github.com/forrtproject/flora_preprint_notifier","last_synced_at":"2026-04-20T18:05:13.700Z","repository":{"id":319676246,"uuid":"1070680538","full_name":"forrtproject/flora_preprint_notifier","owner":"forrtproject","description":"Code to check preprint references for potentially missing replication studies and notify authors","archived":false,"fork":false,"pushed_at":"2026-03-23T19:45:25.000Z","size":1985050,"stargazers_count":6,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-03-24T17:53:48.805Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/forrtproject.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2025-10-06T09:44:16.000Z","updated_at":"2026-03-23T19:45:30.000Z","dependencies_parsed_at":"2025-10-19T23:25:15.958Z","dependency_job_id":null,"html_url":"https://github.com/forrtproject/flora_preprint_notifier","commit_stats":null,"previous_names":["forrtproject/fred_preprint_bot","forrtproject/flora_preprint_notifier"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/forrtproject/flora_preprint_notifier","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/forrtproject%2Fflora_preprint_notifier","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/forrtproject%2Fflora_preprint_notifier/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/forrtproject%2Fflora_preprint_notifier/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/forrtproject%2Fflora_preprint_notifier/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/forrtproject","download_url":"https://codeload.github.com/forrtproject/flora_preprint_notifier/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/forrtproject%2Fflora_preprint_notifier/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32059144,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-20T11:35:06.609Z","status":"ssl_error","status_checked_at":"2026-04-20T11:34:48.899Z","response_time":94,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-04-20T18:05:12.038Z","updated_at":"2026-04-20T18:05:13.684Z","avatar_url":"https://github.com/forrtproject.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# OSF Preprints - Modular Pipeline (No Celery)\n\nThis repository runs a bounded, stage-based pipeline for OSF preprints using DynamoDB as the single source of truth.\n\nPipeline stages:\n1. `sync`: ingest preprints from OSF\n2. `pdf`: download/convert primary files\n3. `grobid`: generate TEI from PDFs\n4. `extract`: parse TEI and write references\n5. `enrich`: fill missing reference DOIs\n6. `flora`: FLoRA lookup + screening\n7. `author`: author/email candidate extraction\n\nAll stages run as normal Python commands and exit. Scheduling is external (cron or GitHub Actions).\nThe `flora` stage checks whether originals have replications cited in the FLoRA database (the FORRT Library of Replication Attempts).\n\nSee the [trial flowchart](docs/protocol_flowchart.md) for a visual overview of how a preprint flows through the pipeline.\n\n## Quick Start (Local)\n\n1. Create a virtual environment and install Python dependencies:\n```bash\npython -m venv .venv\nsource .venv/bin/activate\npip install -r requirements.txt\n```\n2. Install LibreOffice (`soffice`) locally if you need DOCX -\u003e PDF conversion in the `pdf` stage.\n3. Configure `.env`:\n```bash\ncp .env.example .env\n```\n4. Review committed runtime rules in `config/runtime.toml` (for example `ingest.anchor_date` and FLORA endpoint).\n5. Start local infrastructure services (optional if you use AWS DynamoDB and/or a remote GROBID):\n```bash\ndocker compose up -d dynamodb-local grobid\n```\n6. Initialize DynamoDB tables:\n```bash\npython -c \"from osf_sync.db import init_db; init_db(); print('Dynamo tables ready')\"\n```\n7. Run pipeline stages:\n```bash\npython -m osf_sync.pipeline run --stage sync --limit 1000\npython -m osf_sync.pipeline run --stage pdf --limit 100\npython -m osf_sync.pipeline run --stage grobid --limit 50\npython -m osf_sync.pipeline run --stage extract --limit 200\npython -m osf_sync.pipeline run --stage enrich --limit 300\npython -m osf_sync.pipeline run --stage flora --limit-lookup 200 --limit-screen 500\n```\n\n## Main Commands\n\nSingle stage:\n```bash\npython -m osf_sync.pipeline run --stage \u003csync|pdf|grobid|extract|enrich|flora|author\u003e [options]\n```\n\nFull bounded run:\n```bash\npython -m osf_sync.pipeline run-all \\\n  --sync-limit 1000 --pdf-limit 100 --grobid-limit 50 --extract-limit 200 --enrich-limit 300\n```\n`run-all` includes the `author` stage by default; use `--skip-author` to disable it for a run.\nBy default, `run-all` keeps local PDF/TEI files during `author`; use `--cleanup-author-files` to allow cleanup.\nBy default, `author` updates DynamoDB only (no local CSV output). Use `--write-debug-csv` (and optionally `--out`) for local debug snapshots.\n\nAd-hoc window sync:\n```bash\npython -m osf_sync.pipeline sync-from-date --start 2025-07-01\n```\n\nOne-off preprint:\n```bash\npython -m osf_sync.pipeline fetch-one --id \u003cOSF_ID\u003e\n# or\npython -m osf_sync.pipeline fetch-one --doi \u003cDOI_OR_URL\u003e\n```\n\nAuthor-cluster randomisation (standalone, not in `run-all`):\n```bash\npython -m osf_sync.pipeline author-randomize \\\n  --network-state-key trial:author_network_state\n```\nOptionally add `--authors-csv \u003cpath\u003e` to use an enriched author CSV if available.\nStatus: this workflow is not yet validated end-to-end in production and should be treated as experimental.\nThis command processes only unassigned preprints.\nIf no prior trial network exists, it initializes one from those preprints; otherwise it loads the latest network from DynamoDB and augments it.\nAllocations, graph state, and run metadata are stored in DynamoDB trial tables plus `sync_state`.\nUse `--dry-run` to preview candidate processing and allocation counts without writing to DynamoDB:\n```bash\npython -m osf_sync.pipeline author-randomize --dry-run\n```\n\n`python -m osf_sync.cli ...` is now a thin alias to the same pipeline CLI.\n## Common Options\n\n- `--limit`: max items for the stage.\n- `--max-seconds`: stop the stage after N seconds.\n- `--dry-run`: compute/select work without executing mutations.\n- `--debug`: enable verbose logging.\n- `--owner` and `--lease-seconds` (queue stages): override DynamoDB claim ownership/lease duration.\n- `--skip-author` (`run-all`): skip author extraction when needed.\n- `--cleanup-author-files` (`run-all`): allow author stage file deletion (off by default).\n- `--write-debug-csv` (`author` stage): write a local debug CSV snapshot (`--out` overrides the default path).\n\n## Environment (`.env`)\n\n```dotenv\n# local Docker GROBID:\nGROBID_URL=http://localhost:8070\n# remote GROBID example:\n# GROBID_URL=https://grobid.example.org\nGROBID_INCLUDE_RAW_CITATIONS=true\nPIPELINE_ENV=dev\nDDB_BILLING_MODE=PAY_PER_REQUEST\nDEV_SYNC_LOOKBACK_DAYS=7\n# Optional explicit override for sync start date:\n# SYNC_START_DATE_OVERRIDE=2026-01-01\n# Optional explicit override for sync end date:\n# SYNC_END_DATE_OVERRIDE=2026-03-15\n# Safety default: override runs do not rewrite sync cursor.\nSYNC_OVERRIDE_WRITES_CURSOR=false\n# Optional global cursor-write disable.\nSYNC_DISABLE_CURSOR_WRITE=false\nDYNAMO_LOCAL_URL=http://localhost:8000\nAWS_REGION=eu-north-1\nAWS_SECRET_ACCESS_KEY=\u003cAWS_SECRET_ACCESS_KEY\u003e\nAWS_ACCESS_KEY_ID=\u003cAWS_ACCESS_KEY_ID\u003e\nDDB_TABLE_PREPRINTS=dev_preprints\nDDB_TABLE_REFERENCES=dev_preprint_references\nDDB_TABLE_TEI=dev_preprint_tei\nDDB_TABLE_EXCLUDED_PREPRINTS=dev_excluded_preprints\nDDB_TABLE_SYNCSTATE=dev_sync_state\nDDB_TABLE_API_CACHE=dev_api_cache\nDDB_TABLE_TRIAL_AUTHOR_NODES=dev_trial_author_nodes\nDDB_TABLE_TRIAL_AUTHOR_TOKENS=dev_trial_author_tokens\nDDB_TABLE_TRIAL_CLUSTERS=dev_trial_clusters\nDDB_TABLE_TRIAL_ASSIGNMENTS=dev_trial_preprint_assignments\nOPENALEX_EMAIL=\u003cPERSONAL_EMAIL_ID\u003e\nPDF_DEST_ROOT=./data/preprints\nLOG_LEVEL=INFO\nOSF_INGEST_SKIP_EXISTING=false\nAPI_CACHE_TTL_MONTHS=6\nPIPELINE_CLAIM_LEASE_SECONDS=1800\n```\n\n`sync` window behavior:\n- `PIPELINE_ENV=dev`: sync uses a rolling `DEV_SYNC_LOOKBACK_DAYS` window (default 7).\n- `PIPELINE_ENV=prod`: sync uses `ingest.anchor_date`/`ingest.window_months` from `config/runtime.toml`.\n- In prod, changing `ingest.anchor_date` or `ingest.window_months` triggers an automatic bounded backfill (controlled by `ingest.backfill_on_config_change`).\n- `SYNC_START_DATE_OVERRIDE` (optional): forces an explicit start date in either mode.\n- `SYNC_END_DATE_OVERRIDE` (optional): explicit end date; in prod override mode, omitted end defaults to `ingest.anchor_date`.\n- Recommended naming: keep local `.env` on `dev_*` tables; GH Actions prod workflows are set to `prod_*`.\n- `SYNC_OVERRIDE_WRITES_CURSOR=false` (default) keeps continuation cursor unchanged during override/backfill runs.\n\nBackfill without breaking continuation:\n1. In GROBID workflow dispatch, set `sync_start_date_override` (and optionally `sync_end_date_override`).\n2. Leave `sync_override_writes_cursor` as `false` (default).\n3. Run backfill as needed; normal continuation cursor is preserved.\n4. Clear override inputs for subsequent normal runs.\n\n## Runtime Rules (`config/runtime.toml`)\n\nThese non-secret operational rules are committed in git:\n\n```toml\n[ingest]\nanchor_date = \"2026-02-20\" # ISO date/timestamp; empty disables date-window filter\nwindow_months = 6\n\n[flora]\noriginal_lookup_url = \"https://rep-api.forrt.org/v1/original-lookup\"\ncache_ttl_hours = 48\ncsv_url = \"https://github.com/forrtproject/FReD-data/raw/refs/heads/main/output/flora_filtered.csv\"\ncsv_path = \"data/flora.csv\"\n```\n\n## Scheduling\n\nUse either:\n- Cron/systemd timers on a VM, or\n- GitHub Actions `schedule` workflows.\n\nRecommended pattern:\n- Run each stage independently on a cadence with bounded limits.\n- Allow overlap; claim/lease fields in DynamoDB prevent duplicate processing.\n\n## DynamoDB Queue Flow\n\n1. `sync` sets `queue_pdf=pending` when eligible.\n2. `pdf` marks `queue_pdf=done`, `queue_grobid=pending`.\n3. `grobid` marks `queue_grobid=done`, `queue_extract=pending`.\n4. `extract` marks `queue_extract=done`.\n\nQueue stages use claim/lease metadata (`claim_*_owner`, `claim_*_until`) and error tracking fields (`last_error_*`, `retry_count_*`).\n\n## DOI Experiment Command\n\nUse the module entrypoint directly for DOI matching experiments:\n\n```bash\npython -m osf_sync.augmentation.doi_multi_method_lookup --from-db --limit 400 --output doi_multi_method.csv\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fforrtproject%2Fflora_preprint_notifier","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fforrtproject%2Fflora_preprint_notifier","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fforrtproject%2Fflora_preprint_notifier/lists"}