{"id":51340025,"url":"https://github.com/melika-kheirieh/async-job-api","last_synced_at":"2026-07-02T06:33:21.669Z","repository":{"id":363836366,"uuid":"1260196570","full_name":"melika-kheirieh/async-job-api","owner":"melika-kheirieh","description":"FastAPI + Celery async job API with PostgreSQL-backed status tracking, retry/failure handling, stuck-job recovery, idempotency keys, Docker Compose, and tests.","archived":false,"fork":false,"pushed_at":"2026-06-28T08:59:14.000Z","size":122,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-28T09:21:26.355Z","etag":null,"topics":["alembic","async-jobs","backend","celery","docker-compose","fastapi","postgresql","pytest","redis","sqlalchemy"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/melika-kheirieh.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-06-05T08:44:00.000Z","updated_at":"2026-06-28T08:58:52.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/melika-kheirieh/async-job-api","commit_stats":null,"previous_names":["melika-kheirieh/async-job-api"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/melika-kheirieh/async-job-api","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/melika-kheirieh%2Fasync-job-api","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/melika-kheirieh%2Fasync-job-api/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/melika-kheirieh%2Fasync-job-api/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/melika-kheirieh%2Fasync-job-api/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/melika-kheirieh","download_url":"https://codeload.github.com/melika-kheirieh/async-job-api/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/melika-kheirieh%2Fasync-job-api/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":35036553,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-07-02T02:00:06.368Z","response_time":173,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["alembic","async-jobs","backend","celery","docker-compose","fastapi","postgresql","pytest","redis","sqlalchemy"],"created_at":"2026-07-02T06:33:21.025Z","updated_at":"2026-07-02T06:33:21.662Z","avatar_url":"https://github.com/melika-kheirieh.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Async Job API\n\n[![CI](https://github.com/melika-kheirieh/async-job-api/actions/workflows/ci.yml/badge.svg)](https://github.com/melika-kheirieh/async-job-api/actions/workflows/ci.yml)\n\nA compact FastAPI, SQLAlchemy, and Celery backend for durable asynchronous job\nprocessing.\n\nThe API accepts work and returns immediately. Celery processes jobs in the\nbackground, Redis coordinates task delivery, and PostgreSQL remains the source\nof truth for every lifecycle transition.\n\n\u003e This project is production-aware, not production-complete. It handles selected\n\u003e retry, duplicate-delivery, and recovery risks without claiming exactly-once\n\u003e execution or exactly-once side effects.\n\n## What This Demonstrates\n\n- FastAPI API design around durable asynchronous work.\n- PostgreSQL-backed job lifecycle state instead of in-memory task status.\n- Celery and Redis integration with the database as the source of truth.\n- Guarded state transitions for claim, cancellation, and recovery paths.\n- Explicit retry state for jobs waiting on another Celery delivery.\n- A separate processor boundary for demo payload handling.\n- Idempotent job submission using a database uniqueness boundary.\n- Focused tests for API, service, repository, processor, and worker behavior.\n- Clear production boundaries for what the project does and does not guarantee.\n\n## Final Demo\n\nRun the full Docker-based demo:\n\n```bash\n./scripts/e2e_smoke.sh\n```\n\nThe script starts a clean Compose stack, applies migrations, and verifies the\nmain public workflows end to end:\n\n- canceling a waiting job before a worker can claim it;\n- completing a successful job;\n- persisting a non-retryable failure;\n- exhausting retryable failures with Celery retry scheduling;\n- returning the same job for a duplicate idempotency key;\n- filtering `canceled`, `completed`, and `failed` jobs through the list API.\n\nThe cancellation scenario intentionally starts the API before the worker so the\njob remains in `queued` long enough to cancel deterministically.\n\nFor a faster local check that does not require Docker services, run:\n\n```bash\npytest -q\n```\n\n## Reliability Model\n\n| Concern | Current approach |\n|---|---|\n| Durable job state | PostgreSQL stores status, payload, result, errors, attempts, and timestamps. |\n| Task delivery | Redis brokers Celery delivery; API-visible state is read from PostgreSQL. |\n| Concurrent delivery | A conditional database update allows only one successful claim. |\n| Retry visibility | Retryable failures persist `retrying` before Celery schedules another attempt. |\n| Cancellation | Waiting jobs can be canceled through a guarded transition. |\n| Duplicate submission | A unique idempotency key returns the existing job. |\n| Stuck execution | Manual recovery fails old `running` jobs conditionally. |\n| Operational visibility | Stable lifecycle events include `job_id` and execution context. |\n| Exactly-once behavior | Not guaranteed; side-effecting handlers must be idempotent. |\n\n## Architecture\n\n![Async Job API architecture](docs/assets/async-job-api-architecture.png)\n\n```mermaid\nflowchart TD\n    Client[\"Client\"] --\u003e API[\"FastAPI\"]\n    API --\u003e Redis[\"Redis broker\"]\n    Redis --\u003e Worker[\"Celery worker\"]\n    API --\u003e DB[\"PostgreSQL source of truth\"]\n    Worker --\u003e DB\n    Worker --\u003e Processor[\"Demo processor\"]\n```\n\nApplication boundaries remain small and explicit:\n\n```text\nRouter -\u003e Service -\u003e Repository -\u003e Database\nCelery task -\u003e Worker orchestration -\u003e Processor\nWorker/API -\u003e Service -\u003e Repository -\u003e Database\n```\n\n- The router owns HTTP concerns.\n- The service owns job use cases.\n- The repository owns persistence and guarded transitions.\n- The worker owns orchestration: claim, process, mark final state, retry, and log.\n- The processor owns demo payload behavior and retryable/non-retryable errors.\n- The Celery task owns retry scheduling and delegates processing to testable logic.\n\nTasks carry only a `job_id`. The worker loads the latest payload and state from\nPostgreSQL before attempting a guarded claim.\n\n## Job Lifecycle\n\n```mermaid\nstateDiagram-v2\n    [*] --\u003e queued\n    queued --\u003e running: claim\n    queued --\u003e canceled: cancel\n    running --\u003e completed: success\n    running --\u003e failed: permanent failure\n    running --\u003e retrying: retryable failure\n    retrying --\u003e running: reclaim\n    retrying --\u003e canceled: cancel\n    retrying --\u003e failed: retry limit exhausted\n```\n\n| Status | Meaning |\n|---|---|\n| `queued` | Waiting to be claimed. |\n| `running` | Claimed and being processed. |\n| `retrying` | Waiting for another attempt. |\n| `canceled` | Canceled before a worker could claim or reclaim it. |\n| `completed` | Finished successfully. |\n| `failed` | Permanently failed or recovered as stuck. |\n\nOnly `queued` and `retrying` jobs are claimable. Those two states are also\ncancelable. `completed`, `failed`, and `canceled` are terminal.\n\nClaiming is performed with a conditional database update. Two workers may\nreceive the same task, but only one can transition the job from a claimable state\nto `running`. The `attempts` counter increases only when that transition succeeds.\n\n## Quick Start\n\nStart PostgreSQL and Redis:\n\n```bash\ndocker compose up -d postgres redis\n```\n\nApply migrations:\n\n```bash\ndocker compose run --rm api alembic upgrade head\n```\n\nStart the API and worker:\n\n```bash\ndocker compose up --build api worker\n```\n\nThe API is available at `http://localhost:8001` and its OpenAPI documentation at\n`http://localhost:8001/docs`.\n\nFor manual API checks, set:\n\n```bash\nBASE_URL=http://localhost:8001\n```\n\nView worker logs:\n\n```bash\ndocker compose logs -f worker\n```\n\nStop the stack with `docker compose down`. Use `docker compose down -v` only when\nyou also want to delete local PostgreSQL data.\n\n## API\n\n| Method | Endpoint | Purpose |\n|---|---|---|\n| `POST` | `/jobs` | Create and enqueue a job. |\n| `GET` | `/jobs/{job_id}` | Read durable state and result. |\n| `GET` | `/jobs` | Filter and paginate jobs. |\n| `POST` | `/jobs/{job_id}/cancel` | Cancel a waiting job. |\n\n### Create a Job\n\n```bash\ncurl -X POST \"$BASE_URL/jobs\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"payload\": {\"text\": \"hello backend\"}}'\n```\n\nThe endpoint returns `201 Created` with a persisted job in `queued`. Submission\ndoes not mean the background work has completed.\n\nAdd an optional idempotency key when clients may repeat a submission:\n\n```bash\ncurl -X POST \"$BASE_URL/jobs\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"payload\": {\"text\": \"same request\"},\n    \"idempotency_key\": \"demo-123\"\n  }'\n```\n\nRepeating the same key returns the existing job without intentionally enqueueing\nanother task. The API treats the key as the request identity and does not compare\npayloads for conflicting reuse. This deduplicates job creation, not execution or\nside effects.\n\n### Read and List Jobs\n\n```bash\ncurl \"$BASE_URL/jobs/1\"\ncurl \"$BASE_URL/jobs?status=failed\u0026limit=20\u0026offset=0\"\n```\n\nList behavior:\n\n- `status` is optional and accepts any lifecycle status;\n- `limit` accepts 1-100;\n- `offset` must be non-negative;\n- results are ordered newest first;\n- responses include `items`, `limit`, `offset`, and total matching `count`.\n\nUnknown job IDs return `404`; invalid query parameters return `422`.\n\n### Cancel a Job\n\n```bash\ncurl -X POST \"$BASE_URL/jobs/1/cancel\"\n```\n\nOnly jobs in `queued` or `retrying` can be canceled. Canceling a `running`,\n`completed`, `failed`, or already `canceled` job returns `409 Conflict`. Missing\njobs return `404`.\n\nCancellation updates PostgreSQL state. It does not revoke a Celery task or\ndelete an already-published Redis message; a later stale delivery is skipped\nbecause the worker cannot claim a canceled job.\n\n## Failure and Retry Behavior\n\nThe demo processor exposes two deterministic failure inputs.\n\n```bash\n# Non-retryable: becomes failed immediately\ncurl -X POST \"$BASE_URL/jobs\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"payload\": {\"text\": \"bad input\", \"fail\": true}}'\n\n# Retryable: enters retrying and eventually fails after the retry limit\ncurl -X POST \"$BASE_URL/jobs\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"payload\": {\"text\": \"temporary issue\", \"transient_fail\": true}}'\n```\n\nThe retry policy allows three retries after the initial attempt. With the current\nretry limit, scheduled countdowns are 1, 2, and 4 seconds; the backoff helper is\ncapped at 30 seconds. Before each retry, the latest error is persisted and the\nnext delivery must claim the job again.\n\n## Stuck-Job Recovery\n\nA worker can stop after claiming a job but before persisting its final state. The\nservice exposes:\n\n```python\nrecover_stuck_jobs(timeout_minutes=10)\n```\n\nRecovery fails old `running` jobs instead of automatically requeueing work whose\nprevious outcome is uncertain. Its write succeeds only if the job is still\n`running` and the observed `started_at` has not changed.\n\nRecovery is manually invoked through the recovery CLI:\n\n```bash\ndocker compose run --rm api python -m app.cli.recover_stuck_jobs --timeout-minutes 10\n```\n\nScheduling, leases, heartbeats, fencing, and full stale-worker protection remain\nout of scope. See the [operational runbook](docs/runbook.md) for local\ndiagnosis and recovery commands.\n\n## Lifecycle Logging\n\nLogs use stable events such as `job_created`, `job_claimed`, `job_retrying`,\n`job_retry_scheduled`, `job_canceled`, `job_completed`, `job_failed`,\n`job_skipped`, and `stuck_job_recovered`.\n\n```text\nevent=job_claimed job_id=42 status=running attempts=2\nevent=job_completed job_id=42 status=completed attempts=2\n```\n\nThis is lightweight lifecycle logging, not durable event history or a\ncentralized observability stack.\n\n## Tests\n\nRun the fast test suite:\n\n```bash\npytest -q\n```\n\nIt covers API, service, repository, worker lifecycle, retry, cancellation,\nduplicate delivery, idempotency, listing, and recovery behavior without requiring\na live broker.\n\nRun the multi-service smoke test:\n\n```bash\n./scripts/e2e_smoke.sh\n```\n\nThe script starts a clean Docker Compose stack, applies migrations, and verifies\ncancellation, successful completion, non-retryable failure, retry exhaustion,\nduplicate idempotency behavior, and filtered job listing. It exits non-zero on\nfailure and cleans up on exit.\n\nGitHub Actions runs `pytest -q` with Python 3.12 on pushes and pull requests. The\nDocker smoke test remains a manual integration check.\n\n## Decisions and Boundaries\n\n- [Architecture decisions](docs/decisions.md)\n- [Production boundaries](docs/production-boundaries.md)\n- [Operational runbook](docs/runbook.md)\n\nImportant non-guarantees include:\n\n- no exactly-once execution or side effects;\n- no atomic database-to-broker publication;\n- no broker-level cancellation or task revocation;\n- no automatic scheduled recovery or dead-letter workflow;\n- no full stale-worker fencing;\n- no production observability or deployment hardening.\n\nThe project intentionally avoids expanding into Kafka, Kubernetes, multiple job\ntypes, priority queues, an admin dashboard, distributed locking, or a complete\nmonitoring stack without a concrete operational need.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmelika-kheirieh%2Fasync-job-api","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmelika-kheirieh%2Fasync-job-api","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmelika-kheirieh%2Fasync-job-api/lists"}