https://github.com/fn-devx/crawler-ai-task
https://github.com/fn-devx/crawler-ai-task
Last synced: 20 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/fn-devx/crawler-ai-task
- Owner: fn-devX
- Created: 2026-05-30T22:48:01.000Z (30 days ago)
- Default Branch: main
- Last Pushed: 2026-05-30T22:50:46.000Z (30 days ago)
- Last Synced: 2026-05-31T00:14:48.946Z (30 days ago)
- Language: Python
- Size: 27.3 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: newskg/__init__.py
Awesome Lists containing this project
README
---
## How it works
```
topic listing + article pages
│
┌───────────┐ ┌──────────────┐ ┌────────────────┐ ┌────────────┐
│ Crawler │────> │ Source │────>│ Extractor │────>│ Store │
│ async I/O │ │ TechCrunch │ │ Claude tool- │ │ SQLite │
│ + retry │ │ parser │ │ use / LLM │ │ graph │
└───────────┘ └──────────────┘ └────────────────┘ └─────┬──────┘
│
fetch → parse → extract → resolve → store │
▼
┌────────────┐
│ FastAPI │
│ API │
└────────────┘
```
Five stages run end to end: **crawl** the pages, **parse** them into clean articles,
**extract** people and relationships with the LLM, **resolve** duplicate names into one
entity, then **store and serve** the graph. Each layer talks to the next through a small
interface, so sources, extractors, and storage are all swappable.
---
## Run it
```bash
# Python 3.10+
python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -e ".[dev]"
cp .env.example .env # then set ANTHROPIC_API_KEY
# no key? run fully offline: export NEWSKG_EXTRACTOR=heuristic
uvicorn newskg.api:app --reload
```
Then open **http://localhost:8000/docs** — the interactive Swagger UI documents every
endpoint, with request/response shapes and examples. (ReDoc is at `/redoc`.)
---
## Test it
```bash
pytest
```
Tests run fully offline — no network and no API key required — and cover entity
resolution, the store's merge/dedup logic, the TechCrunch parser, the API, and the
evaluation metrics.
# P.S.
# good joke with apple :)