https://github.com/fn-devx/crawler-ai-task

Last synced: 20 days ago
JSON representation

Host: GitHub
URL: https://github.com/fn-devx/crawler-ai-task
Owner: fn-devX
Created: 2026-05-30T22:48:01.000Z (30 days ago)
Default Branch: main
Last Pushed: 2026-05-30T22:50:46.000Z (30 days ago)
Last Synced: 2026-05-31T00:14:48.946Z (30 days ago)
Language: Python
Size: 27.3 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: newskg/__init__.py

Awesome Lists containing this project

README

---

## How it works

```
topic listing + article pages
│
┌───────────┐ ┌──────────────┐ ┌────────────────┐ ┌────────────┐
│ Crawler │────> │ Source │────>│ Extractor │────>│ Store │
│ async I/O │ │ TechCrunch │ │ Claude tool- │ │ SQLite │
│ + retry │ │ parser │ │ use / LLM │ │ graph │
└───────────┘ └──────────────┘ └────────────────┘ └─────┬──────┘
│
fetch → parse → extract → resolve → store │
▼
┌────────────┐
│ FastAPI │
│ API │
└────────────┘
```

Five stages run end to end: **crawl** the pages, **parse** them into clean articles,
**extract** people and relationships with the LLM, **resolve** duplicate names into one
entity, then **store and serve** the graph. Each layer talks to the next through a small
interface, so sources, extractors, and storage are all swappable.

---

## Run it

```bash
# Python 3.10+
python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -e ".[dev]"

cp .env.example .env # then set ANTHROPIC_API_KEY
# no key? run fully offline: export NEWSKG_EXTRACTOR=heuristic

uvicorn newskg.api:app --reload
```

Then open **http://localhost:8000/docs** — the interactive Swagger UI documents every
endpoint, with request/response shapes and examples. (ReDoc is at `/redoc`.)

---

## Test it

```bash
pytest
```

Tests run fully offline — no network and no API key required — and cover entity
resolution, the store's merge/dedup logic, the TechCrunch parser, the API, and the
evaluation metrics.

# P.S.
# good joke with apple :)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/fn-devx/crawler-ai-task

Awesome Lists containing this project

README