https://github.com/devflowinc/hackernews-ingest
Scripts to ingest hackernews
https://github.com/devflowinc/hackernews-ingest
Last synced: 4 days ago
JSON representation
Scripts to ingest hackernews
- Host: GitHub
- URL: https://github.com/devflowinc/hackernews-ingest
- Owner: devflowinc
- Created: 2024-07-10T04:17:59.000Z (12 months ago)
- Default Branch: master
- Last Pushed: 2024-07-10T04:30:04.000Z (12 months ago)
- Last Synced: 2025-06-22T05:17:05.349Z (6 days ago)
- Language: Python
- Size: 9.77 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Hackernews Ingest
This consists of 3 scripts each that are pretty self contained.
1. set-ids
1) Get the max current max post id [https://hacker-news.firebaseio.com/v0/maxitem.json](https://hacker-news.firebaseio.com/v0/maxitem.json).
2) Read from redis of the last value written (redis key is called `last_final`)
3) From `last_final` to `max` push these values into a list called `tovisit`2. get-dataset
1) Pop an item off the `tovisit` redis list
2) Request for post that post id f"https://hacker-news.firebaseio.com/v0/item/{start}.json"
3) If that value exists, and is not deleted. Push the json response from `2` into redis list `hn`3. bulk-ingest
1) Pop 120 json items from redis
2) Format each into a trieve chunk
3) Make a POST request to trieve `/api/chunk` to create the data
4) Push the json items into redis list `sent` (just so we can skip scripts 1 and 2)### Running it
We run all the following scripts in kubernetes as deployments.
```sh
kubectl apply -f set-ids/set-ids.yaml
kubectl apply -f get_dataset/get-datasets.yaml
kubectl apply -f bulk_ingest/bulk_send.yaml
```All scripts can be horizontally scaled exceptfor `set-ids.yaml`, but it runs much faster than the other 2 so no need to worry
Scaling is pretty easy with
```sh
kubectl scale --replicas 3 deploy/bulksend
kubectl scale --replicas 50 deploy/get-datasets
```