https://github.com/devflowinc/hackernews-ingest

Scripts to ingest hackernews
https://github.com/devflowinc/hackernews-ingest

Last synced: 4 days ago
JSON representation

Scripts to ingest hackernews

Host: GitHub
URL: https://github.com/devflowinc/hackernews-ingest
Owner: devflowinc
Created: 2024-07-10T04:17:59.000Z (12 months ago)
Default Branch: master
Last Pushed: 2024-07-10T04:30:04.000Z (12 months ago)
Last Synced: 2025-06-22T05:17:05.349Z (6 days ago)
Language: Python
Size: 9.77 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Hackernews Ingest

This consists of 3 scripts each that are pretty self contained.

1. set-ids

1) Get the max current max post id [https://hacker-news.firebaseio.com/v0/maxitem.json](https://hacker-news.firebaseio.com/v0/maxitem.json).
2) Read from redis of the last value written (redis key is called `last_final`)
3) From `last_final` to `max` push these values into a list called `tovisit`

2. get-dataset

1) Pop an item off the `tovisit` redis list
2) Request for post that post id f"https://hacker-news.firebaseio.com/v0/item/{start}.json"
3) If that value exists, and is not deleted. Push the json response from `2` into redis list `hn`

3. bulk-ingest

1) Pop 120 json items from redis
2) Format each into a trieve chunk
3) Make a POST request to trieve `/api/chunk` to create the data
4) Push the json items into redis list `sent` (just so we can skip scripts 1 and 2)

### Running it

We run all the following scripts in kubernetes as deployments.

```sh
kubectl apply -f set-ids/set-ids.yaml
kubectl apply -f get_dataset/get-datasets.yaml
kubectl apply -f bulk_ingest/bulk_send.yaml
```

All scripts can be horizontally scaled exceptfor `set-ids.yaml`, but it runs much faster than the other 2 so no need to worry

Scaling is pretty easy with
```sh
kubectl scale --replicas 3 deploy/bulksend
kubectl scale --replicas 50 deploy/get-datasets
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/devflowinc/hackernews-ingest

Awesome Lists containing this project

README