An open API service indexing awesome lists of open source software.

https://github.com/34j/cached-historical-data-fetcher

Python utility for fetching any historical data using caching. Suitable for news, posts, weather, etc.
https://github.com/34j/cached-historical-data-fetcher

aiohttp asyncio cache fetch hacktoberfest historical-data http joblib lz4 pagenation pandas python realtime scraper scraping tqdm update web-scraping

Last synced: 7 months ago
JSON representation

Python utility for fetching any historical data using caching. Suitable for news, posts, weather, etc.

Awesome Lists containing this project

README

          

# Cached Historical Data Fetcher



CI Status


Documentation Status


Test coverage percentage




Poetry


black


pre-commit




PyPI Version

Supported Python versions
License

Python utility for fetching any historical data using caching.
Suitable for acquiring data that is added frequently and incrementally, e.g. news, posts, weather, etc.

## Installation

Install this via pip (or your favourite package manager):

```shell
pip install cached-historical-data-fetcher
```

## Features

- Uses cache built on top of [`joblib`](https://github.com/joblib/joblib), [`lz4`](https://github.com/lz4/lz4) and [`aiofiles`](https://github.com/Tinche/aiofiles).
- Ready to use with [`asyncio`](https://docs.python.org/3/library/asyncio.html), [`aiohttp`](https://github.com/aio-libs/aiohttp), [`aiohttp-client-cache`](https://github.com/requests-cache/aiohttp-client-cache). Uses `asyncio.gather` for fetching chunks in parallel. (For performance reasons, only using `aiohttp-client-cache` is probably not a good idea when fetching large number of chunks (web requests).)
- Based on [`pandas`](https://github.com/pandas-dev/pandas) and supports `MultiIndex`.

## Usage

### `HistoricalDataCache`, `HistoricalDataCacheWithChunk` and `HistoricalDataCacheWithFixedChunk`

Override `get_one()` method to fetch data for one chunk. `update()` method will call `get_one()` for each unfetched chunk and concatenate results, then save to cache.

```python
from cached_historical_data_fetcher import HistoricalDataCacheWithFixedChunk
from pandas import DataFrame, Timedelta, Timestamp
from typing import Any

# define cache class
class MyCacheWithFixedChunk(HistoricalDataCacheWithFixedChunk[Timestamp, Timedelta, Any]):
delay_seconds = 0.0 # delay between chunks (requests) in seconds
interval = Timedelta(days=1) # interval between chunks, can be any type
start_index = Timestamp.utcnow().floor("10D") # start index, can be any type

async def get_one(self, start: Timestamp, *args: Any, **kwargs: Any) -> DataFrame:
"""Fetch data for one chunk."""
return DataFrame({"day": [start.day]}, index=[start])

# get complete data
print(await MyCacheWithFixedChunk().update())
```

```shell
day
2023-09-30 00:00:00+00:00 30
2023-10-01 00:00:00+00:00 1
2023-10-02 00:00:00+00:00 2
```

**See [example.ipynb](example.ipynb) for real-world example.**

### `IdCacheWithFixedChunk`

Override `get_one` method to fetch data for one chunk in the same way as in `HistoricalDataCacheWithFixedChunk`.
After updating `ids` by calling `set_ids()`, `update()` method will call `get_one()` for every unfetched id and concatenate results, then save to cache.

```python
from cached_historical_data_fetcher import IdCacheWithFixedChunk
from pandas import DataFrame
from typing import Any

class MyIdCache(IdCacheWithFixedChunk[str, Any]):
delay_seconds = 0.0 # delay between chunks (requests) in seconds

async def get_one(self, start: str, *args: Any, **kwargs: Any) -> DataFrame:
"""Fetch data for one chunk."""
return DataFrame({"id+hello": [start + "+hello"]}, index=[start])

cache = MyIdCache() # create cache
cache.set_ids(["a"]) # set ids
cache.set_ids(["b"]) # set ids again, now `cache.ids` is ["a", "b"]
print(await cache.update(reload=True)) # discard previous cache and fetch again
cache.set_ids(["b", "c"]) # set ids again, now `cache.ids` is ["a", "b", "c"]
print(await cache.update()) # fetch only new data
```

```shell
id+hello
a a+hello
b b+hello
id+hello
a a+hello
b b+hello
c c+hello
```

## Contributors ✨

Thanks goes to these wonderful people ([emoji key](https://allcontributors.org/docs/en/emoji-key)):

This project follows the [all-contributors](https://github.com/all-contributors/all-contributors) specification. Contributions of any kind welcome!