https://github.com/34j/cached-historical-data-fetcher
Python utility for fetching any historical data using caching. Suitable for news, posts, weather, etc.
https://github.com/34j/cached-historical-data-fetcher
aiohttp asyncio cache fetch hacktoberfest historical-data http joblib lz4 pagenation pandas python realtime scraper scraping tqdm update web-scraping
Last synced: 7 months ago
JSON representation
Python utility for fetching any historical data using caching. Suitable for news, posts, weather, etc.
- Host: GitHub
- URL: https://github.com/34j/cached-historical-data-fetcher
- Owner: 34j
- License: mit
- Created: 2023-10-02T02:43:01.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2025-03-11T19:27:07.000Z (7 months ago)
- Last Synced: 2025-03-11T20:29:40.944Z (7 months ago)
- Topics: aiohttp, asyncio, cache, fetch, hacktoberfest, historical-data, http, joblib, lz4, pagenation, pandas, python, realtime, scraper, scraping, tqdm, update, web-scraping
- Language: Python
- Homepage:
- Size: 493 KB
- Stars: 3
- Watchers: 2
- Forks: 1
- Open Issues: 13
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG
- Contributing: CONTRIBUTING.md
- Funding: .github/FUNDING.yml
- License: LICENSE
- Code of conduct: .github/CODE_OF_CONDUCT.md
- Security: .github/SECURITY.md
Awesome Lists containing this project
README
# Cached Historical Data Fetcher
Python utility for fetching any historical data using caching.
Suitable for acquiring data that is added frequently and incrementally, e.g. news, posts, weather, etc.## Installation
Install this via pip (or your favourite package manager):
```shell
pip install cached-historical-data-fetcher
```## Features
- Uses cache built on top of [`joblib`](https://github.com/joblib/joblib), [`lz4`](https://github.com/lz4/lz4) and [`aiofiles`](https://github.com/Tinche/aiofiles).
- Ready to use with [`asyncio`](https://docs.python.org/3/library/asyncio.html), [`aiohttp`](https://github.com/aio-libs/aiohttp), [`aiohttp-client-cache`](https://github.com/requests-cache/aiohttp-client-cache). Uses `asyncio.gather` for fetching chunks in parallel. (For performance reasons, only using `aiohttp-client-cache` is probably not a good idea when fetching large number of chunks (web requests).)
- Based on [`pandas`](https://github.com/pandas-dev/pandas) and supports `MultiIndex`.## Usage
### `HistoricalDataCache`, `HistoricalDataCacheWithChunk` and `HistoricalDataCacheWithFixedChunk`
Override `get_one()` method to fetch data for one chunk. `update()` method will call `get_one()` for each unfetched chunk and concatenate results, then save to cache.
```python
from cached_historical_data_fetcher import HistoricalDataCacheWithFixedChunk
from pandas import DataFrame, Timedelta, Timestamp
from typing import Any# define cache class
class MyCacheWithFixedChunk(HistoricalDataCacheWithFixedChunk[Timestamp, Timedelta, Any]):
delay_seconds = 0.0 # delay between chunks (requests) in seconds
interval = Timedelta(days=1) # interval between chunks, can be any type
start_index = Timestamp.utcnow().floor("10D") # start index, can be any typeasync def get_one(self, start: Timestamp, *args: Any, **kwargs: Any) -> DataFrame:
"""Fetch data for one chunk."""
return DataFrame({"day": [start.day]}, index=[start])# get complete data
print(await MyCacheWithFixedChunk().update())
``````shell
day
2023-09-30 00:00:00+00:00 30
2023-10-01 00:00:00+00:00 1
2023-10-02 00:00:00+00:00 2
```**See [example.ipynb](example.ipynb) for real-world example.**
### `IdCacheWithFixedChunk`
Override `get_one` method to fetch data for one chunk in the same way as in `HistoricalDataCacheWithFixedChunk`.
After updating `ids` by calling `set_ids()`, `update()` method will call `get_one()` for every unfetched id and concatenate results, then save to cache.```python
from cached_historical_data_fetcher import IdCacheWithFixedChunk
from pandas import DataFrame
from typing import Anyclass MyIdCache(IdCacheWithFixedChunk[str, Any]):
delay_seconds = 0.0 # delay between chunks (requests) in secondsasync def get_one(self, start: str, *args: Any, **kwargs: Any) -> DataFrame:
"""Fetch data for one chunk."""
return DataFrame({"id+hello": [start + "+hello"]}, index=[start])cache = MyIdCache() # create cache
cache.set_ids(["a"]) # set ids
cache.set_ids(["b"]) # set ids again, now `cache.ids` is ["a", "b"]
print(await cache.update(reload=True)) # discard previous cache and fetch again
cache.set_ids(["b", "c"]) # set ids again, now `cache.ids` is ["a", "b", "c"]
print(await cache.update()) # fetch only new data
``````shell
id+hello
a a+hello
b b+hello
id+hello
a a+hello
b b+hello
c c+hello
```## Contributors ✨
Thanks goes to these wonderful people ([emoji key](https://allcontributors.org/docs/en/emoji-key)):
This project follows the [all-contributors](https://github.com/all-contributors/all-contributors) specification. Contributions of any kind welcome!