https://github.com/34j/cached-historical-data-fetcher

Python utility for fetching any historical data using caching. Suitable for news, posts, weather, etc.
https://github.com/34j/cached-historical-data-fetcher

aiohttp asyncio cache fetch hacktoberfest historical-data http joblib lz4 pagenation pandas python realtime scraper scraping tqdm update web-scraping

Last synced: 7 months ago
JSON representation

Python utility for fetching any historical data using caching. Suitable for news, posts, weather, etc.

Host: GitHub
URL: https://github.com/34j/cached-historical-data-fetcher
Owner: 34j
License: mit
Created: 2023-10-02T02:43:01.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2025-03-11T19:27:07.000Z (7 months ago)
Last Synced: 2025-03-11T20:29:40.944Z (7 months ago)
Topics: aiohttp, asyncio, cache, fetch, hacktoberfest, historical-data, http, joblib, lz4, pagenation, pandas, python, realtime, scraper, scraping, tqdm, update, web-scraping
Language: Python
Homepage:
Size: 493 KB
Stars: 3
Watchers: 2
Forks: 1
Open Issues: 13
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG
- Contributing: CONTRIBUTING.md
- Funding: .github/FUNDING.yml
- License: LICENSE
- Code of conduct: .github/CODE_OF_CONDUCT.md
- Security: .github/SECURITY.md

Awesome Lists containing this project

README

          # Cached Historical Data Fetcher



  

    

  

  

    

  

  

    

  





  

    

  

  

    

  

  

    

  





  

    

  

  

  



Python utility for fetching any historical data using caching.

Suitable for acquiring data that is added frequently and incrementally, e.g. news, posts, weather, etc.

## Installation

Install this via pip (or your favourite package manager):

```shell

pip install cached-historical-data-fetcher

```

## Features

- Uses cache built on top of [`joblib`](https://github.com/joblib/joblib), [`lz4`](https://github.com/lz4/lz4) and [`aiofiles`](https://github.com/Tinche/aiofiles).

- Ready to use with [`asyncio`](https://docs.python.org/3/library/asyncio.html), [`aiohttp`](https://github.com/aio-libs/aiohttp), [`aiohttp-client-cache`](https://github.com/requests-cache/aiohttp-client-cache). Uses `asyncio.gather` for fetching chunks in parallel. (For performance reasons, only using `aiohttp-client-cache` is probably not a good idea when fetching large number of chunks (web requests).)

- Based on [`pandas`](https://github.com/pandas-dev/pandas) and supports `MultiIndex`.

## Usage

### `HistoricalDataCache`, `HistoricalDataCacheWithChunk` and `HistoricalDataCacheWithFixedChunk`

Override `get_one()` method to fetch data for one chunk. `update()` method will call `get_one()` for each unfetched chunk and concatenate results, then save to cache.

```python

from cached_historical_data_fetcher import HistoricalDataCacheWithFixedChunk

from pandas import DataFrame, Timedelta, Timestamp

from typing import Any

# define cache class

class MyCacheWithFixedChunk(HistoricalDataCacheWithFixedChunk[Timestamp, Timedelta, Any]):

    delay_seconds = 0.0 # delay between chunks (requests) in seconds

    interval = Timedelta(days=1) # interval between chunks, can be any type

    start_index = Timestamp.utcnow().floor("10D") # start index, can be any type

    async def get_one(self, start: Timestamp, *args: Any, **kwargs: Any) -> DataFrame:

        """Fetch data for one chunk."""

        return DataFrame({"day": [start.day]}, index=[start])

# get complete data

print(await MyCacheWithFixedChunk().update())

```

```shell

                           day

2023-09-30 00:00:00+00:00   30

2023-10-01 00:00:00+00:00    1

2023-10-02 00:00:00+00:00    2

```

**See [example.ipynb](example.ipynb) for real-world example.**

### `IdCacheWithFixedChunk`

Override `get_one` method to fetch data for one chunk in the same way as in `HistoricalDataCacheWithFixedChunk`.

After updating `ids` by calling `set_ids()`, `update()` method will call `get_one()` for every unfetched id and concatenate results, then save to cache.

```python

from cached_historical_data_fetcher import IdCacheWithFixedChunk

from pandas import DataFrame

from typing import Any

class MyIdCache(IdCacheWithFixedChunk[str, Any]):

    delay_seconds = 0.0 # delay between chunks (requests) in seconds

    async def get_one(self, start: str, *args: Any, **kwargs: Any) -> DataFrame:

        """Fetch data for one chunk."""

        return DataFrame({"id+hello": [start + "+hello"]}, index=[start])

cache = MyIdCache() # create cache

cache.set_ids(["a"]) # set ids

cache.set_ids(["b"]) # set ids again, now `cache.ids` is ["a", "b"]

print(await cache.update(reload=True)) # discard previous cache and fetch again

cache.set_ids(["b", "c"]) # set ids again, now `cache.ids` is ["a", "b", "c"]

print(await cache.update()) # fetch only new data

```

```shell

       id+hello

    a   a+hello

    b   b+hello

       id+hello

    a   a+hello

    b   b+hello

    c   c+hello

```

## Contributors ✨

Thanks goes to these wonderful people ([emoji key](https://allcontributors.org/docs/en/emoji-key)):

This project follows the [all-contributors](https://github.com/all-contributors/all-contributors) specification. Contributions of any kind welcome!

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/34j/cached-historical-data-fetcher

Awesome Lists containing this project

README