An open API service indexing awesome lists of open source software.

https://github.com/tohtsky/log_count_util

A utility module to count/aggregate logs within a time interval
https://github.com/tohtsky/log_count_util

Last synced: 3 months ago
JSON representation

A utility module to count/aggregate logs within a time interval

Awesome Lists containing this project

README

        

# log-count-utils

## Introduction & Usage

Suppose we have an action log data `df` like

| user_id | timestamp | expense |
| ------: | :------------------ | ------: |
| 0 | 2021-02-18 10:00:00 | 100 |
| 0 | 2021-02-18 10:00:10 | 10 |
| 0 | 2021-02-18 10:00:21 | 1 |
| 0 | 2021-02-18 11:00:21 | 0.1 |
| 1 | 2020-02-18 10:00:10 | 100 |
| 1 | 2020-02-18 10:00:20 | 10 |
| 1 | 2020-02-18 10:00:20 | 1 |
| 1 | 2020-02-18 10:00:29 | 0 |

Suppose that you have to compute the following quantity **for each row in this dataframe**:

- the number of actions each user has taken within 10 seconds
- total amount of expenses of a user within 10 seconds

The following naive way is fine for this tiny example but becomes costly (O(N^2)) for large data frame.

```python
from datetime import timedelta
import numpy as np

td = timedelta(seconds=10)

answers = []
for uid, time_point in zip(df.user_id, df.timestamp):
cnt = np.sum(
(df.user_id == uid) & (df.timestamp < time_point) & (df.timestamp >= (time_point - td))
)
answers.append(cnt)
```

If `df` is sorted (by `user_id` as the primary and `timestamp` as the secondary key),
we can do this blazing fast (O(N)) using `log_count_util`.

```python
from log_count_util import find_n_records_within_interval

answers = find_n_records_within_interval(
df.user_id, df.timestamp, df_user_id, df.timestamp, td
)
```