https://github.com/tohtsky/log_count_util
A utility module to count/aggregate logs within a time interval
https://github.com/tohtsky/log_count_util
Last synced: 3 months ago
JSON representation
A utility module to count/aggregate logs within a time interval
- Host: GitHub
- URL: https://github.com/tohtsky/log_count_util
- Owner: tohtsky
- Created: 2021-02-18T06:03:18.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2021-10-06T21:47:11.000Z (over 3 years ago)
- Last Synced: 2025-01-28T10:45:40.645Z (5 months ago)
- Language: C++
- Size: 32.2 KB
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: Readme.md
Awesome Lists containing this project
README
# log-count-utils
## Introduction & Usage
Suppose we have an action log data `df` like
| user_id | timestamp | expense |
| ------: | :------------------ | ------: |
| 0 | 2021-02-18 10:00:00 | 100 |
| 0 | 2021-02-18 10:00:10 | 10 |
| 0 | 2021-02-18 10:00:21 | 1 |
| 0 | 2021-02-18 11:00:21 | 0.1 |
| 1 | 2020-02-18 10:00:10 | 100 |
| 1 | 2020-02-18 10:00:20 | 10 |
| 1 | 2020-02-18 10:00:20 | 1 |
| 1 | 2020-02-18 10:00:29 | 0 |Suppose that you have to compute the following quantity **for each row in this dataframe**:
- the number of actions each user has taken within 10 seconds
- total amount of expenses of a user within 10 secondsThe following naive way is fine for this tiny example but becomes costly (O(N^2)) for large data frame.
```python
from datetime import timedelta
import numpy as nptd = timedelta(seconds=10)
answers = []
for uid, time_point in zip(df.user_id, df.timestamp):
cnt = np.sum(
(df.user_id == uid) & (df.timestamp < time_point) & (df.timestamp >= (time_point - td))
)
answers.append(cnt)
```If `df` is sorted (by `user_id` as the primary and `timestamp` as the secondary key),
we can do this blazing fast (O(N)) using `log_count_util`.```python
from log_count_util import find_n_records_within_intervalanswers = find_n_records_within_interval(
df.user_id, df.timestamp, df_user_id, df.timestamp, td
)
```