https://github.com/tohtsky/log_count_util

A utility module to count/aggregate logs within a time interval
https://github.com/tohtsky/log_count_util

Last synced: 3 months ago
JSON representation

A utility module to count/aggregate logs within a time interval

Host: GitHub
URL: https://github.com/tohtsky/log_count_util
Owner: tohtsky
Created: 2021-02-18T06:03:18.000Z (over 4 years ago)
Default Branch: main
Last Pushed: 2021-10-06T21:47:11.000Z (over 3 years ago)
Last Synced: 2025-01-28T10:45:40.645Z (5 months ago)
Language: C++
Size: 32.2 KB
Stars: 2
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: Readme.md

Awesome Lists containing this project

README

        # log-count-utils

## Introduction & Usage

Suppose we have an action log data `df` like

| user_id | timestamp           | expense |

| ------: | :------------------ | ------: |

|       0 | 2021-02-18 10:00:00 |     100 |

|       0 | 2021-02-18 10:00:10 |      10 |

|       0 | 2021-02-18 10:00:21 |       1 |

|       0 | 2021-02-18 11:00:21 |     0.1 |

|       1 | 2020-02-18 10:00:10 |     100 |

|       1 | 2020-02-18 10:00:20 |      10 |

|       1 | 2020-02-18 10:00:20 |       1 |

|       1 | 2020-02-18 10:00:29 |       0 |

Suppose that you have to compute the following quantity **for each row in this dataframe**:

- the number of actions each user has taken within 10 seconds

- total amount of expenses of a user within 10 seconds

The following naive way is fine for this tiny example but becomes costly (O(N^2)) for large data frame.

```python

from datetime import timedelta

import numpy as np

td = timedelta(seconds=10)

answers = []

for uid, time_point in zip(df.user_id, df.timestamp):

    cnt = np.sum(

        (df.user_id == uid) & (df.timestamp < time_point) & (df.timestamp >= (time_point - td))

    )

    answers.append(cnt)

```

If `df` is sorted (by `user_id` as the primary and `timestamp` as the secondary key),

we can do this blazing fast (O(N)) using `log_count_util`.

```python

from log_count_util import find_n_records_within_interval

answers = find_n_records_within_interval(

    df.user_id, df.timestamp, df_user_id, df.timestamp, td

)

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tohtsky/log_count_util

Awesome Lists containing this project

README