https://github.com/dmatking/dtlab
Date Time Lab
https://github.com/dmatking/dtlab
csv data-analysis data-quality datetime python timezone
Last synced: 11 days ago
JSON representation
Date Time Lab
- Host: GitHub
- URL: https://github.com/dmatking/dtlab
- Owner: dmatking
- License: mit
- Created: 2025-09-11T03:43:22.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2026-04-03T16:48:36.000Z (2 months ago)
- Last Synced: 2026-04-03T19:41:12.540Z (2 months ago)
- Topics: csv, data-analysis, data-quality, datetime, python, timezone
- Language: Python
- Size: 43 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# dtlab — datetime equivalence detector
Finds datetime columns in a CSV or DataFrame that represent the same instant in different formats or timezones, and groups them together.
Useful when working with wide tables that have many timestamp columns — epoch seconds alongside ISO strings alongside RFC5322 headers, all meaning the same thing.
## What it detects
- ISO 8601 (with Z, numeric offsets, naive)
- RFC 5322 / HTTP-date
- Epoch integers (seconds, milliseconds, microseconds, nanoseconds — inferred by magnitude)
- Slash-style dates (MDY/DMY, with ambiguity flagging)
- ISO week / ordinal formats
- TZ abbreviations (PST/PDT, CST/CDT, EST/EDT, MST/MDT, UTC/GMT)
## Install
```bash
pip install pandas numpy python-dateutil
```
Requires Python 3.9+ (uses `zoneinfo`).
## CLI
```bash
python dt_equivalence.py --in data.csv
```
```
=== dt-equivalence report ===
Source: data.csv
-- Detected columns --
column | role | format | unit | parse_rate | parser | naive_policy | notes
...
-- Equivalence groups (tol=1s, min_overlap=100, min_match_ratio=98%) --
Group 1: ts_iso_utc, ts_iso_cdt, ts_iso_pdt, ts_epoch_s, ts_epoch_ms, ts_rfc5322
Singletons (no match): ts_naive_local
```
Also writes `data.dt_report.json` with full pairwise details.
### Options
| Flag | Default | Description |
| ------------------- | -------- | ----------------------------------------------------------------- |
| `--in` | required | Input CSV path |
| `--delimiter` | auto | CSV delimiter |
| `--naive-tz` | UTC | IANA timezone for naive datetime strings (e.g. `America/Chicago`) |
| `--encoding` | utf-8 | File encoding |
| `--max-rows` | all | Limit rows read |
| `--tolerance` | 1 | Max seconds difference to consider two timestamps equivalent |
| `--min-overlap` | 100 | Minimum non-null row overlap required to compare two columns |
| `--min-match-ratio` | 0.98 | Fraction of overlapping rows that must match within tolerance |
| `--include-columns` | all | Comma-separated list of columns to analyze |
| `--exclude-columns` | none | Comma-separated list of columns to skip |
| `--preview` | off | Write a normalized UTC preview CSV (first 50 rows) |
## Notebook / script API
```python
import pandas as pd
from dt_equivalence import analyze
df = pd.read_parquet("events.parquet")
result = analyze(df, naive_tz="America/Chicago")
result.report() # print text report
result.summary() # pd.DataFrame of column metadata
result.normalized() # pd.DataFrame of detected columns as UTC ISO strings
result.equivalent_groups() # list of groups with 2+ members
result.groups # all groups including singletons
result.parsed # dict of col → UTC pd.Series
result.sim # pairwise {overlap, match_ratio, equivalent}
```
`analyze()` accepts the same parameters as the CLI flags:
```python
result = analyze(
df,
naive_tz="America/New_York",
tolerance_seconds=5,
min_overlap=50,
min_match_ratio=0.95,
include_columns=["created_at", "event_ts", "ts_epoch"],
exclude_columns=["id"],
)
```
## How equivalence works
All detected columns are normalized to UTC. Two columns are considered equivalent if:
1. They share at least `min_overlap` non-null rows (or 1% of total rows for large files)
2. At least `min_match_ratio` of those rows have timestamps within `tolerance_seconds` of each other
Grouping uses union-find, so transitivity is handled correctly (if A≡B and B≡C, all three end up in the same group).
## Caveats
- **Naive timestamps**: without `--naive-tz`, naive strings are assumed to be UTC. If your data has naive local times, set `--naive-tz` to get correct grouping.
- **Ambiguous slash dates**: `03/08/2025` is ambiguous (MDY vs DMY). These are flagged in the notes column but still parsed by pandas using its default interpretation.
- **Floating-point epoch loss**: epoch values stored as floats with few significant digits may not match string timestamps exactly — raise `--tolerance` if needed.