https://github.com/kashirin-alex/data-engineer-interview
https://github.com/kashirin-alex/data-engineer-interview
Last synced: 21 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/kashirin-alex/data-engineer-interview
- Owner: kashirin-alex
- Created: 2021-12-18T16:52:15.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2021-12-18T17:03:31.000Z (over 3 years ago)
- Last Synced: 2025-02-12T18:53:35.906Z (2 months ago)
- Language: Python
- Size: 3.91 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Data-Engineer-interview
It is minimal non-persistent analyzer. (analyzing a single LogFile - no cross logfile data remains) \
It better be with database support and logs(csv-data) is a received from a request to queue-service
1) The `log_alerts.py` expects the logfile `output.csv` on the path
2) The logfile is renamed while processsed and after processed renamed with timestamp
3) Alerts can be defined in the dict `alerts`
* distinct - define the key for the count
* by - define the either matching field to value
---
#### Database support: (for the case with [SWC-DB](https://www.swcdb.org))
1) define alerts
2) insert/index by key=[rounded(ts/duration), distinct/s,,] value=+1 (column with duration TTL)\
(clears expired logs for the alert duration)
* there won't be the need for the `self.tracker` object;
only iter csv and update the cells over the alerts-duration and distinct kinds
3) select the cells with COUNTER >= alert-count
* Size and Tracked Durations won't be a subject of Worker-Host consumes more resources