{"id":16017363,"url":"https://github.com/robertohuertasm/log-ingestor","last_synced_at":"2025-03-18T03:30:52.673Z","repository":{"id":77795542,"uuid":"463687701","full_name":"robertohuertasm/log-ingestor","owner":"robertohuertasm","description":"🌐🦀 A simple http logs ingestor","archived":false,"fork":false,"pushed_at":"2022-03-04T17:10:24.000Z","size":399,"stargazers_count":7,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-02-28T06:31:18.566Z","etag":null,"topics":["cli","csv","http","logs"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/robertohuertasm.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-02-25T22:03:58.000Z","updated_at":"2024-09-21T06:15:12.000Z","dependencies_parsed_at":"2023-03-12T02:07:19.895Z","dependency_job_id":null,"html_url":"https://github.com/robertohuertasm/log-ingestor","commit_stats":{"total_commits":37,"total_committers":1,"mean_commits":37.0,"dds":0.0,"last_synced_commit":"bad564f8218c5cb2a0063281a098fd61675d9b96"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/robertohuertasm%2Flog-ingestor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/robertohuertasm%2Flog-ingestor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/robertohuertasm%2Flog-ingestor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/robertohuertasm%2Flog-ingestor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/robertohuertasm","download_url":"https://codeload.github.com/robertohuertasm/log-ingestor/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243896389,"owners_count":20365370,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cli","csv","http","logs"],"created_at":"2024-10-08T16:05:11.626Z","updated_at":"2025-03-18T03:30:52.659Z","avatar_url":"https://github.com/robertohuertasm.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Log Ingestor\n\nA simple tool to ingest http logs.\n\n## Motivation\n\nThis is merely an exercise to play a little bit with Rust so take it with a pinch of salt. There are many things that can be improved.\n\n## Features\n\nThe tool can ingest logs both from a file or from the standard input.\n\nThe logs are CSV formatted:\n\n```csv\n\"remotehost\",\"rfc931\",\"authuser\",\"date\",\"request\",\"status\",\"bytes\"\n\"10.0.0.2\",\"-\",\"apache\",1549573860,\"GET /api/user HTTP/1.0\",200,1234\n\"10.0.0.4\",\"-\",\"apache\",1549573860,\"GET /api/user HTTP/1.0\",200,1234\n```\n\nThere's a sample [here](sample.csv).\n\n## Usage\n\nThe project is written in [Rust](https://www.rust-lang.org/), so for you to run it you need to install it. Follow [these instructions](https://www.rust-lang.org/tools/install) to install it.\n\nIn order to run it you can use the following command:\n\n```sh\n# get the logs from a file\ncargo run --release sample.csv\n# or use the standard input\ncargo run --release \u003c sample.csv\n```\n\nOn the other hand, you can also build the tool and then run it from the command line using the following command:\n\n```sh\n# let's build it\ncargo build --release\n# and then run it\n./target/release/log-ingestor sample.csv\n# or\n./target/release/log-ingestor \u003c sample.csv\n```\n\nYou should get a list of events similar to this one:\n\n![terminal](./docs/images/terminal.png)\n\n## Events\n\nThe tool produces a list of events depending on the logs it receives:\n\n- **Stats**: Every 10 seconds it prints some statistics about the requests received separated by section. A section is the first part of the request path (e.g. `/api/user`, the section is `/api`).\n\n- **Alerts**: In case there are more than 10 requests per second as average during a period of 2 minutes, it will print an alert with information about the the avg request per second and the time when the alert was triggered. It will also display another alert message whenever the high traffic alert is recovered. \n\n## Architecture\n\nHere's a simple diagram about the architecture of the tool which describes what are the main components and how they interact together.\n\n![components](./docs/images/components.png)\n\nOne of the basic ideas is that all the components are independent so they can be easily tested. There's no coupling to the reader, the writer or  even the processors. By leveraging [traits](https://doc.rust-lang.org/reference/items/traits.html), we can feel free to change the implementation of some of the components.\n\nIt's also important to mention that **the code is asynchronously executed** when reading, parsing and buffering the logs. Processes are spawned in different threads (see [Writer](#writer) for more details).\n\n## Testing\n\nMost of the components have been tested so we can be sure that the tool works as expected. Nevertheless, the CLI has not been exhaustively tested. Mostly because of time constraints while developing the exercise and because the rest of components are quite covered.\n\nYou can run the tests by executing the following command:\n\n```sh\ncargo test\n```\n\n## Tracing \u0026 environment variables\n\nI normally use to instrument the code I write so I can understand what's going on. The tool has been instrumented using the [tracing](https://docs.rs/tracing/0.1.31/tracing/) crate.\n\nIn order to enable it, you can set the env var `RUST_LOG` to `log-ingestor=debug`.\n\nAlternatively, you can leverage the [dotenv](https://docs.rs/dotenv/latest/dotenv/) support to set the env var `RUST_LOG` in the `.env` file.\n\n## Future improvements and limitations\n\nBecause of time constraints I did only implement the basic features of the tool.\n\nI'm going to describe here some of the limitations and possible ways to improve it.\n\n### Writer\n\nThe writer is synchronous so it will block the threads. It's not a problem for the moment because the tool is really performant but this could pose a problem with large datasets.\n\nIdeally, it would be nice to be able to use an [async writer implementation](https://docs.rs/tokio/1.17.0/tokio/io/trait.AsyncWrite.html) the same we're using an async reader.\n\nThe limitation is not technical. It's just I didn't have time to implement it.\n\nFor the time being, I went with [rayon](https://docs.rs/rayon/latest/rayon/) in order to leverage the parallelism of the machine when processing the logs.\n\n### BufferedLogs\n\nThis component is configurable and we can set up the amount of seconds to buffer. On top of that, it will group the logs by time and return them in order.\n\nOne thing that is missing is the ability to handle the logs that were not correctly parsed. Now, it will just swallow the error and trace it. We should probably think of a better strategy to deal with this in the long term.\n\n### Stats processor\n\nOne of the problems while working on this project is that we're not ingesting the logs in real time, meaning that they are ingested from a file or the standard input fairly quickly.\n\nThis means that we cannot use time to unblock or trigger certain events and that's why I chose to naively implement the stats processor.\n\nThe idea is that we should be able to get stats every 10 seconds, but check this case:\n\nLet's imagine we have a first log in time 0 and a second log in time 20.\n\nIdeally, we should get a stats event for first log in time 10 and a stats event for second log in time 20.\n\nIf we could rely on time, we could just trigger the event once 10 seconds have passed since the first log. Unfortunately, we can't do that.\n\nThe current implementation will just trigger the event once the second log is received (given that goes far beyond the 10 seconds) or the stream ends.\n\nThis works, but it means that you can find stats events that are triggered beyond the 10 seconds interval that we stablished.\n\nWe should work on that algorithm to improve the stats processor in this regard. A possible approach would be to calculate the amount of time that has (virtually) passed when receiving the second log and create the events accordingly. For example, in our case, when receiving the log in time 20, we can know that 2 events should be triggered and we could process the logs and split them into two events.\n\nAside from that, the stats processor doesn't ensure that the stats shown are ordered. Just to be clear, **the stats events will be shown in order** but the statistics of the logs are not. For instance, when receiving a stat event, we will see the amount of hits that each section is receiving. It should be possible that you would see the amount of hits for section `/api` first (with 5 hits) and then the amount of hits for section `/api/user` (with 100 hits). This is just a nice to have as doesn't affect the correctness of the stats.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frobertohuertasm%2Flog-ingestor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frobertohuertasm%2Flog-ingestor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frobertohuertasm%2Flog-ingestor/lists"}