{"id":27255548,"url":"https://github.com/bric3/drain-java","last_synced_at":"2025-04-11T02:20:06.887Z","repository":{"id":39580334,"uuid":"325029390","full_name":"bric3/drain-java","owner":"bric3","description":"This a pet project to explore log pattern extraction using DRAIN","archived":false,"fork":false,"pushed_at":"2025-03-16T05:55:37.000Z","size":1256,"stargazers_count":28,"open_issues_count":12,"forks_count":9,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-16T06:25:48.422Z","etag":null,"topics":["drain","java","log","tail","template-mining"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mpl-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bric3.png","metadata":{"files":{"readme":"README.adoc","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-12-28T14:09:47.000Z","updated_at":"2025-03-16T05:55:41.000Z","dependencies_parsed_at":"2023-02-18T16:32:15.956Z","dependency_job_id":"d3c9e7bb-5d4c-44d6-978d-3e79f793cd84","html_url":"https://github.com/bric3/drain-java","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bric3%2Fdrain-java","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bric3%2Fdrain-java/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bric3%2Fdrain-java/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bric3%2Fdrain-java/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bric3","download_url":"https://codeload.github.com/bric3/drain-java/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248328317,"owners_count":21085295,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["drain","java","log","tail","template-mining"],"created_at":"2025-04-11T02:20:06.375Z","updated_at":"2025-04-11T02:20:06.852Z","avatar_url":"https://github.com/bric3.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"= drain java\n\nimage:https://github.com/bric3/drain-java/actions/workflows/gradle.yml/badge.svg[Java CI with Gradle,link=https://github.com/bric3/drain-java/actions/workflows/gradle.yml]\n\n== Introduction\n\ndrain-java is a continuous _log template miner_, for each log message it extracts\ntokens and group them into _clusters of tokens_. As new log messages are added,\ndrain-java will identify similar token and update the cluster with the new template,\nor simply create a new token cluster. Each time a cluster is matched a counter is\nincremented.\n\nThese clusters are stored in prefix tree, which is somewhat similar to a trie, but\nhere the tree as a fixed depth in order to avoid long tree traversal.\nIn avoiding deep trees this also helps to keep it balance.\n\n== Usage\n\nFirst, https://foojay.io/almanac/jdk-11/[Java 11] is required to run drain-java.\n\n=== As a dependency\n\nYou can consume drain-java as a dependency in your project `io.github.bric3.drain:drain-java-core`,\ncurrently only https://s01.oss.sonatype.org/content/repositories/snapshots/io/github/bric3/drain/[snapshots]\nare available by adding this repository.\n\n[source, kotlin]\n----\nrepositories {\n    maven {\n        url(\"https://oss.sonatype.org/content/repositories/snapshots/\")\n    }\n}\n----\n\n=== From command line\n\nSince this tool is not yet released the tool needs to be built locally.\nAlso, the built jar is not yet super user-friendly. Since it's not a finished\nproduct, anything could change.\n\n.Example usage\n[source, shell]\n----\n$ ./gradlew build\n$ java -jar tailer/build/libs/tailer-0.1.0-SNAPSHOT-all.jar -h\n\ntail - drain\nUsage: tail [-dfhV] [--verbose] [-n=NUM]\n            [--parse-after-str=FIXED_STRING_SEPARATOR]\n            [--parser-after-col=COLUMN] FILE\n...\n      FILE          log file\n  -d, --drain       use DRAIN to extract log patterns\n  -f, --follow      output appended data as the file grows\n  -h, --help        Show this help message and exit.\n  -n, --lines=NUM   output the last NUM lines, instead of the last 10; or use\n                      -n 0 to output starting from beginning\n      --parse-after-str=FIXED_STRING_SEPARATOR\n                    when using DRAIN remove the left part of a log line up to\n                      after the FIXED_STRING_SEPARATOR\n      --parser-after-col=COLUMN\n                    when using DRAIN remove the left part of a log line up to\n                      COLUMN\n  -V, --version     Print version information and exit.\n      --verbose     Verbose output, mostly for DRAIN or errors\n$ java -jar tailer/build/libs/tailer-0.1.0-SNAPSHOT-all.jar --version\nVersioned Command 1.0\nPicocli 4.6.3\nJVM: 19 (Amazon.com Inc. OpenJDK 64-Bit Server VM 19+36-FR)\nOS: Mac OS X 12.6 x86_64\n----\n\nBy default, the tool act similarly to `tail`, and it will output the file to the stdout.\nThe tool can _follow_ a file if the `--follow` option is passed.\nHowever, when run with the `--drain` this tool will classify log lines using DRAIN, and will\noutput identified clusters.\nNote that this tool doesn't handle multiline log messages (like logs that contains a stacktrace).\n\nOn the SSH log data set we can use it this way.\n\n[source, shell]\n----\n$ java -jar build/libs/drain-java-1.0-SNAPSHOT-all.jar \\\n  -d \\ \u003c1\u003e\n  -n 0 \\ \u003c2\u003e\n  --parse-after-str \"]: \" \u003c3\u003e\n  build/resources/test/SSH.log \u003c4\u003e\n----\n\u003c1\u003e Identify patterns in the log\n\u003c2\u003e Starts from the beginning of the file (otherwise it starts from the last 10 lines)\n\u003c3\u003e Remove the left part of log line (`Dec 10 06:55:46 LabSZ sshd[24200]: `), ie effectively\nignoring some variable elements like the time.\n\u003c4\u003e The log file\n\n.log pattern clusters and their occurences\n[source]\n--------\n---- Done processing file. Total of 655147 lines, done in 1.588 s, 51 clusters \u003c1\u003e\n0010 (size 140768): Failed password for \u003c*\u003e from \u003c*\u003e port \u003c*\u003e ssh2 \u003c2\u003e\n0009 (size 140701): pam unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= \u003c*\u003e \u003c*\u003e\n0007 (size 68958): Connection closed by \u003c*\u003e [preauth]\n0008 (size 46642): Received disconnect from \u003c*\u003e 11: \u003c*\u003e \u003c*\u003e \u003c*\u003e\n0014 (size 37963): PAM service(sshd) ignoring max retries; \u003c*\u003e \u003e 3\n0012 (size 37298): Disconnecting: Too many authentication failures for \u003c*\u003e [preauth]\n0013 (size 37029): PAM \u003c*\u003e more authentication \u003c*\u003e logname= uid=0 euid=0 tty=ssh ruser= \u003c*\u003e \u003c*\u003e\n0011 (size 36967): message repeated \u003c*\u003e times: [ Failed password for \u003c*\u003e from \u003c*\u003e port \u003c*\u003e ssh2]\n0006 (size 20241): Failed \u003c*\u003e for invalid user \u003c*\u003e from \u003c*\u003e port \u003c*\u003e ssh2\n0004 (size 19852): pam unix(sshd:auth): check pass; user unknown\n0001 (size 18909): reverse mapping checking getaddrinfo for \u003c*\u003e \u003c*\u003e failed - POSSIBLE BREAK-IN ATTEMPT!\n0002 (size 14551): Invalid user \u003c*\u003e from \u003c*\u003e\n0003 (size 14551): input userauth request: invalid user \u003c*\u003e [preauth]\n0005 (size 14356): pam unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= \u003c*\u003e\n0018 (size 1289): PAM \u003c*\u003e more authentication \u003c*\u003e logname= uid=0 euid=0 tty=ssh ruser= \u003c*\u003e\n0024 (size 952): fatal: Read from socket failed: Connection reset by peer [preauth]\n...\n--------\n\u003c1\u003e 51 _types_ of logs were identified from 655147 lines in 1.588s\n\u003c2\u003e There was `140768` similar log messages with this pattern, with `3` positions\nwhere the token is identified as parameter `\u003c*\u003e`.\n\nOn the same dataset, the java implementation performed roughly around 10 times faster.\nAs my implementation does not yet have masking, mask configuration was removed in the\nDrain3 implementation.\n\n=== From Java\n\nThis tool is not yet intended to be used as a library, but for the curious\nthe DRAIN algorythm can be used this way:\n\n.Minimal DRAIN example\n[source, java]\n----\nvar drain = Drain.drainBuilder()\n                 .additionalDelimiters(\"_\")\n                 .depth(4)\n                 .build()\nFiles.lines(Paths.get(\"build/resources/test/SSH.log\"),\n            StandardCharsets.UTF_8)\n     .forEach(drain::parseLogMessage);\n\n// do something with clusters\ndrain.clusters();\n----\n\n\n\n== Status\n\nPieces of puzzle are coming in no particular order, I first bootstrapped the code from a simple Java\nfile. Then I wrote in Java an implementation of Drain. Now here's what I would like to do.\n\n.Todo\n- [ ] More unit tests\n- [x] Wire things together\n- [ ] More documentation\n- [x] Implement _tail follow_ mode (currently in drain mode the whole file is read and stops once finished)\n- [ ] In follow drain mode dump clusters on forced exit (e.g. for example when hitting `ctrl`+`c`)\n- [x] Start reading from the last x lines (like `tail -n 30`)\n- [ ] Implement log masking (e.g. log contain an email, or an IP address which may be considered as private data)\n\n.For later\n- [ ] Json message field extraction\n- [ ] How to handle prefixes : Dates, log level, etc. ; possibly using masking\n- [ ] Investigate marker with specific behavior, e.g. log level severity\n- [ ] Investigate log with stacktraces (likely multiline)\n- [ ] Improve handling of very long lines\n- [ ] Logback appender with micrometer counter\n\n== Motivation\n\nI was inspired by a https://sayr.us/log-pattern-recognition/logmine/[blog article from one of my colleague on LogMine],\n-- many thanks to him for doing the initial research and explaining concepts --, we were both impressed by the log\npattern extraction of https://docs.datadoghq.com/logs/explorer/patterns/[Datadog's Log explorer], his blog post\nsparked my interest.\n\nAfter some discussion together, we saw that Drain was a bit superior to LogMine.\nGoogling Drain in Java didn't yield any result, although I certainly didn't search exhaustively,\nbut regardless this triggered the idea to implement this algorithm in Java.\n\n== References\n\nThe Drain port is mostly a port of https://github.com/IBM/Drain3[Drain3]\ndone by IBM folks (_David Ohana_, _Moshik Hershcovitch_). IBM's Drain3 is a fork of the\nhttps://github.com/logpai/logparser[original work] done by the LogPai team based on the paper of\n_Pinjia He_, _Jieming Zhu_, _Zibin Zheng_, and _Michael R. Lyu_.\n\n_I didn't follow up on other contributors of these projects, reach out if you think you have been omitted._\n\n\nFor reference here's the linked I looked at:\n\n* https://logparser.readthedocs.io/\n* https://github.com/logpai/logparser\n* https://github.com/IBM/Drain3\n* https://jiemingzhu.github.io/pub/pjhe_icws2017.pdf\n(a copy of this publication accessible link:doc/pjhe_icws2017.pdf[there])\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbric3%2Fdrain-java","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbric3%2Fdrain-java","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbric3%2Fdrain-java/lists"}