{"id":20816318,"url":"https://github.com/riptano/logparse","last_synced_at":"2025-07-24T11:06:37.317Z","repository":{"id":141876087,"uuid":"37682136","full_name":"riptano/logparse","owner":"riptano","description":"Parser for Cassandra Logs","archived":false,"fork":false,"pushed_at":"2016-03-23T21:18:27.000Z","size":541,"stargazers_count":13,"open_issues_count":0,"forks_count":6,"subscribers_count":264,"default_branch":"logparse","last_synced_at":"2025-03-31T10:05:00.862Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/riptano.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2015-06-18T20:09:04.000Z","updated_at":"2021-07-17T13:14:17.000Z","dependencies_parsed_at":null,"dependency_job_id":"ed131bed-935e-4888-9f2d-1e32cc643a64","html_url":"https://github.com/riptano/logparse","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/riptano%2Flogparse","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/riptano%2Flogparse/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/riptano%2Flogparse/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/riptano%2Flogparse/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/riptano","download_url":"https://codeload.github.com/riptano/logparse/tar.gz/refs/heads/logparse","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252879453,"owners_count":21818799,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-17T21:29:42.074Z","updated_at":"2025-05-07T12:41:33.757Z","avatar_url":"https://github.com/riptano.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Cassandra system.log parser\n\nThis rule-based log parser uses regular expressions to match various messages logged \nby Cassandra and extract any useful information they contain into separate fields.  \nAdditional transformations can be applied to each of the captured values, and then\na dictionary containing the resulting values on each line is returned.  The dictionaries\ncan be output in json format or inserted into a storage backend.\n\n## log_to_json\n\nThe [log_to_json](log_to_json) script parses system.log and outputs events in JSON format with one\nevent per line.  It takes a list of log files on the command line and parses them.\nIf no arguments are supplied it will attempt to parse stdin. This can be used to parse\na live log file by piping from tail: `tail -f /var/log/cassandra/system.log | log_to_json`.\n\n## cassandra_ingest\n\nThe [cassandra_ingest](cassandra_ingest) script parses system.log and inserts each event into the\nlogparse.systemlog table defined in [systemlog.cql](systemlog.cql). It takes a list of log files\non the command line and parses them.  If no arguments are supplied it will attempt \nto parse stdin. This can be used to parse a live log file by piping from tail: \n`tail -f /var/log/cassandra/system.log | cassandra_ingest`.\n\n## Rule-Based Message Parser\nTo reduce the tedium of defining parsers for many different messages, I created a simple \nDSL using Python function objects.  The function objects can be called like normal \nfunctions, but they are created by a constructor which allows you to define the specific\nbehavior of the resulting function when it is called.  \n\nThe function objects themselves are defined in `rules.py`, and the rules specific to the\nCassandra system.log are defined in `systemlog.py`.  In the future, I may add additional\nsets of rules for Spark executor logs, and OpsCenter daemon and agent logs. These rules \ncan be used to create parsers for your own application logs as well. \n\nA minimal set of two rules is defined as follows:\n\n```\ncapture_message = switch((\n    case('CassandraDaemon'),\n        rule(\n            capture(r'Heap size: (?P\u003cheap_used\u003e[0-9]*)/(?P\u003ctotal_heap\u003e[0-9]*)'),\n            convert(int, 'heap_used', 'total_heap'),\n            update(event_product='cassandra', event_category='startup', event_type='heap_size')),\n            \n        rule(\n            capture(r'Classpath: (?P\u003cclasspath\u003e.*)'),\n            convert(split(':'), 'classpath'),\n            update(event_product='cassandra', event_category='startup', event_type='classpath'))))\n```\n\nThe `switch(cases)` constructor takes a tuple of cases and rules. It was necessary to use\nan actual tuple instead of argument unpacking because the number of rules exceeds the \nmaximum number of parameters supported by a Python function call. The constructor returns\na function that we assign the name `capture_message`.  This function accepts two parameters:\nthe first determines which group of rules will be applied, and the second is the string\nthat the selected group of rules will be applied to until a match is found. The function\nreturns the value returned by the first matching rule. If no rules match, None is returned.  \n\nRules are grouped using the `case(*keys)` constructor. The keys specified in the case\nconstructor will be used by the switch function to determine which group of rules to\nexecute.  The keys in a case constructor apply to all of the rules that follow until the\nnext case constructor is encountered.\n\nRules are defined using `rule(source, *transformations)` where source is a function that\nis expected to take a string and return a dictionary of fields extracted from the string \nif it matches, or None if it doesn't. Unless None is returned, the rule will then pass \nthe resulting dictionary into the transformation functions in the specified order, each\nof which is expected to manipulate the dictionary in some way by adding, removing, or \noverwriting fields. \n\nCurrently the only source defined is `capture(*regexes)`, which takes a list of regular expressions\nto apply against the input string.  Each of the regular expressions will be applied\nuntil the first match is found, and then the match's groupdict will be returned. If no \nmatches are found, None is returned.\n\nSeveral transformations are provided:\n\n- `convert(function, *field_names)` will iterate over the k/v pairs in a dictionary and \napply the specified function to convert the specified fields to a different type or \nperform some some other transformation on the string value.  The function can be a simple \ntype conversion such as int or float, or it can be a user-defined function or the \nconstructor for a function object.  The field names are just one or more strings \nspecifying the dictionary field that the conversion should be applied to. The convert \nfunction will iterate over the fields specified and apply the conversion function to \neach, replacing the value of the field with the result.\n\n- `update(**fields)` simply adds the specified key-value pairs to the dictionary. This can\nbe used to tag the event with a category or type based on the regular expression that has\nmatched it.\n\n- `default(**fields)` is the same as update, but it will only set the key/value pairs for\nfields that do not already exist within the dictionary.\n\n`systemlog.py` defines a capture_line rule to match the overall log line of the format:\n\n```\nlevel [thread] date sourcefile:sourceline - message\n```\n\nThis rule then passes the sourcefile and message fields to the capture_message function \ndefined above, which chooses a group of rules based on the sourcefile, then applies them \nto the message until a match is found.\n\nThese rules are wrapped by a `parse_log` generator that iterates over a sequence of log lines\nand yields a dictionary for each event within the log. This has special handling for\nexceptions which can follow on separate lines after the main line of an error message.\n\nIn order to test the rules, I created a simple front-end called `log_to_json`, which reads\none or more system.log files (or stdin) and converts each event into a json representation\nwith one event per line. \n\n## Cassandra Storage Backend\n\nThe Cassandra storage backend is designed to store the data generated by the log parser in a \nflexible schema. Any provided key/value pairs that match the name of a field in the table\nwill be inserted into the corresponding field. Any pairs that do not match a field in the table\nwill instead be inserted into a set of generic map fields based on the type of the value. \nThe table has a map for each common data type, including boolean (b_), date (d_), integer (i_),\nfloat (f_), string (s_), and list (l_).  \n\nThe required fields on the table are shown below, and additional fields can be added as desired.\n\n```\ncreate table generic (\n    id timeuuid primary key,\n    b_ map\u003ctext, boolean\u003e,\n    d_ map\u003ctext, timestamp\u003e,\n    i_ map\u003ctext, bigint\u003e,\n    f_ map\u003ctext, double\u003e,\n    s_ map\u003ctext, text\u003e,\n    l_ map\u003ctext, text\u003e\n);\n```\n\nThe `genericize` function in `cassandra_store.py` handles the transformation of arbitrary\ndictionaries into the format of the parameters expected by the Cassandra Python Driver.  \nPrior to insertion, any nested dictionaries are flattened by combining the key paths using \nunderscores. For example `{'a': {'b': 'c'}, 'd': {'e': 'f', 'g': {'h': 'i'}}}` becomes \n`{'a_b': 'c', 'd_e': 'f', 'd_g_h': 'i'}`. Since lists can't be nested within a map in \nCassandra, lists are actually expressed as a string where each element of the list \nseparated by a newline. Anything else will be coerced to JSON and inserted into the string map.\nAny fields that are present in the table but not provided in the dictionary will be set to None.\n\nThe `CassandraStore` class connects to the cassandra cluster using the DataStax Python Driver\nand handles automatic preparation and caching of insert statements.  It provides an `insert` method \nto insert a single record into a specified table, either synchronously or asynchronously. \nTo maximize throughput without overloading the cluster, it provides a `slurp` method that will\nconcurrently insert records provided by a generator while maintaining an optimal number of inflight\nqueries.\n\n## Solr indexing\n\nThe Cassandra table containing parsed log entries can be indexed using the Solr implementation from \n[DataStax Enterprise](http://docs.datastax.com/en/datastax_enterprise/4.7//datastax_enterprise/newFeatures.html).\nA [schema.xml](solr/schema.xml) and [solrconfig.xml](solr/solrconfig.xml) are provided\nin the [solr](solr) directory along with the [add-schema.sh](solr/add-schema.sh) script \nwhich will upload the Solr schema to DSE.  \n\nOnce indexed in Solr, the log events can be subsequently analyzed and visualized using \n[Banana](https://github.com/LucidWorks/banana).  Banana is a port of Kibana 3.0 to Solr.\nSeveral pre-made dashboards are saved in json format in the [banana](banana) subdirectory. \nThese can be loaded using the Banana web UI.\n\nSetup Instructions:\n\n1. Clone https://github.com/LucidWorks/banana to $DSE_HOME/resources/banana.\n   Make sure you've checked out the release branch (should be the default).\n   If you want, you can `rm -rf .git` at this point to save space.\n   \n2. Edit resources/banana/src/config.js and:\n   - change `solr_core` to the core you're most frequently going to work with (only a \n     convenience, you can pick a different one later on the settings for each dashboard.\n   - change `banana_index` to `banana.dashboards` (can be anything you want, but modify step \n     3 accordingly). Not strictly necessary if you don't want to save dashboards to solr.\n\n3. Post the banana schema from `resources/banana/resources/banana-int-solr-4.5/banana-int/conf`\n   - Use the `solrconfig.xml` from this project instead of the one provided by banana\n   - Name the core the same name specified above in step 2.\n   - Not strictly necessary if you don't want to save dashboards to solr.\n\n   ```\n   curl --data-binary @solrconfig.xml -H 'Content-type:text/xml; charset=utf-8' \"http://localhost:8983/solr/resource/banana.dashboards/solrconfig.xml\"\n   curl --data-binary @schema.xml -H 'Content-type:text/xml; charset=utf-8' \"http://localhost:8983/solr/resource/banana.dashboards/schema.xml\"\n   curl -X POST -H 'Content-type:text/xml; charset=utf-8' \"http://localhost:8983/solr/admin/cores?action=CREATE\u0026name=banana.dashboards\"\n   ```\n\n4. Edit resources/tomcat/conf/server.xml and add the following inside the `\u003cHost\u003e` tags:\n\n   ```\n   \u003cContext docBase=\"../../banana/src\" path=\"/banana\" /\u003e\n   ```\n   \n5. If you've previously started DSE, remove `resources/tomcat/work` and restart.\n\n6. Start DSE and go to http://localhost:8983/banana\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Friptano%2Flogparse","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Friptano%2Flogparse","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Friptano%2Flogparse/lists"}