{"id":18521853,"url":"https://github.com/onai/loggraph","last_synced_at":"2025-07-08T20:12:00.380Z","repository":{"id":112448731,"uuid":"265373547","full_name":"onai/loggraph","owner":"onai","description":"Logfiles to graphs","archived":false,"fork":false,"pushed_at":"2020-05-22T05:13:39.000Z","size":302,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-02-17T05:25:31.685Z","etag":null,"topics":["giraph","hdf5","logs","pcap"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/onai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-05-19T21:39:50.000Z","updated_at":"2020-05-22T05:13:42.000Z","dependencies_parsed_at":"2023-05-15T01:45:16.554Z","dependency_job_id":null,"html_url":"https://github.com/onai/loggraph","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/onai%2Floggraph","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/onai%2Floggraph/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/onai%2Floggraph/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/onai%2Floggraph/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/onai","download_url":"https://codeload.github.com/onai/loggraph/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254198510,"owners_count":22030966,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["giraph","hdf5","logs","pcap"],"created_at":"2024-11-06T17:27:59.249Z","updated_at":"2025-05-14T18:09:22.279Z","avatar_url":"https://github.com/onai.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Loggraph\nAutomatic generation of graphs from log files\n\n# Table of contents\n\n1. [Background](#Background)\n2. [How to run](#how-to-run)\n## Background\nFirst, we describe some log files of interest, then describe the tool. \n### Giraph graphs:\nThe graph engine Giraph ingests graph in a list format: [source_id,source_value,[[dest_id, edge_value],...]].\n\n### Graphs from HDFS logs:\n![HDFS log file snippet](images/hdf5log.png)\n\nConsider the use case of analyzing HDFS logs. Apache HDFS is a distributed file system designed to handle large data. A distributed system running algorithms on Hadoop with HDFS has an inherent graph structure produced by the block (data) transfers between nodes. For example, the first line in the figure above has source (src) and destination (dest) IP/port. A python script scans each line for `src` and `dest` in the line and is considered as an edge from source to destination. Multiple occurences of an edge in the log file will result in a higher edge weights.\n\n### PCAP graphs:\nAs another example, consider pcap logs. Pcap can capture network traffic and save it as a `.pcap` file. The IP Layer captured by PCAP can be used to create a meaningful graph of source IPs and destination IPs. Like in HDFS logs, we assign higher edge weights for multiple occurences of edges.\n\n### Graph from any arbitrary log file:\n\n![Randomly generated log file](images/logfile.png)\n\n\nTo generate a random log for testing, we can leverage Python's Logging facility. For example, the figure above shows a snippet of a random log file. This log file documents the day, time, user, type of error and IPs in question. A random log file might use any delimiter, but certain ones are more common\n\n### Method:\nWe desire an approach that works on all of the above types of logs. We developed a tool that allows the user to specify which columns correspond to the nodes (and the infers edges). While this of course yields good results, it is a far cry from our goal of automation. We therefore also developed a rules-based approach.\nThis approach first predicts what delimiter a log file is using by checking for delimiters ',' ' ' '\\t' ';' in that order. After the delimiter is predicted, we look for matching columns as follows, leveraging the fact that all nodes must be of the same “type.”\nWe compare the set of values in each column with the set of values in every other column, and look whether a set intersection between the columns will results in an intersection greater than a pre-selected threshold. A set intersection is performed by finding unique elements that are common in the two columns. If the intersection is higher than the threshold, we select the columns as nodes to generate a Giraph graph from it. We do this for each combination of column that has a set intersection size higher than the threshold.\nIn testing, this succeeded in perfectly handling all log files described above, including random log files generated for IP addresses, call records, and more. It is packaged it as a container.\n\n## How to run\n\n### Build docker container\n`docker build -t logtogiraph .`\n\n### Enter into the docker container\n`docker run -it --entrypoint=\"/bin/bash\" logtogiraph`\n\n### Generating Giraph graph from HDF5 logs\n`python3 hdfs-graph.py \u003chdf5_log_file\u003e`\n\n### Generating a random log file\n```\ncd random-log\npython3 logexample.py\n```\n\n### Generating Giraph graphs from log files\n```\ncd random-log\npython3 log-graph.py \u003crandom_log_filename\u003e\n```\nThe graphs will be stored in json files numbered from 0-n e.g. 0.json, 1.json, 2.json.. etc.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fonai%2Floggraph","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fonai%2Floggraph","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fonai%2Floggraph/lists"}