{"id":20749178,"url":"https://github.com/g-research/dgraph-lanl-csr","last_synced_at":"2025-04-28T12:24:06.513Z","repository":{"id":95346985,"uuid":"314345911","full_name":"G-Research/dgraph-lanl-csr","owner":"G-Research","description":"Project to load the \"Comprehensive, Multi-Source Cyber-Security Events\" dataset into a Dgraph cluster.","archived":false,"fork":false,"pushed_at":"2020-12-17T19:54:58.000Z","size":165,"stargazers_count":8,"open_issues_count":0,"forks_count":2,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-03-30T09:31:36.753Z","etag":null,"topics":["cyber-security","dataset","dgraph"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/G-Research.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-11-19T19:11:03.000Z","updated_at":"2025-03-10T18:39:16.000Z","dependencies_parsed_at":null,"dependency_job_id":"26df189e-4e3c-4efb-acf2-f67667b4480e","html_url":"https://github.com/G-Research/dgraph-lanl-csr","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/G-Research%2Fdgraph-lanl-csr","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/G-Research%2Fdgraph-lanl-csr/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/G-Research%2Fdgraph-lanl-csr/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/G-Research%2Fdgraph-lanl-csr/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/G-Research","download_url":"https://codeload.github.com/G-Research/dgraph-lanl-csr/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251311990,"owners_count":21569138,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cyber-security","dataset","dgraph"],"created_at":"2024-11-17T08:21:28.437Z","updated_at":"2025-04-28T12:24:06.508Z","avatar_url":"https://github.com/G-Research.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Dgraph LANL CSR cyber1 dataset\n\nThis project helps to load the [\"Comprehensive, Multi-Source Cyber-Security Events\"](https://csr.lanl.gov/data/cyber1/) dataset published\nby [Advanced Research in Cyber Systems](https://csr.lanl.gov/) into a [Dgraph cluster](https://dgraph.io/docs/get-started#dgraph).\n\nA detailed introduction into the Spark code that performs the pre-processing can be found in [SPARK.md](SPARK.md).\n\nThe pre-processing comprises the following steps:\n\n- [Download dataset](#download-dataset)\n- [Transform the dataset into RDF](#transform-the-dataset-into-rdf)\n- [Bulk-load the RDF into Dgraph](#loading-rdf-into-dgraph)\n- [Spin-up Dgraph cluster](#serve-the-graph)\n- [Example queries for Dgraph](#querying-dgraph)\n\nThe graph has the following schema:\n\n![Graph Schema](schema.png)\n\nThe graph model mimics the original dataset model as much as possible and adds the `User`, `Computer`\nand `ComputerUser` entities. Those have no `time` property, in contrast to the dataset entities that\nhave either `time` (event types) or `start`, `end` and `duration` (duration types) properties.\n\n## Statistics\n\nThe dataset and the derived graph have the following properties:\n\n|Table           |Rows         |Node Type        |Properties\u003cbr/\u003e/ Edges|Nodes        |Triples       |\n|:--------------:|:-----------:|:---------------:|:--------------------:|:-----------:|:------------:|\n|*all files*     |             |`User`           | 3 / 0                |      100,162|       400,648|\n|*all files*     |             |`Computer`       | 1 / 0                |       17,684|        35,368|\n|*all files*     |             |`ComputerUser`   | 0 / 2                |      900,983|     2,702,949|\n|`auth.txt.gz`   |1,051,430,459|`AuthEvent`      | 6 / 2                |1,051,430,459| 7,680,842,814|\n|`proc.txt.gz`   |  426,045,096|`ProcessEvent`   | 4 / 1                |  426,045,096| 2,130,225,480|\n|`flow.txt.gz`   |  129,977,412|`FlowDuration`   | 9 / 2                |  107,968,032| 1,048,963,354|\n|`dns.txt.gz`    |   40,821,591|`DnsEvent`       | 2 / 2                |   40,821,591|   163,286,364|\n|`redteam.txt.gz`|          749|`CompromiseEvent`| 2 / 2                |          715|         2,872|\n|||||||\n|**sum**         |1,648,275,307|                 |27 / 11               |1,627,284,722|11,026,459,849|\n\nThe dataset requires 11 GB (`.txt.gz`) / 89 GB (`.txt`) / 11 GB (`.parquet`) disk space.\nThe RDF version is 41 GB in size (`.gz`), Dgraph requires 191 GB disk space to store the data.\n\nThe dataset contains some null values in four columns:\nauthentication type (55%) and logon type (14%) in `auth.txt.gz` as well as\nsource (71%) and destination port (64%) in `flow.txt.gz`.\nAll other columns have values in all rows.\n\nTwo tables have duplicate rows: `flow.txt.gz` has 6,569,939 and `redteam.txt.gz` has 12 duplicates.\nThese get de-duplicated and respective nodes provide the number of duplicates in the `occurrences`\nproperty.\n\n## Download dataset\n\nFirst, download the dataset from https://csr.lanl.gov/data/cyber1/.\nThe compressed `.txt.gz` files should be decompressed to allow for scalability of the next step.\n\n## Transform the dataset into RDF\n\nThis project provides a Spark application that lets you transform the dataset CSV files into RDF\nthat can be processed by Dgraph live and bulk loaders.\n\nThe following commands read the dataset from `./data` and write RDF files to `./rdf`.\nUse appropriate paths accessible to the Spark workers if you run on a Spark cluster.\n\nRun the Spark application locally on your machine with\n\n    MAVEN_OPTS=-Xmx2g mvn test-compile exec:java -Dexec.classpathScope=\"test\" -Dexec.cleanupDaemonThreads=false \\\n        -Dexec.mainClass=\"uk.co.gresearch.dgraph.lanl.csr.RunSparkApp\" -Dexec.args=\"data rdf\"\n\nYou may want to the Spark application not to use `/tmp` for its temporary files but a different path.\nUse `SPARK_LOCAL_DIRS` for that:\n\n    SPARK_LOCAL_DIRS=$(pwd)/tmp MAVEN_OPTS=-Xmx2g mvn …\n\nRun the application via Spark submit on your Spark cluster:\n\n    mvn package\n    spark-submit --master \"…\" --class uk.co.gresearch.dgraph.lanl.csr.CsrDgraphSparkApp \\\n        target/dgraph-lanl-csr-1.0-SNAPSHOT.jar data/ rdf/\n\nThe application takes 2-3 hours on 8 CPUs with 4 GB RAM and 100 GB SSD disk.\nOn a cluster with more CPUs the time reduces proportionally.\n\n## Loading RDF into Dgraph\n\nLoad the RDF files by running\n\n    mkdir -p bulk tmp\n    cp dgraph.schema.rdf rdf/\n    ./dgraph.bulk.sh rdf bulk tmp /data/dgraph.schema.rdf \"/data/*.rdf/*.txt.gz\"\n\nThe `dgraph.schema.rdf` schema file defines all predicates and types and adds indices to all predicates.\n\nThe Dgraph bulk loader requires up to 32 GB of RAM and 200 GB of disk space.\nLoading the graph with 16 CPUs, 32 GB RAM, 200 GB temporary disk space and SSD disks takes 16 hours.\n\n## Serve the graph\n\nAfter bulk loading the RDF files into `bulk/out/0` we can serve that graph by running\n\n    ./dgraph.serve.sh bulk\n\n## Querying Dgraph\n\nTen users (`User`), their logins (`ComputerLogin`) and destinations of `AuthEvent`s from those logins:\n\n    {\n      user(func: eq(\u003cdgraph.type\u003e, \"User\"), first: 10) {\n        uid\n        id\n        login\n        domain\n        logins: ~user {\n          uid\n          computer { uid id }\n          logsOnto: ~sourceComputerUser @filter(eq(\u003cdgraph.type\u003e, \"AuthEvent\")) {\n            destinationComputerUser {\n              uid\n              computer { uid id }\n              user { uid id }\n            }\n          }\n        }\n      }\n    }\n\n![...](dgraph-ratel-query-graph.png)\n\n\n## Fine-tuning\n\nThe Spark application `CsrDgraphSparkApp` lets you customize the RDF generation part of this pipeline.\n\nThe input files are not particularly Spark-friendly. With `doParquet = true` they will be converted into\nParquet files on the first run and used from then on. The originial `.txt` files can then be deleted.\n\n    // convert the input files to parquet on the first run, original .txt files can be deleted then\n    // parquet is compressed but can be read in a scalable way, other than original .txt.gz files\n    val doParquet = true\n\nUser ids are split on the `@` characters. If your dataset uses a different separator between login and domain, set this here:\n\n    // user ids are split on this pattern to extract login and domain\n    val userIdSplitPattern = \"@\"\n\nThe Spark application prints some statistics of the dataset. Computing these is expensive and only needed once.\nYou should run this at least once to see if assumption of the code hold for the particular dataset.\n\n    // prints statistics of the dataset, this is expensive so only really needed once\n    // this is particularly faster with parquet input files (see doParquet)\n    val doStatistics = false\n\nThe RDF files will be a multiple in size of the input files. Compressing them saves disk space at the extra cost of CPU.\n\n    // written RDF files will be compressed if true\n    val compressRdf = true\n\nSome input files are known to have duplicate rows. These are duplicated by the Spark application by adding\nan optional `occurrences` predicate to those events that occur multiple times in the input files.\nComputing these extra predicates is expensive and only needs to be done for files when duplicate rows are known to exist.\nThe statistics provide such information for all input files.\n\n    // tables with duplicate rows need to be de-duplicated\n    // deduplication is expensive, so only set to true if there are duplicate rows\n    // you can set doStatistics = true to find out\n    val deduplicateAuth = false\n    val deduplicateProc = false\n    val deduplicateFlow = true\n    val deduplicateDns = false\n    val deduplicateRed = true\n\nIn `uk/co/gresearch/dgraph/lanl/package.scala` you can switch from [int](https://dgraph.io/docs/query-language/schema/#scalar-types) time\nto proper Dgraph [datetime](https://dgraph.io/docs/query-language/schema/#scalar-types) timestamps.\nInstead of\n\n```scala\ndef timeLiteral(time: Int): String = literal(time, integerType)\n```\n\nuse this `timeLiteral` implementation:\n\n```scala\ndef timeLiteral(time: Int): String =\n  literal(Instant.ofEpochSecond(time).atOffset(ZoneOffset.UTC).toString, datetimeType)\n```\n\nHere you could also offset the `int` time to any time epoch other than `1970-01-01`.\n\nSwitching to `datetime` timestamps requires you to also modify the `dgraph.schema.rdf`. Instead of\n\n    \u003ctime\u003e: int @index(int) .\n    \u003cstart\u003e: int @index(int) .\n    \u003cend\u003e: int @index(int) .\n\nyou should now use\n\n    \u003ctime\u003e: dateTime @index(hour) .\n    \u003cstart\u003e: dateTime @index(hour) .\n    \u003cend\u003e: dateTime @index(hour) .\n\nAll this allows you to benefit from [datetime indices](https://dgraph.io/docs/query-language/schema/#datetime-indices)\nrather than [integer index](https://dgraph.io/docs/query-language/schema/#indexing).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fg-research%2Fdgraph-lanl-csr","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fg-research%2Fdgraph-lanl-csr","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fg-research%2Fdgraph-lanl-csr/lists"}