{"id":17464188,"url":"https://github.com/onlyphantom/verisr2","last_synced_at":"2025-10-25T12:20:10.021Z","repository":{"id":109717463,"uuid":"174252049","full_name":"onlyphantom/verisr2","owner":"onlyphantom","description":"Convenience functions for exploratory analysis on VERIS database","archived":false,"fork":false,"pushed_at":"2019-04-13T05:27:07.000Z","size":2551,"stargazers_count":2,"open_issues_count":1,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-06-20T06:38:15.049Z","etag":null,"topics":["incidents","r-cyber","security","security-incidents","security-issue","veris","verisr","vz-risk"],"latest_commit_sha":null,"homepage":"","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/onlyphantom.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2019-03-07T01:53:40.000Z","updated_at":"2020-02-26T15:06:13.000Z","dependencies_parsed_at":"2023-04-08T06:46:20.441Z","dependency_job_id":null,"html_url":"https://github.com/onlyphantom/verisr2","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/onlyphantom/verisr2","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/onlyphantom%2Fverisr2","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/onlyphantom%2Fverisr2/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/onlyphantom%2Fverisr2/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/onlyphantom%2Fverisr2/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/onlyphantom","download_url":"https://codeload.github.com/onlyphantom/verisr2/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/onlyphantom%2Fverisr2/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":270974739,"owners_count":24678250,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-18T02:00:08.743Z","response_time":89,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["incidents","r-cyber","security","security-incidents","security-issue","veris","verisr","vz-risk"],"created_at":"2024-10-18T10:44:55.399Z","updated_at":"2025-10-25T12:20:09.956Z","avatar_url":"https://github.com/onlyphantom.png","language":"R","readme":"verisr2\n=======\n\nConvenience functions for exploratory analysis on VERIS database\n(\u003ca href=\"http://veriscommunity.net\" class=\"uri\"\u003ehttp://veriscommunity.net\u003c/a\u003e).\n\nSmall helper functions for working with the data frame objects from the\n[VERIS Community Database (VCDB)](http://veriscommunity.net/vcdb.html),\ntypically converted from JSON using the\n[verisr](https://github.com/vz-risk/verisr) package (or, if unavailable,\nfrom this my fork of [this\npackage](https://github.com/onlyphantom/verisr)). This package\nreplicates in base R or dplyr many of the helper functions originally\nimplemented in the verisr package by Jay Jacobs.\n\nThe original package by Jay uses `data.table` code that is deprecated\nand no longer works with recent versions of R. The author has stated his\ndesire to one day rewrite these functions in dplyr code but since effort\non that has been stagnant for a few years now this is a simple attempt\nto recreate these helper functions in `dplyr` or base R code.\n\nInstallation and Getting Started\n--------------------------------\n\nInstall it from github and load the built-in dataset:\n\n``` r\n# install devtools from https://github.com/hadley/devtools\ndevtools::install_github(\"onlyphantom/verisr2\")\nlibrary(verisr2)\ndata(vcdb)\n```\n\nInspecting the class of the data:\n\n``` r\nclass(vcdb)\n```\n\n    ## [1] \"verisr\"     \"data.frame\"\n\nBecause the incidents are originally recorded in JSON, the transformed\ndata is “wide” spanning across more than 2,430 variables as of this\nwriting. The VERIS specification has intended for the data schema to be\nextended upon, and when expressed as a data frame, this wide format\npresents an opportunity for data analysis and exploratory exercises:\n\n``` r\ndim(vcdb)\n```\n\n    ## [1] 8198 2436\n\nConvenience Functions\n---------------------\n\nRetrieve a list of variables (enumeration / factors) in the data frame\nfrom a specified “parent”:\n\n``` r\ngetenum_stri(vcdb, \"action.error.vector\")[1:5]\n```\n\n    ## [1] \"action.error.vector.Carelessness\"         \n    ## [2] \"action.error.vector.Inadequate personnel\" \n    ## [3] \"action.error.vector.Inadequate processes\" \n    ## [4] \"action.error.vector.Inadequate technology\"\n    ## [5] \"action.error.vector.Other\"\n\nThe same function can also be performed with a vector of (character)\nstrings instead of a single string value:\n\n``` r\ngetenum_stri(vcdb, c(\"actor.internal.motive\", \"value_chain.money laundering.variety\"))[8:12]\n```\n\n    ## [1] \"actor.internal.motive.NA\"                 \n    ## [2] \"actor.internal.motive.Other\"              \n    ## [3] \"actor.internal.motive.Secondary\"          \n    ## [4] \"actor.internal.motive.Unknown\"            \n    ## [5] \"value_chain.money laundering.variety.Bank\"\n\nTo get a frequency table, use `getenum_tbl`:\n\n``` r\ngetenum_tbl(vcdb, c(\"action\", \"asset.variety\"))\n```\n\n    ##           action.Malware           action.Hacking            action.Social \n    ##                      678                     2185                      554 \n    ##          action.Physical            action.Misuse             action.Error \n    ##                     1565                     1675                     2374 \n    ##     action.Environmental           action.Unknown     asset.variety.Server \n    ##                        8                      237                     3819 \n    ##    asset.variety.Network   asset.variety.User Dev      asset.variety.Media \n    ##                      157                     1478                     2207 \n    ##     asset.variety.Person asset.variety.Kiosk/Term    asset.variety.Unknown \n    ##                      606                      345                      646 \n    ##   asset.variety.Embedded \n    ##                        2\n\nWe can use `getenum_df` function to get both the count and the\nproportion of assets where data loss has occured. This replicates the\noriginal functionality from `jayjacobs` and `vz-risk`’s version but uses\nbase R in its underlying function:\n\n``` r\ngetenum_df(vcdb, \"asset.variety\")\n```\n\n    ##         enum    x    n    freq\n    ## 1     Server 3819 8188 0.46641\n    ## 2      Media 2207 8188 0.26954\n    ## 3   User Dev 1478 8188 0.18051\n    ## 4     Person  606 8188 0.07401\n    ## 5 Kiosk/Term  345 8188 0.04213\n    ## 6    Network  157 8188 0.01917\n    ## 7   Embedded    2 8188 0.00024\n    ## 8    Unknown  646   NA      NA\n\nSimilarly, we can pass in a vector of two characters to the function,\nwhich will count the number of incidents across the two enumerations:\n\n``` r\ngetenum_df(vcdb, c(\"action\", \"asset.variety\"))\n```\n\n    ## # A tibble: 64 x 3\n    ##    action   asset.variety     x\n    ##    \u003cchr\u003e    \u003cchr\u003e         \u003cint\u003e\n    ##  1 Hacking  Server         1890\n    ##  2 Error    Media          1395\n    ##  3 Misuse   Server         1030\n    ##  4 Physical User Dev        706\n    ##  5 Error    Server          662\n    ##  6 Social   Person          554\n    ##  7 Physical Media           478\n    ##  8 Malware  Server          453\n    ##  9 Social   Server          375\n    ## 10 Malware  User Dev        371\n    ## # … with 54 more rows\n\n`enum2grid` replicates the plotting function in `jayjacobs` version, and\nwill work with all recent versions of R:\n\n``` r\nenum2grid(vcdb, c(\"asset.variety\", \"actor.external.variety\"))\n```\n\n![](README_files/figure-markdown_github/unnamed-chunk-10-1.png)\n\nAnother example:\n\n``` r\nenum2grid(vcdb, c(\"action\", \"asset.variety\"))\n```\n\n![](README_files/figure-markdown_github/unnamed-chunk-11-1.png)\n\n`importveris()` is a thin wrapper over the `json2veris()` function. In\nlater versions of vcdb incidents, the original function may result in a\ndataframe where one or more of its variables is another level of nested\nlist object(s). This function eliminates these columns, so they’re in a\nmore ready state for most data analysis tasks:\n\n``` r\nvcdb_small \u003c- importveris(\"~/Datasets/vcdb_small/\")\n```\n\n    ## [1] \"veris dimensions\"\n    ## [1]    0 2437\n    ## named integer(0)\n    ## named integer(0)\n\nTransform VCDB to a tidyverse-esque data frame\n----------------------------------------------\n\n`collapse_vcdb()` takes a `vcdb` data frame and turns it into a more\ncompact data frame that conforms to the “tidyverse” specifications. New\nfeatures are created from the original data, using values that best\nrepresent each enumeration. An oversimplified diagram explaining this\nprocess is as follow: ![](README_files/collapse.png)\n\n``` r\ntidy_vcdb \u003c- collapse_vcdb(vcdb)\nstr(tidy_vcdb)\n```\n\n    ## Loading verisr2\n\n    ## 'data.frame':    8198 obs. of  15 variables:\n    ##  $ action                      : Factor w/ 9 levels \"Environmental\",..: 5 7 2 3 2 2 3 3 7 6 ...\n    ##  $ action.environmental.notes  : chr  NA NA NA NA ...\n    ##  $ action.environmental.variety: Factor w/ 4 levels \"Fire\",\"Humidity\",..: 4 4 4 4 4 4 4 4 4 4 ...\n    ##  $ action.error.notes          : chr  NA NA NA NA ...\n    ##  $ action.error.variety        : Factor w/ 18 levels \"Capacity shortage\",..: 18 18 6 18 10 10 18 18 18 18 ...\n    ##  $ action.error.vector         : Factor w/ 8 levels \"Carelessness\",..: 8 8 8 8 1 1 8 8 8 8 ...\n    ##  $ action.hacking.cve          : chr  NA NA NA NA ...\n    ##  $ action.hacking.notes        : chr  NA NA NA NA ...\n    ##  $ action.hacking.result       : Factor w/ 5 levels \"Elevate\",\"Exfiltrate\",..: 5 5 5 5 5 5 5 5 5 5 ...\n    ##  $ action.hacking.variety      : Factor w/ 8 levels \"Brute force\",..: 6 6 6 5 6 6 6 6 6 6 ...\n    ##  $ action.hacking.vector       : Factor w/ 11 levels \"Backdoor or C2\",..: 9 9 9 11 9 9 11 11 9 9 ...\n    ##  $ action.malware.cve          : chr  NA NA NA NA ...\n    ##  $ action.malware.name         : chr  NA NA NA NA ...\n    ##  $ action.malware.notes        : chr  NA NA NA NA ...\n    ##  $ action.malware.result       : Factor w/ 5 levels \"Elevate\",\"Exfiltrate\",..: 5 5 5 5 5 5 5 5 5 5 ...\n\nNote that the new data frame is a lot more compact, with 175 instead of\nthe original 2,430+ variables:\n\n``` r\ndim(tidy_vcdb)\n```\n\n    ## [1] 8198  175\n\nWhere the original VCDB has a shape that resembles a “sparse matrix”,\nthis new “tidy” data frame now has most variables as factor and numeric\nvalues. Obviously some loss of fidelity happens (a 2500-column data\nmatrix where most values are 0 are reduced to 175-column where only the\nrepresentative value is stored in each dimension / enumeration):\n\n    ## \n    ## c(\"ordered\", \"factor\")              character                 factor \n    ##                      1                     59                    105 \n    ##                numeric \n    ##                     10\n\nCombining with `ggplot2`\n------------------------\n\nThe data (both the originalo `vcdb` and its tidy variant) also works\nwell with the rest of `tidyverse`. An example is to use the data in\nconjunction with `dplyr` and `ggplot2`:\n\n``` r\nvcdb %\u003e%\n  group_by(attribute.confidentiality.data_disclosure.Yes) %\u003e%\n  dplyr::count(timeline.incident.year) %\u003e%\n  ungroup() %\u003e% \n  mutate(\n    breach = ifelse(attribute.confidentiality.data_disclosure.Yes, \n                    \"Breach\", \"Incident\")\n  ) %\u003e% filter(\n    timeline.incident.year \u003e 2000\n  ) %\u003e% ggplot(aes(x=timeline.incident.year, y=n, group=breach)) +\n  geom_col(aes(fill=breach), position = \"dodge\") +\n  scale_x_continuous(expand=c(0,0), breaks=seq(2000, 2018, 3)) + \n  scale_y_continuous(expand=c(0,0)) + \n  scale_fill_brewer(palette = 11) + \n  labs(title=\"VCDB Confidentiality Breaches\", caption=\"Confidentiality breaches where data disclosure occured\"\n```\n\n![](README_files/figure-markdown_github/unnamed-chunk-18-1.png)\n\nCountry-level investigation\n---------------------------\n\nExisting functions in this package already allow us with country-level\ninspection pretty effortlessly:\n\n``` r\nsummary(tidy_vcdb$actor.external.country, maxsum=8)\n```\n\n    ##  Unknown       US Multiple       RU       CN       PK       SY  (Other) \n    ##     7320      220      180      110       43       40       36      249\n\nWe can use the collapsed dataframe (result of `collapse_vcdb`) to\nperform our inspection:\n\n``` r\nusvictim \u003c- subset(tidy_vcdb, victim.country==\"US\")\nhead(usvictim$notes)\n```\n\n    ## [1] \"lincoln financial securities Corporation is a subsidiary of Lincoln national Corporation\"                                                                \n    ## [2] \"Limited information provided and there have been no follow-up articles.\"                                                                                 \n    ## [3] \"HHS Breach Tool\"                                                                                                                                         \n    ## [4] \"The Sentry email mistake was modeled seperately. \"                                                                                                       \n    ## [5] \"I can't discern who was breached here. It says DoD. But it also says satellite manufacturer. I'm assuming the latter working for DoD\"                    \n    ## [6] \"The final record count was obtained from the HHS Breachtool record for this incident.  It was listed under the partner rather than Owensboro, strangely.\"\n\nAs of version 0.4.0, the new function `involving_country()` allows us to\nquery even more effectively for all incidents where a specified country\nis involved:\n\n``` r\nus \u003c- involving_country(data = vcdb, \"US\")\nhead(us$discovery_notes)\n```\n\n    ## [1] \"In June 2014, Epic Systems Corp. in Verona received an email that no software company can ignore: Employees of a company working for one of its customers had gained unauthorized access to a restricted website and may have stolen documents that contained trade secrets.\"\n    ## [2] \"actor was arrested on drug charge and they found skimmer and cards, notified employer.\"                                                                                                                                                                                      \n    ## [3] \"Committed ID fraud against her own maid of honor and used her real phone number when establishing fraudulent lines of credit.\"                                                                                                                                               \n    ## [4] \"the FBI was investigating after 2.5GB of data taken from its servers was dumped online and swiftly shared on social media. The union's national site, fop.net, remained offline on Thursday evening\"                                                                         \n    ## [5] \"Engel said, though, that the university didn’t confirm that data had been breached or learn about its apparent scope until external investigators notified officials July 31, 2018.\"                                                                                         \n    ## [6] \"We have disabled the malware and have reconfigured our point-of-sale and payment card processing systems to enhance the security of these systems\"\n\nBy default, `involving_country` returns all columns of every incident\ninvolving that country. If we would like to retrieve only the columns\nwhere one or more notes are present (discovery notes, incident notes,\nimpact notes, actor notes etc - more than 30 of such columns), then set\n`notes_only` to `TRUE`. The function helpfully drops any incident (rows)\nwhere no notes are present:\n\n``` r\n# only notes-type columns\nus_small \u003c- involving_country(vcdb, \"US\", notes_only=TRUE)\ndim(us_small)\n```\n\n    ## [1] 1928   30\n\nCredits\n-------\n\n-   A big appreciation to [Jay Jacobs](https://github.com/jayjacobs) for\n    the original `verisr` package. While it hasn’t receive any updates\n    in recent years, the project has been a tremendous help and starting\n    point.\n\n-   Thanks to the Verizon RISK Team and the community behind The VERIS\n    Community Database\n\n-   Thanks to Hadley Wickham, the contributors and all maintainers of\n    packages used in this project\n\nContributing and Issues\n-----------------------\n\nThe project is licensed under GPL-2. Please feel free to fork, submit\npull requests or open issues.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fonlyphantom%2Fverisr2","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fonlyphantom%2Fverisr2","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fonlyphantom%2Fverisr2/lists"}