{"id":19632115,"url":"https://github.com/das-group/rba-dataset","last_synced_at":"2026-03-02T09:02:01.626Z","repository":{"id":109343607,"uuid":"508778554","full_name":"das-group/rba-dataset","owner":"das-group","description":"Login feature data of more than 33M login attempts and 3M users (IP, UA, RTT)","archived":false,"fork":false,"pushed_at":"2022-06-29T18:59:46.000Z","size":2472,"stargazers_count":14,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-26T21:14:52.358Z","etag":null,"topics":["authentication","data-set","ip-address","login-data","risk-based-authentication","round-trip-time","security-testing","user-agent"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/das-group.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-06-29T17:03:53.000Z","updated_at":"2025-01-07T10:15:15.000Z","dependencies_parsed_at":"2023-04-26T14:37:52.294Z","dependency_job_id":null,"html_url":"https://github.com/das-group/rba-dataset","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/das-group/rba-dataset","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/das-group%2Frba-dataset","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/das-group%2Frba-dataset/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/das-group%2Frba-dataset/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/das-group%2Frba-dataset/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/das-group","download_url":"https://codeload.github.com/das-group/rba-dataset/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/das-group%2Frba-dataset/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29996268,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-02T01:47:34.672Z","status":"online","status_checked_at":"2026-03-02T02:00:07.342Z","response_time":60,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["authentication","data-set","ip-address","login-data","risk-based-authentication","round-trip-time","security-testing","user-agent"],"created_at":"2024-11-11T12:12:57.688Z","updated_at":"2026-03-02T09:02:01.575Z","avatar_url":"https://github.com/das-group.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# Login Data Set for Risk-Based Authentication\n\n\u003e Synthesized login feature data of \u003e33M login attempts and \u003e3.3M users\n\u003e on a large-scale online service in Norway. Original data collected\n\u003e between February 2020 and February 2021.\n\nThis data sets aims to foster research and development for\n[Risk-Based Authentication (RBA)] systems. The data was synthesized from\nthe real-world login behavior of more than 3.3M users at a large-scale\nsingle sign-on (SSO) online service in Norway.\n\nThe users used this SSO to access sensitive data provided by the online\nservice, e.g., a cloud storage and billing information. We used this\ndata set to study how the [Freeman et al. (2016)] RBA model behaves on a\nlarge-scale online service in the real world (see [Publication](#publication)). The\nsynthesized data set can reproduce these results made on the original\ndata set (see [Study Reproduction](#study-reproduction)). Beyond that, you can use this data\nset to evaluate and improve RBA algorithms under real-world conditions.\n\n**WARNING:** The feature values are plausible, but still **totally**\n**artificial**. Therefore, you should NOT use this data set in\nproductive systems, e.g., intrusion detection systems.\n\n![Distribution of Login Attempts Included in the Synthesized Data Set](images/login-overview.png)\n\n## Table of Contents\n\n\u003c!-- TOC depthFrom:2 depthTo:6 withLinks:1 updateOnSave:1 orderedList:0 --\u003e\n\n- [Table of Contents](#table-of-contents)\n- [Download](#download)\n- [Overview](#overview)\n- [Data Creation](#data-creation)\n\t- [Regarding the Data Values](#regarding-the-data-values)\n- [Study Reproduction](#study-reproduction)\n- [Ethics](#ethics)\n- [Publication](#publication)\n\t\t- [Bibtex](#bibtex)\n- [License](#license)\n\n\u003c!-- /TOC --\u003e\n\n## Download\n\nYou can download the data set under the [Releases](https://github.com/das-group/rba-dataset/releases) section of this GitHub project.\n\n\n## Overview\n\nThe data set contains the following features related to each login\nattempt on the SSO:\n\n\nFeature                    | Data Type | Description                                                                                      | Range or Example\n---------------------------|-----------|--------------------------------------------------------------------------------------------------|------------------------------------------------------\nIP Address                 | String    | IP address belonging to the login attempt                                                        | 0.0.0.0 - 255.255.255.255\nCountry                    | String    | Country derived from the IP address                                                              | US\nRegion                     | String    | Region derived from the IP address                                                               | New York\nCity                       | String    | City derived from the IP address                                                                 | Rochester\nASN                        | Integer   | Autonomous system number derived from the IP address                                             | 0 - 600000\nUser Agent String          | String    | User agent string submitted by the client                                                        | Mozilla/5.0 (Windows NT 10.0; Win64; \\...\nOS Name and Version        | String    | Operating system name and version derived from the user agent string                             | Windows 10\nBrowser Name and Version   | String    | Browser name and version derived from the user agent string                                      | Chrome 70.0.3538\nDevice Type                | String    | Device type derived from the user agent string                                                   | (`mobile`, `desktop`, `tablet`, `bot`, `unknown`)[^1]\nUser ID                    | Integer   | Idenfication number related to the affected user account                                         | [Random pseudonym]\nLogin Timestamp            | Integer   | Timestamp related to the login attempt                                                           | [64 Bit timestamp]\nRound-Trip Time (RTT) [ms] | Integer   | Server-side measured latency between client and server                                           | 1 - 8600000\nLogin Successful           | Boolean   | `True`: Login was successful, `False`: Login failed                                              | (`true`, `false`)\nIs Attack IP               | Boolean   | IP address was found in known attacker data set                                                  | (`true`, `false`)\nIs Account Takeover        | Boolean   | Login attempt was identified as account takeover by incident response team of the online service | (`true`, `false`)\n\n[^1]: Few (invalid) user agents strings from the original data set could not be parsed, so their device type is empty. Perhaps this parse error is useful information for your studies, so we kept these 1526 entries.\n\n## Data Creation\n\nAs the data set targets RBA systems, especially the [Freeman et al.\n(2016)] model, the statistical feature probabilities between all users,\nglobally and locally, are identical for the categorical data. All the\nother data was randomly generated while maintaining logical relations\nand timely order between the features.\n\nThe timestamps, however, are not identical and contain randomness. The\nfeature values related to IP address and user agent string were randomly\ngenerated by publicly available data, so they were very likely not\npresent in the real data set. The RTTs resemble real values but were\nrandomly assigned among users per geolocation. Therefore, the RTT\nentries were probably in other positions in the original data set.\n\n- The country was randomly assigned per unique feature value. Based on\n  that, we randomly assigned an ASN related to the country, and\n  generated the IP addresses for this ASN. The cities and regions were\n  derived from the generated IP addresses for privacy reasons and do not\n  reflect the real logical relations from the original data set.\n\n- The device types are identical to the real data set. Based on that, we\n  randomly assigned the OS, and based on the OS the browser information.\n  From this information, we randomly generated the user agent string.\n  Therefore, all the logical relations regarding the user agent are\n  identical as in the real data set.\n\n- The RTT was randomly drawn from the login success status and\n  synthesized geolocation data. We did this to ensure that the RTTs are\n  realistic ones.\n\n\n### Regarding the Data Values\n\nDue to unresolvable conflicts during the data creation, we had to assign\nsome unrealistic IP addresses and ASNs that are not present in the real\nworld. Nevertheless, these do not have any effects on the risk scores\ngenerated by the [Freeman et al. (2016)] model.\n\nYou can recognize them by the following values:\n\n- ASNs with values \u003e= 500.000\n\n- IP addresses in the range 10.0.0.0 - 10.255.255.255 (10.0.0.0/8 CIDR\n  range)\n\n\n## Study Reproduction\n\nBased on our evaluation, this data set can reproduce our study results\nregarding the RBA behavior of an RBA model using the IP address (IP\naddress, country, and ASN) and user agent string (Full string, OS name\nand version, browser name and version, device type) as features.\n\nThe calculated RTT significances for countries and regions inside Norway\nare not identical using this data set, but have similar tendencies. The\nsame is true for the Median RTTs per country. This is due to the fact\nthat the available number of entries per country, region, and city\nchanged with the data creation procedure. However, the RTTs still\nreflect the real-world distributions of different geolocations by city.\n\nSee [RESULTS.md](RESULTS.md) for more details.\n\n![Median RTTs by Country](images/rtts-continents.png)\n\n\n## Ethics\n\nBy using the SSO service, the users agreed in the data collection and\nevaluation for research purposes. For study reproduction and fostering\nRBA research, we agreed with the data owner to create a synthesized data\nset that does not allow re-identification of customers.\n\nThe synthesized data set does not contain any sensitive data values, as\nthe IP addresses, browser identifiers, login timestamps, and RTTs were\nrandomly generated and assigned.\n\n\n## Publication\n\nYou can find more details on our conducted study in the following\njournal article:\n\n[Pump Up Password Security! Evaluating and Enhancing Risk-Based Authentication on a Real-World Large-Scale Online Service] (2022)\u003cbr\u003e\n_Stephan Wiefling, Paul René Jørgensen, Sigurd Thunem, and Luigi Lo Iacono_.\u003cbr\u003e\n_ACM Transactions on Privacy and Security_\n\n\n\n#### Bibtex\n\n~~~.bibtex\n@article{Wiefling_Pump_2022,\n  author = {Wiefling, Stephan and Jørgensen, Paul René and Thunem, Sigurd and Lo Iacono, Luigi},\n  title  = {Pump {Up} {Password} {Security}! {Evaluating} and {Enhancing} {Risk}-{Based} {Authentication} on a {Real}-{World} {Large}-{Scale} {Online} {Service}},\n  journal = {{ACM} {Transactions} on {Privacy} and {Security}},\n  doi = {10.1145/3546069},\n  publisher = {ACM},\n  year   = {2022}\n}\n~~~\n\n\n\n## License\n\nThis data set and the contents of this repository are licensed under the\n[Creative Commons Attribution 4.0 International (CC BY 4.0)] license.\nSee the [LICENSE](LICENSE) file for details.  If the data set is used\nwithin a publication, the following journal article has to be cited as\nthe source of the data set:\n\nStephan Wiefling, Paul René Jørgensen, Sigurd Thunem, and Luigi Lo\nIacono: Pump Up Password Security! Evaluating and Enhancing Risk-Based\nAuthentication on a Real-World Large-Scale Online Service. In: ACM\nTransactions on Privacy and Security (2022). doi: [10.1145/3546069](https://doi.org/10.1145/3546069)\n\n\n\n[Pump Up Password Security! Evaluating and Enhancing Risk-Based Authentication on a Real-World Large-Scale Online Service]: https://doi.org/10.1145/3546069\n[Risk-Based Authentication (RBA)]: https://riskbasedauthentication.org\n[Freeman et al. (2016)]: https://doi.org/10.14722/ndss.2016.23240\n[Creative Commons Attribution 4.0 International (CC BY 4.0)]: https://creativecommons.org/licenses/by/4.0/","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdas-group%2Frba-dataset","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdas-group%2Frba-dataset","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdas-group%2Frba-dataset/lists"}