Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/gregavrbancic/phishing-dataset

Phishing dataset with more than 88,000 instances and 111 features. Web application available at. https://gregavrbancic.github.io/Phishing-Dataset/
https://github.com/gregavrbancic/phishing-dataset

dataset machine-learning phishing phishing-websites-detection

Last synced: about 2 months ago
JSON representation

Phishing dataset with more than 88,000 instances and 111 features. Web application available at. https://gregavrbancic.github.io/Phishing-Dataset/

Awesome Lists containing this project

README

        

# Datasets for Phishing Websites Detection

In this repository the two variants of the phishing dataset are presented.

## Web application

To preview the dataset interactively and/or tailor it to your needs, please visit a dedicated [web application](https://gregavrbancic.github.io/Phishing-Dataset/).

## dataset_full.csv

**Short description of the full variant dataset:**
- Total number of instances: 88,647
- Number of legitimate website instances (labeled as 0): 58,000
- Number of phishing website instances (labeled as 1): 30,647
- Total number of features: 111 (without target)

## dataset_small.csv

**Short description of the small variant dataset:**
- Total number of instances: 58,645
- Number of legitimate website instances (labeled as 0): 27,998
- Number of phishing website instances (labeled as 1): 30,647
- Total number of features: 111 (without target)

## Extracted Features

| Feature | Description |
|----------------------------|----------------------------------------------------|
| qty_dot_url | count (.) in URL |
| qty_hyphen_url | count (-) in URL |
| qty_underline_url | count (_) in URL |
| qty_slash_url | count (/) in URL |
| qty_questionmark_url | count (?) in URL |
| qty_equal_url | count (=) in URL |
| qty_at_url | count (@) in URL |
| qty_and_url | count (&) in URL |
| qty_exclamation_url | count (!) in URL |
| qty_space_url | count ( ) in URL |
| qty_tilde_url | count (~) in URL |
| qty_comma_url | count (,) in URL |
| qty_plus_url | count (+) in URL |
| qty_asterisk_url | count (*) in URL |
| qty_hashtag_url | count (#) in URL |
| qty_dollar_url | count ($) in URL |
| qty_percent_url | count (%) in URL |
| qty_tld_url | top-level-domain length |
| length_url | URL length |
| qty_dot_domain | count (.) in domain |
| qty_hyphen_domain | count (-) in domain |
| qty_underline_domain | count (_) in domain |
| qty_slash_domain | count (/) in domain |
| qty_questionmark_domain | count (?) in domain |
| qty_equal_domain | count (=) in domain |
| qty_at_domain | count (@) in domain |
| qty_and_domain | count (&) in domain |
| qty_exclamation_domain | count (!) in domain |
| qty_space_domain | count ( ) in domain |
| qty_tilde_domain | count (~) in domain |
| qty_comma_domain | count (,) in domain |
| qty_plus_domain | count (+) in domain |
| qty_asterisk_domain | count (*) in domain |
| qty_hashtag_domain | count (#) in domain |
| qty_dollar_domain | count ($) in domain |
| qty_percent_domain | count (%) in domain |
| qty_vowels_domain | count vowels in domain |
| domain_length | domain length |
| domain_in_ip | URL domain in IP address format |
| server_client_domain | domain contains the keywords "server" or "client" |
| qty_dot_directory | count (.) in directory |
| qty_hyphen_directory | count (-) in directory |
| qty_underline_directory | count (_) in directory |
| qty_slash_directory | count (/) in directory |
| qty_questionmark_directory | count (?) in directory |
| qty_equal_directory | count (=) in directory |
| qty_at_directory | count (@) in directory |
| qty_and_directory | count (&) in directory |
| qty_exclamation_directory | count (!) in directory |
| qty_space_directory | count ( ) in directory |
| qty_tilde_directory | count (~) in directory |
| qty_comma_directory | count (,) in directory |
| qty_plus_directory | count (+) in directory |
| qty_asterisk_directory | count (*) in directory |
| qty_hashtag_directory | count (#) in directory |
| qty_dollar_directory | count ($) in directory |
| qty_percent_directory | count (%) in directory |
| directory_length | directory length |
| qty_dot_file | count (.) in file |
| qty_hyphen_file | count (-) in file |
| qty_underline_file | count (_) in file |
| qty_slash_file | count (/) in file |
| qty_questionmark_file | count (?) in file |
| qty_equal_file | count (=) in file |
| qty_at_file | count (@) in file |
| qty_and_file | count (&) in file |
| qty_exclamation_file | count (!) in file |
| qty_space_file | count ( ) in file |
| qty_tilde_file | count (~) in file |
| qty_comma_file | count (,) in file |
| qty_plus_file | count (+) in file |
| qty_asterisk_file | count (*) in file |
| qty_hashtag_file | count (#) in file |
| qty_dollar_file | count ($) in file |
| qty_percent_file | count (%) in file |
| file_length | file length |
| qty_dot_params | count (.) in parameters |
| qty_hyphen_params | count (-) in parameters |
| qty_underline_params | count (_) in parameters |
| qty_slash_params | count (/) in parameters |
| qty_questionmark_params | count (?) in parameters |
| qty_equal_params | count (=) in parameters |
| qty_at_params | count (@) in parameters |
| qty_and_params | count (&) in parameters |
| qty_exclamation_params | count (!) in parameters |
| qty_space_params | count ( ) in parameters |
| qty_tilde_params | count (~) in parameters |
| qty_comma_params | count (,) in parameters |
| qty_plus_params | count (+) in parameters |
| qty_asterisk_params | count (*) in parameters |
| qty_hashtag_params | count (#) in parameters |
| qty_dollar_params | count ($) in parameters |
| qty_percent_params | count (%) in parameters |
| params_length | parameters length |
| tld_present_params | TLD presence in arguments |
| qty_params | number of parameters |
| email_in_url | email present in URL |
| time_response | search time (response) domain (lookup) |
| domain_spf | domain has SPF |
| asn_ip | AS Number (or ASN) |
| time_domain_activation | time (in days) of domain activation |
| time_domain_expiration | time (in days) of domain expiration |
| qty_ip_resolved | number of resolved IPs |
| qty_nameservers | number of resolved name servers (NameServers - NS) |
| qty_mx_servers | number of MX Servers |
| ttl_hostname | time-to-live (TTL) value associated with hostname |
| tls_ssl_certificate | valid TLS / SSL Certificate |
| qty_redirects | number of redirects |
| url_google_index | check if URL is indexed on Google |
| domain_google_index | check if domain is indexed on Google |
| url_shortened | check if URL is shortened |
| phishing | is phishing website |

## Cite this dataset

G. Vrbančič, I. Jr. Fister, V. Podgorelec. Datasets for Phishing Websites Detection. Data in Brief, Vol. 33, 2020, DOI: [10.1016/j.dib.2020.106438](http://dx.doi.org/10.1016/j.dib.2020.106438)