Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/lucasayres/url-feature-extractor
Extracting features from URLs to build a data set for machine learning. The purpose is to find a machine learning model to predict phishing URLs, which are targeted to the Brazilian population.
https://github.com/lucasayres/url-feature-extractor
benign blacklist dataset extractor host lexical machine-learning phishing phishtank python safebrowsing wot
Last synced: about 8 hours ago
JSON representation
Extracting features from URLs to build a data set for machine learning. The purpose is to find a machine learning model to predict phishing URLs, which are targeted to the Brazilian population.
- Host: GitHub
- URL: https://github.com/lucasayres/url-feature-extractor
- Owner: lucasayres
- Created: 2018-06-25T15:35:32.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2021-06-01T22:21:31.000Z (over 3 years ago)
- Last Synced: 2023-06-04T14:31:10.171Z (over 1 year ago)
- Topics: benign, blacklist, dataset, extractor, host, lexical, machine-learning, phishing, phishtank, python, safebrowsing, wot
- Language: Python
- Size: 5.27 MB
- Stars: 52
- Watchers: 4
- Forks: 2
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# URL Feature Extractor
Extracting features from URLs to build a data set for machine learning. The purpose is to find a machine learning model to predict phishing URLs, which are targeted to the Brazilian population.
This repo includes the implementation of our paper:
Lucas Dantas Gama Ayres, Italo Valcy S Brito and Rodrigo Rocha Gomes e Souza. Using Machine Learning to Automatically Detect Malicious URLs in Brazil. In Simpósio Brasileiro de Redes de Computadores e Sistemas Distribuídos (SBRC 2019) - 2019, Gramado - RS - Brazil.
The paper is available here: https://sol.sbc.org.br/index.php/sbrc/article/view/7416
DOI: https://doi.org/10.5753/sbrc.2019.7416
## Install
```bash
$ sudo apt-get update && sudo apt-get upgrade
$ sudo apt-get install virtualenv python3 python3-dev python-dev gcc libpq-dev libssl-dev libffi-dev build-essentials
$ virtualenv -p /usr/bin/python3 .env
$ source .env/bin/activate
$ pip install -r requirements.txt
```## How to use
Before running the software, add the API Keys to the Google Safe Browsing, Phishtank, and MyWot in the ```config.ini``` file.
Now, run:
```bash
$ python run.py
```## Features implemented
LEXICAL
Count (.) in URL
Count (-) in URL
Count (_) in URL
Count (/) in URL
Count (?) in URL
Count (=) in URL
Count (@) in URL
Count (&) in URL
Count (!) in URL
Count ( ) in URL
Count (~) in URL
Count (,) in URL
Count (+) in URL
Count (*) in URL
Count (#) in URL
Count ($) in URL
Count (%) in URL
URL LengthL
TLD amount in URL
Count (.) in Domain
Count (-) in Domain
Count (_) in Domain
Count (/) in Domain
Count (?) in Domain
Count (=) in Domain
Count (@) in Domain
Count (&) in Domain
Count (!) in Domain
Count ( ) in Domain
Count (~) in Domain
Count (,) in Domain
Count (+) in Domain
Count (*) in Domain
Count (#) in Domain
Count ($) in Domain
Count (%) in Domain
Domain Length
Quantidade de vogais in Domain
URL domain in IP address format
Domain contains the key words "server" or "client"
Count (.) in Directory
Count (-) in Directory
Count (_) in Directory
Count (/) in Directory
Count (?) in Directory
Count (=) in Directory
Count (@) in Directory
Count (&) in Directory
Count (!) in Directory
Count ( ) in Directory
Count (~) in Directory
Count (,) in Directory
Count (+) in Directory
Count (*) in Directory
Count (#) in Directory
Count ($) in Directory
Count (%) in Directory
Directory Length
Count (.) in file
Count (-) in file
Count (_) in file
Count (/) in file
Count (?) in file
Count (=) in file
Count (@) in file
Count (&) in file
Count (!) in file
Count ( ) in file
Count (~) in file
Count (,) in file
Count (+) in file
Count (*) in file
Count (#) in file
Count ($) in file
Count (%) in file
File length
Count (.) in parameters
Count (-) in parameters
Count (_) in parameters
Count (/) in parameters
Count (?) in parameters
Count (=) in parameters
Count (@) in parameters
Count (&) in parameters
Count (!) in parameters
Count ( ) in parameters
Count (~) in parameters
Count (,) in parameters
Count (+) in parameters
Count (*) in parameters
Count (#) in parameters
Count ($) in parameters
Count (%) in parameters
Length of parameters
TLD presence in arguments
Number of parameters
Email present at URL
File extension
BLACKLIST
Presence of the URL in blacklists
Presence of the IP Address in blacklists
Presence of the domain in Blacklists
HOST
Presence of the domain in RBL (Real-time Blackhole List)
Search time (response) domain (lookup)
Domain has SPF?
Geographical location of IP
AS Number (or ASN)
PTR of IP
Time (in days) of domain activation
Time (in days) of domain expiration
Number of resolved IPs
Number of resolved name servers (NameServers - NS)
Number of MX Servers
Time-to-live (TTL) value associated with hostname
OTHERS
Valid TLS / SSL Certificate
Number of redirects
Check if URL is indexed on Google
Check if domain is indexed on Google
Uses URL shortener service
## Contributing
Any contribution is appreciated.
#### Submitting a Pull Request (PR)
1. Clone the project:
```
$ git clone https://github.com/lucasayres/url-feature-extractor.git
```2. Make your changes in a new git branch:
```
$ git checkout -b my-branch master
```3. Add your changes.
4. Push your branch to Github.
5. Create a PR to master.