Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/lucasayres/url-feature-extractor

Extracting features from URLs to build a data set for machine learning. The purpose is to find a machine learning model to predict phishing URLs, which are targeted to the Brazilian population.
https://github.com/lucasayres/url-feature-extractor

benign blacklist dataset extractor host lexical machine-learning phishing phishtank python safebrowsing wot

Last synced: about 2 months ago
JSON representation

Extracting features from URLs to build a data set for machine learning. The purpose is to find a machine learning model to predict phishing URLs, which are targeted to the Brazilian population.

Awesome Lists containing this project

README

        

# URL Feature Extractor

Extracting features from URLs to build a data set for machine learning. The purpose is to find a machine learning model to predict phishing URLs, which are targeted to the Brazilian population.

This repo includes the implementation of our paper:

Lucas Dantas Gama Ayres, Italo Valcy S Brito and Rodrigo Rocha Gomes e Souza. Using Machine Learning to Automatically Detect Malicious URLs in Brazil. In Simpósio Brasileiro de Redes de Computadores e Sistemas Distribuídos (SBRC 2019) - 2019, Gramado - RS - Brazil.

The paper is available here: https://sol.sbc.org.br/index.php/sbrc/article/view/7416

DOI: https://doi.org/10.5753/sbrc.2019.7416

## Install

```bash
$ sudo apt-get update && sudo apt-get upgrade
$ sudo apt-get install virtualenv python3 python3-dev python-dev gcc libpq-dev libssl-dev libffi-dev build-essentials
$ virtualenv -p /usr/bin/python3 .env
$ source .env/bin/activate
$ pip install -r requirements.txt
```

## How to use

Before running the software, add the API Keys to the Google Safe Browsing, Phishtank, and MyWot in the ```config.ini``` file.

Now, run:

```bash
$ python run.py
```

## Features implemented



LEXICAL



Count (.) in URL
Count (-) in URL
Count (_) in URL
Count (/) in URL


Count (?) in URL
Count (=) in URL
Count (@) in URL
Count (&) in URL


Count (!) in URL
Count ( ) in URL
Count (~) in URL
Count (,) in URL


Count (+) in URL
Count (*) in URL
Count (#) in URL
Count ($) in URL


Count (%) in URL
URL LengthL
TLD amount in URL
Count (.) in Domain


Count (-) in Domain
Count (_) in Domain
Count (/) in Domain
Count (?) in Domain


Count (=) in Domain
Count (@) in Domain
Count (&) in Domain
Count (!) in Domain


Count ( ) in Domain
Count (~) in Domain
Count (,) in Domain
Count (+) in Domain


Count (*) in Domain
Count (#) in Domain
Count ($) in Domain
Count (%) in Domain


Domain Length
Quantidade de vogais in Domain
URL domain in IP address format
Domain contains the key words "server" or "client"


Count (.) in Directory
Count (-) in Directory
Count (_) in Directory
Count (/) in Directory


Count (?) in Directory
Count (=) in Directory
Count (@) in Directory
Count (&) in Directory


Count (!) in Directory
Count ( ) in Directory
Count (~) in Directory
Count (,) in Directory


Count (+) in Directory
Count (*) in Directory
Count (#) in Directory
Count ($) in Directory


Count (%) in Directory
Directory Length
Count (.) in file
Count (-) in file


Count (_) in file
Count (/) in file
Count (?) in file
Count (=) in file


Count (@) in file
Count (&) in file
Count (!) in file
Count ( ) in file


Count (~) in file
Count (,) in file
Count (+) in file
Count (*) in file


Count (#) in file
Count ($) in file
Count (%) in file
File length


Count (.) in parameters
Count (-) in parameters
Count (_) in parameters
Count (/) in parameters


Count (?) in parameters
Count (=) in parameters
Count (@) in parameters
Count (&) in parameters


Count (!) in parameters
Count ( ) in parameters
Count (~) in parameters
Count (,) in parameters


Count (+) in parameters
Count (*) in parameters
Count (#) in parameters
Count ($) in parameters


Count (%) in parameters
Length of parameters
TLD presence in arguments
Number of parameters


Email present at URL
File extension



BLACKLIST



Presence of the URL in blacklists
Presence of the IP Address in blacklists
Presence of the domain in Blacklists



HOST



Presence of the domain in RBL (Real-time Blackhole List)
Search time (response) domain (lookup)
Domain has SPF?
Geographical location of IP


AS Number (or ASN)
PTR of IP
Time (in days) of domain activation
Time (in days) of domain expiration


Number of resolved IPs
Number of resolved name servers (NameServers - NS)
Number of MX Servers
Time-to-live (TTL) value associated with hostname



OTHERS



Valid TLS / SSL Certificate
Number of redirects
Check if URL is indexed on Google
Check if domain is indexed on Google


Uses URL shortener service

## Contributing

Any contribution is appreciated.

#### Submitting a Pull Request (PR)

1. Clone the project:
```
$ git clone https://github.com/lucasayres/url-feature-extractor.git
```

2. Make your changes in a new git branch:
```
$ git checkout -b my-branch master
```

3. Add your changes.

4. Push your branch to Github.

5. Create a PR to master.