{"id":21089885,"url":"https://github.com/lucasayres/url-feature-extractor","last_synced_at":"2025-05-16T13:30:54.820Z","repository":{"id":41284387,"uuid":"138614922","full_name":"lucasayres/url-feature-extractor","owner":"lucasayres","description":"Extracting features from URLs to build a data set for machine learning. The purpose is to find a machine learning model to predict phishing URLs, which are targeted to the Brazilian population.","archived":false,"fork":false,"pushed_at":"2021-06-01T22:21:31.000Z","size":5528,"stargazers_count":60,"open_issues_count":2,"forks_count":2,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-04-04T01:24:57.885Z","etag":null,"topics":["benign","blacklist","dataset","extractor","host","lexical","machine-learning","phishing","phishtank","python","safebrowsing","wot"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lucasayres.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-06-25T15:35:32.000Z","updated_at":"2025-01-07T23:04:23.000Z","dependencies_parsed_at":"2022-07-06T16:32:13.477Z","dependency_job_id":null,"html_url":"https://github.com/lucasayres/url-feature-extractor","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucasayres%2Furl-feature-extractor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucasayres%2Furl-feature-extractor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucasayres%2Furl-feature-extractor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucasayres%2Furl-feature-extractor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lucasayres","download_url":"https://codeload.github.com/lucasayres/url-feature-extractor/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254538286,"owners_count":22087834,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benign","blacklist","dataset","extractor","host","lexical","machine-learning","phishing","phishtank","python","safebrowsing","wot"],"created_at":"2024-11-19T21:32:33.137Z","updated_at":"2025-05-16T13:30:53.634Z","avatar_url":"https://github.com/lucasayres.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# URL Feature Extractor\n\nExtracting features from URLs to build a data set for machine learning. The purpose is to find a machine learning model to predict phishing URLs, which are targeted to the Brazilian population.\n\nThis repo includes the implementation of our paper:\n\nLucas Dantas Gama Ayres, Italo Valcy S Brito and Rodrigo Rocha Gomes e Souza. Using Machine Learning to Automatically Detect Malicious URLs in Brazil. In Simpósio Brasileiro de Redes de Computadores e Sistemas Distribuídos (SBRC 2019) - 2019, Gramado - RS - Brazil.\n\nThe paper is available here: https://sol.sbc.org.br/index.php/sbrc/article/view/7416\n\nDOI: https://doi.org/10.5753/sbrc.2019.7416\n\n## Install\n\n```bash\n$ sudo apt-get update \u0026\u0026 sudo apt-get upgrade\n$ sudo apt-get install virtualenv python3 python3-dev python-dev gcc libpq-dev libssl-dev libffi-dev build-essentials\n$ virtualenv -p /usr/bin/python3 .env\n$ source .env/bin/activate\n$ pip install -r requirements.txt\n```\n\n## How to use\n\nBefore running the software, add the API Keys to the Google Safe Browsing, Phishtank, and MyWot in the ```config.ini``` file.\n\nNow, run:\n\n```bash\n$ python run.py \u003cinput-urls\u003e \u003coutput-dataset\u003e\n```\n\n## Features implemented\n\n\u003ctable\u003e\n    \u003ctr\u003e\n        \u003cth style=\"text-align:center\" colspan=\"4\"\u003e\n            \u003cb\u003eLEXICAL\u003c/b\u003e\n        \u003c/th\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eCount (.) in URL\u003c/td\u003e\n        \u003ctd\u003eCount (-) in URL\u003c/td\u003e\n        \u003ctd\u003eCount (_) in URL\u003c/td\u003e\n        \u003ctd\u003eCount (/) in URL\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eCount (?) in URL\u003c/td\u003e\n        \u003ctd\u003eCount (=) in URL\u003c/td\u003e\n        \u003ctd\u003eCount (@) in URL\u003c/td\u003e\n        \u003ctd\u003eCount (\u0026) in URL\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eCount (!) in URL\u003c/td\u003e\n        \u003ctd\u003eCount ( ) in URL\u003c/td\u003e\n        \u003ctd\u003eCount (~) in URL\u003c/td\u003e\n        \u003ctd\u003eCount (,) in URL\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eCount (+) in URL\u003c/td\u003e\n        \u003ctd\u003eCount (*) in URL\u003c/td\u003e\n        \u003ctd\u003eCount (#) in URL\u003c/td\u003e\n        \u003ctd\u003eCount ($) in URL\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eCount (%) in URL\u003c/td\u003e\n        \u003ctd\u003eURL LengthL\u003c/td\u003e\n        \u003ctd\u003eTLD amount in URL\u003c/td\u003e\n        \u003ctd\u003eCount (.) in Domain\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eCount (-) in Domain\u003c/td\u003e\n        \u003ctd\u003eCount (_) in Domain\u003c/td\u003e\n        \u003ctd\u003eCount (/) in Domain\u003c/td\u003e\n        \u003ctd\u003eCount (?) in Domain\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eCount (=) in Domain\u003c/td\u003e\n        \u003ctd\u003eCount (@) in Domain\u003c/td\u003e\n        \u003ctd\u003eCount (\u0026) in Domain\u003c/td\u003e\n        \u003ctd\u003eCount (!) in Domain\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eCount ( ) in Domain\u003c/td\u003e\n        \u003ctd\u003eCount (~) in Domain\u003c/td\u003e\n        \u003ctd\u003eCount (,) in Domain\u003c/td\u003e\n        \u003ctd\u003eCount (+) in Domain\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eCount (*) in Domain\u003c/td\u003e\n        \u003ctd\u003eCount (#) in Domain\u003c/td\u003e\n        \u003ctd\u003eCount ($) in Domain\u003c/td\u003e\n        \u003ctd\u003eCount (%) in Domain\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eDomain Length\u003c/td\u003e\n        \u003ctd\u003eQuantidade de vogais in Domain\u003c/td\u003e\n        \u003ctd\u003eURL domain in IP address format\u003c/td\u003e\n        \u003ctd\u003eDomain contains the key words \"server\" or \"client\"\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eCount (.) in Directory\u003c/td\u003e\n        \u003ctd\u003eCount (-) in Directory\u003c/td\u003e\n        \u003ctd\u003eCount (_) in Directory\u003c/td\u003e\n        \u003ctd\u003eCount (/) in Directory\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eCount (?) in Directory\u003c/td\u003e\n        \u003ctd\u003eCount (=) in Directory\u003c/td\u003e\n        \u003ctd\u003eCount (@) in Directory\u003c/td\u003e\n        \u003ctd\u003eCount (\u0026) in Directory\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eCount (!) in Directory\u003c/td\u003e\n        \u003ctd\u003eCount ( ) in Directory\u003c/td\u003e\n        \u003ctd\u003eCount (~) in Directory\u003c/td\u003e\n        \u003ctd\u003eCount (,) in Directory\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eCount (+) in Directory\u003c/td\u003e\n        \u003ctd\u003eCount (*) in Directory\u003c/td\u003e\n        \u003ctd\u003eCount (#) in Directory\u003c/td\u003e\n        \u003ctd\u003eCount ($) in Directory\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eCount (%) in Directory\u003c/td\u003e\n        \u003ctd\u003eDirectory Length\u003c/td\u003e\n        \u003ctd\u003eCount (.) in file\u003c/td\u003e\n        \u003ctd\u003eCount (-) in file\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eCount (_) in file\u003c/td\u003e\n        \u003ctd\u003eCount (/) in file\u003c/td\u003e\n        \u003ctd\u003eCount (?) in file\u003c/td\u003e\n        \u003ctd\u003eCount (=) in file\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eCount (@) in file\u003c/td\u003e\n        \u003ctd\u003eCount (\u0026) in file\u003c/td\u003e\n        \u003ctd\u003eCount (!) in file\u003c/td\u003e\n        \u003ctd\u003eCount ( ) in file\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eCount (~) in file\u003c/td\u003e\n        \u003ctd\u003eCount (,) in file\u003c/td\u003e\n        \u003ctd\u003eCount (+) in file\u003c/td\u003e\n        \u003ctd\u003eCount (*) in file\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eCount (#) in file\u003c/td\u003e\n        \u003ctd\u003eCount ($) in file\u003c/td\u003e\n        \u003ctd\u003eCount (%) in file\u003c/td\u003e\n        \u003ctd\u003eFile length\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eCount (.) in parameters\u003c/td\u003e\n        \u003ctd\u003eCount (-) in parameters\u003c/td\u003e\n        \u003ctd\u003eCount (_) in parameters\u003c/td\u003e\n        \u003ctd\u003eCount (/) in parameters\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eCount (?) in parameters\u003c/td\u003e\n        \u003ctd\u003eCount (=) in parameters\u003c/td\u003e\n        \u003ctd\u003eCount (@) in parameters\u003c/td\u003e\n        \u003ctd\u003eCount (\u0026) in parameters\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eCount (!) in parameters\u003c/td\u003e\n        \u003ctd\u003eCount ( ) in parameters\u003c/td\u003e\n        \u003ctd\u003eCount (~) in parameters\u003c/td\u003e\n        \u003ctd\u003eCount (,) in parameters\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eCount (+) in parameters\u003c/td\u003e\n        \u003ctd\u003eCount (*) in parameters\u003c/td\u003e\n        \u003ctd\u003eCount (#) in parameters\u003c/td\u003e\n        \u003ctd\u003eCount ($) in parameters\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eCount (%) in parameters\u003c/td\u003e\n        \u003ctd\u003eLength of parameters\u003c/td\u003e\n        \u003ctd\u003eTLD presence in arguments\u003c/td\u003e\n        \u003ctd\u003eNumber of parameters\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eEmail present at URL\u003c/td\u003e\n        \u003ctd\u003eFile extension\u003c/td\u003e\n    \u003c/tr\u003e\n\u003c/table\u003e\n\n\u003ctable\u003e\n    \u003ctr\u003e\n        \u003cth style=\"text-align:center\" colspan=\"4\"\u003e\n            \u003cb\u003eBLACKLIST\u003c/b\u003e\n        \u003c/th\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003ePresence of the URL in blacklists\u003c/td\u003e\n        \u003ctd\u003ePresence of the IP Address in blacklists\u003c/td\u003e\n        \u003ctd\u003ePresence of the domain in Blacklists\u003c/td\u003e\n    \u003c/tr\u003e\n\u003c/table\u003e\n\n\u003ctable\u003e\n    \u003ctr\u003e\n        \u003cth style=\"text-align:center\" colspan=\"4\"\u003e\n            \u003cb\u003eHOST\u003c/b\u003e\n        \u003c/th\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003ePresence of the domain in RBL (Real-time Blackhole List)\u003c/td\u003e\n        \u003ctd\u003eSearch time (response) domain (lookup)\u003c/td\u003e\n        \u003ctd\u003eDomain has SPF?\u003c/td\u003e\n        \u003ctd\u003eGeographical location of IP\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eAS Number (or ASN)\u003c/td\u003e\n        \u003ctd\u003ePTR of IP\u003c/td\u003e\n        \u003ctd\u003eTime (in days) of domain activation\u003c/td\u003e\n        \u003ctd\u003eTime (in days) of domain expiration\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eNumber of resolved IPs\u003c/td\u003e\n        \u003ctd\u003eNumber of resolved name servers (NameServers - NS)\u003c/td\u003e\n        \u003ctd\u003eNumber of MX Servers\u003c/td\u003e\n        \u003ctd\u003eTime-to-live (TTL) value associated with hostname\u003c/td\u003e\n    \u003c/tr\u003e\n\u003c/table\u003e\n\n\u003ctable\u003e\n    \u003ctr\u003e\n        \u003cth style=\"text-align:center\" colspan=\"4\"\u003e\n            \u003cb\u003eOTHERS\u003c/b\u003e\n        \u003c/th\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eValid TLS / SSL Certificate\u003c/td\u003e\n        \u003ctd\u003eNumber of redirects\u003c/td\u003e\n        \u003ctd\u003eCheck if URL is indexed on Google\u003c/td\u003e\n        \u003ctd\u003eCheck if domain is indexed on Google\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eUses URL shortener service\u003c/td\u003e\n    \u003c/tr\u003e\n\u003c/table\u003e\n\n## Contributing\n\nAny contribution is appreciated.\n\n#### Submitting a Pull Request (PR)\n\n1. Clone the project:\n  ```\n  $ git clone https://github.com/lucasayres/url-feature-extractor.git\n  ```\n\n2. Make your changes in a new git branch:\n  ```\n  $ git checkout -b my-branch master\n  ```\n\n3. Add your changes.\n\n4. Push your branch to Github.\n\n5. Create a PR to master.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flucasayres%2Furl-feature-extractor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flucasayres%2Furl-feature-extractor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flucasayres%2Furl-feature-extractor/lists"}