https://github.com/fabriziosalmi/fqdn-model
Tiny and fast FQDN classifier to detect malicious domains
https://github.com/fabriziosalmi/fqdn-model
blacklist classifier-model dataset domain-blocklist domains fqdn model training
Last synced: 3 months ago
JSON representation
Tiny and fast FQDN classifier to detect malicious domains
- Host: GitHub
- URL: https://github.com/fabriziosalmi/fqdn-model
- Owner: fabriziosalmi
- License: mit
- Created: 2025-03-05T10:00:03.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2026-02-20T08:18:42.000Z (3 months ago)
- Last Synced: 2026-02-20T11:12:38.739Z (3 months ago)
- Topics: blacklist, classifier-model, dataset, domain-blocklist, domains, fqdn, model, training
- Language: Python
- Homepage: https://fabriziosalmi.github.io/fqdn-model/
- Size: 64.7 MB
- Stars: 5
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- Funding: .github/FUNDING.yml
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md
Awesome Lists containing this project
README
# FQDN (Fully Qualified Domain Name) Classifier
[](https://github.com/fabriziosalmi/fqdn-model/issues)
[](https://github.com/fabriziosalmi/fqdn-model/pulls)
[](https://opensource.org/licenses/MIT)
[](https://www.python.org/downloads/)
A Machine Learning classifier for predicting whether an FQDN (Fully Qualified Domain Name) is **benign** or **malicious**.
The project trains a classifier on features extracted from domain analysis (DNS, SSL, HTTP behavior, WHOIS, and lexical properties) and exposes a Flask API for inference.
> **Data Source**: The model can be trained using domain lists from [fabriziosalmi/blacklist](https://github.com/fabriziosalmi/blacklist) or any compatible blacklist/whitelist files.
## Key Features
* **Feature Extraction**: Extracts ~22 binary/ternary features per domain including DNS record presence, SSL certificate validity, HTTP redirect behavior, HSTS, suspicious keywords, TLD risk, domain length, hyphen/digit counts, and optional WHOIS data.
* **Multiple Model Types**: Supports Gaussian Naive Bayes, Logistic Regression, and Random Forest (default).
* **Flask API**: REST API with input validation and health check endpoint.
* **CLI Prediction**: Classify domains directly from the terminal using `predict.py`.
* **Centralized Configuration**: All settings managed via `settings.py` with `config.ini` override support.
* **Multithreaded Extraction**: Concurrent domain analysis using a configurable thread pool.
## Architecture
The project consists of three main components:
1. **`augment.py`**: The ETL pipeline. Enriches raw FQDN lists with analysis features (DNS, HTTP, SSL, WHOIS) and writes a JSON dataset.
2. **`fqdn_classifier.py`**: The training script. Trains a model on the augmented dataset and serializes it using `joblib`.
3. **`api.py` / `predict.py`**: The inference layer. Loads the serialized model to serve predictions via CLI or HTTP API.
## Installation
1. **Clone the repository:**
```bash
git clone https://github.com/fabriziosalmi/fqdn-model.git
cd fqdn-model
```
2. **Set up environment:**
```bash
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
```
## Usage
### 1. Training (Optional)
If you want to retrain the model with your own data or fresh data from [fabriziosalmi/blacklist](https://github.com/fabriziosalmi/blacklist):
```bash
# 1. Download lists
wget https://raw.githubusercontent.com/fabriziosalmi/blacklist/master/blacklist.txt
wget https://raw.githubusercontent.com/fabriziosalmi/blacklist/master/whitelist.txt
# 2. Augment data (extract features)
python augment.py -i blacklist.txt -o blacklist.json --is_bad Yes
python augment.py -i whitelist.txt -o whitelist.json --is_bad No
# 3. Merge & Train
python merge_datasets.py blacklist.json whitelist.json -o dataset.json
python fqdn_classifier.py dataset.json
```
### 2. Prediction (CLI)
Classify domains directly from your terminal:
```bash
python predict.py google.com
python predict.py malicious-test-domain.xyz
```
### 3. API Serving
Start the server:
```bash
python api.py
```
**Health Check:**
```bash
curl http://localhost:5000/health
```
**Prediction:**
```bash
curl -X POST http://localhost:5000/predict \
-H "Content-Type: application/json" \
-d '{"fqdn": "example.com"}'
```
---
## Configuration
The project uses a hierarchical configuration system:
1. **Defaults**: Defined in `settings.py`.
2. **Config File**: Values in `config.ini` override defaults.
3. **CLI arguments**: Runtime arguments override everything.
See `settings.py` for all available options.
## Testing
Run the test suite:
```bash
pytest tests/
```
## Data Format
The training data consists of two text files:
* **`whitelist.txt`:** Contains a list of benign FQDNs, one per line.
* **`blacklist.txt`:** Contains a list of malicious FQDNs, one per line.
Each line should contain only the FQDN itself, without extra characters or whitespace.
**Example:**
**`whitelist.txt`:**
```
google.com
facebook.com
wikipedia.org
```
**`blacklist.txt`:**
```
malware-domain.xyz
phishing-site.tk
evil-domain.com
```
## Model Details
* **Supported Models:** Gaussian Naive Bayes (`gaussian_nb`), Logistic Regression (`logistic_regression`), Random Forest (`random_forest`, default)
* **Default Estimators (Random Forest):** 100 (configurable via `--rf_n_estimators`)
### Features Extracted
Features are extracted by the `analyze_fqdn` function in `augment.py`. Each feature is encoded as a binary or ternary value (0 = benign indicator, 1 = malicious indicator, 2 = unknown/unavailable):
* DNS record presence (A, AAAA, MX, TXT, CNAME)
* SSL certificate validity
* HTTP status code
* Final protocol (HTTP vs HTTPS)
* HTTP-to-HTTPS redirect
* Excessive redirects
* HSTS header presence
* Suspicious keyword detection in page content
* SSL verification failure
* Risky TLD
* Domain length
* Hyphen count
* Digit count
* IP address embedded in domain
* URL shortener detection
* Subdomain count
* Page title and body presence
* WHOIS data (creation date, expiration date, age) — optional, requires `--whois` flag
### Model Persistence
Trained models are saved using `joblib` in the `models/` directory and loaded automatically by `predict.py` and `api.py`.
## Performance Metrics
The `fqdn_classifier.py` script evaluates the trained model using:
* **Accuracy:** Overall correctness of the model.
* **ROC AUC:** Area under the receiver operating characteristic curve.
* **Precision:** Proportion of true positives among predicted positives.
* **Recall:** Proportion of true positives among actual positives.
* **F1-Score:** Harmonic mean of precision and recall.
* **Log Loss / Brier Score:** Probability calibration metrics.
* **Confusion Matrix:** Counts of true/false positives and negatives.
* **Feature Importance:** Ranking of features by contribution (for Random Forest and Logistic Regression).
## Contributing
Contributions are welcome! Here's how you can contribute:
1. Fork the repository.
2. Create a new branch: `git checkout -b feature/my-new-feature` or `git checkout -b fix/my-bug-fix`
3. Make your changes and commit them: `git commit -am 'Add some feature'`
4. Push to the branch: `git push origin feature/my-new-feature`
5. Create a new Pull Request.
**Guidelines:**
* Follow the existing code style.
* Write clear and concise commit messages.
* Provide tests for your changes.
* Explain the purpose of your changes in the Pull Request description.
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Credits
* `scikit-learn`: [https://scikit-learn.org/](https://scikit-learn.org/)
* `tldextract`: [https://github.com/john-kurkowski/tldextract](https://github.com/john-kurkowski/tldextract)
* `joblib`: [https://joblib.readthedocs.io/en/latest/](https://joblib.readthedocs.io/en/latest/)
* `rich`: [https://github.com/Textualize/rich](https://github.com/Textualize/rich)
* `Flask`: [https://flask.palletsprojects.com/](https://flask.palletsprojects.com/)
## Contact
If you have any questions or suggestions, feel free to open an issue.