https://github.com/hatamiarash7/duckdb-netquack
DuckDB extension for parsing, extracting, and analyzing domains, URIs, and paths with ease.
https://github.com/hatamiarash7/duckdb-netquack
database database-extension duckdb duckdb-community duckdb-database duckdb-extension duckdb-udf extension network-utilities sql
Last synced: 4 months ago
JSON representation
DuckDB extension for parsing, extracting, and analyzing domains, URIs, and paths with ease.
- Host: GitHub
- URL: https://github.com/hatamiarash7/duckdb-netquack
- Owner: hatamiarash7
- License: mit
- Created: 2025-01-26T06:59:06.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2026-02-15T17:02:37.000Z (4 months ago)
- Last Synced: 2026-02-15T23:39:47.024Z (4 months ago)
- Topics: database, database-extension, duckdb, duckdb-community, duckdb-database, duckdb-extension, duckdb-udf, extension, network-utilities, sql
- Language: C++
- Homepage: https://duckdb.org/community_extensions/extensions/netquack.html
- Size: 646 KB
- Stars: 31
- Watchers: 1
- Forks: 2
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Citation: CITATION.cff
- Codeowners: .github/CODEOWNERS
Awesome Lists containing this project
- awesome-duckdb - `netquack` - Parsing, extracting, and analyzing domains, URIs, and paths with ease. (Extensions / [Community Extensions](https://duckdb.org/community_extensions/))
README
# DuckDB Netquack Extension
[](https://duckdb.org/community_extensions/extensions/netquack.html) [](https://github.com/hatamiarash7/duckdb-netquack/blob/main/LICENSE) [](https://github.com/hatamiarash7/duckdb-netquack/releases/latest)

This extension is designed to simplify working with domains, URIs, and web paths directly within your database queries. Whether you're extracting top-level domains (TLDs), parsing URI components, or analyzing web paths, Netquack provides a suite of intuitive functions to handle all your network tasks efficiently. Built for data engineers, analysts, and developers.
With Netquack, you can unlock deeper insights from your web-related datasets without the need for external tools or complex workflows.
NetQuack uses ClickHouse-inspired character-by-character parsing and gperf-generated perfect hash functions for optimal performance.
Table of Contents
- [DuckDB Netquack Extension](#duckdb-netquack-extension)
- [Installation π](#installation-)
- [Usage Examples π](#usage-examples-)
- [Extracting The Main Domain](#extracting-the-main-domain)
- [Extracting The Path](#extracting-the-path)
- [Extracting The Host](#extracting-the-host)
- [Extracting The Schema](#extracting-the-schema)
- [Extracting The Query](#extracting-the-query)
- [Query String](#query-string)
- [Query Parameters](#query-parameters)
- [Extracting The Port](#extracting-the-port)
- [Extracting The File Extension](#extracting-the-file-extension)
- [Extracting The TLD (Top-Level Domain)](#extracting-the-tld-top-level-domain)
- [Extracting The Sub Domain](#extracting-the-sub-domain)
- [Get Tranco Rank](#get-tranco-rank)
- [Update Tranco List](#update-tranco-list)
- [Get Tranco Ranking](#get-tranco-ranking)
- [IP Address Functions](#ip-address-functions)
- [IP Calculator](#ip-calculator)
- [Get Extension Version](#get-extension-version)
- [Build Requirements](#build-requirements)
- [Debugging](#debugging)
- [Roadmap πΊοΈ](#roadmap-οΈ)
- [Contributing π€](#contributing-)
- [Issues π](#issues-)
## Installation π
**netquack** is distributed as a [DuckDB Community Extension](https://duckdb.org/community_extensions/) and can be installed using SQL:
```sql
SET allow_community_extensions = true;
INSTALL netquack FROM community;
LOAD netquack;
```
If you previously installed the `netquack` extension, upgrade using the FORCE command
```sql
FORCE INSTALL netquack FROM community;
LOAD netquack;
```
Also, you can check for any available updates for the extension using this command:
```sql
UPDATE EXTENSIONS (netquack);
```
## Usage Examples π
Once installed, the [macro functions](https://duckdb.org/community_extensions/extensions/netquack.html#added-functions) provided by the extension can be used just like built-in functions.
### Extracting The Main Domain
This function extracts the main domain from a URL using an optimized static TLD lookup system. The extension uses Mozilla's Public Suffix List compiled into a gperf-generated perfect hash function for O(1) TLD lookups with zero collisions.
```sql
D SELECT extract_domain('a.example.com') AS domain;
βββββββββββββββ
β domain β
β varchar β
βββββββββββββββ€
β example.com β
βββββββββββββββ
D SELECT extract_domain('https://b.a.example.com/path') AS domain;
βββββββββββββββ
β domain β
β varchar β
βββββββββββββββ€
β example.com β
βββββββββββββββ
```
The TLD lookup is built into the extension at compile time using the latest Mozilla Public Suffix List. No runtime downloads or database operations are required.
### Extracting The Path
This function extracts the path from a URL.
```sql
D SELECT extract_path('https://b.a.example.com/path/path') AS path;
ββββββββββββββ
β path β
β varchar β
ββββββββββββββ€
β /path/path β
ββββββββββββββ
D SELECT extract_path('example.com/path/path/image.png') AS path;
ββββββββββββββββββββββββ
β path β
β varchar β
ββββββββββββββββββββββββ€
β /path/path/image.png β
ββββββββββββββββββββββββ
```
### Extracting The Host
This function extracts the host from a URL.
```sql
D SELECT extract_host('https://b.a.example.com/path/path') AS host;
βββββββββββββββββββ
β host β
β varchar β
βββββββββββββββββββ€
β b.a.example.com β
βββββββββββββββββββ
D SELECT extract_host('example.com:443/path/image.png') AS host;
βββββββββββββββ
β host β
β varchar β
βββββββββββββββ€
β example.com β
βββββββββββββββ
```
### Extracting The Schema
This function extracts the schema from a URL. Supported schemas for now:
- `http` | `https`
- `ftp`
- `mailto`
- `tel` | `sms`
```sql
D SELECT extract_schema('https://b.a.example.com/path/path') AS schema;
βββββββββββ
β schema β
β varchar β
βββββββββββ€
β https β
βββββββββββ
D SELECT extract_schema('mailto:someone@example.com') AS schema;
βββββββββββ
β schema β
β varchar β
βββββββββββ€
β mailto β
βββββββββββ
D SELECT extract_schema('tel:+123456789') AS schema;
βββββββββββ
β schema β
β varchar β
βββββββββββ€
β tel β
βββββββββββ
```
### Extracting The Query
#### Query String
The `extract_query_string` function extracts the query string from a URL as a single string.
```sql
D SELECT extract_query_string('example.com?key=value') AS query;
βββββββββββββ
β query β
β varchar β
βββββββββββββ€
β key=value β
βββββββββββββ
D SELECT extract_query_string('http://example.com.ac/path/?a=1&b=2') AS query;
βββββββββββ
β query β
β varchar β
βββββββββββ€
β a=1&b=2 β
βββββββββββ
```
#### Query Parameters
The `extract_query_parameters` table function parses the query string and returns each key-value pair as a separate row. This is useful for analyzing URL parameters in a structured way.
```sql
D SELECT * FROM extract_query_parameters('http://example.com/path/?a=1&b=2');
βββββββββββ¬ββββββββββ
β key β value β
β varchar β varchar β
βββββββββββΌββββββββββ€
β a β 1 β
β b β 2 β
βββββββββββ΄ββββββββββ
D SELECT * FROM extract_query_parameters('https://example.com/search?q=duckdb&hl=en&num=10');
βββββββββββ¬ββββββββββ
β key β value β
β varchar β varchar β
βββββββββββΌββββββββββ€
β q β duckdb β
β hl β en β
β num β 10 β
βββββββββββ΄ββββββββββ
D SELECT m.media_url,
e.key,
e.value
FROM instagram_posts m,
LATERAL extract_query_parameters(m.media_url) e
ORDER BY m.id;
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ¬βββββββββββββ¬ββββββββββββ
β media_url β key β value β
β varchar β varchar β varchar β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββΌββββββββββββ€
β https://cdn.instagram.com/media/abc123.jpg?utm_source=instagram&utm_medium=social&id=1001 β id β 1001 β
β https://cdn.instagram.com/media/abc123.jpg?utm_source=instagram&utm_medium=social&id=1001 β utm_medium β social β
β https://cdn.instagram.com/media/abc123.jpg?utm_source=instagram&utm_medium=social&id=1001 β utm_source β instagram β
β https://cdn.instagram.com/media/def456.jpg?quality=hd&format=webp&user=arash β user β arash β
β https://cdn.instagram.com/media/def456.jpg?quality=hd&format=webp&user=arash β format β webp β
β https://cdn.instagram.com/media/def456.jpg?quality=hd&format=webp&user=arash β quality β hd β
β https://cdn.instagram.com/media/ghi789.mp4?autoplay=true&loop=false&session_id=xyz987 β session_id β xyz987 β
β https://cdn.instagram.com/media/ghi789.mp4?autoplay=true&loop=false&session_id=xyz987 β loop β false β
β https://cdn.instagram.com/media/ghi789.mp4?autoplay=true&loop=false&session_id=xyz987 β autoplay β true β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ΄βββββββββββββ΄ββββββββββββ
```
### Extracting The Port
This function extracts the port from a URL.
```sql
D SELECT extract_port('https://example.com:8443/') AS port;
βββββββββββ
β port β
β varchar β
βββββββββββ€
β 8443 β
βββββββββββ
D SELECT extract_port('[::1]:6379') AS port;
βββββββββββ
β port β
β varchar β
βββββββββββ€
β 6379 β
βββββββββββ
```
### Extracting The File Extension
This function extracts the file extension from a URL. It will return the file extension without the dot.
```sql
D SELECT extract_extension('http://example.com/image.jpg') AS ext;
βββββββββββ
β ext β
β varchar β
βββββββββββ€
β jpg β
βββββββββββ
```
### Extracting The TLD (Top-Level Domain)
This function extracts the top-level domain from a URL using the optimized gperf-based public suffix lookup system. The function correctly handles multi-part TLDs (like `com.au`) using the longest-match algorithm from Mozilla's Public Suffix List.
```sql
D SELECT extract_tld('https://example.com.ac/path/path') AS tld;
βββββββββββ
β tld β
β varchar β
βββββββββββ€
β com.ac β
βββββββββββ
D SELECT extract_tld('a.example.com') AS tld;
βββββββββββ
β tld β
β varchar β
βββββββββββ€
β com β
βββββββββββ
```
### Extracting The Sub Domain
This function extracts the sub-domain from a URL using the optimized public suffix lookup system to correctly identify the domain boundary and extract everything before it.
```sql
D SELECT extract_subdomain('http://a.b.example.com/path') AS dns_record;
ββββββββββββββ
β dns_record β
β varchar β
ββββββββββββββ€
β a.b β
ββββββββββββββ
D SELECT extract_subdomain('test.example.com.ac') AS dns_record;
ββββββββββββββ
β dns_record β
β varchar β
ββββββββββββββ€
β test β
ββββββββββββββ
```
### Get Tranco Rank
#### Update Tranco List
This function returns the [Tranco](https://tranco-list.eu/) rank of a domain. You have an `update_tranco` function to update the Tranco list manually.
```sql
D SELECT update_tranco(true);
βββββββββββββββββββββββββββββββββββββββ
β update_tranco(CAST('f' AS BOOLEAN)) β
β varchar β
βββββββββββββββββββββββββββββββββββββββ€
β Tranco list updated β
βββββββββββββββββββββββββββββββββββββββ
```
This function will get the latest Tranco list and save it into the `tranco_list` table. There will be a `tranco_list_%Y-%m-%d.csv` file in the current directory after the function is called. The extension will use this file to prevent downloading the list again.
You can ignore the file and force the extension to download the list again by calling the function with `true` as a parameter. If you don't want to download the list again, you can call the function with `false` as a parameter.
```sql
D SELECT update_tranco(false);
```
As the latest Tranco list is for the last day, you can download your list manually and rename it to `tranco_list_%Y-%m-%d.csv` to use it with the extension too.
#### Get Tranco Ranking
You can use this function to get the ranking of a domain:
```sql
D SELECT get_tranco_rank('microsoft.com') AS rank;
βββββββββββ
β rank β
β varchar β
βββββββββββ€
β 2 β
βββββββββββ
D SELECT get_tranco_rank('cloudflare.com') AS rank;
βββββββββββ
β rank β
β varchar β
βββββββββββ€
β 13 β
βββββββββββ
```
You can use the `get_tranco_rank_category` function to retrieve the category utility column that gives you the domain's rank category. The `category` value is on a log10 scale with half steps (e.g., top 1k, top 5k, top 10k, top 50k, top 100k, top 500k, top 1M, top 5m, etc.), with each rank excluding the previous (e.g., top 5k is actually 4k domains, excluding top 1k).
```sql
D SELECT get_tranco_rank_category('microsoft.com') AS category;
ββββββββββββ
β category β
β varchar β
ββββββββββββ€
β top1k β
ββββββββββββ
```
### IP Address Functions
This extension provides various functions for manipulating and analyzing IP addresses, including calculating networks, hosts, and subnet masks.
#### IP Calculator
> [!WARNING]
> It's an experimental function.
The `ipcalc` function takes an IP address and netmask and calculates the resulting broadcast, network, wildcard mask, and host range.

```sql
SELECT * FROM ipcalc('192.168.1.0/24');
```
It's a table function that provides various details about IP addresses, including:
- Address
- Netmask
- Wildcard
- Network / Hostroute
- HostMin
- HostMax
- Broadcast
- Hosts count
You can use this table function with your data easily:
```sql
D CREATE OR REPLACE TABLE ips AS SELECT '127.0.0.1' AS ip UNION ALL SELECT '192.168.1.0/22';
D SELECT i.IP,
(
SELECT hostsPerNet
FROM ipcalc(i.IP)
) AS hosts
FROM ips AS i;
ββββββββββββββββββ¬ββββββββ
β ip β hosts β
β varchar β int64 β
ββββββββββββββββββΌββββββββ€
β 127.0.0.1 β 254 β
β 192.168.1.0/22 β 1022 β
ββββββββββββββββββ΄ββββββββ
```
### Get Extension Version
You can use the `netquack_version` function to get the extension version.
```sql
D SELECT * FROM netquack_version();
βββββββββββ
β version β
β varchar β
βββββββββββ€
β v1.8.1 β
βββββββββββ
```
## Build Requirements
- **C++ compiler**: Needs C++17 or later (e.g., `g++`, `clang++`).
- **gperf**: Perfect hash generation requires `gperf`.
- **CMake**
- **GNU Make**
```bash
# On Debian-based systems
sudo apt-get install gperf cmake make
# On MacOS using Homebrew
brew install gperf cmake make
```
## Debugging
The debugging process for DuckDB extensions is not an easy job. For Netquack, we have created a log file in the current directory. The log file is named `netquack.log` and contains all the logs for the extension. You can use this file to debug your code.
Also, there will be stdout errors for background tasks like CURL.
## Roadmap πΊοΈ
- [ ] Implement `extract_custom_format` function
- [ ] Implement `parse_uri` function - Return a STRUCT with all components (scheme, host, port, path, query, fragment) in a single call
- [ ] Save Tranco data as Parquet
- [ ] Implement GeoIP functionality
- [ ] Return default value for `get_tranco_rank`
- [ ] Implement `extract_fragment` function - Extract the fragment (`#section`) from a URL
- [ ] Implement `normalize_url` function - Canonicalize URLs (lowercase scheme/host, remove default ports, sort query params, remove trailing slashes)
- [ ] Implement `is_valid_url` function - Return whether a string is a well-formed URL
- [ ] Implement `url_encode` / `url_decode` functions - Standalone percent-encoding and decoding
- [ ] Implement `is_valid_ip` function - Return whether a string is a valid IPv4 or IPv6 address
- [ ] Implement `is_private_ip` function - Check if an IP is in a private/reserved range (RFC 1918, loopback, link-local)
- [ ] Implement `ip_to_int` / `int_to_ip` functions - Convert between dotted-quad notation and integer representation
- [ ] Implement `ip_in_range` function - Check if an IP falls within a given CIDR block
- [ ] Implement `ip_version` function - Return `4` or `6` for the IP version of a given address
- [ ] Support internationalized domain names (IDNs)
- [ ] Implement `punycode_encode` / `punycode_decode` functions - Convert internationalized domain names to/from ASCII-compatible encoding
- [ ] Implement `is_valid_domain` function - Validate a domain name against RFC rules
- [ ] Implement `domain_depth` function - Return the number of levels in a domain
- [ ] Implement `base64_encode` / `base64_decode` functions - Encode and decode Base64 strings
- [ ] Implement `extract_path_segments` table function - Split a URL path into individual segment rows
## Contributing π€
Don't be shy and reach out to us if you want to contribute π
1. Fork it!
2. Create your feature branch: `git checkout -b my-new-feature`
3. Commit your changes: `git commit -am 'Add some feature'`
4. Push to the branch: `git push origin my-new-feature`
5. Submit a pull request
## Issues π
Each project may have many problems. Contributing to the better development of this project by [reporting them](https://github.com/hatamiarash7/duckdb-netquack/issues). π