An open API service indexing awesome lists of open source software.

https://github.com/hatamiarash7/duckdb-netquack

DuckDB extension for parsing, extracting, and analyzing domains, URIs, and paths with ease.
https://github.com/hatamiarash7/duckdb-netquack

database database-extension duckdb duckdb-community duckdb-database duckdb-extension duckdb-udf extension network-utilities sql

Last synced: 5 months ago
JSON representation

DuckDB extension for parsing, extracting, and analyzing domains, URIs, and paths with ease.

Awesome Lists containing this project

README

          

# DuckDB Netquack Extension

[![DuckDB Badge](https://img.shields.io/badge/Built_With-DuckDB-fff100)](https://duckdb.org/community_extensions/extensions/netquack.html) [![GitHub License](https://img.shields.io/github/license/hatamiarash7/duckdb-netquack)](https://github.com/hatamiarash7/duckdb-netquack/blob/main/LICENSE) [![GitHub Release](https://img.shields.io/github/v/release/hatamiarash7/duckdb-netquack)](https://github.com/hatamiarash7/duckdb-netquack/releases/latest)

![logo](./.github/netquack.webp)

This extension is designed to simplify working with domains, URIs, and web paths directly within your database queries. Whether you're extracting top-level domains (TLDs), parsing URI components, or analyzing web paths, Netquack provides a suite of intuitive functions to handle all your network tasks efficiently. Built for data engineers, analysts, and developers.

With Netquack, you can unlock deeper insights from your web-related datasets without the need for external tools or complex workflows.

Table of Contents

- [DuckDB Netquack Extension](#duckdb-netquack-extension)
- [Installation πŸš€](#installation-)
- [Usage Examples πŸ“š](#usage-examples-)
- [Extracting The Main Domain](#extracting-the-main-domain)
- [Extracting The Path](#extracting-the-path)
- [Extracting The Host](#extracting-the-host)
- [Extracting The Schema](#extracting-the-schema)
- [Extracting The Query](#extracting-the-query)
- [Extracting The Port](#extracting-the-port)
- [Extracting The File Extension](#extracting-the-file-extension)
- [Extracting The TLD (Top-Level Domain)](#extracting-the-tld-top-level-domain)
- [Extracting The Sub Domain](#extracting-the-sub-domain)
- [Get Tranco Rank](#get-tranco-rank)
- [Update Tranco List](#update-tranco-list)
- [Get Tranco Ranking](#get-tranco-ranking)
- [IP Address Functions](#ip-address-functions)
- [IP Calculator](#ip-calculator)
- [Get Extension Version](#get-extension-version)
- [Debugging](#debugging)
- [Roadmap πŸ—ΊοΈ](#roadmap-️)
- [Contributing 🀝](#contributing-)
- [Issues πŸ›](#issues-)

## Installation πŸš€

**netquack** is distributed as a [DuckDB Community Extension](https://duckdb.org/community_extensions/) and can be installed using SQL:

```sql
INSTALL netquack FROM community;
LOAD netquack;
```

If you previously installed the `netquack` extension, upgrade using the FORCE command

```sql
FORCE INSTALL netquack FROM community;
LOAD netquack;
```

## Usage Examples πŸ“š

Once installed, the [macro functions](https://duckdb.org/community_extensions/extensions/netquack.html#added-functions) provided by the extension can be used just like built-in functions.

### Extracting The Main Domain

This function extracts the main domain from a URL. For this purpose, the extension will get all public suffixes from the [publicsuffix.org](https://publicsuffix.org/) list and extract the main domain from the URL.

The download process of the public suffix list is done automatically when the function is called for the first time. After that, the list is stored in the `public_suffix_list` table to avoid downloading it again.

```sql
D SELECT extract_domain('a.example.com') AS domain;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ domain β”‚
β”‚ varchar β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ example.com β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

D SELECT extract_domain('https://b.a.example.com/path') AS domain;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ domain β”‚
β”‚ varchar β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ example.com β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

You can use the `update_suffixes` function to update the public suffix list manually.

```sql
D SELECT update_suffixes();
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ update_suffixes() β”‚
β”‚ varchar β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ updated β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

> [!WARNING]
> This a public service with a limited number of requests. If you call the function too many times, you may get a 403 error.
> `AccessDeniedAccess denied.`
> The list usually changes a few times per week; more frequent downloading will cause rate limiting.
> In this case, you can download the list manually from [publicsuffix.org](https://publicsuffix.org/) and save it in the `public_suffix_list` table.

### Extracting The Path

This function extracts the path from a URL.

```sql
D SELECT extract_path('https://b.a.example.com/path/path') AS path;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ path β”‚
β”‚ varchar β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ /path/path β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

D SELECT extract_path('example.com/path/path/image.png') AS path;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ path β”‚
β”‚ varchar β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ /path/path/image.png β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

### Extracting The Host

This function extracts the host from a URL.

```sql
D SELECT extract_host('https://b.a.example.com/path/path') AS host;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ host β”‚
β”‚ varchar β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ b.a.example.com β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

D SELECT extract_host('example.com:443/path/image.png') AS host;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ host β”‚
β”‚ varchar β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ example.com β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

### Extracting The Schema

This function extracts the schema from a URL. Supported schemas for now:

- `http` | `https`
- `ftp`
- `mailto`
- `tel` | `sms`

```sql
D SELECT extract_schema('https://b.a.example.com/path/path') AS schema;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ schema β”‚
β”‚ varchar β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ https β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

D SELECT extract_schema('mailto:someone@example.com') AS schema;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ schema β”‚
β”‚ varchar β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ mailto β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

D SELECT extract_schema('tel:+123456789') AS schema;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ schema β”‚
β”‚ varchar β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ tel β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

### Extracting The Query

This function extracts the query string from a URL.

```sql
D SELECT extract_query_string('example.com?key=value') AS query;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ query β”‚
β”‚ varchar β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ key=value β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

D SELECT extract_query_string('http://example.com.ac/path/?a=1&b=2&') AS query;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ query β”‚
β”‚ varchar β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ a=1&b=2& β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

### Extracting The Port

This function extracts the port from a URL.

```sql
D SELECT extract_port('https://example.com:8443/') AS port;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ port β”‚
β”‚ varchar β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 8443 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

D SELECT extract_port('[::1]:6379') AS port;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ port β”‚
β”‚ varchar β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 6379 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

### Extracting The File Extension

This function extracts the file extension from a URL. It will return the file extension without the dot.

```sql
D SELECT extract_extension('http://example.com/image.jpg') AS ext;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ ext β”‚
β”‚ varchar β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ jpg β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

### Extracting The TLD (Top-Level Domain)

This function extracts the top-level domain from a URL. This function will use the public suffix list to extract the TLD. Check the [Extracting The Main Domain](#extracting-the-main-domain) section for more information about the public suffix list.

```sql
D SELECT extract_tld('https://example.com.ac/path/path') AS tld;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ tld β”‚
β”‚ varchar β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ com.ac β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

D SELECT extract_tld('a.example.com') AS tld;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ tld β”‚
β”‚ varchar β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ com β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

### Extracting The Sub Domain

This function extracts the sub-domain from a URL. This function will use the public suffix list to extract the TLD. Check the [Extracting The Main Domain](#extracting-the-main-domain) section for more information about the public suffix list.

```sql
D SELECT extract_subdomain('http://a.b.example.com/path') AS dns_record;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ dns_record β”‚
β”‚ varchar β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ a.b β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

D SELECT extract_subdomain('test.example.com.ac') AS dns_record;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ dns_record β”‚
β”‚ varchar β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ test β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

### Get Tranco Rank

#### Update Tranco List

This function returns the [Tranco](https://tranco-list.eu/) rank of a domain. You have an `update_tranco` function to update the Tranco list manually.

```sql
D SELECT update_tranco(true);
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ update_tranco(CAST('f' AS BOOLEAN)) β”‚
β”‚ varchar β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Tranco list updated β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

This function will get the latest Tranco list and save it into the `tranco_list` table. There will be a `tranco_list_%Y-%m-%d.csv` file in the current directory after the function is called. The extension will use this file to prevent downloading the list again.

You can ignore the file and force the extension to download the list again by calling the function with `true` as a parameter. If you don't want to download the list again, you can call the function with `false` as a parameter.

```sql
D SELECT update_tranco(false);
```

As the latest Tranco list is for the last day, you can download your list manually and rename it to `tranco_list_%Y-%m-%d.csv` to use it with the extension too.

#### Get Tranco Ranking

You can use this function to get the ranking of a domain:

```sql
D SELECT get_tranco_rank('microsoft.com') AS rank;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ rank β”‚
β”‚ varchar β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 2 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

D SELECT get_tranco_rank('cloudflare.com') AS rank;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ rank β”‚
β”‚ varchar β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 13 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

You can use the `get_tranco_rank_category` function to retrieve the category utility column that gives you the domain's rank category. The `category` value is on a log10 scale with half steps (e.g., top 1k, top 5k, top 10k, top 50k, top 100k, top 500k, top 1M, top 5m, etc.), with each rank excluding the previous (e.g., top 5k is actually 4k domains, excluding top 1k).

```sql
D SELECT get_tranco_rank_category('microsoft.com') AS category;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ category β”‚
β”‚ varchar β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ top1k β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

### IP Address Functions

This extension provides various functions for manipulating and analyzing IP addresses, including calculating networks, hosts, and subnet masks.

#### IP Calculator

> [!WARNING]
> It's an experimental function.

The `ipcalc` function takes an IP address and netmask and calculates the resulting broadcast, network, wildcard mask, and host range.

![ipcalc-sc](./.github/ipcalc-sc.png)

```sql
SELECT * FROM ipcalc('192.168.1.0/24');
```

It's a table function that provides various details about IP addresses, including:

- Address
- Netmask
- Wildcard
- Network / Hostroute
- HostMin
- HostMax
- Broadcast
- Hosts count

You can use this table function with your data easily:

```sql
D CREATE OR REPLACE TABLE ips AS SELECT '127.0.0.1' AS ip UNION ALL SELECT '192.168.1.0/22';

D SELECT i.IP,
(
SELECT hostsPerNet
FROM ipcalc(i.IP)
) AS hosts
FROM ips AS i;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”
β”‚ ip β”‚ hosts β”‚
β”‚ varchar β”‚ int64 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 127.0.0.1 β”‚ 254 β”‚
β”‚ 192.168.1.0/22 β”‚ 1022 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜
```

### Get Extension Version

You can use the `netquack_version` function to get the extension version.

```sql
D SELECT * FROM netquack_version();
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ version β”‚
β”‚ varchar β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ v1.4.0 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

## Debugging

The debugging process for DuckDB extensions is not an easy job. For Netquack, we have created a log file in the current directory. The log file is named `netquack.log` and contains all the logs for the extension. You can use this file to debug your code.

Also, there will be stdout errors for background tasks like CURL.

## Roadmap πŸ—ΊοΈ

- [ ] Create a `TableFunction` for `extract_query_parameters` that return each key-value pair as a row.
- [ ] Implement `extract_custom_format` function
- [ ] Implement `parse_uri` function
- [ ] Save Tranco data as Parquet
- [ ] Implement GeoIP functionality
- [ ] Return default value for `get_tranco_rank`
- [ ] Support internationalized domain names (IDNs)

## Contributing 🀝

Don't be shy and reach out to us if you want to contribute πŸ˜‰

1. Fork it!
2. Create your feature branch: `git checkout -b my-new-feature`
3. Commit your changes: `git commit -am 'Add some feature'`
4. Push to the branch: `git push origin my-new-feature`
5. Submit a pull request

## Issues πŸ›

Each project may have many problems. Contributing to the better development of this project by [reporting them](https://github.com/hatamiarash7/duckdb-netquack/issues). πŸ‘