An open API service indexing awesome lists of open source software.

https://github.com/dense-analysis/dank

Dense Analysis Network Knowledge
https://github.com/dense-analysis/dank

ai clickhouse knowledge-graph python redis scraping scraping-websites

Last synced: 3 months ago
JSON representation

Dense Analysis Network Knowledge

Awesome Lists containing this project

README

          

# DANK - Dense Analysis Network Knowledge

DANK is a Dense Analysis project focused on collecting and analyzing live data
from the public Internet. It uses API access, web scraping, RSS feeds, and
semantic indexing tools to ingest external content in real time. It applies
sentiment analysis, semantic clustering, and AI models to build structured
insights about the world, including trends, public perception, and evolving
narratives. The goal is to automate contextual understanding and surface
relevant knowledge as it emerges.

## Requirements

- Python 3.13
- uv
- ClickHouse (local server)

## ClickHouse setup

1. Install ClickHouse: https://clickhouse.com/docs/en/install
2. Start the ClickHouse server (systemd or `clickhouse server`).
3. Create the schema:

```
~/clickhouse/clickhouse client --multiquery < schema.sql
```

The schema uses the `dank` database by default. Adjust `config.toml` if you
need a different database name.

## Configuration

Configuration lives in `config.toml` and should not be committed. Example:

```toml
sources = [
{ domain = "x.com", accounts = ["example"] },
"blog.codinghorror.com",
]

[clickhouse]
host = "localhost"
port = 8123
database = "dank"
username = "default"
password = ""
secure = false
use_http = true

[x]
username = "your-x-username"
password = "your-x-password"
max_posts = 200
max_scrolls = 20
scroll_pause_seconds = 1.5

[storage]
data_dir = "data"
max_asset_bytes = 10485760

[browser]
# Optional: full path or command name for a Chromium-based browser.
executable_path = "thorium-browser"
# Optional: extra time to wait for the browser to start.
connection_timeout = 1.0
# Optional: connection retry count for slow browser startups.
connection_max_tries = 30

[email]
# Optional: IMAP settings for OTP codes.
host = "imap.example.com"
username = "you@example.com"
password = "your-imap-password"
port = 993

[logging]
# Optional: file path for scrape/process logs.
file = "dank.log"
# Optional: logging level (DEBUG, INFO, WARNING, ERROR).
level = "INFO"
```

`sources` controls which domains to scrape and process. Each entry can provide
accounts for account-based sources like `x.com`.

If any particular domain lacks a specific configuration, the root of the
domain will be scraped to discover RSS feeds to read from.

`browser.executable_path` sets the browser binary to launch. If unset, DANK
will try common Chromium locations.

`storage.max_asset_bytes` caps asset downloads (bytes). Larger assets are
skipped but still recorded.

When X prompts for a one-time code, DANK will poll the IMAP inbox for messages
from `x.com` that arrived after the login attempt and extract the confirmation
code.

If the browser takes longer to start, increase
`browser.connection_timeout` or `browser.connection_max_tries`.

`logging.file` controls where scrape/process logs are written. Relative paths
are resolved from the current working directory.

## Usage

Dank offers the following commands.

* `uv run scrape` -- Scrape the web for data
* Pass `--domains` to scrape only matching domains from `sources`,
for example `--domains '^x\\.com$'`.
* `uv run process` -- Process previously scraped data
* The `--age` argument can be given a duration to process, for example
`6hours` or `2days`.
* `uv run clickhouse-query` -- Run queries on the database
* You can only run `SELECT`, `SHOW`, or `EXPLAIN` queries through this tool
* Query results are well formatted and easy to read
* Query results are truncated unless you pass `--full`
* `uv run embed-text "your text"` -- Print an embedding vector
* Output is a JSON `list[float]` for easy copy/paste into other tools.
* `uv run download-embedding-model` -- Download and cache embeddings model
* Pass `--model` to choose another Hugging Face model id.
* `uv run web` -- Start a simple web server to view content.
* Pass `--no-reload` to disable hot code reloading.
* Supports search filters for domain/account and a days-back slider.

## Testing

* `uv run pytest` -- Run default test suite.
* `uv run pytest -m embeddings -s` -- Run real-model embedding checks.
* These tests are skipped by default and require the model cache.
* Includes per-case similarity and margin output for each model.