https://github.com/dots-suite/thunderdots
ThunderDoTS: a DTS Crawler via DoTS
https://github.com/dots-suite/thunderdots
api corpora crawler digital-humanities distributed-text-services dots dts humanities
Last synced: 6 days ago
JSON representation
ThunderDoTS: a DTS Crawler via DoTS
- Host: GitHub
- URL: https://github.com/dots-suite/thunderdots
- Owner: dots-suite
- License: mit
- Created: 2026-02-19T16:49:10.000Z (4 months ago)
- Default Branch: master
- Last Pushed: 2026-06-12T16:09:52.000Z (20 days ago)
- Last Synced: 2026-06-18T16:31:44.106Z (14 days ago)
- Topics: api, corpora, crawler, digital-humanities, distributed-text-services, dots, dts, humanities
- Language: Jupyter Notebook
- Homepage: https://dots-suite.github.io/ThunderDots/
- Size: 4.03 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
- Citation: CITATION.cff
Awesome Lists containing this project
README

ThunderDots — DTS client for documentary corpora
DTS scrapping, TEI fragmentation, metadata filtering, validation, and export pipelines.
---
## Overview
**ThunderDots** is a Python client for [DTS](https://dtsapi.org/specifications/) (*Distributed Text Services*) endpoints, initially built for [DoTS](https://dots-suite.github.io/dots_documentation/).
It helps you move from a remote DTS API to structured Python objects and JSON records that can feed indexing pipelines, including full-text search, RAG/vector databases, and corpus-analysis workflows.
ThunderDots focuses on practical documentary workflows: crawling DTS collections, fetching TEI/XML resources, extracting reusable text fragments, selecting metadata, validating outputs, and exporting data to downstream search or indexing systems.
---
## What ThunderDots does
ThunderDots can:
- walk DTS collections and subcollections;
- fetch resources and TEI/XML documents;
- extract text fragments from full documents, DTS navigation, or custom TEI XPath rules;
- preserve or filter Dublin Core and extension metadata;
- enrich temporal metadata such as dates and coverage ranges;
- validate generated outputs with JSON Schema;
- transform DTS resources into Pandas/Polars DataFrames;
- export records to indexing pipelines such as Elasticsearch or Qdrant-compatible formats;
- cache fetched corpora as JSON and CSV;
- run synchronous or asynchronous workflows.
---
## Installation
### With `uv`
```bash
uv add thunderdots
```
### With pip
```bash
pip install thunderdots
```
### For development
```bash
git clone https://github.com/chartes/thunderdots.git
cd thunderdots
uv venv
source .venv/bin/activate
uv sync --all-extras --dev
```
or with pip
```bash
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
```
## Minimal example
```python
from thunderdots import ThunderDots
td = ThunderDots(
endpoint_dts="https://dots.chartes.psl.eu/api/dts",
collection_params={"collection_id": "ENCPOS_1900"},
resource_params={"fragment_mode": "document"},
)
td.fetch()
results = td.results()
print(td.stats())
```
## Development
### Run tests
```bash
pytest
```
Online DTS tests are opt-in:
```bash
RUN_NETWORK_TESTS=1 pytest
```
### Run Ruff (linter, format)
```bash
ruff format --check
ruff check
```
### Build the documentation
```bash
mkdocs build --strict -f mkdocs/mkdocs.yml
```
### Create a new PyPI release
Check the [release checklist](./RELEASE.md) for details.
### License
ThunderDots is distributed under the [MIT License](./LICENSE.md).
### Citation
If you use ThunderDots in academic work, please cite it as:
```
@software{terriel_thunderdots_2026,
author = {Terriel, Lucas},
title = {ThunderDots},
year = {2026},
publisher = {GitHub},
institution = {{École nationale des chartes}},
url = {https://github.com/chartes/thunderdots},
note = {Python client for Distributed Text Services (DTS) via DoTS}
}
```
You can also use the repository metadata from [CITATION.cff](./CITATION.cff).