https://github.com/jmfontaine/dgkit
Quickly convert Discogs data dumps or load them into databases
https://github.com/jmfontaine/dgkit
discogs discogs-dump python
Last synced: 4 months ago
JSON representation
Quickly convert Discogs data dumps or load them into databases
- Host: GitHub
- URL: https://github.com/jmfontaine/dgkit
- Owner: jmfontaine
- License: apache-2.0
- Created: 2025-12-29T16:59:43.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2026-01-06T23:10:57.000Z (5 months ago)
- Last Synced: 2026-01-07T00:17:41.092Z (5 months ago)
- Topics: discogs, discogs-dump, python
- Language: Python
- Homepage:
- Size: 979 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# dgkit
A fast, easy-to-use tool for converting, and loading data from [Discogs data dumps](https://www.discogs.com/data/) into various formats and databases.

## Installation
TODO
## Usage
```text
Usage: dgkit [OPTIONS] COMMAND [ARGS]...
Process Discogs data dumps.
╭─ Options ────────────────────────────────────────────────────────────────────────────╮
│ --debug Show full error tracebacks. │
│ --install-completion Install completion for the current shell. │
│ --show-completion Show completion for the current shell, to copy it or │
│ customize the installation. │
│ --help Show this message and exit. │
╰──────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ───────────────────────────────────────────────────────────────────────────╮
│ convert Convert data dumps to another format. │
│ load Load data dumps into a database. │
│ sample Extract a sample from a Discogs data dump. │
╰──────────────────────────────────────────────────────────────────────────────────────╯
```
### dgkit convert
Convert Discogs data dumps to various formats. Supports filtering records, limiting output, and optional compression.
| Option | Values |
|-----------------|-------------------------------|
| Output formats | `json`, `jsonl` |
| Compression | `bz2`, `gzip`, `none` |
Full command help
```text
Usage: dgkit convert [OPTIONS] FILES...
Convert data dumps to another format.
╭─ Arguments ──────────────────────────────────────────────────────────────────────────╮
│ * files FILES... Discogs dump files. [required] │
╰──────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ────────────────────────────────────────────────────────────────────────────╮
│ * --format -f [blackhole|console|js Output file format. │
│ on|jsonl] [required] │
│ --compress -c [bz2|gzip|none] Compression algorithm. │
│ [default: none] │
│ --drop-if TEXT Drop records matching │
│ field=value. │
│ --limit INTEGER Max records per file. │
│ --output-dir -o PATH Output directory. │
│ [default: .] │
│ --overwrite -w Overwrite existing │
│ files. │
│ --progress --no-progress Show progress bar. │
│ [default: progress] │
│ --strict Warn about unhandled │
│ XML elements. │
│ --strict-fail Fail on unhandled XML │
│ data (implies │
│ --strict). │
│ --summary --no-summary Show summary. │
│ [default: summary] │
│ --type -t [artists|labels|maste Entity type (if not │
│ rs|releases] auto-detected). │
│ --unset TEXT Fields to set to null │
│ (comma-separated). │
│ --verbose -v Show detailed │
│ processing info. │
│ --help Show this message and │
│ exit. │
╰──────────────────────────────────────────────────────────────────────────────────────╯
```
#### Examples
```shell
# Convert releases to JSONL format
dgkit convert -f jsonl discogs_20260101_releases.xml.gz
# Convert to JSON with bzip2 compression
dgkit convert -f json -c bz2 discogs_20260101_artists.xml.gz
# Convert first 1000 records only (useful for testing)
dgkit convert -f jsonl --limit 1000 discogs_20260101_releases.xml.gz
# Drop records with specific data quality
dgkit convert -f jsonl --drop-if "data_quality=Needs Vote" discogs_20260101_releases.xml.gz
# Clear specific fields from output
dgkit convert -f jsonl --unset notes,images discogs_20260101_releases.xml.gz
# Validate XML for unhandled elements
dgkit convert -f jsonl --strict discogs_20260101_releases.xml.gz
# Fail on any unhandled XML data
dgkit convert -f jsonl --strict-fail discogs_20260101_releases.xml.gz
# Convert multiple files
dgkit convert -f jsonl discogs_20260101_artists.xml.gz discogs_20260101_labels.xml.gz discogs_20260101_releases.xml.gz
# Convert all XML files using a wildcard
dgkit convert -f jsonl discogs_20260101_*.xml.gz
# Convert a file with entity type in filename (auto-detected)
dgkit convert -f jsonl my_releases_backup.xml.gz
# Explicit type override when filename has no entity type
dgkit convert -f jsonl --type releases dump.xml.gz
```
### dgkit load
Load Discogs data dumps directly into a database. Supports batched inserts, filtering, and schema auto-creation.
| Database | Versions |
|----------|----------|
| SQLite | 3.x |
| PostgreSQL | 14+ |
Full command help
```text
Usage: dgkit load [OPTIONS] FILES...
Load data dumps into a database.
╭─ Arguments ──────────────────────────────────────────────────────────────────────────╮
│ * files FILES... Discogs dump files. [required] │
╰──────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ────────────────────────────────────────────────────────────────────────────╮
│ --batch -b INTEGER Batch size for │
│ database inserts. │
│ [default: 10000] │
│ --commit-interval INTEGER Commit transaction │
│ every N records │
│ (PostgreSQL only). │
│ --drop-if TEXT Drop records matching │
│ field=value. │
│ --dsn TEXT Database connection │
│ string (PostgreSQL: │
│ postgresql://..., │
│ SQLite: path or │
│ sqlite:///...). │
│ --limit INTEGER Max records per file. │
│ --overwrite -w Overwrite existing │
│ database. │
│ --progress --no-progress Show progress bar. │
│ [default: progress] │
│ --strict Warn about unhandled │
│ XML elements. │
│ --strict-fail Fail on unhandled XML │
│ data (implies │
│ --strict). │
│ --summary --no-summary Show summary. │
│ [default: summary] │
│ --type -t [artists|labels|maste Entity type (if not │
│ rs|releases] auto-detected). │
│ --unset TEXT Fields to set to null │
│ (comma-separated). │
│ --verbose -v Show detailed │
│ database operation │
│ info. │
│ --help Show this message and │
│ exit. │
╰──────────────────────────────────────────────────────────────────────────────────────╯
```
#### Examples
```shell
# Load releases into SQLite (database name auto-generated from input files)
dgkit load discogs_20260101_releases.xml.gz
# Load into a specific SQLite database
dgkit load --dsn discogs.db discogs_20260101_releases.xml.gz
# Load into PostgreSQL
dgkit load --dsn "postgresql://user:pass@localhost/discogs" discogs_20260101_releases.xml.gz
# Load first 1000 records only (useful for testing)
dgkit load --limit 1000 discogs_20260101_releases.xml.gz
# Use smaller batch size for memory-constrained environments
dgkit load -b 1000 discogs_20260101_releases.xml.gz
# Overwrite existing database without prompting
dgkit load -w discogs_20260101_releases.xml.gz
# Load multiple entity types into the same database
dgkit load --dsn discogs.db discogs_20260101_artists.xml.gz discogs_20260101_labels.xml.gz discogs_20260101_releases.xml.gz
# Load all dump files using a wildcard
dgkit load --dsn discogs.db discogs_20260101_*.xml.gz
# Drop records with specific data quality
dgkit load --drop-if "data_quality=Needs Vote" discogs_20260101_releases.xml.gz
# Clear specific fields before loading
dgkit load --unset notes,images discogs_20260101_releases.xml.gz
# Validate XML for unhandled elements
dgkit load --strict discogs_20260101_releases.xml.gz
# Fail on any unhandled XML data
dgkit load --strict-fail discogs_20260101_releases.xml.gz
# Load a file with entity type in filename (auto-detected)
dgkit load my_releases_backup.xml.gz
# Explicit type override when filename has no entity type
dgkit load --type releases dump.xml.gz
```
## License
dgkit is licensed under the [Apache License 2.0](LICENSE.txt).