https://github.com/czcorpus/depreldb
A fast database for UD dependency relations between lemmas
https://github.com/czcorpus/depreldb
collocation-extraction corpus-linguistics corpus-processing corpus-tools data-retrieval database linguistics universal-dependencies
Last synced: 4 months ago
JSON representation
A fast database for UD dependency relations between lemmas
- Host: GitHub
- URL: https://github.com/czcorpus/depreldb
- Owner: czcorpus
- Created: 2025-07-02T06:47:23.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2025-10-14T08:38:02.000Z (8 months ago)
- Last Synced: 2026-01-12T00:33:29.967Z (5 months ago)
- Topics: collocation-extraction, corpus-linguistics, corpus-processing, corpus-tools, data-retrieval, database, linguistics, universal-dependencies
- Language: Go
- Homepage:
- Size: 124 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# DeprelDB
A high-performance Go-based **dependency-based collocation extraction and search library** for linguistic analysis. DeprelDB processes linguistic data to calculate statistical measures like T-Score, Log-Dice, and LMI (Local Mutual Information) for finding meaningful syntactic collocations between lemmas.
## Features
- **Fast collocation search** using BadgerDB with optimized read-only configurations
- **High-performance storage**:
- **memory-efficient** binary key encoding and optimized grouping algorithms
- **Statistical measures**: T-Score, Log-Dice, and LMI calculations with Reciprocal Rank Fusion (RRF) scoring
- **Universal Dependencies support**: Full integration with UD POS tags and dependency relations
- **Flexible querying**: Filter by lemma, POS tags, dependency relations, and text types
- **Multiple output formats**: Tabular display or JSON output
- **Large dataset optimized**: Handles multi-GB databases with intelligent caching
- **REPL mode**: Interactive query session with CTRL+C support
- **Can be used as a library**
## Installation
### Prerequisites
- Go 1.23.4 or later
### Building
```bash
# Clone the repository
git clone https://github.com/czcorpus/depreldb
cd depreldb
# Build the project
make all
```
This will build:
1. The `scollsrch` binary for querying databases
2. The `mkscolldb` binary for data import
Alternatively, build manually:
```bash
go build -o scollsrch ./cmd/search
```
## Input Data Format
DeprelDB expects linguistic data in **vertical format**, where each token is on a separate line with tab-separated attributes. Sentences are separated by `` structures with possible xml-like attributes.
### Import Profiles
Import profiles define the column structure of your vertical files. Predefined profiles include:
- **intercorp_v16ud**: InterCorp v16 with Universal Dependencies
- Add custom profiles in `storage/profiles.go`
Each profile specifies:
- Lemma column position
- POS tag column position
- Dependency relation column position
- Syntactic parent column position
- Text type mappings
- Custom deprel values
## Usage
### Data Import
Before searching, you need to import linguistic data into the database using the `mkscolldb` tool:
```bash
./mkscolldb [options] [vert_path] [db_path]
```
#### Import Options
- `-import-profile=NAME` - Use predefined corpus profile (e.g., "intercorp_v16ud")
- `-lemma-idx=2` - Column position of lemma in vertical file (default: 2)
- `-pos-idx=5` - Column position of POS tag (default: 5)
- `-parent-idx=12` - Column position of syntactic parent info (default: 12)
- `-deprel-idx=11` - Column position of dependency relation (default: 11)
- `-min-freq=20` - Minimal frequency of collocates to accept (default: 20)
- `-verbose` - Print detailed activity information (default: false)
- `-log-level=info` - Set logging level (debug, info, warn, error)
#### Import Examples
```bash
# Import using predefined profile
./mkscolldb -import-profile intercorp_v16ud -min-freq 10 /path/to/corpus.vert /path/to/database.db
# Import with custom column positions
./mkscolldb -lemma-idx 1 -pos-idx 3 -min-freq 5 /path/to/corpus.vert /path/to/database.db
# Import from directory of vertical files
./mkscolldb -import-profile intercorp_v16ud /path/to/corpus/dir/ /path/to/database.db
```
### Basic Search
```bash
./scollsrch [options] [db_path] [lemma] [pos] [text_type]
```
### Command Line Options
- `-limit` - Maximum number of matching items to show (default: 10)
- `-sort-by` - Sorting measure: `tscore`, `ldice`, `lmi`, or `rrf` (default: rrf)
- `-collocate-group-by-pos` - Group collocates by their POS tags
- `-collocate-group-by-deprel` - Group collocates by their dependency relations
- `-collocate-group-by-tt` - Group collocates by their text type
- `-json-out` - Output results in JSON format instead of tabular format
- `-repl` - Run in interactive read-eval-print loop mode (exit with CTRL+C)
- `-log-level` - Set logging level (debug, info, warn, error, default = info)
### Examples
```bash
# Basic search for collocations of "run"
./search /path/to/database.db run
# Search with POS filtering
./search /path/to/database.db run VERB
# Search with custom limits and sorting
./search -limit=20 -sort-by=ldice /path/to/database.db run VERB
# Search using LMI measure
./search -sort-by=lmi /path/to/database.db run VERB
# Search using RRF (default) - combines all measures
./search -sort-by=rrf /path/to/database.db run VERB
# JSON output for programmatic processing
./search -json-out /path/to/database.db run VERB
# Group results by POS and dependency relations
./search -collocate-group-by-pos -collocate-group-by-deprel /path/to/database.db run
# Interactive REPL mode
./search -repl /path/to/database.db
```
## Output Format
### Tabular Output (default)
```
registry lemma lemma props. collocate collocate props T-Score Log-Dice LMI RRF Score mutual dist.
════════ ═════ ════════════ ═════════ ═══════════════ ═══════ ════════ ══════ ═════════ ════════════
- education (nmod, -) of (-) 45.78 11.29 245.67 0.0821 1.10
- education (obj, -) a (-) 29.17 9.62 178.43 0.0734 1.10
- education (obj, -) have (-) 27.51 8.75 156.92 0.0687 -1.00
- education (nmod, -) training (-) 27.11 9.00 163.45 0.0701 2.00
```
### JSON Output (`-json-out`)
```json
{
"lemma":{
"value":"education",
"pos":"",
"deprel":"nmod"
},
"collocate":{
"value":"of",
"pos":"",
"deprel":""
},
"logDice":11.28,
"tScore":45.78,
"lmi":245.67,
"rrfScore":0.0821,
"mutualDist":1.1,
"textType":""
}
// etc...
```
## Statistical Measures
### T-Score
Measures the confidence of word association:
```
T-Score = (F(x,y) - F(x)*F(y)/N) / √F(x,y)
```
### Log-Dice
Measures the strength of association between words:
```
Log-Dice = 14.0 + log₂(2*F(x,y)/(F(x)+F(y)))
```
### LMI (Local Mutual Information)
Measures pointwise mutual information weighted by co-occurrence frequency:
```
LMI = F(x,y) * log₂(N * F(x,y) / (F(x) * F(y)))
```
### RRF (Reciprocal Rank Fusion)
Combines rankings from T-Score, Log-Dice, and LMI using reciprocal rank fusion for better overall ranking:
```
RRF_score = Σ(1 / (60 + rank_i))
```
Where:
- `F(x,y)` = frequency of an co-occurrence
- `F(x)`, `F(y)` = individual word frequency
- `N` = corpus size
- `rank_i` is a rank of an item when considering an `i-th` measure.
## Database Schema
DeprelDB uses BadgerDB with highly optimized binary encoding for maximum performance:
- **Binary encoding**: collocation entries encoded in 16 bytes long keys (9 bytes for single lemma frequencies)
- **Frequency and node distance encoded in DB values**
- - 4 bytes for **frequency**, 1 byte for **distance encoding** (0.1 precision; values from -12.7 to +12.7)
- **Efficient result grouping operations** - based on binary keys
- **Read-optimized**: Large block cache (512MB) and index cache (256MB) for fast queries
### Key Types
- **Metadata**: `0x01 + keyID` → JSON metadata (import profile, corpus info)
- **Lemma to ID**: `0x02 + lemma` → `tokenID`
- **Reverse index**: `0x03 + tokenID` → `lemma`
- **Token frequency**: `0x04 + tokenID + pos + textType + deprel` → `freq`
- **Collocation frequency**: `0x05 + [composite key]` → `freq + distance`
## Development
### Project Structure
```
depreldb/
├── cmd/
│ └── mkscolldb/ # An utility for importing corpus vertical files
│ └── search/ # Search command-line interface with REPL mode
├── record/ # Data structures, binary encoding, and key generation
├── storage/ # BadgerDB storage layer
├── scoll/ # High level interface for collocations search
└── dataimport/ # Data import logic
```
### Running Tests
```bash
# Run all tests
go test ./...
# Run specific package tests
go test ./storage -v
go test ./record -v
```