{"id":43954112,"url":"https://github.com/czcorpus/depreldb","last_synced_at":"2026-02-07T04:06:21.584Z","repository":{"id":302601952,"uuid":"1012320811","full_name":"czcorpus/depreldb","owner":"czcorpus","description":"A fast database for  UD dependency relations between lemmas","archived":false,"fork":false,"pushed_at":"2025-10-14T08:38:02.000Z","size":127,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-01-12T00:33:29.967Z","etag":null,"topics":["collocation-extraction","corpus-linguistics","corpus-processing","corpus-tools","data-retrieval","database","linguistics","universal-dependencies"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/czcorpus.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-07-02T06:47:23.000Z","updated_at":"2025-10-14T08:36:35.000Z","dependencies_parsed_at":"2025-07-03T08:53:48.780Z","dependency_job_id":"d3cf7ae1-69e6-439a-b8bc-b73b22cfb57f","html_url":"https://github.com/czcorpus/depreldb","commit_stats":null,"previous_names":["czcorpus/scollector"],"tags_count":29,"template":false,"template_full_name":null,"purl":"pkg:github/czcorpus/depreldb","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/czcorpus%2Fdepreldb","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/czcorpus%2Fdepreldb/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/czcorpus%2Fdepreldb/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/czcorpus%2Fdepreldb/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/czcorpus","download_url":"https://codeload.github.com/czcorpus/depreldb/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/czcorpus%2Fdepreldb/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29186091,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-07T03:35:06.566Z","status":"ssl_error","status_checked_at":"2026-02-07T03:34:57.604Z","response_time":63,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["collocation-extraction","corpus-linguistics","corpus-processing","corpus-tools","data-retrieval","database","linguistics","universal-dependencies"],"created_at":"2026-02-07T04:06:20.914Z","updated_at":"2026-02-07T04:06:21.579Z","avatar_url":"https://github.com/czcorpus.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DeprelDB\n\nA high-performance Go-based **dependency-based collocation extraction and search library** for linguistic analysis. DeprelDB processes linguistic data to calculate statistical measures like T-Score, Log-Dice, and LMI (Local Mutual Information) for finding meaningful syntactic collocations between lemmas.\n\n## Features\n\n- **Fast collocation search** using BadgerDB with optimized read-only configurations\n- **High-performance storage**:\n  - **memory-efficient** binary key encoding and optimized grouping algorithms\n- **Statistical measures**: T-Score, Log-Dice, and LMI calculations with Reciprocal Rank Fusion (RRF) scoring\n- **Universal Dependencies support**: Full integration with UD POS tags and dependency relations\n- **Flexible querying**: Filter by lemma, POS tags, dependency relations, and text types\n- **Multiple output formats**: Tabular display or JSON output\n- **Large dataset optimized**: Handles multi-GB databases with intelligent caching\n- **REPL mode**: Interactive query session with CTRL+C support\n- **Can be used as a library**\n\n## Installation\n\n### Prerequisites\n\n- Go 1.23.4 or later\n\n### Building\n\n```bash\n# Clone the repository\ngit clone https://github.com/czcorpus/depreldb\ncd depreldb\n\n# Build the project\nmake all\n```\n\nThis will build:\n1. The `scollsrch` binary for querying databases\n2. The `mkscolldb` binary for data import\n\nAlternatively, build manually:\n```bash\ngo build -o scollsrch ./cmd/search\n```\n\n## Input Data Format\n\nDeprelDB expects linguistic data in **vertical format**, where each token is on a separate line with tab-separated attributes. Sentences are separated by `\u003cs\u003e` structures with possible xml-like attributes.\n\n\n\n### Import Profiles\n\nImport profiles define the column structure of your vertical files. Predefined profiles include:\n\n- **intercorp_v16ud**: InterCorp v16 with Universal Dependencies\n- Add custom profiles in `storage/profiles.go`\n\nEach profile specifies:\n- Lemma column position\n- POS tag column position\n- Dependency relation column position\n- Syntactic parent column position\n- Text type mappings\n- Custom deprel values\n\n## Usage\n\n### Data Import\n\nBefore searching, you need to import linguistic data into the database using the `mkscolldb` tool:\n\n```bash\n./mkscolldb [options] [vert_path] [db_path]\n```\n\n#### Import Options\n\n- `-import-profile=NAME` - Use predefined corpus profile (e.g., \"intercorp_v16ud\")\n- `-lemma-idx=2` - Column position of lemma in vertical file (default: 2)\n- `-pos-idx=5` - Column position of POS tag (default: 5)\n- `-parent-idx=12` - Column position of syntactic parent info (default: 12)\n- `-deprel-idx=11` - Column position of dependency relation (default: 11)\n- `-min-freq=20` - Minimal frequency of collocates to accept (default: 20)\n- `-verbose` - Print detailed activity information (default: false)\n- `-log-level=info` - Set logging level (debug, info, warn, error)\n\n#### Import Examples\n\n```bash\n# Import using predefined profile\n./mkscolldb -import-profile intercorp_v16ud -min-freq 10 /path/to/corpus.vert /path/to/database.db\n\n# Import with custom column positions\n./mkscolldb -lemma-idx 1 -pos-idx 3 -min-freq 5 /path/to/corpus.vert /path/to/database.db\n\n# Import from directory of vertical files\n./mkscolldb -import-profile intercorp_v16ud /path/to/corpus/dir/ /path/to/database.db\n```\n\n### Basic Search\n\n```bash\n./scollsrch [options] [db_path] [lemma] [pos] [text_type]\n```\n\n### Command Line Options\n\n- `-limit` - Maximum number of matching items to show (default: 10)\n- `-sort-by` - Sorting measure: `tscore`, `ldice`, `lmi`, or `rrf` (default: rrf)\n- `-collocate-group-by-pos` - Group collocates by their POS tags\n- `-collocate-group-by-deprel` - Group collocates by their dependency relations\n- `-collocate-group-by-tt` - Group collocates by their text type\n- `-json-out` - Output results in JSON format instead of tabular format\n- `-repl` - Run in interactive read-eval-print loop mode (exit with CTRL+C)\n- `-log-level` - Set logging level (debug, info, warn, error, default = info)\n\n### Examples\n\n```bash\n# Basic search for collocations of \"run\"\n./search /path/to/database.db run\n\n# Search with POS filtering\n./search /path/to/database.db run VERB\n\n# Search with custom limits and sorting\n./search -limit=20 -sort-by=ldice /path/to/database.db run VERB\n\n# Search using LMI measure\n./search -sort-by=lmi /path/to/database.db run VERB\n\n# Search using RRF (default) - combines all measures\n./search -sort-by=rrf /path/to/database.db run VERB\n\n# JSON output for programmatic processing\n./search -json-out /path/to/database.db run VERB\n\n# Group results by POS and dependency relations\n./search -collocate-group-by-pos -collocate-group-by-deprel /path/to/database.db run\n\n# Interactive REPL mode\n./search -repl /path/to/database.db\n```\n\n## Output Format\n\n\n### Tabular Output (default)\n```\nregistry  lemma      lemma props.   collocate   collocate props  T-Score  Log-Dice  LMI     RRF Score  mutual dist.\n════════  ═════      ════════════   ═════════   ═══════════════  ═══════  ════════  ══════  ═════════  ════════════\n-         education  (nmod, -)      of          (-)               45.78    11.29     245.67  0.0821     1.10\n-         education  (obj, -)       a           (-)               29.17    9.62      178.43  0.0734     1.10\n-         education  (obj, -)       have        (-)               27.51    8.75      156.92  0.0687    -1.00\n-         education  (nmod, -)      training    (-)               27.11    9.00      163.45  0.0701     2.00\n```\n\n### JSON Output (`-json-out`)\n```json\n{\n  \"lemma\":{\n    \"value\":\"education\",\n    \"pos\":\"\",\n    \"deprel\":\"nmod\"\n  },\n  \"collocate\":{\n    \"value\":\"of\",\n    \"pos\":\"\",\n    \"deprel\":\"\"\n  },\n  \"logDice\":11.28,\n  \"tScore\":45.78,\n  \"lmi\":245.67,\n  \"rrfScore\":0.0821,\n  \"mutualDist\":1.1,\n  \"textType\":\"\"\n}\n// etc...\n\n```\n\n## Statistical Measures\n\n### T-Score\n\nMeasures the confidence of word association:\n```\nT-Score = (F(x,y) - F(x)*F(y)/N) / √F(x,y)\n```\n\n### Log-Dice\n\nMeasures the strength of association between words:\n```\nLog-Dice = 14.0 + log₂(2*F(x,y)/(F(x)+F(y)))\n```\n\n### LMI (Local Mutual Information)\n\nMeasures pointwise mutual information weighted by co-occurrence frequency:\n```\nLMI = F(x,y) * log₂(N * F(x,y) / (F(x) * F(y)))\n```\n\n### RRF (Reciprocal Rank Fusion)\n\nCombines rankings from T-Score, Log-Dice, and LMI using reciprocal rank fusion for better overall ranking:\n```\nRRF_score = Σ(1 / (60 + rank_i))\n```\n\nWhere:\n- `F(x,y)` = frequency of an co-occurrence\n- `F(x)`, `F(y)` = individual word frequency\n- `N` = corpus size\n- `rank_i` is a rank of an item when considering an `i-th` measure.\n\n## Database Schema\n\nDeprelDB uses BadgerDB with highly optimized binary encoding for maximum performance:\n\n- **Binary encoding**: collocation entries encoded in 16 bytes long keys (9 bytes for single lemma frequencies)\n- **Frequency and node distance encoded in DB values**\n- - 4 bytes for **frequency**, 1 byte for **distance encoding** (0.1 precision; values from -12.7 to +12.7)\n- **Efficient result grouping operations** - based on binary keys\n- **Read-optimized**: Large block cache (512MB) and index cache (256MB) for fast queries\n\n\n### Key Types\n- **Metadata**: `0x01 + keyID` → JSON metadata (import profile, corpus info)\n- **Lemma to ID**: `0x02 + lemma` → `tokenID`\n- **Reverse index**: `0x03 + tokenID` → `lemma`\n- **Token frequency**: `0x04 + tokenID + pos + textType + deprel` → `freq`\n- **Collocation frequency**: `0x05 + [composite key]` → `freq + distance`\n\n\n\n## Development\n\n### Project Structure\n\n```\ndepreldb/\n├── cmd/\n│   └── mkscolldb/       # An utility for importing corpus vertical files\n│   └── search/          # Search command-line interface with REPL mode\n├── record/              # Data structures, binary encoding, and key generation\n├── storage/             # BadgerDB storage layer\n├── scoll/               # High level interface for collocations search\n└── dataimport/          # Data import logic\n```\n\n### Running Tests\n\n```bash\n# Run all tests\ngo test ./...\n\n# Run specific package tests\ngo test ./storage -v\ngo test ./record -v\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fczcorpus%2Fdepreldb","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fczcorpus%2Fdepreldb","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fczcorpus%2Fdepreldb/lists"}