{"id":37247292,"url":"https://github.com/georgeguimaraes/stephen","last_synced_at":"2026-02-27T01:34:42.608Z","repository":{"id":331964802,"uuid":"1132264058","full_name":"georgeguimaraes/stephen","owner":"georgeguimaraes","description":"ColBERT-style neural retrieval for Elixir","archived":false,"fork":false,"pushed_at":"2026-01-20T20:54:43.000Z","size":208,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-01-21T05:38:48.467Z","etag":null,"topics":["bert","colbert","elixir","embedding","nx","retrieval"],"latest_commit_sha":null,"homepage":"","language":"Elixir","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/georgeguimaraes.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-01-11T16:31:49.000Z","updated_at":"2026-01-20T20:54:46.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/georgeguimaraes/stephen","commit_stats":null,"previous_names":["georgeguimaraes/stephen"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/georgeguimaraes/stephen","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/georgeguimaraes%2Fstephen","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/georgeguimaraes%2Fstephen/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/georgeguimaraes%2Fstephen/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/georgeguimaraes%2Fstephen/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/georgeguimaraes","download_url":"https://codeload.github.com/georgeguimaraes/stephen/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/georgeguimaraes%2Fstephen/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28667308,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-22T14:01:31.714Z","status":"ssl_error","status_checked_at":"2026-01-22T13:59:23.143Z","response_time":144,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert","colbert","elixir","embedding","nx","retrieval"],"created_at":"2026-01-15T13:00:27.048Z","updated_at":"2026-01-22T17:01:57.935Z","avatar_url":"https://github.com/georgeguimaraes.png","language":"Elixir","funding_links":[],"categories":["Machine Learning"],"sub_categories":["Vector Search \u0026 Similarity"],"readme":"# Stephen\n\n[![Run in Livebook](https://livebook.dev/badge/v1/blue.svg)](https://livebook.dev/run?url=https%3A%2F%2Fgithub.com%2Fgeorgeguimaraes%2Fstephen%2Fblob%2Fmain%2Flivebook%2Fgetting_started.livemd)\n\nColBERT-style neural retrieval for Elixir.\n\nStephen implements late interaction retrieval using per-token embeddings and MaxSim scoring. Instead of compressing text into a single vector, it keeps one embedding per token, enabling fine-grained semantic matching.\n\n## Installation\n\n```elixir\ndef deps do\n  [\n    {:stephen, \"~\u003e 0.1.0\"},\n    {:exla, \"~\u003e 0.9\"}  # optional, for GPU acceleration\n  ]\nend\n```\n\nFor GPU acceleration:\n\n```elixir\n# config/config.exs\nconfig :nx, default_backend: EXLA.Backend\n```\n\n## Quick Start\n\n```elixir\n# Load encoder (downloads model on first use)\n{:ok, encoder} = Stephen.load_encoder()\n\n# Create index and add documents\nindex = Stephen.new_index(encoder)\nindex = Stephen.index(encoder, index, [\n  {\"colbert\", \"Stephen Colbert hosted The Colbert Report before The Late Show\"},\n  {\"conan\", \"Conan O'Brien is known for his self-deprecating humor and tall hair\"},\n  {\"seth\", \"Seth Meyers was head writer at SNL before hosting Late Night\"}\n])\n\n# Search\nresults = Stephen.search(encoder, index, \"late night comedy\")\n# =\u003e [%{doc_id: \"colbert\", score: 15.2}, ...]\n\n# Save/load\n:ok = Stephen.save_index(index, \"my_index\")\n{:ok, index} = Stephen.load_index(\"my_index\")\n```\n\n## Why ColBERT?\n\nTraditional dense retrieval compresses each text into a single vector. ColBERT keeps per-token embeddings and matches query tokens to document tokens individually:\n\n1. Each token gets its own embedding\n2. For each query token, find the best-matching document token (MaxSim)\n3. Sum these maximum similarities\n\nThis captures nuanced relevance that single-vector methods miss.\n\n## Reranking\n\nUse Stephen to rerank candidates from a faster first-stage retriever:\n\n```elixir\n# From indexed documents\nresults = Stephen.rerank(encoder, index, \"query\", [\"doc1\", \"doc5\", \"doc12\"])\n\n# From raw text (no index needed)\ncandidates = [\n  {\"colbert\", \"Stephen Colbert interviews politicians with satirical wit\"},\n  {\"conan\", \"Conan O'Brien traveled the world for his travel show\"}\n]\nresults = Stephen.rerank_texts(encoder, \"political satire comedy\", candidates)\n```\n\n## Query Expansion (PRF)\n\nImprove recall with pseudo-relevance feedback:\n\n```elixir\nresults = Stephen.search_with_prf(encoder, index, \"late night hosts\")\n\n# Tune expansion parameters\nresults = Stephen.search_with_prf(encoder, index, query,\n  feedback_docs: 5,\n  expansion_tokens: 15,\n  expansion_weight: 0.3\n)\n```\n\nPRF uses top-ranked documents to expand the query with related terms, finding documents that may not match the exact query.\n\n## Debugging Scores\n\nUnderstand why a document scored the way it did:\n\n```elixir\nexplanation = Stephen.explain(encoder, \"satirical comedy\", \"Colbert is a satirical host\")\n\n# Print formatted explanation\nexplanation |\u003e Stephen.Scorer.format_explanation() |\u003e IO.puts()\n# Score: 15.20\n#\n# Query Token          -\u003e Doc Token            Similarity\n# --------------------------------------------------------\n# satirical            -\u003e satirical            0.95\n# comedy               -\u003e host                 0.72\n# ...\n```\n\nThis shows which query tokens matched which document tokens and their similarity scores.\n\n## Index Types\n\n| Index | Use Case |\n|-------|----------|\n| `Stephen.Index` | Small-medium collections, fast updates |\n| `Stephen.Plaid` | Larger collections, sub-linear search |\n| `Stephen.Index.Compressed` | Memory-constrained, 4-32x compression |\n\n## Documentation\n\nTry the interactive [Livebook tutorial](livebook/getting_started.livemd) to explore Stephen hands-on.\n\nSee the [guides](guides/) for detailed documentation:\n\n- [Architecture](guides/architecture.md): how ColBERT and Stephen work\n- [Index Types](guides/index_types.md): choosing and configuring indexes\n- [Compression](guides/compression.md): residual quantization for memory efficiency\n- [Chunking](guides/chunking.md): handling long documents\n- [Configuration](guides/configuration.md): encoder and index options\n- [Reranking](guides/reranking.md): using Stephen as a second-stage reranker\n- [Query Expansion](guides/prf.md): pseudo-relevance feedback for better recall\n- [Debugging Scores](guides/explain.md): understanding why documents scored the way they did\n\n## License\n\nCopyright (c) 2025 George Guimarães\n\nLicensed under the MIT License. See [LICENSE](LICENSE) for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgeorgeguimaraes%2Fstephen","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgeorgeguimaraes%2Fstephen","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgeorgeguimaraes%2Fstephen/lists"}