{"id":50680325,"url":"https://github.com/hyparam/hypgrep","last_synced_at":"2026-06-08T18:03:58.455Z","repository":{"id":360156321,"uuid":"1101656629","full_name":"hyparam/hypgrep","owner":"hyparam","description":"Full Text Search for Parquet","archived":false,"fork":false,"pushed_at":"2026-05-25T07:45:55.000Z","size":240,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2026-05-25T09:26:27.606Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hyparam.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-11-22T02:12:46.000Z","updated_at":"2026-05-25T07:45:58.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/hyparam/hypgrep","commit_stats":null,"previous_names":["hyparam/hypgrep"],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/hyparam/hypgrep","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hyparam%2Fhypgrep","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hyparam%2Fhypgrep/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hyparam%2Fhypgrep/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hyparam%2Fhypgrep/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hyparam","download_url":"https://codeload.github.com/hyparam/hypgrep/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hyparam%2Fhypgrep/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34073817,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-08T02:00:07.615Z","response_time":111,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-08T18:03:57.570Z","updated_at":"2026-06-08T18:03:58.449Z","avatar_url":"https://github.com/hyparam.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# HypGrep\n\n[![mit license](https://img.shields.io/badge/License-MIT-orange.svg)](https://opensource.org/licenses/MIT)\n![coverage](https://img.shields.io/badge/Coverage-95-darkred)\n\nBuild a compact n-gram search index for a Parquet file using [`hyparquet`](https://github.com/hyparam/hyparquet) and [`hyparquet-writer`](https://github.com/hyparam/hyparquet-writer). Queries are case-insensitive substring matches — grep semantics over a precomputed index.\n\n## Why?\n\nEnable efficient grep-style search on large Parquet datasets from any client without a server. Store your Parquet dataset on S3, generate a compact index file, and query it directly from a browser or other clients using HTTP range requests. The index tells you exactly which row blocks to fetch, so you only download the data you need.\n\nPerfect for serverless architectures where you want to offer search capabilities without managing infrastructure.\n\n## CLI usage\n\nBuild an index:\n\n```bash\nnpx hypgrep dataset.parquet [dataset.index.parquet]\n```\n\nGrep against the indexed file:\n\n```bash\nnpx hypgrep search dataset.parquet 'serverless'          # literal substring\nnpx hypgrep search dataset.parquet '/eigen.+value/i'      # regex\nnpx hypgrep search dataset.parquet 'rhythm' --limit 5     # first N matches\nnpx hypgrep search dataset.parquet 'rhythm' -c            # count only\nnpx hypgrep search dataset.parquet 'rhythm' -i            # case-insensitive literal\n```\n\nTo install as a system-wide CLI tool:\n\n```bash\nnpm install -g hypgrep\nhypgrep search dataset.parquet 'pattern'\n```\n\n## Find rows in a parquet file in JavaScript\n\nUse `parquetFind` to find rows containing the query as a substring while preserving natural row order (like Ctrl+F):\n\n```javascript\nimport { parquetFind } from 'hypgrep'\n\nfor await (const row of parquetFind({\n  query: 'serverless',\n  url: 'https://s3.hyperparam.app/hypgrep/wiki_en.parquet',\n})) {\n  console.log(row) // { title: '...', text: '...' }\n}\n```\n\nThe query matches as a contiguous substring (grep semantics): `'speed of light'` matches rows containing that exact phrase, not rows where the words merely co-occur. Queries shorter than the indexed n-gram length (default 5) fall back to a full scan but still return correct results.\n\n### Regex queries\n\nPass a `RegExp` directly — mandatory literals are extracted from the pattern for index pruning, and `regex.test` runs against each row:\n\n```javascript\nfor await (const row of parquetFind({\n  query: /eigen\\w*value/i,\n  url: '...',\n})) ...\n```\n\nIf the regex has no extractable literal (e.g. `/./`, `/foo|bar/`), the index can't prune and HypGrep does a full scan. The substring/regex filter still applies — results are correct, just unaccelerated.\n\nIf you want full control over the row predicate (e.g. a custom JS function), pass `rowFilter`. The string `query` is still used for index pruning while the callback decides which rows to keep:\n\n```javascript\nfor await (const row of parquetFind({\n  query: 'eigen',\n  rowFilter: row =\u003e myCustomCheck(row),\n  url: '...',\n})) ...\n```\n\n## Ranked search\n\nUse `parquetSearch` for Google-style ranked search: whitespace-separated words are ANDed (every word must appear), and results are ranked by total occurrence count:\n\n```javascript\nimport { parquetSearch } from 'hypgrep'\n\nfor await (const row of parquetSearch({\n  query: 'serverless',\n  url: 'https://s3.hyperparam.app/hypgrep/wiki_en.parquet',\n})) {\n  console.log(row) // most matches first\n}\n```\n\n## Create an index in JavaScript\n\n```javascript\nimport { asyncBufferFromFile } from 'hyparquet'\nimport { fileWriter } from 'hyparquet-writer'\nimport { createIndex } from 'hypgrep'\n\n// Generate dataset.index.parquet from dataset.parquet\nconst sourceFile = await asyncBufferFromFile('dataset.parquet')\nconst indexFile = fileWriter('dataset.index.parquet')\nawait createIndex({ sourceFile, indexFile })\n```\n\n## Local parquet files\n\nTo search against local parquet files, provide an `asyncBufferFactory` that loads the file from the local filesystem:\n\n```js\nimport { asyncBufferFromFile } from 'hyparquet'\nimport { parquetFind } from 'hypgrep'\n\n// Loads parquet file from local filesystem\nfunction asyncBufferFactory({ url }) {\n  return asyncBufferFromFile(url)\n}\n\nfor await (const row of parquetFind({\n  query: 'serverless',\n  url: 'dataset.parquet',\n  asyncBufferFactory,\n})) {\n  console.log(row)\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhyparam%2Fhypgrep","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhyparam%2Fhypgrep","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhyparam%2Fhypgrep/lists"}