https://github.com/hyparam/hypgrep
Full Text Search for Parquet
https://github.com/hyparam/hypgrep
Last synced: 18 days ago
JSON representation
Full Text Search for Parquet
- Host: GitHub
- URL: https://github.com/hyparam/hypgrep
- Owner: hyparam
- License: mit
- Created: 2025-11-22T02:12:46.000Z (7 months ago)
- Default Branch: master
- Last Pushed: 2026-05-25T07:45:55.000Z (about 1 month ago)
- Last Synced: 2026-05-25T09:26:27.606Z (about 1 month ago)
- Language: JavaScript
- Homepage:
- Size: 234 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# HypGrep
[](https://opensource.org/licenses/MIT)

Build a compact n-gram search index for a Parquet file using [`hyparquet`](https://github.com/hyparam/hyparquet) and [`hyparquet-writer`](https://github.com/hyparam/hyparquet-writer). Queries are case-insensitive substring matches — grep semantics over a precomputed index.
## Why?
Enable efficient grep-style search on large Parquet datasets from any client without a server. Store your Parquet dataset on S3, generate a compact index file, and query it directly from a browser or other clients using HTTP range requests. The index tells you exactly which row blocks to fetch, so you only download the data you need.
Perfect for serverless architectures where you want to offer search capabilities without managing infrastructure.
## CLI usage
Build an index:
```bash
npx hypgrep dataset.parquet [dataset.index.parquet]
```
Grep against the indexed file:
```bash
npx hypgrep search dataset.parquet 'serverless' # literal substring
npx hypgrep search dataset.parquet '/eigen.+value/i' # regex
npx hypgrep search dataset.parquet 'rhythm' --limit 5 # first N matches
npx hypgrep search dataset.parquet 'rhythm' -c # count only
npx hypgrep search dataset.parquet 'rhythm' -i # case-insensitive literal
```
To install as a system-wide CLI tool:
```bash
npm install -g hypgrep
hypgrep search dataset.parquet 'pattern'
```
## Find rows in a parquet file in JavaScript
Use `parquetFind` to find rows containing the query as a substring while preserving natural row order (like Ctrl+F):
```javascript
import { parquetFind } from 'hypgrep'
for await (const row of parquetFind({
query: 'serverless',
url: 'https://s3.hyperparam.app/hypgrep/wiki_en.parquet',
})) {
console.log(row) // { title: '...', text: '...' }
}
```
The query matches as a contiguous substring (grep semantics): `'speed of light'` matches rows containing that exact phrase, not rows where the words merely co-occur. Queries shorter than the indexed n-gram length (default 5) fall back to a full scan but still return correct results.
### Regex queries
Pass a `RegExp` directly — mandatory literals are extracted from the pattern for index pruning, and `regex.test` runs against each row:
```javascript
for await (const row of parquetFind({
query: /eigen\w*value/i,
url: '...',
})) ...
```
If the regex has no extractable literal (e.g. `/./`, `/foo|bar/`), the index can't prune and HypGrep does a full scan. The substring/regex filter still applies — results are correct, just unaccelerated.
If you want full control over the row predicate (e.g. a custom JS function), pass `rowFilter`. The string `query` is still used for index pruning while the callback decides which rows to keep:
```javascript
for await (const row of parquetFind({
query: 'eigen',
rowFilter: row => myCustomCheck(row),
url: '...',
})) ...
```
## Ranked search
Use `parquetSearch` for Google-style ranked search: whitespace-separated words are ANDed (every word must appear), and results are ranked by total occurrence count:
```javascript
import { parquetSearch } from 'hypgrep'
for await (const row of parquetSearch({
query: 'serverless',
url: 'https://s3.hyperparam.app/hypgrep/wiki_en.parquet',
})) {
console.log(row) // most matches first
}
```
## Create an index in JavaScript
```javascript
import { asyncBufferFromFile } from 'hyparquet'
import { fileWriter } from 'hyparquet-writer'
import { createIndex } from 'hypgrep'
// Generate dataset.index.parquet from dataset.parquet
const sourceFile = await asyncBufferFromFile('dataset.parquet')
const indexFile = fileWriter('dataset.index.parquet')
await createIndex({ sourceFile, indexFile })
```
## Local parquet files
To search against local parquet files, provide an `asyncBufferFactory` that loads the file from the local filesystem:
```js
import { asyncBufferFromFile } from 'hyparquet'
import { parquetFind } from 'hypgrep'
// Loads parquet file from local filesystem
function asyncBufferFactory({ url }) {
return asyncBufferFromFile(url)
}
for await (const row of parquetFind({
query: 'serverless',
url: 'dataset.parquet',
asyncBufferFactory,
})) {
console.log(row)
}
```