https://github.com/hyparam/hypgrep

Full Text Search for Parquet
https://github.com/hyparam/hypgrep

Last synced: 18 days ago
JSON representation

Full Text Search for Parquet

Host: GitHub
URL: https://github.com/hyparam/hypgrep
Owner: hyparam
License: mit
Created: 2025-11-22T02:12:46.000Z (7 months ago)
Default Branch: master
Last Pushed: 2026-05-25T07:45:55.000Z (about 1 month ago)
Last Synced: 2026-05-25T09:26:27.606Z (about 1 month ago)
Language: JavaScript
Homepage:
Size: 234 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # HypGrep

[![mit license](https://img.shields.io/badge/License-MIT-orange.svg)](https://opensource.org/licenses/MIT)

![coverage](https://img.shields.io/badge/Coverage-95-darkred)

Build a compact n-gram search index for a Parquet file using [`hyparquet`](https://github.com/hyparam/hyparquet) and [`hyparquet-writer`](https://github.com/hyparam/hyparquet-writer). Queries are case-insensitive substring matches — grep semantics over a precomputed index.

## Why?

Enable efficient grep-style search on large Parquet datasets from any client without a server. Store your Parquet dataset on S3, generate a compact index file, and query it directly from a browser or other clients using HTTP range requests. The index tells you exactly which row blocks to fetch, so you only download the data you need.

Perfect for serverless architectures where you want to offer search capabilities without managing infrastructure.

## CLI usage

Build an index:

```bash

npx hypgrep dataset.parquet [dataset.index.parquet]

```

Grep against the indexed file:

```bash

npx hypgrep search dataset.parquet 'serverless'          # literal substring

npx hypgrep search dataset.parquet '/eigen.+value/i'      # regex

npx hypgrep search dataset.parquet 'rhythm' --limit 5     # first N matches

npx hypgrep search dataset.parquet 'rhythm' -c            # count only

npx hypgrep search dataset.parquet 'rhythm' -i            # case-insensitive literal

```

To install as a system-wide CLI tool:

```bash

npm install -g hypgrep

hypgrep search dataset.parquet 'pattern'

```

## Find rows in a parquet file in JavaScript

Use `parquetFind` to find rows containing the query as a substring while preserving natural row order (like Ctrl+F):

```javascript

import { parquetFind } from 'hypgrep'

for await (const row of parquetFind({

  query: 'serverless',

  url: 'https://s3.hyperparam.app/hypgrep/wiki_en.parquet',

})) {

  console.log(row) // { title: '...', text: '...' }

}

```

The query matches as a contiguous substring (grep semantics): `'speed of light'` matches rows containing that exact phrase, not rows where the words merely co-occur. Queries shorter than the indexed n-gram length (default 5) fall back to a full scan but still return correct results.

### Regex queries

Pass a `RegExp` directly — mandatory literals are extracted from the pattern for index pruning, and `regex.test` runs against each row:

```javascript

for await (const row of parquetFind({

  query: /eigen\w*value/i,

  url: '...',

})) ...

```

If the regex has no extractable literal (e.g. `/./`, `/foo|bar/`), the index can't prune and HypGrep does a full scan. The substring/regex filter still applies — results are correct, just unaccelerated.

If you want full control over the row predicate (e.g. a custom JS function), pass `rowFilter`. The string `query` is still used for index pruning while the callback decides which rows to keep:

```javascript

for await (const row of parquetFind({

  query: 'eigen',

  rowFilter: row => myCustomCheck(row),

  url: '...',

})) ...

```

## Ranked search

Use `parquetSearch` for Google-style ranked search: whitespace-separated words are ANDed (every word must appear), and results are ranked by total occurrence count:

```javascript

import { parquetSearch } from 'hypgrep'

for await (const row of parquetSearch({

  query: 'serverless',

  url: 'https://s3.hyperparam.app/hypgrep/wiki_en.parquet',

})) {

  console.log(row) // most matches first

}

```

## Create an index in JavaScript

```javascript

import { asyncBufferFromFile } from 'hyparquet'

import { fileWriter } from 'hyparquet-writer'

import { createIndex } from 'hypgrep'

// Generate dataset.index.parquet from dataset.parquet

const sourceFile = await asyncBufferFromFile('dataset.parquet')

const indexFile = fileWriter('dataset.index.parquet')

await createIndex({ sourceFile, indexFile })

```

## Local parquet files

To search against local parquet files, provide an `asyncBufferFactory` that loads the file from the local filesystem:

```js

import { asyncBufferFromFile } from 'hyparquet'

import { parquetFind } from 'hypgrep'

// Loads parquet file from local filesystem

function asyncBufferFactory({ url }) {

  return asyncBufferFromFile(url)

}

for await (const row of parquetFind({

  query: 'serverless',

  url: 'dataset.parquet',

  asyncBufferFactory,

})) {

  console.log(row)

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/hyparam/hypgrep

Awesome Lists containing this project

README