An open API service indexing awesome lists of open source software.

https://github.com/ahcm/tantivy_warc_indexer

builds a tantivy index from common crawl warc.wet files
https://github.com/ahcm/tantivy_warc_indexer

commoncrawl index search tantivy

Last synced: 5 months ago
JSON representation

builds a tantivy index from common crawl warc.wet files

Awesome Lists containing this project

README

        

# tantivy_warc_indexer

tantivy_warc_indexer builds a [tantivy](https://github.com/tantivy-search/tantivy) index from common crawl warc.wet files and pubmed entrez articles.

## Build
Install rust (e.g. via [rustup](https://rustup.rs)).
```
make
```
## Usage
```
./target/release/tantivy_warc_indexer --help
WARC Indexer

Usage:
warc_parser [-t ] [--from ] [--to ] -s  
warc_parser (-h | --help)

Options:
-h --help Show this help
-s type of source files (WARC or ENTREZ or WIKIPEDIA_ABSTRACT)
-t number of threads to use, default 4
--from skip files until from
--to skip files after to``
```

## Run

Where is the directory of an empty index you created e.g. tantivy-cli
and the path to the directory with the common crawl warc.wet or warc.wet.gz files.
Depending on your system this might take a few days or weeks.
```
./target/release/tantivy_warc_indexer -s WARC ../common_crawl_tantivy_index ../wet
```
To create an index:
```
mkdir ../common_crawl_tantivy_index
cp template/meta.json ../common_crawl_tantivy_index/
```

Best
Andreas