https://github.com/ahcm/tantivy_warc_indexer
builds a tantivy index from common crawl warc.wet files
https://github.com/ahcm/tantivy_warc_indexer
commoncrawl index search tantivy
Last synced: 5 months ago
JSON representation
builds a tantivy index from common crawl warc.wet files
- Host: GitHub
- URL: https://github.com/ahcm/tantivy_warc_indexer
- Owner: ahcm
- Created: 2021-05-04T14:55:04.000Z (about 4 years ago)
- Default Branch: master
- Last Pushed: 2024-06-16T23:58:16.000Z (11 months ago)
- Last Synced: 2024-11-28T14:18:19.857Z (6 months ago)
- Topics: commoncrawl, index, search, tantivy
- Language: Rust
- Homepage:
- Size: 19.5 KB
- Stars: 10
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# tantivy_warc_indexer
tantivy_warc_indexer builds a [tantivy](https://github.com/tantivy-search/tantivy) index from common crawl warc.wet files and pubmed entrez articles.
## Build
Install rust (e.g. via [rustup](https://rustup.rs)).
```
make
```
## Usage
```
./target/release/tantivy_warc_indexer --help
WARC IndexerUsage:
warc_parser [-t ] [--from ] [--to ] -s
warc_parser (-h | --help)Options:
-h --help Show this help
-s type of source files (WARC or ENTREZ or WIKIPEDIA_ABSTRACT)
-t number of threads to use, default 4
--from skip files until from
--to skip files after to``
```## Run
Where is the directory of an empty index you created e.g. tantivy-cli
and the path to the directory with the common crawl warc.wet or warc.wet.gz files.
Depending on your system this might take a few days or weeks.
```
./target/release/tantivy_warc_indexer -s WARC ../common_crawl_tantivy_index ../wet
```
To create an index:
```
mkdir ../common_crawl_tantivy_index
cp template/meta.json ../common_crawl_tantivy_index/
```Best
Andreas