Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/hugovk/gutengrep
Find whole sentences matching a regex in Project Gutenberg
https://github.com/hugovk/gutengrep
Last synced: 29 days ago
JSON representation
Find whole sentences matching a regex in Project Gutenberg
- Host: GitHub
- URL: https://github.com/hugovk/gutengrep
- Owner: hugovk
- Created: 2014-11-12T13:04:40.000Z (about 10 years ago)
- Default Branch: gh-pages
- Last Pushed: 2023-02-05T07:11:32.000Z (almost 2 years ago)
- Last Synced: 2024-10-16T04:21:29.026Z (29 days ago)
- Language: HTML
- Size: 9.68 MB
- Stars: 32
- Watchers: 4
- Forks: 10
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
gutengrep
=========[![Build Status](https://travis-ci.org/hugovk/gutengrep.svg?branch=gh-pages)](https://travis-ci.org/hugovk/gutengrep)
Find whole sentences matching a regex in Project Gutenberg plain text files.
Example commands
----------------gutengrep.py "^[^\w]*And then" "*.txt" --cache --sort --correct -o output/and-then.txt
gutengrep.py "^[^\w]*But why" "*.txt" --cache --sort --correct -o output/but-why.txt
gutengrep.py -i "whale" moby11.txt --sort --correct -o out\mobydick-whale.txt
Example output
--------------| Name | Sorted | Regex | Input | Word count |
|:-----------------------------------------------:|:----------------------------------------------------:|:----------------:|:------------:|:----------:|
| [But why?](output/but-why.txt?raw=true) | [But why?](output/but-why-sort.txt?raw=true) | `^[^\w]*But why` | `*.txt` | 7,572 |
| [And then!](output/and-then.txt?raw=true) | [And then!](output/and-then-sort.txt?raw=true) | `[^\w]*And then` | `*.txt` | 85,014 |
| [The whale](output/mobydick-whale.txt?raw=true) | [The whale](output/mobydick-whale-sort.txt?raw=true) | `whale` | `moby11.txt` | 50,913 |
| [Why](output/why.txt?raw=true) | [Why](output/why-sort.txt?raw=true) | `[^\w]*Why` | `*.txt` | 184,832 |
| [Once upon a time](output/once-upon-a-time.txt?raw=true) | [Once upon a time](output/once-upon-a-time-sort.txt?raw=true) | `-i` `once upon a time` | `*.txt` | 6,195 |
| [The End](output/the-end.txt?raw=true) | [The End](output/the-end-sort.txt?raw=true) | `-i` `the end\.` | `*.txt` | 142,94 |
| [Happily ever after](output/happily-ever-after.txt?raw=true) | [Happily ever after](output/happily-ever-after-sort.txt?raw=true) | `-i` `happily ever after` | `*.txt` | 271 |
| [Moonlit](output/moonlit.txt?raw=true) | [Moonlit](output/moonlit-sort.txt?raw=true) | `-i` `moonlit` | `*.txt` | 52,345 |
| [Moonlight](output/moonlight.txt?raw=true) | [Moonlight](output/moonlight-sort.txt?raw=true) | `-i` `moonlight` | `*.txt` | 3,186 |See also [nanogenmo.md](nanogenmo.md).
Tips
----Download the [Project Gutenberg August 2003 CD](https://www.gutenberg.org/ebooks/11220) (download and mount the ISO file) and copy all the text files from the 'etext' directories to your hard drive, and put all of the text files in the same directory.
When working on the whole corpus, use `--cache` to cut down on file operations. The first time it will build a cache file of all tokenised sentences. This first pass takes about 5 minutes on my MBP to go through the 597 books of the Project Gutenberg CD and extract its 3,583,390 sentences. Subsequent runs using the cache take about 40 seconds.
If searching just a single file, or a subset of files, make sure not to use `--cache` because it will use the cache file generated on the initial file spec.