Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/samwho/rsgrep

A pure Ruby implementation of the sorted grep program. Grep efficiently over large, sorted files.
https://github.com/samwho/rsgrep

Last synced: about 2 months ago
JSON representation

A pure Ruby implementation of the sorted grep program. Grep efficiently over large, sorted files.

Awesome Lists containing this project

README

        

# Rsgrep

This is a pure Ruby implementation with the same goal as the small but amazing
[sorted grep](http://sourceforge.net/projects/sgrep/) program written by
Stephen C. Losen.

It is designed for use on large, lexicographically sorted files. It allows you
to search for lines that *begin* with a certain pattern (searching for anything
at a position anywhere other than the start of a line isn't possible using a
binary search).

## Installation

Add this line to your application's Gemfile:

gem 'rsgrep'

And then execute:

$ bundle

Or install it yourself as:

$ gem install rsgrep

## Usage

The gem monkey patches into the File class. It can be used in the following two
ways:

``` ruby
require 'rsgrep'

puts File.sgrep("key pattern", "path/to/file.txt")
#=> array of all lines that start with "key pattern", empty array for no
# matches.

# or ...

f = File.open("path/to/file.txt")
puts f.sgrep("key pattern")
#=> array of all lines that start with "key pattern", empty array for no
# matches.

f.close
```

You can pass both of these functions an options hash. Here are some examples of
the options you can pass:

``` ruby
require 'rsgrep'

f = File.open("path/to/file.txt")

# Case insensitive search
f.sgrep("PaTTern", :insensitive => true)

f.close
```

**NOTE**: There are a lot of caveat involved in getting this to work properly.
For example, you **cannot** do a case insensitive search on a file that is not
sorted in a case insensitive fashion. The results will not be what you expect.

This will be true of almost all options you pass to rsgrep. You will get the
best results on a file that uses alphanumeric characters and only uses one
casing (upper or lower, doesn't matter which).

## Contributing

1. Fork it
2. Create your feature branch (`git checkout -b my-new-feature`)
3. Commit your changes (`git commit -am 'Added some feature'`)
4. Push to the branch (`git push origin my-new-feature`)
5. Create new Pull Request

## A note on the specs...

Because writing specs for this required having a very large file to scan, I had
to choose a very large file that was freely available. For obvious reasons, I
cannot put the file into this repository but you can download it from
[here](http://books.google.com/ngrams/datasets). It's the 0th file of the 3grams
dataset in English.

Direct link:
[http://commondatastorage.googleapis.com/books/ngrams/books/googlebooks-eng-all-3gram-20090715-0.csv.zip](http://commondatastorage.googleapis.com/books/ngrams/books/googlebooks-eng-all-3gram-20090715-0.csv.zip)

It's about 440mb compressed, 3gb uncompressed. You will need to uncompress it
into the `spec/data` directory in order to run the specs successfully.

This file is a bit of a bad example though, to be honest. I'm only using it at
the moment so that the specs give a good idea of how long it takes to scan
through such a large file. The reason that this is not a good file to use is
because it isn't sorted in a way the rsgrep knows how to process yet. Its
handling of capital letters and punctuation are a bit confusing and I haven't
yet been able to find a consistent and clean way of scanning it.