https://github.com/massprospecting/csv-indexer
Simple indexation and searching of CSV large files. Not as robust as Lucence, but simple and cost-effective. May index files with millions of rows and find specific rows in matter of seconds.
https://github.com/massprospecting/csv-indexer
csv elastic-search elasticsearch indexer indexing lucence ruby ruby-gem rubygem
Last synced: 11 months ago
JSON representation
Simple indexation and searching of CSV large files. Not as robust as Lucence, but simple and cost-effective. May index files with millions of rows and find specific rows in matter of seconds.
- Host: GitHub
- URL: https://github.com/massprospecting/csv-indexer
- Owner: MassProspecting
- License: mit
- Created: 2022-11-08T19:26:30.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2022-11-23T18:22:59.000Z (over 3 years ago)
- Last Synced: 2024-11-01T18:06:16.003Z (over 1 year ago)
- Topics: csv, elastic-search, elasticsearch, indexer, indexing, lucence, ruby, ruby-gem, rubygem
- Language: Ruby
- Homepage:
- Size: 481 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
   
# CSV-Indexer
CSV-Indexer makes it simple the indexation and searching in large CSV files.
CSV-Indexer is not as robust as Lucene, but it is simple and cost-effective. May index files with millions of rows and find specific rows in matter of seconds.
## 1. Installation
```bash
gem install csv-indexer
```
## 2. Quick Start
**Step 1.** Download a sample CSV file in the same directory where you are running your Ruby script:
```bash
wget https://raw.githubusercontent.com/leandrosardi/csv-indexer/main/examples/example.csv
```
**Step 2.** In your Ruby script, require the `csv-indexer` gem.
```ruby
require 'csv-indexer'
```
**Step 3.** Setup the index for that CSV file.
```ruby
# define the indexation of for `example.csv`
source = BlackStack::CSVIndexer.add_indexation({
# Assign a unique name for this indexation.
#
# This parameter is mandatory.
#
# Each `.csv` file indexed will be stored in a file with the same name but replaciing `.csv` with the name of this index.
# For example, if you use `:name => 'my_index'` and you index a file called `my_file.csv`, the index will be stored in a file called `my_file.my_index`.
#
# This name must have filename safe characters only. No spaces, no special characters.
#
:name => 'ix_example01',
# Write a brief description of what you are indexing and why.
# This parameter is optional.
# Default: nil.
:description => 'Find the email address and other insights of any LinkedIn user from his/her LinkedIn URL.',
# The path to the `.csv` file(s) to be indexed.
# This parameter is optional.
# Default: './*.csv'
:input => './example.csv',
# The path to the directory where the index will be stored.
# This parameter is optional.
# Default: './'
:output => './',
# The path to the directory where the log files will be stored.
# This parameter is optional.
# Default: './'
:log => './',
# The mapping of the columns in the `.csv` file to be index.
# This parameter is mandatory.
:mapping => {
:first_name => 0,
:last_name => 1,
:linkedin_url => 2,
:email => 5,
},
# List column mapped to the index who are used to build the key of the index.
# This parameter is mandatory.
:keys => [:linkedin_url],
})
```
**Step 4.** Run the indexation
Add this line to build the index.
```ruby
BlackStack::CSVIndexer.index('ix_example01')
# => 2022-11-09 15:37:46: Indexing example.csv... done
```
**Note:**
For better performance, the `index` method loads the whole file to memory.
So, if you have `csv` files higher than 500MB, it is advisable you split then in chunks using the `split` command.
E.g.:
```bash
split -C 500m --numeric-suffixes input_filename
```
**Step 5.** Searching for a specific LinkedIn URL in your index.
```ruby
ret = BlackStack::CSVIndexer.find('ix_example01', 'linkedin.com/in/almu-dan-9808753a')
puts "#{ret[:matches].size.to_s} results found."
puts "Enlapsed seconds: #{ret[:enlapsed_seconds].to_s}"
# => 1 results found.
# => Enlapsed seconds: 0.001595287
```
## 3. Indexing Many Files
You can define the indexation of many files.
E.g.: Replacing `'./example.csv'` by `'./*.csv'`.
```ruby
source = BlackStack::CSVIndexer.add_indexation({
:name => 'ix_example01',
:input => './*.csv',
:mapping => {
:first_name => 0,
:last_name => 1,
:linkedin_url => 2,
:email => 5,
},
:keys => [:linkedin_url],
})
```
## 4. Indexing by Many Columns
You can index by many columns.
E.g.: Replacing `[:linkedin_url]` by `[:first_name, :last_name]`.
```ruby
source = BlackStack::CSVIndexer.add_indexation({
:name => 'ix_example02',
:description => 'Find the email address and other insights of any LinkedIn user from his/her name.',
:input => './example.csv',
:output => './',
:log => './',
:mapping => {
:first_name => 0,
:last_name => 1,
:linkedin_url => 2,
:email => 5,
},
:keys => [:first_name, :last_name],
})
BlackStack::CSVIndexer.index('ix_example02')
# => 2022-11-09 16:43:52: Indexing example.csv... done
```
## 5. Searching by Many Columns
If you indexed by more than one column, you can choose one or more of those columns for search.
E.g.: Replacing `'linkedin.com/in/almu-dan-9808753a'` by `['alan', 'armstrong']`.
```ruby
ret = BlackStack::CSVIndexer.find('ix_example02', ['alan', 'armstrong'])
puts "#{ret[:matches].size.to_s} results found."
if ret[:matches].size > 0
puts "First Name: #{ret[:matches].first[2]}"
puts "Last Name: #{ret[:matches].first[3]}"
puts "Email: #{ret[:matches].first[5]}"
end
puts "Enlapsed seconds: #{ret[:enlapsed_seconds].to_s}"
# => 1 results found.
# => First Name: alan
# => Last Name: armstrong
# => Email: razorback1@plansandmorellp.com
# => Enlapsed seconds: 0.001613454
```
## 6. Key Must Be Unique
At this moment, CSV-Indexer returns no more than 1 result.
If there are two or more rows in your index who match with the criteria, CSV-Indexer will return the first that it founds.
E.g.: If you remove the `'armstrong'`, you get another Alan.
```ruby
ret = BlackStack::CSVIndexer.find('ix_example02', ['alan'])
puts "#{ret[:matches].size.to_s} results found."
if ret[:matches].size > 0
puts "First Name: #{ret[:matches].first[2]}"
puts "Last Name: #{ret[:matches].first[3]}"
puts "Email: #{ret[:matches].first[5]}"
end
puts "Enlapsed seconds: #{ret[:enlapsed_seconds].to_s}"
# => 1 results found.
# => First Name: alan
# => Last Name: kane
# => Email: akane@myalexandertoyota.com
# => Enlapsed seconds: 0.001480246
```
## 7. Case Insensitive
CSV-Indexer is case-insensitive.
E.g.: `['alan', 'armstrong']` is the same than `['Alan', 'Armstrong']`.
```ruby
ret = BlackStack::CSVIndexer.find('ix_example02', ['Alan', 'Armstrong'])
puts "#{ret[:matches].size.to_s} results found."
if ret[:matches].size > 0
puts "First Name: #{ret[:matches].first[2]}"
puts "Last Name: #{ret[:matches].first[3]}"
puts "Email: #{ret[:matches].first[5]}"
end
puts "Enlapsed seconds: #{ret[:enlapsed_seconds].to_s}"
# => 1 results found.
# => First Name: alan
# => Last Name: armstrong
# => Email: razorback1@plansandmorellp.com
# => Enlapsed seconds: 0.001613454
```
## 8. Matching Criteria
You can find values who match partially with the key.
E.g: `['Ala', 'Armstrong']` works the same than `['Alan', 'Armstrong']` if you add a thirth parameter `exact_match=false`
```ruby
ret = BlackStack::CSVIndexer.find('ix_example02', ['Ala', 'Armstrong'], exact_match=false)
puts "#{ret[:matches].size.to_s} results found."
if ret[:matches].size > 0
puts "First Name: #{ret[:matches].first[2]}"
puts "Last Name: #{ret[:matches].first[3]}"
puts "Email: #{ret[:matches].first[5]}"
end
puts "Enlapsed seconds: #{ret[:enlapsed_seconds].to_s}"
# => 1 results found.
# => First Name: alan
# => Last Name: armstrong
# => Email: razorback1@plansandmorellp.com
# => Enlapsed seconds: 0.001595377
```