{"id":18142239,"url":"https://github.com/massprospecting/csv-indexer","last_synced_at":"2025-07-26T21:10:17.379Z","repository":{"id":181418359,"uuid":"563517628","full_name":"MassProspecting/csv-indexer","owner":"MassProspecting","description":"Simple indexation and searching of CSV large files. Not as robust as Lucence, but simple and cost-effective. May index files with millions of rows and find specific rows in matter of seconds.","archived":false,"fork":false,"pushed_at":"2022-11-23T18:22:59.000Z","size":493,"stargazers_count":0,"open_issues_count":1,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-11-01T18:06:16.003Z","etag":null,"topics":["csv","elastic-search","elasticsearch","indexer","indexing","lucence","ruby","ruby-gem","rubygem"],"latest_commit_sha":null,"homepage":"","language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MassProspecting.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2022-11-08T19:26:30.000Z","updated_at":"2024-07-24T11:06:52.000Z","dependencies_parsed_at":"2023-07-15T14:11:23.683Z","dependency_job_id":"d15b958d-bbb8-4205-b2b6-489fa8084dd4","html_url":"https://github.com/MassProspecting/csv-indexer","commit_stats":null,"previous_names":["leandrosardi/csv-indexer","massprospecting/csv-indexer"],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MassProspecting%2Fcsv-indexer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MassProspecting%2Fcsv-indexer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MassProspecting%2Fcsv-indexer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MassProspecting%2Fcsv-indexer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MassProspecting","download_url":"https://codeload.github.com/MassProspecting/csv-indexer/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247535524,"owners_count":20954576,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["csv","elastic-search","elasticsearch","indexer","indexing","lucence","ruby","ruby-gem","rubygem"],"created_at":"2024-11-01T18:06:17.390Z","updated_at":"2025-04-06T19:16:35.716Z","avatar_url":"https://github.com/MassProspecting.png","language":"Ruby","funding_links":[],"categories":[],"sub_categories":[],"readme":"![GitHub issues](https://img.shields.io/github/issues/leandrosardi/csv-indexer) ![GitHub](https://img.shields.io/github/license/leandrosardi/csv-indexer) ![GitHub tag (latest by date)](https://img.shields.io/github/v/tag/leandrosardi/csv-indexer) ![GitHub last commit](https://img.shields.io/github/last-commit/leandrosardi/csv-indexer)\n\n# CSV-Indexer\n\nCSV-Indexer makes it simple the indexation and searching in large CSV files. \n\nCSV-Indexer is not as robust as Lucene, but it is simple and cost-effective. May index files with millions of rows and find specific rows in matter of seconds.\n\n## 1. Installation\n\n```bash\ngem install csv-indexer\n```\n\n## 2. Quick Start\n\n**Step 1.** Download a sample CSV file in the same directory where you are running your Ruby script:\n\n```bash\nwget https://raw.githubusercontent.com/leandrosardi/csv-indexer/main/examples/example.csv\n```\n\n**Step 2.** In your Ruby script, require the `csv-indexer` gem.\n\n```ruby\nrequire 'csv-indexer'\n```\n\n**Step 3.** Setup the index for that CSV file.\n\n```ruby\n# define the indexation of for `example.csv`\nsource = BlackStack::CSVIndexer.add_indexation({\n    # Assign a unique name for this indexation.\n    #\n    # This parameter is mandatory.\n    #\n    # Each `.csv` file indexed will be stored in a file with the same name but replaciing `.csv` with the name of this index.\n    # For example, if you use `:name =\u003e 'my_index'` and you index a file called `my_file.csv`, the index will be stored in a file called `my_file.my_index`.\n    #\n    # This name must have filename safe characters only. No spaces, no special characters.\n    # \n    :name =\u003e 'ix_example01',\n    # Write a brief description of what you are indexing and why.\n    # This parameter is optional.\n    # Default: nil.\n    :description =\u003e 'Find the email address and other insights of any LinkedIn user from his/her LinkedIn URL.',\n    # The path to the `.csv` file(s) to be indexed.\n    # This parameter is optional.\n    # Default: './*.csv'\n    :input =\u003e './example.csv',\n    # The path to the directory where the index will be stored.\n    # This parameter is optional.\n    # Default: './'\n    :output =\u003e './',\n    # The path to the directory where the log files will be stored.\n    # This parameter is optional.\n    # Default: './'\n    :log =\u003e './',\n    # The mapping of the columns in the `.csv` file to be index.\n    # This parameter is mandatory.\n    :mapping =\u003e {\n        :first_name =\u003e 0,\n        :last_name =\u003e 1,\n        :linkedin_url =\u003e 2,\n        :email =\u003e 5,\n    },\n    # List column mapped to the index who are used to build the key of the index.\n    # This parameter is mandatory.\n    :keys =\u003e [:linkedin_url],\n})\n```\n\n**Step 4.** Run the indexation\n\nAdd this line to build the index.\n\n```ruby\nBlackStack::CSVIndexer.index('ix_example01')\n# =\u003e 2022-11-09 15:37:46: Indexing example.csv... done\n```\n\n**Note:**\n\nFor better performance, the `index` method loads the whole file to memory.\nSo, if you have `csv` files higher than 500MB, it is advisable you split then in chunks using the `split` command.\n\nE.g.:\n\n```bash\nsplit -C 500m --numeric-suffixes input_filename\n```\n\n**Step 5.** Searching for a specific LinkedIn URL in your index.\n\n```ruby\nret = BlackStack::CSVIndexer.find('ix_example01', 'linkedin.com/in/almu-dan-9808753a')\nputs \"#{ret[:matches].size.to_s} results found.\"\nputs \"Enlapsed seconds: #{ret[:enlapsed_seconds].to_s}\"\n# =\u003e 1 results found.\n# =\u003e Enlapsed seconds: 0.001595287\n```\n\n## 3. Indexing Many Files\n\nYou can define the indexation of many files. \n\nE.g.: Replacing `'./example.csv'` by `'./*.csv'`.\n\n```ruby\nsource = BlackStack::CSVIndexer.add_indexation({\n    :name =\u003e 'ix_example01',\n    :input =\u003e './*.csv',\n    :mapping =\u003e {\n        :first_name =\u003e 0,\n        :last_name =\u003e 1,\n        :linkedin_url =\u003e 2,\n        :email =\u003e 5,\n    },\n    :keys =\u003e [:linkedin_url],\n})\n```\n\n## 4. Indexing by Many Columns\n\nYou can index by many columns.\n\nE.g.: Replacing `[:linkedin_url]` by `[:first_name, :last_name]`. \n\n```ruby\nsource = BlackStack::CSVIndexer.add_indexation({\n    :name =\u003e 'ix_example02',\n    :description =\u003e 'Find the email address and other insights of any LinkedIn user from his/her name.',\n    :input =\u003e './example.csv',\n    :output =\u003e './',\n    :log =\u003e './',\n    :mapping =\u003e {\n        :first_name =\u003e 0,\n        :last_name =\u003e 1,\n        :linkedin_url =\u003e 2,\n        :email =\u003e 5,\n    },\n    :keys =\u003e [:first_name, :last_name],\n})\n\nBlackStack::CSVIndexer.index('ix_example02')\n# =\u003e 2022-11-09 16:43:52: Indexing example.csv... done\n```\n\n## 5. Searching by Many Columns\n\nIf you indexed by more than one column, you can choose one or more of those columns for search.\n\nE.g.: Replacing `'linkedin.com/in/almu-dan-9808753a'` by `['alan', 'armstrong']`.\n\n```ruby\nret = BlackStack::CSVIndexer.find('ix_example02', ['alan', 'armstrong'])\nputs \"#{ret[:matches].size.to_s} results found.\"\nif ret[:matches].size \u003e 0\n    puts \"First Name: #{ret[:matches].first[2]}\" \n    puts \"Last Name: #{ret[:matches].first[3]}\" \n    puts \"Email: #{ret[:matches].first[5]}\" \nend\nputs \"Enlapsed seconds: #{ret[:enlapsed_seconds].to_s}\"\n# =\u003e 1 results found.\n# =\u003e First Name: alan\n# =\u003e Last Name: armstrong\n# =\u003e Email: razorback1@plansandmorellp.com\n# =\u003e Enlapsed seconds: 0.001613454\n```\n\n## 6. Key Must Be Unique\n\nAt this moment, CSV-Indexer returns no more than 1 result.\n\nIf there are two or more rows in your index who match with the criteria, CSV-Indexer will return the first that it founds. \n\nE.g.: If you remove the `'armstrong'`, you get another Alan.\n\n```ruby\nret = BlackStack::CSVIndexer.find('ix_example02', ['alan'])\nputs \"#{ret[:matches].size.to_s} results found.\"\nif ret[:matches].size \u003e 0\n    puts \"First Name: #{ret[:matches].first[2]}\" \n    puts \"Last Name: #{ret[:matches].first[3]}\" \n    puts \"Email: #{ret[:matches].first[5]}\" \nend\nputs \"Enlapsed seconds: #{ret[:enlapsed_seconds].to_s}\"\n# =\u003e 1 results found.\n# =\u003e First Name: alan\n# =\u003e Last Name: kane\n# =\u003e Email: akane@myalexandertoyota.com\n# =\u003e Enlapsed seconds: 0.001480246\n```\n\n## 7. Case Insensitive\n\nCSV-Indexer is case-insensitive.\n\nE.g.: `['alan', 'armstrong']` is the same than `['Alan', 'Armstrong']`.\n\n```ruby\nret = BlackStack::CSVIndexer.find('ix_example02', ['Alan', 'Armstrong'])\nputs \"#{ret[:matches].size.to_s} results found.\"\nif ret[:matches].size \u003e 0\n    puts \"First Name: #{ret[:matches].first[2]}\" \n    puts \"Last Name: #{ret[:matches].first[3]}\" \n    puts \"Email: #{ret[:matches].first[5]}\" \nend\nputs \"Enlapsed seconds: #{ret[:enlapsed_seconds].to_s}\"\n# =\u003e 1 results found.\n# =\u003e First Name: alan\n# =\u003e Last Name: armstrong\n# =\u003e Email: razorback1@plansandmorellp.com\n# =\u003e Enlapsed seconds: 0.001613454\n```\n\n## 8. Matching Criteria\n\nYou can find values who match partially with the key.  \n\nE.g: `['Ala', 'Armstrong']` works the same than `['Alan', 'Armstrong']` if you add a thirth parameter `exact_match=false`\n\n```ruby\nret = BlackStack::CSVIndexer.find('ix_example02', ['Ala', 'Armstrong'], exact_match=false)\nputs \"#{ret[:matches].size.to_s} results found.\"\nif ret[:matches].size \u003e 0\n    puts \"First Name: #{ret[:matches].first[2]}\" \n    puts \"Last Name: #{ret[:matches].first[3]}\" \n    puts \"Email: #{ret[:matches].first[5]}\" \nend\nputs \"Enlapsed seconds: #{ret[:enlapsed_seconds].to_s}\"\n# =\u003e 1 results found.\n# =\u003e First Name: alan\n# =\u003e Last Name: armstrong\n# =\u003e Email: razorback1@plansandmorellp.com\n# =\u003e Enlapsed seconds: 0.001595377\n```\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmassprospecting%2Fcsv-indexer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmassprospecting%2Fcsv-indexer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmassprospecting%2Fcsv-indexer/lists"}