Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/woahdae/simple_xlsx_reader

Ruby xlsx reader to parse cell values into plain ruby primitives and dates/times.
https://github.com/woahdae/simple_xlsx_reader

Last synced: 4 days ago
JSON representation

Ruby xlsx reader to parse cell values into plain ruby primitives and dates/times.

Host: GitHub
URL: https://github.com/woahdae/simple_xlsx_reader
Owner: woahdae
License: mit
Created: 2013-01-16T17:01:19.000Z (over 11 years ago)
Default Branch: master
Last Pushed: 2024-05-03T19:09:30.000Z (about 2 months ago)
Last Synced: 2024-06-18T14:10:20.002Z (6 days ago)
Language: Ruby
Size: 338 KB
Stars: 177
Watchers: 9
Forks: 54
Open Issues: 10
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE.txt

Lists

awesome-stars - simple_xlsx_reader
awesome-stars - woahdae/simple_xlsx_reader - Ruby xlsx reader to parse cell values into plain ruby primitives and dates/times. (Ruby)
awesome-stars - woahdae/simple_xlsx_reader - Ruby xlsx reader to parse cell values into plain ruby primitives and dates/times. (Ruby)

README

        # SimpleXlsxReader

A [fast](#performance) xlsx reader for Ruby that parses xlsx cell values into

plain ruby primitives and dates/times.

This is *not* a rewrite of excel in Ruby. Font styles, for

example, are parsed to determine whether a cell is a number or a date,

then forgotten. We just want to get the data, and get out!

## Summary (now with stream parsing):

```ruby

doc = SimpleXlsxReader.open('/path/to/workbook.xlsx')

doc.sheets # => [<#SXR::Sheet>, ...]

doc.sheets.first.name # 'Sheet1'

rows = doc.sheet.first.rows # 

rows.each # an  ready to chain or stream

rows.each {} # Streams the rows to your block

rows.each(headers: true) {} # Streams row-hashes

rows.each(headers: {id: /ID/}) {} # finds & maps headers, streams

rows.slurp # Slurps rows into memory as a 2D array

```

That's the gist of it!

See also the [Document](https://github.com/woahdae/simple_xlsx_reader/blob/2.0.0-pre/lib/simple_xlsx_reader/document.rb) object.

## Why?

### Accurate

This project was started years ago, primarily because other Ruby xlsx parsers

didn't import data with the correct types. Numbers as strings, dates as numbers,

[hyperlinks](https://github.com/woahdae/simple_xlsx_reader/blob/master/lib/simple_xlsx_reader/hyperlink.rb)

with inaccessible URLs, or - subtly buggy - simple dates as DateTime

objects. If your app uses a timezone offset, depending on what timezone and

what time of day you load the xlsx file, your dates might end up a day off!

SimpleXlsxReader understands all these correctly.

### Idiomatic

Many Ruby xlsx parsers seem to be inspired more by Excel than Ruby, frankly.

SimpleXlsxReader strives to be fairly idiomatic Ruby:

```ruby

# quick example having fun w/ ruby

doc = SimpleXlsxReader.open(path_or_io)

doc.sheets.first.rows.each(headers: {id: /ID/})

  .with_index.with_object({}) do |(row, index), acc|

    acc[row[:id]] = index

end

```

### Now faster

Finally, as of v2.0, SimpleXlsxReader is the fastest and most

memory-efficient parser. Previously this project couldn't reasonably load

anything over ~10k rows. Other parsers could load 100k+ rows, but were still

taking ~1gb RSS to do so, even "streaming," which seemed excessive. So a SAX

implementation was born. See [performance](#performance) for details.

## Usage

### Streaming

SimpleXlsxReader is performant by default - If you use

`rows.each {|row| ...}` it will stream the XLSX rows to your block without

loading either the sheet XML or the full sheet data into memory.

You can also chain `rows.each` with other Enumerable functions without

triggering a slurp, and you have lots of ways to find and map headers while

streaming.

If you had an excel sheet representing this data:

```

| Hero ID | Hero Name  | Location     |

| 13576   | Samus Aran | Planet Zebes |

| 117     | John Halo  | Ring World   |

| 9704133 | Iron Man   | Planet Earth |

```

Get a handle on the rows proxy:

```ruby

rows = SimpleXlsxReader.open('suited_heroes.xlsx').sheets.first.rows

```

Simple streaming (kinda boring):

```ruby

rows.each { |row| ... }

````

Streaming with headers, and how about a little enumerable chaining:

```ruby

# Map of hero names by ID: { 117 => 'John Halo', ... }

rows.each(headers: true).with_object({}) do |row, acc|

  acc[row['Hero ID']] = row['Hero Name']

end

```

Sometimes though you have some junk at the top of your spreadsheet:

```

| Unofficial Report  |                        |              |

| Dont tell Nintendo | Yes "John Halo" I know |              |

|                    |                        |              |

| Hero ID            | Hero Name              | Location     |

| 13576              | Samus Aran             | Planet Zebes |

| 117                | John Halo              | Ring World   |

| 9704133            | Iron Man               | Planet Earth |

```

For this, `headers` can be a hash whose keys replace headers and whose values

help find the correct header row:

```ruby

# Same map of hero names by ID: { 117 => 'John Halo', ... }

rows.each(headers: {id: /ID/, name: /Name/}).with_object({}) do |row, acc|

  acc[row[:id]] = row[:name]

end

```

If your header-to-attribute mapping is more complicated than key/value, you

can do the mapping elsewhere, but use a block to find the header row:

```ruby

# Example roughly analogous to some production code mapping a single spreadsheet

# across many objects. Might be a simpler way now that we have the headers-hash

# feature.

object_map = { Hero => { id: 'Hero ID', name: 'Hero Name', location: 'Location' } }

HEADERS = ['Hero ID', 'Hero Name', 'Location']

rows.each(headers: ->(row) { (HEADERS & row).any? }) do |row|

  object_map.each_pair do |klass, attribute_map|

    attributes =

      attribute_map.each_pair.with_object({}) do |(key, header), attrs|

        attrs[key] = row[header]

      end

    klass.new(attributes)

  end

end

```

### Slurping

To make SimpleXlsxReader rows act like an array, for use with legacy

SimpleXlsxReader apps or otherwise, we still support slurping the whole array

into memory. The good news is even when doing this, the xlsx worksheet & shared

string files are never loaded as a (big) Nokogiri doc, so that's nice.

By default, to prevent accidental slurping, `` will throw an exception

if you try to access it with array methods like `[]` and `shift` without

explicitly slurping first. You can slurp either by calling `rows.slurp` or

globally by setting `SimpleXlsxReader.configuration.auto_slurp = true`.

Once slurped, enumerable methods on `rows` will use the slurped data

(i.e. not re-parse the sheet), and those Array-like methods will work.

We don't support all Array methods, just the few we have used in real projects,

as we transition towards streaming instead.

### Load Errors

By default, cell load errors (ex. if a date cell contains the string

'hello') result in a SimpleXlsxReader::CellLoadError.

If you would like to provide better error feedback to your users, you

can set `SimpleXlsxReader.configuration.catch_cell_load_errors =

true`, and load errors will instead be inserted into Sheet#load_errors keyed

by [rownum, colnum]:

```ruby

{

  [rownum, colnum] => '[error]'

}

```

### Performance

SimpleXlsxReader is (as of this writing) the fastest and most memory efficient

Ruby xlsx parser.

Recent updates here have focused on large spreadsheets with especially

non-unique strings in sheets using xlsx' shared strings feature

(Excel-generated spreadsheets always use this). Other projects have implemented

streaming parsers for the sheet data, but currently none stream while loading

the shared strings file, which is the second-largest file in an xlsx archive

and can represent millions of strings in large files.

For more details, see [my fork of @shkm's excel benchmark project](https://github.com/woahdae/excel-parsing-benchmarks), but here's the summary:

1mb excel file, 10,000 rows of sample "sales records" with a fair amount of

non-unique strings (ran on an M1 Macbook Pro):

| Gem                | Parses/second | RSS Increase | Allocated Mem | Retained Mem | Allocated Objects | Retained Objects |

|--------------------|---------------|--------------|---------------|--------------|-------------------|------------------|

| simple_xlsx_reader | 1.13          | 36.94mb      | 614.51mb      | 1.13kb       | 8796275           | 3                |

| roo                | 0.75          | 74.0mb       | 164.47mb      | 2.18kb       | 2128396           | 4                |

| creek              | 0.65          | 107.55mb     | 581.38mb      | 3.3kb        | 7240760           | 16               |

| xsv                | 0.61          | 75.66mb      | 2127.42mb     | 3.66kb       | 5922563           | 10               |

| rubyxl             | 0.27          | 373.52mb     | 716.7mb       | 2.18kb       | 10612577          | 4                |

Here is a benchmark for the "worst" file I've seen, a 26mb file whose shared

strings represent 10% of the archive (note, MemoryProfiler has too much

overhead to reasonably measure allocations so that analysis was left off, and

we just measure total time for one parse):

| Gem                | Time    | RSS Increase |

|--------------------|---------|--------------|

| simple_xlsx_reader | 28.71s  | 148.77mb     |

| roo                | 40.25s  | 1322.08mb    |

| xsv                | 45.82s  | 391.27mb     |

| creek              | 60.63s  | 886.81mb     |

| rubyxl             | 238.68s | 9136.3mb     |

## Installation

Add this line to your application's Gemfile:

    gem 'simple_xlsx_reader'

And then execute:

    $ bundle

Or install it yourself as:

    $ gem install simple_xlsx_reader

## Versioning

This project follows [semantic versioning 1.0](http://semver.org/spec/v1.0.0.html)

## Contributing

Remember to write tests, think about edge cases, and run the existing

suite.

The full suite contains a performance test that on an M1 MBP runs the final

large file in about five seconds. Check out that test before & after your

change to check for performance changes.

Then, the standard stuff:

1. Fork this project

2. Create your feature branch (`git checkout -b my-new-feature`)

3. Commit your changes (`git commit -am 'Add some feature'`)

4. Push to the branch (`git push origin my-new-feature`)

5. Create new Pull Request