https://github.com/rsheets/jailbreakr
  
  
    Get out of Excel free. 
    https://github.com/rsheets/jailbreakr
  
        Last synced: 3 months ago 
        JSON representation
    
Get out of Excel free.
- Host: GitHub
- URL: https://github.com/rsheets/jailbreakr
- Owner: rsheets
- Created: 2016-01-24T05:04:04.000Z (almost 10 years ago)
- Default Branch: master
- Last Pushed: 2016-08-18T02:38:06.000Z (about 9 years ago)
- Last Synced: 2024-08-06T03:05:28.188Z (about 1 year ago)
- Language: R
- Size: 56.6 KB
- Stars: 89
- Watchers: 16
- Forks: 9
- Open Issues: 6
- 
            Metadata Files:
            - Readme: README.md
 
Awesome Lists containing this project
- jimsghstars - rsheets/jailbreakr - Get out of Excel free. (R)
README
          # jailbreakr
**Warning: This project is in the early scoping stages; do not use for anything other than amusement/frustration purposes**
Data Liberator.  To extract tabular data people put in nontabular structures in a program designed to hold tables.

## Installation
Requires the development version of xml2 (for `xml_find_lgl`) as well as [cellrangr](https://github.com/rsheets/cellranger) and [linen](https://github.com/rsheets/linen).  Chances are you'll want [rexcel](https://github.com/rsheets/rexcel) too.
```r
devtools::install_github(c("hadley/xml2",
                           "rsheets/linen",
                           "rsheets/cellranger",
                           "rsheets/rexcel",
                           "rsheets/jailbreakr"))
```
## Goals
There are two large excel spreadsheet corpora; it would be nice to use these to get a feel for what fraction of spreadsheets we can handle or the range of non-table-like data out there.

The first is the [EUSES corpus](http://openscience.us/repo/spreadsheet/euses.html) of 4,447 spreadsheets (16,853 worksheets).  This is all xls files (rather than xlsx) and therefore need either an [xls -> xlsx conversion](http://bit.ly/1P2rMGr) or support in jailbreakr for xls files.
The second, larger, one is the [Enron corpus](http://www.felienne.com/archives/3634) of 15,770 spreadsheets (79,983)
# Roadmap
* data structure package:
  - linen?  General representation of spreadsheet data, plus some limited low-level operations on that data
  - depends on cell ranger, tibble
  - constructor function
  - print methods
  - subsetting, range extraction etc.
  - plot method - for quickly getting a feel for structure, or a shiny app
  - summary: this has n sheets, no formulae, 3 plots, etc, things about the references between the sheets?
  - where it came from (excel, googlesheet, etc), with filenames, reference ids etc.
  - probably needs references to handle multiple sheets and formulae within them, definitely if we need to do things with plots, but make them immutable at first?
  - md5 or other "id" so that we can see if the upstream source has changed.  This is different for googlesheets where the id is properly baked into the sheet
* low level packages:
  - googlesheets
  - rexcel
  - these depend on linen, and will have to provide things like ids and filenames to satisfy all the features that linen will do.
* jailbreakr
  - uses output in linen format that is provided by googlesheets or rexcel
# Ideas
Can we feed things through openrefine or something?