Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/lmullen/chronam-ocr-debatcher

Turn a batch of OCR files from Chronicling America into a CSV that can be imported into a database
https://github.com/lmullen/chronam-ocr-debatcher

Last synced: about 1 month ago
JSON representation

Turn a batch of OCR files from Chronicling America into a CSV that can be imported into a database

Awesome Lists containing this project

README

        

[![Build Status](https://travis-ci.org/lmullen/chronam-ocr-debatcher.svg?branch=master)](https://travis-ci.org/lmullen/chronam-ocr-debatcher)

# Chronicling America OCR debatcher

This program takes paths to `.tar.bz2` batches of OCR files from the
*Chronicling America* [bulk data
downloads](https://chroniclingamerica.loc.gov/about/api/#bulk-data). It converts
each batch into a CSV file, which you can load into a database or do whatever
you like with. It will process the batches concurrently.

Usage:

```
./chronam-ocr-debatcher [--processes=8]
```

You can download binaries from the [releases page](https://github.com/lmullen/chronam-ocr-debatcher/releases).