Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/lmullen/ats-corpus

A corpus of historical texts for the purpose of detecting similar documents and text reuse
https://github.com/lmullen/ats-corpus

Last synced: about 1 month ago
JSON representation

A corpus of historical texts for the purpose of detecting similar documents and text reuse

Awesome Lists containing this project

README

        

## America Tract Society corpus

This corpus contains plain text versions of publications by the American Tract Society between 1800 and 1900 (according to the Internet Archive's metadata, anyway). This corpus was created for the purpose of testing document similarity and text reuse algorithms. The ATS frequently republished tracts under the same title. Furthermore, they published volumes with collections of tracts. So there are many examples of text reuse to be detected. (And of course, the documents are historically interesting in their own right.)

The texts themselves are in the `corpus` directory. The file `manifest.csv` contains the file names in the corpus along with associated metadata.

### Downloading the corpus

The corpus can be downloaded here:

### Reproducing the corpus

You can reproduce the corpus using the code available [on GitHub](https://github.com/lmullen/ats-corpus). The texts themselves are too big for the GitHub repository.

### Copyright and license

All of the texts are in the public domain and were gathered from the [Internet Archive](https://archive.org/).

All code is licensed [MIT](https://opensource.org/licenses/MIT) by [Lincoln Mullen](http://lincolnmullen.com/), 2015.