Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/lmullen/ats-corpus
A corpus of historical texts for the purpose of detecting similar documents and text reuse
https://github.com/lmullen/ats-corpus
Last synced: about 1 month ago
JSON representation
A corpus of historical texts for the purpose of detecting similar documents and text reuse
- Host: GitHub
- URL: https://github.com/lmullen/ats-corpus
- Owner: lmullen
- Created: 2015-09-10T00:41:13.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2015-09-11T20:49:03.000Z (over 9 years ago)
- Last Synced: 2024-10-28T04:59:17.635Z (3 months ago)
- Language: R
- Size: 172 KB
- Stars: 4
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## America Tract Society corpus
This corpus contains plain text versions of publications by the American Tract Society between 1800 and 1900 (according to the Internet Archive's metadata, anyway). This corpus was created for the purpose of testing document similarity and text reuse algorithms. The ATS frequently republished tracts under the same title. Furthermore, they published volumes with collections of tracts. So there are many examples of text reuse to be detected. (And of course, the documents are historically interesting in their own right.)
The texts themselves are in the `corpus` directory. The file `manifest.csv` contains the file names in the corpus along with associated metadata.
### Downloading the corpus
The corpus can be downloaded here:
### Reproducing the corpus
You can reproduce the corpus using the code available [on GitHub](https://github.com/lmullen/ats-corpus). The texts themselves are too big for the GitHub repository.
### Copyright and license
All of the texts are in the public domain and were gathered from the [Internet Archive](https://archive.org/).
All code is licensed [MIT](https://opensource.org/licenses/MIT) by [Lincoln Mullen](http://lincolnmullen.com/), 2015.