Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/christianmurphy/gutenberg-book-normalize
Normalize project Gutenberg books to a format easier for statistical models and machine learning to consume
https://github.com/christianmurphy/gutenberg-book-normalize
book machine-learning natural-language-processing project-gutenberg statistics
Last synced: 7 days ago
JSON representation
Normalize project Gutenberg books to a format easier for statistical models and machine learning to consume
- Host: GitHub
- URL: https://github.com/christianmurphy/gutenberg-book-normalize
- Owner: ChristianMurphy
- License: mit
- Created: 2018-12-01T16:51:10.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2018-12-02T04:15:12.000Z (about 6 years ago)
- Last Synced: 2024-12-27T16:13:25.491Z (11 days ago)
- Topics: book, machine-learning, natural-language-processing, project-gutenberg, statistics
- Language: JavaScript
- Size: 18.6 KB
- Stars: 2
- Watchers: 3
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
- License: license
Awesome Lists containing this project
README
# Gutenberg Book Normalize
> Normalize project Gutenberg books to a format easier for statistical models and machine learning to consume
## Installation
```bash
git clone [email protected]:ChristianMurphy/gutenberg-book-normalize.git
cd gutenberg-book-normalize
npm install
```## Usage
### Download books
> Download all project Gutenberg English languages books in HTML format
Uses project Gutenberg's official robot access guide recommendations
:warning: size is over 75 gigabytes, download time can take 24 hours or more.
```bash
npm run gutenberg-download
```### Extract books
> Unzips content into files and folders
```bash
npm run gutenberg-extract
```### Normalize books
> Normalizes HTML content into an easier to process JSON format
```bash
npm run gutenberg-normalize
```Example output:
```json
{
"type": "book",
"title": "lorem ipsum",
"author": "lorem ipsum",
"children": [
{
"type": "chapter",
"title": "lorem ipsum",
"level": "h2",
"children": [
{
"type": "paragraph",
"value": "lorem ipsum"
}
]
}
]
}
```:notebook: format conforms to [unist](https://github.com/syntax-tree/unist).
Any of the [unist utilities](https://github.com/syntax-tree/unist#list-of-utilities) can be used to further process the content.