https://github.com/jferrl/gutemberg-analysis
Gutemberg corpus analysis with apache hadoop
https://github.com/jferrl/gutemberg-analysis
analysis gutemberg hadoop java
Last synced: 23 days ago
JSON representation
Gutemberg corpus analysis with apache hadoop
- Host: GitHub
- URL: https://github.com/jferrl/gutemberg-analysis
- Owner: jferrl
- License: mit
- Created: 2019-11-06T17:01:03.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2020-12-10T22:57:25.000Z (over 5 years ago)
- Last Synced: 2025-01-19T19:26:10.234Z (over 1 year ago)
- Topics: analysis, gutemberg, hadoop, java
- Language: Java
- Homepage:
- Size: 50.8 KB
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Gutemberg Analysis
[](https://travis-ci.org/jferrl/gutemberg-analysis)
[](https://codeclimate.com/github/jferrl/gutemberg-analysis/maintainability)
This project has been created with the purpose of analyzing the linguistic corpus of Gutemberg. In addition, this java project will be prepared to adapt it to a Hadoop execution.
## Gutenberg Dataset
This is a collection of 3,036 English books written by 142 authors. This collection is a small subset of the Project Gutenberg corpus. All books have been manually cleaned to remove metadata, license information, and transcribers' notes, as much as possible.
Link to dataset: https://drive.google.com/file/d/0B2Mzhc7popBga2RkcWZNcjlRTGM/edit
## Tests performed
- Tokenize the dataset in different sentences
- Find the 10 most used words
- Total number of words
- Find valid numeric words
- Average size of paragraph
## Team Members
- Luis Gómez García
- Cansu Ozturk
- Jorge Ferrero Linacero
## How to execute it
From vscode:
- Run
- Debug
Program accepts gutemberg file location as execution args(args[0] = Path of gutemberg dataset)