Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/jferrl/gutemberg-analysis
Gutemberg corpus analysis with apache hadoop
https://github.com/jferrl/gutemberg-analysis
analysis gutemberg hadoop java
Last synced: 13 days ago
JSON representation
Gutemberg corpus analysis with apache hadoop
- Host: GitHub
- URL: https://github.com/jferrl/gutemberg-analysis
- Owner: jferrl
- License: mit
- Created: 2019-11-06T17:01:03.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2020-12-10T22:57:25.000Z (about 4 years ago)
- Last Synced: 2024-11-18T22:06:16.786Z (2 months ago)
- Topics: analysis, gutemberg, hadoop, java
- Language: Java
- Homepage:
- Size: 50.8 KB
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Gutemberg Analysis
[![Build Status](https://travis-ci.org/jferrl/gutemberg-analysis.svg?branch=master)](https://travis-ci.org/jferrl/gutemberg-analysis)
[![Maintainability](https://api.codeclimate.com/v1/badges/3dcbeed599bb53561265/maintainability)](https://codeclimate.com/github/jferrl/gutemberg-analysis/maintainability)This project has been created with the purpose of analyzing the linguistic corpus of Gutemberg. In addition, this java project will be prepared to adapt it to a Hadoop execution.
## Gutenberg Dataset
This is a collection of 3,036 English books written by 142 authors. This collection is a small subset of the Project Gutenberg corpus. All books have been manually cleaned to remove metadata, license information, and transcribers' notes, as much as possible.
Link to dataset: https://drive.google.com/file/d/0B2Mzhc7popBga2RkcWZNcjlRTGM/edit
## Tests performed
- Tokenize the dataset in different sentences
- Find the 10 most used words
- Total number of words
- Find valid numeric words
- Average size of paragraph## Team Members
- Luis Gómez García
- Cansu Ozturk
- Jorge Ferrero Linacero## How to execute it
From vscode:
- Run
- DebugProgram accepts gutemberg file location as execution args(args[0] = Path of gutemberg dataset)