https://github.com/jferrl/gutemberg-analysis

Gutemberg corpus analysis with apache hadoop
https://github.com/jferrl/gutemberg-analysis

analysis gutemberg hadoop java

Last synced: 23 days ago
JSON representation

Gutemberg corpus analysis with apache hadoop

Host: GitHub
URL: https://github.com/jferrl/gutemberg-analysis
Owner: jferrl
License: mit
Created: 2019-11-06T17:01:03.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2020-12-10T22:57:25.000Z (over 5 years ago)
Last Synced: 2025-01-19T19:26:10.234Z (over 1 year ago)
Topics: analysis, gutemberg, hadoop, java
Language: Java
Homepage:
Size: 50.8 KB
Stars: 0
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Gutemberg Analysis

[![Build Status](https://travis-ci.org/jferrl/gutemberg-analysis.svg?branch=master)](https://travis-ci.org/jferrl/gutemberg-analysis)
[![Maintainability](https://api.codeclimate.com/v1/badges/3dcbeed599bb53561265/maintainability)](https://codeclimate.com/github/jferrl/gutemberg-analysis/maintainability)

This project has been created with the purpose of analyzing the linguistic corpus of Gutemberg. In addition, this java project will be prepared to adapt it to a Hadoop execution.

## Gutenberg Dataset

This is a collection of 3,036 English books written by 142 authors. This collection is a small subset of the Project Gutenberg corpus. All books have been manually cleaned to remove metadata, license information, and transcribers' notes, as much as possible.

Link to dataset: https://drive.google.com/file/d/0B2Mzhc7popBga2RkcWZNcjlRTGM/edit

## Tests performed

- Tokenize the dataset in different sentences
- Find the 10 most used words
- Total number of words
- Find valid numeric words
- Average size of paragraph

## Team Members

- Luis Gómez García
- Cansu Ozturk
- Jorge Ferrero Linacero

## How to execute it

From vscode:

- Run
- Debug

Program accepts gutemberg file location as execution args(args[0] = Path of gutemberg dataset)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jferrl/gutemberg-analysis

Awesome Lists containing this project

README