{"id":20932907,"url":"https://github.com/jferrl/gutemberg-analysis","last_synced_at":"2026-05-20T11:38:43.058Z","repository":{"id":55861755,"uuid":"220047810","full_name":"jferrl/gutemberg-analysis","owner":"jferrl","description":"Gutemberg corpus analysis with apache hadoop","archived":false,"fork":false,"pushed_at":"2020-12-10T22:57:25.000Z","size":52,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-01-19T19:26:10.234Z","etag":null,"topics":["analysis","gutemberg","hadoop","java"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jferrl.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-11-06T17:01:03.000Z","updated_at":"2020-12-10T22:57:22.000Z","dependencies_parsed_at":"2022-08-15T08:00:40.919Z","dependency_job_id":null,"html_url":"https://github.com/jferrl/gutemberg-analysis","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jferrl%2Fgutemberg-analysis","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jferrl%2Fgutemberg-analysis/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jferrl%2Fgutemberg-analysis/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jferrl%2Fgutemberg-analysis/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jferrl","download_url":"https://codeload.github.com/jferrl/gutemberg-analysis/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243325421,"owners_count":20273287,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["analysis","gutemberg","hadoop","java"],"created_at":"2024-11-18T21:53:44.465Z","updated_at":"2025-12-28T11:58:05.279Z","avatar_url":"https://github.com/jferrl.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Gutemberg Analysis\n\n[![Build Status](https://travis-ci.org/jferrl/gutemberg-analysis.svg?branch=master)](https://travis-ci.org/jferrl/gutemberg-analysis)\n[![Maintainability](https://api.codeclimate.com/v1/badges/3dcbeed599bb53561265/maintainability)](https://codeclimate.com/github/jferrl/gutemberg-analysis/maintainability)\n\nThis project has been created with the purpose of analyzing the linguistic corpus of Gutemberg. In addition, this java project will be prepared to adapt it to a Hadoop execution.\n\n## Gutenberg Dataset\n\nThis is a collection of 3,036 English books written by 142 authors. This collection is a small subset of the Project Gutenberg corpus. All books have been manually cleaned to remove metadata, license information, and transcribers' notes, as much as possible.\n\nLink to dataset: https://drive.google.com/file/d/0B2Mzhc7popBga2RkcWZNcjlRTGM/edit\n\n## Tests performed\n\n-   Tokenize the dataset in different sentences\n-   Find the 10 most used words\n-   Total number of words\n-   Find valid numeric words\n-   Average size of paragraph\n\n## Team Members\n\n-   Luis Gómez García\n-   Cansu Ozturk\n-   Jorge Ferrero Linacero\n\n## How to execute it\n\nFrom vscode:\n\n-   Run\n-   Debug\n\nProgram accepts gutemberg file location as execution args(args[0] = Path of gutemberg dataset)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjferrl%2Fgutemberg-analysis","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjferrl%2Fgutemberg-analysis","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjferrl%2Fgutemberg-analysis/lists"}