Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/kylase/data-science-capstone
Data Processing scripts for Coursera Data Science Capstone (Swiftkey)
https://github.com/kylase/data-science-capstone
coursera data-science linux natural-language-processing python
Last synced: 3 days ago
JSON representation
Data Processing scripts for Coursera Data Science Capstone (Swiftkey)
- Host: GitHub
- URL: https://github.com/kylase/data-science-capstone
- Owner: kylase
- Created: 2015-10-18T15:20:03.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2015-12-20T06:41:51.000Z (about 9 years ago)
- Last Synced: 2024-11-01T07:42:49.997Z (about 2 months ago)
- Topics: coursera, data-science, linux, natural-language-processing, python
- Language: HTML
- Homepage:
- Size: 182 KB
- Stars: 0
- Watchers: 2
- Forks: 5
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Data Science Capstone Data Processing Scripts (July to August 2015)
This repository contains the codes and scripts that are used to process the raw text files obtained from [Coursera](https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip).
The objective of these scripts is to demostrate data engineering rather than building a smoothing ngram model. These script managed to process the entire corpus to generate the ngrams and their counts, which is hardly possible with R. These data can be then used for developing the predictive text algorithm.
## Requirements
### Hardware
1. At least 16 GB of RAM
2. Most of the scripts are single-threaded, therefore dual-core is sufficient### Software
1. Linux (tested on Ubuntu 14.04, should work on most Linux)
2. Python (version 2.7)
1. nltk (version 3.0)
2. numpy## Scripts
Run the following scripts in order:
1. `1-get_src_data.sh`
> Download the raw corpus file2. `2-extract_data.sh`
> Unpackage the raw corpus file and remove the non-English data files.3. `3-data_cleaning.sh`
> Clean up the corpus by replacing unicode characters to ascii4. `4-generate_grams.sh`
> Using the python scripts (`generate_grams.py` and `generate_unigrams.py`), the ngrams (up to n = 3) and their respective counts for each corpus (therefore, unaggregated) are generated in csv file5. `5-post_process.sh`
> Post-process removes invalid ngrams and aggregate the count for common ngrams.6. `6-compress.sh`
> Compress the files with aggregated counts with bzip2 (optional)