https://github.com/hamidzr/freq-analysis

python map-reduce freq analysis with basic stemmer
https://github.com/hamidzr/freq-analysis

frequency-analysis map-reduce

Last synced: 3 months ago
JSON representation

python map-reduce freq analysis with basic stemmer

Host: GitHub
URL: https://github.com/hamidzr/freq-analysis
Owner: hamidzr
Created: 2019-06-10T03:57:54.000Z (almost 6 years ago)
Default Branch: master
Last Pushed: 2019-06-17T22:15:30.000Z (almost 6 years ago)
Last Synced: 2025-01-21T00:45:44.537Z (4 months ago)
Topics: frequency-analysis, map-reduce
Language: Python
Size: 644 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Frequency Analyzer - Human Practice - HZ

## Prompt
- The application should accept a text document from the user, count how often each word is used in it, and report the top 25 most frequently used via a friendly and attractive web page.
- In order to make the results more useful, the analysis should extract the stems of the words so that different inflections of the same word are all counted in the same bucket. Use the following categories when stemming:
- Regularly conjugated English verbs. For example, consider "talk", "talks", "talking", and "talked" to all be forms of "talk”, and “passes”, “passed”, and “passing” to all be forms of “pass”.
- Regularly pluralized English nouns. For example, consider "cat" and "cats" to be forms of "cat".

- Exclude common English stop words from your counts. Allow the user to decide whether to exclude stop words from their analysis.
- Save the most recent 10 frequency analysis (original text, stop words setting, and resulting word frequencies), allowing the user to navigate back to view a previous analysis for comparison.
These persisted analysis should survive a restart of the server process.

## Steps
- count the tokens
- group using the stemming algorithm: conjugated and plural words
- UI
- save and present history

## TODO
- [x] create simple webserver, flask
- [x] simple database setup, sqlite, sqlalchemy?
- [x] setup endpoints
- [x] create a UI

- [ ] do we want non alphanumeric string? what about pure numbers

- [ ] improve stemming
- [ ] add unit and e2e testing
- [ ] improve the frontend build setup, minify, etc using webpack?
- [ ] decouple request submission from getting the response. non blocking: sockets? pulling?
- [ ] use a logger instead of console.log
- [ ] keep it consistent between `camelCase` vs `snake_case`
- [ ] dev tools
- [ ] add linters
- [ ] watch and hot reload
- create an integrated build script

## Installation
- run `pipenv install` to install pip dependencies
- `pipenv shell` to enter the created virtualenv or preprend all other commands with `pipenv run`

- cd to `src/client` and run `npm i && npm run build`

### Env variables
```
export FLASK_APP=src/server/server.py
export FLASK_ENV=development
export FLASK_DEBUG=1
```

### Requirements
- python 3.7+
- pipenv
- nodejs 8+

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/hamidzr/freq-analysis

Awesome Lists containing this project

README