https://github.com/khinshankhan/nlp-tf-idf-hadoop
NLP analysis of Term Frequency - Inverse Document Frequency using Hadoop
https://github.com/khinshankhan/nlp-tf-idf-hadoop
hadoop mapreduce nlp tf-idf
Last synced: 7 months ago
JSON representation
NLP analysis of Term Frequency - Inverse Document Frequency using Hadoop
- Host: GitHub
- URL: https://github.com/khinshankhan/nlp-tf-idf-hadoop
- Owner: khinshankhan
- License: mit
- Created: 2019-11-26T21:41:18.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2019-12-19T03:46:45.000Z (almost 6 years ago)
- Last Synced: 2025-01-19T21:46:52.245Z (9 months ago)
- Topics: hadoop, mapreduce, nlp, tf-idf
- Language: Python
- Size: 7.22 MB
- Stars: 4
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# nlp-tf-idf-hadoop
NLP analysis of Term Frequency - Inverse Document Frequency using Hadoop
Khan_Rafi: Khinshan Khan and Shakil Rafi
## Requirements
- Apache Spark
- have `pyspark` on path
- Python 3
- Note 3.8 and above do not work well with spark
- Python Packages properly in environment:
- math
- re
- sys
## RunOne can run the project two ways:
- Traditional Way
```bash
spark-submit app.py
cat output
```- Abstracted Way
```bash
make FILE= QUERY=
```## Notes
- Running the program will write relevant output to `output` rather than stdout