Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/keval-gandevia/bigdataetlandsentimentanalysis
A Java based project aims to extract news articles from large .sgm file, process them and load them into MongoDB Database. It includes an Apache Spark job for word frequency analysis directly from .sgm files, and a sentiment analysis implementation using a Bag-of-Words model in Java.
https://github.com/keval-gandevia/bigdataetlandsentimentanalysis
apache-spark bag-of-words big-data dataproc-cluster etl gcp java mongodb nosql regex sentiment-analysis solid-principles
Last synced: 6 days ago
JSON representation
A Java based project aims to extract news articles from large .sgm file, process them and load them into MongoDB Database. It includes an Apache Spark job for word frequency analysis directly from .sgm files, and a sentiment analysis implementation using a Bag-of-Words model in Java.
- Host: GitHub
- URL: https://github.com/keval-gandevia/bigdataetlandsentimentanalysis
- Owner: Keval-Gandevia
- Created: 2024-08-22T15:59:54.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2024-08-22T17:00:08.000Z (6 months ago)
- Last Synced: 2025-02-02T01:31:43.499Z (19 days ago)
- Topics: apache-spark, bag-of-words, big-data, dataproc-cluster, etl, gcp, java, mongodb, nosql, regex, sentiment-analysis, solid-principles
- Language: Java
- Homepage:
- Size: 437 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# BigDataETLAndSentimentAnalysis
## Overview
This project provides a comprehensive solution for processing and analyzing Reuters news data. It includes:
- A Java application for parsing and storing news articles in MongoDB.
- An Apache Spark job for word frequency analysis directly from .sgm files.
- A Java-based sentiment analysis implementation using a Bag-of-Words model which provides polarity of words.## Features
- **Data Parsing and Storage:** Extracts news articles from .sgm files and stores them in a MongoDB database.
- **Word Frequency Analysis:** Utilizes Apache Spark to count word frequencies in news articles.
- **Sentiment Analysis:** Implements a Bag-of-Words model in Java to classify news article titles as positive, negative, or neutral.## Technologies Used
- **Java**
- **MongoDB**
- **Apache Spark**
- **Bag-of-Words Model**