Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/keval-gandevia/bigdataetlandsentimentanalysis

A Java based project aims to extract news articles from large .sgm file, process them and load them into MongoDB Database. It includes an Apache Spark job for word frequency analysis directly from .sgm files, and a sentiment analysis implementation using a Bag-of-Words model in Java.
https://github.com/keval-gandevia/bigdataetlandsentimentanalysis

apache-spark bag-of-words big-data dataproc-cluster etl gcp java mongodb nosql regex sentiment-analysis solid-principles

Last synced: 6 days ago
JSON representation

A Java based project aims to extract news articles from large .sgm file, process them and load them into MongoDB Database. It includes an Apache Spark job for word frequency analysis directly from .sgm files, and a sentiment analysis implementation using a Bag-of-Words model in Java.

Awesome Lists containing this project

README

        

# BigDataETLAndSentimentAnalysis

## Overview
This project provides a comprehensive solution for processing and analyzing Reuters news data. It includes:
- A Java application for parsing and storing news articles in MongoDB.
- An Apache Spark job for word frequency analysis directly from .sgm files.
- A Java-based sentiment analysis implementation using a Bag-of-Words model which provides polarity of words.

## Features
- **Data Parsing and Storage:** Extracts news articles from .sgm files and stores them in a MongoDB database.
- **Word Frequency Analysis:** Utilizes Apache Spark to count word frequencies in news articles.
- **Sentiment Analysis:** Implements a Bag-of-Words model in Java to classify news article titles as positive, negative, or neutral.

## Technologies Used
- **Java**
- **MongoDB**
- **Apache Spark**
- **Bag-of-Words Model**