https://github.com/theveryhim/stream-data-analysis
Filtering, sampling and analysis on a news data set using stream simulation.
https://github.com/theveryhim/stream-data-analysis
pyspark stream-data-analysis stream-processing
Last synced: 11 months ago
JSON representation
Filtering, sampling and analysis on a news data set using stream simulation.
- Host: GitHub
- URL: https://github.com/theveryhim/stream-data-analysis
- Owner: theveryhim
- License: mit
- Created: 2025-07-04T19:40:59.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2025-07-04T20:12:18.000Z (12 months ago)
- Last Synced: 2025-07-04T21:23:23.393Z (12 months ago)
- Topics: pyspark, stream-data-analysis, stream-processing
- Language: Jupyter Notebook
- Homepage:
- Size: 29.8 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: news_dataset_MDA2024.json
- License: LICENSE
Awesome Lists containing this project
README
# Stream data analysis
## *news_dataset_MDA2024*
- Implementing a streaming data processing system using Structured Streaming in PySpark framework.
- Process the input data entered into the system continuously and in real time (real-time) from a news source and extract useful information from it.
- Filter the data flow based on different tasks
```markdown
Time: 2024-12-29 12:00:00
Category: PARENTING, Count: 1
Category: CULTURE & ARTS, Count: 1
Category: U.S. NEWS, Count: 5
Category: COMEDY, Count: 1
Category: WORLD NEWS, Count: 2
------------------------------
Time: 2024-12-29 12:00:20
Category: SPORTS, Count: 1
Category: CULTURE & ARTS, Count: 1
Category: U.S. NEWS, Count: 1
Category: WORLD NEWS, Count: 6
Category: TECH, Count: 1
------------------------------
...
```
## *web_streaming_dataset*:
- Get acquainted with the important algorithms in the analysis of stream data and implement these algorithms(DGIM & FM) in PySpark framework
- Estimate the number of 1 bits (successful user requests) in each window using the DGIM algorithm
- Using *FM* algorithm, estimate the number of unique users who have accessed the website.
```markdown
Actual number of unique users: 1491
Estimated number of unique users: 1575.3846153846155
```
## Persian Twitter Dataset(*dataset*)
- Both task and implementation of this section is provided as *Task3*
- Using PySpark's Structured Streaming, process new tweets in real-time and identify and count the hashtags of each tweet.
```markdown
+--------------------+--------------------+-------------+
| window| hashtag|hashtag_count|
+--------------------+--------------------+-------------+
|{2023-12-01 06:14...| پرستو_معینی| 1|
|{2023-11-22 11:18...| دلیران_میدان| 1|
|{2023-12-01 06:14...| زهرا_صفایی| 1|
|{2023-11-10 07:10...| درمان_سرطان| 1|
|{2023-11-10 07:10...| سرطان_پروستات| 1|
...
```
- Use the sentiment feature of each tweet to analyze sentiment and calculate and display the average sentiment for each hashtag in real-time.
```markdown
+--------------------+-------------+
| hashtag|avg_sentiment|
+--------------------+-------------+
| حسن_روحانی| 0.0|
| قیام_سراسری| 0.0|
| یمن| 0.0|
| آرمیتا_گراوند| 0.0|
| KingRezaPahlavi| 0.0|
...
```