https://github.com/theveryhim/massive-text-processing
cleaning, processing and analysis of papers' dataset in pyspark(rdd) framework
https://github.com/theveryhim/massive-text-processing
big-data data-analysis frequent-itemsets massive-datasets pyspark text-preprocessing
Last synced: 3 months ago
JSON representation
cleaning, processing and analysis of papers' dataset in pyspark(rdd) framework
- Host: GitHub
- URL: https://github.com/theveryhim/massive-text-processing
- Owner: theveryhim
- License: mit
- Created: 2025-07-03T08:23:54.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2025-07-04T22:24:14.000Z (3 months ago)
- Last Synced: 2025-07-04T23:27:35.022Z (3 months ago)
- Topics: big-data, data-analysis, frequent-itemsets, massive-datasets, pyspark, text-preprocessing
- Language: Jupyter Notebook
- Homepage:
- Size: 1.31 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Large-scale data analysis in pyspark framework
Some of tasks done in this Repo([papers' dataset](https://drive.google.com/file/d/1-EhpZaY5gvbgNuEU5IskmlQ0EnNAG5cu/view?usp=drive_link)):
- Clean the texts in the title and abstract fields if needed.
- Remove mathematical symbols, meaningless characters in the text, remove stopwords, etc.
- Calculate the number of articles in each category (e.g., ph-hep or co.math).
- Identify the category that has the most articles.
- Analyze the distribution of the number of authors in each article.
![]()
- Filter articles that have more than three authors and list their titles and authors.
- Draw the number of articles registered in each year.
![]()
- Extract and display 20 frequently used words in the abstract section of the article.
```markdown
5 most frequent words in abstract:
model : 1188676
data : 917131
results : 859049
show : 831879
using : 809828
```
- Find the articles in which the word algorithm is mentioned in their abstract.
- Count the number of words in the abstract of this article
- Arrange them in descending order based on the number of words.
- Display the five articles with the highest number of words in the abstract as the final result.
```markdown
Top 5 articles with the highest word counts in their abstract (containing 'algorithm'):
Title: The Nonlinearity Coefficient - A Practical Guide to Neural Architecture
Design, Word Count: 498
Title: Generating a Generic Fluent API in Java, Word Count: 488
Title: Boxicity and Poset Dimension, Word Count: 484
Title: An Anytime Algorithm for Optimal Coalition Structure Generation, Word Count: 484
Title: McMini: A Programmable DPOR-Based Model Checker for Multithreaded
Programs, Word Count: 475
```