Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ibrahimhabibeg/national-university-of-singapore-sms-analysis
Analysis of SMS messages collected by the National University of Singapore
https://github.com/ibrahimhabibeg/national-university-of-singapore-sms-analysis
analytics data-analysis data-science nlp python
Last synced: 24 days ago
JSON representation
Analysis of SMS messages collected by the National University of Singapore
- Host: GitHub
- URL: https://github.com/ibrahimhabibeg/national-university-of-singapore-sms-analysis
- Owner: ibrahimhabibeg
- Created: 2024-06-19T21:57:50.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2024-06-19T21:59:30.000Z (7 months ago)
- Last Synced: 2024-11-05T00:42:16.707Z (2 months ago)
- Topics: analytics, data-analysis, data-science, nlp, python
- Language: Jupyter Notebook
- Homepage: https://www.kaggle.com/code/ibrahimhabibeg/national-university-of-singapore-sms-analysis
- Size: 1.83 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# National University of Singapore SMS Analysis Project
## Overview
This project involves analyzing a dataset from The National University of Singapore SMS Corpus. The dataset comprises thousands of text messages (SMS) mostly from Singaporeans and students attending the University. The analysis focuses on understanding the characteristics of these messages.## Dataset Description
The dataset contains the following features:
- **id**: A unique identifier for every SMS.
- **Message**: The actual text of the SMS in natural language.
- **length**: The number of characters in the message.
- **country**: The country the sender is from.
- **Date**: The month and year the message was sent.## Data Cleaning and Preprocessing
Dropped an unnamed column
- Created a new binary feature from_singapore to indicate messages from Singapore
- Preprocessed messages by converting to lowercase, tokenizing, removing stop words, punctuation, numbers, single characters, and non-alphabetic characters, and lemmatizing the remaining words
Text Analysis
- Created word clouds to visualize frequently used words and nouns
- Identified the most common noun phrases and verb phrases
- Sentiment Analysis
- Analyzed sentiment distribution by country, year, month, and season## Key Findings
- Most messages are positive, followed by neutral messages.
- Messages from Singapore are more positive than those from other countries.
- The most common words include informal language and laughter indicators ("haha", "lol").
- Common verb phrases suggest a focus on daily activities ("go home", "go eat").
- Frequent mentions of time ("wat time", "next time") and social events ("new year", "happy birthday") highlight their importance.## Dependencies
- pandas
- nltk
- matplotlib## Running the Analysis
- Clone this repository.
- Install required libraries (pip install pandas nltk matplotlib).
- Ensure the clean_nus_sms.csv file is in the same directory.
- Run the Jupyter notebook.## Acknowledgments
I would like to thank Codecademy for offering the Data Scientist: NLP Specialist Professional Certificate, which provided the foundation and guidance for this project.