Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/kwokhing/exploratory-data-analysis-on-smrt-tweets

Demo on performing exploratory data analysis (EDA) on train service disruptions based on scrapped (user generated contents) tweets from the train operator's (SMRT) twitter account
https://github.com/kwokhing/exploratory-data-analysis-on-smrt-tweets

data-analysis data-cleaning data-collection data-preparation exploratory-data-analysis exploratory-data-visualizations folium geospatial-data leaflet-map python python3 regex scraping selenium selenium-python social-media text-processing user-generated-content web-scraping webscraping

Last synced: about 2 months ago
JSON representation

Demo on performing exploratory data analysis (EDA) on train service disruptions based on scrapped (user generated contents) tweets from the train operator's (SMRT) twitter account

Awesome Lists containing this project

README

        

## Project Overview

This demo will provide a brief introduction in performing a rudimentary analysis on train service disruptions in Singapore. Data scrapped are from the SMRT's twitter account and wikipedia containing the relevant train stations information such as name and code

- scraping of data from website (twitter) using Selenium
- scraping of tabular data from website (wikipedia) using Xpath
- exploratory data analysis (EDA) on the scrapped data
- data cleaning, data prepration and processing
- loading of .shp (shape) files into Python
- geospatial analysis on frequency of service disruptions using Folium & Leaflet

There are two primary methods of extracting data from the SMRT tweets (twitter website). The first method was to use the provided twitter API for getting SMRT tweets, while the second method was to scrap information out from the HTML codes on the official SMRT twitter website (https://twitter.com/smrt_singapore). Due to limitation on the number of tweets the twitter's API could be pulled and an expected substantial number of SMRT tweets involved (approximately 4000 tweets), the latter method was employed to overcome twitter API's rate limitation.

This codes are submitted as a web scraping project for NTU's WKW H6752 - Data Extraction Techniques module.

![png](images/output_28.png)

## Getting started
Open `1_scrape_tweets.ipynb` and `2_geospatial_EDA_tweets.ipynb` on a jupyter notebook environment, or Google colab. The notebook consists of further technical details.

- `1_scrape_tweets.ipynb` shows the steps taken to scrape tweets from twitter using Selenium
- `2_geospatial_EDA_tweets.ipynb`shows the steps taken to generate a heat map on the frequency of train breakdowns

## Improvements
To perform scraping and generate SBS train breakdowns heat map as well.