Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/kwokhing/exploratory-data-analysis-on-smrt-tweets
Demo on performing exploratory data analysis (EDA) on train service disruptions based on scrapped (user generated contents) tweets from the train operator's (SMRT) twitter account
https://github.com/kwokhing/exploratory-data-analysis-on-smrt-tweets
data-analysis data-cleaning data-collection data-preparation exploratory-data-analysis exploratory-data-visualizations folium geospatial-data leaflet-map python python3 regex scraping selenium selenium-python social-media text-processing user-generated-content web-scraping webscraping
Last synced: about 2 months ago
JSON representation
Demo on performing exploratory data analysis (EDA) on train service disruptions based on scrapped (user generated contents) tweets from the train operator's (SMRT) twitter account
- Host: GitHub
- URL: https://github.com/kwokhing/exploratory-data-analysis-on-smrt-tweets
- Owner: KwokHing
- License: unlicense
- Created: 2018-06-02T07:09:50.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2019-12-05T11:23:57.000Z (about 5 years ago)
- Last Synced: 2023-08-19T08:42:07.281Z (over 1 year ago)
- Topics: data-analysis, data-cleaning, data-collection, data-preparation, exploratory-data-analysis, exploratory-data-visualizations, folium, geospatial-data, leaflet-map, python, python3, regex, scraping, selenium, selenium-python, social-media, text-processing, user-generated-content, web-scraping, webscraping
- Language: Jupyter Notebook
- Homepage:
- Size: 1.25 MB
- Stars: 3
- Watchers: 2
- Forks: 4
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
## Project Overview
This demo will provide a brief introduction in performing a rudimentary analysis on train service disruptions in Singapore. Data scrapped are from the SMRT's twitter account and wikipedia containing the relevant train stations information such as name and code
- scraping of data from website (twitter) using Selenium
- scraping of tabular data from website (wikipedia) using Xpath
- exploratory data analysis (EDA) on the scrapped data
- data cleaning, data prepration and processing
- loading of .shp (shape) files into Python
- geospatial analysis on frequency of service disruptions using Folium & LeafletThere are two primary methods of extracting data from the SMRT tweets (twitter website). The first method was to use the provided twitter API for getting SMRT tweets, while the second method was to scrap information out from the HTML codes on the official SMRT twitter website (https://twitter.com/smrt_singapore). Due to limitation on the number of tweets the twitter's API could be pulled and an expected substantial number of SMRT tweets involved (approximately 4000 tweets), the latter method was employed to overcome twitter API's rate limitation.
This codes are submitted as a web scraping project for NTU's WKW H6752 - Data Extraction Techniques module.
![png](images/output_28.png)
## Getting started
Open `1_scrape_tweets.ipynb` and `2_geospatial_EDA_tweets.ipynb` on a jupyter notebook environment, or Google colab. The notebook consists of further technical details.- `1_scrape_tweets.ipynb` shows the steps taken to scrape tweets from twitter using Selenium
- `2_geospatial_EDA_tweets.ipynb`shows the steps taken to generate a heat map on the frequency of train breakdowns## Improvements
To perform scraping and generate SBS train breakdowns heat map as well.