Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/matthewmckenna/ca682-data-visualisation
Code and analyses for CA682: Data Management and Visualisation project
https://github.com/matthewmckenna/ca682-data-visualisation
Last synced: about 2 months ago
JSON representation
Code and analyses for CA682: Data Management and Visualisation project
- Host: GitHub
- URL: https://github.com/matthewmckenna/ca682-data-visualisation
- Owner: matthewmckenna
- Created: 2018-12-11T18:55:57.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2022-12-08T01:28:33.000Z (about 2 years ago)
- Last Synced: 2023-05-14T11:15:31.751Z (over 1 year ago)
- Language: Jupyter Notebook
- Size: 1.03 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# CA682: Data Management and Visualisation
Code and analyses for CA682 module.
## Data
* Tarball containing 498 CSV files: [csvs.tar.gz](https://drive.google.com/open?id=18hNkat9vaakwkwfsJoylyVINF45yQr6t) [31 MB]
## Finding weather station data
Met Éireann provides access to historical data via their [website](https://www.met.ie/climate/available-data/historical-data).
The interface provided allows a user to download a dataset for one weather station at a time.
However, for this project we require much more data, and would like all data for all weather stations.Using the inspection tools in Chrome I found a `stations.csv` file that was being loaded.
Some fields of interest here were the station name, county, latitude, longitude, and most importantly, the assigned station number.I then found a base url where the historical data was hosted.
By manually downloading a single dataset and examining the URL, I was able to determine the form of the URLs for the hosted data:`"$BASE_URL"/dly"$STATION_NUMBER".zip`
## Cleaning up `stations.csv`
There is a lot of information that we don't need in `stations.csv`.
This file can be tidied up to extract only the fields of interest by running `clean_stations_csv.sh`.
This is a very simple shell script.
The contents are shown below:```bash
$ cat clean_stations_csv.sh
#!/bin/bash
# extract fields of interest from `stations.csv`# directory containing the input file `stations.csv`
data_dir=$1# the name of the input file
input_file="$1"/stations.csv
# the name of the output file
output_file="$1"/stations_clean.csv# add the header to the output file
echo "station_number,station_name,county,latitude,longitude" >"$output_file"# extract the first five fields of the input file, skip the original header
# and append the rest to the output file
cut -d',' -f1-5 "$input_file" |tail -n+2 >>"$output_file"
```## Getting all weather station data
Run the Python application to download the data:
```bash
$ python get_data.py data/daily/zips
```Note that the path passed (`data/daily/zips`) above, **must** exist.
With the example above, `518` zip archives were downloaded to `data/daily/zips`.
This took several minutes, running with two threads.## Unzipping the archives
The zip archives were downloaded to `data/daily/zips`.
From the parent directory (`data/daily`), the following commands were run to extract the archives:```bash
$ mkdir csvs
$ for f in $(ls zips); do unzip -nd csv zips/$f; done >log 2>&1
```Some errors were noted during the `unzip` operation (as noted in the `log` file generated).
It appears that a number of the constructed URLs were not valid.
There are `20` files (listed below) for which the archive size is < 1 KB, and simply contain the source of a `404` error page.```bash
$ find . -type f -size -1024c
./dly3422.zip
./dly9938.zip
./dly2931.zip
./dly538.zip
./dly9323.zip
./dly4413.zip
./dly3037.zip
./dly1807.zip
./dly4702.zip
./dly2604.zip
./dly199.zip
./dly5819.zip
./dly5602.zip
./dly9206.zip
./dly8912.zip
./dly2218.zip
./dly8123.zip
./dly5729.zip
./dly9106.zip
./dly907.zip
```This leaves us with valid weather data for `498` weather stations across Ireland.
## Finding how many rows to skip
The CSV files are not standardised.
In order to load the CSVs into a `DataFrame`, we need to provide the `skiprows` argument.
In order to find out how many rows to skip, we can recursively `grep` for the header line.
In the `data` directory, run:```bash
grep -nr "date," daily/csvs/dly*.csv |cut -d'/' -f3 |cut -d':' -f1,2 >header_rows.txt
```We can inspect the first `10` lines with:
```bash
$ head header_rows.txt
dly10023.csv:10
dly1024.csv:10
dly1033.csv:10
dly1042.csv:10
dly1043.csv:10
dly1075.csv:25
dly108.csv:10
dly1103.csv:10
dly1107.csv:10
dly1108.csv:10
```For most of the files, the header is on line `10`.
However, for `dly1075.csv`, the header is on line `25`.
We can extract this information, and create a mapping using Python.
We'll also make use of the `stations_clean.csv` file generated earlier.## Data shift in `stations.csv`
There is an extra comma in the `county` field in the original `stations.csv` which caused the data to be shifted when running the script to clean this data.
This can be corrected by running the following code on macOS:```bash
$ sed -i.bak "460s/.*/5331,DELVIN CASTLE G.C.,Westmeath,53.608,-7.104/" stations_clean.csv
$ mv stations_clean.csv{.bak,}
```or on Linux/Windows
```bash
$ sed -i "460s/.*/5331,DELVIN CASTLE G.C.,Westmeath,53.608,-7.104/" stations_clean.csv
```## Creating a `DataFrame` with all station details
We want to gather all information about the stations in an easy-to-use format.
We'll use `pandas` to create `DataFrame` objects to work with.
Initial cleaning and transformation is done in [`data-cleaning_2018-12-14.ipynb`](data-cleaning_2018-12-14.ipynb).