https://github.com/districtdatalabs/dod-ds-overview
Data Science and Big Data Overview Training
https://github.com/districtdatalabs/dod-ds-overview
Last synced: 9 months ago
JSON representation
Data Science and Big Data Overview Training
- Host: GitHub
- URL: https://github.com/districtdatalabs/dod-ds-overview
- Owner: DistrictDataLabs
- License: mit
- Created: 2018-07-19T13:50:47.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2022-12-08T02:21:01.000Z (about 3 years ago)
- Last Synced: 2025-04-05T02:01:34.660Z (9 months ago)
- Language: Jupyter Notebook
- Size: 6.48 MB
- Stars: 6
- Watchers: 7
- Forks: 1
- Open Issues: 11
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Data Science and Big Data Overview Training
This repository contains Jupyter notebooks and associated data for the [District Data Labs](https://www.districtdatalabs.com/) introductory training on data science and big data.
## Requirements
Before running any code in this repository, make sure you have installed the requirements (preferably in a `virtualenv` or `conda` environment) with:
```
pip install -r requirements.txt
```
## Notebooks
This repository contains the following notebooks:
* `exploratory_data_analysis.ipynb`: a notebook demonstrating basic exploratory data analysis (EDA) techniques using Yelp and U.S. Census data
* `supervised_learning.ipynb`: a notebook demonstrating supervised learning techniques on baseball player statistics
* `data_collection.ipynb`: a notebook demonstrating data acquisition through webscraping public speeches by the U.S. Secretary of Defense
* `unsupervised_learning.ipynb`: a notebook demonstrating unsupervised learning (clustering) on the public speeches referenced above
* `string_matching.ipynb`: a notebook demonstrating techniques for entity resolution using string matching
* `elasticsearch_overview.ipynb`: a notebook demonstrating how to interact with an Elasticsearch cluster (requires access to Elasticsearch either remotely or locally--using Docker, etc.)
## Data
This repository is self-contained: the relevant data for the notebooks is available in `/data`. There are a number of `.csv` files. Each of these files provenance is explained in the relevant notebook where it gets used.