Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/filipinascimento/dataviz


https://github.com/filipinascimento/dataviz

Last synced: about 2 months ago
JSON representation

Awesome Lists containing this project

README

        

# Welcome to the Data Visualization Course!

Welcome to our exciting journey into the world of Data Visualization! This course is designed to provide you with a solid understanding of visualization fundamentals, emphasizing practical skills and real-world applications through Python and Javascript.

Visualizing data is an essential skill for researchers, data scientists, analysts, journalists, and professionals in various fields dealing with information. It is an important tool for understanding complex datasets and making data-driven decisions. Also, it can play an important role in troubleshooting issues within complex data analysis pipelines or AI models. Furthermore, data visualization serves as a powerful medium for communicating data-driven insights and narratives with colleagues or broader audiences.

This course is heavily based on Prof. Yong-Yeol “YY” Ahn's Data Visualization course (http://yongyeol.com).

# Table of Contents

1. [Welcome to the Data Visualization Course!](#welcome-to-the-data-visualization-course)
2. [Objectives](#objectives)
3. [Prerequisites](#prerequisites)
4. [Course Structure](#course-structure)
5. [Grades](#grades)
6. [Communication](#communication)
7. [Final Project](#final-project)
8. [Recommended Books and Resources](#recommended-books-and-resources)
9. [Course Materials](#course-materials)
10. [Getting Started With the Course](#getting-started-with-the-course)
11. [Setting Up Your Data Visualization Environment](#setting-up-your-data-visualization-environment)
12. [Basic GitHub Usage for the Data Visualization Course (for submitting assignments)](#basic-github-usage-for-the-data-visualization-course-for-submitting-assignments)

## Objectives

By the end of this course, you are expected to be able to:
- Prepare and manipulate basic data types, such as numerical, categorical, and textual data.
- Explain and summarize data using descriptive statistics.
- Analyze data using exploratory visualization techniques.
- Critically analyze and improve visualizations based on principles such as human perception, design, visualization techniques, technology, and ethics.
- Understand how visualizations can be misleading or misrepresent data.
- Use ethical considerations when creating and deploying visualizations, such as fairness, accuracy, transparency, accessibility and diversity.
- Use modern libraries and tools for creating interactive visualizations.
- Integrate visualization into data analysis and machine learning pipelines.
- Prepare narrative visualizations to communicate data-driven insights and stories.
- Create and deploy interactive visualizations to the web.

You will showcase your learned skills by undertaking a course project, in which you will develop a visualization to reveal insights from real-world datasets. This project will require detailed documentation of each step involved in its development, from initial concept to final execution.

### Prerequisites
In this course, we will primarily use Python for data analysis and visualization tasks. Thus, you are required to have a good understanding of algorithms and practical experience with Python. We also expect you to have some level of familiarity with web technologies (HTML, CSS, and JavaScript), which will be essential for creating and deploying interactive visualizations. You are encouraged to also have a basic understanding of statistics and probability, as well as notions of 2D geometry and linear algebra.

For self-assessment, please, visit the following link: http://bit.ly/dvizselfassess (created by YY Ahn). Contact the instructor if you are uncertain about your background.

### Course Structure

Each week, we'll explore different topics in Data Visualization, starting from the basics and gradually moving to more advanced concepts. The course is designed to be hands-on, with a mix of theory, practical exercises, and projects.

Here's a tentative outline of the course for Fall 2024:

- **Week 1** (Aug 26 and Aug 28): Introduction to Data Visualization
- Overview of Data Visualization
- The Importance of Data Visualization
- Historical overview
- Course summary and expectations
- Famous visualizations and their impact
- Demonstration of visualizations, tools, and libraries

- **Week 2** (Sep 4): Principles of data visualization
- Human perception and cognition
- Gestalt principles
- Visual encoding
- Design principles
- Color perception, theory and representations
- Ethical considerations in data visualization

- **Week 3** (Sep 9 and Sep 11): Prerequisites and Recap of Fundamentals
- Python basics (Jupyter, Pandas)
- Simple statistics
- Modern Javascript, HTML, and CSS basics
- Setup of a web development environment
- Introduction to d3.js
- Canvas and SVG
- Basics of 2D computer graphics, geometry, and affine transformations

- **Week 4** (Sep 16 and Sep 18): Data types and Exploratory Data Analysis
- Data types and data structures
- Data cleaning and preprocessing
- Manipulation of data
- Description and summarization of data
- Histograms
- Box plots and variants
- File formats for visualizations
- Exporting visualizations to design tools

- **Week 5** (Sep 23 and Sep 25): Distributions, scales and axes
- Typical distributions
- Kernel density estimation
- Power-law distribution
- Linear, Logarithmic and Time scales
- Revisiting the Power-law distribution
- Line plots
- Axes, ticks and labels

- **Week 6** (Sep 30 and Oct 2): Mapping data to 2D
- Scatter and bubble plots
- Heatmaps
- Color scales
- 2D histograms, contour and density plots
- Which chart, color, scale, map, *, to use?
- Bad vs Good visualizations
- **Project idea discussions and matchmaking**

- **Week 7** (Oct 7 and Oct 9): Multidimensional data I
- Parallel coordinates and radar chart
- Scatter matrices and multi-panel plots
- Are 3D plots the solution?
- Principles of dimensionality reduction
- Principal component analysis

- **Week 8** (Oct 14 and Oct 16): Multidimensional data II
- Visualizing high-dimensional data
- t-SNE and UMAP
- Clustering

- **Week 9** (Oct 21 and Oct 23): Geospatial data
- Map projections
- Choropleth maps
- Density projection and caveats
- Geodesic and great-circle distances
- Routes

- **Week 10** (Oct 28 and Oct 30): Text and embeddings
- Preprocessing text
- Word clouds and variations
- Word prevalence plots
- Words and Text embedding (e.g., word2vec, BERT, etc.)
- Other types of embedding (e.g., images, etc.)

- **Week 11** (Nov 4 and Nov 6): Network visualization
- Network visualization
- Node-link diagrams
- Graph layout algorithms
- Visualizing social media
- **Project checkpoint and discussions**

- **Week 12** (Nov 11 and Nov 13): Interactive visualizations
- Importance of interactivity
- Types of interactions
- Building interactive visualizations with d3.js
- **Guest lecture**

- **Week 13** (Nov 18 and Nov 20): Deconstructing and reconstructing visualizations with d3.js
- Deploying visualizations to the web
- The building blocks of visualizations
- Customizing visualizations with d3.js

- **Week 14**: **Thanksgiving break**

- **Week 15** (Dec 2 and Dec 4): **Project hackday week**

- **Week 16** (Dec 9 and Dec 11): **Project presentations week**

- **Week 17** (Dec 16 and Dec 18): **Final Exam Week**

### Grades
You will be evaluated based on performance in participation, attendance, assignments, final project and the final exam. The final grade will be calculated as follows:

- 20% - Participation and attendance
- 20% - Assignments
- 30% - Final project
- 30% - Final exam

Some assignments may give bonus points for the final grade. Extra credits will be given based on engagement in class and offline, such as asking questions, helping others, and contributing to the course materials.

### Communication

We will use GitHub for all course materials, assignments, and for the final project. Slack will be used for communication, discussions, and to provide feedback on assignments and projects. We encourage you to actively participate in discussions, ask questions, and share your thoughts and ideas. Please be respectful and considerate of others' opinions and ideas, also do not post your personal information or sensitive data in the Slack channel.

Slack Channel for the course: [TBD]

Canvas and Email also work for communication but with a certain delay. We encourage you to use Slack for faster communication.

If you have suggestions, criticism or feedback on improving the course, please feel free to share them with us. You can use Slack or use the anonymous feedback form: [TBD].

## Final Project
You can choose your own project topic (or as a team), but it is a good idea to talk it over with your instructor. You must deliver a final report that shows your findings and clearly explains how you created your visualizations. This will prove your understanding of visualization methods and your skill in using them to make visual data presentations.

### Recommended Books and Resources
Here are some highly recommended books and resources on Data Visualization and general Data Science with Python:

1. **Fundamentals of Data Visualization by Claus O. Wilke** (available online at https://serialmentor.com/dataviz/): A comprehensive guide to the theory and practice of data visualization.

2. **The Visual Display of Quantitative Information (2nd ed.) by E.R. Tufte**: A classic book on data visualization principles and techniques.

3. **"Python Data Science Handbook" by Jake VanderPlas**: A comprehensive guide to using Python for data analysis, manipulation, and visualization.

4. **D3 Tips and Tricks v7.x by Malcolm Maclean**. https://leanpub.com/d3-t-and-t-v7 (Free online book, or pay what you want)

5. **D3 Tutorial updated by Danny Yang** (forked from Square's original tutorial): Online tutorial on d3.js.
https://yangdanny97.github.io/blog/2022/08/07/d3-resources

6. **"Data Science from Scratch" by Joel Grus**: A great introduction to Data Science fundamentals using Python.

7. **"Python for Data Analysis"** by Wes McKinney (the creator of pandas): A practical guide to using Python for data analysis, manipulation, and visualization.

8. **"Introduction to Machine Learning with Python" by Andreas C. Müller & Sarah Guido**: A practical approach to learning machine learning with Python.

9. **Kaggle**: Participate in competitions or explore datasets for practical experience. (https://www.kaggle.com)

10. **Awesome Public Datasets**: For a huge list of public datasets for practice and projects (https://github.com/awesomedata/awesome-public-datasets)

11. **[Visual Complexity: Mapping patterns of information by Manuel Lima](https://www.amazon.com/Visual-Complexity-Mapping-Patterns-Information/dp/1568989369/refsr_1_1?s=books&ie=UTF8&qid=1318466736&sr1-1)**: A book on the visualization of complex networks and systems.

Wait for more resources to be added to this list or suggest your own!

### Course Materials

Here's what you can find in our repository:

- **Python Jupyter Notebooks**: Interactive notebooks with code, explanations, and exercises.
- **PDF Presentations**: Slides covering key concepts and examples.
- **Assignments**: Python notebook assignments to apply what you've learned.
- **Datasets**: A collection of datasets used in our materials, including links to Kaggle datasets for hands-on practice.
- **Additional Resources**: Links to further reading and external resources.

Most of these materials will be available when the course starts.

### Getting Started With the Course

To get started, please ensure you have set up your environment as described in the next sections. We encourage you to explore the materials, complete the assignments, and actively participate in discussions using Slack. If you have any questions or need help, feel free to ask the instructor or your peers. We're here to help you succeed in your Data Science journey!

## Setting Up Your Data Visualization Environment

We suggest using miniconda to install python packages and setup your environment. You can also use Anaconda, but it is a larger package and may take longer to install. Alternatively, you can also setup your own python environment using pip and virtualenv.

### Step 1: Install Miniconda

Miniconda is a minimal installer for Conda, a package manager and an environment manager. Here’s how to install it:

1. **Download Miniconda**:
- Visit the [Miniconda download page](https://docs.conda.io/en/latest/miniconda.html).
- Choose the version suitable for your operating system (Windows, macOS, or Linux).
- Download the appropriate installer (Python 3.x is recommended).

2. **Install Miniconda**:
- **Windows**: Run the downloaded `.exe` file and follow the on-screen instructions.
- **macOS/Linux**: Open a terminal, navigate to the folder containing the downloaded file, and run `bash Miniconda3-latest-MacOSX-x86_64.sh` (adjust the filename as needed).

3. **Verify the Installation**:
- Open a new terminal window.
- Type `conda list`. If Miniconda is installed correctly, you'll see a list of installed packages.

### Step 2: Create a Conda Environment

Creating a separate environment for your Data Science projects is good practice:

1. **Create a New Environment for this course**:
- Run the command: `conda env create -f environment.yml`. This will create a new environment called datascience with all the necessary packages installed.

2. **Activate the Environment**:
- Run: `conda activate datascience`.

3. **Launch Jupyter Lab**:
- Run: `jupyter lab`.
- This will open Jupyter Lab in your default web browser.

### Step 3: Manually Install Essential Packages

The created environment already includes most of the packages we'll need, but if you need to install any additional packages, here's how:

1. **Manually Install Packages **:
- In your activated environment, run: `conda install ` to install the packages
- For example, to install the `scikit-learn` package, run: `conda install scikit-learn`.
- Alternatively, you can use `pip install ` to install packages from PyPI.

### Step 4: Verify Installation

Make sure everything is installed correctly:

1. **Open a New Notebook in Jupyter Lab**:
- In Jupyter Lab, create a new notebook.

2. **Test the Packages**:
- Try importing the packages: `import numpy as np`, `import pandas as pd`, `import matplotlib.pyplot as plt`.
- If there are no errors, the packages are installed correctly.

### Additional Tips

- **Updating Conda**: Keep Conda and your packages updated with `conda update conda` and `conda update --all`.
- **Managing Environments**: View your environments with `conda env list` and switch between them using `conda activate `.
- **Finding Packages**: To find available packages, use `conda search `.
- **Conda Cheat Sheet**: For more commands, see the [Conda Cheat Sheet](https://docs.conda.io/projects/conda/en/latest/user-guide/cheatsheet.html).

## Basic GitHub Usage for the Data Visualization Course (for submitting assignments)

### Step 1: Fork the Course Repository

"Forking" means creating a personal copy of someone else's project. To fork the `filipinascimento/datascience` repository:

1. **Navigate to the Repository**:
- Go to the [filipinascimento/datascience](https://github.com/filipinascimento/datascience) repository on GitHub.

2. **Fork the Repository**:
- Click on the "Fork" button at the top right corner of the page.
- This will create a copy of the repository in your GitHub account.

### Step 2: Clone the Forked Repository

After forking, you'll want to clone the repository to work on it locally:

1. **Clone the Repository**:
- On your forked repository page, click the "Code" button and copy the URL under "Clone with HTTPS".
- Open a terminal on your computer and run `git clone [URL]` (replace `[URL]` with the copied URL).

### Step 3: Use the Issues Tab

The Issues tab in GitHub is used to track ideas, enhancements, tasks, or bugs:

1. **Navigate to Issues**:
- In the original `filipinascimento/datascience` repository, go to the "Issues" tab.

2. **Create a New Issue**:
- Click on "New Issue".
- Provide a title and a detailed description of your problem or discussion point.
- Click "Submit new issue" when done.

### Step 4: Submitting Assignments as Pull Requests

Submit your completed assignments as Pull Requests:

1. **Complete Your Assignment**:
- Work on the assignment in your local clone of the forked repository.
- Rename the assignment file to include your GitHub username (e.g., `assignment1-username.ipynb`).

2. **Commit Your Changes**:
- Use `git add .` to stage your changes.
- Commit the changes with `git commit -m "Completed Assignment"`.

3. **Push to GitHub**:
- Push your changes to your forked repository using `git push`.

4. **Create a Pull Request**:
- Go to the original `filipinascimento/datascience` repository.
- Click on "Pull Requests" and then "New Pull Request".
- Choose your forked repository as the source.
- Add a title and description for your PR.
- Click "Create pull request".

### Step 5: Keep Your Fork Updated

Ensure your fork is up to date with the main repository:

1. **Configure a Remote for the Fork**:
- Run `git remote add upstream https://github.com/filipinascimento/datascience.git`.

2. **Sync Your Fork**:
- Fetch the changes with `git fetch upstream`.
- Check out your fork's local default branch (`main` or `master`) with `git checkout main`.
- Merge changes from `upstream/main` into your local branch with `git merge upstream/main`.

### Additional Resources

- **GitHub Documentation**: For more detailed instructions, visit [GitHub Docs](https://docs.github.com/en).
- **Git Cheatsheet**: Refer to this [Git Cheatsheet](https://training.github.com/downloads/github-git-cheat-sheet.pdf) for common Git commands.

## POLICIES
(Copied from Prof. YY Ahn's course)

1. **Be honest. Don’t be a cheater.** Your assignments and papers should be your own work. If you find useful resources for your assignments, share them and cite them. If your friends helped you, acknowledge them. You should feel free to discuss both online and offline (except for the exam), but do not show your code directly. Any cases of academic misconduct (cheating, fabrication, plagiarism, etc.) will be reported to the School and the Dean of Students, following the standard procedure. Cheating is not cool.

2. **You have the responsibility of backing up all your data and code.** Always back up your code and data. You should at least use Google Drive or Dropbox at the minimum. You can also use cloud services like Google Colaboratory. Ideally, learn version control systems and use [https://github.iu.edu](https://github.iu.edu) or [https://github.com](https://github.com). Loss of data, code, or papers (e.g., due to malfunction of your laptop) is not an acceptable excuse for delayed or missing submission.

3. **Disabilities.** Every attempt will be made to accommodate qualified students with disabilities (e.g., mental health, learning, chronic health, physical, hearing, vision, neurological, etc.). You must have established your eligibility for support services through Disability Services for Students. Note that services are confidential, may take time to put into place, and are not retroactive. Captions and alternate media for print materials may take three or more weeks to get produced. Please contact Disability Services for Students at [http://disabilityservices.indiana.edu](http://disabilityservices.indiana.edu) or 812-855-7578 as soon as possible if accommodations are needed. The office is located on the third floor, west tower, of the Wells Library (Room W302). Walk-ins are welcome 8 AM to 5 PM, Monday through Friday. You can also locate a variety of campus resources for students and visitors who need assistance at [http://www.iu.edu/~ada/index.shtml](http://www.iu.edu/~ada/index.shtml).

4. **Bias-based incidents.** Any act of discrimination or harassment based on race, ethnicity, religious affiliation, gender, gender identity, sexual orientation, or disability can be reported to [email protected] or to the Dean of Students Office at (812) 855-8188.

5. **Sexual misconduct and Title IX.** Title IX and IU’s Sexual Misconduct Policy prohibit sexual misconduct in any form, including sexual harassment, sexual assault, stalking, and dating and domestic violence. If you have experienced sexual misconduct, or know someone who has, you can use university resources:
- a) The Sexual Assault Crisis Services (SACS) at (812) 855-8900 (counseling services)
- b) Confidential Victim Advocates (CVA) at (812) 856-2469 (advocacy and advice services)
- c) IU Health Center at (812) 855-4011 (health and medical services)

It is also important that you know that Title IX and University policy require me to share any information brought to my attention about potential sexual misconduct, with the campus Deputy Title IX Coordinator or IU’s Title IX Coordinator. In that event, those individuals will work to ensure that appropriate measures are taken and resources are made available. Protecting student privacy is of utmost concern, and information will only be shared with those that need to know to ensure the University can respond and assist. Visit [stopsexualviolence.iu.edu](http://stopsexualviolence.iu.edu) to learn more.

6. **If you have any mental health issues,** don’t hesitate to contact IU’s Counseling and Psychological Services, which provides free counseling sessions. Also, please contact Disability Services for Students at [http://disabilityservices.indiana.edu](http://disabilityservices.indiana.edu) or 812-855-7578 as soon as possible if accommodations are needed.