https://github.com/kwokhing/visualizing-datasets-with-facets
Demo on using Facets: An Open Source Visualization Tool for Machine Learning Training Data developed by Google's PAIR Initiative
https://github.com/kwokhing/visualizing-datasets-with-facets
anaconda data-analysis data-visualization facets jupyter-notebook missing-data open-source python skewness unbalanced-data visualisation visualization
Last synced: 4 months ago
JSON representation
Demo on using Facets: An Open Source Visualization Tool for Machine Learning Training Data developed by Google's PAIR Initiative
- Host: GitHub
- URL: https://github.com/kwokhing/visualizing-datasets-with-facets
- Owner: KwokHing
- Created: 2017-12-10T05:02:33.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2017-12-10T05:50:30.000Z (over 7 years ago)
- Last Synced: 2025-01-30T05:26:57.176Z (5 months ago)
- Topics: anaconda, data-analysis, data-visualization, facets, jupyter-notebook, missing-data, open-source, python, skewness, unbalanced-data, visualisation, visualization
- Language: Jupyter Notebook
- Homepage:
- Size: 2.51 MB
- Stars: 0
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
### Visualizing Machine Learning Datasets using Anaconda & Facets
Facets allows for easy visualization. For using Facets, first clone the git repository:> git clone https://github.com/PAIR-code/facets.git
To use the visualization capabilities, you will have to add an nbextension. Therefore, find the path to the facets-dist directory in the cloned git repo and execute the following line of code:
> jupyter nbextension install facets-dist/ --user
In which case 'facets-dist' is the path to the respective folder.
If the above command still does not show the visualizations on the notebook, copy the file called facets-jupyter.html in 'facets/facets-dist' folder your local anaoconda file path _'[anaconda_path]/share/jupyter/nbextensions/'_. This is a known issue https://github.com/PAIR-code/facets/issues/41
You might need to restart jupyter after this and proceed with the vizualisation. For a more detailed installation guide and updates, have a look at:
> https://github.com/PAIR-code/facets
Do also install the protobuf package
> conda install protobuf
```python
# Add the facets overview python code to the python path
import sys
# FACETS_PATH is the full path to the python file in the clonde github repo of Facets.
# It should look similar to this: ".../facets/facets_overview/python"
# If you have cloned the facets repo to your current working directory, you can proceed.
# If you have chosen another location, just add it here.FACETS_PATH = 'facets-master/facets_overview/python'
sys.path.append(FACETS_PATH)
``````python
import pandas as pdtrain_data = pd.read_csv(
"train.csv",
#sep=r'\s*,\s*',
engine='python',
na_values="?")test_data = pd.read_csv(
"test.csv",
#sep=r'\s*,\s*',
engine='python',
na_values="?")test_salaries = pd.read_csv(
"test_salaries.csv",
#sep=r'\s*,\s*',
engine='python',
na_values="?")test_data = pd.concat([test_salaries, test_data], axis=1)
``````python
# Calculate the feature statistics proto from the datasets and stringify it for use in
# facets overview
from generic_feature_statistics_generator import GenericFeatureStatisticsGenerator
import base64gfsg = GenericFeatureStatisticsGenerator()
proto = gfsg.ProtoFromDataFrames([{'name': 'train', 'table': train_data},
{'name': 'test', 'table': test_data}])
protostr = base64.b64encode(proto.SerializeToString()).decode("utf-8")
``````python
# Display the facets overview visualization for this data
from IPython.core.display import display, HTMLHTML_TEMPLATE = """
document.querySelector("#elem").protoInput = "{protostr}";
"""html = HTML_TEMPLATE.format(protostr=protostr)
display(HTML(html))
```

Facets Overview provides a quick understanding of the distribution of values across the features of their datasets. Multiple datasets, such as a training set and a test set, can also be compared on the same visualization.
Common data issues that can hamper machine learning are pushed to the forefront, such as: unexpected feature values, features with high percentages of missing values, features with unbalanced distributions, and feature distribution skew between datasets.
### Known Issues ###
The Facets visualizations currently work only in Chrome browsers