https://github.com/dylan-profiler/visions

Type System for Data Analysis in Python
https://github.com/dylan-profiler/visions

data-analysis data-science hacktoberfest numpy pandas python spark type-inference type-system

Last synced: about 1 year ago
JSON representation

Type System for Data Analysis in Python

Host: GitHub
URL: https://github.com/dylan-profiler/visions
Owner: dylan-profiler
License: other
Created: 2019-12-12T15:09:01.000Z (over 6 years ago)
Default Branch: develop
Last Pushed: 2025-02-01T23:40:28.000Z (over 1 year ago)
Last Synced: 2025-05-03T05:02:27.074Z (about 1 year ago)
Topics: data-analysis, data-science, hacktoberfest, numpy, pandas, python, spark, type-inference, type-system
Language: Python
Homepage: https://dylan-profiler.github.io/visions/visions/getting_started/usage/types.html
Size: 37.9 MB
Stars: 212
Watchers: 6
Forks: 19
Open Issues: 18
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          


  


  And these visions of data types, they kept us up past the dawn. 





  

    

  

  

    

  

  

    

  

  

    

  

  

    

  

  

    

  



# The Semantic Data Library

``Visions`` provides a set of tools for defining and using *semantic* data types.

- [x] [Semantic type](https://dylan-profiler.github.io/visions/visions/getting_started/concepts.html#types) detection &

  inference on sequence data.

- [x] Automated data processing

- [x] Completely customizable. `Visions` makes it easy to build and modify semantic data types for domain specific

  purposes

- [x] Out of the box support for

  multiple [backend implementations](https://github.com/dylan-profiler/visions#supported-frameworks) including pandas,

  spark, numpy, and python

- [x] A robust set

  of [default types and typesets](https://dylan-profiler.github.io/visions/visions/getting_started/usage/defaults.html)

  covering the most common use cases.

Check out the complete

documentation [here](https://dylan-profiler.github.io/visions/visions/getting_started/introduction.html).

## Installation

Source code is available on [github](https://github.com/dylan-profiler/visions) and binary installers via pip.

```

# Pip

pip install visions

```

Complete installation instructions (including extras) are available in

the [docs](https://dylan-profiler.github.io/visions/visions/getting_started/installation.html).

## Quick Start Guide

If you want to play immediately check out the examples folder

on [![](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/dylan-profiler/visions/master). Otherwise,

let's get some data

```python

import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")

df.head(2)

```

  

    

      PassengerId

      Survived

      Pclass

      Name

      Sex

      Age

      SibSp

      Parch

      Ticket

      Fare

      Cabin

      Embarked

    

  

  

    

      1

      0

      3

      Braund, Mr. Owen Harris

      male

      22.0

      1

      0

      A/5 21171

      7.2500

      NaN

      S

    

    

      2

      1

      1

      Cumings, Mrs. John Bradley (Florence Briggs Thayer)

      female

      38.0

      1

      0

      PC 17599

      71.2833

      C85

      C

    

  

The most important abstraction in `visions` are Types - these represent semantic notions about data. You have access to

a range of well tested types like `Integer`, `Float`, and `Files` covering the most common software development use

cases.

Types can be bundled together into typesets. Behind the scenes, `visions` builds a traversable graph for any collection

of types.

```python

from visions import types, typesets

# StandardSet is the basic builtin typeset

typeset = typesets.CompleteSet()

typeset.plot_graph()

```

![](https://dylan-profiler.github.io/visions/_images/typeset_complete_base.svg)

Note: Plots require pygraphviz to be [installed](https://pygraphviz.github.io/documentation/stable/install.html).

Because of the special relationship between types these graphs can be used to detect the type of your data or _infer_ a

more appropriate one.

```python

# Detection looks like this

typeset.detect_type(df)

# While inference looks like this

typeset.infer_type(df)

# Inference works well even if we monkey with the data, say by converting everything to strings

typeset.infer_type(df.astype(str))

>> {

    'PassengerId': Integer,

    'Survived': Integer,

    'Pclass': Integer,

    'Name': String,

    'Sex': String,

    'Age': Float,

    'SibSp': Integer,

    'Parch': Integer,

    'Ticket': String,

    'Fare': Float,

    'Cabin': String,

    'Embarked': String

}

```

`Visions` solves many of the most common problems working with tabular data for example, sequences of Integers are still

recognized as integers whether they have trailing decimal 0's from being cast to float, missing values, or something

else altogether. Much of this cleaning is performed automatically providing nicely cleaned and processed data as well.

```python

cleaned_df = typeset.cast_to_inferred(df)

```

This is only a small taste of everything visions can do

including [building your own](https://dylan-profiler.github.io/visions/visions/getting_started/extending.html) domain

specific types and typesets so please check out the [API](https://dylan-profiler.github.io/visions/visions/api.html)

documentation or the [examples/](https://github.com/dylan-profiler/visions/tree/develop/examples) directory for more

info!

## Supported frameworks

Thanks to its dispatch based implementation `Visions` is able to exploit framework specific capabilities offered by

libraries like pandas and spark. Currently it works with the following backends by default.

- [Pandas](https://github.com/pandas-dev/pandas) (feature complete)

- [Numpy](https://github.com/numpy/numpy) (boolean, complex, date time, float, integer, string, time deltas, string,

  objects)

- [Spark](https://github.com/apache/spark) (boolean, categorical, date, date time, float, integer, numeric, object,

  string)

- [Python](https://docs.python.org/3/library/stdtypes.html#sequence-types-list-tuple-range) (string, float, integer,

  date time, time delta, boolean, categorical, object, complex - other datatypes are untested)

If you're using pandas it will also take advantage of parallelization tools like

[swifter](https://github.com/jmcarpenter2/swifter) if available.

It also offers a simple annotation based API for registering new implementations as needed. For example, if you wished

to extend the categorical data type to include a Dask specific implementation you might do something like

```python

from visions.types.categorical import Categorical

from pandas.api import types as pdt

import dask

@Categorical.contains_op.register

def categorical_contains(series: dask.dataframe.Series, state: dict) -> bool:

    return pdt.is_categorical_dtype(series.dtype)

```

## Contributing and support

Contributions to `visions` are welcome. For more information, please visit the community

contributions [page](https://dylan-profiler.github.io/visions/visions/contributing/contributing.html) and join on us

on [slack](https://join.slack.com/t/dylan-profiling/shared_invite/zt-11c9blvpt-AqxXD5AMS9Q6CO7UUm~cRw). The

github [issues tracker](https://github.com/dylan-profiler/visions/issues/new/choose) is used for reporting bugs, feature

requests and support questions.

Also, please check out some of the other companies and packages using `visions` including:

* [pandas profiling](https://github.com/pandas-profiling/pandas-profiling)

* [Compress*io*](https://github.com/dylan-profiler/compressio)

* [Bitrook](https://www.bitrook.com/)

If you're currently using `visions` or would like to be featured here please let us know.

## Acknowledgements

This package is part of the [dylan-profiler](https://github.com/dylan-profiler)  project. The package is core component

of [pandas-profiling](https://github.com/pandas-profiling/pandas-profiling). More information can be

found [here](https://dylan-profiler.github.io/visions/visions/background/about.html>). This work was partially supported

by [SIDN Fonds](https://www.sidnfonds.nl/projecten/dylan-data-analysis-leveraging-automatisation).

![](https://github.com/dylan-profiler/visions/raw/master/images/SIDNfonds.png)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dylan-profiler/visions

Awesome Lists containing this project

README