https://github.com/dariodip/rfd-discovery
This project, written in Python and Cython, deals with Discovery of Relaxed Functional Dependencies(RFDs) using a bottom-up approach.
https://github.com/dariodip/rfd-discovery
artificial-intelligence cython data-science python python-3 university-project
Last synced: 9 months ago
JSON representation
This project, written in Python and Cython, deals with Discovery of Relaxed Functional Dependencies(RFDs) using a bottom-up approach.
- Host: GitHub
- URL: https://github.com/dariodip/rfd-discovery
- Owner: dariodip
- Created: 2016-12-14T10:02:29.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2021-03-17T11:55:09.000Z (about 5 years ago)
- Last Synced: 2025-04-13T18:09:09.857Z (about 1 year ago)
- Topics: artificial-intelligence, cython, data-science, python, python-3, university-project
- Language: Python
- Homepage:
- Size: 3.42 MB
- Stars: 8
- Watchers: 1
- Forks: 5
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# **rfd-discovery**
[](https://travis-ci.org/dariodip/rfd-discovery)
###### By
- [Altamura Antonio](https://www.linkedin.com/in/antonio-altamura-26ab85136/en)
- [Tomeo Mattia](https://www.linkedin.com/in/mattia-tomeo-b71aa6130/en)
- [Di Pasquale Dario](https://it.linkedin.com/in/dario-di-pasquale)
## Description
This project, written in Python and Cython, deals with Discovery of Relaxed Functional Dependencies(RFDs)
[[1](http://hdl.handle.net/11386/4658456)] using a bottom-up approach:
instead of giving a fixed threshold on input and then finding all the RDFs, this method infers distances from different RHS
attributes by itself and then discovers the RFDs for these ones.
rfd-discovery takes a dataset, representing a table of a relational database, in CSV format as input and prints the set
of the discovered RFDs.
CSV file can contain the following formats:
- int;
- int32;
- int64;
- float;
- float64;
- string;
- datetime64*.
*for date format you can use one of the formats known by [pandas](http://pandas.pydata.org/pandas-docs/stable/timeseries.html)
***Index:***
- [Requirements](#requirements)
- [Setup rfd-discovery](#setup)
- [Build](#build)
- [Usage](#usage)
## Requirements
rfd-discovery is developed using **[Python 3.5](http://www.python.it/)**, a C compiler ([gcc](https://gcc.gnu.org/) or [Visual Studio C++](https://www.visualstudio.com/vs/cplusplus/)) and [Cython 0.25.2](http://cython.org/),
the latter is used to improve time and memory consuming in CPU-bound operations.
For running rdf-discovery correctly, you have to install **Python 3.5** and **Cython 0.25**.
For installing correctly all the requirements you have to install **pip 9.0** (or high).
rdf-discovery use the following Python's libraries:
*[matplotlib✛](http://matplotlib.org/)*
*[numpy✛](http://www.numpy.org/)*
*[pandas✛](http://pandas.pydata.org/)*
*[tornado](http://www.tornadoweb.org/en/stable/)*
*[Cython](http://cython.org/)*
*[nltk](http://www.nltk.org/)*
*[flask](http://flask.pocoo.org/)*
You can install these by following the [Setup Section](#setup).
✛these libraries are part of [SciPy stack](https://www.scipy.org/index.html)
## Setup
In order to install rfd-discovery and all his requirements, you have to create a virtual environment using [venv](https://virtualenv.pypa.io/en/stable/) on Python 3.5.
To install *venv*, run the following:
`[sudo] pip3 install virtualenv` on Linux/macOS
or
`pip install virtualenv` using the prompt as the administrator on Windows.
To create a virtual environment, in the main directory of the project run:
`virtualenv venv`.
To activate the virtual environment, in the main directory on the project run:
`source venv/bin/activate` on Linux/MacOS
or
`venv\Scripts\activate` on Windows.
You can check if the virtual environment is activated, checking if the command prompt has the prefix `(venv)`.
To install all the requirements, run the following:
`pip install -r requirements.txt`
This should install, using [pip](https://pypi.python.org/pypi/pip), all the [requirements](#requirements).
To install WordNet, run:
`python setup.py install`.
## Build
Part of rfd-discovery is written using *Cython*, a superset of the Python programming language, designed to give C-like
performance with code which is mostly written in Python. This because operations that take place in the code are mostly
CPU bound, wasting computation and memory resources.
You can compile Cython code running the following:
`python build.py build_ext --inplace`
this will generate C code from Cython code and will try to compile it.
** Note that you'll need gcc or other C compiler **
If building phase ends without errors, you should have some *.c* and *.pyd* (or *.so*, depending by your OS) files. Don't
worry about dealing with these, Python does it automatically **:)**.
## Usage
Using rdf-discovery is easy enough. Just run the following command:
`python3 main.py -c [options]`
- *`-c `*: is the path of the dataset on which you want to discover RFDs;
Options:
- *`-v`* : display the version number;
- *`-s `*: the separation char used in your CSV file. If you don't provide this, rfd-discovery tries to infer
it for you;
- *`-h`*: Indicates that the CSV file has the header row. If you don't provide this, rdf-discovery tries to infer it for you.
- *`-r `*: is the column number of the RHS attribute. It must be a valid integer. You can avoid specifying it only if you don't specify LHS attributes (it will find RFDs using each attribute as RHS and the remaining as LHS);
- *`-l `*: column indexes of LHS attributes separated by commas
(e.g. *1,2,3*). You can avoid specifying them:
if you don't specify the index for RHS attribute it will find RFDs using each attribute as RHS and the remaining as LHS;
if you specify a valid RHS index it will assume your LHS as the remaining attributes;
- *`-i `*: the column which contains the primary key of the dataset. Specifying it, the program will not
calculate distance on it. **NOTE: index column should contain unique values**;
- *`-d `*: a list of columns, separated by commas, which values are in datetime format;
Specifying this, rfd-discovery can depict distance between two date in time format (e.g. ms, sec, min);
- *`--semantic`*: use semantic distance on Wordnet for string;
For more info [here.](http://www.cs.toronto.edu/pub/gh/Budanitsky+Hirst-2001.pdf)
- *`--human`*: print the RFDs to the standard output in a human-readable form;
- *`--help`*: show help.
##### Valid Examples:
###### Check on each combination of attributes:
`python main.py -c resources/dataset.csv`
###### Infer LHS attributes given a fixed RHS' attribute index:
`python main.py -c resources/dataset.csv -r 0`
###### RHS and LHS fixed, separator and header line specified:
`python main.py -c resources/dataset.csv -r 0 -l 1,2,3 -s , -h 0`