{"id":13693198,"url":"https://github.com/HoloClean/holoclean","last_synced_at":"2025-05-02T21:31:43.828Z","repository":{"id":29750949,"uuid":"156911280","full_name":"HoloClean/holoclean","owner":"HoloClean","description":"A Machine Learning System for Data Enrichment.","archived":false,"fork":false,"pushed_at":"2023-07-20T01:00:14.000Z","size":8482,"stargazers_count":519,"open_issues_count":20,"forks_count":131,"subscribers_count":30,"default_branch":"master","last_synced_at":"2024-12-08T00:03:00.730Z","etag":null,"topics":["data-enrichment","data-science","inference-engine","machine-learning","pytorch"],"latest_commit_sha":null,"homepage":"http://www.holoclean.io","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/HoloClean.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2018-11-09T19:55:30.000Z","updated_at":"2024-11-30T16:28:51.000Z","dependencies_parsed_at":"2022-09-01T10:42:02.485Z","dependency_job_id":"797087ad-2e21-4e3d-acc4-d8cf84d596ee","html_url":"https://github.com/HoloClean/holoclean","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HoloClean%2Fholoclean","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HoloClean%2Fholoclean/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HoloClean%2Fholoclean/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HoloClean%2Fholoclean/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/HoloClean","download_url":"https://codeload.github.com/HoloClean/holoclean/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252108841,"owners_count":21696146,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-enrichment","data-science","inference-engine","machine-learning","pytorch"],"created_at":"2024-08-02T17:01:06.841Z","updated_at":"2025-05-02T21:31:43.053Z","avatar_url":"https://github.com/HoloClean.png","language":"Python","funding_links":[],"categories":["Data-driven methods","Python"],"sub_categories":["Tabular"],"readme":"Master:\n[![Build Status](https://travis-ci.org/HoloClean/holoclean.svg?branch=master)](https://travis-ci.org/HoloClean/holoclean)\nDev:\n[![Build Status](https://travis-ci.org/HoloClean/holoclean.svg?branch=dev)](https://travis-ci.org/HoloClean/holoclean)\n\n# HoloClean: A Machine Learning System for Data Enrichment\n\n[HoloClean](http://www.holoclean.io) is built on top of PyTorch and PostgreSQL.\n\nHoloClean is a statistical inference engine to impute, clean, and enrich data.\nAs a weakly supervised machine learning system, HoloClean leverages available\nquality rules, value correlations, reference data, and multiple other signals\nto build a probabilistic model that accurately captures the data generation\nprocess, and uses the model in a variety of data curation tasks. HoloClean\nallows data practitioners and scientists to save the enormous time they spend\nin building piecemeal cleaning solutions, and instead, effectively communicate\ntheir domain knowledge in a declarative way to enable accurate analytics,\npredictions, and insights form noisy, incomplete, and erroneous data.\n\n## Installation\n\nHoloClean was tested on Python versions 2.7, 3.6, and 3.7. \nIt requires PostgreSQL version 9.4 or higher.\n\n\n### 1. Install and configure PostgreSQL\n\nWe describe how to install PostgreSQL and configure it for HoloClean\n(creating a database, a user, and setting the required permissions).\n\n#### Option 1: Native installation of PostgreSQL\n\nA native installation of PostgreSQL runs faster than docker containers.\nWe explain how to install PostgreSQL then how to configure it for HoloClean use.\n\n##### a. Installing PostgreSQL\n\nOn Ubuntu, install PostgreSQL by running\n`\n$ apt-get install postgresql postgresql-contrib\n`\n\nFor macOS, you can find the installation instructions on\n[https://www.postgresql.org/download/macosx/](https://www.postgresql.org/download/macosx/)\n\n##### b. Setting up PostgreSQL for HoloClean\n\nBy default, HoloClean needs a database `holo` and a user `holocleanuser` with permissions on it.\n\n1. Start the PostgreSQL `psql` console from the terminal using \\\n`$ psql --user \u003cusername\u003e`. You can omit `--user \u003cusername\u003e` to use current user.\n\n2. Create a database `holo` and user `holocleanuser`\n```sql\nCREATE DATABASE holo;\nCREATE USER holocleanuser;\nALTER USER holocleanuser WITH PASSWORD 'abcd1234';\nGRANT ALL PRIVILEGES ON DATABASE holo TO holocleanuser;\n\\c holo\nALTER SCHEMA public OWNER TO holocleanuser;\n```\n\nYou can connect to the `holo` database from the PostgreSQL `psql` console by running\n`psql -U holocleanuser -W holo`.\n\nHoloClean currently populates the database `holo` with auxiliary and meta tables.\nTo clear the database simply connect as a `root` user or as `holocleanuser` and run\n```sql\nDROP DATABASE holo;\nCREATE DATABASE holo;\n```\n\n#### Option 2: Using Docker\nIf you are familiar with docker, an easy way to start using\nHoloClean is to start a PostgreSQL docker container.\n\nTo start a PostgreSQL docker container, run the following command:\n\n```bash\ndocker run --name pghc \\\n    -e POSTGRES_DB=holo -e POSTGRES_USER=holocleanuser -e POSTGRES_PASSWORD=abcd1234 \\\n    -p 5432:5432 \\\n    -d postgres:11\n```\n\nwhich starts a backend server and creates a database with the required permissions.\n\nYou can then use `docker start pghc` and `docker stop pghc` to start/stop the container.\n\n\nNote the port number which may conflict with existing PostgreSQL servers.\nRead more about this docker image [here](https://hub.docker.com/_/postgres/). \n\n### 2. Setting up HoloClean\nHoloClean runs on Python 2.7 or 3.6+. We recommend running it from within\na virtual environment.\n\n#### Creating a virtual environment for HoloClean\n##### Option 1: Conda Virtual Environment\n\nFirst, download Anaconda (not miniconda) from [this link](https://www.anaconda.com/download).\nFollow the steps for your OS and framework. \n\nSecond, create a conda environment (python 2.7 or 3.6+).\nFor example, to create a *Python 3.6* conda environment, run:\n\n```bash\n$ conda create -n hc36 python=3.6\n```\n\nUpon starting/restarting your terminal session, you will need to activate your\nconda environment by running\n```bash\n$ conda activate hc36\n```\n\n##### Option 2: Set up a virtual environment using pip and Virtualenv\n\nIf you are familiar with `virtualenv`, you can use it to create \na virtual environment.\n\nFor Python 3.6, create a new environment\nwith your preferred virtualenv wrapper, for example:\n\n* [virtualenvwrapper](https://virtualenvwrapper.readthedocs.io/en/latest/) (Bourne-shells)\n* [virtualfish](https://virtualfish.readthedocs.io/en/latest/) (fish-shell)\n\n\nEither follow instructions [here](https://virtualenv.pypa.io/en/stable/installation/) or install via\n`pip`.\n```bash\n$ pip install virtualenv\n```\n\nThen, create a `virtualenv` environment by creating a new directory for a Python 3.6 virtualenv environment\n```bash\n$ mkdir -p hc36\n$ virtualenv --python=python3.6 hc36\n```\nwhere `python3.6` is a valid reference to a Python 3.6 executable.\n\nActivate the environment\n```bash\n$ source hc36/bin/activate\n```\n\n#### Install the required python packages\n\n*Note: make sure that the environment is activated throughout the installation process.\nWhen you are done, deactivate it using* \n`conda deactivate`, `source deactivate`, *or* `deactivate` \n*depending on your version*.\n\nIn the project root directory, run the following to install the required packages.\nNote that this commands installs the packages within the activated virtual environment.\n\n```bash\n$ pip install -r requirements.txt\n```\n\n\n*Note for macOS Users:*\nyou may need to install XCode developer tools using `xcode-select --install`.\n\n\n## Running HoloClean\n\nSee the code in `examples/holoclean_repair_example.py` for a documented usage of HoloClean.\n\nIn order to run the example script, run the following:\n```bash\n$ cd examples\n$ ./start_example.sh\n```\n\nNotice that the script sets up the Python path environment to run HoloClean.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FHoloClean%2Fholoclean","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FHoloClean%2Fholoclean","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FHoloClean%2Fholoclean/lists"}