{"id":13701632,"url":"https://github.com/probcomp/PClean","last_synced_at":"2025-05-04T21:31:12.498Z","repository":{"id":45757395,"uuid":"211813088","full_name":"probcomp/PClean","owner":"probcomp","description":"A domain-specific probabilistic programming language for scalable Bayesian data cleaning","archived":false,"fork":false,"pushed_at":"2024-07-31T03:47:53.000Z","size":1430,"stargazers_count":216,"open_issues_count":21,"forks_count":32,"subscribers_count":22,"default_branch":"master","last_synced_at":"2024-11-13T07:36:19.574Z","etag":null,"topics":["bayesian-inference","data-cleaning","data-cleansing","probabilistic-graphical-models","probabilistic-programming"],"latest_commit_sha":null,"homepage":"","language":"Julia","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/probcomp.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-09-30T08:31:11.000Z","updated_at":"2024-10-29T07:24:48.000Z","dependencies_parsed_at":"2024-07-31T05:45:12.441Z","dependency_job_id":"9873b1fa-a76d-4c25-a2b8-79f8bea86340","html_url":"https://github.com/probcomp/PClean","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/probcomp%2FPClean","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/probcomp%2FPClean/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/probcomp%2FPClean/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/probcomp%2FPClean/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/probcomp","download_url":"https://codeload.github.com/probcomp/PClean/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252403952,"owners_count":21742469,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bayesian-inference","data-cleaning","data-cleansing","probabilistic-graphical-models","probabilistic-programming"],"created_at":"2024-08-02T20:01:52.815Z","updated_at":"2025-05-04T21:31:10.768Z","avatar_url":"https://github.com/probcomp.png","language":"Julia","funding_links":[],"categories":["Julia"],"sub_categories":[],"readme":"# PClean\n\n[![Build Status](https://travis-ci.com/probcomp/PClean.svg?branch=master)](https://travis-ci.com/probcomp/PClean)\n\nPClean: A Domain-Specific Probabilistic Programming Language for Bayesian Data Cleaning\n\n*Warning: This is a rapidly evolving research prototype.*\n\nPClean was created at the [MIT Probabilistic Computing Project](http://probcomp.csail.mit.edu/).\n\nIf you use PClean in your research, please cite the our 2021 AISTATS paper:\n\nPClean: Bayesian Data Cleaning at Scale with Domain-Specific Probabilistic Programming. Lew, A. K.; Agrawal, M.; Sontag, D.; and Mansinghka, V. K. (2021, March).\nIn International Conference on Artificial Intelligence and Statistics (pp. 1927-1935). PMLR. ([pdf](http://proceedings.mlr.press/v130/lew21a/lew21a.pdf))\n\n## Using PClean\n\n\nTo use PClean, create a Julia file with the following structure:\n\n```julia\nusing PClean\nusing DataFrames: DataFrame\nimport CSV\n\n# Load data\ndata = CSV.File(filepath) |\u003e DataFrame\n\n# Define PClean model\nPClean.@model MyModel begin\n    @class ClassName1 begin\n        ...\n    end\n\n    ...\n    \n    @class ClassNameN begin\n        ...\n    end\nend\n\n# Align column names of CSV with variables in the model.\n# Format is ColumnName CleanVariable DirtyVariable, or, if\n# there is no corruption for a certain variable, one can omit\n# the DirtyVariable.\nquery = @query MyModel.ClassNameN [\n  HospitalName hosp.name             observed_hosp_name\n  Condition    metric.condition.desc observed_condition\n  ...\n]\n\n# Configure observed dataset\nobservations = [ObservedDataset(query, data)]\n\n# Configuration\nconfig = PClean.InferenceConfig(1, 2; use_mh_instead_of_pg=true)\n\n# SMC initialization\nstate = initialize_trace(observations, config)\n\n# Rejuvenation sweeps\nrun_inference!(state, config)\n\n# Evaluate accuracy, if ground truth is available\nground_truth = CSV.File(filepath) |\u003e CSV.DataFrame\nresults = evaluate_accuracy(data, ground_truth, state, query)\n\n# Can print results.f1, results.precision, results.accuracy, etc.\nprintln(results)\n\n# Even without ground truth, can save the entire latent database to CSV files:\nPClean.save_results(dir, dataset_name, state, observations)\n```\n\nThen, from this directory, run the Julia file.\n\n```\nJULIA_PROJECT=. julia my_file.jl\n```\n\nTo learn to write a PClean model, see [our paper](http://proceedings.mlr.press/v130/lew21a/lew21a.pdf), but note\nthe surface syntax changes described below.\n\n## Differences from the paper\n\nAs a DSL embedded into Julia, our implementation of the PClean language has some differences, in terms of surface syntax,\nfrom the stand-alone syntax presented in our paper:\n\n(1) Instead of `latent class C ... end`, we write `@class C begin ... end`.\n\n(2) Instead of `subproblem begin ... end`, inference hints are given using ordinary\n    Julia `begin ... end` blocks.\n\n(3) Instead of `parameter x ~ d(...)`, we use `@learned x :: D{...}`. The set of\n    distributions D for parameters is somewhat restricted.\n\n(4) Instead of `x ~ d(...) preferring E`, we write `x ~ d(..., E)`.\n\n(5) Instead of `observe x as y, ... from C`, write `@query ModelName.C [x y; ...]`.\n    Clauses of the form `x z y` are also allowed, and tell PClean that the model variable\n    `C.z` represents a clean version of `x`, whose observed (dirty) version is modeled\n    as `C.y`. This is used when automatically reconstructing a clean, flat dataset.\n\nThe names of built-in distributions may also be different, e.g. `AddTypos` instead of `typos`,\nand `ProportionsParameter` instead of `dirichlet`.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprobcomp%2FPClean","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fprobcomp%2FPClean","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprobcomp%2FPClean/lists"}