{"id":15350594,"url":"https://github.com/maropu/spark-data-repair-plugin","last_synced_at":"2025-04-15T03:32:36.092Z","repository":{"id":37682043,"uuid":"225989054","full_name":"maropu/spark-data-repair-plugin","owner":"maropu","description":"Provide functionality to build statistical models to repair dirty tabular data in Spark","archived":false,"fork":false,"pushed_at":"2023-04-21T20:49:04.000Z","size":37182,"stargazers_count":12,"open_issues_count":3,"forks_count":4,"subscribers_count":4,"default_branch":"master","last_synced_at":"2024-11-01T08:50:42.814Z","etag":null,"topics":["data-repairing","distributed-computing","error-detection","parallel-computing","spark"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/maropu.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-12-05T01:21:07.000Z","updated_at":"2024-05-23T17:16:36.000Z","dependencies_parsed_at":"2024-10-16T01:42:07.665Z","dependency_job_id":"7f741645-463b-41a8-a9f7-30ef2cc30174","html_url":"https://github.com/maropu/spark-data-repair-plugin","commit_stats":{"total_commits":592,"total_committers":2,"mean_commits":296.0,"dds":"0.0016891891891891442","last_synced_commit":"7701550d50d079f0471d8ff70f23f8931e199b6b"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maropu%2Fspark-data-repair-plugin","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maropu%2Fspark-data-repair-plugin/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maropu%2Fspark-data-repair-plugin/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maropu%2Fspark-data-repair-plugin/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/maropu","download_url":"https://codeload.github.com/maropu/spark-data-repair-plugin/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223657852,"owners_count":17181024,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-repairing","distributed-computing","error-detection","parallel-computing","spark"],"created_at":"2024-10-01T11:58:46.960Z","updated_at":"2024-11-08T09:04:29.355Z","avatar_url":"https://github.com/maropu.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![License](http://img.shields.io/:license-Apache_v2-blue.svg)](https://github.com/maropu/spark-data-repair-plugin/blob/master/LICENSE)\n[![Build and test](https://github.com/maropu/spark-data-repair-plugin/workflows/Build%20and%20tests/badge.svg)](https://github.com/maropu/spark-data-repair-plugin/actions?query=workflow%3A%22Build+and+tests%22)\n\u003c!---\n[![Coverage Status](https://coveralls.io/repos/github/maropu/spark-data-repair-plugin/badge.svg?branch=master)](https://coveralls.io/github/maropu/spark-data-repair-plugin?branch=master)\n--\u003e\n\nThis is an experimental prototype for building a statistical model to repair tabular data errors on [Apache Spark](https://spark.apache.org/)\nwhich is a parallel and distributed framework for large-scale data processing.\nClean and consistent data is one of major interests for downstream analytics;\nclean data makes machine learning and BI reporting more accurate and\nconsistent data with constraints (e.g., functional dependences) is important for efficient query plans.\nTherefore, data repairing is a first step for a reliable analytics pipeline.\n\n## How to Repair Error Cells\n\n```\n$ git clone https://github.com/maropu/spark-data-repair-plugin.git\n$ cd spark-data-repair-plugin\n\n# This repository includes a simple wrapper script `bin/python` to create\n# a conda virtual environment to resolve the required dependencies\n# (e.g., Python 3.7 and PySpark 3.2), and then\n# launch a Python VM with our plugin.\n$ ./bin/python\n\nWelcome to\n      ____              __\n     / __/__  ___ _____/ /__\n    _\\ \\/ _ \\/ _ `/ __/  '_/\n   /__ / .__/\\_,_/_/ /_/\\_\\   version 3.2.0\n      /_/\n\nUsing Python version 3.7.11 (default, Jul 27 2021 07:03:16)\nSparkSession available as 'spark'.\nDelphi APIs (version 0.1.0-spark3.2-EXPERIMENTAL) available as 'delphi'.\n\n# Loads CSV data having seven NULL cells\n\u003e\u003e\u003e spark.read.option(\"header\", True).csv(\"./testdata/adult.csv\").createOrReplaceTempView(\"adult\")\n\u003e\u003e\u003e spark.table(\"adult\").show()\n+---+-----+------------+-----------------+-------------+------+-------------+-----------+\n|tid|  Age|   Education|       Occupation| Relationship|   Sex|      Country|     Income|\n+---+-----+------------+-----------------+-------------+------+-------------+-----------+\n|  0|31-50|Some-college|     Craft-repair|      Husband|  Male|United-States|LessThan50K|\n|  1|  \u003e50|Some-college|  Exec-managerial|    Own-child|Female|United-States|LessThan50K|\n|  2|31-50|   Bachelors|            Sales|      Husband|  Male|United-States|LessThan50K|\n|  3|22-30|     HS-grad|     Craft-repair|    Own-child|  null|United-States|LessThan50K|\n|  4|22-30|     HS-grad|  Farming-fishing|      Husband|Female|United-States|LessThan50K|\n|  5| null|Some-college|     Craft-repair|      Husband|  Male|United-States|       null|\n|  6|31-50|     HS-grad|   Prof-specialty|Not-in-family|Female|United-States|LessThan50K|\n|  7|31-50| Prof-school|   Prof-specialty|      Husband|  null|        India|MoreThan50K|\n|  8|18-21|Some-college|     Adm-clerical|    Own-child|Female|United-States|LessThan50K|\n|  9|  \u003e50|     HS-grad|  Farming-fishing|      Husband|  Male|United-States|LessThan50K|\n| 10|  \u003e50|   Assoc-voc|   Prof-specialty|      Husband|  Male|United-States|LessThan50K|\n| 11|  \u003e50|     HS-grad|            Sales|      Husband|Female|United-States|MoreThan50K|\n| 12| null|   Bachelors|  Exec-managerial|      Husband|  null|United-States|MoreThan50K|\n| 13|22-30|     HS-grad|     Craft-repair|Not-in-family|  Male|United-States|LessThan50K|\n| 14|31-50|  Assoc-acdm|  Exec-managerial|    Unmarried|  Male|United-States|LessThan50K|\n| 15|22-30|Some-college|            Sales|    Own-child|  Male|United-States|LessThan50K|\n| 16|  \u003e50|Some-college|  Exec-managerial|    Unmarried|Female|United-States|       null|\n| 17|31-50|     HS-grad|     Adm-clerical|Not-in-family|Female|United-States|LessThan50K|\n| 18|31-50|        10th|Handlers-cleaners|      Husband|  Male|United-States|LessThan50K|\n| 19|31-50|     HS-grad|            Sales|      Husband|  Male|         Iran|MoreThan50K|\n+---+-----+------------+-----------------+-------------+------+-------------+-----------+\n\n# Runs a job to compute repair updates for the seven NULL cells above in `dirty_df`\n# A `repaired` column represents proposed updates to repiar them\n\u003e\u003e\u003e from repair.errors import NullErrorDetector\n\u003e\u003e\u003e repair_updates_df = delphi.repair \\\n...   .setInput(\"adult\") \\\n...   .setRowId(\"tid\") \\\n...   .setErrorDetectors([NullErrorDetector()]) \\\n...   .run()\n\n\u003e\u003e\u003e repair_updates_df.show()\n+---+---------+-------------+-----------+\n|tid|attribute|current_value|   repaired|\n+---+---------+-------------+-----------+\n|  7|      Sex|         null|     Female|\n| 12|      Age|         null|      18-21|\n| 12|      Sex|         null|     Female|\n|  3|      Sex|         null|     Female|\n|  5|      Age|         null|      18-21|\n|  5|   Income|         null|MoreThan50K|\n| 16|   Income|         null|MoreThan50K|\n+---+---------+-------------+-----------+\n\n# You need to set `True` to `repair_data` for getting repaired data directly\n\u003e\u003e\u003e clean_df = delphi.repair \\\n...   .setInput(\"adult\") \\\n...   .setRowId(\"tid\") \\\n...   .setErrorDetectors([NullErrorDetector()]) \\\n...   .run(repair_data=True)\n\n\u003e\u003e\u003e clean_df.show()\n+---+-----+------------+-----------------+-------------+------+-------------+-----------+\n|tid|  Age|   Education|       Occupation| Relationship|   Sex|      Country|     Income|\n+---+-----+------------+-----------------+-------------+------+-------------+-----------+\n|  0|31-50|Some-college|     Craft-repair|      Husband|  Male|United-States|LessThan50K|\n|  1|  \u003e50|Some-college|  Exec-managerial|    Own-child|Female|United-States|LessThan50K|\n|  2|31-50|   Bachelors|            Sales|      Husband|  Male|United-States|LessThan50K|\n|  3|22-30|     HS-grad|     Craft-repair|    Own-child|  Male|United-States|LessThan50K|\n|  4|22-30|     HS-grad|  Farming-fishing|      Husband|Female|United-States|LessThan50K|\n|  5|31-50|Some-college|     Craft-repair|      Husband|  Male|United-States|LessThan50K|\n|  6|31-50|     HS-grad|   Prof-specialty|Not-in-family|Female|United-States|LessThan50K|\n|  7|31-50| Prof-school|   Prof-specialty|      Husband|  Male|        India|MoreThan50K|\n|  8|18-21|Some-college|     Adm-clerical|    Own-child|Female|United-States|LessThan50K|\n|  9|  \u003e50|     HS-grad|  Farming-fishing|      Husband|  Male|United-States|LessThan50K|\n| 10|  \u003e50|   Assoc-voc|   Prof-specialty|      Husband|  Male|United-States|LessThan50K|\n| 11|  \u003e50|     HS-grad|            Sales|      Husband|Female|United-States|MoreThan50K|\n| 12|31-50|   Bachelors|  Exec-managerial|      Husband|  Male|United-States|MoreThan50K|\n| 13|22-30|     HS-grad|     Craft-repair|Not-in-family|  Male|United-States|LessThan50K|\n| 14|31-50|  Assoc-acdm|  Exec-managerial|    Unmarried|  Male|United-States|LessThan50K|\n| 15|22-30|Some-college|            Sales|    Own-child|  Male|United-States|LessThan50K|\n| 16|  \u003e50|Some-college|  Exec-managerial|    Unmarried|Female|United-States|LessThan50K|\n| 17|31-50|     HS-grad|     Adm-clerical|Not-in-family|Female|United-States|LessThan50K|\n| 18|31-50|        10th|Handlers-cleaners|      Husband|  Male|United-States|LessThan50K|\n| 19|31-50|     HS-grad|            Sales|      Husband|  Male|         Iran|MoreThan50K|\n+---+-----+------------+-----------------+-------------+------+-------------+-----------+\n\n# Or, you can merge the computed repair updates with the input table as follows\n\u003e\u003e\u003e repair_updates_df.createOrReplaceTempView(\"predicted\")\n\u003e\u003e\u003e clean_df = delphi.misc.options({\"repair_updates\": \"predicted\", \"table_name\": \"adult\", \"row_id\": \"tid\"}).repair()\n\u003e\u003e\u003e clean_df.show()\n\u003cthe same output above\u003e\n```\n\nFor more running examples, please check Python scripts in the [resources/examples](./resources/examples) folder.\n\nNOTE: There are many types of errors on dirty data [9], but our purpose is to repair the data\nwhose attribute already has correct values against their errors.\nFor instance, in the `Sex` column in the `adult` table above, our plugin can repair the three NULL cells\nbecause it already has correct values, `Female` or `Male`, against the NULL cells.\nTo repair them, our plugin captures and exploits data dependencies between the `Sex` column and the other ones.\nFor repairing the other types of data errors, existing data cleaning tools might be suitable;\na programming-by-examples technique is a good fit to fix format errors like `2021.8.23` -\u003e `2021/8/23` and\n[Trifacta](https://www.trifacta.com/) has a functionality,\nnamed [Transformation by Example](https://docs.trifacta.com/display/SS/Transformation+by+Example+Page),\nto implement it. Few existing tools can handle the error cases in the `adult` example above and,\ntherefore, our plugin is complementary to those other tools.\n\n## Error Detection\n\nTo detect error cells, you can use some of bult-in error detectors below:\n\n - NullErrorDetector\n - DomainValues\n - RegExErrorDetector\n - ConstraintErrorDetector\n - GaussianOutlierErrorDetector\n - LOFOutlierErrorDetector\n\nPlease check [the example code](./resources/examples/error-detectors.py) for how to use these error detectors.\nIf you specify no error detector, `DomainValues`s for each attribute and `NullErrorDetector` are used by default.\n\n```\n# Setting `True` to `detect_errors_only` lets you get detected error cells only\n\u003e\u003e\u003e error_cells_df = delphi.repair \\\n...   .setInput(\"adult\") \\\n...   .setRowId(\"tid\") \\\n...   .setErrorDetectors([NullErrorDetector()]) \\\n...   .run(detect_errors_only=True)\n\n\u003e\u003e\u003e error_cells_df.show()\n+---+---------+-------------+\n|tid|attribute|current_value|\n+---+---------+-------------+\n| 12|      Age|         null|\n|  5|      Age|         null|\n| 12|      Sex|         null|\n|  7|      Sex|         null|\n|  3|      Sex|         null|\n| 16|   Income|         null|\n|  5|   Income|         null|\n+---+---------+-------------+\n\n# `DomainValue`s and `NullErrorDetector` are used by default\n\u003e\u003e\u003e error_cells_df = delphi.repair \\\n...   .setInput(\"adult\") \\\n...   .setRowId(\"tid\") \\\n...   .run(detect_errors_only=True)\n\n\u003e\u003e\u003e error_cells_df.show()\n+---+----------+--------------+\n|tid| attribute| current_value|\n+---+----------+--------------+\n| 12|       Age|          null|\n|  5|       Age|          null|\n|  7|       Sex|          null|\n| 12|       Sex|          null|\n|  3|       Sex|          null|\n|  5|    Income|          null|\n| 16|    Income|          null|\n|  4|       Age|         22-30|\n|  8|       Age|         18-21|\n|  3|       Age|         22-30|\n| 13|       Age|         22-30|\n| 15|       Age|         22-30|\n| 10| Education|     Assoc-voc|\n|  7| Education|   Prof-school|\n| 14| Education|    Assoc-acdm|\n| 12| Education|     Bachelors|\n|  2| Education|     Bachelors|\n| 18| Education|          10th|\n|  0|Occupation|  Craft-repair|\n|  6|Occupation|Prof-specialty|\n+---+----------+--------------+\nonly showing top 20 rows\n```\n\nNote that `ConstraintErrorDetector` is the most powerful choice; it uses [denial constraints](https://www.sciencedirect.com/science/article/pii/S0890540105000179) [5]\nthat an input tabular data should follow. The constraints consist of the predicates that cannot hold true simultaneously.\n\n```\n# Constraints below mean that `Sex=\"Female\"` and `Relationship=\"Husband\"`\n# (`Sex=\"Male\"` and `Relationship=\"Wife\"`) does not hold true simultaneously.\n# Note that the syntax for denial constraints follows the HoloClean [7] one and\n# it is a research-backed statistical inference engine to clean data.\n$ cat ./testdata/adult_constraints.txt\nt1\u0026EQ(t1.Sex,\"Female\")\u0026EQ(t1.Relationship,\"Husband\")\nt1\u0026EQ(t1.Sex,\"Male\")\u0026EQ(t1.Relationship,\"Wife\")\n\n# Use the constraints to detect errors and then repair them\n\u003e\u003e\u003e repair_updates_df = delphi.repair \\\n...   .setInput(\"adult\") \\\n...   .setRowId(\"tid\") \\\n...   .setErrorDetectors([NullErrorDetector(), ConstraintErrorDetector(constraint_path=\"./testdata/adult_constraints.txt\")]) \\\n...   .run()\n\n# Changes values from `Female` to `Male` in the `Sex` cells\n# of the 4th and 11th rows.\n\u003e\u003e\u003e repair_updates_df.show()\n+---+------------+-------------+-----------+\n|tid|   attribute|current_value|   repaired|\n+---+------------+-------------+-----------+\n|  3|         Sex|         null|       Male|\n|  4|Relationship|      Husband|    Husband|\n|  4|         Sex|       Female|       Male|\n|  5|         Age|         null|      31-50|\n|  5|      Income|         null|LessThan50K|\n|  7|         Sex|         null|       Male|\n| 11|Relationship|      Husband|    Husband|\n| 11|         Sex|       Female|       Male|\n| 12|         Age|         null|      31-50|\n| 12|         Sex|         null|       Male|\n| 16|      Income|         null|LessThan50K|\n+---+------------+-------------+-----------+\n\n# If the \"adult\" table has a functional dependency from \"Age\" to \"Income\",\n# its dependency is represented as a following denial constraint:\n\u003e\u003e\u003e repair_updates_df = delphi.repair \\\n...   .setInput(\"adult\") \\\n...   .setRowId(\"tid\") \\\n...   .setErrorDetectors([ConstraintErrorDetector(constraints=\"t1\u0026t2\u0026EQ(t1.Age,t2.Age)\u0026IQ(t1.Income,t2.Income)\")]) \\\n...   .run()\n\n# Or, you can use syntactic sugar instead\n\u003e\u003e\u003e repair_updates_df = delphi.repair \\\n...   .setInput(\"adult\") \\\n...   .setRowId(\"tid\") \\\n...   .setErrorDetectors([ConstraintErrorDetector(constraints=\"Age-\u003eIncome\")]) \\\n...   .run()\n```\n\n## Repairing based on Predicted Probabilities\n\nIf you want to select some of repaired updates based on theier probabilities, you can set `True` to\n`compute_repair_prob` for getting the probabilities from built statistical models.\n\n```\n# To get predicted probabilities, computes repair updates with `compute_repair_prob`=`True`\n\u003e\u003e\u003e repair_updates_df = delphi.repair.setInput(\"adult\").setRowId(\"tid\").run(compute_repair_prob=True)\n\u003e\u003e\u003e repair_updates_df.show()\n+---+---------+-------------+-----------+------------------+\n|tid|attribute|current_value|   repaired|              prob|\n+---+---------+-------------+-----------+------------------+\n|  3|      Sex|         null|     Female|0.6664498420338913|\n|  7|      Sex|         null|     Female|0.7436767447201434|\n| 16|   Income|         null|MoreThan50K|0.8721610530603738|\n|  5|      Age|         null|      18-21|0.3018171710707878|\n|  5|   Income|         null|MoreThan50K|0.8333912988626406|\n| 12|      Age|         null|      18-21|0.3598905853884847|\n| 12|      Sex|         null|     Female|0.7436767447201434|\n+---+---------+-------------+-----------+------------------+\n\n# Applies the repair udpates whose probabilities are greater than 0.70\n\u003e\u003e\u003e repair_updates_df.where(\"prob \u003e 0.70\").createOrReplaceTempView(\"predicted\")\n\u003e\u003e\u003e clean_df = delphi.misc.options({\"repair_updates\": \"predicted\", \"table_name\": \"adult\", \"row_id\": \"tid\"}).repair()\n\u003e\u003e\u003e clean_df.show()\n\u003coutput with the four cells repaired\u003e\n```\n\n## Run a Repair Job via spark-submit\n\nYou can run a repair job ([main.py](./python/main.py)) on your Spark cluster as following:\n\n```\n$ echo $SPARK_HOME\n/tmp/spark-3.2.0-bin-hadoop3.2\n\n$ ./bin/spark-submit ./python/main.py --input adult --output repaired --row-id tid\nPredicted repair values are saved as 'repaired'\n\n$ $SPARK_HOME/bin/spark-shell\n\nscala\u003e spark.table(\"repaired\").show()\n+---+---------+-------------+-----------+\n|tid|attribute|current_value|   repaired|\n+---+---------+-------------+-----------+\n|  7|      Sex|         null|     Female|\n| 12|      Age|         null|      18-21|\n| 12|      Sex|         null|     Female|\n|  3|      Sex|         null|     Female|\n|  5|      Age|         null|      18-21|\n|  5|   Income|         null|MoreThan50K|\n| 16|   Income|         null|MoreThan50K|\n+---+---------+-------------+-----------+\n```\n\n## Major Configurations\n\n```\ndelphi.repair\n\n  // Basic Parameters\n  .setDbName(str)                              // database name (default: '')\n  .setInput(str)                               // table name or `DataFrame`\n  .setRowId(str)                               // unique column name in table\n  .setTargets(list)                            // target attribute list to repair\n\n  // Parameters for Error Detection\n  .setErrorCells(str)                          // user-specified error cells\n  .setErrorDetectors(list)                     // list of error detector implementations (`NullErrorDetector`, `DomainValues`, `RegExErrorDetector`, `ConstraintErrorDetector`, and `GaussianOutlierErrorDetector`)\n  .setDiscreteThreshold(int)                   // max domain size of discrete values (default: 80)\n\n  // Parameters for Repair Model Training\n  .setRepairByRules(bool)                      // whether to enable rule-based repair techniques, e.g., using functional dependencies and merging nearest values (default: False)\n  .setParallelStatTrainingEnabled(bool)        // whether to run multiples tasks to build stat repair models (default: False)\n  .setTrainingDataRebalancingEnabled(bool)     // whether to rebalance class labels in training data (default: False)\n\n  // Parameters for Repairing\n  .setRepairDelta(int)                         // max number of applied repairs\n\n  // Running Mode Parameters\n  .run(\n    detect_errors_only=bool,                   // whether to return detected error cells (default: False)\n    compute_repair_candidate_prob=bool,        // whether to return probabiity mass function of candidate repairs (default: False)\n    compute_repair_prob=bool,                  // whether to return probabiity of predicted repairs\n    repair_data=bool                           // whether to return repaired data\n  )\n```\n\n## References\n\n - [1] Heidari, Alireza et al., HoloDetect: Few-Shot Learning for Error Detection, Proceedings of SIGMOD, 2019.\n - [2] Mohamed Yakout et. al., Don't be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes, Proceedings of SIGMOD, 2013.\n - [3] Ihab F. Ilyas and Xu Chu, Data Cleaning, ACM Books, 2019.\n - [4] Theodoros Rekatsinas et al., Holoclean: Holistic Data Repairs with Probabilistic Inference, PVLDB 10, no.11, pp.1190-1201, 2017.\n - [5] Jan Chomicki and Jerzy Marcinkowski, Minimal-Change Integrity Maintenance Using TupleDdeletions, Inf. Comput. 197(1-2), pp.90–121, 2005.\n - [6] Eduardo H. M. Pena et al., Discovery of Approximate (and Exact) Denial Constraints. Proceedings of the VLDB Endowment. 13(3), pp.266–278, 2019.\n - [7] Wu, Richard et al., Attention-based Learning for Missing Data Imputation in HoloClean, MLSys, 2020.\n - [8] Michael Stonebraker et al., Data Curation at Scale: The Data Tamer System, CIDR, 2013.\n - [9] Ziawasch Abedjan et al., Detecting Data Errors: Where Are We and What Needs to be Done?, Proceedings of the VLDB Endowment, 9(12), pp.993–1004, 2016.\n - [10] Zuhair Khayyat et al., BigDansing: A System for Big Data Cleansing, Proceedings of SIGMOD, pp.1215–1230, 2015.\n - [11] George Papadakis, et al., Blocking and Filtering Techniques for Entity Resolution, ACM Computing Surveys, Article 31, pp.42, 2020.\n - [12] Ahmed K. Elmagarmid et al., Duplicate Record Detection: A Survey, IEEE Transactions on Knowledge and Data Engineering, vol.19, no.1, pp.1-16, 2007.\n - [13] Ihab F. Ilyas and Xu Chu, Trends in Cleaning Relational Data: Consistency and Deduplication, Foundations and Trends in Databases, vol.5, no.4, pp.281-393, 2015.\n - [14] Mohamed Yakout et al., Guided data repair, Proceedings of the VLDB Endowment, 4(5), pp.279–289, 2011.\n - [15] El Kindi Rezig et al., Horizon: Scalable Dependency-driven Data Cleaning, Proceedings of the VLDB Endowment, vol.14, no.11, 2021.\n - [16] Peng Li et al., CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks, Proceedings of ICDE, pp.13-24, 2021.\n - [17] Zeyu Li et al., Repairing data through regular expressions, Proceedings of the VLDB Endowment, vol.9, no.5, pp.432-443, 2016.\n - [18] Leopoldo Bertossi, Database Repairing and Consistent Query Answering, Synthesis Lectures on Data Management, Morgan \u0026 Claypool Publishers, 2011.\n - [19] Babak Salimi et al., Interventional Fairness: Causal Database Repair for Algorithmic Fairness, Proceedings of SIGMOD, pp.793–810, 2019.\n\n## TODO\n\n - Implements a rule-based repair strategy using regular expressions (See [17])\n\n## Bug Reports\n\nIf you hit some bugs and have requests, please leave some comments on [Issues](https://github.com/maropu/spark-data-repair-plugin/issues)\nor Twitter ([@maropu](http://twitter.com/#!/maropu)).\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaropu%2Fspark-data-repair-plugin","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmaropu%2Fspark-data-repair-plugin","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaropu%2Fspark-data-repair-plugin/lists"}