{"id":13719312,"url":"https://github.com/larsga/Duke","last_synced_at":"2025-05-07T11:31:30.489Z","repository":{"id":13764047,"uuid":"16458851","full_name":"larsga/Duke","owner":"larsga","description":"Duke is a fast and flexible deduplication engine written in Java","archived":false,"fork":false,"pushed_at":"2023-10-11T07:12:54.000Z","size":3546,"stargazers_count":620,"open_issues_count":115,"forks_count":193,"subscribers_count":71,"default_branch":"master","last_synced_at":"2025-04-14T08:11:36.547Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/larsga.png","metadata":{"files":{"readme":"README.md","changelog":"changes.txt","contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2014-02-02T17:17:38.000Z","updated_at":"2025-03-30T17:36:52.000Z","dependencies_parsed_at":"2022-07-30T16:18:58.035Z","dependency_job_id":"3c42ae97-9208-40bf-a5af-9f9e6882c8b3","html_url":"https://github.com/larsga/Duke","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/larsga%2FDuke","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/larsga%2FDuke/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/larsga%2FDuke/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/larsga%2FDuke/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/larsga","download_url":"https://codeload.github.com/larsga/Duke/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252868847,"owners_count":21816925,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T01:00:46.263Z","updated_at":"2025-05-07T11:31:29.359Z","avatar_url":"https://github.com/larsga.png","language":"Java","funding_links":[],"categories":[":hammer: Frameworks","Machine Learning","数据科学"],"sub_categories":["Clustering /","BBedit"],"readme":"# Duke\n\nDuke is a fast and flexible deduplication (or entity resolution, or\nrecord linkage) engine written in Java on top of Lucene.  The latest\nversion is 1.2 (see [ReleaseNotes](https://github.com/larsga/Duke/wiki/ReleaseNotes)).\n\nDuke can find duplicate customer records, or other kinds of records in\nyour database. Or you can use it to connect records in one data set\nwith other records representing the same thing in another data set.\nDuke has sophisticated comparators that can handle spelling\ndifferences, numbers, geopositions, and more. Using a probabilistic\nmodel Duke can handle noisy data with good accuracy.\n\nFeatures\n\n  * High performance.\n  * Highly configurable.\n  * Support for [CSV, JDBC, SPARQL, NTriples, and JSON](https://github.com/larsga/Duke/wiki/DataSources).\n  * Many built-in [comparators](https://github.com/larsga/Duke/wiki/Comparator).\n  * Plug in your own data sources, comparators, and [cleaners](https://github.com/larsga/Duke/wiki/Cleaner).\n  * [Genetic algorithm](https://github.com/larsga/Duke/wiki/GeneticAlgorithm) for automatically tuning configurations.\n  * Command-line client for getting started.\n  * [API](https://github.com/larsga/Duke/wiki/UsingTheAPI) for embedding into any kind of application.\n  * Support for batch processing and continuous processing.\n  * Can maintain database of links found via JNDI/JDBC.\n  * Can run in multiple threads.\n\nThe [GettingStarted page](https://github.com/larsga/Duke/wiki/GettingStarted) explains how to get started and has links to\nfurther documentation. The [examples of use](https://github.com/larsga/Duke/wiki/ExamplesOfUse) page\nlists real examples of using Duke, complete with data and\nconfigurations. [This\npresentation](http://www.slideshare.net/larsga/linking-data-without-common-identifiers)\nhas more of the big picture and background.\n\nContributions, whether issue reports or patches, are very much\nwelcome.  Please fork the repository and make pull requests.\n\nSupports Java 1.7 and 1.8.\n\n[![Build status](https://travis-ci.org/larsga/Duke.png?branch=master)](https://travis-ci.org/larsga/Duke)\n\nIf you have questions or problems, please register an issue in the\nissue tracker, or post to the [the mailing\nlist](http://groups.google.com/group/duke-dedup). If you don't want to\njoin the list you can always write to me at `larsga [a]\ngarshol.priv.no`, too.\n\n## Using Duke with Maven\n\nDuke is hosted in Maven Central, so if you want to use Duke it's as\neasy as including the following in your pom file:\n\n```\n\u003cdependency\u003e\n  \u003cgroupId\u003eno.priv.garshol.duke\u003c/groupId\u003e\n  \u003cartifactId\u003eduke\u003c/artifactId\u003e\n  \u003cversion\u003e1.2\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\n## Building the source\n\nIf you have [Maven](https://maven.apache.org/) installed, this is as\neasy as giving the command `mvn package` in the root directory. This\nwill produce a `.jar` file in the `target/` subdirectory of each\nmodule.\n\n## Older documentation\n\n[This blog post](http://www.garshol.priv.no/blog/217.html) describes\nthe basic approach taken to match records. It does not deal with the\nLucene-based lookup, but describes an early, slow O(n^2)\nprototype. [This early\npresentation](http://www.slideshare.net/larsga/deduplication)\ndescribes the ideas behind the engine and the intended architecture","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flarsga%2FDuke","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flarsga%2FDuke","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flarsga%2FDuke/lists"}