{"id":18683609,"url":"https://github.com/scify/jedaitoolkit","last_synced_at":"2025-04-06T04:11:23.733Z","repository":{"id":15291334,"uuid":"54029243","full_name":"scify/JedAIToolkit","owner":"scify","description":"An open source, high scalability toolkit in Java for Entity Resolution.","archived":false,"fork":false,"pushed_at":"2024-04-12T11:52:56.000Z","size":291025,"stargazers_count":218,"open_issues_count":14,"forks_count":47,"subscribers_count":26,"default_branch":"master","last_synced_at":"2025-03-30T02:09:46.970Z","etag":null,"topics":["blocking","entity-matching","entity-resolution","scalability"],"latest_commit_sha":null,"homepage":"http://jedai.scify.org","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/scify.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-03-16T12:18:20.000Z","updated_at":"2025-03-22T16:47:26.000Z","dependencies_parsed_at":"2024-12-24T13:22:56.464Z","dependency_job_id":null,"html_url":"https://github.com/scify/JedAIToolkit","commit_stats":null,"previous_names":[],"tags_count":7,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scify%2FJedAIToolkit","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scify%2FJedAIToolkit/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scify%2FJedAIToolkit/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scify%2FJedAIToolkit/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/scify","download_url":"https://codeload.github.com/scify/JedAIToolkit/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247430870,"owners_count":20937874,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["blocking","entity-matching","entity-resolution","scalability"],"created_at":"2024-11-07T10:15:07.110Z","updated_at":"2025-04-06T04:11:23.713Z","avatar_url":"https://github.com/scify.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e \n\u003cimg src=\"https://github.com/scify/JedAIToolkit/blob/master/documentation/JedAI_logo_small.png\"\u003e\n\u003c/p\u003e\n\n**Please check [pyJedAI](https://github.com/AI-team-UoA/pyJedAI) for an implementation of JedAI in native Python.**\n\nPlease check our [paper](https://github.com/scify/JedAIToolkit/blob/master/documentation/JedAI_3D_ER.pdf) for a detailed description of version 3.0. \n\nThe code for running JedAI on **Apache Spark** is available [here](https://github.com/scify/JedAI-Spark). \n\nThe **Web Application** for running JedAI is available [here](https://github.com/GiorgosMandi/JedAI-WebApp). A video explaining how to use it is available [here](https://www.youtube.com/watch?v=OJY1DUrUAe8).\n\nJedAI is also available as a **Docker image** [here](https://hub.docker.com/repository/docker/gmandi/jedai-webapp). See below for more details.\n\nThe latest version of JedAI-gui is available [here](jedai-ui.7z).\n\n# Java gEneric DAta Integration (JedAI) Toolkit\nJedAI constitutes an open source, high scalability toolkit that offers out-of-the-box solutions for any data integration task, e.g., Record Linkage, Entity Resolution and Link Discovery. At its core lies a set of *domain-independent*, *state-of-the-art* techniques that apply to both RDF and relational data. These techniques rely on an approximate, *schema-agnostic* functionality based on *(meta-)blocking* for high scalability. \n\nJedAI can be used in three different ways:\n\n  1) As an **open source library** that implements numerous state-of-the-art methods for all steps of the end-to-end ER work presented in the figure below.\n  2) As a [**desktop application**](https://github.com/scify/jedai-ui) with an intuitive Graphical User Interface that can be used by both expert and lay users.\n  3) As a **workbench** that compares the relative performance of different (configurations of) ER workflows.\n  \nThis repository contains the code (in Java 8) of JedAI's open source library. The code of JedAI's desktop application and workbench is available in this [repository](https://github.com/scify/jedai-ui). \n\nSeveral **datasets** already converted into the serialized data type of JedAI can be found [here](./data).\n\nYou can find a short presentation of JedAI Toolkit [here](documentation/JedAIpresentation.pptx).\n\n### Citation\n\nIf you use JedAI, please cite the following paper:\n\n*George Papadakis, Leonidas Tsekouras, Emmanouil Thanos, George Giannakopoulos, Themis Palpanas and Manolis Koubarakis: \"The return of JedAI: End-to-End Entity Resolution for Structured and Semi-Structured Data\", in VLDB 2018* ([pdf](http://www.vldb.org/pvldb/vol11/p1950-papadakis.pdf)).\n\n### Consortium\n\nJEDAI is a collaboration project involving the following partners:\n* [Department of Informatics and Telecommunications, University of Athens](http://www.di.uoa.gr),\n* [Software and Knowledge Engineering Lab, National Center for Scientific Research \"Demokritos\"](https://www.iit.demokritos.gr/labs/skel/) ,\n* [Science-For-You not-for-profit company](https://www.scify.gr/site/en) \n* [LIPADE, Paris Descartes University](http://lipade.mi.parisdescartes.fr)\n\n## JedAI Workflow\n\nJedAI supports 3 workflows, as shown in the following images:\n\n\u003cimg src=\"documentation/workflow1.png\" height=\"80\"\u003e\n\u003cimg src=\"documentation/workflow2.png\" height=\"80\"\u003e\n\u003cimg src=\"documentation/workflow3.png\" height=\"80\"\u003e\n\nBelow, we explain in more detail the purpose and the functionality of every step.\n\n### Data Reading \nIt transforms the input data into a list of entity profiles. An entity is a uniquely identified set of name-value pairs (e.g., an RDF resource with its URI as identifier and its set of predicates and objects as name-value pairs). \n\nThe following formats are currently supported:\n 1) CSV \n 2) RDF in any format, i.e., XML, OWL, HDT, JSON\n 3) Relational Databases (mySQL, PostgreSQL)\n 4) SPARQL endpoints\n 5) Java serialized objects\n \n### Schema Clustering\n\nThis is an optional step, suitable for highly heterogeneous datasets with a schema comprising a large diversity of attribute names. To this end, it groups together attributes that are syntactically similar, but are not necessarily semantically equivalent. \n\nThe following methods are currently supported:\n1) Attribute Name Clustering\n2) Attribute Value Clustering\n3) Holistic Attribute Clustering\n\nFor more details on the functionality of these methods, see [here](http://www.vldb.org/pvldb/vol9/p312-papadakis.pdf).  \n  \n### Block Building \nIt clusters entities into overlapping blocks in a lazy manner that relies on unsupervised blocking keys: every token in an attribute value forms a key. Blocks are then extracted, possibly using a transformation, based on its equality or on its similarity with other keys.\n\nThe following methods are currently supported:\n 1) Standard/Token Blocking\n 2) Sorted Neighborhood\n 3) Extended Sorted Neighborhood\n 4) Q-Grams Blocking\n 5) Extended Q-Grams Blocking\n 6) Suffix Arrays Blocking\n 7) Extended Suffix Arrays Blocking\n 8) LSH MinHash Blocking\n 9) LSH SuperBit Blocking\n  \nFor more details on the functionality of these methods, see [here](https://github.com/scify/JedAIToolkit/blob/master/documentation/JedAI_3D_ER.pdf).  \n\n### Block Cleaning\nIts goal is to clean a set of overlapping blocks from unnecessary comparisons, which can be either *redundant* (i.e., repeated comparisons that have already been executed in a previously examined block) or *superfluous* (i.e., comparisons that involve non-matching entities). Its methods operate on the coarse level of individual blocks or entities.\n\nThe following methods are currently supported:\n 1) Size-based Block Purging\n 2) Cardinality-based Block Purging\n 3) Block Filtering\n 4) Block Clustering\n \nAll methods are optional, but complementary with each other and can be used in combination. For more details on the functionality of these methods, see [here](http://www.vldb.org/pvldb/vol9/p684-papadakis.pdf).  \n\n### Comparison Cleaning\nSimilar to Block Cleaning, this step aims to clean a set of blocks from both redundant and superfluous comparisons. Unlike Block Cleaning, its methods operate on the finer granularity of individual comparisons. \n\nThe following methods are currently supported:\n 1) Comparison Propagation\n 2) Cardinality Edge Pruning (CEP)\n 3) Cardinality Node Pruning (CNP)\n 4) Weighed Edge Pruning (WEP)\n 5) Weighed Node Pruning (WNP)\n 6) Reciprocal Cardinality Node Pruning (ReCNP)\n 7) Reciprocal Weighed Node Pruning (ReWNP)\n 8) BLAST\n 9) Canopy Clusetring\n 10) Extended Canopy Clustering\n\nMost of these methods are Meta-blocking techniques. All methods are optional, but competive, in the sense that only one of them can part of an ER workflow. For more details on the functionality of these methods, see [here](http://www.sciencedirect.com/science/article/pii/S2214579616300168). They can be combined with one of the following weighting schemes:\n   1) Aggregate Reciprocal Comparisons Scheme (ARCS)\n   2) Common Blocks Scheme (CBS)\n   3) Enhanced  Common  Blocks  Scheme (ECBS)\n   4) Jaccard Scheme (JS)\n   5) Enhanced  Jaccard  Scheme (EJS)\n   6) Pearson chi-squared test\n\n### Entity Matching\nIt compares pairs of entity profiles, associating every pair with a similarity in [0,1]. Its output comprises the *similarity graph*, i.e., an undirected, weighted graph where the nodes correspond to entities and the edges connect pairs of compared entities. \n\nThe following schema-agnostic methods are currently supported:\n  1) [Group Linkage](http://pike.psu.edu/publications/icde07.pdf), \n  2) Profile Matcher, which aggregates all attributes values in an individual entity into a textual representation.\n\nBoth methods can be combined with the following representation models.\n  1) character n-grams (n=2, 3 or 4)\n  2) character n-gram graphs (n=2, 3 or 4)\n  3) token n-grams (n=1, 2 or 3)\n  4) token n-gram graphs (n=1, 2 or 3)\n\nFor more details on the functionality of these bag and graph models, see [here](https://link.springer.com/article/10.1007%2Fs11280-015-0365-x).\n\nThe bag models can be combined with the following similarity measures, using both TF and TF-IDF weights: \n   1) ARCS similarity\n   2) Cosine similarity \n   3) Jaccard similarity \n   4) Generalized Jaccard similarity \n   5) Enhanced Jaccard similarity\n   \nThe graph models can be combined with the following graph similarity measures:\n   1) Containment similarity \n   2) Normalized Value similarity \n   3) Value similarity \n   4) Overall Graph similarity\n   \nAny word or character-level pre-trained embeddings are also supported in combination with cosine similarity or Euclidean distance.\n\n### Entity Clustering\nIt takes as input the similarity graph produced by Entity Matching and partitions it into a set of equivalence clusters, with every cluster corresponding to a distinct real-world object.\n\nThe following domain-independent methods are currently supported for Dirty ER:\n  1) Connected Components Clustering\n  2) Center Clustering\n  3) Merge-Center Clustering\n  4) Ricochet SR Clustering\n  5) Correlation Clustering\n  6) Markov Clustering\n  7) Cut Clustering\n\nFor more details on the functionality of these methods, see [here](http://dblab.cs.toronto.edu/~fchiang/docs/vldb09.pdf). \n\nFor Clean-Clean ER, the following methods are supported:\n  1) Unique Mapping Clustering\n  2) Row-Column Clustering\n  3) Best Assignment Clustering\n\nFor more details on the functionality of the first method, see [here](https://arxiv.org/pdf/1207.4525.pdf). The 2nd algorithm implements an efficient approximation of the Hungarian Algorithm, while the 3rd one implements an efficient, heuristic solution to the assignment problem in unbalanced bipartite graphs.\n\n### Similarity Join\nSimilarity Join conveys the state-of-the-art algorithms for accelerating the computation of a specific character- or token-based similarity measure in combination with a user-determined similarity threshold.\n\nThe following token-based similarity jon algorithms are supported:\n  1) AllPairs\n  2) PPJoin\n  3) SilkMoth\n\nThe following character-based similarity jon algorithms are also supported:\n  1) FastSS\n  2) PassJoin\n  3) PartEnum\n  4) EdJoin\n  5) AllPairs\n\n### Comparison Prioritization\nComparison Prioritization associates all comparisons in a block collection with a weight that is proportional to the likelihood that they involve duplicates and then, it emits them iteratively, in decreasing weight.\n\nThe following methods are currently supported:\n  1) Local Progressive Sorted Neighborhood\n  2) Global Progressive Sorted Neighborhood\n  3) Progressive Block Scheduling\n  4) Progressive Entity Scheduling\n  5) Progressive Global Top Comparisons\n  6) Progressive Local Top Comparisons\n\nFor more details on the functionality of these methods, see [here](https://arxiv.org/pdf/1905.06385.pdf).\n  \n## How to add JedAI as a dependency to your project\n\nVisit https://search.maven.org/artifact/org.scify/jedai-core\n\n## How to run JedAI as a Docker image\n\nAfter installing Docker on your machine, type the following commands:\n\n~~~~\ndocker pull gmandi/jedai-webapp\n\ndocker run -p 8080:8080 gmandi/jedai-webapp\n~~~~\n\nThen, open your browser and go to localhost:8080. JedAI should be running on your browser!\n\n## How to use JedAI with Python\n\nYou can combine JedAI with Python through PyJNIus (https://github.com/kivy/pyjnius).\n\nPreparation Steps:\n1. Install python3 and PyJNIus (https://github.com/kivy/pyjnius).\n2. Install java 8 openjdk and openjfx for java 8 and configure it as the default java.\n3. Create a directory or a jar file with jedai-core and its dependencies. One approach is to use the maven-assembly-plugin\n(https://maven.apache.org/plugins/maven-assembly-plugin/usage.html), which will package everything to a single jar file:\njedai-core-3.0-jar-with-dependencies.jar\n\nIn the following code block a simple example is presented in python 3. The code reads the ACM.csv file found at (JedAIToolkit/data/cleanCleanErDatasets/DBLP-ACM) and prints the entities found:\n\n~~~~\nimport jnius_config;\njnius_config.add_classpath('jedai-core-3.0-jar-with-dependencies.jar')\n\nfrom jnius import autoclass\n\nfilePath = 'path_to/ACM.csv'\nCsvReader = autoclass('org.scify.jedai.datareader.entityreader.EntityCSVReader')\nList = autoclass('java.util.List')\nEntityProfile = autoclass('org.scify.jedai.datamodel.EntityProfile')\nAttribute = autoclass('org.scify.jedai.datamodel.Attribute')\ncsvReader = CsvReader(filePath)\ncsvReader.setAttributeNamesInFirstRow(True);\ncsvReader.setSeparator(\",\");\ncsvReader.setIdIndex(0);\nprofiles = csvReader.getEntityProfiles()\nprofilesIterator = profiles.iterator()\nwhile profilesIterator.hasNext() :\n    profile = profilesIterator.next()\n    print(\"\\n\\n\" + profile.getEntityUrl())\n    attributesIterator = profile.getAttributes().iterator()\n    while attributesIterator.hasNext() :\n        print(attributesIterator.next().toString())\n~~~~\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscify%2Fjedaitoolkit","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fscify%2Fjedaitoolkit","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscify%2Fjedaitoolkit/lists"}