{"id":18553464,"url":"https://github.com/hpcc-systems/medley","last_synced_at":"2026-01-24T12:07:41.843Z","repository":{"id":110264206,"uuid":"262392160","full_name":"hpcc-systems/Medley","owner":"hpcc-systems","description":"ECL module for performing whole-record fuzzy matching.","archived":false,"fork":false,"pushed_at":"2024-01-03T14:16:17.000Z","size":532,"stargazers_count":3,"open_issues_count":0,"forks_count":2,"subscribers_count":13,"default_branch":"master","last_synced_at":"2025-02-17T10:49:40.880Z","etag":null,"topics":["ecl","hpcc","hpcc-systems"],"latest_commit_sha":null,"homepage":"","language":"ECL","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hpcc-systems.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-05-08T17:56:58.000Z","updated_at":"2023-11-08T14:15:13.000Z","dependencies_parsed_at":null,"dependency_job_id":"f2c28804-54db-47e8-9a0a-77544bbeebf0","html_url":"https://github.com/hpcc-systems/Medley","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/hpcc-systems/Medley","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hpcc-systems%2FMedley","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hpcc-systems%2FMedley/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hpcc-systems%2FMedley/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hpcc-systems%2FMedley/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hpcc-systems","download_url":"https://codeload.github.com/hpcc-systems/Medley/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hpcc-systems%2FMedley/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28727384,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-24T10:24:43.181Z","status":"ssl_error","status_checked_at":"2026-01-24T10:24:36.112Z","response_time":89,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ecl","hpcc","hpcc-systems"],"created_at":"2024-11-06T21:17:14.661Z","updated_at":"2026-01-24T12:07:41.820Z","avatar_url":"https://github.com/hpcc-systems.png","language":"ECL","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Medley\n\n## What Is It?\n\nMedley is a library that supports searching for \"similar\" records in a dataset.\n\nThis concept of \"similar\" is best defined via analogy:  Two words can be similar\nif they differ in spelling by only a letter or two.  This is how many spell-checkers\nwork:  Finding real words that differ from what you typed by only\na few letters.  Extending that idea, Medley finds similar dataset records\nby examining field values and finding records where only a few field values\nare different.\n\nThe module [Medley.ecl](Medley.ecl) provides the following features:\n\n- UTF-8 support\n- Creation of search indexes, linked back to entity ID values\n- Fuzzy matching within the source dataset, linking similar IDs\n- Searching for related IDs, given a dataset of IDs\n- Searching for related IDs, given a dataset of data mimicking the source dataset\n\n## License\nThis software is [licensed](LICENSE.txt) under the Apache v2 license.\n\n## Dependencies\n\nMedley supports UTF-8 via the ICU (International Components for Unicode) library.\nThe runtime version of this library is used by the HPCC Systems platform code,\nbut Medley requires the header files as well since you're compiling new code\nusing Unicode support.  If you receive an error indicating that the file\nunicode/unistr.h cannot be found, then you need to install a library package.\nFor either RHEL/CentOS or Debian operating systems, that package is libicu-dev.\nAt minimum, you need to install it on the node that compiles your ECL code\n(the node running eclccserver).\n\n## Versions\n\nThe ECL module itself can be inspected for version information at compile time.\nThe following attributes are all exported:\n\n\tUNSIGNED1 VERSION_MAJOR\n\tUNSIGNED1 VERSION_MINOR\n\tUNSIGNED1 VERSION_POINT\n\tSTRING    VERSION_STRING\n\n|Version|Notes|\n|:----:|:-----|\n|0.5.0|Initial public release|\n|0.6.0|Support for multiple field directives in a single build|\n|0.6.1|Rename offensive terms; replace expensive self-join with a rollup; skip non-required fields containing empty strings when computing hashes|\n|0.6.2|Rearrange ECL #body declarations in embedded C++ functions|\n\n## Example Code\n\nThe Example directory contains BWRs for creating Medley indexes, analyzing deduplication\nresults, and querying the indexes for both related entity IDs as well as fuzzy searching.\n\n## Theory of Operations\n\nThe basic concept implemented here is called \"deletion neighborhoods\".\nIt is a term coined in an academic paper written by Thomas Bocek, Ela Hunt,\nand Burkhard Stiller from the University of Zurich, titled [\"Fast Similarity\nSearch in Large Dictionaries\"](https://fastss.csg.uzh.ch/ifi-2007.02.pdf).\nThe work described there was expanded in a paper written by Daniel Karch,\nDennis Luxen, and Peter Sanders from the Karlsruhe Institute of Technology,\ntitled [\"Improved Fast Similarity Search in Dictionaries\"](https://arxiv.org/abs/1008.1191v2).\nBoth of these papers deal with efficient searching for similar string values, given a query string.\n\nThis ECL module takes the concept of a deletion neighborhood and applies it\nto dataset records.  Once you read the papers, the executive summary is\nstraightforward:  instead of working with words composed of characters, we\nwork with records composed of fields.  Both string- and record-based\ndeletion neighborhood techniques are used here, as both offer powerful\ncapabilities when combined.\n\nThe data that this module is designed to work with is most easily described\nas \"entity data\" -- usually thought of as, \"each record describes a single\nperson, place, or thing.\"  More generally, any kind of data will work just\nfine as long as each record contains a unique entity ID field and then\nmore fields that add information about that ID.  IDs can be duplicated\nin the dataset, so long as the information associated with them belong\nto the right IDs (in other words, don't reuse the entity IDs for\ndifferent entities).\n\nThe concept of applying a deletion neighborhood technique to records like\nyou would strings is easy to grasp, but there are a few tweaks that improve\nthe results:\n\n1) Field values, in isolation, sometimes don't offer enough actual\ninformation to be worthwhile when it comes to searching (or to\nput a different way, discriminating between records).  The rule of\nthumb is, the lower the cardinality in a field the less unique\ninformation it adds.  An example is a field containing abbreviations\nof U.S. states.  A value of \"TX\" does not necessarily help\ndiscriminate one record from another (though it might; it depends\non the use case).  However, if you pair this field with another\nfield, say \"city_name\", then the combined information helps\nrecord discrimination tremendously.  \"Austin\" as a city name, by\nitself, can refer to a number of different cities in the U.S. but\nwhen paired with the state abbreviation \"TX\" it suddenly acquires\ngreater precision.  This module therefore offers a method for\ncombining fields into \"field groups\" and treating those groups\nas if they were a single value.\n\n2) One of the more powerful aspects of deletion neighborhoods is that\nthe process of iteratively deleting units (characters from strings,\nor field values from records) is blind.  We don't actually care\nabout what we're deleting.  But that is not always true when it\ncomes to fields (or field groups).  When working with records\nthat contain location data, for instance, it may be super important\nthat we never ignore the postal code from the record, because if we\ndo then we run the risk of matching records that are geographically\ndistant.  This module therefore offers the ability to designate\nfields and field groups as \"required\" which has the effect of\nmaking them not deletable.\n\n3) The idea of creating deletion neighborhoods against string values\nis powerful and should not require an extra step to use.  Therefore,\nthis module offers the ability to expand any field into a deletion\nneighborhood prior to creating the deletion neighborhood for the\nrecord as a whole.  This has the effect of duplicating a given\nrecord a number of times, with each containing a slightly different\n\"version\" of that field value.  As a bonus, any number of fields\nmay be expanded this way, and each may cite a different edit\ndistance (you did read the papers, right?).  The practical side\nof this functionality is the ability to \"fuzzy match\" on these\nfields even while performing the record-based similarity matching.\nIt is also worth noting that the module supports UTF-8 strings,\nnot just plain ASCII strings.  (Side note:  If you are interested\nin just fuzzy-matching strings and not whole-record fuzzy matching, see\n[FuzzyStringSearch.ecl](https://github.com/dcamper/Useful_ECL/blob/master/FuzzyStringSearch.ecl).)\n\nOne of the extremely cool features of record-level deletion neighborhoods is\nthe ability to solve a normally hard-to-code use case:  Given a search form\nin a browser, the user is presented with somes fields to fill in (these just\nhappen to correspond to the fields you previously indexed using this module).\nThe requirement is that the user should basically fill in as many of the\nfields as she can and the system should locate related records.\n\nUsing this module, you can accommodate this type of searching by adjusting\nthe field group-level maximum edit distance (MaxED) when building the index.\nThe simple formula is:\n\n     MaxED = (total number of fields) - (minimum number of entered fields)\n\nIf the requirements say that the user needs to fill out only one of ten\nfields presented, then your MaxED value is 9.  If the user needs to fill\nout any six fields, then the MaxED value is 4.  When searching, set the MaxED\nvalue to zero to prevent over-fuzzing the search parameters.\n\nThe uber-cool part of this is two-fold:\n\n1) Other than different MaxED values, there is no change to the\nindexing code.\n2) Other than different MaxED values, there is no change to the\nsearching code.\n\nDo keep in mind that the larger the edit distance, the bigger the indexes\nand the greater the chance for seeing a false positive search result.\n\n\"Larger indexes\" has been bandied about a few times.  What does that really\nmean?  You can compute the number of records created by a deletion\nneighborhood with the following pseudocode (note that 'fact' means\ncomputing the factorial of the argument):\n\n     for (r = 1; r \u003c= MaxED; r++)\n         numRecs += (fact(n) / (fact(r) * fact(n - r)));\n\nWhere n = the number of fields in your record.  n could also be the number\nof characters in a string, because this equation works just as well for\nstrings.  The reason this is a summation is because deletion neighborhoods\nin effect store everything from edit distance zero (an exact copy of the\ninput) up to MaxED.\n\nNote that the equation gives you the result if you have ONE record (or\nstring).  If you have many records to process, multiple that count by\nnumRecs.  If are using the equation to for a string field, use the\naverage length of the string value for n, then multiply by numRecs.\n\nTo give you some ideas of the scale of this \"record explosion\" here are\nthe results of expanding one record with 10 fields (or one string that\nis 10 characters in length), with different MaxID values:\n\n     MaxED   numRecs\n     ----------------\n       1       11\n       2       56\n       3      176\n       4      386\n       5      638\n       6      848\n       7      968\n       8     1013\n       9     1023\n\nTo reiterate:  That is the number of index records generated from ONE input\nrecord.\n\nThere is another consideration regarding the index files this module\ncreates:  Disk space.  In HPCC Systems, indexes are naturally compressed.\nThis is good.  Unfortunately, while these indexes all have simple layouts\n(a pair of numbers), they are mostly *random* numbers.  They do not\ncompress well at all.  The ones with ID values first tend to compress better,\nbut none of them are outstanding.  This in no way hurts performance, but it\nis a consideration for storage.\n\n## How To Use Medley\n\nWhat follows is the suggested basic \"flow\" for using this module.\n\n1) Examine your dataset and identify the search fields -- those fields\nthat contain data that help discriminate between records.  The fewer\nthe fields, the smaller the indexes and the faster the process\nwill run.  But hey, if you have a monstrous HPCC Systems cluster\nfeel free to go nuts.\n\n2) Identify fields that should be grouped together, if any.\n\n3) Identify fields that should be expanded with their own deletion\nneighborhoods to aid fuzzy matching, if any.  Note that if your\ndata is already heavily normalized, you may not have any such\nfields.\n\n4) Create a field directive string or a set of strings (see the section titled\n[Field Directive Formatting](#field_directive_formatting), below)\nusing the information from steps 1-3.\n\n5) Decide on the record (or field group-level) maximum edit distance\nyou want to use.  Remember that higher values produce more index\nrecords, take longer to process, and will return more \"hits\" when\nsearching but they will also produce more false positives.  Note\nthat you can set the maximum edit distance to zero.  If you do,\nno field-based deletion neighborhood will be built, turning the\nindexes into a fast methhod for multi-field exact matches.\n\n6) Call BuildAllIndexes() to create all of the indexes.\n\nAfter six steps, you now have all the data you need for Fun Searching.\nThat data is composed of four simply-formatted but probably very large\nINDEX files.\n\n**Fun Search Scenario #1:** Given one or more of your unique IDs, find all\nthe related IDs.  This is pretty straightforward:\n\n1) Stuff your IDs into a DATASET(IDLayout) and call the\nFindRelatedIDs() function.  You will need to pass in the logical\npathnames for some of the index files as well.  What you get back\nis a set of records containing one of your original IDs and\na related ID (RelatedIDLayout format).\n\n**Fun Search Scenario #2:**  Given a set of information that mimics the data\nyou have indexed -- basically a populated one-record dataset in the\nsame format as your original data -- and find the IDs of the matching\nrecords.\n\n1) Create a dataset with a layout that includes at least the fields\nused when creating the deletion neighborhood indexes and populate it\nwith your search values.  Make sure to include the unique ID field, even\nthough you probably don't have a value for it.  You can have more than\none record in this dataset if needed.\n\n2) Pass this created dataset to CreateLookupTable() along with the\nfield directive string you used when creating the lookup\nindexes, and the field group-level edit distance.  You will get\nback a lookup table of hash codes.\n\n3) Pass the lookup table to the FindRelatedIDsFromLookupTable()\nfunction.  What you get back will be a simple list of IDs\n(in IDLayout format) that are similar to the data from step #1.\n\n**Fun Search Scenario #3:** Like #2, but you are satisfying the use case\nof \"find records given up anywhere from N to M number field values\"\n(where M is the actual number of fields you have indexed).\n\n1) Create a dataset with a layout that includes at least the fields\nused when creating the deletion neighborhood indexes and populate it\nwith your search values.  In this layout, use UTF8 as the data type\nfor every field, but use the same names.  Where the user does not\nsupply a value for a field, make sure that field's value is an empty\nstring.  Make sure to include the unique ID field, even\nthough you probably don't have a value for it.\n\n2) Pass this created dataset to CreateLookupTable() along with the\nfield directive string you used when creating the lookup\nindexes, and use zero for the field group-level edit distance.  You will get\nback a lookup table of hash codes.\n\n3) JOIN those hash codes against the index defined by\n`Medley.Hash2IDLookupIndexDef(Medley.HASH2ID_LOOKUP.PATH_ROXIE)`,\nbasically filtering that index by the lookup table you computed.  The JOIN\nwould look something like this:\n\n        hashMatches := JOIN\n            (\n                lookupTable,\n                hash2IDIndex,\n                LEFT.hash_value = RIGHT.hash_value,\n                TRANSFORM(RIGHT),\n                LIMIT(0)\n            );\n\nThe result of that JOIN will be a dataset with this layout:\n\n        LookupTableLayout := RECORD\n            ID_t                id;\n            Hash_t              hash_value;\n        END;\n\nThe results will be the matches against your data, and the `id` field\nis your entity ID.  Now you can look up those IDs in your master file\nto retrieve the original data.\n\n\u003ca name=\"field_directive_formatting\"\u003e\u003c/a\u003e\n## Field Directive Formatting\n\nThe field directive (the 'fieldSpec' parameter in Medley's\nCreateLookupTable() function macro) is a single ```STRING``` or\n```SET OF STRING``` argument defining how the input dataset\nshould be parsed while creating lookup and search neighborhoods.  Multiple\ndirectives can be supplied via the ```SET OF STRING``` form, with the\neffect of creating an OR condition between them.\n\nEach field directive string is a semi-colon delimited string, with each\nelement defining a \"field group\".  A field group is a comma-delimited list\nof field names from the input dataset.  A field group may contain only one\nfield name.  The unique ID field should not normally be included in any\nfield group.   Individual fields may appear multiple times, but you should\nthink carefully about the impact of doing so.\n\nIndividual fields may be expanded with their own deletion neighborhoods.\nTo indicate such an expansion, append the suffix of '%N' to the field's\nname, where N is the maximum edit distance for the deletion neighborhood.\nNormally, N will be either 1 or 2 (larger maximum edit distances create\nconsiderably larger lookup tables and the result may cause too many\nfalse positive search results).  If a field appears more than once in a\nfield directive and any of them have a %N suffix, then all\noccurrences of that field will be be expanded with a maximum edit distance\nof MAX(N).\n\nField group-level deletion neighborhoods are all about systematically\nignoring certain field groups when creating hash values.  To indicate that\na field group should not be ignored -- that it is, required -- prepend\nthe entire field group with a '\u0026' character.  The '\u0026' character should\nappear as a prefix of the first field name in the field group.\n\nPractical example:  Let's assume you are working with this data structure:\n\n     RECORD\n         UNSIGNED6    id;\n         UTF8         fname;\n         UTF8         lname;\n         UTF8         street;\n         UTF8         city;\n         UTF8         state;\n         STRING       postal;\n     END;\n\nRemember, the entity ID field is NOT part of your directive.\n\nIf you want to consider all of these fields independently, without grouping\nor creating any string-based deletion neighborhoods, the field\ndirective is a simple semi-colon delimited list of the six fields:\n\n     'fname;lname;street;city;state;postal;'\n\nNote that a \"field group\" is defined as \"one or more fields\" so you have a\nfield directive defining six field groups, even though each field\ngroup has only one field in it.\n\nNow let's say that in the interest of precision, you want to consider the\ncity, state, and postal fields together:  Don't break them up, and don't\ndelete one of those independently.  Those fields become a \"field group\"\nand are comma-delimited items:\n\n     'fname;lname;street;city,state,postal;'\n\nYou now have four field groups:\n\n     fname\n     lname\n     street\n     city, state, postal\n\nLet's further assume that the city/state/postal field group is important\nand that you never want it omitted from the index creation (recall that\ncreating deletion neighborhoods BLINDLY deletes items).  The way you\ndesignate a field group as required is to prepend the entire group (meaning,\nthe first field name in the group) with an ampersand character:\n\n     'fname;lname;street;\u0026city,state,postal;'\n\nYou still have four field groups, but only three of them will participate\nin the deletion neighborhood.  If your MaxED is 1, that means indexes will\nbe created for the following combinations of field values:\n\n     fname, lname, street, city/state/postal\n     lname, street, city/state/postal\n     fname, street, city/state/postal\n     fname, lname, city/state/postal\n\nThat combination of values is what you will be matching on.\n\nYou could also indicate that you want string-based deletion neighborhoods\ncreated for certain fields (not field groups).  For instance, if your data\nhas not been thoroughly cleaned or if you will need to account for typos\nand such when searching, you may need that extra \"fuzziness\" to match\nrecords correctly.  String-based deletion neighborhoods are designated\nwithin the field directive by adding a suffix of '%N' to the field's\nname, where N is the MaxED you want.  In our example, let's say that you\nwant to be able to find first names with up to one character different\n(MaxED = 1) and the street part of the address with up to two characters\ndifferent (MaxED = 2).  The directive will become:\n\n     'fname%1;lname;street%2;\u0026city,state,postal;'\n\nThis would cause every record in your dataset to be duplicated with subtle\nvariations in the fname and street values, exploding the size of the data\ntemporarily.  The field-level deletion neighborhood is the same, though:\n\n     fname, lname, street, city/state/postal\n     lname, street, city/state/postal\n     fname, street, city/state/postal\n     fname, lname, city/state/postal\n\nIt's just that there will be many more of those records processed.\n\nAs an example of using multiple field directives, let us take our example\nand assume that we want to match records using two different criteria:\n\n     'fname;lname;postal;'\n     'lname;city,state,postal;'\n\nThose two criteria are considered OR'd together when it comes to matching.\nAll of the other formatting directives and limitations are valid for each\ndirective.  To supply them, simply submit them in a ```SET OF STRING```\ndata type rather than a simple ```STRING```:\n\n     ['fname;lname;postal;', 'lname;city,state,postal;']\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhpcc-systems%2Fmedley","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhpcc-systems%2Fmedley","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhpcc-systems%2Fmedley/lists"}