{"id":21496557,"url":"https://github.com/mlibrary/library_identifier_solr_filters","last_synced_at":"2025-03-17T12:16:06.963Z","repository":{"id":46412275,"uuid":"372945081","full_name":"mlibrary/library_identifier_solr_filters","owner":"mlibrary","description":null,"archived":false,"fork":false,"pushed_at":"2024-01-11T14:24:36.000Z","size":122,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-01-23T21:53:24.511Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Java","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mlibrary.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2021-06-01T19:48:44.000Z","updated_at":"2022-04-21T14:03:38.000Z","dependencies_parsed_at":"2024-01-11T16:27:06.208Z","dependency_job_id":"460ecb2b-7ff5-47a5-bd08-26438787a24c","html_url":"https://github.com/mlibrary/library_identifier_solr_filters","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mlibrary%2Flibrary_identifier_solr_filters","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mlibrary%2Flibrary_identifier_solr_filters/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mlibrary%2Flibrary_identifier_solr_filters/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mlibrary%2Flibrary_identifier_solr_filters/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mlibrary","download_url":"https://codeload.github.com/mlibrary/library_identifier_solr_filters/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244031154,"owners_count":20386534,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-23T16:17:11.584Z","updated_at":"2025-03-17T12:16:06.941Z","avatar_url":"https://github.com/mlibrary.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# library_identifer_solr_filters\n\n## Overview\n\nThis is a series of simple solr analysis-chain filters useful to those\ndealing with library identifiers (currently only LC Callnumbers, but\nmore to come).\n\n## Getting/generating the .jar file\n\nYou can just nab a .jar file from the [github releases page](https://github.com/billdueber/library_identifier_solr_filters/releases). They're labeled\nwith the version of the library and the version of solr they're created\nagainst. \n\nYou can also use maven. You should be able to build with just\n\n```shell\nmvn package # .jar file will appear in `target/`\n\n```\n\nTo use different versions of solr and/or the icu4j library, you can\ndefine them on the command line (defaults are in the pom.xml file)\n\n```shell\nmvn package -Dsolr.version=8.6.1 -Dicu.version=66.1\n\n```\n\n## Placing the .jar file\n\nThe jar file needs to put somewhere solr's going to pick it up, which\nis defined in the `solrconfig.xml` file. \n\nI like to have a `lib` directory\n\"next to\" my `conf` directory with the solr configuration\n\n```\nmycore\n  |- conf\n      |- schema.xml\n      |- solrconfig.xml\n      |- ...\n  |- lib\n      |- library_identifier_solr_filters-0.1-solr8.8.2.jar\n\n```\n\n... and then have `conf/solrconfig.xml` include the line:\n\n```xml\n\u003clib dir=\"${solr.core.config}/lib\" regex=\".*\\.jar\"/\u003e\n```\n\n## LC Callnumbers\n\nThis is a simple/simplistic attempt to take LC callnumbers and turn them\ninto something sortable/searchable. It does the bare minimum to massage the \ncallnumbers before indexing.\n\nIn addition to the underlying code to do the conversion, there are two ways\nto use it in fieldTypes.\n\n### LCCallNumberSimpleFilterFactory for prefix queries\n\nAn analysis filter that will take a token and perform the callnumber\nnormalization (or at least do its best), suitable for use with\nthe edge n-gram filter for providing It requires \nthat callnumbers be treated as a single token, so should only be used \nwith a keyword tokenizer.\n\nNote that this filter will never be called for range searches; if you \nwant to use ranges see the `CallnumberSortableFieldType`.\n\nA good fieldType definition for prefix searches is as follows:\n\n```xml\n\u003cfieldType name=\"callnumber_prefix_search\"  class=\"solr.TextField\"\u003e\n  \u003canalyzer type=\"index\"\u003e\n    \u003ctokenizer class=\"solr.KeywordTokenizerFactory\"/\u003e\n    \u003cfilter class=\"edu.umich.library.lucene.analysis.LCCallNumberSimpleFilterFactory\" passThroughOnError=\"true\"/\u003e\n    \u003cfilter class=\"solr.EdgeNGramFilterFactory\" maxGramSize=\"40\" minGramSize=\"2\"/\u003e\n  \u003c/analyzer\u003e\n  \u003canalyzer type=\"query\"\u003e\n    \u003ctokenizer class=\"solr.KeywordTokenizerFactory\"/\u003e\n    \u003cfilter class=\"edu.umich.library.lucene.analysis.LCCallNumberSimpleFilterFactory\" passThroughOnError=\"true\"/\u003e\n  \u003c/analyzer\u003e\n\u003c/fieldType\u003e\n```\n\n\n### The CallnumberSortableFieldType\n\n`CallnumberSortableFieldType` is a derivative of `solr.String` which does\nthe callnumber conversion on the way in (for both stored and indexed values). \nThis not only gives you a sortable value (which the filter does as well),\nbut allows the type to be used correctly with ranges (since Solr doesn't\nrun the analysis chain for range queries).\n\nBecause it's implemented as a FieldType, all the normalization works\nas you'd expect (e.g., `callnumber_search:[qa20 to *]` will pick up\nthe callnumber \"QA 20.2\" and not \"QA 3.11 .D4\"). \n\nThe FieldType can/should be used for both exact queries and range queries,\nas well as (pretty) accurate sorting,\nbut can't be used for prefix search since it's not a part of the analysis \nchain (being based on String and not TextField), hence the filter, above.\n\n```xml\n\n\u003cfieldType name=\"callnumber_sortable\" class=\"edu.umich.library.\nlibrary_identifier.schema.CallnumberSortableFieldType\" /\u003e\n\n\n\u003cfield name=\"callnumber_search\" type=\"callnumber_sortable\"\n       multiValued=\"true\"/\u003e\n\n\u003cfield name=\"callnumber_sort\" type=\"callnumber_sortable\"\n       multiValued=\"false\"/\u003e\n\n\n```\n\n\n### The normalization algorithm\n\nGiven a callnumber:\n```\nQA 123.456 .C5 D6 1990 v.3\n 1  2   3  \u003c--    4    --\u003e\n```\n\nWe label it as follows: \n 1. The _initial letters_\n 2. The _digits_\n 3. An (optional) _decimal_\n 4. Everything else\n\nIn particular, there's no attempt to separate out the cutters, enumchron,\nyear, etc. since at my institution there just wasn't any appetite for what\nlittle functionality it added compared to the ambiguities/bugs it \nproduced.\n\nThe transformation process is, essentially:\n\n  * Lowercase everything, trim/collapse whitespace\n  * Remove any space between the initial letters and digits\n  * Prepend the digits with its string length (e.g., 44 -\u003e 244, 1234 -\u003e \n    41234).\n    This makes the number correctly sort \"alphabetically\" and we don't have\n    to mess around with zero-padding or anything\n  * Remove punctuation other than dots that create a decimal (e.g., \"v.1\" will\n    become \"v1\", but \"no. 123.45\" will become \"no 123.45\").\n    \n## Invalid callnumbers\n\nAnything that doesn't start with some letters followed by some digits is\ndeclared _invalid_. These values can be either kept or ignored depending on\nthe argument `allowInvalid` in the solr fieldType (see below).\n\nThe invalid callnumber passed through isn't exactly the same as what\nwas passed in -- we still do lowercasing, space collapse/trim, and remove\nnon-decimal-place-looking punctuation.\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmlibrary%2Flibrary_identifier_solr_filters","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmlibrary%2Flibrary_identifier_solr_filters","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmlibrary%2Flibrary_identifier_solr_filters/lists"}