{"id":13623071,"url":"https://github.com/GSA/punchcard","last_synced_at":"2025-04-15T10:32:26.317Z","repository":{"id":23558265,"uuid":"26925714","full_name":"GSA/punchcard","owner":"GSA","description":"Repository of synonyms, protected words, stop words, and localizations","archived":false,"fork":false,"pushed_at":"2022-07-11T12:40:42.000Z","size":586,"stargazers_count":45,"open_issues_count":3,"forks_count":23,"subscribers_count":25,"default_branch":"master","last_synced_at":"2025-04-03T04:51:08.546Z","etag":null,"topics":["maintained"],"latest_commit_sha":null,"homepage":"http://search.digitalgov.gov","language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"cc0-1.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/GSA.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-11-20T18:25:30.000Z","updated_at":"2025-03-15T03:40:59.000Z","dependencies_parsed_at":"2022-08-23T15:31:15.962Z","dependency_job_id":null,"html_url":"https://github.com/GSA/punchcard","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GSA%2Fpunchcard","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GSA%2Fpunchcard/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GSA%2Fpunchcard/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GSA%2Fpunchcard/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/GSA","download_url":"https://codeload.github.com/GSA/punchcard/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249051792,"owners_count":21204887,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["maintained"],"created_at":"2024-08-01T21:01:27.758Z","updated_at":"2025-04-15T10:32:26.051Z","avatar_url":"https://github.com/GSA.png","language":"Ruby","readme":"punchcard\n=========\n\n[![CircleCI](https://circleci.com/gh/GSA/punchcard.svg?style=svg)](https://circleci.com/gh/GSA/punchcard)\n\nRepository of synonyms, protected words, stop words, localizations, and other vocabularies to improve the precision, recall, and usability of search results.\n\n# Synonyms\n\nEach locale's synonyms are in a separate YAML file (e.g., `es.yml`, `en.yml`). Here is a sample entry:\n\n```yaml\ninmunización, vacuna, vacunación:\n  :notes: Approved (synonyms) and (stemming). AFF 11/12/14\n  :status: Approved\n  :analyzed: inmunizacion, vacun, vacunacion\n```\n\nThe entry listing is a comma-separated list of natural language terms, probably lemmas. \n\nThe `notes` field can be long and multi-line, but it still needs to be [valid YAML](http://www.yamllint.com). Notes include information on the type of synonym:\n\n1. Abbreviations\n1. Acronyms\n1. Clipped words\n1. Gerunds\n1. Irregular plurals\n1. Language variants\n1. Misspellings\n1. Numbers\n1. Spelling variants\n1. Stemming\n1. Synonyms\n1. Tickers (stock ticker symbols)\n1. Verbs\n\nThe `status` is either `Approved`, `Rejected`, or `Candidate`.\n\nThe `analyzed` field is a comma-separated list of the entry terms after they have been run through an analyzer and de-duped. The analysis chain comprises 6 filters:\n\n1. standard \n2. asciifolding \n3. lowercase \n4. es_stop_filter\n5. es_protected_filter\n6. es_stem_filter\n\nIn the example entry above, `vacuna` becomes `vacun` because of the Spanish stemmer, and `vacunación` and `inmunización` become `vacunacion` and `inmunizacion`, respectively, because of ASCII folding.\n\n## Extracting approved synonyms to Solr/Elasticsearch format\n\nGenerate the text file of *approved* synonyms for each locale, like this:\n\n    cd synonyms\n    ./lib/yaml_to_solr.rb es.yml \u003e es.txt\n    ./lib/yaml_to_solr.rb en.yml \u003e en.txt\n\nYou can then reference these files when you define your per-locale synonym filters.\n\n# Protected words\n\nEach locale's protected words are in a separate YAML file (e.g., `es.yml`, `en.yml`). The [keyword marker token filter](http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-keyword-marker-tokenfilter.html) keeps these words from getting treated by the stemmer. \nHere is a sample entry:\n\n```yaml\nirs:\n  :notes: Stems to ir in minimal_english\n  :status: Approved\n```\n\nThe entry listing is the token *after* ASCII folding and lowercasing.\n\n## Extracting approved protected words to Solr/Elasticsearch format\n\nGenerate the text file of *approved* protected words for each locale, like this:\n\n    cd protected_words\n    ./lib/yaml_to_solr.rb es.yml \u003e es.txt\n    ./lib/yaml_to_solr.rb en.yml \u003e en.txt\n\nYou can then reference these files when you define your per-locale keyword marker filters.\n\n# Stop words\n\nEach locale's stop words are in a separate YAML file (e.g., `es.yml`, `en.yml`). Here is a sample entry:\n\n```yaml\nthey:\n  :notes: \n  :status: Approved\n```\n\nThe entry listing is the token *after* ASCII folding and lowercasing.\n\n## Extracting approved stop words to Solr/Elasticsearch format\n\nGenerate the text file of *approved* stop words for each locale, like this:\n\n    cd stop_words\n    ./lib/yaml_to_solr.rb es.yml \u003e es.txt\n    ./lib/yaml_to_solr.rb en.yml \u003e en.txt\n\nYou can then reference these files when you define your per-locale stop word filters.\n\n# Analysis\n\nThe Elasticsearch index mapping used to transform entries into analyzed fields is here:\n\n```json\n{\n  \"settings\": {\n    \"index\": {\n      \"analysis\": {\n        \"char_filter\": {\n          \"ignore_chars\": {\n            \"type\": \"mapping\",\n            \"mappings\": [\n              \"'=\u003e\",\n              \"’=\u003e\",\n              \"`=\u003e\"\n            ]\n          }\n        },\n        \"filter\": {\n          \"es_protected_filter\": {\n            \"type\": \"keyword_marker\",\n            \"keywords\": [\n              \"ronaldo\"\n            ]\n          },\n          \"es_stem_filter\": {\n            \"type\": \"stemmer\",\n            \"name\": \"light_spanish\"\n          },\n          \"es_stop_filter\": {\n            \"type\": \"stop\",\n            \"stopwords\": [\n              \"a\",\n              \"al\",\n              \"ante\",\n              \"aquel\",\n              \"aquello\",\n              \"bajo\",\n              \"cabe\",\n              \"cada\",\n              \"como\",\n              \"con\",\n              \"conmigo\",\n              \"consigo\",\n              \"contigo\",\n              \"contra\",\n              \"cual\",\n              \"cuando\",\n              \"de\",\n              \"del\",\n              \"desde\",\n              \"despues\",\n              \"donde\",\n              \"durante\",\n              \"e\",\n              \"el\",\n              \"en\",\n              \"entonces\",\n              \"entre\",\n              \"es\",\n              \"esta\",\n              \"esto\",\n              \"fin\",\n              \"fue\",\n              \"ha\",\n              \"hacia\",\n              \"has\",\n              \"hasta\",\n              \"la\",\n              \"las\",\n              \"le\",\n              \"les\",\n              \"los\",\n              \"mas\",\n              \"mediante\",\n              \"menos\",\n              \"mi\",\n              \"ni\",\n              \"o\",\n              \"para\",\n              \"pero\",\n              \"por\",\n              \"que\",\n              \"quien\",\n              \"salvo\",\n              \"segun\",\n              \"ser\",\n              \"si\",\n              \"sin\",\n              \"so\",\n              \"sobre\",\n              \"solamente\",\n              \"solo\",\n              \"somos\",\n              \"son\",\n              \"soy\",\n              \"su\",\n              \"suya\",\n              \"suyo\",\n              \"suyos\",\n              \"tal\",\n              \"tambien\",\n              \"tras\",\n              \"u\",\n              \"un\",\n              \"una\",\n              \"unas\",\n              \"unos\",\n              \"via\",\n              \"y\"\n            ]\n          },\n          \"en_protected_filter\": {\n            \"type\": \"keyword_marker\",\n            \"keywords\": [\n              \"irs\"\n            ]\n          },\n          \"en_stem_filter\": {\n            \"type\": \"stemmer\",\n            \"name\": \"minimal_english\"\n          },\n          \"en_stop_filter\": {\n            \"type\": \"stop\",\n            \"stopwords\": [\n              \"a\",\n              \"an\",\n              \"and\",\n              \"are\",\n              \"as\",\n              \"at\",\n              \"be\",\n              \"but\",\n              \"by\",\n              \"for\",\n              \"if\",\n              \"in\",\n              \"into\",\n              \"is\",\n              \"no\",\n              \"not\",\n              \"of\",\n              \"on\",\n              \"or\",\n              \"s\",\n              \"such\",\n              \"t\",\n              \"that\",\n              \"the\",\n              \"their\",\n              \"then\",\n              \"there\",\n              \"these\",\n              \"they\",\n              \"this\",\n              \"to\",\n              \"was\",\n              \"with\"\n            ]\n          }\n        },\n        \"analyzer\": {\n          \"en_analyzer\": {\n            \"type\": \"custom\",\n            \"char_filter\": [\n              \"ignore_chars\"\n            ],\n            \"filter\": [\n              \"standard\",\n              \"asciifolding\",\n              \"lowercase\",\n              \"en_stop_filter\",\n              \"en_protected_filter\",\n              \"en_stem_filter\"\n            ],\n            \"tokenizer\": \"standard\"\n          },\n          \"es_analyzer\": {\n            \"type\": \"custom\",\n            \"char_filter\": [\n              \"ignore_chars\"\n            ],\n            \"filter\": [\n              \"standard\",\n              \"asciifolding\",\n              \"lowercase\",\n              \"es_stop_filter\",\n              \"es_protected_filter\",\n              \"es_stem_filter\"\n            ],\n            \"tokenizer\": \"standard\"\n          }\n        }\n      }\n    }\n  }\n}\n```\n\n# Localizations (l10n)\n\nAs of November 2021, translations management has moved to the [GSA/search-gov](https://github.com/GSA/search-gov) repository. Learn how to contribute to Search.gov translations [here](https://github.com/GSA/search-gov/blob/master/CONTRIBUTING.md).\n\n# Contributing\n\nYou're encouraged to submit changes via pull requests, propose features and discuss issues.\n\nSee [CONTRIBUTING](CONTRIBUTING.md).","funding_links":[],"categories":["Ruby"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FGSA%2Fpunchcard","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FGSA%2Fpunchcard","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FGSA%2Fpunchcard/lists"}