{"id":17181906,"url":"https://github.com/clipperhouse/jargon","last_synced_at":"2025-04-09T23:34:47.653Z","repository":{"id":140010292,"uuid":"132513937","full_name":"clipperhouse/jargon","owner":"clipperhouse","description":"Tokenizers and lemmatizers for Go","archived":false,"fork":false,"pushed_at":"2024-05-27T21:49:36.000Z","size":1153,"stargazers_count":109,"open_issues_count":4,"forks_count":1,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-03-24T01:24:24.717Z","etag":null,"topics":["data-science","go","lemmatizer","nlp","tokenizer"],"latest_commit_sha":null,"homepage":"https://clipperhouse.com/jargon/","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/clipperhouse.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-05-07T20:38:20.000Z","updated_at":"2025-02-24T22:38:49.000Z","dependencies_parsed_at":"2024-06-19T00:22:09.028Z","dependency_job_id":"a8450af5-e269-48f2-9bbd-3909e86bea3c","html_url":"https://github.com/clipperhouse/jargon","commit_stats":null,"previous_names":[],"tags_count":25,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clipperhouse%2Fjargon","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clipperhouse%2Fjargon/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clipperhouse%2Fjargon/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clipperhouse%2Fjargon/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/clipperhouse","download_url":"https://codeload.github.com/clipperhouse/jargon/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248130358,"owners_count":21052735,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-science","go","lemmatizer","nlp","tokenizer"],"created_at":"2024-10-15T00:35:30.942Z","updated_at":"2025-04-09T23:34:47.634Z","avatar_url":"https://github.com/clipperhouse.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Jargon\n\nJargon is a text pipeline, focused on recognizing variations on canonical and synonymous terms.\n\nFor example, jargon lemmatizes `react`, `React.js`, `React JS` and `REACTJS` to a canonical `reactjs`.\n\n## Install\n\nBinaries are available on the [Releases page](https://github.com/clipperhouse/jargon/releases).\n\nIf you have [Homebrew](https://brew.sh):\n```\nbrew install clipperhouse/tap/jargon\n```\n\nIf you have a [Go installation](https://golang.org/doc/install):\n```\ngo install github.com/clipperhouse/jargon/cmd/jargon\n```\n\nTo display usage, simply type:\n\n```bash\njargon\n```\n\nExample:\n\n```bash\ncurl -s https://en.wikipedia.org/wiki/Computer_programming | jargon -html -stack -lemmas -lines\n```\n\n[CLI usage and details...](https://github.com/clipperhouse/jargon/tree/master/cmd/jargon)\n\n## In your code\n\nSee [GoDoc](https://godoc.org/github.com/clipperhouse/jargon). Example:\n\n```go\nimport (\n\t\"fmt\"\n\t\"log\"\n\t\"strings\"\n\n\t\"github.com/clipperhouse/jargon\"\n\t\"github.com/clipperhouse/jargon/filters/stackoverflow\"\n)\n \ntext := `Let’s talk about Ruby on Rails and ASPNET MVC.`\nstream := jargon.TokenizeString(text).Filter(stackoverflow.Tags)\n\n// Loop while Scan() returns true. Scan() will return false on error or end of tokens.\nfor stream.Scan() {\n\ttoken := stream.Token()\n\t// Do stuff with token\n\tfmt.Print(token)\n}\n\nif err := stream.Err(); err != nil {\n\t// Because the source is I/O, errors are possible\n\tlog.Fatal(err)\n}\n\n// As an iterator, a token stream is 'forward-only'; once you consume a token, you can't go back.\n\n// See also the convenience methods String, ToSlice, WriteTo\n```\n\n## Token filters\n\nCanonical terms (lemmas) are looked up in token filters. Several are available:\n\n[Stack Overflow technology tags](https://pkg.go.dev/github.com/clipperhouse/jargon/filters/stackoverflow)\n  - `Ruby on Rails → ruby-on-rails`\n  - `ObjC → objective-c`\n\n[Contractions](https://pkg.go.dev/github.com/clipperhouse/jargon/filters/contractions)\n  - `Couldn’t → Could not`\n\n[ASCII fold](https://pkg.go.dev/github.com/clipperhouse/jargon/filters/ascii)\n  - `café → cafe`\n\n[Stem](https://pkg.go.dev/github.com/clipperhouse/jargon/filters/stemmer)\n  - `Manager|management|manages → manag`\n\nTo implement your own, see the [Filter type](https://godoc.org/github.com/clipperhouse/jargon/#Filter).\n\n## Performance\n\n`jargon` is designed to work in constant memory, regardless of input size. It buffers input and streams tokens.\n\nExecution time is designed to O(n) on input size. It is I/O-bound. In your code, you control I/O and performance implications by the `Reader` you pass to Tokenize.\n\n## Tokenizer\n\nJargon includes a tokenizer based partially on [Unicode text segmentation](https://unicode.org/reports/tr29/). It’s good for many common cases.\n\nIt preserves all tokens verbatim, including whitespace and punctuation, so the original text can be reconstructed with fidelity (“round tripped”).\n\n## Background\n\nWhen dealing with technical terms in text – say, a job listing or a resume – it’s easy to use different words for the same thing. This is acute for things like “react” where it’s not obvious what the canonical term is. Is it React or reactjs or react.js?\n\nThis presents a problem when **searching** for such terms. _We_ know the above terms are synonymous but databases don’t.\n\nA further problem is that some n-grams should be understood as a single term. We know that “Objective C” represents **one** technology, but databases naively see two words.\n\n## What’s it for?\n\n- Recognition of domain terms in text\n- NLP for unstructured data, when we wish to ensure consistency of vocabulary, for statistical analysis.\n- Search applications, where searches for “Ruby on Rails” are understood as an entity, instead of three unrelated words, or to ensure that “React” and “reactjs” and “react.js” and handled synonmously.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fclipperhouse%2Fjargon","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fclipperhouse%2Fjargon","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fclipperhouse%2Fjargon/lists"}