{"id":13413733,"url":"https://github.com/ndabAP/assocentity","last_synced_at":"2025-03-14T20:30:37.646Z","repository":{"id":57495880,"uuid":"162680118","full_name":"ndabAP/assocentity","owner":"ndabAP","description":"Package assocentity returns the mean distance from tokens to an entity and its synonyms","archived":false,"fork":false,"pushed_at":"2025-02-22T07:20:45.000Z","size":106923,"stargazers_count":15,"open_issues_count":4,"forks_count":3,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-08T14:32:18.150Z","etag":null,"topics":["go","golang","natural-language-processing","nlp","social-sciences","tokenizer"],"latest_commit_sha":null,"homepage":"https://pkg.go.dev/github.com/ndabAP/assocentity/v14","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ndabAP.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-12-21T07:17:09.000Z","updated_at":"2025-01-18T18:46:21.000Z","dependencies_parsed_at":"2024-06-18T23:01:08.660Z","dependency_job_id":"9a6b8a49-ed4c-412e-8bfb-d8fc2cf005b9","html_url":"https://github.com/ndabAP/assocentity","commit_stats":{"total_commits":178,"total_committers":3,"mean_commits":"59.333333333333336","dds":"0.011235955056179803","last_synced_commit":"9d81cd1480bc957edd9c730d0f5fff007d79db8a"},"previous_names":[],"tags_count":66,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ndabAP%2Fassocentity","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ndabAP%2Fassocentity/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ndabAP%2Fassocentity/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ndabAP%2Fassocentity/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ndabAP","download_url":"https://codeload.github.com/ndabAP/assocentity/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243641951,"owners_count":20323940,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["go","golang","natural-language-processing","nlp","social-sciences","tokenizer"],"created_at":"2024-07-30T20:01:47.694Z","updated_at":"2025-03-14T20:30:34.491Z","avatar_url":"https://github.com/ndabAP.png","language":"Go","funding_links":[],"categories":["Science and Data Analysis","Relational Databases","数据分析与数据科学","科学与数据分析"],"sub_categories":["HTTP Clients","查询语","HTTP客户端"],"readme":"# assocentity\n\n[![Go Report Card](https://goreportcard.com/badge/github.com/ndabAP/assocentity/v14)](https://goreportcard.com/report/github.com/ndabAP/assocentity/v14)\n\nPackage assocentity is a social science tool to analyze the relative distance\nfrom tokens to entities. The motiviation is to make conclusions based on the\ndistance from interesting tokens to a certain entity and its synonyms. Visit\n[this](https://ndabap.github.io/entityscrape/index.html) website to see an\nusage example.\n\n## Features\n\n- Provide your own tokenizer\n- Provides a default NLP tokenizer (by Google)\n- Define aliases for entities\n- Provides a multi-OS, language-agnostic CLI version\n\n## Installation\n\n```bash\n$ go get github.com/ndabAP/assocentity/v14\n```\n\n## Prerequisites\n\nIf you want to analyze human readable texts you can use the provided Natural\nLanguage tokenizer (powered by Google). To do so, sign-up for a Cloud Natural\nLanguage API service account key and download the generated JSON file. This\nequals the `credentialsFile` at the example below. You should never commit that\nfile.\n\nA possible offline tokenizer would be a white space tokenizer. You also might\nuse a parser depending on your purposes.\n\n## Example\n\nWe would like to find out which adjectives are how close in average to a certain\npublic person. Let's take George W. Bush and 1,000 NBC news articles as an\nexample. \"George Bush\" is the entity and synonyms are \"George Walker Bush\" and\n\"Bush\" and so on. The text is each of the 1,000 NBC news articles.\n\nDefining a text source and to set the entity would be first step. Next, we need\nto instantiate our tokenizer. In this case, we use the provided Google NLP\ntokenizer. Finally, we can calculate our mean distances. We can use\n`assocentity.Distances`, which accepts multiple texts. Notice\nhow we pass `tokenize.ADJ` to only include adjectives as part of speech.\nFinally, we can take the mean by passing the result to `assocentity.Mean`.\n\n```go\n// Define texts source and entity\ntexts := []string{\n\t\"Former Presidents Barack Obama, Bill Clinton and ...\", // Truncated\n\t\"At the pentagon on the afternoon of 9/11, ...\",\n\t\"Tony Blair moved swiftly to place his relationship with ...\",\n}\nentities := []string{\n\t\"Goerge Walker Bush\",\n\t\"Goerge Bush\",\n\t\"Bush\",\n}\nsource := assocentity.NewSource(entities, texts)\n\n// Instantiate the NLP tokenizer (powered by Google)\nnlpTok := nlp.NewNLPTokenizer(credentialsFile, nlp.AutoLang)\n\n// Get the distances to adjectives\nctx := context.TODO()\ndists, err := assocentity.Distances(ctx, nlpTok, tokenize.ADJ, source)\nif err != nil {\n\t// Handle error\n}\n// Get the mean from the distances\nmean := assocentity.Mean(dists)\n```\n\nThe `NLPTokenizer` has a built-in retryer with a strategy that went well with\nthe Google Language API limitations. It can't be disabled or configured.\n\n### Tokenization\n\nA `Tokenizer` is something that produces tokens with a given text. While a\n`Token` is the smallest possible unit of a text. The interface with the\nmethod `Tokenize` has the following signature:\n\n```go\ntype Tokenizer interface {\n\tTokenize(ctx context.Context, text string) ([]Token, error)\n}\n```\n\nA `Token` has the following properties:\n\n```go\ntype Token struct {\n\tPoS  PoS    // Part of speech\n\tText string // Text\n}\n\n// Part of speech\ntype PoS int\n```\n\nFor example, given the text:\n\n```go\ntext := \"Punchinello was burning to get me\"\n```\n\nThe result from `Tokenize` would be a slice of tokens:\n\n```go\n[]Token{\n\t{\n\t\tText: \"Punchinello\",\n\t\tPoS:  tokenize.NOUN,\n\t},\n\t{\n\t\tText: \"was\",\n\t\tPoS:  tokenize.VERB,\n\t},\n\t{\n\t\tText: \"burning\",\n\t\tPoS:  tokenize.VERB,\n\t},\n\t{\n\t\tText: \"to\",\n\t\tPoS:  tokenize.PRT,\n\t},\n\t{\n\t\tText: \"get\",\n\t\tPoS:  tokenize.VERB,\n\t},\n\t{\n\t\tText: \"me\",\n\t\tPoS:  tokenize.PRON,\n\t},\n}\n```\n\n## CLI\n\nThere is also a language-agnostic terminal version available for either Windows,\nMac (Darwin) or Linux (only with 64-bit support) if you don't have Go available.\nThe application expects the text from \"stdin\" and accepts the following flags:\n\n| Flag                 | Description                                                                                           | Type     | Default |\n| -------------------- | ----------------------------------------------------------------------------------------------------- | -------- | ------- |\n| `entities`           | List of comma separated entities, example: `-entities=\"Max Payne,Payne\"`                              | `string` |         |\n| `google-svc-acc-key` | Google Clouds NLP JSON service account file, example: `-google-svc-acc-key=~/google-svc-acc-key.json` | `string` |         |\n| `op`                 | Operation to execute, default is `mean`                                                               | `string` | `mean`  |\n| `pos`                | List of comma separated part of speeches, example: `-pos=noun,verb,pron`                              | `string` | `any`   |\n\nExample:\n\n```bash\necho \"Relax, Max. You're a nice guy.\" | ./bin/assocentity_linux_amd64_v14.0.0-0-g948274a-dirty -gog-svc-loc=/home/max/.config/assocentity/google-service.json -entities=\"Max Payne,Payne,Max\"\n```\n\nThe output is written to \"stdout\" in appropoiate formats.\n\n## Projects using assocentity\n\n- [entityscrape](https://github.com/ndabAP/entityscrape) - Distance between word\n  types (default: adjectives) in news articles and persons\n\n## Author\n\n[Julian Claus](https://www.julian-claus.de) and contributors.\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FndabAP%2Fassocentity","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FndabAP%2Fassocentity","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FndabAP%2Fassocentity/lists"}