{"id":13788013,"url":"https://github.com/CogComp/talen","last_synced_at":"2025-05-12T02:30:53.331Z","repository":{"id":8770952,"uuid":"59701653","full_name":"CogComp/talen","owner":"CogComp","description":"A way to do annotations for NER. TALEN: Tool for Annotation of Low-resource ENtities","archived":false,"fork":false,"pushed_at":"2022-05-11T19:50:57.000Z","size":5541,"stargazers_count":112,"open_issues_count":10,"forks_count":25,"subscribers_count":8,"default_branch":"master","last_synced_at":"2024-11-18T01:39:20.477Z","etag":null,"topics":["named-entity-recognition","ner"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/CogComp.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-05-25T22:20:39.000Z","updated_at":"2024-03-31T14:17:51.000Z","dependencies_parsed_at":"2022-08-08T23:00:14.137Z","dependency_job_id":null,"html_url":"https://github.com/CogComp/talen","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CogComp%2Ftalen","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CogComp%2Ftalen/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CogComp%2Ftalen/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CogComp%2Ftalen/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/CogComp","download_url":"https://codeload.github.com/CogComp/talen/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253662561,"owners_count":21944096,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["named-entity-recognition","ner"],"created_at":"2024-08-03T21:00:34.464Z","updated_at":"2025-05-12T02:30:52.364Z","avatar_url":"https://github.com/CogComp.png","language":"Java","funding_links":[],"categories":["Labeling Tools","人工智能"],"sub_categories":["Text","自然语言处理"],"readme":"\n\u003c!--\n\u003cimg src=\"/src/main/resources/static/img/logo-black-trans.png\" width=\"50%\" /\u003e\n--\u003e\n\n# TALEN: Tool for Annotation of Low-resource ENtities\n\nA lightweight web-based tool for annotating word sequences.\n\n![Screenshot of web interface](/src/main/resources/static/img/selection.png?raw=true \"Screenshot\")\n\n\n\n## Installation\n\nRequires Java 8 and Maven. Run:\n\n    $ ./scripts/run.sh\n\nThis will start the server on port 8009. Point a browser to [localhost:8009](http://localhost:8009). The port number is specified in [`application.properties`](./src/main/resources/application.properties).\n\nThis reads from [`config/users.txt`](config/users.txt), which has a username and password pair on each line. You will\nlog in using one of those pairs, and then that username is tied to your activities in that session. All annotations\nthat you do will be written to a path called `\u003corig\u003e-annotation-\u003cusername\u003e`, where `\u003corig\u003e` is the original path\nspecified in the config file, and `\u003cusername\u003e` is what you chose as username.\n\nSuppose you do some annotations, then leave the session, and come back again. If you log in with the same\nusername as the previous session, it will reload all of the annotations right where you left off, so no\nwork is lost.\n\n## Usage\n\nYou make annotations by clicking on words and selecting a label. If you want to remove a label, right click on a word.\n\nTo annotate a phrase, highlight the phrase, ending with the mouse in the middle of the last word. The standard box will\n  show up, and you can select the correct label. To dismiss the annotation box, click on the word it points to.\n\nA document is saved by pressing the Save button. If you navigate away using\nthe links on the top of the page, the document is not saved. \n\n## Configuration\n\nThere are two kinds of config files, corresponding to the two annotation methods\n(see below). The document-based method looks for config files that start with 'doc-'\nand the sentence-based method looks for config files that start with 'sent-'.\n\nSee the [example config files](config/) for the minimally required set of options.\n\n## Annotation Methods\n\nThere are two main annotation methods supported: document-based, and sentence-based. \n\n### Document-based\nThe document-based method is a common paradigm. You point the software to a folder of documents\nand each is displayed in turn, and you annotate them.\n\n### Sentence-based  \nThe sentence-based method is intended to allow a rapid annotation process. First, you need to\nbuild an index using `TextFileIndexer.java`, then you supply some seed names\nin the config file. The system searches for these seed names in the index, and returns \na small number of sentences containing them. The annotator is encouraged to annotate\nthese correctly, and also annotate any other names which may appear. These new names then \njoin the list of seed names, and annotation continues. \n\nFor example, if the seed name is 'Pete Sampras', then we might hope that 'Andre Agassi'\nwill show up in the same sentence. If the annotator chooses to annotate\n'Andre Agassi' also, then the system will retrieve new sentences containing 'Andre Agassi'.\nPresumably these sentences will contain entities such as 'Wimbledon' and 'New York City'. In principle,\nthis will continue until some cap on the number of entities has been reached.\n\n#### Using the sentence-based\n\nFirst, you need to download a corpus. We have used Hindi for this. Run:\n\n```bash\n$ (If you don't already have nltk) sudo pip install -U nltk \n$ python -m nltk.downloader indian\n```\n\nNow convert this:\n```bash\n$ cd data\n$ python data/getindian.py\n$ cd ..\n```\n\nYou'll notice that this created files in `data/txt/hindi` and in `data/tajson/hindi`. Now build the index:\n```bash\n$ mvn dependency:copy-dependencies\n$ ./scripts/buildindex.sh data/tajson/hindi/ data/index_hindi \n```\n\nThat's it! There is already a config file called `config/sent-Hindi.txt` that should get you started.\n\n\n## Non-speaker Helps\nOne major focus of the software is to allow non-speakers of a language to \nannotate text. Some features are: inline dictionary replacement, morphological \nawareness and coloring, entity propagation, entity suggestions, hints based on frequency and \nmutual information.\n\n### How to build an index\nUse [`buildindex.sh`](scripts/buildindex.sh) to build a local index for the sentence based mode. The `indexdir` variable\nwill be put in the sentence-based config file. This, in turn calls `TextFileIndexer.java`.\n\n## Command line tool\nWe also ship a lightweight command line tool for TALEN. This tool will read a folder of JSON TextAnnotations (more formats coming soon)\nand spin up a Java-only server, serving static HTML versions of each document. This will be used only for examination and exploration.\n\nInstall it as follows:\n```bash\n$ ./scripts/install-cli.sh\n$ export PATH=$PATH:$HOME/software/talen/\n```  \n\n(You can change the `INSTALLDIR` in `install-cli.sh` if you want it installed somewhere else). Now it is installed, you can run it \nfrom any folder in your terminal:\n\n```bash\n$ talen-cli FolderOfTAFiles\n```\n\nThis will serve static HTML documents at `localhost:PORT` (default `PORT` is 8008). You can run with additional options:\n\n```bash\n$ talen-cli FolderOfTAFiles -roman -port 8888\n```\n\nWhere the `-roman` option uses the `ROMANIZATION` view in the TextAnnotation for text (if available), and the `-port` option\nuses the specified port.\n\n\n## Mechanical Turk\nAlthough the main function of this software is a server based system, there is also a lightweight version that runs\nentirely in Javascript, for the express purpose of creating Mechanical Turk jobs.\n\nThe important files are [mturkTemplate.html](src/main/resources/templates/mturk/mturkTemplate.html) and [annotate-local.js](src/main/resources/static/js/annotate-local.js). The\nlatter is a version of [annotate.js](src/main/resources/static/js/annotate.js), but the code to handle adding and\nremoving spans is included in the Javascript instead of sent to a Java controller. This is less powerful (because we have\nNLP libraries written in Java, not Javascript), but can be run with no server.\n\n\nAll the scripts needed to create this file are included in this repository. It was created as follows:\n\n```bash\n$ python scripts/preparedata.py preparedata data/txt tmp.csv\n$ python scripts/preparedata.py testfile tmp.csv docs/index.html\n```\n\n[mturkTemplate.html](src/main/resources/templates/mturk/mturkTemplate.html) has a lot of extra stuff (instructions, annotator test, etc) which\ncan all be removed if desired. I found it was useful for mturk tasks. When you create the mturk task, there will be a \nsubmit button, and the answer will be put into the `#finalsubmission` field. The output string is a Javascript list of token spans along with \nlabel. \n\n\n## Citation\n\nIf you use this in your research paper, please cite us!\n\n```\n@inproceedings{talen2018,\n    author = {Stephen Mayhew, Dan Roth},\n    title = {TALEN: Tool for Annotation of Low-resource ENtities},\n    booktitle = {ACL System Demonstrations},\n    year = {2018},\n}\n```\n\nRead the paper here: [http://cogcomp.org/papers/MayhewRo18.pdf](http://cogcomp.org/papers/MayhewRo18.pdf) \n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FCogComp%2Ftalen","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FCogComp%2Ftalen","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FCogComp%2Ftalen/lists"}