{"id":24912842,"url":"https://github.com/raynardj/langhuan","last_synced_at":"2025-10-16T23:31:25.018Z","repository":{"id":62575055,"uuid":"329886416","full_name":"raynardj/langhuan","owner":"raynardj","description":"Light weight labeling engine","archived":false,"fork":false,"pushed_at":"2021-09-14T07:51:17.000Z","size":1109,"stargazers_count":12,"open_issues_count":2,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2024-04-23T01:01:40.455Z","etag":null,"topics":["classification","data-science","labeling","labeling-tool","machine-learning","named-entity-recognition","ner","nlp","tagging-tool"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/raynardj.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-01-15T10:56:02.000Z","updated_at":"2022-03-25T08:07:51.000Z","dependencies_parsed_at":"2022-11-03T18:51:55.311Z","dependency_job_id":null,"html_url":"https://github.com/raynardj/langhuan","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raynardj%2Flanghuan","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raynardj%2Flanghuan/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raynardj%2Flanghuan/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raynardj%2Flanghuan/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/raynardj","download_url":"https://codeload.github.com/raynardj/langhuan/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":236756752,"owners_count":19199894,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["classification","data-science","labeling","labeling-tool","machine-learning","named-entity-recognition","ner","nlp","tagging-tool"],"created_at":"2025-02-02T05:28:44.823Z","updated_at":"2025-10-16T23:31:24.325Z","avatar_url":"https://github.com/raynardj.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# LangHuAn\n\u003e **Lang**uage **Hu**man **An**notations, a frontend for tagging AI project labels, drived by pandas dataframe data.\n\n\u003e From Chinese word **琅嬛[langhuan]** (Legendary realm where god curates books)\n\nHere's a [5 minutes youtube video](https://www.youtube.com/watch?v=Nwh6roiX_9I) explaining how langhuan works\n\n[![Introduction Video](https://raw.githubusercontent.com/raynardj/langhuan/main/docs/imgs/ner1.jpg)](https://www.youtube.com/watch?v=Nwh6roiX_9I)\n\n## Installation\n```shell\npip install langhuan\n```\n\n## Minimun configuration walk through\n\u003e langhuan start a flask application from **pandas dataframe** 🐼 !\n\n### Simplest configuration for **NER** task 🚀\n\n```python\nfrom langhuan import NERTask\n\napp = NERTask.from_df(\n    df, text_col=\"description\",\n    options=[\"institution\", \"company\", \"name\"])\napp.run(\"0.0.0.0\", port=5000)\n```\n\n### Simplest configuration for **Classify** task 🚀\n```python\nfrom langhuan import ClassifyTask\n\napp = ClassifyTask.from_df(\n    df, text_col=\"comment\",\n    options=[\"positive\", \"negative\", \"unbiased\", \"not sure\"])\napp.run(\"0.0.0.0\", port=5000)\n```\n![classification image](https://raw.githubusercontent.com/raynardj/langhuan/main/docs/imgs/classify1.jpg)\n\n## Frontend\n\u003e You can visit following pages for this app.\n\n### Tagging\n```http://[ip]:[port]/``` is for our hard working taggers to visit.\n\n### Admin\n```http://[ip]:[port]/admin``` is a page where you can 👮🏽‍♂️:\n* See the progress of each user.\n* Force save the progress, (or it will only save according to ```save_frequency```, default 42 entries)\n* Download the tagged entries\n\n## Advanced settings\n#### Validation\nYou can set minimun verification number: ```cross_verify_num```, aka, how each entry will be validated, default is 1\n\nIf you set ```cross_verify_num``` to 2, and you have 5 taggers, each entry will be seen by 2 taggers\n\n```python\napp = ClassifyTask.from_df(\n    df, text_col=\"comment\",\n    options=[\"positive\", \"negative\", \"unbiased\", \"not sure\"],\n    cross_verify_num=2,)\n```\n\n#### Preset the tagging\nYou can set a column in dataframe, eg. called ```guessed_tags```, to preset the tagging result.\n\nEach cell can contain the format of tagging result, eg. \n```json\n{\"tags\":[\n    {\"text\": \"Genomicare Bio Tech\", \"offset\":32, \"label\":\"company\"},\n    {\"text\": \"East China University of Politic Science \u0026 Law\", \"offset\":96, \"label\":\"company\"},\n    ]}\n```\n\nThen you can run the app with preset tag column\n```python\napp = NERTask.from_df(\n    df, text_col=\"description\",\n    options=[\"institution\", \"company\", \"name\"],\n    preset_tag_col=\"guessed_tags\")\napp.run(\"0.0.0.0\", port=5000)\n```\n\n#### Order strategy\nThe order of which text got tagged first is according to order_strategy.\n\nDefault is set to ```\"forward_match\"```, you can try ```pincer``` or ```trident```\n![order strategies](https://raw.githubusercontent.com/raynardj/langhuan/main/docs/imgs/strategies.jpg)\n\nAssume the order_by_column is set to the prediction of last batch of deep learning model:\n- trident means the taggers tag the most confident positive, most confident negative, most unsure ones first.\n\n#### Load History\nIf your service stopped, you can recover the progress from cache.\n\nPrevious cache will be at ```$HOME/.cache/langhuan/{task_name}```\n\nYou can change the save_frequency to suit your task, default is 42 entries.\n\n```python\napp = NERTask.from_df(\n    df, text_col=\"description\",\n    options=[\"institution\", \"company\", \"name\"],\n    save_frequency=128,\n    load_history=True,\n    task_name=\"task_NER_210123_110327\"\n    )\n```\n\n#### Admin Control\n\u003e This application assumes internal use within organization, hence the mininum security. If you set admin_control, all the admin related page will require ```adminkey```, the key will appear in the console prompt\n\n```python\napp = NERTask.from_df(\n    df, text_col=\"description\",\n    options=[\"institution\", \"company\", \"name\"],\n    admin_control=True,\n    )\n```\n\n#### From downloaded data =\u003e pytorch dataset\n\u003e For downloaded NER data tags, you can create a dataloader with the json file automatically:\n* [pytorch + huggingface tokenizer](https://raynardj.github.io/langhuan/docs/loader)\n* tensorflow + huggingface tokenizer, development pending\n\n#### Gunicorn support\nThis is a **light weight** solution. When move things to gunicorn, multithreads is acceptable, but multiworkers will cause chaos.\n\n```shell\ngunicorn --workers=1 --threads=5 app:app\n```\n\n## Compatibility 💍\nWell, this library hasn't been tested vigorously against many browsers with many versions, so far\n* compatible with chrome, firefox, safari if version not too old.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fraynardj%2Flanghuan","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fraynardj%2Flanghuan","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fraynardj%2Flanghuan/lists"}