{"id":15520599,"url":"https://github.com/percevalw/nlstruct","last_synced_at":"2025-04-12T20:25:36.739Z","repository":{"id":41782514,"uuid":"229176303","full_name":"percevalw/nlstruct","owner":"percevalw","description":"Natural language structuring library","archived":false,"fork":false,"pushed_at":"2024-06-05T11:45:29.000Z","size":556,"stargazers_count":19,"open_issues_count":9,"forks_count":11,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-03-09T23:02:09.076Z","etag":null,"topics":["deep-learning","machine-learning","natural-language-processing","notebook","python","structured-data"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/percevalw.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2019-12-20T02:38:34.000Z","updated_at":"2024-12-30T22:24:43.000Z","dependencies_parsed_at":"2024-01-11T23:50:22.338Z","dependency_job_id":"5b6d50d0-b57c-42f0-8656-4cf361f5907b","html_url":"https://github.com/percevalw/nlstruct","commit_stats":{"total_commits":378,"total_committers":6,"mean_commits":63.0,"dds":0.07671957671957674,"last_synced_commit":"da30fbee021bae3b06a0a586f160b073d56bef6a"},"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/percevalw%2Fnlstruct","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/percevalw%2Fnlstruct/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/percevalw%2Fnlstruct/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/percevalw%2Fnlstruct/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/percevalw","download_url":"https://codeload.github.com/percevalw/nlstruct/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248441525,"owners_count":21104011,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","machine-learning","natural-language-processing","notebook","python","structured-data"],"created_at":"2024-10-02T10:28:06.253Z","updated_at":"2025-04-12T20:25:36.711Z","avatar_url":"https://github.com/percevalw.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# NLStruct\n\nNatural language struturing library.\nCurrently, it implements a nested NER model and a span classification model, but other algorithms might follow.\n\nIf you find this library useful in your research, please consider citing:\n\n```\n@phdthesis{wajsburt:tel-03624928,\n  TITLE = {{Extraction and normalization of simple and structured entities in medical documents}},\n  AUTHOR = {Wajsb{\\\"u}rt, Perceval},\n  URL = {https://hal.archives-ouvertes.fr/tel-03624928},\n  SCHOOL = {{Sorbonne Universit{\\'e}}},\n  YEAR = {2021},\n  MONTH = Dec,\n  KEYWORDS = {nlp ; structure ; extraction ; normalization ; clinical ; multilingual},\n  TYPE = {Theses},\n  PDF = {https://hal.archives-ouvertes.fr/tel-03624928/file/updated_phd_thesis_PW.pdf},\n  HAL_ID = {tel-03624928},\n  HAL_VERSION = {v1},\n}\n```\n\nThis work was performed at [LIMICS](http://www.limics.fr/), in collaboration with [AP-HP's Clinical Data Warehouse](https://eds.aphp.fr/) and funded by the [Institute of Computing and Data Science](https://iscd.sorbonne-universite.fr/).\n\n## Features\n\n- processes large documents seamlessly: it automatically handles tokenization and sentence splitting.\n- do not train twice: an automatic caching mechanism detects when an experiment has already been run\n- stop \u0026 resume with checkpoints\n- easy import and export of data\n- handles nested or overlapping entities\n- multi-label classification of recognized entities\n- strict or relaxed multi label end to end retrieval metrcis\n- pretty logging with [rich-logger](https://github.com/percevalw/rich_logger)\n- heavily customizable, without config files (see [train_ner.py](https://github.com/percevalw/nlstruct/blob/nlstruct/recipes/train_ner.py))\n- built on top of [transformers](https://github.com/huggingface/transformers) and [pytorch_lightning](https://github.com/PyTorchLightning/pytorch-lightning)\n\n## Training models\n\n### How to train a NER model\n\n```python\nfrom nlstruct.recipes import train_ner\n\nmodel = train_ner(\n    dataset={\n        \"train\": \"path to your train brat/standoff data\",\n        \"val\": 0.05,  # or path to your validation data\n        # \"test\": # and optional path to your test data\n    },\n    finetune_bert=False,\n    seed=42,\n    bert_name=\"camembert/camembert-base\",\n    fasttext_file=\"\",\n    gpus=0,\n    xp_name=\"my-xp\",\n    return_model=True,\n)\nmodel.save_pretrained(\"model.pt\")\n```\n\n### How to use it\n\n```python\nfrom nlstruct import load_pretrained\nfrom nlstruct.datasets import load_from_brat, export_to_brat\n\nner = load_pretrained(\"model.pt\")\nner.eval()\nner.predict({\"doc_id\": \"doc-0\", \"text\": \"Je lui prescris du lorazepam.\"})\n# Out: \n# {'doc_id': 'doc-0',\n#  'text': 'Je lui prescris du lorazepam.',\n#  'entities': [{'entity_id': 0,\n#    'label': ['substance'],\n#    'attributes': [],\n#    'fragments': [{'begin': 19,\n#      'end': 28,\n#      'label': 'substance',\n#      'text': 'lorazepam'}],\n#    'confidence': 0.9998705969553088}]}\nexport_to_brat(ner.predict(load_from_brat(\"path/to/brat/test\")), filename_prefix=\"path/to/exported_brat\")\n```\n\n### How to train a NER model followed by a span classification model\n\n```python\nfrom nlstruct.recipes import train_qualified_ner\n\nmodel = train_qualified_ner(\n    dataset={\n        \"train\": \"path to your train brat/standoff data\",\n        \"val\": 0.05,  # or path to your validation data\n        # \"test\": # and optional path to your test data\n    },\n    finetune_bert=False,\n    seed=42,\n    bert_name=\"camembert/camembert-base\",\n    fasttext_file=\"\",\n    gpus=0,\n    xp_name=\"my-xp\",\n    return_model=True,\n)\nmodel.save_pretrained(\"model.pt\")\n```\n\n## Ensembling\n\nEasily ensemble multiple models (same architecture, different seeds):\n```python\nmodel1 = load_pretrained(\"model-1.pt\")\nmodel2 = load_pretrained(\"model-2.pt\")\nmodel3 = load_pretrained(\"model-3.pt\")\nensemble = model1.ensemble_with([model2, model3]).cuda()\nexport_to_brat(ensemble.predict(load_from_brat(\"path/to/brat/test\")), filename_prefix=\"path/to/exported_brat\")\n```\n\n## Advanced use\n\nShould you need to further configure the training of a model, please modify directly one \nof the recipes located in the [recipes](nlstruct/recipes/) folder.\n\n\n### Install\n\nThis project is still under development and subject to changes.\n\n```bash\npip install nlstruct==0.2.0\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpercevalw%2Fnlstruct","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpercevalw%2Fnlstruct","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpercevalw%2Fnlstruct/lists"}