{"id":13741438,"url":"https://github.com/proycon/folia","last_synced_at":"2025-05-07T15:47:05.319Z","repository":{"id":57431785,"uuid":"1948022","full_name":"proycon/folia","owner":"proycon","description":"FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas, and set definitions","archived":false,"fork":false,"pushed_at":"2024-05-14T12:29:06.000Z","size":57691,"stargazers_count":61,"open_issues_count":21,"forks_count":10,"subscribers_count":13,"default_branch":"master","last_synced_at":"2024-12-17T01:54:40.380Z","etag":null,"topics":["computational-linguistics","corpus","file-format","folia","language","library","linguistic-annotation-framework","linguistics","nlp","python","xml"],"latest_commit_sha":null,"homepage":"http://proycon.github.io/folia/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/proycon.png","metadata":{"files":{"readme":"README.md","changelog":"ChangeLog","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2011-06-24T14:50:47.000Z","updated_at":"2024-11-14T21:27:25.000Z","dependencies_parsed_at":"2024-08-03T04:18:04.867Z","dependency_job_id":null,"html_url":"https://github.com/proycon/folia","commit_stats":{"total_commits":1517,"total_committers":8,"mean_commits":189.625,"dds":"0.030323005932762048","last_synced_commit":"a90d91a99603c9b1d117bd2a7cf59c65eb4132f5"},"previous_names":[],"tags_count":32,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/proycon%2Ffolia","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/proycon%2Ffolia/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/proycon%2Ffolia/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/proycon%2Ffolia/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/proycon","download_url":"https://codeload.github.com/proycon/folia/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":231449739,"owners_count":18378431,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["computational-linguistics","corpus","file-format","folia","language","library","linguistic-annotation-framework","linguistics","nlp","python","xml"],"created_at":"2024-08-03T04:00:59.185Z","updated_at":"2024-12-27T07:09:11.636Z","avatar_url":"https://github.com/proycon.png","language":"Python","readme":"\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/proycon/folia/master/logo.png\" width=\"200\" /\u003e\n\u003c/div\u003e\n\n# FoLiA: Format for Linguistic Annotation\n\n[![tests](https://travis-ci.org/proycon/folia.svg?branch=master)](https://travis-ci.org/proycon/folia)\n[![documentation](http://readthedocs.org/projects/folia/badge/?version=latest)](http://foliapy.readthedocs.io/en/latest/?badge=latest)\n[![lamabadge](http://applejack.science.ru.nl/lamabadge.php/folia)](http://applejack.science.ru.nl/languagemachines/)\n[![DOI](https://zenodo.org/badge/1948022.svg)](https://zenodo.org/badge/latestdoi/1948022)\n[![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)\n\n[Documentation](http://foliapy.readthedocs.io/en/latest/?badge=latest) | [Examples](https://github.com/proycon/folia/tree/master/examples) | [Python Library](https://pypi.org/project/FoLiA/) | [Python Library Documentation](https://foliapy.readthedocs.io/en/latest/) | [C++ Library](https://github.com/LanguageMachines/libfolia) | [Rust Library](https://crates.io/crates/folia) | [FoLiA-Tools](https://github.com/proycon/foliatools) | [FoLiA Utilities](https://github.com/LanguageMachines/foliautils) | [FLAT: Web-based Annotation environment](https://github.com/proycon/flat)\n\n*by Maarten van Gompel, CLST/Radboud University Nijmegen \u0026 KNAW Humanities Cluster*\n\n\u003chttps://proycon.github.io/folia\u003e\n\nFoLiA is an XML-based annotation format, suitable for the representation\nof linguistically annotated language resources. FoLiA's intended use is\nas a format for storing and/or exchanging language resources, including\ncorpora. Our aim is to introduce a single rich format that can\naccommodate a wide variety of linguistic annotation types through a\nsingle generalised paradigm. We do not commit to any label set, language\nor linguistic theory. This is always left to the developer of the\nlanguage resource, and provides maximum flexibility.\n\nXML is an inherently hierarchic format. FoLiA does justice to this by\nmaximally utilising a hierarchic, inline, setup. We inherit from the\nD-Coi format, which posits to be loosely based on a minimal subset of\nTEI. Because of the introduction of a new and much broader paradigm,\nFoLiA is not backwards-compatible with D-Coi, i.e. validators for D-Coi\nwill not accept FoLiA XML. It is however easy to convert FoLiA to less\ncomplex or verbose formats such as the D-Coi format, or plain-text.\nConverters are provided.\n\nThe main characteristics of FoLiA are:\n\n-   **Generalised paradigm** - We use a generalised paradigm, with as\n    few ad-hoc provisions for annotation types as possible.\n-   **Expressivity** - The format is highly expressive, annotations can\n    be expressed in great detail and with flexibility to the user's\n    needs, without forcing unwanted details. Moreover, FoLiA has\n    generalised support for representing annotation alternatives, and\n    annotation metadata such as information on annotator, time of\n    annotation, and annotation confidence.\n-   **Extensible** - Due to the generalised paradigm and the fact that\n    the format does not commit to any label set, FoLiA is fairly easily\n    extensible.\n-   **Formalised** - The format is formalised, and can be validated on\n    both a shallow and a deep level (the latter including tagset\n    validation), and easily machine parsable, for which tools are\n    provided.\n-   **Practical** - FoLiA has been developed in a bottom-up fashion\n    right alongside applications, libraries, and other toolkits and\n    converters. Whilst the format is rich, we try to maintain it as\n    simple and straightforward as possible, minimising the learning\n    curve and making it easy to adopt FoLiA in practical applications.\n\nThe FoLiA format makes mixed-use of inline and stand-off annotation.\nInline annotation is used for annotations pertaining to single tokens,\nwhilst stand-off annotation in a separate annotation layers is adopted\nfor annotation types that span over multiple tokens. This provides FoLiA\nwith the necessary flexibility and extensibility to deal with various\nkinds of annotations.\n\nNotable features are:\n\n-   XML-based, UTF-8 encoded\n-   Language and tagset independent\n-   Can encode both tokenised as well as untokenised text + partial\n    reconstructability of untokenised form even after tokenisation.\n-   Generalised paradigm, extensible and flexible\n-   Provenance support for all linguistic annotations: annotator, type\n    (automatic or manual), time.\n-   Used by various software projects and corpora, especially in the\n    Dutch-Flemish NLP community\n\nParadigm Schema\n===============\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"https://github.com/proycon/folia/blob/master/docs/folia_paradigm2.png\" width=\"800\" /\u003e\n\u003c/div\u003e\n\nResources\n=========\n\n-   [Website](https://proycon.github.io/folia) - **Please visit this\n    FoLiA website for more information**\n-   [Documentation](https://folia.readthedocs.io)\n-   [RelaxNG schema](http://github.com/proycon/folia/blob/master/schemas/folia.rng)\n    (not sufficient for full validation, use the\n    [foliavalidator](https://github.com/proycon/foliatools) or\n    [folialint](https://github.com/LanguageMachines/libfolia) tool!)\n-   [Examples of FoLiA documents](https://github.com/proycon/folia/tree/master/examples)\n-   [FoLiApy: FoLiA library for Python](https://github.com/proycon/foliapy) (`pip install folia`)\n    -   [Library documentation](https://foliapy.readthedocs.io)\n-   [libfolia: FoLiA library for  C++](https://github.com/LanguageMachines/libfolia)\n-   [folia-rust: FoLiA library for Rust](https://github.com/proycon/folia-rust)\n-   [FoLiA Tools: Various command-line tools for FoLiA](https://github.com/proycon/foliatools)\n    (`pip install folia-tools`)\n-   [FoLiA Utilities: Various command-line tools for FoLiA](https://github.com/LanguageMachines/foliautils)\n-   [FLAT: A web-based annotation environment](https://github.com/proycon/flat)\n\nA more extensive list of FoLiA-capable software is maintained on the\n[FoLiA website](https://proycon.github.io/folia)\n\nPublications\n============\n\nSee the [FoLiA website](https://proycon.github.io/folia) for more\npublications and full text links.\n\n-   Maarten van Gompel (2019). FoLiA: Format for Linguistic Annotation -\n    Documentation. Language and Speech Technology Technical Report\n    Series. Radboud University Nijmegen.\n-   Maarten van Gompel, Ko van der Sloot, Martin Reynaert, Antal van den\n    Bosch (2017). **FoLiA in Practice: The Infrastructure of a\n    Linguistic Annotation Format.** In: CLARIN in the Low Countries.\n    Eds: Jan Odijk and Arjan van Hessen. Pp. 71-81.\n    [PDF](https://www.jstor.org/stable/j.ctv3t5qjk.13?seq=1#metadata_info_tab_contents)\n-   Maarten van Gompel \u0026 Martin Reynaert (2014). **FoLiA: A practical\n    XML format for linguistic annotation - a descriptive and comparative\n    study;** Computational Linguistics in the Netherlands Journal;\n    3:63-81; 2013. [PDF](https://clinjournal.org/clinj/article/view/26/22)\n-   Maarten van Gompel (2014). **FoLiA: Format for Linguistic\n    Annotation. Documentation.** Language and Speech Technology\n    Technical Report Series LST-14-01. Radboud University Nijmegen.\n","funding_links":[],"categories":["Software","Python"],"sub_categories":["Utilities"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fproycon%2Ffolia","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fproycon%2Ffolia","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fproycon%2Ffolia/lists"}