{"id":13543352,"url":"https://github.com/LanguageMachines/PICCL","last_synced_at":"2025-04-02T12:32:14.748Z","repository":{"id":146794074,"uuid":"86154255","full_name":"LanguageMachines/PICCL","owner":"LanguageMachines","description":"A set of workflows for corpus building through OCR, post-correction and normalisation","archived":false,"fork":false,"pushed_at":"2022-09-07T12:28:13.000Z","size":4472,"stargazers_count":48,"open_issues_count":3,"forks_count":7,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-03-20T09:06:34.320Z","etag":null,"topics":["computational-linguistics","corpus-linguistics","corpus-tools","folia","nlp","ocr","workflow"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/LanguageMachines.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2017-03-25T12:01:28.000Z","updated_at":"2024-09-05T18:47:56.000Z","dependencies_parsed_at":"2024-01-15T23:27:22.908Z","dependency_job_id":"54193895-bbf0-4933-ad51-c1d85f426f3d","html_url":"https://github.com/LanguageMachines/PICCL","commit_stats":null,"previous_names":[],"tags_count":28,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LanguageMachines%2FPICCL","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LanguageMachines%2FPICCL/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LanguageMachines%2FPICCL/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LanguageMachines%2FPICCL/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/LanguageMachines","download_url":"https://codeload.github.com/LanguageMachines/PICCL/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246815838,"owners_count":20838525,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["computational-linguistics","corpus-linguistics","corpus-tools","folia","nlp","ocr","workflow"],"created_at":"2024-08-01T11:00:30.577Z","updated_at":"2025-04-02T12:32:09.739Z","avatar_url":"https://github.com/LanguageMachines.png","language":"Python","funding_links":[],"categories":["Optical Character Recognition Engines and Frameworks"],"sub_categories":["CTPN [paper:2016](https://arxiv.org/pdf/1609.03605.pdf)"],"readme":"[![Language Machines Badge](http://applejack.science.ru.nl/lamabadge.php/PICCL)](http://applejack.science.ru.nl/languagemachines/)\n[![Build Status](https://travis-ci.org/LanguageMachines/PICCL.svg?branch=master)](https://travis-ci.org/LanguageMachines/PICCL)\n\n[![GitHub release](https://img.shields.io/github/release/LanguageMachines/PICCL.svg)](https://GitHub.com/LanguageMachines/PICCL/releases/)\n[![Project Status: Unsupported – The project has reached a stable, usable state but the author(s) have ceased all work on it. A new maintainer may be desired.](https://www.repostatus.org/badges/latest/unsupported.svg)](https://www.repostatus.org/#unsupported)\n\n# PICCL: Philosophical Integrator of Computational and Corpus Libraries\n\nPICCL offers a workflow for corpus building and builds on a variety of tools.\nThe primary component of PICCL is TICCL; a Text-induced Corpus Clean-up system, which\nperforms spelling correction and OCR post-correction (normalisation of spelling\nvariants etc).\n\nPICCL and TICCL constitute original research by Martin Reynaert (Tilburg University \u0026 Radboud University Nijmegen), and\nis currently developed in the scope of the [CLARIAH](https://www.clariah.nl) project.\n\nThis repository hosts the relevant workflows that constitute PICCL, powered by\n[Nextflow](https://www.nextflow.io).  These will be shipped as part of our\n[LaMachine](https://proycon.github.io/LaMachine) software distribution. The\ncombination of these enable the PICCL workflow to be portable and scalable; it\ncan be executed accross multiple computing nodes on a high performance cluster\nsuch as SGE, LSF, SLURM, PBS, HTCondor, Kubernetes and Amazon AWS.\nParallellisation is handled automatically. Consult the [Nextflow\ndocumentation](https://www.nextflow.io/docs/latest/index.html) for details\nregarding this.\n\nAll the modules that make up TICCL are part of the [TicclTools](https://github.com/LanguageMachines/ticcltools)\ncollection, and are not part of the current repository. Certain other required components are in the\n[FoLiA-Utils](https://github.com/LanguageMachines/foliautils) collection. There is no need to install either of these or\nother dependencies manually.\n\nPICCL makes extensive use of the [FoLiA](https://proycon.github.io/folia) format, a rich XML-based format for linguistic\nannotation.\n\n**Important Note**: This is beta software still in development; for the old and deprecated version consult [this repository](https://github.com/martinreynaert/TICCL).\n\n## Installation\n\nPICCL is shipped as a part of [LaMachine](https://proycon.github.io/LaMachine), although you need to explicitly select it for installation using ``lamachine-add piccl \u0026\u0026 lamachine-update`` (from inside a LaMachine installation). Once inside LaMachine, the command line interface can be invoked by directly specifying one of the workflows:\n\n    $ ocr.nf\n\nOr\n\n    $ ticcl.nf\n\nIf you using a LaMachine installation, you can skip the rest of this section. If not, you can install [Nextflow](https://www.nextflow.io) and [Docker](https://docker.io) manually and then run the\nfollowing to obtain the latest development release of PICCL:\n\n    $ nextflow pull LanguageMachines/PICCL\n\nIn this case you need to ensure to always run it with the ``-with-docker proycon/lamachine:piccl`` parameter, this lets\nnextflow manage your LaMachine docker container (this is not tested as much as running from inside the container\ndirectly):\n\n    $ nextflow run LanguageMachines/PICCL -with-docker proycon/lamachine:piccl\n\nWe have prepared PICCL for work in many languages, mainly on the basis of available open source lexicons due to [Aspell](http://aspell.net), these data files serve as the input for TICCL and have to be downloaded once as follows;\n\n    $ nextflow run LanguageMachines/PICCL/download-data.nf -with-docker proycon/lamachine:piccl\n\nThis will generate a ``data/`` directory in your current directory, and will be referenced in the usage examples in the\nnext section. In a LaMachine environment, this directory is already available in ``$LM_PREFIX/opt/PICCL/data``.\n\nIn addition, you can also download example corpora (\u003e300MB), which will be placed in a ``corpora/`` directory:\n\n    $ nextflow run LanguageMachines/PICCL/download-examples.nf -with-docker proycon/lamachine:piccl\n\n## Architecture\n\nPICCL consists of two workflows, one for optical character recognition using [tesseract](https://github.com/tesseract-ocr/tesseract), and a TICCL workflow for\nOCR-post-correction and normalisation. Third, PICCL provides a webservice that ties together both these workflows and\nalso integrates two other workflows from [aNtiLoPe](https://github.com/proycon/antilope): a workflow for tokenisation (using [ucto](https://languagemachines.github.io/ucto)) and Dutch Linguistic Enrichment (using [frog](https://languagemachines.github.io/frog)).\n\nThe architecture of the PICCL webservice, and its two integral workflows, is visualised schematically as follows:\n\n![PICCL Architecture](https://raw.githubusercontent.com/LanguageMachines/PICCL/master/architecture.png)\n\n\n## Usage\n\n### Command line interface\n\nPICCL encompasses two workflows (and in webservice form it also integrates two more from\n[aNtiLoPe](https://github.com/proycon/antilope))\n\n * ``ocr.nf``   - A pipeline for Optical Character Recognition using [Tesseract](https://github.com/tesseract-ocr/tesseract); takes PDF documents or images of scanned pages and produces [FoLiA](https://proycon.github.io/folia) documents.\n * ``ticcl.nf`` - The Text-induced Corpus Clean-up system: performs OCR-postcorrection, takes as input the result from\n   ``ocr.nf``, or standalone text or PDF (text; no OCR), and produces further enriched [FoLiA](https://proycon.github.io/folia) documents.\n\nIf you are inside LaMachine, you can invoke these directly. If you let Nextflow manage LaMachine through docker, then\nyou have to invoke them like ``nextflow run LanguageMachines/PICCL/ocr.nf -with-docker proycon/lamachine:piccl``. This applies to all examples in this section.\n\nRunning with the ``--help`` parameter or absence of any parameters will output usage\ninformation.\n\n    $ ocr.nf --help\n    --------------------------\n    OCR Pipeline\n    --------------------------\n    Usage:\n      ocr.nf [PARAMETERS]\n\n    Mandatory parameters:\n      --inputdir DIRECTORY     Input directory\n      --language LANGUAGE      Language (iso-639-3)\n\n    Optional parameters:\n    --inputtype STR          Specify input type, the following are supported:\n            pdf (extension *.pdf)  - Scanned PDF documents (image content) [default]\n            tif ($document-$sequencenumber.tif)  - Images per page (adhere to the naming convention!)\n            jpg ($document-$sequencenumber.jpg)  - Images per page\n            png ($document-$sequencenumber.png)  - Images per page\n            gif ($document-$sequencenumber.gif)  - Images per page\n            djvu (extension *.djvu)\"\n            (The hyphen delimiter may optionally be changed using --seqdelimiter)\n    --outputdir DIRECTORY    Output directory (FoLiA documents)\n    --virtualenv PATH        Path to Python Virtual Environment to load (usually path to LaMachine)\n    --pdfhandling reassemble Reassemble/merge all PDFs with the same base name and a number suffix; this can\n                             for instance reassemble a book that has its chapters in different PDFs.\n                             Input PDFs must adhere to a \\$document-\\$sequencenumber.pdf convention.\n                             (The hyphen delimiter may optionally be changed using --seqdelimiter)\n    --seqdelimiter           Sequence delimiter in input files (defaults to: _)\n    --seqstart               What input field is the sequence number (may be a negative number to count from the end), default: -2\n\n\n    $ ticcl.nf --help\n    --------------------------\n    TICCL Pipeline\n    --------------------------\n    Usage:\n      ticcl.nf [OPTIONS]\n\n    Mandatory parameters:\n      --inputdir DIRECTORY     Input directory (FoLiA documents with an OCR text layer)\n      --lexicon FILE           Path to lexicon file (*.dict)\n      --alphabet FILE          Path to alphabet file (*.chars)\n      --charconfus FILE        Path to character confusion list (*.confusion)\n\n    Optional parameters:\n      --outputdir DIRECTORY    Output directory (FoLiA documents)\n      --language LANGUAGE      Language\n      --extension STR          Extension of FoLiA documents in input directory (default: folia.xml)\n      --inputclass CLASS       FoLiA text class to use for input, defaults to 'current' for FoLiA input; must be set to 'OCR' for FoLiA documents produced by ocr.nf\n      --inputtype STR          Input type can be either 'folia' (default), 'text', or 'pdf' (i.e. pdf with text; no OCR)\n      --virtualenv PATH        Path to Virtual Environment to load (usually path to LaMachine)\n      --artifrq INT            Default value for missing frequencies in the validated lexicon (default: 10000000)\n      --distance INT           Levenshtein/edit distance (default: 2)\n      --clip INT               Limit the number of variants per word (default: 10)\n      --corpusfreqlist FILE    Corpus frequency list (skips the first step that would compute one for you)\n      --low INT                skip entries from the anagram file shorter than 'low' characters. (default=5)\n      --high INT               skip entries from the anagram file longer than 'high' characters. (default=35)\n      --chainclean BOOLINT     enable chain clean or not (1 = on, 0 = off, default)\n\nAn example of invoking an OCR workflow for English is provided below, it assumes the sample data are installed in the ``corpora/``\ndirectory. It OCRs the ``OllevierGeets.pdf`` file, which contains scanned image data, therefore we choose the\n``pdfimages`` input type.\n\n    $ ocr.nf --inputdir corpora/PDF/ENG/ --inputtype pdfimages --language eng\n\nAlternative input types are images per page, in which case ``inputtype`` is set to either ``tif``, ``jpg``, ``gif`` or ``png``. These input files should be placed in the designated input directory and follow the naming convention\n``$documentname-$sequencenumber.$extension``, for example ``harrypotter-032.png``. An example invocation on dutch\nscanned pages in the example collection would be:\n\n    $ ocr.nf --inputdir corpora/TIFF/NLD/ --inputtype tif --language nld\n\nIn case of the first example the result will be a file ``OllevierGeets.folia.xml`` in the ``ocr_output/`` directory. This in turn can serve as\ninput for the TICCL workflow, which will attempt to correct OCR errors. Take care that that the ``--inputclass OCR``\nparameter is mandatory if you want to use the FoLiA output of ``ocr.nf`` as input for TICCL:\n\n    $ ticcl.nf --inputdir ocr_output/ --inputclass OCR --lexicon $LM_PREFIX/opt/PICCL/data/int/eng/eng.aspell.dict --alphabet $LM_PREFIX/opt/PICCL/data/int/eng/eng.aspell.dict.lc.chars --charconfus $LM_PREFIX/opt/PICCL/data/int/eng/eng.aspell.dict.c0.d2.confusion\n\nNote that here we pass a language-specific lexicon file, alphabet file, and character confusion file from the data files obtained by\n``download-data.nf``. Result will be a file ``OllevierGeets.folia.ticcl.xml`` in the ``ticcl_output/`` directory,\ncontaining enriched corrections. The second example, on the dutch corpus data, can be run as follows:\n\n    $ ticcl.nf --inputdir ocr_output/ --inputclass OCR --lexicon $LM_PREFIX/opt/PICCL/data/int/nld/nld.aspell.dict --alphabet $LM_PREFIX/opt/PICCL/data/int/nld/nld.aspell.dict.lc.chars --charconfus $LM_PREFIX/opt/PICCL/data/int/nld/nld.aspell.dict.c20.d2.confusion\n\n\n## Webapplication / RESTful webservice\n\n### Installation\n\nPICCL is also available as a webapplication and RESTful webservice, powered by [CLAM](https://proycon.github.io/clam).\nIf you are in LaMachine with PICCL, the webservice is already installed, but you may need to run\n``lamachine-start-webserver`` if it is not already running.\n\nFor production environments, you will want to adapt the CLAM configuration. To this end,\ncopy ``$LM_PREFIX/etc/piccl.config.yml`` to ``$LM_PREFIX/etc/piccl.$HOST.yml``, where ``$HOST`` corresponds with your\nhostname and edit the file with your host specific settings. Always enable authentication if your server is world-accessible (consult the CLAM\ndocumentation to read how).\n\n\n## Technical Details \u0026 Contributing\n\nPlease see CONTRIBUTE.md for technical details and information on how to contribute.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FLanguageMachines%2FPICCL","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FLanguageMachines%2FPICCL","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FLanguageMachines%2FPICCL/lists"}