{"id":22344699,"url":"https://github.com/languagemachines/luiginlp","last_synced_at":"2025-07-30T03:31:16.124Z","repository":{"id":146793997,"uuid":"58048606","full_name":"LanguageMachines/LuigiNLP","owner":"LanguageMachines","description":"A workflow system for Natural Language Processing.","archived":false,"fork":false,"pushed_at":"2019-10-17T14:26:11.000Z","size":498,"stargazers_count":21,"open_issues_count":1,"forks_count":4,"subscribers_count":7,"default_branch":"master","last_synced_at":"2024-07-08T15:41:03.622Z","etag":null,"topics":["natural-language-processing","nlp","workflow-management-system"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/LanguageMachines.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2016-05-04T12:00:51.000Z","updated_at":"2022-03-08T13:47:14.000Z","dependencies_parsed_at":"2024-01-16T17:34:09.379Z","dependency_job_id":"4f45d727-290c-499a-9433-50edb4374c42","html_url":"https://github.com/LanguageMachines/LuigiNLP","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LanguageMachines%2FLuigiNLP","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LanguageMachines%2FLuigiNLP/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LanguageMachines%2FLuigiNLP/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LanguageMachines%2FLuigiNLP/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/LanguageMachines","download_url":"https://codeload.github.com/LanguageMachines/LuigiNLP/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":228078624,"owners_count":17865959,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["natural-language-processing","nlp","workflow-management-system"],"created_at":"2024-12-04T09:14:23.700Z","updated_at":"2024-12-04T09:14:24.328Z","avatar_url":"https://github.com/LanguageMachines.png","language":"Python","funding_links":[],"categories":["Uncategorized"],"sub_categories":["Uncategorized"],"readme":".. image:: http://applejack.science.ru.nl/lamabadge.php/LuigiNLP\n   :target: http://applejack.science.ru.nl/languagemachines/\n\n.. image:: https://travis-ci.org/LanguageMachines/LuigiNLP.svg?branch=master\n    :target: https://travis-ci.org/LanguageMachines/LuigiNLP\n\n.. image:: https://www.repostatus.org/badges/latest/abandoned.svg\n   :alt: Project Status: Abandoned – The project has been abandoned and the author(s) do not intend on continuing development.\n   :target: https://www.repostatus.org/#abandoned\n\n\n\n*************\nLuigiNLP\n*************\n\nAn NLP workflow system building upon\n**sciluigi** (https://github.com/pharmbio/sciluigi), which is in turn based on\n**luigi** (https://github.com/spotify/luigi).\n\nThis started out as a proof of concept intended to be used for the PICCL and\nQuoll NLP pipelines developed at Radboud University Nijmegen.\n\nThis is a solution for either a single computing node or a cluster of nodes\n(Hadoop, SLURM, not tested yet). The individual components are not webservices,\nnor is data passed around. This ensures minimal overhead and higher performance.\n\n=========\nGoals\n=========\n\n* Abstraction of workflow/pipeline logic: a generic, scalable and adaptable solution\n* Modularisation; clear separation of all components from the workflow system itself\n* Automatic dependency resolution (similar to GNU Make, top-down)\n* Robust failure recovery: when failures occur, fix the problem and run the workflow again, tasks that have completed will not be rerun.\n* Easy to extend with new modules (i.e. workflow components \u0026 tasks).\n* Traceability of all intermediate steps, retain intermediate results until explicitly discarded\n* Explicit workflow definitions\n* Automatic parallellisation of tasks where possible\n* Keep it simple, minimize overhead for the developer of the workflow, use Pythonic principles,\n* Python-based, all workflow and component specifications are in Python rather than external.\n* Protection against shell injection attacks for tasks that invoke external tools\n* Runnable standalone from command-line\n\n==============\nArchitecture\n==============\n\nLuigiNLP follows a **goal-oriented** paradigm. The user invokes the workflow\nsystem by specifying a target **workflow component** along with an initial\ninput file. Given the target and an initial input file, a sequence of workflow\ncomponents will be automatically found that leads from initial input to the\ndesired goal, processing the data each step of the way. Workflow components are\ndefined in a *backwards* manner, as is also common in tools such as GNU Make.\nEach component expresses which other components it **accepts** as input, or\nwhich input files it accepts directly. This enables you to run the\ncomponent either directly on an input file, or have the input go through other\ncomponents first for necessary preprocessing. The dependency resolution\nmechanism will automatically chose a path based on the specified input and\nselected parameters.\n\nA workflow component consists of a specification that chains together\n**tasks**. Whereas a workflow component represents a more comprehensive piece\nof work that is defined in a context of other components, a **task** represents\nthe smallest unit of work and is defined **independently** of any other tasks\nor components, making it a highly reusable part. A task consists of one or more\ninput slots, corresponding to input files of a particular type, one or more\noutput slots corresponding to output files of a particular type, and\nparameters. A workflow component only glues together different tasks, the task\nperforms an actual job, either by invoking an external tool, or by running\nPython code directly. Chaining together tasks in the definition of the\nworkflow component is done by connecting output slots of one task, to input\nslots of the other.\n\nThe architecture is visualised in the following scheme:\n\n.. image:: https://raw.githubusercontent.com/LanguageMachines/LuigiNLP/master/architecture.png\n    :alt: LuigiNLP Architecture\n    :align: center\n\nTasks and workflow components may take **parameters**. These are available\nwithin a task's ``run()`` method to either be propagated to an external tool\nor to be handled within Python directly. At the component level, parameters may also be used to influence\ntask composition, though often they are just passed on to the tasks.\n\nThe simplest instance of a workflow component is just one that accepts one\nparticular type of input file and sets up just a single task.\n\nBoth tasks and workflow components are defined in a **module** (in the Python\nsense of the word), which simply groups several tasks and workflow components together.\n\nLuigiNLP relies heavily on filename extensions. Input formats are matched on\nthe basis of an extension, and generally each task reads a file and outputs\na file with a new extension. Re-use of the same filename (i.e. writing output to the\ninput file), is **strictly forbidden**!\n\nIt is important to understand that the actual input files are only open for\ninspection when a Task is executed (its ``run()`` method is invoked).  During\nworkflow composition in a component (in its ``setup()/autosetup()`` method),  files can not\nbe inspected as the composition by definition preceeds the existence of any\nfiles, and the whole process has to proceed deterministically.\n\n=============\nLimitations\n=============\n\n* No circular dependencies allowed in workflow components\n* Intermediate files are not open for inspection in workflow specifications, only within ``Task.run()``\n\n====================\nDirectory Structure\n====================\n\n* ``luiginlp/luiginlp.py`` - Main tool\n* ``luiginlp/modules/`` - Modules, each addressing a specific tool/goal. A module consists of workflow components and tasks.\n* ``luiginlp/util.py`` - Auxiliary functions\n* ``setup.py`` - Installation script for LuigiNLP (only covers LuigiNLP and its direct python dependencies)\n\n==============\nInstallation\n==============\n\nInstall as follows::\n\n    $ python setup.py install\n\n(If this fails due to a ``python-daemon`` error, just run it again. There is a\nproblem in that package)\n\nMany of the implemented modules rely on software distributed as part of\nLaMachine (https://proycon.github.io/LaMachine), so LuigiNLP is best used from\nwithin a LaMachine installation. LuigiNLP itself is included in LaMachine as\nwell.\n\n===========\nUsage\n===========\n\nExample, specify a workflow corresponding to your intended goal and an input file. Workflows may take extra parameters (``--skip`` for Frog in this case)::\n\n    $ luiginlp Frog --module luiginlp.modules.frog --inputfile test.rst --skip p\n\nA workflow can be run parallelised for multiple input files as well, the number\nof workers should be explicitly set::\n\n    $ luiginlp Parallel --module luiginlp.modules.frog --component Frog --inputfiles test.rst,test2.rst --workers 2 --skip p\n\nYou can always pass workflow-component-specific parameters by using the workflow component name as a prefix. For\ninstance, the Frog component takes an option ``skip``, you can use ``--Frog-skip`` to explicitly set it.\n\nYou can also invoke LuigiNLP from within Python of course:\n\n.. code-block:: python\n\n    import luiginlp\n    from luiginlp.modules.frog import Frog\n    luiginlp.run(Frog(inputfile=\"test.rst\",skip='p'))\n\nTo parallelize multiple tasks you can just do:\n\n.. code-block:: python\n\n    import luiginlp\n    from luiginlp.modules.frog import Frog\n    luiginlp.run(\n        Frog(inputfile=\"test.rst\",skip='p'),\n        Frog(inputfile=\"test2.rst\",skip='p'))\n\nOr use the ``Parallel`` interface:\n\n.. code-block:: python\n\n    import luiginlp\n    from luiginlp.modules.frog import Frog\n    from luiginlp.engine import Parallel, PassParameters\n    luiginlp.run(\n        Parallel(component=\"Frog\",inputfiles=\"test.rst,test2.rst\",\n            passparameters=PassParameters(skip='p')\n        )\n    )\n\n\nHere's an example of running an OCR workflow for a scanned PDF file (requires the tools ``pdfimages``,\n``Tesseract``, ``FoLiA-hocr`` and ``foliacat``, the latter two are a part of LaMachine)::\n\n    $ luiginlp --module luiginlp.modules.ocr OCR_folia --inputfile OllevierGeets.pdf --language eng\n\nLuigiNLP automatically finds a sequence of components leading from your input\nfile (provided it's name matches whatever convention you use) to the target\ncomponent. You may, however, force an inputfile by setting the ``--inputslot``\nparameter to some input format ID. This can be useful if you want to feed an\ninput file that does not comply to your naming convention.\nYou may also specify a ``--startcomponent`` to explicitly state which component\nshould be the first one, this may be useful in cases of ambiguity where\nmultiple paths are possible (the first possibility would be otherwise be chosen).\n\nWriting tasks and components for LuigiNLP\n=============================================\n\nIn order to plug in your own tools into LuigiNLP, you will need to do\nseveral things:\n\n* Create a new module that groups your code (inside LuigiNLP these reside in ``luiginlp/modules/*.py``, but you may just as well have a module in an external Python project)\n* Write one or more tasks, tasks are classes derived from ``luiginlp.engine.Task``\n* Write one or more workflow components that chain tasks together, workflow components are classes derived from ``luiginlp.engine.WorkflowComponent``, you usually want to derive from ``luiginlp.engine.StandardWorkflowComponent`` which is a standard component that takes one inputfile as parameter.\n\nAlways take in mind the following guidelines when writing tasks and components for\nLuigiNLP:\n\n* Tasks should cover the smallest unit of work, do not do too much in one task, but chain tasks instead.\n* Be very specific in your file extensions. If two tasks output a file with the\n  same extension, they are considered identical for all intents and purposes!  Multiple stacking extensions are fine and\n  recommend (``*.x.y.z``). Generally, each task strips input extensions (optional) and adds a a new extension.\n* Input and output filenames may never be the same! It is forbidden to change a file in-place.\n* Consider whether you want to chain multiple workflow components and to use the automatic\n  resolution mechanism, or whether you have larger components that chain\n  multiple tasks. Components are needed whenever you want to have multiple entry points.\n\nLet's begin by writing a simple task that invokes the tokeniser\n*ucto* (https://languagemachines.github.io/ucto) to convert plain text to\ntokenised plain text. We prescribe that the plain text document has the\nextension ``txt`` and tokenised text has the extension ``tok``. The tokeniser\ntakes one mandatory parameter: the language the text is in.\n\n.. code-block:: python\n    from luiginlp.engine import Task, InputSlot, Parameter\n\n    class Ucto_txt2tok(Task):\n        #This task invokes an external tool (ucto), set the executable to invoke\n        executable = 'ucto'\n\n        #Parameters for this task\n        language = Parameter()\n\n        #this is the input slot for plaintext files, input slots are connected\n        #to output slots of other tasks by a workflow component\n        in_txt = InputSlot()\n\n        #Define an output slot, output slots are methods that start with out_\n        def out_tok(self):\n            #Output slots should call outputforminput() to automatically derive the output file\n            #from the input file, typically by stripping the specified\n            extension form the input and adding a new *and distinct* output extension. The inputformat\n            #parameter must correspond to an input slot (in_txt in this case).\n            #If an outputdir parameter is defined in the task, it is automatically\n            #supported.\n            return self.outputfrominput(inputformat='txt',stripextension='.txt',addextension='.tok')\n\n        #Define the run method, this will be called to do the actual work\n        def run(self):\n            #Here we run the external tool. This will invoke the executable\n            #specified. Keyword arguments are passed as option flags (-L in\n            #this case). Positional arguments are passed as such (after option flags).\n            #All parameters are available on the Task instance\n            #Values will be passed in a shell-safe manner, protecting against injection attacks\n            self.ex(self.in_txt().path, self.out_tok().path,\n                    L=self.language,\n            )\n\nWe can now turn this task into a simple component that we can invoke:\n\n.. code-block:: python\n    from luiginlp.engine import StandardWorkflowComponent, InputFormat, Parameter\n\n    class Ucto(StandardWorkflowComponent):\n        #parameters for the task, most are just passed on to the task(s)\n        language = Parameter()\n\n        #The accepts methods return what input formats or other input components are accepted as input. It may return multiple values (in a tuple/list), the one that matches with the specified input is chosen\n        def accepts(self):\n            return InputFormat(self, format_id='txt',extension='txt')\n\n        #Autosetup constructs a workflow for you automatically based on the tasks you specify. If you specify a tuple of multiple tasks, the one fitting the input will be executed.\n        def autosetup(self):\n            return Ucto_txt2tok\n\nAssuming you wrote all this in a ``mymodule.py`` file, you now can invoke this\nworkflow component on a text document as follows::\n\n    $ luiginlp Ucto --module mymodule --inputfile test.txt --language en\n\nUcto does not just support plain text input, it can also handle input in the\n*FoLiA* format (https://proycon.github.io/folia), an XML-based format for linguistic\nannotation. We could write a task ``Ucto_folia2tok`` that runs ucto in this\nmanner. Suppose we did that, we could extend our workflow component as\nfollows:\n\n.. code-block:: python\n\n    def accepts(self):\n        return InputFormat(self, format_id='txt',extension='txt'), InputFormat(self, format_id='folia', extension='folia.xml')\n\n    def autosetup(self):\n        return Ucto_txt2tok, Ucto_folia2tok\n\nNow the workflow component will be able automatically figure out which of the tasks to run based on the supplied input, allowing us to do::\n\n    $ luiginlp Ucto --module mymodule --inputfile test.folia.xml --language en\n\nWhat about any other file format? Ucto itself can only handle plain text or\nFoLiA. What if our input text is in PDF format, MarkDown format, or God forbid,\nin MS Word format? We could solve this problem by writing a\n``ConvertToPlaintext`` component that handles a multitude of formats and simply\ninstructs ucto to accept the plaintext output from that component. We need some extra imports\nand would then modify the ``accepts()`` to tie in the component:\n\n.. code-block:: python\n\n    from luiginlp.engine import InputComponent\n    from some.other.module import ConvertToPlaintext\n\n.. code-block:: python\n\n    def accepts(self):\n        return (\n            InputFormat(self, format_id='txt',extension='txt'),\n            InputFormat(self, format_id='folia', extension='folia.xml'),\n            InputComponent(self, ConvertToPlaintext) #you can pass parameters using keyword arguments here\n        )\n\nOur ucto component thus-far has been fairly simple, we first used ``autosetup()`` to\nwrap a single task, and later to choose amongst two tasks. Let's look at a more\nexplicit example with actual task chaining.\n\nSuppose we want the Ucto component to lowercase our text before passing it on\nto the actual task that invokes ucto. We can write a simple lowercase task as\nfollows, for this one we just use Python and call no external tools (i.e. we\nset no ``executable`` and do not call ``ex()``):\n\n.. code-block:: python\n    from luiginlp.engine import Task, InputSlot, Parameter\n\n    class LowercaseText(Task):\n        #Parameters for this task\n        language = Parameter()\n        encoding = Parameter(default='utf-8')\n\n        in_txt = InputSlot()\n\n        #Define an output slot, output slots are methods that start with out_\n        def out_txt(self):\n            #We add a lowercased prefix to the extension\n            #The output file may NEVER be equal to the input file\n            return self.outputfrominput(inputformat='txt',stripextension='.txt',addextension='.lowercased.txt')\n\n        #Define the run method, this will be called to do the actual work\n        def run(self):\n            #We do the work in Python itself\n            #Input and output slots can be opened as file objects\n            with self.in_txt().open('r',encoding=self.encoding) as inputfile\n                with self.out_txt().open('w',encoding=self.encoding) as outputfile:\n                    outputfile.write(inputfile.read().lower())\n\nNow we go back to our Ucto component, we forget about the FoLiA part for a\nbit, and we set up explicit chaining using ``setup()`` instead of\n``autosetup()``, which is a bit more work but gives us complete control over\neverything.\n\n\n.. code-block:: python\n    from luiginlp.engine import StandardWorkflowComponent, InputFormat\n\n    class Ucto(StandardWorkflowComponent):\n        #parameters for the task, most are just passed on to the task(s)\n        language = Parameter()\n\n        #The accepts methods return what input formats or other input components are accepted as input. It may return multiple values (in a tuple/list), the one that matches with the specified input is chosen\n        def accepts(self):\n            return (\n                InputFormat(self, format_id='txt',extension='txt'),\n                InputComponent(self, ConvertToPlaintext) #you can pass parameters using keyword arguments here\n            )\n\n        #Setup a workflow chain manually\n        def setup(self, workflow, input_feeds):\n            #input_feeds will be a dictionary of format_id =\u003e output_slot\n\n            #set up the lower caser and feed the input to it\n            lowercaser = workflow.new_task('lowercaser',LowercaseText)\n            lowercaser.in_txt = input_feeds['txt']\n\n            #set up ucto and feed the output of the lower caser to it.\n            #we explicitly pass any parameters we want to propagate\n            #if you instead want to implicitly pass all parameters with matching names\n            #between component and task, just set keyword argument autopass=True\n            ucto = workflow.new_task('ucto', Ucto_txt2tok, language=self.language)\n            ucto.in_txt = lowercaser.out_txt\n\n            #always return the last task(s)\n            return ucto\n\n-----------------------------------\nExecuting external commands\n-----------------------------------\n\nWe have seen that the ``ex()`` method on a task can be invoked from it's\n``run()`` method to call external tools. The executable to execute is defined\nin the task's ``executable`` property.\n\nThe ``ex()`` method allows you to define your calls to external tools in a\npython way, and ensures that all parameter values are properly escaped to prevent any\nshell injection attacks. Its offers cleaner and more secure code.\n\nWhen you call ``ex()``, all keyword arguments will be passed as parameters. The\nkeyword argument ``x`` (one letter) to ``ex()`` , will result in the flag ``-x``,\nwhereas keyword argument ``foo`` (multiple letters), will result in the flag\n``--foo``. If you want to force single hyphens for multiletter options, set ``__singlehyphen=True``.\nKeyword arguments starting with a double underscore are special directives to\n``ex()``. A double underscore inside a parameter will be translated to a\nhyphen, as Python does not allow variables with hyphens. So keyword argument\n``foo__bar`` will result in the option ``--foo-bar``.\n\nKeyword arguments with a boolean value are passed as flags without\nvalue, i.e. passing ``foo=True`` results just in ``--foo``, whereas ``foo=5``\nyields ``--foo 5``. If you want to force the use of an assignment operator, as\nin ``--foo=5``, pass  ``__assignop=True``.\n\nShell redirects (``\u003c``,``\u003e``,``2\u003e``) are supported through the special keyword\narguments ``__stin_from``, ``__stdout_to`` and ``__stderr_to``, each expecting\na path to a file. Further piping is not supported through the ``ex()`` command.\n\nKeyword arguments starting with a single underscore will have that underscore\nremoved, this is useful in cases where parameters clash with reserved keywords\nin Python, such as ``from`` or ``import``.\n\nProcesses are expected to return proper exit codes (0 for success, non-zero for\nfailure), LuigiNLP will interpret it as such and consider the task failed if a\nnon-zero exit code is obtained. If you want to ignore failures,\nset ``__ignorefailure=True``.\n\n------------------------------------\nDynamic dependencies aka Inception\n------------------------------------\n\nWorkflows are static in the sense that based on the format of the input file\nand all given parameters, all workflow components and tasks are assembled\ndeterministically. This means that, within a components ``setup()`` method, it\nis not possible to inspect input/intermediate files nor adjust the flow based\non file contents.\n\nAt times, however, more dynamic workflows are needed. In such cases, the common\ntheme is that input data has to be inspected and decisions made accordingly.\nThe **only** stage at which input files can be inspected is in a task's\n``run()`` method. Fortunately, there are facilities here to implement more\ndynamic dependencies, a task's ``run()`` method is allowed to **yield** (in the\nPython sense of the word) a list of other tasks that it depends on.\n\nA good example would be if we create a new tokenisation component that does not\njust take an input file, but takes a **directory** containing input files and\nproduces a directory of output files. The proper way to implement this is to\nreuse the component that performs on the individual files (i.e. our ``Ucto``\ncomponent).  Consider the following task and component:\n\n.. code-block:: python\n\n    import glob\n    from luiginlp.engine import Task, StandardWorkflowComponent, InputSlot, Parameter\n\n    class Ucto_txtdir2tokdir(Task):\n        language = Parameter()\n\n        in_txtdir = InputSlot()\n\n        def out_tokdir(self):\n            return self.outputfrominput(inputformat='txtdir',stripextension='.txtdir',addextension='.tokdir')\n\n        def run(self):\n            #setup the output directory\n            # this creates the directory and also moves it out of the way again when failures occur in this task\n            self.setup_output_dir(self.out_tokdir().path)\n\n            #gather input files\n            inputfiles = [ filename for filename in glob.glob(self.in_txtdir().path + '/*.txt' ]\n\n            #inception aka dynamic dependencies: we yield a list of components which could not have been predicted statically\n            #in this case we run the Ucto component for each input file in the directory\n            yield [ Ucto(inputfile=inputfile,outputdir=self.out_tokdir().path,language=self.language) for inputfile in inputfiles ]\n\n\n    class Ucto_collection(StandardWorkflowComponent):\n        def accepts(self):\n            return (\n                InputFormat(self, format_id='txtdir',extension='txtdir', directory=True),\n            )\n\n        def autosetup(self):\n            return Ucto_txtdir2tokdir\n\n\nThe magic happens in the task's ``run()`` method, as that it the only stage\nwhere we can examine the contents of any input files, in this case: the contents\nof the input directory. First we set up the output directory with a call to\n``self.setup_output_dir()``. This creates the directory if it doesn't exist\nyet, but also makes sure the directory is stashed away when the task fails,\nensuring you can always rerun the pipeline if happens to break off. (in\ntechnical terms, this preserves idempotency).\n\nMext, we construct a list of all the txt files in the directory. We use this\nlist to yield a **list** of components to run, one component for each input file.\nNow, when the task's ``run()`` method is called, a series of components will be\nscheduled and run **in parallel** (up to the number of workers).\n\nYou may be tempted to yield the components individually one by one, but that\nwon't result in parallisation, you must really yield an entire list (or tuple).\n\nNote that we added an ``outputdir`` parameter to the Ucto component which we\nhadn't implemented yet. This is necessary to ensure all individual output files\nend up in the directory that groups our output. The Ucto component should\nsimply pass this parameter on to the ``Ucto_txt2tok`` task. The outputdir\nparameter is implicitly present on all tasks as well as on\n``StandardWorkflowComponent``, the ``outputfrominput()`` method automatically\nsupports this parameter.\n\nAssuming you have a collecting of text files in a directory ``corpus.txtdir/``,\nyou can now invoke LuigiNLP as follows and end up with a ``corpus.tokdir/``\ndirectory with tokenised output texts::\n\n    $ luiginlp Ucto_collection --module mymodule --inputfile corpus.txtdir --language en --workers 4\n\nNote the ``--workers`` parameter, which is the generic way to tell LuigiNLP how\nmany workers may run in parallel. You will want to explicitly set this to a\nvalue that approximates the number of free CPU cores as the default value is\none (no parallellisation).\n\n-----------------------------\nInheriting parameters\n-----------------------------\n\nComponents often inherit parameters from the tasks they wrap. When you use\n``autosetup()``, parameters with the matching names are automatically passed\nfrom component to task. Similarly, if you use ``workflow.new_task()`` in your\nsetup method, you can set the keyword argument ``autopass=True`` to also\naccomplish this.\n\nStill, you actually need to which parameters on the component.\nThis can be done in the usual way, but if a task already defines them, you may want to inherit the parameters automatically and prevent any code duplication. This is done as follows:\n\n.. code-block:: python\n\n    class MyComponent(WorkflowComponent):\n        ...\n    MyComponent.inherit_parameters(MyTask1,MyTask2,MyTask3)\n\nNote that the ``inherit_parameters()`` call is not part of the class definition (not in class scope) but placed after it.\n\n-----------------------------\nMultiple inputs\n-----------------------------\n\nTasks may defined multiple input slots (and multiple output slots). Components\nmay accept multiple inputs similtaneously as well. Consider for example a\nclassifier that takes a training file and a test file. Components can not use\n``autosetup()`` in this case, but you need to explicitly define a ``setup()``\nmethod.\n\nTo define multiple concurrent inputs, group them together in a tuple and return\nthis as part of a list or tuple from ``accepts()``. The following example\ncomponents is for a classifier, it takes two inputs (``trainfile`` and\n``testfile``) rather than the standard ``inputfile`` pre-defined in\n``StandardWorkflowComponent`` (this class is therefore subclassed from\n``WorkflowComponent`` instead, which does not predefine ``inputfile``).\n\nNote furthermore that the ``InputFormat`` tuple contains the ``inputparameter``\nkeyword argument that binds the proper inputformat to the proper parameter (the\ndefault was ``inputparameter=\"inputfile\"`` so we never needed it before).\n\n\n.. code-block:: python\n\n    @registercomponent\n    class TimblClassifier(WorkflowComponent):\n        \"\"\"A Timbl classifier that takes training data, test data, and outputs the test data with classification\"\"\"\n\n        trainfile = Parameter()\n        testfile = Parameter()\n\n        def accepts(self):\n            #Note: tuple in a list, the outer list corresponds to options (just one here), while the inner tuples are conjunctions\n            return [ ( InputFormat(self, format_id='train', extension='train',inputparameter='trainfile'), InputFormat(self, format_id='test', extension='test',inputparameter='testfile')) ]\n\n        def setup(self, workflow, input_feeds):\n            timbl_train = workflow.new_task('timbl_train',Timbl_train, autopass=True)\n            timbl_train.in_train = input_feeds['train']\n\n            timbl_test = workflow.new_task('timbl_test',Timbl_test, autopass=True)\n            timbl_test.in_test = input_feeds['test']\n            timbl_test.in_ibase = timbl_train.out_ibase\n            timbl_test.in_wgt = timbl_train.out_wgt\n\n            return timbl_test\n\nWe have not defined the tasks here, but you can infer that the ``Timbl_train``\ntask has at least two output slots, and ``Timbl_test`` has two input slots.\n\n\n==================\nTroubleshooting\n==================\n\n* *Everything is run sequentially, nothing is parallelised?* -- Did you explicitly supply a ``workers`` parameter with the desired maximum number of threads? Otherwise just one worker will be used and everything is sequential. If you did supply multiple workers, it may just  be the case that there is simply nothing to run in parallel in your invoked workflow.\n* *I get no errors but nothing seems to run when I rerun my workflow?* -- If all output files already exist, then the workflow has nothing to do. You will need to explicitly delete your output if you want to rerun things that have already been produced succesfully.\n* *error: unrecognized argument* -- You are passing an argument that\n  is not known to the target component. Perhaps you forgot to inherit certain\n  parameters from tasks to components using ``inherit_parameters()``?\n* *RuntimeError: Unfulfilled dependency at run time* -- This error says that the\n  specified task or component has not delivered the output files that were\n  promised by the output slots. You should ensure all of the promised files are\n  delivered and there are no typos in the filenames/extensions.\n* *InvalidInput: Unable to find an entry point for supplied input* -- The\n  filename you specified can not be matched with one of the input formats. Are\n  you supplying the right file and that your target component has a possible\n  path to that input (through ``accepts()``). Either make sure it has the expected extension\n  so it is automatically detected. You may also explicitly supply an\n  ``inputslot`` parameter with the ID of the format, possibly in combination\n  with a ``startcomponent`` parameter with the name of the component you want\n  to start with.\n* *ValueError: Workflow setup() did not return a valid last task (or sequence of\n  tasks)* or *TypeError: setup() did not return a Task or a sequence of Tasks* -- At the end of your component's ``setup()``\n  method you must return the last task instance, or a list of the last task\n  instances. Is a return statement missing?\n* *Exception: Input item is neither callable, TargetInfo, nor list: None*. --\n  All ``out_*()`` methods must return a ``TargetInfo`` instance, which is\n  usually achieved by returning whatever ``outputfrominput()`` returns. Is a\n  return statement missing in an output slot?\n* *ValueError: Inputslot .... of ..... is not connected to any output slot!* -- You forgot to connect the specified input slot of the specified\n  task to an output slot (in a components ``setup()`` method). All input slots must be satisfied.\n* *ValueError: Specified inputslot for ... does not exist in ....* -- Your call\n  to ``outputfrominput()`` has a ``inputformat`` argument that does not\n  correspond to any of the input slots. If you have an input slot ``in_x``, the\n  inputformat should be ``x``.\n* *Exception: No executable defined for .....* -- You are invoking the ``ex()``\n  method to execute through the shell but the Task's class does not specify an\n  executable to run. Set ``executable = \"yourexecutable\"`` in the class.\n* *TypeError: Invalid element in accepts(), must be InputFormat or InputComponent* -- Your component's accepts() method returns something it shouldn't, you may return a list/tuple of InputFormat or InputComponent instances, you may also includes tuples grouping multiple InputFormats or InputComponents in case the component takes multiple input files.\n* *AutoSetupError: AutoSetup expected a Task class* -- Your components ``autosetup()`` method must return either a single Task class (not an instance) or a list/tuple of Task classes.\n* *AutoSetupError: No outputslot found on ....* -- The task you are returning in a component's ``autosetup()`` method has no output slots (one or more ``out_*()`` methods).\n* *AutoSetupError: AutoSetup only works for single input/output tasks now* -- You can not use ``autosetup()`` for components that take multiple input files, use ``setup()`` instead.\n* *AutoSetupError: No matching input slots found for the specified task*  -- Autosetup was not able to automatically connect any of the supplied input formats or input components (those in ``accept()``) to one of the tasks defined in ``autosetup()``, there is probably a mismatch between format names (outputslot using a different format id than the inputslot). Use the ``setup()`` method instead of ``autosetup()`` and connect everything explicitly there.\n* *NotImplementedError: Override the setup() or autosetup() method for your workflow component* -- Each component must define a ``setup()`` or ``autosetup()`` method, it is missing here.\n\n\n============\nPlans/TODO\n============\n\n* Expand autosetup to build longer sequential chains of tasks (a2b b2c c2d)\n* [tentative] Integration with `CLAM \u003chttps://github.com/proycon/clam\u003e`_ to automatically\n  create webservices of workflow components\n* Further testing...\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flanguagemachines%2Fluiginlp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flanguagemachines%2Fluiginlp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flanguagemachines%2Fluiginlp/lists"}