{"id":26864408,"url":"https://github.com/ad4gd/harmonisationpipelines","last_synced_at":"2025-03-31T03:39:10.089Z","repository":{"id":277163338,"uuid":"931534870","full_name":"AD4GD/HarmonisationPipelines","owner":"AD4GD","description":null,"archived":false,"fork":false,"pushed_at":"2025-02-12T12:55:38.000Z","size":140,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-02-12T13:58:48.869Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AD4GD.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-02-12T12:49:56.000Z","updated_at":"2025-02-12T12:55:42.000Z","dependencies_parsed_at":"2025-02-12T13:58:50.221Z","dependency_job_id":"807f0ca9-43f9-4a4b-9bdf-820573bfba1f","html_url":"https://github.com/AD4GD/HarmonisationPipelines","commit_stats":null,"previous_names":["ad4gd/harmonisationpipelines"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AD4GD%2FHarmonisationPipelines","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AD4GD%2FHarmonisationPipelines/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AD4GD%2FHarmonisationPipelines/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AD4GD%2FHarmonisationPipelines/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AD4GD","download_url":"https://codeload.github.com/AD4GD/HarmonisationPipelines/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246413263,"owners_count":20773053,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-03-31T03:39:09.522Z","updated_at":"2025-03-31T03:39:10.079Z","avatar_url":"https://github.com/AD4GD.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Table of contents\n\n- [1. Introduction ](#1-introduction-)\n- [2. Installation ](#2-installation-)\n  - [2.1 Automatically, from the DockerHub Image (recommended) ](#21-automatically-from-the-dockerhub-image-recommended-)\n  - [2.2 Manually, by building a container ](#22-manually-by-building-a-container-)\n- [3. Configuration ](#3-configuration-)\n  - [3.1 Virtuoso instance ](#31-virtuoso-instance-)\n  - [3.2 LPIS country configuration files ](#32-lpis-country-configuration-files-)\n  - [3.3 General mapping generator ](#33-general-mapping-generator-)\n  - [3.4 SPARQL mapping ](#34-sparql-mapping-)\n  - [3.5 Relational Database as input source ](#35-relational-database-as-input-source-)\n  - [3.6 Link discovery process ](#36-link-discovery-process-)\n- [4. Usage ](#4-usage-)\n  - [4.1 FADN Pipeline ](#41-fadn-pipeline-)\n  - [4.2 LPIS Pipeline ](#42-lpis-pipeline-)\n  - [4.3 GENERIC Pipeline ](#43-generic-pipeline-)\n- [5. Version ](#5-version-)\n- [6. Team ](#6-team-)\n- [7. License ](#7-license-)\n\n\n# 1. Introduction \u003ca name=\"introduction\"\u003e\u003c/a\u003e\n\nThe Linked Data Pipelines is an ETL tool written in Python. It takes care of fetching, extracting, preprocessing, transforming, post-processing, and loading linked data into the triplestore. The interaction with the user is performed through CLI (Command Line Interface) and configuration files. Users can choose from a set of specific pipelines, as well as the generic pipeline. Generic pipeline enables single and flexible operations on a different type of data. \nThe tool re-uses other existing tools, such as [Geotriples](https://github.com/LinkedEOData/GeoTriples), [RMLmapper](https://github.com/RMLio/rmlmapper-java), and others, for particular tasks,  providing a unified interface over them and connecting them transparently in sequences to implement full pipelines.\n\n\n# 2. Installation \u003ca name=\"installation\"\u003e\u003c/a\u003e\n\nThe project is distributed with a Dockerfile which helps immensely with setting up the whole environment in a stable and reproducible way. It is recommended to use this tool inside the docker container, although if one takes care of all dependencies it is also possible to set it up locally.\n\n## 2.1 Automatically, from the DockerHub Image (recommended) \u003ca name=\"installation-auto\"\u003e\u003c/a\u003e\n\n1. Make sure Docker is installed on your machine.\n2. Pull the image from the docker repository by typing: ```docker pull montanaz0r/demeter-pipelines:latest```\n3. Next, in console/terminal run the container type ```docker run -ti montanaz0r/demeter-pipelines:latest```\n4. Inside the container type ```source pipelines/bin/activate``` to activate the virtual environment created for the python dependencies.\n5. Move to the src directory by typing ```cd src```.\n6. You are now ready to use the tool.\n7. (Optional) In step 3 you can optionally use docker run with the bind-mounting command to forward the output directly into a local directory. \n  e.g. ```docker run -v {local host directory}:{container output directory} -ti montanaz0r/demeter-pipelines:latest```\n\n## 2.2 Manually, by building a container \u003ca name=\"installation-manual\"\u003e\u003c/a\u003e\n\n1. Download project's Dockerfile.\n2. Make sure Docker is installed on your machine.\n3. Go to the directory where Dockerfile is stored.\n4. In console/terminal type ```docker build -f Dockerfile -t {image name} . ```\n    e.g. ```docker build -f Dockerfile -t pipelines .```.\n5. Next, in console/terminal run the container type ```docker run -ti {image name}```\n   e.g. ```docker run -ti pipelines```\n6. Inside the container type ```source pipelines/bin/activate``` to activate the virtual environment created for the python dependencies.\n7. Move to the src directory by typing ```cd src```.\n8. You are now ready to use the tool.\n9. (Optional) In step 5 you can optionally use docker run with the bind-mounting command to forward the output directly into a local directory. \n  e.g. ```docker run -v {local host directory}:{container output directory} -ti {image name}```\n\n# 3. Configuration \u003ca name=\"configuration\"\u003e\u003c/a\u003e\n\n## 3.1 Virtuoso instance \u003ca name=\"virtuoso-instance\"\u003e\u003c/a\u003e\n\nThe tools allow users to load RDF dumps into a preconfigured Virtuoso triple store.\nThe tool assumes that connection to the virtuoso server can be done via ssh from the machine where it is running (i.e., port 22 is open in the virtuoso server to CLI machine). Additionally, the tool assumes access to the virtuoso server is via ssh keys. Therefore, the virtuoso server should have the public key (in its authorized keys) of an account that will be used by the CLI tool. \n\nAccordingly, to use this functionality, users need to configure the following two files:\n* ./.env\n* ./cfg/config.yaml\n\n**.env configuration file**\n\nIn this file, the user should provide both the private and public ssh keys used to log in to the virtuoso server. SSH credentials should be provided decoded as base64, and the values will be automatically encoded in the tool. The key values should be provided as follows:  \n``export VSO_PRIVATE_KEY = “\u003cvirtuoso_server_account_private_key\u003e”  ``\n``export VSO_PUBLIC_KEY = “\u003cvirtuoso_server_account_public_key\u003e”  ``\n\n**./cfg/config.yaml configuration file**\n\nThis is the master config file which should be populated. The template can be found under ./cfg/config_template.yaml. It includes a section, called  vto_cfg, with the virtuoso server settings, which include:\n\n  •   SSH_PASSPHRASE  \n  •   VIRTUOSO_PORT (*Virtuoso DB port, by default 1111*)  \n  •   VIRTUOSO_USER (*Virtuoso DB user account with write access permissions*)  \n  •   VIRTUOSO_PASSWORD (*Virtuoso DB user's account password*)  \n  •   DUMP_EXTENSION (*Extension of dump files to be loaded in virtuoso, i.e. .nt*)  \n  •   DATA_FOLDER (*The main directory in the Virtuoso server where all data is stored*)  \n  •   DATA_PREFIX (*A prefix that will be temporarily added to the data in the DATA_FOLDER. This is inconsequential if data was properly loaded since the original data is removed from the server after it was loaded into a graph. However, it might be useful to always have some kind of tag to your data in case something goes wrong and manual intervention is needed.*)  \n  •   SERVER_USER (*Username that connects to the machine where Virtuoso DB is located.*)  \n  •   SERVER_HOST (*Hostname for the machine where Viurtoso DB is located.*)  \n\n## 3.2 LPIS country configuration files \u003ca name=\"lpis-country-configuration-files\"\u003e\u003c/a\u003e\n\nFor the LPIS Pipelines, the mappings are generated based on the configuration files that can differ between countries. \nThe list of supported countries can be found in ./cfg/config.yaml in the \"lpis_countries\" section. \nThe section specifies the link between country name and the specific configuration file that will be used in the mapping generation.\n\nCountry-specific pre-defined mapping coniguration files are stored under **./cfg/LPIS/**. Currently there are pre-defined mappings for **Spain**, **Poland** and **Lithuania**. Additionally, there the folder includes the **other** configuration file, which can be used to configure mapping for any other country. Users can change the content of the respective configuration file, including the **other** configuration file, by changing the values in key-value pairs or removing/adding items.\n\n**Note: There are keys that are required to generate the mapping and those should always be present in the configuration file with a respective value! Below is the list of supported keys that can be included in the country-specific LPIS configuration file:**\n\n**Required:**\n  - BASE_URI\n  - LABEL\n  - IACS_ID\n  - TEMPLATE_ID\n\n**Optional:**\n  - VALID_FROM\n  - SHORT_ID\n  - SPECIFIC_LAND_USE\n  - PARENT_ADM2\n  - PARENT_ADM3\n  - MUNICIPALITY_ID\n  - AREA\n  - PERIMETER\n  - LAYER_ABBREVIATION\n\n## 3.3 General mapping generator \u003ca name=\"general-mapping-generator\"\u003e\u003c/a\u003e\n\nA general mapping generator can be used as a part of the **GENERIC Pipeline** (*--from_config flag*) to create mappings from scratch based on a simple YAML configuration file provided by the user. \n\nUsers can use this feature by modifying **[this config file](cfg/GENERIC/generic_cfg.yaml)**. There are examples provided [here](cfg/GENERIC/) to allow users to get a grasp of how this YAML should be specified. With that being said, there are a couple of general rules for constructing the YAML configuration file:\n\n1. The specification should be placed under the cfg section key\n2. The TEMPLATE_ID is the URL path that is added to the base URL provided via the *base_uri* parameter. The automatically generated URL will be created by concatenating base_uri, TEMPLATE_ID, and entity_name. This path can include multiple / characters and can reference fields in the data source by enclosing them in single quotes. e.g., ```\"RPL/2022/Parcel/{`BLOKAS_ID`}\"```\n3. The CONTEXT key is required. CONTEXT provides a list of terms that can be used for defining entities in the config. CONTEXT can be a single reference to the context file (JSONLD), or a list with multiple context files,\n4. CONTEXT value can also consist of dictionary like key/value pairs for additional references that are not included in context file/files\n5. The MAIN_TYPE key type is required and provides information about entities that are related to the core section of the mapping.\n6. Each property consist of predicate (key) and object (value),\n7. For complex types additional information are nested under the predicate (key),\n8. @type is a special keyword that is referencing type that can contain multiple types in a list-like object\n9. the value of predicate can be i) column/variable from datasource, ii) fixed value (of particular datatype), iii) string with reference to columns/variables, iv) URL with reference to columns/variables. \n10. column/variable should be surrounded by curly brackets abd single quotes in the config i.e. {\\`variable_value\\`}\n11. fixed values (of any datatype) are specified using angle brackets i.e. \u003cfixed_value\u003e.\n12. The data type can be specified by adding a vertical bar after the value with a specific datatype. i.e.\n     ```\n    `some_value`|\u003cinteger\u003e\n    ```\n13. If datatype is not explicitly provided the value will be resolved as IRI by default.\n14. Enumerated values can be provided using special keywords that consist of an \"@\" symbol followed by integer. For example, @1.\n15. The Enumerator should be placed under the property that is to be enumerated; for example, \n    ``` \n    propertyName:\n       @1:\n         ...\n       @2\n         ...\n    ```\n16. If the uri value for a complex type is left empty (by YAML convention this should be either null or ~) the program will generate uri for this type automatically using base_uri and template_id. However, the key is mandatory, so even though it can be left empty, do not delete the key!\n\n17. Be careful of illegal characters in cases where datatype is not specified. Proper URI might not be generated in some cases, and program will throw an error.\n\n## 3.4 SPARQL mapping \u003ca name=\"sparql-mapping\"\u003e\u003c/a\u003e\n\nThe **GENERIC Pipeline** supports mappings specified as SPARQL queries (in a .sparql file) since version 0.2.1. This option is reserved for cases where the input data is provided in the form of a CSV file!\n\n\n## 3.5 Relational Database as input source \u003ca name=\"relational-database-as-input-source\"\u003e\u003c/a\u003e\n\nThe **GENERIC pipeline** supports relational databases as a source of input data. Such source be used to generate mappings, transform the data (using an existing mapping), or both.\n\nThe user has to provide a set of details in order to successfully establish database connection. This information is provided in the main configuration file, i.e., **/cfg/config.yaml**, under the sql_cfg section. The tool is expecting and accepting the following details:\n\n  •   DB_TYPE (*REQUIRED Type of the database i.e. postgresql, mysql*)  \n  •   DB_USERNAME (*REQUIRED Username that can access the database*)  \n  •   DB_PASSWORD (*OPTIONAL If db uses password, then this should be filled in*)  \n  •   DB_HOST (*REQUIRED*)  \n  •   DB_PORT (*OPTIONAL*)  \n  •   DB_NAME (*REQUIRED*)  \n\n\n## 3.6 Link discovery process \u003ca name=\"linking-process\"\u003e\u003c/a\u003e\n\nThe **GENERIC pipeline** supports link discovery from version 1.0.0, which relies on the [Silk tool](https://github.com/silk-framework/silk) tool. To carry out this process, the user needs to provide a linking configuration file, as specified by the [Link Specification Language](https://app.assembla.com/wiki/show/silk/Link_Specification_Language), as input - either in the local directory or through a URL.\n\nNote that the linking configuration file specifies can specify different types of inputs, e.g., RDF dump files or SPARQL endpoints. If the user links physical files and not SPARQL endpoints, those files should be provided together with a linking config file (either through dir_input or url_input). This is because the input files (e.g., n-triple files) are provided with simple filenames instead of complete paths, so the tool assumes that the input files are on the same level as the provided configuration file. \n\nIn the following example, the input zip file consists of a config file (XML) and the two input files (as specified in the config file).\n\n```python main.py generic --process=link --url_input=\"https://box.psnc.pl/f/7d856fd71a/?raw=1\"```\n\n\n# 4. Usage \u003ca name=\"usage\"\u003e\u003c/a\u003e\n\nThere are currently three different pipeline types supported by the tool: FADN, LPIS, and Generic. Each of them has a specific set of parameters, options, and rules, which will be briefly described underneath. \n\nTo list the supported pipelines the following command can should be used:\n```python main.py -h```\n\nTo get details for specific pipeline the following command should be used:\n```python main.py \u003cpipeline_name\u003e -h```\n\n## 4.1 FADN Pipeline \u003ca name=\"fadn-pipeline\"\u003e\u003c/a\u003e\n\nThe FADN Pipeline is used to handle FADN data that is provided through the https://ec.europa.eu site. Data comes in packages, and each package contains a specific structure of the CSV file. The full pipeline can be evoked by running the following command:\n\n```python main.py fadn --stage=all --url_input=\u003curl_to_the_zip_file_containing_data\u003e --graph_uri=\u003cgraph_uri_value\u003e```\n\nStage parameter is required, and instead of running the whole pipeline, one may choose to run it up until a particular stage. The chosen stage is **inclusive** so far instance running --stage=transform will perform all the tasks that are preceding transformation and the transformation itself as a final step. In that way, the output will consist of a set of dumps. Other stages are incorporating similar logic, and they all are inclusive. Here is the list of all available stages for the FADN Pipeline with corresponding output in the brackets:\n\n- **all** (set of dumps loaded into a triplestore),\n- **postprocess** (a set of post-processed dumps),\n- **transform** (a set of dumps),\n- **mapping** (a set of mapping files),\n- **preprocess** (a set of auxiliary CSV files),\n- **fetch** (raw data acquired and unzipped from the source).\n\nHere is the list of all parameters and options for FADN Pipeline with a corresponding description:\n\n```\nUsage: main.py fadn [OPTIONS]\n\n  Function that initializes FADN Pipeline.\n\nOptions:\n  Input data sources: [mutually_exclusive]\n                                  The source of the input data. The default,\n                                  if neither option is used, is the fadn\n                                  folder in the current directory.\n\n    -ui, --url_input TEXT         URL to the zip file with input file package.\n                                  Required when --stage=fetch. Optional in all\n                                  other cases.\n\n    -di, --dir_input DIRECTORY    Directory containing input files.\n\n  -s, --stage [all|fetch|preprocess|mapping|transform|postprocess]\n                                  Runs the whole fadn_pipeline or a single\n                                  fadn_pipeline stage.  [required]\n\n  -u, --graph_uri TEXT            Graph's URI that will be used to load dumps\n                                  into the database. Required when --stage is\n                                  set to all\n\n  -gpd, --graph_per_dump          Treats -u/--graph_uri as base that will be\n                                  extended with the dump name for each load.\n                                  Optional when --stage is set to all.\n\n  -rg, --reload_graph             Removes target graph before loading dumps\n                                  into the database. Optional when --stage is\n                                  set to all\n\n  -o, --output DIRECTORY          Output folder name. Optional.  [default:\n                                  results]\n\n  -c, --clean                     Removes all files generated throughout the\n                                  run of the full pipeline. Optional when\n                                  --stage is set to all.\n\n  -h, --help                      Show this message and exit.\n  ```\n\n   **Example**:\n    Running the whole FADN Pipeline using url_input and graph_per_dump flag:\n\n    python main.py fadn --stage=all --graph_uri=http://testing/FADN/ --url_input=https://ec.europa.eu/agriculture/rica/database/reports/archives/fadn20200621.zip -gpd\n\n## 4.2 LPIS Pipeline \u003ca name=\"lpis-pipeline\"\u003e\u003c/a\u003e\n\nThe LPIS Pipeline handles datasets provided in the form of shapefiles. The pipeline will process each shapefile that is present in the provided input. Similar to the FADN pipeline, the input can be either in the form of an URL or a directory.\nJust as with the FADN pipeline, a user can choose to run a different stage by picking a value for the --stage parameter. The full pipeline can be ran using the following command:\n\n```python main.py lpis--stage=all --url_input=\u003curl_to_the_zip_file_containing_data\u003e --graph_uri=\u003cgraph_uri_value\u003e --country=\u003cname_of_the_country_selected_from_a_list\u003e```\n\nLPIS pipeline currently handles transformation for three specific countries: Lithuania, Poland, and Spain. The details regarding the mapping are be provided through configuration files as explained in [LPIS country configuration files](#lpis-country-configuration-files) section. The pre-configured files for those countries can be found under the **./cfg/LPIS/** directory along with the other.yaml, which can be used for processing any other country. For such case, the country parameter should be \"other\" and the user should fill the correspondig YAML file (other.yaml).\n\nThe following stages with their respective output are available for the LPIS type of pipeline:\n\n  - **all** (set of dumps loaded into a triplestore),\n  - **postprocess** (a set of post-processed dumps),\n  - **transform** (a set of dumps),\n  - **mapping** (a set of mapping files),\n  - **fetch** (raw data acquired and unzipped from the source).\n\nHere is the list of all parameters and options for LPIS Pipeline with a corresponding description:\n\n```\nUsage: main.py lpis [OPTIONS]\n\n  Function that initializes LPIS Pipeline.\n\nOptions:\n  Input data sources: [mutually_exclusive]\n                                  The source of the input data. The default,\n                                  if neither option is used, is the current\n                                  directory.\n\n    -ui, --url_input TEXT         URL to the zip file with input file package.\n                                  The file is unpacked and the directory\n                                  traversed to find all existing shapefiles.\n                                  Required when --stage=fetch.\n\n    -di, --dir_input DIRECTORY    Directory containing input data. The the\n                                  directory is traversed to find all existing\n                                  shapefiles.\n  \n  -s, --stage [all|fetch|mapping|transform|postprocess]\n                                  Runs the whole LPIS Pipeline or a single\n                                  LPIS Pipeline stage.  [required]\n\n  -cn, --country [other|spain|poland|lithuania]\n                                  Mappings will be generated for a specific\n                                  country that was chosen from the list.\n                                  Required when --stage=all, mapping,\n                                  transformor postprocess.\n\n  -u, --graph_uri TEXT            Graph's URI that will be used to load dumps\n                                  into the database. Required when --stage is\n                                  set to all.\n\n  -gpd, --graph_per_dump          Treats -u/--graph_uri as base that will be\n                                  extended with the dump name for each load.\n                                  Optional when --stage is set to all.\n\n  -rg, --reload_graph             Removes target graph before loading dumps\n                                  into the database. Optional when --stage is\n                                  set to all\n\n  -o, --output DIRECTORY          Output folder name. Optional.  [default:\n                                  results]\n\n\n  -c, --clean                     Removes all files generated throughout the\n                                  run of the full pipeline. Optional when\n                                  --stage is set to all.\n\n  -h, --help                      Show this message and exit.\n  ```\n     \n   **Example**:\n    Running the postp-processing LPIS Pipeline for Spain using url_input:\n\n    python main.py lpis --stage=postprocess --country=SPAIN --url_input=\u003curl\u003e\n\n## 4.3 GENERIC Pipeline \u003ca name=\"generic-pipeline\"\u003e\u003c/a\u003e\n\nThe Generic Pipeline aims at providing as much flexibility to the user as possible. The tool can work with multiple data types, including:\n\n- **Shapefiles**\n- **JSON files**\n- **CSV files**\n- **Databases**\n\nSimilar to FADN and LPIS Pipeline, though, users can choose to provide data through a URL (url_input) or just point to the directory (dir_input). But unlike other pipelines, there is no full pipeline method, as every process can be treated as an autonomous step. With that being said, the Generic Pipeline supports to stack multiple processes. \n\nFor the transformation process, this pipeline can handle different situations:\n* when the number of mapping files is equal to the number of input files (in this scenario, mapping files should have the same base name as data files)\n* when a single mapping files is used with multiple input files\n* when a single mapping file is used with a single input file\nFor the last two scenarios, the tool will adjust the mapping appropriately to align it with input file/files.\n\n\nThe following processes are currently available for the Generic pipeline. \n\n- **preprocess** Currently supporting only CSV files (output: preprocessed file),\n- **mapping generation** Currently supporting only CSV and Shapefiles (output: mapping file),\n- **transform**. Supporting all available data types (output: dump file),\n- **postprocess** (output: post-processed dump file),\n- **load** (output: dump loaded into a triplestore),\n- **link** (output: dump file with discovered links). See details in section [Link discovery process](#linking-process)\n\n\nIt is useful to know that processes do not need to be stacked in any particular order, as the tool handles the sequence of actions by itself. Therefore the two examples below, are treated equally:\n\n```--process=transform --process=mapping --process=preprocess```\n\n```--process=mapping --process=preprocess --process=transform```\n\n\nBelow is the list of all parameters and options for Generic Pipeline with a short description:\n\n```\nUsage: main.py generic [OPTIONS]\n\n  Function that initializes Generic Pipeline.\n\nOptions:\n  -p, --process [preprocess|mapping|transform|postprocess|load|link]\n                                  Runs a single process from the generic\n                                  Pipeline. Can be used multiple times to\n                                  evoke set of tasks.  [required]\n\n  -u, --graph_uri TEXT            Graph's URI. Required when using load\n                                  process.\n\n  Input data sources: [mutually_exclusive, required]\n                                  The source of the input data.\n    -ui, --url_input TEXT         URL to the input zip file package.\n    -di, --dir_input DIRECTORY    Directory containing input files.\n    -db, --db_input               Flag indicating database input. Details are\n                                  provided through cfg/config.yaml by updating\n                                  sql_cfg section.\n\n  -o, --output DIRECTORY          Output folder name. Optional. !!!WARNING!!!\n                                  if directory already exist the content of it\n                                  will be erased before pipeline execution.\n                                  Select your output path with caution!\n                                  [default: results]\n\n  -it, --input_type [Shapefile|GML|KML|GeoJson|CSV|JSON|XML|DB|netCDF|CSVW]\n                                  Type of input data that has to be\n                                  transformed. Required if process is\n                                  transform, mapping or preprocess. WARNING:\n                                  this option is case sensitive!\n\n  -mi, --mapping_input DIRECTORY  Input path to the mapping directory.\n                                  Required when process is transform and there\n                                  is no preceding mapping process.\n\n  -mu, --mapping_url TEXT         Input mapping as URL to a zip package (works\n                                  only with data provided as url_input).\n                                  Required when process is transform and there\n                                  is no preceding mapping process.\n\n  -bu, --base_uri TEXT            Base URI. Required with mapping process and\n                                  transform when processing shapefiles.\n\n  -gpd, --graph_per_dump          Treats -u/--graph_uri as base that will be\n                                  extended with the dump name for each load.\n                                  Optional when --process is set to load.\n\n  -rg, --reload_graph             Removes target graph before loading dumps\n                                  into the database. Optional when --process\n                                  is set to load.\n\n  -ppa, --preprocess_activity [add_seq_col|unzip_multiple_archives|normalize_delimiter|to_crs|add_enum]\n                                  List of available methods to choose from for\n                                  the preprocessing part of the Pipeline.\n                                  Required with --process=preprocess.\n                                  unzip_multiple_archives works for every\n                                  input type. add_seq_col and\n                                  normalize_delimiter areconsequential only\n                                  for CSV input type, to_crs works only with\n                                  Shapefiles and add_enum is somewhat\n                                  equivalent to add_seq_col but works for json\n                                  files.\n\n  -ttl, --to_ttl                  Converts output from n-triples to turtle as\n                                  a part of post-processing. Available only\n                                  with post-processing.\n\n  -re, --replace_expression TEXT...\n                                  Replaces all occurrences of X with Y as a\n                                  part of additional dump postprocessing.\n                                  Optional when process is set to postprocess.\n                                  Regular expression patterns are accepted but\n                                  they needto be properly escaped!\n\n  -rl, --remove_line TEXT         Removes every line containing provided\n                                  string in the of dump file. Optional when\n                                  process=postprocess\n\n  -te, --target_encoding TEXT     Desired encoding for output dump. Optional\n                                  when process is set to postprocess.\n\n  -se, --source_encoding TEXT     Encoding of a source file. Optional when\n                                  --target_encoding is provided. if source\n                                  encoding was not provided program will try\n                                  to make an educated guess on the source\n                                  encoding.\n\n  -fc, --from_config              Flag indicating that the mapping should be\n                                  generated based on the configfile (Supports\n                                  Shapefiles and CSV). Optional when --process\n                                  is set to mapping.\n\n  -fcv, --from_config_value FILE  Alternative path to the config file for\n                                  generating a custom mapping.  [default:\n                                  cfg/GENERIC/generic_cfg.yaml]\n\n  -sq, --sparql_query             Uses a SPARQL query as a mapping to produce\n                                  dumps. Query file should be provided like\n                                  the regular mapping either through\n                                  --mapping_input or --mapping_url. Can be\n                                  used only with a transform process.\n\n  -yrv, --yarrrml_rules_value FILE\n                                  Path to YAML containing all the rules for\n                                  mapping generation using YARRRML tool.\n\n  -yru, --yarrrml_rules_url TEXT  URL to zip package containing YAML file with\n                                  rules for mapping generation using YARRRML\n                                  tool.\n\n  -c, --clean                     Removes all files generated throughout the\n                                  run of the full pipeline. Optional when\n                                  --process=load.\n\n  -h, --help                      Show this message and exit.\n```\n\n# 5. Version \u003ca name=\"version\"\u003e\u003c/a\u003e\n\nLatest version:\n**version v1.1.2**\n\nTo check the version of the tool you are currently using, you can pass\n ```python main.py --version``` into the terminal.\n\n# 6. Team \u003ca name=\"team\"\u003e\u003c/a\u003e\n\nThe Linked Data Pipelines was built with an effort of the Data Analytics and Semantics Department in Poznan Supercomputing and Networking Center. [Bogusz Janiak](http://boguszjaniak.xyz/) is the main creator and maintainer. People who contributed to the project: [Raul Palma](http://orcid.org/0000-0003-4289-4922\u003e), [Soumya Brahma](https://github.com/sbrahma), [Andrzej Mazurek](andrzej.a.mazurek@gmail.com).\n\n# 7. License \u003ca name=\"license\"\u003e\u003c/a\u003e\n\nThe Linked Data Pipelines has an MIT License, as found in the [LICENSE](LICENSE) file.\n\n![alt text](https://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/License_icon-mit.svg/384px-License_icon-mit.svg.png)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fad4gd%2Fharmonisationpipelines","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fad4gd%2Fharmonisationpipelines","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fad4gd%2Fharmonisationpipelines/lists"}