{"id":13725658,"url":"https://github.com/ceumicrodata/mETL","last_synced_at":"2025-05-07T20:33:16.959Z","repository":{"id":8619124,"uuid":"10261631","full_name":"ceumicrodata/mETL","owner":"ceumicrodata","description":"mito ETL tool","archived":false,"fork":false,"pushed_at":"2021-06-01T08:26:55.000Z","size":7794,"stargazers_count":162,"open_issues_count":18,"forks_count":40,"subscribers_count":19,"default_branch":"master","last_synced_at":"2024-08-04T01:27:55.037Z","etag":null,"topics":["data-integration","etl","etl-framework","pipeline","python"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ceumicrodata.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2013-05-24T07:43:05.000Z","updated_at":"2024-06-25T13:52:10.000Z","dependencies_parsed_at":"2022-09-09T23:10:22.498Z","dependency_job_id":null,"html_url":"https://github.com/ceumicrodata/mETL","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ceumicrodata%2FmETL","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ceumicrodata%2FmETL/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ceumicrodata%2FmETL/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ceumicrodata%2FmETL/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ceumicrodata","download_url":"https://codeload.github.com/ceumicrodata/mETL/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224645478,"owners_count":17346166,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-integration","etl","etl-framework","pipeline","python"],"created_at":"2024-08-03T01:02:30.376Z","updated_at":"2024-11-14T15:31:44.360Z","avatar_url":"https://github.com/ceumicrodata.png","language":"Python","readme":"\u003ccenter\u003e\r\n\u003ctable\u003e\r\n\t\u003ctr\u003e\r\n\t\t\u003ctd style=\"border: none;\"\u003e\u003cimg src=\"docs/logo.png\" width=\"200\"\u003e\u003c/td\u003e\r\n\t\t\u003ctd style=\"border: none;\"\u003e\r\n\t\t\t\u003cul\u003e\r\n\t\t\t\t\u003cli\u003e\u003ca href=\"#overview\"\u003eOverview\u003c/a\u003e\u003c/li\u003e\r\n\t\t\t\t\u003cli\u003e\u003ca href=\"#thirtysecondstutorial\"\u003eThirty Seconds Tutorial\u003c/a\u003e\u003c/li\u003e\r\n\t\t\t\t\u003cli\u003e\u003ca href=\"#changelog\"\u003eChangelog\u003c/a\u003e\u003c/li\u003e\r\n\t\t\t\t\u003cli\u003e\u003ca href=\"#documentation\"\u003eDocumentation\u003c/a\u003e\u003c/li\u003e\r\n\t\t\t\u003c/ul\u003e\r\n\t\t\u003c/td\u003e\r\n\t\u003c/tr\u003e\r\n\u003c/table\u003e\r\n\u003c/center\u003e\r\n\r\n\u003ca id=\"overview\"\u003e\u003c/a\u003e\r\n# Overview\r\n\r\nmETL is an ETL tool which has been especially designed to load elective data necessary for CEU. Obviously, the program can be used in a more general way, it can be used to load practically any kind of data. The program was written in Python, taking into maximum consideration the optimal memory usage after having assessed the Brewery tool’s capabilities.\r\n\r\n### Presentations\r\n\r\n1. [Extract, Transform, Load in Python](https://speakerdeck.com/bfaludi/extract-transform-load-in-python) - Bence Faludi (@bfaludi), Budapest.py Meetup\r\n\r\n\tOur solutions to create a new Python ETL tool from scratch.\r\n\r\n2. [mETL - just another ETL tool?](https://speakerdeck.com/bpdatabase/metl-just-another-etl-tool-molnar-daniel-mito) - Dániel Molnár (@soobrosa), Budapest Database Meetup\r\n\r\n\tA practical rimer on how to make your life easier on ETL processes - even without writing loader code.\r\n\r\n3. [Extract, Transform, Load using mETL](https://speakerdeck.com/bfaludi/extract-transform-load-using-metl) - Bence Faludi (@bfaludi), PyData '14, Berlin\r\n\r\n\t\u003e Presentation was published at PyData '14 conference in Berlin. Novice level training to help you learn and use mETL in your daily work. [video](https://www.youtube.com/watch?v=NOGXdKbB-gQ)\r\n\t\r\n4. [Extract, Transform, Load using mETL](https://speakerdeck.com/bfaludi/extract-transform-load-using-metl-1) - Bence Faludi (@bfaludi), PyCon Sei, Florince\r\n\r\n\t\u003e We are using this tool in production for many of our clients and It is really stable and reliable. The project has a few contributors all around the world right now and I hope many developers will join soon. I want to introduce you this tool. In this presentation I will show you the functionality and the common use cases. Furthermore, I will talk about other ETL tools in Python. [video](https://www.youtube.com/watch?v=5fe3wBMsmMg)\r\n\r\n### Tutorials\r\n\r\n1. [Novice level exercises](https://github.com/bfaludi/mETL-tutorials)\r\n\r\n\u003ca id=\"thirtysecondstutorial\"\u003e\u003c/a\u003e\r\n# Thirty-seconds tutorial\r\n\r\nFirst of all, let's see the most common problem. Want to load data into database from a text or binary file. Our example file is called \u003cspan\u003eauthors.csv\u003c/span\u003e and file's structure is the following:\r\n\r\n\tAuthor,Email,Birth,Phone\r\n\tDuane Boyer,duaneboyer@yahoo.com,1918-05-01,+3670636943\r\n\tJonah Bazile,jonahbazile@live.com,1971-10-05,+3670464615\r\n\tWilliam Teeple,williamteeple@gmail.com,1995-07-26,+3670785797\r\n\tJunior Thach,juniorthach@msn.com,1941-08-10,+3630589648\r\n\tEmilie Smoak,emiliesmoak@msn.com,1952-03-08,+3670407688\r\n\tLouella Utecht,louellautecht@yahoo.com,1972-02-28,+3670942982\r\n\t...\r\n\r\nFirst task to generate a Yaml configuration for mETL. This configuration file contains the fields and types, transformation steps and source and target data. Write the following into the terminal. `config.yml` will be configuration file's name, and the example file's type is `CSV`.\r\n\r\n\t$ metl-generate csv config.yml\r\n\r\nThe script will give you information about the correct attributes what you have to fill out.\r\n\r\n\tUsage: metl-generate [options] CONFIG_FILE SOURCE_TYPE\r\n\r\n\tOptions:\r\n\t  -h, --help            show this help message and exit\r\n\t  -l LIMIT, --limit=LIMIT\r\n\t                        Create the configuration file with examining LIMIT\r\n\t                        number of records.\r\n\t  --delimiter=DELIMITER\r\n\t  --quote=QUOTE         \r\n\t  --skipRows=SKIPROWS   \r\n\t  --headerRow=HEADERROW\r\n\t  --resource=RESOURCE   \r\n\t  --encoding=ENCODING   \r\n\t  --username=USERNAME   \r\n\t  --password=PASSWORD   \r\n\t  --realm=REALM         \r\n\t  --host=HOST \r\n\r\nHave to add the following attributes for the generator script:\r\n\r\n* **headerRow**: File has a header in the first row.\r\n* **skipRows**: Because it has a header you should skip one row.\r\n* **resource**: File's path.\r\n\r\nRun the command with the attributes:\r\n\r\n\t$ metl-generate --resource authors.csv --headerRow 0 --skipRows 1 csv config.yml\r\n\r\nScript will create the Yaml configuration which could be used by mETL. You could write the configuration manually but `metl-generate` will examine the rows and determine the correct field's type and mapping.\r\n\r\n\tsource:\r\n\t  fields:\r\n\t  - map: Phone\r\n\t    name: Phone\r\n\t    type: BigInteger\r\n\t  - map: Email\r\n\t    name: Email\r\n\t    type: String\r\n\t  - map: Birth\r\n\t    name: Birth\r\n\t    type: DateTime\r\n\t  - map: Author\r\n\t    name: Author\r\n\t    type: String\r\n\t  headerRow: '0'\r\n\t  resource: authors.csv\r\n\t  skipRows: '1'\r\n\t  source: CSV\r\n\ttarget:\r\n\t  silence: false\r\n\t  type: Static\r\n \r\nModify the `target` because currently it will write out the information into the stdout. You have to add the database target.\r\n\r\n\t...\r\n\ttarget:\r\n\t  url: postgresql://username:password@localhost:5432/database\r\n\t  table: authors\r\n\t  createTable: true\r\n  \r\nScript will create the table and load data into the PostgreSQL database automatically. Run the following command the start the process:\r\n\r\n\t$ metl config.yml\r\n\t\r\nIt's done. mETL knows many source and target types and supports transformations and manipulations as well.\r\n\r\n\u003ca id=\"changelog\"\u003e\u003c/a\u003e\r\n# Change Log\r\n\r\n### Version 1.0\r\n\r\n- .0: First stable release with full documentation and running time reduction on OrderModifier.\r\n- .0.1: Merge Excel sheets ability added\r\n- .0.2: Tarr dependency fix\r\n- .0.3: Added [dm](https://github.com/bfaludi/dm) package dependency. From now, everyone can use the standalone mETL's fieldmap.\r\n- .0.4: Added dispacher option for metl-transfer\r\n- .0.5: Added encoding option for Database source.\r\n- .0.6: JSON target works from json library instead of demjson. Gives better performance.\r\n- .0.7: Fix an error when you want to use schema for PostgreSQL.\r\n\r\n### Version 0.1.8\r\n- .0: Minor, but usefull changes\r\n\r\n   * DatabaseTarget has a new attribute which is allowing to continue the write/update process when error happens.\r\n   \r\n            target:\r\n               type: Database\r\n               url: sqlite:///database.db\r\n               table: t_table\r\n               createTable: true\r\n               continueOnError: true\r\n   \r\n   * Migration on big dataset running time optimization.\r\n   * Execute function on DatabaseTarget. \r\n   \r\n      You could use to load data into special or multiple table in one time or trigger changes (deleted, changed, ...) on records based on migration differences.\r\n     \r\n      In 0.1.7 inactivate deleted records was sluggish, currently it's quite easy.\r\n      \r\n            metl-differences -d delete.yml migration/current.pickle migration/prev.pickle\r\n\r\n      where delete.yml is the following:\r\n\r\n            target:\r\n               type: Database\r\n               url: sqlite:///database.db\r\n               fn: mgmt.inactivateRecords\r\n\r\n      mgmt.py is contains:\r\n\r\n            def inactivateRecords( connection, delete_buffer, other_buffer ):\r\n                   \r\n               connection.execute( \r\n                  \"\"\"\r\n                  UPDATE\r\n                      t_table\r\n                  SET\r\n                      active = FALSE\r\n                  WHERE\r\n                      id IN ( %s )\r\n                  \"\"\" % ( ', '.join( [ b['key'] for b in delete_buffer ] ) ) \r\n               )\r\n\r\n- .1: Added logger attribute to Source/Manipulation/Target elements to define specific logger method.\r\n- .2: AppendAllExpand get skipIfFails attribute\r\n- .3: Neo4j Target added\r\n- .4: mETL-transfer command added to migrate and copy whole databases\r\n- .5: Minor fix on mETL transfer\r\n- .6: Fixed a bug in mETL-transfer when using on big datasets sometimes lost source connection.\r\n- .7: AppendAllExpand has a new ability to do not walk the whole directory.\r\n\r\n### Version 0.1.7\r\n- .0: Major changes and running time reduction.\r\n\r\n   * Changed PostgreSQL target to load data more efficient (12x speed boost) by creating a workaround for psycopg2 and SQLAlchemy's slow behaviour. \r\n   * JSON file loading now replaced to standard json package (from demjson) because faster with big (\u003e100MB) files.\r\n   * BigInteger type is added to handle 8bit length numbers.\r\n   * Pickle type is added to handle serialized BLOB objects.\r\n\r\n   \u003cbr\u003e\r\n   **IMPORTANT**: The alternate PostgreSQL target will work with only basic field types and lower case column names.\r\n   \r\n- .1: Added GoogleSpreadsheetTarget to write and update Spreadsheet files.\r\n- .2: Added AppendAll expander to append files content by walking a folder.\r\n\r\n### Version 0.1.6\r\n- .0: Changed XML converter to \u003ca href=\"https://github.com/bfaludi/xmlsquash\"\u003exmlsquash\u003c/a\u003e package. \r\n\r\n   **IMPORTANT**: It has a new XML mapping technique, all XML source map must be updated!\r\n   \r\n   * For element's value: path/for/xml/element/**text**\r\n   * For element's attribute: path/for/xml/element/attributename\r\n\r\n- .0: Fixed a bug in XML sources when multiple list element founded at sub-sub-sub level.\r\n- .0: Fixed a bug with htaccess file opening in CSV, TSV, Yaml, JSON sources.\r\n- .1: Fixed a bug where map ending was *\r\n- .1: Added SetWithMap modifier and Complex type\r\n- .2: Fixed a bug in List expander when field's value was empty.\r\n- .2: Split transform could split a list items too.\r\n- .2: Clean transform removes new lines. \r\n- .3: Added Order modifier.\r\n- .4: Added basic aggregator functions.\r\n- .5: Added dinamicSheetField attribute to XLS target to group your data in different sheets.\r\n- .6: Added KeepByCondition filter.\r\n- .7: JSON, Yaml source rootIterator and XML source itemName attributes are working like fieldMaps.\r\n- .8: Absolute FieldMap (starts with `/` mark) usage for JSON, XML, YAML files.\r\n- .9: Database source has a resource attribute to handle sql statements via file.\r\n- .10: Database source has a params attribute to add parameters to statements.\r\n- .11: Fields has a new limit attribute for database targets. Easy to add new database types if necessary.\r\n- .12: Boolean conversion is working for String correctly.\r\n- .13: Added JoinByKey modifier to easily join two sources and fill out some fields.\r\n- .14: Added `metl-generate` command to generate automaticaly Yaml configuration files.\r\n\r\n### Version 0.1.5\r\n- .0: htaccess file opening support.\r\n- .1: List type JSON support for database target and source.\r\n- .1: ListExpander with map ability.\r\n\r\n### Version 0.1.4\r\n- .0: First public release.\r\n- .1: Remove elementtree and cElementTree dependencies.\r\n- .2: TARR dependency link added, PyXML dependency removed.\r\n- .3: JSON target get a compact format parameter to create pretty printed files.\r\n- .4: Update TARR dependency.\r\n- .5: Add missing dependency: python-dateutil\r\n- .6: Fixed xml test case after 2.7.2 python version.\r\n- .7: Fixed List type converter for string or unicode data. It will not split the string!\r\n- .8: Fixed JSON source when no root iterator given and the resource file is contains only one dictionary.\r\n- .9: Added a new operator `!` to convert dictionart into list in mapping process.\r\n- .10: Fixed a bug in Windows when want to open a resource with absolute path.\r\n- .11: Added ListExpander to expand list information into single fields.\r\n- .12: XML source open via http and https protocols.\r\n\r\n\u003ccenter\u003e\r\n\u003cimg src=\"docs/logo.png\" width=\"200\"\u003e\r\n\u003c/center\u003e\r\n\r\n\u003ca id=\"documentation\"\u003e\u003c/a\u003e\r\n# Documentation\r\n\r\n## Capabilities\r\n\r\nThe actual version supports the most widespread file formats with data migration and data migration packages. These include:\r\n\r\n**Source- types**:\r\n\r\n- CSV, TSV, XLS, XLSX, Google SpreadSheet, Fixed width file\r\n- PostgreSQL, MySQL, Oracle, SQLite, Microsoft SQL Server\r\n- JSON, XML, YAML\r\n\r\n**Target- types:**\r\n\r\n- CSV, TSV, XLS, XLSX, Google SpreadSheet - with file continuation as well\r\n- Fixed width file\r\n- PostgreSQL, MySQL, Oracle, SQLite, Microsoft SQL Server - with the purpose of modification as well\r\n- JSON, XML, YAML\r\n- Neo4j\r\n\r\nDuring the develpoment of the program we tried to provide the whole course of processing with the most widespread transformation steps, program structures and mutation steps. In light of this, the program by default possesses the following transformations: \r\n\r\n- **Add**: Adds an arbitrary number to a value.\r\n- **Clean**: Removes the different types of punctuation marks. (dots, commas, etc.)\r\n- **ConvertType**: Modifies the type of the field to another type.\r\n- **Homogenize**: Converts the accentuated letters to unaccentuated ones. (NFKD format)\r\n- **LowerCase**: Converts to lower case.\r\n- **Map**: Changes the value of a field to anothe value.\r\n- **RemoveWordsBySource**: Using another source, it removes certain words.\r\n- **ReplaceByRegexp**: Makes a change (replaces) by a regular expression.\r\n- **ReplaceWordsBySource**: Replaces words using another source.\r\n- **Set**: Sets a certain value.\r\n- **Split**: Separates words by spaces and leaves a given interval.\r\n- **Stem**: Brings words to a stem. (root)\r\n- **Strip**: Removes the unnecessary spaces and/or other characters from the beginning and ending of the value.\r\n- **Sub**: Subtracts a given number from a given value.\r\n- **Title**: Capitalizes the first letter of every word.\r\n- **UpperCase**: Converts to upper case.\r\n\r\nFour groups are differentiated in case of manipulations:\r\n\r\n1. **Modifier**\r\n\r\n   Modifiers are those objects that are given a whole line (record) and revert with a whole line. However, during their processes they make changes to values with the usage of the related values of different fields. \r\n   \r\n   - **JoinByKey**: Merge and join two different record.\r\n   - **Order**: Orders lines according to the given conditions.\r\n   - **Set**: Sets a value with the use of fix value scheme, function or another source.\r\n   - **SetWithMap**: Sets a value in case of a complicated type with a given map.\r\n   - **TransformField**: During manipulation, regular field transformation can be achieved with this command.\r\n     \r\n2. **Filter**\r\n\r\n   Their function is primarily filtering. It is used when we would like to evaluate or get rid of incomlete or faulty records as a result of an earlier tranformation.\r\n\r\n   - **DropByCondition**: The fate of the record depends on a condition.\r\n   - **DropBySource**: The fate is decided by whether or not the record is in another file.\r\n   - **DropField**: Does not decrease the number of records but field can be deleted with it.\r\n   - **KeepByCondition**: The fate of the record depends on a condition.\r\n\r\n3. **Expand**\r\n\r\n   It is used for enlargement if we would like to add more values to the present given source.\r\n\r\n   - **Append**: Pasting a new source file identical to the used one after the actual one being used.\r\n   - **AppendAll**: Run over a folder and append the file's content into the process.\r\n   - **AppendBySource**: A new file source may be pasted after the original one.\r\n   - **Field**: Collects coloumns as parameters and puts them into another coloumn with the coloumns’ values.\r\n   - **BaseExpander**: Class used for enlargement, primarily used when we would like to multiply a record.\r\n   - **ListExpander**: Splits list-type elements and puts them into separate lines.\r\n   - **Melt**: Fixes given coloumns and shows the rest of the coloumns as key-value pairs.\r\n\r\n4. **Aggregator**\r\n\r\n   Aggregators are used to connect and arrange data.\r\n   \r\n   - **Avg**: Used to determine the mean average.\r\n   - **Count**: Used to calculate figures.\r\n   - **Sum**: Used to determine sums.\r\n\r\n\r\n### Component figure\r\n\r\n\u003cimg src=\"docs/components.png\" alt=\"Folyamat\" style=\"width: 100%;\"/\u003e\r\n\r\n## Installation\r\n\r\nAs a traditional Python package, installation can the most easily be carried out with the help of the following command int he mELT directory:  \r\n\r\n\t$ python setup.py install\r\n\t\r\nor \r\n\r\n\t$ easy_install mETL\r\n\r\nThen the package can be tested with the following command:\r\n\r\n\t$ python setup.py test\r\n\r\nThe package has the following dependancies: `python-dateutil`, `xlrd`, `gdata`, `demjson`, `pyyaml`, `sqlalchemy`, `xlwt`, `nltk`, `tarr`, `xlutils`, `xmlsquash`, `qspread`, `py2neo`\r\n\r\n\r\nOn Mac OSX before installation, one needs to have the following packages installed. Afterwards all packages are installed properly.\r\n\r\n- XCode\r\n- [Macports](https://distfiles.macports.org/MacPorts/MacPorts-2.1.3-10.8-MountainLion.pkg)\r\n\r\nOn Linux before installation, one needs to check that they have `python-setuptools` and in case of its absence it need to be installed with the help of `apt-get install`.\r\n\r\n## Running of the program\r\n\r\nThe program is a collection of console scripts which can be built into all systems and can even be timed with the help of cron scripts.\r\n\r\n**The programme is made up of the following scipts:**\r\n\r\n1. `metl`: A complete process can be started with the help of it ont he basis of the YAML file as a parameter. The processes in the configuration should all be described by the configuraion file including the exact route of input and outout files.\r\n\r\n\t\tUsage: metl [options] CONFIG.YML\r\n\t\t\r\n\t\tOptions:\r\n\t\t  -h, --help            show this help message and exit\r\n\t\t  -t TARGET_MIGRATION_FILE, --targetMigration=TARGET_MIGRATION_FILE\r\n\t\t                        During running, it prepares a migration file from the\r\n\t\t                        state of the present data.\r\n\t\t  -m MIGRATION_FILE, --migration=MIGRATION_FILE\r\n\t\t                        Conveyance of previous migration file that was part of\r\n\t\t                        the previously run version.\r\n\t\t  -p PATH, --path=PATH  Conveyance of a folder, which is added to the PATH\r\n\t\t                        variable in order that the link in the YAML\r\n\t\t                        configuration could be run on an outside python file.\r\n\t\t  -d, --debug           Debug mode, writes everything out as stdout.\r\n\t\t  -l LIMIT, --limit=LIMIT\r\n\t\t                        One can decide the number of elements to be processed.\r\n\t\t                        It is an excellent opportunity to test huge files with\r\n\t\t                        a small number of records until everything works the\r\n\t\t                        way they should.\r\n\t\t  -o OFFSET, --offset=OFFSET\r\n\t\t                        Starting element of processing.\r\n\t\t  -s SOURCE, --source=SOURCE\r\n\t\t                        If the configuration does not contain the path of the\r\n\t\t                        resource, it could be given here as well.\r\n\r\n2. `metl-walk`: Its task is to apply the YAML file to every folder that act as parameter. The configuration files in this case do not have to contain the accessibility of input file as the script automatically carries out their substitution.\r\n\r\n\t\tUsage: metl-walk [options] BASECONFIG.YML FOLDER\r\n\t\t\r\n\t\tOptions:\r\n\t\t  -h, --help            show this help message and exit\r\n\t\t  -p PATH, --path=PATH  Conveyance of a folder, which is added to the PATH\r\n\t\t                        variable in order that the link in the YAML\r\n\t\t                        configuration could be run on an outside python file.\r\n\t\t  -d, --debug           Debug mode, writes everything out as stdout.\r\n\t\t  -l LIMIT, --limit=LIMIT\r\n\t\t                        One can decide the number of elements to be processed.\r\n\t\t                        It is an excellent opportunity to test huge files with\r\n\t\t                        a small number of records until everything works the\r\n\t\t                        way they should.\r\n\t\t  -o OFFSET, --offset=OFFSET\r\n\t\t                        Starting element of processing.\r\n\t\t  -m, --multiprocessing\r\n\t\t                        Turning on multiprocessing on computers with more than\r\n\t\t                        one CPU. The files to be processed are to be put to\r\n\t\t                        different threads. It is to be used exclusively for\r\n\t\t                        Database purposes as otherwise it causes problems!\r\n\r\n3. `metl-transform`: Its task is to test the transformation steps of a field in a YAML file. As parameters, it requires the name of the field and the value on which the test should be based. The script will write out the changes in value step by step.\r\n\r\n\t\tUsage: metl-transform [options] CONFIG.YML FIELD VALUE\r\n\t\t\r\n\t\tOptions:\r\n\t\t  -h, --help            show this help message and exit\r\n\t\t  -p PATH, --path=PATH  Conveyance of a folder, which is added to the PATH\r\n\t\t                        variable in order that the link in the YAML\r\n\t\t                        configuration could be run on an outside python file.\r\n\t\t  -d, --debug           Debug mode, writes everything out as stdout\r\n\r\n4. `metl-aggregate`:  Its task is to collect all the possible values to the field given as a parameter. Based on these values, a Map is easily made for the records. \r\n\r\n\t\tUsage: metl-aggregate [options] CONFIG.YML FIELD\r\n\t\t\r\n\t\tOptions:\r\n\t\t  -h, --help            show this help message and exit\r\n\t\t  -p PATH, --path=PATH  Conveyance of a folder, which is added to the PATH\r\n\t\t                        variable in order that the link in the YAML\r\n\t\t                        configuration could be run on an outside python file.\r\n\t\t  -d, --debug           Debug mode, writes everything out as stdout.\r\n\t\t  -l LIMIT, --limit=LIMIT\r\n\t\t                        One can decide the number of elements to be processed.\r\n\t\t                        It is an excellent opportunity to test huge files with\r\n\t\t                        a small number of records until everything works the\r\n\t\t                        way they should.\r\n\t\t  -o OFFSET, --offset=OFFSET\r\n\t\t                        Starting element of processing.\r\n\t\t  -s SOURCE, --source=SOURCE\r\n\t\t                        If the configuration file does not contain the\r\n\t\t                        resource path, it can be given here as well.\r\n\r\n\r\n5. `metl-differences`: Its task is to compare two different migrations. Its first parameter is the recent migration whereas the second parameter is the older migration. The script lets us know the number of elements that have become part of the new migration, the number of elements that have been modified and the number of elements that have been left unchanged or deleted.\r\n\r\n\t\tUsage: metl-differences [options] CURRENT_MIGRATION LAST_MIGRATION\r\n\t\t\r\n\t\tOptions:\r\n\t\t  -h, --help            show this help message and exit\r\n\t\t  -p PATH, --path=PATH  Conveyance of a folder, which is added to the PATH\r\n\t\t                        variable in order that the link in the YAML\r\n\t\t                        configuration could be run on an outside python file.\r\n\t\t  -d DELETED, --deleted=DELETED\r\n\t\t                        Configuration file for receiving keys of the deleted\r\n\t\t                        elements.\r\n\t\t  -n NEWS, --news=NEWS  Configuration file for receiving keys of the new\r\n\t\t                        elements.\r\n\t\t  -m MODIFIED, --modified=MODIFIED\r\n\t\t                        Configuration file for receiving keys of the modified\r\n\t\t                        elements\r\n\t\t  -u UNCHANGED, --unchanged=UNCHANGED\r\n\t\t                        Configuration file for receiving keys of the\r\n\t\t                        unmodified elements.\r\n\r\n6. `metl-generate`: Prepares a YAML file from a chosen source file. In order that a configuration can be made,  the initialisation and source parameters of the source are needed.\r\n\r\n\t\tUsage: metl-generate [options] SOURCE_TYPE CONFIG_FILE\r\n\r\n\r\n7. `metl-transfer`: Transfer all data from one database to another.\r\n\r\n\t\tUsage: metl-transfer CONFIG.YML\r\n\r\n   \r\n## Functioning\r\n\r\nThe tool **uses a YAML file for configuration**, which describes the route of the realisation as well as all the needed transformation steps. \r\n\r\n**Outline of the functioning of an average program:**\r\n\r\n1. The programme reads the given source file.\r\n2. Line by line, the program fills in the fields with the values with the help of a setting.\r\n3. Different transformations are carried out individually in each field.\r\n4. The final transformed field reaches the first manipulation where further filtering or modifications can be done to the whole line. Each manipulation sends the converted and processed line to the next manipulation step.\r\n5. After reaching the target object, the final line is written out to the given file type.\r\n\r\nLet’s examine in detail all the components used during functioning and then let’s take a look at how we can make from the listed steps YAML configuration files.\r\n\r\nThis document has two main objectives. Firstly, it defines how one can describe tasks in a YAML configuraion. Secondly, through an example, lets us glance at the python code. Also, it helps us make additional conditions and modifications when the basic tools prove to be insufficient.\r\n\r\n## Workflow\r\n\r\n\u003cimg src=\"docs/workflow.png\" alt=\"Folyamat\" style=\"width: 100%;\"/\u003e\r\n\r\n## Configuration file (Yaml)\r\n\r\nAll mETL configuration files are made up of the following:\r\n\r\n\tsource:\r\n\t  source: \u003csource_type\u003e\r\n\t  …\r\n\t\r\n\tmanipulations:\r\n\t  - \u003clisting_of_manipulations\u003e\r\n\t  …\r\n\t\r\n\ttarget:\r\n\t  type: \u003ctarget_type\u003e\r\n\t  …\r\n\t \r\nOut of these, the listing of manipulations is not necessary as in easier processes they’re not even needed. Interestingly, configuration files can be adapted indefinite times, therefore similar configuration files can be made from the same „abstract” configuration. Let us examine the following example.\r\n\r\nLet the file be called **base.yml**:\r\n\r\n\tsource:\r\n\t  fields:\r\n\t    - name: ID\r\n\t      type: Integer\r\n\t      key: true\r\n\t    - name: FORMATTED_NAME\r\n\t      key: true\r\n\t    - name: DISTRICT\r\n\t      type: Integer\r\n\t    - name: LATITUDE\r\n\t      type: Float\r\n\t    - name: LONGITUDE\r\n\t      type: Float\r\n\t\r\n\ttarget:\r\n\t  type: Static\r\n\r\nAnd the next **fromcsv.yml**:\r\n\r\n\tbase: base.yml\r\n\tsource:\r\n\t\tsource: CSV\r\n\t\tresource: input/forras.csv\r\n\t\tmap:\r\n\t\t  ID: 0\r\n\t\t  FORMATTED_NAME: 1\r\n\t\t  DISTRICT: 2\r\n\t\t  LATITUDE: 3\r\n\t\t  LONGITUDE: 4\r\n\t\t  \r\nThe above file was derived from the first one and only showed where the source file and its fileds are if it had to process data from the CVS source file. If we had a TSV file, then the following configuration can be written for it (where the CVS-style sign for separation is represented not by `,`but by `\\t`)\r\n\r\n\tbase: fromcsv.yml\r\n\tsource:\r\n\t\tsource: TSV\r\n\t\tresource: input/forras.tsv\r\n\r\n### Source\r\n\r\nAll processes start with a source file, from which the the data are retrieved. There are unique types, which all have their own settings. There role is complex as during the ETL procedure any transformation or manipulation can retrieve further sources in order to do their operations. Their number and interlocking are critical. The **all source type** have the following data:\r\n\r\n- **source**: Source type\r\n- **fields**: List of fields\r\n- **map**: Map/interlocking of fields. Not necessary here, can be given at the level of fields as well.\r\n- **defaultValues**: Default values for the fields. Not necessary here, can be given at the level of fields as well. \r\n\r\nAn example of the configuration of a Static source YAML file:\r\n\r\n\tsource:\r\n\t  source: Static\r\n\t  sourceRecords:\r\n\t    - [ 'El Agent', 'El Agent@metl-test-data.com', 2008, 2008 ]\r\n\t    - [ 'Serious Electron', 'Serious Electron@metl-test-data.com', 2008, 2013 ]\r\n\t    - [ 'Brave Wizard', 'Brave Wizard@metl-test-data.com', 2008, 2008 ]\r\n\t    - [ 'Forgotten Itchy Emperor', 'Forgotten Itchy Emperor@metl-test-data.com', 2008, 2013 ]\r\n\t    - [ 'The Moving Monkey', 'The Moving Monkey@metl-test-data.com', 2008, 2008 ]\r\n\t    - [ 'Evil Ghostly Brigadier', 'Evil Ghostly Brigadier@metl-test-data.com', 2008, 2013 ]\r\n\t    - [ 'Strangely Oyster', 'Strangely Oyster@metl-test-data.com', 2008, 2008 ]\r\n\t    - [ 'Anaconda Silver', 'Anaconda Silver@metl-test-data.com', 2006, 2008 ]\r\n\t    - [ 'Hawk Tough', 'Hawk Tough@metl-test-data.com', 2004, 2008 ]\r\n\t    - [ 'The Disappointed Craw', 'The Disappointed Craw@metl-test-data.com', 2008, 2013 ]\r\n\t    - [ 'The Raven', 'The Raven@metl-test-data.com', 1999, 2008 ]\r\n\t    - [ 'Ruby Boomerang', 'Ruby Boomerang@metl-test-data.com', 2008, 2008 ]\r\n\t    - [ 'Skunk Tough', 'Skunk Tough@metl-test-data.com', 2010, 2008 ]\r\n\t    - [ 'The Nervous Forgotten Major', 'The Nervous Forgotten Major@metl-test-data.com', 2008, 2013 ]\r\n\t    - [ 'Bursting Furious Puppet', 'Bursting Furious Puppet@metl-test-data.com', 2011, 2008 ]\r\n\t    - [ 'Neptune Eagle', 'Neptune Eagle@metl-test-data.com', 2011, 2013 ]\r\n\t    - [ 'The Skunk', 'The Skunk@metl-test-data.com', 2008, 2013 ]\r\n\t    - [ 'Lone Demon', 'Lone Demon@metl-test-data.com', 2008, 2008 ]\r\n\t    - [ 'The Skunk', 'The Skunk@metl-test-data.com', 1999, 2008 ]\r\n\t    - [ 'Gamma Serious Spear', 'Gamma Serious Spear@metl-test-data.com', 2008, 2008 ]\r\n\t    - [ 'Sleepy Dirty Sergeant', 'Sleepy Dirty Sergeant@metl-test-data.com', 2008, 2008 ]\r\n\t    - [ 'Red Monkey', 'Red Monkey@metl-test-data.com', 2008, 2008 ]\r\n\t    - [ 'Striking Tiger', 'Striking Tiger@metl-test-data.com', 2005, 2008 ]\r\n\t    - [ 'Sliding Demon', 'Sliding Demon@metl-test-data.com', 2011, 2008 ]\r\n\t    - [ 'Lone Commander', 'Lone Commander@metl-test-data.com', 2008, 2013 ]\r\n\t    - [ 'Dragon Insane', 'Dragon Insane@metl-test-data.com', 2013, 2013 ]\r\n\t    - [ 'Demon Skilled', 'Demon Skilled@metl-test-data.com', 2011, 2004 ]\r\n\t    - [ 'Vulture Lucky', 'Vulture Lucky@metl-test-data.com', 2003, 2008 ]\r\n\t  map:\r\n\t    name: 0\r\n\t    year: 2\r\n\t  defaultValues:\r\n\t    name: 'Empty Name'\r\n\t  fields:\r\n\t    - name: name\r\n\t      type: String\r\n\t      key: true\r\n\t    - name: time\r\n\t      type: Date\r\n\t      finalType: String\r\n\t      transforms:\r\n\t        - transform: ConvertType\r\n\t          fieldType: String\r\n\t        - transform: ReplaceByRegexp\r\n\t          regexp: '^([0-9]{4}-[0-9]{2})-[0-9]{2}$'\r\n\t          to: '$1'\r\n\t    - name: year\r\n\t      type: Integer\r\n\r\nThe example is long and may contain data and structure not known as of yet, these will be analysed in depth later on. \r\n\r\nThe source is therefore responsible for the following:\r\n\r\n1. Description of type and format of the file containing the data (source)\r\n2. Description of processed data (fields)\r\n3. Defining the interlocking between them (map)\r\n\r\nLet us examine how we can describe the type of files containing data.\r\n\r\n#### CSV\r\n\r\nSource type used with CSV files. Its parameters of intialisation:\r\n\r\n- **delimiter**: The sign used for separation in CSV files.  By default `,`is used.\r\n- **quote**:The character used to protect data if we’re using the above mentioned demiliter. By default `\"`is used.\r\n- **skipRows**: Regulatest he number of lines to be left out from the beginning of the CSV file. By default we do not leave lines out at all. \r\n- **headerRow**: Lets us know which line contains the header of the CSV file. If given, then the interlocking will be achieved not through index (ordinal number) but through the name of a column.\r\n\r\nFurther parameters for source data:\r\n\r\n- **resource**: Path of a CSV file, which can even be an URL.\r\n- **encoding**: Coding of a CSV file. By default `UTF-8` should be set.\r\n\r\nAn extract example of YAML configuration with CSV source\r\n\r\n\tsource: CSV\r\n\tresource: path/to/file/name.csv\r\n\tdelimiter: \"|\"\r\n\theaderRow: 0\r\n\tskipRows: 1\r\n\r\n#### Database\r\n\r\nSource type for getting data from databases. Can perform more than one function, but first let us examine the necessary parameters for getting data.\r\n\r\n- **url**: Connection URL of the database.\r\n- **schema**: Scheme of the database, to which one can connect. Not necessary.\r\n- **table**: Table of the database, from which the data are extracted.\r\n- **statement**: Unique query can be given. If it is given, then there’s no need to give the `table` parameter.\r\n\r\nIn light of this, let’s see two examples of YAML configuration. Let the first be the `test` table of a `SQLite` database:\r\n\r\n\tsource: Database\r\n\turl: sqlite:///tests/test_sources/test_db_source.db\r\n\ttable: test\r\n\r\nThe second one is a unique query from a `PostgreSQL` database:\r\n\r\n\tsource: Database\r\n\turl: 'postgresql://felhasznalo:jelszo@localhost:5432/adatbazis'\r\n\tstatement: \"select c.*, p.* from public.t_customer as c inner join public.t_purchase as p on ( p.cid = c.id ) where p.purchase_date \u003e= CURRENT_TIMESTAMP - interval '2 months'\"\r\n\r\n#### FixedWidthText\r\n\r\nSource type for using fixed width files. Parameter of its initialisation:\r\n\r\n- **skipRows**: Leaves the given number os lines out from the beginning of a TXT file. By default, no lines are left out.\r\n\r\nFurther parameters for source data:\r\n\r\n- **resource**: Path of a TXT file, which can even be an URL\r\n- **encoding**: Coding of the TXT file. By default, it is `UTF-8`\r\n\r\nAn example of an XLS configuration:\r\n\r\n\tsource: FixedWidthText\r\n\tresource: path/to/file.txt\r\n\tskipRows: 1\r\n\r\n#### GoogleSpreadsheet\r\n\r\nIt is also possible to use Google Spreadsheet as a source. It doesn’t require much data for inicialisation, however, for getting source data, it does require lots of parameters:\r\n\r\n- **username**: Username\r\n- **password**: Password\r\n- **spreadsheetKey**: key of spreadsheet\r\n- **spreadsheetName**: name of spreadsheet\r\n- **worksheetId**: ID of worksheet\r\n- **worksheetName**: name of the worksheet\r\n\r\nNone of the above is mandatory, however the source is unable to work without proper data. When supplying data, the following rules apply:\r\n\r\n1. **Public Google SpreadSheet**: Only `spreadsheetKey` is required. Usage of public spreadsheets is not perfect of the file contains `:` and`,` characters, they can give problematic results. It is Google’s fault, because in case of public documents, it does not give values back per cells, but as a complete text, without protecting characters. \r\n\r\n2. **Not Public spreadsheet**: It is mandatory to give the `username` and `password` fields, and one of ’spreadsheetKey’ or ’spreadsheetName’. If we wish to refer to a given spreadsheet, it is enough to supply one of `worksheetId` or `worksheetName`\r\n\r\nAn example of a YAML configuration of a public Google Spreadsheet.\r\n\r\n\tsource: GoogleSpreadsheet\r\n\tspreadsheetKey: 0ApA_54tZDwKTdHNGNVFRX3g1aE12bXhzckRzd19aNnc\r\n\r\n#### JSON\r\n\r\nSource type used with JSON files. Initialisation parameter:\r\n\r\n- **rootIterator**: Name of the root that contains the list of data. Not necessary to supply, but of it is not given, then the whole JSON file will be considered as one record. En masse that data can be processed with `metl-walk`\r\n\r\nAn example of the above `rootIterator`, where the value of the `rootIterator` is `items`:\r\n\r\n\t{\r\n\t   \"items\":[\r\n\t      {\r\n\t         \"lat\":47.5487066254,\r\n\t         \"lng\":19.0546094353,\r\n\t         \"nev\":\"Óbudaisziget\",\r\n\t      },\r\n\t      …\r\n\t   ]\r\n\t}\r\n\r\nFurther parameters for source data:\r\n\r\n- **resource**: Path of the JSON file, which can even be an URL.\r\n- **encoding**: Coding of the JSON file. By default, we expect `UTF-8`.\r\n\r\nAn example of a YAML configuration:\r\n\r\n\tsource: JSON\r\n\tresource: path/to/file.json\r\n\trootIterator: items\r\n\t\r\n#### Static\r\n\r\nSource type mainly used for testing, in which the configuration file contains the records. Has only one parameter:\r\n\r\n- **sourceRecords**: List of data in arbitrary order.\r\n\r\nExample from above:\r\n\r\n\tsource: Static\r\n\tsourceRecords:\r\n\t  - [ 'El Agent', 'El Agent@metl-test-data.com', 2008, 2008 ]\r\n\t  - [ 'Serious Electron', 'Serious Electron@metl-test-data.com', 2008, 2013 ]\r\n\t  - [ 'Brave Wizard', 'Brave Wizard@metl-test-data.com', 2008, 2008 ]\r\n\t  - [ 'Forgotten Itchy Emperor', 'Forgotten Itchy Emperor@metl-test-data.com', 2008, 2013 ]\r\n\r\n#### TSV\r\n\r\nSource type used for TSV files. It's initialisation parameters:\r\n\r\n- **delimiter**: The separation sign in a TSV file. By default we use `\\t`.\r\n- **quote**: The character we use to protect data if the text contains the above mentioned delimiter. By default we use `\"`.\r\n- **skipRows**: Sets the number of lines to be left out from the beginning of the TSV file. By default no lines are left out.\r\n- **headerRow**: The number of row that contains the header of the TSV can be given here. If given, the setting can only be done by coloumn name and NOT by index (ordinal number)\r\n\r\nFurther parameters for source data:\r\n\r\n- **resource**: Path of the TSV flie, which can even be a URL\r\n- **encoding**: Coding of the TSV file. By default, it is `UTF-8`. \r\n\r\nExample of YAML configuration from a TSV source:\r\n\r\n\tsource: TSV\r\n\tresource: path/to/file.tsv\r\n\theaderRow: 0\r\n\tskipRows: 1\r\n\t\r\n#### XLS\r\n\r\nSource type used for XLS files. Its initialisation parameters:\r\n\r\n- **skipRows**: The number of lines to be left out. By default no lines are left out.\r\n\r\nFurther parameters for source data:\r\n\r\n- **resource**: Path of the XLS file, which can even be a URL.\r\n- **encoding**: Coding of the XLS file. By default we expect `UTF-8`. \r\n- **sheetName**: Name or number of the sheet of the XLS file.\r\n- **mergeSheets**: Merge all sheets. Define the sheetName not necessary if the value is 'true'. By default not merging will happen.\r\n\r\nExample of an XLS configuration:\r\n\r\n\tsource: XLS\r\n\tresource: path/to/file.xls\r\n\tskipRows: 1\r\n\tsheetName: Sheet1\r\n\r\n#### XML\r\n\r\nSource type for XML files. Its initialisation parameters:\r\n\r\n- **itemName**: Name of the block, containing the data. Not necessary to supply, however, int hat case, the whole file is considered to be one record. En masse that data can be processed with `metl-walk`.\r\n\r\nExample of giving `itemName` if the file contains more than one record. In this case, the value of `itemName` should be `item`.\r\n\r\n\r\n\t\u003c?xml version=\"1.0\" ?\u003e\r\n\t\u003citems\u003e\r\n\t  \u003citem\u003e\r\n\t    \u003clat\u003e\r\n\t      47.5487066254\r\n\t    \u003c/lat\u003e\r\n\t    \u003clng\u003e\r\n\t      19.0546094353\r\n\t    \u003c/lng\u003e\r\n\t    \u003cnev\u003e\r\n\t      Óbudaisziget\r\n\t    \u003c/nev\u003e\r\n\t  \u003c/item\u003e\r\n\t  …\r\n\t\u003c/items\u003e\r\n\r\nFurther parameters for source data:\r\n\r\n- **resource**: Path of the XML file, which can even be a URL.\r\n- **encoding**: Coding the XML file. By default, we expect`UTF-8`. The header of the XML file contains an encoding parameter, whose coding shoud be same as the file’s coding. \r\n\r\nExample of an XML configuration:\r\n\r\n\tsource: XML\r\n\tresource: path/to/file.xml\r\n\titemName: item\r\n\r\nLater on, during mapping one should take into account that the access to the XMLs and its routes the \u003ca href=\"https://github.com/bfaludi/XML2Dict\"\u003exml2dict\u003c/a\u003e package will be used, therefore when giving the value, the true value will be at the `text` attribute. An example of a a setting in case of `latitude`, `longitude` and `name` fields.\r\n\r\n\tmap:\r\n\t\tlatitude: lat/text\r\n\t\tlongitude: lng/text\r\n\t\tname: nev/text\r\n\r\n#### Yaml\r\n\r\nSource type used for YAML files. Initialisation parameters:\r\n\r\n- **rootIterator**: Name of the root element that contains the list of data\r\n\r\nExample of the above rootIterator the the value of the `rootIterator` is `items`:\r\n\r\n\titems:\r\n\t- district_id: 3\r\n\t\tlat: 47.5487066254\r\n\t\tlng: 19.0546094353\r\n\t\tnev: \"\\xD3budaisziget\"\r\n\r\nFurther parameters for source data:\r\n\r\n- **resource**: Path of the YAML file, which can even be a URL.\r\n- **encoding**: Coding of the YAML file. By defalut `UTF-8` is expected. \r\n\r\nAn example of a YAML configuration:\r\n\r\n\tsource: Yaml\r\n\tresource: path/to/file.yml\r\n\trootIterator: items\r\n\r\nA couple of pages above it has been mentioned that the source is responsible for the following:\r\n\r\n1. Description of the type and format of the files containing data (source)\r\n2. Description of processed data structure (field)\r\n3. Settings/joining between the above two (map)\r\n\r\nWe’ve seen the first function, let us now examine how we describe processed data.\r\n\r\n### Field\r\n\r\nIt is mandatory to give the fields of the source in the case of all source files. Naturally, if any of the fields is not necessary for the process, it does not have to be included unless we want it to be appeared in the output. But those fields in which we would like to write values must be listed, as during the process there is no possibility to add new fields. All fields can possess the following values:\r\n\r\n* **name**: Name of the field which must be unique.\r\n* **type**: Type of the field, by default it is String.\r\n* **map**: Description of mapping/interlocking. Not necessary to supply here, can be given at the source level as well.\r\n* **finalType**: Final type of the field if any change was done by transformations compared to the original type.\r\n* **key**: Whether it is a key field or not. To make us able to use the migration capabilities in the future, set this value for each field to 'true' in all cases. \r\ndefaultValue: Can be used only if there is no mapping defined for the field.\r\n* **transforms**: Transformation steps.\r\n* **limit**: Field length used in databases.\r\n* **nullable**: Whether the field can be left empty or not. If not, it will be stored as empty text.\r\n\r\nAn example of a YAML configuration:\r\n\r\n\t- name: uniquename\r\n\t  type: Float\r\n\t\r\nImportant functions in Python:\r\n\r\n- **setValue( value ):** Set the field's value.\r\n- **getValue():** Get the current value of the field.\r\n\r\nAn example of a Python code:\r\n\r\n\tf = Field( 'uniquename', FloatFieldType(), key = True )\r\n\tf.setValue( u'5,211' )\r\n\tprint repr( f.getValue() )\r\n\t# 5.211\r\n\r\n#### FieldType\r\n\r\nEach field possesses a type. The following types are handled by mETL currently:\r\n\r\n- **Boolean**: True-false field.\r\n- **Complex**: Complex type for any kind of data storage. It is worth using in the case of Dict/List when we need to work with the given value in the future.\r\n- **Date**: Date type.\r\n- **Datetime**: Date and time type.\r\n- **Float**: Fractional number field type.\r\n- **Integer**: Whole number field type.\r\n- **List**: List type for any kind of data storage.\r\n- **String**: Text field type.\r\n- **Text**: Long text field type.\r\n\r\nThe type value is used for conversion. Its basic task is to convert an incoming value to the defined type of element. If the conversion is unsuccessful or has an empty value (e.g. empty text) then it will result in a None value. All field types can have `None` value, the value of the type is adequate this way also.\r\n\r\nExample from Python:\r\n\r\n\tprint repr( DateFieldType().getValue( u'22/06/2013 11:33:11 GMT+1' ) )\r\n\t# datetime.date(2013, 6, 22)\r\n\r\nUsage in YAML configuration file:\r\n\r\n\ttype: Date\r\n\t\r\nThe most interesting type is `List` since it is hard to imagine in the case of certain resources. In `XML` and `JSON` it will be stored in its original format, in `CSV` and `TSV` the Python list will be converted to text format, while in the case of `Database` it will be stored as `JSON` in the `VARCHAR` field.\r\n\r\n#### Transforms\r\nThe process of field transforms within mETL is done by using TARR packages. Ordinary list should be used which can contain both transforms and statements. However, on the configuration side there is a possibility to arrange the steps in order to make it easier to read by using the 'then' structure.\r\n\r\nIts functioning is simple, it checks the transforms by statements, and at the end, if the finalType value of the field differs from the earlier field type it tries to convert the value.\r\n\r\nLet's examine the following YAML configuration for a field:\r\n\t\r\n    - name: district\r\n      type: Integer\r\n      finalType: String\r\n      transforms:\r\n        - transform: ConvertType\r\n          fieldType: String\r\n        - transform: Map\r\n          values:\r\n            '1': Budavár\r\n            '2': null\r\n            '3': 'Óbuda-Békásmegyer'\r\n            '4': Újpest\r\n            '5': 'Belváros-Lipótváros'\r\n            '6': Terézváros\r\n            '7': Erzsébetváros\r\n            '8': Józsefváros\r\n            '9': Ferencváros\r\n            '10': Kőbánya\r\n            '11': Újbuda\r\n            '12': Hegyvidék\r\n            '13': 'Angyalföld-Újlipótváros'\r\n            '14': Zugló\r\n            '15': null\r\n            '16': null\r\n            '17': Rákosmente\r\n            '18': 'Pestszentlőrinc-Pestszentimre'\r\n            '19': Kispest\r\n            '20': Pestszenterzsébet\r\n            '21': Csepel\r\n            '22': 'Budafok-Tétény'\r\n            '23': Soroksár\r\nWhat needs to be noticed in the first place is that the `Integer` type was used during the read, so we can be sure that all values which cannot be handled as numbers are stored as `None`. Since the `finalType` value is `String`, a type change will be performed during the transform as well. The first transform is a ConvertType which processes the above type change, while the next step is the `Map` which gives different values for a certain value. This way we created a text value field from a number field where the original name of the disctricts are shown. All transformations must be named by the key word `transform`.\r\n\r\nThese types of transforms can be defined for each field one by one. \r\n\r\nBefore getting into the description of more difficult transforms, let's take a look at to an other simple example:\r\n\r\n    - name: district_roman\r\n      type: Integer\r\n      finalType: String\r\n      transforms:\r\n        - transform: ConvertType\r\n          fieldType: String\r\n        - transform: tests.test_source.convertToRomanNumber\r\n        \r\nOur aim with this field is to generate a Roman numeral from a whole number value. Since mETL does not have this transform by default, we use the content of an other package. If it is not an installed package, then for mETL, the `PATH` variable can be supplemented with the `-p`** parameter**  to load the necessary Python package.\r\n\r\n\t@tarr.rule\r\n\tdef convertToRomanNumber( field ):\r\n\t\r\n\t    if field.getValue() is None:\r\n\t        return None\r\n\t\r\n\t    number = int( field.getValue() )\r\n\t    ints = (1000, 900, 500, 400, 100,  90, 50, 40, 10, 9, 5, 4, 1)\r\n\t    nums = ('M', 'CM', 'D', 'CD', 'C', 'XC', 'L', 'XL', 'X', 'IX', 'V', 'IV', 'I')\r\n\t\r\n\t    result = \"\"\r\n\t    for i in range( len( ints ) ):\r\n\t        count   = int( number / ints[i] )\r\n\t        result += nums[i] * count\r\n\t        number -= ints[i] * count\r\n\t\r\n\t    field.setValue( '%s.' % ( result ) )\r\n\t    return field\r\n \r\nWith this method, we can easily add unique transforms to our project.\r\nThis method has only one shortcoming - no further parameters can be given for the transforms defined this way, so it is not sure that it can be used for more general tasks.\r\n\r\nSee the StripTransform code as an example:\r\n\r\n\tclass StripTransform( metl.transform.base.Transform ):\r\n\t\r\n\t    init = ['chars']\r\n\t\r\n\t    # void\r\n\t    def __init__( self, chars = None, *args, **kwargs ):\r\n\t\r\n\t        self.chars = chars\r\n\t        \r\n\t        super( StripTransform, self ).__init__( *args, **kwargs )\r\n\t\r\n\t    def transform( self, field ):\r\n\t\r\n\t        if field.getValue() is None:\r\n\t            return field\r\n\t\r\n\t        field.setValue( field.getValue().strip( self.chars ) )\r\n\t        return field\r\n\r\nThis method allows us to add transforms to the system that can accept further parameters in the following way:\r\n\r\n\ttransforms:\r\n\t  ...\r\n\t  - transform: package.path.StripTransform\r\n\t\tchars: -\r\n\t  ...\r\n\r\nThe above example is not the best since the word 'Transform' never needs to be added after the transform name that are default in mETL, and no paths need to be supplied either.\r\n\r\nIt was mentioned that transforms supported statements as well. Let's see an example YAML configuration for this case:\r\n\r\n    - name: intervalled\r\n      type: Date\r\n      map: created\r\n      transforms:\r\n        - statement: IF\r\n          condition: IsBetween\r\n          fromValue: 2012-02-02\r\n          toValue: 2012-09-01\r\n          then:\r\n            - transform: ConvertType\r\n              fieldType: Boolean\r\n              hard: true\r\n              defaultValue: true\r\n        - statement: ELSE\r\n          then:\r\n            - transform: ConvertType\r\n              fieldType: Boolean\r\n              hard: true\r\n            - transform: Set\r\n              value: false\r\n        - statement: ENDIF\r\n      finalType: Boolean\r\n\r\nThe above reads a date field from which a true-false value is generated by the end of the process. To achieve this, statements are used. Like in every low-level progamming language, `IF` needs to be closed with `ENDIF`. The above example examines whether the read date is between two intervals or not. If yes, it takes the 'true' value, otherwise it will end up as 'false'. The above example also shows several possibilities for the value set.\r\n\r\nIn the case of statements, the key word `statement` must be used instead of `transform`. For conditions, the key word `condition` is the needed one, but first, let's see what conditions exist and how their parametrization works. \r\n\r\n#### Condition\r\nEach condition uses the key word `condition`, but it does not have importance on its own, it is only used by statements and certain manipulation objects for decision making. One condition **decides the true or false value for exactly one field**, it cannot be used for entire lines or for correlations between fields!\r\n\r\nConditions work the following way in Python:\r\n\r\n\tf = Field( 'uniquename', FloatFieldType(), key = True, defaultValue = '5,211' )\r\n\tprint repr( IsBetweenCondition( 5.11, '5,2111' ).getResult( f ) )\r\n\t# True\r\n\r\nAs in the case of ordinary transforms, unique conditions can be defined here also in the following way:\r\n\r\n\t@tarr.branch\r\n\tdef IsGreaterThenFiveCondition( field ):\r\n\t\r\n\t    return field.getValue() is not None and field.getValue() \u003e 5\r\n\r\nAlso, the above can be done in a parameterized way:\r\n\r\n\tclass IsGreaterCondition( metl.condition.base.Condition ):\r\n\t\r\n\t    init = ['value']\r\n\t    \r\n\t    # void\r\n\t    def __init__( self, value, *args, **kwargs ):\r\n\t\r\n\t    \tself.value = value\r\n\t    \t\r\n\t        super( IsGreaterCondition, self ).__init__( *args, **kwargs )\r\n\t\r\n\t    # bool\r\n\t    def getResult( self, field ):\r\n\t        \r\n\t        if field.getValue() is None:\r\n\t            return False\r\n\t\r\n\t        return field.getValue() \u003e field.getType().getValue( self.value )\r\n\r\nHere the code of one of the built-in conditions can be seen. The below version is longer than the above one as in this case, the number from which we want to get a bigger value can be conveyed as a parameter, and also a type conversion is processed on the result number to make the evaluation occur for the same types.\r\n\r\n##### IsBetween\r\nField value falls into a given interval or not. It makes sense to use only in the case of `Integer`, `Float`, `Date`, `DateTime` types. Its parameters:\r\n\r\n- **fromValue**: Minimum value of the interval\r\n- **toValue**: Maximum value of the interval\r\n\r\nAn example of a YAML configuration:\r\n\r\n\tcondition: IsBetween\r\n    fromValue: 2012-02-02\r\n    toValue: 2012-09-01\r\n\r\n##### IsEmpty\r\nChecks whether the result field is empty or not. No parameters are expected and can be used for all types. \r\n\r\nAn example of a YAML configuration:\r\n\r\n\tcondition: IsEmpty\r\n    \r\n##### IsEqual\r\nThe value of the field is the same as the value of the parameter. This condition can be used for all types. Its parameter:\r\n\r\n- **value**: The value that is examined during the comparison\r\n\r\nAn example of a YAML configuration:\r\n\r\n\tcondition: IsEqual\r\n    value: 2012-02-02\r\n    \r\n##### IsGreaterAndEqual\r\nThe value of the field is greater than or equal to the value of the parameter. This condition can be used for all types. Its parameter:\r\n\r\n- **value**: The value that is examined during the comparison\r\n\r\nAn example of a YAML configuration:\r\n\r\n\tcondition: IsGreaterAndEqual\r\n    value: 2012-02-02\r\n    \r\n##### IsGreater\r\nThe value of the field is greater than the value of the parameter. This condition can be used for all types. Its parameter:\r\n\r\n- **value**: The value that is examined during the comparison\r\n\r\nAn example of a YAML configuration:\r\n\r\n\tcondition: IsGreater\r\n    value: 2012-02-02\r\n\r\n##### IsLessAndEqual\r\nThe value of the field is less than or equal to the value of the parameter. This condition can be used for all types. Its parameter:\r\n\r\n- **value**: The value that is examined during the comparison\r\n\r\nAn example of a YAML configuration:\r\n\r\n\tcondition: IsLessAndEqual\r\n    value: 2012-02-02\r\n    \r\n##### IsLess\r\nThe value of the field is less than the value of the parameter. This condition can be used for all types. Its parameter:\r\n\r\n- **value**: The value that is examined during the comparison\r\n\r\nAn example of a YAML configuration:\r\n\r\n\tcondition: IsLess\r\n    value: 2012-02-02\r\n    \r\n##### IsIn\r\nThe value of the field is one of  the values of the parameter. This condition can be used for all types. Its parameter:\r\n\r\n- **values**: The list of values that are examined during the comparison\r\n\r\nAn example of a YAML configuration:\r\n\r\n\tcondition: IsIn\r\n\tvalues:\r\n\t  - MICRA / MARCH\r\n\t  - PATHFINDER\r\n\t  - ALMERA TINO\r\n\t  - PRIMASTAR\r\n\r\n##### IsInSource\r\nIt works in a quite similar way as IsIn, but the values used in the examination are loaded from an other source file and the inclusion of the field value is checked from this other file. Parameter:\r\n\r\n- **join**: How the field is called in the other source which contains the value the IsIn condition will be applied to.\r\n\r\nThere are other parameters as well since the entire `Source` configuration must be attached to this condition.\r\n\r\nAn example of a YAML configuration:\r\n\r\n\tcondition: IsInSource\r\n\tsource: Yaml\r\n\tresource: examples/vins.yml\r\n\trootIterator: vins\r\n\tjoin: vin\r\n\tfields:\r\n\t  - name: vin\r\n\t    type: String\r\n\r\n##### IsMatchByRegexp\r\nIt uses a regular expression for the evaluation of the field value. The given field is considered successful if the regular expression interlocks with it. Its parameters:\r\n\r\n- **regexp**: Regular expression to be examined.\r\n- **ignorecase**: Ignore the lower and upper case during the assessment of the regular expression. It differentiates them by default!\r\n\r\nAn example of a YAML configuration:\r\n\r\n\tcondition: IsMatchByRegexp\r\n\tregexp: '^.*[0-9]+.*$'\r\n\tignorecase: false\r\n\r\n#### Statement\r\nThe key word **statement** must be used instead of **transform**. Statements can only be used during the transform steps of fields to create the final form of field value and to do a successful data cleanup. Statements **can be embedded into each other without amount limit** but all of them must be closed. \r\n\r\n##### IF\r\nIt serves as an \"If, then\" condition just as in regular programming languages. Each **IF** must be followed by an **ENDIF** later on. Its parameter:\r\n\r\n- **condition**: Condition with all its needed parameters.\r\n\r\nIt is not necessary, but in the YAML configuration a **then** can be added to it as well which contains the transforms belonging to it.\r\n\r\n##### IFNOT\r\nIt serves as an \"If not, then\" condition. The same rules and parameters apply to it as for **IF**. Its parameter:\r\n\r\n- **condition**: Condition with all its needed parameters.\r\n\r\n##### ELIF\r\nThis statement is used if the condition is not met but we want to define a new one. It can be used between **IF** and **ENDIF**, but always before **ELSE**. The same rules and parameters apply to it as for **IF**. Its parameter:\r\n\r\n- **condition**: Condition with all its needed parameters.\r\n\r\n##### ELIFNOT\r\nThe same rules and parameters apply to it as for **ELIF**, but this is met if the condition is denied. Its parameter:\r\n\r\n- **condition**: Condition with all its needed parameters.\r\n\r\n##### ELSE\r\nIt is used if the condition is not met and we do not want to set further conditions, but a path without condition is needed where the transform is able to go. Must be used between **IF** and **ENDIF**. If **ELIF** or **ELIFNOT** is present, this statement must be added after them. It does not have any parameter.\r\n\r\n##### ENDIF\r\nKey word to close an **IF** condition. It does not have any parameter.\r\n\r\n##### ReturnTrue\r\nIt exists from the condition and ends the transforms. It does not have any parameter.\r\n\r\n#### Transform\r\nIt was mentioned that the list of transforms could be defined to fields. These transformations are labelled with the key word **transform** after which the name of the used transformation must be added. The followings are available in the system.\r\n\r\n##### Add\r\nIt adds a number to the value of the field. Can be used only in the case of `Integer` and `Float` fields. Its parameters:\r\n\r\n- **number**: The number with which we want to increase the value of the field.\r\n\r\nAn example of a YAML configuration:\r\n\r\n\t- transform: Add\r\n\t  number: 4\r\n\t\r\nAn example for the result of the above transform:\r\n\r\n\t12 \r\n\t=\u003e 16\r\n\r\n##### Clean\r\nRemoves the different staves from the defined field. It is important that it can be used only in the case of `String` and `Text` fields. Its parameters are not mandatory, but can be redefined:\r\n\r\n- **stopChars**: Which characters to remove from the values of the field. By deafult: .,!?\"\r\n- **replaces**: List of value pairs prescribing what to replace with what as part of the clean process.\r\n\r\nAn example of a YAML configuration:\r\n\r\n\t- transform: Clean\r\n\t  replaces:\r\n\t    many: 1+\r\n\t\r\nAn example for the result of the above transform:\r\n\r\n\t'  That is a good sentence, which is contains many english word! ' \r\n\t=\u003e 'That is a good sentence which is contains 1+ english word'\r\n\r\n##### ConvertType\r\nModifies the type of the field to an other type. Since not all field types can be converted to an other type without loss, more parameters need to be set here.\r\n\r\n- **fieldType**: Name of a new field type.\r\n- **hard**: With this forced modification request, the current values are destroyed. It has a false value by default.\r\n- **defaultValue**: Sets default value by forced modification. No default values are defined by default.\r\n\r\nIf we want to convert a date into text, there is no need for hard mode since this conversion is quite easily processed and in a lucky case, it can be executed in the other direction as well. Though the hard mode must be used to convert the value of a date field to Boolean. It is important to note that **the finalType value of the field does not work on hard principle**, so that must be assured with a transform before.\r\n\r\nAn example of a YAML configuration:\r\n\r\n\t- transform: ConvertType\r\n  \t  fieldType: Boolean\r\n  \t  hard: true\r\n  \t  defaultValue: true\r\nor\r\n\r\n\t- transform: ConvertType\r\n  \t  fieldType: String\r\n\r\n##### Homogenize\r\nIt changes the accentuated characters to non-accentuated ones in the case of `String` and `Text` fields. It is a very frequent process if we want to pair values to data coming from other sources since the quality of the data can be questionable, but this way we can make crosschecks easily. No parameter is expected.\r\n\r\nAn example of a YAML configuration:\r\n\r\n\t- transform: Homogenize\r\n\t\r\nAn example for the result of the above transform:\r\n\r\n\tu'árvíztűrőtükörfúrógépÁRVÍZTŰRŐTÜKÖRFÚRÓGÉP \r\n\t=\u003e 'arvizturotukorfurogeparvizturotukorfurogep'\r\n\r\n##### LowerCase\r\nIt changes the field value to lowercase in the case of `String` and `Text` fields. No parameter is expected.\r\n\r\nAn example of a YAML configuration:\r\n\r\n\t- transform: LowerCase\r\n\t\r\nAn example for the result of the above transform:\r\n\r\n\t'That is a good sentence, which is contains many english word!'\r\n\t=\u003e 'that is a good sentence, which is contains many english word!'\r\n\r\n##### Map\r\nIt changes the value of the field to other values. It needs to be given as key-value pairs. It works appropriately only in the case of `String` and `Text` field types. Its parameters:\r\n\r\n- **values**: Group of key-value pairs that contains the convert values.\r\n- **ignorecase**: Whether to ignore the difference between lowercase and uppercase during the evaluation.\r\n- **elseValue**: Not a mandatory parameter. It should be given if we want to modify all values to anything not included in the defined list.\r\n- **elseClear**: Not a mandatory parameter. It should be given if we want to clear all values that are not among the values list.\r\n\r\nAn example can be the previously seen YAML configuration:\r\n\r\n\t- transform: Map\r\n\t  values:\r\n\t    '1': Budavár\r\n\t    '2': null\r\n\t    '3': 'Óbuda-Békásmegyer'\r\n\t    '4': Újpest\r\n\t    '5': 'Belváros-Lipótváros'\r\n\t    '6': Terézváros\r\n\t    '7': Erzsébetváros\r\n\t    '8': Józsefváros\r\n\t    '9': Ferencváros\r\n\t    '10': Kőbánya\r\n\t    '11': Újbuda\r\n\t    '12': Hegyvidék\r\n\t    '13': 'Angyalföld-Újlipótváros'\r\n\t    '14': Zugló\r\n\t    '15': null\r\n\t    '16': null\r\n\t    '17': Rákosmente\r\n\t    '18': 'Pestszentlőrinc-Pestszentimre'\r\n\t    '19': Kispest\r\n\t    '20': Pestszenterzsébet\r\n\t    '21': Csepel\r\n\t    '22': 'Budafok-Tétény'\r\n\t    '23': Soroksár\r\n\r\nwhich results in the following\r\n\t\r\n\t'4'\r\n\t=\u003e 'Újpest'\r\n\r\n##### RemoveWordsBySource\r\nIt removes words from an arbitrary sentence in `String` or `Text` field types by using an other source file. Words are separated by spaces, so it is highly recommended to run `Clean` before the removal.  \r\n\r\nIt does not have own parameters, but the entire `Source` configuration is needed in this case also. The source can contain only one field. If more fields are included, the transform considers only the first one and removals will be done based on the value defined there.\r\n\r\nAn example of a YAML configuration:\r\n\r\n\t- transform: RemoveWordsBySource\r\n\t  source: CSV\r\n\t  resource: materials/hu_stopword.csv\r\n\t  fields:\r\n\t    - name: word\r\n\t      type: String\r\n\t      map: 0\r\n\t      \r\n##### ReplaceByRegexp\r\nIt executes a replacement based on regular expressions in `String` and `Text` field types. Parameters that can be used:\r\n\r\n- **regexp**: Regular expression based on which the replacement can be done.\r\n- **to**: The output/target of the replacement. Opposite to the usual Python syntax, `$` must be used instead of `\\\\` to paste the highlighted parameters\r\n- **ignorecase**: Whether to differentiate between lowercase and uppercase during the assessment of regular expressions. Differentiation is in place by default.\r\n\r\nAn example for YAML configuration where only the year-month pair is kept from a text based date format:\r\n\r\n\t- transform: ReplaceByRegexp\r\n\t  regexp: '^([0-9]{4}-[0-9]{2})-[0-9]{2}$'\r\n\t  to: '$1'\r\n\t  \r\nThe above results in the following:\r\n\r\n\t'2013-04-15'\r\n\t=\u003e '2013-04'\r\n          \r\n##### ReplaceWordsBySource\r\nIt replaces words in `String` or `Text` field types by using an other source file. Its parameters:\r\n\r\n- **join**: How the field is called in the other source which contains the same value based on which we want to join the current source with the other one. The field name must be identical in both sources!\r\n\r\nThere are other parameters as well since the entire `Source`configuration must be attached to this condition. **Important to note that the source defined here can contain only 2 fields - including the field defined during join!**  Thus the value to which the replacement takes places will be a column not part of the join conditon!\r\n\r\nAn example of a YAML configuration:\r\n\r\n\t- transform: ReplaceWordsBySource\r\n\t  join: KEY\r\n\t  source: CSV\r\n\t  resource: materials/hu_wordtoenglish.csv\r\n\t  fields:\r\n\t    - name: KEY\r\n\t      type: Integer\r\n\t      map: 0\r\n\t    - name: VALUE\r\n\t      type: String\r\n\t      map: 1\r\n\r\n##### Set\r\nIt performs value setting in the case of any field types. Its parameters:\r\n\r\n- **value**: New field value\r\n\r\nIn the case of String and Text fields it is possible to paste the old value into the new one. Let's see a YAML configuration example for this:\r\n\r\n\t- transform: Set\r\n\t  value: '%(self)s or the new string'\r\n\t\r\nAn example for the result of the above transform:\r\n\r\n\t'Myself'\r\n\t=\u003e 'Myself or the new string'\r\n\r\n##### Split\r\nIt separates the words by the spaces and leaves the defined interval in the case of `String` and `Text` field types. It is important to note that an exact number can be given (e.g.: 1) meaning that the words with that index will be kept. Intervals can be defined as well separated by a colon (e.g.: 2:-1). Its parameters:\r\n\r\n- **idx**: Index of the excerption, numbered from 0.\r\n- **chars**: Based on which character should the split take place. Whitespace is set by default. \r\n\r\nAn example of a YAML configuration:\r\n\r\n\t- transform: Split\r\n\t  idx: '1:-1'\r\n\t\r\nAn example for the result of the above transform:\r\n\r\n\t'contains hungarian members attractive sadness killing'\r\n\t=\u003e 'hungarian members attractive sadness'\r\n\r\n##### Stem\r\nIt brings the words in `String` and `Text` fields to their stem. Words are separated by spaces, so in most cases the usage of `Clean` transform is necessary. Its parameters:\r\n\r\n- **language**: Language of stemming\r\n\r\nAn example of a YAML configuration:\r\n\r\n\t- transform: Stem\r\n\t  language: English\r\n\t\r\nAn example for the result of the above transform:\r\n\r\n\t'contains hungarian members attractive sadness killing'\r\n\t=\u003e 'contain hungarian member attract sad kill'\r\n\r\nFor process execution the nltk **SnowballStemmer** package is used.\r\n\r\n##### Strip\r\nRemoves the unnecessary spaces or other characters from the beginning and end of the value. Can be used only in the case of `String` and `Text` fields. Its parameters:\r\n\r\n- **chars**: What characters to be removed from the beginning and end of the text. Whitespace is set by default.\r\n\r\nAn example of a YAML configuration:\r\n\r\n\t- transform: Strip\r\n\t\r\nAn example for the result of the above transform:\r\n\r\n\t'  That is a good sentence, which is contains many english word!   '\r\n\t=\u003e 'That is a good sentence, which is contains many english word!'\r\n\r\n##### Sub\r\nIt subtracts a number from the field value. Can be used only in the case of `Integer` and `Float` fields. Its parameters:\r\n\r\n- **number**: The number with which we want to decrease the actual value of the field.\r\n\r\nAn example of a YAML configuration:\r\n\r\n\t- transform: Sub\r\n\t  number: 4\r\n\t\r\nAn example for the result of the above transform:\r\n\r\n\t12 \r\n\t=\u003e 8\r\n\t\r\n##### Title\r\nIt capitalizes each word. Can be used only in the case of `String` and `Text` fields. No parameter is expected.\r\n\r\nAn example of a YAML configuration:\r\n\r\n\t- transform: Title\r\n\t\r\nAn example for the result of the above transform:\r\n\r\n\t'That is a good sentence, which is contains many english word!'\r\n\t=\u003e 'That Is A Good Sentence, Which Is Contains Many English Word!'\r\n\r\n##### UpperCase\r\nIt changes the field value to upper case in the case of `String` and `Text` fields. No parameter is expected.\r\n\r\nAn example of a YAML configuration:\r\n\r\n\t- transform: UpperCase\r\n\t\r\nAn example for the result of the above transform:\r\n\r\n\t'That is a good sentence, which is contains many english word!'\r\n\t=\u003e 'THAT IS A GOOD SENTENCE, WHICH IS CONTAINS MANY ENGLISH WORD!'\r\n\r\nIt has already been mentioned that the source is responsible for the followings:\r\n\r\n1. To describe the type and form of the resource containing data (source)\r\n2. To describe the data structure read (fields)\r\n3. To define the mapping between the above items (map)\r\n\r\nWe have already covered the first 2 points, let's see how the mapping of the processed data lines works for the above defined arbitrary fields.\r\n\r\n### FieldMap\r\nWe have a resource which values we want to read, and we have fields we want to put the values in for each line. The only missing item is to create a mapping which we can define for all sources.\r\n\r\nFor fields, fieldmap can be defined in two places:\r\n\r\n1. In the `Source` record under the `map` parameter.\r\n\r\n\t\tsource:\r\n\t\t  …\r\n\t\t  map:\r\n\t\t    MEZONEV: 0\r\n\t\t    MASIKMEZONEV: 2\r\n\t      … \r\n      \r\n2. Within the `Field` itself as `map`.\r\n\r\n\t\tsource:\r\n\t\t  …\r\n\t\t  fields:\r\n\t\t    …\r\n\t\t    - name: MEZONEV\r\n\t\t      map: 0\r\n\t\t    - name: MASIKMEZONEV\r\n\t\t      type: Integer\r\n\t\t      map: 2\r\n\t\t    … \r\n\t      … \r\n\r\nBoth will produce the same result. The first version is the better choice if we want to derieve a configuration file from it, since in this case, the fields do not need be redefined. The second version is better in the sense that it is more transparent since everything is where they belong. **If no map is defined for a field, then values are searched based on the field name by default.**\r\n\r\nEach map means a path to the \"data\". The path can contain words, numbers (indices) and the combinations of them divided by a `/`. \r\n\r\nIn the light of this, let's see a more complex example based on which it will be easier to understand the process. The `XML`, `JSON`, `YAML` resources could contain multidimensional lists, but when the data is coming from `Database`, `GoogleSpreadsheet`, it has to be a one-dimensional list.\r\n\r\n\tpython_dict = {\r\n\t\t'first': {\r\n             'of': {\r\n                 'all': 'dictionary',\r\n                 'with': [ 'many', 'list', 'item' ]\r\n             },\r\n             'and': [ (0, 1), (1, 2), (2, 3), (3, 4) ]\r\n         },\r\n         'filtered': [ { \r\n             'name': 'first', \r\n             'value': 'good'\r\n         }, {\r\n             'name': 'second',\r\n             'value': 'normal'\r\n         }, {\r\n             'name': 'third',\r\n             'value': 'bad'\r\n         } ],\r\n         'emptylist': {\r\n             'item': 'itemname'\r\n         },\r\n         'notemptylist': [\r\n             { 'item': 'itemname' },\r\n             { 'item': 'seconditemname' }\r\n         ],\r\n\t\t 'strvalue': 'many',\r\n         'strlist': [ 'many', 'list', 'item' ],\r\n         'root': 'R'\r\n    }\r\n    \r\n    print repr( metl.fieldmap.FieldMap({\r\n         'list_first': 'first/of/with/0',\r\n         'list_last': 'first/of/with/-1',\r\n         'tuple_last_first': 'first/and/-1/0',\r\n         'not_existing': 'first/of/here',\r\n         'root': 'root',\r\n         'dict': 'first/of/all',\r\n         'filtered': 'filtered/name=second/value',\r\n         'list': 'filtered/*/value',\r\n         'emptylistref': 'emptylist/~0/item',\r\n         'notemptylistref': 'notemptylist/~0/item',\r\n         'strvalue': 'strvalue',\r\n         'strvalue1': 'strvalue/!/0',\r\n         'strvalue2': 'strvalue/!/1',\r\n         'strlist1': 'strlist/!/0',\r\n         'strlist2': 'strlist/!/1'\r\n     }).getValues( python_dict ) )\r\n     \r\n     # {'list_first': 'many', 'not_existing': None, 'dict': 'dictionary', 'tuple_last_first': 3, 'list_last': 'item', 'root': 'R', 'filtered': 'normal', 'list': ['good','normal','bad', 'emptylistref': 'itemname', 'notemptylistref': 'itemname', 'strvalue': 'many', 'strvalue1': 'many', 'strvalue2': None, 'strlist1': 'many', 'strlist2': 'list' ]}\r\n     \r\nIf several data sources are used - like `CSV`, `TSV`, `XLS`,  lists arrive. Note that in the case of `CSV` and `TSV`, it can be achieved that they receive one-dimensional values in the above format by defining the `headerRow` parameter.\r\n\r\n\tpython_list = [ 'many', 'list', 'item' ]\r\n\t\r\n\tprint repr( metl.fieldmap.FieldMap({\r\n        'first': 0,\r\n        'last': '-1',\r\n        'not_existing': 4\r\n    }).getValues( python_list ) )\r\n    \r\n    # {'last': 'item', 'not_existing': None, 'first': 'many'}\r\n    \r\nMore important operators:\r\n\r\n - `/`: Defines a next level in the given path/mapping.\r\n - `*`: Checks all elements in the case of lists. If we want to save it in this format instead of converted text, the usage of List type and JSON or XML target type is recommended. This operator is used mainly in the case of XML and JSON sources.\r\n - `~`: Test operator (List 2 Dict) if both list and dict can exist on the same level. If list exists, it can be defined what we look for, if dict exists, nothing happens, the process goes on in the given path from the next element. This operator is used in the case of XML and JSON sources.\r\n - `!`: Operator that converts (List 2 Dict). It is used if we want to get a list but it is not known whether we will get that or not. This operator is used in the case of XML sources.\r\n\r\n### Manipulation\r\nAfter the whole line is processed, the values are in the fields and the transforms are done on the field level, there is a possibility to manipulate the entire, cleaned values based on their correlations. There are 4 key words that can be used - each of them labels a single type: `modifier`, `filter`, `expand`, `aggregator`. We will mainly use them during our more complex tasks (e.g. API communication) Manipulation steps can follow each other **in any order**, regardless of the type. As soon as one of them finishes, it gives the result to the next one. This process continues until the `Target` object is reached.\r\n\r\n\tmanipulations:\r\n\t  - filter: DropByCondition\r\n\t    condition: IsMatchByRegexp\r\n\t    regexp: '^.*\\-.*$'\r\n\t    fieldNames: name \r\n\t  - modifier: Set\r\n\t    fieldNames: \r\n\t      - district_search\r\n\t      - district_copy\r\n\t    value: '%(district)s'\r\n\t  - modifier: TransformField\r\n\t    fieldNames: district_copy\r\n\t    transforms:\r\n\t      - statement: IFNot\r\n\t        condition: IsEmpty\r\n\t        then:\r\n\t          - transform: Set\r\n\t            value: '%(self)s, '\r\n\t      - statement: ENDIF\r\n\t  - modifier: TransformField\r\n\t    fieldNames: \r\n\t      - name_search\r\n\t      - district_search\r\n\t    transforms:\r\n\t      - transform: Clean\r\n\t      - transform: LowerCase\r\n\t      - transform: Homogenize\r\n\t  - modifier: Set\r\n\t    fieldNames: formatted_name\r\n\t    value: '%(district_roman)s kerület, %(district_copy)s%(name)s'\r\n\t  - filter: tests.test_source.DropIfSameNameAndDistrict\r\n\t  - filter: DropField\r\n\t    fieldNames:\r\n\t      - name_search\r\n\t      - district_copy\r\n\t      - district_search\r\n\t      - district_roman\r\n\t      - district_id\r\n\t      - region_id\r\n\t      \r\nThe above example will not be explained in details, the main points are to show the key word usage and the format.\r\n\r\n#### Modifier\r\nModifiers are those objects that are given a whole line (record) and always return with a whole line. However, during their processes they make changes to values with the usage of the related values of different fields. In manipulations they always start with the key word `modifier` and we will use them most of the time during our work.\r\n\r\nBefore examining what system level modifiers mETL has, let's see how we can add new ones, as this step will be needed most frequently.\r\n\r\n\timport urllib, demjson\r\n\tfrom metl.utils import *\r\n\t\r\n\tclass MitoAPIPhoneSearch( Modifier ):\r\n\t\r\n\t\t# str\r\n\t    def getURL( self, firstname, lastname, city ):\r\n\t\r\n\t        return 'http://mito.api.hu/api/KEY/phone/search/hu/%(firstname)s/%(lastname)s/%(city)s' % {\r\n\t            'firstname': urllib.quote( firstname.encode('utf-8') ),\r\n\t            'lastname': urllib.quote( lastname.encode('utf-8' ) ),\r\n\t            'city': urllib.quote( city.encode('utf-8' ) )\r\n\t        }\r\n\t\r\n\t\t# FieldSet\r\n\t    def modify( self, record ):\r\n\t\t\r\n\t        url = self.getURL( \r\n\t        \trecord.getField('FIRSTNAME').getValue(), \r\n\t        \trecord.getField('LASTNAME').getValue(),\r\n\t        \trecord.getField('CITY').getValue()\r\n\t        ) )\r\n\r\n\t        fp = urllib.urlopen( url )\r\n\t        result = demjson.decode( fp.read() )\r\n\t        phones = list( set([\r\n\t            r.get('phone',{}).get('format',{}).get('e164') \\\r\n\t            for r in result['result']\r\n\t        ]))\r\n\t\r\n\t        record.getField('PHONENUMBERS').setValue( u', '.join( phones ) )\r\n\t        \r\n\t        return record\r\nThis example shows that by using 3 values included from any source (surname, first name, city)  we create a new value (phone number list). But we gather the data through an API request. The above does not have parameters, it can be easily embedded in the process.\r\n\r\n\t- modifier: package.path.MitoAPIPhoneSearch\r\n\r\nWe need to pay attention to two things during extension:\r\n\r\n1. Needs to be derieved from `Modifier` class\r\n2. `modify` function needs to be rewritten and it should get back with  `record`\r\n\r\n##### JoinByKey\r\nIt joins two sources by a key defined by the inner source. During the process, the key fields must be highlighted for the inner source so those fields will be searched in the outer source. If there is a match, the outer source is refreshed by the fields listed in the `fieldNames`. **The same name must be given** for both the key and the fields that need to be refreshed. In the case of key based join, only one record can belong to one line. \r\n\r\n- **fieldNames**: Which fields are to be updated. Fields must have the same name in both sources!\r\n\r\nAn example of a YAML configuration:\r\n\r\n\tsource: \r\n\t  source: XML\r\n\t  resource: outer_source.xml\r\n\t  ...\r\n\t  itemName: property\r\n\t  fields:\r\n\t    - name: originalId\r\n\t      map: \"source-system-id\"\r\n\t\r\n\t    - name: agentId\r\n\t      map: \"agent-id\"\r\n\t\r\n\t    ...\r\n\t\r\n\t    - name: phone\r\n\t    - name: email\r\n\t\r\n\tmanipulations:\r\n\t  ...\r\n\t  - modifier: JoinByKey\r\n\t    source: XML\r\n\t    resource: inner_source.xml\r\n\t    itemName: agent\r\n\t    fieldNames:\r\n\t      - phone\r\n\t      - email\r\n\t    fields:\r\n\t      - name: agentId\r\n\t        map: \"agent-id\"\r\n\t        key: true\r\n\t      - name: name\r\n\t        map: \"agent-name/text\"\r\n\t      - name: phone\r\n\t        map: \"agent-phone/text\"\r\n\t        transforms:\r\n\t          - transform: test.convertToE164\r\n\t      - name: email\r\n\t        map: \"agent-email/text\"\r\n\t\r\n\ttarget:\r\n\t  type: JSON\r\n\t  ...\r\n\r\n\t    \r\n##### Order\r\nIt orders the records based on the defined fields. Its parameters:\r\n\r\n- **fieldNamesAndOrder**: On which fields should the ordering occur and in which order. Only `ASC` and `DESC` can be given as an order.\r\n\r\nAn example of a YAML configuration:\r\n\r\n\t- modifier: Order\r\n\t  fieldNamesAndOrder:\r\n\t    - year: DESC\r\n\t    - name: ASC\r\n\r\n##### Set\r\nIt executes value setting by using fixed value scheme, function or other source. It is the most commonly used modifier, but in order to get a faster and optimal processing, it is worth writing an own modifier. its parameters for initialization:\r\n\r\n- **fieldNames**: On which fields should the value setting take place.\r\n- **value**: New field value.\r\n\r\nThe functioning of Set is complicated. It can be extended with a `fn` parameter as well, where an arbitrary value setting function can be defined for it, and also an entire source description can be given. \r\n\r\nTypes of usage:\r\n\r\n1. **Value modification**\r\n\r\n   In short, value modification can be done based on the actual field values by putting the names of the fields into value parameter. Value set will be performed for all fields listed in `fieldNames`.\r\n\r\n\r\n\t\t- modifier: Set\r\n\t\t  fieldNames: formatted_name\r\n\t\t  value: '%(district_roman)s kerület, %(district_copy)s%(name)s'\r\n    \r\n2. **Value modification by using function**\r\n\r\n   For performing a complex calculation, it is worth using the `Set` modifier this way. The function needs to be created by our own, therefore the `-p` parameter should be used here as well when running the metl script.\r\n   \r\n\t\t- modifier: Set\r\n\t\t  fieldNames: age\r\n\t\t  fn: utvonal.calculateAge\r\n\t\t \r\n   For the above, the following function can be written:\r\n\t\t   \r\n\t\tdef calculateAge( record, field, scope ):\r\n\t\t\t\r\n\t\t\tif record.getField('date_of_birth').getValue() is None:\r\n\t\t\t    return None\r\n\t\t\t\r\n\t\t\ttd = datetime.date.today() - record.getField('date_of_birth').getValue()\r\n\t\t\treturn int( td.days / 365.25 )\r\n\t\t\t\r\n   In the above function, `record` means the whole line, `field` defines the field that needs to be set (this function is carried out for all values listed in `fieldNames`), while scope stands for the `SetModifier`.\r\n   \r\n3. **Value modification by using other source**\r\n\r\n   This is the most difficult modifier type, but if the optimal speed is important, it is worth redefining based on the data structure of the known source. `fn` and `source` need to be given as well.\r\n \r\n\t\t- modifier: Set\r\n\t\t  fieldNames: EMAILFOUND\r\n\t\t  fn: utvonal.setValue\r\n\t\t  source: TSV\r\n\t\t  resource: utvonal/masikforras.tsv\r\n\t\t  fields:\r\n\t\t    - name: EMAIL\r\n\t\t    - name: FIRSTNAME\r\n\t\t    - name: LASTNAME\r\n\r\n   The following function belongs to it:\r\n\t\r\n\t\tdef setValue( record, field, scope ):\r\n\t\r\n\t\t\treturn 'Found same email address' \\\r\n\t\t\t    if record.getField('EMAIL').getValue() in \\\r\n\t\t\t    \t[ sr.getField('EMAIL').getValue() for sr in scope.getSourceRecords() ] \\\r\n\t\t\t    else 'Not found same email address'\r\n\r\n##### SetWithMap\r\nSets field's values based on the mapping of a Complex field.\r\n\r\n- **fieldNamesWithMap**: Field names and map paths on which the setting must be performed.\r\n- **complexFieldName**: Name of a complex field from which we want to derieve the value. Types can be `List` or `Complex`.\r\n\r\nTransforms must be done one by one for each field listed among `fieldNames`.\r\n\r\nAn example of a YAML configuration:\r\n\r\n\tsource:\r\n\t  source: JSON\r\n\t  fields:\r\n\t    - name: LISTITEMS\r\n\t      map: response/tips/items/*\r\n\t      type: List\r\n\t    - name: LISTELEMENT\r\n\t      type: Complex\r\n\t    - name: CREATEDAT\r\n\t      type: Integer\r\n\t    - name: TEXT\r\n\t    - name: CATEGORIES\r\n\t      type: List\r\n\t\r\n\tmanipulations:\r\n\t  - expand: ListExpander\r\n\t    listFieldName: LISTITEMS\r\n\t    expandedFieldName: LISTELEMENT\r\n\t  - modifier: SetWithMap\r\n\t    fieldNamesWithMap: \r\n\t      CREATEDAT: createdAt\r\n\t      TEXT: text\r\n\t      CATEGORIES: venue/categories/*/id\r\n\t    complexFieldName: LISTELEMENT\r\n\t  - filter: DropField\r\n\t    fieldNames:\r\n\t      - LISTELEMENT\r\n\t      - LISTITEMS\r\n\r\n\t    \r\n##### TransformField\r\nIt performs a regular field level transormation during the manipulation step. Parameters of its initialization:\r\n\r\n- **fieldNames**: Name of fields on which transforms must be performed.\r\n- **transforms**: List of field level transforms.\r\n\r\nTransforms must be done one by one for each field listed among `fieldNames`.\r\n\r\nAn example of a YAML configuration:\r\n\r\n\t- modifier: TransformField\r\n\t  fieldNames: district_copy\r\n\t  transforms:\r\n\t    - statement: IFNot\r\n\t      condition: IsEmpty\r\n\t      then:\r\n\t        - transform: Set\r\n\t          value: '%(self)s, '\r\n\t    - statement: ENDIF\r\n\r\n#### Filter\r\nTheir function is primarily filtering. It is used when we would like to evaluate or get rid of incomplete or faulty records as a result of an earlier tranformation.\r\n\r\nIf we want to put a new filter in the system, the following can help:\r\n\r\n\tfrom metl.utils import *\r\n\t\r\n\tclass MyFilter( Filter ):\r\n\t\r\n\t\t# bool\r\n\t    def isFiltered( self, record ):\r\n\t\r\n\t        return not record.getField('MEGMARADJON').getValue()\r\n\r\n##### DropByCondition\r\nThe fate of the record is decided by condition. Parameters of its initialization:\r\n\r\n- **condition**: Condition shown before as well with all its parameters.\r\n- **fieldNames**: Fields on which the examination must be performed.\r\n- **operation**: What condition exists between the assessments of the fields. `AND` condition is in place by default.\r\n\r\nLet's see three examples. In the first example, we want to leave out from the results those cases when the value of the `NAME` field matches with a pattern. During the evaluation a field, the `operation` parameter is not important.\r\n\r\n\t- filter: DropByCondition\r\n\t  condition: IsMatchByRegexp\r\n\t  regexp: '^.*\\-.*$'\r\n\t  fieldNames: NAME \r\n\r\nIn the second example, let's see an other type of assessment. We want to delete the line if both `EMAIL` and `NAME` values are empty.\r\n\r\n\t- filter: DropByCondition\r\n\t  condition: isEmpty\r\n\t  fieldNames:\r\n\t   - NAME\r\n\t   - EMAIL\r\n\t  operation: AND\r\n\r\nIn the third example, let's examine the previous one with `OR` `operation`. In this case, the line will be deleted from the results if either the `NAME` or the `EMAIL` fields are empty.\r\n\r\n\t- filter: DropByCondition\r\n\t  condition: isEmpty\r\n\t  fieldNames:\r\n\t   - NAME\r\n\t   - EMAIL\r\n\t  operation: OR\r\n\r\n##### DropBySource\r\nInclusion in an other source file decides on the fate of the record. All rules are identical with the ones applicable for DropByCondition, so here the cases will not be described again. Its initialization:\r\n\r\n- **condition**: Condition described before with all its parameters. Only conditions with 1 parameter can be used!\r\n- **join**: Name of fields that are joined. They must have the same name in both sources.\r\n- **operation**: What condition exists between the assessments of the fields. `AND` condition is in place by default.\r\n\r\nand `source` with all its parameters. The only parameter that belongs to the condition (which is usually `value`) must contain that in which column of the other source the comparison value can be found.\r\n\r\nAn example of a YAML configuration:\r\n\r\n\t- filter: DropBySource\r\n\t  join: PID\r\n\t  condition: IsEqual\r\n\t  value: NAME\r\n\t  source: Database\r\n\t  url: 'postgresql://felhasznalonev:jelszo@localhost:5432/adatbazis'\r\n\t  table: adattabla\r\n\t  fields:\r\n\t    - name: PID\r\n\t      type: Integer\r\n\t      map: id\r\n\t    - name: NAME\r\n\t      type: String\r\n\r\n##### DropField\r\nFields can be dropped from a record with its help. It can happen that a value from a source is pasted into several fields in order to perform different transformations on them, and at the end of the process we want to delete those fields that are not needed anymore. This filter makes this possible.\r\n\r\n- **fieldNames**: List of fields to be dropped.\r\n\r\nAn example of a YAML configuration:\r\n\r\n\t- filter: DropField\r\n\t  fieldNames:\r\n\t    - name_search\r\n\t    - district_copy\r\n\t    - district_search\r\n\t    - district_roman\r\n\t    - district_id\r\n\t    - region_id\r\n\t    \r\n##### KeepByCondition\r\nThe fate of the record can be decided by condition. Parameters for its initialization as follows:\r\n\r\n- **condition**: Condition described before with all its parameters.\r\n- **fieldNames**: On which fields should the examination take place.\r\n- **operation**: What condition exists between the assessments of the fields. `AND` condition is in place by default.\r\n\r\nIt is almost identical with the `DropByCondition` function, the only difference is that in this case, the record will not be filtered if the condition is met!\r\n\r\n#### Expand\r\nIt is used for expansion if we want to add additional values after the current source.\r\n\r\n##### Append\r\nIt gives the possibility to read a resource with the same format as of the actual source, and paste it in the actual process. Its parameters:\r\n\r\n - **skipIfFails**: If the source is incorrect, the process will not be stopped only the step will be skipped. The list of incorrect sources can be saved with `logFile` and `appendLog`.\r\n\r\nImportant to note that **it does not extend the previous source with new fields**, everything continues the same way as if the current file would get an other `resource` without `modifier` and `target`. All **resouce** attributes can be rewritten, even `username`, `password`, or `encoding` data connected to htaccess!\r\n\r\nAn example of a YAML configuration:\r\n\t\r\n\t- expand: Append\r\n\t  resource: target/otherfile.json\r\n\t  encoding: iso-8859-2\r\n\t  skipIfFails: true\r\n\t  logFile: log/otherfile.txt\r\n      appendLog: true\r\n\r\n##### AppendAll\r\nIt gives the possibility to read a resource with the same format as of the actual source, and paste it in the actual process. Its parameters:\r\n\r\n - **folder**: The folder which contains the files that need to be appended to the ending. If the source file is part of the folder as well, it will be ignored.\r\n - **extension**: Only files with extensions will be processed.\r\n - **skipIfFails**: If the source file is incorrect, the process will not be stopped only the step will be skipped. The list of incorrect sources can be saved with `logFile` and `appendLog`\r\n - **skipSubfolders**: It skips the subfolders. They are inlcuded by default.\r\n\r\nImportant to note that **it does not extend the previous source with new fields**, everything continues the same way as if the current file would get an other `resource` without `modifier` and `target`. It creates `Append` for each file and the process will be executed this way. \r\n\r\nAn example of a YAML configuration:\r\n\t\t\r\n\t- expand: AppendAll\r\n\t  folder: source/oc\r\n\t  extension: xml\r\n\t    \r\n      \r\n##### AppendBySource\r\nThe content of an other source can be appended after the original source. Only `source` is needed for initialization with all its parameters.\r\n\r\nImportant to note that **it does not extend the previous source with new fields**, it pairs everything by name to the data and columns of the original source. The same fields in must have identical name in both sources. Those fields of the current source that do not exist in the original source will not be included among the results!\r\n\r\nAn example of a YAML configuration:\r\n\r\n\t- filter: AppendBySource\r\n\t  source: Database\r\n\t  url: 'postgresql://felhasznalonev:jelszo@localhost:5432/adatbazis'\r\n\t  table: adattabla\r\n\t  fields:\r\n\t    - name: PID\r\n\t      type: Integer\r\n\t      map: id\r\n\t    - name: NAME\r\n\t      type: String\r\n\r\n\r\n##### Field\r\nIt collects columns defined as parameters to an other column including the column values. It can be used if we want to list a few lines of a statistics in key-value form below each other and we want to keep all original columns. Its initialization:\r\n\r\n- **fieldNamesAndLabels**: Those fields and their names which we want to contract in two columns.\r\n- **valueFieldName**: The name of the field where the value of the column will be written.\r\n- **labelFieldName**: The name of the field where the name of the value column will be written.\r\n\r\nAn example of a YAML configuration:\r\n\r\n\t- expand: Field\r\n\t  fieldNamesAndLabels:\r\n\t    cz: Czech\r\n\t    hu: Hungary\r\n\t    sk: Slovak\r\n\t    pl: Poland\r\n\t  valueFieldName: count\r\n\t  labelFieldName: country\r\n\r\n##### BaseExpander\r\nA class that can be used for expansion, but it cannot work on its own. It has a task if we want to create more lines from one line in the process. Don't forget the `clone()` method during the prototype query!\r\n\r\n\tclass ResultExpand( BaseExpanderExpand ):\r\n\t\r\n\t    def expand( self, record ):\r\n\t\r\n\t\t\tfor phone in record.getField('PHONES').getValue().split(', '):\r\n\t\t\t\tfs = self.getFieldSetPrototypeCopy().clone()\r\n\t\t\t\tfs.setValues( record.getValues() )\r\n\t\t\t\tfs.getField('PHONES').setValue( phone )\r\n\t\t\t\t\r\n\t\t\t\tyield fs\r\n\r\n##### ListExpander\r\nIt breaks up list type elements to separate lines based on their values. It derieves from `BaseExpander`, therefore their functioning is quite similar. Its parameters:\r\n\r\n - **listFieldName**: The name of the list type element which values need to be broken into separate lines.\r\n - **expandedFieldName**: Where to write the actual value of the list element. Type can be given to it for further type conversion. \r\n - **expanderMap**: It is used when we want to write the list value into several fields. In this case, each field can have a map added to define from where the values should be gathered within the list.\r\n \r\nIt is important to note, that the two fields can never be identical. If the list element is not needed later, the unnecessary field can be dropped by a `filter` step. Either the `expandedFieldName` or the `expanderMap` is mandatory.\r\n\r\n\r\n\tsource:\r\n\t  resource: 589739.json\r\n\t  source: JSON\r\n\t  fields:\r\n\t    - name: ID\r\n\t      type: Integer\r\n\t      map: response/user/id\r\n\t    - name: FIRST\r\n\t      map: response/user/firstName\r\n\t    - name: LAST\r\n\t      map: response/user/lastName\r\n\t    - name: FRIENDS\r\n\t      map: response/user/friends/groups/0/items/*/id\r\n\t      type: List\r\n\t    - name: FRIEND\r\n\t      type: Integer\r\n\t\r\n\tmanipulations:\r\n\t  - expand: ListExpander\r\n\t    listFieldName: FRIENDS\r\n\t    expandedFieldName: FRIEND\r\n\t  - filter: DropField\r\n\t    fieldNames: FRIENDS\r\n\t\r\n\ttarget:\r\n\t  type: JSON\r\n\t  resource: result.json\r\n\t  compact: false\r\n\r\n##### Melt\r\nIt fixes the given columns while the other columns will be shown by key-value pairs. **During the process, all fields that are neither fixed nor contain key-value pairs will be deleted.** If we do not want to remove the fields just a few fields should be melted, the `Field` expander must be used. Its initialization:\r\n\r\nAn example of a YAML configuration:\r\n\r\n\t- expand: Melt\r\n\t  fieldNames:\r\n\t    - first\r\n\t    - last\r\n\t  valueFieldName: value\r\n\t  labelFieldName: quantity\r\n\r\n#### Aggregator\r\nIt is used to create groups and calculate information from them. Aggregators act many times as Filters or Modifiers as well, since in several cases they delete lines or columns, modify and collect given values. All procedures like this starts with the key word `aggregator`. \r\n\r\n##### Avg\r\nIt is used to determine the mean average. Its initialization:\r\n\r\n- **fieldNames**: Which fields belong to the group. These values will appear in distinct form in the future!\r\n- **targetFieldName**: Name of the field which will contain the distinct count of the records.\r\n- **valueFieldName**: The value of which fields must be added.\r\n- **listFieldName**: Name of a List field. Matched records will be saved here. It is not mandatory to give.\r\n\r\nThe aggregator deletes all columns except for `fieldNames`, `targetFieldName`, `listFieldName`.\r\n\r\nAn example of a YAML configuration:\r\n\r\n\t- aggregator: Avg\r\n\t  fieldNames: author\r\n\t  targetFieldName: avgprice\r\n\t  valueFieldName: price\r\n\t  \r\n##### Count\r\nUsed to calculate figures. Its initialization:\r\n\r\n-  **fieldNames**: Which fields belong to the group. These values will appear in distinct form in the future!\r\n-  **targetFieldName**: Name of the field which will contain the distinct count of the records.\r\n-  **listFieldName**: Name of a List field. Matched records will be saved here. It is not mandatory to give.\r\n\r\nThe aggregator deletes all columns except for `fieldNames`, `targetFieldName`, `listFieldName`.\r\n\r\nAn example of a YAML configuration:\r\n\r\n\t- aggregator: Count\r\n\t  fieldNames: word\r\n\t  targetFieldName: count\r\n\t  listFieldName: matches\r\n\r\n##### Sum\r\nIt is used to sum values. Its initialization:\r\n\r\n-  **fieldNames**: Which fields belong to the group. These values will appear in distinct form in the future!\r\n-  **targetFieldName**: Name of the field which will contain the distinct count of the records.\r\n-  **listFieldName**: Name of a List field. Matched records will be saved here. It is not mandatory to give.\r\n\r\nThe aggregator deletes all columns except for `fieldNames`, `targetFieldName`, `listFieldName`.\r\n\r\nAn example of a YAML configuration:\r\n\r\n\t- aggregator: Sum\r\n\t  fieldNames: author\r\n\t  targetFieldName: sumprice\r\n\t  valueFieldName: price\r\n\t      \r\n### Target\r\nAfter the data is read from the source, and the transform and manipulation steps are over, the finalized record gets to the `Target`. This will write and create the file with the final data.\r\n\r\n\ttarget:\r\n\t  type: \u003ctarget_type\u003e\r\n\t  … \r\n\r\nTarget is required for every process, and **only one instance of it could exist**. You can continue previous actions when you have used `CSV`, `TSV`, `Database` targets before.\r\n\r\n#### CSV\r\nTarget type used in the case of CSV resource. Parameters of its initialization:\r\n\r\n- **delimiter**: The sign used for separation in CSV resource. By default `,` is used.\r\n- **quote**: The character used to protect data if the text contains the previously mentioned delimiter. `\"` is used by default.\r\n- **addHeader**: It puts the field names in the first line as header.\r\n- **appendFile**: If the target files already exists, should the writing continue or start from the beginning. It always rewrites the files by default.\r\n\r\nFurther parameters to define the target place:\r\n\r\n- **resource**: Target path of the CSV resource, it can be URL as well.\r\n- **encoding**: Coding of the CSV resource. Coding occurs in `UTF-8` by default.\r\n\r\nAn example of a YAML configuration:\r\n\r\n\ttarget:\r\n\t  type: CSV\r\n\t  resurce: path/to/the/output.csv\r\n\t  delimiter: \"|\"\r\n\t  addHeader: true\r\n\t  appendFile: true\r\n\r\n#### Database\r\nThis target type is used if we want to write our records to a database. Several parameters exist for its initialization:\r\n\r\n- **createTable**: If the table does not exist in the database, should it be created or not. It is not created by default, the assumption is that the table already exists with the correct scheme.\r\n- **replaceTable**: Should the table be deleted and re-created either if it exists already in the database or not. It does not delete and replace by default. The usage of this is needed if the table exists already but the new process would expand it with additional columns.\r\n- **truncateTable**: Should the already existing table be cleared in order to write the records to an empty state. It does not clear by default. It is important to note that if the value of the `replaceTable` is true, the table will become empty anyway, so this parameter does not need to be defined in that case.\r\n- **addIDKey**: Should an univoque key with autoincrement sequence be added to the table at the moment of creation. It adds by default.\r\n- **idKeyName**: If the value of the `addIDKey` is true, what the name of that column should be. No columns with this name be among the values to be written.\r\n- **continueOnError**: The line can be skipped if error occurs during the writing or modification (e.g. Foreign Key is not listed). It does not continue by default.\r\n\r\nFurther parameters to define the target place:\r\n\r\n- **url**: Connection link of the database.\r\n- **table**: The name of the table in which we want to write. If the writing/modification is given, the system performs automatically.\r\n- **fn**: The name of the function with which we will write/modify. It must be given in the case when no table is defined. It is worth using if we want to write in several tables in Foreign Key environment. If both `table` and `fn` are defined, then the automatic writing and the function load happen as well. \r\n- **schema**: The name of the scheme in which the table is found. It is not necessary to define.\r\n\r\nAn example of a YAML configuration:\r\n\r\n\ttarget:\r\n\t  type: Database\r\n\t  url: sqlite:///tests/target\r\n\t  table: result\r\n\t  addIDKey: false\r\n\t  createTable: true\r\n\t  replaceTable: true\r\n\t  truncateTable: true\r\n\t  \r\nor\r\n\r\n\ttarget:\r\n\t  type: Database\r\n\t  url: sqlite:///tests/target\r\n\t  fn: mgmt.RunFunctionQuery\r\n\t  \r\nwith the following Python resource:\r\n\r\n\tdef RunFunctionQuery( connection, insert_buffer, update_buffer ):\r\n\t\r\n\t    for item in insert_buffer:\r\n\t        connection.execute(\r\n\t            \"\"\"\r\n\t            INSERT INTO result ( lat, lng ) VALUES ( :lat, :lng );\r\n\t            \"\"\",\r\n\t            item\r\n\t        )\r\n\t        \r\n\t        ...\r\n\t        \r\nTo define `fn` is most useful when creating migrations. Example will be added later.\r\n\r\n#### FixedWidthText\r\nTarget type used in the case of fixed width resources. Parameters of its initialization:\r\n\r\n- **addHeader**: Should the field names be put in the first line as header. It puts by default.\r\n\r\nFurther parameters to define the target place:\r\n\r\n- **resource**: Target path of the TXT resource, it can be URL as well.\r\n- **encoding**: Coding of the TXT resource. Coding occurs in `UTF-8` by default. \r\n\r\nAn example of a YAML configuration:\r\n\r\n\ttarget:\r\n\t  type: FixedWidthText\r\n\t  resurce: utvonal/output.txt\r\n\t  \r\n#### GoogleSpreadsheet\r\nTarget type used in the case of spreadsheet resources. Parameters of its initialization:\r\n\r\n- **addHeader**: Should the field names be put in the first line as header. It puts by default.\r\n\r\nFurther parameters to define the target place:\r\n\r\n- **username**: Name of the user\r\n- **password**: Password of the user\r\n- **spreadsheetKey**: Identifier of the spreadsheet to be written.\r\n- **spreadsheetName**: Name of the  spreadsheet to be written. Either spreadsheetKey or spreadsheetName must be given!\r\n- **worksheetName**: Name of the worksheet. If it does not exist, it will be created automatically.\r\n- **truncateSheet**: Should the content of the spreadsheet be cleared. It does not clear by default.\r\n\r\nAn example of a YAML configuration:\r\n\r\n\ttarget:\r\n\t  type: GoogleSpreadsheet\r\n\t  username: ***\r\n\t  password: ***\r\n\t  spreadsheetKey: 0ApA_54tZDwKTdDlibXppSkd1MExxb3Y5WmJrZjFxR1E\r\n\t  worksheetName: Sheet1\r\n\r\n\t   \r\n#### JSON\r\nTarget type used in the case of JSON resources. Parameters of its initialization:\r\n\r\n- **rootIterator**: The name of the variable in which we want to collect the records. It is not mandatory to be given, the JSON resource will contain only a list if this parameter is left empty.\r\n- **flat**: If only one field exists, this option makes possible to only list the values without field name. It is not used by default.\r\n- **compact**: It generates formatted JSON. Its value is false by default, it generates JSON into one line.\r\n\r\n  Left empty:\r\n  \r\n  \t\t[ { … }, { … }, …, { … } ]\r\n  \t\t\r\n  Filled in, e.g. with `items`:\r\n  \r\n  \t\t{ \"items\": [ { … }, { … }, …, { … } ] }\r\n\r\nFurther parameters to define the target place:\r\n\r\n- **resource**: Target path of the JSON resource. It can be URL as well.\r\n- **encoding**: Coding of the JSON resource. Coding occurs in `UTF-8` by default.\r\n\r\nAn example of a YAML configuration:\r\n\r\n\ttarget:\r\n\t  type: JSON\r\n\t  resurce: utvonal/output.json\r\n\t  rootIterator: items\r\n\t  \r\n#### Neo4j\r\nTarget type used to write into graph database. Parameters of its initialization:\r\n\r\n- **bufferSize**: Size of record to be written at the same time.\r\n\r\nFurther parameters to define the target place:\r\n\r\n- **url**: The address of the Neo4j database\r\n- **resourceType**: The type of data we want to write. Can have `Node` and `Relation` values.\r\n- **label**: Label to be used for the loaded data. It is always mandatory to give, even if for Neo4j it is not necessary.\r\n- **truncateLabel**: To delete the already existing records with the same labels at the beginning of the load. It does not delete by default.\r\n\r\nIf we choose `Relation` resourceType, the following parameters are mandatory as well:\r\n\r\n- **fieldNameLeft**: From the data to be loaded, in which field the identifier describing the left side of the relation is included.\r\n- **fieldNameRight**: From the data to be loaded, in which field the identifier describing the right side of the relation is included.\r\n- **keyNameLeft**: Which field of the object on the left hand side of the relation includes the key.\r\n- **keyNameRight**: Which field of the object on the right hand side of the relation includes the key.\r\n- **labelLeft**: The label the left hand side element has. Not mandatory.\r\n- **labelRight**:  The label the right hand side element has. Not mandatory.\r\n\r\nThe system automatically places an index to the fields with **key**.\r\n\r\nAn example for YAML configuration in the case of `Node`:\r\n\r\n\tsource:\r\n\t  source: TSV\r\n\t  resource: Artist.txt\r\n\t  quote: \"\"\r\n\t  skipRows: 1\r\n\t  fields:\r\n\t    - name: uid\r\n\t      map: 0\r\n\t      key: true\r\n\t    - name: name\r\n\t      map: 1\r\n\t    - name: nationality\r\n\t      map: 2\r\n      \r\n\ttarget:\r\n\t  type: Neo4j\r\n\t  url: http://localhost:7474\r\n\t  label: Artist\r\n\t  truncateLabel: true\r\n\t  resourceType: Node\r\n\r\nAnd in the case of `Relation`:\r\n\r\n\tsource:\r\n\t  source: TSV\r\n\t  resource: AlbumArtist.txt\r\n\t  quote: \"\"\r\n\t  skipRows: 1\r\n\t  fields:\r\n\t    - name: album_uid\r\n\t      map: 0\r\n\t    - name: artist_uid\r\n\t      map: 1\r\n\t\r\n\ttarget:\r\n\t  type: Neo4j\r\n\t  url: http://localhost:7474\r\n\t  label: Contains\r\n\t  truncateLabel: true\r\n\t  resourceType: Relation\r\n\t  fieldNameLeft: album_uid\r\n\t  fieldNameRight: artist_uid\r\n\t  keyNameLeft: uid\r\n\t  keyNameRight: uid\r\n\t  labelLeft: Album\r\n\t  labelRight: Artist\r\n\r\n\t  \r\n#### Static\r\nType created for testing purposes, it works for `stdout` in TSV format. Parameters of its initialization:\r\n\r\n- **silence**: Whether to write to stdout.\r\n\r\nAn example of a YAML configuration:\r\n\r\n\ttarget:\r\n\t  type: Static\r\n\t  silence: false\r\n\r\n#### TSV\r\nTarget type used in the case of TSV resources. Parameters of its initialization:\r\n\r\n- **delimiter**: The sign used for separation in TSV resources. By default `,` is used.\r\n- **quote**: The character used to protect data if the text contains the previously mentioned delimiter. `\"` is used by default.\r\n- **addHeader**: Whether to put the field names in the first line as header. It puts by default.\r\n- **appendFile**: If the target files already exists, should the writing continue or start from the beginning. It always rewrites the files by default.\r\n\r\nFurther parameters to define the target place:\r\n\r\n- **resource**: Target path of the TSV resource, it can be URL as well.\r\n- **encoding**: Coding of the TSV resource. Coding occurs in `UTF-8` by default.\r\n\r\nAn example of a YAML configuration:\r\n\r\n\ttarget:\r\n\t  type: CSV\r\n\t  resurce: path/to/output.csv\r\n\t  delimiter: \"|\"\r\n\t  addHeader: true\r\n\t  appendFile: true\r\n\t  \r\n#### XLS\r\nTarget type used in the case of XLS resources. Parameters of its initialization:\r\n\r\n- **addHeader**: Whether to put the field names in the first line as header. It puts by default.\r\n\r\nFurther parameters to define the target place:\r\n\r\n- **resource**: Target path of the XLS resource, it can be URL as well.\r\n- **encoding**: Coding of the XLS resource. Coding occurs in `UTF-8` by default.\r\n- **sheetName**: Name of the worksheet in which we want to write\r\n- **replaceFile**: Should the whole XLS be replaced. If we had an XLS file before with even several worksheets, their values will be lost. It replaces by default. \r\n- **truncateSheet**: Should the worksheet be empty. If the value of the `replaceFile` is true, then this parameter has no importance. But in the other case, it is important to define whether we want to continue writing to the end of worksheet or replace the worksheet to the new data. The worksheet is replaced by default.\r\n- **dinamicSheetField**: If we want to create more worksheets other than the main field by data scattering, then here the field name must be given on the basis of which we want to group the data. The value of the defined field here will be the name of the worksheet. Not necessary. \r\n\r\nIf we give a non-existing worksheet, the process automatically creates one together with the entire XLS file.\r\n\r\nAn example of a YAML configuration:\r\n\r\n\ttarget:\r\n\t  type: XLS\r\n\t  resource: path/to/output.xls\r\n\t  sheetName: NotExisting\r\n\t  replaceFile: false\r\n\t  truncateSheet: false\r\n  \r\n#### XML\r\nTarget type used in the case of XML resources. Parameters of its initialization:\r\n\r\n- **itemName**: Name of an XML element (or the path to that element) which contains one record we need to process.  It is mandatory to define!\r\n- **rootIterator**: The name of the root element to which the above XML data should be collected. `root` is used by default. It can't be left empty.\r\n- **flat**: If only one field exists, this option makes possible to only list the values without field name. It is not used by default.\r\n\r\nFurther parameters to define the target place:\r\n\r\n- **resource**: Target path of the XML file, it can be URL as well.\r\n- **encoding**: Coding of the XML file. Coding occurs in `UTF-8` by default. \r\n\r\nAn example of a YAML configuration:\r\n\r\n\ttarget:\r\n\t  type: XML\r\n\t  resource: utvonal/output.xml\r\n\t  itemName: estate\r\n\t  rootIterator: estates\r\n\t  \r\n#### Yaml\r\nTarget type used in the case of Yaml resources. Parameters of its initialization:\r\n\r\n- **rootIterator**: The name of the root element to which the data should be collected. It can't be left empty.\r\n- **safe**: Should the indications generated by Python be removed. (e.g. unicode character coding) They are kept by default.\r\n- **flat**: If only one field exists, this option makes possible to only list the values without field name. It is not used by default.\r\n\r\nFurther parameters to define the target place:\r\n\r\n- **resource**: Target path of the YML resource, it can be URL as well.\r\n- **encoding**: Coding of the YML resource. Coding occurs in `UTF-8` by default.\r\n\r\nAn example of a YAML configuration:\r\n\r\n\ttarget:\r\n\t  type: Yaml\r\n\t  resource: utvonal/output.yml\r\n\t  rootIterator: estates\r\n\r\n### Migrate\r\nDuring the running of the mETL script, there is a possibility to define a migration file (`-m` parameter) and to generate a new migration file (`-t` parameter). There are two types of migrations, **keyless** and **with key**. If a modification occurs in the configuration file, the earlier migration cannot be used in the future. The two different types of migrations cannot be mixed with each other in any way.\r\n\r\n#### Key, Hash, ID\r\nEach line has the above parameters.\r\n\r\n- **Key**: The values of the fields identified as key field separated by `-`. These values clearly identify the record (line). The line does not have a `key` if no fields are labelled. \r\n- **Hash**: Long number and letter row created from the value of an entire record (line) which cannot be decoded.\r\n- **ID**: Merge of the above ones separated by `:`. It clearly identifies the actual status of all values of a line.\r\n\r\n#### Logging\r\n`log` variable can be given to a start-up source, manipulation and target file. To activate the logging to a given step, the path of the file must be defined. Each step creates a different logging format, but in general the following applies:\r\n\r\n1. **Source log**: It contains the `ID` of the processed line, the dictionary of the processed line as JSON and the value after the transformation in JSON.\r\n2. **Filter modifier log**: It contains the `ID` of the processed line, and the transform value of the dropped line in JSON.\r\n3. **Target file log**: It contains the `ID` of the processed line, the operation (writing or modification) and the value of the line to be written out in JSON.\r\n\r\nModifiers, Expanders, and the transformation steps will not be logged one by one separately.\r\n\r\n#### Migration without key fields\r\nIn this case, we can't identify the lines clearly, thus we can't determine whether the value of the line has changed or not compared to previous status. In this version, none of the fields contain `key` in their configuration, so the only item the migration can define is whether in the previous version (`-m`) there was any identical values based on the `hash`.\r\n\r\nOnly the already non-existing fileds will be written into Target. If a new migration is asked to be genarated (`-t`), then the miration file will have all old and new values. Those that are not inlcuded in the new file will not be part of the new migration, they will be considered as deleted elements though we do not mark them for deletion anywhere.\r\n\r\n#### Migration with the use of key fields\r\nIt is more commonly used, since for almost all sources we can find a combination with which a line can be clearly identified based on several data. During the migration process, for all `ID`s it stores the `hash` belonging to the line. This way, if the hash that belongs to the same `ID` changes, we know exactly which record (line) value has been modified. In the case of text source, all new and modified records get into the target file. But in the case of database target, `UPDATE` commands must be used. Mingration to be generated (-t) will contain the final status. \r\n\r\n#### List of new/modified/unaltered/deleted element keys\r\nThe `metl-differences` script is able to compare migrations. Example can be as follows:\r\n\r\n`metl-differences -d delete.yml migration/migration.pickle migration/historical.pickle`\r\n\r\nAs it can be seen, it gets a -d parameter with the configuration file. It defines where to write the keys of those elements that are to be deleted during the new migration. An example for the delete.yml configuration:\r\n\r\n\ttarget:\r\n\t  type: JSON\r\n\t  resource: migration/migration_delete.json\r\n\t  rootIterator: deleted\r\n\t  flat: true\r\n\r\nOnly target must be defined, the others are handled by the script. The above generates a list similar to this one:\r\n\r\n\t{\"deleted\":[\"23105283\",\"23099212\",\"23101411\"]}\r\n\r\nIn the original configuration file, one single `id` field contained the `key` setting. \r\n**The above script can be used only in the case of identical types of migrations!!!**\r\n\r\nBut in most cases, the modifications and the list of deleted records are needed due to other reasons. It is common that the new records of the whole migration are written to a database, while the deleted records are to be inactivated. The `fn` attribute of the `DatabaseTarget` is used for this.\r\n\r\nIn the case of\r\n\r\n`metl-differences -d delete.yml migration/current.pickle migration/prev.pickle`\r\n\r\nthe content of delete.yml is:\r\n\r\n\ttarget:\r\n\t  type: Database\r\n\t  url: sqlite:///database.db\r\n\t  fn: mgmt.inactivateRecords\r\n\r\nwhile the content of the mgmt.py is:\r\n\r\n\tdef inactivateRecords( connection, delete_buffer, other_buffer ):\r\n\t    \r\n\t    connection.execute( \r\n\t    \t\"\"\"\r\n\t        UPDATE\r\n\t            t_result\r\n\t        SET\r\n\t            active = FALSE\r\n\t        WHERE\r\n\t            id = ANY( VALUES %s )\r\n\t        \"\"\" % ( ', '.join( [ \"('%(key)s')\" % b for b in delete_buffer ] ) ) \r\n\t    )\r\n\r\n## Example\r\nAs we are already familiar with the possibilities the tool can give, let's see what it can be used for.\r\n\r\n### Loading Spanish data\r\nIt is a quite simple load, but can be interesting due to the amount of data. The data is several GBs, it covers almost a decade. On a monthly basis, it has about 10 pieces of 80 MB resources with 250 000 lines for each. The goal is to load the data in a database table in the fastest way. \r\n\r\nThe following configuration was created for it:\r\n\r\n\tsource:\r\n\t  source: FixedWidthText\r\n\t  map:\r\n\t    FLOW: '0:1'\r\n\t    YEAR: '1:3'\r\n\t    MONTH: '3:5'\r\n\t    CUSTOM_ENCLOSURE_RPOVINCE: '5:7'\r\n\t    DATE_OF_ADMISSION_DOCUMENT: '19:25'\r\n\t    POSITION_STATISTICS: '25:37'\r\n\t    DECLARATION_TYPE: '37:38'\r\n\t    ADDITIONAL_CODES: '38:46'\r\n\t    COUNTRY_ORIGIN_DESTINATION: '66:69'\r\n\t    COUNTRY_OF_ORIGIN_ISSUE: '69:72'\r\n\t    PROVINCE_OF_ORIGIN_DESTINATION: '75:77'\r\n\t    CUSTOMS_REGIME_REQUESTED: '82:84'\r\n\t    PRECEDING_CUSTOMES_PROCEDURE: '84:86'\r\n\t    WEIGHT: '89:104'\r\n\t    UNITS: '104:119'\r\n\t    STATISTICAL_VALUE: '119:131'\r\n\t    INVOICE_VALUE: '131:143'\r\n\t    COUNTRY_CURRENCY: '143:146'\r\n\t    CONTAINER: '158:159'\r\n\t    TRANSPORT_SYSTEM: '159:164'\r\n\t    BORDER_TRANSPORT_MODE: '164:165'\r\n\t    INLAND_TRANSPORT_MODE: '165:166'\r\n\t    NATIONALITY_THROUGH_TRANSPORT: '166:169'\r\n\t    ZONE_EXCHANGE: '170:171'\r\n\t    NATURE_OF_TRANSACTION: '172:174'\r\n\t    TERMS_OF_DELIVERY: '174:177'\r\n\t    CONTINGENT: '177:183'\r\n\t    TARIFF_PREFERENCE: '183:189'\r\n\t    FREIGHT: '189:201'\r\n\t    TAX_ADDRESS_PROVINCE: '224:226'\r\n\t  fields:\r\n\t    - name: FLOW\r\n\t    - name: YEAR\r\n\t      type: Integer\r\n\t    - name: MONTH\r\n\t      type: Integer\r\n\t    - name: CUSTOM_ENCLOSURE_RPOVINCE\r\n\t    - name: DATE_OF_ADMISSION_DOCUMENT\r\n\t      type: Date\r\n\t    - name: POSITION_STATISTICS\r\n\t    - name: DECLARATION_TYPE\r\n\t    - name: ADDITIONAL_CODES\r\n\t    - name: COUNTRY_ORIGIN_DESTINATION\r\n\t    - name: COUNTRY_OF_ORIGIN_ISSUE\r\n\t    - name: PROVINCE_OF_ORIGIN_DESTINATION\r\n\t    - name: CUSTOMS_REGIME_REQUESTED\r\n\t    - name: PRECEDING_CUSTOMES_PROCEDURE\r\n\t    - name: WEIGHT\r\n\t    - name: UNITS\r\n\t    - name: STATISTICAL_VALUE\r\n\t    - name: INVOICE_VALUE\r\n\t    - name: COUNTRY_CURRENCY\r\n\t    - name: CONTAINER\r\n\t    - name: TRANSPORT_SYSTEM\r\n\t    - name: BORDER_TRANSPORT_MODE\r\n\t    - name: INLAND_TRANSPORT_MODE\r\n\t    - name: NATURE_OF_TRANSACTION\r\n\t    - name: ZONE_EXCHANGE\r\n\t    - name: NATIONALITY_THROUGH_TRANSPORT\r\n\t    - name: TERMS_OF_DELIVERY\r\n\t    - name: CONTINGENT\r\n\t    - name: TARIFF_PREFERENCE\r\n\t    - name: FREIGHT\r\n\t    - name: TAX_ADDRESS_PROVINCE\r\n\t\r\n\ttarget:\r\n\t  type: Database\r\n\t  url: postgresql://metl:metl@localhost:5432/metl\r\n\t  table: spanish_trade\r\n\t  createTable: true\r\n\t  replaceTable: false\r\n\t  truncateTable: false\r\n\r\nAs it can be seen, there is no `resource` parameter defined in he case of `Source`, since we do not want to create separate configurations for similar file formats. For running, `metl-walk` is used with multiprocess (-m) setting, to process the monthly 10 files as soon as possible simultaneously. \r\n\r\n`metl-walk -m spanishtrade.yml data/spanish_trade/2013/jan`\r\n\r\n### Aggregated data conversion and collection\r\nMany cases it is needed to create meaningful, clear data from non-reasonable data sources. The below resource formats arrived for each county:\r\n\r\n\t{\r\n\t   \"data\":[\r\n\t      {\r\n\t         \"category\":\"Local business\",\r\n\t         \"category_list\":[\r\n\t            {\r\n\t               \"id\":\"115725465228008\",\r\n\t               \"name\":\"Region\"\r\n\t            },\r\n\t            {\r\n\t               \"id\":\"192803624072087\",\r\n\t               \"name\":\"Fast Food Restaurant\"\r\n\t            }\r\n\t         ],\r\n\t         \"location\":{\r\n\t            \"street\":\"Sz\\u00e9chenyi t\\u00e9r 1.\",\r\n\t            \"city\":\"P\\u00e9cs\",\r\n\t            \"state\":\"\",\r\n\t            \"country\":\"Hungary\",\r\n\t            \"zip\":\"7621\",\r\n\t            \"latitude\":46.07609661278,\r\n\t            \"longitude\":18.228635482364\r\n\t         },\r\n\t         \"name\":\"McDonald's P\\u00e9cs Sz\\u00e9chenyi t\\u00e9r\",\r\n\t         \"id\":\"201944486491918\"\r\n\t      },\r\n\t      …\r\n\t   ]\r\n\t}\r\n\r\nThe goal is to generate a TSV resource that contains all data included in these files. Configuration used to achieve this:\r\n\r\n\tsource:\r\n\t  source: JSON\r\n\t  fields:\r\n\t    - name: category\r\n\t    - name: category_list_id\r\n\t      map: category_list/0/id\r\n\t    - name: category_list_name\r\n\t      map: category_list/0/name\r\n\t    - name: location_street\r\n\t      map: location/street\r\n\t    - name: location_city\r\n\t      map: location/city\r\n\t    - name: location_state\r\n\t      map: location/state\r\n\t    - name: location_country\r\n\t      map: location/country\r\n\t    - name: location_zip\r\n\t      map: location/zip\r\n\t      type: Integer\r\n\t    - name: location_latitude\r\n\t      map: location/latitude\r\n\t      type: Float\r\n\t    - name: location_longitude\r\n\t      map: location/longitude\r\n\t      type: Float\r\n\t    - name: name\r\n\t    - name: id\r\n\t  rootIterator: data\r\n\t\r\n\ttarget:\r\n\t  type: TSV\r\n\t  resource: common.tsv\r\n\t  appendFile: true\r\n\r\nThe program was running with the below format:\r\n`metl-walk config.yml data/`\r\n\r\n### Long format conversion from table form\r\n\r\n#### By using Field expander\r\n\r\nWe have a TSV resource with the following format:\r\n\r\n\tYear\tCZ\tHU\tSK\tPL\r\n\t1999\t32\t694\t129\t230\r\n\t1999\t395\t392\t297\t453\r\n\t1999\t635\t812\t115\t97\r\n\t…\r\n\r\nTo which we create the below configuration:\r\n\r\n\tsource:\r\n\t  source: TSV\r\n\t  resource: input1.csv\r\n\t  skipRows: 1\r\n\t  fields:\r\n\t    - name: year\r\n\t      type: Integer\r\n\t      map: 0\r\n\t    - name: country\r\n\t    - name: count\r\n\t      type: Integer\r\n\t    - name: cz\r\n\t      type: Integer\r\n\t      map: 1\r\n\t    - name: hu\r\n\t      type: Integer\r\n\t      map: 2\r\n\t    - name: sk\r\n\t      type: Integer\r\n\t      map: 3\r\n\t    - name: pl\r\n\t      type: Integer\r\n\t      map: 4\r\n\t\r\n\tmanipulations:\r\n\t  - expand: Field\r\n\t    fieldNamesAndLabels:\r\n\t      cz: Czech\r\n\t      hu: Hungary\r\n\t      sk: Slovak\r\n\t      pl: Poland\r\n\t    valueFieldName: count\r\n\t    labelFieldName: country\r\n\t  - filter: DropField\r\n\t    fieldNames:\r\n\t      - cz\r\n\t      - hu\r\n\t      - sk\r\n\t      - pl\r\n\t\r\n\ttarget:\r\n\t  type: TSV\r\n\t  resource: output1.csv\r\n\r\nThus we get the following result:\r\n\r\n\tyear\tcountry\tcount\r\n\t1999\tSlovak\t129\r\n\t1999\tCzech\t32\r\n\t1999\tPoland\t230\r\n\t1999\tHungary\t694\r\n\t1999\tSlovak\t297\r\n\t1999\tCzech\t395\r\n\t\r\n#### By using Melt expander\r\n\r\nLet's see the following input file:\r\n\r\n\tfirst\theight\tlast\tweight\tiq\r\n\tJohn\t5.5\t    Doe\t    130\t    102\r\n\tMary\t6.0\t    Bo\t    150\t    98\r\n\r\nAn example configuration file to get the data to long value based on the key-value pairs:\r\n\r\n\tsource:\r\n\t  source: TSV\r\n\t  resource: input2.csv\r\n\t  skipRows: 1\r\n\t  fields:\r\n\t    - name: first\r\n\t      map: 0\r\n\t    - name: height\r\n\t      type: Float\r\n\t      map: 1\r\n\t    - name: last\r\n\t      map: 2\r\n\t    - name: weight\r\n\t      type: Integer\r\n\t      map: 3\r\n\t    - name: iq\r\n\t      type: Integer\r\n\t      map: 4\r\n\t    - name: quantity\r\n\t    - name: value\r\n\t\r\n\tmanipulations:\r\n\t  - expand: Melt\r\n\t    fieldNames:\r\n\t      - first\r\n\t      - last\r\n\t    valueFieldName: value\r\n\t    labelFieldName: quantity\r\n\t\r\n\ttarget:\r\n\t  type: TSV\r\n\t  resource: output2.csv\r\n\t  \r\nAs a result, the below will be created:\r\n\r\n\tfirst\tlast\tquantity\tvalue\r\n\tJohn\tDoe\t    iq\t        102\r\n\tJohn\tDoe\t    weight\t    130\r\n\tJohn\tDoe\t    height\t    5.5\r\n\tMary\tBo\t    iq\t        98\r\n\tMary\tBo\t    weight\t    150\r\n\tMary\tBo\t    height\t    6.0\r\n\r\n### Data load to two tables of a database\r\n\r\nLet's see a complex example which is based on the usage of the ListExpander.\r\n\r\n\tsource:\r\n\t  source: JSON\r\n\t  rootIterator: features\r\n\t  resource: hucitystreet.geojson\r\n\t  fields:\r\n\t    - name: id\r\n\t      type: Integer\r\n\t      map: id\r\n\t      key: true\r\n\t    - name: osm_id\r\n\t      type: Float\r\n\t      map: properties/osm_id\r\n\t    - name: name\r\n\t      map: properties/name\r\n\t    - name: ref\r\n\t      map: properties/ref\r\n\t    - name: type\r\n\t      map: properties/type\r\n\t    - name: oneway\r\n\t      type: Boolean\r\n\t      map: properties/oneway\r\n\t    - name: bridge\r\n\t      type: Boolean\r\n\t      map: properties/bridge\r\n\t    - name: tunnel\r\n\t      type: Boolean\r\n\t      map: properties/tunnel\r\n\t    - name: maxspeed\r\n\t      map: properties/maxspeed\r\n\t    - name: telkod\r\n\t      map: properties/TEL_KOD\r\n\t    - name: telnev\r\n\t      map: properties/TEL_NEV\r\n\t    - name: kistkod\r\n\t      map: properties/KIST_KOD\r\n\t    - name: kistnev\r\n\t      map: properties/KIST_NEV\r\n\t    - name: megynev\r\n\t      map: properties/MEGY_NEV\r\n\t    - name: regnev\r\n\t      map: properties/REG_NEV\r\n\t    - name: regkod\r\n\t      map: properties/REG_KOD\r\n\t    - name: geometry\r\n\t      type: List\r\n\t      map: geometry/coordinates\r\n\t\r\n\ttarget:\r\n\t  type: Database\r\n\t  url: postgresql://metl:metl@localhost:5432/metl\r\n\t  table: osm_streets\r\n\t  createTable: true\r\n\t  replaceTable: true\r\n\t  truncateTable: true\r\n\t  addIDKey: false\r\n\r\nIn the database, the value of the `geometry` field will be `JSON`. We want to break up this list to an other table as `latitude` and `longitude` coordinates. Currently, the following values are in the `geometry` field:\r\n\r\n\t[[17.6874552,46.7871465],[17.6865955,46.7870049],[17.6846158,46.7866786],[17.6834977,46.7864944],[17.6822251,46.7862847],[17.6815319,46.7861705],[17.6811473,46.7861071],[17.6795989,46.785852],[17.6774482,46.7854976],[17.6739061,46.7849139],[17.6729351,46.7847539],[17.6720789,46.7846318]]\r\n \r\nWe want to achieve this by the below configuration:\r\n\r\n\tsource:\r\n\t  source: Database\r\n\t  url: postgresql://metl:metl@localhost:5432/metl\r\n\t  table: osm_streets\r\n\t  fields:\r\n\t    - name: street_id\r\n\t      type: Integer\r\n\t      map: id\r\n\t    - name: latitude\r\n\t      type: Float\r\n\t    - name: longitude\r\n\t      type: Float\r\n\t    - name: geometry\r\n\t      type: List\r\n\t      map: geometry\r\n\t\r\n\tmanipulations:\r\n\t  - expand: ListExpander\r\n\t    listFieldName: geometry\r\n\t    expanderMap:\r\n\t      latitude: 0\r\n\t      longitude: 1\r\n\t  - filter: DropField\r\n\t    fieldNames: geometry\r\n\t\r\n\ttarget:\r\n\t  type: Database\r\n\t  url: postgresql://metl:metl@localhost:5432/metl\r\n\t  table: osm_coords\r\n\t  createTable: true\r\n\t  replaceTable: true\r\n\t  truncateTable: true\r\n\t\r\nWe read out here the previously loaded value of the `geometry` field into a list, then with the help of the `ListExpander` we define what to write exactly in the `latitude` and `longitude` fields. With this, we created a table and a connection table belonging to it.\r\n\r\n### Database transfer\r\n\r\nWe have a MySQL and a PostgreSQL database and we want to switch between the two.\r\nData can be transferred easily through a command:\r\n\r\n`metl-transfer config.yml `\r\n\r\nThe configuration file is the following:\r\n\r\n\tsourceURI: mysql+mysqlconnector://xyz:xyz@localhost/database\r\n\ttargetURI: postgresql://xyz:xyz@localhost:5432/database\r\n\t\r\n\ttables:\r\n\t  - [ 'Message', 'message' ]\r\n\t  - [ 'SourceMessage', 'sourcemessage' ]\r\n\t  - related_content\r\n\t  - poi\r\n\t  - shorturl\r\n\t  - ident_data\r\n\t  - user\r\n\t  - estate_agency\r\n\t  - time_series\r\n\t  - auth_item\r\n\t  - property_migration\r\n\t  - property_group\r\n\t  - cemp_id_daily\r\n\t  - cluster\r\n\t  - auth_assignment\r\n\t  - property\r\n\t  - lead\r\n\t  - pic\r\n\t  - lead_comment\r\n\t  - similarity\r\n\t  - property_cluster\r\n\t\r\n\ttruncate:\r\n\t  - auth_item\r\n\t  - estate_agency\r\n\t\r\n\trunAfter: |\r\n\t  UPDATE\r\n\t      property\r\n\t  SET\r\n\t      status = status + 1,\r\n\t      condition = condition + 1,\r\n\t      estatetype = estatetype + 1,\r\n\t      heating = heating + 1,\r\n\t      conveniences = conveniences + 1,\r\n\t      parking = parking + 1,\r\n\t      view = view + 1,\r\n\t      material = material + 1;\r\n\t\r\n`sourceURI` contains the address of the source database, while `targetURI` contains the address of the target database.\r\nListing of `tables` is not mandatory, if they are not listed, then all of the tables from the source database will be copied to the target database. With the truncate option, given tables can be cleared in the target database before loading, while SQL ccommands can be run with `runAfter` and `runBefore`\r\n\r\n**Important to note that the tables must exist** in the target database, the transfer does not create them.\r\n","funding_links":[],"categories":["Python"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fceumicrodata%2FmETL","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fceumicrodata%2FmETL","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fceumicrodata%2FmETL/lists"}