{"id":14982413,"url":"https://github.com/absaoss/cobrix","last_synced_at":"2026-02-24T13:04:37.839Z","repository":{"id":34345034,"uuid":"142015736","full_name":"AbsaOSS/cobrix","owner":"AbsaOSS","description":"A COBOL parser and Mainframe/EBCDIC data source for Apache Spark","archived":false,"fork":false,"pushed_at":"2025-09-29T08:40:48.000Z","size":5011,"stargazers_count":153,"open_issues_count":105,"forks_count":84,"subscribers_count":23,"default_branch":"master","last_synced_at":"2025-09-29T10:23:37.305Z","etag":null,"topics":["cobol","cobol-parser","copybook","ebcdic","etl","mainframe","scalable","spark"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AbsaOSS.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2018-07-23T13:08:07.000Z","updated_at":"2025-09-29T08:40:53.000Z","dependencies_parsed_at":"2024-01-15T17:16:04.226Z","dependency_job_id":"fa429ac5-9c84-4ee7-b987-71832be2c0dc","html_url":"https://github.com/AbsaOSS/cobrix","commit_stats":{"total_commits":1202,"total_committers":30,"mean_commits":40.06666666666667,"dds":0.3252911813643927,"last_synced_commit":"9d1fb43f8369200695effc5cd55490f90d75147c"},"previous_names":[],"tags_count":99,"template":false,"template_full_name":null,"purl":"pkg:github/AbsaOSS/cobrix","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fcobrix","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fcobrix/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fcobrix/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fcobrix/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AbsaOSS","download_url":"https://codeload.github.com/AbsaOSS/cobrix/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fcobrix/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278872050,"owners_count":26060525,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-07T02:00:06.786Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cobol","cobol-parser","copybook","ebcdic","etl","mainframe","scalable","spark"],"created_at":"2024-09-24T14:05:22.415Z","updated_at":"2026-02-24T13:04:37.809Z","avatar_url":"https://github.com/AbsaOSS.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Cobrix - COBOL Data Source for Apache Spark\n\n[![License: Apache v2](https://img.shields.io/badge/license-Apache%202-blue)](https://directory.fsf.org/wiki/License:Apache-2.0)\n[![FOSSA Status](https://app.fossa.com/api/projects/custom%2B24661%2Fgithub.com%2FAbsaOSS%2Fcobrix.svg?type=shield)](https://app.fossa.com/projects/custom%2B24661%2Fgithub.com%2FAbsaOSS%2Fcobrix)\n[![Build](https://github.com/AbsaOSS/cobrix/workflows/Build/badge.svg)](https://github.com/AbsaOSS/cobrix/actions)\n[![Maven Central](https://img.shields.io/maven-central/v/za.co.absa.cobrix/cobol-parser_2.12?label=cobol-parser)](https://mvnrepository.com/artifact/za.co.absa.cobrix/cobol-parser)\n[![Maven Central](https://img.shields.io/maven-central/v/za.co.absa.cobrix/spark-cobol_2.12?label=spark-cobol)](https://mvnrepository.com/artifact/za.co.absa.cobrix/spark-cobol)\n\nPain free Spark/Cobol files integration.\n\nSeamlessly query your COBOL/EBCDIC binary files as Spark Dataframes and streams.   \n\nAdd mainframe as a source to your data engineering strategy.\n\n## Motivation\n\nAmong the motivations for this project, it is possible to highlight:\n\n- Lack of expertise in the Cobol ecosystem, which makes it hard to integrate mainframes into data engineering strategies.\n\n- Lack of support from the open-source community to initiatives in this field.\n\n- The overwhelming majority (if not all) of tools to cope with this domain are proprietary.\n\n- Several institutions struggle daily to maintain their legacy mainframes, which prevents them from evolving to more modern approaches to data management.\n\n- Mainframe data can only take part in data science activities through very expensive investments.\n\n\n## Features\n\n- Supports primitive types (although some are \"Cobol compiler specific\").\n\n- Supports REDEFINES, OCCURS and DEPENDING ON fields (e.g. unchecked unions and variable-size arrays).\n\n- Supports nested structures and arrays.\n\n- Supports Hadoop (HDFS, S3, ...) as well as local file system.\n\n- The COBOL copybooks parser doesn't have a Spark dependency and can be reused for integrating into other data processing engines.\n\n- Supports reading files compressed in Hadoop-compatible way (gzip, bzip2, etc), but with limited parallelism. \n  Uncompressed files are preferred for performance. \n\n## Videos\n\nWe have presented Cobrix at DataWorks Summit 2019 and Spark Summit 2019 conferences. The screencasts are available here:\n\nDataWorks Summit 2019 (General Cobrix workflow for hierarchical databases): https://www.youtube.com/watch?v=o_up7X3ZL24\n\nSpark Summit 2019 (More detailed overview of performance optimizations): https://www.youtube.com/watch?v=BOBIdGf3Tm0\n\n## Requirements\n\n| spark-cobol | Spark   |\n|-------------|---------|\n| 0.x         | 2.2+    |\n| 1.x         | 2.2+    |\n| 2.x         | 2.4.3+  |\n| 2.6.x+      | 3.2.0+  |\n\n## Linking\n\nYou can link against this library in your program at the following coordinates:\n\n\u003ctable\u003e\n\u003ctr\u003e\u003cth\u003eScala 2.11\u003c/th\u003e\u003cth\u003eScala 2.12\u003c/th\u003e\u003cth\u003eScala 2.13\u003c/th\u003e\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"center\"\u003e\n\u003ca href = \"https://mvnrepository.com/artifact/za.co.absa.cobrix/spark-cobol\"\u003e\u003cimg src=\"https://img.shields.io/maven-central/v/za.co.absa.cobrix/spark-cobol_2.11?label=spark-cobol_2.11\"\u003e\u003c/a\u003e\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\n\u003ca href = \"https://mvnrepository.com/artifact/za.co.absa.cobrix/spark-cobol\"\u003e\u003cimg src=\"https://img.shields.io/maven-central/v/za.co.absa.cobrix/spark-cobol_2.12?label=spark-cobol_2.12\"\u003e\u003c/a\u003e\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\n\u003ca href = \"https://mvnrepository.com/artifact/za.co.absa.cobrix/spark-cobol\"\u003e\u003cimg src=\"https://img.shields.io/maven-central/v/za.co.absa.cobrix/spark-cobol_2.13?label=spark-cobol_2.13\"\u003e\u003c/a\u003e\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd\u003e\n\u003cpre\u003egroupId: za.co.absa.cobrix\u003cbr\u003eartifactId: spark-cobol_2.11\u003cbr\u003eversion: 2.9.8\u003c/pre\u003e\n\u003c/td\u003e\n\u003ctd\u003e\n\u003cpre\u003egroupId: za.co.absa.cobrix\u003cbr\u003eartifactId: spark-cobol_2.12\u003cbr\u003eversion: 2.9.8\u003c/pre\u003e\n\u003c/td\u003e\n\u003ctd\u003e\n\u003cpre\u003egroupId: za.co.absa.cobrix\u003cbr\u003eartifactId: spark-cobol_2.13\u003cbr\u003eversion: 2.9.8\u003c/pre\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/table\u003e\n\n## Using with Spark shell\nThis package can be added to Spark using the `--packages` command line option. For example, to include it when starting the spark shell:\n\n\n### Spark compiled with Scala 2.11\n```\n$SPARK_HOME/bin/spark-shell --packages za.co.absa.cobrix:spark-cobol_2.11:2.9.8\n```\n\n### Spark compiled with Scala 2.12\n```\n$SPARK_HOME/bin/spark-shell --packages za.co.absa.cobrix:spark-cobol_2.12:2.9.8\n```\n\n### Spark compiled with Scala 2.13\n```\n$SPARK_HOME/bin/spark-shell --packages za.co.absa.cobrix:spark-cobol_2.13:2.9.8\n```\n\n## Usage\n\n## Quick start\n\nThis repository contains several standalone example applications in `examples/spark-cobol-app` directory.\nIt is a Maven project that contains several examples:\n* `SparkTypesApp` is an example of a very simple mainframe file processing.\n   It is a fixed record length raw data file with a corresponding copybook. The copybook \n   contains examples of various numeric data types Cobrix supports.\n* `SparkCobolApp` is an example of a Spark Job for handling multisegment variable record\n   length mainframe files.  \n* `SparkCodecApp` is an example usage of a custom record header parser. This application reads a variable\n   record length file having non-standard RDW headers. In this example RDH header is 5 bytes instead of 4\n* `SparkCobolHierarchical` is an example processing of an EBCDIC multisegment file extracted from a hierarchical database.\n\n\nThe example project can be used as a template for creating Spark Application. Refer to README.md\nof that project for the detailed guide how to run the examples locally and on a cluster.\n\nWhen running `mvn clean package` in `examples/spark-cobol-app` an uber jar will be created. It can be used to run\njobs via `spark-submit` or `spark-shell`. \n\n## How to generate Code coverage report\n```sbt\nsbt ++{scala_version} jacoco\n```\nCode coverage will be generated on path:\n```\n{project-root}/cobrix/{module}/target/scala-{scala_version}/jacoco/report/html\n```\n\n### Reading Cobol binary files from Hadoop/local and querying them \n\n1. Create a Spark ```SQLContext```\n\n2. Start a ```sqlContext.read``` operation specifying ```za.co.absa.cobrix.spark.cobol.source``` as the format\n\n3. Inform the path to the copybook describing the files through ```... .option(\"copybook\", \"path_to_copybook_file\")```. \n   - By default the copybook is expected to be in the default Hadoop filesystem (HDFS, S3, etc). \n   - You can specify that a copybook is located in the local file system by adding `file://` prefix. \n   - For example, you can specify a local file like this `.option(\"copybook\", \"file:///home/user/data/copybook.cpy\")`.\n   - Alternatively, instead of providing a path to a copybook file you can provide the contents of the copybook itself by using `.option(\"copybook_contents\", \"...copybook contents...\")`. \n   - You can store the copybook in the JAR itself at resources section in this case use `jar://` prefix, e.g.: `.option(\"copybook\", \"jar:///copybooks/copybook.cpy\")`.\n\n4. Inform the path to the Hadoop directory containing the files: ```... .load(\"s3a://path_to_directory_containing_the_binary_files\")``` \n\n5. Inform the query you would like to run on the Cobol Dataframe\n\nBelow is an example whose full version can be found at ```za.co.absa.cobrix.spark.cobol.examples.SampleApp``` and ```za.co.absa.cobrix.spark.cobol.examples.CobolSparkExample```\n\n```scala\nval sparkBuilder = SparkSession.builder().appName(\"Example\")\nval spark = sparkBuilder\n  .getOrCreate()\n\nval cobolDataframe = spark\n  .read\n  .format(\"cobol\")\n  .option(\"copybook\", \"data/test1_copybook.cob\")\n  .load(\"data/test2_data\")\n\ncobolDataframe\n    .filter(\"RECORD.ID % 2 = 0\") // filter the even values of the nested field 'RECORD_LENGTH'\n    .take(10)\n    .foreach(v =\u003e println(v))\n```\n\nThe full example is available [here](https://github.com/AbsaOSS/cobrix/blob/master/spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/examples/CobolSparkExample.scala)\n\nIn some scenarios Spark is unable to find \"cobol\" data source by it's short name. In that case you can use the full path to the source class instead: `.format(\"za.co.absa.cobrix.spark.cobol.source\")`\n\nCobrix assumes input data is encoded in EBCDIC. You can load ASCII files as well by specifying the following option:\n`.option(\"encoding\", \"ascii\")`.\n\nIf the input file is a text file (CRLF / LF are used to split records), use\n`.option(\"is_text\", \"true\")`.\n\nMultisegment ASCII text files are supported using this option:\n`.option(\"record_format\", \"D\")`.\n\nCobrix has better handling of special characters and partial records using its extension format:\n`.option(\"record_format\", \"D2\")`.\n\nRead more on record formats at https://www.ibm.com/docs/en/zos/2.4.0?topic=files-selecting-record-formats-non-vsam-data-sets\n\n### Streaming Cobol binary files from a directory\n\n1. Create a Spark ```StreamContext```\n\n2. Import the binary files/stream conversion manager: ```za.co.absa.spark.cobol.source.streaming.CobolStreamer._```\n\n3. Read the binary files contained in the path informed in the creation of the ```SparkSession``` as a stream: ```... streamingContext.cobolStream()```\n\n4. Apply queries on the stream: ```... stream.filter(\"some_filter\") ...```\n\n5. Start the streaming job.\n\nBelow is an example whose full version can be found at ```za.co.absa.cobrix.spark.cobol.examples.StreamingExample```\n\n```scala\nval spark = SparkSession\n  .builder()\n  .appName(\"CobolParser\")\n  .master(\"local[2]\")\n  .config(\"duration\", 2)\n  .config(\"copybook\", \"path_to_the_copybook\")\n  .config(\"path\", \"path_to_source_directory\") // could be both, local or Hadoop (s3://, hdfs://, etc)\n  .getOrCreate()          \n      \nval streamingContext = new StreamingContext(spark.sparkContext, Seconds(3))         \n    \nimport za.co.absa.spark.cobol.source.streaming.CobolStreamer._ // imports the Cobol streams manager\n\nval stream = streamingContext.cobolStream() // streams the binary files into the application    \n\nstream\n    .filter(row =\u003e row.getAs[Integer](\"NUMERIC_FLD\") % 2 == 0) // filters the even values of the nested field 'NUMERIC_FLD'\n    .print(10)\t\t\n\nstreamingContext.start()\nstreamingContext.awaitTermination()\n```\n\n### Using Cobrix from a Spark shell\n\nTo query mainframe files interactively using `spark-shell` you need to provide jar(s) containing Corbrix and it's dependencies.\nThis can be done either by downloading all the dependencies as separate jars or by creating an uber jar that contains all\nof the dependencies.\n\n#### Getting all Cobrix dependencies\n\nCobrix's `spark-cobol` data source depends on the COBOL parser that is a part of Cobrix itself.\n\nThe jars that you need to get are:\n\n* spark-cobol_2.12-2.9.8.jar\n* cobol-parser_2.12-2.9.8.jar\n\n\u003e Versions older than 2.8.0 also need `scodec-core_2.12-1.10.3.jar` and `scodec-bits_2.12-1.1.4.jar`.\n\n\u003e Versions older than 2.7.1 also need `antlr4-runtime-4.8.jar`.\n\nAfter that you can specify these jars in `spark-shell` command line. Here is an example:\n```\n$ spark-shell --packages za.co.absa.cobrix:spark-cobol_2.12:2.9.8\nor \n$ spark-shell --master yarn --deploy-mode client --driver-cores 4 --driver-memory 4G --jars spark-cobol_2.12-2.9.8.jar,cobol-parser_2.12-2.9.8.jar\n\nSetting default log level to \"WARN\".\nTo adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\nSpark context available as 'sc' (master = yarn, app id = application_1535701365011_2721).\nSpark session available as 'spark'.\nWelcome to\n      ____              __\n     / __/__  ___ _____/ /__\n    _\\ \\/ _ \\/ _ `/ __/  '_/\n   /___/ .__/\\_,_/_/ /_/\\_\\   version 2.4.5\n      /_/\n\nUsing Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_171)\nType in expressions to have them evaluated.\nType :help for more information.\n\nscala\u003e val df = spark.read.format(\"cobol\").option(\"copybook\", \"/data/example1/test3_copybook.cob\").load(\"/data/example1/data\")\ndf: org.apache.spark.sql.DataFrame = [TRANSDATA: struct\u003cCURRENCY: string, SIGNATURE: string ... 4 more fields\u003e]\n\nscala\u003e df.show(false)\n+----------------------------------------------------+\n|TRANSDATA                                           |\n+----------------------------------------------------+\n|[GBP,S9276511,Beierbauh.,0123330087,1,89341.00]     |\n|[ZAR,S9276511,Etqutsa Inc.,0039003991,1,2634633.00] |\n|[USD,S9276511,Beierbauh.,0038903321,0,75.71]        |\n|[ZAR,S9276511,Beierbauh.,0123330087,0,215.39]       |\n|[ZAR,S9276511,Test Bank,0092317899,1,643.94]        |\n|[ZAR,S9276511,Xingzhoug,8822278911,1,998.03]        |\n|[USD,S9276511,Beierbauh.,0123330087,1,848.88]       |\n|[USD,S9276511,Beierbauh.,0123330087,0,664.11]       |\n|[ZAR,S9276511,Beierbauh.,0123330087,1,55262.00]     |\n+----------------------------------------------------+\nonly showing top 20 rows\n\nscala\u003e\n``` \n\n#### Creating an uber jar\n\nGathering all dependencies manually maybe a tiresome task. A better approach would be to create a jar file that contains\nall required dependencies (an uber jar aka fat jar). \n\nCreating an uber jar for Cobrix is very easy. Steps to build:\n- Install JDK 8\n- Install SBT\n- Clone Cobrix repository\n- Run `sbt assembly` in the root directory of the repository specifying the Scala and Spark version you want to build for:\n    ```sh\n    # For Scala 2.11\n    sbt -DSPARK_VERSION=\"2.4.8\" ++2.11.12 assembly\n  \n    # For Scala 2.12\n    sbt -DSPARK_VERSION=\"3.3.4\" ++2.12.20 assembly\n    sbt -DSPARK_VERSION=\"3.4.4\" ++2.12.20 assembly\n  \n    # For Scala 2.13\n    sbt -DSPARK_VERSION=\"3.3.4\" ++2.13.17 assembly\n    sbt -DSPARK_VERSION=\"3.4.4\" ++2.13.17 assembly\n    ```\n\nYou can collect the uber jar of `spark-cobol` either at\n`spark-cobol/target/scala-2.11/` or in `spark-cobol/target/scala-2.12/` depending on the Scala version you used.\nThe fat jar will have '-bundle' suffix. You can also download pre-built bundles from https://github.com/AbsaOSS/cobrix/releases/tag/v2.7.3\n\nThen, run `spark-shell` or `spark-submit` adding the fat jar as the option.\n```sh\n$ spark-shell --jars spark-cobol_2.12_3.3-2.9.9-SNAPSHOT-bundle.jar\n```\n\n\u003e \u003cb\u003eA note for building and running tests on Windows\u003c/b\u003e\n\u003e - `java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$POSIX.stat` is a Hadoop compatibility with\n\u003e   Windows issue. The workaround is described here: https://stackoverflow.com/questions/41851066/exception-in-thread-main-java-lang-unsatisfiedlinkerror-org-apache-hadoop-io\n\u003e - When running assembly with `-DSPARK_VERSION=...` on Windows, it seems an sbt compatibility with Windows issue:\n\u003e   https://stackoverflow.com/questions/59144913/run-sbt-1-2-8-project-with-java-d-options-on-windows\n\u003e   You can work around it by using default Spark version for a given Scala version:\n\u003e   ```sh\n\u003e   sbt ++2.11.12 assembly\n\u003e   sbt ++2.12.20 assembly\n\u003e   sbt ++2.13.17 assembly\n\u003e   ```\n\n## Other Features\n\n### Loading several paths\nCurrently, specifying multiple paths in `load()` is not supported. Use the following syntax: \n```scala\n    spark\n      .read\n      .format(\"cobol\")\n      .option(\"copybook_contents\", copybook)\n      .option(\"paths\", inputPaths.mkString(\",\"))\n      .load()\n```\n\n### Spark SQL schema extraction\nThis library also provides convenient methods to extract Spark SQL schemas and Cobol layouts from copybooks.  \n\nIf you want to extract a Spark SQL schema from a copybook by providing same options you provide to Spark: \n```scala\n// Same options that you use for spark.read.format(\"cobol\").option()\nval options = Map(\"schema_retention_policy\" -\u003e \"keep_original\")\n\nval cobolSchema = CobolSchema.fromSparkOptions(Seq(copybook), options)\nval sparkSchema = cobolSchema.getSparkSchema.toString()\n\nprintln(sparkSchema)\n```\n\nIf you want to extract a Spark SQL schema from a copybook using the Cobol parser directly:\n```scala\nimport za.co.absa.cobrix.cobol.parser.CopybookParser\nimport za.co.absa.cobrix.cobol.reader.policies.SchemaRetentionPolicy\nimport za.co.absa.cobrix.spark.cobol.schema.CobolSchema\n\nval parsedSchema = CopybookParser.parseSimple(copyBookContents)\nval cobolSchema = new CobolSchema(parsedSchema, SchemaRetentionPolicy.CollapseRoot, inputFileNameField = \"\", generateRecordId = false)\nval sparkSchema = cobolSchema.getSparkSchema.toString()\n\nprintln(sparkSchema)\n```\n\nIf you want to check the layout of the copybook: \n\n```scala\nimport za.co.absa.cobrix.cobol.parser.CopybookParser\n\nval copyBook = CopybookParser.parseSimple(copyBookContents)\nprintln(copyBook.generateRecordLayoutPositions())\n```\n\n### Spark schema metadata\nWhen a copybook is converted to a Spark schema, some information is lost, such as length of string fields or\nminimum and maximum number of elements in arrays. To preserve this information, Cobrix adds metadata to Spark schema\nfields. The following metadata is added:\n\n| Metadata key | Description                                 |\n|--------------|---------------------------------------------|\n| maxLength    | The maximum length of a string field.       |\n| minElements  | The minimum number of elements in an array. |\n| maxElements  | The maximum number of elements in an array. |\n\nYou can access the metadata in the usual way:\n```scala\n// This example returns the maximum length of a string field that is the first field of the copybook\ndf.schema.fields(0).metadata.getLong(\"maxLength\")\n```\n\n### Fixed record length files\nCobrix assumes files has fixed length (`F`) record format by default. The record length is determined by the length of\nthe record defined by the copybook. But you can specify the record length explicitly:\n```\n.option(\"record_format\", \"F\")\n.option(\"record_length\", \"250\")\n```\n\nFixed block record formats (`FB`) are also supported. The support is _experimental_, if you find any issues, please\nlet us know. When the record format is 'FB' you can specify block length or number of records per\nblock. As with 'F' if `record_length` is not specified, it will be determined from the copybook.\n\nRecords that have BDWs, but not rdws can be read like this:\n```\n.option(\"record_format\", \"FB\")\n.option(\"record_length\", \"250\")\n```\nor simply\n```\n.option(\"record_format\", \"FB\")\n```\n\nRecords that have neither BDWs nor RDWs can be read like this:\n\n```\n.option(\"record_format\", \"FB\")\n.option(\"record_length\", \"250\")\n.option(\"block_length\", \"500\")\n```\nor\n```\n.option(\"record_format\", \"FB\")\n.option(\"record_length\", \"250\")\n.option(\"records_per_block\", \"2\")\n```\n\nMore on fixed-length record formats: https://www.ibm.com/docs/en/zos/2.3.0?topic=sets-fixed-length-record-formats\n\n### Variable length records support\n\nCobrix supports variable record length files. The only requirement is that such a file should contain a standard 4 byte\nrecord header known as _Record Descriptor Word_ (RDW). Such headers are created automatically when a variable record length\nfile is copied from a mainframe. Another type of files are _variable blocked length_. Such files contain _Block Descriptor\nWord_ (BDW), as well as Record Descriptor Word (RDW) headers. Any such header can be either big-endian or little-endian.\nAlso, quite often BDW headers need to be adjusted in order to be read properly. See the use cases section below.\n\nTo load variable length record file the following option should be specified:\n```\n.option(\"record_format\", \"V\")\n```\n\nTo load variable blocked length record file the following option should be specified:\n```\n.option(\"record_format\", \"VB\")\n```\n\nMore on record formats: https://www.ibm.com/docs/en/zos/2.3.0?topic=files-selecting-record-formats-non-vsam-data-sets\n\nThe space used by the headers (both BDW and RDW) should not be mentioned in the copybook if this option is used. Please refer to the\n'Record headers support' section below. \n\nIf a record of the copybook contains record lengths for each record you can use 'record_length_field' like this:\n```\n.option(\"record_format\", \"F\")\n.option(\"record_length_field\", \"RECORD_LENGTH\")\n```\n\nYou can use expressions as well:\n```\n.option(\"record_format\", \"F\")\n.option(\"record_length_field\", \"RECORD_LENGTH + 10\")\n```\nor\n```\n.option(\"record_format\", \"F\")\n.option(\"record_length_field\", \"FIELD1 * 10 + 200\")\n```\n\nIf the record field contains a string that can be mapped to a record size, you can add the mapping as a JSON:\n```\n.option(\"record_format\", \"F\")\n.option(\"record_length_field\", \"FIELD_STR\")\n.option(\"record_length_map\", \"\"\"{\"SEG1\":100,\"SEG2\":200}\"\"\")  \n```\n\nYou can specify the default record size by defining the key \"_\":\n```\n.option(\"record_format\", \"F\")\n.option(\"record_length_field\", \"FIELD_STR\")\n.option(\"record_length_map\", \"\"\"{\"SEG1\":100,\"SEG2\":200,\"_\":100}\"\"\")  \n```\n\n### Use cases for various variable length formats\n\nIn order to understand the file format it is often sufficient to look at the first 4 bytes of the file (un case of RDW only files),\nor the first 8 bytes of a file + lookup the offset of the block (in case of BDW + RDW)  \n\n#### V header examples (have only RDW headers)\n\nIn order to determine if an RDW is a big- or little-endian, take a look at the first 4 bytes. If the first 2 bytes are zeros,\nit's a little-endian RDW header, otherwise it is a big-endian RDW header.\n\n| Header example |                           Description                                  |       Options  |\n| -------------- |:---------------------------------------------------------------------- | :----------------\n| `00 10 00 00`  |  Big-endian RDW, no adjustments,\u003cbr/\u003ethe record size: `0x10 = 16 bytes`    | `.option(\"record_format\", \"V\")`\u003cbr/\u003e`.option(\"is_rdw_big_endian\", \"true\")`  |\n| `01 10 00 00`  |  Big-endian RDW, adjustment `-2`,\u003cbr/\u003ethe record size: `0x01*256 + 0x10 - 2 = 256 + 16 + 2 = 270 bytes`    | `.option(\"record_format\", \"V\")`\u003cbr/\u003e`.option(\"is_rdw_big_endian\", \"true\")`\u003cbr/\u003e`.option(\"rdw_adjustment\", -2)`  |\n| `00 00 10 00`  |  Little-endian RDW, no adjustments,\u003cbr/\u003ethe record size: `0x10 = 16 bytes` | `.option(\"record_format\", \"V\")`\u003cbr/\u003e`.option(\"is_rdw_big_endian\", \"false\")` |\n| `00 00 10 01`  |  Little-endian RDW, adjustment `-2`,\u003cbr/\u003ethe record size: `0x01*256 + 0x10 - 2 = 256 + 16 + 2 = 270 bytes` | `.option(\"record_format\", \"V\")`\u003cbr/\u003e`.option(\"is_rdw_big_endian\", \"false\")`\u003cbr/\u003e`.option(\"rdw_adjustment\", -2)` |\n\n#### VB header examples (have both BDW and RDW headers)\n\nIt is harder to determine if a BDW header is big- or little-endian since BDW header bytes can be all non-zero.\nBut for VB format RDWs follow BDWs and endiness. You can determine the endiness from an RDW, and use the same option for BDW.\n\n|               Header example             |                           Description                           |       Options  |\n| ---------------------------------------- |:--------------------------------------------------------------- | :----------------\n| `00 28 00 00`  `00 10 00 00` (BDW, RDW)  |  Big-endian BDW+RDW, no adjustments,\u003cbr/\u003eBDW = `0x28 = 40 byes`\u003cbr/\u003ethe record size: `0x10 = 16 bytes`    | `.option(\"record_format\", \"VB\")`\u003cbr/\u003e`.option(\"is_bdw_big_endian\", \"true\")`\u003cbr/\u003e`.option(\"is_rdw_big_endian\", \"true\")`  |\n| `00 2C 00 00`  `00 10 00 00` (BDW, RDW)  |  Big-endian BDW+RDW, need -4 byte adjustment since BDW includes its own length,\u003cbr/\u003eBDW = `0x2C - 4 = 40 byes`\u003cbr/\u003ethe record size: `0x10 = 16 bytes`    | `.option(\"record_format\", \"VB\")`\u003cbr/\u003e`.option(\"is_bdw_big_endian\", \"true\")`\u003cbr/\u003e`.option(\"is_rdw_big_endian\", \"true\")`\u003cbr/\u003e`.option(\"rdw_adjustment\", -4)`  |\n| `00 00 28 00`  `00 00 10 00` (BDW, RDW)  |  Little-endian BDW+RDW, no adjustments,\u003cbr/\u003eBDW = `0x28 = 40 byes`\u003cbr/\u003ethe record size: `0x10 = 16 bytes`    | `.option(\"record_format\", \"VB\")`\u003cbr/\u003e`.option(\"is_bdw_big_endian\", \"false\")`\u003cbr/\u003e`.option(\"is_rdw_big_endian\", \"false\")`  |\n| `00 00 2C 00`  `00 00 10 00` (BDW, RDW)  |  Little-endian BDW+RDW, need -4 byte adjustment since BDW includes its own length,\u003cbr/\u003eBDW = `0x2C - 4 = 40 byes`\u003cbr/\u003ethe record size: `0x10 = 16 bytes`    | `.option(\"record_format\", \"VB\")`\u003cbr/\u003e`.option(\"is_bdw_big_endian\", \"false\")`\u003cbr/\u003e`.option(\"is_rdw_big_endian\", \"false\")`\u003cbr/\u003e`.option(\"rdw_adjustment\", -4)`  |\n\n### Schema collapsing\n\nMainframe data often contain only one root GROUP. In such cases such a GROUP can be considered something similar to XML rowtag.\nCobrix allows either to collapse or to retain the GROUP. To turn this on use the following option:\n\n```scala\n.option(\"schema_retention_policy\", \"collapse_root\")\n```\nor\n```scala\n.option(\"schema_retention_policy\", \"keep_original\")\n```\n\nLet's look at an example. Let's say we have a copybook that looks like this:\n```cobol\n       01  RECORD.\n           05  ID                        PIC S9(4)  COMP.\n           05  COMPANY.\n               10  SHORT-NAME            PIC X(10).\n               10  COMPANY-ID-NUM        PIC 9(5) COMP-3.\n```\n\nWhen \"schema_retention_policy\" is set to \"collapse_root\" (default), the root group will be collapsed and the schema will look\nlike this (note the RECORD field is not part of the schema):\n```\nroot\n |-- ID: integer (nullable = true)\n |-- COMPANY: struct (nullable = true)\n |    |-- SHORT_NAME: string (nullable = true)\n |    |-- COMPANY_ID_NUM: integer (nullable = true)\n```\n\nBut when \"schema_retention_policy\" is set to \"keep_original\", the schema will look like this (note the RECORD field is part of the schema):\n\n```\nroot\n |-- RECORD: struct (nullable = true)\n |    |-- ID: integer (nullable = true)\n |    |-- COMPANY: struct (nullable = true)\n |    |    |-- SHORT_NAME: string (nullable = true)\n |    |    |-- COMPANY_ID_NUM: integer (nullable = true)\n```\n\nYou can experiment with this feature using built-in example in `za.co.absa.cobrix.spark.cobol.examples.CobolSparkExample`\n\n\n### Record Id fields generation\n\nFor data that has record order dependency generation of \"File_Id\", \"Record_Id\", and \"Record_Byte_Length\" fields is\nsupported. The values of the File_Id column will be unique for each file when a directory is specified as the source for\ndata. The values of the Record_Id column will be unique and sequential record identifiers within the file.\n\nTurn this feature on use\n```\n.option(\"generate_record_id\", true)\n```\n\nThe following fields will be added to the top of the schema:\n```\nroot\n |-- File_Id: integer (nullable = false)\n |-- Record_Id: long (nullable = false)\n |-- Record_Byte_Length: integer (nullable = false)\n```\n\nYou can use this option to generate raw bytes of each record as a binary field:\n```\n.option(\"generate_record_bytes\", \"true\")\n```\n\nThe following fields will be added to the top of the schema:\n```\nroot\n |-- Record_Bytes: binary (nullable = false)\n```\n\nYou can generate `_corrupt_fields` that will contain original binary values of fields Cobrix was unable to decode:\n```scala\n.option(\"generate_corrupt_fields\", \"true\")\n```\n\n### Locality optimization for variable-length records parsing\n\nVariable-length records depend on headers to have their length calculated, which makes it hard to achieve parallelism while parsing.\n\nCobrix strives to overcome this drawback by performing a two-stages parsing. The first stage traverses the records retrieving their lengths\nand offsets into structures called indexes. Then, the indexes are distributed across the cluster, which allows for parallel variable-length\nrecords parsing.\n\nHowever effective, this strategy may also suffer from excessive shuffling, since indexes may be sent to executors far from the actual data.\n\nThe latter issue is overcome by extracting the preferred locations for each index directly from HDFS/S3/..., and then passing those locations to\nSpark during the creation of the RDD that distributes the indexes.\n\nWhen processing large collections, the overhead of collecting the locations is offset by the benefits of locality, thus, this feature is\nenabled by default, but can be disabled by the configuration below:\n```\n.option(\"improve_locality\", false)\n```\n\n### Workload optimization for variable-length records parsing\n\nThis feature works only for HDFS, not for any other of Hadoop filesystems.\n\nWhen dealing with variable-length records, Cobrix strives to maximize locality by identifying the preferred locations in the cluster to parse\neach record, i.e. the nodes where the record resides.\n\nThis feature is implemented by querying HDFS about the locations of the blocks containing each record and instructing Spark to create the\npartition for that record in one of those locations.\n\nHowever, sometimes, new nodes can be added to the cluster after the Cobol file is stored, in which case those nodes would be ignored when\nprocessing the file since they do not contain any record.\n\nTo overcome this issue, Cobrix also strives to re-balance the records among the new nodes at parsing time, as an attempt to maximize the\nutilization of the cluster. This is done through identifying the busiest nodes and sharing part of their burden with the new ones.\n\nSince this is not an issue present in most cluster configurations, this feature is disabled by default, and can be enabled from the\nconfiguration below:\n```\n.option(\"optimize_allocation\", true)\n```\n\nIf however the option ```improve_locality``` is disabled, this option will also be disabled regardless of the value in ```optimize_allocation```.\n\n### Record headers support\n\nAs you may already know a file in the mainframe world does not mean the same as in the PC world. On PCs we think of a file\nas a stream of bytes that we can open, read/write and close. On mainframes a file can be a set of records that we can query.\nRecord is a blob of bytes, can have different size. Mainframe's 'filesystem' handles the mapping between logical records\nand physical location of data.\n\n\u003e _Details are available at this [Wikipedia article](https://en.wikipedia.org/wiki/MVS) (look for MVS filesystem)._ \n\nSo usually a file cannot simply be 'copied' from a mainframe. When files are transferred using tools like XCOM each\nrecord is prepended with an additional *record header* or *RDW*. This header allows readers of a file in PC to restore the\n'set of records' nature of the file.\n\nMainframe files coming from IMS and copied through specialized tools contain records (the payload) having schema of DBs\ncopybook warped with DB export tool headers wrapped with record headers. Like this:\n\nRECORD_HEADERS ( TOOL_HEADERS ( PAYLOAD ) )\n\n\u003e _Similar to Internet's TCP protocol   IP_HEADERS ( TCP_HEADERS ( PAYLOAD ) )._\n\nTOOL_HEADERS are application dependent. Often it contains the length of the payload. But this length is sometime\nnot very reliable. RECORD_HEADERS contain the record length (including TOOL_HEADERS length) and are proved to be reliable.\n\nFor fixed record length files record headers can be ignored since we already know the record length. But for variable\nrecord length files and for multisegment files record headers can be considered the most reliable single point of truth\nabout record length.\n\nYou can instruct the reader to use 4 byte record headers to extract records from a mainframe file.\n\n```\n.option(\"record_format\", \"V\")\n```\n\nThis is very helpful for multisegment files when segments have different lengths. Since each segment has it's own\ncopybook it is very convenient to extract segments one by one by combining `record_format = V` option with segment\nfilter option.\n\n```\n.option(\"segment_field\", \"SEG-ID\")\n.option(\"segment_filter\", \"1122334\")\n```\n\nIn this example it is expected that the copybook has a field with the name 'SEG-ID'. The data source will read all\nsegments, but will parse only ones that have `SEG-ID = \"1122334\"`.\n\nIf you want to parse multiple segments, set the option 'segment_filter' to a comma separated list of the segment values.\nFor example:\n```\n.option(\"segment_field\", \"SEG-ID\")\n.option(\"segment_filter\", \"1122334,1122335\")\n```\nwill only parse the records with `SEG-ID = \"1122334\" OR SEG-ID = \"1122335\"`\n\n### Custom record extractors\n\nCustom record extractors can be used for customizing splitting of input files into a set of records. Cobrix supports\ntext files, fixed length binary files and binary files with RDWs. If your input file is not in one of the supported\nformats you can implement a custom record extractor interface and provide it to `spark-cobol` as a option:\n\n```\n.option(\"record_extractor\", \"com.example.record.header.parser\")\n```\n\nA custom record extractor needs to be a class having this precise constructor signature:\n```scala\nclass TextRecordExtractor(ctx: RawRecordExtractorParameters) extends Serializable with RawRecordExtractor {\n                             // Your implementation\n                          }\n```\n\nA record extractor is essentially iterator of records. Each returned record is an array of bytes parsable by the\ncopybook.  \n\nA record extractor is invoked two times. First, it is invoked at the beginning each file to go thought the file and\ncreate a sparse index. The second time it is invoked by parallel processes starting from different records in the file.\nThe starting record number is provided in constructor. The starting file offset is available from `inputStream`.\n\nRawRecordContext consists of the following fields that the custom record extractor will get from Cobrix\nin runtime:\n* `startingRecordNumber` - A record number the input stream is pointing to.\n* `inputStream` - The input stream of bytes of the input file.\n* `copybook` - The parsed copybook of the input stream.\n* `additionalInfo` - An arbitrary info that can be passed as an option (see below).\n\nIf your record extractor needs additional information in order to extract records properly, you can provide\nan arbitrary additional info to the record extracted at runtime by specifying this option:\n\nTake a look at `CustomRecordExtractorMock` inside `spark-cobol` project to see how a custom record extractor can be built.\n\n```\n.option(\"re_additional_info\", \"some info\")\n```\n\n### Custom record header parsers (deprecated)\n\nCustom record header parsers are deprecated. Use custom record extractors instead. They are more flexible and easier to use. \n\nIf your variable length file does not have RDW headers, but has fields that can be used for determining record lengths\nyou can provide a custom record header parser that takes starting bytes of each record and returns record lengths.\nIn order to do that you need to create a class inheriting `RecordHeaderParser` and `Serializable` traits and provide a\nfully qualified class name to the following option:\n```\n.option(\"record_header_parser\", \"com.example.record.header.parser\")\n```\n\n### RDDs\nCobrix provides helper methods to convert `RDD[String]` or `RDD[Array[Byte]]` to `DataFrame` using a copybook.\nThis can be used if you want to use a custom logic to split the input file into records as either ASCII strings\nor arrays of bytes, and then parse each record using a copybook.\n\nAn example of `RDD[Array[Byte]]`:\n```scala\nimport za.co.absa.cobrix.spark.cobol.Cobrix\n\nval rdd = ???\nval df = Cobrix.fromRdd\n    .copybookContents(copybook)\n    .option(\"encoding\", \"ebcdic\") // any supported option \n    .load(rdd)\n```\n\nAn example of ASCII Strings `RDD[String]`:\n```scala\nimport za.co.absa.cobrix.spark.cobol.Cobrix\n\nval rdd = ???\nval df = Cobrix.fromRdd\n    .copybookContents(copybook)\n    .option(\"variable_size_occurs\", \"true\") // any supported option \n    .loadText(rdd)\n```\n\nWhen converting from an RDD some of the options like `record_format` or `generate_record_id` cannot be used since the\ndata is assumed to be already split by records and the information about file names and relative order of records is not available.\n\n## EBCDIC code pages\n\nThe following code pages are supported:\n* `common` - (default) EBCDIC common characters\n* `common_extended` - EBCDIC common characters with special characters extension\n* `cp037` - IBM EBCDIC US-Canada\n* `cp037_extended` - IBM EBCDIC US-Canada with special characters extension\n* `cp300` - IBM EBCDIC Japanese Extended (2 byte code page)\n* `cp838` - IBM EBCDIC Thailand\n* `cp870` - IBM EBCDIC Multilingual Latin-2\n* `cp875` - IBM EBCDIC Greek\n* `cp1025` - IBM EBCDIC Multilingual Cyrillic\n* `cp1047` - IBM EBCDIC Latin-1/Open System\n* `cp1364` - (experimental support) IBM EBCDIC Korean (2 byte code page)\n* `cp1388` - (experimental support) IBM EBCDIC Simplified Chinese (2 byte code page)\n\nBy default, Cobrix uses common EBCDIC code page which contains only basic latin characters, numbers, and punctuation.\nYou can specify the code page to use for all string fields by setting the `ebcdic_code_page` option to one of the\nfollowing values:\n\n```\n.option(\"ebcdic_code_page\", \"cp037\")\n```\n\nFor multi-codepage files, you can specify the code page to use for each field by setting the `field_code_page:\u003ccode page\u003e` option\n```\n.option(\"ebcdic_code_page\", \"cp037\")\n.option(\"field_code_page:cp1256\" -\u003e \"FIELD1\")\n.option(\"field_code_page:us-ascii\" -\u003e \"FIELD-2, FIELD_3\")\n```\n\n## Reading ASCII text file\nCobrix is primarily designed to read binary files, but you can directly use some internal functions to read ASCII text files. In ASCII text files, records are separated with newlines.\n\nWorking example 1:\n```scala\n    // The recommended way\n    val df = spark\n      .read\n      .format(\"cobol\")\n      .option(\"copybook_contents\", copybook)\n      .option(\"ascii_charset\", \"ISO-8859-1\") // You can choose a charset, UTF-8 is used by default\n      .option(\"record_format\", \"D\")\n      .load(tmpFileName)\n````\n\nWorking example 2 - Using RDDs and helper methods:\n```scala\n    // This is the way if you have data converted to an RDD[String] already.\n    // You have full control on reading the input data records and converting them to `java.lang.String`.\n    val df = Cobrix.fromRdd\n        .copybookContents(copybook)\n        .option(\"variable_size_occurs\", \"true\") // any supported option \n        .loadText(rdd)\n````\n\nWorking example 3 - Using RDDs and record parsers directly:\n```scala\n    // This is the most verbose way - creating dataframes from RDDs. But it gives full control on how text files are\n    // processed before parsing actual records\n    val spark = SparkSession\n      .builder()\n      .appName(\"Spark-Cobol ASCII text file\")\n      .master(\"local[*]\")\n      .getOrCreate()\n\n    val copybook =\n      \"\"\"       01  COMPANY-DETAILS.\n        |            05  SEGMENT-ID\t\tPIC 9(1).\n        |            05  STATIC-DETAILS.\n        |               10  NAME      \tPIC X(2).\n        |\n        |            05  CONTACTS REDEFINES STATIC-DETAILS.\n        |               10  PERSON    \tPIC X(3).\n      \"\"\".stripMargin\n\n    val parsedCopybook = CopybookParser.parse(copybook, dataEnncoding = ASCII, stringTrimmingPolicy = StringTrimmingPolicy.TrimNone)\n    val cobolSchema = new CobolSchema(parsedCopybook, SchemaRetentionPolicy.CollapseRoot, \"\", false)\n    val sparkSchema = cobolSchema.getSparkSchema\n\n    val rddText = spark.sparkContext.textFile(\"src/main/resources/mini.txt\")\n\n    val recordHandler = new RowHandler()\n\n    val rddRow = rddText\n      .filter(str =\u003e str.length \u003e 0)\n      .map(str =\u003e {\n        val record = RecordExtractors.extractRecord[GenericRow](parsedCopybook.ast,\n          str.getBytes(),\n          0,\n          SchemaRetentionPolicy.CollapseRoot, handler = recordHandler)\n        Row.fromSeq(record)\n      })\n\n    val dfOut = spark.createDataFrame(rddRow, sparkSchema)\n\n    dfOut.printSchema()\n    dfOut.show()\n```\n\nCorresponding data sample in `mini.txt`:\n```\n1BB\n2CCC\n```\n\nOutput:\n```\nroot\n |-- SEGMENT_ID: integer (nullable = true)\n |-- STATIC_DETAILS: struct (nullable = true)\n |    |-- NAME: string (nullable = true)\n |-- CONTACTS: struct (nullable = true)\n |    |-- PERSON: string (nullable = true)\n\n ...\n\n +----------+--------------+--------+\n |SEGMENT_ID|STATIC_DETAILS|CONTACTS|\n +----------+--------------+--------+\n |         1|          [BB]|  [null]|\n |         2|          [CC]|   [CCC]|\n +----------+--------------+--------+\n```\n\nThere, Cobrix loaded all redefines for every record. Each record contains data from all of the segments. But only one redefine is valid for every segment. Filtering is described in the following section.\n\n## Automatic segment redefines filtering\n\nWhen reading a multisegment file you can use Spark to clean up redefines that do not match segment ids. Cobrix will parse\nevery redefined field for each segment. To increase performance you can specify which redefine corresponds to which\nsegment id. This way Cobrix will parse only relevant segment redefined fields and leave the rest of the redefined fields null.\n\n```\n  .option(\"redefine-segment-id-map:0\", \"REDEFINED_FIELD1 =\u003e SegmentId1,SegmentId2,...\")\n  .option(\"redefine-segment-id-map:1\", \"REDEFINED_FIELD2 =\u003e SegmentId10,SegmentId11,...\")\n```\n\nFor the above example the load options will lok like this (last 2 options):\n```scala\nval df = spark\n  .read\n  .format(\"cobol\")\n  .option(\"copybook_contents\", copybook)\n  .option(\"record_format\", \"V\")\n  .option(\"segment_field\", \"SEGMENT_ID\")\n  .option(\"segment_id_level0\", \"C\")\n  .option(\"segment_id_level1\", \"P\")\n  .option(\"redefine_segment_id_map:0\", \"STATIC-DETAILS =\u003e C\")\n  .option(\"redefine_segment_id_map:1\", \"CONTACTS =\u003e P\")\n  .load(\"examples/multisegment_data/COMP.DETAILS.SEP30.DATA.dat\")\n```\n\nThe filtered data will look like this:\n```\ndf.show(10)\n+----------+----------+--------------------+--------------------+\n|SEGMENT_ID|COMPANY_ID|      STATIC_DETAILS|            CONTACTS|\n+----------+----------+--------------------+--------------------+\n|         C|9377942526|[Joan Q \u0026 Z,10 Sa...|                    |\n|         P|9377942526|                    |[+(277) 944 44 55...|\n|         C|3483483977|[Robotrd Inc.,2 P...|                    |\n|         P|3483483977|                    |[+(174) 970 97 54...|\n|         P|3483483977|                    |[+(848) 832 61 68...|\n|         P|3483483977|                    |[+(455) 184 13 39...|\n|         C|7540764401|[Eqartion Inc.,87...|                    |\n|         C|4413124035|[Xingzhoug,74 Qin...|                    |\n|         C|9546291887|[ZjkLPj,5574, Tok...|                    |\n|         P|9546291887|                    |[+(300) 252 33 17...|\n+----------+----------+--------------------+--------------------+\n```\n\nIn the above example invalid fields became `null` and the parsing is done faster because Cobrix does not need to process\nevery redefine for each record.\n\n\n## Group Filler dropping\n\nA FILLER is an anonymous field that is usually used for reserving space for new fields in a fixed record length data.\nOr it is used to remove a field from a copybook without affecting compatibility.\n\n```cobol\n      05  COMPANY.\n          10  NAME      PIC X(15).\n          10  FILLER    PIC X(5).\n          10  ADDRESS   PIC X(25).\n          10  FILLER    PIC X(125).\n``` \nSuch fields are dropped when imported into a Spark data frame by Cobrix. Some copybooks, however, have FILLER groups that\ncontain non-filler fields. For example,\n```cobol\n      05  FILLER.\n          10  NAME      PIC X(15).\n          10  ADDRESS   PIC X(25).\n      05  FILLER.\n          10  AMOUNT    PIC 9(10)V96.\n          10  COMMENT   PIC X(40).\n``` \nBy default Cobrix will retain such fields, but will rename each such filler to a unique name so each each individual struct\ncan be specified unambiguously. For example, in this case the filler groups will be renamed to `FILLER_1` and `FILLER_2`.\nYou can change this behaviour if you would like to drop such filler groups by providing this option:\n```\n.option(\"drop_group_fillers\", \"true\")\n```\n\nIn order to retain *value FILLERs* (e.g. non-group FILLERs) as well, use this option:\n```\n.option(\"drop_value_fillers\", \"false\")\n```\n\n\n## \u003ca id=\"ims\"/\u003eReading hierarchical data sets\n\nLet's imagine we have a multisegment file with 2 segments having parent-child relationships. Each segment has a different\nrecord type. The root record/segment contains company info, an address and a taxpayer number. The child segment contains\na contact person for a company. Each company can have zero or more contact persons. So each root record can be followed by\nzero or more child records.\n\nTo load such data in Spark the first thing you need to do is to create a copybook that contains all segment specific fields\nin redefined groups. Here is the copybook for our example:\n\n```cobol\n        01  COMPANY-DETAILS.\n            05  SEGMENT-ID        PIC X(5).\n            05  COMPANY-ID        PIC X(10).\n            05  STATIC-DETAILS.\n               10  COMPANY-NAME      PIC X(15).\n               10  ADDRESS           PIC X(25).\n               10  TAXPAYER.\n                  15  TAXPAYER-TYPE  PIC X(1).\n                  15  TAXPAYER-STR   PIC X(8).\n                  15  TAXPAYER-NUM  REDEFINES TAXPAYER-STR\n                                     PIC 9(8) COMP.\n\n            05  CONTACTS REDEFINES STATIC-DETAILS.\n               10  PHONE-NUMBER      PIC X(17).\n               10  CONTACT-PERSON    PIC X(28).\n```\n\nThe 'SEGMENT-ID' and 'COMPANY-ID' fields are present in all of the segments. The 'STATIC-DETAILS' group is present only in\nthe root record. The 'CONTACTS' group is present only in child record. Notice that 'CONTACTS' redefine 'STATIC-DETAILS'.\n\nBecause the records have different lengths use `record_format = V` or `record_format = VB` depending of the record format.\n\nIf you load this file as is you will get the schema and the data similar to this.\n\n#### Spark App:\n```scala\nval df = spark\n  .read\n  .format(\"cobol\")\n  .option(\"copybook\", \"/path/to/thecopybook\")\n  .option(\"record_format\", \"V\")\n  .load(\"examples/multisegment_data\")\n```\n\n#### Schema\n```\ndf.printSchema()\nroot\n |-- SEGMENT_ID: string (nullable = true)\n |-- COMPANY_ID: string (nullable = true)\n |-- STATIC_DETAILS: struct (nullable = true)\n |    |-- COMPANY_NAME: string (nullable = true)\n |    |-- ADDRESS: string (nullable = true)\n |    |-- TAXPAYER: struct (nullable = true)\n |    |    |-- TAXPAYER_TYPE: string (nullable = true)\n |    |    |-- TAXPAYER_STR: string (nullable = true)\n |    |    |-- TAXPAYER_NUM: integer (nullable = true)\n |-- CONTACTS: struct (nullable = true)\n |    |-- PHONE_NUMBER: string (nullable = true)\n |    |-- CONTACT_PERSON: string (nullable = true)\n```\n\n#### Data sample\n```\ndf.show(10)\n+----------+----------+--------------------+--------------------+\n|SEGMENT_ID|COMPANY_ID|      STATIC_DETAILS|            CONTACTS|\n+----------+----------+--------------------+--------------------+\n|         C|9377942526|[Joan Q \u0026 Z,10 Sa...|[Joan Q \u0026 Z     1...|\n|         P|9377942526|[+(277) 944 44 5,...|[+(277) 944 44 55...|\n|         C|3483483977|[Robotrd Inc.,2 P...|[Robotrd Inc.   2...|\n|         P|3483483977|[+(174) 970 97 5,...|[+(174) 970 97 54...|\n|         P|3483483977|[+(848) 832 61 6,...|[+(848) 832 61 68...|\n|         P|3483483977|[+(455) 184 13 3,...|[+(455) 184 13 39...|\n|         C|7540764401|[Eqartion Inc.,87...|[Eqartion Inc.  8...|\n|         C|4413124035|[Xingzhoug,74 Qin...|[Xingzhoug      7...|\n|         C|9546291887|[ZjkLPj,5574, Tok...|[ZjkLPj         5...|\n|         P|9546291887|[+(300) 252 33 1,...|[+(300) 252 33 17...|\n+----------+----------+--------------------+--------------------+\n```\n\nAs you can see Cobrix loaded *all* redefines for *every* record. Each record contains data from all of the segments. But only\none redefine is valid for every segment. So we need to split the data set into 2 datasets or tables. The distinguisher is\nthe 'SEGMENT_ID' field. All company details will go into one data sets (segment id = 'C' [company]) while contacts will go in\nthe second data set (segment id = 'P' [person]). While doing the split we can also collapse the groups so the table won't\ncontain nested structures. This can be helpful to simplify the analysis of the data.\n\nWhile doing it you might notice that the taxpayer number field is actually a redefine. Depending on the 'TAXPAYER_TYPE'\neither 'TAXPAYER_NUM' or 'TAXPAYER_STR' is used. We can resolve this in our Spark app as well.\n\n### \u003ca id=\"autoims\"/\u003eAutomatic reconstruction of hierarchical record structure\nStarting from `spark-cobol` version `1.1.0` hierarchical structure of multisegment records can be restored automatically. In order to do this you\nneed to provide:\n- A segment ID field that will be used to distinguish segment types.\n- A segmentId to redefine fields mapping that will be used to map each segment to a redefine field.\n- A parent-child relationship between segments identified by segment redefine fields.\n\nWhen all of the above is specified Cobrix can reconstruct hierarchical nature of records by making child segments nested\narrays of parent segments. Arbitrary levels of hierarchy and arbitrary number of segments is supported.\n\n```scala\nval df = spark\n  .read\n  .format(\"cobol\")\n  .option(\"copybook\", \"/path/to/thecopybook\")\n  .option(\"record_format\", \"V\")\n\n  // Specifies a field containing a segment id\n  .option(\"segment_field\", \"SEGMENT_ID\")\n  \n  // Specifies a mapping between segment ids and segment redefine fields\n  .option(\"redefine_segment_id_map:1\", \"STATIC-DETAILS =\u003e C\")\n  .option(\"redefine-segment-id-map:2\", \"CONTACTS =\u003e P\")\n  \n  // Specifies a parent-child relationship\n  .option(\"segment-children:1\", \"STATIC-DETAILS =\u003e CONTACTS\")\n  \n  .load(\"examples/multisegment_data\")\n```\n\nThe output schema will be\n\n```\nscala\u003e df.printSchema()\n\nroot\n |-- SEGMENT_ID: string (nullable = true)\n |-- COMPANY_ID: string (nullable = true)\n |-- STATIC_DETAILS: struct (nullable = true)\n |    |-- COMPANY_NAME: string (nullable = true)\n |    |-- ADDRESS: string (nullable = true)\n |    |-- TAXPAYER: struct (nullable = true)\n |    |    |-- TAXPAYER_TYPE: string (nullable = true)\n |    |    |-- TAXPAYER_STR: string (nullable = true)\n |    |    |-- TAXPAYER_NUM: integer (nullable = true)\n |    |-- CONTACTS: array (nullable = true)\n |    |    |-- element: struct (containsNull = true)\n |    |    |    |-- PHONE_NUMBER: string (nullable = true)\n |    |    |    |-- CONTACT_PERSON: string (nullable = true)\n\n```\n\nNotice that contacts now is an array of structs. That is a company static details can contain zero or mor contacts.\nA possible hierarchical record output is\n```\nscala\u003e import za.co.absa.cobrix.spark.cobol.utils.SparkUtils\n\nscala\u003e println(SparkUtils.prettyJSON(df.toJSON.take(1).mkString(\"[\", \", \", \"]\")))\n{\n  \"SEGMENT_ID\" : \"C\",\n  \"COMPANY_ID\" : \"9377942526\",\n  \"STATIC_DETAILS\" : {\n    \"COMPANY_NAME\" : \"Joan Q \u0026 Z\",\n    \"ADDRESS\" : \"10 Sandton, Johannesburg\",\n    \"TAXPAYER\" : {\n      \"TAXPAYER_TYPE\" : \"A\",\n      \"TAXPAYER_STR\" : \"92714306\",\n      \"TAXPAYER_NUM\" : 959592241\n    },\n    \"CONTACTS\" : [ {\n      \"PHONE_NUMBER\" : \"+(174) 970 97 54\",\n      \"CONTACT_PERSON\" : \"Tyesha Debow\"\n    }, {\n      \"PHONE_NUMBER\" : \"+(848) 832 61 68\",\n      \"CONTACT_PERSON\" : \"Mindy Celestin\"\n    }, {\n      \"PHONE_NUMBER\" : \"+(455) 184 13 39\",\n      \"CONTACT_PERSON\" : \"Mabelle Winburn\"\n    } ]\n  }\n}\n```\n\nAn advanced hierarchical example with multiple levels of nesting and multiple segments on each level\nis available as a unit test `za/co/absa/cobrix/spark/cobol/source/integration/Test17HierarchicalSpec.scala`.\n \n### Manual reconstruction of hierarchical structure\n\nAlternatively, hierarchical record structure can be reconstructed manually by extracting each segment and joining\nsegments together. This a is more complicated process, but it provides more control.\n\n#### Getting the first segment\n```scala\nimport spark.implicits._\n\nval dfCompanies = df\n  // Filtering the first segment by segment id\n  .filter($\"SEGMENT_ID\"===\"C\")\n  // Selecting fields that are only available in the first segment\n  .select($\"COMPANY_ID\", $\"STATIC_DETAILS.COMPANY_NAME\", $\"STATIC_DETAILS.ADDRESS\",\n  // Resolving the taxpayer redefine\n    when($\"STATIC_DETAILS.TAXPAYER.TAXPAYER_TYPE\" === \"A\", $\"STATIC_DETAILS.TAXPAYER.TAXPAYER_STR\")\n      .otherwise($\"STATIC_DETAILS.TAXPAYER.TAXPAYER_NUM\").cast(StringType).as(\"TAXPAYER\"))\n```\n\nThe resulting table looks like this:\n```\ndfCompanies.show(10, truncate = false)\n+----------+-------------+-------------------------+--------+\n|COMPANY_ID|COMPANY_NAME |ADDRESS                  |TAXPAYER|\n+----------+-------------+-------------------------+--------+\n|9377942526|Joan Q \u0026 Z   |10 Sandton, Johannesburg |92714306|\n|3483483977|Robotrd Inc. |2 Park ave., Johannesburg|31195396|\n|7540764401|Eqartion Inc.|871A Forest ave., Toronto|87432264|\n|4413124035|Xingzhoug    |74 Qing ave., Beijing    |50803302|\n|9546291887|ZjkLPj       |5574, Tokyo              |73538919|\n|9168453994|Test Bank    |1 Garden str., London    |82573513|\n|4225784815|ZjkLPj       |5574, Tokyo              |96136195|\n|8463159728|Xingzhoug    |74 Qing ave., Beijing    |17785468|\n|8180356010|Eqartion Inc.|871A Forest ave., Toronto|79054306|\n|7107728116|Xingzhoug    |74 Qing ave., Beijing    |70899995|\n+----------+-------------+-------------------------+--------+\n```\n\nThis looks like a valid and clean table containing the list of companies. Now let's do the same for the second segment.\n\n#### Getting the second segment\n```scala\n    val dfContacts = df\n      // Filtering the second segment by segment id\n      .filter($\"SEGMENT_ID\"===\"P\")\n      // Selecting the fields only valid for the second segment\n      .select($\"COMPANY_ID\", $\"CONTACTS.CONTACT_PERSON\", $\"CONTACTS.PHONE_NUMBER\")\n```\n\nThe resulting data loons like this:\n```\ndfContacts.show(10, truncate = false)\n+----------+--------------------+----------------+\n|COMPANY_ID|CONTACT_PERSON      |PHONE_NUMBER    |\n+----------+--------------------+----------------+\n|9377942526|Janiece Newcombe    |+(277) 944 44 55|\n|3483483977|Tyesha Debow        |+(174) 970 97 54|\n|3483483977|Mindy Celestin      |+(848) 832 61 68|\n|3483483977|Mabelle Winburn     |+(455) 184 13 39|\n|9546291887|Carrie Celestin     |+(300) 252 33 17|\n|9546291887|Edyth Deveau        |+(907) 101 70 64|\n|9546291887|Jene Norgard        |+(694) 918 17 44|\n|9168453994|Timika Bourke       |+(768) 691 44 85|\n|9168453994|Lynell Riojas       |+(695) 918 33 16|\n|4225784815|Jene Mackinnon      |+(540) 937 33 71|\n+----------+--------------------+----------------+\n```\n\nThis looks good as well. The table contains the list of contact persons for companies. This data set contains the\n'COMPANY_ID' field which we can use later to join the tables. But often there are no such fields in data imported from\nhierarchical databases. If that is the case Cobrix can help you craft such fields automatically. Use 'segment_field' to\nspecify a field that contain the segment id. Use 'segment_id_level0' to ask Cobrix to generate ids for the particular\nsegments. We can use 'segment_id_level1' to generate child ids as well. If children records can contain children of their\nown we can use 'segment_id_level2' etc.\n\n#### Generating segment ids\n\n```scala\nval df = spark\n  .read\n  .format(\"cobol\")\n  .option(\"copybook_contents\", copybook)\n  .option(\"record_format\", \"V\")\n  .option(\"segment_field\", \"SEGMENT_ID\")\n  .option(\"segment_id_level0\", \"C\")\n  .option(\"segment_id_level1\", \"P\")\n  .load(\"examples/multisegment_data/COMP.DETAILS.SEP30.DATA.dat\")\n```\n\nSometimes, the leaf level has many segments. In this case, you can use `_` as the list of segment ids to specify\n'the rest of segment ids', like this:\n\n```scala\nval df = spark\n  .read\n  .format(\"cobol\")\n  .option(\"copybook_contents\", copybook)\n  .option(\"record_format\", \"V\")\n  .option(\"segment_field\", \"SEGMENT_ID\")\n  .option(\"segment_id_level0\", \"C\")\n  .option(\"segment_id_level1\", \"_\")\n  .load(\"examples/multisegment_data/COMP.DETAILS.SEP30.DATA.dat\")\n```\n\nThe result of both above code snippets is the same.\n\nThe resulting table will look like this:\n```\ndf.show(10)\n+------------------+-----------------------+----------+----------+--------------------+--------------------+\n|           Seg_Id0|                Seg_Id1|SEGMENT_ID|COMPANY_ID|      STATIC_DETAILS|            CONTACTS|\n+------------------+-----------------------+----------+----------+--------------------+--------------------+\n|20181219130609_0_0|                   null|         C|9377942526|[Joan Q \u0026 Z,10 Sa...|[Joan Q \u0026 Z     1...|\n|20181219130609_0_0|20181219130723_0_0_L1_1|         P|9377942526|[+(277) 944 44 5,...|[+(277) 944 44 55...|\n|20181219130609_0_2|                   null|         C|3483483977|[Robotrd Inc.,2 P...|[Robotrd Inc.   2...|\n|20181219130609_0_2|20181219130723_0_2_L1_1|         P|3483483977|[+(174) 970 97 5,...|[+(174) 970 97 54...|\n|20181219130609_0_2|20181219130723_0_2_L1_2|         P|3483483977|[+(848) 832 61 6,...|[+(848) 832 61 68...|\n|20181219130609_0_2|20181219130723_0_2_L1_3|         P|3483483977|[+(455) 184 13 3,...|[+(455) 184 13 39...|\n|20181219130609_0_6|                   null|         C|7540764401|[Eqartion Inc.,87...|[Eqartion Inc.  8...|\n|20181219130609_0_7|                   null|         C|4413124035|[Xingzhoug,74 Qin...|[Xingzhoug      7...|\n|20181219130609_0_8|                   null|         C|9546291887|[ZjkLPj,5574, Tok...|[ZjkLPj         5...|\n|20181219130609_0_8|20181219130723_0_8_L1_1|         P|9546291887|[+(300) 252 33 1,...|[+(300) 252 33 17...|\n+------------------+-----------------------+----------+----------+--------------------+--------------------+\n```\n\nThe data now contain 2 additional fields: 'Seg_Id0' and 'Seg_Id1'. The 'Seg_Id0' is an autogenerated id for each root\nrecord. It is also unique for a root record. After splitting the segments you can use Seg_Id0 to join both tables.\nThe 'Seg_Id1' field contains a unique child id. It is equal to 'null' for all root records but uniquely identifies\nchild records.\n\nYou can now split these 2 segments and join them by Seg_Id0. The full example is available at\n`spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/examples/CobolSparkExample2.scala`\n\nTo run it from an IDE you'll need to change Scala and Spark dependencies from 'provided' to 'compile' so the\njar file would contain all the dependencies. This is because Cobrix is a library to be used in Spark job projects.\nSpark jobs uber jars should not contain Scala and Spark dependencies since Hadoop clusters have their Scala and Spark\ndependencies provided by the infrastructure. Including Spark and Scala dependencies in an uber jar can produce\nbinary incompatibilities when these jars are used in `spark-submit` and `spark-shell`.\n\nHere is our example tables to join:\n\n##### Segment 1 (Companies)\n```\ndfCompanies.show(10, truncate = false)\n+--------------------+----------+-------------+-------------------------+--------+\n|Seg_Id0             |COMPANY_ID|COMPANY_NAME |ADDRESS                  |TAXPAYER|\n+--------------------+----------+-------------+-------------------------+--------+\n|20181219130723_0_0  |9377942526|Joan Q \u0026 Z   |10 Sandton, Johannesburg |92714306|\n|20181219130723_0_2  |3483483977|Robotrd Inc. |2 Park ave., Johannesburg|31195396|\n|20181219130723_0_6  |7540764401|Eqartion Inc.|871A Forest ave., Toronto|87432264|\n|20181219130723_0_7  |4413124035|Xingzhoug    |74 Qing ave., Beijing    |50803302|\n|20181219130723_0_8  |9546291887|ZjkLPj       |5574, Tokyo              |73538919|\n|20181219130723_0_12 |9168453994|Test Bank    |1 Garden str., London    |82573513|\n|20181219130723_0_15 |4225784815|ZjkLPj       |5574, Tokyo              |96136195|\n|20181219130723_0_20 |8463159728|Xingzhoug    |74 Qing ave., Beijing    |17785468|\n|20181219130723_0_24 |8180356010|Eqartion Inc.|871A Forest ave., Toronto|79054306|\n|20181219130723_0_27 |7107728116|Xingzhoug    |74 Qing ave., Beijing    |70899995|\n+--------------------+----------+-------------+-------------------------+--------+\n```\n\n##### Segment 2 (Contacts)\n```\ndfContacts.show(13, truncate = false)\n+-------------------+----------+-------------------+----------------+\n|Seg_Id0            |COMPANY_ID|CONTACT_PERSON     |PHONE_NUMBER    |\n+-------------------+----------+-------------------+----------------+\n|20181219130723_0_0 |9377942526|Janiece Newcombe    |+(277) 944 44 55|\n|20181219130723_0_2 |3483483977|Tyesha Debow        |+(174) 970 97 54|\n|20181219130723_0_2 |3483483977|Mindy Celestin      |+(848) 832 61 68|\n|20181219130723_0_2 |3483483977|Mabelle Winburn     |+(455) 184 13 39|\n|20181219130723_0_8 |9546291887|Carrie Celestin     |+(300) 252 33 17|\n|20181219130723_0_8 |9546291887|Edyth Deveau        |+(907) 101 70 64|\n|20181219130723_0_8 |9546291887|Jene Norgard        |+(694) 918 17 44|\n|20181219130723_0_12|9168453994|Timika Bourke       |+(768) 691 44 85|\n|20181219130723_0_12|9168453994|Lynell Riojas       |+(695) 918 33 16|\n|20181219130723_0_15|4225784815|Jene Mackinnon      |+(540) 937 33 71|\n|20181219130723_0_15|4225784815|Timika Concannon    |+(122) 216 11 25|\n|20181219130723_0_15|4225784815|Jene Godfrey        |+(285) 643 50 47|\n|20181219130723_0_15|4225784815|Gabriele Winburn    |+(489) 644 53 67|\n+-------------------+----------+-------------------+----------------+\n\n```\n\nLet's now join these tables.\n\n##### Joined datasets\n\nThe join statement in Spark:\n```scala\nval dfJoined = dfCompanies.join(dfContacts, \"Seg_Id0\")\n```\n\nThe joined data looks like this:\n\n```\ndfJoined.show(13, truncate = false)\n+--------------------+----------+-------------+-------------------------+--------+----------+--------------------+----------------+\n|Seg_Id0             |COMPANY_ID|COMPANY_NAME |ADDRESS                  |TAXPAYER|COMPANY_ID|CONTACT_PERSON      |PHONE_NUMBER    |\n+--------------------+----------+-------------+-------------------------+--------+----------+--------------------+----------------+\n|20181219130723_0_0  |9377942526|Joan Q \u0026 Z   |10 Sandton, Johannesburg |92714306|9377942526|Janiece Newcombe    |+(277) 944 44 55|\n|20181219131239_0_2  |3483483977|Robotrd Inc. |2 Park ave., Johannesburg|31195396|3483483977|Mindy Celestin      |+(848) 832 61 68|\n|20181219131239_0_2  |3483483977|Robotrd Inc. |2 Park ave., Johannesburg|31195396|3483483977|Tyesha Debow        |+(174) 970 97 54|\n|20181219131239_0_2  |3483483977|Robotrd Inc. |2 Park ave., Johannesburg|31195396|3483483977|Mabelle Winburn     |+(455) 184 13 39|\n|20181219131344_0_8  |9546291887|ZjkLPj       |5574, Tokyo              |73538919|9546291887|Jene Norgard        |+(694) 918 17 44|\n|20181219131344_0_8  |9546291887|ZjkLPj       |5574, Tokyo              |73538919|9546291887|Edyth Deveau        |+(907) 101 70 64|\n|20181219131344_0_8  |9546291887|ZjkLPj       |5574, Tokyo              |73538919|9546291887|Carrie Celestin     |+(300) 252 33 17|\n|20181219131344_0_12 |9168453994|Test Bank    |1 Garden str., London    |82573513|9168453994|Timika Bourke       |+(768) 691 44 85|\n|20181219131344_0_12 |9168453994|Test Bank    |1 Garden str., London    |82573513|9168453994|Lynell Riojas       |+(695) 918 33 16|\n|20181219131344_0_15 |4225784815|ZjkLPj       |5574, Tokyo              |96136195|4225784815|Jene Mackinnon      |+(540) 937 33 71|\n|20181219131344_0_15 |4225784815|ZjkLPj       |5574, Tokyo              |96136195|4225784815|Timika Concannon    |+(122) 216 11 25|\n|20181219131344_0_15 |4225784815|ZjkLPj       |5574, Tokyo              |96136195|4225784815|Jene Godfrey        |+(285) 643 50 47|\n|20181219131344_0_15 |4225784815|ZjkLPj       |5574, Tokyo              |96136195|4225784815|Gabriele Winburn    |+(489) 644 53 67|\n+--------------------+----------+-------------+-------------------------+--------+----------+--------------------+----------------+\n```\n\nAgain, the full example is available at\n`spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/examples/CobolSparkExample2.scala`\n\n## COBOL parser extensions\n\nSome encoding formats are not expressible by the standard copybook spec. Cobrix has extensions to help you decode \nfields encoded in this way.   \n\n### Loading multiple paths\n\nLoading multiple paths in the standard way is not supported.\n```scala\n val df = spark\n   .read\n   .format(\"cobol\")\n   .option(\"copybook_contents\", copybook)\n   .load(\"/path1\", \"/paths2\")\n```\n\nBut there is a Cobrix extension that allows you to load multiple paths:\n```scala\n val df = spark\n   .read\n   .format(\"cobol\")\n   .option(\"copybook_contents\", copybook)\n   .option(\"data_paths\", \"/path1,/paths2\")\n   .load()\n```\n\n### Parsing little-endian binary numbers\n\nCobrix expects all binary numbers to be big-endian. If you have a binary number in the little-endian format, use \n`COMP-9` (Cobrix extension) instead of `COMP` or `COMP-5` for the affected fields.\n\nFor example, `0x01 0x02` is `1 + 2*256 = 513` in big-endian (`COMP`) and `1*256 + 2 = 258` (`COMP-9`) in little-endian.   \n\n```\n  10 NUM  PIC S9(8) COMP.    ** Big-endian\n  10 NUM  PIC S9(8) COMP-9.  ** Little-endian\n```\n\n### Parsing 'unsigned packed' aka Easyextract numbers\nUnsigned backed numbers are encoded as BCD (`COMP-3`) without the sign nibble. For example, bytes `0x12 0x34` encode\nthe number `1234`. As of `2.6.2` Cobrix supports decoding such numbers using an extension. Use `COMP-3U` for unsigned\npacked numbers.\n\nThe 'COMP-3U' usage \n```\n  10 NUM  PIC X(4) COMP-3U.\n```\nNote that when using `X` 4 refers to the number of bytes the field occupies. Here, the number of digits is 4*2 = 8. \n\n```\n  10 NUM  PIC 9(8) COMP-3U.\n```\nWhen using `9` 8 refers to the number of digits the number has. Here, the size of the field in bytes is 8/2 = 4.\n\n```\n  10 NUM  PIC 9(6)V99 COMP-3U.\n```\nYou can have decimals when using COMP-3 as well.\n\n### Flattening schema with GROUPs and OCCURS\nFlattening could be helpful when migrating data from mainframe data with fields that have OCCURs (arrays) to a relational\ndatabases that do not support nested arrays.\n\nCobrix has a method that can flatten the schema automatically given a DataFrame produced by `spark-cobol`.\n\nSpark Scala example:\n```scala\nval dfFlat = SparkUtils.flattenSchema(df, useShortFieldNames = false)\n```\n\nPySpark example\n```python\nfrom pyspark.sql import SparkSession, DataFrame, SQLContext\nfrom pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType\nfrom py4j.java_gateway import java_import\n\nschema = StructType([\n   StructField(\"id\", IntegerType(), True),\n   StructField(\"name\", StringType(), True),\n   StructField(\"subjects\", ArrayType(StringType()), True)\n])\n\n# Sample data\ndata = [\n   (1, \"Alice\", [\"Math\", \"Science\"]),\n   (2, \"Bob\", [\"History\", \"Geography\"]),\n   (3, \"Charlie\", [\"English\", \"Math\", \"Physics\"])\n]\n\n# Create a test DataFrame\ndf = spark.createDataFrame(data, schema)\n\n# Show the Dataframe before flattening\ndf.show()\n\n# Flatten the schema using Cobrix Scala 'SparkUtils.flattenSchema' method\nsc = spark.sparkContext\njava_import(sc._gateway.jvm, \"za.co.absa.cobrix.spark.cobol.utils.SparkUtils\")\ndfFlatJvm = spark._jvm.SparkUtils.flattenSchema(df._jdf, False)\ndfFlat = DataFrame(dfFlatJvm, SQLContext(sc))\n\n# Show the Dataframe after flattening\ndfFlat.show(truncate=False)\ndfFlat.printSchema()\n```\n\nThe output looks like this:\n```\n# Before flattening\n+---+-------+------------------------+\n|id |name   |subjects                |\n+---+-------+------------------------+\n|1  |Alice  |[Math, Science]         |\n|2  |Bob    |[History, Geography]    |\n|3  |Charlie|[English, Math, Physics]|\n+---+-------+------------------------+\n\n# After flattening\n+---+-------+----------+----------+----------+\n|id |name   |subjects_0|subjects_1|subjects_2|\n+---+-------+----------+----------+----------+\n|1  |Alice  |Math      |Science   |null      |\n|2  |Bob    |History   |Geography |null      |\n|3  |Charlie|English   |Math      |Physics   |\n+---+-------+----------+----------+----------+\n```\n\n## Summary of all available options\n\n##### File reading options\n\n| Option (usage example)                 | Description                                                                                                    |\n|----------------------------------------|:---------------------------------------------------------------------------------------------------------------|\n| .option(\"data_paths\", \"/path1,/path2\") | Allows loading data from multiple unrelated paths on the same filesystem.                                      |\n| .option(\"file_start_offset\", \"0\")      | Specifies the number of bytes to skip at the beginning of each file.                                           |\n| .option(\"file_end_offset\", \"0\")        | Specifies the number of bytes to skip at the end of each file.                                                 |\n| .option(\"record_start_offset\", \"0\")    | Specifies the number of bytes to skip at the beginning of each record before applying copybook fields to data. |\n| .option(\"record_end_offset\", \"0\")      | Specifies the number of bytes to skip at the end of each record after applying copybook fields to data.        |\n\n##### Copybook parsing options\n\n| Option (usage example)               | Description                                                                                                                                          |\n|--------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------|\n| .option(\"truncate_comments\", \"true\") | Historically, COBOL parser ignores the first 6 characters and all characters after 72. When this option is `false`, no truncation is performed.      |\n| .option(\"comments_lbound\", 6)        | By default each line starts with a 6 character comment. The exact number of characters can be tuned using this option.                               |\n| .option(\"comments_ubound\", 72)       | By default all characters after 72th one of each line is ignored by the COBOL parser. The exact number of characters can be tuned using this option. |\n\n##### Data parsing options\n\n| Option (usage example)                                    | Description                                                                                                                                                                                                                                                                       |\n|-----------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| .option(\"string_trimming_policy\", \"both\")                 | Specifies if and how string fields should be trimmed. Available options: `both` (default), `none`, `left`, `right`, `keep_all`. `keep_all` - keeps control characters when decoding ASCII text files                                                                              |\n| .option(\"display_pic_always_string\", \"false\")             | If `true` fields that have `DISPLAY` format will always be converted to `string` type, even if such fields contain numbers, retaining leading and trailing zeros. Cannot be used together with `strict_integral_precision`.                                                       |\n| .option(\"ebcdic_code_page\", \"common\")                     | Specifies a code page for EBCDIC encoding. Currently supported values: `common` (default), `common_extended`, `cp037`, `cp037_extended`, and others (see \"Currently supported EBCDIC code pages\" section.                                                                         |\n| .option(\"ebcdic_code_page_class\", \"full.class.specifier\") | Specifies a user provided class for a custom code page to UNICODE conversion.                                                                                                                                                                                                     |\n| .option(\"field_code_page:cp825\", \"field1, field2\")        | Specifies the code page for selected fields. You can add mo than 1 such option for multiple code page overrides.                                                                                                                                                                  |\n| .option(\"is_utf16_big_endian\", \"true\")                    | Specifies if UTF-16 encoded strings (`National` / `PIC N` format) are big-endian (default).                                                                                                                                                                                       |\n| .option(\"floating_point_format\", \"IBM\")                   | Specifies a floating-point format. Available options: `IBM` (default), `IEEE754`, `IBM_little_endian`, `IEEE754_little_endian`.                                                                                                                                                   |\n| .option(\"variable_size_occurs\", \"false\")                  | If `false` (default) fields that have `OCCURS 0 TO 100 TIMES DEPENDING ON` clauses always have the same size corresponding to the maximum array size (e.g. 100 in this example). If set to `true` the size of the field will shrink for each field that has less actual elements. |\n| .option(\"occurs_mapping\", \"{\\\"FIELD\\\": {\\\"X\\\": 1}}\")      | If specified, as a JSON string, allows for String `DEPENDING ON` fields with a corresponding mapping.                                                                                                                                                                             |\n| .option(\"strict_sign_overpunching\", \"true\")               | If `true` (default), sign overpunching will only be allowed for signed numbers. If `false`, overpunched positive sign will be allowed for unsigned numbers, but negative sign will result in null.                                                                                |\n| .option(\"improved_null_detection\", \"true\")                | If `true`(default), values that contain only 0x0 ror DISPLAY strings and numbers will be considered `null`s instead of empty strings.                                                                                                                                             |\n| .option(\"strict_integral_precision\", \"true\")              | If `true`, Cobrix will not generate `short`/`integer`/`long` Spark data types, and always use `decimal(n)` with the exact precision that matches the copybook. Cannot be used together with `display_pic_always_string`.                                                          |\n| .option(\"binary_as_hex\", \"false\")                         | By default fields that have `PIC X` and `USAGE COMP` are converted to `binary` Spark data type. If this option is set to `true`, such fields will be strings in HEX encoding.                                                                                                     |\n\n##### Modifier options\n\n| Option (usage example)                              | Description                                                                                                                                                                                                                                                                                           |\n|-----------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| .option(\"schema_retention_policy\", \"collapse_root\") | When `collapse_root` (default) the root level record will be removed from the Spark schema. When `keep_original`, the root level GROUP will be present in the Spark schema                                                                                                                            |\n| .option(\"drop_group_fillers\", \"false\")              | If `true`, all GROUP FILLERs will be dropped from the output schema. If `false` (default), such fields will be retained.                                                                                                                                                                              |\n| .option(\"drop_value_fillers\", \"false\")              | If `true` (default), all non-GROUP FILLERs will be dropped from the output schema. If `false`, such fields will be retained.                                                                                                                                                                          |\n| .option(\"filler_naming_policy\", \"sequence_numbers\") | Filler renaming strategy so that column names are not duplicated. Either `sequence_numbers` (default) or `previous_field_name` can be used.                                                                                                                                                           |\n| .option(\"non_terminals\", \"GROUP1,GROUP2\")           | Specifies groups to also be added to the schema as string fields. When this option is specified, the reader will add one extra data field after each matching group containing the string data for the group.                                                                                         |\n| .option(\"generate_record_id\", false)                | Generate autoincremental 'File_Id', 'Record_Id' and 'Record_Byte_Length' fields. This is used for processing record order dependent data.                                                                                                                                                             |\n| .option(\"generate_record_bytes\", false)             | Generate 'Record_Bytes', the binary field that contains raw contents of the original unparsed records.                                                                                                                                                                                                |\n| .option(\"generate_corrupt_fields\", false)           | Generate `_corrupt_fields` field that contains values of fields Cobrix was unable to decode.                                                                                                                                                                                                          |\n| .option(\"with_input_file_name_col\", \"file_name\")    | Generates a column containing input file name for each record (Similar to Spark SQL `input_file_name()` function). The column name is specified by the value of the option. This option only works for variable record length files. For fixed record length and ASCII files use `input_file_name()`. |\n| .option(\"metadata\", \"basic\")                        | Specifies wat kind of metadata to include in the Spark schema: `false`, `basic`(default), or `extended` (PIC, usage, etc).                                                                                                                                                                            |\n| .option(\"debug\", \"hex\")                             | If specified, each primitive field will be accompanied by a debug field containing raw bytes from the source file. Possible values: `none` (default), `hex`, `binary`, `string` (ASCII only). The legacy value `true` is supported and will generate debug fields in HEX.                             |\n\n##### Fixed length record format options (for record_format = F or FB)\n\n| Option (usage example)            | Description                                                                                                                                                                                                                                                                             |\n|-----------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| .option(\"record_format\", \"F\")     | Record format from the [spec](https://www.ibm.com/docs/en/zos/2.3.0?topic=files-selecting-record-formats-non-vsam-data-sets). One of `F` (fixed length, default), `FB` (fixed block), V` (variable length RDW), `VB` (variable block BDW+RDW), `D` (ASCII text).                        |\n| .option(\"record_length\", \"100\")   | Overrides the length of the record (in bypes). Normally, the size is derived from the copybook. But explicitly specifying record size can be helpful for debugging fixed-record length files.                                                                                           |\n| .option(\"block_length\", \"500\")    | Specifies the block length for FB records. It should be a multiple of 'record_length'. Cannot be used together with `records_per_block`                                                                                                                                                 |\n| .option(\"records_per_block\", \"5\") | Specifies the number of records ber block for FB records. Cannot be used together with `block_length`                                                                                                                                                                                   |\n\n##### Variable record length files options (for record_format = V or VB)\n\n| Option (usage example)                                      | Description                                                                                                                                                                                                                                                                             |\n|-------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| .option(\"record_format\", \"V\")                               | Record format from the [spec](https://www.ibm.com/docs/en/zos/2.3.0?topic=files-selecting-record-formats-non-vsam-data-sets). One of `F` (fixed length, default), `FB` (fixed block), V` (variable length RDW), `VB` (variable block BDW+RDW), `D` (ASCII text).                        |\n| .option(\"is_record_sequence\", \"true\")                       | _[deprecated]_ If 'true' the parser will look for 4 byte RDW headers to read variable record length files. Use `.option(\"record_format\", \"V\")` instead.                                                                                                                                 |\n| .option(\"is_rdw_big_endian\", \"true\")                        | Specifies if RDW headers are big endian. They are considered little-endian by default.                                                                                                                                                                                                  |\n| .option(\"is_rdw_part_of_record_length\", false)              | Specifies if RDW headers count themselves as part of record length. By default RDW headers count only payload record in record length, not RDW headers themselves. This is equivalent to `.option(\"rdw_adjustment\", -4)`. For BDW use `.option(\"bdw_adjustment\", -4)`                   |\n| .option(\"rdw_adjustment\", 0)                                | If there is a mismatch between RDW and record length this option can be used to adjust the difference.                                                                                                                                                                                  |\n| .option(\"bdw_adjustment\", 0)                                | If there is a mismatch between BDW and record length this option can be used to adjust the difference.                                                                                                                                                                                  |\n| .option(\"re_additional_info\", \"\")                           | Passes a string as an additional info parameter passed to a custom record extractor to its constructor.                                                                                                                                                                                 |\n| .option(\"record_length_field\", \"RECORD-LEN\")                | Specifies a record length field or expression to use instead of RDW. Use `rdw_adjustment` option if the record length field differs from the actual length by a fixed amount of bytes. The `record_format` should be set to `F`. This option is incompatible with `is_record_sequence`. |\n| .option(\"record_length_map\", \"\"\"{\"A\":100,\"B\":50}\"\"\")        | Specifies a mapping between record length field values and actual record lengths.                                                                                                                                                                                                       |\n| .option(\"record_extractor\", \"com.example.record.extractor\") | Specifies a class for parsing record in a custom way. The class must inherit `RawRecordExtractor` and `Serializable` traits. See the chapter on record extractors above.                                                                                                                |\n| .option(\"minimum_record_length\", 1)                         | Specifies the minimum length a record is considered valid, will be skipped otherwise.                                                                                                                                                                                                   |\n| .option(\"maximum_record_length\", 1000)                      | Specifies the maximum length a record is considered valid, will be skipped otherwise.                                                                                                                                                                                                   |\n\n##### ASCII files options (for record_format = D or D2)\n\n| Option (usage example)                             | Description                                                                                                                                                                                                                                                                                 |\n|----------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| .option(\"record_format\", \"D\")                      | Record format from the [spec](https://www.ibm.com/docs/en/zos/2.3.0?topic=files-selecting-record-formats-non-vsam-data-sets). One of `F` (fixed length, default), `FB` (fixed block), V` (variable length RDW), `VB` (variable block BDW+RDW), `D` (ASCII text).                            |\n| .option(\"is_text\", \"true\")                         | If 'true' the file will be considered a text file where records are separated by an end-of-line character. Currently, only ASCII files having UTF-8 charset can be processed this way. If combined with `record_format = D`, multisegment and hierarchical text record files can be loaded. |\n| .option(\"ascii_charset\", \"US-ASCII\")               | Specifies a charset to use to decode ASCII data. The value can be any charset supported by `java.nio.charset`: `US-ASCII` (default), `UTF-8`, `ISO-8859-1`, etc.                                                                                                                            |\n| .option(\"field_code_page:cp825\", \"field1, field2\") | Specifies the code page for selected fields. You can add more than 1 such option for multiple code page overrides.                                                                                                                                                                            |\n| .option(\"minimum_record_length\", 1)                | Specifies the minimum length a record is considered valid, will be skipped otherwise. It is used to skip ASCII lines that contains invalid records, an EOF character, for example.                                                                                                          |\n\n##### Multisegment files options\n\n| Option (usage example)                                                                | Description                                                                                                                                                                                                                                                                                                                                                                        |\n|---------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| .option(\"segment_field\", \"SEG-ID\")                                                    | Specify a segment id field name. This is to ensure the splitting is done using root record boundaries for hierarchical datasets. The first record will be considered a root segment record.                                                                                                                                                                                        |\n| .option(\"redefine-segment-id-map:0\", \"REDEFINED_FIELD1 =\u003e SegmentId1,SegmentId2,...\") | Specifies a mapping between redefined field names and segment id values. Each option specifies a mapping for a single segment. The numeric value for each mapping option must be incremented so the option keys are unique.                                                                                                                                                        |\n| .option(\"segment-children:0\", \"COMPANY =\u003e EMPLOYEE,DEPARTMENT\")                       | Specifies a mapping between segment redefined fields and their children. Each option specifies a mapping for a single parent field. The numeric value for each mapping option must be incremented so the option keys are unique. If such mapping is specified hierarchical record structure will be automatically reconstructed. This require `redefine-segment-id-map` to be set. | \n| .option(\"enable_indexes\", \"true\")                                                     | Turns on indexing of multisegment variable length files (on by default).                                                                                                                                                                                                                                                                                                           |\n| .option(\"enable_index_cache\", \"true\")                                                 | When true (default), calculated indexes are cached in memory for later use. This improves performance of processing when same files are processed more than once.                                                                                                                                                                                                                  |\n| .option(\"input_split_records\", 50000)                                                 | Specifies how many records will be allocated to each split/partition. It will be processed by Spark tasks. (The default is not set and the split will happen according to size, see the next option)                                                                                                                                                                               |\n| .option(\"input_split_size_mb\", 100)                                                   | Specify how many megabytes to allocate to each partition/split. (The default is 100 MB)                                                                                                                                                                                                                                                                                            |\n\n##### Helper fields generation options    \n\n| Option (usage example)                     | Description                                                                                                                                                                         |\n|--------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| .option(\"segment_field\", \"SEG-ID\")         | Specified the field in the copybook containing values of segment ids.                                                                                                               |\n| .option(\"segment_filter\", \"S0001\")         | Allows to add a filter on the segment id that will be pushed down the reader. This is if the intent is to extract records only of a particular segments.                            |\n| .option(\"segment_id_level0\", \"SEGID-ROOT\") | Specifies segment id value for root level records. When this option is specified the Seg_Id0 field will be generated for each root record                                           |\n| .option(\"segment_id_level1\", \"SEGID-CLD1\") | Specifies segment id value for child level records. When this option is specified the Seg_Id1 field will be generated for each root record                                          |\n| .option(\"segment_id_level2\", \"SEGID-CLD2\") | Specifies segment id value for child of a child level records. When this option is specified the Seg_Id2 field will be generated for each root record. You can use levels 3, 4 etc. |\n| .option(\"segment_id_prefix\", \"A_PREEFIX\")  | Specifies a prefix to be added to each segment id value. This is to mage generated IDs globally unique. By default the prefix is the current timestamp in form of '201811122345_'.  |\n\n##### Debug helper options\n\n| Option (usage example)                             | Description                                                                                                                                                                                                                                                                                                                         |\n|----------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| .option(\"pedantic\", \"false\")                       | If 'true' Cobrix will throw an exception is an unknown option is encountered. If 'false' (default), unknown options will be logged as an error without failing Spark Application.                                                                                                                                                   |\n| .option(\"debug_layout_positions\", \"true\")          | If 'true' Cobrix will generate and log layout positions table when reading data.                                                                                                                                                                                                                                                    |\n| .option(\"debug_ignore_file_size\", \"true\")          | If 'true' no exception will be thrown if record size does not match file size. Useful for debugging copybooks to make them match a data file.                                                                                                                                                                                       |\n| .option(\"enable_self_checks\", \"true\")              | If 'true' Cobrix will run self-checks to validate internal consistency. Note: Enabling this option may impact performance, especially for large datasets. It is recommended to disable this option in performance-critical environments. The only check implemented so far is custom record extractor indexing compatibility check. |\n\n##### Currently supported EBCDIC code pages\n\n| Option                                | Code page   | Description                                                                                                 |\n|:--------------------------------------|-------------|:------------------------------------------------------------------------------------------------------------|\n| .option(\"ebcdic_code_page\", \"common\") | Common      | (Default) Only characters common across EBCDIC code pages are decoded.                                      |\n| .option(\"ebcdic_code_page\", \"cp037\")  | EBCDIC 037  | Australia, Brazil, Canada, New Zealand, Portugal, South Africa, USA.                                        |\n| .option(\"ebcdic_code_page\", \"cp273\")  | EBCDIC 273  | Germany, Austria.                                                                                           |\n| .option(\"ebcdic_code_page\", \"cp274\")  | EBCDIC 274  | Belgium.                                                                                                    |\n| .option(\"ebcdic_code_page\", \"cp275\")  | EBCDIC 275  | Brazil.                                                                                                     |\n| .option(\"ebcdic_code_page\", \"cp277\")  | EBCDIC 277  | Denmark and Norway.                                                                                         |\n| .option(\"ebcdic_code_page\", \"cp278\")  | EBCDIC 278  | Finland and Sweden.                                                                                         |\n| .option(\"ebcdic_code_page\", \"cp280\")  | EBCDIC 280  | Italy.                                                                                                      |\n| .option(\"ebcdic_code_page\", \"cp284\")  | EBCDIC 284  | Spain and Latin America.                                                                                    |\n| .option(\"ebcdic_code_page\", \"cp285\")  | EBCDIC 285  | United Kingdom.                                                                                             |\n| .option(\"ebcdic_code_page\", \"cp297\")  | EBCDIC 297  | France.                                                                                                     |\n| .option(\"ebcdic_code_page\", \"cp300\")  | EBCDIC 300  | Double-byte code page with Japanese and Latin characters.                                                   |\n| .option(\"ebcdic_code_page\", \"cp500\")  | EBCDIC 500  | Belgium, Canada, Switzerland, International.                                                                |\n| .option(\"ebcdic_code_page\", \"cp838\")  | EBCDIC 838  | Double-byte code page with Thai and Latin characters.                                                       |\n| .option(\"ebcdic_code_page\", \"cp870\")  | EBCDIC 870  | Albania, Bosnia and Herzegovina, Croatia, Czech Republic, Hungary, Poland, Romania, Slovakia, and Slovenia. |\n| .option(\"ebcdic_code_page\", \"cp875\")  | EBCDIC 875  | A code page with Greek characters.                                                                          |\n| .option(\"ebcdic_code_page\", \"cp1025\") | EBCDIC 1025 | A code page with Cyrillic alphabet.                                                                         |\n| .option(\"ebcdic_code_page\", \"cp1047\") | EBCDIC 1047 | A code page containing all of the Latin-1/Open System characters.                                           |\n| .option(\"ebcdic_code_page\", \"cp1140\") | EBCDIC 1140 | Same as code page 037 with € at the position of the international currency symbol ¤.                        |\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabsaoss%2Fcobrix","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fabsaoss%2Fcobrix","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabsaoss%2Fcobrix/lists"}