{"id":23122602,"url":"https://github.com/folio-org/mod-data-import","last_synced_at":"2026-01-16T18:02:48.798Z","repository":{"id":38486127,"uuid":"150272398","full_name":"folio-org/mod-data-import","owner":"folio-org","description":null,"archived":false,"fork":false,"pushed_at":"2026-01-14T12:02:56.000Z","size":16065,"stargazers_count":4,"open_issues_count":2,"forks_count":6,"subscribers_count":17,"default_branch":"master","last_synced_at":"2026-01-14T12:22:15.484Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"mIRC Script","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/folio-org.png","metadata":{"files":{"readme":"README.md","changelog":"NEWS.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2018-09-25T13:46:58.000Z","updated_at":"2025-12-16T13:50:09.000Z","dependencies_parsed_at":"2024-02-16T13:25:33.395Z","dependency_job_id":"21b9b0d5-b071-4afc-80e2-bc43ea0feb61","html_url":"https://github.com/folio-org/mod-data-import","commit_stats":null,"previous_names":[],"tags_count":68,"template":false,"template_full_name":null,"purl":"pkg:github/folio-org/mod-data-import","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/folio-org%2Fmod-data-import","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/folio-org%2Fmod-data-import/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/folio-org%2Fmod-data-import/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/folio-org%2Fmod-data-import/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/folio-org","download_url":"https://codeload.github.com/folio-org/mod-data-import/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/folio-org%2Fmod-data-import/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28480513,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-16T11:59:17.896Z","status":"ssl_error","status_checked_at":"2026-01-16T11:55:55.838Z","response_time":107,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-17T07:28:32.069Z","updated_at":"2026-01-16T18:02:48.790Z","avatar_url":"https://github.com/folio-org.png","language":"mIRC Script","funding_links":[],"categories":[],"sub_categories":[],"readme":"# mod-data-import\n\nCopyright (C) 2018-2022 The Open Library Foundation\n\nThis software is distributed under the terms of the Apache License,\nVersion 2.0. See the file \"[LICENSE](LICENSE)\" for more information.\n\n\u003c!-- ../../okapi/doc/md2toc -l 2 -h 4 README.md --\u003e\n* [Introduction](#introduction)\n* [Compiling](#compiling)\n* [Docker](#docker)\n* [Installing the module](#installing-the-module)\n* [Deploying the module](#deploying-the-module)\n* [Maximum upload file size and java heap memory setups](#maximum-upload-file-size-and-java-heap-memory-setups)\n    * [Example](#example)\n* [File splitting configuration](#file-splitting-configuration)\n* [Interaction with AWS S3/Minio](#interaction-with-aws-s3minio)\n* [Queue prioritization algorithm](#queue-prioritization-algorithm)\n* [Interaction with Kafka](#interaction-with-kafka)\n* [Other system properties](#other-system-properties)\n* [Issue tracker](#issue-tracker)\n* [Additional information](#additional-information)\n* [Script to upload a batch of MARC records](#script-to-upload-a-batch-of-marc-records)\n\n## Introduction\n\nmod-data-import is responsible for uploading files (see [documentation for file uploading](FileUploadApi.md)), initial handling and sending records for further processing (see [documentation for file processing](FileProcessingApi.md)).\n\n## Compiling\n\n```\n   mvn install\n```\n\nSee that it says \"BUILD SUCCESS\" near the end.\n\n## Docker\n\nBuild the docker container with:\n\n```\n   docker build -t mod-data-import .\n```\n\nTest that it runs with:\n\n```\n   docker run -t -i -p 8081:8081 mod-data-import\n```\n\n## Installing the module\n\nFollow the guide of\n[Deploying Modules](https://github.com/folio-org/okapi/blob/master/doc/guide.md#example-1-deploying-and-using-a-simple-module)\nsections of the Okapi Guide and Reference, which describe the process in detail.\n\nFirst of all you need a running Okapi instance.\n(Note that [specifying](../README.md#setting-things-up) an explicit 'okapiurl' might be needed.)\n\n```\n   cd .../okapi\n   java -jar okapi-core/target/okapi-core-fat.jar dev\n```\n\nWe need to declare the module to Okapi:\n\n```\ncurl -w '\\n' -X POST -D -   \\\n   -H \"Content-type: application/json\"   \\\n   -d @target/ModuleDescriptor.json \\\n   http://localhost:9130/_/proxy/modules\n```\n\nThat ModuleDescriptor tells Okapi what the module is called, what services it\nprovides, and how to deploy it.\n\n## Deploying the module\n\nNext we need to deploy the module. There is a deployment descriptor in\n`target/DeploymentDescriptor.json`. It tells Okapi to start the module on 'localhost'.\n\nDeploy it via Okapi discovery:\n\n```\ncurl -w '\\n' -D - -s \\\n  -X POST \\\n  -H \"Content-type: application/json\" \\\n  -d @target/DeploymentDescriptor.json  \\\n  http://localhost:9130/_/discovery/modules\n```\n\nThen we need to enable the module for the tenant:\n\n```\ncurl -w '\\n' -X POST -D -   \\\n    -H \"Content-type: application/json\"   \\\n    -d @target/TenantModuleDescriptor.json \\\n    http://localhost:9130/_/proxy/tenants/\u003ctenant_name\u003e/modules\n```\n\n## Maximum upload file size and java heap memory setups\n\nCurrent implementation supports only storing of the file in a LOCAL_STORAGE (file system of the module). It has a couple of implications:\n1. the request for processing the file can be processed only by the same instance of the module, which prevents mod-data-import from scaling \n2. file size that can be uploaded is limited to the java heap memory allocated to the module.\nIt is necessary to have the size of the java heap equal to the expected max file size plus 10 percent.\n\n#### Example\n| File Size | Java Heap size |\n|:---------:|:--------------:|\n|   256mb   |     270+ mb    |\n|   512mb   |     560+ mb    |\n|    1GB    |     1.1+ GB    |\n\n## File splitting configuration\n\nThe file-splitting process may be configured with the following environment variables:\n\n| Name                                | Type               | Required                 | Default | Description                                                                                |\n| ----------------------------------- | ------------------ | ------------------------ | ------- | ------------------------------------------------------------------------------------------ |\n| `SPLIT_FILES_ENABLED`               | `true` or `false`  | yes, if enabling feature | `false` | Whether files should be split into chunks and processed separately                         |\n| `RECORDS_PER_SPLIT_FILE`            | integer \u003e 0        | no                       | `1000`  | The maximum number of records to include in a single file                                  |\n| `ASYNC_PROCESSOR_POLL_INTERVAL_MS`  | integer (msec) ≥ 0 | no                       | `5000`  | The number of milliseconds between times when the module checks the queue for waiting jobs |\n| `ASYNC_PROCESSOR_MAX_WORKERS_COUNT` | integer ≥ 1        | no                       | `1`     | The maximum number of concurrent jobs to process at once, in this instance                 |\n\nFor the polling interval, a lower number results in decreased latency between when a job is added to the queue and when it is processed. However, this also results in more frequent database queries, which may impact performance. Note that the number set here is the \"worst case\" — average waiting would be half of it — and that a few seconds delay on a large import is hardly noticeable.\n\nThe worker count is useful for production/multi-tenant environments, where you might want to provide more capacity without additional instances.  **However, note that this may cause some odd behavior when only one user is running a job, as multiple parts may appear to complete together.**\n\n\u003e [!NOTE]\n\u003e For full information about this feature, please view [the release notes](https://wiki.folio.org/display/FOLIOtips/Detailed+Release+Notes+for+Data+Import+Splitting+Feature)\n\n## Interaction with AWS S3/Minio\n\nThis module uses S3-compatible storage as part of the file upload process. The following environment variables must be set with values for your S3-compatible storage (AWS S3, Minio Server):\n\n| Name                   | Type              | Required           | Default                  | Description                                                                   |\n|------------------------| ----------------- | ------------------ | ------------------------ | ----------------------------------------------------------------------------- |\n| `S3_URL`               | URL as string     | yes                | `http://127.0.0.1:9000/` | URL of S3-compatible storage                                                  |\n| `S3_REGION`            | string            | yes                | _none_                   | S3 region                                                                     |\n| `S3_BUCKET`            | string            | yes                | _none_                   | Bucket to store and retrieve data                                             |\n| `S3_ACCESS_KEY_ID`     | string            | yes                | _none_                   | S3 access key                                                                 |\n| `S3_SECRET_ACCESS_KEY` | string            | yes                | _none_                   | S3 secret key                                                                 |\n| `S3_IS_AWS`            | `true` or `false` | no, if using MinIO | `false`                  | If AWS S3 is being used (`true` if so, `false` other platforms such as MinIO) |\n\nPath-style vs virtual-hosted style requests are described [on the AWS S3 documentation](https://docs.aws.amazon.com/AmazonS3/latest/userguide/VirtualHosting.html#path-style-access).\n\n\u003e [!WARNING]\n\u003e It is possible for files to be partially uploaded but abandoned in the UI. This module makes no effort to detect these cases and proactively delete them.\n\u003e\n\u003e Instead, use the retention policies built into AWS S3 and MinIO, as described [here](https://wiki.folio.org/display/FOLIOtips/Detailed+Release+Notes+for+Data+Import+Splitting+Feature#DetailedReleaseNotesforDataImportSplittingFeature-Garbagecollectionofcloudstoredfiles:).\n\n## Queue prioritization algorithm\n\nThis covers the following environment variables:\n\n\u003e [!NOTE]\n\u003e None of these are required; if not set, the following default values will be used.\n\n| Name                                  | Type (unit)       | Default  | Reasoning                                                                                                                                                                                    |\n| ------------------------------------- | ----------------- | -------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| `SCORE_JOB_SMALLEST`                  | integer           | `40`     |                                                                                                                                                                                              |\n| `SCORE_JOB_LARGEST`                   | integer           | `-40`    | Larger jobs should be deprioritized                                                                                                                                                          |\n| `SCORE_JOB_REFERENCE`                 | integer (records) | `100000` |                                                                                                                                                                                              |\n| `SCORE_AGE_NEWEST`                    | integer           | `0`      | New jobs begin with no boost                                                                                                                                                                 |\n| `SCORE_AGE_OLDEST`                    | integer           | `50`     | As jobs age, their score increases rapidly, so this does not have to be too high. We want small jobs to \"cut\" in line effectively.                                                           |\n| `SCORE_AGE_EXTREME_THRESHOLD_MINUTES` | integer (minutes) | `480`    | 8 hours                                                                                                                                                                                      |\n| `SCORE_AGE_EXTREME_VALUE`             | integer           | `10000`  | Jump to the top of the queue if waiting more than 8 hours                                                                                                                                    |\n| `SCORE_TENANT_USAGE_MIN`              | integer           | `100`    | If the tenant has no jobs running, then it should be prioritized                                                                                                                             |\n| `SCORE_TENANT_USAGE_MAX`              | integer           | `-200`   | If the tenant is using all available workers, it should be **significantly** deprioritized. If no other tenants are competing, this will not matter (since all jobs would be offset by this) |\n| `SCORE_PART_NUMBER_FIRST`             | integer           | `1`      | Very small; we only want to order parts amongst others within a job (which would likely have the same score otherwise)                                                                       |\n| `SCORE_PART_NUMBER_LAST`              | integer           | `0`      |                                                                                                                                                                                              |\n| `SCORE_PART_NUMBER_LAST_REFERENCE`    | integer           | `100`    | Does not really matter due to the small range                                                                                                                                                |\n\nFor information on what these mean, how to configure them, how scores are calculated, and even a playground to try experiment with different values, please see [this wiki page](https://wiki.folio.org/display/FOLIOtips/Detailed+Release+Notes+for+Data+Import+Splitting+Feature#DetailedReleaseNotesforDataImportSplittingFeature-QueuePrioritizationAlgorithm).\n\n\u003e [!IMPORTANT]\n\u003e To disable an individual metric (or the prioritization altogether), set the value(s) to `0`.\n\n\u003e [!NOTE]\n\u003e We recommend the suggested values above, however, there is a lot of room for customization and extension as needed.  Please see the doc for more information.\n\n## Interaction with Kafka\n\nAll modules involved in data import (mod-data-import, mod-source-record-manager, mod-source-record-storage, mod-inventory, mod-invoice) are communicating via Kafka directly. Therefore, to enable data import Kafka should be set up properly and all the necessary parameters should be set for the modules.\n\n**Properties that are required for mod-data-import to interact with Kafka:**\n\n* `KAFKA_HOST`\n* `KAFKA_PORT`\n* `OKAPI_URL`\n* `ENV` (unique env ID)\n\nThere are another important properties - `number of partitions` for topics `DI_INITIALIZATION_STARTED` and `DI_RAW_RECORDS_CHUNK_READ`\nwhich are created during tenant initialization, the values of which can be customized with\n `DI_INITIALIZATION_STARTED_PARTITIONS` and `DI_RAW_RECORDS_CHUNK_READ_PARTITIONS` env variables respectively. \nDefault value - `1`.\n\n## Other system properties\n\nInitial handling of the uploaded file means chunking it and sending records for processing in other modules. The chunk size can be adjusted for different files, otherwise default values will be used:\n\n* \"_file.processing.marc.raw.buffer.chunk.size_\": 50 - applicable to MARC files in binary format\n* \"_file.processing.marc.json.buffer.chunk.size_\": 50 - applicable to json files with MARC data in json format\n* \"_file.processing.marc.xml.buffer.chunk.size_\": 10 - applicable to xml files with MARC data in xml format\n* \"_file.processing.edifact.buffer.chunk.size_\": 10 - applicable to EDIFACT files\n\n## Issue tracker\n\nSee project [MODDATAIMP](https://issues.folio.org/browse/MODDATAIMP)\nat the [FOLIO issue tracker](https://dev.folio.org/guidelines/issue-tracker/).\n\n## Additional information\n\nThe [raml-module-builder](https://github.com/folio-org/raml-module-builder) framework.\n\nOther [modules](https://dev.folio.org/source-code/#server-side).\n\nSee project [MODDATAIMP](https://issues.folio.org/browse/MODDATAIMP) at the [FOLIO issue tracker](https://dev.folio.org/guidelines/issue-tracker).\n\nOther FOLIO Developer documentation is at [dev.folio.org](https://dev.folio.org/)\n\n## Script to upload a batch of MARC records\n\n[The `scripts` directory](scripts) contains a shell-script, `load-marc-data-into-folio.sh`, and a file with a sample of 100 MARC records, `sample100.marc`. This script can be used to upload any batch of MARC files automatically, using the same sequence of WSAPI operations as the Secret Button. First, login to a FOLIO backend service using [the Okapi command-line utility](https://github.com/thefrontside/okapi.rb) or any other means that leaves definitions of the Okapi URL, tenant and token in the `.okapi` file in the home directory. Then run the script, naming the MARC file as its own argument:\n\n```\nscripts$ echo OKAPI_URL=https://folio-snapshot-stable-okapi.dev.folio.org \u003e ~/.okapi\nscripts$ echo OKAPI_TENANT=diku \u003e\u003e ~/.okapi\nscripts$ okapi login\nusername: diku_admin\npassword: ************\nLogin successful. Token saved to /Users/mike/.okapi\nscripts$ ./load-marc-data-into-folio.sh sample100.marc\n=== Stage 1 ===\n=== Stage 2 ===\n=== Stage 3 ===\nHTTP/2 204\ndate: Thu, 27 Aug 2020 11:55:28 GMT\nx-okapi-trace: POST mod-authtoken-2.6.0-SNAPSHOT.73 http://10.36.1.38:9178/data-import/uploadDefinitions/123a8d01-e389-4893-a53e-cc2de846471d/processFiles.. : 202 7078us\nx-okapi-trace: POST mod-data-import-1.11.0-SNAPSHOT.140 http://10.36.1.38:9175/data-import/uploadDefinitions/123a8d01-e389-4893-a53e-cc2de846471d/processFiles.. : 204 6354us\nscripts$\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffolio-org%2Fmod-data-import","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffolio-org%2Fmod-data-import","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffolio-org%2Fmod-data-import/lists"}