{"id":47895716,"url":"https://github.com/folio-org/mod-linked-data-import","last_synced_at":"2026-04-04T03:44:17.169Z","repository":{"id":312265700,"uuid":"1046867257","full_name":"folio-org/mod-linked-data-import","owner":"folio-org","description":"Module to import Bibframe RDF into FOLIO's data graph.","archived":false,"fork":false,"pushed_at":"2026-04-03T08:37:11.000Z","size":317,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2026-04-04T03:44:14.215Z","etag":null,"topics":["bibframe","builde","data-graph","linked-data","rdf"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/folio-org.png","metadata":{"files":{"readme":"README.md","changelog":"NEWS.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-08-29T11:00:26.000Z","updated_at":"2026-04-03T08:37:13.000Z","dependencies_parsed_at":"2025-12-30T23:01:57.999Z","dependency_job_id":"f1299359-9915-46c0-8dd9-b2d7a632234a","html_url":"https://github.com/folio-org/mod-linked-data-import","commit_stats":null,"previous_names":["folio-org/mod-linked-data-import"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/folio-org/mod-linked-data-import","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/folio-org%2Fmod-linked-data-import","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/folio-org%2Fmod-linked-data-import/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/folio-org%2Fmod-linked-data-import/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/folio-org%2Fmod-linked-data-import/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/folio-org","download_url":"https://codeload.github.com/folio-org/mod-linked-data-import/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/folio-org%2Fmod-linked-data-import/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31387023,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-04T01:22:39.193Z","status":"online","status_checked_at":"2026-04-04T02:00:07.569Z","response_time":60,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bibframe","builde","data-graph","linked-data","rdf"],"created_at":"2026-04-04T03:44:11.860Z","updated_at":"2026-04-04T03:44:17.160Z","avatar_url":"https://github.com/folio-org.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# mod-linked-data-import\nCopyright (C) 2025 The Open Library Foundation\n\nThis software is distributed under the terms of the Apache License, Version 2.0.\nSee the file \"[LICENSE](LICENSE)\" for more information.\n## Introduction\n\nThis module provides bulk import functionality for RDF data graphs into the [`mod-linked-data`](https://github.com/folio-org/mod-linked-data) application.\nIt reads RDF subgraphs in [Bibframe 2](https://id.loc.gov/ontologies/bibframe.html) format, transforms them into the\n[Builde](https://bibfra.me/) vocabulary, and delivers them to `mod-linked-data` via Kafka.\n\n## Third party libraries used in this software\nThis software uses the following Weak Copyleft (Eclipse Public License 1.0 / 2.0) licensed software libraries:\n\n- [jakarta.annotation-api](https://projects.eclipse.org/projects/ee4j.ca)\n- [jakarta.json-api](https://github.com/jakartaee/jsonp-api)\n- [junit](https://junit.org/)\n- [aspectjweaver](https://eclipse.dev/aspectj/)\n\n## How to Import Data\n1. Upload the RDF file to the S3 bucket specified by the `S3_BUCKET` environment variable.\n2. Inside that bucket, place the file within the subdirectory corresponding to the target tenant ID.\n3. Trigger the import by calling the following API:\n```\nPOST /linked-data-import/start?fileName={fileNameInS3}\u0026contentType=application/ld+json\nx-okapi-tenant: {tenantId}\nx-okapi-token: {token}\n```\nThe `fileName` parameter should contain the file name. The module will retrieve the file from S3 bucket (specified by `S3_BUCKET` environment variable) from the subdirectory matching the tenant ID.\n\nResponse is a job execution id, which could be later used for getting job status or failed lines.\n## To check the import job status, use:\n```\nGET /linked-data-import/jobs/{jobExecutionId}\nx-okapi-tenant: {tenantId}\nx-okapi-token: {token}\n```\nThe response includes job information such as:\n- `startDate`: Job start date and time\n- `startedBy`: User ID who started the job\n- `status`: Current job status (COMPLETED, STARTED, FAILED, etc.)\n- `fileName`: Name of the imported file\n- `currentStep`: Current processing step\n- `linesRead`: Total lines read from the file\n- `linesMapped`: Lines successfully mapped\n- `linesFailedMapping`: Lines failed during mapping\n- `linesCreated`: Resources created\n- `linesUpdated`: Resources updated\n- `linesFailedSaving`: Lines failed during saving\n\n## To download failed RDF lines as CSV file:\n```\nGET /linked-data-import/jobs/{jobExecutionId}/failed-lines\nx-okapi-tenant: {tenantId}\nx-okapi-token: {token}\n```\nThe CSV file contains:\n- `lineNumber`: Line number in the original file\n- `description`: Error description\n- `failedRdfLine`: The RDF line content that failed\n\n## To cancel a running import job:\n```\nPUT /linked-data-import/jobs/{jobExecutionId}/cancel\nx-okapi-tenant: {tenantId}\nx-okapi-token: {token}\n```\n**Note:** The job will stop gracefully after completing the current processing chunk or step. It will not stop immediately.\n\n## File Format \u0026 Contents\n\n1. The file must be in **JSON Lines (jsonl)** format.\n2. Each line must contain a complete subgraph of a **Bibframe Instance** resource, as defined by the [Bibframe 2 ontology](https://id.loc.gov/ontologies/bibframe.html).\n\nFor an example of a valid import file containing two RDF instances, see [docs/example-import.jsonl](./docs/example-import.jsonl).\n\n## Limitations\n1. Only RDF data serialized as `application/ld+json` is supported.\n   Support for additional formats (e.g., XML, N-Triples) may be added in the future.\n2. Only **Bibframe Instances** and their connected resources can be imported.\n   Standalone resources—such as a [Person](https://id.loc.gov/ontologies/bibframe.html#c_Person) not linked to any Instance—cannot be processed.\n\n## Batch processing\nFile contents are processed in batches.\nYou can configure batch processing using following environment variables:\n1. CHUNK_SIZE: Number of lines read from the input file per chunk\n2. OUTPUT_CHUNK_SIZE: Number of Graph resources sent to Kafka per chunk\n3. PROCESS_FILE_MAX_POOL_SIZE: Maximum threads used for parallel chunk processing\n\n## Interaction with mod-linked-data\n\n`mod-linked-data` uses the [Builde vocabulary](https://bibfra.me/) for representing graph data.\n\nDuring import:\n\n1. This module transforms Bibframe 2 subgraphs into the equivalent Builde subgraph using the [`lib-linked-data-rdf4ld`](https://github.com/folio-org/lib-linked-data-rdf4ld) library.\n2. The transformed subgraphs are published to the Kafka topic specified by the `KAFKA_LINKED_DATA_IMPORT_OUTPUT_TOPIC` environment variable.\n3. `mod-linked-data` consumes messages from this topic, performs additional processing, and persists the graph to its database.\n\n## Dependencies on libraries\nThis module is dependent on the following libraries:\n- [lib-linked-data-dictionary](https://github.com/folio-org/lib-linked-data-dictionary)\n- [lib-linked-data-fingerprint](https://github.com/folio-org/lib-linked-data-fingerprint)\n- [lib-linked-data-rdf4ld](https://github.com/folio-org/lib-linked-data-rdf4ld)\n## Compiling\n```bash\nmvn clean install\n```\nSkip tests:\n```bash\nmvn clean install -DskipTests\n```\n\n### Environment variables\nThis module uses S3 storage for files. AWS S3 and Minio Server are supported for files storage.\nIt is also necessary to specify variable S3_IS_AWS to determine if AWS S3 is used as files storage. By default,\nthis variable is `false` and means that MinIO server is used as storage.\nThis value should be `true` if AWS S3 is used.\n\n| Name                                                     | Default value             | Description                                                                 |\n|:---------------------------------------------------------|:--------------------------|:----------------------------------------------------------------------------|\n| SERVER_PORT                                              | 8081                      | Server port                                                                 |\n| DB_USERNAME                                              | postgres                  | Database username                                                           |\n| DB_PASSWORD                                              | postgres                  | Database password                                                           |\n| DB_HOST                                                  | postgres                  | Database host                                                               |\n| DB_PORT                                                  | 5432                      | Database port                                                               |\n| DB_DATABASE                                              | okapi_modules             | Database name                                                               |\n| DB_MAXPOOLSIZE                                           | 100                       | Maximum database connection pool size                                       |\n| KAFKA_HOST                                               | kafka                     | Kafka broker host                                                           |\n| KAFKA_PORT                                               | 9092                      | Kafka broker port                                                           |\n| KAFKA_CONSUMER_MAX_POLL_RECORDS                          | 100                       | Maximum number of records returned in a single poll                         |\n| KAFKA_SECURITY_PROTOCOL                                  | PLAINTEXT                 | Kafka security protocol                                                     |\n| KAFKA_SSL_KEYSTORE_PASSWORD                              | -                         | Kafka SSL keystore password                                                 |\n| KAFKA_SSL_KEYSTORE_LOCATION                              | -                         | Kafka SSL keystore location                                                 |\n| KAFKA_SSL_TRUSTSTORE_PASSWORD                            | -                         | Kafka SSL truststore password                                               |\n| KAFKA_SSL_TRUSTSTORE_LOCATION                            | -                         | Kafka SSL truststore location                                               |\n| ENV                                                      | folio                     | Environment name used in Kafka topic names                                  |\n| KAFKA_RETRY_INTERVAL_MS                                  | 2000                      | Kafka retry interval in milliseconds                                        |\n| KAFKA_RETRY_DELIVERY_ATTEMPTS                            | 6                         | Number of Kafka delivery retry attempts                                     |\n| KAFKA_IMPORT_RESULT_EVENT_CONCURRENCY                    | 1                         | Number of concurrent consumers for import result events                     |\n| KAFKA_IMPORT_RESULT_EVENT_TOPIC_PATTERN                  | (${ENV}\\\\.)(.\\*\\\\.)result | Kafka topic pattern for import result events                                |\n| KAFKA_LINKED_DATA_IMPORT_OUTPUT_TOPIC                    | linked_data_import.output | Kafka topic where the transformed subgraph is published for mod-linked-data |\n| KAFKA_LINKED_DATA_IMPORT_OUTPUT_TOPIC_PARTITIONS         | 3                         | Number of partitions for the output topic                                   |\n| KAFKA_LINKED_DATA_IMPORT_OUTPUT_TOPIC_REPLICATION_FACTOR | -                         | Replication factor for the output topic                                     |\n| KAFKA_LINKED_DATA_IMPORT_RESULT_TOPIC                    | linked_data_import.result | Kafka topic for import processing results                                   |\n| KAFKA_LINKED_DATA_IMPORT_RESULT_TOPIC_PARTITIONS         | 3                         | Number of partitions for the result topic                                   |\n| KAFKA_LINKED_DATA_IMPORT_RESULT_TOPIC_REPLICATION_FACTOR | -                         | Replication factor for the result topic                                     |\n| S3_URL                                                   | http://127.0.0.1:9000/    | S3 url                                                                      |\n| S3_REGION                                                | -                         | S3 region                                                                   |\n| S3_BUCKET                                                | -                         | S3 bucket                                                                   |\n| S3_ACCESS_KEY_ID                                         | -                         | S3 access key                                                               |\n| S3_SECRET_ACCESS_KEY                                     | -                         | S3 secret key                                                               |\n| S3_IS_AWS                                                | false                     | Specify if AWS S3 is used as files storage                                  |\n| CHUNK_SIZE                                               | 1000                      | Number of lines read from the input file per chunk                          |\n| OUTPUT_CHUNK_SIZE                                        | 100                       | Number of Graph resources sent to Kafka per chunk                           |\n| PROCESS_FILE_MAX_POOL_SIZE                               | 1000                      | Maximum threads used for parallel chunk processing                          |\n| DATA_CLEANUP_CRON                                        | 0 0 2 * * *               | Cron expression for automatic cleanup of completed job data (daily at 2 AM) |\n| DATA_CLEANUP_AGE_DAYS                                    | 2                         | Number of days after which job data is eligible for cleanup                 |\n\n## Further information\n\n### Issue tracker\nProject [MODLDI](https://issues.folio.org/browse/MODLDI)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffolio-org%2Fmod-linked-data-import","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffolio-org%2Fmod-linked-data-import","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffolio-org%2Fmod-linked-data-import/lists"}