{"id":13760800,"url":"https://github.com/natliblux/BnLMetsExporter","last_synced_at":"2025-05-10T11:31:51.861Z","repository":{"id":170886085,"uuid":"130645091","full_name":"natliblux/BnLMetsExporter","owner":"natliblux","description":"Command Line Interface (CLI) to export METS/ALTO documents to other formats.","archived":false,"fork":false,"pushed_at":"2022-04-25T08:38:53.000Z","size":443,"stargazers_count":13,"open_issues_count":0,"forks_count":1,"subscribers_count":7,"default_branch":"master","last_synced_at":"2024-11-16T18:34:07.515Z","etag":null,"topics":["alto","alto-xml","mets","mets-xml","solr","solrcloud","xml"],"latest_commit_sha":null,"homepage":null,"language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/natliblux.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"COPYING","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":"AUTHORS"}},"created_at":"2018-04-23T05:36:38.000Z","updated_at":"2023-10-26T18:21:59.000Z","dependencies_parsed_at":"2024-01-15T03:57:34.269Z","dependency_job_id":"d731ce79-d2bb-467c-9f06-d43c554b2577","html_url":"https://github.com/natliblux/BnLMetsExporter","commit_stats":null,"previous_names":["natliblux/bnlmetsexporter"],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/natliblux%2FBnLMetsExporter","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/natliblux%2FBnLMetsExporter/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/natliblux%2FBnLMetsExporter/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/natliblux%2FBnLMetsExporter/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/natliblux","download_url":"https://codeload.github.com/natliblux/BnLMetsExporter/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253410672,"owners_count":21904129,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["alto","alto-xml","mets","mets-xml","solr","solrcloud","xml"],"created_at":"2024-08-03T13:01:22.290Z","updated_at":"2025-05-10T11:31:51.517Z","avatar_url":"https://github.com/natliblux.png","language":"Java","funding_links":[],"categories":["Java"],"sub_categories":[],"readme":"# BnL Mets Exporter\n\nThis project is a command line tool used by the [National Library of Luxembourg (Bibliothèque nationale de Luxembourg - BnL)](http://www.bnl.public.lu/) to parse METS/ALTO documents and export them into another format. \n\nNote that this project is highly tailored to the BnL's METS/ALTO requirements and current workflows. \n\n# Requirements:\n\n- Java 8\n- Maven\n\n# Data\n\nThis program is tailored for the data of the National Library of Luxembourg.\nYou can download a small dataset [here](http://downloads.bnl.lu).\n\n# Getting Started\n\n## 1. Download and Build \n\nDownload or clone this repository, then go inside and build it using Maven: `mvn install`\n\n## 2. Config\n\nA default `config_example` folder is provided with this project.\nMake a copy and rename it to `config`.\n\nFor testing purposes, you can use the config as it is, but the different URLs should be modified for your needs in the `export.yml`.\n\nTo test the configuration you can run:\n\n\tjava -jar BnLMetsExporter.jar --testconfig\n\n## 3. Run\n\nThe tool supports 2 modes:\n\n1. Primo: Parse METS/ALTO and exports each document unit to a XML document (OAI with Dublin Core). The result is directly saved into a `tar.gz`.\n2. Solr: Parse METS/ALTO and load each document unit to Solr indices.\n\n**Note:** A document unit is for example a newspaper article, section, illustration, advertisement, ... or book chapter, ... and is defined in the `export.yml` file.\n\n\nCommand for exporting to `tar.gz` from a local folder (where your METS/ATLO are):\n\n\tjava -jar BnLMetsExporter.jar -export primo -dir path/to/metsalto\n\nFor exporting to Solr, refer to the advanced section below.\n\n## 4. Output\n\nThe output will be saved in: `./import-for-primo.tar.gz`.\n\n## More Commands\n\nFor help and details on the available commands, just run:\n\n\tjava -jar BnLMetsExporter.jar -h\n\n# Build\n\nTo build, simply call `mvn install`.\nThe jar file will be `BnLMetsExporter.jar` in a folder named `target`.\n\n\n# License\n\n[![License: GPL v3](https://img.shields.io/badge/License-GPL%20v3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)\n\nSee `COPYING` to see the full text.\n\n# Contributions\n\nContributions to this project will be evaluated on a case-by-case basis.\nSome part of this software is general, but this project is highly tailored to the needs of the National Library of Luxembourg. \nIf your needs are different, please fork your own version and modify it.\n\n\n# Advanced\n\n## Getting Mets files differently\n\nThe program can fetch Mets files from an URL in the following format: `http://127.0.0.1:8080/getMets?pid=`.\nYou have to provide a list of IDs in a line-by-line text file e.g. `-pids 10pids.lst`.\nThe PID will be appended to the \"get Mets Url\".\nNote that Alto files should still be accessible via disk and so you have to provide the argument `-dir path/to/metsalto` that points to the root where alto files can be found relative to what is written in the METS file. \n\n**Custom**\n\nIn addition to the already implemented methods, you can write your own implementation of the abstract class `MetsGetter`.\nIt is located in the package `lu.bnl.reader`. \n\n## Getting Started with Solr Export\n\nIn case you want to do an export to Solr, the following chapters will present a guide to know how to get started with a Solr installation.\n\n### Development Environment\n\nThe easiest way to create and manage a Solr Cloud on a development environment is using the Solr cloud example as explained here:\nhttps://lucene.apache.org/solr/guide/6_6/getting-started-with-solrcloud.html\n\n### Config\n\nOn the development environment, create a new configset in `solr/server/solr/configsets/`.\nFor this, just copy the `basic_configs/` to `my_config_sets`.\nUnder `my_config_set/conf/`, upload your own configuration from the folder `config_example/solr` of this project.\n\nThen the config must be reloaded into ZooKeeper (upconfig):\n\n\t$ solr-6.6.0/server/scripts/cloud-scripts/zkcli.sh -zkhost localhost:9983 -cmd upconfig -confname my_config_set -confdir solr-6.6.0/server/solr/configsets/my_config_set/conf/\n\nAfter this step. All collections must be reloaded.\n\n**IMPORTANT: The Solr config in this repository is for development only.**\n\n### Solr Collections / Cores\n\nThere are 2 indices:\n\n1. Article Index: Stores the data of a single article, chapter, section, etc.\n2. Page Index: Stores the data of a complete page\n\nSo, you must create two cores or collections using the configsets above. The default names are `my_articles` and `my_pages` as written in the default `export.yml`\n\nIn the config, note that `articleURL` and `pageURL` is for Solr running in single host mode (not cloud).\n`articleCollection` and `pageCollection` along with `zkHost` must be specified for a Solr running in Cloud mode.\n\n### Cleaning the Solr Indices\n\nUse the command `-clean` to clean the Solr article and page indices. Add `-cloud` if it is a Solr Cloud environment.\n\n\t$ java -jar BnLMetsExporter.jar -clean -cloud\n\n## Solr FAQ\n\n### What are the key changes in the schema?\n\n- Solr is in `ClassicIndexSchemaFactory` mode. The code `\u003cschemaFactory class=\"ClassicIndexSchemaFactory\"/\u003e` has been added to the `config/solr/solrconfig.xml` file.\n- There is only 1 schema. Articles and Pages use the same schema.\n- Most dynamic fields are commented out.\n- The custom field `text_primo` tries to follow the same analysis and query rules as the internal system of the BnL (Primo).\n\n\n### How is the full text stored?\n\nThe full text is stored in 2 text fields, text_lines and text_words.\n\n- **text_lines**: Stores the full text, which contains HTML and custom XML tags and is meant for web display to a user. \n- **text_word**: Stores each word and its coordinate on the page in a custom format (using XML tags).\n\n### Why use HTML to store line ending information?\n\nThe current BnLViewer displays the full text and needs the line by line information.\nThis allows to display the full text in the same format as it was on the physical paper.\n\nThe easiest solution is to store the line ending information inside the full text directly.\nHTML tags allows to have a standard way to mark this information without interfering with the analyser, because it uses an `solr.HTMLStripCharFilterFactory` (only during indexing, not during querying).\n\nThis has the advantage of being simple to parse, manipulate and to still use Solr's built-in highlight feature.\n\n### How are line endings stored?\n\nLine endings are stored using custom XML tags in the `text_lines` field.\nAt the moment, each new line is marked with a `\u003cbr\u003e`.\nEvery line ending with hyphenated character is stored as `\u003cle s=\"X-\" e=\"Y\"\u003eXY\u003c/le\u003e`.\n*s* contains *X-* (with hyphen), which is the start of the word.\n*e* contains *Y*, which is the end of the word.\n*X* is located in line *n* and *Y* is located in line *n+1*.\nThe word (without hyphen) *XY* is required to allow the analyser to search it fully. \nThis word is also what is highlighted, meaning that some post-processing is required to recreate the correct line by line text.\n\n### How are word coordinates stored?\n\nWord coordinates are stored using custom XML tags in the `text_words` field.\nThe full text is stored as as a sequence of `\u003cw\u003e` tags like this:\n\n\t\u003cw a=\"DTL1\" b=\"BLOCK1\" p=\"ALTO00001\" x=\"300\" y=\"1400\" w=\"287\" h=\"68\"\u003eHello\u003c/w\u003e\n\t\u003cw a=\"DTL1\" b=\"BLOCK2\" p=\"ALTO00001\" x=\"600\" y=\"1400\" w=\"290\" h=\"67\"\u003eWorld\u003c/w\u003e\n\t...\n\t\nAgain, using the `solr.HTMLStripCharFilterFactory` allows to ignore all tags, but still be able to search the words using the same analyser as if it were a full text. The Solr highlighting works perfectly with the correct options, such as `hl.fragsize=0` and `hl.maxAnalyzedChars=10000000`.\nThis solutions allows to easily retrieve the coordinates of the words for any page and article.\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnatliblux%2FBnLMetsExporter","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnatliblux%2FBnLMetsExporter","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnatliblux%2FBnLMetsExporter/lists"}