{"id":20668627,"url":"https://github.com/vandlj/xmltojson","last_synced_at":"2026-05-18T04:11:10.341Z","repository":{"id":261764741,"uuid":"885268406","full_name":"VandlJ/XMLtoJSON","owner":"VandlJ","description":"Tool for converting XML files into JSON format from digital archive of historical documents. The project is divided into separate Python modules for handling different kinds of data, including documents, persons, and archives.","archived":false,"fork":false,"pushed_at":"2024-12-23T17:57:07.000Z","size":1342,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-10-13T01:32:06.830Z","etag":null,"topics":["digital-archive","digital-humanities","full-text-search","historical-documents","json","xml"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/VandlJ.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-11-08T09:20:16.000Z","updated_at":"2024-12-23T17:57:10.000Z","dependencies_parsed_at":"2024-12-23T18:41:14.142Z","dependency_job_id":null,"html_url":"https://github.com/VandlJ/XMLtoJSON","commit_stats":null,"previous_names":["vandlj/xmltojson"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/VandlJ/XMLtoJSON","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VandlJ%2FXMLtoJSON","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VandlJ%2FXMLtoJSON/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VandlJ%2FXMLtoJSON/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VandlJ%2FXMLtoJSON/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/VandlJ","download_url":"https://codeload.github.com/VandlJ/XMLtoJSON/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VandlJ%2FXMLtoJSON/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279013897,"owners_count":26085326,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-12T02:00:06.719Z","response_time":53,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["digital-archive","digital-humanities","full-text-search","historical-documents","json","xml"],"created_at":"2024-11-16T20:10:03.425Z","updated_at":"2025-10-13T01:32:31.492Z","avatar_url":"https://github.com/VandlJ.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# XML to JSON Converter\n\n## Table of Contents\n- [XML to JSON Converter](#xml-to-json-converter)\n  - [Table of Contents](#table-of-contents)\n  - [Overview](#overview)\n  - [Features](#features)\n  - [Repository Structure](#repository-structure)\n  - [Usage](#usage)\n    - [Input Data Folder Structure](#input-data-folder-structure)\n  - [Setup Instructions](#setup-instructions)\n    - [Step 1: Clone the Repository](#step-1-clone-the-repository)\n    - [Step 2: Run the Conversion](#step-2-run-the-conversion)\n  - [Changes and Improvements](#changes-and-improvements)\n\n---\n\n## Overview\nThis project provides a set of scripts and tools for converting XML files into JSON format. It is designed to work with different XML data sources and is fully customizable, supporting multiple conversion modules. The project is divided into separate Python modules for handling different kinds of data, including documents, persons, and archives.\n\nThe solution includes:\n- Various XML parsing methods for extracting data from different types of XML files.\n- Tools to handle specific document types like archives, artwork, and persons.\n- A flexible structure for easy conversion and integration with other systems.\n\n## Features\n- **Custom Conversion Scripts**: Designed for different XML formats, including documents, persons, and archive links.\n- **Flexible Data Handling**: The ability to handle text, metadata, and specific attributes such as aliases and references.\n- **Modular Structure**: Each XML type is handled by separate scripts, making it easy to extend or modify.\n\n## Repository Structure\n\n```bash\nconvert/\n  archiveLinkConvert.py      # Handles conversion of archive link XMLs\n  artworkConvert.py          # Handles artwork XML data\n  commonConvert.py           # Contains common conversion utilities\n  personConvert.py           # Handles conversion of person-related XMLs\ndocs/\n  pictures/                  # Picture documentation related to the project\n  Analyza_SP.md              # Analysis related documentation\n  documentaria_rudolphina.md # Project-specific documentation\nmodel/\n  ArchiveLink.py             # Data model for archive links\n  Document.py                # Data model for documents\n  Person.py                  # Data model for person records\nscripts/\n  main_convert.py            # Main script to execute conversion\n  .gitignore                 # Git ignore configuration\n  README.md                  # This documentation file\n```\n\n## Usage\n\nTo use this tool, you'll need Python and pip installed.\n\nThen, run the following command:\n\n```bash\npip install -r requirements.txt\n```\n\nThis will install necessary libraries to run the script. Then simply run the `main_convert.py` script with the appropriate options. Here are the main commands to run the program from the `XMLtoJSON` directory:\n\n- Display help information:\n  ```bash\n  python3 scripts/main_convert.py --help\n  ```\n  or\n  ```bash\n  python3 scripts/main_convert.py --h\n  ```\n\n- Convert all types of XML files:\n  ```bash\n  python3 scripts/main_convert.py --type all --input_path \"path_for_input_data\" --output_path \"path_for_output_data\"\n  ```\n\n- Convert name-related XML files:\n  ```bash\n  python3 scripts/main_convert.py --type names --input_path \"path_for_input_data\" --output_path \"path_for_output_data\"\n  ```\n\n- Convert register-related XML files:\n  ```bash\n  python3 scripts/main_convert.py --type registers --input_path \"path_for_input_data\" --output_path \"path_for_output_data\"\n  ```\n\n- Convert archive-related XML files:\n  ```bash\n  python3 scripts/main_convert.py --type archive --input_path \"path_for_input_data\" --output_path \"path_for_output_data\"\n  ```\n\n### Input Data Folder Structure\n\nThe input data folder should be structured as follows:\n\n```bash\ninput_data/\n  Archiv/                    # Archive-related XML files\n  Regesten/                  # Register-related XML files\n  Namen/                     # Name-related XML files\n  Indicies/                  # Index-related XML files \n```\n\n## Setup Instructions\n\n### Step 1: Clone the Repository\n\n```bash\ngit clone https://github.com/VandlJ.git\ncd XMLtoJSON\n```\n\n### Step 2: Run the Conversion\n\nTo begin the conversion, use the main conversion script. For example, to convert all XML files:\n```bash\npython3 scripts/main_convert.py --type all --input_path \"../test_data\" --output_path \"../test_data/output\"\n```\n\nYou can also check out all available options and get detailed information by running:\n```bash\npython3 scripts/main_convert.py --help\n```\n\nThis command will start processing the XML files in the specified `--input_path` directory and output the results to the `--output_path` directory.\n\n## Changes and Improvements\n\nThis project was inherited from another team, and we made several significant improvements and fixes to enhance its functionality and reliability:\n\n1. **Error Handling: Spaces/Blank Characters for Indentation in Text - in Regesten Files**\n   - Previously, the Regesten JSON files had issues with spaces and blank characters causing indentation errors. We addressed this by splitting the \"text\" field into two distinct key values:\n     - `display`: This field is used for displaying text on the frontend, ensuring it retains the original formatting for readability.\n     - `processable`: This field contains a cleaner version of the text, optimized for computer processing and analysis.\n\n2. **Metadata Handling: Problem Metadata in Regesten**\n   - There were inconsistencies in capturing metadata elements such as `.p` in the Regesten files. Some elements were missing or incorrectly captured. We conducted a thorough review and ensured that all metadata elements are now accurately captured and processed in our iteration of the program.\n\n3. **Enhanced Interactivity: Add Information `onmouseover=\"highlightWords(event, '...')\"` in Regesten**\n   - To improve the user experience, we added interactivity to the Regesten files. The `onmouseover` attribute was added to highlight words when hovered over. The processed data now includes:\n     ```json\n     \"names\": [\n       {\n         \"Aichholz_Johann\": \"Johann Aichholz\",\n         \"alias\": \"Johann Aichholz Ehrzney doctor\"\n       },\n       {\n         \"Strauben_Franz\": \"Franz Strauben\",\n         \"alias\": \"Frannzen Strauben\"\n       }\n     ]\n     ```\n\n4. **Name Processing: Splitting First Name and Last Name via External Tool - GettyULAN**\n   - We integrated the project with an external tool, GettyULAN, to enhance name processing. This tool or API provides URL links to authors and returns one request per person. The application queries the SPARQL endpoint Getty, where each name is validated and processed. This integration ensures accurate and enriched author information.\n   - Additionally, we made the API for name splitting run asynchronously with caching, significantly increasing performance by reducing redundant requests and improving response times.\n\n5. **Unified Main Script for Conversion**\n   - We streamlined the conversion process by consolidating the three main Python scripts (previously used for different document types) into a single, unified script. This main script is now configurable via terminal options, allowing users to specify `--type`, `--input_path`, and `--output_path`. This change simplifies the execution and enhances the flexibility of the conversion process.\n\n6. **Improved Documentation and Setup Instructions**\n   - Updated the documentation to reflect the new changes and provide clear setup instructions. This includes detailed usage examples and the expected input data folder structure to ensure users can easily get started with the project.\n\n7. **Performance Enhancements and Bug Fixes**\n   - Conducted a comprehensive review of the codebase to identify and fix bugs. Implemented performance enhancements to ensure the conversion process is efficient and reliable.\n\n8. **Fixes in Archiv Type JSON Output**\n   - Corrected the handling of `hasSublink`, `linkTo`, and `next_link` variables in the output JSON files for the Archiv type. This ensures that these variables are accurately represented and linked in the JSON output.\n\nThese improvements have significantly enhanced the functionality, usability, and reliability of the XML to JSON Converter project, making it more robust and user-friendly.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvandlj%2Fxmltojson","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvandlj%2Fxmltojson","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvandlj%2Fxmltojson/lists"}