{"id":15364181,"url":"https://github.com/kellnerd/gsoc-proposal","last_synced_at":"2025-06-13T11:07:47.957Z","repository":{"id":230390833,"uuid":"622976126","full_name":"kellnerd/gsoc-proposal","owner":"kellnerd","description":"BookBrainz: Import Other Open Databases (GSoC 2023 proposal)","archived":false,"fork":false,"pushed_at":"2024-04-02T13:21:09.000Z","size":273,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-13T11:07:19.898Z","etag":null,"topics":["bookbrainz","books","database-import","gsoc","gsoc-2023","gsoc-proposals","marc21","metabrainz","open-data"],"latest_commit_sha":null,"homepage":"","language":"Shell","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kellnerd.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-04-03T12:54:23.000Z","updated_at":"2024-03-29T11:37:50.000Z","dependencies_parsed_at":"2024-04-02T14:51:17.371Z","dependency_job_id":null,"html_url":"https://github.com/kellnerd/gsoc-proposal","commit_stats":{"total_commits":7,"total_committers":1,"mean_commits":7.0,"dds":0.0,"last_synced_commit":"bf44e3bbb84a289a983c51f2634d051dead51236"},"previous_names":["kellnerd/gsoc-proposal"],"tags_count":5,"template":false,"template_full_name":null,"purl":"pkg:github/kellnerd/gsoc-proposal","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kellnerd%2Fgsoc-proposal","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kellnerd%2Fgsoc-proposal/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kellnerd%2Fgsoc-proposal/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kellnerd%2Fgsoc-proposal/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kellnerd","download_url":"https://codeload.github.com/kellnerd/gsoc-proposal/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kellnerd%2Fgsoc-proposal/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259634349,"owners_count":22887698,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bookbrainz","books","database-import","gsoc","gsoc-2023","gsoc-proposals","marc21","metabrainz","open-data"],"created_at":"2024-10-01T13:10:30.406Z","updated_at":"2025-06-13T11:07:47.932Z","avatar_url":"https://github.com/kellnerd.png","language":"Shell","readme":"---\nauthor: David Kellner\ntitle: \"BookBrainz: Import Other Open Databases\"\nsubtitle: GSoC 2023 Proposal\nsubject: GSoC 2023 Proposal\nkeywords:\n- books\n- database import\n- GSoC\n- open data\n- MetaBrainz\n- BookBrainz\n- MARC records\n---\n\n# Contact Information\n\n- **Name**: David Kellner\n- **Email**: [redacted] \u003csmall\u003e(*I will not publish my email address in a public place, but mentors can find it in the PDF version of my proposal for the GSoC website.*)\u003c/small\u003e\n- **Libera.Chat IRC**: kellnerd\n- **GitHub**: [kellnerd](https://github.com/kellnerd)\n- **Timezone**: UTC+02:00 (Central European Summer Time)\n\n# Synopsis\n\nBookBrainz still has a relatively small community and contains less entities than other comparable databases.\nTherefore we want to provide a way to import available collections of library records into the database while still ensuring that they meet BookBrainz' high data quality standards.\n\nFrom a previous GSoC project, the database schema already contains additional tables set up for that purpose, where the imports will await a user's approval before becoming a fully accepted entity in the database.\n\nThe project will require processing very large data dumps (e.g. [MARC] records or [JSON] files) in a robust way and transforming entities from one database schema to the BookBrainz schema.\nAdditionally the whole process should be repeatable without creating duplicate entries.\n\n[JSON]: https://www.json.org/json-en.html\n[MARC]: https://www.loc.gov/marc/\n\n# Overview: The Import Process\n\nBefore I will start to explain the software architecture of the import process, let us agree on common terms for the different states of the imported data:\n\n**External Entity**\n: An entity which has been extracted from a certain external data source (i.e. a database dump or an API response) in its original format.\n\n**Parsed Entity**\n: A JSON representation of the external entity which is compatible with the BookBrainz ORM.\n\tThe JSON may also contain additional data which can not be represented within the current BookBrainz schema.\n\n**Pending Entity**\n: The entity has been imported into the BookBrainz database schema, but has not been approved (or discarded) by a user so far.\n\tThe additional data of the parsed entity will be kept in a freeform JSON column.\n\n**Accepted Entity**\n: The imported entity has been accepted by a user and now has the same status as a regularly added entity.\n\n## Existing Infrastructure\n\nInfrastructure to import entities into BookBrainz had already been developed in a previous GSoC project from 2018 [^blog-2018].\nThe architecture consists of two separate types of services which are connected by a [RabbitMQ] messaging queue which acts as a processing queue:\n\n1. **Producer**: Extracts external entities from a database dump (or an API) and emits parsed entities which are inserted into the queue.\n\tEach external data source has its own specific producer implementation although parts of it can be shared.\n\tSince the parsed entities are queued for insertion into the database, there can be multiple producer instances running in parallel.\n\n2. **Consumer**: Takes parsed entities from the queue and validates them before they are passed to the ORM which inserts them into the database as a pending entity.\n\nThe processing queue acts as a buffer to make the fast parsing logic independent from the slow database insertion process.\nIt also allows the consumer to be stopped or interrupted without losing data and the import process can be continued at any time.\n\n![Import process of an entity](import-process.drawio.svg)\n\nAs soon as a pending entity is available in the database, it can be browsed on the website [^site-pr].\nThe page looks similar to that of an accepted entity, but provides different action buttons at the bottom.\nHere the user can choose whether they want to discard, approve or edit and approve the entity.\n\nOnce an entity gets approved, the server assigns it a BBID and creates an initial entity revision to which the approving user's browser is redirected.\n\n![Display page of a pending work](page-pending-work.png)\n\nWhile a pending entity can be promoted to an accepted entity by a single vote, the pending import will only be deleted if it has been discarded by multiple users.\nThis is done in order to prevent losing pending entities (which have no previous revision that can be restored!) forever by accident.\nAnd usually we do not want to restore discarded entities when we decide to repeat the import process.\n\n[^blog-2018]: Blog post: [GSoC 2018: Developing infrastructure for importing data into BookBrainz](https://blog.metabrainz.org/2018/08/13/gsoc-2018-developing-infrastructure-for-importing-data-into-bookbrainz/)\n\n[^site-pr]: Once the open pull request with the website changes has been merged: [bookbrainz-site#201]\n\n[RabbitMQ]: https://www.rabbitmq.com/\n[bookbrainz-site#201]: https://github.com/metabrainz/bookbrainz-site/pull/201\n\n## Shortcomings of the Current Implementation\n\n### No Support for Relationships Between Pending Entities\n\nA producer reads one external entity at a time and only emits a single parsed entity without any relationships to other entities at that point.\nIt would be the consumer's task to resolve the available names and identifiers of the related entities to existing pending entities or even accepted entities.\nIf the entity resolution does not succeed, the consumer additionally has to create the related pending entities with the minimal information that is currently available.\n\nBut pending entities currently only have aliases, identifiers and basic properties (such as annotation, languages, dates) which are specific per entity type.\nAllowing them to also have relationships could potentially lead to relationships between pending and accepted entities, because BookBrainz uses *bidirectional relationships* (new relationships also have to be added to the respective relationship's target entity).\nIn order to prevent opening that can of worms, relationships are currently stored in the additional freeform data of pending entities.\n\n|                    | **Accepted Entity** (Standard) | **Pending Entity** (Import) |\n| ------------------ | :----------------------------: | :-------------------------: |\n|                    |       ![Accepted entity]       |   ![Pending entity 2018]    |\n| Basic properties   |               ✓                |              ✓              |\n| Aliases            |               ✓                |              ✓              |\n| Identifiers        |               ✓                |              ✓              |\n| Relationships [^1] |               ✓                |              ✗              |\n| BBID               |               ✓                |              ✗              |\n| Revision number    |               ✓                |              ✗              |\n| Freeform data      |               ✗                |              ✓              |\n\nTable: Current features of pending entities\n\n[^1]: Refers to *regular* bidirectional relationships between two entities as well as *implicit* unidirectional relationships (which are used for Author Credits, Publisher lists and the link between an Edition and its Edition Group).\n\nWhile this was an intentional decision to reduce the complexity of the project in 2018 [^proposal-2018], it is not a viable long term solution.\nImporting e.g. a work without at least a relationship to its author saves little work for the users, so we really need to support relationships this time.\n\n[^proposal-2018]: Proposal: [GSoC 2018: Importing data into BookBrainz](https://community.metabrainz.org/t/gsoc-2018-importing-data-into-bookbrainz/367714)\n\n### Performance Issues\n\nIf support for relationships would be the only thing that is currently lacking you might ask yourself why this feature is still not in production in 2023.\nPart of the answer is that the developer of the system was not happy with their own results [^irc-logs-2019] although the project could be considered finished.\n\nAs a consequence, the producer and consumer services, which reside in the `bookbrainz-utils` repository [^importer-code], have never been used in production.\nThe import tables in the database are still empty as of today and therefore it also made no sense to deploy the changes from [bookbrainz-site#201] to the website.\n\nOne major issue was already identified back then:\nThe self-developed [`asyncCluster` module](https://github.com/metabrainz/bookbrainz-utils/tree/master/importer/src/asyncCluster), which is used to spawn multiple parallel producer instances, proved to be neither performant nor really asynchronous.\nIt is based on the Node.js [cluster] API that creates a separate *process* for each instance (which also results in high memory consumption).\nSince 2018 the newer [worker threads] API that only creates separate *threads* has become stable with Node.js v12.\nThe conclusion was to drop the entire `asyncCluster` module in favour of worker threads since the data processing seemed to be rather CPU-intensive than I/O-bound.\n\n[^irc-logs-2019]: Discussion between *bukwurm* and *Mr_Monkey* in `#metabrainz` (IRC Logs: [June 3rd 2019](https://chatlogs.metabrainz.org/brainzbot/metabrainz/msg/4407776/) | [June 4th 2019](https://chatlogs.metabrainz.org/brainzbot/metabrainz/msg/4408831/))\n\n[^importer-code]: Code of the importer software infrastructure: [bookbrainz-utils/importer](https://github.com/metabrainz/bookbrainz-utils/tree/master/importer)\n\n[bookbrainz-site]: https://github.com/metabrainz/bookbrainz-site\n[cluster]: https://nodejs.org/api/cluster.html\n[worker threads]: https://nodejs.org/api/worker_threads.html\n\n### Outdated and Duplicated Code\n\nAlmost five years have passed since the initial implementation of the importer project for GSoC 2018.\nSince BookBrainz has evolved in the meantime, the importer is no longer compatible with the latest database schema.\n\nThe following changes should be necessary to get the importer project into a clean, working state again:\n\n- Entity type names have been changed in 2019 from *Creator* to *Author* and from *Publication* to *Edition Group* in order to make them more consistent. The remaining occurrences in the importer code have to be renamed too.\n\n- Add import tables for *Series* entities (which were introduced by a GSoC project in 2021).\n\n- Update the importer software infrastructure to support *Series* entities and *Author Credits* (which were introduced in 2022).\n\n- Entity [validators] from `bookbrainz-site` (which are used to validate entity editor form data) have been duplicated and adapted for the consumer. Ideally generalized versions of these validators should be moved into [bookbrainz-data-js] and used by both.\n\n- The UI to list recent imports uses its own kind of pagination component, which should be replaced by the pagination component that is used elsewhere (and which was introduced shortly after GSoC 2018).\n\n- The pending website changes [bookbrainz-site#201] have to be rebased onto the current `master` branch, which also involves using the new environment with Webpack and Docker (... and the code should probably be refactored a bit more).\n\n[validators]: https://github.com/metabrainz/bookbrainz-site/tree/master/src/client/entity-editor/validators\n[bookbrainz-data-js]: https://github.com/metabrainz/bookbrainz-data-js\n\n# Project Goals\n\nDuring GSoC I will try to achieve the following goals, only those marked as *stretch goal* are optional, the rest are required to consider this project a success:\n\n1. Update existing infrastructure to be compatible with the latest database schema\n2. Test the infrastructure by finishing the [OpenLibrary producer] (also from GSoC 2018)\n3. Update database schema to support relationships between pending entities\n4. Resolve an entity's external identifiers to a BBID\n5. Create relationships between pending entities\n6. Add a consumer option to update already imported entities when an import process is repeated (with updated source data or an improved producer)\n   1. Pending entities (automatically overwrite old data)\n   2. Accepted entities which have *not* been changed in BookBrainz (update has to be approved again)\n   3. Accepted entities which have been changed (requires merging) [stretch goal]\n7. Create a producer which parses MARC records\n8. Import MARC records from [LOC](#loc) (US Library of Congress)\n9. Import MARC records from [DNB](#dnb) (German National Library) [stretch goal]\n10. Create type definitions for the parsed entity JSON exchange format\n11. Improve the performance by replacing the custom `asyncCluster` implementation\n12. Document relevant new features while developing them and write test cases for critical code\n13. Dockerize the importer infrastructure (producers and consumer) [stretch goal]\n\n[OpenLibrary producer]: https://github.com/metabrainz/bookbrainz-utils/tree/master/importer/src/openLibrary/producer\n\n# Implementation\n\nIn this section I will focus on the weaknesses of the already existing importer infrastructure and how I intend to improve them.\nHence I will not write much about the parts which are already supposed to be working [^outdated-code] and will not propose to change the user-facing components or provide mockups for these.\n\nThe only user-facing change which I intend to do on top is displaying pending entities the same way as accepted entities, only with a special marker next to the entity name.\nIn listings such as the relationship section or entity tables, pending entities will be displayed after the accepted entities.\nIdeally they can also be filtered and sorted in the future, but these are general features which BookBrainz is lacking so far and probably out of this project's scope.\n\n[^outdated-code]: I can not guarantee or test that currently, because the code is outdated and has to be adapted to the latest schema version first.\n\n## Importing Entities With Relationships\n\n### General Considerations and Entity Resolution\n\nIdeally we want to keep relationships between external entities, especially to preserve the authors and publishers of editions and works.\nIn order to achieve this without creating duplicate target entities every time they appear in another relationship, we have to do a lookup of pending (and maybe even accepted) entities at some point.\n\nGenerally, let us keep the specific producer implementations dumb and have all the database lookups on the common consumer's side, because that way the mapping logic has to be implemented/used only in one place and not in every producer implementation.\nAlso, directly accessing the database from a producer would defeat the purpose of having a queue that reduces the database load.\n\nThe process of looking up relationship target entities is what I called **entity resolution** previously.\nIt will take all the available data about relationship target entities from the parsed entity, which includes names, identifiers and possibly other data (such as dates and areas).\nBased on this data, it tries to find a single pending or accepted entity which matches the given data with high certainty.\nIn case of doubt, it will suggest to create a new pending entity using the available data, which can be merged later (if necessary).\n\nOnce we have such an entity resolver, there are two possible approaches to model relationships between pending entities:\n\n1. Store **representations** of the pending entity's relationships as additional freeform data and create proper relationships only when the user accepts the import.\n\n2. Allow pending entities to also have **proper relationships** (and deal with the consequences of having relationships between pending and accepted entities).\n\n### Pending Entities with Relationship Representations\n\nFor this approach, the consumer stores relationship representations in the additional freeform data of the relationship source entity.\nEach representation includes the BB **relationship type** ID and the available data about the target entity, which has no BBID but an **external identifier**.\n\nWhen an import is being accepted, we can use the external identifier to resolve it to a target entity among all pending and accepted entities.\nIf the target entity is also pending, we inform the user that it will also be accepted (on the source entity's preview page).\nIn the rare case that there are multiple entities which have the external identifier, we ask the user to select one or merge the entities.\n\nOnce we have an accepted target entity, we can use its BBID to convert the relationship representation into a real relationship.\nThe resulting new relationship sets will be added to the data of each involved entity.\n\nThis approach avoids accepted entities ever having relationships to pending entities by simply not using bidirectional relationships.\nWhile this would simplify the consumer implementation, the code to promote pending to accepted entities would be complexer.\nWe would also need separate code to display pending relationships instead of reusing the existing code for regular relationships.\nHence we will not follow this approach further.\n\n### Pending Entities With Proper Relationships\n\nCurrently the data of pending entities is stored in the regular `bookbrainz.*_data` tables (e.g. `bookbrainz.edition_data`) that are also used for accepted entities.\nOnly the additional freeform data and other import metadata (such as the source of the data) are stored in the separate `bookbrainz.link_import` table.\n\nIn order to support relationships between two pending entities, we have to assign them BBIDs if we also want to store relationships in the regular tables.\nThis is necessary because the source and target entity columns of the `bookbrainz.relationship` table each contain a BBID.\n\n|                  | **Accepted Entity** (Standard) | **Pending Entity** (Import) |\n| ---------------- | :----------------------------: | :-------------------------: |\n|                  |       ![Accepted entity]       |   ![Pending entity 2023]    |\n| Basic properties |               ✓                |              ✓              |\n| Aliases          |               ✓                |              ✓              |\n| Identifiers      |               ✓                |              ✓              |\n| Relationships    |               ✓                |              ✓              |\n| BBID             |               ✓                |              ✓              |\n| Revision number  |               ✓                |              ✗              |\n| Freeform data    |               ✗                |              ✓              |\n\nTable: Proposed features of pending entities\n\n[Accepted entity]: entity-accepted.drawio.svg\n[Pending entity 2018]: entity-pending-2018.drawio.svg\n[Pending entity 2023]: entity-pending-2023.drawio.svg\n\nAssigning imports a BBID would require a schema change.\nCurrently the SQL schema of the entry table for pending entities (which are linked to their data via `bookbrainz.*_import_header` tables for each entity type) looks as follows:\n\n```sql\nCREATE TABLE IF NOT EXISTS bookbrainz.import (\n\tid SERIAL PRIMARY KEY,\n\ttype bookbrainz.entity_type NOT NULL\n);\n```\n\nWhen we alter the `id` column (and all foreign columns which refer to it) to be a UUID column, the `bookbrainz.import` table would basically be identical to the `bookbrainz.entity` table.\nSo I would suggest to combine both tables and use an additional column to distinguish imports and accepted entities:\n\n```sql\nCREATE TABLE bookbrainz.entity (\n\tbbid UUID PRIMARY KEY DEFAULT public.uuid_generate_v4(),\n\ttype bookbrainz.entity_type NOT NULL,\n\tis_import BOOLEAN NOT NULL DEFAULT FALSE -- new flag\n);\n```\n\nCombining the tables (and dropping the `bookbrainz.import` table) has two advantages:\n\n1. We no longer have to move pending entities into the `bookbrainz.entity` table once they have been accepted, we can simply update the new `is_import` flag.\n\n2. The `source_bbid` and `target_bbid` columns of the `bookbrainz.relationship` table have a foreign key constraint to the `bbid` column of `bookbrainz.entity`.\n\tHaving a separate table for imports would have violated that constraint.\n\tAlternatively we would have needed a new flag for both relationship columns in order to know whether the BBID belongs to an accepted entity or to a pending import.\n\nWhen the consumer now handles a parsed work which has an author, it can add a relationship to the created pending work that points to the resolved author entity.\nThis works regardless of whether the author is an accepted or a pending entity as both have a BBID now which can be looked up in the `bookbrainz.entity` table.\n\n![Relationship target resolution of a new pending work](pending-rel-resolution.drawio.svg)\n\nSince relationships are (usually) bidirectional, they have to be added to the relationship sets of both involved entities.\nWhile it is unproblematic to update the relationship sets of pending entities, changes to an accepted entity's relationship set would result in a new entity revision.\n\n**Bidirectional relationships** between pending and accepted entities cause some problems:\n\n1. Accepted entities might become cluttered with lots of relationships to pending entities of doubtful quality.\n\n2. We do not want to create new revisions of already accepted entities every time a new related pending entity (e.g. a new book by the same author) is imported.\n\nThe first problem can be considered a feature as it makes pending entities more discoverable when they are visible in the relationship section of accepted entities' pages.\nAfter all, we want our users to approve or discard imports related which are related to entities they are familiar with.\nIdeally we would provide a way to hide relationships to pending entities, of course.\n\nTo solve the second problem, we only create **unidirectional relationships** from pending entities to accepted entities initially, i.e. updating the accepted target entities' relationship sets will be delayed.\nThere are multiple times during the import flow when we can upgrade these unidirectional relationships to full bidirectional relationships:\n\n1. When the pending entity becomes an accepted entity.\n\tThis would be the simplest solution which would also avoid the first problem, but since we consider this to be a feature, we will not do that.\n\n2. When an import process run has finished and the consumer is done with importing all the parsed entities in the queue.\n\tThis way we will create one new entity revision for each accepted entity at most.\n\tFor this compromise we have to keep track of the relationship IDs which have to be added to (or removed from) the relationship set of each accepted entity which is affected.\n\n![Upgrading the pending relationships creates a new author revision](pending-rel-upgrade.drawio.svg)\n\n## Repeating the Import Process\n\nSince we do not want to import duplicates every time we rerun an import script (with updated dumps or after code changes, e.g. parser improvements), the pending entities need a unique identifier which is also available in the external entity data.\nUsing this identifier we will know which external entity already exists as a pending entity and to which entity it corresponds in case we want to update the data.\n\n### Identifying Already Imported Entities\n\nAlready imported external entities can be identified by the composite primary key of `bookbrainz.link_import`.\nBecause the naming of these columns is suboptimal and we already have to do a schema change change for these empty tables, I also suggest alternative names for them:\n\n- External source/origin of the data (e.g. OpenLibrary): `origin_source_id INT NOT NULL`\n\n\t→ origin and source are somewhat redundant terms, rename the column to `external_source_id` (and the referenced table to `external_source`)\n\n- External identifier (e.g. OpenLibrary ID): `origin_id TEXT NOT NULL`\n\n\t→ naming pattern `*_id` is usually used for foreign keys, rename to `external_identifier` (or `remote_identifier` or simply `identifier`)\n\nUsing this information, the **consumer** can easily identify parsed entities which have already been imported.\nThe current implementation simply skips these because they violate the condition that a primary key has to be unique.\n\nTaking the proposed relationships between pending entities into account, the consumer will now also create pending entities for missing relationship targets.\nSince these entities contain only a minimal amount of data, we want them to be updated as soon as the complete parsed entity comes out of the queue.\nWhen we are doing a full import of an external source, we can assume that it still contains the complete desired entity.\nThis means that updating pending imports is a desired behavior and we should not skip duplicates at the consumer.\n\nIf we want to do that, duplicates should be identified as early as possible to avoid wasting processing time, i.e. by the **producer** at the external entity level.\nSince we do not want the producer to permanently ask the database whether the currently processed entity is already pending, and as the entity could also still be queued, there is only one reliable way to do the duplicate detection:\nThe producer has to load the list of external identifiers, which have already been processed previously, during startup.\nThis should be sufficient as we can assume that there are no duplicates within in the source dump file itself.\n\n### Updating Pending Entities\n\nAs discussed in the previous section, we want the consumer to update at least the pending entities.\nThese pending imports can automatically be overwritten with the new data from the queue, except for the following cases:\n\n1. The import has already been *accepted* (`bookbrainz.link_import.entity_id` is not `NULL`).\n\n2. The pending entity has already been *discarded* (`bookbrainz.link_import.import_id` is `NULL`) and we do not want this data to be imported again.\n\n\t→ `bookbrainz.link_import` keeps track of discarded imports while these are deleted from `bookbrainz.import` (respectively `bookbrainz.entity` after the proposed schema change)\n\nOverwriting with new data is as trivial as it sounds, except for relationships (of course).\nHere we have to compare the pending entity's relationship set to the parsed entity's relationship data and make as little changes as possible (because every change also affects the respective target entity's relationship set).\n\n### Updating Accepted Entities\n\nNow that we can update pending entities, we want to try updating already accepted imports with external data changes too.\nTherefore we need a way to detect changes between...\n\n- freshly parsed entities and their pending entity counterparts to see if the parsed entity contains **relevant updates** (compared to its last import)\n\n- accepted entities and their pending entity counterparts to see if the accepted entity has been **changed in BookBrainz** since the import\n\nBoth of these checks require keeping a reference to the pending entity in `import_id`, even after it has already been accepted [^no-redundancy].\n\nIn order to **detect changes in BookBrainz** since the last import, we can simply compare the data IDs of the accepted entity's current revision and its pending entity equivalent.\nIf they are identical, no further revisions have been made or the accepted entity has been reverted to the state of the last import.\n\n**Detecting relevant updates** is harder, because we have to convert the parsed entity to a new (temporary) pending entity to be able to compare it to the old pending entity.\nWhile comparing we have to ignore the internal database IDs of pending entities and only pay attention to differences between the actual values of the old and the new one.\nThis will also be important later when we want to create the data with the necessary updates while keeping the amount of changes to the already accepted entity minimal.\n\nIf there are no relevant changes we will not save the new pending entity.\nOtherwise we assign a BBID to the new pending entity with which we will overwrite `import_id` while the BBID of the previously accepted entity will still be kept in `entity_id`.\nThe BBID of the new pending entity will be shown on the page of the accepted entity and suggests the user to review it and approve the changes (again).\n\nWhen the accepted entity has not been changed in the meantime, **approving the new pending entity** as a new revision works as follows:\n\n1. The new revision will simply replace the accepted entities data ID by the one of the new pending entity.\n\n2. The same happens for the original pending import, it should always point to the latest accepted data ID (since we do not want to store the history of imports).\n\n3. The new temporary import entry can now be discarded and its BBID in `import_id` will be overwritten with the accepted entity's BBID again.\n\tPending updates can easily be recognized by `import_id` and `entity_id` containing different values this way.\n\n4. Unfortunately relationship changes for pending entities also have to be reflected at the respective target entities.\n\nIf the **accepted entity has also been changed** in the meantime, it gets more complicated and we have to merge the new pending entity into the accepted one on approval.\nThe `bookbrainz.link_import` table will get a new boolean column to decide whether we can directly update (as described above) or whether we need to present the accepting user the merge tool first.\nIdeally we accept the duplicate entity (without creating a revision) and immediately merge it without having to setup a redirect for the throw-away BBID (which would be merged in its second revision).\n\nSince the merging feature might be quite complex, it is unclear whether it can be achieved during this project.\n\n[^no-redundancy]: This causes no redundant data, as both the accepted entity and the pending entity will point to the same `bookbrainz.*_data` column (for the initial revision).\nCurrently imports (pending entity and associated data) are discarded once they are accepted, but that has to be changed now.\n\n## Parsing MARC records\n\nThe original MARC (MAchine Readable Cataloging) standard was developed in the 1960s and was soon followed by many regional versions.\nToday MARC 21 is the predominant version, which was created in 1999 in order to harmonize and replace the existing versions [^marc-wiki].\n\nSince MARC records are a very old standard, their fields use numeric tags instead of descriptive property names to prevent wasting expensive storage space.\nMARC 21 records are available in binary (MARC8), plain text (UTF8) and XML file formats, for which there are existing parser implementations for Node.js such as [marcjs] (which uses the [stream] API).\n\nAdditionally the producer implementation would have to map the numeric tags (which additionally have single-letter subfield codes) to BookBrainz entity property types and relationship types.\nThis task surely will involve reading the available MARC documentations thoroughly to understand the meaning of the various fields and to review lots of real records to confirm the acquired knowledge.\n\n**Simple Example**: The field `300 $a 42 p. :` (tag `300`, subfield code `a`) of a bibliographical MARC record would be mapped to the `pages` property (value: `42`) of a BookBrainz edition.\n\nAs you can see in this example, parsing MARC records often involves trimming trailing separators (here: colon) and extracting data (here: number) from formatted text.\nThis should be implemented in a way that is not too restrictive in its assumptions about the formatting or could even strip relevant parts of the data.\n\nWhile the uncompressed XML files are much larger than their plain text counterparts they are more readable for humans, which is especially useful while testing and collecting good sample data during development.\nOnce the importer is ready, we can still decide to use the plain text or binary format to save storage, but that might not matter if we implement streaming of compressed dump files because the compressed versions have roughly the same size [^test-data].\n\n[^marc-wiki]: Source: \u003chttps://en.wikipedia.org/wiki/MARC_standards\u003e\n\n[^test-data]: Test data set: 24.1 MB / 7.1 MB uncompressed XML / binary file is only 1.8 MB / 1.7 MB gzipped\n\n[marcjs]: https://github.com/fredericd/marcjs\n[stream]: https://nodejs.org/api/stream.html\n\n# Datasets\n\nOnce we have a working parser for MARC records, we should be able to import entities from a variety of (national) libraries which use the MARC standard to catalog their collections.\n\nI have chosen the LOC, which provides a large collection of MARC records and is also the inventor of the standard, and the DNB, because they offer all of their data for free and I have already used it in a personal project.\n\n## LOC\n\nThe United States Library of Congress (LOC) aims to be the largest library in the world and includes publications in many languages, half of them in English [^loc-facts].\n\nThey provide [full dumps](https://loc.gov/cds/products/marcDist.php) in MARC 21 and MARC 21 XML format.\nUnfortunately the latest open-access dumps are from 2016 and a paid subscription is required to get later updates.\nUsers are encouraged to use the data for development purposes and name the LOC as source of the original data.\n\nAs of writing this proposal, the latest available dumps are from 2016:\n\n- **Total**: ~25 million records (as of 2014)\n- **Books All** (Editions etc.): 2016 Combined, ~2 GB zipped\n- **Name Authorities** (Authors, Publishers etc.): 2014 Combined, ~1.2 GB zipped (the dump from 2016 is split into 40 instead of 37 separate parts as in 2014)\n\n[^loc-facts]: Source: [Fascinating Facts | About the Library | Library of Congress](https://www.loc.gov/about/fascinating-facts/)\n\n## DNB\n\nThe German National Library (DNB = \"Deutsche Nationalbibliothek\") aims to include all publications in German language which have been issued since 1913 [^dnb-mandate].\n\nThey provide [full dumps](https://www.dnb.de/dumps) as well as [weekly updates](https://www.dnb.de/metadataupdate) in MARC 21, MARC 21 XML and various other formats.\nAdditionally they also offer small [test data sets]( https://data.dnb.de/testdat).\nAll of the mentioned data is available free of charge for general re-use under Creative Commons Zero terms (CC0 1.0).\n\nAs of writing this proposal, the latest available dumps are from February 2023:\n\n- **Bibliographic data** (Editions etc.): Full copy, ~26.7 million records, ~8 GB zipped\n- **Integrated Authority Files** (Authors, Publishers, Works etc.): Full copy, ~9.3 million records, ~1.8 GB zipped\n\n[^dnb-mandate]: Source: [DNB - Our collection mandate](https://www.dnb.de/sammelauftrag)\n\n# Timeline\n\n- **May 04 - May 28**: *Community Bonding Period*\n  - Read documentation about MARC records\n  - Have a closer look at the existing importer code\n  - Continue to work on improving the type definitions of our data models\n\n- **May 29 - June 04**: Coding Period, Week 1\n  - Generalize the entity data [validators] and move them into `bookbrainz-data-js`\n  - Update producer/consumer services to be compatible with the latest database schema\n\n- **June 05 - June 11**: Coding Period, Week 2\n  - Rebase [bookbrainz-site#201] and get it running using the current development environment with Docker and Webpack\n\n- **June 12 - June 18**: Coding Period, Week 3\n  - Verify that the current importer infrastructure (including the UI) is working as expected before proceeding\n  - Update the importer database schema to support series entities and relationships\n\n- **June 19 - June 25**: Coding Period, Week 4\n  - Resolve an entity's external identifiers to a BBID\n  - Create type definitions for the parsed entity JSON exchange format etc.\n\n- **June 26 - July 02**: Coding Period, Week 5\n  - Support creating unidirectional relationships (which are used for Author Credits and to link an edition to its publishers)\n  - Test the new infrastructure by adding these features to the [OpenLibrary producer]\n\n- **July 03 - July 09**: Coding Period, Week 6 (buffer week)\n\n- **July 10 - July 16**: Coding Period, Week 7\n  - Improve the performance by replacing the custom `asyncCluster` implementation\n\n- **July 17 - July 23**: Coding Period, Week 8\n  - Detect already imported entities at the producer's side (now that we finished our performance tests for which this feature would have been unfortunate)\n\n- **July 24 - July 30**: Coding Period, Week 9\n  - Update pending entities when the import process is repeated\n\n- **July 31 - August 06**: *Midterm evaluations*\n\n- **August 07 - August 13**: Coding Period, Week 10\n  - Create relationships between pending entities (now that we can create incomplete pending entities which will be updated later)\n\n- **August 14 - August 20**: Coding Period, Week 11 (buffer week)\n\n- **August 21 - August 27**: Coding Period, Week 12 (end of the standard coding period)\n  - Update website to handle pending relationships (in entity editor state and for display)\n  - Ask the community for good test cases from LOC / DNB catalogs\n\n- **September 04 - September 10**: Coding Period, Week 13\n  - Create a producer which parses MARC records\n  - Analyze MARC records test data (from LOC and DNB) and write test cases\n\n- **September 11 - September 17**: Coding Period, Week 14\n  - Fine-tune the MARC records parser\n  - Directly stream gzipped data dumps instead of decompressing the input files first\n\n- **September 18 - September 25**: Coding Period, Week 15\n  - Perform an extensive import test run with dumps from LOC\n  - Also perform an import with dumps from DNB and OpenLibrary (optional)\n\n- **September 26 - October 01**: Coding Period, Week 16\n  - Create updates for already accepted entities which have not been changed in BB\n  - Try to merge updates for already accepted entities which have been changed\n\n- **October 02 - October 08**: Coding Period, Week 17 (buffer week)\n\n- **October 09 - October 15**: Coding Period, Week 18\n  - Finish all started tasks and ensure that everything is in a working state\n  - Write a blog post about the project\n\n- **November 6 (18:00 UTC)**: *Final Submission and Final Evaluation*\n\nThe buffer weeks will be used for catching up, otherwise I will already begin with the tasks for the next week.\n\n# About Me\n\n## Biographical Information\n\nMy name is David Kellner and I am a MSc student of *Electrical Engineering and Information Technology* from Germany.\nWhile I am specialized in intelligent signal processing (machine learning, computer architecture, wireless communications) and automation technology, I have also attended software engineering lectures.\nDuring my bachelor studies, I had programming courses in C/C++ (with which I was already familiar) and also did other projects in Java, Python and Matlab.\n\nHowever, most of my coding skills which are relevant to manage this project have been obtained by self-study.\nI am experienced with HTML, CSS, JavaScript and SQL, which I have used for multiple of my personal projects over the last ten years.\nFor my bachelor thesis I developed a Node.js web application with the Express.js framework which is also used by BookBrainz, so I am also familiar with that.\nAbout two years ago I have started to learn TypeScript as I had noticed that I was writing lots of JSDoc type annotations since I am using VS code as my IDE.\nMy love for regular expressions might also prove useful when it comes to parsing records from external data sources.\n\nWhile I was still at school (which unfortunately did not offer real IT classes), I have given many different (programming) languages a try:\nFrom C++, JavaScript (back when HTML5 audio/video support was the new thing) and PHP (for my first website backends) over SQL, Java (for Android) and Python (for Raspberry Pi projects) to Assembler (for microcontrollers and the Game Boy), I've experimented with many areas of software development.\nTherefore I am very confident that I will quickly acquire any skills that I might notice myself to be lacking during this project (e.g. advanced knowledge about React, Docker and threading in Node.js).\n\nI am a member of the MetaBrainz community since 2019, when I first started editing on MusicBrainz.\nSince then I have contributed to MetaBrainz projects in various aspects:\nI have reported issues, helped translating the MB website and Picard, patched userscripts and wrote my own, submitted a few small pull requests and lurked in IRC meetings.\n\nThrough MetaBrainz I have learned about GSoC in 2020 and considered applying for it myself, but never had enough time during the summer so far.\nThis year that has finally changed and I have decided to take my (probably last) chance to participate.\n\n## Other Information\n\n\u003e Tell us about the computer(s) you have available for working on your SoC project!\n\nI have a custom built desktop PC with AMD Ryzen 5 2600 CPU, Nvidia GeForce GTX 1660 Ti GPU, 16 GB of RAM and 1 TB SSD which is running Manjaro Linux (and Windows 10).\n\n\u003e When did you first start programming?\n\nI did my first attempts at writing code when I was 13.\nNaive as I was, I decided to learn C++ because according to my research this was the programming language most of my favorite programs were written in.\n\n\u003e What type of music do you listen to? (Please list a series of MBIDs as examples.)\n\nMe owning multiple of their albums on vinyl is usually a good indicator that I like the artist's music.\nSo I definitely love 70s prog rock like *Pink Floyd* ([83d91898-7763-47d7-b03b-b92132375c47](https://musicbrainz.org/artist/83d91898-7763-47d7-b03b-b92132375c47)) and *Genesis* ([8e3fcd7d-bda1-4ca0-b987-b8528d2ee74e](https://musicbrainz.org/artist/8e3fcd7d-bda1-4ca0-b987-b8528d2ee74e)).\n\nOther artists whose music I regularly listen to include *Led Zeppelin* ([678d88b2-87b0-403b-b63d-5da7465aecc3](https://musicbrainz.org/artist/678d88b2-87b0-403b-b63d-5da7465aecc3)), *Billie Eilish* ([f4abc0b5-3f7a-4eff-8f78-ac078dbce533](https://musicbrainz.org/artist/f4abc0b5-3f7a-4eff-8f78-ac078dbce533)) and *Carsten Bohn's Bandstand* ([cc9bddd9-e4ad-4591-9d85-633deb35f33c](https://musicbrainz.org/artist/cc9bddd9-e4ad-4591-9d85-633deb35f33c)).\n\nSince I'm also a big movie fan and collector, of course I also listen to soundtrack albums by artists like *Hans Zimmer* ([e6de1f3b-6484-491c-88dd-6d619f142abc](https://musicbrainz.org/artist/e6de1f3b-6484-491c-88dd-6d619f142abc)) and *Ennio Morricone* ([a16e47f5-aa54-47fe-87e4-bb8af91a9fdd](https://musicbrainz.org/artist/a16e47f5-aa54-47fe-87e4-bb8af91a9fdd)).\n\n\u003e What type of books do you read? (Please list a series of BBIDs as examples.)\n\nMy all time favorites probably are the fantasy series *A Song of Ice and Fire* ([676bb144-ff07-4e79-aef1-51788ebc251a](https://bookbrainz.org/series/676bb144-ff07-4e79-aef1-51788ebc251a)) and *Harry Potter* ([e6f48cbd-26de-4c2e-a24a-29892f9eb3be](https://bookbrainz.org/series/e6f48cbd-26de-4c2e-a24a-29892f9eb3be)), of which I own all the books in German and English -- not to speak of the audio books and movie/TV adaptions.\n\nAdditionally I read thrillers such as Thomas Harris' *Red Dragon* ([df62f3a9-ffdb-4089-a867-2f2ad5c430af](https://bookbrainz.org/work/df62f3a9-ffdb-4089-a867-2f2ad5c430af)) or Daniel Suarez' *Daemon* ([2a4567cd-8e49-4b84-a2b5-b1e8718e15d7](https://bookbrainz.org/work/2a4567cd-8e49-4b84-a2b5-b1e8718e15d7)).\n\nOccasionally I also prefer less serious books like those by *Marc-Uwe Kling* ([c7152cbd-7db9-4cd5-95d6-1db9eb709992](https://bookbrainz.org/author/c7152cbd-7db9-4cd5-95d6-1db9eb709992)) and *Rita Falk* ([51033a32-a79a-466b-a74d-eab7213024d2](https://bookbrainz.org/author/51033a32-a79a-466b-a74d-eab7213024d2)).\n\nAnd since my childhood I'm a big fan of the Uncle Scrooge comics by *Carl Barks* ([28e0eba5-6c33-411a-baf7-a04a0611b418](https://bookbrainz.org/author/28e0eba5-6c33-411a-baf7-a04a0611b418)) and of Don Rosa's comic biography *The Life and Times of Scrooge McDuck* ([83cbfb59-328d-4211-a2d0-054e2460738b](https://bookbrainz.org/edition/83cbfb59-328d-4211-a2d0-054e2460738b)).\n\n\u003e What aspects of the project you're applying for (e.g. MusicBrainz, AcousticBrainz, etc.) interest you the most?\n\nThe git-like revision system and the possibility to undo wrong edits are definitely the most interesting feature of BookBrainz compared to MusicBrainz.\nSince I am also a big fan of MusicBrainz, I see a lot of potential in an open database which is following the Brainz philosophy and am looking forward to use it to manage my book collection soon.\n\n\u003e Have you ever used MusicBrainz to tag your files?\n\nOf course I have, but admittedly I've still only tagged a few test releases so far.\nOver three years after joining the MusicBrainz community, I'm still pushing the art of writing the perfect Picard tagger script for my needs to the extreme...\n\n\u003e Have you contributed to other Open Source projects? If so, which projects and can we see some of your code?\n\nSo far I've almost exclusively contributed to MetaBrainz projects, starting with a few minor bug fixes for MBS and features for Picard in 2020.\nYou can [find my pull requests on GitHub](https://github.com/pulls?q=author%3Akellnerd+user%3Ametabrainz), including my smaller and medium size contributions to BookBrainz this year.\n\n\u003e What sorts of programming projects have you done on your own time?\n\nBesides writing [userscripts for MusicBrainz](https://github.com/kellnerd/musicbrainz-scripts), I often create scripts (Bash, Node.js, Python) to automate tasks related to the maintenance of my personal audio and video file collections, e.g. [creating backups](https://github.com/kellnerd/backup) and tagging videos.\n\nSeveral years ago I've also written a JS/PHP/SQL application to manage my book collection.\nThe only remaining features that are still ahead of BookBrainz are cover art and seeding with data from the [DNB API](https://www.dnb.de/EN/sru) (MARC 21 XML records!), as collections and series have since been implemented in BookBrainz.\nAfter all these years I've learned a lot and consider my old code unmaintainable, so I'm looking forward to this GSoC project and finally migrate all the data from my database into BookBrainz!\n\n\u003e How much time do you have available, and how would you plan to use it?\n\nSince I will probably start writing my master thesis this summer, I plan to have about 20 hours per week available for GSoC.\nThere might be weeks during which I will be less available, but I think I can compensate the time during other weeks or over the weekend.\n\nWhile I'm focusing on the project which I've proposed above, I will also work on improving the TypeScript typings of the ORM and other minor improvements in all BookBrainz repositories which make the implementation of my project easier.\n\nMaybe I will even work on the replacement of the unmaintained Bookshelf.js with an alternative ORM with better TypeScript support ([BB-729](https://tickets.metabrainz.org/browse/BB-729)).\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkellnerd%2Fgsoc-proposal","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkellnerd%2Fgsoc-proposal","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkellnerd%2Fgsoc-proposal/lists"}