{"id":21699846,"url":"https://github.com/jonasraoni/xml-postgresql-import-template","last_synced_at":"2026-05-18T17:40:10.566Z","repository":{"id":124279324,"uuid":"143774343","full_name":"jonasraoni/xml-postgresql-import-template","owner":"jonasraoni","description":"High performance XML to PostgreSQL import template in C#","archived":false,"fork":false,"pushed_at":"2018-08-06T19:36:56.000Z","size":18,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-11-25T20:12:06.151Z","etag":null,"topics":["asynchronous","csharp","import","parallel","postgresql","template","xml"],"latest_commit_sha":null,"homepage":null,"language":"C#","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jonasraoni.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-08-06T19:36:47.000Z","updated_at":"2018-08-08T21:37:21.000Z","dependencies_parsed_at":null,"dependency_job_id":"dd9ee1ef-625d-41c3-b0e7-c6f2a43bfa44","html_url":"https://github.com/jonasraoni/xml-postgresql-import-template","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jonasraoni%2Fxml-postgresql-import-template","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jonasraoni%2Fxml-postgresql-import-template/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jonasraoni%2Fxml-postgresql-import-template/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jonasraoni%2Fxml-postgresql-import-template/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jonasraoni","download_url":"https://codeload.github.com/jonasraoni/xml-postgresql-import-template/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":235593359,"owners_count":19015139,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["asynchronous","csharp","import","parallel","postgresql","template","xml"],"created_at":"2024-11-25T20:11:45.524Z","updated_at":"2026-05-18T17:40:05.534Z","avatar_url":"https://github.com/jonasraoni.png","language":"C#","funding_links":[],"categories":[],"sub_categories":[],"readme":"# High performance XML \u003e PostgreSQL import template in C# \n\nI've written this to import a huge amount of data from XMLs into a PostgreSQL database and as I had the opportunity to try out several techniques, I've decided to keep it roughly documented here to reuse the ideas =)\n\n# Design\nThe application is designed in a \"producer \u003c=\u003e storage \u003c=\u003e consumer\" fashion. The producer starts out by taking an empty bag from the storage and then, asynchronously and concurrently, extracts records from the XML into it. Once the bag is full, it is sent to the storage, the first consumer on the waiting line is notified, takes the bag, inserts the records into the database and send the bag back to the storage for recycling.\n\nThere may be N producers/consumers, they run in parallel and share a single storage instance.\n\n## Storage\nA generic concurrent queue, featuring recycling of buffers and a waiting line, which is done by blocking the calling thread. It's used to keep batches of records (represented by `Member` instances inherited from a Dictionary).\n\nIn order to reduce garbage collection calls I've tried to transform the `Member` class into a `struct` (which turns the class into a fixed size value type in C#), plain object and also to reuse the instances, but the improvements didn't pay off in my environment or brought too much complexity.\n\n## Consumer\nMoves data from the storage, while it's opened, into the database, once it's over it pauses until there's a new batch available. In case it fails, the batch is returned to the storage, so another consumer can handle it. The consumers also work asynchronously and concurrently.\n\n### Database Techniques\n- Batches: Data is inserted into the database by batches, on my system a value of 10K items was a good deal. At first I tried to fill a temporary table, large enough to be kept in memory, with a prepared insert statement made of several `values` (e.g. VALUES(...), VALUES(...), ...), it was fast, but not faster than using the binary `COPY` format.\n- Storage: A non-indexed temporary table is used to save the batch (`temp_buffers` is set to batch-size/500 MB to accommodate the sample data in memory safely).\n- Transactions: I was in doubt between covering the code with a big win/lose transaction or issuing them by batches, I didn't see much improvement on the first, so every batch is covered with a transaction.\n- Indexing: The sample table has no primary key nor index, so there's an argument to create a unique key, which is a big performance winner.\n- Upsert: With an indexed table I'm able to use the new \"upsert\" syntax. Without it, I could use an UPDATE followed by an INSERT, which has a longer syntax and raises the problem of comparing NULLS (I've tried to use the new `fieldA IS NOT DISTINCT FROM fieldB` syntax, which checks for nulls, but it's ultimately slow, it looks like it doesn't use indexes).\n- Validation: Members are pre-validated locally to avoid spending database resources with invalid items.\n- \"Hash\" column: I was going to add a `hash` column to update the record only if the hash of its data was changed, but for the sample table it isn't worth as there's just one field to update.\n\n## Producer\nExtracts batches of data from the XML into the storage and pauses once the storage is full.\n\nTo work in parallel each producer has a starting point in the XML and they avoid collisions by skipping records, I just implemented it this way out of curiosity... But IF parsing XML was the bottleneck, maybe a RAID structure with several workers plus reverse parsers (starting to read from the end of the XML) could improve the performance.\n\n### XML Parser\nI wrote a generic XML finder using the standard SAX library of C#, which already has some integrity checks bundled.\nI thought about writing it using iterators `foreach(person in find(\"/person\"))`, but as it was slow I decided to go on with another approach... What came in my mind was a functional `person.find(\"person\", () =\u003e find(\"first-name\"))` and a standard `person = find(\"person\")` =\u003e `person.find(\"first-name\")` way. I've implemented both ways and according to the profiler their performance was the same for the use case... In the end I decided to stay with what I consider the simplest: `context = find(\"person\")` =\u003e `find(\"first-name\", context)`, a kind of \"string.indexOf\".\n\nThe `find` method uses a query based on XPath (I've just added two types, queries based on the root `/node` and anywhere `person/first-name`) and drills down until the end of the file when no `context` is available. When a match is found, the current depth is returned, which can be used as a context to query children nodes.\n\n# Running\nThere's a sample XML file and a `CREATE TABLE` statement in the [/data](data) folder.\n\nThe application smoothly compiles in Windows and Linux (Mono [https://www.mono-project.com/] required).\nAs there's an official Docker image with Mono, follow below a snippet to build and run it. It just requires sharing the application folder and setting the required arguments:\n\n```bash\ndocker run -it --rm -v **APPLICATION FOLDER**:/home mono:latest bash -c \"cd /home \\\n\u0026\u0026 nuget restore \\\n\u0026\u0026 msbuild /P:Configuration=Release \\\n\u0026\u0026 mono bin/Release/Synchronizer.exe user **USER** pass **PASSWORD** host **HOST** db **DATABASE** path /home/data/UPDATE.xml\"\n```\n\n## Parameters\nThe parameters for the executable are in the format `name value[ name value...]`:\n- host (localhost): Database host\n- port (5432): Database port\n- user: Database username\n- pass: Database password\n- db: Database name\n- path: Path of the \"update-file.xml\"\n- batch (10000): The amount of records in the batch\n- buffer (4): Amount of batches that will be kept in the storage\n- index (true): Whether to create an unique index\n- upsert (true): Whether to use the upsert syntax\n- maxlength (255): Max length of data that the Parser will attempt to read from a text node\ntimeout (120): Database communication timeout in seconds\n- p-workers (2): Amount of producer workers\n- c-workers (4): Amount of consumer workers\n\n# Considerations\n- Wherever possible, I used asynchronous operations/thread blocking to allow the CPUs to process other things.\n- The SQL matching is solely handled by the operator \"=\", there's no collation/case sensitivity enforcement.\n- There's a small defensive check to avoid memory exhaustion when reading data (the default for this sample is 255 characters and can be set to unlimited).\n- The default values for the arguments are based on my system (6-core processor + old SSD disk).","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjonasraoni%2Fxml-postgresql-import-template","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjonasraoni%2Fxml-postgresql-import-template","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjonasraoni%2Fxml-postgresql-import-template/lists"}