{"id":16196403,"url":"https://github.com/dspinellis/alexandria3k","last_synced_at":"2025-04-04T13:13:48.851Z","repository":{"id":64342907,"uuid":"559926599","full_name":"dspinellis/alexandria3k","owner":"dspinellis","description":"Local relational access to openly-available publication data sets","archived":false,"fork":false,"pushed_at":"2024-10-09T16:33:08.000Z","size":2467,"stargazers_count":81,"open_issues_count":3,"forks_count":14,"subscribers_count":4,"default_branch":"main","last_synced_at":"2024-10-11T08:47:30.061Z","etag":null,"topics":["bibliometric-analysis","crossref","data-science","orcid","scientometrics"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dspinellis.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-10-31T11:47:38.000Z","updated_at":"2024-10-09T16:32:36.000Z","dependencies_parsed_at":"2023-02-18T16:45:53.782Z","dependency_job_id":"731d5854-ee76-443a-90db-930ef75e3418","html_url":"https://github.com/dspinellis/alexandria3k","commit_stats":null,"previous_names":[],"tags_count":28,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dspinellis%2Falexandria3k","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dspinellis%2Falexandria3k/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dspinellis%2Falexandria3k/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dspinellis%2Falexandria3k/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dspinellis","download_url":"https://codeload.github.com/dspinellis/alexandria3k/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247182349,"owners_count":20897380,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bibliometric-analysis","crossref","data-science","orcid","scientometrics"],"created_at":"2024-10-10T08:47:33.203Z","updated_at":"2025-04-04T13:13:48.844Z","avatar_url":"https://github.com/dspinellis.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![Alexandria3k CI](https://github.com/dspinellis/alexandria3k/actions/workflows/ci.yml/badge.svg)](https://github.com/dspinellis/alexandria3k/actions/workflows/ci.yml)\n\n## Alexandria3k\n\n\u003c!-- INTRO-BEGIN --\u003e\n\nThe _alexandria3k_ package supplies a library and a command-line tool\nproviding fast and space-efficient relational query access to the following\nlarge scientific publication open data sets.\nData are decompressed on the fly, thus allowing the package's use even on\nstorage-restricted laptops.\nThe _alexandria3k_ package supports the following large data sets.\n\n* [Crossref](https://www.nature.com/articles/d41586-022-02926-y)\n  (184 GiB compressed,\n  1.9 TiB uncompressed — as of March 2025).\n  This contains publication metadata from all major international publishers.\n  The Crossref data set is split into about 33 thousand files.\n  Each file contains JSON data for 5000 publications (works).\n  In total, Crossref contains data for 167 million works,\n  35 million abstracts, 465 million associated work authors,\n  and 2.5 billion references.\n\u003c!--. gzip -l * | awk '{s += $2}END{print s, s / 1024 / 1024 / 1024 / 1024}'\n 2081831841198 1.89342 --\u003e\n\n* [PubMed](https://pubmed.ncbi.nlm.nih.gov/)\n  (47 GiB compressed, 707 GiB uncompressed — as of April 2025).\n  This comprises more than 36 million citations\n  for biomedical literature from\n  [MEDLINE](https://www.nlm.nih.gov/medline/medline_overview.html),\n  life science journals, and online books,\n  with rich domain-specific metadata,\n  such as [MeSH](https://www.nlm.nih.gov/mesh/meshhome.html) indexing,\n  funding, genetic, and chemical details.\n\u003c!--. gzip -l * | awk '{s += $2}END{print s, s / 1024 / 1024 / 1024 }' --\u003e\n\n* [ORCID summary data set](https://support.orcid.org/hc/en-us/articles/360006897394-How-do-I-get-the-public-data-file-)\n  (37 GiB compressed, 651 GiB uncompressed — as of October 2024).\n  This contains about 22 million author details records.\n\u003c!-- tar tzvf ORCID_2024_10_summaries.tar.gz | wc -l --\u003e\n\n* [DataCite](https://datacite.org/)\n  (24 GiB compressed, 347 GiB uncompressed — as of 2024).\n  This comprises research outputs and resources,\n  such as data, pre-prints, images, and samples,\n  containing about 50 million work entries.\n\n* [United States Patent Office issued patents](https://bulkdata.uspto.gov/)\n  (12 GiB compressed, 128 GiB uncompressed — as of January 2025).\n  This  contains about 5.4 million records.\n\u003c!-- find . -name \\*.zip | xargs -n 1 unzip -v | awk '/files$/{ s+= $1}END{print s, s / 1024 / 1024 / 1024}' --\u003e\n\nFurther supported data sets include\nfunder bodies,\njournal names,\nopen access journals,\nand research organizations.\n\nThe _alexandria3k_ package installation contains all elements required\nto run it.\nIt does not require the installation, configuration, and maintenance\nof a third party relational or graph database.\nIt can therefore be used out-of-the-box for performing reproducible\npublication research on the desktop.\n\nDatabases populated with _alexandria3k_ can be used by generative AI\napplications through the\n[Model Context Protocol](https://modelcontextprotocol.io/) and its\n[SQLite](https://github.com/modelcontextprotocol/servers/blob/main/src/sqlite)\nreference server.\nApplication examples include\ntopic modeling,\nsnowballing,\ntrend analysis,\nauthor disambiguation,\ncitation graph generation,\nresearch trend analysis,\npatent similarity detection,\ngrant and funding prediction,\nco-authorship network mapping,\ninstitutional collaboration analysis,\nknowledge graph augmentation,\nresearch impact prediction,\nacademic fraud detection,\ntechnology transfer mapping,\ninterdisciplinary research discovery, and\nresearch paper recommendations.\n\n\u003c!-- INTRO-END --\u003e\n\n## Installation and documentation\n\n* 📦 The _alexandria3k_ is available on [PyPI](https://pypi.org/project/alexandria3k/).\n* 📄 Full reference and use documentation for _alexandria3k_  is available [here](https://dspinellis.github.io/alexandria3k/).\n\n## Major contributors\n\n* [Aggelos Margkas](https://github.com/AggelosMargkas): US patents\n* [Bas Verlooy](https://github.com/BasVerlooy): PubMed\n* [Evgenia Pampidi](https://github.com/evgepab): DataCite\n\n## Publication\n\nDetails about the rationale, design, implementation, and use of this software\ncan be found in the following paper.\n\nDiomidis Spinellis. Open reproducible scientometric research with Alexandria3k. _PLoS ONE_ 18(11): e0294946. November 2023. [doi: 10.1371/journal.pone.0294946](https://doi.org/10.1371/journal.pone.0294946)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdspinellis%2Falexandria3k","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdspinellis%2Falexandria3k","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdspinellis%2Falexandria3k/lists"}