{"id":35619815,"url":"https://github.com/scieloorg/scielo_scholarly_data","last_synced_at":"2026-01-05T06:04:32.318Z","repository":{"id":38082565,"uuid":"389781590","full_name":"scieloorg/scielo_scholarly_data","owner":"scieloorg","description":"This repository contains a set of tools responsible for processing scientific publication data (also known, in part, as scholarly data). The methods we develop cover standardization, normalization, and deduplication processes.","archived":false,"fork":false,"pushed_at":"2022-06-10T15:38:24.000Z","size":178,"stargazers_count":1,"open_issues_count":2,"forks_count":3,"subscribers_count":7,"default_branch":"main","last_synced_at":"2024-04-14T20:25:31.267Z","etag":null,"topics":["deduplication","normalization","preprocessing","standardization"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/scieloorg.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-07-26T22:14:21.000Z","updated_at":"2022-01-22T21:59:10.000Z","dependencies_parsed_at":"2022-08-28T23:20:40.053Z","dependency_job_id":null,"html_url":"https://github.com/scieloorg/scielo_scholarly_data","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/scieloorg/scielo_scholarly_data","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scieloorg%2Fscielo_scholarly_data","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scieloorg%2Fscielo_scholarly_data/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scieloorg%2Fscielo_scholarly_data/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scieloorg%2Fscielo_scholarly_data/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/scieloorg","download_url":"https://codeload.github.com/scieloorg/scielo_scholarly_data/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scieloorg%2Fscielo_scholarly_data/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28214413,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2026-01-05T02:00:06.358Z","response_time":57,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deduplication","normalization","preprocessing","standardization"],"created_at":"2026-01-05T06:02:49.300Z","updated_at":"2026-01-05T06:04:32.313Z","avatar_url":"https://github.com/scieloorg.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# SciELO Scholarly Data\nThis repository contains a set of tools responsible for processing scientific publication data (also known, in part, as **scholarly data**). The methods we develop cover standardization, normalization, and deduplication processes.\n\n## Installation\n\n### Installing as a library\n```shell\npip install -e git+https://github.com/scieloorg/scielo_scholarly_data#egg=scielo_scholarly_data\n```\n\n### Installing as standalone application\n\n_Create a virtual environment and install the application dependencies_\n```shell\n# Create a virtual environment\nvirtualenv -p python3 .venv\n\n# Access the virtual environment\nsource .venv/bin/activated\n\n# Install dependencies\npip install -r requirements.txt\n\n# Install the package\npython setup.py install\n```\n\n_Run tests_\n```\npython setup.py test\n```\n\n\n## Usage\nThis section presents examples of using the standardizer and core libraries.\n```python\nfrom scielo_scholarly_data import standardizer\n\n# Standardize a journal title\nstandardizer.journal_title_for_deduplication('Agrociencia \u0026amp;   (Uruguay)')\n\u003e 'agrociencia \u0026 uruguay'\n\nstandardizer.journal_title_for_visualization('Agrociencia   \u0026amp; (Uruguay)')\n\u003e 'Agrociencia \u0026 (Uruguay)'\n\n# Standardizer a journal ISSN\nstandardizer.journal_issn('1387666x')\n\u003e '1387-666X'\n\n# Standardizer a issue volume\nstandardizer.issue_volume(' .15,b ')\n\u003e '15b'\n\n# Standardizer a issue number\nstandardizer.issue_number(' 123 a. ')\n\u003e '123 a'\n\n# Standardize a document DOI\nstandardizer.document_doi('\u0026referrer=google*url=10.1590/1678-4766E2016006')\n\u003e '10.1590/1678-4766E2016006'\n\n# Standardizer a document title\nstandardizer.document_title_for_deduplication(' INNOVACIÓN TECNOLÓGICA EN LA RESOLUCIÓN DE PROBLEMÁTICAS ')\n\u003e 'innovacion tecnologica en la resolucion de problematicas'\n\nstandardizer.document_title_for_visualization(' INNOVACIÓN TECNOLÓGICA EN LA RESOLUCIÓN DE PROBLEMÁTICAS ')\n\u003e 'INNOVACIÓN TECNOLÓGICA EN LA RESOLUCIÓN DE PROBLEMÁTICAS'\n\n# Standardizer a document page\nstandardizer.document_first_page('120-10')\n\u003e '120'\n\nstandardizer.document_last_page('120-10')\n\u003e '130'\n\n# Standardizer a document elocation\nstandardizer.document_elocation('e*277$2%1@')\n\u003e 'e27721'\n\n# Standardizer a document publication date\nstandardizer.document_publication_date('19 de nov de 2020')\n\u003e datetime.date(2020, 11, 19)\n\nstandardizer.document_publication_date('19 de nov de 2020', only_year=True)\n\u003e datetime.date(2020)\n\n# Standardizer a document author\nstandardizer.document_author_for_deduplication('John Fitzgerald Kennedy')\n\u003e 'kennedy, john fitzgerald'\n\nstandardizer.document_author_for_deduplication('John Fitzgerald Kennedy', surname_first=True)\n\u003e 'kennedy, john fitzgerald'\n\nstandardizer.document_author_for_visualization('John Fitzgerald Kennedy')\n\u003e 'Kennedy, John Fitzgerald'\n\nstandardizer.document_author_for_visualization('John Fitzgerald Kennedy', surname_first=True)\n\u003e 'Kennedy, John Fitzgerald'\n\nstandardizer.book_title_for_deduplication('O MODELO DE DESENVOLVIMENTO BRASILEIRO DAS PRIMEIRAS DÉCADAS DO SÉCULO XXI: \u0026#60; APORTES PARA O DEBATE', remove_special_char=False)\n\u003e 'o modelo de desenvolvimento brasileiro das primeiras decadas do seculo xxi: \u003c aportes para o debate'\n\nstandardizer.book_title_for_visualization('O MODELO DE DESENVOLVIMENTO BRASILEIRO DAS PRIMEIRAS DÉCADAS DO SÉCULO XXI: \u0026#60; APORTES PARA O DEBATE', remove_special_char=False)\n\u003e 'O MODELO DE DESENVOLVIMENTO BRASILEIRO DAS PRIMEIRAS DÉCADAS DO SÉCULO XXI: APORTES PARA O DEBATE'\n\nfrom scielo_scholarly_data import core\n# Remove accents from a text\ncore.remove_accents('Olá mundo')\n\u003e 'Ola mundo'\n\n# Remove double spaces from a text\ncore.remove_double_spaces('This  is a  sentence')\n\u003e 'This is a sentence'\n\n# Keeps only alphanumeric, numeric and space characters in a text\ncore.keep_alpha_num_space('This$ ° [is]+- a´ (sentence) that contains numbers 1, 2, 3')\n\u003e 'This     is    a   sentence  that contains numbers 1  2  3'\n\n# Keeps only alphanumeric and space characters in a text\ncore.keep_alpha_space('This     is    a   sentence  that contains numbers 1  2  3')\n\u003e 'This     is    a   sentence  that contains numbers        '\n\n# Remove non printable characteres from a text\ncore.remove_non_printable_chars('\\nabc\\t123')\n\u003e 'abc123'\n\n# Remove end punctuation from a text\ncore.remove_end_punctuation_chars('abc123.,;')\n\u003e 'abc123'\n\n# Remove parenthesis from a text\ncore.remove_parenthesis('abc (123)')\n\u003e 'abc'\n\n# Convert a date to ISO format\ncore.convert_to_iso_date('20/feb/2021')\n\u003e datetime.date(2021, 2, 20)\n\ncore.convert_to_iso_date('2021')\n\u003e datetime.date(2021, 1, 1)\n\ncore.convert_to_iso_date('2021', month='06', day='15')\n\u003e datetime.date(2021, 6, 15)\n\ncore.check_sum_orcid('000000021694233X')\n\u003eTrue\n\n```\n\n## Documentation\nThis section aims to provide a scientific explanation about the decisions we made in our processing methods.\n\n### Standardization processes\n- book_editor_address\n- book_editor_name\n- book_title\n- chapter_title\n- document_author_for_deduplication\n- document_author_for_visualization\n- document_doi\n- document_elocation\n- document_first_page\n- document_last_page\n- document_publication_date\n- document_title_for_deduplication\n- document_title_for_visualization\n- issue_number\n- issue_volume\n- journal_issn\n- journal_number\n- journal_title_for_deduplication\n- journal_title_for_visualization\n- journal_volume\n- orcid_validator\n\n\n### Normalization processes\n`To do`\n\n### Deduplication processes\n`To do`\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscieloorg%2Fscielo_scholarly_data","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fscieloorg%2Fscielo_scholarly_data","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscieloorg%2Fscielo_scholarly_data/lists"}