{"id":35619814,"url":"https://github.com/scieloorg/normalizations-experiments","last_synced_at":"2026-01-05T06:04:31.925Z","repository":{"id":32207562,"uuid":"130249337","full_name":"scieloorg/normalizations-experiments","owner":"scieloorg","description":"Exploratory experiments upon authors affiliations data.","archived":false,"fork":false,"pushed_at":"2022-12-08T05:13:02.000Z","size":4465,"stargazers_count":5,"open_issues_count":6,"forks_count":3,"subscribers_count":5,"default_branch":"master","last_synced_at":"2024-04-14T20:25:29.464Z","etag":null,"topics":["experiments","labs"],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/scieloorg.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-04-19T17:23:00.000Z","updated_at":"2023-12-07T13:35:22.000Z","dependencies_parsed_at":"2023-01-14T20:45:21.037Z","dependency_job_id":null,"html_url":"https://github.com/scieloorg/normalizations-experiments","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/scieloorg/normalizations-experiments","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scieloorg%2Fnormalizations-experiments","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scieloorg%2Fnormalizations-experiments/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scieloorg%2Fnormalizations-experiments/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scieloorg%2Fnormalizations-experiments/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/scieloorg","download_url":"https://codeload.github.com/scieloorg/normalizations-experiments/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scieloorg%2Fnormalizations-experiments/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28214410,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2026-01-05T02:00:06.358Z","response_time":57,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["experiments","labs"],"created_at":"2026-01-05T06:02:49.289Z","updated_at":"2026-01-05T06:04:31.918Z","avatar_url":"https://github.com/scieloorg.png","language":"Jupyter Notebook","readme":"Data normalization/cleaning experiments\n=======================================\n\nThis repository contains analyses and experiments\nperformed with the goal of normalizing/cleaning the SciELO data,\nintended to find and fix unclean/inconsistent values\nin their raw format,\nas well as other similar issues,\nmainly towards the fields that regards to the affiliations.\n\nContents of this repository ordered by creation date:\n\n.. list-table::\n\n  * - **Date**\n    - **Description**\n    - **Link**\n\n  * - 2018-04-05\n    - Grabbing article ``\u003caff\u003e`` and ``\u003ccountry\u003e`` data\n      with BeautifulSoup 4\n    - `Notebook \u003cexperiments_2018-04-05.ipynb\u003e`_\n\n  * - 2018-04-19\n    - Article XML parsing with ``ElementTree``/``libxml2``/``lxml``,\n      using XPath/XSLT\n    - `Notebook \u003cexperiments_2018-04-19.ipynb\u003e`_ /\n      `XML pack \u003chttps://drive.google.com/open?id=1ek_18qnBaEEvOkUdateMHhA9FExOT4An\u003e`_\n\n  * - 2018-04-26\n    - Creating a table with data from ``\u003caff\u003e``-``\u003ccontrib\u003e`` pairs\n      (front matter) in 25 XML files using ``lxml``\n    - `Notebook \u003cexperiments_2018-04-26.ipynb\u003e`_ /\n      `CSV \u003caffs_table_25.csv\u003e`_\n\n  * - 2018-05-03\n    - Loading/cleaning/analyzing a table of manually normalized data,\n      including a DBSCAN clustering model for the institution name\n    - `Notebook \u003cexperiments_2018-05-03.ipynb\u003e`_ /\n      `Raw manual CSV \u003chttps://drive.google.com/open?id=1Y_5jtWKOBhBUXQIQBZSb4qz13nyOwdWO\u003e`_ /\n      `Manual CSV \u003chttps://drive.google.com/open?id=1-RImt4SMK1a2t_t4GfMWT5ciDNIDQvoQ\u003e`_\n\n  * - 2018-05-10\n    - Looking for alternatives to the CSS/XPath/XSLT based XML parsing:\n      ``xmltodict`` on article XML and fuzzy regex on custom paths\n    - `Notebook \u003cexperiments_2018-05-10.ipynb\u003e`_\n\n  * - 2018-05-17\n    - Getting tags that looks like\n      ``\u003carticle-id\u003e``, ``\u003caff\u003e`` and ``\u003ccontrib\u003e``\n      using fuzzy regex / Levenshtein distance\n    - `Notebook \u003cexperiments_2018-05-17.ipynb\u003e`_\n\n  * - 2018-06-04\n    - CSV generation with `Clea \u003chttps://github.com/scieloorg/clea\u003e`_\n    - `Notebook \u003cexperiments_2018-06-04.ipynb\u003e`_ /\n      `File list \u003chttps://drive.google.com/open?id=1bYP5DRzSS4BmDeEUA3mQrhH117LfPk5q\u003e`_ /\n      `CSV \u003chttps://drive.google.com/file/d/1XmBh6YlfPkB5WfYSolAMP1EA5e02jHQO/view?usp=sharing\u003e`_\n\n  * - 2018-06-07\n    - Analysis of the ``contrib_type`` field from Clea's CSV output\n    - `Notebook \u003cexperiments_2018-06-07.ipynb\u003e`_\n\n  * - 2018-06-14 to 2018-07-05\n    - Country analysis of Clea's CSV output using graphs (NetworkX),\n      including a substantial analysis of alternative libraries\n      for country normalization/cleaning in Python/R/Ruby,\n      resulting in a taxonomy/classification of techniques\n      (exact match, regex, fuzzy, graphs)\n    - `Notebook \u003cexperiments_2018-06_country.ipynb\u003e`_\n\n  * - 2018-07-05\n    - Analysis of the country in the manual normalization CSV data\n      using graphs\n    - `Notebook \u003cexperiments_2018-07-05.ipynb\u003e`_\n\n  * - 2018-07-12\n    - Creation of a CrossRef fetching script\n      for all articles in a ``article_doi`` CSV column\n      due to the presence of several DOI / PID empty fields\n    - `Notebook \u003cexperiments_2018-07-12.ipynb\u003e`_ /\n      `Script \u003cfetch_crossref.py\u003e`_\n\n  * - 2018-07-23\n    - Matching and normalizing PID/DOI using Crossref data,\n      besides a first experiment based on the SciELO's \"XML debug\" API\n      to get the current article PID from its older PID\n    - `Notebook \u003cexperiments_2018-07-23.ipynb\u003e`_ /\n      `Script \u003cheaders_listener_tornado.py\u003e`_\n\n  * - 2018-07-26\n    - Crunching/crawling data from SciELO's search engine\n      and the XML debug API, looking for a specific DOI / PID\n    - `Notebook \u003cexperiments_2018-07-26.ipynb\u003e`_\n\n  * - 2018-08-02 to 2018-08-16\n    - Normalizing the USP institutions ``orgname`` (faculty name)\n      and ``orgdiv1`` (department name) fields\n      filled in Brazilian Portuguese\n    - `Notebook \u003cexperiments_2018-08_usp.ipynb\u003e`_\n\n  * - 2018-08-09\n    - Summarization of the affiliations report from SciELO Analytics\n    - `Notebook \u003c2018-08-09_affiliations_report_summary.ipynb\u003e`_ /\n      `Summary \u003chttps://drive.google.com/open?id=1TPlf5FmZeZuUVZI4QiEJFyyPS7f32v7g\u003e`_\n\n  * - 2018-08-23 to 2018-11-14\n    - Latent Semantic Analysis (LSA) on the CSV data\n      for predicting the country code,\n      using k-Means, k-NN and random forest\n    - `Notebook \u003cexperiments_2018-08_words_lsa.ipynb\u003e`_\n\n  * - 2018-11-22 to 2019-03-08\n    - Experiments with word2vec\n      to find the country code from a single string\n      having the merged information of an affiliation-contributor pair\n    - `Notebook \u003cexperiments_2018-11_word2vec.ipynb\u003e`_ /\n      `Example \u003c2019-03-08_rf_w2v_example.ipynb\u003e`_ /\n      `Dump Dictionary \u003chttps://drive.google.com/open?id=1z4vAm2m3ANp48b2XnRtSlNDM2Gp4vrMX\u003e`_ /\n      `Dump W2V 200 \u003chttps://drive.google.com/open?id=1EEI-sY-nprjzQ1yyS11F_fhocAKzRpIt\u003e`_ /\n      `Dump W2V 1000 \u003chttps://drive.google.com/open?id=1_HeYOyjPlM6s1taoXSpG48XjIWd6A921\u003e`_\n\n  * - 2018-12-06 to 2018-12-13\n    - Looking for articles' PIDs from USP/UNESP/UNICAMP (SciELO Brazil)\n      by analyzing the distinct values\n      that appear as the institution name\n    - `Notebook \u003cexperiments_2018-12_sao_paulo.ipynb\u003e`_ /\n      `XLSX \u003chttps://drive.google.com/file/d/1KwpXe-E-WET9CiPp8YZqRjor1JcJeuP6/view\u003e`_\n\n  * - 2019-01-10 to 2019-02-21\n    - Looking for articles from EMBRAPA\n      and public state universities in SP (USP/UNESP/Unicamp)\n      in the entire SciELO Network\n      by analyzing the institution name, country, state and city,\n      as well as the graph of authors and institutions\n    - `Notebook \u003cexperiments_2019-02_usp_unicamp_unesp_embrapa.ipynb\u003e`_ /\n      `XLSX \u003chttps://drive.google.com/file/d/1d3WIFoftk15uzGrPkSDzqaPqnSNeOfqq/view\u003e`_\n\n  * - 2019-05-13 to 2019-06-05\n    - Analysis of the trained \"W2V 200\" model using other XML files\n    - `Notebook \u003cexperiments_2019-05_w2v_evaluation.ipynb\u003e`_ /\n      `List of training files \u003chttps://drive.google.com/open?id=1bYP5DRzSS4BmDeEUA3mQrhH117LfPk5q\u003e`_ /\n      `Script requirements \u003crequirements.w2v_country.txt\u003e`_ /\n      `Script \u003cw2v_country.py\u003e`_ /\n      `W2V 200 results CSV \u003chttps://drive.google.com/open?id=1JTjUfYfYnspH1DL_mNVcGvIYJqIp-fta\u003e`_\n\n  * - 2019-08-15\n    - Number of days until the first access burst\n    - `Notebook \u003c2019-08-15_first_access_burst.ipynb\u003e`_\n\n  * - 2019-08-21\n    - Analyzing accesses of a single journal\n      with Ratchet and ArticleMeta\n    - `Notebook \u003c2019-08-21_ratchet_example.ipynb\u003e`_\n\n  * - 2019-11-14 onwards\n    - Applying FastText directly on ISIS ISO data\n    - `Notebook \u003c2019-08-21_ratchet_example.ipynb\u003e`_ /\n      `ISO files \u003chttps://drive.google.com/open?id=101-oKPeKF2LM0L2uO_dYL9fp0eKOCE_-\u003e`_\n\nList of files that aren't stored in this repository:\n\n* Dataset of manually normalized data:\n  `aff_norm_update.csv (raw) \u003chttps://drive.google.com/open?id=1Y_5jtWKOBhBUXQIQBZSb4qz13nyOwdWO\u003e`_,\n  `aff_n15.csv (fixed) \u003chttps://drive.google.com/open?id=1-RImt4SMK1a2t_t4GfMWT5ciDNIDQvoQ\u003e`_\n\n* `Clea \u003chttps://github.com/scieloorg/clea\u003e`_'s 2018-06-04 CSV\n  and the XML pack from which it was created:\n  `selecao_xml_br.tgz \u003chttps://drive.google.com/open?id=1ek_18qnBaEEvOkUdateMHhA9FExOT4An\u003e`_,\n  `inner_join_2018-06-04.csv \u003chttps://drive.google.com/open?id=1XmBh6YlfPkB5WfYSolAMP1EA5e02jHQO\u003e`_,\n  `inner_join_2018-06-04_filenames.txt \u003chttps://drive.google.com/open?id=1bYP5DRzSS4BmDeEUA3mQrhH117LfPk5q\u003e`_\n\n* ISIS ISO dump:\n  `2019-11-13_iso200.zip \u003chttps://drive.google.com/open?id=101-oKPeKF2LM0L2uO_dYL9fp0eKOCE_-\u003e`_\n\n* Random forest models based on Word2Vec:\n  `dictionary_w2v_both.dump \u003chttps://drive.google.com/open?id=1z4vAm2m3ANp48b2XnRtSlNDM2Gp4vrMX\u003e`_,\n  `rf_w2v_200.dump \u003chttps://drive.google.com/open?id=1EEI-sY-nprjzQ1yyS11F_fhocAKzRpIt\u003e`_,\n  `rf_w2v_1000.dump \u003chttps://drive.google.com/open?id=1_HeYOyjPlM6s1taoXSpG48XjIWd6A921\u003e`_\n\n* Results of applying the ``rf_w2v_200.dump`` model:\n  `2019-05_w2v_country.csv \u003chttps://drive.google.com/open?id=1JTjUfYfYnspH1DL_mNVcGvIYJqIp-fta\u003e`_\n\n* Country summary CSV based on the reports\n  from `SciELO Analytics \u003chttps://analytics.scielo.org/\u003e`_\n  (2018-06-10):\n  `documents_affiliations_country_summary.csv \u003chttps://drive.google.com/open?id=1TPlf5FmZeZuUVZI4QiEJFyyPS7f32v7g\u003e`_\n\n* XLSX with articles' PIDs based on the reports\n  from `SciELO Analytics \u003chttps://analytics.scielo.org/\u003e`_\n  (2018-12-10):\n  `pids_network_2018-12-10_usp_unesp_unicamp_embrapa.xlsx \u003chttps://drive.google.com/file/d/1d3WIFoftk15uzGrPkSDzqaPqnSNeOfqq/view\u003e`_,\n  `pids_2018-12-10_usp_unesp_unicamp.xlsx \u003chttps://drive.google.com/file/d/1KwpXe-E-WET9CiPp8YZqRjor1JcJeuP6/view\u003e`_\n\nPackages with old `reports \u003chttps://analytics.scielo.org/w/reports\u003e`_\nfrom SciELO Analytics on which some experiment was based:\n\n* `2018-06-10 (All) \u003chttps://drive.google.com/open?id=1-FMfu8e83uAjkAQUK8xhtm2L5hn10m51\u003e`_\n* `2018-11-10 (Brazil) \u003chttps://drive.google.com/open?id=1WItJXlNzrYkm9rUicsvenH5QgmU4n2MR\u003e`_\n* `2018-12-10 (Brazil and Network) \u003chttps://drive.google.com/open?id=1yxvrvFAy-L0ZV9Mm_NKXTV7ztA_nLAEh\u003e`_\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscieloorg%2Fnormalizations-experiments","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fscieloorg%2Fnormalizations-experiments","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscieloorg%2Fnormalizations-experiments/lists"}