An open API service indexing awesome lists of open source software.

https://github.com/scieloorg/normalizations-experiments

Exploratory experiments upon authors affiliations data.
https://github.com/scieloorg/normalizations-experiments

experiments labs

Last synced: about 1 month ago
JSON representation

Exploratory experiments upon authors affiliations data.

Awesome Lists containing this project

README

          

Data normalization/cleaning experiments
=======================================

This repository contains analyses and experiments
performed with the goal of normalizing/cleaning the SciELO data,
intended to find and fix unclean/inconsistent values
in their raw format,
as well as other similar issues,
mainly towards the fields that regards to the affiliations.

Contents of this repository ordered by creation date:

.. list-table::

* - **Date**
- **Description**
- **Link**

* - 2018-04-05
- Grabbing article ```` and ```` data
with BeautifulSoup 4
- `Notebook `_

* - 2018-04-19
- Article XML parsing with ``ElementTree``/``libxml2``/``lxml``,
using XPath/XSLT
- `Notebook `_ /
`XML pack `_

* - 2018-04-26
- Creating a table with data from ````-```` pairs
(front matter) in 25 XML files using ``lxml``
- `Notebook `_ /
`CSV `_

* - 2018-05-03
- Loading/cleaning/analyzing a table of manually normalized data,
including a DBSCAN clustering model for the institution name
- `Notebook `_ /
`Raw manual CSV `_ /
`Manual CSV `_

* - 2018-05-10
- Looking for alternatives to the CSS/XPath/XSLT based XML parsing:
``xmltodict`` on article XML and fuzzy regex on custom paths
- `Notebook `_

* - 2018-05-17
- Getting tags that looks like
````, ```` and ````
using fuzzy regex / Levenshtein distance
- `Notebook `_

* - 2018-06-04
- CSV generation with `Clea `_
- `Notebook `_ /
`File list `_ /
`CSV `_

* - 2018-06-07
- Analysis of the ``contrib_type`` field from Clea's CSV output
- `Notebook `_

* - 2018-06-14 to 2018-07-05
- Country analysis of Clea's CSV output using graphs (NetworkX),
including a substantial analysis of alternative libraries
for country normalization/cleaning in Python/R/Ruby,
resulting in a taxonomy/classification of techniques
(exact match, regex, fuzzy, graphs)
- `Notebook `_

* - 2018-07-05
- Analysis of the country in the manual normalization CSV data
using graphs
- `Notebook `_

* - 2018-07-12
- Creation of a CrossRef fetching script
for all articles in a ``article_doi`` CSV column
due to the presence of several DOI / PID empty fields
- `Notebook `_ /
`Script `_

* - 2018-07-23
- Matching and normalizing PID/DOI using Crossref data,
besides a first experiment based on the SciELO's "XML debug" API
to get the current article PID from its older PID
- `Notebook `_ /
`Script `_

* - 2018-07-26
- Crunching/crawling data from SciELO's search engine
and the XML debug API, looking for a specific DOI / PID
- `Notebook `_

* - 2018-08-02 to 2018-08-16
- Normalizing the USP institutions ``orgname`` (faculty name)
and ``orgdiv1`` (department name) fields
filled in Brazilian Portuguese
- `Notebook `_

* - 2018-08-09
- Summarization of the affiliations report from SciELO Analytics
- `Notebook <2018-08-09_affiliations_report_summary.ipynb>`_ /
`Summary `_

* - 2018-08-23 to 2018-11-14
- Latent Semantic Analysis (LSA) on the CSV data
for predicting the country code,
using k-Means, k-NN and random forest
- `Notebook `_

* - 2018-11-22 to 2019-03-08
- Experiments with word2vec
to find the country code from a single string
having the merged information of an affiliation-contributor pair
- `Notebook `_ /
`Example <2019-03-08_rf_w2v_example.ipynb>`_ /
`Dump Dictionary `_ /
`Dump W2V 200 `_ /
`Dump W2V 1000 `_

* - 2018-12-06 to 2018-12-13
- Looking for articles' PIDs from USP/UNESP/UNICAMP (SciELO Brazil)
by analyzing the distinct values
that appear as the institution name
- `Notebook `_ /
`XLSX `_

* - 2019-01-10 to 2019-02-21
- Looking for articles from EMBRAPA
and public state universities in SP (USP/UNESP/Unicamp)
in the entire SciELO Network
by analyzing the institution name, country, state and city,
as well as the graph of authors and institutions
- `Notebook `_ /
`XLSX `_

* - 2019-05-13 to 2019-06-05
- Analysis of the trained "W2V 200" model using other XML files
- `Notebook `_ /
`List of training files `_ /
`Script requirements `_ /
`Script `_ /
`W2V 200 results CSV `_

* - 2019-08-15
- Number of days until the first access burst
- `Notebook <2019-08-15_first_access_burst.ipynb>`_

* - 2019-08-21
- Analyzing accesses of a single journal
with Ratchet and ArticleMeta
- `Notebook <2019-08-21_ratchet_example.ipynb>`_

* - 2019-11-14 onwards
- Applying FastText directly on ISIS ISO data
- `Notebook <2019-08-21_ratchet_example.ipynb>`_ /
`ISO files `_

List of files that aren't stored in this repository:

* Dataset of manually normalized data:
`aff_norm_update.csv (raw) `_,
`aff_n15.csv (fixed) `_

* `Clea `_'s 2018-06-04 CSV
and the XML pack from which it was created:
`selecao_xml_br.tgz `_,
`inner_join_2018-06-04.csv `_,
`inner_join_2018-06-04_filenames.txt `_

* ISIS ISO dump:
`2019-11-13_iso200.zip `_

* Random forest models based on Word2Vec:
`dictionary_w2v_both.dump `_,
`rf_w2v_200.dump `_,
`rf_w2v_1000.dump `_

* Results of applying the ``rf_w2v_200.dump`` model:
`2019-05_w2v_country.csv `_

* Country summary CSV based on the reports
from `SciELO Analytics `_
(2018-06-10):
`documents_affiliations_country_summary.csv `_

* XLSX with articles' PIDs based on the reports
from `SciELO Analytics `_
(2018-12-10):
`pids_network_2018-12-10_usp_unesp_unicamp_embrapa.xlsx `_,
`pids_2018-12-10_usp_unesp_unicamp.xlsx `_

Packages with old `reports `_
from SciELO Analytics on which some experiment was based:

* `2018-06-10 (All) `_
* `2018-11-10 (Brazil) `_
* `2018-12-10 (Brazil and Network) `_