{"id":17293992,"url":"https://github.com/qetdr/research-data-pipeline","last_synced_at":"2026-04-12T00:04:44.999Z","repository":{"id":67692624,"uuid":"569160443","full_name":"qetdr/research-data-pipeline","owner":"qetdr","description":"Data Pipeline for Exploring Scientific Activities within the Computer Science Domain","archived":false,"fork":false,"pushed_at":"2023-01-24T07:52:07.000Z","size":147539,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-07-03T03:02:50.516Z","etag":null,"topics":["airflow","cypher","docker","neo4j","pandas","postgresql","python"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/qetdr.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-11-22T08:04:44.000Z","updated_at":"2023-01-25T07:42:34.000Z","dependencies_parsed_at":null,"dependency_job_id":"ff2c4b6a-94a9-4a3f-b74b-bfa0e9394048","html_url":"https://github.com/qetdr/research-data-pipeline","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/qetdr/research-data-pipeline","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qetdr%2Fresearch-data-pipeline","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qetdr%2Fresearch-data-pipeline/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qetdr%2Fresearch-data-pipeline/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qetdr%2Fresearch-data-pipeline/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/qetdr","download_url":"https://codeload.github.com/qetdr/research-data-pipeline/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qetdr%2Fresearch-data-pipeline/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":263250594,"owners_count":23437287,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["airflow","cypher","docker","neo4j","pandas","postgresql","python"],"created_at":"2024-10-15T10:50:11.390Z","updated_at":"2026-04-12T00:04:44.967Z","avatar_url":"https://github.com/qetdr.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data Pipeline for Exploring the Scientific Activities within the Computer Science Domain\nA Capstone Project for the *Data Engineering* (LTAT.02.007) course.\n\nThe aim of the present project is to develop an end-to-end data pipeline for analytic queries.\n\nThe project repository is hosted on Github. In case of reading this from a pdf, we advise the reader to navigate to the repository: [https://github.com/qetdr/research-pipeline](https://github.com/qetdr/research-pipeline).\n\n## Team\nDmitri Rozgonjuk\u003cbr\u003e\nLisanne Siniväli\u003cbr\u003e\nEerik-Sven Puudist\u003cbr\u003e\nCheng-Han Chung\u003cbr\u003e\n\n# 1. Pipeline Overview\nThe general pipeline structure is presented in Figure 1. Below is the high-level overview of the steps:\n1. **Ingestion and Preliminary Preprocessing**: Data is ingested from Kaggle and first clean data tables in .csv format are saved.\n2. **Augmentation**: Clean data are augmented via APIs or static data files, and saved to another directory as clean and augmented .csv-s.\n3. **Loading to databases**: The cleaned and augmented .csv files are read into `pandas` and loaded to Postgres (data warehousing) as well as to Neo4J database.\n4. **Analytic queries**: The data are ready for analytic queries which are done from browser via Jupyter Notebook and Neo4J.\n\n|![[Figure 1. The macro-view of the data pipeline.]](images/pipeline_fig1.png)|\n|:--:|\n| \u003cb\u003eFigure 1. The macro-view of the data pipeline.\u003c/b\u003e|\n\nLargely, the process is automated via pipeline orchestration done with Apache Airflow. The data files are kept until the scheduled update. Then, cleaned and augmented data files are deleted to be replaced by updated data files. However, for the raw data files which might be needed for the initial run, one needs to run a `python` script from command line (see below).\n\n## 1.1. Relational Database Schema\nWe use Postgres as the relational database. The database schema is presented in Figure 2. The aim for this database was to allow us to run queries on author-related statistics (please see some example queries at the end of this document).\n\n|![[Figure 2. Entity-relationship diagram]](images/dwh_erd.png)|\n|:--:|\n| \u003cb\u003eFigure 2. Entity-relationship diagram.\u003c/b\u003e|\n\n## 1.2. Graph Database Schema\nWe use Neo4J as the graph database. We mostly use the data that are also present in the Data Warehouse. \n\nThe database schema is similar to the schema shown in Figure 2; however, the main difference is that the yellow tables in Figure 2 (`journal`, `article`, `author`, and `category`) are treated as nodes, where the other two tables, `authorship` and `article_category` are relationships `AUTHORED` and `BELONGS_TO`, linking authors-articles and articles-categories, respectively, in Neo4J. We also created the `COAUTHORS` relationship, linking individual authors via joint publication(s), as well as `PUBLISHED_IN`, marking the relationship between a specific journal and article. \n\nThe general aim of this database was primarily to allow us to extract and visualize (in Neo4J) the ego-network of a given researcher. For instance, when given a researcher's name, we wanted to create the possibility to see with whom and on what the person has collaborated. However, the database provides also additional exploratory analysis options (not within the scope of this project).\n\n|![[Figure 3. Graph database schema]](images/graph_db_schema.png)|\n|:--:|\n| \u003cb\u003eFigure 3. Graph database schema.\u003c/b\u003e|\n\n## 1.3. Data Cleaning, Transformations, and Augmentations\nBelow is the step-by step description of data cleaning and transformations done within the pipeline. The following steps are in the functionality of `dags/scripts/raw_to_tables.py` module.\n\n1. Data are downloaded from kaggle, unzipped, and only the necessary columns are extracted and converted to a `pandas DataFrame`.\n2. Initial cleaning is perfomed: records with a missing DOI are removed,duplicates (based on `article_id`) are dropped, only the articles that include the category `cs` (stands for 'computer science') are included.\n3. Initial table extraction:\n    - `authorship` and `author`: names are parsed from the dataset that was extracted in previous step. For each author, first, last, and middle names are extracted. Names are cleaned (removing non-alphabetical characters). Author identifier is created (in the form `LastnameF` where `'F'` stands for first name initial). `authorship` table is a long-format table with each author corresponding to each article. `author` table is extracted so that unique name identifiers create their own table.\n    - `article_category` and `category`: similarly, lists of article categories are parsed and `article_category` (long-format table with each category label corresponding to article) as well as `category` table (unique categories forming a table with super- and subdomain identifiers) are created.\n    - initial `article` and `journal` tables are created. \n4. Once these tables are prepared, they are then cleaned for missing values, NaN-values, etc. Authors with too short last names are removed. Tables are written to .csv format in the `dags/tables` directory.\n\nNext the clean data for use in databases are created and augmented. Here, the module `/dags/scripts/final_tables.py` is relevant. The process starts with preparing and augmenting the `article` table, since it defines what parts of other tables are included. \n\n5. We query the DOIs of articles against Crossref API to receive the work type, number of citations, and journal ISSN. For that, we use the helper-function `fetch_article_augments()` from `/dags/scripts/augmentations.py`. The querying is done in batches of 2000, where after each batch, the data are updated in the .csv file. **WARNING!** This process is very slow, since too many queries per second may result in the IP being blocked. Hence, we chose the stable but slow option over fast but highly risky. After the `article` table is augmented, we select only the works where type is `journal-article`. Other tables are updated accordingly.\n\n6. `journal` table is then augmented. We add the source-normalized impact factors from the CWTS website. However, for convenience, we have downloaded the Excel workbook and use this a source (from the local repository). The helper-functions `check_or_prepare_journal_metrics()` and `find_journal_stats()` from `/dags/scripts/augmentations.py` are used.\n\n7. Finally, we augment the `author` table. We start by including genders for authors based on their first name. The names are retrieved from a static dataset which is included in the  `dags/augmentation/` directory. First names from `author` table are matched with the names in data. If a match is found, gender is updated; otherwise, gender remains a `NaN`. Then, we also compute various statistics for each author. Additionally, we compute the h-index for each author based on the metrics from the database. For h-index computation, we use the `hindex()` function from `/dags/scripts/augmentations.py`.\n\n8. Finally, we update all tables to be in coherence with each other (meaning that each entity/node has relations, etc).\n\n9. Clean data tables are saved in `.csv` format to `dags/data_ready/` directory from where they can be used for loading to databases.\n\n## 1.4. Pipeline Orchestration\nWe use Airflow for pipeline orchestration. Airflow makes it convenient to schedule the pipeline tasks. For the needs of the present project, we want to update the data yearly. To meet this goal, here is how Airflow works (for a visual overview of tasks, please see Figure 4). Of note, the entire script for Airflow pipeline orchestration can be found in `dags/research_pipeline_dag.py`.\n\n1. Once Airflow is initiated, it will try to run the pipeline. **Warning!** It can happen that the pipeline run will not be successful, as different services are setting up and being initialized. In our experience, problems with loading data to Neo4J may have issues, and this is likely the biggest single point-of-failure of the pipeline, meaning that one might need to manually restart Neo4J. However, it is also mentioned on Docker that Neo4J may come with poor performance and volatile stability.\n\n2. The scheduling is done so that the start data of the pipeline in Airflow is 01.08.2022, meaning that the pipeline will be definitely initialized when it is first run (because the start date is in the past), and will then run again yearly (so, in next August). Yearly-updates can be turned off (i.e., to manual triggering) by setting `'schedule_interval': None` in the `default_args`.\n\n2. There are **7 tasks**:\n    - `Begin_Execution`: starts the pipeline, a dummy/empty operator;\n    - `delete_for_update`: checks for existence and deletes the augmented/clean data files from `dags/data_ready` directory. This is necessary for yearly updates.\n    - `find_tables_or_ingest_raw`: checks if the non-cleansed tables have been prepared. If yes, the pipeline proceeds to next task. If no, it is prompted that the user needs to ingest the data given the prompted script (to be run from Terminal).\n    - `check_or_augment`: checks if the cleaned and augmented tables are present. If not, the tables from previous step are used for augmentation and cleaning.\n    - `pandas_to_dwh`: imports the cleaned .csv-s and loads to Postgres Data Warehouse.\n    - `pandas_to_neo`: imports the cleaned .csv-s and loads to Neo4J Graph Database.\n    - `Stop_Execution`: a dummy operator to indicate the status of pipeline execution end.\n\n|![[Figure 4. Airflow-orchestrated data pipeline]](images/airflow_tasks.png)|\n|:--:|\n| \u003cb\u003eFigure 4. Airflow-orchestrated data pipeline \u003c/b\u003e (a successful pipeline run).| \n\nWe would also like to note that we considered running the `pandas_to_dwh` and `pandas_to_neo` tasks in parallel. But because we want to keep it open with regards to how much data is used, we refactored the solution to sequential, since this reduces the risk of running out of memory.\n\n# 2. How to Run\n## 2.1. Prerequisites\u003cbr\u003e\n- It is assumed that the entire project is on a local machine (i.e., your computer). If not, clone it from github:\u003cbr\u003e\n`git clone https://github.com/eeriksp/research-pipeline`.\n- If you want to ingest the raw data from Kaggle, you will need to prepare a `kaggle.json` file and include it in the project root directory. The default file is included but it needs to be updated with the appropriate credentials. More information can be found here: https://pypi.org/project/opendatasets/.\n\n## 2.2. Run the Pipeline\n1. Start your `Docker Desktop` and make sure to allocate sufficient memory (`Settings -\u003e Resources -\u003e Memory`, say 6.75 GB).\n2. Navigate to the root directory of this project.\n3. From command line (or Terminal on a Mac), run \u003cbr\u003e\n`echo -e \"AIRFLOW_UID=$(id -u)\\nAIRFLOW_GID=0\" \u003e .env`  \u003cbr\u003e\nThis creates an environment file for Airflow to allow to run it as a superuser.\n4. Run `docker-compose up airflow-init`. This initializes Airflow with username and pwd 'airflow' to be used. Wait until the run is completed.\n5. Now, run `docker-compose up`. This runs all the services described in the `docker-compose.yaml` file. These services include Airflow, Jupyter Notebook, Postgres, and Neo4J. This may take some time, but it is suggested to keep an eye on the progress from command line, especially whether all services have properly started. This command also installs all necessary `python` modules and packages to Airflow.\n6. Make sure that all services are up and running. For that, try accessing the services from your browser:\n    - Airflow: http://localhost:8080/\n    - Jupyter Notebook: http://localhost:8888/\n    - Neo4J: http://localhost:7474/ \n\n    If it's not possible to access all services, wait a bit. Typically, the problem occurs with Neo4J, and if it's not possible to start the service(s), you can also try stopping the process (press twice `Ctrl (or Cmd) + C`) and `docker-compose down`. Then, repeat the process, starting from Step 4.\n7. The project folder includes the data tables by default. However, you can also test data ingestion, transformation, and augmentation yourself by deleting the .csv files. **WARNING!** Doing so will mean a significantly long runtime (can be more than half a day - not counting in potential issues with Neo4J). If you do choose to go without the default data tables (in repositories `dags/tables` and `dags/data_ready`, see below for description), navigate to the root directory of this project and run the following commands from Terminal: \u003cbr\u003e\n    - Install the necessary packages (on your local machine):\u003cbr\u003e \n    `python -m pip install -r requirements.txt`\n    - Run the script:\u003cbr\u003e\n    `python3 dags/scripts/raw_to_tables.py`\u003cbr\u003e\n    This script (1) will download the Kaggle data on your machine, (2) unzip it (appx 3.6+ GB), (3) extracts the necessary data, (4) makes the preliminary data cleaning, (5) creates the tables depicted in Figure 2, and (6) saves the tables to the `dags/tables` directory in .csv format. *Note*: you might be asked for your Kaggle credentials from the command line but usually it works also when the file is in the root directory of the project.\n\n8. Navigate to Airflow (http://localhost:8080/). If everything is correct, you should see a DAG called `research_pipeline_dag`. Click on it. Then click on `Graph`. You should now be able to see the DAG. \n9. If not already automatically triggered, to trigger the DAG, click on the button on the right that resembles 'Play'. Select `Trigger DAG` when prompted. This triggers the DAG. Wiat for the execution to be finished.\n10. Navigate to Jupyter Notebook (http://localhost:8888/) and run the analytic queries for relational database. Of note, it is also possible to run Neo4J queries from the notebook, but for our purposes (well, mainly better visuals), we run the Neo4J queries from Neo4J browser interface (http://localhost:7474/)\n11. And this should be it. Airflow should trigger the data update in summer in a year (at around 1st August), so nothing should change before that.\n\n# 3. Files and Directories\n## 3.1 Directory Tree and a Brief Functional Overview\nBelow is a high-level overview of the general directory structure. Some files (e.g., for caching) that are produced but will not be directly interacted with by the user are not presented below. Additionally, we present here the pipeline where data are ingested and cleaned, i.e., to its 'final' form. When the project is ran for the very first time, the directories `dags/tables/` and `dags/data_ready` are empty.\n\n`research_pipe_container/`: the root directory\n- `docker-compose.yaml`: Docker container configuration file\n- `README.md`: the file you're reading now\n- `requirements.txt`: Python libraries/modules to be installed\n- `analytical_queries.ipynb`: example queries for Postgres,a nd the possiblity to query Neo4J from Jupyter Notebook\n- `dags/`: main directory for scripts, etc\n    - `research_pipeline_dag.py`: the entire DAG configuration and scripts for Airflow\n    - `augmentation/`\n        - `article_journal.csv`: clean table with only journal articles\n        - `cwts2021.csv`: the data for journals' impact (SNIPs)\n        - `names_genders.csv`: most of the names matched with genders\n    - `data_ready/`: directory with clean and aaugmented data\n        - `article_augmented_raw.csv`: the augmented article table where data are not filtered based on column `type` value\n        - `article_category.csv`: each article linked to each category label\n        - `article.csv`: clean augmented table with only `type` `journal-article`\n        - `author.csv`: clean table with each unique author with their attributes (see Figure 2)\n        - `authorship.csv`: each individual author-article relationship table\n        - `category.csv`: each unique article category with super- and subdomains\n        - `journal.csv`: each unique journal with impact metrics (SNIPs)\n    - `scripts/`: ETL scripts\n        - `raw_to_tables.py`: a module written to primarily be run from command line. Includes data ingestion from source (Kaggle), preliminary pre-processing, and initial data tables (in Figure 2) preparation\n        - `augmentations.py`: augmentation scripts for article (CrossRef queries), journal metrics (from a static .csv prepared from a .xslsx file online), and authors (h-index computation using the binary search algorithm)\n        - `final_tables.py`: checking if clean tables exist; if not, augmentations are applied, statistics, and tables are cleaned\n        - `sql_queries.py`: dropping, creating, and inserting into Postgres tables\n        - `neo4j_queries.py`: Neo4J connection class, data insertion (in batches) from pandas to Neo4J\n    - `tables/`: directory with preliminary data tables after ingestion and preliminary preprocessing\n        - `article_category.csv`: each article linked to each category label\n        - `article.csv`: table with extract journal articles (missing `type`, etc)\n        - `author.csv`: each unique author with their attributes (see Figure 2)\n        - `authorship.csv`: each individual author-article relationship table\n        - `category.csv`: each unique article category with super- and subdomains\n        - `journal.csv`: an empty journal table placeholder to be filled in augmentation process\n- `images/`: directory for images used in this notebook\n- `logs/`: Airflow logs\n- `neo4j/`: Neo4J related files and folders\n- `plugins/`: Airflow plugins (not used in this project)\n\n# 4. Design Choices\n## 4.1. Subsetting the Entire Dataset\nAlthough it is possible to work with the entire Arxiv dataset available in Kaggle, we decided to limit our data to indexed journal articles with a valid DOI for which at least one of the category tags included `'cs'`, or Computer Science. Hence, only eligible entities (i.e., authors, journals) were included in further databases and analyses.\n\n## 4.2. Data Augmentations\n### 4.2.1. `article`\nWe queried Crossref API with a given work's DOI. Works of type `journal-article` were updated with citation counts and journal ISSNs. We did not query more information about the scientific work, since there was no specific purpose for that, and not adding additional queries helps to save on runtime.\n \n### 4.2.2. `journal`\nIn order to get the journal information, we need the journal ISSN list from the article table. Although journal Impact Factor is a more known metric, it is trademarked and, hence, retrieving it is not open-source. The alternative is to use SNIP: the source-normalized impact per publication. This is the average number of citations per publication, corrected for differences in citation practice between research domains. Fortunately, the list of journals and their SNIP is available from the CWTS website (https://www.journalindicators.com/).\n\n### 4.2.3. `author`\nFor each author, we used a names-genders dataset to add supposed genders to each author. Although affiliation could also be of interest, there are several problems due to which we decided not to seek and extract affiliation data. First, the article augmentation source (the Crossref database) was largely missing affiliation data for authors. Second, in some cases, making queries based on author names were not possible (e.g., only author's first name initial was present - not allowing to identify the author properly or making it unjustifiably costly). Third, authors' affiliation may change dynamically (e.g., when changing an institution) and authors can also have multiple affiliations. Fourth, author affiliation in itself was not within the scope of the present project.\n\n# 5. Example Queries and Results\nIn the present pipeline, we had several analytic goals for which the pipeline was created. The aim of the Data Warehouse was to provide insights into the productivity (operationalized as the number of total publications) and influence (operationalized as citation count) of top researchers in the database. Additionally, we used a graph database that allowed us both presenting relations between records as well as appealing visuals.\n\n## 5.1. Data Warehouse Queries\nBelow are (1) analytic questions, (2) SQL-queries (via Python), and pictures of results of the queries.\n\n### 5.1.1. Who are the top 0.01% scientists with the most publications in the sample?\n\u003cpre\u003e\nSELECT author_id, rank_total_pubs as rank, total_pubs as publications\nFROM author \nORDER BY rank_total_pubs \nLIMIT  0.01 * (SELECT COUNT(*) FROM author) / 100;\n\u003c/pre\u003e\n\n|![[Figure 5. Top researchers with most publications.]](images/dwh_q1.png)|\n|:--:|\n| \u003cb\u003eFigure 5. Top researchers with most publications. \u003c/b\u003e | \n\n### 5.1.2. Proportionally, in which journals have the top 0.01% of scientists (in terms of publication count) published their work the most?\n\n\u003cpre\u003e\nSELECT final.author_id, final.rank, final.publications, final.journal_title as top_journal,  TO_CHAR((final.number * 100 / final.publications), 'fm99%') as percentage_of_all_publications\nFROM (select a.author_id, rank, publications, mode() within group (order by j.journal_title) AS journal_title, COUNT(j.journal_title) as number\n      from (SELECT author_id, rank_total_pubs as rank, total_pubs as publications\n      FROM author \n      ORDER BY rank_total_pubs \n      LIMIT  0.01 * (SELECT COUNT(*) FROM author) / 100) AS a\n      INNER JOIN authorship au ON a.author_id = au.author_id\n      INNER JOIN article ar ON au.article_id = ar.article_id\n      INNER JOIN journal j ON ar.journal_issn = j.journal_issn\n      group by a.author_id, rank, publications,j.journal_title\n      having j.journal_title = mode() within group (order by j.journal_title)) as final\nLEFT JOIN (select a.author_id, rank, publications, mode() within group (order by j.journal_title) AS journal_title, COUNT(j.journal_title) as number\n      from (SELECT author_id, rank_total_pubs as rank, total_pubs as publications\n      FROM author \n      ORDER BY rank_total_pubs \n      LIMIT  0.01 * (SELECT COUNT(*) FROM author) / 100) AS a\n      INNER JOIN authorship au ON a.author_id = au.author_id\n      INNER JOIN article ar ON au.article_id = ar.article_id\n      INNER JOIN journal j ON ar.journal_issn = j.journal_issn\n      group by a.author_id, rank, publications,j.journal_title\n      having j.journal_title = mode() within group (order by j.journal_title)) as final1 ON \n    final.author_id = final1.author_id AND final.number \u003c final1.number\nWHERE final1.author_id IS NULL\nORDER BY final.rank \nLIMIT  0.01 * (SELECT COUNT(*) FROM author) / 100;\n\u003c/pre\u003e\n\n|![[Figure 6. Proportion of specific journals among all publications within the top authors.]](images/dwh_q2.png)|\n|:--:|\n| \u003cb\u003eFigure 6. Proportion of specific journals among all publications within the top authors. \u003c/b\u003e | \n\n### 5.1.3. What was the most productive year (N publications) for top 0.01% scientists?\n\n\u003cpre\u003e\nSELECT final.author_id, final.rank, final.year AS most_influential_year, final.pub AS count_of_pub, final.avg_cites\nFROM (SELECT a.author_id, rank, count(ar.year) as pub, ar.year, (sum(ar.n_cites::DECIMAL)::int) / count(ar.year) as avg_cites\n    FROM (SELECT author_id, rank_total_pubs as rank\n    FROM author\n    ORDER BY rank_total_pubs \n    LIMIT  0.01 * (SELECT COUNT(*) FROM author) / 100) AS a\n    INNER JOIN authorship au ON a.author_id = au.author_id\n    INNER JOIN article ar ON au.article_id = ar.article_id\n    GROUP BY a.author_id, rank, ar.year) as final\nLEFT JOIN (SELECT a.author_id, rank, count(ar.year) as pub, ar.year, (sum(ar.n_cites::DECIMAL)::int) / count(ar.year) as avg_cites\n    FROM (SELECT author_id, rank_total_pubs as rank\n    FROM author \n    ORDER BY rank_total_pubs \n    LIMIT  0.01 * (SELECT COUNT(*) FROM author) / 100) AS a\n    INNER JOIN authorship au ON a.author_id = au.author_id\n    INNER JOIN article ar ON au.article_id = ar.article_id\n    GROUP BY a.author_id, rank, ar.year) as final1 ON \n    final.author_id = final1.author_id AND final.avg_cites \u003c final1.avg_cites\nWHERE final1.author_id IS NULL\nORDER BY final.rank \nLIMIT  0.01 * (SELECT COUNT(*) FROM author) / 100;\n\u003c/pre\u003e\n\n|![[Figure 7. The most productive year for the top scientists.]](images/dwh_q3.png)|\n|:--:|\n| \u003cb\u003eFigure 7. The most productive year for the top scientists. \u003c/b\u003e | \n\n\n### 5.1.4. What was the most influential (in terms of N citations/ N publications) year for top 0.01% scientists?\n\n\u003cpre\u003e\nSELECT final.author_id, final.rank, final.hindex, final.pub, final.avg_cites, final.year\nFROM (SELECT a.author_id, rank, sum(hindex::DECIMAL) as hindex, sum(publications::DECIMAL) as pub, sum(avg_cites::DECIMAL) as avg_cites, ar.year\n    FROM (SELECT author_id, rank_total_pubs as rank, total_pubs as publications, hindex, avg_cites\n    FROM author \n    ORDER BY rank_total_pubs \n    LIMIT  0.01 * (SELECT COUNT(*) FROM author) / 100) AS a\n    INNER JOIN authorship au ON a.author_id = au.author_id\n    INNER JOIN article ar ON au.article_id = ar.article_id\n    GROUP BY a.author_id, rank, ar.year) as final\nLEFT JOIN (SELECT a.author_id, rank, sum(hindex::DECIMAL) as hindex, sum(publications::DECIMAL) as pub, sum(avg_cites::DECIMAL) as avg_cites, ar.year \n    FROM (SELECT author_id, rank_total_pubs as rank, total_pubs as publications, hindex, avg_cites\n    FROM author \n    ORDER BY rank_total_pubs \n    LIMIT  0.01 * (SELECT COUNT(*) FROM author) / 100) AS a\n    INNER JOIN authorship au ON a.author_id = au.author_id\n    INNER JOIN article ar ON au.article_id = ar.article_id\n    GROUP BY a.author_id, rank, ar.year) as final1 ON \n    final.author_id = final1.author_id AND final.hindex \u003c final1.hindex\nWHERE final1.author_id IS NULL\nORDER BY final.rank \nLIMIT  0.01 * (SELECT COUNT(*) FROM author) / 100;\n\u003c/pre\u003e\n\n|![[Figure 8. The most infuential year for the top scientists.]](images/dwh_q4.png)|\n|:--:|\n| \u003cb\u003eFigure 8. The most infuential year for the top scientists.\u003c/b\u003e | \n\n## 5.2. Graph Database Queries\nUsing the graph database, the aim of the queries was to provide information about a particular author's research activity - with compelling visuals. Specifically, we wanted to see with whom and on what a given author (e.g., based on name) has collaborated. Querying graph database allows to gain insights into the ego-network of a particualr scientist. To that end, we can see the total network (with the scientist) as well as gain a first insight into how modularized is the network (when exploring the network qithout including the ego-node). In addition, we can query other entities and their relations - and explore more beyond the scope of the present project.\n\n## 5.2.1. Display the collaboration network of Lars Birkedal and the papers published with himself on it .\n\n\u003cpre\u003e\nMATCH (author1:Author)-[r:COAUTHORS]-(author2:Author)\nWHERE author1.id = \"BirkedalL\"\nRETURN author1, author2, r\n\u003c/pre\u003e\n\n|![[Figure 9. Ego-network of Lars Birkedal (with author)]](images/graph_ego1.png)|\n|:--:|\n| \u003cb\u003eFigure 9. Ego-network of Lars Birkedal \u003c/b\u003e (with author).| \n\n\n## 5.2.2. Display the collaboration network of Lars Birkedal's co-authors (without showing articles).\n\u003cpre\u003e\nMATCH (author1:Author)-[r:COAUTHORS]-(author2:Author)\nWHERE author1.id = \"BirkedalL\"\nRETURN author2, r\n\u003c/pre\u003e\n\n|![[Figure 10. Ego-network of Lars Birkedal (without the author)]](images/graph_ego2.png)|\n|:--:|\n| \u003cb\u003eFigure 10. Ego-network of Lars Birkedal \u003c/b\u003e (without the author).| \n\n## 5.2.3. Display all papers published in the journal 'Artificial Intelligence'\n\u003cpre\u003e\nMATCH p=(ar:Article)-[r:PUBLISHED_IN]-\u003e(j:Journal)\nWHERE j.title = 'Artificial Intelligence'\nRETURN p\n\u003c/pre\u003e\n\n|![[Figure 11. All papers published in 'Artificial Intelligence]](images/graph_fig3.png)|\n|:--:|\n| \u003cb\u003eFigure 11. All papers published in 'Artificial Intelligence' \u003c/b\u003e| \n\n## 5.2.4. Display all articles that are from data science domain ('DS') and are cited more than 100 times.\n\u003cpre\u003e\nMATCH q = (ar:Article)-[r:BELONGS_TO]-\u003e(c:Category) \nWHERE c.subdom = 'DS' AND ar.n_cites \u003e 100\nRETURN q\n\u003c/pre\u003e\n\n|![Figure 12. Papers from 'data science' domain with more than 100 citations](images/graph_fig4.png)|\n|:--:|\n| \u003cb\u003eFigure 12. Papers from 'data science' domain with more than 100 citations \u003c/b\u003e| \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fqetdr%2Fresearch-data-pipeline","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fqetdr%2Fresearch-data-pipeline","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fqetdr%2Fresearch-data-pipeline/lists"}