{"id":22271209,"url":"https://github.com/coderham/data-512-a1","last_synced_at":"2025-08-19T01:08:37.104Z","repository":{"id":93309292,"uuid":"153050368","full_name":"CoderHam/data-512-a1","owner":"CoderHam","description":"The goal of this assignment is to construct, analyze, and publish a dataset of monthly traffic on English Wikipedia from January 1 2008 through September 30 2018. All of the analysis is performed in the Jupyter notebook and the data is made available in `.json` and `.csv` files. This ReadMe acts as the documentation.","archived":false,"fork":false,"pushed_at":"2018-10-15T04:12:20.000Z","size":230,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-06-09T22:04:55.603Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/CoderHam.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-10-15T03:52:56.000Z","updated_at":"2018-10-15T04:12:21.000Z","dependencies_parsed_at":"2023-06-28T05:08:35.929Z","dependency_job_id":null,"html_url":"https://github.com/CoderHam/data-512-a1","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/CoderHam/data-512-a1","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CoderHam%2Fdata-512-a1","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CoderHam%2Fdata-512-a1/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CoderHam%2Fdata-512-a1/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CoderHam%2Fdata-512-a1/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/CoderHam","download_url":"https://codeload.github.com/CoderHam/data-512-a1/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CoderHam%2Fdata-512-a1/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":271083778,"owners_count":24696368,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-18T02:00:08.743Z","response_time":89,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-03T12:11:09.831Z","updated_at":"2025-08-19T01:08:37.078Z","avatar_url":"https://github.com/CoderHam.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data 512 (Human Centered Data Science) Assignment 1 - Data curation\n\n## Goal of the Project\n\n- The goal of this assignment is to construct, analyze, and publish a dataset of monthly traffic on English Wikipedia from January 1 2008 through September 30 2018. All of the analysis is performed in the [Jupyter notebook](https://github.com/CoderHam/data-512-a1/blob/master/hcds-a1-data-curation.ipynb) and the data is made available in `.json` (the naming conventio is `apiname_accesstype_firstmonth-lastmonth.json`) and `.csv` files. This ReadMe acts as the documentation.\n\n- The purpose of the assignment is to demonstrate that I can follow the best practices for open scientific research in designing and implementing my project, and make my project fully reproducible by others: from data collection to data analysis.\n\n- For this assignment, I have combined data about Wikipedia page traffic from two different Wikimedia REST API endpoints (Page views and Page counts) into a single dataset, performed some simple data processing steps on the data, and then analyzed the data by building a simple visualization.\n\n## Terms of Use and License of Source Data\n\n- Wikimedia contains their own license and terms of use as described below:\n\n    - The Wikimedia REST API offers access to Wikimedia's content and metadata in machine-readable formats. Focused on high-volume use cases, it tightly integrates with Wikimedia's globally distributed caching infrastructure. As a result, API users benefit from reduced latencies and support for high request volumes.\n\n    - The REST API along with its documentation is available for all major Wikimedia projects at the location /api/rest_v1/. For example, for the English Wikipedia it is available at [https://en.wikipedia.org/api/rest_v1/](https://en.wikipedia.org/api/rest_v1/).\n\n    - While the functionality offered by most projects closely matches that on English Wikipedia, there are some noteworthy exceptions:\n\n        - wikimedia.org offers cross-project information like page view metrics.\n        - en.wiktionary.org offers an experimental definition end point, exposing Wiktionary information as structured data. Support for other languages is under discussion.\n\n    More can be read at: [https://www.mediawiki.org/wiki/REST_API#Terms_and_conditions](https://www.mediawiki.org/wiki/REST_API#Terms_and_conditions)\n\n## APIs Used for Data Collection\n\n- Wikimedia Legacy Pagecounts API : [https://wikitech.wikimedia.org/wiki/Analytics/AQS/Legacy_Pagecounts](https://wikitech.wikimedia.org/wiki/Analytics/AQS/Legacy_Pagecounts)\n- Wikimedia Pageviews API : [https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews](https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews)\n\n## Curated Data\n\nThe final data is stored in a `csv` file for easy access and the schema/description of the columns is shown below: \n\n| Column | Value | \n| ------ | ------ |\n| year | YYYY | \n| month | MM | \n|pagecount_all_views| number of views for all clients reported by the (Legacy) Pagecounts API |\n|pagecount_desktop_views | number of views for desktop clients reported by the (Legacy) Pagecounts API |\n|pagecount_mobile_views\t| number of views for mobile clients reported by the (Legacy) Pagecounts API |\n|pageview_all_views| number of views for all clients reported by the Pageviews API |\n|pageview_desktop_views| number of views for desktop clients reported by the Pageviews API |\n|pageview_mobile_views| number of views for mobile clients reported by the Pageviews API |\n\nThe csv file is provided in this repository: [en-wikipedia_traffic_200712-201809.csv](https://github.com/CoderHam/data512-assignment-1/blob/master/en-wikipedia_traffic_200801-201709.csv)\n\n## Final Results \n\nThe data was visualized using `matplotlib` as shown below:\n\n![Wikipedia Page Views Monthly between 2008 and 2018](https://github.com/CoderHam/data512-assignment-1/blob/master/wikipedia_pagevies_monthly.png)\n\nThe sudden dip or difference between values from pagecounts to pageviews can be attributed to the filteration of web crawlers or spiders. \n\nDuring the brief period of overlap between pageviews and pagecounts, it is possible that the pagecounts was not yet legacy even after the pageviews is introduced.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcoderham%2Fdata-512-a1","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcoderham%2Fdata-512-a1","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcoderham%2Fdata-512-a1/lists"}