{"id":20020267,"url":"https://github.com/multivacplatform/multivac-wikipedia","last_synced_at":"2026-04-29T20:05:30.169Z","repository":{"id":91345030,"uuid":"116053552","full_name":"multivacplatform/multivac-wikipedia","owner":"multivacplatform","description":"Wonderful reusable codes, libraries and scripts to process Wikipedia page views by using Apache Spark.","archived":false,"fork":false,"pushed_at":"2019-01-31T18:04:22.000Z","size":228,"stargazers_count":2,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-06-03T23:48:53.609Z","etag":null,"topics":["data-frame","multivac-wikipedia","spark","spark-sql","wikipedia"],"latest_commit_sha":null,"homepage":"https://multivac.iscpif.fr","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/multivacplatform.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-01-02T20:20:28.000Z","updated_at":"2022-12-05T18:20:18.000Z","dependencies_parsed_at":"2024-04-22T05:47:47.627Z","dependency_job_id":null,"html_url":"https://github.com/multivacplatform/multivac-wikipedia","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/multivacplatform/multivac-wikipedia","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/multivacplatform%2Fmultivac-wikipedia","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/multivacplatform%2Fmultivac-wikipedia/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/multivacplatform%2Fmultivac-wikipedia/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/multivacplatform%2Fmultivac-wikipedia/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/multivacplatform","download_url":"https://codeload.github.com/multivacplatform/multivac-wikipedia/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/multivacplatform%2Fmultivac-wikipedia/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32441468,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-29T18:12:22.909Z","status":"ssl_error","status_checked_at":"2026-04-29T18:11:33.322Z","response_time":110,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-frame","multivac-wikipedia","spark","spark-sql","wikipedia"],"created_at":"2024-11-13T08:30:48.356Z","updated_at":"2026-04-29T20:05:30.150Z","avatar_url":"https://github.com/multivacplatform.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"# multivac-wikipedia [![GitHub license](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/multivacplatform/multivac-wikipedia/blob/master/LICENSE) [![Build Status](https://travis-ci.org/multivacplatform/multivac-wikipedia.svg?branch=master)](https://travis-ci.org/multivacplatform/multivac-wikipedia) [![Multivac Discuss](https://img.shields.io/badge/multivac-discuss-ff69b4.svg)](https://discourse.iscpif.fr/c/multivac) [![Multivac Channel](https://img.shields.io/badge/multivac-chat-ff69b4.svg)](https://chat.iscpif.fr/channel/multivac)\n\nWonderful reusable codes, libraries and scripts to process Wikipedia dumps (page content, page views, etc.) by using Apache Spark (SQL, ML, and GraphX).\n\n## build_pageviews\nThis repo represents:\n* Download hourly pageviews for entire Wikipedia projects (daily)\n* Cleaning up and creating DataFrame\n* Save DataFrame as increamentally and dynamically partitioned parquets\n\n[Read more about this repo](https://github.com/multivacplatform/multivac-wikipedia/tree/master/build_pageviews)\n\n### Showcase\n#### Wikipedia PageViews in December 2017\n\n**Number of rows: 4,529,669,792** (4.5 billion)\n\n**Sum of requests: 15,278,050,138** (15.3 billion)\n\n```\n+---------+-------------+\n|project  |sum(requests)|\n+---------+-------------+\n|en.m     |3784911811   |\n|en       |3632828923   |\n|ja.m     |578906226    |\n|ru       |532707570    |\n|es.m     |507966307    |\n|de.m     |464186949    |\n|de       |463264619    |\n|ja       |379715338    |\n|ru.m     |369216509    |\n|fr.m     |361069999    |\n|it.m     |328056166    |\n|fr       |318697185    |\n|es       |314862963    |\n|zh       |206919597    |\n|pt.m     |172852499    |\n|it       |161235234    |\n|zh.m     |149878515    |\n|ar.m     |127827169    |\n|pl       |125353954    |\n|pl.m     |113954004    |\n|pt       |108576418    |\n|commons.m|105668930    |\n|id.m     |89284575     |\n|fa.m     |88369910     |\n|nl.m     |77441421     |\n|nl       |67609149     |\n|sv.m     |57038991     |\n|en.zero  |52135201     |\n|www.wd   |48210254     |\n|ar       |42496761     |\n+---------+-------------+\nonly showing top 30 rows\n```\n[Read more](https://github.com/multivacplatform/multivac-wikipedia/tree/master/spark_wiki_pageviews)\n\n\n## Testing Environment\n\n* Spark 2.2 Local / IntelliJ\n* Spark 2.2 / Cloudera CDH 5.13 / YARN (cluster - client)\n\n## Code of Conduct\n\nThis, and all github.com/multivacplatform projects, are under the [Multivac Platform Open Source Code of Conduct](https://github.com/multivacplatform/code-of-conduct/blob/master/code-of-conduct.md). Additionally, see the [Typelevel Code of Conduct](http://typelevel.org/conduct) for specific examples of harassing behavior that are not tolerated.\n\n## Copyright and License\n\nCode and documentation copyright (c) 2017-2019 [ISCPIF - CNRS](http://iscpif.fr). Code released under the [MIT license](https://github.com/multivacplatform/multivac-wikipedia/blob/master/LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmultivacplatform%2Fmultivac-wikipedia","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmultivacplatform%2Fmultivac-wikipedia","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmultivacplatform%2Fmultivac-wikipedia/lists"}