Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/multivacplatform/multivac-wikipedia
Wonderful reusable codes, libraries and scripts to process Wikipedia page views by using Apache Spark.
https://github.com/multivacplatform/multivac-wikipedia
data-frame multivac-wikipedia spark spark-sql wikipedia
Last synced: 29 days ago
JSON representation
Wonderful reusable codes, libraries and scripts to process Wikipedia page views by using Apache Spark.
- Host: GitHub
- URL: https://github.com/multivacplatform/multivac-wikipedia
- Owner: multivacplatform
- License: mit
- Created: 2018-01-02T20:20:28.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2019-01-31T18:04:22.000Z (about 6 years ago)
- Last Synced: 2024-11-13T08:37:23.707Z (3 months ago)
- Topics: data-frame, multivac-wikipedia, spark, spark-sql, wikipedia
- Language: Scala
- Homepage: https://multivac.iscpif.fr
- Size: 223 KB
- Stars: 2
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
# multivac-wikipedia [![GitHub license](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/multivacplatform/multivac-wikipedia/blob/master/LICENSE) [![Build Status](https://travis-ci.org/multivacplatform/multivac-wikipedia.svg?branch=master)](https://travis-ci.org/multivacplatform/multivac-wikipedia) [![Multivac Discuss](https://img.shields.io/badge/multivac-discuss-ff69b4.svg)](https://discourse.iscpif.fr/c/multivac) [![Multivac Channel](https://img.shields.io/badge/multivac-chat-ff69b4.svg)](https://chat.iscpif.fr/channel/multivac)
Wonderful reusable codes, libraries and scripts to process Wikipedia dumps (page content, page views, etc.) by using Apache Spark (SQL, ML, and GraphX).
## build_pageviews
This repo represents:
* Download hourly pageviews for entire Wikipedia projects (daily)
* Cleaning up and creating DataFrame
* Save DataFrame as increamentally and dynamically partitioned parquets[Read more about this repo](https://github.com/multivacplatform/multivac-wikipedia/tree/master/build_pageviews)
### Showcase
#### Wikipedia PageViews in December 2017**Number of rows: 4,529,669,792** (4.5 billion)
**Sum of requests: 15,278,050,138** (15.3 billion)
```
+---------+-------------+
|project |sum(requests)|
+---------+-------------+
|en.m |3784911811 |
|en |3632828923 |
|ja.m |578906226 |
|ru |532707570 |
|es.m |507966307 |
|de.m |464186949 |
|de |463264619 |
|ja |379715338 |
|ru.m |369216509 |
|fr.m |361069999 |
|it.m |328056166 |
|fr |318697185 |
|es |314862963 |
|zh |206919597 |
|pt.m |172852499 |
|it |161235234 |
|zh.m |149878515 |
|ar.m |127827169 |
|pl |125353954 |
|pl.m |113954004 |
|pt |108576418 |
|commons.m|105668930 |
|id.m |89284575 |
|fa.m |88369910 |
|nl.m |77441421 |
|nl |67609149 |
|sv.m |57038991 |
|en.zero |52135201 |
|www.wd |48210254 |
|ar |42496761 |
+---------+-------------+
only showing top 30 rows
```
[Read more](https://github.com/multivacplatform/multivac-wikipedia/tree/master/spark_wiki_pageviews)## Testing Environment
* Spark 2.2 Local / IntelliJ
* Spark 2.2 / Cloudera CDH 5.13 / YARN (cluster - client)## Code of Conduct
This, and all github.com/multivacplatform projects, are under the [Multivac Platform Open Source Code of Conduct](https://github.com/multivacplatform/code-of-conduct/blob/master/code-of-conduct.md). Additionally, see the [Typelevel Code of Conduct](http://typelevel.org/conduct) for specific examples of harassing behavior that are not tolerated.
## Copyright and License
Code and documentation copyright (c) 2017-2019 [ISCPIF - CNRS](http://iscpif.fr). Code released under the [MIT license](https://github.com/multivacplatform/multivac-wikipedia/blob/master/LICENSE).