{"id":19152507,"url":"https://github.com/gesiscss/wiki-download-parse-page-views","last_synced_at":"2025-02-22T21:15:10.107Z","repository":{"id":104686260,"uuid":"130336171","full_name":"gesiscss/wiki-download-parse-page-views","owner":"gesiscss","description":"Pipeline for downloading, parsing and aggregating static page view dumps from Wikipedia.","archived":false,"fork":false,"pushed_at":"2018-04-20T12:25:49.000Z","size":17,"stargazers_count":3,"open_issues_count":0,"forks_count":1,"subscribers_count":14,"default_branch":"master","last_synced_at":"2025-01-03T18:47:12.340Z","etag":null,"topics":["aggregation","downloader","dumps","pageviews","parser","pipeline","python","script","wikipedia"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gesiscss.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-04-20T08:51:28.000Z","updated_at":"2019-10-31T20:22:02.000Z","dependencies_parsed_at":null,"dependency_job_id":"cabd2bee-4643-4396-a029-627afbdb6be6","html_url":"https://github.com/gesiscss/wiki-download-parse-page-views","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gesiscss%2Fwiki-download-parse-page-views","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gesiscss%2Fwiki-download-parse-page-views/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gesiscss%2Fwiki-download-parse-page-views/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gesiscss%2Fwiki-download-parse-page-views/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gesiscss","download_url":"https://codeload.github.com/gesiscss/wiki-download-parse-page-views/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240236361,"owners_count":19769580,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aggregation","downloader","dumps","pageviews","parser","pipeline","python","script","wikipedia"],"created_at":"2024-11-09T08:18:09.487Z","updated_at":"2025-02-22T21:15:10.102Z","avatar_url":"https://github.com/gesiscss.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Download, Parse, Aggregate Wikpedia Page View Dumps\nPipeline for downloading, parsing and aggregating static page view dumps from Wikipedia.\n\n# How it works?\n\nIn case you need an anual number of pageviews for specific pages on Wikipedia before 2015. you will unfortunately not be able to rely on the API (at least not at time of writing this doc) as it gives access to new records (post 2015). However, a collection of [static dumps](https://dumps.wikimedia.org/other/pagecounts-raw/) is available. \n\nThis pipline was made in order to: \n\n1. Fetch names of all files to be downloaded\n2. Download the needed files (paralelized)\n3. Parse them after downloading (paralelized)\n4. Aggregate files for each year in order to get the anual number of views for selected pages\n\nThe following scripts need to be ran respectively:\n\n1. [fetch_file_names.py](https://github.com/gesiscss/wiki-download-parse-page-views/blob/master/fetch_file_names.py)\n2. [downloader.py](https://github.com/gesiscss/wiki-download-parse-page-views/blob/master/downloader.py)\n3. [parser.py](https://github.com/gesiscss/wiki-download-parse-page-views/blob/master/parser.py)\n4. [group_by.py](https://github.com/gesiscss/wiki-download-parse-page-views/blob/master/group_by.py)\n\n# Fetching file names and URLs\n\nFirst, we need to get the names of files we want to download. For every year, there is a set of files available, so it is also good to specify about which years we are interested in. \n\n### fetch_file_names.py \n\nThe script generates a csv file containing file names, file sizes and URLs from which the files should be downloaded. Script parameters:\n* year_start - first year to be downloaded\n* year_end - last year to be downloaded (all years in between are  downloaded)\n* output_dir - directory where files for each year will be stored\n\n```{r, engine='bash', count_lines}\npython fetch_file_names.py  [year_start] [year_end] [output_dir]\n```\n### Output file\n\nfile | size |url |\n--- |--- |--- |\npagecounts-20140101-000000.gz |82| https://.. |\npagecounts-20140201-000000.gz |81| https://.. |\n... | ... | ... |\n\n# Downloading files\n\nNow, when we have downloaded the file names and URLs, we can download them! \n\n### downloader.py \n\nThis script concurently downloads [Wikipedia pagecount dumps](https://dumps.wikimedia.org/other/pagecounts-raw/) [qzip]. The file previously generated **file.csv** contains a list of urls for the files mentioned. The **path_save** refers to directory where files should be downloaded. \n\n```{r, engine='bash', count_lines}\npython downloader.py [file.csv] [path_save] [thread_number]\n```\n**THE SERVER IS CURRENTLY BLOCKING IN CASE OF USING MORE THEN 3 THREADS**\n\n# Parsing files\n\nAs the files have information on every page on Wikipedia which was accessed within the hour specified in the file name, we should remove page names that we do not need.\n\n### Input file\n\nFor parsing, a csv file containing wikipedia page names has to be provided in the following format:\n\nnames_u | names_q |\n--- |--- |\nBarack_Obama |Barack_Obama| \nRené_Konen |Ren%C3%A9_Konen| \nZoran_Đinđić |Zoran_%C4%90in%C4%91i%C4%87|\n... | ... | \n\nThe column **names_u** is standard utf-8 encoding (the unquated representation), however in the files a nother type of encoding is used, so we need a **names_q** which is the 'qouated' representation. Both [quote and unquote](https://stackoverflow.com/questions/300445/how-to-unquote-a-urlencoded-unicode-string-in-python) can be done with [urllib](https://docs.python.org/2/library/urllib.html).\n\n### parser.py\n\nOpens specified list of files in **files_dir**, filters them per names in **page_names_file** and **project_name** (\"en\" for english wikipedia, \"de\" for german, etc.), saves filtered files in **save_dir** using\na specified **num_threads**.\n\n```{r, engine='bash', count_lines}\npython parser.py [page_names_file] [files_dir] [save_dir] [project_name] [num_threads]\n```\n\n# Getting the aggregated pageviews\n\nAfter parsing the files, it is time to aggregate the page views! \n\nLoads files from **file_dir** as pandas dataframes, concatinates them, performs aggregation and saves them as csv on **save_path**. \n\n```{r, engine='bash', count_lines}\npython groupby.py [file_dir] [save_path] \n```\n\n### Output file\n\nnames_u | names_q | views |\n--- |--- | --- |\nBarack_Obama |Barack_Obama| 3562998 | \nRené_Konen |Ren%C3%A9_Konen| 156456 |\nZoran_Đinđić |Zoran_%C4%90in%C4%91i%C4%87| 96846 |\n... | ... | ... |\n\n# Dependencies\n\n```{r, engine='bash', count_lines}\n#todo requirements.txt\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgesiscss%2Fwiki-download-parse-page-views","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgesiscss%2Fwiki-download-parse-page-views","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgesiscss%2Fwiki-download-parse-page-views/lists"}