{"id":20723579,"url":"https://github.com/networks-learning/wikipedia-protobuf","last_synced_at":"2025-12-06T19:02:13.833Z","repository":{"id":141875047,"uuid":"45068286","full_name":"Networks-Learning/wikipedia-protobuf","owner":"Networks-Learning","description":"This repository processes wikipedia and other datasets.","archived":false,"fork":false,"pushed_at":"2015-11-02T13:14:47.000Z","size":188,"stargazers_count":1,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-01-17T23:18:12.154Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Networks-Learning.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-10-27T20:32:24.000Z","updated_at":"2015-11-02T13:14:48.000Z","dependencies_parsed_at":"2023-03-13T15:52:52.389Z","dependency_job_id":null,"html_url":"https://github.com/Networks-Learning/wikipedia-protobuf","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Networks-Learning%2Fwikipedia-protobuf","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Networks-Learning%2Fwikipedia-protobuf/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Networks-Learning%2Fwikipedia-protobuf/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Networks-Learning%2Fwikipedia-protobuf/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Networks-Learning","download_url":"https://codeload.github.com/Networks-Learning/wikipedia-protobuf/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":242997980,"owners_count":20219270,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-17T04:09:08.697Z","updated_at":"2025-12-06T19:02:13.715Z","avatar_url":"https://github.com/Networks-Learning.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# wikipedia-protobuf\nThis repository processes wikipedia history dataset.\n\nThe library depends on [google protobuf](https://github.com/google/protobuf) library. The proto messages are stored in messages library and an already compiled version of these messages are stored in code package. While you don't need to compile them again yo do need to install protobuf python library to use them.\n\n## Main functionalities\n\nThe main functionality provided by this library are exposed in three scripts:\n\n### Reading from wikipedia history and converting to proto format.\n\nIn order to convert a complete history into proto format you need to call [`extarct_db.py`](code/extract_db.py). An example call would look like this:\n\n```bash\npython -m code.extract_db -i PATH/TO/WIKIDATASET/ -o OUTPUT/DIRECTORY\n```\n\nCalling help on the script would show something like the following explaining the arguments:\n\n```bash\npython -m code.extract_db -h\nusage: extract_db.py [-h] -i INPUT [-o OUTPUT] [-m MATCH] [--after AFTER]\n                     [--before BEFORE] [--space SPACE] [--mzip] [--temp TEMP]\n\nProcess wikipedia history and store meta data information.\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -i INPUT, --input INPUT\n                        Input directory.\n  -o OUTPUT, --output OUTPUT\n                        Output directory\n  -m MATCH, --match MATCH\n                        Regular expression to match only some paths, this is\n                        useful if you would like to split the process into\n                        several parts.\n  --after AFTER         starts parsing files only with names matching after\n                        the given pattern, excluding the pattern itself. Note\n                        that this only skips one file, if multiple files match\n                        only the first is skipped.\n  --before BEFORE       ends before the matching path The boundary is\n                        exclusing.\n  --space SPACE         meta data size per file(before zip). The chunksize for\n                        each proto file. Note that this is a soft threshold,\n                        if a document is large the threshold may not be\n                        respected.\n  --mzip                zip metadata after processing.\n  --temp TEMP           a temp directory to unzip files, the files created ar\n                        removed after processing. If not provided the input\n                        directory is used.\n\n```\n\n## Extracting a subset of documents given an input list\n\nAssuming documents are stored in proto format one can extract a subset od documents given a list of document names by calling [list.py](./code/list.py). Similar to above script this script can be used as follows:\n\n```bash\npython -m code.list -i PATH/TO/PROTO/FILES -o OUTPUT/DIRECTORY -l FILENAME/WITH/DOCUMENT/NAMES -s 150 -o ./OUTPUT/FILE --sep '\\t' --logging I --column 1\n```\n\nThis call reads data from input directory and a list. Assumes the list file is a csv file separated with tabs and read document names from column 1(column 0 is first column and default value).\n\nCalling help on this script provides the arguments:\n\n```bash\npython -m code.list -h\nusage: list.py [-h] -i INPUT -l LIST -o OUTPUT -s SIZE [--logging LOGGING]\n               [--logging_dir LOGGING_DIR] [--column COLUMN] [--sep SEP]\n               [--has_header]\n\nThis scripts extract subset of documents from a list of wikipedia pages\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -i INPUT, --input INPUT\n                        Input directory of wikipedia pages.\n  -l LIST, --list LIST  list of wikipedia pages.\n  -o OUTPUT, --output OUTPUT\n                        output directory\n  -s SIZE, --size SIZE  output file sizes\n  --logging LOGGING     logging level, can be\n                        [W]arning,[E]rror,[D]ebug,[I]nfo,[C]ritical\n  --logging_dir LOGGING_DIR\n                        path for storing log files\n  --column COLUMN       column number, 0 picks the first column\n  --sep SEP             separator for csv file\n  --has_header          input csv has header\n\n```\n\n## Extracting document statistics\n\nThis script demonstrates how the data that is processed using above scripts can be used for other works. Calling [link.py](./code/link.py) read the input data and outputs a dictionary in [pickle]() format containing document index, list of web domains referenced in the dataset, a list of links with time they were inserted and removed(-1 if not removed). This script also needs a file called `effective_tld_names.dat.txt` which contains top level domains. You read more about this [here](http://stackoverflow.com/questions/1066933/how-to-extract-top-level-domain-name-tld-from-url) and get a copy of a possible tld list from [here](https://publicsuffix.org/list/effective_tld_names.dat)\n\n```\npython -m code.link -i INPUT/DIRECTORY -o OUTPUT/PICKLE/FILE\n```\n\nCalling help on this script returns additional arguments for this script:\n\n```bash\npython -m code.link -h\nusage: link.py [-h] -i INPUT -o OUTPUT [-m MATCH] [--after AFTER]\n               [--before BEFORE] [--ltd_names LTD_NAMES] [--count COUNT]\n\nProcess wikipedia data, extract links and life time.\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -i INPUT, --input INPUT\n                        Input directory.\n  -o OUTPUT, --output OUTPUT\n                        Output file\n  -m MATCH, --match MATCH\n                        Regular expression to match only some paths, this is\n                        useful if you would like to split the process into\n                        several parts.\n  --after AFTER         starts parsing files only with names matching after\n                        the given pattern, excluding the pattern itself. Note\n                        that this only skips one file, if multiple files match\n                        only the first is skipped.\n  --before BEFORE       ends before the matching path The boundary is\n                        exclusing.\n  --ltd_names LTD_NAMES\n                        tld_list, this file contains the list of primary\n                        domains and is necessary to extract main domains.\n  --count COUNT         number of items per file\n\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnetworks-learning%2Fwikipedia-protobuf","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnetworks-learning%2Fwikipedia-protobuf","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnetworks-learning%2Fwikipedia-protobuf/lists"}