{"id":21187019,"url":"https://github.com/bitfunnel/workbench","last_synced_at":"2025-07-13T16:32:49.172Z","repository":{"id":91594508,"uuid":"58910970","full_name":"BitFunnel/Workbench","owner":"BitFunnel","description":"Java and Lucene based tools for BitFunnel corpus preparation","archived":false,"fork":false,"pushed_at":"2017-01-04T21:58:08.000Z","size":512,"stargazers_count":20,"open_issues_count":11,"forks_count":4,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-07-10T09:29:33.628Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"http://bitfunnel.org","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/BitFunnel.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2016-05-16T07:03:33.000Z","updated_at":"2025-02-12T01:27:17.000Z","dependencies_parsed_at":null,"dependency_job_id":"ecb13129-9a44-437b-9480-4b2e4ec07661","html_url":"https://github.com/BitFunnel/Workbench","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/BitFunnel/Workbench","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BitFunnel%2FWorkbench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BitFunnel%2FWorkbench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BitFunnel%2FWorkbench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BitFunnel%2FWorkbench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/BitFunnel","download_url":"https://codeload.github.com/BitFunnel/Workbench/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BitFunnel%2FWorkbench/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265173444,"owners_count":23722568,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-20T18:27:47.705Z","updated_at":"2025-07-13T16:32:49.125Z","avatar_url":"https://github.com/BitFunnel.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# WorkBench: Tools for processing Wikipedia Dumps\n\nThe **org.bitfunnel.workbench** package provides tools for converting\n[Wikipedia](https://www.wikipedia.org/)\ndatabase dump files into [BitFunnel corpus files](http://bitfunnel.org/corpus-file-format/).\nWe designed BitFunnel corpus\nfiles with the goal of trivial and extremely low overhead parsing.\n\nThe conversion process involves parsing the Wikipedia dump files, extracting\neach document, removing wiki markup, performing [Lucene](https://lucene.apache.org/) analysis for\ntokenization and stemming, and finally generating encoding and writing\nthe data in BitFunnel format.\n\nWhile the initial conversion from Wikipedia database dumps to BitFunnel corpus\nfiles may be slow, subsequent experiments with BitFunnel corpus files should be fast and reliable.\n\nThe expected workflow is to download a Wikipedia database dump and convert it\nonce and then use the resulting BitFunnel corpus files many time over.\n\nBefore processing a Wikipedia database dump follow [these instructions](BUILD.md) to \nbuild the **org.bitfunnel.workbench** package. Instructions for obtaining and processing\na Wikipedia database dump appear below.\n\n## Obtaining a the Wikipedia Database Dump\n\n1. Obtain a Wikipedia database dump file.  These files are available at\n[https://dumps.wikimedia.org/](https://dumps.wikimedia.org/).\n\n1. Click on [Database backup dumps](https://dumps.wikimedia.org/backup-index.html).\nThe dumps of the English language Wikipedia pages are under [enwiki]\n(https://dumps.wikimedia.org/enwiki/). Each folder here corresponds to\na dump on a particular day.\n\n1. The dump folder is organized into sections. Look for a section entitled,\n**Articles, templates, media/file descriptions, and primary meta-pages.**\nThe links here should be of the form enwikie-DATE-parges-articlesN.xml-pXXXpYYY.bz2\nwhere DATE is of the form YYYYMMDD, N is section number, and XXX and YYY\nare number specifiying page ranges.\n\n1. Download one of these files and use [7-Zip](http://www.7-zip.org/) or equivalent to decompress.\n\n## Preprocessing the Wikipedia Database Dump\n\nThe wikipedia dump is XML data that looks something like\n\n~~~\n\u003cmediawiki xmlns=\"http://www.mediawiki.org/xml/export-0.10/\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:schemaLocation=\"http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd\" version=\"0.10\" xml:lang=\"en\"\u003e\n  \u003csiteinfo\u003e\n    \u003csitename\u003eWikipedia\u003c/sitename\u003e\n    \u003cdbname\u003eenwiki\u003c/dbname\u003e\n    \u003cbase\u003ehttps://en.wikipedia.org/wiki/Main_Page\u003c/base\u003e\n    \u003cgenerator\u003eMediaWiki 1.27.0-wmf.19\u003c/generator\u003e\n    \u003ccase\u003efirst-letter\u003c/case\u003e\n    \u003cnamespaces\u003e\n      \u003cnamespace key=\"-2\" case=\"first-letter\"\u003eMedia\u003c/namespace\u003e\n      \u003cnamespace key=\"-1\" case=\"first-letter\"\u003eSpecial\u003c/namespace\u003e\n      ... many more namespace entries ...\n      \u003cnamespace key=\"2600\" case=\"first-letter\"\u003eTopic\u003c/namespace\u003e\n    \u003c/namespaces\u003e\n  \u003c/siteinfo\u003e\n  \u003cpage\u003e\n    \u003crevision\u003e\n      \u003ctitle\u003eTITLE\u003c/title\u003e\n      ... lots of other tags ...\n      \u003ctext xml:space=\"preserve\"\u003e\n        ... the wiki markup for the page ...\n      \u003c/text\u003e\n    \u003c/revision\u003e\n  \u003c/page\u003e\n  ... lots more pages ...\n\u003c/mediawiki\u003e\n~~~\n\nThis data must be preprocessed with an open source program called\n[WikiExtractor](https://github.com/attardi/wikiextractor)\nbefore converting to BitFunnel corpus format. The WikiExtractor program\nparses the XML dump file, extracts the title, url, curid, text for each\npage, and then strips all of the wiki markup tags from the text.\n\nWikiExtractor requires Python 2.7\n(note that Python 3 and beyond are not compatible with version 2.7).\nTo install Python on the Mac,\n~~~\nbrew install python\n~~~\n\nTo install Python on linux,\n~~~\nsudo apt install python\n~~~\n\nOn windows, run the Python 2.7.11 [installer](https://www.python.org/downloads/).\n\nYou can run WikiExtractor from the command line as follows:\n~~~\n./WikiExtractor.py input\n~~~\nwhere **input** is an uncompressed Wikipedia database dump file.\nIf you don't supply the \"-o\" option the output will be written\nto the directory ./text.\n\nNote that some versions of WikiExtractor may fail on Windows because of a\nbug related to process spawning.\nYou can work around this bug by using the \"-a\" flag, but the extraction will be\nslower because it will be limited to a single thread.\n\nThe output of wikiextractor looks something like\n\n~~~\n\u003cdoc id=\"ID\" url=\"https://en.wikipedia.org/wiki?curid=ID\" title=\"TITLE\"\u003e\n  text for this document.\n  ... more text ...\n\u003c/doc\u003e\n... more documents ...\n~~~\n\nWe were able to successfully process\n[enwiki-20160407-pages-meta-current1.xml-p000000010p000030303.bz2](https://dumps.wikimedia.org/enwiki/20160407/enwiki-20160407-pages-meta-current1.xml-p000000010p000030303.bz2)\nusing WikiExtractor [commit 60e40824](https://github.com/attardi/wikiextractor/commit/60e4082440b626465b2df30301ab00c3a04cd79b).\n\nNote that this version of WikiExtractor will not run on Windows\nwithout the \"-a\" flag because of a bug.\n\n## Generating the BitFunnel Corpus Files\n\nThe Java class **org.bitfunnel.workbench.MakeCorpusFile** converts\nthe WikiExtractor output to BitFunnel corpus format.\n\nThis repository includes a pair of sample input files in the **sample-input** directory.\n~~~\n% ls -l sample-input\ntotal 8\n-rw-r--r-- 1 Mike 197121 1283 May 15 17:05 Frost.txt\n-rw-r--r-- 1 Mike 197121 2769 May 15 17:09 Whitman.txt\n~~~\nThese sample files are in the format generated by WikiExtractor.\n\nHere's how to use MakeCorpusFile to generate the corresponding BitFunnel corpus files.\nOn OSX and Linux:\n~~~\n% java -cp target/corpus-tools-1.0-SNAPSHOT.jar \\\n       org.bitfunnel.workbench.MakeCorpusFile \\\n       sample-input \\\n       sample-output\n~~~\n\nOn Windows:\n~~~\n% java -cp target\\corpus-tools-1.0-SNAPSHOT.jar ^\n       org.bitfunnel.workbench.MakeCorpusFile ^\n       sample-input ^\n       sample-output\n~~~\n\nIn the above examples, **sample-input** is the name of a directory\ncontaining WikiExtractor output and **sample-output** is the name\nof a directory to create the BitFunnel corpus files.\n\nHere's the output\n~~~\n$ ls -l sample-output/\ntotal 8\n-rw-r--r-- 1 Mike 197121  836 May 15 23:55 Frost.txt\n-rw-r--r-- 1 Mike 197121 1862 May 15 23:55 Whitman.txt\n~~~\n\nThe converter uses the [Lucene](https://lucene.apache.org/) Standard Analyzer\nto tokenize and stem each word in the extracted Wikipedia dump.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbitfunnel%2Fworkbench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbitfunnel%2Fworkbench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbitfunnel%2Fworkbench/lists"}