{"id":13741362,"url":"https://github.com/zseder/webcorpus","last_synced_at":"2025-09-10T22:44:35.736Z","repository":{"id":2399936,"uuid":"3366827","full_name":"zseder/webcorpus","owner":"zseder","description":"webcorpus pipeline","archived":false,"fork":false,"pushed_at":"2015-03-30T13:03:25.000Z","size":452,"stargazers_count":8,"open_issues_count":3,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-05-08T21:35:38.211Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"lgpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zseder.png","metadata":{"files":{"readme":"README","changelog":null,"contributing":null,"funding":null,"license":"COPYING","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2012-02-06T12:37:16.000Z","updated_at":"2020-02-17T13:55:49.000Z","dependencies_parsed_at":"2022-08-06T12:15:17.787Z","dependency_job_id":null,"html_url":"https://github.com/zseder/webcorpus","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/zseder/webcorpus","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zseder%2Fwebcorpus","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zseder%2Fwebcorpus/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zseder%2Fwebcorpus/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zseder%2Fwebcorpus/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zseder","download_url":"https://codeload.github.com/zseder/webcorpus/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zseder%2Fwebcorpus/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":274537899,"owners_count":25304116,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-10T02:00:12.551Z","response_time":83,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T04:00:58.280Z","updated_at":"2025-09-10T22:44:35.701Z","avatar_url":"https://github.com/zseder.png","language":"C++","funding_links":[],"categories":["Software"],"sub_categories":["Utilities"],"readme":"This project is a collection of scripts and programs for creating a webcorpus\nfrom crawled data.\nThe input data is extracted by the wire crawler\n(http://www.cwr.cl/projects/WIRE/) and the output is a text file with document\nseparators and raw text\nA sample output data and the published article can be found at the homepage\nof our research group at http://hlt.sztaki.hu/resources/webcorpora.html\n\nDependencies:\n- gcc\n- flex\n- libtextcat\n  - http://software.wise-guys.nl/libtextcat/\n  - WARNING: libtextcat unfortunately uses a predefined confidence parameter,\n    which cannot be changed at runtime. we created a little patch fixing this,\n    so before compiling and installing libtextcat, please apply our patch\n    at src/libtextcat-2.2.patch with\n    - patch -p1 \u003c /path/to/libtextcat-2.2.patch # while being in a directory\n      that contains original libtextcat-2.2; or\n    - patch -p2 \u003c /path/to/libtextcat-2.2.patch # while being in the actual\n      libtextcat-2.2 directory\n- libhunspell\n  - we only used libhunspell-1.3 for testing\n\nUsage:\n- src/Makefile is for building everything and move binaries to bin/ folder\n  make for building; make clean for cleaning\n- dat/Makefile is for processing data, see Makefile for details\n  example:\n  make finnish.raw; make finnish.parsed; make finnish.senfiltered; etc.\n- some scenarios:\n  - crawl ended without a problem and data is extracted with wire-info-extract\n    - result is in mylang.wire\n    - run these commands in dat/:\n      - make mylang.raw\n      - make mylang.parsed\n      - make mylang.senfiltered\n      - make mylang.langfiltered\n      - make mylang.dedup\n      - make mylang.neardedup\n      - make mylang.tok\n      - make mylang.freq\n      - make mylang.stemdict\n    - OR simply (because of make discovers dependencies):\n      - make mylang.tokenized\n  - there were some troubles in crawling\n    - rename Data/text/storage.raw in crawling directory to\n    dat/mylang.not_extractable_wire and run these commands:\n      - make mylang.raw\n      - make mylang.parsed\n      - make mylang.senfiltered\n      - make mylang.langfiltered\n      - make mylang.cleaned # see WIRE troubles later in this file\n      - make mylang.cleaned_langfiltered # see WIRE troubles later in this file\n      - make mylang.dedup\n      - make mylang.neardedup\n      - make mylang.tok\n      - make mylang.freq\n      - make mylang.stemdict\n    - this cannot be shortened with \"make mylang.tokenized\" because\n      then optional cleaning won't run; this depends on user needs and data\n\n- if processing a language, one has to edit dat/Makefile and add hunspell\n  dictionaries in the same format  in which there are already some,\n  or if there is no hunspell dictionary for a given language (like finnish),\n  then use libtextcat (dat/Makefile is prepared to do that automatically,\n  when there is no dictionary given) and edit dat/Makefile in \n  TEXTCAT_CONFIG an TEXTCAT_CONF_LIMIT variables.\n\nWIRE troubles\n- when setting maxdoc in xml config too high, crawling really slows down after\n  a few rounds. See crawling tutorial on wiki page at github for details\n- There are some encoding issues in wire (especially when the run didn't end\n  correctly) such as:\n  - html is not in utf-8 even though it is supposed to be (set in wire.conf)\n  - some htmls lie about their encoding, so some characters\n    will get messed up, and it cannot be fixed with a simple iconv anymore\n    The main problem is when a multibyte utf-8 character is handled as more\n    characters in a 1-byte encoding, so that ugly things happen\n- when WIRE crashes while running gatherer or seeder, crawling cannot be\n  continued and data can only be extracted in a hard way.\n  wire-data/text/storage.raw has to be renamed to mylang.not_extractable_wire\n  and process data afterwards (see one scenario above)\n- because of these problems, there is a \"clean encoding\" phase in dat/Makefile\n  after language filtering, and the reason of its location is that our cleaning\n  procedure uses clean data for character statistics, and it is extracted\n  from data that is already in a specific language judged by hunspell or\n  textcat. After cleaning, a second language filtering runs because it now will\n  give you more good data in the given language.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzseder%2Fwebcorpus","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzseder%2Fwebcorpus","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzseder%2Fwebcorpus/lists"}