Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/zseder/webcorpus

webcorpus pipeline
https://github.com/zseder/webcorpus

Last synced: 3 months ago
JSON representation

webcorpus pipeline

Awesome Lists containing this project

README

        

This project is a collection of scripts and programs for creating a webcorpus
from crawled data.
The input data is extracted by the wire crawler
(http://www.cwr.cl/projects/WIRE/) and the output is a text file with document
separators and raw text
A sample output data and the published article can be found at the homepage
of our research group at http://hlt.sztaki.hu/resources/webcorpora.html

Dependencies:
- gcc
- flex
- libtextcat
- http://software.wise-guys.nl/libtextcat/
- WARNING: libtextcat unfortunately uses a predefined confidence parameter,
which cannot be changed at runtime. we created a little patch fixing this,
so before compiling and installing libtextcat, please apply our patch
at src/libtextcat-2.2.patch with
- patch -p1 < /path/to/libtextcat-2.2.patch # while being in a directory
that contains original libtextcat-2.2; or
- patch -p2 < /path/to/libtextcat-2.2.patch # while being in the actual
libtextcat-2.2 directory
- libhunspell
- we only used libhunspell-1.3 for testing

Usage:
- src/Makefile is for building everything and move binaries to bin/ folder
make for building; make clean for cleaning
- dat/Makefile is for processing data, see Makefile for details
example:
make finnish.raw; make finnish.parsed; make finnish.senfiltered; etc.
- some scenarios:
- crawl ended without a problem and data is extracted with wire-info-extract
- result is in mylang.wire
- run these commands in dat/:
- make mylang.raw
- make mylang.parsed
- make mylang.senfiltered
- make mylang.langfiltered
- make mylang.dedup
- make mylang.neardedup
- make mylang.tok
- make mylang.freq
- make mylang.stemdict
- OR simply (because of make discovers dependencies):
- make mylang.tokenized
- there were some troubles in crawling
- rename Data/text/storage.raw in crawling directory to
dat/mylang.not_extractable_wire and run these commands:
- make mylang.raw
- make mylang.parsed
- make mylang.senfiltered
- make mylang.langfiltered
- make mylang.cleaned # see WIRE troubles later in this file
- make mylang.cleaned_langfiltered # see WIRE troubles later in this file
- make mylang.dedup
- make mylang.neardedup
- make mylang.tok
- make mylang.freq
- make mylang.stemdict
- this cannot be shortened with "make mylang.tokenized" because
then optional cleaning won't run; this depends on user needs and data

- if processing a language, one has to edit dat/Makefile and add hunspell
dictionaries in the same format in which there are already some,
or if there is no hunspell dictionary for a given language (like finnish),
then use libtextcat (dat/Makefile is prepared to do that automatically,
when there is no dictionary given) and edit dat/Makefile in
TEXTCAT_CONFIG an TEXTCAT_CONF_LIMIT variables.

WIRE troubles
- when setting maxdoc in xml config too high, crawling really slows down after
a few rounds. See crawling tutorial on wiki page at github for details
- There are some encoding issues in wire (especially when the run didn't end
correctly) such as:
- html is not in utf-8 even though it is supposed to be (set in wire.conf)
- some htmls lie about their encoding, so some characters
will get messed up, and it cannot be fixed with a simple iconv anymore
The main problem is when a multibyte utf-8 character is handled as more
characters in a 1-byte encoding, so that ugly things happen
- when WIRE crashes while running gatherer or seeder, crawling cannot be
continued and data can only be extracted in a hard way.
wire-data/text/storage.raw has to be renamed to mylang.not_extractable_wire
and process data afterwards (see one scenario above)
- because of these problems, there is a "clean encoding" phase in dat/Makefile
after language filtering, and the reason of its location is that our cleaning
procedure uses clean data for character statistics, and it is extracted
from data that is already in a specific language judged by hunspell or
textcat. After cleaning, a second language filtering runs because it now will
give you more good data in the given language.