https://github.com/LOD-Laundromat/LOD-Laundromat

Cleaning other people's dirty data.
https://github.com/LOD-Laundromat/LOD-Laundromat

Last synced: about 1 month ago
JSON representation

Cleaning other people's dirty data.

Host: GitHub
URL: https://github.com/LOD-Laundromat/LOD-Laundromat
Owner: LOD-Laundromat
License: mit
Created: 2014-05-07T07:37:05.000Z (almost 11 years ago)
Default Branch: master
Last Pushed: 2021-06-17T21:43:49.000Z (almost 4 years ago)
Last Synced: 2024-11-04T07:37:02.520Z (6 months ago)
Language: Prolog
Homepage: http://lodlaundromat.org
Size: 10.6 MB
Stars: 52
Watchers: 11
Forks: 5
Open Issues: 0
Metadata Files:
- Readme: README.org
- License: LICENSE

Awesome Lists containing this project

awesome-starred - LOD-Laundromat/LOD-Laundromat - Cleaning other people's dirty data. (others)

README

        #+TITLE: LOD-Laundromat: Cleaning other people's dirty data

#+AUTHOR: Wouter Beek

*Closed for maintenance! May be back soon*

LOD Laundromat is a standard-compliant implementation for crawling and

cleaning Linked Open Data (LOD).  Since data cleaning is often

reported as comprising 80% of a data analist's workload, it would be

great if we can automate at least a large part of that.

With the LOD Laundromat, data is cleaned in parallel by an arbitrary

number of /Washing Machine/ threads.  The collection of washing

machines can be monitored from a main /Laundromat/ thread.

* Data discovery

** Sitemap

#+BEGIN_SRC xml

  

    $(URI)

    $(DATE_TIME)

    $(DURATION)

    $(FLOAT)

  

#+END_SRC

** Semantic sitemap

#+BEGIN_SRC xml

  

    $(STRING)

    $(URI)

    $(IRI)

    $(DURATION)

  

#+END_SRC

* Job sceduling

* The cleaning process

** Adding a seedpoint

Adding a new URI to the seedlist results in the following

registration:

#+BEGIN_SRC prolog

$(URI_HASH){

  relative: $(BOOL),

  status: added,

  uri: $(URI)

}

#+END_SRC

We are storing seedlist registrations in a SWI dictionary

datastructure.  SWI dictionaries are very similar to the JSON format

for data exchange.  The main difference is the ~$(HASH)~, which we

will use to link seedlist registrations to one another.

The value of ~$(HASH)~ is computed in the following way:

  1. Take the value of ~$(URI)~

  2. Map uppercase characters that appear in the scheme or host

     components to their corresponding lowercase characters[fn::See

     §6.2.2.1 of RFC 3986

     (https://tools.ietf.org/html/rfc3986#section-6.2.2.1)].

  3. Map lowercase characters that denote hexadecimal digits within a

     percent-encoded octet to their corresponding uppercase

     characters[fn::See §6.2.2.1 of RFC 3986

     (https://tools.ietf.org/html/rfc3986#section-6.2.2.1)].

  4. Decode percent-encoded octets that denote unreserved

     characters[fn::See §6.2.2.2 of RFC 3986

     (https://tools.ietf.org/html/rfc3986#section-6.2.2.1)].

  5. Remove the relative path references ~.~ and ~..~ by applying

     reference resolution[fn::See §6.2.2.3 of RFC 3986

     (https://tools.ietf.org/html/rfc3986#section-6.2.2.3)].

  6. Take the MD5 hash.

We allow relative URIs to be added to the seedlist, denoted by the

Boolean property ~relative~.  We cannot do anything useful with

relative URIs, because their download location is unknown due to a

missing host machine name.  Still, we want to be able to quantify how

often a dataset is erroneously denoted by a relative URI.

The ~status~ property is going to keep track of the URI throughout the

data cleaning process.  Its initial state is ~added~, which means that

it is added to the seedlist.

The ~uri~ property stores the URI itself.  ~interval~ denotes the time

in between consecutive crawls, expressed in seconds since the epoch.

~processed~ denotes the time of the last crawl.

#+BEGIN_SRC prolog

$(URI_HASH){

  added: $(ADDED),

  child: $(HASH),

  interval: $(INTERVAL),

  processed: $(PROCESSED),

  uri: $(URI)

}

#+END_SRC

** Downloading the file stream

This stage takes seeds that match the pattern in (1), and changes them

to match pattern (2) during the download process.  If the download

fails we only have metadata ~$(HTTP_META)~ about the TCP and/or HTTP

communication process, resulting in a seed record with pattern (3).

If the download succeeds, there is also content metadata

~$(CONTENT_META)~, resulting in a seed record with pattern (4).

A seed is stale, and therefore a candicate for re-downloading, if

~$(PROCESSED) + $(INTERVAL) < $(NOW)~

While downloading:

#+BEGIN_SRC prolog

$(DOWNLOAD_HASH){

  parent: $(SEED_HASH),

  status: downloading

}

#+END_SRC

After downloading:

#+BEGIN_SRC prolog

$(DOWNLOAD_HASH){

  http: [$(HTTP_META)],

  newline: $(NEWLINE),        %

  number_of_bytes: $(NONNEG), %

  number_of_chars: $(NONNEG), %

  number_of_lines: $(NONNEG)  %

  parent: $(SEED_HASH),

  status: filed,

  timestamp: $(BEGIN)-$(END)

}

#+END_SRC

The record includes the ~$(BEGIN)$~ and ~$(END)~ times of the

download.

~$(HTTP_META)~ has the following form:

#+BEGIN_SRC prolog

http{

  headers: $(HTTP_HEADERS),

  status: $(STATUS_CODE),

  uri: $(URI),

  version: version{major: $(NONNEG), minor: $(NONNEG)},

  walltime: $(FLOAT)

}

#+END_SRC

** Unpacking the file stream

This stage is started for each seed that matches [1].  If the seed

denotes a downloaded file that is an archive, the resulting seed

record will include pointer to each directly included ‘child’ file as

in [3].  Status ~depleted~ denotes that no more files are enclosed

within this file.  For each child, a new seed record of the form [4]

is added to the seedlist.

If the seed denotes a downloaded file that contains data, its seed

record is updated to have status ~unarchived~.  We must determine the

character encoding of the data file in order to be able to read it.

Unfortunately, this can only be determined heuristically.  We perform

the following steps:

  1. We look for a Unicode Byte Order Marker (BOM), which indicates

     that the file has Unicode encoding.

  2. If not BOM is present, we use /unchardet/ in order to guess the

     encoding.  If the encoding is incompatible with Unicode[fn::An

     example of a common encoding that is compatible with Unicode is

     (US-)ASCII.], we recode the entire file using /iconv/.

Candidates for the unpacking stage have the following form:

#+BEGIN_SRC prolog

$(ARCHIVE_HASH){status: filed}

#+END_SRC

While unpacking:

#+BEGIN_SRC prolog

$(ENTRY_HASH){parent: $(ARCHIVE_HASH), status: unarchiving}

#+END_SRC

After unpacking:

#+BEGIN_SRC prolog

$(ENTRY_HASH){status: unarchived} % leaf node

$(ARCHIVE_HASH){children: [$(ENTRY_HASH)], status: depleted} % non-leaf node

$(ENTRY_HASH){parent: $(ARCHIVE_HASH), status: filed} % future processing

#+END_SRC

** Guess the Media Type / RDF serialization format

#+BEGIN_SRC prolog

$(ENTRY_HASH){status: unarchived}

$(ENTRY_HASH){status: guessing}

$(ENTRY_HASH){format: $(FORMAT), status: guessed}

#+END_SRC

~$(FORMAT)~ is one of the following values:

  1. JSON-LD

  2. N-Quads

  3. N-Triples

  4. RDF/XML

  5. RDFa

  6. TriG

  7. Turtle

** Parsing the RDF

#+BEGIN_SRC prolog

$(ENTRY_HASH){format: $(FORMAT), status: guessed}

$(ENTRY_HASH){status: parsing}

$(CLEAN_HASH){dirty: $(ENTRY_HASH), status: cleaned} % clean file

$(ENTRY_HASH){clean: $(CLEAN_HAHS), status: parsed} % dirty file

#+END_SRC

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/LOD-Laundromat/LOD-Laundromat

Awesome Lists containing this project

README