Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/LOD-Laundromat/LOD-Laundromat

Cleaning other people's dirty data.
https://github.com/LOD-Laundromat/LOD-Laundromat

Last synced: 5 days ago
JSON representation

Cleaning other people's dirty data.

Awesome Lists containing this project

README

        

#+TITLE: LOD-Laundromat: Cleaning other people's dirty data
#+AUTHOR: Wouter Beek

*Closed for maintenance! May be back soon*

LOD Laundromat is a standard-compliant implementation for crawling and
cleaning Linked Open Data (LOD). Since data cleaning is often
reported as comprising 80% of a data analist's workload, it would be
great if we can automate at least a large part of that.

With the LOD Laundromat, data is cleaned in parallel by an arbitrary
number of /Washing Machine/ threads. The collection of washing
machines can be monitored from a main /Laundromat/ thread.

* Data discovery

** Sitemap

#+BEGIN_SRC xml


$(URI)
$(DATE_TIME)
$(DURATION)
$(FLOAT)

#+END_SRC

** Semantic sitemap

#+BEGIN_SRC xml


$(STRING)
$(URI)
$(IRI)
$(DURATION)

#+END_SRC

* Job sceduling

* The cleaning process

** Adding a seedpoint

Adding a new URI to the seedlist results in the following
registration:

#+BEGIN_SRC prolog
$(URI_HASH){
relative: $(BOOL),
status: added,
uri: $(URI)
}
#+END_SRC

We are storing seedlist registrations in a SWI dictionary
datastructure. SWI dictionaries are very similar to the JSON format
for data exchange. The main difference is the ~$(HASH)~, which we
will use to link seedlist registrations to one another.

The value of ~$(HASH)~ is computed in the following way:
1. Take the value of ~$(URI)~
2. Map uppercase characters that appear in the scheme or host
components to their corresponding lowercase characters[fn::See
§6.2.2.1 of RFC 3986
(https://tools.ietf.org/html/rfc3986#section-6.2.2.1)].
3. Map lowercase characters that denote hexadecimal digits within a
percent-encoded octet to their corresponding uppercase
characters[fn::See §6.2.2.1 of RFC 3986
(https://tools.ietf.org/html/rfc3986#section-6.2.2.1)].
4. Decode percent-encoded octets that denote unreserved
characters[fn::See §6.2.2.2 of RFC 3986
(https://tools.ietf.org/html/rfc3986#section-6.2.2.1)].
5. Remove the relative path references ~.~ and ~..~ by applying
reference resolution[fn::See §6.2.2.3 of RFC 3986
(https://tools.ietf.org/html/rfc3986#section-6.2.2.3)].
6. Take the MD5 hash.

We allow relative URIs to be added to the seedlist, denoted by the
Boolean property ~relative~. We cannot do anything useful with
relative URIs, because their download location is unknown due to a
missing host machine name. Still, we want to be able to quantify how
often a dataset is erroneously denoted by a relative URI.

The ~status~ property is going to keep track of the URI throughout the
data cleaning process. Its initial state is ~added~, which means that
it is added to the seedlist.

The ~uri~ property stores the URI itself. ~interval~ denotes the time
in between consecutive crawls, expressed in seconds since the epoch.
~processed~ denotes the time of the last crawl.

#+BEGIN_SRC prolog
$(URI_HASH){
added: $(ADDED),
child: $(HASH),
interval: $(INTERVAL),
processed: $(PROCESSED),
uri: $(URI)
}
#+END_SRC

** Downloading the file stream

This stage takes seeds that match the pattern in (1), and changes them
to match pattern (2) during the download process. If the download
fails we only have metadata ~$(HTTP_META)~ about the TCP and/or HTTP
communication process, resulting in a seed record with pattern (3).
If the download succeeds, there is also content metadata
~$(CONTENT_META)~, resulting in a seed record with pattern (4).

A seed is stale, and therefore a candicate for re-downloading, if
~$(PROCESSED) + $(INTERVAL) < $(NOW)~

While downloading:

#+BEGIN_SRC prolog
$(DOWNLOAD_HASH){
parent: $(SEED_HASH),
status: downloading
}
#+END_SRC

After downloading:

#+BEGIN_SRC prolog
$(DOWNLOAD_HASH){
http: [$(HTTP_META)],
newline: $(NEWLINE), %
number_of_bytes: $(NONNEG), %
number_of_chars: $(NONNEG), %
number_of_lines: $(NONNEG) %
parent: $(SEED_HASH),
status: filed,
timestamp: $(BEGIN)-$(END)
}
#+END_SRC

The record includes the ~$(BEGIN)$~ and ~$(END)~ times of the
download.

~$(HTTP_META)~ has the following form:

#+BEGIN_SRC prolog
http{
headers: $(HTTP_HEADERS),
status: $(STATUS_CODE),
uri: $(URI),
version: version{major: $(NONNEG), minor: $(NONNEG)},
walltime: $(FLOAT)
}
#+END_SRC

** Unpacking the file stream

This stage is started for each seed that matches [1]. If the seed
denotes a downloaded file that is an archive, the resulting seed
record will include pointer to each directly included ‘child’ file as
in [3]. Status ~depleted~ denotes that no more files are enclosed
within this file. For each child, a new seed record of the form [4]
is added to the seedlist.

If the seed denotes a downloaded file that contains data, its seed
record is updated to have status ~unarchived~. We must determine the
character encoding of the data file in order to be able to read it.
Unfortunately, this can only be determined heuristically. We perform
the following steps:
1. We look for a Unicode Byte Order Marker (BOM), which indicates
that the file has Unicode encoding.
2. If not BOM is present, we use /unchardet/ in order to guess the
encoding. If the encoding is incompatible with Unicode[fn::An
example of a common encoding that is compatible with Unicode is
(US-)ASCII.], we recode the entire file using /iconv/.

Candidates for the unpacking stage have the following form:

#+BEGIN_SRC prolog
$(ARCHIVE_HASH){status: filed}
#+END_SRC

While unpacking:

#+BEGIN_SRC prolog
$(ENTRY_HASH){parent: $(ARCHIVE_HASH), status: unarchiving}
#+END_SRC

After unpacking:

#+BEGIN_SRC prolog
$(ENTRY_HASH){status: unarchived} % leaf node
$(ARCHIVE_HASH){children: [$(ENTRY_HASH)], status: depleted} % non-leaf node
$(ENTRY_HASH){parent: $(ARCHIVE_HASH), status: filed} % future processing
#+END_SRC

** Guess the Media Type / RDF serialization format

#+BEGIN_SRC prolog
$(ENTRY_HASH){status: unarchived}
$(ENTRY_HASH){status: guessing}
$(ENTRY_HASH){format: $(FORMAT), status: guessed}
#+END_SRC

~$(FORMAT)~ is one of the following values:
1. JSON-LD
2. N-Quads
3. N-Triples
4. RDF/XML
5. RDFa
6. TriG
7. Turtle

** Parsing the RDF

#+BEGIN_SRC prolog
$(ENTRY_HASH){format: $(FORMAT), status: guessed}
$(ENTRY_HASH){status: parsing}
$(CLEAN_HASH){dirty: $(ENTRY_HASH), status: cleaned} % clean file
$(ENTRY_HASH){clean: $(CLEAN_HAHS), status: parsed} % dirty file
#+END_SRC