https://github.com/tanmaykm/commoncrawl.jl

Interface to common crawl dataset on Amazon S3
https://github.com/tanmaykm/commoncrawl.jl

Last synced: 7 months ago
JSON representation

Interface to common crawl dataset on Amazon S3

Host: GitHub
URL: https://github.com/tanmaykm/commoncrawl.jl
Owner: tanmaykm
License: mit
Created: 2013-09-24T10:08:49.000Z (about 12 years ago)
Default Branch: master
Last Pushed: 2015-09-08T04:55:00.000Z (about 10 years ago)
Last Synced: 2025-02-12T22:44:50.782Z (8 months ago)
Language: Julia
Size: 222 KB
Stars: 5
Watchers: 3
Forks: 5
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          CommonCrawl.jl

==============

[![Build Status](https://travis-ci.org/tanmaykm/CommonCrawl.jl.png)](https://travis-ci.org/tanmaykm/CommonCrawl.jl)

Interface to the [common crawl dataset on Amazon S3](http://aws.amazon.com/datasets/41740)

## Usage

An instance of the corpus is obtained as:

````

cc = CrawlCorpus(cache_location::String, debug::Bool=false)

````

Since the crawl corpus files are large, they are cached locally by default at `cache_location`. The first time a file is accessed, it is downloaded in full into the cache location. Subsequent calls to read are served locally.

All cached files, or a particular cached archive file can be deleted:

````

clear_cache(cc::CrawlCorpus)

clear_cache(cc::CrawlCorpus, archive::URI)

````

Segments and archive files in a segment can be listed as: 

````

segment_names = segments(cc::CrawlCorpus)

archive_uris = archives(cc::CrawlCorpus, segment::String)

````

Archive files across all segments can be accessed easily as: 

````

archive_uris = archives(cc::CrawlCorpus, count::Int=0)

````

Passing count as `0` lists all available archive files (which can be large).

A particular archive file can be opened as:

````

open(cc::CrawlCorpus, archive::URI)

````

And crawl entries can be read from an opened archive as:

````

entry = read_entry(cc::CrawlCorpus, f::IO, mime_part::String="", metadata_only::Bool=false)

entries = read_entries(cc::CrawlCorpus, f::IO, mime_part::String="", num_entries::Int=0, metadata_only::Bool=false)

````

Method `read_entry` returns an `ArchiveEntry` instance corresponding to the next entry in the file with mime type beginning with `mime_part`. Method `read_entries` returns an array of `ArchiveEntry` objects. If `num_entries` is `0`, all matching entries in the archive file are returned. If `metadata_only` is true, only the file metadata (url and mime type) is populated in the entries.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tanmaykm/commoncrawl.jl

Awesome Lists containing this project

README