Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/tanmaykm/commoncrawl.jl
Interface to common crawl dataset on Amazon S3
https://github.com/tanmaykm/commoncrawl.jl
Last synced: 13 days ago
JSON representation
Interface to common crawl dataset on Amazon S3
- Host: GitHub
- URL: https://github.com/tanmaykm/commoncrawl.jl
- Owner: tanmaykm
- License: mit
- Created: 2013-09-24T10:08:49.000Z (about 11 years ago)
- Default Branch: master
- Last Pushed: 2015-09-08T04:55:00.000Z (about 9 years ago)
- Last Synced: 2023-05-05T19:51:38.546Z (over 1 year ago)
- Language: Julia
- Size: 222 KB
- Stars: 4
- Watchers: 2
- Forks: 5
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
CommonCrawl.jl
==============[![Build Status](https://travis-ci.org/tanmaykm/CommonCrawl.jl.png)](https://travis-ci.org/tanmaykm/CommonCrawl.jl)
Interface to the [common crawl dataset on Amazon S3](http://aws.amazon.com/datasets/41740)
## Usage
An instance of the corpus is obtained as:
````
cc = CrawlCorpus(cache_location::String, debug::Bool=false)
````
Since the crawl corpus files are large, they are cached locally by default at `cache_location`. The first time a file is accessed, it is downloaded in full into the cache location. Subsequent calls to read are served locally.All cached files, or a particular cached archive file can be deleted:
````
clear_cache(cc::CrawlCorpus)
clear_cache(cc::CrawlCorpus, archive::URI)
````Segments and archive files in a segment can be listed as:
````
segment_names = segments(cc::CrawlCorpus)
archive_uris = archives(cc::CrawlCorpus, segment::String)
````Archive files across all segments can be accessed easily as:
````
archive_uris = archives(cc::CrawlCorpus, count::Int=0)
````
Passing count as `0` lists all available archive files (which can be large).A particular archive file can be opened as:
````
open(cc::CrawlCorpus, archive::URI)
````And crawl entries can be read from an opened archive as:
````
entry = read_entry(cc::CrawlCorpus, f::IO, mime_part::String="", metadata_only::Bool=false)
entries = read_entries(cc::CrawlCorpus, f::IO, mime_part::String="", num_entries::Int=0, metadata_only::Bool=false)
````
Method `read_entry` returns an `ArchiveEntry` instance corresponding to the next entry in the file with mime type beginning with `mime_part`. Method `read_entries` returns an array of `ArchiveEntry` objects. If `num_entries` is `0`, all matching entries in the archive file are returned. If `metadata_only` is true, only the file metadata (url and mime type) is populated in the entries.