https://github.com/preciz/common_crawl

Work with Common Crawl data from Elixir.
https://github.com/preciz/common_crawl

commoncrawl elixir

Last synced: 6 months ago
JSON representation

Work with Common Crawl data from Elixir.

Host: GitHub
URL: https://github.com/preciz/common_crawl
Owner: preciz
License: mit
Created: 2021-12-02T15:11:40.000Z (almost 4 years ago)
Default Branch: main
Last Pushed: 2025-03-19T13:19:09.000Z (7 months ago)
Last Synced: 2025-03-19T14:26:48.419Z (7 months ago)
Topics: commoncrawl, elixir
Language: Elixir
Homepage: https://hex.pm/packages/common_crawl
Size: 183 KB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE

Awesome Lists containing this project

README

          # CommonCrawl

[![test](https://github.com/preciz/common_crawl/actions/workflows/test.yml/badge.svg)](https://github.com/preciz/common_crawl/actions/workflows/test.yml)

Work with Common Crawl data from Elixir.

## Installation

Add `common_crawl` to your list of dependencies in `mix.exs`:

```elixir

def deps do

  [

    {:common_crawl, "~> 0.3.0"}

  ]

end

```

## Usage Examples

```elixir

# Get latest available crawl of a URL

{:ok, %{response: _, headers: _, warc: _}} = CommonCrawl.get_latest_for_url("https://example.com")

# Get list of available crawls

crawls = CommonCrawl.collinfo()

# Search for URLs in the index

crawl = List.first(crawls)

{:ok, results} = CommonCrawl.IndexAPI.get(crawl["cdx-api"], %{

  "url" => "example.com/*",

  "output" => "json"

})

# Download webpage content from WARC file

{url, timestamp, metadata} = List.first(results)

{:ok, segment} = CommonCrawl.WARC.get_segment(

  metadata["filename"],

  metadata["offset"],

  metadata["length"]

)

# Stream all entries from index files

CommonCrawl.Index.stream("CC-MAIN-2024-51")

|> Stream.filter(fn {_key, _timestamp, metadata} ->

  metadata["status"] == "200"

end)

|> Enum.take(10)

# Work with raw index files

{:ok, index_paths} = CommonCrawl.Index.get_all_paths("CC-MAIN-2021-43")

{:ok, index_file} = CommonCrawl.Index.get("CC-MAIN-2021-43", List.first(index_paths))

```

## Docs

Documentation can be found at [https://hexdocs.pm/common_crawl](https://hexdocs.pm/common_crawl).

## License

CommonCrawl is [MIT Licensed](LICENSE)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/preciz/common_crawl

Awesome Lists containing this project

README