https://github.com/preciz/common_crawl
Work with Common Crawl data from Elixir.
https://github.com/preciz/common_crawl
commoncrawl elixir
Last synced: 6 months ago
JSON representation
Work with Common Crawl data from Elixir.
- Host: GitHub
- URL: https://github.com/preciz/common_crawl
- Owner: preciz
- License: mit
- Created: 2021-12-02T15:11:40.000Z (almost 4 years ago)
- Default Branch: main
- Last Pushed: 2025-03-19T13:19:09.000Z (7 months ago)
- Last Synced: 2025-03-19T14:26:48.419Z (7 months ago)
- Topics: commoncrawl, elixir
- Language: Elixir
- Homepage: https://hex.pm/packages/common_crawl
- Size: 183 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# CommonCrawl
[](https://github.com/preciz/common_crawl/actions/workflows/test.yml)
Work with Common Crawl data from Elixir.
## Installation
Add `common_crawl` to your list of dependencies in `mix.exs`:
```elixir
def deps do
[
{:common_crawl, "~> 0.3.0"}
]
end
```## Usage Examples
```elixir
# Get latest available crawl of a URL
{:ok, %{response: _, headers: _, warc: _}} = CommonCrawl.get_latest_for_url("https://example.com")# Get list of available crawls
crawls = CommonCrawl.collinfo()# Search for URLs in the index
crawl = List.first(crawls)
{:ok, results} = CommonCrawl.IndexAPI.get(crawl["cdx-api"], %{
"url" => "example.com/*",
"output" => "json"
})# Download webpage content from WARC file
{url, timestamp, metadata} = List.first(results)
{:ok, segment} = CommonCrawl.WARC.get_segment(
metadata["filename"],
metadata["offset"],
metadata["length"]
)# Stream all entries from index files
CommonCrawl.Index.stream("CC-MAIN-2024-51")
|> Stream.filter(fn {_key, _timestamp, metadata} ->
metadata["status"] == "200"
end)
|> Enum.take(10)# Work with raw index files
{:ok, index_paths} = CommonCrawl.Index.get_all_paths("CC-MAIN-2021-43")
{:ok, index_file} = CommonCrawl.Index.get("CC-MAIN-2021-43", List.first(index_paths))
```## Docs
Documentation can be found at [https://hexdocs.pm/common_crawl](https://hexdocs.pm/common_crawl).## License
CommonCrawl is [MIT Licensed](LICENSE)