Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/webis-de/archive-query-log

📜 The Archive Query Log.
https://github.com/webis-de/archive-query-log
information-retrieval information-retrieval-history internet-archive query-log search-engine-result-page serp wayback-machine web-archive
Last synced: about 2 months ago
JSON representation
📜 The Archive Query Log.
Host: GitHub
URL: https://github.com/webis-de/archive-query-log
Owner: webis-de
License: mit
Created: 2023-02-22T10:37:35.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2024-04-11T11:12:02.000Z (9 months ago)
Last Synced: 2024-04-14T10:11:33.094Z (9 months ago)
Topics: information-retrieval, information-retrieval-history, internet-archive, query-log, search-engine-result-page, serp, wayback-machine, web-archive
Language: Jupyter Notebook
Homepage: https://tira.io/task/archive-query-log
Size: 52.2 MB
Stars: 20
Watchers: 19
Forks: 0
Open Issues: 11
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

        [![CI](https://img.shields.io/github/actions/workflow/status/webis-de/archive-query-log/ci.yml?branch=main&style=flat-square)](https://github.com/webis-de/archive-query-log/actions/workflows/ci.yml)

[![Code coverage](https://img.shields.io/codecov/c/github/webis-de/archive-query-log?style=flat-square)](https://codecov.io/github/webis-de/archive-query-log/)

[![arXiv preprint](https://img.shields.io/badge/arXiv-2304.00413-blue?style=flat-square)](https://arxiv.org/abs/2304.00413)

[![Papers with Code](https://img.shields.io/badge/papers%20with%20code-AQL--22-blue?style=flat-square)](https://paperswithcode.com/paper/the-archive-query-log-mining-millions-of)

[![Issues](https://img.shields.io/github/issues/webis-de/archive-query-log?style=flat-square)](https://github.com/webis-de/archive-query-log/issues)

[![Commit activity](https://img.shields.io/github/commit-activity/m/webis-de/archive-query-log?style=flat-square)](https://github.com/webis-de/archive-query-log/commits)

[![License](https://img.shields.io/github/license/webis-de/archive-query-log?style=flat-square)](LICENSE)

# 📜 The Archive Query Log

Mining Millions of Search Result Pages of Hundreds of Search Engines from 25 Years of Web Archives.

[![Queries TSNE](docs/queries-tsne-teaser.png)](docs/queries-tsne.png)

Start now by running [your custom analysis/experiment](#integrations), scraping [your own query log](#tldr), or just look at [our example files](data/examples).

## Contents

- [Integrations](#integrations)

- [Installation](#installation)

- [Usage](#tldr)

- [Development](#development)

- [Third-party Resources](#third-party-resources)

- [Contribute](#contribute)

- [Abstract](#abstract)

## Integrations

### Running Experiments on the AQL

The data in the Archive Query Log is highly sensitive (still, you can [re-crawl everything from the Wayback Machine](#usage)). For that reason, we ensure that custom experiments or analyises can not leak sensitive data (please [get in touch](#contribute) if you have questions) by using [TIRA](https://tira.io) as a platform for custom analyses/experiments. In TIRA, you submit a Docker image that implements your experiment. Your software is then executed in sandboxed mode (without internet connection) to ensure that your software does not leak sensitive information. After your software execution finished, administrators will review your submission and unblind it so that you can access the outputs.  

Please refer to our [dedicated TIRA tutorial](integrations/tira/) as starting point for your experiments.

## Installation

1. Install [Python 3.10](https://python.org/downloads/)

2. Create and activate virtual environment:

    ```shell

    python3.10 -m venv venv/

    source venv/bin/activate

    ```

4. Install dependencies:

    ```shell

    pip install -e .

    ```

## Usage

To quickly scrape a sample query log, jump to the [TL;DR](#tldr).

If you want to learn more about each step here are some more detailed guides:

1. [Search providers](#1-search-providers)

2. [Fetch archived URLs](#2-archived-urls)

3. [Parse archived query URLs](#3-archived-query-urls)

4. [Download archived raw SERPs](#4-archived-raw-serps)

5. [Parse archived SERPs](#5-archived-parsed-serps)

### TL;DR

Let's start with a small example and construct a query log for the [ChatNoir](https://chatnoir.eu) search engine:

1. `python -m archive_query_log make archived-urls chatnoir`

2. `python -m archive_query_log make archived-query-urls chatnoir`

3. `python -m archive_query_log make archived-raw-serps chatnoir`

4. `python -m archive_query_log make archived-parsed-serps chatnoir`

Got the idea? Now you're ready to scrape your own query logs! To scale things up and understand the data, just keep on reading. For more details on how to add more search providers, see [below](#contribute).

### 1. Search providers

Manually or semi-automatically collect a list of search providers that you would like to scrape query logs from.

The list of search providers should be stored in a single [YAML][yaml-spec] file at [`data/selected-services.yaml`](data/selected-services.yaml) and contain one entry per search provider, like shown below:

```yaml

- name: string               # search providers name (alexa_domain - alexa_public_suffix)

  public_suffix: string      # public suffix (https://publicsuffix.org/) of alexa_domain

  alexa_domain: string       # domain as it appears in Alexa top-1M ranks

  alexa_rank: int            # rank from fused Alexa top-1M rankings

  category: string           # manual annotation

  notes: string              # manual annotation

  input_field: bool          # manual annotation

  search_form: bool          # manual annotation

  search_div: bool           # manual annotation

  domains: # known domains of the search providers (including the main domain)

    - string

    - string

    - ...

  query_parsers: # query parsers in order of precedence

    - pattern: regex

      type: query_parameter    # for URLs like https://example.com/search?q=foo

      parameter: string

    - pattern: regex

      type: fragment_parameter # for URLs like https://example.com/search#q=foo

      parameter: string

    - pattern: regex

      type: query_parameter    # for URLs like https://example.com/search/foo

      path_prefix: string

    - ...

  page_parsers: # page number parsers in order of precedence

    - pattern: regex

      type: query_parameter    # for URLs like https://example.com/search?page=2

      parameter: string

    - ...

  offset_parsers: # page offset parsers in order of precedence

    - pattern: regex

      type: query_parameter    # for URLs like https://example.com/search?start=11

      parameter: string

    - ...

  interpreted_query_parsers: # interpreted query parsers in order of precedence

    - ...

  results_parsers: # search result and snippet parsers in order of precedence

    - ...

- ...

```

In the source code, a search provider corresponds to the Python class [`Service`](archive_query_log/model/__init__.py).

### 2. Archived URLs

Fetch all archived URLs for a search provider from the Internet Archive's Wayback Machine.

You can run this step with the following command line, where `` is the name of the search provider you want to fetch archived URLs from:

```shell:

python -m archive_query_log make archived-urls 

```

This will create multiple files in the `archived-urls` subdirectory under the [data directory](#pro-tip--specify-a-custom-data-directory), based on the search provider's name (``), domain (``), and the Wayback Machine's CDX [page number][cdx-pagination] (``) from which the URLs were originally fetched:

```

/archived-urls///.jsonl.gz

```

Here, the `` is a 10-digit number with leading zeros, e.g., `0000000001`.

Each individual file is a GZIP-compressed [JSONL][jsonl-spec] file with one archived URL per line, in arbitrary order. Each line contains the following fields:

```json

{

  "url": "string",

  // archived URL

  "timestamp": "int"

  // archive timestamp as POSIX integer

}

```

In the source code, an archived URL corresponds to the Python class [`ArchivedUrl`](archive_query_log/model/__init__.py).

### 3. Archived Query URLs

Parse and filter archived URLs that contain a query and may point to a search engine result page (SERP).

You can run this step with the following command line, where `` is the name of the search provider you want to parse query URLs from:

```shell:

python -m archive_query_log make archived-query-urls 

```

This will create multiple files in the `archived-query-urls` subdirectory under the [data directory](#pro-tip--specify-a-custom-data-directory), based on the search provider's name (``), domain (``), and the Wayback Machine's CDX [page number][cdx-pagination] (``) from which the URLs were originally fetched:

```

/archived-query-urls///.jsonl.gz

```

Here, the `` is a 10-digit number with leading zeros, e.g., `0000000001`.

Each individual file is a GZIP-compressed [JSONL][jsonl-spec] file with one archived query URL per line, in arbitrary order. Each line contains the following fields:

```json

{

  "url": "string",

  // archived URL

  "timestamp": "int",

  // archive timestamp as POSIX integer

  "query": "string",

  // parsed query

  "page": "int",

  // result page number (optional)

  "offset": "int"

  // result page offset (optional)

}

```

In the source code, an archived query URL corresponds to the Python class [`ArchivedQueryUrl`](archive_query_log/model/__init__.py).

### 4. Archived Raw SERPs

Download the raw HTML content of archived search engine result pages (SERPs).

You can run this step with the following command line, where `` is the name of the search provider you want to download raw SERP HTML contents from:

```shell:

python -m archive_query_log make archived-raw-serps 

```

This will create multiple files in the `archived-urls` subdirectory under the [data directory](#pro-tip--specify-a-custom-data-directory), based on the search provider's name (``), domain (``), and the Wayback Machine's CDX [page number][cdx-pagination] (``) from which the URLs were originally fetched. Archived raw SERPs are stored as 1GB-sized WARC chunk files, that is, WARC chunks are "filled" sequentially up to a size of 1GB each. If a chunk is full, a new chunk is created.

```

/archived-raw-serps////.jsonl.gz

```

Here, the `` and `` are both 10-digit numbers with leading zeros, e.g., `0000000001`.

Each individual file is a GZIP-compressed [WARC][warc-spec] file with one WARC request and one WARC response per archived raw SERP. WARC records are arbitrarily ordered within or across chunks, but the WARC request and response for the same archived query URL are kept together. The archived query URL is stored in the WARC request's and response's `Archived-URL` field in [JSONL][jsonl-spec] format (the same format as in the previous step):

```json

{

  "url": "string",

  // archived URL

  "timestamp": "int",

  // archive timestamp as POSIX integer

  "query": "string",

  // parsed query

  "page": "int",

  // result page number (optional)

  "offset": "int"

  // result page offset (optional)

}

```

In the source code, an archived raw SERP corresponds to the Python class [`ArchivedRawSerp`](archive_query_log/model/__init__.py).

### 5. Archived Parsed SERPs

Parse and filter archived SERPs from raw contents.

You can run this step with the following command line, where `` is the name of the search provider you want to parse SERPs from:

```shell:

python -m archive_query_log make archived-parsed-serps 

```

This will create multiple files in the `archived-serps` subdirectory under the [data directory](#pro-tip--specify-a-custom-data-directory), based on the search provider's name (``), domain (``), and the Wayback Machine's CDX [page number][cdx-pagination] (``) from which the URLs were originally fetched:

```

/archived-serps///.jsonl.gz

```

Here, the `` is a 10-digit number with leading zeros, e.g., `0000000001`.

Each individual file is a GZIP-compressed [JSONL][jsonl-spec] file with one archived parsed SERP per line, in arbitrary order. Each line contains the following fields:

```json

{

  "url": "string",

  // archived URL

  "timestamp": "int",

  // archive timestamp as POSIX integer

  "query": "string",

  // parsed query

  "page": "int",

  // result page number (optional)

  "offset": "int",

  // result page offset (optional)

  "interpreted_query": "string",

  // query displayed on the SERP (e.g. with spelling correction; optional)

  "results": [

    {

      "url": "string",

      // URL of the result

      "title": "string",

      // title of the result

      "snippet": "string"

      // snippet of the result (highlighting normalized to )

    },

    ...

  ]

}

```


In the source code, an archived parsed SERP corresponds to the Python class [`ArchivedParsedSerp`](archive_query_log/model/__init__.py).

### Pro Tip: Specify a Custom Data Directory

By default, the data directory is set to [`data/`](data). You can change this with the `--data-directory` option, e.g.:

```shell

python -m archive_query_log make archived-urls --data-directory /mnt/ceph/storage/data-in-progress/data-research/web-search/web-archive-query-log/

```

### Pro Tip: Limit Scraping for Testing

If the search provider you're scraping queries for is very large and has many domains, testing your settings on a smaller sample from that search provider can be helpful. You can specify a single domain to scrape from like this:

```shell

python -m archive_query_log make archived-urls  

```

If a domain is very popular and therefore has many archived URLs,

you can further limit the number of archived URLs to scrape by selecting

a [page](https://github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md#pagination-api)

from the Wayback Machine's

[CDX API](https://github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md#pagination-api):

```shell

python -m archive_query_log make archived-urls   

```

## Citation

If you use the Archive Query Log dataset or the code to generate it in your research, please cite the following paper describing the AQL and its use-cases:

> TODO

You can use the following BibTeX entry for citation:

```bibtex

% TODO

```

## Development

Run tests:

```shell

flake8 archive_query_log

pylint -E archive_query_log

pytest archive_query_log

```

Add new tests for parsers:

1. Select the number of tests to run per service and the number of services.

2. Auto-generate unit tests and download WARCs with [generate_tests.py](archive_query_log/results/test/generate_tests.py)

3. Run the tests.

4. Failing tests will open a diff editor with the approval and a web browser tab with the Wayback URL.

5. Use the web browser dev tools to find the query input field and search result CSS paths.

6. Close diffs and tabs and re-run tests.

## Third-party Resources

- [Kaggle dataset of the manual test SERPs](https://www.kaggle.com/datasets/federicominutoli/awesome-archive-query-log), thanks to @DiTo97

## Contribute

If you've found an important search provider to be missing from this query log, please suggest it by creating an [issue][repo-issues]. We also very gratefully accept [pull requests][repo-prs] for adding [search providers](#1-search-providers) or new parser configurations!

If you're unsure about anything, post an [issue][repo-issues], or contact us:

- [[email protected]](mailto:[email protected])

- [[email protected]](mailto:[email protected])

- [[email protected]](mailto:[email protected])

- [[email protected]](mailto:[email protected])

- [[email protected]](mailto:[email protected])

- [[email protected]](mailto:[email protected])

- [[email protected]](mailto:[email protected])

- [[email protected]](mailto:[email protected])

We're happy to help!

## License

This repository is released under the [MIT license](LICENSE). Files in the `data/` directory are exempt from this license.

If you use the AQL in your research, we'd be glad if you'd [cite us](#citation).

## Abstract

The Archive Query Log (AQL) is a previously unused, comprehensive query log collected at the Internet Archive over the last 25 years. Its first version includes 356 million queries, 166 million search result pages, and 1.7 billion search results across 550 search providers. Although many query logs have been studied in the literature, the search providers that own them generally do not publish their logs to protect user privacy and vital business data. Of the few query logs publicly available, none combines size, scope, and diversity. The AQL is the first to do so, enabling research on new retrieval models and (diachronic) search engine analyses. Provided in a privacy-preserving manner, it promotes open research as well as more transparency and accountability in the search industry.

[repo-issues]: https://git.webis.de/code-research/web-search/web-archive-query-log/-/issues

[repo-prs]: https://git.webis.de/code-research/web-search/web-archive-query-log/-/merge_requests

[cdx-pagination]: https://github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md#pagination-api

[warc-spec]: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/

[jsonl-spec]: https://jsonlines.org/

[yaml-spec]: https://yaml.org/