Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/coveooss/SIGIR-ecom-data-challenge


https://github.com/coveooss/SIGIR-ecom-data-challenge

Last synced: about 2 months ago
JSON representation

Awesome Lists containing this project

README

        

# SIGIR eCOM 2021 Data Challenge Dataset
_Public Data Release 1.0.0_

### Overview
Coveo hosted the [2021 SIGIR eCom](https://sigir-ecom.github.io) Data Challenge
and this repository contains utility scripts and the dataset, which is freely
available for research purposes (see below): the paper introducing the Challenge is [available](https://arxiv.org/abs/2104.09423) as
a pre-print.

The Data Challenge
original README (containing baseline information, design papers, solutions, etc.)
is archived in this repository as `README_DC_2021.md`. Background information
about the Challenge, the motivations behind the release and some inspiring submissions
can be found in the original [paper](https://arxiv.org/abs/2104.09423), the archival section in `README_DC_2021.md`
and the SIGIR [presentation](https://drive.google.com/file/d/1O0BSAhgJFzx1ddeExxAEGnP_836AftNT/view).

_Note: there has been some issues when downloading the file using Safari;
we suggest you to use Chrome for the download and sign-up process._

### License

The dataset is available for research and educational purposes at
[this page](https://www.coveo.com/en/ailabs/sigir-ecom-data-challenge).
To obtain the dataset, you are required to fill a form with information about you
and your institution, and agree to the Terms And Conditions for fair usage of the data.
For convenience, Terms And Conditions are also included in a pure `txt` format in this repo:
usage of the data implies the acceptance of these Terms And Conditions.

### Dataset

#### Data Description

The dataset is provided as three big text files (`.csv`) - `browsing_train.csv`, `search_train.csv`, `sku_to_content.csv` -
inside a `zip` archive containing an additional copy of the _Terms And Conditions_. The final dataset contains 36M events,
and it is the first dataset of this kind to be released to the research community: please review the
[Data Challenge paper](https://arxiv.org/abs/2104.09423)
for a comparison with existing datasets and for the motivations behind the release format.
For your convenience, three sample files are included in the `start` folder, showcasing the data structure.
Below, you will find a detailed description for each file.

##### Browsing Events

The file `browsing_train.csv` contains almost 5M anonymized shopping [sessions](https://support.google.com/analytics/answer/2731565?hl=en).
The structure of this dataset is similar to our [Scientific Reports](https://github.com/coveooss/shopper-intent-prediction-nature-2020) data release:
each row corresponds to a browsing event in a session, containing session and timestamp information, as well as
(hashed) details on the interaction (was it _purchase_ or a _detail_ event? Was it a simple _pageview_ or a specific
product action?). All data was collected and processed in an anonymized fashion through our standard [SDK](https://docs.coveo.com/en/3188/coveo-for-commerce/tracking-commerce-events):
remember that front-end tracking is by nature imperfect, so small inconsistencies are to be expected.

Field | Type | Description
------------ | ------------- | -------------
session_id_hash | string | Hashed identifier of the shopping session. A session groups together events that are at most 30 minutes apart: if the same user comes back to the target website after 31 minutes from the last interaction, a new session identifier is assigned.
event_type | enum | The type of event according to the [Google Protocol](https://developers.google.com/analytics/devguides/collection/protocol/v1), one of { _pageview_ , _event_ }; for example, an _add_ event can happen on a page load, or as a stand-alone event.
product_action | enum | One of { _detail_, _add_, _purchase_, _remove_ }. If the field is empty, the event is a simple page view (e.g. the `FAQ` page) without associated products. Please also note that an action involving removing a product from the cart might lead to several consecutive _remove_ events. Please note that _click_ events (that is, events generated by clicking on a search page) are included in the `search_train.csv` file.
product_sku_hash | string | If the event is a _product_ event, hashed identifier of the product in the event.
server_timestamp_epoch_ms | int | Epoch time, in milliseconds. As a further anonymization technique, the timestamp has been shifted by an unspecified amount of weeks, keeping intact the intra-week patterns.
hashed_url | string | Hashed url of the current web page.

Finally, please be aware that a PDP may generate both a _detail_ and a _pageview_ event, and that the order of the events in the
file is not strictly chronological (refer to the session identifier and the timestamp information to reconstruct the
actual chain of events for a given session).

##### Search Events

The file `search_train.csv` contains more than 800k search-based interactions. Each row is a search query event issued by a shopper, which includes an array of (hashed) results returned to the client. We also provide which result(s) have been clicked from the result set, if any.
By reporting also products seen but not clicked, we hope to inspire clever ways to use negative feedback.

Field | Type | Description
------------ | ------------- | -------------
session_id_hash | string | Hashed identifier of the shopping session. A session groups together events that are at most 30 minutes apart: if the same user comes back to the target website after 31 minutes from the last interaction, a new session identifier is assigned.
server_timestamp_epoch_ms | int | Epoch time, in milliseconds. As a further anonymization technique, the timestamp has been shifted by an unspecified amount of weeks, keeping intact the intra-week patterns.
query_vector | vector | A dense representation of the search query, obtained through standard pre-trained modeling and dimensionality reduction techniques.
product_skus_hash | list | Hashed identifiers of the products in the search response.
clicked_skus_hash | list | Hashed identifiers of the products clicked after issuing the search query.

##### Catalog Metadata

The file `sku_to_content.csv` contains a mapping between (hashed) product identifiers (SKUs) and dense representation
of textual and image meta-data from the actual catalog, for all the SKUs in the training and the Challenge evaluation
dataset (when the information is available).

Field | Type | Description
------------ | ------------- | -------------
product_sku_hash | string | Hashed identifier of product ID (SKU).
category_hash | string | The categories are hashed representations of the category hierarchy, `/`-separated.
price_bucket | int | The product price, provided as a 10-quantile integer.
description_vector | vector | A dense representation of textual meta-data, obtained through standard pre-trained modeling and dimensionality reduction techniques. Please note that this representation is compatible with the one in the search file.
image_vector| vector | A dense representation of image meta-data, obtained through standard pre-trained modeling and dimensionality reduction techniques.

#### How to Start

Download the `zip` folder and unzip it in your local machine. To verify that all is well, you can run the simple
`start/dataset_stats.py` script in the folder: the script will parse the three files, show some sample rows and
print out some basic stats and counts (if you don't modify the three paths, it will run on the sample `csv`).

Please remember that usage of this dataset implies acceptance of the Terms And Conditions: you agree to
not use the dataset for any other purpose than what is stated in the Terms and Conditions,
nor attempt to reverse engineer or de-anonymise the dataset by explicitly or implicitly linking the data
to any person, brand or legal entity.

### Contacts

For questions about the dataset, please reach out to [Jacopo Tagliabue](https://www.linkedin.com/in/jacopotagliabue/).

### Acknowledgments
The authors of the paper and organizers are:

* [Jacopo Tagliabue](https://www.linkedin.com/in/jacopotagliabue/) - Coveo AI Labs
* [Ciro Greco](https://www.linkedin.com/in/cirogreco/) - Coveo AI Labs
* [Jean-Francis Roy](https://www.linkedin.com/in/jeanfrancisroy/) - Coveo
* [Federico Bianchi](https://www.linkedin.com/in/federico-bianchi-3b7998121/) - Postdoctoral Researcher at Università Bocconi
* [Giovanni Cassani](https://giovannicassani.github.io/) - Tillburg University
* [Bingqing Yu](https://www.linkedin.com/in/bingqing-christine-yu/) - Coveo
* [Patrick John Chia](https://www.linkedin.com/in/patrick-john-chia-b0a34019b/) - Coveo

The authors wish to thank the entire Coveo's legal team, for supporting our research and believing in
this data sharing initiative; special thanks to [Luca Bigon](https://www.linkedin.com/in/bigluck/)
for help in data collection and preparation.

### How to Cite our Work

If you use this dataset, please cite our work:

```
@inproceedings{CoveoSIGIR2021,
author = {Tagliabue, Jacopo and Greco, Ciro and Roy, Jean-Francis and Bianchi, Federico and Cassani, Giovanni and Yu, Bingqing and Chia, Patrick John},
title = {SIGIR 2021 E-Commerce Workshop Data Challenge},
year = {2021},
booktitle = {SIGIR eCom 2021}
}
```