Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/terrier-org/terrier-spark

A Spark API for the Terrier.org information retrieval platform
https://github.com/terrier-org/terrier-spark

information-retrieval spark

Last synced: 4 months ago
JSON representation

A Spark API for the Terrier.org information retrieval platform

Host: GitHub
URL: https://github.com/terrier-org/terrier-spark
Owner: terrier-org
Created: 2018-05-02T20:19:19.000Z (almost 7 years ago)
Default Branch: master
Last Pushed: 2019-01-23T20:31:29.000Z (about 6 years ago)
Last Synced: 2023-07-07T10:43:54.689Z (over 1 year ago)
Topics: information-retrieval, spark
Language: Jupyter Notebook
Homepage:
Size: 138 KB
Stars: 6
Watchers: 3
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # Terrier-Spark

Terrier-Spark is a Scala library for [Apache Spark](https://spark.apache.org/) that allows the [Terrier.org](http://terrier.org) information retrieval platform to be installed and working.

To use within a notebook, this requires [Apache Toree](https://toree.apache.org/) to be installed and working.

Requirements:

 - Terrier 5.0

 - Apache Spark version 2.0 or newer

 - Jupyter & Apache Tree (optional)

## Functionality

 - Retrieving a run from a Terrier index (local or remote)

 - Evaluating a run

 - Optimising the parameter of a retrieval run on a local index

 - Grid searching the parameter of a retrieval run on a local index

 - Learning a model using learning-to-rank

For known improvements/issues, see [TODO.md](TODO.md)

## Example

    val indexref = IndexRef.of("/path/to/index/data.properties")

    val props = Map(

    "terrier.home" -> terrierHome)

    TopicSource.configureTerrier(props)

    val topics = TopicSource.extractTRECTopics(topicsFile)

        .toList.toDF("qid", "query")

    val queryTransform = new QueryingTransformer()

        .setTerrierProperties(props)

        .setIndexReference(indexref)

        .setSampleModel(model)

    val r1 = queryTransform.transform(topics)

    //r1 is a dataframe with results for queries in topics

    val qrelTransform = new QrelTransformer()

        .setQrelsFile(qrelsFile)

    val r2 = qrelTransform.transform(r1)

    //r2 is a dataframe as r1, but also includes a label column

    val ndcg = new RankingEvaluator(Measure.NDCG, 20).evaluateByQuery(r2).toList

More examples are provided in the [example notebooks](example_notebooks/toree/), or in our [SIGIR 2018 demo paper](http://www.dcs.gla.ac.uk/~craigm/publications/macdonald2018terriersparkdemo.pdf) [1].

## Use from the Spark Shell

	$ spark-shell --packages org.terrier:terrier-spark:0.0.1-SNAPSHOT

## Use within a Jupyter Notebook

Firstly, make sure you have a working installation of Toree. Next, import Terrier and terrier-spark using some `%AddDeps` "magic":

	%AddDeps org.terrier terrier-core 5.0 --transitive --exclude org.slf4j:slf4j-log4j12  

	%AddDeps org.terrier terrier-spark 0.0.1-SNAPSHOT --repository file:/home/user/.m2/repository --transitive

You can then use the terrier-spark code directly in your Scala notebooks.

We have provided several example notebooks:

 - Performing a simple run: [example_notebooks/toree/simple_run.ipynb](example_notebooks/toree/simple_run.ipynb)

 - Training a weighting model parameters: [example_notebooks/toree/train_bm25.ipynb](example_notebooks/toree/train_bm25.ipynb)

 - Training and evaluating a learning-to-rank model: [example_notebooks/toree/ltr.ipynb](example_notebooks/toree/train_bm25.ipynb)

## Bibliography

If you use this software, please cite one of:

1. [Combining Terrier with Apache Spark to create agile experimental information retrieval pipelines. Craig Macdonald. In Proceedings of SIGIR 2018.](http://www.dcs.gla.ac.uk/~craigm/publications/macdonald2018terriersparkdemo.pdf)

2. [Agile Information Retrieval Experimentation with Terrier Notebooks. Craig Macdonald, Richard McCreadie, Iadh Ounis. In Proceedings of DESIRES 2018.](http://ceur-ws.org/Vol-2167/paper12.pdf)

## Credits

Developed by Craig Macdonald, University of Glasgow