Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/terrier-org/terrier-spark
A Spark API for the Terrier.org information retrieval platform
https://github.com/terrier-org/terrier-spark
information-retrieval spark
Last synced: 4 months ago
JSON representation
A Spark API for the Terrier.org information retrieval platform
- Host: GitHub
- URL: https://github.com/terrier-org/terrier-spark
- Owner: terrier-org
- Created: 2018-05-02T20:19:19.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2019-01-23T20:31:29.000Z (about 6 years ago)
- Last Synced: 2023-07-07T10:43:54.689Z (over 1 year ago)
- Topics: information-retrieval, spark
- Language: Jupyter Notebook
- Homepage:
- Size: 138 KB
- Stars: 6
- Watchers: 3
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Terrier-Spark
Terrier-Spark is a Scala library for [Apache Spark](https://spark.apache.org/) that allows the [Terrier.org](http://terrier.org) information retrieval platform to be installed and working.
To use within a notebook, this requires [Apache Toree](https://toree.apache.org/) to be installed and working.
Requirements:
- Terrier 5.0
- Apache Spark version 2.0 or newer
- Jupyter & Apache Tree (optional)## Functionality
- Retrieving a run from a Terrier index (local or remote)
- Evaluating a run
- Optimising the parameter of a retrieval run on a local index
- Grid searching the parameter of a retrieval run on a local index
- Learning a model using learning-to-rankFor known improvements/issues, see [TODO.md](TODO.md)
## Example
val indexref = IndexRef.of("/path/to/index/data.properties")
val props = Map(
"terrier.home" -> terrierHome)TopicSource.configureTerrier(props)
val topics = TopicSource.extractTRECTopics(topicsFile)
.toList.toDF("qid", "query")val queryTransform = new QueryingTransformer()
.setTerrierProperties(props)
.setIndexReference(indexref)
.setSampleModel(model)val r1 = queryTransform.transform(topics)
//r1 is a dataframe with results for queries in topics
val qrelTransform = new QrelTransformer()
.setQrelsFile(qrelsFile)val r2 = qrelTransform.transform(r1)
//r2 is a dataframe as r1, but also includes a label column
val ndcg = new RankingEvaluator(Measure.NDCG, 20).evaluateByQuery(r2).toListMore examples are provided in the [example notebooks](example_notebooks/toree/), or in our [SIGIR 2018 demo paper](http://www.dcs.gla.ac.uk/~craigm/publications/macdonald2018terriersparkdemo.pdf) [1].
## Use from the Spark Shell
$ spark-shell --packages org.terrier:terrier-spark:0.0.1-SNAPSHOT
## Use within a Jupyter Notebook
Firstly, make sure you have a working installation of Toree. Next, import Terrier and terrier-spark using some `%AddDeps` "magic":
%AddDeps org.terrier terrier-core 5.0 --transitive --exclude org.slf4j:slf4j-log4j12
%AddDeps org.terrier terrier-spark 0.0.1-SNAPSHOT --repository file:/home/user/.m2/repository --transitiveYou can then use the terrier-spark code directly in your Scala notebooks.
We have provided several example notebooks:
- Performing a simple run: [example_notebooks/toree/simple_run.ipynb](example_notebooks/toree/simple_run.ipynb)
- Training a weighting model parameters: [example_notebooks/toree/train_bm25.ipynb](example_notebooks/toree/train_bm25.ipynb)
- Training and evaluating a learning-to-rank model: [example_notebooks/toree/ltr.ipynb](example_notebooks/toree/train_bm25.ipynb)## Bibliography
If you use this software, please cite one of:
1. [Combining Terrier with Apache Spark to create agile experimental information retrieval pipelines. Craig Macdonald. In Proceedings of SIGIR 2018.](http://www.dcs.gla.ac.uk/~craigm/publications/macdonald2018terriersparkdemo.pdf)
2. [Agile Information Retrieval Experimentation with Terrier Notebooks. Craig Macdonald, Richard McCreadie, Iadh Ounis. In Proceedings of DESIRES 2018.](http://ceur-ws.org/Vol-2167/paper12.pdf)
## Credits
Developed by Craig Macdonald, University of Glasgow