Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/stewid/xapr

R bindings to the Xapian search engine
https://github.com/stewid/xapr

Last synced: 3 months ago
JSON representation

R bindings to the Xapian search engine

Awesome Lists containing this project

README

        

[![Build Status](https://travis-ci.org/stewid/xapr.png)](https://travis-ci.org/stewid/xapr)

# Introduction

`xapr` is an R package that provides an interface to the
[Xapian](http://xapian.org/) search engine from R, allowing both
indexing and retrieval operations. A great introduction to
[Xapian](http://xapian.org/) is the
[Getting Started with Xapian](http://getting-started-with-xapian.readthedocs.org/en/latest/).

## Indexing

Index the content of a `data.frame` to documents with the Xapian
search engine A `document` is the data returned from a search.

The index plan is specified symbolically. An index plan has the form
`data ~ terms` where `data` is the blob of data returned from a search
and the `terms` are the basis for a search in Xapian. A first order
term index the text in the column as free text. A specification of the
form `prefix:term` indicates that the text in `term` should be
indexed with the prefix `prefix`.

The prefix is a short string at the beginning of the term to indicate
which field the term indexes. Valid prefixes are: 'A' ,'D', 'E', 'G',
'H', 'I', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V',
'X', 'Y' and 'Z'. See http://xapian.org/docs/omega/termprefixes for a
list of conventional prefixes.

The specification `prefix*term` is the same as `term +
prefix:term`. The prefix `X` will create a user defined prefix by
appending the uppercase `term` to `X`. The prefix `Q` will use data
in the `term` column as a unique identifier for the document. `NA`
values in indexed columns are skipped.

No response e.g. `~ term + prefix:term` writes the row number as
data to the document.

The specification `~X*.` creates prefix terms with all columns plus
free text.

If the response contains one or more columns, e.g. `col_1 + col_2 ~
X*.` the response is first converted to `JSON`. A compact form to
convert all fields to `JSON` is to use `. ~ terms`. It is also
possible to drop response fields e.g. `. - col_1 - col_2 ~ X*.` to
include all fields in the response except `col_1` and `col_2`.

### Example

This is an `R` version of the `Python` example in the
[Getting Started with Xapian](http://getting-started-with-xapian.readthedocs.org/en/latest/practical_example/index.html)

We are going to build a simple search system based on museum catalogue
data released under the Creative Commons Attribution-NonCommercial-
ShareAlike license (http://creativecommons.org/licenses/by-nc-sa/3.0/)
by the Science Museum in London, UK.
(http://api.sciencemuseum.org.uk/documentation/collections/)

```{r, index, message = FALSE, comment = "#>"}
library(xapr)

## The first 100 rows of the museum catalogue data is distributed with
## the 'xapr' package
filename <- system.file("extdata/NMSI_100.csv", package="xapr")
nmsi <- read.csv(filename, as.is = TRUE)

## Create a temporary directory to hold the database
path <- tempfile(pattern="xapr-")
dir.create(path)

## Index the 'TITLE' and 'DESCRIPTION' fields with both a suitable
## prefix and without a prefix for general search. Use the 'id_NUMBER'
## as unique identifier. Store all the fields as JSON for display
## purposes.
db <- xindex(. ~ S*TITLE + X*DESCRIPTION + Q:id_NUMBER, nmsi, path)

## Display a summary of the Xapian database
summary(db)

## Run a search and display docid (rowname) and TITLE from each match
xsearch(db, "watch", TITLE ~ .)

## Run a search with multiple words
xsearch(db, "Dent watch", TITLE ~ .)

## Run a search with prefix
xsearch(db, "title:sunwatch", TITLE ~ title:S)

## Run a search with multiple prefixes
xsearch(db,
"description:\"leather case\" AND title:sundial",
TITLE ~ title:S + description:XDESCRIPTION)
```

## Installation

The development files for the `Xapian` search engine must be
installed.

```
$ sudo apt-get install libxapian-dev
```

To install the development version of `xapr`, it's easiest to use the
devtools package:

```{r, install, eval = FALSE}
# install.packages("devtools")
library(devtools)
install_github("stewid/xapr")
```

Another alternative is to use `git` and `make`

```
$ git clone https://github.com/stewid/xapr.git
$ cd xapr
$ make install
```

**NOTE:** The package is in a very early development phase. Functions
and documentation may be incomplete and subject to
change. Suggestions, bugs, forks and pull requests are
appreciated. Get in touch.

# License

The `xapr` package is licensed under the GPL (>= 2). See these files
for additional details:

- [LICENSE](LICENSE) - `xapr` package license