{"id":18723791,"url":"https://github.com/hscells/boogie","last_synced_at":"2025-04-12T15:20:23.510Z","repository":{"id":57547431,"uuid":"111039945","full_name":"hscells/boogie","owner":"hscells","description":"DSL front-end for groove query analysis pipeline","archived":false,"fork":false,"pushed_at":"2022-02-02T07:09:40.000Z","size":540,"stargazers_count":2,"open_issues_count":0,"forks_count":3,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-26T09:51:12.274Z","etag":null,"topics":["command-line","dsl","elasticsearch","groove-pipeline","measurements","medline","pipeline","pubmed","query","terrier"],"latest_commit_sha":null,"homepage":"https://github.com/hscells/groove","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hscells.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-11-17T00:59:02.000Z","updated_at":"2023-09-20T21:11:51.000Z","dependencies_parsed_at":"2022-09-26T18:40:54.135Z","dependency_job_id":null,"html_url":"https://github.com/hscells/boogie","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hscells%2Fboogie","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hscells%2Fboogie/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hscells%2Fboogie/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hscells%2Fboogie/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hscells","download_url":"https://codeload.github.com/hscells/boogie/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248586218,"owners_count":21128998,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["command-line","dsl","elasticsearch","groove-pipeline","measurements","medline","pipeline","pubmed","query","terrier"],"created_at":"2024-11-07T13:51:37.610Z","updated_at":"2025-04-12T15:20:23.480Z","avatar_url":"https://github.com/hscells.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cimg height=\"200px\" src=\"gopher.png\" alt=\"gopher\" align=\"right\"/\u003e\n\n# boogie\n\n[![GoDoc](https://godoc.org/github.com/hscells/boogie?status.svg)](https://godoc.org/github.com/hscells/boogie)\n[![Go Report Card](https://goreportcard.com/badge/github.com/hscells/boogie)](https://goreportcard.com/report/github.com/hscells/boogie)\n\n_DSL front-end for [groove](https://github.com/hscells/groove)_\n\nOften, we would like to abstract away the way we perform measurements (e.g. query performance prediction) or the way we\ntransform queries (e.g. query expansion/reduction); and we would like to do these things in a repeatable, reproducible\nmanner. This is where boogie comes in: an experiment is represented as a pipeline of operations that cover the format\nof queries, the source for statistics, the operations and measurements for each query, and how the experiment is to be\noutput. boogie translates a simple DSL syntax into a [groove](https://github.com/hscells/groove) pipeline. Both groove\nand boogie are designed to be easily extendable and offer sane, simple abstractions.\n\nThe most important abstraction is the statistic source. A boogie pipeline is not concerned with how documents are\nstored or the structure of the index; only how to retrieve documents. In this way, boogie separates\nhow you choose to store your documents from how you get your experiments done.\n\n## Installation\n\nboogie can be installed with `go install`.\n\n```bash\ngo install github.com/hscells/boogie/cmd/boogie\n```\n\n## Usage\n\nFor command line help, see `boogie --help`.\n\ncommand line usage:\n\n```bash\nboogie --pipeline pipeline.json\n```\n\n - `--pipeline`; the path to a boogie pipeline file which will be used to construct a groove pipeline.\n - `--logfile` (optional); the path to a logfile to output logs to.\n\n**Important:** Queries require a specific format that is used by groove. Each query file must contain one query, and the\nname of the file must be the topic for that query. For example, if topic 1 contains the query:\n\n```\ngreen eggs and ham\n```\n\nThen the file must be named `1`. Groove uses this to process evaluation and result files.\n\n## DSL\n\nboogie uses a domain specific language (DSL) for creating [groove](https://github.com/hscells/groove) pipelines.\nThe boogie DSL looks like a regular JSON file (although JSON is notoriously bad for these types of things, I think in this case it is OK). The example below provides a pipeline for running a simple IR experiment - run some queries in a search engine and evaluate them:\n\n```json\n{\n  \"query\": {\n    \"format\": \"medline\",\n    \"path\": \"path/to/queries\"\n  },\n  \"statistic\": {\n    \"source\": \"elasticsearch\",\n    ...\n  },\n  \"evaluations\": [\n    \"precision\",\n    \"recall\",\n    \"f1\"\n  ],\n  \"output\": {\n    \"evaluations\": {\n      \"qrels\": \"medline.qrels\",\n      \"formats\": [\n        {\n          \"format\": \"json\",\n          \"filename\": \"medline_bool.json\"\n        }\n      ]\n    },\n    \"trec_results\": {\n      \"output\": \"medline_bool.results\"\n    }\n  }\n}\n```\n\n### Templates\n\nThe DSL files can include template arguments to prevent repeating yourself. Anything can be templated by the `template` keyword at the top of the file. The syntax of templating is as follows:\n\n```\ntemplate stats stats.btmpl.json\ntemplate path $0\n{\n    %stats,\n    \"query\": {\n        \"format\": \"medline\",\n        \"path\": \"%path\",\n    }\n}\n```\n\nIn this example, the file `stats.btmpl.json` contains the section of DSL that will be templated  in and `$0` refers to the first command line argument to be passed in.\n\n## Configuration Items\n\nThere are currently 11 different top-level configuration items that may or may not integrate with each other. I have tried my best to describe each of these items and how they can interact with each other.\n\n### Query (`query`)\n\nQuery formats are specified using the `format`, the different query formats and options are detailed below. The path to\nyour queries should be specified using `path`.\n\n#### `medline`\n\n - `mapping`: Specify a field mapping in the same format as when loading a field mapping into\n [transmute](https://github.com/hscells/transmute).\n\n#### `pubmed`\n\nThe options for the `pubmed` format are the same as `medline`.\n\n#### `keyword`\n\nA keyword query (just one string of characters per file). No additional options may be specified.\n\n### Statistic (`statistic`)\n\nStatistic sources provide common information retrieval methods. They are specified using `source`. The source component\nand options are detailed below. groove/boogie does not attempt to configure information retrieval systems (e.g. sources\nfor statistics), only attempt to wrap them in some way. For this reason, you should read how to set up these systems\nbefore using boogie. A statistic source can only be configured if `query` has been configured. \n\nThere are currently three configured statistic sources: Elasticsearch, Terrier, and Entrez.\n\n#### `elasticsearch`\n\n - `hosts`: Specify a list of Elasticsearch urls (e.g. http://example.com:9200)\n - `index`: Elasticsearch index to run experiments on.\n - `document_type`: Elasticsearch document type.\n - `field`: Field to search on (for keyword queries).\n - `analyser`: Specify a preconfigured analyser for term vectors/analyse transformation.\n - `analyse_field`: Specify the field to be analysed for term vectors/analyse transformation.\n - `scroll`: Specify whether to scroll or not (true/false).\n\nNote: The `analyser` and `analyse_field` are to be used in the cases where you may have stemmed documents and stemmed\nqueries and wish to get a term vector for a pre-stemmed term in a query. To do this, point `analyse_field` to the\nanalysed field name (i.e. \"keyword\"). When only `analyser` is set, this defaults to normal behaviour. In this case, the\nrequest to the Elasticsearch term vectors API will look like this:\n\n```json\n{\n    \"doc\": {\n        \"text\": \"disease\"\n    },\n    \"term_statistics\": true,\n    \"field_statistics\": false,\n    \"offsets\": false,\n    \"positions\": false,\n    \"payloads\": false,\n    \"fields\": [\"text.keyword\"],\n    \"per_field_analyzer\": {\n        \"text.keyword\": \"\"\n    }\n}\n```\n\nIf `analyse_field` is not specified, the `fields` and `per_field_analyzer` keys will whatever `field` is set to in the\npipeline (in the case above would be \"text\").\n\n#### `terrier`\n\n - `properties`: Location of the terrier properties file.\n\n#### `entrez`\n\n - `email`: Email of the account using Entrez.\n - `tool`: Tool name accessing Entrez.\n - `key`: (optional) Key parameter of Entrez (to increase rate limit).\n\n#### Universal options:\n\n - `params`: Map of parameter name to float value (e.g. k, lambda).\n - `search`: Search properties; `size` (maximum number of results to retrieve), `run_name` (name of the run for trec)\n\n### Query Preprocessing (`preprocess`)\n\nPreprocessing is performed before analysing a query. This component accepts a list of preprocessors:\n\n - `alphanum`: Remove non-alphanumeric characters.\n - `lowercase`: Transform uppercase characters to lowercase.\n - `strip_numbers`: Remove numbers.\n\n### Query Transformations (`transformations`)\n\nQuery transformations are operations that change queries beyond simple string manipulation. For instance, a\ntransformation can simplify a query, or replace Boolean operators. The output directory and a list of transformations\ncan be specified. If the directory is not present, no queries will be output.\n\n - `output`: Directory to output transformed queries to.\n - `operations`: List of transformations to apply (see below).\n\nThe possible query transformation operations are listed as follows:\n\n - `simplify`: Simplify a Boolean query to just \"and\" and \"or\" operators.\n - `analyse`: Use Elasticsearch to analyse the query strings in the query.\n\nAdditionally, the following transformation can be used in conjunction with the Elasticsearch statistics source:\n\n - `analyse`: Run the analyser specified in `statistic` on the query.\n\nOperations are applied in the order specified.\n\n### Query Rewrites (`rewrite`)\n\nRewrites are a different type of transformation in that they can be applied in multiple ways to a query. These are useful for creating query variations or exploring the space of possible queries. Currently the only use for this is in the query chain machine learning model. But the variations could, for example, just be output to a directory (I'm too lazy to do this).\n\nThe possible rewrites that are available are:\n\n - `logical_operator_replacement`: Replace ORs with ANDs and ANDs with ORs\n - `adj_range`: Modify the distance of adjacency operators.\n - `adj_replacement`: Replace adjacency operators with AND operators.\n - `mesh_explosion`: Explode/Unexplode a MeSH keyword.\n - `mesh_parent`: Move a MeSH keyword up one level in the ontology.\n - `field_restrictions`: Permute the fields being searched on.\n\n### Measurements (`measurements`)\n\nMeasurements are methods that apply a calculation to a query using a statistics source. All measurements return a\nfloating point number. This component accepts a list of measurements:\n\n - `term_count` - Total number of query terms.\n - `avg_ictf` - Average inverse collection term frequency.\n - `avg_idf` - Average inverse document frequency.\n - `sum_idf` - Sum inverse document frequency.\n - `max_idf` - Max inverse document frequency.\n - `std_idf` - Standard Deviation inverse document frequency.\n - `sum_cqs` - Sum Collection Query Similarity\n - `max_cqs` - Max Collection Query Similarity.\n - `avg_cqs` - Average Collection Query Similarity.\n - `scs` - Simplified Clarity Score.\n - `query_scope` - Query Scope.\n - `wig` - Weighted Information Gain.\n - `weg` - Weighted Entropy Gain.\n - `ncq` - Normalised Query Commitment.\n - `clarity_score` - Clarity Score.\n - `retrieval_size` - Total number of documents retrieved.\n - `boolean_clauses` - Number of clauses in Boolean query.\n - `boolean_keywords` - Number of keywords in Boolean query.\n - `boolean_fields` - Number of fields in Boolean query.\n - `boolean_truncated` - Number of wildcard keywords in Boolean query.\n - `mesh_keywords` - Number of MeSH keywords in Boolean query.\n - `mesh_exploded` - Number of Exploded MeSH keywords in Boolean query.\n - `mesh_non_exploded` - Number of Non-Exploded MeSH keywords in Boolean Query\n -  `mesh_avg_depth` - Average depth of MeSH keywords in ontology in Boolean query.\n -  `mesh_max_depth` - Maximum depth of MeSH keywords in ontology in Boolean query.\n\nMeasurements can just be output to a file, or be used as inputs to machine learning (for example feature engineering; see below).\n\n### Evaluation (`evaluation`)\n\nQueries can be evaluated through different measures. To evaluate queries in the pipeline, use the `evaluation` key. Each\nevaluation measurement comprises:\n\n - `evaluate`: The measure to evaluate each topic with.\n\nThe list of measures are as follows:\n\n - `num_ret`: Total number of retrieved documents.\n - `num_rel`: Total number of relevant documents (from qrels).\n - `num_rel_ret`: Total number of relevant documents that were retrieved.\n - `precision`: Ratio of relevant retrieved documents to retrieved documents.\n - `recall`: Ratio of relevant retrieved documents to relevant documents.\n - `f05_measure`: F-beta 0.5\n - `f1_measure`: F-beta 1\n - `f3_measure`: F-beta 3\n - `wss`: Work Saved over Sampling\n\n### Output (`output`)\n\nAn output specifies how experiments are to be formatted and what file to write them to. The `output` component comprises\na list of outputs. Each output can either of type `measurements`, `trec_results`, or `evaluations`.\n\nFor `measurements`, each item contains a `format` field and a `filename` field. The `filename` field tells the pipeline\nwhere to write the file, and the `format` is the format of the file. The formats are described below.\n\n - `json`: JSON formatting.\n - `csv`: Comma separated formatting.\n\nFor `trec_results`, a filename must be specified using `output`:\n\n - `output`: Where to write trec-style results file to.\n\nFor `evaluations`, both the `qrels` file must be specified, and a list of formats similar to `measurements`; i.e.\na list of filename and format pairs:\n\n - `qrels`: Path to a trec-style qrels file.\n - `formats`: `format`, `filename` pairs.\n\nThe format of `evaluations` is currently only `json`.\n\n### Machine Learning (`learning`)\n\nMachine learning is kind of new in boogie and it's still not perfect, but at the moment there is some learning to rank being implemented. \n\n - `model`: Which machine learning model to use (currently available: `query_chain`).\n - `options`: Additional model-specific options (see below).\n - `train`: Options for training.\n - `test`: Options for testing.\n - `generate`: Options for generating data.\n\nEven if a model does not have configuration options for training, testing, or generating, it tells the pipeline which operation(s) to perform.\n\n#### `query_chain`\n\n##### Options:\n\n - `candidate_selector`: one of `ltr_svmrank`, `ltr_quickrank`, or `reinforcement` (only `ltr_quickrank` is fully implemented).\n    - `ltr_svmrank` requires `model_file` to be configured here.\n    - `ltr_quickrank` requires `binary` to be configured here, as well as any arguments to quickrank (see: https://github.com/hpclab/quickrank)\n \n##### Generate:\n\nQuery chain generate requires that a query, statistic source, measurements, rewrite, evaluation, and output (qrels) is configured.\n\n - `output`: Path to generate features to.\n\n## Extending\n\nAdding a query format, statistics source, preprocessing step, measurement, or output format requires firstly to\nimplement the corresponding [groove](https://github.com/hscells/groove) interface. Once an interface has been\nimplemented, it can be added to boogie by registering it in the [config](config.go).\n\nI am open to contributions, but having said that I would not be contributing at this point in time unless it was to a really stable API like evaluation or measurements.\n\n## Citing\n\nIf you use this work for scientific publication, please reference\n\n```\n@inproceedings{scells2018framework,\n author = {Scells, Harrisen and Locke, Daniel and Zuccon, Guido},\n title = {An Information Retrieval Experiment Framework for Domain Specific Applications},\n booktitle = {The 41st International ACM SIGIR Conference on Research \\\u0026\\#38; Development in Information Retrieval},\n series = {SIGIR '18},\n year = {2018},\n} \n```\n\n## Logo\n\nThe Go gopher was created by [Renee French](https://reneefrench.blogspot.com/), licensed under\n[Creative Commons 3.0 Attributions license](https://creativecommons.org/licenses/by/3.0/).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhscells%2Fboogie","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhscells%2Fboogie","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhscells%2Fboogie/lists"}