{"id":15681536,"url":"https://github.com/desilinguist/paraquery","last_synced_at":"2025-05-07T12:24:59.373Z","repository":{"id":9647797,"uuid":"11582824","full_name":"desilinguist/paraquery","owner":"desilinguist","description":"An Interactive Querying Tool for Pivot-based Paraphrase Databases","archived":false,"fork":false,"pushed_at":"2013-09-15T21:38:05.000Z","size":1678,"stargazers_count":10,"open_issues_count":1,"forks_count":4,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-05-06T12:17:25.580Z","etag":null,"topics":["analysis","paraphrase","sqlite"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/desilinguist.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2013-07-22T14:06:10.000Z","updated_at":"2022-11-25T11:10:34.000Z","dependencies_parsed_at":"2022-09-05T21:30:21.283Z","dependency_job_id":null,"html_url":"https://github.com/desilinguist/paraquery","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/desilinguist%2Fparaquery","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/desilinguist%2Fparaquery/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/desilinguist%2Fparaquery/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/desilinguist%2Fparaquery/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/desilinguist","download_url":"https://codeload.github.com/desilinguist/paraquery/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252875069,"owners_count":21817951,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["analysis","paraphrase","sqlite"],"created_at":"2024-10-03T16:56:00.542Z","updated_at":"2025-05-07T12:24:59.350Z","avatar_url":"https://github.com/desilinguist.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![Bitdeli Badge](https://d2weczhvl823v0.cloudfront.net/desilinguist/paraquery/trend.png)](https://bitdeli.com/free \"Bitdeli Badge\")\n\nWhat is ParaQuery?\n------------------\nParaQuery is a tool that helps a user interactively explore and characterize a given pivoted paraphrase collection, analyze its utility for a particular domain, and compare it to other popular lexical similarity resources – all within a single interface.\n\nRequirements\n------------\n\nFor running ParaQuery with the bundled databases, you need:\n\n- Python 2.7 or higher.\n- SQLite\n- [NLTK](http://www.nltk.org)\n- [Pyparsing](http://pyparsing.wikispaces.com)\n- [NumPy](http://www.numpy.org)\n- [SciPy](http://www.scipy.org)\n\nHowever, if you need to generate paraphrase rules from your own bilingual data, then you also need:\n\n - Java 1.5 or higher\n - [Apache Hadoop](http://hadoop.apache.org)\n\n\nFile Formats\n------------\nParaQuery works by taking a pivoted paraphrase database and converting it into an SQLite database which can then be queried. ParaQuery provides an `index` command that can take a gzipped file containing pivoted paraphrases and write out an SQLite database to disk. The gzipped file needs to contain _paraphrase rules_ in the following format on each line:\n\n```[X] ||| source ||| target ||| features ||| pivots```\n\nwhere `source` is the source English phrase being paraphrased, `target` is its paraphrase,  `features` is a whitespace-separated list of features associated with the paraphrase pair, and `pivots` is a list of the foreign-language pivots that were used in generating this particular paraphrase pair. The format is based on the format used by the [Joshua](http://joshua-decoder.org/) paraphrase extractor since that is what is currently used to produce the compressed paraphrase files (See section on Generating Paraphrase Rules below).\n\nThe features currently produced by the Joshua paraphrase extractor are as follows:\n\n1. `0` (always 0; indicates whether the rule is a glue rule which it never is for paraphrases)\n2. Ignored by ParaQuery.\n3. `1` if `source` and `target` are identical\n4. `-log p(target|source)`. This is what's currently used by ParaQuery as the score for a paraphrase pair.\n5. `-log p(source|target)`\n6. Ignored by ParaQuery.\n7. Ignored by ParaQuery.\n8. Number of words in `source`\n9. Number of words in `target`\n10. Difference in number of words between `target` and `source`\n11. Is the rule purely lexical? (Right now, ParaQuery works best with purely lexical rules so this is always 1. However, in the future it might be useful for cases where this is not true, e.g., hierarchical or syntactic paraphrases.)\n12. Ignored by ParaQuery.\n13. Ignored by ParaQuery.\n14. Ignored by ParaQuery.\n15. Ignored by ParaQuery.\n16. Ignored by ParaQuery.\n17. Ignored by ParaQuery.\n\nNote that we have made some modifications to the Joshua paraphrase extractor (e.g., some additional filtering options) and, therefore, we bundle our modified version with ParaQuery. Several fields produced by the paraphrase extractor are currently ignored by ParaQuery but may be useful in the future.\n\nThe `pivots` are a list of the pivoted phrases along with the score contributed to the pair by that pivot.\n\nHere's an example of a paraphrase rule generated using French as the pivot language:\n\n`[X] ||| accidents at sea ||| maritime accidents ||| 0.0 1.0 0.0 2.6342840508626013 2.5230584157523763 14.266144534541269 6.260026717543335 3.0 2.0 -1.0 1.0 0.0 0.0 3.833333333333333 0.1353352832366127 0.0 0.0 ||| [\"accidents maritimes:0.07177033492822964\"]`\n\n\nGenerating Paraphrase Rules\n---------------------------\nIf you would like to use ParaQuery right out of the box, four databases are already available. These databases are generated from the European Parliament [bilingual corpora](http://www.statmt.org/europarl). The four ParaQuery databases available use the following languages as pivots:\n\n1. French [.paradb](https://s3.amazonaws.com/paraquery-databases/fr-en/.paradb) (1.3 GB)\n2. German [.paradb](https://s3.amazonaws.com/paraquery-databases/de-en/.paradb) (1.0 GB)\n3. Spanish [.paradb](https://s3.amazonaws.com/paraquery-databases/es-en/.paradb) (1.4 GB)\n4. Finnish [.paradb](https://s3.amazonaws.com/paraquery-databases/fi-en/.paradb) (775 MB)\n\nNote that each of the above links is to a file called `.paradb` which is the SQLite database generated from the respective bilingual corpus for use with ParaQuery. Since they are all named `.paradb`, you should probably put them in seperate directories. To generate your own databases from your own data, please read on.\n\nThe code to generate the gzipped paraphrase rules in the format that's currently readable by ParaQuery is also included here. To generate pivoted paraphrases, you need three files: the file containing the foreign language sentences `sentences.fr`, the file containing the corresponding English sentences `sentences.en`, and, finally, a file `sentences.align` containing the word alignments between the sentences. To generate the word alignments, you can probably use the [Berkeley Word Aligner](https://code.google.com/p/berkeleyaligner/). The alignments need to be in the following format:\n\n`0-0 1-1 1-2 2-3`\n\nwhere the first number is an index for the foreign language sentence and the second number for the English sentence. Note also that both `sentences.fr` and `sentences.en` must be tokenized.\n\nPlease note that if you want to generate paraphrases using one of the other languages in the Europarl corpus, you do not need to do much work. Chris Callison-Burch has the files from each of the 13 languages nicely processed and available for download [here](http://www.cs.jhu.edu/~ccb/howto-extract-paraphrases.html) as part of his paraphrasing software.\n\nOnce these files are ready, paraphrase rule files can be created as follows:\n\n- Prepare data to run through the Thrax offline grammar extractor (`create_thrax_data.sh` is bundled with ParaQuery under `scripts/`):\n`create_thrax_data.sh sentences.fr sentences.en sentences.align \u003e sentences.input`\n\n- Run Thrax to extract the paraphrase grammar:\n`hadoop jar thrax.jar hiero.conf \u003coutdir\u003e \u003e\u0026 thrax.log`, where the library `thrax.jar` comes bundled with ParaQuery under the `lib/` directory and so does the configuration file `hiero.conf`. The only option you should need to modify is the `input-file` in `hiero.conf` -- to point to `sentences.input`. If you want to modify the other options, read more about Thrax [here](http://cs.jhu.edu/~jonny/thrax/). `\u003coutdir\u003e` is your desired output directory.\n\n- Get the final hadoop output in the current directory:\n`hadoop fs -getmerge \u003coutdir\u003e/final ./rules.gz`\n\n- Sort the generated paraphrase rules by the source side:\n`zcat rules.gz | sort -t'|' -k1,4 | gzip \u003e rules-sorted.gz`\n\n- Run the paraphrase grammar builder (note that `joshua.jar` is bundled with ParaQuery under `lib/`):\n`(java -Dfile.encoding=UTF8 -Xmx8g -classpath joshua.jar joshua.tools.BuildParaphraseGrammarWithPivots -g rules-sorted.gz | gzip \u003e para-grammar.gz) 2\u003ebuild_para.log`.\n\n- Sort by both source and target side:\n`zcat para-grammar.gz | sort -t'|' -k4,7 | gzip \u003e para-grammar-sorted.gz`\n\n- Aggregate paraphrase rules (sum duplicate rules that you might get from different pivots):\n`java -Dfile.encoding=UTF8 -Xmx8g -classpath joshua.jar joshua.tools.AggregateParaphraseGrammarWithPivots -g para-grammar-sorted.gz | gzip \u003e final-para-grammar.gz`\n\n- Sort by the source side:\n`zcat final-para-grammar.gz | sort -t'|' -k1,4 | gzip \u003e final-para-grammar-sorted.gz`\n\nUsing ParaQuery\n---------------\n\nOnce the gzipped paraphrase file has been  generated, it can be easily converted to the SQLite database from inside ParaQuery:\n\n - Run `paraquery` (the launching script provided)\n - At the resulting prompt, run the following command which will create a `.paradb` file in the current directory:\n`index final-para-grammar-sorted.gz`\n - If a `.paradb` file in the current directory, `paraquery` will automatically attach it and output a message when starting up. Otherwise, the path to the `.paradb` file must be provided as an argument.\n\nOnce you have a database loaded up, you can use all the commands that ParaQuery supports. Please read the detailed [user manual](manual.md) for a detailed explanation of how to use ParaQuery.\n\nAcknowledgments\n-----\nWe would like to thank [Juri Ganitkevitch](http://cs.jhu.edu/~juri/), [Jonny Weese](http://cs.jhu.edu/~jonny/), and [Chris Callison-Burch](http://www.cs.jhu.edu/~ccb/) for all their help and guidance during the development of ParaQuery.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdesilinguist%2Fparaquery","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdesilinguist%2Fparaquery","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdesilinguist%2Fparaquery/lists"}