{"id":35015766,"url":"https://github.com/mdda/cryptic-wordplay","last_synced_at":"2026-05-19T12:07:16.028Z","repository":{"id":247633525,"uuid":"823068399","full_name":"mdda/cryptic-wordplay","owner":"mdda","description":"Dataset building tools for Cryptic Crossword Clue solutions","archived":false,"fork":false,"pushed_at":"2024-10-09T16:12:17.000Z","size":419,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-12-28T19:05:56.088Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mdda.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-02T11:10:46.000Z","updated_at":"2025-10-22T13:41:08.000Z","dependencies_parsed_at":"2024-08-01T04:29:51.839Z","dependency_job_id":"63c97441-291d-484f-80ea-b35d8040223f","html_url":"https://github.com/mdda/cryptic-wordplay","commit_stats":null,"previous_names":["mdda/cryptic-wordplay"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/mdda/cryptic-wordplay","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mdda%2Fcryptic-wordplay","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mdda%2Fcryptic-wordplay/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mdda%2Fcryptic-wordplay/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mdda%2Fcryptic-wordplay/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mdda","download_url":"https://codeload.github.com/mdda/cryptic-wordplay/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mdda%2Fcryptic-wordplay/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33215622,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-19T07:54:09.561Z","status":"ssl_error","status_checked_at":"2026-05-19T07:54:08.508Z","response_time":58,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-12-27T05:19:39.385Z","updated_at":"2026-05-19T12:07:15.987Z","avatar_url":"https://github.com/mdda.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## Wordplay Dataset - (for Cryptic Crossword clues)\n\nThis repo includes the tools for building the **Wordplay Dataset** - \na dataset of Cryptic Crossword Clue solutions \ncreated by enthusiastic solvers, \neach of which have submitted (over the years) their \nwordplay breakdowns of many cryptic crosswords to site such as:\n\n* [https://FifteenSquared.net/](https://www.fifteensquared.net/)\n* [https://TimeForTheTimes.co.uk/](https://timesforthetimes.co.uk/)\n\nIncluded in the repo are _scrapers_ that are known to be effective for :\n* the above two sites,\n* across a number of the authors,\n* which are robust to the wide variety of formats used\n* by each author and over a long period of time.\n\nSince permission has not been sought from these websites (yet), \nthe full dataset is not downloadable from here. The code for extracting the wordplay lists from author pages\nis included here in the `wordplay` module, and convenience scripts will soon be provided so that \ndata can be gathered automatically (though it is not clear that more than the 5000 wordplay samples provided in the dataset sample in `./prebuilt` are essential to train a useful model - see below).\n\n\n### LLM post-training warning\n\nIf you are looking for a post training set for an LLM : Look elsewhere!  \nThis dataset is intended for experimentation on reasoning tasks, \nand simply trying to use the dataset as training material \nwould be as pointless as training on ARC Challenge tasks...  \n\n\n\n## Download\n\nThere is a sample dataset with ~5400 training examples in the `./prebuilt` directory, with two splits : \"train\" and \"val\".\n\nThis sample has the following characteristics:\n* Single author: `teacow` on [https://FifteenSquared.net/](https://FifteenSquared.net/author/teacow)\n  + chosen for their clear and consistent `wordplay` annotations across more than 6 years of submission\n* Financial Times clue solutions (predominantly) - typically of difficulty similiar to the regular Times Cryptic\n* Retrieved using `custom` (i.e. manually coded) scraping tools \n  + should not suffer from partial captures\n\nEven with 'only' 5K examples, this sample dataset has been found suffient to fine-tune ~7B models to guess at `definition` and `wordplay` pairs for new clues.  \n\n\n### Splits\n\nThe splits used for this Wordplay Dataset are the same as those first given in [Cryptonite](https://github.com/aviaefrat/cryptonite) - and we attempt to enforce that choice in the dataset generation tools provided here.  For certainty, the \"val\" and \"test\" wordlists derived from Cryptonite are given\nin `./prebuilt`.\n\nIntentionally, the \"test\" version of the wordplay data is not provided, \nso that it won't be incorporated into web trawls (which could contaminate LLM training sets).\n\nTo preserve the integrity/usefulness of this Wordplay dataset, please: \n* don't even consider creating the 'test' split; and/or\n* be careful not to let a copy of any 'test' split leak onto the internet.\n\n\n### Dataset Format\n\nEach line of the `jsonl` file contains the following fields:\n* `clue` : The clue as given in the puzzle, but with the definition part(s) surrounded with '{}' brackets\n* `pattern` : The number of letters in the answer - as given in the puzzle\n* `ad` : {A,D} = Across / Down \n* `answer` : The uppercase answer that would be written in the grid - may contain spaces and '-'\n* `wordplay` : Wordplay 'analysis', which can be in a wide variety of formats/styles\n* `author` : this identifies the wordplay analysis author\n* `setter` : name of the puzzle creator\n* `publication` : where the puzzle originally appeared (simplistic)\n* `is_quick` : whether the puzzle was a 'Quick' variant (simplistic)\n\nNote that the lines in the dataset are order according to their extraction / scraping - so they\nare grouped by author / in date order / in puzzle clue-order.  It is very likely that they \nrequire shuffling before use (or, practically speaking, an index list should be shuffled, so they\ncan be indexed into in a pre-defined 'random' order).\n\nEach clue/answer/wordplay data item is also:\n* Sanitised : \n  + For instance: if a given `wordplay` appears to be a `Double Definition`, it will start with that string exactly\n* Sanity-checked:\n  + Does the `answer` string match the `pattern` for the clue?\n  + Are a majority of the letters in the `answer` present as upper-case characters in the `wordplay`?\n  + Does the `clue` contain a span highlighted with '{}' as the definition (twice in the case of Double Definition wordplay)\n* ... see [`./wordplay/__init__.py#L300`](/mdda/cryptic-wordplay/blob/main/wordplay/__init__.py#L300) for more details\n\n\n## Installation\n\nTo use the scrapers directly, ensure its dependencies are installed:\n\n```bash\npip install --upgrade pip\npip install requests bs4 OmegaConf\ngit clone https://github.com/mdda/cryptic-wordplay.git\n```\n\nImport the module (it looks up its own configuration from `./sites/config.yaml`, and caches website files in `./sites/SITENAME/`):\n\n```python\np='./cryptic-wordplay'\nif p not in sys.path:\n  sys.path.append(p)\n\nimport wordplay\nprint( wordplay.config )\n```\n\nNote that the scrapers will cache index pages for the authors specified, and then cache the referenced\nwebpages.  Accesses are spaced apart so as not to inconvenience the sites' maintainers.\n\nThere are two kinds of scraping tools included: \n* The `custom` scrapers used for the sample dataset in `./prebuilt`\n  + Specifically built to capture `div.fts-group` and `p[]` styles of HTML pages\n* A more advanced `generic` scraper that (should) adaptively figure out how the list of clues/answers/wordplay annotations is formatted, and scrape those\n  + This is not perfect, but is able to gather a good percentage of available pages\n* When/if there is time available, the next avenue to improve things is probably to experiment with LLM-based parser generation\n  + Testing a parse is quick/cheap, and can be verified to some degree\n    - so testing all cached parse methods is also relatively cheap\n  + And if none works, then ask a commercial LLM (such as Gemini-Flash) to come up with a parsing scheme\n    - and loop until it works or exhaustion sets in\n\n\n### Assembling a dataset (with train/val splits)\n\nHere are some example invocations of the dataset creation utility that pull from several authors :\n\n```bash\npython create_dataset_with_splits.py  --author teacow --site fifteensquared --pages -1\npython create_dataset_with_splits.py  --author pipkirby --site timesforthetimes --pages -1\npython create_dataset_with_splits.py  --author chris-woods --site timesforthetimes --pages -1\n```\n\nOnce _enough_ data has been generated, find the files within the directory structure:\n\n```bash\nfor split in train val; do\n  find sites | grep author_aggregate_${split}.jsonl | sort \u003e list.${split}\ndone\n```\n\nThis will create `list.train` and `list.val` files with lists of files that can be combined.\nEdit these lists to select for the authors/sites required.\n\nThen, combine the `jsonl` files listed into `wordplay_DATE_SPLIT.jsonl` :\n```bash\ndt=`date --iso-8601=date`\nfor split in train val; do\n  { xargs cat \u003c list.${split} ; } | uniq \u003e wordplay_${dt}_${split}.jsonl\ndone\n```\n\nGood luck with the Cryptic Crossword solving!\n\n\n\n## Dataset Citation\n\nPlease cite this dataset as follows:\n```latex\n@software{Wordplay_dataset_repo,\n  author = {Andrews, Martin},\n  title = {{Wordplay Dataset}},\n  url = {https://github.com/mdda/cryptic-wordplay},\n  version = {0.0.1},\n  year = {2024}\n}\n```\n\n### Related Papers\n\nThe following paper(s) make use of the Wordplay dataset:\n\n* [\"Proving that Cryptic Crossword Clue Answers are Correct\"](https://arxiv.org/abs/2407.08824) - Andrews \u0026 Witteveen (2024)\n  + Accepted at the [ICML 2024 Workshop on LLMs and Cognition](https://llm-cognition.github.io/)\n  + [Explainer Video on YouTube](https://www.youtube.com/watch?v=vLITb6XDTQ8)\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmdda%2Fcryptic-wordplay","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmdda%2Fcryptic-wordplay","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmdda%2Fcryptic-wordplay/lists"}