{"id":40961920,"url":"https://github.com/ilkka/gtgcrawler","last_synced_at":"2026-01-22T05:44:40.752Z","repository":{"id":204637007,"uuid":"710952955","full_name":"ilkka/gtgcrawler","owner":"ilkka","description":"Playing around with Guess the Game data","archived":false,"fork":false,"pushed_at":"2026-01-10T06:25:42.000Z","size":279,"stargazers_count":0,"open_issues_count":8,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-01-11T01:51:16.129Z","etag":null,"topics":["data-scraping","elixir","livebook"],"latest_commit_sha":null,"homepage":"","language":"Julia","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ilkka.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2023-10-27T20:03:06.000Z","updated_at":"2023-10-31T14:21:25.000Z","dependencies_parsed_at":null,"dependency_job_id":"691c0ac7-449a-410b-b891-85237e877d7f","html_url":"https://github.com/ilkka/gtgcrawler","commit_stats":null,"previous_names":["ilkka/gtgcrawler"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ilkka/gtgcrawler","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ilkka%2Fgtgcrawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ilkka%2Fgtgcrawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ilkka%2Fgtgcrawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ilkka%2Fgtgcrawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ilkka","download_url":"https://codeload.github.com/ilkka/gtgcrawler/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ilkka%2Fgtgcrawler/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28656569,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-22T01:17:37.254Z","status":"online","status_checked_at":"2026-01-22T02:00:07.137Z","response_time":144,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-scraping","elixir","livebook"],"created_at":"2026-01-22T05:44:40.102Z","updated_at":"2026-01-22T05:44:40.747Z","avatar_url":"https://github.com/ilkka.png","language":"Julia","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Gtgcrawler\n\nThere was talk about [Guess the Game](https://guessthe.game) in one of my\ncommunities, and a person said that GtG has a bias towards more recent games.\nI was immediately interested in this claim, because it seems like it should be\npossible to find out if this is the case. My idea was that I would get the\npublication years for all the GtG answers and compare that to the overall\namount of games published on the same year.\n\nThe timing was also good because I wanted an excuse to play with Elixir. After\na bit of searching I found [an article about using Crawly](https://www.scrapingbee.com/blog/web-scraping-elixir/)\nwhich seemed to fit the ticket.\n\nI initially wrote a scraper in a mix project for a results website that claimed\nto have GtG answers for one year, but later learned that GtG has an API where\nI can just grab all the answers from. I then integrated both that and the rest\nof the processing into a [livebook](https://livebook.dev), because I wanted\nto play more with that as well.\n\nNow the livebook does the following:\n\n1. Grab the answers from the GtG API\n2. Using game titles, try querying wikidata.org (the structured data service\n   behind wikipedia) for the publication year\n3. Crawl a general videogame statistics website to get an idea of how many\n   games were published for a given year\n4. Visualize the two datasets for visual comparison\n\nThe crawlers try to be Nice in a couple of ways:\n\n- They respect sites' [robots.txt](https://en.wikipedia.org/wiki/Robots.txt) files\n- They send a User-Agent string that includes identifying information\n\n## Running it\n\nRun it by installing [livebook](https://livebook.dev), opening [game-data-maelving.livemd](game-data-maelving.livemd) and executing the cells. It is not necessary to re-run the crawlers,\ninstead one can use the data files included in the repo.\n\nThe `start-livebook.sh` script can be used to run livebook in Docker.\n\n## Results\n\nThis is still far removed from actual science, but there seems to be a\nsomewhat weak correlation between the number of overall published games (green line)\nand the number of GtG puzzle answers (blue bars) for a given year. The hump in publications in the 90s\nis interesting especially as it is not reflected in the GtG data. On the other hand there are\na number of games missing from the GtG data, due to not being able to find the correct Wikipedia\npage title (or the page not existing at all). There have been about 530 days of GtG as of time of\nwriting, and I was able to scrape publication years for about 350 of those, so about 1/3 of the\ngames are missing from the data.\n\nI can see how there might be a bias affecting what games even _get_ Wikipedia pages, but if we make\nthe not-too-outrageous-in-my-opinion assumption that the missing GtG year data is evenly distributed,\nI would be happy to conclude that there is no significant bias towards more recent games in GtG.\n\n![graph showing the distribution of GtG answers vs published games per year](docs/files/visualization.png)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Filkka%2Fgtgcrawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Filkka%2Fgtgcrawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Filkka%2Fgtgcrawler/lists"}