{"id":28347373,"url":"https://github.com/jglauber/wikipediastats","last_synced_at":"2026-03-07T11:03:55.088Z","repository":{"id":292963675,"uuid":"982521291","full_name":"jglauber/wikipediastats","owner":"jglauber","description":"A package to process data from Wikimedia using the server sent events (SSE) protocol.","archived":false,"fork":false,"pushed_at":"2025-07-10T13:57:24.000Z","size":32,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-11-28T00:33:17.175Z","etag":null,"topics":["aiohttp","sse-client","statistics","wikimedia","wikipedia"],"latest_commit_sha":null,"homepage":"https://github.com/jglauber/wikipediastats/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jglauber.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-05-13T02:37:39.000Z","updated_at":"2025-07-10T13:57:20.000Z","dependencies_parsed_at":"2025-05-13T03:01:35.338Z","dependency_job_id":"7a1b6786-cc0b-40b3-a032-03d7661292f2","html_url":"https://github.com/jglauber/wikipediastats","commit_stats":null,"previous_names":["jglauber/wikipediastats"],"tags_count":4,"template":false,"template_full_name":null,"purl":"pkg:github/jglauber/wikipediastats","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jglauber%2Fwikipediastats","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jglauber%2Fwikipediastats/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jglauber%2Fwikipediastats/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jglauber%2Fwikipediastats/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jglauber","download_url":"https://codeload.github.com/jglauber/wikipediastats/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jglauber%2Fwikipediastats/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30212103,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-07T09:02:10.694Z","status":"ssl_error","status_checked_at":"2026-03-07T09:02:08.429Z","response_time":53,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aiohttp","sse-client","statistics","wikimedia","wikipedia"],"created_at":"2025-05-27T16:40:49.385Z","updated_at":"2026-03-07T11:03:55.029Z","avatar_url":"https://github.com/jglauber.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Welcome to StatSpEdia\n\nA tool written in Python 3.13 utilizing the async aiohttp package to grab and process data from Wikimedia using the server sent events (SSE) protocol.\n\nAll data is stored as individual documents in a local mongodb database.\n\n## Prerequisites\n\nPrior to installing the python package, please install mongodb community edition on your machine using the instructions here: [mongodb installation guide](https://www.mongodb.com/docs/manual/installation/)\n\n## Installation\n\nTo install a local copy please run:\n`pip install statspedia`\n\n## Example Usage\n\n### Create an Instance of the WikiStream Class\n\n```python\nfrom statspedia import WikiStream\nimport asyncio\n\nasync def main():\n    ws = WikiStream()\n    return await ws.stream()\n    \nasyncio.run(main())\n```\n\n### Program Console Output\n\nBy default, logs will be printed to the console and stored in a folder logs/ at the root directory.\n\nA sample log output is as follows:\n\n```bash\n2025-06-08 14:25:13,787 - statspedia.wiki_stream - DEBUG - Buffer will be cleared when chunk completes object\n2025-06-08 14:25:31,036 - statspedia.wiki_stream - DEBUG - HTTP chunk does not contain full object.\n2025-06-08 14:25:31,036 - statspedia.wiki_stream - DEBUG - Buffer will be cleared when chunk completes object\n2025-06-08 14:25:37,384 - statspedia.wiki_stream - DEBUG - Wiki Edit List Count is 74. Clearing and Saving to MongoDB\n2025-06-08 14:25:37,387 - statspedia.wiki_stream - DEBUG - A new deep copy of Wiki Edit List was created successfully\n2025-06-08 14:25:37,393 - statspedia.wiki_stream - INFO - Wiki Edit List written to latest_edits collection in MongoDB\n2025-06-08 14:25:37,394 - statspedia.wiki_stream - DEBUG - Wiki Edit List succesfully cleared.\n2025-06-08 14:25:37,395 - statspedia.wiki_stream - DEBUG - Program started at: 2025-06-08 01:42:55.524709+00:00\n2025-06-08 14:25:37,395 - statspedia.wiki_stream - DEBUG - Current hour: 2025-06-08 21:00:00+00:00\n```\n\n### Data Schema and Basic Queries\n\nEvery server sent event from the English wikipedia is saved as a document in mongodb in a database named wiki_stream under the collection latest_changes. Every hour, the program will summarize the previous hours data in the same database in a collection named statistics. Each of these collections may be queried using the shell commands of mongosh or using the python driver [pymongo](https://pypi.org/project/pymongo/).\n\nThe schema for the latest_changes documents is:\n\n```json\n[\n  {\n    \"_id\": \"ObjectId()\",\n    \"$schema\": \"/mediawiki/recentchange/1.0.0\",\n    \"meta\": {\n      \"uri\": \"string\",\n      \"request_id\": \"string\",\n      \"id\": \"string\",\n      \"dt\": \"ISODate()\",\n      \"domain\": \"en.wikipedia.org\",\n      \"stream\": \"mediawiki.recentchange\",\n      \"topic\": \"eqiad.mediawiki.recentchange\",\n      \"partition\": \"int\",\n      \"offset\": \"Long()\"\n    },\n    \"id\": \"int\",\n    \"type\": \"edit\",\n    \"namespace\": \"int\",\n    \"title\": \"string\",\n    \"title_url\": \"string\",\n    \"comment\": \"string\",\n    \"timestamp\": \"int\",\n    \"user\": \"string\",\n    \"bot\": \"bool\",\n    \"notify_url\": \"string\",\n    \"minor\": \"bool\",\n    \"length\": { \"old\": \"int\", \"new\": \"int\" },\n    \"revision\": { \"old\": \"int\", \"new\": \"int\" },\n    \"server_url\": \"https://en.wikipedia.org\",\n    \"server_name\": \"en.wikipedia.org\",\n    \"server_script_path\": \"/w\",\n    \"wiki\": \"enwiki\",\n    \"parsedcomment\": \"string\",\n    \"bytes_change\": \"int\"\n  }\n]\n```\n\nThe schema for statistics is:\n\n```json\n{\n    \"most_data_added\": {},\n    \"most_data_removed\": {},\n    \"top_editors\": {},\n    \"top_editors_bots\": {},\n    \"all_editors\": {},\n    \"all_editors_bots\": {},\n    \"top_edited_articles\": {},\n    \"all_edited_articles\": {},\n    \"num_edited_articles\": \"int\",\n    \"num_editors\": \"int\",\n    \"num_editors_bots\": \"int\",\n    \"num_edits\": \"int\",\n    \"bytes_added\": \"int\",\n    \"bytes_removed\": \"int\",\n    \"total_bytes_change\": \"int\",\n    \"timestamp\": \"ISODate()\"\n\n}\n```\n\nBelow are some simple examples for how to perform queries of the database using pymongo. For more information on queries please see the documentation here: \n\n```python\nfrom pymongo import MongoClient\nfrom pymongo.cursor import Cursor\nfrom datetime import datetime, timezone\n\nclient = MongoClient(host='mongodb://127.0.0.1',port=27017)\ndb = client.wiki_stream\ncollection1 = db.statistics\ncollection2 = db.latest_changes\n\n\ndef create_cur(field: str, collection) -\u003e Cursor:\n    cur = collection.find({field: {'$exists': 1}},{'_id': 0, field: 1})\n    return cur\n\ndef edit_count_by_user(cur: Cursor, field: str):\n    user_edit_dict = {}\n\n    for i in cur:\n        user_generator = ((k,v) for (k,v) in i[field].items())\n        for user,edit_count in user_generator:\n            try:\n                user_edit_dict[user] += edit_count\n            except KeyError:\n                user_edit_dict[user] = edit_count\n    \n    total_unique_editors = len(user_edit_dict.keys())\n\n    sorted_user_edit_dict = dict(sorted(user_edit_dict.items(),\n                                        key=lambda item: item[1],\n                                        reverse=True)[0:10])\n\n    return sorted_user_edit_dict, total_unique_editors\n\n\ndef edit_count_by_document(cur: Cursor, field: str):\n    document_edit_dict = {}\n    count = 0\n    for i in cur:\n        document_title = i[field]\n        try:\n            document_edit_dict[document_title] += 1\n        except KeyError:\n            document_edit_dict[document_title] = 1\n        count += 1\n    \n    total_documents_edited = len(document_edit_dict.keys())\n\n    sorted_document_edit_dict = dict(sorted(document_edit_dict.items(),\n                                        key=lambda item: item[1],\n                                        reverse=True)[0:10])\n\n    return sorted_document_edit_dict, total_documents_edited, count\n\n\n\ndef sum_across_all_stats(cur: Cursor, field: str):\n    total = 0\n\n    for i in cur:\n        total += i[field]\n    \n    return total\n\ncur = create_cur('all_editors', collection1)\nusers, num_unique_users = edit_count_by_user(cur,'all_editors')\nprint(f\"Top Editors (Human) All Time: {users}\")\nprint(f\"Total Editors (Human) All Time: {num_unique_users}\")\n\ncur2 = create_cur('all_editors_bots', collection1)\nusers, num_unique_users_bots = edit_count_by_user(cur2,'all_editors_bots')\nprint(f\"Top Editors (Bots) All Time: {users}\")\nprint(f\"Total Editors (Bots) All Time: {num_unique_users_bots}\")\n\ncur3 = create_cur('num_edits', collection1)\nnum_edits = sum_across_all_stats(cur3,'num_edits')\nprint(f\"Total Edits All Time {num_edits}\")\n\ncur4 = create_cur('bytes_added', collection1)\nbytes_added = sum_across_all_stats(cur4,'bytes_added')\nprint(f\"Total MB Added All Time {bytes_added/1e6}\")\n\ncur5 = create_cur('bytes_removed', collection1)\nbytes_removed = sum_across_all_stats(cur5,'bytes_removed')\nprint(f\"Total MB Removed All Time {bytes_removed/1e6}\")\n\ncur6 = create_cur('total_bytes_change', collection1)\nbytes_change = sum_across_all_stats(cur6,'total_bytes_change')\nprint(f\"Total MB Change All Time {bytes_change/1e6}\")\n\ncur7 = create_cur('timestamp', collection1)\nfor i in cur7:\n    print(f\"Data recording started on: {i['timestamp']}\")\n    break\n\ncur8 = create_cur('title', collection2)\ntop_docs_edited, total_docs_edited, count = edit_count_by_document(cur8,'title')\nprint(f\"Most Edited Docs: {top_docs_edited}\")\nprint(f\"Total Edited Docs: {total_docs_edited}\")\nprint(count)\n\ncur9 = create_cur('all_edited_articles', collection1)\narticles, total_articles = edit_count_by_user(cur9,'all_edited_articles')\nprint(f\"Most Edited Docs: {articles}\")\nprint(f\"Total Edited Docs: {total_articles}\")\n```\n\n### Stopping the Program\n\nThe program may be safely stopped using `ctrl + c` which will cancel all active async tasks.\n\nThe following will be outputted to the console:\n\n```bash\nAll tasks cancelled.\nElapsed Time: 0.0 days 0.0 hours 0.0 mins 18.9 secs\n```\n\n## License\n\nMIT\n\n## Project Status\n\nIn development.\n\n## Authors\n\nJohn Glauber\n\n## Contact\n\nFor any questions, comments, or suggestions please reach out via email to:  \n  \nJohn Glauber  \n\u003cjohnbglauber@gmail.com\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjglauber%2Fwikipediastats","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjglauber%2Fwikipediastats","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjglauber%2Fwikipediastats/lists"}