{"id":19289781,"url":"https://github.com/ragibson/sms-mms-deduplication","last_synced_at":"2025-04-22T05:31:45.642Z","repository":{"id":219314440,"uuid":"626688629","full_name":"ragibson/SMS-MMS-deduplication","owner":"ragibson","description":"Tool to remove duplicate text messages (SMS/MMS/RCS). RCS support is available for some clients.","archived":false,"fork":false,"pushed_at":"2025-03-12T01:43:43.000Z","size":115,"stargazers_count":16,"open_issues_count":0,"forks_count":2,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-01T20:00:11.040Z","etag":null,"topics":["deduplication","mms","rcs","sms","text-message"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ragibson.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-04-12T01:10:05.000Z","updated_at":"2025-03-13T04:25:27.000Z","dependencies_parsed_at":null,"dependency_job_id":"8160fc30-9b38-4cb2-8e19-7cf44625ff36","html_url":"https://github.com/ragibson/SMS-MMS-deduplication","commit_stats":null,"previous_names":["ragibson/sms-mms-deduplication"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ragibson%2FSMS-MMS-deduplication","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ragibson%2FSMS-MMS-deduplication/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ragibson%2FSMS-MMS-deduplication/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ragibson%2FSMS-MMS-deduplication/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ragibson","download_url":"https://codeload.github.com/ragibson/SMS-MMS-deduplication/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250175067,"owners_count":21387132,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deduplication","mms","rcs","sms","text-message"],"created_at":"2024-11-09T22:17:03.887Z","updated_at":"2025-04-22T05:31:45.635Z","avatar_url":"https://github.com/ragibson.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# SMS-MMS-deduplication\n\nThis is a simple tool to remove duplicate text messages from XML backups of\nthe \"SMS Backup \u0026 Restore\" format.\n\nIt also supports removal of more complicated duplicates than other tools while\ntaking extreme care not to identify any false positives.\n\nFor example, we handle instances where\n\n* One message contains a data attachment (e.g., images sent via text message),\n  but the other does not\n* The phone numbers inconsistently include or exclude country codes\n* The internal timestamps have inconsistent millisecond vs. second precision\n* The internal ordering of phone numbers is inconsistent between messages\n* The internal [SMIL data](https://en.wikipedia.org/wiki/Synchronized_Multimedia_Integration_Language)\n  format varies, but the message content and data are otherwise identical\n* The internal storage fields are inconsistently omitted or `null`\n\nThese conflicts tend to occur when using multiple backup agents over time or\nsimultaneously. E.g., accidentally recovering data from Google's backups\n*and* Samsung's backups or simply changing manufacturers or carriers.\n\nIf you intend to use this to remove duplicated messages on your device (rather\nthan in your backup location), please read [\"An important warning about\ndeduplicating messages *on a device* in practice\"](#ImportantWarning).\n\n## Simple Usage\n\nThe usage of this tool is extremely simple and can handle files of several\ngigabytes in a few seconds.\n\nFor example,\n\n```commandline\npython3 dedupe_texts.py example-input.xml example-output.xml deduplication-results.log\n```\n\n### Console Output\n\nThis will produce output of the following form.\n\n```\nReading 'example-input.xml'... Done in 8.1 s.\nPreparing log file 'deduplication-results.log'.\nSearching for duplicates... Done in 5.9 s.\nDeduplication Summary:\n    Message Type    |   Original Count   |      Removed       | Deduplicated Count \n        mms         |       24893        |       10325        |       14568        \n        sms         |       19828        |         0          |       19828        \nWriting 'example-output.xml'... Done in 3.8 s\n```\n\nor\n\n```\nReading 'example-input.xml'... Done in 8.1 s.\nPreparing log file 'deduplication-results.log'.\nSearching for duplicates... Done in 5.8 s.\nDeduplication Summary:\n    Message Type    |   Original Count   |      Removed       | Deduplicated Count \n        mms         |       14341        |         0          |       14341        \n        sms         |       19676        |         0          |       19676        \nNo duplicate messages found. Skipping writing of output file.\n```\n\nIf instead you get an `lxml.etree.XMLSyntaxError` like those below, please refer to\n[handling_extremely_large_text_messages.md](handling_extremely_large_text_messages.md).\n\n```\nlxml.etree.XMLSyntaxError: AttValue length too long, line 2, column 1000000xxx\nlxml.etree.XMLSyntaxError: Resource limit exceeded: Buffer size limit exceeded, try XML_PARSE_HUGE, line xxxxxx, column 99yyyyyyy\n```\n\n### Log File Output\n\nThe log file contains sections of the following form for each removed message.\n\n```\nRemoving mms:\n    date: 1680729606000\n address: \u003cREDACTED #1\u003e | \u003cREDACTED #2\u003e\n    text: look at this amazing picture!\n  m_type: 128\n    type: 137 | 151\n\nIn favor of keeping mms:\n    date: 1680729606000\n address: \u003cREDACTED #1\u003e | \u003cREDACTED #2\u003e\n    text: look at this amazing picture!\n  m_type: 128\n    type: 137 | 151\n    data: \u003cLENGTH 539706 OMISSION\u003e\n```\n\n### Full Usage Details\n\nThe full usage information with a few optional features is below.\n\n```\nusage: dedupe_texts.py [-h] [--default-country-code [DEFAULT_COUNTRY_CODE]]\n                       [--ignore-date-milliseconds]\n                       [--ignore-whitespace-differences] [--aggressive]\n                       input_file [output_file] [log_file]\n\nDeduplicate text messages from XML backup.\n\npositional arguments:\n  input_file            The input XML to deduplicate.\n  output_file           The output file to save deduplicated entries. Defaults\n                        to the input filepath with \"_deduplicated\" appended to\n                        the filename.\n  log_file              The log file to record details of each removed\n                        message. Defaults to the input filepath with\n                        \"_deduplication.log\" appended to the filename.\n\noptions:\n  -h, --help            show this help message and exit\n  --default-country-code [DEFAULT_COUNTRY_CODE]\n                        Default country code to assume if a phone number has\n                        no country code. Treat phone numbers as identical if\n                        they include this country code or none at all.\n                        Defaults to +1 (United States / Canada).\n  --ignore-date-milliseconds\n                        Ignore millisecond precision in dates if timestamps\n                        are slightly inconsistent. Treat identical messages as\n                        duplicates if received in the same second.\n  --ignore-whitespace-differences\n                        Ignore whitespace differences in text messages. Treat\n                        identical messages as duplicates if they differ only\n                        in the type of whitespace or leading/trailing spaces.\n  --aggressive          Only consider timestamp and body/text/data in\n                        identifying duplicates. Treat any matching messages as\n                        duplicates, regardless of address, messaging protocol\n                        (SMS, MMS, RCS, etc.), or other fields.\n```\n\n\u003ca name = \"ImportantWarning\"\u003e\u003c/a\u003e\n\n## An important warning about deduplicating messages *on a device* in practice\n\nNote that\n\n* SMS Backup \u0026 Restore avoids restoring duplicates by default, and\n* Most messaging clients/apps actually hide deleted conversations before they\n  are deleted internally (they continue the deletion work in the background)\n\nThus, if you flag conversations for deletion and then start restoring from\nbackup (without verifying the message deletion has completed internally),\n***you may lose messages!***\n\nIn these cases, the backup restoration essentially detects duplicates of\nmessages that were mid-deletion and only completes a partial restore. E.g.,\n\n* With duplicates where some messages have data attachments and others do not,\n  you may lose images, shared contacts, etc. from text messages\n* Some messaging clients may continue the deletion after the backup is\n  restored, in which case you will simply lose entire messages or conversations\n\nWith this in mind, to deduplicate messages on a device itself, you should\n\n1) Perform the backup and deduplicate it (retain both versions, just in case)\n2) Confirm you have not received any new messages in the meantime (consider\n   using airplane mode)\n3) Mass-delete your text messages\n4) Wait a few minutes (the time required depends on your phone's processing\n   speed, the number of messages, etc.)\n5) Clear your messaging client's data (`App Info \u003e Storage \u003e Clear Data`) to\n   force a refresh of the text message view\n6) Confirm that no messages appear in the view. Otherwise, return to step #2\n7) Restore from the deduplicated backup and verify that all messages were\n   restored before removing the original (non-deduplicated) backup file\n\nFor step #7, consider keeping the restoration's duplicate check enabled. If it\ndetects *any* duplicates when restoring to a phone that appears to have zero\nexisting text messages, that should be a *major* warning that something has\ngone wrong.\n\nMoreover, if you create a new backup afterward, it should not be much smaller\nthan the one you restored from!\n\n## Related Work\n\nThis tool is somewhat inspired by\n[legacy work by radj](https://github.com/radj/AndroidSMSBackupRestoreCleaner),\nbut is significantly simpler, much more up to date, requires far fewer\ndependencies/setup, and supports MMS messages.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fragibson%2Fsms-mms-deduplication","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fragibson%2Fsms-mms-deduplication","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fragibson%2Fsms-mms-deduplication/lists"}