{"id":14235966,"url":"https://github.com/icy/google-group-crawler","last_synced_at":"2025-04-10T03:55:31.404Z","repository":{"id":10761816,"uuid":"13024589","full_name":"icy/google-group-crawler","owner":"icy","description":"[Deprecated] Get (almost) original messages from google group archives. Your data is yours.","archived":false,"fork":false,"pushed_at":"2022-03-25T15:11:30.000Z","size":154,"stargazers_count":215,"open_issues_count":6,"forks_count":38,"subscribers_count":12,"default_branch":"master","last_synced_at":"2025-04-10T03:55:28.071Z","etag":null,"topics":["bash","cookie","crawler","curl","google","ownership","wget"],"latest_commit_sha":null,"homepage":"","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/icy.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2013-09-23T01:48:47.000Z","updated_at":"2025-02-01T05:11:29.000Z","dependencies_parsed_at":"2022-09-13T14:11:15.781Z","dependency_job_id":null,"html_url":"https://github.com/icy/google-group-crawler","commit_stats":null,"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/icy%2Fgoogle-group-crawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/icy%2Fgoogle-group-crawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/icy%2Fgoogle-group-crawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/icy%2Fgoogle-group-crawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/icy","download_url":"https://codeload.github.com/icy/google-group-crawler/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248155002,"owners_count":21056542,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bash","cookie","crawler","curl","google","ownership","wget"],"created_at":"2024-08-20T21:02:35.100Z","updated_at":"2025-04-10T03:55:31.386Z","avatar_url":"https://github.com/icy.png","language":"Shell","funding_links":[],"categories":["Shell"],"sub_categories":[],"readme":"WARNING: This project doesn't work and it's deprecated. \n**Reason:** Ajax support is completely deprecated by Google\n  See also https://github.com/icy/google-group-crawler/issues/42#issuecomment-889013487\n\n[![Build Status](https://travis-ci.org/icy/google-group-crawler.svg?branch=master)](https://travis-ci.org/icy/google-group-crawler)\n\n## Download all messages from Google Group archive\n\n`google-group-crawler` is a `Bash-4` script to download all (original)\nmessages from a Google group archive.\nPrivate groups require some cookies string/file.\nGroups with adult contents haven't been supported yet.\n\n* [Installation](#installation)\n* [Usage](#usage)\n  * [The first run](#the-first-run)\n  * [Update your local archive thanks to rss feed](#update-your-local-archive-thanks-to-rss-feed)\n  * [Private group or Group hosted by an organization](#private-group-or-group-hosted-by-an-organization)\n  * [The hook](#the-hook)\n  * [What to do with your local archive](#what-to-do-with-your-local-archive)\n  * [Rescan the whole local archive](#rescan-the-whole-local-archive)\n  * [Known problems](#known-problems)\n* [Contributions](#contributions)\n* [Similar projects](#similar-projects)\n* [License](#license)\n* [Author](#author)\n* [For script hackers](#for-script-hackers)\n\n## Installation\n\nThe script requires `bash-4`, `sort`, `curl`, `sed`, `awk`.\n\nMake the script executable with `chmod 755` and put them in your path\n(e.g, `/usr/local/bin/`.)\n\nThe script may not work on `Windows` environment as reported in\nhttps://github.com/icy/google-group-crawler/issues/26.\n\n## Usage\n\n### The first run\n\nFor private group, please\n[prepare your cookies file](#private-group-or-group-hosted-by-an-organization).\n\n    # export _CURL_OPTIONS=\"-v\"       # use curl options to provide e.g, cookies\n    # export _HOOK_FILE=\"/some/path\"  # provide a hook file, see in #the-hook\n\n    # export _ORG=\"your.company\"      # required, if you are using Gsuite\n    export _GROUP=\"mygroup\"           # specify your group\n    ./crawler.sh -sh                  # first run for testing\n    ./crawler.sh -sh \u003e curl.sh        # save your script\n    bash curl.sh                      # downloading mbox files\n\nYou can execute `curl.sh` script multiple times, as `curl` will skip\nquickly any fully downloaded files.\n\n### Update your local archive thanks to RSS feed\n\nAfter you have an archive from the first run you only need to add the latest\nmessages as shown in the feed. You can do that with `-rss` option and the\nadditional `_RSS_NUM` environment variable:\n\n    export _RSS_NUM=50                # (optional. See Tips \u0026 Tricks.)\n    ./crawler.sh -rss \u003e update.sh     # using rss feed for updating\n    bash update.sh                    # download the latest posts\n\nIt's useful to follow this way frequently to update your local archive.\n\n### Private group or Group hosted by an organization\n\nTo download messages from private group or group hosted by your organization,\nyou need to provide some cookie information to the script. In the past,\nthe script uses `wget` and the Netscape cookie file format,\nnow we are using `curl` with cookie string and a configuration file.\n\n0. Open Firefox, press F12 to enable Debug mode and select Network tab\n   from the Debug console of Firefox. (You may find a similar way for\n   your favorite browser.)\n1. Log in to your testing google account, and access your group.\n   For example\n     https://groups.google.com/forum/?_escaped_fragment_=categories/google-group-crawler-public\n   (replace `google-group-crawler-public` with your group name).\n   Make sure you can read some contents with your own group URI.\n2. Now from the Network tab in Debug console, select the address\n   and select `Copy -\u003e Copy Request Headers`. You will have a lot of\n   things in the result, but please paste them in your text editor\n   and select only `Cookie` part.\n3. Now prepare a file `curl-options.txt` as below\n\n        user-agent = \"Mozilla/5.0 (X11; Linux x86_64; rv:74.0) Gecko/20100101 Firefox/74.0\"\n        header = \"Cookie: \u003csnip\u003e\"\n\n   Of course, replace the `\u003csnip\u003e` part with your own cookie strings.\n   See `man curl` for more details of the file format.\n\n2. Specify your cookie file by `_CURL_OPTIONS`:\n\n        export _CURL_OPTIONS=\"-K /path/to/curl-options.txt\"\n\n   Now every hidden group can be downloaded :)\n\n### The hook\n\nIf you want to execute a `hook` command after a `mbox` file is downloaded,\nyou can do as below.\n\n1. Prepare a Bash script file that contains a definition of `__curl_hook`\n   command. The first argument is to specify an output filename, and the\n   second argument is to specify an URL. For example, here is simple hook\n\n        # $1: output file\n        # $2: url (https://groups.google.com/forum/message/raw?msg=foobar/topicID/msgID)\n        __curl_hook() {\n          if [[ \"$(stat -c %b \"$1\")\" == 0 ]]; then\n            echo \u003e\u00262 \":: Warning: empty output '$1'\"\n          fi\n        }\n\n    In this example, the `hook` will check if the output file is empty,\n    and send a warning to the standard error device.\n\n2. Set your environment variable `_HOOK_FILE` which should be the path\n   to your file. For example,\n\n        export _GROUP=archlinuxvn\n        export _HOOK_FILE=$HOME/bin/curl.hook.sh\n\n   Now the hook file will be loaded in your future output of commands\n   `crawler.sh -sh` or `crawler.sh -rss`.\n\n### What to do with your local archive\n\nThe downloaded messages are found under `$_GROUP/mbox/*`.\n\nThey are in `RFC 822` format (possibly with obfuscated email addresses)\nand they can be converted to `mbox` format easily before being imported\nto your email clients  (`Thunderbird`, `claws-mail`, etc.)\n\nYou can also use [mhonarc](https://www.mhonarc.org/) ultility to convert\nthe downloaded to `HTML` files.\n\nSee also\n\n* https://github.com/icy/google-group-crawler/issues/15#issuecomment-221018338\n* https://github.com/icy/google-group-crawler/issues/35#issuecomment-580659966\n* My script https://github.com/icy/bashy/blob/master/libs/raw2mbox.sh\n\n### Rescan the whole local archive\n\nSometimes you may need to rescan / redownload all messages.\nThis can be done by removing all temporary files\n\n    rm -fv $_GROUP/threads/t.*    # this is a must\n    rm -fv $_GROUP/msgs/m.*       # see also Tips \u0026 Tricks\n\nor you can use `_FORCE` option:\n\n    _FORCE=\"true\" ./crawler.sh -sh\n\nAnother option is to delete all files under `$_GROUP/` directory.\nAs usual, remember to backup before you delete some thing.\n\n### Known problems\n\n1. Fails on group with adult contents (https://github.com/icy/google-group-crawler/issues/14)\n1. This script may not recover emails from public groups.\n  When you use valid cookies, you may see the original emails\n  if you are a manager of the group. See also https://github.com/icy/google-group-crawler/issues/16.\n2. When cookies are used, the original emails may be recovered\n  and you must filter them before making your archive public.\n3. Script can't fetch from group whose name contains some special character (e.g, `+`)\n  See also https://github.com/icy/google-group-crawler/issues/30\n\n## Contributions\n\n1. `parallel` support: @Pikrass has a script to download messages in parallel.\n  It's discussed in the ticket https://github.com/icy/google-group-crawler/issues/32.\n  The script: https://gist.github.com/Pikrass/f8462ff8a9af18f97f08d2a90533af31\n2. `raw access denied`: @alexivkin mentioned he could use the `print` function\n  to work-around the issue. See it here\n  https://github.com/icy/google-group-crawler/issues/29#issuecomment-468810786\n\n## Similar projects\n\n* (website) [Google Takeout - Download all info for any groups you own](https://takeout.google.com/)\n* (Shell/curl) [ggscrape - Download emails from a Google Group. Rescue your archives](https://git.scuttlebot.io/%25nkOkiGF0Dd321GmNqs6aW%2BWHaH9Uunq4m8dVfJuU%2Bps%3D.sha256)\n* (Python/Webdriver) [scrape_google_groups.py  - A simple script to scrape a google group](https://gist.github.com/punchagan/7947337)\n* (Python/webscraping.webkit) [gg-scrape - Liberate you data from google groups](https://github.com/jrholliday/gg-scrape)\n* (Python/urllib) [gg_scraper](https://gitlab.com/mcepl/gg_scraper)\n* (PHP/libcurl) [scraping-google-groups](http://saturnboy.com/2010/03/scraping-google-groups/)\n\n## License\n\nThis work is released under the terms of a MIT license.\n\n## Author\n\nThis script is written by Anh K. Huynh.\n\nHe wrote this script because he couldn't resolve the problem by using\n`nodejs`, `phantomjs`, `Watir`.\n\nNew web technology just makes life harder, doesn't it?\n\n## For script hackers\n\nPlease skip this section unless your really know to work with `Bash` and shells.\n\n0. If you clean your files _(as below)_, you may notice that it will be\n   very slow when re-downloading all files. You may consider to use\n   the `-rss` option instead. This option will fetch data from a `rss` link.\n\n   It's recommmeded to use the `-rss` option for daily update. By default,\n   the number of items is 50. You can change it by the `_RSS_NUM` variable.\n   However, don't use a very big number, because Google will ignore that.\n\n1. Because Topics is a FIFO list, you only need to remove the last file.\n   The script will re-download the last item, and if there is a new page,\n   that page will be fetched.\n\n        ls $_GROUP/msgs/m.* \\\n        | sed -e 's#\\.[0-9]\\+$##g' \\\n        | sort -u \\\n        | while read f; do\n            last_item=\"$f.$( \\\n              ls $f.* \\\n              | sed -e 's#^.*\\.\\([0-9]\\+\\)#\\1#g' \\\n              | sort -n \\\n              | tail -1 \\\n            )\";\n            echo $last_item;\n          done\n\n2. The list of threads is a LIFO list. If you want to rescan your list,\n   you will need to delete all files under `$_D_OUTPUT/threads/`\n\n3. You can set the time for `mbox` output files, as below\n\n        ls $_GROUP/mbox/m.* \\\n        | while read FILE; do \\\n            date=\"$( \\\n              grep ^Date: $FILE\\\n              | head -1\\\n              | sed -e 's#^Date: ##g' \\\n            )\";\n            touch -d \"$date\" $FILE;\n          done\n\n    This will be very useful, for example, when you want to use the\n    `mbox` files with `mhonarc`.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ficy%2Fgoogle-group-crawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ficy%2Fgoogle-group-crawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ficy%2Fgoogle-group-crawler/lists"}