{"id":20478673,"url":"https://github.com/do-me/fast-instagram-scraper","last_synced_at":"2025-04-13T13:15:40.487Z","repository":{"id":52185687,"uuid":"315036109","full_name":"do-me/fast-instagram-scraper","owner":"do-me","description":"A fast Instagram Scraper based on Torpy.","archived":false,"fork":false,"pushed_at":"2023-11-26T17:37:09.000Z","size":406,"stargazers_count":35,"open_issues_count":1,"forks_count":7,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-04-13T13:15:29.298Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/do-me.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-11-22T12:48:39.000Z","updated_at":"2025-02-24T20:03:18.000Z","dependencies_parsed_at":"2024-11-15T15:40:33.689Z","dependency_job_id":"8051f923-f479-4adc-b033-eaff4a1198a3","html_url":"https://github.com/do-me/fast-instagram-scraper","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/do-me%2Ffast-instagram-scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/do-me%2Ffast-instagram-scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/do-me%2Ffast-instagram-scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/do-me%2Ffast-instagram-scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/do-me","download_url":"https://codeload.github.com/do-me/fast-instagram-scraper/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248717238,"owners_count":21150389,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-15T15:38:40.357Z","updated_at":"2025-04-13T13:15:40.457Z","avatar_url":"https://github.com/do-me.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Fast Instagram Scraper\n\n**UPDATE 11/2023**\n\nAs [torpy is currently unmaintained and needs refactoring due to TOR changes from V2 to V3](https://github.com/torpyorg/torpy/issues/20#issuecomment-1826467859) fast-instagram-scraper won't work.\n\n**Important: Needs slight rework due to recent API changes! Not working at the moment, PRs welcome!** \n**Please read [this issue](https://github.com/do-me/fast-instagram-scraper/issues/4#issuecomment-1382752591) for information on Instagram's API**\n\nv2.0.0 (beta) - licensed under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/?ref=chooser-v1) \n\n**Looking for quick data analysis? LBSN Dashboard will be the answer. Currently available as (frontend-only) prototype for [Bonn](https://geo.rocks/dashboards/bonn) and [Waynesboro](https://geo.rocks/dashboards/waynesboro).**\n\n## Downloads metadata and images *fast* over the Tor network. No login, no API-key needed.\n\nA fast Instagram Scraper based on Torpy. Downloads post metadata and images for multiple hashtags and [location ids](https://geo.rocks/post/mining-locations-ids/) sequentially, concurrently or combined. Multithreading supported.\n\n*Requirements: [Torpy](https://github.com/torpyorg/torpy) package installed but no login and no API-Key. Working for all OS.*\n\n## Please use responsibly and respect Instagram's terms of use! This tool is exclusively thought for research purposes, not for commercial use! If you enjoy Fast Instagram Scraper consider giving a ⭐.\n\n*Update December 2020*: Hashtags will be mined around 4-5 times faster due to larger batches even though Tor end nodes get blocked more often than for location IDs.\n\n*Update January 2021*: Downloading images now supported - also in combination with multithreading! \n\n*Update April 2022*: Fixed new graphql hashtag endpoint. ~**Locations cannot be mined without login anymore**. Test [here](https://instagram.com/explore/locations) and see [this issue](https://github.com/do-me/fast-instagram-scraper/issues/3) for an alternative workflow.~\n\n*Update August 2022*: Locations **can** be mined without login again.\n\n*For newbies*: See the Helper_Functions.ipynb jupyter notebook to get a quick idea of how Fast Instagram Scraper works! \n\n## Command Line Version\n![Fast Instagram Scraper](https://github.com/do-me/fast-instagram-scraper/blob/main/fast-instagram-scraper-cli.gif)\n\n## Jupyter Version [deprecated]\n![Fast Instagram Scraper](https://github.com/do-me/fast-instagram-scraper/blob/main/Fast%20Instagram%20Scraper.gif)\n\nFor this scraper I had the same motivation as for [Simple Instagram Scraper](https://github.com/do-me/Simple-Instagram-Scraper):\nDue to latest Instagram blocking policy changes [Instagram Scraper](https://github.com/arc298/instagram-scraper) is temporarily not performing well (as of November 2020). \nParticularly in comparison to this scraper it's too slow and struggles with getting blocked after a while. \n\n## Installation \nJust clone the repo or simply download either the jupyter notebook or the command line version.\nBest create a virtual environment with conda first and install the necessary packages with:\n```python\nconda create --name scrape python=3.9 \nconda activate scrape\npip install func-timeout pandas tqdm requests\npip install git+git://github.com/torpyorg/torpy@master\n```\nFor the jupyter notebook version you need to install ipython as well:\n```python\npip install ipython\n```\nAfterwards clone the repo and you are good to go:\n```\ngit clone https://github.com/do-me/fast-instagram-scraper.git\n```\nFor jupyter start the notebook in your cloned repo:\n```\njupyter notebook\n```\nFor command line, you can call an [example command](https://github.com/do-me/fast-instagram-scraper#command-line-version-1).\n\n\n## Why not [Simple Instagram Scraper](https://github.com/do-me/Simple-Instagram-Scraper)?\n[Simple Instagram Scraper](https://github.com/do-me/Simple-Instagram-Scraper) can mine all of a post's information - technically everything being displayed on the page or in the DOM including location and accessibility caption. As it's literally looking at each post and needs to bahave like a human in order not to get blocked it needs to be relatively slow (a couple of seconds per post, depending on your parameters). [Fast Instagram Scraper](https://github.com/do-me/fast-instagram-scraper) aims at mining at scale but can only do so by accessing Instagram's JSON objects which come in batches of 50 (for hashtags) or ca. 150 (for locations) posts and unfortunately do not include some information such as location and accessibility caption.\n\n### Scraper Comparison\n|Scraper|Pro|Con|\n|---|---|---|\n|[Simple Instagram Scraper](https://github.com/do-me/Simple-Instagram-Scraper)|+ all post information|- relatively slow\u003cbr\u003e- login required\u003cbr\u003e - max. 8-12k posts|\n|[Fast Instagram Scraper](https://github.com/do-me/fast-instagram-scraper)|+ fast\u003cbr\u003e + no login required\u003cbr\u003e + theoretically no maximum|- not all post information|\n\n## Torpy\n[Torpy](https://github.com/torpyorg/torpy) makes use of the tor network to request pages.\nInstall torpy with: `pip3 install torpy` or `pip install torpy`. If you like Torpy consider giving a ⭐ or donating to https://donate.torproject.org/\n\nThe Torpy-logic applied here unfortunately doesn't work to scrape all post information as one needs to be logged in. The amount of requests will be associated with the account which gets blocked no matter where from. Hence Torpy cannot be used for [Simple Instagram Scraper](https://github.com/do-me/Simple-Instagram-Scraper).\n\n## Idea\nUse one tor end node to get as many requests as possible. Experience tells: a normal end node can do 15-40 requests (each one 50 posts) waiting around 10 seconds each time. Let's do some [quick math](https://youtu.be/M3ujv8xdK2w): if you got a good node, you'll get 40x50 posts in 400 seconds which gives you a rate of 5 posts per second or even faster if you just want to scrape \u003c500 posts.\n\n## Jupyter Version\nYou will find detailed information in the notebook.\nAll future improvements will be available only for the command line version.\n\n## Command Line Version \n\nPositional Arguments:\n```\n  object_id_or_string         Location id or hashtag like 12345678 or truckfonalddump. \n                              If --list, enter the item list here comma separated like    \n                              loveyourlife,justdoit,truckfonalddump\n  location_or_hashtag         \"location\" or \"hashtag\"\n```\n\nOptional Arguments:\n```\n  -h, --help                  Show this help message and exit\n  --out_dir                   Path to store csv like scrape/ (default is working directory)\n  --max_posts                 Limit posts to scrape \n  --max_requests              Limit requests\n  --wait_between_requests     Waiting time between requests in seconds\n  --max_tor_renew             Max number of new tor sessions\n  --run_number                Additional file name part like \"_v2\" for \"1234567_v2.csv\"\n  --location_or_hashtag_list  For heterogenous hashtag/location list scraping only: provide another list with hashtag,location,...\n  --list                      Scrape for list\n  --last_cursor               Continue from where you quit before (last_cursor)\n  --tor_timeout               Set tor timeout when tor session gets blocked for some reason (default 600 seconds)\n  --user_agent                Change user agent if needed\n  --threads                   Number of concurrent threads\n  --save_as                   csv | json\n```  \nExample commands:\n```\n1. python fast-instagram-scraper.py byebyedonald hashtag \n2. python fast-instagram-scraper.py 123456789987 location --max_posts 10000 --max_tor_renew 100\n3. python fast-instagram-scraper.py 123456789987 location --last_cursor --out_dir \"/.../directory/folder/\"\n4. python fast-instagram-scraper.py byebyedonald,hellohereIam,georocks hashtag --list\n5. python fast-instagram-scraper.py byebyedonald,123456789987,georocks hashtag --list --location_or_hashtag_list hashtag,location,hashtag --max_posts 100 \n```\nFor the last command hashtag argument is a fallback in case the list passed after is not valid. If --location_or_hashtag_list is valid hashtag will be overwritten by the respective value.\n\nNote that saving as json will be memory expensive as Instagram provides lots of different (unnecessary) image thumbnail URLs. Saving as csv is around 1 kb/post; json 10 kb/post.\n\n## Multithreading 🐙\nFast Instagram Scraper supports multithreading. Each thread has a different tor end node. Don't use the --list flag when multithreading. \nA basic example for 3 threads would look like this:\n```\npython fast-instagram-scraper.py byebyedonald,hellohereIam,georocks hashtag --threads 3\n```\nAll hashtags will be mined concurrently. The shell output will get quite messy as the threads' outputs will be printed in just one shell.🦥\n\nIf you would like to have 3 concurrent threads with each 4 sequential commands pass lists. Each list runs on one thread with the parameters provided:\n```\npython fast-instagram-scraper.py byebyedonald,hellohereIam,[hereiam,goodlife,geography] hashtag --threads 3\n```\nYou can use all arguments as definded above like `--last_cursor`.\n\nIf you want to monitor how many files per second are downloaded use [New Files Monitor](https://github.com/do-me/New-Files-Monitor) that I just created for this purpose. It gives you a live files/s rate and a total file count.\n\n```\n 6.62 files/s\n 1795 files total\n 2021-01-16 19:23:55.067400 start count\n 2021-01-16 19:24:11.363195 end count\n 16 seconds delta\n 106 files delta\n```\n\nNote that on Ubuntu at the moment you should add `shell=True` argument to `subprocess.run(cli_line, shell=True)` in `scrape_subprocess()` function in source code. It's a [minor issue on Ubuntu](https://stackoverflow.com/questions/3172470/actual-meaning-of-shell-true-in-subprocess#:~:text=After%20reading%20the%20docs%2C%20I,the%20process%20is%20directly%20started.) but working fine on Windows.\n\n## Parallelizing 👷‍♀️ 👷‍♂️\nThe above method is the preferred way to mine simultanously for several hashtags/locations. If however you would like to monitor every process in a shell, do as follows.\nYou can run several parallel tor sessions and hence run multiple instances of Fast Instagram Scraper. Let's say you have a list of location IDs and want to get few posts of every location. When running the script sequentially, it will mine one location after another. \nYou can easily parallelize it by spawning multiple shells. For Powershell you could generate your commands in Python: \n``` python\nlocation_list = [1234567,1234564567,1234578765432]\nfor i in location_list:\n    print(\"start powershell {python fast-instagram-scraper.py \" + str(i) + \" location --max_posts 500};\")\n    \n# Result\nstart powershell {python fast-instagram-scraper.py 1234567 location --max_posts 500};\nstart powershell {python fast-instagram-scraper.py 1234564567 location --max_posts 500};\nstart powershell {python fast-instagram-scraper.py 1234578765432 location --max_posts 500};\n```\nCopy paste these commands in a new Powershell window and execute. The locations will be mined and the Powershell windows closed when finished. \nNote: Could be also done with jobs running in the background.\n\nOf course you shouldn't spawn an infinite amount of new processes. For a longer list of locations (i.e. 1000) the recommended method is to chunk your list so you can parallel processes sequential commands. The following example is a list of 60 location IDs chunked into 15 processes of each 4 locations to scrape. \n\n``` python\nlocation_list = [1234567,1234564567,1234578765432,1234567,1234564567,1234578765432,1234567,1234564567,1234578765432,1234567,1234564567,1234578765432,1234567,1234564567,1234578765432,1234567,1234564567,1234578765432,1234567,1234564567,1234578765432,1234567,1234564567,1234578765432,1234567,1234564567,1234578765432,1234567,1234564567,1234578765432,1234567,1234564567,1234578765432,1234567,1234564567,1234578765432,1234567,1234564567,1234578765432,1234567,1234564567,1234578765432,1234567,1234564567,1234578765432,1234567,1234564567,1234578765432,1234567,1234564567,1234578765432,1234567,1234564567,1234578765432,1234567,1234564567,1234578765432,1234567,1234564567,1234578765432]\n\n# chunking function - also works with uneven numbers\ndef chunks(lst, n): # https://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks\n    \"\"\"Yield successive n-sized chunks from lst.\"\"\"\n    for i in range(0, len(lst), n):\n        yield lst[i:i + n]\n        \nlocation_chunks = list(chunks(location_list,4))\nlocation_chunks\n\n# Result is a list of 15 lists with each 4 locations\n[[1234567, 1234564567, 1234578765432, 1234567],\n [1234564567, 1234578765432, 1234567, 1234564567],\n [1234578765432, 1234567, 1234564567, 1234578765432],\n [1234567, 1234564567, 1234578765432, 1234567],\n [1234564567, 1234578765432, 1234567, 1234564567],\n [1234578765432, 1234567, 1234564567, 1234578765432],\n [1234567, 1234564567, 1234578765432, 1234567],\n [1234564567, 1234578765432, 1234567, 1234564567],\n [1234578765432, 1234567, 1234564567, 1234578765432],\n [1234567, 1234564567, 1234578765432, 1234567],\n [1234564567, 1234578765432, 1234567, 1234564567],\n [1234578765432, 1234567, 1234564567, 1234578765432],\n [1234567, 1234564567, 1234578765432, 1234567],\n [1234564567, 1234578765432, 1234567, 1234564567],\n [1234578765432, 1234567, 1234564567, 1234578765432]]\n\n```\nEach list of 4 locations will now be put into a Fast Instagram Scraper sequential command which will be executed in a new Powershell window. \n\n```python\nfor i in location_chunks:\n    outcmd = \"\"\n    outcmd = \"start powershell {\"\n    pyth= \"\"\n    for loc in i:\n        pyth = pyth +\"python fast-instagram-scraper.py \" + str(loc) + \" location --max_posts 500;\"\n    outcmd = outcmd + pyth + \"};\"\n    print(outcmd)\n    \n# Result\nstart powershell {python fast-instagram-scraper.py 1234567 location --max_posts 500;python fast-instagram-scraper.py 1234564567 location --max_posts 500;python fast-instagram-scraper.py 1234578765432 location --max_posts 500;python fast-instagram-scraper.py 1234567 location --max_posts 500;};\nstart powershell {python fast-instagram-scraper.py 1234564567 location --max_posts 500;python fast-instagram-scraper.py 1234578765432 location --max_posts 500;python fast-instagram-scraper.py 1234567 location --max_posts 500;python fast-instagram-scraper.py 1234564567 location --max_posts 500;};\nstart powershell {python fast-instagram-scraper.py 1234578765432 location --max_posts 500;python fast-instagram-scraper.py 1234567 location --max_posts 500;python fast-instagram-scraper.py 1234564567 location --max_posts 500;python fast-instagram-scraper.py 1234578765432 location --max_posts 500;};\nstart powershell {python fast-instagram-scraper.py 1234567 location --max_posts 500;python fast-instagram-scraper.py 1234564567 location --max_posts 500;python fast-instagram-scraper.py 1234578765432 location --max_posts 500;python fast-instagram-scraper.py 1234567 location --max_posts 500;};\nstart powershell {python fast-instagram-scraper.py 1234564567 location --max_posts 500;python fast-instagram-scraper.py 1234578765432 location --max_posts 500;python fast-instagram-scraper.py 1234567 location --max_posts 500;python fast-instagram-scraper.py 1234564567 location --max_posts 500;};\nstart powershell {python fast-instagram-scraper.py 1234578765432 location --max_posts 500;python fast-instagram-scraper.py 1234567 location --max_posts 500;python fast-instagram-scraper.py 1234564567 location --max_posts 500;python fast-instagram-scraper.py 1234578765432 location --max_posts 500;};\nstart powershell {python fast-instagram-scraper.py 1234567 location --max_posts 500;python fast-instagram-scraper.py 1234564567 location --max_posts 500;python fast-instagram-scraper.py 1234578765432 location --max_posts 500;python fast-instagram-scraper.py 1234567 location --max_posts 500;};\nstart powershell {python fast-instagram-scraper.py 1234564567 location --max_posts 500;python fast-instagram-scraper.py 1234578765432 location --max_posts 500;python fast-instagram-scraper.py 1234567 location --max_posts 500;python fast-instagram-scraper.py 1234564567 location --max_posts 500;};\nstart powershell {python fast-instagram-scraper.py 1234578765432 location --max_posts 500;python fast-instagram-scraper.py 1234567 location --max_posts 500;python fast-instagram-scraper.py 1234564567 location --max_posts 500;python fast-instagram-scraper.py 1234578765432 location --max_posts 500;};\nstart powershell {python fast-instagram-scraper.py 1234567 location --max_posts 500;python fast-instagram-scraper.py 1234564567 location --max_posts 500;python fast-instagram-scraper.py 1234578765432 location --max_posts 500;python fast-instagram-scraper.py 1234567 location --max_posts 500;};\nstart powershell {python fast-instagram-scraper.py 1234564567 location --max_posts 500;python fast-instagram-scraper.py 1234578765432 location --max_posts 500;python fast-instagram-scraper.py 1234567 location --max_posts 500;python fast-instagram-scraper.py 1234564567 location --max_posts 500;};\nstart powershell {python fast-instagram-scraper.py 1234578765432 location --max_posts 500;python fast-instagram-scraper.py 1234567 location --max_posts 500;python fast-instagram-scraper.py 1234564567 location --max_posts 500;python fast-instagram-scraper.py 1234578765432 location --max_posts 500;};\nstart powershell {python fast-instagram-scraper.py 1234567 location --max_posts 500;python fast-instagram-scraper.py 1234564567 location --max_posts 500;python fast-instagram-scraper.py 1234578765432 location --max_posts 500;python fast-instagram-scraper.py 1234567 location --max_posts 500;};\nstart powershell {python fast-instagram-scraper.py 1234564567 location --max_posts 500;python fast-instagram-scraper.py 1234578765432 location --max_posts 500;python fast-instagram-scraper.py 1234567 location --max_posts 500;python fast-instagram-scraper.py 1234564567 location --max_posts 500;};\nstart powershell {python fast-instagram-scraper.py 1234578765432 location --max_posts 500;python fast-instagram-scraper.py 1234567 location --max_posts 500;python fast-instagram-scraper.py 1234564567 location --max_posts 500;python fast-instagram-scraper.py 1234578765432 location --max_posts 500;};\n```\nSame as above: Copy paste these commands in a new Powershell window and execute. The locations will be mined and the Powershell windows closed when finished. \n\nYou can raise the chunk size according to your system but be polite and don't exaggerate as it might affect the tor network. \n\nHowever, if for example you would like to visualize the approximately 1000 location IDs Instagram is displaying for each city of a country under https://www.instagram.com/explore/locations you could do this quite fast by first [mining the location IDs as described in my blog post](https://geo.rocks/post/mining-locations-ids/) in two simple steps with javascript and after chunking the locations i.e. to 20 chunks of 50 locations. Limit the max_posts parameter to a low number (technically anything between 1 and 50 will have the same effect) i.e. 20 and go for it! Depending on your luck with good tor connections you'll be done in around 10-20 minutes! \n\n## Recommendation for mining all posts from one location ID or hashtag \nWhen mining for locations or hashtags with a vast amount of posts it might be better to scrape with multiple commands by using --last_cursor instead of mining everything in one go. At the moment the saving logic append all JSON data to a list, converts it to csv and saves the entire file which becomes quite in\nfor big files. \nHowever mining in smaller chunks has more advantages so just go i.e. for maximum 20000 - 50000 posts resulting in 16 - 38 mb files each. In my case one iteration (including the costly saving process) was still executed in a reasonable amount of time (\u003c15 seconds). For the very first iteration mine normally. After don't forget the --last_cursor flag. A timeout of 600 seconds per iteration (default) proved to work well. Chaining commands with a semicolon for Powershell and bash helps to keep the process going i.e.:\n```bash\npython fast-instagram-scraper.py 123456789987 location --max_posts 20000;\npython fast-instagram-scraper.py 123456789987 location --max_posts 20000 --last_cursor;\npython fast-instagram-scraper.py 123456789987 location --max_posts 20000 --last_cursor;\n# and so on ...\n```\n\n## CSV concatination \nUse a simple Powershell command to concat all your freshly mined csv files. Manually create a folder \"merged\" first and execute this command:\n```powershell\nGet-ChildItem -Filter *.csv | Select-Object -ExpandProperty FullName | Import-Csv  -Encoding UTF8| Export-Csv .\\merged\\merged.csv -NoTypeInformation -Append -Encoding UTF8\n```\n\n## Data preprocessing \nSee the [jupyter notebook in this repo](https://github.com/do-me/fast-instagram-scraper/blob/main/A%20complete%20guide%20to%20preprocess%20Instagram%20post%20data%20mined%20with%20Fast%20Instagram%20Scraper.ipynb) to preprocess all the post data with pandas and create a geodataframe with geopandas for visualizing in ordinary GIS programs such as QGIS. \nHashtag extraction, reprojetion in web mercator (EPSG:3857) and unique points filter included.\n\n## To Do\n- Create interface for [LBSNTransform](https://pypi.org/project/lbsntransform/)\n\n## More\n- [Blog article](https://geo.rocks/post/fast-instagram-scraper/) about Fast Instagram Scraper\n- Find me and stay tuned on [my blog](https://geo.rocks)!\n\n## Star \nStar this repo if you enjoy! ⭐\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdo-me%2Ffast-instagram-scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdo-me%2Ffast-instagram-scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdo-me%2Ffast-instagram-scraper/lists"}