{"id":13826026,"url":"https://github.com/18520339/facebook-data-extraction","last_synced_at":"2025-04-03T03:12:11.910Z","repository":{"id":42396372,"uuid":"275714828","full_name":"18520339/facebook-data-extraction","owner":"18520339","description":"Experience for effectively fetching Facebook data by Querying Graph API with Account-based Token and Operating undetectable scraping Bots to extract Client/Server-side Rendered content","archived":false,"fork":false,"pushed_at":"2024-02-02T19:54:11.000Z","size":28279,"stargazers_count":191,"open_issues_count":0,"forks_count":60,"subscribers_count":9,"default_branch":"master","last_synced_at":"2025-03-24T09:17:37.378Z","etag":null,"topics":["automation","browser-fingerprinting","crawling","facebook","facebook-graph-api","proxy","scraping","selenium","tor-network"],"latest_commit_sha":null,"homepage":"https://www.youtube.com/watch?v=Q4oAsz__e_M","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/18520339.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-06-29T02:49:28.000Z","updated_at":"2025-03-18T10:35:02.000Z","dependencies_parsed_at":"2024-02-02T20:47:53.927Z","dependency_job_id":null,"html_url":"https://github.com/18520339/facebook-data-extraction","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/18520339%2Ffacebook-data-extraction","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/18520339%2Ffacebook-data-extraction/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/18520339%2Ffacebook-data-extraction/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/18520339%2Ffacebook-data-extraction/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/18520339","download_url":"https://codeload.github.com/18520339/facebook-data-extraction/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246895601,"owners_count":20851297,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["automation","browser-fingerprinting","crawling","facebook","facebook-graph-api","proxy","scraping","selenium","tor-network"],"created_at":"2024-08-04T09:01:30.970Z","updated_at":"2025-04-03T03:12:11.885Z","avatar_url":"https://github.com/18520339.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# Summary of Facebook data extraction approaches\n\n\u003e I'm finalizing everything to accommodate the latest major changes\n\n##  Overview\n\n\u003ctable\u003e\n    \u003ctr\u003e\n        \u003cth\u003eApproach\u003c/th\u003e\n        \u003cth\u003eSign-in required from the start\u003c/th\u003e\n        \u003cth\u003eRisk when sign-in (*)\u003c/th\u003e\n        \u003cth\u003eRisk when not sign-in\u003c/th\u003e\n        \u003cth\u003eDifficulty\u003c/th\u003e\n        \u003cth\u003eSpeed\u003c/th\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003e1️⃣ \u0026nbsp;\u003ca href=\"#approach-1-graph-api-with-full-permission-token\"\u003eGraph API + Full-permission Token\u003c/a\u003e\u003c/td\u003e\n        \u003ctd align=\"center\" rowspan=\"2\"\u003e✅\u003c/td\u003e\n        \u003ctd align=\"center\"\u003eAccess Token leaked + \u003ca href=\"https://developers.facebook.com/docs/graph-api/reference/page/feed/#limitations\"\u003eRate Limits\u003c/a\u003e\u003c/td\u003e\n        \u003ctd align=\"center\" rowspan=\"2\"\u003eNot working\u003c/td\u003e\n        \u003ctd align=\"center\"\u003eEasy\u003c/td\u003e\n        \u003ctd align=\"center\"\u003eFast\u003c/td\u003e\n    \u003c/tr\u003e\n        \u003ctr\u003e\n        \u003ctd\u003e2️⃣ \u0026nbsp;\u003ca href=\"#approach-2-ssr---server-side-rendering\"\u003eSSR - Server-side Rendering\u003c/a\u003e\u003c/td\u003e\n        \u003ctd align=\"center\" rowspan=\"3\"\u003eCheckpoint but less \u003cb\u003eloading more\u003c/b\u003e failure\u003c/td\u003e \u003c!-- Merged Cell --\u003e\n        \u003ctd align=\"center\" rowspan=\"2\"\u003eHard\u003c/td\u003e\n        \u003ctd align=\"center\"\u003eMedium\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003e3️⃣ \u0026nbsp;\u003ca href=\"#approach-3-csr---client-side-rendering\"\u003eCSR - Client-side Rendering\u003c/a\u003e\u003c/td\u003e\n        \u003ctd align=\"center\" rowspan=\"2\"\u003eWhen access private content\u003c/td\u003e\n        \u003ctd align=\"center\"\u003eSafest\u003c/td\u003e\n        \u003ctd align=\"center\" rowspan=\"2\"\u003eSlow\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003e4️⃣ \u0026nbsp;\u003ca href=\"#approach-4-devtools-console\"\u003eDevTools Console\u003c/a\u003e\u003c/td\u003e\n        \u003ctd align=\"center\"\u003eCan be banned if overused\u003c/td\u003e\n        \u003ctd align=\"center\"\u003eMedium\u003c/td\u003e\n    \u003c/tr\u003e\n\u003c/table\u003e\n\n### I. My general conclusion after many tries with different approaches\n\nWhen run at **not sign-in** state, Facebook usually redirects to the login page or prevent you from loading more comments/replies.\n\n**(*)** For safety when testing with **sign-in** state, I recommend create a **fake account** (you can use a [Temporary Email Address](https://temp-mail.org/en/) to create one) and use it for the extraction, because:\n-   No matter which approach you use, any fast or irregular activity continuously in **sign-in** state for a long time can be likely to get blocked at any time.\n\n- Authenticating via services with lack of encryption such as proxies using **HTTP** protocol can have potential security risks, especially if sensitive data is being transmitted: \n    - Therefore, if you are experimenting with your own account, it's advisable to use **HTTPS** proxies or other more secure methods like `VPNs`.\n    - I won't implement these types of risky authentication into the sign-in process for approaches in this repo, but you can do it yourself if you want.\n\n### II. DISCLAIMER\n\nAll information provided in this repo and related articles are for educational purposes only. So use at your own risk, I will not guarantee \u0026 not be responsible for any situations including:\n\n-   Whether your Facebook account may get Checkpoint due to repeatedly or rapid actions.\n-   Problems that may occur or for any abuse of the information or the code provided.\n-   Problems about your privacy while using [IP hiding techniques](#i-ip-hiding-techniques) or any malicious scripts.\n\n## APPROACH 1. Graph API with Full-permission Token\n\n👉 Check out my implementation for this approach with [Python](./graph-api/).\n\nYou will query [Facebook Graph API](https://developers.facebook.com/docs/graph-api) using your own Token with **full permission** for fetching data. This is the **MOST EFFECTIVE** approach.\n\n\u003e The knowledge and the way to get **Access Token** below are translated from these 2 Vietnamese blogs:\n\u003e\n\u003e -   https://ahachat.com/help/blog/cach-lay-token-facebook\n\u003e -   https://alotoi.com/get-token-full-quyen\n\n### I. What is Facebook Token?\n\nA Facebook **Access Token** is a randomly generated code that contains data linked to a Facebook account. It contains the permissions to perform an action on the library (API) provided by Facebook. Each Facebook account will have different **Access Tokens**, and there can be ≥ 1 Tokens on the same account.\n\nDepending on the limitations of each Token's permissions, which are generated for use with corresponding features, either many or few, they can be used for various purposes, but the main goal is to automate all manual operations. Some common applications include:\n\n- Increasing likes, subscriptions on Facebook.\n- Automatically posting on Facebook.\n- Automatically commenting and sharing posts.\n- Automatically interacting in groups and Pages.\n- ...\n\nThere are 2 types of Facebook Tokens: **App-based Token** and **Personal Account-based Token**. The Facebook **Token by App** is the safest one, as it will have a limited lifetime and only has some basic permissions to interact with `Pages` and `Groups`. Our main focus will on the Facebook **Personal Account-based Token**.\n\n### II. Personal Account-based Access Token\n\nThis is a **full permissions** Token represented by a string of characters starting with `EAA...`. The purpose of this Token is to act on behalf of your Facebook account to perform actions you can do on Facebook, such as sending messages, liking pages, and posting in groups through `API`. \n\nCompared to an **App-based Token**, this type of Token has a longer lifespan and more permissions. Simply put, whatever an **App-based Token** can do, a **Personal Account-based Token** can do as well, but not vice versa.\n\nAn example of using this Facebook Token is when you want to simultaneously post to many `Groups` and `Pages`. To do this, you cannot simply log into each `Group` or `Page` to post, which is very time-consuming. Instead, you just need to fill in a list of `Group` and `Page` IDs, and then call an `API` to post to all in this list. Or, as you can often see on the Internet, there are tools to increase fake likes and comments also using this technique.\n\nNote that using Facebook Token can save you time, but you should not reveal this Token to others as they can misuse it for malicious purposes:\n\n- Do not download extensions to get Tokens or login with your phone number and password on websites that support Token retrieval, as your information will be compromised.\n- And if you suspect your Token has been compromised, immediately change your Facebook password and delete the extensions installed in the browser. \n- If you wanna be more careful, you can turn on **two-factor authentication** (2FA).\n\n👉 To ensure safety when using the Facebook Token for personal purposes and saving time as mentioned above, you should obtain the Token directly from Facebook following the steps below.\n\n### III. Get Access Token with full permissions\n\nBefore, obtaining Facebook Tokens was very simple, but now many Facebook services are developing and getting Facebook Tokens has become more difficult. Facebook also limits Full permission Tokens to prevent Spam and excessive abuse regarding user behavior. \n\nIt's possible to obtain a Token, but it might be limited by basic permissions that we do not use. This is not a big issue compared to sometimes having accounts locked (identity verification) on Facebook.\n\nCurrently, this is the most used method, but it may require you to authenticate with 2FA (via app or SMS Code). With these following steps, you can get an **almost full permission** Token:\n\n-   Go to https://business.facebook.com/business_locations.\n-   Press `Ctrl + U`, then `Ctrl + F` to find the code that contains `EAAG`. Copy the highlighted text, that's the Token you want to obtain.\n\n    ![](./assets/token-business.png)\n\n-   You can go to this [facebook link](https://developers.facebook.com/tools/debug/accesstoken) to check the permissions of the above Token.\n    ![](https://lh4.googleusercontent.com/0S64t2sjFXjkX8HUjo2GeEW8hyKL88G4lMXkpNF7RgtFCRm0oVPRT--vnoM1rkMyhrRvvHufW9J0ZeP8tPxfo4j5vYityQFM0m06NTI2hq4zk1JMp59W9voHXHYtOjE7zqDGMlhh)\n\n**Note**: I only share how to get **Access Token** from Facebook itself. Revealing Tokens can seriously affect your Facebook account. Please don't get Tokens from unknown sources!\n\n\n## APPROACH 2. SSR - Server-side Rendering\n\n👉 Check out my implementation using 2 [scraping tools](#scraping-tools) for this approach: [Scrapy](./stealth-ssr-scrapy/) (Implementing) and [Puppeteer](./stealth-ssr-puppeteer/) (Implementing).\n\n### I. What is Server-side Rendering?\n\nThis is a popular technique for rendering a normally client-side only single-page application (`SPA`) on the **Server** and then sending a fully rendered page to the client. The client's `JavaScript` bundle can then take over and the `SPA` can operate as normal:\n\n```mermaid\n%%{init: {'theme': 'default', 'themeVariables': { 'primaryColor': '#333', 'lineColor': '#666', 'textColor': '#333', }}}%%\nsequenceDiagram\n    participant U as 🌐 User's Browser\n    participant S as 🔧 Server\n    participant SS as 📜 Server-side Scripts\n    participant D as 📚 Database\n    participant B as 🖥️ Browser Engine\n    participant H as 💧 Hydration (SPA)\n\n    rect rgb(235, 248, 255)\n    U-\u003e\u003eS: 1. Request 🌐 (URL/Link)\n    Note right of S: Server processes the request\n    S-\u003e\u003eSS: 2. Processing 🔄 (PHP, Node.js, Python, Ruby, Java)\n    SS-\u003e\u003eD: Execute Business Logic \u0026 Query DB\n    D--\u003e\u003eSS: Return Data\n    SS--\u003e\u003eS: 3. Rendering 📄 (Generate HTML)\n    S--\u003e\u003eU: 4. Response 📤 (HTML Page)\n    end\n\n    rect rgb(255, 243, 235)\n    U-\u003e\u003eB: 5. Display 🖥️\n    Note right of B: Browser parses HTML, CSS, JS\n    B--\u003e\u003eU: Page Displayed to User\n    end\n\n    alt 6. Hydration (if SPA)\n        rect rgb(235, 255, 235)\n        U-\u003e\u003eH: Hydration 💧\n        Note right of H: Attach event listeners\\nMake page interactive\n        H--\u003e\u003eU: Page now reactive\n        end\n    end\n```\n\n1. The **user's browser** requests a page.\n2. The **Server** receives and processes request. This involves running necessary **Server-side scripts**, which can be written in languages like *PHP*, *Node.js*, *Python*, ...\n3. These **Server-side scripts** dynamically generate the `HTML` content of the page. This may include executing business logic or querying a database.\n4. The **Server** responds by sending the **fully-rendered HTML** page back to **user's browser**. This response also includes `CSS` and the `JS`, which will be process once the `HTML` is loaded.\n5. The **user's browser** receives the `HTML` response and renders the page. The browser's rendering engine parses the `HTML`, `CSS`, and execute `JS` to display the page.\n6. (Optional) If the application is a `SPA` using a framework like *React*, *Vue*, or *Angular*, an additional process called **Hydration** may occur to attach event listeners to the existing **Server-rendered HTML**:\n    - This is where the client-side `JS` takes over and `binds` event handlers to the **Server-rendered HTML**, effectively turning a static page into a dynamic one.\n    - This allows the application to handle user interactions, manage `state`, and potentially update the `DOM` without the need to render a new page from scratch or return to the **Server** for every action.\n\n| Pros | Cons |\n| --------------------------- | --------------------------- |\n| - Improved initial load time as users see a **fully-rendered page** sooner, which is important for experience, particularly on slow connections   | - More **Server** resources are used to generate the **fully-rendered HTML**. |\n| - Improved SEO as search engine crawlers can see the **fully-rendered page**. | - Complex to implement as compared to [CSR](#approach-3-csr---client-side-rendering), especially for dynamic sites where content changes frequently.     |\n\n### II. [Mbasic Facebook](https://mbasic.facebook.com) - A Facebook SSR version\n\nThis Facebook version is made for mobile browsers on slow internet connection by using [SSR](#i-what-is-server-side-rendering) to focus on delivering content in raw `HTML` format. You can access it without a modern smartphones. With modern devices, it will improves the page loading time \u0026 the contents will be mainly rendered using raw `HTML` rather than relying heavily on `JS`:\n\nhttps://github.com/18520339/facebook-data-extraction/assets/50880271/ae2635ff-3f2a-4b84-a5b3-c126102a0118\n\n- You can leverage the power of many web scraping frameworks like [scrapy](https://scrapy.org) not just automation tools like [puppeteer](https://github.com/puppeteer/puppeteer) or [selenium](https://github.com/seleniumhq/selenium) and it will become even more powerful when used with [IP hiding techniques](#i-ip-hiding-techniques). \n- You can get each part of the contents through different URLs, not only through the page scrolling ➔ You can do something like using proxy for each request or [AutoThrottle](https://docs.scrapy.org/en/latest/topics/autothrottle.html) (a built-in [scrapy](https://scrapy.org) extension), ...\n\nUpdating...\n\n## APPROACH 3. CSR - Client-side Rendering\n\n👉 Check out my implementation using 2 [scraping tools](#scraping-tools) for this approach: [Selenium](./stealth-csr-selenium/) (Deprecated) and [Puppeteer](./stealth-csr-puppeteer/) (Implementing).\n\nUpdating...\n\n\n## APPROACH 4. DevTools Console\n\nThis is the most simple way, which is to directly write \u0026 run JS code in the [DevTools Console](https://developer.chrome.com/docs/devtools/open) of your browser, so it's quite convenient, not required to setup anything.\n\n- You can take a look at this [extremely useful project](https://github.com/jayremnt/facebook-scripts-dom-manipulation) which includes many automation scripts (not just about data extraction) with no Access Token needed for Facebook users by directly manipulating the DOM.\n  \n- Here's my example script to collect comments on **a Facebook page when not sign-in**:\n   \n```js\n// Go to the page you want to collect, wait until it finishes loading.\n// Open the DevTools Console on the Browser and run the following code\nlet csvContents = [['UserId', 'Name', 'Comment']];\nlet cmtsSelector = '.userContentWrapper .commentable_item';\n\n// 1. Click see more comments\n// If you want more, just wait until the loading finishes and run this again\nmoreCmts = document.querySelectorAll(cmtsSelector + ' ._4sxc._42ft');\nmoreCmts.forEach(btnMore =\u003e btnMore.click());\n\n// 2. Collect all comments\ncomments = document.querySelectorAll(cmtsSelector + ' ._72vr');\ncomments.forEach(cmt =\u003e {\n    let info = cmt.querySelector('._6qw4');\n    let userId = info.getAttribute('href')?.substring(1);\n    let content = cmt.querySelector('._3l3x\u003espan')?.innerText;\n    csvContents.push([userId, info.innerText, content]);\n});\ncsvContents.map(cmt =\u003e cmt.join('\\t')).join('\\n');\n```\n\n\u003cdetails\u003e\n    \u003csummary\u003e\n        \u003cb\u003e\n            \u003ca href=\"https://github.com/18520339/facebook-data-extraction/blob/master/devtool-data.xlsx\"\u003eExample\u003c/a\u003e result for the script above\n        \u003c/b\u003e\n    \u003c/summary\u003e\u003cbr/\u003e\n\n| UserId          | Name           | Comment                            |\n| --------------  | -------------- | ---------------------------------- |\n| freedomabcxyz   | Freedom        | Sau khi dùng                       |\n| baodendepzai123 | Bảo Huy Nguyễn | nhưng mà thua                      |\n| tukieu.2001     | Tú Kiều        | đang xem hài ai rãnh xem quãng cáo |\n| ABCDE2k4        | Maa Vănn Kenn  | Lê Minh Nhất                       |\n| buikhanhtoanpro | Bùi Khánh Toàn | Haha                               |\n\n\u003c/details\u003e\n\n## Scraping Tools \n\nUpdating...\n\n\n## Bypassing Bot Detection (When not sign-in)\n\nUpdating...\n\n👉 Highly recommend: https://github.com/niespodd/browser-fingerprinting\n\n### I. IP hiding techniques\n\n\u003ctable\u003e\n    \u003cthead\u003e\n        \u003ctr\u003e\n            \u003cth align=\"center\" width=\"2%\"\u003eTechnique\u003c/th\u003e\n            \u003cth align=\"center\" width=\"10%\" \u003eSpeed\u003c/th\u003e\n            \u003cth align=\"center\" width=\"10%\" \u003eCost\u003c/th\u003e\n            \u003cth align=\"center\" width=\"22%\" \u003eScale\u003c/th\u003e\n            \u003cth align=\"center\" width=\"22%\" \u003eAnonymity\u003c/th\u003e\n            \u003cth align=\"center\" width=\"12%\" \u003eOther Risks\u003c/th\u003e\n            \u003cth align=\"center\" width=\"22%\" \u003eAdditional Notes\u003c/th\u003e\n        \u003c/tr\u003e\n    \u003c/thead\u003e\n    \u003ctbody\u003e\n        \u003ctr\u003e\n            \u003ctd align=\"center\"\u003e\u003cb\u003eVPN Service\u003c/b\u003e\u003cbr\u003e⭐⭐\u003cbr\u003e⭐⭐\u003c/td\u003e\n            \u003ctd\u003eFast, offers a balance of anonymity and speed\u003c/td\u003e\n            \u003ctd align=\"center\"\u003eUsually paid\u003c/td\u003e\n            \u003ctd\u003e- Good for \u003cb\u003esmall-scale\u003c/b\u003e operations.\u003cbr\u003e- May not be suitable for high-volume scraping due to potential IP blacklisting.\u003c/td\u003e\n            \u003ctd\u003e- Provides good anonymity and can bypass geo-restriction.\u003cbr\u003e- Potential for IP blacklisting/blocks if the VPN's IP range is known to the target site.\u003c/td\u003e\n            \u003ctd\u003e- Service reliability varies.\u003cbr\u003e- Possible activity logs.\u003c/td\u003e\n            \u003ctd\u003eChoose a reputable provider to avoid security risks.\u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd align=\"center\"\u003e\u003cb\u003eTOR Network\u003c/b\u003e\u003cbr\u003e⭐⭐\u003c/td\u003e\n            \u003ctd\u003eVery slow due to onion routing\u003c/td\u003e\n            \u003ctd align=\"center\"\u003eFree\u003c/td\u003e\n            \u003ctd\u003e- Fine for \u003cb\u003esmall-scale\u003c/b\u003e, impractical for time-sensitive/ high-volume scraping due to very slow speed.\u003cbr\u003e- Consider only for research purposes, not scalable data collection.\u003c/td\u003e\n            \u003ctd\u003e- Offers excellent privacy.\u003cbr\u003e- Tor exit nodes can be blocked or malicious, like \u003ca href=\"https://support.torproject.org/https/https-1/\"\u003epotential for eavesdropping\u003c/a\u003e.\u003c/td\u003e\n            \u003ctd align=\"center\"\u003e-\u003c/td\u003e\n            \u003ctd align=\"center\"\u003eSlowest choice\u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd align=\"center\"\u003e\u003cb\u003ePublic\u003cbr\u003eWi-Fi\u003c/b\u003e\u003cbr\u003e⭐\u003c/td\u003e\n            \u003ctd align=\"center\"\u003eVary\u003c/td\u003e\n            \u003ctd align=\"center\"\u003eFree\u003c/td\u003e\n            \u003ctd\u003eFine for \u003cb\u003esmall-scale\u003c/b\u003e.\u003c/td\u003e\n            \u003ctd\u003ePotential for being banned by target sites if scraping is detected.\u003c/td\u003e\n            \u003ctd align=\"center\"\u003ePotential unsecured networks\u003c/td\u003e\n            \u003ctd align=\"center\"\u003eLong distance way solution.\u003cbr\u003e\u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd align=\"center\"\u003e\u003cb\u003eMobile Network\u003c/b\u003e\u003cbr\u003e⭐⭐\u003c/td\u003e\n            \u003ctd\u003eRelatively fast but slower speeds on some networks\u003c/td\u003e\n            \u003ctd\u003ePaid, potential for additional costs.\u003c/td\u003e\n            \u003ctd\u003eUsing mobile IPs can be effective for \u003cb\u003esmall-scale\u003c/b\u003e scraping, impractical for large-scale.\u003c/td\u003e\n            \u003ctd\u003eMobile IPs can change but not an anonymous option since it's tied to your personal account.\u003c/td\u003e\n            \u003ctd align=\"center\"\u003e-\u003c/td\u003e\n            \u003ctd align=\"center\"\u003eUsing own data\u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd align=\"center\"\u003e\u003cb\u003ePrivate/\u003cbr\u003eDedicated Proxies\u003c/b\u003e\u003cbr\u003e⭐⭐⭐\u003cbr\u003e⭐⭐\u003cbr\u003e(Best)\u003c/td\u003e\n            \u003ctd align=\"center\"\u003eFast\u003c/td\u003e\n            \u003ctd align=\"center\"\u003ePaid\u003c/td\u003e\n            \u003ctd\u003e- Best for \u003cb\u003elarge-scale\u003c/b\u003e operations and professional scraping projects.\u003c/td\u003e\n            \u003ctd\u003eOffer better performance and reliability with lower risk of blacklisting.\u003c/td\u003e\n            \u003ctd align=\"center\"\u003eVary in quality\u003c/td\u003e\n            \u003ctd rowspan=\"2\"\u003e- \u003cb\u003eRotating Proxies\u003c/b\u003e are popular choices for scraping as they can offer better speed and a variety of IPs.\u003cbr\u003e- You can use this \u003ca href=\"https://addons.mozilla.org/en-US/firefox/addon/proxy-checker/\"\u003eproxy checker tool\u003c/a\u003e to assess your proxy quality\u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd align=\"center\"\u003e\u003cb\u003eShared Proxies\u003c/b\u003e\u003cbr\u003e⭐⭐⭐\u003cbr\u003e(Free)\u003cbr\u003e⭐⭐\u003cbr\u003e⭐⭐\u003cbr\u003e(Paid)\u003c/td\u003e\n            \u003ctd align=\"center\"\u003eSlow to Moderate\u003c/td\u003e\n            \u003ctd\u003eUsually Free or cost-effective for low-volume scraping.\u003c/td\u003e\n            \u003ctd\u003eGood for basic, \u003cb\u003esmall-scale\u003c/b\u003e, or non-critical scraping tasks.\u003c/td\u003e\n            \u003ctd\u003eCan be overloaded or blacklisted or, encountering already banned IPs.\u003c/td\u003e\n            \u003ctd\u003ePotential unreliable/ insecure proxies, especially Free ones.\u003c/td\u003e\n        \u003c/tr\u003e\n    \u003c/tbody\u003e\n\u003c/table\u003e\n\n**IMPORTANT**: Nothing above is absolutely safe and secure. _Caution is never superfluous_. You will need to research more about them if you want to enhance the security of your data and privacy.\n\n### II. Private/Dedicated Proxies (Most effective IP hiding technique)\n\nAs you can conclude from the table above, **Rotating Private/Dedicated Proxies** is the most effective IP hiding technique for **undetectable** and **large-scale** scraping. Below are 2 popular ways to effectively integrate this technique into your scraping process:\n\n\u003ctable\u003e\n    \u003cthead\u003e\n        \u003ctr\u003e\n            \u003cth align=\"center\" width=\"2%\"\u003eTechnique\u003c/th\u003e\n            \u003cth align=\"center\" width=\"10%\" \u003eSpeed\u003c/th\u003e\n            \u003cth align=\"center\" width=\"10%\" \u003eCost\u003c/th\u003e\n            \u003cth align=\"center\" width=\"23%\" \u003eScale\u003c/th\u003e\n            \u003cth align=\"center\" width=\"30%\" \u003eAnonymity\u003c/th\u003e\n            \u003cth align=\"center\" width=\"25%\" \u003eAdditional Notes\u003c/th\u003e\n        \u003c/tr\u003e\n    \u003c/thead\u003e\n    \u003ctbody\u003e\n        \u003ctr\u003e\n            \u003ctd align=\"center\"\u003e\u003cb\u003eResidential Rotating Proxies\u003c/b\u003e\u003cbr\u003e⭐⭐⭐\u003cbr\u003e⭐⭐\u003cbr\u003e(Best)\u003c/td\u003e\n            \u003ctd align=\"center\"\u003eFast\u003c/td\u003e\n            \u003ctd align=\"center\"\u003ePaid\u003c/td\u003e\n            \u003ctd\u003eIdeal for high-end, \u003cb\u003elarge-scale\u003c/b\u003e scraping tasks.\u003c/td\u003e\n            \u003ctd\u003e- Mimics real user IPs and auto-rotate IPs when using proxy gateways, making detection harder.\u003cbr\u003e- Provides high anonymity and low risk of blacklisting/blocks due to legitimate residential IPs.\u003c/td\u003e\n            \u003ctd\u003eConsider proxy quality, location targeting, and rotation speed.\u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd align=\"center\"\u003e\u003cb\u003eDatacenter Rotating Proxies\u003c/b\u003e\u003cbr\u003e⭐⭐\u003cbr\u003e⭐⭐\u003c/td\u003e\n            \u003ctd align=\"center\"\u003eFaster than \u003cb\u003eResidential Proxies\u003c/b\u003e\u003c/td\u003e\n            \u003ctd\u003eMore affordable than \u003cb\u003eResidential Proxies\u003c/b\u003e\u003c/td\u003e\n            \u003ctd\u003eGood for cost-effective, \u003cb\u003elarge-scale\u003c/b\u003e scraping.\u003c/td\u003e\n            \u003ctd\u003eLess anonymous than Residential Proxies.\u003cbr\u003e- Higher risk of being blocked.\u003cbr\u003e- Easily detectable due to their datacenter IP ranges.\u003c/td\u003e\n            \u003ctd\u003eConsider reputation of the provider and frequency of IP rotation.\u003c/td\u003e\n        \u003c/tr\u003e\n    \u003c/tbody\u003e\n\u003c/table\u003e\n\nRecently, I experimented my web scraping [npm package](https://www.npmjs.com/package/puppeteer-ecommerce-scraper) with [NodeMaven](https://nodemaven.com/?a_aid=quandang), a **Residential proxy provider** with a focus on IP quality as well as stability, and I think it worked quite well. Below is the proxy quality result that I tested using the [proxy checker tool](https://addons.mozilla.org/en-US/firefox/addon/proxy-checker/) I mentioned [above](#i-ip-hiding-techniques):\n\n![](./assets/nodemaven-test.jpg)\n\nAnd this is the performance measure of my actual run that I tested with my [scraping package](https://www.npmjs.com/package/puppeteer-ecommerce-scraper):\n\n- **Successful scrape runs**: 96% (over 100 attempts). This result is a quite good.\n    - [NodeMaven](https://nodemaven.com/?a_aid=quandang) already reduced the likelihood of encountering banned or blacklisted IPs through the `IP Quality Filtering` feature.\n    - Another reason is that it can access to over `5M residential IPs` across 150+ countries, a broad range of geo-targeting options.\n    - I also used their `IP Rotation` feature to rotate IPs within a single gateway endpoint, which simplified my scraping setup and provided consistent anonymity.\n- **Average scrape time**: around 1-2 mins/10 pages for complex dynamic loading website (highly dependent on website complexity). While the proxy speeds were generally consistent, there were occasional fluctuations, which is expected in any proxy service.\n- **Sticky Sessions**: 24h+ session durations allowed me to maintain connections and complete scrapes efficiently.\n- **IP block rate** / **Redirect** / **Blank page** in the first run: \u003c4%.\n\nOverall, throughout many runs, the proxies proved to be reliable with minimal downtime or issues. For those interested in trying [NodeMaven](https://nodemaven.com/?a_aid=quandang), you can apply the code `QD2` for an additional 2GB of traffic free with your trial or package purchase.\n\n### II. Browser Settings \u0026 Plugins\n\nUpdating...\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F18520339%2Ffacebook-data-extraction","html_url":"https://awesome.ecosyste.ms/projects/github.com%2F18520339%2Ffacebook-data-extraction","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F18520339%2Ffacebook-data-extraction/lists"}