Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/javapuppteernodejs/curl_cffi


https://github.com/javapuppteernodejs/curl_cffi

captcha curl curl-library curlphp fingerprint ja3 ja3-fingerprint ja3-h2 tls web-scraping web-scraping-software web-scraping-tutorials

Last synced: about 2 months ago
JSON representation

Awesome Lists containing this project

README

        

# How to Solve TLS/JA3 Fingerprinting in Web Scraping with curl_cffi

![](https://assets.capsolver.com/prod/images/post/2024-08-27/f90e3486-092c-42fc-b0a7-e0b031892456.png)

Struggling with anti-bot measures while scraping websites? `curl_cffi`, an advanced Python library wrapping around the cURL tool, can help you bypass these barriers effectively. By mimicking browser behavior and leveraging cURL's features, `curl_cffi` enhances your scraper’s ability to avoid detection and perform smoothly. In this guide, we’ll explore how `curl_cffi` works, how to use it for various tasks, and address its limitations. We’ll also discuss potential solutions to overcome these limitations.

### What is curl_cffi?

`curl_cffi` is a Python library designed for network requests, similar to libraries like `requests` and `httpx`. However, unlike these libraries, `curl_cffi` can simulate browser TLS/JA3 and HTTP/2 fingerprints. [`curl-impersonate`](https://github.com/lwthiker/curl-impersonate) is a command-line tool that can simulate four major browsers and perform TLS and HTTP handshakes just like a real browser. `curl_cffi` wraps `curl-impersonate` into a Python library using `cffi`.

> Struggling with the repeated failure to completely solve the irritating captcha?
>
> Discover seamless automatic captcha solving with **CapSolver** AI-powered Auto Web Unblock technology!
>
> Claim Your **Bonus Code** for top captcha solutions; [CapSolver](https://www.capsolver.com/?utm_source=official&utm_medium=blog&utm_campaign=curlcurl_cffi): **WEBS**. After redeeming it, you will get an extra 5% bonus after each recharge, Unlimited
>
> ![](https://assets.capsolver.com/prod/images/post/2024-03-29/fbc29472-886c-45b2-9eb2-2b307f6d9700.png)

### What is TLS/JA3 Fingerprinting?

Today, most websites use HTTPS. To establish an HTTPS connection, a TLS handshake occurs between the server and client, exchanging information such as supported TLS versions and encryption algorithms. Different clients have varying characteristics, and these details are often stable, allowing servers to identify requests as coming from typical user browsers or automated scripts. JA3 is a common algorithm used to generate TLS fingerprints. It works by concatenating these characteristics and computing an MD5 hash.

### Using curl_cffi

The usage of `curl_cffi` is quite similar to `requests`. Here’s how you can use `requests` to get the JA3 fingerprint:

```python
import requests

url = "https://tls.browserleaks.com/json"
r = requests.get(url)
print(r.json())
```

You might get results like this:

```json
{
"user_agent": "python-requests/2.32.3",
"ja3_hash": "8d9f7747675e24454cd9b7ed35c58707",
"ja3_text": "771,4866-4867-4865-49196-49200-49195-49199-52393-52392-159-158-52394-49327-49325-49326-49324-49188-49192-49187-49191-49162-49172-49161-49171-49315-49311-49314-49310-107-103-57-51-157-156-49313-49309-49312-49308-61-60-53-47-255,0-11-10-16-22-23-49-13-43-45-51-21,29-23-30-25-24,0-1-2",
"ja3n_hash": "a790a1e311289ac1543f411f6ffceddf",
"ja3n_text": "771,4866-4867-4865-49196-49200-49195-49199-52393-52392-159-158-52394-49327-49325-49326-49324-49188-49192-49187-49191-49162-49172-49161-49171-49315-49311-49314-49310-107-103-57-51-157-156-49313-49309-49312-49308-61-60-53-47-255,0-10-11-13-16-21-22-23-43-45-49-51,29-23-30-25-24,0-1-2",
"akamai_hash": "",
"akamai_text": ""
}
```

If you make repeated requests, you’ll find that your JA3 hash remains the same. However, starting from Chrome version 110, the TLS ClientHello extension order is randomized, making it easier for website developers to block libraries like `requests` based on JA3 fingerprints. If your requests consistently show the same JA3 fingerprint, they might be identified as coming from a single user, increasing the likelihood of being flagged as a bot.

Here’s how to use `curl_cffi` to simulate a real JA3 fingerprint:

```python
from curl_cffi import requests

url = "https://tls.browserleaks.com/json"
r = requests.get(url, impersonate="chrome124")
print(r.json())
```

The `impersonate` parameter allows you to specify the browser and version to simulate. Supported browsers include Chrome, Chrome Android, Edge, and Safari, with versions continually updated. For detailed information, refer to the [curl_cffi GitHub repository](https://github.com/lexiforest/curl_cffi). With `curl_cffi`, the JA3 fingerprint will align with that of a real browser, and starting from Chrome 110, the JA3 fingerprint will change with each request:

```json
{
"user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"ja3_hash": "c97c8dac4ca1de968fe230de54f3e0f3",
"ja3_text": "771,4865-4866-4867-49195-49199-49196-49200-52393-52392-49171-49172-156-157-47-53,16-10-27-18-5-51-23-17513-45-35-43-13-65281-0-11-65037,25497-29-23-24,0",
"ja3n_hash": "4c9ce26028c11d7544da00d3f7e4f45c",
"ja3n_text": "771,4865-4866-4867-49195-49199-49196-49200-52393-52392-49171-49172-156-157-47-53,0-5-10-11-13-16-18-23-27-35-43-45-51-17513-65037-65281,25497-29-23-24,0",
"akamai_hash": "52d84b11737d980aef856699f885ca86",
"akamai_text": "1:65536;2:0;4:6291456;6:262144|15663105|0|m,a,s,p"
}
```

### Addressing curl_cffi Limitations

While `curl_cffi` can simulate real JA3 fingerprints and potentially avoid bot challenges and blocks, it may not always be sufficient. Many sites implement advanced bot protection mechanisms such as hCaptcha, reCaptcha, Geetest, Cloudflare Turnstile, DataDome, and AWS WAF. These systems use complex images and difficult-to-read JavaScript challenges to differentiate between humans and bots. Sometimes, even with a genuine and randomized JA3 fingerprint, bypassing these challenges is unavoidable.

If you encounter CAPTCHA challenges, regardless of the request library you use, they may be unavoidable. However, there’s no need to worry. [CapSolver](https://www.capsolver.com/?utm_source=official&utm_medium=blog&utm_campaign=curlcurl_cffi) provides a solution to these problems. CapSolver uses AI-based automated web unlocking technology to solve various bot challenges in seconds. Whether dealing with images or complex problems, CapSolver can handle it efficiently. If the solution fails, you won’t incur any costs.

CapSolver also offers a browser extension that automatically resolves CAPTCHAs during data scraping with Selenium. Additionally, an API solution is available for resolving CAPTCHAs and obtaining tokens in frameworks like Scrapy. Everything can be accomplished in just a few seconds. For more details, refer to the [CapSolver documentation](https://docs.capsolver.com/en/guide/getting-started/).

### Conclusion

By integrating `curl_cffi` into your web scraping setup, you can effectively mimic real browser behavior to overcome TLS/JA3 fingerprinting challenges. While `curl_cffi` provides robust tools for handling these challenges, advanced CAPTCHA and bot detection systems still pose significant obstacles. CapSolver offers a complementary solution to tackle these CAPTCHA challenges seamlessly, ensuring your scraping activities run smoothly.

For further insights and resources, check out the [CapSolver website](https://capsolver.com) and explore the [curl_cffi GitHub repository](https://github.com/lexiforest/curl_cffi).