Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/neuroradiology/InsideReCaptcha

Reverse-engineering the new “captchaless” ReCaptcha system...
https://github.com/neuroradiology/InsideReCaptcha

Last synced: 3 months ago
JSON representation

Reverse-engineering the new “captchaless” ReCaptcha system...

Lists

README

        

# Summary

A few days ago, Google has introduced a [new version of ReCaptcha](http://googleonlinesecurity.blogspot.com/2014/12/are-you-robot-introducing-no-captcha.html), theorically allowing most users to complete it by only ticking a checkbox. If the user isn't deemed as human by Google, the old version with distorted text appears. Although I used a normal Firefox version, I still had to fill the text captcha after clicking, so it didn't really worked for me. My curiosity induced me to look at the JavaScript in order to know how all this really works...

# What happens on the wire

First, the browser makes the few following requests:

* `https://www.google.com/recaptcha/api.js`, whose function is mainly to load the next one...
* `https://www.gstatic.com/recaptcha/api2/r20141202135649/recaptcha__en.js`, which contains common code.
* `https://apis.google.com/_/scs/apps-static/_/js/` (followed by a bunch of more or less cryptic parameters) which contains other common JavaScript code.

The browser then makes a requests to `https://www.google.com/recaptcha/api2/anchor`, whose response contains the very interesting stuff: a callback to a function called `recaptcha.anchor.Main.init`, which contains two base64-encoded parameters.

The first parameter points to a JavaScript file: [`https://www.google.com/js/bg/6yg-ggdQgQAg8SAADJkAjc-JMNnOnYuIGgH_iBV7uf8.js`](https://www.google.com/js/bg/6yg-ggdQgQAg8SAADJkAjc-JMNnOnYuIGgH_iBV7uf8.js). The second one contains *double-*base64-encoded binary data.

It turned out this new ReCaptcha system is heavily obfuscated, as **Google implemented a whole VM in JavaScript with a specific bytecode language**.

The first parameter is the bytecode interpreter. After trimming the `(function(){eval('` and `')})()`, and passing it to [JSBeautifier](http://jsbeautifier.org/), I finally dove in this mass of minified code.

# The analysis

The interpreter has two entry points: the `M` function which is executed when ReCaptcha is loaded, and `M.prototype.ha` which is executed when you click the checkbox, and that returns the information for Google servers.

I first discovered that the bytecode was encrypted using the [XTEA](https://en.wikipedia.org/wiki/XTEA) algorithm. Each block of 8 bytes is xored with a keystream (so decryption and encryption functions are the same), where the first 32-bit word of plaintext is read from the bytecode file, the second 32-bit word is the position in the bytecode file divided by 8, and the key is *by default* `[0, 0, 0, 0]`.

By default... because it would have been too simple: it turns out the bytecode has direct access to JavaScript variables of its *own* interpreter, and changes its *own* decryption key and even its *own* opcodes numbers at many points.

Even more nifty, the bytecode key is once generated by directly hashing JavaScript code from the interpreter (`Function.toString()` rocks, it doesn't?), or with the output of browser-specific functions and CSS rules, or with the hostname of the calling domain (www.google.com)...

**After about 2 days of work, I produced a working disassembler and then decompiler for the ReCaptcha bytecode.** You can try it from this GitHub repository. However, it stills has some hardcoded keys values, so it will only work on the bytecode sample contained in the `enc` file for now.

Just execute the `./decomp.py` file to give it a try, it will output pseudo-JavaScript. `xhr1` and `xhr2` are byte arrays that contains the data later sent to Google servers.

# Gathered information

Google servers will receive and process, at least, the following information:

* Plug-ins
* User-agent
* Screen resolution
* Execution time, timezone
* Number of click/keyboard/touch actions *in the `` of the captcha*
* It tests the behavior of many browser-specific functions and CSS rules
* It checks the rendering of canvas elements
* Likely cookies server-side (it's executed on the www.google.com domain)
* And likely other stuff...

You can look at the decompiled bytecode for more precision.

This information, along with numeric values hardcoded in the bytecode (forcing a potential bot to read all of it), is sent to the `https://www.google.com/recaptcha/api2/frame` page. Look at the `M.prototype.Q` function to see how the encoding process is realized. Some of information (the one I call `xhr2` in the decompiler, which is retrieved in the `this.c[this.g]` variable − `xhr1` is in `this.c[this.d]`) is also encrypted with XTEA.

# What's next...

We could:

* Make statistics about when the checkbox-captcha suffices and when it doesn't.
* Programmatically bypass the captcha by interpreting bytecode.
* Programmatically bypass the captcha by simply executing a rendering engine and automating movements of the mouse. But it would be slighty less funny.

Cheers and good reversing!