https://github.com/antony-jr/coepy
A Powerful CLI Tool to automatically scrape information from Controller Of Examination AU written in Python :snake:.
https://github.com/antony-jr/coepy
anna-university automation capcha-solver coe-anna-university-scraper python-cli python3 scraper
Last synced: about 1 year ago
JSON representation
A Powerful CLI Tool to automatically scrape information from Controller Of Examination AU written in Python :snake:.
- Host: GitHub
- URL: https://github.com/antony-jr/coepy
- Owner: antony-jr
- License: mit
- Created: 2018-10-21T07:38:38.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2019-01-28T05:25:15.000Z (over 7 years ago)
- Last Synced: 2025-01-13T23:25:34.678Z (over 1 year ago)
- Topics: anna-university, automation, capcha-solver, coe-anna-university-scraper, python-cli, python3, scraper
- Language: Python
- Size: 1.45 MB
- Stars: 2
- Watchers: 3
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# CoePy [](https://github.com/antony-jr/CoePy/issues) [](https://github.com/antony-jr/CoePy/network) [](https://github.com/antony-jr/CoePy/stargazers) [](https://github.com/antony-jr/CoePy/blob/master/LICENSE)
This is a simple and powerful CLI tool written in python which can automate student information extraction from AU , and
eventually makes our lives better.
This script simply uses **selenium to enter form feilds** , the interesting part is that this script **automatically solves the
captcha generated by COE AU Website.**
Thus saves a lot of time , This can also help you if you want to check marks in **bulk** , Just execute this script , drink
your coffee and wait for the script to show your requested information.
**IMPORTANT** : Only tested the login process , Which is quite good. But Still cannot test the scraping of information since
the website is not responding to anyone , neither humans nor bots.
**NOTE**: Only tested on linux , may or may not work in other platforms.
# Installation
I did not publish this in pypi(Python Package Index) since the code will be updated frequently , and cannot release it a gazillion times ,And also **this project can be discontinued anytime.**
Therefore you have to install it manually from source , Don't worry it will be easy. Before you do anything , Make sure you have **google chrome** or **chromium** installed in your computer. (Which will be used by this script to render the website since it
depends so much on a real browser , I will be honest , the website is very cranky when scraped with requests)
Now execute the following commands in your terminal ,
```
$ git clone https://github.com/antony-jr/CoePy
$ cd CoePy
$ sudo pip3 install -r requirements.txt
$ ./coepy.py --help
```
**Note** : This script is only tested on **Python 3.7**.
# Cracking Captcha
The most interesting part of this project is cracking the captcha generated by the website , Thats the biggest hurdle
in automating this process , right ?
So here is how I cracked it.
So first lets take a look at captcha's generated by COE AU Website ,
**Note** : On observing a sample of 1500 captcha's generated by COE AU Website , the captcha seems to only follow the above two patterns.
By Observations ,
**The Static Properties of Captcha's generated by COE AU Website are,**
* Resolution of any arbitary captcha is always **70x20 Pixels (can be taken as 70x20 Matrix)**.
* Any arbitary captcha is always **binary** , (i.e) It will always use only black and white colours.
* Only uses **numericals 0-9** and **alphabets A-Z**
* Has very few noise.
* Any arbitary captcha has **6 characters inscribed in it**.
* A Single character can be fitted inside a **10x8 Matrix**.
Now lets do some math ,
Let **'R'** be a matrix of a arbitary captcha image which is in **binary**.(i.e White Pixel is 255 and Black Pixel is 0) ,
Now from the **static properties** we know that **'R' must be a 70x20 Matrix** , like so...
Let **'Cn'** be a matrix that represents a single character from the captcha , Where **'n'** represents the **n**th
character from the captcha , Thus the range of **'n'** must be **0 <= n < Number of Characters in the Captcha**.
Now from the **static properties** we know that **'Cn' must be a 10x8 Matrix** , like so...
Let **T0 T1 T2 ... T35** be a set of matrix which is of **order 10x8** , Let this set be the **test set**., These matrices are obtained from coverting all possible characters in the captcha's generated by COE AU Website to a matrix which
has the character in white pixels and background in black pixels , These characters are un-noised and filtered from all
samples. From the **static properties we know that the maximum number of characters is 36.**
Now these matrices are stored for later use.
Now each matrix in the set {**C0 C1 C2 ... C5**} is compared againts each element in the **test set**. The **Percentage of Match** is calculated between a single character(**Cn**) in the arbitary captcha and all the characters in the **test set** , The element in **test set** with the highest **Percentage of Match** is the most appropriate character in the position **n** of the **resultant string** which must have a length of **6**(Since the captcha has only 6 characters inscribed in it).
**T0** corresponds to the character **'0'** , **T1** corresponds to the character **'1'** , and
thus **Tm** corresponds to the character in chronological order in the list of all possible characters **'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ'**.
The **Percentage of Match** is calculated like so... ,
**Therefore The most appropriate character for nth character in the resultant string is obtained , This process is repeated upto n=6(Since the maximum character inscribed in the captcha is six). Thus the resultant string is obtained.**
To see this in action , You can use the **coepy-cpt.py** script which is used to test the **captcha parser** ,
```
$ cd CoePy
$ pip3 install -r requirements.txt # Install the dependencies.
$ python coepy-cpt.py CaptchaSamples/CaptchaSample0.png # The first captcha
$ python coepy-cpt.py CaptchaSamples/CaptchaSample1.png # The second captcha
```
Still this can get us wrong results sometimes , but we can just re-try , We just need the most likely answers.
# License
The MIT License.
Copyright (C) 2018 [Antony Jr.](https://github.com/antony-jr)