Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/karambir/ugc-colleges
Python Script to extract college names from UGC, India website.
https://github.com/karambir/ugc-colleges
college crawler extract html-parser python python-script ugc
Last synced: 13 days ago
JSON representation
Python Script to extract college names from UGC, India website.
- Host: GitHub
- URL: https://github.com/karambir/ugc-colleges
- Owner: karambir
- Created: 2012-06-23T23:02:06.000Z (over 12 years ago)
- Default Branch: master
- Last Pushed: 2012-07-29T17:14:12.000Z (over 12 years ago)
- Last Synced: 2023-03-24T18:56:39.528Z (over 1 year ago)
- Topics: college, crawler, extract, html-parser, python, python-script, ugc
- Language: Python
- Size: 2.18 MB
- Stars: 5
- Watchers: 3
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
#Extracting college names, address from UGC site
*Author: Karambir Singh Nain*
This include a python script which I made to extract college names from ugc main site. It uses reguler expressions.
It outputs a file name colleges.txt with all college names and address. I am able to extract 7758 colleges from 8000 in the list. Most which I couldn't extract were bad data entries in UGC's site.I wanted to practice Rgex a bit.
**It can also be done with string find methods.**
##Requirements:
1. UrlLib2 - for downloading html files from usc website.
2. Re - regular expressions module.
If you have any query, give a pull request.