{"id":25363553,"url":"https://github.com/tcd93/python-web-scraping","last_synced_at":"2025-04-09T04:19:40.832Z","repository":{"id":137012197,"uuid":"337641511","full_name":"tcd93/python-web-scraping","owner":"tcd93","description":"A short tutorial to perform scraping job data (via Python) from popular Vietnamese job sites such as Vietnamworks, itviec, jobhopin...","archived":false,"fork":false,"pushed_at":"2021-02-18T10:04:54.000Z","size":555,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-02-14T22:38:50.457Z","etag":null,"topics":["python","scraping-websites","tutorial"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tcd93.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-02-10T06:51:23.000Z","updated_at":"2023-12-08T06:50:02.000Z","dependencies_parsed_at":null,"dependency_job_id":"9edfda66-2038-4bab-b3de-f41931ae9e0b","html_url":"https://github.com/tcd93/python-web-scraping","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tcd93%2Fpython-web-scraping","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tcd93%2Fpython-web-scraping/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tcd93%2Fpython-web-scraping/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tcd93%2Fpython-web-scraping/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tcd93","download_url":"https://codeload.github.com/tcd93/python-web-scraping/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247975154,"owners_count":21026818,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["python","scraping-websites","tutorial"],"created_at":"2025-02-14T22:34:49.398Z","updated_at":"2025-04-09T04:19:40.826Z","avatar_url":"https://github.com/tcd93.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"This is a tutorial (with code sample) for scraping job contents from job searching pages such as \n[_itviec_](itviec.com)\n, [_vietnamworks_](https://www.vietnamworks.com/)\n, [_jobhopin_](https://jobhopin.com/)\nusing Python's [`requests`](https://requests.readthedocs.io/en/master/)\n\n#### Before we begin\nReads the description of the `robots.txt` files of the sites we're scraping (for example: https://www.vietnamworks.com/robots.txt), make sure we play nice \u0026 don't violate the rules; also, don't make too many requests at the same time.\n\n---\n\n### Scraping Jobhopin.com\nFirst we go to the target url: https://jobhopin.com/viec-lam/vi?cities=ho-chi-minh\u0026type=job\n\nThis is what we'd see:\n\n![index.png](img/jobhopin/1.png)\n\nIf we open it via Chrome's Developer Console, we get an entirely different page:\n\n![chrome_dev.png](img/jobhopin/2.png)\n\n**Jobhopin.com** is built by a client-side-rendered framework (like _Reactjs_), meaning the web server just \nreturns a bunch of Javascript code to the browser instead of an HTML page like normal, so tools like \n`requests` can not see the contents if we request the above link.\n\nIn this case, we can check if the job data is already embedded into the JS codes itself (this is a technique \ncalled **data de-hydration** by front-end gurus). Open up search drawer (CTRL+SHIFT+F12) in Devtool and search \nby company name (because company name is likely not affected translation libraries and easily searchable):\n\n![search_result.png](img/jobhopin/3.png)\n\nNothing's found, data might be coming from an external API request, we need to investigate the Network tab more \nthoroughly (_tips: filter requests by __XHR___):\n\n![portal.png](img/jobhopin/4.png)\n\nAs guessed, the info can be easily retrievable by making `GET` requests to admin.jobhop.vn/api/public/jobs;\nopen up another browser tab and paste in\n[this link](https://admin.jobhop.vn/api/public/jobs/?cities=79\u0026industries=\u0026levels=\u0026jobTypes=\u0026salaryMin=0\u0026page=1\u0026pageSize=10\u0026ordering=)\n\n![img_1.png](img/jobhopin/6.png)\n\nNow we can easily get what we need:\n```python\nimport requests\n\nurl = 'https://admin.jobhop.vn/api/public/jobs/?cities=79\u0026format=json\u0026industries=\u0026jobTypes=\u0026levels=\u0026ordering=\u0026page=1\u0026pageSize=10\u0026salaryMin=0'\n\nprint(requests.get(url).json()['data'])\n```\n\n**One more thing, the salary**\n\nIf we did not log in, the API will not display `salary` information, `salaryMin` \u0026 `salaryMax` would show `null`\nlike the above image\n\nLog in the web page and catch the Network request again, salary info will be returned from API:\n\n![salary.png](img/jobhopin/7.png)\n\nComparing with previously non-logged in request, we see that this time the request header includes a **Bearer token**\n(see OAuth2.0 authorization [document](https://tools.ietf.org/html/rfc6750)):\n\n![bearer.png](img/jobhopin/8.png)\n\nIf this time we use Postman to send the `GET` request with this token attached, we can retrieve the salary info \njust like normal; or via code:\n```python\nimport requests\n\nurl = 'https://admin.jobhop.vn/api/public/jobs/?cities=79\u0026format=json\u0026industries=\u0026jobTypes=\u0026levels=\u0026ordering=\u0026page=1\u0026pageSize=10\u0026salaryMin=0'\ntoken = 'eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJqdGkiOjE2MTQwMDU2NTgsInN1YiI6IjEzMGM0ZWNlLWI4NWItNGQzZC04Y2M0LTJjZjMzODVhMTVjMCIsImlhdCI6MTYxMzQ2MjA1OSwiZXhwIjoxNjIyMTAyMDU5fQ.mOicukGrkSTyHb1O1Dj10Wj3dKhOOw7WaO5zUV4faPM'\n\njson = requests.get(url, headers={\n    'Authorization': f'Bearer {token}'\n}).json()['data']['collection']\n\nresult = filter(lambda v: True if v['salaryMin'] is not None else False, json)\nprint(*result)\n```\n\nThe access token has an expiry date (normally, 1 or 2 months), so if you're fine with manually \"refreshing\" the code\nafter a while, then we're basically done; if not, then read on.\n\n**How to get access token?**\n\nIn the above example, I used my Google's account to log in, so the token was coming from their Oauth2 [service](https://developers.google.com/identity/protocols/oauth2/openid-connect#sendauthrequest),\nbut for simplicity's sake, we're going to retrieve the access token from Jobhopin's own authorization service.\n\nRegister a Jobhopin account, navigate to their login page, Open Network tab \u0026 log in again:\n\n![img.png](img/jobhopin/9.png)\n\nWe can see that the token is returned from their server at endpoint `/account/api/v1/login/` if we include\ncorrect credentials in the request body:\n```python\nimport requests\n\ntoken = requests.post(\n    'https://admin.jobhop.vn/account/api/v1/login/',\n    json={'usernameOrEmail': '[your email]', 'password': '[your password]', 'role': 'ROLE_JOBSEEKER'},\n).json()['data']['accessToken']\n\nurl = 'https://admin.jobhop.vn/api/public/jobs/?cities=79\u0026format=json\u0026industries=\u0026jobTypes=\u0026levels=\u0026ordering=\u0026page=1\u0026pageSize=10\u0026salaryMin=0'\n\njson = requests.get(url, headers={\n    'Authorization': f'Bearer {token}'\n}).json()['data']['collection']\n\nresult = filter(lambda v: True if v['salaryMin'] is not None else False, json)\nprint(*result)\n```\n---\n### Scraping Itviec.com\nTarget url: https://itviec.com/it-jobs/ho-chi-minh-hcm\n\n![img.png](img/itviec/0.png)\n\nAgain, by checking the site from Devtool's _preview_ tab, we can see that the contents stay mostly the same, \nexcept the right-hand side part:\n\n![img.png](img/itviec/1.png)\n\nJob details are fetched after the main page is loaded, and most of what we need is inside that details page, \nso we need a way to fetch these data.\n\nThis is the HTML structure of a job item from the list:\n\n![img.png](img/itviec/2.png)\n\nNotice the attribute `data-search--job-selection-job-url`, navigate to that [link](https://itviec.com/it-jobs/frontend-engineer-vuejs-reactjs-line-vietnam-5858/content) \ngives us a raw HTML page with all the details we need.\n\nSo, to scrap this page, there needs to be two steps:\n1. fetch the main page, parse the HTML, get the link from attribute `data-search--job-selection-job-url`\n2. fetch the page from that link, parse the HTML, get the data\n\nParsing the HTML contents is very easy in Python with [_BeautifulSoup_](https://pypi.org/project/beautifulsoup4/), \ncheckout the code in `scrapper.py` for working example.\n\n**Getting the salary**\n\nLike the previous website, the salary info of jobs are hidden behind a login, we need to identify what authorization \ntechnique is used.\n\nBy debugging the login workflow from Network tab, you'll notice a id stored in Cookie after `/sign_in` request:\n\n![img.png](img/itviec/3.png)\n\nThat ID is what let the server knows _who_ the client is, without it, it'll treat the client as anonymous and do not \nreturn the salary information.\n\nBy attaching that ID into each request's cookie, you'll _trick_ the server into thinking that this request is made \nby a valid, logged-in user (well, it is, technically):\n\n```python\nimport requests\n\nsession = '5j1C3ZA...' ## your session ID here\nurl = 'https://itviec.com/it-jobs/ho-chi-minh-hcm'\npage = requests.get(url, cookies={'_ITViec_session': session})\n```\n\nNow you can also scrap the salary range from returned HTML content.\n\n**Automating stuff**\n\nJust like the bearer token, session ID also has an expiry time, but you can use code to emulate a login; steps are \nvery similar to previous example, but this time we'd need to include something called __CSRF token__ (`authenticity_token`) \nin the login `POST` request, here's a valid form data from `/sign-in` page:\n\n![img.png](img/itviec/4.png)\n\nThis token's purpose is to prevent [_phishing_ attacks](https://owasp.org/www-community/attacks/csrf), it's a random-generated \nstring by the server upon first request, and it's attached to the HTML page (usually as a hidden input)\n\n![img.png](img/itviec/5.png)\n\nWith that, we can now use Python's `requests` package to \"automate\" logins and retrieve session id from response header. \nI'm too lazy to include code here (because _itviec_'s session expiry time is actually quite long, and does not expire \nupon logout! no need to write extra codes, lol)\n\n---\n\n### Scraping Vietnamworks.com\n\nThis one is the easiest of the bunch, because they don't store job data at their server; instead, they delegate that task to a 3rd party service called [Algolia search](https://www.algolia.com/products/search/).\n\nSo we don't even need to touch their site to get the contents (the job list is loaded dynamically, `requests` would not work anyway). \nWhat we need is the `app_id` and `api_id` for the Algolia [search client](https://github.com/algolia/algoliasearch-client-python) to connect to their service, to catch those keys, Chrome's devtool is your best friend, but I'm going to simplify your work and write them out:\n```python\nfrom algoliasearch.search_client import SearchClient\nindex = SearchClient.create('JF8Q26WWUD', 'ecef10153e66bbd6d54f08ea005b60fc').init_index('vnw_job_v2')\nsearch_result = index.search(...)\n```\n\n---\n\n## Conclusion\n\nNo website is like another, understanding how it's made is key to scraping it effectively.\n\n`selenium` is (most of the time) overrated when you have basic knowledge about HTML \u0026 common authorization techniques.\n\nThis repo includes a sample Flask server which has two `GET` endpoints: `/itviec` \u0026 `/vietnamwork`, follow the intructions here to get it running.\n\n## Requirement\nPython 3.9\n\n## Install\n`pip install -r requirements.txt`\n\n## Start Server (port 8080)\n`python server.py`\n\n---\n\n### Deploying to cloud\n\nUsing `AWS Lambda` or `AWS Batch` is a good option because the \"scrapper\" is not on a fixed IP address (meaning harder to ban)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftcd93%2Fpython-web-scraping","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftcd93%2Fpython-web-scraping","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftcd93%2Fpython-web-scraping/lists"}