{"id":22497177,"url":"https://github.com/chaudharypraveen98/stackoverflowscraper","last_synced_at":"2026-04-30T17:31:27.856Z","repository":{"id":48857029,"uuid":"282897612","full_name":"chaudharypraveen98/StackOverflowScraper","owner":"chaudharypraveen98","description":"This project aims to scraps questions depending on the fields, no of pages and question size. It makes a file with required tag. You can edit the filename easily.","archived":false,"fork":false,"pushed_at":"2023-05-23T00:11:28.000Z","size":216,"stargazers_count":0,"open_issues_count":1,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-27T21:33:24.539Z","etag":null,"topics":["pandas","requests","requests-html","scraping","scripting"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/chaudharypraveen98.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-07-27T12:53:39.000Z","updated_at":"2024-08-01T19:09:55.000Z","dependencies_parsed_at":"2025-03-27T21:32:06.856Z","dependency_job_id":"4a93aa38-dd0c-4c46-a040-ec67bd71b37d","html_url":"https://github.com/chaudharypraveen98/StackOverflowScraper","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/chaudharypraveen98/StackOverflowScraper","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chaudharypraveen98%2FStackOverflowScraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chaudharypraveen98%2FStackOverflowScraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chaudharypraveen98%2FStackOverflowScraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chaudharypraveen98%2FStackOverflowScraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/chaudharypraveen98","download_url":"https://codeload.github.com/chaudharypraveen98/StackOverflowScraper/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chaudharypraveen98%2FStackOverflowScraper/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32472396,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-30T13:12:12.517Z","status":"ssl_error","status_checked_at":"2026-04-30T13:12:06.837Z","response_time":57,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["pandas","requests","requests-html","scraping","scripting"],"created_at":"2024-12-06T20:17:08.342Z","updated_at":"2026-04-30T17:31:27.841Z","avatar_url":"https://github.com/chaudharypraveen98.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## **Stack Overflow Question Scraper**\nThis scrapper scrapes the questions from the stack overflow depending upon the number of votes, newest, active , no of question, no of pages to search and the field in which you want to search the question(topic).\n\n##### Level: Beginner\n\n\u003ch3\u003eTopics -\u003e requests, requests-html, pandas, scraping, scripting, StackOverflowScraper\u003c/h3\u003e\n\u003ch5\u003ePreview Link -\u003e \u003cu\u003e\u003ca href=\"https://drive.google.com/file/d/1HATnEAczmo3SlCD1ctq8nfH_QU0Mw-xc/preview\"\u003eStackOverflowScraper\u003c/a\u003e\u003c/u\u003e\u003c/h5\u003e\n\u003ch5\u003eSource Code Link -\u003e \u003cu\u003e\u003ca href=\"https://github.com/chaudharypraveen98/StackOverflowScraper\"\u003eGitHub\u003c/a\u003e\u003c/u\u003e\u003c/h5\u003e\n\u003ciframe src=\"https://drive.google.com/file/d/1HATnEAczmo3SlCD1ctq8nfH_QU0Mw-xc/preview\" width=\"640\" height=\"480\" allow=\"autoplay\"\u003e\u003c/iframe\u003e\n\u003cstrong\u003eWhat We are going to do?\u003c/strong\u003e\n\u003col\u003e\n    \u003cli\u003eFirst, we made a request to fetch the html page using the requests library\u003c/li\u003e\n    \u003cli\u003eIf the response is OK , then we feed into the HTML parser from requests-HTML\u003c/li\u003e\n    \u003cli\u003eWe will then use the selectors to get the required fields like question title, tag , votes and answered.\u003c/li\u003e\n\u003c/ol\u003e\n## libraries Required : - \n\u003col\u003e\n    \u003cli\u003eRequest-html\u003c/li\u003e\n    \u003cli\u003ePandas\u003c/li\u003e\n    \u003cli\u003eRequest library\u003c/li\u003e\n\u003c/ol\u003e\n## Prerequisites  \n\n\u003cstrong\u003eWhat are selectors/locators?\u003c/strong\u003e\nA CSS Selector is a combination of an element selector and a value which identifies the web element within a web page.\n\n\u003cb\u003eThe choice of locator depends largely on your Application Under Test\u003c/b\u003e\n\n\n\u003cb\u003eId\u003c/b\u003e\nAn element’s id in XPATH is defined using: “[@id='example']” and in CSS using: “#” - ID's must be unique within the DOM.\nExamples:\n`\nXPath: //div[@id='example']\nCSS: #example\n`\n\n\u003cb\u003eElement Type\u003c/b\u003e\nThe previous example showed //div in the xpath. That is the element type, which could be input for a text box or button, img for an image, or \"a\" for a link. \n\n`\nXpath: //input or\nCss: =input\n`\n\n\u003cb\u003eDirect Child\u003c/b\u003e\nHTML pages are structured like XML, with children nested inside of parents. If you can locate, for example, the first link within a div, you can construct a string to reach it. A direct child in XPATH is defined by the use of a “/“, while on CSS, it’s defined using “\u003e”. \nExamples:\n`                        \nXPath: //div/a\nCSS: div \u003e a\n`\n\n\u003cb\u003eChild or Sub-Child\u003c/b\u003e\nWriting nested divs can get tiring - and result in code that is brittle. Sometimes you expect the code to change, or want to skip layers. If an element could be inside another or one of its children, it’s defined in XPATH using “//” and in CSS just by a whitespace.\nExamples:\n```\nXPath: //div//a\nCSS: div a\n```\n\n\u003cb\u003eClass\u003c/b\u003e\n\nFor classes, things are pretty similar in XPATH: “[@class='example']” while in CSS it’s just “.” \nExamples:\n```\nXPath: //div[@class='example']\nCSS: .example\n```\n\n## Understanding the code : - \n## Requesting the html webpage  \n\nWe will using the requests library to fetch the html code \n```\ndef extract_from_url(url):\nr = requests.get(url)\nif r.status_code not in range(200, 299):\nprint(\"error\")\nreturn \"error while finding the data\"\n```\n\u003cb\u003er.status_code\u003c/b\u003e will check the response status code. If it is valid then proceed to other part.\n## Parsing the Html code using HTML from requests-HTML\n```\nhtml_text = r.text\nformatted_html = HTML(html=html_text)\n```\n\n## Scraping using the parsed HTML code  \n```\ndata_summary = formatted_html.find(\".question-summary\")\ndata = []\nclasses_needed = ['.vote-count-post', '.question-hyperlink']\nfinal_data = []\nfor question in data_summary:\nquestion_votes = question.find('.vote-count-post', first=True).text\nquestion_data = question.find('.question-hyperlink', first=True).text\nquestion_tags = question.find('.tags', first=True).text\ndata = {}\ndata[\"question\"] = question_data\ndata[\"votes\"] = question_votes\ndata[\"tags\"] = question_tags\nfinal_data.append(data)\nreturn final_data\n```\nFirst we find the question container that contains whole information. We had used the class css selector (.question-summary)\nThen, we loop through all the question container.We can easily extract other details using the css selector like\n\u003cul\u003e\n    \u003cli\u003e('.vote-count-post') selector for votes\u003c/li\u003e\n    \u003cli\u003e('.question-hyperlink') selector for question link\u003c/li\u003e\n    \u003cli\u003e('.tags') selector for getting all the tags for the question\u003c/li\u003e\n\u003c/ul\u003e\n\n## Starting Scraper and Saving data into CSV format\n\n```\ndef scrape_stack(tag=\"python\", page=1, pagesize=\"20\", sortby=\"votes\"):\n    base_url = \"https://stackoverflow.com/questions/tagged/\"\n    all_page_data = []\n    # iterating through each pages\n    for i in range(1, page + 1):\n    url = f\"{base_url}{tag}?tab={sortby}\u0026page={i}\u0026pagesize={pagesize}\"\n    all_page_data += extract_from_url(url)\n    df = pd.DataFrame(all_page_data)\n    df.to_csv(f\"{tag}.csv\", index=False)\n```\n\nTo scrap the Stack Overflows Question , We have 4 keyword argument\nscrape_stack(tag=\"python\", page=1, pagesize=\"20\", sortby=\"votes\")\nwhere\n\u003col\u003e\n    \u003cli\u003e\u003cb\u003etag\u003c/b\u003e : Field you want to search like c, javascript, html etc.\u003c/li\u003e\n    \u003cli\u003e\u003cb\u003epage\u003c/b\u003e : How many pages you want to search.\u003c/li\u003e\n    \u003cli\u003e\u003cb\u003epagesize\u003c/b\u003e : How much questions or thread each page contains.\u003c/li\u003e\n    \u003cli\u003e\u003cb\u003esortby\u003c/b\u003e : You can sort the question according to votes,newest,active and unanswered.\u003c/li\u003e\n\u003c/ol\u003e\nif argument are passed then we made the url according to it, otherwise we will use the default arguments.\nOnce the scraping is done, we load that data into pandas dataframe. Once we are able to make dataframe, then we can easily export the data into .csv file.\n\n## How to setup/run on local machine\n\n\u003col\u003e\n\n\u003cli\u003eFirst clone the repo by following command:- `git clone https://github.com/chaudharypraveen98/StackOverflowScraper.git`\u003c/li\u003e\n\n\u003cli\u003eThen you have to install all the required dependencies by following command :- `pip3 install -r requirements.txt`\u003c/li\u003e\n\n\u003cli\u003eRun the file in python interactive mode. Now you are ready to go. To scrap the Stack Overflows Question , type:-\n`scrape_stack(tag=\"python\", page=1, pagesize=\"20\", sortby=\"votes\")`\u003c/li\u003e\n\n\u003c/ol\u003e\n\n## Deployment\n\nFor deployment,  We are using the \u003cstrong\u003eRepl\u003c/strong\u003e or \u003cstrong\u003eHeroku\u003c/strong\u003e to deploy our localhost to web.\u003cspan\u003e\u003ca href=\"https://replit.com/\"\u003eFor More Info\u003c/a\u003e\u003c/span\u003e\n## Web Preview / Output\n\u003ca href=\"questions.JPG\"\u003e\u003cimg src=\"questions.JPG\" alt=\"web preview\" /\u003e\u003c/a\u003e\n\n\u003cspan\u003eWeb preview on deployment\u003c/span\u003e\n\nPlaceholder text by \u003ca href=\"https://chaudharypraveen98.github.io/\"\u003ePraveen Chaudhary\u003c/a\u003e\u0026middot; Images by \u003ca href=\"hhttps://chaudharypraveen98.github.io/binarybeast/\"\u003eBinary Beast\u003c/a\u003e\u003c/span\u003e\n\n\n_**Note**_: Any changes are most welcomed. By default the file extension is set to csv with the tag you used for scraping\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchaudharypraveen98%2Fstackoverflowscraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fchaudharypraveen98%2Fstackoverflowscraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchaudharypraveen98%2Fstackoverflowscraper/lists"}