{"id":25324127,"url":"https://github.com/adhardy/multi-webbing","last_synced_at":"2025-07-19T13:39:09.638Z","repository":{"id":57444056,"uuid":"348503165","full_name":"adhardy/Multi-Webbing","owner":"adhardy","description":"A multi-threaded libary for web scraping in python.","archived":false,"fork":false,"pushed_at":"2021-03-25T15:52:18.000Z","size":32,"stargazers_count":3,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-11-17T03:14:03.292Z","etag":null,"topics":["multithreading","requests","selenium","web-scraping"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/adhardy.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-03-16T21:58:26.000Z","updated_at":"2023-04-27T05:08:19.000Z","dependencies_parsed_at":"2022-09-10T18:50:59.473Z","dependency_job_id":null,"html_url":"https://github.com/adhardy/Multi-Webbing","commit_stats":null,"previous_names":["adhardy/multi-webbing"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adhardy%2FMulti-Webbing","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adhardy%2FMulti-Webbing/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adhardy%2FMulti-Webbing/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adhardy%2FMulti-Webbing/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/adhardy","download_url":"https://codeload.github.com/adhardy/Multi-Webbing/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":238759654,"owners_count":19525873,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["multithreading","requests","selenium","web-scraping"],"created_at":"2025-02-14T00:56:41.634Z","updated_at":"2025-02-14T00:56:42.180Z","avatar_url":"https://github.com/adhardy.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Multi-Webbing\nA multi-threaded libary for web scraping in python, built upon the python threading modules. Supports using requests and selenium for making web requests.\n\n## Set Up\n\n1. Install the module from pip\n\n        pip install multi_webbing\n\n2. Import the Module into your python file\n\n        from multi_webbing import multi_webbing as mw\n\n3. Set the Number of threads and create a multi-webbing object. By default this will use the requests module, but this can be changed to selenium by passing the web_module=\"selenium\" option to MultWebbing.\n\n        num_threads = 4\n        my_threads = mw.MultiWebbing(num_threads) #intialize threading\n        \n4. Start the threads. The threads will now continuously check the work queue for work.\n\n        my_threads.start\n\n5. To put a job in the queue, call the job_queue.put() method of the multi-webbing object.\n\n        my_threads.job_queue.put(mw.Job(job_id, job_function, url, [job_data, job_type]))\n\n6. When you are ready, stop the threads\n\n        my_threads.finish()\n\nYou might find it useful to check the size of the queue in a loop before calling finish:\n       \n        while my_threads.queue.qsize() \u003e 0:\n                pass\n        my_threads.finish()\n\n## Job Function\n\nWhen creating a job, you need to pass a job function that the thread will call to do some work.\n\nThe job function has 3 required arguments and 2 optional ones:\n\n### Required Arguments\n\n1. url\n\nThe URL of the webpage to be worked on.\n\n2. job_function\n\nThe function the thread should call when it picks the job out of the queue. See [Job Function](#Job-Function).\n\n3. custom_data\n\nAn argument that can be used for anything to be accessed inside the job function.\n\n### Optional Arguments\n\n4. session\n\nA requests.session object. If this is not set, the job will use the session set when the MultiWebbing object was instanced.\n\n5. lock\n\nA threading.lock object. If this is not set, the job will use the lock set when the MultiWebbing object was instanced.\n                \n## Returning Data From Threads\n\nIt is not possible to directly return data from a thread to the main process using the \"return\" statement.\n\nInstead you should create a list or dictioary in the main process, then put this in the custom_data argument of the job. You can then use       \n        \n        dictionary.update() \n        \nor \n\n        list.append()\n        \nin the job function. The main process will be able to access the updated/appended data. A note: while the update and append functions are thread safe, some other functions are not (e.g. JSON.dumps()) and you may need to wrap them in a lock to prevent a race condition.\n\nMultiple variables and data structures can be accessed in the job by placing them in a list.\n\n\n## Job Function\n\nThe job function will be called from a thread when it gets a job from the queue.\n\nAn example using using the requests module:\n\n    def job_function(job):\n\n        job_data = job.custom_data[0] #in this example, a dictionary which contains the data processed from scraping\n        job_type = job.custom_data[1] #in this example, a string\n        \n        get_url_success = job.get_url() #get the URL\n        if get_url_success: #check the request connected\n            if job.request.status_code == 200: #check that the URL was recieved OK\n                job.lock.acquire() #update/append are thread safe but other operations elsewhere (e.g. JSON.dumps) might not be\n                if job.type == \"jobtype1\": #do something\n                    job.custom_data.update({\"key1\":\"val3\", \"key2\":\"val4\"})\n                if job.type == \"jobtype2\": #do something different\n                    job.custom_data.update({\"key1\":\"val3\", \"key2\":\"val4\"})\n                job.lock.release()\n\nUsing requests, you can access the request object by calling job.request. For example, to obtain the text attribute from the visited page:\n\n        text = job.request.text \n\nUsing selenium you can access the webdriver by calling job.driver, for example:\n\n        element = driver.find_element_by_xpath('xpath_string')\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadhardy%2Fmulti-webbing","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fadhardy%2Fmulti-webbing","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadhardy%2Fmulti-webbing/lists"}