Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/jmgogo/webscraping-for-business
Webscraping with scrapy and playwright, prospecting local businesses in Doral, Florida for marketing opportunities. π·οΈ
https://github.com/jmgogo/webscraping-for-business
playwright scrapy webscraping
Last synced: about 1 month ago
JSON representation
Webscraping with scrapy and playwright, prospecting local businesses in Doral, Florida for marketing opportunities. π·οΈ
- Host: GitHub
- URL: https://github.com/jmgogo/webscraping-for-business
- Owner: jmgogo
- License: mit
- Created: 2024-01-04T03:08:09.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-01-18T18:42:23.000Z (about 1 year ago)
- Last Synced: 2024-11-06T02:19:28.312Z (3 months ago)
- Topics: playwright, scrapy, webscraping
- Language: Python
- Homepage:
- Size: 409 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 8
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Webscraping for Business
Foreword
Webscraping for business with Scrapy and Playwright, Ubuntu devcontainer for development on Visual Studio Code.
Report Bug
Β·
Request Feature
## Table of Contents
- [Webscraping for Business](#webscraping-for-business)
- [Table of Contents](#table-of-contents)
- [About](#about)
- [Setup](#setup)
- [Scrapy](#scrapy)
- [Prerequisites](#prerequisites)
- [Getting started](#getting-started)
- [License](#license)## About
### Setup
This project includes a Dev Container for Visual Studio Code. It serves as a full-featured development environment with everything needed to run the application. Dev Containers separate tools, libraries, or runtimes needed for working with a codebase. They aid in continuous integration and testing. Dev containers can be run locally or remotely, in a private or public cloud, in a variety of supporting tools and editors.
[This devcontainer](.devcontainer), is built IAW the [dev containers specification](https://containers.dev/implementors/spec/) and tailored for a build environment that runs Ubuntu version 22.04 and Python. Software libraries are installed in accordance with [requirements.txt](./.devcontainer/requirements.txt). Additionally, the devcontainer has git for version control and several extensions installed for Visual Studio Code as development utilities.
### Scrapy
Scrapy is the bread and butter of this project. It is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival. To understand Scrapy, it's helpful to reference it's architecture. For more information, check the [docs](https://docs.scrapy.org/en/latest/topics/architecture.html#spider-middlewares) and the image below.
![Scrapy Architecture](imgs/scrapy_architecture.png)
## Prerequisites
To start, you need to have Docker Engine and Docker Compose on your machine. You can either:
- Install Docker Desktop which includes both Docker Engine and Docker Compose
- Install Docker Engine and Docker Compose as standalone binariesThis devcontainer is setup for development on Visual Studio Code. You should have it installed along with the [remote development pack](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.vscode-remote-extensionpack) to enable the IDE's devcontainers functionality.
Create a `credentials.txt` file to host your git configuration settings. This file is utilized by the devcontainer's `postCreateCommand` as specified in `.devcontainer.json`. The format of this file should looks as such:
```sh
User:"YOUR NAME"
Email:"YOUR EMAIL"
```All dependencies within your [requirements.txt](./.devcontainer/requirements.txt) file will be handled during the build process with Docker.
## Getting started
Open Docker Desktop to run the Docker daemon, a background process that manages and coordinates Docker containers on your system. On VS Code, start the development container by running `Dev Containers: Rebuild and Reopen In Container` in the command palette. It can be accessed with the keyboard shortcut `ctrl + shift + P` on your keyboard. The command shows up as follows:
![Rebuild and Reopen In Container](imgs/rebuildAndReopenInContainer.png)
Doing so will start the build process for the devcontainer. Visual Studio will establish a remote connection to the development container with several common python extensions installed in the IDE. Of note, when a successful connection is established to the container, info for the Ubuntu OS is displayed as shown below:
![Operating System Information](imgs/osInfo.png)
Scrapy doesnβt support JavaScript execution by default. So [Playwright](https://playwright.dev/python/) is used to prerender websites. Doing so gives [Scrapy](https://docs.scrapy.org/en/latest/intro/overview.html) access it would otherwise lack to dynamically load content via JavaScript on an installed browser.
When the container launches successfully, you will have a Chromium browser installed for [Playwright](https://playwright.dev/python/). Chromium is a free, open-source web browser created by Google. Several modern browsers - Edge, Chrome, Brave, Opera, Vivaldi, etc. - are built off Chromium. If you please, you can also run webscraping by downloading different browsers for your container. [Playwright](https://playwright.dev/python/) offers installation for chrome, chrome-beta, msedge, msedge-beta, msedge-dev, firefox, firefox-asan, and webkit. You can run the following commands or add them to the Dockerfile:
``` sh
# Install a subset of available browsers for Playwright
playwright install# Complete Playwright configuration by installing all system dependencies
playwright install-deps
```This project includes several spiders. They are found in `./doral/doral/spiders`. Some are for testing, others are still in development. Spiders can be run via their name, for example, try the following command to verify that the doral spider works:
```sh
scrapy crawl doral
```## License
Distributed under the MIT License. See `LICENSE` for more information.