https://github.com/karpetrosyan/scrapyio
https://github.com/karpetrosyan/scrapyio
Last synced: 3 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/karpetrosyan/scrapyio
- Owner: karpetrosyan
- License: mit
- Created: 2023-03-30T14:50:41.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2023-05-31T13:42:46.000Z (about 2 years ago)
- Last Synced: 2025-03-10T18:16:20.279Z (3 months ago)
- Language: Python
- Size: 132 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Scrapyio is a web scraping framework that is asynchronous, fast, and easy to use.
---
Install `scrapyio` using pip:
```shell
$ pip install scrapyio
```Let's begin with the most basic examples.
```python
from scrapyio import BaseSpiderclass Spider(BaseSpider):
start_requests = ["https://scrapyio-example.com"]async def parse(self, response):
print("I have received the response!!", response)
yield None```
First, create your `BaseSpider` subclass and name it (in this case, **Spider**).
You must instruct scrapyio on where to begin the requests, i.e. the urls to be scraped.**Things to keep in mind**
- `start_requests` is a class variable that instructs scrapyio to scrape these urls during the first `iteration`.
- `parse` is an asynchronous generator method with a single argument, `response`, which is an HTTP response object containing the page content to be scraped.
- `iteration` Scrapyio is an asynchronous framework, so if you populate the start requests variable with a thousand URLs, they will be handled concurrently by the Python asyncio framework. So let's call the url sending process an `iteration`.## Requests
As you know, you can pass the url string in start requests to tell `Scrapyio` which content you want to download; however, some websites require additional **headers** to function properly; therefore, we can pass `Request` objects explicitly rather than only strings.
so we can replace
```python
start_requests = ["https://scrapyio-example.com"]
```with
```python
start_requests = [Request(url="https://scrapyio-example.com", method="GET", headers={"custom-header": "test"})]
```If we use the `Request` object directly, we can specify many request-specific parameters such as `proxies`, `authorization`, `headers`, `cookies`, whether to `follow-redirect` or not, and so on.
There is more complex `Request` object.
```python
from scrapyio import Request
start_request = [
Request(
url="https://scrapyio-example.com",
headers={"custom-key": "custom-value"},
proxies={"http": "...", "https": "..."},
cookies={"cookie-key": "cookie-value"},
follow_redirects=True,
verify=False
)
]
```## Parsing
After you've built your `Request` with all of the necessary headers, data, and so on, you should write your parsing logic; every response that is successfully installed will call your spider's parse method, which will return the response that was downloaded.
```python
class Spider(BaseSpider):
start_requests = ["https://www.scrapethissite.com/pages/simple/"]async def parse(self, response):
print("I have received the response!!", response)
yield None
````Scrapyio` is a parser free framework and let's you use any parser you want, in our case, we will use `beautifulsoup` library
```python
class Spider(BaseSpider):
start_requests = ["https://www.scrapethissite.com/pages/simple/"]async def parse(self, response):
soup = BeautifulSoup(response.text, "html.parser")
countries = soup.select(".country")for country in countries:
country_name = country.select_one(".country-name").text.strip()
country_population = int(
country.select_one(".country-population").text.strip()
)
country_capital = country.select_one(".country-capital").text.strip()
country_area = float(country.select_one(".country-area").text.strip())yield Country(
name=country_name,
population=country_population,
capital=country_capital,
area=country_area,
)
```It's natural to be confused if you've never used `beautifulsoup` before, but the parsing process isn't particularly important; you can parse however you want, using python `regex`, parsers, or any other tool.
In this case, we're using Python's **yield** syntax to tell `Scrapyio` which Item to process and possibly save in the future.
So we get a `Country` instance, which is also a `pydantic` subclass.
Scrapyio validates and serializes the data parsed from the website using `Pydantic`.```python
from scrapyio import Item
class Country(Item):
tablename: str = "MyTable"
name: str
population: int
capital: str
area: float
```
**ANDDDDD..... THAT'S ALL**.## Running
Now you can run scrapyio and tell him to save the items as `JSON`, `CSV`, or in any `SQLAlchemy`-supported database.
For this, you can use the scrapyio command line tool.```shell
$ scrapyio run --help
Usage: scrapyio run [OPTIONS] SPIDEROptions:
-j, --json TEXT Json file path
-c, --csv TEXT Csv file path
-s, --sql TEXT SQL URI supported by SQLAlchemy
--help Show this message and exit.
```Let's run `scrapyio` and export data in `JSON` and `CSV` formats.
```shell
$ scrapyio run Spider --json data.json --csv data.csv
```data.json
```json
[
{
"name": "Andorra",
"population": 84000,
"capital": "Andorra la Vella",
"area": 468.0
},
{
"name": "United Arab Emirates",
"population": 4975593,
"capital": "Abu Dhabi",
"area": 82880.0
},
...
]
```data.csv
```csv
name,population,capital,area
Andorra,84000,Andorra la Vella,468.0
United Arab Emirates,4975593,Abu Dhabi,82880.0
Afghanistan,29121286,Kabul,647500.0
Antigua and Barbuda,86754,St. John's,443.0
Anguilla,13254,The Valley,102.0
```Alternatively, we can save the data to a PostgreSQL database.
```shell
$ scrapyio run Spider --sql postgresql+asyncpg://test:[email protected]:5432/postgres
```Then select it in the psql command tool.
```
postgres=# select * from "Mytable";
id | name | population | capital | area
-----+----------------------------------------------+------------+---------------------+----------
1 | Andorra | 84000 | Andorra la Vella | 468
2 | United Arab Emirates | 4975593 | Abu Dhabi | 82880
3 | Afghanistan | 29121286 | Kabul | 647500
4 | Antigua and Barbuda | 86754 | St. John's | 443
5 | Anguilla | 13254 | The Valley | 102
6 | Albania | 2986952 | Tirana | 28748
7 | Armenia | 2968000 | Yerevan | 29800
8 | Angola | 13068161 | Luanda | 1246700
9 | Antarctica | 0 | None | 14000000
```