Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/hayatiyrtgl/poems_scraper

beautifulsoup beautifulsoup4 poetry python scraper

Last synced: 30 days ago
JSON representation

Host: GitHub
URL: https://github.com/hayatiyrtgl/poems_scraper
Owner: HayatiYrtgl
Created: 2023-11-20T12:27:16.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2023-11-20T12:29:17.000Z (about 1 year ago)
Last Synced: 2024-11-05T09:15:55.290Z (3 months ago)
Topics: beautifulsoup, beautifulsoup4, poetry, python, scraper
Language: Python
Homepage:
Size: 1.9 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # poems_scraper

# Web Scraping Poems

This Python script is designed for web scraping poems and poet information from a poetry website. It utilizes the BeautifulSoup and requests libraries for web scraping.

## ScrapPoems Class

### Initialization

```python

class ScrapPoems:

    def __init__(self):

        self.web_address = "https://siir.sitesi.web.tr/sairler.html"

        self.poet_list = []

        self.poem_list = []

```

### Requester Method

```python

    def requester(self, address):

        req = re.get(address)

        status_code = req.status_code

        if status_code == 200:

            print("Connected to", address)

            return req.content

        else:

            print(f"{status_code} Error")

```

### Parser Methods

#### Link Parser

```python

    def parser(self, content, tag=None, object=None):

        parsed_content = beauty(content, "lxml")

        tag = tag

        parsed_content = parsed_content.findAll("div", attrs={"class": "siir"})

        for i in parsed_content:

            for x in i:

                object.append(x.get(tag))

```

#### Text Parser

```python

    def text_parser(self, content):

        parsed_content = beauty(content, "lxml")

        parsed_content = parsed_content.findAll("div", attrs={"class": "text"})

        for i in parsed_content:

            with open("siirler.txt", "a", encoding="utf-8") as file:

                file.write(i.text)

```

## Usage

### Part 1

```python

c = ScrapPoems()

s = c.requester(c.web_address)

c.parser(s, tag="href", object=c.poet_list)

# Parse and write the file

for i in c.poet_list:

    s2 = c.requester(i)

    c.parser(s2, tag="href", object=c.poem_list)

print(c.poet_list, c.poem_list, sep="\n")

with open("poems_list.txt", "a", encoding="utf-8") as file:

    file.write(str(c.poem_list))

```

### Part 2

```python

with open("poems_list.txt", "r", encoding="utf-8") as file:

    f = file.read()

    f = eval(f)

c = ScrapPoems()

# Parse and get text

progress = 0

for link in f:

    progress += 1

    if progress < 0:

        pass

    else:

        s = c.requester(link)

        c.text_parser(s)

        print(f"{progress}/{len(f)}")

```

## Note

- Use this script responsibly and ensure compliance with the terms of service of the website you are scraping.