Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/wwwwwydev/crawlist

A universal solution for web crawling lists
https://github.com/wwwwwydev/crawlist

crawl crawler crawler-python python reptile

Last synced: 3 months ago
JSON representation

A universal solution for web crawling lists

Host: GitHub
URL: https://github.com/wwwwwydev/crawlist
Owner: WwwwwyDev
License: mit
Created: 2024-04-03T08:49:13.000Z (11 months ago)
Default Branch: main
Last Pushed: 2024-04-13T14:37:05.000Z (10 months ago)
Last Synced: 2024-04-14T04:23:31.380Z (10 months ago)
Topics: crawl, crawler, crawler-python, python, reptile
Language: Python
Homepage: https://wwydev.gitbook.io/crawlist/
Size: 193 KB
Stars: 23
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        


  





# crawlist

A universal solution for web crawling lists



  

    

  

  

  

  

  






## introduction

You can use crawlist to crawl websites containing lists, and with some simple configurations, you can obtain all the list data.

## installing

You can use pip or pip3 to install the crawlist

`pip install crawlist` or `pip3 install crawlist`

If you have already installed crawlist, you may need to update to the latest version

`pip install --upgrade crawlist`

## quickly start

This is a static website demo. It does not use the JavaScript to load the data.

```python

import crawlist as cl

if __name__ == '__main__':

    # Initialize a pager to implement page flipping 

    pager = cl.StaticRedirectPager(uri="https://www.douban.com/doulist/893264/?start=0&sort=seq&playable=0&sub_type=",

                                   uri_split="https://www.douban.com/doulist/893264/?start=%v&sort=seq&playable=0&sub_type=",

                                   start=0,

                                   offset=25) 

    

    # Initialize a selector to select the list element

    selector = cl.CssSelector(pattern=".doulist-item")

    

    # Initialize an analyzer to achieve linkage between pagers and selectors

    analyzer = cl.AnalyzerPrettify(pager, selector)

    res = []

    limit = 100

    # Iterating a certain number of results from the analyzer

    for tr in analyzer(limit): 

        print(tr)

        res.append(tr)

    # If all the data has been collected, the length of the result will be less than the limit

    print(len(res))

```

This is a dynamic website demo. It uses the JavaScript to load the data.So we need to load a selenium webdriver to drive the JavaScript.

```python

import crawlist as cl

if __name__ == '__main__':

    # Initialize a pager to implement page flipping 

    pager = cl.DynamicScrollPager(uri="https://ec.ltn.com.tw/list/international")

    

    # Initialize a selector to select the list element

    selector = cl.CssSelector(pattern="#ec > div.content > section > div.whitecon.boxTitle.boxText > ul > li")

    

    # Initialize an analyzer to achieve linkage between pagers and selectors

    analyzer = cl.AnalyzerPrettify(pager=pager, selector=selector)

    res = []

    

    # Iterating a certain number of results from the analyzer

    for tr in analyzer(100):

        print(tr)

        res.append(tr)

    print(len(res))

    # After completion, you need to close the webdriver, otherwise it will occupy your memory resources

    pager.webdriver.quit()

```

## Documenting

If you are interested and would like to see more detailed documentation, please click on the link below.

[中文](https://wwydev.gitbook.io/crawlist-zh/ "中文文档")|[English](https://wwydev.gitbook.io/crawlist "English Document")

## Contributing

Please submit pull requests to the develop branch