Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/wwwwwydev/crawlist
A universal solution for web crawling lists
https://github.com/wwwwwydev/crawlist
crawl crawler crawler-python python reptile
Last synced: 3 days ago
JSON representation
A universal solution for web crawling lists
- Host: GitHub
- URL: https://github.com/wwwwwydev/crawlist
- Owner: WwwwwyDev
- License: mit
- Created: 2024-04-03T08:49:13.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2024-04-13T14:37:05.000Z (7 months ago)
- Last Synced: 2024-04-14T04:23:31.380Z (7 months ago)
- Topics: crawl, crawler, crawler-python, python, reptile
- Language: Python
- Homepage: https://wwydev.gitbook.io/crawlist/
- Size: 193 KB
- Stars: 23
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
## introduction
You can use crawlist to crawl websites containing lists, and with some simple configurations, you can obtain all the list data.
## installing
You can use pip or pip3 to install the crawlist`pip install crawlist` or `pip3 install crawlist`
If you have already installed crawlist, you may need to update to the latest version
`pip install --upgrade crawlist`
## quickly start
This is a static website demo. It does not use the JavaScript to load the data.
```python
import crawlist as clif __name__ == '__main__':
# Initialize a pager to implement page flipping
pager = cl.StaticRedirectPager(uri="https://www.douban.com/doulist/893264/?start=0&sort=seq&playable=0&sub_type=",
uri_split="https://www.douban.com/doulist/893264/?start=%v&sort=seq&playable=0&sub_type=",
start=0,
offset=25)
# Initialize a selector to select the list element
selector = cl.CssSelector(pattern=".doulist-item")
# Initialize an analyzer to achieve linkage between pagers and selectors
analyzer = cl.AnalyzerPrettify(pager, selector)
res = []
limit = 100
# Iterating a certain number of results from the analyzer
for tr in analyzer(limit):
print(tr)
res.append(tr)
# If all the data has been collected, the length of the result will be less than the limit
print(len(res))
```
This is a dynamic website demo. It uses the JavaScript to load the data.So we need to load a selenium webdriver to drive the JavaScript.
```python
import crawlist as clif __name__ == '__main__':
# Initialize a pager to implement page flipping
pager = cl.DynamicScrollPager(uri="https://ec.ltn.com.tw/list/international")
# Initialize a selector to select the list element
selector = cl.CssSelector(pattern="#ec > div.content > section > div.whitecon.boxTitle.boxText > ul > li")
# Initialize an analyzer to achieve linkage between pagers and selectors
analyzer = cl.AnalyzerPrettify(pager=pager, selector=selector)
res = []
# Iterating a certain number of results from the analyzer
for tr in analyzer(100):
print(tr)
res.append(tr)
print(len(res))
# After completion, you need to close the webdriver, otherwise it will occupy your memory resources
pager.webdriver.quit()```
## Documenting
If you are interested and would like to see more detailed documentation, please click on the link below.[中文](https://wwydev.gitbook.io/crawlist-zh/ "中文文档")|[English](https://wwydev.gitbook.io/crawlist "English Document")
## Contributing
Please submit pull requests to the develop branch