Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/fagci/parsee
Python tiny site parser
https://github.com/fagci/parsee
Last synced: about 2 months ago
JSON representation
Python tiny site parser
- Host: GitHub
- URL: https://github.com/fagci/parsee
- Owner: fagci
- Created: 2021-04-08T17:17:56.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2021-09-07T20:36:34.000Z (about 3 years ago)
- Last Synced: 2024-07-05T10:43:50.339Z (2 months ago)
- Language: Python
- Size: 51.8 KB
- Stars: 2
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Parsee
Sweet python tiny site parser.
[new] Now with CloudFlare bypass
[Немного на русском](https://mikhail-yudin.ru/blog/frontend/parsee_prostoi_vkusniy_parser_saitov/)
## Lang syntax
```config
[a@ ] [% ]
```Note: `python_code` relative to last tag. Use `.` (dot) to get attribute or call method.
## Requirements
Ensure that you have installed (for lxml):
`[!] for Termux without sudo`
```
sudo apt-get install libxml2 libxslt
```Before use, install requirements:
```sh
pip3 install -r requirements.txt
```## Examples
### Crawl first page links, get paragraph in each page next to heading contains text
```python
for page in parser / '.links a@':
for p in page / 'h3:-soup-contains("Some title")+p':
print(p.text)
```### Get titles of subpages
```sh
./parser.py http://site.org 'a@a@title%.text'
```### Get links at home page
```sh
./parser.py http://site.org 'a%.get("href")'
```