Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/lincvv/djparsing
https://github.com/lincvv/djparsing
django parsing webdevelopment website
Last synced: 15 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/lincvv/djparsing
- Owner: lincvv
- License: mit
- Created: 2017-10-06T13:14:09.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2024-07-10T08:42:03.000Z (4 months ago)
- Last Synced: 2024-10-22T16:27:02.227Z (24 days ago)
- Topics: django, parsing, webdevelopment, website
- Language: Python
- Size: 7.13 MB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 23
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
djparsing
===========
##### Convenient parsing, can be used as an application Django or independently
parser of the first data block (by date this is new) and saving in the specified tableRequirements
-----------
* python (3.4, 3.5, 3.6б 3.7)
* django (1.8, 1.9, 1.10, 1.11)
* lxml (4.1.1)
* cssselect (1.0.1)
* PillowQuick start
-----------
##### Install:
pip install djparsing
##### Using:
```python
class MyModel(models.Model):
title = models.CharField(max_length=256)
text = HTMLField(blank=True)
source = models.URLField(max_length=255, blank=True)
create_date = models.DateTimeField(auto_now_add=True)
img = models.ImageField(blank=True, null=True)
flag = models.BooleanField(default=False)
```
```python
from djparsing.core import Parser, init
from djparsing import data@init(model='MyModel', app='my_app')
class MyParserClass(Parser):
body = data.BodyCSSSelect()
text = data.TextContentCSSSelect()
source = data.AttrCSSSelect(attr_data='href') #Or set the first argument AttrCSSSelect('href')
title = data.TextCSSSelect()
img = data.ImgCSSSelect('src') #The default is src, so the argument is optional. can ImgCSSSelect()
class Meta:
coincidence = ['Python', 'Django', 'Питон', 'ML'] # a list of words for the condition that the data fit
field_coincidence = 'title' # field to which a list of words is used
pars_obj = MyParserClass(
body='.content-list__item',
text='.post__body_crop > .post__text',
source='a.post__title_link',
title='a.post__title_link',
img='.post__body_crop > .post__text img',
url='http://site/'
)
pars_obj.run()```
Note: a model for saving data can be specified in Meta class```python
class Meta:
model = MyModel # decorator @init is not needed
```
Inheritance:
```python
class MyChildParserClass(MyParserClass):
my_field = data.TextCSSSelect()
```
Note: fields from the base class, and also the Meta class is inherited. You can override
If you need to install an additional field in the database:
```python
pars_obj.add_field['flag'] = True
pars_obj.run() #if you do not need to save to the database and print the data to the log,
# add the argument log -> run(log=True) and redefine the method log_output(self, result):
```
Example:
```python
@init(model='MyModel', app='my_app')
class MyParserClass(Parser):
body = data.BodyCSSSelect()
text = data.TextContentCSSSelect()
def log_output(self, result): # if you do not override the method, the result will be output to the terminal
pass # and work further with the result
```If you do not want to write data to the database or output to the log, use:
```python
data = pars_obj.run(create=False)
```
Note: Also a must create=False, when you are not working with django and baseAttributs
=========
##### start_url```python
# initialize the path to the URL with the data block.
# This is needed when the list of objects is on the page, and the data is on another page
BodyCSSSelect(start_url='div.description.float-right > a')
```
Note: in the attribute with the URL should be href##### add_domain
```python
# if the URL in the attribute does not have a domain
# set add_domain=True, by default FalseBodyCSSSelect(start_url='div.description.float-right > a', add_domain=True)
```##### save_start_url
when you need to save additional data in the field,
such as the start URLs of objects, add the ExtraDataField field (save_start_url = True)##### body_count
how many objects are parsingExample:
```python
class MyParserClass(Parser):
start = BodyCssSelect(start_url='ul.quest-tiles > li.quest-tile-1 > div.item-box > div.item-box-desc h4 a',
add_domain=True,
body_count=4)
source = ExtraDataField(save_start_url=True)```
It works on this [site](http://pythoff.com/), all this on the [channel](https://telegram.me/python_all)
-----