Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/lincvv/djparsing

django parsing webdevelopment website

Last synced: 4 months ago
JSON representation

Host: GitHub
URL: https://github.com/lincvv/djparsing
Owner: lincvv
License: mit
Created: 2017-10-06T13:14:09.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2024-07-10T08:42:03.000Z (7 months ago)
Last Synced: 2024-10-22T16:27:02.227Z (4 months ago)
Topics: django, parsing, webdevelopment, website
Language: Python
Size: 7.13 MB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 23
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        djparsing 

===========

##### Convenient parsing, can be used as an application Django or independently

parser of the first data block (by date this is new) and saving in the specified table

Requirements

-----------

* python (3.4, 3.5, 3.6б 3.7)

* django (1.8, 1.9, 1.10, 1.11)

* lxml (4.1.1)

* cssselect (1.0.1)

* Pillow         

Quick start

-----------

##### Install:

        pip install djparsing

##### Using:

```python

class MyModel(models.Model):

    title = models.CharField(max_length=256)

    text = HTMLField(blank=True)

    source = models.URLField(max_length=255, blank=True)

    create_date = models.DateTimeField(auto_now_add=True)

    img = models.ImageField(blank=True, null=True)

    flag = models.BooleanField(default=False)

    

```

```python

from djparsing.core import Parser, init

from djparsing import data

@init(model='MyModel', app='my_app')

class MyParserClass(Parser):

    body = data.BodyCSSSelect()

    text = data.TextContentCSSSelect()

    source = data.AttrCSSSelect(attr_data='href') #Or set the first argument AttrCSSSelect('href')

    title = data.TextCSSSelect()

    img = data.ImgCSSSelect('src') #The default is src, so the argument is optional. can ImgCSSSelect()

    

    class Meta:

        coincidence = ['Python', 'Django', 'Питон', 'ML'] # a list of words for the condition that the data fit

        field_coincidence = 'title' # field to which a list of words is used

    

pars_obj = MyParserClass(

        body='.content-list__item',

        text='.post__body_crop > .post__text',

        source='a.post__title_link',

        title='a.post__title_link',

        img='.post__body_crop > .post__text img',

        url='http://site/'

        )

pars_obj.run()

```

Note: a model for saving data can be specified in Meta class

```python

class Meta:

    model = MyModel # decorator @init is not needed

```

Inheritance:

```python

class MyChildParserClass(MyParserClass):

    my_field = data.TextCSSSelect()

```

Note: fields from the base class, and also the Meta class is inherited. You can override

    

If you need to install an additional field in the database:

```python

pars_obj.add_field['flag'] = True

pars_obj.run()  #if you do not need to save to the database and print the data to the log, 

                # add the argument log -> run(log=True) and redefine the method log_output(self, result):

```

    Example:

```python

@init(model='MyModel', app='my_app')

class MyParserClass(Parser):

    body = data.BodyCSSSelect()

    text = data.TextContentCSSSelect()

    

    def log_output(self, result): # if you do not override the method, the result will be output to the terminal

        pass # and work further with the result

```

If you do not want to write data to the database or output to the log, use:

```python

data = pars_obj.run(create=False)

```

Note: Also a must create=False, when you are not working with django and base

Attributs

=========

##### start_url

```python

# initialize the path to the URL with the data block.

# This is needed when the list of objects is on the page, and the data is on another page 

BodyCSSSelect(start_url='div.description.float-right > a')

```

Note: in the attribute with the URL should be href

##### add_domain

```python

# if the URL in the attribute does not have a domain

# set add_domain=True, by default False

BodyCSSSelect(start_url='div.description.float-right > a', add_domain=True)

```

##### save_start_url

when you need to save additional data in the field, 

such as the start URLs of objects, add the ExtraDataField field (save_start_url = True)

##### body_count

how many objects are parsing

    Example:

```python

class MyParserClass(Parser):

    start = BodyCssSelect(start_url='ul.quest-tiles > li.quest-tile-1 > div.item-box > div.item-box-desc h4  a',

                          add_domain=True,

                          body_count=4)

    source = ExtraDataField(save_start_url=True)

```

It works on this [site](http://pythoff.com/), all this on the [channel](https://telegram.me/python_all)

-----