https://github.com/tekintian/spider-tools-python

spider tools with python3, 小说采集, txt小说采集打包工具, 可完整采集小说的所有章节,支持自定义列表和详情采集规则
https://github.com/tekintian/spider-tools-python

python3 spider tools

Last synced: 6 months ago
JSON representation

spider tools with python3, 小说采集, txt小说采集打包工具, 可完整采集小说的所有章节,支持自定义列表和详情采集规则

Host: GitHub
URL: https://github.com/tekintian/spider-tools-python
Owner: tekintian
Created: 2022-07-12T10:43:39.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2022-07-12T12:58:43.000Z (about 3 years ago)
Last Synced: 2025-02-15T05:28:34.644Z (8 months ago)
Topics: python3, spider, tools
Language: Python
Homepage:
Size: 8.79 KB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: readme.md

Awesome Lists containing this project

README

          # Python spider数据采集工具类

面向对象的python spider数据采集工具模块封装, 采用高效灵活的正则封装和匹配数据, 简洁,高效,快速!

支持列表采集;

详情页采集;

支持一次性采集列表中的所有页面内容到同一个txt文件;

支持任意网页编码;

支持gzip压缩后的网页;

彻底解决中文乱码问题;

支持随机UA,防拉黑功能等;

## 小说采集示例

~~~py

from spider import Spider

if __name__ == "__main__":

    # 根据要采集的站点来配置相应的正则

    url = 'https://www.yruan.com/article/41479.html'  # 要采集的URL 地址,支持单页地址或者列表地址

    re_title = '
(?P[\s\S]+?)<'  # 标题正则

    re_content = '(?P[\s\S]+?)'  # 内容URL正则

    re_list = '', "\r\n", content)

    # 将2个以上的连续换行符替换为1个

    # content = re.sub(r'([\r\n]{1,})', "\r\n", content)

    # 删除2个以上的空格和所有的非img标签

    content = re.sub(r'([ ]{2,})|<([^img].*?)>', "", content)

    return title+content

# 获取待采集的列表url

def spider_start(url, re_title, re_list, re_content, re_intro='', save_dir='article/'):

    data = http_req(url)

    # 正则获取标题

    matche = re.search(re_title, data)

    if matche:

        title = matche.groupdict().get('title')

    else:

        title = '小说'

    # 验证save_dir是否非绝对路径,如果非绝对路径加上当前目录

    if not save_dir.startswith('/'):

        save_dir = os.getcwd()+'/'+save_dir

    # 验证存放目录 save_dir 是否存在

    if not os.path.isdir(save_dir) and not os.path.isfile(save_dir):

        os.mkdir(save_dir)

    filename = save_dir+title+'.txt'  # 文件名定义

    intro = ''  # 描述 默认为空

    if re_intro != '':

        matche = re.search(re_intro, data)

        if matche:

            intro = matche.groupdict().get('intro')

            # 去除所有html标签

            intro = re.sub(r'<[\s\S]+?>', '', intro)

    # 获取urls

    matches = re.findall(re_list, data)

    le = len(matches)

    if le > 0:

        uri0 = matches[0]  # 取第一个uri来分析base_url

        if uri0.startswith('/'):  # 如果是相对跟目录的绝对路径

            matche_u = re.search(r'(https?://(.*?))/', url)

            if matche_u:

                base_url = matche_u[1]

        elif not uri0.startswith('http', 0, 4):  # 相对于当前页面的url

            base_url = url[0:url.rindex('/')+1]

        else:  # url为完整的url, 不需要加base_url

            base_url = ''

        # 覆盖写入

        file = open(filename, 'w', encoding='utf-8')

        file.write('# '+title+'\r\n'+intro)  # 写入小说名称

        file.close()

        # 以追加读写模式打开

        file = open(filename, 'a+', encoding='utf-8')

        for index in range(le):

            sys.stdout.write('\r正在采集第: '+str(index)+' 章, 总:'+str(le))

            sys.stdout.flush()

            # 获取列表中的url链接地址,

            uri = matches[index]

            # 获取指定的匹配组

            url = base_url+uri

            txt = get_content_txt(url, re_title, re_content)

            file.write(txt)

        file.flush()

        file.close()

        print('\r\n采集完成')

    else:

        txt = get_content_txt(url, re_title, re_content)

        if txt != '':

            file = open(filename, 'w', encoding='utf-8')

            file.write(txt)

            file.flush()

            file.close()

            print('采集完成')

        else:

            print('未匹配到数据')

print('开始采集')

sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf8')

# change for actual site

# 根据要采集的站点来配置相应的正则

re_title = '(?P[\s\S]+?)<'  # 标题正则

re_intro = '(?P[\s\S]+?)'  # 简介正则

re_content = '(?P[\s\S]+?)'  # 内容URL正则

re_list = '

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tekintian/spider-tools-python

Awesome Lists containing this project

README

(?P[\s\S]+?)<' # 标题正则
re_intro = '
(?P[\s\S]+?)
' # 简介正则
re_content = '
(?P[\s\S]+?)
' # 内容URL正则
re_list = '

https://github.com/tekintian/spider-tools-python

Awesome Lists containing this project

README

(?P[\s\S]+?)<' # 标题正则 re_intro = '(?P[\s\S]+?)' # 简介正则 re_content = '(?P[\s\S]+?)' # 内容URL正则 re_list = '

(?P[\s\S]+?)<' # 标题正则
re_intro = '
(?P[\s\S]+?)
' # 简介正则
re_content = '
(?P[\s\S]+?)
' # 内容URL正则
re_list = '