https://github.com/tekintian/spider-tools-python
spider tools with python3, 小说采集, txt小说采集打包工具, 可完整采集小说的所有章节,支持自定义列表和详情采集规则
https://github.com/tekintian/spider-tools-python
python3 spider tools
Last synced: 6 months ago
JSON representation
spider tools with python3, 小说采集, txt小说采集打包工具, 可完整采集小说的所有章节,支持自定义列表和详情采集规则
- Host: GitHub
- URL: https://github.com/tekintian/spider-tools-python
- Owner: tekintian
- Created: 2022-07-12T10:43:39.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2022-07-12T12:58:43.000Z (about 3 years ago)
- Last Synced: 2025-02-15T05:28:34.644Z (8 months ago)
- Topics: python3, spider, tools
- Language: Python
- Homepage:
- Size: 8.79 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
Awesome Lists containing this project
README
# Python spider数据采集工具类
面向对象的python spider数据采集工具模块封装, 采用高效灵活的正则封装和匹配数据, 简洁,高效,快速!
支持列表采集;
详情页采集;
支持一次性采集列表中的所有页面内容到同一个txt文件;
支持任意网页编码;
支持gzip压缩后的网页;
彻底解决中文乱码问题;
支持随机UA,防拉黑功能等;## 小说采集示例
~~~py
from spider import Spiderif __name__ == "__main__":
# 根据要采集的站点来配置相应的正则
url = 'https://www.yruan.com/article/41479.html' # 要采集的URL 地址,支持单页地址或者列表地址
re_title = '(?P[\s\S]+?)<' # 标题正则
re_content = '(?P[\s\S]+?)' # 内容URL正则
re_list = '', "\r\n", content)
# 将2个以上的连续换行符替换为1个
# content = re.sub(r'([\r\n]{1,})', "\r\n", content)
# 删除2个以上的空格和所有的非img标签
content = re.sub(r'([ ]{2,})|<([^img].*?)>', "", content)return title+content
# 获取待采集的列表url
def spider_start(url, re_title, re_list, re_content, re_intro='', save_dir='article/'):
data = http_req(url)
# 正则获取标题
matche = re.search(re_title, data)
if matche:
title = matche.groupdict().get('title')
else:
title = '小说'
# 验证save_dir是否非绝对路径,如果非绝对路径加上当前目录
if not save_dir.startswith('/'):
save_dir = os.getcwd()+'/'+save_dir
# 验证存放目录 save_dir 是否存在
if not os.path.isdir(save_dir) and not os.path.isfile(save_dir):
os.mkdir(save_dir)
filename = save_dir+title+'.txt' # 文件名定义intro = '' # 描述 默认为空
if re_intro != '':
matche = re.search(re_intro, data)
if matche:
intro = matche.groupdict().get('intro')
# 去除所有html标签
intro = re.sub(r'<[\s\S]+?>', '', intro)# 获取urls
matches = re.findall(re_list, data)
le = len(matches)
if le > 0:
uri0 = matches[0] # 取第一个uri来分析base_url
if uri0.startswith('/'): # 如果是相对跟目录的绝对路径
matche_u = re.search(r'(https?://(.*?))/', url)
if matche_u:
base_url = matche_u[1]
elif not uri0.startswith('http', 0, 4): # 相对于当前页面的url
base_url = url[0:url.rindex('/')+1]
else: # url为完整的url, 不需要加base_url
base_url = ''
# 覆盖写入
file = open(filename, 'w', encoding='utf-8')
file.write('# '+title+'\r\n'+intro) # 写入小说名称
file.close()
# 以追加读写模式打开
file = open(filename, 'a+', encoding='utf-8')
for index in range(le):
sys.stdout.write('\r正在采集第: '+str(index)+' 章, 总:'+str(le))
sys.stdout.flush()
# 获取列表中的url链接地址,
uri = matches[index]
# 获取指定的匹配组
url = base_url+uri
txt = get_content_txt(url, re_title, re_content)
file.write(txt)file.flush()
file.close()
print('\r\n采集完成')
else:
txt = get_content_txt(url, re_title, re_content)
if txt != '':
file = open(filename, 'w', encoding='utf-8')
file.write(txt)
file.flush()
file.close()
print('采集完成')
else:
print('未匹配到数据')print('开始采集')
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf8')
# change for actual site
# 根据要采集的站点来配置相应的正则
re_title = '(?P[\s\S]+?)<' # 标题正则
re_intro = '(?P[\s\S]+?)' # 简介正则
re_content = '(?P[\s\S]+?)' # 内容URL正则
re_list = '