https://github.com/xiyuan-fengyu/ppspider_example

ppspider爬虫例子，B站视频信息及评论爬取，qq音乐信息及评论爬取，推特主题评论和用户信息爬取
https://github.com/xiyuan-fengyu/ppspider_example

bilibili cheerio crawler ppspider puppeteer qq-music spider twitter

Last synced: 4 days ago
JSON representation

ppspider爬虫例子，B站视频信息及评论爬取，qq音乐信息及评论爬取，推特主题评论和用户信息爬取

Host: GitHub
URL: https://github.com/xiyuan-fengyu/ppspider_example
Owner: xiyuan-fengyu
Created: 2018-06-09T01:28:23.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2020-04-07T06:46:54.000Z (over 5 years ago)
Last Synced: 2025-04-20T01:18:07.363Z (6 months ago)
Topics: bilibili, cheerio, crawler, ppspider, puppeteer, qq-music, spider, twitter
Language: TypeScript
Homepage:
Size: 219 KB
Stars: 23
Watchers: 2
Forks: 13
Open Issues: 3
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# ppspider example
安装依赖
```
set PUPPETEER_DOWNLOAD_HOST=https://npm.taobao.org/mirrors/
npm install
# 或则用yarn安装依赖（需要通过npm提前全局安装yarn：npm install yarn -g）
# yarn install
```

## src/quickstart
演示了 @OnStart 装饰器的作用
在爬虫系统启动后，立即执行一个任务

## src/ontime
演示了 @OnTime 装饰器的作用
在爬虫启动后根据 cron 表达式周期性执行任务
这例子中仅仅是每个5秒钟打印一次时间

## src/queue
演示了 @AddToQueue @FromQueue 装饰器的作用

## src/requestMapping
演示了 @RequestMapping 声明 HTTP rest 接口，提供远程动态添加任务的能力
系统启动后，访问如下地址添加任务
```
curl http://localhost:9000/addJob/test?url=https%3A%2F%2Fwww.baidu.com&notifyUrl=http%3A%2F%2Flocalhost%3A9000%2FjobResult
```

## src/puppeteerUtil
演示了 PuppeteerUtil 工具类中一些方法的使用方式

## src/debug
演示了注入js的调试方法

## src/dataSave
演示了几种数据保存方案
由于抓到的大部分数据都是json格式的，建议使用1，然后根据实际数据需求，
后续再转存到其他存储介质中
1. 保存到本地文件中
2. 上传到服务器
3. 存入 mysql

## src/db
演示了 nedb / mongodb 的使用方式
这两种数据库是内置封装好的，直接通过 appInfo.db 使用

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/xiyuan-fengyu/ppspider_example

Awesome Lists containing this project

README