https://github.com/zeeklog/csdn-crawler
A Node.js Crawler for csdn.com. node 爬虫,csdn 爬虫, 爬取csdn 用户的全部文章。代码仅用于测试和交流学习,请勿用于不良用途。
https://github.com/zeeklog/csdn-crawler
crawl-user-article csdn csdn-crawler csdn-docs csdnspider node-spider
Last synced: about 2 months ago
JSON representation
A Node.js Crawler for csdn.com. node 爬虫,csdn 爬虫, 爬取csdn 用户的全部文章。代码仅用于测试和交流学习,请勿用于不良用途。
- Host: GitHub
- URL: https://github.com/zeeklog/csdn-crawler
- Owner: zeeklog
- License: mit
- Created: 2022-07-18T08:38:15.000Z (almost 4 years ago)
- Default Branch: master
- Last Pushed: 2024-08-23T02:00:54.000Z (over 1 year ago)
- Last Synced: 2025-08-09T04:05:29.259Z (8 months ago)
- Topics: crawl-user-article, csdn, csdn-crawler, csdn-docs, csdnspider, node-spider
- Language: JavaScript
- Homepage:
- Size: 135 KB
- Stars: 2
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# A Nodejs Crawler for crawling user's article from csdn.com.
> Only for Node.js Application, not work on browser.
- Offer `options.username` will return you the user's article list(default length is 5);
- Upload the Article's image to your own Qiniu Cloud Server when you offer the config: `options.qiniu`
- Offer `options.page`, `options.size` can limit the page and size config for api
### 为什么写这个? / Why would I code this?
- > I want some data to fill my database for big-data's test, but it seems hard to me to write it myself(because I am so lazy).
- > May be so many coder face the same things like me. So, let me make this job become easier.
- > WARN: This repo is only for test and study, do not use this to run Pressure-Test on csdn.com.
> And CSDN is Sucks!
### 实现原理 / How to fuck this site
```shell
# dependencies
cheerio
html-to-md
pinyin
request-promise
# 使用request-primose获取目标文档
# 通过cheerio解析HTML文档,获取文章内容
# 使用html-to-md 解析HTML内容, 转为md
# 使用pinyin生成文章alias
```
### 使用指南 / Usages
#### 1、Fill you own config
```javascript
// Example:
const options = {
username: 'weixin_45534242', // target username
page: 1, // the page index you are crawling
size: 5, // page size
link: '', // the user center article list api, you can find it on csdn.com using: F12
businessType: 'blog', // crawl article type. only support 'blog' now.
sleepTime: null, // Unit is: ms. sleep time when you crawling the data, it may save your ip from blocking.
supportImageType: ['jpg', 'png', 'jpeg', 'webp', 'gif', 'mp4', 'bmp', 'svg'], // support uplaod image
imagePrefixName: 'crawl-', // upload image name prefix
contentNodeIdentify: '#article_content', // the html id name in article node
qiniu: {
zone: '', // Your qiniu cloud zone
scope: '', // Your qiniu scope name. Storage name.
useHttpsDomain: true, // like what you see. this is https setting
useCdnDomain: true, // config your cdn domain, it use on Article List Image
baseQiNiuCdnApi: '', // you CDN domain name
remoteFilePath: '/openStatic', // the folder path where you want to save img
isNeedWaterMark: false, // if `true`, you will need to offer qiniu image style name, write it below:
imageStyleSplitQuote: '&', // the quote you use in image src link like: https://qiniu.com/asd.png&scale-my-img
imageStyleName: '', // your qiniu style name
accessKey: '', // Qiniu cloud accessKey
secretKey: '', // Qiniu secretKey
imageBaseAlt: '' // image base alt message prefix
}
}
```
#### 2、开始使用csdnCrawler / Fly your code now.
```javascript
// You can find this code on `./demo.js`
const csdnCrawler = require('./index')
const exampleOptions = {
username: 'weixin_45534242',
page: 1,
size: 5,
link: '',
businessType: 'blog',
sleepTime: null, // Unit is: ms
supportImageType: ['jpg', 'png', 'jpeg', 'webp', 'gif', 'mp4', 'bmp', 'svg'],
imagePrefixName: 'crawl-',
contentNodeIdentify: '#article_content',
qiniu: {}
}
csdnCrawler(exampleOptions, data => {
console.log(data)
console.log(`==============================`)
console.log(`=== Demo Crawl Succeed !!!===`)
console.log(`==============================`)
console.log(`Total Data length : ${data.length}`)
})
```
### 再次警告 / FBI WARN AGAIN( to save me from trouble)
- Don't use this for bad purpose.
- It may cause something bad result in CN(Maybe break the law...) and will drive you crazy.
- Plz only use this for testing and study purpose.