https://github.com/thinkwee/nlp_spyders
Various spyders for crawling NLP corpus
https://github.com/thinkwee/nlp_spyders
Last synced: about 1 month ago
JSON representation
Various spyders for crawling NLP corpus
- Host: GitHub
- URL: https://github.com/thinkwee/nlp_spyders
- Owner: thinkwee
- Created: 2019-06-02T08:01:31.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2019-06-02T08:03:27.000Z (almost 6 years ago)
- Last Synced: 2025-02-14T11:33:34.938Z (3 months ago)
- Language: Python
- Size: 14.6 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# NLP_Spyders
- 各种语料爬虫
- acl_link+acl_pdf:爬取ACL论文,爬取链接和爬取pdf分开写,10到18年约9000多篇
- exec_pdf2txt+pdf2txt:将pdf论文转换为文本
- china_news:爬取中国新闻网,可用于中文文本分类,27类,12年到18年可爬取350w+新闻
- douban_spyder:爬取豆瓣书评,需改进防止被反爬虫
- sina_spyder:爬取新浪新闻,可用于文本分类,类别比例比较不均匀
- wiki_spyder:爬取英文维基百科,可用于链接分析和聚类
- china_news_headline_generation:爬取中国新闻网新闻正文+标题,用于标题生成任务