https://github.com/riophae/lynda-video-transcripts

一个批量抓取 Lynda 视频字幕的爬虫脚本
https://github.com/riophae/lynda-video-transcripts

cralwer lynda transcripts

Last synced: 3 months ago
JSON representation

一个批量抓取 Lynda 视频字幕的爬虫脚本

Host: GitHub
URL: https://github.com/riophae/lynda-video-transcripts
Owner: riophae
Created: 2015-12-19T06:39:58.000Z (over 9 years ago)
Default Branch: master
Last Pushed: 2017-06-23T15:46:28.000Z (almost 8 years ago)
Last Synced: 2025-01-03T15:13:19.502Z (5 months ago)
Topics: cralwer, lynda, transcripts
Language: JavaScript
Homepage:
Size: 33.2 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Lynda Video Transcripts

一个批量抓取 [Lynda](http://www.lynda.com/) 视频字幕的爬虫脚本。

## Requirements

- Node.js
- Phantom.js 2.x

## Installation

```bash
$ git clone https://github.com/riophae/lynda-video-transcripts.git
$ cd lynda-video-transcripts
$ npm install # 安装依赖
$ # 配置 config
$ npm run build # 每次修改 config 后都要进行编译
$ npm start # 执行爬虫脚本
```

## Configuration

复制一份 `config.example.yaml` 并更名为 `config.yaml`，打开编辑：

- `detectNetworkCondition` 设置是否在开始时检查网络连接状况 `yes`/`no`
- `userAgent` 建议配置成与自己常用浏览器一致的 userAgent 可能好一些
- `captureScreenAutomatically` 设置爬虫运行过程中是否定时自动截图 `yes`/`no`
- `viewportSize` 设置爬虫使用的浏览器的可视区域大小，取值任意，不要太小即可
- `username` `password` lynda.com 账号名和密码
- `courses` 需要抓取的课程列表
- `intervalBetweenTutorialVisits` 设置每两节课程抓取时间的间隔，不建议设置得太短，避免被反作弊处理

#### `courses`

支持两种方式。可以同时指定输出目录和该课程起始抓取点：

```yaml
courses:
- dirName:
startPoint:
- dirName: ...
startPoint: ...
- dirName: ...
startPoint: ...
```

也可以只指定每个课程的起始点，程序会自动根据课程名称确定输出目录：

```yaml
courses:
-
-
- ...
```

爬虫内部的运作逻辑是，会从指定的起始点开始抓取字幕，直到课程的最后一节。

## Caveats

每次启动爬虫脚本都会清空输出目录（`output/`），因此请注意及时转移文件。

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/riophae/lynda-video-transcripts

Awesome Lists containing this project

README