Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/delan/scrapetopia

scrape old lectopia lectures
https://github.com/delan/scrapetopia

Last synced: about 2 months ago
JSON representation

scrape old lectopia lectures

Awesome Lists containing this project

README

        

Scrapetopia
===========

Preparation
-----------

Install Python 2.7, pip and a C compiler. Windows users would be far better off
using cygwin, though it may work with native Python and exactly MSVC++ 2008.

If you don't have pip, install it with `easy_install pip`.

Then install dependencies with `pip install -r requirements.txt`.

Finally, run `setup.py` to create the database and other directories.

Fetching lecture metadata
-------------------------

Run `getallmeta.py` to fetch lecture information.

There is no official list of units, and no indication of the highest ID, so by
default it tries 0 to 9999, which appears to be sufficient because the highest
ID that exists within this range is 5151.

Extracting the media file list
------------------------------

Run `getlist.py` to dump all media file URLs from the database, except for those
which have a corresponding file saved in `data/media`.

The list of files will be at `data/list.txt`. Don't try to feed it into a
typical download manager like JDownloader, it won't handle it nicely because
there can be over 100000 files.

Downloading media files
-----------------------

Run `getmedia.py` to download media files.

It supports retrying and continuing partial downloads, so feel free to ^C at any
time. If you restart the downloader, it'll go through and ping all of the files
to make sure they exist and are the right size.

Files that are definitely done will automatically go into `data/done.txt`, and
you can remove these from the main list by running `scalpel.sh`. Running this is
optional but makes restarts faster because the list is shortened.

If you have lost `data/done.txt` and want to remove completed files in
`data/media` from your current `data/list.txt`, run `redo.py` and a new
`list.txt` will be generated without the existing media files.

Backing up media files
----------------------

You can run `backup.py` with the main and backup directories supplied as
arguments to copy the media files to another location. The need to copy a
particular file is determined by a quick check of file sizes, as matching file
contents exactly would be very slow.

Hosting the web-based browser
-----------------------------

Run `web.py` to start the application.

Install nginx and use the provided `nginx.conf`, changing the `alias` path on
line 15 to point to the `data/media` directory. nginx proxies to the application
but is still needed because it is responsible for serving the media files.

Start nginx and the interface will be available on port 8653.

Screenshots
-----------

![statistics](http://i.imgur.com/Fic6u5F.png)

![units](http://i.imgur.com/Z5cCcGG.png)

![lectures](http://i.imgur.com/eNzehtk.png)

![files](http://i.imgur.com/Fa7tNDk.png)

![player](http://i.imgur.com/n4j8RqT.png)