https://github.com/phanikmr/linkcrawler

A LinkCrawler is a Python module that takes a url on the web (ex: http://python.org), fetches the web-page corresponding to that url, and parses all the links on that page into a repository of links. Next, it fetches the contents of any of the url from the repository just created, parses the links from this new content into the repository and continues this process for all links in the repository until stopped or after a given number of links are fetched.
https://github.com/phanikmr/linkcrawler

async crawler linkcrawler parse python scrapy spider

Last synced: 6 months ago
JSON representation

Host: GitHub
URL: https://github.com/phanikmr/linkcrawler
Owner: phanikmr
License: gpl-3.0
Created: 2018-03-28T05:53:24.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2025-01-16T05:31:24.000Z (6 months ago)
Last Synced: 2025-01-16T05:31:26.321Z (6 months ago)
Topics: async, crawler, linkcrawler, parse, python, scrapy, spider
Language: Python
Size: 54.7 KB
Stars: 1
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # LinkCrawler

A LinkCrawler is a Python module that takes a url on the web (ex: http://python.org), fetches the web-page corresponding to that url, and parses all the links on that page into a repository of links. Next, it fetches the contents of any of the url from the repository just created, parses the links from this new content into the repository and continues this process for all links in the repository until stopped or after a given number of links are fetched.

Requirements

============

* Python 3.5+

* Works on Linux, Windows, Mac OSX, BSD

# Install

The quick way::

    pip install dist/LinkCrawler-1.0.0-py2.py3-none-any.whl

# Logs

```bash

 ~user/.crawler/

 ```

 # Usage 

 

```python

from crawler import Crawler

with Crawler("https://www.python.org", output_path= "D://links.txt",LOG=Crawler.INFO_LOG) as crawler:

     crawler.crawl()

     

with Crawler("https://www.python.org", output_path= "D://links.txt",LOG=Crawler.INFO_LOG) as crawler:

     for links in crawler.crawl_next():

          print(links)

          

with Crawler("https://www.python.org", output_path= "D://links.txt",LOG=Crawler.DEBUG_LOG) as crawler:

     crawler.crawl(1000)

```

need to be removed

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/phanikmr/linkcrawler

Awesome Lists containing this project

README