https://github.com/guillim/arachnida

App to scrap the web, for people without coding skills. Fully integrates WebCrawlers (Headless Chrome) and the interface to deal with it.
https://github.com/guillim/arachnida

crawler crawling framework headless-chrome javascipt meteor scraper scrapping

Last synced: 18 days ago
JSON representation

App to scrap the web, for people without coding skills. Fully integrates WebCrawlers (Headless Chrome) and the interface to deal with it.

Host: GitHub
URL: https://github.com/guillim/arachnida
Owner: guillim
License: mit
Created: 2018-09-21T10:01:07.000Z (almost 7 years ago)
Default Branch: master
Last Pushed: 2020-10-05T07:16:54.000Z (over 4 years ago)
Last Synced: 2023-10-20T20:07:41.782Z (over 1 year ago)
Topics: crawler, crawling, framework, headless-chrome, javascipt, meteor, scraper, scrapping
Language: JavaScript
Size: 296 KB
Stars: 11
Watchers: 4
Forks: 12
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        Arachnida : simple web interface to pilot crawlers (Under Construction)

=========

Scrape the web easily -> no need to be a coding expert.

Arachnida is providing a simple web interface to pilot powerful crawlers (running Headless Chrome)

# Install (2 seconds) #

open a terminal, and run:

```

git clone https://github.com/guillim/Arachnida.git arachnida  && cd arachnida  && meteor

```

**Finished !** 

# Use (1 minute) #  

Now open google chrome (or any browser) and follow this link: http://localhost:3000  

You will be able to add a crawler, configure it, and run it in seconds ! 

### 1. Create a crawler on the main page: ###

First give it a name, and leave the function empty (except if you know what you're doing) 

![screenshot](https://ibin.co/4GSHblERpQfn.png)

### 2. Configure your crawler: ###

This is the only moment when a bit of coding knowledge is helpful. In the main part, you need to write a JavaScript function that will be executed on every page scraped by the crawler.   

For instance, to extract the title of each page, write:

```

return {             

  title: $('title').text(),

};

```  

Yes, jquery is already set up. You simply need to provide the selectors (id, class...)

![screenshot](https://ibin.co/4GSHWS9cgqUR.png)

### View the results: ###

![screenshot](https://ibin.co/4GSJEILx9T9s.png)

## What's included ##

* See screenshot of your running crawler

* Manually add URL to be scraped, or upload a CSV 

* Sign in / Sign up  

* Account management: Profile Page, Username, Change password, Delete account...

* Admin for the webmaster: go to `/admin`

* Router

* MongoDB as database

# Contribute #  

I am looking for people to make pull requests to improve Arachnida. Please do it :)  

TO DO:  

1. Setup live queue of url to be scraped (ex: at the moment, you can't click straight on a link and scrape it)

2. Live Log from the server brought to the interface to help debugging

3. Results export functionality (CSV & Json)  

### Thanks ###  

Boilerplate: yogiben.  

HeadlessChrome layer: yujiosaka

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/guillim/arachnida

Awesome Lists containing this project

README