https://github.com/rafaelglikis/sinama

Web scraping library
https://github.com/rafaelglikis/sinama

crawler crawling scraper scraping

Last synced: 5 months ago
JSON representation

Web scraping library

Host: GitHub
URL: https://github.com/rafaelglikis/sinama
Owner: rafaelglikis
Created: 2018-08-14T23:22:46.000Z (almost 8 years ago)
Default Branch: master
Last Pushed: 2018-08-23T00:16:15.000Z (almost 8 years ago)
Last Synced: 2025-07-26T16:52:28.374Z (11 months ago)
Topics: crawler, crawling, scraper, scraping
Language: PHP
Size: 45.9 KB
Stars: 3
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # Sinama

[![Build Status](https://travis-ci.org/rafaelglikis/sinama.svg?branch=master)](https://travis-ci.org/rafaelglikis/sinama)

Sinama is a simple web scraping library.

## Requirements

* PHP 7.0

## Installation

```shell

composer require rafaelglikis/sinama

```

## Usage

Create a Sinama Client (which extends Goutte\Client):

```php

use  Sinama\Client;

$client = new Client();

```    

Make requests with the request() method:

```php

// Go to the motherfuckingwebsite.com website

$crawler = $client->request('GET', 'https://motherfuckingwebsite.com/');

```

    

The method returns a Crawler object (which extends [Symfony/Component/DomCrawler/Crawler](https://api.symfony.com/4.1/Symfony/Component/DomCrawler/Crawler.html)).

To use your own Guzzle settings, you may create and pass a new Guzzle 6 instance to Sinama Client. For example, to add a 60 second request timeout:

```php

use  Sinama\Client;

use GuzzleHttp\Client as GuzzleClient;

$client = new Client(new GuzzleClient([

    'timeout' => 60

]));

$crawler = $client->request('GET', 'https://github.com/trending');

```

For more options visit [Guzzle Documentation](http://docs.guzzlephp.org/en/stable/request-options.html).

Click on links:

```php

$link = $crawler->selectLink('PHP')->link();

$crawler = $client->click($link);

echo $crawler->getUri()."\n";

```

    

Extract data the symfony way:

```php

$crawler->filter('h3 > a')->each(function ($node) {

    print trim($node->text())."\n";

});

```

    

Or use Sinama special methods:

    

```php

$crawler = $client->request('GET', 'https://github.com/trending');

echo '';

echo '';

echo ''.$crawler->findTitle().'';

echo '';

echo '';

echo '
'.$crawler->findTitle().'';

echo 'Main Image: '.$crawler->findMainImage().'';

echo $crawler->findMainContent();

echo '';

echo 'Links: ';

print_r($crawler->findLinks());

echo 'Emails: ';

print_r($crawler->findEmails());

echo 'Images: ';

print_r($crawler->findImages());

echo '';

echo '';

echo '';

```

    

Submit forms:

```php

$crawler = $client->request('GET', 'https://www.google.com/');

$form = $crawler->selectButton('Google Search')->form();

$crawler = $client->submit($form, ['q' => 'rafaelglikis/sinama']);

$crawler->filter('h3 > a')->each(function ($node) {

    print trim($node->text())."\n";

});

```

Now that we have learned enough let's scrape a site with Sinama Spider:

```php

use Sinama\Crawler;

use Sinama\Spider as BaseSpider;

class Spider extends BaseSpider

{

    public function parse(Crawler $crawler)

    {

        $crawler->filter('div.read-more > a')->each(function (Crawler $node) {

            $this->scrape($node->attr('href'));

        });

        $crawler->filter('div.blog-pagination > a')->each(function ($node) {

            $this->follow($node->attr('href'));

        });

    }

    public function scrape($url)

    {

        echo "*************************************************** ".$url."\n";

        $crawler = $this->client->request('GET', $url);

        echo "Title: " . $crawler->findTitle() . "\n";

        echo "Main Image: " . $crawler->findMainImage()."\n";

        echo "Main Content: \n" . $crawler->findMainContent()."\n";

        echo "Emails: \n";

        print_r($crawler->findEmails());

        echo "Links: \n";

        print_r($crawler->findLinks());

    }

    public function getStartUrls(): array

    {

        return [

            'https://blog.scrapinghub.com'

        ];

    }

}

$spider = new Spider([

    'start_urls' => [ 'https://blog.scrapinghub.com' ],

    'max_depth' => 2,

    'verbose' => true

]);

$spider->run();

```

## TODO

* Crawler::findTags()

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/rafaelglikis/sinama

Awesome Lists containing this project

README

'.$crawler->findTitle().'