{"id":21902443,"url":"https://github.com/krolow/marsvin","last_synced_at":"2025-10-13T20:39:18.764Z","repository":{"id":4697984,"uuid":"5845154","full_name":"krolow/Marsvin","owner":"krolow","description":"Structural Crawler framework written in PHP","archived":false,"fork":false,"pushed_at":"2013-08-05T03:21:26.000Z","size":178,"stargazers_count":12,"open_issues_count":1,"forks_count":4,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-10-13T20:39:18.001Z","etag":null,"topics":["crawler","framework","parser","php"],"latest_commit_sha":null,"homepage":"","language":"PHP","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/krolow.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2012-09-17T18:20:02.000Z","updated_at":"2016-12-06T11:16:47.000Z","dependencies_parsed_at":"2022-08-21T08:50:25.074Z","dependency_job_id":null,"html_url":"https://github.com/krolow/Marsvin","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/krolow/Marsvin","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/krolow%2FMarsvin","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/krolow%2FMarsvin/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/krolow%2FMarsvin/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/krolow%2FMarsvin/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/krolow","download_url":"https://codeload.github.com/krolow/Marsvin/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/krolow%2FMarsvin/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279016929,"owners_count":26085910,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-13T02:00:06.723Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","framework","parser","php"],"created_at":"2024-11-28T15:18:44.335Z","updated_at":"2025-10-13T20:39:18.748Z","avatar_url":"https://github.com/krolow.png","language":"PHP","funding_links":[],"categories":[],"sub_categories":[],"readme":"#Marsvin\n\n\n## What is it?\n\nHave you ever write a crawler or parser? \n\nIf yes, you must know that is always a trivial task, but we have always to think how structure our code to do such a thing...\n\nSo... to solve that Marvins was created, Marvins provide a simple API and structure to be followed to you create your parsers or crawler. The main focus is to facilitate the task of parser data from external resources, to extract data from websites or import data from XML, CSV files etc...\n\nAnd it has more, as a plus it comes with a Process management, that makes enable you open more than one PHP process, so you are able to do more than one thing at the same time.\n\n## How to use it?\n\n**Create a composer.json**\n\n```javascript\n{\n  \"name\" : \"your/projectname\",\n  \"require\" : {\n    \"cobaia/marsvin\" : \"dev-master\"\n  },\n  \"minimum-stability\" : \"dev\"\n}\n```\n\n**Run the command:**\n\n```bash\ncomposer.phar install\n```\n\n**Create your console command:**\n\n\n**File:** console.php\n\n```php\n\u003c?php\nrequire_once dirname(__DIR__) . DIRECTORY_SEPARATOR . 'vendor' . DIRECTORY_SEPARATOR . 'autoload.php';\n\nuse Symfony\\Component\\Console\\Application;\nuse Symfony\\Component\\Console\\Helper\\HelperSet;\n\n$console = new Application('Your Project', '1.0');\n$console-\u003eaddCommands(\n    array(\n            new Marsvin\\Command\\GenerateProviderCommand(),\n            new Marsvin\\Command\\RequestProviderCommand(),\n    )\n);\n\n//You are able to pass as much helper set you want, like Doctrine, Monolog, etc...\n//$console-\u003esetHelperSet($helperSet);\n$console-\u003erun();\n\n\n```\n\nAfter create the console you are already enable to run the command: php app/console.php\n\nYou will check that we have two commands to use:\n\n```bash\nmarsvin\n  marsvin:generate:provider   Generate Provider code structure\n  marsvin:request:provider    Request one specific Provider\n```\n\n**Marsvin use the following nomenclature:**\n\n- **Provider:** It's the name of the operation that you will be doing, an example of provider would be: Facebook, Github, Google, etc..\n- **Requester:** It's the layer responsible to make the requests of one provider, for example for Facebook would be one HTTP Request, for Github maybe it can be one GIT operation, for another provider it can be one FTP access etc...\n- **Parser:** Once you the request operation has been done, the parser layers comes this layer will take care of the data, so if you want to setup some entities of doctrine, or do you want to create some array of datas, etc... Or you want to normalize some how the data, this is the layer that you will be using to do such a task.\n- **Persister:** Once you have parsed your data, it goes to the Persister layer, here is where you will do forever you want with the data, in doctrine for example you will be able to persist and flush the data into database, or if you want to persists in file system, persists sending one email, or whatever the Persister layer will be handling that to you.\n\n\n### Creating our first provider\n\nTo create our provider, marvins has one command that create the folder structure to you:\n\n```bash\nphp app/console marsvin:generate:provider MyProject\\\\Github ./src/\n```\n\nYou will check that Marsvin will generate the following folder tree to you:\n\n```bash\n.\n└── MyProject\n    └── Github\n        ├── GithubParser.php\n        ├── GithubPersister.php\n        ├── GithubProvider.php\n        └── GithubRequester.php\n```\n\nNow it's time to setup the adapters, Marsvin using the adapter pattern to define each one of the layers, so for example if you want to make HTTP Request you can setup one HttpAdapter to use in the requester layer.\n\nBy default Marvins comes with few adapters:\n\n- **Requester:** DefaultAdapter.php BuzzAdapter.php\n- **Parser:** DomAdapter.php\n- **Persister:** DefaultAdapter.php DoctrineAdapter.php\n\nSo let's setup our provider:\n\n```php\n\u003c?php\nnamespace MyProject\\Github;\n\nuse Marsvin\\Provider\\AbstractProvider;\nuse Marsvin\\Provider\\ProviderInterface;\nuse Marsvin\\Requester\\Adapter\\BuzzAdapter;\nuse Marsvin\\Parser\\Adapter\\DomAdapter;\nuse Marsvin\\Persister\\Adapter\\DefaultAdapter;\n\nclass GithubProvider extends AbstractProvider implements ProviderInterface\n{\n\n    public function getRequesterAdapter()\n    {\n        return new BuzzAdapter();\n    }\n\n    public function getParserAdapter()\n    {\n        return new DomAdapter();\n    }\n\n    public function getPersisterAdapter()\n    {\n        return new DefaultAdapter();\n    }\n\n}\n```\n\nThe Requester:\n\n```php\n\u003c?php\nnamespace MyProject\\Github;\n\nuse Marsvin\\Requester\\AbstractRequester;\nuse Marsvin\\Requester\\RequesterInterface;\nuse Marsvin\\Response;\n\nclass GithubRequester extends AbstractRequester implements RequesterInterface\n{\n\n    const GITHUB_URL = 'https://github.com/%s?tab=repositories';\n\n    public function request()\n    {\n        $adapter = $this-\u003egetAdapter();\n\n        $profiles = array(\n            'krolow',\n            'gquental',\n            'moacirosa',\n            'fabpot',\n        );\n\n        $self = $this;\n\n        foreach ($profiles as $profile) {\n            $this-\u003eprocess(function () use ($self, $adapter, $profile) {\n                $self-\u003edone(\n                    new Response(\n                        $adapter-\u003erequest(\n                            sprintf(\n                                GithubRequester::GITHUB_URL,\n                                $profile\n                            )\n                        )\n                    )\n                );\n            });\n        }\n    }\n\n}\n```\n\nThe Parser:\n\n```php\n\u003c?php\nnamespace MyProject\\Github;\n\nuse Marsvin\\Parser\\AbstractParser;\nuse Marsvin\\Parser\\ParserInterface;\nuse Marsvin\\Response;\nuse Marsvin\\ResponseInterface;\nuse DOMXPath;\nuse DOMDocument;\n\nclass GithubParser extends AbstractParser implements ParserInterface\n{\n\n    public function parse(ResponseInterface $response)\n    {\n        $adapter = $this-\u003egetAdapter();\n\n        $dom = $adapter-\u003eparse($response-\u003eget());\n\n        $xpath = new DOMXPath($dom);\n        \n        $nodes = $xpath-\u003equery('//span[@itemprop=\"name\"]');\n        $author = $nodes-\u003eitem(0)-\u003enodeValue;\n\n        $nodes = $xpath-\u003equery('//li[contains(@class, \"public\")]');\n\n        $projects = array();\n\n        foreach ($nodes as $node) {\n            array_push(\n                $projects,\n                $this-\u003eparseProject($node, $author)\n            );\n        }\n\n\n        $this-\u003edone(new Response($projects));\n    }\n\n    protected function parseProject($node, $author)\n    {\n        $doc = new DOMDocument();\n        $doc-\u003eappendChild($doc-\u003eimportNode($node, true));\n\n        $name = $doc-\u003egetElementsByTagName('h3');\n        $url = $doc-\u003egetElementsByTagName('a');\n\n        $project = array(\n            'name' =\u003e trim($name-\u003eitem(0)-\u003enodeValue),\n            'url' =\u003e 'http://github.com/' . $url-\u003eitem(2)-\u003egetAttribute('href'),\n            'author' =\u003e $author\n        );\n        \n        return $project;\n    }\n\n}\n```\n\nThe Persister:\n\n```php\n\u003c?php\nnamespace MyProject\\Github;\n\nuse Marsvin\\Persister\\AbstractPersister;\nuse Marsvin\\Persister\\PersisterInterface;\nuse Marsvin\\Response;\nuse Marsvin\\ResponseInterface;\n\nclass GithubPersister extends AbstractPersister implements PersisterInterface\n{\n\n    public function persists(ResponseInterface $response)\n    {\n        $adapter = $this-\u003egetAdapter();\n        $adapter-\u003epersist($response-\u003eget());\n        file_put_contents('/tmp/marsvin.log', var_export($adapter-\u003eflush(), true), FILE_APPEND);\n    }\n\n}\n```\n\nTo run the command do the follow:\n\n```bash\nphp app/console.php marsvin:request:provider MyProject\\\\Github\\\\GithubProvider\n```\n\nYou can check what is happening here:\n\n```bash\ntail -f /tmp/marsvin.log\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkrolow%2Fmarsvin","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkrolow%2Fmarsvin","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkrolow%2Fmarsvin/lists"}