{"id":16639970,"url":"https://github.com/kiddyuchina/beanbun-parser","last_synced_at":"2025-06-22T15:36:31.854Z","repository":{"id":57006411,"uuid":"94024699","full_name":"kiddyuchina/beanbun-parser","owner":"kiddyuchina","description":"beanbun-parser 是 Beanbun 的数据抽取插件。抽取规则的选择器语法类似于 jQuery，使用简单。","archived":false,"fork":false,"pushed_at":"2017-07-27T08:16:40.000Z","size":8,"stargazers_count":95,"open_issues_count":2,"forks_count":27,"subscribers_count":12,"default_branch":"master","last_synced_at":"2025-03-16T06:12:10.987Z","etag":null,"topics":["beanbun","parser","phpquery"],"latest_commit_sha":null,"homepage":"","language":"PHP","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kiddyuchina.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-06-11T18:46:47.000Z","updated_at":"2024-09-22T17:50:28.000Z","dependencies_parsed_at":"2022-08-21T14:30:50.347Z","dependency_job_id":null,"html_url":"https://github.com/kiddyuchina/beanbun-parser","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/kiddyuchina/beanbun-parser","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kiddyuchina%2Fbeanbun-parser","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kiddyuchina%2Fbeanbun-parser/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kiddyuchina%2Fbeanbun-parser/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kiddyuchina%2Fbeanbun-parser/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kiddyuchina","download_url":"https://codeload.github.com/kiddyuchina/beanbun-parser/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kiddyuchina%2Fbeanbun-parser/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261315649,"owners_count":23140312,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["beanbun","parser","phpquery"],"created_at":"2024-10-12T07:07:31.486Z","updated_at":"2025-06-22T15:36:26.836Z","avatar_url":"https://github.com/kiddyuchina.png","language":"PHP","funding_links":[],"categories":[],"sub_categories":[],"readme":"# beanbun-parser \n\n### 简介\nbeanbun-parser 是 [Beanbun](https://github.com/kiddyuchina/Beanbun) 的数据抽取插件。通过设置抽取规则，在每次爬取页面之后，可以自动提取页面数据到数组中以供使用。抽取规则的选择器语法类似于 jQuery，使用简单。  \n插件使用了 [phpQuery](https://github.com/jae-jae/phpQuery-single) 和 [querylist](https://github.com/jae-jae/QueryList) 两个包。 \n\n### 安装\n通过 composer 进行安装。\n```\n$ composer require kiddyu/beanbun-parser\n```\n\n### 使用\n只需实例化后，通过 Beanbun::middleware() 加载即可。实例化时，可接受一个参数，类型为数组，内容为对 beanbun-parser 的配置，加载后 Beanbun 实例会增加 $parser 属性，属性值即为 beanbun-parser 实例。  \n目前只接受一个选项 auto，即插件是否按照规则自动抽取数据，默认为 true。  \nauto 为 true 时，Beanbun 实例会增加 $fields、$data 两个属性。$fields 为抽取规则，$data 为抽取到的数据。  \n\n```php\n\u003c?php\nuse Beanbun\\Beanbun;\nuse Beanbun\\Middleware\\Parser;\n\n$beanbun = new Beanbun;\n$beanbun-\u003ename = '950d';\n$beanbun-\u003eseed = 'http://www.950d.com/';\n\n$parser = new Parser;\n$beanbun-\u003emiddleware($parser);\n```\n\n### Beanbun 属性  \n#### Beanbun::$fields  \n$fields 每个抽取项可以包含一下元素  \nname: 此项数据的变量名  \nselector: 抽取规则。包含两个元素，前一个为 jQuery 风格的选择器，后一个为要抽取的属性，可选为 text、html、[HTML标签属性]:如src、href、name、data-src等任意HTML标签属性名  \nrepeated: 定义抽取到的内容是否是有多项, 默认 false  \nrequired: 定义该 field 的值是否必须, 默认 false  \nchildren: 为此 field 定义子项，子项的定义仍然是一个fields数组，没错, 这是一个树形结构  \n```php\n$beanbun-\u003efields = [\n    [\n        'name' =\u003e 'title',\n        'selector' =\u003e ['title', 'text']\n    ],\n    [\n        'name' =\u003e 'template',\n        'children' =\u003e [\n            [\n                'name' =\u003e 'title',\n                'selector' =\u003e ['.js-course-list li h5', 'text'],\n                'repeated' =\u003e true,\n            ],\n            [\n                'name' =\u003e 'url',\n                'selector' =\u003e ['.js-course-list li .course-list-img a', 'href'],\n                'repeated' =\u003e true,\n            ],\n            [\n                'name' =\u003e 'image',\n                'selector' =\u003e ['.js-course-list li .course-list-img img', 'src'],\n                'repeated' =\u003e true,\n            ]\n        ]\n    ]\n];\n```\n\n#### Beanbun::$data \n$data 是抽取到的数据，在 Beanbun 中 afterDownloadPage 和其之后的回调函数中都可以使用 \n```php\n$beanbun-\u003eafterDownloadPage = function($beanbun) {\n    print_r($beanbun-\u003edata);\n};\n\n// 上面例子中抽取到的数据为\n$beanbun-\u003edata = [\n    'title' =\u003e '企业网站模板 - Finecms模板 Duxcms模板 Doccms模板 稻壳cms模板',\n    'template' =\u003e [\n        'title' =\u003e [\n            '旅游类通用型手机站模板',\n            '简洁高效多产品分类模板',\n            '虚拟商品销售网站Doccms模板',\n            '幼儿园幼儿教育Doccms网站模板',\n            '宠物会馆职业培训类Doccms模板',\n            '蓝色物流运输类Doccms模板',\n            '设计公司Duxcms手机网站模板',\n            '设计公司Duxcms网站模板',\n            'Doccms2016版大气简洁企业站模板',\n            '响应式红色企业网站模板',\n            '投资金融贷款类企业网站模板',\n            '投资贷款类企业手机模板'\n        ],\n        'url' =\u003e [\n            'http://www.950d.com/list/187.html',\n            'http://www.950d.com/list/184.html',\n            'http://www.950d.com/list/183.html',\n            'http://www.950d.com/list/182.html',\n            'http://www.950d.com/list/181.html',\n            'http://www.950d.com/list/180.html',\n            'http://www.950d.com/list/179.html',\n            'http://www.950d.com/list/178.html',\n            'http://www.950d.com/list/177.html',\n            'http://www.950d.com/list/176.html',\n            'http://www.950d.com/list/175.html',\n            'http://www.950d.com/list/174.html'\n        ],\n        'image' =\u003e [\n            '/upload/2016-12-27/2c41a2b55cc1123a2909487e9c078969.jpg',\n            '/upload/2016-11-05/41bac823202e3f8b37dccb285f09b7ca.jpg',\n            '/upload/2016-11-05/336269e55db23da60e519d4806f6d2b0.jpg',\n            '/upload/2016-11-05/913ed6669b8cf2de0d366c55f0917002.jpg',\n            '/upload/2016-11-05/1760bd081855d178e48bd420a42d34d4.jpg',\n            '/upload/2016-11-05/614212d8bd4b4b7d2072300edb0e101d.jpg',\n            '/upload/2016-11-04/b5a2eae483169a602d6742ab383c772d.jpg',\n            '/upload/2016-11-04/62b40db4bd2ee13a0bcf4e49eae166aa.jpg',\n            '/upload/2016-03-22/21d397aa278643d7489533827d16bfa2.jpg',\n            '/upload/2016-10-12/d09c689ce01a525b631a5b2b56e052bc.jpg',\n            '/upload/2016-09-22/c2ad9f776f424309b89ff24bdefd152b.jpg',\n            '/upload/2016-09-22/d4b32be547ad65a9fd84a14e45e60180.jpg'\n        ]\n    ]\n];\n```\n\n### Beanbun::$parser 可用方法  \ngetData  \n接受一个参数 $feilds，格式与上面提到的 Beanbun::$fields 相同。 \n```php\n$beanbun-\u003eafterDownloadPage = function($beanbun) {\n    $data = $beanbun-\u003eparser-\u003egetData([\n        [\n            'name' =\u003e 'title',\n            'selector' =\u003e ['title', 'text']\n        ]\n    ]);\n    print_r($data);\n};\n\n```\n\n\n### 完整示例\n``` php\nuse Beanbun\\Beanbun;\nuse Beanbun\\Middleware\\Parser;\n\nrequire_once(__DIR__ . '/vendor/autoload.php');\n\n$beanbun = new Beanbun;\n$beanbun-\u003ename = '950d';\n$beanbun-\u003ecount = 5;\n$beanbun-\u003eseed = 'http://www.950d.com/';\n$beanbun-\u003emax = 100;\n$beanbun-\u003eurlRegex = [\n    '/http:\\/\\/www.950d.com\\/list-1.html\\?page=(\\d*)/'\n];\n\n$beanbun-\u003emiddleware(new Parser());\n$beanbun-\u003efields = [\n    [\n        'name' =\u003e 'title',\n        'selector' =\u003e ['title', 'text']\n    ],\n    [\n        'name' =\u003e 'template',\n        'children' =\u003e [\n            [\n                'name' =\u003e 'title',\n                'selector' =\u003e ['.js-course-list li h5', 'text'],\n                'repeated' =\u003e true,\n            ],\n            [\n                'name' =\u003e 'url',\n                'selector' =\u003e ['.js-course-list li .course-list-img a', 'href'],\n                'repeated' =\u003e true,\n            ],\n            [\n                'name' =\u003e 'image',\n                'selector' =\u003e ['.js-course-list li .course-list-img img', 'src'],\n                'repeated' =\u003e true,\n            ]\n        ]\n    ]\n];\n\n$beanbun-\u003eafterDownloadPage = function($beanbun) {\n    print_r($beanbun-\u003edata);\n};\n$beanbun-\u003estart();\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkiddyuchina%2Fbeanbun-parser","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkiddyuchina%2Fbeanbun-parser","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkiddyuchina%2Fbeanbun-parser/lists"}