{"id":18903836,"url":"https://github.com/bububa/lego","last_synced_at":"2026-03-04T08:30:19.310Z","repository":{"id":741544,"uuid":"392427","full_name":"bububa/Lego","owner":"bububa","description":"A python module which is based on bububa.SuperMario project provide a YAML style templates crawler, also provide some collective intellectual functions.","archived":false,"fork":false,"pushed_at":"2010-03-03T12:04:19.000Z","size":151,"stargazers_count":3,"open_issues_count":0,"forks_count":1,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-12-31T10:16:36.633Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"http://syd.todayclose.com/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bububa.png","metadata":{"files":{"readme":"README.txt","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2009-12-02T05:34:26.000Z","updated_at":"2019-08-18T06:34:13.000Z","dependencies_parsed_at":"2022-07-05T13:14:13.749Z","dependency_job_id":null,"html_url":"https://github.com/bububa/Lego","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bububa%2FLego","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bububa%2FLego/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bububa%2FLego/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bububa%2FLego/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bububa","download_url":"https://codeload.github.com/bububa/Lego/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":239888352,"owners_count":19713690,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-08T09:06:37.296Z","updated_at":"2026-03-04T08:30:17.251Z","avatar_url":"https://github.com/bububa.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"= About Lego =\n    Lego is an advance web crawler library written in python. It\nprovides a number of methods to mine data from kinds of sites. You \ncould use YAML to create crawler templates don't need to know write \npython code for a web crawl job.\nThis project is based on SuperMario (http://github.com/bububa/SuperMario/).\n\n== License ==\nBSD License\nSee 'LICENSE' for details.\n\n== Requirements ==\nPlatform: *nix like system (Unix, Linux, Mac OS X, etc.)\nPython: 2.5+\nStorage: mongodb\nSome other python models:\n    - bububa.SuperMario\n\n== Features ==\n  + YAML style templates;\n  + keywords IDF calculation\n  + keywords COEF calculation\n  + Sitemanager\n  + Smart Crawling\n  + \n\n== DEMO ==\nSample YAML file to crawl snipplr.com\n\n#snipplr.yaml\nrun: !SiteCrawler\n    check: !Crawlable\n        yaml_file: sites.yaml\n        label: snipplr\n        logger:\n            filename: snipplr.log\n    crawler: !YAMLStorage\n        yaml_file: sites.yaml\n        label: snipplr\n        method: update_config\n        data: !DetailCrawler\n            sleep: 3\n            proxies: !File\n                method: readlines\n                filename: /proxies\n            pages: !PaginateCrawler\n                proxies: !File\n                    method: readlines\n                    filename: /proxies\n                url_pattern: http://snipplr.com/all/page/{NUMBER}\n                start_no: 0\n                end_no: 0\n                multithread: True\n                wrapper: '\u003col class=\"snippets marg\"\u003e([^^].*?)\u003c/ol\u003e'\n                logger:\n                    filename: snipplr.log\n            url_pattern: '/view/\\d+/[^^]*?/$'\n            wrapper:\n                title: '\u003ch1\u003e([^^]*?)\u003c/h1\u003e'\n                language: '\u003cp class=\"nomarg\"\u003e\u003cspan class=\"rgt\"\u003ePublished in: ([^^]*?)\u003c/span\u003e'\n                author: '\u003ch2\u003ePosted By\u003c/h2\u003e\\s+\u003cp\u003e\u003ca[^^]*?\u003e([^^]*?)\u003c/a\u003e'\n                code: '\u003ca rel=\"nofollow\" href=\"/view.php\\?codeview\u0026amp;id=(\\d+)\"\u003e'\n                comment: '\u003cdiv class=\"description\"\u003e([^^]*?)\u003c/div\u003e'\n                tag: '\u003ch2\u003eTagged\u003c/h2\u003e\\s+\u003cp\u003e([^^]*?)\u003c/p\u003e'\n            essential_fields:\n                - title\n                - language\n                - code\n            multithread: True\n            remove_external_duplicate: True\n            logger:\n                filename: snipplr.log\n            page_callback: !Document\n                label: snipplr\n                method: write\n                page: None\n                logger:\n                    filename: snipplr.log\n            furthure: \n                tag: \n                    parser: !Steps\n                        inputs: None\n                        steps: \n                            - !Dict\n                                dictionary: None\n                                method: member\n                                args: \n                                    - tag\n                            - !Regx\n                                string: None\n                                pattern: \u003ca[^^]*?\u003e([^^]*?)\u003c/a\u003e\n                                multiple: True\n                code: \n                    parser: !Steps\n                        inputs: None\n                        steps: \n                            - !Dict\n                                dictionary: None\n                                method: member\n                                args: \n                                    - code\n                            - !String\n                                args: None\n                                base_str: 'http://snipplr.com/view.php?codeview\u0026id=%s'\n                            - !Init \n                                inputs: None \n                                obj: !URLCrawler\n                                    urls: None\n                                params: \n                                    save_output: True\n                                    wrapper: '\u003ctextarea[^^]*?class=\"copysource\"\u003e([^^]*?)\u003c/textarea\u003e'\n                            - !Array\n                                arr: None\n                                method: member\n                                args: \n                                    - 0\n                            - !Dict\n                                dictionary: None\n                                method: member\n                                args: \n                                    - wrapper\n\n-------------------------------------------------------------------\n#sites.yaml\nsnipplr: \n    duration: 1800\n    end_no: 1\n    last_updated_at: 1263883143.0\n    start_no: 1\n    step: 1\n\n--------------------------------------------------------------------\n\nRun the crawler in shell\n\npython lego.py -c snipplr.yaml\n\nThe results are stored in mongodb. You could use inserter Module to insert the data into mysql database.\n\n\n\n  ","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbububa%2Flego","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbububa%2Flego","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbububa%2Flego/lists"}