{"id":13470099,"url":"https://github.com/est/cx-extractor","last_synced_at":"2025-03-26T09:32:25.561Z","repository":{"id":28728806,"uuid":"32249999","full_name":"est/cx-extractor","owner":"est","description":"Automatically exported from code.google.com/p/cx-extractor","archived":false,"fork":false,"pushed_at":"2015-03-15T07:52:52.000Z","size":3472,"stargazers_count":7,"open_issues_count":5,"forks_count":6,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-10-30T00:52:50.782Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"CodeYourFuture/first-git-conflict","license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/est.png","metadata":{"files":{"readme":"Readme.txt","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-03-15T07:42:19.000Z","updated_at":"2018-03-26T06:30:56.000Z","dependencies_parsed_at":"2022-09-05T19:20:33.183Z","dependency_job_id":null,"html_url":"https://github.com/est/cx-extractor","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/est%2Fcx-extractor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/est%2Fcx-extractor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/est%2Fcx-extractor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/est%2Fcx-extractor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/est","download_url":"https://codeload.github.com/est/cx-extractor/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245626165,"owners_count":20646312,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T16:00:23.145Z","updated_at":"2025-03-26T09:32:24.406Z","avatar_url":"https://github.com/est.png","language":"HTML","funding_links":[],"categories":["HTML"],"sub_categories":[],"readme":"\r\n建议：\r\n\r\n1. 如果要提取娱乐类的网页，尤其是在图片把正文分割的比较支离破碎时，\r\n   建议用Java版代码。Java版实现时对多个正文片段进行合并，可以很好\r\n   的处理这一问题。但缺点是正文结尾可能会有少许噪声。\r\n\r\n\r\n2. Perl和PHP的实现版本，一遍扫描只求最大行块，不进行拼接。如果出\r\n   现特别支离破碎的正文时，可能会有丢失。但优点是边缘的噪声去除的\r\n   很好。\r\n\r\n\r\n\r\n\r\n有任何问题，欢迎随时联系我：）\r\n****************************************\r\n陈  鑫\r\nEmail: cx3180@gmail.com\r\nBlog:  http://hi.baidu.com/爱心同盟_陈鑫\r\n****************************************","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fest%2Fcx-extractor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fest%2Fcx-extractor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fest%2Fcx-extractor/lists"}