{"id":13446294,"url":"https://github.com/grangier/python-goose","last_synced_at":"2026-04-10T10:04:22.498Z","repository":{"id":2184443,"uuid":"3131959","full_name":"grangier/python-goose","owner":"grangier","description":"Html Content / Article Extractor, web scrapping lib in Python","archived":false,"fork":false,"pushed_at":"2026-03-10T10:24:55.000Z","size":1959,"stargazers_count":4071,"open_issues_count":108,"forks_count":782,"subscribers_count":196,"default_branch":"develop","last_synced_at":"2026-03-10T16:37:48.996Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"antmicro/enclustra_zynq_linux","license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/grangier.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2012-01-08T20:52:44.000Z","updated_at":"2026-03-10T10:21:09.000Z","dependencies_parsed_at":"2022-07-10T00:46:35.645Z","dependency_job_id":null,"html_url":"https://github.com/grangier/python-goose","commit_stats":null,"previous_names":[],"tags_count":27,"template":false,"template_full_name":null,"purl":"pkg:github/grangier/python-goose","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/grangier%2Fpython-goose","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/grangier%2Fpython-goose/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/grangier%2Fpython-goose/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/grangier%2Fpython-goose/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/grangier","download_url":"https://codeload.github.com/grangier/python-goose/tar.gz/refs/heads/develop","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/grangier%2Fpython-goose/sbom","scorecard":{"id":443500,"data":{"date":"2025-08-11","repo":{"name":"github.com/grangier/python-goose","commit":"09023ec9f5ef26a628a2365616c0a7c864f0ecea"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":1.9,"checks":[{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Dangerous-Workflow","score":-1,"reason":"no workflows found","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Code-Review","score":1,"reason":"Found 3/18 approved changesets -- score normalized to 1","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Token-Permissions","score":-1,"reason":"No tokens found","details":null,"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Pinned-Dependencies","score":-1,"reason":"no dependencies found","details":null,"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE.txt:0","Info: FSF or OSI recognized license: Apache License 2.0: LICENSE.txt:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'develop'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"SAST","score":0,"reason":"SAST tool is not run on all commits -- score normalized to 0","details":["Warn: 0 commits out of 16 are checked with a SAST tool"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}},{"name":"Vulnerabilities","score":0,"reason":"64 existing vulnerabilities detected","details":["Warn: Project is vulnerable to: GHSA-55x5-fj6c-h6m8","Warn: Project is vulnerable to: PYSEC-2014-9 / GHSA-57qw-cc2g-pv5p","Warn: Project is vulnerable to: PYSEC-2021-19 / GHSA-jq4v-f5q6-mjqq","Warn: Project is vulnerable to: GHSA-pgww-xf46-h92r","Warn: Project is vulnerable to: PYSEC-2022-230 / GHSA-wrxv-2j5q-m38w","Warn: Project is vulnerable to: PYSEC-2018-12 / GHSA-xp26-p53h-6h2p","Warn: Project is vulnerable to: PYSEC-2021-356 / GHSA-2ww3-fxvq-293j","Warn: Project is vulnerable to: PYSEC-2024-167 / GHSA-cgvx-9447-vcch","Warn: Project is vulnerable to: PYSEC-2021-859 / GHSA-f8m6-h2c7-8h9x","Warn: Project is vulnerable to: PYSEC-2019-106 / GHSA-mr7p-25v2-35wr","Warn: Project is vulnerable to: PYSEC-2022-5 / GHSA-rqjh-jp2r-59cj","Warn: Project is vulnerable to: GHSA-3c5c-7235-994j","Warn: Project is vulnerable to: GHSA-3f63-hfp8-52jq","Warn: Project is vulnerable to: PYSEC-2021-41 / GHSA-3wvg-mj6g-m9cv","Warn: Project is vulnerable to: PYSEC-2020-77 / GHSA-3xv8-3j54-hgrp","Warn: Project is vulnerable to: PYSEC-2020-80 / GHSA-43fq-w8qq-v88h","Warn: Project is vulnerable to: GHSA-44wm-f244-xhp3","Warn: Project is vulnerable to: GHSA-4fx9-vc88-q2xc","Warn: Project is vulnerable to: PYSEC-2021-35 / GHSA-57h3-9rgr-c24m","Warn: Project is vulnerable to: PYSEC-2020-172 / GHSA-5gm3-px64-rw72","Warn: Project is vulnerable to: PYSEC-2021-331 / GHSA-7534-mm45-c74v","Warn: Project is vulnerable to: PYSEC-2021-92 / GHSA-7r7m-5h27-29hp","Warn: Project is vulnerable to: PYSEC-2020-78 / GHSA-8843-m7mw-mxqm","Warn: Project is vulnerable to: PYSEC-2023-227 / GHSA-8ghj-p4vj-mr35","Warn: Project is vulnerable to: PYSEC-2014-87 / GHSA-8m9x-pxwq-j236","Warn: Project is vulnerable to: PYSEC-2022-10 / GHSA-8vj2-vxx3-667w","Warn: Project is vulnerable to: PYSEC-2021-36 / GHSA-8xjq-8fcg-g5hw","Warn: Project is vulnerable to: PYSEC-2016-6 / GHSA-8xjv-v9xq-m5h9","Warn: Project is vulnerable to: PYSEC-2021-42 / GHSA-95q3-8gr9-gm8w","Warn: Project is vulnerable to: PYSEC-2022-168 / GHSA-9j59-75qj-795w","Warn: Project is vulnerable to: PYSEC-2014-10 / GHSA-cfmr-38g9-f2h7","Warn: Project is vulnerable to: PYSEC-2020-76 / GHSA-cqhg-xjhh-p8hf","Warn: Project is vulnerable to: PYSEC-2021-40 / GHSA-f4w8-cv6p-x6r5","Warn: Project is vulnerable to: PYSEC-2021-69 / GHSA-f5g8-5qq7-938w","Warn: Project is vulnerable to: PYSEC-2021-139 / GHSA-g6rj-rv7j-xwp4","Warn: Project is vulnerable to: PYSEC-2015-16 / GHSA-h5rf-vgqx-wjv2","Warn: Project is vulnerable to: PYSEC-2016-5 / GHSA-hggx-3h72-49ww","Warn: Project is vulnerable to: PYSEC-2020-84 / GHSA-hj69-c76v-86wr","Warn: Project is vulnerable to: PYSEC-2016-7 / GHSA-hvr8-466p-75rh","Warn: Project is vulnerable to: PYSEC-2015-15 / GHSA-j6f7-g425-4gmx","Warn: Project is vulnerable to: GHSA-j7hp-h8jx-5ppr","Warn: Project is vulnerable to: PYSEC-2019-110 / GHSA-j7mj-748x-7p78","Warn: Project is vulnerable to: GHSA-jgpv-4h4c-xhw3","Warn: Project is vulnerable to: PYSEC-2022-42979 / GHSA-m2vv-5vj5-2hm7","Warn: Project is vulnerable to: PYSEC-2021-37 / GHSA-mvg9-xffr-p774","Warn: Project is vulnerable to: PYSEC-2020-83 / GHSA-p49h-hjvm-jg3h","Warn: Project is vulnerable to: PYSEC-2022-8 / GHSA-pw3c-h7wp-cvhx","Warn: Project is vulnerable to: PYSEC-2021-93 / GHSA-q5hq-fp76-qmrc","Warn: Project is vulnerable to: PYSEC-2020-82 / GHSA-r7rm-8j6h-r933","Warn: Project is vulnerable to: PYSEC-2014-23 / GHSA-r854-96gq-rfg3","Warn: Project is vulnerable to: PYSEC-2016-8 / GHSA-rwr3-c2q8-gm56","Warn: Project is vulnerable to: PYSEC-2020-81 / GHSA-vcqg-3p29-xw73","Warn: Project is vulnerable to: PYSEC-2020-79 / GHSA-vj42-xq3r-hr3r","Warn: Project is vulnerable to: PYSEC-2021-70 / GHSA-vqcj-wrf2-7v73","Warn: Project is vulnerable to: PYSEC-2016-9 / GHSA-w4vg-rf63-f3j3","Warn: Project is vulnerable to: PYSEC-2014-22 / GHSA-x895-2wrm-hvp7","Warn: Project is vulnerable to: PYSEC-2022-9 / GHSA-xrcv-f9gm-v42c","Warn: Project is vulnerable to: PYSEC-2021-137","Warn: Project is vulnerable to: PYSEC-2021-138","Warn: Project is vulnerable to: PYSEC-2021-317","Warn: Project is vulnerable to: PYSEC-2021-38","Warn: Project is vulnerable to: PYSEC-2021-39","Warn: Project is vulnerable to: PYSEC-2021-94","Warn: Project is vulnerable to: PYSEC-2023-175"],"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}}]},"last_synced_at":"2025-08-19T06:08:13.718Z","repository_id":2184443,"created_at":"2025-08-19T06:08:13.718Z","updated_at":"2025-08-19T06:08:13.718Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31637749,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-10T07:40:12.752Z","status":"ssl_error","status_checked_at":"2026-04-10T07:40:11.664Z","response_time":98,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T05:00:50.736Z","updated_at":"2026-04-10T10:04:22.492Z","avatar_url":"https://github.com/grangier.png","language":"HTML","readme":"Python-Goose - Article Extractor \n===============================================\n\nIntro\n-----\n\nGoose was originally an article extractor written in Java that has most\nrecently (Aug2011) been converted to a `scala project \u003chttps://github.com/GravityLabs/goose\u003e`_.\n\nThis is a complete rewrite in Python. The aim of the software is to\ntake any news article or article-type web page and not only extract what\nis the main body of the article but also all meta data and most probable\nimage candidate.\n\nGoose will try to extract the following information:\n\n-  Main text of an article\n-  Main image of article\n-  Any YouTube/Vimeo movies embedded in article\n-  Meta Description\n-  Meta tags\n\nThe Python version was rewritten by:\n\n-  Xavier Grangier\n\nLicensing\n---------\n\nIf you find Goose useful or have issues please drop me a line. I'd love\nto hear how you're using it or what features should be improved.\n\nGoose is licensed by Gravity.com under the Apache 2.0 license; see the\nLICENSE file for more details.\n\nSetup\n-----\n\n::\n\n    mkvirtualenv --no-site-packages goose\n    git clone https://github.com/grangier/python-goose.git\n    cd python-goose\n    pip install -r requirements.txt\n    python setup.py install\n\nTake it for a spin\n------------------\n\n::\n\n    \u003e\u003e\u003e from goose import Goose\n    \u003e\u003e\u003e url = 'http://edition.cnn.com/2012/02/22/world/europe/uk-occupy-london/index.html?hpt=ieu_c2'\n    \u003e\u003e\u003e g = Goose()\n    \u003e\u003e\u003e article = g.extract(url=url)\n    \u003e\u003e\u003e article.title\n    u'Occupy London loses eviction fight'\n    \u003e\u003e\u003e article.meta_description\n    \"Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoid eviction Wednesday in a decision made by London's Court of Appeal.\"\n    \u003e\u003e\u003e article.cleaned_text[:150]\n    (CNN) -- Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoi\n    \u003e\u003e\u003e article.top_image.src\n    http://i2.cdn.turner.com/cnn/dam/assets/111017024308-occupy-london-st-paul-s-cathedral-story-top.jpg\n\nConfiguration\n-------------\n\nThere are two ways to pass configuration to goose. The first one is to\npass goose a Configuration() object. The second one is to pass a\nconfiguration dict.\n\nFor instance, if you want to change the userAgent used by Goose just\npass:\n\n::\n\n    \u003e\u003e\u003e g = Goose({'browser_user_agent': 'Mozilla'})\n\nSwitching parsers : Goose can now be used with lxml html parser or lxml\nsoup parser. By default the html parser is used. If you want to use the\nsoup parser pass it in the configuration dict :\n\n::\n\n    \u003e\u003e\u003e g = Goose({'browser_user_agent': 'Mozilla', 'parser_class':'soup'})\n\nGoose is now language aware\n---------------------------\n\nFor example, scraping a Spanish content page with correct meta language\ntags:\n\n::\n\n    \u003e\u003e\u003e from goose import Goose\n    \u003e\u003e\u003e url = 'http://sociedad.elpais.com/sociedad/2012/10/27/actualidad/1351332873_157836.html'\n    \u003e\u003e\u003e g = Goose()\n    \u003e\u003e\u003e article = g.extract(url=url)\n    \u003e\u003e\u003e article.title\n    u'Las listas de espera se agravan'\n    \u003e\u003e\u003e article.cleaned_text[:150]\n    u'Los recortes pasan factura a los pacientes. De diciembre de 2010 a junio de 2012 las listas de espera para operarse aumentaron un 125%. Hay m\\xe1s ciudad'\n\nSome pages don't have correct meta language tags, you can force it using\nconfiguration :\n\n::\n\n    \u003e\u003e\u003e from goose import Goose\n    \u003e\u003e\u003e url = 'http://www.elmundo.es/elmundo/2012/10/28/espana/1351388909.html'\n    \u003e\u003e\u003e g = Goose({'use_meta_language': False, 'target_language':'es'})\n    \u003e\u003e\u003e article = g.extract(url=url)\n    \u003e\u003e\u003e article.cleaned_text[:150]\n    u'Importante golpe a la banda terrorista ETA en Francia. La Guardia Civil ha detenido en un hotel de Macon, a 70 kil\\xf3metros de Lyon, a Izaskun Lesaka y '\n\nPassing {'use\\_meta\\_language': False, 'target\\_language':'es'} will\nforcibly select Spanish.\n\n\nVideo extraction\n----------------\n\n::\n\n    \u003e\u003e\u003e import goose\n    \u003e\u003e\u003e url = 'http://www.liberation.fr/politiques/2013/08/12/journee-de-jeux-pour-ayrault-dans-les-jardins-de-matignon_924350'\n    \u003e\u003e\u003e g = goose.Goose({'target_language':'fr'})\n    \u003e\u003e\u003e article = g.extract(url=url)\n    \u003e\u003e\u003e article.movies\n    [\u003cgoose.videos.videos.Video object at 0x25f60d0\u003e]\n    \u003e\u003e\u003e article.movies[0].src\n    'http://sa.kewego.com/embed/vp/?language_code=fr\u0026playerKey=1764a824c13c\u0026configKey=dcc707ec373f\u0026suffix=\u0026sig=9bc77afb496s\u0026autostart=false'\n    \u003e\u003e\u003e article.movies[0].embed_code\n    '\u003ciframe src=\"http://sa.kewego.com/embed/vp/?language_code=fr\u0026amp;playerKey=1764a824c13c\u0026amp;configKey=dcc707ec373f\u0026amp;suffix=\u0026amp;sig=9bc77afb496s\u0026amp;autostart=false\" frameborder=\"0\" scrolling=\"no\" width=\"476\" height=\"357\"/\u003e'\n    \u003e\u003e\u003e article.movies[0].embed_type\n    'iframe'\n    \u003e\u003e\u003e article.movies[0].width\n    '476'\n    \u003e\u003e\u003e article.movies[0].height\n    '357'\n\n\nGoose in Chinese\n----------------\n\nSome users want to use Goose for Chinese content. Chinese word\nsegmentation is way more difficult to deal with than occidental\nlanguages. Chinese needs a dedicated StopWord analyser that need to be\npassed to the config object.\n\n::\n\n    \u003e\u003e\u003e from goose import Goose\n    \u003e\u003e\u003e from goose.text import StopWordsChinese\n    \u003e\u003e\u003e url  = 'http://www.bbc.co.uk/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics.shtml'\n    \u003e\u003e\u003e g = Goose({'stopwords_class': StopWordsChinese})\n    \u003e\u003e\u003e article = g.extract(url=url)\n    \u003e\u003e\u003e print article.cleaned_text[:150]\n    香港行政长官梁振英在各方压力下就其大宅的违章建筑（僭建）问题到立法会接受质询，并向香港民众道歉。\n\n    梁振英在星期二（12月10日）的答问大会开始之际在其演说中道歉，但强调他在违章建筑问题上没有隐瞒的意图和动机。\n\n    一些亲北京阵营议员欢迎梁振英道歉，且认为应能获得香港民众接受，但这些议员也质问梁振英有\n\nGoose in Arabic\n---------------\n\nIn order to use Goose in Arabic you have to use the StopWordsArabic\nclass.\n\n::\n\n    \u003e\u003e\u003e from goose import Goose\n    \u003e\u003e\u003e from goose.text import StopWordsArabic\n    \u003e\u003e\u003e url = 'http://arabic.cnn.com/2013/middle_east/8/3/syria.clashes/index.html'\n    \u003e\u003e\u003e g = Goose({'stopwords_class': StopWordsArabic})\n    \u003e\u003e\u003e article = g.extract(url=url)\n    \u003e\u003e\u003e print article.cleaned_text[:150]\n    دمشق، سوريا (CNN) -- أكدت جهات سورية معارضة أن فصائل مسلحة معارضة لنظام الرئيس بشار الأسد وعلى صلة بـ\"الجيش الحر\" تمكنت من السيطرة على مستودعات للأسل\n\n\nGoose in Korean\n----------------\n\nIn order to use Goose in Korean you have to use the StopWordsKorean\nclass.\n\n::\n\n    \u003e\u003e\u003e from goose import Goose\n    \u003e\u003e\u003e from goose.text import StopWordsKorean\n    \u003e\u003e\u003e url='http://news.donga.com/3/all/20131023/58406128/1'\n    \u003e\u003e\u003e g = Goose({'stopwords_class':StopWordsKorean})\n    \u003e\u003e\u003e article = g.extract(url=url)\n    \u003e\u003e\u003e print article.cleaned_text[:150]\n    경기도 용인에 자리 잡은 민간 시험인증 전문기업 ㈜디지털이엠씨(www.digitalemc.com). \n    14년째 세계 각국의 통신·안전·전파 규격 시험과 인증 한 우물만 파고 있는 이 회사 박채규 대표가 만나기로 한 주인공이다. \n    그는 전기전자·무선통신·자동차 전장품 분야에\n\n\nKnown issues\n------------\n\n- There are some issues with unicode URLs.\n- Cookie handling : Some websites need cookie handling. At the moment the only work around is to use the raw_html extraction. For instance:\n\n    \u003e\u003e\u003e import urllib2\n    \u003e\u003e\u003e import goose\n    \u003e\u003e\u003e url = \"http://www.nytimes.com/2013/08/18/world/middleeast/pressure-by-us-failed-to-sway-egypts-leaders.html?hp\"\n    \u003e\u003e\u003e opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())\n    \u003e\u003e\u003e response = opener.open(url)\n    \u003e\u003e\u003e raw_html = response.read()\n    \u003e\u003e\u003e g = goose.Goose()\n    \u003e\u003e\u003e a = g.extract(raw_html=raw_html)\n    \u003e\u003e\u003e a.cleaned_text\n    u'CAIRO \\u2014 For a moment, at least, American and European diplomats trying to defuse the volatile standoff in Egypt thought they had a breakthrough.\\n\\nAs t'\n\nTODO\n----\n\n-  Video html5 tag extraction\n\n\n.. |Build Status| image:: https://travis-ci.org/grangier/python-goose.png?branch=develop   :target: https://travis-ci.org/grangier/python-goose\n","funding_links":[],"categories":["HTML","Web Content Extracting","资源列表","HarmonyOS","Awesome Python","Web Scraping \u0026 Crawling"],"sub_categories":["网页内容提取","Windows Manager","Web Content Extracting"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgrangier%2Fpython-goose","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgrangier%2Fpython-goose","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgrangier%2Fpython-goose/lists"}