{"id":26742736,"url":"https://github.com/scrapfly/python-scrapfly","last_synced_at":"2025-04-14T17:49:23.424Z","repository":{"id":52380555,"uuid":"301017362","full_name":"scrapfly/python-scrapfly","owner":"scrapfly","description":"Scrapfly Python SDK for headless browsers and proxy rotation","archived":false,"fork":false,"pushed_at":"2025-01-29T14:33:48.000Z","size":649,"stargazers_count":41,"open_issues_count":1,"forks_count":11,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-04-14T17:49:21.058Z","etag":null,"topics":["crawler","headless-browser","python","scraper","scraping","scraping-api","sdk","web-scraper","web-scraping"],"latest_commit_sha":null,"homepage":"https://scrapfly.io/docs/sdk/python","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/scrapfly.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-10-04T01:28:34.000Z","updated_at":"2025-04-09T17:39:09.000Z","dependencies_parsed_at":"2023-01-31T07:30:45.989Z","dependency_job_id":"dcefbf09-f92d-47b9-9057-a4f83a572231","html_url":"https://github.com/scrapfly/python-scrapfly","commit_stats":null,"previous_names":["scrapfly/python-sdk"],"tags_count":42,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapfly%2Fpython-scrapfly","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapfly%2Fpython-scrapfly/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapfly%2Fpython-scrapfly/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapfly%2Fpython-scrapfly/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/scrapfly","download_url":"https://codeload.github.com/scrapfly/python-scrapfly/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248931064,"owners_count":21185109,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","headless-browser","python","scraper","scraping","scraping-api","sdk","web-scraper","web-scraping"],"created_at":"2025-03-28T06:19:46.471Z","updated_at":"2025-04-14T17:49:23.397Z","avatar_url":"https://github.com/scrapfly.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Scrapfly SDK\n\n## Installation\n\n`pip install scrapfly-sdk`\n\nYou can also install extra dependencies\n\n* `pip install \"scrapfly-sdk[seepdup]\"` for performance improvement\n* `pip install \"scrapfly-sdk[concurrency]\"` for concurrency out of the box (asyncio / thread)\n* `pip install \"scrapfly-sdk[scrapy]\"` for scrapy integration\n* `pip install \"scrapfly-sdk[all]\"` Everything!\n\nFor use of built-in HTML parser (via `ScrapeApiResponse.selector` property) additional requirement of either [parsel](https://pypi.org/project/parsel/) or [scrapy](https://pypi.org/project/Scrapy/) is required.\n\nFor reference of usage or examples, please checkout the folder `/examples` in this repository.\n\nThis SDK cover the following Scrapfly API endpoints:\n\n* [Web Scraping API](https://scrapfly.io/docs/onboarding#web-scraping-api)\n* [Extraction API](https://scrapfly.io/docs/onboarding#extraction-api)\n* [Screenshot API](https://scrapfly.io/docs/onboarding#screenshot-api)\n\n## Integrations  \n\nScrapfly Python SDKs are integrated with [LlamaIndex](https://www.llamaindex.ai/) and [LangChain](https://www.langchain.com/). Both framework allows training Large Language Models (LLMs) using augmented context.\n\nThis augmented context is approached by training LLMs on top of private or domain-specific data for common use cases:\n- Question-Answering Chatbots (commonly referred to as RAG systems, which stands for \"Retrieval-Augmented Generation\")\n- Document Understanding and Extraction\n- Autonomous Agents that can perform research and take actions\n\u003cbr\u003e  \n\nIn the context of web scraping, web page data can be extracted as Text or Markdown using [Scrapfly's format feature](https://scrapfly.io/docs/scrape-api/specification#api_param_format) to train LLMs with the scraped data.\n\n### LlamaIndex\n\n#### Installation\nInstall `llama-index`, `llama-index-readers-web`, and `scrapfly-sdk` using pip:\n```shell\npip install llama-index llama-index-readers-web scrapfly-sdk\n```\n\n#### Usage\nScrapfly is available at LlamaIndex as a [data connector](https://docs.llamaindex.ai/en/stable/module_guides/loading/connector/), known as a `Reader`. This reader is used to gather a web page data into a `Document` representation, which can be used with the LLM directly. Below is an example of building a RAG system using LlamaIndex and scraped data. See the [LlamaIndex use cases](https://docs.llamaindex.ai/en/stable/use_cases/) for more.\n```python\nimport os\n\nfrom llama_index.readers.web import ScrapflyReader\nfrom llama_index.core import VectorStoreIndex\n\n# Initiate ScrapflyReader with your Scrapfly API key\nscrapfly_reader = ScrapflyReader(\n    api_key=\"Your Scrapfly API key\",  # Get your API key from https://www.scrapfly.io/\n    ignore_scrape_failures=True,  # Ignore unprocessable web pages and log their exceptions\n)\n\n# Load documents from URLs as markdown\ndocuments = scrapfly_reader.load_data(\n    urls=[\"https://web-scraping.dev/products\"]\n)\n\n# After creating the documents, train them with an LLM\n# LlamaIndex uses OpenAI default, other options can be found at the examples direcotry: \n# https://docs.llamaindex.ai/en/stable/examples/llm/openai/\n\n# Add your OpenAI key (a paid subscription must exist) from: https://platform.openai.com/api-keys/\nos.environ['OPENAI_API_KEY'] = \"Your OpenAI Key\"\nindex = VectorStoreIndex.from_documents(documents)\nquery_engine = index.as_query_engine()\n\nresponse = query_engine.query(\"What is the flavor of the dark energy potion?\")\nprint(response)\n\"The flavor of the dark energy potion is bold cherry cola.\"\n```\n\nThe `load_data` function accepts a ScrapeConfig object to use the desired Scrapfly API parameters:\n```python\nfrom llama_index.readers.web import ScrapflyReader\n\n# Initiate ScrapflyReader with your ScrapFly API key\nscrapfly_reader = ScrapflyReader(\n    api_key=\"Your Scrapfly API key\",  # Get your API key from https://www.scrapfly.io/\n    ignore_scrape_failures=True,  # Ignore unprocessable web pages and log their exceptions\n)\n\nscrapfly_scrape_config = {\n    \"asp\": True,  # Bypass scraping blocking and antibot solutions, like Cloudflare\n    \"render_js\": True,  # Enable JavaScript rendering with a cloud headless browser\n    \"proxy_pool\": \"public_residential_pool\",  # Select a proxy pool (datacenter or residnetial)\n    \"country\": \"us\",  # Select a proxy location\n    \"auto_scroll\": True,  # Auto scroll the page\n    \"js\": \"\",  # Execute custom JavaScript code by the headless browser\n}\n\n# Load documents from URLs as markdown\ndocuments = scrapfly_reader.load_data(\n    urls=[\"https://web-scraping.dev/products\"],\n    scrape_config=scrapfly_scrape_config,  # Pass the scrape config\n    scrape_format=\"markdown\",  # The scrape result format, either `markdown`(default) or `text`\n)\n```\n\n### LangChain\n\n#### Installation\nInstall `langchain`, `langchain-community`, and `scrapfly-sdk` using pip:\n```shell\npip install langchain langchain-community scrapfly-sdk\n```\n\n#### Usage\nScrapfly is available at LangChain as a [document loader](https://python.langchain.com/v0.2/docs/concepts/#document-loaders), known as a `Loader`. This reader is used to gather a web page data into `Document` representation, which canbe used with the LLM after a few operations. Below is an example of building a RAG system with LangChain using scraped data, see [LangChain tutorials](https://python.langchain.com/v0.2/docs/tutorials/) for further use cases.\n```python\nimport os\n\nfrom langchain import hub # pip install langchainhub\nfrom langchain_chroma import Chroma # pip install langchain_chroma\nfrom langchain_core.runnables import RunnablePassthrough\nfrom langchain_core.output_parsers import StrOutputParser\nfrom langchain_openai import OpenAIEmbeddings, ChatOpenAI # pip install langchain_openai\nfrom langchain_text_splitters import RecursiveCharacterTextSplitter # pip install langchain_text_splitters\nfrom langchain_community.document_loaders import ScrapflyLoader\n\n\nscrapfly_loader = ScrapflyLoader(\n    [\"https://web-scraping.dev/products\"],\n    api_key=\"Your Scrapfly API key\",  # Get your API key from https://www.scrapfly.io/\n    continue_on_failure=True,  # Ignore unprocessable web pages and log their exceptions\n)\n\n# Load documents from URLs as markdown\ndocuments = scrapfly_loader.load()\n\n# This example uses OpenAI. For more see: https://python.langchain.com/v0.2/docs/integrations/platforms/\nos.environ[\"OPENAI_API_KEY\"] = \"Your OpenAI key\"\n\n# Create a retriever\ntext_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)\nsplits = text_splitter.split_documents(documents)\nvectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())\nretriever = vectorstore.as_retriever()\n\ndef format_docs(docs):\n    return \"\\n\\n\".join(doc.page_content for doc in docs)\n\nmodel = ChatOpenAI()\nprompt = hub.pull(\"rlm/rag-prompt\")\n\nrag_chain = (\n    {\"context\": retriever | format_docs, \"question\": RunnablePassthrough()}\n    | prompt\n    | model\n    | StrOutputParser()\n)\n\nresponse = rag_chain.invoke(\"What is the flavor of the dark energy potion?\")\nprint(response)\n\"The flavor of the Dark Energy Potion is bold cherry cola.\"\n```\n\nTo use the full Scrapfly features with LangChain, pass a ScrapeConfig object to the `ScrapflyLoader`:\n```python\nfrom langchain_community.document_loaders import ScrapflyLoader\n\nscrapfly_scrape_config = {\n    \"asp\": True,  # Bypass scraping blocking and antibot solutions, like Cloudflare\n    \"render_js\": True,  # Enable JavaScript rendering with a cloud headless browser\n    \"proxy_pool\": \"public_residential_pool\",  # Select a proxy pool (datacenter or residnetial)\n    \"country\": \"us\",  # Select a proxy location\n    \"auto_scroll\": True,  # Auto scroll the page\n    \"js\": \"\",  # Execute custom JavaScript code by the headless browser\n}\n\nscrapfly_loader = ScrapflyLoader(\n    [\"https://web-scraping.dev/products\"],\n    api_key=\"Your Scrapfly API key\",  # Get your API key from https://www.scrapfly.io/\n    continue_on_failure=True,  # Ignore unprocessable web pages and log their exceptions\n    scrape_config=scrapfly_scrape_config,  # Pass the scrape_config object\n    scrape_format=\"markdown\",  # The scrape result format, either `markdown`(default) or `text`\n)\n\n# Load documents from URLs as markdown\ndocuments = scrapfly_loader.load()\nprint(documents)\n```\n## Get Your API Key\n\nYou can create a free account on [Scrapfly](https://scrapfly.io/register) to get your API Key.\n\n* [Usage](https://scrapfly.io/docs/sdk/python)\n* [Python API](https://scrapfly.github.io/python-scrapfly/scrapfly)\n* [Open API 3 Spec](https://scrapfly.io/docs/openapi#get-/scrape) \n* [Scrapy Integration](https://scrapfly.io/docs/sdk/scrapy)\n\n## Migration\n\n### Migrate from 0.7.x to 0.8\n\nasyncio-pool dependency has been dropped\n\n`scrapfly.concurrent_scrape` is now an async generator. If the concurrency is `None` or not defined, the max concurrency allowed by\nyour current subscription is used.\n\n```python\n    async for result in scrapfly.concurrent_scrape(concurrency=10, scrape_configs=[ScrapConfig(...), ...]):\n        print(result)\n```\n\nbrotli args is deprecated and will be removed in the next minor. There is not benefit in most of case\nversus gzip regarding and size and use more CPU.\n\n### What's new\n\n### 0.8.x\n\n* Better error log\n* Async/Improvement for concurrent scrape with asyncio\n* Scrapy media pipeline are now supported out of the box\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscrapfly%2Fpython-scrapfly","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fscrapfly%2Fpython-scrapfly","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscrapfly%2Fpython-scrapfly/lists"}