{"id":13576268,"url":"https://github.com/elixir-crawly/crawly","last_synced_at":"2025-12-11T23:53:38.231Z","repository":{"id":41300150,"uuid":"174680350","full_name":"elixir-crawly/crawly","owner":"elixir-crawly","description":"Crawly, a high-level web crawling \u0026 scraping framework for Elixir. ","archived":false,"fork":false,"pushed_at":"2024-09-09T07:43:38.000Z","size":2940,"stargazers_count":1020,"open_issues_count":8,"forks_count":117,"subscribers_count":19,"default_branch":"master","last_synced_at":"2025-05-14T08:41:25.398Z","etag":null,"topics":["crawler","crawling","elixir","erlang","extract-data","scraper","scraping","scraping-websites","spider"],"latest_commit_sha":null,"homepage":"https://hexdocs.pm/crawly","language":"Elixir","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/elixir-crawly.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-03-09T10:37:03.000Z","updated_at":"2025-05-13T11:29:32.000Z","dependencies_parsed_at":"2023-11-20T10:39:17.096Z","dependency_job_id":"df298496-253e-47d8-9d72-7306d0e02c09","html_url":"https://github.com/elixir-crawly/crawly","commit_stats":null,"previous_names":["oltarasenko/crawly"],"tags_count":17,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elixir-crawly%2Fcrawly","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elixir-crawly%2Fcrawly/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elixir-crawly%2Fcrawly/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elixir-crawly%2Fcrawly/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/elixir-crawly","download_url":"https://codeload.github.com/elixir-crawly/crawly/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254436944,"owners_count":22070946,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","crawling","elixir","erlang","extract-data","scraper","scraping","scraping-websites","spider"],"created_at":"2024-08-01T15:01:08.685Z","updated_at":"2025-12-11T23:53:38.171Z","avatar_url":"https://github.com/elixir-crawly.png","language":"Elixir","funding_links":[],"categories":["Elixir","Data Ingestion \u0026 ETL"],"sub_categories":["How to Join"],"readme":"# Crawly\n\n[![Module Version](https://img.shields.io/hexpm/v/crawly.svg)](https://hex.pm/packages/crawly)\n[![Hex Docs](https://img.shields.io/badge/hex-docs-lightgreen.svg)](https://hexdocs.pm/crawly/)\n[![Total Download](https://img.shields.io/hexpm/dt/crawly.svg)](https://hex.pm/packages/crawly)\n[![License](https://img.shields.io/hexpm/l/crawly.svg)](https://github.com/elixir-crawly/crawly/blob/master/LICENSE)\n[![Last Updated](https://img.shields.io/github/last-commit/elixir-crawly/crawly.svg)](https://github.com/elixir-crawly/crawly/commits/master)\n\n## Overview\n\nCrawly is an application framework for crawling web sites and\nextracting structured data which can be used for a wide range of\nuseful applications, like data mining, information processing or\nhistorical archival.\n\n## Requirements\n\n1. Elixir `~\u003e 1.14`\n2. Works on GNU/Linux, Windows, macOS X, and BSD.\n\n\n## Quickstart\n0. Create a new project: `mix new quickstart --sup`\n1. Add Crawly as a dependencies:\n\n   ```elixir\n   # mix.exs\n   defp deps do\n       [\n         {:crawly, \"~\u003e 0.17.2\"},\n         {:floki, \"~\u003e 0.33.0\"}\n       ]\n   end\n   ```\n2. Fetch dependencies: `$ mix deps.get`\n3. Create a spider\n\n   ```elixir\n    # lib/crawly_example/books_to_scrape.ex\n    defmodule BooksToScrape do\n      use Crawly.Spider\n\n      @impl Crawly.Spider\n      def base_url(), do: \"https://books.toscrape.com/\"\n\n      @impl Crawly.Spider\n      def init() do\n        [start_urls: [\"https://books.toscrape.com/\"]]\n      end\n\n      @impl Crawly.Spider\n      def parse_item(response) do\n        # Parse response body to document\n        {:ok, document} = Floki.parse_document(response.body)\n\n        # Create item (for pages where items exists)\n        items =\n          document\n          |\u003e Floki.find(\".product_pod\")\n          |\u003e Enum.map(fn x -\u003e\n            %{\n              title: Floki.find(x, \"h3 a\") |\u003e Floki.attribute(\"title\") |\u003e Floki.text(),\n              price: Floki.find(x, \".product_price .price_color\") |\u003e Floki.text(),\n              url: response.request_url\n            }\n          end)\n\n        next_requests =\n          document\n          |\u003e Floki.find(\".next a\")\n          |\u003e Floki.attribute(\"href\")\n          |\u003e Enum.map(fn url -\u003e\n            Crawly.Utils.build_absolute_url(url, response.request.url)\n            |\u003e Crawly.Utils.request_from_url()\n          end)\n\n        %Crawly.ParsedItem{items: items, requests: next_requests}\n      end\n    end\n   ```\n\n    **New in 0.15.0 :**\n\n    \u003e It's possible to use the command to speed up the spider creation,\n    so you will have a generated file with all needed callbacks:\n    `mix crawly.gen.spider --filepath ./lib/crawly_example/books_to_scrape.ex --spidername BooksToScrape`\n\n\n4. Configure Crawly\n\n   By default, Crawly does not require any configuration. But obviously you will need a configuration for fine tuning the crawls:\n   (in file: `config/config.exs`)\n\n   ```elixir\n\n    import Config\n\n    config :crawly,\n      closespider_timeout: 10,\n      concurrent_requests_per_domain: 8,\n      closespider_itemcount: 100,\n\n      middlewares: [\n        Crawly.Middlewares.DomainFilter,\n        Crawly.Middlewares.UniqueRequest,\n        {Crawly.Middlewares.UserAgent, user_agents: [\"Crawly Bot\"]}\n      ],\n      pipelines: [\n        {Crawly.Pipelines.Validate, fields: [:url, :title, :price]},\n        {Crawly.Pipelines.DuplicatesFilter, item_id: :title},\n        Crawly.Pipelines.JSONEncoder,\n        {Crawly.Pipelines.WriteToFile, extension: \"jl\", folder: \"/tmp\"}\n      ]\n\n   ```\n\n    **New in 0.15.0:**\n\n    \u003e You can generate  example config with the help of the following command:\n   `mix crawly.gen.config`\n\n5. Start the Crawl:\n\n   ```bash\n     iex -S mix run -e \"Crawly.Engine.start_spider(BooksToScrape)\"\n   ```\n\n6. Results can be seen with:\n\n   ```\n   $ cat /tmp/BooksToScrape_\u003ctimestamp\u003e.jl\n   ```\n\n## Running Crawly without Elixir or Elixir projects\n\nIt's possible to run Crawly in a standalone mode, when Crawly is running as a tiny docker container, and spiders are just YMLfiles or elixir modules that are mounted inside.\n\nPlease read more about it here:\n- [Standalone Crawly](./documentation/standalone_crawly.md)\n- [Spiders as YML](./documentation/spiders_in_yml.md)\n\n## Need more help?\n\nPlease use discussions for all conversations related to the project\n\n## Browser rendering\n\nCrawly can be configured in the way that all fetched pages will be browser rendered,\nwhich can be very useful if you need to extract data from pages which has lots\nof asynchronous elements (for example parts loaded by AJAX).\n\nYou can read more here:\n- [Browser Rendering](https://hexdocs.pm/crawly/basic_concepts.html#browser-rendering)\n\n## Simple management UI (New in 0.15.0) {#management-ui}\nCrawly provides a simple management UI by default on the `localhost:4001`\n\nIt allows to:\n - Start spiders\n - Stop spiders\n - Preview scheduled requests\n - View/Download items extracted\n - View/Download logs\n\nNOTE: It's possible to disable the Simple management UI (and rest API) with the\n`start_http_api?: false`  options of Crawly configuration.\n\nYou can choose to run the management UI as a plug in your application.\n\n```elixir\ndefmodule MyApp.Router do\n  use Plug.Router\n\n  ...\n  forward \"/admin\", Crawly.API.Router\n  ...\nend\n```\n\n![Crawly Management UI](docs/crawly_ui.gif)\n\n\n## Experimental UI [Deprecated]\n\nNow don't have a possibility to work on experimental UI built with Phoenix and LiveViews, and keeping it here for\nmainly demo purposes.\n\nThe CrawlyUI project is an add-on that aims to provide an interface for managing and rapidly developing spiders.\nCheckout the code from [GitHub](https://github.com/oltarasenko/crawly_ui)\n\n## Documentation\n\n- [API Reference](https://hexdocs.pm/crawly/api-reference.html#content)\n- [Quickstart](https://hexdocs.pm/crawly/readme.html#quickstart)\n\n## Roadmap\n\nTo be discussed\n\n## Articles\n\n1. Blog post on Erlang Solutions website: https://www.erlang-solutions.com/blog/web-scraping-with-elixir.html\n2. Blog post about using Crawly inside a machine learning project with Tensorflow (Tensorflex): https://www.erlang-solutions.com/blog/how-to-build-a-machine-learning-project-in-elixir.html\n3. Web scraping with Crawly and Elixir. Browser rendering: https://medium.com/@oltarasenko/web-scraping-with-elixir-and-crawly-browser-rendering-afcaacf954e8\n4. Web scraping with Elixir and Crawly. Extracting data behind authentication: https://oltarasenko.medium.com/web-scraping-with-elixir-and-crawly-extracting-data-behind-authentication-a52584e9cf13\n5. [What is web scraping, and why you might want to use it?](https://oltarasenko.medium.com/what-is-web-scraping-and-why-you-might-want-to-use-it-a0e4b621f6d0?sk=3145cceff095523c88e72e3ddb456016)\n6. [Using Elixir and Crawly for price monitoring](https://oltarasenko.medium.com/using-elixir-and-crawly-for-price-monitoring-7364d345fc64?sk=9788899eb8e1d1dd6614d022eda350e8)\n7. [Building a Chrome-based fetcher for Crawly](https://oltarasenko.medium.com/building-a-chrome-based-fetcher-for-crawly-a779e9a8d9d0?sk=2dbb4d39cdf319f01d0fa7c05f9dc9ec)\n\n## Example projects\n\n1. Blog crawler: https://github.com/oltarasenko/crawly-spider-example\n2. E-commerce websites: https://github.com/oltarasenko/products-advisor\n3. Car shops: https://github.com/oltarasenko/crawly-cars\n4. JavaScript based website (Splash example): https://github.com/oltarasenko/autosites\n\n## Contributors\n\nWe would gladly accept your contributions!\n\n## Documentation\nPlease find documentation on the [HexDocs](https://hexdocs.pm/crawly/)\n\n## Production usages\n\nUsing Crawly on production? Please let us know about your case!\n\n## Copyright and License\n\nCopyright (c) 2019 Oleg Tarasenko\n\nLicensed under the Apache License, Version 2.0 (the \"License\");\nyou may not use this file except in compliance with the License.\nYou may obtain a copy of the License at [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)\n\nUnless required by applicable law or agreed to in writing, software\ndistributed under the License is distributed on an \"AS IS\" BASIS,\nWITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\nSee the License for the specific language governing permissions and\nlimitations under the License.\n\n\n## How to release:\n\n1. Update version in mix.exs\n2. Update version in quickstart (README.md, this file)\n3. Commit and create a new tag: `git commit \u0026\u0026 git tag 0.xx.0 \u0026\u0026 git push origin master --follow-tags`\n4. Build docs: `mix docs`\n5. Publish hex release: `mix hex.publish`\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Felixir-crawly%2Fcrawly","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Felixir-crawly%2Fcrawly","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Felixir-crawly%2Fcrawly/lists"}