{"id":13413984,"url":"https://github.com/slotix/dataflowkit","last_synced_at":"2026-01-16T18:05:44.459Z","repository":{"id":38150903,"uuid":"81462386","full_name":"slotix/dataflowkit","owner":"slotix","description":"Extract structured data from web sites. Web sites scraping.  ","archived":false,"fork":false,"pushed_at":"2023-03-07T00:03:55.000Z","size":4834,"stargazers_count":654,"open_issues_count":4,"forks_count":80,"subscribers_count":24,"default_branch":"master","last_synced_at":"2024-07-31T20:53:13.500Z","etag":null,"topics":["cdp","chrome-fetcher","crawling","extract-data","go","golang","golang-library","headless","scraper","scraping","scraping-websites"],"latest_commit_sha":null,"homepage":"https://dataflowkit.com","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/slotix.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-02-09T15:08:15.000Z","updated_at":"2024-07-21T08:26:18.000Z","dependencies_parsed_at":"2024-06-18T17:11:08.934Z","dependency_job_id":null,"html_url":"https://github.com/slotix/dataflowkit","commit_stats":{"total_commits":741,"total_committers":5,"mean_commits":148.2,"dds":"0.17543859649122806","last_synced_commit":"d33463d17312d648b7f0445313183d1d3884f050"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/slotix/dataflowkit","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/slotix%2Fdataflowkit","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/slotix%2Fdataflowkit/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/slotix%2Fdataflowkit/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/slotix%2Fdataflowkit/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/slotix","download_url":"https://codeload.github.com/slotix/dataflowkit/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/slotix%2Fdataflowkit/sbom","scorecard":{"id":832017,"data":{"date":"2025-08-11","repo":{"name":"github.com/slotix/dataflowkit","commit":"d33463d17312d648b7f0445313183d1d3884f050"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":1.6,"checks":[{"name":"Code-Review","score":0,"reason":"Found 0/30 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Dangerous-Workflow","score":-1,"reason":"no workflows found","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Token-Permissions","score":-1,"reason":"No tokens found","details":null,"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"SAST","score":0,"reason":"no SAST tool detected","details":["Warn: no pull requests merged into dev branch"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE:0","Info: FSF or OSI recognized license: BSD 3-Clause \"New\" or \"Revised\" License: LICENSE:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'master'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"Pinned-Dependencies","score":0,"reason":"dependency not pinned by hash detected -- score normalized to 0","details":["Warn: containerImage not pinned by hash: cmd/fetch.d/Dockerfile:1: pin your Docker image by updating alpine:latest to alpine:latest@sha256:4bcff63911fcb4448bd4fdacec207030997caf25e9bea4045fa6c8c44de311d1","Warn: containerImage not pinned by hash: cmd/parse.d/Dockerfile:1: pin your Docker image by updating alpine:latest to alpine:latest@sha256:4bcff63911fcb4448bd4fdacec207030997caf25e9bea4045fa6c8c44de311d1","Warn: containerImage not pinned by hash: testserver/Dockerfile:1: pin your Docker image by updating alpine:latest to alpine:latest@sha256:4bcff63911fcb4448bd4fdacec207030997caf25e9bea4045fa6c8c44de311d1","Info:   0 out of   3 containerImage dependencies pinned"],"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"Vulnerabilities","score":0,"reason":"19 existing vulnerabilities detected","details":["Warn: Project is vulnerable to: GO-2020-0019 / GHSA-3xh2-74w9-5vxm","Warn: Project is vulnerable to: GO-2022-0536 / GHSA-39qc-96h7-956f / GHSA-hgr8-6h9x-f7q9","Warn: Project is vulnerable to: GO-2022-0236 / GHSA-h86h-8ppg-mxmh","Warn: Project is vulnerable to: GO-2021-0238 / GHSA-83g2-8m93-v3w7","Warn: Project is vulnerable to: GO-2022-0288","Warn: Project is vulnerable to: GO-2022-0969 / GHSA-69cg-p879-7622","Warn: Project is vulnerable to: GO-2022-1144 / GHSA-xrjj-mj9h-534m","Warn: Project is vulnerable to: GO-2023-1571 / GHSA-vvpx-j8f3-3w6h","Warn: Project is vulnerable to: GO-2023-1988 / GHSA-2wrh-6pvc-2jm9","Warn: Project is vulnerable to: GO-2023-2102 / GHSA-4374-p667-p6c8","Warn: Project is vulnerable to: GHSA-qppj-fm5r-hxr3","Warn: Project is vulnerable to: GO-2024-2687 / GHSA-4v7x-pqxf-cx7m","Warn: Project is vulnerable to: GO-2024-3333","Warn: Project is vulnerable to: GO-2025-3503 / GHSA-qxp5-gwg8-xv66","Warn: Project is vulnerable to: GO-2025-3595 / GHSA-vvgc-356p-c3xw","Warn: Project is vulnerable to: GO-2022-0493 / GHSA-p782-xgp4-8hr8","Warn: Project is vulnerable to: GO-2020-0015 / GHSA-5rcv-m4m3-hfh7","Warn: Project is vulnerable to: GO-2021-0113 / GHSA-ppp9-7jff-5vj2","Warn: Project is vulnerable to: GO-2022-1059 / GHSA-69ch-w2m2-3vjp"],"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}}]},"last_synced_at":"2025-08-23T17:59:32.805Z","repository_id":38150903,"created_at":"2025-08-23T17:59:32.805Z","updated_at":"2025-08-23T17:59:32.805Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28480516,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-16T11:59:17.896Z","status":"ssl_error","status_checked_at":"2026-01-16T11:55:55.838Z","response_time":107,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cdp","chrome-fetcher","crawling","extract-data","go","golang","golang-library","headless","scraper","scraping","scraping-websites"],"created_at":"2024-07-30T20:01:54.426Z","updated_at":"2026-01-16T18:05:44.418Z","avatar_url":"https://github.com/slotix.png","language":"Go","readme":"# Dataflow kit\n\n![alt tag](https://raw.githubusercontent.com/slotix/dataflowkit/master/images/logo-whitebg.png)\n\n[![Build Status](https://travis-ci.org/slotix/dataflowkit.svg?branch=master)](https://travis-ci.org/slotix/dataflowkit)\n[![GoDoc](https://godoc.org/github.com/slotix/dataflowkit?status.svg)](https://godoc.org/github.com/slotix/dataflowkit)\n[![Go Report Card](https://goreportcard.com/badge/github.com/slotix/dataflowkit)](https://goreportcard.com/report/github.com/slotix/dataflowkit)\n[![codecov](https://codecov.io/gh/slotix/dataflowkit/branch/master/graph/badge.svg)](https://codecov.io/gh/slotix/dataflowkit)\n\n\nDataflow kit (\"DFK\") is a Web Scraping framework for Gophers. It extracts data from web pages, following the specified CSS Selectors.\n\nYou can use it in many ways for data mining, data processing or archiving.\n\n## The Web Scraping Pipeline\nWeb-scraping pipeline consists of 3 general components:\n\n- **Downloading** an HTML web-page. (Fetch Service)\n- **Parsing** an HTML page and retrieving data we're interested in (Parse Service)\n- **Encoding** parsed data to CSV, MS Excel, JSON, [JSON Lines](https://hackernoon.com/json-lines-format-76353b4e588d) or XML format.\n\n## Fetch service\n**fetch.d** server is intended for html web pages content download. \nDepending on Fetcher type, web page content is downloaded using either Base Fetcher or Chrome fetcher. \n\nBase fetcher uses standard golang http client to fetch pages as is. \nIt works faster than Chrome fetcher. But Base fetcher cannot render dynamic javascript driven web pages. \n\nChrome fetcher is intended for rendering dynamic javascript based content. It sends requests to Chrome running in headless mode.  \n\nA fetched web page is passed to parse.d service. \n\n## Parse service\n**parse.d** is the service that extracts data from downloaded web page following the rules listed in configuration JSON file. Extracted data is returned in CSV, MS Excel, JSON or XML format.\n\n*Note: Sometimes Parse service cannot extract data from some pages retrieved by default Base fetcher. Empty results may be returned while parsing Java Script generated pages. Parse service then attempts to force Chrome fetcher to render the same dynamic javascript driven content automatically. Have a look at https://scrape.dataflowkit.com/persons/page-0 which is a sample of JavaScript driven web page.*   \n\n\n## Dataflow kit benefits:\n\n- Scraping of JavaScript generated pages;\n- Data extraction from paginated websites;\n- Processing infinite scrolled pages.\n- Sсraping of websites behind login form;\n- Cookies and sessions handling;\n- Following links and detailed pages processing;\n- Managing delays between requests per domain; \n- Following robots.txt directives; \n- Saving intermediate data in Diskv or Mongodb. Storage interface is flexible enough to add more storage types easily;\n- Encode results to CSV, MS Excel, JSON(Lines), XML formats;\n\n- Dataflow kit is fast. It takes about 4-6 seconds to fetch and then parse 50 pages.\n- Dataflow kit is suitable to process quite large volumes of data. Our tests show the time needed to parse appr. 4 millions of pages is about 7 hours. \n\n## Installation\n\n```\ngo get -u github.com/slotix/dataflowkit\n```\n\n## Usage\n\n### Docker\n1. Install [Docker](https://www.docker.com) and [Docker Compose](https://docs.docker.com/compose/install/)\n\n2. Start services.\n\n```\ncd $GOPATH/src/github.com/slotix/dataflowkit \u0026\u0026 docker-compose up\n```\nThis command fetches docker images automatically and starts services.\n\n3. Launch parsing in the second terminal window by sending POST request to parse daemon. Some json configuration files for testing are available in /examples folder.\n```\ncurl -XPOST  127.0.0.1:8001/parse --data-binary \"@$GOPATH/src/github.com/slotix/dataflowkit/examples/books.toscrape.com.json\"\n```\nHere is the sample json configuration file:\n\n```\n{\n\t\"name\":\"collection\",\n\t\"request\":{\n\t   \"url\":\"https://example.com\"\n\t},\n\t\"fields\":[\n\t   {\n\t\t  \"name\":\"Title\",\n\t\t  \"selector\":\".product-container a\",\n\t\t  \"extractor\":{\n\t\t\t \"types\":[\"text\", \"href\"],\n\t\t\t \"filters\":[\n\t\t\t\t\"trim\",\n\t\t\t\t\"lowerCase\"\n\t\t\t ],\n\t\t\t \"params\":{\n\t\t\t\t\"includeIfEmpty\":false\n\t\t\t }\n\t\t  }\n\t   },\n\t   {\n\t\t  \"name\":\"Image\",\n\t\t  \"selector\":\"#product-container img\",\n\t\t  \"extractor\":{\n\t\t\t \"types\":[\"alt\",\"src\",\"width\",\"height\"],\n\t\t\t \"filters\":[\n\t\t\t\t\"trim\",\n\t\t\t\t\"upperCase\"\n\t\t\t ]\n\t\t  }\n\t   },\n\t   {\n\t\t  \"name\":\"Buyinfo\",\n\t\t  \"selector\":\".buy-info\",\n\t\t  \"extractor\":{\n\t\t\t \"types\":[\"text\"],\n\t\t\t \"params\":{\n\t\t\t\t\"includeIfEmpty\":false\n\t\t\t }\n\t\t  }\n\t   }\n\t],\n\t\"paginator\":{\n\t   \"selector\":\".next\",\n\t   \"attr\":\"href\",\n\t   \"maxPages\":3\n\t},\n\t\"format\":\"json\",\n\t\"fetcherType\":\"chrome\",\n\t\"paginateResults\":false\n}\n```\nRead more information about scraper configuration JSON files at our [GoDoc reference](https://godoc.org/github.com/slotix/dataflowkit/cmd/parse.d)\n\nExtractors and filters are described at  [https://godoc.org/github.com/slotix/dataflowkit/extract](https://godoc.org/github.com/slotix/dataflowkit/extract)\n\n4. To stop services just press Ctrl+C and run \n``` \ncd $GOPATH/src/github.com/slotix/dataflowkit \u0026\u0026 docker-compose down --remove-orphans --volumes\n```\n\n[![IMAFGE ALT CLI Dataflow kit web scraping framework](https://raw.githubusercontent.com/slotix/dataflowkit/master/images/CLI-DFK.png)](https://youtu.be/lqFz1CbWzRs)\n\nClick on image to see CLI in action.\n\n### Manual way\n\n1. Start Chrome docker container \n``` \ndocker run --init -it --rm -d --name chrome --shm-size=1024m -p=127.0.0.1:9222:9222 --cap-add=SYS_ADMIN \\\n  yukinying/chrome-headless-browser\n```\n\n\n[Headless Chrome](https://developers.google.com/web/updates/2017/04/headless-chrome) is used for fetching web pages to feed a Dataflow kit parser. \n\n2. Build and run fetch.d service\n```\ncd $GOPATH/src/github.com/slotix/dataflowkit/cmd/fetch.d \u0026\u0026 go build \u0026\u0026 ./fetch.d\n```\n3. In new terminal window build and run parse.d service\n```\ncd $GOPATH/src/github.com/slotix/dataflowkit/cmd/parse.d \u0026\u0026 go build \u0026\u0026 ./parse.d\n```\n4. Launch parsing. See step 3. from the previous section. \n\n### Run tests\n- ```docker-compose -f test-docker-compose.yml up -d```\n- ```./test.sh```\n- To stop services just run ```docker-compose -f test-docker-compose.yml down```\n\n\n## Front-End\nTry https://dataflowkit.com/dfk Front-end with Point-and-click interface to Dataflow kit services. It generates JSON config file and sends POST request to DFK Parser \n\n[![IMAGE ALT Dataflow kit web scraping framework](https://raw.githubusercontent.com/slotix/dataflowkit/master/images/dfk-screenshot1.png)](https://youtu.be/SKBkclf1FxA)\n\nClick on image to see Dataflow kit in action.\n\n## License\nThis is Free Software, released under the BSD 3-Clause License.\n\n## Contributing\nYou are welcome to contribute to our project. \n- Please submit [your issues](https://github.com/slotix/dataflowkit/issues) \n- Fork the [project](https://github.com/slotix/dataflowkit/fork)\n\n\n![alt tag](https://raw.githubusercontent.com/slotix/dataflowkit/master/images/Spider-White-BG.png)\n","funding_links":[],"categories":["Text Processing","All","文本处理","Go","Bot Building","Specific Formats","scraping","Template Engines","文本处理`解析和操作文本的代码库`"],"sub_categories":["HTTP Clients","交流","Scrapers","刮刀","查询语"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fslotix%2Fdataflowkit","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fslotix%2Fdataflowkit","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fslotix%2Fdataflowkit/lists"}