{"id":21586476,"url":"https://github.com/vida-nyu/ache","last_synced_at":"2025-04-04T13:13:51.902Z","repository":{"id":25371947,"uuid":"28800030","full_name":"VIDA-NYU/ache","owner":"VIDA-NYU","description":"ACHE is a web crawler for domain-specific search.","archived":false,"fork":false,"pushed_at":"2023-08-24T17:55:19.000Z","size":69862,"stargazers_count":464,"open_issues_count":42,"forks_count":135,"subscribers_count":34,"default_branch":"master","last_synced_at":"2025-03-28T12:09:56.638Z","etag":null,"topics":["domain-specific-search","focused-crawler","hacktoberfest","web-crawler","web-scraping","web-search","web-spider"],"latest_commit_sha":null,"homepage":"http://ache.readthedocs.io","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/VIDA-NYU.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2015-01-05T06:11:43.000Z","updated_at":"2025-03-22T19:24:48.000Z","dependencies_parsed_at":"2023-02-19T08:45:50.697Z","dependency_job_id":"6e7d2a59-8932-4cd5-94a7-74169af13676","html_url":"https://github.com/VIDA-NYU/ache","commit_stats":null,"previous_names":[],"tags_count":15,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VIDA-NYU%2Fache","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VIDA-NYU%2Fache/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VIDA-NYU%2Fache/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VIDA-NYU%2Fache/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/VIDA-NYU","download_url":"https://codeload.github.com/VIDA-NYU/ache/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247182420,"owners_count":20897381,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["domain-specific-search","focused-crawler","hacktoberfest","web-crawler","web-scraping","web-search","web-spider"],"created_at":"2024-11-24T15:13:50.473Z","updated_at":"2025-04-04T13:13:51.879Z","avatar_url":"https://github.com/VIDA-NYU.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cimg src=\"https://raw.githubusercontent.com/ViDA-NYU/ache/master/ache-logo.png\" align=\"right\" height=\"90px\"/\u003e\n\n[![Build Status](https://github.com/VIDA-NYU/ache/actions/workflows/gradle-build.yml/badge.svg)](https://github.com/VIDA-NYU/ache/actions/workflows/gradle-build.yml)\n[![Docker Build](https://github.com/VIDA-NYU/ache/actions/workflows/docker-image.yml/badge.svg)](https://hub.docker.com/r/vidanyu/ache)\n[![Documentation Status](https://readthedocs.org/projects/ache/badge/?version=latest)](http://ache.readthedocs.io/en/latest/?badge=latest)\n[![License](https://img.shields.io/badge/license-Apache--2.0-blue.svg)](http://www.apache.org/licenses/LICENSE-2.0)\n\n# ACHE Focused Crawler\n\nACHE is a focused web crawler. It collects web pages that satisfy some specific criteria, e.g., pages that belong to a given domain or that contain a user-specified pattern.\nACHE differs from generic crawlers in sense that it uses *page classifiers* to distinguish between relevant and irrelevant pages in a given domain. A page classifier can be from a simple regular expression (that matches every page that contains a specific word, for example), to a machine-learning based classification model.\nACHE can also automatically learn how to prioritize links in order to efficiently locate relevant content while avoiding the retrieval of irrelevant content.\n\nACHE supports many features, such as:\n- Regular crawling of a fixed list of web sites\n- Discovery and crawling of new relevant web sites through automatic link prioritization\n- Configuration of different types of pages classifiers (machine-learning, regex, etc)\n- Continuous re-crawling of sitemaps to discover new pages\n- Indexing of crawled pages using Elasticsearch\n- Web interface for searching crawled pages in real-time\n- REST API and web-based user interface for crawler monitoring\n- Crawling of hidden services using TOR proxies\n\n## License\n\nStarting from version 0.11.0 onwards, ACHE is licensed under Apache 2.0.\nPrevious versions were licensed under GNU GPL license.\n\n## Documentation\n\nMore info is available on the project's [documentation](http://ache.readthedocs.io/en/latest/).\n\n## Installation\n\nYou can either build ACHE from the source code, download the executable binary using `conda`, or use Docker to build an image and run ACHE in a container.\n\n### Build from source with Gradle\n\n**Prerequisite:** You will need to install recent version of Java (JDK 8 or latest).\n\nTo build ACHE from source, you can run the following commands in your terminal:\n\n```\ngit clone https://github.com/ViDA-NYU/ache.git\ncd ache\n./gradlew installDist\n```\n\nwhich will generate an installation package under `ache/build/install/`.\nYou can then make `ache` command available in the terminal by adding ACHE binaries to the `PATH` environment variable:\n\n```bash\nexport ACHE_HOME=\"{path-to-cloned-ache-repository}/ache/build/install/ache\"\nexport PATH=\"$ACHE_HOME/bin:$PATH\"\n```\n\n### Running using Docker\n\n**Prerequisite:** You will need to install a recent version of Docker. See https://docs.docker.com/engine/installation/ for details on how to install Docker for your platform.\n\nWe publish pre-built docker images on [Docker Hub](https://hub.docker.com/r/vidanyu/ache/) for each released version.\nYou can run the latest image using:\n\n    docker run -p 8080:8080 vidanyu/ache:latest\n\nAlternatively, you can build the image yourself and run it:\n\n```\ngit clone https://github.com/ViDA-NYU/ache.git\ncd ache\ndocker build -t ache .\ndocker run -p 8080:8080 ache\n```\n\nThe [Dockerfile](https://github.com/ViDA-NYU/ache/blob/master/Dockerfile) exposes two data volumes so that you can mount a directory with your configuration files (at `/config`) and preserve the crawler stored data (at `/data`) after the container stops.\n\n### Download with Conda\n\n**Prerequisite:** You need to have Conda package manager installed in your system.\n\nIf you use Conda, you can install `ache` from Anaconda Cloud by running:\n\n```\nconda install -c vida-nyu ache\n```\n\n*NOTE: Only released tagged versions are published to Anaconda Cloud, so the version available through Conda may not be up-to-date.\nIf you want to try the most recent version, please clone the repository and build from source or use the Docker version.*\n\n## Running ACHE\n\nBefore starting a crawl, you need to create a configuration file named `ache.yml`.\nWe provide some configuration samples in the repository's [config](https://github.com/ViDA-NYU/ache/tree/master/config) directory that can help you to get started.\n\nYou will also need a page classifier configuration file named `pageclassifier.yml`.\nFor details on how configure a page classifier, refer to the [page classifiers documentation](http://ache.readthedocs.io/en/latest/page-classifiers.html).\n\nAfter you have configured a classifier, the last thing you will need is a seed file, i.e, a plain text containing one URL per line. The crawler will use these URLs to bootstrap the crawl.\n\nFinally, you can start the crawler using the following command:\n\n```\nache startCrawl -o \u003cdata-output-path\u003e -c \u003cconfig-path\u003e -s \u003cseed-file\u003e -m \u003cmodel-path\u003e\n```\nwhere,\n- `\u003cconfiguration-path\u003e` is the path to the config directory that contains `ache.yml`.\n- `\u003cseed-file\u003e` is the seed file that contains the seed URLs.\n- `\u003cmodel-path\u003e` is the path to the model directory that contains the file `pageclassifier.yml`.\n- `\u003cdata-output-path\u003e` is the path to the data output directory.\n\nExample of running ACHE using the sample *pre-trained page classifier model* and the sample *seeds file* available in the repository:\n\n```\nache startCrawl -o output -c config/sample_config -s config/sample.seeds -m config/sample_model\n```\n\nThe crawler will run and print the logs to the console. Hit ``Ctrl+C`` at any time to stop it (it may take some time).\nFor long crawls, you should run ACHE in background using a tool like nohup.\n\n### Data Formats\n\nACHE can output data in multiple formats. The data formats currently available are:\n\n- FILES (default) - raw content and metadata is stored in rolling compressed files of fixed size.\n- ELATICSEARCH - raw content and metadata is indexed in an ElasticSearch index.\n- KAFKA - pushes raw content and metadata to an Apache Kafka topic.\n- WARC - stores data using the standard format used by the Web Archive and Common Crawl.\n- FILESYSTEM_HTML - only raw page content is stored in plain text files.\n- FILESYSTEM_JSON - raw content and metadata is stored using JSON format in files.\n- FILESYSTEM_CBOR - raw content and some metadata is stored using [CBOR](http://cbor.io) format in files.\n\nFor more details on how to configure data formats, see the [data formats documentation](http://ache.readthedocs.io/en/latest/data-formats.html) page.\n\n## Bug Reports and Questions\n\nWe welcome user feedback. Please submit any suggestions, questions or bug reports using the [Github issue tracker](https://github.com/ViDA-NYU/ache/issues).\n\nWe also have a chat room on [Gitter](https://gitter.im/ViDA-NYU/ache).\n\n## Contributing\n\nCode contributions are welcome. We use a code style derived from the [Google Style Guide](https://google.github.io/styleguide/javaguide.html), but with 4 spaces for tabs. A Eclipse Formatter configuration file is available in the [repository](https://github.com/ViDA-NYU/ache/blob/master/eclipse-code-style.xml).\n\n## Contact\n\n- Aécio Santos [aecio.santos@nyu.edu]\n- Kien Pham [kien.pham@nyu.edu]\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvida-nyu%2Fache","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvida-nyu%2Fache","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvida-nyu%2Fache/lists"}