{"id":13799477,"url":"https://github.com/TransparencyToolkit/Harvester","last_synced_at":"2025-05-13T08:31:13.714Z","repository":{"id":45424337,"uuid":"46875499","full_name":"TransparencyToolkit/Harvester","owner":"TransparencyToolkit","description":"Web crawling and document processing through a usable interface.","archived":false,"fork":false,"pushed_at":"2017-07-22T15:55:59.000Z","size":60036,"stargazers_count":71,"open_issues_count":3,"forks_count":15,"subscribers_count":16,"default_branch":"master","last_synced_at":"2024-11-18T14:58:45.440Z","etag":null,"topics":["api","crawling","document","interface","ocr","osint","web"],"latest_commit_sha":null,"homepage":"https://transparencytoolkit.org","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/TransparencyToolkit.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-11-25T17:06:20.000Z","updated_at":"2024-04-05T21:21:41.000Z","dependencies_parsed_at":"2022-07-14T01:00:39.207Z","dependency_job_id":null,"html_url":"https://github.com/TransparencyToolkit/Harvester","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TransparencyToolkit%2FHarvester","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TransparencyToolkit%2FHarvester/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TransparencyToolkit%2FHarvester/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TransparencyToolkit%2FHarvester/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/TransparencyToolkit","download_url":"https://codeload.github.com/TransparencyToolkit/Harvester/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253903711,"owners_count":21981736,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["api","crawling","document","interface","ocr","osint","web"],"created_at":"2024-08-04T00:01:03.174Z","updated_at":"2025-05-13T08:31:13.357Z","avatar_url":"https://github.com/TransparencyToolkit.png","language":"JavaScript","funding_links":[],"categories":["Fingerprint"],"sub_categories":["Web"],"readme":"Harvester\n=========\n\nHarvester is a tool to crawl websites and OCR/extract metadata from documents,\nall through a usable graphical interface. The goal is for journalists,\nactivists, and researchers to be able to rapidly collect open source\nintelligence (OSINT) from public websites and convert any set of documents\ninto machine readable form without programming or complex technical setup.\n\nHarvester requires\n[DocManager](https://github.com/TransparencyToolkit/DocManager) so that it can\nindex the data with Elasticsearch. Harvester can also be used with\n[LookingGlass](https://github.com/TransparencyToolkit/LookingGlass) to\nseamlessly generate searchable archives of crawled data and processed\ndocuments.\n\n# Installation\n\n## Dependencies\n\n* [DocManager](https://github.com/TransparencyToolkit/DocManager) and all of\n  its dependencies\n* Ruby 2.4.1\n* Rails 5\n* Mongodb\n* Curl\n* Redis\n* Tika and Tesseract\n* (optionally) [LookingGlass](https://github.com/TransparencyToolkit/LookingGlass)\n\n## Setup Instructions\n\n1. Install the dependencies\n\n* Download elasticsearch (https://www.elastic.co/downloads/elasticsearch)\n* Download rvm (https://rvm.io/rvm/install)\n* Install Ruby: Run `rvm install 2.4.1` and `rvm use 2.4.1`\n* Install Rails: `gem install rails`\n* Install Debian dependencies: `sudo apt-get install libcurl3 libcurl3-gnutls libcurl4-openssl-dev libmagickcore-dev libmagickwand-dev mongodb`\n* Follow the installation instructions for [DocManager](https://github.com/TransparencyToolkit/DocManager)\n* Install Redis: [instructions for Debian](https://www.linode.com/docs/databases/redis/deploy-redis-on-ubuntu-or-debian#debian)\n\n2. Install Tika \u0026 Tesseract (optional)\n\nNOTE: By default document conversion (pdf, docs, etc..) is handled by\n[GiveMeText](http://givemetext.okfnlabs.org), this approach sends your\ndocuments over the clear internet. *DO NOT USE THIS* with sensitive documents,\ninstead install Tika \u0026 Tesseract as described below.\n\n* Install dependencies: `apt-get install default-jdk maven unzip`\n* Download Tika: Run `curl https://codeload.github.com/apache/tika/zip/trunk -o  trunk.zip` and `unzip trunk.zip`\n* Go into Tika directory: `cd tika-trunk`\n* Install Tika: Run `mvn -DskipTests=true clean install` and `cp tika-server/target/tika-server-1.*-SNAPSHOT.jar /srv/tika-server-1.*-SNAPSHOT.jar`\n* Install Tesseract: Run `apt-get -y -q install tesseract-ocr tesseract-ocr-deu tesseract-ocr-eng`\n* Run Tika: `java -jar tika-server/target/tika-server-*.jar` (use `--host=localhost --port=1234` for a custom host and port)\n\n3. Get Harvester\n\n* Clone repo: `git clone https://github.com/TransparencyToolkit/Harvester`\n* Go into Harvester directory: `cd Harvester`\n* Install RubyGems: Run `bundle install`\n\n4. Run Harvester\n\n* Start DocManager: Follow the instructions on the\n  [DocManager](https://github.com/TransparencyToolkit/DocManager) repo\n* Configure Project: Edit the file in `config/initializers/project_config` so\n  that the PROJECT_INDEX value is the name of the index in the\n  [DocManager](https://github.com/TransparencyToolkit/DocManager) project\n  config Harvester should use\n* Start Harvester: Run `rails server -p 3333`\n* Start Resque: Run `QUEUE=* rake environment resque:work`\n* Use Harvester: Go to [http://0.0.0.0:3333](http://0.0.0.0:3333) in your\n  browser\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FTransparencyToolkit%2FHarvester","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FTransparencyToolkit%2FHarvester","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FTransparencyToolkit%2FHarvester/lists"}