Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ukwa/webarchive-discovery
WARC and ARC indexing and discovery tools.
https://github.com/ukwa/webarchive-discovery
Last synced: 3 months ago
JSON representation
WARC and ARC indexing and discovery tools.
- Host: GitHub
- URL: https://github.com/ukwa/webarchive-discovery
- Owner: ukwa
- Created: 2012-12-20T12:17:14.000Z (about 12 years ago)
- Default Branch: master
- Last Pushed: 2024-08-09T10:57:54.000Z (7 months ago)
- Last Synced: 2024-08-10T10:28:17.349Z (7 months ago)
- Language: Java
- Homepage: https://github.com/ukwa/webarchive-discovery/wiki
- Size: 12.6 MB
- Stars: 113
- Watchers: 24
- Forks: 25
- Open Issues: 96
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGES.md
Awesome Lists containing this project
README
Web Archive Discovery
=====================These are the components we use to data-mine and index our ARC and WARC files and make the contents explorable and discoverable.
[](https://github.com/ukwa/webarchive-discovery/actions/workflows/ci-build-and-push.yml)
[](https://central.sonatype.com/namespace/uk.bl.wa.discovery)
Documentation
-------------See the [wiki](https://github.com/ukwa/webarchive-discovery/wiki).
Running the development Opensearch Server
-----------------------------------------The Opensearch part is also usable for Elasticsearch 7.10.2 and may usable for older versions (with minor modifications). You can start it with the provided docker-compose file. After checkout do the following steps in a shell
$ cd warc-indexer/src/main/opensearch/os1
$ docker-compose up -d## Initalize the index
To use the cluster you need to create an index. You can do it by calling
$ curl --insecure --user admin:admin -H 'Content-Type: application/json' -XPUT https://localhost:9200/warcdiscovery/ -d @schema.json
this call creates the index with the schema.json which you can use with warcindexer.
You can delete the index by calling$ curl --insecure --user admin:admin -XDELETE https://localhost:9200/warcdiscovery
## Solr-schema ported to Opensearch
The Solr-schema was as close as possible ported to Opensearch. There are just a few small differences:
* default value "NOW" of index_time will be done by the warcindexer
* default value "other" of content_type_norm will be done by the warcindexer
* field content must be indexed, otherwise no position_increment_gap is possible in elastic
* we only put ssdeep_hash_bs_* as dynamicField and skipped the institution-specific values, but these could be added easilyIndexing a WARC file
--------------------Use the following line if you want to populate the opensearch index:
$ java -jar target/warc-indexer-*-jar-with-dependencies.jar -e https://localhost:9200/warcdiscovery/ --user admin --password admin src/test/resources/wikipedia-mona-lisa/flashfrozen-jwat-recompressed.warc.gz
License
-------Overall, [GNU General Public License Version 2](http://www.gnu.org/copyleft/gpl.html), but some sub-components are [Apache Software License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0.txt).