Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/centic9/file-type-detection

A small tool to use Apache Tika to determine the mime-type of all files in a directory
https://github.com/centic9/file-type-detection

Last synced: 2 days ago
JSON representation

A small tool to use Apache Tika to determine the mime-type of all files in a directory

Awesome Lists containing this project

README

        

[![Build Status](https://travis-ci.org/centic9/file-type-detection.svg)](https://travis-ci.org/centic9/file-type-detection) [![Gradle Status](https://gradleupdate.appspot.com/centic9/file-type-detection/status.svg?branch=master)](https://gradleupdate.appspot.com/centic9/file-type-detection/status)

This is a small tool to use [Apache Tika](http://tika.apache.org) to detect the mime-type of files in a
directory and produce JSON output that can be used for further processing.

The JSON is printed to stdout. Summary/Error information is printed to stderr.
So a typical invocation will redirect stdout to a file via `> file-types.txt`

#### Getting started

##### Grab it

git clone https://github.com/centic9/file-type-detection.git
cd file-type-detection

##### Build it

./gradlew check installDist

#### Run it

build/install/file-type-detection/bin/file-type-detection > file-types.txt

### How it works

The actual code is quite small, it uses the `DirectoryWalker` from
[Apache Commons IO](/https://commons.apache.org/proper/commons-io/) to
search the provided directories and invokes a handler for each file that is found.

The handler uses a thread-pool to schedule a `Runnable` to an `Executor` which performs the
detection of the file-type via Apache Tika.

The async handling allows to scan the file-system in
parallel to the file detection logic.

### Helper for extracting text from files

As Tika is very good at text-extraction as well, this project also provides a small
tool to extract text from any file-type which it supports.

Run the following Java application: `org.dstadler.filesearch.ExtractText`

### Support this project

If you find this tool useful and would like to support it, you can [Sponsor the author](https://github.com/sponsors/centic9)

### Licensing

Copyright 2013-2022 Dominik Stadler

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.