https://github.com/rewong03/xtract_file_service
Code for XtractHub server that processes individual files.
https://github.com/rewong03/xtract_file_service
extractor flask globus process-metadata redis xtracthub
Last synced: 2 months ago
JSON representation
Code for XtractHub server that processes individual files.
- Host: GitHub
- URL: https://github.com/rewong03/xtract_file_service
- Owner: rewong03
- Created: 2019-08-16T15:17:06.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2024-06-17T22:45:03.000Z (12 months ago)
- Last Synced: 2025-04-12T08:55:44.290Z (2 months ago)
- Topics: extractor, flask, globus, process-metadata, redis, xtracthub
- Language: Python
- Homepage:
- Size: 133 KB
- Stars: 0
- Watchers: 1
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# XtractHub File Service
XtractHub File Service is a user metadata file service based on [XtractHub](https://github.com/xtracthub)
from Globus Labs. XtractHub File Service allows users to manage files, process metadata from uploaded files using
XtractHub, and view extracted file metadata using a REST API.## Getting Started
These instructions will get a copy of XtractHub File Service running on your local machine for development and testing
purposes.### Prerequisites
- Redis (available [here](https://redis.io/download))
- Docker (available [here](https://docs.docker.com/install/))### Installation
First, clone this repository and activate a virtual environment:
```
git clone https://github.com/rewong03/xtract_file_service
cd xtract_file_service
python3 -m venv venv
source venv/bin/activate
```
Next, install the requirements:
```
pip install -r requirements.txt
deactivate
```### Running XtractHub File Service
First, open a terminal and start a redis server:
```
cd /path/to/redis/
src/redis-server
```
Then in a second terminal, start a celery worker:
```
cd /path/to/xtract_file_service/
source venv/bin/activate
venv/bin/celery -A app.celery_app worker -Q celery,priority
```
In a third terminal, start the flask app:
```
cd /path/to/xtract_file_service
source venv/bin/activate
venv/bin/flask run
```## Interacting with the server
The server is a REST API with no GUI or HTML. All interactions are done using `curl` or the xtract_file_service command
line interface. This section is documentation of how to interact with all of the server's features with `curl`.**While you *can* use `curl` to interact with the server, it is recommended that you use the xtract_file_service CLI
instead. To learn how to use the CLI, click [here](xfs_cli/README.md).**### Creating a user:
To create a new user, run:
```
curl -X POST -d '{"Username": "your_username", "Email": "your_email", "Password": "your_password"}' http://localhost:5000
```
- This will either return a success message, or a failure message if your input wasn't formatted correctly or if the
username has already been taken.### Logging in:
To login, run:
```
curl -X GET -d '{"Username": "your_username", "Password": "your_password"}' http://localhost:5000/login
```
- This will return a user ID that will become your `"Authentication"` for other server interactions. Save or copy this
so you won't have to login again.### Deleting an account:
To delete an account, run:
```
curl -X DELETE -H "Authentication: your_authentication" -d '{"Username": "your_username", "Password": "your_password"}' http://localhost:5000/delete_user
```### Viewing, uploading, and deleting files:
To upload files for automatic metadata processing, run:
```
curl -X POST -H "Authentication: your_authentication" -H "Extractor: extractor_name" -F "file=@/local/file/path.txt" http://localhost:5000/files
```
- **Note: See the available extractors section below.**
- This will return a task ID for the metadata processing job, which can be used to view the status of your job. Omitting
`extractor_name` will still return a task ID but will not result in any metadata being processed.
- Compressed files will be automatically be decompressed.To view uploaded files, run:
```
curl -X GET -H "Authentication: your_authentication" http://localhost:5000/files
```
- This will return a string containing the names of uploaded files as well as their size.To delete files, run:
```
curl -X DELETE -H "Authentication: your_authentication" -d filename http://localhost:5000/files
```
- This will return a success message or an error message if the file doesn't exist.### Viewing, processing, and deleting metadata:
To view processed metadata, run:
```
curl -X GET -H "Authentication: your_authentication" -d filename http://localhost:5000/metadata
```
- This will return all metadata extracted for a given file.To process metadata for an uploaded file, run:
```
curl -X POST -H "Authentication: your_authentication" -d '{"Filename": "filename", "Extractor": "extractor_name"}' http://localhost:5000/metadata
```
- **Note: See the available extractors section below.**
- **Note: You cannot process metadata for a file and extractor if you have already done so for that file and extractor
combination.**
- This will return a task ID for the metadata processing job, which can be used to view the status of your job.
Omitting `extractor_name` will still return a task ID but will not result in any metadata being processed.To delete all metadata for a file, run:
```
curl -X DELETE -H "Authentication: your_authentication" -d filename http://localhost:5000/metadata
```
- An additional `-H "Extractor: extractor` header can be passed to specify extractor metadata to delete for `filename`.
If omitted, all metadata for `filename` will be deleted.### Viewing task status:
To view a task status, run:
```
curl -X GET -d task_id http://localhost:5000/tasks
```
- This returns a status message for the given task ID.## Available Extractors:
`tabular`:
- Can process tabular/columnar files (.tsv, .csv, etc.).
- Returns preamble, headers, and means, medians, modes, max, min for each column.
- Tabular files containing preamble will automatically process the preamble using the
`keyword` extractor.`keyword`:
- Can process text files.
- Returns a list of keywords as well as scores for those keywords (calculated using (degree)/(frequency)).`jsonxml`:
- Can process .json/.xml style files.
- Returns depth, headers, columns, and all text found in the file.
- Json/xml files containing text will automatically process the text using the `keyword`
extractor.`netcdf`:
- Can process netcdf style files.
- Returns attributes, dimensions, size, and variables.`image`:
- Can process any image file.
- Returns a classification of plot, map, map-plot, graphic, or photograph.`map`:
- Can process a map image (works better for maps with text or coordinates).
- Returns any text found from the image as well as locations on the map.`matio`:
- Can process material science files.
- `matio` was built using the [MaterialsIO](https://github.com/materials-data-facility/MaterialsIO) library. For a more
extensive list of accepted file types and outputs, please visit their [documentation](https://materialsio.readthedocs.io/en/latest/parsers.html).