Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/imperial-genomics-facility/limsmetadataparsing

A pyspark based codebase for fetching and formatting metadata from a LIMS db for IGF
https://github.com/imperial-genomics-facility/limsmetadataparsing

apache-arrow apache-spark pandas pyodbc python-3-6 sparksql

Last synced: 12 days ago
JSON representation

A pyspark based codebase for fetching and formatting metadata from a LIMS db for IGF

Awesome Lists containing this project

README

        

# LimsMetadataParsing
A pyspark based codebase for fetching and formatting metadata from a LIMS db for IGF

## Set up environment

* Step 1: Get Miniconda

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh

bash Miniconda3-latest-Linux-x86_64.sh

* Step 2: Clone git repo

git clone https://github.com/imperial-genomics-facility/LimsMetadataParsing.git


* Step 3: Install conda env from the environment.yml file
conda env create -n ENV_NAME --file environment.yml


* Step 4: Create egg file for LimsMetadataParsing repo
python setup.py bdist_egg

## Get UCanAccess

Download UCanAccess from the following link and unzip the contents
- [http://ucanaccess.sourceforge.net/site.html](http://ucanaccess.sourceforge.net/site.html)

## Usage

parseAccessDbForMetadata.py [-h] -a ACCESS_DB_PATH -q QUOTE_FILE_PATH

-o OUTPUT_PATH -k KNOWN_PROJECTS_LIST -j
UCANACCESS_JAR_PATH

optional arguments:
-h, --help show this help message and exit
-a ACCESS_DB_PATH, --access_db_path ACCESS_DB_PATH
Path to Access LIMS db
-q QUOTE_FILE_PATH, --quote_file_path QUOTE_FILE_PATH
Path to quote xls file
-o OUTPUT_PATH, --output_path OUTPUT_PATH
Output dir path for metadta files
-k KNOWN_PROJECTS_LIST, --known_projects_list KNOWN_PROJECTS_LIST
File containing list of known projects
-j UCANACCESS_JAR_PATH, --ucanaccess_jar_path UCANACCESS_JAR_PATH
Path to ucanaccess jar files


## Run spark code

spark-submit \

--master local[NUMBER_OF_CPUS] \
--py-files /path/igfLimsParsing-0.0.1-py3.6.egg \
/path/LimsMetadataParsing/scripts/parseAccessDbForMetadata.py \
-a /path/Database.accdb \
-q /path/Quotes.xlsx \
-o /path/csv_dir \
-k /path/project_list.csv \
-j /path/UCanAccess-4.0.4-bin