Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/lazappi/task_dimensionality_reduction_copy

A copy of the dimensionality reduction task for testing
https://github.com/lazappi/task_dimensionality_reduction_copy
Last synced: 22 days ago
JSON representation
A copy of the dimensionality reduction task for testing
Host: GitHub
URL: https://github.com/lazappi/task_dimensionality_reduction_copy
Owner: lazappi
License: mit
Created: 2024-09-17T08:58:20.000Z (about 2 months ago)
Default Branch: main
Last Pushed: 2024-09-17T10:28:49.000Z (about 2 months ago)
Last Synced: 2024-09-18T11:47:32.285Z (about 2 months ago)
Language: Python
Size: 71.3 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project

README

        # Dimensionality Reduction for Visualization

Reduction of high-dimensional datasets to 2D for visualization &

interpretation.

Repository:

[openproblems-bio/task_dimensionality_reduction](https://github.com/openproblems-bio/task_dimensionality_reduction)

## Description

Data visualisation is an important part of all stages of single-cell

analysis, from initial quality control to interpretation and

presentation of final results. For bulk RNA-seq studies, linear

dimensionality reduction techniques such as PCA and MDS are commonly

used to visualise the variation between samples. While these methods are

highly effective they can only be used to show the first few components

of variation which cannot fully represent the increased complexity and

number of observations in single-cell datasets. For this reason

non-linear techniques (most notably t-SNE and UMAP) have become the

standard for visualising single-cell studies. These methods attempt to

compress a dataset into a two-dimensional space while attempting to

capture as much of the variance between observations as possible. Many

methods for solving this problem now exist. In general these methods try

to preserve distances, while some additionally consider aspects such as

density within the embedded space or conservation of continuous

trajectories. Despite almost every single-cell study using one of these

visualisations there has been debate as to whether they can effectively

capture the variation in single-cell datasets \[@chari2023speciousart\].

The dimensionality reduction task attempts to quantify the ability of

methods to embed the information present in complex single-cell studies

into a two-dimensional space. Thus, this task is specifically designed

for dimensionality reduction for visualisation and does not consider

other uses of dimensionality reduction in standard single-cell workflows

such as improving the signal-to-noise ratio (and in fact several of the

methods use PCA as a pre-processing step for this reason). Unlike most

tasks, methods for the dimensionality reduction task must accept a

matrix containing expression values normalised to 10,000 counts per cell

and log transformed (log-10k) and produce a two-dimensional coordinate

for each cell. Pre-normalised matrices are required to enforce

consistency between the metric evaluation (which generally requires

normalised data) and the method runs. When these are not consistent,

methods that use the same normalisation as used in the metric tend to

score more highly. For some methods we also evaluate the pre-processing

recommended by the method.

## Authors & contributors

| name                   | roles              |

|:-----------------------|:-------------------|

| Luke Zappia            | maintainer, author |

| Michal Klein           | author             |

| Scott Gigante          | author             |

| Ben DeMeo              | author             |

| Robrecht Cannoodt      | author             |

| Kai Waldrant           | contributor        |

| Sai Nirmayi Yasa       | contributor        |

| Juan A. Cordero Varela | contributor        |

## API

``` mermaid

flowchart LR

  file_common_dataset("Dataset")

  comp_process_dataset[/"Data processor"/]

  file_dataset("Dataset")

  file_solution("Test data")

  comp_control_method[/"Control method"/]

  comp_method[/"Method"/]

  comp_metric[/"Metric"/]

  file_embedding("Embedding")

  file_score("Score")

  file_common_dataset---comp_process_dataset

  comp_process_dataset-->file_dataset

  comp_process_dataset-->file_solution

  file_dataset---comp_control_method

  file_dataset---comp_method

  file_solution---comp_control_method

  file_solution---comp_metric

  comp_control_method-->file_embedding

  comp_method-->file_embedding

  comp_metric-->file_score

  file_embedding---comp_metric

```

## File format: Dataset

The dataset to pass to a method.

Example file: `resources_test/common/pancreas/dataset.h5ad`

Format:



    AnnData object

     obs: 'cell_type'

     var: 'hvg_score'

     layers: 'counts', 'normalized'

     uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'normalization_id'



Data structure:



| Slot | Type | Description |

|:---|:---|:---|

| `obs["cell_type"]` | `string` | Classification of the cell type based on its characteristics and function within the tissue or organism. |

| `var["hvg_score"]` | `double` | High variability gene score (normalized dispersion). The greater, the more variable. |

| `layers["counts"]` | `integer` | Raw counts. |

| `layers["normalized"]` | `double` | Normalized expression values. |

| `uns["dataset_id"]` | `string` | A unique identifier for the dataset. |

| `uns["dataset_name"]` | `string` | Nicely formatted name. |

| `uns["dataset_url"]` | `string` | (*Optional*) Link to the original source of the dataset. |

| `uns["dataset_reference"]` | `string` | (*Optional*) Bibtex reference of the paper in which the dataset was published. |

| `uns["dataset_summary"]` | `string` | Short description of the dataset. |

| `uns["dataset_description"]` | `string` | Long description of the dataset. |

| `uns["dataset_organism"]` | `string` | (*Optional*) The organism of the sample in the dataset. |

| `uns["normalization_id"]` | `string` | Which normalization was used. |



## Component type: Data processor

A dimensionality reduction dataset processor.

Arguments:



| Name | Type | Description |

|:---|:---|:---|

| `--input` | `file` | The dataset to pass to a method. |

| `--output_dataset` | `file` | (*Output*) The dataset to pass to a method. |

| `--output_solution` | `file` | (*Output*) The data for evaluating a dimensionality reduction. |



## File format: Dataset

The dataset to pass to a method.

Example file:

`resources_test/dimensionality_reduction/pancreas/dataset.h5ad`

Format:



    AnnData object

     var: 'hvg_score'

     layers: 'counts', 'normalized'

     uns: 'dataset_id', 'normalization_id'



Data structure:



| Slot | Type | Description |

|:---|:---|:---|

| `var["hvg_score"]` | `double` | High variability gene score (normalized dispersion). The greater, the more variable. |

| `layers["counts"]` | `integer` | Raw counts. |

| `layers["normalized"]` | `double` | Normalized expression values. |

| `uns["dataset_id"]` | `string` | A unique identifier for the dataset. |

| `uns["normalization_id"]` | `string` | Which normalization was used. |



## File format: Test data

The data for evaluating a dimensionality reduction.

Example file:

`resources_test/dimensionality_reduction/pancreas/solution.h5ad`

Format:



    AnnData object

     obs: 'cell_type'

     var: 'hvg_score'

     layers: 'counts', 'normalized'

     uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'normalization_id'



Data structure:



| Slot | Type | Description |

|:---|:---|:---|

| `obs["cell_type"]` | `string` | Classification of the cell type based on its characteristics and function within the tissue or organism. |

| `var["hvg_score"]` | `double` | High variability gene score (normalized dispersion). The greater, the more variable. |

| `layers["counts"]` | `integer` | Raw counts. |

| `layers["normalized"]` | `double` | Normalized expression values. |

| `uns["dataset_id"]` | `string` | A unique identifier for the dataset. |

| `uns["dataset_name"]` | `string` | Nicely formatted name. |

| `uns["dataset_url"]` | `string` | (*Optional*) Link to the original source of the dataset. |

| `uns["dataset_reference"]` | `string` | (*Optional*) Bibtex reference of the paper in which the dataset was published. |

| `uns["dataset_summary"]` | `string` | Short description of the dataset. |

| `uns["dataset_description"]` | `string` | Long description of the dataset. |

| `uns["dataset_organism"]` | `string` | (*Optional*) The organism of the sample in the dataset. |

| `uns["normalization_id"]` | `string` | Which normalization was used. |



## Component type: Control method

Quality control methods for verifying the pipeline.

Arguments:



| Name | Type | Description |

|:---|:---|:---|

| `--input` | `file` | The dataset to pass to a method. |

| `--input_solution` | `file` | The data for evaluating a dimensionality reduction. |

| `--output` | `file` | (*Output*) A dataset with dimensionality reduction embedding. |



## Component type: Method

A dimensionality reduction method.

Arguments:



| Name | Type | Description |

|:---|:---|:---|

| `--input` | `file` | The dataset to pass to a method. |

| `--output` | `file` | (*Output*) A dataset with dimensionality reduction embedding. |



## Component type: Metric

A dimensionality reduction metric.

Arguments:



| Name | Type | Description |

|:---|:---|:---|

| `--input_embedding` | `file` | A dataset with dimensionality reduction embedding. |

| `--input_solution` | `file` | The data for evaluating a dimensionality reduction. |

| `--output` | `file` | (*Output*) Metric score file. |



## File format: Embedding

A dataset with dimensionality reduction embedding.

Example file:

`resources_test/dimensionality_reduction/pancreas/embedding.h5ad`

Format:



    AnnData object

     obsm: 'X_emb'

     uns: 'dataset_id', 'method_id', 'normalization_id'



Data structure:



| Slot                      | Type     | Description                          |

|:--------------------------|:---------|:-------------------------------------|

| `obsm["X_emb"]`           | `double` | The dimensionally reduced embedding. |

| `uns["dataset_id"]`       | `string` | A unique identifier for the dataset. |

| `uns["method_id"]`        | `string` | A unique identifier for the method.  |

| `uns["normalization_id"]` | `string` | Which normalization was used.        |



## File format: Score

Metric score file

Example file:

`resources_test/dimensionality_reduction/pancreas/score.h5ad`

Format:



    AnnData object

     uns: 'dataset_id', 'normalization_id', 'method_id', 'metric_ids', 'metric_values'



Data structure:



| Slot | Type | Description |

|:---|:---|:---|

| `uns["dataset_id"]` | `string` | A unique identifier for the dataset. |

| `uns["normalization_id"]` | `string` | Which normalization was used. |

| `uns["method_id"]` | `string` | A unique identifier for the method. |

| `uns["metric_ids"]` | `string` | One or more unique metric identifiers. |

| `uns["metric_values"]` | `double` | The metric values obtained for the given prediction. Must be of same length as ‘metric_ids’. |