Skip to main content

UMAP dimension reduction

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.

UMAP is a novel manifold learning technique for dimension reduction. UMAP is constructed from a theoretical framework based in Riemannian geometry and algebraic topology. The result is a practical scalable algorithm that applies to real world data. The UMAP algorithm is competitive with t-SNE for visualization quality, and arguably preserves more of the global structure with superior run time performance. Furthermore, UMAP has no computational restrictions on embedding dimension, making it viable as a general purpose dimension reduction technique for machine learning.

Spectra selection, normalization and previsualization

The first step is to select the spectra :

Select spectra
How to select spectra.

Spectra selection

All the spectra analysis tools start with a phase of selection.

overview

Select samples

In order to facilitate the analysis of the spectra it is advised to have samples containing representative spectra in order to evaluate the intra-variability as well as the reproducibility.

Selection of spectra to analyze is achieved with one of those 3 methods:

At the level of the sample by either clicking on the +, this will add all the spectra related to this sample or on the + on the top of the sample box to add all the spectra of all the selected samples.

select sample

If you select a sample it is also possible to add a specific spectrum by clicking on the + at the level of the spectra list.

select spectra

Once spectra have been selected, data normalization filters can be applied :

Normalization
How to normalize spectra.

Preprocessing

In order to compare spectra it is required to create a matrix. In this matrix each row corresponds to a spectrum while the columns are the various values for a specific X. To create this matrix we apply various preprocessing methods that consist of:

  • filter the data in order to reduce the impact of sample preparation or experimental artifacts using various filters
  • select the representative part of the spectra that is expected to be important for the analysis
  • remove large peaks not characteristic to the sample (like water in NMR spectra) that could interfere with the analysis
  • reduce the number of points in order to accelerate the analysis
  • apply matrix related processing allowing to normalize the columns

preferences

Filters

You may also apply various Filters that allows to normalize or transform the data. Among those filters we have:

  • Center mean
  • Divide by SD (standard deviation)
  • Rescale: set the min value to 0 and the max value to 1
  • Normalize: set the sum of all the points to 1
  • Align: create a peak picking between from / to and calculate the mean X value between the nbPeaks highest peaks. The spectrum will be moved so that the mean has the targetX value.
  • Pareto: Pareto scaling, which uses the square root of standard deviation as the scaling factor, circumvents the amplification of noise by retaining a small portion of magnitude information. 10.1016/j.molstruc.2007.12.026
  • Savitzky-golay: smoothing spectra and calculate derivatives based on the following parameters:
    • windowSize: smoothing window, must be an odd number
    • derivative: enter 0, 1 or 2
    • polynomial: the degree of the polynomial used to calculate SG
  • X function: a function that modifies the X axis based on the x parameter. Like for example log(x)
  • Y function: a function that modifies the Y axis based on the y parameter. Like for example log10(y+1)

One classical preprocessing algorithm is Standard Normal Variate (SNV). This preprocessing can be achieved by selecting the 2 options Center mean and Divide by SD.

Selecting the range

Only the information between the From and To values of the range will be considered.

Exclusions

Depending on the analysis some region should be removed in order to improve the analysis. For example NMR spectroscopy in water yields to a large peak around 4.5ppm and using exclusion zone it can be removed from the analysis.

Number of points

The data normalization process will select equidistant Nb points between the From and To values.

Matrix processing

Once all the previous filters have been applied we obtain a matrix in which rows represent the normalized spectra and columns represent the intensity of teach spectrum.

Some filters are using the columns for further processing like:

  • PQN: Probabilistic Quotient Normalization (10.1021/ac051632c)
  • Center mean: for each column the mean of the values will be centered
  • Rescale (0 to 1): for each column the min value will be set to 0 and the max value to 1

Large dataset

The list of the spectra in the dataset is displayed in the following table:

memory

In some cases it is not possible to keep in memory the original spectra and the system will only keep the normalized spectra. Therefore, it will not be possible to change the normalization parameters anymore.

Preview

A preview of the normalized spectra as well as the exclusions zones will be displayed. This allows to fine tune the processing.

preview

The superimposed spectra can be manipulated without numerous advanced features described here.

The superimposed spectra can be manipulated without numerous :

Visualization
How to visualize spectra.

Spectra visualization

Numerous options are available to display the either all the spectra in the dataset or the selected spectra in the dataset.

selection

Selection of spectra in the dataset

The toolbar on the top of the list of spectra in the dataset provides many options (from left to right):

selection tools

  • Remove all spectra from dataset
  • Select category: select which property contains the category description
  • Download normalized matrix
  • Recolor spectra based on category: a different color will be applied for each category. By default, the sample reference
  • Select all spectra
  • Append to selected spectra
  • Select only current spectra
  • Remove spectra from current selection
  • Unselect all spectra

Graph options

It is possible to either display the selected spectra, all the spectra or various derived information.

display

Customization of the display is achieved using the chart toolbar:

graph tools

Display spectra

The first options allow to either display all the spectra, only the selected spectra or nothing.

selected

Displaying no spectrum is useful when displaying other derived data.

Original / normalized

These options allow to either display the original spectra or the normalized data. Most of the time we will display normalized data. Those are the data that will be analyzed, and normally they also take less room in memory.

original

Boxplot

The boxplot kind of representation allows to display the first / third quartile as a dark grey zone for each X point. The min and max values are represented as a light gray zone and the median is represented as a line for which the color varies based on the standard deviation (red: high variation, blue: small variation).

boxplot

Tracking information

By selecting the tracking information you will display the X values and the corresponding Y values for all the spectra.

tracking

Correlation

Correlation of the vector represented by the Y points can be useful to determine which peaks are correlated in a big mixture of products. This is known in NMR metabolomics as STOCSY.

By SHIFT ⇧ + ALT + click you can select the X value for which you would like to check correlation. Strongly correlated signals will appear in red while non correlated signals are blue.

correlation

References