Construct mSet data container, read metabolomics data, filter data, and impute missing values

met.read_data is a wrapper function that constructs an mSet object, adds data from a table file or R dataframe object, applies unspecific and user-defined data filters, and imputes missing values.

Usage

met.read_data(
  data,
  data.type = "conc",
  anal.type = "stat",
  paired = FALSE,
  csvsep = ";",
  dec = ".",
  sheet = 1,
  data.format = "rowu",
  lbl.type = "disc",
  filt.feat = c(""),
  filt.smpl = c(""),
  filt.grp = c(""),
  filt.method = "none",
  remain.num = NULL,
  qcFilter = "F",
  qc.rsd = 25,
  all.rsd = NULL,
  imp.method = "lod",
  export = FALSE,
  img.format = "pdf",
  dpi = dpi
)

Arguments

data

Enter name of an R dataframe object or the "path name" (in quotes) of the CSV/TSV/XLS/XLSX/TXT file to read.

data.type

(Character) The type of data, either "list" (Compound lists), "conc" (Compound concentration data), "specbin" (Binned spectra data), "pktable" (Peak intensity table), "nmrpeak" (NMR peak lists), "mspeak" (MS peak lists), or "msspec" (MS spectra data).

anal.type

(Character) Indicate the analysis module to be performed: "stat", "pathora", "pathqea", "msetora", "msetssp", "msetqea", "ts", "cmpdmap", "smpmap", or "pathinteg".

paired

(Logical) Indicate if the data is paired (TRUE) or not (FALSE).

csvsep

(Character) Enter the separator used in the CSV file (only applicable if reading a ".csv" file).

dec

(Character) decimal separator used in CSV, TSV and TXT files.

sheet

(Integer or Character string) Number or name of the sheet with proteomics data in XLS or XLSX files (optional).

data.format

(Character) Specify if samples are paired and in rows ("rowp"), unpaired and in rows ("rowu"), in columns and paired ("colp"), or in columns and unpaired ("colu").

lbl.type

(Character) Specify the group label type, either categorical ("disc") or continuous ("cont").

filt.feat

(Character Vector) Enter the names of features to remove from the dataset.

filt.smpl

(Character Vector) Enter the names of samples to remove from the dataset.

filt.grp

(Character Vector) Enter the names of groups to remove from the dataset.

filt.method

(Character) Select an option for unspecific filtering based on the following ranking criteria:

"none" apply no unspecific filtering.
"rsd" filters features with low relative standard deviation across the dataset.
"nrsd" is the non-parametric relative standard deviation.
"mean" filters features with low mean intensity value across the dataset.
"median" filters features with low median intensity value across the dataset.
"sd" filters features with low absolute standard deviation across the dataset.
"mad" filters features with low median absolute deviation across the dataset.
"iqr" filters features with a low inter-quartile range across the dataset.

remain.num

(Numerical) Enter the number of variables to keep in your dataset. If NULL, the following empirical rules are applied during data filtering with the methods specified in filter = "":

Less than 250 variables: 5% will be filtered
250 - 500 variables: 10% will be filtered
500 - 1000 variables: 25% will be filtered
More than 1000 variables: 40% will be filtered

qcFilter

(Logical) Filter the variables based on the relative standard deviation of features in quality control (QC) samples (TRUE), or not (FALSE). This filter can be applied in addition to other, unspecific filtering methods.

qc.rsd

(Numeric) Define the relative standard deviation cut-off in %. Variables with a RSD greater than this number in the QC samples will be removed from the dataset. It is only necessary to specify this argument if qcFilter is TRUE. Otherwise, it will be ignored.

all.rsd

(Numeric or NULL) Apply a filter based on the in-group relative standard deviation (RSD, in %) or not NULL. Therefore, the RSD of every feature is calculated for every group in the dataset. If the RSD of a variable in any group exceeds the indicated threshold, it is removed from the dataset. This filter can be applied in addition to other filtering methods and is especially useful to perform on data with technical replicates.

imp.method

(Character) Select the option to replace missing variables:

"lod" replaces missing values with 1/5 of the minimum value for the respective variable.
"rowmin" replaces missing values with the half sample minimum.
"colmin" replaces missing values with the half feature minimum.
"mean" replaces missing values with the mean value of the respective feature column.
"median" replaces missing values with the median value of the respective feature column.
"knn_var" imputes missing values by finding the features in the training set “closest” to it and averages these nearby points to fill in the value.
"knn_smp" imputes missing values by finding the samples in the training set “closest” to it and averages these nearby points to fill in the value.
"bpca" applies Bayesian PCA to impute missing values.
"ppca" applies probabilistic PCA to impute missing values.
"svdImpute" applies singular value decomposition to impute missing values.

export

(Logical, TRUE or FALSE) Shall the missing value detection plots be exported as PDF or PNG file?

img.format

(Character, "png" or "pdf") image file format (if export = TRUE).

dpi

(Numeric) The resolution of exported PNG and PDF images.

Value

An mSet object with (built in ascending order):

original data at mSetObj$dataSet$data_orig.
data with manually filtered out features/samples/groups at mSetObj$dataSet$edit.
data with unspecifically filtered data at mSetObj$dataSet$filt.
data with imputed missing values at mSetObj$dataSet$data_proc.
missing value heatmap at mSetObj$imgSet$missval_heatmap.plot (see met.plot_missval).
Density and CumSum plots of intensities of proteins with and without missing values at mSetObj$imgSet$missval_density.plot (see met.plot_detect).

Author

Nicolas T. Wirth mail.nicowirth@gmail.com Technical University of Denmark License: GNU GPL (>= 2)