Read proteomics data in table format and create SummarizedExperiment

prot.read_data takes a table file containing proteomics data and filters proteins based on missing values (see prot.filter_missing) or a given relative standard deviation threshold, and creates a SummarizedExperiment object.

Usage

prot.read_data(
  data = "dat_prot.csv",
  expdesign = NULL,
  csvsep = ";",
  dec = ".",
  na.strings = "",
  sheet = 1,
  filter = c("Reverse", "Potential contaminant"),
  rsd_thresh = NULL,
  name = "Gene Symbol",
  id = "Ensembl Gene ID",
  pfx = "abundances.",
  filt_type = c("condition", "complete", "fraction", NULL),
  filt_thr = 3,
  filt_min = NULL,
  log2_transform = TRUE
)

Arguments

data

An R dataframe object or a table file with extension '.xlsx', '.xls', '.csv', '.tsv', or '.txt' containing proteomics data. The table must contain:

A column with protein IDs (e.g., accession numbers). The header of this column is provided as argument id.
A column with protein names. The header of this column is provided as argument name.
X columns containing abundance values for X samples. The column headers must have a prefix (e.g., "abundances.") that is provided as argument pfx. Replicates are identified by identical column headers followed by an underscore and the replicate number (e.g., "abundances.ConditionA_1", "abundances.ConditionA_2", "abundances.ConditionA_3", ...).

expdesign

(optional, if made previously) An R dataframe object or a table file containing the columns 'label', 'condition', and 'replicate' with label = "condition_replicate". If NULL, an experimental design table will be created automatically.

csvsep

(Character string) separator used in CSV files (ignored for other file types). Default: ";"

dec

(Character string) decimal separator used in CSV, TSV or TXT files (ignored for other file types). Default: "."

na.strings

A character vector of strings which are to be interpreted as NA values.

sheet

(Integer or Character string) Number or name of the sheet with proteomics data in XLS or XLSX files (optional).

filter

(Character string or vector of strings) Provide the header of a column containing "+" or "-" to indicate if proteins should be discarded or kept, respectively.

rsd_thresh

(Numeric, optional) Provide a relative standard deviation (RSD) threshold in % for proteins. The RSD is calculated for each condition and if the maximum RSD value determined for a given protein exceeds rsd_thresh, the protein is discarded. The RSD filter is applied before further missing value filters based on the three filt_ arguments.

name

(Character string) Provide the header of the column containing protein names.

id

(Character string) Provide the header of the column containing protein IDs

pfx

(Character string) Provide the common prefix for headers containing abundance values (e.g., "abundances.").

filt_type

(Character string) "complete", "condition" or "fraction", Sets the type of filtering applied. "complete" will only keep proteins with valid values in all samples. "condition" will keep proteins that have a maximum of filt_thr missing values in at least one condition. "fraction" will keep proteins that have a filt_min fraction of valid values in all samples.

filt_thr

(Integer) Sets the threshold for the allowed number of missing values in at least one condition if filt_type = "condition". In other words: "keep proteins that have a maximum of 'filt_thr' missing values in at least one condition."

filt_min

(Numeric) Sets the threshold for the minimum fraction of valid values allowed for any protein if filt_type = "fraction".

log2_transform

(Logical) Should the data be log2 transformed?

Value

A filtered SummarizedExperiment object.