Read proteomics data in table format and create SummarizedExperiment
Source:R/proteomics_computation.R
prot.read_data.Rd
prot.read_data
takes a table file containing proteomics data and filters proteins based on missing values (see prot.filter_missing
) or a given relative standard deviation threshold, and creates a SummarizedExperiment
object.
Usage
prot.read_data(
data = "dat_prot.csv",
expdesign = NULL,
csvsep = ";",
dec = ".",
na.strings = "",
sheet = 1,
filter = c("Reverse", "Potential contaminant"),
rsd_thresh = NULL,
name = "Gene Symbol",
id = "Ensembl Gene ID",
pfx = "abundances.",
filt_type = c("condition", "complete", "fraction", NULL),
filt_thr = 3,
filt_min = NULL,
log2_transform = TRUE
)
Arguments
- data
An R dataframe object or a table file with extension '.xlsx', '.xls', '.csv', '.tsv', or '.txt' containing proteomics data. The table must contain:
A column with protein IDs (e.g., accession numbers). The header of this column is provided as argument
id
.A column with protein names. The header of this column is provided as argument
name
.X columns containing abundance values for X samples. The column headers must have a prefix (e.g., "abundances.") that is provided as argument
pfx
. Replicates are identified by identical column headers followed by an underscore and the replicate number (e.g., "abundances.ConditionA_1", "abundances.ConditionA_2", "abundances.ConditionA_3", ...).
- expdesign
(optional, if made previously) An R dataframe object or a table file containing the columns 'label', 'condition', and 'replicate' with label = "condition_replicate". If
NULL
, an experimental design table will be created automatically.- csvsep
(Character string) separator used in CSV files (ignored for other file types). Default:
";"
- dec
(Character string) decimal separator used in CSV, TSV or TXT files (ignored for other file types). Default:
"."
- na.strings
A character vector of strings which are to be interpreted as NA values.
- sheet
(Integer or Character string) Number or name of the sheet with proteomics data in XLS or XLSX files (optional).
- filter
(Character string or vector of strings) Provide the header of a column containing "+" or "-" to indicate if proteins should be discarded or kept, respectively.
- rsd_thresh
(Numeric, optional) Provide a relative standard deviation (RSD) threshold in % for proteins. The RSD is calculated for each condition and if the maximum RSD value determined for a given protein exceeds
rsd_thresh
, the protein is discarded. The RSD filter is applied before further missing value filters based on the threefilt_
arguments.- name
(Character string) Provide the header of the column containing protein names.
- id
(Character string) Provide the header of the column containing protein IDs
- pfx
(Character string) Provide the common prefix for headers containing abundance values (e.g., "abundances.").
- filt_type
(Character string) "complete", "condition" or "fraction", Sets the type of filtering applied. "complete" will only keep proteins with valid values in all samples. "condition" will keep proteins that have a maximum of
filt_thr
missing values in at least one condition. "fraction" will keep proteins that have afilt_min
fraction of valid values in all samples.- filt_thr
(Integer) Sets the threshold for the allowed number of missing values in at least one condition if
filt_type = "condition"
. In other words: "keep proteins that have a maximum of 'filt_thr' missing values in at least one condition."- filt_min
(Numeric) Sets the threshold for the minimum fraction of valid values allowed for any protein if
filt_type = "fraction"
.- log2_transform
(Logical) Should the data be log2 transformed?