DecoID API Documentation

Core DecoID

class DecoID.DecoID.DecoID(libFile, mzCloud_lib, numCores=1, resolution=2, label='', api_key='none', mplus1PPM=15, numConcurrentGroups=20, scoringFunc='dot product')

Class for working with and performing a database assisted deconvolution of MS/MS spectra. Deconvolution is done using LASSO regression.

Parameters:
  • libFile – string, path to database file or “none” to use mzCloud
  • mzCloud_lib – str, only applies to mzCloud, specifies library to seach: reference or autoprocessing to only use reference library
  • numCores – int, Number of parralel processes to use.
  • resolution – int, Number of decimal places to consider for m/z values of MS/MS peaks
  • label – str, optional label to add to the end of output files
  • api_key – str, for use of mzCloud api, access key must be entered
  • mplus1PPM – float, ppm tolerance for finding subformulas in M+1 spectrum prediction. Set based on database
  • numConcurrentGroups – int, number of unique features to processes at once, if memory consumption is high, try reducing
  • scoringFunc – str, function to score metabolite ID matches. Defaults to the normalized dot product
static combineResults(filenames, newFilename, endings=['_scanInfo.csv', '_decoID.csv', '.DecoID'])

Combine results from several files. Helpful to merge results from lots of files. Does only the .csv files for memory reasons.

Parameters:
  • filenames – filenames to merge
  • newFilename – output file name
  • endings – which file endings to use.
Returns:

None

static combineResultsAutoAnnotate(filenames, newFilename, numHits=1, min_score=0)

Combine results from several files. Takes the best hit found across multiple files

Parameters:
  • filenames – list, filenames to merge should not have any file endings (i.e. .mzML, _decoID.csv)
  • newFilename – output file name
  • numHits – int, number of hits to return in the merged file for each feature
  • min_score – float, minimum dot product score for returned hits.
Returns:

None

identifyUnknowns(resPenalty=100, percentPeaks=0.01, iso=False, ppmThresh=10, dpThresh=80, rtTol=0.5)

Generate the on-the-fly unknown library by searching all spectra first and identifying those that are unknown. These spectra can then be used to deconvolve other spectra. Only applicable to DDA with MS1 collected.

Parameters:
  • resPenalty – float, lasso regression coefficient. Set to float(“inf”) for direct searching. Set to 0 for unregularized deconvolution. Reccomend 1 for DDA, 100 for DIA
  • percentPeaks – float, filtering parameter, exclude spectra that do not match this fraction of the database peaks
  • iso – bool, remove contamination from orphan isotopologues if True, False- do not.
  • ppmThresh – float, mass error match tolerance for the resulting hits of the spectra.
  • dpThresh – float, dot product match tolerance for the resulting hits of the spectra.
  • rtTol – float, retention time tolerance (minutes) allowed between database and acquired RT in order for database spectrum to be used in deconvolution
Returns:

None

prepareForCluster(numberOfFiles)

Split a large dataset read into DecoID for searching on separate machines. Helpful for large datafiles that want to be run on a compute cluster. Will create dill files of DecoID objects with partial data lists, but the same parameters. These can be read with fromDill

Parameters:numberOfFiles – int, number of data files to split
Returns:None
readData(filename, resolution, peaks, DDA, massAcc, offset=0.65, peakDefinitions='', tic_cutoff=0, frag_cutoff=0)

Read in raw MS data into DecoID object.

Parameters:
  • filename – str, path to MS datafile
  • resolution – int, number of decimal places to consider in MS/MS peaks
  • peaks – bool, use MS1 data if available
  • DDA – bool, True for DDA data, False for DIA data
  • massAcc – float, mass accuracy of instrument in ppm
  • offset – float, isolation window width measured from center of window to outer edge.
  • peakDefinitions – str, path to peak definition file
  • tic_cutoff – float, TIC cutoff for MS/MS spectra. Spectra below this value will be ignored
  • frag_cutoff – float, Intensity cutoff for fragments to be included. Fragments in spectra below this intensity will be ignored.
Returns:

None

readMS_DIAL_data(file, mode, massAcc, peakDataFile)

Load in data from MS-DIAL exported peak list text file of deconvoluted spectra. This enables library searching and combined usage of DecoID and MS-DIAL

Parameters:
  • file – str, path to MS-DIAL output file
  • mode – str, polarity (Positive/Negative)
  • massAcc – int, mass accuracy of instrument (ppm)
  • peakDataFile – str, path to peak information file for features of interest. If doing a combined usage, this needs to be the same file used for DecoID>
Returns:

searchSpectra(verbose, resPenalty=100, percentPeaks=0.01, iso=False, threshold=0.0, rtTol=0.5, redundancyCheckThresh=0.9)

Search the spectra loaded into the DecoID object and write the output files

Parameters:
  • verbose – str or Queue, “y” to write the progress to std out. Queue is used with the GUI to send updates dynamically/
  • resPenalty – float, lasso regression coefficient. Set to float(“inf”) for direct searching. Set to 0 for unregularized deconvolution. Reccomend 1 for DDA, 100 for DIA
  • percentPeaks – float, filtering parameter, exclude spectra that do not match this fraction of the database peaks
  • iso – bool, remove contamination from orphan isotopologues if True, False- do not.
  • threshold – float filtering parameter to remove hits with a spectral similarity less than this value
  • rtTol – float, retention time tolerance to use database spectra (minutes)
  • redundancyCheckThresh – float, dot product threshold to classify a component as redundant set to > 100 to turn off
Returns:

None

Other Helpful Functions

DecoID.DecoID.dotProductSpectra(foundSpectra, b, mz1=-1, mz2=-1, polarity=-1)

Computes the normalized dot product (cosine) similarity between two spectra.

Parameters:
  • foundSpectra – dict or array like, this is the first spectrum
  • b – dict or array like, this is the second spectrum
  • mz1 – not used for this method
  • mz2 – not used for this method
  • polarity – not used for this method
Returns:

Cosine similarity of the two spectrum in a scale of 0-1

DecoID.DecoID.readRawDataFile(filename, maxMass, resolution, useMS1, ppmWidth=50, offset=0.65, tic_cutoff=5, frag_cutoff=0)
Read MS datafile and convert to mzml if necessary. Conversion performs vendor centroiding. MS/MS data along with MS1 data are extracted and returned. In addition, the contamination in the MS/MS spectra from co-isolated analytes is computed.
Parameters:
  • filename – str, path to MS datafile
  • maxMass – float, maximum mass to consider in MS/MS spectra
  • resolution – int, number of decimal places to consider to m/z value of MS/MS peaks
  • useMS1 – bool, read and return MS1 data contained in file
  • ppmWidth – float, Mass accuracy of insturment in parts per million
  • offset – float, Isolation window width/2. Necessary for non-thermo data
  • tic_cutoff – float, Signal cutoff for MS/MS spectra. Spectra below this signal level will be ignored
  • frag_cutoff – float, intensity cutoff for MS/MS peaks. Fragments with absolute intensity below this threshold will be removed
Returns:

result-list of MS/MS spectra. Each spectrum is a dict giving the precursor mz, retention time, isolation window upper and lower bounds, TIC, contamination level, and scanID ms1Scans-dict where key is the retention time and the value is another dict of m/z:intensity pairs

Documentation is on-going. All relevant functions will be added as more documentation is written.