DecoID Usage ================================== Input Files ----------- MS/MS datafile ++++++++++++++ DecoID supports all vendor file formats compatible with MS-Convert. Natively, DecoID accepts .mzML files that have been centroided. However, if MS-Convert is installed on the machine and msconvert.exe is in the system PATH, vendor files can be supplied. DecoID will read MS/MS metadata including polarity, targeted m/z's , and isolation windows (Thermo Fisher Scientific datafiles only). For non-Thermo data, the width of the isolation window must be provided. DecoID is compatable with multiple MS/MS datatypes including data dependent acquistion (DDA) and data independent acqusition (DIA). Further, less common experimental workflows where MS1 full scan data is not collected can also be used. Peak Information ++++++++++++++++ To facilitate metabolite information and improve the signal to noise of the MS/MS spectra and the resulting deconvolution, the m/z and retention time bounds of features of interest can be provided in a .csv file. If so, DecoID will average all acquired spectra within the retention time bounds. In addition, the simplifies the output datafile for downstream analysis. Peak information is required for DIA data. The format of this file is shown below: ====== ======== ===== mz rt_start rt_end ====== ======== ===== 123.45 1.67 2.43 145.78 2.94 3.55 90.08 7.83 9.23 etc. etc. etc. ====== ======== ===== Database ++++++++ DecoID uses either NIST msp formatted databases or a remote connection to the mzCloud (Thermo-Fisher Scientific) spectral database to perform metabolite identification and MS/MS deconvolution. For mzCloud based deconvolution, no input file is required. More details are given below in `Connection to mzCloud`_. MSP files of the Human Metabolome Database (HMDB) and the Mass Bank of North America and be downloaded from the `DecoID GitHub Site `_ Alternatively, these databases can be downloaded from the MoNA website `here `_. MSP files are simple text files. For compatibility with DecoID the following fields must be non-empty: * Name * DB# * Precursor_type OR Ion_mode * ExactMass OR PrecursorMZ An example MSP entry is shown below::: Name: GLYCEROL Synon: $:00in-source DB#: 0 InChiKey: 56-81-5 Precursor_type: [M-H]- Spectrum_type: MS2 PrecursorMZ: 91.04004 Ion_mode: N Collision_energy: HCD (NCE 40%) RetentionTime: 2.969 Formula: C3H8O3 MW: 92.0 ExactMass: 92.04734 Num Peaks: 40 50.003 0.002940881981667288 50.043 0.001125309959124438 52.92 0.0010507693668541853 54.262 0.0009862185551523176 54.741 0.0010954343864214948 55.606 0.0012054987503008479 55.851 0.0010793672216122406 58.039 0.0010972036230296137 58.056 0.0012052440367981193 58.173 0.0011118376593017905 59.013 0.1877307037191327 59.017 0.004254219849821239 59.967 0.9227487450965145 59.97 0.038360551398232204 59.972 0.023916371530628082 60.974 0.0013073255841957638 61.744 0.001150458036919167 62.524 0.0010455834798728655 62.724 0.001173758557930979 63.886 0.0012296895388148035 64.873 0.0010963706929665274 65.145 0.001130812508637816 72.444 0.0012381777855903992 72.449 0.001096971813758573 73.011 0.0046482667112914475 74.99 1.0 74.995 0.03924297796715474 74.997 0.026938366619882102 75.967 0.0011636634025248632 76.214 0.0012321326056216008 76.969 0.005576439025961694 91.021 0.002564739158271785 91.029 0.0013610635233254089 93.001 0.9799805233973448 93.007 0.027811413214189975 93.01 0.004905488457969213 93.011 0.004496375838517466 97.159 0.0011219222849101172 98.845 0.0012869953884785937 100.86 0.0010522699014690819 After DecoID parses the MSP file a binary .db file is generated (this is a Pickle file) for faster loading on future usage. Output Files ------------ After successful deconvolution of an MS/MS datafile, 3 output files are generated in the input file directory. * _scanInfo.csv * _decoID.csv * .DecoID **_scanInfo.csv** gives the purified spectra and all components for each acquired MS/MS spectrum. It is formatted as shown below: ========= ===================== ============= =========== ================== =========== =========== ================================== featureID Signal to Noise Ratio numComponents componentID componentAbundance componentRT componentMz spectrum ========= ===================== ============= =========== ================== =========== =========== ================================== 1 2.5 4 cpdID1 .8 1.8 138.6 84:12.6 72:45 1 2.5 4 cpdID2 .2 2.6 137.1 102:45 68:55 1 2.5 4 original 0 2.4 138.6 68:55 72:45 84:12.6 102:45 72:55 1 2.5 4 residual 0 2.4 138.6 72:10 etc etc etc etc etc etc etc etc ========= ===================== ============= =========== ================== =========== =========== ================================== * featureID: The row number from the peak information file that gives which feature this spectrum belongs or if peak information is not provided, this is the scanID of the MS/MS spectrum. * Signal to Noise Ratio: denotes the the signal attributed to a particular compound divided by the signal not attributed to any compound. * numComponents: The number of components used in the deconvolution. * componentID: compound ID for each component. If this field is "original" then this is the acquired spectrum. Residual is the error between the reconstructed spectrum and the acquired. * componentAbundance: mixing coefficient for each component in the deconvolution * componentRT: database retention time for the precursor of the component * ComponentMz: The m/z value of the precursor of the component. * spectrum: The spectrum for each component given in a m/z:intensity pairs separated by a space. **_decoID.csv** gives the metabolite identification results after the deconvolution. With a single match on each line. The format is given below: ========= ==================== ==== ============ =========== ================ =============== ============= ============== =========== ========= ========= =========== ========= featureID isolation_center_m/z rt compound_m/z compound_rt compound_formula DB_Compound_ID Compound_Name DB_Spectrum_ID dot_product ppm_Error Abundance ComponentID redundant ========= ==================== ==== ============ =========== ================ =============== ============= ============== =========== ========= ========= =========== ========= 1 133.014 2.4 133.014 2.1 C4H6O5 cpdID01 Malic Acid HMDB0031518 99.8 -1.3 0.8 cpdID01 FALSE etc etc etc etc etc etc etc etc etc etc etc etc etc etc ========= ==================== ==== ============ =========== ================ =============== ============= ============== =========== ========= ========= =========== ========= * featureID: The row number from the peak information file that gives which feature this spectrum belongs or if peak information is not provided, this is the scanID of the MS/MS spectrum. * isolation_center_m/z: The feature of interest m/z value. * rt: The retention time where the spectrum was acquired. * compound_m/z: The m/z value of the matched compound. * compound_rt: The retention time of the database compound. * compound_formula: The formula of the database precursor compound. * DB_Compound_ID: The compound ID of the matched compound. * Compound_Name: The name of the matched compound. * DB_Spectrum_ID: Spectrum ID or accession of the matched spectrum. Given by DB# in the input database. * dot_product: The normalized dot product similarity to the reference spectrum * ppm_Error: The mass error in parts per million (ppm) between the feature's m/z and the database match m/z. * Abundance: The normalized regression coefficient of this compound in the deconvolution. Note: this should not be used for comparative/quantitative purposes. * componentID: compound ID of the component matched to. If this field is "original" then this is the acquired spectrum. Residual is the error between the reconstructed spectrum and the acquired. * redundant: Result of redundancy check. If TRUE the matches component could have been a different database compound. This indicates a non-unique deconvolution and possibly an inconclusive identification **.DecoID** is a gzipped pickle file that contains all the information provided in the previous two output files but in a format that allows for easier analysis and visualation through the DecoID user interface. Example Usage -------------------- Regardless of data type the following parameters are required::: from DecoID.DecoID import DecoID libFile = "DecoID/databases/HMDB_experimental.db" #path to database numCores = 10 # of parallel processes to use file = "DecoID/exampleData/Asp-Mal_1uM_5Da.mzML" #path to datafile peakfile = "DecoID/exampleData/peak_table.csv" #path to peak information file useMS1 = True #use MS1 data if available massAcc = 10 #Mass accuracy of instrument res = 2 # # of decimal places to round MS/MS peaks. fragThresh = 0 #absolute intensity threshold for MS/MS peaks rtTol = 1 #retention time threshold With this the DecoID object can be instantiated and database parsed::: decID = DecoID(libFile, "reference", numCores) Before the raw data can be read-in some data-type specific parameters must be provided::: offset = .5 #half of the width of the MS/MS isolation window. Not required for Thermo data. DDA = True #true for DDA, False for DIA Now the raw MS/MS data can be read::: decID.readData(file, res, useMS1, DDA, massAcc,offset,peakDefinitions=peakfile,frag_cutoff=fragCutoff) With the data read, the search parameters can be defined::: lam = 5.0 # LASSO regression coefficient. The higher this is the more sparse a solution will be found. Recommend 5.0 for DDA and 50.0 for DIA. useIso = True # Predict M+1 isotopologue spectra to remove contamination from orphan isotopologues Optionally, acquired pure MS/MS spectra can be used to deconvolve spectra in the datafile if data is from a DDA experiment. To enable this the command below must be run::: decID.identifyUnknowns(iso=useIso,rtTol=rtTol,dpThresh=80,resPenalty=lam) Now, the datafile can be searched with the command below::: decID.searchSpectra("y", lam , iso=useIso,rtTol=rtTol) Advanced Usage -------------- Changing Deconvolution Parameters +++++++++++++++++++++++++++++++++ Changing the "lam" parameter in effect allows for a continuum of performance between direct library searching without deconvolution and non-regularized deconvolution. With:: lam = float("inf") You have no deconvolution and standard library searching. With:: lam = 0 There is no penalty for more complex solutions. High Performance Computing ++++++++++++++++++++++++++ DecoID can be used in a UNIX environment and is suitable and has been tested on HPC cluster. The easiest usage is to submit individual files for deconvolution as separate jobs. This can be very helpful for large datasets searched against multiple databases. See DecoID/HPC_scripts for examples. Connection to mzCloud +++++++++++++++++++++ Connection to mzCloud is dependent on an access key granted by Thermo-Fisher Scientific. If a key is granted, it must be entered during instantiation of the DecoID object and the libFile parameter must be "none". The library can be either "reference" or "autoprocessing"::: decID = DecoID("none", "reference", numCores,api_key="XXXXXXXXX) Parallel Usage with MS-DIAL +++++++++++++++++++++++++++ DecoID can be used in parallel with MS-DIAL on DIA MS/MS data. First, DecoID and MS-DIAL should be used individually. The MS-DIAL peak list should be exported as a .txt file with the deconvoluted spectra. Next, this txt file should be imported into DecoID with the following command::: decID.readMS_DIAL_data(fn,mode,massAcc,peakFile) Here fn in the path to the text file and mode is "Positive" or "Negative" depending on the polarity of the acquired data. Next, the MS-DIAL ouput can be searched with::: decID.searchSpectra("y",float("inf"),iso=useIso,rtTol = rtTol) This command will directly search (without deconvolution) the MS-DIAL deconvolved spectra against the referecne database using the same peak list as before. Lastly, the results can be combined across multiple datafiles with the following command. Note that this function can be used to merge the results of any datafile not just between MS-DIAL and DecoID. For instance, to combine the results of several MS/MS experiments of the same sample.:: decID.combineResultsAutoAnnotate([,,,numHits = 3) An example script is available at DecoID/examples/exampleUsage_msdial_decoID.py