a computational tool to discover chemical novelty in natural extracts libraries
Version 1.0
To run the repository the minimun inputs needed are:
This will run at least the Feature component (FC).
Optionally, the following inputs can be added:
The standard format from GNPS is prefered:
metadata_filename
: it uses the GNPS format.
While creating the ‘metadata’ there are some MANDATORY columns:
Taxonomy: the species, genus and family are neeeded if the LC and CC want to be computated. The taxonomy should be cleaned to uptoday recognized names, you can use the Open Tree of Life.
The headers for each one could follow the GNPS format or the user’s preferences, how ever the following parameter need to be indicated:
species_column = 'yourspeciesnamecolumn' #ATTRIBUTE_species
genus_column = 'yourgenusnamecolumn' #ATTRIBUTE_genus
family_column = 'yourfamilynamecolumn' #ATTRIBUTE_family
organe_column = 'yourorganenamecolumn' #ATTRIBUTE_organie
filename_header = 'yourfilenamecolumn' #filename
The organe _colum
should be specified if you have diferent parts (or solvents) from the same species. If you prefer to use only the filename as identifier for the resuts, it can be specified directly in the notebook.
quantitative_data_filename
: MZmine output format using only the ‘Peak area’, ‘row m/z’ and ‘row retention time’ columns.
-Inventa takes input directly from MZmine2 or MZmine 3, is possible to use other processing sofwares , however the input should be manually formated to a MZmine 2 format.
-Inventa is capable to performe the calcultions based on the results from Ion Identity, reducing the total number of features.
if you prefer ‘Peak Height’, go to src/process_data.py
and change it inside the function quant_table(). ONLY ONE of the columns is considered at the time, ‘Peak height’ or ‘Peak area’, if you want to consider both they must be done one at a time.
if you did export any other column, like identities, etc, please remove manually or add the corresponding lines in the funcion quant_table():
df.drop('name of the colum', axis=1, inplace=True)
tima_results_filename
: timaR reponderated output format.
canopus_npc_summary_filename
: Sirius CANOPUS recomputated output format.
This output needs an additional step after runnign sirius, please follow the next instructions:
from canopus import Canopus
C = Canopus(sirius="sirius_projectspace")
C.npcSummary().to_csv("npc_summary.tsv")
the output canopus_npc_summary.tsv
corresponds to the file nedded for running Inventa
given that the Lotus Dabase uses the NPClassifyre ontology and Sirius uses the Classifyre ontology, performing this step is absolutley necesary for a proper comparison of the propsed chemical classes.
sirius_annotations_filename
: Sirius annotations output format. Containing Zodiac and Cosmic.
compound_identification.tsv
vectorized_data_filename
: MEMO package output format.
Examples of all these input could be found in /format_examples