Input files

This page is dedicated to input file of ODAMNet.

Target genes

Warning

  • Gene IDs have to be consistent between input data (target genes, GMT and networks)

  • When data are retrieved by queries, HGNC IDs are used.

Choose one of these input parameters according your input data:

-c, --chemicalsFile FILENAME

Contains a list of chemicals. They have to be in MeSH identifiers (e.g. D014801). Each line contains one or several chemical IDs, separated by “;”.

1. Chemicals file

By default, ODAMNet retrieved chemical target genes list from the the Comparative Toxicogenomics Database [1] (CTD) using queries. This file contains a list of chemicals IDs (MeSH, e.g. D014801). Each line contains one or several chemical IDs, separated by “;”.

D014801;D014807
D014212
C009166

ODAMNet approaches are applied in each line separately. If a line contains multiple chemicals, target genes of each chemical will be retrieved and merged as unique target genes list.

Chemical target genes are retrieved in HGCN format.

2. Target genes file

ODAMNet can also used input data provided by the user. This target genes file contains a list of genes. One gene per line.

AANAT
ABCB1
ABCC2
ABL1
ACADM

3. CTD file

This third way to retrieved target genes is well appropriate to do reproducible analysis or to use a specific database version. The required file contains 9 columns:

  • Input: query input (e.g chemical IDs from chemicals file)

  • ChemicalName: name of the query input or its descendant chemicals

  • ChemicalId: MeSH ID of the query or its descendant chemicals

  • CasRN: CasRN ID of the query or its descendant chemicals

  • GeneSymbol: names of target genes that are connected to the query or its descendant chemicals

  • GeneId: target gene ID (HGCN)

  • Organism: organism name

  • OrganismId: organism ID

  • PubMedIds: PubMed IDs of publications that talk about this connection

Input       ChemicalName    ChemicalId      CasRN   GeneSymbol      GeneId  Organism        OrganismId      PubMedIds
d014801     Tretinoin       D014212 302-79-4        ZYG11A  440590  Homo sapiens    9606    23724009|33167477
d014801     Tretinoin       D014212 302-79-4        ZYX     7791    Homo sapiens    9606    23724009
d014801     Tretinoin       D014212 302-79-4        ZZZ3    26009   Homo sapiens    9606    33167477
d014801     Vitamin A       D014801 11103-57-4      ACE2    59272   Homo sapiens    9606    32808185
d014801     Vitamin A       D014801 11103-57-4      AKR1B10 57016   Homo sapiens    9606    19014918

This kind of files is created as query results with query mode of ODAMNet.

Pathways/processes of interest

By default, ODAMNet retrieved all rare disease pathways and all human pathways from WikiPathways [2] using queries. Genes involved in rare disease pathways are retrieved in HGCN format.

Moreover, the user can also provide their own pathways/processes of interest. Two types of files are required by ODAMNet:

--GMT FILENAME

It’s a tab-delimited file that describes gene sets of pathways/processes of interest. Pathways can come from several sources. Each row represents a gene set.

--backgroundFile FILENAME

This file contains the list of the different background file source. They have to be in the same order that they appear on the GMT file. Each file is a GMT file (see above).

GMT file

This file contains genes composition of the pathways/processes of interest. There are at least three columns:

  • pathwayIDs: first column is pathway IDs

  • pathways: second column is pathway names - Optional, you can fill it in a dummy field

  • HGNC: all the other columns contain genes inside pathway. The number of columns is different for each pathway and varies according the number of genes inside.

The GMT file is organized as follow:

pathwayIDs  pathways        HGNC
WP5195      Disorders in ketolysis  ACAT1   HMGCS1  OXCT1   BDH1    ACAT2
WP5189      Copper metabolism       ATP7B   ATP7A   SLC11A2 SLC31A1
WP5190      Creatine pathway        GAMT    SLC6A8  GATM    OAT     CK

For more details, see GMT file format webpage.

Warning

GMT file must doesn’t contain empty columns.

Background file

In addition to the GMT file, ODAMNet needs another GMT file used as background genes for statistical approaches. It can used different background genes at the same time. So, instead of given directly the background GMT file, ODAMNet takes as input the list of background file name.

hsapiens.GO-BP.name.gmt
hsapiens.REAC.name.gmt
hsapiens.REAC.name.gmt
hsapiens.GO-BP.name.gmt
hsapiens.WP.name.gmt

Background file contains same line number as GMT file and background file names are in the same order that they are in the GMT file.

Examples

Background and GMT files need to be in the same folder.

Three lines of WP background file

hsapiens.WP.name.gmt
hsapiens.WP.name.gmt
hsapiens.WP.name.gmt

Three lines of WP pathways

pathwayIDs  pathways        HGNC
WP5195      Disorders in ketolysis  ACAT1   HMGCS1  OXCT1   BDH1    ACAT2
WP5189      Copper metabolism       ATP7B   ATP7A   SLC11A2 SLC31A1
WP5190      Creatine pathway        GAMT    SLC6A8  GATM    OAT     CK

Networks

In ODAMNet, two mains network format file are used:

  • Simple interaction file (SIF)

  • Graph file (GR)

SIF file

This network format is used in the Active Module Identification (AMI) approach. The SIF file contains three columns: source node, interaction type and target node with header. It’s a tab-separated file.

node_1      link    node_2
AAMP        ppi     VPS52
AAMP        ppi     BHLHE40
AAMP        ppi     AEN
AAMP        ppi     C8orf33
AAMP        ppi     TK1

For more details, see SIF file format webpage.

GR file

This network format is used in the Random Walk with Restart (RWR) approach. The GR format contains two columns: source node and target node, without header. It’s a tab-separated file.

NFYA        NFYB
NFYA        NFYC
NFYB        NFYC
BTRC        CUL1
BTRC        SKP1

Configuration file

Warning

Follow the same folder tree used in multiXrank

To perform a RWR, multiXrank [3] needs a configuration file as input. This file contains path of networks used. It could be short (see bellow) or very detailed with parameters.

For more details about this file, see the multiXrank’s documentation: Github / Documentation.

This is an example of short configuration file:

 multiplex:
     1:
         layers:
             - multiplex/1/Complexes_gene_names_190123.gr
             - multiplex/1/Pathways_reactome_gene_names_190123.gr
             - multiplex/1/PPI_HiUnion_LitBM_APID_gene_names_190123.gr
     2:
         layers:
             - multiplex/2/RareDiseasePathways_network_useCase1.gr
 bipartite:
     bipartite/Bipartite_RareDiseasePathways_geneSymbols_useCase1.gr:
         source: 2
         target: 1
 seed:
     seeds.txt

Tip

Whatever the networks used, the command line is the same. You have to change the network name inside the configuration file.

References