about us   database   info   order clones   imprint   privacy statement   contact


GGTC German Gene Trap Consortium


 home | info | bioinformatic tools





 Gene trap sequence tag analysis pipeline

(Jens Hansen, last updated October 1st, 2007)




The GGTC gene trap screen utilizes 2 different PCR methods to analyze the vector integration sites: Splinkerette (SPLK) PCR and 5' RACE. These methods are used to obtain genomic sequence tags and cDNA-type sequence tags respectively. The pipeline for the analysis of the gene trap sequence tags (GTST) combines various bioinformatic techniques and is used to localize the vector integration site on the genome and to identify the mutated gene:




1. Sequence processing:

In the first step of the pipeline the GTSTs are preprocessed for subsequent BLAST searches (fig.1). Analysis of mutant ES cell lines by PCR tags, which contain both a short stretch of vector sequence and sequence of the mutated gene. The key step in the sequence processing is the vector sequence clip. Successful identification of the transition zone between the vector and the gene sequence is important for the prediction of the true vector integratopm site (fig.2). The end of the sequence tag is determined by a PCR-specific SPLK adapter sequence. Unless the SPLK adapter sequence can be identified, the low quality sequence end is clipped. The resulting sequence passes several filtering and masking steps, e.g. a repeat-masking step and a low-compexity filter.




2. BLAST alignments:

BlastN is used to map the sequence tag to the mouse genome (ENSEMBL).


3. Identification of the vector insertion point:

To find the vector insertion site most precisely the transition zone between the vector sequence and the endogenous sequence of the trapped gene needs to be identified. This transition zone is used to deduce the true vector insertion point from the alignment starting point (fig.2).


4. Genomic localization:

After running BLAST searches with all sequence tags, two different pipelines are used for cDNA and genomic sequence tags respectively. Due to their origin from mRNA transcripts, cDNA tags result in gapped alignments when aligned to the mouse genome sequence, whereas genomic tags normally result in linear alignments:


(i) genomic sequence tags:

The crucial step in the analysis of the optimal localization of the sequence tag is the selection of the best alignment. The alignments are filtered for significance and the remaining alignments are screened to identify the best matching alignment (fig.3). If both PCR analyses, 5' or 3' Splinkerette PCR, return signifcant results for a particular cell line, verification of the genomic localization of the vector integration is done by cross-comparing the resultant genomic coordinates, the chromosome matched, and the orientation. The distance between the vector integration points predicted by the 5' and 3' Splinkerette PCR analysis should not exceed a threshold of 1000 nucleotides. Finding single end mappings (class III) or paired end mappings (class I) determines the level of evidence of the mapping process (fig.3). If no best matching alignment has been found, the alignments of both 3' SPLK tag and 5' SPLK tag are screened for paired end mappings by an all-against-all approach and the best mapping by e-value is selected (class II). If no paired mappings can be found both tags are omitted.




(ii) cDNA sequence tags:

The SPLIGN software (NCBI) is used to map cDNA sequence tags to the mouse genome(fig.4). Various gene models predicted by SPLIGN are compared to select the best matching model. Important selection parameters are the total length of the gene model, identity of the longest exon, and a minimal deviation of the alignment starting point to the vector integration site (i.e. the transition between vector and gene sequnce in terms of the GTST). The resultant best-matching gene model is used to predict the site where the splice acceptor of the vector has been spliced to the splice donor of the endogenous exon (most of the vectors used by the GGTC are intron trap vectors that depend on splicing). To reconcile the predicted gene model, the splice site, as deduced from the cDNA, is compared with the vector integration site, as deduced from the genomic sequence tag analysis. If the predicted splice site either deviates more than 1 Mb from the vector integration site or is localized on another chromosome or in the opposite orientation, the selected gene model is rejected.




5. Annotation:

The genomic coordinate of the vector insertion site, the orientation of the alignment, and the chromosome matched are used to identify the trapped gene on the basis of the mouse genome annotation (ENSEMBL). Identification of the vector insertion in relation to strucural elements of the trapped gene (e.g. exons, introns, UTR, stop signals etc.) are also investigated.


Relevant data from genomic localization and annotation of the sequence tags are stored in a database and presented on the web-interface of the GGTC.