genexpress: a computer system for description, analysis and recognition of regulatory ... · 2006....

GENEEXPRESS: A COMPUTER SYSTEM FOR DESCRIPTION,ANALYSIS, AND RECOGNITION OF REGULATORY SEQUENCES IN

EUKARYOTIC GENOMEN.A. Kolchanov, M.P. Ponomarenko, A.E. Kel, Yu.V. Kondrakhin, A.S. Frolov, F.A. Kolpakov,

T.N. Goryachkovsky, O.V. Kel, E.A. Ananko, E.V. Ignatieva, O.A. Podkolodnaya, V.N. Babenko,I.L. Stepanenko, A.G. Romashchenko, T.I. Merkulova, D.G. Vorobiev, S.V. Lavryushev,

Yu.V. Ponomarenko, A.V. Kochetov, G.B. Kolesov,Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences, Novosibirsk, 630090; Siberian Branch

of the Russian Academy of Sciences, Novosibirsk, Russia, 630090; emaiI:[email protected]. V. Solovyev

The Sanger Centre Hinxton, Cambridge, CBIO ISA, UK emaiI: [email protected]. Milanesi

lstituto Di Tecnologie Biomediche Avanzate, Consiglio Nazionale Della Ricerche, Via Ampere 56, Milano, ltaly;email: [email protected]

N. L. PodkolodnyInstitute of Computational Mathematics and Mathematical Geophysics, Siberian Branch of the Russian Academy of Sciences,

Novosibirsk, Russia, 630090; email:[email protected]

E. Wingender, T. HeinemeyerGesellschaft fur Biotechnologische Forschung mbH, Mascheroder Weg 1, D-38124 Braunschweig, Germany; email: ewi@gbf deKeywords: Gene networks, transcription, translation, regulation, site, recognition, activity, databases.

via the particular signal transduction pathway toAbstract

GeneExpress system has been designed tointegrate description, analysis, and recognition ofeukaryotic regulatory sequences. The system includes 5basic units: (1) GeneNet contains an object-orienteddatabase for accumulation of data on gene networksand signal transduction pathways and a Java-basedviewer that allows an exploration and visualization ofthe GeneNet information; (2) TranscriptionRegulation combines the database on transcriptionregulatory regions of eukaryotic genes (TRRD) andTRRD Viewer; (3) Transcription Factor BindingSite Recognition contains a compilation oftranscription factor binding sites (TFBSC) andprograms for their analysis and recognition; (4)mRNA Translation is designed for analysis ofstructural and contextual features of mRNA 5’UTRsand prediction of their translation efficiency; and (5)ACTIVITY is the module for analysis and site activityprediction of a given nucleotide sequence. Integrationof the databases in the GeneExpress is based on theSequence Retrieval System (SRS) created in theEuropean Bioinformatics Institute.

GeneExpress is available athttp://wwwmgs, bioneL ns~ ru/systems/GeneExpress/.

IntroductionThe eukaryotic gene expression is one of the most

complex biological phenomena involving a number ofmolec~flar events. It may start with reception of adefinite stimulus by the cell, which is then conveyedCopyright © 1998, American Association for Artificial Intelligence(www.aaai.org). All rights reserved.

initiate transcription of the relevant genes. Their pre-mRNAs are processed by 3’ cutting/polyadenylation,capping, splicing, and finally the correspondingproteins are translated from these mature mRNAs.This totality of molecular events forms the particulargene network that provides the cell response to thestimulus. The cellular and organismic homeostases aswell as cell/tissue differentiation and development aremaintained by their gene networks. That is the reasonwhy molecular biologists investigating the geneexpression should be able to access databases on all thestages of gene expression as well as the relevantprograms for their analysis. Thus, investigation of thegene expression is an integrative problem of biology.

Currently, various experimental data on genomicregulatory sequences controlling the eukaryotic geneexpression are being rapidly accttmulated. Thetranscription regulator)" regions have been sequencedfor thousands of genes. A great number oftranscription regulatory elements have been localizedincluding transcription factor binding sites, enhancers,promoters, etc. (Kel’ A.E et al., 1997; Peter et at.,1998; Wingender et al., 1996). A wide range offunctional sites controlling other stages of geneexpression (splicing, processing-polyadenylation, andtranslation) have been isolated and studied. considerable volume of the experimental data on theactivity of various types of functional sites controllingthe gene expression has been generated (Kolchanov etal., 1998). The experimental data on the genenetworks, the ensembles of coordinately functioning

Kolchanov 95

From: ISMB-98 Proceedings. Copyright © 1998, AAAI (www.aaai.org). All rights reserved.

genes (Kolpakov et al., 1998), is growing. Computeranalysis of the genomic regulatory sequences becomeseven more important in case of functionalinterpretation of newly sequenced genomic fragmentsas well as for study of the molecular mechanisms ofgene expression regulation. The current number ofdatabases on various genomic regulatory regions isconsiderable. In addition to the general databases, suchas EMBL and GenBank, a number of specializeddatabases on gene expression regulation are available:EPD (Peter et al., 1998), TRANSFAC (Wingender al., 1996), TRRD (Kel’, A.E. et al., 1997), COMPEL(Kel, O.V. et al., 1995b), EpoDB (Salas et al., 1998),etc. Many computer methods for recognition ofregulatory genomic sequences (Waterman et al, 1984;Lawrence et al., 1993; Chen et al., 1995; Ulyanov &Stormo, 1995; Quandt et al., 1995; Fickett &Hatzigeorgiou, 1997 (review); Kel, A.E. et al., 1995;Prestridge, 1995; Pedersen et al., 1996; Solovyev &Salamov, 1997; Salamov & Solovyev, 1997) have beendeveloped. Thus, the challenging problem is to create aWWW-based environment capable of integrating theinformation coming from various databases onexpression regulation and make this informationaccessible by software for investigation and predictionof regulatory sequences.

We started this integration from cross-linking theTRANSFAC, TRRD, and COMPEL databasesthrough introduction of a common format table for allof them (Wingender et al., 1996). Appearance of SRSquery system (Etzold and Argos, 1993) opened a newera of web-integration. It provides unification ofqueries to various databases concealing any specificdetails of their realization; unified representation of thequeried information; flexible format of informationrepresentation (for example, FASTA, PIR, etc.); possibility to include additional modules for graphicrepresentation; a powerful reference and help systemsfor each of the databases; and a possibility of linkagewith the other databases and computer systems.

Using SRS, we have developed GeneExpress, theSRS-based integrator for the databases and programssupporting investigation of the gene expression. Thedatabase GeneNet on molecular events forming genenetworks was assigned its integrative core. To studytranscription, this core was supplemented with thedatabase TRRD on transcription regulatory regionsand the compilation TFBSC of the sequence sets oftranscription factor binding sites. The TRRD andTFBSC were linked to the system RgScan,recognizing the sites in DNA sequences. Fortranslation, the database LeaderRNA on mRNA

leaders was included and linked to the programpredicting the High/Low translation levels from agiven mRNA sequence. The gene expression is alsoquantitatively described by the system ACTMTYcompiling the functional site activity magnitudes andlinked with the programs predicting the activities fromsite sequences. Thus, the GeneExpress system isdesigned to integrate description, analysis, andrecognition of eukaryotic genomic sequences. Themodular and hierarchical organization of regulatorygenomic sequences and the network-organizedregulation of gene expression were taken intoconsideration during the system development.GeneExpress, is WWW-available athttp://wwwmgs.bionet.nsc.ru/systems/GeneExpress/.

GENE NETWORKS

The GeneNet database is designed foraccumulation of formalized description of genenetworks and signal transduction pathways. Using theobject-oriented approach, the following componentsare included in the description of a gene network:entities (any material objects), relations between theentities, and processes connected with them (forexample, viral infection, anemia, or erythroc~edifferentiation). Four classes of entities aredistinguished: (1) Cell (tissue, organ) entity, regardedas a definite compartment containing a certain set ofentities of other classes; (2) Protein; (3) Gene; and Substance (a nonprotein regulatory substance, forexample, metabolite). Two classes of relations betweenthe entities are described: (1) reaction of interactionbetween entities yielding a new entity or process; and(2) regulatory event as the effect of an entity on certain reaction. Instances of Cell (tissue, organ),Gene, Protein, Substance, State, and Relation classesare described in the separate tables CELL, GENRE,PROTEIN, STATE, and RELATION, respectively.The database is also supplemented with the SCHEMEtable. Thus, the database contains eight tables in theEMBL-Iike text format: (1) CELL (information on cell types and lines, including also the description oftissues and organs); (2) GENE (genes and theirregulatory features based on the information from theTRRD database); (3) PROTEIN (proteins and proteincomplexes); (4) SUBSTANCE (regulatory substancesand metabolites); (5) PROCESS (physiological processand the organismic state during the gene networkfunctioning); (6) RELATION (relations between gene network components); and (7) SCHEME(description of the gene network graph).

96 ISMB-98

PROTEEI~T TAJ~ZLE :~’:~i*~ Fit.isI D H s :I:,84

J :::::DIL TFF~.CTOIK, TOIS?3; J i7~7 7~!~ji!:; i ili~ii:::~!!:!!Ti:i ~}:::.: :! ::;i;i;JDR EMBL; M979~6; i i ::.::.::~i::ii ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

i :: ~i~;~:i:!ill .a.l held~ ........................................................" .... :: ::~::::::!.!I:I!I!~I~,,~,’L:!~ ......................!~,~:i:i:1~" ~i;!i;:i|Click the mouse ~ .... i~ B RO U S ER the e~try from the d4~abase, " i ::~::::|1"--................ I -~ :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

Figure 1. Example of automated construction of the diagram representing the gene network of the antiviralresponse at the cell level

The GeneNet database is also developed using theSRS. It supports the cross-references within theGeneNet database and with EMBL, SWISS-PROT,TRRD, TRANSFAC, and EPD databases. The currentversion of the GeneNet database contains thedescriptions of gene networks of antiviral response(Ananko et al., 1997) and erythropoiesis(Podkolodnaya and Stepanenko, 1997).

The GeneNet includes automatic construction of agene network diagram. The diagram is presented as agraph with the nodes corresponding to entities or statesand the edges reflecting the relations between the genenetwork components. Information on the graph

Inte2ral re2ulation of eukarvotic genes

&

Re~ulator~ uni! (promoter, enhanqer, or silencer)i

...7 ~ . ./ ~ .~.~

~ "%". Compositeelement[~ACCCrGAGGT

-136 -128

C is-e lement (~ns:no~ ~o’t~oa~ ate)

Figure 2. Structural and functional organization ofeukaryotic genes.

structure is taken from the SCHEME table. Each genenetwork component has its own image on the diagram,showing its features (Fig. 1). The GeneNet systemtakes into account that the gene network componentscan belong to different organs, tissues, cells, and cellcompartments. The three following hierarchical levelsare considered: (1) organism level, at which suchentities as organs, tissues, cell .types, and varioussubstances affect other organs, tissues, and cells; (2)the single cell level, where four compartments aredistinguished: the intercellular space,, cell membrane,cytoplasm, and nucleus; and (3) the single gene level,where the description of transcription regulationemploys the data from the TRRD database. Each levelcan be displayed in a separate window. The gene levelis visualized via the TRRD Viewer described above.

The GeneNet Viewer is a Java applet. It includesthe above-described generation of the gene networkdiagram and some tools for data navigation, on-linehelp, interactive cross-references within the GeneNetdatabase, and references to other databases. All imageson the diagram are interactive, i.e., if a user clicks theimage, the textual description of the correspondingentD’ is displayed in the special text window under thediagram (Fig. 1). Double clicking the gene image startsthe TRRD Viewer, and the regulatory map of the geneis visualized. The text window contains a formattedtext with hypertext references of three types: (1) thereference explaining the type of information describedin the field; (2) cross-references within the GeneNet

Kolchanov 97

database; and (3) references to other databases (EMBL,SWISS-PROT, TRRD, TRANSFAC, and EPD).

TRANSCRIPTION REGULATION

Transcription Regulatory Regions Database ~)The model of functional organization of eukaryotic

gene regulatory regions (Kel’, A.E. et al., 1997; Kel,O.V. et al., 1995a) was used as the basis for the TRRDdatabase. It takes into account a great diversity of theelements controlling gene transcription, their modularorganization, and the hierarchy of these elements,essential for their functioning. The TRRD format

TRROGENES contains a general description ofgenes, peculiarities of their transcription regulation(dependence on the cell cycle stage, developmentalstage, tissue-specificity, or effects of external factors),chromosomal location, description of regulatory units(promoters, enhancers, or silencers), compositeelements, and free text comments. This table is linkedto the TRANSFAC_GENE (Wingender et al., 1996),COMPEL (Kel, O.V., 1995b), and GeneNetCKolpakov et at., 1998) databases.

TRRDSITES accumulates information ontranscription factor binding sites: nucleotide sequences

and their location within the gene,the list of the relevant cell lines andcodes of experiments, and free textcomments. This table is linked to thefollowing computer systems:ACTIVITY for predicting site

i i activity (Kolchanov et al., 1998) andprograms for site recognition(Kondrakhin et at., 1998). The tablealso contains the references toTRANSFAC_SITE,TRANSFAC FACTOR(Wingender et al., 1996), and EMBLdatabases.

TRRDBIB is linked with the siteand gene description tables and

contains the complete references to the original articlesand to MEI)LINE references.

The current version, TRRI) 3.5, comprises thedescription of 427 genes, 607 regulatory units(promoters, enhancers, and silencers), and 2147transcription factor binding sites. Over 1500 scientificpublications have been processed to obtain these data.The TRRDGENES database includes information onhuman (185 entries), mouse (126), rat (69), chicken(29), and other genes. Most of the tissue-specific genesare expressed in liver (120), blood (67), or muscle cells(37). The major part of the genes compiled in TRRDare either interferon-induced (62) or glucocorticoid-regulated genes (30), or belong to lipid metabolism(41), erythroid differentiation (37), or cell cycleregulation (23) fimctional systems. TRRD is installedunder the SRS to provide easy information retrievaland integration with other databases and computersystems for information processing.

TRRD Viewer. The Java applet, TRRO-Viewer,allows to visualize the data on location of transcriptionfactor binding sites in a map form (Fig. 3) andoverlook their textual description. While working withthis applet, the user selects a gene identifier from thelist, and the textual description of the gene (from

I~or~ ,~,,~,~,,~,, r,~,r ~*,~ ~---~- : ! !~t~i ....................................~i~::2":~c._qq GOCIOO44 . iCR $14akkix~C~T,KclO ’~ 011¢ F;7 up

C_O ~1~51 "

Figure 3. Example of visualization of gene regulatory map by TRRDViewer. Boxes represent binding sites of transcription factors; the line

allows describing the modular structure oftranscription regulatory regions and the hierarchy oftheir constituent regulatory units. The hierarchy of thefollowing elements has been implemented (Fig. 2): (1)Cis-elements provide the interaction of transcriptionfactors with DNA (Wingender, 1993); (2) Compositeelements support the interactions between DNA sitesand the protein factors or the protein-proteininteractions, causing either synergistic or antagonisticregulatory effects (Kel, O.V. et at., 1995b; Kel’, A.E.et al., 1997); (3)Promoters, enhancers, and silencersat this level of the hierarchy provide the transcriptionregulation under certain conditions; (4) Transcriptionregulatory regions are represented by continuousregions of genomic DNA containing the regulatoryelements of the levels described above (Kel, O.V. et al.,1995a; Kolchanov, 1997) and located in the gene 3’-and 5’-flanking regions or introns; and (5) The systemof integral regulation of gene transcription comprisesall these regulatory elements (Kolchanov, 1997; Kel’,A.E. et al., 1997).

The TRRD database has three interconnectedtables: TRRDGENES (description of genes),TRRDSITES (description of sites), and TRRDBIB(references).

98 ISMB-98

TRRDGENES), its sites (from TRRDSITES), and relevant references (from TRRDBIB) appears in thetext window. Transcription factor binding sites andcomposite elements are presented graphically. If theuser clicks the site image, the description from theTRRDSITES table is displayed in the text window.Clicking the field title provides comments on theinformation described in the field. Several optionsallow a number of different site representations.

TRANSCRIPTION FACTOR BINDING SITERECOGNITION

This module includes two blocks: the database ontranscription factor binding site compilations (TFBSC)and programs for site analysis and recognitionSiteGroup and SiteScan (Kondrakhin et al., 1998). Thetraining samples from TFBSC, containingexperimentally determined sequences of a particularsite have been used in developing the recognition

Table 1. A set of realizations forAP-1 bindin siteN WeimarS) Realization0 26 tgactca1 10 tgactAa2 5 tgaAtca3 4 t~acGca4 2 tAactca5 2 tsacAca6 2 tgactGa7 1 tCactca8 1 T~ct~a9 1 ] T~actcG10 1 TgactcC$) Realization weight is the numberof binding sites from U0 containing agiven realization.

methods. The datainclude samples of41 transcriptionfactor binding sites(from 6 to 199sequences for eachfactor; 1496sequences totally)in EMBL-likeformat.

A simplerecognition methodis based on therepresentation oftranscription factorbinding sequences

as a set of site realizations R={Ro, R~, ..., 1%_1}. Inother words, the set of realizations in the form of

Table 2. Examples of the accuracy of binding site recognitionScan program

No Binding Errors.) for RGScan Errors for matrixmethod method

site0~1 0~1 ~2

1 API 0.188 0.004303 0.156 0.0074212 AP2 0.125 0.000872 0.063 0.0324773 ATF/CREB 0.147 0.000207 0.118 0.0009644 C/EBP 0.060 0.023392 0.096 0.0212845 COUP/RAR 0.025 0.003936 0.050 0.0578746 ETF 0.000 0.002229 0.333 <1047 GATA 0.127 0.000491 0.072 0.0157748 GR 0.118 0.003092 0.039 0.0332869 NF-1 0.038 0.000620 0.099 0.01369410 NF-kB 0.138 <104 0.069 0.00201411 NF-Y 0.045 0.001466 0.409 0.00025812 OCT 0.163 0.000505 0.571 <10413 Pit- 1 0.176 0.000522 0.059 0.00995914 Sp1 0.029 0.008347 0.194 0.001999

°) a~, false negatives; c~2; false positives

* Prediction ofpoteatial binding site* of transcription factors in a given* DNA sequence on the basis of* recogIfition groups.

*******************************************

The name of sequence: XLACTA2 standard;The number of all predicted binding sites = 14Name Position SiteAPF 5 agtaacNFIII 52 tcatttGT-2B 81 cagctgMLTF 107 gtcactAP-1 108 tcactcaSRF 221 ccatgtaaggGATA-1 262 cttatcaSRF 272 ccaaatalggNFIII 288 aaatgaAR 301 tcttctAR 356 agagcaNFIII 361 aaatgaPit-1 378 atgaataTFIID 385 tataaa

Figure 4. Prediction of transcription factor binding sites in thepromoter region ofXenopus laevis alpha2 gene (sarcomeric actin)by SiteScan program.

oligonucleotides x bases long coded in the IUPAC-IUB15 single-letter based codes is created by our algorithmfor any particular site. This approach avoids theaveraging of the nucleotide composition, as occurwhile describing the functional site by the weightmatrix or consensus. The program SiteGroup uses a setU0={ul,...,Um} of experimentally determined bindingsite sequences extracted from the TFBSC as the initialinformation to create the set of realizations R. It isdetermined by two parameters: (1) the length of theoligonucleotide z and (2) the maximally allowabledifference (distance) t (miss) between theseoligonucleotides. The main realization is theoligonucleotide 17,o of length x having the highestfrequency in the sample Uo. Then, all sequences ulcontaining Ro are removed from the sample Uo,producing the sample UI and so on. Analysis of thesample Ur, during which the rth realization Rr issearched for, is performed at the rth step. In thisiterative process, all oligonucleotide words of length xare considered. The distance t from the word Ro isestimated for each word. The word R~ that has theleast distance from the Ro word is selected. If a numberof words have the same minimal distance, the wordwith the maximal frequency in the set Ur is selected.The iteration is stopped when the set U~ either becomesempty or contains only the words that differ from Roby the value exceeding t ("a’’). Each set of realizationscreated by the method described may be characterizedby parameter fo: the proportion of the sequences fromthe initial set Uo represented in this set of realizations(covering of the set Uo). The set of realizations

Kolchanov 99

providing maximization of the functional is searchedfor by exhaustion of the pairs (t("i"),x)=(/d’).

qi,j=fi,j × [(fi, j-fi.l,j)+(f~j-fi,j+0]. (1)Examples of the first and second type errors in

recognition of several transcription factor binding sitesare listed in Table 2. For assessing the second typeerrors, the compilation of eukaryotic non-first exonswas used. Small first and second type errors wereobserved for the SiteScan recognition. It is especiallyimportant for analysis of transcription factor bindingsites in long genomic sequences.

COMPUTER SYSTEM FOR PREDICTING mRNATRANSLATION ACTIVITY

This part of the GeneExpress system is designedfor prediction of mRNA translation efficiency basingon analysis of structural and contextual features of 5’untranslated regions (5’UTRs). It has a program(Leader) for mRNA translation rate prediction and thedatabase containing 5’UTR sequences (Leader_Sq) andsome information on the effect of several 5’UTRfeatures on mRNA translation efficiency.

Eukaryotic mRNAs differ considerably in theirtranslation efficiency. This has been attributed todifferent efficiency of translation initiation. Thecontextual and structural features of 5’UTRs havestrong effect on translation initiation. To reveal thesecharacteristics, we have compared the mRNAsequences of several house-keeping gene groups,highly expressed in eukaryotic cells, and some groupsof regulatory genes, whose expression is low and understringent control. The group of highly expressedmRNAs consists of mRNAs of highly abundantproteins such as actins, tubulins, ribosomal proteins,lfistones, hsp70, etc.

Low expression mRNAs include mRNAs oftranscription factors, protein kinases, growth factors,protooncogenes, etc. We have found several featuresthat are different for these two groups (Fig. 5): 5’UTRlength, nucleotide composition, context of start AUG

P(F), frequency

0,2

0,1

0,0HH

, Ii, )//II I l l

-1 6 -5 -2 -1

Expert Weights (0-10 are valid; 5 employed automatically)1. Translation increases with decreasing the Leader length2. TranslatTon increases with decreasqng the G/C ratio3. Translation increases with increasing the G/C-imbalance4. TranslalT"on increases with decreasing the alt-AUG content5. Translation increases with decreasing the f?amed A UG content6. Translation increases depending on the "-3 positT"on" rule7. Translation increases with decreasing the A UG inside leader8. Translation increases with increasing the [C] content9. Translation increases with increasing the [TM] content10. Translation increases with increasing the[CnY] content11. Comparison with the weight matrices for nucl. content in 5 ’UTRs

of high expression mRNAs.12. Comparison with complex (high to low) weight matrices for nucl.

content in 5 ’UTRs of high and low expression mRNAs

Figure S. List of 5’UTR mRNA characteristics important forpredicting the level ofgene expression. Parameters 8-12 weredetermined for the (-3 5 ;- 1 ) 5 ’UTR fragment.

codon, and presence of AUGs within 5’UTRs(Ischenko et al., 1996; Kochetov A.V. et al., 1998).These 5’UTR features affect the 40S ribosomal subunitmovement along the leader and, therefore, theefficiency of the translation initiation.

The difference in these features was used fordiscrimination between high and low expressed genesThe program calculates the 5’LrrR features andevaluates the translation activity of mRNA. We createa simple discrimination function based on Penrosedistance. The discrimination between the controlsamples of the high and low expressed mRNAs of dicotplants showed that 84% of the high and 76% of thelow expressed mRNAs were classified correctly (Fig.6).

SEQUENCE-BASED PREDICTION OFFUNCTIONAL SITE ACTIVITY

Initial postulates. It is suggested that the siteactivity F is determined by context-dependentproperties of its nucleotide sequence S: statistical,physical, and conformational (Ponomarenko et al.,1997a, Kolchanov et al., 1998). These properties are oftwo types (Kel, A.E. et al., 1993): (1) obligatory, whichare invariant for all sequences S, of the site and

F(seq), predicted activity

IHighIS]Lowo~ < 0.001

Figure 6. The control results were obtained using a set of independent data. Broken line is the selected threshold toseparate low and high expressed mRNA.

I00 ISMB-98

DATABASE

of DNA/RNAsite activity

t

DATabaSE 11of conformational, [[physical and chemica~

DNA properties [[

tKNOWLEDGE DISCOVERY

DATABASEof contextual

featuresDNA/RNA sites

KNOWLEDGE BASESignificant ?[sOgrams for

properties for ite activity

site activity~Iredicti°n

Search for DNA/RNA propertiesdetermining site activities

Generating computer programsfor site activity prediction

\.

Figure 7. Principal scheme of the ACTIVITY computer system.

determine its basal activity; and (2) facultative, whichare individual in terms of their "number, size, andlocation" for each sequence of the site and modulatethe site activity with respect to the basal level. Hence,within the framework of the linear-additiveapproximation, the activity of the site ,Mth sequence Smay be described by the following equation:

KF(Sn) = F0 (Sn) + Z Fk x Xk(Sn) ; (2)

k=l

where F0(S) is the basal activity level determinedby the occurrence of the obligatory properties of this

site in the sequence S.;MI P0000001MN ConformationalMD B-DNAML dinucleotide stepHN SCI00001RN RF000012RN RF000017PN TwistPM CalculatedPV TwistCalcPU DegreeDINUCLEOTIDEAA 38.90AT 33.81AG32.15AC 31.12TA 33.28TT 38.90TG 41.41TC 41.31GA41.31GT 31.12GG 34.96GC 38.50CA41.41CT 32.15CG 32.91CC 34.96Figure 8. Description oconformational propert:"Helical twist angle of BDNA" in the ACTIVIT5system database.

{Xk}k=lX are the facultativeproperties; and Fk is thecontribution of the facultativeproperty X4, to the site activityF. The principal scheme of theACTIVITY system is shownin Fig. 7.

Database on functionalsite aetivities compiles theavailable data on functionalsites with the experimentallymeasured activities: over 240site samples of different types,such as promoters and bindingsites for E. coli regulatory.proteins, TATA boxes andbinding sites for eukaryotictranscription factors,translation starts, splicing and3’-processing sites, etc. Siteactivity characteristics includethe association/dissociationrates of DNA-proteincomplexes, their lifetimes,equilibrium constants,

transcription and translation efficiencies, etc.Database on conformational and

physical~chemical DNA properties compiles theinformation on context-dependent properties whichmay play a significant role in DNA-proteininteractions. The format of this database is illustratedin Fig. 8. The database currently contains over 40conformational parameters, determined eithercomputationally or by X-ray analysis. Over 10 physicalDNA properties -- melting temperature, persistentlength, bending-rigidity, entropy, etc. -- are alsoincluded. The system for knowledge discovery on siteactivity contains two blocks. The first is responsible forrevealing any site properties significant for predictingthe site activity and the second provides generation ofC-code programs to predict the activity of a givensequence.

One example of the context characteristics of thesequence S is the positional weighted concentrationX...~S) of mono-, di-, tri-, and tetranucleotides(Ponomarenko et al., 1997’0):

L-m+lXz, n~w (S) = X w(i) x 6z(S ± Si+l... Si+m_l), (3)

"i=l

where ?>z is the ’T’ or "0" indicator functiondepending on the match or mismatch between thesequence S and the oligonucleotide Z; w(i) is thefunction of position effect determined according to therule: "the more important is the position i for the siteactivity, the higher its assigned weight w(i)". Theactivity prediction employs 180 various weightfunctions w(i). They are stored in the database forcontextual characteristics of the ACTIVITY system(Fig. 9). Three examples of weight functions w(i) shown in Fig. 10. The nucleotide composition of anyoligonucleotide is presented in the 15 single-letterbased code.

We also use the mean values of site S propertieswithin the region [a;b] as site characteristics:

Kolchanov 101

b-1Pq(Si Si + i)

Xq, a,b (S) -- ~=~ (4)b - a

Search for statistically significant characteristics isimplemented as analysis of conformational andphysical characteristics for all oligonucleotides withlengths from 1 to M, checking each of the 180 positioneffect functions w(i). The total number of combinations<Z, m, w> is about 107 for m=4. Similarly, for everyconformational or physical property q, the analysis ofall possible locations of the region [a, b] within the sitesequence is performed. Xq~b(S~) is calculated for fixed combination <q, a, b> for each sequence of thesite. The total number of the combinations <q, a, b> is105

The significance of a property for site activityprediction is estimated (Ponomarenko et al., 19970)within the frames of the Utility Theory for DecisionMaking (Fishburn et al., 1970). Let’s calculate thefixed property Xz~,,(S.) for each sequence Sn with theknown activity F.. If the resulting pairs {Xz.,w(S.), F.}meet all the necessary, conditions of the linearregression applicability, then the activity F ispredictable from an arbitrary sequence S via the featureXZm,,. TO test these conditions of linear regressionapplicability, a simple regression is optimized for thepair {Xz~,w(S.), F,}:

FZ, m, w (Sn) = f0 -5 fl X XZ, m, w (Sn)

where fo and f, are the regression optimizedcoefficients.

To ensure the reliability of the regression betweenXz~w(S.) and F., 22 conditions of regression analysisare tested (the presence of linear, sign, and rankcorrelations between the predicted Fz~,(S.) and theexperimental F. activities; the equality of Xz~I(S.) andF. distributions, etc.).The significance level ct, atwhich the rth condition is met is estimated. Then, thepartial utility of the feature Xz..w in predicting theactivity F is calculated as follows:

2 1122 22 u~zt (Xz,~,,.

U(Xz,~,~, r) = t=~ ~=~ . (5)22

u. in the Utility Theory for Decision Making isdetermined as:

v~I~, if ~ < 0.0~

(6)tlct(Xz’mw’ 13 - 28.3 × cc~e + 55.6 × ct~e , if O.Ol_<O~t _< 0.i;

[ - i, if O~e > OJ.

Only the properties with u(x~.,,,D>o are selected for theactivity prediction and used to choose a limited set oflinearly independent properties

102 ISMB-98

Knowledgebase on W(ilfunctional 1 } j~site activity.All selected o,$charac-teristics are ostored in the 1 t-m*1

knowledge a) i, position

base of the 1 w(i)ACTIVITYsystem. The 0,5format of theconformation oal property 1 L -m + 1description is b)

i, position

illustrated in W (i)Fig. 9. Using

1 Tf

these

L/properties, 0,5

we generate0the program 1 k-m + 1

to predict the c)i, p o s itio n

site activityfrom its Figure 10. Examples of weightnucleotide functions w(i).sequencethrough optimization of equation (2). C-code programsfor such prediction are also stored in the knowledgebase (Ponomarenko et at., 19970).

Analysis of several site samples has demonstratedapplicability of this simple approach. For all these

MI K0000039CF SEQUENCE-DEPENDENT CONFORMATIONAL FEATURECT PROPERTY AVERAGED FOR REGION [A;B]DP P0000001PV TwistAt3 I0 18UT 0.234LC -0.859FG http://wwwmgs.bionetnsc.ru/.../USFTWIST.hlmC-CODE/*USFflDNA-binding increases with Twist decrease*/double TwistCalc_for SynthUSFbind (char *s){double X; char *seq; inl i,k, SiteLength=9;double DinucPar[16]={ 38.90 .......... 34.96 };seq:&s[0];if(strlen(seq) < SiteLength+ 1 )return(- 100 for (i=0, X=0.;i < SiteLength-1;i++) switch (seq[i ]) { case ’A’: = 0; break;

switch (seq[i+ 1 ]) { case ’A’: k+=0; break;

if(k > 15) return(-1004.); X+=DinucPar[k];return (X/(doubleX Site Length- 1));}

Figure 9. Description of a conformational property Helical twist forUSF-binding site in the ACTIVITY.

oo

-OO~o=-0’86

° oI I °° ~’[

1 2[VUK[’Q

34 35 36 3"7

Twist, Degreesa) b) c)

1 o ~ o.o

¯ _ o ,.0-1 E -1.5

-2.0c~ -2 -2.5

-5 [/o I ~ -3.o-3.5

o o2,5 0525.5

"? 0.0

.~ 24.5 1 ~ -0.523.5

o

,o/ ~ -l.o o o ,=o~9o-1.5 I

13.3 13.6 13.9 -1.5 -0.5 0.5Width, Angstrom tr. activity, predicted

d)Figure 11. (a) Dependence of the mature DNA yield in the 3’-processing of SV40 virus pre-mRNA on the weighted concentration of VUKK tetranucleotid,downstream of the mRNA cleavage point; (b) the dependence of the USF/DNA affinity on the helical twist angle of B-DNA; (c) the dependence of Crcrepressor/DNA affinity on the major groove width of B-DNA; (d) correlation of the experimenlal and calculated transcription activity for TATA/PE

containing promoter region of mouse etA-crystalline gene.

samples, the significant features have been identifiedand the linear-additive approximation for predictingthe site activity has been derived. For example, theweighted concentration of the tetranucleotide VUKKdownstream of the SV40 pre-mRNA cleavage pointwas found to be responsible for the 3’-processingefficiency (Fig. l la), The USF/DNA factor affinitycorrelates very well with the helical twist of B-DNA(Fig. 1 lb). The major groove width determines the Crorepressor/DNA affinity- (Fig. llc). The identifiedcharacteristics provide a first approximation inpredicting the value of the specific activity offunctional sites, These properties were used to generatethe method for predicting the transcription activity- ofmouse aA-crystalline gene by analyzing its promotersequence; the prediction shows a good agreement withthe experimental data (Fig. 11 d).

Reliability of the functional site activity valuespredicted by ACTIVITY from their sequences wasstudied by the authors earlier (Ponomarenko et al.,1997b) as well as ACTIVITY was compared withweight matrices (Stormo et al., 1986) and neuralnetworks (Jonson et al., 1993).

CONCLUSION

The first step in interpreting a human genomesequence involves finding and annotation of the allgenes it contains. The second step consists incharacterizing the biological function of the individualgenes, the way they are controlled, and their possibleinvolvement in human disease. Significant success hasbeen made in predicting and annotating protein codingregions (exons), although the exons account for only few percents of the genomic sequence. A considerablepart of the genome is occupied by regulatorysequences, which specify the tissue, developmentalstage, or biochemical context of gene expression.Recognition, interpretation, and annotation of genomeregulatory sequences should be one of the major tasksin the future progress of Human Genome Project. We

designed the GeneExpress computer system as a firstattempt to integrate the variety- of information ongenomic regulatory sequences and to use thisinformation in developing and running software fortheir analysis and recognition. GeneExpress integratesTRRB and GeneNet databases and providesreferences to external databases, such as TRANSFAC,COMPEL, and EMBL using the SRS query system.

It is essential that the GeneExpress can and haveto progress and expand continuously to update andintegrate new resources for investigating othermolecular events of the gene expression, such assplicing, DNA/protein interactions, etc. In the nearestfuture, a number of new basic modules will be added tothe system including programs for recognition ofeukaryotic promoters (Solovyev & Salamov, 1997) andcomposite regulatory elements (Kel, O.V. et al., 1995).A great number of software and information resourceson various aspects of gene expression regulation,developed by the bioinformatics community, arecurrently existing. However, representation diversity, ofthe data and the results of the data processing hindersthe access to these resources. This diversity- is and willbe the natural trait of bioinformatics, inherent for itsdevelopment. Hence, the problem is not to developuniform data formats but to succeed in integration ofthe already available software and informationresources in the formats developed by their authors tomake these resources maximally convenient forexperimenters. The advent of the SRS (Etzold andArgos, 1993) opens a way to solve this problem. Inaddition, the users should be provided with thepossibility- to arrange complex scenarios of step-by-steprunning programs in the course of data analysis usingthe integrated WWW resources. This may be realized,for example, by creating a virtual knowledge base, sothat the user can accumulate the results of analysis,visualize them, and compare to both one another andthe information available in the integrated databases.In the project AUTOGENE (Ptitsyn et al, 1996), the

Kolchanov 103

authors have already tried to integrate the analyzingprograms into flexible scenarios with input/outputtransfer of the results using the virtual knowledge baseand demonstrated that the approach is promising.

Acknowledgments

This work was supported partially by grants ofRussian National Human Genome Project, RussianMinistry of Science and Technical Politics, SiberianDepartment of Russian Academy of Sciences, andRussian Foundation for Basic Research (97-04-49740-a, 97-07-90309-a and 96-04-50006).

References:Ananko, E.A_, Bazhan, S.I., Belova, O.E., and Kel, A.E. (1997)

Mechanisms of transcription of the interferon-induced genes: adescription in the IIG-TRRD information system. Mol. Biol (Msk) 31,701-713.

Chen, Q, Hertz, G., and Stormo G. (1995) Matrix search 1.o: computer program that scans DNA sequences for transcriptionalelements using a database of weight matrices. CABIOS 1 I, 563-566.

Etzold, T. and Argos, P. (1993) SRS--an indexing and retrieval tool forflat file data libraries. CABIOS. 9, 49-57

Fickett, J.W. and Hatzigeorgiou, A_G. (1997) Eukaryotic promoterrecognition. Genome Res. 7, 861-878.

Fishburn, P.C. (1970) Utility Theory for Decision Making, New York:John Wiley & Sons.

Ischenko, I.V., Kochetov, A.V., Kel, ALE., Kisselev, L.L., and KolchanovN.A. (1996) Comparative analysis of the local secondary structure mRNAs encoded by high- and low-expression eukaryotic genes In:"Proceedings of the German Conference on Bioinformatics (GCB’96).Leipzig Univ.", 124-129.

Jonson, J., Norberg, T., Carlsson, L., Gustafsson, C., and Wold, S.(1993) Quantitative sequence-activity models (QSAM) - tools sequence design. Nucl. Acids Res., 21,733-739.

Kel, A.E. Kondrakhin, Y.V., Kolpakov, Ph.A., Kel, O.V., Romachenko,A,G., Wingender, E., Milanesi, L., and Kolchanov, N.A. (1995)Computer tool FUNSITE for analysis of eukaryotic regulatorygenomic sequences. Proceedings of the third international conferenceon intelligent systems for molecular biology. AA,M Press. California.197- 205.

Kel, A.E., Ponomarenko, M.P., Likhachev, E.A_, Orlov, Y.L., Ischenko,[.V., Milanesi, L., and Kolchanov, N.A. (1993) SITEVIDEO: computer system for functional site analysis and recognition.Investigation of the human splice sites. CABIOS. 9, 617-627.

Kel, OV., Romashchenko, A.G., Kel, ALE., NaumochkJn, A_N., andKolchanov, N.A. (1995a) Proceedings of the th Annual HawaiiInternational Conference on System Sciences [HICSS]. 5.Biotechnology Computing, IEE Computer Society Press: Los Alamos,Calitbmia. 42 -51

Kel, O.V., Romashchenko, A_G., Kel, A.E., Wingender, and E.,Kolchanov, N.A., (1995b) A compilation of composite regulatoryelements affecting gene transcription in vertebrates. Nucl. Acids Res.23, 4097-4103.

Kel’, ALE., Kolchanov, N.A,, Kel’, O.V., Romashchenko, A.G.,Anan’ko, E.A_, Ignati’eva, E.V., Merkulova, T.I., Podkolodnaya,O.A., Stepanenko, I.L., Kochetov, A.V, Kolpakov, F.A., Podkolodny,N.L., and Naumochkin A.N. (1997) TRRD: database on transcriptionregulatory regions of eukaryotic genes. Mol. Biol. (Msk)31, 521-530.

Kolchanov, N.A. (1997) Regulation of eukaryotic gene transcription:databases and computer analysis. Mol. Biol. (Msk), 31, 481-482.

Kolchanov, N.A., Ponomarenko, M.P., Ponomarenko, Yu.V.,Podkolodny, N.L., and Frolov, A.S. (1998). Functional sites in theprokaryotic and eukaryotic genomes: Computer models and activitypredictions. Mol. Biol. (Msk) (in press).

Kolpakov, F.A., Ananko, E.A_, Kolesov, G.B., and Kolchanov N.A_1998. GeneNet: a database for gene networks and its automatedvisualization through the Imemet. Bioinformatics. (in press.)

Kondrakhirg Yu.V., Babenko, V.N., Milanesi, L., Lavryushev, C.V., andKolchanov, N.A. (1998) Recognition groups: a new method fordescription and prediction and of transcription factor binding sites.CABIOS (in press).

Lawrence, C., Altschul, S., Boguski, M., Liu, I., Neuwald, A,, andWootton, J. (1993) Detecting subtle sequence signals: a Gibs samplingstrategy for multiple alignment. Science. 262, 208-214.

Peter, R.C., Juner, T., and Bucher, P. (1998) The eukaryotic promoterdatabase EPD. Nucl. Acids Res. 26, 353-357.

Podkolodnaya, O.A, and Stepanenko, I.L. (1997) Mechanisms transcription regulation of the erythroid-specific genes. Mol. Biol.(Msk), 31,540-547.

Ponomarenko, M.P., Kel, A.E., Orlov, Y.L., Benjuklx D.N., Ischenko,I.V., Bokhonov, V.B., Likhachev, E.A., and Kolchanov, N.A. (1994)Computer Analysis of Genetic Macromolecules: Structure, Functionand Evolution (Kolchanov, N. and Lim, H., Eds.) Singapore: WorldSci. 35-65

Ponomarenko, M.P., Kolchanova, A.N., and Kolchanov, N.A. (199713)Generating programs for predicting the activity of functional sites. J.Comput. Biol. 4, 83-90

Ponomarenko, M.P., Ponomarenko, Yu.V., Kel’, A-E., Kolchanov, N.A.,Karas, H., Wingender, E., and Sklenar, H. (1997c) Computer analysisof the DNA conformation features of the eukaryotic promoter TATAbox. Mol. Biol. (Msk), 31,733-740.

Ponomarenko, M.P., Savinkova, L.K., Ponomarenko, Yu.V., Kel’, A_E.,Titov, I.L, and Kolchanov, N.A. (1997a) Modeling TATA-boxsequences in eukaryotic genes. Mol. Biol. (Msk), 31,726-732.

Pedersen, A,, Baldi, P., Brunak, S., and Chauvin, Y. (1996)Characterization of eukaryotic and prokaryotic promoters using HidenMarkov models. Intel. Sys. Mol. Biol., 4, 182-191.

Ptitsyrg A.A., Rogozin, I.B., Grigorovich, D.A, Strelets, V.B., Kel’,A.E., Milanezi, L., and Kolchanov, N.A. (1996) Computer system"AUTOGENE" for automatic analysis of nucleotide sequences. MolBiol. (Msk), 30, 432-441.

Quandt, K., Frech, K., Karas, H., Wingender, E., and Wemer, T. (1995)Matlnd and MatInspector: new fast and versatile tools for detection ofconsensus matches in nucleotide sequence data. Nucl. Acids. Res. 23,4878-4884.

Salas, F., Haas, J., Brunk, B., Stoeckert Jr, C.J., and Overton, G.C.(1998) EpoDB: a database of genes expressed during vertebrateerythropoiesis. Nucleic Acids Res. 26, 290-292

Salamov, A_A. and Solovyev, V.V. (1997) Recognition of 3’-endcleavage and polyadenylation region of human rnRNA precursors.CABIOS 13, 23-28.

Solovyev, V.V. and Salamov, A.A. (1997) The Gene-Finder computertools for analysis of human and model organisms genome sequences.In Proceedings of the Fifth International Conference on IntelligentSystems for Molecular Biology (eds. Rawling C., Clark D., AirmanR., Hunter L., Lengauer T., Wodak S.), Halkidiki, Greece, AAAIPress, 294-302

Stormo, G.D., Schneider, T.D. and Gold L. (1986) Quantitative analysisof the relationship between nucleotide sequence and functionalactivity. Nucl. Acids Res., 14, 6661-6679.

Ulyanov, A. and Stormo, G. (1995) Multi-alphabet consensus algorithmfor identification of low specificity protein-DNA interactions. Nucl.Acids Res. 23, 1434-1440.

Waterman, M., Arriata, R., and Galas, D. (1984) Pattern recognition several sequences. Bull Math Biol., 46, 515-527.

Wingender, E., Dietze, P., Karas, H., and Kneuppel, R. (1996)TRANSFAC: a database on transcription factors and their DNAbinding sites. Nucl. Acids Res. 24. P. 238-241.

Wingender, E., Kel, A.E., Kel, O.V., Karas, H., Heinemeyer, T., Dietze,P., Knuppel, R., Romaschenko, A.G., and Kolchanov, N.A_ (1997)TRANSFAC, TRRD and COMPEL: towards a federated databasesystem on transcriptional regulation. Nucleic Acids Res. 25, 265-268

104 ISMB-98

genexpress: a computer system for description, analysis and recognition of regulatory ... · 2006....

Documents