corrélations à longue portée dans les séquences génomiques : relation avec la structure et la...
TRANSCRIPT
Corrélations à longue portée dans les séquences génomiques :
relation avec la structure et la dynamique des nucléosomes
Analyse multi-échelles des génomes
Etude des corrélations à longue portée dans les génomes
Etude des « propriétés globales » des séquences génomiques (Voss, 92 ; Peng et al., 92) :
• la nature d’un nucléotide dépend de celle des autres à grande distance (jusqu’à kb)
• corrélations à longue portée observées dans les introns (non codants) mais pas dans les
exons (codants)
• controverse méthodologique (hétérogénéité de composition des génomes)
Mécanismes biologiques proposés :
• dynamique des génomes :
• réplication- mutation (Li, 93) ;
• insertion-deletion (Buldyrev et al., 93) ;
• tandem repeats (Dokholian et al., 97 ; Li et al., 98)
Construction of a random sequence
the new nt
- without correlations : …ACAGTACT G does not depend on other nucleotides
- with short-range correlations : …CGATTAAC A depends on few neighbour nucleotides (Markov chains)
- with long-range correlations : …TCCGACGG A depends on all nucleotides over large distances with a power-law correlation function
(1/d )
In genomic sequences these correlation properties can extend over tens to thousands bp
Long-range correlations are scale-invariant
What are Long-Range Correlations in DNA sequences ?
Nucleotide compositions are correlated to each other in the same manner whatever the scale
Scale-invariant processes in genomic sequences
L
10 nt
DNA
10 L
100 nt
100 L
1000 nt
Scale invariant processes
- « zooming » on the sequence does not change the shape of the correlation function
- the correlation function is a power law:
C (L) ~ L- C (k L) ~ k- L-
Long-range correlations
Invariance by dilation - Fractal structure
these particular correlation properties
differ from repeated motifs or periodic patterns
Processes described by Markov Chains
- a nucleotide depends only on the adjacent other nucleotides
- the correlation function is :
C (L) ~ exp(-L/L0)
characteristic length L0 :
Short-range correlations
Invariance by translation
at distances larger than L0,,
the nucleotides have no influence
no characteristic scale
What biological mechanisms ?
Long-range correlations :
- observed in introns (non-coding regions)
- absent in exons (coding regions)
Long-range correlations can be generated by genome dynamics :
- expansion-modification systems (duplication-mutation systems, Li, 1991)
- oligonucleotide repeats (Li and Kaneko, 1992)
- insertion-deletion of pseudogenes (Buldyrev et al., 1993
- tandem repeats (Dokholian et al., 97 ; Li et al., 98)
duplications
exon intron
FIRST HYPOTHESIS: long-range correlations are a consequence of genome dynamics
Processes without a characteristic length scale
If the duplication rate is high (e.g. 0.9) and the mutation rate is small (0.1)
as the sequence becomes longer and longer, the sequence exhibits long-range correlations (1/f power spectra)
time
mutations
DNA fragment
Duplication-mutation
Quantification of LRCQuantification of LRC
Sequence length = 260 000 nt (50 % purines)Sequence length = 260 000 nt (50 % purines)
uncorrelated uncorrelated correlated sequencesequence sequence
w2 = 512 pbw2 = 512 pb
CPu
(32/512)(32/512)0.50.5-1-1 = 4 (32/512)= 4 (32/512)0.9-1 = 1.3= 1.3
w2 = 512 pbw2 = 512 pb
w1 = 32 pbw1 = 32 pb w1 = 32 pbw1 = 32 pb
CPu
Pu/Pyr codingPu/Pyr coding
w1 = 32 pbw1 = 32 pb
Pu/Pyr codingPu/Pyr coding
w2 = 512 pbw2 = 512 pb
11
22 w2w2
w1w1( )
H - 1
H = 0.5 No LRC
H > 0.5 LRC loglog = (H-1) logw = (H-1) logw + Cte+ Cte
roughness exponent H
Pu/Pyr codingPu/Pyr coding
w = 1 pbw = 1 pb
Pu/Pyr codingPu/Pyr coding
w = 32 pbw = 32 pb
Pu/Pyr codingPu/Pyr coding
w = 512 pbw = 512 pb
Properties of LRCProperties of LRC
Sequence length = 260 000 nt (50 % purines)Sequence length = 260 000 nt (50 % purines)
uncorrelated uncorrelated correlated sequencesequence sequence
H > 0.5 LRC
persistence(small “roughness”)
H = 0.5 No LRC
log
log 22
((w
tw
t ww))
loglog22ww
H=0.8H=0.8
H=0.5H=0.5
H = 0.5 NO LRCH = 0.5 NO LRC
H > 0.5 LRCH > 0.5 LRC
1 - Straight line scale invariance properties
2 - The slope gives the roughness exponent H
log
log 22
((w
tw
t ww)
- 0.
6 l
og)
- 0.
6 l
og22ww
loglog22ww
H=0.8H=0.8
H=0.5H=0.5
A unique way to display results
loglog = (H-1) logw = (H-1) logw + Cte+ Cte11
22 w2w2w1w1
( )
H - 1
T [f x0 , w f(x) ( ) x0 : spacespace parameterparameter
w : scalescale parameterparameter
Computation of the wavelets coefficients
Advantage : élimination of composition biaisesAdvantage : élimination of composition biaises
-∞
x - x0
+∞
w1w
A WAY TO MEASURE : THE WAVELET TRANSFORM
A. Grossmann & J. Morlet 1984
The wavelet transform eliminates the composition biaises
g(1)
g(2)
(w8)
DNADNA codingcoding SignalSignal
WTWT
SignalSignal
SignalSignal
88 pbpb
128128 pb pb
w = w = 88 pb pb
w = w = 128128 pb pb
HH
SignalSignal WaveletWavelet
128128 pb pb
wt largewt large
wt smallwt small
loglog22ww
log
log 22
(C
O
(CO
ww)-
0.6l
og)-
0.6l
og22ww
T [f x0 , w f(x) ( ) dx+ x - x0
w1w -
(w128)
0.80.8
0.50.5
Quantification of LRC
Presence of LRC in exonic sequences (human)
0
-0.05
-0.1
100
intron
all exons
exon (high GC)
Presence of LRC in exonic sequences
-1 0 1
IH = 0.6
IIH = 0.8
IH = 0.6
IIH = 0.8
A A
IH = 0.6
IIH = 0.8
S. cerevisiae
Two regimes of LRC
E. coli Human
I
H = 0.5
II
H = 0.8
I
H = 0.6
II
H = 0.8
Two regimes of LRC
nucleosomes ?
Two regimes of LRC
E. coli
I
H = 0.5
II
H = 0.8
STRUCTURAL HYPOTHESIS :
the LRC are assocated to the bending of DNA in nucleosomes
Long-range correlations between DNA bending sites ?
Presence of LRC in exonic sequences
necessity of a new hypothesis
Test
Existence of long-range correlations between di-, tri-nucleotides associated to DNA
bending in nucleosomes ?
- nucleosomal DNA bending table (Pnuc) -> LRC ?
(Andrew & Travers, 1986)
Control :
- DNase bending table (Dnase) -> no LRC ?
(Satchwell et al., 1995)
- eubacteria (no nucleosomes) -> no LRC ?
Nucleosome based bending table(Pnuc)
nucleasedigestion oflinker DNA
released nucleosomes
dissociation of histones
146 nucleotideDNA fragments
cloning and sequencing of nucleosomal DNA
chromatin fiber
cloning and sequencing of nucleosomal DNA
sequence analysis of aligned nucleosomal fragments
(Fourier transform)
Pnuc table
Dnase I bending table(DNase)
DNase I induces bending
Dnase activity is favoured by DNA flexibility
measurement of cutting efficiencyalong the DNA molecule
sequence analysis of the cutting profile
Dnase table
digestion of known DNA fragments
by Dnase I
(Luger et al., Nature, 1997)
A - tracts preferred here (minor groove inside)
position0
AAAfrequency
20 40 60
Analysis of nucleosomal DNA
Fourier analysis
DNA sequenceDNA sequence signalsignal
text profiletext profile
nucleosomal profilenucleosomal profile
flexibility profileflexibility profile
codingcoding treatmenttreatment HH
MononucleotideMononucleotide
A T G A T CA T G A T C+1 -1 -1 +1 -1 -1 +1 -1 -1 +1 -1 -1
PnucPnuc
A T G A T CA T G A T C
DnaseDnase
A T G A T CA T G A T C
6.7 5.4
8.7 10
Different ways of coding sequences
Pnuc
Dnase
I II H = 0.5 H = 0.8
H (Pnuc) > H (Dnase)
Dnase
I II H = 0.6 H = 0.8
Human
random table
Dnase
Pnuc
Human (chr 21)
EUKARYOTES
Human
C. elegans
D. melanogaster
A. thaliana
EUKARYOTES EUBACTERIA
Human
B. subtilisC. elegans
M. pneumoniaeD. melanogaster
H. influenzae
A. thaliana Synechocystis
DNA viruses
T4
Lambda
SPBc2
Bacteriophages
T4
Lambda
SPBc2
Bacteriophages
Adenovirus
Animal viruses
DNA viruses
T4
Lambda
SPBc2
Bacteriophages
Adenovirus
Herpesvirus
Animal viruses
DNA viruses
T4
Lambda
SPBc2
Bacteriophages
Adenovirus
Herpesvirus
M. Sanguinipes (Pox)
Animal viruses
DNA viruses
SS RNA (-)
SS RNA (+)
dS RNA
SS RNA (-)
SS RNA (+)
dS RNA
RNA viruses
SS RNA (-)
SS RNA (+)
dS RNA
SS RNA (-)
SS RNA (+)
dS RNA
Spumavirus
Retroviruses
RNA viruses
SS RNA (-)
SS RNA (+)
dS RNA
HIV (1,2)
SS RNA (-)
SS RNA (+)
dS RNA
Spumavirus
MMTV
Retroviruses
RNA viruses
SS RNA (-)
SS RNA (+)
dS RNA
HIV (1,2)
SS RNA (-)
SS RNA (+)
dS RNA
Spumavirus
MMTV
Retroviruses
Retroviruses
RNA viruses
new test of the structural hypothesis
- A’s present LRC - A tracts induce DNA curvature- are these LRC specific of A tracts ?
ALRC+
A tracts (curvature) A isolated
LRC ? LRC ?
Test Control
Human (chr 21)
LRC are associated to A tracts, not isolated A
A
Aiso
AAPnuc
Dnase
structural hypothesis : LRC are associated to DNA curvature
AA
Aiso
Pnuc
Dnase
A
Question
- to what extent the sequence of DNA contributes to its own packaging into nucleosomes ?
Contradictory answers
- Nucleosomal DNA is « periodic »
(Drew & Travers, 1985, JMB; Bina, 1994, JMB)
- Affinity of Eukaryotic DNA for histone octamer
(Lowary & Widom, 1997, JMB) :
5 % of genomic sequences strong affinity
95 % of bulk genomic DNA ~ random DNA
Two types of nucleosomes :
I - strongly binded : periodic repartition of bending sites
5 % genomic DNA
II - weakly binded : same bending sites « apparently random »
95 % genomic DNA
Model
For most nucleosomes (weakly binded) the bending sites are distributed with long-range correlations.
The persistent nature of the distribution of bending sites favours the dynamics of nucleosome formation
and diffusion : displacement requires less energy as in super-diffusive processes.
This organisation of genome sequences favors dynamical processes.
Periodic Long-range correlations : persistence
H > 0.5H not defined
DNA
weakly binded nucleosomes
Human globin locus (70 kb)
globin genes
bp
Presence of LRC in organelles
Few bacteria present LRC in the 0 - 200 nt range
Hypothesis : DNA pakaging in the 0 - 200 nt rangespecific of these bacteria ?
Archaeoglobus fulgidus
Presence of LRC (in the 0 - 200 nt range)in archaebacteria
G
Archaeoglobus fulgidus
The Pnuc coding does not best « extract » LRC in archaebacteria
G
Aeropyrum pernix (56.3% GC)
Sulfolobus solfataricus (35.8% GC)
Aeropyrum pernix (56.3% GC)
Conclusion
Long-range correlations between DNA bending sites, in the 10-200 nt range are a signature of
nucleosomes.
Model
The persistent nature of the distribution of bending sites favours the dynamics of chromatin
Perspectives
Find the DNA structural codings (related to DNA packaging?) that better “extract” the LRC in
genomic sequences
Samuel Nicolay
Cédric Vaillant
Alain Arnéodo
ENS-Lyon
Benjamin Audit
EMBL-EBI, Cambridge
Marie Touchon
Yves d'Aubenton-Carafa
C. Thermes
CGM, Gif sur Yvette