Grappling with statistics
This is an attempt to read the 4th chapter of the book "Stat Labs : Mathematical Statistics Through Applications" by Nolan and Speed.
The example uses palindrome patterns in DNA to introduce statistical concepts.
It is possible to download the file from here. If you prefer to work within R, then the data can be read directly using this code.
----
Location of palindromes
Reproduced the Figure 4.1 of the book. You could do it too!
----
Random Scatter
"A computer can simulate 296 palindrome sites chosen at random along a DNA sequence of 229, 354 bases using a pseudo-random number generator." pp. 81-82
for(i in 1:10) {
posvec_rand = sort(sample(1:dnalen, length(posvec), replace=F))
print(cor(posvec, posvec_rand))
}
[1] 0.9986788
[1] 0.997675
[1] 0.998048
[1] 0.9982345
[1] 0.9982026
[1] 0.9969843
[1] 0.9979355
[1] 0.9988115
[1] 0.9983922
[1] 0.9984165
The correlation is close to 1 for all the 10 cases. Does it mean that the palindrome locations on CMV dna are indeed due to a random scatter?
--
"... pursue the point of view that structure in the data is indicated by departures from a unifrom scatter of palindromes across the DNA." p. 81
It does seem to depart from a unifrom scatter.
----
Locations and Spacings
> space=posvec[2:length(posvec)]-posvec[1:(length(posvec)-1)]
> hist(space, main="Histogram of spacings: consecutive palindromes", xlab="")
Why does it have a long right tail? What distribution does it come from?
Next, summing up the consecutive pairs, triplets, etc.
Similar to the distribution of palindrom locations?
----
Counts
I draw histograms of counts of palindromes in segments of different sizes.

How do I choose the proper interval length? Why did the authors select the size 4000 (Table 4.2)?
----
The biggest cluster
Tried interval sizes from 500 to 12000 with increments of 500.
Looks like the biggest cluster is between 92501 and 93001.
----
Sliding window approach (Figure 4.5)
The window of 1000 bp interval slides with an overlap of 500 bp.
The biggest cluster seems to be just before 100,000.
----
I still feel that I have not mastered the chapter yet. I am actually going back to some earlier chapters and other sources. I would post those elsewhere.
I am always looking for someone who is willing to discuss statistical concepts in a biological context. If you are interested, please let me know here. Thanks for stopping by!
The example uses palindrome patterns in DNA to introduce statistical concepts.
It is possible to download the file from here. If you prefer to work within R, then the data can be read directly using this code.
----
Location of palindromes
Reproduced the Figure 4.1 of the book. You could do it too!----
Random Scatter
"A computer can simulate 296 palindrome sites chosen at random along a DNA sequence of 229, 354 bases using a pseudo-random number generator." pp. 81-82
for(i in 1:10) {
posvec_rand = sort(sample(1:dnalen, length(posvec), replace=F))
print(cor(posvec, posvec_rand))
}
[1] 0.9986788
[1] 0.997675
[1] 0.998048
[1] 0.9982345
[1] 0.9982026
[1] 0.9969843
[1] 0.9979355
[1] 0.9988115
[1] 0.9983922
[1] 0.9984165
The correlation is close to 1 for all the 10 cases. Does it mean that the palindrome locations on CMV dna are indeed due to a random scatter?
--
"... pursue the point of view that structure in the data is indicated by departures from a unifrom scatter of palindromes across the DNA." p. 81
It does seem to depart from a unifrom scatter.----
Locations and Spacings
> space=posvec[2:length(posvec)]-posvec[1:(length(posvec)-1)]
> hist(space, main="Histogram of spacings: consecutive palindromes", xlab="")
Why does it have a long right tail? What distribution does it come from?Next, summing up the consecutive pairs, triplets, etc.
Similar to the distribution of palindrom locations?----
Counts
I draw histograms of counts of palindromes in segments of different sizes.

How do I choose the proper interval length? Why did the authors select the size 4000 (Table 4.2)?
----
The biggest cluster
Tried interval sizes from 500 to 12000 with increments of 500.
Looks like the biggest cluster is between 92501 and 93001.----
Sliding window approach (Figure 4.5)
The window of 1000 bp interval slides with an overlap of 500 bp.
The biggest cluster seems to be just before 100,000.----
I still feel that I have not mastered the chapter yet. I am actually going back to some earlier chapters and other sources. I would post those elsewhere.
I am always looking for someone who is willing to discuss statistical concepts in a biological context. If you are interested, please let me know here. Thanks for stopping by!