ANOVAを題材にＲの練習をする - ryamadaのコンピュータ・数学メモ

分散、平方和、総当たり variance, total sum of squares, pair-wise
- 分散は全要素のペアワイズな違いの足し合わせの指標 variance is an index of sum of pairwise difference of all elements

# Make a numeric vector with a given length 指定の長さの数値ベクトルを作る
n <- 100
x <- rnorm(n)
# Make a directional distance matrix by looping ループを使って向きあり(正負あり)の距離行列を作る
m1 <- matrix(0,n,n)
for(i in 1:n){
 for(j in 1:n){
  m1[i,j] <- x[i]-x[j]
 }
}
# Do same with outer() 同じことをouter()関数を使って行う
m2 <- outer(x,x,"-")
# Check m1 == m2 二つの行列m1,m2が等しいことを確かめるいくつかの方法
m1 == m2 # Many TRUEs たくさんのTRUE
prod(m1==m2) # ALL TRUEs is 1 全部がTRUEなら1
range(m1-m2) # Difference of all elements range from 0 to 0 全要素の範囲は0から0
sum(abs(m1-m2)) # Sum of absolute difference is 0 差の絶対値の和は0
sum((m1-m2)^2) # Sum of square of difference is 0 差の二乗の和は0
# Distance can be calculated with dist() 距離を返す関数dist()を使ってもよい
m3 <- as.matrix(dist(x)) # "as.matrix()" makes the return as a matrix 返り値を行列形式にするために"as.matrix()"を使う
range(m1) # positives and negative and zeros 値は正負に及ぶ
range(m3) # all >= 0 値は非負
range(abs(m1)-m3)
# distance matrix is symmetric 距離行列は対称行列
image(m3) # Symmetricity by imaging 画像で確認、対称性
# symmetric matrix 対称行列
M <- matrix(1:9,3,3) # make a square matrix 正方行列を作る
t(M) # transposed matrix 転置行列
M + t(M) # Self and its transposed 行列とその転置行列の和
M2 <- M + t(M)
M2 - t(M2) # means symmetric 対称行列とわかる
# m3 is symmetric
range(m3-t(m3))
# What is sum of distance^2 距離の2乗の和とは？
sum(m3^2)
# Because distance matrix is symmetric, divide by 2 対称行列なので２で割る
sum(m3^2)/2
# Standardize with length of x 値ベクトルの長さで標準化する
sum(m3^2)/2/n
# Calculate TSS (total sum of squares) 全平方和は平均との差の二乗の和
sum((x-mean(x))^2)
# Sample variance 標本分散
sum((x-mean(x))^2)/n
# Unbiased variance 不偏分散
var(x)
# Sample variance and Unbiased variance 不偏分散から標本分散に直す
var(x)*(n-1)/n
# Sample variance 標本分散の値と並べてみておく
sum((x-mean(x))^2)/n

全体・群内・群間 all/intra-group/inter-group

# Two groups ２群
d <- 2
# Make d numeric vectors and the tandem of them 群の数の数値ベクトルを作ってその連結ベクトルを作る
n1 <- 30
x1 <- rnorm(n1,1) # mean=1 平均1
n2 <- 20
x2 <- rnorm(n2,5) # mean=2 平均2
X <- c(x1,x2) # tandem 連結
# Pairwise distance matrix with outer() 総当たり距離行列をouter()方式で
m1 <- outer(x1,x1,"-") # intra-group of g1 群１の群内
m2 <- outer(x2,x2,"-") # intra-group of g2 群２の群内
M <- outer(X,X,"-") # intra-group(?) of total 全体(を群とした、その)群内
# How to calculate inter-group of x1 and x2 群間はどうやる？
m12 <- outer(x1,x2,"-") # inter-group 群間
m21 <- outer(x2,x1,"-") # inter-group of different order of groups 群の順序を変えてもう一つの群間
# Dimension and length of m* それぞれの行列のdimension と要素数
dim(m1)
dim(m2)
dim(M)
dim(m12)
dim(m21)
length(m1)
length(m2)
length(M)
length(m12)
length(m21)
length(m1)+length(m2)+length(m12)+length(m21)
length(M) # Same 同じになる
# sum of squared elements
sum(m1^2)+sum(m2^2)+sum(m12^2)+sum(m21^2)
sum(M^2) # 同じになる
# intra and inter 群内と群間
sum(m1^2)+sum(m2^2)
sum(m12^2)+sum(m21^2)
sum(M^2)-(sum(m1^2)+sum(m2^2)) # Another way 別の計算法

群内・群間のばらつきの指標と群の違い indices of inter/intra variations and "difference of groups"

# When two groups are different, inter-group distances tends to be longer than intra-group distances ２群に違いがあると群間距離は群内距離より長めになる
par(mfcol=c(1,2))
plot(c(m1,m2,m12),col=c(rep(1,length(m1)),rep(2,length(m2)),rep(3,length(m12))))
abline(h=0)
plot(c(m1^2,m2^2,m12^2),col=c(rep(1,length(m1)),rep(2,length(m2)),rep(3,length(m12))))
abline(h=0)
par(mfcol=c(1,1))
# No difference between two groups... ２群の差をなくすと
# Make d numeric vectors and the tandem of them 群の数の数値ベクトルを作ってその連結ベクトルを作る
n1 <- 30
x1 <- rnorm(n1,1) # mean=1 平均1
n2 <- 20
x2 <- rnorm(n2,1) # mean=2 平均2
# Pairwise distance matrix with outer() 総当たり距離行列をouter()方式で
m1 <- outer(x1,x1,"-") # intra-group of g1 群１の群内
m2 <- outer(x2,x2,"-") # intra-group of g2 群２の群内
# How to calculate inter-group of x1 and x2 群間はどうやる？
m12 <- outer(x1,x2,"-") # inter-group 群間
# When two groups are different, inter-group distances tends to be longer than intra-group distances ２群に違いがあると群間距離は群内距離より長めになる
par(mfcol=c(1,2))
plot(c(m1,m2,m12),col=c(rep(1,length(m1)),rep(2,length(m2)),rep(3,length(m12))))
abline(h=0)
plot(c(m1^2,m2^2,m12^2),col=c(rep(1,length(m1)),rep(2,length(m2)),rep(3,length(m12))))
abline(h=0)
par(mfcol=c(1,1))

群内のサンプルの値のばらつきと群間のサンプルの値のばらつきとに分けて考えると群の違いが見えてくるらしい separation of variations into intra-components and inter-components seems to tell something on the difference between (among) groups.
検定としてうまくいくためには、帰無仮説を仮定したときの珍しさをp値で表せるように、統計量とそれのp値化するための分布とが必要 When we use this idea as a statistical test, we need a statistics measuring rarity under the null hypothesis and a distribution which converts the statistics into p-value.
それがF統計量とF分布(Wiki)

n1 <- 30
x1 <- rnorm(n1,1) # mean=1 平均1
n2 <- 20
x2 <- rnorm(n2,2) # mean=2 平均2
X <- c(x1,x2) # tandem 連結
# Pairwise distance matrix with outer() 総当たり距離行列をouter()方式で
m1 <- outer(x1,x1,"-") # intra-group of g1 群１の群内
m2 <- outer(x2,x2,"-") # intra-group of g2 群２の群内
M <- outer(X,X,"-") # intra-group(?) of total 全体(を群とした、その)群内
# How to calculate inter-group of x1 and x2 群間はどうやる？
m12 <- outer(x1,x2,"-") # inter-group 群間
m21 <- outer(x2,x1,"-") # inter-group of different order of groups 群の順序を変えてもう一つの群間

# Make a tandem vector and give factor labels to them
X <- c(x1,x2)
Z <- factor(c(rep("A",n1),rep("B",n2)))
# Linear regression with lm() and its output object should be processed with anova() to obtain result of anova for means between two groups 線形回帰をlm()で行い、その出力をanova()関数に渡すことで2群の平均値の差に関するANOVA検定が実施できる
anova(lm(X~Z))

> anova(lm(X~Z))
Analysis of Variance Table

Response: X
          Df Sum Sq Mean Sq F value    Pr(>F)    
Z          1 11.800 11.7999  13.654 0.0005631 ***
Residuals 48 41.482  0.8642                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’

F value and Pr(>F)が「何か」だろう You can see "F value" and "Pr(>F)" and they must be something

help(df) # d*,p*,q* and r* are functions for distribution "*" in R 確率分布に関するRの関数作成ルールを思い出して検索する
# F value is "statistical value" and Pr(>F) is p.value 統計量とp値
# pf() should be the function to return p value of a F-statistic value F統計量を与えるとp値を返すのはpf()関数
# pf(q, df1, df2, ncp, lower.tail = TRUE, log.p = FALSE)
pf(13.654,1,48)
1-pf(13.654,1,48)
pf(13.654,1,48,lower.tail=FALSE)

> pf(13.654,1,48)
[1] 0.9994369
> 1-pf(13.654,1,48)
[1] 0.0005630661
> pf(13.654,1,48,lower.tail=FALSE)
[1] 0.0005630661

ここまででANOVAの出力のDf,F value Pr(>F)の関係は分かった We understood relation among Df, F value, and Pr(>F) already.
ANOVAのp値はF統計量と二つの自由度で決まる p-value of ANOVA is determined by F statistics and two DFs.

sum(m1^2)/2/n1 + sum(m2^2)/2/n2
sum(M^2)/2/(n1+n2) - (sum(m1^2)/2/n1 + sum(m2^2)/2/n2)

> sum(m1^2)/2/n1 + sum(m2^2)/2/n2
[1] 41.482
> sum(M^2)/2/(n1+n2) - (sum(m1^2)/2/n1 + sum(m2^2)/2/n2)
[1] 11.800

出力のSum Sqの計算法もわかった We now know Sum Sq in the output

S.intra <- sum(m1^2)/2/n1 + sum(m2^2)/2/n2
S.inter <- sum(M^2)/2/(n1+n2) - (sum(m1^2)/2/n1 + sum(m2^2)/2/n2)
S.intra/48
S.inter/1

出力のMean Sqもわかった We know Mean Sq in the output

M.inter <- S.inter/1
M.intra <- S.intra/48
M.inter/M.intra

出力のF valueがわかった。群間のばらつきと群内のばらつきの比。We know F value, which is a ratio of degree of variations between inter and intra.
２つのDfは、群の数-1、(総サンプル数-1)-(群の数-1) Two DFs are (number of groups -1) and (\(total number of samples -1\) - (number of groups-1))
全体の自由度はサンプル数-1、なぜなら、全サンプルの平均値は標本から求めたものを使っていて、そのように「前提としている値」が１つあるから DF in the whole is (number of samples -1), because we use the mean of the whole for the evaluation from sample data.
そのDFをを群間と群内にふりわけてある The DF should be divided into inter and intra.