3グループデータのモデル化
前回に続きましてRNA-seqの記事になります。まずは、3グループの場合のそれぞれのサンプル、遺伝子のカウントデータの平均パラメータを数式で簡単にモデル化していきたいと思います。こちらのページ(https://bi.biopapyrus.jp/rnaseq/analysis/de-analysis/3g-edger.html)にあるように、一般化線形モデルにおけるのリンク関数と線形予測子の考え方を用いて、ある遺伝子のサンプルごとの平均パラメータをベクトルで表すと、3つのグループの比較で各グループに3つの反復があるとき、以下のように考えることができます。
実際に計算して見るとについて、以下のようになります。
上から3つをGroup1、真ん中3つをGroup2、残り3つをGroup3のサンプルの平均とします。同じグループの反復は同じ平均を示していていい感じです。また、Group2はGroup1に対して倍だけ平均が異なり、Group3はGroup1に対して倍だけ異なる事がわかります。
2つのグループではGroup1とGroup2の発現の比(Fold change)を考えるだけでしたが、3つのグループを考えるときは2つのFold changeが考えられるということになります。
に関してはその遺伝子の全体的な発現レベルを決める役割を果たします。この値が小さければ全体的に発現が小さいということを表します。
では実際にこの式をもとに、前回の記事で考えた各発現パターンにおける平均パラメータを計算してみましょう。ここでは発現パターンについてのみ考えるので、で固定しています。
非発現変動遺伝子(nonDEG)
nonDEGの場合は簡単で、とすれば良いです。
alpha <- log(100) beta_1 <- log(1) beta_2 <- log(1) X <- model.matrix(~gl(3, 3)) lambda <- exp(X %*% c(alpha, beta_1, beta_2)) plot(lambda, type = "h", lwd = 30, lend = 1, col = "#696969", xlim = c(0, 10), ylim = c(-5, max(lambda)), axes = F, ann = F) text(c(2, 5, 8), c(-3, -3, -3), c("Group1", "Group2", "Group3"), adj = c(0.5, 1)) segments(c(0.6, 3.6, 6.6), c(-1, -1, -1), c(3.4, 6.4, 9.4), c(-1, -1, -1))
発現変動遺伝子(1対他)
続いてあるグループでその他2つのグループと比べて発現変動している場合について考えます。Fold changeについては2とします。
G1 > G2 = G3
Group1においてのみ高発現の場合。
beta_1 <- log(1/2) beta_2 <- log(1/2) X <- model.matrix(~gl(3, 3)) lambda <- exp(X %*% c(alpha, beta_1, beta_2)) plot(lambda, type = "h", lwd = 30, lend = 1, col = "#696969", xlim = c(0, 10), ylim = c(-5, max(lambda)), axes = F, ann = F) text(c(2, 5, 8), c(-3, -3, -3), c("Group1", "Group2", "Group3"), adj = c(0.5, 1)) segments(c(0.6, 3.6, 6.6), c(-1, -1, -1), c(3.4, 6.4, 9.4), c(-1, -1, -1))
G1 < G2 = G3
Group1においてのみ低発現の場合。
beta_1 <- log(2) beta_2 <- log(2) X <- model.matrix(~gl(3, 3)) lambda <- exp(X %*% c(alpha, beta_1, beta_2)) plot(lambda, type = "h", lwd = 30, lend = 1, col = "#696969", xlim = c(0, 10), ylim = c(-5, max(lambda)), axes = F, ann = F) text(c(2, 5, 8), c(-3, -3, -3), c("Group1", "Group2", "Group3"), adj = c(0.5, 1)) segments(c(0.6, 3.6, 6.6), c(-1, -1, -1), c(3.4, 6.4, 9.4), c(-1, -1, -1))
G2 > G1 = G3
Group2においてのみ高発現の場合。
beta_1 <- log(2) beta_2 <- log(1) X <- model.matrix(~gl(3, 3)) lambda <- exp(X %*% c(alpha, beta_1, beta_2)) plot(lambda, type = "h", lwd = 30, lend = 1, col = "#696969", xlim = c(0, 10), ylim = c(-5, max(lambda)), axes = F, ann = F) text(c(2, 5, 8), c(-3, -3, -3), c("Group1", "Group2", "Group3"), adj = c(0.5, 1)) segments(c(0.6, 3.6, 6.6), c(-1, -1, -1), c(3.4, 6.4, 9.4), c(-1, -1, -1))
G2 < G1 = G3
Group2においてのみ低発現の場合。
beta_1 <- log(1/2) beta_2 <- log(1) X <- model.matrix(~gl(3, 3)) lambda <- exp(X %*% c(alpha, beta_1, beta_2)) plot(lambda, type = "h", lwd = 30, lend = 1, col = "#696969", xlim = c(0, 10), ylim = c(-5, max(lambda)), axes = F, ann = F) text(c(2, 5, 8), c(-3, -3, -3), c("Group1", "Group2", "Group3"), adj = c(0.5, 1)) segments(c(0.6, 3.6, 6.6), c(-1, -1, -1), c(3.4, 6.4, 9.4), c(-1, -1, -1))
G3 > G1 = G2
Group3においてのみ高発現の場合。
beta_1 <- log(1) beta_2 <- log(2) X <- model.matrix(~gl(3, 3)) lambda <- exp(X %*% c(alpha, beta_1, beta_2)) plot(lambda, type = "h", lwd = 30, lend = 1, col = "#696969", xlim = c(0, 10), ylim = c(-5, max(lambda)), axes = F, ann = F) text(c(2, 5, 8), c(-3, -3, -3), c("Group1", "Group2", "Group3"), adj = c(0.5, 1)) segments(c(0.6, 3.6, 6.6), c(-1, -1, -1), c(3.4, 6.4, 9.4), c(-1, -1, -1))
G3 < G1 = G2
Group3においてのみ低発現の場合。
beta_1 <- log(1) beta_2 <- log(1/2) X <- model.matrix(~gl(3, 3)) lambda <- exp(X %*% c(alpha, beta_1, beta_2)) plot(lambda, type = "h", lwd = 30, lend = 1, col = "#696969", xlim = c(0, 10), ylim = c(-5, max(lambda)), axes = F, ann = F) text(c(2, 5, 8), c(-3, -3, -3), c("Group1", "Group2", "Group3"), adj = c(0.5, 1)) segments(c(0.6, 3.6, 6.6), c(-1, -1, -1), c(3.4, 6.4, 9.4), c(-1, -1, -1))
全てのグループ間で発現変動
さらに続けて、全てのグループの間で発現変動している遺伝子について考えます。この場合もFolc changeについては
2を考えますが、一番発現が小さいグループと一番発現が多いグループのFolc changeはとなります。
G1 < G2 < G3
beta_1 <- log(2) beta_2 <- log(4) X <- model.matrix(~gl(3, 3)) lambda <- exp(X %*% c(alpha, beta_1, beta_2)) plot(lambda, type = "h", lwd = 30, lend = 1, col = "#696969", xlim = c(0, 10), ylim = c(-5, max(lambda)), axes = F, ann = F) text(c(2, 5, 8), c(-3, -3, -3), c("Group1", "Group2", "Group3"), adj = c(0.5, 1)) segments(c(0.6, 3.6, 6.6), c(-1, -1, -1), c(3.4, 6.4, 9.4), c(-1, -1, -1))
G1 < G3 < G2
beta_1 <- log(4) beta_2 <- log(2) X <- model.matrix(~gl(3, 3)) lambda <- exp(X %*% c(alpha, beta_1, beta_2)) plot(lambda, type = "h", lwd = 30, lend = 1, col = "#696969", xlim = c(0, 10), ylim = c(-5, max(lambda)), axes = F, ann = F) text(c(2, 5, 8), c(-3, -3, -3), c("Group1", "Group2", "Group3"), adj = c(0.5, 1)) segments(c(0.6, 3.6, 6.6), c(-1, -1, -1), c(3.4, 6.4, 9.4), c(-1, -1, -1))
G2 < G1 < G3
beta_1 <- log(1/2) beta_2 <- log(2) X <- model.matrix(~gl(3, 3)) lambda <- exp(X %*% c(alpha, beta_1, beta_2)) plot(lambda, type = "h", lwd = 30, lend = 1, col = "#696969", xlim = c(0, 10), ylim = c(-5, max(lambda)), axes = F, ann = F) text(c(2, 5, 8), c(-3, -3, -3), c("Group1", "Group2", "Group3"), adj = c(0.5, 1)) segments(c(0.6, 3.6, 6.6), c(-1, -1, -1), c(3.4, 6.4, 9.4), c(-1, -1, -1))
G3 < G1 < G2
beta_1 <- log(2) beta_2 <- log(1/2) X <- model.matrix(~gl(3, 3)) lambda <- exp(X %*% c(alpha, beta_1, beta_2)) plot(lambda, type = "h", lwd = 30, lend = 1, col = "#696969", xlim = c(0, 10), ylim = c(-5, max(lambda)), axes = F, ann = F) text(c(2, 5, 8), c(-3, -3, -3), c("Group1", "Group2", "Group3"), adj = c(0.5, 1)) segments(c(0.6, 3.6, 6.6), c(-1, -1, -1), c(3.4, 6.4, 9.4), c(-1, -1, -1))
G2 < G3 < G1
beta_1 <- log(1/4) beta_2 <- log(1/2) X <- model.matrix(~gl(3, 3)) lambda <- exp(X %*% c(alpha, beta_1, beta_2)) plot(lambda, type = "h", lwd = 30, lend = 1, col = "#696969", xlim = c(0, 10), ylim = c(-5, max(lambda)), axes = F, ann = F) text(c(2, 5, 8), c(-3, -3, -3), c("Group1", "Group2", "Group3"), adj = c(0.5, 1)) segments(c(0.6, 3.6, 6.6), c(-1, -1, -1), c(3.4, 6.4, 9.4), c(-1, -1, -1))
G3 < G2 < G1
beta_1 <- log(1/2) beta_2 <- log(1/4) X <- model.matrix(~gl(3, 3)) lambda <- exp(X %*% c(alpha, beta_1, beta_2)) plot(lambda, type = "h", lwd = 30, lend = 1, col = "#696969", xlim = c(0, 10), ylim = c(-5, max(lambda)), axes = F, ann = F) text(c(2, 5, 8), c(-3, -3, -3), c("Group1", "Group2", "Group3"), adj = c(0.5, 1)) segments(c(0.6, 3.6, 6.6), c(-1, -1, -1), c(3.4, 6.4, 9.4), c(-1, -1, -1))
Fold changeをランダムに設定する
これまではFold change()を固定して考えてきましたが、現実のRNA-seqデータをシミュレーションすることが目標なので、このパラメータをランダムに設定していきたいと思います。気をつけるべきところは、DEGの場合にランダムに設定してたまたまnonDEGとほぼ同じような発現データの状況になったということを避けることでしょう。そのために、発現変動があるというグループ間には
最低でも1.5倍の差が生じるようにしました。
G1 > G2 = G3
alpha <- log(100) beta_1 <- beta_2 <- runif(1, min = -log(4), max = -log(1.5)) X <- model.matrix(~gl(3, 3)) lambda <- exp(X %*% c(alpha, beta_1, beta_2)) plot(lambda, type = "h", lwd = 30, lend = 1, col = "#696969", xlim = c(0, 10), ylim = c(-5, max(lambda)), axes = F, ann = F) text(c(2, 5, 8), c(-3, -3, -3), c("Group1", "Group2", "Group3"), adj = c(0.5, 1)) segments(c(0.6, 3.6, 6.6), c(-1, -1, -1), c(3.4, 6.4, 9.4), c(-1, -1, -1))
G1 < G2 = G3
alpha <- log(100) beta_1 <- beta_2 <- runif(1, min = log(1.5), max = log(4)) X <- model.matrix(~gl(3, 3)) lambda <- exp(X %*% c(alpha, beta_1, beta_2)) plot(lambda, type = "h", lwd = 30, lend = 1, col = "#696969", xlim = c(0, 10), ylim = c(-5, max(lambda)), axes = F, ann = F) text(c(2, 5, 8), c(-3, -3, -3), c("Group1", "Group2", "Group3"), adj = c(0.5, 1)) segments(c(0.6, 3.6, 6.6), c(-1, -1, -1), c(3.4, 6.4, 9.4), c(-1, -1, -1))
G2 > G1 = G3
alpha <- log(100) beta_1 <- runif(1, min = log(1.5), max = log(4)) beta_2 <- log(1) X <- model.matrix(~gl(3, 3)) lambda <- exp(X %*% c(alpha, beta_1, beta_2)) plot(lambda, type = "h", lwd = 30, lend = 1, col = "#696969", xlim = c(0, 10), ylim = c(-5, max(lambda)), axes = F, ann = F) text(c(2, 5, 8), c(-3, -3, -3), c("Group1", "Group2", "Group3"), adj = c(0.5, 1)) segments(c(0.6, 3.6, 6.6), c(-1, -1, -1), c(3.4, 6.4, 9.4), c(-1, -1, -1))
G2 < G1 = G3
alpha <- log(100) beta_1 <- runif(1, min = -log(4), max = -log(1.5)) beta_2 <- log(1) X <- model.matrix(~gl(3, 3)) lambda <- exp(X %*% c(alpha, beta_1, beta_2)) plot(lambda, type = "h", lwd = 30, lend = 1, col = "#696969", xlim = c(0, 10), ylim = c(-5, max(lambda)), axes = F, ann = F) text(c(2, 5, 8), c(-3, -3, -3), c("Group1", "Group2", "Group3"), adj = c(0.5, 1)) segments(c(0.6, 3.6, 6.6), c(-1, -1, -1), c(3.4, 6.4, 9.4), c(-1, -1, -1))
G3 > G1 = G2
alpha <- log(100) beta_1 <- log(1) beta_2 <- runif(1, min = log(1.5), max = log(4)) X <- model.matrix(~gl(3, 3)) lambda <- exp(X %*% c(alpha, beta_1, beta_2)) plot(lambda, type = "h", lwd = 30, lend = 1, col = "#696969", xlim = c(0, 10), ylim = c(-5, max(lambda)), axes = F, ann = F) text(c(2, 5, 8), c(-3, -3, -3), c("Group1", "Group2", "Group3"), adj = c(0.5, 1)) segments(c(0.6, 3.6, 6.6), c(-1, -1, -1), c(3.4, 6.4, 9.4), c(-1, -1, -1))
G3 < G1 = G2
alpha <- log(100) beta_1 <- log(1) beta_2 <- runif(1, min = -log(4), max = -log(1.5)) X <- model.matrix(~gl(3, 3)) lambda <- exp(X %*% c(alpha, beta_1, beta_2)) plot(lambda, type = "h", lwd = 30, lend = 1, col = "#696969", xlim = c(0, 10), ylim = c(-5, max(lambda)), axes = F, ann = F) text(c(2, 5, 8), c(-3, -3, -3), c("Group1", "Group2", "Group3"), adj = c(0.5, 1)) segments(c(0.6, 3.6, 6.6), c(-1, -1, -1), c(3.4, 6.4, 9.4), c(-1, -1, -1))
G1 < G2 < G3
beta_1 <- runif(1, min = log(1.5), max = log(4)) beta_2 <- runif(1, min = beta_1 + log(1.5), max = beta_1 + log(4)) X <- model.matrix(~gl(3, 3)) lambda <- exp(X %*% c(alpha, beta_1, beta_2)) plot(lambda, type = "h", lwd = 30, lend = 1, col = "#696969", xlim = c(0, 10), ylim = c(-5, max(lambda)), axes = F, ann = F) text(c(2, 5, 8), c(-3, -3, -3), c("Group1", "Group2", "Group3"), adj = c(0.5, 1)) segments(c(0.6, 3.6, 6.6), c(-1, -1, -1), c(3.4, 6.4, 9.4), c(-1, -1, -1))
G1 < G3 < G2
beta_2 <- runif(1, min = log(1.5), max = log(4)) beta_1 <- runif(1, min = beta_2 + log(1.5), max = beta_2 + log(4)) X <- model.matrix(~gl(3, 3)) lambda <- exp(X %*% c(alpha, beta_1, beta_2)) plot(lambda, type = "h", lwd = 30, lend = 1, col = "#696969", xlim = c(0, 10), ylim = c(-5, max(lambda)), axes = F, ann = F) text(c(2, 5, 8), c(-3, -3, -3), c("Group1", "Group2", "Group3"), adj = c(0.5, 1)) segments(c(0.6, 3.6, 6.6), c(-1, -1, -1), c(3.4, 6.4, 9.4), c(-1, -1, -1))
G2 < G1 < G3
beta_1 <- runif(1, min = -log(4), max = -log(1.5)) beta_2 <- runif(1, min = log(1.5), max = log(4)) X <- model.matrix(~gl(3, 3)) lambda <- exp(X %*% c(alpha, beta_1, beta_2)) plot(lambda, type = "h", lwd = 30, lend = 1, col = "#696969", xlim = c(0, 10), ylim = c(-5, max(lambda)), axes = F, ann = F) text(c(2, 5, 8), c(-3, -3, -3), c("Group1", "Group2", "Group3"), adj = c(0.5, 1)) segments(c(0.6, 3.6, 6.6), c(-1, -1, -1), c(3.4, 6.4, 9.4), c(-1, -1, -1))
G3 < G1 < G2
beta_1 <- runif(1, min = log(1.5), max = log(4)) beta_2 <- runif(1, min = -log(4), max = -log(1.5)) X <- model.matrix(~gl(3, 3)) lambda <- exp(X %*% c(alpha, beta_1, beta_2)) plot(lambda, type = "h", lwd = 30, lend = 1, col = "#696969", xlim = c(0, 10), ylim = c(-5, max(lambda)), axes = F, ann = F) text(c(2, 5, 8), c(-3, -3, -3), c("Group1", "Group2", "Group3"), adj = c(0.5, 1)) segments(c(0.6, 3.6, 6.6), c(-1, -1, -1), c(3.4, 6.4, 9.4), c(-1, -1, -1))
G2 < G3 < G1
beta_2 <- runif(1, min = -log(4), max = -log(1.5)) beta_1 <- beta_2 + runif(1, min = -log(4), max = -log(1.5)) X <- model.matrix(~gl(3, 3)) lambda <- exp(X %*% c(alpha, beta_1, beta_2)) plot(lambda, type = "h", lwd = 30, lend = 1, col = "#696969", xlim = c(0, 10), ylim = c(-5, max(lambda)), axes = F, ann = F) text(c(2, 5, 8), c(-3, -3, -3), c("Group1", "Group2", "Group3"), adj = c(0.5, 1)) segments(c(0.6, 3.6, 6.6), c(-1, -1, -1), c(3.4, 6.4, 9.4), c(-1, -1, -1))
G3 < G2 < G1
beta_1 <- runif(1, min = -log(4), max = -log(1.5)) beta_2 <- beta_1 + runif(1, min = -log(4), max = -log(1.5)) X <- model.matrix(~gl(3, 3)) lambda <- exp(X %*% c(alpha, beta_1, beta_2)) plot(lambda, type = "h", lwd = 30, lend = 1, col = "#696969", xlim = c(0, 10), ylim = c(-5, max(lambda)), axes = F, ann = F) text(c(2, 5, 8), c(-3, -3, -3), c("Group1", "Group2", "Group3"), adj = c(0.5, 1)) segments(c(0.6, 3.6, 6.6), c(-1, -1, -1), c(3.4, 6.4, 9.4), c(-1, -1, -1))
疑似RNA-seqカウントデータ(3グループ)の生成
最後に3グループRNA-seqカウントデータのシミュレーションデータを生成してみたいと思います。ここまで、各発現パターンについて細かく考えてきましたが、シミュレーションデータを生成するだけなら、適当にを定めてやって各グループのデータの平均パラメータを求めてからそのベクトルの要素を適切に並び替えれば良いと思います。
あとはその平均パラメータをもとに負の二項分布に基づくノイズを加えてあげれば擬似的カウントデータの完成です。
以下には汎用性にかけますが、RNA-seqカウントデータ(3グループ)の生成コードを示します。
generateCountMeanMatrix <- function(Ngene, z) { X <- model.matrix(~gl(3,1)) beta_1 <- runif(Ngene, min = log(1.5), max = log(4)) beta_2 <- beta_1 + runif(Ngene, min = log(1.5), log(4)) alpha <- rnorm(Ngene, 4, 1) tmp <- t(X %*% rbind(alpha, beta_1, beta_2)) logMeanMat <- matrix(NA, nrow = Ngene, ncol = 9) # nonDEG logMeanMat[z == 0, ] <- tmp[z == 0, 1] # DEG type 1 logMeanMat[z == 1, 1:3 ] <- tmp[z == 1, 1] logMeanMat[z == 1, -(1:3)] <- tmp[z == 1, 2] # DEG type 2 logMeanMat[z == 2, 1:3] <- tmp[z == 2, 2] logMeanMat[z == 2, -(1:3)] <- tmp[z == 2, 1] # DEG type 3 logMeanMat[z == 3, 4:6] <- tmp[z == 3, 1] logMeanMat[z == 3, -(4:6)] <- tmp[z == 3, 2] # DEG type 4 logMeanMat[z == 4, 4:6] <- tmp[z == 4, 2] logMeanMat[z == 4, -(4:6)] <- tmp[z == 4, 1] # DEG type 5 logMeanMat[z == 5, 7:9] <- tmp[z == 5, 1] logMeanMat[z == 5, -(7:9)] <- tmp[z == 5, 2] # DEG type 6 logMeanMat[z == 6, 7:9] <- tmp[z == 6, 2] logMeanMat[z == 6, -(7:9)] <- tmp[z == 6, 1] # DEG type 7 logMeanMat[z == 7, 1:3] <- tmp[z == 7, 1] logMeanMat[z == 7, 4:6] <- tmp[z == 7, 2] logMeanMat[z == 7, 7:9] <- tmp[z == 7, 3] # DEG type 8 logMeanMat[z == 8, 1:3] <- tmp[z == 8, 1] logMeanMat[z == 8, 4:6] <- tmp[z == 8, 3] logMeanMat[z == 8, 7:9] <- tmp[z == 8, 2] # DEG type 9 logMeanMat[z == 9, 1:3] <- tmp[z == 9, 2] logMeanMat[z == 9, 4:6] <- tmp[z == 9, 1] logMeanMat[z == 9, 7:9] <- tmp[z == 9, 3] # DEG type 10 logMeanMat[z == 10, 1:3] <- tmp[z == 10, 3] logMeanMat[z == 10, 4:6] <- tmp[z == 10, 1] logMeanMat[z == 10, 7:9] <- tmp[z == 10, 2] # DEG type 11 logMeanMat[z == 11, 1:3] <- tmp[z == 11, 2] logMeanMat[z == 11, 4:6] <- tmp[z == 11, 3] logMeanMat[z == 11, 7:9] <- tmp[z == 11, 1] # DEG type 12 logMeanMat[z == 12, 1:3] <- tmp[z == 12, 3] logMeanMat[z == 12, 4:6] <- tmp[z == 12, 2] logMeanMat[z == 12, 7:9] <- tmp[z == 12, 1] logFC <- logMeanMat[, c(1, 4, 7)] - logMeanMat[, 1] list(logFC = logFC, logMeanMat = logMeanMat) } x <- generateCountMeanMatrix(Ngene = 20000, z = rep(0:12, c(14000, rep(500, 12)))) counts <- matrix(rnbinom(20000 * 9, size = 10, mu = exp(x$logMeanMat)), 20000, 9) head(counts) ##> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] ##> [1,] 45 47 52 39 52 28 68 33 41 ##> [2,] 86 67 65 35 45 52 84 43 41 ##> [3,] 106 197 140 200 101 196 66 126 161 ##> [4,] 9 6 12 18 10 10 6 19 17 ##> [5,] 24 38 30 63 46 53 27 46 25 ##> [6,] 120 237 246 157 270 396 329 163 189 heatmap(counts, Rowv = NA, Colv = NA)
それっぽいデータが得られました。