草人 최광민: limma: design matrix w/ or w/o intercept

Scientist. Husband. Daddy. --- TOLLE. LEGE

외부자료의 인용에 있어 대한민국 저작권법(28조)과 U.S. Copyright Act (17 USC. §107)에 정의된 "저작권물의 공정한 이용원칙 | the U.S. fair use doctrine" 을 따릅니다. 저작권(© 최광민)이 명시된 모든 글과 번역문들에 대해 (1) 복제-배포, (2) 임의수정 및 자의적 본문 발췌, (3) 무단배포를 위한 화면캡처를 금하며, (4) 인용 시 URL 주소 만을 사용할 수 있습니다. [후원 | 운영] [대문으로] [방명록] [티스토리 (백업)] [신시내티]

limma: design matrix w/ or w/o intercept

Labels: Informatics
Email This BlogThis!Share to X Share to Facebook

https://stat.ethz.ch/pipermail/bioconductor/2006-April/012825.html

https://stat.ethz.ch/pipermail/bioconductor/2011-June/039777.html

The tilde has a different meaning within R, specifying the right
hand side of a model equation. The default in R is to fit an intercept
in all linear models (which in the context of ANOVA is better thought of
as a 'baseline' sample, to which all other samples are compared).

So when you do something like

f = factor(rep(c("A","B"), each = 3))
design = model.matrix(~f)

you are by default setting the 'A' samples as the baseline sample, and
the second coefficient in the model is the B - A comparison.

To eliminate the intercept, you add either a 0 or a -1 to the right hand
side of the equation:

design = model.matrix(~0+f)

which will then compute the average expression of the A and B samples
separately, so you have to explicitly create a contrasts matrix in order
to compute the B - A contrast.

Without an intercept you are fitting a cell means model in which you are
estimating the mean expression for each factor level (e.g., the model is
y_ij = u_i + e_ij). In this case, doing the contrasts is quite
straightforward.

With an intercept you are fitting a factor effects model in
which all of the other factors are specified in relation to some mean
value. In this case, all the other factors are specified in relation to
the mean of the BASE (e.g., the model is y_ij = u. + t_i + e_ij).
Here u. is the mean of the BASE samples, and the t_i are the amounts
that each of the other group means differ from the BASE mean.
Therefore, the contrasts are specified by the t_i values themselves if
you are comparing to BASE, and are specified by e.g., groupPE -
BASE for the other contrasts.

See the limmaUsersGuide, and ?formula for more information.

2. design matrix w/ or w/o intercept

As for '~ 0 + Group' versus '~ Group', the first instance means that you
don't want an intercept term, whereas the second means you do (as that
is the default).

design matrix w/o intercept term

model.matrix( ~0 + factor)
I almost always use a cell means model (design matrix without an intercept term).
Cons

you cannot make any comparisons without specifying contrasts (which you might be able to do with a factor effects model, where there is an intercept).

Pros

I don't have to figure out each time which level is being used as the baseline.

design matrix w/ intercept term

model.matrix( factor)

As an example, using the two design matrices below, the first model is a
factor effects model where WT is used as the baseline, so the second
coefficient gives the difference between MU and WT. For this you don't
need a contrast, and for this simple comparison it is probably easier.
If you had two factors and were interested in the interaction, then you
would have to do the algebra to figure out the contrasts.

> > Group-> factor(c("WT","WT","MU","MU","MU"),levels=c("WT","MU"))
> > Group
> [1] WT WT MU MU MU
> Levels: WT MU
> > design-> model.matrix(~Group)
> > design
> (Intercept) GroupMU
> 1 1 0
> 2 1 0
> 3 1 1
> 4 1 1
> 5 1 1
> attr(,"assign")
> [1] 0 1
> attr(,"contrasts")
> attr(,"contrasts")$Group
> [1] "contr.treatment"

The second model simply computes the mean for each factor level, (hence,
cell means model) so you have to explicitly compute the contrast of
interest. However, in this case it would be easier to figure out
an interaction if you have two factors.

>
> > design2-> model.matrix(~0+Group)
> > design2
> GroupWT GroupMU
> 1 1 0
> 2 1 0
> 3 0 1
> 4 0 1
> 5 0 1
> attr(,"assign")
> [1] 1 1
> attr(,"contrasts")
> attr(,"contrasts")$Group
> [1] "contr.treatment"
>

The tilde is used to specify a model, separating the right hand side
(explanatory variables) from the left hand side (dependent variable). So
if you were fitting a model as above, but for just one gene, you would
do something like

lm(gene_expression_values ~ Group)

However, when you are using model.matrix, you are only specifying the
right hand side of that equation (e.g., the design matrix), so you just
use the tilde followed by your explanatory variables.

For a more complete explanation, see ?formula.

Labels: Informatics

Scientist. Husband. Daddy. --- TOLLE. LEGE

GoogleSearch

블로그 내부검색

2. design matrix w/ or w/o intercept