¡¡

Dear Cynthia:  

Thanks for your email sent on Monday.  

Using F-statistic to compare means is often classified under ANOVA (Analysis of Variances). Excel has ANOVA functions for both single factor and two-factor analysis, with or without replications. Its Ftest function is used to test the difference of variances between two samples. As you know, Excel only works on datasets with equal sample sizes, that is, equal number of data points. But, S-plus or R and a few other software packages can help us to implement the ANOVA work when sample sizes are unequal.  

In the following, I will explain it in more details. First, I will describe how F stats are calculated. Then, I will describe the computational procedures in S-plus and in R. I think you will prefer S-plus or R to other software, as S-plus and R are very powerful to handle this type of work. If you do not have access to S-plus, you may get the R from www.r-project.org that is free. If you want a quick solution, you may want to skip the following formulas and just read about the implementations part.

----- formulas and math part (you may skip) ---  

The F-statistic is often calculated by  SSB/d.f B (or SS treat / d.f. treat) divided by SSW/d.f. W. Here, B means ¡°between groups¡±, W means ¡°within groups¡±and treat refers to treatments in an experiment. So, SSB is the sum of squares between groups (or SS treat is the sum of squares between treatments) and SSW is the sum of squares within groups. As you may know, d.f. is the degrees of freedom. Sum of squares divided by degrees of freedom is often referred as MS (mean square). MS between groups (or between treatments) divided by MS within groups becomes the F-statistic. When sample sizes are equal, the calculation is very simple.  

When sample sizes are not equal, if only one factor, the calculation is not that simple but still very straightforward as the followings:  

Assume the number of groups is a and let ni be the number of subjects in group i  

\begin{displaymath}N=\sum_{i=1}^a n_i\end{displaymath}

\begin{displaymath}\bar X_i = \sum_{j=1}^{n_i} X_{ij} / n_i,~~i=1,\ldots,a\end{displaymath}

\begin{displaymath}\bar X_{..} = \sum_{i=1}^a n_i \bar X_i / N\end{displaymath}

\begin{displaymath}SS_{treat} = \sum_{i=1}^a n_i (\bar X_i - \bar X_{..})^2\end{displaymath}

DFtreat = a-1

\begin{displaymath}SS_{within} = \sum_{i=1}^a \sum_{j=1}^{n_i} (X_{ij} - \bar X_i)^2\end{displaymath}

\begin{displaymath}DF_{within} = \sum_i n_i -a = N-a\end{displaymath}

MS treat = SS treat / DF treat

MS within = SS within / DF within

\begin{displaymath}F = {{MS_{treat}}\over{MS_{within}}}\end{displaymath}  

Then, Compare F to the Fa-1,N-a distribution for your decision-making.

 

When there are two factors, the calculation is a little bit more complicated. But, the principle is still the same as above. As we always use software to complete the calculation, the above formula is mainly for us to see the principle behind.  

--------- implementation starts here -----

 

As to compute the F statistics, I suggest using S-plus or R. Using R is the same as using S-plus. In R, after you invoke R and import your data, please type aov(Dependent ~ Factor1 + Factor2, dataset). Or you may type aov(Dependent ~ Factor1 + Factor2 + Factor1*Factor2, dataset)  if you want to test the interaction between the two factors. If only one factor, just type aov(Dependent ~ Factor, dataset). The ANOVA tables returned will include F-statistics and p-values for your decision-making.  

For your convenience, this linked web page contains one example of using R to calculate F statistics created by me.

The implementation is rather simple. However, if you are working on a two factor ANOVA and the sample sizes are unequal, the order in which factors are analyzed becomes very important. If you have factors A and B, and if factor A is analyzed first (or entered first for the work in a statistical software), the ANOVA table gives the sum of squares explained by factor A, then the sum of squares explained by B after removing effects of A. Similarly if B is entered first, the sum of squares for B, then for A after removing B¡¯s effects. In other words, the ANOVA table produced needs to be interpreted sequentially. The contribution of each row must be interpreted as adding that term to the design containing the previous terms.  

Therefore, when the sample sizes are unequal, the results will depend on which factor is entered first. In general, you want to enter the less important factor first. If all the factors are equally important to you, you may use different orders to produce all the ANOVA tables and use a combination of them (use all the ones after removing other factors). Also, function drop1 in S-plus or R gives results showing the effects of dropping each item in your design that can be interpreted without the sequential consideration. I guess you may not want to deal with this issue now. If you need any more assistance on this factor order matter later, please contact me anytime.  

By the way, for experiments, ¡°unequal sample sizes¡± is referred as ¡°unbalanced experiment¡±. In S-plus or R, you can use functions replication or alias to investigate the pattern of unbalance if needed.  

I hope the above answers most of your questions for this research. If you need any more assistance or need me to take a look of your dataset, please do not hesitate to contact me.  

Sincerely,

 Alex Liu

¡¡

RM Publications

RM Programs

RM Platforms

¡¡