cor
Correlation, Variance and Covariance (Matrices)
Description
var
, cov
and cor
compute the variance of x
and the covariance or correlation of x
and y
if these are vectors. If x
and y
are matrices then the covariances (or correlations) between the columns of x
and the columns of y
are computed.
cov2cor
scales a covariance matrix into the corresponding correlation matrix efficiently.
Usage
var(x, y = NULL, na.rm = FALSE, use) cov(x, y = NULL, use = "everything", method = c("pearson", "kendall", "spearman")) cor(x, y = NULL, use = "everything", method = c("pearson", "kendall", "spearman")) cov2cor(V)
Arguments
x | a numeric vector, matrix or data frame. |
y |
|
na.rm | logical. Should missing values be removed? |
use | an optional character string giving a method for computing covariances in the presence of missing values. This must be (an abbreviation of) one of the strings |
method | a character string indicating which correlation coefficient (or covariance) is to be computed. One of |
V | symmetric numeric matrix, usually positive definite such as a covariance matrix. |
Details
For cov
and cor
one must either give a matrix or data frame for x
or give both x
and y
.
The inputs must be numeric (as determined by is.numeric
: logical values are also allowed for historical compatibility): the "kendall"
and "spearman"
methods make sense for ordered inputs but xtfrm
can be used to find a suitable prior transformation to numbers.
var
is just another interface to cov
, where na.rm
is used to determine the default for use
when that is unspecified. If na.rm
is TRUE
then the complete observations (rows) are used (use = "na.or.complete"
) to compute the variance. Otherwise, by default use = "everything"
.
If use
is "everything"
, NA
s will propagate conceptually, i.e., a resulting value will be NA
whenever one of its contributing observations is NA
.
If use
is "all.obs"
, then the presence of missing observations will produce an error. If use
is "complete.obs"
then missing values are handled by casewise deletion (and if there are no complete cases, that gives an error).
"na.or.complete"
is the same unless there are no complete cases, that gives NA
. Finally, if use
has the value "pairwise.complete.obs"
then the correlation or covariance between each pair of variables is computed using all complete pairs of observations on those variables. This can result in covariance or correlation matrices which are not positive semi-definite, as well as NA
entries if there are no complete pairs for that pair of variables. For cov
and var
, "pairwise.complete.obs"
only works with the "pearson"
method. Note that (the equivalent of) var(double(0), use = *)
gives NA
for use = "everything"
and "na.or.complete"
, and gives an error in the other cases.
The denominator n - 1 is used which gives an unbiased estimator of the (co)variance for i.i.d. observations. These functions return NA
when there is only one observation (whereas S-PLUS has been returning NaN
).
For cor()
, if method
is "kendall"
or "spearman"
, Kendall's tau or Spearman's rho statistic is used to estimate a rank-based measure of association. These are more robust and have been recommended if the data do not necessarily come from a bivariate normal distribution.
For cov()
, a non-Pearson method is unusual but available for the sake of completeness. Note that "spearman"
basically computes cor(R(x), R(y))
(or cov(., .)
) where R(u)
:= rank(u, na.last = "keep")
. In the case of missing values, the ranks are calculated depending on the value of use
, either based on complete observations, or based on pairwise completeness with reranking for each pair.
When there are ties, Kendall's tau_b is computed, as proposed by Kendall (1945).
Scaling a covariance matrix into a correlation one can be achieved in many ways, mathematically most appealing by multiplication with a diagonal matrix from left and right, or more efficiently by using sweep(.., FUN = "/")
twice. The cov2cor
function is even a bit more efficient, and provided mostly for didactical reasons.
Value
For r <- cor(*, use = "all.obs")
, it is now guaranteed that all(abs(r) <= 1)
.
Note
Some people have noted that the code for Kendall's tau is slow for very large datasets (many more than 1000 cases). It rarely makes sense to do such a computation, but see function cor.fk
in package pcaPP.
References
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988). The New S Language. Wadsworth & Brooks/Cole.
Kendall, M. G. (1938). A new measure of rank correlation, Biometrika, 30, 81–93. doi: 10.1093/biomet/30.1-2.81.
Kendall, M. G. (1945). The treatment of ties in rank problems. Biometrika, 33 239–251. doi: 10.1093/biomet/33.3.239
See Also
cor.test
for confidence intervals (and tests).
cov.wt
for weighted covariance computation.
sd
for standard deviation (vectors).
Examples
var(1:10) # 9.166667 var(1:5, 1:5) # 2.5 ## Two simple vectors cor(1:10, 2:11) # == 1 ## Correlation Matrix of Multivariate sample: (Cl <- cor(longley)) ## Graphical Correlation Matrix: symnum(Cl) # highly correlated ## Spearman's rho and Kendall's tau symnum(clS <- cor(longley, method = "spearman")) symnum(clK <- cor(longley, method = "kendall")) ## How much do they differ? i <- lower.tri(Cl) cor(cbind(P = Cl[i], S = clS[i], K = clK[i])) ## cov2cor() scales a covariance matrix by its diagonal ## to become the correlation matrix. cov2cor # see the function definition {and learn ..} stopifnot(all.equal(Cl, cov2cor(cov(longley))), all.equal(cor(longley, method = "kendall"), cov2cor(cov(longley, method = "kendall")))) ##--- Missing value treatment: C1 <- cov(swiss) range(eigen(C1, only.values = TRUE)$values) # 6.19 1921 ## swM := "swiss" with 3 "missing"s : swM <- swiss colnames(swM) <- abbreviate(colnames(swiss), minlength=6) swM[1,2] <- swM[7,3] <- swM[25,5] <- NA # create 3 "missing" ## Consider all 5 "use" cases : (C. <- cov(swM)) # use="everything" quite a few NA's in cov.matrix try(cov(swM, use = "all")) # Error: missing obs... C2 <- cov(swM, use = "complete") stopifnot(identical(C2, cov(swM, use = "na.or.complete"))) range(eigen(C2, only.values = TRUE)$values) # 6.46 1930 C3 <- cov(swM, use = "pairwise") range(eigen(C3, only.values = TRUE)$values) # 6.19 1938 ## Kendall's tau doesn't change much: symnum(Rc <- cor(swM, method = "kendall", use = "complete")) symnum(Rp <- cor(swM, method = "kendall", use = "pairwise")) symnum(R. <- cor(swiss, method = "kendall")) ## "pairwise" is closer componentwise, summary(abs(c(1 - Rp/R.))) summary(abs(c(1 - Rc/R.))) ## but "complete" is closer in Eigen space: EV <- function(m) eigen(m, only.values=TRUE)$values summary(abs(1 - EV(Rp)/EV(R.)) / abs(1 - EV(Rc)/EV(R.)))
Copyright (©) 1999–2012 R Foundation for Statistical Computing.
Licensed under the GNU General Public License.