diff --git a/.gitignore b/.gitignore index 2529aa6..b3bceef 100644 --- a/.gitignore +++ b/.gitignore @@ -6,4 +6,5 @@ docs renv.lock .renvignore .Rprofile -renv \ No newline at end of file +renv +inst/doc diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md new file mode 100644 index 0000000..1138c0a --- /dev/null +++ b/CONTRIBUTING.md @@ -0,0 +1,88 @@ + +# Contributing to `corrp` + +Thank you for your interest in contributing to the `corrp` package! + +. We welcome contributions and improvements from the community. To help make the process smooth, please follow the guidelines outlined below. + +## How to Contribute + +### Reporting Bugs + +If you encounter a bug or issue, please follow these steps: +1. **Search for existing issues** to see if the bug has already been reported. +2. If not, **open a new issue** on the [GitHub Issues page](https://github.com/meantrix/corrp/issues). +3. Be sure to include the following details: + - A description of the issue. + - Steps to reproduce the issue. + - Any error messages or warnings. + - The version of the package you're using. + - The R version. + - Operating system if relevant. + + +### Submitting Code + +To submit code changes, please follow these steps: +1. **Fork the repository**. +2. **Create a new branch** for your changes. +3. **Implement your changes**. Be sure to: + - Write clear and concise commit messages. + - Document new code and functions using `roxygen2` comments. + - Ensure your code follows the existing code style (e.g., indentation, naming conventions). + +4. **Run tests** to ensure your changes work as expected. To do this, use the following command: + +```R +rcmdcheck::rcmdcheck() +``` + +The result should show **one note** only, like this: + +``` +── R CMD check results ────────────────────────────────────────────────────────────────────────────────────────────────────────────────── corrp 0.5.0 ──── +Duration: 57.5s + +❯ checking installed package size ... NOTE + installed size is 5.9Mb + sub-directories of 1Mb or more: + libs 5.6Mb + +0 errors ✔ | 0 warnings ✔ | 1 note ✖ +``` + +Make sure there are **no errors** or **warnings** in the output. If there are, please resolve them before submitting your changes. + +1. **Push your changes** to your fork and create a pull request to the main repository. + +### Code Style Guidelines + +- **Indentation**: Use 2 spaces for indentation. +- **Naming conventions**: Follow the existing naming conventions in the codebase. +- **Documentation**: Document all public functions with `roxygen2` comments. +- **Testing**: Ensure that your changes are covered by tests, especially if you're adding new functionality or fixing bugs. + + + +### Running Tests + +We use [testthat](https://cran.r-project.org/web/packages/testthat/index.html) for unit testing. To run the tests, use the following command: + +```R +devtools::test() +``` + +Make sure all tests pass before submitting your changes. + +### Documentation + +If your changes introduce new functionality, make sure to: +- Update the relevant documentation using `roxygen2` comments. +- Update the `README.md` and any relevant vignettes. +- Update the `NEWS.md` with your changes. +- Update the version of the package. + +### License + +By contributing, you agree that your contributions will be licensed according to [license](LICENSE.md). + diff --git a/DESCRIPTION b/DESCRIPTION index 449b872..e8c3a93 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -17,25 +17,29 @@ Description: Compute multiple types of correlation analyses, including the average correlation clustering algorithm and distance correlation t-test. Imports: - Rcpp (>= 1.0.4.6), - RcppArmadillo (>= 0.11.2.0.0), - lsr (>= 0.5), - parallel (>= 3.6.3), - stats (>= 3.6.3), - caret (>= 6.0-85), - minerva (>= 1.5.8), - checkmate (>= 2.0.0), + Rcpp (>= 1.0.13-1), + RcppArmadillo (>= 14.2.2-1), + corrplot (>= 0.95), + lsr (>= 0.5.2), + parallel (>= 4.4.1), + stats (>= 4.4.1), + caret (>= 7.0-1), + minerva (>= 1.5.10), + checkmate (>= 2.3.2), ppsr (>= 0.0.2), DescTools (>= 0.99.40) Suggests: - corrplot, energy, + knitr, + rmarkdown, testthat License: GPL (>= 3) Encoding: UTF-8 LazyData: true LinkingTo: Rcpp, RcppArmadillo -RoxygenNote: 7.2.3 +RoxygenNote: 7.3.2 +Roxygen: list(markdown = TRUE) URL: https://github.com/meantrix/corrp, https://meantrix.github.io/corrp/ BugReports: https://github.com/meantrix/corrp/issues +VignetteBuilder: knitr diff --git a/NAMESPACE b/NAMESPACE index 87fa870..2edd4ae 100644 --- a/NAMESPACE +++ b/NAMESPACE @@ -13,6 +13,7 @@ S3method(corr_rm,matrix) S3method(sil_acca,acca_list) S3method(sil_acca,list) export(acca) +export(assert_required_argument) export(best_acca) export(corr_fun) export(corr_matrix) @@ -20,7 +21,9 @@ export(corr_rm) export(corrp) export(dcorT_test) export(ptest) +export(set_arguments) export(sil_acca) importFrom(Rcpp,evalCpp) importFrom(RcppArmadillo,armadillo_version) +importFrom(corrplot,corrplot) useDynLib(corrp, .registration=TRUE) diff --git a/NEWS.md b/NEWS.md index a12aecd..209524f 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1,5 +1,31 @@ # CHANGELOG + +## 0.6.0 + +- Add `VignetteBuilder: knitr` to DESCRIPTION +- Add usefull error message for required parameters. +- Fix C++ `Astar` method. +- Run benchmarks, expand the paper to include statements on resource-intensive options, and incorporate an enhanced version of `energy::dcorT.test`. Also, change the data used in the paper. + + +### Methods Added + +- Added method `set_arguments`: Assigns provided arguments from the `args_list` to the parent environment. If an argument is inside the arguments of the methods that calculate statistics, it assigns it on the parent environment, and removes the argument from the list. +- Added method `assert_required_argument`: Ensures that a required argument is provided. If the argument is missing, it throws an error with a clear message. + +### Methods Altered + +- Altered messages and make *.args lists be able to alter arguments (`p.value`, `comp`, "alternative", "num.s", "rk") of methods: `.corlm`, `.cramersvp`, `.dcorp`, `.corperp`, `.micorp`, `.uncorp`, `.corpps`. +- Update the `.corpps` method to support p-value testing (`p-test`), which is disabled by default due to its slow performance. When `p-test` is not performed, the `isig` value is set to `NA`. `p-test` can be run assigning an element `ptest = TRUE` to `pps.args` argument. + +### Documentation + +- Enhanced the documentation for `corrp` by including examples, refining the pair type section with additional details and references, and providing a more comprehensive explanation of the output format and its interpretation. +- Improved the documentation for `corr_rm` by adding examples and providing a clearer explanation of the `c` parameter. +- Added examples of usage in the documentation for: `acca`, `best_acca`, `corrp`, `corr_rm`, `corr_matrix`, `corr_fun`, `ptest`, `sil_acca`. + + ## 0.5.0 - Creates the package website with the command: `usethis::use_pkgdown_github_page`; @@ -25,7 +51,7 @@ - p.value: p-value of the t-test. - data.name: description of data. -### Changes +### Methods Altered - `corr_fun`: Now uses C++ while using distance correlation. ## 0.3.0 diff --git a/R/acca.R b/R/acca.R index e4be71d..fd653b7 100644 --- a/R/acca.R +++ b/R/acca.R @@ -1,6 +1,7 @@ #' @useDynLib corrp, .registration=TRUE #' @importFrom Rcpp evalCpp #' @importFrom RcppArmadillo armadillo_version +#' @importFrom corrplot corrplot #' @title Average correlation clustering algorithm #' @@ -16,13 +17,13 @@ #' @param maxrep \[\code{integer(1)}]\cr maximum number of #' interactions without change in the clusters. #' @param maxiter \[\code{integer(1)}]\cr maximum number of interactions. -#' @param ... Additional arguments (TODO). +#' @param ... Additional arguments . #' #' @return \[\code{acca_list(k)}]\cr A list with the #' final result of the clustering method. #' That is, the name of the variables belonging to each cluster k. #' -#' @author Igor D.S. Siciliani +#' @author Igor D.S. Siciliani, Paulo H. dos Santos #' #' @keywords correlation , acca #' @@ -31,12 +32,18 @@ #' "Average correlation clustering algorithm (ACCA) for grouping of co-regulated #' genes with similar pattern of variation in their expression values." #' Journal of Biomedical Informatics 43.4 (2010): 560-568. +#' +#' @examples #' -#' +#' x <- corrp::corrp(iris) +#' m <- corrp::corr_matrix(x) +#' corrp::acca(m, 2) #' #' @export #' -acca <- function(m, ...) { +acca <- function(m, k, ...) { + assert_required_argument(m, "The 'm' argument must be a cmatrix object, which is the output from corr_matrix function, or it must be a matrix.") + assert_required_argument(m, "The 'k' argument must be the number of number of clusters considered.") UseMethod("acca", m) } diff --git a/R/best_acca.R b/R/best_acca.R index 40aef61..04c54ad 100644 --- a/R/best_acca.R +++ b/R/best_acca.R @@ -12,7 +12,7 @@ #' without change in the clusters in the ACCA method. #' @param maxiter \[\code{integer(1)}]\cr maximum number #' of interactions in the ACCA method. -#' @param ... Additional arguments (TODO). +#' @param ... Additional arguments. #' #' @return \[\code{list(3)}]\cr A list with: #' silhouette average with per k `$silhouette.ave`; @@ -20,7 +20,7 @@ #' the optimal number of clusters `$best.k`. #' @seealso \code{\link{sil_acca}} #' -#' @author Igor D.S. Siciliani +#' @author Igor D.S. Siciliani, Paulo H. dos Santos #' #' @keywords silhouette , acca , optimal , k #' @@ -35,11 +35,16 @@ #' "International Conference on Artificial Intelligence #' and Soft Computing. Springer, Cham, 2015. #' +#' @examples #' +#' x <- corrp::corrp(iris) +#' m <- corrp::corr_matrix(x) +#' best_acca(m, 2, 6) #' #' @export #' best_acca <- function(m, ...) { + assert_required_argument(m, "The 'm' argument must be a cmatrix object, which is the output from corr_matrix function, or it must be a matrix.") UseMethod("best_acca", m) } diff --git a/R/corr_fun.R b/R/corr_fun.R index aa6ccbb..6a22b4a 100644 --- a/R/corr_fun.R +++ b/R/corr_fun.R @@ -8,38 +8,8 @@ #' #' @name corr_fun #' -#' @section Details (Types): +#' @inheritSection corrp Pair Types #' -#' - \code{integer/numeric pair} Pearson Correlation using -#' \code{\link[stats]{cor}} function. The -#' value lies between -1 and 1.\cr -#' - \code{integer/numeric pair} Distance Correlation -#' using \code{\link[energy]{dcorT.test}} function. The -#' value lies between 0 and 1.\cr -#' - \code{integer/numeric pair} Maximal Information Coefficient using -#' \code{\link[minerva]{mine}} function. The -#' value lies between 0 and 1.\cr -#' - \code{integer/numeric pair} Predictive Power Score using -#' \code{\link[ppsr]{score}} function. The -#' value lies between 0 and 1.\cr\cr -#' - \code{integer/numeric - factor/categorical pair} correlation coefficient or -#' squared root of R^2 coefficient of linear regression of integer/numeric -#' variable over factor/categorical variable using -#' \code{\link[stats]{lm}} function. The value -#' lies between 0 and 1.\cr -#' - \code{integer/numeric - factor/categorical pair} -#' Predictive Power Score using \code{\link[ppsr]{score}} function. -#' The value lies between 0 and 1.\cr\cr -#' - \code{factor/categorical pair} Cramer's V value is -#' computed based on chisq test and using -#' \code{\link[lsr]{cramersV}} function. The value lies -#' between 0 and 1.\cr -#' - \code{factor/categorical pair} Uncertainty coefficient -#' using \code{\link[DescTools]{UncertCoef}} function. The -#' value lies between 0 and 1.\cr -#' - \code{factor/categorical pair} Predictive Power Score -#' using \code{\link[ppsr]{score}} function. -#' The value lies between 0 and 1.\cr #' #' @return list with all statistical results.\cr #' - All statistical tests are controlled by the confidence internal of @@ -52,45 +22,12 @@ #' default the association measure(`infer.value`) will be `NA`. #' #' -#' @param df \[\code{data.frame(1)}]\cr input data frame. -#' @param nx \[\code{character(1)}]\cr column name of -#' independent/predictor variable. -#' @param ny \[\code{character(1)}]\cr column name of dependent/target variable. -#' @param p.value \[\code{logical(1)}]\cr -#' P-value probability of obtaining the observed results of a test, -#' assuming that the null hypothesis is correct. By default p.value=0.05 (Cutoff value for p-value.). -#' @param comp \[\code{character(1)}]\cr The param \code{p.value} must be greater -#' or less than those estimated in tests and correlations. -#' @param alternative \[\code{character(1)}]\cr a character string specifying the alternative hypothesis for -#' the correlation inference. It must be one of "two.sided" (default), "greater" or "less". -#' You can specify just the initial letter. -#' @param verbose \[\code{logical(1)}]\cr Activate verbose mode. -#' @param num.s \[\code{numeric(1)}]\cr Used in permutation test. The number of samples with -#' replacement created with y numeric vector. -#' @param rk \[\code{logical(1)}]\cr Used in permutation test. -#' if its TRUE transform x, y numeric vectors with samples ranks. -#' @param cor.nn \[\code{character(1)}]\cr -#' Choose correlation type to be used in integer/numeric pair inference. -#' The options are `pearson: Pearson Correlation`,`mic: Maximal Information Coefficient`, -#' `dcor: Distance Correlation`,`pps: Predictive Power Score`.Default is `Pearson Correlation`. -#' @param cor.nc \[\code{character(1)}]\cr -#' Choose correlation type to be used in integer/numeric - factor/categorical pair inference. -#' The option are `lm: Linear Model`,`pps: Predictive Power Score`. Default is `Linear Model`. -#' @param cor.cc \[\code{character(1)}]\cr -#' Choose correlation type to be used in factor/categorical pair inference. -#' The option are `cramersV: Cramer's V`,`uncoef: Uncertainty coefficient`, -#' `pps: Predictive Power Score`. Default is ` Cramer's V`. -#' @param lm.args \[\code{list(1)}]\cr additional parameters for the specific method. -#' @param pearson.args \[\code{list(1)}]\cr additional parameters for the specific method. -#' @param dcor.args \[\code{list(1)}]\cr additional parameters for the specific method. -#' @param mic.args \[\code{list(1)}]\cr additional parameters for the specific method. -#' @param pps.args \[\code{list(1)}]\cr additional parameters for the specific method. -#' @param uncoef.args \[\code{list(1)}]\cr additional parameters for the specific method. -#' @param cramersV.args \[\code{list(1)}]\cr additional parameters for the specific method. -#' @param ... Additional arguments (TODO). +#' @inheritParams corrp +#' @param nx \[\code{character(1)}]\cr first variable column name: independent/predictor variable. +#' @param ny \[\code{character(1)}]\cr second variable column name: dependent/target variable. #' #' -#' @author Igor D.S. Siciliani +#' @author Igor D.S. Siciliani, Paulo H. dos Santos #' #' @keywords correlation , power predictive score , linear model , distance correlation , #' mic , point biserial , pearson , cramer'sV @@ -103,17 +40,21 @@ #' Paul van der Laken, ppsr,2021. #' URL \url{https://github.com/paulvanderlaken/ppsr}. #' -#' +#' @examples +#' +#' # since both `nx` and `ny` columns are numerical the method type is defined by `cor.nn` +#' corr_fun(iris, nx = "Sepal.Length", ny = "Sepal.Width", cor.nn = "dcor") +#' #' @export corr_fun <- function(df, nx, ny, p.value = 0.05, verbose = TRUE, - num.s = 1000, + num.s = 250, rk = FALSE, comp = c("greater", "less"), - alternative = c("two.sided", "less", "greater"), + alternative = c("greater", "less", "two.sided"), cor.nn = c("pearson", "mic", "dcor", "pps"), cor.nc = c("lm", "pps"), cor.cc = c("cramersV", "uncoef", "pps"), @@ -121,10 +62,19 @@ corr_fun <- function(df, pearson.args = list(), dcor.args = list(), mic.args = list(), - pps.args = list(), + pps.args = list(ptest = FALSE), cramersV.args = list(), uncoef.args = list(), ...) { + + + assert_required_argument(df, + "The 'df' argument must be a data.frame containing the data to analyze.") + assert_required_argument(nx, + "The 'nx' argument must be a character vector specifying a column name from 'df' for the independent variable(s).") + assert_required_argument(ny, + "The 'ny' argument must be a character string specifying a column name from 'df' for the dependent variable.") + alternative <- match.arg(alternative) cor.nn <- match.arg(cor.nn) cor.nc <- match.arg(cor.nc) @@ -240,15 +190,14 @@ corr_fun <- function(df, ) } - if ((class(r) %in% "try-error")) { - msg <- "" - + if (inherits(r, "try-error")) { + msg <- "" if (verbose) { warnings(cat( "ERROR: some operations produces Nas values.", "\n", ny, " FUN ", nx, "\n" )) - msg <- r[[1]] + msg <- attr(r, "condition")$message } r <- list( diff --git a/R/corr_matrix.R b/R/corr_matrix.R index f82f393..4a7695f 100644 --- a/R/corr_matrix.R +++ b/R/corr_matrix.R @@ -3,20 +3,24 @@ #' @description Through the results obtained from corrp function #' create a correlation matrix. #' -#' @param c \[\code{corrp.list(1)}]\cr output from the \code{\link{corrp}} function. +#' @param c \[\code{clist(1)}]\cr output from the \code{\link{corrp}} function. #' @param col \[\code{character(1)}]\cr choose the column to be used in the correlation matrix. #' @param isig \[\code{logical(1)}]\cr values that are not statistically significant will #' be represented by NA or FALSE in the correlation matrix. #' @param ... Additional arguments (TODO). #' -#' @author Igor D.S. Siciliani +#' @author Igor D.S. Siciliani, Paulo H. dos Santos #' #' @keywords correlation matrix , corrp #' +#' @examples #' -#' +#' iris_cor <- corrp(iris) +#' iris_m <- corr_matrix(iris_cor, isig = FALSE) +#' corrplot::corrplot(iris_m) #' @export corr_matrix <- function(c, ...) { + assert_required_argument(c, "The 'c' argument must be a clist object, which is the output from corrp.") UseMethod("corr_matrix", c) } @@ -34,9 +38,6 @@ corr_matrix.clist <- function(c, col = c("infer.value", "stat.value", "isig"), i .corr_matrix(c = c, col = col, isig = isig, ...) } - - - .corr_matrix <- function(c, col = c("infer.value", "stat.value", "isig"), isig = TRUE, ...) { checkmate::assert_names(names(c), identical.to = c("data", "index")) checkmate::assert_logical(isig, len = 1) @@ -64,6 +65,5 @@ corr_matrix.clist <- function(c, col = c("infer.value", "stat.value", "isig"), i rownames(m) <- mnames colnames(m) <- mnames - return(structure(m, class = c("cmatrix", "matrix"))) } diff --git a/R/corr_rm.R b/R/corr_rm.R index c1a6307..c810abd 100644 --- a/R/corr_rm.R +++ b/R/corr_rm.R @@ -5,22 +5,31 @@ #' #' #' @param df \[\code{data.frame(1)}]\cr input data frame. -#' @param c \[\code{clist(1)} | \code{cmatrix(1)}]\cr correlation list output from \code{\link{corrp}} or -#' correlation matrix output from \code{\link{corr_matrix}}. +#' @param c \[\code{clist(1)} | \code{cmatrix(1)}]\cr correlation list output from the function \code{\link[corrp]{corrp}} +#' with class \code{clist} or correlation matrix output +#' from \code{\link[corrp]{corr_matrix}} with class \code{cmatrix}. #' @param cutoff \[\code{numeric(1)}]\cr A numeric value for the pair-wise absolute correlation cutoff. #' The default values is 0.75. #' @param col \[\code{character(1)}]\cr choose the column to be used in the correlation matrix #' @param isig \[\code{logical(1)}]\cr values that are not statistically significant will #' be represented by NA or FALSE in the correlation matrix. -#' @param ... Additional arguments (TODO). +#' @param ... Additional arguments. #' +#' @examples #' -#' @author Igor D.S. Siciliani +#' iris_clist <- corrp(iris) +#' iris_cmatrix <- corr_matrix(iris_clist) +#' corr_rm(df = iris, c = iris_clist, cutoff = 0.75, col = "infer.value", isig = FALSE) +#' corr_rm(df = iris, c = iris_cmatrix, cutoff = 0.75, col = "infer.value", isig = FALSE) +#' +#' @author Igor D.S. Siciliani, Paulo H. dos Santos #' #' @keywords highly correlated , cmatrix , clist #' #' @export corr_rm <- function(df, c, ...) { + assert_required_argument(df, "The 'df' argument must be a data.frame which columns will be filtered.") + assert_required_argument(c, "The 'c' argument must be a clist object, which is the output from corrp, or a cmatrix object, which is the output from corr_matrix.") UseMethod("corr_rm", c) } diff --git a/R/corrp.R b/R/corrp.R index 0ac1ebb..6562da4 100644 --- a/R/corrp.R +++ b/R/corrp.R @@ -8,38 +8,53 @@ #' #' @name corrp #' -#' @section Details (Types): +#' @section Pair Types: #' -#' - \code{integer/numeric pair} Pearson Correlation using \code{\link[stats]{cor}} function. The -#' value lies between -1 and 1.\cr -#' - \code{integer/numeric pair} Distance Correlation using \code{\link[energy]{dcorT.test}} function. The -#' value lies between 0 and 1.\cr -#' - \code{integer/numeric pair} Maximal Information Coefficient using \code{\link[minerva]{mine}} function. The -#' value lies between 0 and 1.\cr -#' - \code{integer/numeric pair} Predictive Power Score using \code{\link[ppsr]{score}} function. The -#' value lies between 0 and 1.\cr\cr -#' - \code{integer/numeric - factor/categorical pair} correlation coefficient or -#' squared root of R^2 coefficient of linear regression of integer/numeric -#' variable over factor/categorical variable using \code{\link[stats]{lm}} function. The value -#' lies between 0 and 1.\cr -#' - \code{integer/numeric - factor/categorical pair} Predictive Power Score using \code{\link[ppsr]{score}} function. The -#' value lies between 0 and 1.\cr\cr -#' - \code{factor/categorical pair} Cramer's V value is -#' computed based on chisq test and using \code{\link[lsr]{cramersV}} function. The value lies -#' between 0 and 1.\cr -#' - \code{factor/categorical pair} Uncertainty coefficient using \code{\link[DescTools]{UncertCoef}} function. The -#' value lies between 0 and 1.\cr -#' - \code{factor/categorical pair} Predictive Power Score using \code{\link[ppsr]{score}} function. The -#' value lies between 0 and 1.\cr +#' **Numeric pairs (integer/numeric):** +#' +#' - **Pearson Correlation Coefficient:** A widely used measure of the strength and direction of linear relationships. Implemented using \code{\link[stats]{cor}}. For more details, see \url{https://doi.org/10.1098/rspl.1895.0041}. The value lies between -1 and 1.\cr +#' - **Distance Correlation:** Based on the idea of expanding covariance to distances, it measures both linear and nonlinear associations between variables. Implemented using \code{\link[energy]{dcorT.test}}. For more details, see \url{https://doi.org/10.1214/009053607000000505}. The value lies between 0 and 1.\cr +#' - **Maximal Information Coefficient (MIC):** An information-based nonparametric method that can detect both linear and non-linear relationships between variables. Implemented using \code{\link[minerva]{mine}}. For more details, see \url{https://doi.org/10.1126/science.1205438}. The value lies between 0 and 1.\cr +#' - **Predictive Power Score (PPS):** A metric used to assess predictive relations between variables. Implemented using \code{\link[ppsr]{score}}. For more details, see \url{https://zenodo.org/record/4091345}. The value lies between 0 and 1.\cr\cr #' -#' @return list with two tables: data and index.\cr -#' - The `$data` table contains all the statistical results;\cr -#' - The `$index` table contains the pairs of indices used in each inference of the data table. -#' - All statistical tests are controlled by the confidence internal of +#' **Numeric and categorical pairs (integer/numeric - factor/categorical):** +#' +#' - **Square Root of R² Coefficient:** From linear regression of the numeric variable over the categorical variable. Implemented using \code{\link[stats]{lm}}. For more details, see \url{https://doi.org/10.4324/9780203774441}. The value lies between 0 and 1.\cr +#' - **Predictive Power Score (PPS):** A metric used to assess predictive relations between numeric and categorical variables. Implemented using \code{\link[ppsr]{score}}. For more details, see \url{https://zenodo.org/record/4091345}. The value lies between 0 and 1.\cr\cr +#' +#' **Categorical pairs (factor/categorical):** +#' +#' - **Cramér's V:** A measure of association between nominal variables. Computed based on a chi-squared test and implemented using \code{\link[lsr]{cramersV}}. For more details, see \url{https://doi.org/10.1515/9781400883868}. The value lies between 0 and 1.\cr +#' - **Uncertainty Coefficient:** A measure of nominal association between two variables. Implemented using \code{\link[DescTools]{UncertCoef}}. For more details, see \url{https://doi.org/10.1016/j.jbi.2010.02.001}. The value lies between 0 and 1.\cr +#' - **Predictive Power Score (PPS):** A metric used to assess predictive relations between categorical variables. Implemented using \code{\link[ppsr]{score}}. For more details, see \url{https://zenodo.org/record/4091345}. The value lies between 0 and 1.\cr +#' +#' @return +#' A list with two tables: `data` and `index`. +#' +#' - **data**: A table containing all the statistical results. The columns of this table are as follows: +#' +#' - `infer`: The method or metric used to assess the relationship between the variables (e.g., Maximal Information Coefficient or Predictive Power Score). +#' - `infer.value`: The value or score obtained from the specified inference method, representing the strength or quality of the relationship between the variables. +#' - `stat`: The statistical test or measure associated with the inference method (e.g., P-value or F1_weighted). +#' - `stat.value: The numerical value corresponding to the statistical test or measure, providing additional context about the inference (e.g., significance or performance score). +#' - `isig`: A logical value indicating whether the statistical result is significant (`TRUE`) or not, based on predefined criteria (e.g., threshold for P-value). +#' - `msg`: A message or error related to the inference process. +#' - `varx`: The name of the first variable in the analysis (independent variable or feature). +#' - `vary`: The name of the second variable in the analysis (dependent/target variable). +#' +#' +#' +#' - **index**: A table that contains the pairs of indices used in each inference of the `data` table. +#' +#' +#' All statistical tests are controlled by the confidence internal of #' p.value param. If the statistical tests do not obtain a significance greater/less #' than p.value the value of variable `isig` will be `FALSE`.\cr -#' - There is no statistical significance test for the pps algorithm. By default `isig` is TRUE.\cr -#' - If any errors occur during operations the association measure(`infer.value`) will be `NA`. + +#' If any errors occur during operations the association measure (`infer.value`) will be `NA`.\cr +#' The result `data` and `index` will have \eqn{N^2} rows, where N is the number of variables of the input data. +#' By default there is no statistical significance test for the pps algorithm. By default `isig` is NA, you can enable in the `pps.args` setting `ptest = TRUE`.\cr +#' All the `*.args` can modified the parameters (`p.value`, `comp`, `alternative`, `num.s`, `rk`, `ptest`) for the respective method on it's prefix. #' #' @param df \[\code{data.frame(1)}]\cr input data frame. #' @param parallel \[\code{logical(1)}]\cr If its TRUE run the operations in parallel backend. @@ -68,16 +83,16 @@ #' Choose correlation type to be used in factor/categorical pair inference. #' The option are `cramersV: Cramer's V`,`uncoef: Uncertainty coefficient`, #' `pps: Predictive Power Score`. Default is ` Cramer's V`. -#' @param lm.args \[\code{list(1)}]\cr additional parameters for the specific method. -#' @param pearson.args \[\code{list(1)}]\cr additional parameters for the specific method. -#' @param dcor.args \[\code{list(1)}]\cr additional parameters for the specific method. -#' @param mic.args \[\code{list(1)}]\cr additional parameters for the specific method. -#' @param pps.args \[\code{list(1)}]\cr additional parameters for the specific method. -#' @param uncoef.args \[\code{list(1)}]\cr additional parameters for the specific method. -#' @param cramersV.args \[\code{list(1)}]\cr additional parameters for the specific method. -#' @param ... Additional arguments (TODO). +#' @param lm.args \[\code{list(1)}]\cr additional parameters for linear model to be passed to \code{\link[stats]{lm}}. +#' @param pearson.args \[\code{list(1)}]\cr additional parameters for Pearson correlation to be passed to \code{\link[stats]{cor.test}}. +#' @param dcor.args \[\code{list(1)}]\cr additional parameters for the distance correlation to be passed to \code{\link[corrp]{dcorT_test}}. +#' @param mic.args \[\code{list(1)}]\cr additional parameters for the maximal information coefficient to be passed to \code{\link[minerva]{mine}}. +#' @param pps.args \[\code{list(1)}]\cr additional parameters for the predictive power score to be passed to \code{\link[ppsr]{score}}. +#' @param uncoef.args \[\code{list(1)}]\cr additional parameters for the uncertainty coefficient to be passed to \code{\link[DescTools]{UncertCoef}}. +#' @param cramersV.args \[\code{list(1)}]\cr additional parameters for the Cramer's V to be passed to \code{\link[lsr]{cramersV}}. +#' @param ... Additional arguments. #' -#' @author Igor D.S. Siciliani +#' @author Igor D.S. Siciliani, Paulo H. dos Santos #' #' @keywords correlation , power predictive score , linear model , distance correlation , #' mic , point biserial , pearson , cramer'sV @@ -89,7 +104,12 @@ #' #' Paul van der Laken, ppsr,2021. #' URL \url{https://github.com/paulvanderlaken/ppsr}. -#' +#' +#' @examples +#' iris_c <- corrp(iris) +#' iris_m <- corr_matrix(iris_c, isig = FALSE) +#' corrplot::corrplot(iris_m) +#' #' #' @export corrp <- function(df, @@ -97,10 +117,10 @@ corrp <- function(df, n.cores = 1, p.value = 0.05, verbose = TRUE, - num.s = 1000, + num.s = 250, rk = FALSE, comp = c("greater", "less"), - alternative = c("two.sided", "less", "greater"), + alternative = c("greater", "less", "two.sided"), cor.nn = c("pearson", "mic", "dcor", "pps"), cor.nc = c("lm", "pps"), cor.cc = c("cramersV", "uncoef", "pps"), @@ -108,10 +128,12 @@ corrp <- function(df, pearson.args = list(), dcor.args = list(), mic.args = list(), - pps.args = list(), + pps.args = list(ptest = FALSE), cramersV.args = list(), uncoef.args = list(), ...) { + + assert_required_argument(df, "The 'df' argument must be a data.frame containing the data to analyze.") alternative <- match.arg(alternative) cor.nn <- match.arg(cor.nn) cor.nc <- match.arg(cor.nc) diff --git a/R/ptest.R b/R/ptest.R index 7854927..28e25c8 100644 --- a/R/ptest.R +++ b/R/ptest.R @@ -8,16 +8,22 @@ #' @param num.s \[\code{numeric(1)}]\cr number of samples with replacement created with y numeric vector. #' @param rk \[\code{logical(1)}]\cr if its TRUE transform x, y numeric vectors with samples ranks. #' @param alternative \[\code{character(1)}]\cr a character string specifying the alternative hypothesis, -#' must be one of "two.sided" (default), "greater" or "less". You can specify just the initial letter. -#' @param ... Additional arguments (TODO). +#' must be one of "greater" (default), "less" or "two.sided". You can specify just the initial letter. +#' @param ... Additional arguments. +#' +#' @examples +#' +#' x <- iris[[1]] +#' y <- iris[[2]] +#' ptest(x, y, FUN = function(x, y) cor(x, y), alternative = "t") #' #' @export #' ptest <- function(x, y, FUN, rk = FALSE, - alternative = c("two.sided", "less", "greater"), - num.s = 1000, ...) { + alternative = c("greater", "less", "two.sided"), + num.s = 250, ...) { FUN <- match.fun(FUN) # check mandatory args fargs <- formals(FUN) @@ -34,7 +40,7 @@ ptest <- function(x, y, checkmate::assert_number(num.s) if (is.data.frame(x)) x <- x[[1]] if (is.data.frame(y)) y <- y[[1]] - if (!rk) stopifnot(is.numeric(x), is.numeric(y)) + # if (!rk) stopifnot(is.numeric(x), is.numeric(y)) stopifnot(length(x) == length(y)) @@ -57,8 +63,8 @@ ptest <- function(x, y, "g" = { p.value <- mean(est >= obs) }, - "t" = { - p.value <- mean(abs(est) >= abs(obs)) + "t" = { + p.value <- min(mean(est >= obs), mean(est <= obs)) * 2 } ) diff --git a/R/sil_acca.R b/R/sil_acca.R index 651b936..f4ce35a 100644 --- a/R/sil_acca.R +++ b/R/sil_acca.R @@ -4,9 +4,9 @@ #' of interpretation and validation of consistency within acca clusters of data. #' #' @param acca \[\code{acca_list(1)}]\cr Acca clustering results from \code{\link{acca}} -#' @param m \[\code{matrix(1)}]\cr correlation matrix from \code{\link{corr_matrix}}. +#' @param m \[\code{cmatrix(1)|matrix(1)}]\cr correlation matrix from \code{\link{corr_matrix}}. #' By default the distance matrix(dist) used in this method is given by `dist = 1 - m`. -#' @param ... Additional arguments (TODO). +#' @param ... Additional arguments. #' #' @return \[\code{numeric(1)}]\cr the average value of #' the silhouette width over all data of the entire dataset. @@ -14,7 +14,7 @@ #' are very well clustered. #' #' -#' @author Igor D.S. Siciliani +#' @author Igor D.S. Siciliani, Paulo H. dos Santos #' #' @keywords silhouette , acca #' @@ -26,11 +26,19 @@ #' Starczewski, Artur, and Adam Krzyżak. "Performance evaluation of the silhouette index. #' " International Conference on Artificial Intelligence and Soft Computing. Springer, Cham, 2015. #' +#' @examples + #' +#' x <- corrp::corrp(iris) +#' m <- corrp::corr_matrix(x) +#' acca <- corrp::acca(m, 2) +#' sil_acca(acca, m) #' #' @export #' -sil_acca <- function(acca, ...) { +sil_acca <- function(acca, m, ...) { + assert_required_argument(acca, "The 'acca' argument must be a acca_list object, which is the output from acca function, or it must be a list.") + assert_required_argument(m, "The 'm' argument must be a cmatrix object, which is the output from corr_matrix function or it must be a matrix.") UseMethod("sil_acca", acca) } diff --git a/R/utils.R b/R/utils.R index c43f20d..779dcfc 100644 --- a/R/utils.R +++ b/R/utils.R @@ -8,6 +8,8 @@ infer <- "Linear Model" stat <- "P-value" + set_arguments(lm.args) + args <- c(list(y ~ as.factor(x)), lm.args) sum.res <- summary( @@ -28,9 +30,9 @@ isig <- TRUE if (verbose) { - msg <- paste( - ny, "vs.", nx, ".", - "Alternative hypothesis: true ", infer, " is not equal to 0.", + msg <- paste0( + ny, " vs. ", nx, ". ", + "Alternative hypothesis: true ", infer, " is not equal to 0. ", "P-value: ", pv, "." ) @@ -40,10 +42,10 @@ isig <- FALSE if (verbose) { - msg <- paste( - ny, "vs.", nx, ".", - "There is no correlation at the confidence level p-value.", - "P-value:", p.value, compare$str, "estimated p-value:", pv + msg <- paste0( + ny, " vs. ", nx, ". ", + "There is no correlation at the confidence level p-value. ", + "P-value:", p.value, " ", compare$str, " estimated p-value: ", pv, "." ) message(msg) @@ -65,10 +67,13 @@ infer <- "Cramer\'s V" stat <- "P-value" + set_arguments(cramersV.args) + args <- c(list(x), list(y), cramersV.args) pv <- stats::chisq.test(x, y, simulate.p.value = TRUE)$p.value r <- do.call(lsr::cramersV, args) + compare <- .comparepv(x = pv, pv = p.value, comp = comp) msg <- "" @@ -77,9 +82,9 @@ isig <- TRUE if (verbose) { - msg <- paste( - ny, "vs.", nx, ".", - "Alternative hypothesis: true ", infer, " is not equal to 0.", + msg <- paste0( + ny, " vs. ", nx, ". ", + "Alternative hypothesis: true ", infer, " is not equal to 0. ", "P-value: ", pv, "." ) @@ -89,10 +94,10 @@ isig <- FALSE if (verbose) { - msg <- paste( - ny, "vs.", nx, ".", - "There is no correlation at the confidence level p-value.", - "P-value:", p.value, compare$str, "estimated p-value:", pv + msg <- paste0( + ny, " vs. ", nx, ". ", + "There is no correlation at the confidence level p-value. ", + "P-value:", p.value, " ", compare$str, " estimated p-value: ", pv, "." ) message(msg) @@ -110,6 +115,8 @@ infer <- "Distance Correlation" stat <- "P-value" + set_arguments(dcor.args) + args <- c(list(x), list(y), dcor.args) @@ -124,9 +131,9 @@ isig <- TRUE if (verbose) { - msg <- paste( - ny, "vs.", nx, ".", - "Alternative hypothesis: true ", infer, " is not equal to 0.", + msg <- paste0( + ny, " vs. ", nx, ". ", + "Alternative hypothesis: true ", infer, " is not equal to 0. ", "P-value: ", pv, "." ) @@ -136,10 +143,10 @@ isig <- FALSE if (verbose) { - msg <- paste( - ny, "vs.", nx, ".", - "There is no correlation at the confidence level p-value.", - "P-value:", p.value, compare$str, "estimated p-value:", pv + msg <- paste0( + ny, " vs. ", nx, ". ", + "There is no correlation at the confidence level p-value. ", + "P-value:", p.value, " ", compare$str, " estimated p-value: ", pv, "." ) message(msg) @@ -159,6 +166,8 @@ infer <- "Pearson Correlation" stat <- "P-value" + set_arguments(pearson.args) + pearson.args$alternative <- alternative # from global pearson.args$method <- "pearson" args <- c(list(x), list(y), pearson.args) @@ -174,9 +183,9 @@ isig <- TRUE if (verbose) { - msg <- paste( - ny, "vs.", nx, ".", - "Alternative hypothesis: true ", infer, " is not equal to 0.", + msg <- paste0( + ny, " vs. ", nx, ". ", + "Alternative hypothesis: true ", infer, " is not equal to 0. ", "P-value: ", pv, "." ) @@ -186,10 +195,10 @@ isig <- FALSE if (verbose) { - msg <- paste( - ny, "vs.", nx, ".", - "There is no correlation at the confidence level p-value.", - "P-value:", p.value, compare$str, "estimated p-value:", pv + msg <- paste0( + ny, " vs. ", nx, ". ", + "There is no correlation at the confidence level p-value. ", + "P-value:", p.value, " ", compare$str, " estimated p-value: ", pv, "." ) message(msg) @@ -211,6 +220,8 @@ infer <- "Maximal Information Coefficient" stat <- "P-value" + set_arguments(mic.args) + args <- c(list(x), list(y), mic.args) pv <- ptest(x, y, FUN = function(y, x) { @@ -232,9 +243,9 @@ isig <- TRUE if (verbose) { - msg <- paste( - ny, "vs.", nx, ".", - "Alternative hypothesis: true ", infer, " is not equal to 0.", + msg <- paste0( + ny, " vs. ", nx, ". ", + "Alternative hypothesis: true ", infer, " is not equal to 0. ", "P-value: ", pv, "." ) @@ -244,10 +255,10 @@ isig <- FALSE if (verbose) { - msg <- paste( - ny, "vs.", nx, ".", - "There is no correlation at the confidence level p-value.", - "P-value:", p.value, compare$str, "estimated p-value:", pv + msg <- paste0( + ny, " vs. ", nx, ". ", + "There is no correlation at the confidence level p-value. ", + "P-value:", p.value, " ", compare$str, " estimated p-value: ", pv, "." ) message(msg) @@ -266,12 +277,14 @@ if (is.data.frame(x)) x <- x[[1]] if (is.data.frame(y)) y <- y[[1]] + set_arguments(uncoef.args) + args <- c(list(x), list(y), uncoef.args) pv <- ptest(y, x, FUN = function(x, y) { args <- c(list(x), list(y), uncoef.args) do.call(DescTools::UncertCoef, args) - }, rk = TRUE, num.s = num.s, alternative = alternative) + }, rk = rk, num.s = num.s, alternative = alternative) # pv = ptest(y,x,FUN = function(y,x) DescTools::UncertCoef(y,x) ) infer <- "Uncertainty coefficient" @@ -284,9 +297,9 @@ isig <- TRUE if (verbose) { - msg <- paste( - ny, "vs.", nx, ".", - "Alternative hypothesis: true ", infer, " is not equal to 0.", + msg <- paste0( + ny, " vs. ", nx, ". ", + "Alternative hypothesis: true ", infer, " is not equal to 0. ", "P-value: ", pv, "." ) @@ -296,10 +309,10 @@ isig <- FALSE if (verbose) { - msg <- paste( - ny, "vs.", nx, ".", - "There is no correlation at the confidence level p-value.", - "P-value:", p.value, compare$str, "estimated p-value:", pv + msg <- paste0( + ny, " vs. ", nx, ". ", + "There is no correlation at the confidence level p-value. ", + "P-value:", p.value, " ", compare$str, " estimated p-value: ", pv, "." ) message(msg) @@ -313,24 +326,64 @@ } # Predictive Power Score Calculations -.corpps <- function(x, y, nx, ny, verbose, pps.args = list(), ...) { - args <- c(list(data.frame(x, y)), list(nx), list(ny), pps.args) - - r <- do.call(ppsr::score, args) +.corpps <- function(x, y, nx, ny, p.value, comp, verbose, alternative, num.s, rk, pps.args = list(ptest = FALSE), ...) { + + ptest = FALSE + set_arguments(pps.args) + args <- c(list(data.frame(x = unlist(x), y = unlist(y))), list("x", "y"), pps.args) + + if (!isFALSE(ptest)) { + pv <- ptest(y, x, FUN = function(x, y) { + args <- c(list(data.frame(x = x, y = y)), list("x", "y"), pps.args) + r = do.call(ppsr::score, args) + return(r$pps) + }, rk = rk, num.s = num.s, alternative = alternative) + + compare <- .comparepv(x = pv, pv = p.value, comp = comp) + } + r <- do.call(ppsr::score, args) msg <- "" infer <- "Predictive Power Score" - infer.value <- r$pps - stat <- r$metric - stat.value <- r$model_score - isig <- TRUE + infer.value <- r$pps + stat = r$metric + stat.value = r$model_score + isig <- NA + + if (!isFALSE(ptest)) { + stat <- "P-value" + stat.value <- pv + + if (compare$comp) { + isig <- TRUE + + if (verbose) { + msg <- paste0( + ny, " vs. ", nx, ". ", + "Alternative hypothesis: true ", infer, " is not equal to 0. ", + "P-value: ", pv, ".\n" + ) + } + } else { + isig <- FALSE + + if (verbose) { + msg <- paste0( + ny, " vs. ", nx, ". ", + "There is no correlation at the confidence level p-value. ", + "P-value:", p.value, " ", compare$str, " estimated p-value: ", pv, ".\n" + ) + } + } + } if (verbose) { - msg <- paste( - "Target: ", ny, "vs. Predicted: ", nx, ".", - "Anothers Outputs(baseline_score,cv_folds,algorithm,model_type):", - r$baseline_score, ";", r$cv_folds, ";", r$algorithm, ";", r$model_type - ) + msg = paste0(msg, + "Model Parameters:", + "\nbaseline_score: ", r$baseline_score, + "\ncv_folds: ", r$cv_folds, ";", + "\nalgorithm: ", r$algorithm, ";", + "\nmodel_type: ", r$model_type, ".") message(msg) } @@ -363,3 +416,53 @@ x[null_indices] <- NA return(x) } + + +#' @title Assert Required Argument +#' @description Ensures that a required argument is provided. If the argument is missing, it throws an error with a clear message. +#' +#' @param arg \[\code{any}]\cr +#' The argument to check. +#' +#' @param description \[\code{character(1)}]\cr +#' A description of the argument's purpose and requirements. +#' +#' @return Throws an error if the argument is missing; otherwise, returns \code{NULL}. +#' +#' +#' @export +assert_required_argument <- function(arg, description) { + arg_name <- deparse(substitute(arg)) + t = try(arg, silent = TRUE) + if (inherits(t, "try-error")) { + stop(simpleError(paste( + sprintf("\n Missing required argument: '%s'.", arg_name), + description, + sep = "\n " + ), sys.call(sys.parent()))) + } +} + +#' @title Set Argument +#' @description Assigns provided arguments from the `args_list` to the parent environment. If an argument is inside the arguments of the methods that calculate statistics, it assigns it on the parent environment, and removes the argument from the list. +#' +#' @param args_list \[\code{list}]\cr +#' A named list of arguments to be assigned to the parent environment. +#' +#' @return A modified \code{args_list} with the arguments that were assigned to the parent environment removed. +#' +#' @export +set_arguments = function(args_list) { + checkmate::assert_list(args_list) + list_name <- deparse(substitute(args_list)) + + for (name_arg in names(args_list)) { + if (name_arg %in% c("p.value", "comp", "alternative", "num.s", "rk", "ptest")) { + assign(name_arg, args_list[[name_arg]], envir = parent.frame()) + args_list[[name_arg]] = NULL + } + } + + assign(list_name, args_list, envir = parent.frame()) + return(invisible()) +} diff --git a/README.md b/README.md index 2b4166e..bc9f321 100644 --- a/README.md +++ b/README.md @@ -21,7 +21,7 @@ The data.frame is allowed to have columns of these four classes: integer, numeri In this new package the correlation is automatically computed according to the follow options: #### integer/numeric pair: -- [Pearson correlation test](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) ; +- [Pearson correlation test](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient); - [Distance Correlation](https://en.wikipedia.org/wiki/Distance_correlation); - [Maximal Information Coefficient](https://en.wikipedia.org/wiki/Maximal_information_coefficient); - [Predictive Power Score](https://github.com/paulvanderlaken/ppsr). @@ -65,17 +65,30 @@ remotes::install_github("meantrix/corrp@main") `corrp` Next, we calculate the correlations for the data set iris using: Maximal Information Coefficient for numeric pair, the Power Predictive Score algorithm for numeric/categorical pair and Uncertainty coefficient for categorical pair. ```r -results = corrp::corrp(iris, cor.nn = 'mic',cor.nc = 'pps',cor.cc = 'uncoef', n.cores = 2 , verbose = FALSE) +# coorp with using iris using parallel processing +results = corrp::corrp(iris, cor.nn = 'mic', cor.nc = 'pps',cor.cc = 'uncoef', n.cores = 2 , verbose = FALSE) +# an sequential example with different correlation pair types +results_2 = corrp::corrp(mtcars, cor.nn = 'pps', cor.nc = 'lm', cor.cc = 'cramersV', parallel = FALSE, verbose = FALSE) head(results$data) # infer infer.value stat stat.value isig msg varx vary -# Maximal Information Coefficient 0.9994870 P-value 0.0000000 TRUE Sepal.Length Sepal.Length -# Maximal Information Coefficient 0.2770503 P-value 0.0000000 TRUE Sepal.Length Sepal.Width -# Maximal Information Coefficient 0.7682996 P-value 0.0000000 TRUE Sepal.Length Petal.Length -# Maximal Information Coefficient 0.6683281 P-value 0.0000000 TRUE Sepal.Length Petal.Width -# Predictive Power Score 0.5591864 F1_weighted 0.7028029 TRUE Sepal.Length Species +# Maximal Information Coefficient 0.9994870 P-value 0.0000000 TRUE Sepal.Length Sepal.Length +# Maximal Information Coefficient 0.2770503 P-value 0.0000000 TRUE Sepal.Length Sepal.Width +# Maximal Information Coefficient 0.7682996 P-value 0.0000000 TRUE Sepal.Length Petal.Length +# Maximal Information Coefficient 0.6683281 P-value 0.0000000 TRUE Sepal.Length Petal.Width +# Predictive Power Score 0.5591864 F1_weighted 0.7028029 NA Sepal.Length Species # Maximal Information Coefficient 0.2770503 P-value 0.0000000 TRUE Sepal.Width Sepal.Length +head(results_2$data) + +# infer infer.value stat stat.value isig msg varx vary +# Predictive Power Score 1.0000000 NA NA mpg mpg +# Predictive Power Score 0.3861810 MAE 0.8899206 NA mpg cyl +# Predictive Power Score 0.3141056 MAE 74.7816795 NA mpg disp +# Predictive Power Score 0.2311418 MAE 42.3961506 NA mpg hp +# Predictive Power Score 0.1646116 MAE 0.3992651 NA mpg drat +# Predictive Power Score 0.2075760 MAE 0.5768637 NA mpg wt + ``` `corr_matrix` Using the previous result we can create a correlation matrix as follows: diff --git a/dev/Rcpp/benchmark.R b/dev/Rcpp/benchmark.R new file mode 100644 index 0000000..33a5638 --- /dev/null +++ b/dev/Rcpp/benchmark.R @@ -0,0 +1,33 @@ + +library(corrp) +source("./dev/memory_time.R") + +test_data_a = data.frame(a = runif(1e4)) +test_data_b = data.frame(b = runif(1e4)) + +benchmark_cpp = calculate_memory_runtime({ + dcorT_test(test_data_a, test_data_b) +}) +# MEMORY PEAK(mb): 4701.44140625 +# TIME (S): 6.022 + +benchmark_r = calculate_memory_runtime({ + energy::dcorT.test(test_data_a, test_data_b) +}) +# MEMORY PEAK(mb): 7000.65234375 +# TIME (S): 13.846 + +test_data_a = data.frame(a = runif(2e4)) +test_data_b = data.frame(b = runif(2e4)) + +benchmark_cpp = calculate_memory_runtime({ + dcorT_test(test_data_a, test_data_b) +}) +# MEMORY PEAK(mb): 18440.53515625 +# TIME (S): 25.977 + +benchmark_r = calculate_memory_runtime({ + energy::dcorT.test(test_data_a, test_data_b) +}) +# MEMORY PEAK(mb): 27598.3828125 +# TIME (S): 60.264 \ No newline at end of file diff --git a/dev/Rcpp/dist.R b/dev/Rcpp/dist.R index 5a686bf..df10357 100644 --- a/dev/Rcpp/dist.R +++ b/dev/Rcpp/dist.R @@ -1,28 +1,40 @@ +## Check consistency with energy -# Test the distCpp function -set.seed(42) -gc() -test_data = readRDS("dev/Rcpp/x.rds") -test_data = as.matrix(data.frame(a = c(3, 2, 4, 4))) +R_Astar = function(d) { + if (inherits(d, "dist")) + d <- as.matrix(d) + n <- nrow(d) + if (n != ncol(d)) stop("Argument d should be distance") + m <- rowMeans(d) + M <- mean(d) + a <- sweep(d, 1, m) + b <- sweep(a, 2, m) + A <- b + M -Rcpp::sourceCpp("src/dcorT.cpp") -dist_cpp = distCpp(test_data) # Compute distances using distCpp -dist_r = as.matrix(dist(test_data)) + A <- A - d / n + diag(A) <- m - M + (n / (n - 1)) * A +} -all(dist_r == dist_cpp) +Rcpp::sourceCpp("src/dcort.cpp") -all(Astar(dist_cpp) == Astarcpp(dist_cpp)) +# test_data = readRDS("dev/Rcpp/x.rds") +test_data = as.matrix(data.frame(a = c(1, 2, 4, 4))) +# Compute distances using distCpp +r_dist = as.matrix(stats::dist(test_data)) +cpp_dist = dist(test_data) +all(r_dist == cpp_dist) # TRUE +r_astar = R_Astar(r_dist) +cpp_astar = Astar(cpp_dist) -res_cpp = dcorTcpp(as.matrix(data.frame(a = c(3, 2, 4, 4))), as.matrix(data.frame(b = c(5, 3, 3, 7)))) +all(r_astar == cpp_astar) -res_energy = energy::dcorT.test(as.matrix(data.frame(a = c(3, 2, 4, 4))), as.matrix(data.frame(b = c(5, 3, 3, 7)))) -class(res_energy) = "list" +res_cpp = dcorT_test(as.matrix(data.frame(a = c(3, 2, 4, 4))), as.matrix(data.frame(b = c(5, 3, 3, 7)))) -dcorT_test(as.matrix(test_data), as.matrix(test_data)) +res_energy = energy::dcorT.test(as.matrix(data.frame(a = c(3, 2, 4, 4))), as.matrix(data.frame(b = c(5, 3, 3, 7)))) -energy::dcorT(as.matrix(test_data), as.matrix(test_data)) diff --git a/dev/memory_time.R b/dev/memory_time.R new file mode 100644 index 0000000..5c39c8b --- /dev/null +++ b/dev/memory_time.R @@ -0,0 +1,81 @@ +future::plan(strategy = future::multicore) +options(future.globals.maxSize = Inf) + +#' @title Calculate Memory Peak +#' @param expr expression to evaluate +calculate_memory <- function(expr) { + # https://stackoverflow.com/questions/58250531/memory-profiling-in-r-how-to-find-the-place-of-maximum-memory-usage + node_size <- function() { + bit <- 8L * .Machine$sizeof.pointer + if (!(bit == 32L || bit == 64L)) { + stop("Unknown architecture", call. = FALSE) + } + if (bit == 32L) + 28L + else 56L + } + + promise = future::future(globals = FALSE, seed = TRUE, { + res = expr + gc(reset = TRUE) + mallinfo::malloc.trim() + res + }) + + print(paste("Se\u00e7\u00e3o de R de monitoramento:", Sys.getpid())) + print(paste("Se\u00e7\u00e3o de R de execu\u00e7\u00e3o:", promise$job$pid)) + max_mem_used = 0 + while (TRUE) { + Sys.sleep(0.001) + current_mem = as.numeric((system(paste("ps -p", promise$job$pid, "-o rss="), intern = TRUE))) / 1024 + + if (current_mem > max_mem_used) + max_mem_used = current_mem + + if (future::resolved(promise)) + break + } + + res = future::value(promise) + rm(promise) + + gc(reset = TRUE) + mallinfo::malloc.trim() + # cat(sprintf("mem: %.1fMb.\n", res)) + return( + list( + # Máximo de memória + max_mem_used = max_mem_used, + # Resultado da expressão + res = res + ) + ) +} + +#' @title Calculate runtime +#' @param expr expression to evaluate +#' @param msg \[\code{character(1)}\]\cr msg +#' @param quiet \[\code{logical(1)}\]\cr logical +calculate_runtime <- function(expr, msg = "Time", quiet = TRUE) { + tictoc::tic(msg, quiet = quiet) + res = expr + t <- tictoc::toc() + time <- round(t$toc - t$tic, 3) + list(res = res, runtime = time) +} + +#' @title Calculate runtime and memory peak +#' @param expr expression to evaluate +#' @param msg \[\code{character(1)}\]\cr msg +#' @export +calculate_memory_runtime <- function(expr, msg = deparse(substitute(expr))) { + m <- calculate_memory( + calculate_runtime(expr, msg = msg) + ) + m$runtime = m[[2]]$runtime + m$res = m[[2]]$res + message("MEMORY PEAK(mb): ", m[[1]]) + message("TIME (S): ", m$runtime) + + return(m) +} diff --git a/dev/tests/benchmark.R b/dev/tests/benchmark.R new file mode 100644 index 0000000..0eba7be --- /dev/null +++ b/dev/tests/benchmark.R @@ -0,0 +1,62 @@ +library(corrp) +library(laeken) +data(eusilc) +source("./dev/memory_time.R") + + +eusilc = eusilc[1:1000, c("eqSS", "eqIncome", "db040", "rb090")] +colnames(eusilc) = c("House_Size", "Income", "State", "Sex") + +benchmarks_nn = list() +for (nn in c("pearson", "mic", "dcor", "pps")) { + + message(nn) + + bench = calculate_memory_runtime({ + corrp( + eusilc[, c("House_Size", "Income")], + parallel = FALSE, + cor.nn = nn + ) + }) + bench$res = NULL + benchmarks_nn[[nn]] = bench +} + +benchmarks_nc = list() +for (nc in c("lm", "pps")) { + + message(nc) + + bench = calculate_memory_runtime({ + corrp( + eusilc[, c("Income", "State")], + parallel = FALSE, + cor.nc = nc + ) + }) + bench$res = NULL + benchmarks_nc[[nc]] = bench + +} + + +benchmarks_cc = list() +for (cc in c("cramersV", "uncoef", "pps")) { + + message(cc) + + bench = calculate_memory_runtime({ + corrp( + eusilc[, c("State", "Sex")], + parallel = FALSE, + cor.cc = cc + ) + }) + + bench$res = NULL + benchmarks_cc[[cc]] = bench + +} + + diff --git a/man/acca.Rd b/man/acca.Rd index 74da049..9b4fe70 100644 --- a/man/acca.Rd +++ b/man/acca.Rd @@ -6,29 +6,29 @@ \alias{acca.matrix} \title{Average correlation clustering algorithm} \usage{ -acca(m, ...) +acca(m, k, ...) \method{acca}{cmatrix}(m, k, maxrep = 2L, maxiter = 100L, ...) \method{acca}{matrix}(m, k, maxrep = 2L, maxiter = 100L, ...) } \arguments{ -\item{m}{\[\code{matrix(1)}]\cr correlation matrix from +\item{m}{[\code{matrix(1)}]\cr correlation matrix from \code{\link{corr_matrix}} or a distance matrix.} -\item{...}{Additional arguments (TODO).} +\item{k}{[\code{integer(1)}]\cr number of clusters considered.} -\item{k}{\[\code{integer(1)}]\cr number of clusters considered.} +\item{...}{Additional arguments .} -\item{maxrep}{\[\code{integer(1)}]\cr maximum number of +\item{maxrep}{[\code{integer(1)}]\cr maximum number of interactions without change in the clusters.} -\item{maxiter}{\[\code{integer(1)}]\cr maximum number of interactions.} +\item{maxiter}{[\code{integer(1)}]\cr maximum number of interactions.} } \value{ -\[\code{acca_list(k)}]\cr A list with the +[\code{acca_list(k)}]\cr A list with the final result of the clustering method. - That is, the name of the variables belonging to each cluster k. +That is, the name of the variables belonging to each cluster k. } \description{ A C++ implementation of the ACCA method @@ -36,6 +36,13 @@ that works directly with the correlation matrix derived from the \code{\link{corr_matrix}} function. In this sense, this implementation differs from the original, it works with mixed data and several correlation methods. +} +\examples{ + +x <- corrp::corrp(iris) +m <- corrp::corr_matrix(x) +corrp::acca(m, 2) + } \references{ Bhattacharya, Anindya, and Rajat K. De. @@ -44,7 +51,7 @@ genes with similar pattern of variation in their expression values." Journal of Biomedical Informatics 43.4 (2010): 560-568. } \author{ -Igor D.S. Siciliani +Igor D.S. Siciliani, Paulo H. dos Santos } \keyword{,} \keyword{acca} diff --git a/man/assert_required_argument.Rd b/man/assert_required_argument.Rd new file mode 100644 index 0000000..8866c22 --- /dev/null +++ b/man/assert_required_argument.Rd @@ -0,0 +1,21 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/utils.R +\name{assert_required_argument} +\alias{assert_required_argument} +\title{Assert Required Argument} +\usage{ +assert_required_argument(arg, description) +} +\arguments{ +\item{arg}{[\code{any}]\cr +The argument to check.} + +\item{description}{[\code{character(1)}]\cr +A description of the argument's purpose and requirements.} +} +\value{ +Throws an error if the argument is missing; otherwise, returns \code{NULL}. +} +\description{ +Ensures that a required argument is provided. If the argument is missing, it throws an error with a clear message. +} diff --git a/man/best_acca.Rd b/man/best_acca.Rd index 9479f26..899b684 100644 --- a/man/best_acca.Rd +++ b/man/best_acca.Rd @@ -13,31 +13,38 @@ best_acca(m, ...) \method{best_acca}{matrix}(m, mink, maxk, maxrep = 2L, maxiter = 100L, ...) } \arguments{ -\item{m}{\[\code{matrix(1)}]\cr correlation matrix +\item{m}{[\code{matrix(1)}]\cr correlation matrix from \code{\link{corr_matrix}}.} -\item{...}{Additional arguments (TODO).} +\item{...}{Additional arguments.} -\item{mink}{\[\code{integer(1)}]\cr minimum number of clusters considered.} +\item{mink}{[\code{integer(1)}]\cr minimum number of clusters considered.} -\item{maxk}{\[\code{integer(1)}]\cr maximum number of clusters considered.} +\item{maxk}{[\code{integer(1)}]\cr maximum number of clusters considered.} -\item{maxrep}{\[\code{integer(1)}]\cr maximum number of interactions +\item{maxrep}{[\code{integer(1)}]\cr maximum number of interactions without change in the clusters in the ACCA method.} -\item{maxiter}{\[\code{integer(1)}]\cr maximum number +\item{maxiter}{[\code{integer(1)}]\cr maximum number of interactions in the ACCA method.} } \value{ -\[\code{list(3)}]\cr A list with: -silhouette average with per k `$silhouette.ave`; -the sequence of clusters tested `$k` and -the optimal number of clusters `$best.k`. +[\code{list(3)}]\cr A list with: +silhouette average with per k \verb{$silhouette.ave}; +the sequence of clusters tested \verb{$k} and +the optimal number of clusters \verb{$best.k}. } \description{ Determining the optimal number of cluster in the ACCA clustering using the average silhouette aproach. +} +\examples{ + +x <- corrp::corrp(iris) +m <- corrp::corr_matrix(x) +best_acca(m, 2, 6) + } \references{ Leonard Kaufman; Peter J. Rousseeuw (1990). @@ -54,7 +61,7 @@ and Soft Computing. Springer, Cham, 2015. \code{\link{sil_acca}} } \author{ -Igor D.S. Siciliani +Igor D.S. Siciliani, Paulo H. dos Santos } \keyword{,} \keyword{acca} diff --git a/man/corr_fun.Rd b/man/corr_fun.Rd index 1904a64..afed51f 100644 --- a/man/corr_fun.Rd +++ b/man/corr_fun.Rd @@ -10,10 +10,10 @@ corr_fun( ny, p.value = 0.05, verbose = TRUE, - num.s = 1000, + num.s = 250, rk = FALSE, comp = c("greater", "less"), - alternative = c("two.sided", "less", "greater"), + alternative = c("greater", "less", "two.sided"), cor.nn = c("pearson", "mic", "dcor", "pps"), cor.nc = c("lm", "pps"), cor.cc = c("cramersV", "uncoef", "pps"), @@ -21,132 +21,128 @@ corr_fun( pearson.args = list(), dcor.args = list(), mic.args = list(), - pps.args = list(), + pps.args = list(ptest = FALSE), cramersV.args = list(), uncoef.args = list(), ... ) } \arguments{ -\item{df}{\[\code{data.frame(1)}]\cr input data frame.} +\item{df}{[\code{data.frame(1)}]\cr input data frame.} -\item{nx}{\[\code{character(1)}]\cr column name of -independent/predictor variable.} +\item{nx}{[\code{character(1)}]\cr first variable column name: independent/predictor variable.} -\item{ny}{\[\code{character(1)}]\cr column name of dependent/target variable.} +\item{ny}{[\code{character(1)}]\cr second variable column name: dependent/target variable.} -\item{p.value}{\[\code{logical(1)}]\cr +\item{p.value}{[\code{logical(1)}]\cr P-value probability of obtaining the observed results of a test, assuming that the null hypothesis is correct. By default p.value=0.05 (Cutoff value for p-value.).} -\item{verbose}{\[\code{logical(1)}]\cr Activate verbose mode.} +\item{verbose}{[\code{logical(1)}]\cr Activate verbose mode.} -\item{num.s}{\[\code{numeric(1)}]\cr Used in permutation test. The number of samples with +\item{num.s}{[\code{numeric(1)}]\cr Used in permutation test. The number of samples with replacement created with y numeric vector.} -\item{rk}{\[\code{logical(1)}]\cr Used in permutation test. +\item{rk}{[\code{logical(1)}]\cr Used in permutation test. if its TRUE transform x, y numeric vectors with samples ranks.} -\item{comp}{\[\code{character(1)}]\cr The param \code{p.value} must be greater +\item{comp}{[\code{character(1)}]\cr The param \code{p.value} must be greater or less than those estimated in tests and correlations.} -\item{alternative}{\[\code{character(1)}]\cr a character string specifying the alternative hypothesis for +\item{alternative}{[\code{character(1)}]\cr a character string specifying the alternative hypothesis for the correlation inference. It must be one of "two.sided" (default), "greater" or "less". You can specify just the initial letter.} -\item{cor.nn}{\[\code{character(1)}]\cr +\item{cor.nn}{[\code{character(1)}]\cr Choose correlation type to be used in integer/numeric pair inference. -The options are `pearson: Pearson Correlation`,`mic: Maximal Information Coefficient`, -`dcor: Distance Correlation`,`pps: Predictive Power Score`.Default is `Pearson Correlation`.} +The options are \verb{pearson: Pearson Correlation},\verb{mic: Maximal Information Coefficient}, +\verb{dcor: Distance Correlation},\verb{pps: Predictive Power Score}.Default is \verb{Pearson Correlation}.} -\item{cor.nc}{\[\code{character(1)}]\cr +\item{cor.nc}{[\code{character(1)}]\cr Choose correlation type to be used in integer/numeric - factor/categorical pair inference. -The option are `lm: Linear Model`,`pps: Predictive Power Score`. Default is `Linear Model`.} +The option are \verb{lm: Linear Model},\verb{pps: Predictive Power Score}. Default is \verb{Linear Model}.} -\item{cor.cc}{\[\code{character(1)}]\cr +\item{cor.cc}{[\code{character(1)}]\cr Choose correlation type to be used in factor/categorical pair inference. -The option are `cramersV: Cramer's V`,`uncoef: Uncertainty coefficient`, -`pps: Predictive Power Score`. Default is ` Cramer's V`.} +The option are \verb{cramersV: Cramer's V},\verb{uncoef: Uncertainty coefficient}, +\verb{pps: Predictive Power Score}. Default is \verb{ Cramer's V}.} -\item{lm.args}{\[\code{list(1)}]\cr additional parameters for the specific method.} +\item{lm.args}{[\code{list(1)}]\cr additional parameters for linear model to be passed to \code{\link[stats]{lm}}.} -\item{pearson.args}{\[\code{list(1)}]\cr additional parameters for the specific method.} +\item{pearson.args}{[\code{list(1)}]\cr additional parameters for Pearson correlation to be passed to \code{\link[stats]{cor.test}}.} -\item{dcor.args}{\[\code{list(1)}]\cr additional parameters for the specific method.} +\item{dcor.args}{[\code{list(1)}]\cr additional parameters for the distance correlation to be passed to \code{\link[corrp]{dcorT_test}}.} -\item{mic.args}{\[\code{list(1)}]\cr additional parameters for the specific method.} +\item{mic.args}{[\code{list(1)}]\cr additional parameters for the maximal information coefficient to be passed to \code{\link[minerva]{mine}}.} -\item{pps.args}{\[\code{list(1)}]\cr additional parameters for the specific method.} +\item{pps.args}{[\code{list(1)}]\cr additional parameters for the predictive power score to be passed to \code{\link[ppsr]{score}}.} -\item{cramersV.args}{\[\code{list(1)}]\cr additional parameters for the specific method.} +\item{cramersV.args}{[\code{list(1)}]\cr additional parameters for the Cramer's V to be passed to \code{\link[lsr]{cramersV}}.} -\item{uncoef.args}{\[\code{list(1)}]\cr additional parameters for the specific method.} +\item{uncoef.args}{[\code{list(1)}]\cr additional parameters for the uncertainty coefficient to be passed to \code{\link[DescTools]{UncertCoef}}.} -\item{...}{Additional arguments (TODO).} +\item{...}{Additional arguments.} } \value{ list with all statistical results.\cr -- All statistical tests are controlled by the confidence internal of - p.value param. If the statistical tests do not +\itemize{ +\item All statistical tests are controlled by the confidence internal of +p.value param. If the statistical tests do not obtain a significance greater/less - than p.value the value of variable `isig` will be `FALSE`.\cr -- There is no statistical significance test -for the pps algorithm. By default `isig` is TRUE.\cr -- If any errors occur during operations by -default the association measure(`infer.value`) will be `NA`. +than p.value the value of variable \code{isig} will be \code{FALSE}.\cr +\item There is no statistical significance test +for the pps algorithm. By default \code{isig} is TRUE.\cr +\item If any errors occur during operations by +default the association measure(\code{infer.value}) will be \code{NA}. +} } \description{ Compute correlation type analysis on two mixed classes columns of a given dataframe. - The dataframe is allowed to have columns of these four classes: integer, - numeric, factor and character. The character column is considered as - categorical variable. +The dataframe is allowed to have columns of these four classes: integer, +numeric, factor and character. The character column is considered as +categorical variable. +} +\section{Pair Types}{ + + +\strong{Numeric pairs (integer/numeric):} +\itemize{ +\item \strong{Pearson Correlation Coefficient:} A widely used measure of the strength and direction of linear relationships. Implemented using \code{\link[stats]{cor}}. For more details, see \url{https://doi.org/10.1098/rspl.1895.0041}. The value lies between -1 and 1.\cr +\item \strong{Distance Correlation:} Based on the idea of expanding covariance to distances, it measures both linear and nonlinear associations between variables. Implemented using \code{\link[energy]{dcorT.test}}. For more details, see \url{https://doi.org/10.1214/009053607000000505}. The value lies between 0 and 1.\cr +\item \strong{Maximal Information Coefficient (MIC):} An information-based nonparametric method that can detect both linear and non-linear relationships between variables. Implemented using \code{\link[minerva]{mine}}. For more details, see \url{https://doi.org/10.1126/science.1205438}. The value lies between 0 and 1.\cr +\item \strong{Predictive Power Score (PPS):} A metric used to assess predictive relations between variables. Implemented using \code{\link[ppsr]{score}}. For more details, see \url{https://zenodo.org/record/4091345}. The value lies between 0 and 1.\cr\cr +} + +\strong{Numeric and categorical pairs (integer/numeric - factor/categorical):} +\itemize{ +\item \strong{Square Root of R² Coefficient:} From linear regression of the numeric variable over the categorical variable. Implemented using \code{\link[stats]{lm}}. For more details, see \url{https://doi.org/10.4324/9780203774441}. The value lies between 0 and 1.\cr +\item \strong{Predictive Power Score (PPS):} A metric used to assess predictive relations between numeric and categorical variables. Implemented using \code{\link[ppsr]{score}}. For more details, see \url{https://zenodo.org/record/4091345}. The value lies between 0 and 1.\cr\cr +} + +\strong{Categorical pairs (factor/categorical):} +\itemize{ +\item \strong{Cramér's V:} A measure of association between nominal variables. Computed based on a chi-squared test and implemented using \code{\link[lsr]{cramersV}}. For more details, see \url{https://doi.org/10.1515/9781400883868}. The value lies between 0 and 1.\cr +\item \strong{Uncertainty Coefficient:} A measure of nominal association between two variables. Implemented using \code{\link[DescTools]{UncertCoef}}. For more details, see \url{https://doi.org/10.1016/j.jbi.2010.02.001}. The value lies between 0 and 1.\cr +\item \strong{Predictive Power Score (PPS):} A metric used to assess predictive relations between categorical variables. Implemented using \code{\link[ppsr]{score}}. For more details, see \url{https://zenodo.org/record/4091345}. The value lies between 0 and 1.\cr } -\section{Details (Types)}{ - - -- \code{integer/numeric pair} Pearson Correlation using -\code{\link[stats]{cor}} function. The - value lies between -1 and 1.\cr -- \code{integer/numeric pair} Distance Correlation -using \code{\link[energy]{dcorT.test}} function. The - value lies between 0 and 1.\cr -- \code{integer/numeric pair} Maximal Information Coefficient using -\code{\link[minerva]{mine}} function. The - value lies between 0 and 1.\cr -- \code{integer/numeric pair} Predictive Power Score using -\code{\link[ppsr]{score}} function. The - value lies between 0 and 1.\cr\cr -- \code{integer/numeric - factor/categorical pair} correlation coefficient or - squared root of R^2 coefficient of linear regression of integer/numeric - variable over factor/categorical variable using -\code{\link[stats]{lm}} function. The value - lies between 0 and 1.\cr -- \code{integer/numeric - factor/categorical pair} -Predictive Power Score using \code{\link[ppsr]{score}} function. -The value lies between 0 and 1.\cr\cr -- \code{factor/categorical pair} Cramer's V value is - computed based on chisq test and using -\code{\link[lsr]{cramersV}} function. The value lies - between 0 and 1.\cr -- \code{factor/categorical pair} Uncertainty coefficient -using \code{\link[DescTools]{UncertCoef}} function. The - value lies between 0 and 1.\cr -- \code{factor/categorical pair} Predictive Power Score -using \code{\link[ppsr]{score}} function. -The value lies between 0 and 1.\cr } +\examples{ + +# since both `nx` and `ny` columns are numerical the method type is defined by `cor.nn` +corr_fun(iris, nx = "Sepal.Length", ny = "Sepal.Width", cor.nn = "dcor") + +} \references{ KS Srikanth,sidekicks,cor2, 2020. URL \url{https://github.com/talegari/sidekicks/}. - Paul van der Laken, ppsr,2021. URL \url{https://github.com/paulvanderlaken/ppsr}. } \author{ -Igor D.S. Siciliani +Igor D.S. Siciliani, Paulo H. dos Santos } \keyword{,} \keyword{biserial} diff --git a/man/corr_matrix.Rd b/man/corr_matrix.Rd index 357369e..5104dc1 100644 --- a/man/corr_matrix.Rd +++ b/man/corr_matrix.Rd @@ -13,21 +13,27 @@ corr_matrix(c, ...) \method{corr_matrix}{clist}(c, col = c("infer.value", "stat.value", "isig"), isig = TRUE, ...) } \arguments{ -\item{c}{\[\code{corrp.list(1)}]\cr output from the \code{\link{corrp}} function.} +\item{c}{[\code{clist(1)}]\cr output from the \code{\link{corrp}} function.} \item{...}{Additional arguments (TODO).} -\item{col}{\[\code{character(1)}]\cr choose the column to be used in the correlation matrix.} +\item{col}{[\code{character(1)}]\cr choose the column to be used in the correlation matrix.} -\item{isig}{\[\code{logical(1)}]\cr values that are not statistically significant will +\item{isig}{[\code{logical(1)}]\cr values that are not statistically significant will be represented by NA or FALSE in the correlation matrix.} } \description{ Through the results obtained from corrp function create a correlation matrix. } +\examples{ + +iris_cor <- corrp(iris) +iris_m <- corr_matrix(iris_cor, isig = FALSE) +corrplot::corrplot(iris_m) +} \author{ -Igor D.S. Siciliani +Igor D.S. Siciliani, Paulo H. dos Santos } \keyword{,} \keyword{correlation} diff --git a/man/corr_rm.Rd b/man/corr_rm.Rd index 012651a..5471b42 100644 --- a/man/corr_rm.Rd +++ b/man/corr_rm.Rd @@ -33,27 +33,36 @@ corr_rm(df, c, ...) \method{corr_rm}{matrix}(df, c, cutoff = 0.75, ...) } \arguments{ -\item{df}{\[\code{data.frame(1)}]\cr input data frame.} +\item{df}{[\code{data.frame(1)}]\cr input data frame.} -\item{c}{\[\code{clist(1)} | \code{cmatrix(1)}]\cr correlation list output from \code{\link{corrp}} or -correlation matrix output from \code{\link{corr_matrix}}.} +\item{c}{[\code{clist(1)} | \code{cmatrix(1)}]\cr correlation list output from the function \code{\link[corrp]{corrp}} +with class \code{clist} or correlation matrix output +from \code{\link[corrp]{corr_matrix}} with class \code{cmatrix}.} -\item{...}{Additional arguments (TODO).} +\item{...}{Additional arguments.} -\item{col}{\[\code{character(1)}]\cr choose the column to be used in the correlation matrix} +\item{col}{[\code{character(1)}]\cr choose the column to be used in the correlation matrix} -\item{isig}{\[\code{logical(1)}]\cr values that are not statistically significant will +\item{isig}{[\code{logical(1)}]\cr values that are not statistically significant will be represented by NA or FALSE in the correlation matrix.} -\item{cutoff}{\[\code{numeric(1)}]\cr A numeric value for the pair-wise absolute correlation cutoff. +\item{cutoff}{[\code{numeric(1)}]\cr A numeric value for the pair-wise absolute correlation cutoff. The default values is 0.75.} } \description{ Remove highly correlated variables from a data.frame using the corrp functions outputs and the caret package function \code{\link[caret]{findCorrelation}}. +} +\examples{ + +iris_clist <- corrp(iris) +iris_cmatrix <- corr_matrix(iris_clist) +corr_rm(df = iris, c = iris_clist, cutoff = 0.75, col = "infer.value", isig = FALSE) +corr_rm(df = iris, c = iris_cmatrix, cutoff = 0.75, col = "infer.value", isig = FALSE) + } \author{ -Igor D.S. Siciliani +Igor D.S. Siciliani, Paulo H. dos Santos } \keyword{,} \keyword{clist} diff --git a/man/corrp.Rd b/man/corrp.Rd index 81a614c..a85c229 100644 --- a/man/corrp.Rd +++ b/man/corrp.Rd @@ -10,10 +10,10 @@ corrp( n.cores = 1, p.value = 0.05, verbose = TRUE, - num.s = 1000, + num.s = 250, rk = FALSE, comp = c("greater", "less"), - alternative = c("two.sided", "less", "greater"), + alternative = c("greater", "less", "two.sided"), cor.nn = c("pearson", "mic", "dcor", "pps"), cor.nc = c("lm", "pps"), cor.cc = c("cramersV", "uncoef", "pps"), @@ -21,77 +21,92 @@ corrp( pearson.args = list(), dcor.args = list(), mic.args = list(), - pps.args = list(), + pps.args = list(ptest = FALSE), cramersV.args = list(), uncoef.args = list(), ... ) } \arguments{ -\item{df}{\[\code{data.frame(1)}]\cr input data frame.} +\item{df}{[\code{data.frame(1)}]\cr input data frame.} -\item{parallel}{\[\code{logical(1)}]\cr If its TRUE run the operations in parallel backend.} +\item{parallel}{[\code{logical(1)}]\cr If its TRUE run the operations in parallel backend.} -\item{n.cores}{\[\code{numeric(1)}]\cr The number of cores to use for parallel execution.} +\item{n.cores}{[\code{numeric(1)}]\cr The number of cores to use for parallel execution.} -\item{p.value}{\[\code{logical(1)}]\cr +\item{p.value}{[\code{logical(1)}]\cr P-value probability of obtaining the observed results of a test, assuming that the null hypothesis is correct. By default p.value=0.05 (Cutoff value for p-value.).} -\item{verbose}{\[\code{logical(1)}]\cr Activate verbose mode.} +\item{verbose}{[\code{logical(1)}]\cr Activate verbose mode.} -\item{num.s}{\[\code{numeric(1)}]\cr Used in permutation test. The number of samples with +\item{num.s}{[\code{numeric(1)}]\cr Used in permutation test. The number of samples with replacement created with y numeric vector.} -\item{rk}{\[\code{logical(1)}]\cr Used in permutation test. +\item{rk}{[\code{logical(1)}]\cr Used in permutation test. if its TRUE transform x, y numeric vectors with samples ranks.} -\item{comp}{\[\code{character(1)}]\cr The param \code{p.value} must be greater +\item{comp}{[\code{character(1)}]\cr The param \code{p.value} must be greater or less than those estimated in tests and correlations.} -\item{alternative}{\[\code{character(1)}]\cr a character string specifying the alternative hypothesis for +\item{alternative}{[\code{character(1)}]\cr a character string specifying the alternative hypothesis for the correlation inference. It must be one of "two.sided" (default), "greater" or "less". You can specify just the initial letter.} -\item{cor.nn}{\[\code{character(1)}]\cr +\item{cor.nn}{[\code{character(1)}]\cr Choose correlation type to be used in integer/numeric pair inference. -The options are `pearson: Pearson Correlation`,`mic: Maximal Information Coefficient`, -`dcor: Distance Correlation`,`pps: Predictive Power Score`.Default is `Pearson Correlation`.} +The options are \verb{pearson: Pearson Correlation},\verb{mic: Maximal Information Coefficient}, +\verb{dcor: Distance Correlation},\verb{pps: Predictive Power Score}.Default is \verb{Pearson Correlation}.} -\item{cor.nc}{\[\code{character(1)}]\cr +\item{cor.nc}{[\code{character(1)}]\cr Choose correlation type to be used in integer/numeric - factor/categorical pair inference. -The option are `lm: Linear Model`,`pps: Predictive Power Score`. Default is `Linear Model`.} +The option are \verb{lm: Linear Model},\verb{pps: Predictive Power Score}. Default is \verb{Linear Model}.} -\item{cor.cc}{\[\code{character(1)}]\cr +\item{cor.cc}{[\code{character(1)}]\cr Choose correlation type to be used in factor/categorical pair inference. -The option are `cramersV: Cramer's V`,`uncoef: Uncertainty coefficient`, -`pps: Predictive Power Score`. Default is ` Cramer's V`.} +The option are \verb{cramersV: Cramer's V},\verb{uncoef: Uncertainty coefficient}, +\verb{pps: Predictive Power Score}. Default is \verb{ Cramer's V}.} -\item{lm.args}{\[\code{list(1)}]\cr additional parameters for the specific method.} +\item{lm.args}{[\code{list(1)}]\cr additional parameters for linear model to be passed to \code{\link[stats]{lm}}.} -\item{pearson.args}{\[\code{list(1)}]\cr additional parameters for the specific method.} +\item{pearson.args}{[\code{list(1)}]\cr additional parameters for Pearson correlation to be passed to \code{\link[stats]{cor.test}}.} -\item{dcor.args}{\[\code{list(1)}]\cr additional parameters for the specific method.} +\item{dcor.args}{[\code{list(1)}]\cr additional parameters for the distance correlation to be passed to \code{\link[corrp]{dcorT_test}}.} -\item{mic.args}{\[\code{list(1)}]\cr additional parameters for the specific method.} +\item{mic.args}{[\code{list(1)}]\cr additional parameters for the maximal information coefficient to be passed to \code{\link[minerva]{mine}}.} -\item{pps.args}{\[\code{list(1)}]\cr additional parameters for the specific method.} +\item{pps.args}{[\code{list(1)}]\cr additional parameters for the predictive power score to be passed to \code{\link[ppsr]{score}}.} -\item{cramersV.args}{\[\code{list(1)}]\cr additional parameters for the specific method.} +\item{cramersV.args}{[\code{list(1)}]\cr additional parameters for the Cramer's V to be passed to \code{\link[lsr]{cramersV}}.} -\item{uncoef.args}{\[\code{list(1)}]\cr additional parameters for the specific method.} +\item{uncoef.args}{[\code{list(1)}]\cr additional parameters for the uncertainty coefficient to be passed to \code{\link[DescTools]{UncertCoef}}.} -\item{...}{Additional arguments (TODO).} +\item{...}{Additional arguments.} } \value{ -list with two tables: data and index.\cr -- The `$data` table contains all the statistical results;\cr -- The `$index` table contains the pairs of indices used in each inference of the data table. -- All statistical tests are controlled by the confidence internal of - p.value param. If the statistical tests do not obtain a significance greater/less - than p.value the value of variable `isig` will be `FALSE`.\cr -- There is no statistical significance test for the pps algorithm. By default `isig` is TRUE.\cr -- If any errors occur during operations the association measure(`infer.value`) will be `NA`. +A list with two tables: \code{data} and \code{index}. +\itemize{ +\item \strong{data}: A table containing all the statistical results. The columns of this table are as follows: +\itemize{ +\item \code{infer}: The method or metric used to assess the relationship between the variables (e.g., Maximal Information Coefficient or Predictive Power Score). +\item \code{infer.value}: The value or score obtained from the specified inference method, representing the strength or quality of the relationship between the variables. +\item \code{stat}: The statistical test or measure associated with the inference method (e.g., P-value or F1_weighted). +\item `stat.value: The numerical value corresponding to the statistical test or measure, providing additional context about the inference (e.g., significance or performance score). +\item \code{isig}: A logical value indicating whether the statistical result is significant (\code{TRUE}) or not, based on predefined criteria (e.g., threshold for P-value). +\item \code{msg}: A message or error related to the inference process. +\item \code{varx}: The name of the first variable in the analysis (independent variable or feature). +\item \code{vary}: The name of the second variable in the analysis (dependent/target variable). +} +\item \strong{index}: A table that contains the pairs of indices used in each inference of the \code{data} table. +} + +All statistical tests are controlled by the confidence internal of +p.value param. If the statistical tests do not obtain a significance greater/less +than p.value the value of variable \code{isig} will be \code{FALSE}.\cr +If any errors occur during operations the association measure (\code{infer.value}) will be \code{NA}.\cr +The result \code{data} and \code{index} will have \eqn{N^2} rows, where N is the number of variables of the input data. +By default there is no statistical significance test for the pps algorithm. By default \code{isig} is NA, you can enable in the \code{pps.args} setting \code{ptest = TRUE}.\cr +All the \verb{*.args} can modified the parameters (\code{p.value}, \code{comp}, \code{alternative}, \code{num.s}, \code{rk}, \code{ptest}) for the respective method on it's prefix. } \description{ Compute correlations type analysis on mixed classes columns of larges dataframes @@ -100,42 +115,47 @@ The dataframe is allowed to have columns of these four classes: integer, numeric, factor and character. The character column is considered as categorical variable. } -\section{Details (Types)}{ - - -- \code{integer/numeric pair} Pearson Correlation using \code{\link[stats]{cor}} function. The - value lies between -1 and 1.\cr -- \code{integer/numeric pair} Distance Correlation using \code{\link[energy]{dcorT.test}} function. The - value lies between 0 and 1.\cr -- \code{integer/numeric pair} Maximal Information Coefficient using \code{\link[minerva]{mine}} function. The - value lies between 0 and 1.\cr -- \code{integer/numeric pair} Predictive Power Score using \code{\link[ppsr]{score}} function. The - value lies between 0 and 1.\cr\cr -- \code{integer/numeric - factor/categorical pair} correlation coefficient or - squared root of R^2 coefficient of linear regression of integer/numeric - variable over factor/categorical variable using \code{\link[stats]{lm}} function. The value - lies between 0 and 1.\cr -- \code{integer/numeric - factor/categorical pair} Predictive Power Score using \code{\link[ppsr]{score}} function. The - value lies between 0 and 1.\cr\cr -- \code{factor/categorical pair} Cramer's V value is - computed based on chisq test and using \code{\link[lsr]{cramersV}} function. The value lies - between 0 and 1.\cr -- \code{factor/categorical pair} Uncertainty coefficient using \code{\link[DescTools]{UncertCoef}} function. The - value lies between 0 and 1.\cr -- \code{factor/categorical pair} Predictive Power Score using \code{\link[ppsr]{score}} function. The - value lies between 0 and 1.\cr +\section{Pair Types}{ + + +\strong{Numeric pairs (integer/numeric):} +\itemize{ +\item \strong{Pearson Correlation Coefficient:} A widely used measure of the strength and direction of linear relationships. Implemented using \code{\link[stats]{cor}}. For more details, see \url{https://doi.org/10.1098/rspl.1895.0041}. The value lies between -1 and 1.\cr +\item \strong{Distance Correlation:} Based on the idea of expanding covariance to distances, it measures both linear and nonlinear associations between variables. Implemented using \code{\link[energy]{dcorT.test}}. For more details, see \url{https://doi.org/10.1214/009053607000000505}. The value lies between 0 and 1.\cr +\item \strong{Maximal Information Coefficient (MIC):} An information-based nonparametric method that can detect both linear and non-linear relationships between variables. Implemented using \code{\link[minerva]{mine}}. For more details, see \url{https://doi.org/10.1126/science.1205438}. The value lies between 0 and 1.\cr +\item \strong{Predictive Power Score (PPS):} A metric used to assess predictive relations between variables. Implemented using \code{\link[ppsr]{score}}. For more details, see \url{https://zenodo.org/record/4091345}. The value lies between 0 and 1.\cr\cr +} + +\strong{Numeric and categorical pairs (integer/numeric - factor/categorical):} +\itemize{ +\item \strong{Square Root of R² Coefficient:} From linear regression of the numeric variable over the categorical variable. Implemented using \code{\link[stats]{lm}}. For more details, see \url{https://doi.org/10.4324/9780203774441}. The value lies between 0 and 1.\cr +\item \strong{Predictive Power Score (PPS):} A metric used to assess predictive relations between numeric and categorical variables. Implemented using \code{\link[ppsr]{score}}. For more details, see \url{https://zenodo.org/record/4091345}. The value lies between 0 and 1.\cr\cr +} + +\strong{Categorical pairs (factor/categorical):} +\itemize{ +\item \strong{Cramér's V:} A measure of association between nominal variables. Computed based on a chi-squared test and implemented using \code{\link[lsr]{cramersV}}. For more details, see \url{https://doi.org/10.1515/9781400883868}. The value lies between 0 and 1.\cr +\item \strong{Uncertainty Coefficient:} A measure of nominal association between two variables. Implemented using \code{\link[DescTools]{UncertCoef}}. For more details, see \url{https://doi.org/10.1016/j.jbi.2010.02.001}. The value lies between 0 and 1.\cr +\item \strong{Predictive Power Score (PPS):} A metric used to assess predictive relations between categorical variables. Implemented using \code{\link[ppsr]{score}}. For more details, see \url{https://zenodo.org/record/4091345}. The value lies between 0 and 1.\cr +} } +\examples{ + iris_c <- corrp(iris) + iris_m <- corr_matrix(iris_c, isig = FALSE) + corrplot::corrplot(iris_m) + + +} \references{ KS Srikanth,sidekicks,cor2, 2020. URL \url{https://github.com/talegari/sidekicks/}. - Paul van der Laken, ppsr,2021. URL \url{https://github.com/paulvanderlaken/ppsr}. } \author{ -Igor D.S. Siciliani +Igor D.S. Siciliani, Paulo H. dos Santos } \keyword{,} \keyword{biserial} diff --git a/man/dcorT_test.Rd b/man/dcorT_test.Rd index f424201..69bd921 100644 --- a/man/dcorT_test.Rd +++ b/man/dcorT_test.Rd @@ -7,18 +7,18 @@ dcorT_test(x, y) } \arguments{ -\item{x}{\[\code{data.frame(1) | matrix(1)}]\cr A data of the first sample.} +\item{x}{[\code{data.frame(1) | matrix(1)}]\cr A data of the first sample.} -\item{y}{\[\code{data.frame(1) | matrix(1)}]\cr A data of the second sample.} +\item{y}{[\code{data.frame(1) | matrix(1)}]\cr A data of the second sample.} } \value{ returns a list containing - \item{method}{description of test} - \item{statistic}{observed value of the test statistic} - \item{parameter}{degrees of freedom} - \item{estimate}{(bias corrected) squared dCor(x,y)} - \item{p.value}{p-value of the t-test} - \item{data.name}{description of data} +\item{method}{description of test} +\item{statistic}{observed value of the test statistic} +\item{parameter}{degrees of freedom} +\item{estimate}{(bias corrected) squared dCor(x,y)} +\item{p.value}{p-value of the t-test} +\item{data.name}{description of data} } \description{ Distance correlation t-test of multivariate independence for high dimension. C++ version of energy::dcorT.test. diff --git a/man/ptest.Rd b/man/ptest.Rd index 96ab0b7..f5d846a 100644 --- a/man/ptest.Rd +++ b/man/ptest.Rd @@ -9,29 +9,36 @@ ptest( y, FUN, rk = FALSE, - alternative = c("two.sided", "less", "greater"), - num.s = 1000, + alternative = c("greater", "less", "two.sided"), + num.s = 250, ... ) } \arguments{ -\item{x}{\[\code{numeric(1)}]\cr a numeric vector.} +\item{x}{[\code{numeric(1)}]\cr a numeric vector.} -\item{y}{\[\code{numeric(1)}]\cr a numeric vector.} +\item{y}{[\code{numeric(1)}]\cr a numeric vector.} -\item{FUN}{\[\code{function(1)}]\cr the function to be applied} +\item{FUN}{[\code{function(1)}]\cr the function to be applied} -\item{rk}{\[\code{logical(1)}]\cr if its TRUE transform x, y numeric vectors with samples ranks.} +\item{rk}{[\code{logical(1)}]\cr if its TRUE transform x, y numeric vectors with samples ranks.} -\item{alternative}{\[\code{character(1)}]\cr a character string specifying the alternative hypothesis, -must be one of "two.sided" (default), "greater" or "less". You can specify just the initial letter.} +\item{alternative}{[\code{character(1)}]\cr a character string specifying the alternative hypothesis, +must be one of "greater" (default), "less" or "two.sided". You can specify just the initial letter.} -\item{num.s}{\[\code{numeric(1)}]\cr number of samples with replacement created with y numeric vector.} +\item{num.s}{[\code{numeric(1)}]\cr number of samples with replacement created with y numeric vector.} -\item{...}{Additional arguments (TODO).} +\item{...}{Additional arguments.} } \description{ Execute one-sample permutation test on two numeric vector. Its keep one vector constant and ‘shuffle’ the other by resampling. This approximates the null hypothesis — that there is no dependency/difference between the variables. } +\examples{ + +x <- iris[[1]] +y <- iris[[2]] +ptest(x, y, FUN = function(x, y) cor(x, y), alternative = "t") + +} diff --git a/man/set_arguments.Rd b/man/set_arguments.Rd new file mode 100644 index 0000000..06326be --- /dev/null +++ b/man/set_arguments.Rd @@ -0,0 +1,18 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/utils.R +\name{set_arguments} +\alias{set_arguments} +\title{Set Argument} +\usage{ +set_arguments(args_list) +} +\arguments{ +\item{args_list}{[\code{list}]\cr +A named list of arguments to be assigned to the parent environment.} +} +\value{ +A modified \code{args_list} with the arguments that were assigned to the parent environment removed. +} +\description{ +Assigns provided arguments from the \code{args_list} to the parent environment. If an argument is inside the arguments of the methods that calculate statistics, it assigns it on the parent environment, and removes the argument from the list. +} diff --git a/man/sil_acca.Rd b/man/sil_acca.Rd index ce20389..dd68c50 100644 --- a/man/sil_acca.Rd +++ b/man/sil_acca.Rd @@ -6,29 +6,37 @@ \alias{sil_acca.list} \title{Silhouette (clustering)} \usage{ -sil_acca(acca, ...) +sil_acca(acca, m, ...) \method{sil_acca}{acca_list}(acca, m, ...) \method{sil_acca}{list}(acca, m, ...) } \arguments{ -\item{acca}{\[\code{acca_list(1)}]\cr Acca clustering results from \code{\link{acca}}} +\item{acca}{[\code{acca_list(1)}]\cr Acca clustering results from \code{\link{acca}}} -\item{...}{Additional arguments (TODO).} +\item{m}{[\code{cmatrix(1)|matrix(1)}]\cr correlation matrix from \code{\link{corr_matrix}}. +By default the distance matrix(dist) used in this method is given by \code{dist = 1 - m}.} -\item{m}{\[\code{matrix(1)}]\cr correlation matrix from \code{\link{corr_matrix}}. -By default the distance matrix(dist) used in this method is given by `dist = 1 - m`.} +\item{...}{Additional arguments.} } \value{ -\[\code{numeric(1)}]\cr the average value of - the silhouette width over all data of the entire dataset. - Observations with a large average silhouette width (almost 1) - are very well clustered. +[\code{numeric(1)}]\cr the average value of +the silhouette width over all data of the entire dataset. +Observations with a large average silhouette width (almost 1) +are very well clustered. } \description{ A C++ implementation of the Silhouette method of interpretation and validation of consistency within acca clusters of data. +} +\examples{ + +x <- corrp::corrp(iris) +m <- corrp::corr_matrix(x) +acca <- corrp::acca(m, 2) +sil_acca(acca, m) + } \references{ Leonard Kaufman; Peter J. Rousseeuw (1990). @@ -39,7 +47,7 @@ Starczewski, Artur, and Adam Krzyżak. "Performance evaluation of the silhouette " International Conference on Artificial Intelligence and Soft Computing. Springer, Cham, 2015. } \author{ -Igor D.S. Siciliani +Igor D.S. Siciliani, Paulo H. dos Santos } \keyword{,} \keyword{acca} diff --git a/paper/paper.md b/paper/paper.md index f7e17e6..e22e594 100644 --- a/paper/paper.md +++ b/paper/paper.md @@ -60,7 +60,6 @@ As mentioned before, one can choose between the following options based on the t The `corrp` package provides seven main functions for correlation calculations, clustering, and basic data manipulation: - - **corrp**: Performs correlation-like analysis with user-specified methods for numeric, categorical, factor, interger and mixed pairs. - **corr_matrix**: Generates a correlation matrix from analysis results. - **corr_rm**: Removes variables based on p-value significance. @@ -68,39 +67,50 @@ The `corrp` package provides seven main functions for correlation calculations, - **sil_acca**: A C++ implementation of the Silhouette method for interpreting and validating the consistency of clusters within ACCA clusters of data. - **best_acca**: Determining the optimal number of clusters in ACCA clustering using the average silhouette approach. -First, we calculate the correlations for the *iris* dataset using the Maximal Information Coefficient for numeric pairs, the Predictive Power Score algorithm for numeric/categorical pairs, and the Uncertainty Coefficient for categorical pairs. +We calculate correlations for the *eusilc* dataset using the Maximal Information Coefficient for numeric pairs, Predictive Power Score for numeric/categorical pairs, and Uncertainty Coefficient for categorical pairs. This synthetic dataset represents Austrian EU-SILC data on income, demographics, and household characteristics. ```r +set.seed(2024) +library("laeken") library("corrp") +data(eusilc) + +eusilc = eusilc[, c("eqSS", "eqIncome", "db040", "rb090")] +colnames(eusilc) = c("House_Size", "Income", "State", "Sex") + results = corrp( - iris, - cor.nn = 'mic', cor.nc = 'pps', cor.cc = 'uncoef', - n.cores = 2, verbose = FALSE + eusilc, + cor.nn = 'dcor', cor.nc = 'lm', cor.cc = 'pps', + verbose = FALSE ) head(results$data) ``` -| | infer | infer.value | stat | stat.value | -|------|---------------------------------|-------------|------------|------------| -| 1 | Maximal Information Coefficient | 0.9994870 | P-value | 0.0000000 | -| 2 | Maximal Information Coefficient | 0.2770503 | P-value | 0.0000000 | -| 3 | Maximal Information Coefficient | 0.7682996 | P-value | 0.0000000 | -| 4 | Maximal Information Coefficient | 0.6683281 | P-value | 0.0000000 | -| 5 | Predictive Power Score | 0.5591864 | F1_weighted| 0.7028029 | -| 6 | Maximal Information Coefficient | 0.2770503 | P-value | 0.0000000 | +| | infer | infer.value | stat | stat.value | +|------|---------------------|-------------|---------|------------| +| 1 | Distance Correlation| 1.000 | P-value | 0.000 | +| 2 | Distance Correlation| 0.008 | P-value | 0.000 | +| 3 | Linear Model | 0.146 | P-value | 3.57e-64 | +| 4 | Linear Model | 0.071 | P-value | 4.79e-18 | +| 5 | Distance Correlation| 0.008 | P-value | 0.000 | +| 6 | Distance Correlation| 1.000 | P-value | 0.000 | +| | isig | msg | varx | vary | +|------|------|-----|------------|------------| +| 1 | TRUE | | House_Size | House_Size | +| 2 | TRUE | | House_Size | Income | +| 3 | TRUE | | House_Size | State | +| 4 | TRUE | | House_Size | Sex | +| 5 | TRUE | | Income | House_Size | +| 6 | TRUE | | Income | Income | -| | isig | msg | varx | vary | -|------|-------|-------|--------------|--------------| -| 1 | TRUE | | Sepal.Length | Sepal.Length | -| 2 | TRUE | | Sepal.Length | Sepal.Width | -| 3 | TRUE | | Sepal.Length | Petal.Length | -| 4 | TRUE | | Sepal.Length | Petal.Width | -| 5 | TRUE | | Sepal.Length | Species | -| 6 | TRUE | | Sepal.Width | Sepal.Length | +When choosing correlation methods, it's important to think about their performance for different pair types. For **numeric pairs**, **Pearson** is the quickest and most efficient option, while the **Maximal Information Coefficient (mic)** is significantly slower, making it less suitable for large datasets. **Distance correlation (dcor)** is a better performer than mic but still not the fastest choice, and **Predictive Power Score (pps)** is efficient but may take longer than Pearson. For **numeric-categorical pairs**, the **linear model (lm)** typically outperforms pps. In **categorical pairs**, **Cramér's V**, **Uncertainty Coefficient (uncoef)**, and **pps** are options, with **uncoef** being the slowest of the three. + +As you increase the number of columns, runtime will grow significantly due to the `N * N` scaling, so choose your methods wisely to ensure efficient performance. + Using the previous result, we can create a correlation matrix as follows: ```r @@ -108,22 +118,12 @@ m = corr_matrix(results, col = 'infer.value', isig = TRUE) m ``` -| | Sepal.Length | Sepal.Width | -|----------------|--------------|-------------| -| Sepal.Length | 0.9994870 | 0.2770503 | -| Sepal.Width | 0.2770503 | 0.9967831 | -| Petal.Length | 0.7682996 | 0.4391362 | -| Petal.Width | 0.6683281 | 0.4354146 | -| Species | 0.5591864 | 0.3134401 | - -| | Petal.Length | Petal.Width | Species | -|----------------|--------------|-------------|------------| -| Sepal.Length | 0.7682996 | 0.6683281 | 0.4075487 | -| Sepal.Width | 0.4391362 | 0.4354146 | 0.2012876 | -| Petal.Length | 1.0000000 | 0.9182958 | 0.7904907 | -| Petal.Width | 0.9182958 | 0.9995144 | 0.7561113 | -| Species | 0.9167580 | 0.9398532 | 0.9999758 | - +| | House_Size | Income | State | Sex | +|------------|------------|----------|---------|---------| +| House_Size | 1.000 | 0.008 | 0.146 | 0.071 | +| Income | 0.008 | 1.000 | 0.070 | 0.071 | +| State | 0.146 | 0.070 | 1.000 | 0.000 | +| Sex | 0.071 | 0.071 | 0.000 | 1.000 | ```r @@ -138,16 +138,28 @@ Finally, we can cluster the dataset variables using the ACCA algorithm and the c acca.res = acca(m, 2) acca.res # $cluster1 -# [1] "Species" "Sepal.Length" "Petal.Width" +# [1] "Sex" "Income" # # $cluster2 -# [1] "Petal.Length" "Sepal.Width" +# [1] "State" "House_Size" # # attr(,"class") -# [1] "acca_list" "list" - +# [1] "acca_list" "list" ``` +## Performance Improvements + +When using the `corrp` function with the `dcor` method for numeric pairs (i.e., `cor.nn = "dcor"`), significant improvements in both memory usage and runtime are observed. This is because the `corrp` package uses a C++ implementation of distance correlation (`dcorT_test`), which is more efficient than the `energy::dcorT.test` function from the `energy` package. + +For example, using two vector of length 10000 and 20000, the benchmarks show the following improvements: + +| Method | 10,000 | | 20,000 | | +|-----------------------|--------------|----------------|------------------|------------------| +| | Memory (MB) | Time (sec) | Memory (MB) | Time (sec) | +| **dcorT_test (C++)** | 4701.44 | 6.022 | 18440.54 | 25.977 | +| **energy::dcorT.test**| 7000.65 | 13.846 | 27598.38 | 60.264 | + +This highlights a substantial reduction in both memory usage and execution time, making the `corrp` package more scalable for larger datasets when applying distance correlation methods. The memory reduction is particularly important because calculating distance correlation requires constructing a distance matrix of size $N^2$, where $N$ is the length of the input vector. As $N$ grows, the memory demands can quickly become prohibitive. # Acknowledgements diff --git a/src/RcppExports.o b/src/RcppExports.o new file mode 100644 index 0000000..83e9db9 Binary files /dev/null and b/src/RcppExports.o differ diff --git a/src/corrp.so b/src/corrp.so index 6327a0f..6ac6cfd 100755 Binary files a/src/corrp.so and b/src/corrp.so differ diff --git a/src/dcort.cpp b/src/dcort.cpp index 1aa48b7..7df6bb4 100644 --- a/src/dcort.cpp +++ b/src/dcort.cpp @@ -9,8 +9,8 @@ NumericMatrix Astar(NumericMatrix d) { // Calculate overall mean double M = mean(m); - - d = (1 - 1 / n) * d + M; + + d = (1.0 - 1.0 / n) * d + M; for (int i = 0; i < n; ++i) { d(i, _) = d(i, _) - m[i]; diff --git a/src/dcort.o b/src/dcort.o index 861fe0d..7f2fa8f 100644 Binary files a/src/dcort.o and b/src/dcort.o differ diff --git a/tests/testthat/test-corr_rm.R b/tests/testthat/test-corr_rm.R index 510a275..d8fe4ed 100644 --- a/tests/testthat/test-corr_rm.R +++ b/tests/testthat/test-corr_rm.R @@ -12,6 +12,6 @@ test_that("Tests on corr_rm", { df3 <- corr_rm(df = df, c = m) expect_warning(df4 <- corr_rm(df = df, c = m2)) expect_equal(df2, df3) - expect_equal(df2,df4) + expect_equal(df2, df4) }) diff --git a/vignettes/.gitignore b/vignettes/.gitignore new file mode 100644 index 0000000..097b241 --- /dev/null +++ b/vignettes/.gitignore @@ -0,0 +1,2 @@ +*.html +*.R diff --git a/vignettes/usage-corrp.Rmd b/vignettes/usage-corrp.Rmd index 1b1c0c7..4287af1 100644 --- a/vignettes/usage-corrp.Rmd +++ b/vignettes/usage-corrp.Rmd @@ -24,6 +24,7 @@ We will use the built-in `iris` dataset, which includes 150 observations of 5 va ```{r} # Load the iris dataset +library(corrp) data(iris) head(iris) ``` @@ -60,7 +61,7 @@ To focus on significant correlations, you can filter the results based on signif ```{r} # Filter significant correlations (p-value < 0.05) -significant_results <- subset(results$data, isig == TRUE) +significant_results <- subset(results$data, isTRUE(isig)) significant_results ``` @@ -119,5 +120,5 @@ corr_fun( ## Conclusion The `corrp` package provides a simple way to compute correlations across different types of variables. If you are working with mixed data, `corrp` offers a solution for your correlation analysis needs. By leveraging parallel processing and C++ implementation, `corrp` can handle large datasets efficiently. -``` +