Preface

Model-based clustering and classification methods provide a systematic statistical modeling framework for cluster analysis and classification. The model-based approach has gained in popularity because it allows the problems of choosing or developing an appropriate clustering or classification method to be understood within the context of statistical modeling.

mclust is a widely-used software package for the statistical environment R. It provides functionality for model-based clustering, classification, and density estimation, including methods for summarizing and visualizing the estimated models.

This book aims at giving a detailed overview of mclust and its features. A description of the modeling underpinning the software is provided, along with examples of its usage. In addition to serving as a reference manual for mclust, the book will be particularly useful to readers who plan to employ these model-based techniques in their research or applications.

Who is this book for?

The book is written to appeal to quantitatively trained readers from a wide range of backgrounds. An understanding of basic statistical methods, including statistical inference and statistical computing, is required. Throughout the book, examples and code are used extensively in an expository style to demonstrate the use of mclust for model-based clustering, classification, and density estimation.

Additionally, the book can serve as a reference for courses in multivariate analysis, statistical learning, machine learning, and data mining. It would also be a useful reference for advanced quantitative courses in application areas, including social sciences, physical sciences, and business.

Companion website

A companion website for this book is available at https://mclust-org.github.io/book
The website contains the R code to reproduce the examples and figures presented in the book, errata and various supplementary material.

Software information and conventions

The R session information when compiling this book is shown below:

sessionInfo()
## R version 4.4.0 (2024-04-24)
## Platform: x86_64-apple-darwin20
## Running under: macOS Ventura 13.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Europe/Rome
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  utils     datasets  grDevices methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.5        jsonlite_1.8.8      dplyr_1.1.4        
##  [4] compiler_4.4.0      tidyselect_1.2.1    Rcpp_1.0.13        
##  [7] parallel_4.4.0      gridExtra_2.3       scales_1.3.0       
## [10] fastmap_1.2.0       ggplot2_3.5.1       R6_2.5.1           
## [13] generics_0.1.3      curl_5.2.1          knitr_1.48         
## [16] htmlwidgets_1.6.4   tibble_3.2.1        munsell_0.5.1      
## [19] pillar_1.9.0        rlang_1.1.4         utf8_1.2.4         
## [22] V8_4.4.2            inline_0.3.19       xfun_0.46          
## [25] rstan_2.32.6        RcppParallel_5.1.8  cli_3.6.3          
## [28] magrittr_2.0.3      digest_0.6.36       grid_4.4.0         
## [31] rstudioapi_0.16.0   lifecycle_1.0.4     StanHeaders_2.32.10
## [34] vctrs_0.6.5         evaluate_0.24.0     glue_1.7.0         
## [37] QuickJSR_1.3.1      codetools_0.2-20    stats4_4.4.0       
## [40] pkgbuild_1.4.4      fansi_1.0.6         colorspace_2.1-1   
## [43] rmarkdown_2.27      tools_4.4.0         matrixStats_1.3.0  
## [46] loo_2.8.0           pkgconfig_2.0.3     htmltools_0.5.8.1

Every R input command starts on a new line without any additional prompt (as > or +). The corresponding output is shown on lines starting with two hashes ##, as it can be seen from the R session information above. Package names are in bold text (e.g., mclust), and inline code and file names are formatted in a typewriter font (e.g., data("iris", package = "datasets")). Function names are followed by parentheses (e.g., Mclust()).

About the authors

Luca Scrucca
Associate Professor of Statistics at Università degli Studi di Perugia, his research interests include: mixture models, model-based clustering and classification, statistical learning, dimension reduction methods, genetic and evolutionary algorithms. He is currently Associate Editor for the Journal of Statistical Software and Statistics and Computing. He has developed and he is the maintainer of several high profile R packages available on The Comprehensive R Archive Network (CRAN). His webpage is at https://luca-scr.github.io.

Chris Fraley
Most recently a research staff member at Tableau, she previously held research positions in Statistics at the University of Washington and at Insightful from its early days as Statistical Sciences. She has contributed to computational methods in a number of areas of applied statistics, and is the principal author of several widely-used R packages. She was the originator (at Statistical Sciences) of numerical functions such as nlminb that have long been available in the R core stats package.

T. Brendan Murphy
Professor of Statistics at University College Dublin, his research interests include: model-based clustering, classification, network modeling, and latent variable modeling. He is interested in applications in social science, political science, medicine, food science, and biology. He served as Associate Editor for the journal Statistics and Computing}, he is currently Editor for the Annals of Applied Statistics and Associate Editor for Statistical Analysis and Data Mining. His webpage is at http://mathsci.ucd.ie/~brendan.

Adrian Raftery
Boeing International Professor of Statistics and Sociology, and Adjunct Professor of Atmospheric Sciences at the University of Washington, Seattle. He is also a faculty affiliate of the Center for Statistics and the Social Sciences and the Center for Studies in Demography and Ecology at University of Washington. He was one of the founding researchers in model-based clustering, having published in the area since 1984. His research interests include: model-based clustering, Bayesian statistics, social network analysis, and statistical demography. He is interested in applications in social, environmental, biological, and health sciences. He is a member of the U.S. National Academy of Sciences and was identified by Thomson-Reuter as the most cited researcher in mathematics in the world for the decade 1995–-2005. He served as Editor of the Journal of the American Statistical Association (JASA). His webpage is at http://www.stat.washington.edu/raftery.

Acknowledgments

The idea for writing this book arose during one of the yearly meetings of the Working Group on Model-Based Clustering, which constitutes a small but very active place for scholars from all over the world interested in mixture modeling. We thank all of the participants for providing the stimulating environment in which we started this project.

We are also fortunate to have benefited from a thorough review contributed by Bettina Grün, a leading expert in mixture modeling.

We have many others to thank for their contributions to mclust as users, collaborators, and developers. Thanks also to the R core team, and to those responsible for the many packages we have leveraged.

The development of the mclust package was supported over many years by the U.S. Office of Naval Research (ONR), and we acknowledge the encouragement and enthusiasm of our successive ONR program officers, Julia Abrahams and Wendy Martinez.

Chris Fraley is indebted to Tableau for supporting her efforts as co-author.

Brendan Murphy’s research was supported by the Science Foundation Ireland (SFI) Insight Research Centre (SFI/12/RC/2289\(\_\)P2), Vistamilk Research Centre (16/RC/3835) and Collegium de Lyon — Institut d’Études Avancées, Université de Lyon.

Adrian Raftery’s research was supported by the Eunice Kennedy Shriver National Institute for Child Health and Human Development (NICHD) under grant number R01 HD070936, by the Blumstein-Jordan and Boeing International Professorships at the University of Washington,and by the Fondation des Sciences Mathématiques de Paris (FSMP) and Université Paris-Cité.

Finally, special thanks to Rob Calver, Senior Publisher at Chapman & Hall/CRC, for his encouragement and enthusiastic support for this book.