Title: | Mixtures of Exponential-Distance Models with Covariates |
---|---|
Description: | Implements a model-based clustering method for categorical life-course sequences relying on mixtures of exponential-distance models introduced by Murphy et al. (2021) <doi:10.1111/rssa.12712>. A range of flexible precision parameter settings corresponding to weighted generalisations of the Hamming distance metric are considered, along with the potential inclusion of a noise component. Gating covariates can be supplied in order to relate sequences to baseline characteristics and sampling weights are also accommodated. The models are fitted using the EM algorithm and tools for visualising the results are also provided. |
Authors: | Keefe Murphy [aut, cre] , Thomas Brendan Murphy [ctb] , Raffaella Piccarreta [ctb], Isobel Claire Gormley [ctb] |
Maintainer: | Keefe Murphy <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.4.1 |
Built: | 2024-11-25 03:53:46 UTC |
Source: | https://github.com/keefe-murphy/medseq |
Fits MEDseq models: mixtures of Exponential-Distance models with gating covariates and sampling weights. Typically used for clustering categorical/longitudinal life-course sequences.
Fits _MEDseq_ models introduced by Murphy et al. (2021) <doi:10.1111/rssa.12712>, i.e. fits mixtures of exponential-distance models for clustering longitudinal life-course sequence data via the EM/CEM algorithm.
A family of parsimonious precision parameter constraints are accommodated. So too are sampling weights. Gating covariates can be supplied via formula interfaces.
The most important function in the MEDseq package is: MEDseq_fit
, for fitting the models via EM/CEM. This function requires the data to be in "stslist"
format; the function seqdef
is conveniently reexported from the TraMineR package for this purpose.
MEDseq_control
allows supplying additional arguments which govern, among other things, controls on the initialisation of the allocations for the EM/CEM algorithm and the various model selection options.
MEDseq_compare
is provided for conducting model selection between different results from using different covariate combinations &/or initialisation strategies, etc.
MEDseq_stderr
is provided for computing the standard errors of the coefficients for the covariates in the gating network.
A dedicated plotting function plot.MEDseq
exists for visualising various aspects of the results, using new methods as well as some existing methods adapted from the TraMineR package.
Finally, the package also contains two data sets: biofam
and mvad
.
Package
MEDseq
1.4.1
2023-12-12 (this version), 2019-08-24 (original release)
GPL (>= 3)
Further details and examples are given in the associated vignette document:vignette("MEDseq", package = "MEDseq")
Keefe Murphy [aut, cre], Thomas Brendan Murphy [ctb], Raffaella Piccarreta [ctb], Isobel Claire Gormley [ctb]
Maintainer: Keefe Murphy - <[email protected]>
Murphy, K., Murphy, T. B., Piccarreta, R., and Gormley, I. C. (2021). Clustering longitudinal life-course sequences using mixtures of exponential-distance models. Journal of the Royal Statistical Society: Series A (Statistics in Society), 184(4): 1414-1451. <doi:10.1111/rssa.12712>.
Useful links:
# Load the MVAD data data(mvad) mvad$Location <- factor(apply(mvad[,5:9], 1L, function(x) which(x == "yes")), labels = colnames(mvad[,5:9])) mvad <- list(covariates = mvad[c(3:4,10:14,87)], sequences = mvad[,15:86], weights = mvad[,2]) mvad.cov <- mvad$covariates # Create a state sequence object with the first two (summer) time points removed states <- c("EM", "FE", "HE", "JL", "SC", "TR") labels <- c("Employment", "Further Education", "Higher Education", "Joblessness", "School", "Training") mvad.seq <- seqdef(mvad$sequences[-c(1,2)], states=states, labels=labels) # Fit a range of unweighted models without covariates # Only consider models with a noise component # Supply some MEDseq_control() arguments mod1 <- MEDseq_fit(mvad.seq, G=9:10, modtype=c("CCN", "CUN", "UCN", "UUN"), algo="CEM", init.z="kmodes", criterion="icl") # Fit a model with weights and gating covariates # Have the probability of noise-component membership be constant mod2 <- MEDseq_fit(mvad.seq, G=11, modtype="UUN", weights=mvad$weights, gating=~ gcse5eq, covars=mvad.cov, noise.gate=FALSE) # Examine this model and its gating network summary(mod2, network=TRUE) plot(mod2, "clusters")
# Load the MVAD data data(mvad) mvad$Location <- factor(apply(mvad[,5:9], 1L, function(x) which(x == "yes")), labels = colnames(mvad[,5:9])) mvad <- list(covariates = mvad[c(3:4,10:14,87)], sequences = mvad[,15:86], weights = mvad[,2]) mvad.cov <- mvad$covariates # Create a state sequence object with the first two (summer) time points removed states <- c("EM", "FE", "HE", "JL", "SC", "TR") labels <- c("Employment", "Further Education", "Higher Education", "Joblessness", "School", "Training") mvad.seq <- seqdef(mvad$sequences[-c(1,2)], states=states, labels=labels) # Fit a range of unweighted models without covariates # Only consider models with a noise component # Supply some MEDseq_control() arguments mod1 <- MEDseq_fit(mvad.seq, G=9:10, modtype=c("CCN", "CUN", "UCN", "UUN"), algo="CEM", init.z="kmodes", criterion="icl") # Fit a model with weights and gating covariates # Have the probability of noise-component membership be constant mod2 <- MEDseq_fit(mvad.seq, G=11, modtype="UUN", weights=mvad$weights, gating=~ gcse5eq, covars=mvad.cov, noise.gate=FALSE) # Examine this model and its gating network summary(mod2, network=TRUE) plot(mod2, "clusters")
2000 16 year-long family life sequences built from the retrospective biographical survey carried out by the Swiss Household Panel (SHP) in 2002.
data(biofam)
data(biofam)
A data frame with 2000 rows, 16 state variables, 1 id variable and 7 covariates and 2 weights variables.
The biofam data set was constructed by Müller et al. (2007) from the data of the retrospective biographical survey carried out by the Swiss Household Panel (SHP) in 2002.
The data set contains (in columns 10 to 25) sequences of family life states from age 15 to 30 (sequence length is 16) and a series of covariates. The sequences are a sample of 2000 sequences of those created from the SHP biographical survey. It includes only individuals who were at least 30 years old at the time of the survey. The biofam data set describes family life courses of 2000 individuals born between 1909 and 1972.
The states numbered from 0 to 7 are defined from the combination of five basic states, namely Living with parents (Parent), Left home (Left), Married (Marr), Having Children (Child), Divorced:
0 = "Parent"
1 = "Left"
2 = "Married"
3 = "Left+Marr"
4 = "Child"
5 = "Left+Child"
6 = "Left+Marr+Child"
7 = "Divorced"
The covariates are:
sex |
|
birthyr |
(birth year) |
nat_1_02 |
(first nationality) |
plingu02 |
(language of questionnaire) |
p02r01 |
(religion) |
p02r04 |
(religious participation) |
cspfaj |
(father's social status) |
cspmoj |
(mother's social status) |
Two additional weights variables are inserted for illustrative purpose ONLY (since biofam
is a subsample of the original data, these weights are not adapted to the actual data):
wp00tbgp |
(weights inflating to the Swiss population) |
wp00tbgs |
(weights respecting sample size) |
Swiss Household Panel https://forscenter.ch/projects/swiss-household-panel/
Müller, N. S., Studer, M. and Ritschard, G. (2007). Classification de parcours de vie à l'aide de l'optimal matching. In XIVe Rencontre de la Société francophone de classification (SFC 2007), Paris, 5-7 septembre 2007, pp. 157-160.
data(biofam, package="MEDseq") biofam <- list(covariates = biofam[2L:9L], sequences = biofam[10L:25L] + 1L) biofam.cov <- biofam$covariates[,colSums(is.na(biofam$covariates)) == 0] biofam.seq <- seqdef(biofam$sequences, states = c("P", "L", "M", "L+M", "C", "L+C", "L+M+C", "D"), labels = c("Parent", "Left", "Married", "Left+Marr", "Child", "Left+Child", "Left+Marr+Child", "Divorced")) biofam.cov$age <- 2002 - biofam.cov$birthyr
data(biofam, package="MEDseq") biofam <- list(covariates = biofam[2L:9L], sequences = biofam[10L:25L] + 1L) biofam.cov <- biofam$covariates[,colSums(is.na(biofam$covariates)) == 0] biofam.seq <- seqdef(biofam$sequences, states = c("P", "L", "M", "L+M", "C", "L+C", "L+M+C", "D"), labels = c("Parent", "Left", "Married", "Left+Marr", "Child", "Left+Child", "Left+Marr+Child", "Divorced")) biofam.cov$age <- 2002 - biofam.cov$birthyr
Computes the Density-based Silhouette for a 'soft' clustering assignment matrix.
dbs(z, ztol = 1E-100, weights = NULL, summ = c("mean", "median"), clusters = NULL, ...)
dbs(z, ztol = 1E-100, weights = NULL, summ = c("mean", "median"), clusters = NULL, ...)
z |
A numeric matrix such that rows correspond to observations, columns correspond to clusters, and rows sum to |
ztol |
A small (single, numeric, non-negative) tolerance parameter governing whether small assignment probabilities are treated instead as crisp assignments. Defaults to |
weights |
An optional numeric vector giving observation-specific weights for computing the (weighted) mean/median DBS (see |
summ |
A single character string indicating whether the (possibly weighted) |
clusters |
Optional/experimental argument for giving the indicator labels of the cluster assignments. Defaults to the MAP assignment derived from |
... |
Catches unused arguments. |
A list with the following elements:
silvals
A matrix where each row contains the cluster to which each observation belongs in the first column and the observation-specific DBS width in the second column.
msw
Depending on the value of summ
, either the mean or median DBS width.
wmsw
Depending on the value of summ
, either the weighted mean or weighted median DBS width.
When calling MEDseq_fit
, the summ
argument can be passed via the ...
construct, in which case it governs both the dbs
and asw
criteria.
Keefe Murphy - <[email protected]>
Menardi, G. (2011). Density-based silhouette diagnostics for clustering methods. Statistics and Computing, 21(3): 295-308.
# Generate a toy z matrix z <- abs(matrix(rnorm(50), ncol=2)) z <- z/rowSums(z) # Return the median DBS width dbs(z, summ="median")$msw # For real sequence data data(mvad) mod <- MEDseq_fit(seqdef(mvad[,17:86]), G=11, modtype="UUN", weights=mvad$weight) dbs(mod$z, weights=mvad$weight)
# Generate a toy z matrix z <- abs(matrix(rnorm(50), ncol=2)) z <- z/rowSums(z) # Return the median DBS width dbs(z, summ="median")$msw # For real sequence data data(mvad) mod <- MEDseq_fit(seqdef(mvad[,17:86]), G=11, modtype="UUN", weights=mvad$weight) dbs(mod$z, weights=mvad$weight)
Computes the matrix of pairwise distance using a frequency-weighted variant of the Hamming distance often used in k-modes clustering.
dist_freqwH(data, full.matrix = TRUE)
dist_freqwH(data, full.matrix = TRUE)
data |
A matrix or data frame of categorical data. Objects have to be in rows, variables in columns. |
full.matrix |
Logical. If |
As per wKModes
, the frequency weights are computed within the function and are not user-specified. These frequency weights are assigned on a per-feature basis and derived from the categories represented in each column of data
.
The whole matrix of pairwise distances if full.matrix=TRUE
, otherwise the corresponding dist
object.
Keefe Murphy - <[email protected]>
Huang, Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3): 283-304.
wKModes
, wcAggregateCases
, wcSilhouetteObs
suppressMessages(require(WeightedCluster)) set.seed(99) # Load the MVAD data & aggregate the state sequences data(mvad) agg <- wcAggregateCases(mvad[,17:86], weights=mvad$weight) # Create a state sequence object without the first two (summer) time points states <- c("EM", "FE", "HE", "JL", "SC", "TR") labels <- c("Employment", "Further Education", "Higher Education", "Joblessness", "School", "Training") weights <- agg$aggWeights mvad.seq <- seqdef(mvad[agg$aggIndex, 17:86], states=states, labels=labels, weights=agg$aggWeights) # Run k-modes with weights resW <- wKModes(mvad.seq, 2, weights=agg$aggWeights) # Run k-modes with additional frequency weights resF <- wKModes(mvad.seq, 2, weights=agg$aggWeights, freq.weighted=TRUE) # Examine the average silhouette widths of both weighted solutions weighted.mean(wcSilhouetteObs(seqdist(mvad.seq, method="HAM"), resW$cluster, weights), weights) # weighted.mean(wcSilhouetteObs(seqdist(mvad.seq, method="HAM"), resF$cluster, weights), weights) weighted.mean(wcSilhouetteObs(dist_freqwH(mvad.seq), resF$cluster, weights), weights)
suppressMessages(require(WeightedCluster)) set.seed(99) # Load the MVAD data & aggregate the state sequences data(mvad) agg <- wcAggregateCases(mvad[,17:86], weights=mvad$weight) # Create a state sequence object without the first two (summer) time points states <- c("EM", "FE", "HE", "JL", "SC", "TR") labels <- c("Employment", "Further Education", "Higher Education", "Joblessness", "School", "Training") weights <- agg$aggWeights mvad.seq <- seqdef(mvad[agg$aggIndex, 17:86], states=states, labels=labels, weights=agg$aggWeights) # Run k-modes with weights resW <- wKModes(mvad.seq, 2, weights=agg$aggWeights) # Run k-modes with additional frequency weights resF <- wKModes(mvad.seq, 2, weights=agg$aggWeights, freq.weighted=TRUE) # Examine the average silhouette widths of both weighted solutions weighted.mean(wcSilhouetteObs(seqdist(mvad.seq, method="HAM"), resW$cluster, weights), weights) # weighted.mean(wcSilhouetteObs(seqdist(mvad.seq, method="HAM"), resF$cluster, weights), weights) weighted.mean(wcSilhouetteObs(dist_freqwH(mvad.seq), resF$cluster, weights), weights)
Utility function for extracting results of submodels from "MEDseq"
objects when a range of models were run via MEDseq_fit
.
get_MEDseq_results(x, what = c("z", "MAP", "DBS", "ASW"), rank = 1L, criterion = c("bic", "icl", "aic", "dbs", "asw", "cv", "nec", "loglik"), G = NULL, modtype = NULL, noise = TRUE, ...)
get_MEDseq_results(x, what = c("z", "MAP", "DBS", "ASW"), rank = 1L, criterion = c("bic", "icl", "aic", "dbs", "asw", "cv", "nec", "loglik"), G = NULL, modtype = NULL, noise = TRUE, ...)
x |
An object of class |
what |
A character string indicating the desired results to extract. |
rank |
A number indicating what |
criterion |
The |
G |
Optional argument giving the number of components in the model for which results are desired. Can be supplied with or without also specifying |
modtype |
Optional argument giving the desired model type for which results are desired. Can be supplied with or without also specifying |
noise |
A logical indicating whether models with a noise component should be considered. Defaults to |
... |
Catches unused arguments. |
The arguments rank
and criterion
are invoked when one or more of the arguments G
and modtype
are missing. Thus, supplying G
and modtype
allows rank
and criterion
to be bypassed entirely.
The desired results extracted from the MEDseq
model.
Arguments to this function can be supplied to plot.MEDseq
via the ...
construct.
Keefe Murphy - <[email protected]>
data(biofam) # mod <- MEDseq_fit(seqdef(biofam[10:25] + 1L), G=9:10) # Extract the MAP clustering of the best 9-cluster model according to the asw criterion # get_MEDseq_results(mod, what="MAP", G=9, criterion="asw") # Extract the DBS values of the best UUN model according to the dbs criterion # get_MEDseq_results(mod, what="DBS", modtype="UUN", criterion="dbs") # Plot the DBS values of this same model, by passing get_MEDseq_results arguments through plot # plot(mod, type="dbsvals", modtype="UUN", criterion="dbs")
data(biofam) # mod <- MEDseq_fit(seqdef(biofam[10:25] + 1L), G=9:10) # Extract the MAP clustering of the best 9-cluster model according to the asw criterion # get_MEDseq_results(mod, what="MAP", G=9, criterion="asw") # Extract the DBS values of the best UUN model according to the dbs criterion # get_MEDseq_results(mod, what="DBS", modtype="UUN", criterion="dbs") # Plot the DBS values of this same model, by passing get_MEDseq_results arguments through plot # plot(mod, type="dbsvals", modtype="UUN", criterion="dbs")
Calculates the per-component average posterior probabilities of a fitted MEDseq model.
MEDseq_AvePP(x, group = TRUE)
MEDseq_AvePP(x, group = TRUE)
x |
An object of class |
group |
A logical indicating whether the average posterior probabilities should be computed per component. Defaults to |
When group=TRUE
, this function calculates AvePP, the average posterior probabilities of membership for each component for the observations assigned to that component via MAP probabilities. Otherwise, an overall measure of clustering certainty is returned.
When group=TRUE
, a named vector of numbers, of length equal to the number of components (G), in the range [1/G,1], such that larger values indicate clearer separation of the clusters. When group=FALSE
, a single number in the same range is returned.
This function will always return values of 1
for all components for models fitted using the "CEM"
algorithm (see MEDseq_control
), or models with only one component.
Keefe Murphy - <[email protected]>
Murphy, K., Murphy, T. B., Piccarreta, R., and Gormley, I. C. (2021). Clustering longitudinal life-course sequences using mixtures of exponential-distance models. Journal of the Royal Statistical Society: Series A (Statistics in Society), 184(4): 1414-1451. <doi:10.1111/rssa.12712>.
MEDseq_fit
, MEDseq_control
, MEDseq_entropy
# Load the MVAD data data(mvad) mvad$Location <- factor(apply(mvad[,5:9], 1L, function(x) which(x == "yes")), labels = colnames(mvad[,5:9])) mvad <- list(covariates = mvad[c(3:4,10:14,87)], sequences = mvad[,15:86], weights = mvad[,2]) mvad.cov <- mvad$covariates # Create a state sequence object with the first two (summer) time points removed states <- c("EM", "FE", "HE", "JL", "SC", "TR") labels <- c("Employment", "Further Education", "Higher Education", "Joblessness", "School", "Training") mvad.seq <- seqdef(mvad$sequences[-c(1,2)], states=states, labels=labels) # Fit a model with weights and a gating covariate # Have the probability of noise-component membership be constant mod <- MEDseq_fit(mvad.seq, G=11, modtype="UUN", weights=mvad$weights, gating=~ gcse5eq, covars=mvad.cov, noise.gate=FALSE) # Calculate the AvePP per component MEDseq_AvePP(mod) # Calculte an overall measure of clustering certainty MEDseq_AvePP(mod, group=FALSE)
# Load the MVAD data data(mvad) mvad$Location <- factor(apply(mvad[,5:9], 1L, function(x) which(x == "yes")), labels = colnames(mvad[,5:9])) mvad <- list(covariates = mvad[c(3:4,10:14,87)], sequences = mvad[,15:86], weights = mvad[,2]) mvad.cov <- mvad$covariates # Create a state sequence object with the first two (summer) time points removed states <- c("EM", "FE", "HE", "JL", "SC", "TR") labels <- c("Employment", "Further Education", "Higher Education", "Joblessness", "School", "Training") mvad.seq <- seqdef(mvad$sequences[-c(1,2)], states=states, labels=labels) # Fit a model with weights and a gating covariate # Have the probability of noise-component membership be constant mod <- MEDseq_fit(mvad.seq, G=11, modtype="UUN", weights=mvad$weights, gating=~ gcse5eq, covars=mvad.cov, noise.gate=FALSE) # Calculate the AvePP per component MEDseq_AvePP(mod) # Calculte an overall measure of clustering certainty MEDseq_AvePP(mod, group=FALSE)
These functions extract names for clusters according to the SPS representation of their central sequences.
MEDseq_clustnames(x, cluster = TRUE, size = FALSE, weighted = FALSE, ...) MEDseq_nameclusts(names)
MEDseq_clustnames(x, cluster = TRUE, size = FALSE, weighted = FALSE, ...) MEDseq_nameclusts(names)
x |
An object of class |
cluster |
A logical indicating whether names should be prepended with the text " |
size |
A logical indicating whether the (typically 'soft') size of each cluster is appended to the label of each group, expressed as a percentage of the total number of observations. Defaults to |
weighted |
A logical indicating whether the sampling weights (if any) are used when appending the |
... |
Catches unused arguments. |
names |
The output of |
Unlike the seqclustname
function from the WeightedCluster package which inspired these functions, MEDseq_clustnames
only returns the names themselves, not the factor
variable indicating cluster membership with labels given by those names. Thus, MEDseq_nameclusts
is provided as a convenience function for precisely this purpose (see Examples
).
For MEDseq_clustnames
, a character vector containing the names for each component defined by their central sequence, and optionally the cluster name (see cluster
above) and cluster size (see size
above). The name for the noise component, if any, will always be simply "Noise"
(or "Cluster 0: Noise"
).
For MEDseq_nameclusts
, a factor version of x$MAP
with levels given by the output of MEDseq_clustnames
.
The main MEDseq_clustnames
function is used internally by plot.MEDseq
, MEDseq_meantime
, MEDseq_stderr
, and also other print
and summary
methods, where its invocation can typically controlled via a SPS
logical argument. However, the optional arguments cluster
, size
, and weighted
can only be passed through plot.MEDseq
; elsewhere cluster=TRUE
, size=FALSE
, and weighted=FALSE
are always assumed.
Keefe Murphy - <[email protected]>
Murphy, K., Murphy, T. B., Piccarreta, R., and Gormley, I. C. (2021). Clustering longitudinal life-course sequences using mixtures of exponential-distance models. Journal of the Royal Statistical Society: Series A (Statistics in Society), 184(4): 1414-1451. <doi:10.1111/rssa.12712>.
seqformat
, seqclustname
, plot.MEDseq
, MEDseq_meantime
, MEDseq_stderr
# Load the MVAD data data(mvad) mvad$Location <- factor(apply(mvad[,5:9], 1L, function(x) which(x == "yes")), labels = colnames(mvad[,5:9])) mvad <- list(covariates = mvad[c(3:4,10:14,87)], sequences = mvad[,15:86], weights = mvad[,2]) mvad.cov <- mvad$covariates # Create a state sequence object with the first two (summer) time points removed states <- c("EM", "FE", "HE", "JL", "SC", "TR") labels <- c("Employment", "Further Education", "Higher Education", "Joblessness", "School", "Training") mvad.seq <- seqdef(mvad$sequences[-c(1,2)], states=states, labels=labels) # Fit a model with weights and a gating covariate # Have the probability of noise-component membership depend on the covariate mod <- MEDseq_fit(mvad.seq, G=5, modtype="UUN", weights=mvad$weights, gating=~ gcse5eq, covars=mvad.cov, noise.gate=TRUE) # Extract the names names <- MEDseq_clustnames(mod, cluster=FALSE, size=TRUE) # Get the renamed MAP cluster membership indicator vector group <- MEDseq_nameclusts(names) # Use the output in plots plot(mod, type="d", soft=FALSE, weighted=FALSE, cluster=FALSE, size=TRUE, border=TRUE) # same as: # seqplot(mvad.seq, type="d", group=group) # Indeed, this function is invoked by default for certain plot types plot(mod, type="d", soft=TRUE, weighted=TRUE) plot(mod, type="d", soft=TRUE, weighted=TRUE, SPS=FALSE) # Invoke this function when printing the gating network coefficients print(mod$gating, SPS=FALSE) print(mod$gating, SPS=TRUE) # Invoke this function in a call to MEDseq_meantime MEDseq_meantime(mod, SPS=TRUE) # Invoke this function in other plots plot(mod, type="clusters", SPS=TRUE) plot(mod, type="precision", SPS=TRUE)
# Load the MVAD data data(mvad) mvad$Location <- factor(apply(mvad[,5:9], 1L, function(x) which(x == "yes")), labels = colnames(mvad[,5:9])) mvad <- list(covariates = mvad[c(3:4,10:14,87)], sequences = mvad[,15:86], weights = mvad[,2]) mvad.cov <- mvad$covariates # Create a state sequence object with the first two (summer) time points removed states <- c("EM", "FE", "HE", "JL", "SC", "TR") labels <- c("Employment", "Further Education", "Higher Education", "Joblessness", "School", "Training") mvad.seq <- seqdef(mvad$sequences[-c(1,2)], states=states, labels=labels) # Fit a model with weights and a gating covariate # Have the probability of noise-component membership depend on the covariate mod <- MEDseq_fit(mvad.seq, G=5, modtype="UUN", weights=mvad$weights, gating=~ gcse5eq, covars=mvad.cov, noise.gate=TRUE) # Extract the names names <- MEDseq_clustnames(mod, cluster=FALSE, size=TRUE) # Get the renamed MAP cluster membership indicator vector group <- MEDseq_nameclusts(names) # Use the output in plots plot(mod, type="d", soft=FALSE, weighted=FALSE, cluster=FALSE, size=TRUE, border=TRUE) # same as: # seqplot(mvad.seq, type="d", group=group) # Indeed, this function is invoked by default for certain plot types plot(mod, type="d", soft=TRUE, weighted=TRUE) plot(mod, type="d", soft=TRUE, weighted=TRUE, SPS=FALSE) # Invoke this function when printing the gating network coefficients print(mod$gating, SPS=FALSE) print(mod$gating, SPS=TRUE) # Invoke this function in a call to MEDseq_meantime MEDseq_meantime(mod, SPS=TRUE) # Invoke this function in other plots plot(mod, type="clusters", SPS=TRUE) plot(mod, type="precision", SPS=TRUE)
Takes one or more sets of "MEDseq"
models fitted by MEDseq_fit
and ranks them according to a specified model selection criterion. It's possible to respect the internal ranking within each set of models, or to discard models within each set which were already deemed sub-optimal. This function can help with model selection via exhaustive or stepwise searches.
MEDseq_compare(..., criterion = c("bic", "icl", "aic", "dbs", "asw", "cv", "nec"), pick = 10L, optimal.only = FALSE) ## S3 method for class 'MEDseqCompare' print(x, index = seq_len(x$pick), rerank = FALSE, digits = 3L, maxi = length(index), ...)
MEDseq_compare(..., criterion = c("bic", "icl", "aic", "dbs", "asw", "cv", "nec"), pick = 10L, optimal.only = FALSE) ## S3 method for class 'MEDseqCompare' print(x, index = seq_len(x$pick), rerank = FALSE, digits = 3L, maxi = length(index), ...)
... |
One or more objects of class This argument is only relevant for the |
criterion |
The criterion used to determine the ranking. Defaults to |
pick |
The (integer) number of models to be ranked and compared. Defaults to |
optimal.only |
Logical indicating whether to only rank models already deemed optimal within each |
x , index , rerank , digits , maxi
|
Arguments required for the associated
|
The purpose of this function is to conduct model selection on "MEDseq"
objects, fit to the same data set, with different combinations of gating network covariates or different initialisation settings.
Model selection will have already been performed in terms of choosing the optimal number of components and MEDseq model type within each supplied set of results, but MEDseq_compare
will respect the internal ranking of models when producing the final ranking if optimal.only
is FALSE
: otherwise only those models already deemed optimal within each "MEDseq"
object will be ranked.
As such if two sets of results are supplied when optimal.only
is FALSE
, the 1st, 2nd, and 3rd best models could all belong to the first set of results, meaning a model deemed suboptimal according to one set of covariates could be superior to one deemed optimal under another set of covariates.
A list of class "MEDseqCompare"
, for which a dedicated print function exists, containing the following elements (each of length pick
, and ranked according to criterion
, where appropriate):
data |
The name of the data set to which the models were fitted. |
optimal |
The single optimal model (an object of class |
pick |
The final number of ranked models. May be different (i.e. less than) the supplied |
MEDNames |
The names of the supplied |
modelNames |
The MEDseq model names (denoting the constraints or lack thereof on the precision parameters). |
G |
The optimal numbers of components. |
df |
The numbers of estimated parameters. |
iters |
The numbers of EM/CEM iterations. |
bic |
BIC values, ranked according to |
icl |
ICL values, ranked according to |
aic |
AIC values, ranked according to |
dbs |
(Weighted) mean/median DBS values, ranked according to |
asw |
(Weighted) mean/median ASW values, ranked according to |
cv |
Cross-validated log-likelihood values, ranked according to |
nec |
NEC values, ranked according to |
loglik |
Maximal log-likelihood values, ranked according to |
gating |
The gating formulas. |
algo |
The algorithm used for fitting the model - either |
equalPro |
Logical indicating whether mixing proportions were constrained to be equal across components. |
opti |
The method used for estimating the central sequence(s). |
weights |
Logical indicating whether the given model was fitted with sampling weights. |
noise |
Logical indicating the presence/absence of a noise component. Only displayed if at least one of the compared models has a noise component. |
noise.gate |
Logical indicating whether gating covariates were allowed to influence the noise component's mixing proportion. Only printed for models with a noise component, when at least one of the compared models has gating covariates. |
equalNoise |
Logical indicating whether the mixing proportion of the noise component for |
The criterion
argument here need not comply with the criterion used for model selection within each "MEDseq"
object, but be aware that a mismatch in terms of criterion
may require the optimal model to be re-fit in order to be extracted, thereby slowing down MEDseq_compare
.
If random starts had been used via init.z="random"
the optimal
model may not necessarily correspond to the highest-ranking model in the presence of a criterion mismatch, due to the randomness of the initialisation.
A dedicated print
function exists for objects of class "MEDseqCompare"
and plot.MEDseq
can also be called on objects of class "MEDseqCompare"
.
Keefe Murphy - <[email protected]>
Murphy, K., Murphy, T. B., Piccarreta, R., and Gormley, I. C. (2021). Clustering longitudinal life-course sequences using mixtures of exponential-distance models. Journal of the Royal Statistical Society: Series A (Statistics in Society), 184(4): 1414-1451. <doi:10.1111/rssa.12712>.
data(biofam) seqs <- seqdef(biofam[10:25] + 1L, states = c("P", "L", "M", "L+M", "C", "L+C", "L+M+C", "D")) covs <- cbind(biofam[2:3], age=2002 - biofam$birthyr) # Fit a range of models # m1 <- MEDseq_fit(seqs, G=9:10) # m2 <- MEDseq_fit(seqs, G=9:10, gating=~sex, covars=covs, noise.gate=FALSE) # m3 <- MEDseq_fit(seqs, G=9:10, gating=~age, covars=covs, noise.gate=FALSE) # m4 <- MEDseq_fit(seqs, G=9:10, gating=~sex + age, covars=covs, noise.gate=FALSE) # Rank only the optimal models (according to the dbs criterion) # Examine the best model in more detail # (comp <- MEDseq_compare(m1, m2, m3, m4, criterion="dbs", optimal.only=TRUE)) # (best <- comp$optimal) # (summ <- summary(best, parameters=TRUE)) # Examine all models visited, including those already deemed suboptimal # Only print models with gating covariates & 10 components # comp2 <- MEDseq_compare(comp, m1, m2, m3, m4, criterion="dbs", pick=Inf) # print(comp2, index=comp2$gating != "None" & comp2$G == 10)
data(biofam) seqs <- seqdef(biofam[10:25] + 1L, states = c("P", "L", "M", "L+M", "C", "L+C", "L+M+C", "D")) covs <- cbind(biofam[2:3], age=2002 - biofam$birthyr) # Fit a range of models # m1 <- MEDseq_fit(seqs, G=9:10) # m2 <- MEDseq_fit(seqs, G=9:10, gating=~sex, covars=covs, noise.gate=FALSE) # m3 <- MEDseq_fit(seqs, G=9:10, gating=~age, covars=covs, noise.gate=FALSE) # m4 <- MEDseq_fit(seqs, G=9:10, gating=~sex + age, covars=covs, noise.gate=FALSE) # Rank only the optimal models (according to the dbs criterion) # Examine the best model in more detail # (comp <- MEDseq_compare(m1, m2, m3, m4, criterion="dbs", optimal.only=TRUE)) # (best <- comp$optimal) # (summ <- summary(best, parameters=TRUE)) # Examine all models visited, including those already deemed suboptimal # Only print models with gating covariates & 10 components # comp2 <- MEDseq_compare(comp, m1, m2, m3, m4, criterion="dbs", pick=Inf) # print(comp2, index=comp2$gating != "None" & comp2$G == 10)
Supplies a list of arguments (with defaults) for use with MEDseq_fit
.
MEDseq_control(algo = c("EM", "CEM", "cemEM"), init.z = c("kmedoids", "kmodes", "kmodes2", "hc", "random", "list"), z.list = NULL, dist.mat = NULL, unique = TRUE, criterion = c("bic", "icl", "aic", "dbs", "asw", "cv", "nec"), tau0 = NULL, noise.gate = TRUE, random = TRUE, do.cv = FALSE, do.nec = FALSE, nfolds = 10L, nstarts = 1L, stopping = c("aitken", "relative"), equalPro = FALSE, equalNoise = FALSE, tol = c(1E-05, 1E-08), itmax = c(.Machine$integer.max, 1000L), opti = c("mode", "medoid", "first", "GA"), ordering = c("none", "decreasing", "increasing"), MaxNWts = 1000L, verbose = TRUE, ...)
MEDseq_control(algo = c("EM", "CEM", "cemEM"), init.z = c("kmedoids", "kmodes", "kmodes2", "hc", "random", "list"), z.list = NULL, dist.mat = NULL, unique = TRUE, criterion = c("bic", "icl", "aic", "dbs", "asw", "cv", "nec"), tau0 = NULL, noise.gate = TRUE, random = TRUE, do.cv = FALSE, do.nec = FALSE, nfolds = 10L, nstarts = 1L, stopping = c("aitken", "relative"), equalPro = FALSE, equalNoise = FALSE, tol = c(1E-05, 1E-08), itmax = c(.Machine$integer.max, 1000L), opti = c("mode", "medoid", "first", "GA"), ordering = c("none", "decreasing", "increasing"), MaxNWts = 1000L, verbose = TRUE, ...)
algo |
Switch controlling whether models are fit using the |
init.z |
The method used to initialise the cluster labels. All options respect the presence of sampling The |
z.list |
A user supplied list of initial cluster allocation matrices, with number of rows given by the number of observations, and numbers of columns given by the range of component numbers being considered. Only relevant if |
dist.mat |
An optional distance matrix to use for initialisation when |
unique |
A logical indicating whether the model is fit only to the unique observations (defaults to When When In both cases, the results will be unchanged, but setting |
criterion |
When either |
tau0 |
Prior mixing proportion for the noise component. If supplied, a noise component will be added to the model in the estimation, with |
noise.gate |
A logical indicating whether gating network covariates influence the mixing proportion for the noise component, if any. Defaults to |
random |
A logical governing how ties for estimated central sequence positions are handled. When Note that this argument is also passed to |
do.cv |
A logical indicating whether cross-validated log-likelihood scores should also be computed (see |
do.nec |
A logical indicating whether the normalised entropy criterion (NEC) should also be computed (for models with more than one component). Defaults to |
nfolds |
The number of folds to use when |
nstarts |
The number of random initialisations to use when |
stopping |
The criterion used to assess convergence of the EM/CEM algorithm. The default ( |
equalPro |
Logical variable indicating whether or not the mixing proportions are to be constrained to be equal in the model. Default: |
equalNoise |
Logical which is only invoked when |
tol |
A vector of length two giving relative convergence tolerances for 1) the log-likelihood of the EM/CEM algorithm, and 2) optimisation in the multinomial logistic regression in the gating network, respectively. The default is |
itmax |
A vector of length two giving integer limits on the number of iterations for 1) the EM/CEM algorithm, and 2) the multinomial logistic regression in the gating network, respectively. The default is If, for any model with gating covariates, the multinomial logistic regression in the gating network fails to converge in |
opti |
Character string indicating how central sequence parameters should be estimated. The default |
ordering |
Experimental feature that should only be tampered with by experienced users. Allows sequences to be reordered on the basis of the column-wise entropy when |
MaxNWts |
The maximum allowable number of weights in the call to |
verbose |
Logical indicating whether to print messages pertaining to progress to the screen during fitting. By default is |
... |
Catches unused arguments, and also allows the optional arguments |
MEDseq_control
is provided for assigning values and defaults within MEDseq_fit
. While the criterion
argument controls the choice of the optimal number of components and MEDseq model type (in terms of the constraints or lack thereof on the precision parameters), MEDseq_compare
is provided for choosing between fits with different combinations of covariates or different initialisation settings.
A named list in which the names are the names of the arguments and the values are the values supplied to the arguments.
Keefe Murphy - <[email protected]>
Murphy, K., Murphy, T. B., Piccarreta, R., and Gormley, I. C. (2021). Clustering longitudinal life-course sequences using mixtures of exponential-distance models. Journal of the Royal Statistical Society: Series A (Statistics in Society), 184(4): 1414-1451. <doi:10.1111/rssa.12712>.
Menardi, G. (2011). Density-based silhouette diagnostics for clustering methods. Statistics and Computing, 21(3): 295-308.
Hoos, H. and T. Stützle (2004). Stochastic Local Search: Foundations and Applications. The Morgan Kaufman Series in Artificial Intelligence. San Francisco, CA, USA: Morgan Kaufman Publishers Inc.
MEDseq_fit
, dbs
, wcKMedoids
, pam
, wKModes
, hclust
, seqdist
, multinom
, MEDseq_compare
# The CC MEDseq model is almost equivalent to k-medoids when the # CEM algorithm is employed, mixing proportions are constrained, # and the central sequences are restricted to the observed sequences ctrl <- MEDseq_control(algo="CEM", equalPro=TRUE, opti="medoid", criterion="asw") data(mvad) # Note that ctrl must be explicitly named 'ctrl' mod <- MEDseq_fit(seqdef(mvad[,17:86]), G=11, modtype="CC", weights=mvad$weight, ctrl=ctrl) # Alternatively, specify the control arguments directly mod <- MEDseq_fit(seqdef(mvad[,17:86]), G=11, modtype="CC", weights=mvad$weight, algo="CEM", equalPro=TRUE, opti="medoid", criterion="asw") # Note that supplying control arguments via a mix of the ... construct and the named argument # 'control' or supplying MEDseq_control output without naming it 'control' can throw an error
# The CC MEDseq model is almost equivalent to k-medoids when the # CEM algorithm is employed, mixing proportions are constrained, # and the central sequences are restricted to the observed sequences ctrl <- MEDseq_control(algo="CEM", equalPro=TRUE, opti="medoid", criterion="asw") data(mvad) # Note that ctrl must be explicitly named 'ctrl' mod <- MEDseq_fit(seqdef(mvad[,17:86]), G=11, modtype="CC", weights=mvad$weight, ctrl=ctrl) # Alternatively, specify the control arguments directly mod <- MEDseq_fit(seqdef(mvad[,17:86]), G=11, modtype="CC", weights=mvad$weight, algo="CEM", equalPro=TRUE, opti="medoid", criterion="asw") # Note that supplying control arguments via a mix of the ... construct and the named argument # 'control' or supplying MEDseq_control output without naming it 'control' can throw an error
Calculates the normalised entropy of a fitted MEDseq model.
MEDseq_entropy(x, group = FALSE)
MEDseq_entropy(x, group = FALSE)
x |
An object of class |
group |
A logical (defaults to |
When group
is FALSE
, this function calculates the normalised entropy via
,
where and
are the sample size and number of components, respectively, and
is the estimated posterior probability at convergence that observation
belongs to component
.
When group
is TRUE
,
is computed for each observation and averaged according to the MAP classification.
When group
is FALSE
, a single number, given by , in the range [0,1], such that larger values indicate clearer separation of the clusters. Otherwise, a vector of length
G
containing the per-component averages of the observation-specific entropies is returned.
This function will always return a normalised entropy of 1
for models fitted using the "CEM"
algorithm (see MEDseq_control
), or models with only one component.
Keefe Murphy - <[email protected]>
Murphy, K., Murphy, T. B., Piccarreta, R., and Gormley, I. C. (2021). Clustering longitudinal life-course sequences using mixtures of exponential-distance models. Journal of the Royal Statistical Society: Series A (Statistics in Society), 184(4): 1414-1451. <doi:10.1111/rssa.12712>.
MEDseq_fit
, MEDseq_control
, MEDseq_AvePP
# Load the MVAD data data(mvad) mvad$Location <- factor(apply(mvad[,5:9], 1L, function(x) which(x == "yes")), labels = colnames(mvad[,5:9])) mvad <- list(covariates = mvad[c(3:4,10:14,87)], sequences = mvad[,15:86], weights = mvad[,2]) mvad.cov <- mvad$covariates # Create a state sequence object with the first two (summer) time points removed states <- c("EM", "FE", "HE", "JL", "SC", "TR") labels <- c("Employment", "Further Education", "Higher Education", "Joblessness", "School", "Training") mvad.seq <- seqdef(mvad$sequences[-c(1,2)], states=states, labels=labels) # Fit a model with weights and a gating covariate # Have the probability of noise-component membership be constant mod <- MEDseq_fit(mvad.seq, G=11, modtype="UUN", weights=mvad$weights, gating=~ gcse5eq, covars=mvad.cov, noise.gate=FALSE) # Calculate the normalised entropy MEDseq_entropy(mod) # Calculate the normalised entropy per cluster MEDseq_entropy(mod, group=TRUE)
# Load the MVAD data data(mvad) mvad$Location <- factor(apply(mvad[,5:9], 1L, function(x) which(x == "yes")), labels = colnames(mvad[,5:9])) mvad <- list(covariates = mvad[c(3:4,10:14,87)], sequences = mvad[,15:86], weights = mvad[,2]) mvad.cov <- mvad$covariates # Create a state sequence object with the first two (summer) time points removed states <- c("EM", "FE", "HE", "JL", "SC", "TR") labels <- c("Employment", "Further Education", "Higher Education", "Joblessness", "School", "Training") mvad.seq <- seqdef(mvad$sequences[-c(1,2)], states=states, labels=labels) # Fit a model with weights and a gating covariate # Have the probability of noise-component membership be constant mod <- MEDseq_fit(mvad.seq, G=11, modtype="UUN", weights=mvad$weights, gating=~ gcse5eq, covars=mvad.cov, noise.gate=FALSE) # Calculate the normalised entropy MEDseq_entropy(mod) # Calculate the normalised entropy per cluster MEDseq_entropy(mod, group=TRUE)
Fits MEDseq models: mixtures of Exponential-Distance models with gating covariates and sampling weights. Typically used for clustering categorical/longitudinal life-course sequences. Additional arguments are available via the function MEDseq_control
.
MEDseq_fit(seqs, G = 1L:9L, modtype = c("CC", "UC", "CU", "UU", "CCN", "UCN", "CUN", "UUN"), gating = NULL, weights = NULL, ctrl = MEDseq_control(...), covars = NULL, ...) ## S3 method for class 'MEDseq' summary(object, classification = TRUE, parameters = FALSE, network = FALSE, SPS = FALSE, ...) ## S3 method for class 'MEDseq' print(x, digits = 3L, ...)
MEDseq_fit(seqs, G = 1L:9L, modtype = c("CC", "UC", "CU", "UU", "CCN", "UCN", "CUN", "UUN"), gating = NULL, weights = NULL, ctrl = MEDseq_control(...), covars = NULL, ...) ## S3 method for class 'MEDseq' summary(object, classification = TRUE, parameters = FALSE, network = FALSE, SPS = FALSE, ...) ## S3 method for class 'MEDseq' print(x, digits = 3L, ...)
seqs |
A state-sequence object of class |
G |
A positive integer vector specifying the numbers of mixture components (clusters) to fit. Defaults to |
modtype |
A vector of character strings indicating the type of MEDseq models to be fitted, in terms of the constraints or lack thereof on the precision parameters. By default, all valid model types are fitted (except some only where |
gating |
A |
weights |
Optional numeric vector containing observation-specific sampling weights, which are accounted for in the model fitting and other functions where applicable. |
ctrl |
A list of control parameters for the EM/CEM and other aspects of the algorithm. The defaults are set by a call to |
covars |
An optional data frame (or a matrix with named columns) in which to look for the covariates in the |
... |
Catches unused arguments (see |
x , object , digits , classification , parameters , network , SPS
|
Arguments required for the |
The function effectively allows 8 different MEDseq precision parameter settings for models with or without gating network covariates. By constraining the mixing proportions to be equal (see equalPro
in MEDseq_control
) an extra special case is facilitated in the latter case.
While model selection in terms of choosing the optimal number of components and the MEDseq model type is performed within MEDseq_fit
, using one of the criterion
options within MEDseq_control
, choosing between multiple fits with different combinations of covariates or different initialisation settings can be done by supplying objects of class "MEDseq"
to MEDseq_compare
.
A list (of class "MEDseq"
) with the following named entries (of which some may be missing, depending on the criterion
employed), mostly corresponding to the chosen optimal model (as determined by the criterion
within MEDseq_control
):
call |
The matched call. |
data |
The input data, |
modtype |
A character string denoting the MEDseq model type at which the optimal |
G |
The optimal number of mixture components according to |
params |
A list with the following named components:
|
gating |
An object of class |
z |
The final responsibility matrix whose |
MAP |
The vector of cluster labels for the chosen model corresponding to |
BIC |
A matrix of all BIC values with |
ICL |
A matrix of all ICL values with |
AIC |
A matrix of all AIC values with |
DBS |
A matrix of all (weighted) mean/median DBS values with |
DBSvals |
A list of lists giving the observation-specific DBS values for all fitted models. The first level of the list corresponds to numbers of components, the second to the MEDseq model types. |
dbs |
The (weighted) mean/median DBS value corresponding to the optimal model. May not necessarily be the optimal DBS. |
dbsvals |
Observation-specific DBS values corresponding to the optimum model, which may not be optimal in terms of DBS. |
ASW |
A matrix of all (weighted) mean/median ASW values with |
ASWvals |
A list of lists giving the observation-specific ASW values for all fitted models. The first level of the list corresponds to numbers of components, the second to the MEDseq model types. |
asw |
The (weighted) mean/median ASW value corresponding to the optimal model. May not necessarily be the optimal ASW. |
aswvals |
Observation-specific ASW values corresponding to the optimum model, which may not be optimal in terms of ASW. |
LOGLIK |
A matrix of all maximal log-likelihood values with |
DF |
A matrix giving the numbers of estimated parameters (i.e. the number of 'used' degrees of freedom) for all visited models, with |
ITERS |
A matrix giving the total number of EM/CEM iterations for all visited models, with |
CV |
A matrix of all cross-validated log-likelihood values with |
NEC |
A matrix of all NEC values with |
bic |
The BIC value corresponding to the optimal model. May not necessarily be the optimal BIC. |
icl |
The ICL value corresponding to the optimal model. May not necessarily be the optimal ICL. |
aic |
The AIC value corresponding to the optimal model. May not necessarily be the optimal AIC. |
loglik |
The vector of increasing log-likelihood values for every EM/CEM iteration under the optimal model. The last element of this vector is the maximum log-likelihood achieved by the parameters returned at convergence. |
df |
The number of estimated parameters in the optimal model (i.e. the number of 'used' degrees of freedom). Subtract this number from the sample size to get the degrees of freedom. |
iters |
The total number of EM/CEM iterations for the optimal model. |
cv |
The cross-validated log-likelihood value corresponding to the optimal model, if available. May not necessarily be the optimal one. |
nec |
The NEC value corresponding to the optimal model, if available. May not necessarily be the optimal NEC. |
ZS |
A list of lists giving the |
uncert |
The uncertainty associated with the |
covars |
A data frame gathering the set of covariates used in the |
Dedicated plot
, print
, and summary
functions exist for objects of class "MEDseq"
.
Where BIC
, ICL
, AIC
, DBS
, ASW
, LOGLIK
, DF
, ITERS
, CV
, and NEC
contain NA
entries, this corresponds to a model which was not run; for instance a UU model is never run for single-component models as it is equivalent to CU, while a UCN model is never run for two-component models as it is equivalent to CCN. As such, one can consider the value as not really missing, but equivalent to the corresponding value. On the other hand, -Inf
represents models which were terminated due to error, for which a log-likelihood could not be estimated. These objects all inherit the class "MEDCriterion"
for which dedicated print
and summary
methods exist. For plotting, please see plot
.
Keefe Murphy - <[email protected]>
Murphy, K., Murphy, T. B., Piccarreta, R., and Gormley, I. C. (2021). Clustering longitudinal life-course sequences using mixtures of exponential-distance models. Journal of the Royal Statistical Society: Series A (Statistics in Society), 184(4): 1414-1451. <doi:10.1111/rssa.12712>.
seqdef
(reexported by MEDseq for convenience), MEDseq_control
, MEDseq_compare
, plot.MEDseq
, predict.MEDgating
, MEDseq_stderr
, I
, MEDseq_clustnames
, seqformat
# Load the MVAD data data(mvad) mvad$Location <- factor(apply(mvad[,5:9], 1L, function(x) which(x == "yes")), labels = colnames(mvad[,5:9])) mvad <- list(covariates = mvad[c(3:4,10:14,87)], sequences = mvad[,15:86], weights = mvad[,2]) mvad.cov <- mvad$covariates # Create a state sequence object with the first two (summer) time points removed states <- c("EM", "FE", "HE", "JL", "SC", "TR") labels <- c("Employment", "Further Education", "Higher Education", "Joblessness", "School", "Training") mvad.seq <- seqdef(mvad$sequences[-c(1,2)], states=states, labels=labels) # Fit a range of exponential-distance models without clustering mod0 <- MEDseq_fit(mvad.seq, G=1) # Fit a range of unweighted mixture models without covariates # Only consider models with a noise component # Supply some MEDseq_control() arguments # mod1 <- MEDseq_fit(mvad.seq, G=9:10, modtype=c("CCN", "CUN", "UCN", "UUN"), # algo="CEM", init.z="kmodes", criterion="icl") # Fit a model with weights and a gating covariate # Have the probability of noise-component membership be constant mod2 <- MEDseq_fit(mvad.seq, G=11, modtype="UUN", weights=mvad$weights, gating=~ gcse5eq, covars=mvad.cov, noise.gate=FALSE) # Examine this model in greater detail summary(mod2, classification=TRUE, parameters=TRUE) summary(mod2$gating, SPS=TRUE) print(mod2$params$theta, SPS=TRUE) plot(mod2, "clusters")
# Load the MVAD data data(mvad) mvad$Location <- factor(apply(mvad[,5:9], 1L, function(x) which(x == "yes")), labels = colnames(mvad[,5:9])) mvad <- list(covariates = mvad[c(3:4,10:14,87)], sequences = mvad[,15:86], weights = mvad[,2]) mvad.cov <- mvad$covariates # Create a state sequence object with the first two (summer) time points removed states <- c("EM", "FE", "HE", "JL", "SC", "TR") labels <- c("Employment", "Further Education", "Higher Education", "Joblessness", "School", "Training") mvad.seq <- seqdef(mvad$sequences[-c(1,2)], states=states, labels=labels) # Fit a range of exponential-distance models without clustering mod0 <- MEDseq_fit(mvad.seq, G=1) # Fit a range of unweighted mixture models without covariates # Only consider models with a noise component # Supply some MEDseq_control() arguments # mod1 <- MEDseq_fit(mvad.seq, G=9:10, modtype=c("CCN", "CUN", "UCN", "UUN"), # algo="CEM", init.z="kmodes", criterion="icl") # Fit a model with weights and a gating covariate # Have the probability of noise-component membership be constant mod2 <- MEDseq_fit(mvad.seq, G=11, modtype="UUN", weights=mvad$weights, gating=~ gcse5eq, covars=mvad.cov, noise.gate=FALSE) # Examine this model in greater detail summary(mod2, classification=TRUE, parameters=TRUE) summary(mod2$gating, SPS=TRUE) print(mod2$params$theta, SPS=TRUE) plot(mod2, "clusters")
Computes the mean time (per cluster) spent in each sequence category (i.e. state value) for a fitted MEDseq
model.
MEDseq_meantime(x, MAP = FALSE, weighted = TRUE, norm = TRUE, prop = FALSE, map.size = FALSE, wt.size = FALSE, SPS = FALSE) ## S3 method for class 'MEDseqMeanTime' print(x, digits = 3L, ...)
MEDseq_meantime(x, MAP = FALSE, weighted = TRUE, norm = TRUE, prop = FALSE, map.size = FALSE, wt.size = FALSE, SPS = FALSE) ## S3 method for class 'MEDseqMeanTime' print(x, digits = 3L, ...)
x |
An object of class |
MAP |
A logical indicating whether to use the MAP classification in the computation of the averages, or the 'soft' clustering assignment probabilities given by |
weighted |
A logical indicating whether the sampling weights (if used during model fitting) are used to compute the weighted averages. These can be used alone (when |
norm |
A logical indicating whether the mean times (outputted values after the first column) are normalised to sum to the sequence length within each cluster (defaults to |
prop |
A logical (defaulting to |
map.size |
A logical (defaulting to |
wt.size |
A logical (defaults to |
SPS |
A logical indicating whether the output should be labelled according to the state-permanence-sequence representation of the central sequences. Defaults to |
digits |
Minimum number of significant digits to be printed in values. |
... |
Catches unused arguments. |
Models with weights, covariates, &/or a noise component are also accounted for.
A matrix with sequence category and cluster-specific mean times, giving clusters on the rows, corresponding cluster sizes (or weighted cluster sizes) in the first column, and sequence categories in the remaining columns.
The function plot.MEDseq
with the option type="mt"
can be used to visualise the mean times (by cluster). However, the results displayed therein (at present) always assume norm=TRUE
, prop=FALSE
, and wt.size=TRUE
, while the MAP
argument is renamed to soft
, where MAP=!soft
.
Keefe Murphy - <[email protected]>
Murphy, K., Murphy, T. B., Piccarreta, R., and Gormley, I. C. (2021). Clustering longitudinal life-course sequences using mixtures of exponential-distance models. Journal of the Royal Statistical Society: Series A (Statistics in Society), 184(4): 1414-1451. <doi:10.1111/rssa.12712>.
MEDseq_fit
, MEDseq_control
, plot.MEDseq
data(biofam) seqs <- seqdef(biofam[10:25] + 1L, states = c("P", "L", "M", "L+M", "C", "L+C", "L+M+C", "D")) mod <- MEDseq_fit(seqs, G=10, modtype="UUN") MEDseq_meantime(mod) MEDseq_meantime(mod, prop=TRUE) MEDseq_meantime(mod, map.size=TRUE) MEDseq_meantime(mod, MAP=TRUE, norm=FALSE, SPS=TRUE)
data(biofam) seqs <- seqdef(biofam[10:25] + 1L, states = c("P", "L", "M", "L+M", "C", "L+C", "L+M+C", "D")) mod <- MEDseq_fit(seqs, G=10, modtype="UUN") MEDseq_meantime(mod) MEDseq_meantime(mod, prop=TRUE) MEDseq_meantime(mod, map.size=TRUE) MEDseq_meantime(mod, MAP=TRUE, norm=FALSE, SPS=TRUE)
Show the NEWS
file of the MEDseq
package.
MEDseq_news()
MEDseq_news()
The MEDseq
NEWS
file, provided the session is interactive.
MEDseq_news()
MEDseq_news()
Computes standard errors of the gating network coefficients in a fitted MEDseq model using either the Weighted Likelihood Bootstrap or Jackknife methods.
MEDseq_stderr(mod, method = c("WLBS", "Jackknife"), N = 1000L, symmetric = TRUE, SPS = FALSE)
MEDseq_stderr(mod, method = c("WLBS", "Jackknife"), N = 1000L, symmetric = TRUE, SPS = FALSE)
mod |
An object of class |
method |
The method used to compute the standard errors (defaults to |
N |
The (integer) number of samples to use when the |
symmetric |
A logical indicating whether symmetric draws from the uniform Dirichlet distribution are used for the |
SPS |
A logical indicating whether the output should be labelled according to the state-permanence-sequence representation of the central sequences. Defaults to |
A progress bar is displayed as the function iterates over the N
samples. The function may take a long time to run for large N
. The function terminates immediately if mod$G == 1
.
A list with the following two elements:
Coefficients
The original matrix of estimated coefficients (coef(mod$gating)
).
Std. Errors
The matrix of corresponding standard error estimates.
The symmetric
argument is an experimental feature. More generally, caution is advised in interpreting the standard error estimates under either the "WLBS"
or the "Jackknife"
method
when there are existing sampling weights which arise from complex/stratified sampling designs.
Keefe Murphy - <[email protected]>
Murphy, K., Murphy, T. B., Piccarreta, R., and Gormley, I. C. (2021). Clustering longitudinal life-course sequences using mixtures of exponential-distance models. Journal of the Royal Statistical Society: Series A (Statistics in Society), 184(4): 1414-1451. <doi:10.1111/rssa.12712>.
O'Hagan, A., Murphy, T. B., Scrucca, L., and Gormley, I. C. (2019). Investigation of parameter uncertainty in clustering using a Gaussian mixture model via jackknife, bootstrap and weighted likelihood bootstrap. Computational Statistics, 34(4): 1779-1813.
MEDseq_fit
, MEDseq_clustnames
, seqformat
# Load the MVAD data data(mvad) mvad$Location <- factor(apply(mvad[,5:9], 1L, function(x) which(x == "yes")), labels = colnames(mvad[,5:9])) mvad <- list(covariates = mvad[c(3:4,10:14,87)], sequences = mvad[,15:86], weights = mvad[,2]) mvad.cov <- mvad$covariates # Create a state sequence object with the first two (summer) time points removed states <- c("EM", "FE", "HE", "JL", "SC", "TR") labels <- c("Employment", "Further Education", "Higher Education", "Joblessness", "School", "Training") mvad.seq <- seqdef(mvad$sequences[-c(1,2)], states=states, labels=labels) # Fit a model with weights and a gating covariate # Have the probability of noise-component membership be constant # mod <- MEDseq_fit(mvad.seq, G=11, modtype="UUN", weights=mvad$weights, # gating=~ gcse5eq, covars=mvad.cov, noise.gate=FALSE) # Estimate standard errors using 100 WLBS samples # (std <- MEDseq_stderr(mod, N=100))
# Load the MVAD data data(mvad) mvad$Location <- factor(apply(mvad[,5:9], 1L, function(x) which(x == "yes")), labels = colnames(mvad[,5:9])) mvad <- list(covariates = mvad[c(3:4,10:14,87)], sequences = mvad[,15:86], weights = mvad[,2]) mvad.cov <- mvad$covariates # Create a state sequence object with the first two (summer) time points removed states <- c("EM", "FE", "HE", "JL", "SC", "TR") labels <- c("Employment", "Further Education", "Higher Education", "Joblessness", "School", "Training") mvad.seq <- seqdef(mvad$sequences[-c(1,2)], states=states, labels=labels) # Fit a model with weights and a gating covariate # Have the probability of noise-component membership be constant # mod <- MEDseq_fit(mvad.seq, G=11, modtype="UUN", weights=mvad$weights, # gating=~ gcse5eq, covars=mvad.cov, noise.gate=FALSE) # Estimate standard errors using 100 WLBS samples # (std <- MEDseq_stderr(mod, N=100))
The data comes from a study by McVicar and Anyadike-Danes on transition from school to work. The data consist of static background characteristics and a time series sequence of 72 monthly labour market activities for each of a cohort of 712 individuals in the Status Zero Survey. The individuals were followed up from July 1993 to June 1999. The monthly states are recorded in columns 15 (Jul.93
) to 86 (Jun.99
).
data(mvad)
data(mvad)
A data frame containing 712 rows, 72 state variables, 1 id variable and 13 covariates.
States are:
employment |
(EM) |
FE |
further education (FE) |
HE |
higher education (HE) |
joblessness |
(JL) |
school |
(SC) |
training |
(TR) |
The data set contains also ids (id
) and sample weights (weights
) as well as the following binary covariates:male
catholic
Belfast
, N.Eastern
, Southern
, S.Eastern
, Western
(location of school, one of five Education and Library Board areas in Northern Ireland)Grammar
(type of secondary education, 1=grammar school)funemp
(father's employment status at time of survey, 1=father unemployed)gcse5eq
(qualifications gained by the end of compulsory education, 1=5+ GCSEs at grades A-C, or equivalent)fmpr
(SOC code of father's current or most recent job at time of survey, 1=SOC1 (professional, managerial or related))livboth
(living arrangements at time of first sweep of survey (June 1995), 1=living with both parents)
The first two months of the observation period coincide with summer holidays from school. Hence, documented examples throughout this package extract only the states in columns 17 to 86; i.e. sequences of length 70 from Sep.93
to Jun.99
.
McVicar and Anyadike-Danes (2002)
McVicar, D. (2000). Status 0 four years on: young people and social exclusion in Northern Ireland. Labour Market Bulletin, 14, 114-119.
McVicar, D. and Anyadike-Danes, M. (2002). Predicting successful and unsuccessful transitions from school to work by using sequence methods. Journal of the Royal Statistical Society: Series A (Statistics in Society), 165(2): 317-334.
data(mvad, package="MEDseq") mvad$Location <- factor(apply(mvad[,5:9], 1L, function(x) which(x == "yes")), labels = colnames(mvad[,5:9])) mvad <- list(covariates = mvad[c(3:4,10:14,87)], sequences = mvad[,15:86], weights = mvad[,2]) mvad.cov <- mvad$covariates # Create a state sequence object with the first two (summer) time points removed states <- c("EM", "FE", "HE", "JL", "SC", "TR") labels <- c("Employment", "Further Education", "Higher Education", "Joblessness", "School", "Training") mvad.seq <- seqdef(mvad$sequences[-c(1,2)], states=states, labels=labels, weights=mvad$weights)
data(mvad, package="MEDseq") mvad$Location <- factor(apply(mvad[,5:9], 1L, function(x) which(x == "yes")), labels = colnames(mvad[,5:9])) mvad <- list(covariates = mvad[c(3:4,10:14,87)], sequences = mvad[,15:86], weights = mvad[,2]) mvad.cov <- mvad$covariates # Create a state sequence object with the first two (summer) time points removed states <- c("EM", "FE", "HE", "JL", "SC", "TR") labels <- c("Employment", "Further Education", "Higher Education", "Joblessness", "School", "Training") mvad.seq <- seqdef(mvad$sequences[-c(1,2)], states=states, labels=labels, weights=mvad$weights)
Produces a range of plots of the results of fitted MEDseq
models.
## S3 method for class 'MEDseq' plot(x, type = c("clusters", "central", "precision", "gating", "bic", "icl", "aic", "dbs", "asw", "cv", "nec", "LOGLIK", "dbsvals", "aswvals", "similarity", "uncert.bar", "uncert.profile", "loglik", "d", "dH", "f", "Ht", "i", "I", "ms", "mt"), seriated = c("observations", "both", "clusters", "none"), soft = NULL, weighted = TRUE, SPS = NULL, smeth = "TSP", sortv = NULL, subset = NULL, quant.scale = FALSE, ...)
## S3 method for class 'MEDseq' plot(x, type = c("clusters", "central", "precision", "gating", "bic", "icl", "aic", "dbs", "asw", "cv", "nec", "LOGLIK", "dbsvals", "aswvals", "similarity", "uncert.bar", "uncert.profile", "loglik", "d", "dH", "f", "Ht", "i", "I", "ms", "mt"), seriated = c("observations", "both", "clusters", "none"), soft = NULL, weighted = TRUE, SPS = NULL, smeth = "TSP", sortv = NULL, subset = NULL, quant.scale = FALSE, ...)
x |
An object of class |
type |
A character string giving the type of plot requested:
Also available are the following options which act as wrappers to types of plots produced by the Note also that all of the plot types below can be made to either work with the hard MAP partition (as per
|
seriated |
Switch indicating whether seriation should be used to improve the visualisation by re-ordering the The Additionally, the Though all |
soft |
This argument is a single logical indicator which is only relevant for the Note that soft cluster membership probabilities will not be available if |
weighted |
This argument is a single logical indicator which is only relevant for the Additionally, for these plots and the |
SPS |
A logical indicating whether clusters should be labelled according to the state-permanence-sequence representation of their central sequence. See |
smeth |
A character string with the name of the seriation method to be used. Defaults to |
sortv |
A sorting method governing the ordering of observations for Additionally, when (and only when) |
subset |
An optional numeric vector giving the indices of the clusters to be plotted. For models with a noise component, values in |
quant.scale |
Logical indicating whether precision parameter heatmaps should use quantiles to determine non-linear colour break-points when |
... |
Catches unused arguments, and allows arguments to |
The type
options related to model selection criteria plot values for all fitted models in the "MEDseq"
object x
. The remaining type
options plot results for the optimal model, by default. However, arguments to get_MEDseq_results
can be passed via the ...
construct to plot corresponding results for suboptimal models in x
when type
is one of "clusters"
, "d"
, "dH"
, "f"
, "Ht"
, "i"
, "I"
, "ms"
, or "mt"
. See the examples below.
The visualisation according to type
of the results of a fitted MEDseq
model.
Every type
of plot respects the sampling weights, if any. However, those related to seqplot
plots from TraMineR ("d"
, "dH"
, "f"
, "Ht"
, "i"
, "I"
, "ms"
, "mt"
) do so only when weighted=TRUE
(the default).
For these plot types borrowed from TraMineR, when weighted=TRUE
, the y-axis labels (which can be suppressed using ylab=NA
) always display cluster sizes which correspond to the output of MEDseq_meantime(x, MAP=!soft, weighted=weighted, map.size=FALSE, wt.size=TRUE)
, where wt.size=TRUE
is NOT the default behaviour for MEDseq_meantime
.
Please note that type="dH"
will be unavailable if versions of TraMineR prior to 2.2-4
are in use. The colour of the entropy line(s) will be "blue"
for type="Ht"
and "black"
for type="dH"
. Finally, the plot types borrowed from TraMineR may be too wide to display in the preview panel. The same may also be true when type
is "dbsvals"
or "aswvals"
.
Keefe Murphy - <[email protected]>
Murphy, K., Murphy, T. B., Piccarreta, R., and Gormley, I. C. (2021). Clustering longitudinal life-course sequences using mixtures of exponential-distance models. Journal of the Royal Statistical Society: Series A (Statistics in Society), 184(4): 1414-1451. <doi:10.1111/rssa.12712>.
Studer, M. (2018). Divisive property-based and fuzzy clustering for sequence analysis. In G. Ritschard and M. Studer (Eds.), Sequence Analysis and Related Approaches: Innovative Methods and Applications, Volume 10 of Life Course Research and Social Policies, pp. 223-239. Cham, Switzerland: Springer.
Gabadinho, A., Ritschard, G., Mueller, N. S., and Studer, M. (2011). Analyzing and visualizing state sequences in R with TraMineR. Journal of Statistical Software, 40(4): 1-37.
MEDseq_fit
, seqplot
, dbs
, get_MEDseq_results
, seriate
, list_seriation_methods
, fuzzyseqplot
, MEDseq_meantime
, MEDseq_clustnames
, seqformat
# Load the MVAD data data(mvad) mvad$Location <- factor(apply(mvad[,5:9], 1L, function(x) which(x == "yes")), labels = colnames(mvad[,5:9])) mvad <- list(covariates = mvad[c(3:4,10:14,87)], sequences = mvad[,15:86], weights = mvad[,2]) mvad.cov <- mvad$covariates # Create a state sequence object with the first two (summer) time points removed states <- c("EM", "FE", "HE", "JL", "SC", "TR") labels <- c("Employment", "Further Education", "Higher Education", "Joblessness", "School", "Training") mvad.seq <- seqdef(mvad$sequences[-c(1,2)], states=states, labels=labels) # Fit a range of exponential-distance models without clustering mod0 <- MEDseq_fit(mvad.seq, G=1) # Show the central sequence and precision parameters of the optimal model plot(mod0, type="central") plot(mod0, type="ms") plot(mod0, type="precision") # Fit a range of unweighted mixture models without covariates # Only consider models with a noise component # mod1 <- MEDseq_fit(mvad.seq, G=9:11, modtype=c("CCN", "CUN", "UCN", "UUN")) # Plot the DBS values for all fitted models # plot(mod1, "dbs") # Plot the clusters of the optimal model (according to the dbs criterion) # plot(mod1, "clusters", criterion="dbs") # Use seriation to order the observations and the clusters # plot(mod1, "cluster", criterion="dbs", seriated="both") # Use a different seriation method # seriation::list_seriation_methods("dist") # plot(mod1, "cluster", criterion="dbs", seriated="both", smeth="Spectral") # Use the DBS values instead to sort the observations, and label the clusters # plot(mod1, "cluster", criterion="dbs", seriated="both", sortv="dbs", SPS=TRUE, size=TRUE) # Plot the observation-specific ASW values of the best CCN model (according to the asw criterion) # plot(mod1, "aswvals", modtype="CCN", criterion="asw") # Plot the similarity matrix (as a heatmap) of the best G=9 model (according to the icl criterion) # plot(mod1, "similarity", G=9, criterion="icl") # Fit a model with weights and gating covariates # mod2 <- MEDseq_fit(mvad.seq, G=10, modtype="UCN", weights=mvad$weights, # gating=~ fmpr + gcse5eq + livboth, covars=mvad.cov) # Plot the central sequences & precision parameters of this model # plot(mod2, "central") # plot(mod2, "precision") # Plot the clustering uncertainties in the form of a barplot # plot(mod2, "uncert.bar") # Plot the observation-specific DBS values # plot(mod2, "dbsvals") # Plot the transversal entropies by cluster & then the state-distributions by cluster # Note that these plots may not display properly in the preview panel # plot(mod2, "Ht", ylab=NA) # suppress the y-axis labels # plot(mod2, "d", border=TRUE) # add borders # plot(mod2, "dH", ylab=NA, border=TRUE) # both simultaneously (needs TraMineR >=2.2-4) # The plots above use the soft cluster membership probabilities # Discard this information and reproduce the per-cluster state-distributions plot # plot(mod2, "d", soft=FALSE) # The plots above use the observation-specific sampling weights # Discard this information and plot the mean times per state per cluster # plot(mod2, "mt", weighted=FALSE) # Use type="I" and subset=0 to examine the noise component # plot(mod2, "I", subset=0, border=TRUE, weighted=FALSE, seriated="none")
# Load the MVAD data data(mvad) mvad$Location <- factor(apply(mvad[,5:9], 1L, function(x) which(x == "yes")), labels = colnames(mvad[,5:9])) mvad <- list(covariates = mvad[c(3:4,10:14,87)], sequences = mvad[,15:86], weights = mvad[,2]) mvad.cov <- mvad$covariates # Create a state sequence object with the first two (summer) time points removed states <- c("EM", "FE", "HE", "JL", "SC", "TR") labels <- c("Employment", "Further Education", "Higher Education", "Joblessness", "School", "Training") mvad.seq <- seqdef(mvad$sequences[-c(1,2)], states=states, labels=labels) # Fit a range of exponential-distance models without clustering mod0 <- MEDseq_fit(mvad.seq, G=1) # Show the central sequence and precision parameters of the optimal model plot(mod0, type="central") plot(mod0, type="ms") plot(mod0, type="precision") # Fit a range of unweighted mixture models without covariates # Only consider models with a noise component # mod1 <- MEDseq_fit(mvad.seq, G=9:11, modtype=c("CCN", "CUN", "UCN", "UUN")) # Plot the DBS values for all fitted models # plot(mod1, "dbs") # Plot the clusters of the optimal model (according to the dbs criterion) # plot(mod1, "clusters", criterion="dbs") # Use seriation to order the observations and the clusters # plot(mod1, "cluster", criterion="dbs", seriated="both") # Use a different seriation method # seriation::list_seriation_methods("dist") # plot(mod1, "cluster", criterion="dbs", seriated="both", smeth="Spectral") # Use the DBS values instead to sort the observations, and label the clusters # plot(mod1, "cluster", criterion="dbs", seriated="both", sortv="dbs", SPS=TRUE, size=TRUE) # Plot the observation-specific ASW values of the best CCN model (according to the asw criterion) # plot(mod1, "aswvals", modtype="CCN", criterion="asw") # Plot the similarity matrix (as a heatmap) of the best G=9 model (according to the icl criterion) # plot(mod1, "similarity", G=9, criterion="icl") # Fit a model with weights and gating covariates # mod2 <- MEDseq_fit(mvad.seq, G=10, modtype="UCN", weights=mvad$weights, # gating=~ fmpr + gcse5eq + livboth, covars=mvad.cov) # Plot the central sequences & precision parameters of this model # plot(mod2, "central") # plot(mod2, "precision") # Plot the clustering uncertainties in the form of a barplot # plot(mod2, "uncert.bar") # Plot the observation-specific DBS values # plot(mod2, "dbsvals") # Plot the transversal entropies by cluster & then the state-distributions by cluster # Note that these plots may not display properly in the preview panel # plot(mod2, "Ht", ylab=NA) # suppress the y-axis labels # plot(mod2, "d", border=TRUE) # add borders # plot(mod2, "dH", ylab=NA, border=TRUE) # both simultaneously (needs TraMineR >=2.2-4) # The plots above use the soft cluster membership probabilities # Discard this information and reproduce the per-cluster state-distributions plot # plot(mod2, "d", soft=FALSE) # The plots above use the observation-specific sampling weights # Discard this information and plot the mean times per state per cluster # plot(mod2, "mt", weighted=FALSE) # Use type="I" and subset=0 to examine the noise component # plot(mod2, "I", subset=0, border=TRUE, weighted=FALSE, seriated="none")
Predicts mixing proportions from MEDseq gating networks. Effectively akin to predicting from a multinomial logistic regression via multinom
, although here the noise component (if any) is properly accounted for. So too are models with no gating covariates at all, or models with the equal mixing proportion constraint. Prior probabilities are returned by default.
## S3 method for class 'MEDgating' predict(object, newdata = NULL, type = c("probs", "class"), keep.noise = TRUE, droplevels = FALSE, ...) ## S3 method for class 'MEDgating' fitted(object, ...) ## S3 method for class 'MEDgating' residuals(object, ...)
## S3 method for class 'MEDgating' predict(object, newdata = NULL, type = c("probs", "class"), keep.noise = TRUE, droplevels = FALSE, ...) ## S3 method for class 'MEDgating' fitted(object, ...) ## S3 method for class 'MEDgating' residuals(object, ...)
object |
An object of class |
newdata |
A matrix or data frame of test examples. If omitted, the fitted values are used. |
type |
The type of output desired. The default ( |
keep.noise |
A logical indicating whether the output should acknowledge the noise component (if any). Defaults to |
droplevels |
A logical indicating whether unseen factor levels in categorical variables within |
... |
Catches unused arguments or allows the |
This function (unlike the predict
method for multinom
on which predict.MEDgating
is based) accounts for sampling weights and the various ways of treating gating covariates, equal mixing proportion constraints, and noise components, although its type
argument defaults to "probs"
rather than "class"
.
The return value depends on whether newdata
is supplied or not and whether the model includes gating covariates to begin with. When newdata
is not supplied, the fitted values are returned (as matrix if the model contained gating covariates, otherwise as a vector as per x$params$tau
). If newdata
is supplied, the output is always a matrix with the same number of rows as the newdata
.
Keefe Murphy - <[email protected]>
Murphy, K., Murphy, T. B., Piccarreta, R., and Gormley, I. C. (2021). Clustering longitudinal life-course sequences using mixtures of exponential-distance models. Journal of the Royal Statistical Society: Series A (Statistics in Society), 184(4): 1414-1451. <doi:10.1111/rssa.12712>.
# Load the MVAD data data(mvad) mvad$Location <- factor(apply(mvad[,5:9], 1L, function(x) which(x == "yes")), labels = colnames(mvad[,5:9])) mvad <- list(covariates = mvad[c(3:4,10:14,87)], sequences = mvad[,15:86], weights = mvad[,2]) mvad.cov <- mvad$covariates # Create a state sequence object with the first two (summer) time points removed states <- c("EM", "FE", "HE", "JL", "SC", "TR") labels <- c("Employment", "Further Education", "Higher Education", "Joblessness", "School", "Training") mvad.seq <- seqdef(mvad$sequences[-c(1,2)], states=states, labels=labels) # Fit a model with weights and a gating covariate # Have the probability of noise-component membership be constant mod <- MEDseq_fit(mvad.seq, G=11, modtype="UUN", weights=mvad$weights, gating=~ gcse5eq, covars=mvad.cov, noise.gate=FALSE) (preds <- predict(mod$gating, newdata=mvad.cov[1:5,])) # Note that the predictions are not the same as the multinom predict method # in this instance, owing to the invocation of noise.gate=FALSE above mod2 <- mod class(mod2$gating) <- c("multinom", "nnet") predict(mod2$gating, newdata=mvad.cov[1:5,], type="probs") # We can make this function behave in the same way by invoking keep.noise=FALSE predict(mod$gating, keep.noise=FALSE, newdata=mvad.cov[1:5,])
# Load the MVAD data data(mvad) mvad$Location <- factor(apply(mvad[,5:9], 1L, function(x) which(x == "yes")), labels = colnames(mvad[,5:9])) mvad <- list(covariates = mvad[c(3:4,10:14,87)], sequences = mvad[,15:86], weights = mvad[,2]) mvad.cov <- mvad$covariates # Create a state sequence object with the first two (summer) time points removed states <- c("EM", "FE", "HE", "JL", "SC", "TR") labels <- c("Employment", "Further Education", "Higher Education", "Joblessness", "School", "Training") mvad.seq <- seqdef(mvad$sequences[-c(1,2)], states=states, labels=labels) # Fit a model with weights and a gating covariate # Have the probability of noise-component membership be constant mod <- MEDseq_fit(mvad.seq, G=11, modtype="UUN", weights=mvad$weights, gating=~ gcse5eq, covars=mvad.cov, noise.gate=FALSE) (preds <- predict(mod$gating, newdata=mvad.cov[1:5,])) # Note that the predictions are not the same as the multinom predict method # in this instance, owing to the invocation of noise.gate=FALSE above mod2 <- mod class(mod2$gating) <- c("multinom", "nnet") predict(mod2$gating, newdata=mvad.cov[1:5,], type="probs") # We can make this function behave in the same way by invoking keep.noise=FALSE predict(mod$gating, keep.noise=FALSE, newdata=mvad.cov[1:5,])
Perform k-modes clustering on categorical data with observation-specific sampling weights and tie-breaking adjustments.
wKModes(data, modes, weights = NULL, iter.max = .Machine$integer.max, freq.weighted = FALSE, fast = TRUE, random = TRUE, ...)
wKModes(data, modes, weights = NULL, iter.max = .Machine$integer.max, freq.weighted = FALSE, fast = TRUE, random = TRUE, ...)
data |
A matrix or data frame of categorical data. Objects have to be in rows, variables in columns. |
modes |
Either the number of modes or a set of initial (distinct) cluster modes (where each mode is a row and |
weights |
Optional numeric vector containing non-negative observation-specific case weights. |
iter.max |
The maximum number of iterations allowed. Defaults to |
freq.weighted |
A logical indicating whether the usual simple-matching (Hamming) distance between objects is used, or a frequency weighted version of this distance. Defaults to |
fast |
A logical indicating whether a fast version of the algorithm should be applied. Defaults to |
random |
A logical indicating whether ties for the modal values &/or assignments are broken at random. Defaults to Regarding the modes, ties are broken at random when |
... |
Catches unused arguments. |
The k-modes algorithm (Huang, 1998) is an extension of the k-means algorithm by MacQueen (1967).
The data given by data
is clustered by the k-modes method (Huang, 1998) which aims to partition the objects into k groups such that the distance from objects to the assigned cluster modes is minimised.
By default, the simple-matching (Hamming) distance is used to determine the dissimilarity of two objects. It is computed by counting the number of mismatches in all variables. Alternatively, this distance can be weighted by the frequencies of the categories in data, using the freq.weighted
argument (see Huang, 1998, for details). For convenience, the function dist_freqwH
is provided for calculating the corresponding pairwise dissimilarity matrix for subsequent use.
If an initial matrix of modes is supplied, it is possible that no object will be closest to one or more modes. In this case, fewer clusters than the number of supplied modes will be returned and a warning will be printed.
If called using fast = TRUE
, the reassignment of the data to clusters is done for the entire data set before recomputation of the modes is done. For computational reasons, this option should be chosen for all but the most moderate of data sizes.
An object of class "wKModes"
which is a list with the following components:
cluster
A vector of integers indicating the cluster to which each object is allocated.
size
The number of objects in each cluster.
modes
A matrix of cluster modes.
withindiff
The within-cluster (weighted) simple-matching distance for each cluster.
tot.withindiff
The total within-cluster (weighted) distance over all clusters. tot.withindiff
can be used to guide the choice of the number of clusters, but beware of inherent randomness in the algorithm, which is liable to yield a jagged elbow plot (see examples).
iterations
The number of iterations the algorithm reached.
weighted
A logical indicating whether observation-level weights
were used or not throughout the algorithm.
freq.weighted
A logical indicating whether feature-level freq.weights
were used or not in the computation of the distances. For convenience, the function dist_freqwH
is provided for calculating the corresponding pairwise dissimilarity matrix for subsequent use.
random
A logical indicating whether ties were broken at random or not throughout the algorithm.
This code is adapted from the kmodes
function in the klaR package. Specifically, modifications were made to allow for random tie-breaking for the modes and assignments (see random
above) and the incorporation of observation-specific sampling weights
, with a view to using this function as a means to initialise the allocations for MEDseq models (see the MEDseq_control
argument init.z
and the related options "kmodes"
and "kmodes2"
).
Notably, the wKModes
function, when invoked inside MEDseq_fit
, is used regardless of whether the weights are true sampling weights, or the weights are merely aggregation weights, or there are no weights at all. Furthermore, the MEDseq_control
argument random
is also passed to wKModes
when it is invoked inside MEDseq_fit
.
Update: as of version 1.7-1
of klaR, klaR::kmodes
now breaks assignment ties at random only when fast=TRUE
. It still breaks assignment ties when fast=FALSE
and all ties for modal values in the non-random manner described above. Thus, the old behaviour of klaR::kmodes
can be recovered by specifying random=FALSE
here, but random=TRUE
allows random tie-breaking for both types of ties in all situations.
Keefe Murphy - <[email protected]>
(adapted from klaR::kmodes
)
Huang, Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3): 283-304.
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In L. M. L. Cam and J. Neyman (Eds.), Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1, June 21-July 18, 1965 and December 27 1965-January 7, 1966, Statistical Laboratory of the University of California, Berkelely, CA, USA, pp. 281-297. University of California Press.
MEDseq_control
, MEDseq_fit
, dist_freqwH
, wcAggregateCases
, seqformat
suppressMessages(require(WeightedCluster)) set.seed(99) # Load the MVAD data & aggregate the state sequences data(mvad) agg <- wcAggregateCases(mvad[,17:86], weights=mvad$weight) # Create a state sequence object without the first two (summer) time points states <- c("EM", "FE", "HE", "JL", "SC", "TR") labels <- c("Employment", "Further Education", "Higher Education", "Joblessness", "School", "Training") mvad.seq <- seqdef(mvad[agg$aggIndex, 17:86], states=states, labels=labels, weights=agg$aggWeights) # Run k-modes without the weights resX <- wKModes(mvad.seq, 2) # Run k-modes with the weights resW <- wKModes(mvad.seq, 2, weights=agg$aggWeights) # Examine the modal sequences of both solutions seqformat(seqdef(resX$modes), from="STS", to="SPS", compress=TRUE) seqformat(seqdef(resW$modes), from="STS", to="SPS", compress=TRUE) # Using tot.withindiff to choose the number of clusters TWdiffs <- sapply(1:5, function(k) wKModes(mvad.seq, k, weights=agg$aggWeights)$tot.withindiff) plot(TWdiffs, type="b", xlab="K") # Use multiple random starts to account for inherent randomness TWDiff <- sapply(1:5, function(k) min(replicate(10, wKModes(mvad.seq, k, weights=agg$aggWeights)$tot.withindiff))) plot(TWDiff, type="b", xlab="K")
suppressMessages(require(WeightedCluster)) set.seed(99) # Load the MVAD data & aggregate the state sequences data(mvad) agg <- wcAggregateCases(mvad[,17:86], weights=mvad$weight) # Create a state sequence object without the first two (summer) time points states <- c("EM", "FE", "HE", "JL", "SC", "TR") labels <- c("Employment", "Further Education", "Higher Education", "Joblessness", "School", "Training") mvad.seq <- seqdef(mvad[agg$aggIndex, 17:86], states=states, labels=labels, weights=agg$aggWeights) # Run k-modes without the weights resX <- wKModes(mvad.seq, 2) # Run k-modes with the weights resW <- wKModes(mvad.seq, 2, weights=agg$aggWeights) # Examine the modal sequences of both solutions seqformat(seqdef(resX$modes), from="STS", to="SPS", compress=TRUE) seqformat(seqdef(resW$modes), from="STS", to="SPS", compress=TRUE) # Using tot.withindiff to choose the number of clusters TWdiffs <- sapply(1:5, function(k) wKModes(mvad.seq, k, weights=agg$aggWeights)$tot.withindiff) plot(TWdiffs, type="b", xlab="K") # Use multiple random starts to account for inherent randomness TWDiff <- sapply(1:5, function(k) min(replicate(10, wKModes(mvad.seq, k, weights=agg$aggWeights)$tot.withindiff))) plot(TWDiff, type="b", xlab="K")