15  Multivariate Analysis I

In this chapter we survey multivariate analysis with an emphasis on factor analysis — the technique most heavily used in psychological research.

15.1 A comprehensive view of multivariate analysis

As the name suggests, multivariate analysis concerns datasets with many variables, and its principal aim is information summarisation. Interpreting each variable individually when many are present is laborious; a well-chosen summary buys considerable insight at modest cost.

From that aim several substantive interpretations follow. Below we list the main interpretive flavours and the analytic techniques that correspond to them.

  • To summarise = to discard. We use some of the information and set aside the rest.
  • Reduce to a single composite variable. A one-dimensional overall index (principal component analysis).
  • Map to a few dimensions. Visualise the data in a low-dimensional space (multidimensional scaling).
  • Classify variables into a small number of groups. Describe and analyse each group separately (cluster analysis).
  • Reduce observed variables to a few latent variables. Measure constructs and assign scores (factor analysis).
  • Treat a binary observation as influenced by a single latent variable. Tests with right/wrong outcomes (item response theory).
  • Classify respondents from response patterns under a latent-variable assumption. Finding hidden customer segments and the like (latent class analysis).
  • Model structural relationships among latent variables as well. Apply regression- and factor-analysis-style linear relationships to the data as a whole (structural equation modelling).
  • Visualise the strength of variable-to-variable ties without assuming a latent variable. Draw topological relationships with nodes and ties, or model the mathematical structure (network analysis).

The essential common feature is discovering inter-variable relationships from data. The relationships are expressed in different ways, with corresponding differences in technique. Most often we use covariances (correlations) as the measure of relationship; this presupposes data at the interval level or higher. For ordinal data, polychoric or polyserial correlations replace Pearson; for binary data, tetrachoric correlations.

Inter-variable relationships are not limited to correlations. For categorical variables one can use the co-frequency of joint category occurrence as the relational index. Co-frequency techniques include dual scaling (西里 2010) (mathematically equivalent to correspondence analysis and Hayashi’s Quantification Method Type III). Such categorical analyses are applied, for instance, in text mining of free responses after morphological analysis.

Relationships can also be expressed as distances (i.e., similarities) between variables. When the data satisfy the distance axioms, they can be visualised by multidimensional scaling (高根 1980; 岡太 and 今泉 1994) or grouped by cluster analysis (新納 2007; 足立 2006).

Correlation captures the strength of a straight-line relationship between two variables but can also reflect spurious associations driven by a hidden third variable. The partial correlation controls for surrounding variables; this is the foundation of graphical modelling (宮川 1997) and network analysis (アデラ=マリア et al. [2022] 2024).

In any case, given inter-variable relationships, an external criterion (if available) is fitted to estimate unknowns, and without one, the data are structured by model-based assumptions. The underlying goal is information summarisation, but some models emphasise visualisation, others the estimation of latent scores; each technique has its strengths and theoretical commitments.

Slightly off the main thread: linear algebra — the calculus of vectors and matrices — is the mathematical foundation underlying multivariate analysis. Its strength is to unify computation and visualisation under one perspective. Readers who want a deeper grip on multivariate analysis are encouraged to study linear algebra alongside.

15.1.1 Cattell’s data cube

R. B. Cattell drew a covariation chart — a data cube — to classify data. Any observation or measurement, he argued, can be specified by three attributes:

  • the time or occasion at which the observation is taken
  • the variable being observed
  • the reference point or person to which the variable is attached

Cattell’s data cube

In psychology a “dataset” is typically a two-dimensional matrix of persons × variables (a spreadsheet), but that is a slice through the cube along one time point. Other slices and other reorderings of the axes give six possible covariation cross-sections.

R Q O P S T
Technique variable × person person × person occasion × occasion person × occasion occasion × variable variable × occasion
Design variable × person person × occasion occasion × variable variable × occasion person × variable occasion × person
Factor extracted variable person occasion person occasion variable

Cattell classified the factor analyses on each cross-section as separate techniques. The familiar factor analysis — computing a variable-to-variable correlation matrix from variable × person data and extracting factors among the variables — is R-technique. Computing a person × person correlation matrix and extracting “person factors” is Q-technique. Slicing person × occasion gives O-technique (factors among occasions) and P-technique (factors among persons viewed from the occasion side). S- and T-techniques are analogously defined. S-technique applications are scarce; the others each have a small literature.

Inter-variable relationships can equally well be distance matrices (similarity matrices) or co-frequency relationships. One could classify variables or persons by cluster analysis on a distance matrix. The choice of cross-section and the choice of perspective are both free.

Generalising the three-dimensional restriction, more recent classifications use the data types present (modes) and the number of indexings (ways). For instance, mutual ratings among members within several groups have two modes (members and groups) with member × member relationships indexed over groups, giving 2-mode 3-way data. Family-systems research handles such data routinely.

Multivariate analysis, then, is a general toolkit asking which combination, which perspective, and which elements to summarise. The constraint “this cross-section is forbidden” does not arise; analysts visualise the whole and freely choose the meaningful cross-sections. Mathematically and statistically there are no model-level restrictions, and with packages such as R any of these analyses is feasible.

With this panoramic view, we turn to factor analysis — the technique psychology most embraces.

15.2 Factor analysis

We outline factor analysis, one of psychology’s most widely used techniques. Factor analysis is a statistical model of measurement. A related technique, principal component analysis (PCA), is best described as a model of summarisation rather than measurement.

The factor model is

\[ z_{ij} = a_{j1}f_{i1} + a_{j2}f_{i2} + \cdots + a_{jm}f_{im} + d_j U_{ij}, \]

where \(z_{ij}\) is the standardised score of person \(i\) on item \(j\); \(a_{j\cdot}\) are the factor loadings of item \(j\); \(f_{i\cdot}\) are person \(i\)’s factor scores; \(d_j\) is the unique-factor loading for item \(j\); and \(U_{ij}\) is the unique factor for person \(i\) on item \(j\). The \(m\) factors carried by the \(a_{j\cdot}\) are called common factors. Typically \(m \ll M\), the number of items. The Big Five personality inventory has \(M = 25\) and \(m = 5\); the YG personality test has \(M = 120\) and \(m = 12\). Factor analysis is thus a model of information compression.

PCA, by contrast, can be written

\[ P_i = w_1 X_{i1} + w_2 X_{i2} + \cdots + w_M X_{iM}, \]

where \(X_{i\cdot}\) is person \(i\)’s response on item \(j\), and \(P_i\) is the principal component formed as a weighted linear combination. The unknowns are the weights \(w_j\), chosen to produce the most discriminating composite — i.e., the weights that maximise the variance of \(P_i\). A single composite cannot in general capture all the information across \(M\) variables, so further principal components are constructed in order.

What, then, distinguishes the \(m\) common factors of factor analysis from \(m\) principal components? Factor analysis is a measurement model: the data \(z_{ij} = (X_{ij} - \bar{X}_j)/\sigma_j\) are assumed to contain measurement error \(d_j U_{ij}\). PCA makes no such assumption and uses \(X_{ij}\) directly.

In practice, responses that plausibly contain measurement error — e.g., psychological scale responses — call for factor analysis; quantities recorded without error — official records, accounting data — call for PCA. The historical spread of factor analysis through psychology and test theory, and of PCA through economics, commerce, and sociology, reflects this division.

Computationally the two share the use of eigenvalue decomposition to extract the most-explanatory components from an inter-variable relationship, so many software packages bundle them in the same menu, with different output styles. The design-level difference is nonetheless worth knowing. Factor analysis typically operates on a correlation matrix, PCA on a variance–covariance matrix; this reflects that psychological measurement is on a relative scale (more extraverted, more introverted), whereas other social-science quantities (trade surpluses, etc.) are absolute. PCA, aimed at compression, typically focuses on the first component; factor analysis, motivated by measurement theory, typically considers several factors. The choice between a single-factor and multi-factor view of intelligence in early intelligence testing reflects exactly this theoretical disagreement.

Similar techniques, different premises: knowing the background helps you pick the right tool.

15.2.1 Exploratory factor analysis

When unqualified, “factor analysis” usually means exploratory factor analysis (EFA). “Exploratory” reflects that neither the factor loadings nor even the number of common factors are fixed in advance; both are inferred from the data.

EFA proceeds in three steps:

  1. Decide the number of factors.
  2. Estimate the loadings (and rotate the factor axes).
  3. Estimate factor scores.

Before any of this, you should have characterised the data with descriptive statistics and visualisation.

15.2.1.1 Determining the number of factors

Mathematically, factor analysis reduces to an eigenvalue decomposition of the correlation matrix \(\mathbb{R}\) among the \(M\) items. The default correlation is the Pearson product-moment; ordinal items call for polychoric correlations, binary items for tetrachoric correlations.

Eigenvalue decomposition exposes the dimensionality of the correlation matrix. Information in \(M\) items occupies \(M\) dimensions. With two variables \(X, Y\), the data live in the 2-D space with \(X\) and \(Y\) as axes. If \(X\) and \(Y\) are correlated, an orthogonal \((X, Y)\) representation may not be the most efficient; a rotated basis with larger variance along one axis may be available. This is the common idea behind factor analysis and PCA: PCA focuses on the axis with the largest variance; factor analysis distinguishes the “useful dimensions” — the common factors — from the rest, which are absorbed into error.

The number of factors is chosen by the analyst, and the determination of “useful dimensions” has a subjective component. Many objective criteria have been proposed and are routinely used today, but treating mathematically extracted dimensions as substantive common factors is the analyst’s responsibility.

A common rule for setting the number of factors is parallel analysis based on the scree plot. We illustrate with the psych package and its bfi sample data — five-item-per-factor measurements of the Big Five.

pacman::p_load(tidyverse, psych)
dat <- psych::bfi |> select(-gender, -education, -age)
# parallel analysis
fa.parallel(dat)

Parallel analysis suggests that the number of factors =  6  and the number of components =  6 

A scree plot plots the eigenvalues from largest to smallest as a line graph. By default, both PC (principal-component) and FA (factor-analysis) scree plots are shown. The two differ in whether measurement error is assumed: factor analysis posits per-item error, so the per-item information is below 1 (\(r_{jj} = 1 - h_j^2 = u^2 < 1\), with \(h_j^2\) the communality — the sum of squared common-factor loadings — and \(u_j^2\) the uniqueness). The FA curve therefore always lies below the PC curve.

The legend distinguishes Actual Data, Simulated Data, and Resampled Data. Real data have some semantic structure; their correlation matrix is unevenly distributed across dimensions, producing a gradually decaying scree curve. Simulated data are drawn from random numbers of the same size; resampled data are obtained by shuffling the original matrix. Neither carries semantic structure, so every dimension is uniformly uninformative, and the eigenvalues form a near-flat line. Parallel analysis compares the two: dimensions whose actual eigenvalue exceeds the flat line are meaningful. On this criterion, both the FA and PC solutions support six factors.

A horizontal line at eigenvalue 1 is also drawn, marking the older Kaiser–Guttman criterion that a factor must explain at least one variable’s worth of variance to qualify as a common factor. By this rule three factors are warranted. Yet another rule of thumb is “what proportion of total variance is explained”: if three factors capture less than 50%, that may be too much information discarded, and four or five factors might be retained instead.

15.2.1.2 Estimating factor loadings

Once the number of factors is set, we estimate the loadings. For instance:

result.fa <- fa(dat, nfactors = 6, fm = "ML", rotate = "geominQ")
要求されたパッケージ GPArotation をロード中です

psych::fa() offers many options; here we specify the number of factors (nfactors), the estimation method (fm), and the rotation method (rotate).

For estimation we chose maximum likelihood (ML). For samples of more than ~200, ML — assuming multivariate-normal data — is generally most appropriate. For small samples, the least-squares family (ULS, OLS, WLS, GLS, etc.) minimises the model–data discrepancy and is the safer choice. Without specification, minimum residual (minres) — the same as ULS in concept, but with an improved algorithm with better convergence — is the default. The principal-axis option (pa) corresponds to a model without estimated errors. The differences across algorithms are usually modest.

Rotation transforms the loadings to ease interpretation. Factor analysis and PCA identify new axes for the data, but those axes can be linearly transformed (rotated) about the origin to any orientation. In practice we want the orientation that is easiest to interpret. Mathematically, this is the rotation that yields simple structure: each item loads on one factor and minimally on the others. If items measuring Extraversion load heavily on the first factor, we prefer that they not also load on factors 2, 3, 4, and 5 — otherwise interpretation becomes a headache.

Within this guiding principle, many algorithms exist. The classical varimax rotation maximises the variance of squared loadings on each factor. Other choices include oblimin and geomin; for details see 小杉 (2018).

Rotations split into orthogonal and oblique. Orthogonal rotations keep the rotated axes orthogonal — i.e., assume no inter-factor correlation. Oblique rotations allow inter-factor correlations. The latter is mathematically the weaker assumption, so a sensible workflow is to do oblique first and, if the inter-factor correlations are small, re-do orthogonal. Here we used geominQ, the oblique version of geomin.

The (Bernaards and Jennrich 2005) package implements many rotations selectable via rotate; consult its help.

There is no absolute standard for rotation either; algorithms reflect different premises. Unlike estimation, however, loadings can change appreciably under different rotations. Choose whichever rotation makes interpretation easiest, but be ready to explain — in your own words — what the rotation is and what it assumes.

15.2.1.3 Inspecting the output

print(result.fa, sort = T, cut = 0.3)
Factor Analysis using method =  ml
Call: fa(r = dat, nfactors = 6, rotate = "geominQ", fm = "ML")
Standardized loadings (pattern matrix) based upon correlation matrix
   item   ML1   ML2   ML5   ML3   ML4   ML6   h2   u2 com
E2   12  0.70                               0.55 0.45 1.0
E1   11  0.58                               0.39 0.61 1.4
N4   19  0.51  0.35                         0.48 0.52 2.0
E4   14 -0.50                          0.33 0.56 0.44 2.2
E5   15 -0.41                               0.40 0.60 2.8
N2   17        0.84                         0.69 0.31 1.1
N1   16        0.83                         0.71 0.29 1.1
N3   18        0.61                         0.52 0.48 1.3
N5   20  0.33  0.37                         0.34 0.66 2.8
A2    2              0.70                   0.50 0.50 1.2
A3    3              0.65                   0.51 0.49 1.1
A1    1             -0.57              0.37 0.33 0.67 1.8
A5    5              0.50                   0.48 0.52 1.7
A4    4              0.42                   0.28 0.72 1.7
C2    7                    0.67             0.50 0.50 1.2
C4    9                   -0.60        0.35 0.55 0.45 1.9
C3    8                    0.54             0.31 0.69 1.1
C1    6                    0.53             0.35 0.65 1.4
C5   10                   -0.51             0.43 0.57 1.8
O3   23                          0.67       0.48 0.52 1.0
O1   21                          0.58       0.34 0.66 1.1
O5   25                         -0.49  0.41 0.37 0.63 2.0
O2   22                         -0.40  0.34 0.29 0.71 2.3
O4   24  0.40                    0.40       0.25 0.75 2.4
E3   13                          0.38       0.48 0.52 3.0

                       ML1  ML2  ML5  ML3  ML4  ML6
SS loadings           2.34 2.25 2.00 1.89 1.77 0.82
Proportion Var        0.09 0.09 0.08 0.08 0.07 0.03
Cumulative Var        0.09 0.18 0.26 0.34 0.41 0.44
Proportion Explained  0.21 0.20 0.18 0.17 0.16 0.07
Cumulative Proportion 0.21 0.41 0.59 0.77 0.93 1.00

 With factor correlations of 
      ML1   ML2   ML5   ML3   ML4   ML6
ML1  1.00  0.24 -0.36 -0.20 -0.28 -0.08
ML2  0.24  1.00 -0.01 -0.12  0.05  0.25
ML5 -0.36 -0.01  1.00  0.19  0.28  0.26
ML3 -0.20 -0.12  0.19  1.00  0.14  0.04
ML4 -0.28  0.05  0.28  0.14  1.00  0.11
ML6 -0.08  0.25  0.26  0.04  0.11  1.00

Mean item complexity =  1.7
Test of the hypothesis that 6 factors are sufficient.

df null model =  300  with the objective function =  7.23 with Chi Square =  20163.79
df of  the model are 165  and the objective function was  0.36 

The root mean square of the residuals (RMSR) is  0.02 
The df corrected root mean square of the residuals is  0.03 

The harmonic n.obs is  2762 with the empirical chi square  330.64  with prob <  3.7e-13 
The total n.obs was  2800  with Likelihood Chi Square =  1013.79  with prob <  4.6e-122 

Tucker Lewis Index of factoring reliability =  0.922
RMSEA index =  0.043  and the 90 % confidence intervals are  0.04 0.045
BIC =  -295.88
Fit based upon off diagonal values = 0.99
Measures of factor score adequacy             
                                                   ML1  ML2  ML5  ML3  ML4  ML6
Correlation of (regression) scores with factors   0.81 0.89 0.81 0.85 0.82 0.74
Multiple R square of scores with factors          0.66 0.80 0.66 0.72 0.67 0.54
Minimum correlation of possible factor scores     0.32 0.59 0.33 0.44 0.34 0.09

We set sort (sort items by loading) and cut (suppress loadings smaller than 0.3 in the display). These are display options; the underlying \(5 \times 25\) paths from each factor to each item are all computed.

The output starts with the loading matrix, the communality \(h_j^2\), the uniqueness \(u_j^2 = 1 - h_j^2\), and the complexity.1 These are the post-rotation pattern matrix. In an oblique rotation, the loadings split into the pattern (factor effects, projecting variables orthogonally onto an oblique factor system) and the structure (factor correlations: variable–factor simple correlations).

Below the loadings, the sum of squared loadings (SS loadings) shows the variance explained by each factor; the proportions and cumulative proportions follow. Here the cumulative explained variance is 44%, meaning 56% of the information is discarded — from a compression standpoint, potentially too much.

Because the rotation is oblique, inter-factor correlations are reported next. The largest absolute correlation is −0.36. If all inter-factor correlations are within ±0.3, an orthogonal rotation can be considered.

Goodness-of-fit indices follow; we omit detailed discussion.

15.2.1.4 Factor-score estimation

So far we have estimated relationships between factors and items. In psychological research one is also interested in person-by-factor relationships: who is high on Extraversion, and what characterises low-Neuroticism individuals?

Mathematically, the eigenvalue decomposition that produces the loadings also produces eigenvectors; the per-person information has been summarised away by the time the correlation matrix is constructed. After the factor structure is fixed, we therefore back-compute the per-person scores: with the factor loadings known on the right-hand side of the factor model and the observed values on the left, we solve for the factor scores \(f_{i\cdot}\).

psych::fa() returns factor scores by default:

head(result.fa$scores, 10)
              ML1         ML2         ML5         ML3        ML4         ML6
61617 -0.04010511 -0.13419381 -0.69145376 -1.29299334 -1.6430660 -0.12400705
61618 -0.35267122  0.09080773 -0.04506789 -0.60267233 -0.0566277  0.39616337
61620 -0.05290788  0.69698270 -0.68175428 -0.03704608  0.2334657  0.01280638
61621  0.22194371 -0.07729865 -0.15482554 -0.90538802 -0.9125738  0.95278898
61622 -0.39695952 -0.28825362 -0.61037363 -0.12382926 -0.5814368  0.21976607
61623 -1.05223173  0.41585860  0.29450029  1.35906961  0.8457775  0.45985066
61624 -0.43875292 -1.21974967  0.06875832  0.03201599  0.6001998 -0.69274350
61629  1.42563085  0.35294078 -2.23058559 -0.99435391 -1.2166723 -1.00627947
61630          NA          NA          NA          NA         NA          NA
61633 -0.62399161  1.11010731  0.53805411  1.05139495  0.4870489  0.04325653

Some scores are NA (e.g., ID 61630): this happens when any response is missing.

head(dat, 10)
      A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2 E3 E4 E5 N1 N2 N3 N4 N5 O1 O2 O3 O4
61617  2  4  3  4  4  2  3  3  4  4  3  3  3  4  4  3  4  2  2  3  3  6  3  4
61618  2  4  5  2  5  5  4  4  3  4  1  1  6  4  3  3  3  3  5  5  4  2  4  3
61620  5  4  5  4  4  4  5  4  2  5  2  4  4  4  5  4  5  4  2  3  4  2  5  5
61621  4  4  6  5  5  4  4  3  5  5  5  3  4  4  4  2  5  2  4  1  3  3  4  3
61622  2  3  3  4  5  4  4  5  3  2  2  2  5  4  5  2  3  4  4  3  3  3  4  3
61623  6  6  5  6  5  6  6  6  1  3  2  1  6  5  6  3  5  2  2  3  4  3  5  6
61624  2  5  5  3  5  5  4  4  2  3  4  3  4  5  5  1  2  2  1  1  5  2  5  6
61629  4  3  1  5  1  3  2  4  2  4  3  6  4  2  1  6  3  2  6  4  3  2  4  5
61630  4  3  6  3  3  6  6  3  4  5  5  3 NA  4  3  5  5  2  3  3  6  6  6  6
61633  2  5  6  6  5  6  5  6  2  1  2  2  4  5  5  5  5  5  2  4  5  1  5  5
      O5
61617  3
61618  3
61620  2
61621  5
61622  3
61623  1
61624  1
61629  3
61630  1
61633  2

Because factor scores are back-computed from the model, a single missing value blocks the calculation. The scores returned are standardised, hence unitless and only relatively comparable. Some authors argue against subsequent tests of mean differences on relative estimates of this kind.

In practice a simpler simple-sum factor score is widely used: identify the items associated with each factor and average their item-level responses. With our example, say the first factor reflects E2, E1, N4, E4, E5. Because E4 and E5 have negative loadings, their values are reverse-scored:

Fscore1.raw <- dat |>
  # pick out the items relevant to the first factor
  select(E2, E1, N4, E4, E5) |>
  # reverse-score
  mutate(
    E4 = 7 - E4,
    E5 = 7 - E5
  )

# row-wise mean, ignoring missings
Fscore1 <- apply(Fscore1.raw, 1, function(x) mean(x, na.rm = TRUE))
summary(Fscore1)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   2.000   2.800   2.892   3.600   6.000 

Reverse-scoring is done by subtracting from 7 (for a 6-point scale starting at 1). The advantage is that the score retains its scale-anchored meaning (above the midpoint = agreement, below = disagreement) and is computable as long as some responses are present.

The drawbacks: it ignores the per-item weighting given by the loadings, and the error variance that factor analysis was supposed to remove is reintroduced. The scale items themselves should ideally have been constructed by proper scaling methods; in practice that rarely happens, and the simple-sum score is consequently a rough estimate.

Even so, simple-sum and model-based scores correlate very highly. If you can accept that psychological data are not so precise as to justify finer measurement, the simple sum is often enough.

cor(result.fa$scores[, 1], Fscore1, use = "pairwise")
[1] 0.9595003
plot(result.fa$scores[, 1], Fscore1)

15.2.2 Confirmatory factor analysis

So far we have discussed exploratory factor analysis. There the number of factors and the loadings are not specified a priori; the data speak, and interpretation is post hoc. In our example the first factor consists mainly of Extraversion items, but a Neuroticism item (N4) also slips in, complicating interpretation. The Big Five name implies five factors, yet the data prefer six.

Sometimes the theory is firmer than that: personality inventories have well-grounded structures, and one prefers to test a hypothesised structure. Confirmatory factor analysis (CFA) is the appropriate tool: factor structure is specified in advance and fitted to the data. CFA sits inside structural equation modelling (SEM): the relationships between items and latent variables are expressed as equations and estimated.

SEM fits equations involving latent variables to a covariance/correlation structure. The relationships are often drawn as a path diagram: bidirectional arrows for correlations, unidirectional arrows for regressions, rectangles for observed variables, ovals for latent variables. The diagram makes the difference between factor analysis and PCA stark:

Path diagrams: PCA (left) and factor analysis (right)

And between EFA and CFA:

Path diagrams: EFA (left) and CFA (right). Error factors omitted for clarity.

In CFA, each factor’s effect on each item is specified individually — equivalently, the paths assumed absent are fixed to zero. The figure assumes zero inter-factor correlations, which is why no path is drawn between the factors (correlations can of course be drawn).

The model comes first; the covariance matrix implied by the model is fit to the observed covariance matrix. Estimation methods include OLS, ML, and Bayesian methods; in practice you should know the algorithm name as well. Post-estimation you assess goodness of fit, with several indices considered together.

A worked example in R, using the lavaan package.2 Specify a CFA model3:

pacman::p_load(lavaan)
# model specification
model <- "
Neuroticism =~ N1 + N2 + N3 + N4 + N5
Agreeableness =~ A1 + A2 + A3 + A4 + A5
Extraversion =~ E1 + E2 + E3 + E4 + E5
Openness =~ O1 + O2 + O3 + O4 + O5
Conscientiousness =~ C1 + C2 + C3 + C4 + C5
"

Note that the model is a string in quotes. Latent variables go on the left, indicators on the right, connected by =~ (the measurement equation). Correlations use ~~; regressions use ~. Equations between latent variables are called structural equations.

No inter-factor correlations are specified; by default a covariance between any two latent variables not declared zero is freely estimated. To force a zero, write Neuroticism ~~ 0 * Openness.

With the model specified, fit it. We use ML and request fit measures and standardised coefficients in the summary:

# model estimation
model.fit <- sem(model, estimator = "ML", data = dat)
summary(model.fit, fit.measures = TRUE, standardized = TRUE)
lavaan 0.6-21 ended normally after 55 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        60

                                                  Used       Total
  Number of observations                          2436        2800

Model Test User Model:
                                                      
  Test statistic                              4165.467
  Degrees of freedom                               265
  P-value (Chi-square)                           0.000

Model Test Baseline Model:

  Test statistic                             18222.116
  Degrees of freedom                               300
  P-value                                        0.000

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    0.782
  Tucker-Lewis Index (TLI)                       0.754

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)             -99840.238
  Loglikelihood unrestricted model (H1)     -97757.504
                                                      
  Akaike (AIC)                              199800.476
  Bayesian (BIC)                            200148.363
  Sample-size adjusted Bayesian (SABIC)     199957.729

Root Mean Square Error of Approximation:

  RMSEA                                          0.078
  90 Percent confidence interval - lower         0.076
  90 Percent confidence interval - upper         0.080
  P-value H_0: RMSEA <= 0.050                    0.000
  P-value H_0: RMSEA >= 0.080                    0.037

Standardized Root Mean Square Residual:

  SRMR                                           0.075

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Latent Variables:
                       Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  Neuroticism =~                                                            
    N1                    1.000                               1.300    0.825
    N2                    0.947    0.024   39.899    0.000    1.230    0.803
    N3                    0.884    0.025   35.919    0.000    1.149    0.721
    N4                    0.692    0.025   27.753    0.000    0.899    0.573
    N5                    0.628    0.026   24.027    0.000    0.816    0.503
  Agreeableness =~                                                          
    A1                    1.000                               0.484    0.344
    A2                   -1.579    0.108  -14.650    0.000   -0.764   -0.648
    A3                   -2.030    0.134  -15.093    0.000   -0.983   -0.749
    A4                   -1.564    0.115  -13.616    0.000   -0.757   -0.510
    A5                   -1.804    0.121  -14.852    0.000   -0.873   -0.687
  Extraversion =~                                                           
    E1                    1.000                               0.920    0.564
    E2                    1.226    0.051   23.899    0.000    1.128    0.699
    E3                   -0.921    0.041  -22.431    0.000   -0.847   -0.627
    E4                   -1.121    0.047  -23.977    0.000   -1.031   -0.703
    E5                   -0.808    0.039  -20.648    0.000   -0.743   -0.553
  Openness =~                                                               
    O1                    1.000                               0.635    0.564
    O2                   -1.020    0.068  -14.962    0.000   -0.648   -0.418
    O3                    1.373    0.072   18.942    0.000    0.872    0.724
    O4                    0.437    0.048    9.160    0.000    0.277    0.233
    O5                   -0.960    0.060  -16.056    0.000   -0.610   -0.461
  Conscientiousness =~                                                      
    C1                    1.000                               0.680    0.551
    C2                    1.148    0.057   20.152    0.000    0.781    0.592
    C3                    1.036    0.054   19.172    0.000    0.705    0.546
    C4                   -1.421    0.065  -21.924    0.000   -0.967   -0.702
    C5                   -1.489    0.072  -20.694    0.000   -1.012   -0.620

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  Neuroticism ~~                                                        
    Agreeableness     0.141    0.018    7.712    0.000    0.223    0.223
    Extraversion      0.292    0.032    9.131    0.000    0.244    0.244
    Openness         -0.093    0.022   -4.138    0.000   -0.112   -0.112
    Conscientisnss   -0.250    0.025  -10.117    0.000   -0.283   -0.283
  Agreeableness ~~                                                      
    Extraversion      0.304    0.025   12.293    0.000    0.683    0.683
    Openness         -0.093    0.011   -8.446    0.000   -0.303   -0.303
    Conscientisnss   -0.110    0.012   -9.254    0.000   -0.334   -0.334
  Extraversion ~~                                                       
    Openness         -0.265    0.021  -12.347    0.000   -0.453   -0.453
    Conscientisnss   -0.224    0.020  -11.121    0.000   -0.357   -0.357
  Openness ~~                                                           
    Conscientisnss    0.130    0.014    9.190    0.000    0.301    0.301

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .N1                0.793    0.037   21.575    0.000    0.793    0.320
   .N2                0.836    0.036   23.458    0.000    0.836    0.356
   .N3                1.222    0.043   28.271    0.000    1.222    0.481
   .N4                1.654    0.052   31.977    0.000    1.654    0.672
   .N5                1.969    0.060   32.889    0.000    1.969    0.747
   .A1                1.745    0.052   33.725    0.000    1.745    0.882
   .A2                0.807    0.028   28.396    0.000    0.807    0.580
   .A3                0.754    0.032   23.339    0.000    0.754    0.438
   .A4                1.632    0.051   31.796    0.000    1.632    0.740
   .A5                0.852    0.032   26.800    0.000    0.852    0.528
   .E1                1.814    0.058   31.047    0.000    1.814    0.682
   .E2                1.332    0.049   26.928    0.000    1.332    0.512
   .E3                1.108    0.038   29.522    0.000    1.108    0.607
   .E4                1.088    0.041   26.732    0.000    1.088    0.506
   .E5                1.251    0.040   31.258    0.000    1.251    0.694
   .O1                0.865    0.032   27.216    0.000    0.865    0.682
   .O2                1.990    0.063   31.618    0.000    1.990    0.826
   .O3                0.691    0.039   17.717    0.000    0.691    0.476
   .O4                1.346    0.040   34.036    0.000    1.346    0.946
   .O5                1.380    0.045   30.662    0.000    1.380    0.788
   .C1                1.063    0.035   30.073    0.000    1.063    0.697
   .C2                1.130    0.039   28.890    0.000    1.130    0.650
   .C3                1.170    0.039   30.194    0.000    1.170    0.702
   .C4                0.960    0.040   24.016    0.000    0.960    0.507
   .C5                1.640    0.059   27.907    0.000    1.640    0.615
    Neuroticism       1.689    0.073   23.034    0.000    1.000    1.000
    Agreeableness     0.234    0.030    7.839    0.000    1.000    1.000
    Extraversion      0.846    0.062   13.693    0.000    1.000    1.000
    Openness          0.404    0.033   12.156    0.000    1.000    1.000
    Conscientisnss    0.463    0.036   12.810    0.000    1.000    1.000

The output begins with a model summary (estimator, etc.) and then a battery of fit indices (CFI, TLI, AIC, BIC, RMSEA, SRMR, …). These fall into three categories:

First, comparative indices anchored at a null model (worst possible) of 0 and a saturated model (perfect fit) of 1 — CFI and TLI fall here. The saturated model fits the data perfectly with all possible paths; the null (independence) model assumes no inter-variable association whatsoever and is the most constrained.

Second, likelihood-based relative indices — AIC, BIC, SABIC. The likelihood expresses how close the model’s distribution is to the data, so it is meaningful only for comparing models on the same data. AIC (Akaike Information Criterion) combines \(-2\) log-likelihood with the number of parameters: \(-2\text{LL} + 2p\). Smaller is better. BIC (Bayesian Information Criterion) penalises parameters more heavily, scaled by the sample size. SABIC is a sample-size-adjusted variant.

Third, residual-based indices — RMSEA, SRMR. RMSEA (Root Mean Square Error of Approximation) targets the model’s approximation error; values below 0.05 are conventionally “good.” SRMR (Standardised Root Mean Square Residual) is the standardised root mean square of the residuals between observed and model-implied data; values below 0.08 are “good.”

When evaluating model fit, look at several indices together rather than rely on any single one.

Below the fit indices come the estimates and test statistics. In psychology, the column most often consulted is Std.all, with all variables standardised.4 Small or non-significant path coefficients can be dropped to improve fit — but improving fit should not be a research goal. If conventional thresholds cannot be met, revisit the assumptions; adding paths inconsistent with theory just to push fit up is the same kind of QRP as “engineering significance.”

Packages also automate path-diagram drawing. Several exist (lavaanExtra, tidySEM, lavaanPlot, lavaanPlot2); we illustrate with the classical semPlot:

pacman::p_load(semPlot)
semPaths(model.fit, what = "stand", style = "lisrel")

That concludes the overview of factor analysis.

Factor analysis is a model of measurement and is heavily used when designing psychological scales. But because it identifies common dimensions from inter-item correlations, neither “the construct has been measured directly” nor “the existence of the construct has been demonstrated” follows. For example, write items about ramen and have respondents rate them quantitatively, and you can extract a “ramen factor” or a “tonkotsu factor” with some interpretive content. That does not mean people have internalised a psychological “tonkotsu factor.”

SEM can specify regression or correlation paths between latent variables, but it pays to think about how those relationships actually manifest. Even a strongly fitting model with strong cross-factor influence may have small measurement-model coefficients: ask what concretely changes when a factor score increases by one unit, and how that shows up in behaviour or measurement. A statistically valid but substantively meaningless model is a paper exercise.

Issues of measurement and substantive impact run in parallel between factor analysis and item response theory (a relative of factor analysis), to which we now turn.

15.2.3 SEM applications

SEM is not confined to confirmatory factor analysis; it provides a flexible vocabulary for structural relationships between latent variables and so supports several useful extensions.

15.2.3.1 Mediation analysis

15.2.3.2 Multi-group SEM

15.2.3.3 Latent growth curve models

15.3 Item response theory

Next, item response theory (IRT). Sometimes called modern test theory in contrast to classical test theory (CTT), IRT is grounded in test theory and so assumes a binary outcome (0 = incorrect, 1 = correct). It can also be viewed as factor analysis with a binary indicator, and its mathematical equivalence to categorical factor analysis is established.

A further difference from factor analysis: factor analysis (rooted in personality psychology) emphasises exploring factor structure, while IRT (rooted in test theory) emphasises refining factor-score estimation. In personality psychology the number of factors is itself a research question; in test theory a single-factor “ability” structure is preferred. This is why scale construction via factor analysis cultivates “simple structure” and prunes items, while IRT — given a sufficiently large first-factor loading (around 30% variance is typical) — treats the data as essentially unidimensional and pools items rather than discarding them.

Computer adaptive testing (CAT) is the leading current application of IRT: items are served dynamically from a pool based on the respondent’s pattern of correct/incorrect answers, efficiently estimating the latent ability. CAT requires an IRT-based model, an item pool calibrated for various ability levels, and a database-and-delivery system tying them together.5

15.3.1 Logistic models

IRT is a single-factor model for binary data. Because of the binary outcome, dimensional analysis uses tetrachoric correlations rather than Pearson’s. Modelling treats item responses as regressed on a person parameter \(\theta\) that corresponds to the factor — a logistic-regression-like setup.

The person parameter is assumed standard normal; expressed via the normal CDF, this is well approximated by a logistic curve, and the linear placement of item parameters fits naturally inside a logistic model.6 The standard normal density, its CDF, and a logistic approximation thereof are plotted below. The logistic model conventionally uses the constant 1.702 in the exponent to better approximate the normal CDF:

\[ f(x) = \frac{1}{1 + \exp(-1.702 x)}. \]

pacman::p_load(ggplot2, patchwork)

x <- seq(-4, 4, length.out = 1000)
normal_df <- data.frame(
  x = x,
  density = dnorm(x),
  cdf = pnorm(x),
  logistic = 1 / (1 + exp(-1.702 * x))
)

# 1. standard normal density
p1 <- ggplot(normal_df, aes(x = x, y = density)) +
  geom_line() +
  labs(title = "Standard normal density", x = "x", y = "density") +
  theme_minimal()

# 2. standard normal CDF
p2 <- ggplot(normal_df, aes(x = x, y = cdf)) +
  geom_line() +
  labs(title = "Standard normal CDF", x = "x", y = "cumulative probability") +
  theme_minimal()

# 3. logistic approximation
p3 <- ggplot(normal_df, aes(x = x, y = logistic)) +
  geom_line() +
  labs(title = "Logistic approximation", x = "x", y = "probability") +
  theme_minimal()

p1 + p2 + p3

Using this logistic function we now characterise items via item parameters. IRT logistic models come with various numbers of parameters, with the more elaborate ones containing the simpler as special cases.

15.3.1.1 1PL model

The one-parameter logistic (1PL) model has a single item parameter \(b\):

\[ P(Y_{ij} = 1 \mid \theta_i, b_j) = \frac{1}{1 + \exp(-1.702 (\theta_i - b_j))}. \]

\(Y_{ij}\) is the binary indicator of correctness for person \(i\) on item \(j\); \(\theta_i\) is the person’s ability; \(b_j\) is the item’s difficulty. Larger \(b_j\) shifts the curve to the right (so the same probability of correct response requires higher ability), and smaller \(b_j\) shifts it left.

logistic_1pl <- function(theta, b) {
  1 / (1 + exp(-1.702 * (theta - b)))
}

x <- seq(-4, 4, length.out = 1000)
normal_df <- data.frame(
  x = x,
  default = logistic_1pl(x, 0),
  easy = logistic_1pl(x, -1),
  hard = logistic_1pl(x, 1)
)

ggplot(normal_df) +
  geom_line(aes(x = x, y = default, color = "default (b=0)")) +
  geom_line(aes(x = x, y = easy, color = "easy (b=-1)")) +
  geom_line(aes(x = x, y = hard, color = "hard (b=1)")) +
  scale_color_brewer(palette = "Set2") +
  labs(
    title = "1PL logistic model",
    x = "theta",
    y = "P(correct)",
    color = "difficulty"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

15.3.1.2 2PL model

The 2PL model adds an item parameter \(a_j\) — the discrimination:

\[ P(Y_{ij} = 1 \mid \theta_i, a_j, b_j) = \frac{1}{1 + \exp(-1.702 a_j (\theta_i - b_j))}. \]

It is called the discrimination parameter because it controls the slope of the logistic curve. A steep slope means a sharp transition from incorrect to correct around a particular \(\theta\); a shallow slope means the item only weakly distinguishes ability levels. In categorical factor analysis, \(b_j\) corresponds to a threshold and \(a_j\) to a loading.

logistic_2pl <- function(theta, a, b) {
  1 / (1 + exp(-1.702 * a * (theta - b)))
}

x <- seq(-4, 4, length.out = 1000)
normal_df <- data.frame(
  x = x,
  default = logistic_2pl(x, 1, 0),
  easy = logistic_2pl(x, 1.5, -1),
  hard = logistic_2pl(x, 0.5, 1)
)

ggplot(normal_df) +
  geom_line(aes(x = x, y = default, color = "default (b=0, a=1)")) +
  geom_line(aes(x = x, y = easy, color = "b=-1, a=1.5")) +
  geom_line(aes(x = x, y = hard, color = "b=1, a=0.5")) +
  scale_color_brewer(palette = "Set2") +
  labs(
    title = "2PL logistic model",
    x = "theta",
    y = "P(correct)",
    color = "model and settings"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

15.3.1.3 3PL, 4PL, 5PL models

In practice the 2PL is the most common, but 3PL, 4PL, and 5PL models exist:

\[ P(Y_{ij} = 1 \mid \theta_i, a_j, b_j, c_j) = c_j + \frac{1 - c_j}{1 + \exp(-1.702 a_j (\theta_i - b_j))}, \]

\[ P(Y_{ij} = 1 \mid \theta_i, a_j, b_j, c_j, d_j) = c_j + \frac{d_j - c_j}{1 + \exp(-1.702 a_j (\theta_i - b_j))}, \]

\[ P(Y_{ij} = 1 \mid \theta_i, a_j, b_j, c_j, d_j, e_j) = c_j + \frac{d_j - c_j}{\{1 + \exp(-1.702 a_j (\theta_i - b_j))\}^{e_j}}. \]

\(c_j\) is the lower-asymptote parameter (guessing), \(d_j\) the upper-asymptote parameter, and \(e_j\) the asymmetry parameter. More parameters demand larger samples for estimation and complicate operational use (e.g., equating across forms), so these are not widely used.

15.3.1.4 Practical fitting

The logistic curve characterising an item is the item response function (IRF) or item characteristic curve (ICC). Several R packages fit IRT models — ltm, exametrika, and others. We use the author’s own exametrika package and its sample data. J15S500 is data from 500 respondents answering 15 items.

pacman::p_load(exametrika)
result.2pl <- IRT(J15S500, model = 2, verbose = FALSE)
print(result.2pl)
Item Parameters
       slope location PSD(slope) PSD(location)
Item01 0.698   -1.683     0.1093         0.266
Item02 0.810   -1.552     0.1166         0.221
Item03 0.559   -1.838     0.0988         0.338
Item04 1.416   -1.178     0.1569         0.113
Item05 0.681   -2.242     0.1152         0.360
Item06 0.997   -2.162     0.1499         0.273
Item07 1.084   -1.039     0.1281         0.130
Item08 0.694   -0.558     0.1002         0.153
Item09 0.347    1.630     0.0766         0.427
Item10 0.492   -1.421     0.0907         0.306
Item11 1.122    1.020     0.1314         0.124
Item12 1.216    1.031     0.1385         0.117
Item13 0.875   -0.720     0.1111         0.133
Item14 1.200   -1.232     0.1407         0.134
Item15 0.823   -1.203     0.1127         0.180

Item Fit Indices
       model_log_like bench_log_like null_log_like model_Chi_sq null_Chi_sq
Item01       -263.524       -240.190      -283.343       46.669      86.307
Item02       -252.914       -235.436      -278.949       34.954      87.025
Item03       -281.083       -260.906      -293.598       40.353      65.383
Item04       -205.851       -192.072      -265.962       27.558     147.780
Item05       -232.072       -206.537      -247.403       51.070      81.732
Item06       -173.930       -153.940      -198.817       39.981      89.755
Item07       -252.039       -228.379      -298.345       47.320     139.933
Item08       -313.754       -293.225      -338.789       41.057      91.127
Item09       -325.692       -300.492      -327.842       50.399      54.700
Item10       -309.448       -288.198      -319.850       42.500      63.303
Item11       -250.836       -224.085      -299.265       53.501     150.360
Item12       -240.247       -214.797      -293.598       50.900     157.603
Item13       -291.816       -262.031      -328.396       59.571     132.730
Item14       -224.330       -204.953      -273.212       38.754     136.519
Item15       -273.120       -254.764      -302.847       36.713      96.166
       model_df null_df   NFI   RFI   IFI   TLI   CFI RMSEA    AIC    CAIC
Item01       12      13 0.459 0.414 0.533 0.488 0.527 0.076 22.669 -39.906
Item02       12      13 0.598 0.565 0.694 0.664 0.690 0.062 10.954 -51.621
Item03       12      13 0.383 0.331 0.469 0.414 0.459 0.069 16.353 -46.222
Item04       12      13 0.814 0.798 0.885 0.875 0.885 0.051  3.558 -59.017
Item05       12      13 0.375 0.323 0.440 0.384 0.432 0.081 27.070 -35.505
Item06       12      13 0.555 0.517 0.640 0.605 0.635 0.068 15.981 -46.595
Item07       12      13 0.662 0.634 0.724 0.699 0.722 0.077 23.320 -39.255
Item08       12      13 0.549 0.512 0.633 0.597 0.628 0.070 17.057 -45.518
Item09       12      13 0.079 0.002 0.101 0.002 0.079 0.080 26.399 -36.177
Item10       12      13 0.329 0.273 0.405 0.343 0.394 0.071 18.500 -44.076
Item11       12      13 0.644 0.615 0.700 0.673 0.698 0.083 29.501 -33.075
Item12       12      13 0.677 0.650 0.733 0.709 0.731 0.081 26.900 -35.675
Item13       12      13 0.551 0.514 0.606 0.570 0.603 0.089 35.571 -27.004
Item14       12      13 0.716 0.692 0.785 0.765 0.783 0.067 14.754 -47.822
Item15       12      13 0.618 0.586 0.706 0.678 0.703 0.064 12.713 -49.862
           BIC
Item01 -27.906
Item02 -39.621
Item03 -34.222
Item04 -47.017
Item05 -23.505
Item06 -34.595
Item07 -27.255
Item08 -33.518
Item09 -24.177
Item10 -32.076
Item11 -21.075
Item12 -23.675
Item13 -15.004
Item14 -35.822
Item15 -37.862

Model Fit Indices
                   value
model_log_like -3890.655
bench_log_like -3560.005
null_log_like  -4350.217
model_Chi_sq     661.300
null_Chi_sq     1580.424
model_df         180.000
null_df          195.000
NFI                0.582
RFI                0.547
IFI                0.656
TLI                0.624
CFI                0.653
RMSEA              0.073
AIC              301.300
CAIC            -637.330
BIC             -457.330

Numerically, the Item Parameters block reports the slope (discrimination) and the location (difficulty) with their standard errors. Item Fit Indices gives per-item fit; Model Fit Indices gives test-level fit. Viewed through the lens of SEM fit indices, IRT models tend to fit poorly — a price of modelling binary data, in part.

IRT’s strength is less in the numerical fit and more in the ease of visualising items. exametrika::plot() draws the IRFs:

plot(result.2pl, item = 1:5, type = "IRF", overlay = TRUE)

Summing the IRFs across the whole test gives the test response function (TRF):

plot(result.2pl, type = "TRF")

A transformation of the IRF yields the item information function (IIF). The IIF peaks where the variance is greatest — at \(P = 0.5\) — and is defined by

\[ I_j(\theta) = \frac{P_j^{\prime}(\theta)^2}{P_j(\theta)\bigl(1 - P_j(\theta)\bigr)}. \]

Roughly, the larger the gap between correct- and incorrect-response probabilities, the more information the item carries. Plot with type = "IIF":

plot(result.2pl, item = 4, type = "IRF")

plot(result.2pl, item = 4, type = "IIF")

The IIF encodes IRT’s notion of reliability: reliability is a function of \(\theta\), telling us where in the ability range the item works most efficiently.

For contrast, classical test theory expresses reliability as the proportion of true-score variance in the overall test; factor analysis expresses it via item-level communality \(h_j^2\). CTT goes from the test to the items; modern test theory goes the other way and evaluates per-function, per-item performance.

In IRT, even items that are too easy or too hard are not deleted: they are needed to assess high- or low-ability respondents. Such items will not contribute much variance, and may have low communalities, but they are kept — a philosophical contrast with the factor-analytic approach.

The whole-test information function is the sum of per-item IIFs. Use type = "TIF":

plot(result.2pl, type = "TIF")

This 15-item test peaks near \(\theta = -1\) — it offers its sharpest measurement for respondents of somewhat-below-average ability. When item parameters are known for a large pool, IRT lets the test designer engineer the precision in advance by selecting items.

The person parameter \(\theta\) is estimated from the response pattern. A standard-normal prior with Bayesian estimation is typical.7 exametrika returns these along with the item parameters:

head(result.2pl$ability)
          ID        EAP       PSD
1 Student001 -0.6645678 0.5457047
2 Student002 -0.1485370 0.5626979
3 Student003  0.0136253 0.5699764
4 Student004  0.5877568 0.6012839
5 Student005 -0.9779686 0.5415528
6 Student006  0.8589251 0.6187224

15.3.2 IRT extensions

IRT is fundamentally for binary data, but models for graded and multi-category responses exist. For Likert-style multi-step responses, the graded response model (GRM) and the partial credit model (PCM) are common. These let one assume an ordinal level of measurement on psychological-scale data.

Psychological scales’ Likert-style responses are typically thought to support, at best, ordinal-level measurement, yet they have long been treated as interval-level for mathematical convenience (and for the unflattering reason that anything more careful was thought unwarranted). One reason this “treat as interval” practice persisted was that GRM/PCM were unavailable in mainstream packages. Today the mathematical equivalence of GRM and the 2PL means common software (e.g., Mplus) can estimate them as the same latent-variable model, and R packages such as ltm provide GRM and PCM. “We don’t have the tools” is no longer a valid excuse.

A further advantage of graded-response modelling is in choosing the right number of response categories. Five-point and seven-point Likert items are common conventions but lack any theoretical justification; the more important question is whether respondents can actually discriminate between 5 or 7 categories.8

A concrete example using ltm::grm() on the Science dataset — science-attitude items on a 4-point scale.

pacman::p_load(ltm)
data(Science)
result.grm <- grm(Science)
print(result.grm)

Call:
grm(data = Science)

Coefficients:
             Extrmt1  Extrmt2  Extrmt3  Dscrmn
Comfort      -10.768   -5.645    3.097   0.411
Environment   -2.154   -0.790    0.627   1.570
Work          32.102    9.261  -24.402  -0.074
Future       -30.602  -11.806   10.455   0.108
Technology    -2.462   -0.885    0.642   1.650
Industry      -2.870   -1.529    0.286   1.642
Benefit      -21.232   -5.982   10.297   0.136

Log.Lik: -2998.129

The output reports three thresholds (Extrmt) and the corresponding discrimination (Dscrmn). Like the logistic model, GRM admits IRF/IIF/TIF plots; we draw the IRF (called ICC in ltm):

plot(result.grm, items = 2, type = "ICC")

Each category’s response probability is plotted as a function of \(\theta\). As \(\theta\) rises, the modal category shifts from 1 to 2 to 3, in order.

The next item, however, looks different:

plot(result.grm, items = 1, type = "ICC")

Categories 1 and 2 never become the modal response; nearly the whole range is dominated by category 3, with category 4 only beginning to appear above \(\theta = 3\). The plot spans \(-4 \leq \theta \leq +4\); expanding the negative side might recover category-1 and -2 modes, but that is not realistic. Some items have no clear category modes, suggesting that the response scale was poorly designed.

The IRT approach is increasingly recommended for designing psychological scales. The single-factor assumption can also be relaxed; multidimensional IRT models exist (the mirt package provides them). There is no longer a reason not to use the IRT approach.

15.4 Summary

We have covered the basics and the practice of exploratory factor analysis, confirmatory factor analysis (under SEM), and item response theory.

Their common thread is constructing measurement models with latent variables, grounded in covariance or correlation matrices. With tetrachoric/polychoric/polyserial correlations for ordinal scales, the analysis becomes a categorical factor analysis — which is also a (graded) IRT model. Software capable of computing correlations appropriate to the level of measurement can run these models with the same workflow (lavaan lets you set the scale level of each variable; Mplus is also well known for this).

Knowing the historical roots and theoretical lineage of each model informs its application, but as a user one should use whichever tool fits the task. Design research with the respondent in mind. Letting mathematical limitations, software constraints, or — worst of all — a researcher’s incomprehension or laziness force respondents into ill-fitting designs degrades the quality of the data.

15.5 Exercises

Practise the multivariate techniques of this chapter on the following problems, which aim to deepen understanding through hands-on analysis. As an example we use psych::small.msq — the Motivational State Questionnaire — a 14-variable, 200-case dataset.9

Its 14 variables comprise energetic arousal (active, alert, arousal, sleepy, tired, drowsy), tense arousal (anxious, jittery, nervous, calm, relaxed, at.ease), plus gender and drug (a drug-condition factor). Excluding gender and drug, the structure is presumed two-factor.

15.5.1 Exercise 1: exploratory factor analysis

Perform an exploratory factor analysis.

Steps:

  1. Decide the number of factors.
  2. Fit the factor model (oblique rotation).
  3. Interpret the results (factor loadings, communalities, uniquenesses).

15.5.2 Exercise 2: confirmatory factor analysis

Fit a two-factor model with lavaan. Use ordered = TRUE in the model specification to treat the indicators as ordinal.

Steps:

  1. Build the theoretical model (draw a path diagram).
  2. Fit with lavaan.
  3. Plot.
  4. Evaluate fit indices.
  5. Modify the model if needed.

15.5.3 Exercise 3: graded response model

Apply ltm’s GRM. Because GRM assumes unidimensionality, split the data into the energetic-arousal and tense-arousal item sets and fit a GRM to each.

Steps:

  1. Split the data.
  2. Fit a GRM.
  3. Plot ICCs.
  4. Plot IIFs.

15.5.4 Exercise 4: multidimensional graded IRT

Use mirt (multidimensional IRT) to fit a two-factor graded model:

pacman::p_load(mirt)
# two factors, graded response
result.mirt <- mirt(dat, model = 2, itemtype = "graded")
# oblique rotation in the summary
summary(result.mirt, rotate = "geominQ")

# multidimensional ICC
plot(result.mirt, type = "trace", which.items = 1)
# multidimensional information
plot(result.mirt, type = "info")

  1. Complexity is the index of how cleanly each item loads on a single factor: values near 1 mean the item loads essentially on one factor; larger values mean the item is spread across multiple factors. For loadings \(a_{jk}\), complexity is \((\sum_k a_{jk}^2)^2 / \sum_k a_{jk}^4\).↩︎

  2. An idiosyncratic name; “lavaan” stands for LAtent VAriable ANalysis.↩︎

  3. The lavaangui package lets you specify models via a GUI.↩︎

  4. Std.all standardises both observed and latent variables to unit variance; Std.lv standardises only the latent variables.↩︎

  5. Japan’s national university-entrance examinations briefly considered CAT but rejected it: the need for a huge item pool (requiring extensive pre-testing) and the difficulty of providing the same network and operating environment at remote and island sites as in urban areas made it impractical. The existing paper-and-pencil (mark-sheet) system runs an extraordinary error rate below one percent even with 500,000 simultaneous test-takers each year; given its social impact, CAT adoption demands extreme caution.↩︎

  6. It is also possible, of course, to model with the normal CDF directly.↩︎

  7. With MLE, a respondent who gets everything right or everything wrong has an infinite \(\theta\) estimate — not practically useful.↩︎

  8. Frequently asked: “If the prior literature uses 5 points, do we have to?” and “Can we mix 5-point and 7-point scales?” The correct answers: if you are using the prior literature’s scoring or standardisation, follow the prior literature; otherwise (e.g., when running EFA and setting your own loadings), design the scale around what respondents can actually distinguish. If you do not know in advance, use a graded-response IRT model to check responses to each category.↩︎

  9. A subset of the more than 2,500 sample MSQ; the full data are in the psychTools package.↩︎