Overview
This vignette defines the terms used throughout the swereg target
trial emulation (TTE) system. It serves as a quick reference for anyone
reading or writing code that uses TTEDesign,
TTEEnrollment, or TTEPlan.
Data levels
| Term | Meaning |
|---|---|
| skeleton | Person-week panel created by
create_skeleton() and enriched with registry data. One row
per person per ISO week. Input to the TTE pipeline. Stored as batched
.qs2 files. |
| person-week | Synonym for skeleton-level data before enrollment. The
data_level of a TTEEnrollment starts as
"person_week". |
| trial | After enrollment via
TTEEnrollment$new(..., ratio = ), data is expanded to trial
panels: one row per person per trial per time period.
data_level becomes "trial". |
| counting-process | The trial-level data uses counting-process format with
tstart/tstop columns (Andersen-Gill style),
suitable for time-varying Cox models and weighted Poisson
regression. |
Classes
| Class | Role |
|---|---|
TTEDesign |
Column name mappings that define the trial schema:
person ID, exposure, outcome, confounder, and time variables. Created
once via TTEDesign$new() and reused across the
workflow. |
TTEEnrollment |
Enrollment data container (data.table + design +
workflow state). Methods modify in-place via R6 reference semantics and
return invisible(self) for $-chaining. |
TTEPlan |
Builder for trial generation. Holds the ETT grid,
skeleton file paths, and per-ETT design parameters. Orchestrates Loop 1
via $s1_generate_enrollments_and_ipw(). |
ETT grid
| Term | Meaning |
|---|---|
| ETT (Emulated Target Trial) | One combination of outcome × follow-up duration × age
group. Each ETT produces one analysis-ready dataset. Corresponds to one
row in plan$ett. |
| enrollment_id | Groups ETTs that share the same trial panels. ETTs
within an enrollment_id have the same age group and
matching design parameters (confounders, exposure, eligibility). They
differ only in outcome and/or follow-up duration. |
| ett_id | Unique identifier for a single ETT (e.g.,
"ETT01"). Auto-assigned sequentially by
$add_one_ett(). |
| enrollment_spec | The metadata list returned by
plan$enrollment_spec(i). Contains design
(TTEDesign), enrollment_id,
age_range, and n_threads. Used internally by
the two-pass Loop 1 workers. |
Two-loop architecture
Loop 1: enrollment + IPW
One iteration per enrollment_id. Run by
plan$s1_generate_enrollments_and_ipw():
skeleton files ──(parallel callr workers)──► enroll (band-based match + collapse)
──► rbind ──► impute ──► IPW + truncate ──► save
Produces two files per enrollment_id:
- file_raw — post-enrollment, pre-imputation
- file_imp — post-imputation + IPW (input to Loop 2)
Loop 2: per-ETT outcome weighting
One iteration per ETT. Runs sequentially in the main process:
load file_imp ──► $s4_prepare_for_analysis() ──► save file_analysis
$s4_prepare_for_analysis() combines outcome preparation
and IPCW-PP into one call. It prepares outcome data, calculates IPCW-PP,
combines weights (ipw × ipcw_pp →
analysis_weight_pp), truncates, and drops intermediate IPCW
columns.
Weights
| Term | Meaning |
|---|---|
| IPW (Inverse Probability of treatment Weighting) | Baseline confounding adjustment. Computed once per
enrollment_id in Loop 1 via $s2_ipw(). |
| IPCW-PP (Inverse Probability of Censoring Weighting, Per-Protocol) | Time-varying weight for per-protocol analysis. Accounts
for treatment switching and loss to follow-up. Computed per ETT in Loop
2 via $s4_prepare_for_analysis(). |
| analysis_weight_pp | Final combined weight (ipw × ipcw_pp),
truncated. Created automatically by
$s4_prepare_for_analysis(). |
| truncation | Winsorization of extreme weights at the 1st and 99th
percentiles (by default) to reduce variance. Applied via
$s3_truncate_weights(). |
File naming
All output files live in the project-specific data directory.
Column in plan$ett
|
Pattern | When created |
|---|---|---|
file_raw |
{prefix}_raw_{enrollment_id}.qs2 |
Loop 1 (intermediate) |
file_imp |
{prefix}_imp_{enrollment_id}.qs2 |
Loop 1 (output) |
file_analysis |
{prefix}_analysis_{ett_id}.qs2 |
Loop 2 (output) |
Variable prefixes
| Prefix | Convention |
|---|---|
x_ |
Loop iteration variables extracted from grid tables
(e.g., x_outcome, x_follow_up,
x_file_analysis). Used in generate and analysis scripts to
distinguish loop variables from dataset columns. |
rd_ |
Row-dependent variables (e.g.,
rd_age_continuous, rd_exposed). Variables that
can change value across rows (time points) for the same person.
Counterpart of row-independent (rowind) variables that are
time-invariant. |
Analysis types
The target trial emulation literature describes three analysis strategies:
| Analysis | Description | swereg support |
|---|---|---|
| Intention-to-treat (ITT) | Compare initiators vs non-initiators regardless of subsequent adherence. No censoring at treatment switching. |
Not implemented. Our pipeline censors
at protocol deviation via $s4_prepare_for_analysis(), which
fundamentally modifies the dataset for per-protocol analysis. A separate
ITT pipeline would need to skip per-protocol censoring entirely. |
| Per-protocol | Censor at protocol deviation (treatment switching), adjust for informative censoring with IPCW. |
Yes —
$s4_prepare_for_analysis() applies per-protocol censoring,
then estimates IPCW-PP weights. The final combined weight
analysis_weight_pp = ipw × ipcw_pp is used in
$irr(). |
| As-treated | Model time-varying treatment status with time-varying IPW. | Not implemented. Requires time-varying treatment weights P(A_t |
The standard swereg pipeline produces per-protocol estimates: use
$irr(weight_col = "analysis_weight_pp_trunc") after the
full pipeline.
Enrollment band width and residual immortal time bias
The period_width parameter in TTEDesign
(default: 4 weeks) controls the width of enrollment bands. This is a
critical methodological parameter that trades off between bias and
power:
-
Narrower bands (e.g.,
period_width = 1): Less residual immortal time bias, but fewer events per trial and larger datasets. -
Wider bands (e.g.,
period_width = 4): More residual immortal time bias, but more events per trial and better computational efficiency.
Caniglia et al. (2023) explicitly discusses this trade-off:
“a one-week enrollment period for the target trial beginning at week 36 was inappropriate, and that defining trials on the day scale may have been necessary. Generally, defining trials on a shorter scale will reduce residual immortal time bias.”
The period_width also serves as the grace period — the
window during which treatment initiation is assumed to occur. For
studies where the exposure is defined at a specific time point,
period_width = 1 may be appropriate. For studies where
treatment initiation is gradual, a wider window is justified (Hernan
2016, Section 4.4).
Matching approach
swereg uses per-band stratified matching: within each enrollment band, unexposed individuals are sampled at a specified ratio (e.g., 1:5) to match the number of exposed. This is a design choice for computational efficiency in large registry datasets.
Alternative approaches described in the literature:
- No matching, IPW only (Danaei 2013, Hernan 2008): Include all eligible non-initiators, adjust via propensity score. More statistically efficient but computationally expensive with large registries.
- Propensity score matching (Danaei 2013, sensitivity analysis): Match on estimated propensity score. More complex but can achieve better balance.
Our approach combines matching (for computational tractability) with IPW (to correct residual confounding within the matched set). The IPW step re-weights the matched sample to balance measured confounders.
