Skip to contents

Overview

This vignette defines the terms used throughout the swereg target trial emulation (TTE) system. It serves as a quick reference for anyone reading or writing code that uses TTEDesign, TTEEnrollment, or TTEPlan.

Data levels

Term Meaning
skeleton Person-week panel created by create_skeleton() and enriched with registry data. One row per person per ISO week. Input to the TTE pipeline. Stored as batched .qs2 files.
person-week Synonym for skeleton-level data before enrollment. The data_level of a TTEEnrollment starts as "person_week".
trial After enrollment via TTEEnrollment$new(..., ratio = ), data is expanded to trial panels: one row per person per trial per time period. data_level becomes "trial".
counting-process The trial-level data uses counting-process format with tstart/tstop columns (Andersen-Gill style), suitable for time-varying Cox models and weighted Poisson regression.

Classes

Class Role
TTEDesign Column name mappings that define the trial schema: person ID, exposure, outcome, confounder, and time variables. Created once via TTEDesign$new() and reused across the workflow.
TTEEnrollment Enrollment data container (data.table + design + workflow state). Methods modify in-place via R6 reference semantics and return invisible(self) for $-chaining.
TTEPlan Builder for trial generation. Holds the ETT grid, skeleton file paths, and per-ETT design parameters. Orchestrates Loop 1 via $s1_generate_enrollments_and_ipw().

ETT grid

Term Meaning
ETT (Emulated Target Trial) One combination of outcome × follow-up duration × age group. Each ETT produces one analysis-ready dataset. Corresponds to one row in plan$ett.
enrollment_id Groups ETTs that share the same trial panels. ETTs within an enrollment_id have the same age group and matching design parameters (confounders, exposure, eligibility). They differ only in outcome and/or follow-up duration.
ett_id Unique identifier for a single ETT (e.g., "ETT01"). Auto-assigned sequentially by $add_one_ett().
enrollment_spec The metadata list returned by plan$enrollment_spec(i). Contains design (TTEDesign), enrollment_id, age_range, and n_threads. Used internally by the two-pass Loop 1 workers.

Two-loop architecture

Loop 1: enrollment + IPW

One iteration per enrollment_id. Run by plan$s1_generate_enrollments_and_ipw():

skeleton files ──(parallel callr workers)──► enroll (band-based match + collapse)
  ──► rbind ──► impute ──► IPW + truncate ──► save

Produces two files per enrollment_id:

  • file_raw — post-enrollment, pre-imputation
  • file_imp — post-imputation + IPW (input to Loop 2)

Loop 2: per-ETT outcome weighting

One iteration per ETT. Runs sequentially in the main process:

load file_imp ──► $s4_prepare_for_analysis() ──► save file_analysis

$s4_prepare_for_analysis() combines outcome preparation and IPCW-PP into one call. It prepares outcome data, calculates IPCW-PP, combines weights (ipw × ipcw_ppanalysis_weight_pp), truncates, and drops intermediate IPCW columns.

Weights

Term Meaning
IPW (Inverse Probability of treatment Weighting) Baseline confounding adjustment. Computed once per enrollment_id in Loop 1 via $s2_ipw().
IPCW-PP (Inverse Probability of Censoring Weighting, Per-Protocol) Time-varying weight for per-protocol analysis. Accounts for treatment switching and loss to follow-up. Computed per ETT in Loop 2 via $s4_prepare_for_analysis().
analysis_weight_pp Final combined weight (ipw × ipcw_pp), truncated. Created automatically by $s4_prepare_for_analysis().
truncation Winsorization of extreme weights at the 1st and 99th percentiles (by default) to reduce variance. Applied via $s3_truncate_weights().

File naming

All output files live in the project-specific data directory.

Column in plan$ett Pattern When created
file_raw {prefix}_raw_{enrollment_id}.qs2 Loop 1 (intermediate)
file_imp {prefix}_imp_{enrollment_id}.qs2 Loop 1 (output)
file_analysis {prefix}_analysis_{ett_id}.qs2 Loop 2 (output)

Variable prefixes

Prefix Convention
x_ Loop iteration variables extracted from grid tables (e.g., x_outcome, x_follow_up, x_file_analysis). Used in generate and analysis scripts to distinguish loop variables from dataset columns.
rd_ Row-dependent variables (e.g., rd_age_continuous, rd_exposed). Variables that can change value across rows (time points) for the same person. Counterpart of row-independent (rowind) variables that are time-invariant.

Analysis types

The target trial emulation literature describes three analysis strategies:

Analysis Description swereg support
Intention-to-treat (ITT) Compare initiators vs non-initiators regardless of subsequent adherence. No censoring at treatment switching. Not implemented. Our pipeline censors at protocol deviation via $s4_prepare_for_analysis(), which fundamentally modifies the dataset for per-protocol analysis. A separate ITT pipeline would need to skip per-protocol censoring entirely.
Per-protocol Censor at protocol deviation (treatment switching), adjust for informative censoring with IPCW. Yes$s4_prepare_for_analysis() applies per-protocol censoring, then estimates IPCW-PP weights. The final combined weight analysis_weight_pp = ipw × ipcw_pp is used in $irr().
As-treated Model time-varying treatment status with time-varying IPW. Not implemented. Requires time-varying treatment weights P(A_t

The standard swereg pipeline produces per-protocol estimates: use $irr(weight_col = "analysis_weight_pp_trunc") after the full pipeline.

Enrollment band width and residual immortal time bias

The period_width parameter in TTEDesign (default: 4 weeks) controls the width of enrollment bands. This is a critical methodological parameter that trades off between bias and power:

  • Narrower bands (e.g., period_width = 1): Less residual immortal time bias, but fewer events per trial and larger datasets.
  • Wider bands (e.g., period_width = 4): More residual immortal time bias, but more events per trial and better computational efficiency.

Caniglia et al. (2023) explicitly discusses this trade-off:

“a one-week enrollment period for the target trial beginning at week 36 was inappropriate, and that defining trials on the day scale may have been necessary. Generally, defining trials on a shorter scale will reduce residual immortal time bias.”

The period_width also serves as the grace period — the window during which treatment initiation is assumed to occur. For studies where the exposure is defined at a specific time point, period_width = 1 may be appropriate. For studies where treatment initiation is gradual, a wider window is justified (Hernan 2016, Section 4.4).

Matching approach

swereg uses per-band stratified matching: within each enrollment band, unexposed individuals are sampled at a specified ratio (e.g., 1:5) to match the number of exposed. This is a design choice for computational efficiency in large registry datasets.

Alternative approaches described in the literature:

  • No matching, IPW only (Danaei 2013, Hernan 2008): Include all eligible non-initiators, adjust via propensity score. More statistically efficient but computationally expensive with large registries.
  • Propensity score matching (Danaei 2013, sensitivity analysis): Match on estimated propensity score. More complex but can achieve better balance.

Our approach combines matching (for computational tractability) with IPW (to correct residual confounding within the matched set). The IPW step re-weights the matched sample to balance measured confounders.