swereg 26.5.20
All-subprocess s1 architecture (OOM fix + clean dispatcher)
Background: the previous s1 design mixed parallel-pool work and main- thread work in the same loop. After parallel_pool() returned for the multi-scout, the main R process held ~41,686 (tuples, attrition) data.tables (2,194 skeletons x 19 enrollments) in RAM, then layered an rbindlist of ~2,194 panel chunks on top during the per-enrollment post-step. On a 003-iliadis-stroke run (19 enrollments, 2,194 skeleton files, 6 workers) this peaked high enough that the parent process either OOMed at the end of the multi-scout or starved the loop-3 workers when they spawned.
$s1_generate_enrollments_and_ipw() is now a pure dispatcher: every step that touches multiple skeletons’ worth of data runs in a subprocess and exits when done. The main R process holds only paths, status flags, and progressors – never a data.table.
Sub-step nomenclature
Loop 1 is split into four named sub-steps (s1a..s1d). Each sub-step runs in its own subprocess and communicates with the next via files in a per-project work directory:
{study$data_meta_dir}/s1_work/{project_prefix}/
| Sub-step | Mode | Worker script |
|---|---|---|
| s1a | parallel x skeleton (n_workers) |
worker_s1a_multi.R |
| s1b | single subprocess per enrollment | worker_s1b.R |
| s1c | parallel x (enrollment x skeleton) | worker_s1c.R |
| s1d | single subprocess per enrollment | worker_s1d.R |
s1a writes s1a_cache_* + s1a_pre_* + a per-skeleton sentinel. s1b reads all s1a_pre_* for one enrollment, samples comparators, writes s1b_enrolled_ids_* + s1b_attrition_* + the enrollment counts sidecar + sentinel. s1c reads s1a_cache_* + s1b_enrolled_ids_*, builds the panel, writes s1c_panel_* + sentinel. s1d reads all s1c_panel_* for one enrollment, imputes, computes IPW, truncates, writes the final file_raw + file_imp + sentinel.
The work directory is removed automatically on a successful end-to- end run.
Renames (breaking for internal ::: callers; no public API impact)
-
.s1b_worker()is now.s1c_worker()(panel build). The two-arg in-memory helper used bydev/verify_*.Randdev/profile_*.Ris exposed as.s1c_worker_impl()(formerly the body of.s1b_worker). -
inst/worker_s1b.Rpreviously dispatched panel build; that role moved toinst/worker_s1c.R.inst/worker_s1b.Ris now the match worker. -
.s1a_worker()(single-enrollment scout) is unchanged. Used only by tests/verify/profile; the orchestrator does not call it. - New internal helpers:
.s1b_worker(),.s1d_worker(),.s1_work_dir(), and path constructors (.s1a_cache_path(),.s1a_pre_path(),.s1a_done_path(),.s1b_enrolled_ids_path(),.s1b_attrition_path(),.s1b_done_path(),.s1c_panel_path(),.s1c_done_path(),.s1d_done_path()), and.touch_sentinel().
Resume
resume = TRUE is now sentinel-based across all four sub-steps. The master skips any sub-step whose sentinel file is present in the work directory, so a crash in s1c (for example) only requires redoing the missing panel chunks – not the upstream scout or the downstream post-step.
parallel_pool(collect = FALSE) everywhere
All four sub-steps invoke parallel_pool() with collect = FALSE. Workers write their outputs directly to final paths in the work directory; no result data is shipped back to the master through qs2 tempfiles. This was the architectural change that eliminated the post-pool memory hump.
What didn’t change (by design)
- Math/semantics are identical – the match step uses the same
set.seed(enrollment_spec$seed)and the samedata.tablegroup-by- sample logic; the post step uses the same
tteenrollment_rbind -
s2_ipw+s3_truncate_weightschain; IDs inenrolled_ids,enrollment_counts,file_raw, andfile_impare bit-identical to those produced by 26.5.19.
- sample logic; the post step uses the same
-
swereg::tteplan_locate_and_load(...)$s1_generate_enrollments_and_ipw(...)– the user-facing entry point – has the same signature (output_dir,impute_fn,stabilize,n_workers,swereg_dev_path,resume).
Caveats / behaviour change to be aware of
-
impute_fnis serialised via qs2 across the subprocess boundary into s1d. The defaulttteenrollment_impute_confoundersis namespaced and round-trips cleanly. Custom imputation closures that capture unexported state from the caller’s session may not deserialise; pass them as eitherswereg::your_fn-style refs or as self-contained closures. - The work directory consumes ~10-20 GB during a 003-sized run (cache + pre + panel chunks). Plan accordingly; on success it is removed automatically.
swereg 26.5.19
Performance
Large-scale flame-graph-driven optimisation of s1. End-to-end output is bit-identical (verified via A/B against the pre-patch reference on a real 003-iliadis-stroke skeleton: 25 columns x 4.16 M panel rows, plus direct comparison of (tuples, attrition) for all 19 enrollments on the first skeleton file). On a 003-sized study (19 enrollments x 2,194 skeleton files, 6 workers) the projected wall savings on a 10-day s1 run total ~21 hours.
Stage 1a / multi-enrollment scout
s1_generate_enrollments_and_ipw()now does its scout pass per skeleton file across all enrollments at once, instead of per (enrollment x skeleton). Each canonical skeleton (~5 MB qs2, ~3.7 GB decompressed for 1,025 columns) is deserialised ONCE per s1 run instead of 19 times, saving ~9-11 hours wall on a 003-sized run. Driven by a new internal worker.s1a_worker_multi()and worker scriptinst/worker_s1a_multi.R.The multi-scout worker projects the canonical to the union of columns any enrollment uses (typically ~50-100 of ~1,025) immediately after load, dropping the rest in place via
:= NULL. Apply-exclusions etc. then operate on a much smaller working data.table. Between enrollment iterations we drop only the columns prepare/finalize added (instead ofdata.table::copy()-ing the canonical, which itself cost ~3 s per iteration). New helper:.tte_canonical_needed_cols()..s1a_worker_multi()writes a per-skeleton scout checkpoint file (s1_scout_<basename>.qs2, ~0.5 MB) containing all 19 enrollments’(tuples, attrition). The outer dispatch checks for existing checkpoints + cache files and skips any skeleton whose scout is already complete, so a mid-scout crash on resume only redoes the skeletons that hadn’t finished. Checkpoint round-trip: 72 s scout vs 0.08 s read.Split
.s1_prepare_skeleton()into.s1_load_skeleton()(qs2 read +setalloccol+setkey) and.s1_prepare_loaded()(exclusions + treatment + eligibility combine) so the two worker variants share internal logic.Split
.s1a_worker()’s post-prep work into.s1a_finalize_on_skeleton()(attrition + tuples + cache write), shared with.s1a_worker_multi().
Stage 1b / cache projection
.s1a_worker()and.s1a_worker_multi()write the per-enrollment cache projected to only the columns s1b actually consumes (~30 of ~1,025):id,isoyearweek,trial_id, treatment/rd_intervention cols,design$confounder_vars,design$outcome_vars, theeligible_*cols, plus source variables for any computed confounder. Cache file shrinks ~10x (~5 MB -> ~0.5 MB per file); s1b cache read drops from ~5 s to ~0.5 s per worker call. Across 19 x 2,194 s1b calls / 6 workers this saves ~10 hours wall. New helper:.tte_s1_cache_columns().Removed the redundant per-enrollment
%in%filter inprivate$enroll()Phase B when.s1b_workerhas already filtered the cache to enrolled persons upstream (gated by the new.tte_filtered_to_enrolledattribute on the skeleton). Avoids allocating a ~3 GB identity copy of the panel per stage-1b worker.
Batched per-id derivations (s1a)
-
tteplan_apply_exclusions()andtteplan_apply_derived_confounders()now collect all per-person (by = id) grouped derivations into a singledt[, c(...) := list(...), by = id]call via the new.tte_apply_eligibility_batch()helper, instead of onedt[, col := f(x), by = id]call per criterion. With 12 exclusions- 4 computed confounders for 003, this collapses 16 separate radix walks of the 17 M-row skeleton into 1.
.s1_prepare_skeleton()fuses both helpers into one combined batch so the skeleton is walked exactly once.
- 4 computed confounders for 003, this collapses 16 separate radix walks of the 17 M-row skeleton into 1.
.s1_compute_attrition(): fusedfiltered <- sk[mask]; pt_i <- filtered[, ..., by]intopt_i <- sk[mask, ..., by]inside the cumulative-criterion loop, eliminating ~220 MB allocation per criterion. Dropped redundant== TRUEon the cumulative mask; replacedsum(.tte_tx_any == TRUE)/sum(.tte_tx_any == FALSE)withsum(.tte_tx_any)/sum(!.tte_tx_any)..s1_eligible_tuples(): fused[i][, j, by=]and dropped the redundantsetorderv()(caller already sorted;any()doesn’t need ordered input). Eliminates a 1.18 GB intermediate allocation..s1_prepare_skeleton(): collapsed three sequential[, := ]calls forrd_intervention/baseline_intervention/eligible_valid_treatmentinto one multi-column assignment (evaluatesfcase()once and writes three columns in one dispatch).any_events_prior_to(): replacedc(FALSE, cum[-n] > 0L)(three n-vector allocations per call – slice, compare, prepend) withdata.table::shift(prior_counts, n = 1L, fill = 0L) > 0L(one allocation). The function is called once per person per exclusion criterion, so the saving multiplies.
Faster post-rbind impute + IPW (s1 post)
-
TTEEnrollment$s1_impute_confounders()is faster on production-scale panels. Three changes:- Pre-scan baseline rows for NAs and skip the full panel collapse + merge entirely when every confounder is complete.
- When only some confounders need imputing, restrict the group-by collapse and merge-back to that subset (
needs_impute) instead of the fullconfounder_vars. - Replaced the drop +
merge.data.tableround trip with an[, := mget(paste0("i.", needs_impute)), on = id_var]update join. Avoids allocating a new merged data.table for the 17 M-row panel. Measured on a 17 M-row 002-ozel-psychosis trial with 7 confounders: impute step dropped from 4.69 s to 4.17 s; full post-rbind block (impute + s2_ipw + s3_truncate_weights) 10.19 s -> 8.86 s.
private$enroll()Phase B: replaceddata[get(person_id_col) %in% enrolled_person_ids]with a binary- search keyed joindata[.(unique(...)), on = person_id_col]on the existing(id, isoyearweek)key. Avoids the temporary hash allocation that drove GC pressure on this 17M-row filter.private$enroll()Phase B: dropped thesetorderv()immediately precedingsetkeyv()on the same columns (two sorts collapsed into one). Includedisoyearweekin the key sofirst(isoyearweek)inside the aggregation remains deterministic.
Misc
Replaced
paste0()withstringi::stri_c()at the three hot panel-ID construction sites (r6_tteenrollment.RPhase B + Phase C,enrollment_idprefix inr6_tteplan.R::.s1a_worker).stri_c()is ~50% faster thanpaste0()on long character vectors. Stage-1b paste0 self time dropped from 9.2% to 4.7%.Added
stringitoImports.
Progress reporting
- The single combined
progressr::progressor()is replaced by four per-loop progressors, each created lazily right before its loop so the handler’s “active” bar matches the current phase:-
p_eligibility(2194 steps, per skeleton, parallel) for the multi-enrollment scout. -
p_match(19 steps, per enrollment, main process) for the comparator-sampling step. -
p_panel(19 x 2194 steps, per enrollment x per skeleton, parallel) forworker_s1b.R. -
p_post(19 steps, per enrollment, main process) for rbind + impute + IPW + truncate + save. Each loop is preceded by acat()header explaining what work it does and over what unit.
-
swereg 26.5.18
Bug Fixes
-
TTE per-protocol bias fix:
TTEEnrollment$s4_prepare_for_analysis()now drops censoring-event rows (wherecensor_this_period == 1) fromself$dataafter the IPCW model is fit. Previously these rows were retained withevent = 0, which biased the downstream weighted outcome regression toward the null.On a synthetic dataset with a known true per-protocol log-OR of -0.49, the previous behavior produced an estimate of -0.38 (bias +0.11); after the fix, the estimate is -0.52 (bias -0.02), agreeing closely with the canonical CRAN
TrialEmulationpackage on the same data.
New Tests
-
tests/testthat/test-tte_simulation_correctness.R: end-to-end correctness tests using simulated data with a known true PP effect. Skipped on CRAN. -
tests/testthat/test-tte_vs_trialemulation.R: cross-package comparison against the CRANTrialEmulationpackage. Skipped whenTrialEmulationis not installed or on CRAN.
swereg 26.5.17
Breaking changes
-
RegistryStudy$compute_population()andRegistryStudy$compute_summary()are no longer public methods. They are now internal and run automatically at the end of$process_skeletons(). - Population computation is declarative: pass
population_by_specstoRegistryStudy$new()(a list of character vectors, one per desiredbyaggregation). Each registered spec is pre-computed into the per-batchmeta_*.qs2sidecar and reduced topopulation_<spec>.qs2once perprocess_skeletons()run. - New getter
study$population(by = c(...))reads the cached population table. Errors with a clear message whenbyis not inpopulation_by_specs. - New active binding
study$summaryreads backsummary.qs2.
Performance
-
study$population(by = ...)is now a sub-second meta-only walk instead of re-loading every full skeleton from disk on each call. First call afterprocess_skeletons()is fast because the per-batch aggregations were computed in-memory while skeletons were already loaded. - Adding a new spec to
population_by_specsbetween runs triggers a meta-only refresh on the next$process_skeletons(): skeletons on disk are not rewritten; only themeta_*.qs2sidecars are augmented with the missing aggregation.
Changed
-
$delete_skeletons()now also removes cachedpopulation_*.qs2andsummary.qs2files indata_skeleton_dir.
Fixed
Excel
excel_spec_summary()renderer: named criteria under each enrollment’sadditional_inclusion/additional_exclusionblocks now render one indent deeper than their parent section header (previously they collided with it), and theAge range:row is bolded like the other criterion names. Adds a third indent level (sub-sub) with new stylesst_*_sub_sub_item/st_*_sub_sub_label, a newadd_sub_sub_item()helper, and asub_subargument onadd_kv()/add_yellow()/add_var()/add_derived_var().Excel
excel_spec_summary()renderer: surfacesspec$standing_methods$calendar_timeas the first entry in the Confounders section when present. Calendar time at trial registration is auto-adjusted via the IPW/IPCW models, so rendering it explicitly stops protocol reviewers re-asking “what about calendar year?” on every TTE.
Changed
-
self$spec_xlsx(and thereforeexcel_spec_summary()when called without an explicitpath) now writes tospec_<version>.xlsxinstead of a fixedspec.xlsx. The filename mirrors the YAML convention (spec_v003.yaml,spec_v004.yaml, …) so each spec iteration produces a non-overwriting Excel artefact alongside it. The previousFILENAME_SPEC_XLSXconstant is replaced byfilename_spec_xlsx(version).
swereg 26.5.15
Changed
-
status.txt: each section (never-matched, rare) now leads with a bucket-count summary (one row per registry prefix, descending by size) before the per-bucket detail blocks. Per-bucket lists are never collapsed – every column name appears in full. Rare cutoff rendered as “1-9” (clearer than “< 10”).
swereg 26.5.14
Changed
-
status.txtrendered by$compute_summary()is restructured:- The
[ok]count appears first (above the noise) so the headline number (“how many variables look healthy”) is immediately readable. - Never-matched and rare-variable sections are grouped by the column’s registry-type prefix (everything up to the first underscore:
dorsm,sv,os,osd,can,op,rx, …), sorted by bucket size descending so the dominant problem shows first. Avoids the 252-line flat alphabetical list that previously forced the reader to scroll past the entire never-matched dump to reach the actually-OK count. - All numbers comma-formatted (
8,852,776not8852776). - Trailing pointer to
summary.qs2for full per-column detail.
- The
swereg 26.5.13
Changed
- Meta sidecar now carries weekly/annual splits and the date range of weekly data:
n_rows_weekly/n_rows_annual,n_persons_weekly/n_persons_annual,weekly_min_isoyearweek/weekly_max_isoyearweek,annual_min_isoyear/annual_max_isoyear. Per-column$countsnow emitn_person_weeks_with(TRUE rows whereis_isoyear == FALSE) andn_person_years_with(TRUE rows whereis_isoyear == TRUE) separately; previously the single field was misleadingly named since it was actuallynrow()(weekly + annual combined). -
$compute_summary()aggregates these splits;status.txtnow prints the WEEKLY and ANNUAL time periods with their respective denominators. - TSV audit-track gains an
n_person_years_withcolumn and header comments for both periods.
swereg 26.5.12
Breaking
- Removed the per-batch code-check warning machinery added in PR #4 + 26.5.10. Specifically removed:
start_code_check_session()/end_code_check_session()exports,warn_unmatched_codes()/warn_empty_logical_cols()exports, the internal.swereg_codes_pre()/.swereg_codes_post()hooks,.code_check_snapshot()/.code_check_merge()/.code_check_emit(), and thecode_check_statefield on the meta sidecar. Rationale: this information is now exposed more usefully viaRegistryStudy$compute_summary()(see below), so having two parallel mechanisms reporting the same data was redundant. -
RegistryStudy$save_skeleton()no longer takes acode_check_stateargument (it had no other callers). The meta sidecar shape changed:code_check_stateis dropped,n_personsis added, and every entry inapplied_registrynow carries a$countssub-field. Older meta sidecars (built before this version) keep working but their$countsfield is missing, so$compute_summary()will report zero per-column counts for those batches. To fix, delete the affectedskeleton_*.qs2+meta_*.qs2files (or callstudy$delete_skeletons()) and re-run$process_skeletons().
New
-
RegistryStudy$compute_summary(): aggregates per-batchmeta_*.qs2sidecars into a study-wide sanity report. Always writessummary.qs2(binary; programmatic reload) andstatus.txt(human-readable flag report) todata_skeleton_dir. On full runs (every expected batch present), also writes a git-trackedsummary_<UTC>_<git-sha>_<swereg-ver>.tsvto the newdata_summaries_dircandidate, with counts belowsuppress_below(default 5) masked as"<N"(Swedish registry data convention). Partial runs explicitly skip the TSV. - New
data_summaries_dirconstructor parameter (optional). Defaults to NULL; when NULL,$compute_summary()skips the TSV even on full runs. -
Skeleton$apply_code_entry()now computes per-columnn_persons_withandn_person_weeks_withfor every column it adds, stored on the entry’sapplied_registryrecord. These are the primitive that$compute_summary()rolls up. -
kept–expand_codes()/expand_code_list()(bracket / range expansion utilities from PR #4) remain. The expansion of code patterns happens inline in eachadd_*()function; the removed pieces were specifically the per-call warning emission machinery, not the code expansion machinery.
swereg 26.5.11
Fix
-
.swereg_dev_path()(helper that hands the package root to callr workers so they candevtools::load_all()the same dev source as the parent) was returningsystem.file(package="swereg")directly, which for a dev-loaded package is theinst/subdirectory – not the package root thatdevtools::load_all()expects. Workers then loaded a broken namespace and$process_skeletons()failed on batch 1 withattempt to apply non-function. Now strips the trailing/instso the worker receives the actual package root.
swereg 26.5.10
Breaking (on-disk)
- Skeleton schema bumped from version 4 to version 5 to introduce a
meta_%05d.qs2sidecar next to everyskeleton_%05d.qs2. Existing skeleton directories from earlier versions are not readable; runstudy$delete_skeletons()and re-run$process_skeletons()to regenerate. The sidecar is small (a few KB) and lets$process_skeletons()do an incremental-rebuild check by reading meta only – avoiding the full skeleton deserialise on the common no-change path. For a 2000-batch delivery with 500 MB skeletons this turns a ~1 TB read into a few MB.
New features
RegistryStudy$process_skeletons()now emits a single consolidated code-check warning at the end of every run, covering every batch in scope. Sequential and parallel runs behave identically because the per-batch accumulators flow through the meta sidecars on disk rather than via in-memory state.RegistryStudy$skeleton_pipeline_hashes()now reads the small meta sidecars instead of deserialising every skeleton, with a transparent fallback to loading the skeleton when meta is missing. Significantly faster on large studies.
Internal
- The code-check session machinery introduced in 26.5.9 is now fully internal: the previously-exported
start_code_check_session(),end_code_check_session(),expand_codes(),expand_code_list(),warn_unmatched_codes(),warn_empty_logical_cols()are no longer exported. Users running through$process_skeletons()get cross-batch aggregation automatically; users running manual loops outsideRegistryStudyget per-call warnings (the pre-26.5.9 behaviour). If you need cross-batch aggregation in a manual loop, open an issue and we’ll re-export.
swereg 26.5.9
New features
add_diagnoses(),add_operations(),add_cods(),add_rx(),add_icdo3s(),add_snomed3s()andadd_snomedo10s()now accept bracket / character-class / range patterns directly (e.g."I2[0-5]","FN[ABCDEGW][0-9][0-9]","!302[A-Z]"). Bracket expansion runs unconditionally; the matchers themselves continue to usestartsWith()on literal prefixes.The same
add_*family now runs pre-call (per-literal source-data) and post-call (column-level) sanity checks automatically. Bad patterns surface at run-time instead of producing silent empty columns. Both checks can be disabled viaoptions(swereg.check_codes = FALSE).The pre-call check also runs a cheap, data-free syntax check on the expanded code list, firing in milliseconds at the first
add_*()call rather than at hour 6 of a multi-hour batched pipeline. It warns when any expanded literal is empty or contains regex metacharacters (^ $ * + ? . ( ) | \ [ ]) that will not match understartsWith(). Skipped automatically foradd_rx(source = "produkt")because product names are exact-matched via%chin%and may legitimately contain those characters.
Contributed by @alexengberg (PR #4).
swereg 26.5.8
Breaking
- File-naming format width bumped from 3 digits to 5 digits for batch / ETT identifiers. Rawbatch files become
00001_rawbatch_lmed.qs2(was001_rawbatch_lmed.qs2); skeleton files becomeskeleton_00001.qs2(wasskeleton_001.qs2); ETT identifiers becomeETT00001(wasETT001). Existing on-disk files using the old 3-digit width will not be recognised by the new code – callers must rename them via shell or regenerate. AffectsRegistryStudy(rawbatch + skeleton) andTTEPlan(ett_id, and thefile_analysis = "<prefix>_analysis_<ett_id>.qs2"derived name).
swereg 26.5.7
Breaking-ish
-
registrystudy_load(): parameter renamedcandidate_dir_rawbatch->candidate_dir_meta. Callers must update if they passed by name. Behaviour is unchanged for callers that passed the same path used atRegistryStudy$new(data_meta_dir = ...)(which defaults to the rawbatch dir, so existing scripts continue to work positionally).
New
-
RegistryStudy$new()gainsdata_meta_dir: candidate paths for the directory holdingregistrystudy.qs2. Defaults todata_rawbatch_dir(full backward compatibility). Pass an explicit value – e.g. the parent of rawbatch – to keep the singleton control file out of the per-batch data directory. Exposed as a read-only active binding. -
$meta_filenow resolves tofile.path(self$data_meta_dir, "registrystudy.qs2").
swereg 26.5.6
Performance
-
create_skeleton()is now ~8x faster and uses ~7.5x less memory at every cohort size tested (1k–25k IDs over an 11-year range; linear scaling). The win comes from sorting a single time spine once and replicating it per id, instead ofexpand.grid()-ing the full cartesian product and then sorting the result. Output is identical to the previous implementation modulo a hiddendata.tablesecondary index attribute. Contributed by @gkaramanis (PR #3).
Internal
- New
dev/bench/scaffold for tracking performance regressions:Rscript dev/bench/run_all.Rruns all benchmarks against the current source,Rscript dev/bench/diff.Rcompares the run to a checked-inbaseline.csvand exits non-zero on >20% time / >50% memory regressions. Excluded from the package build via.Rbuildignore. - New
tests/testthat/test-create_skeleton-parity.R(31 tests) pinningcreate_skeleton()output against an embedded reference oracle plus structural invariants and edge cases (empty / duplicate / NA ids, single-day range, ISO W53 year, year boundary).
swereg 26.4.28
Breaking changes
-
RegistryStudy$process_skeletons()now fails fast on any batch error instead of swallowing it into awarning()and pushing through the remaining batches. These pipelines run unattended for days; if batch 1 fails 10 minutes in (e.g. a systematic bug, missing column, unreadable rawbatch file), the user wants to SSH in within minutes and see the failure – not at the end of a 4-day run with the remaining batches all failing for the same root cause. The failing batch’s underlying error is surfaced via astop()that includes (a) which batch number failed, (b) the original error message, and (c) a hint that successful batches are persisted on disk and the user can rerun withbatches = ...to retry from the failed one. In then_workers > 1(callr subprocess) path, in-flight workers are killed viakill_tree()before thestop()so they don’t keep burning compute on what is likely the same systematic failure. Callers who intentionally want the old “complete with warnings” behaviour can wrap the call intryCatch(error = function(e) ...).
New features
-
add_rx()now supports"!"-prefixed exclusion patterns, restoring parity with theadd_diagnoses()family. Previously,add_rx’s matcher was a one-shotReduce(|, ...)union with no per-pattern branching, so"!"-prefixed entries were silently treated as literal first characters of an ATC code (which never match) or literal product names (which would, accidentally, include a brand named with a leading!). Now:matches any antipsychotic except first-generation (N05AA / N05AB) classes – closing the spec-expressiveness gap that previously forced exhaustive enumeration of every desired sub-code. The veto is independent per named code (no leak across list entries) and applies whether
source = "atc"(prefix match) orsource = "produkt"(exact match).
Documentation
Expanded
add_rx()@param codesto spell out four nuances that theadd_diagnosesfamily carried implicitly: vetoes are independent per named code (no leak across list entries); veto match style followssource(prefix for atc, exact for produkt – so"!Sertralin"does NOT mask"Sertralin Sandoz"); all-negative pattern sets produce empty columns; and the per- source-row veto interacts with the per-week aggregation such that a non-vetoed Rx still drives a week to TRUE even if a vetoed Rx overlaps in the same week.Fixed misleading pattern-syntax documentation on
add_diagnoses(),add_cods(),add_icdo3s(),add_snomed3s(),add_snomedo10s(), andadd_operations(). The previous text described patterns as regex with auto-prepended^anchors and example strings like"^F640"/"^8140"/"^80146002". The actual matcher is prefix-only viastartsWith()– a literal^in a pattern is treated as an ordinary character and silently matches nothing. Updated each function’s@param codesto describe the real contract (prefix matching, no regex), and added a clear explanation of"!"-prefixed row-level vetoes (including the important detail that the veto operates on the raw source row, not on the(id, isoyearweek)bucket – a non-vetoed code in the same week still triggers TRUE).add_rx()documentation now explicitly notes that it does NOT support"!"exclusion patterns (its matcher is a simple union).
swereg 26.4.27
Maintenance
- Renamed user-facing “exposure” terminology in the results workbook to match the project’s TTE vocabulary, where the assignment is the treatment and the active arm is the intervention. The Sensitivity sheet’s identifier column header
Exposureis nowIntervention, and the per-arm measurement columnsEvents (exp),PY (exp),Rate/100k (exp)are nowEvents (int),PY (int),Rate/100k (int)(paired with the unchanged(cmp)columns). Existing workbooks already on disk are not rewritten – re-runplan$export_tables()to regenerate.
New features
TTEPlan$s3_analyze()gains aforceargument (defaultFALSE). WhenTRUE, cachedresults_enrollmentandresults_ettentries in the targeted scope are dropped before recomputation. Scope followsenrollment_ids/ett_ids: with bothNULL, all cached results are cleared; otherwise only matching entries are dropped. Provides a supported way to recompute after a broken environment producedskipped = TRUEplaceholders (e.g. missingsurveypackage), without poking R6 internals from the calling script.TTEPlan$s3_analyze()gains ann_workersargument (default1L). Both the enrollment loop and the per-ETT loop now dispatch throughparallel_pool()with the requested concurrency (previously hardcoded to1Lat both call sites). Each subprocess loads its own analysis file, so peak RAM scales linearly withn_workers; CPU threads per worker are auto-partitioned asfloor(detectCores() / n_workers).
Bug Fixes
CONSORT cascade no longer double-counts the global cohort. In
.s1_compute_attrition(), the per-trial summary tables (before_rowand the per-criterionrows[[i]]) were grouped bytrial_idover apt0that included person-weeks outside any trial period. Those rows collapsed into a spurious(trial_id = NA, criterion)group whosen_personslater got summed together with the legitimatebefore_global/global_rows[[i]]row during per-batch aggregation (by = .(trial_id, criterion)at the centralized matching step), roughly doubling the global cohort number in the CONSORT diagram and attrition tables. Per-trial counts and arm counts were unaffected. Fix filterspt0[!is.na(trial_id)](andpt_i[!is.na(trial_id)]) before computing per-trial summaries; the global row is still computed off the unfilteredpt0so it captures the full pre-exclusion uniqueN. Existingenrollment_counts_*.qs2files need to be regenerated by re-running s1 to pick up the corrected counts.inst/worker_s2.Rnow passessep_by_tx(matching.s2_worker()and the plan-side item field) instead of the stalesep_by_exp. Previouslyplan$s2_generate_analysis_files_and_ipcw_pp()failed at the very first ETT withunused argument (sep_by_exp = params$sep_by_exp).
Maintenance
- Promoted runtime-required packages from
SuggeststoImportsso they install automatically:survey,survival,mgcv,MASS,scales,glue,openxlsx,patchwork,DiagrammeR,DiagrammeRsvg,rsvg. All are referenced unconditionally from R/ (norequireNamespace()guards or alternative code paths). Previously, missingsurveycaused every IRR call ins3_analyzeto be silently captured aslist(skipped = TRUE, reason = "there is no package called 'survey'"), which then produced an empty forest plot inexport_tableswith no visible error. MissingDiagrammeR/DiagrammeRsvg/rsvgsimilarly skipped CONSORT sidecars with only a warning.
swereg 26.4.22
New features
-
CONSORT reporting surfaces unique-person counts alongside person-trial counts. Sequential TTE inflates the analytic denominator because one person enters many weekly trials; a cohort with 390k women can generate 22M person-weeks, and the 60x gap routinely confuses reviewers. Attrition bookkeeping now carries both numbers end-to-end:
-
.s1_compute_attrition()emits a globaltrial_id = NArow per criterion alongside the existing per-trial rows, with a trueuniqueN(person_id)across the whole skeleton. Summing per-trialn_personsover-counted anyone entering more than one trial; the global row is the honest number. Per-trial rows are retained for diagnostic slicing. -
.build_consort_dot()now renders one lumped red side-box with a bulleted list of every exclusion criterion (CONSORT-2010 convention) instead of a stacked red box per criterion, and each box reportsN persons / M person-trials. Enrollment titles are split at ” (” onto two lines so long spec labels don’t blow out the box width. -
TTEEnrollment$rates()gains ann_personscolumn per treatment arm, usingdesign$person_id_var. - Results workbook: each enrollment’s combined-baseline sheet opens with a one-line “Cohort: N persons contributed M sequential trial enrollments” summary. A companion
Attrition_{id}sheet carries the tabular form of the CONSORT numbers (criterion / n_persons / n_person_trials / excluded counts / n_intervention / n_comparator) so reviewers can cite exact figures without measuring pixels on the PNG sidecar.
-
-
$register_codes()auto-validates theadd_*contract. Every call to a registeredfnis now wrapped with a pre/post-state check: row count must be preserved, structural columns (id,isoyear,isoyearweek,is_isoyear) must still be present, and every column name in the registration’scodeslist must actually exist on the skeleton after the call. Contract violations error loudly with a pointer back to the offending$register_codes(<label>)entry. Customadd_*functions plugged into the pipeline (Norwegian registries, regional Swedish cohorts, payer claims, …) get this enforcement for free; built-ins already pass the checks, so no behaviour change for existing registrations.
Documentation
-
New vignette
builtin-add-functions— end-to-end walkthrough of everyadd_*function swereg ships (add_onetime,add_annual,add_diagnoses,add_operations,add_rx,add_cods,add_quality_registry), with pattern syntax, collision policies, and a typical end-to-end ordering. -
New vignette
custom-add-functions— how to write your ownadd_*function for registries swereg doesn’t ship support for (non-Swedish registries, in-house data, quality registries without dedicated built-ins). Covers the contract, reusing built-ins via$register_codes(fn = swereg::add_diagnoses),fn_argsfor extra knobs likediag_type = "main", a completeadd_vaccinations()worked example run through a realRegistryStudy, a demonstration of the auto-wrap firing on a deliberately broken function, and a design cheat sheet of lessons from the built-ins.
User experience
-
RegistryStudy$skeleton_pipeline_hashes()now reports progress. After the last batch finishes,$process_skeletons()calls$write_pipeline_snapshot(), which in turn calls$skeleton_pipeline_hashes()to collect per-batch metadata. That function has to deserialize everyskeleton_*.qs2file from disk (viaqs2::qs_read()) to read a handful of hash fields, which with many/large batches can take tens of minutes. Previously this ran silently, making$process_skeletons()look hung after dispatching the last batch. It now prints an explanatory message up front and ticks aprogressrbar per file.
swereg 26.4.16
Documentation
- Consolidated hand-rolled vignettes from 3-step (skeleton1-create, skeleton2-clean, skeleton3-analyze) into a 2-step manual workflow (skeleton-create, skeleton-analyze) that matches the actual architecture.
- Renamed pkgdown “Hand-rolled” section to “Manual workflow”.
- Updated skeleton-concept, skeleton-pipeline, and CLAUDE.md to reflect the 2-step model.
swereg 26.4.20
Bug fixes
add_cods()now acceptscodes =like all siblingadd_*functions (previouslycods =). The oldcods =name is kept as a deprecated alias with a warning. This mismatch caused everyadd_codsentry registered viaRegistryStudy$register_codes()to fail silently with"unused argument (codes = list(...))"— the dispatcher always passescodes = ...butadd_codswas the only one not expecting that name.RegistryStudy$process_skeletons()progress bar now advances when batches fail. Previously, the parallel branch’s error handler emitted a warning but never ticked the progressor, so a single failing batch left the bar frozen and — combined withoptions("warn")default of0buffering warnings until the function returns — produced the symptom “progress bar stuck at 0% indefinitely”. The bar now ticks on both success and failure, and failed-batch messages show up in the bar’s(last: ...)slot with aFAILEDtag. Warnings are also emitted withimmediate. = TRUEso failures surface in real time regardless of the session’swarnsetting. The serial branch gained matching error handling (previously a single batch error aborted the whole run).
BREAKING CHANGES
-
TTE vocabulary rename (
exposure/exposed/unexposed→treatment/intervention/comparator): the TTE system now uses PICO-aligned terminology.treatmentis the umbrella concept (the variable naming the assignment);interventionandcomparatorare the two arm values. This matches how TTE papers are written up in the major medical journals (BMJ, NEJM) and Hernán’s canonical target-trial references. Migration:YAML spec:
-
exposure:→treatment: -
exposed_value:→intervention_value: -
arms.exposed:→arms.intervention:
TTEDesign$new()arguments:-
exposure_var = ...→treatment_var = ... -
time_exposure_var = ...→time_treatment_var = ...
Skeleton / trial-panel columns produced by the pipeline:
-
baseline_exposed→baseline_intervention -
rd_exposed→rd_intervention -
n_exposed/n_unexposed(in CONSORT / matching summaries) →n_intervention/n_comparator(and the_total/_enrolledvariants)
Logical semantics unchanged:
TRUE= intervention arm,FALSE= comparator arm. Column names now describe whatTRUEmeans, which is clearer than the prior “exposed” (ambiguous between an assignment and a time-varying status). User-chosen column names (e.g., a column named"exposed"in user data) are unaffected — the rename only touches package-provided API surface. -
New features
-
RegistryStudy$register_derived_codes(codes, from, as): registers a “derived” code entry that doesn’t read rawbatch data, but instead ORs together already-existing skeleton columns from earlier primary entries. For eachnmincodes, writes<as>_<nm> := <from[1]>_<nm> | <from[2]>_<nm> | .... Use case: building combined outcome columns likeosd_* = os_* | dorsu_* | dorsm_*where the hospital half comes fromadd_diagnoseson OV+SV and the death half comes fromadd_codson DORS – two functions that can’t share a singlecombine_asargument because they search different raw-data columns (hdia/dia*vsulorsak/morsak*).Derived entries are full first-class citizens of the code registry: they get their own fingerprint, participate in the per-entry incremental sync via
Skeleton$sync_with_registry(), and fold upstream primary fingerprints into their own so that editing an upstream primary’sfn_args/groups/codes(e.g. flippingcod_typefrom"underlying"to"multiple") automatically cascades into a derived replay. Derived entries must be registered AFTER their upstream primaries (apply runs in registration order).
swereg 26.4.19
BREAKING CHANGES
New
SkeletonR6 class: per-batch skeleton files on disk are now serializedSkeletonR6 objects instead of baredata.tables. They carry their own phase provenance (framework hash, applied code registry fingerprints keyed by their minimal descriptor, and an ordered named list of applied phase-3 “randvars” steps). Legacy bare-data.tableskeleton files are auto-wrapped on first load for backwards compatibility. Downstream code that reads skeletons viaqs2_read()now needs to unwrap viask$data; the three swereg internal call sites (in.s1_prepare_skeleton(),.s1b_worker()cache reuse, andtteplan_from_spec_and_registrystudy()) have been updated.-
RegistryStudy$process_skeletons()signature changed: drops theprocess_fncallback argument. Callers must pre-register their pipeline functions via$register_framework(fn)and$register_randvars(name, fn)before calling$process_skeletons(). The new three-phase orchestration re-runs each phase only when its relevant part of the pipeline has changed:- Phase 1 (framework): full rebuild on function body/formals hash change.
- Phase 3 (randvars): ordered per-step divergence-point rewind-and- replay. Editing one step replays that step and everything downstream of it; upstream steps untouched.
- Phase 2 (codes): per-entry fingerprint diff. Adding/removing one code entry touches only that entry’s columns on existing skeletons. For the typical “edit one ICD-10 code” workflow, this drops the re-run cost from a full pipeline rebuild to roughly one-Nth of phase 2, where N is the number of registered code entries.
RegistryStudyschema bump (3 → 4): existingregistrystudy.qs2files with an older schema error on load via$check_version()with an actionable message pointing at re-running the upstream generator script.
New features
SkeletonR6 class (R/r6_skeleton.R). Public methods:initialize,check_version,pipeline_hash,apply_code_entry,drop_code_entry,sync_with_registry(phase 2 incremental diff),sync_randvars(phase 3 divergence-point rewind-and-replay),save,print.drop_code_entryuses metadata prediction via the new file-level.entry_columns(reg)helper rather than a runtime column map.-
New
RegistryStudymethods:-
$register_framework(fn)and$register_randvars(name, fn)for phase 1 / phase 3 registration. -
$code_registry_fingerprints()returning xxhash64 digests of eachcode_registryentry’s(codes, label, groups, fn_args, combine_as)tuple. -
$pipeline_hash()returning a single xxhash64 over (framework hash, ordered randvars sequence hashes, code registry fingerprints). Answer to “what would a freshly-built skeleton look like?” -
$load_skeleton(batch_number)/$save_skeleton(sk)as thin wrappers aroundSkeleton$save()that supplyself$data_skeleton_dirautomatically, mirroring the existing$load_rawbatch()/$save_rawbatch()pattern. -
$skeleton_pipeline_hashes()returning a per-batch summarydata.tablewith each skeleton’s currentpipeline_hash. Useful for spotting batches out of sync with each other. -
$assert_skeletons_consistent()as a pre-flight check for downstream consumers: errors on mixed-hash or partially-rebuilt state. -
$write_pipeline_snapshot()writing a one-row TSV to{data_pipeline_snapshot_dir}/{host_label}.tsv. Git-trackable, concurrency-safe (each host writes only its own file), silently skipped when the snapshot candidate directory is not configured or not mounted on the current host. -
$adopt_runtime_state_from(other)copies runtime fields (n_ids,n_batches,batch_id_list,groups_saved) from anotherRegistryStudywithout touching config fields. Used by generator scripts to reload disk state without silently adopting stalegroup_namesorcode_registry.
-
New
RegistryStudypublic fields:framework_fn,randvars_fns(named ordered list),host_label,data_pipeline_snapshot_cp([CandidatePath]).New active binding
data_pipeline_snapshot_dirresolving fromdata_pipeline_snapshot_cp, parallel to the existingdata_rawbatch_dir/data_skeleton_dir/data_raw_diraccessors.File-level helpers in
r6_registrystudy.R:.hash_function(fn)(xxhash64 overlist(body(fn), formals(fn))– stable across R sessions because it excludes the function environment),.fingerprint_entry(reg),.entry_columns(reg)(vectorized wrapper around the previously-private.generated_columns_for_entry()which has been lifted to file level),.format_batch_range(batches), and.process_one_batch(study, i, ...)(the per-batch orchestration shared byprocess_skeletons()’s serial and callr-parallel branches).
Memory footprint note
$load_skeleton() calls data.table::setalloccol(obj$data, n = getOption("datatable.alloccol", 4096L)) on the loaded Skeleton$data to restore data.table over-allocation slots that qs2 serialization drops. Memory overhead is a new data.table HEADER (~8-16 bytes per slot, so ~32-64 KB total) – NOT a full copy of the column data, which stays shared by reference. This is negligible compared to the per-batch skeleton size (typically multi-GB). Without the refresh, subsequent := mutations inside helper functions would silently reallocate and strand the R6 field pointing at the stale old-address data.table. Studies that need more than 4096 column slots can bump via options(datatable.alloccol = 8192L) at the top of their generator script.
Other changes
- 6 existing
process_skeletonstest cases were rewritten to use the new$register_framework()idiom. - Added
tests/testthat/test-r6_skeleton.R(74 tests),tests/testthat/test-process_skeletons_incremental.R(47 tests),tests/testthat/test-entry_columns_parity.R(10 tests), and 56 new unit tests intest-registrystudy.Rfor the new methods. - Total test count: 800 tests passing.
swereg 26.4.18
BREAKING CHANGES
TTEPlanschema bump (1 → 2) and new constructor signature:tteplan_from_spec_and_registrystudy()now requirescandidate_dir_spec,candidate_dir_tteplan,candidate_dir_results, andspec_version. The previous positionalspecargument is removed — the spec YAML is now read from inside the resolvedcandidate_dir_specusingfilename_spec(spec_version). Existing_plan.qs2files error on load viacheck_version(); regenerate by re-runnings0_init.Rfor each project.RegistryStudyschema bump (2 → 3): private.data_{rawbatch,skeleton,raw}_dir_candidatesand_cachefields are replaced by publicdata_{rawbatch,skeleton,raw}_cpfields of typeCandidatePath. Existingregistry_study_meta.qs2files error on load; regenerate by re-runningrun_generic_create_datasets_v2.R.-
Stub-free on-disk filenames. Project-scoped directories no longer need a
{project_id}_prefix on files. The rename is:-
registry_study_meta.qs2→registrystudy.qs2 -
{project_id}_plan.qs2→tteplan.qs2 -
{project_id}_spec.xlsx→spec.xlsx -
{project_id}_tables.xlsx→tables.xlsx -
study_spec_vXXX.yaml→spec_vXXX.yamlUsedev/rename_r6_files.shin the downstream MHT repo for the on-disk migration.
-
New features
CandidatePathR6 class (R/r6_candidate_path.R). First-class representation of “a directory that lives at one of several candidate locations depending on host”. Owns its candidate list, its resolution cache, and its$resolve()/$invalidate()/$is_resolved()/$print()methods. BothRegistryStudyandTTEPlannow holdCandidatePathinstances via public*_cpfields, so multi-host path resolution is structurally identical across classes — cannot drift.first_existing_path(candidates, label)(exported). Generic, study-agnostic “first existing path” picker lifted from the old.resolve_path(). Auto-creates the first candidate whose parent directory exists (unchanged from the old behavior).invalidate_candidate_paths(obj)(exported). Walks an R6 object’s public fields and calls$invalidate()on everyCandidatePathit finds, recursing into embedded R6 objects. Called byRegistryStudy$save_meta()andTTEPlan$save()before serialization so on-disk files never carry the saving host’s cached resolved paths.tteplan_locate_and_load(candidate_dir_tteplan)(exported). Stage scripts (s1.R,s2.R,s3.R,s4_export.R) use this one-liner to load atteplan.qs2from the first candidate directory that exists.registrystudy_load(candidate_dir_rawbatch)(exported). Paired withtteplan_from_spec_and_registrystudy()sos0_init.Rreadsregistrystudy.qs2from the first rawbatch directory that exists on the current host.TTEPlanactive bindings for owned paths:dir_tteplan,dir_spec,dir_results_base,dir_results(appendsspec_version),tteplan,spec_path,spec_xlsx,tables_xlsx, plusdata_skeletonanddata_rawbatchthat delegate to the embeddedregistrystudy(no duplication). Stage methods default to these bindings, sos1/s2/s3no longer need an explicitoutput_dir =.Host-portable
skeleton_filesafter load.tteplan_load()now refreshesplan$skeleton_filesfromplan$registrystudy$skeleton_fileson every load, reapplying then_skeleton_files_limitstored on the plan. A plan saved on one host and loaded on another immediately points at the current host’s skeleton files.
swereg 26.4.17
New features
-
Custom Table 1 engine:
sweregnow ships its own Table 1 builder (.swereg_table1()), replacing the optionaltableonedependency.TTEEnrollment$table1()returns a long-formatdata.tablewith classswereg_table1and supports new arguments:-
arm_labels— display labels for the two exposure arms -
include_smd— toggle SMD column (defaults TRUE) -
show_missing— annotate variable names with"(missing X.X%)"instead of emitting a separateMissingrow (defaults TRUE). Percentages for non-missing levels are computed against the non-missing denominator, so they sum to 100 within each column. Multi-level categorical SMDs follow the Yang & Dalton (2012) generalisation.
-
Forest plot for Table 3: the workbook’s Table 3 sheet is now a forest plot rendered with
ggplot2. SupplementalTable S{n+1}keeps the full tabular IRR for all ETTs (per-protocol truncated). The forest plot is delivered as a high-resolution PNG (300 dpi) embedded in the worksheet, plus a vector PDF sidecar saved next to the workbook.CONSORT flowcharts:
.write_consort()now renders a Graphviz DOT flowchart viaDiagrammeR+DiagrammeRsvg+rsvg, embeds the PNG into the worksheet, and saves PNG/PDF sidecars next to the workbook. Falls back to the legacy text-table layout when the optional packages are missing.featured_ettsargument onTTEPlan$export_tables(): filters Tables 2 and 3 to a user-specified subset of ETT ids; the supplementary tabular IRR remains unfiltered. An “Exposure definitions” legend block is written above each main table, and the_Exposed/_Unexposedcolumn suffixes are rewritten to spec-derived arm labels when all featured ETTs share a single enrollment.TTEPlan$reload_spec(): refreshes cosmetic spec fields (study title, enrollment names, exposure-arm labels, outcome names, ETT descriptions) from a YAML spec on disk WITHOUT re-running the upstream pipeline. Structural changes (confounders, exclusion criteria, follow-up windows, matching parameters, etc.) are detected and reported via a loud warning but NOT applied — cached results stay bound to the old definitions. The new fieldsspec_reloaded_atandspec_reload_skipped_diffsare surfaced on the Provenance sheet.TTEPlan$recompute_baselines(): re-runs the new Table 1 engine on cached enrollment files in-process, used to refresh stale baseline tables after upgrading swereg without re-running$s3_analyze().$export_tables()calls this lazily when it detects pre-26.4.17 cached Table 1 results.
Workbook output changes
- Supplementary baseline panels are renamed:
-
Raw→Unimputed and unweighted -
Unweighted→Imputed and unweighted -
IPW→Imputed and IPW -
IPW Truncated→Imputed and IPW truncated
-
- The main Table 1 sheet hides the SMD column and missingness annotations; supplementary panels include both.
Breaking changes
-
TTEEnrollment$table1()now returns adata.table(classc("swereg_table1", "data.table", "data.frame")) instead of atableoneS3 object. Code that introspectedTableOnefields will need to read from the long-format columns instead. -
tableoneis removed fromSuggests. Cachedresults_enrollmentlists produced by older versions are recognised and refreshed lazily on the next$export_tables()call (anoutput_diris required for the refresh). -
ggplot2is now inImports(wasSuggests) since the forest plot is a mandatory part of$export_tables(). -
DiagrammeR,DiagrammeRsvg, andrsvgare added toSuggests. They are required for CONSORT flowcharts; without them$export_tables()silently falls back to the legacy text CONSORT.
swereg 26.4.16
Improvements
-
setup_progress_handlers(): Pickhandler_progress()format based oninteractive(). In interactive sessions use\r-based single-line repaint (clear = TRUE, no trailing newline) so the bar updates in place like a normal terminal progress bar. In non-interactive sessions (RStudio background jobs, Rscript, CI) use a trailing\nwithclear = FALSEso each step is a new line in the log.
swereg 26.4.15
Improvements
-
RegistryStudy$process_skeletons(): Pass the current timestamp as the progressmessage(both sequential and parallel paths), matching the convention already used inparallel_pool(). The(last: :message)suffix in thesetup_progress_handlers()format string now shows the clock time of the last completed batch (e.g.(last: 14:35:22)) so you can tell at a glance whether the job is making progress or frozen. Previously calledp()with no message, so(last: )was always blank.
swereg 26.4.14
Bug Fixes
-
setup_progress_handlers(): The real reason progress never showed up in RStudio background jobs –progressrsilently suppresses reporting in non-interactive sessions unless you setoptions("progressr.enable" = TRUE). Background jobs haveinteractive() == FALSE, so the global handler was being installed correctly butprogressor()calls were emitting no output. Now forces the option on. Also restores(last: :message)in the format so you can tell the bar isn’t frozen by watching the item label advance.
swereg 26.4.13
Improvements
-
setup_progress_handlers(): Drop thehandler_rstudio/ rstudioapi branch entirely. Usehandler_progress()withformat = "[:bar] :current/:total (:percent) in :elapsedfull, eta: :eta\n"andclear = FALSEin every context — same recipe ascs9::set_progressrandplnr::Plan$run_all. The trailing\nmakes each update a new log line (instead of a\rrepaint that job logs can’t render), andclear = FALSEkeeps finished bars in the scrollback. Works in interactive R, RStudio’s foreground console, and RStudio background-job subprocesses without any detection logic or handler switching.handler_rstudiohas been ineffective for the background-job case in this codebase.
swereg 26.4.12
Improvements
-
setup_progress_handlers(): Collapse the RStudio-detection logic to a singlerstudioapi::isAvailable(child_ok = TRUE)call. Thechild_ok = TRUEparameter handles both the foreground RStudio console and background-job subprocesses (via IPC to the parent session), so all the earlierhasFun/exists/isJobgymnastics were unnecessary. Also drops the now-redundant feature-test forjobAdd/jobSetProgress/jobRemove—progressr::handler_rstudiorequires those to exist anyway.
swereg 26.4.11
Bug Fixes
-
setup_progress_handlers(): Stop usingrstudioapi::hasFun("jobAdd")to feature-test for the job wrappers.hasFun()(a) short-circuits toFALSEwheneverisAvailable()isFALSE, and (b) looks the name up in the internalrstudionamespace where the function is actually calledaddJob(therstudioapi::jobAdd()wrapper forwards tocallFun("addJob", ...)). SohasFun("jobAdd")returnedFALSEeven on systems wherejobAdd()works fine, which caused the helper to fall through to the text handler every time. Now checks therstudioapiwrapper namespace directly viaexists("jobAdd", envir = asNamespace("rstudioapi"), mode = "function").
swereg 26.4.10
Bug Fixes
-
setup_progress_handlers(): Fix detection of RStudio background-job subprocesses. Previously relied onrstudioapi::isAvailable(), which returns FALSE inside ajobRunScript()subprocess (because.Platform$GUIis not “RStudio” there) — so scripts launched via Source as Background Job fell through to the text handler and no Jobs-pane progress bar appeared. Now also acceptsrstudioapi::isJob()as a valid context; in job subprocesses,rstudioapi::callFun()auto-delegatesjobAdd/jobSetProgress/jobRemoveback to the parent RStudio session via IPC, sohandler_rstudioworks correctly.
swereg 26.4.9
New Features
-
setup_progress_handlers(): Helper for run scripts. Feature-detectsrstudioapi::jobAdd()and installsprogressr::handler_rstudio()when available, else falls back tohandler_progress(). Fixes the “no good progress bar” problem when launching run scripts via RStudio’s Source as Background Job menu — the default text bar renders badly in job logs, the RStudio handler draws a proper Jobs-pane progress bar. Automatically covers every progressr-emitting method (process_skeletons,s1_*,s2_*,s3_*) with no per-method changes.
swereg 26.4.8
New Features
-
RegistryStudy$compute_population(by): Compute a population denominator table from saved skeleton files. Counts unique persons byisoyearand user-specified structural variables (e.g. sex, age, register tag). Handles both annual and weekly skeleton rows viauniqueN(id). Produces a complete grid with all combinations (missing cells filled with zero). Saves result aspopulation.qs2in the skeleton directory.
swereg 26.4.7
Improvements
-
s3_analyze()now prints the output directory path and count of .qs2 files before processing. -
s3_analyze()gainsett_idsparameter to run only specific ETTs (e.g.ett_ids = "ETT01"). - Remove heterogeneity test (
het_test) froms3_analyze()— a single call consumed 42GB RAM and 40+ minutes CPU on real data, making full runs infeasible. - Remove
het_slotparameter fromtteenrollment_irr_combine()(no longer needed).
swereg 26.4.6
Bug Fixes
- Fix latent bug in
parallel_pool():Filter(Negate(is.null), results)removed NULLs and shifted indices, breaking positionalitem_mapindexing ins3_analyze. Workers that fail already raise errors before producing output, so NULL results cannot occur.
Internal
- Deduplicate
.write_combined_rates()and.write_combined_irr()into shared helper.prepare_combine_data(). - Extract
.build_code_lookup()helper for code_lookup + fmt_var construction, shared byprint_spec_summary()and.write_spec_summary(). - Call
parallel::detectCores()once at top ofs3_analyze()instead of twice in the enrollment and ETT loops.
swereg 26.4.5
New Features
-
$s3_analyze(): New Loop 3 method on TTEPlan that computes all analysis results (baseline characteristics, rates, IRR, heterogeneity tests) and stores them on the plan. Split into enrollment-level results (table1 variants) and ETT-level results (outcome-specific). Degenerate ETTs (GLM failure) are caught and stored with a skip reason instead of crashing. Progress bars via progressr. -
$results_summary(): Print diagnostic table showing event counts and IRR/rates status per ETT. -
$export_tables(path): Export all results to a multi-sheet Excel workbook: enrollment overview, ETT overview, Table 1 (chosen enrollment), Tables 2-3 (combined rates/IRR), per-enrollment combined baselines (4-panel: Raw/Unweighted/IPW/IPW Truncated), CONSORT attrition sheets, supplemental rates/IRR. -
tteplan_load(path): Load a TTEPlan from disk with the current class definition, ensuring new methods are available on old serialized objects. -
$s1_generate_enrollments_and_ipw(resume = TRUE)and$s2_generate_analysis_files_and_ipcw_pp(resume = TRUE)skip completed work based on file timestamps (must be <24h old).
Bug Fixes
- Fix
'from' must be of length 1crash inenroll()when a skeleton file has no enrolled persons for the current enrollment. data.table evaluatesjonce on 0-row data even withby, giving by-variables length 0 instead of scalar. This also produced spurious-Infwarnings frommax(logical(0), na.rm = TRUE)in Phase B. Fix: short-circuitenroll()with an empty panel whenentry_dthas 0 rows. - IPCW GLM fallback:
fit_and_predict()ins6_ipcw_ppnow uses tryCatch; falls back to marginal censoring rate when the model fails (e.g. near-zero events).
swereg 26.4.3
Breaking Changes
-
parallel_pool()rewritten to useprocessx+ qs2 tempfiles instead offuture.callr. Worker logic moved to standalone R scripts ininst/(worker_s1a.R,worker_s1b.R,worker_s2.R), launched viaprocessx::process$new(). All data passes through qs2 files on disk instead of R’s IPC serialization, fixing the loop 1b bottleneck whereenrolled_idswas serialized N times through pipe buffers.enrolled_idsis now written once to a shared tempfile. Dependenciesfuture,future.apply,future.callrremoved;processxadded.
swereg 26.4.2
Breaking Changes
-
parallel_pool()rewritten to usefuture.callrinstead of persistentcallr::r_sessionworkers. Each work item now runs in a fresh R subprocess, eliminating deadlocks caused by accumulated IPC socket state. New dependencies:future,future.apply,future.callr. Theprocessxdependency is removed.callr_kill_workers()is removed (no longer needed).
swereg 26.3.30
Improvements
-
callr_pool()gains atimeout_minutesparameter (default: 30). If a work item runs longer than the timeout, its worker is killed and the item is retried once. If the retry also times out,callr_pool()callsstop(). Disable withtimeout_minutes = NULL.
CRAN compliance
- Move
mgcvfrom Imports to Suggests (only used conditionally viarequireNamespace()). - Add
@importFromforprogressrandutils::getFromNamespaceto satisfy NAMESPACE checks. - Replace
swereg:::calls withgetFromNamespace()in callr worker sessions. - Replace
assign(..., globalenv())with a package-level environment (.swereg_env). - Add
var <- NULLdeclarations for all data.table NSE variables. - Add
.vscodeto.Rbuildignore.
swereg 26.3.23
Improvements
-
callr_pool()workers now self-terminate if the parent R session dies (e.g. OOM kill). Each worker spawns a lightweight shell watchdog that polls the parent PID every 5 seconds. Previously, orphaned workers ran indefinitely until manually cleaned up viacallr_kill_workers().
Bug Fixes
Critical:
.s1_eligible_tuples()usedfirst(rd_exposed)to classify exposure at each trial period, which only detected MHT initiation if it happened on the first week of a 4-week trial period. Withperiod_width = 4, ~75% of exposed people start MHT mid-period and were silently dropped — their first trial period showed them as unexposed (week 1 was pre-initiation), and the next period excluded them for prior MHT. Fixed by usingany(rd_exposed, na.rm = TRUE)instead. The existingno_prior_exposureexclusion correctly handles the new-user restriction. Verified: eligible exposed count on skeleton_001 went from 19 → 84, matching the old per-week pipeline..s1_compute_attrition(): exposure classification now usesany()per person-trial instead of checking the first eligible row. Aligns attrition reporting with theany()fix in.s1_eligible_tuples()— previously the attrition flow underreported exposed counts by ~4x.tteplan_validate_spec(): missing variables (confounders, outcomes, exclusion criteria, exposure) nowstop()instead ofwarning(). Previously, a misspelled or renamed variable would silently pass validation and break downstream (e.g. IPW model missing a confounder). Category mismatches (values in spec but not data) remain as warnings since they can occur in small batches.
swereg 26.3.20
Bug Fixes
.s1_compute_attrition(): fix undercounting of person-trials for row-level eligibility criteria (e.g.eligible_valid_exposure). The old code checked only the first row per person-trial, missing cases where exposure onset occurred after the first week. The new approach filters to eligible rows first, then counts — matching the logic used by.s1_eligible_tuples()..s1_compute_attrition(): fix negative exposed/comparator deltas in participant flow. Thebefore_exclusionsbaseline now classifies exposure from the first row with non-NA exposure per person-trial, rather than the first overall row (which often hasrd_exposed = NA). Total person-trial counts remain unfiltered.
Performance
- TTE s1 pipeline: add
data.table::setkey()calls to eliminate redundant hash-based grouping. Skeleton reads in.s1_prepare_skeleton()and.s1b_worker()now set key on(id, isoyearweek)(metadata-only, no re-sort).enroll()Phase B collapse uses keyed grouping on(pid, trial_id), and Phase D panel expansion uses keyed binary join instead ofmerge().
Bug Fixes
callr_pool()PID files now written to/tmpinstead oftempdir()so that orphaned workers from crashed R sessions can be discovered and cleaned up by new sessions.callr_kill_workers()simplified to orphan-only cleanup: kills workers whose parent R process is dead and removes stale PID files. Own-session cleanup is already handled bycallr_pool()’son.exit()handler; this function is only needed after hard crashes (SIGKILL, OOM).
Performance
callr_pool()now uses persistentcallr::r_sessionworkers instead of spawning a freshcallr::r_bg()process per work item. The swereg namespace is loaded once per worker slot rather than once per item, eliminating redundant startup overhead when scaling to large numbers of items.Orphan protection:
callr_pool()writes a PID file per invocation and cleans up orphaned worker sessions from previous crashed runs (e.g. OOM kills) on the next invocation.
Bug Fixes
- Fixed 3 test failures in
test-tte_spec.Rcaused by s1 pipeline changes: added missingrd_exposedcolumn to.s1_compute_attritiontest fixtures, addedn_exposed/n_unexposedto mock attrition data, and updated matching output expectations.
Performance
s1_generate_enrollments_and_ipw()now caches prepared skeletons between s1a (scout) and s1b (enrollment) passes, eliminating redundant file reads and exclusion processing. Expected ~30-40% reduction in per-enrollment wall-clock time..s1b_worker()now subsets the skeleton to enrolled persons before computing derived confounders, avoiding expensive rolling-window operations on non-enrolled persons.TTEEnrollment$new()acceptsown_data = TRUEto skip the defensivedata.table::copy()when the caller will not reuse the data. Used in.s1b_worker()where the skeleton is discarded immediately after.enroll()Phase B now aggregates confounders, time-exposure, and outcome columns in a single groupby pass instead of four separate passes with merges.
Improvements
“Valid exposure” (
eligible_valid_exposure) is now the first exclusion criterion in the TTE attrition flow. Rows whererd_exposedis NA are explicitly accounted for rather than silently disappearing between the before-exclusions total and the first real criterion.TARGET Item 8 (participant flow) now shows a richer flow diagram with before-exclusion counts, per-step exposed/unexposed breakdown, delta (excluded) and remaining counts at each criterion, right-justified aligned columns, and color-coded output (red for exclusions, cyan for remaining). Post-matching line also reformatted with arrow indicator. “Before exclusions” line no longer shows a meaningless exposed/comparator breakdown.
enrollment_counts$attritionnow includesn_exposedandn_unexposedcolumns and a"before_exclusions"row.
Bug Fixes
Fixed
trial_idmissing error caused byattr<-breaking data.table’s internal self-reference. Replaced withdata.table::setattr()in.s1_prepare_skeleton()andtteplan_apply_exclusions()to preserve in-place modification semantics.Fixed callr worker stale-namespace bug: after
devtools::load_all()in a subprocess, worker functions still referenced the old (installed) swereg namespace. Now rebinds the worker function’s environment to the freshly-loaded namespace.
Improvements
Reorganized
print_spec_summary()header layout: renamed “Study created” → “RegistryStudy”, merged “Skeletons created” + “Skeleton files” into a single nested line with tree connector, renamed “Plan created” → “TTEPlan”, and reordered to follow data pipeline order.Rewrote TARGET checklist items 6c, 6h, and 7a-h in
print_target_checklist()as academic prose suitable for copy-pasting into a methods section. Item 6c now dynamically reflects per-enrollment matching ratios from the spec.
Breaking changes
enrollment_countsstructure changed: Each element ofTTEPlan$enrollment_countsis now a list with$attritionand$matchingsub-elements (was a single data.table). Code accessingplan$enrollment_counts[["01"]]directly as a data.table must update toplan$enrollment_counts[["01"]]$matching.person_trial_idrenamed toenrollment_person_trial_id: The composite key column now has a 3-part name matching its 3-part format (enrollment_id.person_id.trial_id). All code referencingperson_trial_idmust be updated.process_fnparameter removed from$s1_generate_enrollments_and_ipw(): The two-pass spec-driven pipeline is now the only code path.self$specis required (create plans withtteplan_from_spec_and_registrystudy()). The legacy single-pass.s1_worker()has been deleted..s2_worker()renamed to.s3_worker(): Internal Loop 2 IPCW-PP worker renamed to avoid confusion with the two-pass Loop 1 pipeline.
New features
-
Two-pass enrollment pipeline:
$s1_generate_enrollments_and_ipw()now uses a two-pass pipeline that fixes cross-batch matching ratio imbalance:-
Pass 1a (scout): Lightweight parallel pass collecting eligible
(person_id, trial_id, exposed)tuples from all batches. -
Centralized matching: Combines all tuples and performs per-
trial_idmatching globally, ensuring the correct ratio across all batches. - Pass 1b (full enrollment): Parallel pass using pre-matched IDs to enroll without per-batch matching.
-
Pass 1a (scout): Lightweight parallel pass collecting eligible
enrollment_countson TTEPlan: New field storing per-trial matching counts (total vs enrolled, exposed vs unexposed) for TARGET Item 8 reporting..assign_trial_ids(): New shared helper function that is the single source of truth forisoyearweek -> trial_idmapping. Used consistently by both scout (s1a) and enrollment (s1b/enroll) phases.enrolled_idsparameter onTTEEnrollment$new(): When provided, enrollment skips the matching phase and uses pre-decided IDs directly, enabling the two-pass pipeline.Per-criterion attrition counts for TARGET Item 8: The scout pass (s1a) now computes cumulative person and person-trial counts at each eligibility step. Stored in
plan$enrollment_counts[["01"]]$attritionas a long-format data.table with columnstrial_id,criterion,n_persons,n_person_trials.$print_target_checklist()Item 8 auto-populates with these counts when available.
swereg 26.3.21
New features
$heterogeneity_test(): New method onTTEEnrollmentthat tests for heterogeneity of treatment effects across trials via a Wald test on thetrial_id × exposureinteraction (Hernán 2008, Danaei 2013).$print_target_checklist(): New method onTTEPlanthat generates a self-contained TARGET Statement (Cashin et al., JAMA 2025) 21-item reporting checklist. Auto-populates items from the study spec and provides[FILL IN]placeholders for PI completion.
Improvements
$irr()calendar-time adjustment: Outcome model now includestrial_idas a covariate to adjust for calendar-time variation in outcome rates across enrollment bands (Caniglia 2023, Danaei 2013). Usesns(trial_id, df=3)for ≥5 unique trial IDs, linear term for 2-4, omitted for 1.$irr()IPW-only guard:$irr()now rejects IPW-only weight columns (ipw,ipw_trunc) after per-protocol censoring has been applied. The swereg pipeline applies per-protocol censoring in$s4_prepare_for_analysis(), so only per-protocol weights (analysis_weight_pp_trunc) are valid for the censored dataset.
Documentation
Methodology vignette: New
vignette("tte-methodology")maps the swereg TTE implementation to five reference papers (Hernán 2008/2016, Danaei 2013, Caniglia 2023, Cashin 2025). Documents which methods are implemented, which are not, and design rationale.Analysis types:
vignette("tte-nomenclature")now documents that swereg supports per-protocol analysis only. ITT analysis is not supported because the pipeline censors at protocol deviation. As-treated analysis requires time-varying IPW (not implemented).period_widthdocumentation:vignette("tte-nomenclature")now explains the enrollment band width / residual immortal time bias trade-off, citing Caniglia (2023) and Hernán (2016).Matching approach:
vignette("tte-nomenclature")now documents the per-band stratified matching design choice and alternatives from the literature.$s2_ipw()documentation: Clarified that IPW estimates the propensity score for baseline treatment assignment only, not time-varying treatment weights.$irr()documentation: Documented IRR ≈ HR for rare events,ns(tstop)for flexible baseline hazard,quasipoissonfor overdispersion, and computational equivalence to pooled logistic regression.IPCW stabilization: Documented the simplified marginal stabilization approach and its relationship to Danaei (2013).
swereg 26.3.20
Improvements
-
Band-based enrollment: Added explicit
isoyearweekordering before band-level collapse to prevent silent misclassification when input data is not pre-sorted by time. -
IPCW-PP: Censoring model now includes
trial_idto account for calendar-time variation in censoring patterns across enrollment bands. -
person_weeks: Now computed from actual source row counts during band collapse instead of hardcodedperiod_width. Partial-coverage bands (e.g., at data boundaries) now contribute accurate person-time.
Breaking changes
-
$irr(): Removed the constant (no time adjustment) Poisson model. Only the flexible model with natural splines (splines::ns(tstop, df=3)) is retained. Output columns renamed:IRR_flex→IRR,IRR_flex_lower→IRR_lower,IRR_flex_upper→IRR_upper,IRR_flex_pvalue→IRR_pvalue,warn_flex→warn. AllIRR_const*andwarn_constcolumns removed. -
tteenrollment_irr_combine(): Updated to match new$irr()output. Columns renamed:IRR (flexible)→IRR,95% CI (flexible)→95% CI,p (flexible)→p. Constant-model columns removed. -
TTE ID semantics: The composite person-per-trial identifier column is now called
person_trial_id(wastrial_id). The actual trial identifier (the enrollment band) is now exposed astrial_idin enrollment output. This fixes the semantics sotrial_idmeans the trial andperson_trial_ididentifies a person’s participation in a trial. -
TTEDesign default:
id_vardefault changed from"trial_id"to"person_trial_id". -
s1_impute_confounders(): No longer hardcodestrial_id; usesdesign$id_varthroughout.
Code quality
- Rename private methods
prepare_outcomeandipcw_pptos5_prepare_outcomeands6_ipcw_ppto signal their execution order withins4_prepare_for_analysis(). - Reorder
TTEEnrollmentpublic step methods to match their numeric sequence (s1 before s2).
Breaking changes
Band-based enrollment:
TTEEnrollmentenrollment now uses N-week bands (controlled byperiod_widthinTTEDesign, default 4). Calendar time is grouped into bands based onisoyearweek, matching is done per-band (stratified), and data is collapsed to band level during enrollment. This eliminates the separate$s1_collapse()step entirely.-
Step renumbering: Public workflow methods on
TTEEnrollmenthave been renumbered after removing$s1_collapse():-
$s2_impute_confounders()->$s1_impute_confounders() -
$s3_ipw()->$s2_ipw() -
$s4_truncate_weights()->$s3_truncate_weights() -
$s5_prepare_for_analysis()->$s4_prepare_for_analysis()
-
period_widthparameter: Moved fromTTEPlan$s1_generate_enrollments_and_ipw()toTTEDesign$new(period_width = 4L). Now part of the design contract.isoyearweekcolumn required: Band-based enrollment requires anisoyearweekcolumn in person-week data.Schema version bump:
TTEDesignandTTEEnrollmentschema versions bumped to 2. Objects saved with version 1 will warn on load.
New features
TTEPlan provenance timestamps: TTEPlan now tracks
created_at(stamped at construction),registry_study_created_at(from the source RegistryStudy), andskeleton_created_at(from the first skeleton file’s attribute). All three timestamps are shown inprint()andprint_spec_summary()when available, making it easy to detect stale plans.R6 schema versioning: All R6 classes (
RegistryStudy,TTEPlan,TTEDesign,TTEEnrollment) now carry a.schema_versionprivate field, stamped at construction time. A new$check_version()public method compares the stored version against the current class definition and warns when stale.qs2_read()automatically calls$check_version()on R6 objects after loading, so outdated serialized objects produce a clear warning instead of silently breaking.Deprecation warnings for old
add_*parameter names:add_diagnoses(diags=),add_operations(ops=),add_rx(rxs=),add_icdo3s(icdo3s=),add_snomed3s(snomed3s=), andadd_snomedo10s(snomedo10s=)now emit a deprecation warning when the old parameter name is used. Usecodes=instead.
Breaking changes
RegistryStudy:
register_codes()now takes a declarative signature:register_codes(codes, fn, groups, fn_args, combine_as). Each call declares codes, the function to apply them, which data groups to use, and optional prefix/combine behavior. The old per-type fields (icd10_codes,rx_atc_codes,rx_produkt_codes,operation_codes,icdo3_codes) and the oldregister_codes(icd10_codes = ...)signature are removed. The singlecode_registrylist field replaces them.summary_table(): Thetypeparameter is removed. Thetypecolumn is replaced bylabel. Uselabelto filter.add_diagnoses(),add_operations(),add_rx(),add_icdo3s(),add_snomed3s(),add_snomedo10s(): The codes parameter is renamed tocodes(wasdiags,ops,rxs,icdo3s,snomed3s,snomedo10s). Old parameter names still work for backwards compatibility.
Refactoring
- Moved
qs2_read()to its own file (R/qs2.R) and inlined the fallback logic directly. Removed pointless.qs_savewrapper (replaced with directqs2::qs_savecalls) and.qs_readinternal helper.
Breaking changes
skeleton_save()no longer splits batches into sub-files. It saves one file per batch asskeleton_NNN.qs2(wasskeleton_NNN_SS.qs2). Theids_per_fileandid_colparameters have been removed.RegistryStudy:batch_sizesparameter (integer vector) replaced withbatch_size(single integer, default 1000). Theids_per_skeleton_fileparameter has been removed. All batches are now uniform size.
swereg 26.3.21
Breaking changes
-
RENAMED: Standalone TTE functions renamed to signal which class they operate on:
-
tte_rbind()→tteenrollment_rbind() -
tte_rates_combine()→tteenrollment_rates_combine() -
tte_irr_combine()→tteenrollment_irr_combine() -
tte_impute_confounders()→tteenrollment_impute_confounders() -
tte_read_spec()→tteplan_read_spec() -
tte_apply_exclusions()→tteplan_apply_exclusions() -
tte_apply_derived_confounders()→tteplan_apply_derived_confounders() -
tte_validate_spec()→tteplan_validate_spec() -
tte_plan_from_spec_and_registrystudy()→tteplan_from_spec_and_registrystudy() -
tte_callr_pool()→callr_pool()
-
-
RENAMED: Eligibility helpers renamed from
tte_eligible_*toskeleton_eligible_*to reflect that they operate on skeleton data.tables, not TTE classes:-
tte_eligible_isoyears()→skeleton_eligible_isoyears() -
tte_eligible_age_range()→skeleton_eligible_age_range() -
tte_eligible_no_events_in_window_excluding_wk0()→skeleton_eligible_no_events_in_window_excluding_wk0() -
tte_eligible_no_observation_in_window_excluding_wk0()→skeleton_eligible_no_observation_in_window_excluding_wk0() -
tte_eligible_no_events_lifetime_before_and_after_baseline()→skeleton_eligible_no_events_lifetime_before_and_after_baseline() -
tte_eligible_combine()→skeleton_eligible_combine()
-
File reorganization
-
RENAMED:
R/tte_enrollment_r6.R→R/r6_tteenrollment.R -
RENAMED:
R/tte_plan_r6.R→R/r6_tteplan.R -
RENAMED:
R/registry_study_r6.R→R/r6_registry_study.R -
EXTRACTED:
callr_pool()to its own fileR/callr_pool.R -
MOVED: Eligibility helpers to
R/skeleton_utils.R -
MOVED:
tteenrollment_impute_confounders()toR/r6_tteenrollment.R
swereg 26.3.20
Breaking changes
-
RENAMED: TTEEnrollment public workflow methods now have step-number prefixes to signal execution order:
-
$collapse()→$s1_collapse() -
$impute_confounders()→$s2_impute_confounders() -
$ipw()→$s3_ipw() -
$truncate()→$s4_truncate_weights() -
$prepare_for_analysis()→$s5_prepare_for_analysis()
-
RENAMED:
$s4_truncate()→$s4_truncate_weights()for clarity.-
RENAMED: TTEPlan orchestration methods now have step-number prefixes:
-
$generate_enrollments_and_ipw()→$s1_generate_enrollments_and_ipw() -
$generate_analysis_files_and_ipcw_pp()→$s2_generate_analysis_files_and_ipcw_pp()
-
-
RENAMED: Internal worker functions for consistent naming:
-
.tte_process_skeleton()→.s1_worker() -
.loop2_worker()→.s2_worker()
-
REMOVED: Constructor wrapper functions
tte_design(),tte_enrollment(), andtte_plan(). UseTTEDesign$new(),TTEEnrollment$new(), andTTEPlan$new()directly. The auto-detection and data-copy logic fromtte_enrollment()has been moved intoTTEEnrollment$new().
Improvements
REFACTOR: Inlined 5 of 6 private helper methods into their single callers on TTEEnrollment (
.calculate_ipw,.calculate_ipcw,.combine_weights_fn,.match_ratio,.collapse_periods). Kept.truncate_weightsas private (used in 2 places). Reduces indirection for stateless methods that don’t useself.TESTS: Rewrote
test-tte_weights.Rto test through public API ($s1_collapse(),$s3_ipw(),$s4_truncate(),tte_enrollment(ratio=)) instead of accessing inlined private methods.
swereg 26.3.20
Improvements
REFACTOR: Inlined 6 weight/matching functions as private methods on TTEEnrollment (tte_truncate_weights, tte_calculate_ipw, tte_calculate_ipcw, tte_combine_weights, tte_match_ratio, tte_collapse_periods). Removed 2 orphaned functions (tte_identify_censoring, tte_time_to_event). Users access this functionality through R6 methods ($collapse, $ipw, $truncate, etc.).
-
REFACTOR: Consolidated TTE source files from 7 to 2 (+1 rename):
-
tte_design.R+tte_enrollment.R+tte_weights.Rmerged intotte_enrollment_r6.R(TTEDesign + TTEEnrollment + all weight/matching functions called by their methods) -
tte_plan.R+tte_spec.R+tte_eligibility.Rmerged intotte_plan_r6.R(TTEPlan + spec functions + eligibility helpers) -
registry_study.Rrenamed toregistry_study_r6.R - Files containing R6 classes now have
_r6suffix for discoverability
-
REORDER: TTEEnrollment public methods now follow workflow execution order: collapse -> ipw -> impute_confounders -> truncate -> prepare_for_analysis -> extract/summary/diagnostics -> analysis output.
DOCS: Added inline comments documenting data flow in
generate_enrollments_and_ipw()(Loop 1),.tte_process_skeleton(),private$enroll(),enrollment_spec(), andadd_one_ett().
swereg 26.3.18
Improvements
MHT: Added
rd_approach3b_{single,multiple}exposure variables that collapseestrogen_progesterone_bioidenticalandestrogen_progesterone_syntheticinto a singleestrogen_progesteronelevel. Derived by relabeling the finished approach3 columns, which is valid because switching between active MHT types never triggers “previous”.MHT:
x2026_mht_add_lmed()now creates exposure variables (rd_approach{1,2,3}_{single,multiple}) internally via the new internal helperx2026_mht_create_exposure_variables(). This consolidates all MHT LMED logic in the package, eliminating the need for a separate step 14 in external workflow scripts.MHT: Removed 18 sensitivity columns (
*_sensitivity_60p,*_sensitivity_under60censorallat60,*_sensitivity_under60censorrefat65) fromx2026_mht_create_exposure_variables(). These had a logic issue wherelocal_or_none_mhtrows at age >= 65 producedNAinstead ofFALSE. Therd_age_continuouscolumn is no longer required as input.
swereg 26.2.22
New features
EXPORTED:
tte_callr_pool()— genericcallr::r_bg()worker pool, generalized from the internal.tte_callr_pool(). New API acceptsitems(list of arg-lists),worker_fn,item_labels, andcollect(FALSE to discard results when workers save directly). Eliminates boilerplate when scripts need their own parallel loops (e.g., Loop 2 IPCW-PP).NEW:
TTEPlan$generate_analysis_files_and_ipcw_pp()— Loop 2 method that runs per-ETT IPCW-PP calculation and saves analysis-ready files. Mirrors$generate_enrollments_and_ipw()(Loop 1). Parameters:output_dir,estimate_ipcw_pp_separately_by_exposure,estimate_ipcw_pp_with_gam,n_workers,swereg_dev_path.
Improvements
MEMORY:
tte_calculate_ipcw()now usesmgcv::bam(discrete = TRUE)instead ofmgcv::gam()whenuse_gam = TRUE.bam()discretizes covariates to avoid forming the full model matrix, dramatically reducing peak memory for large datasets. Model objects are also explicitly freed (rm()+gc()) between exposed/unexposed fits.MEMORY:
$irr()and$km()now subset to only the columns needed before creatingsurvey::svydesign(). Previously the full data.table (all columns) was copied into the survey object. Model objects and intermediate data are freed between fits.
swereg 26.2.21
Breaking changes
RENAMED:
$prepare_for_analysis()parametersestimate_ipcw_separately_by_exposure→estimate_ipcw_pp_separately_by_exposureandestimate_ipcw_with_gam→estimate_ipcw_pp_with_gamfor consistency with the IPCW-PP method they control.-
PRIVATE:
$enroll(),$prepare_outcome(),$ipcw_pp(), and$combine_weights()are now private methods onTTEEnrollment.- Enrollment: use
tte_enrollment(data, design, ratio = 2, seed = 4)instead oftte_enrollment(data, design)$enroll(ratio = 2, seed = 4). - Outcome prep + IPCW: use
$prepare_for_analysis()(unchanged). - Weight combination: handled automatically by
$ipcw_pp()(unchanged). - Tests can access private methods via
enrollment$.__enclos_env__$private$method_name().
- Enrollment: use
swereg 26.2.20
Breaking changes
-
RENAMED:
$prepare_analysis()→$prepare_for_analysis()onTTEEnrollment. The new name better communicates that this method prepares the enrollment for analysis (it is not the analysis itself).
Bug fixes
FIXED: 3 remaining broken test calls (
tte_extract(),tte_summary(),tte_weights()) migrated to R6 method syntax ($extract(),print(),$combine_weights()). Column assertion updated:"weight_pp"→"analysis_weight_pp".FIXED:
$impute_confounders()now appends"impute"tosteps_completed, consistent with all other mutating methods.FIXED:
$ipcw_pp()IPW column guard moved from after IPCW computation to before it (fail-fast).
Documentation
FIXED: Vignette truncation bounds corrected from “0.5th and 99.5th percentiles” to “1st and 99th percentiles” (matching code defaults
lower = 0.01, upper = 0.99).FIXED:
TTEDesignroxygen references to removedtte_match()/tte_expand()replaced with$enroll().FIXED:
$weight_summary()moved from “Mutating” to “Non-mutating” section inTTEEnrollmentroxygen (it only prints, never modifies data).
swereg 26.2.13
New features
NEW:
$prepare_for_analysis()method onTTEEnrollmentmerges$prepare_outcome()+$ipcw_pp()into one step. Parameters:outcome,follow_up,separate_by_exposure,use_gam,censoring_var.NEW:
$enrollment_stageactive binding onTTEEnrollment. Derives lifecycle stage from existing state:"pre_enrollment"→"enrolled"→"analysis_ready". Zero maintenance — readsdata_levelandsteps_completed.
swereg 26.2.11
Breaking changes
-
REMOVED: 19 standalone TTE functions moved to R6 methods on
TTETrial(15 methods) andTTEPlan(4 methods). Pipe chaining (trial |> tte_ipw()) replaced with$-chaining (trial$ipw()).TTETrial methods:
$enroll(),$collapse(),$ipw(),$ipcw_pp(),$combine_weights(),$truncate(),$prepare_outcome(),$impute_confounders(),$weight_summary(),$extract(),$summary(),$table1(),$rates(),$irr(),$km().TTEPlan methods:
$add_one_ett(),$save(),$enrollment_spec(),$generate_enrollments_and_ipw(). -
RENAMED:
TTEPlan$task()→TTEPlan$enrollment_spec(). The method returns enrollment metadata (design, enrollment_id, age_range, n_threads), not a generic task. Theprocess_fncallback parameter convention changes fromfunction(task, file_path)tofunction(enrollment_spec, file_path).Removed exports:
tte_enroll,tte_collapse,tte_ipw,tte_ipcw_pp,tte_weights,tte_truncate,tte_prepare_outcome,tte_extract,tte_summary,tte_weight_summary,tte_table1,tte_rates,tte_irr,tte_km,tte_plan_add_one_ett,tte_plan_save,tte_plan_task,tte_generate_enrollments_and_ipw.Kept standalone:
tte_rbind(),tte_rates_combine(),tte_irr_combine(),tte_impute_confounders()(thin wrapper for callback default). CHANGED: TTE classes (
TTEDesign,TTETrial,TTEPlan) migrated from S7 to R6. Property access changes from@to$(e.g.,trial@data→trial$data,design@id_var→design$id_var). R6 reference semantics eliminate copy-on-write overhead fromtrial$data[, := ...], reducing peak RAM from ~3X to ~2X during the weight-calculation chain (Loop 2).-
FIXED: Three S7
@accessor bugs that silently produced no-ops:-
$ipcw_pp(): dropping intermediate IPCW columns (p_censor, etc.) -
$collapse(): creatingperson_weekscolumn -
$impute_confounders(): deleting old confounder columns before merge All fixed automatically by R6 (in-place modification works).
-
CHANGED:
$ipcw_pp()now inlines weight combination and truncation (was callingtte_combine_weights()andtte_truncate_weights()via function parameters that created extra refcount). Keeps data.table refcount=1 throughout.
swereg 26.2.10
Bug fixes
-
FIXED:
tte_ipw(),tte_ipcw_pp(): in-place joins via S7@accessor now use extract/modify/reassign pattern (dt <- trial@data; dt[...]; trial@data <- dt). The previoustrial@data[i, := ...]silently modified a copy, leaving the S7 object’s data unchanged.
Performance
-
IMPROVED:
tte_ipw(),tte_ipcw_pp(),tte_calculate_ipcw(): replacemerge()with in-place keyed joins (data[i, := ...]), reducing peak RAM from ~3x to ~2x panel size during the weight-calculation chain.
Breaking changes
CHANGED:
tte_ipcw_pp()now also combines weights (ipw * ipcw_pp→analysis_weight_pp), truncatesanalysis_weight_pp, and drops intermediate IPCW columns (p_censor,p_uncensored,cum_p_uncensored,marginal_p,cum_marginal). Callers no longer needtte_weights()+tte_truncate()aftertte_ipcw_pp().RENAMED:
tte_generate_enrollments()→tte_generate_enrollments_and_ipw(). Now computes IPW + truncation once on the full combined enrollment (after imputation), so the per-ETT Loop 2 no longer needs to calltte_ipw(). Newstabilizeparameter (default TRUE) controls IPW stabilization.
New features
NEW:
tte_plan_load()reads a.qs2plan file and reconstructs theTTEPlanS7 object. Companion totte_plan_save().CHANGED:
tte_plan_save()now persistsproject_prefixandskeleton_filesalongsideettandglobal_max_isoyearweek, sotte_plan_load()can fully reconstruct the object.NEW:
skeleton_process()gainsn_workersparameter for parallel batch processing. When > 1, usescallr::r()+parallel::mclapply()to process batches concurrently while avoidingfork()+ data.table OpenMP segfaults.
swereg 26.2.9
Improvements
CHANGED: Migrate serialization from
qs(archived) toqs2..qs_save/.qs_readwrappers now callqs2::qs_save/qs2::qs_read(standard format, preserves S7 objects). All file extensions changed from.qsto.qs2. Thepresetparameter is no longer used.IMPROVED:
tte_rates()now setsswereg_typeandexposure_varattributes on its output;tte_irr()setsswereg_type.RENAMED:
tte_rates_table()→tte_rates_combine(),tte_irr_table()→tte_irr_combine(). New API accepts(results, slot, descriptions)— extracts the rates/irr slot internally, removing the need forlapply(results, [[, "table2")at call sites. Exposure column is now read from theexposure_varattribute instead of guessing viasetdiff().
Breaking changes
CHANGED:
tte_plan_add_one_ett()now requires explicitenrollment_idparameter. Auto-assignment based on follow_up + age_group removed. Validation that design params match within an enrollment_id is preserved.IMPROVED:
print(plan)now shows both enrollment grid and full ETT grid.CHANGED:
tte_plan_add_one_ett()bundlesage_group,age_min,age_max,person_id_varinto anargsetnamed list parameter.time_exposure_varandeligible_varno longer have defaults (must be explicit).exposure_varremoved from interface (hardcoded to"baseline_exposed").RENAMED:
file_idcolumn in theettdata.table →enrollment_id. This makes explicit that ETTs sharing the same follow_up + age_group are processed together as one “enrollment” (shared eligibility, matching, collapse, imputation).RENAMED:
tte_generate_trials()→tte_generate_enrollments(). The function generates enrollments (one per follow_up × age_group), not individual trials.RENAMED:
tte_plan_task()return list keyfile_id→enrollment_id.UPDATED:
print(plan)now shows “Enrollments: N x M skeleton files” instead of “Tasks: N file_id(s) x M skeleton files”.
swereg 26.2.8
Breaking changes
CHANGED:
tte_plan()is now infrastructure-only — takes onlyproject_prefix,skeleton_files,global_max_isoyearweek. Usette_plan_add_one_ett()to add ETTs with per-ETT design parameters.REMOVED: TTEPlan plan-level properties
confounder_vars,person_id_var,exposure_var,time_exposure_var,eligible_var. These are now per-ETT columns in theettdata.table.REMOVED: Internal
.tte_grid()function. The ETT grid is now built incrementally viatte_plan_add_one_ett().ADDED:
TTEPlan@project_prefixproperty (needed for file naming intte_plan_add_one_ett()).
New features
NEW:
tte_plan_add_one_ett()— builder function that adds one ETT row to a plan. Stores design params (confounder_vars, person_id_var, exposure_var, time_exposure_var, eligible_var) per-ETT, allowing different ETTs to use different confounders. Validates that design params match within an enrollment_id (same follow_up + age_group).RENAMED:
TTEPlan@filesproperty →TTEPlan@skeleton_filesfor clarity.
swereg 26.2.7
Breaking changes
-
REFACTORED:
tte_generate_enrollments()(formerlytte_generate_trials()) now takes aTTEPlanobject instead of separate parameters (ett,files,confounder_vars,global_max_isoyearweek). Theprocess_fncallback signature changes fromfunction(file_path, design, file_id, age_range, n_threads)tofunction(task, file_path)wheretaskis a list withdesign,enrollment_id,age_range, andn_threads.
New features
-
NEW:
TTEPlanS7 class bundles ETT grid, skeleton file paths, confounder definitions, and design column names into a single object for trial generation.-
tte_plan(): Constructor function -
tte_plan_task(plan, i): Extract the i-th enrollment task as a list withdesign,enrollment_id,age_range,n_threads -
plan[[i]]: Shorthand fortte_plan_task(plan, i) -
length(plan): Number of unique enrollment_id groups - Supports interactive testing:
task <- plan[[1]]; process_fn(task, plan@skeleton_files[1])
-
swereg 26.2.3
Breaking changes
-
REPLACED:
tte_match()andtte_expand()merged into singlette_enroll()function:- Old workflow:
tte_trial(data, design) |> tte_match(ratio = 2, seed = 4) |> tte_expand(extra_cols = "isoyearweek") - New workflow:
tte_trial(data, design) |> tte_enroll(ratio = 2, seed = 4, extra_cols = "isoyearweek") - The two operations were tightly coupled and always used together
-
tte_enroll()combines sampling (matching) and panel expansion in one step - Records “enroll” in
steps_completed(previously recorded “match” then “expand”)
- Old workflow:
New features
-
NEW: Trial eligibility helper functions for composable eligibility criteria:
-
tte_eligible_isoyears(): Check eligibility based on calendar years -
tte_eligible_age_range(): Check eligibility based on age range -
tte_eligible_no_events_in_window_excluding_wk0(): Check for no events in prior window (correctly excludes baseline week) -
tte_eligible_no_observation_in_window_excluding_wk0(): Check for no specific value in prior window (for categorical variables) -
tte_eligible_combine(): Combine multiple eligibility columns using AND logic - All functions modify data.tables by reference and return invisibly for method chaining
-
Documentation
-
IMPROVED: Clarified that eligibility checks should EXCLUDE the baseline week. Using
cumsum(x) == 0is incorrect because it includes the current week. The new eligibility functions useany_events_prior_to()which correctly excludes the current row.
swereg 26.1.31
New features
-
NEW: S7 object-oriented API for target trial emulation workflows:
-
TTEDesignclass: Define column name mappings once and reuse across all TTE functions -
TTETrialclass: Fluent method chaining with workflow state tracking -
tte_design()/tte_trial(): Constructor functions for the S7 classes -
tte_match(),tte_expand(),tte_collapse(),tte_ipw(): S7 methods for data preparation -
tte_prepare_outcome(),tte_ipcw(): Outcome-specific per-protocol analysis -
tte_weights(),tte_truncate(): Weight combination and truncation -
tte_rbind(): Combine batched trial objects -
tte_extract(),tte_summary(): Access data and diagnostics -
tte_table1(),tte_rates(),tte_irr(),tte_km(): Analysis and visualization
-
Breaking changes
-
REMOVED: Deprecated S7 methods replaced by
tte_prepare_outcome():-
tte_tte(): Usette_prepare_outcome()which computesweeks_to_eventinternally -
tte_set_outcome(): Usette_prepare_outcome(outcome = "...")instead -
tte_censoring(): Usette_prepare_outcome()which handles censoring internally
-
swereg 26.1.30
New features
-
NEW: Target trial emulation weight functions for causal inference in observational studies:
-
tte_calculate_ipw(): Calculate stabilized inverse probability of treatment weights (IPW) for baseline confounding adjustment using propensity scores -
tte_calculate_ipcw(): Calculate time-varying inverse probability of censoring weights (IPCW) for per-protocol analysis using GAM or GLM -
tte_identify_censoring(): Identify protocol deviation and loss to follow-up for per-protocol analysis -
tte_combine_weights(): Combine IPW and IPCW weights for per-protocol effect estimation -
tte_truncate_weights(): Truncate extreme weights at specified quantiles to reduce variance
-
-
NEW: Target trial emulation data preparation functions:
-
tte_match_ratio(): Sample comparison group at specified ratio (e.g., 2:1 unexposed to exposed) -
tte_collapse_periods(): Collapse fine-grained time intervals (e.g., weekly) to coarser periods (e.g., 4-week) -
tte_time_to_event(): Calculate time to first event for each trial/person
-
swereg 25.12.24
API changes
-
SIMPLIFIED: Removed
validate_source_column()requirement fromadd_diagnoses(),add_operations(),add_icdo3s(),add_snomed3s(), andadd_snomedo10s():- The
sourcecolumn is no longer required in diagnosis data - To track diagnoses by source (inpatient/outpatient/cancer), filter the dataset externally before calling
add_diagnoses() - See
?add_diagnosesfor the recommended pattern
- The
New features
-
NEW:
any_events_prior_to()function for survival analysis:- Checks if any TRUE values exist in a preceding time window (excludes current row)
- Useful for determining if an event occurred in a prior time period
- Default window of 104 weeks (~2 years) with customizable size
- Complements
steps_to_first()for comprehensive time-to-event analysis
-
ENHANCED:
steps_to_first()function improvements:- Renamed parameter from
windowtowindow_including_wk0for clarity - Default window is now 104 (inclusive of current week)
- Added
@family survival_analysistag and cross-reference toany_events_prior_to()
- Renamed parameter from
Bug fixes
- FIXED: Added slider package to Imports in DESCRIPTION to fix R CMD check warning about undeclared import
Data
-
BREAKING: Replaced separate
fake_inpatient_diagnosesandfake_outpatient_diagnoseswith unifiedfake_diagnosesdataset:- New
SOURCEcolumn identifies data origin: “inpatient”, “outpatient”, or “cancer” - ~2000 inpatient records, ~2000 outpatient records, ~1000 cancer records
- Cancer records always have populated
ICDO3codes - Enables testing of source-based filtering and validation
- New
-
ENHANCED: Added ICD-O-3 and SNOMED-CT columns to fake diagnosis data:
-
ICDO3: ICD-O-3 morphology codes (always populated for cancer source) -
SNOMED3: SNOMED-CT version 3 codes -
SNOMEDO10: SNOMED-CT version 10 codes
-
Validation
- ENHANCED: SOURCE column validation is now optional - filter externally if needed (see API changes above)
Documentation
-
IMPROVED: Examples for
add_icdo3s(),add_snomed3s(), andadd_snomedo10s()are now runnable using package fake data (previously wrapped in\dontrun{})
swereg 25.12.6
New features
-
NEW:
steps_to_first()function for survival analysis:- Calculates the number of steps (e.g., weeks) until the first TRUE value in a forward-looking window
- Useful for time-to-event calculations in longitudinal registry data
- Default window of 103 weeks (~2 years) with customizable size
- Returns NA if no event occurs within the window
Bug fixes
-
CRITICAL: Fixed
add_snomed3s()andadd_snomedo10s()calling non-existent internal functions- Both functions now correctly call
add_diagnoses_or_operations_or_cods_or_icdo3_or_snomed() - These functions would have caused runtime errors before this fix
- Both functions now correctly call
-
FIXED: Removed erroneous
icdo10column references fromadd_diagnoses():- ICD-O only has editions 1, 2, and 3 (not 10)
- ICD-O-3 codes should be handled via the dedicated
add_icdo3s()function
-
FIXED: Added
icd7*andicd9*columns to diagnosis search inadd_diagnoses():- Historical ICD-7 and ICD-9 columns are now properly searched when
diag_type = "both" - Validation and helper function now consistent
- Historical ICD-7 and ICD-9 columns are now properly searched when
-
FIXED: Corrected error messages in
add_icdo3s(),add_snomed3s(), andadd_snomedo10s():- Messages now correctly reference the appropriate data types instead of “operation data”
Documentation
-
ENHANCED:
add_diagnoses()documentation now clearly lists which diagnosis columns are searched:- When
diag_type = "both":hdia,dia*,ekod*,icd7*,icd9* - When
diag_type = "main":hdiaonly
- When
swereg 25.8.19
CRAN Submission Preparation
-
CRAN READY: Package prepared for CRAN submission with comprehensive compliance improvements:
- Fixed DESCRIPTION file author field duplication issue
- Updated .Rbuildignore to exclude all development files (docs/, .git/, .Rhistory, etc.)
- Removed non-portable files (@eaDir directories, .DS_Store files)
- Added missing global variable declarations to prevent R CMD check warnings
- Verified URL consistency between DESCRIPTION and package startup messages
-
OPTIMIZED: Vignette structure significantly improved for CRAN submission:
- Reduced total vignette content by 31% (626 lines removed)
- Condensed cookbook-survival-analysis.Rmd (removed verbose descriptive statistics and redundant sections)
- Simplified skeleton2-clean.Rmd (removed duplicated skeleton1_create workflow)
- Streamlined skeleton3-analyze.Rmd (removed redundant data loading and best practices sections)
- Fixed all vignette build errors by ensuring consistent data variable availability
- All vignettes now compile successfully and use package synthetic data consistently
- VALIDATED: All examples are runnable using package fake data - no \dontrun sections without justification
Code Quality Improvements
-
CONSISTENCY: Fixed date_columns parameter usage throughout package:
- Updated all vignettes to use lowercase date_columns parameters (e.g., “indatum” instead of “INDATUM”)
- Added warning to make_lowercase_names() function when uppercase date_columns are provided
- Enhanced documentation to clarify that date_columns should use lowercase names
- Improved user experience with clear guidance and automatic handling of uppercase inputs
-
ELEGANCE: Enhanced vignette code patterns for better readability:
- Replaced verbose data() loading patterns with elegant pipe syntax
- Updated all data loading to use swereg::fake_* |> copy() |> make_lowercase_names() pattern
- Eliminated clumsy multi-step data preparation code throughout vignettes
- Improved code flow and professional appearance of package examples
- VERIFIED: Package builds successfully with R CMD build and passes CRAN compliance checks
- CONFIRMED: inst/ directory contains only files referenced by package functions
swereg 25.7.30
New Features
-
NEW:
make_rowind_first_occurrence()helper function for rowdep → rowind transformations:- Simplifies the common pattern of creating row-independent variables from first occurrence of conditions
- Automatically handles temp variable creation and cleanup
- Uses
first_non_na()for robust aggregation across all variable types - Includes comprehensive input validation and clear error messages
-
NEW: “Understanding rowdep and rowind Variables” vignette:
- Explains the fundamental distinction between row-dependent and row-independent variables
- Demonstrates common transformation patterns with practical examples
- Shows integration with the swereg workflow (skeleton1_create → skeleton2_clean → skeleton3_analyze)
- Includes best practices for longitudinal registry data analysis
swereg 25.7.16
New Swedish Date Parsing and Enhanced Data Cleaning
-
NEW:
parse_swedish_date()function for handling Swedish registry dates with varying precision:- Handles 4-character (YYYY), 6-character (YYYYMM), and 8-character (YYYYMMDD) formats
- Automatically replaces “0000” with “0701” and “00” with “15” for missing date components
- Supports custom defaults for missing date parts
- Includes comprehensive error handling and vectorized processing
-
ENHANCED:
make_lowercase_names()now supports automatic date cleaning:- New
date_columnparameter to specify which column contains dates - Automatically creates cleaned ‘date’ column using
parse_swedish_date() - Works with both default and data.table methods
- Maintains backward compatibility with existing code
- New
-
ENHANCED: All
add_*functions now require cleaned date columns:-
add_diagnoses(),add_operations(),add_rx(),add_cods()expect ‘date’ column - Clear error messages guide users to use
make_lowercase_names(data, date_column = "...") - Improved validation ensures data preprocessing consistency
-
-
ENHANCED:
create_skeleton()now includespersonyearscolumn:- Annual rows (is_isoyear==TRUE) have personyears = 1
- Weekly rows (is_isoyear==FALSE) have personyears = 1/52.25
- Facilitates person-time calculations for survival analysis
-
IMPROVED: Survival analysis cookbook vignette updated:
- Uses weekly data instead of yearly data for more precise analyses
- Age calculation based on isoyearweeksun instead of isoyear
- Includes person-time in descriptive statistics
- Demonstrates proper use of new date cleaning workflow
Enhanced error handling and validation
-
ENHANCED: Comprehensive input validation for all
add_*functions:-
add_onetime(): Validates skeleton structure, ID column exists, checks for ID matches -
add_annual(): Validates isoyear parameter, checks skeleton year coverage -
add_diagnoses(): Validates diagnosis patterns, checks for diagnosis code columns -
add_operations(): Validates operation patterns, checks for operation code columns -
add_rx(): Validates prescription data structure, checks source columns -
add_cods(): Validates death data structure, checks cause of death columns
-
-
IMPROVED: User-friendly error messages with specific guidance:
- Clear indication when
make_lowercase_names()is forgotten - Helpful suggestions for column naming issues
- Informative ID mismatch diagnostics with sample values
- Clear indication when
- NEW: Internal validation helper functions for consistent error handling
- ADDED: Input validation for pattern lists, data structures, and parameter ranges
New cookbook documentation
-
NEW: Comprehensive survival analysis cookbook (
cookbook-survival-analysis.Rmd):- Complete workflow from raw data to Cox proportional hazards model
- Time-varying covariates (annual income) with heart attack outcome
- Handles common challenges: missing data, multiple events, competing risks
- Performance tips for large datasets
- Practical solutions for real-world registry analysis
-
ENHANCED: Updated
_pkgdown.ymlwith new “Cookbooks” section -
ADDED:
survivalpackage to Suggests dependencies
swereg 25.7.16
Major documentation restructuring
-
RESTRUCTURED: Complete vignette reorganization for clear learning progression:
- NEW “Skeleton concept” vignette: Conceptual foundation explaining the skeleton approach without technical implementation
- “Building the data skeleton (skeleton1_create)”: Pure data integration focus - raw data to time-structured skeleton
- “Cleaning and deriving variables (skeleton2_clean)”: Pure data cleaning and variable derivation focus
- “Production analysis workflows (skeleton3_analyze)”: Memory-efficient processing and final analysis datasets
- IMPROVED: Clear separation of concerns with focused, single-purpose tutorials
- ENHANCED: Systematic learning progression from concept to implementation to production
- UPDATED: _pkgdown.yml structure with logical vignette grouping
- PRESERVED: All existing technical content while improving organization
Content improvements
- NEW: Comprehensive conceptual introduction based on presentation content
- IMPROVED: Each vignette builds systematically on the previous one
- ENHANCED: Better explanation of three types of data integration (one-time, annual, event-based)
- CLARIFIED: Production workflow patterns with memory-efficient batching strategies
- STANDARDIZED: Consistent academic tone and sentence case throughout
swereg 25.7.15
Documentation and presentation improvements
-
STANDARDIZED: Changed all titles and headings to normal sentence case throughout:
- Vignette titles: “Basic Workflow” → “Basic workflow”, “Complete Workflow” → “Complete workflow”, etc.
- README.md section headings: “Core Functions” → “Core functions”, “Data Integration” → “Data integration”, etc.
- NEWS.md section headings: “Vignette Restructuring” → “Vignette restructuring”, etc.
- CLAUDE.md section headings: “Project Overview” → “Project overview”, “Development Commands” → “Development commands”, etc.
- IMPROVED: Consistent normal sentence case for better readability and less formal appearance
- SIMPLIFIED: Removed subtitle text after colons in vignette titles for cleaner presentation
-
ENHANCED: Improved Core Concept section in basic workflow vignette with clear explanation of three data types:
- One-time data (demographics): Added to all rows for each person
- Annual data (income, family status): Added to all rows for specific year
- Event-based data (diagnoses, prescriptions, deaths): Added to rows where events occurred
-
CLARIFIED: Step 1 documentation now properly explains all skeleton columns including
isoyearweeksun - VERIFIED: All vignettes compile successfully with improved content
Major documentation and vignette reorganization
-
RESTRUCTURED: Complete vignette reorganization with improved naming and content flow:
-
swereg.Rmd→basic-workflow.Rmd: Focused introduction to skeleton1_create -
advanced-workflow.Rmd→complete-workflow.Rmd: Two-stage workflow (skeleton1_create + skeleton2_clean) -
memory-efficient-batching.Rmd: Maintained as comprehensive three-stage workflow guide
-
- IMPROVED: Eliminated content redundancy between vignettes for clearer learning progression
- ENHANCED: Updated _pkgdown.yml configuration to reflect new vignette structure
Function documentation improvements
-
ENHANCED: Comprehensive documentation improvements for all exported functions:
- Added @family tags for logical grouping (data_integration, skeleton_creation, data_preprocessing)
- Added @seealso sections with cross-references to related functions and vignettes
- Replaced placeholder examples with runnable code using synthetic data
- Improved parameter documentation with detailed descriptions and expected formats
- Enhanced return value documentation with explicit side effects description
- STANDARDIZED: Consistent academic tone throughout all documentation
Professional presentation updates
- IMPROVED: Removed informal elements and adopted academic tone across all documentation
- UPDATED: Changed terminology from “fake data” to “synthetic data” throughout
- ENHANCED: More professional language in README.md and vignettes
- STANDARDIZED: Consistent formal tone appropriate for scientific software
swereg 25.7.1
Vignette restructuring
-
RESTRUCTURED: Reorganized vignettes for clearer learning progression:
-
swereg.Rmd: Clean skeleton1_create tutorial using full datasets (removed subset filtering) -
advanced-workflow.Rmd: Focused skeleton1→skeleton2 workflow (removed batching and skeleton3 content) -
memory-efficient-batching.Rmd: NEW comprehensive batching vignette with complete skeleton1→skeleton2→skeleton3 workflow for large-scale studies
-
- IMPROVED: GitHub Actions workflow optimization with dependency caching and binary packages for faster CI/CD
Batching vignette fixes
-
FIXED: Updated memory-efficient-batching vignette with production-ready improvements:
- Replace
split()withcsutil::easy_splitfor better batch handling - Replace
saveRDS/readRDSwithqs::qsave/qreadfor 2-10x faster file I/O - Fix skeleton3_analyze to properly aggregate weekly→yearly data using
swereg::max_with_infinite_as_na - Remove incorrect
is_isoyear == TRUEfilter in skeleton3_analyze - Fix analysis results to avoid NaN outputs in treatment rate calculations
- Add explanations for weekly→yearly data aggregation and qs package performance benefits
- Replace
New features
-
NEW: Added
isoyearweeksunvariable tocreate_skeleton()function - provides Date representing the Sunday (last day) of each ISO week/year for easier date calculations - NEW: Updated package logo
-
IMPROVED: Updated all vignettes to not assume swereg is loaded - all functions use
swereg::prefix anddata()calls usepackage="swereg"argument -
IMPROVED: Updated function documentation to clarify that pattern matching functions (
add_diagnoses,add_cods,add_rx) automatically add “^” prefix - users should NOT include “^” in their patterns -
NEW: Added comprehensive fake Swedish registry datasets for development and vignettes:
-
fake_person_ids: 1000 synthetic personal identifiers -
fake_demographics: Demographics data matching SCB format -
fake_annual_family: Annual family status data -
fake_inpatient_diagnosesandfake_outpatient_diagnoses: NPR diagnosis data with ICD-10 codes -
fake_prescriptions: LMED prescription data with ATC codes and hormone therapy focus -
fake_cod: Cause of death data
-
-
NEW: Added two comprehensive vignettes:
-
swereg.Rmd: Basic skeleton1_create workflow tutorial -
advanced-workflow.Rmd: Complete 3-phase workflow (skeleton1 → skeleton2 → skeleton3)
-
- NEW: Replaced magrittr pipe (%>%) with base pipe (|>) throughout codebase
- NEW: Added memory-efficient batched processing examples for large registry studies
Bug fixes
-
CRITICAL: Fixed incorrect variable names in
fake_coddataset - changed from non-Swedishunderlying_cod/contributory_cod1/contributory_cod2to correct Swedish registry namesulorsak/morsak1/morsak2 - VERIFIED: Confirmed all fake datasets use correct Swedish registry variable name conventions
- VERIFIED: All ICD-10 and ATC codes in fake datasets are properly formatted and realistic
Documentation improvements
- BREAKING: Fixed incorrect function descriptions that were copied from another package
-
NEW: Added comprehensive roxygen2 documentation for all exported functions:
-
add_onetime(): Documents merging one-time/baseline data to skeleton -
add_annual(): Documents merging annual data for specific ISO years -
add_cods(): Documents cause of death analysis with ICD-10 codes -
add_diagnoses(): Documents diagnosis analysis with main/secondary diagnoses -
add_operations(): Documents surgical operation analysis including gender-affirming procedures -
add_rx(): Documents prescription drug analysis with ATC/product codes -
create_skeleton(): Documents longitudinal skeleton creation with detailed return structure -
make_lowercase_names(): Documents generic function with S3 methods -
x2023_mht_add_lmed(): Documents specialized MHT study function
-
- NEW: Added documentation for all helper functions:
-
NEW: Added
@paramdescriptions for all function parameters -
NEW: Added
@returndescriptions explaining function outputs -
NEW: Added
@exampleswith practical usage demonstrations -
NEW: Added
@detailsand@notesections for complex functions -
IMPROVED: Used proper roxygen2 practices including
@rdnamefor S3 methods and@seealsocross-references
