A `Skeleton` is a single batch's person-week data.table plus its full provenance: the hash of the framework function that built the base time grid, an ordered record of every randvars function that has been applied to it, and a fingerprint map of every code_registry entry whose columns currently live in the data.
This is the on-disk unit produced by [RegistryStudy]`$process_skeletons()`. One file per batch.
`Skeleton` objects are rarely constructed directly. Use [RegistryStudy]`$load_skeleton(batch_number)` to read one from disk and [RegistryStudy]`$save_skeleton(sk)` to write one back.
Phase provenance fields
- `framework_fn_hash`
xxhash64 of `list(body(fn), formals(fn))` for the framework function that built `self$data`. Used by `$process_skeletons()` to decide whether to rebuild this batch from scratch (phase 1) when the framework code has changed.
- `applied_registry`
Named list keyed by code_registry entry fingerprint. Each value is a minimal descriptor sufficient to recompute the entry's column names via `.entry_columns()` at drop time, without re-running `fn`:
Primary entries (from `$register_codes()`) store `list(codes, groups, combine_as, label, fn_args)`.
Derived entries (from `$register_derived_codes()`) store `list(kind = "derived", codes, from, as, label)`. `.entry_columns()` branches on the entry's `kind` field (defaulting to `"primary"` when absent) so both shapes produce the right column predictions at drop time.
The entry's `fn` is NOT stored – serializing R function objects carries enclosing-environment bloat and we never call `fn` at drop time anyway.
- `randvars_state`
Named ordered list, one entry per phase-3 step that's been applied. Each value is `list(fn_hash = ..., added_columns = ...)`. `fn_hash` is the hash of the function that ran; `added_columns` is the character vector of column names it wrote, recorded via a before/after diff at apply time (since randvars functions are arbitrary user code whose outputs can't be predicted from metadata).
See also
[RegistryStudy] for the pipeline that produces and consumes `Skeleton` objects; [CandidatePath] for the directory resolution mechanism behind `study$load_skeleton()` / `$save_skeleton()`.
Other skeleton_pipeline:
RegistryStudy
Public fields
dataThe underlying `data.table` (time grid + derived columns).
batch_numberInteger batch index.
framework_fn_hashxxhash64 of the framework function that built `self$data`.
applied_registryNamed list (keyed by code_registry entry fingerprint). Each value is a minimal descriptor: for primary entries it's `list(codes, groups, combine_as, label, fn_args)`; for derived entries (from `$register_derived_codes()`) it's `list(kind = "derived", codes, from, as, label)`. See the class-level "Phase provenance fields" section for why both shapes omit `fn`.
randvars_stateNamed ordered list, one entry per phase-3 step that's been applied. Each value is `list(fn_hash = ..., added_columns = ...)`.
created_atPOSIXct timestamp for when this `Skeleton` object was constructed.
Methods
Skeleton$new()
Construct a new `Skeleton` wrapping an existing `data.table`. Typically called by [RegistryStudy]`$process_skeletons()` after the framework function produces the base time grid.
Usage
Skeleton$new(data, batch_number)Skeleton$check_version()
Check this object's schema version against the current `Skeleton` schema version. Errors with an actionable migration message on mismatch.
Skeleton$pipeline_hash()
Compute this skeleton's total pipeline hash from its own stored provenance. Invariant: `sk$pipeline_hash() == study$pipeline_hash()` iff the skeleton is fully synced with the study's currently-registered framework + randvars + codes.
Skeleton$apply_code_entry()
Apply one code_registry entry to `self$data`, mutating it in place, and record a minimal descriptor of the entry under its fingerprint so a future `$drop_code_entry(fingerprint)` call knows which columns to remove. The stored descriptor shape depends on `entry$kind`: primary entries store the `codes/groups/combine_as/label/fn_args` quintuple, derived entries store `list(kind = "derived", codes, from, as, label)`. For derived entries, `batch_data` is unused – the apply just ORs already-existing skeleton columns under new names.
Arguments
entryA code_registry entry (as constructed by [RegistryStudy]`$register_codes()` or [RegistryStudy]`$register_derived_codes()`).
batch_dataNamed list of data.tables from [RegistryStudy]`$load_rawbatch()`. Ignored for derived entries.
id_colCharacter. Person-ID column name.
fingerprintCharacter. The xxhash64 fingerprint for `entry` (computed by [RegistryStudy]`$code_registry_fingerprints()`).
Skeleton$drop_code_entry()
Drop every column that the registry entry with the given fingerprint contributed to `self$data`, and clear its descriptor from `self$applied_registry`. Columns are computed from the stored descriptor via `.entry_columns()` – no lookup map, no before/after diff.
Tolerates missing columns (e.g. after a partial-state crash): the column set is intersected with `names(self$data)` before dropping, so the method is a safe idempotent operation.
Skeleton$sync_with_registry()
Bring this skeleton into sync with the given code registry (phase 2 of `$process_skeletons()`). Entries in `stored - current` are dropped (their columns removed via `.entry_columns()` on the stored descriptor). Entries in `current - stored` are applied via `$apply_code_entry()`.
"Changed" entries – same `label` but different `codes` / `groups` / etc. – are handled automatically without special casing: their old fingerprint lives in `stored` (so the old descriptor's columns get dropped) and their new fingerprint lives in `current` (so the new entry gets freshly applied).
Rawbatches are loaded lazily via `batch_data_loader`: if no new entries need to be applied, the loader is never called.
Skeleton$sync_randvars()
Bring this skeleton into sync with the currently- registered phase-3 step sequence (phase 3 of `$process_skeletons()`).
Uses "divergence-point + rewind and replay" semantics: 1. Scan the stored step sequence (`names(self$randvars_state)` + stored `fn_hash`s) against the current sequence (`names(randvars_fns)` + `randvars_hashes`). Find the first position where the name or hash differs, or where one sequence ends. 2. Rewind: drop the stored `added_columns` of every step from the divergence point forward, in stored order. 3. Replay: run the current steps from the divergence point forward, in current order, recording each step's hash + new `added_columns`.
This handles add, remove, edit, and reorder uniformly because any of those operations changes either the name sequence or the hash sequence, and the first mismatch point is the divergence point. When no divergence exists, the method is a no-op and `batch_data_loader` is never called.
Arguments
randvars_fnsNamed ordered list of phase-3 functions (from `RegistryStudy$randvars_fns`).
randvars_hashesCharacter vector parallel to `randvars_fns` with the xxhash64 of each function's body + formals.
batch_data_loaderZero-argument closure returning the rawbatch data for this batch.
configThe owning `RegistryStudy` (passed as the third argument to each randvars function).
Skeleton$save()
Save this `Skeleton` to disk as `skeleton_NNN.qs2` inside `dir`. Prefer [RegistryStudy]`$save_skeleton(sk)` which supplies `self$data_skeleton_dir` automatically.
Examples
if (FALSE) { # \dontrun{
# Load a persisted skeleton from disk and inspect its provenance.
sk <- study$load_skeleton(batch_number = 1L)
sk # print summary
sk$data # the underlying data.table
sk$framework_fn_hash # hash of the phase-1 fn that built it
names(sk$randvars_state) # applied phase-3 steps in order
length(sk$applied_registry) # applied code registry entries
sk$pipeline_hash() # rolled-up provenance scalar
# Check consistency with the study's current pipeline.
identical(sk$pipeline_hash(), study$pipeline_hash())
# Write back after manual editing (rare; process_skeletons handles
# this automatically).
study$save_skeleton(sk)
} # }
