discourse/migrations/tooling
Gerhard Schlager 420bced604
MT: Split converter steps into source and processor roles (#40816)
Previously, a converter `ProgressStep` was a single instance that
spanned the process boundary: `items` ran in the main process while
`process_item` ran in forked workers. Instance variables behaved
differently depending on where they were set — state set in `items`
never reached a worker, state set in `execute` was only a copy-on-write
snapshot, and state mutated in `process_item` accumulated per worker
and never came back. It was easy to write a step that looked correct
but lost data silently, and with the planned thread-based worker for
JRuby the same code would turn into a data race.

This change replaces the `items` / `process_item` contract with two
separate role objects, defined with a small DSL: a `source` that
enumerates the items in the main process (`items`, `max_progress`) and
a `processor` that handles one item at a time in a worker, with a
`setup` hook that runs once per worker after the fork. `helpers` mixes
pure functions into both roles. Because the roles are separate objects,
source state is structurally out of reach for the processor — a
cross-boundary call raises a `NameError` instead of reading state that
never crossed the fork. The worker lifecycle is now an explicit
`setup` -> `run` -> `cleanup`, so a future `ThreadWorker` on JRuby can
call the same hooks at thread start without changes to this code.

Also in this change:

- `Step` and `ProgressStep` are siblings below a minimal `StepBase`
  now; the progress step itself is only a coordinator without per-step
  state.
- Writes to the IntermediateDB during processor `setup` raise at the
  write site (`SetupGuard`), in serial and parallel mode alike. Before,
  such writes were silently discarded in parallel mode.
- A crashed worker process raises `Worker::CrashedError` in the parent
  instead of silently dropping the items that were queued for it.
- `ParallelJob` no longer registers a `ForkManager.after_fork_child`
  callback; `Worker` calls `job.setup` at the top of its fork block
  instead, and the unused `after_fork_child` API and `run_once` option
  were removed from `ForkManager`.
- The `SiteSettings` step queried the source DB in `execute` and kept
  the result in an instance variable for `process_item`, which only
  worked because of the fork snapshot. The uploads lookup is now part
  of the `items` query. Moving it also surfaced a latent bug: the
  `::int` casts relied on `AND`/`OR` short-circuiting, but PostgreSQL
  doesn't guarantee the evaluation order — with a production-sized
  `uploads` table the planner builds index probes from the casts and
  the query fails on the first non-numeric setting value. The casts are
  now guarded with `CASE`.
- The MRI performance path is untouched: same Oj/pipe IPC, same single
  DB writer thread, same fork batching.

New specs cover the role isolation, the worker lifecycle, and
serial/parallel parity (the same fixture step runs in both modes and
has to produce identical inserts), and the gem spec suites verify
partial doubles now. The IntermediateDB output is unchanged — running
the converter against the same source database before and after this
change produces identical contents.
2026-06-12 20:54:31 +02:00
..
config/schema
lib
scripts/benchmarks MT: Benchmark worker IPC serializers and pin the wire contract 2026-06-12 19:50:23 +02:00
spec MT: Split converter steps into source and processor roles (#40816) 2026-06-12 20:54:31 +02:00
.rspec
Gemfile
migrations-tooling.gemspec