mirror of
https://gh.wpcy.net/https://github.com/discourse/discourse.git
synced 2026-06-18 22:48:32 +08:00
Previously, a converter `ProgressStep` was a single instance that spanned the process boundary: `items` ran in the main process while `process_item` ran in forked workers. Instance variables behaved differently depending on where they were set — state set in `items` never reached a worker, state set in `execute` was only a copy-on-write snapshot, and state mutated in `process_item` accumulated per worker and never came back. It was easy to write a step that looked correct but lost data silently, and with the planned thread-based worker for JRuby the same code would turn into a data race. This change replaces the `items` / `process_item` contract with two separate role objects, defined with a small DSL: a `source` that enumerates the items in the main process (`items`, `max_progress`) and a `processor` that handles one item at a time in a worker, with a `setup` hook that runs once per worker after the fork. `helpers` mixes pure functions into both roles. Because the roles are separate objects, source state is structurally out of reach for the processor — a cross-boundary call raises a `NameError` instead of reading state that never crossed the fork. The worker lifecycle is now an explicit `setup` -> `run` -> `cleanup`, so a future `ThreadWorker` on JRuby can call the same hooks at thread start without changes to this code. Also in this change: - `Step` and `ProgressStep` are siblings below a minimal `StepBase` now; the progress step itself is only a coordinator without per-step state. - Writes to the IntermediateDB during processor `setup` raise at the write site (`SetupGuard`), in serial and parallel mode alike. Before, such writes were silently discarded in parallel mode. - A crashed worker process raises `Worker::CrashedError` in the parent instead of silently dropping the items that were queued for it. - `ParallelJob` no longer registers a `ForkManager.after_fork_child` callback; `Worker` calls `job.setup` at the top of its fork block instead, and the unused `after_fork_child` API and `run_once` option were removed from `ForkManager`. - The `SiteSettings` step queried the source DB in `execute` and kept the result in an instance variable for `process_item`, which only worked because of the fork snapshot. The uploads lookup is now part of the `items` query. Moving it also surfaced a latent bug: the `::int` casts relied on `AND`/`OR` short-circuiting, but PostgreSQL doesn't guarantee the evaluation order — with a production-sized `uploads` table the planner builds index probes from the casts and the query fails on the first non-numeric setting value. The casts are now guarded with `CASE`. - The MRI performance path is untouched: same Oj/pipe IPC, same single DB writer thread, same fork batching. New specs cover the role isolation, the worker lifecycle, and serial/parallel parity (the same fixture step runs in both modes and has to produce identical inserts), and the gem spec suites verify partial doubles now. The IntermediateDB output is unchanged — running the converter against the same source database before and after this change produces identical contents. |
||
|---|---|---|
| .. | ||
| intermediate_db | ||
| connection_spec.rb | ||
| intermediate_db_spec.rb | ||
| migrator_spec.rb | ||
| offline_connection_spec.rb | ||
| prepared_statement_cache_spec.rb | ||