Key Facts
- ✓ Most consensus libraries (Raft, Paxos) treat the state machine as a pure black box.
- ✓ If a leader crashes after the side effect but before committing it, duplicates occur.
- ✓ Chr2 uses a Replicated Outbox: Side effects are stored as 'pending' in replicated state.
- ✓ Durable Fencing uses a manifest persisted via atomic tmp+fsync+rename to stop zombie leaders.
- ✓ Chr2 is a CP system, prioritizing safety over availability.
Quick Summary
Standard consensus libraries like Raft and Paxos generally ignore what happens after a log entry is committed. They treat the state machine as a black box, assuming the application handles the rest. This assumption breaks down when the application must trigger external actions, such as charging a credit card, firing a webhook, or sending an email. If a leader crashes after performing the action but before the commit is finalized, the system often loses track of the operation. When a new leader takes over, it may re-execute the same command, leading to duplicate side effects.
To solve this, a new library named Chr2 was developed to treat crash-safe side effects as a primary feature rather than an afterthought. The core philosophy is to ensure that side effects are not just logged but are managed through a strict execution lifecycle. The library introduces a Replicated Outbox mechanism. Instead of executing immediately, side effects are stored as 'pending' items within the replicated state. Execution is strictly controlled; only the active leader is allowed to execute these effects, and it must do so under a specific fencing token.
Preventing 'zombie' leaders—old leaders that come back online and try to act—is critical. Chr2 uses Durable Fencing to manage this. A manifest file persists the highest view number using atomic operations (tmp+fsync+rename). This ensures that an old leader cannot wake up and execute stale effects. To guarantee consistency during recovery or replay, the system provides a Deterministic Context. Application code receives a deterministic RNG seed and the block time directly from the log, ensuring that replaying the log produces identical state transitions. Finally, the Write-Ahead Log (WAL) is strict: entries are CRC’d and hash-chained. If corruption is detected, the system is designed to halt rather than guess. While these measures provide strong safety, the system is explicitly designed as a CP (Consistency/Partition Tolerance) system, prioritizing safety over availability and accepting that side effects will be at-least-once rather than strictly exactly-once.
The Problem with Standard Consensus
Consensus algorithms are the backbone of distributed systems, allowing multiple servers to agree on a sequence of commands. Libraries implementing Raft and Paxos are widely used for this purpose. However, these libraries typically focus solely on log replication and consistency. They ensure that all nodes agree on the order of operations, but they do not manage the consequences of those operations. This is often described as treating the state machine as a 'black box.' The consensus layer passes the command to the application layer and considers its job done.
This separation of concerns becomes problematic when the application needs to interact with the outside world. Common operations include:
- Charging a customer's credit card.
- Sending an email notification.
- Triggering a webhook to an external service.
The danger arises during a leader failure. Imagine a leader receives a command to 'Charge $50.' It passes this to the application, which contacts the payment gateway and successfully charges the card. However, before the leader can replicate the log entry to a majority of followers and commit it, it crashes. The followers do not know the operation was completed. When a new leader is elected, it sees the uncommitted entry and executes it again. The customer is charged twice. This is the 'exactly-once' lie mentioned in the documentation: true exactly-once execution is incredibly difficult to guarantee at the consensus layer without help from the application or a specialized mechanism.
"if you’ve ever had “exactly once” collapse the first time a leader died mid flight, you know exactly why I built this."
— Chr2 Creator
How Chr2 Ensures Crash Safety
Chr2 approaches the problem by integrating side effect management directly into the consensus mechanism. It moves away from the black-box model to a system where side effects are first-class citizens. The library achieves this through a combination of four specific technical mechanisms designed to work in unison.
Replicated Outbox
The fundamental shift in Chr2 is the introduction of a Replicated Outbox. Rather than executing a side effect immediately and hoping the log entry commits, Chr2 stores the intent as a 'pending' state in the replicated log. This means the request to perform an action is replicated to other nodes just like any other state change. However, the actual execution is decoupled from the initial log entry. Only the designated leader is permitted to execute these pending effects, and it does so under the protection of a fencing token. This token acts as a proof of authority, ensuring that only the current valid leader can trigger external actions.
Durable Fencing and Zombie Prevention
A significant risk in distributed systems is the 'zombie leader.' This occurs when a leader is partitioned from the network, presumed dead, and then suddenly reappears. If it still believes it is the leader, it might try to execute operations that have already been handled by its successor. Chr2 prevents this using Durable Fencing.
The system maintains a manifest file that records the highest 'view' number (essentially the term of the leader). When a leader changes, this manifest is updated using a specific sequence of operations: writing to a temporary file, forcing a sync to disk (fsync), and then renaming the file to the final name (atomic replace). This ensures that even if power is lost during the update, the state remains consistent. An old leader attempting to wake up will find that its view number is lower than the persisted manifest and will refuse to execute effects.
Deterministic Context for Replay
Replaying the log is a standard requirement for recovering from crashes. However, side effects often rely on variables like timestamps or random numbers. If these change during replay, the state machine might end up in a different state than before. Chr2 solves this by providing a Deterministic Context. When the application code needs to perform an action, it receives specific inputs from the consensus layer:
- A deterministic Random Number Generator (RNG) seed.
- The exact block time from the log.
Because these inputs are fixed by the log history, replaying the log will always produce the same result. This ensures 1:1 state transitions.
Strict Write-Ahead Log (WAL)
Data integrity is paramount. Chr2 employs a Strict WAL. Every entry written to the log is protected by a CRC (Cyclic Redundancy Check) and hash-chained to the previous entry. This creates a verifiable chain of data. If corruption is detected in the middle of the log, the system is designed to stop immediately rather than attempting to guess what the missing data might have been. This 'fail-closed' approach prevents data corruption from propagating through the system.
The Trade-offs: Safety vs. Availability
Chr2 makes a deliberate architectural choice regarding the CAP theorem. The CAP theorem states that a distributed system can only provide two of three guarantees: Consistency, Availability, and Partition Tolerance. Chr2 is explicitly a CP system (Consistency and Partition Tolerance). It prioritizes safety and correctness over availability. If a partition occurs or if the system cannot verify the safety of side effects, it will choose to halt or refuse requests rather than risk duplicating transactions or losing consistency.
This leads to a specific stance on the 'exactly-once' problem. The documentation notes that side effects in Chr2 are intentionally at-least-once. The library guarantees that an effect will be executed at least once, but it does not guarantee it will be executed exactly once. The reasoning is that 'exactly-once' usually requires stable effect IDs for sink-side deduplication (i.e., the external service, like a payment processor, must be able to recognize and ignore duplicates). Chr2 provides the tools to minimize duplicates (via fencing and outbox), but it acknowledges that the final guarantee often requires cooperation from the external system.
By accepting the at-least-once semantics, Chr2 avoids the complexity and potential failure points of trying to enforce exactly-once execution at the consensus layer. This trade-off is often preferred in financial and critical systems where missing a transaction (fail-stop) is better than duplicating one (false success).
Conclusion
Chr2 represents a targeted evolution of consensus algorithms for modern application needs. By acknowledging that distributed systems frequently need to interact with the outside world, it moves beyond the limitations of traditional libraries like Raft and Paxos. Its combination of a Replicated Outbox, Durable Fencing, and Deterministic Contexts offers a robust solution to the problem of duplicate side effects. While it forces developers to accept at-least-once semantics and a CP availability model, it provides a mathematically safer environment for critical operations. For developers who have experienced the frustration of 'exactly-once' guarantees failing during a leader crash, Chr2 offers a principled, crash-safe alternative.




