ARGON: Ticket Change Details

Overview

Artifact ID:	66f6ebefe395bd7ae4e2b125a20dc6684c8dc485
Ticket:	11776edc915f3dda05b476852bf1ddef5114de13 Consider support for replicated processing
User & Date:	alaric 2013-07-16 11:36:28

Changes

comment changed to:

WOLFRAM gives us replicated fault-tolerance storage, with end-to-end checksums helping to protect against corruption of data on disk or in transit.

However, CPU/RAM errors during the processing of a LITHIUM handler that updates entity state via WOLFRAM will often remain undetected. Hardened processors and ECC RAM help somewhat, but it would be nice to have a software offering.

See if it's practical to make it possible to execute handlers <i>twice</i> (or more) in parallel. To make them absolutely consistent, every access to an API that reads external mutable state (MERCURY, clocks, etc) should be "cached" for the duration of the handler so that both instances of the handler read exactly the same value. We can execute the two handler instances in lock-step; the first one to call an API does so and the arguments and results of the API stored until the second one calls to the API, which should be exactly the same one (or we have detected a consistency violation), and it is then provided with the same results. If the first instance tries to call another API before this, it should block to let the second instance catch up.

If there is any disparity in the sequence of API calls performed, then we've detected an inconsistency, and should abort the handler (and retry it).

Consistently failing handlers may reveal a bug in our consistent execution logic, or a CPU/RAM so broken that the error rate is approaching 100%; so give up and reject the handler outright.

private_contact changed to: "edd852a1b86b4a3139e73d229e5a61a63d12b819"
severity changed to: "Critical"
status changed to: "Open"
title changed to: "Consider support for replicated processing"
type changed to: "Code_Defect"