Consider support for replicated processing
|User & Date:||alaric 2013-07-16 11:36:28|
- Change comment to:
WOLFRAM gives us replicated fault-tolerance storage, with end-to-end checksums helping to protect against corruption of data on disk or in transit.
However, CPU/RAM errors during the processing of a LITHIUM handler that updates entity state via WOLFRAM will often remain undetected. Hardened processors and ECC RAM help somewhat, but it would be nice to have a software offering.
See if it's practical to make it possible to execute handlers twice (or more) in parallel. To make them absolutely consistent, every access to an API that reads external mutable state (MERCURY, clocks, etc) should be "cached" for the duration of the handler so that both instances of the handler read exactly the same value. We can execute the two handler instances in lock-step; the first one to call an API does so and the arguments and results of the API stored until the second one calls to the API, which should be exactly the same one (or we have detected a consistency violation), and it is then provided with the same results. If the first instance tries to call another API before this, it should block to let the second instance catch up.
If there is any disparity in the sequence of API calls performed, then we've detected an inconsistency, and should abort the handler (and retry it).
Consistently failing handlers may reveal a bug in our consistent execution logic, or a CPU/RAM so broken that the error rate is approaching 100%; so give up and reject the handler outright.
- Change private_contact to "edd852a1b86b4a3139e73d229e5a61a63d12b819"
- Change severity to "Critical"
- Change status to "Open"
- Change title to "Consider support for replicated processing"
- Change type to "Code_Defect"