Many hyperlinks are disabled.
Use anonymous login
to enable hyperlinks.
Overview
Comment: | Mainly NITROGEN - documenting the node lifecycle state machine. Updated other sections to refer properly to it. Removed the bootstrap code from the ARGON page as it's all been eaten up by NITROGEN. |
---|---|
Timelines: | family | ancestors | descendants | both | trunk |
Files: | files | file ages | folders |
SHA1: |
309ad96ecff9553c4527db4a376e8a2b |
User & Date: | alaric 2012-11-30 11:51:08 |
Context
2012-11-30
| ||
13:00 | Added more detail to NITROGEN about failure modes and special cases, sorted out the reference linking, and clarified what states real-time tasks and device drivers run in. check-in: 7ae135dc3e user: alaric tags: trunk | |
11:51 | Mainly NITROGEN - documenting the node lifecycle state machine. Updated other sections to refer properly to it. Removed the bootstrap code from the ARGON page as it's all been eaten up by NITROGEN. check-in: 309ad96ecf user: alaric tags: trunk | |
2012-07-26
| ||
08:36 | wolfram: FIXME about lightweight job generation interface check-in: e6c37344e9 user: alaric tags: trunk | |
Changes
Changes to README.wiki.
︙ | ︙ | |||
210 211 212 213 214 215 216 | entities with LITHIUM. The name is chosen not for chemical reasons, but because Mercury was the name of the Roman messenger god.</dd> <dt>[./intro/caesium.wiki|CAESIUM]</dt> <dd>Entities might also need to do things without being asked externally - so CAESIUM provides a distributed scheduler, invoking entity entry points using LITHIUM according to a schedule. The name is | | > | 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 | entities with LITHIUM. The name is chosen not for chemical reasons, but because Mercury was the name of the Roman messenger god.</dd> <dt>[./intro/caesium.wiki|CAESIUM]</dt> <dd>Entities might also need to do things without being asked externally - so CAESIUM provides a distributed scheduler, invoking entity entry points using LITHIUM according to a schedule. The name is a nod to the use of the element Caesium inside atomic clicks. [http://argon.org.uk/caesium.html|old site page]</dd> <dt>CARBON</dt> <dd>TUNGSTEN stores entity state as sets of tuples; CARBON adds an inference engine on top of that (like a PROLOG implementation) to allow high-level querying. CARBON can obtain data from in-memory temporary tuple sets, TUNGSTEN data (via WOLFRAM), or from tuple stores published via MERCURY. The publishing of tuple stores is such |
︙ | ︙ |
Changes to intro/argon.wiki.
1 2 | The system as a whole is called ARGON; but given all the other modules that comprise an ARGON node's kernel, all that's left to be "ARGON" | < | | < | < < < < < < < < < < < < | | | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | The system as a whole is called ARGON; but given all the other modules that comprise an ARGON node's kernel, all that's left to be "ARGON" itself is the code shared by components that doesn't fit into any one component, yet doesn't deserve a component in its own right. The ARGON module comprises: * [./hydrogen.wiki|HYDROGEN code] for the first stage of the kernel startup sequence. This initialises [./helium.wiki|HELIUM], [./iron.wiki|IRON], and [./chrome.wiki|CHROME], then compiles and runs [./helium.wiki|HELIUM]. * The CHROME (with embedded HYDROGEN where necessary to use hardware crypto accelerators or for hand-coded inner loops) library of cryptographic primitives and classification / clearance operations referred to in the [./security.wiki|security model]. * Infrastructure for managing debugging mode in entities. CHROME needs to know if the entity is in debug mode so it can provide single-stepping, breakpoints, and so on for CHROME code. WOLFRAM and MERCURY need to know if tracing of uses of their services is requested, and if so, to be able to easily submit log messages down the debugging channel. Etc. ...and doubtless many other little utilities that are needed in time will accumulate. Here's [./argon-node.png|a diagram showing roughly how the components fit together on a running node]. <h1>Variations in Node Configuration</h1> I've written the above boot process for running a "full" ARGON node that handles both TUNGSTEN storage and runs LITHIUM handlers. However, a node might be configured without any TUNGSTEN mass storage |
︙ | ︙ | |||
101 102 103 104 105 106 107 | Mobile devices such as tablet computers and smartphones can provide full general NEON user interface capabilities like either of the above, or go for a more specialist interface, optionally connecting to a user agent entity through a MERCURY endpoint that lets them access private state such as address books and messages, much like modern smartphones connect to sync and messaging services. | < | 87 88 89 90 91 92 93 | Mobile devices such as tablet computers and smartphones can provide full general NEON user interface capabilities like either of the above, or go for a more specialist interface, optionally connecting to a user agent entity through a MERCURY endpoint that lets them access private state such as address books and messages, much like modern smartphones connect to sync and messaging services. |
Changes to intro/nitrogen.wiki.
︙ | ︙ | |||
73 74 75 76 77 78 79 | private key, used by WOLFRAM to protect in-cluster communications. These are stored in a special key-storage area provided by HYDROGEN, which is slightly different to the normal HYDROGEN bootstrap configuration storage in that it allows for rapid secure erasure of keys on demand. It is likely to be implemented using a trusted platform module or other such specialist key-storage hardware where available. | > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > | 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 | private key, used by WOLFRAM to protect in-cluster communications. These are stored in a special key-storage area provided by HYDROGEN, which is slightly different to the normal HYDROGEN bootstrap configuration storage in that it allows for rapid secure erasure of keys on demand. It is likely to be implemented using a trusted platform module or other such specialist key-storage hardware where available. <h1>Operational State Machine</h1> NITROGEN maintains the passage of each node through transitions of a state machine. The states are: <dl> <dt>Non-running states</dt> <dd>In the OFF, ADMIN and WIPING states, the system is in low-level administrative modes and is not considered to be "running". Kernel components above the level of HYDROGEN, HELIUM, IRON, CHROME and NITROGEN are not running, and HELIUM's threading is disabled. Only one CPU is active.</dd> <dt>OFF</dt> <dd>Either power is off, or the node is in the early stages of booting. This state can be entered manually from any other state by removing power from the node, or by requesting that the system remove power from itself where the hardware permits, or if a reboot has been requested; and when power is applied to the node, it will attempt to boot, either going to ADMIN if there's a configuration, installation, or hardware error, or a button is pressed or configuration space directive set to to request administration mode upon boot, or proceeding to ISOLATED STANDBY, ISOLATED RUNNING, RECOVERING STANDBY, or RECOVERING RUNNING based upon a setting from the configuration space. Nodes that lack the hardware to power themselves off will go to the ADMIN state if a software transition to OFF is requested.</dd> <dt>ADMIN</dt> <dd>A manual configuration state entered if there's a problem booting, or upon a manual request to interrupt automatic booting, or entered manually from any other system state. Only one CPU is active, with no threading. The CPU is dedicated to providing the administrative console interface. The reason for entering the state should be available on the console, and options to repair the problem, alter configuration, and so on, and continue on to ISOLATED STANDBY, ISOLATED RUNNING, RECOVERING STANDBY, RECOVERING RUNNING or WIPING, or to switch to the OFF state, at the administrator's command.</dd> <dt>WIPING</dt> <dd>This state is used to decommission the node. Only one CPU is active, with no threading, and that one CPU proceeds to request a secure wipe of the key storage area from HYDROGEN, then a fast wipe of all attached TUNGSTEN storage volumes, then a secure wipe of all attached TUNGSTEN storage volumes. If all that completes, an automatic transition to OFF occurs (or ADMIN if the hardware does not allow a software OFF). The console can be used to manually abort the wipe, either by entering OFF or dropping into ADMIN mode.</dd> <dt>ISOLATED states</dt> <dd>In these states, the node is booted up and connected to the cluster, but any TUNGSTEN local storage on the node is disabled. Nodes without TUNGSTEN storage only have the ISOLATED states as their "running states"; RECOVERING and SYNCHRONISED are not applicable to them.</dd> <dt>ISOLATED STANDBY</dt> <dd>The node is remaining idle until administratively told to do otherwise. From this state it can be told to switch OFF or go into ADMIN, or to go to ISOLATED RUNNING to start LITHIUM, or into RECOVERING STANDBY to start recovery or into RECOVERING RUNNING to start both, or into WIPING to erase the node.</dd> <dt>ISOLATED RUNNING</dt> <dd>The node is accepting requests for LITHIUM from whatever kernel components feel like generating them (MERCURY, real-time tasks, device drivers, WOLFRAM, CAESIUM, etc). The local TUNGSTEN store (if any) is not being kept up to date by WOLFRAM, so any access to entity data has to be obtained from other nodes via WOLFRAM. From this state, it can go to OFF or ADMIN (for a hard shutdown), to WIPING (for a hard wipe), to RECOVERING RUNNING to start recovery, to RECOVERING STANDBY (starting recovery but doing a hard stop of LITHIUM), to ISOLATED STANDBY (for a hard stop) or to ISOLATED STOPPING (in which case a desired target state must be chosen).</dd> <dt>ISOLATED STOPPING</dt> <dd>This state is used to leave the ISOLATED RUNNING state cleanly. Unlike the direct transitions to OFF, ADMIN, ISOLATED STANDBY, RECOVERING STANDBY or WIPING, which terminate all currently running LITHIUM handlers immediately, the ISOLATED STOPPING state disables starting new LITHIUM handlers but waits until all existing ones have stopped normally, before transitioning to OFF, ADMIN, ISOLATED STANDBY, RECOVERING STANDBY or WIPING. However, the stop process can be manually cancelled by an immediate transition to one of those target states (terminating all currently running LITHIUM handlers), or to ISOLATED RUNNING or RECOVERING RUNNING to abort the shutdown and resume LITHIUM handling, optionally starting recovery at the same time.</dd> <dt>RECOVERING states</dt> <dd>In all of these states, WOLFRAM is attempting to bring the local TUNSGTEN storage up to date with the cluster. These states may only be entered by nodes with TUNGSTEN storage attached. Communication failures with the rest of the cluster that prohibit recovery will result in the node remaining in the same state, retrying, rather than aborting to an ISOLATED state. Succesful completion of recovery will cause an automatic transition to a corresponding SYNCHRONISED state.</dd> <dt>RECOVERING STANDBY</dt> <dd>Recovery is occuring without LITHIUM handlers being invoked. When it is up to date, an automatic transition occurs to SYNCHRONISED STANDBY. However, the recovery can be aborted by a manual transition to ISOLATED STANDBY, OFF, ADMIN or WIPING; or aborted while turning LITHIUM on with a manual transition to ISOLATED RUNNING.</dd> <dt>RECOVERING RUNNING</dt> <dd>Recovery is occuring while LITHIUM is enabled. As the local TUNGSTEN storage is not synchronised, access to entity state must be from other nodes via WOLFRAM, except where it can be proved that the local state required is up to date already. There can be manual transitions to RECOVERING STANDBY (to keep recovering but to do a hard stop of LITHIUM handlers), OFF or ADMIN (for a hard shutdown), WIPING (for a hard wipe), ISOLATED STANDBY (for a hard stop of LITHIUM and to abort recovery), ISOLATED RUNNING (to just abort recovery while keeping LITHIUM running), or to RECOVERING STOPPING (for a soft stop of LITHIUM, and then a transition to a chosen target state). If recovery completes, there is an automatic transition to SYNCHRONISED RUNNING.</dd> <dt>RECOVERING STOPPING</dt> <dd>This is used to offer an orderly stop of LITHIUM from RECOVERING RUNNING. LITHIUM does not accept new tasks, but existing handlers are allowed to complete. When they are all stopped, an automatic transition to a chosen target state is performed, or it can be performed manually to abort the clean stop (killing all pending LITHIUM handlers if the transition is not to a RUNNING state). The valid target states are OFF, ADMIN, WIPING, ISOLATED STANDBY, ISOLATED RUNNING, RECOVERING STANDBY or RECOVERING RUNNING. If recovery completes while in RECOVERING STOPPING, then an automatic transition to SYNCHRONISED STOPPING occurs.</dd> <dt>SYNCHRONISED states</dt> <dd>These states can only be entered if WOLFRAM is satisfied that the TUNGSTEN local storage is up to date through completing a RECOVERING state. It can only be maintained while connectivity to the cluster lets WOLFRAM be sure that the local TUNGSTEN storage is being kept synchronised; in the event of failure, an automatic transition to a corresponding RECOVERING state will occur.</dd> <dt>SYNCHRONISED STANDBY</dt> <dd>LITHIUM is not configured to start handlers in this state. Manual transitions to SYNCHRONISED RUNNING, OFF, ADMIN, WIPING, ISOLATED STANDBY or ISOLATED RUNNING are available; all by SYNCHRONISED RUNNING will abandon the synchronisation, requiring recovery to get it back. A synchronisation failure will automatically transition the node to RECOVERING STANDBY.</dd> <dt>SYNCHRONISED RUNNING</dt> <dd>LITHIUM is configured to start handlers. Manual hard transitions to SYNCHRONISED STANDBY, OFF, ADMIN, WIPING or ISOLATED STANDBY are available, which will kill current LITHIUM handlers in progress. Synchronisation can be stopped without stopping LITHIUM by a manual transition to ISOLATED RUNNING. Soft transitions are available by going to the SYNCHRONISED STOPPING state then on to a chosen target state. A synchronisation failure will automatically transition the node to RECOVERING RUNNING.</dd> <dt>SYNCHRONISED STOPPING</dt> <dd>This is used for a soft stop from SYNCHRONISED RUNNING. As usual, new LITHIUM handlers are not started, but existing ones allowed to run to completion, then an automatic transition to SYNCHRONISED STANDBY, OFF, WIPING or ISOLATED STANDBY occurs. Or a manual transition to any of those states or back to SYNCHRONISED RUNNING or ISOLATED RUNNING may be triggered to cancel the clean shutdown. A synchronisation failure will cause an automatic transition to RECOVERING STOPPING.</dd> </dl> <h2>A table of valid state transitions</h2> <p>Key: - = no transition (we're already in that state), A = automatic transition will occur when required, M = manual transition is available, M* = manual transition is available but will terminate any currently running LITHIUM handlers, AM = automatic transition will occur when required, or can be manually triggered, AM* = an automatic transition will occur when all currently running LITHIUM handlers have terminated, or a manual transition is available but will terminate any currently running LITHIUM handlers.</p> <table> <tr> <th>From</th> <th>to O</th> <th>to A</th> <th>to IS</th> <th>to IR</th> <th>to IX</th> <th>to RS</th> <th>to RR</th> <th>to RX</th> <th>to SS</th> <th>to SR</th> <th>to SX</th> <th>to W</th> <th>Notes</th> </tr> <tr><th>O</th> <td>-</td><td>AM</td><td>A</td><td>A</td><td></td><td>A</td><td>A</td><td></td><td></td><td></td><td></td><td></td><td></td></tr> <tr><th>A</th> <td>M</td><td>-</td><td>M</td><td>M</td><td></td><td>M</td><td>M</td><td></td><td></td><td></td><td></td><td>M</td> <td>Administrative console is open.</td></tr> <tr><th>IS</th> <td>M</td><td>M</td><td>-</td><td>M</td><td></td><td>M</td><td>M</td><td></td><td></td><td></td><td></td><td>M</td><td></td></tr> <tr><th>IR</th> <td>M*</td><td>M*</td><td>M*</td><td>-</td><td>M</td><td>M*</td><td>M</td><td></td><td></td><td></td><td></td><td>M*</td> <td>LITHIUM is up</td></tr> <tr><th>IX</th> <td>AM*</td><td>AM*</td><td>AM*</td><td>M</td><td>-</td><td>AM*</td><td>M</td><td></td><td></td><td></td><td></td><td>AM*</td> <td>LITHIUM is cleanly stopping.</td></tr> <tr><th>RS</th> <td>M</td><td>M</td><td>M</td><td>M</td><td></td><td>-</td><td>M</td><td></td><td>A</td><td></td><td></td><td>M</td> <td>Recovery is in progress.</td></tr> <tr><th>RR</th> <td>M*</td><td>M*</td><td>M*</td><td>M</td><td></td><td>M*</td><td>-</td><td>M</td><td></td><td>A</td><td></td><td>M*</td> <td>Recovery is in progress, LITHIUM is running.</td></tr> <tr><th>RX</th> <td>AM*</td><td>AM*</td><td>AM*</td><td>M</td><td></td><td>AM*</td><td>M</td><td>-</td><td></td><td></td><td>A</td><td>AM*</td> <td>Recovery is in progress, LITHIUM is cleanly stopping.</tr> <tr><th>SS</th> <td>M</td><td>M</td><td>M</td><td>M</td><td></td><td>A</td><td></td><td></td><td>-</td><td>M</td><td></td><td>M</td> <td>Synchronized.</td></tr> <tr><th>SR</th> <td>M*</td><td>M*</td><td>M*</td><td>M</td><td></td><td></td><td>A</td><td></td><td>M*</td><td>-</td><td>M</td><td>M*</td> <td>Synchronzed, LITHIUM is running.</td></tr> <tr><th>SX</th> <td>AM*</td><td>AM*</td><td>AM*</td><td>M</td><td></td><td></td><td></td><td>A</td><td>AM*</td><td>M</td><td>-</td><td>AM*</td> <td>Synchronized, LITHIUM is cleanly stopping.</tr> <tr><th>W</th> <td>AM</td><td>AM</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td>-</td> <td>Secure erasure is in progress.</td></tr> </table> <h2>State transitions and WOLFRAM</h2> Note that there are no state transitions triggered by "gaining or losing a connection to the WOLFRAM cluster". The closest are the automatic transitions between corresponding SYNCHRONISED and RECOVERING states caused by the local TUNGSTEN store gaining or losing synchronisation with the cluster - and loss of synchronisation will usually be caused by failure to be able to communicate with the cluster (the other option being failure of TUNGSTEN storage so we can't write to it). And we will never be able to transition from RECOVERING to SYNCHRONISED if we can't connect to the cluster to gain synchronisation - with the trivial exception being where we are the only member of the cluster, in which case recovery completes immediately. However, we can never really say for sure "if the cluster is reachable". All we can say is that a particular attempt to communicate with it has failed or not. Failure in synchronisation causes a transition to a RECOVERING state, but that's the only network failure transition in our state diagram. Notably, failure to contact other nodes to obtain access to entity state in the RUNNING states will simply cause that LITHIUM handler to fail with an exception, aborting it, and will not trigger any node state transition. |
Changes to intro/wolfram.wiki.
︙ | ︙ | |||
187 188 189 190 191 192 193 194 195 196 197 198 199 200 | that have not been fully replicated, and upon seeing the missing node again, replays them. Also, a transaction may be flagged to "commit asynchronously", in which case it is simply replicated to every node without a distributed commit, meaning that it may appear on different nodes at different points in time. <h2>Distributed data model</h2> The use of distributed transactions means that, in any given group of currently-interconnected nodes, the shared state of a replicated entity will appear strongly consistent (except when asynchronous commits are used). However, the presence of link failures may cause | > > > | 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 | that have not been fully replicated, and upon seeing the missing node again, replays them. Also, a transaction may be flagged to "commit asynchronously", in which case it is simply replicated to every node without a distributed commit, meaning that it may appear on different nodes at different points in time. The distributed storage system cooperates closely with NITROGEN to manage the overall state of the node. <h2>Distributed data model</h2> The use of distributed transactions means that, in any given group of currently-interconnected nodes, the shared state of a replicated entity will appear strongly consistent (except when asynchronous commits are used). However, the presence of link failures may cause |
︙ | ︙ |