ARGON
Check-in [7ae135dc3e]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Added more detail to NITROGEN about failure modes and special cases, sorted out the reference linking, and clarified what states real-time tasks and device drivers run in.
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1:7ae135dc3e3a31b1468636f2ecda61df6f960d97
User & Date: alaric 2012-11-30 13:00:42
Context
2012-12-06
13:29
Put in links to in-progress intro pages, so I remember they exist.

Elaborated somewhat on carbon and iodine. check-in: 5b2e596027 user: alaric tags: trunk

2012-11-30
13:00
Added more detail to NITROGEN about failure modes and special cases, sorted out the reference linking, and clarified what states real-time tasks and device drivers run in. check-in: 7ae135dc3e user: alaric tags: trunk
11:51
Mainly NITROGEN - documenting the node lifecycle state machine. Updated other sections to refer properly to it. Removed the bootstrap code from the ARGON page as it's all been eaten up by NITROGEN. check-in: 309ad96ecf user: alaric tags: trunk
Changes
Hide Diffs Unified Diffs Ignore Whitespace Patch

Changes to intro/nitrogen.wiki.

1
2
3
4



5
6

7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
..
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
...
108
109
110
111
112
113
114

115
116
117
118
119
120
121
122
123
124
125
126
127

128
129
130
131
132





133
134
135
136
137
138
139
140


141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
...
180
181
182
183
184
185
186
187


188
189
190
191
192
193
194
...
224
225
226
227
228
229
230
231
232



233
234
235
236
237
238
239
...
270
271
272
273
274
275
276





277
278
279
280
281
282
283
...
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
NITROGEN is a kernel component that runs on every ARGON node and
implements the special "node entity" representing the node itself. The
node entity has no storage in [./tungsten.wiki|TUNGSTEN] like normal
entities as the node may not have any mass storage capability, entity,



and its behaviour is special-cased via a hook in
[./lithium.wiki|LITHIUM] that diverts requests to it straight to

NITROGEN, which runs as a kernel component rather than within the
normal sandboxed entity context.

As such, it has direct access to the hardware abstraction layer of
[./hydrogen.wiki|HYDROGEN]; the scheduling parameters and status
reporting from [./helium.wiki|HELIUM] and LITHIUM; the network stacks
of [./iridium.wiki|IRIDIUM], [./wolfram.wiki|WOLFRAM],
[./mercury.wiki|MERCURY] and [./fluorine.wiki|FLUORINE]; and the
storage management of TUNGSTEN and WOLFRAM.

The reason it is implemented this way, rather than as normal entity
handlers stored in TUNGSTEN but run with priveleged access, is to
reduce the dependencies. If a mass storage system failure occurs so
TUNGSTEN cannot operate, then although the node entity's storage is
unavailable, the MERCURY interface to the node entity will still be
able to query the source of the problem from the entity and to tell it
................................................................................

<dl>

<dt>Non-running states</dt>

<dd>In the OFF, ADMIN and WIPING states, the system is in low-level
administrative modes and is not considered to be "running". Kernel
components above the level of HYDROGEN, HELIUM, IRON, CHROME and
NITROGEN are not running, and HELIUM's threading is disabled. Only one
CPU is active.</dd>

<dt>OFF</dt>

<dd>Either power is off, or the node is in the early stages of
booting. This state can be entered manually from any other state by
removing power from the node, or by requesting that the system remove
power from itself where the hardware permits, or if a reboot has been
................................................................................
space. Nodes that lack the hardware to power themselves off will go to
the ADMIN state if a software transition to OFF is requested.</dd>

<dt>ADMIN</dt>

<dd>A manual configuration state entered if there's a problem booting,
or upon a manual request to interrupt automatic booting, or entered

manually from any other system state. Only one CPU is active, with no
threading. The CPU is dedicated to providing the administrative
console interface. The reason for entering the state should be
available on the console, and options to repair the problem, alter
configuration, and so on, and continue on to ISOLATED STANDBY,
ISOLATED RUNNING, RECOVERING STANDBY, RECOVERING RUNNING or WIPING, or
to switch to the OFF state, at the administrator's command.</dd>

<dt>WIPING</dt>

<dd>This state is used to decommission the node. Only one CPU is
active, with no threading, and that one CPU proceeds to request a
secure wipe of the key storage area from HYDROGEN, then a fast wipe of

all attached TUNGSTEN storage volumes, then a secure wipe of all
attached TUNGSTEN storage volumes. If all that completes, an automatic
transition to OFF occurs (or ADMIN if the hardware does not allow a
software OFF). The console can be used to manually abort the wipe,
either by entering OFF or dropping into ADMIN mode.</dd>






<dt>ISOLATED states</dt>

<dd>In these states, the node is booted up and connected to the
cluster, but any TUNGSTEN local storage on the node is disabled. Nodes
without TUNGSTEN storage only have the ISOLATED states as their
"running states"; RECOVERING and SYNCHRONISED are not applicable to
them.</dd>



<dt>ISOLATED STANDBY</dt>

<dd>The node is remaining idle until administratively told to do
otherwise. From this state it can be told to switch OFF or go into
ADMIN, or to go to ISOLATED RUNNING to start LITHIUM, or into
RECOVERING STANDBY to start recovery or into RECOVERING RUNNING
to start both, or into WIPING to erase the node.</dd>

<dt>ISOLATED RUNNING</dt>

<dd>The node is accepting requests for LITHIUM from whatever kernel
components feel like generating them (MERCURY, real-time tasks, device
drivers, WOLFRAM, CAESIUM, etc). The local TUNGSTEN store (if any) is
not being kept up to date by WOLFRAM, so any access to entity data has
to be obtained from other nodes via WOLFRAM. From this state, it can
go to OFF or ADMIN (for a hard shutdown), to WIPING (for a hard wipe),
to RECOVERING RUNNING to start recovery, to RECOVERING STANDBY
(starting recovery but doing a hard stop of LITHIUM), to ISOLATED
STANDBY (for a hard stop) or to ISOLATED STOPPING (in which case a
desired target state must be chosen).</dd>

<dt>ISOLATED STOPPING</dt>

<dd>This state is used to leave the ISOLATED RUNNING state
cleanly. Unlike the direct transitions to OFF, ADMIN, ISOLATED
STANDBY, RECOVERING STANDBY or WIPING, which terminate all currently
running LITHIUM handlers immediately, the ISOLATED STOPPING state
................................................................................
<dd>In all of these states, WOLFRAM is attempting to bring the local
TUNSGTEN storage up to date with the cluster. These states may only be
entered by nodes with TUNGSTEN storage attached. Communication
failures with the rest of the cluster that prohibit recovery will
result in the node remaining in the same state, retrying, rather than
aborting to an ISOLATED state. Succesful completion of recovery will
cause an automatic transition to a corresponding SYNCHRONISED
state.</dd>



<dt>RECOVERING STANDBY</dt>

<dd>Recovery is occuring without LITHIUM handlers being invoked. When
it is up to date, an automatic transition occurs to SYNCHRONISED
STANDBY. However, the recovery can be aborted by a manual transition
to ISOLATED STANDBY, OFF, ADMIN or WIPING; or aborted while turning
................................................................................

<dt>SYNCHRONISED states</dt>

<dd>These states can only be entered if WOLFRAM is satisfied that the
TUNGSTEN local storage is up to date through completing a RECOVERING
state. It can only be maintained while connectivity to the cluster
lets WOLFRAM be sure that the local TUNGSTEN storage is being kept
synchronised; in the event of failure, an automatic transition to
a corresponding RECOVERING state will occur.</dd>




<dt>SYNCHRONISED STANDBY</dt>

<dd>LITHIUM is not configured to start handlers in this state. Manual
transitions to SYNCHRONISED RUNNING, OFF, ADMIN, WIPING, ISOLATED
STANDBY or ISOLATED RUNNING are available; all by SYNCHRONISED RUNNING
will abandon the synchronisation, requiring recovery to get it back. A
................................................................................
available, M* = manual transition is available but will terminate any
currently running LITHIUM handlers, AM = automatic transition will
occur when required, or can be manually triggered, AM* = an automatic
transition will occur when all currently running LITHIUM handlers have
terminated, or a manual transition is available but will terminate any
currently running LITHIUM handlers.</p>






<table>
<tr>
<th>From</th>
<th>to O</th>
<th>to A</th>
<th>to IS</th>
<th>to IR</th>
................................................................................
<th>to SR</th>
<th>to SX</th>
<th>to W</th>
<th>Notes</th>
</tr>

<tr><th>O</th>
<td>-</td><td>AM</td><td>A</td><td>A</td><td></td><td>A</td><td>A</td><td></td><td></td><td></td><td></td><td></td><td></td></tr>
<tr><th>A</th>
<td>M</td><td>-</td><td>M</td><td>M</td><td></td><td>M</td><td>M</td><td></td><td></td><td></td><td></td><td>M</td>
<td>Administrative console is open.</td></tr>
<tr><th>IS</th>
<td>M</td><td>M</td><td>-</td><td>M</td><td></td><td>M</td><td>M</td><td></td><td></td><td></td><td></td><td>M</td><td></td></tr>
<tr><th>IR</th>
<td>M*</td><td>M*</td><td>M*</td><td>-</td><td>M</td><td>M*</td><td>M</td><td></td><td></td><td></td><td></td><td>M*</td>



|
>
>
>
|
<
>
|
|




|
|
|







 







|
|
|







 







>
|
|
|
|
|







|
>
|
|


|
>
>
>
>
>







|
>
>













|
|
|
|
|
|
|
|







 







|
>
>







 







|
|
>
>
>







 







>
>
>
>
>







 







|







1
2
3
4
5
6
7
8

9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
..
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
...
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
...
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
...
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
...
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
...
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
NITROGEN is a kernel component that runs on every ARGON node and
implements the special "node entity" representing the node itself. The
node entity has no storage in [./tungsten.wiki|TUNGSTEN] like normal
entities as the node may not have any mass storage capability and
[./wolfram.wiki|WOLFRAM] may not yet be available to allow access to
storage on other nodes in the cluster. Therefore, the normal LITHIUM
behaviour of handling requests to an entity by obtaining its code and
state from the distributed storage is special-cased via a hook

triggered by access to the node entity, that diverts requests to it
straight to NITROGEN, which runs as a kernel component rather than
within the normal sandboxed entity context.

As such, it has direct access to the hardware abstraction layer of
[./hydrogen.wiki|HYDROGEN]; the scheduling parameters and status
reporting from [./helium.wiki|HELIUM] and LITHIUM; the network stacks
of [./iridium.wiki|IRIDIUM], WOLFRAM, [./mercury.wiki|MERCURY] and
[./fluorine.wiki|FLUORINE]; and the storage management of TUNGSTEN and
WOLFRAM.

The reason it is implemented this way, rather than as normal entity
handlers stored in TUNGSTEN but run with priveleged access, is to
reduce the dependencies. If a mass storage system failure occurs so
TUNGSTEN cannot operate, then although the node entity's storage is
unavailable, the MERCURY interface to the node entity will still be
able to query the source of the problem from the entity and to tell it
................................................................................

<dl>

<dt>Non-running states</dt>

<dd>In the OFF, ADMIN and WIPING states, the system is in low-level
administrative modes and is not considered to be "running". Kernel
components above the level of HYDROGEN, HELIUM, [./iron.wiki|IRON],
CHROME and NITROGEN are not running, and HELIUM's threading is
disabled. Only one CPU is active.</dd>

<dt>OFF</dt>

<dd>Either power is off, or the node is in the early stages of
booting. This state can be entered manually from any other state by
removing power from the node, or by requesting that the system remove
power from itself where the hardware permits, or if a reboot has been
................................................................................
space. Nodes that lack the hardware to power themselves off will go to
the ADMIN state if a software transition to OFF is requested.</dd>

<dt>ADMIN</dt>

<dd>A manual configuration state entered if there's a problem booting,
or upon a manual request to interrupt automatic booting, or entered
manually from any other system state, or automatically upon critical
system failure from any other system state. Only one CPU is active,
with no threading. The CPU is dedicated to providing the
administrative console interface. The reason for entering the state
should be available on the console, and options to repair the problem,
alter configuration, and so on, and continue on to ISOLATED STANDBY,
ISOLATED RUNNING, RECOVERING STANDBY, RECOVERING RUNNING or WIPING, or
to switch to the OFF state, at the administrator's command.</dd>

<dt>WIPING</dt>

<dd>This state is used to decommission the node. Only one CPU is
active, with no threading, and that one CPU proceeds to request a
secure wipe of the key storage area from HYDROGEN, a wipe of the rest
of the configuration and bootstrap space, then a fast wipe of all
attached TUNGSTEN storage volumes, then a secure wipe of all attached
TUNGSTEN storage volumes. If all that completes, an automatic
transition to OFF occurs (or ADMIN if the hardware does not allow a
software OFF). The console can be used to manually abort the wipe,
either by entering OFF or dropping into ADMIN mode. A post-WIPING node
will probably not be able to escape the OFF state in future due to the
destruction of bootstrap code and configuration, and the lack of
cryptographic data in the key storage area; it might at best make it
into ADMIN, but a reinstallation will probably be required to get it
booting again.</dd>

<dt>ISOLATED states</dt>

<dd>In these states, the node is booted up and connected to the
cluster, but any TUNGSTEN local storage on the node is disabled. Nodes
without TUNGSTEN storage only have the ISOLATED states as their
"running states"; RECOVERING and SYNCHRONISED are not applicable to
them. Real-time tasks and device drivers are running in these states,
but may be operating in a degraded mode of some kind outside of
ISOLATED RUNNING, due to LITHIUM being inactive.</dd>

<dt>ISOLATED STANDBY</dt>

<dd>The node is remaining idle until administratively told to do
otherwise. From this state it can be told to switch OFF or go into
ADMIN, or to go to ISOLATED RUNNING to start LITHIUM, or into
RECOVERING STANDBY to start recovery or into RECOVERING RUNNING
to start both, or into WIPING to erase the node.</dd>

<dt>ISOLATED RUNNING</dt>

<dd>The node is accepting requests for LITHIUM from whatever kernel
components feel like generating them (MERCURY, real-time tasks, device
drivers, WOLFRAM, [./caesium.wiki|CAESIUM], etc). The local TUNGSTEN
store (if any) is not being kept up to date by WOLFRAM, so any access
to entity data has to be obtained from other nodes via WOLFRAM. From
this state, it can go to OFF or ADMIN (for a hard shutdown), to WIPING
(for a hard wipe), to RECOVERING RUNNING to start recovery, to
RECOVERING STANDBY (starting recovery but doing a hard stop of
LITHIUM), to ISOLATED STANDBY (for a hard stop) or to ISOLATED
STOPPING (in which case a desired target state must be chosen).</dd>

<dt>ISOLATED STOPPING</dt>

<dd>This state is used to leave the ISOLATED RUNNING state
cleanly. Unlike the direct transitions to OFF, ADMIN, ISOLATED
STANDBY, RECOVERING STANDBY or WIPING, which terminate all currently
running LITHIUM handlers immediately, the ISOLATED STOPPING state
................................................................................
<dd>In all of these states, WOLFRAM is attempting to bring the local
TUNSGTEN storage up to date with the cluster. These states may only be
entered by nodes with TUNGSTEN storage attached. Communication
failures with the rest of the cluster that prohibit recovery will
result in the node remaining in the same state, retrying, rather than
aborting to an ISOLATED state. Succesful completion of recovery will
cause an automatic transition to a corresponding SYNCHRONISED
state. Real-time tasks and device drivers are running in these states,
but may be operating in a degraded mode of some kind outside of
RECOVERING RUNNING, due to LITHIUM being inactive.</dd>

<dt>RECOVERING STANDBY</dt>

<dd>Recovery is occuring without LITHIUM handlers being invoked. When
it is up to date, an automatic transition occurs to SYNCHRONISED
STANDBY. However, the recovery can be aborted by a manual transition
to ISOLATED STANDBY, OFF, ADMIN or WIPING; or aborted while turning
................................................................................

<dt>SYNCHRONISED states</dt>

<dd>These states can only be entered if WOLFRAM is satisfied that the
TUNGSTEN local storage is up to date through completing a RECOVERING
state. It can only be maintained while connectivity to the cluster
lets WOLFRAM be sure that the local TUNGSTEN storage is being kept
synchronised; in the event of failure, an automatic transition to a
corresponding RECOVERING state will occur. Real-time tasks and device
drivers are running in these states, but may be operating in a
degraded mode of some kind outside of SYNCHRONISED RUNNING, due to LITHIUM
being inactive.</dd>

<dt>SYNCHRONISED STANDBY</dt>

<dd>LITHIUM is not configured to start handlers in this state. Manual
transitions to SYNCHRONISED RUNNING, OFF, ADMIN, WIPING, ISOLATED
STANDBY or ISOLATED RUNNING are available; all by SYNCHRONISED RUNNING
will abandon the synchronisation, requiring recovery to get it back. A
................................................................................
available, M* = manual transition is available but will terminate any
currently running LITHIUM handlers, AM = automatic transition will
occur when required, or can be manually triggered, AM* = an automatic
transition will occur when all currently running LITHIUM handlers have
terminated, or a manual transition is available but will terminate any
currently running LITHIUM handlers.</p>

<p>This table does not show the fact that the system may automatically
go into the ADMIN state from any other state in the event of a system
failure, as it's implicit and just made the table look a bit
messier. I left it out as it's an exceptional case.</p>

<table>
<tr>
<th>From</th>
<th>to O</th>
<th>to A</th>
<th>to IS</th>
<th>to IR</th>
................................................................................
<th>to SR</th>
<th>to SX</th>
<th>to W</th>
<th>Notes</th>
</tr>

<tr><th>O</th>
<td>-</td><td>M</td><td>A</td><td>A</td><td></td><td>A</td><td>A</td><td></td><td></td><td></td><td></td><td></td><td></td></tr>
<tr><th>A</th>
<td>M</td><td>-</td><td>M</td><td>M</td><td></td><td>M</td><td>M</td><td></td><td></td><td></td><td></td><td>M</td>
<td>Administrative console is open.</td></tr>
<tr><th>IS</th>
<td>M</td><td>M</td><td>-</td><td>M</td><td></td><td>M</td><td>M</td><td></td><td></td><td></td><td></td><td>M</td><td></td></tr>
<tr><th>IR</th>
<td>M*</td><td>M*</td><td>M*</td><td>-</td><td>M</td><td>M*</td><td>M</td><td></td><td></td><td></td><td></td><td>M*</td>