Artifact e6d707a493c11bf75c41bc7fb8bb2c00e0eceb0a:
- File README.wiki — part of check-in [d2367e4f41] at 2013-08-16 15:01:34 on branch trunk — Better examples all round (user: alaric size: 28819)
Magic Pipes is a suite of tools to construct powerful Unix shell pipelines that operate on structured data.
Conventional shell pipeline tools - grep, sed, awk, and friends - work on lines, and have rather crude support for handling structure within the lines. This makes them fine for dealing with line-oriented data with simple structure within, but dealing with complex structure within the lines quickly descends into a hell of fragile separator handling; the shell pipelines become more minutae than meat.
Magic Pipes aims to fix that. The Magic Pipes tools read from standard input and write to standard output so they can be combined into pipelines, but rather than dealing with a line at a time, they deal with an s-expression at a time.
The tools fall into a few groups; input tools generate s-expressions from other data formats (CSV files, for instance, or traditional line-oriented data) or sources (directory or process listings). Output tools convert s-expressions into other formats. Data processing tools map s-expression to s-expressions in useful ways. Interface tools transform s-expressions by consulting external data sources, such as database files or Web services.
The original inspiration was my blog post at http://www.snell-pym.org.uk/archives/2009/06/25/magic-pipes/, but the spec has been refined since then.
General usage
In general, Magic Pipes tools read s-expressions from standard input and write them to standard output (although many do only one or the other of those, rather than both). They process the s-expressions one at a time, based on the command-line arguments, which will often include snippets of Scheme source code that evaluate to procedures which are applied to input s-expressions or intermediate results.
Supplied Scheme code has access to a rich standard environment, including:
- R5RS Scheme
- Core Chicken extensions
- The Chicken data-structures unit
- SRFI-1 (list utilities)
- SRFI-13 (string utilities)
- SRFI-69 (hash tables)
- alist-lib
- (mplog format args...) uses Chicken's printf formatting system to output strings to the standard error port.
In addition, any Chicken standard module or installed egg can easily be used by referencing the module name, and utility code can be included from files. The following command-line arguments can be used with any Magic Pipes tool to affect the environment in which supplied Scheme code is run, and to execute code before normal processing:
- -u module Use the specified unit.
- -d expr</> Evaluate the supplied expression.
- -i filename Evaluate the contents of the supplied file.
The options take effect in the order they are supplied. Code executed as a result of them has access to the standard input, output and error ports, so they can read some of the input before normal processing occurs, and likewise generate output before normal processing. Any definitions imported or created due to the use of the above options are available to the code in subsequent options from the above, and are available to user code executed during normal processing.
Finally, it is possible to run code after normal processing:
- -e expr Evaluate the supplied expression.
As you might expect, multiple expressions are executed in the order supplied, can access any definitions generated or imported by earlier code, and have access to standard output and error; standard input will have all been used up by this point, so attempting to read from it is an error.
Generally, user code executed during normal processing only has access to the standard error port; attempts to read from standard input or write to standard output are errors.
Data s-expressions read to and from standard input and output are read using a special reader that disables non-standard or unsafe read syntax. Also, srfi-38 syntax is read and written to express shared structure.
The Tools
Data processing
mpmap procedure-expr
Reads s-expressions from standard input, and applies the procedure expression procedure-expr to each of them. The results are then written to standard output. The procedure expression may return an arbitrary number of values; all non-undefined values are output in order (separated by a single space), and a newline is written after the values from each application of the procedure expression. It is legal for the procedure expression to return zero values.
Examples
Basic operation
$ echo '"Alaric" "was" "here"' | mpmap string-reverse "ciralA" "saw" "ereh"
Returning multiple values
$ echo "1 2 3" | mpmap '(lambda (x) (values x (+ x 1)))' 1 2 2 3 3 4
Note how the two values returned for each input s-expr are output on the same line, for clarity.
Defining utilities with -d
$ echo "1 2 3" | mpmap -d '(define (foo x) (values x (+ x 1)))' 'foo' 1 2 2 3 3 4
Using procedures from Chicken modules
$ echo "1 2 3" | mpmap -u srfi-4 'make-u8vector' #u8(112) #u8(74 61) #u8(136 51 131)
Omitting undefined values
$ echo "#t #f #t" | mpmap '(lambda (x) (values x (if x "Yes" (void)) x))' #t "Yes" #t #f #f #t "Yes" #t
Omitting entire inputs
$ echo "1 2 3 4 5 6" | mpmap '(lambda (x) (if (even? x) x (values)))'2
4
6
When we produce zero values with values, only a blank line is output.
mpfilter predicate-expr
Read s-expressions from standard input, applies the procedure-expr to them, and writes the s-expressions to standard output if the predicate returned a true value. The predicate may write to standard error, but standard input and output are inaccessible.
Examples
$ echo "1 2 3 4 5 6" | mpfilter 'even?' 2 4 6
mpfold [-n integer] [-a expr] [-t] [-o expr] expr [expr]
The first expression evaluates to a two-argument procedure, the second (the initial accumulator) to any value; an unspecified value is the default if none is specified. (-o) specifies a single-argument output procedure; the identity function is the default.
Applies the procedure to each s-expression from the input in turn, with the current accumulator as the second argument. At the end, outputs the result of applying the output procedure to the final accumulator.
If (-n) is specified, then rather than just the single "current input s-expression", the first expression's procedure is passed the specified number of previous s-expressions as well, in extra arguments. (-n) defaults to 0. The first call to the procedure will be performed when sufficient s-expressions have been read (eg, (-n)+1 s-expressions) unless (-a) is specified, when the procedure will be called for every s-expression, with the result of the expression passed to (-a) used as a "default" for slots that cannot yet be filled.
If (-t) is specified, then the current, and the specified number of previous, sexprs are passed as a single list, rather than as separate arguments to the procedure. The procedure is called for every input s-expression, with the list being truncated if sufficient previous values are not available; (-a) is ignored.
current-input-port is banned, but current-error-port and current-output-port are accessible.
Examples
Basic operation
$ echo "1 2 3" | mpfold + 0 6
Using -o
$ echo "1 2 3" | mpfold -o 'number->string' + 0 "6"
$ echo "1 2 3" | mpfold -o '(lambda (x) (* x 2))' + 0 12
Using mplog to demonstrate how the fold-proc is called
$ echo "1 2 3 4 5 6" | mpfold '(lambda x (mplog "~S" x) 0)' 0 (1 0) (2 0) (3 0) (4 0) (5 0) (6 0) 0
The fold-proc is called with two arguments - the current input s-expr and the accumulator value.
Adding context with -n
$ echo "1 2 3 4 5 6" | mpfold -n 2 '(lambda x (mplog "~S" x) 0)' 0 (3 2 1 0) (4 3 2 0) (5 4 3 0) (6 5 4 0) 0
Note how the fold-proc is not called for 1 or 2; the first value passed to it is 3 with 2 and 1 as the two requested context items. The fold-proc is now called with four arguments; the current item, the previous two items, and the accumulator value.
Using -a as the default context
$ echo "1 2 3 4 5 6" | mpfold -n 2 -a '#f' '(lambda x (mplog "~S" x) 0)' 0 (1 #f #f 0) (2 1 #f 0) (3 2 1 0) (4 3 2 0) (5 4 3 0) (6 5 4 0) 0
Note how the fold-proc is now called for all the input sexprs, with #f filling in for "missing context".
Using -t to pass a list into the fold-proc
$ echo "1 2 3 4 5 6" | mpfold -n 2 -t '(lambda x (mplog "~S" x) 0)' 0 ((1) 0) ((2 1) 0) ((3 2 1) 0) ((4 3 2) 0) ((5 4 3) 0) ((6 5 4) 0) 0
The fold-proc now receives two arguments again; the first is a list containing the current input sexpr and up to 2 (as requested by -n) previous ones. We are invoked for every input s-expression, and the list is simply shorter if there have not been two previous sexprs yet.
Moving average with final mean
$ echo "3 7 2 4 6 4 1" | mpfold -n 2 \ -o '(lambda (x) `(mean ,(/ (cdr x) (car x))))' \ '(lambda (x h1 h2 acc) (write `(moving-mean ,(/ (+ x h1 h2) 3))) (newline) (if acc (cons (+ (car acc) 1) (+ (cdr acc) x)) (cons 3 (+ x h1 h2))))' '#f' (moving-mean 4) (moving-mean 4.33333333333333) (moving-mean 4) (moving-mean 4.66666666666667) (moving-mean 3.66666666666667) (mean 3.85714285714286)
Computing the final mean correctly when using -n without -a is a little complicated.
Here's a simpler way:
$ echo "3 7 2 4 6 4 1" | mpfold -n 2 -a '#f' \ -o '(lambda (x) `(mean ,(/ (cdr x) (car x))))' \ '(lambda (x h1 h2 acc) (when (and h1 h2) (write `(moving-mean ,(/ (+ x h1 h2) 3))) (newline)) (cons (+ (car acc) 1) (+ (cdr acc) x)))' '(0 . 0)' (moving-mean 4) (moving-mean 4.33333333333333) (moving-mean 4) (moving-mean 4.66666666666667) (moving-mean 3.66666666666667) (mean 3.85714285714286)
mpsort (FIXME)
mpgroup (FIXME)
mptree (FIXME)
mprandom (FIXME)
mpshuffle (FIXME)
mphead (FIXME)
mptail (FIXME)
mpsxpath (FIXME)
Input
mpps (FIXME)
mpls (FIXME)
mpre {sre|-p pcre] [-o output-proc]
The argument, if present, must be a valid SRE; or, if (-p) is used, a POSIX regexp. If not present, it defaults to "(seq bos (* any) eos)"
Reads in lines of text from stdin and converts them to s-expressions by applying the regular expression. Lines that do not match the regexp are ignored.
If (-o) is specified, then the expression must be a single-argument procedure which is applied to each irregex match object to generate the output s-expression. If not, then a default is used which has the following behaviour:
If the regexp has no captures, then the entire matching string is returned.
If it has only numbered submatches, then a list of the submatches is returned.
If it has named (and maybe also numbered) submatches, then an alist of them is returned, with names used where available and numbers where not.
Examples
Default output, numbered submatches
$ printf 'Hello world\nHello world\nDear bob, I love you\n' |\ mpre '(seq ($ (* any)) "" ($ (* any)) "" ($ (* any)))' ("Hello " "world" "") ("Dear " "bob" ", I love you")
Default output, named submatches
$ printf 'Hello world\nHello world\nDear bob, I love you\n' |\ mpre '(seq (=> pre (* any)) "" (=> bracketed (* any)) "" (=> post (* any)))' ((pre . "Hello ") (bracketed . "world") (post . "")) ((pre . "Dear ") (bracketed . "bob") (post . ", I love you"))
Parsing output of ls (as if we didn't have mpls)
$ ls -l *.scm | \ mpre '(seq (=> perms (= 10 any)) (* space) (=> linkcount integer) (* space) (=> user symbol) (* space) (=> group symbol) (* space) (=> size integer) (* space) (=> modtime symbol (* space) integer (* space) integer ":" integer) (* space) (=> name (* any)))' ((perms . "-rw-r--r--") (linkcount . "1") (user . "alaric") (group . "users") (size . "7397") (modtime . "Aug 16 14:30") (name . "magic-pipes.scm")) ((perms . "-rw-r--r--") (linkcount . "1") (user . "alaric") (group . "users") (size . "5340") (modtime . "Aug 16 14:54") (name . "mpfold.scm")) ((perms . "-rw-r--r--") (linkcount . "1") (user . "alaric") (group . "users") (size . "2844") (modtime . "Aug 16 15:36") (name . "mpre.scm"))
mpjson-read (FIXME)
mpxml-read (FIXME)
mpcsv-read (FIXME)
Output
mpforeach (FIXME)
mpjson-write (FIXME)
mpxml-write (FIXME)
mpcsv-write (FIXME)
mpprintf (FIXME)
Interface
mplookup (FIXME)
mplookup-set (FIXME)
mplookup-delete (FIXME)
mplookup-dump (FIXME)
Parallelism
mpfork (FIXME)
Not yet implemented things
mpmerge, mpjoin, mpcogroup?
Do we need these more advanced operators from the database world, or can they be done in other ways?
mpmerge would need to accept a list of file names and read from them all (possibly including standard input as well), comparing already-sorted input elements using a supplied comparison expression similar to mpsort, and output the results in merged order.
mpcogroup would also accept a list of input file names (possibly including standard input) and, for each, an expression mapping an s-expression to a join key value. For each distinct join key value in the entire input, it would output a list starting with the join key value, followed by a (possibly empty) list of matching s-expressions from each input file in order.
mpjoin would work much like mpcogroup, except that the output would consist of the cross product of each group. Each s-expression in the output would be a list with the join key value followed by one element per input file, containing an s-expression from each file that produced the same join expression.
mpcogroup/mpjoin might build up a hash table internally, then if it reaches a certain limiting size, write it to a temporary sqlite file and then continue writing into that until it's time to generate output.
Safe reader
File objects in runtime module
These should include a set of tools for manipulating "directory entry alists" as created by mpls, but which will crop up elsewhere in pipelines fed from an upstream mpls. These alists always have the file name in the first element, followed by zero or more extra things about the file; a set of procedures accept such an alist, or a raw file name, as the first argument and return information about the file. If the required information is in the alist it is returned as-is, otherwise it is computed or otherwise obtained and the alist mutated to insert the information (which is why the filename, which never changes, is always first).
There is an accessor for the file's name (which may be a relative path), the absolute path, the directory name, the file name without the path, the file name without path or extension, and the extension; and for everything returned by stat(2). There is also a procedure that maps a file information alist (or just a raw string path) to a file information alist in order to canonicalise raw paths out, and a similar procedure which always returns an alist with an absolute path.
It is perfectly valid to add "custom" keys to the alist; Magic Pipes will always prefix its own with "mp", to avoid namespace issues.
A few utilities such as "(file-older <file> <age>)" can be built on top of basic tools such as "file-creation-time", too. To help with inputting ages (which are going to be in seconds), provide "(days <N>) -> (* 86400 N)" and similar for minutes, seconds, weeks, etc.
Useful UNIX information procedures in runtime library
uid->username (see posix unit) username->uid gid->groupname groupname->gid ip->hostnames (see hostinfo egg) hostname->ips get-environment-variable (alised to $)
** mptree <id-expr> <parent-expr> <children-expr> <output-expr>
Reads input s-expressions and organises them into a tree. For each s-expression, the single-argument procedures that the first three expressions evaluated to are called, yielding an identifier for the s-expression, an identifier for its parent (or #f if it cannot be obtained), and a list of identifiers of its children (or '() or #f if they cannot be obtained). Using what information becomes available, parent/child relationships are found between the s-expressions, forming one or more trees. If conflicts arise (multiple parents for the same s-expression), an error is signalled and processing stops. If no errors occur, then a set (hopefully singleton) of roots (s-expressions with no parents) is found, each at the head of a nice tree.
If the output expression is supplied, then it is applied to each tree in turn (in some arbitrary order). The trees are represented by "node" record instances, which have the following accessors:
- (node-id NODE) returns the ID of the node.
- (node-data NODE) returns the s-expression.
- (node-parent NODE) returns the parent node (or #f).
- (node-children NODE) returns a list of child nodes.
(These come from a magic-pipes-runtime-tree module which is automatically loaded.)
If no output expression is supplied, a default one is used which renders the nodes as s-expressions with the node data as the first element and the children thereafter, indented neatly to show the structure.
** mpsort -c -r -p <integer> <expr> [<expr>]
The first expression must produce a two-argument comparison procedure, and defaults to "smart<" if none is present. The second expression must produce a single-argument key extraction procedure, which defaults to the identity.
Reads in all the expressions from the input, sorts them by applying the comparison procedure to the results of applying the extraction procedure to the expressions, then returns the result.
If (-c) is specified, then the extraction procedure is assumed to be expensive, and its result computed and cached at load time.
If (-r) is specified, then the sort order is reversed.
Provide smart< and smart> procedures, which compare things in a type-agnostic way: < for numbers, string< for strings, recursive testing for pairs and vectors.
As usual, the procedures have no access to current input or output ports, but can write to the error port.
If (-p) is specified, then rather than sorting in-memory, we instead start the specified number of threads, each of which reads sexpressions from a bounded FIFO and sends them to a child mpsort process. A master thread then reads sexpressions from standard input and round-robins them to the FIFOs, skipping any FIFOs that are "full" and blocking if they all are. Each child process also has a reader thread that reads its sorted output and loads them into another FIFO, and a final output thread merges the sorted FIFO outputs into a final sorted output to standard output. #!eof is used as a marker in the FIFOs to record the actual end of the file, to distinguish EOF from an empty FIFO due to the source not having produced anything yet.
Or do we make a separate mpmerge tool that takes a list of filenames on the command line along with extract and compare procedure expressions, and invoke that using a set of FIFOs which the sub-mpsorts feed out to?
Is it worth having an option to go multi-machine by running mpsort from inetd (perhaps in parallel mode to use multiple cores) on remote machines and parallelising via TCP rather than running a child process? That would be kind of cool and not too hard.
Or for huge sorts (where there's not enough memory available), we could have a flag that splits the input into temporary files of up to a certain size, sorts them individually one by one, then merges the results together.
** mplookup -F {lookup <map file>|revlookup <map file>}
It would be convenient to have a simple command-line tool to handle look-up tables, mapping one s-expression to one or more other s-expressions. By default, each output s-expression is a list of results from the corresponding input s-expression, which is empty if the there is no mapping. If (-f) is specified then the first result is returned only, not wrapped in a list, and #f used if there is none. If (-F) is used then the first result is returned, and #f if there is none or more than one.
File type detection is performed on the map file. There is support for sqlite databases in a special format (ending .mbm; magic binary map), or plain text files with a sequence of (<key> . <value>) pairs (ending .msm; magic sexpr map) or /etc/aliases format files (default), which are treated as string->string mappings.
*** mplookup-set <map file> <expr1> <expr2>
In the given map file (which, if nonexistant, is created), set expr1 to map to expr2.
If the exprs are omitted, then sexprs are read from standard input, and must be pairs, the first element of which is treated as expr1 and the second as expr2, and are all set into the map in order.
*** mplookup-delete <map file> <expr>
Deletes the given mapping from the given map file. If the expression is omitted, then expressions are read from stdin and removed from the map file. If the map file does not exist, an error is raised.
*** mplookup-dump <map file>
Spits out the contents of the map file as a sequence of pairs, with the car being the key and the cdr the value. This can be piped into mplookup-set to effect map file format conversions.
** FIXME: mprandom ???
Take random samples of the input - either pick any s-expression with a given chance, or read all the s-expressions into RAM and pick N at random
** FIXME: mpshuffle ???
Read input s-expressions into a list, shuffle, and output the result.
** FIXME: mpsysinfo ???
Can we portably get hold of system info tables - mounted filesystems, network interfaces and addresses, process, memory usage, load average, that sort of thing - and present a tool or suite of tools that turns them into s-expressions on stdout? That would be handy for system administration scripting.
** mpps -u <user>... -p <pid>... -x <executable>... -c Outputs an information alist for every process matching the specified criteria.
(-u) restricts output to processed owned by a specific user (or users, if multiple instances of the option are used). Otherwise, all user's processes are listed.
(-p) restricts output to the specific pid listed. Again, multiple instances can be used to restrict to a set of pids.
(-x) restricts output to the specific executable listed. Once more, multiple instances can be used to restrict to a set of executables.
If the above are combined in any way, then the different restrictions are logically ANDed togather.
If (-c) is specified, then any process which is a child of a process matched by the above restrictions is also included.
The process alist contains information from /proc, such as
mpname mpexecutable mpcwd mpcmdline mppid mpppid mpuid mpgid mppgrp mpsid mpnice mpvsize mprss mpwchan
** FIXME: mphead/mptail ??? Picking (all but) the first or last N s-expressions is a useful operation. ** mpflatten
Reads input s-expressions, and if they are lists, writes the elements of the list as separate s-expressions, otherwise writes them as-is.
** mpfork -x <integer>... Runs the given list of shell commands in parallel, distributing input s-expressions to them atomically, and atomically merging their output s-expressions to standard output. If any commands terminate before their input is closed, mpfork terminates with an error.
(-x) specifies a multiplier factor; subsequent commands are "repeated" that many times. (-x) defaults to 1, in practice.
Implementation: a pair of threads is spawned for each command, one for input and one for output (standard error is left untouched). Each thread has a single-sexpr buffer.
A master input thread reads s-exprs from standard input and places them in the first empty input buffer in the list of command input threads, in round-robin fashion, blocking if none are available.
A master output thread blocks until at least one output buffer is full, then scans in round-robin fashion to find and empty it to standard output.
Once input is closed, all the subprocess standard inputs are closed; and once all the subprocesses have terminated, mpfork terminates. ** mpgroup -a -t -l <expr>
The expression must be a single-argument procedure. It is applied to each input s-expression to obtain a "key" for each input s-expression.
As usual, the procedure has no access to current input or output ports, but can write to the error port.
If (-a) is specified, then the s-expressions are accumulated in memory by their keys, into a hashtable. If (-f) is specified, the only the first s-expression for each key is kept; if (-l) is specified, the only the last is kept. At the end, the hash table is written out; if (-t) is specified, it is written as one list per key, the first element being the key value and the rest being the s-expressions with that key. If (-t) is not specified, then it is just one list per key, but without the key as the first element. The order of the keys listed in undefined, but if neither (-f) nor (-l) are specified, the s-expressions within a key are in the order they were read.
If (-a) is not specified, then the s-expressions are not accumulated and spat out in a single batch; instead, they are output in the same order that they were read in, but grouped into lists of s-expressions having the same key in a contiguous run. If (-t) is specified, the key value is prepended to the list. If (-f) is specified, then only the first s-expression in each run of the same key value is listed (and if (-t) is not specified, then it is output as-is rather than as a single-element list). Likewise, if (-l) is specified, the only the last s-expression in each run with of the same key value is listed, and unless (-t) is specified, it's written as-is without a single-element list enclosing it.
** mpforeach <expr>...
Run the supplied Scheme procedure(s) on each s-expression from the input. Ignore anything returned, and the Scheme procedure can access stdout/stderr if required, but has no access to stdin.
** FIXME: mpjson ??? Convert to/from JSON representation ** FIXME: mpxml ??? Parse xml->sxml, html->sxml (htmlprag), or sxml->xml ** FIXME: mpsxpath ??? Read in sexprs from stdin, treating them as sxml, and output sexprs matching the supplied sxpath/xpath selector. Each input sexpr might map to several output sexprs if there's multiple matches; have a flag to wrap each input's outputs in a list. ** mpprintf -n <string>...
Calls "printf" on each input sexpr, with the arguments (concatenated with spaces) as the format string. Appends a newline unless (-n) is specified.
** mpls -R <expr> -f <expr>... <filename>...
Write an "ls"-equivalent tool that outputs alists containing information about each file, and optional filter expression(s) (-f) which are ANDed together. By default, the filter accepts all files.
Give it the option (-r) to recurse, or (-R <expr>) to apply the procedure to each directory encountered; recursion within will occur if the expression is true. (-r) is really (-R 'any?').
Takes an optional list of files on the command line to just list those, a la "ls".
In the expressions, current-input-port and current-output-port are banned, but current-error-port is accessible.
mpls -r -f 'regular-file?' -f '(lambda (file) (file-older file (days 5)))' | mpforeach 'delete-file' ** FIXME: Some nice output plugins? mpprintf is a start, but something feeding into the fmt egg, or some kind of sxml template generation for HTML, might be cool in order to make report generation easier. Perhaps wrappers for GraphViz and gnuplot might be useful (converting sexpr syntax into that tool's native syntax).