Overview
\nRecently Ken Fallon did a show on HPR, number\n3962, in which he used a Bash\npipeline of multiple commands feeding their output into a\nwhile
loop. In the loop he processed the lines produced by\nthe pipeline and used what he found to download audio files belonging to\na series with wget
.
This was a great show and contained some excellent advice, but the\nuse of the format:
\npipeline | while read variable; do ...
\nreminded me of the \"gotcha\" I mentioned in my own show\n2699.
\nI thought it might be a good time to revisit this subject.
\nSo, what\'s the problem?
\nThe problem can be summarised as a side effect of pipelines.
\nWhat are pipelines?
\nPipelines are an amazingly useful feature of Bash (and other shells).\nThe general format is:
\ncommand1 | command2 ...
\nHere command1
runs in a subshell and produces output (on\nits standard output) which is connected via the pipe symbol\n(|
) to command2
where it becomes its\nstandard input. Many commands can be linked together in this\nway to achieve some powerful combined effects.
A very simple example of a pipeline might be:
\n$ printf 'World\nHello\n' | sort\nHello\nWorld
\nThe printf
command (≡\'command1\'
) writes two\nlines (separated by newlines) on standard output and this is\npassed to the sort
command\'s standard input\n(≡\'command2\'
) which then sorts these lines\nalphabetically.
Commands in the pipeline can be more complex than this, and in the\ncase we are discussing we can include a loop command such as\nwhile
.
For example:
\n$ printf 'World\nHello\n' | sort | while read line; do echo "($line)"; done\n(Hello)\n(World)
\nHere, each line output by the sort
command is read into\nthe variable line
in the while
loop and is\nwritten out enclosed in parentheses.
Note that the loop is written on one line. The semi-colons are used\ninstead of the equivalent newlines.
\nVariables and subshells
\nWhat if the lines output by the loop need to be numbered?
\n$ i=0; printf 'World\nHello\n' | sort | while read line; do ((i++)); echo "$i) $line"; done\n1) Hello\n2) World
\nHere the variable \'i\'
is set to zero before the\npipeline. It could have been done on the line before of course. In the\nwhile
loop the variable is incremented on each iteration\nand included in the output.
You might expect \'i\'
to be 2 once the loop exits but it\nis not. It will be zero in fact.
The reason is that there are two \'i\'
variables. One is\ncreated when it\'s set to zero at the start before the pipeline. The\nother one is created in the loop as a \"clone\". The expression:
((i++))
\nboth creates the variable (where it is a copy of the one in the\nparent shell) and increments it.
\nWhen the subshell in which the loop runs completes, it will delete\nthis version of \'i\'
and the original one will simply\ncontain the zero that it was originally set to.
You can see what happens in this slightly different example:
\n$ i=1; printf 'World\nHello\n' | sort | while read line; do ((i++)); echo "$i) $line"; done\n2) Hello\n3) World\n$ echo $i\n1
\nThese examples are fine, assuming the contents of variable\n\'i\'
incremented in the loop are not needed outside it.
The thing to remember is that the same variable name used in a\nsubshell is a different variable; it is initialised with the value of\nthe \"parent\" variable but any changes are not passed back.
\nHow to avoid the\nloss of changes in the loop
\nTo solve this the loop needs to be run in the original shell, not a\nsubshell. The pipeline which is being read needs to be attached to the\nloop in a different way:
\n$ i=0; while read line; do ((i++)); echo "$i) $line"; done < <(printf 'World\nHello\n' | sort)\n1) Hello\n2) World\n$ echo $i\n2
\nWhat is being used here is process\nsubstitution. A list of commands or pipelines are enclosed with\nparentheses and a \'less than\'
sign prepended to the list\n(with no intervening spaces). This is functionally equivalent to a\n(temporary) file of data.
The redirection feature allows for data being read from a\nfile in a loop. The general format of the command is:
\nwhile read variable\n do\n # Use the variable\n done < file
\nUsing process substitution instead of a file will achieve what is\nrequired if computations are being done in the loop and the results are\nwanted after it has finished.
\nBeware of this type of\nconstruct
\nThe following one-line command sequence looks similar to the version\nusing process substitution, but is just another form of pipeline:
\n$ i=0; while read line; do echo $line; ((i++)); done < /etc/passwd | head -n 5; echo $i\nroot:x:0:0:root:/root:/bin/bash\ndaemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin\nbin:x:2:2:bin:/bin:/usr/sbin/nologin\nsys:x:3:3:sys:/dev:/usr/sbin/nologin\nsync:x:4:65534:sync:/bin:/bin/sync\n0
\nThis will display the first 5 lines of the file but does it by\nreading and writing the entire file and only showing the first 5 lines\nof what is written by the loop.
\nWhat is more, because the while
is in a subshell in a\npipeline changes to variable \'i\'
will be lost.
Advice
\n- \n
Use the pipe-connected-to-loop layout if you\'re aware of\nthe pitfalls, but will not be affected by them.
\nUse the read-from-process-substitution format if you\nwant your loop to be complex and to read and write variables in the\nscript.
\nPersonally, I always use the second form in scripts, but if I\'m\nwriting a temporary one-line thing on the command line I usually use the\nfirst form.
\n
Tracing pipelines (advanced)
\nI have always wondered about processes in Unix. The process you log\nin to, normally called a shell runs a command language\ninterpreter that executes commands read from the standard input or\nfrom a file. There are several such interpreters available, but we\'re\ndealing with bash here.
\nProcesses are fairly lightweight entities in Unix/Linux. They can be\ncreated and destroyed quickly, with minimal overhead. I used to work\nwith Digital Equipment Corporation\'s OpenVMS operating system\nwhich also uses processes - but these are much more expensive to create\nand destroy, and therefore slow and less readily used!
\nBash pipelines, as discussed, use subshells. The description\nin the Bash man page says:
\n\n\nEach command in a multi-command pipeline, where pipes are created, is\nexecuted in a subshell, which is a separate process.
\n
So a subshell in this context is basically another child\nprocess of the main login process (or other parent process), running\nBash.
\nProcesses (subshells) can be created in other ways. One is to place a\ncollection of commands in parentheses. These can be simple Bash\ncommands, separated by semi-colons, or pipelines. For example:
\n$ (echo "World"; echo "Hello") | sort\nHello\nWorld
\nHere the strings \"World\"
and \"Hello\"
, each\nfollowed by a newline are created in a subshell and written to standard\noutput. These strings are piped to sort
and the end result\nis as shown.
Note that this is different from this example:
\n$ echo "World"; echo "Hello" | sort\nWorld\nHello
\nIn this case \"World\"
is written in a separate command,\nthen \"Hello\"
is written to a pipeline. All\nsort
sees is the output from the second echo
,\nwhich explains the output.
Each process has a unique numeric id value (the process id\nor PID). These can be seen with tools like ps
or\nhtop
. Each process holds its own PID in a Bash variable\ncalled BASHPID
.
Knowing all of this I decided to modify Ken\'s script from show\n3962 to show the processes being created - mainly for my interest,\nto get a better understanding of how Bash works. I am including it here\nin case it may be of interest to others.
\n#!/bin/bash\n\nseries_url="https://hackerpublicradio.org/hpr_mp3_rss.php?series=42&full=1&gomax=1"\ndownload_dir="./"\n\npidfile="/tmp/hpr3962.sh.out"\ncount=0\n\necho "Starting PID is $BASHPID" > $pidfile\n\n(echo "[1] $BASHPID" >> "$pidfile"; wget -q "${series_url}" -O -) |\\n (echo "[2] $BASHPID" >> "$pidfile"; xmlstarlet sel -T -t -m 'rss/channel/item' -v 'concat(enclosure/@url, "→", title)' -n -) |\\n (echo "[3] $BASHPID" >> "$pidfile"; sort) |\\n while read -r episode; do\n\n [ $count -le 1 ] && echo "[4] $BASHPID" >> "$pidfile"\n ((count++))\n\n url="$( echo "${episode}" | awk -F '→' '{print $1}' )"\n ext="$( basename "${url}" )"\n title="$( echo "${episode}" | awk -F '→' '{print $2}' | sed -e 's/[^A-Za-z0-9]/_/g' )"\n #wget "${url}" -O "${download_dir}/${title}.${ext}"\n done\n\necho "Final value of \$count = $count"\necho "Run 'cat $pidfile' to see the PID numbers"
\nThe point of doing this is to get information about the pipeline\nwhich feeds data into the while loop
. I kept the rest\nintact but commented out the wget
command.
For each component of the pipeline I added an echo
\ncommand and enclosed it and the original command in parentheses, thus\nmaking a multi-command process. The echo
commands write a\nfixed number so you can tell which one is being executed, and it also\nwrites the contents of BASHPID
.
The whole thing writes to a temporary file\n/tmp/hpr3962.sh.out
which can be examined once the script\nhas finished.
When the script is run it writes the following:
\n$ ./hpr3962.sh\nFinal value of $count = 0\nRun 'cat /tmp/hpr3962.sh.out' to see the PID numbers
\nThe file mentioned contains:
\nStarting PID is 80255\n[1] 80256\n[2] 80257\n[3] 80258\n[4] 80259\n[4] 80259
\nNote that the PID values are incremental. There is no guarantee that\nthis will be so. It will depend on whatever else the machine is\ndoing.
\nMessage number 4 is the same for every loop iteration, so I stopped\nit being written after two instances.
\nThe initial PID is the process running the script, not the login\n(parent) PID. You can see that each command in the pipeline runs in a\nseparate process (subshell), including the loop.
\nGiven that a standard pipeline generates a process per command, I was\nslightly surprised that the PID numbers were consecutive. It seems that\nBash optimises things so that only one process is run for each element\nof the pipe. I expect that it would be possible for more processes to be\ncreated by having pipelines within these parenthesised lists, but I\nhaven\'t tried it!
\nI found this test script quite revealing. I hope you find it useful\ntoo.
\nLinks
\n- \n
- Bash pipelines:\n \n
- \n
- Bash loops:\n \n
- \n
- Bash process substitution:\n \n
- \n
- HPR shows referenced:\n \n