Hands-off editing with sed, Part 1

The ABCs of the Unix stream editor

Summary
The Unix stream editor (sed) has some handy uses that other editing tools can't match: it can modify text across multiple files and make changes to files without opening them in an interactive editor. This month, Mo walks you through some of sed's unique abilities. (2,900 words)


With the sed utility (otherwise known as the Unix stream editor), you can alter files without opening them in an interactive editor. Instead, you use sed to specify a series of rules (edits) or transformations that you want applied to lines of text, then apply them to your file.

sed is not suitable as a general purpose editor, and is best used to apply a set of modifications to text, particularly if you're managing more than one file. If you want to run a series of once-only edits, you are better off directly editing a file using vi or Emacs, as using sed will take much longer. But if you need to make systematic changes across multiple files, then sed is your best bet.

 Hands-off editing with sed: Read the whole series! 

Part 1. The ABCs of the Unix stream editor

Part 2. The advanced basics of sed

For example, suppose you've written a book using vi. In it, your heroine is named Brunhilda Mathewhowsenstern, to whom you occasionally apply the nickname Brunie. You have already saved each of the book's 66 chapters as separate text files -- chap01.txt, chap02.txt, etc. -- and have just received the go-ahead from your publisher when, horror of horrors, a real person named Brunhilda Mathewhowsenstern shows up and threatens a suit if you use her name in the book. After considerable research you discover that the name Mathilda Leapfrogandrun is unclaimed by any litigious miscreants, so you resolve to change your heroine's name to this, with the new nickname Mattie replacing Brunie. You might think that you have a lot of work ahead of you, since you have to open all 66 files in vi or some other editor to make this fix.

At this point, the real heroine, the sed utility, rides to the rescue, allowing you to devise a script of editing actions to be applied one after another to each of the 66 chapters. You will be able to effect these changes without opening a single file. The script will apply three rules:

  1. Locate all instances of "Brunhilda" and change them to "Mathilda"
  2. Locate all instances of "Mathewhowsenstern" and change them to "Leapfrogandrun"
  3. Locate all instances of "Brunie" and change them to "Mattie"

How will all this work? Let's cover some basic features of the sed utility to find out.

Some sed basics
We'll start our exploration of sed with the following command:

sed [options] script input_file

Here, script is a set of actions to perform on a file named input_file. The output of sed is sent to standard output; in order to save the result of the sed actions you need to redirect this output:

sed [options] script input_file >output_file

Do not redirect the output to the file you're editing or you'll clobber the file.

sed can also be used on an input stream. The named input_file is optional; if unspecified, sed will take its input from the keyboard. The command

sed script >output_file

will take everything typed on the keyboard, apply it to the script, and place the result in output_file. This also provides a handy mechanism for testing sed scripts. By using the command

sed script

everything typed into the keyboard will be processed according to script and printed out on the screen.

In interactive editors like vi or Emacs, it's necessary to explicitly tell the editor when to apply commands to the entire text. The sed editor does this automatically by applying each script command to each line in the input file.

First, let's try a simple substitution script. The syntax of the substitution command is:

s/search text/replacement text/g

The optional g at the end of the line signals that the search-and-replace process is to be applied to all instances of the search text found in each line; without this option, the process would be applied only to the first instance found in each line of text.

Those of you familiar with vi will notice that the search and replace text uses the line-global option (g at the end of the line) but not the file-global option (g at the beginning of the line). sed commands are automatically applied globally to all lines.

Type this command and press Enter to search for all occurrences of "m" and replace each with an "x":

sed s/m/x/g

Now type the following lines and press Enter at the end of each line:

I'm not Mary mostly
but maybe I'm Mark.

After pressing Enter, a new version of the typed line appears with the designated text having been replaced. Press Control-D to end the input. You should end up with a display that looks like Listing 1 below:

$ sed s/m/x/g
I'm not Mary mostly
I'x not Mary xostly
but maybe I'm Mark.
but xaybe I'x Mark.
$
Listing 1. A simple replacement

Every lower-case "m" has been replaced by an "x". The upper case "M" was unaffected because sed is case dependent.

sed options
The default behavior of sed outputs all lines, even if no changes are made. The -n option will skip lines that aren't modified.

To demonstrate this, use the -n option and leave the g off the end of the sed command. Listing 2 shows the results, with only the first instance of a lower case "m" in each line being replaced, and only altered lines being output. In this listing, the line containing "or not" is not echoed to the screen because it wasn't altered by sed.

$ sed -n s/m/x/
I'm not Mary mostly
I'x not Mary mostly
but maybe I'm Mark
but maybe I'm Mark
or not
as the case may be.
as the case xay be.
$
Listing 2. A replacement without the line-global option

The -n option is useful when you're processing large files and only want to preview changed lines. If the preview looks okay, simply remove the -n switch.

It is common in sed scripts to enclose the script commands within single quotes, because many elements within a sed script include characters interpreted in a special way by the Unix shell. The quotes protect them from shell interpretation. For instance, a Unix shell might have problems running a search and replace involving any text that contained a space, as in the following command.

sed s/one for all/all for one/g

The spaces here would be broken into segments by the shell interpreter, and the received script command wouldn't be what you intended. Here is the correct way to enter this command:

sed 's/one for all/all for one/g'

The need for single quotes is so common in sed script commands that it's a fairly safe habit to use them all the time.

Using the output of other commands
Because sed can be used on a stream file, the output of another command can be used as the input to sed. For example, although my Unix login is my initials, mjb, I much prefer to be known as Most Exalted One. Instead of the usual boring output from an ls command, such as Listing 3, I can use a sed script to modify the output, as shown in Listing 4.

 

$ ls -l

                      

                      

                      

                      
-rw-r--r-x
1
mjb
group
384 Jul 27 1992file.cpio
-rw-r--r-x
1
mjb
group
3584 Nov 30 22:38  file.tar
-rw-r--r-x
1
mjb
group
26 Nov 30 17:26  file1.txt
Listing 3. Ordinary, boring output

 

$ ls -l|sed 's/mjb/Most Exalted One/'
-rw-r--r-x
1
Most Exalted One
group
384 Jul 27 1992   file.cpio
-rw-r--r-x
1
Most Exalted One
group
3584 Nov 30 22:38  file.tar
-rw-r--r-x
1
Most Exalted One
group
26 Nov 30 17:26  file1.txt
Listing 4. Modified output

Multiple lines of search-and-replace text can be entered in several ways, the simplest of which is to use the multiline-entry capability of non-C shells. Taking our ls example again, let's say that we now want to execute two commands. The first is the one you've already seen; the second will search for the word "group" and replace it with nothing, as in:

s/group//

To do this, type the original command, but omit the closing single quote. Press Enter and the shell will offer you a continuation prompt (>). Now type the second command and the closing single quote and press Enter again. The result should be something like Listing 5.

 
$ ls -l|sed 's/mjb/Most Exalted One/
> s/group//'
-rw-r--r-x
1
Most Exalted One
384 Jul 27 1992
file.cpio
-rw-r--r-x
1
Most Exalted One
3584 Nov 30 22:38
file.tar
-rw-r--r-x
1
Most Exalted One
26 Nov 30 17:26
file1.txt
$

                      

                      

                      

                      
Listing 5. Multiple lines of search-and-replace processes

In the C shell, you can force multiline entry by adding a backslash to the end of the line, as in Listing 6. The default continuation character in the C shell is a question mark (?).

% ls -l|sed 's/mjb/Most Exalted One/ \
? s/group//'
-rw-r--r-x  1 Most Exalted One  384 Jul 27 1992   file.cpio
-rw-r--r-x  1 Most Exalted One  3584 Nov 30 22:38  file.tar
-rw-r--r-x  1 Most Exalted One  26 Nov 30 17:26  file1.txt
%
Listing 6. Multiple lines of search-and-replace processes in C shell

As multiline sed scripts grow, it becomes more convenient to put them in a sed script file, which contain one or more lines of sed commands. Single quotes aren't needed, because, although the shell doesn't look inside the file, sed does. Listing 7 is exalted.sed, a sed script file containing the two commands used in the previous example. It can be created using vi or any other simple editor.

s/mjb/Most Exalted One/
s/group//
Listing 7. A sed script file called exalted.sed

To use a script file with sed, replace the script argument with the -f option, followed by the script filename, as in Listing 8.

$ ls -l|sed  -f exalted.sed
-rw-r--r-x  1  Most Exalted One   384 Jul 27 1992   file.cpio
-rw-r--r-x  1  Most Exalted One   3584 Nov 30 22:38  file.tar
-rw-r--r-x  1  Most Exalted One   26 Nov 30 17:26  file1.txt
$
Listing 8. Using exalted.sed

The use of sed script files is probably more common than creating a sed command with a script in the command line. The sed editor will do some amazing, and sometimes destructive, transformations on text files, although it is more common to develop and test a sed script step-by-step than it is to unleash a command line argument on a set of text files.

Let's return to the perils of Brunhilda. By now, you can probably envision the first part of the solution. Listing 9 is heroine.sed, the script commands to make the needed substitutions. Note that each command uses the g option to handle any lines in which the search text appears more than once.

s/Brunhilda/Mathilda/g
s/Mathewhowsenstern/Leapfrogandrun/g
s/Brunie/Mattie/g
Listing 9. heroine.sed

To process all chapters in the book, you would run sed, applying heroine.sed to each file. Next, you would capture the output in a temporary file, and then rename the temporary file with the original filename, as in Listing 10.

$
sed -f heroine.sed <chap01.txt >temp.txt
mv temp.txt chap01.txt
sed -f heroine.sed <chap02.txt >temp.txt
mv temp.txt chap02.txt
. . . 
. . . 
. . . 
sed -f heroine.sed <chap66.txt >temp.txt
mv temp.txt chap66.txt
$
Listing 10. Using heroine.sed

Of course, this isn't very practical -- the sed utility is more commonly used with a shell script that allows it to process multiple files. For this example, all 66 chapters of the book have been copied to a separate directory where they can be safely mangled. A wise old programmer once told me that global search and replace would be more correctly named global search and destroy. The sed editor compounds this danger by potentially wreaking havoc on more than one file at a time; therefore, backing up master copies before unleashing sed is highly advised.

Once the files are safely copied or backed up, a script is created to modify all the files in that directory, as in Listing 11.

#!/bin/sh

for name in *
do
    sed -f heroine.sed <$name >temp.txt
    mv temp.txt $name
done

rm -f temp.txt
Listing 11. A script to use heroine.sed

This shell script assigns a temporary variable, $name, to each file in the directory in turn. It applies the heroine.sed script to each file, captures the output in temp.txt, and then assigns the original file name to the resulting temp.txt. This is repeated for each file in the directory until, ultimately, the left over temp.txt file is removed.

Regular expressions in a search and replace
The sed utility also allows regular expressions to be used in search-and-replace processes. Regular expressions are used in the grep family of search utilities, as well as the sed, ed, vi, and Emacs editors, to name just a few. I will attempt to give you some of the rudiments of regular expressions here by illustrating a simple problem. For more detailed treatments of regular expressions, see the Resources section below.

First, a regular expression is a way of specifying a text string, usually one for which you wish to search. In regular expressions, letters of the alphabet, digits, and most punctuation marks represent nothing but themselves. However, several punctuation marks are used to represent special characters (called metacharacters) in a regular expression, and the meaning of letters and digits can be changed by using metacharacters with them.

Let's look at a simple problem that seems to have a complex solution. Assume that a text file containing addresses has many variations of the common address phrase "P.O. Box." Over the years, it's been entered as "po box," "PO Box," "P. O. BOX," and so on. At long last, a decision has been made to standardize the mailing list so that it will always read "P.O. Box." At first blush it would seem that a sed script file like the one in Listing 12 might achieve the desired result.

s/po box/P.O. Box/
s/PO BOX/P.O. Box/
s/PO Box/P.O. Box/
. . . .
. . . . 
Listing 12 po.sed, matching all possible combinations of P.O. Box

However, after entering about 25 possible variations on "P.O. Box," frustration sets in and you pine for a simpler way. Enter regular expressions.

The first simple rule in a regular expression is that two or more letters may be enclosed in brackets to indicate that either letter is acceptable. I'll start with Listing 13, which includes a search for the upper- and lower-case versions of "P.O. Box."

s/[Pp][Oo] [Bb][Oo][Xx]/P.O. Box/
Listing 13 po.sed, matching upper- or lower-case letters

Now, what about the possibility of multiple spaces appearing between letters and words? In a regular expression, any character followed by an asterisk matches zero or more occurrences of that letter. So, if we insert the spaces and asterisks at appropriate points, we arrive at Listing 14.

s/[Pp] *[Oo] *[Bb][Oo][Xx]/P.O. Box/
Listing 14. po.sed, allowing multiple spaces between words

Listing 14 might be translated into plain English as, "Search for 'P' or 'p' followed by zero or more spaces followed by 'O' or 'o' followed by zero or more spaces followed by 'B' or 'b' followed by 'O' or 'o' followed by 'X' or 'x'; if you find something that meets these criteria, replace it with 'P.O. Box.'"

What about the possibility of periods appearing after "P" and "O?" A period in a regular expression is a metacharacter that represents any character, and we can't simply insert an any-character metacharacter. What we want is an actual period. To remove the metacharacter nature of any metacharacter, precede it by a backslash. So, in Listing 15, we have:

s/[Pp]\. *[Oo]\. *[Bb][Oo][Xx]/P.O. Box/
Listing 15. po.sed, locating a period

The period may not exist in the phrase, so we need to specify that zero or more periods are acceptable. Thus, in Listing 16 we add two more asterisks, one after each period. I have also added a final g to ensure that more than one occurrence of "P.O. Box" in a line is correctly handled.

s/[Pp]\.* *[Oo]\.* *[Bb][Oo][Xx]/P.O. Box/g
Listing 16. po.sed matching an optional period

You would translate Listing 16 as, "Search for 'P' or 'p' followed by zero or more periods followed by zero or more spaces followed by 'O' or 'o' followed by zero or more periods followed by zero or more spaces followed by 'B' or 'b' followed by 'O' or 'o' followed by 'X' or 'x'; if you find something that meets these criteria, replace it with 'P.O. Box.'"

You now have a po.sed file that will correct 90 percent of the addresses in the file. (It won't fix actual misspellings, such as "PO Bux," or dropped-out letters, such as "P. Box").

This should give you a start with sed, but you can do much more with it than search and replace. Assuming that civilization as we know it doesn't come to an end at midnight on New Year's Eve, you'll be able to read Part 2 of this article in the January 2000 issue, where I'll cover other sed options and commands.

Contact us for a free consultation.

 

MENU:

 
SOFTWARE DEVELOPMENT:
    • EXPERIENCE
PRODUCTS:
UNIX: 

   • UNIX TUTORIALS

LEGACY SYSTEMS:

    • LEARN COBOL
    • PRODUCTS
    • GEN-CODE
    • COMPILERS   

INTERNET:
    • CYBERSUITE   
WINDOWS:

    • PRODUCTS


Search Now:
 
In Association with Amazon.com

Copyright©2001 King Computer Services Inc. All rights reserved.