Processing files with awk

The awk processing utility can practically be used as a programming language -- but first you need to learn its simpler features. In the first of two columns on awk, we show you how it breaks records into fields and how to execute more than one set of commands on a record.

Summary
Awk is a text processing utility that runs through a text file by reading and processing a record at a time. We start with the basics and move to executing more than one set of commands with awk. We give you multiple examples of awk processes. (2,700 words)

 


There have been several requests for information on awk, and I happen to like it as a utility, so this column and next month's column will cover awk.

Awk is a flexible text processing utility that can be used almost as a programming language. You can do a great deal with awk once you learn just a few of its simple features.

Awk runs through a text file by reading and processing one record at time. Its commands are written with the intention that they act repetitively on each record as it is read in to awk. A record that has been read by awk is broken into separate fields, and actions can be performed on the separate fields as well as on the whole record.

The actions or steps to be performed on the fields in each record or on the whole record make up an awk program or an awk script.

When you type awk as a command, you must also provide two additional pieces of information or arguments. The first is the program or script to be executed, and the second is some method of identifying the file on which to perform the actions. Awk can be used as a pipe, and the file does not need to be explicitly named on the command line.

Starting with the basics
Let's start with a simple awk command in Figure 1 to get a better idea of how it works.

Figure 1

ls -l|awk '{print}'

The output of the ls -l command has been piped into awk and is the "file" to be processed. There is no need to name a file in the awk portion of the command line. The awk program or script is one command, {print}. This example doesn't do much. It takes the whole record that was sent to awk and prints it on the screen. This simple command does partially illustrate the record-by-record action of awk. For each record received by the awk program (each line of the output of the ls -l command), the print instruction is executed. It is important to remember this action by awk. Each record is read, then for each record, the instructions in the program are executed.

The output of this program is pretty uninteresting and will look something like Figure 2 depending on the contents of your directory.

Figure 2

-rw-r--r--   1 mjb     group       109  Mar 09 18:32 store.dat
-rw-r--r--   1 mjb     group        93  Mar 09 18:31 store.sav
-rwxr-xr-x   1 mjb     group      3058  Mar 09 18:29 store.txt
-rw-r--r--   1 mjb     group        89  Mar 09 18:32 sort.dat
-rw-r--r--   1 mjb     group       193  Mar 09 18:31 sort.sav
-rwxr-xr-x   1 mjb     group      2068  Mar 09 18:29 sort.txt
-rw-r--r--   1 mjb     group        20  Mar 09 18:31 palet.txt

So far nothing very exciting has happened. In fact, this is exactly the same output as the simple ls -l command. Obviously there must be more to awk.

Onward! Breaking down into fields
Awk automatically breaks a record into fields. The default delimiter that awk assumes between fields is spaces. In Figure 2, field 1 is "-rwxr-xr-x" for the first record, field 2 is "1," field 3 is "mjb," and so on.

When awk reads in a record and breaks the contents of the record into fields, it assigns a variable name to each field. These variable names are a dollar sign ($) followed by the number of the field counting from left to right. The variable $1 represents the contents of field 1 which in Figure 2 would be "-rwxr-xr-x." $2 represents field 2 which is "1" in Figure 2 and so on. The awk variables $1 or $2 through $nn represent the fields of each record and should not be confused with shell variables that use the same style of names. Inside an awk script $1 refers to field 1 of a record; $2 to field 2 of a record.

In the first awk example, the print command on its own caused the entire record to be printed. The print command followed by specific field variables will print only those fields named by the variables, instead of the entire record. Let's look at an example. To extract the owner, size, and file name from the output of an ls -l files listing, you would need to print only fields 3, 5, and 9. The command for doing this is illustrated in Figure 3. Note that $3, $5, and $9 appear inside the awk script '{print $3 $5 $9}' and are therefore interpreted by awk as awk field variables. The single quotes protect the awk field variables from the shell, so there is no attempt to expand them. It is good practice to get in the habit of including opening and closing single quotes around awk commands to protect them from shell expansion.

Figure 3

ls -l|awk '{print $3 $5 $9}'

The problem with the output of this command is shown in listing Figure 4. There are no spaces between fields.

Figure 4

mjb109store.dat
mjb93store.sav
mjb3058store.txt
mjb89sort.dat
mjb193sort.sav
mjb2068sort.txt
mjb20palet.txt

One way around this is to embed literals in the print line as in Figure 5, which puts spaces in the output lines, producing the output shown in Figure 6.

Figure 5

ls -l|awk '{print $3 " " $5 " " $9}'

Figure 6

mjb 109 store.dat
mjb 93 store.sav
mjb 3058 store.txt
mjb 89 sort.dat
mjb 193 sort.sav
mjb 2068 sort.txt
mjb 20 palet.txt

This provides some spacing, but the fields don't line up very well. One simple way to improve alignment is to embed tabs in the literals instead of spaces. Repeat the command line in Figure 5, but instead of pressing the space bar between the double quotes, press the TAB key. You will not see any characters on the screen, but the double quotes will be separated by what appear to be more spaces. These "more" spaces are actually a tab character. The result will look something like Figure 7. Figure 8 is an example of the output.

Figure 7

ls -l|awk '{print $3 "      " $5 "  "$9}'

Figure 8

mjb    109     store.dat
mjb    93      store.sav
mjb    3058    store.txt
mjb    89      sort.dat
mjb    193     sort.sav
mjb    2068    sort.txt
mjb    20      palet.txt

In one more variation, we can switch the order of the fields during printing as in the listing in Figure 9 and the output in Figure 10. In this and subsequent examples I will use the ^ (caret) character to indicate a tab key pressed.

Figure 9

ls -l|awk '{print $9 "     ^" $5 " ^"$3}'        (<-- note ^ = TAB key)

Figure 10

store.dat    109     mjb
store.sav    93      mjb
store.txt    3058    mjb
sort.dat     89      mjb
sort.sav     193     mjb
sort.txt     2068    mjb
palet.txt    20      mjb

Executing more than one set of commands
Figure 11 adds two more features of awk. You may execute more than one set of commands on a record by separating the commands with a semicolon (;), and awk allows flexible use of user-defined variables within scripts. In this example a variable is used to keep a running record of the total number of bytes displayed in each line so far. As each record is processed, field $5 is summed into the variable ttl before the printing takes place; then as the fields are printed, the ttl variable is printed on each line as a running total of bytes for the sizes of files.

The variable ttl is initialized to zero the first time it is used. Since the ttl variable is accessed once each time a record is read, it is accessed for the first time when the first record is read. When this first read happens, and the first reference to variable ttl is made, ttl is automatically set to zero. The syntax "ttl += $5" is borrowed from C. In other program languages it would be necessary to write something like this:

add $5 to ttl
or
ttl = ttl + 5

Awk uses += as a shorthand for "add to."

Awk initializes all variables to 0 when they are used for numbers and to "" when they are used for string storage. Awk is flexible about its variables, and you do not have to identify them as numeric or string types before using them. The ttl variable could have been used as a string holder, but since it is used for numeric information it starts life as a zero when the first record is read, and thereafter immediately has the contents of field $5 added to it.

As a note on Figure 11, press the TAB key after the double quote but before "Total."

Figure 11

ls -l|awk '{ttl+=$5; print $9 "     ^" $5 " ^"$3 " ^Total " ttl " bytes"}'

Figure 12 is the output of Figure 11.

Figure 12

store.dat    109     mjb      Total 109 bytes
store.sav    93      mjb      Total 212 bytes
store.txt    3058    mjb      Total 3270 bytes
sort.dat     89      mjb      Total 3359 bytes
sort.sav     193     mjb      Total 3552 bytes
sort.txt     2068    mjb      Total 5620 bytes
palet.txt    20      mjb      Total 5640 bytes

Line splitting
Awk examples are gradually getting too long for a single line, so we will have to start splitting the lines. If you are not using the C shell, one way to do this is to press enter after you have typed the initial opening single quote before the awk commands. The line will be continued allowing you to enter one or more commands until the final closing single quote is typed. This can be used to break an awk script or program into several separate lines. Figure 13 is an example. In Figure 14 the output is identical to Figure 12.

Figure 13

ls -l|awk '	<- once the open quote is typed, press enter
{ttl+=$5;	<- and continue on the next lines
print $9 "	^" $5 " ^"$3 " ^Total " ttl " bytes"}
'		<- until the final closing quote

Figure 14

store.dat    109     mjb      Total 109 bytes
store.sav    93      mjb      Total 212 bytes
store.txt    3058    mjb      Total 3270 bytes
sort.dat     89      mjb      Total 3359 bytes
sort.sav     193     mjb      Total 3552 bytes
sort.txt     2068    mjb      Total 5620 bytes
palet.txt    20      mjb      Total 5640 bytes

For the C shell, use the backslash as the line continuation character as shown in Figure 15. Further examples will assume that you are using sh, ksh, or one of its derivatives. If you are using csh, then be sure to include the backslash continuation characters.

Figure 15

ls -l|awk ' \      <- use the backslash to force a continuation
{ttl+=$5; \        <- on each line
print $9 "         ^" $5 " ^"$3 " ^Total " ttl " bytes"} \
'                  <- until the final closure, then press enter

A running total is fine, but what I really wanted here was a total bytes count at the end of the listing.

Although the awk default is to perform all commands on each record, awk also allows actions to be performed before the first record is read, and/or after the last record is processed. Commands to be executed at the beginning or end of the records are set off by the key words BEGIN and END. Figure 16, is an example of the END key word. The values in field $5 are still accumulated in the ttl variable, but the total in ttl is printed as part of the END action instead of with each record.

Figure 16

ls -l|awk '
{ttl+=$5;
print $9 "  ^" $5 " ^"$3}
END{print "Total " ttl " bytes"}'

Figure 17 is the output of Figure 16 and you will see that the total is printed as a final line after the last directory entry.

Figure 17

store.dat    109     mjb
store.sav    93      mjb
store.txt    3058    mjb
sort.dat     89      mjb
sort.sav     193     mjb
sort.txt     2068    mjb
palet.txt    20      mjb
Total 5640 bytes

Figure 18 adds the use of the BEGIN key word and Figure 19 shows the output with the heading created with the BEGIN statement.

Figure 18

ls -l|awk '
BEGIN{print "Custom Directory Listing"}
{ttl+=$5;
print $9 "  ^" $5 " ^"$3}
END{print "Total " ttl " bytes"}'

Figure 19

Custom Directory Listing

store.dat    109     mjb
store.sav    93      mjb
store.txt    3058    mjb
sort.dat     89      mjb
sort.sav     193     mjb
sort.txt     2068    mjb
palet.txt    20      mjb
Total 5640 bytes

Figure 20 is a pseudo-listing of the three parts of the awk script. The middle section is marked "each record," but this is not an awk keyword. It is inserted to make the pseudo-listing clearer.

Figure 20

ls -l|awk '
BEGIN             {print "Custom Directory Listing"}
each record       {ttl+=$5;print $9 "  ^" $5 " ^"$3}
END               {print "Total " ttl " bytes"}'

Take another look at Figure 19 for an additional problem that can be fixed with a feature of awk. There is a blank line between "Custom Directory Listing" and the line containing the first file. Why? I fudged a bit in the earlier part of this article. The real result of an ls -l actually looks more like Figure 21. The total blocks are listed on the first line.

Figure 21

total 18
-rw-r--r--   1 mjb     group       109  Mar 09 18:32 store.dat
-rw-r--r--   1 mjb     group        93  Mar 09 18:31 store.sav
-rwxr-xr-x   1 mjb     group      3058  Mar 09 18:29 store.txt
-rw-r--r--   1 mjb     group        89  Mar 09 18:32 sort.dat
-rw-r--r--   1 mjb     group       193  Mar 09 18:31 sort.sav
-rwxr-xr-x   1 mjb     group      2068  Mar 09 18:29 sort.txt
-rw-r--r--   1 mjb     group        20  Mar 09 18:31 palet.txt

Awk sees the line containing "total 18" as the first record that it processes. This first record only has fields $1 and $2, so fields $3, $5, and $9 are blank for the first record. The print command on this first record is actually printing 3 blank fields from the first record. These show up as a single blank line, but this single line provides an opportunity to show another part of the awk language.

"If" tests and conditions
An if test can be used to eliminate an unwanted record. Figure 22 includes an if test which uses the next statement, on line 3. The if test is straight forward except that awk uses "==" (equal equal) for "is equal to." In English this would read, "If the first field is equal to `total'..."

The next statement causes awk to skip all further actions on this record and to loop back to the top of the logic that reads the next input record.

Figure 22

ls -l|awk '
BEGIN{print "Custom Directory Listing"}
{if($1 == "total") next;
ttl+=$5;
print $9 "  ^" $5 " ^"$3}
END{print "Total " ttl " bytes"}'

Figure 23 is an illustration of the steps that happen in awk record processing as the if condition is tested, and what the next does. Note that step 1 in the illustration, read a record, is the automatic default of action of awk, and there is no awk command to read a record.

Figure 23 The logic in an if-next statement

1.	read a record                 < the automatic action in awk
2.	{ if ($1 == "total")          < test the first field
3.	next;                         < if true go to step 1
4.	ttl += $5;                    < otherwise continue
5.	(rest of the code)

Even simple if tests such as the one shown here can add a powerful tool to awk processes.

This is about all I have space for in this edition, so join me next month for some more advanced features in awk, including better formatting and processing of files whose fields are not separated by spaces.

Contact us for a free consultation.

 

MENU:

 
SOFTWARE DEVELOPMENT:
    • EXPERIENCE
PRODUCTS:
UNIX: 

   • UNIX TUTORIALS

LEGACY SYSTEMS:

    • LEARN COBOL
    • PRODUCTS
    • GEN-CODE
    • COMPILERS   

INTERNET:
    • CYBERSUITE   
WINDOWS:

    • PRODUCTS


Search Now:
 
In Association with Amazon.com

Copyright©2001 King Computer Services Inc. All rights reserved.