Awk by example, Part 1

Expressions and blocks

$1 == "fred" { print $3 }

$5 ~ /root/ { print $3 }

1.1 xml/htdocs/doc/en/articles/l-awk2.xml file : http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/l-awk2.xml?rev=1.1&content-type=text/x-cvsweb-markup&cvsroot=gentoo plain: http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/l-awk2.xml?rev=1.1&content-type=text/plain&cvsroot=gentoo Index: l-awk2.xml =================================================================== Awk by example, Part 2 Daniel Robbins Łukasz Damentko In this sequel to his previous intro to awk, Daniel Robbins continues to explore awk, a great language with a strange name. Daniel will show you how to handle multi-line records, use looping constructs, and create and use awk arrays. By the end of this article, you'll be well versed in a wide range of awk features, and you'll be ready to write your own powerful awk scripts. 1.0 2005-07-27 Records, loops, and arrays

Multi-line records The original version of this article was published on IBM developerWorks, and is property of Westtech Information Services. This document is an updated version of the original article, and contains various improvements made by the Gentoo Linux Documentation team.

Awk is an excellent tool for reading in and processing structured data, such as the system's /etc/passwd file. /etc/passwd is the UNIX user database, and is a colon-delimited text file, containing a lot of important information, including all existing user accounts and user IDs, among other things. In my previous article, I showed you how awk could easily parse this file. All we had to do was to set the FS (field separator) variable to ":".

By setting the FS variable correctly, awk can be configured to parse almost any kind of structured data, as long as there is one record per line. However, just setting FS won't do us any good if we want to parse a record that exists over multiple lines. In these situations, we also need to modify the RS record separator variable. The RS variable tells awk when the current record ends and a new record begins.

As an example, let's look at how we'd handle the task of processing an address list of Federal Witness Protection Program participants:

Jimmy the Weasel
100 Pleasant Drive
San Francisco, CA 12345
Big Tony
200 Incognito Ave.
Suburbia, WA 67890

Ideally, we'd like awk to recognize each 3-line address as an individual record, rather than as three separate records. It would make our code a lot simpler if awk would recognize the first line of the address as the first field ($1), the street address as the second field ($2), and the city, state, and zip code as field $3. The following code will do just what we want:

BEGIN {
    FS="\n"
    RS=""
}

Above, setting FS to "\n" tells awk that each field appears on its own line. By setting RS to "", we also tell awk that each address record is separated by a blank line. Once awk knows how the input is formatted, it can do all the parsing work for us, and the rest of the script is simple. Let's look at a complete script that will parse this address list and print out each address record on a single line, separating each field with a comma.

BEGIN {
    FS="\n"
    RS=""
}
{ print $1 ", " $2 ", " $3 }

If this script is saved as address.awk, and the address data is stored in a file called address.txt, you can execute this script by typing awk -f address.awk address.txt. This code produces the following output:

Jimmy the Weasel, 100 Pleasant Drive, San Francisco, CA 12345
Big Tony, 200 Incognito Ave., Suburbia, WA 67890

OFS and ORS

In address.awk's print statement, you can see that awk concatenates (joins) strings that are placed next to each other on a line. We used this feature to insert a comma and a space (", ") between the three address fields that appeared on the line. While this method works, it's a bit ugly looking. Rather than inserting literal ", " strings between our fields, we can have awk do it for us by setting a special awk variable called OFS. Take a look at this code snippet.

print "Hello", "there", "Jim!"

The commas on this line are not part of the actual literal strings. Instead, they tell awk that "Hello", "there", and "Jim!" are separate fields, and that the OFS variable should be printed between each string. By default, awk produces the following output:

Hello there Jim!

This shows us that by default, OFS is set to " ", a single space. However, we can easily redefine OFS so that awk will insert our favorite field separator. Here's a revised version of our original address.awk program that uses OFS to output those intermediate ", " strings:

BEGIN {
    FS="\n"
    RS=""
    OFS=", "
}
{ print $1, $2, $3 }

Awk also has a special variable called ORS, called the "output record separator". By setting ORS, which defaults to a newline ("\n"), we can control the character that's automatically printed at the end of a print statement. The default ORS value causes awk to output each new print statement on a new line. If we wanted to make the output double-spaced, we would set ORS to "\n\n". Or, if we wanted records to be separated by a single space (and no newline), we would set ORS to " ".

Multi-line to tabbed

Let's say that we wrote a script that converted our address list to a single-line per record, tab-delimited format for import into a spreadsheet. After using a slightly modified version of address.awk, it would become clear that our program only works for three-line addresses. If awk encountered the following address, the fourth line would be thrown away and not printed:

Cousin Vinnie
Vinnie's Auto Shop
300 City Alley
Sosueme, OR 76543

1.1 xml/htdocs/doc/en/articles/l-awk3.xml file : http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/l-awk3.xml?rev=1.1&content-type=text/x-cvsweb-markup&cvsroot=gentoo plain: http://www.gentoo.org/cgi-bin/viewcvs.cgi/xml/htdocs/doc/en/articles/l-awk3.xml?rev=1.1&content-type=text/plain&cvsroot=gentoo Index: l-awk3.xml =================================================================== Awk by example, Part 3 Daniel Robbins Łukasz Damentko In this sequel to his previous intro to awk, Daniel Robbins continues to explore awk, a great language with a strange name. Daniel will show you how to handle multi-line records, use looping constructs, and create and use awk arrays. By the end of this article, you'll be well versed in a wide range of awk features, and you'll be ready to write your own powerful awk scripts. 1.0 2005-07-27 String functions and ... checkbooks?

Formatting output

While awk's print statement does do the job most of the time, sometimes more is needed. For those times, awk offers two good old friends called printf() and sprintf(). Yes, these functions, like so many other awk parts, are identical to their C counterparts. printf() will print a formatted string to stdout, while sprintf() returns a formatted string that can be assigned to a variable. If you're not familiar with printf() and sprintf(), an introductory C text will quickly get you up to speed on these two essential printing functions. You can view the printf() man page by typing "man 3 printf" on your Linux system.

Here's some sample awk sprintf() and printf() code. As you can see, everything looks almost identical to C.

x=1
b="foo"
printf("%s got a %d on the last test\n","Jim",83)
myout=("%s-%d",b,x)
print myout

This code will print:

Jim got a 83 on the last test
foo-1

String functions

Awk has a plethora of string functions, and that's a good thing. In awk, you really need string functions, since you can't treat a string as an array of characters as you can in other languages like C, C++, and Python. For example, if you execute the following code:

mystring="How are you doing today?"
print mystring[3]

You'll receive an error that looks something like this:

awk: string.gawk:59: fatal: attempt to use scalar as array

Oh, well. While not as convenient as Python's sequence types, awk's string functions get the job done. Let's take a look at them.

First, we have the basic length() function, which returns the length of a string. Here's how to use it:

print length(mystring)

This code will print the value:

OK, let's keep going. The next string function is called index, and will return the position of the occurrence of a substring in another string, or it will return 0 if the string isn't found. Using mystring, we can call it this way:

print index(mystring,"you")

Awk prints:

We move on to two more easy functions, tolower() and toupper(). As you might guess, these functions will return the string with all characters converted to lowercase or uppercase respectively. Notice that tolower() and toupper() return the new string, and don't modify the original. This code:

print tolower(mystring)
print toupper(mystring)
print mystring

....will produce this output:

how are you doing today?
HOW ARE YOU DOING TODAY?
How are you doing today?

So far so good, but how exactly do we select a substring or even a single character from a string? That's where substr() comes in. Here's how to call substr():

mysub=substr(mystring,startpos,maxlen)

mystring should be either a string variable or a literal string from which you'd like to extract a substring. startpos should be set to the starting character position, and maxlen should contain the maximum length of the string you'd like to extract. Notice that I said maximum length; if length(mystring) is shorter than startpos+maxlen, your result will be truncated. substr() won't modify the original string, but returns the substring instead. Here's an example:

print substr(mystring,9,3)

Awk will print:

you

If you regularly program in a language that uses array indices to access parts of a string (and who doesn't), make a mental note that substr() is your awk substitute. You'll need to use it to extract single characters and substrings; because awk is a string-based language, you'll be using it often.

Now, we move on to some meatier functions, the first of which is called match(). match() is a lot like index(), except instead of searching for a substring like -- gentoo-doc-cvs@gentoo.org mailing list