AWK Language Programming will undoubtedly continue to evolve. The awk utility interprets a special-purpose programming language that. This is Edition of GAWK: Effective AWK Programming: A User's Guide for GNU Awk, for the A The Evolution of the awk Language. Technically-oriented PDF Collection (Papers, Specs, Decks, Manuals, etc) - tpn/ pdfs.
|Language:||English, Spanish, Portuguese|
|Genre:||Children & Youth|
|ePub File Size:||21.59 MB|
|PDF File Size:||10.76 MB|
|Distribution:||Free* [*Regsitration Required]|
AWK is a programming language designed for text processing and typically used for a data extraction and reporting tool. It is a standard feature. But the real reason to learn awk is to have an excuse to read the superb book The AWK Programming Language by its authors Aho, Kernighan. The awk programming language is often used for text and string awk is a patternmatching program for processing files, especially when each line has a simple.
So to accumulate the sums of deposits and checks, for example, we could simply scan the input for deposits and checks, like this: Many of the longer programs in this book were developed in this way. Multiline Records By default, records are separated by newlines, so the terms "line" and "record" are normally synonymous. Adam Smith Wall St. This is important: So far, in all of our examples of matching expressions, the right-hand operand of - and 1- has been a regular expression enclosed in slashes. Suppose we want to create a command field 1 that will print the first field of each line of input.
Arrays and array elements need not be declared, nor is there any need to specify how many elements an array has. Like variables, array elements spring into existence by being mentioned; at birth, they have the numeric value 0 and the string value " ". In fact, it is easy though perhaps slow to read the entire input into an array, then process it in any convenient order. For example, this variant of the program from Section I. The characteristic that sets awk arrays apart from those in most other languages is that subscripts are strings.
This gives awk a capability like the associative memory of SNOBOL4 tables, and for this reason, arrays in awk are called associative arrays. The following program accumulates the populations of Asia and Europe in the array pop.
The END action prints the total populations of these two con- tinents. European population is million. Note that the subscripts are the string constants "Asia" and "Europe". If we had written pop[Asia] instead of pop["Asia"], the expression would have used the value of the variable Asia as the subscript, and since the variable is uninitialized, the values would have been accumulated in pop [ " " ].
This example doesn't really need an associative array since there are only two elements, both named explicitly. Suppose instead that our task is to deter- mine the total population for each continent. Associative arrays are ideally suited for this kind of aggregation. This code works regardless of the number of con- tinents; the output from the countries file is North America South America Asia Europe The last program used a form of the for statement that loops over all sub- scripts of an array: The order in which the subscripts are considered is implementa- tion dependent.
Results are unpredictable if new elements are added to the array by statement. Thus, to test whether Africa is a subscript of the array pop you can say if "Africa" in pop Note that neither is a test of whether the array pop contains an element with value "Africa".
The delete Statement. An array element may be deleted with delete array[subscript] For example, this loop removes all the elements from the array pop: The function split str, arr ,Js splits the string value of str into fields and stores them in the array arr. The number of fields produced is returned as the value of split. The string value of the third argu- ment, fs, determines the field separator. If there is no third argument, FS is used.
In either case, the rules are as for input field splitting, which is discussed in Section 2. Multidimensional Arrays. Awk does not support multidimensional arrays directly but it provides a simulation using one-dimensional arrays. Although you can write multidimensional subscripts like i, j or s, p, q, r, awk concaten- ates the components of the subscripts with a separator between them to syn- thesize a single subscript out of the multiple subscripts you write.
The built-in variable SUBSEP contains the value of the subscript-component separator; its default value is not a comma but.. To loop over such an array, however, you would write for kin arr Array elements cannot themselves be arrays. Such a function is defined by a statement. Thus, the general form of an awk program is a sequence of pattern-action statements and function definitions separated by newlines or semicolons. In a function definition, newlines are optional after the left brace and before the right brace of the function body.
The parameter list is a sequence of vari- able names separated by commas; within the body of the function these vari- ables refer to the arguments with which the function was called. The body of a function definition may contain a return statement that returns control and perhaps a value to the caller.
It has the form return expression The expression is optional, and so is the return statement itself, but the returned value is undefined if none is provided or if the last statement executed is not a return.
For example, this function computes the maximum of its arguments: A user-defined function can be used in any expression in any pattern-action statement or the body of any function definition.
Each use is a call of the func- tion. If a user-defined function is called in the body of its own definition, that function is said to be recursive.
For example, the max function might be called like this: This means that the func- tion cannot affect the value of the variable outside the function. The jargon is that such variables, called "scalars," are passed "by value.
This is called passing "by reference. To repeat, within a function definition, the parameters are local variables - they last only as long as the function is executing, and they are unrelated to variables of the same name elsewhere in the program. But all other variables are global; if a variable is not named in the parameter list, it is visible and accessible throughout the program. This means that the way to provide local variables for the private use of a function is to include them at the end of the parameter list in the function definition.
Any variable in the parameter list for which no actual parameter is supplied in a call is a local variable, with null initial value. This is not a very elegant language design but it at least provides the necessary facility. We put several blanks between the arguments and the local variables so they can be dis- tinguished.
These statements can be used in any mixture; the output comes out in the order in which it is generated. Pipes and system may not be available on non-Unix systems. The print Statement The print statement has two forms: For example, the following program prints the first and second fields of each line with a colon between the fields and two newlines after the second field: Like print, it has both an unparenthesized and parenthesized form: The following program will put the first and third fields of all input lines into two files: Filenames can be variables or expressions as well: Output Into Pipes It is also possible to direct output into a pipe instead of a file on systems that support pipes.
The statement print I command causes the output of print to be piped into the command. Suppose we want to create a list of continent-population pairs, sorted in reverse numeric order by population.
The program below accumulates in an array pop the population values in the third field for each of the distinct con- tinent names in the fourth field. The END action prints each continent name and its population, and pipes this output into a suitable sort command.
There are several idioms for writing on the standard error file: If the file or pipe is explicitly closed and then reused, it will be reopened. Closing Flies and Pipes The statement close expr closes a file or pipe denoted by expr; the string value of expr must be the same as the string used to create the file or pipe in the first place.
There are also system-defined limits on the number of files and pipes that can be open at the same time. The most common arrangement is to put input data in a file, say data, and then type a wk 'program ' data Awk reads its standard input if no filenames are given; thus, a second com- mon arrangement is to have another program pipe its output into awk.
For example, the program egrep selects input lines containing a specified regular expression, but it does this much faster than awk does. Input Separators The default value of the built-in variable FS is 11 11 , that is, a single blank. The field separator can be changed by assigning a string to the built-in vari- able FS. If the string is longer than one character, it is taken to be a regular expression. The leftmost longest nonnull and nonoverlapping substrings matched by that regular expression become the field separators in the current input line.
When FS is set to a single character other than blank, that character becomes the field separator. This convention makes it easy to use regular expression metacharacters as field separators: FS can also be set on the command line with the - F argument.
Multiline Records By default, records are separated by newlines, so the terms "line" and "record" are normally synonymous.
The default record separator can be changed in a limited way, however, by assigning a new value to the built-in record-separator variable RS. With multiline records, no matter what value FS has, newline is always one of the field separators. There is a limit on how long a record can be, usually about characters. Chapter 3 contains more dis- cussion of how to handle multiline records. The getline Function The function getl ine can be used to read input either from the current input or from a file or pipe.
By itself, getline fetches the next input record and performs the normal field-splitting operations on it. It sets NF, NR, and FNR; it returns I if there was a record present, 0 if end-of-file was encountered, and -1 if some error occurred such as failure to open a file. No splitting is done; NF is not set. Table summarizes the forms of the getline function. The value of each expression is the value returned by getline. As an example, this program copies its input to its output, except that each line like include "filename" is replaced by the contents of the file filename.
For example, the statement while "who": The output of who is a list of the users logged in. Each iteration of the while loop reads one more line from this list and increments the variable n, so after the while loop terminates, n contains a count of the number of users. Simi- larly, the expression "date": Again, input pipes may not be available on non-Unix systems.
In all cases involving getline, you should be aware of the possibility of an error return if the file can't be accessed. If a filename has the form var-text, however, it is treated as an assignment of text to var, performed at the time when that argu- ment would otherwise be accessed as a file. This type of assignment allows vari- ables to be changed before and after a file is read.
The value of the built-in variable ARGC is one more than the number of arguments. ARGC is one more than the number of arguments because awk, the name of the command, is counted as argument zero, as it is in C programs.
If the awk program appears on the command line, however, the program is not treated as an argument, nor is -f filename or any -F option. The following program echoes its command-line arguments: Another program using command-line arguments is seq, which generates sequences of integers: Thus setting an element of ARGV to null means that it will not be treated as an input file.
The name 11 - 11 may be used for the standard input. The discussion applies primarily to the Unix operating system; the examples here may fail or work differently on non-Unix systems. The system Function The built-in function system expression executes the command given by the string value of expression. The value returned by system is the status returned by the command executed.
For example, we can build another version of the file-inclusion program of Section 2. Other lines are just copied. Both methods of invoking the awk program require some typing.
To reduce the number of keystrokes, we might want to put both the command and the pro- gram into an executable file, and invoke the command by typing just the name of the file. Suppose we want to create a command field 1 that will print the first field of each line of input. This is easy: Now, consider writing a more general command field that will print an arbitrary combination of fields from each line of its input; in other words, the command will print the specified fields in the specified order.
How do we get the value of each n; into the awk program each time it is run and how do we distinguish the n; 's from the filename arguments? There are several ways to do this if one is adept in shell programming. The simplest way that uses only awk, however, is to scan through the built-in array ARGV to process the n; 's, resetting each such argument to the null string so that?
You will find that it pays to go back and re-read sections from time to time, either to see precisely how something works, or because one of the examples in later chapters suggests a construction that you might not have tried before. Awk, like any language, is best learned by experience and practice, so we encourage you to go off and write your own programs. They don't have to be big or complicated - you can usually learn how some feature works or test some crucial point with only a couple of lines of code, and you can just type in data to see how the program behaves.
There are numerous books on how to use the Unix system; The Unix Program- ming Environment, by Brian Kernighan and Rob Pike Prentice-Hall, has an extensive discussion of how to create shell programs that include awk. We have already seen simple examples of these in Chapters 1 and 2. In this chapter, we will consider more complex tasks of a similar nature.
Most of the examples deal with the usual line-at-a-time processing, but the final section describes how to handle data where an input record may occupy several lines. Awk programs are often developed incrementally: Many of the longer programs in this book were developed in this way.
It's also possible to write awk programs in the traditional way, sketching the outline of the program, consulting the language manual, and so forth. But modifying an existing program to get the desired effect is frequently easier. The programs in this book thus serve another purpose, providing useful models for programming by example. Another use is selection of relevant data from a larger data set, often with reformatting and the preparation of summary infor- mation.
This section contains a variety of examples of these topics. Summing Columns We have already seen several variants of the two-line awk program that adds up all the numbers in a single field. The following program performs a some- what more complicated but still representative data-reduction task. Every input line has several fields, each containing numbers, and the task is to compute the sum of each column of numbers, regardless of how many columns there are.
It's also worth noting that the program prints nothing if the input file is empty. It's convenient that the program doesn't need to be told how many fields a row has, but it doesn't check that the entries are all numbers, nor that each row has the same number of entries.
The following program does the same job, but also checks that each row has the same number of entries as the first: Now suppose that some of the fields are nonnumeric, so they shouldn't be included in the sums. The strategy is to add an array numcol to keep track of which fields are numeric, and a function isnum to check if an entry is a number.
If the program can trust its input, it need only look at the first line to tell if a field will be numeric. A more general definition for numbers can be found in the discussion of regular expressions in Section 2. Exercise Modify the program sum3 to ignore blank lines.
Add the more general regular expression for a number. How does it affect the running time? What is the effect of removing the test of numcol in the second for statement? Write a program that reads a list of item and quantity pairs and for each item on the list accumulates the total quantity; at the end, it prints the items and total quantities, sorted alphabetically by item. This requires two passes over the data. If there's only one column of numbers and not too much data, the easiest way is to store the numbers in an array on the first pass, then compute the percentages on the second pass as the values are being printed: Once the grades have been computed as numbers between 0 and , it might be interesting to see a histogram: We can test histogram with some randomly generated grades.
The first program in the pipeline below generates random numbers between 0 and , and pipes them into the histogram maker. Scale the rows of stars so they don't overflow the line length when there's a lot of data. Make a version of the histogram code that divides the input into a speci- fied number of buckets, adjusting the ranges according to the data seen. D Numbers with Commas Suppose we have a list of numbers that contain commas and decimal points, like 12, Since awk thinks that the first comma terminates a number, these numbers cannot be summed directly.
The commas must first be erased: This program doesn't check that the commas are in the right places, nor does it print commas in its answer. Putting commas into numbers requires only a little effort, as the next program shows. It formats numbers with commas and two digits after the decimal point.
The structure of this program is a useful one to emulate: After it's been tested and is working, the new function can be included in the final program. The basic idea is to insert commas from the decimal point to the left in a loop; each iteration puts a comma in front of the leftmost three digits that are followed by a comma or decimal point, provided there will be at least one addi- tional digit in front of the comma. The algorithm uses recursion to handle negative numbers: Here are the results for some test data: Modify sumcomma, the program that adds numbers with commas, to check that the commas in the numbers are properly positioned.
D Fixed-Field Input Information appearing in fixed-width fields often requires some kind of preprocessing before it can be used directly. Some programs, such as spreadsheets, put out numbers in fixed columns, rather than with field separa- tors; if the numbers are too wide, the columns abut. Fixed-field data is best handled with substr, which can be used to pick apart any combination of columns. For example, suppose the first six characters of each line contain a date in the form mmddyy.
The easiest way to sort this by date is to convert the dates into the form yymmdd: How would you convert dates into a form in which you can do arithmetic like computing the number of days between two dates'? Sometimes that output is merely a set of homogeneous lines, in which case field-splitting or substr operations are quite adequate.
Sometimes, however, the upstream program thinks its output is intended for people. In that case, the task of the awk program is to undo careful formatting, so as to extract the information from the irrelevant. The next example is a simple instance. Large programs are built from many files. It is convenient and sometimes vital to know which file defines which function, and where the function is used. To that end, the Unix program nm prints a neatly formatted list of the names, definitions, and uses of the names in a set of object files.
A typical fragment of its output looks like this: T indicates that a definition is a text symbol function and U indicates that the name is undefined.
Using this raw output to determine what file defines or uses a particular symbol can be a nuisance, since the filename is not attached to each symbol. A three-line awk program, however, can add the name to each item, so subsequent programs can retrieve the useful information from one line: T addroot file. T -checkdev file. T -checkdupl file. U -chown file. U -client file. U -close funmount. T funmount funmount. This technique does not provide line number information nor tell how many times a name is used in a file, but these things can be found by a text editor or another awk program.
Nor does it depend on which language the programs are written in, so it is much more flexible than the usual run of cross-referencing tools, and shorter and simpler too. Formatted Output As another example we'll use awk to make money, or at least to print checks. The input consists of lines, each containing a check number, an amount, and a payee, separated by tabs. The output goes on check forms, eight lines high. The second and third lines have the check number and date indented 45 spaces, the fourth line contains the payee in a field 45 characters long, fol- lowed by three blanks, followed by the amount.
The fifth line contains the amount in words, and the other lines are blank. A check looks like this: Note also how we combine line con- tinuation and string concatenation to create the string argument to split in the function ini tnum; this is a useful idiom. The date comes from the system by the line "date" I getline date get today's date which runs the date command and pipes its output into qetline.
A little processing converts the date from Wed Jun 17 The functions numtowords and intowords convert numbers to words. They are straightforward, although about half the program is devoted to them.
The function intowords is recursive: This is the second example of recursion in this chapter, and we will see others later on. In each case, recursion is an effective way to break a big job into smaller, more manageable pieces. Use the function addcomma from a previous example to include commas in the printed amount.
The program prchecks does not deal with negative quantities or very long amounts in a graceful way. Modify the program to reject requests for checks for negative amounts and to split very long amounts onto two lines. The function numtowords sometimes puts out two blanks in a row. It also produces blunders like "one dollars. Modify the program to put hyphens into the proper places in spelled-out amounts, as in "twenty-one dollars.
This section contains several small programs that check input for validity. For example, consider the column-summing pro- grams in the previous section. Are there any numeric fields where there should be nonnumeric ones, or vice versa? Such a program is very close to one we saw before, with the summing removed: These lines are text-formatting commands that make the programs come out in their distinctive font when the text is typeset.
Since programs cannot be nested, these text-formatting commands must form an alternating sequence. P2 If one or the other of these delimiters is omitted, the output will be badly man- gled by our text formatter.
To make sure that the programs would be typeset properly, we wrote this tiny delimiter checker, which is typical of a large class of such programs: P1 after. P2 with no preceding. What is the best way to extend this program to handle multiple sets of delimiter pairs? D Password-File Checking The password file on a Unix system contains the name of and other informa- tion about authorized users.
Each line of the password file has 7 fields, separated by colons: Brian Kernighan: Al Aho: Peter Weinberger: Mark Kernighan: The first field is the user's login name, which should be alphanumeric. The second is an encrypted version of the password; if this field is empty, anyone can log in pretending to be that user, while if there is a password, only people who know the password can log in. The third and fourth fields are supposed to be numeric. The following program prints all lines that fail to satisfy these criteria, along with the number of the erroneous line and an appropriate diagnostic message.
Running this program every night is a small part of keeping a system healthy and safe from intruders. Here is a small set of error conditions and mes- sages, where each condition is a pattern from the program above. The error message is to be printed for each input line where the condition is true. Note that in checkgen, some of the special characters in the printf format string must be quoted to produce a valid generated program.
This technique in which one awk program creates another is broadly applica- ble and of course it's not restricted to awk programs.
We will see several more examples of its use throughout this book. Add a facility to checkgen so that pieces of code can be passed through verbatim, for example, to create a BEGIN action to set the field separator. Awk is often useful for inspecting programs, or for organizing the activities of other testing programs. This section contains a somewhat incestuous exam- ple: The following program does a reasonable job of detecting such problems in old programs: This job isn't done perfectly, so some lines may not be properly processed.
The third argument of the first split function is a string that is interpreted as a regular expression. The leftmost longest substrings matched by this regular expression in the input line become the field separators. The function aspli t is just like split, except that it creates an array whose subscripts are the words within the string.
Incoming words can then be tested for membership in this array. This is the output of compat on itself: Rewrite compat to identify keywords, etc. Compare the two versions on complexity and speed. Because awk variables are not declared, a misspelled name will not be detected. Write a program to identify names that are used only once. To make it truly useful, you will have to handle function declarations and variables used in functions.
The prob- lem is to combine "bundle" a set of ASCII files into one file in such a way that they can be easily separated "unbundled" into the original files. This section contains two tiny awk programs that do this pair of operations. They can be used for bundling small files together to save disk space, or to package a collection of files for convenient electronic mailing, The bundle program is trivial, so short that you can just type it on a com- mand line. There are other ways to write bundle and unbundle, but the versions here are the easiest, and for short files, reasonably space efficient.
Another organiza- tion is to add a distinctive line with the filename before each file, so the filename appears only once. Compare the speed and space requirements of these versions of bundle and unbundle with variations that use headers and perhaps trailers.
Evaluate the tradeoff between performance and program complexity. Many other kinds of data, however, come in multiline chunks.
Examples include address lists: Adam Smith Wall St. Donald E. Chateau Lafite Rothschild 12 bottles Dealing with such data in awk requires only a bit more work than single-line data does; we'll show several approaches. Records are separated by a single blank line: That would be easy if each line were a field.
When RS is set to " 11 , the field separator by default is any sequence of blanks and tabs, or newline. Processing Multiline Records If an existing program can process its input only by lines, we may still be able to use it for multiline records by writing two awk programs.
The first com- bines the multiline records into single-line records that can be processed by the existing program. Then, the second transforms the processed output back into the original multiline format. We'll assume that limits on line lengths are not a problem. To illustrate, let's sort our address list with the Unix sort command.
The following pipeline sorts the address list by last name: This assumes that the last word on the first line really is the last name. For each multiline record the first program creates a single line consisting of the last name, followed by the string I I , followed by all the fields in the record separated by this string. Any other separator that does not occur in the data and that sorts earlier than the data could be used in place of the string l I.
Modify the first awk program to detect occurrences of the magic string I I in the data. D Records with Headers and Trailers Sometimes records are identified by a header and trailer, rather than by a record separator. Consider a simple example, again an address list, but this time each record begins with a header that indicates some characteristic, such as occupation, of the person whose name follows, and each record except possi- bly the last is terminated by a trailer consisting of a blank line: Will Seymour Maple Blvd.
Berkeley Heights, NJ lawyer David w. When a line containing the desired header is found, p is set to one; a subsequent line con- taining a trailer resets p to zero, its default initial value.
Since lines are printed only when p is set to one, only the body and trailer of each record are printed; other combinations are easily selected instead. For instance, addresses might include a coun- try name, or might not have a street address. One way to deal with structured data is to add an identifying name or key- word to each field of each record. For example, here is how we might organize a checkbook in this format: That means that different records can con- tain different fields, or similar fields in arbitrary order.
One way to process this kind of data is to treat it as single lines, with occa- sional blank lines as separators. Each line identifies the value it corresponds to, but they are not otherwise connected. So to accumulate the sums of deposits and checks, for example, we could simply scan the input for deposits and checks, like this: But it is delicate, requiring careful initialization, reini- tialization, and end-of-file processing.
Thus an appealing alternative is to read each record as a unit, then pick it apart as needed. The following program computes the same sums of deposits and checks, using a function to extract the value associated with an item of a given name: A third possibility is to split each field into an associative array and access that for the values.
Write a command lookup x y that will print from a known file all multiline records having the item name x with value y. There are several reasons why such diverse tasks are fairly easy to do in awk. The pattern-action model is a good match to this kind of processing. In the following chapters, we'll see further applications of these facilities.
The emphasis is on tabular data, but the tech- niques apply to more complex forms as well. The theme is the development of programs that can be used with one another. We will see a number of common data-processing problems that are hard to solve in one step, but easily handled by making several passes over the data. The first part of the chapter deals with generating reports by scanning a sin- gle file.
Although the format of the final report is of primary interest, there are complexities in the scanning too. The second part of the chapter describes one approach to collecting data from several interrelated files.
We've chosen to do this in a fairly general way, by thinking of the group of files as a relational database.
One of the advantages is that fields can have names instead of numbers. We will use a three-step process to generate reports: The preparation step involves selecting data and perhaps performing computations on it to obtain the desired information.
The sort step is necessary if we want to order the data in some particular fashion. To per- form this step we pass the output of the preparation program into the system sort command. The formatting step is done by a second awk program that gen- erates the desired report from the sorted data. In this section we will generate a few reports from the countries file of Chapter 2 to illustrate the approach.
A Simple Report Suppose we want a report giving the population, area, and population density of each country. Asia Japan The -t: In Section 6. This applies to all the examples in this chapter.
We have completed the preparation and sort steps; all we need now is to for- mat this information into the desired report. The program form 1 does the job: By default, the sort com- mand sorts its input lexicographically. In the final report, the output needs to be sorted alphabetically by continent and in reverse numerical order by popula- tion density. To avoid arguments to sort, the preparation program can put at the beginning of each line a quantity depending on continent and population density that, when sorted lexicographically, will automatically order the output correctly.
Asia 0. Sf covers a wide range of reciprocal densities. The final formatting program is like form 1 but skips the new second field. The trick of manufacturing a sort key that simplifies the sorting options is quite general. If we would like a slightly fancier report in which only the first occurrence of each continent name is printed, we can use the formatting program form2 in place of form1: The variable prev keeps track of the value of the continent field; only when it changes is the continent name printed.
In the next section, we will see a more complicated example of control-break programming. A More Complex Report Typical business reports have more substance or at least form than what we have seen so far. We would also like to add a title and more column headers: Report No. Millions Pet. Total Sq. Asia Japan 4.
India In the first pass it accumulates the area and population of each continent in the arrays area and pop, and also the totals areatot and poptot. The two passes are controlled by the value of the variable pass, which can be changed on the command line between passes: The form3 program prints a total after all of the entries for each continent have been seen. But naturally it doesn't know that all the entries have been seen until a new continent is encountered.
Dealing with this "we've gone too far" situation is the classic example of control-break programming. The solu- tion here is to test each input line before printing, to see whether a total has to be produced for the previous group; the same test has to be included in the END action as well, which means it's best to use a function for the computation. Control breaks are easy enough when there is only one level, but get messier when there are multiple levels. As these examples suggest, complex formatting tasks can often be done by the composition of awk programs.
An alternative is to let a program compute how big things are, then do the positioning for you. It would be quite feasible to write an awk program to for- mat simple tables for printers; we'll come back to that in a moment. Since we are using Unix and a typesetter, however, we can use what already exists: The program form4 is very similar to form3, except that it contains no magic numbers for column widths.
Instead, it generates some tbl commands and the table data in columns separated by tabs; tbl does the rest. If you are not familiar with tbl, you can safely ignore the details. II' "Total", "Sq. Implementing a program as sophisticated as tbl is too ambitious, but let's make something smaller: In other words, given a header and the countries file as input it would print: The second pass in the END action prints each item in the proper position. Left-justifying alphabetic items is easy: It's a bit more work for numeric items: Modify form3 and forrp.
Because of rounding. How would you correct this? The table formatter assumes that all numbers have the same number of digits after the decimal point.
Modify it to work properly if this assumption is not true. Enhance table to permit a sequence of specification lines that tell how the subsequent data is to be formatted in each column. This is how tbl is controlled. Suppose we want to determine the population, area, and population density of various countries. Now, if we want to invoke this same command on different countries, we would get tired of substituting the new country name into the awk program every time we executed the command.
We would find it more convenient to put the pro- gram into an executable file, say info, and answer queries by typing info Canada info USA We can use the technique from Section 2. What's happening is that the shell makes up the awk program by concatenating three strings: Notice that any regular expression can be passed to info; in particular, it is possible to retrieve information by specifying only a part of a country name or by specifying several countries at once, as in info 'CanlUSA' Exercise Revise the info program so the regular expression is passed in through ARGV instead of by shell manipulations.
The text con- tains parameters that will be replaced by a set of parameter values for each form letter that is generated. Demographic Information About 1 From: AWK Demographics, Inc.
In response to your request for information about 1, our latest research has revealed that its population is 2 million people and its area is 3 million square miles. This gives 1 a population density of 4 people per square mile. From the input values Canada: Demographic Information About Canada From: In response to your request for information about Canada, our latest research has revealed that its population is 25 million people and its area is 3.
This gives Canada a population density of 6. The program form. Notice how string concatena- tion is used to create the first argument of gsub. This system extends awk as a database language in three ways: Fields are referred to by name rather than by number. The database can be spread over several files rather than just one. A sequence of queries can be made interactively. A multifile database is easier to maintain, primarily because it is easier to edit a file with a small number of fields than one that contains all of them.
Also, with the database system of this section it is possible to restructure the database without having to change the programs that access it. Finally, for simple queries it is more efficient to access a small file than a large one. On the other hand, we have to be careful to change all relevant files whenever we add information to the database, so that it remains consistent.
Up to this point, our database has consisted of a single file named countries in which each line has four fields, named country, area, population, and continent. Suppose we add to this database a second file called capitals where each entry contains the name of a country and its capi- tal city: From these two files, if we want to print the names of the countries in Asia along with their populations and capitals, we would have to scan both files and then piece together the results. For example, this command would work if there is not too much input data: This is how we would phrase this query in q, the language that we will describe shortly.
Natural Joins It's time for a bit of terminology. In relational databases, a file is called a table or relation and the columns are called attributes. So we might say that the capitals table has the attributes country and capital. A natural join, or join for short, is an operator that combines two tables into one on the basis of their common attributes.
The attributes of the resulting table are all the attributes of the two tables being joined, with duplicates removed. If we join the two tables countries and capitals, we get a single table, let's call it cc, that has the attributes country, area, population, continent, capital For each country that appears in both tables, we get a row in the cc table that has the name of the country, followed by its area, population, continent, and then its capital: To answer a query involving attri- butes from several tables, we will first join the tables and then apply the query to the resulting table.
That is, when necessary, we create a temporary file. The trick is how, in general, to decide which tables to join. The actual joining operation can be done by the Unix command join, but if you don't have that available, here is a basic version in awk.
It joins two files on the attribute in the first field of each. It makes an output line for each possible pairing of matching input fields.
Groups of lines with a common first attribute value are read from the second file. If the prefix of the line read from the first file matches the common attribute value of some group, each line of the group gives rise to a joined output line. The function getgroup puts the next group of lines with a common prefix into the array gp; it calls getone to get each line, and unget to put a line back if it is not part of the group.
We have localized the extraction of the first attribute value into the function prefix so it's easy to change. You should examine the way in which the functions getone and unget implement a pushback or "un-read" of an input line. Before reading a new line, getone checks to see if there is a line that has already been read and stored by unget, and if there is, returns that instead of reading a new one.
Pushback is a different way of dealing with a problem that we encountered earlier, reading one too many inputs. In the control-break programs early in this chapter, we delayed processing; here we pretend, through a pair of functions, that we never even saw the extra input.
This version of join does not check for errors or whether the files are sorted. Remedy these defects. How much bigger is the program'? Implement a version of join that reads one file entirely into memory, then does the join. Which is simpler'?
Modify join so it can join on any field or fields of the input files, and output any selected subset of fields in any order. We store this information in a file called the relfile "rei" is for relation. The relfile contains the names of the tables in the database, the attributes they contain, and the rules for constructing a table if it does not exist.
The relfile is a sequence of table descriptors of the form: After the tablename comes a list of the names of the attributes for that table, each prefixed by blanks or tabs. Following the attributes is an optional sequence of commands prefixed by exclamation points that tell how this table is to be constructed.
If a table has no commands, a file with that name containing the data of that table is assumed to exist already. Such a table is called a base table. Data is entered and updated in the base tables. A table with a sequence of commands appearing after its name in the relfile is a derived table. Derived tables are con- structed when they are needed.
We will use the following relfile for our expanded countries database: This ensures that there is one table that contains any combination of attributes. The table cc is a universal relation for the countries-capitals database. A good design for a complex database should take into account the kinds of queries that are likely to be asked and the dependencies that exist among the attributes, but the small databases for which q is likely to be fast enough, with only a few tables, are unlikely to uncover subtleties in relfile design.
The query processor qawk answers a query as follows: It determines the set of attributes in the query. Starting from the beginning of the relfile, it finds the first table whose attributes include all the attributes in the query.
If this table is a base table, it uses that table as the input for the query. If the table is a derived table, it constructs the derived table and uses it as the input. This means that every combination of attributes that might appear in a query must also appear in either a base or derived table in the relfile.
It transforms the q query into an awk program by replacing the symbolic field references by the appropriate numeric field references. This program is then applied to the table determined in step 2. We have been using the word "query," but it's certainly possible to use qawk to compute as well, as in this computation of the average area: First, qawk reads the relfile and collects the table names into the array relname.
It collects any commands needed to construct the i-th table and stores them into the array cmd beginning at location cmd[i, 1 ].
It also collects the attributes of each table into the two-dimensional array attr; the entry attr [i ,a] holds the index of the attribute named a in the i-th table. Using the subset function, it determines T;, the first table whose attributes include all of the attributes present in the query.
It substitutes the indexes of these attributes into the origi- nal query to generate an awk program, issues whatever commands are needed to create T;, then executes the newly generated awk program with T; as input. The second step is repeated for each subsequent query.
The following diagram outlines the behavior of qawk: If your operating system doesn't support awk's system function, modify qawk to write the appropriate sequence of commands in a file or files that can be exe- cuted separately. As it constructs a derived table, qawk calls system once for each com- mand. Modify qawk to collect all of the commands for building a table into one string and to execute them with a single call to system.
Modify qawk to check whether a derived file that is going to be used as input has already been computed. If this file has been computed and the base files from which it was derived have not been modified since, then we can use the derived file without recomputing it. Look at the program make presented in Chapter 7.
Provide a way to enter and edit multiline queries. Multiline queries can be collected with minimal changes to qawk. One possibility for editing is a way to invoke your favorite text editor; another is to write a very simple editor in awk itself.
For generating reports, a "divide-and-conquer" strategy is often best: Control breaks can be handled either by looking behind, or, often more elegantly, by an input pushback mechanism. They can also sometimes be done by a pipeline too, although we didn't show that in this chapter.
For the details of formatting, a good alternative to counting characters by hand is to use some program that does all the mechanical parts. Although awk is not a tool for production databases, it is quite reasonable for small personal databases, and it also serves well for illustrating some of the fundamental notions.
The qawk processor is an attempt to demonstrate both of these aspects. Bibliographic Notes There are many good books on databases; you might try J. The examples include programs that generate random words and sen- tences, that carry on limited dialogues with the user, and that do text process- ing. Most are toys, of value mainly as illustrations, but some of the document preparation programs are in regular use. Such programs can be created using the built-in function rand, which returns a pseudo-random number each time it is called.
The rand function starts generating random numbers from the same seed each time a program using it is invoked, so if you want a different sequence each time, you must call srand once, which will initialize rand with a seed computed from the current time. Random Choices Each time it is called, rand returns a random floating point number greater than or equal to 0 and less than 1, but often what is wanted is a random integer between 1 and n.
That's easy to compute from rand: We can use randint to select random letters like this: The function choose prints k random elements in order from the first n elements of an array A. Test rand to see how random its output really is.
D Exercise Write a program to generate k distinct random integers between 1 and n in time proportional to k. Write a program to generate random bridge hands. D Cliche Generation Our next example is a cliche generator, which creates new cliches out of old ones. The input is a set of sentences like A rolling stone: He who lives by the sword: A jack of all trades: Every man: All's well that: A rolling stone repeats itself.
History abhors a vacuum. Nature repeats itself. All's well that gathers no moss. He who lives by the sword has a price. The code is straightforward: Random Sentences A context-free grammar is a set of rules that defines how to generate or analyze a set of sentences. Each rule, called a production, has the form A-BCD The meaning of this production is that any A can be "rewritten" as B CD The symbol on the left-hand side, A, is called a nonterminal, because it can be expanded further.
The symbols on the right-hand side can be nonterminals including more A's or terminals, so called because they do not get expanded. There can be several rules with the same left side; terminals and nonterminals can be repeated in right sides. In Chapter 6 we will show a grammar for a part of awk itself, and use that to write a parser that analyzes awk programs. In this chapter, however, our interest is in generation, not analysis.
Open Preview See a Problem? Details if other: Thanks for telling us about the problem. Return to Book Page. Aho ,. Brian W. Peter J. Originally developed by Alfred Aho, Brian Kernighan, and Peter Weinberger in , AWK is a pattern-matching language for writing short programs to perform common data-manipulation tasks. In , a new version of the language was developed, incorporating additional features such as multiple input files, dynamic regular expressions, and user-defined funcitons.
This new ver Originally developed by Alfred Aho, Brian Kernighan, and Peter Weinberger in , AWK is a pattern-matching language for writing short programs to perform common data-manipulation tasks. Get A Copy. Paperback , pages. Published January 11th by Pearson first published January 1st More Details Original Title.
Other Editions 2. Friend Reviews. To see what your friends thought of this book, please sign up. Lists with This Book. This book is not yet featured on Listopia.
Community Reviews. Showing Rating details. Sort order. Nov 10, Tom rated it it was amazing. Most of my early programing was with Pascal. With the discovery of this book, and the awk program provided by the Thompson Automation Software a super-set of the traditional Awk. It's been a great ride. I'm 80 now, and still at it.
Nov 29, David rated it it was amazing. Wow, what a gem of a book! I read it for free in PDF form online you can find it at archive. Despite its age, this remains a shining example of how to write the perfect programming book. It begins with a brief introduction to the language to get you going. Chapter Two is dry reference which you should absolutely skim as the authors suggest, lest you die of boredom.
Then all of the other chapters show incre Wow, what a gem of a book! Then all of the other chapters show increasingly virtuoso uses of the language: The authors of the book are the authors of the language and it was wonderful to read about their experiences with the evolution and success of their creation in their own words.
As for the AWK language: I've certainly come away with a new respect for its capabilities. It's a full language with conditionals, functions, associative arrays, etc. In conclusion, the book is wonderful: I cannot recommend this book highly enough.
Likewise, the AWK language itself is also wonderful for its intended purpose: For longer scripts, you will find many worthy successors. Jul 22, Madhura Parikh rated it it was amazing. Just read some parts of all the chapters, the first chapter pretty much covers whatever working knowledge you need to get started. The remaining chapters are more detailed, and probably server more as a reference. I love Kernighan's writing style and this book didn't disappoint.
Feb 25, Emile Mercier rated it it was amazing. Jan 29, Berin Larson rated it it was amazing. Surprisingly good. Sep 10, Pat Rondon rated it it was amazing. This is another classic language manual in the same tradition as Kernighan and Ritchie's "The C Programming Language".
As with "The C Programming Language", this book is a compact, lucid,tightly-written guide to its subject. There's a lot of knowledge packed into a small number of pages here, not just about Awk, but about computing in general among the examples are toy or skeletal implementations of an awk-subset-to-C compiler, a column-oriented database, and the "make" utility.
Unfortunately, the book is mostly worth reading for its style and these examples, rather than as a guide to the language itself.