Performance Monitoring

"System Accounting," teaches about the UNIX accounting system, and the tools that the accounting system provides. Some of these utilities and reports give you information about system utilization and performance. Some of these can be used when investigating performance problems.

In this portion of the book, you will learn all about performance monitoring. There are a series of commands that enable system administrators, programmers, and users to examine each of the resources that a UNIX system uses. By examining these resources you can determine if the system is operating properly or poorly. More important than the commands themselves, you will also learn strategies and procedures that can be used to search for performance problems. Armed with both the commands and the overall methodologies with which to use them, you will understand the factors that are affecting system performance, and what can be done to optimize them so that the system performs at its best.

Although this chapter is helpful for users, it is particularly directed at new system administrators that are actively involved in keeping the system they depend on healthy, or trying to diagnose what has caused its performance to deteriorate.

This chapter introduces several new tools to use in your system investigations.

The sequence of the chapter is not based on particular commands. It is instead based on the steps and the strategies that you will use during your performance investigations. In other words, the chapter is organized to mirror the logical progression that a system administrator uses to determine the state of the overall system and the status of each of its subsystems.

You will frequently start your investigations by quickly looking at the overall state of the system load, as described in the section "Monitoring the Overall System Status." To do this you see how the commands uptime and sar can be used to examine the system load and the general level of Central Processing Unit (CPU) loading. You also see how tools such as SunOS'sperfmeter can be helpful in gaining a graphic, high-level view of several components at once.

Next, in the section "Monitoring Processes with ps," you learn how ps can be used to determine the characteristics of the processes that are running on your system. This is a natural next step after you have determined that the overall system status reflects a heavier-than-normal loading. You will learn how to use ps to look for processes that are consuming inordinate amounts of resources and the steps to take after you have located them.

After you have looked at the snapshot of system utilization that ps gives you, you may well have questions about how to use the memory or disk subsystems. So, in the next section, "Monitoring Memory Utilization," you learn how to monitor memory performance with tools such as vmstat and sar, and how to detect when paging and swapping have become excessive (thus indicating that memory must be added to the system).

In the section "Monitoring Disk Subsystem Performance," you see how tools such as iostat, sar, and df can be used to monitor disk Input/Output (I/O) performance. You will see how to determine when your disk subsystem is unbalanced and what to do to alleviate disk performance problems.

After the section on disk I/O performance is a related section on network performance. (It is related to the disk I/O discussion because of the prevalent use of networks to provide extensions of local disk service through such facilities as NFS.) Here you learn to use netstat, nfsstat, and spray to determine the condition of your network.

This is followed by a brief discussion of CPU performance monitoring, and finally a section on kernel tuning. In this final section, you will learn about the underlying tables that reside within the UNIX operating system and how they can be tuned to customize your system's UNIX kernel and optimize its use of resources.

You have seen before in this book that the diversity of UNIX systems make it important to check each vendor's documentation for specific details about their particular implementation. The same thing applies here as well. Furthermore, modern developments such as symmetric multiprocessor support and relational databases add new characteristics and problems to the challenge of performance monitoring. These are touched on briefly in the discussions that follow.

Performance and Its Impact on Users

Before you get into the technical side of UNIX performance monitoring, there are a few guidelines that can help system administrators avoid performance problems and maximize their overall effectiveness.

All too typically, the UNIX system administrator learns about performance when there is a critical problem with the system. Perhaps the system is taking too long to process jobs or is far behind on the number of jobs that it normally processes. Perhaps the response times for users have deteriorated to the point where users are becoming distracted and unproductive (which is a polite way of saying frustrated and angry!). In any case, if the system isn't actually failing to help its users attain their particular goals, it is at least failing to meet their expectations.

It may seem obvious that when user productivity is being affected, money and time, and sometimes a great deal of both, are being lost. Simple measurements of the amount of time lost can often provide the cost justification for upgrades to the system. In this chapter you learn how to identify which components of the system are the best candidates for such an upgrade. (If you think people were unhappy to begin with, try talking to them after an expensive upgrade has produced no discernible improvement in performance!)

Often, it is only when users begin complaining that people begin to examine the variables that are affecting performance. This in itself is somewhat of a problem. The system administrator should have a thorough understanding of the activities on the system before users are affected by a crisis. He should know the characteristics of each group of users on the system. This includes the type of work that they submit while they are present during the day, as well as the jobs that are to be processed during the evening. What is the size of the CPU requirement, the I/O requirement, and the memory requirement of the most frequently occurring and/or the most important jobs? What impact do these jobs have on the networks connected to the machine? Also important is the time-sensitivity of the jobs, the classic example being payrolls that must be completed by a given time and date.

These profiles of system activity and user requirements can help the system administrator acquire a holistic understanding of the activity on the system. That knowledge will not only be of assistance if there is a sudden crisis in performance, but also if there is a gradual erosion of it. Conversely, if the system administrator has not compiled a profile of his various user groups, and examined the underlying loads that they impose on the system, he will be at a serious disadvantage in an emergency when it comes to figuring out where all the CPU cycles, or memory, have gone. This chapter examines the tools that can be used to gain this knowledge, and demonstrates their value.

Finally, although all users may have been created equal, the work of some users inevitably will have more impact on corporate profitability than the work of other users. Perhaps, given UNIX's academic heritage, running the system in a completely democratic manner should be the goal of the system administrator. However, the system administrator will sooner or later find out, either politely or painfully, who the most important and the most influential groups are. This set of characteristics should also somehow be factored into the user profiles the system administrator develops before the onset of crises, which by their nature obscure the reasoning process of all involved.

Introduction to UNIX Performance

While the system is running, UNIX maintains several counters to keep track of critical system resources. The relevant resources that are tracked are the following:

CPU utilization	Buffer usage
Disk I/O activity	Tape I/O activity
Terminal activity	System call activity
Context switching activity	File access utilization
Queue activity	Interprocess communication (IPC)
Paging activity	Free memory and swap space
Kernel memory allocation (KMA)	Kernel tables
Remote file sharing (RFS)

By looking at reports based on these counters you can determine how the three major subsystems are performing. These subsystems are the following:

CPU	The CPU processes instructions and programs. Each time you submit a job to the system, it makes demands on the CPU. Usually, the CPU can service all demands in a timely manner. However, there is only so much available processing power, which must be shared by all users and the internal programs of the operating system, too.
Memory	Every program that runs on the system makes some demand on the physical memory on the machine. Like the CPU, it is a finite resource. When the active processes and programs that are running on the system request more memory than the machine actually has, paging is used to move parts of the processes to disk and reclaim their memory pages for use by other processes. If further shortages occur, the system may also have to resort to swapping, which moves entire processes to disk to make room.
I/O	The I/O subsystem(s) transfers data into and out of the machine. I/O subsystems comprise devices such as disks, printers, terminals/keyboards, and other relatively slow devices, and are a common source of resource contention problems. In addition, there is a rapidly increasing use of network I/O devices. When programs are doing a lot of I/O, they can get bogged down waiting for data from these devices. Each subsystem has its own limitations with respect to the bandwidth that it can effectively use for I/O operations, as well as its own peculiar problems.

Performance monitoring and tuning is not always an exact science. In the displays that follow, there is a great deal of variety in the system/subsystem loadings, even for the small sample of systems used here. In addition, different user groups have widely differing requirements. Some users will put a strain on the I/O resources, some on the CPU, and some will stress the network. Performance tuning is always a series of trade-offs. As you will see, increasing the kernel size to alleviate one problem may aggravate memory utilization. Increasing NFS performance to satisfy one set of users may reduce performance in another area and thereby aggravate another set of users. The goal of the task is often to find an optimal compromise that will satisfy the majority of user and system resource needs.

Monitoring the Overall System Status

The examination of specific UNIX performance monitoring techniques begins with a look at three basic tools that give you a snapshot of the overall performance of the system. After getting this high-level view, you will normally proceed to examine each of the subsystems in detail.

Monitoring System Status Using `uptime`

One of the simplest reports that you use to monitor UNIX system performance measures the number of processes in the UNIX run queue during given intervals. It comes from the command uptime. It is both a high-level view of the system's workload and a handy starting place when the system seems to be performing slowly. In general, processes in the run queue are active programs (that is, not sleeping or waiting) that require system resources. Here is an example:

% uptime

  2:07pm  up 11 day(s),  4:54,  15 users,  load average: 1.90, 1.98, 2.01

The useful parts of the display are the three load-average figures. The 1.90 load average was measured over the last minute. The 1.98 average was measured over the last 5 minutes. The 2.01 load average was measured over the last 15 minutes.

TIP: What you are usually looking for is the trend of the averages. This particular example shows a system that is under a fairly consistent load. However, if a system is having problems, but the load averages seem to be declining steadily, then you may want to wait a while before you take any action that might affect the system and possibly inconvenience users. While you are doing some ps commands to determine what caused the problem, the imbalance may correct itself.

NOTE: uptime has certain limitations. For example, high-priority jobs are not distinguished from low-priority jobs although their impact on the system can be much greater.

Run uptime periodically and observe both the numbers and the trend. When there is a problem it will often show up here, and tip you off to begin serious investigations. As system loads increase, more demands will be made on your memory and I/O subsystems, so keep an eye out for paging, swapping, and disk inefficiencies. System loads of 2 or 3 usually indicate light loads. System loads of 5 or 6 are usually medium-grade loads. Loads above 10 are often heavy loads on large UNIX machines. However, there is wide variation among types of machines as to what constitutes a heavy load. Therefore, the mentioned technique of sampling your system regularly until you have your own reference for light, medium, and heavy loads is the best technique.

Monitoring System Status Using `perfmeter`

Because the goal of this first section is to give you the tools to view your overall system performance, a brief discussion of graphical performance meters is appropriate. SUN Solaris users are provided with an OpenWindows XView tool calledperfmeter, which summarizes overall system performance values in multiple dials or strip charts. Strip charts are the default. Not all UNIX systems come with such a handy tool. That's too bad because in this case a picture is worth, if not a thousand words, at least 30 or 40 man pages. In this concise format, you get information about the system resources shown in Table 22.1:

Table 22.1. System resources and their descriptions.

Resources	Description
`cpu`	Percent of CPU being utilized
`pkts`	EtherNet activity, in packets per second
`page`	Paging, in pages per second
`swap`	Jobs swapped per second
`intr`	Number of device interrupts per second
`disk`	Disk traffic, in transfers per second
`cntxt`	Number of context switches per second
`load`	Average number of runnable processes over the last minute
`colls`	Collisions per second detected on the EtherNet
`errs`	Errors per second on receiving packets

The charts of the perfmeter are not a source for precise measurements of subsystem performance, but they are graphic representations of them. However, the chart can be very useful for monitoring several aspects of the system at the same time. When you start a particular job, the graphics can demonstrate the impact of that job on the CPU, on disk transfers, and on paging. Many developers like to use the tool to assess the efficiency of their work for this very reason. Likewise, system administrators use the tool to get valuable clues about where to start their investigations. As an example, when faced with intermittent and transitory problems, glancing at a perfmeter and then going directly to the proper display may increase the odds that you can catch in the act the process that is degrading the system.

The scale value for the strip chart changes automatically when the chart refreshes to accommodate increasing or decreasing values on the system. You add values to be monitored by clicking the right mouse button and selecting from the menu. From the same menu you can select properties, which will let you modify what the perfmeter is monitoring, the format (dials/graphs, direction of the displays, and solid/lined display), remote/local machine choice, and the frequency of the display.

You can also set a ceiling value for a particular strip chart. If the value goes beyond the ceiling value, this portion of the chart will be displayed in red. Thus, a system administrator who knows that someone is periodically running a job that eats up all the CPU memory can set a signal that the job may be run again. The system administrator can also use this to monitor the condition of critical values from several feet away from his monitor. If he or she sees red, other users may be seeing red, too.

The perfmeter is a utility provided with SunOS. You should check your own particular UNIX operating system to determine if similar performance tools are provided.

Monitoring System Status Using `sar -q`

If your machine does not support uptime, there is an option for sar that can provide the same type of quick, high-level snapshot of the system. The -q option reports the average queue length and the percentage of time that the queue is occupied.

% sar -q 5 5

07:28:37 runq-sz %runocc swpq-sz %swpocc
07:28:42     5.0     100                _
07:28:47     5.0     100                _
07:28:52     4.8     100                _
07:28:57     4.8     100                _
07:29:02     4.6     100                _

Average      4.8     100                _

The fields in this report are the following:

`runq-sz`	This is the length of the run queue during the interval. The run queue list doesn't include jobs that are sleeping or waiting for I/O, but does include jobs that are in memory and ready to run.
`%runocc`	This is the percentage of time that the run queue is occupied.
`swpq-sz`	This is the average length of the swap queue during the interval. Jobs or threads that have been swapped out and are therefore unavailable to run are shown here.
`%swpocc`	This is the percentage of time that there are swapped jobs or threads.

The run queue length is used in a similar way to the load averages of uptime. Typically the number is less than 2 if the system is operating properly. Consistently higher values indicate that the system is under heavier loads, and is quite possibly CPU bound. When the run queue length is high and the run queue percentage is occupied 100% of the time, as it is in this example, the system's idle time is minimized, and it is good to be on the lookout for performance-related problems in the memory and disk subsystems. However, there is still no activity indicated in the swapping columns in the example. You will learn about swapping in the next section, and see that although this system is obviously busy, the lack of swapping is a partial vote of confidence that it may still be functioning properly.

Monitoring System Status Using `sar -u`

Another quick and easy tool to use to determine overall system utilization is sar with the -u option. CPU utilization is shown by-u, and sar without any options defaults on most versions of UNIX to this option. The CPU is either busy or idle. When it is busy, it is either working on user work or system work. When it is not busy, it is either waiting on I/O or it is idle.

% sar -u 5 5

13:16:58    %usr    %sys    %wio   %idle
13:17:03      40      10      13      38
13:17:08      31       6      48      14
13:17:13      42      15       9      34
13:17:18      41      15      10      35
13:17:23      41      15      11      33

Average       39      12      18      31

The fields in the report are the following:

`%usr`	This is the percentage of time that the processor is in user mode (that is, executing code requested by a user).
`%sys`	This is the percentage of time that the processor is in system mode, servicing system calls. Users can cause this percentage to increase above normal levels by using system calls inefficiently.
`%wio`	This is the percentage of time that the processor is waiting on completion of I/O, from disk, NFS, or RFS. If the percentage is regularly high, check the I/O systems for inefficiencies.
`%idle`	This is the percentage of time the processor is idle. If the percentage is high and the system is heavily loaded, there is probably a memory or an I/O problem.

In this example, you see a system with ample CPU capacity left (that is, the average idle percentage is 31 percent). The system is spending most of its time on user tasks, so user programs are probably not too inefficient with their use of system calls. The I/O wait percentage indicates an application that is making a fair amount of demands on the I/O subsystem.

Most administrators would argue that %idle should be in the low 'teens rather than 0, at least when the system is under load. If it is 0 it doesn't necessarily mean that the machine is operating poorly. However, it is usually a good bet that the machine is out of spare computational capacity and should be upgraded to the next level of CPU speed. The reason to upgrade the CPU is in anticipation of future growth of user processing requirements. If the system work load is increasing, even if the users haven't yet encountered the problem, why not anticipate the requirement? On the other hand, if the CPU idle time is high under heavy load, a CPU upgrade will probably not help improve performance much.

Idle time will generally be higher when the load average is low.

A high load average and idle time is a symptom of potential problems. Either the memory or the I/O subsystems, or both, are hindering the swift dispatch and completion of the jobs. You should review the following sections that show how to look for paging, swapping, disk, or network-related problems.

Monitoring Processes with `ps`

You have probably noticed that, while throughout the rest of this chapter the commands are listed under the topic in which they are used (for example, nfsstat is listed in the section "Monitoring Network Performance"), this section is dedicated to just one command. What's so special about ps? It is singled out in this manner because of the way that it is used in the performance monitoring process. It is a starting point for generating theories (for example, processes are using up so much memory that you are paging and that is slowing down the system). Conversely, it is an ending point for confirming theories (for example, here is a burst of network activity--I wonder if it is caused by that communications test job that the programmers keep running?). Since it is so pivotal, and provides a unique snapshot of the processes on the system, ps is given its own section.

One of the most valuable commands for performance monitoring is the ps command. It enables you to monitor the status of the active processes on the system. Remember the words from the movie Casablanca, "round up the usual suspects"? Well, pshelps to identify the usual suspects (that is, suspect processes that could be using inordinate resources). Then you can proceed to determine which of the suspects is actually guilty of causing the performance degradation. It is at once a powerful tool and a source of overhead for the system itself. Using various options, the following information is shown:

Current status of the process	Process ID
Parent process ID	User ID
Scheduling class	Priority
Address of process	Memory used
CPU time used

Using ps provides you a snapshot of the system's active processes. It is used in conjunction with other commands throughout this section. Frequently, you will look at a report from a command, for example vmstat, and then look to ps either to confirm or to deny a theory you have come up with about the nature of your system's problem. The particular performance problem that motivated you to look at ps in the first place may have been caused by a process that is already off the list. It provides a series of clues to use in generating theories that can then be tested by detailed analysis of the particular subsystem.

The following are the fields from the output of the ps command that are important in terms of performance tuning:

Field	Description
`F`	Flags that indicate the process's current state and are calculated by adding each of the hexadecimal values:
`00`	Process has terminated
`01`	System process, always in memory
`02`	Process is being traced by its parent
`04`	Process is being traced by parent, and is stopped
`08`	Process cannot be awakened by a signal
`10`	Process is in memory and locked, pending an event
`20`	Process cannot be swapped
`S`	The current state of the process, as indicated by one of the following letters:
`O`	Process is currently running on the processor
`S`	Process is sleeping, waiting for an I/O event (including terminal I/O) to complete
`R`	Process is ready to run
`I`	Process is idle
`Z`	Process is a zombie process (it has terminated, and the parent is not waiting but is still in the process table)
`T`	Process is stopped because of parent tracing it
`X`	Process is waiting for more memory
`UID`	User ID of the process's owner
`PID`	Process ID number
`PPID`	Parent process ID number
`C`	CPU utilization for scheduling (not shown when `-c` is used)
`CLS`	Scheduling class, real-time, time sharing, or system (only shown when the `-c` option is used)
`PRI`	Process scheduling priority (higher numbers mean lower priorities).
`NI`	Process `nice` number (used in scheduling priorities--raising the number lowers the priority so the process gets less CPU time)
`SZ`	The amount of virtual memory required by the process (This is a good indication of the memory load the process places on the systems memory.)
`TTY`	The terminal that started the process, or its parent (A `?` indicates that no terminal exists.)
`TIME`	The total amount of CPU time used by the process since it began
`COMD`	The command that generated the process

If your problem is immediate performance, you can disregard processes that are sleeping, stopped, or waiting on terminal I/O, as these will probably not be the source of the degradation. Look instead for the jobs that are ready to run, blocked for disk I/O, or paging.

% ps -el

 F S   UID   PID  PPID  C PRI NI     ADDR     SZ    WCHAN TTY        TIME COMD
19 T     0     0     0 80   0 SY e00ec978      0          ?          0:01 sched
19 S     0     2     0 80   0 SY f5735000      0 e00eacdc ?          0:05 pageout
 8 S  1001  1382     1 80  40 20 f5c6a000   1227 e00f887c console    0:02 mailtool
 8 S  1001  1386     1 80  40 20 f60ed000    819 e00f887c console    0:28 perfmete
 8 S  1001 28380 28377 80  40 20 f67c0000   5804 f5cfd146 ?         85:02 sqlturbo
 8 S  1001 28373     1 80  40 20 f63c6000   1035 f63c61c8 ?          0:07 cdrl_mai
 8 S  1001 28392     1 80  40 20 f67ce800   1035 f67ce9c8 ?          0:07 cdrl_mai
 8 S  1001 28391 28388 80  40 20 f690a800   5804 f60dce46 ?        166:39 sqlturbo
 8 S  1001 28361     1 80  60 20 f67e1000  30580 e00f887c ?        379:35 mhdms
 8 S  1001 28360     1 80  40 20 f68e1000  12565 e00f887c ?        182:22 mhharris
 8 O  1001 10566 10512 19  70 20 f6abb800    152          pts/14     0:00 ps
 8 S  1001 28388     1 80  40 20 f6384800    216 f60a0346 ?         67:51 db_write
 8 S  1000  7750  7749 80  40 20 f6344800   5393 f5dad02c pts/2     31:47 tbinit
 8 O  1001  9538  9537 80  81 22 f6978000   5816          ?        646:57 sqlturbo
 8 S  1033  3735  3734164  40 20 f63b8800    305 f60e0d46 pts/9      0:00 ksh
 8 S  1033  5228  5227 80  50 20 f68a8800    305 f60dca46 pts/7      0:00 ksh
 8 S  1001 28337     1 80  99 20 f6375000  47412 f63751c8 ?       1135:50 velox_ga

The following are tips for using ps to determine why system performance is suffering.

Look at the UID (user ID) fields for a number of identical jobs that are being submitted by the same user. This is often caused by a user who runs a script that starts a lot of background jobs without waiting for any of the jobs to complete. Sometimes you can safely use kill to terminate some of the jobs. Whenever you can, you should discuss this with the user before you take action. In any case, be sure the user is educated in the proper use of the system to avoid a replication of the problem. In the example, User ID 1001 has multiple instances of the same process running. In this case, it is a normal situation, in which multiple processes are spawned at the same time for searching through database tables to increase interactive performance.

Look at the TIME fields for a process that has accumulated a large amount of CPU time. In the example, you can see the large amount of time acquired by the processes whose command is shown as velox_ga. This may indicate that the process is in an infinite loop, or that something else is wrong with its logic. Check with the user to determine whether it is appropriate to terminate the job. If something is wrong, ask the user if a dump of the process would assist in debugging it (check your UNIX system's reference material for commands, such as gcore, that can dump a process).

Request the -l option and look at the SZ fields for processes that are consuming too much memory. In the example you can see the large amount of memory acquired by the processes whose command is shown as velox_ga. You could check with the user of this process to try to determine why it behaves this way. Attempting to renice the process may simply prolong the problem that it is causing, so you may have to kill the job instead. SZ fields may also give you a clue as to memory shortage problems caused by this particular combination of jobs. You can use vmstat or sar -wpgr to check the paging and swapping statistics that are examined.

Look for processes that are consuming inordinate CPU resources. Request the -c option and look at the CLS fields for processes that are running at inappropriately high priorities. Use the nice command to adjust the nice value of the process. Beware in particular of any real-time (RT) process, which can often dominate the system. If the priority is higher than you expected, you should check with the user to determine how it was set. If he is resetting the priority because he has figured out the superuser password, dissuade him from doing this. (See Chapter 19 to find out more about using the nice command to modify the priorities of processes.)

If the processes that are running are simply long-running, CPU-intensive jobs, ask the users if you can nice them to a lower priority or if they can run them at night, when other users will not be affected by them.

Look for processes that are blocking on I/O. Many of the example processes are in this state. When that is the case, the disk subsystem probably requires tuning. The section "Monitoring Disk Performance Using vmstat" examines how to investigate problems with your disk I/O. If the processes are trying to read/write over NFS, this may be a symptom that the NFS server to which they are attached is down, or that the network itself is hung.

Monitoring Memory Utilization

You could say that one can never have too much money, be too thin, or have too much system memory. Memory sometimes becomes a problematic resource when programs that are running require more physical memory than is available. When this occurs UNIX systems begin a process called paging. During paging the system copies pages of physical memory to disk, and then allows the now-vacated memory to be used by the process that required the extra space. Occasional paging can be tolerated by most systems, but frequent and excessive paging is usually accompanied by poor system performance and unhappy users.

UNIX Memory Management

Paging uses an algorithm that selects portions, or pages, of memory that are not being used frequently and displaces them to disk. The more frequently used portions of memory, which may be the most active parts of a process, thus remain in memory, while other portions of the process that are idle get paged out.

In addition to paging, there is a similar technique used by the memory management system called swapping. Swapping moves entire processes, rather than just pages, to disk in order to free up memory resources. Some swapping may occur under normal conditions. That is, some processes may just be idle enough (for example, due to sleeping) to warrant their return to disk until they become active once more. Swapping can become excessive, however, when severe memory shortages develop. Interactive performance can degrade quickly when swapping increases since it often depends on keyboard-dependent processes (for example, editors) that are likely to be considered idle as they wait for you to start typing again.

As the condition of your system deteriorates, paging and swapping make increasing demands on disk I/O. This, in turn, may further slow down the execution of jobs submitted to the system. Thus, memory resource inadequacies may result in I/O resource problems.

By now, it should be apparent that it is important to be able to know if the system has enough memory for the applications that are being used on it.

TIP: A rule of thumb is to allocate twice the swap space as you have physical memory. For example, if you have 32 MB of physical Random Access Memory (RAM) installed upon your system, you would set up 64 MB of swap space when configuring the system. The system would then use this diskspace for its memory management when displacing pages or processes to disk.

Both vmstat and sar provide information about the paging and swapping characteristics of a system. Let's start with vmstat. On the vmstat reports you will see information about page-ins, or pages moved from disk to memory, and page-outs, or pages moved from memory to disk. Further, you will see information about swap-ins, or processes moved from disk to memory, and swap-outs, or processes moved from memory to disk.

Monitoring Memory Performance Using `vmstat`

The vmstat command is used to examine virtual memory statistics, and present data on process status, free and swap memory, paging activity, disk reports, CPU load, swapping, cache flushing, and interrupts. The format of the command is:

vmstat  t [n]

This command takes n samples, at t second intervals. For example, the following frequently used version of the command takes samples at 5-second intervals without stopping until canceled:

vmstat 5

The following screen shows the output from the SunOS variant of the command

vmstat -S 5

which provides extra information regarding swapping.

 procs     memory            page            disk          faults           cpu
 r b w   swap  free  si  so pi po fr de sr s0 s3 s5 s5   in    sy   cs us sy id
 0 2 0  16516  9144   0   0  0  0  0  0  0  1  4 34 12  366  1396  675 14  9 76
 0 3 0 869384 29660   0   0  0  0  0  0  0  0  4 63 15  514 10759 2070 19 17 64
 0 2 0 869432 29704   0   0  0  0  0  0  0  4  3 64 11  490  2458 2035 16 13 72
 0 3 0 869448 29696   0   0  0  0  0  0  0  0  3 65 13  464  2528 2034 17 12 71
 0 3 0 869384 29684   0   0  0  0  0  0  0  1  3 68 18  551  2555 2136 16 14 70
 0 2 0 869188 29644   0   0  0  2  2  0  0  2  3 65 10  432  2495 2013 18  9 73
 0 3 0 869176 29612   0   0  0  0  0  0  0  0  3 61 16  504  2527 2053 17 11 71
 0 2 0 869156 29600   0   0  0  0  0  0  0  0  3 69  8  438 15820 2027 20 18 62

The fields in the vmstat report are the following:

`procs`	Reports the number of processes in each of the following states
`r`	In the Run queue
`b`	Blocked, waiting for resources
`w`	Swapped, waiting for processing resources
`memory`	Reports on real and virtual memory
`swap`	Available swap space
`free`	Size of free list
`page`	Reports on page faults and paging, averaged over an interval (typically 5 seconds) and provided in units per second
`re`	Pages reclaimed from the free list (not shown when the `-S` option is requested)
`mf`	Minor faults (not shown when `-S` option is requested)
`si`	Number of pages swapped in (only shown with the `-S` option)
`so`	Number of pages swapped out (only shown with the `-S` option)
`pi`	Kilobytes paged in
`po`	Kilobytes paged out
`fr`	Kilobytes freed
`de`	Anticipated short-term memory shortfall
`sr`	Pages scanned by clock algorithm, per second
`disk`	Shows the number of disk operations per second
`faults`	Shows the per-second trap/interrupt rates
`in`	Device interrupts
`sy`	System faults per second
`cs`	CPU context switches
`cpu`	Shows the use of CPU time
`us`	User time
`sy`	System time
`id`	Idle time

NOTE: The vmstat command's first line is rarely of any use. When reviewing the output from the command, always start at the second line and go forward for pertinent data.

Let's look at some of these fields for clues about system performance. As far as memory performance goes, po and w are very important. For people using the -S option so is similarly important. These fields all clearly show when a system is paging and swapping. If w is non-zero and so continually indicates swapping, the system probably has a serious memory problem. If, likewise, po consistently has large numbers present, the system probably has a significant memory resource problem.

TIP: If your version of vmstat doesn't specifically provide swapping information, you can infer the swapping by watching the relationship between the w and the fre fields. An increase in w, the swapped-out processes, followed by an increase in fre, the number of pages on the free list, can provide the same information in a different manner.

Other fields from the vmstat output are helpful, as well. The number of runnable and blocked processes can provide a good indication of the flow of processes, or lack thereof, through the system. Similarly, comparing each percentage CPU idle versus CPU in system state, and versus CPU in user state, can provide information about the overall composition of the workload. As the load increases on the system, it is a good sign if the CPU is spending the majority of the time in the user state. Loads of 60 or 70 percent for CPU user state are ok. Idle CPU should drop as the user load picks up, and under heavy load may well fall to 0.

If paging and swapping are occurring at an unusually high rate, it may be due to the number and types of jobs that are running. Usually you can turn to ps to determine what those jobs are.

Imagine that ps shows a large number of jobs that require significant memory resources. (You saw how to determine this in theps discussion in the previous section.) That would confirm the vmstat report. To resolve the problem, you would have to restrict memory-intensive jobs, or the use of memory, or add more memory physically.

TIP: You can see that having a history of several vmstat and ps reports during normal system operation can be extremely helpful in determining what the usual conditions are, and, subsequently, what the unusual ones are. Also, one or two vmstat reports may indicate a temporary condition, rather than a permanent problem. Sample the system multiple times before deciding that you have the answer to your system's performance problems.

If you are using HP-UX, you would get a slightly different output from vmstat. For example, if you run vmstat 5 3, you would get something similar to the following output:

procfs           memory               page                      faults                   cpu

r   b    w    avm    free   re    at  pi   po   fr   de   sr   in    sy     cs     us   sy    id

4   0    0    1161   2282    6    22  48   0    0    0    0    429   289    65     44   18    18

9   0    0    1161   1422    4    30  59   0    0    0    0    654   264    181    18   20    62

6   0    0    1409   1247    2    19  37   0    0    0    0    505   316    130    47   10    43

If you compare the two outputs, you see that there are three new metrics (avm, re and at), two metrics not included (swap andso), and one category not included here (disk).

In the fourth column, you see the new metric avm. This is the number of virtual memory pages owned by processes that have run within the last 20 seconds. Should this number grow to roughly the size of physical memory minus your kernel, then your system is near paging.

The next new metric, re, shows the pages which were reclaimed. If this number gets very high, then you are wasting valuable time trying to salvage paging space. This is a good indicator that your system does not have adequate memory installed. The metric at is not very useful.

This version of vmstat is missing two metrics: swap and so. Swap is replaced with avm, as avm shows the number of virtual memory pages. The si and so metrics are missing, as they are related to the swap metric.

The disk category is not included with this version of vmstat, as most disk io is already shown with the iostat utility.

Monitoring Memory Performance with `sar -wpgr`

More information about the system's utilization of memory resources can be obtained by using sar -wpgr.

% sar -wpgr 5 5

07:42:30 swpin/s pswin/s swpot/s bswot/s pswch/s
          atch/s  pgin/s ppgin/s  pflt/s  vflt/s slock/s
          pgout/s ppgout/s pgfree/s pgscan/s %s5ipf
          freemem freeswp

07:42:35    0.00    0.0    0.00    0.0    504
            0.00    0.00   0.00    0.00   6.20   11.78
            0.00    0.00   0.00    0.00   0.00
           33139  183023

...

Average     0.00     0.0    0.00     0.0     515
Average     0.00    0.32    0.40    2.54    5.56   16.83
Average     0.00     0.00     0.00     0.00   0.00
Average    32926 183015

The fields in the report are the following:

`swpin/s`	Number of transfers into memory per second.
`bswin/s`	Number of blocks transferred for swap-ins per second.
`swpot/s`	Number of transfers from memory to swap area per second. (More memory may be needed if the value is greater than 1.)
`bswot/s`	Number of blocks transferred for swap-outs per second.
`pswch/s`	Number of process switches per second.
`atch/s`	Number of attaches per second (that is, page faults where the page is reclaimed from memory).
`pgin/s`	Number of times per second that file systems get page-in requests.
`ppgin/s`	Number of pages paged in per second.
`pflt/s`	Number of page faults from protection errors per second.
`vflt/s`	Number of address translation page (validity) faults per second.
`slock/s`	Number of faults per second caused by software lock requests requiring I/O.
`pgout/s`	Number of times per second that file systems get page-out requests.
`ppgout/s`	Number of pages paged out per second.
`pgfree/s`	Number of pages that are put on the free list by the page-stealing daemon. (More memory may be needed if this is a large value.)
`pgscan/s`	Number of pages scanned by the page-stealing daemon. (More memory may be needed if this is a large value, because it shows that the daemon is checking for free memory more than it should need to.)
`%ufs_ipf`	Percentage of the `ufs` inodes that were taken off the free list that had reusable pages associated with them. (Large values indicate that `ufs` inodes should be increased, so that the free list of inodes will not be page bound.) This will be `%s5ipf` for System V file systems, like in the example.
`freemem`	The average number of pages, over this interval, of memory available to user processes.
`freeswp`	The number of disk blocks available for page swapping.

You should use the report to examine each of the following conditions. Any one of them would imply that you may have a memory problem. Combinations of them increase the likelihood all the more.

Check for page-outs, and watch for their consistent occurrence. Look for a high incidence of address translation faults. Check for swap-outs. If they are occasional, it may not be a cause for concern as some number of them is normal (for example, inactive jobs). However, consistent swap-outs are usually bad news, indicating that the system is very low on memory and is probably sacrificing active jobs. If you find memory shortage evidence in any of these, you can use ps to look for memory-intensive jobs, as you saw in the section on ps.

Multiprocessor Implications of `vmstat`

In the CPU columns of the report, the vmstat command summarizes the performance of multiprocessor systems. If you have a two-processor system and the CPU load is reflected as 50 percent, it doesn't necessarily mean that both processors are equally busy. Rather, depending on the multiprocessor implementation it can indicate that one processor is almost completely busy and the next is almost idle.

The first column of vmstat output also has implications for multiprocessor systems. If the number of runnable processes is not consistently greater than the number of processors, it is less likely that you can get significant performance increases from adding more CPUs to your system.

Monitoring Disk Subsystem Performance

Disk operations are the slowest of all operations that must be completed to enable most programs to complete. Furthermore, as more and more UNIX systems are being used for commercial applications, and particularly those that utilize relational database systems, the subject of disk performance has become increasingly significant with regard to overall system performance. Therefore, probably more than ever before, UNIX system tuning activities often turn out to be searches for unnecessary and inefficient disk I/O. Before you learn about the commands that can help you monitor your disk I/O performance, some background is appropriate.

Some of the major disk performance variables are the hard disk activities themselves (that is, rotation and arm movement), the I/O controller card, the I/O firmware and software, and the I/O backplane of the system.

For example, for a given disk operation to be completed successfully, the disk controller must be directed to access the information from the proper part of the disk. This results in a delay known as a queuing delay. When it has located the proper part of the disk, the disk arm must begin to position itself over the correct cylinder. This results in a delay called seek latency. The read/write head must then wait for the relevant data to happen as the disk rotates underneath it. This is known as rotational latency. The data must then be transferred to the controller. Finally, the data must be transferred over the I/O backplane of the system to be used by the application that requested the information.

If you think about your use of a compact disk, many of the operations are similar in nature. The CD platter contains information, and is spinning all the time. When you push 5 to request the fifth track of the CD, a controller positions the head that reads the information at the correct area of the disk (similar to the queuing delay and seek latency of disk drives). The rotational latency occurs as the CD spins around until the start of your music passes under the reading head. The data--in this case your favorite song--is then transferred to a controller and then to some digital to analog converters that transform it into amplified musical information that is playable by your stereo.

Seek time is the time required to move the head of the disk from one location of data, or track, to another. Moving from one track to another track that is adjacent to it takes very little time and is called minimum seek time. Moving the head between the two furthest tracks on a disk is measured as the maximum seek time. The average seek time approximates the average amount of time a seek takes.

As data access becomes more random in nature, seek time can become more important. In most commercial database applications that feature relational databases, for example, the data is often being accessed in a random manner, at a high rate, and in relatively small packets (for example, 512 bytes). Therefore, the disk heads are moving back and forth all the time looking for the pertinent data. Therefore, choosing disks that have small seek times for those systems can increase I/O performance.

Many drives have roughly the same rotational speed, measured as revolutions per minute, or RPMs. However, some manufacturers are stepping up the RPM rates of their drives. This can have a positive influence on performance by reducing the rotational delay, which is the time that the disk head has to wait for the information to get to it (that is, on average one-half of a rotation). It also reduces the amount of time required to transfer the read/write information.

Disk I/O Performance Optimization

While reviewing the use of the commands to monitor disk performance, you will see how these clearly show which disks and disk subsystems are being the most heavily used. However, before examining those commands, there are some basic hardware-oriented approaches to this problem that can help increase performance significantly. The main idea is to put the hardware where the biggest disk problem is, and to evenly spread the disk work load over available I/O controllers and disk drives.

If your I/O work load is heavy (for example, with many users constantly accessing large volumes of data from the same set of files), you can probably get significant performance increases by reducing the number of disk drives that are daisy chained off one I/O controller from five or six to two or three. Perhaps doing this will force another daisy chain to increase in size past a total of four or five, but if the disks on that I/O controller are only used intermittently, system performance will be increased overall.

Another example of this type of technique is if you had one group of users that are pounding one set of files all day long, you could locate the most frequently used data on the fastest disks.

Notice that, once again, the more thorough your knowledge of the characteristics of the work being done on your system, the greater the chance that your disk architecture will answer those needs.

NOTE: Remember, distributing a work load evenly across all disks and controllers is not the same thing as distributing the disks evenly across all controllers, or the files evenly across all disks. You must know which applications make the heaviest I/O demands, and understand the work load itself, to distribute it effectively.

TIP: As you build file systems for user groups, remember to factor in the I/O work load. Make sure your high-disk I/O groups are put on their own physical disks and preferably their own I/O controllers as well. If possible, keep them, and /usr, off the root disk as well.

Disk-striping software frequently can help in cases where the majority of disk access goes to a handful of disks. Where a large amount of data is making heavy demands on one disk or one controller, striping distributes the data across multiple disks and/or controllers. When the data is striped across multiple disks, the accesses to it are averaged over all the I/O controllers and disks, thus optimizing overall disk throughput. Some disk-striping software also provides Redundant Array of Inexpensive Disks (RAID) support and the ability to keep one disk in reserve as a hot standby (that is, a disk that can be automatically rebuilt and used when one of the production disks fails). When thought of in this manner, this can be a very useful feature in terms of performance because a system that has been crippled by the failure of a hard drive will be viewed by your user community as having pretty bad performance.

This information may seem obvious, but it is important to the overall performance of a system. Frequently, the answer to disk performance simply rests on matching the disk architecture to the use of the system.

Relational Databases

With the increasing use of relational database technologies on UNIX systems, I/O subsystem performance is more important than ever. While analyzing all the relational database systems and making recommendations is beyond the scope of this chapter, some basic concepts are in order.

More and more often these days an application based on a relational database product is the fundamental reason for the procurement of the UNIX system itself. If that is the case in your installation, and if you have relatively little experience in terms of database analysis, you should seek professional assistance. In particular, insist on a database analyst that has had experience tuning your database system on your operating system. Operating systems and relational databases are both complex systems, and the performance interactions between them is difficult for the inexperienced to understand.

The database expert will spend a great deal of time looking at the effectiveness of your allocation of indexes. Large improvements in performance due to the addition or adjustment of a few indexes are quite common.

You should use raw disks versus the file systems for greatest performance. File systems incur more overhead (for example, inode and update block overhead on writes) than do raw devices. Most relational databases clearly reflect this performance advantage in their documentation.

If the database system is extremely active, or if the activity is unbalanced, you should try to distribute the load more evenly across all the I/O controllers and disks that you can. You will see how to determine this in the following section.

Checking Disk Performance with `iostat` and `sar`

The two original commands for system monitoring, iostat and sar, are still in very wide use today as reliable, simple, and free tools. As a matter of fact, most system monitoring tools that you can buy today are simply extensions of these programs.

The iostat Command The iostat command is used to examine disk input and output, and produces throughput, utilization, queue length, transaction rate, and service time data. It is similar both in format and in use to vmstat. The format of the command is:

iostat  t [n]

This command takes n samples, at t second intervals. For example, the following frequently used version of the command takes samples at 5-second intervals without stopping, until canceled:

iostat 5

For example, the following shows disk statistics sampled at 5-second intervals.

      tty          sd0          sd30          sd53          sd55          cpu
 tin tout Kps tps serv  Kps tps serv  Kps tps serv  Kps tps serv  us sy wt id
   0   26   8   1   57   36   4   20   77  34   24   31  12   30  14  9 47 30
   0   51   0   0    0    0   0    0  108  54   36    0   0    0  14  7 78  0
   0   47  72  10  258    0   0    0  102  51   38    0   0    0  15  9 76  0
   0   58   5   1    9    1   1   23  112  54   33    0   0    0  14  8 77  1
   0   38   0   0    0   25   0   90  139  70   17    9   4   25  14  8 73  6
   0   43   0   0    0  227  10   23  127  62   32   45  21   20  20 15 65  0

The first line of the report shows the statistics since the last reboot. The subsequent lines show the interval data that is gathered. The default format of the command shows statistics for terminals (tty), for disks (fd and sd), and CPU.

	For each terminal, `iostat` shows the following:
`tin`	Characters in the terminal input queue
`tout`	Characters in the terminal output queue
	For each disk, iostat shows the following:
`bps`	Blocks per second
`tps`	Transfers per second
`serv`	Average service time, in milliseconds
	For the CPU, iostat displays the CPU time spent in the following modes:
`us`	User mode
`sy`	System mode
`wt`	Waiting for I/O
`id`	Idle mode

The first two fields, tin and tout, have no relevance to disk subsystem performance, as these fields describe the number of characters waiting in the input and output terminal buffers. The next fields are relevant to disk subsystem performance over the preceding interval. The bps field indicates the size of the data transferred (read or written) to the drive. The tps field describes the transfers (that is, I/O requests) per second that were issued to the physical disk. Note that one transfer can combine multiple logical requests. The serv field is for the length of time, in milliseconds, that the I/O subsystem required to service the transfer. In the last set of fields, note that I/O waiting is displayed under the wt heading.

You can look at the data within the report for information about system performance. As with vmstat, the first line of data is usually irrelevant to your immediate investigation. Looking at the first disk, sd0, you see that it is not being utilized as the other three disks are. Disk 0 is the root disk, and often will show the greatest activity. This system is a commercial relational database implementation, however, and the activity that is shown here is often typical of online transaction processing, or OLTP, requirements. Notice that the activity is mainly on disks sd53 and sd55. The database is being exercised by a high volume of transactions that are updating it (in this case over 100 updates per second).

Disks 30, 53, and 55 are three database disks that are being pounded with updates from the application through the relational database system. Notice that the transfers per second, the kilobytes per second, and the service times are all reflecting a heavier load on disk 53 than on disks 30 and 55. Notice that disk 30's use is more intermittent but can be quite heavy at times, while53's is more consistent. Ideally, over longer sample periods, the three disks should have roughly equivalent utilization rates. If they continue to show disparities in use like these, you may be able to get a performance increase by determining why the load is unbalanced and taking corrective action.

You can use iostat -xtc to show the measurements across all of the drives in the system.

% iostat -xtc 10 5 _

                                 extended disk statistics       tty         cpu
disk      r/s  w/s   Kr/s   Kw/s wait actv  svc_t  %w  %b  tin tout us sy wt id
sd0       0.0  0.9    0.1    6.3  0.0  0.0   64.4   0   1    0   26 12 11 21 56
sd30      0.2  1.4    0.4   20.4  0.0  0.0   21.5   0   3 _
sd53      2.6  2.3    5.5    4.6  0.0  0.1   23.6   0   9 _
sd55      2.7  2.4    5.6    4.7  0.0  0.1   24.2   0  10 _

...

                                 extended disk statistics       tty         cpu
disk      r/s  w/s   Kr/s   Kw/s wait actv  svc_t  %w  %b  tin tout us sy wt id
sd0       0.0  0.3    0.0    3.1  0.0  0.0   20.4   0   1    0 3557  5  8 14 72
sd30      0.0  0.2    0.1    0.9  0.0  0.0   32.2   0   0 _
sd53      0.1  0.2    0.4    0.5  0.0  0.0   14.6   0   0 _
sd55      0.1  0.2    0.3    0.4  0.0  0.0   14.7   0   0 _

This example shows five samples of all disks at 10-second intervals.

Each line shows the following:

r/s	Reads per second
`w/s`	Writes per second
`Kr/s`	KB read per second
`Kw/s`	KB written per second
`wait`	Average transactions waiting for service (that is, queue length)
`actv`	Average active transactions being serviced
`svc_t`	Average time, in milliseconds, of service
`%w`	Percentage of time that the queue isn't empty
`%b`	Percentage of time that the disk is busy

Once again, you can check to make sure that all disks are sharing the load equally, or if this is not the case, that the most active disk is also the fastest.

The sar -d Command The sar -d option reports on the disk I/O activity of a system, as well.

% sar -d 5 5

20:44:26   device        %busy   avque   r+w/s  blks/s  avwait  avserv

...

20:44:46   sd0               1     0.0       1       5     0.0    20.1
           sd1               0     0.0       0       0     0.0     0.0
           sd15              0     0.0       0       0     0.0     0.0
           sd16              1     0.0       0       1     0.0    27.1
           sd17              1     0.0       0       1     0.0    26.8
           sd3               0     0.0       0       0     0.0     0.0

Average    sd0               1     0.0       0       3     0.0    20.0
           sd1               0     0.0       0       2     0.0    32.6
           sd15              0     0.0       0       1     0.0    13.6
           sd16              0     0.0       0       0     0.0    27.6
           sd17              0     0.0       0       0     0.0    26.1
           sd3               2     0.1       1      14     0.0   102.6

Information about each disk is shown as follows:

`device`	Names the disk device that is measured
`%busy`	Percentage of time that the device is busy servicing transfers
`avque`	Average number of requests outstanding during the period
`r+w/s`	Read/write transfers to the device per second
`blks/s`	Number of blocks transferred to the device per second
`avwait`	Average number of milliseconds that a transfer request spends waiting in the queue for service
`avserv`	Average number of milliseconds for a transfer to be completed, including seek, rotational delay, and data transfer time.

You can see from the example that this system is lightly loaded, since %busy is a small number and the queue lengths and wait times are small as well. The average service times for most of the disks is consistent; however, notice that SCSI disk 3, sd3, has a larger service time than the other disks. Perhaps the arrangement of data on the disk is not organized properly (a condition known as fragmentation) or perhaps the organization is fine but the disproportionate access of sd3 (see the blks/s column) is bogging it down in comparison to the other drives.

TIP: You should double-check vmstat before you draw any conclusions based on these reports. If your system is paging or swapping with any consistency, you have a memory problem, and you need to address that first because it is surely aggravating your I/O performance.

As this chapter has shown, you should distribute the disk load over I/O controllers and drives, and you should use your fastest drive to support your most frequently accessed data. You should also try to increase the size of your buffer cache if your system has sufficient memory. You can eliminate fragmentation by rebuilding your file systems. Also, make sure that the file system that you are using is the fastest type supported with your UNIX system (for example, UFS) and that the block size is the appropriate size.

Monitoring File System Use with `df`

One of the biggest and most frequent problems that systems have is running out of disk space, particularly in /tmp or /usr. There is no magic answer to the question How much space should be allocated to these? but a good rule of thumb is between 1500KB and 3000KB for /tmp and roughly twice that for /usr. Other file systems should have about 5 or 10 percent of the system's available capacity.

The df Command The df command shows the free disk space on each disk that is mounted. The -k option displays the information about each file system in columns, with the allocations in KB.

% df -k

Filesystem            kbytes    used   avail capacity  Mounted on
/dev/dsk/c0t0d0s0      38111   21173   13128    62%    /                  _
/dev/dsk/c0t0d0s6     246167  171869   49688    78%    /usr               _
/proc                      0       0       0     0%    /proc              _
fd                         0       0       0     0%    /dev/fd            _
swap                  860848     632  860216     0%    /tmp               _
/dev/dsk/c0t0d0s7     188247   90189   79238    53%    /home              _
/dev/dsk/c0t0d0s5     492351  179384  263737    40%    /opt               _
gs:/home/prog/met      77863   47127   22956    67%    /home/met

From this display you can see the following information (all entries are in KB):

`kbytes`	Total size of usable space in file system (size is adjusted by allotted head room)
`used`	Space used
`avail`	Space available for use
`capacity`	Percentage of total capacity used
`mounted on`	mount point

The usable space has been adjusted to take into account a 10 percent reserve head room adjustment, and thus reflects only 90 percent of the actual capacity. The percentage shown under capacity is therefore used space divided by the adjusted usable space.

TIP: For best performance, file systems should be cleansed to protect the 10 percent head room allocation. Remove excess files with rm, or archive/move files that are older and no longer used to tapes with tar or cpio, or to less-frequently-used disks.

Monitoring Network Performance

"The network is the computer" is an appropriate saying these days. What used to be simple ASCII terminals connected over serial ports have been replaced by networks of workstations, Xterminals, and PCs, connected, for example, over 10 BASE-T EtherNet networks. Networks are impressive information transmission media when they work properly. However, troubleshooting is not always as straightforward as it should be. In other words, he who lives by the network can die by the network without the proper procedures.

The two most prevalent standards that you will have to contend with in the UNIX world are TCP/IP, (a communications protocol) and NFS, (a popular network file system). Each can be a source of problems. In addition, you need to keep an eye on the implementation of the network, which can also can be a problem area. Each network topology has different capacities, and each implementation (for example, using thin-net instead of 10 BASE-T twisted pair, or using intelligent hubs, and so on) has advantages and problems inherent in its design. The good news is that even a simple EtherNet network has a large amount of bandwidth for transporting data. The bad news is that with every day that passes users and programmers are coming up with new methods of using up as much of that bandwidth as possible.

Most networks are still based on EtherNet technologies. Ethernet is referred to as a 10 Mps medium, but the throughput that can be used effectively by users and applications is usually significantly less than 10 MB. Often, for various reasons, the effective capacity falls to 4 Mps. That may still seem like a lot of capacity, but as the network grows it can disappear fast. When the capacity is used up, EtherNet is very democratic. If it has a capacity problem, all users suffer equally. Furthermore, one person can bring an EtherNet network to its knees with relative ease. Accessing and transferring large files across the network, running programs that test transfer rates between two machines, or running a program that has a loop in it that happens to be dumping data to another machine, and so on, can affect all the users on the network. Like other resources (that is, CPU, disk capacity, and so on), the network is a finite resource.

If given the proper instruction, users can quite easily detect capacity problems on the network by which they are supported. A quick comparison of a simple command executed on the local machine versus the same command executed on a remote machine (for example, login and rlogin) can indicate that the network has a problem.

A little education can help your users and your network at the same time. NFS is a powerful tool, in both the good and the bad sense. Users should be taught that it will be slower to access the file over the network using NFS, particularly if the file is sizable, than it will be to read or write the data directly on the remote machine by using a remote login. However, if the files are of reasonable size, and the use is reasonable (editing, browsing, moving files back and forth), it is a fine tool to use. Users should understand when they are using NFS appropriately or not.

Monitoring Network Performance with `netstat -i`

One of the most straightforward checks you can make of the network's operation is with netstat -i. This command can give you some insight into the integrity of the network. All the workstations and the computers on a given network share it. When more than one of these entities try to use the network at the same time, the data from one machine "collides" with that of the other. (Despite the sound of the term, in moderation this is actually a normal occurrence, but too many collisions can be a problem.) In addition, various technical problems can cause errors in the transmission and reception of the data. As the errors and the collisions increase in frequency, the performance of the network degrades because the sender of the data retransmits the garbled data, thus further increasing the activity on the network.

Using netstat -i you can find out how many packets the computer has sent and received, and you can examine the levels of errors and collisions that it has detected on the network. Here is an example of the use of netstat:

% netstat -i

Name  Mtu  Net/Dest   Address     Ipkts   Ierrs  Opkts  Oerrs Collis Queue _
lo0   8232 loopback   localhost    1031780 0     1031780  0     0      0
le0   1500 100.0.0.0  SCAT        13091430 6     12221526 4     174250 0

The fields in the report are the following:

`Name`	The name of the network interface. The names show what the type of interface is (for example, an `en` followed by a digit indicates an EtherNet card, the `lo0` shown here is a loopback interface used for testing networks).
`Mtu`	The maximum transfer unit, also known as the packet size, of the interface.
`Net/Dest`	The network to which the interface is connected.
`Address`	The Internet address of the interface. (The Internet address for this name may be referenced in `/etc/hosts`.)
`Ipkts`	The number of packets the system has received since the last boot.
`Ierrs`	The number of input errors that have occurred since the last boot. This should be a very low number relative to the`Ipkts` field (that is, less than 0.25 percent, or there is probably a significant network problem).
`Opkts`	Same as `Ipkts`, but for sent packets.
`Oerrs`	Same as `Ierrs`, but for output errors.
`Collis`	The number of collisions that have been detected. This number should not be more than 5 or 10 percent of the output packets (`Opkts`) number or the network is having too many collisions and capacity is reduced.

In this example you see that the collision ratio shows a network without too many collisions (approximately 1 percent). If collisions are constantly averaging 10 percent or more, the network is probably being over utilized.

The example also shows that input and output error ratios are negligible. Input errors usually mean that the network is feeding the system bad input packets, and the internal calculations that verify the integrity of the data (called checksums) are failing. In other words, this normally indicates that the problem is somewhere out on the network, not on your machine. Conversely, rapidly increasing output errors probably indicates a local problem with your computer's network adapters, connectors, interface, and so on.

If you suspect network problems you should repeat this command several times. An active machine should show Ipkts andOpkts consistently incrementing. If Ipkts changes and Opkts doesn't, the host is not responding to the client requesting data. You should check the addressing in the hosts database. If Ipkts doesn't change, the machine is not receiving the network data at all.

Monitoring Network Performance Using `spray`

It is quite possible that you will not detect collisions and errors when you use netstat -i, and yet will still have slow access across the network. Perhaps the other machine that you are trying to use is bogged down and cannot respond quickly enough. Use spray to send a burst of packets to the other machine and record how many of them actually made the trip successfully. The results will tell you if the other machine is failing to keep up. Here is an example of a frequently used test:

% spray SCAT

sending 1162 packets of length 86 to SCAT ...
        no packets dropped by SCAT
        3321 packets/sec, 285623 bytes/sec

This shows a test burst sent from the source machine to the destination machine called SCAT. No packets were dropped. If SCAT were badly overloaded some probably would have been dropped. The example defaulted to sending 1162 packets of 86 bytes each. Another example of the same command uses the -c option to specify the number of packets to send, the -doption to specify the delay so that you don't overrun your buffers, and the -l option to specify the length of the packet. This example of the command is a more realistic test of the network:

% spray -c 100 -d 20 0 -l 2048 SCAT

sending 100 packets of length 2048 to SCAT ...
        no packets dropped by SCAT
        572 packets/sec, 1172308 bytes/sec

Had you seen significant numbers (for example, 5 to 10 percent or more) of packets dropped in these displays, you would next try looking at the remote system. For example, using commands such as uptime, vmstat, sar, and ps as described earlier in this section, you would check on the status of the remote machine. Does it have memory or CPU problems, or is there some other problem that is degrading its performance so it can't keep up with its network traffic?

Monitoring Network Performance with `nfsstat -c`

Systems running NFS can skip spray and instead use nfsstat -c. The -c option specifies the client statistics, and -s can be used for server statistics. As the name implies, client statistics summarize this system's use of another machine as a server. The NFS service uses synchronous procedures called RPCs (remote procedure calls). This means that the client waits for the server to complete the file activity before it proceeds. If the server fails to respond, the client retransmits the request. Just as with collisions, the worse the condition of the communication, the more traffic that is generated. The more traffic that is generated, the slower the network and the greater the possibility of collisions. So if the retransmission rate is large, you should look for servers that are under heavy loads, high collision rates that are delaying the packets en route, or EtherNet interfaces that are dropping packets.

% nfsstat -c

Client rpc:
calls    badcalls retrans  badxid   timeout  wait     newcred  timers
74107    0        72       0        72       0        0        82       _

Client nfs:
calls      badcalls   nclget     nclcreate
73690      0          73690      0          _
null       getattr    setattr    root       lookup     readlink   read       _
0  0%      4881  7%   1  0%      0  0%      130  0%    0  0%      465  1%    _
wrcache    write      create     remove     rename     link       symlink    _
0  0%      68161 92%  16  0%     1  0%      0  0%      0  0%      0  0%      _
mkdir      rmdir      readdir    statfs     _
0  0%      0  0%      32  0%     3  0%      _

The report shows the following fields:

`calls`	The number of calls sent
`badcalls`	The number of calls rejected by the RPC
`retrans`	The number of retransmissions
`badxid`	The number of duplicated acknowledgments received
`timeout`	The number of time-outs
`wait`	The number of times no available client handles caused waiting
`newcred`	The number of refreshed authentications
`timers`	The number of times the time-out value is reached or exceeded
`readlink`	The number of reads made to a symbolic link

If the timeout ratio is high, the problem can be unresponsive NFS servers or slow networks that are impeding the timely delivery and response of the packets. In the example, there are relatively few time-outs compared to the number of calls (72/74107 or about 1/10 of 1 percent) that do retransmissions. As the percentage grows toward 5 percent, system administrators begin to take a closer look at it. If badxid is roughly the same as retrans, the problem is probably an NFS server that is falling behind in servicing NFS requests, since duplicate acknowledgments are being received for NFS requests in roughly the same amounts as the retransmissions that are required. (The same thing is true if badxid is roughly the same astimeout.) However, if badxid is a much smaller number than retrans and timeout, then it follows that the network is more likely to be the problem.

TIP: nfsstat enables you to reset the applicable counters to 0 by using the -z option (executed as root). This can be particularly handy when trying to determine if something has caused a problem in the immediate time frame, rather than looking at the numbers collected since the last reboot.

Monitoring Network Performance with `netstat`

One way to check for network loading is to use netstat without any parameters:

% netstat

TCP
   Local Address        Remote Address    Swind Send-Q Rwind Recv-Q  State
-------------------- -------------------- ----- ------ ----- ------ -------_
AAA1.1023            bbb2.login            8760      0  8760      0 ESTABLISHED
AAA1.listen          Cccc.32980            8760      0  8760      0 ESTABLISHED
AAA1.login           Dddd.1019             8760      0  8760      0 ESTABLISHED
AAA1.32782           AAA1.32774           16384      0 16384      0 ESTABLISHED
...

In the report, the important field is the Send-Q field, which indicates the depth of the send queue for packets. If the numbers inSend-Q are large and increasing in size across several of the connections, the network is probably bogged down.

Looking for Network Data Corruption with `netstat -s`

The netstat -s command displays statistics for each of several protocols supported on the system (that is, UDP, IP, TCP, and ICMP). The information can be used to locate problems for the protocol. Here is an example:

% netstat -s

UDP
     udpInDatagrams      =2152316  udpInErrors         =     0
     udpOutDatagrams     =2151810

TCP  tcpRtoAlgorithm     =     4   tcpRtoMin           =   200
     tcpRtoMax           = 60000   tcpMaxConn          =    -1
     tcpActiveOpens      =1924360  tcpPassiveOpens     =    81
     tcpAttemptFails     =584963   tcpEstabResets      =1339431
     tcpCurrEstab        =    25   tcpOutSegs          =7814776
     tcpOutDataSegs      =1176484  tcpOutDataBytes     =501907781
     tcpRetransSegs      =1925164  tcpRetransBytes     =444395
     tcpOutAck           =6767853  tcpOutAckDelayed    =1121866
     tcpOutUrg           =   363   tcpOutWinUpdate     =129604
     tcpOutWinProbe      =    25   tcpOutControl       =3263985
     tcpOutRsts          =    47   tcpOutFastRetrans   =    23
     tcpInSegs           =11769363
     tcpInAckSegs        =2419522  tcpInAckBytes       =503241539
     tcpInDupAck         =3589621  tcpInAckUnsent      =     0
     tcpInInorderSegs    =4871078  tcpInInorderBytes   =-477578953
     tcpInUnorderSegs    =910597   tcpInUnorderBytes   =826772340
     tcpInDupSegs        = 60545   tcpInDupBytes       =46037645
     tcpInPartDupSegs    = 44879   tcpInPartDupBytes   =10057185
     tcpInPastWinSegs    =     0   tcpInPastWinBytes   =     0
     tcpInWinProbe       =704105   tcpInWinUpdate      =4470040
     tcpInClosed         =    11   tcpRttNoUpdate      =   907
     tcpRttUpdate        =1079220  tcpTimRetrans       =  1974
     tcpTimRetransDrop   =     2   tcpTimKeepalive     =   577
     tcpTimKeepaliveProbe=   343   tcpTimKeepaliveDrop =     2

IP   ipForwarding        =     2   ipDefaultTTL        =   255
     ipInReceives        =12954953 ipInHdrErrors       =     0
     ipInAddrErrors      =     0   ipInCksumErrs       =     0
     ipForwDatagrams     =     0   ipForwProhibits     =     0
     ipInUnknownProtos   =     0   ipInDiscards        =     0
     ipInDelivers        =13921597 ipOutRequests       =12199190
     ipOutDiscards       =     0   ipOutNoRoutes       =     0
     ipReasmTimeout      =    60   ipReasmReqds        =     0
     ipReasmOKs          =     0   ipReasmFails        =     0
     ipReasmDuplicates   =     0   ipReasmPartDups     =     0
     ipFragOKs           =  3267   ipFragFails         =     0
     ipFragCreates       = 19052   ipRoutingDiscards   =     0
     tcpInErrs           =     0   udpNoPorts          = 64760
     udpInCksumErrs      =     0   udpInOverflows      =     0
     rawipInOverflows    =     0

ICMP icmpInMsgs          =   216   icmpInErrors        =     0
     icmpInCksumErrs     =     0   icmpInUnknowns      =     0
     icmpInDestUnreachs  =   216   icmpInTimeExcds     =     0
     icmpInParmProbs     =     0   icmpInSrcQuenchs    =     0
     icmpInRedirects     =     0   icmpInBadRedirects  =     0
     icmpInEchos         =     0   icmpInEchoReps      =     0
     icmpInTimestamps    =     0   icmpInTimestampReps =     0
     icmpInAddrMasks     =     0   icmpInAddrMaskReps  =     0
     icmpInFragNeeded    =     0   icmpOutMsgs         =   230
     icmpOutDrops        =     0   icmpOutErrors       =     0
     icmpOutDestUnreachs =   230   icmpOutTimeExcds    =     0
     icmpOutParmProbs    =     0   icmpOutSrcQuenchs   =     0
     icmpOutRedirects    =     0   icmpOutEchos        =     0
     icmpOutEchoReps     =     0   icmpOutTimestamps   =     0
     icmpOutTimestampReps=     0   icmpOutAddrMasks    =     0
     icmpOutAddrMaskReps =     0   icmpOutFragNeeded   =     0
     icmpInOverflows     =     0
IGMP:
          0 messages received
          0 messages received with too few bytes
          0 messages received with bad checksum
          0 membership queries received
          0 membership queries received with invalid field(s)
          0 membership reports received
          0 membership reports received with invalid field(s)
          0 membership reports received for groups to which we belong
          0 membership reports sent

The checksum fields should always show extremely small values, as they are a percentage of total traffic sent along the interface.

By using netstat -s on the remote system in combination with spray on your own, you can determine whether data corruption (as opposed to network corruption) is impeding the movement of your network data. Alternate between the two displays, observing the differences, if any, between the reports. If the two reports agree on the number of dropped packets, the file server is probably not keeping up. If they don't, suspect network integrity problems. Use netstat -i on the remote machine to confirm this.

Corrective Network Actions

If you suspect that there are problems with the integrity of the network itself, you must try to determine where the faulty piece of equipment is. Hire network consultants, who will use network diagnostic scopes to locate and correct the problems.

If the problem is that the network is extremely busy, thus increasing collisions, time-outs, retransmissions, and so on, you may need to redistribute the work load more appropriately. This is a good example of the "divide and conquer" concept as it applies to computers. By partitioning and segmenting the network nodes into subnetworks that more clearly reflect the underlying work loads, you can maximize the overall performance of the network. This can be accomplished by installing additional network interfaces in your gateway and adjusting the addressing on the gateway to reflect the new subnetworks. Altering your cabling and implementing some of the more advanced intelligent hubs may be needed as well. By reorganizing your network, you will maximize the amount of bandwidth that is available for access to the local subnetwork. Make sure that systems that regularly perform NFS mounts of each other are on the same subnetwork.

If you have an older network and are having to rework your network topology, consider replacing the older coax-based networks with the more modern twisted-pair types, which are generally more reliable and flexible.

Make sure that the work load is on the appropriate machine(s). Use the machine with the best network performance to do its proper share of network file service tasks.

Check your network for diskless workstations. These require large amounts of network resources to boot up, swap, page, etc. With the cost of local storage descending constantly, it is getting harder to believe that diskless workstations are still cost-effective when compared to regular workstations. Consider upgrading the workstations so that they support their users locally, or at least to minimize their use of the network.

If your network server has been acquiring more clients, check its memory and its kernel buffer allocations for proper sizing.

If the problem is that I/O-intensive programs are being run over the network, work with the users to determine what can be done to make that requirement a local, rather than a network, one. Educate your users to make sure they understand when they are using the network appropriately and when they are being wasteful with this valuable resource.

Monitoring CPU Performance

The biggest problem a system administrator faces when examining performance is sorting through all the relevant information to determine which subsystem is really in trouble. Frequently, users complain about the need to upgrade a processor that is assumed to be causing slow execution, when in fact it is the I/O subsystem or memory that is the problem. To make matters even more difficult, all of the subsystems interact with one another, thus complicating the analysis.

You already looked at the three most handy tools for assessing CPU load in the section "Monitoring the Overall System Status." As stated in that section, processor idle time can, under certain conditions, imply that I/O or memory subsystems are degrading the system. It can also, under other conditions, imply that a processor upgrade is appropriate. Using the tools that have been reviewed in this chapter, you can by now piece together a competent picture of the overall activities of your system and its subsystems. You should use the tools to make absolutely sure that the I/O and the memory subsystems are indeed optimized properly before you spend the money to upgrade your CPU.

If you have determined that your CPU has just run out of gas, and you cannot upgrade your system, all is not lost. CPUs are extremely powerful machines that are frequently underutilized for long spans of time in any 24 hour period. If you can rearrange the schedule of the work that must be done to use the CPU as efficiently as possible, you can often overcome most problems. This can be done by getting users to run all appropriate jobs at off-hours (off work load hours, that is, not necessarily 9 to 5). You can also get your users to run selected jobs at lower priorities. You can educate some of your less efficient users and programmers. Finally, you can carefully examine the work load and eliminate some jobs, daemons, and so on, that are not needed.

The following is a brief list of jobs and daemons that deserve review, and possibly elimination, based on the severity of the problem and their use, or lack thereof, on the system. Check each of the following and ask yourself whether you use it or need them: accounting services, printer daemons, mountd remote mount daemon, sendmail daemon, talk daemon, remote who daemon, NIS server, and database daemons.

Monitoring Multiprocessor Performance with `mpstat`

One of the most recent developments of significance in the UNIX server world is the rapid deployment of symmetric multiprocessor (SMP) servers. Of course, having multiple CPUs can mean that you may desire a more discrete picture of what is actually happening on the system than sar -u can provide.

You learned about some multiprocessor issues in the examination of vmstat, but there are other tools for examining multiprocessor utilization. The mpstat command reports the per-processor statistics for the machine. Each row of the report shows the activity of one processor.

% mpstat

CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    1   0    0   201   71  164   22   34  147    0   942   10  10  23  57
  1    1   0    0    57   37  171   23   34  144    1   975   10  11  23  56
  2    1   0    0    77   56  158   22   33  146    0   996   11  11  21  56
  3    1   0    0    54   33  169   23   34  156    0  1139   12  11  21  56
  4    1   0    0    21    0  180   23   33  159    0  1336   14  10  20  56
  5    1   0    0    21    0  195   23   31  163    0  1544   17  10  18  55

All values are in terms of events per second, unless otherwise noted. You may specify a sample interval, and a number of samples, with the command, just as you would with sar. The fields of the report are the following:

`CPU`	CPU processor ID
`minf`	Minor faults
`mjf`	Major faults
`xcal`	Interprocessor cross calls
`intr`	Interrupts
`ithr`	Interrupts as threads (not counting clock interrupt)
`csw`	Context switches
`icsw`	Involuntary context switches
`migr`	Thread migrations (to another processor)
`smtx`	Spins on mutexes (lock not acquired on first try)
`srw`	Spins on reader/writer locks (lock not acquired on first try)
`syscl`	System calls
`usr`	Percentage of user time
`sys`	Percentage of system time
`wt`	Percentage of wait time
`idl`	Percentage of idle time

Don't be intimidated by the technical nature of the display. It is included here just as an indication that multiprocessor systems can be more complex than uniprocessor systems to examine for their performance. Some multiprocessor systems actually can bias work to be done to a particular CPU. That is not done here, as you can see. The user, system, wait, and idle times are all relatively evenly distributed across all the available CPUs.

Kernel Tuning

Kernel tuning is a complex topic, and the space that can be devoted to it in this section is limited. In order to fit this discussion into the space allowed, the focus is on kernel tuning for SunOS in general, and Solaris 2.x in particular. In addition, the section focuses mostly on memory tuning. Your version of UNIX may differ in several respects from the version described here, and you may be involved in other subsystems, but you should get a good idea of the overall concepts and generally how the parameters are tuned.

The most fundamental component of the UNIX operating system is the kernel. It manages all the major subsystems, including memory, disk I/O, utilization of the CPU, process scheduling, and so on. In short, it is the controlling agent that enables the system to perform work for you.

As you can imagine from that introduction, the configuration of the kernel can dramatically affect system performance either positively or negatively. There are parameters that you can tune for various kernel modules that you can tune. A couple reasons could motivate you to do this. First, by tuning the kernel you can reduce the amount of memory required for the kernel, thus increasing the efficiency of the use of memory, and increasing the throughput of the system. Second, you can increase the capacity of the system to accommodate new requirements (users, processing, or both).

This is a classic case of software compromise. It would be nice to increase the capacity of the system to accommodate all users that would ever be put on the system, but that would have a deleterious effect on performance. Likewise, it would be nice to tune the kernel down to its smallest possible size, but that would have negative side-effects as well. As in most software, the optimal solution is somewhere between the extremes.

Some people think that you only need to change the kernel when the number of people on the system increases. This is not true. You may need to alter the kernel when the nature of your processing changes. If your users are increasing their use of X Windows, or increasing their utilization of file systems, running more memory-intensive jobs, and so on, you may need to adjust some of these parameters to optimize the throughput of the system.

Two trends are changing the nature of kernel tuning. First, in an effort to make UNIX a commercially viable product in terms of administration and deployment, most manufacturers are trying to minimize the complexity of the kernel configuration process. As a result, many of the tables that were once allocated in a fixed manner are now allocated dynamically, or else are linked to the value of a handful of fields. Solaris 2.x takes this approach by calculating many kernel values based on the maxusers field. Second, as memory is dropping in price and CPU power is increasing dramatically, the relative importance of precise kernel tuning for most systems is gradually diminishing. However, for high-performance systems, or systems with limited memory, it is still a pertinent topic.

Your instruction in UNIX kernel tuning begins with an overview of the kernel tables that are changed by it, and how to display them. It continues with some examples of kernel parameters that are modified to adjust the kernel to current system demands, and it concludes with a detailed example of paging and swapping parameters under SunOS.\

CAUTION: Kernel tuning can actually adversely affect memory subsystem performance. As you adjust the parameters upward, the kernel often expands in size. This can affect memory performance, particularly if your system is already beginning to experience a memory shortage problem under normal utilization. As the kernel tables grow, the internal processing related to them may take longer, too, so there may be some minor degradation related to the greater time required for internal operating system activities. Once again, with a healthy system this may be transparent, but with a marginal system the problems may become apparent or more pronounced.

CAUTION: In general you should be very careful with kernel tuning. People that don't understand what they are doing can cripple their systems. Many UNIX versions come with utility programs that help simplify configuration. It's best to use them. It also helps to read the manual, and to procure the assistance of an experienced system administrator, before you begin.

CAUTION: Finally, always make sure that you have a copy of your working kernel before you begin altering it. Some experienced system administrators actually make backup copies even if the utility automatically makes one. And it is always a good idea to do a complete backup before installing a new kernel. Don't assume that your disk drives are safe because you are "just making a few minor adjustments," or that the upgrade that you are installing "doesn't seem to change much with respect to the I/O subsystem." Make sure you can get back to your original system state if things go wrong.

Kernel Tables

When should you consider modifying the kernel tables? You should review your kernel parameters in several cases, such as before you add new users, before you increase your X Window activity significantly, or before you increase your NFS utilization markedly. Also review them before the makeup of the programs that are running is altered in a way that will significantly increase the number of processes that are run or the demands they will make on the system

Some people believe that you always increase kernel parameters when you add more memory, but this is not necessarily so. If you have a thorough knowledge of your system's parameters and know that they are already adjusted to take into account both current loads and some future growth, then adding more memory, in itself, is not necessarily a reason to increase kernel parameters.

Some of the tables are described as follows:

Process table The process table sets the number of processes that the system can run at a time. These processes include daemon processes, processes that local users are running, and processes that remote users are running. It also includes forked or spawned processes of users--it may be a little more trouble for you to accurately estimate the number of these. If the system is trying to start system daemon processes and is prevented from doing so because the process table has reached its limit, you may experience intermittent problems (possibly without any direct notification of the error).
User process table The user process table controls the number of processes per user that the system can run.
Inode table The inode table lists entries for such things as the following:
- Each open pipe
- Each current user directory
- Mount points on each file system
- Each active I/O device
- When the table is full, performance will degrade. The console will have error messages written to it regarding the error when it occurs. This table is also relevant to the open file table, since they are both concerned with the same subsystem.
Open file table This table determines the number of files that can be open on the system at the same time. When the system call is made and the table is full, the program will get an error indication and the console will have an error logged to it.
Quota table If your system is configured to support disk quotas, this table contains the number of structures that have been set aside for that use. The quota table will have an entry for each user who has a file system that has quotas turned on. As with the inode table, performance suffers when the table fills up, and errors are written to the console.
Callout table This table controls the number of timers that can be active concurrently. Timers are critical to many kernel-related and I/O activities. If the callout table overflows, the system is likely to crash.

Checking System Tables with `sar -v`

The -v option enables you to see the current process table, inode table, open file table, and shared memory record table.

The fields in the report are as follows:

`proc-sz`	The number of process table entries in use/the number allocated
`inod-sz`	The number of inode table entries in use/the number allocated
`file-sz`	The number of file table entries currently in use/the number 0 designating that space is allocated dynamically for this entry
`lock-sz`	The number of shared memory record table entries in use/the number 0 designating that space is allocated dynamically for this entry
`ov`	The overflow field, showing the number of times the field to the immediate left has had to overflow

Any non-zero entry in the ov field is an obvious indication that you need to adjust your kernel parameters relevant to that field. This is one performance report where you can request historical information, for the last day, the last week, or since last reboot, and actually get meaningful data out of it.

This is also another good report to use intermittently during the day to sample how much reserve capacity you have.

Here is an example:

% sar -v 5 5

18:51:12  proc-sz    ov  inod-sz    ov  file-sz    ov   lock-sz
18:51:17  122/4058    0 3205/4000    0  488/0       0   11/0   _
18:51:22  122/4058    0 3205/4000    0  488/0       0   11/0   _
18:51:27  122/4058    0 3205/4000    0  488/0       0   11/0   _
18:51:32  122/4058    0 3205/4000    0  488/0       0   11/0   _
18:51:37  122/4058    0 3205/4000    0  488/0       0   11/0   _

Since all the ov fields are 0, you can see that the system tables are healthy for this interval. In this display, for example, there are 122 process table entries in use, and there are 4058 process table entries allocated.

Displaying Tunable Kernel Parameters

To display a comprehensive list of tunable kernel parameters, you can use the nm command. For example, applying the command to the appropriate module, the name list of the file will be reported:

% nm /kernel/unix

Symbols from /kernel/unix:

[Index]   Value    Size  Type  Bind  Other Shndx   Name

... _
[15]|         0|       0|FILE |LOCL |0    |ABS    |unix.o
[16]|3758124752|       0|NOTY |LOCL |0    |1      |vhwb_nextset
[17]|3758121512|       0|NOTY |LOCL |0    |1      |_intr_flag_table
[18]|3758124096|       0|NOTY |LOCL |0    |1      |trap_mon
[19]|3758121436|       0|NOTY |LOCL |0    |1      |intr_set_spl
[20]|3758121040|       0|NOTY |LOCL |0    |1      |intr_mutex_panic
[21]|3758121340|       0|NOTY |LOCL |0    |1      |intr_thread_exit
[22]|3758124768|       0|NOTY |LOCL |0    |1      |vhwb_nextline
[23]|3758124144|       0|NOTY |LOCL |0    |1      |trap_kadb
[24]|3758124796|       0|NOTY |LOCL |0    |1      |vhwb_nextdword
[25]|3758116924|       0|NOTY |LOCL |0    |1      |firsthighinstr
[26]|3758121100|     132|NOTY |LOCL |0    |1      |intr_thread
[27]|3758118696|       0|NOTY |LOCL |0    |1      |fixfault
[28]|         0|       0|FILE |LOCL |0    |ABS    |confunix.c
...
     (Portions of display deleted for brevity)

The relevant fields in the report are the following:

`Index`	The index of the symbol (appears in brackets).
`Value`	The value of the symbol.
`Size`	The size, in bytes, of the associated object.
`Type`	A symbol is one of the following types: `NOTYPE` (no type was specified), `OBJECT` (a data object such as an array or variable), `FUNC` (a function or other executable code), `SECTION` (a section symbol), or `FILE` (name of the source file).
`Bind`	The symbol's binding attributes. `LOCAL` symbols have a scope limited to the object file containing their definition;`GLOBAL` symbols are visible to all object files being combined; and `WEAK` symbols are essentially global symbols with a lower precedence than `GLOBAL`.
`Shndx`	Except for three special values, this is the section header table index in relation to which the symbol is defined. The following special values exist: `ABS` indicates that the symbol's value will not change through relocation; `COMMON`indicates an allocated block and the value provides alignment constraints; and `UNDEF` indicates an undefined symbol.
`Name`	The name of the symbol.

On HP-UX 10.x systems, there is a text file that is used as the configuration file for the kernel at compile time. This file is the/stand/system file.

To get the most recent version of the kernel configurations, this file needs to be rebuilt. To do this, cd into the /stand/builddirectory and run the command /usr/lbin/sysadm/system_prep -s system. This will create a new system file in the/stand/build directory, which can then be edited for the desired changes.

Displaying Current Values of Tunable Parameters

To display a list of the current values assigned to the tunable kernel parameters, you can use the sysdef -i command:

% sysdef -i

... (portions of display are deleted for brevity)
*
* System Configuration
*
swapfile             dev  swaplo blocks   free
/dev/dsk/c0t3d0s1   32,25      8 547112  96936
*
* Tunable Parameters
*
 5316608  maximum memory allowed in buffer cache (bufhwm)
    4058  maximum number of processes (v.v_proc)
      99  maximum global priority in sys class (MAXCLSYSPRI)
    4053  maximum processes per user id (v.v_maxup)
      30  auto update time limit in seconds (NAUTOUP)
      25  page stealing low water mark (GPGSLO)
       5  fsflush run rate (FSFLUSHR)
      25  minimum resident memory for avoiding deadlock (MINARMEM)
      25  minimum swapable memory for avoiding deadlock (MINASMEM)
*
* Utsname Tunables
*
     5.3  release (REL)
    DDDD  node name (NODE)
   SunOS  system name (SYS)
Generic_101318-31  version (VER)
*
* Process Resource Limit Tunables (Current:Maximum)
*
Infinity:Infinity   cpu time
Infinity:Infinity   file size
7ffff000:7ffff000   heap size
  800000:7ffff000   stack size
Infinity:Infinity   core file size
      40:     400   file descriptors
Infinity:Infinity   mapped memory
*
* Streams Tunables
*
     9    maximum number of pushes allowed (NSTRPUSH)
 65536    maximum stream message size (STRMSGSZ)
  1024    max size of ctl part of message (STRCTLSZ)
*
* IPC Messages
*
   200    entries in msg map (MSGMAP)
  2048    max message size (MSGMAX)
 65535    max bytes on queue (MSGMNB)
    25    message queue identifiers (MSGMNI)
   128    message segment size (MSGSSZ)
   400    system message headers (MSGTQL)
  1024    message segments (MSGSEG)
   SYS    system class name (SYS_NAME)

As stated earlier, over the years there have been many enhancements that have tried to minimize the complexity of the kernel configuration process. As a result, many of the tables that were once allocated in a fixed manner are now allocated dynamically, or else linked to the value of the maxusers field. The next step in understanding the nature of kernel tables is to look at themaxusers parameter and its impact on UNIX system configuration.

Modifying the Configuration Information File

SunOS uses the /etc/system file for modification of kernel-tunable variables. The basic format is this:

set parameter = value

It can also have this format:

set [module:]variablename = value

The /etc/system file can also be used for other purposes (for example, to force modules to be loaded at boot time, to specify a root device, and so on). The /etc/system file is used for permanent changes to the operating system values. Temporary changes can be made using adb kernel debugging tools. The system must be rebooted for the changes made for them to become active using /etc/system. With adb the changes take place when applied.

CAUTION: Be very careful with set commands in the /etc/system file! They basically cause patches to be performed on the kernel itself, and there is a great deal of potential for dire consequences from misunderstood settings. Make sure you have handy the relevant system administrators' manuals for your system, as well as a reliable and experienced system administrator for guidance.

As mentioned earlier in the chapter, HP-UX 10.x has a similar /etc/system file, which can be modified and re-compiled.

Once you have made your changes to this file, you can recompile to make a new UNIX kernel. The command is mkkernel -s system. This new kernel, called vmunix.test, is placed in the /stand/build directory. Next, you move the present/stand/system file to /stand/system.prev; then you can move the modified file /stand/build/system to/stand/system. Then you move the currently running kernel /stand/vmunix to /stand/vmunix.prev, and then move the new kernel, /stand/build/vmunix.test, into place in /stand/vmunix (i.e., mv /stand/build/vmunix.test /stant/vmunix). The final step is to reboot the machine to make your changes take effect.

The `maxusers` Parameter

Many of the tables are dynamically updated either upward or downward by the operating system, based on the value assigned to the maxusers parameter, which is an approximation of the number of users the system will have to support. The quickest and, more importantly, safest way to modify the table sizes is by modifying maxusers, and letting the system perform the adjustments to the tables for you.

The maxusers parameter can be adjusted by placing commands in the /etc/system file of your UNIX system:


set maxusers=24

A number of kernel parameters adjust their values according to the setting of the maxusers parameter. For example, Table 22.2 lists the settings for various kernel parameters, where maxusers is utilized in their calculation.

Table 22.2. Kernel parameters affected by maxusers.

Table	Parameter	Setting
Process	`max_nprocs`	`10 + 16 * maxusers` (sets the size of the process table)
User process	`maxuprc`	`max_nprocs-5` (sets the number of user processes)
Callout	`ncallout`	`16 + max_nprocs` (sets the size of the callout table)
Name cache	`ncsize`	`max_nprocs + 16 + maxusers + 64` (sets size of the directory lookup cache)
Inode	`ufs_ninode`	`max_nprocs + 16 + maxusers + 64` (sets the size of the inode table)
Quota table	`ndquot`	`(maxusers * NMOUNT) / 4 + max_nprocs` (sets the number of disk quota structures)

The directory name lookup cache (dnlc) is also based on maxusers in SunOS systems. With the increasing usage of NFS, this can be an important performance tuning parameter. Networks that have many clients can be helped by an increased name cache parameter ncsize (that is, a greater amount of cache). By using vmstat with the -s option, you can determine the directory name lookup cache hit rate. A cache miss indicates that disk I/O was probably needed to access the directory when traversing the path components to get to a file. If the hit rate falls below 70 percent, this parameter should be checked.

% vmstat -s

        0 swap ins
        0 swap outs
        0 pages swapped in
        0 pages swapped out
  1530750 total address trans. faults taken
    39351 page ins
    22369 page outs
    45565 pages paged in
   114923 pages paged out
    73786 total reclaims
    65945 reclaims from free list
        0 micro (hat) faults
  1530750 minor (as) faults
    38916 major faults
    88376 copy-on-write faults
   120412 zero fill page faults
   634336 pages examined by the clock daemon
       10 revolutions of the clock hand
   122233 pages freed by the clock daemon
     4466 forks
      471 vforks
     6416 execs
 45913303 cpu context switches
 28556694 device interrupts
  1885547 traps
665339442 system calls
   622350 total name lookups (cache hits 94%)
        4 toolong
  2281992 user   cpu
  3172652 system cpu
 62275344 idle   cpu
   967604 wait   cpu

In this example, you can see that the cache hits are 94 percent, and therefore enough directory name lookup cache is allocated on the system.

By the way, if your NFS traffic is heavy and irregular in nature, you should increase the number of nfsd NFS daemons. Some system administrators recommend that this should be set between 40 and 60 on dedicated NFS servers. This will increase the speed with which the nfsd daemons take the requests off the network and pass them on to the I/O subsystem. Conversely, decreasing this value can throttle the NFS load on a server when that is appropriate.

The `monitor` utility

Monitor is a shareware utility which can be obtained from various ftp sites. This utility is actually a very handy tool for getting live updates on the status of your system.

So what does it show? The question is, what doesn't it show. Monitor will give you real time updates on cpu utilization, cpu wait states, disk io, a list of the top running processes, and much more.

As you bring the utility up, you can see a number of things in the first screen. Cpu utilization in shown as a text-based "emoticon" meter, breaking down cpu time into system, user, and idle time. Also, you have different load statistics displayed, such as disk io, swapping statistics, free memory, and a breakdown of memory metrics.

There are also two screen switches to show further details on disk io and process statistics.

To find a full breakdown of disk activity by disk, simply hit the "d" key. You can see the disk transfer wait in kb/s, io's per second, disk wait times, and much more. To get back to the main screen, just hit the "d" key again.

To see a full breakdown of the most active processes, hit the "t" key. This will show you a detailed listing of system processes, in descending order from highest to lowest in compute time. This a good way to see if you have any runaway or hung processes. Here you can see how long a process has been running, who started and owns it, which process spawned it, and much more. To get back to the main menu, just hit the "t key again.

To quit monitor, all you have to do is hit the "q" key.

Parameters That Influence Paging and Swapping

The section isn't large enough to review in detail how tuning can affect each of the kernel tables. However, for illustration purposes, this section describes how kernel parameters influence paging and swapping activities in a SunOS system. Other tables affecting other subsystems can be tuned in much the same manner as these.

As processes make demands on memory, pages are allocated from the free list. When the UNIX system decides that there is no longer enough free memory--less than the lotsfree parameter--it searches for pages that haven't been used lately to add them to the free list. The page daemon will be scheduled to run. It begins at a slow rate, based on the slowscan parameter, and increases to a faster rate, based on the fastscan parameter, as free memory continues toward depletion. If there is less memory than desfree, and there are two or more processes in the run queue, and the system stays in that condition for more than 30 seconds, the system will begin to swap. If the system gets to a minimum level of required memory, specified by theminfree parameter, swapping will begin without delay. When swapping begins, entire processes will be swapped out as described earlier.

NOTE: If you have your swapping spread over several disks, increasing the maxpgio parameter may be beneficial. This parameter limits the number of pages scheduled to be paged out, and is based on single-disk swapping. Increasing it may improve paging performance. You can use the po field from vmstat, as described earlier, which checks against maxpgio and pagesize to examine the volumes involved.

The kernel swaps out the oldest and the largest processes when it begins to swap. The maxslp parameter is used in determining which processes have exceeded the maximum sleeping period, and can thus be swapped out as well. The smallest higher-priority processes that have been sleeping the longest will then be swapped back in.

The most pertinent kernel parameters for paging and swapping are the following:

minfree This is the absolute minimum memory level that the system will tolerate. Once past minfree, the system immediately resorts to swapping.
desfree This is the desperation level. After 30 seconds at this level, paging is abandoned and swapping is begun.
lotsfree Once below this memory limit, the page daemon is activated to begin freeing memory.
fastscan This is the number of pages scanned per second.
slowscan This is the number of pages scanned per second when there is less memory than lotsfree available. As memory decreases from lotsfree the scanning speed increases from slowscan to fastscan.
maxpgio This is the maximum number of page out I/O operations per second that the system will schedule. This is normally set at approximately 40 under SunOS, which is appropriate for a single 3600 RPM disk. It can be increased with more or faster disks.

Newer versions of UNIX, such as Solaris 2.x, do such a good job of setting paging parameters that tuning is usually not required.

Increasing lotsfree will help on systems on which there is a continuing need to allocate new processes. Heavily used interactive systems with many Windows users often force this condition as users open multiple windows and start processes. By increasing lotsfree you create a large enough pool of free memory that you will not run out when most of the processes are initially starting up.

For servers that have a defined set of users and a more steady-state condition to their underlying processes, the normal default values are usually appropriate.

However, for servers such as this with large, stable work loads, but that are short of memory, increasing lotsfree is the wrong idea. This is because more pages will be taken from the application and put on the free list.

Some system administrators recommend that you disable the maxslp parameter on systems where the overhead of swapping normally sleeping processes (such as clock icons and update processes) isn't offset by any measurable gain due to forcing the processes out. This parameter is no longer used in Solaris 2.x releases, but is used on older versions of UNIX.

Conclusion of Kernel Tuning

You have now seen how to optimize memory subsystem performance by tuning a system's kernel parameters. Other subsystems can be tuned by similar modifications to the relevant kernel parameters. When such changes correct existing kernel configurations that have become obsolete and inefficient due to new requirements, the result can sometimes dramatically increase performance even without a hardware upgrade. It's not quite the same as getting a hardware upgrade for free, but it's about as close as you're likely to get in today's computer industry.

Third Party Solutions

In addition to the standard text-based utilities we have been talking about, there are a number of third party, enterprise-wide products that are available to monitor your servers. I will focus on two here: ServerVision from Platinum Technologies and EcoTools from Compuware.

Usually, these products are located on their own separate server, but often, due to budget considerations and other concerns, the monitor server is often placed on a production box. This defeats the purpose of a monitoring system, since the monitoring machine is just as likely to go down as the servers that it monitors in this situation. Therefore, I strongly recommend that you push as much as you can for a small workstation to act as the monitor for your system, and nothing else.

EcoTools has a number of neat features that make it a nice solution for many shops. According to their marketing literature (for what it's worth), it boasts an open and robust architecture, heterogeneous support, support for process automation, extensive monitoring and analysis, security for sensitive system information, out-of-the-box functionality and easy customization.

In reality, it is on a par with most every other system monitoring tool on the market, with a slight advantage because it's a GUI. It really does have a nice fuzzy display that will show most of what you need to see on your systems. The graphs you can get from its logs could be shown in any boardroom, if that's what your after.

What it lacks in warm fuzziness, ServerVision makes up for in pure kitchen sink monitoring metrics and logging tools. Platinum boasts over 200 system metrics that can be monitored on your systems, in addition to another 200 database metrics that you can use if you get their DBVision product.

NOTE: If you do have DBVision installed, particularly version 3.1.3 and version 3.1.6, you must turn off thelock_waits metric if it is installed on an AIX system. If you don't, your system will slow down to an unusable crawl! This is a bug in version 3.1.3 on AIX, and, at the time of this writing, it has yet to be fixed in version 3.1.6.

With ServerVision, as with any such tool, you must run the default settings for a short time before moving it into production to get a feel for where to set your alarm settings. If you have your server set up to page you and you don't modify these values, you will be getting paged late at night on a regular basis. Not much fun.

The paging function often comes in handy, if you like having a live system when you come in to the office in the morning. With this advanced warning, you would have the ability to get online and save a dying system well before it crashes.

No matter what software you buy for your system, ask your salesman for a complete demonstration, and bring a list of questions about your requirements. These solutions can be very expensive, so be sure you are getting what you pay for before you buy.

Thursday, January 30, 2014