Real, User and Sys process time statistics
One of these things is not like the other. Real refers to actual elapsed time; User and Sys refer to CPU time used only by the process.
Real is wall clock time - time from start to finish of the call. This is all elapsed time including time slices used by other processes and time the process spends blocked (for example if it is waiting for I/O to complete).
User is the amount of CPU time spent in user-mode code (outside the kernel) within the process. This is only actual CPU time used in executing the process. Other processes and time the process spends blocked do not count towards this figure.
Sys is the amount of CPU time spent in the kernel within the process. This means executing CPU time spent in system calls within the kernel, as opposed to library code, which is still running in user-space. Like 'user', this is only CPU time used by the process. See below for a brief description of kernel mode (also known as 'supervisor' mode) and the system call mechanism.
User+Sys will tell you how much actual CPU time your process used. Note that this is across all CPUs, so if the process has multiple threads (and this process is running on a computer with more than one processor) it could potentially exceed the wall clock time reported by
Real (which usually occurs). Note that in the output these figures include the
Sys time of all child processes (and their descendants) as well when they could have been collected, e.g. by
waitpid(2), although the underlying system calls return the statistics for the process and its children separately.
Origins of the statistics reported by
The statistics reported by
time are gathered from various system calls. 'User' and 'Sys' come from
wait (2) (POSIX) or
times (2) (POSIX), depending on the particular system. 'Real' is calculated from a start and end time gathered from the
gettimeofday (2) call. Depending on the version of the system, various other statistics such as the number of context switches may also be gathered by
On a multi-processor machine, a multi-threaded process or a process forking children could have an elapsed time smaller than the total CPU time - as different threads or processes may run in parallel. Also, the time statistics reported come from different origins, so times recorded for very short running tasks may be subject to rounding errors, as the example given by the original poster shows.
A brief primer on Kernel vs. User mode
On Unix, or any protected-memory operating system, 'Kernel' or 'Supervisor' mode refers to a privileged mode that the CPU can operate in. Certain privileged actions that could affect security or stability can only be done when the CPU is operating in this mode; these actions are not available to application code. An example of such an action might be manipulation of the MMU to gain access to the address space of another process. Normally, user-mode code cannot do this (with good reason), although it can request shared memory from the kernel, which could be read or written by more than one process. In this case, the shared memory is explicitly requested from the kernel through a secure mechanism and both processes have to explicitly attach to it in order to use it.
The privileged mode is usually referred to as 'kernel' mode because the kernel is executed by the CPU running in this mode. In order to switch to kernel mode you have to issue a specific instruction (often called a trap) that switches the CPU to running in kernel mode and runs code from a specific location held in a jump table. For security reasons, you cannot switch to kernel mode and execute arbitrary code - the traps are managed through a table of addresses that cannot be written to unless the CPU is running in supervisor mode. You trap with an explicit trap number and the address is looked up in the jump table; the kernel has a finite number of controlled entry points.
The 'system' calls in the C library (particularly those described in Section 2 of the man pages) have a user-mode component, which is what you actually call from your C program. Behind the scenes, they may issue one or more system calls to the kernel to do specific services such as I/O, but they still also have code running in user-mode. It is also quite possible to directly issue a trap to kernel mode from any user space code if desired, although you may need to write a snippet of assembly language to set up the registers correctly for the call.
More about 'sys'
There are things that your code cannot do from user mode - things like allocating memory or accessing hardware (HDD, network, etc.). These are under the supervision of the kernel, and it alone can do them. Some operations like
fwrite will invoke these kernel functions and that then will count as 'sys' time. Unfortunately it's not as simple as "every call to malloc will be counted in 'sys' time". The call to
malloc will do some processing of its own (still counted in 'user' time) and then somewhere along the way it may call the function in kernel (counted in 'sys' time). After returning from the kernel call, there will be some more time in 'user' and then
malloc will return to your code. As for when the switch happens, and how much of it is spent in kernel mode... you cannot say. It depends on the implementation of the library. Also, other seemingly innocent functions might also use
malloc and the like in the background, which will again have some time in 'sys' then.
With a clustered index the rows are stored physically on the disk in the same order as the index. Therefore, there can be only one clustered index.
With a non clustered index there is a second list that has pointers to the physical rows. You can have many non clustered indices, although each new index will increase the time it takes to write new records.
It is generally faster to read from a clustered index if you want to get back all the columns. You do not have to go first to the index and then to the table.
Writing to a table with a clustered index can be slower, if there is a need to rearrange the data.
It's a pretty decent measure of performance, as long as you understand exactly what it measures.
FLOPS is, as the name implies FLoating point OPerations per Second, exactly what constitutes a FLOP might vary by CPU. (Some CPU's can perform addition and multiplication as one operation, others can't, for example). That means that as a performance measure, it is fairly close to the hardware, which means that 1) you have to know your hardware to compute the ideal FLOPS on the given architecture, and you have to know your algorithm and implementation to figure out how many floating point ops it actually consists of.
In any case, it's a useful tool for examining how well you utilize the CPU. If you know the CPU's theoretical peak performance in FLOPS, you can work out how efficiently you use the CPU's floating point units, which are often one of the hard to utilize efficiently. A program which runs 30% of the FLOPS the CPU is capable of, has room for optimization. One which runs at 70% is probably not going to get much more efficient unless you change the basic algorithm. For math-heavy algorithms like yours, that is pretty much the standard way to measure performance. You could simply measure how long a program takes to run, but that varies wildly depending on CPU. But if your program has a 50% CPU utilization (relative to the peak FLOPS count), that is a somewhat more constant value (it'll still vary between radically different CPU architectures, but it's a lot more consistent than execution time).
But knowing that "My CPU is capable of X GFLOPS, and I'm only actually achieving a throughput of, say, 20% of that" is very valuable information in high-performance software. It means that something other than the floating point ops is holding you back, and preventing the FP units from working efficiently. And since the FP units constitute the bulk of the work, that means your software has a problem.
It's easy to measure "My program runs in X minutes", and if you feel that is unacceptable then sure, you can go "I wonder if I can chop 30% off that", but you don't know if that is possible unless you work out exactly how much work is being done, and exactly what the CPU is capable of at peak. How much time do you want to spend optimizing this, if you don't even know whether the CPU is fundamentally capable of running any more instructions per second?
It's very easy to prevent the CPU's FP unit from being utilized efficiently, by having too many dependencies between FP ops, or by having too many branches or similar preventing efficient scheduling. And if that is what is holding your implementation back, you need to know that. You need to know that "I'm not getting the FP throughput that should be possible, so clearly other parts of my code are preventing FP instructions from being available when the CPU is ready to issue one".
Why do you need other ways to measure performance? What's wrong with just working out the FLOPS count as your boss asked you to? ;)