Performance and Scalability - Java Concurrency in Practice

Java Reference

In-Depth Information

The actual cost of context switching varies across platforms, but a good rule of thumb is that

a context switch costs the equivalent of 5,000 to 10,000 clock cycles, or several microseconds

on most current processors.

The vmstat command on Unix systems and the perfmon tool on Windows systems report

the number of context switches and the percentage of time spent in the kernel. High kernel

usage (over 10%) often indicates heavy scheduling activity, which may be caused by block-

ing due to I/O or lock contention.

11.3.2. Memory Synchronization

The performance cost of synchronization comes from several sources. The visibility guar-

antees provided by synchronized and volatile may entail using special instructions

called memory barriers that can flush or invalidate caches, flush hardware write buffers, and

stall execution pipelines. Memory barriers may also have indirect performance consequences

because they inhibit other compiler optimizations; most operations cannot be reordered with

memory barriers.

When assessing the performance impact of synchronization, it is important to distinguish

between contended and uncontended synchronization. The synchronized mechanism is

optimized for the uncontended case ( volatile is always uncontended), and at this writ-

ing, the performance cost of a “fast-path” uncontended synchronization ranges from 20 to

250 clock cycles for most systems. While this is certainly not zero, the effect of needed, un-

contended synchronization is rarely significant in overall application performance, and the

alternative involves compromising safety and potentially signing yourself (or your successor)

up for some very painful bug hunting later.

Modern JVMs can reduce the cost of incidental synchronization by optimizing away locking

that can be proven never to contend. If a lock object is accessible only to the current thread,

the JVM is permitted to optimize away a lock acquisition because there is no way another

thread could synchronize on the same lock. For example, the lock acquisition in Listing 11.2

can always be eliminated by the JVM.

More sophisticated JVMs can use escapeanalysis to identify when a local object reference is

never published to the heap and is therefore thread-local. In getStoogeNames in Listing

11.3 , the only reference to the List is the local variable stooges , and stack-confined

variables are automatically thread-local. A naive execution of getStoogeNames would

acquire and release the lock on the Vector four times, once for each call to add or

toString . However, a smart runtime compiler can inline these calls and then see that

Search WWH ::

Custom Search

Home