Multi/Many Core - High Performance Parallel I/O

Hardware Reference

In-Depth Information

FIGURE 32.4: Using on-memory deduplication provides extra performance

without any administration cost (i.e., installing a new file system).

becoming more important. Deduplication is a technique that has been mainly

used to reduce the size of data in various cases, such as in file systems

(ZFS [11]), virtual machines (KVM, XEN [5], VMware [12]), and special

tagged memory zones (KSM [2]). However, using deduplication to improve

I/O caching by effectively increasing the size of the I/O cache incurs signifi-

cant CPU overhead due to the cost of deduplication techniques. On the other

hand, with the increasing number of cores, it makes sense to examine related

trade-offs when trading CPU cycles to improve the cache hit ratio and to

achieve better I/O performance.

Deduplication is a powerful tool such that along with NUMA-aware al-

gorithms and adaptive CPU partitioning, latency and overheads can be sig-

nificantly reduced. The results indicated in Figure 32.4 confirm that using

deduplication in the case of a VM's le server improves I/O performance by

30% as Jin et al. [6] show, offering better quality of service while using the

same hardware.

An important aspect of trading CPU eciency for I/O eciency is to en-

sure that the CPU cycles being used are not stolen from running applications.

Ideally, only idle resources should be used to perform I/O-related tasks, such

as deduplication. To examine this issue, a framework [10] was designed that is

able to dynamically assign CPUs to I/O or application tasks. The framework

sends tasks to idle cores (including GPUs) and partitions them in two cate-

gories: cores that run specialized tasks and cores that run application tasks.

The framework is able to dynamically decide the number of cores that should

be used for executing I/O-related tasks. As there is no monotonic relation

between performance obtained and cores used, all possible partitions should

be examined to nd the best one, for example, using Berry et al.'s Armed

Bandit [4] technique. Figure 32.5 shows that any static CPU partitioning can

degrade performance significantly, while using a dynamic algorithm allows

executing I/O tasks and keeping application performance above 90%.

Overall, the preliminary investigation shows that performing I/O-related

tasks in parallel without hurting application performance, can lead to better

I/O performance.

Search WWH ::

Custom Search

Home