Tuning the Middleware Services (JBoss AS 5) Part 4

Tuning HornetQ (JBoss AS 6.x)

HornetQ is the new high performance messaging provider for the JBoss AS 6.x release. Its core has been designed as a simple set of Plain Old Java Objects (POJOs) without any dependencies on any other JBoss classes which is why it can be used stand-alone, embedded in your applications, or integrated with the application server.

The main configuration file is located in the folder, <server>/deploy/hornetq/ hornetq-configuration.xml.

The two key elements, which can influence your application’s performance, are the journal and transport configuration. The journal handles the persistence of messages and transactions using a high performance algorithm implemented by the HornetQ team.

The transport configuration is based on Netty, which is an asynchronous event-driven network application framework.

How to configure HornetQ journal for optimal performance?

In order to tune HornetQ’s journal you have to check its configuration in the file hornetq-configuration.xml.

This is the part of the configuration related to the journal:

tmp39-182_thumb[2][2][2]


• journal-directory is the filesystem directory where the journal is written. Put the journal on its own physical volume. If the disk is shared with other processes for example, transaction co-ordinator, database or other journals which are also reading and writing from it, then this may greatly reduce performance since the disk head may be skipping all over the place between the different files.

• journal-min-files reflects the minimum number of journal files to be created. If you see new files being created on the journal data directory too often, i.e. lots of data is being persisted, you need to increase the minimal number of files, this way the journal would reuse more files instead of creating new data files

HornetQ transparently supports huge queues containing millions of messages while the server is running with limited memory with the paging feature.

In such a situation, it’s not possible to store all of the queues in memory at any one time, so HornetQ transparently pages messages into and out of memory as they are needed, thus allowing massive queues with a low memory footprint.

HornetQ will start paging messages to disk, when the size of all messages in memory for an address exceeds a configured maximum size.

Paging directory is configured by default in the following directory:

<paging-directory>${jboss.server.data.dir}/hornetq/ paging</paging-directory> m

Other parameters which are not specified in the default configuration are:

• journal-file-size: The journal file size should be aligned to the capacity of a cylinder on the disk. The default value 10 MB should be enough on most systems.

• journal-type: This parameter lets you choose the I/O libraries which are used to append data to the journal. Valid values are nio or asyncio.

Choosing nio selects the Java NIO journal. Choosing asyncio selects the Linux asynchronous IO journal. If you choose asyncio but are not running Linux, or you do not have libaio installed, then HornetQ will detect this and automatically fall back to using Nio.

The Java Nio journal gives great performance, but if you are running HornetQ using Linux Kernel 2.6 or later, we highly recommend using the asyncio journal for the best persistence performance especially under high concurrency.

• journal-aio-flush-on-sync: HornetQ, by default, is optimized by the case where you have many producers and thus it combines multiple writes in a single OS operation. So, setting this option to false will give you a performance boost, making your system scale much better. However if you have few producers, it might be worth setting this property to true, which means that your system will flush every sync with an OS call.

Do you need message security and persistence?

If you don’t need message persistence, you might greatly increase the performance of your JMS applications by setting persistence-enabled to false in hornetq-configuration.xml.

You may get as well a small performance boost by disabling security by setting the security-enabled parameter to false.

How do you configure HornetQ transport for optimal performance?

HornetQ transport is based on Netty’s framework, which greatly simplifies and streamlines network programming such as TCP and UDP socket server.

The basic configuration comes with a preconfigured connector/acceptor pair (netty-throughput) in hornetq-configuration.xml and JMS Connection factory (ThroughputConnectionFactory) which can be used to give the very best throughput, especially for small messages Acceptors are used on the server to define which connections can be made to the HornetQ server, while connectors are used by a client to define how it connects to a server. The most relevant parameters for transport configuration are TCP buffer sizes and the tcp_nodelay property.

TCP buffer sizes. These properties can be used to change the TCP send and receive buffer sizes. If you have a fast network and fast machines, you may get a performance boost by increasing this buffer which defaults to 32768 bytes (32 KB).

Here’s how to set the TCP send and receive buffer to 64 KB:

tmp39-183_thumb[2][2][2]

However, note that TCP buffer sizes should be tuned according to the bandwidth and latency of your network. You can estimate your optimal TCP buffer size with the following formula:

tmp39-184_thumb[2][2][2]

Where bandwidth is measured in bytes per second and network round trip time (RTT) is in seconds. RTT can be easily measured using the ping utility.

If you are interested in low-level network details, here’s how you calculate this parameter:

Bandwidth can be calculated with O/S tools, for example Solaris/Linux users can issue the iperf command:

tmp39-185_thumb[2][2][2]

And here’s the RTT, roughly calculated with the ping utility:

tmp39-186_thumb[2][2][2]

So by multiplying the two factors, with a little math, we estimate an optimal 165KB TCP buffer size:

tmp39-187_thumb[2][2][2]

If you are interested in low-level network details, here’s how you calculate this parameter:

Bandwidth can be calculated with O/S tools, for example Solaris/Linux users can issue the iperf command:

Another important parameter, well known by network administrators, is tcp_ nodelay, which can be set with the parameter:

tmp39-188_thumb[2][2][2]

This parameter relates to Nagle’s algorithm, which is a low-level algorithm that tries to minimize the number of TCP packets on the network, by trying to fill a TCP packet before sending it. TCP packets have a 40-byte header, so if you try to send a single byte you incur a lot of overhead as you are sending 41 bytes to represent 1 byte of information. (This situation often occurs in Telnet sessions, where most key presses generate a single byte of data that is transmitted immediately.)

For Enterprise applications, however, the TCP segment includes a larger data section. So enabling the Nagle’s algorithm would delay transmission, increasing bandwidth at the expense of latency. For this reason you should always set the tcp_nodelay property to true.

Before writing packets to the transport, HornetQ can be configured to batch up writes for a maximum of batch-delay milliseconds. This can increase overall throughput for very small messages. It does so at the expense of an increase in average latency for message transfer. If not specified, the default value for batch-delay property is 0 ms

Basic JMS tuning

Tuning the JMS provider is essential to reach the maximum performance of your message-based applications. However, a poorly written JMS application will never reach a decent throughput if you don’t pay attention to some basic rules.

The first important rule is to reduce as much as possible the size of the message, which is being sent. A JMS message is basically composed of a header, a set of properties and the body of the message.

tmp39-189_thumb[2][2][2]

So at first you should get rid of the properties which you don’t need and inflate the message. For example, use the setDisableMessageiD method on the MessageProducer class to disable message ids if you don’t need them. This decreases the size of the message and also avoids the overhead of creating a unique ID.

Also invoking setDisableMessageTimeStamp method on the MessageProducer class disables message timestamps and contributes to making the message smaller.

On the contrary, you should use setTimeToLive, which controls the amount of time (in milliseconds) after which the message expires. By default, the message never expires, so setting the optimal message age, will reduce memory overhead, thus improving performance.

As far as the message body is concerned, you should avoid using messages with a large data section. Verbose formats such as XML take up a lot of space on the wire and performance will suffer as result. Consider using ByteMessages if you need to transfer XML messages to ensure that the message is transmitted efficiently, avoiding unnecessary data conversion.

Also you should be careful with the ObjectMessage type. ObjectMessage is convenient but it comes at a cost: the body of an ObjectMessage uses Java serialization to serialize it to bytes. The Java serialized form of even small objects is very verbose so takes up a lot of space on the wire, also Java serialization is slow compared to custom marshalling techniques. Only use ObjectMessage if you really can’t use one of the other message types, for example if you really don’t know the type of the payload until runtime.

Another element which influences the performances of your messages is the acknowledge mode:

• client_acknowledge mode is the least feasible option since the JMS server cannot send subsequent messages till it receives an acknowledgement from the client.

• auto_acknowledge mode follows the policy of delivering the message once-and-only once but this incurs an overhead on the server to maintain this policy and requires an acknowledgement to be sent from the server for each message received on the client.

• dups_ok_acknowledge mode has a different policy of sending the message more than once thereby reducing the overhead on the server but might impose an overhead on the network traffic by sending the message more than once.

From a performance point of view, usually dups_ok_acknowledge gives better performance than auto_acknowledge. You might consider this as an alternative to create a transacted session and batch up many acknowledgements with one acknowledge/commit.

Another thing you should always consider is re-using your JMS resources such as connections, sessions, consumers, and producers. Creating JMS resources is expensive and should be absolutely avoided for every message to be sent or consumed.

You should re-use temporary queues across many requests. As a matter of fact, when you are issuing a message using a temporary queue request-response pattern, the message is sent to the target and a reply-to header is set with the address of a local temporary queue. Instead of creating a new temporary queue you should just send back the response to the address specified in the reply-to.

An example use case with HornetQ

You have been hired by Acme Ltd to improve the performance of a Stock Trading application, which uses a JMS system for querying stock values, and to issue orders. The company has recently migrated to the HornetQ messaging system and has a performance goal of delivering 3 K (3000) messages per second. The average size of JMS messages is 1 KB.

The specifications of the system are to persist JMS messages for stock orders but not for stock quotation queries (which account for 80% of the traffic) where it can be acceptable for messages to be lost, in case of a system crash.

The system architecture is described as follows:

• JBoss 6.0.0 M5 with HornetQ 2.1.1 installed

Linux System Fedora running on Xeon 4 Dual Core 32 Mb

The Acme project team has installed HornetQ with default values and, after a system warm-up, has launched a first batch of orders:

[STDOUT] 10000 Messages in 49,263 secs [STDOUT] 20000 Messages in 97,212 secs

[STDOUT] 100000 Messages in 472,924 (472 seconds and 924 ms)

The system is delivering about 211 msg/sec.

We are far away from the performance expectations; however don’t despair, there’s much room for improvement.

The first drastic remedy to your system will be to differentiate message persistence depending on the type of operation. Since most JMS messages are stock queries, you can send them as non-persistent messages. On the other hand, JMS messages bearing orders will be persisted.

tmp39-190_thumb[2][2][2]

Here’s the new benchmark output:

[STDOUT] 10000 Messages in 5,353 secs [STDOUT] 20000 Messages in 9,039 secs

[STDOUT] 100000 Messages in 50,853 secs The system is delivering now 1968 msg/sec.

We have made a huge leap towards our target. The next change in the list, is installing libaio on our Linux system so that the journal can actually use the default ASYNCIO mode:

$ sudo yum install libaio

That’s the new benchmark, using ASYNCIO:

[STDOUT] 10000 Messages in 4,512 secs [STDOUT] 20000 Messages in 8,592 secs

[STDOUT] 100000 Messages in 42,735 secs The system is delivering now 2340 msg/sec.

We are getting closer to our goal. The next optimization we will try is to set the tcpnodelay parameter to true which means disabling Nagle’s algorithm. As a matter of fact this algorithm could yield some benefits for very small packets (in the range of a few bytes, think about a telnet session), however a typical JMS message is much bigger and bypassing this algorithm gives a definite performance boost.

tmp39-191_thumb[2][2][2]

Here’s the new benchmark:

[STDOUT] 10000 Messages in 4,093 secs [STDOUT] 20000 Messages in 7,932 secs

[STDOUT] 100000 Messages in 38,381 secs The system is delivering now 2605 msg/sec.

The next optimization we will include is tuning tcpsend/ receivebuffersize. By multiplying network bandwidth X latency we have estimated the optimal network TCP buffer to 256 KB. So we will change the configuration accordingly:

tmp39-192_thumb[2][2][2]

Running a new benchmark does not reveal any improvement in the message throughput:

[STDOUT] 10000 Messages in 4,512 secs [STDOUT] 20000 Messages in 8,592 secs

[STDOUT] 100000 Messages in 42,735 secs The system is delivering now 2340 msg/sec.

What might be the cause of it is that the operating system is not ready to use a TCP buffer size of 256 KB. By default, most Linux operating systems are tuned for low latency with a general purpose network environment.

As we have got plenty of memory on this machine, we will set the max OS send buffer size (wmem) and receive buffer size (rmem) to 12 MB for queues on all protocols. In other words, we will increase the amount of memory that is allocated for each TCP socket when it is opened or created while transferring files.

tmp39-193_thumb[2][2][2]

You also need to set minimum size, initial size, and maximum size in bytes:

tmp39-194_thumb[2][2][2]

Note: on Solaris you can use the ndd shell command to set the send(xmit) and receive(recv) buffer size. For example, the following command sets both buffers to 75.000 bytes.

tmp39-195_thumb[2][2][2]

The new benchmark, with modified system kernel settings, shows a much-improved throughput:

[STDOUT] 10000 Messages in 3,524 secs [STDOUT] 20000 Messages in 6,947 secs

[STDOUT] 100000 Messages in 32,937 secs We have reached 3.036 msg/sec, meeting the desired target.

It can be theoretically possible to obtain some additional improvements by reducing the size of JMS Messages (for example by removing properties which are not required). Also the journal logging can be further optimized by putting its log files on different physical volume from the paging directory. If you are frequently writing in these two locations on the same physical volume, the hard disk head has got the tough job to skip continuously between the two directories. The same rule can be applied for the transaction logs, in case you’re using XA transactions in the AS.

Summary

In this topic, we have covered two core middleware services: EJB and JMS service.

With proper tuning, you can greatly improve the performance of your stateless and stateful session beans.

The two elements you can use to fine tune performance of stateless beans (SLSB) are pool size and locking strategy.

• The optimal maximum size can be determined by inspecting the value of concurrentCalls attributes from the component statistics.

• Locking strategy can be chosen between ThreadLocalPool (default) and StrictMaxPool. The default ThreaLocalPool strategy delivers better throughput because it uses Thread local variables instead of synchronization. StrictMaxPool can be used to acquire an exact control over the maximum number of concurrent EJB instances.

• Stateful session beans (SFSBs) have higher CPU and memory requirements depending on the size of non-transient Objects contained in them and on the time span of the Session. If you keep them short, the performance of SFSBs can be assimilated to that of stateless session beans.

• Additional SFSB performance can be gained by disabling passivation, at the expense of higher memory consumption. So you should increase the JVM threshold to avoid garbage collector stealing this benefit.

• Extreme tuning resource is to modify the EJB container interceptor’s stack by removing unnecessary AOP classes. Transaction interceptors are a potential candidate.

JBoss JMS is served by different providers in the releases 5.X and 6.X.

• The JBoss AS 5.x provider is JBoss Messaging Service. You can fine tune its ConnectionFactory by setting these two attributes:

° PrefetchSize: indicates how many messages client side message consumers will buffer locally. The default value for PrefetchSize is 150. Larger values give better throughput but require more memory.

° You should ensure that the attribute SlowConsumers is set to false otherwise this will disable client messaging buffering, which badly degrade your performance.

• The single destination should consider tuning the following parameters:

° PageSize: indicates the maximum number of messages to pre-load in one single operation.

° FullSize: this is the maximum number of messages held by the queue or topic subscriptions in memory at any one time. If you have a very high ratio of messages to be delivered, you should further increase the default value which is 75.000. Increase as well DownCacheSize to flush messages on the storage sparingly.

• The JBoss AS 6.x default JMS provider is HornetQ, which can be used also embedded in your application or as standalone JMS server. You can tune HornetQ in two major areas: the journal (where messages are persisted) and the tuning transport, which uses Netty libraries.

° If you are frequently writing to the journal it’s important to keep logging files in separate hard disk volumes to improve hard disk transfer rates. You are strongly advised to use Linux asyncio libraries if your operating system supports it.

° HornetQ transport tuning requires the setting of an appropriate TCP send/receive buffer size. You should modify these properties according to your operating system settings.

• Set the property disabletcpdelay to true to disable Nagle’s algorithm and improve performance.

• Follow JMS best practices, which include reducing the size of the messages, re-using your JMS resources (connections, sessions, consumers ,and producers) and consider if you really need message persistence, which is the most expensive factor of a JMS session.

Next post:

Previous post: