Tuning the Persistence Layer (JBoss AS 5) Part 1

Data persistence is a key ingredient of any enterprise application and it has been a part of the JDK API since its very first release. Most readers certainly agree that data persistence is the most common cause of bottleneck in your applications. Unfortunately, isolating the root of the problem is not a straightforward affair and requires investigating in many areas: from the SQL code, to the interfaces used to issue SQL statements. Other potential areas that might affect your data persistence are the database configuration and finally, the underlying network and database hardware.

For this reason, we have divided this topic into three main sections in order to cover all factors which can drag the performance of your data persistence:

• The first section introduces some principles of good database design. If your database design doesn’t conform to some commonly agreed rules, chances are high that you will find it hard to achieve the performance needed for a successful application.

• The next section illustrates how to improve the performance of the Java Database Connectivity (JDBC) API, which allows connecting to a legacy database by means of drivers released by database vendors.

• The last section covers the core concepts about Java Persistence API and the Hibernate framework, which is used by JBoss AS as a persistence engine.

Designing a good database

Before you begin the development of your application, both logical and physical database design must be right there. Once the application has been implemented, it’s simply too late to fix an inaccurate database design, leaving no other choice than buying a larger amount of fast, expensive hardware to cope with the problem.

Designing a database structure is usually the task of a database administrator. However, it’s not rare in today’s tight-budget world that the software architect takes care to design the database schemas as well. That’s why you should be aware of a few basic concepts like database normalization, database partitioning, and good column indexing.

Reducing the space needed by your database

One of the most basic optimizations is to design your tables to take as little space on the disk as possible. This makes disk reads faster and uses less memory for query processing.

In order to reduce the amount of space used by your tables, you should first choose, for your columns, the smallest data type possible. As a matter of fact, the smaller your data types, the more indexes (and data) can fit into a block of memory, the faster your queries will be.

Secondly, you should carefully choose between a normalized database and a de-normalized database.

Database normalization eliminates redundant data, which usually makes updates faster since there is less data to change. However, a normalized schema causes joins for queries, which makes queries slower. The following image shows database normalization versus database de-normalization:

On the other hand, a de-normalized database replicates most of the information in one or several tables, thus speeding up data retrieval but reducing the performance of updates.

The DBA point of view

The primary objective of normalization is data integrity. So if you are choosing to de-normalize your database, be aware that having multiple copies of the same data can easily get out of sync due to programmer or system errors.

Applications involving many transactions, also known as Online Transaction Processing, generally perform better with a normalized schema, which guarantees fast updates/inserts.

As a matter of fact, most of the applications involving extremely large amounts of transactions are slowed down by row locks, which are necessary to guarantee the integrity of transactions.On the other hand, reporting types of application (Online Analytical Processing) performing massive amounts of selects are generally faster with a de-normalized schema.

Your first goal in the design process should be to normalize your data. Next, you can test your design with realistic data and transactions. At this point, if you see that de-normalization will help, then by all means you can apply it. But don’t assume that you need to de-normalize data until you can prove (through testing) that it’s the right thing to do.

With JPA O/R mapping you can use the @Embedded annotation for de-normalized columns to specify a persistent field whose @Embeddable type can be stored as an intrinsic part of the owning entity and share the identity of the entity.

Partitioning the database

If you are designing a database that potentially could be very large, holding millions or billions of rows, you should consider the option of partitioning your database tables. You can partition your database in two ways:

• Horizontal partitioning divides what would typically be a single table into multiple tables, creating many smaller tables instead of a single, large table. The advantage of this is that it is generally much faster to query a single small table than a single large table.

The next image shows an example of how you could split your Sales table in different sub-tables depending on the region which sales are related to:

Also, you could split tables on a time basis, for example, you might partition the table into separate sub-tables for each year of transactions.

• Vertical Partitioning distributes a table with many columns into multiple tables with fewer columns, so that only certain columns are included in a particular dataset, with each partition including all rows.

Although partitioning is considered a dangerous premature optimization for small to medium scale projects, if you are dealing with huge sets of data, it is often a prudent choice. Adapting your application to a different partitioning logic can lead to major architectural changes; this can be far more expensive in time and money.

For example, a table that contains a number of very wide BLOB columns that aren’t referenced often, can be split into two tables with the most-referenced columns in one table and the seldom-referenced text or BLOB columns in another. Here’s an example of vertical partitioning:

By removing the large data columns from the table, you get a faster query response time for the more frequently accessed Product data. Wide tables can slow down queries, so you should always ensure that all columns defined in a table are actually needed.

Using JPA/Hibernate mapping, you can easily map the above case with a lazy one-to-many relationship between the Product table and the ProductPhoto table. The ProductPhoto contains a less frequently accessed BLOB data type, which can be lazy loaded, that is queried just when the client requests the specific fields of the relationship.

Using indexes

Effective indexes are one of the best ways to improve performance of a database schema. Just like the reader searching for a word in a topic, an index helps when you are looking for a specific record or set of records with a where clause.

Since index entries are stored in sorted order, indexes also help when processing order by clauses. Without an index, the database has to load the records and sort them during execution.

Though indexes are indispensable for fast queries, they can have some drawbacks as well, in particular when the time comes to modify records. As a matter of fact, any time a query modifies the data in a table the indexes on the data must change too. Try to use a maximum of four or five indexes on one table, not more. If you have a read-only table, then the number of indexes may be safely increased.

There are a number of guidelines to building the most effective indexes for your application which are valid for every database, in particular:

• Index on appropriate columns: In order to achieve the maximum benefit from an index, you should choose to index the most common column, that is, in your where clauses for queries against this table.

• Keep indexes small: Short indexes are faster to process in terms of I/O and are faster to compare. An index on an integer field provides optimal results.

• Choose distinct keys: If the fields bearing the index have small or no duplicate values, it is highly selective and provides the best performance results.

• Structure the composite index correctly: If you create a composite (multicolumn) index, the order of the columns in the keys are very important. Put the most unique data element, the element that has the biggest variety of values, first in the index. The index will find the correct page faster.

Not many Hibernate/JPA users know that you can define indexes in the table configuration of Hibernate. For example, if you need to define an index named index1 on the columns column1 and column2 you should use this simple annotation:

Tuning JDBC

Java Database Connectivity API was the first interface used by Java programmers to connect to database systems. Even if many new database persistence methods for Java programmers have been developed recently (for example, Entity, JDO, Hibernate, and many others), most database access code running in today’s systems is written as plain JDBC.

The reason for the popularity of JDBC, besides the obvious fact that it was the first API released, is that it does not require any special knowledge to get started with it. Just feed the native SQL to your JDBC interfaces and collect the result.

JDBC tuning follows the same pattern as any tuning process, that is, you should at first measure the performance of your application in production and then evaluate possible corrections. However, it’s worth to know some basic tuning techniques, which are available from the outset without sacrificing good design and coding practices.

These basic principles, which are described in the following sections, can also be used as a foundation for frameworks built on top of JDBC, for example, Hibernate. More precisely you should:

• Introduce a database Connection Pool that reuses your connections

• Make use of proper JDBC features, such as fetch size and batch size

• Use Prepared statements in your application and configure a Prepared statement cache at application server level

Using a Connection Pool

The first basic rule you need to follow when programming JDBC is to use a Connection Pool when accessing a database. Establishing database connections, depending upon the platform, can take from a few milliseconds up to one second. This can be a meaningful amount of time for many applications, if done frequently. As depicted in the following image, the Connection Pool is located in the path of a JDBC connection:

A Connection Pool allows the reuse of physical connections and minimizes expensive operations in the creation and closure of sessions. Also, maintaining many idle connections is expensive for a database management system, and the pool can optimize the usage of idle connections (or disconnect, if there are no requests).

From the programmer’s point of view, what the application server provides you is a Connection object implemented by JBoss AS as org.jboss.resource.adapter. jdbc.WrappedConnection.

For example, if you are interested in retrieving the underlying implementation of an OracleConnection from the Connection Pool of JBoss AS perform the following commands:

You might be wondering what the advantage of retrieving the underlying Connection implementation is. One good reason is if you need to access some custom properties, which are not available through the base Connection interface. For example, if you are using an Oracle thin driver and you need to debug your Prepared Statement Cache Size, you can use the getstatementcachesize() method of the OracleConnection object:

For more details about configuring the Connection Pool, you can refer to next topic, Tuning the JBoss AS, where we have dissected all the performance details about it.

Setting the proper fetch size

The fetch size is the number of rows physically retrieved from the database at one time by the JDBC driver as you scroll through a ResultSet with the next() method.

If you set the query fetch size to 100, when you retrieve the first row, the JDBC driver retrieves at once the first 100 rows (or all of them if fewer than 100 rows satisfy the query). When you retrieve the second row, the JDBC driver merely returns the row from local memory—it doesn’t have to retrieve that row from the database. This feature improves performance by reducing the number of calls (which are frequently network transmissions) to the database.

To set the query fetch size, use the setFetchSize() method on the Statement (or PreparedStatement or CallableStatement) before execution.

The optimal fetch size is not always obvious. Usually, a fetch size of one half or one quarter of the total expected result size is optimal.

As a general rule, setting the query fetch size is mostly effective for a large ResultSet. If you set the fetch size much larger than the number of rows retrieved, it’s likely that you’ll get a performance decrease, not an increase.

Setting a higher default fetch size on JBoss AS

If you plan to set increase the default Row Prefetch for all your statements, then you can do it by means of the defaultRowPrefetch Connection property. If you acquire the Connection from JBoss AS Datasource, then you need to add the following property to your -ds.xml file:

The following micro-benchmark shows the different timing when retrieving a set of 500 records using a fetch size of 10 (default), 250, and 1000 records respectively. As you can see, using a fetch size around % and Vi of the expected list of records usually produces the best results, while using an exaggeratedly high value is counter productive both from the performance view and from the memory usage.

Caution! The memory used by row fetching does not depend only on the number of rows!

The cost of maintaining the local cache of fetched rows depends on the sum of many factors: The number of columns, the defined size of each column, and the fetch size. For example, if using an Oracle database, for each column assume 2 bytes per char in the defined size for character types, 4 KB for LOBs and RAWs, and 22 bytes for everything else. Sum the sizes of the columns, and then multiply by the fetch size. As you can imagine, the cost of row fetching can be much higher for certain data types like blob, image, and so on. You are advised to use smaller fetch sizes for these types while you can try bigger numbers for other simpler column types.