A Volatile Look At HSQLDB

Recently, I had cause to look at HSQLDB, one of several pure Java databases out there. At my job, we use RDBMS technology to back our deductive database systems. As such, we have need for a small-scale development RDBMS to accompany larger enterprise-grade RDBMS’s. For several years we have used H2, and it has been good to us. It is small, fast, very complete and reliable.

However, a couple years ago H2 changes it’s on-disk storage system. The new storage system is neater and cleaner, but it had a side effect: H2 is now significant less concurrent. Our deductive engine is highly concurrent (we implemented a form of Actor concurrency before we knew what Actors were), and we routinely run hundreds of SQL queries simultaneously. We treat the RDBMS backend much like a NoSQL database, and really thrash it at times, especially before the caching and materialization layers kick in. In recent years, as every machine gained cores like mad, this has become a problem. H2’s lack of concurrent execution sometimes slows the system down to the point where even on small datasets, Postgres is faster, despite the network hop. (This is not a knock on H2; highly concurrent access is not a design goal that seems to be emphasized. Fair enough.)

So this is a problem, so I decided it was time to survey embedded DB’s again. In the past, I had looked at HSQLDB and Derby and rejected them for performance reasons. Perhaps, in the intervening three years something had changed?

Derby: Sorry, Just Slow
Derby was pretty much like I left it: just too slow. Faster now, but still slow. It was very helpful to have SQL SEQUENCE’s; we need to generate a lot of unique identifiers, and sequences are perfect for this.

It took about four hours to get our 1700+ test framework to run over Derby. It worked fine. But it was slooooow. It was more concurrent than H2 (judging from the CPU usage), but it was just slower at getting to answers. I wonder if the whole SQL -> bytecode path is just flawed. I don’t know, and I am not going to dig too deep. Plus, Derby databases are slow to deploy, comparatively, and we create and delete about a hundred databases during a single test run, so that matters.

Finally, what a pain to programmatically manipulate. Shutdown by JDBC url? Set system vars to specify db location? All this stuff makes it harder to create, connect to, shutdown, and destroy multiple databases while in the same JVM. You can do it, but it requires a lot of careful management.

So while Derby was faster than last time I looked at it, it was still too slow.

HSQLDB
My other easy option was HSQLDB, which is all Java, shares some distant heritage with H2, and is quite popular. Plus, HSQLDB claims the engine is fully multi-threaded. Cool! Silver Bullet time!

Because HSQLDB, like H2, uses a fairly modern variant of SQL it was simple to deploy our system over it. In a couple hours I had the system running, and running quickly. Magic fast. It bothered me how fast it was.

Then I uncovered HSQLDB’s odd design choice of MEMORY tables. By default, HSQLDB keeps tables in memory and persists them to disk seperately. This leads to really fast execution (the data is in memory after all). But … our application needs the memory. Sad, but true. With H2, we our servers explicitly manage the amount of memory available for H2’s cache. Changing the default table type to CACHED moves the storage to disk. So I did that, and I likewise explicitly managed the cache size for HSQLDB.

Now the raw SQL performance was close to H2, just a shade slower in some cases, a shade faster in others . But if it was really concurrent, we’d get a big boost in deductive query evaluation. And indeed we did! For a certain class of query plan, the engine really flew.

Until everything hung inside HSQLDB.

Grrrr. Argh. Gnashing of teeth.

The hang was repeatable on my quad-core laptop under Linux (eight CPUs seen), but I couldn’t repeat it with a simple test case. The confluence of events and timing was just too complex to extract.

So I got the source, and tried it again, looking hard at the stack traces. Eventually I came to the HSQLDB class CountUpDownLatch, which implements a, well, counting up or down latch. This class has been around in HSQLDB a long time, and can be found in various source code repositories around the internet. The code looks like this:

import java.util.concurrent.CountDownLatch;

public class CountUpDownLatch {

    CountDownLatch latch;
    int            count;

    public CountUpDownLatch() {
        latch      = new CountDownLatch(1);
        this.count = count;
    }

    public void await() throws InterruptedException {

        if (count == 0) {
            return;
        }

        latch.await();
    }

    public void countDown() {

        count--;

        if (count == 0) {
            latch.countDown();
        }
    }

    public long getCount() {
        return count;
    }

    public void countUp() {

        if (latch.getCount() == 0) {
            latch = new CountDownLatch(1);
        }

        count++;
    }

    public void setCount(int count) {

        if (count == 0) {
            if (latch.getCount() != 0) {
                latch.countDown();
            }
        } else if (latch.getCount() == 0) {
            latch = new CountDownLatch(1);
        }

        this.count = count;
    }
}

What I saw, repeatably, was a hang in await(). As I looked at it I noticed two things.

First, for all that it is in a .concurrent package, it isn’t actually thread-safe. Most of the methods would fail miserably if they weren’t executed atomically. I ran around the source code for a while, but it looked like there was just one instance of the class, and that instance was always accessed via an exclusive lock. So, for all that CountUpDownLatch was not thread-safe, it seemed to be used in a thread-safe manner. As an aside, this happens via a ReentrantLock, not by a synchronized block. So, that is OK, I guess, though it goes against the grain that the code isn’t inherently thread-safe.

The second thing I noticed was subtler.

That ‘count’ variable — it kinda, sorta ought to be volatile. I would say it is the prototypical volatile use case. It is a variable whose value can be updated by another thread. The spec on volatile is here; a good explanation is here. Somewhat simplified, volatile is used to denote a variable whose value can be changed by another thread. If you don’t specify a variable as volatile, the compiler can cache the value (I believe in a register) and not go back to RAM every time. There are other events which cause this cached value to be flushed, but accessing a variable marked volatile forces the compiler to go back to RAM. (It is, of course, a good bit more complicated than this but the preceding is a useful working definition.)

In the code above, for example, if the value of count in the current thread was cached as 3, then another thread increments it to 4 , then the current thread decrements using 3 as it’s cached value, as 3-1 to 2, things are screwed up. Basically, the value of count is not guaranteed to be manipulated consistently, even though it is within the ReentrantLock.

So I wrote a bug, #3153989. But, you know, the code was just lying there in an Eclipse workspace, so I changed the variable to volatile, generated a jarfile, and rerun my tests.

Bingo, it all worked. Nifty.

And unsettling.

You see, the optimization of caching variable values into registers per thread is not one that has been commonly implemented; I believe newer JVM’s do it a lot more nowadays, particularly on server JVM’s on multi-core machines. That means that this code appeared to work fine on many machines, under many load scenarios. Only on a multi-core machine under concurrent load, with a JVM which does this caching optimization, could this show up.

— UPDATED —
Several posters cleared up some things. It seems that Lock.lock() does in fact flush the cache, thereby avoiding the problem in the manner I noted. Thus, volatile would not be needed, if every access to every method in CountUpDownLatch was under a lock. However, it turns out that the await() method in particular is often called outside of any locks, which is what subjects it to the volatile/caching issue.

So CountUpDownLatch is really specialized, and should probably be called “HSQLDBSpecificNotInherentlyConcurrentAtAllCountUpDownLatch”. Or some such.
— UPDATED —
The response to my bug was pretty swift, although the code available via SVN has not been updated yet. I suspect by the release of RC4, it will be.

But it did unsettle me — what other miscues like this are in there? So far I haven’t found any other problems, but we’ve got thousands and thousands of testing hours and end-user usage for H2. Obviously, it’ll be a while until I am prepared to trust HSQLDB to that level. That is not a knock on HSQLDB, just a fact of H2’s lead in our usage.

Wrapping Up
I did look at some other embedded RDBMS’s, notably SQLite, which is easily available from Java. But it is a much more primitive SQL, and thus lacks some things I’d like. And, well, it just wasn’t that fast with say, a hundred thousand rows. Not surprising, I am not sure that is the design goal for SQLite. Other, larger embedded DBs were not Java, and were going to be a royal pain to manage across 32/64-bit Linux/Windows/Mac.

So, we’ll probably at least put HSQLDB in the rotation for testing, and see how it holds up.

About these ads
Explore posts in the same categories: Software

Tags: , , , ,

You can comment below, or link to this permanent URL from your own site.

20 Comments on “A Volatile Look At HSQLDB”

  1. Patrick Says:

    Hi,

    I don’t think volatile is the correct solution. volatile is useful if you can guarantee that only one thread ever updates the count variable. See Java Concurrency in Practice by Brian Goetz, Josh Bloch and others.

    Even with that what about the race conditions when replacing the countdown latch??

    Cheers,
    Patrick

  2. Patrick Says:

    I think this may be a better solution

    import java.util.concurrent.locks.Condition;
    import java.util.concurrent.locks.ReentrantLock;

    public class CountLatch
    {
    private int count;

    private final ReentrantLock lock = new ReentrantLock();
    private final Condition condition = lock.newCondition();

    public CountLatch(int count)
    {
    this.count = count;
    }

    public void await()
    {
    while (true)
    {
    try
    {
    lock.lockInterruptibly();

    if (count == 0)
    {
    break;
    }

    condition.await();

    if (count == 0)
    {
    break;
    }
    }
    catch (InterruptedException ignore)
    {
    // loop around and test again inside the lock
    }
    finally
    {
    lock.unlock();
    }
    }
    }

    public void increment()
    {
    try
    {
    lock.lock();
    count += 1;
    }
    finally
    {
    lock.unlock();
    }
    }

    public void decrement()
    {
    try
    {
    lock.lock();
    if (count == 0)
    {
    return;
    }

    count -= 1;
    if (count == 0)
    {
    condition.signalAll();
    }
    }
    finally
    {
    lock.unlock();
    }
    }

    • Chris Says:

      Your solution is better, if the goal is to make the class thread-safe, but.it isn’t correct. As far as I can tell, the instance of the class in question is always used (in HSQLDB) under an exclusive lock already. (I mention this as problem #1 in the post.)

      So for the HSQLDB code base, it is not intended to be thread safe, really. The locking is done, correctly, at a higher level. No different from your solution, the lock is just in a different place.

      But the locking, either theirs or yours, won’t work if ‘count’ is not volatile.

      Note that if you make the blocks synchronized, it forces the cache flush to happen automatically, at the cost of some speed.

      • Campbell Burnet Says:

        Actually, probably better to simply reuse the same potentially lock-free implementation to which java’s provided CountDownLatch delegates:

        /**
        * A variation on {@link java.util.CountDownLatch} to allow counting up.
        *
        * Unlike {@link CountUpDownLatch}, which is a wrapper for
        * {@link java.util.concurrent.CountDownLatch}, this class directly mimics
        * CountDownLatch by delegating to {@link AbstractQueuedSynchronizer}.
        *
        * The primary advantage of mimicry is that volatile variables are not used,
        * meaning that the underlying java libraries and java runtime are free to
        * fully and unrestrictedly capitalize upon whatever lock-free / wait-free
        * algorithms and hardware primitives are available (see:
        * http://www.ibm.com/developerworks/java/library/j-jtp11234/).
        *
        * The secondary advantage is that by eliminating the HSQLDB CountUpDownLatch use of an inner CountDownLatch
        * instance that must be discarded and replaced every time the count increments
        * from zero, heap memory object burn rate is reduced.
        *
        *
        * @author Campbell Burnet (campbell-burnet@users dot sourceforge.net)
        */
        public class CountUpDownLatch2 {

        /**
        * Synchronization control For CountUpDownLatch2.
        *
        * Uses AQS state property to represent count.
        */
        private static final class Sync extends AbstractQueuedSynchronizer {
        private static final long serialVersionUID = 1L;

        Sync(int count) {
        setState(count);
        }

        int getCount() {
        return getState();
        }

        /**
        * Queries if the state of this synchronizer permits it to be acquired
        * in the shared mode, and if so to acquire it.
        *
        * @param ignored
        * @return -1 on failure; 1 if acquisition in shared mode succeeded and
        * subsequent shared-mode acquires might also succeed, in which
        * case a subsequent waiting thread must check availability.
        */
        protected int tryAcquireShared(int ignored) {
        return getState() == 0 ? 1 : -1;
        }

        /**
        * Updates the state of this object to the {@code requestedCount} value,
        * returning {@code true} on transition to zero.
        *
        * Note that negative {@code requestedCount} values are silently
        * converted to zero.
        *
        * @param requestedCount the value of the count property to be set
        * @return {@code true} if this release of shared mode may permit a
        * waiting acquire (shared or exclusive) to succeed; and
        * {@code false} otherwise
        */
        protected boolean tryReleaseShared(int requestedCount) {

        final int newCount = Math.max(0, requestedCount);
        final boolean result = (newCount == 0);

        for (;;) {
        if (compareAndSetState(getState(), newCount)) {
        return result;
        }
        }
        }
        }
        private final Sync sync;

        /**
        * Constructs a {@code CountUpDownLatch2} initialized with the given value.
        *
        * @param initialCount the initial value representing the number of times
        * {@link #countDown} must be invoked before threads can pass
        * through {@link #await}
        * @throws IllegalArgumentException if {@code initialCount} is negative
        */
        public CountUpDownLatch2(int initialCount) {
        if (initialCount < 0) {
        throw new IllegalArgumentException("count < 0");
        }
        this.sync = new Sync(initialCount);
        }

        /**
        * Causes the current thread to wait until the latch has counted down to
        * zero, unless the thread is {@linkplain Thread#interrupt interrupted}.
        *
        * If the current count is zero then this method returns immediately.
        *
        * If the current count is greater than zero then the current
        * thread becomes disabled for thread scheduling purposes and lies
        * dormant until one of two things happen:
        *
        * The count reaches zero due to invocations of the
        * {@link #countDown} method; or
        * Some other thread {@linkplain Thread#interrupt interrupts}
        * the current thread.
        *
        *
        * If the current thread:
        *
        * has its interrupted status set on entry to this method; or
        * is {@linkplain Thread#interrupt interrupted} while waiting,
        *
        * then {@link InterruptedException} is thrown and the current thread’s
        * interrupted status is cleared.
        *
        * @throws InterruptedException if the current thread is interrupted
        * while waiting
        */
        public void await() throws InterruptedException {
        sync.acquireSharedInterruptibly(1);
        }

        /**
        * Causes the current thread to wait until the latch has counted down to
        * zero, unless the thread is {@linkplain Thread#interrupt interrupted},
        * or the specified waiting time elapses.
        *
        * If the current count is zero then this method returns immediately
        * with the value {@code true}.
        *
        * If the current count is greater than zero then the current
        * thread becomes disabled for thread scheduling purposes and lies
        * dormant until one of three things happen:
        *
        * The count reaches zero due to invocations of the
        * {@link #countDown} method; or
        * Some other thread {@linkplain Thread#interrupt interrupts}
        * the current thread; or
        * The specified waiting time elapses.
        *
        *
        * If the count reaches zero then the method returns with the
        * value {@code true}.
        *
        * If the current thread:
        *
        * has its interrupted status set on entry to this method; or
        * is {@linkplain Thread#interrupt interrupted} while waiting,
        *
        * then {@link InterruptedException} is thrown and the current thread’s
        * interrupted status is cleared.
        *
        * If the specified waiting time elapses then the value {@code false}
        * is returned. If the time is less than or equal to zero, the method
        * will not wait at all.
        *
        * @param timeout the maximum time to wait
        * @param unit the time unit of the {@code timeout} argument
        * @return {@code true} if the count reached zero and {@code false}
        * if the waiting time elapsed before the count reached zero
        * @throws InterruptedException if the current thread is interrupted
        * while waiting
        */
        public boolean await(long timeout, TimeUnit unit)
        throws InterruptedException {
        return sync.tryAcquireSharedNanos(1, unit.toNanos(timeout));
        }

        /**
        * Decrements the count of the latch, releasing all waiting threads if
        * the count reaches zero.
        *
        * If the current count is greater than zero then it is decremented.
        * If the new count is zero then all waiting threads are re-enabled for
        * thread scheduling purposes.
        *
        * If the current count equals zero then nothing happens.
        */
        public void countUp() {
        sync.releaseShared(sync.getCount() + 1);
        }

        /**
        * Decrements the count of the latch, releasing all waiting threads if
        * the count reaches zero.
        *
        * If the current count is greater than zero then it is decremented.
        * If the new count is zero then all waiting threads are re-enabled for
        * thread scheduling purposes.
        *
        * If the current count equals zero then nothing happens.
        */
        public void countDown() {
        sync.releaseShared(sync.getCount() – 1);
        }

        /**
        * Returns the current count.
        *
        * This method is typically used only for debugging and testing purposes.
        *
        * @return the current count
        */
        public long getCount() {
        return sync.getCount();
        }

        public void setCount(int newCount) {
        sync.releaseShared(newCount);
        }

        /**
        * Returns a string identifying this latch, as well as its state.
        * The state, in brackets, includes the String {@code “Count =”}
        * followed by the current count.
        *
        * @return a string identifying this latch, as well as its state
        */
        public String toString() {
        return super.toString() + “[Count = " + sync.getCount() + "]“;
        }
        }

  3. Patrick Says:

    In the solution that I presented volatile is not necessary, the lock ensures that the changes are visible to all threads, the same way a synchronized block would.

    The solution you proposed with volatile will likely work if a single thread is used to count up, count down and set the count, and any number of threads calling await.

    — Patrick

    • designbygravity Says:

      I think, based on the JLS (links above), you are incorrect. ReentrantLock (and friends) only guarantee locking — they explicitly do not guarantee a consistent memory picture for variable which is accessed by two threads, even if the access is done under an exclusive lock.

      The solution I proposed works *only for HSQLDB*, because every access to every method in the single CountUpDownLatch which is allocated is done inside ReentrantLocks, exactly as you have it above. However, simply because they happen under a lock does not guarantee a consistent memory picture. To be more precise:

      T1 locks with an external ReentrantLock
      T1 reads the value of count, count=3
      T1 increments count, count=4
      T1 releases lock
      T2 locks
      T2 reads the value of count, count=4
      T2 increments count, count=5
      T2 releases lock
      T1 locks
      T1 “reads” the value of count, but since it is non-volatile, it uses the value cached in it’s registers.
      So, count=4 is what T1 sees
      T1 increments count, count=5. AGAIN.
      T1 releases lock

      And now count has the wrong value.

      Look at the spec; that caching is what is allowed to happen with non-volatile vars. It is subtler than that, because any instance field which is volatile will force all of the same instance fields to have their cache value flushed.

      Prior to 1.5, volatile did not work this way on 64-bit values, and this is why Double Checked Locking did not work perfectly under Java. See: http://en.wikipedia.org/wiki/Double-checked_locking#Usage_in_Java for a useful explanation, and why volatile was necessary to fix it.

      • Cristian Malinescu Says:

        This is for sure a VM bug or if not bug a very weird optimization design decision which can be translated as a design bug. However, looking at the behavior of the variable when used as volatile tells me this is a bug. Naturally, when a thread writes a variable after releasing the lock the new value of the variable has to be visible to the next thread acquiring the lock so T1 should have been read 5. What happened is that the VM has missed the acquire/release cycles of the two threads and as of a weird optimization logic has seen T1 as requiring a lock previously acquired by itself – the T2 acquire didn’t recorded, so instead of updating the T1 local cache it returned the old cache value – BANG. This didn’t happen with volatile because, oh well, volatile IS about not using the thread local cache optimization algorithm :). At the end, it proves once again that stateful code/Turing design is always prone to errors even at the most subtle and discrete levels like the VM itself


  4. [...] This post was mentioned on Twitter by Chris Schanck, HN Firehose. HN Firehose said: A Volatile Look into HSQLDB: http://bit.ly/hMEDUL [...]

  5. Joel Says:

    The javadoc for the java.util.concurrent.locks.Lock interface (http://download.oracle.com/javase/6/docs/api/java/util/concurrent/locks/Lock.html) indicates that implementations of the Lock interface should have the same member synchronization semantics as the built-in monitor lock. Thus I think externally synchronizing the the CountUpDownLatch instance with a ReentrantLock should work. Perhaps there’s a bug in the JRE that you’re using?

    • designbygravity Says:

      Interesting; I did not know that. However, the explanation is simpler — the await() call appears to be used outside of any locking, meaning volatile does come into play.


  6. [...] A Volatile Look At HSQLDB (designbygravity.wordpress.com) [...]

  7. J Yu Says:

    memory visibility shouldn’t be the problem. the original code contains obvious deadlock, once latch.await() is entered, no other thread can wake it up, because the outter lock is held by the await thread. it is a mystery, like most concurrent issues, why after adding volatile, latch.await() is never reached. in any case, the original code is too bad, it should be rewritten.

  8. bnode Says:

    Looking at the HSQLDB 2.0 source, it isn’t clear to me that the calls to latch.await() are made under the reentrant writeLock. That would explain why volatile would solve the problem even if Joel is correct (which I believe he is) and avoid the deadlock case that J Yu points out (which was perplexing me as well).

    • designbygravity Says:

      I think that is the root accidental reason why volatile seemingly solves it — the usage is mixed between mostly being accessed under an exclusive lock vs. not.

      In general, I hate the class, because it purports to be a concurrent class, and as it is written it is at best a concurrency “supporting” package. It is in no way concurrent as it stands.

      I suspect the usage (from further looking at the source) is for all the modification calls to be made under exclusive lock, which the await() is not, which works, if it is volatile. But it seems lousy programming practice.


  9. [...] A Volatile Look At HSQLDB (designbygravity.wordpress.com) [...]


  10. Looking at the latest trunk (r4141), it is still riddled with race-conditions and wtfs.

  11. Solegaonkar Says:

    Eager to know if HSQLDB passed your test? Actually I am looking around trying to choose between traditional MySQL and HSQL db for an application. It is not data intensive – might accumulate ~50 mb per month. But, expecting a lot of read transactions every morning. Your article does provide a lot of clarity on this topic. Thank you!

  12. Fred Toussi Says:

    The reported issue was fixed in the first quarter of 2011. HSQLDB uses locks and latches to perform the highly complex orchestration of multithreaded MVCC transactions. More routine multithreaded tests have since been introduced and no problems have been reported this year.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

%d bloggers like this: