Java CPU-intensive application stalls/hangs when increasing no. of workers. Where is the bottleneck, and how to deduce/ monitor it on a Ubuntu server


I'm running a nightly CPU-intensive Java-application on an Ec2-server (c1.xlarge) which has eight cores, 7.5 GB RAM (running Linux / Ubuntu 9.10 (Karmic Koala) 64 bit).

The application is architected in such a way that a variable number of workers are constructed (each in their own thread) and fetch messages from a queue to process them.

Throughput is the main concern here and performance is measured in processed messages / second. The application is NOT RAM-bound… And as far as I can see not I/O-bound. (although I'm not a star in Linux. I'm using dstat to check for I/O-load which are pretty low and CPU wait-signals (which are almost non-existent)).

I'm seeing the following when spawning a different number of workers (worker-threads).

  1. Worker: throughput 1.3 messages / sec / worker

  2. worker: ~ throughput 0.8 messages / sec / worker

  3. worker: ~ throughput 0.5 messages / sec / worker

  4. worker: ~ throughput 0.05 messages / sec / worker

I was expecting a near-linear increase in throughput, but reality proves otherwise.

Three questions:

  1. What might be causing the sub-linear performance going from one worker –> two workers and two workers –> three workers?

  2. What might be causing the (almost) complete halt when going from three workers to four workers? It looks like a kind of deadlock-situation or something.. (can this happen due to heavy context-switching?)

  3. How would I start measuring where the problems occur? My development-box has two CPUs and is running under windows. I normally attach a GUI-profiler and check for threading-issues. But the problem only really starts to manifest itself my more than two threads.

Some more background information:

  • Workers are spawned using a Executors.newScheduledThreadPool

  • A workers-thread does calculations based on the message (CPU-intensive). Each worker-thread contains a separate persistQueue used for offloading writing to disk (and thus make use of CPU / I/O concurrency.)

    persistQueue = new ThreadPoolExecutor(1, 1, 100, TimeUnit.MILLISECONDS, new ArrayBlockingQueue(maxAsyncQueueSize), new ThreadPoolExecutor.AbortPolicy());

The flow (per worker) goes like this:

  1. The worker-thread puts the result of a message in the persistQueue and gets on with processing the next message.

  2. The ThreadpoolExecutor (of which we have one per worker-thread) only contains one thread which processes all incoming data (waiting in the persistQueue ) and writes it to disk (Berkeley DB + Apache Lucene).

  3. The idea is that 1. and 2. can run concurrent for the most part since 1. is CPU-heavy and 2. is I/O-heavy.

  4. It's possible that persistQueue becomes full. This is done because otherwise a slow I/O-system might cause flooding of the queues, and result in OOM-errors (yes, it's a lot of data). In that case the workerThread pauses until it can write its content to persistQueue. A full queue hasn't happened yet on this setup (which is another reason I think the application is definitely not I/O-bound).

The last information:

  • Workers are isolated from the others concerning their data, except:

    • They share some heavily used static final maps (used as caches. The maps are memory-intensive, so I can't keep them local to a worker even if I wanted to). Operations that workers perform on these caches are: iterations, lookups, contains (no writes, deletes, etc.)

    • These shared maps are accessed without synchronization (no need. right?)

    • Workers populate their local data by selecting data from MySQL (based on keys in the received message). So this is a potential bottleneck. However, most of the data are reads, queried tables are optimized with indexes and again not I/O-bound.

    • I have to admit that I haven't done much MySQL-server optimizing yet (in terms of config -params), but I just don't think that is the problem.

  • Output is written to:

    • Berkeley DB (using memcached(b)-client). All workers share one server.
    • Lucene (using a home-grown low-level indexer). Each workers has a separate indexer.
  • Even when disabling output writing, the problems occur.

This is a huge post, I realize that, but I hope you can give me some pointers as to what this might be, or how to start monitoring / deducing where the problem lies.

Best Solution

If I were you, I wouldn't put much faith in anybody's guesswork as to what the problem is. I hate to sound like a broken record, but there's a very simple way to find out - stackshots. For example, in your 4-worker case that is running 20 times slower, every time you take a sample of a worker's call stack, the probability is 19/20 that it will be in the hanging state, and you can see why just by examining the stack.

Related Question