R – Concurrently read from a single file


I have the following problem situation. A bunch of data is split across 10k small files (approx. 8-16 kib each). Depending on user input, I have to load these as fast as possible, and process them. More precisely, each data packet can be split into 100-100k files, and there are approximately 1k data packets. Most of them are the smaller ones, though.

Right now, I'm using a thread pool and on each file access, the next free thread opens the file, reads it, and returns the data prepared for displaying. As the number of files is going to grow in the future, I'm not so happy with this approach, especially if it's likely to end up with something around 100k or more files (deploying this would be surely fun 😉 ).

So, the idea is to combine all of these tiny files for one data packet into a large one, and read from that. I can guarantee that it's going to be read-only, but I don't know the number of threads that will be accessing one file concurrently up front (I do know the maximum number). This would give me around 1000 good sized files, and I can easily add new data packets.

The question is: How can I allow 1..N threads to read efficiently from a single file in this scenario? I can use asynchronous I/O on Windows, but it's supposed to become synchronous for reads smaller than 64k. Memory mapping the file is not an option, as the expected size is > 1.6 GiB, and I still need to be able to run on x86 (unless I can efficiently map some tiny part, read it, unmap it again — my experience with memory mapping was that it brings quite some overhead compared to a single read).

I thought about opening each of the data packets N times, and give each thread a handle in a round-robin fashion, but the problem is that it can end up with (number of data files)x(maximum number of threads) open handles (can become easily 8-16k), and I would have to synchronize on each access to a data packet, or use some lock-free magic, to get the next free file handle.

As this does not seem to be an original problem (I guess, any database engine has a similar one, where you can have M tables (data packets) with N rows (files in my case), and you want to allow as many threads as possible to read rows concurrently). So what's the recommended practice here? BTW, it should ran on Windows and Linux, so portable approaches are welcome (or at least approaches which work on both platforms, even if they use different underlying APIs — as long as they can be wrapped, I'm happy).

[EDIT] This is not about the speed, this is about hiding the latency. That is, I read like 100 of those tiny files per second maybe, so I'm at 1 mib/s at most. My main concern is the seek times (as my access pattern is not predictable), and I want to hide them, by firing off the reading while displaying the old data to the user. The question is how to allow several threads to issue IO requests over several files, with possibly >1 thread accessing a single file.

It's really no problem if one of the calls takes 70 ms or so to finish, but I can't afford if the read call blocks.

Best Solution

I don't think that multi-threading will help you very much with the disk reads. Assuming the file is on one disk platter, you have only one set of read heads to access it with, so you are serialised right there.

In this situation I think I would have one disk read process that read the file sequentially into buffers (this would hopefully maximise read performance, as the read heads would not need to move a bout too much, assuming a fairly unfragmented data file) and a number of processing threads that read the buffers, marking them as free when they completed the processing.

However you choose to proceed, can I suggest that you make sure that your code is structured in such a way that the number of different types of threads is easily configurable, ideally from the executables command line. In situations like this you will want to experiment with different thread configurations to find the optimal numbers for your specific situation.