R – Parallel MATLAB and logging

loggingmatlabparallel-processing

I am running an experiment distributed over several computers using the Parallel computing toolbox. I want to be able to produce a log of the progress of the experiment (or of any error occurring) and save this info in a file while the processes are running. What is the standard way to do it?

EDIT:

  1. I use embarrassingly parallel
  2. I want only one file for all the workers (I have a network drive that can be accessed from all the machine)

My main concern is having a file opened for append by several workers. Do I risk losing messages, or having an error opening the file?

Best Solution

When multiple processes output to a single file, you could run into some potential problems, like messages being overwritten or intermingled. I've had this happen with programs in other languages (like C), and I assume the same problem could arise in MATLAB, but I freely admit I could be wrong about this. Assuming I'm not wrong...

If you want to reliably output data from multiple worker processes to a single log file while the processes are running, one way to do this is to make one process be responsible for all the file operations (i.e. a "master" process). The "master" process would collect messages from the other workers (i.e. "slaves") and output this data to the log file.

Since I don't know what specifically you are having each process do, it's hard to suggest specific code changes to make. Here are some steps and sample code for how you might do this in MATLAB. These code samples assume you are running the same function (process_fcn) on each process:

  • The "master" process first has to open the file. This code (using the labindex function) should be run at the beginning of process_fcn:

    if (labindex == 1),
      fid = fopen('log.txt','at');  %# Open text file for appending
    end
    
  • While each process is running, you can collect any data that needs to be output to the log file in a variable called data, which stores a string or character array. This data could be error messages captured within a try-catch block or any other data that you would want to be in the log file.

  • At periodic points in process_fcn (either when major tasks are completed or within a loop of computation), you would have to have each process check for data that needs to be output (i.e. data is not empty) and have that data sent to the "master" process. The "master" process would then collect and print these messages from other processes, along with any of its own. Here's a sample of how this might be done (using the functions labBarrier, labProbe, labSend, and labReceive):

    labBarrier;  %# All processes are synchronized here
    if (labindex == 1),  %# This is done by the "master"
      if ~isempty(data),
        fprintf(fid,'%s\n',data);  %# Print "master" data
      end
      pause(1);  %# Wait a moment for "slaves" to send messages
      while labProbe,  %# Loop while messages are available
        data = labReceive;  %# Get data from "slaves"
        fprintf(fid,'%s\n',data);
      end
    else  %# This is done by the "slaves"
      if ~isempty(data),
        labSend(data,1);  %# Send data to the "master"
      end
    end
    data = '';  %# Clear data
    

    The call to PAUSE is there to ensure that the calls to labSend for each "slave" process occur before the "master" starts looking for sent messages.

  • Finally, the "master" process has to close the file. This code should be run at the end of process_fcn:

    if (labindex == 1),
      fclose(fid);
    end
    
Related Question