R – Perl script to search pattern and concat lines in a file

perlregexstringtext

I have a text file (basically an error log with date, timestamp and some data) in the following pattern:

mm/dd/yy 12:00:00:0001  
This is line 1
This is line 2

mm/dd/yy 12:00:00:0004  
This is line 3
This is line 4
This is line 5


mm/dd/yy 12:00:00:0004
This is line 6
This is line 7

I'm new at Perl and need to write a script that searches the file for timestamps and merges the data that have the same timestamp in it.

I'm expecting the following output for the above sample.

mm/dd/yy 12:00:00:0001  
This is line 1
This is line 2

mm/dd/yy 12:00:00:0004  
This is line 3
This is line 4
This is line 5
This is line 6
This is line 7

What's the best way to get this done?

Best Solution

I've had to do this task before on some very large files and the timestamps did not come in order. I didn't want to store it all in memory. I accomplished the task by using a three-pass solution:

  • Tag each input line with its timestamp and save in temp file
  • Sort the temp file with a fast sorter, like sort(1)
  • Turn the sorted file back into the starting format

This was fast enough for my task where I could let it run while I went for a cup of coffee, but you might have to do something more fancy if you need the results really quickly.

use strict;
use warnings;
use File::Temp qw(tempfile);

my( $temp_fh, $temp_filename )  = tempfile( UNLINK => 1 );

# read each line, tag with timestamp, and write to temp file
# will sort and undo later.
my $current_timestamp = '';
LINE: while( <DATA> )
    {
    chomp;

    if( m|^\d\d/\d\d/\d\d \d\d:\d\d:\d\d:\d\d\d\d$| ) # timestamp line
        {
        $current_timestamp = $_;
        next LINE;
        }
    elsif( m|\S| ) # line with non-whitespace (not a "blank line")
        {
        print $temp_fh "[$current_timestamp] $_\n";
        }
    else # blank lines
        {
        next LINE;
        }
    }

close $temp_fh;

# sort the file by lines using some very fast sorter
system( "sort", qw(-o sorted.txt), $temp_filename );

# read the sorted file and turn back into starting format
open my($in), "<", 'sorted.txt' or die "Could not read sorted.txt: $!";

$current_timestamp = '';
while( <$in> )
    {
    my( $timestamp, $line ) = m/\[(.*?)] (.*)/;
    if( $timestamp ne $current_timestamp )
        {
        $current_timestamp = $timestamp;
        print $/, $timestamp, $/;
        }

    print $line, $/;
    }

unlink $temp_file, 'sorted.txt';

__END__
01/01/70 12:00:00:0004
This is line 3
This is line 4
This is line 5

01/01/70 12:00:00:0001
This is line 1
This is line 2


01/01/70 12:00:00:0004
This is line 6
This is line 7