C# – Extreme performance difference when using DataTable.Add

.netc++performance

Take a look at the program below. It's pretty self-explanatory, but I'll explain anyway 🙂

I have two methods, one fast and one slow. These methods do the exact same thing: they create a table with 50,000 rows and 1000 columns. I write to a variable number of columns in the table. In the code below I've picked 10 (NUM_COLS_TO_WRITE_TO).

In other words, only 10 columns out of the 1000 will actually contain data. OK. The only difference between the two methods is that the fast populates the columns and then calls DataTable.AddRow, whereas the slow one does it after. That's it.

The performance difference however is shocking (to me anyway). The fast version is almost completely unaffected by changing the number of columns we write to, whereas the slow one goes up linearly. For example, when the number of columns I write to is 20, the fast version takes 2.8 seconds, but the slow version takes over a minute.

What in the world could possibly be going on here?

I thought that maybe adding dt.BeginLoadData would make a difference, and it did to some extent, it brought the time down from 61 seconds to ~50 seconds, but that's still a huge difference.

Of course, the obvious answer is, "Well, don't do it that way." OK. Sure. But what in world is causing this? Is this expected behavior? I sure didn't expect it. 🙂

public class Program
{
    private const int NUM_ROWS = 50000;
    private const int NUM_COLS_TO_WRITE_TO = 10;
    private const int NUM_COLS_TO_CREATE = 1000;

    private static void AddRowFast() {
        DataTable dt = new DataTable();            
        //add a table with 1000 columns
        for (int i = 0; i < NUM_COLS_TO_CREATE; i++) {
            dt.Columns.Add("x" + i, typeof(string));
        }
        for (int i = 0; i < NUM_ROWS; i++) {                
            var theRow = dt.NewRow();
            for (int j = 0; j < NUM_COLS_TO_WRITE_TO; j++) {
                theRow[j] = "whatever";
            }

            //add the row *after* populating it
            dt.Rows.Add(theRow);                
        }
    }

    private static void AddRowSlow() {
        DataTable dt = new DataTable();
        //add a table with 1000 columns
        for (int i = 0; i < NUM_COLS_TO_CREATE; i++) {
            dt.Columns.Add("x" + i, typeof(string));
        }
        for (int i = 0; i < NUM_ROWS; i++) {
            var theRow = dt.NewRow();
            //add the row *before* populating it
            dt.Rows.Add(theRow);

            for (int j=0; j< NUM_COLS_TO_WRITE_TO; j++){
                theRow[j] = "whatever";
            }                
        }
    }

    static void Main(string[] args)
    {
        var sw = Stopwatch.StartNew();
        AddRowFast();
        sw.Stop();
        Console.WriteLine(sw.Elapsed.TotalMilliseconds);

        sw.Restart();
        AddRowSlow();
        sw.Stop();
        Console.WriteLine(sw.Elapsed.TotalMilliseconds);

        //When NUM_COLS is 5
        //FAST: 2754.6782
        //SLOW: 15794.1378

        //When NUM_COLS is 10
        //FAST: 2777.431  ms
        //SLOW 32004.7203 ms

        //When NUM_COLS is 20
        //FAST:  2831.1733 ms
        //SLOW: 61246.2243 ms
    }
}

Update

Calling theRow.BeginEdit and theRow.EndEdit in the slow version makes the slow version more or less constant (~4 seconds on my machine). If I actually had some constraints on the table, I guess this might make sense to me.

Best Solution

When attached to table, much more work is being done to record and track the state on every change.

For example, if you do this,

theRow.BeginEdit();

for (int j = 0; j < NUM_COLS_TO_WRITE_TO; j++)
{
   theRow[j] = "whatever";
}

theRow.CancelEdit();

Then in BeginEdit() , internally it's taking a copy of the contents of the row, so that at any point, you can rollback - and the end result of the above is an empty row again without whatever. This is still possible, even when in BeginLoadData mode. Following the path ofBeginEdit if attached to a DataTable, eventually you get into DataTable.NewRecord() which shows that it is just copying each value for every column to store the original state incase a cancel is needed - not much magic here. On the other hand, if not attached to a datatable, not much is happening in BeginEdit at all and it is exited quickly.

EndEdit() is similarly pretty heavy (when attached), as here all the constraints etc are checked (max length, does column allow nulls etc). Also it fires a bunch of events, explictly frees storage used incase the edit was cancelled, and makes available for recall with DataTable.GetChanges(), which is still possible in BeginLoadData. Infact looking at source all BeginLoadData seems to do is turn off constraint checking and indexing.

So that describes what BeginEdit and EditEdit does, and they are completely different when attached or not attached in terms of what is stored. Now consider that a single theRow[j] = "whatever" you can see on the indexer setter for DataRow it calls BeginEditInternal and then EditEdit on every single call (if it is not already in an edit because you explicitly called BeginEdit earlier). So that means it's copying and storing every single value for each column in the row, every time you do this call. So you're doing it 10 times, that means with your 1,000 column DataTable, over 50,000 rows, it means it is allocating 500,000,000 objects. On top of this there is all the other versioning, checks and events being fired after every single change, and so, overall, it's much slower when the row is attached to a DataTable than when not.