ETL with Entity Framework 6.0

In this post I’m going to walk step-by-step through a number of performance improvements to a recent ETL process I worked on. The scenario here is loading folder metadata into a reporting database. Note that the use of EF or any ORM will add some overhead when compared to pure ADO.NET, although its ease of use is clearly beneficial.

Process overview

The ETL process works as follows.

1 Extract

  • The source data is read into a strongly-typed in-memory collection of type SourceFolderData
  • The collection is called folders
  • The extract phase is not included in this post, so consider the folders object to be prepopulated with folder metadata for around 10,000 folders, enough to see the impact of some performance tweaks

2 Transform

  • The wider context of this application used EF Code First to build the reporting database. As a result there is an Entities project which includes classes for each of the entities relating to the application
  • The source type is SourceFolderData
  • The destination type is Folder

3 Load

  • The DbContext used by EF is called context

Iteration 1: the basic approach

Here the code simply attempts to load all the data in one transaction.

foreach (var f in folders)
{
    var folder = new Folder
    {
        FolderPath = f.FolderPath.Substring(0,f.FolderPath.Length - 1),
        Owner = f.Owner,
        LastChangeDate = f.LastChangeDate,
        LastAccessDate = f.LastAccessDate,
        CreationDate = f.CreationDate,
        SizeIncSubfolders = f.SizeIncSubfolders,
        FileCount = f.FileCount,
        SubFolderCount = f.SubFolderCount,
        PermissionsBlob = f.Permissions,
        CreatedOn = DateTime.Now
    };
    context.Folders.Add(folder);
}
context.SaveChanges();

As you’d expect this works fine for a handful of items, but quickly breaks down when a significant number of items is loaded. For the 10,000 test items, the performance was as follows.

Time (milliseconds)Max memory (MB)Memory Trend
610,602166.8Slowly increasing

This equates to over 10 minutes which is not acceptable.

Iteration 2: turning off AutoDetectChanges

A next improvement would be to turn the DbContext AutoDetectChanges property to false. From MSDN:

Detect Changes works by detecting differences between the current property values of the entity and the original property values that are stored in a snapshot when the entity was queried or attached.” Significant performance improvements can be seen by turning this off.

context.Configuration.AutoDetectChangesEnabled = false;

foreach (var f in folders)  
{
    ...
}
context.SaveChanges();  

The performance improvement is immediately obvious.

Time (milliseconds)Max memory (MB)Memory Trend
17,12492.7Slowly increasing

Down from 10 minutes to 17 seconds, and a reduction in memory usage by over 44%.

Iteration 3: batching inserts

Another improvement is to batch the inserts and call SubmitChanges() several times instead of just once.

context.Configuration.AutoDetectChangesEnabled = false;  
int folderCount = 0;  
foreach (var f in folders)  
{
    ...
    context.Folders.Add(folder);

    if (folderCount % 500  0 && folderCount > 0)    
    {        
        context.SaveChanges();
    }
    folderCount++;
}
context.SaveChanges();  

The performance improvement is seen in the overall time to execute, but memory usage worsened and the trend was for usage to continue rising if the number of items being processed increased.

Time (milliseconds)Max memory (MB)Memory Trend
9,144117.2Increasing

Iteration 4: recreating the DbContext

Another improvement to show in this post is disposing and recreating the DbContext instance with each SaveChanges().

context.Configuration.AutoDetectChangesEnabled = false;  
int folderCount = 0;  
foreach (var f in folders)  
{
    ...

    context.Folders.Add(folder);

    if (folderCount % 500  0 && folderCount > 0)
    {
        context.SaveChanges();
        context.Dispose();
        context = new FolderContext();
        // Remember to turn AutoDetectChangesEnabled to false again!
        context.Configuration.AutoDetectChangesEnabled = false;
    }
    folderCount++;
}
context.SaveChanges();  

An improvement on the overall time is seen along with a mild memory improvement.

Time (milliseconds)Max memory (MB)Memory Trend
6,514112.4Increasing

Memory is seen to be increasing, however, so the performance would degrade as more items are processed.

Iteration 5: adjusting the batch size

Finally, by adjusting the batch size, an optimal level can be found where time, memory usage and memory trend is stable. The following table shows the performance as batch size increases.

Batch SizeTime (milliseconds)Max memory (MB)Memory Trend
1006,48986.5Stable
2006,22787.2Stable
3006,48593.1Stable
4006,248101.0Stable
5006,514112.4Increasing
6006,587107.2Increasing
7006,717117.8Increasing
8006,780114.9Increasing
9006,503114.6Increasing
10006,847124.9Increasing

The row highlighted in green shows the best setting for this particular process (on my dev laptop). The interesting thing to see was the stability in memory usage with small batch sizes. This gives me confidence to throw larger data sets at the process and not see memory usage get out of control.


Conclusion

Hopefully this post illustrates the differences that small adjustments to how EF is used can make to overall performance by showing some real numbers. These results would change on different hardware so it would be beneficial to adjust the batch size to find an optimal level for your process on your hardware. Finally I’d recommend reading Tips on writing an EF based ETL by Isaac Abraham for some other good practices.