Comparing the Performance of `for` vs. `foreach` Loops

Published on 05 Mar 2022.

We had a discussion on Discord about performance, and the nuances around for and foreach loops came up. Since that discussion, I’ve spent some time profiling and analyzing the compiled code for these two loop types in various circumstances, and I wanted to write up what I learned.

At a surface level, you might expect the following two chunks of code to have very similar performance characteristics:

List<int> numbers = new List<int> { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 };
int sum = 0;
for (int index = 0; index < numbers.Length; index++)
    sum += numbers[index];

and:

List<int> numbers = new List<int> { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 };
int sum = 0;
foreach (int number in numbers)
    sum += number;

They do the same thing at a philosophical level: walk down the list and add up all the numbers. So shouldn’t they have similar performance and be more or less interchangeable?

In the general case, a foreach loop actually compiles to something more like this:

List<int> numbers = new List<int> { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 };
int sum = 0;

IEnumerator<int> enumerator = _numbers.GetEnumerator();
while (enumerator.MoveNext())
    sum += enumerator.Current;

That is, a foreach loop will use the IEnumerable<T> infrastructure behind the scenes. That requires creating extra objects and extra method and property calls beyond what a simple for loop will do.

Which begs the question:

Is there a speed difference between the two?

The real answer to these performance questions is always the same.

You don’t know what the difference is unless you measure it. You should never just assume something is slower or faster based on your intuition, and you should never assume you made an improvement by making a change unless you measure the results.

The best tool I’ve found for this type of CPU performance comparisons is an extremely well-made library called Benchmark.NET. So I made a new project and added a benchmark to compare simple iteration over a List<int> with a for loop vs. a foreach loop. If you’re not familiar with Benchmark.NET, it is an extremely useful tool for comparing the CPU performance of different variations of the code.

Here was my code for this test:

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkRunner.Run<ForVsForeachList>();

public class ForVsForeachList
{
    private readonly List<int> _numbers = new List<int> { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 };

    [Benchmark]
    public int ForLoop()
    {
        int sum = 0;
        for (int index = 0; index < _numbers.Count; index++)
            sum += _numbers[index];
        return sum;
    }

    [Benchmark]
    public int ForeachLoop()
    {
        int sum = 0;
        foreach (int number in _numbers)
            sum += number;
        return sum;
    }
}

I intentionally got the List creation out of the benchmark, since I don’t want that to be part of the comparison. (It would be the same between the two, anyway.)

And the results were:

|       Method |      Mean |     Error |    StdDev |
|------------- |----------:|----------:|----------:|
|      ForLoop |  6.683 ns | 0.0523 ns | 0.0463 ns |
|  ForeachLoop | 12.228 ns | 0.2287 ns | 0.2139 ns |

In this specific scenario, the foreach loop took nearly twice as long as the for loop! That’s substantial, but also note the units: ns. That’s nanoseconds. They’re both extremely fast, and frankly, it would be rare for this to make much of a difference in practice.

I think that last statement is an important point and I want to reiterate it in different words:

Even though for loops tend to be faster than foreach loops, foreach loops often are more readable. You should prefer readability over minor improvements in speed, and only optimize for speed when you have a known CPU performance problem.

The next question I had was:

Does it matter how many items are in the list?

So I ran it with 1000 items in the array instead of 10:

|       Method |       Mean |   Error |  StdDev |
|------------- |-----------:|--------:|--------:|
|      ForLoop |   663.3 ns | 6.09 ns | 5.70 ns |
|  ForeachLoop |   876.4 ns | 5.73 ns | 5.08 ns |

With 100x the number of items, we got about 100x times the CPU time. (663 ns is still blazingly fast.) The overhead of the foreach loop isn’t quite as prominent, but it isn’t just a flat or one-time cost. This, again, confirms that a foreach loop is likely to be a bit slower than a for loop.

Manually using `IEnumerator<T>`

The next thing I wanted to try was to compare the foreach loop against manually writing out the IEnumerator code, which I added as another method in the benchmark:

[Benchmark]
public int IEnumerator()
{
    int sum = 0;
    IEnumerator<int> enumerator = _numbers.GetEnumerator();
    while (enumerator.MoveNext())
        sum += enumerator.Current;
    return sum;
}

And the results:

|       Method |      Mean |     Error |    StdDev |
|------------- |----------:|----------:|----------:|
|      ForLoop |  6.760 ns | 0.0636 ns | 0.0563 ns |
|  ForeachLoop | 12.026 ns | 0.1150 ns | 0.1020 ns |
|  IEnumerator | 31.302 ns | 0.6320 ns | 0.6207 ns |

Hold up. Why is my code so dang slow?! My handmade version is clearly inferior, at something like 2.5x the time.

Again, I’ll reiterate: don’t just assume you know what’s happening, because you’re probably going to be wrong a lot.

I went to the IL disassembler to see what the foreach loop was actually doing and spotted the likely culprit. It turns out that List<T>’s implementation of IEnumerable<T> returns a List<T>.Iterator instance, which is a struct. And the compiler is generating specific code to use a List<T>.Iterator instead of an IEnumerator<int>. That change saves us from a boxing conversion and allows the compiler to know specifically which method will be used at runtime, so we can do a normal method call instead of a virtual method call.

My next step was to do a similar thing and benchmark that:

[Benchmark]
public int Enumerator()
{
    int sum = 0;
    List<int>.Enumerator enumerator = _numbers.GetEnumerator();
    while (enumerator.MoveNext())
        sum += enumerator.Current;
    return sum;
}

(I have a preference for explicitly writing out my variable types, but in this case, had I used var, the compiler would have inferred a List<int>.Enumerator and I never would have even seen this problem. I suppose that’s a point in var’s favor.)

Let’s see how this compares:

|      Method |      Mean |     Error |    StdDev |
|------------ |----------:|----------:|----------:|
|     ForLoop |  6.797 ns | 0.0848 ns | 0.0794 ns |
| ForeachLoop | 12.399 ns | 0.2409 ns | 0.2253 ns |
|  Enumerator | 12.232 ns | 0.1980 ns | 0.1852 ns |
| IEnumerator | 33.124 ns | 0.6967 ns | 0.8023 ns |

That Enumerator version is the new version. And it’s much better. Nearly identical to ForeachLoop.

It is actually slightly faster. I could chalk that up to rounding error, though I also saw that the code the compiler generates for a foreach loop, in this case, actually put in a try/finally block and disposes of the List<int>.Enumerator instance, which… is probably the right thing to do, even if it slows you down by 0.17 nanoseconds.

Trying Again with Arrays

The compiler took advantage of some knowledge it had about lists–optimizations it may not be able to make in all cases. But if there’s a type the compiler knows a ton about, it is the humble array. I wanted to try this with arrays to see what the compiler did, and I was rather surprised with what I saw.

Here’s my code for the array-based version:

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkRunner.Run<ForVsForeachArray>();

public class ForVsForeachArray
{
    private readonly int[] _numbers = new int[] { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 };

    [Benchmark]
    public int ForLoop()
    {
        int sum = 0;
        for (int index = 0; index < _numbers.Length; index++)
            sum += _numbers[index];
        return sum;
    }

    [Benchmark]
    public int ForeachLoop()
    {
        int sum = 0;
        foreach (int number in _numbers)
            sum += number;
        return sum;
    }
}

Simple enough.

The results boggled my mind:

|           Method |     Mean |     Error |    StdDev |
|----------------- |---------:|----------:|----------:|
|          ForLoop | 6.591 ns | 0.0350 ns | 0.0292 ns |
|      ForeachLoop | 3.299 ns | 0.0390 ns | 0.0304 ns |

What’s going on now?! The foreach loop is suddenly twice as fast?

Once again, I decompiled the code with the IL disassembler. The most obvious difference was that the compiler didn’t even use the IEnumerable stuff! Array types have it, but the compiler sidestepped it entirely, and basically generated the same code as a for loop.

This makes sense to me; the compiler already knows a ton about this type, and it can make a bunch of guarantees about arrays, including that arrays can’t change size, once created, and that the way arrays implement IEnumerable<T> will always work the same. The compiler knew enough about the type being iterated over and chose the faster code. It is a compiler-level optimization.

That can easily explain why the foreach loop is not slower, but what’s the explanation for why it is faster? I had to pick through the disassembled code very carefully to spot the differences.

There were two things that I had to change in my for loop code to make it as efficient as the generated foreach version. The first, which was adding about two-thirds of the overhead, was the fact that the for loop version had to look up the _numbers variable from the containing instance. When you see code like _numbers.Length, it doesn’t immediately jump out to you that this is equivalent to this._numbers.Length, and any time it appears, the code must retrieve _numbers from the instance first. The for loop version has two occurrences of _numbers, including one in the loop. The foreach loop version only has one, and it isn’t in the loop.

This is easily fixed by making a local variable (say, numbers) and assign the array to this local variable once before the loop (int[] numbers = _numbers;).

The second difference is that the foreach loop version has an extra variable that it uses to store the retrieved value in before adding it to sum. We can manually add in that extra varable ourselves.

With both of those changes, our micromanaged for loop, designed to compete with the compiler-generated foreach loop, looks like this:

[Benchmark]
public int ForLoopFineTuned()
{
    int sum = 0;
    int[] numbers = _numbers;
    for (int index = 0; index < numbers.Length; index++)
    {
        int total = numbers[index];
        sum += total;
    }
    return sum;
}

The results are now nearly identical:

|       Method |     Mean |     Error |    StdDev |
|------------- |---------:|----------:|----------:|
| ForLoopLocal | 3.292 ns | 0.0352 ns | 0.0329 ns |
|  ForeachLoop | 3.432 ns | 0.0998 ns | 0.1068 ns |

The two differ by a tiny bit, but not by enough to matter. (And I checked. The IL code is precisely identical. The error we see from Benchmark.NET is probably caused by other things going on on my computer or something. I ran it a second time and got similar but reversed numbers.)

Manually Unrolling the Loop

If we’re trying to squeeze every drop out of our CPU performance, there’s one more tool we could use. For reasons beyond what this post can reasonably cover, conditional branching–jumping around in the code based on values, as happens with a loop’s condition and an if statement–is a source of poorer performance. You don’t generally need to worry about it, because most situations don’t demand such nuanced control, but in our specific case, we could use a tool called loop unrolling. Since we know we will run through the loop exactly 10 times, we could simply write our code that way:

[Benchmark]
public int ManuallyUnrolled()
{
    return
        _numbers[0] + _numbers[1] + _numbers[2] +
        _numbers[3] + _numbers[4] + _numbers[5] +
        _numbers[6] + _numbers[7] + _numbers[8] +
        _numbers[9];
}

Does this lead to better performance?

Yes:

|           Method |     Mean |     Error |    StdDev |
|----------------- |---------:|----------:|----------:|
|          ForLoop | 6.593 ns | 0.0774 ns | 0.0724 ns |
|      ForeachLoop | 3.249 ns | 0.0486 ns | 0.0455 ns |
| ForLoopFineTuned | 3.392 ns | 0.0997 ns | 0.1109 ns |
| ManuallyUnrolled | 2.045 ns | 0.0183 ns | 0.0163 ns |

This is actually quite a bit faster. But it only works if (a) you know how many times the loop will run at compile time, and (b) you’re willing to copy/paste the code that many times, and (c) you’re willing to live with larger executables because of more instructions (which is usually negligible).

Summary

This has been a long post, so let’s summarize:

Computers are fast; usually you want to optimize for readability, not CPU performance.
Your intuitive understanding of what will be slow and fast is not always right.
Before you optimize for CPU, make sure you’re optimizing the slow part by measuring.
After optimizing for CPU, make sure it had the intended effect by measuring.
for loops are, indeed, usually faster than foreach loops…
… but the compiler is quite smart and has a ton of optimizations in there that may surprise you.
When the compiler’s optimizations don’t need to use the IEnumerable<T>/IEnumerator<T> infrastructure, but can convert it to a for loop, it will generally be as performant as a for loop.
In some cases, the compiler-optimized code will be faster than a non-micromanaged for loop.

Comparing the Performance of for vs. foreach Loops

Manually using IEnumerator<T>

Trying Again with Arrays

Manually Unrolling the Loop

Summary

Comparing the Performance of `for` vs. `foreach` Loops

Manually using `IEnumerator<T>`