Except in Linq

I think that every programmer sooner or later comes across a code that works “not as you expect from it.” This is what prompted me to write the next article, in which I try to understand why Except in Linq works the way it is written, and not the way I want it to.


What do you think the following code should output:

var documentsDir = new DirectoryInfo(Environment.GetFolderPath(Environment.SpecialFolder.MyDocuments));

FileInfo[] FirstDirrectoryFiles = documentsDir.GetFiles();

FileInfo[] SecondDirrectoryFiles = documentsDir.GetFiles();

foreach (FileInfo item in FirstDirrectoryFiles.Except(SecondDirrectoryFiles))

{

Console.WriteLine(item.Name);

}

I’ve assumed that nothing, because Except must subtract the set (IEnumerable) of the right argument from the set (IEnumerable) of the left argument. However, contrary to my expectations, I got:

Without a doubt, it does not look like an empty set. Let’s try to figure out why this happens (the result in .NET 5 and in .NET 6 is equivalent). To understand why this is happening, and what can be done about it, let’s turn to the documentation of the method Except… It actually says that this method “Finds the difference of sets represented by two sequences” and has two overloads:

Except (IEnumerable , IEnumerable )

Finds the set difference of two sequences using the default equality comparer to compare values.

Except (IEnumerable , IEnumerable , IEqualityComparer )

Finds the set difference of two sequences by using the specified IEqualityComparer to compare values.

Note the statement that the default comparator is used for comparison. To understand why our code worked the way it did, we need to understand the default comparator behavior. To do this, I suggest going to https://github.com/dotnet/runtime/ and analyze the work of the Except method.

Our point of entry:

The code in the picture
public static IEnumerable<TSource> Except<TSource>(this IEnumerable<TSource> first, IEnumerable<TSource> second)
{
            if (first == null)
            {
                ThrowHelper.ThrowArgumentNullException(ExceptionArgument.first);
            }

            if (second == null)
            {
                ThrowHelper.ThrowArgumentNullException(ExceptionArgument.second);
            }

            return ExceptIterator(first, second, null);
}

The method checks that it has received two objects with a non-null pointer and passes the arguments to ExceptIterator

Strictly speaking, the overloading method could also be used, since it literally works the same way:

The code in the picture
public static IEnumerable<TSource> Except<TSource>(this IEnumerable<TSource> first, IEnumerable<TSource> second, IEqualityComparer<TSource>? comparer)
{
            if (first == null)
            {
                ThrowHelper.ThrowArgumentNullException(ExceptionArgument.first);
            }

            if (second == null)
            {
                ThrowHelper.ThrowArgumentNullException(ExceptionArgument.second);
            }

            return ExceptIterator(first, second, comparer);
}

Why overload and not the default parameter? Forces community it was suggested that the case in this and this

The actual code ExceptIterator:

The code in the picture
private static IEnumerable<TSource> ExceptIterator<TSource>(IEnumerable<TSource> first, IEnumerable<TSource> second, IEqualityComparer<TSource>? comparer)
{
            var set = new HashSet<TSource>(second, comparer);

            foreach (TSource element in first)
            {
                if (set.Add(element))
                {
                    yield return element;
                }
            }
}

From the elements of the second argument, a lots of… The elements of the first argument that were successfully added to the set are returned as an iterator.

The set constructor takes a comparator interface, which is used to compare the elements of the set:

Specifically, in our case, the comparator is null, so we fall into the Default property of the generic EqualityComparer class.

The code in the picture
public HashSet(IEqualityComparer<T>? comparer)
{
            if (comparer is not null && comparer != EqualityComparer<T>.Default) // first check for null to avoid forcing default comparer instantiation unnecessarily
            {
                _comparer = comparer;
            }

            // Special-case EqualityComparer<string>.Default, StringComparer.Ordinal, and StringComparer.OrdinalIgnoreCase.
            // We use a non-randomized comparer for improved perf, falling back to a randomized comparer if the
            // hash buckets become unbalanced.
            if (typeof(T) == typeof(string))
            {
                IEqualityComparer<string>? stringComparer = NonRandomizedStringEqualityComparer.GetStringComparer(_comparer);
                if (stringComparer is not null)
                {
                    _comparer = (IEqualityComparer<T>?)stringComparer;
                }
            }
}

Here, in my humble opinion, everything is obvious:

The code in the picture
public abstract partial class EqualityComparer<T> : IEqualityComparer, IEqualityComparer<T>
{
        // To minimize generic instantiation overhead of creating the comparer per type, we keep the generic portion of the code as small
        // as possible and define most of the creation logic in a non-generic class.
        public static EqualityComparer<T> Default { [Intrinsic] get; } = (EqualityComparer<T>)ComparerHelpers.CreateDefaultEqualityComparer(typeof(T));
}

The comment says that in order to minimize the overhead of creating a generic instance for each type, the code of generic classes is reduced as much as possible by transferring its logic to a non-generic class (in our case, ComparerHelpers).

Let’s take a look at what how the process of creating the default comparator works:

The code in the picture
internal static object CreateDefaultEqualityComparer(Type type)
{
            Debug.Assert(type != null && type is RuntimeType);

            object? result = null;
            var runtimeType = (RuntimeType)type;

            if (type == typeof(byte))
            {
                // Specialize for byte so Array.IndexOf is faster.
                result = new ByteEqualityComparer();
            }
            else if (type == typeof(string))
            {
                // Specialize for string, as EqualityComparer<string>.Default is on the startup path
                result = new GenericEqualityComparer<string>();
            }
            else if (type.IsAssignableTo(typeof(IEquatable<>).MakeGenericType(type)))
            {
                // If T implements IEquatable<T> return a GenericEqualityComparer<T>
                result = CreateInstanceForAnotherGenericParameter((RuntimeType)typeof(GenericEqualityComparer<string>), runtimeType);
            }
            else if (type.IsGenericType)
            {
                // Nullable does not implement IEquatable<T?> directly because that would add an extra interface call per comparison.
                // Instead, it relies on EqualityComparer<T?>.Default to specialize for nullables and do the lifted comparisons if T implements IEquatable.
                if (type.GetGenericTypeDefinition() == typeof(Nullable<>))
                {
                    result = TryCreateNullableEqualityComparer(runtimeType);
                }
            }
            else if (type.IsEnum)
            {
                // The equality comparer for enums is specialized to avoid boxing.
                result = TryCreateEnumEqualityComparer(runtimeType);
            }

            return result ?? CreateInstanceForAnotherGenericParameter((RuntimeType)typeof(ObjectEqualityComparer<object>), runtimeType);
}

Let’s go in order:

1. If the argument type is byte, then a comparator is returned specifically for this type (ByteEqualityComparer)

2. If it is a string, then GenericEqualityComparer () is returned;

3. If the type implements IEquatable, then based on GenericEqualityComparer , the GenericEqualityComparer is returned for the argument type (don’t even ask);

4. If the argument is a generic type (generic) and if this generic type is Nullable <>, then based on NullableEqualityComparer is generated by the NullableEqualityComparer for the argument type;

5. If the argument is an enumeration, then based on EnumEqualityComparer <> is created by EnumEqualityComparer;

6. In all other cases, an ObjectEqualityComparer is created based on the ObjectEqualityComparer .

With the help of such a simple code (I wanted to rewrite it via string builder, but I think you can score here: D) let’s try to understand which of the listed cases is ours:

The code in the picture
var documentsDir = new DirectoryInfo(Environment.GetFolderPath(Environment.SpecialFolder.MyDocuments));
FileInfo[] result1 = documentsDir.GetFiles();
FileInfo[] result2 = documentsDir.GetFiles();

Type type = result1[0].GetType();
Console.Write(
    $"type == typeof(byte): {type == typeof(byte)}n" +
    $"type == typeof(string): {type == typeof(string)}n" +
    $"type.IsAssignableTo(typeof(IEquatable<>).MakeGenericType(type)): {type.IsAssignableTo(typeof(IEquatable<>).MakeGenericType(type))}n" +
    $"type.IsGenericType: {type.IsGenericType}n" +
    $"type.IsEnum: {type.IsEnum}nn");

Which is to be expected:

This means that now our path lies in ObjectEqualityComparer… Here, in fact, he is:

The code in the picture
public sealed partial class ObjectEqualityComparer<T> : EqualityComparer<T>
{
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public override bool Equals(T? x, T? y)
        {
            if (x != null)
            {
                if (y != null) return x.Equals(y);
                return false;
            }
            if (y != null) return false;
            return true;
        }

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public override int GetHashCode([DisallowNull] T obj) => obj?.GetHashCode() ?? 0;

        // Equals method for the comparer itself.
        public override bool Equals([NotNullWhen(true)] object? obj) =>
            obj != null && GetType() == obj.GetType();

        public override int GetHashCode() =>
            GetType().GetHashCode();
}

ObjectEqualityComparer defines an Equals method for two objects as follows:

· Objects are equal if they are both null (which is logical);

· Objects are not equal if only one of them is null;

· If both objects are not null, then their equivalence is determined by the Equals method of the “left” argument.

If you turn to documentation, you can see that our FileInfo class does indeed have an Equals method marked “Inherited from Object”. What, there and our path lies! There, in the “comments” section, we can find out that:

If the current instance is a reference type, the Equals (Object) method checks for reference equality, and calling the Equals (Object) method is equivalent to calling the ReferenceEquals method. Reference equality means that the object variables being compared refer to the same object.

At this, it seemed, it would be possible to end our journey, but lastly, let’s figure out how to make Except stop showing files in the directory with my documents.

Option from the category “so far, then I’ll fix it”:

The code in the picture
var documentsDir = new DirectoryInfo(Environment.GetFolderPath(Environment.SpecialFolder.MyDocuments));
FileInfo[] result1 = documentsDir.GetFiles();
FileInfo[] result2 = documentsDir.GetFiles();
foreach (FileInfo item in result1
                                 .Select(i => i.FullName)
                                 .Except(result2.Select(i => i.FullName))
                                 .Select(i => new FileInfo(i)))
{
    Console.WriteLine(item.Name);
}

We essentially call Except on two IEnumerate s, and then reassemble the IEnumerate from the result.

In the fresh .net6, you can still use ExceptBy:

The code in the picture
var documentsDir = new DirectoryInfo(Environment.GetFolderPath(Environment.SpecialFolder.MyDocuments));
FileInfo[] result1 = documentsDir.GetFiles();
FileInfo[] result2 = documentsDir.GetFiles();
foreach (FileInfo item in result1.ExceptBy(result2.Select(i => i.FullName), ks => ks.FullName))
{
    Console.WriteLine(item.Name);
}

This code already looks nicer and even saves us from having to parse and re-assemble the original array, but you can do it differently.

The next idea that came into my head was to inherit FileInfo and override Equals, but unfortunately FileInfo is a sealed class, which means it cannot be inherited, so this path is cut off for us.

Let’s approach the question from the other side. Recall that Equals has an overload that takes IEqualityComparer as its second argument, which allows us to create something like this:

var documentsDir = new DirectoryInfo(Environment.GetFolderPath(Environment.SpecialFolder.MyDocuments));
FileInfo[] result1 = documentsDir.GetFiles();
FileInfo[] result2 = documentsDir.GetFiles();

foreach (FileInfo item in result1.Except(result2, new CustomFileInfoComparer()))
{
    Console.WriteLine(item.Name);
}

public class CustomFileInfoComparer : IEqualityComparer<FileInfo>
{
    bool IEqualityComparer<FileInfo>.Equals(FileInfo? lhv, FileInfo? rhv)
       => lhv?.FullName == rhv?.FullName;
    int IEqualityComparer<FileInfo>.GetHashCode(FileInfo obj) => obj.FullName.GetHashCode();
}

This solution works well if you have to compare IEnumerable at least once further down the code.

That’s all for me, I hope that the reader found my humble work interesting.

I would like to express my deep gratitude to my wife for help in preparing this post, as well as to the community .NET Talks🎄

Similar Posts

Leave a Reply