jordan.terrell
Just trying to make sense of things...

LINQ: Distinct() does not work as expected

Thursday, 2 October 2008 10:20 by jordan.terrell

A few days ago I was using LINQ to filter a distinct (only unique items) list of custom objects.  I needed the distinction to be based on a property of type System.String.  Prior to this, I have never used or looked at the Distinct query operator - thus I expected it to have overloads that let me pass in a selector (Func<T, TRet>) to select the property to make the distinction on.

Unfortunately, it has no such overload.

However, what it did have was an overload that took an IEqualityComparer<T>.  As much as that felt like a cop-out (not a very LINQ-ish approach), I figured it would have to do the job.  There was one other overload that basically defaulted to EqualityComparer<T>.Default - this goes through a series of type checks to determine the best comparer for the type T being compared.  If the type T implements the IEquatable<T> interface, it will use that for the comparison.  Great!  I'll just implement that interface on my custom type, and be on my way!

Not so fast! It didn't work - for some reason it never invoked the IEquatable<T>.Equals method implemented on my custom type.  My next step was to implement as custom IEqualityComparer<T> to make the comparison based on the the property value.

That didn't work either! At this point I was quite confused, and then out of sheer desperation I put a breakpoint on the other method defined on IEqualityComparer<T>: GetHashCode.

"You've got to be kidding me!" - I believe those were the words going through my mind.  The Distinct query operator uses hash codes provided by an IEqualityComparer<T> to determine the distinct set of items.  Upon further inspection, the Distinct operator doesn't really use GetHashCode, it defers to another Set class to filter the items - that class uses the GetHashCode method.

Not what I expected.  Needless to say I was a bit disappointed.  Not only was there not an standard overload for me to pass in a selection lambda, but now I couldn't even use the Equals method on IEquatable<T> OR IEqualityComparer<T>!

I wasn't about to implement a custom GetHashCode method - I believe the annotations in the book Framework Design Guidelines basically says that the current implementation of GetHashCode is flawed.  Plus, for it to work in my case I would have to return the hash code for the property I wanted compared, not the hash code for the custom type.  All of this left a very sour taste in my mouth.

I did find an alternative though - using a combination of the GroupBy, First, and Select query operators, I was able to get what I wanted:

   1: var distinctItems = items
   2:     .GroupBy(x => x.PropertyToCompare)
   3:     .Select(x => x.First());

 

It looses quite a bit in meaning, but it accomplishes the end goal.  At some point I think I'm going to create an Extension Method that adds another Distinct() overload that takes a selection lambda - something that should have been there to being with!

Tags:   ,
Categories:   .NET | Programming
Actions:   E-mail | del.icio.us | Permalink | Comments (7) | Comment RSSRSS comment feed

Comments (7) -

October 3. 2008 02:45

David Meyer

Hahaha, I ran into that same problem.  That is definitely messed up.  I expected an overload with a selector too.  Or at least an equality comparer in the form of a delegate (like Func<T, T, bool>).  But an equality comparer interface?!  That should be illegal.  Lol.  Since lambda expressions in C# 3.0, that should at least be considered a very poor design anyway...

One of the great things about LINQ is how well it communicates intent, and your group by, etc. solution does circumvent that (losing meaning, like you said).  I agree there should be another Distinct overload.  items.Distinct(x => x.PropertyToCompare) would be much more readable.  Let me know if you write such an extension method; I would want to add that to my common library as well! Smile

David Meyer

October 20. 2008 09:26

Anders Ljusberg

I too was looking for a solution to this, but I have to say that I can understand why Distinct works as it does. Basically, what you want to do is not to get distinct object instances, you actually want to get the first object that has a property set to a specific value. So the "workaround" you propose (GroupBy, then Select First) actually makes more sense.

Anders Ljusberg

November 13. 2008 17:17

Marcelo

Very Useful.  THANK YOU.

Marcelo

January 19. 2009 07:22

Andy Wilson

It makes sense to me that the implementation of Distinct should use GetHashCode for optimization.  Note that writing a custom Equals method without a matching GetHashCode is bas practice (as noted in msdn.microsoft.com/en-us/library/bsc2ak47.aspx and msdn.microsoft.com/.../...litycomparer.equals.aspx)

It sounds like the implementation of Distinct will use GetHashCode first and if two items have different hash-codes, it will assume they are distinct and it doesn't have to bother using Equals (seems perfectly reasonable to me!).  I guess this is why Distinct() takes an IEqualityComparer (Equals plus matching GetHashCode) rather than a simple comparison delegate.

If it didn't use this approach, then the number of comparisons required grows exponentially with sequence length, i.e. the performance of this will be bad for long sequences.

I think your GroupBy alternative will work as long as the type of the PropertyToCompare implements Equals and GetHashCode correctly.

Andy Wilson

January 20. 2009 03:57

joniba

I just wrote my own extensions, I found it to be much easier:


/// <summary>
        /// Removes duplicate entries from a sequence of elements by comparing them using the Equals method.
        /// Provides an alternative to the Distinct method, which compares elements using their GetHashCode method.
        /// </summary>
        /// <typeparam name="T">The type of the elements of the sequence.</typeparam>
        /// <param name="source">The sequence of elements to filter.</param>
        /// <returns>The original sequence with duplicate entries removed.</returns>
        public static IEnumerable<T> RemoveDuplicates<T>(this IEnumerable<T> source)
        {
            return RemoveDuplicates(source, (t1, t2) => t1.Equals(t2));
        }

        /// <summary>
        /// Removes duplicate entries from a sequence of elements by using the specified Func to compare elements and determine
        /// whether they are equal.
        /// Provides an alternative to the Distinct method, which compares elements using their GetHashCode method.
        /// </summary>
        /// <typeparam name="T">The type of the elements of the sequence.</typeparam>
        /// <param name="source">The sequence of elements to filter.</param>
        /// <param name="equater">A Func that takes 2 elements of type T and returns true if they are equal, else false.</param>
        /// <returns>The original sequence with duplicate entries removed.</returns>
        public static IEnumerable<T> RemoveDuplicates<T>(this IEnumerable<T> source, Func<T, T, bool> equater)
        {
            // copy the source array
            List<T> result = new List<T>();

            foreach (T item in source)
            {
                if (result.All(t => !equater(item, t)))
                {
                    // Doesn't exist already: Add it
                    result.Add(item);
                }
            }

            return result;
        }

joniba

January 27. 2009 01:36

Ben Cherry

Thanks for the tip!  It works great... except when I want to select distinct on multiple conditions...

No worries, I've figured it out!  I'll post it here to help others in the same situation:
[code]
var distinctByMultiple = items
  .GroupBy(x => new {x.Property1, x.Property2})
  .Select(g => g.Key);
[/code]
This will give you a collection of the distinct items.  Each item in the collection will have Property1, Property2, etc.  The main drawback is that you are losing the original collection and any other properties or methods that came with the original type.  However, as long as you specify the properties you need in the GroupBy line, they'll be accessible.

Again, thanks for the help!

Ben Cherry

February 8. 2009 01:34

Jonathan Holland

I know this is an older post, but I came up with an extended version of Distinct() that allows you to pass in lambdas. You can see it here:

www.gfilter.net/

Jonathan Holland

Pingbacks and trackbacks (1)+

Comments are closed