Friday, February 4, 2011

Fastest way to find common items across multiple lists in C#

Given the following:

List<List<Option>> optionLists;

what would be a quick way to determine the subset of Option objects that appear in all N lists? Equality is determined through some string property such as option1.Value == option2.Value.

So we should end up with List<Option> where each item appears only once.

  • Ok, this will find the list of Option objects that have a Value appearing in every list.

    var x = from list in optionLists
            from option in list
            where optionLists.All(l => l.Any(o => o.Value == option.Value))
            orderby option.Value
            select option;
    

    It doesn't do a "distinct" select so it'll return multiple Option objects, some of them with the same Value.

  • Sort, then do something akin to a merge-sort.

    Basically you would do this:

    1. Retrieve the first item from each list
    2. Compare the items, if equal, output
    3. If any of the items are before the others, sort-wise, retrieve a new item from the corresponding list to replace it, otherwise, retrieve new items to replace them all, from all the list
    4. As long as you still got items, go back to 2.
  • what about using a hashSet? that way you can do what you want in O(n) where n is the number of items in all the lists combined, and I think that's the fastest way to do it.

    you just have to iterate over every list and insert the values you find into the hashset When you insert a key that already exists you will receive false as the return value of the .add method, otherwise true is returned

    From Sven
  • Here's a much more efficent implementation:

    static SortedDictionary<T,bool>.KeyCollection FindCommon<T> (List<List<T>> items)
    {
      SortedDictionary<T, bool>
        current_common = new SortedDictionary<T, bool> (),
        common = new SortedDictionary<T, bool> ();
    
      foreach (List<T> list in items)
      {
        if (current_common.Count == 0)
        {
          foreach (T item in list)
          {
            common [item] = true;
          }
        }
        else
        {
          foreach (T item in list)
          {
            if (current_common.ContainsKey (item))
            {
              common [item] = true;
            }
          }
        }
    
        if (common.Count == 0)
        {
          current_common.Clear ();
          break;
        }
    
        SortedDictionary<T, bool>
          swap = current_common;
    
        current_common = common;
        common = swap;
        common.Clear ();
      }
    
      return current_common.Keys;
    }
    

    It works by creating a set of all items common to all lists processed so far and comparing each list with this set, creating a temporary set of the items common to the current list and the list of common items so far. Effectively an O(n.m) where n is the number of lists and m the number of items in the lists.

    An example of using it:

    static void Main (string [] args)
    {
      Random
        random = new Random();
    
      List<List<int>>
        items = new List<List<int>>();
    
      for (int i = 0 ; i < 10 ; ++i)
      {
        List<int>
          list = new List<int> ();
    
        items.Add (list);
    
        for (int j = 0 ; j < 100 ; ++j)
        {
          list.Add (random.Next (70));
        }
      }
    
      SortedDictionary<int, bool>.KeyCollection
        common = FindCommon (items);
    
      foreach (List<int> list in items)
      {
        list.Sort ();
      }
    
      for (int i = 0 ; i < 100 ; ++i)
      {
        for (int j = 0 ; j < 10 ; ++j)
        {
          System.Diagnostics.Trace.Write (String.Format ("{0,-4:D} ", items [j] [i]));
        }
    
        System.Diagnostics.Trace.WriteLine ("");
      }
    
      foreach (int item in common)
      {
        System.Diagnostics.Trace.WriteLine (String.Format ("{0,-4:D} ", item));
      }
    }
    

    Skizz

    From Skizz
  • I don't have the performance stats, but if you don't want to roll your own method, various collections libraries have a 'Set' or 'Set(T)' object that offer the usual set procedures. (listed in the order I would use them).

    1. IESI Collections (literally just Set classes)
    2. PowerCollections (not updated in a while)
    3. C5 (never personally used)
  • Building on Matt's answer, since we are only interested in options that all lists have in common, we can simply check for any options in the first list that the others share:

    var sharedOptions =
        from option in optionLists.First( ).Distinct( )
        where optionLists.Skip( 1 ).All( l => l.Contains( option ) )
        select option;
    

    If an option list cannot contain duplicate entires, the Distinct call is unnecessary. If the lists vary greatly in size, it would be better to iterate over the options in the shortest list, rather than whatever list happens to be First. Sorted or hashed collections could be used to improve the lookup time of the Contains call, though it should not make much difference for a moderate number of items.

0 comments:

Post a Comment