Google
 

Monday, December 31, 2007

The perfect C# string enumerator

Introduction

I really love C# and do my utmost at kicking myself in the rear for not having gotten into it sooner than I did. In spite of that, though, I sorely miss some things from other languages.

Take string enumerators for example. Enumerators are an amazing thing when it comes to keep things clean and easy to maintain. You define them once, assign them their values, and use their names throughout the program. Behind the scene each enumeration is translated to the value you set it up with, and that's what you end up storing on disk, using for calculations, etc.

String enumerations are no different. You pick a set of names, assign them their text values, and in code you always refer to the names. When it comes time to using them for disk storage, display, etc., you refer to the enumeration's value.

But in C#, for whatever reason (anyone know why?), this was left out. I've looked around for quite a while, and found proposed solutions that ranged from simply creating a class with static strings to some that use reflection. They all seemed either too simple or too complex, and none gave me all the functionality an enumerator should have.

So I set out to make my own, and I think I've done it. I would say that it mimics a regular enumerator to 100%, and provides exactly the same functionality, usage and behaviour as regular enumerators. I cannot think of anything that has been missed, and am depending on my readers to help point those things out :)


Background

When I set out to create this, I had to sit down and think about all the ways in which an enumerator could be used in order to make sure the string enumerator is as true to its name as possible. I've created a number of tests using nUnit, and so far so good. I can't really think of any more good tests for this, but if anyone else can please let me know.

Usage

You can refer to the tests in the nUnit project to see the enumerator's capabilities, but here they are in brief:

  • Assigning a static enumerator's value to a string

  • Instantiating an enumerator object, setting it to a valid enumerator value, passing it as a parameter to a method, and assigning its value to a string.

  • Instantiating an enumerator object, setting it to a valid enumerator value, passing it as a parameter to a method, copying that enumerator to a new one, and assigning the new enumerator's value to a string.

  • Same as the above, but right after copying the first enumerator to the second one, we change the first one's value and ensure the second one has not been affected. Note that although we are dealing with a reference item, the enumerator class is immutable so we are always creating a new copy when assigning it a new value. Assigning an existing object to another one is done just with the "=" sign, without a dependency on the iCloneable interface, and the copy is not a shallow one (although there isn't much depth to this class).

  • Testing enumerated strings that contain spaces

  • Assigning a valid string value to an enumerator object

  • Assigning an invalid string value to an enumerator object, and getting an exception

  • Creating two enumerator objects, and enssuring that ==, != and Equals() perform properly.

Once you create the enumerator class, you use it exactly the way you would use a regular enumerator, with one slight difference: When you want to deal with the actual string value of either the enumerator object or of one of the static enumerator values, you must use the ToString() method. I would dearly love to find a way around this, but even if I can't, I think it's a pretty small price to pay.

The EStringEnum base class

To create a string enumerator you'll need two things. The base EStringEnum class, and a class that you define (which inherits from EStringEnum) with the actual string values.

The base class exposes two items that must be overwritten, as well as logic for validating string values that the user is trying to assign to an enumerator object (i.e. the value must be one of the valid enumerated strings), and some overloads for ToString() and various operators:


/// 
/// A string enumerator base class. Must be inherited.
///

public abstract class EStringEnum
{
#region Data
protected string mCurrentEnumValue = "";

protected abstract string EnumName { get; }

protected abstract List PossibleEnumValues { get; }

#endregion

#region Constructors
public EStringEnum() { }

///
/// A string enumerator
///

/// A valid enumerator value.
/// An exception is raised if the value is invalid.
public EStringEnum(string value)
{
if (PossibleEnumValues.Contains(value))
{
mCurrentEnumValue = value;
}
else
{
string errorMessage = string.Format(
"{0} is an invalid {1} enumerator value",
value,
EnumName);
throw new Exception(errorMessage);
}
}
#endregion

#region Overloads
///
/// Returns the enumerator's current value
///

///
public override string ToString()
{
return mCurrentEnumValue;
}

///
/// Test for equality
///

///
///
///
public static bool operator ==(EStringEnum stringEnum1, EStringEnum stringEnum2)
{
return (stringEnum1.ToString().Equals(stringEnum2.ToString()));
}

///
/// Test for inequality
///

///
///
///
public static bool operator !=(EStringEnum stringEnum1, EStringEnum stringEnum2)
{
return (!stringEnum1.ToString().Equals(stringEnum2.ToString()));
}

///
/// Test for equality
///

///
///
public override bool Equals (object o)
{
EStringEnum stringEnum = o as EStringEnum;
return (this.ToString().Equals(stringEnum.ToString()));
}

///
/// Retrieve the hashcode
///

///
public override int GetHashCode()
{
return base.GetHashCode();
}
#endregion
}
Only two things from this base class must be implemented when inheriting it:
protected abstract string EnumName { get; }
protected abstract List PossibleEnumValues { get; }
EnumName must return a string with the name of the string enumerator (e.g. ECarManufacturers), and PossibleEnumValues must return a List of the enumerated strings (e.g. "Ford", "Toyota", etc).

The actual enumerator class

The enumerator must inherit from EStringEnum and do the following:
  • implement the two properties mentioned above,

  • provide static accessors for each enumerated string,

  • provide a constructor (which just calles the base constructor), and

  • provide an implicit conversion from a string type to an object type of the actual enumerator (e.g. ECarManufacturers).

It may sound complex, but as the example below shows, it really is quite simple to implement:


/// 
/// A car manufacturer enumerator
///

public class ECarManufacturers: EStringEnum
{
#region Data
///
/// Used by EStringEnum to identify the current class
///

protected override string EnumName { get { return "ECarManufacturers"; } }

protected override List PossibleEnumValues { get
{ return mPossibleEnumValues; } }

///
/// Complete list of string values that this enumerator can hold
///

private static List mPossibleEnumValues = new List()
{
"Toyota",
"Honda",
"Ford",
"Chrysler",
"Volvo",
"General Motors"
};

///
/// CarManufacturers type
///

static public ECarManufacturers Toyota
{
get { return new ECarManufacturers(mPossibleEnumValues[0]); }
}

///
/// CarManufacturers type
///

static public ECarManufacturers Honda
{
get { return new ECarManufacturers(mPossibleEnumValues[1]); }
}

///
/// CarManufacturers type
///

static public ECarManufacturers Ford
{
get { return new ECarManufacturers(mPossibleEnumValues[2]); }
}

///
/// CarManufacturers type
///

static public ECarManufacturers Chrysler
{
get { return new ECarManufacturers(mPossibleEnumValues[3]); }
}

///
/// CarManufacturers type
///

static public ECarManufacturers Volvo
{
get { return new ECarManufacturers(mPossibleEnumValues[4]); }
}

///
/// CarManufacturers type
///

static public ECarManufacturers GeneralMotors
{
get { return new ECarManufacturers(mPossibleEnumValues[5]); }
}
#endregion

#region Constructor
///
/// A car manufacturer enumerator
///

/// A valid enumerator value.
/// An exception is raised if the value is invalid.
private ECarManufacturers(string value): base(value)
{ }
#endregion

#region Misc methods
///
/// Implicitly convert a string to a CarManufacturers object
///

/// A string value to convert to an ECarManufacturers enum value.
/// An exception is raised if the value is invalid.
///
public static implicit operator ECarManufacturers(string value)
{
return new ECarManufacturers(value);
}

///
/// Implicitly convert an ECarManufacturers object to a string
///

/// A ECarManufacturers object whose value is to
/// be returned as a string
///
public static implicit operator string(ECarManufacturers carManufacturers)
{
return carManufacturers.ToString();
}
#endregion
}
Examples

The following tests are included in the nUnit test project (available in the download), but I'm posting them here as well to give you a quick look at how true-to-life this implementation is:

[TestFixture]
public class TestClass1
{
[Test]
public void Test01()
{
string result = ECarManufacturers.GeneralMotors.ToString();
string expected = "General Motors";
Assert.IsTrue(result == expected);
}
[Test]
public void Test02()
{
ECarManufacturers carManufacturer = ECarManufacturers.Honda;
string result = carManufacturer.ToString();
string expected = "Honda";
Assert.IsTrue(result == expected);
}
[Test]
public void Test03()
{
Test03A(ECarManufacturers.Ford);
}
private void Test03A(ECarManufacturers carManufacturer)
{
string expected = "Ford";
string result = carManufacturer.ToString();
Assert.IsTrue(result == expected);
}
[Test]
public void Test04()
{
Test04A(ECarManufacturers.Ford);
}
private void Test04A(ECarManufacturers carManufacturer)
{
ECarManufacturers tempCarManufacturers2 = carManufacturer;
string result = tempCarManufacturers2.ToString();
string expected = "Ford";
Assert.IsTrue(result == expected);
}
[Test]
public void Test05()
{
Test05A(ECarManufacturers.Ford);
}
private void Test05A(ECarManufacturers carManufacturer)
{
ECarManufacturers carManufacturer2 = carManufacturer;
carManufacturer = ECarManufacturers.Chrysler;
string expected1 = "Chrysler";
string result1 = carManufacturer.ToString();
string expected2 = "Ford";
string result2 = carManufacturer2.ToString();
Assert.IsTrue(result1 == expected1 && result2 == expected2);
}
[Test]
public void Test06()
{
ECarManufacturers tempCarManufacturers = "Ford";
string result = tempCarManufacturers.ToString();
string expected = "Ford";
Assert.IsTrue(result == expected);
}
[Test]
public void Test07()
{
ECarManufacturers tempCarManufacturers2 = null;
try
{
tempCarManufacturers2 = "Orion";
Assert.Fail();
}
catch (Exception ex)
{
Assert.IsTrue(ex.Message.Equals(
"Orion is an invalid ECarManufacturers enumerator value"));
}
}
[Test]
public void Test08()
{
ECarManufacturers tempCarManufacturers = "General Motors";
Assert.IsTrue(tempCarManufacturers.ToString().Equals(
ECarManufacturers.GeneralMotors.ToString()));
}
[Test]
public void Test09()
{
ECarManufacturers carManufacturers1 = ECarManufacturers.GeneralMotors;
ECarManufacturers carManufacturers2 = ECarManufacturers.GeneralMotors;
Assert.IsTrue(carManufacturers1 == carManufacturers2);
}
[Test]
public void Test10()
{
ECarManufacturers carManufacturers1 = ECarManufacturers.GeneralMotors;
ECarManufacturers carManufacturers2 = ECarManufacturers.Ford;
Assert.IsTrue(carManufacturers1 != carManufacturers2);
}
[Test]
public void Test11()
{
ECarManufacturers carManufacturers1 = ECarManufacturers.GeneralMotors;
int number = carManufacturers1.GetHashCode();
}

[Test]
public void Test12()
{
ECarManufacturers carManufacturers1 = ECarManufacturers.GeneralMotors;
string value = carManufacturers1;
ECarManufacturers carManufacturers2 = carManufacturers1;
string value2 = carManufacturers2;
Assert.IsTrue(value.Equals("General Motors"));
}
}
In conclusion

Not a very complex solution when you think about it, and I think it covers all the behaviours an enumerator is expected to have. When it comes time to use the enumerator, you use it exactly like you would use a regular enumerator. If you can think of anything I've missed or can improve upon, feel free to tell me.

Something I'd like to do to improve on this are:
  • Extract as much logic as possible and put it in the base class (such as the static accessors). I could use something like Spring.Net to inject the accessors at runtime, but this would have two major drawbacks: intellisense would no longer work, and I would be adding a dependency to the project.

A downloadable copy of the source code is available here: http://www.codeproject.com/KB/cs/csharpstringenumerator.aspx

History
December 31 2007 - Initial post (happy new year!).
January 01 2008 - Added examples.

Saturday, December 29, 2007

The search for string enumerators

Of all the languages I've worked with so far, C# is by far my favourite. My list of reasons for this is quite long, but in spite of all that, I still have things I enjoyed very much in other languages that C# lacks. Simple examples are Cobol's EVALUATE statement and Visual Basic's string enumerators.

Under the hood, the EVALUATE statement is the equivalent of a series of else if statements, except that the debugger does a very elegant job of jumping straight to the else path that is true instead of stepping through every single one until it finds one that it likes. Not a big deal, but certainly something I miss.

The string enumerator, on the other hand, is sorely missed. I have yet to find a good way to implement one in C#, and so am embarking on a mission. I've seen several suggestions out there, ranging from simple ones that use string formatting to extract the names of an integer enumerator to more complex ones that use reflection.

Suffice it to say that I didn't like any of them. Too simple, too inelegant, etc. There has to be a good way out of this that will not be too complex to implement, will be elegent to code and use, and will resemble the normal enumerators as much as possible (preferably to 100%).

Friday, December 28, 2007

If you need to do asynchronous work, this is for you ...

I read a good article by Mike Peretz on the codeproject about delegates and asynchronous method invocation. It's a good thing to know, especially if you have work that takes a while to complete, have a UI that needs to remain fresh while doing some heavy work (although you should probably be using an MVP pattern - which takes care of this - if your app has a front end and complex business logic), etc.

You can read the article here: http://www.codeproject.com/KB/cs/AsyncMethodInvocation.aspx

Thursday, December 27, 2007

Excellent article on distributed computing

I came across this article by Daniel Vaughan in the code project about making your own distributed computing application, similar to what BOINC does with Seti@home only through a browser. Well written and pretty interesting. Should be a good read to anyone interested in this stuff, or is just looking for something new and cool.

Check it out at http://www.codeproject.com/KB/silverlight/gridcomputing.aspx.

Wednesday, December 26, 2007

Ok, seems like it's all fixed now.

These templates can be a pain to work with, but it seems to be over now. On to the next article, then!

Major blog issues ...

I have no idea why, but blogspot has serious issues with their templates. The site's proper layout will be up and running in the next day or two.

Monday, December 10, 2007

Faster batch processing of large amounts of data

Introduction

I frequently come across the need to process large amounts (i.e. in the hundreds of thousands or even millions) of repetitive pieces of data such as records from a database table or a collection of objects, apply some business transformation, and them save them (e.g. rewrite them to the database, generate a report, etc.)

In a simple database example you could just write an SQL statement that does an UPDATE of the affected records based on certain selection criteria (as specified by the WHERE clause). This is a pretty limited way of doing things, however, and if you start getting into complex situations you might end up using cursors which are a whole other can of worms. So if you start performing the update in code, you are automatically obliged to get a DataReader, read the records one by one, apply your business logic, and the rewrite the results. For various reasons, this is very time consuming.

A non-database related example could be a collection holding several million objects that all need to be manipulated somehow. You will undoubtedly use a foreach() loop and handle the objects one by one, but this again is an inefficient use of CPU power. You can do better.

This article describes a very efficient and scalable mechanism for processing such large amounts of data. It uses multiple concurrent threads to handle small batches of work, and can take advantage of multiple core CPUs by creating more threads depending on how many cores there are.

A demo project is available at the codeproject. It creates 5,000,000 customer objects that hold basic information such as an ID and a name, updates a discount percentage for each, and reverses the customer name 50 times (just so we have something time consuming to do). It was run on my machine (which has a quad-core CPU) for a benchmark and produced the following processing times:

  • Single thread execution: 5 minutes and 20 seconds.
  • Multiple thread execution (running two threads per CPU): 2 minutes and 40 seconds.

You pick the winner.

Batch processing the old fashioned way

As stated in the introduction, you could process a batch of objects (or records) one at a time, by using a foreach() loop and handling the objects in that loop one at a time. Some of the advantages to this method are that it’s easier to understand the code and you don’t have to deal with threading. That having being said, the code can be perfectly understood if it’s well documented (which it should be!), and if you’re processing large amounts of data you should already be familiar with threading and should be using it by now. If you’re not, then get on it.

A simple one-batch example:


class CustomerMaintenance
{
// a collection of Custoemr objects that we need to process
List mCustomerList;

// Constructor. User passes in the collection of Customer
// objects to process.
public CustomerMaintenance (List customerList)
{
mCustomerList = customerList;
}

// Update method applies all the business transformations
// to the Custoemr objects, one at a time.
public void Update()
{
foreach (Customer customerToUpdate in mCustomerList)
{
customerToUpdate.Discount *= 1.10;
}
}
}
In this example, a list of Customer objects is passed down to a CustomerMaintenance class. The Update() method is then executed, and it just applies some changes to every single object in the list. This is a very simple example since the foreach() loop could very well contain much more complex code. It is that update code which will slow you down if you’re doing something complex.

Batch processing with concurrent threads

The biggest flaw in the example above is that it does not take advantage of multiple core CPUs, and is most definitely not scalable. If you compare the execution time between a single core and a multiple core machine running this code you might find a faster time on the latter, but it will not be a big difference. Moreover, the difference would most probably be due to a faster CPU, more RAM, or some other such factor. The biggest weapon in your arsenal - the extra CPUs - will just remain unused and untapped.

So here’s how you do it. Processing you data in concurrent threads involves several steps:


Step 1 - Assigning key values to each object

The first thing we need is the ability to refer to each object by a key value. When we launch the thread workers, we will not be passing them a batch containing all the objects to be handled. Rather, all those objects will remain in one collection visible to all the thread workers, and each worker will be working on distinct objects within this list, which they will access by a key.

Coming up with these keys is slightly different when working with objects as opposed to database records. If you’re dealing with a list of objects, you can either choose a property that will be unique amongst all those objects (such as the customer ID), or if you do not have such a property, you can simply use an incremental counter (which does not necessarily need to be a property of the objects).

To work with objects, simply declare a SortedList where the key is that unique property and the object is the type of object you will be working with. Load the SortedList with all the keys and values, and you’re set to go. This is the shared collection of data that all worker threads will be updating. Next, create a List that contains all the keys from that shared collection, and this becomes the pool from which all threads will receive their individual batches of work.

For example:


// Shared collection, containing all the objects to update
SortedList mCustomerList mCustomerList =
new SortedList();

// code to populate the dictionary with the Customer objects goes here

// Create a list of keys containing all key values to the shared
// collection. This becomes a sort of index.
List allCustomerIDs = new List(mCustomerList.Keys);
If, however, you’re dealing with database records, you will need to use a key field in your SELECT statement. Assuming you are going to issue an UPDATE on all customer records in a customer table, you should first issue a SELECT statement retrieving all the customer IDs that will be affected, and then store those IDs in a List. Just like with the object example above, this list then becomes the pool from which all threads will receive their individual batches of work. Contrary to the previous example, though, the individual threads will be updating records directly in the database, which is the equivalent of the shared collection of data used above.

Step 2 - Preparing a semaphore
A semaphore (you will need to reference System.Threading for this one) is a very simple yet crucial element. It will control how many running thread workers we currently have, and when one has finished its work and exited, the semaphore will let us know that we can launch another worker. Semaphores are quite configurable, and you can easily specify how many requests it can handle.

For example:


// This will create a semaphore that helps control
// as many thread launches as we need to.
Semaphore mSemaphore = new Semaphore(numberOfThreadsToUse, numberOfThreadsToUse)
Step 3 - Looping through the list of keys and dispatching work in batches

Now comes the fun part. We loop while the list of keys created in step 1 contains data, and do the following:
  • Wait until the semaphore has a free resource.
  • Reserve a resource in the semaphore.
  • Copy a predetermined number of items from keys list to a work list that will be passed down to the thread worker.
  • Remove those same keys from the keys list so they are not dispatched for processing a second time.
  • Launch the thread worker, passing it the work list. The thread will apply the business rules and modifications to the objects in the shared main list that are indexed by the key values passed down to it.
  • When the thread worker is done, it releases the semaphore resource (thereby enabling the launch of another thread worker) and exits.

For example:
private void UpdateAllCustomersInConcurrentBatches()
{
// retrieve the number of CPUs on this machine, and calculate the total number
// of threads we should run.
ManagementObjectSearcher managementObjectSearcher = new ManagementObjectSearcher("select * from Win32_Processor");
ManagementObjectCollection managementObjectCollection = managementObjectSearcher.Get();
int numberOfCpus = managementObjectCollection.Count;
int numberOfThreadsToUse = numberOfCpus * mMaxNumberOfThreadsPerCpu;
int batchSize = 5000;

// get a list of all the key values to process
List allCustomerIDs = new List(mCustomerList.Keys);
while (allCustomerIDs.Count > 0)
{
// make of list of customer IDs to process in the next batch
List customerIDsToProcess = allCustomerIDs.GetRange(0, System.Math.Min(batchSize, allCustomerIDs.Count));
// remove those customer IDs from the master list so they are not processed again
allCustomerIDs.RemoveRange(0, System.Math.Min(batchSize, allCustomerIDs.Count));

// wait for the semaphore to let us launch another thread
mSemaphore.WaitOne();

// launch a thread worker and give it the list of customer IDs to process
ThreadPool.QueueUserWorkItem(new WaitCallback(UpdateAllCustomersInSubBatch), customerIDsToProcess);
}

// ensure all threads have exited by waiting until we can get all the semaphore requests
for (int ctr = 0; ctr < numberOfThreadsToUse; ctr++)
{
mSemaphore.WaitOne();
}
mSemaphore.Release(numberOfThreadsToUse);
}
Step 4 - processing a batch of records in a thread worker
The method to process the records will be launched in step 3, and will receive the list of keys to work with. It will then use a foreach() loop to go through them, and using each key in the loop, access a Customer object in the shared collection and apply the appropriate business rules and changes to it.
Similarly, if you are working with database records you would use this key value to issue a SELECT statement for one record in the table, fetch it, update it, and write it back (or maybe just issue an UPDATE statement).
For example:


private void UpdateAllCustomersInSubBatch(object state)
{
try
{
List customerIDsToProcess = state as List;
foreach (long customerID in customerIDsToProcess)
{
Customer tempCustomer = mCustomerList[customerID];
// a foreach item cannot be passed down by reference, so pass
// a copy.
ApplyBusinessRulesToCustomerObject(ref tempCustomer);
Lock (mLock)
{
mCustomerList[customerID].Discount = tempCustomer.Discount;
mCustomerList[customerID].Name = tempCustomer.Name;
}
}
}
catch (Exception ex)
{
lock (mLock)
{
// An exception was raised. This thread has no access to the UI, so store the exception in
// mExceptions and get out.
mExceptions.Add(ex);
}
}
finally
{
// The work in this thread is complete. Release the semaphore request so that it can be reused to
// launch another thread worker for the next batch.
mSemaphore.Release();
}
}
A note about exceptions

An important thing to note about threads is that you have to be careful with exceptions. In a single threaded application, exceptions will be available to all the objects up the execution path, and you can therefore trap and handle them at any point.

In a multithreaded application, however, exceptions will only go up as high as the first method to execute in that thread. If you do not have any try/catch blocks to handle the exceptions before (or on) that point, you’ll get an “unhandled exception” error, and the application will stop when the exception is raised.

To get around this, the demo uses a List collection to store any and all exceptions raised during the thread’s execution. When we’re done looping through the main list of object keys, we can check this collection for content, and if there are any exceptions they can be sent back to form, logged, etc. Additionally, we could alter the main dispatching foreach() loop (described in step 3) to stop dispatching new batches of work if the exception collection’s count is greater than zero.

A note about thread safety and locking

It’s also important to note that since multiple threads will be accessing and modifying the same objects (e.g. the shared object list and the exception collection), we need to protect against corruption and ensure that only one thread at a time has access to the shared resources. To do this, we use the lock() command which ensures that only one thread at a time can access the code in the lock block and therefore only one thread at a time can update this hared resource.

In conclusion

Splitting up massive amounts of repetitive work into smaller batches and processing them in parallel is simply a must-have when it comes to processing large amounts of data. The benchmark I show in the introduction shows how you can easily save 50% of your execution time, although the savings will largely be determined by the exact type of transformations you need to make to the data, and how the code is implemented.

Still, I think this is a time saver that cannot be ignored, and I hope it helps someone out there.

A sample project is available for testing here: http://www.codeproject.com/KB/cs/FasterBatchProcessingWith.aspx




Monday, December 3, 2007

Upcoming article - speeding up processing of large amounts of data

One of the things I come across quite often is the need to process large amounts of data, apply certain business rules to it, and then write it back to the database it came from.

If the business rules to be applied are not too complex, you can either generate an SQL UPDATE statement that will do the job or use a stored procedure with that same UPDATE statement. That's your best bet.

If, however, the logic is complex and better represented in code, then you're probably stuck creating a data reader, looping through the records, and changing them one at a time. You could simply do this as described and let it take as long as it takes, or you could take advantage of threading to split the job into mutilple concurrent jobs.

Managing these worker threads through a semaphore, creating and maintaining a work queue, dispatching the work units, handling exceptions ... all this will be in the next article.