Google
 

Monday, December 31, 2007

The perfect C# string enumerator

Introduction

I really love C# and do my utmost at kicking myself in the rear for not having gotten into it sooner than I did. In spite of that, though, I sorely miss some things from other languages.

Take string enumerators for example. Enumerators are an amazing thing when it comes to keep things clean and easy to maintain. You define them once, assign them their values, and use their names throughout the program. Behind the scene each enumeration is translated to the value you set it up with, and that's what you end up storing on disk, using for calculations, etc.

String enumerations are no different. You pick a set of names, assign them their text values, and in code you always refer to the names. When it comes time to using them for disk storage, display, etc., you refer to the enumeration's value.

But in C#, for whatever reason (anyone know why?), this was left out. I've looked around for quite a while, and found proposed solutions that ranged from simply creating a class with static strings to some that use reflection. They all seemed either too simple or too complex, and none gave me all the functionality an enumerator should have.

So I set out to make my own, and I think I've done it. I would say that it mimics a regular enumerator to 100%, and provides exactly the same functionality, usage and behaviour as regular enumerators. I cannot think of anything that has been missed, and am depending on my readers to help point those things out :)


Background

When I set out to create this, I had to sit down and think about all the ways in which an enumerator could be used in order to make sure the string enumerator is as true to its name as possible. I've created a number of tests using nUnit, and so far so good. I can't really think of any more good tests for this, but if anyone else can please let me know.

Usage

You can refer to the tests in the nUnit project to see the enumerator's capabilities, but here they are in brief:

  • Assigning a static enumerator's value to a string

  • Instantiating an enumerator object, setting it to a valid enumerator value, passing it as a parameter to a method, and assigning its value to a string.

  • Instantiating an enumerator object, setting it to a valid enumerator value, passing it as a parameter to a method, copying that enumerator to a new one, and assigning the new enumerator's value to a string.

  • Same as the above, but right after copying the first enumerator to the second one, we change the first one's value and ensure the second one has not been affected. Note that although we are dealing with a reference item, the enumerator class is immutable so we are always creating a new copy when assigning it a new value. Assigning an existing object to another one is done just with the "=" sign, without a dependency on the iCloneable interface, and the copy is not a shallow one (although there isn't much depth to this class).

  • Testing enumerated strings that contain spaces

  • Assigning a valid string value to an enumerator object

  • Assigning an invalid string value to an enumerator object, and getting an exception

  • Creating two enumerator objects, and enssuring that ==, != and Equals() perform properly.

Once you create the enumerator class, you use it exactly the way you would use a regular enumerator, with one slight difference: When you want to deal with the actual string value of either the enumerator object or of one of the static enumerator values, you must use the ToString() method. I would dearly love to find a way around this, but even if I can't, I think it's a pretty small price to pay.

The EStringEnum base class

To create a string enumerator you'll need two things. The base EStringEnum class, and a class that you define (which inherits from EStringEnum) with the actual string values.

The base class exposes two items that must be overwritten, as well as logic for validating string values that the user is trying to assign to an enumerator object (i.e. the value must be one of the valid enumerated strings), and some overloads for ToString() and various operators:


/// 
/// A string enumerator base class. Must be inherited.
///

public abstract class EStringEnum
{
#region Data
protected string mCurrentEnumValue = "";

protected abstract string EnumName { get; }

protected abstract List PossibleEnumValues { get; }

#endregion

#region Constructors
public EStringEnum() { }

///
/// A string enumerator
///

/// A valid enumerator value.
/// An exception is raised if the value is invalid.
public EStringEnum(string value)
{
if (PossibleEnumValues.Contains(value))
{
mCurrentEnumValue = value;
}
else
{
string errorMessage = string.Format(
"{0} is an invalid {1} enumerator value",
value,
EnumName);
throw new Exception(errorMessage);
}
}
#endregion

#region Overloads
///
/// Returns the enumerator's current value
///

///
public override string ToString()
{
return mCurrentEnumValue;
}

///
/// Test for equality
///

///
///
///
public static bool operator ==(EStringEnum stringEnum1, EStringEnum stringEnum2)
{
return (stringEnum1.ToString().Equals(stringEnum2.ToString()));
}

///
/// Test for inequality
///

///
///
///
public static bool operator !=(EStringEnum stringEnum1, EStringEnum stringEnum2)
{
return (!stringEnum1.ToString().Equals(stringEnum2.ToString()));
}

///
/// Test for equality
///

///
///
public override bool Equals (object o)
{
EStringEnum stringEnum = o as EStringEnum;
return (this.ToString().Equals(stringEnum.ToString()));
}

///
/// Retrieve the hashcode
///

///
public override int GetHashCode()
{
return base.GetHashCode();
}
#endregion
}
Only two things from this base class must be implemented when inheriting it:
protected abstract string EnumName { get; }
protected abstract List PossibleEnumValues { get; }
EnumName must return a string with the name of the string enumerator (e.g. ECarManufacturers), and PossibleEnumValues must return a List of the enumerated strings (e.g. "Ford", "Toyota", etc).

The actual enumerator class

The enumerator must inherit from EStringEnum and do the following:
  • implement the two properties mentioned above,

  • provide static accessors for each enumerated string,

  • provide a constructor (which just calles the base constructor), and

  • provide an implicit conversion from a string type to an object type of the actual enumerator (e.g. ECarManufacturers).

It may sound complex, but as the example below shows, it really is quite simple to implement:


/// 
/// A car manufacturer enumerator
///

public class ECarManufacturers: EStringEnum
{
#region Data
///
/// Used by EStringEnum to identify the current class
///

protected override string EnumName { get { return "ECarManufacturers"; } }

protected override List PossibleEnumValues { get
{ return mPossibleEnumValues; } }

///
/// Complete list of string values that this enumerator can hold
///

private static List mPossibleEnumValues = new List()
{
"Toyota",
"Honda",
"Ford",
"Chrysler",
"Volvo",
"General Motors"
};

///
/// CarManufacturers type
///

static public ECarManufacturers Toyota
{
get { return new ECarManufacturers(mPossibleEnumValues[0]); }
}

///
/// CarManufacturers type
///

static public ECarManufacturers Honda
{
get { return new ECarManufacturers(mPossibleEnumValues[1]); }
}

///
/// CarManufacturers type
///

static public ECarManufacturers Ford
{
get { return new ECarManufacturers(mPossibleEnumValues[2]); }
}

///
/// CarManufacturers type
///

static public ECarManufacturers Chrysler
{
get { return new ECarManufacturers(mPossibleEnumValues[3]); }
}

///
/// CarManufacturers type
///

static public ECarManufacturers Volvo
{
get { return new ECarManufacturers(mPossibleEnumValues[4]); }
}

///
/// CarManufacturers type
///

static public ECarManufacturers GeneralMotors
{
get { return new ECarManufacturers(mPossibleEnumValues[5]); }
}
#endregion

#region Constructor
///
/// A car manufacturer enumerator
///

/// A valid enumerator value.
/// An exception is raised if the value is invalid.
private ECarManufacturers(string value): base(value)
{ }
#endregion

#region Misc methods
///
/// Implicitly convert a string to a CarManufacturers object
///

/// A string value to convert to an ECarManufacturers enum value.
/// An exception is raised if the value is invalid.
///
public static implicit operator ECarManufacturers(string value)
{
return new ECarManufacturers(value);
}

///
/// Implicitly convert an ECarManufacturers object to a string
///

/// A ECarManufacturers object whose value is to
/// be returned as a string
///
public static implicit operator string(ECarManufacturers carManufacturers)
{
return carManufacturers.ToString();
}
#endregion
}
Examples

The following tests are included in the nUnit test project (available in the download), but I'm posting them here as well to give you a quick look at how true-to-life this implementation is:

[TestFixture]
public class TestClass1
{
[Test]
public void Test01()
{
string result = ECarManufacturers.GeneralMotors.ToString();
string expected = "General Motors";
Assert.IsTrue(result == expected);
}
[Test]
public void Test02()
{
ECarManufacturers carManufacturer = ECarManufacturers.Honda;
string result = carManufacturer.ToString();
string expected = "Honda";
Assert.IsTrue(result == expected);
}
[Test]
public void Test03()
{
Test03A(ECarManufacturers.Ford);
}
private void Test03A(ECarManufacturers carManufacturer)
{
string expected = "Ford";
string result = carManufacturer.ToString();
Assert.IsTrue(result == expected);
}
[Test]
public void Test04()
{
Test04A(ECarManufacturers.Ford);
}
private void Test04A(ECarManufacturers carManufacturer)
{
ECarManufacturers tempCarManufacturers2 = carManufacturer;
string result = tempCarManufacturers2.ToString();
string expected = "Ford";
Assert.IsTrue(result == expected);
}
[Test]
public void Test05()
{
Test05A(ECarManufacturers.Ford);
}
private void Test05A(ECarManufacturers carManufacturer)
{
ECarManufacturers carManufacturer2 = carManufacturer;
carManufacturer = ECarManufacturers.Chrysler;
string expected1 = "Chrysler";
string result1 = carManufacturer.ToString();
string expected2 = "Ford";
string result2 = carManufacturer2.ToString();
Assert.IsTrue(result1 == expected1 && result2 == expected2);
}
[Test]
public void Test06()
{
ECarManufacturers tempCarManufacturers = "Ford";
string result = tempCarManufacturers.ToString();
string expected = "Ford";
Assert.IsTrue(result == expected);
}
[Test]
public void Test07()
{
ECarManufacturers tempCarManufacturers2 = null;
try
{
tempCarManufacturers2 = "Orion";
Assert.Fail();
}
catch (Exception ex)
{
Assert.IsTrue(ex.Message.Equals(
"Orion is an invalid ECarManufacturers enumerator value"));
}
}
[Test]
public void Test08()
{
ECarManufacturers tempCarManufacturers = "General Motors";
Assert.IsTrue(tempCarManufacturers.ToString().Equals(
ECarManufacturers.GeneralMotors.ToString()));
}
[Test]
public void Test09()
{
ECarManufacturers carManufacturers1 = ECarManufacturers.GeneralMotors;
ECarManufacturers carManufacturers2 = ECarManufacturers.GeneralMotors;
Assert.IsTrue(carManufacturers1 == carManufacturers2);
}
[Test]
public void Test10()
{
ECarManufacturers carManufacturers1 = ECarManufacturers.GeneralMotors;
ECarManufacturers carManufacturers2 = ECarManufacturers.Ford;
Assert.IsTrue(carManufacturers1 != carManufacturers2);
}
[Test]
public void Test11()
{
ECarManufacturers carManufacturers1 = ECarManufacturers.GeneralMotors;
int number = carManufacturers1.GetHashCode();
}

[Test]
public void Test12()
{
ECarManufacturers carManufacturers1 = ECarManufacturers.GeneralMotors;
string value = carManufacturers1;
ECarManufacturers carManufacturers2 = carManufacturers1;
string value2 = carManufacturers2;
Assert.IsTrue(value.Equals("General Motors"));
}
}
In conclusion

Not a very complex solution when you think about it, and I think it covers all the behaviours an enumerator is expected to have. When it comes time to use the enumerator, you use it exactly like you would use a regular enumerator. If you can think of anything I've missed or can improve upon, feel free to tell me.

Something I'd like to do to improve on this are:
  • Extract as much logic as possible and put it in the base class (such as the static accessors). I could use something like Spring.Net to inject the accessors at runtime, but this would have two major drawbacks: intellisense would no longer work, and I would be adding a dependency to the project.

A downloadable copy of the source code is available here: http://www.codeproject.com/KB/cs/csharpstringenumerator.aspx

History
December 31 2007 - Initial post (happy new year!).
January 01 2008 - Added examples.

Saturday, December 29, 2007

The search for string enumerators

Of all the languages I've worked with so far, C# is by far my favourite. My list of reasons for this is quite long, but in spite of all that, I still have things I enjoyed very much in other languages that C# lacks. Simple examples are Cobol's EVALUATE statement and Visual Basic's string enumerators.

Under the hood, the EVALUATE statement is the equivalent of a series of else if statements, except that the debugger does a very elegant job of jumping straight to the else path that is true instead of stepping through every single one until it finds one that it likes. Not a big deal, but certainly something I miss.

The string enumerator, on the other hand, is sorely missed. I have yet to find a good way to implement one in C#, and so am embarking on a mission. I've seen several suggestions out there, ranging from simple ones that use string formatting to extract the names of an integer enumerator to more complex ones that use reflection.

Suffice it to say that I didn't like any of them. Too simple, too inelegant, etc. There has to be a good way out of this that will not be too complex to implement, will be elegent to code and use, and will resemble the normal enumerators as much as possible (preferably to 100%).

Friday, December 28, 2007

If you need to do asynchronous work, this is for you ...

I read a good article by Mike Peretz on the codeproject about delegates and asynchronous method invocation. It's a good thing to know, especially if you have work that takes a while to complete, have a UI that needs to remain fresh while doing some heavy work (although you should probably be using an MVP pattern - which takes care of this - if your app has a front end and complex business logic), etc.

You can read the article here: http://www.codeproject.com/KB/cs/AsyncMethodInvocation.aspx

Thursday, December 27, 2007

Excellent article on distributed computing

I came across this article by Daniel Vaughan in the code project about making your own distributed computing application, similar to what BOINC does with Seti@home only through a browser. Well written and pretty interesting. Should be a good read to anyone interested in this stuff, or is just looking for something new and cool.

Check it out at http://www.codeproject.com/KB/silverlight/gridcomputing.aspx.

Wednesday, December 26, 2007

Ok, seems like it's all fixed now.

These templates can be a pain to work with, but it seems to be over now. On to the next article, then!

Major blog issues ...

I have no idea why, but blogspot has serious issues with their templates. The site's proper layout will be up and running in the next day or two.

Monday, December 10, 2007

Faster batch processing of large amounts of data

Introduction

I frequently come across the need to process large amounts (i.e. in the hundreds of thousands or even millions) of repetitive pieces of data such as records from a database table or a collection of objects, apply some business transformation, and them save them (e.g. rewrite them to the database, generate a report, etc.)

In a simple database example you could just write an SQL statement that does an UPDATE of the affected records based on certain selection criteria (as specified by the WHERE clause). This is a pretty limited way of doing things, however, and if you start getting into complex situations you might end up using cursors which are a whole other can of worms. So if you start performing the update in code, you are automatically obliged to get a DataReader, read the records one by one, apply your business logic, and the rewrite the results. For various reasons, this is very time consuming.

A non-database related example could be a collection holding several million objects that all need to be manipulated somehow. You will undoubtedly use a foreach() loop and handle the objects one by one, but this again is an inefficient use of CPU power. You can do better.

This article describes a very efficient and scalable mechanism for processing such large amounts of data. It uses multiple concurrent threads to handle small batches of work, and can take advantage of multiple core CPUs by creating more threads depending on how many cores there are.

A demo project is available at the codeproject. It creates 5,000,000 customer objects that hold basic information such as an ID and a name, updates a discount percentage for each, and reverses the customer name 50 times (just so we have something time consuming to do). It was run on my machine (which has a quad-core CPU) for a benchmark and produced the following processing times:

  • Single thread execution: 5 minutes and 20 seconds.
  • Multiple thread execution (running two threads per CPU): 2 minutes and 40 seconds.

You pick the winner.

Batch processing the old fashioned way

As stated in the introduction, you could process a batch of objects (or records) one at a time, by using a foreach() loop and handling the objects in that loop one at a time. Some of the advantages to this method are that it’s easier to understand the code and you don’t have to deal with threading. That having being said, the code can be perfectly understood if it’s well documented (which it should be!), and if you’re processing large amounts of data you should already be familiar with threading and should be using it by now. If you’re not, then get on it.

A simple one-batch example:


class CustomerMaintenance
{
// a collection of Custoemr objects that we need to process
List mCustomerList;

// Constructor. User passes in the collection of Customer
// objects to process.
public CustomerMaintenance (List customerList)
{
mCustomerList = customerList;
}

// Update method applies all the business transformations
// to the Custoemr objects, one at a time.
public void Update()
{
foreach (Customer customerToUpdate in mCustomerList)
{
customerToUpdate.Discount *= 1.10;
}
}
}
In this example, a list of Customer objects is passed down to a CustomerMaintenance class. The Update() method is then executed, and it just applies some changes to every single object in the list. This is a very simple example since the foreach() loop could very well contain much more complex code. It is that update code which will slow you down if you’re doing something complex.

Batch processing with concurrent threads

The biggest flaw in the example above is that it does not take advantage of multiple core CPUs, and is most definitely not scalable. If you compare the execution time between a single core and a multiple core machine running this code you might find a faster time on the latter, but it will not be a big difference. Moreover, the difference would most probably be due to a faster CPU, more RAM, or some other such factor. The biggest weapon in your arsenal - the extra CPUs - will just remain unused and untapped.

So here’s how you do it. Processing you data in concurrent threads involves several steps:


Step 1 - Assigning key values to each object

The first thing we need is the ability to refer to each object by a key value. When we launch the thread workers, we will not be passing them a batch containing all the objects to be handled. Rather, all those objects will remain in one collection visible to all the thread workers, and each worker will be working on distinct objects within this list, which they will access by a key.

Coming up with these keys is slightly different when working with objects as opposed to database records. If you’re dealing with a list of objects, you can either choose a property that will be unique amongst all those objects (such as the customer ID), or if you do not have such a property, you can simply use an incremental counter (which does not necessarily need to be a property of the objects).

To work with objects, simply declare a SortedList where the key is that unique property and the object is the type of object you will be working with. Load the SortedList with all the keys and values, and you’re set to go. This is the shared collection of data that all worker threads will be updating. Next, create a List that contains all the keys from that shared collection, and this becomes the pool from which all threads will receive their individual batches of work.

For example:


// Shared collection, containing all the objects to update
SortedList mCustomerList mCustomerList =
new SortedList();

// code to populate the dictionary with the Customer objects goes here

// Create a list of keys containing all key values to the shared
// collection. This becomes a sort of index.
List allCustomerIDs = new List(mCustomerList.Keys);
If, however, you’re dealing with database records, you will need to use a key field in your SELECT statement. Assuming you are going to issue an UPDATE on all customer records in a customer table, you should first issue a SELECT statement retrieving all the customer IDs that will be affected, and then store those IDs in a List. Just like with the object example above, this list then becomes the pool from which all threads will receive their individual batches of work. Contrary to the previous example, though, the individual threads will be updating records directly in the database, which is the equivalent of the shared collection of data used above.

Step 2 - Preparing a semaphore
A semaphore (you will need to reference System.Threading for this one) is a very simple yet crucial element. It will control how many running thread workers we currently have, and when one has finished its work and exited, the semaphore will let us know that we can launch another worker. Semaphores are quite configurable, and you can easily specify how many requests it can handle.

For example:


// This will create a semaphore that helps control
// as many thread launches as we need to.
Semaphore mSemaphore = new Semaphore(numberOfThreadsToUse, numberOfThreadsToUse)
Step 3 - Looping through the list of keys and dispatching work in batches

Now comes the fun part. We loop while the list of keys created in step 1 contains data, and do the following:
  • Wait until the semaphore has a free resource.
  • Reserve a resource in the semaphore.
  • Copy a predetermined number of items from keys list to a work list that will be passed down to the thread worker.
  • Remove those same keys from the keys list so they are not dispatched for processing a second time.
  • Launch the thread worker, passing it the work list. The thread will apply the business rules and modifications to the objects in the shared main list that are indexed by the key values passed down to it.
  • When the thread worker is done, it releases the semaphore resource (thereby enabling the launch of another thread worker) and exits.

For example:
private void UpdateAllCustomersInConcurrentBatches()
{
// retrieve the number of CPUs on this machine, and calculate the total number
// of threads we should run.
ManagementObjectSearcher managementObjectSearcher = new ManagementObjectSearcher("select * from Win32_Processor");
ManagementObjectCollection managementObjectCollection = managementObjectSearcher.Get();
int numberOfCpus = managementObjectCollection.Count;
int numberOfThreadsToUse = numberOfCpus * mMaxNumberOfThreadsPerCpu;
int batchSize = 5000;

// get a list of all the key values to process
List allCustomerIDs = new List(mCustomerList.Keys);
while (allCustomerIDs.Count > 0)
{
// make of list of customer IDs to process in the next batch
List customerIDsToProcess = allCustomerIDs.GetRange(0, System.Math.Min(batchSize, allCustomerIDs.Count));
// remove those customer IDs from the master list so they are not processed again
allCustomerIDs.RemoveRange(0, System.Math.Min(batchSize, allCustomerIDs.Count));

// wait for the semaphore to let us launch another thread
mSemaphore.WaitOne();

// launch a thread worker and give it the list of customer IDs to process
ThreadPool.QueueUserWorkItem(new WaitCallback(UpdateAllCustomersInSubBatch), customerIDsToProcess);
}

// ensure all threads have exited by waiting until we can get all the semaphore requests
for (int ctr = 0; ctr < numberOfThreadsToUse; ctr++)
{
mSemaphore.WaitOne();
}
mSemaphore.Release(numberOfThreadsToUse);
}
Step 4 - processing a batch of records in a thread worker
The method to process the records will be launched in step 3, and will receive the list of keys to work with. It will then use a foreach() loop to go through them, and using each key in the loop, access a Customer object in the shared collection and apply the appropriate business rules and changes to it.
Similarly, if you are working with database records you would use this key value to issue a SELECT statement for one record in the table, fetch it, update it, and write it back (or maybe just issue an UPDATE statement).
For example:


private void UpdateAllCustomersInSubBatch(object state)
{
try
{
List customerIDsToProcess = state as List;
foreach (long customerID in customerIDsToProcess)
{
Customer tempCustomer = mCustomerList[customerID];
// a foreach item cannot be passed down by reference, so pass
// a copy.
ApplyBusinessRulesToCustomerObject(ref tempCustomer);
Lock (mLock)
{
mCustomerList[customerID].Discount = tempCustomer.Discount;
mCustomerList[customerID].Name = tempCustomer.Name;
}
}
}
catch (Exception ex)
{
lock (mLock)
{
// An exception was raised. This thread has no access to the UI, so store the exception in
// mExceptions and get out.
mExceptions.Add(ex);
}
}
finally
{
// The work in this thread is complete. Release the semaphore request so that it can be reused to
// launch another thread worker for the next batch.
mSemaphore.Release();
}
}
A note about exceptions

An important thing to note about threads is that you have to be careful with exceptions. In a single threaded application, exceptions will be available to all the objects up the execution path, and you can therefore trap and handle them at any point.

In a multithreaded application, however, exceptions will only go up as high as the first method to execute in that thread. If you do not have any try/catch blocks to handle the exceptions before (or on) that point, you’ll get an “unhandled exception” error, and the application will stop when the exception is raised.

To get around this, the demo uses a List collection to store any and all exceptions raised during the thread’s execution. When we’re done looping through the main list of object keys, we can check this collection for content, and if there are any exceptions they can be sent back to form, logged, etc. Additionally, we could alter the main dispatching foreach() loop (described in step 3) to stop dispatching new batches of work if the exception collection’s count is greater than zero.

A note about thread safety and locking

It’s also important to note that since multiple threads will be accessing and modifying the same objects (e.g. the shared object list and the exception collection), we need to protect against corruption and ensure that only one thread at a time has access to the shared resources. To do this, we use the lock() command which ensures that only one thread at a time can access the code in the lock block and therefore only one thread at a time can update this hared resource.

In conclusion

Splitting up massive amounts of repetitive work into smaller batches and processing them in parallel is simply a must-have when it comes to processing large amounts of data. The benchmark I show in the introduction shows how you can easily save 50% of your execution time, although the savings will largely be determined by the exact type of transformations you need to make to the data, and how the code is implemented.

Still, I think this is a time saver that cannot be ignored, and I hope it helps someone out there.

A sample project is available for testing here: http://www.codeproject.com/KB/cs/FasterBatchProcessingWith.aspx




Monday, December 3, 2007

Upcoming article - speeding up processing of large amounts of data

One of the things I come across quite often is the need to process large amounts of data, apply certain business rules to it, and then write it back to the database it came from.

If the business rules to be applied are not too complex, you can either generate an SQL UPDATE statement that will do the job or use a stored procedure with that same UPDATE statement. That's your best bet.

If, however, the logic is complex and better represented in code, then you're probably stuck creating a data reader, looping through the records, and changing them one at a time. You could simply do this as described and let it take as long as it takes, or you could take advantage of threading to split the job into mutilple concurrent jobs.

Managing these worker threads through a semaphore, creating and maintaining a work queue, dispatching the work units, handling exceptions ... all this will be in the next article.

Thursday, November 29, 2007

Dynamically creating DSN entries

There's a good article about creating dynamic DSN entries, which can be put together with the blog I wrote about copying tables to MS Access. This way you don't need to create the DSN manually, and can also clean up when the copy is done by automatically removing the DSN entry.

The article is here:
http://www.codeproject.com/cs/database/DSNAdmin.asp

Tuesday, November 27, 2007

How to quickly copy tables from an ODBC source to MS Access within a C# project

Introduction

In a recent project I worked on, I ran across a pretty serious problem that seemingly had no solution. The problem was simple: I needed to copy a fairly large amount of table from a table in an SQL Server database to a table in an MS Access database. No matter which approach I followed, however, the speeds at which the copy was being executed were far too low and simply unacceptable.

My frame of reference was MS Access itself, which has the ability to import tables from an ODBC source. When I compared the time it took to perform this import, it consistently beat the other tests by a wide margin of at least twenty to one.

It sounds huge. I know. I do not have the exact times it took for each of the tests, but they were all much longer than doing a simple import through
MS Access. The comparison tests I conducted were:

  • Using Enterprise Manager
  • Using SQL Server Management Studio
  • Using SQL Server Information Services
  • Using BCP (although this can only export to a text file or SQL Server binary file). The BCP process alone was able to match the time record achieved by the MS Access import, but useless since it could not export to MS Access.

The ultimate solution, then, was to find a way to have MS Access perform the copy all the while running within a C# project. The biggest problem with this was that including support for MS Access in a .NET project automatically means including COM, and a lot of people cringe at the thought. This article, then, discusses the implementation of an SQL server to MS Access copy process that is performed by MS Access, runs within a C# project, and minimizes the impact COM has on the overall application by using late binding.

Embedding a COM object in .NET

It isn’t pretty, but it’s possible. It will make your setup and installation more complex, but will add functionality that may otherwise not be available from within the .NET framework. The bottom line is that it is up to you on whether you want to include COM support in your project, and you’ll have to weigh the pros and cons and then decide.

In my case, there wasn’t much of a choice. I could either leave my project spic and span and let the copy take almost an hour, or find an elegant way to implement COM and shrink the copy process down to a couple of minutes. The choice was pretty clear.

There are basically two ways to use a COM object in your project:

Early binding

In this case, you add a reference to the COM object in question to your project, and benefit from the fact that Visual Studio can (in some cases) offer intellisense on the classes, methods and properties of this object. For screen controls, you also get the ability to add them to your forms in the screen designer. On the down side, however, you can no longer just copy the project files from your bin\debug or bin\release folder to another machine and run the program. You now need a setup project that will copy and register the COM controls on the target machine as well as copy your project files.

Late binding

If you are guaranteed that the COM object in question is already installed on the target computer, and can do away with such niceties as intellisense, you can write your code so that it hooks up to the already-installed COM control and use it in pretty much the same way. The down side, obviously, is that you are not guaranteed your code will always work (because you cannot be 100% certain the COM object is already installed on the target machine), and you will not have access to intellisense or the screen designer.

Importing a table

When importing a table into MS Access, the application lets you do so by going to File\Get external data\Import. You then choose a source such as an ODBC connection, select the database and table, and that’s it. The table gets copied over (albeit without any keys or indexes, which you then have to rebuild).

This functionality is very much available through COM automation, and can therefore be used in a C# project. The problem is that the code is only available from within an Access module (although I’m sure one the readers will find a way to access this function directly from C#). So to get to it, you need to write a VB function that will import a table. The C# project will then need to load the Access database and execute the function which in turns performs the import.

The only drawback to this method is that it is not asynchronous (although you could add threading on the C# side) and has no way to indicate progress.

The VB function you’ll need to put into MS Access is quite simple, and basically calls the TransferDatabase method by passing it a DSN (pointing to the source database), a source table name and target table name. The code is as follows:

Public Function Import(dsnName As String, sourceTableName As String, targetTableName As String)
‘ if the table already existsm, delete it.
On Error GoTo CopyTable
DoCmd.DeleteObject acTable, targetTableName
CopyTable:
DoCmd.TransferDatabase _
acImport, _
"ODBC Database", _
"ODBC;DSN=" + dsnName, _
acTable, _
sourceTableName, _
targetTableName
End Function
And then the C# code:
object accessObject = null;
try
{
accessObject = Activator.CreateInstance(Type.GetTypeFromProgID("Access.Application"));

accessObject.GetType().InvokeMember(
"OpenCurrentDatabase",
System.Reflection.BindingFlags.Default System.Reflection.BindingFlags.InvokeMethod,
null,
accessObject,
new Object[] { "AccessDbase.mdb" });

accessObject.GetType().InvokeMember(
"Run",
System.Reflection.BindingFlags.Default System.Reflection.BindingFlags.InvokeMethod,
null,
accessObject,
new Object[] { "Import", "DSN Name", "Source table name", "Target table name" });

accessObject.GetType().InvokeMember(
"CloseCurrentDatabase",
System.Reflection.BindingFlags.Default System.Reflection.BindingFlags.InvokeMethod,
null,
accessObject,
null);

MessageBox.Show("Copy succeeded.");
}
catch (Exception ex)
{
string message = ex.Message;
while (ex.InnerException != null)
{
ex = ex.InnerException;
message += "\r\n----\r\n" + ex.Message;
}
MessageBox.Show(message);
}
finally
{
if (accessObject != null)
{
System.Runtime.InteropServices.Marshal.ReleaseComObject(accessObject);
accessObject = null;
}
}
All we’re doing here is creating an instance of Access.Application, opening a database, executing the VB module to perform the import, closing the database and cleaning up.

Note that for the copy to work, you need to specify several pieces of information:


  • The fully qualified name of the Access database
  • The name of an ODBC DSN created in your control panel that points to the source database (i.e. where the table will be copied from)
  • The name of the table as it appears in the source database
  • The name you want to assign to the table once it’s been copied (can be the same as the source).

Finally, it should be noted that the copy process creates the table with no keys or indexes, so these will have to be rebuilt after the copy is complete. This is very easily achieved, however, by executing SQL "ALTER TABLE " commands through ADO. There are plenty of articles out there that describe this so I won’t get into it here.

Credits and further info
This article may not seem very big, but it is the culmination of a massive headache and lot of needle hunting (in a haystack of needles) on the net. I found several articles out there that helped out, and these are some of them:
http://www.thescripts.com/forum/thread255310.html - executing an MS Access module from C#
http://support.microsoft.com/kb/306683 - executing MS Access modules from within C#. Note that the method described here uses early binding, which means you’re adding COM references to your project and you are therefore complicating your setup/installation package.
http://www.codeproject.com/cs/database/mdbcompact_latebind.asp - very good article by Alexander Yumashev describing how to compact and repair an MS Access database from within C#. More importantly, it overcomes the problem in the Microsoft article mentioned above by using late binding.

In conclusion
MS Access is being put out to pasture out for many good reasons, but it’s still very much in use. So if you’re one of those people who still has to copy large amounts of data from another database source to MS Access, need to do in a C# project, and want a clean way to do it, then I think this article will help. If you don’t fall in that criteria, though, then I hope you at least had an interesting read and kept this in the back of your mind for future use.

A sample project is available for testing here: http://www.codeproject.com/useritems/MSAccessTableCopyInCSharp.asp

Wednesday, November 21, 2007

A very fast table copy from SQL Server to MS Access

New article coming up ...

I've recently fought (and won) a battle to get the fastest possible copy of a table from SQL Server to MS Access within a C# project. I tried the usual things like SSIS and ADO.net, but the fastest thing was the import utility within Access itself. I finally go it to work properly, and it flies (it's an average of 20 - yes 20 - times faster than copying through SSIS or other widely used methods). I'll post the article in the coming days.

Sunday, November 18, 2007

A C# central logging mechanism using the Observer and Singleton patterns

Introduction

Logging is something any respectable application needs to do. Whether it's to keep track of what the application is doing under the hood (for possible trouble shooting), to show a list of events that we might need to inform the user of, or simply to keep a history of events, logging is as necessary to the programming world as note taking is to a class room. If you don't do it, you might be able to get by for the moment. Just hope you never need to refer back to something that happened a few minutes, hours or days ago.

This article shows an excellent mechanism for adding logging to a C# project. It is extremely powerful and versatile, yet very simple to implement and use. It uses two very common programming patterns: the singleton and the observer.

Background

The ideas and code shown in here are not rocket science, so as long as you have at least a basic idea of object oriented programming, and have been playing around with C# for a while, I'd say you're safe.

This having been said, it is always a good idea to brush up on a couple of things if you want to get the most out of this. I would suggest reading up on patterns, and focusing on some of the more common ones such as the Singleton and Observer.

To keep it brief and concise, however, I'll delve into these two for a bit and give you an idea of what they're about.

The Singleton pattern

In layman's terms ...

Probably the most widely known and one of simplest patterns, the singleton is a mechanism that ensures all objects in your process have access to a shared and single set of data and methods.
Imagine an intersection where several roads meet, and where the cars need to ensure they do not crash into each other when crossing this intersection. Assuming there were no traffic lights to guide them, we would need a police officer standing in the middle to direct traffic. This officer would be a real life approximation of a singleton.

The officer would receive visual queues from all roads, telling him which one is filling up the most with cars. With all this information, he would be able to decide which road to let the cars come through, and which roads to block.

This would be significantly more complex if there were several officers controlling the intersection, where they would all be receiving visual queues from the roads, would have to communicate with each other and then collectively decide which road to let traffic come from.
Similarly, a singleton object is something shared by all the objects in the system. For the purposes of the logging mechanism, the singleton is the one and only object to which all other objects will send the information they wish to have logged.

Not only does this centralize and simplify control of the logging mechanism, this also gives the developer an excellent way to provide uniform formatting to the log output. You can add timestamps, titles, parameter information and more in just one place. None of the objects that need information to be logged need be concerned with this, as the singleton logger takes care of it on its own.

Putting this in code ...

When regular classes need to be used, they are instantiated as an object and then used. A singleton class, however, does not allow anyone except itself to instantiate it. This guarantees that only one copy of this object will ever run, and all objects in the program will be accessing this one copy.

To accomplish this, the logger has only one private constructor and a private variable of its own type. The only way to really get a hold of this object from the outside then is to instantiate the Logger class and equate the new object to the static handle exposed in the class. The Instance property then checks to see if the private mLogger object was ever created, and if it was not, this is the only place where Logger will ever get instantiated.

Logger
{
private object mLock;
private static Logger mLogger = null;
// the public Instance property everyone uses to access the Logger
public static Logger Instance
{
get
{
// If this is the first time we're referring to the
// singleton object, the private variable will be null.
if (mLogger == null)
{
// for thread safety, lock an object when
// instantiating the new Logger object. This prevents
// other threads from performing the same block at the
// same time.
lock(mLock)
{
// Two or more threads might have found a null
// mLogger and are therefore trying to create a
// new one. One thread will get to lock first, and
// the other one will wait until mLock is released.
// Once the second thread can get through, mLogger
// will have already been instantiated by the first
// thread so test the variable again.
if (mLogger == null)
{
mLogger = new Logger();
}
}
}
return mLogger;
}
}

// the constructor. usually public, this time it is private to ensure
// no one except this class can use it.
private Logger()
{
mLock = new object();
}
}

public class SomeWorkerClass
{
// any class wanting to use the Logger just has to create a
// Logger object by pointing it to the one and only Logger
// instance.
private Logger mLogger = Logger.Instance;
}
The Observer pattern

In layman's terms again ...

This is where things get a bit more complex, but a bit more fun as well. The observer pattern is simply a mechanism where one object (such as our police officer from the example above) has the ability to dispatch a message to one or more observers. What they do with this message is completely up to them, and in fact, the officer doesn't even know who they are or what they do. He just knows how many observers there are, and has a predefined method for giving them this information.

Going back to the traffic example: whenever the officer needs to let traffic on one road stop moving, he shows his open palm to that road's direction. He is in effect sending all traffic, which is observing him for input, a message. The traffic facing him will interpret this message and stop, and all other traffic will interpret the message and subsequently ignore it has no effect on them.

Next, he waives to traffic in another road indicating that the drivers should start moving. Again, he has sent another message, and all the drivers will receive it in the same way, but act on it in different ways. Those facing him will start moving, and all others will simply ignore the message.
So far, we have one observed subject (the officer) and several observers (the drivers). The beauty of this pattern, however, is that there could be observers of other types as well, and as long as they can receive the officer's messages in the same way as the drivers (i.e. as long as they have eye sight and are looking at him), they can act on the messages as well. A simple example would be for the officer to actually be a cadet on training, and a senior officer to be sitting in his car watching this cadet and taking notes of the stop and go messages. Again, the senior officer is observing the cadet and receiving his messages in the same way as the drivers, but is acting on them in a different manner.

In programming terms, the officer would be sending his messages to the observers via an interface. If a given object implements an interface, any other object can interact with it through the properties and methods exposed in that interface without even knowing what the true nature of the object really is.

Putting this in code as well ...

The first part of this pattern is the use of an interface which will allow the Logger to dispatch new log entries to the objects observing it. The beauty of this is that those objects could b of literally any type and offer all sorts of functionality. They could be forms, file writers, database writers, etc. But the Logger will only know that they implement this interface and will be able to communicate with them through it.


interface ILogger
{
void ProcessLogMessage(string logMessage);
}
Next is the management of these observers within Logger. We need an array (or List<>) to store them (technically speaking, though, we're only storing a reference to them), and a public method by which they can be added to the array:

class Logger
{
private List mObservers;
private Logger()
{
mObservers = new List();
}
public void RegisterObserver(ILogger observer)
{
if (!mObservers.Contains(observer))
{
mObservers.Add (observer);
}
}
}
Whenever we want to implement a new observer, we just need to ensure it implements the ILogger interface and does something meaningful when the ProcessLogMessage method is executed. An example would be a FileLogger object that writes the log messages to a file:

class FileLogger: ILogger
{
private string mFileName;
private StreamWriter mLogFile;
public string FileName
{
get
{
return mFileName;
}
}
public FileLogger(string fileName)
{
mFileName = fileName;
}
public void Init()
{
mLogFile = new StreamWriter(mFileName);
}
public void Terminate()
{
mLogFile.Close();
}
public void ProcessLogMessage(string logMessage)
{
// FileLogger implements the ProcessLogMessage method by
// writing the incoming message to a file.
mLogFile.WriteLine(logMessage);
}
}
This class would then be instantiated and passed to Logger as follows:

public partial class Form1 : Form
{
// the main Logger object private Logger mLogger;
// a logger observer that will write the log entries to a file
private FileLogger mFileLogger;
private void Form1_Load(object sender, EventArgs e)
{
// instantiate the logger
mLogger = Logger.Instance;
// instantiate the log observer that will write to disk
mFileLogger = new FileLogger(@"c:\temp\log.txt" );
mFileLogger.Init();
// Register mFileLogger as a Logger
observer. mLogger.RegisterObserver(mFileLogger);
}
private void Form1_FormClosing(object sender, FormClosingEventArgs e)
{
// The application is shutting down, so ensure the file
// logger closes the file it's been writing to.
mFileLogger.Terminate();
}
}
Putting the patterns together

Our logger will use both of these patterns, and the attached sample code uses two observers: the FileLogger mentioned above, and the actual form itself. The FileLogger logs the messages to a file, and the form shows the messages in a textbox control. This is obviously a simple implementation, but it could be used for much more creative purposes. We could have an observer that writes the log entries to a database table, another that concatenates all the entries and the emails the log, etc.

Whenever the form needs to have something logged, it simply executes the AddLogMessage() on the logger and passes it the log entry:

class Logger
{
public void AddLogMessage(string message)
{
// Apply some basic formatting like the current timestamp
string formattedMessage = string.Format("{0} - {1}", DateTime.Now.ToString(), message);
foreach (ILogger observer in mObservers)
{
observer.ProcessLogMessage(formattedMessage);
}
}
}
public partial class Form1 : Form
{
private void button1_Click(object sender, EventArgs e)
{
mLogger.AddLogMessage("The button was clicked.");
}
}
Using the code

The sample project does not perform any complex operations, but showcases the logging discussed in this article. The main form has a button that, when clicked, increments a private counter and displays the value in a text box.

Every time the counter is increased, however, the logger is informed. In turn, the logger formats the message it receives and dispatches it to all the observers for processing.

As a nice plus, the Logger's AddLogMessage() method is overwritten to accept an exception. If the application throws an exception and we want it properly logged, we just pass the exception to the Logger and it extracts all the messages from the exception and inner exceptions (if applicable), puts them together, adds the stack trace, and then logs the whole thing. Very useful.
In conclusion

I hope this article and attached sample prove useful. If you have any suggestions for improvement I'm all ears, so post a message and let us know what you think!


A sample project is available for testing here: http://www.codeproject.com/useritems/ObserverLogging.asp

So I'm joining the crowds and blogging...

Never thought I'd be one to just follow the crowds, but I guess some things just make sense. Blogging sounds like an easy way to write and reach the masses easily, so here I am, doing just that.

Anyway, most of my entries will be programming-oriented. I do a lot of research on the net to figure out how to solve complex issues, and once I have it all working, I sometimes write it up in an easy to read article that attempts to spare the reader from the pain and hard work involved in doing all the research. I know I'd like to find more articles like this, so I'm sure there are others like me out there.

So here goes. I'll be posting my first article really soon.