May 04, 2009

My little baby’s all grown up

In my previous post I began refactoring the BlogEngine.NET Search.cs file, specifically a portion of the “catalog” functionality. I’m going to continue my refactoring here …

But first, a little context. It seems that the searching algorithm (a method called BuildResultSet() ) works on “entries” in a collection called “catalog”, returning a set of “results”. An Entry is a struct defined right in the Search.cs file:

/// <summary>

/// A search optimized post object cleansed from HTML and stop words.

/// </summary>

internal struct Entry

{

/// <summary>The post object reference</summary>

internal IPublishable Item;

/// <summary>The title of the post cleansed for stop words</summary>

internal string Title;

/// <summary>The content of the post cleansed for stop words and HTML</summary>

internal string Content;

}

A Result is actually a completely separate class, contained also in the Search.cs file:

/// <summary>

/// A result is a search result which contains a post and its ranking.

/// </summary>

internal class Result : IComparable<Result>

{

/// <summary>

/// The rank of the post based on the search term. The higher the rank, the higher the post is in the result set.

/// </summary>

internal int Rank;

 

/// <summary>

/// The post of the result.

/// </summary>

internal IPublishable Item;

 

/// <summary>

/// Compares the current object with another object of the same type.

/// </summary>

/// <param name="other">An object to compare with this object.</param>

/// <returns>

/// A 32-bit signed integer that indicates the relative order of the objects being compared. The return value 

/// has the following meanings: Value Meaning Less than zero This object is less than the other parameter.Zero 

/// This object is equal to other. Greater than zero This object is greater than other.

/// </returns>

public int CompareTo(Result other)

{

return other.Rank.CompareTo(Rank);

}

 

public override int GetHashCode()

{

return Item.Id.GetHashCode();

}

}

Catalog, meanwhile, is defined as a local variable:

private static Collection<Entry> _Catalog = new Collection<Entry>();

Each of these items have a few methods set aside for them, such as “BuildCatalog()”, “AddItem()”, “CleanContent()”, and so forth. These pieces of code are all folded up and hidden away within regions in the Search.cs file.

The searching itself is done in a few methods, primarily BuildResultSet():

/// <summary>

/// Builds the results set and ranks it.

/// </summary>

private static List<Result> BuildResultSet(string searchTerm, bool includeComments)

{

List<Result> results = new List<Result>();

string term = CleanContent(searchTerm.ToLowerInvariant().Trim(), false);

string[] terms = term.Split(new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);

string regex = string.Format(System.Globalization.CultureInfo.InvariantCulture, "({0})", string.Join("|", terms));

 

foreach (Entry entry in _Catalog)

{

Result result = new Result();

if (!(entry.Item is Comment))

{

int titleMatches = Regex.Matches(entry.Title, regex).Count;

result.Rank = titleMatches * 20;

 

int postMatches = Regex.Matches(entry.Content, regex).Count;

result.Rank += postMatches;

 

int descriptionMatches = Regex.Matches(entry.Item.Description, regex).Count;

result.Rank += descriptionMatches * 2;

}

else if (includeComments)

{

int commentMatches = Regex.Matches(entry.Content + entry.Title, regex).Count;

result.Rank += commentMatches;

}

 

if (result.Rank > 0)

{

result.Item = entry.Item;

results.Add(result);

}

}

 

results.Sort();

return results;

}

So, where am I going with this? I’m glad you asked. :-)

It seems to me that this one file, Search.cs, is trying to do too many complex things. It’s defining several entities (Entry, Result, Catalog), managing them (BuildCatalog(), BuildResultSet()), searching them (APMLMatches(), Hits()), manipulating them (CleanContent(), FindRelatedItems(), ResultToPost()), and handling events (Post_Saved(), Post_CommentAdded(), OnSearching()). It’s probable that the search functionality started out pretty simple, and grew more complex over time.

When that happens to our code, we should practice a little “ruthless refactoring” and extract these groups of functionality into their own classes where appropriate. In other words, when your baby methods grow up, make them move out and get their own home. In this way, we can keep the “Search” class concerned with searching, not maintaining the content catalog or sanitizing the result’s text. I don’t think enough of that was done here, so I’m going to do it.

The first thing I did when I opened up this file (even before what I did in the previous post), was extract the “catalog”, and give it some freedom. My new Catalog.cs, all grown up now (and with a few more refactorings which I’ll explain later), looks like this:

public static class Catalog

{

private static List<Entry> entries = new List<Entry>();

private readonly static object syncRoot = new object();

 

internal static List<Entry> Entries

{

get { return entries; }

}

 

internal static void AddItem(IPublishable item)

{

item.AddToCatalog();

}

 

internal static void Build()

{

OnIndexBuilding();

lock (syncRoot)

{

entries.Clear();

Page.PublishedPages.ForEach(page => page.AddToCatalog());

 

bool commentSearchEnabled = BlogSettings.Instance.EnableCommentSearch;

if (commentSearchEnabled)

{

Post.PublishedPosts.ForEach(post => post.AddSelfAndCommentsToCatalog());

}

else

{

Post.PublishedPosts.ForEach(post => post.AddToCatalog());

}

}

OnIndexBuilt();

}

 

private static void AddSelfAndCommentsToCatalog(this Post post)

{

post.AddToCatalog();

post.ApprovedComments.ForEach(c => c.AddToCatalog());

}

 

private static void AddToCatalog(this IPublishable item)

{

var entry = new Entry

{

Item = item,

Title = TextAgent.CleanContent(item.Title, false),

Content = HttpUtility.HtmlDecode(TextAgent.CleanContent(item.Content, true))

};

 

if (item is Comment)

{

entry.Content += HttpUtility.HtmlDecode(TextAgent.CleanContent(item.Author, false));

}

entries.Add(entry);

}

 

public static event EventHandler<EventArgs> IndexBuilding;

private static void OnIndexBuilding()

{

if (IndexBuilding != null)

{

IndexBuilding(null, EventArgs.Empty);

}

}

 

public static event EventHandler<EventArgs> IndexBuilt;

private static void OnIndexBuilt()

{

if (IndexBuilt != null)

{

IndexBuilt(null, EventArgs.Empty);

}

}

}

Similarly, I’ll do the same type of extraction to other pieces of Search.cs, leaving each of the individual pieces simpler, more single-purpose, and ultimately (I hope) easier to understand and manipulate.

Coming up next, I take advantage of some .NET 3+ functionality to further refine the code.