Video Screencast Help
Symantec to Separate Into Two Focused, Industry-Leading Technology Companies. Learn more.
Endpoint Management Community Blog

{CWoC} Performance Issues with the String Cache > It's Becoming a Drag... So Let's Look Into Indexing Strategies

Created: 29 Dec 2009 • Updated: 29 Dec 2009 • 1 comment
Ludovic Ferre's picture
0 0 Votes
Login to vote

Argh... with most feature comes a cost, and the great benefits from the string cache (which does a huge simplification, storing data from hundred of thousand of lines into 7~16,000 entries between 0 and 36 bytes long) comes a huge performance hit as it is right now (illustrated below).

Illustration 1: Simple linked list implementation currently in use.

Working with the string cache to store guids and hit counts for them we saw some interesting statistics whilst there was only a minor performance cost with up to 7,600 cached entries.

But as I added the Client IP Addresses (c_ip) yesterday the string cache doubled in size reaching upward of 16,000 entries. Knowing that the string cache is implemented with a simple list (which is a bad circular list implementation that acts much like a stack form which we never pop) we have hit the limit whereby traversal of the list is so costly at the end of a file parse (once we hit the 500,000 line regions, probably close to the 10,000 cache entries) that the parsing of the largest file I have here (950K lines, 200MiB+) now takes anything between 60 and 90 seconds! Gasp.

I had considered this potential problem before, knowing that linear searches aren't efficient in arrays and lists alike past a given size, and made room for an index in the string cache list. So now comes the time to implement a string cache index, aka sci.

But how should we do this? First of all it's really important to look at the cache content. With 9,000 ip address entries in store, and with a consistent ip addressing scheme we can't index the ip address using less than 4~5 chars (take a case where the customer networks are in the range 10.x.x.x, 192.168.x.x). It is clear that creating a binary tree to store the index would be quite costly, and probably too much work for now I am wondering if using a double single char index isn't the best solution, with a first index (named front index) based on the first char in the string and a second index (named rear index) base don the last char in the string.

This rear index would resolve the ip address caching issue reported above (assuming an even distribution we would reduce search on 9,000 ip address to 900 entries on average) , whilst the front index would cater nicely for guids (with a larger distribution across 26 chars our 7,000 entries would perform much better with 270 entries per char on average). Using a case-less index we could implement the front and rear data on a pair of 36 byte arrays ( 10 digit, + 26 letters) with a pointer to the first entry of the given type and insertions being made to this point.

Illustration2: Front and rear indexing
This shows that the rear ~ front indexing may only be implemented with two separate indexes or a doubly linked list but it looks rather messy. In the end two separate indexes, one for Guids and one for IP entries may be the best solution (the guid being a front index and ip a rear one, bt both would be independent).

Another note on the performance side of the {CWoC} house (still in the aila room ;): I have come to realize that my current search strategy(ies) for uri data, guids and more was not the most efficient one. So I implemented a single parse_line function that ensures the data read from the file is parsed and stored in a parsed_line structure and that the searches are done on fragments of the string instead of the full string.

This proved to be very beneficial, as the largest file I have parses in ~15s when previous versions of aila took 23~25 seconds to return the same output! In this manner we can implement a file translation straight out to output the data to a csv or sql format (sql for bulked insert in a db).

Comments 1 CommentJump to latest comment

Ludovic Ferre's picture

Just tought about this one now.

It would make a lot more sense than bothering with complex indexing. Indexing could be good for guid entries, but IP addresses would be lighter and easier to handle intheir natural form i.e. 32-bit integer!

In the end I think I'll end up with two list used in the cache, one for guids, one for ip's. Each with their indexing or search functions (int ip's would be good for binary search albeit my list isn't up to it yet).

I am currently off-net, on a retreat of some kind. I'll be back real soon, and you sure will hear from me then ;-).

Ludovic FERRÉ
Principal Remote Product Specialist

Login to vote