(note: even if you aren't in edev, this might be an interesting read, as it has some stats on what causes some lag, and on user searching).

Looking for ways to optimize E2, I turned my attention to the user search. This is going to be kind of a hack of the way we do things, so I wanted to put this out for ideas of both edev and the higher-ups, as I might be missing things in my evaluation. User searches are one of the biggest eaters of SQL time we have here at E2. So I set about to figure out what the deal is with them.

So to start my study on user searches, i took some profiling data over a time of roughly two weeks. Data was collected in such a way to maintain privacy; only the search types and who was being searched on (for better data for prediction of cache invalidation stats) was taken. This is the data that was collected:

User search stats for a two week period:

47421 total searches

  • title DESC: 32 (0.07% of search space)
  • title ASC: 214 (0.45% of search space)
  • title: 36 (0.08% of search space)
  • createtime ASC: 739 (1.56% of search space)
  • reputation DESC: 2777 (5.86% of search space)
  • reputation ASC: 2004 (4.23% of search space)
  • createtime DESC: 41619 (87.76% of search space)
(this was done by dumpsearchstats.pl, a script run on the db server).
We can obviously see that 88% of seraches are "createtime DESC", which as we all know are the default user search when you click on homenodes. This brings up two questions.
  1. Is this the smartest type of search we could be doing?
  2. Are there ways that we could cache this data, seeing as it does not change very often




SQL timing:
Well, let's do a user search on me and see:
    SELECT * FROM node WHERE author_user=459692 AND type_nodetype=117 ORDER BY node_id DESC\g

411 rows in set (0.02-0.04 sec)



    SELECT * FROM node WHERE author_user=459692 AND type_nodetype=117 ORDER BY createtime DESC\g

411 rows in set (0.02-0.04 sec)



It seems that the searches are about even. Ten searches of each seems to yield about the same results. Because user searches have to sort through large amounts of data, this can be troublesome in laggy conditions. We'd rather shift the weight of the computation over to the web server machine, so let's look at caching.



Caching:
Seeing as we have 88 % of the same basic search type as the default search, what about managing a search cache for a user, at least for the top fifty nodes ordered by createtime. We could store them in an independant table and just pick it out when we need it. There are two concerns with caching, and that is invalidation, and generation. Assuming we only store the top fifty writeups, the general algorithm will look like this:

On User Search for Person
  Search for Cache (by user and searchtype)
     if found use cache
     if not found generate cache, store list, use cache.

On Writeup deletion
  If in first 50 writeup (delete is slow anyways), invalidate cache.

On Writeup addition
  Invalidate cache

**Note: are there any other places where we'd have to watch out for cache out of sync-ness?

Logically, we'd probably store them in a fast array or hash setting so that all we'd have to do is eval the data to get it out fast, then iterate through it. We can't use the standard setting, because those are already taken for $VARS. That's pretty much the easy part.

Invalidation We could set a quick bit to do that, so that we ignore the cache when we search. We'd have to overload writeup maintenance create and writeup maintenance delete (and I could get that last writeup cache working at the same time in those), to invalidate the caches. Re-validation would occur on the first user search of the user.

Initial Data We'd need to compile the intial data here to put in everyone's caches. We could do this over a time of about ten minutes one night, and start all of them rolling. This seeding process no one would really notice, and it's all one time CPU usage.

Other applications We could try to cache the cool archive, but it would be much harder to do so, seeing as C!s get tossed system wide more often than individual users put in writeups.

Generally I'm looking for comments. There might be a smarter way to tackle this problem, and if so, I'm open to suggestions. This is a big force of lag as wel lock the table to do the user searches now. Thanks.