Improve performance of useragent lookup
mrfelton - June 7, 2009 - 13:09
| Project: | Browscap |
| Version: | 6.x-1.0 |
| Component: | Code |
| Category: | bug report |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | postponed |
Jump to:
Description
After examining my MySQL slow query log, there seem to be a lot of entries from browscap like this
SELECT * from browscap WHERE 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; GTB6; DS_desktopsmiley; InfoPath.1)' LIKE useragent ORDER BY LENGTH(useragent) DESC LIMIT 0, 1;Perhaps this module could benefit from some MySQL indexing?

#1
ie. probably an index is needed on the useragent column if you are going to use it in the lookups so frequently.
#2
I looked into this a little bit and this is likely going to be an interesting problem.
The query is using "LIKE" which, according to the docs
I think the major difficulty is that the browscap useragent data looks like:
+-----------------------------------------------------------------------+| useragent |
+-----------------------------------------------------------------------+
| !Susie (http://www.sync2it.com/susie) |
| % |
| % (compatible; Googlebot-Mobile/2.1; %http://www.google.com/bot.html) |
| % (compatible;YahooSeeker/M1A1-R2D2; %) |
| %(compatible; MSIE %.%; Windows CE; PPC; %) |
| %AgentName/% |
| %Amiga% |
| %avantbrowser% |
| %BecomeBot/% |
| %BecomeBot@exava.com% |
+-----------------------------------------------------------------------+
So - lots of wildcards at the beginning of the data. I've got a request in to have some database folks look at ways we can potentially improve this.
#3
I just noticed in browscap_get_browser that it is leveraging the cache keyed off of the full useragent which is very likely an indexed operation that should be very fast. So, the slow LIKE query is only performed when the useragent isn't a known useragent or the cache has expired which happens once a day.
Given that the data is only updated once a week I think we could consider a couple things:
1) Moving this data to its own cache table so we can actively clear the cache when the data is updated.
2) Only expiring items from the cache based on updated data (i.e. step 1 of this plan) and never based on time.
OR a simpler solution is to just increase the time that we allow data to be cached to a week or something. I think having slightly inaccurate data would be acceptable on most sites in exchange for performance improvements.
#4
So, I discussed this with Paul - the sql guru on the project - and he suggested that given the nature of those fields we are unlikely to be able to drastically improve that particular query, but also that the cache idea I presented in comment #3 could work well.
I'm marking this postponed - it's not a problem for me so I don't plan to work on it.