As the customers and products table gets larger the Customer Reports (admin/store/reports/customers) gets slower and slower so that by the time you have 50,000 orders and 25,000 customers (my data set) the query never finishes. See this thread for discussion on the issue. As mentioned in the thread the slow part of the query is summing up the number of products in a user's combined orders. The fix, as it turns out, is easy: add an index to qty and the report becomes quite usable again. Patch is attached that does just that.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

jrust’s picture

FileSize
740 bytes

Just did some more testing with the reports module and large data sets and it turns out that the product report also slows to a crawl because of a lack of an index on nid in uc_order_products. I've updated the patch to contain both indexes and now all reports are speedy on my semi-big data set!

jrust’s picture

Title: Customer Report is Unusable with Large Data Sets » Customer & Product Reports are Unusable with Large Data Sets
torgosPizza’s picture

This is a problem we also noticed in UC1.x, I'll see if your indexes need to be created on our site as well. I'm pretty sure we got them all but it won't hurt to double-check.

Sorry for bringing up a necro-thread. Seems that it hasn't gotten looked at by anyone else either...

Island Usurper’s picture

I got to digging around, and while those indexes are good for the products report, it doesn't do a thing for customers. I've rewritten the query that's used for it, and now that it's actually using the indexes that are available, it should go a lot faster.

rszrama’s picture

Issue tags: +Scaling, +ubercamp sprint

Tagging.

cha0s’s picture

Status: Needs review » Needs work

Latest patch breaks using pgsql.

sammys’s picture

Status: Needs work » Needs review
FileSize
5.1 KB

Here is the revised patch for this issue. I've removed the CONCAT() calls since we only need to uniquely identify customers down to the user account in Drupal rather than a uid.firstname.lastname combination. This should speed up the queries and get rid of the nastiest PostgreSQL compatibility problem. Also added the non-aggregate fields into the GROUP BY clause.

Not reviewed on MySQL so please review on that.

sammys’s picture

After some further discussions we had to simplify the report output and remove the customer name. Please read on if you want to know the reason.

Ubercart allows different billing/shipping recipients to be entered for the same user account. While this is a great feature to have it makes reporting customer statistics a little tricky.

On one hand we could group statistics using the user ID, first name and surname. This would show more results on the screen and have a finer granularity. Unfortunately, it makes the query take longer. Lyle timed the query on a small dataset and it ended up as 70ms.

On the other hand we can simply decide to group the statistics by user accounts. This makes sense in the accounting paradigm. An account is a customer not a person. Using user account grouping of statistics records reduces the query time to 7ms for the same small dataset used above. For those "DB block cache needs to be cold" purists out there, we ran this user account grouped query (7ms) before the user ID, first name and surname grouped query (70ms). I reckon that's pretty good.

Patch attached!

cha0s’s picture

Cool, though I'm not sure if people are going to be upset that they don't get as much information from the report as they used to.

Well anyways, it looks good and makes the code prettier and works on pgsql, so I'm happy about that. :)

jrust’s picture

I tested it out on my dataset which is decently large (37,000 orders) and the difference between grouping by u.uid, u.name vs. u.uid, o.billing_first_name, o.billing_last_name was negligible (.95 seconds vs. 1.3 seconds). I'm not sure losing valuable information (and I think name is quite valuable to store managers who work with customers with names, not user accounts with just usernames) is worth shaving off .3 seconds from an already somewhat slow query. The real solution might be to create an aggregate index that covers the columns we GROUP BY. However, I tried creating an index of o.uid, o.billing_last_name, and o.billing_first_name and couldn't get MySQL to actually use it...

cha0s’s picture

I plan on benchmarking this properly soon... we definitely need to get some more data here because if it really is as negligible as you say to include the info, then I think it's definitely a serious mistake to remove the name stuff. Only if we're talking a 10x slowdown as was implied in sammys's post.

Can we see the benchmarking code used here?

jrust’s picture

sure, here are the queries and their times that I'm using to test:

SELECT u.uid, u.name, COUNT( DISTINCT o.order_id ) AS orders, SUM( op.qty ) AS products, SUM( o.order_total ) AS total, AVG( o.order_total ) AS average
FROM uc_orders AS o
LEFT JOIN users AS u ON o.uid = u.uid
JOIN (
  SELECT order_id, SUM( qty ) AS qty
  FROM uc_order_products
  GROUP BY order_id
) AS op ON o.order_id = op.order_id
WHERE o.order_status
IN (
'completed'
)
GROUP BY u.uid, u.name
-- (20,754 total, Query took 0.6667 sec)
SELECT u.uid, u.name, COUNT( DISTINCT o.order_id ) AS orders, SUM( op.qty ) AS products, SUM( o.order_total ) AS total, AVG( o.order_total ) AS average
FROM uc_orders AS o
LEFT JOIN users AS u ON o.uid = u.uid
JOIN (
  SELECT order_id, SUM( qty ) AS qty
  FROM uc_order_products
  GROUP BY order_id
) AS op ON o.order_id = op.order_id
WHERE o.order_status
IN (
'completed'
)
GROUP BY o.uid, o.billing_last_name, o.billing_first_name
-- (20,772 total, Query took 1.1164 sec)
sammys’s picture

Thanks for running that on a larger dataset.

I understand that names are important when dealing with clients. The code we're talking about relates to statistics rather than dealings with the payer or recipient. In addition, discounts and other operations are related to the account holder rather than payer/recipient. This report is not meant to be used to find an order either. Grouping by name makes Ubercart reporting different to normal business practices.

Unfortunately, Drupal does not provide a name field in the user account (i.e for a default account contact in the case of Ubercart) for us to display in the report result. Would have put that there otherwise.

Perhaps we need to make these reports use db_rewrite_sql(). hehe

jrust’s picture

Good point sammys. I agree that the most useful statistics are likely based around a single account, not around multiple addresses a person has. The only downside is just that the report then does not easily show the store admin who are these top purchasers on the report are since username jxdoe29 is not as likely to be known by the admin as John Doe. Could we just show the billing name that the GROUP BY comes up with the understanding that the name could, in some (rare) cases, not be the same as what they used on other orders?

Island Usurper’s picture

Yeah, I think it's a good idea to keep the customer names on the report. I've made some adjustments to my earlier patch to make it Postgres compatible and to take some unnecessary calculations out of the count query.

I do want to point out that it's possible for the queries to be cached by the database, so it only takes .5 milliseconds to return the data on average. I did try to write the queries a couple of other ways, but they were always slower (nearly half a minute in some instances) and wouldn't cache.

A proper benchmark is still a good idea if anyone wants to. However, if it loads fast enough for you without knowing exactly how much time it takes, then that's good enough for me.

Island Usurper’s picture

...and like a doofus, I forgot to upload my latest patch.

itsmahitha’s picture

Assigned: Unassigned » itsmahitha
Status: Needs review » Reviewed & tested by the community

Reviewed and Tested.

TR’s picture

Assigned: itsmahitha » Unassigned
Status: Reviewed & tested by the community » Needs review
jasonabc’s picture

I think the attached patch contains an error - it says to add "function uc_order_update_6011()" - but this already exists in uc_order/uc_order.install...? I changed it to "function uc_order_update_6015()" and that seems to work well.

With a large dataset - Store Admin > Reports > Product Reports now loads much faster - but clicking on the "Custom Product Report" tab to run a report on a certain start/end date is still unacceptably slow. It takes forever to execute the report.

3dloco’s picture

+1

isaac.niebeling’s picture

Just tested this on a DB with 1M+ rows in uc_order_products -- it definitely helps. Still took somewhere between 2:30 and 4:00 (I went and got a bowl of cereal while it was running :p) but the report is at least functional now.

Any chance this will be included in UC any time soon?

longwave’s picture

Status: Needs review » Needs work

Variant on #1 committed to add indexes to nid and qty, to both branches.

When I tested the updated customer SQL on MySQL it gave somewhat different results on a large dataset, as some customers with many orders haven't always entered the exact same name; I think maybe grouping by just uid may be better, but I see this has problems on Postgres.

longwave’s picture

TR’s picture

Version: 6.x-2.0-rc3 » 6.x-2.x-dev
Component: Code » Reports
Issue tags: -Scaling, -ubercamp sprint
longwave’s picture

Status: Needs work » Closed (fixed)

I guess adding the indexes improved things enough, let's not change things unnecessarily if nobody is having further issues with this.