Customer & Product Reports are Unusable with Large Data Sets [#465536]

Comment	File	Size	Author
#16	customer-products_reports.patch	4.27 KB	Island Usurper

#8	465536_200907031919-0400.patch	6.67 KB	sammys

#7	465536_200907031741-0400.patch	5.1 KB	sammys

#4	customer-products_reports.patch	4.28 KB	Island Usurper

#1	uc_order_qty_index.patch	740 bytes	jrust

	uc_order_qty_index.patch	633 bytes	jrust

Comment #1

jrust CreditAttribution: jrust commented 18 May 2009 at 01:49

File	Size
uc_order_qty_index.patch	740 bytes

Just did some more testing with the reports module and large data sets and it turns out that the product report also slows to a crawl because of a lack of an index on nid in uc_order_products. I've updated the patch to contain both indexes and now all reports are speedy on my semi-big data set!

Log in or register to post comments

Comment #2

jrust CreditAttribution: jrust commented 18 May 2009 at 01:49

Title:

Customer Report is Unusable with Large Data Sets

» Customer & Product Reports are Unusable with Large Data Sets

Log in or register to post comments

Comment #3

torgosPizza

he/him

English

Portland, OR

CreditAttribution: torgosPizza commented 15 June 2009 at 06:36

This is a problem we also noticed in UC1.x, I'll see if your indexes need to be created on our site as well. I'm pretty sure we got them all but it won't hurt to double-check.

Sorry for bringing up a necro-thread. Seems that it hasn't gotten looked at by anyone else either...

Log in or register to post comments

Comment #4

Island Usurper CreditAttribution: Island Usurper commented 19 June 2009 at 21:18

File	Size
customer-products_reports.patch	4.28 KB

I got to digging around, and while those indexes are good for the products report, it doesn't do a thing for customers. I've rewritten the query that's used for it, and now that it's actually using the indexes that are available, it should go a lot faster.

Log in or register to post comments

Comment #5

rszrama CreditAttribution: rszrama commented 29 June 2009 at 16:47

Issue tags:

+Scaling, +ubercamp sprint

Tagging.

Log in or register to post comments

Comment #6

cha0s CreditAttribution: cha0s commented 2 July 2009 at 23:40

Status:

Needs review

» Needs work

Latest patch breaks using pgsql.

Log in or register to post comments

Comment #7

sammys CreditAttribution: sammys commented 3 July 2009 at 21:44

Status:

Needs work

» Needs review

File	Size
465536_200907031741-0400.patch	5.1 KB

Here is the revised patch for this issue. I've removed the CONCAT() calls since we only need to uniquely identify customers down to the user account in Drupal rather than a uid.firstname.lastname combination. This should speed up the queries and get rid of the nastiest PostgreSQL compatibility problem. Also added the non-aggregate fields into the GROUP BY clause.

Not reviewed on MySQL so please review on that.

Log in or register to post comments

Comment #8

sammys CreditAttribution: sammys commented 3 July 2009 at 23:30

File	Size
465536_200907031919-0400.patch	6.67 KB

After some further discussions we had to simplify the report output and remove the customer name. Please read on if you want to know the reason.

Ubercart allows different billing/shipping recipients to be entered for the same user account. While this is a great feature to have it makes reporting customer statistics a little tricky.

On one hand we could group statistics using the user ID, first name and surname. This would show more results on the screen and have a finer granularity. Unfortunately, it makes the query take longer. Lyle timed the query on a small dataset and it ended up as 70ms.

On the other hand we can simply decide to group the statistics by user accounts. This makes sense in the accounting paradigm. An account is a customer not a person. Using user account grouping of statistics records reduces the query time to 7ms for the same small dataset used above. For those "DB block cache needs to be cold" purists out there, we ran this user account grouped query (7ms) before the user ID, first name and surname grouped query (70ms). I reckon that's pretty good.

Patch attached!

Log in or register to post comments

Comment #9

cha0s CreditAttribution: cha0s commented 4 July 2009 at 07:39

Cool, though I'm not sure if people are going to be upset that they don't get as much information from the report as they used to.

Well anyways, it looks good and makes the code prettier and works on pgsql, so I'm happy about that. :)

Log in or register to post comments

Comment #10

jrust CreditAttribution: jrust commented 6 July 2009 at 18:21

I tested it out on my dataset which is decently large (37,000 orders) and the difference between grouping by u.uid, u.name vs. u.uid, o.billing_first_name, o.billing_last_name was negligible (.95 seconds vs. 1.3 seconds). I'm not sure losing valuable information (and I think name is quite valuable to store managers who work with customers with names, not user accounts with just usernames) is worth shaving off .3 seconds from an already somewhat slow query. The real solution might be to create an aggregate index that covers the columns we GROUP BY. However, I tried creating an index of o.uid, o.billing_last_name, and o.billing_first_name and couldn't get MySQL to actually use it...

Log in or register to post comments

Comment #11

cha0s CreditAttribution: cha0s commented 7 July 2009 at 16:52

I plan on benchmarking this properly soon... we definitely need to get some more data here because if it really is as negligible as you say to include the info, then I think it's definitely a serious mistake to remove the name stuff. Only if we're talking a 10x slowdown as was implied in sammys's post.

Can we see the benchmarking code used here?

Log in or register to post comments

Comment #12

jrust CreditAttribution: jrust commented 7 July 2009 at 17:32

sure, here are the queries and their times that I'm using to test:

SELECT u.uid, u.name, COUNT( DISTINCT o.order_id ) AS orders, SUM( op.qty ) AS products, SUM( o.order_total ) AS total, AVG( o.order_total ) AS average
FROM uc_orders AS o
LEFT JOIN users AS u ON o.uid = u.uid
JOIN (
  SELECT order_id, SUM( qty ) AS qty
  FROM uc_order_products
  GROUP BY order_id
) AS op ON o.order_id = op.order_id
WHERE o.order_status
IN (
'completed'
)
GROUP BY u.uid, u.name
-- (20,754 total, Query took 0.6667 sec)

SELECT u.uid, u.name, COUNT( DISTINCT o.order_id ) AS orders, SUM( op.qty ) AS products, SUM( o.order_total ) AS total, AVG( o.order_total ) AS average
FROM uc_orders AS o
LEFT JOIN users AS u ON o.uid = u.uid
JOIN (
  SELECT order_id, SUM( qty ) AS qty
  FROM uc_order_products
  GROUP BY order_id
) AS op ON o.order_id = op.order_id
WHERE o.order_status
IN (
'completed'
)
GROUP BY o.uid, o.billing_last_name, o.billing_first_name
-- (20,772 total, Query took 1.1164 sec)

Log in or register to post comments

Comment #13

sammys CreditAttribution: sammys commented 7 July 2009 at 18:20

Thanks for running that on a larger dataset.

I understand that names are important when dealing with clients. The code we're talking about relates to statistics rather than dealings with the payer or recipient. In addition, discounts and other operations are related to the account holder rather than payer/recipient. This report is not meant to be used to find an order either. Grouping by name makes Ubercart reporting different to normal business practices.

Unfortunately, Drupal does not provide a name field in the user account (i.e for a default account contact in the case of Ubercart) for us to display in the report result. Would have put that there otherwise.

Perhaps we need to make these reports use db_rewrite_sql(). hehe

Log in or register to post comments

Comment #14

jrust CreditAttribution: jrust commented 8 July 2009 at 16:24

Good point sammys. I agree that the most useful statistics are likely based around a single account, not around multiple addresses a person has. The only downside is just that the report then does not easily show the store admin who are these top purchasers on the report are since username jxdoe29 is not as likely to be known by the admin as John Doe. Could we just show the billing name that the GROUP BY comes up with the understanding that the name could, in some (rare) cases, not be the same as what they used on other orders?

Log in or register to post comments

Comment #15

Island Usurper CreditAttribution: Island Usurper commented 10 July 2009 at 19:09

Yeah, I think it's a good idea to keep the customer names on the report. I've made some adjustments to my earlier patch to make it Postgres compatible and to take some unnecessary calculations out of the count query.

I do want to point out that it's possible for the queries to be cached by the database, so it only takes .5 milliseconds to return the data on average. I did try to write the queries a couple of other ways, but they were always slower (nearly half a minute in some instances) and wouldn't cache.

A proper benchmark is still a good idea if anyone wants to. However, if it loads fast enough for you without knowing exactly how much time it takes, then that's good enough for me.

Log in or register to post comments

Comment #16

Island Usurper CreditAttribution: Island Usurper commented 14 July 2009 at 18:54

File	Size
customer-products_reports.patch	4.27 KB

...and like a doofus, I forgot to upload my latest patch.

Log in or register to post comments

Comment #17

itsmahitha CreditAttribution: itsmahitha commented 2 March 2010 at 21:59

Assigned:	Unassigned	» itsmahitha
Status:	Needs review	» Reviewed & tested by the community

Reviewed and Tested.

Log in or register to post comments

Comment #18

TR CreditAttribution: TR commented 2 March 2010 at 23:49

Assigned:	itsmahitha	» Unassigned
Status:	Reviewed & tested by the community	» Needs review

Log in or register to post comments

Comment #19

jasonabc CreditAttribution: jasonabc commented 4 April 2010 at 20:54

I think the attached patch contains an error - it says to add "function uc_order_update_6011()" - but this already exists in uc_order/uc_order.install...? I changed it to "function uc_order_update_6015()" and that seems to work well.

With a large dataset - Store Admin > Reports > Product Reports now loads much faster - but clicking on the "Custom Product Report" tab to run a report on a certain start/end date is still unacceptably slow. It takes forever to execute the report.

Log in or register to post comments

Comment #20

3dloco CreditAttribution: 3dloco commented 24 June 2010 at 02:05

+1

Log in or register to post comments

Comment #21

isaac.niebeling CreditAttribution: isaac.niebeling commented 14 September 2010 at 14:51

Just tested this on a DB with 1M+ rows in uc_order_products -- it definitely helps. Still took somewhere between 2:30 and 4:00 (I went and got a bowl of cereal while it was running :p) but the report is at least functional now.

Any chance this will be included in UC any time soon?

Log in or register to post comments

Comment #22

longwave

he/him

English

UK

CreditAttribution: longwave commented 29 July 2011 at 21:23

Status:

Needs review

» Needs work

Variant on #1 committed to add indexes to nid and qty, to both branches.

When I tested the updated customer SQL on MySQL it gave somewhat different results on a large dataset, as some customers with many orders haven't always entered the exact same name; I think maybe grouping by just uid may be better, but I see this has problems on Postgres.