#303087: Move "Sent Messages" to privatemsg_filter, Add inbox, improve user filtering. has shown that some of the assumptions I made when changing the query builder may be wrong.
In short, the current query builder tries to do a GROUP BY across a UNION and that does not work.
Attached patch changes this and adds a UNION_GROUP_BY to do a a separate GROUP BY for the UNION than the one done on the former section of the query.
@litwol: is there a way to do a GROUP BY across a UNION? without that ability, it will be really difficult creating an "all messages" link which combines the current inbox and sent messages views. (by "difficult" I mean that I do not know how.)
| Comment | File | Size | Author |
|---|---|---|---|
| #39 | privatemsg_nbz.patch | 16.26 KB | naheemsays |
| #38 | privatemsg_nbz.patch | 15.24 KB | naheemsays |
| #37 | privatemsg_nbz.patch | 14.26 KB | naheemsays |
| #30 | privatemsg_nbz.patch | 12.44 KB | naheemsays |
| #29 | privatemsg_mrtoner.patch | 15.2 KB | naheemsays |
Comments
Comment #1
litwol commented@nbz: i am in favor of changing database schema to avoid using UNION. UNION makes our queries too complex and makes it hard for other modules to interact with.
Comment #2
naheemsays commentedIf it can be done, I would also favour that - I am finding keeping track of UNION's a pain and with current schema we will probably need more going forward...
Comment #3
litwol commentedLets do it. now is the best chance to rethink schema before we get too deep into features that will prevent us from improving it.
Comment #4
naheemsays commentedMost of the UNION's come from having to process the index table twice - once for the author and then once again for the recipients.
Combining them into one UID column should fix that, so when sending a message to N users, there will be N+1 rows created in the index table - the extra row having the UID = author UID and new = 0.
This would also simplify delete architecture as all you need is a single column - "deleted" instead of two, one for author, the other for recipient.
Another fringe benefit - the inbox automatically turns into an "all messages" folder.
Comment #5
naheemsays commentedPatch attached which does the above and a little bit more.
For good measure, I have also included a reworked patch from #291206: Delete Architecture. and removed the from/to columns - they will need to be reworked to fit in with conversations anyway and would complicate this patch too much if they were a part of it. IMO a separate issue.
I have NOT renamed the inbox to "All Messages" as I dislike that title and it could cause problems further down the line when we have tagging support. I will leave the renaming and any potential minefields to you. :P
(we can also get rid of "sent messages pretty simply now - just remove the menu item and make a few simple simple deletions in the privatemsg_list function.)
Comment #6
litwol commentedThis will be a tough cookie to review. we need to brainstorm how the new schema will affect our future development. delete architecture, mail tagging, mail forwarding, mail this, mail that.
Comment #7
naheemsays commentedI have included delete architecture.
As for tagging, I would assume another JOIN in the query and two additional tables: pm_tags (tag_name, tag_id) and pm_tags_index (tag_id, thread_id, uid)
and then some syntax in the list_alter functions along the lines of
No idea about forwarding and the rest... forwarding would have to involve a new thread_id (and be per message rather than per conversation) but probably not much else.
EDIT - forgot to mention the main reason for this approach - it feels right. :P
Comment #8
mrtoner commentedThis is where I was going in this comment. However, as pointed out later in that topic:
- It's necessary to keep the message around even it's been deleted, to maintain participant data (nbz)
- The message needs to be deleted, not just marked for deletion (litwol)
We need to resolve those two conflicting points.
Comment #9
litwol commentedMaybe instead of 'deleting' messages. we can describe the functionality as 'unsubscribe from message'. where the message remains to exist, but you cannot receive new messages. and it gets moved somewhere you dont have to see it unless you really want to.
Comment #10
mrtoner commentedHow about three database tables:
When a thread is started, a record for each participant is created in pm_threads.
When a message is sent, a record is created in pm_messages.
When a message is sent, a record is created in pm_instances for each recipient and the author (n+1).
When a message is replied to, the participant data is pulled from pm_threads.
When a message is deleted, the record is deleted from pm_instances.
When all instances of a message are deleted, the record is deleted from pm_messages.
When all messages are deleted from a thread, all corresponding records are deleted from pm_threads.
Comment #11
litwol commentedHow does it handle a scenario when: user A deletes the whole thread from his messages list, but user B still supposed to know that he previously communicated with user A.
Comment #12
mrtoner commentedThe user is only deleting messages. In that case, all of User A's messages are deleted from pm_instances. Records for User A and User B are still in pm_threads, so a reply by User B will still go to User A.
Comment #13
litwol commentedok very good. it does seem to simplify a lot. however, consider scenario when messages database grows very big. having to do 3 joints just to get some messages sounds like a performance hit. but at the same time i'm not too concerned with it. sites that will generate THAT many messages will probably be running on very very fat boxes to support the necessary performance .
Comment #14
naheemsays commented@ comment 8 - marking for deletion works with the proposed patch. Actual deletion would cause issues, but even with the three table situation, the actual message is not deleted from pm_messages. I don't think we can delete the stuff in pm_messages until all participants have marked the message as deleted. (the problem when I first proposed moving (back) to this schema was the intent to remove the row containing the data for the delete action.)
As for the proposed schema in comment 10, the number of JOIN's I do not think that is an issue - when we have tagging, we will probably have to have more JOIN's anyway.
I don't think the proposal in 10 adds enough yet (seeing the actual queries may change my mind, as one of the hurdles is overcoming unforseen issues :P). An improvement in it that I like is moving the timestamp to the pm_messages table, but apart from that, the queries will probably not be much simpler - just involving different tables for different actions.
Comment #15
mrtoner commentedI see no need to "mark for deletion." Instead, actually delete the instance of the message from pm_instances when the participant (author/recipient) wants it deleted. The remaining instances still exist in the other participants' message store.
When all instances of a message have been deleted (all participants have deleted their copy), the message is deleted from pm_messages. When all messages have been deleted from a thread (all participants have deleted all messages), the thread is deleted from pm_threads.
Comment #16
litwol commented@mrtoner: perhaps a patch will clarify all the outstanding questions?
Comment #17
mrtoner commentedWell, I'm likely to screw it up because when I get into JOINS and such I'm going to be a bit out of my element, but I'll give it a go. Let's see what I can do this weekend.
Comment #18
naheemsays commentedI might give this a go too - but a couple of concerns first:
1. thread_id in pm_messages. Currently we calculate the thread_id AFTER inserting into that table (if it already exists, we keep that value, but if not, thread_id = mid.). Doing this with the proposed schema means we will write to pm_messages, and then update that same record. No idea how bad this is or even whether it is worse than what we do now, but the concern should be noted. On the other hand, getting participants will be much simpler.
I think getting rid of thread_id from pm_messages will cause the read queries needing thread_id to become more complicated.
2. pm_thread assumes that a conversation does not allow the participants to change. Good for new users, but does the same apply when updating from drupal 5 where the recipients can be arbitrary? (to be fair, my approach also ignores this). This is probably ok.
Comment #19
naheemsays commentedQuestion: is it acceptable to use string matching instead of integer matching? (ie how big is the hit?)
Reason: I can extend my above patch and make the UID column in pm_index be smalltext or soemthing similar - that way we can prefix user id's with u_ and later expand and add g_{group_id} for OG support, r_{role_id} for roles etc etc.
Would this be scalable for larger sites/installations?
Comment #20
litwol commentedI dont see the reason for this right now. lets hold it off. if you want to extend functionality to other types then you can just create another join using the query builder.
Comment #21
naheemsays commentedjust making sure the stuff is expandable in the future - from what I see, both approaches can be explianed in the mentioned way - but we probably will not want to as there would be other downsides (person leaving a group/role will lose messages etc etc...)
Comment #22
naheemsays commentedupdated version of my patch in comment #5 to move timestamp from index to messages table.
(Now time to modify this to get as close as possible to mrtoner's proposed schema so that there is a choice of paths.)
Comment #23
naheemsays commentedsame as above, but with upgrade function to move data from drupal 5 version of privatemsg (there is still a need to test for duplicate messages and also to remove the other tables such as privatemsg_block_user and privatemsg_folder).
My thinking - it is easier to upgrade from privatemsg for drupal 5 to my version of this patch for drupal 6 as there is no dataloss.
If we decide to go mrtoner's way, moving from here to his schema should be trivial and we can also provide an upgrade path (create pm_threads, fill in participants and then drop all rows in pm_index where deleted = 1 and then drop the deleted column).
A curious thing - I migrated a privatemsg install with the privatemsg for being approx 2.4 MB, and privatemsg_archive being 14KB. The new combined pm_message and pm_index is 5.2 MB (3.9 for the former, 1.3 for the latter.)
(I would also like to be the place from where we give support changes to the schema)
Comment #24
naheemsays commenteduntested update function from above patch to mrtoners layout:
The following would also be added to the schema
and apart from this, the only functions that need to be altered are privatemsg_privatemsg_participants_alter, privatemsg_send, privatemsg_form_alter and privatemsg_preprocess_privatemsg_to.
I will be honest and think this is mostly unneeded as the thing it simplifies is getting all the participants in a thread, but we already have a working function for that.
(still as CNW as I have not created a patch for this, nor have I finished the code writing, but I need sleep.)
Comment #25
mrtoner commentedMy thought was that there would be a row in pm_threads for each participant, rather than a single row for all participants. That may not be the best practice, but a length of 255 for 'participants' as is currently suggested will be too small for conversations with more than, say, 40 participants.
Yes, 40 sounds like a lot, but if, for example, we allow privileged users to PM a site's entire user base, 40 will be too little.
Re: #18-2 -- I agree that changing the participants mid-thread (since we're now threading) is bad.
Re: #18-1 -- I remember thinking earlier that adding thread_id to pm_instances might be make for faster displaying of the message list. It violates data normalization rules, but are there benefits that outweigh the additional data being saved?
Comment #26
naheemsays commented"data normalisation rules"? hehe, I have no idea what that is. The reason I moved thread_id back to pm_index was that it just felt wrong doing an INSERT INTO followed straight up by an UPDATE into the same row.
as for the size of the field - I cheated and copied the schema from the subject, hence the current limit. No real reason. as for storing them all in one field vs one user per row, I misunderstood what you meant. However, going the latter way, it is no different from the current pm_index and is just duplicating the data.
What do you say? which way should I take the patch?
As for speed, I doubt it makes a difference (but i am totally clueless when it comes to databases - I just try things and either they go boom or they don't.) - we will still need to load the subject from the pm_message table.
PS can we keep pm_index called that instead of pm_instances? If we do keep that, the changes for your scheme are pretty minor over my patch.
(PPS - patch in comment 25 is almost CNR - just need to change the update function to use update_sql instead of db_query...)
Comment #27
mrtoner commentedFancy terminology so I can look smart. One rule in creating databases is that you don't duplicate data between tables. For example, moving 'timestamp' to pm_messages keeps that data from being duplicated when that data doesn't change between instances.
So, putting 'thread_id' in pm_instances duplicates the value in pm_messages. The relation between the tables is 'message_id', so a JOIN would give you the 'thread_id'. However, I'm just wondering if duplicating the 'thread_id' would make for a quicker query -- and would make violating the rule a good thing. Maybe not, because -- as you point out -- we still need to grab the subject and body from pm_messages.
I'm all for making things easy, but I'm also for making things understandable. Someone coming in later might wonder why a table is named "index" when it's not really an index. I'd like to go with the name change. Easy for me to say, though, when I'm not making the changes. :-)
Comment #28
naheemsays commentedThanks for the explanations. There is a problem with putting thread_id in pm_messages - for new threads it is derrived AFTER writing the message to pm_messages. TO get around this you need a hack - add dummy therad data i first time around, calculate the thread_id and then update the field.
I don't think there are (m)any cases where pm_messages can be used without a JOIN with pm_index/instances, so IMO thread_id should only be in the latter table.
The problem is not a read problem, but a write one - there is also a similar issue with writing to pm_thread where you would only write to it if the post starts a new thread, so there would need to be too loops - one if a new thread, and then another for writing to pm_instances.
Comment #29
naheemsays commenteduntested version of mrtoner's proposal - with the difference that the thread_id is staying in pm_instances instead of pm_message.
Now that both proposals are here, time to review and select one.
Comment #30
naheemsays commentedand my competing patch - without the update function this time.
Comment #31
naheemsays commentedJust a guide, the differences between the two patches:
1. pm_message is currently identical in both. mrtoner DID want thread_id in there, but I resisted in my version of his architecture. :P
Also a question. me and mrtoner have both moved the timestamp from pm_index/pm_instances into pm_message. Was there a reason for the original design? such as speed issues? If so, that bit can be trivially reverted in both patches as pm_message and pm_[index|instances] are rarely used without a JOIN between the two.
2. In my version, delete is indicated by a "deleted" column = 1 in pm_index. On mrtoner's, it is indicated by lack of the data in pm_instances (making that table lossy, but the can probably be recreated by using the data in pm_thread).
3. With mrtoners way, deleting the row from pm_instances means that if a participant deletes a message, the instances table no longer knows that the user was a participant. To get around this, there is another table pm_thread which stores all participants in a thread.
Query wise, the list alter functions lack a where deleted = 0 in mrtoner's version (making his query a little simpler) and also the participants alter is different at it reads from pm_thread instead of pm_index as in mine.
4. The only major difference is in pm_send where there is extra logic in mrtoner's version to insert the relevant records into pm_thread on thread creation.
5. For upgrading from Drupal 5, I don't see how we can easily get to mrtoner's version of the schema without going through an intermediary step similar to my proposed version.
On the whole, I still prefer mine (call it ego), but both are IMO an improvement over what we currently have and extending either to eg add tag support should be pretty similar.
Comment #32
naheemsays commentedso... any reviews?
I have pretty extensively tested the patch in comment 30 and it works well.
There is an acid to see how well the architecture(s) works: #314327: Tagging architecture. however, I cannot figure out how to make that work - the framework is there, but the query does not give any results.
Comment #33
litwol commentedweekend is coming up and i should be reviewing our patches tomorrow.
Comment #34
litwol commentedI had no time at all to do it this weekend :( i will try to do it in the upcoming days.
Comment #35
naheemsays commentedNo probs - you're not the only one who has suddenly run out of free time :(
Comment #36
naheemsays commentedThe "acid test" #314327: Tagging architecture. patch is now also ready (for the single tag case) and may be best to test at the same time as this patch.
(I would also like #315325: Userblocking architecture + other fixes patch to go in soon as these three should complete the schema rearchitecture.)
Comment #37
naheemsays commentedAdded a delete button - not very elegant.
Comment #38
naheemsays commentedand a slightly less ugly patch - this approach can be extended and made less ugly when needed.
Comment #39
naheemsays commentedand updated again - this time to add a function to add a function for per message actions which will allow a row of buttons.
Also, made changes to only show the reply box if there are messages that have been displayed and give an error message if there are no (undeleted/readable) messages in the thread.
Comment #40
litwol commentedWeird / incomplete behaviors:
1) When author deletes a message, it is still visible in sent messages list. however when i try to view the deleted message i get notice saying "There are no messages available to display".
Overall this make sense because you cant view something that is deleted, maybe we need to add some kind of visual representation in sent folder that message was deleted. or maybe not even show it in the sent list (easiest..). of course the message will need to be undeleted before it can be viwed again. that is out of the scope of this patch.
2) When all recipients delete the message, it dissapears from the author's sent messages list.
Overall i like what i see. i am committing it so we can work on improving it and new features.
good job :)
Comment #41
litwol commentedcommited: http://drupal.org/cvs?commit=146261
lets do a quick round of code cleanup and keep a close eye on the database. we need to freeze schema now and make a beta release.
Comment #42
naheemsays commentedWahey!
As for the remaining issues:
For number 1, I would prefer to go the route of #303087: Move "Sent Messages" to privatemsg_filter, Add inbox, improve user filtering. - atleast for the beta and then stay that way unless there is some fuss over the lack of sent items.
If you want to actually fix the buy, you can change the following (around line 679):
to
As for number 2, I cannot replicate that behaviour.
Comment #43
Anonymous (not verified) commentedAutomatically closed -- issue fixed for two weeks with no activity.