Good afternoon.

Our organization is serving many web applications from a 'cluster' of LAMP servers. They are sharing storage on a RAID with GFS and Red Hat Cluster Suite is providing high-availability failover. All three machines are running apache. A pair of load-balancing directors are forwarding requests to them using LVS (ipvsadm and such). The idea is to make the three machines in the cluster behave as one, transparent to the end user.

Two of these machines are configured with Master-Master MySQL replication. This means that each is the other's slave, so that any write operation to either database server is [almost] immediately duplicated to the other. Each of our sites are pointed to localhost, assuming that data will be consistant across servers. For all of our in-house web applications this works fine, providing both load-balancing and high-availability.

Drupal, however, is able to 'break' this replication. Every once in a while, replication encounters errors and stops. They're always duplicate key errors for INSERT statements, and they look something like this one:

Error 'Duplicate entry '21:9dda7f11b555fa99b9c9c6b5a2b5b639' for key 1' on query. Default database: 'drupal_thissite'. Query: 'INSERT INTO cache_filter (cid, data, created, expire, headers) VALUES ('21:9dda7f11b555fa99b9996cb5a2b5b639', '<p><font face=\"Arial,Helvetica\" size=\"3\" class=\"lineheight\"><font face=\"Arial,Helvetica\" size=\"3\" class=\"lineheight\"> Welcome to our County's website. </font></font></p><p><img src=\"/sites/thissite.com/files/u2/portrait.jpg\" width=\"214\" height=\"340\" /> </p><p><font face=\"Arial,Helvetica\" size=\"3\" class=\"lineheight\"><font face=\"Arial,Helvetica\" size=\"3\" class=\"lineheight\"> ... etc. etc.

Sometimes the errors occur on other tables as well. I can post more error messages if that will help.
Also, this occures more often under heavier load. One of our larger Drupal sites will be going live soon, and we'd like to resolve this before that acid test.

I've read around the forums and groups here, and done my share of googling. I haven't found anyone else detailing an implementation of multi-master MySQL replication with Drupal. I've seen a few discussions about modifying Drupal to send read and write requests to different database servers, but nothing that seemed to come to fruition.

Are there any large site operators that have encountered similar problems? Any developers care to speculate on why these tables are affected?

Thank you for your time,
Adam Vollrath
BIS Network Services

Comments

papile’s picture

I noticed too on our site, we have problems with caching. A few times the cache_set function broke mysql with the tremendous serialized queries it placed in there. So I completely rewrote everything, and replaced caching with memcached. It is faster also! We have a read only DB powering our site for users linked to a write master. So yes, we have drupal running successfully on a RO database. The memcached module should be live for drupal6. Unfortunately, I have not found it possible to replace ALL caching in drupal 5 without doing core hacks. So our drupal 5 is really hacked now, but when we upgrade to 6, I will be able to implement caching without core hacks, so I can just discard my old code.

High performance scaling in drupal is still in its infancy. With big players coming on board, they will change this soon.

pwolanin’s picture

Since it looks like this is an issue with the cache code, this would be the responsible function:

http://api.drupal.org/api/5/function/cache_set

Maybe a problem with either the table lock or a problem with the call to if db_affected_rows()?

---
Work: BioRAFT

adam_vollrath@groups.drupal.org’s picture

Thank you, pwolanin.

We're not only having issues with the cache* tables, but also the variable table, and a few others, like those created by some modules. I don't think any of Drupal's code is able to handle the multi-master setup we're using. (It's not specified to handle it.)

adam_vollrath@groups.drupal.org’s picture

I've done some more digging, and this seems rather relevant:
From: http://www.mysql.org/doc/refman/5.0/en/replication-faq.html (emphasis mine)

Q: What issues should I be aware of when setting up two-way replication?

A: MySQL replication currently does not support any locking protocol between master and slave to guarantee the atomicity of a distributed (cross-server) update. In other words, it is possible for client A to make an update to co-master 1, and in the meantime, before it propagates to co-master 2, client B could make an update to co-master 2 that makes the update of client A work differently than it did on co-master 1. Thus, when the update of client A makes it to co-master 2, it produces tables that are different from what you have on co-master 1, even after all the updates from co-master 2 have also propagated. This means that you should not chain two servers together in a two-way replication relationship unless you are sure that your updates can safely happen in any order, or unless you take care of mis-ordered updates somehow in the client code.

Would it be fair to say that, for Drupal, I cannot be 'sure that updates can safely happen in any order'? This arrangement works fine for our home-brew web-apps, but none of them are so dependant on MySQL for writing dynamic, time-sensitive, and globally read data. Does this mean that I shouldn't be expecting the tables to stay consistant when I throw replication into the mix?

papile’s picture

Yeah this would describe why there is a problem more pronounced in cache because it is being flushed, written and read frequently. Not to mention the write is huge, so it give plenty of time for the example given to happen. Mysql is not really in the big leagues yet when it comes to scalability. There are many ways this can go wrong. Maybe you can try a write and read slave.

adam_vollrath@groups.drupal.org’s picture

So, because I'd rather not receive any txt messages this weekend about our replication breaking, I've made an attempt to fix this problem. I've added the following line to my.cnf on each of my servers:
replicate-wild-ignore-table=drupal_%.cache%
((Note that all of my Drupal sites are in separate databases prefixed by 'drupal_'.))

This will not replicate the cache, cache_menu, cache_page, and cache_filter tables. These are the ones we've been having problems out of. Hopefully this will work, and Drupal won't break our replication anymore, and it won't break anything else.

Can anyone tell me what these tables do exactly? And is it ok if each server is generating them independently?

anisotropic’s picture

@Adam: did your above attempt fix the problems with Master <==> Master replication?

We had the same problems and backed out of the idea of redundant servers for Drupal because we didn't need the performance, we just wanted the redundancy. This was back before we upgraded to Drupal 5, and while there are moves afoot [1] to support Master / Slave replication but I don't see how that fixes scaling past the first level.

Otis Wildflower’s picture

BTW, for multimaster replication to work properly for unique key IDs that are automatically generated, mysql provides for preventing collision by giving each autoincrement its own base index to start:

auto_increment_increment = 10
auto_increment_offset = 2

Drupal (and any other other DB-reliant app), if it's not using mysql's own internal autoincrement, needs to have a similar per-DB mechanism to prevent key collision. Unless there's some sort of clustering exposed to Drupal via mysql DB libraries, you'll probably have to rely on mysql's autoincrement and stop using any Drupal-generated ID keys.

yang_yi_cn’s picture

That's very useful for tables like node, taxonomy, which have auto-increment primary id. However, it doesn't work for cache% table, which uses a hashed id as primary key, varaible table, which uses name, sessions table, which uses a sid, i18n_strings, which uses lid. All these will cause duplicated id still.

In MySQL, we can change the INSERT in theses files to 'INSERT ... UPDATE ON DUPLICATE ID ...' to fix the problem, but the patch is not being accepted to Drupal core because the PostgreSQL doesn't have something like that.

mrfelton’s picture

I had a similar problem in a master -> slave replication setup. Duplicate errors stoping things. Setting the following in my.cnf on the slave fixed:

slave-skip-errors=1062

--
Tom
www.systemseed.com - drupal development. drupal training. drupal support.

jyamada’s picture

A bit of an old thread, but you have recently posted Tom, so that's a good signal to me to pose this question ..
Does anyone know if this happens in Drupal 6.x too.?

Thanks
Joe

anonymous07’s picture

I realize this is a year old and about D5 but just in case any additional posts to this (especially on D6) :-)

sir_squall’s picture

Hello,

you can find the patch at:
http://www.cestfait.ch/content/drupal-6-patch-masterslave

is the same system master / slave on pressflow.

Enjoi

anonymous07’s picture

There are Drupal 6 specific tweaks for MySQL Replication posted here:
http://www.widwad.com/drupal6-mysql-replication

  • It works for both "Master-Slave" and "Master-Master" or "Multi-Master" Drupal MySQL Replication
  • And, this method does not require any code changes or patches to Drupal!