Archive for March, 2009

The importance of data integrity

By Michael Simms (CEO and head of Development)
Tuesday, March 31st, 2009

We had a few hours downtime over the last weekend, and I thought some information about it may help others with resolving possible similar errors that may happen in their own systems.

So, taking the clock back to just after midday Saturday, our warning monitors alert us that the main website has gone down. A quick look at the status terminal shows that we have had a kernel panic on the main website server, and the machine is locked. We put in a request for a reboot with the hosting company, and within minutes the system has been power cycled, and everything is running again.

This is where the work begins!

The big problem, we note, is that our backup slave mysql database on a different server in a different building is not receiving replication updates. It just sits there, and stopping and starting the slave achieves nothing. Checking the logs we find the following:

090328 14:37:53 [Note] Slave: connected to master 'repl@87.117.204.74:3306',replication resumed in log 'mysql-bin.000057' at position 263780655
090328 14:37:53 [ERROR] Error reading packet from server: Client requested master to start replication from impossible position ( server_errno=1236)

Hrm, doesn’t look good.

A little bit of information about mysql replication. It works by storing binary logs on the server, logs of commands executed, so that a slave can execute the same commands and have a database in the same state. The commands are sent via a TCP/IP network connection from master to slave.The slave keeps track of where it is and the master sequentially sends the commands to it.

Now, when you power cycle a machine (in this case when the system kernel panics and doesn’t respond to any typed commands, there isn’t much to do other than power cycle it), you do not always end up with everything saved to disc. To improve speed, a number of systems will cache data in memory instead of writing it to disc, and then write in chunks when it is efficient to do so. The problem comes that if some data is being held in memory in preparation of writing to disc, then that data will be lost if you just cut the power. It cannot magically write the data to the disc, it has no electricity to do it. Data will be lost. There are ways to stop this happening, but that is the topic for another day. The fact is that on any modern Linux distro, this will happen.

And so skip to the mysql replication, and the slave is requesting to start its replication from a point that the master said does not exist. This is simply because the data was never written to the physical disc, and the processes before reboot were accessing data that the kernel was caching in memory.

So, mysql replication becomes a victim of efficiency. When this happens, there are two ways to go about fixing it. First, we could just say ’start from the next valid position’ in the binary logs. This would get us up and running, but there is ‘a risk of data loss’ that we deemed to be unacceptable in a backup database. The data loss would be from any transactions that have been made but never written to the binary log because of the power outage. The second option, and the one we choose to take, is the option of creating an entire dump of the mysql database and starting replication again from a known good position.

So, we do this. Taking a dump of the database requires our web and email to be down for an hour, and that bites, but it is really the only option. We dump the database, restart the services, and then, we’re done, it should now be fine. All that is left now is to rebuild the slave from the masters data, and start replication running again.

Or that is how it should have gone.

We transfer the dump across, and run a rebuild of the database. All goes well for a few hours (a rebuild can take a LONG time as it rebuilds indexes on huge tables), until suddenly….

ERROR at line 1963239: Unknown command '\"'.

This was obviously not what we expected. Importing the dump again revealed the same problem, and we realised there was a problem we hadn’t encountered before.

We did all the usual checks, ensuring the database versions were the same, ensuring the dump on the server and master were the same using md5sum, and all seemed OK, except that the database dump wouldn’t import.

We were left wondering if something was wrong with mysqldump.

To be sure, we did another dump. The site had been running for a few more hours and so the database would be a little different than when we did the last dump. Stopping the servers for another hour, and repeating the process. The second time we had a different error message, but a fatal one nonetheless. And the same a third time.

We were becomming convinced that the database was dumping invalid information, and that if that was the case, we were in a bit of trouble because, well, you cannot really go through a dump in excess of 20GB by hand to look for errors. The line numbers of errors were worse than useless, with some of the lines being many MB in size.

A process of methodical deduction finally solved the problem. We spotted that /var/log/messages was showing a number of ATA errors as we were doing the dump. Not encouraging! So, we tried the dump on another hard drive, dumping once onto the location we had been writing to all afternoon, and at the same time dumping to a location on a different drive. The location we had been writing to was significantly different, when we diffed the two files, than the new location. And upon loading the new data dump into the database, suddenly we had a working slave again.

So, the mystery was solved. The data we were dumping was being written onto a hard drive that was seeming to just change random bits on its files. Never a good thing. We were left thinking that we would have had a MUCH easier time if mysql would check the integrity of the files it dumped.

So now this leaves us back up and running, but with some things to look at. We feel that mysql desperately needs to be able to check the integrity of files on a disc, to be sure that a dump, whether from mysqldump, or the binary logs, is valid. Without that the dump is just pointless.

As such, for one of our next contributions to the open source community we will be looking at either adding ourselves, or setting up a bounty for someone else, to add this useful piece of integrity checking, which we feel is vital to mysql. More information to follow later on this, once we have worked out some of the details!

Share/Save/Bookmark

Customer Services Update for March 2009

By Melissa Kealey-Bennett (Customer Services Manager)
Thursday, March 19th, 2009

Hi,

Welcome to a new monthly report for the LGP Blog. Here, I’m going to run though the common problems people have reported over the last month here at the customer services team, and report on how these have been resolved, or how to work around the problems. We are not addressing all the issues, just the top two or three that lots of people have had problems with, those that we think it would be helpful for people to be able to read up on!

X3: Reunion - Problems on a number of ATI cards - Patch Available

This problem manifests itself with the game crashing after the loading screen when starting a new game on a small number of ATI based graphics cards.

Finally after weeks of trying, we have managed to reproduce the error on one of our test servers. Our technical team have finished working on the solution and the patch is now available for everyone to download in our updates section.

X3: Reunion - Problem with the game not starting

This problem is one reported by a number of people who have been trying to run X3 using open source drivers for their graphics cards. Unfortunately, at this point in time, neither the ATI or Nvidia open source graphics drivers are capable of running X3, due to the requirement of Shader Model 3.

You can find out if your current setup can run X3 by downloading our Test Tool and checking that the Shader Level reported by the test tool is at least 3.0, and that Direct Rendering is possible.

Share/Save/Bookmark

Special Edition, while they last!

By Michael Simms (CEO and head of Development)
Saturday, March 14th, 2009

I thought today that I’d just post a picture here, of something that in a few weeks will probably never be seen again.

So, without further ado, here is a photo of just over 200 copies of X3 Special Edition all in one place!

x3specialedition

Share/Save/Bookmark

Sacred Gold is now, well, gold!

By Michael Simms (CEO and head of Development)
Friday, March 6th, 2009

sacredIt took a little longer than we had expected, but we can finally announce that Sacred Gold has been sent to the production company, and we expect it to be ready to ship in 2-3 weeks.

We expected to have had it ready weeks ago, but some last minute issues caused some unexpected delays. The network layer, Grapple, had an error in how it communicated through certain types of firewalls and NAT routers, which meant that only around 30% of players would be able to access the multiplayer features of the game on the PenguinPlay servers.

But that is all past us now, and the game is in production! The game also has a nice new large map insert being produced with the game, just as a small bonus extra! We are also working on the finalising of the demo, which we hope to have available before stock of the game arrives, so you can try it out before buying.

Share/Save/Bookmark

A little-known trick with C switches

By Michael Simms (CEO and head of Development)
Tuesday, March 3rd, 2009

OK, so this little programming post is nothing like rocket science, it really isn’t. It is, however, REALLY useful, and something that almost nobody knows about the C language. We came across it when we were porting a game a few years ago, and none of us had ever seen it before. We tried it, and it works!

So, lets take a C switch statement

switch (somenumber)
{
case 1:
case 2:
case 3:
  do_something();
  break;
case 4:
case 5:
case 6:
  do_something else();
  break;
}

Fairly standard, nothing too unusual here. However there is another way to do it!

switch (somenumber)
{
case 1 ... 3:
  do_something();
  break;
case 4 ... 6:
  do_something else();
  break;
}

Unfortunately, it doesn’t seem to work on anything other than C/C++, which is a shame!

Enjoy!

Share/Save/Bookmark