Posts Tagged ‘data corruption’

The trouble with storing binary data structures

Thursday, January 29th, 2009

To start off our series of Programming posts, I’d like to start you off on a technical issue we bumped into yesterday. This isn’t a new issue for us, but running into it again made us think ‘Hey, this would be a great topic for our first technical article’.

Assumptions: You know some C, You know what a struct in C is.

So, as we were working yesterday on a patch for Majesty, we bumped into an issue

We had the following data structure (this is an abbreviation, the real structure is code we aren’t really allowed to just post on a website!)

struct datastruct
{
  char ltr;
  short key;
  int value;
};

Now, we were using this to read in a blob of binary data from the games datafiles. These data blobs had been stored from Windows when the game was made, and on testing, loaded just fine into Windows.

On Linux, however, reading the data failed.

struct datastruct datastuff;

//src is a data stream that is the same on Windows and Linux
memcpy(&datastuff,src,sizeof(datastuff));

The same code on Windows and Linux produces different results! Why can this be?

The Answer

The answer lies in how the struct is stored.

Windows was being told to ‘pack’ its data structures, to save memory. So the data in the structure was held as follows

Byte     0    1    2    3    4    5    6
Data  |-ltr-||--key--||-----value--------|

When we were using Linux to read this data back in, it was not packed in the same way. On Linux, the default alignment of a 32 bit machine is to align values on 32 bit boundaries, like so

Byte     0    1    2    3    4    5    6    7    8    9    10   11
Data   |-ltr-|              |--key--|          |-----value--------|

As you can see, if you are simply reading in a data stream, you will find that the ltr will be correct, the key will be reading bytes from the middle of the value, and the value could be absolutely anything!

So, how do you fix this?

gcc uses a pragma to resolve this. Use

#pragma pack(n)

on a line of itsown before the struct is defined, where n is the number of bytes you want to pack to. n must be a power of 2 (so 1,2,4,8,16…).

When you are finished defining things that need to be packed in a certain way restore it using

#pragma pack()

So, if you did, at the start of the file defining the structure

#pragma pack(1)

Then the datastructure will look the same as in the first example, all scrunched up into 7 bytes. If you use

#pragma pack(2)

Then the data structure will be aligned so that each element starts on a 2 byte boundry. This means that it will take up 8 bytes, and there will be a 1 byte gap between ltr and key, which would again cause problems.

The second packing example (the one with all the gaps) is a

#pragma pack(4)

example.

So, how do you detect this when you find your data is corrupted?

It isnt that hard to detect when this has happened. If your data is not the same when you read it in, and you are reading a whole struct in from a binary stream or blob, then chances are, it is a packing issue. Look at the bytes in the stream, try and match them up with the bytes you see in your struct, and see if you can see a pattern, see where bits are missing from the data stream when you look in your struct.

If the data in the struct matches the data in the stream, but the data when you read is different from the data you have saved, don’t forget that packing works both ways. If you have a struct that is packed using 32 bit (4 byte) boundries, and you write this to a stream, it will look like this

Byte     0    1    2    3    4    5    6    7    8    9    10   11
Data   |-ltr-|              |--key--|          |-----value--------|

The bits in the gaps (bytes 1,2,3,6,7) will still be saved, but they can be ANYTHING. Do not rely on them being 0, it isnt always the case.

So if you read this into a packed data structure, you will find that you read in the first byte correctly, you then read the key as 2 completely random bytes, and the value will be made up of bits of the key and random bytes!

We hope that this little tutorial has been helpful to you, and given you a bit of an understanding of this problem. If you spot any mistakes, or see ways to improve it, please drop us a comment on the article!

  • Share/Bookmark