User:JesseW/wdb format
Yesterday, I was presented with a file format reverse engineering puzzle. My mom had a old address book database that she wanted to recover. It had been created in Microsoft Works 6.0 (I think) database format (the files with the .wdb extension). My first thought was to run strings(1) on it, and see if the data was in connected ascii text. This worked, mostly, as the text did show up; but it was not organized into fields and records, which would make it harder to recreate a database from it. When she explained that she was going to need the whole thing, not just a few entries, I said I would work on recovering more of the format.
My first attempts at finding a wdb parser involved a web search. Google can parse wdb files - but they don't publish their translation programs, and I didn't want to upload our personal address book to the web and wait for google to index it. I found various commerical programs that claimed to be able to recover wdb files, but I wanted to understand the format, not just buy a product which, might, or might not, work. I was unable to find any discussion in the free software/open source world on the details of the format(which is the main reason I am writing this right now). I did find various tutorials on reverse engineering, including a nice page on reversing file formats by John Rittenhouse, and the Reverse Engineering Wikibook. Both of these helped give me some pointers in my work. I also found a few other wdb files to examine via google, which I used a little in my work on the format.
Since I was unable to find a already written tool to do the job, I decided to investigate the format and write one myself. I used Python, along with Emacs, (and a little use of hexdump) as my tools of choice, as they are what I am most familiar with. I examined the files fruitlessly for a while - identifying various often used characters, but not getting a sense of what they meant. The breakthrough happened when I realized there was a recurring, 10 byte header before each of the ascii text strings representing the values of the fields(which were terminated by a NULL). This header always began with a \x0f byte and NULL, followed by 4 data bytes alternated with NULLs.
The format, as I currently understand it, is as follows:
- A header, from the beginning of the file, ending with \x25\x00\xf2 and 244 NULL bytes
- A 2nd header-like piece beginning with \xffT - i.e. \xff\x54 and extending for 4096 bytes, that contains column/field names and other things, and seems to begin at byte position 6144, i.e. 2**12+2**11.
- Each field value has a 10 byte header, with the following format: {type byte} {type byte, part 2} {data byte 1} \x00 {db 2} \x00 {db 3} {db 3, part 2} {db 4} \x00. data byte 2 is the field number, data byte 3 is the record number (adding data byte 3, part 2 when it goes over 256 records). I don't know what data bytes 1 or 4 do; although in the address book, data byte 1 has 35 different values, and data byte 4 is nearly always \x03, most of the rest of it is \x02, and \x00 in only three lines.
- The type byte's I've seen are:
- "6\x00" (0x36)
- which serves as a continuation line. I'm not sure when it is used.
- "\x0f\x00" (0x0f)
- strings, null terminated.
- "\x0e\x00" (0x0e)
- standard double float (I think, although this may be wrong...)
- "[T" (0x5b)
- propritary varient of standard float. The exponent, rather than being 7 bits, is extended to 11 bits, shrinking the mantissa. If this is taken into account, integers expressed in this form work. However, decimals do not. I don't yet know why.
- The type byte's I've seen are:
I've written a program to parse the format and translate it to CSV format; it is available here. If you know anything else about the format, or have any questions, please let me know. Any feedback gratefully accepted.
This tool works fine for me! thanks for the job. --eldraco (talk) 00:01, 11 May 2008 (UTC)