Pythian Blog: Technical Track

Analyzing SQL Server Data File Anatomy

A database is a defined space stored in the system, grouping user objects. This space can be split into several files organized in groups.

In SQL Server, you have a limit of 32,767 databases per instance, and each database can store more than 2 billion objects, reaching a physical size up to 524,272 terabytes!!!

In this post, I will explain the physical structure of a SQL Server data file in greater depth. Unfortunately, I cannot write all the assumptions, behaviors, and details on this topic, but you will read a good overview of a file structure and a good starting point to better understand index internals, file fragmentation, transaction log internals, and so on…

Before we start, let’s remember some vital concepts.

Database Files

You can see a database file as a normal operating system file. A database must have a least two files: one for data and one for t-logs.

The data files are categorized into two types: Master Data File and Secondary Data File(s). Only one Master Data File is possible for each database, and it keeps track of all the other files in the database. By convention, the Master Data File has the .mdf extension and the secondary files have .ndf.

For the log files, the extension is .ldf. Each database must have at least one.

Page Structure

In Microsoft, the fundamental unit of data storage in SQL Server is the page. The disk space allocated to a data file (.mdf or .ndf) in a database is logically divided into pages numbered contiguously from 0 to n. Disk I/O operations are performed at the page level. That is, SQL Server reads or writes whole data pages.

All the pages in SQL Server have the same structure. At the top of the page there’s a 96 bytes header, and it includes the PageID, the kind of structure the page belongs to, the number of records in the page, and pointers to the previous and next pages. So there are 8096 bytes available on which we can store records. However, the maximum length of data records is 8060 bytes, as in the bottom of the page (latest 36 bytes), resides the slot array containing information about the offset of the rows (2 bytes per row). This array can grow from bottom to top based on the size of the records, so more records can be accommodated in a page and offset table will take more space. These slots are stored in the order defined by index key. For heaps, there’s no special order.

Space Allocation

The space allocation in SQL Server is managed in portions called “extents”, which are basically a group of eight logically contiguous pages. So, a lot of 64 KB (8*8 KB/Page).

The use of extents makes the allocation system more efficient. These units can be distinguished in two types:

  • Uniform extents: Extent units composed by pages owned by a single object
  • Mixed extents: Extent units composed by pages shared to up to eight objects.

The pages are allocated for a new table or index starting from a mixed extent. When the object grows to eight pages, all future allocations will use uniform extents.

All the information about the extents is tracked in GAM, SGAM, and IAM pages, also known as “allocation bitmaps”.

File Structure

Basically, a SQL Server data file has the following basic structure, which we will analyze:

  • Page 0: Header
  • Page 1: First PFS
  • Page 2: First GAM
  • Page 3: First SGAM
  • Page 4: Unused
  • Page 5: Unused
  • Page 6: First DCM
  • Page 7: First BCM

File Header

All the files have a header into the page number 0. This header stores metadata about that particular file and is not recoverable by checkdb. In case of damage, you must restore the entire file.

You can explore the header of some database using the following DBCC command:

DBCC fileheader [( {‘dbname’ | dbid} [, fileid])

You will find useful information on RedoStartLSN, BindingId, SectorSize, Status, and Growth records.

Page Free Space

The PFS identifies the allocation status and determines the amount of free space. These pages contain 1 byte for each page and cover a range of 8,088 pages in a file.

The basic PFS structure is:

  • Bit 1: Indicates whether the page is allocated or not.
  • Bit 2: Indicates if the page is from a mixed extent.
  • Bit 3: Indicates that this page is an IAM page.
  • Bit 4: Indicates that this page contains ghost records
  • Bits 5 to 7: A combined three-bit value, which indicate the page fullness as follows:
    • 0: The page is empty
    • 1: The page is 1–50% full
    • 2: The page is 51–80% full
    • 3: The page is 81–95% full
    • 4: The page is 96–100% full

The first page of each file is a PFS, and another PFS is allocated for every 8,000 pages after the first PFS.

After an extent has been allocated to an object, the Database Engine uses the PFS pages to record which pages in the extent are allocated or free. The  SQL Server allocates a new extent only when it cannot find a page with sufficient space to hold the data.

GAM and SGAM

SQL Server uses two special types of pages to record which extents have been allocated and for which type of use (mixed or uniform) the extent is available:

Global Allocation Map (GAM) pages: Tracks the allocation of any type of extents. A GAM has a bit for each extent in the interval this covers. The bit 0 means that the corresponding extent is in use, and in other hands, the bit is 1 tracks free extents. A GAM can cover about 64,000 extents, or almost 4 GB of data. So, there is one GAM page for every 4 GB of file size.

Shared Global Allocation Map (SGAM) pages: This is the same as GAM, but it refers only to mixed extents. The SGAM has a bit for each extent in the interval it covers. If the bit is 1, the used extent is a mixed extent and has free pages, and the 0 bit represents either a non-mixed extent or a mixed extent whose pages are all in use.

Differential Changed Map (DCM)

It is the seventh page (page 6). It keeps track of which extents in a file have been modified since the last full database backup.

Like GAM and SGAM, BCM pages have 1 bit for each extent in its covered sector of the file. The first BCM page is on the seventh page of every data file and at every 511,230 pages thereafter.

Bulk Changed Map (BCM)

Also known as Minimally Logged Map (ML Map), it is the eighth page (page 7) and is used when an extent in the file is used in a minimally or bulk-logged operation.

Each bit on a BCM page represents an extent, and if the bit is 1, this extent has been changed by a minimally-logged bulk operation since the last transaction log backup. The first BCM page is on the eighth page of every data file and at every 511,230 pages thereafter. All the bits on a BCM page are reset to 0 every time a log backup occurs.

The Boot Page

The page 9 in file 1 is perhaps the most important page in the database – the Boot Page. The base metadata about the whole database is stored in this page. If this page is corrupt, checkdb cannot repair this page, and a restore of file number 1 will be needed to recover the database.

To get information about the database boot page, you can execute DBCC PAGE pointing to page 9 in file 1 or DBCC dbinfo [(‘dbname’)], always activating the trace flag 3604 (DBCC TRACEON(3604)).

Index Allocation Map – IAM Pages

The function of IAM pages is to track the extents that belong to tables or indexes. The IAM is a bitmap that refers extents to objects. This kind of page covers a 4-GB range, belongs to a GAM interval, and is not located at known locations in a file.

For each 4-GB range of data, partition, and allocation unit type, an IAM page is required to track.

By the way, an Allocation Unit is a group of pages that manage data based on their page type and can be from three types:

  • IN_ROW_DATA:  Data or index rows that contain all data, except large object (LOB) data.
  • LOB_DATA: Large object data stored in one or more of these data types: text, ntext, image, xml, varchar(max), nvarchar(max), varbinary(max), or CLR user-defined types.
  • ROW_OVERFLOW_DATA: Variable length data stored in varchar, nvarchar, varbinary, or sql_variant columns that exceed the 8,060 byte row size limit.

For example, a table on four partitions that has all three types of data (in-row, LOB, and row-overflow) has at least 12 IAM pages.

There are four possible combinations of the above pages that are valid for an extent:

IAM Chains

Knowing that a single IAM page covers a 4-GB range (512.000 pages) in a file and belongs to a single GAM interval, what happens if the allocation unit contains extents from more than one file or more than one 4-GB range of a file? There will be multiple IAM pages linked in an IAM chain, an unordered sequence of IAM pages.

Basically, an IAM page has a header containing information about which GAM interval this belongs to, the sequence number, and the linkage information.

Demo

For these demos, I will use the AdventureWorks2008R2 database to make a simpler reproduction. You can download this sample database, and others, on https://msftdbprodsamples.codeplex.com/.

Demo 1: Page Anatomy and PFS, GAM, SGAM, DCM and BCM demonstration

So let’s start demonstrating the Allocation Bitmaps. The %%lockres%% and %%physloc%% mechanisms will return the physical location of the row. So I’m going to use the following query to return the first 5 rows of databaselog table, showing the physical location of each row.


SELECT TOP(5) a .%% physloc %% as Address,

a
.%% lockres %% as LockHashValue,
a
.*
FROM
dbo.databaselog AS a

I got the following rows, and I will explore the second row.

To explore the second row internals, I will use the Database Console Command (DBCC) PAGE for it. Notice that 1:150:1 represents FILE:PAGE:SLOT. So here is the command, using page 150:


DBCC traceon(3604) — We need to activate this flag before execute the DBCC PAGE

DBCC page (adventureworks2008r2, 1, 150, 2)

DBCC traceoff(3604)

The execution returned a long result, but for now let’s concentrate on Allocation Status in the Header section:

Allocation Status

GAM (1:2) = ALLOCATED                SGAM (1:3) = NOT ALLOCATED
PFS (1:1) = 0x63 MIXED_EXT ALLOCATED  95_PCT_FULL                         DIFF (1:6) = CHANGED
ML (1:7) = NOT MIN_LOGGED

Basically we can get all the allocation information about this page and the PFS, GAM, SGAM, DCM, and BCM, which is tracking this page.

After the page Header, we have the data. Here is a sample:

lot 0, Offset 0x60, Length 2732, DumpStyle BYTE
Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP VARIABLE_COLUMNS
Record Size = 2732

Memory Dump @0x0000000009ABA060

0000000000000000:   30001000 01000000 5d85e400 23a00000 †0…….]ä.# ..
0000000000000010:   08000006 0027003f 00450055 00db03ac †…..’.?.E.U.Û.¬
0000000000000020:   0a640062 006f0043 00520045 00410054 †.d.b.o.C.R.E.A.T
0000000000000030:   0045005f 00540041 0042004c 00450064 †.E._.T.A.B.L.E.d
0000000000000040:   0062006f 00450072 0072006f 0072004c †.b.o.E.r.r.o.r.L
0000000000000050:   006f0067 00430052 00450041 00540045 †.o.g.C.R.E.A.T.E
0000000000000060:   00200054 00410042 004c0045 0020005b †. .T.A.B.L.E. .[
0000000000000070:   00640062 006f005d 002e005b 00450072 †.d.b.o.]…[.E.r
0000000000000080:   0072006f 0072004c 006f0067 005d0028 †.r.o.r.L.o.g.].(
0000000000000090:   000d000a 00200020 00200020 005b0045 †….. . . . .[.E
00000000000000A0:   00720072 006f0072 004c006f 00670049 †.r.r.o.r.L.o.g.I

And to conclude, the row offset. This way, you can see all the three parts of the page anatomy, as explained earlier in this article:

OFFSET TABLE:

Row – Offset

5 (0x5) – 7017 (0x1b69)
4 (0x4) – 6774 (0x1a76)
3 (0x3) – 5713 (0x1651)
2 (0x2) – 4560 (0x11d0)
1 (0x1) – 2828 (0xb0c)
0 (0x0) – 96 (0x60)

Demo 2: Looking to IAM

Now I will execute a query that will return all allocation units for the table DatabaseLog. For this I will relate sys.partitions and sys.system_internals_allocation_units DMV, and for each allocation unit we can get the page info, which guides us to the IAM info/chain.

I’m using a page conversion algorithm done by Kimberley Tripp (https://sqlskills.com/blogs/paul/post/Inside-The-Storage-Engine-sp_AllocationMetadata-putting-undocumented-system-catalog-views-to-work.aspx).

So, here’s the query:
SELECT Object_name(p.object_id) AS ‘tableName’,
i.name AS ‘indexName’,
p.partition_number,
au.type_desc,
CONVERT (VARCHAR(6), CONVERT (INT, Substring (au.first_page, 6, 1) +
Substring (
au.first_page, 5, 1)))
+ ‘:’
+ CONVERT (VARCHAR(20), CONVERT (INT, Substring (au.first_page, 4, 1) +
Substring (au.first_page, 3, 1) + Substring (au.first_page, 2, 1) +
Substring (
au.first_page, 1, 1))) AS ‘firstPage’,
CONVERT (VARCHAR(6), CONVERT (INT, Substring (au.root_page, 6, 1) +
Substring (
au.root_page, 5, 1)))
+ ‘:’
+ CONVERT (VARCHAR(20), CONVERT (INT, Substring (au.root_page, 4, 1) +
Substring
(au.root_page, 3, 1) + Substring (au.root_page, 2, 1) + Substring
(au.root_page,
1, 1))) AS ‘rootPage’,
CONVERT (VARCHAR(6), CONVERT (INT, Substring (au.first_iam_page, 6, 1) +
Substring (au.first_iam_page, 5, 1)))
+ ‘:’
+ CONVERT (VARCHAR(20), CONVERT (INT, Substring (au.first_iam_page, 4, 1)
+
Substring (au.first_iam_page, 3, 1) + Substring (au.first_iam_page, 2, 1)
+
Substring (au.first_iam_page, 1, 1))) AS ‘firstIamPage’
FROM sys.indexes AS i
JOIN sys.partitions AS p
ON i.object_id = p.object_id
AND i.index_id = p.index_id
JOIN sys.system_internals_allocation_units AS au
ON p.hobt_id = au.container_id
WHERE p.object_id = Object_id(‘databaselog’)
ORDER BY tablename;

And here is the result (Notice that when executing the DBCC IND, you can have the same info in a more fragmented layout):

If we execute the DBCC PAGE into the root page in option 3, we can get all the IAM chain belonging to the allocation unit. In this case, I choose the IN_ROW_DATA belonging to the clustered index PK_DatabaseLog_DatabaseLogID.

DBCC traceon(3604)
DBCC page (adventureworks2008r2, 1, 1270, 3)
DBCC traceoff(3604)

In the output, checking the ChildPageID, it’s possible to observe all the pages owned by the chain, in the proper sequence:

So now you can explore all the data belonging to this allocation unit using the DBCC PAGE. In this IAM demo, it was possible to verify all the allocation units of the referred table, in this case four, and check the IAM Chain of one of those allocation units.

Demo 3: We know about tables now, but… Where are the other objects?

Where are the Stored Procedures, Views, functions? Obviously, stored into the same file structure! But…how do we reach this?

To demonstrate this, we will need to connect using the Dedicated Administrator Connection (DAC) in order to access sys.sysobjvalues. In this case, I’m crossing sys.sysobjvalues with sysobjects table and just showing the objects with type “P” (Stored Procedures). Here is the query:


SELECT v .%% physloc %% as address,

CONVERT
(VARCHAR(6), CONVERT (INT, Substring (v.%%physloc%%, 6, 1) +
Substring
(
v
.%%physloc%%, 5, 1)))
+
‘:’
+
CONVERT (VARCHAR(20), CONVERT (INT, Substring (v.%%physloc%%, 4, 1) +
Substring
(v.%%physloc%%, 3, 1) + Substring (v.%%physloc%%, 2, 1) +
Substring
(v
.%%
physloc%%, 1, 1))) AS ‘rootPage’,
o
.name objectName,
Cast
(v.imageval AS VARCHAR(max)) code
FROM
sys.objects o
INNER
JOIN sys.sysobjvalues v
ON
o.object_id = v.objid
WHERE
o.type = ‘P’

Here is the output:

Voilà! Now you know the exact page on which your object is stored, and you can explore it. It’s also possible to decrypt the code, if the procedure is encrypted. I recommend reading this article: https://williamsorellana.org/page/2/

Let’s explore the object “uspPrintError”, which is stored on page 257:

DBCC traceon(3604)
DBCC
page (adventureworks2008r2, 1, 257, 3)
DBCC
traceoff(3604)

So, in the middle of that huge output, we can find the object on slot 13 (I will export just a part):

Slot 13 Offset 0x1a6b Length 765
Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP VARIABLE_COLUMNS
Record Size = 765

Memory Dump @0x0000000009ABBA6B

0000000000000000:   30001100 0102797f 4d010000 00000000 †0…..y.M…….
0000000000000010:   000600c0 02002000 fd023801 02000000 †…À.. .ý.8…..
0000000000000020:   0d0a0d0a 2d2d2075 73705072 696e7445 †….– uspPrintE
0000000000000030:   72726f72 20707269 6e747320 6572726f †rror prints erro
0000000000000040:   7220696e 666f726d 6174696f 6e206162 †r information ab
0000000000000050:   6f757420 74686520 6572726f 72207468 †out the error th
0000000000000060:   61742063 61757365 64200d0a 2d2d2065 †at caused ..– e
0000000000000070:   78656375 74696f6e 20746f20 6a756d70 †xecution to jump
0000000000000080:   20746f20 74686520 43415443 4820626c † to the CATCH bl
0000000000000090:   6f636b20 6f662061 20545259 2e2e2e43 †ock of a TRY…C
00000000000000A0:   41544348 20636f6e 73747275 63742e20 †ATCH construct.
00000000000000B0:   0d0a2d2d 2053686f 756c6420 62652065 †..– Should be e
00000000000000C0:   78656375 74656420 66726f6d 20776974 †xecuted from wit
00000000000000D0:   68696e20 74686520 73636f70 65206f66 †hin the scope of
00000000000000E0:   20612043 41544348 20626c6f 636b206f † a CATCH block o
00000000000000F0:   74686572 77697365 200d0a2d 2d206974 †therwise ..– it
0000000000000100:   2077696c 6c207265 7475726e 20776974 † will return wit
0000000000000110:   686f7574 20707269 6e74696e 6720616e †hout printing an
0000000000000120:   79206572 726f7220 696e666f 726d6174 †y error informat
0000000000000130:   696f6e2e 0d0a4352 45415445 2050524f †ion…CREATE PRO
0000000000000140:   43454455 5245205b 64626f5d 2e5b7573 †CEDURE [dbo].[us
0000000000000150:   70507269 6e744572 726f725d 200d0a41 †pPrintError] ..A

So now, what if we…….

DROP PROCEDURE uspprinterror

… and quickly read the page content header:


DBCC traceon(3604)

DBCC
page (adventureworks2008r2, 1, 257, 3)
DBCC
traceoff(3604)

We will have the following output:

PAGE HEADER:

Page @0x0000000082582000

m_pageId = (1:257)                   m_headerVersion = 1                  m_type = 1
m_typeFlagBits = 0x4                 m_level = 0                          m_flagBits = 0x0
m_objId (AllocUnitId.idObj) = 60     m_indexId (AllocUnitId.idInd) = 1    Metadata: AllocUnitId = 281474980642816
Metadata: PartitionId = 281474980642816                                   Metadata: IndexId = 1
Metadata: ObjectId = 60              m_prevPage = (1:23236)               m_nextPage = (1:22959)
pminlen = 17                         m_slotCnt = 35                       m_freeCnt = 2889
m_freeData = 7528                    m_reservedCnt = 0                    m_lsn = (554:35134:9)
m_xactReserved = 0                   m_xdesId = (0:1586506)               m_ghostRecCnt = 1

We found that there’s a Ghost record (m_ghostRecCnt = 1), and we the have the following info in slot 13:

Slot 13 Offset 0x1a6b Length 765

Record Type = GHOST_DATA_RECORD      Record Attributes =  NULL_BITMAP VARIABLE_COLUMNS
Record Size = 765

Memory Dump @0x0000000009ABBA6B

0000000000000000:   3c001100 0102797f 4d010000 00000000 †<…..y.M…….
0000000000000010:   000600c0 02002000 fd023801 02000000 †…À.. .ý.8…..
0000000000000020:   0d0a0d0a 2d2d2075 73705072 696e7445 †….– uspPrintE
0000000000000030:   72726f72 20707269 6e747320 6572726f †rror prints erro
0000000000000040:   7220696e 666f726d 6174696f 6e206162 †r information ab
0000000000000050:   6f757420 74686520 6572726f 72207468 †out the error th
0000000000000060:   61742063 61757365 64200d0a 2d2d2065 †at caused ..– e
0000000000000070:   78656375 74696f6e 20746f20 6a756d70 †xecution to jump
0000000000000080:   20746f20 74686520 43415443 4820626c † to the CATCH bl
0000000000000090:   6f636b20 6f662061 20545259 2e2e2e43 †ock of a TRY…C
00000000000000A0:   41544348 20636f6e 73747275 63742e20 †ATCH construct.
00000000000000B0:   0d0a2d2d 2053686f 756c6420 62652065 †..– Should be e
00000000000000C0:   78656375 74656420 66726f6d 20776974 †xecuted from wit
00000000000000D0:   68696e20 74686520 73636f70 65206f66 †hin the scope of
00000000000000E0:   20612043 41544348 20626c6f 636b206f † a CATCH block o

Let’s wait a minute and check again…


DBCC traceon(3604)

DBCC page (adventureworks2008r2, 1, 257, 3)

DBCC traceoff(3604)

Output:

PAGE HEADER:

Page @0x0000000082582000

m_pageId = (1:257)                   m_headerVersion = 1                  m_type = 1
m_typeFlagBits = 0x4                 m_level = 0                          m_flagBits = 0x0
m_objId (AllocUnitId.idObj) = 60     m_indexId (AllocUnitId.idInd) = 1    Metadata: AllocUnitId = 281474980642816

Metadata: PartitionId = 281474980642816                                   Metadata: IndexId = 1

Metadata: ObjectId = 60              m_prevPage = (1:23236)               m_nextPage = (1:22959)

pminlen = 17                         m_slotCnt = 34                       m_freeCnt = 3656
m_freeData = 7528                    m_reservedCnt = 0                    m_lsn = (554:35140:3)
m_xactReserved = 0                   m_xdesId = (0:1586506)               m_ghostRecCnt = 0
m_tornBits = -1445838435

And for the Slot 13:

Slot 13 Offset 0x330 Length 40
Record Type = PRIMARY_RECORD         Record Attributes =  NULL_BITMAP VARIABLE_COLUMNS
Record Size = 40

Memory Dump @0x0000000009ABA330

0000000000000000:   30001100 01aaa153 4e000000 00000000 †0….ª¡SN…….
0000000000000010:   00060000 02002000 28003801 02000000 †…… .(.8…..
0000000000000020:   2828302e 30302929 †††††††††††††††††††((0.00))

The record m_ghostRecCnt is now 0, and we cannot find the procedure code anymore. Also the objid record is pointing to another item.

Anyway, if you run DBCC PAGE in option 2, you will still be able to find some object track!

Basically, the pages are “free” now, but the bytes stay untouched until something is written into that space. So, with luck, you can recover the object in an emergency situation. Awesome! :)

Friends, I hope you like this article and use it as a base for further learning because this is just a small part of how SQL Server works internally, and a DBA should be learning every single day.

If you have any question, suggestion, or other topic you want to talk about, feel free to contact me!

No Comments Yet

Let us know what you think

Subscribe by email