Oracle RAC Cache Fusion 机制详解-摩杜云开发者社区

Cache Fusion 是RAC 的核心机制，之前在blog里有小提到一些。

Oracle RAC 并发与架构

在网上找到一篇很详细的资料，转过来。链接如下：

http://avdeo.com/2008/07/21/oracle-rac-10g-cache-fusion/

一. Introduction

This post is about Oracle Cache Fusion technology, which is implemented in Oracle database 10g RAC. We are going to discuss just about cache fusion. You should have the architecture knowledge about RAC. Please check Oracle documentation for understanding Oracle RAC architecture. Also you can visit my previous post about Oracle RAC installation to get some basic information and installation details.

Cache fusion technology was partially implemented in Oracle 8i in OPS (Oracle Parallel Server). Before Oracle 8i the situation was different. If we take a case of multi-instance Oracle Parallel server and if one of the instance ask for a block of data which is currently modified by other instance of same database, then the holding instance needs to write the data to disk so that requesting instance can read the same data. This is called “Disk Ping”. This has greatly effected the performance of the database. With Oracle 8i, partial cache fusion was implemented.

Oracle 8i (Oracle Parallel Server) has a background process called “Block Server Process” which was responsible for cache fusion in Oracle 8i OPS. Following table gives the scenario when cache fusion works in Oracle 8i OPS and scenario where cache fusion was not working. Of course these limitations are not present in Oracle 10g RAC.

Oracle RAC Cache Fusion 机制详解_ide

So when requesting instance ask for a block which is present in holding instance in a read or write mode and if the block is dirtied, then cache fusion used to work and block from cache of holding instance used to get copied to requesting instance. But if block is not dirtied and block is present in holding instance then requesting instance has to read the block from datafile. Also if the block is opened for write in holding instance and other instance wants to update the same block then holding instance have to write the block to disk so that requesting instance can read it.

二. Concept of cache fusion

fusing the memory buffer cache of multiple instance into one single cache.

Oracle RAC Cache Fusion 机制详解_ide_02

This behavior is possible because of high speed interconnect existing in the cluster between each instance.

This high speed interconnect is a private connection made just for sending data blocks and more by instances. External users cannot use this connection. It is this interconnect which makes multiple server behave like a cluster. These servers are bind together using this interconnect.

Moving further, now we know how the cluster is formed and what is the back bone of cluster and what exactly we call “cache fusion”. Next we will see how cache fusion works. But before that we need to discuss few important headings which is very important to understand.

We will discuss following topics before discussing Cache Fusion

（1）Cache Coherency

（2）Multi-Version consistency model

（3）Resource Co-ordination – Synchronization

（4）Global Cache Service (GCS)

（5）Global Enqueue Service

（6）Global Resource Directory

（7）GCS resource modes and roles

（8）Past Images

（9）Block access modes and buffer states

I promise this wont be too heavy. Lets look into the overview of these concepts. I wont be going into the details, just sufficient for you to understand cache fusion.

2.1 Cache Coherency

This is called maintaining consistency of data.
Now consider a multi instance system RAC, where a data block might not be present in same instance. A user might be updating data block in some other instance. If data block are already available in local instance then they will be immediately available to the user. if they are present in some other instance with in the cluster, they will be transfered into local buffer cache.
Maintaining the consistency of data blocks in the buffer cache of multiple instance is called “Cache Coherency”.

2.2 Multi-Version consistency model

Multi version consistency model distinguishes between current version of data block and one or mode read consistent version of data block. The current block is the one which contains all the changes, committed as well as uncommitted. Example a user fired a DML on a data block which is not present in any of the instance. Then this block will be read from disk into buffer cache where the value gets changed. After then user commits and fires another DML on same data block. Now that data block is dirty and contains committed as well as uncommitted changes.
Suppose this data block is requested by another user for reading, then oracle will make a copy and apply undo information and make a Consistent Read “CR” copy of this block and ship it to requesting instance. Thus we have multiple versions of same data blocks, each of them are consistent with respect to the user who requested.
During the course of operation there can be many more version of same data block, each of them consistent with respect to some point in time.

关于CR 可以参考我的Blog：

CR (consistent read) blocks create 说明

2.3 Resource Co-ordination – Synchronization

With in the shared cache, co-ordination of concurrent task is called synchronization. The synchronization provided by Oracle RAC provides a cluster wide concurrency of resource and in turn ensure integrity of shared data. All though there is synchronization within the cache, there is some cost involved for doing the same. If we talk about low level operation of synchronization, its just a data copy operation or data transfer operation.
According to Oracle studies, accessing the block in a local cache is much faster then accessing the block from another instance cache with in the cluster. Because with local cache is the in memory copy and with other instance cache, the data transfer needs to be done over high speed interconnect which is obviously slower then in memory copy. Worst is the copy from disk, which is much slower then above two process.

For example:

Block access in local cache ~ 0.01 msec

Block access in remote cache ~ 2.5 msec

Block access on disk ~ 14 msec+

2.4 Global Cache Service

This is represented by background process LMSn. There can be max 10 LMS process for an instance. The main function of GCS is to track the status and location of data blocks. Status of data block means the mode and role of data block (I will explain mode and role further). GCS is the main mechanism by which cache coherency among “multiple cache” is maintained. GCS is also responsible for block transfer between the instances.

2.5 Global Enqueue Service

This involves all non-cache fusion intra instance operations. GES performs concurrency control on dictionary cache locks, library cache locks and transactions. If performs this operation for resources that are accessed by more then once instance.
Enqueue services are also present in single instance database. These are responsible for locking the rows on a table using different locking modes. To understand more about enqueues, check Oracle documentation about locking.

2.6 Global Resource Directory

GES and GCS together maintains Global Resource Directory (GRD). GRD is like an in-memory database which contains details about all the blocks that are present in cache. GRD know what is the location of latest version of block, what is the mode of block, what is the role of block (Mode and role will be discussed shortly) etc. When ever a user ask for any data block GCS gets all the information from GRD. GRD is a distributed resource, meaning that each instance maintain some part of GRD. This distributed nature of GRD is a key to fault tolerance of RAC. GRD is stored in SGA.

Typically GRD contains following and more information

（1）Data Block Address – This is the address of data block being modified

（2）Location of most current version of data block

（3）Modes of data block

（4）Roles of data block

（5）SCN number of data block

Could be current image or past image.

2.7 GCS resource modes and roles

Mode of data block

（1）Null (N) Mode: Null mode is the least restrictive mode. It indicates no access rights.

（2）Shared (S) Mode: Shared mode indicate that database block is being read and not modified. However another session can read the data block

（3）Exclusive (X) Mode: Exclusive mode indicate exclusive access to block. Other resource cannot have write over this data block. However it can have consistent read on this datablock.

GCS resources also has roles. Following are the different roles present:

（1）Local: When a data block is first read into the instance from the disk it has a local role. Meaning that only 1 copy of data block exists in the cache.

（2）Global: Global role indicates that multiple copy of data block exists in clustered instance. For example a user connected to one of the instance request for a data block. This data block is read from disk into an instance. The role granted is local. If another instance request for same block this block will get copied to the requesting instance and the role becomes global.

This role and mode information is maintained in GRD (Global Resource Directory) by GCS (Global Cache Service).

2. 8 Past Images

an image of the block is created in owning instance and then the block is shifted to requesting instance. This image copy of the block is called Past Image (PI). In the event of failure Oracle can reconstruct the block by reading PIs. It is also possible to have more then 1 PI of the block, depending on how many times the block was requested in dirty stage.

A past image of the block is different then CR (Consistent read) image. Past image is required to create CR by applying undo data.

“Juggling” Data with Multiple Past Images

（1）Multiple Past Image versions of a data block may be kept by different instances

（2）Upon a checkpoint, only the current image is written to disk; Past Images are discarded

（3）In the event of a failure, current version of block can be reconstructed from PIs

（4）Since PIs are kept in memory, they aid in avoiding frequent disk writes

（5）This avoids “disk pinging” experienced with 8i OPS due to frequent writes to disk

（6）Data is “juggled” in memory, without touching down on the disk

Oracle RAC Past Image(PI) 说明

2.9 Block access modes and buffer states

state which is the state of a buffer in the local cache of an instance. The buffer state of a block relates to the access mode of the block. For example, if a buffer state is exclusive current (XCUR), an instance owns the resource in exclusive mode.
To see a buffer’s state, query the “status” column of the V$BH dynamic performance view.

The V$BH view provides information about the block access mode and their buffer state names as follows:

With a block access mode of NULL the buffer state name is CR

the buffer state name is SCUR

the buffer state name is XCUR

With a block access mode of NULL the buffer state name is PI

关于v$bh 视图的更多内容参考官网链接：

http://download.oracle.com/docs/cd/B19306_01/server.102/b14237/dynviews_1051.htm#REFRN30029

Only the SCUR and PI buffer states are Real Application Clusters-specific. There can be only one copy of any one block buffered in the XCUR state in the cluster database at any time. To perform modifications on a block, a process must assign an XCUR buffer state to the buffer containing the data block.
For example, if another instance requests read access to the most current version of the same block, then Oracle changes the access mode from exclusive to shared, sends a current read version of the block to the requesting instance, and keeps a PI buffer if the buffer contained a dirty block.

Therefore, the role of the resource becomes global. There can be multiple shared current (SCUR) versions of this block cached throughout the cluster database at any time.

三. Block transfer using Cache Fusion

So if I say that a user executed update that means user executed update + commit. But there is no checkpoint until the end.

Stage 1

In stage 1 datablock is requested by a user C who is connected to instance 3. So a data block is read into the buffer cache of instance 3.

SQL>select sales_rank from salesman where salesid = 10;

SL0 mode (SHARED, LOCAL, 0 PAST IMAGES).

关于这些Lock Modes，在我的Blog里有说明：

Oracle RAC Past Image(PI) 说明

Oracle RAC Cache Fusion 机制详解_sed_03

Stage 2

There is no disk read at this time. Both instances are in SHARED mode (S) and role is LOCAL (L). Here if you see carefully that even though the block is present in more then once instance, still we say that role is local because the block is not yet dirtied. Had the block been dirty and then requested by other instance, then in that case the role will change to global.

Stage 3

SHARED lock from instance 3 is downgraded to NULL lock.

SQL>update salesman set sales_rank = 24 where salesid = 10;

SQL>commit;

Also the role of the block is still LOCAL because the block is dirtied for the first time only on instance 2 and no other instance is having any dirty copy of that.

Stage 4

however, a PAST IMAGE of the data block is created on instance 2 and lock mode on instance 2 is downgraded to NULL with a GLOBAL role.

Stage 5

As a result the lock on instance 1 is converted to SHARED GLOBAL with 1 PAST IMAGE. The reason the lock gets changed to SHARED and not NULL is because instance 3 asked for shared lock (for reading data) and not exclusive lock (for updating data).

Also the instance 3 will now hold SG0 (SHARED, GLOBAL with 0 PAST IMAGES).

Oracle RAC Cache Fusion 机制详解_sed_04

Stage 6

User B issues the same select statement against the salesman table on instance 2. Instance 2 will request for a consistent copy of buffer from another instance, which happens to be the current master.
Therefore instance 1 will ship the block to instance 2, where it will be required with SG1 (SHARED, GLOBAL with 1 PAST IMAGE). So instance 2 mode becomes SG1.

Stage 7

Instance 3 will have EXCLUSIVE lock, GLOBAL role and with no PAST IMAGES (XG0).

Stage 8

The checkpoint is initiated and a “Write to Disk” takes place at instance 3. As a result previous past images will be discarded (as they are not required for recovery) and instance 3 will hold that block in EXCLUSIVE lock LOCAL role with no PAST IMAGES (XL0).

Further if any instance wants to read or write on the same block then a copy will be again shifted from instance 3.

图示：

Read/Read Cache Fusion – GCS Processing

Oracle RAC Cache Fusion 机制详解_sed_05

Write/Write Cache Fusion – GCS Processing

Oracle RAC Cache Fusion 机制详解_Server_06

Blocks to Disk – GCS Processing

Oracle RAC Cache Fusion 机制详解_ide_07

四. Online Instance Recovery Steps

步骤如下：

（1）Instance Failure detected by Cluster Manager and GCS

（2）Reconfiguration of GES resources (enqueues); global resource directory is frozen

（3）Reconfiguration of GCS resources; involves redistribution among surviving instances

（4）One of the surviving instances becomes the “recovering instance”

（5）SMON process of recovering instance starts first pass of redo log read of the failed instance’s redo log thread

（6）SMON finds BWR (block written records) in the redo and removes them as their PI is already written to disk

（7）SMON prepares recovery set of the blocks modified by the failed instance but not written to disk

（8）Entries in the recovery list are sorted by first dirty SCN

（9）SMON informs each block’s master node to take ownership of the block for recovery

（10）Second pass of log read begins.

（11）Redo is applied to the data files.

（12）Global Resource Directory is unfrozen