转自:
http://www.oracle.com/technetwork/articles/servers-storage-dev/silent-data-corruption-1911480.html
How to Prevent Silent Data Corruption
by Martin Petersen and Sonny Singh,Best Practices from Emulex and Oracle
An explanation of Emulex and Oracle's Linux–based data integrity solution,which protects against silent data corruption.
Published February 2013
Data loss and data corruption can be catastrophic,but a new standards-based,end-to-end data integrity solution—which is the result of a joint effort by EMC,Emulex,and Oracle—mitigates episodes of silent data corruption by supporting the T10 Protection Information (T10 PI) standard. The T10 PI standard provides for end-to-end advanced data integrity. This article describes the perils associated with silent data corruption and explains how to implement the data integrity solution.
|
What Is Silent Data Corruption?
One of the most common areas in which data corruption occurs is writing to disk drives. There are two basic kinds of disk drive corruption:
- The first islatent sector errors,which are typically the result of a physical disk drive malfunction. An example would be a file system read error reported from a disk array. This type of corruption is usually detected by error correcting code (ECC) or cyclic redundancy checks (CRC) in the I/O path,and often it is corrected automatically.
- The second issilent data corruption,which can happen without warning and can be defined as the non-malicIoUs loss of data resulting from component failure or inadvertent administrative action. Silent data corruption occurs when invalid data is read or written rather than resulting in a Failed I/O operation. This type of corruption is by the far the most cataclysmic,and there are no effective ways to detect it without end-to-end integrity checking.
With virtualization servers and multicore processors,the probability that a faulty memory cell will cause an error increases. When such an error occurs without the knowledge of the application or the data center staff,this is called silent data corruption. Although silent data corruption is relatively rare,it can go undetected for long periods and result in costly downtime for business-critical functions.
Common perpetrators of silent data corruption include the following:
- The operating system,including the core OS and device drivers
- Storage hardware and firmware
- Administrative errors
What Is Data Integrity Protection?
Data integrity protection is not new. ECC and CRC are available on most,if not all,servers,storage arrays,and Fibre Channel host bus adapters (HBAs). But these checks protect the data only temporarily within a single component. They do not ensure that the data you intended to write does not become corrupt as it travels down the data path from the application running in the server to the HBA,the switch,the storage array,and then the physical disk drive. When data corruption occurs,most applications are unaware that the data that was stored on the disk is not the data that was intended to be stored.
Over the last several years,EMC,and Oracle have worked together to drive and implement the Protection Information additions to the T10 SBC standard,which enables the validation of data as it moves through the data path to ensure that silent data corruption does not occur.
How Does This Solution Provide End-to-End Data Integrity?
The ultimate goal is to provide protection against silent data corruption from application to disk by creating integrity Metadata,also known as protection information,coincident with data creation,and then validating the Metadata throughout the data path and directing errors to the application for remediation,as shown in Figure 1.
Figure 1
The following steps occur when data is written:
- First: The Oracle Automatic Storage Management library adds protection information for each 512-byte sector as it is written to memory.
- Second: The protection information is attached to the I/O request and passed through the layers in the Oracle Linux operating system kernel to the Emulex driver.
- Third: The Emulex LightPulse Lpe16000B Fibre Channel HBA collects the information from memory buffers,verifies the data integrity,merges the data and the protection information,and then sends out 520-byte sectors in accordance with the T10 PI model.
- Fourth: The EMC VMAX array firmware verifies the protection information and writes the data to disk.
- Fifth and last: The disk drive firmware verifies the protection information before committing the data to physical media.
The steps are done in reverse when data is read.
Deployment Requirements and Benefits
The key to implementing this solution is making sure that you are using the correct software releases and hardware equipment that support the data integrity enhancements and have been fully tested.
End-to-end data integrity is supported when each component meets the minimum version requirements listed in Table 1.
Table 1Application/Operating System Layer | |
---|---|
Application | Any Oracle Database application |
Database | Oracle Database 11gwith Oracle Automatic Storage Management |
Operating System | Oracle Linux 5.x or 6.x with Unbreakable Enterprise Kernel versions 2.6.39-200.24.1 or later; Oracle ASMLib 2.0.8 or later |
Storage Layer | |
HBA Models (driver/firmware) | Emulex LPe12000-E or LPe12002-E with firmware 2.01a10 or later and driver 8.3.5.68.6p or later Emulex LightPulse LPe16000 or LPe16000B with firmware 1.1.21 or later and driver 8.3.5.68.6p or later Note: Equivalent OEM HBA (-E) models are also supported. |
Array | EMC Symmetrix VMAX Series with Enginuity 5876.82.57 |
The Emulex LightPulse LPe16000B 16G FC HBAs support protection information offload,which improves overall performance by 30%. This data integrity solution gives you the ability to protect your data and resources,while maximizing service level agreements. The latest Emulex LightPulse LPe16000B 16G FC HBAs provide the industry's highest level of data integrity with full line-rate performance and no systems overhead via vEngine cpu offload technology. In addition,Emulex's BlockGuard data integrity feature makes storage area network (SAN) deployments that use Oracle products operate faster and better.
The data integrity solution is EMC E-Lab certified,which helps ensure complete data integrity with the Oracle stack—from the application through the enterprise storage array—for all I/O operations. EMC is the first enterprise storage array vendor to join with Emulex and Oracle in implementing this end-to-end data integrity solution.
Configuration Requirements
This data integrity solution embodies a truly seamless architecture. There are no added requirements for configurability. If the relevant storage,HBA,firmware,driver,Oracle Automatic Storage Management,and Oracle ASMLib components are in place,data protection is automatically enabled. This means you can conduct day-to-day system administration tasks without having to worry about cumbersome tuning of parameters.
You can verify that data protection is enabled by using theoracleasm-discover
command,which will show the data integrity profile associated with each Oracle Automatic Storage Management disk,as shown in the example below. This command takes no arguments and simply prints a list of discovered devices.
# oracleasm-discover
Using ASMLib from /opt/oracle/extapi/64/asm/orcl/1/libasm.so
[ASM Library - Generic Linux,version 2.0.8 (KABI_V2)] Discovered
disk: ORCL:P00 [20971520 blocks (10737418240 bytes),maxio 512,integrity
DIX1-512/512-IP] Discovered disk: ORCL:P01 [20971520 blocks (10737418240 bytes),integrity DIX1-512/512-IP] [...]
Exception Handling
T10 PI prevents several common silent data corruption scenarios. Steps are taken by the HBA,the storage device,and the operating system to make sure that corrupted data is not written.
If a data integrity error is encountered,Oracle ASMLib will attempt to retry the I/O command. A data integrity error message and a retry message are recorded in the Oracle Automatic Storage Management trace log.
Conclusion
The end-to-end Oracle Linux–based data integrity solution has come to fruition through the joint efforts of Emulex,and Oracle and is a result of many years of development and collaboration.
The data protection information generated by Oracle Automatic Storage Management is validated by the Oracle Linux operating system,and then passed on to the Emulex HBA and the EMC VMAX array,thus enabling protection throughout the I/O stack.
See Also
About the Authors
Martin Petersen has been involved in Linux kernel development since the early 1990s. He works in Oracle's Linux Engineering group where he focuses on future I/O and storage technologies.
Sonny Singh has been with Emulex's Marketing group for three years. He is responsible for the inbound management and co-marketing of all Emulex solutions sold through Oracle and focuses on branding,go-to-market strategy,and development of co-branded solutions.
Revision 1.0,02/15/2013 |