shareVM- Share insights about using VM's

Simplify the use of virtualization in everyday life

Posts Tagged ‘mySQL

Database Virtualization = Location Transparency. Old Wine in a New Bottle?

leave a comment »

When I think about “database virtualization” as

the use of multiple instances of a DBMS, or different DBMS platforms, simultaneously and in a transparent fashion regardless of their physical location


with the goal of providing “data virtualization“,

to view data from disparate sources without knowing or caring where the data actually resides.

this reminds me of location transaprency (ref: Jim Gray and Andreas Reuter)

The databse must insulate the application from the location of the data and the exact representation of the data

The real issue is that there is a lot of hype surrounding virtualization today and database virtualization, at least in the prevalent use of the term, is attempting to shoe horn itself into this space in order to leverage the hype cycle and stock valuations associated with this market segment.

xkoto has built a SQL mediation layer above commodity DBMS’ for the mid-market and IBM has been licensing their technology as a load balancer since 2006. Oracle’s acquisition of TangoSol, Microsoft’s Project Velocity are following HP NeoView’s usage of distributed caches for solving large BI queries. Strictly speaking these are not virtualization technology related innovations – they are simply replacing shared memory clustering with distributed, shared-nothing caches that provide location transparency for data. Tandem was doing this all along from the early 80’s through the mid 90’s. Seems to be old wine in a new bottle – the vintage may still be good, but let us acknowledge it for what it is.

The positioning and messaging innovation in recasting location transparency as “databse virtualization” is an incremental one. What groundbreaking technoloigy innovations can we expect in database virtualization that truly differentiates it and helps startups unveil new disruptive business models?

You may well ask, “Ok Paul, what’s your beef? We are solving real world business problems for our customers”. Your point is well taken, ultimately it is indeed about the customer. If they can map your solution as meeting the technical requirements to be fulfilled for solving their business problem, the marketing labels are a moot point. My concern is about whether the positioning obfuscates instead of providing that sort of clarity to customers.

Written by paule1s

February 4, 2009 at 10:47 pm

Database Virtualization

leave a comment »

I was intrigued by an article on database virtualization that caught my eye early this morning and I wanted to find out what is it all about?

The business driver for database virtualization is the globalized economy where business transaction happen 24 x7 x 365 and business critical data must be available within the network boundary of a corporation, or through the Internet, spanning application downtime and IT maintenance windows.

Data virtualization is defined here as

to view data from disparate sources without knowing or caring where the data actually resides.

Data virtualization obviously leads to database virtualization, which is defined here as

the use of multiple instances of a DBMS, or different DBMS platforms, simultaneously and in a transparent fashion regardless of their physical location

James Kobielus, a Senior Abalyst with Forrester Research is predicting that real time information needs will drive database virtualization

the database as we know it is disappearing into a virtualization fabric of its own. In this emerging paradigm, data will not physically reside anywhere in particular. Instead, it will be transparently persisted, in a growing range of physical and logical formats, to an abstract, seamless grid of interconnected memory and disk resources; and delivered with subsecond delay to consuming applications.

He is making an interesting case that

Real-time is the most exciting new frontier in business intelligence, and virtualization will facilitate low-latency analytics more powerfully than traditional approaches. Database virtualization will enable real-time business intelligence through a policy-driven, latency-agile, distributed-caching memory grid that permeates an infrastructure at all levels.

As this new approach takes hold, it will provide a convergence architecture for diverse approaches to real-time business intelligence, such as trickle-feed extract transform load (ETL), changed-data capture (CDC), event-stream processing and data federation. Traditionally deployed as stovepipe infrastructures, these approaches will become alternative integration patterns in a virtualized information fabric for real-time business intelligence.

The convergence of real-time business-intelligence approaches onto a unified, in-memory, distributed-caching infrastructure may take more than a decade to come to fruition because of the immaturity of the technology; lack of multivendor standards; and spotty, fragmented implementation of its enabling technologies among today’s business-intelligence and data-warehouse vendors. However, all signs point to its inevitability.

Proof Points

Oracle acquired Tangosol in May 2007 and possesses a well-developed in-memory, distributed-caching technology called Coherence

Microsoft annonuced Project Velocity a year later in June 2008:

a distributed cache that allows any type of data (CLR object, XML document, or binary data) to be cached. “Velocity” fuses large numbers of cache nodes in a cluster into a single unified cache and provides transparent access to cache items from any client connected to the cluster.

xkoto was selling GRIDSCALE as a database load balancer in 2006. However, it is very smartly capitalizing on virtualization being a hot segment and has repositioned GRIDSCALE as a database virtualization product. This is a vaiid repositioning not only in the context of the definitions cited above but more importantly due to its validation by noted industry analysts:

Gigaspaces XAP can be used for implementing a distributed cache (in-memory data grid)

GemStone Systems offers the GemFire Enterprise as a data fabric (distributed cache)

Scaleout Software also has a distributed cache offering

Written by paule1s

February 3, 2009 at 5:32 pm

Top 12 referrers over the past 3 months

leave a comment »

Oracle, SQL Server, performance and manageability in a VM

leave a comment »

I have been reading about SQL performance issues within the VM and have come to the realization that the performance of SQL queries on any database,  Oracle, SQL Server, mySQL, etc. running in a VM can never match the performance of the same DB running on the native host OS. I have seen questions about the performance issue in several Google searches, particularly for SQL Server within VMWare over the past couple of years. I would like to share my insight  about the root causes.

SQL Query Optimizer Support

The database’s cost-based optimizer creates and evaluates several  query execution plans and chooses the one that meets the optimization goal, e.g., fastest response time for returning the first row, or least overall time to compute the entire result, etc. at the the lowest cost. The cost is based on several parameters that govern the query plan execution environment. For example, the query optimizer can examine parameters such as the number of processors on the system, the processor type, the amount of physical memory, the amount of swap space available, the disk throughput in I/O per second, CPU load factors, etc.  The query optimizers of the current generation DB’s – Oracle 10g, Oracle 11i, SQL Server 2005, etc are not designed for computing query plans for the VM environment,  they have a single system image for the VM and do not understand that the underlying hardware resources will actually be shared by several VM’s.

VMWare seems to have begin affording a view of the underlying host system for the query optimizer’s benefit, although, at the time of writing I have not researched what Microsoft and Xen offer.  It is important for you to understand the tuning knobs VMWare and HyperV have to offer for deploying Oracle and SQL Server,  Xen is interesting for me because it is the underlying virtualization platform for Amazon’s ec2, which I am using currently. However, such a view is necessary but not sufficient, because it cannot support  the possibility of running large queries in parallel  (parallel query execution) over  multiple, distributed, databases.

Lower Transaction Throughput

But of course, the pathlengths are longer.   Let me state at the outset that I am willing to learn from experts who write such code for a living and are interested in providing clarity in depth. My goal here is to paraphrase and simplify to get the point across:

Imagine that the database has a relational storage manager that provides methods to open/close tables and set-oriented API’s that iterate over a set of rows for reading or updating them individually, or inserting a new row. The storage manager encapsulates a file system driver for managing reads from and writes to its proprietary data store. Imagine further that the file system driver is a kernel-mode component that  performs the low-level I/O on NTFS or Linux, as the case may be.  The nominal read/write call sequences and pathlengths  are optimized so that the databases perform well for routine use as well as for competitive TPC benchmarks. I am illustrating a notional  call stack below:

SQL application
|
DB query run time (user mode)
|
DB storage manager (user mode)
|
DB file system driver (kernel mode)
|
Host OS (kernel mode)
|
Disk driver

Now let’s see what happens when we we run the DB  within the VM.

SQL application
|
DB query run time (user mode)
|
DB storage manager (user mode)
|
DB file system driver (kernel mode)
|
Guest OS (kernel mode)  <= additional I/O path length
|
Hypervisor (kernel mode) <= additional path length
|

Host OS (kernel mode)
|
Disk driver

In the VM, the OS kernel-mode I/O path is traversed more than once (I am sure the optimizations that VMWare and Microsoft are furiously working on to reduce it from 2x to somewhere in the range of 1.6x – 1.8 x including the overhead of the hypervisor code).

Manageability overhead

The database manages its own proprietary storage structures that differ substantially from the files supported by the native OS. For example, most databases allocate large extents (think, collection of disk blocks) which contain several pages (think, disk blocks), and rows within pages. The management complexity lies in the fact that extents can contain pages of different sizes, e.g., 64K, 32K, 4K, 512 bytes, a set of pages may be linked in a list, some pages may be data pages, some only index pages, whereas some others may contain both data and index rows.  The rows in each page may either be of fixed length or variable lengths. Index pages split or combine based on the fill factors, rows get forwarded when they are updated but cannot fit within the current page, there are shadow rows that get deleted asynchronously after a transaction commits. The point is that the database buries a lot of complexity in dealing with storage efficiently in its proprietary internal representation.

All databases grow. The current best practice that I see on the Web is to pre-allocate the largest storage size you expect your DB to row to. Think about what will happen if your DB is sharing that storage with another app and outgrows it fater than anticipated.

OS/third party tools for de-fragmentation and compression are pretty much useless, or are marginally useful at best. You must use the administration utilities, e.g.,  SQL Server stored procedures, for de-fragmenting the database and effecting space recovery – the OS utilities for de-fragmentation and compression are not effective. Remember that any I/O operations performed within the VM suffer the additional kernel I/O pathlengths of the guest and host OS together with that for the hypervisor, as discussed in the earlier section.

From the manageability perspective, none of the databases except Tandem’s, can perform an online reorganization of data. SQL Server, and to the best of my knowledge Oracle too, requires you to bring the database offline, unload and reload the data in order to remove de-fragmentation. This is a real bummer.

Epilogue

Ultimately it is all about planning. Do your capacity planning and then validate it in the lab to some scale with the database residing in a VM. You should read the manuals and ensure that you truly understand and can justify the performance and manageability trade off for moving the DB to a VM.

Written by paule1s

December 7, 2008 at 10:26 pm