next up previous index
Next: Index Up: Databases: From Small to Previous: Talking to a PostgreSQL

Huge Data Bases

If your data base is going to be tens of GBs large, perhaps even hundreds of GBs, then PostgreSQL will not be enough. In this case you may have to process the data base on multiple CPUs in parallel and PostgreSQL is not a parallel data base.

You will find an example of a very powerful parallel data base described in the article  ``DB2 Parallel Edition for AIX: Concepts and Facilities''.

We may provide parallel DB2 to AVIDD  users on specific request, but it is a very large and a very involved system. Think twice about making such a request. Yet, if you have a good justification for it, don't hesitate. This is what the AVIDD cluster is for.

What is a ``good justification''? Data bases do not really live entirely on the disk, when the system is up. A data base server always tries to load as much of it into memory as possible, because otherwise responding to queries would be too slow. So you should think of an active data base as living partially on the disk and partially in memory. A very small data base may live in the memory of a computer entirely, with only occasional updates and logs written to the disk. But a very large data base, whose size is well in excess of the available memory floor will have to live on the disk mostly.

Computers still have rather limited memory, because memory is expensive. I can't wait for the day when memory will be so cheap and persistent that the disks, which are clumsy, mechanical and energy hungry devices, will no longer be necessary. Disks slow down computing enormously. They are also a primary break down point. The computational nodes of the AVIDD cluster have 2GBs memory each, of which you can probably get about a GB for your program, the rest being taken by the operating system, PBS, and a ``safety margin''. This means that you can probably serve a data base with up to 4 times the size, i.e., a 4GB data base from a single node. This assumes that you should have a 25% memory foothold, which is fairly reasonable.

Well, what then if the data base is 40GBs? Then, you should think about distributing it over 10 AVIDD nodes for truly satisfactory performance. If the nodes were more powerful and, just as importantly, if they had more memory, you could run a larger data base from 10 nodes. Consider a system of 4-way SMPs, for example, each SMP with, say, 8GBs memory, of which 1GB is reserved for the OS. This would give you a 7GB foothold per node, and then you could run a 28GB data base from a single node, or 280GB data base from 10 such nodes. This would be nice and this is how large commercial data bases are usually configured. But let us get back to AVIDD.

You should not expect that the performance of a parallel data base will scale linearly forever. There is a lot of synchronization and communication involved in the data base functioning, and so data bases running on more than 64 CPUs are pretty rare. This could translate into, say, a system of 16 4-way SMPs, or into a single powerful SMP such as the Sun Starfire or HP Superdome.

Since the AVIDD nodes are 2-way SMPs you could think of deploying a data base on up to 32 of these. But the AVIDD nodes have little memory and cannot have much more, because they are 32-bit systems, and so even 32 nodes would probably be too many. But 16 to 24 nodes should be quite OK. This would give you a data base of up to 96GBs, which is plenty as far as data bases goes and more than anything I have seen at Indiana University so far, and more than the largest file we have ever moved to our HPSS (although HPSS operators at some DoE laboratories deal routinely with files that are larger than a TB).

Remember what I said right at the beginning of this chapter: data bases are normally populated by human typists. 96GBs would roughly translate into nearly 30 million pages of text, assuming 60 characters per line and 60 lines per page. Assuming 300 pages per book, this would be 100,000 books. You'd need a lot of typists to type this many pages!

DB2 Parallel Edition is a data base that is designed to live on a cluster. Nowadays we actually no longer use the expression ``Parallel Edition'', because every new version of DB2 ``Enterprise Edition'' can be used in this way. It's simply one of the installation options.

Parallelism can manifest itself in various ways on such a system.

The first and the most obvious  is inter-transaction parallelism. Transactions don't always get into each other's way. If they don't, then they can be executed in parallel on separate nodes. Any SQL lookup with the select function doesn't alter the tables, and so you can have multiple lookups going on at the same time. If updates are performed on separate parts of the table, they can be performed in parallel too. Inter-transaction parallelism is especially welcome on relatively small data bases that are accessed very frequently.

A less trivial form of parallelism is  intra-query parallelism. Here a single query may get split across many processors. This type of parallelism would be used on a very large data base, where a single table may have been striped over multiple nodes. Intra-query parallelism can be implemented in various ways too, for example as partition parallelism  or query decomposition, which is the same thing, or pipelined parallelism where  data may flow between processes in a pipeline.

Then we may classify parallel operations depending on how we deal with data. In some cases we may ship functions   from one process to where the data itself resides, in others we may have to fetch the data from other processes and feed them into a function  that runs on a select processor. Function shipping is usually preferable, especially on clusters. Functions  are data too, as every Lisp programmer knows, but they are usually rather small compared to the size of data they operate on in large data bases. So it's usually cheaper to send a function across than the whole table or a portion of it.

The architecture of DB2  Parallel Edition is pretty complex. There is a system controller there, a watch dog, a parallel system controller, a communication manager, a master data base logger, a deadlock detector (sic!) and numerous other agents. You will normally have several DB2 listeners running on various nodes, so that connections made to the data base can be distributed over the cluster in order to level the load on the whole system. There is a central agent that maintains maps of how tables are distributed amongst the nodes, but even this function may well be replicated for better performance. In this respect the functioning of DB2 is likely to be quite similar to the functioning of, e.g., the AFS.

The parallel DB2 supports pretty much  normal SQL and adds a couple of extensions. Most operations are quite transparent and you don't need to think of parallelism. What is parallelizable DB2 will parallelize automatically (business people wouldn't put up with all the parallel programming fuss otherwise!). In some cases operations will run locally on the nodes. When some central action is required, then the operations may be forwarded to the  coordinator node. This is somewhat similar to collective operations in MPI, e.g., MPI_Reduce or MPI_Gather. There is always a coordinating node there, even if the operation is of the type MPI_Allreduce, in which case the node is not explicit, but it is still there.

Parallelism comes to the surface at the level of SQL optimization. Here the optimizer has been  extended to generate parallel plans. The optimizer then selects the lowest-cost plan of a query. The cost depends on how the table has been partitioned amongst the nodes, on table statistics, on index statistics, data base configuration parameters, and on the SQL query itself.

The actions of the data  base can be scrutinized with various so called explain tools. These will usually explain why the optimizer made its choices for the query, and how a particular operation is going to be executed. Explain reports can be very detailed, which is important if you work with a huge commercial data base and need to optimize its every aspect.

DB2 provides  numerous utilities for executing commands on multiple nodes, managing segments, splitting data, load leveling, adding and dropping nodes, redistributing tables, data base back up and restore, data base recovery, and data base directory utility.

IBM DB2 is a very flexible and quite universal tool. It is used not only in business, but increasingly nowadays in science, engineering, and even in computer management itself. For example, the HPSS  data base is DB2 (as of version 5.1 and higher). Another example is the Blue Gene/L  supercomputer, which is configured and managed by DB2.

Even though data base pundits predicted a slow demise of relational data bases many years ago, to be replaced, in their opinions, by object oriented data bases, this has never happened. The reason for this is that a table is actually a very universal device and with suitable extensions it can be used to handle a great variety of data and queries. For example, you may think of storing whole images, even videos as table entries, why not?

Because data bases hold so much useful information, surprising discoveries can sometimes be made by mining this information, looking for unusual and unexpected correlations amongst the millions of rows and columns. Such activities are very difficult and time consuming. Ready-made data mining tools can be purchased from data base companies. Such tools are extremely expensive and people who can operate them skillfully and effectively command extremely high salaries.

Perhaps one day, you'll find yourself amongst them!


next up previous index
Next: Index Up: Databases: From Small to Previous: Talking to a PostgreSQL
Zdzislaw Meglicki
2004-04-29