next up previous index
Next: Huge Data Bases Up: Databases: From Small to Previous: Talking to a PostgreSQL

Talking to a PostgreSQL Data Base from C

Suppose you have the following project. You have a C-language program that can read documents, e.g., memos, articles, books, and that can generate automatically abstracts, keywords and bibliographic entries for these documents - without any human intervention. If you have a lot of documents to process you may consider running the program on the AVIDD farm. You would simply submit a PBS job that would distribute multiple instances of the program over, say, 32 or 48 nodes. Because the instances of the program have no need to communicate with each other, this is all that you really need to do to ``parallelise'' the application.

The only snag is coordination. Suppose that all the documents you need to process live on GPFS in one of your directories. How are you to decide, which instance of your program should get which document? The other issue is how you are going to collect the results of the processing. Each instance of the program is going to generate bibliographic entries for the documents it has processed, where are these entries to go?

This kind of a problem is not restricted to generation of bibliographic entries. When high energy physicists process their experimental data, what they do is almost exactly the same. The only difference in the type of calculations their programs do. They store their data on multiple files, which are then distributed, somehow, to instances of data reduction programs that run on a CPU farm. The programs process the data and return the results on various files, which are then collated and prepared for further processing, e.g., visualization.

In order to coordinate the exercise you could write an MPI program and then incorporate your C program as a subroutine in the MPI program. But this is really quite unnecessary, because you already have a first class coordinator and data collector in the form of $\ldots$ yes, in the form of a PostgreSQL data base.

PostgreSQL data base transactions, i.e., operations such as  update, or  insert, or  delete, are  atomic. Various complex combinations of these and other transactions can be made atomic too using a special SQL bracketing in the form of  begin, commit  and rollback  commands. Atomicity of a transaction means that the transaction, however complex and lengthy, runs to completion or not at all. If anything happens while the transaction runs that makes it impossible to complete it, the transaction is unwound and the state of the data base returns to the condition the data base was in before the transaction began. This is accomplished by the means of sophisticated locking, queuing, logging, backtracking and various other devices that would be quite difficult to implement in a small MPI program.

Additionally SQL provides the command lock , which lets you lock the table you work on explicitly and in various ways. The way to use this command is to lock the table in the first command line of the transaction. Then you proceed with the transaction and the moment you commit it, the lock is automatically lifted. There is no unlock command in SQL.

In summary you could begin by constructing a table that would represent the job queue. Your workers, once they've been instantiated by PBS on various nodes of the AVIDD cluster would contact the data base server running on one of the AVIDD nodes or even on your laboratory server and would perform a transaction with the data base. They could, for example, ask the data base for the top entry in the job queue table that has not been worked on yet. The way to ask for such table entry is similar to the following:

neighbors=# select * from dogs
neighbors-#    where age not like 3
neighbors-#    limit 1;
 name  |       owner       |     race      | age | unit 
 Fiote | Margaret Thatcher | Border Collie |   4 | 2385
(1 row)

Now think of age = 3 representing a ``worked-on'' flag and here you have the solution that gives you the next available entry in the table. Once the worker gets the entry, the worker would set the ``worked-on'' flag on this row by doing something like:
neighbors=# update dogs
neighbors-#    set age = 3
neighbors-#    where name = 'Fiote' and unit = 2385;
Having reserved the row, the worker would then go to process the corresponding document. Then once the bibliographic entry for the document has been constructed, the worker would again contact the data base and it would update the relevant row in this or some other table, by adding the full bibliographic entry to it.

The only thing you need to learn in order to implement your system this way is how to talk to a PostgreSQL data base from your C-language program. The easiest way to do this is to use the ecpg  preprocessor  for C with embedded SQL statements. The preprocessor recognizes the SQL statements and converts them automatically to C-language constructs. The program is then compiled and linked with the ecpg library and $\ldots$ you have it.

Here is a simple example. Consider the following program:

gustav@WOODLANDS:../gustav 18:08:23 !525 $ cat tryme.pgc
#include <stdio.h>
   printf ("Connecting to the data base ... \n");
EXEC SQL UPDATE dogs SET age = 7 WHERE unit = 2383 AND name = 'Cedric';
   printf ("Done it.\n");
gustav@WOODLANDS:../gustav 18:09:40 !526 $
The SQL commands begin with the EXEC SQL string, which is then followed by a command itself and the terminating semicolon. You can interleave SQL commands with normal C-language statements. SQL transactions executed this way are not committed unless COMMIT is requrested explicitly.

I am going to preprocess this program first in order to generate a genuine C-language program:

gustav@WOODLANDS:../gustav 18:09:40 !526 $ ecpg tryme.pgc
gustav@WOODLANDS:../gustav 18:10:30 !527 $
The C-language program I have just generated looks as follows:
gustav@WOODLANDS:../gustav 18:10:30 !527 $ cat tryme.c
/* Processed by ecpg (2.10.0) */
/* These four include files are added by the preprocessor */
#include <ecpgtype.h>
#include <ecpglib.h>
#include <ecpgerrno.h>
#include <sqlca.h>
#line 1 "tryme.pgc"
#include <stdio.h>
   printf ("Connecting to the data base ... \n");
{ ECPGconnect(__LINE__, "neighbors" , NULL,NULL , NULL, 0); }
#line 5 "tryme.pgc"

{ ECPGdo(__LINE__, NULL, "update dogs set age  = 7  where unit  = 2383 and name 
 = 'Cedric'", ECPGt_EOIT, ECPGt_EORT);}
#line 6 "tryme.pgc"

{ ECPGtrans(__LINE__, NULL, "commit");}
#line 7 "tryme.pgc"

{ ECPGdisconnect(__LINE__, "CURRENT");}
#line 8 "tryme.pgc"

   printf ("Done it.\n");
gustav@WOODLANDS:../gustav 18:11:02 !528 $
Now I have to link this program with the ecpg library:
gustav@WOODLANDS:../gustav 18:11:02 !528 $ gcc -o tryme tryme.c -lecpg
gustav@WOODLANDS:../gustav 18:11:57 !529 $
and I have the binary.

Let me demonstrate to you that it actually works!

gustav@WOODLANDS:../gustav 18:08:18 !559 $ psql --command "select * from dogs;" neighbors
  name  |       owner       |     race      | age | unit 
 Barky  | Abraham Lincoln   | Poodle        |   3 | 2387
 Cedric | Woodrow Wilson    | Mongrel       |   8 | 2383
 Fiote  | Margaret Thatcher | Border Collie |   3 | 2385
(3 rows)

gustav@WOODLANDS:../gustav 18:08:25 !560 $ ./tryme
Connecting to the data base ... 
Done it.
gustav@WOODLANDS:../gustav 18:08:32 !561 $ psql --command "select * from dogs;" neighbors
  name  |       owner       |     race      | age | unit 
 Barky  | Abraham Lincoln   | Poodle        |   3 | 2387
 Fiote  | Margaret Thatcher | Border Collie |   3 | 2385
 Cedric | Woodrow Wilson    | Mongrel       |   7 | 2383
(3 rows)

gustav@WOODLANDS:../gustav 18:08:37 !562 $
Also, observe the use of the --command (or -c) command  line switch to psql. It lets you execute psql commands without actually entering an interactive session with it.

In the example above I have performed certain SQL action, but I have not really interfaced SQL variables with C variables and this, clearly, is going to be necessary if you want to do anything more elaborate with your C program. ecpg lets you declare variables that are going to be used both in the SQL and C parts of the program. This is done in the following way, e.g.,

   int  x;
   char foo[16], bar[16];
But here things may get tricky, e.g., if you want to capture the output of a select operation into a C-language string. The ecpg documentation is not very clear on this point.

So here you may wish to turn to the full C-language  interface, called libpq, which  is also provided with PostgreSQL. The full interface is, of course, much more complex and more difficult to use than ecpg, but here you will have no problems with capturing the output of select. There is the whole section, ``1.3.4. Retrieving SELECT Result Information'', in the ``PostgreSQL 7.3.4 Programmer's Guide, I. Client Interfaces'' that talks about it. This specific section's location on my PC is


Briefly speaking, there is a function PQexec, which is used to submit queries to the data base. The function returns the result as an object of type PGresult. This object is really a table. It can be queried with various functions provided by libpq in order to find about the number of rows and columns, column names, column types, the actual values in the fields of the table. There is also a special function that can be used to print the returned table on a file - including standard output.

The same manual, i.e., the Programmer's Guide, provides  example programs. On my PC the examples live in:

and they are quite invaluable if you want to make sense of it all.

If you are a Java programmer, then you'll find Java interface to PostgreSQL discussed  in Chapter 5 of the Programmer's Guide, still part I, Client Interfaces. This chapter lives on my PC in

This chapter is also illustrated with some examples.

next up previous index
Next: Huge Data Bases Up: Databases: From Small to Previous: Talking to a PostgreSQL
Zdzislaw Meglicki