Suppose you have the following project. You have a C-language program that can read documents, e.g., memos, articles, books, and that can generate automatically abstracts, keywords and bibliographic entries for these documents - without any human intervention. If you have a lot of documents to process you may consider running the program on the AVIDD farm. You would simply submit a PBS job that would distribute multiple instances of the program over, say, 32 or 48 nodes. Because the instances of the program have no need to communicate with each other, this is all that you really need to do to ``parallelise'' the application.
The only snag is coordination. Suppose that all the documents you need to process live on GPFS in one of your directories. How are you to decide, which instance of your program should get which document? The other issue is how you are going to collect the results of the processing. Each instance of the program is going to generate bibliographic entries for the documents it has processed, where are these entries to go?
This kind of a problem is not restricted to generation of bibliographic entries. When high energy physicists process their experimental data, what they do is almost exactly the same. The only difference in the type of calculations their programs do. They store their data on multiple files, which are then distributed, somehow, to instances of data reduction programs that run on a CPU farm. The programs process the data and return the results on various files, which are then collated and prepared for further processing, e.g., visualization.
In order to coordinate the exercise you could write an MPI program and
then incorporate your C program as a subroutine in the MPI program. But
this is really quite unnecessary, because you already have a first
class coordinator and data collector in the form of
yes,
in the form of a PostgreSQL data base.
PostgreSQL data base transactions, i.e., operations such as
update,
or insert, or delete,
are atomic. Various complex combinations of these
and other transactions can be made atomic too using a special SQL bracketing in
the form of begin, commit
and rollback commands. Atomicity of
a transaction means that the transaction, however complex and lengthy,
runs to completion or not at all. If anything happens while the transaction runs that
makes it impossible to complete it, the transaction is unwound and the state of the
data base returns to the condition the data base was in before the transaction began.
This is accomplished by the means of sophisticated locking, queuing, logging, backtracking
and various other devices that would be quite difficult to implement in a small
MPI program.
Additionally SQL provides the command lock ,
which lets you lock the table you work on explicitly and in various ways.
The way to use this command is to lock the table in the first command
line of the transaction. Then you proceed with the transaction
and the moment you commit it, the lock is automatically lifted.
There is no unlock command in SQL.
In summary you could begin by constructing a table that would represent the job queue. Your workers, once they've been instantiated by PBS on various nodes of the AVIDD cluster would contact the data base server running on one of the AVIDD nodes or even on your laboratory server and would perform a transaction with the data base. They could, for example, ask the data base for the top entry in the job queue table that has not been worked on yet. The way to ask for such table entry is similar to the following:
neighbors=# select * from dogs neighbors-# where age not like 3 neighbors-# limit 1; name | owner | race | age | unit -------+-------------------+---------------+-----+------ Fiote | Margaret Thatcher | Border Collie | 4 | 2385 (1 row) neighbors=#Now think of
age = 3 representing a ``worked-on'' flag
and here you have the solution
that gives you the next available entry in the table. Once the
worker gets the entry, the worker would set the ``worked-on'' flag
on this row by
doing something like:neighbors=# update dogs neighbors-# set age = 3 neighbors-# where name = 'Fiote' and unit = 2385; UPDATE 1 neighbors=#Having reserved the row, the worker would then go to process the corresponding document. Then once the bibliographic entry for the document has been constructed, the worker would again contact the data base and it would update the relevant row in this or some other table, by adding the full bibliographic entry to it.
The only thing you need to learn in order to implement your system this way
is how to talk to a PostgreSQL data base from your C-language program.
The easiest way to do this is to use the ecpg
preprocessor for
C with embedded SQL statements. The preprocessor recognizes the SQL statements
and converts them automatically to C-language constructs. The program
is then compiled and linked with the ecpg library and
you
have it.
Here is a simple example. Consider the following program:
gustav@WOODLANDS:../gustav 18:08:23 !525 $ cat tryme.pgc
#include <stdio.h>
main()
{
printf ("Connecting to the data base ... \n");
EXEC SQL CONNECT TO neighbors;
EXEC SQL UPDATE dogs SET age = 7 WHERE unit = 2383 AND name = 'Cedric';
EXEC SQL COMMIT;
EXEC SQL DISCONNECT CURRENT;
printf ("Done it.\n");
}
gustav@WOODLANDS:../gustav 18:09:40 !526 $
The SQL commands begin with the EXEC SQL string, which is then followed
by a command itself and the terminating semicolon. You can interleave
SQL commands with normal C-language statements. SQL transactions executed this
way are not committed unless COMMIT is requrested explicitly.
I am going to preprocess this program first in order to generate a genuine C-language program:
gustav@WOODLANDS:../gustav 18:09:40 !526 $ ecpg tryme.pgc gustav@WOODLANDS:../gustav 18:10:30 !527 $The C-language program I have just generated looks as follows:
gustav@WOODLANDS:../gustav 18:10:30 !527 $ cat tryme.c
/* Processed by ecpg (2.10.0) */
/* These four include files are added by the preprocessor */
#include <ecpgtype.h>
#include <ecpglib.h>
#include <ecpgerrno.h>
#include <sqlca.h>
#line 1 "tryme.pgc"
#include <stdio.h>
main()
{
printf ("Connecting to the data base ... \n");
{ ECPGconnect(__LINE__, "neighbors" , NULL,NULL , NULL, 0); }
#line 5 "tryme.pgc"
{ ECPGdo(__LINE__, NULL, "update dogs set age = 7 where unit = 2383 and name
= 'Cedric'", ECPGt_EOIT, ECPGt_EORT);}
#line 6 "tryme.pgc"
{ ECPGtrans(__LINE__, NULL, "commit");}
#line 7 "tryme.pgc"
{ ECPGdisconnect(__LINE__, "CURRENT");}
#line 8 "tryme.pgc"
printf ("Done it.\n");
}
gustav@WOODLANDS:../gustav 18:11:02 !528 $
Now I have to link this program with the ecpg library:gustav@WOODLANDS:../gustav 18:11:02 !528 $ gcc -o tryme tryme.c -lecpg gustav@WOODLANDS:../gustav 18:11:57 !529 $and I have the binary.
Let me demonstrate to you that it actually works!
gustav@WOODLANDS:../gustav 18:08:18 !559 $ psql --command "select * from dogs;" neighbors name | owner | race | age | unit --------+-------------------+---------------+-----+------ Barky | Abraham Lincoln | Poodle | 3 | 2387 Cedric | Woodrow Wilson | Mongrel | 8 | 2383 Fiote | Margaret Thatcher | Border Collie | 3 | 2385 (3 rows) gustav@WOODLANDS:../gustav 18:08:25 !560 $ ./tryme Connecting to the data base ... Done it. gustav@WOODLANDS:../gustav 18:08:32 !561 $ psql --command "select * from dogs;" neighbors name | owner | race | age | unit --------+-------------------+---------------+-----+------ Barky | Abraham Lincoln | Poodle | 3 | 2387 Fiote | Margaret Thatcher | Border Collie | 3 | 2385 Cedric | Woodrow Wilson | Mongrel | 7 | 2383 (3 rows) gustav@WOODLANDS:../gustav 18:08:37 !562 $Also, observe the use of the
--command (or -c)
command line
switch to psql. It lets you execute psql commands without actually
entering an interactive session with it.
In the example above I have performed certain SQL action, but I have not
really interfaced SQL variables with C variables and this, clearly, is going to
be necessary if you want to do anything more elaborate with your C program.
ecpg lets you declare variables that are going to be used
both in the SQL and C parts of the program. This is done in the
following way, e.g.,
EXEC SQL BEGIN DECLARE SECTION; int x; char foo[16], bar[16]; EXEC SQL END DECLARE SECTION;But here things may get tricky, e.g., if you want to capture the output of a
select operation into a C-language string. The ecpg documentation
is not very clear on this point.
So here you may wish to turn to the full C-language
interface, called libpq,
which is
also provided with PostgreSQL. The full interface is, of course, much more
complex and more difficult to use than ecpg, but here you will have
no problems with capturing the output of select. There is the whole
section, ``1.3.4. Retrieving SELECT Result Information'',
in the ``PostgreSQL 7.3.4 Programmer's Guide, I. Client Interfaces''
that talks about it. This specific section's location on my PC is
C:\cygwin\usr\doc\postgresql-7.3.4\html\libpq-exec.html
Briefly speaking, there is a function PQexec, which is used to
submit queries to the data base. The function returns the result as an
object of type PGresult. This object is really a table. It can be queried with
various functions provided by libpq in order to find about the number of
rows and columns, column names, column types, the actual values in the fields
of the table. There is also a special function that can be used to print
the returned table on a file - including standard output.
The same manual, i.e., the Programmer's Guide, provides example programs. On my PC the examples live in:
C:\cygwin\usr\doc\postgresql-7.3.4\html\libpq-example.htmland they are quite invaluable if you want to make sense of it all.
If you are a Java programmer, then you'll find Java interface to PostgreSQL discussed in Chapter 5 of the Programmer's Guide, still part I, Client Interfaces. This chapter lives on my PC in
C:\cygwin\usr\doc\postgresql-7.3.4\html\jdbc.htmlThis chapter is also illustrated with some examples.