July 19th, 2008 admin
A couple of days ago an individual on ##pentaho wanted to run a single kettle transformation against multiple databases on multiple database servers. After a little work we figured how to do it and this i would like to share with you.
What if you had a setup where you had many database servers running the same database at different locations with different user access credentials and you wanted to get all this data into the one place?
You could create a transformation that works for one database server using a static host, username and password and such then copy that transformation for as many database servers you have, changing the access credentials for each transformation. This would create a maintenance nightmare!
Instead, we are going to create a Kettle job that will loop our transformation and replace the host, username and password with a variable from an XML document for each database server.
To save time I have created a sample of these Kettle transformations / jobs that work with the Pentaho Sample data and attached them to this post for you to examine for yourself.
kettle_example.tar.gz
Please note that you will have to edit the db_connections.xml file to fit your database connection details for the Pentaho sampleData. Its also worth mentioning that the transformation get_database_data.ktr assumes that the sampleData is stored in a MySQL database. If this is not the case you will have to change this too.
Lets briefly go over whats happening in these Kettle jobs, sub jobs and transformations in the order they are executed.
The sample should contain six files. Two jobs, three transformations and an XML document.
db_connections.xml
This XML document contains all the connection credentials for each individual database server. For this example you can see that i have specified the same database server twice but in a real life example you would list ALL your database servers credentials here.
Main.kjb
This is the Main job that will execute all the sub jobs and transformations. It is our starting point. This job will run a transformation called get_db_connection_information.ktr and then run the sub job called sub_job.ktr
get_db_connection_information.ktr
This transformation will read our db_connections.xml document and pass the resulting data back to the stream.
sub_job.kjb
Now this sub job will loop for every set of connection details we get from the XML document thus running the set_enviroment_variables.ktr and get_database_data.ktr multiple times.
set_enviroemt_variables.ktr
This will take the first et of database credentials from the XML document and set them as variables.
get_database_data.ktr
And finally this transformation will use a table input step and the recently set variables for the connection information. This transformation will run and get one row of data rom the first database and output this into a text document.
So to sum up the whole job will take the two database connections from the xml document and loop the sub job twice thus outputting a text document with two rows of data (one rom for each time the get_database_data.ktr is run).
Hopefully this has made sense and someone will benefit from it.
Harris
Posted in ETL | No Comments »