Analytics Trends


ETL and Python Integration

Erik Chen
Posted 2/18/16

My background is in computer science and I started here at Dunn Solutions roughly back in September. This is my first contribution to the BI blog here and I was excited to pick a topic that relates back to my major. Since then, I’ve had a chance to work with a variety of things and have noticed that if you work in BI, then you’ve likely worked with SAP Data Services. It is a visually driven program with drag-n-drop interface, which facilitates quick building of solutions without having to dive into writing SQL to accomplish what you need to be done. While such interfaces may be simple and powerful enough to get the job done, it isn’t always easy coaxing a program to do exactly what you want it to.

Let me give an example that I have recently come across while working with a client. Let’s say that in SAP Data Services you needed to process a large dataset. This data set potentially contains unprintable characters, remnants of HTML tags, or profanities. The data will ultimately get thrown into a report and you wouldn’t want HTML tags, strange symbols, or profanities appearing in it.

In SAP Data Services, there are lots of table transforms and scripting functions available for use. Of special note is the ability to create a user defined transform using Python scripting. However, as a programmer, working with Python from within Data Services is very frustrating. The problem with Python scripting in SAP Data Services is that in order to use data from columns, they must be mapped as inputs in the transform. There is also the lack of support for third party modules. I have not been able to find anything indicating how one could install and use third party modules in the built in python editor.  In a regular python dev environment, it would be as simple as installing it with pip and adding the import line to the top of your file. Compatibility is also limited to Python 2.7, as that is what the Data Services environment supports.

If you do decide to do your Python scripting outside of Data Services, there is a great third party module called pymssql that allows you to connect to a Microsoft SQL Server DB to execute queries. With this and the other built in python modules, it is possible to perform complex processing of your dataset while interacting with the database.

The above is all it takes to connect to a database, execute a query, and view the results using pymssql. The “autocommit” flag indicates that changes to the database will be committed immediately if the query succeeds. The “data” parameter of the execute statement is a tuple of values that matches to a string formatter in the SQL statement. For example, every “%s” in the SQL will map to one element in the “data” tuple. One can also use the “as_dict” parameter in the cursor creation if they wished to access results by name. After execution of the query, results can be obtained by iterating over the cursor.

While it is nice being able to freely manipulate your data, you take a performance hit when large numbers of records must be persisted to the DB. Pymssql does not support caching but it does support cooperative multi tasking. The only difference between using execute() and executemany() is that the latter accepts a list of tuples and will call a single execute for each tuple in the list. In other words, you would have to build your own SQL if you wished to aggregate multiple queries into one. A good tradeoff between performance and versatility is to use your ETL tool of choice in conjunction with an external Python script. The ETL tool will handle major transforms and DB interactions while all the heavy processing is done in your script.

This is but a peek of what is possible with integrating Python into your ETL jobs. I hope you find the Python language as great as I do.

Add Comment
Graeme Morgan
Very nice article. I also have in mind to combine Python and DS. Instead of using the UDT, I'm considering using a DS script and calling my Python program inside it but not sure if this work. I assume Python would need to be installed on the Job Server. Would there be any issues with that approach / needing to install a specific version of Python to not interfere with the built in Python of the UDT?
Posted on 9/19/19 2:21 AM.
Graeme Morgan
Realised that Python 2.7 is already installed on the Job Server, although no pip installed.
I did however identify two methods to install 3rd party modules on the job server without pip.
Option 1: download the 3rd party module and extract to a folder on the job server. Then run python install.
Option 2: Install the module on your local machine and then copy paste the folder from your Python27\Lib\site-packages folder to the job server's Python's \Lib\site-packages folder (write acces required)

This can be done for each module you require.

FYI, you can call Python in two ways, either via the UDT (as part of your transform) or use a DS script and call your Python program directly
Posted on 10/7/19 1:51 AM.

Contact Us

character(s) remaining