The skills needed to operationalize a data science solution are typically divided between data engineers and data scientist. It is rare to find a single individual with all of the skillsets needed to build and deploy a data science solution. Take a look at the following chart from a Stitch Data blog post:
Data scientists are great at developing analytical models to achieve specific business results. However, different skills are needed to deploy a model from the data scientist’s development environment to a scalable production environment. To bring a data science based solution to production, the following functions are typically distributed between data scientists and data data engineers:
- Data Scientist
- Model exploration
- Model selection
- Model tuning/training
- Data Engineer
- Data Prep/Cleansing/Normalizing
- Data Blending
- Scaling solution
- Production deployment, management, and monitoring
You can significantly reduce the time it takes to bring a data science solution to market and improve the quality of the end-to-end solution by allowing each type of developer to perform the tasks they are best suited for in a environment that best meets their needs. By using Pentaho Data Integration with Jupyter and Python, data scientists can spend their time on developing and tuning data science models and data engineers can be leveraged to performing data prep tasks. By using all of these tools together, it is easier to collaborate and share applications between these groups of developers. Here are the highlights of how the collaboration can work::
- Allow data engineers to perform all data prep activities in PDI. Use PDI to perform the following tasks:
- Utilize the available connectors to a variety of data sources that can be easily configured instead of coded
- Blend data from multiple sources
- Cleanse and normalize the data
- Tailor data sets for consumption by data scientist’s application by implementing the following following types of tasks in PDI:
- Feature engineering
- Statistical analytics
- Classes and predictors identification
- Easily migrate PDI applications from development to production environments with minimal changes
- Easily scale applications to handle production big data data volumes
- Allow the data scientist to use the prepared data from PDI applications to feed into Jupyter and Python scripts. Using the data engineer’s prepared data, the data scientist can focus on the following tasks in Jupyter/Python:
- Model Exploration
- Model Tuning
- Model Training
- Easily share PDI applications between data engineers and data scientists. The output of the PDI application can easily be fed into Jupyter/Python. This significantly reduces the amount of time the data scientist spends on data prep and integration tasks.
This posting will demonstrate how to use these tools together.
- Pentaho PDI 8.1+ needs to be installed on the same machine as the Jupyter/Python execution environment.
- Pentaho Server with Pentaho Data Service. The Pentaho Server can either be running remotely in a shared environment or locally on your development machine. The PDI transformation developed using the Pentaho Data Service must be stored in the Pentaho Server as required by the Pentaho Data Service feature. For details about Pentaho Data Service see the Pentaho help docs here.
Setting up Jupyter and Python environment is beyond the scope of this article. However, you will need to make sure that the following dependencies are met in your environment:
- Python 2.7.x or Python 3.5.x
- Jupyter Notebook 5.6.0+
- Python JDBC dependencies, i.e. JayDeBeApi and jpype
How to use PDI, Jupyter, and Python Together
3. In Jupyter Notebook, implement the following as Python script. First you will include appropriate PDI libraries and then create a connection to the PDI Data Service. Then the script connects to the PDI Data Services. The sample script below assumes you have installed Pentaho Server on your local machine. If you are running the Pentaho Server on a remote shared server then change the JDBC connection information appropriately.
5. Now that you have the data that was prepared in your PDI transformation in a Python Data Frame, you can experiment with the data by using various Python data science models, libraries and engines (such as SciKit, TensorFlow, and MATLAB). The example below example shows the SciKit Decision Tree.
The above PDI application and Jupyter/Python code is available here.