Friday, May 31, 2013

ETL Testing

1.0  Introduction

Many organizations today are challenged to do more with lesser resources and to make cost reduction a strategic priority. Most Testing teams constantly face the challenge of innovating solutions that add value to their customers, who in turn are looking at reducing costs and increasing the coverage.
DWH-ETL testing is one such area. The maturity level of DWH-ETL testing, in general, with respect to Testing Methodology is very low.

Typically a team should first move towards standardisation of Manual Testing processes. After Manual Testing processes standardization the team should move towards Automation.

2.0  What is Data Warehouse?

A data warehouse is the main repository of the organization's historical data, its corporate memory. For example, a Credit Card company would use the information that's stored in its data warehouse to find out which months of the year their Customers have a very high rate of defaulting on their Credit Card Payment, or the spending habits of different segments of society and age group. In other words, the data warehouse contains the raw material for management's decision support system.


3.0  What is DWH-ETL?

Extract, transform, and load (ETL) is a process in data warehousing that involves
  • Extracting data from outside sources,
  • Transforming it to fit business needs, and ultimately
  • Loading it into the data warehouse.
ETL can in fact generally refer to a process that loads any database.

4.0  DWH-ETL Testing

There is a need for more focused testing regarding DWH-ETL processes.
When testing ETL:
  • Validation should be done that all the data specified gets extracted.
  • Test should include the check to see that the transformation and cleansing process are working correctly.
  • Testing the data loaded and their count, into each target field to ensure correct execution of business rules such as valid-value constraints, cleansing and calculations.
  • If there are different jobs in an application, the job dependencies should also be verified.
ETL Applications can be language based or Tool based. If the capability of the ETL tool is fully utilised it reduces some of the Testing effort otherwise required in language based application.

4.1   Manual Testing Strategy for DWH-ETL
In this section Manual System Testing and Regression Testing strategy has been discussed in detail.
Most of the processes mentioned below have been streamlined in order to make the transition to Automation as smooth as possible.
For System Testing:
From a very high level any DWH-ETL application has inputs which undergo a transformation and results in an output. Hence for system testing, the general steps which need to be followed for success and coverage are:
1)      Scenarios based on the Documents
2)      Test data prepared based on Scenarios
3)      Expected results based on the Transformation and the Test data
4)      Run the Application using the Test Data
5)      Capture the Actual output
6)      Analyse the results after comparing the Expected to Actual
Pre-requisites:
1)      Business Requirement document
2)      Functional Requirement document
3)      Technical Requirement document
4)      Source to Target Mapping document
5)      Code diff
6)      Traceability Matrix
How to come up with Scenario Sheet?
Based on the Function Requirement document, Technical Requirement document and Source to Target Mapping document scenarios should be created to check the new functionalities going in. Scenarios should have both positive and negative cases.
Typically scenarios can be created using an Excel sheet with one of the worksheet having them in plain English.
If it is a complex application where intermediate files are being created, the application should be broken into sub-applications. A sub-application will be any set of jobs clubbed together which has a physical input and a physical output.
Scenario sheet should be created at sub-application level and then merged into a final application level Scenario sheet.
Each Scenario should have corresponding source file/table, source field/column names and target file/table, target field/column names even at the Sub-application level.
The values put in above source fields/columns can be hypothetical ones, but should be unique and should have correct corresponding values (after transformation) put in the Target fields/columns.
The above scenarios should be discussed with the Business Analyst, Development Manager, Test Lead w.r.t coverage, duplication and relevancy.
After being agreed upon by all Stakeholders, 2 new worksheets should be created in the same excel sheet – Input & Expected Output.
Input worksheet should have real world values, with field level details which can be processed through the application.
Expected Output worksheet should have the Expected Target values if the Input worksheet values are passed through the Application.

How to create Test Data?
If the Input used by the Application are files then based on the “Input Worksheet” of the Excel sheet and the “File layout”, the input files used by the application should be mocked up.
If the Input used by the Application is from the table, then again based on the “Input Worksheet” of the Excel sheet, the corresponding values should be inserted into the Columns of the concerned tables.

How to ensure the Mocked up Test Data file gets picked up when application runs?
At a very high level there are 3 ways to ensure it depending on the application
  1. Renaming the file
  2. Changing the Header and Trailer
  3. Getting the name from the Control DB and naming the file based on that.

How to capture the Actual Output for Analysis?
The best way to do so is to import the data from the output file or table to a New Excel Sheet.
Excel has the built-in functionality of importing delimited or fixed column width flat text files. Also most of the DB querying tools like Toad, SQL Navigator and SQL Developer have capabilities to save the result in Excel.
Storing the Output simplifies comparison between expected and actual and also helps in storing the results in readable format for future reference.

How to analyse the results?
Field to field comparison should be done between the Expected and the Actual. Any discrepancies should be highlighted. First the scenario having the difference should be double checked for the Expected value and once satisfied defect should be raised for that Scenario.

For Regression Testing:
Generally doing regression testing of DWH-ETL applications seems a very daunting and time consuming task. Typically due to resource/time constraints:
  • Random Sampling of unchanged high critical functionalities is targeted
  • System testing of  random sampled functionality is done
This leaves plenty of code going into the production with very low confidence level.
  • Most of the unchanged medium and low critical functionality never gets tested release after release.
  • Sometimes due to time constraint and high number of critical functionality in the Application, some of the critical functions never get verified.
Regression testing is done to ensure that the new code did not affect anything other that what it was supposed to affect.
This purpose can be achieved by taking a snapshot of the Application before the code gets implemented and a snapshot after the code gets implemented. A difference in both the snapshots will show what got changed. Anything which was not supposed to get changed will also be highlighted.
Important thing to remember is that the above can only work if the input remains constant. Thus for Regression testing, the test file should remain constant.
There are 2 ways in which a Regression Input file can be created:
1)      By taking a sample of production Input file and desensitising it.
2)      By incrementally adding the System Test Input Data file to a Regression Test file with each release. Progressively with each release, the Regression Test file will get better in its scope and coverage.
Once the above is achieved, to regression test a release, same file should be run through the application before and after code is implemented. The output should be captured for before and after runs and stored preferably in an Excel Sheet in different worksheets.
A simple comparison will highlight the differences between the 2 worksheets. There will be differences either because
1)      It is supposed to be different - the new change for this release or
2)      New code broke something it was not supposed to affect.
Segregating 2 and 1 and analysing the 2s will pin-point what the new code broke.

This is typically what any regression testing aims for.