1.0 Introduction
Many organizations today are challenged to do more with lesser
resources and to make cost reduction a strategic priority. Most Testing teams
constantly face the challenge of innovating solutions that add value to their
customers, who in turn are looking at reducing costs and increasing the
coverage.
DWH-ETL testing is one such area. The maturity level of DWH-ETL
testing, in general, with respect to Testing Methodology is very low.
Typically a team should first move towards standardisation of Manual
Testing processes. After Manual Testing processes standardization the team should
move towards Automation.
2.0 What is Data Warehouse?
A data warehouse is the main repository of the organization's
historical data, its corporate memory. For example, a Credit Card company would
use the information that's stored in its data warehouse to find out which
months of the year their Customers have a very high rate of defaulting on their
Credit Card Payment, or the spending habits of different segments of society
and age group. In other words, the data warehouse contains the raw material for
management's decision support system.


3.0 What is DWH-ETL?
Extract, transform, and
load (ETL) is a process in data warehousing that
involves
- Extracting data from outside sources,
- Transforming it to fit business needs, and ultimately
- Loading it into the data warehouse.
ETL can in fact generally refer to a process that loads any
database.
4.0 DWH-ETL Testing
There is a need for more focused testing regarding DWH-ETL processes.
When testing ETL:
- Validation should be done that all the data specified gets extracted.
- Test should include the check to see that the transformation and cleansing process are working correctly.
- Testing the data loaded and their count, into each target field to ensure correct execution of business rules such as valid-value constraints, cleansing and calculations.
- If there are different jobs in an application, the job dependencies should also be verified.
ETL Applications can be language based or Tool based. If the
capability of the ETL tool is fully utilised it reduces some of the Testing effort
otherwise required in language based application.
4.1
Manual
Testing Strategy for DWH-ETL
In this section Manual System Testing and Regression Testing strategy
has been discussed in detail.
Most of the processes mentioned below have been streamlined in order
to make the transition to Automation as smooth as possible.
For System Testing:
From a very high level any DWH-ETL application has inputs which
undergo a transformation and results in an output. Hence for system testing, the
general steps which need to be followed for success and coverage are:
1)
Scenarios based on the Documents
2)
Test data prepared based on
Scenarios
3)
Expected results based on the
Transformation and the Test data
4)
Run the Application using the
Test Data
5)
Capture the Actual output
6)
Analyse the results after
comparing the Expected to Actual
Pre-requisites:
1)
Business Requirement document
2)
Functional Requirement document
3)
Technical Requirement document
4)
Source to Target Mapping
document
5)
Code diff
6)
Traceability Matrix
How to come up with
Scenario Sheet?
Based on the Function Requirement document, Technical Requirement
document and Source to Target Mapping document scenarios should be created to
check the new functionalities going in. Scenarios should have both positive and
negative cases.
Typically scenarios can be created using an Excel sheet with one of
the worksheet having them in plain English.
If it is a complex application where intermediate files are being
created, the application should be broken into sub-applications. A
sub-application will be any set of jobs clubbed together which has a physical
input and a physical output.
Scenario sheet should be created at sub-application level and then
merged into a final application level Scenario sheet.
Each Scenario should have corresponding source file/table, source field/column
names and target file/table, target field/column names even at the
Sub-application level.
The values put in above source fields/columns can be hypothetical
ones, but should be unique and should have correct corresponding values (after
transformation) put in the Target fields/columns.
The above scenarios should be discussed with the Business Analyst,
Development Manager, Test Lead w.r.t coverage, duplication and relevancy.
After being agreed upon by all Stakeholders, 2 new worksheets should
be created in the same excel sheet – Input & Expected Output.
Input worksheet should have real world values, with field level
details which can be processed through the application.
Expected Output worksheet should have the Expected Target values if
the Input worksheet values are passed through the Application.
How to create Test Data?
If the Input used by the Application are files then based on the “Input
Worksheet” of the Excel sheet and the “File layout”, the input files used by
the application should be mocked up.
If the Input used by the Application is from the table, then again
based on the “Input Worksheet” of the Excel sheet, the corresponding values
should be inserted into the Columns of the concerned tables.
How to ensure the Mocked
up Test Data file gets picked up when application runs?
At a very high level there are 3 ways to ensure it depending on the
application
- Renaming the file
- Changing the Header and Trailer
- Getting the name from the Control DB and naming the file based on that.
How to capture the Actual
Output for Analysis?
The best way to do so is to import the data from the output file or
table to a New Excel Sheet.
Excel has the built-in functionality of importing delimited or fixed
column width flat text files. Also most of the DB querying tools like Toad, SQL
Navigator and SQL Developer have capabilities to save the result in Excel.
Storing the Output simplifies comparison between expected and actual
and also helps in storing the results in readable format for future reference.
How to analyse the
results?
Field to field comparison should be done between the Expected and
the Actual. Any discrepancies should be highlighted. First the scenario having
the difference should be double checked for the Expected value and once
satisfied defect should be raised for that Scenario.
For Regression Testing:
Generally doing regression testing of DWH-ETL applications seems a very
daunting and time consuming task. Typically due to resource/time constraints:
- Random Sampling of unchanged high critical functionalities is targeted
- System testing of random sampled functionality is done
This leaves plenty of code going into the production with very low
confidence level.
- Most of the unchanged medium and low critical functionality never gets tested release after release.
- Sometimes due to time constraint and high number of critical functionality in the Application, some of the critical functions never get verified.
Regression testing is done to ensure that the new code did not affect
anything other that what it was supposed to affect.
This purpose can be achieved by taking a snapshot of the Application
before the code gets implemented and a snapshot after the code gets
implemented. A difference in both the snapshots will show what got changed.
Anything which was not supposed to get changed will also be highlighted.
Important thing to remember is that the above can only work if the
input remains constant. Thus for Regression testing, the test file should
remain constant.
There are 2 ways in which a Regression Input file can be created:
1)
By taking a sample of production
Input file and desensitising it.
2)
By incrementally adding the
System Test Input Data file to a Regression Test file with each release.
Progressively with each release, the Regression Test file will get better in
its scope and coverage.
Once the above is achieved, to regression test a release, same file
should be run through the application before and after code is implemented. The
output should be captured for before and after runs and stored preferably in an
Excel Sheet in different worksheets.
A simple comparison will highlight the differences between the 2
worksheets. There will be differences either because
1)
It is supposed to be different
- the new change for this release or
2)
New code broke something it was
not supposed to affect.
Segregating 2 and 1 and analysing the 2s will pin-point what the new
code broke.
This is typically what any regression testing aims for.