Testing Big Data: A Blueprint

Besides the old data testing standards of getting your test data set in order, no-one has really figured out how they’re going to do this properly, or offered any kind of industry standard for testing in this area.

So, I challenged myself to have a closer look at Big Data, and to put together a blueprint for how to test it.

In my current role, I’m looking after a team that deals with Big Data. This is a new area to me, and having tried to do some research on testing in this discipline, it became apparent that I’m not the only one. This is a brave new world, and it seems that, besides the old data testing standards of getting your test data set in order, no-one has really figured out how they’re going to do this properly, or offered any kind of industry standard for testing in this area.

So, I challenged myself to have a closer look at Big Data, and to put together a blueprint for how to test it. None of this is gospel, but it’s how I intend to get my teams to start thinking about what they’re doing, and the approach I’d like to see them taking.

Best Practices

Data Quality: First of all, the tester should establish the data quality requirements for different forms of data (e.g. traditional data sources, data from social media, data from sensors, etc.) If that’s done properly, the transformation logic can be tested in isolation by executing tests against all possible data sets.

Data Sampling / Risk Based Testing: Data sampling becomes hugely important in Big Data implementation, and it’s the tester’s job to identify suitable sampling techniques, and to establish appropriate levels of risk based testing to include all critical business scenarios and the right test data set(s). Whether this is done with handcrafted data, or a sample of production data is down to the circumstances you’re working in, but do think carefully about security / confidentiality constraints if using real data.

Automation: Automate the test suites as much as possible. Big Data regression tests must be run regularly, as the database will be regularly updated. Automated regression should be created with a view to being run after each iteration.

Parallel test execution: Hopefully, this is obvious – the volume of data being checked will probably require parallel execution. But here’s a talking point – if data sampling is good enough, is it even necessary?

Pairing with developers: This is vital to understand the system under test. Testers will require knowledge on par with the developers about Hadoop / HDFS (Hadoop Data File System) / Hive.

Make things simpler: If possible, the data warehouse should be organised into smaller units that are easier to test. This will offer improved test coverage, and optimisation of the test data set.

Normalise design and tests: Effective generation of normalised test data can be achieved by normalising the dynamic schemas at the design level.

Test Design
One thing I’ve seen repeated in many places is the mantra that test design should centre around measurement of the four Vs of data – Variety, Velocity, Volume and Veracity.

Variety:
Different forms of data. The variety of data types is increasing, as we must now consider structured data, unstructured text-based data, and semi-structured data like social media data, location-based data, log-file data etc. They break down as follows:

  • Structured Data comes in a defined format from RDBMS tables or structured files. Transactional data can be handled in files or tables for validation purposes.
  • Semi-structured Data does not have any defined format, but structure can be determined based on data patterns – for example, data scraped from other websites for analysis purposes. For validation, data need to be transformed into a structured format using custom built scripts. So firstly, the patterns need to be identified, then copy books or pattern outline need to be prepared, then the copy book need to be used in scripts to convert the incoming data into a structured format, then validations performed using comparison tools.
  • Unstructured Data is the data that does not have any format and is stored in documents or web content, etc., so testing it can be complex and time consuming. A level of automation could be achieved by converting the unstructured data into structured data using PIG scripting or something similar – but the overall coverage of automation will be affected by any unexpected behavior of data. This is because the input data can be in any form, and could potentially end up changing every time a new test is performed. So a business scenario validation strategy should be employed for unstructured data to identify different scenarios that could occur in data analysis, and handcrafted test data created based on those scenarios.

Velocity:
The speed at which new data is created. Speed – and the need for real-time analytics to derive business value from it – is increasing thanks to digitization of transactions, mobile computing and the sheer number of internet and mobile device users. Data speed needs to be considered when implementing any Big Data appliance to overcome performance problems. Performance testing plays an important role in the identification of any performance bottlenecks in the system, and in ensuring the system can handle high velocity streaming data.

Volume:
Scale of data. Comparison scripts must be run in parallel across multiple nodes. As data stored in HDFS is in file format, scripts can be written to compare two files and extract the differences using compare tools. Data is converted into expected result format, then compared using compare tools with actual data. This is a faster overall approach which requires an up-front time investment while scripting, but this will reduce the required regression testing time. When we don’t have time to validate complete data, risk-based sampling should be used for validation. Depending on your circumstances, there could potentially be a case to build tools for E2E testing across the cluster.

Veracity:
Accuracy of data. This is the assurance that the final data provided to EDW has been processed correctly and matches the original data file, regardless of its type. Accuracy of subsequent analysis is dependent on the veracity of data. This also means ensuring that “data preparation” processes such as removing duplicates, fixing partial entries, eliminating null / blank entries, concatenating data, collapsing columns or splitting columns, aggregating results into buckets etc. are not onerous manual tasks.

Getting these right allows us to use the data to offer two more Vs – Visibility and Value.

Potential Issues:
Test planning and design:
Existing automated scripts generally cannot be scaled to test Big Data. Trying to scale up test data sets without proper planning and design will lead to delayed response times, time outs etc. during test execution. However, performing action-based testing (ABT), and treating the tests as actions pointed at keywords and appropriate parameters in a test module will help mitigate this issue.

When To Test

Testing should be performed at each of the three phases of Big Data processing to ensure that data is getting processed without any errors.

Functional Testing should include:

  • Validation of pre-Hadoop processing
  • Validation of Hadoop Map Reduce process data output
  • Validation of data extract, and load into EDW

Apart from these functional validations, non-functional testing including performance testing and failover testing should be performed.

Validation of Pre-Hadoop Processing
Data from various sources like weblogs, social network sites, call logs, transactional data etc., is extracted based on the requirements and loaded into HDFS before processing it further.

Validations:
1. Comparing input data file against source systems data to ensure the data is extracted correctly

2. Validating the data requirements and ensuring the right data is extracted

3. Validating that the files are loaded into HDFS correctly

4. Validating the input files are split, moved and replicated in different data nodes.

Potential Issues:
Incorrect data captured from source systems
Incorrect storage of data
Incomplete or incorrect replication

Validation of Hadoop Map Reduce Process
Once the data is loaded into HDFS, the Hadoop map-reduce process is run to process the data coming from different sources.

Validations:
1. Validating that data processing is completed and output file is generated

2. Validating the business logic on standalone node and then validating after running against test cluster

3. Validating the map reduce process to verify that key value pairs are generated correctly

4. Validating the aggregation and consolidation of data after reduce process

5. Validating the output data against the source files and ensuring the data processing is completed correctly

6. Validating the output data file format and ensuring that the format is per the requirement

Potential Issues:
Coding issues in map-reduce jobs
Jobs working correctly when run in standalone node, but not on multiple nodes Incorrect aggregations
Node configurations
Incorrect output format

Validation of Data Extract, and Load into EDW
Once map-reduce process is completed and data output files are generated, this processed data is moved to enterprise data warehouse or any other transactional systems depending on the requirement. MAS don’t currently export data, the Hungary team do.

Validations:
1. Validating that transformation rules are applied correctly

2. Validating that there is no data corruption by comparing target table data against HDFS files data

3. Validating the data load in target system

4. Validating the aggregation of data

5. Validating the data integrity in the target system

Potential Issues:
Incorrectly applied transformation rules Incorrect load of HDFS files into EDW Incomplete data extract from Hadoop HDFS

Validation of Reports
Analytical reports are generated using reporting tools by fetching the data from EDW or running queries on Hive.

Validations:
1. Reports Validation: Reports are tested after ETL/transformation workflows are executed for all the sources systems and the data is loaded into the DW tables. The metadata layer of the reporting tool provides an intuitive business view of data available for report authoring. Checks are performed by writing queries to verify whether the views are getting the exact data needed for the generation of the reports.

2. Cube Testing: Cubes are tested to verify that dimension hierarchies with pre-aggregated values are calculated correctly and displayed in the report.

3. Dashboard Testing: Dashboard testing consists of testing of individual web parts and reports placed in a dashboard. Testing would involve ensuring all objects are rendered properly and the resources on the webpage are current and latest. The data fetched from various web parts is validated against the databases.

Potential Issues:
Report definition not set as per the requirement
Report data issues
Layout and format issues