QA by Passion: Simplify Big Data Testing through Spark library

Testing Big Data Pipelines is increasingly becoming complex. This has two factors. One is with maintaining setup and one is with defining or deriving the expected result.

Maintaining Setup

Hadoop ecosystem is growing rapidly and different teams are utilizing different components of ecosystem which suite their need. This increases number of components of Hadoop ecosystem to be maintained in Test Setup. This includes maintaining correct versions, required directories with correct ownership, local users and hdfs users, status of services. It is estimated that 60-70% of test cycle is vested on deployments and configurations.

Defining or deriving Expected Results

After setup the biggest challenge will be to figure out a substance with which the output is compared to. This substance is known as expected Result. There are different ways to do that. We will talk about 3 major variances.

Golden Dataset

A predefined data set which is mostly hand woven and expected result is derived by going through the data manually.

Production Dataset

This data is a subset of data copied from Production System. Expected result will be derived by running SQL, HQL or Pig Scripts that are hand written.

Generated Dataset

This data is generated with variety of tools/programs with a predefined set of rules. Expected result will be derived by running the SQL, HQL or Pig Scripts that are derived by the tool/program.

Proposal

The proposed system will deal with these complexities by providing various REST Services like

Infrastructure Service

This service provides APIs to deploy given set of machines with different components, configure and validate the deployments. Deploying, Configuring and Validating deployments are three independent services so that one can use them independently based on the kind of setup he has.

Data Service

This service provides APIs to generate data, ingest data into HDFS with given rules. Generating data and ingesting data are two independent services so that one can use them based on their requirement.

Execution Service

This service provides APIs to utilize various executors to run their Pipelines and monitor them. Again utilizing executors and monitors are independent. This service also provides APIs to retrieve performance related Metrics.

Validation Service

This service provides APIs to invoke various validators based on the dataset.

Each service will use utility methods provided in a separate repository.

Utilities

Following utilities will be provided as part of this framework.

SSH Utils

These will help to connect to a remote machine and execute commands on it. It also provides capabilities to transfer files from and to remote machine.

Benerator Utils

These will help in generating data using Benerator tool, creating schema for HSQLDB and deriving expected results.

Hadoop Utils

These will help in executing hadoop commands, copying files to and from hadoop.

String Utils

These will help in dealing with all kinds of String operations.

JSON Utils

These will help in dealing with JSON related complexities like getting value of a given element in a complex JSON, creation of JSON using objects.

Database Utils

These will help in maintaining connection pool, connecting and executing and retrieving results from any Database using JDBC connector.

lzo Utils

These will help in compressing and uncompressing files using lzop.

Pig Utils

These will help in executing pig scripts, monitoring the execution.

Falcon Utils

These will help in creating cluster, submitting feeds, submitting process and monitoring.

Storm Utils

These will help in submitting Storm Topologies and monitoring them.

Kafka Utils

These will help in start/stop/restart producer/consumer/Kafka server.

Ambari Utils

These will help in deploying through Ambari blueprints, start/stop/status/restart of services and components, configuring services and components, sync configs from another cluster.

RESTUtils

These will help in creating GET/POST/PUT REST requests, submit them and get results in JSON.

All utility methods should provide Java Documentation.

This framework will be provided as Hosted Service from Data QA Team. But the infrastructure to run the pipelines will not be a hosted service and should be registered with this framework while calling REST APIs.

Advantages

Existing framework only works with in-memory dataset due to tight coupling of Benerator. With new design one can use any Dataset and any Validation.
In the current system there is no way to test with production Dataset or Golden Dataset. With new design one can use any Dataset
Current system cannot be used for Performance or Stability related testing as it works with in-memory dataset. With new design one can pump large data sets and can measure performance, stability or reliability of the platform.
During a test cycle if one wants to use some operations available in the framework, he has to gather a lot of information to segregate that code and use. This will be taken care with the new design due to introduction of REST APIs.
As this framework will be a Hosted Service, adaptability will be more and resistance will be less.