There are several ways that we can test Big Data Pipelines.
1. Golden Data Set
In this approach, one will create a Data Set either by Hand or by copying it from Production. Manually expected output is determined by running the logic in Human brain. Once expected output is determined, the data set along with expected output will be called as Golden Set.
This is a good approach to begin with. But when more number of columns gets added (Regular activity in Data Team) and they have relation with existing columns then it will become really tedious to maintain this data. Also as expected output has to be determined by running logic manually one cannot do this for a bigger data set or for complex logic which involves more columns to determine a column in the output.
In this case, one solution can be writing a parallel code to the dev code using different technological stack to determine expected output. But the biggest disadvantage of this approach is that QA need to have the knowledge of the alternative stack(huge learning curve) and QA can make the same mistake that of Dev while developing this pipeline.
2. Controlled Data Generation using contracts (Based on Use case Testing)
In this approach every input logline and output logline/Table will be defined either as a POJO(Plain Old Java Object) or will be used from existing contracts like Thrifts. A test case writer will be defining the input columns and valued and expected output columns and values in the form of CSV using logline definitions.
The generator will be padding all other columns from logline definitions by providing valid values. This gives biggest benefit in terms of extendability of the same test cases even if 100 new columns gets added. Also as Test case writer is handling one column or one relation (having multiple columns) at a time in a test case, he has to modify only those test cases which have columns that got affected due to addition of new columns.
In normal case, all the old test cases will remain same and new test cases are added for new columns and definitions are updated with new columns. This can also work with optional columns kind of situations.