The importance of realistic data in the tests

 There are many tests that we can execute in Business Intelligence (BI) systems or any system that uses uncontrolled data - extract, transform, load (ETL), queries, performance etc. In this article, I will concentrate on the data itself, and more specifically on using realistic data for the tests.

Testing should be done as close as possible to the conditions that it will be used in production by the actual users. One of the keys to success in this area is the data that is being used during the testing process.

Some applications use only data produced by them, like alarm clocks. Other apps use only predefined data like weather apps. Those cases are relatively straightforward. However, when your application or system uses a lot of types of data, including various external data, and sometimes unstructured data - like systems with big data, the data might be corrupted or unexpected (IOW, the code can’t handle it). However, it is not only unexpected data that might cause data integrity issues. It can also be malfunctioning of the data processing of items that are supposed to be handled. For example, the inability to process a certain type of picture format or a variation of it.

Another kind of malfunctioning is the BI system's ability to filter or query correctly. For example, querying for items up to 2k and getting also items of 2.1k. You can think of such a test, but you can't think of all the tests. Doing the tests but also using a large amount of data will increase the chance of finding more issues.

The number of possibilities is unlimited, and thus the number of data types and tests. this has many risks, from data loss or data that will not be processed correctly, to system downtime.

Regarding the last point, it is also the programmer's responsibility to handle unexpected data in the code.

So how can we as test engineers reduce the risks?

Always use as much data as you can in your tests

Fill the database with similar types of items you are testing besides all other supported and unsupported data. For example, if you test emails, make sure you have a lot of emails in the system. Some in different languages, different lengths, w/ and w/o attachments, different attachment sizes, etc. If you don’t have enough data, try to develop a test code that can produce a large amount of data so that you can control its content.

Otherwise, when the test says: to use a filter X, make sure you see item A which is a document with the title “I am a document”, and the only document in the database is item A, you might miss a bug that will retrieve all documents that starts with “I” or other types of data that somehow will enter the result (picture by the name “I am a document” for example).

Customer Data

Using real data from the customer or production is not nice to have but a very important factor in the success of the tests and can discover all kinds of issues from above. True, it is not always available, but because of its importance, we must do our best to get it.

Feel free to abuse

Feel free to abuse the data. From corrupted data packets to pictures. Cut the data, add unexpected data, long, short, etc. Fill the database with unsupported data.

Know your customer

If you have a specific customer, research. If he is in a specific country and the data is coming from the web, research what are the most popular websites in that country and which are the most popular languages. If it is related to apps or phone types, research what are the most common apps in the country, common phones, etc., and base your tests on that.


The last tip about the data is that ETL and executing queries are not enough to validate that all is working. You need to make sure the data went all over the system up to the export correctly. Check the FedEx tour by James Whittaker for more ideas, for example here: https://youtu.be/fNkYz1hB7r0?t=35m2s.


Comments