stocksy_txpb0a86041wc9000_small_259960

Do you know who your respondents are? Measuring data hygiene using respondent validation

 /  White Paper
By Susan Frede, Vice President, Research, Lightspeed Research & Nallan Suresh, Director, Analytics, TrueSample

Introduction

A growing trend among Kantar’s and the industry’s largest clients is to shift internal marketing research departments from being tactical to being more strategic. Currently, the tactical focus is more retrospective with little influence outside of marketing. The strategic focus will enable foresight and learning to support business decision-making at the highest levels in the company. In order to be strategic, researchers need to be able to synthesize multiple sources of data. As the focus changes, companies are also want to understand consumers in new and deeper ways.

Fueling these client trends is the recognition that business growth can only be sustained by transforming the role of knowledge and information within the company. This knowledge, in turn, is used by senior executives to manage risk and to make proactive business decisions quickly. This will allow the company to grow share in new markets including underserved subgroups and emerging geographies. Knowledge and insight needs to help companies anticipate unmet needs. This will increase the success rate of new products/services as well as proactively ward off share erosion. Companies need to be able to sense, understand, and respond to early signals of weaknesses in consumer commitment for existing products/services.

The changing role of knowledge and insight within these companies has research quality implications. The emphasis on synthesis and building a body of knowledge means that variability in survey measurement and quality practices can lead to differences that are hard to reconcile and synthesize. Similarly, the importance of making go/no go decisions as soon as possible means that reducing the potential for variability in processes or data is critical for managing the risk of making an incorrect business decision.

At the same time research departments become more strategic there is also an emphasis on reaching a diverse array of consumers. Targets may no longer be nationally representative. It is imperative that research suppliers help clients understand how research quality practices might improve or even reduce the consumer diversity in the information from which insights are drawn.

With this backdrop in mind, the current research on research project is designed to further the industry’s understanding of advances in data quality improvement techniques. The research strives to understand the effects of certain data hygiene practices on the diversity of survey samples. It builds on the work of Miller and Courtright (“Respondent validation: So many choices!”, CASRO Online Research Conference, Las Vegas, 2011), TrueSample unpublished white paper (“What Impact Do Bad Respondents Have on Business Decisions”), and a multitude of Kantar unpublished work that examines the effects of techniques of respondent validation on sample diversity.

Background and Approach

‘One size fits all’ may work for sock selection, but it is far from the truth when it comes to the issue of data quality solutions for market research. The more applicable adage is ‘Everybody is different’, which leads us to the subject of this paper, which shows the advantages of a holistic approach in the improvement of survey data quality.

Typical pre-survey cleansing methods, such as TrueSample, perform name/address verification of individuals. Here, the name and address of the survey-takers are verified against a database and validated. Those who are not validated are considered as having provided incorrect or unverifiable information and are excluded from surveys. Past TrueSample research on research has shown that there is an attitudinal difference or bias when comparing validated to non-validated respondents, indicating that non-validated respondents can impact data.

In Figure 1, we show a breakdown of the individuals who pass this name/address validation test, by demographic, on the Lightspeed MySurvey Panel. While this shows an overall validation rate of 89% on this panel (red line), TrueSample results have generally seen average rates of about 80%. What is critical to note, is that both in the panel results illustrated in the figure and in the typical panel, this validation rate drops by about 10-15 points for certain demographic groups. This includes groups where verifiable information is generally harder to obtain – 18-24 year olds, Hispanics and African Americans.

hygiene-1
Figure 1. There is an opportunity to increase validation rates among key demographic groups.

Two types of error need to be considered as we explore ways to increase the validation rates for key demographic groups:

  • Type I error occurs when valid respondents are erroneously classified as non-validated.
  • Type II error occurs when non-validated respondents are erroneously classified as validated.

Our goal needs to be to minimize Type I error without increasing Type II error. Currently, the higher false positive rates can reduce sample availability in already hard-to-reach demographics and this reduced sample may also lead to some bias from over-exclusion of respondents. A simple modification to increase the validation rate and reduce Type I error would be to lower the validation threshold. However, this may or may not be a desirable approach depending on the population that we are validating. In some situations it may provide too loose a criterion and thereby introduce Type II error. We instead, propose in this paper a method where we use a secondary source of data to validate a large percentage of the previously non-validated individuals. The scope of this research is, therefore, two- fold. The first is to revisit the name/address validation process and provide a baseline for the data quality standard with this method. The second is to show that the usage of a secondary database provides a more holistic approach to individual validation without compromising data quality.

This research will reinforce the theory that individuals are different and not everyone can be validated through or be present in a single database. Some may have credit information or public record information, while others may have to be sourced through alternate means such as social media data, phone records, etc. In this research, we show that a second database, which validates individuals via their email address, helps increase the validation rate in total and for hard-to-reach demographic groups. Even after the secondary validation process, there could be legitimate sets of respondents who do not have information in commercial databases for match purposes, and can therefore be erroneously classified as non-validated. The idea is to show the methodology goes a long way in eliminating most of the Type I error. We further show that this increase in validation rate, which reduces Type I error, can, in many cases be achieved while having minimal increase in Type II error.

Research Design

A 10-12 minute concept test on a granola bar product was used in this research on research. The survey started with standard screening questions on age, gender, ethnicity and confidentiality, followed by the main questionnaire. Concept questions included intent to buy, concept rating questions and attitudinal questions. In addition, some quality measures were included to capture instances of bad behavior. The survey concluded with a survey-rating question.

We selected a set of Lightspeed MySurvey Panelists who had been validated through the name/address validation, and a set that had not been validated during the process. The panelists who had not been validated in the name/address process were sent through the secondary validation process on email addresses, and a subset of those was revalidated by this process. The revalidated panelists were then moved to the validated group. This process is shown in Figure 2.

Figure 2. In the secondary validation process, the panelists who failed validation in the standard name/address validation process are then sent through a secondary source for validation.
Figure 2. In the secondary validation process, the panelists who failed validation in the standard name/address validation process are then sent through a secondary source for validation.

Samples were drawn with the intent to meet minimum quotas of 200 responses in the target demographics (18-24, Over 24, Hispanics, and African Americans) who passed validation and similarly for those who failed validation. The overall base size was 4,986.

Success criteria for the secondary validation process were based on two considerations:

  1. Yield or percentage of sample revalidated from the name/address non-validated sample (i.e. reduction in Type I error), and
  2. Improvement, or at a minimum, lack of degradation in data quality (i.e. no increase in Type II error).

For analysis purposes, respondents were classified into one of three categories:

  • (A) Validated in the name/address check
  • (B) Non-validated in name/address check but validated in the secondary validation process
  • (C) Non-validated by both processes.

The impact of the secondary validation process on the data was evaluated by comparing the survey results before and after the process of moving the respondents in the (B) category into the (A) category (Figure 3).

hygiene-3
Figure 3. Reclassification of individuals as valid through a secondary process can help increase capacity in hard- to-reach demographics if we can show that data quality is not compromised.

Results

Yield

In terms of the yield or percentage of previously non-validated sample revalidated by the secondary validation process, the method is very successful. As seen in Figure 4, the overall validation rate increases from 89% with the standard name/address validation process to 95% using the secondary validation process. This translates to, on average, 55% of the non-validated sample being recouped. More importantly, the secondary validation process has brought nearly all the key demographic groups to a 90% validation rate. Similar results have been seen on other panels tested, where, on average, for a typical panel with an overall validation rate of 80% we can expect to see an increase in the validation rate to 90%.

hygiene-4
Figure 4. The secondary validation process reduces Type I error by 55% on average and increases the overall validation rate to 95%.

Data Quality

For the second portion of the experiment, assessing the impact on data quality, we analyze the data from the survey. We look at the responses from the survey in terms of overall bias in results, quality measures (e.g. trap question errors, speeding, etc.) and differences in intent to buy between the two categories of respondents (validated and non-validated).

The analysis first looks at whether the data from the respondents who are validated and non-validated in the standard name/address validation process are different from each other. The second part of the analysis looks at the same differences between the two categories after the respondents who are revalidated by secondary validation are moved into the validated category from the non-validated category.

In examining the bias or difference between validated and non-validated sample, we use the following approach:

1. We start with the name/address validation process. We first perform a visual analysis of scatterplots that display the scores (i.e. the answer choices) from the respondents in the valid category on the x-axis against those from the respondents in the non-validated category on the y- axis to see if we can detect a consistent bias or difference (see scatterplot in Figure 5). This indicates to us that the two categories of respondents differ from each other. The direction of the difference is unimportant and may vary from panel to panel and from demographic to demographic, but its presence indicates that the name/address validation process successfully distinguishes between the two categories of respondents.

2. We then look at the risk ratio. This is a metric designed to quantify the differences seen in the visual examination of the scatterplots. It is defined as the ratio of the probability of getting a wrong answer to the baseline probability of 5%, based on sampling theory. In order to compute the ratio, we first evaluate the response distribution of the valid respondents. We then infuse this distribution with an increasing percentage of responses from non-validated respondents, and measure the shift in the distribution due to the addition of these responses. The shift is measured by the percent of responses that fall outside the confidence interval and the ratio is computed as this number to the baseline of 5%. If there is a bias or difference between the two categories, the risk ratio will increase exponentially when more and more non-validated respondents are added to the sample of valid respondents, as the resulting distribution of responses will change. An example for illustrative purposes is depicted in Figure 5.

hygiene-5
Figure 5. As shown in this example, the responses from the validated and non-validated respondents differ (scatterplot). This bias can result in increased error when validated respondents are mixed with non-validated respondents (risk ratio).

3. When the secondary validation process is introduced, it is supposed to move truly valid respondent from the non-validated category to the validated category. This will cause the bias to increase or at a minimum stay the same because in effect we are purifying the categories more. When this happens, the risk ratio curve will rise more steeply because the non-validated respondents are even more different from the validated ones in terms of responses. By contrast, if the secondary validation process is not successful and it moves more truly non-valid respondents into the valid category, then the bias will decrease and the risk ratio curve will get flatter (This is illustrated in Figure 6).

hygiene-6
Figure 6. As shown in this example, if the secondary validation process is successful bias should increase or at a minimum, stay the same. If it is not successful, bias will decrease and the risk ratio curve will get flatter.

When we examine the survey data to see if the secondary validation process did indeed reduce Type I error while keeping Type II error in check, we see that the actual distributions are as expected for the various demographic categories. Figure 7 and Figure 8 present the results for the Lightspeed MySurvey Panel. We can clearly see that the biases or differences increase dramatically (as did the risk ratio curve) after the secondary validation process is implemented. This indicates that the process is very successful in selecting the truly valid respondents from the originally non-validated category. Similar results have also been seen for two other test panels, although the extent of the increase in bias varies from panel to panel.

hygiene-7
Figure 7. The secondary validation process clearly produces an increase in bias or difference between the validated and non-validated respondents in the scatterplots. This is coupled with a corresponding increase in risk ratio in the graphs on the right. The scatterplots represent the answers to questions in the survey with the x-axis showing the scores for the validated respondents and the y-axis for the non-validated respondents. The risk ratio plots on the right show the increase in the risk of bad data when more and more non-validated respondents are mixed in with the valid ones.
hygiene-8
Figure 8. An increase in bias similar to the one seen in Figure 7 is also seen after implementing the secondary validation process in the target ethnic groups of African Americans and Hispanics.

We also compare the data from just the respondents who are validated by the secondary validation process with the others. The scatterplots in Figure 9 show two examples from the Lightspeed MySurvey Panel. It is interesting to note that the respondents who have been validated by the secondary validation process tend to respond more like the respondents who passed the standard name/address validation process. Correspondingly, they also look different from the respondents who are not validated by either process, as seen in the scatterplots to the right. Similar results have been seen on other panels.

hygiene-9
Figure 9. Respondents validated via the secondary process look more like those validated by the standard name/address validation process (left) and different from the ones who are not validated by either process (right).

When we look at individual survey metrics, such as intent to buy and instances of “bad” behavior (speeding, straightlining, etc.), we see very similar trends. Figure 10, compares the validated respondents to the non-validated respondents before and after the secondary validation process. For the intent to buy, we report the percent giving a top two box rating (definitely or probably will buy). The bad behavior metrics include things such as having more than two instances of bad behavior, such as speeding or straightlining or missing the trap question.

If the secondary process has worked successfully, we should expect to see the same or an increase in the separation between the validated and non-validated classes, after the secondary validation. If we had incorrectly moved truly non-valid respondents into the valid category, the differences between the valid and non-valid respondents should disappear. As seen in the results in Figure 10 for the Lightspeed MySurvey Panel, the name/address validation does show that valid and non-valid respondents behave differently, indicating that the process is producing separation between two classes of people. The secondary validation process either enhances or at least maintains these differences between validated and non-validated sample in most of the demographics, indicating that it is performing successfully.

hygiene-10
Figure 10. The name/address validation process on the left shows that the valid and non-valid respondents differ in their behavior. The secondary validation process preserves or increases these differences between the validated and non-validated respondents in most cases, implying that the secondary validation process did not move truly non-valid respondents into the valid category.

Conclusion

Based on this research, we have concluded that the name/address validation does indeed provide separation between the two classes of respondents – those who pass validation and those who fail. We also conclude that the secondary validation process increases the validation rates in hard-to-reach demographics, and helps recoup over 50% of the overall sample lost in the name/address validation process. In addition, the secondary validation process achieves two critical objectives, namely:

  • Reducing the validation failure rate in the hard-to-reach demographics of 18-24 year olds, Hispanics and African-Americans by about 50%
  • Enhancing or at a minimum, adhering to the data quality level maintained by the name/address validation process

From a Kantar perspective the secondary validation process has helped us meet our clients’ needs. It provides our clients a consistent quality standard without dramatic losses to capacity. We now have a process that does a better job of trimming those who are truly different while preserving our ability to reach respondents who show no evidence of undesirable behavior.

About the Authors

Susan Frede joined Lightspeed Research in January 2010, bringing nearly 23 years of Market Research experience. As Vice President of Research, Susan designs, conducts and analyzes Research on Research projects to improve panel performance, respondent quality and survey data quality. She has published numerous research-on-research papers and is a well-respected speaker at key industry events. In addition, Susan works closely with clients on quality initiatives and survey integrity, offering insight and consultation on research design, applications, execution and delivery. Susan graduated Magna Cum Laude from Northern Kentucky University with a B.S. degree in Marketing and a Minor in Mathematics. You can contact Susan at sfrede@lightspeedresearch.com.

Nallan Suresh was Director of Analytics at TrueSample. At TrueSample, Suresh applied analytical techniques and scored methods to optimize processes ranging from panelist utilization and recruitment to survey design. Prior to TrueSample, Suresh worked in advanced analytics and data mining in a variety of industries. His experience includes working on missile defense systems for the defense industry, fraud detection in the healthcare industry, and credit scoring and fraud detection in the financial services industry. He holds a Doctorate in Engineering from the University of Michigan and a Masters in Engineering from Cornell University. You can contact TrueSample at support@truesample.com.




  Back to All Articles
Like this article? Share it!