There has been so much talk lately about the importance of harvesting "big data" that it seems people may feel that if they don’t have big data, cannot process big data, or don’t know how to recognize big data when they have it, that the game has already been lost. While big data is giving rise to some profound insights into everything from human behavior to the origins of the universe, not harvesting it today doesn’t mean you can’t still collect data and make critical decisions about how to achieve your quality mission.
First off, let’s define what we mean by "big data." According to Amit Kara, director of technical marketing for Databricks, big data is just like any other data, just more of it. Additionally, big data accommodates more variables from more sources, which ultimately may present security risks based on its content. And with more data and more variables, reliability becomes an issue. It is obvious that big data like this is all around us, and therefore the challenge lies in leveraging it.
There is an interesting difference of opinion over how much data should be collected based on the market in which a company operates, and this in itself challenges the relevance of making data "big." While conventional wisdom may be to gather as much data as possible, psychologist Gerd Gigerenzer cautions that it is better to simplify, use heuristics, and rely on fewer variables, especially during times of uncertainty. This gets to the heart of this article – particularly when you are trying to leverage data to gain an understanding of how your own processes are performing.
Big Data does not mean Good Data. A respected data scientist once told me that "half of all measurement systems are incapable of providing accurate measurements." I was painfully reminded of this when transporting luggage between airports a few years ago. I went to great lengths to ensure my bag was under 50 pounds to avoid extra charges. It was confirmed to be compliant with the maximum allowed weight when I placed it on the scale at the ticket counter. However, when I was prepared to return home, the agent at the next airport placed the very same bag on their scale (with identical contents) and behold, it had gained 1.5 pounds! I did not feed the luggage while we traveled together, so you might pause to wonder about the data being fed to you.
So it is fair to say, the challenge is getting good data in the first place. The key is to remember that data is "guilty until proven innocent." The good news is, you do not need to collect millions of lines of big data to draw useful conclusions. This alone challenges the need for executives to obsess over becoming a big data baron.
Data reliability is established through a variety of tests, from simple to esoteric. The easiest tests center on reproducibility and repeatability, things you likely do very often without even thinking. Check your oil, clean off the stick, dip it and read it again. Take the temperature of your child who may have the flu; check it again. "Measure twice, cut once." These are all tests of repeatability, and to the extent you get similar results each time you check, you are validating the reliability of the data you collect.
Reproducibility is a comparison of results when measured by different operators. Have members of a team each measure the same result, or compare manual observation with automated results. If the results agree 95% of the time, the measurements they take are not perfect, but in most cases, you can feel comfortable using the data. Anything less than 95% agreement, and you are taking the chance of making bad decisions based on bad data.
The process of analyzing small and big data is far easier if you can simply commit to a focus on attribute data. Attribute (aka discrete) data are expressed as whole numbers, not fractions, reducing reliability concerns to a degree. The litmus test to determine if you have attribute data is to ask "can I subdivide a data point?" For example, you cannot have "half a wrong side surgery," and therefore the tracking of wrong side surgeries deals with attribute data. How many wrong side surgeries have to occur before you agree you have a quality-control problem?
Whether dealing with attribute data or continuous data, adapting your analysis into a "pass versus fail" comparison greatly simplifies the discussion and can produce quick, reliable results. In fact, by adopting "pass vs fail" as the comparison protocol you are effectively converting continuous data (i.e. 98.7F) into attribute data ("pass" – not a fever).
How many times do you need to endure customers bringing back defective products before you will admit that you cannot prevent failures from reaching your customers? Once you’ve established your own pain point (which should be the same as your customers), you can easily set about determining whether the underlying process is broken.
After you have slayed the beast of making data reliable and making decisions about the quality of your process, it’s time to translate that data into a sigma level so that you can establish the "capability" of your process. Capability is a metric, expressed in terms of its relationship to "Six Sigma," which is for all practical purposes, a perfect process. I’ll peel the onion on this topic a bit more in a future column, where we will discuss various approaches at determining the capability of your process.
The bottom line is that you can still be a big shot with small data, making decisions that offer breakthrough improvement, while your competitors drown in the cesspool of marginally-useful data.