Software

CART Analysis: Classification and Regression Trees

It’s a machine learning tool that makes analyzing multiple variables user friendly.

QM0225-FEAT-Software-Hayler-p1FT-GettyImages-1171686909.jpg — *Image Source: Laurence Dutton / E+ / Getty Images*

I have been teaching Lean and Six Sigma tools to Black Belt and Green Belt candidates for about 20 years. The most difficult parts of the training for many students are the more intensive statistical methods such as ANOVA (analysis of variance) and regression. A typical Black Belt class is 16 to 20 days long and covers a variety of Six Sigma, Lean, and project management tools. The statistical tools are taught and practiced with example data sets and data generated from in-class exercises. Often help from a coach is needed the first time a student attempts to practically apply the ANOVA general linear model or multiple regression analysis. Collecting and formatting the data, verifying the fulfillment of data assumptions, successfully running the analysis, and interpreting the results can be tricky.

Data comes in two varieties: continuous (also called variable) data, and attribute data. Continuous data can be divided into smaller and smaller pieces down to the resolution of the measurement system. Examples include length, weight, and time. Attribute data falls into discrete buckets. Examples are geographic region, material, and supplier. In graphing and analyzing data there are input (X) variables and output (Y) variables, each of which may be continuous or attribute. This gives rise to four different combinations which in turn gives rise to the many types of graphs and analytics methods that are available. Tools that relate one input to one output are simpler and easier to think about than tools that relate multiple inputs to multiple outputs. Additional complexity is added when sets of inputs or outputs contain both continuous and attribute variables. The more complex the analysis the more potential caveats exist when interpreting the results. A skilled practitioner is needed to understand these caveats to create useful models that help people to make better decisions. This is part of the art of problem-solving.

In the years before the easy access to computers and user friendly statistical software packages, running complex analysis required many hours of study and practice, not to mention the time needed to perform the necessary calculations. Hardware and software advances have made these analyses accessible to a much larger audience. AI (artificial intelligence) promises another quantum leap in accessibility to complex analyses. The latest wave of tools involves several machine learning methods including classification and regression trees, or CART analysis. CART regression analysis relates one continuous output variable to multiple input variables. The input variables may be continuous or categorical (attribute). CART classification relates one categorical output variable to multiple input variables. The input variables may be continuous or categorical.

Consider a painting process where the goal is to reduce defects of all types and maximize the pass rate. Important input variables as determined by the problem-solving team includes air pressure, ambient temperature, paint viscosity, production shift, part type, and paint supplier. Pressure, temperature, and viscosity are continuous predictors (inputs). Shift, part, and supplier are attribute. A few rows of the data are displayed in figure 1. The variable input dialog box is shown in figure 2. The application used for the analysis is Minitab version 21.4. Purists will note that in this example pass rate is being treated as a continuous variable although percentages resulting from counting data are technically attribute.

Figure 1

Figure 2

The optimal tree diagram from the CART regression analysis is shown in figure 3. Node 1 at the top of the tree diagram shows the average and standard deviation for all 145 pass rates. In the analysis the most influential input variables on pass rate are determined, as well as the split points that result in maximum differential. The first split was made by shift. In node 2, the 24 pass rates from shifts 1 and 2 average 79.9% versus an average of 90.3% for the 121 pass rates from shift 3 shown in node 4. Note that the overall average pass rate is 88.6% as seen in node 1.

Node 2 is next split by viscosity. The largest differential occurs at a viscosity of 17.2 sec. Node 3 is the average of the 12 pass rates for viscosity above 17.2 sec. The average of the 12 pass rates with viscosity below 17.2 sec is 83.9%. The node is not split further in the analysis and is labeled terminal node 1. However, node 3 is split further in the analysis by shift. The average of shift 2 is 72.6% (terminal node 2) and the average of shift 1 is 78.4% (terminal node 3).

Returning to node 4 the pass rates for shift 3, the data is next split by viscosity using a split point of 17.0 sec. The higher pass rates in node 5 are further split by viscosity using a split point of 16.5 sec. Comparing all the terminal nodes, it is observed that the largest average pass rate is 93.3% in terminal node 4 and the smallest is 72.6% in terminal node 2. The difference is 20.7%. The best pass rates are found in the data from shift 3 with low viscosity. The worst pass rates are found in the data from shift 2 with high viscosity. The overall Relative Variable Importance is shown in figure 4.

Figure 3

Figure 4

A more conventional approach would be to perform a multiple regression analysis using both continuous and categorical predictors and go through a series of refinements. Alternatively, an ANOVA general linear model approach could be taken using a combination of categorical factors and continuous covariates. While it is possible these methods could result in several different competitive models, this author used both approaches to achieve the same model. The resulting Analysis of Variance table and regression equations are shown in figure 5. The low p-values in the ANOVA table show viscosity and shift as important drivers of pass rate, the same conclusion from the CART analysis. The three regression equations confirm that shift 3 has the highest relative pass rates (constant of 190.8) and shift 2 the lowest (constant of 179.0). The negative viscosity constants reflect that lower viscosity results in higher pass rates. Again, the same conclusion as the CART analysis.

Figure 5

For the CART analysis, once the data is collected and formatted the only knowledge needed is for the user to understand if a variable is an input or an output, and whether it is continuous or attribute data. To understand the results, the user needs to understand how to read the optimal tree diagram which is not greatly different from other types of tree diagrams. In the background the regression equations are calculated so the user can easily predict the pass rate for any set of conditions. To run the multiple regression analysis and the ANOVA general linear model, the user needs to be able to run multiple refinements, understand several goodness of fit criteria, run the tests for data assumptions, understand the assumption caveats, and at the end be able to interpret the results.

To summarize, CART (Classification and Regression Trees) is a machine learning tool that can handle both continuous and attribute data sets to identify important variables and assess their impact. It is more user accessible compared to traditional approaches.

Looking for a reprint of this article?
From high-res PDFs to custom plaques, order your copy today!

Eric Hayler is the Principal of the Hayler Group and a Lean Six Sigma Master Black Belt. He has led Continuous Improvement efforts at BMW Manufacturing and Amazon. Eric was the 2017 ASQ (American Society for Quality) Chair of the Board of Directors. He is a graduate of Rutgers University, holds a PhD in Solid State Inorganic Chemistry, and is currently Adjunct Professor of Business Analytics at the University of South Carolina Upstate.