Classification of Breast Cancer Cells Using JMP®
By Marie Gaudard, Phil Ramsey and Mia Stephens, North Haven Group
Mi-Ling is a Six Sigma Black Belt who has just learned that, in a month, her team will be asked to develop a model to predict which of two forms of a given medication produced by her company future customers will be more likely to purchase. She will have economic and demographic information on a set of about 1,000 customers, as well as information on which form of the medication they purchased over the last six months.
Knowing that she will have to undertake this analysis quickly, Mi-Ling finds a published data set, the Wisconsin Breast Cancer Diagnostic data, which she will analyze for practice. Her plan is to use various JMP techniques to fit classification models to this data set.
This data set arises in connection with diagnosing breast tumors based on a fine needle aspirate (Mangasarian, OL, et al., 1994). In this study, a small-gauge needle was used to remove fluid from a lump. This fluid was placed on a glass slide and stained to reveal the nuclei of the cells. A software program was used to compute 10 characteristics for each nucleus: radius, perimeter, area, texture, smoothness, compactness, number of concave regions, size of concavities, symmetry and fractal dimension of the boundary (Street, WN, et al., 1993).
A set of 569 images was processed as described above. Since a typical image can contain from 10 to 40 nuclei, the data were summarized: For each of the 10 characteristics, the Mean, Max and standard error of the mean (SE) were computed. These are the 30 predictor variables in Mi-Ling’s data set, BreastCancerClassification_Excerpt.jmp.
The case study follows Mi-Ling through her analysis. She begins by using visualization techniques to help build an understanding of the data set. After dividing the data into a training set, a validation set and a test set, she fits four models to the training data. These include a logistic model, a partition model and two neural net models. After comparing their performance on the validation set, she chooses one of these as her final model and uses the test set to assess its performance.
In this excerpt from the case study, we will illustrate two techniques that Mi-Ling uses in her study of this data set:
- As part of data exploration, she uses the JMP 8 new Graph Builder platform to compare her validation, training and test sets.
- To visualize one of her models, she uses the JMP Surface Plot to see the predicted surface, her data and the classification cut.
Comparing Analysis Data Sets Using Graph Builder
Mi-Ling uses a random uniform distribution to assign rows to training, validation, and test sets using a 60%, 20%, 20% apportionment. This is done using formulas defined by the columns Random Unif and Data Set Indicator.
Although this is done randomly, Mi-Ling wants to verify that the split into the three analysis data sets is representative. To this end, she uses Graph Builder, which provides an intuitive interface for constructing trellis displays used in comparing the distributions of multiple variables.
The graph in Figure 1 utilizes box plots to compare the distributions of six of the measured characteristics: Smoothness, Compactness, Concavity, Concave Points, Symmetry and Fractal Dim for each of the Mean, Max and SE summary measures. Mi-Ling sees at a glance that these 18 variables are comparable across the three constructed sets, with the exception of two outlying values for SE Concavity in the Training Set. Mi-Ling saves this analysis to the data table as a script called Graph Builder – Smoothness to Fractal Dimension, which team members can rerun later to repeat this analysis.
| Figure 1. Box plots for Mean, Max and SE of Smoothness through Fractal Dim, by Analysis Data Set. | ![]() |
Mi-Ling could have obtained box plots for all 10 variables in the format shown in Figure 1. But because of the differences in scaling, the box plots for most of the variables would have appeared as horizontal lines, showing no detail whatsoever.
Because of the comparability of scaling issue, Mi-Ling separates Texture from the remaining six variables. Figure 2 shows a Graph Builder view of the Texture variable, using histograms instead of box plots. Again, the distributions appear consistent across analysis sets. This script is called Graph Builder – Texture.
| Figure 2. Histograms of Mean, Max and SE of Texture, by Analysis Data Set. | ![]() |
Mi-Ling repeats the analysis for the remaining three variables, Radius, Perimeter and Area. She saves this analysis for future use as a script called Graph Builder – Radius, Perimeter, Area.
Visualizing a Logistic Model Using Surface Plot
In the case study, Mi-Ling’s first modeling approach involves logistic regression. She decides that she would like to see what a logistic model based on only two predictors might look like. She copies the Training Set row states to the data table row states; this has the effect of selecting only the training data. She fits a logistic model to Diagnosis, using only the two predictors: Mean Perimeter and Mean Smoothness. The script is Logistic – Two Predictors. She saves Prob[M], the formula used to estimate the probability that a lump is malignant, to the data table.
Using Surface Plot, she constructs the plot in Figure 3. The script is called Surface Plot. The surface shows the probability of a malignant diagnosis as modeled by Mean Perimeter and Mean Smoothness. Note the S-shaped logistic surface. The values of Prob[M] are plotted for the points in the training set; these points are plotted as red circles for malignant tumors and blue plus signs for benign tumors.
Note that a grid has been inserted at Prob[M] = 0.5. Mi-Ling sees that if she were to classify observations with Prob[M] values above 0.5 as malignant and below 0.5 as benign, then the classification rule would correctly classify a fair number of observations. To be precise, 312 of the 347 training observations are correctly classified.
| Figure 3. Logistic Model Based on Mean Perimeter and Mean Smoothness. | ![]() |
Conclusion
These are only two of the many visualization tools that Mi-Ling uses in exploring her data and building models. Her understanding is augmented by the ability to see the data both on its own and in relation to models that she constructs. She realizes that these JMP tools will greatly enhance her ability to share her findings with members of her team and with management.
What’s Next?
After completing this exercise, Mi-Ling will be ready to develop a model to predict which of the two forms of her company’s medication is more likely to be purchased by future customers based on their background. All she needs now is the customer economic and demographic data.
References
Mangasarian, OL, Street, WN, and Wolberg, WH, “Breast Cancer Diagnosis and Prognosis via Linear Programming,” Mathematical Programming Technical Report 94-10, Dec. 19, 1994, pp. 1-9.
Street, WN, Wolberg, WH, and Mangasarian, OL, “Nuclear Feature Extraction for Breast Tumor Diagnosis,” 1993 International Symposium on Electronic Imaging: Science and Technology, Vol. 1905, 1993, pp. 861-870.
About this article
This is an excerpt from the upcoming book Visual Six Sigma: Making Data Analysis Lean: A Practitioner’s Guide Using Case Studies and JMP® Software by Ian Cox (SAS), Marie Gaudard (North Haven Group), Phil Ramsey (North Haven Group), Mia Stephens (North Haven Group) and Leo Wright (SAS). The book is scheduled to be published by SAS® Press in 2009.
This excerpt shows only two of the many visual tools used by Mi-Ling to explore and model the cancer data. The book contains five other case studies featuring traditional transactional and manufacturing Six Sigma projects. Each case study utilizes a Visual Six Sigma approach, illustrating a variety of visualization and modeling tools, in describing how the project was successfully completed.




