Classification of Breast Cancer Using Logistic Regression

Classification of Breast Cancer Using Logistic Regression

Chapter One

Research Aim and objectives

The aim is to develop a prediction system for detecting breast cancer.

The main objectives are:

Study and apply logistic regression for the classification of breast cancer.
Compare Logistic regression with other extant machine learning classification models on the same data set.
Performance analysis and conclusion.

CHAPTER TWO

LITERATURE REVIEW

This chapter presents some basic concepts and terminologies such as: Data mining, Classification techniques. Furthermore, a review of previous related work done in this research topic is presented. This review is done to know the techniques, other authors employed for the classification of breast cancer. This review is cut through other machine learning algorithms that have been used for the classification of breast cancer and not only logistic regression. In the review, prediction accuracy is discussed as well as the techniques used in improving them.

Basic Terminologies and Concepts

Machine Learning (ML) is the science (and art) of programming computers so they can learn from data (Géron, 2017). Machine learning can be defined in a more general way as:

ML as the field of study that gives computers the ability to learn without being explicitly programmed. – Arthur Samuel, 1959. ML can also be defined in a more technical way as: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T as measured by P, improves with experience E. – Tom Mitchell,1997. There are several applications for ML, the most significant of which is data mining. People are often prone to making mistakes during analyses or, possibly, when trying to establish relationships between multiple features (Kotsiantis, Zaharakis, & Pintelas, 2006).

Data Pre-processing

Data pre-processing is one of the most data mining tasks which includes preparation and transformation of data into a suitable form of mining procedure. Data pre processing aims to reduce the data size, find the relations between data, normalize data, remove outliers and extract features for data. It includes several techniques like data cleaning, integration, transformation and reduction (Alasadi & Bhaya, 2017).

Feature scaling

Feature scaling is a technique that is used to normalize the range of independent variables or features of data. In data pre-processing, it is also known as data normalization and is usually employed during the data pre-processing step.

Supervised Learning

Supervised machine learning is the search for algorithms that cogitate from externally supplied instances to give general hypotheses, which then infer predictions about future instances. In other words, the goal of supervised learning is to build an incisive model of the distribution of class labels in terms of predictor features. The resulting classifier is then used to assign class labels to the testing instances where the values of the predictor features are known, but the value of the class label is unknown (Kotsiantis et al., 2006).

In supervised learning as shown in Fig. 2.1, the learner is provided with two sets of data, a training set, and a test set. The idea is for the learner to “learn” from a set of labelled examples in the training set so that it can identify unlabeled examples in the test set with the highest possible accuracy (Learned-miller, 2014).

CHAPTER THREE

MATERIALS AND METHOD

In this chapter, we shall discuss the framework, algorithm used and explain various stages in the framework.

Concept of Classification Technique

Classification is one of the ways the machine learns. It has the specific goal of accurately classifying the unknown values of attribute of the target known values (Jhajharia et al., 2016; Aggarwal & Xhai, 2012; Mitra & Acharya, 2004). Classification is crucial in data mining and machine learning because it presents a clear distinction between the various classes by understanding deeply the relationship between the variables together with the class attribute (Aggarwal, 2015; Guo, Huang, & Zhang, 2014; Kriegel et al., 2007; T. Li, Ma, & Ogihara, 2005; Uppal, 2016).

It is established that there are some attributes that are slightly different from another or the difference is insignificant. Therefore if some of the insignificant attributes are ignored, results will be obtained at the minimum time (Garg, Beg, & Ansari, 2009). We developed a model for classifying breast cancer using Logistic Regression classifier.

The model was trained and tested using a Wisconsin Breast Cancer Dataset (WBCD) obtained from UCI machine learning repositories.

Software Design Phase

The proposed model was implemented using Jupyter Notebook, a python programming environment, which has a machine learning library, Sci-Kit Learn. Sci-Kit Learn has built-in support for all extant machine learning algorithms used for classification, and a good number of packages for data pre-processing techniques, machine learning performance measures. This language has major advantages over others because of its flexibility, given output after the convergence of the learning stage, easy plotting of graphs and charts.

Hardware Requirement

The hardware requirements are:

Windows 7, 8 or 10, 64bits for PC and iOS 8, 10 for Macintosh operating system.

All CPUs
4GB RAM and 40GB HDD free space

CHAPTER FOUR

RESULTS AND DISCUSSIONS

This chapter includes the implemented framework and results with the programming language used from the preprocessing phase to the training and validation phase of the prediction models. Screenshots of results are presented to support our proposed framework.

Presentation of Results

All the steps taken in this research work: handling missing values in the data, feature scaling of the data, training the prediction models and the evaluation of the models’ performance in terms of accuracy, precision and sensitivity are presented; all the various stages of preprocessing, normalization, training and testing of data, classification, and measures of accuracy are implemented using SciKit Learn library in python programming. The results are all displayed and analyzed.

CHAPTER FIVE

SUMMARY, CONCLUSION AND FUTURE WORK

Summary

The need for an accurate predictor for the prediction of breast cancer cannot be overemphasized. Breast cancer has the second highest mortality rate, where lung cancer is the first and this cancer affects mostly women. For its detection and classification, physicians used mammography to make prognosis and diagnosis on their patients. However, the accuracy of mammography is less impressive, so the need for a better prediction facilitator is ever fervent.

Many researchers have employed the techniques of machine learning and artificial intelligence for the prediction and classification of breast cancer. These techniques take data as input, learn from the data and next time will be able to make predictions on any new data that has the same dimension with that which they learn from. In this research, the machine learning technique employed is logistic regression. Logistic regression is a statistical probabilistic model, which uses sigmoid function as its activation function. The data used in this work is Wisconsin breast cancer dataset from UCI online machine learning repository. The data has 11 attributes with 699 data points and 16 missing values. The missing values were filled up with mean value calculated from the non-missing values in ‘Bare_Nuclei’ features of the data. For the classification task, the data is split into two sets, which are training set (80% of data) and test set (20% of data).

The training set is used to train the logistic regression model and subsequently, the test set is used to test the trained model. In this work, we checked for the behavior of our model in cases where the data used for training and testing has its features scaled and that when its features are not scaled. The performances of our model on both cases are compared with other extant prediction models namely, SVM, NB, and MLP (an artificial neural network). From our result analysis, SVM performed slightly better than our LR model, however, the notable observation from our work is that the performance our LR model remained the same for both cases, and this is not true for the other models. The performance metrics used for our performance evaluation are the confusion matrix, precision score, recall score and f1-score.

Conclusion

Logistic regression model does not necessarily require data feature scaling of data, neither is it greatly affected by unbalanced data nor dependency among data set features. Hence, for medium size data, logistic regression is a good probabilistic prediction model to employ for a binary classification problem, because of its simplicity and less time complexity; therefore, logistic regression model can be used for the prediction of breast cancer, which greatly help physicians to make proper and early diagnosis, which will go a long way in increasing the survivability rate of breast cancer patients.

Future work

For future work, we propose the development of an ensemble learning model, comprising Logistic regression, Artificial Neural Network, and Support Vector Machine for cancer predictions.

REFERENCE

Aaltonen, L. A., Salovaara, R., Kristo, P., Canzian, F., Hemminki, A., Peltomäki, P., …de la Chapelle, A. (1998). Incidence of hereditary nonpolyposis colorectal cancer and the feasibility of molecular screening for the disease. The New England Journal of Medicine. https://doi.org/10.1056/NEJM199805213382101
Abedin, T., Chowdhury, M. Z. I., & Afzal, A. (2016). Review Article Application of Binary Logistic Regression in Clinical Research. Journal of National Heart Foundation of Bangladesh, 5(1), 8–11.
Agarap, A. F. (2017). On Breast Cancer Detection: An Application of Machine Learning Algorithms on the Wisconsin Diagnostic Dataset. (1), 5–9. https://doi.org/10.1145/3184066.3184080
Aggarwal, C. C. (2015). Data Mining. In Journal of Visual Languages & Computing (Vol. 11).
Aggarwal, C. C., & Xhai, C. (2012). A survery of text clustring algorithms. Mining Text Data, 8, 77–128. https://doi.org/10.1007/978-1-4614-3223-4
Agrawal, R., Gunopulos, D., & Leymann, F. (n.d.). Workflow and Scientific Databases Mining Process Models from Workflow Logs. Retrieved from https://link.springer.com/content/pdf/10.1007%2FBFb0101003.pdf
Alasadi, S. A., & Bhaya, W. S. (2017). Review of data preprocessing techniques in data mining. Journal of Engineering and Applied Sciences. https://doi.org/10.3923/jeasci.2017.4102.4107
Amrane, M., Oukid, S., Gagaoua, I., & Ensari, T. (2018). Breast cancer classification using machine learning. 2018 Electric Electronics, Computer Science, Biomedical Engineerings’ Meeting, EBBT 2018, 1–https://doi.org/10.1109/EBBT.2018.8391453
Bazazeh, D., & Shubair, R. (2016). Comparative Study of Machine Learning Algorithms for Breast Cancer Detection and Diagnosis. 2–5

Other Topics