ReportSample11.pdf

How the Features of the Cars Effecting Users Acceptance? Using Data mining to estimate the acceptance of cars based on their features.

Abrar Salman

Information Technology Department University of Bahrain

Sakheer, Kingdom of Bahrain abrar-salman@live.com

Fatima Ali Information Technology Department

University of Bahrain Sakheer, Kingdom of Bahrain fatima.ali.yahya@hotmail.com

Supervised by: Dr. Ahmed Zeki

Abstract

The features of the cars have a great impact on the acceptance level of users. The aim of this study is to find the hidden relation and patterns between the features of the car and the users acceptance using data mining. This study will be beneficial to car dealers to decide what is going to be on demand, as well as the manufacturers to improve their final products. Data mining techniques used are Naïve Bayes, Simple KMeans and the association rules.

Index Terms— car, acceptance, features.

I. INTRODUCTION

Car manufacturer companies are aiming to be the best in the market by providing competitive features in many aspects like price, appearance, comfort, performance, safety and many others to make their product more acceptable by the users.

Data Mining is the extraction of interesting patterns or knowledge from huge amount of data. Data mining is going to be used in this project to measure how acceptance can be affected by the features.

II. DATA COLLECTION

To find the relation between the features and the acceptance level, we looked for a ready dataset, and we found one created by Marko Bohanec in 1997 [3], with the title Car Evaluation Database. This dataset was suitable for our research since it has 6 attributes and 1728 instances, which will provide us with more reliable results.

III. DATA PREPARATION

The dataset was already in an Attribute-Relation File Format (.arff), data was clean, nothing was missing, and all the attributes were nominal. We made some basic improvements by changing the name of the class attribute to “accept” and combining the values of it (vgood and good) into one distinct, since the both very small and they refer almost to the same thing.

IV. ATTRIBUTES

The dataset has seven nominal attributes along with the class shown in (Table 1.1), that includes the names of the

attributes, a brief description and the possible values for each.

Table1.1 Attributes Attribute Description Value

buying Buying price. ¯ vhigh

¯ high

¯ med

¯ low maint Price of the

maintenance. ¯ vhigh

¯ high

¯ med

¯ low doors Number of

doors. ¯ 2

¯ 3

¯ 4

¯ 5more persons Capacity in

terms of persons to carry.

¯ 2

¯ 4

¯ more lug_boot The size of

luggage boot. ¯ small

¯ med

¯ big safety Estimated

safety of the car.

¯ low

¯ med

¯ high accept Car

acceptability. ¯ unacc

¯ acc

¯ good

V. OBJECTIVES

The main objective of this project is to use three different data mining tools to analyze data, and extract any possible pattern or knowledge. The tools and techniques used in this project are Naïve Bayes, Association Rules and Clustering using Simple K-means.

A. Classification: Naïve Bayes

A probabilistic classification algorithm that based on probability models, and the independent hypotheses are combined. Usually, these hypotheses are not affecting the reality. So, they are considered as naive. [1]

figure 1.1 Naïve Bayes summary using Cross-Validation

As shown in figure 3.2, 86.8056% were accurately estimated by the classifier 1500 instances out of 1728 instances, which means the accuracy of the model is about 86.8%.

13.1944% where incorrectly classified 228 instances out of 1728 instances.

figure 1.2 Detailed accuracy by class

There are three classes unacc (unacceptable), acc(acceptable), good which is refer to the car accuracy

TP Rate: rate of true positives (rate of instances that correctly classifies to a true class)

FP Rate: rate of false positives (rate of instances that incorrectly classified to a wrong class)

Precision: the ratio of dividing TP rate by the total numbers of instances classified to the given class.

Recall: rate of instances classified to a class divided by the actual number of instances in this class.

F-Measure: its calculated by multiplying 2 by precision by recall and divide the total by the summation of precision and recall.

figure 1.3 Naïve Bayes Confusion Matrix

the figure above shows that 1161 instances of class a(unacc) is correctly classified to a and 49 instances of class a is incorrectly classified to b and c, 272 instances of class b(acc) is correctly classified to b and 112 instances of class b is incorrectly classified to a and c, 67 instances of class c(good) is correctly classified to c and 67 instances of class c is incorrectly classified to b.

B. Clustering: Simple K-Means

The second tool used in this report is clustering using the k-means. K-means is an unsupervised statistical clustering technique, that is simple yet was proved as an effective tool.

The aim of this project is to prove that the better features the car has, the more accepted by user it will be.

In this step, data will be classified in different clusters to see the different relations between the features and the acceptance. We decided to have seven clusters in the dataset after using the elbow method.

figure 2.1 Simple K-Means

After applying the K-means to the dataset, it shows that

the clusters are classified as: Clusters 0, 2, 3, 5 are unacceptable, clusters 4 and 6 are

acceptable, cluster 1 is good.

figure 2.2 Clustered Instances

For the clustered instances output, it shows that 27% of the data was clustered in cluster 0, 12% in cluster 1, 14% in cluster 2, 15% in cluster 3, 11% in cluster 4, 12% in cluster 5, and finally 9% in cluster 6.

figure 2.3 Final cluster centroids

Simple K-means results:

Table 1.2 Cluster 0 model buying vhigh maint med doors 5more persons 2 lug_boot small safety low accept unacc

Table 1.3 Cluster 4 model buying high maint high doors 3 persons more lug_boot med safety high accept acc

Clusters 0 and 4 are a great example showing how the

features can affect the acceptance of users. As shown on cluster 0, if the prices are high with low features, users will refuse to accept them because they will feel like their money is not well spent, but in the other hand in cluster 4, users accept paying high amount to buy the car and maintain it if they are getting better options in return.

C. Association Rules: Apriori

Apriori algorithm is a classical algorithm in data mining that is being used to mining frequent sets and relevant association rules. [2]

This method will show the relations between the attributes and the class and the best rules for the data based on the confidence factor that was set to 90%.

We extracted 13 rules for this data set as shown below in figure 3.1:

figure 3.1 Apriori

The confidence level for all the rules is 100%. The rules show the cases where the cars are

unacceptable by the users when: 1. The number of persons is 2 only. 2. The car has low safety. 3. It is 2 persons with a small luggage boot. 4. It is 2 persons with a medium luggage boot. 5. It is 2 persons with a big luggage boot. 6. 2 persons and low safety. 7. 2 persons and medium safety. 8. 2 persons and high safety. 9. 4 persons and low safety. 10. More persons with low safety. 11. Small luggage boot with low safety. 12. Medium luggage boot with low safety. 13. Big luggage boot with low safety.

CONCLUSION: According to the findings and outputs founded from

applying the three data mining methodologies, it can be concluding many points:

The accuracy of the estimated results by using Naïve Bayes is approximately 87% which is considered as a good percent.

K-means considered clusters as follow: clusters 0, 2, 3, 5 are unacceptable, clusters 4 and 6 are acceptable, cluster 1 is good.

The best role and the strong relationship founded by Association rules method is between the person’s number and the unacceptable of the car.

ACKNOWLEDGMENT

A great thank for the creator and donor Marko Bohanec and the donor Blaz Zupan of their useful database that helped us in doing and complete our project. Also, we thank our instructor Dr. Ahmed Zaki and lab assistant Miss. Hajer Khalifa for their efforts and their time explaining the material of the course in an interesting way.

REFERENCES

[1]"IBM Knowledge Center", Ibm.com, 2018. [Online]. Available: https://www.ibm.com/support/knowledgecenter/en/SSEPGG_9.7.0/com.ibm.im.overview.doc/c_naive_bayes_classification.html. [Accessed: 30- Dec- 2018]. [2]"Rashmi Jain, Author at HackerEarth Blog", HackerEarth Blog, 2018. [Online]. Available: https://www.hackerearth.com/blog/author/rashmi/?post. [Accessed: 30- Dec- 2018]. [3]"renatopp/arff-datasets", GitHub, 2018. [Online]. Available: https://github.com/renatopp/arff-datasets/blob/master/classification/car.arff#L1. [Accessed: 31- Dec- 2018].

APPENDIX

1. Choosing the number of clusters by using the elbow method.

Number of cluster Sum of square error 2 6577.0 3 6073.0 4 5727.0 5 5596.0 6 5303.0 7 5077.0 8 4974.0 9 4889.0

10 4802.0

4000

4500

5000

5500

6000

6500

7000

0 1 2 3 4 5 6 7 8 9 10 11