INTRODUCTIONTODATAMINING
INTRODUCTIONTODATAMININGSECONDEDITION
PANG-NINGTANMichiganStateUniversitMICHAELSTEINBACHUniversityofMinnesotaANUJKARPATNEUniversityofMinnesotaVIPINKUMARUniversityofMinnesota
330HudsonStreet,NYNY10013
Director,PortfolioManagement:Engineering,ComputerScience&GlobalEditions:JulianPartridge
Specialist,HigherEdPortfolioManagement:MattGoldstein
PortfolioManagementAssistant:MeghanJacoby
ManagingContentProducer:ScottDisanno
ContentProducer:CaroleSnyder
WebDeveloper:SteveWright
RightsandPermissionsManager:BenFerrini
ManufacturingBuyer,HigherEd,LakeSideCommunicationsInc(LSC):MauraZaldivar-Garcia
InventoryManager:AnnLam
ProductMarketingManager:YvonneVannatta
FieldMarketingManager:DemetriusHall
MarketingAssistant:JonBryant
CoverDesigner:JoyceWells,jWellsDesign
Full-ServiceProjectManagement:ChandrasekarSubramanian,SPiGlobal
Copyright©2019PearsonEducation,Inc.Allrightsreserved.ManufacturedintheUnitedStatesofAmerica.ThispublicationisprotectedbyCopyright,andpermissionshouldbeobtainedfromthepublisherpriortoanyprohibitedreproduction,storageinaretrievalsystem,ortransmissioninanyformorbyanymeans,electronic,mechanical,photocopying,recording,orlikewise.Forinformationregardingpermissions,requestformsandtheappropriatecontactswithinthePearsonEducationGlobalRights&Permissionsdepartment,pleasevisitwww.pearsonhighed.com/permissions/.
Manyofthedesignationsbymanufacturersandsellerstodistinguishtheirproductsareclaimedastrademarks.Wherethosedesignationsappearinthisbook,andthepublisherwasawareofatrademarkclaim,thedesignationshavebeenprintedininitialcapsorallcaps.
LibraryofCongressCataloging-in-PublicationDataonFile
Names:Tan,Pang-Ning,author.|Steinbach,Michael,author.|Karpatne,Anuj,author.|Kumar,Vipin,1956-author.
Title:IntroductiontoDataMining/Pang-NingTan,MichiganStateUniversity,MichaelSteinbach,UniversityofMinnesota,AnujKarpatne,UniversityofMinnesota,VipinKumar,UniversityofMinnesota.
Description:Secondedition.|NewYork,NY:PearsonEducation,[2019]|Includesbibliographicalreferencesandindex.
Identifiers:LCCN2017048641|ISBN9780133128901|ISBN0133128903
Subjects:LCSH:Datamining.
Classification:LCCQA76.9.D343T352019|DDC006.3/12–dc23LCrecordavailableathttps://lccn.loc.gov/2017048641
118
ISBN-10:0133128903
ISBN-13:9780133128901
Toourfamilies…
PrefacetotheSecondEditionSincethefirstedition,roughly12yearsago,muchhaschangedinthefieldofdataanalysis.Thevolumeandvarietyofdatabeingcollectedcontinuestoincrease,ashastherate(velocity)atwhichitisbeingcollectedandusedtomakedecisions.Indeed,theterm,BigData,hasbeenusedtorefertothemassiveanddiversedatasetsnowavailable.Inaddition,thetermdatasciencehasbeencoinedtodescribeanemergingareathatappliestoolsandtechniquesfromvariousfields,suchasdatamining,machinelearning,statistics,andmanyothers,toextractactionableinsightsfromdata,oftenbigdata.
Thegrowthindatahascreatednumerousopportunitiesforallareasofdataanalysis.Themostdramaticdevelopmentshavebeenintheareaofpredictivemodeling,acrossawiderangeofapplicationdomains.Forinstance,recentadvancesinneuralnetworks,knownasdeeplearning,haveshownimpressiveresultsinanumberofchallengingareas,suchasimageclassification,speechrecognition,aswellastextcategorizationandunderstanding.Whilenotasdramatic,otherareas,e.g.,clustering,associationanalysis,andanomalydetectionhavealsocontinuedtoadvance.Thisneweditionisinresponsetothoseadvances.
Overview
Aswiththefirstedition,thesecondeditionofthebookprovidesacomprehensiveintroductiontodataminingandisdesignedtobeaccessibleandusefultostudents,instructors,researchers,andprofessionals.Areas
coveredincludedatapreprocessing,predictivemodeling,associationanalysis,clusteranalysis,anomalydetection,andavoidingfalsediscoveries.Thegoalistopresentfundamentalconceptsandalgorithmsforeachtopic,thusprovidingthereaderwiththenecessarybackgroundfortheapplicationofdataminingtorealproblems.Asbefore,classification,associationanalysisandclusteranalysis,areeachcoveredinapairofchapters.Theintroductorychaptercoversbasicconcepts,representativealgorithms,andevaluationtechniques,whilethemorefollowingchapterdiscussesadvancedconceptsandalgorithms.Asbefore,ourobjectiveistoprovidethereaderwithasoundunderstandingofthefoundationsofdatamining,whilestillcoveringmanyimportantadvancedtopics.Becauseofthisapproach,thebookisusefulbothasalearningtoolandasareference.
Tohelpreadersbetterunderstandtheconceptsthathavebeenpresented,weprovideanextensivesetofexamples,figures,andexercises.Thesolutionstotheoriginalexercises,whicharealreadycirculatingontheweb,willbemadepublic.Theexercisesaremostlyunchangedfromthelastedition,withtheexceptionofnewexercisesinthechapteronavoidingfalsediscoveries.Newexercisesfortheotherchaptersandtheirsolutionswillbeavailabletoinstructorsviatheweb.Bibliographicnotesareincludedattheendofeachchapterforreaderswhoareinterestedinmoreadvancedtopics,historicallyimportantpapers,andrecenttrends.Thesehavealsobeensignificantlyupdated.Thebookalsocontainsacomprehensivesubjectandauthorindex.
WhatisNewintheSecondEdition?
Someofthemostsignificantimprovementsinthetexthavebeeninthetwochaptersonclassification.Theintroductorychapterusesthedecisiontreeclassifierforillustration,butthediscussiononmanytopics—thosethatapply
acrossallclassificationapproaches—hasbeengreatlyexpandedandclarified,includingtopicssuchasoverfitting,underfitting,theimpactoftrainingsize,modelcomplexity,modelselection,andcommonpitfallsinmodelevaluation.Almosteverysectionoftheadvancedclassificationchapterhasbeensignificantlyupdated.ThematerialonBayesiannetworks,supportvectormachines,andartificialneuralnetworkshasbeensignificantlyexpanded.Wehaveaddedaseparatesectionondeepnetworkstoaddressthecurrentdevelopmentsinthisarea.Thediscussionofevaluation,whichoccursinthesectiononimbalancedclasses,hasalsobeenupdatedandimproved.
Thechangesinassociationanalysisaremorelocalized.Wehavecompletelyreworkedthesectionontheevaluationofassociationpatterns(introductorychapter),aswellasthesectionsonsequenceandgraphmining(advancedchapter).Changestoclusteranalysisarealsolocalized.TheintroductorychapteraddedtheK-meansinitializationtechniqueandanupdatedthediscussionofclusterevaluation.Theadvancedclusteringchapteraddsanewsectiononspectralgraphclustering.Anomalydetectionhasbeengreatlyrevisedandexpanded.Existingapproaches—statistical,nearestneighbor/density-based,andclusteringbased—havebeenretainedandupdated,whilenewapproacheshavebeenadded:reconstruction-based,one-classclassification,andinformation-theoretic.Thereconstruction-basedapproachisillustratedusingautoencodernetworksthatarepartofthedeeplearningparadigm.Thedatachapterhasbeenupdatedtoincludediscussionsofmutualinformationandkernel-basedtechniques.
Thelastchapter,whichdiscusseshowtoavoidfalsediscoveriesandproducevalidresults,iscompletelynew,andisnovelamongothercontemporarytextbooksondatamining.Itsupplementsthediscussionsintheotherchapterswithadiscussionofthestatisticalconcepts(statisticalsignificance,p-values,falsediscoveryrate,permutationtesting,etc.)relevanttoavoidingspuriousresults,andthenillustratestheseconceptsinthecontextofdata
miningtechniques.Thischapteraddressestheincreasingconcernoverthevalidityandreproducibilityofresultsobtainedfromdataanalysis.Theadditionofthislastchapterisarecognitionoftheimportanceofthistopicandanacknowledgmentthatadeeperunderstandingofthisareaisneededforthoseanalyzingdata.
Thedataexplorationchapterhasbeendeleted,ashavetheappendices,fromtheprinteditionofthebook,butwillremainavailableontheweb.Anewappendixprovidesabriefdiscussionofscalabilityinthecontextofbigdata.
TotheInstructor
Asatextbook,thisbookissuitableforawiderangeofstudentsattheadvancedundergraduateorgraduatelevel.Sincestudentscometothissubjectwithdiversebackgroundsthatmaynotincludeextensiveknowledgeofstatisticsordatabases,ourbookrequiresminimalprerequisites.Nodatabaseknowledgeisneeded,andweassumeonlyamodestbackgroundinstatisticsormathematics,althoughsuchabackgroundwillmakeforeasiergoinginsomesections.Asbefore,thebook,andmorespecifically,thechapterscoveringmajordataminingtopics,aredesignedtobeasself-containedaspossible.Thus,theorderinwhichtopicscanbecoveredisquiteflexible.Thecorematerialiscoveredinchapters2(data),3(classification),5(associationanalysis),7(clustering),and9(anomalydetection).WerecommendatleastacursorycoverageofChapter10(AvoidingFalseDiscoveries)toinstillinstudentssomecautionwheninterpretingtheresultsoftheirdataanalysis.Althoughtheintroductorydatachapter(2)shouldbecoveredfirst,thebasicclassification(3),associationanalysis(5),andclusteringchapters(7),canbecoveredinanyorder.Becauseoftherelationshipofanomalydetection(9)toclassification(3)andclustering(7),thesechaptersshouldprecedeChapter9.
Varioustopicscanbeselectedfromtheadvancedclassification,associationanalysis,andclusteringchapters(4,6,and8,respectively)tofitthescheduleandinterestsoftheinstructorandstudents.Wealsoadvisethatthelecturesbeaugmentedbyprojectsorpracticalexercisesindatamining.Althoughtheyaretimeconsuming,suchhands-onassignmentsgreatlyenhancethevalueofthecourse.
SupportMaterials
Supportmaterialsavailabletoallreadersofthisbookareavailableathttp://www-users.cs.umn.edu/~kumar/dmbook.
PowerPointlectureslidesSuggestionsforstudentprojectsDataminingresources,suchasalgorithmsanddatasetsOnlinetutorialsthatgivestep-by-stepexamplesforselecteddataminingtechniquesdescribedinthebookusingactualdatasetsanddataanalysissoftware
Additionalsupportmaterials,includingsolutionstoexercises,areavailableonlytoinstructorsadoptingthistextbookforclassroomuse.Thebook’sresourceswillbemirroredatwww.pearsonhighered.com/cs-resources.Commentsandsuggestions,aswellasreportsoferrors,canbesenttotheauthorsthroughdmbook@cs.umn.edu.
Acknowledgments
Manypeoplecontributedtothefirstandsecondeditionsofthebook.Webeginbyacknowledgingourfamiliestowhomthisbookisdedicated.Withouttheirpatienceandsupport,thisprojectwouldhavebeenimpossible.
WewouldliketothankthecurrentandformerstudentsofourdatamininggroupsattheUniversityofMinnesotaandMichiganStatefortheircontributions.Eui-Hong(Sam)HanandMaheshJoshihelpedwiththeinitialdataminingclasses.Someoftheexercisesandpresentationslidesthattheycreatedcanbefoundinthebookanditsaccompanyingslides.StudentsinourdatamininggroupswhoprovidedcommentsondraftsofthebookorwhocontributedinotherwaysincludeShyamBoriah,HaibinCheng,VarunChandola,EricEilertson,LeventErtöz,JingGao,RohitGupta,SridharIyer,Jung-EunLee,BenjaminMayer,AyselOzgur,UygarOztekin,GauravPandey,KashifRiaz,JerryScripps,GyorgySimon,HuiXiong,JiepingYe,andPushengZhang.WewouldalsoliketothankthestudentsofourdataminingclassesattheUniversityofMinnesotaandMichiganStateUniversitywhoworkedwithearlydraftsofthebookandprovidedinvaluablefeedback.WespecificallynotethehelpfulsuggestionsofBernardoCraemer,ArifinRuslim,JamshidVayghan,andYuWei.
JoydeepGhosh(UniversityofTexas)andSanjayRanka(UniversityofFlorida)classtestedearlyversionsofthebook.WealsoreceivedmanyusefulsuggestionsdirectlyfromthefollowingUTstudents:PankajAdhikari,RajivBhatia,FredericBosche,ArindamChakraborty,MeghanaDeodhar,ChrisEverson,DavidGardner,SaadGodil,ToddHay,ClintJones,AjayJoshi,JoonsooLee,YueLuo,AnujNanavati,TylerOlsen,SunyoungPark,AashishPhansalkar,GeoffPrewett,MichaelRyoo,DarylShannon,andMeiYang.
RonaldKostoff(ONR)readanearlyversionoftheclusteringchapterandofferednumeroussuggestions.GeorgeKarypisprovidedinvaluableLATEXassistanceincreatinganauthorindex.IreneMoulitsasalsoprovided
assistancewithLATEXandreviewedsomeoftheappendices.MusettaSteinbachwasveryhelpfulinfindingerrorsinthefigures.
WewouldliketoacknowledgeourcolleaguesattheUniversityofMinnesotaandMichiganStatewhohavehelpedcreateapositiveenvironmentfordataminingresearch.TheyincludeArindamBanerjee,DanBoley,JoyceChai,AnilJain,RaviJanardan,RongJin,GeorgeKarypis,ClaudiaNeuhauser,HaesunPark,WilliamF.Punch,GyörgySimon,ShashiShekhar,andJaideepSrivastava.Thecollaboratorsonourmanydataminingprojects,whoalsohaveourgratitude,includeRameshAgrawal,ManeeshBhargava,SteveCannon,AlokChoudhary,ImmeEbert-Uphoff,AuroopGanguly,PietC.deGroen,FranHill,YongdaeKim,SteveKlooster,KerryLong,NiharMahapatra,RamaNemani,NikunjOza,ChrisPotter,LisianePruinelli,NagizaSamatova,JonathanShapiro,KevinSilverstein,BrianVanNess,BonnieWestra,NevinYoung,andZhi-LiZhang.
ThedepartmentsofComputerScienceandEngineeringattheUniversityofMinnesotaandMichiganStateUniversityprovidedcomputingresourcesandasupportiveenvironmentforthisproject.ARDA,ARL,ARO,DOE,NASA,NOAA,andNSFprovidedresearchsupportforPang-NingTan,MichaelStein-bach,AnujKarpatne,andVipinKumar.Inparticular,KamalAbdali,MitraBasu,DickBrackney,JagdishChandra,JoeCoughlan,MichaelCoyle,StephenDavis,FredericaDarema,RichardHirsch,ChandrikaKamath,TsengdarLee,RajuNamburu,N.Radhakrishnan,JamesSidoran,SylviaSpengler,BhavaniThuraisingham,WaltTiernin,MariaZemankova,AidongZhang,andXiaodongZhanghavebeensupportiveofourresearchindataminingandhigh-performancecomputing.
ItwasapleasureworkingwiththehelpfulstaffatPearsonEducation.Inparticular,wewouldliketothankMattGoldstein,KathySmith,CaroleSnyder,
andJoyceWells.WewouldalsoliketothankGeorgeNichols,whohelpedwiththeartworkandPaulAnagnostopoulos,whoprovidedLATEXsupport.
WearegratefultothefollowingPearsonreviewers:LemanAkoglu(CarnegieMellonUniversity),Chien-ChungChan(UniversityofAkron),ZhengxinChen(UniversityofNebraskaatOmaha),ChrisClifton(PurdueUniversity),Joy-deepGhosh(UniversityofTexas,Austin),NazliGoharian(IllinoisInstituteofTechnology),J.MichaelHardin(UniversityofAlabama),JingruiHe(ArizonaStateUniversity),JamesHearne(WesternWashingtonUniversity),HillolKargupta(UniversityofMaryland,BaltimoreCountyandAgnik,LLC),EamonnKeogh(UniversityofCalifornia-Riverside),BingLiu(UniversityofIllinoisatChicago),MariofannaMilanova(UniversityofArkansasatLittleRock),SrinivasanParthasarathy(OhioStateUniversity),ZbigniewW.Ras(UniversityofNorthCarolinaatCharlotte),XintaoWu(UniversityofNorthCarolinaatCharlotte),andMohammedJ.Zaki(RensselaerPolytechnicInstitute).
Overtheyearssincethefirstedition,wehavealsoreceivednumerouscommentsfromreadersandstudentswhohavepointedouttyposandvariousotherissues.Weareunabletomentiontheseindividualsbyname,buttheirinputismuchappreciatedandhasbeentakenintoaccountforthesecondedition.
ContentsPrefacetotheSecondEditionv
1Introduction11.1WhatIsDataMining?4
1.2MotivatingChallenges5
1.3TheOriginsofDataMining7
1.4DataMiningTasks9
1.5ScopeandOrganizationoftheBook13
1.6BibliographicNotes15
1.7Exercises21
2Data232.1TypesofData26
2.1.1AttributesandMeasurement27
2.1.2TypesofDataSets34
2.2DataQuality422.2.1MeasurementandDataCollectionIssues42
2.2.2IssuesRelatedtoApplications49
2.3DataPreprocessing502.3.1Aggregation51
2.3.2Sampling52
2.3.3DimensionalityReduction56
2.3.4FeatureSubsetSelection58
2.3.5FeatureCreation61
2.3.6DiscretizationandBinarization63
2.3.7VariableTransformation69
2.4MeasuresofSimilarityandDissimilarity712.4.1Basics72
2.4.2SimilarityandDissimilaritybetweenSimpleAttributes74
2.4.3DissimilaritiesbetweenDataObjects76
2.4.4SimilaritiesbetweenDataObjects78
2.4.5ExamplesofProximityMeasures79
2.4.6MutualInformation88
2.4.7KernelFunctions*90
2.4.8BregmanDivergence*94
2.4.9IssuesinProximityCalculation96
2.4.10SelectingtheRightProximityMeasure98
2.5BibliographicNotes100
2.6Exercises105
3Classification:BasicConceptsandTechniques113
3.1BasicConcepts114
3.2GeneralFrameworkforClassification117
3.3DecisionTreeClassifier1193.3.1ABasicAlgorithmtoBuildaDecisionTree121
3.3.2MethodsforExpressingAttributeTestConditions124
3.3.3MeasuresforSelectinganAttributeTestCondition127
3.3.4AlgorithmforDecisionTreeInduction136
3.3.5ExampleApplication:WebRobotDetection138
3.3.6CharacteristicsofDecisionTreeClassifiers140
3.4ModelOverfitting1473.4.1ReasonsforModelOverfitting149
3.5ModelSelection1563.5.1UsingaValidationSet156
3.5.2IncorporatingModelComplexity157
3.5.3EstimatingStatisticalBounds162
3.5.4ModelSelectionforDecisionTrees162
3.6ModelEvaluation1643.6.1HoldoutMethod165
3.6.2Cross-Validation165
3.7PresenceofHyper-parameters1683.7.1Hyper-parameterSelection168
3.7.2NestedCross-Validation170
3.8PitfallsofModelSelectionandEvaluation1723.8.1OverlapbetweenTrainingandTestSets172
3.8.2UseofValidationErrorasGeneralizationError172
3.9ModelComparison 1733.9.1EstimatingtheConfidenceIntervalforAccuracy174
3.9.2ComparingthePerformanceofTwoModels175
3.10BibliographicNotes176
3.11Exercises185
4Classification:AlternativeTechniques1934.1TypesofClassifiers193
4.2Rule-BasedClassifier1954.2.1HowaRule-BasedClassifierWorks197
4.2.2PropertiesofaRuleSet198
4.2.3DirectMethodsforRuleExtraction199
4.2.4IndirectMethodsforRuleExtraction204
4.2.5CharacteristicsofRule-BasedClassifiers206
4.3NearestNeighborClassifiers2084.3.1Algorithm209
4.3.2CharacteristicsofNearestNeighborClassifiers210
*
4.4NaïveBayesClassifier2124.4.1BasicsofProbabilityTheory213
4.4.2NaïveBayesAssumption218
4.5BayesianNetworks2274.5.1GraphicalRepresentation227
4.5.2InferenceandLearning233
4.5.3CharacteristicsofBayesianNetworks242
4.6LogisticRegression2434.6.1LogisticRegressionasaGeneralizedLinearModel244
4.6.2LearningModelParameters245
4.6.3CharacteristicsofLogisticRegression248
4.7ArtificialNeuralNetwork(ANN)2494.7.1Perceptron250
4.7.2Multi-layerNeuralNetwork254
4.7.3CharacteristicsofANN261
4.8DeepLearning2624.8.1UsingSynergisticLossFunctions263
4.8.2UsingResponsiveActivationFunctions266
4.8.3Regularization268
4.8.4InitializationofModelParameters271
4.8.5CharacteristicsofDeepLearning275
4.9SupportVectorMachine(SVM)2764.9.1MarginofaSeparatingHyperplane276
4.9.2LinearSVM278
4.9.3Soft-marginSVM284
4.9.4NonlinearSVM290
4.9.5CharacteristicsofSVM294
4.10EnsembleMethods2964.10.1RationaleforEnsembleMethod297
4.10.2MethodsforConstructinganEnsembleClassifier297
4.10.3Bias-VarianceDecomposition300
4.10.4Bagging302
4.10.5Boosting305
4.10.6RandomForests310
4.10.7EmpiricalComparisonamongEnsembleMethods312
4.11ClassImbalanceProblem3134.11.1BuildingClassifierswithClassImbalance314
4.11.2EvaluatingPerformancewithClassImbalance318
4.11.3FindinganOptimalScoreThreshold322
4.11.4AggregateEvaluationofPerformance323
4.12MulticlassProblem330
4.13BibliographicNotes333
4.14Exercises345
5AssociationAnalysis:BasicConceptsandAlgorithms3575.1Preliminaries358
5.2FrequentItemsetGeneration3625.2.1TheAprioriPrinciple363
5.2.2FrequentItemsetGenerationintheAprioriAlgorithm364
5.2.3CandidateGenerationandPruning368
5.2.4SupportCounting373
5.2.5ComputationalComplexity377
5.3RuleGeneration3805.3.1Confidence-BasedPruning380
5.3.2RuleGenerationinAprioriAlgorithm381
5.3.3AnExample:CongressionalVotingRecords382
5.4CompactRepresentationofFrequentItemsets3845.4.1MaximalFrequentItemsets384
5.4.2ClosedItemsets386
5.5AlternativeMethodsforGeneratingFrequentItemsets*389
5.6FP-GrowthAlgorithm*3935.6.1FP-TreeRepresentation394
5.6.2FrequentItemsetGenerationinFP-GrowthAlgorithm397
5.7EvaluationofAssociationPatterns401
5.7.1ObjectiveMeasuresofInterestingness402
5.7.2MeasuresbeyondPairsofBinaryVariables414
5.7.3Simpson’sParadox416
5.8EffectofSkewedSupportDistribution418
5.9BibliographicNotes424
5.10Exercises438
6AssociationAnalysis:AdvancedConcepts4516.1HandlingCategoricalAttributes451
6.2HandlingContinuousAttributes4546.2.1Discretization-BasedMethods454
6.2.2Statistics-BasedMethods458
6.2.3Non-discretizationMethods460
6.3HandlingaConceptHierarchy462
6.4SequentialPatterns4646.4.1Preliminaries465
6.4.2SequentialPatternDiscovery468
6.4.3TimingConstraints 473
6.4.4AlternativeCountingSchemes 477
6.5SubgraphPatterns4796.5.1Preliminaries480
∗
∗
6.5.2FrequentSubgraphMining483
6.5.3CandidateGeneration487
6.5.4CandidatePruning493
6.5.5SupportCounting493
6.6InfrequentPatterns 4936.6.1NegativePatterns494
6.6.2NegativelyCorrelatedPatterns495
6.6.3ComparisonsamongInfrequentPatterns,NegativePatterns,andNegativelyCorrelatedPatterns496
6.6.4TechniquesforMiningInterestingInfrequentPatterns498
6.6.5TechniquesBasedonMiningNegativePatterns499
6.6.6TechniquesBasedonSupportExpectation501
6.7BibliographicNotes505
6.8Exercises510
7ClusterAnalysis:BasicConceptsandAlgorithms5257.1Overview528
7.1.1WhatIsClusterAnalysis?528
7.1.2DifferentTypesofClusterings529
7.1.3DifferentTypesofClusters531
7.2K-means5347.2.1TheBasicK-meansAlgorithm535
∗
7.2.2K-means:AdditionalIssues544
7.2.3BisectingK-means547
7.2.4K-meansandDifferentTypesofClusters548
7.2.5StrengthsandWeaknesses549
7.2.6K-meansasanOptimizationProblem549
7.3AgglomerativeHierarchicalClustering5547.3.1BasicAgglomerativeHierarchicalClusteringAlgorithm555
7.3.2SpecificTechniques557
7.3.3TheLance-WilliamsFormulaforClusterProximity562
7.3.4KeyIssuesinHierarchicalClustering563
7.3.5Outliers564
7.3.6StrengthsandWeaknesses565
7.4DBSCAN5657.4.1TraditionalDensity:Center-BasedApproach565
7.4.2TheDBSCANAlgorithm567
7.4.3StrengthsandWeaknesses569
7.5ClusterEvaluation5717.5.1Overview571
7.5.2UnsupervisedClusterEvaluationUsingCohesionandSeparation574
7.5.3UnsupervisedClusterEvaluationUsingtheProximityMatrix582
7.5.4UnsupervisedEvaluationofHierarchicalClustering585
7.5.5DeterminingtheCorrectNumberofClusters587
7.5.6ClusteringTendency588
7.5.7SupervisedMeasuresofClusterValidity589
7.5.8AssessingtheSignificanceofClusterValidityMeasures594
7.5.9ChoosingaClusterValidityMeasure596
7.6BibliographicNotes597
7.7Exercises603
8ClusterAnalysis:AdditionalIssuesandAlgorithms6138.1CharacteristicsofData,Clusters,andClusteringAlgorithms614
8.1.1Example:ComparingK-meansandDBSCAN614
8.1.2DataCharacteristics615
8.1.3ClusterCharacteristics617
8.1.4GeneralCharacteristicsofClusteringAlgorithms619
8.2Prototype-BasedClustering6218.2.1FuzzyClustering621
8.2.2ClusteringUsingMixtureModels627
8.2.3Self-OrganizingMaps(SOM)637
8.3Density-BasedClustering6448.3.1Grid-BasedClustering644
8.3.2SubspaceClustering648
8.3.3DENCLUE:AKernel-BasedSchemeforDensity-BasedClustering652
8.4Graph-BasedClustering6568.4.1Sparsification657
8.4.2MinimumSpanningTree(MST)Clustering658
8.4.3OPOSSUM:OptimalPartitioningofSparseSimilaritiesUsingMETIS659
8.4.4Chameleon:HierarchicalClusteringwithDynamicModeling660
8.4.5SpectralClustering666
8.4.6SharedNearestNeighborSimilarity673
8.4.7TheJarvis-PatrickClusteringAlgorithm676
8.4.8SNNDensity678
8.4.9SNNDensity-BasedClustering679
8.5ScalableClusteringAlgorithms6818.5.1Scalability:GeneralIssuesandApproaches681
8.5.2BIRCH684
8.5.3CURE686
8.6WhichClusteringAlgorithm?690
8.7BibliographicNotes693
8.8Exercises699
9AnomalyDetection7039.1CharacteristicsofAnomalyDetectionProblems705
9.1.1ADefinitionofanAnomaly705
9.1.2NatureofData706
9.1.3HowAnomalyDetectionisUsed707
9.2CharacteristicsofAnomalyDetectionMethods708
9.3StatisticalApproaches7109.3.1UsingParametricModels710
9.3.2UsingNon-parametricModels714
9.3.3ModelingNormalandAnomalousClasses715
9.3.4AssessingStatisticalSignificance717
9.3.5StrengthsandWeaknesses718
9.4Proximity-basedApproaches7199.4.1Distance-basedAnomalyScore719
9.4.2Density-basedAnomalyScore720
9.4.3RelativeDensity-basedAnomalyScore722
9.4.4StrengthsandWeaknesses723
9.5Clustering-basedApproaches7249.5.1FindingAnomalousClusters724
9.5.2FindingAnomalousInstances725
9.5.3StrengthsandWeaknesses728
9.6Reconstruction-basedApproaches7289.6.1StrengthsandWeaknesses731
9.7One-classClassification7329.7.1UseofKernels733
9.7.2TheOriginTrick734
9.7.3StrengthsandWeaknesses738
9.8InformationTheoreticApproaches7389.8.1StrengthsandWeaknesses740
9.9EvaluationofAnomalyDetection740
9.10BibliographicNotes742
9.11Exercises749
10AvoidingFalseDiscoveries75510.1Preliminaries:StatisticalTesting756
10.1.1SignificanceTesting756
10.1.2HypothesisTesting761
10.1.3MultipleHypothesisTesting767
10.1.4PitfallsinStatisticalTesting776
10.2ModelingNullandAlternativeDistributions77810.2.1GeneratingSyntheticDataSets781
10.2.2RandomizingClassLabels782
10.2.3ResamplingInstances782
10.2.4ModelingtheDistributionoftheTestStatistic783
10.3StatisticalTestingforClassification78310.3.1EvaluatingClassificationPerformance783
10.3.2BinaryClassificationasMultipleHypothesisTesting785
10.3.3MultipleHypothesisTestinginModelSelection786
10.4StatisticalTestingforAssociationAnalysis78710.4.1UsingStatisticalModels788
10.4.2UsingRandomizationMethods794
10.5StatisticalTestingforClusterAnalysis79510.5.1GeneratingaNullDistributionforInternalIndices796
10.5.2GeneratingaNullDistributionforExternalIndices798
10.5.3Enrichment798
10.6StatisticalTestingforAnomalyDetection800
10.7BibliographicNotes803
10.8Exercises808
AuthorIndex816
SubjectIndex829
CopyrightPermissions839
1Introduction
Rapidadvancesindatacollectionandstoragetechnology,coupledwiththeeasewithwhichdatacanbegeneratedanddisseminated,havetriggeredtheexplosivegrowthofdata,leadingtothecurrentageofbigdata.Derivingactionableinsightsfromtheselargedatasetsisincreasinglyimportantindecisionmakingacrossalmostallareasofsociety,includingbusinessandindustry;scienceandengineering;medicineandbiotechnology;andgovernmentandindividuals.However,theamountofdata(volume),itscomplexity(variety),andtherateatwhichitisbeingcollectedandprocessed(velocity)havesimplybecometoogreatforhumanstoanalyzeunaided.Thus,thereisagreatneedforautomatedtoolsforextractingusefulinformationfromthebigdatadespitethechallengesposedbyitsenormityanddiversity.
Dataminingblendstraditionaldataanalysismethodswithsophisticatedalgorithmsforprocessingthisabundanceofdata.Inthisintroductorychapter,wepresentanoverviewofdataminingandoutlinethekeytopicstobecoveredinthisbook.Westartwitha
descriptionofsomeapplicationsthatrequiremoreadvancedtechniquesfordataanalysis.
BusinessandIndustryPoint-of-saledatacollection(barcodescanners,radiofrequencyidentification(RFID),andsmartcardtechnology)haveallowedretailerstocollectup-to-the-minutedataaboutcustomerpurchasesatthecheckoutcountersoftheirstores.Retailerscanutilizethisinformation,alongwithotherbusiness-criticaldata,suchaswebserverlogsfrome-commercewebsitesandcustomerservicerecordsfromcallcenters,tohelpthembetterunderstandtheneedsoftheircustomersandmakemoreinformedbusinessdecisions.
Dataminingtechniquescanbeusedtosupportawiderangeofbusinessintelligenceapplications,suchascustomerprofiling,targetedmarketing,workflowmanagement,storelayout,frauddetection,andautomatedbuyingandselling.Anexampleofthelastapplicationishigh-speedstocktrading,wheredecisionsonbuyingandsellinghavetobemadeinlessthanasecondusingdataaboutfinancialtransactions.Dataminingcanalsohelpretailersanswerimportantbusinessquestions,suchas“Whoarethemostprofitablecustomers?”“Whatproductscanbecross-soldorup-sold?”and“Whatistherevenueoutlookofthecompanyfornextyear?”Thesequestionshaveinspiredthedevelopmentofsuchdataminingtechniquesasassociationanalysis(Chapters5 and6 ).
AstheInternetcontinuestorevolutionizethewayweinteractandmakedecisionsinoureverydaylives,wearegeneratingmassiveamountsofdataaboutouronlineexperiences,e.g.,webbrowsing,messaging,andpostingonsocialnetworkingwebsites.Thishasopenedseveralopportunitiesforbusinessapplicationsthatusewebdata.Forexample,inthee-commercesector,dataaboutouronlineviewingorshoppingpreferencescanbeusedto
providepersonalizedrecommendationsofproducts.DataminingalsoplaysaprominentroleinsupportingseveralotherInternet-basedservices,suchasfilteringspammessages,answeringsearchqueries,andsuggestingsocialupdatesandconnections.Thelargecorpusoftext,images,andvideosavailableontheInternethasenabledanumberofadvancementsindataminingmethods,includingdeeplearning,whichisdiscussedinChapter4 .Thesedevelopmentshaveledtogreatadvancesinanumberofapplications,suchasobjectrecognition,naturallanguagetranslation,andautonomousdriving.
Anotherdomainthathasundergonearapidbigdatatransformationistheuseofmobilesensorsanddevices,suchassmartphonesandwearablecomputingdevices.Withbettersensortechnologies,ithasbecomepossibletocollectavarietyofinformationaboutourphysicalworldusinglow-costsensorsembeddedoneverydayobjectsthatareconnectedtoeachother,termedtheInternetofThings(IOT).Thisdeepintegrationofphysicalsensorsindigitalsystemsisbeginningtogeneratelargeamountsofdiverseanddistributeddataaboutourenvironment,whichcanbeusedfordesigningconvenient,safe,andenergy-efficienthomesystems,aswellasforurbanplanningofsmartcities.
Medicine,Science,andEngineeringResearchersinmedicine,science,andengineeringarerapidlyaccumulatingdatathatiskeytosignificantnewdiscoveries.Forexample,asanimportantsteptowardimprovingourunderstandingoftheEarth’sclimatesystem,NASAhasdeployedaseriesofEarth-orbitingsatellitesthatcontinuouslygenerateglobalobservationsofthelandsurface,oceans,andatmosphere.However,becauseofthesizeandspatio-temporalnatureofthedata,traditionalmethodsareoftennotsuitableforanalyzingthesedatasets.TechniquesdevelopedindataminingcanaidEarthscientistsinansweringquestionssuchasthefollowing:“Whatistherelationshipbetweenthefrequencyandintensityofecosystemdisturbances
suchasdroughtsandhurricanestoglobalwarming?”“Howislandsurfaceprecipitationandtemperatureaffectedbyoceansurfacetemperature?”and“Howwellcanwepredictthebeginningandendofthegrowingseasonforaregion?”
Asanotherexample,researchersinmolecularbiologyhopetousethelargeamountsofgenomicdatatobetterunderstandthestructureandfunctionofgenes.Inthepast,traditionalmethodsinmolecularbiologyallowedscientiststostudyonlyafewgenesatatimeinagivenexperiment.Recentbreakthroughsinmicroarraytechnologyhaveenabledscientiststocomparethebehaviorofthousandsofgenesundervarioussituations.Suchcomparisonscanhelpdeterminethefunctionofeachgene,andperhapsisolatethegenesresponsibleforcertaindiseases.However,thenoisy,high-dimensionalnatureofdatarequiresnewdataanalysismethods.Inadditiontoanalyzinggeneexpressiondata,dataminingcanalsobeusedtoaddressotherimportantbiologicalchallengessuchasproteinstructureprediction,multiplesequencealignment,themodelingofbiochemicalpathways,andphylogenetics.
Anotherexampleistheuseofdataminingtechniquestoanalyzeelectronichealthrecord(EHR)data,whichhasbecomeincreasinglyavailable.Notverylongago,studiesofpatientsrequiredmanuallyexaminingthephysicalrecordsofindividualpatientsandextractingveryspecificpiecesofinformationpertinenttotheparticularquestionbeinginvestigated.EHRsallowforafasterandbroaderexplorationofsuchdata.However,therearesignificantchallengessincetheobservationsonanyonepatienttypicallyoccurduringtheirvisitstoadoctororhospitalandonlyasmallnumberofdetailsaboutthehealthofthepatientaremeasuredduringanyparticularvisit.
Currently,EHRanalysisfocusesonsimpletypesofdata,e.g.,apatient’sbloodpressureorthediagnosiscodeofadisease.However,largeamountsof
morecomplextypesofmedicaldataarealsobeingcollected,suchaselectrocardiograms(ECGs)andneuroimagesfrommagneticresonanceimaging(MRI)orfunctionalMagneticResonanceImaging(fMRI).Althoughchallengingtoanalyze,thisdataalsoprovidesvitalinformationaboutpatients.Integratingandanalyzingsuchdata,withtraditionalEHRandgenomicdataisoneofthecapabilitiesneededtoenableprecisionmedicine,whichaimstoprovidemorepersonalizedpatientcare.
1.1WhatIsDataMining?Dataminingistheprocessofautomaticallydiscoveringusefulinformationinlargedatarepositories.Dataminingtechniquesaredeployedtoscourlargedatasetsinordertofindnovelandusefulpatternsthatmightotherwiseremainunknown.Theyalsoprovidethecapabilitytopredicttheoutcomeofafutureobservation,suchastheamountacustomerwillspendatanonlineorabrick-and-mortarstore.
Notallinformationdiscoverytasksareconsideredtobedatamining.Examplesincludequeries,e.g.,lookingupindividualrecordsinadatabaseorfindingwebpagesthatcontainaparticularsetofkeywords.Thisisbecausesuchtaskscanbeaccomplishedthroughsimpleinteractionswithadatabasemanagementsystemoraninformationretrievalsystem.Thesesystemsrelyontraditionalcomputersciencetechniques,whichincludesophisticatedindexingstructuresandqueryprocessingalgorithms,forefficientlyorganizingandretrievinginformationfromlargedatarepositories.Nonetheless,dataminingtechniqueshavebeenusedtoenhancetheperformanceofsuchsystemsbyimprovingthequalityofthesearchresultsbasedontheirrelevancetotheinputqueries.
DataMiningandKnowledgeDiscoveryinDatabasesDataminingisanintegralpartofknowledgediscoveryindatabases(KDD),whichistheoverallprocessofconvertingrawdataintousefulinformation,asshowninFigure1.1 .Thisprocessconsistsofaseriesofsteps,fromdatapreprocessingtopostprocessingofdataminingresults.
Figure1.1.Theprocessofknowledgediscoveryindatabases(KDD).
Theinputdatacanbestoredinavarietyofformats(flatfiles,spreadsheets,orrelationaltables)andmayresideinacentralizeddatarepositoryorbedistributedacrossmultiplesites.Thepurposeofpreprocessingistotransformtherawinputdataintoanappropriateformatforsubsequentanalysis.Thestepsinvolvedindatapreprocessingincludefusingdatafrommultiplesources,cleaningdatatoremovenoiseandduplicateobservations,andselectingrecordsandfeaturesthatarerelevanttothedataminingtaskathand.Becauseofthemanywaysdatacanbecollectedandstored,datapreprocessingisperhapsthemostlaboriousandtime-consumingstepintheoverallknowledgediscoveryprocess.
“Closingtheloop”isaphraseoftenusedtorefertotheprocessofintegratingdataminingresultsintodecisionsupportsystems.Forexample,inbusinessapplications,theinsightsofferedbydataminingresultscanbeintegratedwithcampaignmanagementtoolssothateffectivemarketingpromotionscanbeconductedandtested.Suchintegrationrequiresapostprocessingsteptoensurethatonlyvalidandusefulresultsareincorporatedintothedecisionsupportsystem.Anexampleofpostprocessingisvisualization,whichallowsanalyststoexplorethedataandthedataminingresultsfromavarietyofviewpoints.Hypothesistestingmethodscanalsobeappliedduring
postprocessingtoeliminatespuriousdataminingresults.(SeeChapter10 .)
1.2MotivatingChallengesAsmentionedearlier,traditionaldataanalysistechniqueshaveoftenencounteredpracticaldifficultiesinmeetingthechallengesposedbybigdataapplications.Thefollowingaresomeofthespecificchallengesthatmotivatedthedevelopmentofdatamining.
Scalability
Becauseofadvancesindatagenerationandcollection,datasetswithsizesofterabytes,petabytes,orevenexabytesarebecomingcommon.Ifdataminingalgorithmsaretohandlethesemassivedatasets,theymustbescalable.Manydataminingalgorithmsemployspecialsearchstrategiestohandleexponentialsearchproblems.Scalabilitymayalsorequiretheimplementationofnoveldatastructurestoaccessindividualrecordsinanefficientmanner.Forinstance,out-of-corealgorithmsmaybenecessarywhenprocessingdatasetsthatcannotfitintomainmemory.Scalabilitycanalsobeimprovedbyusingsamplingordevelopingparallelanddistributedalgorithms.AgeneraloverviewoftechniquesforscalingupdataminingalgorithmsisgiveninAppendixF.
HighDimensionality
Itisnowcommontoencounterdatasetswithhundredsorthousandsofattributesinsteadofthehandfulcommonafewdecadesago.Inbioinformatics,progressinmicroarraytechnologyhasproducedgeneexpressiondatainvolvingthousandsoffeatures.Datasetswithtemporalorspatialcomponentsalsotendtohavehighdimensionality.Forexample,
consideradatasetthatcontainsmeasurementsoftemperatureatvariouslocations.Ifthetemperaturemeasurementsaretakenrepeatedlyforanextendedperiod,thenumberofdimensions(features)increasesinproportiontothenumberofmeasurementstaken.Traditionaldataanalysistechniquesthatweredevelopedforlow-dimensionaldataoftendonotworkwellforsuchhigh-dimensionaldataduetoissuessuchascurseofdimensionality(tobediscussedinChapter2 ).Also,forsomedataanalysisalgorithms,thecomputationalcomplexityincreasesrapidlyasthedimensionality(thenumberoffeatures)increases.
HeterogeneousandComplexData
Traditionaldataanalysismethodsoftendealwithdatasetscontainingattributesofthesametype,eithercontinuousorcategorical.Astheroleofdatamininginbusiness,science,medicine,andotherfieldshasgrown,sohastheneedfortechniquesthatcanhandleheterogeneousattributes.Recentyearshavealsoseentheemergenceofmorecomplexdataobjects.Examplesofsuchnon-traditionaltypesofdataincludewebandsocialmediadatacontainingtext,hyperlinks,images,audio,andvideos;DNAdatawithsequentialandthree-dimensionalstructure;andclimatedatathatconsistsofmeasurements(temperature,pressure,etc.)atvarioustimesandlocationsontheEarth’ssurface.Techniquesdevelopedforminingsuchcomplexobjectsshouldtakeintoconsiderationrelationshipsinthedata,suchastemporalandspatialautocorrelation,graphconnectivity,andparent-childrelationshipsbetweentheelementsinsemi-structuredtextandXMLdocuments.
DataOwnershipandDistribution
Sometimes,thedataneededforananalysisisnotstoredinonelocationorownedbyoneorganization.Instead,thedataisgeographicallydistributedamongresourcesbelongingtomultipleentities.Thisrequiresthedevelopment
ofdistributeddataminingtechniques.Thekeychallengesfacedbydistributeddataminingalgorithmsincludethefollowing:(1)howtoreducetheamountofcommunicationneededtoperformthedistributedcomputation,(2)howtoeffectivelyconsolidatethedataminingresultsobtainedfrommultiplesources,and(3)howtoaddressdatasecurityandprivacyissues.
Non-traditionalAnalysis
Thetraditionalstatisticalapproachisbasedonahypothesize-and-testparadigm.Inotherwords,ahypothesisisproposed,anexperimentisdesignedtogatherthedata,andthenthedataisanalyzedwithrespecttothehypothesis.Unfortunately,thisprocessisextremelylabor-intensive.Currentdataanalysistasksoftenrequirethegenerationandevaluationofthousandsofhypotheses,andconsequently,thedevelopmentofsomedataminingtechniqueshasbeenmotivatedbythedesiretoautomatetheprocessofhypothesisgenerationandevaluation.Furthermore,thedatasetsanalyzedindataminingaretypicallynottheresultofacarefullydesignedexperimentandoftenrepresentopportunisticsamplesofthedata,ratherthanrandomsamples.
1.3TheOriginsofDataMiningWhiledatamininghastraditionallybeenviewedasanintermediateprocesswithintheKDDframework,asshowninFigure1.1 ,ithasemergedovertheyearsasanacademicfieldwithincomputerscience,focusingonallaspectsofKDD,includingdatapreprocessing,mining,andpostprocessing.Itsorigincanbetracedbacktothelate1980s,followingaseriesofworkshopsorganizedonthetopicofknowledgediscoveryindatabases.Theworkshopsbroughttogetherresearchersfromdifferentdisciplinestodiscussthechallengesandopportunitiesinapplyingcomputationaltechniquestoextractactionableknowledgefromlargedatabases.Theworkshopsquicklygrewintohugelypopularconferencesthatwereattendedbyresearchersandpractitionersfromboththeacademiaandindustry.Thesuccessoftheseconferences,alongwiththeinterestshownbybusinessesandindustryinrecruitingnewhireswithdataminingbackground,havefueledthetremendousgrowthofthisfield.
Thefieldwasinitiallybuiltuponthemethodologyandalgorithmsthatresearchershadpreviouslyused.Inparticular,dataminingresearchersdrawuponideas,suchas(1)sampling,estimation,andhypothesistestingfromstatisticsand(2)searchalgorithms,modelingtechniques,andlearningtheoriesfromartificialintelligence,patternrecognition,andmachinelearning.Datamininghasalsobeenquicktoadoptideasfromotherareas,includingoptimization,evolutionarycomputing,informationtheory,signalprocessing,visualization,andinformationretrieval,andextendingthemtosolvethechallengesofminingbigdata.
Anumberofotherareasalsoplaykeysupportingroles.Inparticular,databasesystemsareneededtoprovidesupportforefficientstorage,indexing,andqueryprocessing.Techniquesfromhighperformance(parallel)computingare
oftenimportantinaddressingthemassivesizeofsomedatasets.Distributedtechniquescanalsohelpaddresstheissueofsizeandareessentialwhenthedatacannotbegatheredinonelocation.Figure1.2 showstherelationshipofdataminingtootherareas.
Figure1.2.Dataminingasaconfluenceofmanydisciplines.
DataScienceandData-DrivenDiscoveryDatascienceisaninterdisciplinaryfieldthatstudiesandappliestoolsandtechniquesforderivingusefulinsightsfromdata.Althoughdatascienceisregardedasanemergingfieldwithadistinctidentityofitsown,thetoolsandtechniquesoftencomefrommanydifferentareasofdataanalysis,suchasdatamining,statistics,AI,machinelearning,patternrecognition,databasetechnology,anddistributedandparallelcomputing.(SeeFigure1.2 .)
Theemergenceofdatascienceasanewfieldisarecognitionthat,often,noneoftheexistingareasofdataanalysisprovidesacompletesetoftoolsforthedataanalysistasksthatareoftenencounteredinemergingapplications.
Instead,abroadrangeofcomputational,mathematical,andstatisticalskillsisoftenrequired.Toillustratethechallengesthatariseinanalyzingsuchdata,considerthefollowingexample.SocialmediaandtheWebpresentnewopportunitiesforsocialscientiststoobserveandquantitativelymeasurehumanbehavioronalargescale.Toconductsuchastudy,socialscientistsworkwithanalystswhopossessskillsinareassuchaswebmining,naturallanguageprocessing(NLP),networkanalysis,datamining,andstatistics.Comparedtomoretraditionalresearchinsocialscience,whichisoftenbasedonsurveys,thisanalysisrequiresabroaderrangeofskillsandtools,andinvolvesfarlargeramountsofdata.Thus,datascienceis,bynecessity,ahighlyinterdisciplinaryfieldthatbuildsonthecontinuingworkofmanyfields.
Thedata-drivenapproachofdatascienceemphasizesthedirectdiscoveryofpatternsandrelationshipsfromdata,especiallyinlargequantitiesofdata,oftenwithouttheneedforextensivedomainknowledge.Anotableexampleofthesuccessofthisapproachisrepresentedbyadvancesinneuralnetworks,i.e.,deeplearning,whichhavebeenparticularlysuccessfulinareaswhichhavelongprovedchallenging,e.g.,recognizingobjectsinphotosorvideosandwordsinspeech,aswellasinotherapplicationareas.However,notethatthisisjustoneexampleofthesuccessofdata-drivenapproaches,anddramaticimprovementshavealsooccurredinmanyotherareasofdataanalysis.Manyofthesedevelopmentsaretopicsdescribedlaterinthisbook.
Somecautionsonpotentiallimitationsofapurelydata-drivenapproacharegivenintheBibliographicNotes.
1.4DataMiningTasksDataminingtasksaregenerallydividedintotwomajorcategories:
PredictivetasksTheobjectiveofthesetasksistopredictthevalueofaparticularattributebasedonthevaluesofotherattributes.Theattributetobepredictediscommonlyknownasthetargetordependentvariable,whiletheattributesusedformakingthepredictionareknownastheexplanatoryorindependentvariables.
DescriptivetasksHere,theobjectiveistoderivepatterns(correlations,trends,clusters,trajectories,andanomalies)thatsummarizetheunderlyingrelationshipsindata.Descriptivedataminingtasksareoftenexploratoryinnatureandfrequentlyrequirepostprocessingtechniquestovalidateandexplaintheresults.
Figure1.3 illustratesfourofthecoredataminingtasksthataredescribedintheremainderofthisbook.
Figure1.3.Fourofthecoredataminingtasks.
Predictivemodelingreferstothetaskofbuildingamodelforthetargetvariableasafunctionoftheexplanatoryvariables.Therearetwotypesofpredictivemodelingtasks:classification,whichisusedfordiscretetargetvariables,andregression,whichisusedforcontinuoustargetvariables.Forexample,predictingwhetherawebuserwillmakeapurchaseatanonlinebookstoreisaclassificationtaskbecausethetargetvariableisbinary-valued.Ontheotherhand,forecastingthefuturepriceofastockisaregressiontaskbecausepriceisacontinuous-valuedattribute.Thegoalofbothtasksistolearnamodelthatminimizestheerrorbetweenthepredictedandtruevaluesofthetargetvariable.Predictivemodelingcanbeusedtoidentifycustomerswhowillrespondtoamarketingcampaign,predictdisturbancesintheEarth’s
ecosystem,orjudgewhetherapatienthasaparticulardiseasebasedontheresultsofmedicaltests.
Example1.1(PredictingtheTypeofaFlower).Considerthetaskofpredictingaspeciesofflowerbasedonthecharacteristicsoftheflower.Inparticular,considerclassifyinganIrisflowerasoneofthefollowingthreeIrisspecies:Setosa,Versicolour,orVirginica.Toperformthistask,weneedadatasetcontainingthecharacteristicsofvariousflowersofthesethreespecies.Adatasetwiththistypeofinformationisthewell-knownIrisdatasetfromtheUCIMachineLearningRepositoryathttp://www.ics.uci.edu/~mlearn.Inadditiontothespeciesofaflower,thisdatasetcontainsfourotherattributes:sepalwidth,sepallength,petallength,andpetalwidth.Figure1.4 showsaplotofpetalwidthversuspetallengthforthe150flowersintheIrisdataset.Petalwidthisbrokenintothecategorieslow,medium,andhigh,whichcorrespondtotheintervals[0,0.75),[0.75,1.75), ,respectively.Also,petallengthisbrokenintocategorieslow,medium,andhigh,whichcorrespondtotheintervals[0,2.5),[2.5,5), ,respectively.Basedonthesecategoriesofpetalwidthandlength,thefollowingrulescanbederived:
PetalwidthlowandpetallengthlowimpliesSetosa.
PetalwidthmediumandpetallengthmediumimpliesVersicolour.
PetalwidthhighandpetallengthhighimpliesVirginica.
Whiletheserulesdonotclassifyalltheflowers,theydoagood(butnotperfect)jobofclassifyingmostoftheflowers.NotethatflowersfromtheSetosaspeciesarewellseparatedfromtheVersicolourandVirginicaspecieswithrespecttopetalwidthandlength,butthelattertwospeciesoverlapsomewhatwithrespecttotheseattributes.
[1.75,∞)
[5,∞)
Figure1.4.Petalwidthversuspetallengthfor150Irisflowers.
Associationanalysisisusedtodiscoverpatternsthatdescribestronglyassociatedfeaturesinthedata.Thediscoveredpatternsaretypicallyrepresentedintheformofimplicationrulesorfeaturesubsets.Becauseoftheexponentialsizeofitssearchspace,thegoalofassociationanalysisistoextractthemostinterestingpatternsinanefficientmanner.Usefulapplicationsofassociationanalysisincludefindinggroupsofgenesthathaverelatedfunctionality,identifyingwebpagesthatareaccessedtogether,orunderstandingtherelationshipsbetweendifferentelementsofEarth’sclimatesystem.
Example1.2(MarketBasketAnalysis).
ThetransactionsshowninTable1.1 illustratepoint-of-saledatacollectedatthecheckoutcountersofagrocerystore.Associationanalysiscanbeappliedtofinditemsthatarefrequentlyboughttogetherbycustomers.Forexample,wemaydiscovertherule ,whichsuggeststhatcustomerswhobuydiapersalsotendtobuymilk.Thistypeofrulecanbeusedtoidentifypotentialcross-sellingopportunitiesamongrelateditems.
Table1.1.Marketbasketdata.
TransactionID Items
1 {Bread,Butter,Diapers,Milk}
2 {Coffee,Sugar,Cookies,Salmon}
3 {Bread,Butter,Coffee,Diapers,Milk,Eggs}
4 {Bread,Butter,Salmon,Chicken}
5 {Eggs,Bread,Butter}
6 {Salmon,Diapers,Milk}
7 {Bread,Tea,Sugar,Eggs}
8 {Coffee,Sugar,Chicken,Eggs}
9 {Bread,Diapers,Milk,Salt}
10 {Tea,Eggs,Cookies,Diapers,Milk}
Clusteranalysisseekstofindgroupsofcloselyrelatedobservationssothatobservationsthatbelongtothesameclusteraremoresimilartoeachotherthanobservationsthatbelongtootherclusters.Clusteringhasbeenusedto
{Diapers}→{Milk}
groupsetsofrelatedcustomers,findareasoftheoceanthathaveasignificantimpactontheEarth’sclimate,andcompressdata.
Example1.3(DocumentClustering).ThecollectionofnewsarticlesshowninTable1.2 canbegroupedbasedontheirrespectivetopics.Eacharticleisrepresentedasasetofword-frequencypairs(w:c),wherewisawordandcisthenumberoftimesthewordappearsinthearticle.Therearetwonaturalclustersinthedataset.Thefirstclusterconsistsofthefirstfourarticles,whichcorrespondtonewsabouttheeconomy,whilethesecondclustercontainsthelastfourarticles,whichcorrespondtonewsabouthealthcare.Agoodclusteringalgorithmshouldbeabletoidentifythesetwoclustersbasedonthesimilaritybetweenwordsthatappearinthearticles.
Table1.2.Collectionofnewsarticles.
Article Word-frequencypairs
1 dollar:1,industry:4,country:2,loan:3,deal:2,government:2
2 machinery:2,labor:3,market:4,industry:2,work:3,country:1
3 job:5,inflation:3,rise:2,jobless:2,market:3,country:2,index:3
4 domestic:3,forecast:2,gain:1,market:2,sale:3,price:2
5 patient:4,symptom:2,drug:3,health:2,clinic:2,doctor:2
6 pharmaceutical:2,company:3,drug:2,vaccine:1,flu:3
7 death:2,cancer:4,drug:3,public:4,health:3,director:2
8 medical:2,cost:3,increase:2,patient:2,health:3,care:1
Anomalydetectionisthetaskofidentifyingobservationswhosecharacteristicsaresignificantlydifferentfromtherestofthedata.Suchobservationsareknownasanomaliesoroutliers.Thegoalofananomalydetectionalgorithmistodiscovertherealanomaliesandavoidfalselylabelingnormalobjectsasanomalous.Inotherwords,agoodanomalydetectormusthaveahighdetectionrateandalowfalsealarmrate.Applicationsofanomalydetectionincludethedetectionoffraud,networkintrusions,unusualpatternsofdisease,andecosystemdisturbances,suchasdroughts,floods,fires,hurricanes,etc.
Example1.4(CreditCardFraudDetection).Acreditcardcompanyrecordsthetransactionsmadebyeverycreditcardholder,alongwithpersonalinformationsuchascreditlimit,age,annualincome,andaddress.Sincethenumberoffraudulentcasesisrelativelysmallcomparedtothenumberoflegitimatetransactions,anomalydetectiontechniquescanbeappliedtobuildaprofileoflegitimatetransactionsfortheusers.Whenanewtransactionarrives,itiscomparedagainsttheprofileoftheuser.Ifthecharacteristicsofthetransactionareverydifferentfromthepreviouslycreatedprofile,thenthetransactionisflaggedaspotentiallyfraudulent.
1.5ScopeandOrganizationoftheBookThisbookintroducesthemajorprinciplesandtechniquesusedindataminingfromanalgorithmicperspective.Astudyoftheseprinciplesandtechniquesisessentialfordevelopingabetterunderstandingofhowdataminingtechnologycanbeappliedtovariouskindsofdata.Thisbookalsoservesasastartingpointforreaderswhoareinterestedindoingresearchinthisfield.
Webeginthetechnicaldiscussionofthisbookwithachapterondata(Chapter2 ),whichdiscussesthebasictypesofdata,dataquality,preprocessingtechniques,andmeasuresofsimilarityanddissimilarity.Althoughthismaterialcanbecoveredquickly,itprovidesanessentialfoundationfordataanalysis.Chapters3 and4 coverclassification.Chapter3 providesafoundationbydiscussingdecisiontreeclassifiersandseveralissuesthatareimportanttoallclassification:overfitting,underfitting,modelselection,andperformanceevaluation.Usingthisfoundation,Chapter4 describesanumberofotherimportantclassificationtechniques:rule-basedsystems,nearestneighborclassifiers,Bayesianclassifiers,artificialneuralnetworks,includingdeeplearning,supportvectormachines,andensembleclassifiers,whicharecollectionsofclassifiers.Themulticlassandimbalancedclassproblemsarealsodiscussed.Thesetopicscanbecoveredindependently.
AssociationanalysisisexploredinChapters5 and6 .Chapter5describesthebasicsofassociationanalysis:frequentitemsets,associationrules,andsomeofthealgorithmsusedtogeneratethem.Specifictypesoffrequentitemsets—maximal,closed,andhyperclique—thatareimportantfor
dataminingarealsodiscussed,andthechapterconcludeswithadiscussionofevaluationmeasuresforassociationanalysis.Chapter6 considersavarietyofmoreadvancedtopics,includinghowassociationanalysiscanbeappliedtocategoricalandcontinuousdataortodatathathasaconcepthierarchy.(Aconcepthierarchyisahierarchicalcategorizationofobjects,e.g.,storeitems .)Thischapteralsodescribeshowassociationanalysiscanbeextendedtofindsequentialpatterns(patternsinvolvingorder),patternsingraphs,andnegativerelationships(ifoneitemispresent,thentheotherisnot).
ClusteranalysisisdiscussedinChapters7 and8 .Chapter7 firstdescribesthedifferenttypesofclusters,andthenpresentsthreespecificclusteringtechniques:K-means,agglomerativehierarchicalclustering,andDBSCAN.Thisisfollowedbyadiscussionoftechniquesforvalidatingtheresultsofaclusteringalgorithm.AdditionalclusteringconceptsandtechniquesareexploredinChapter8 ,includingfuzzyandprobabilisticclustering,Self-OrganizingMaps(SOM),graph-basedclustering,spectralclustering,anddensity-basedclustering.Thereisalsoadiscussionofscalabilityissuesandfactorstoconsiderwhenselectingaclusteringalgorithm.
Chapter9 ,isonanomalydetection.Aftersomebasicdefinitions,severaldifferenttypesofanomalydetectionareconsidered:statistical,distance-based,density-based,clustering-based,reconstruction-based,one-classclassification,andinformationtheoretic.Thelastchapter,Chapter10 ,supplementsthediscussionsintheotherChapterswithadiscussionofthestatisticalconceptsimportantforavoidingspuriousresults,andthendiscussesthoseconceptsinthecontextofdataminingtechniquesstudiedinthepreviouschapters.Thesetechniquesincludestatisticalhypothesistesting,p-values,thefalsediscoveryrate,andpermutationtesting.AppendicesAthroughFgiveabriefreviewofimportanttopicsthatareusedinportionsof
storeitems→clothing→shoes→sneakers
thebook:linearalgebra,dimensionalityreduction,statistics,regression,optimization,andscalingupdataminingtechniquesforbigdata.
Thesubjectofdatamining,whilerelativelyyoungcomparedtostatisticsormachinelearning,isalreadytoolargetocoverinasinglebook.Selectedreferencestotopicsthatareonlybrieflycovered,suchasdataquality,areprovidedintheBibliographicNotessectionoftheappropriatechapter.Referencestotopicsnotcoveredinthisbook,suchasminingstreamingdataandprivacy-preservingdataminingareprovidedintheBibliographicNotesofthischapter.
1.6BibliographicNotesThetopicofdatamininghasinspiredmanytextbooks.IntroductorytextbooksincludethosebyDunham[16],Hanetal.[29],Handetal.[31],RoigerandGeatz[50],ZakiandMeira[61],andAggarwal[2].DataminingbookswithastrongeremphasisonbusinessapplicationsincludetheworksbyBerryandLinoff[5],Pyle[47],andParrRud[45].BookswithanemphasisonstatisticallearningincludethosebyCherkasskyandMulier[11],andHastieetal.[32].SimilarbookswithanemphasisonmachinelearningorpatternrecognitionarethosebyDudaetal.[15],Kantardzic[34],Mitchell[43],Webb[57],andWittenandFrank[58].Therearealsosomemorespecializedbooks:Chakrabarti[9](webmining),Fayyadetal.[20](collectionofearlyarticlesondatamining),Fayyadetal.[18](visualization),Grossmanetal.[25](scienceandengineering),KarguptaandChan[35](distributeddatamining),Wangetal.[56](bioinformatics),andZakiandHo[60](paralleldatamining).
Thereareseveralconferencesrelatedtodatamining.SomeofthemainconferencesdedicatedtothisfieldincludetheACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining(KDD),theIEEEInternationalConferenceonDataMining(ICDM),theSIAMInternationalConferenceonDataMining(SDM),theEuropeanConferenceonPrinciplesandPracticeofKnowledgeDiscoveryinDatabases(PKDD),andthePacific-AsiaConferenceonKnowledgeDiscoveryandDataMining(PAKDD).DataminingpaperscanalsobefoundinothermajorconferencessuchastheConferenceandWorkshoponNeuralInformationProcessingSystems(NIPS),theInternationalConferenceonMachineLearning(ICML),theACMSIGMOD/PODSconference,theInternationalConferenceonVeryLargeDataBases(VLDB),theConferenceonInformationandKnowledgeManagement(CIKM),theInternationalConferenceonDataEngineering(ICDE),the
NationalConferenceonArtificialIntelligence(AAAI),theIEEEInternationalConferenceonBigData,andtheIEEEInternationalConferenceonDataScienceandAdvancedAnalytics(DSAA).
JournalpublicationsondataminingincludeIEEETransactionsonKnowledgeandDataEngineering,DataMiningandKnowledgeDiscovery,KnowledgeandInformationSystems,ACMTransactionsonKnowledgeDiscoveryfromData,StatisticalAnalysisandDataMining,andInformationSystems.Therearevariousopen-sourcedataminingsoftwareavailable,includingWeka[27]andScikit-learn[46].Morerecently,dataminingsoftwaresuchasApacheMahoutandApacheSparkhavebeendevelopedforlarge-scaleproblemsonthedistributedcomputingplatform.
Therehavebeenanumberofgeneralarticlesondataminingthatdefinethefieldoritsrelationshiptootherfields,particularlystatistics.Fayyadetal.[19]describedataminingandhowitfitsintothetotalknowledgediscoveryprocess.Chenetal.[10]giveadatabaseperspectiveondatamining.RamakrishnanandGrama[48]provideageneraldiscussionofdataminingandpresentseveralviewpoints.Hand[30]describeshowdataminingdiffersfromstatistics,asdoesFriedman[21].Lambert[40]explorestheuseofstatisticsforlargedatasetsandprovidessomecommentsontherespectiverolesofdataminingandstatistics.Glymouretal.[23]considerthelessonsthatstatisticsmayhavefordatamining.Smythetal.[53]describehowtheevolutionofdataminingisbeingdrivenbynewtypesofdataandapplications,suchasthoseinvolvingstreams,graphs,andtext.Hanetal.[28]consideremergingapplicationsindataminingandSmyth[52]describessomeresearchchallengesindatamining.Wuetal.[59]discusshowdevelopmentsindataminingresearchcanbeturnedintopracticaltools.DataminingstandardsarethesubjectofapaperbyGrossmanetal.[24].Bradley[7]discusseshowdataminingalgorithmscanbescaledtolargedatasets.
Theemergenceofnewdataminingapplicationshasproducednewchallengesthatneedtobeaddressed.Forinstance,concernsaboutprivacybreachesasaresultofdatamininghaveescalatedinrecentyears,particularlyinapplicationdomainssuchaswebcommerceandhealthcare.Asaresult,thereisgrowinginterestindevelopingdataminingalgorithmsthatmaintainuserprivacy.Developingtechniquesforminingencryptedorrandomizeddataisknownasprivacy-preservingdatamining.SomegeneralreferencesinthisareaincludepapersbyAgrawalandSrikant[3],Cliftonetal.[12]andKarguptaetal.[36].Vassiliosetal.[55]provideasurvey.Anotherareaofconcernisthebiasinpredictivemodelsthatmaybeusedforsomeapplications,e.g.,screeningjobapplicantsordecidingprisonparole[39].Assessingwhethersuchapplicationsareproducingbiasedresultsismademoredifficultbythefactthatthepredictivemodelsusedforsuchapplicationsareoftenblackboxmodels,i.e.,modelsthatarenotinterpretableinanystraightforwardway.
Datascience,itsconstituentfields,andmoregenerally,thenewparadigmofknowledgediscoverytheyrepresent[33],havegreatpotential,someofwhichhasbeenrealized.However,itisimportanttoemphasizethatdatascienceworksmostlywithobservationaldata,i.e.,datathatwascollectedbyvariousorganizationsaspartoftheirnormaloperation.Theconsequenceofthisisthatsamplingbiasesarecommonandthedeterminationofcausalfactorsbecomesmoreproblematic.Forthisandanumberofotherreasons,itisoftenhardtointerpretthepredictivemodelsbuiltfromthisdata[42,49].Thus,theory,experimentationandcomputationalsimulationswillcontinuetobethemethodsofchoiceinmanyareas,especiallythoserelatedtoscience.
Moreimportantly,apurelydata-drivenapproachoftenignorestheexistingknowledgeinaparticularfield.Suchmodelsmayperformpoorly,forexample,predictingimpossibleoutcomesorfailingtogeneralizetonewsituations.However,ifthemodeldoesworkwell,e.g.,hashighpredictiveaccuracy,then
thisapproachmaybesufficientforpracticalpurposesinsomefields.Butinmanyareas,suchasmedicineandscience,gaininginsightintotheunderlyingdomainisoftenthegoal.Somerecentworkattemptstoaddresstheseissuesinordertocreatetheory-guideddatascience,whichtakespre-existingdomainknowledgeintoaccount[17,37].
Recentyearshavewitnessedagrowingnumberofapplicationsthatrapidlygeneratecontinuousstreamsofdata.Examplesofstreamdataincludenetworktraffic,multimediastreams,andstockprices.Severalissuesmustbeconsideredwhenminingdatastreams,suchasthelimitedamountofmemoryavailable,theneedforonlineanalysis,andthechangeofthedataovertime.Dataminingforstreamdatahasbecomeanimportantareaindatamining.SomeselectedpublicationsareDomingosandHulten[14](classification),Giannellaetal.[22](associationanalysis),Guhaetal.[26](clustering),Kiferetal.[38](changedetection),Papadimitriouetal.[44](timeseries),andLawetal.[41](dimensionalityreduction).
Anotherareaofinterestisrecommenderandcollaborativefilteringsystems[1,6,8,13,54],whichsuggestmovies,televisionshows,books,products,etc.thatapersonmightlike.Inmanycases,thisproblem,oratleastacomponentofit,istreatedasapredictionproblemandthus,dataminingtechniquescanbeapplied[4,51].
Bibliography[1]G.AdomaviciusandA.Tuzhilin.Towardthenextgenerationof
recommendersystems:Asurveyofthestate-of-the-artandpossibleextensions.IEEEtransactionsonknowledgeanddataengineering,17(6):734–749,2005.
[2]C.Aggarwal.Datamining:TheTextbook.Springer,2009.
[3]R.AgrawalandR.Srikant.Privacy-preservingdatamining.InProc.of2000ACMSIGMODIntl.Conf.onManagementofData,pages439–450,Dallas,Texas,2000.ACMPress.
[4]X.AmatriainandJ.M.Pujol.Dataminingmethodsforrecommendersystems.InRecommenderSystemsHandbook,pages227–262.Springer,2015.
[5]M.J.A.BerryandG.Linoff.DataMiningTechniques:ForMarketing,Sales,andCustomerRelationshipManagement.WileyComputerPublishing,2ndedition,2004.
[6]J.Bobadilla,F.Ortega,A.Hernando,andA.Gutiérrez.Recommendersystemssurvey.Knowledge-basedsystems,46:109–132,2013.
[7]P.S.Bradley,J.Gehrke,R.Ramakrishnan,andR.Srikant.Scalingminingalgorithmstolargedatabases.CommunicationsoftheACM,45(8):38–43,2002.
[8]R.Burke.Hybridrecommendersystems:Surveyandexperiments.Usermodelinganduser-adaptedinteraction,12(4):331–370,2002.
[9]S.Chakrabarti.MiningtheWeb:DiscoveringKnowledgefromHypertextData.MorganKaufmann,SanFrancisco,CA,2003.
[10]M.-S.Chen,J.Han,andP.S.Yu.DataMining:AnOverviewfromaDatabasePerspective.IEEETransactionsonKnowledgeandDataEngineering,8(6):866–883,1996.
[11]V.CherkasskyandF.Mulier.LearningfromData:Concepts,Theory,andMethods.Wiley-IEEEPress,2ndedition,1998.
[12]C.Clifton,M.Kantarcioglu,andJ.Vaidya.Definingprivacyfordatamining.InNationalScienceFoundationWorkshoponNextGenerationDataMining,pages126–133,Baltimore,MD,November2002.
[13]C.DesrosiersandG.Karypis.Acomprehensivesurveyofneighborhood-basedrecommendationmethods.Recommendersystemshandbook,pages107–144,2011.
[14]P.DomingosandG.Hulten.Mininghigh-speeddatastreams.InProc.ofthe6thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages71–80,
Boston,Massachusetts,2000.ACMPress.
[15]R.O.Duda,P.E.Hart,andD.G.Stork.PatternClassification.JohnWiley…Sons,Inc.,NewYork,2ndedition,2001.
[16]M.H.Dunham.DataMining:IntroductoryandAdvancedTopics.PrenticeHall,2006.
[17]J.H.Faghmous,A.Banerjee,S.Shekhar,M.Steinbach,V.Kumar,A.R.Ganguly,andN.Samatova.Theory-guideddatascienceforclimatechange.Computer,47(11):74–78,2014.
[18]U.M.Fayyad,G.G.Grinstein,andA.Wierse,editors.InformationVisualizationinDataMiningandKnowledgeDiscovery.MorganKaufmannPublishers,SanFrancisco,CA,September2001.
[19]U.M.Fayyad,G.Piatetsky-Shapiro,andP.Smyth.FromDataMiningtoKnowledgeDiscovery:AnOverview.InAdvancesinKnowledgeDiscoveryandDataMining,pages1–34.AAAIPress,1996.
[20]U.M.Fayyad,G.Piatetsky-Shapiro,P.Smyth,andR.Uthurusamy,editors.AdvancesinKnowledgeDiscoveryandDataMining.AAAI/MITPress,1996.
[21]J.H.Friedman.DataMiningandStatistics:What’stheConnection?Unpublished.www-stat.stanford.edu/~jhf/ftp/dm-stat.ps,1997.
[22]C.Giannella,J.Han,J.Pei,X.Yan,andP.S.Yu.MiningFrequentPatternsinDataStreamsatMultipleTimeGranularities.InH.Kargupta,A.Joshi,K.Sivakumar,andY.Yesha,editors,NextGenerationDataMining,pages191–212.AAAI/MIT,2003.
[23]C.Glymour,D.Madigan,D.Pregibon,andP.Smyth.StatisticalThemesandLessonsforDataMining.DataMiningandKnowledgeDiscovery,1(1):11–28,1997.
[24]R.L.Grossman,M.F.Hornick,andG.Meyer.Dataminingstandardsinitiatives.CommunicationsoftheACM,45(8):59–61,2002.
[25]R.L.Grossman,C.Kamath,P.Kegelmeyer,V.Kumar,andR.Namburu,editors.DataMiningforScientificandEngineeringApplications.KluwerAcademicPublishers,2001.
[26]S.Guha,A.Meyerson,N.Mishra,R.Motwani,andL.O’Callaghan.ClusteringDataStreams:TheoryandPractice.IEEETransactionsonKnowledgeandDataEngineering,15(3):515–528,May/June2003.
[27]M.Hall,E.Frank,G.Holmes,B.Pfahringer,P.Reutemann,andI.H.Witten.TheWEKADataMiningSoftware:AnUpdate.SIGKDDExplorations,11(1),2009.
[28]J.Han,R.B.Altman,V.Kumar,H.Mannila,andD.Pregibon.Emergingscientificapplicationsindatamining.CommunicationsoftheACM,45(8):54–58,2002.
[29]J.Han,M.Kamber,andJ.Pei.DataMining:ConceptsandTechniques.MorganKaufmannPublishers,SanFrancisco,3rdedition,2011.
[30]D.J.Hand.DataMining:StatisticsandMore?TheAmericanStatistician,52(2):112–118,1998.
[31]D.J.Hand,H.Mannila,andP.Smyth.PrinciplesofDataMining.MITPress,2001.
[32]T.Hastie,R.Tibshirani,andJ.H.Friedman.TheElementsofStatisticalLearning:DataMining,Inference,Prediction.Springer,2ndedition,2009.
[33]T.Hey,S.Tansley,K.M.Tolle,etal.Thefourthparadigm:data-intensivescientificdiscovery,volume1.MicrosoftresearchRedmond,WA,2009.
[34]M.Kantardzic.DataMining:Concepts,Models,Methods,andAlgorithms.Wiley-IEEEPress,Piscataway,NJ,2003.
[35]H.KarguptaandP.K.Chan,editors.AdvancesinDistributedandParallelKnowledgeDiscovery.AAAIPress,September2002.
[36]H.Kargupta,S.Datta,Q.Wang,andK.Sivakumar.OnthePrivacyPreservingPropertiesofRandomDataPerturbationTechniques.InProc.ofthe2003IEEEIntl.Conf.onDataMining,pages99–106,Melbourne,Florida,December2003.IEEEComputerSociety.
[37]A.Karpatne,G.Atluri,J.Faghmous,M.Steinbach,A.Banerjee,A.Ganguly,S.Shekhar,N.Samatova,andV.Kumar.Theory-guidedDataScience:ANewParadigmforScientificDiscoveryfromData.IEEETransactionsonKnowledgeandDataEngineering,2017.
[38]D.Kifer,S.Ben-David,andJ.Gehrke.DetectingChangeinDataStreams.InProc.ofthe30thVLDBConf.,pages180–191,Toronto,Canada,2004.MorganKaufmann.
[39]J.Kleinberg,J.Ludwig,andS.Mullainathan.AGuidetoSolvingSocialProblemswithMachineLearning.HarvardBusinessReview,December2016.
[40]D.Lambert.WhatUseisStatisticsforMassiveData?InACMSIGMODWorkshoponResearchIssuesinDataMiningandKnowledgeDiscovery,pages54–62,2000.
[41]M.H.C.Law,N.Zhang,andA.K.Jain.NonlinearManifoldLearningforDataStreams.InProc.oftheSIAMIntl.Conf.onDataMining,LakeBuenaVista,Florida,April2004.SIAM.
[42]Z.C.Lipton.Themythosofmodelinterpretability.arXivpreprintarXiv:1606.03490,2016.
[43]T.Mitchell.MachineLearning.McGraw-Hill,Boston,MA,1997.
[44]S.Papadimitriou,A.Brockwell,andC.Faloutsos.Adaptive,unsupervisedstreammining.VLDBJournal,13(3):222–239,2004.
[45]O.ParrRud.DataMiningCookbook:ModelingDataforMarketing,RiskandCustomerRelationshipManagement.JohnWiley…Sons,NewYork,NY,2001.
[46]F.Pedregosa,G.Varoquaux,A.Gramfort,V.Michel,B.Thirion,O.Grisel,M.Blondel,P.Prettenhofer,R.Weiss,V.Dubourg,J.Vanderplas,A.Passos,D.Cournapeau,M.Brucher,M.Perrot,andE.Duchesnay.Scikit-learn:MachineLearninginPython.JournalofMachineLearningResearch,12:2825–2830,2011.
[47]D.Pyle.BusinessModelingandDataMining.MorganKaufmann,SanFrancisco,CA,2003.
[48]N.RamakrishnanandA.Grama.DataMining:FromSerendipitytoScience—GuestEditors’Introduction.IEEEComputer,32(8):34–37,1999.
[49]M.T.Ribeiro,S.Singh,andC.Guestrin.Whyshoulditrustyou?:Explainingthepredictionsofanyclassifier.InProceedingsofthe22ndACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages1135–1144.ACM,2016.
[50]R.RoigerandM.Geatz.DataMining:ATutorialBasedPrimer.Addison-Wesley,2002.
[51]J.Schafer.TheApplicationofData-MiningtoRecommenderSystems.Encyclopediaofdatawarehousingandmining,1:44–48,2009.
[52]P.Smyth.BreakingoutoftheBlack-Box:ResearchChallengesinDataMining.InProc.ofthe2001ACMSIGMODWorkshoponResearchIssuesinDataMiningandKnowledgeDiscovery,2001.
[53]P.Smyth,D.Pregibon,andC.Faloutsos.Data-drivenevolutionofdataminingalgorithms.CommunicationsoftheACM,45(8):33–37,2002.
[54]X.SuandT.M.Khoshgoftaar.Asurveyofcollaborativefilteringtechniques.Advancesinartificialintelligence,2009:4,2009.
[55]V.S.Verykios,E.Bertino,I.N.Fovino,L.P.Provenza,Y.Saygin,andY.Theodoridis.State-of-the-artinprivacypreservingdatamining.SIGMODRecord,33(1):50–57,2004.
[56]J.T.L.Wang,M.J.Zaki,H.Toivonen,andD.E.Shasha,editors.DataMininginBioinformatics.Springer,September2004.
[57]A.R.Webb.StatisticalPatternRecognition.JohnWiley…Sons,2ndedition,2002.
[58]I.H.WittenandE.Frank.DataMining:PracticalMachineLearningToolsandTechniques.MorganKaufmann,3rdedition,2011.
[59]X.Wu,P.S.Yu,andG.Piatetsky-Shapiro.DataMining:HowResearchMeetsPracticalDevelopment?KnowledgeandInformationSystems,5(2):248–261,2003.
[60]M.J.ZakiandC.-T.Ho,editors.Large-ScaleParallelDataMining.Springer,September2002.
[61]M.J.ZakiandW.MeiraJr.DataMiningandAnalysis:FundamentalConceptsandAlgorithms.CambridgeUniversityPress,NewYork,2014.
1.7Exercises1.Discusswhetherornoteachofthefollowingactivitiesisadataminingtask.
a. Dividingthecustomersofacompanyaccordingtotheirgender.
b. Dividingthecustomersofacompanyaccordingtotheirprofitability.
c. Computingthetotalsalesofacompany.
d. Sortingastudentdatabasebasedonstudentidentificationnumbers.
e. Predictingtheoutcomesoftossinga(fair)pairofdice.
f. Predictingthefuturestockpriceofacompanyusinghistoricalrecords.
g. Monitoringtheheartrateofapatientforabnormalities.
h. Monitoringseismicwavesforearthquakeactivities.
i. Extractingthefrequenciesofasoundwave.
2.SupposethatyouareemployedasadataminingconsultantforanInternetsearchenginecompany.Describehowdataminingcanhelpthecompanybygivingspecificexamplesofhowtechniques,suchasclustering,classification,associationrulemining,andanomalydetectioncanbeapplied.
3.Foreachofthefollowingdatasets,explainwhetherornotdataprivacyisanimportantissue.
a. Censusdatacollectedfrom1900–1950.
b. IPaddressesandvisittimesofwebuserswhovisityourwebsite.
c. ImagesfromEarth-orbitingsatellites.
d. Namesandaddressesofpeoplefromthetelephonebook.
e. NamesandemailaddressescollectedfromtheWeb.
2Data
Thischapterdiscussesseveraldata-relatedissuesthatareimportantforsuccessfuldatamining:
TheTypeofDataDatasetsdifferinanumberofways.Forexample,theattributesusedtodescribedataobjectscanbeofdifferenttypes—quantitativeorqualitative—anddatasetsoftenhavespecialcharacteristics;e.g.,somedatasetscontaintimeseriesorobjectswithexplicitrelationshipstooneanother.Notsurprisingly,thetypeofdatadetermineswhichtoolsandtechniquescanbeusedtoanalyzethedata.Indeed,newresearchindataminingisoftendrivenbytheneedtoaccommodatenewapplicationareasandtheirnewtypesofdata.
TheQualityoftheDataDataisoftenfarfromperfect.Whilemostdataminingtechniquescantoleratesomelevelofimperfectioninthedata,afocusonunderstandingandimprovingdataqualitytypicallyimprovesthequalityoftheresultinganalysis.Dataqualityissuesthatoftenneedtobeaddressedincludethepresenceofnoiseandoutliers;missing,inconsistent,orduplicatedata;anddatathatisbiasedor,insomeotherway,unrepresentativeofthephenomenonorpopulationthatthedataissupposedtodescribe.
PreprocessingStepstoMaketheDataMoreSuitableforDataMiningOften,therawdatamustbeprocessedinordertomakeitsuitablefor
analysis.Whileoneobjectivemaybetoimprovedataquality,othergoalsfocusonmodifyingthedatasothatitbetterfitsaspecifieddataminingtechniqueortool.Forexample,acontinuousattribute,e.g.,length,sometimesneedstobetransformedintoanattributewithdiscretecategories,e.g.,short,medium,orlong,inordertoapplyaparticulartechnique.Asanotherexample,thenumberofattributesinadatasetisoftenreducedbecausemanytechniquesaremoreeffectivewhenthedatahasarelativelysmallnumberofattributes.
AnalyzingDatainTermsofItsRelationshipsOneapproachtodataanalysisistofindrelationshipsamongthedataobjectsandthenperformtheremaininganalysisusingtheserelationshipsratherthanthedataobjectsthemselves.Forinstance,wecancomputethesimilarityordistancebetweenpairsofobjectsandthenperformtheanalysis—clustering,classification,oranomalydetection—basedonthesesimilaritiesordistances.Therearemanysuchsimilarityordistancemeasures,andtheproperchoicedependsonthetypeofdataandtheparticularapplication.
Example2.1(AnIllustrationofData-RelatedIssues).Tofurtherillustratetheimportanceoftheseissues,considerthefollowinghypotheticalsituation.Youreceiveanemailfromamedicalresearcherconcerningaprojectthatyouareeagertoworkon.
Hi,
I’veattachedthedatafilethatImentionedinmypreviousemail.Eachlinecontainsthe
informationforasinglepatientandconsistsoffivefields.Wewanttopredictthelastfieldusing
theotherfields.Idon’thavetimetoprovideanymoreinformationaboutthedatasinceI’mgoing
outoftownforacoupleofdays,buthopefullythatwon’tslowyoudowntoomuch.Andifyou
don’tmind,couldwemeetwhenIgetbacktodiscussyourpreliminaryresults?Imightinvitea
fewothermembersofmyteam.
Thanksandseeyouinacoupleofdays.
Despitesomemisgivings,youproceedtoanalyzethedata.Thefirstfewrowsofthefileareasfollows:
012 232 33.5 0 10.7
020 121 16.9 2 210.1
027 165 24.0 0 427.6
⋮
Abrieflookatthedatarevealsnothingstrange.Youputyourdoubtsasideandstarttheanalysis.Thereareonly1000lines,asmallerdatafilethanyouhadhopedfor,buttwodayslater,youfeelthatyouhavemadesomeprogress.Youarriveforthemeeting,andwhilewaitingforotherstoarrive,youstrikeupaconversationwithastatisticianwhoisworkingontheproject.Whenshelearnsthatyouhavealsobeenanalyzingthedatafromtheproject,sheasksifyouwouldmindgivingherabriefoverviewofyourresults.
Statistician:So,yougotthedataforallthepatients?
DataMiner:Yes.Ihaven’thadmuchtimeforanalysis,butIdohaveafewinterestingresults.
Statistician:Amazing.ThereweresomanydataissueswiththissetofpatientsthatIcouldn’tdomuch.
DataMiner:Oh?Ididn’thearaboutanypossibleproblems.
Statistician:Well,firstthereisfield5,thevariablewewanttopredict.
It’scommonknowledgeamongpeoplewhoanalyzethistypeofdatathatresultsarebetterifyouworkwiththelogofthevalues,butIdidn’tdiscoverthisuntillater.Wasitmentionedtoyou?
DataMiner:No.
Statistician:Butsurelyyouheardaboutwhathappenedtofield4?It’ssupposedtobemeasuredonascalefrom1to10,with0indicatingamissingvalue,butbecauseofadataentryerror,all10’swerechangedinto0’s.Unfortunately,sincesomeofthepatientshavemissingvaluesforthisfield,it’simpossibletosaywhethera0inthisfieldisareal0ora10.Quiteafewoftherecordshavethatproblem.
DataMiner:Interesting.Werethereanyotherproblems?
Statistician:Yes,fields2and3arebasicallythesame,butIassumethatyouprobablynoticedthat.
DataMiner:Yes,butthesefieldswereonlyweakpredictorsoffield5.
Statistician:Anyway,givenallthoseproblems,I’msurprisedyouwereabletoaccomplishanything.
DataMiner:True,butmyresultsarereallyquitegood.Field1isaverystrongpredictoroffield5.I’msurprisedthatthiswasn’tnoticedbefore.
Statistician:What?Field1isjustanidentificationnumber.
DataMiner:Nonetheless,myresultsspeakforthemselves.
Statistician:Oh,no!Ijustremembered.WeassignedIDnumbersafterwesortedtherecordsbasedonfield5.Thereisastrongconnection,butit’smeaningless.Sorry.
Althoughthisscenariorepresentsanextremesituation,itemphasizestheimportanceof“knowingyourdata.”Tothatend,thischapterwilladdresseach
ofthefourissuesmentionedabove,outliningsomeofthebasicchallengesandstandardapproaches.
2.1TypesofDataAdatasetcanoftenbeviewedasacollectionofdataobjects.Othernamesforadataobjectarerecord,point,vector,pattern,event,case,sample,instance,observation,orentity.Inturn,dataobjectsaredescribedbyanumberofattributesthatcapturethecharacteristicsofanobject,suchasthemassofaphysicalobjectorthetimeatwhichaneventoccurred.Othernamesforanattributearevariable,characteristic,field,feature,ordimension.
Example2.2(StudentInformation).Often,adatasetisafile,inwhichtheobjectsarerecords(orrows)inthefileandeachfield(orcolumn)correspondstoanattribute.Forexample,Table2.1 showsadatasetthatconsistsofstudentinformation.Eachrowcorrespondstoastudentandeachcolumnisanattributethatdescribessomeaspectofastudent,suchasgradepointaverage(GPA)oridentificationnumber(ID).
Table2.1.Asampledatasetcontainingstudentinformation.
StudentID Year GradePointAverage(GPA) …
⋮
1034262 Senior 3.24 …
1052663 Freshman 3.51 …
1082246 Sophomore 3.62 …
Althoughrecord-baseddatasetsarecommon,eitherinflatfilesorrelationaldatabasesystems,thereareotherimportanttypesofdatasetsandsystemsforstoringdata.InSection2.1.2 ,wewilldiscusssomeofthetypesofdatasetsthatarecommonlyencounteredindatamining.However,wefirstconsiderattributes.
2.1.1AttributesandMeasurement
Inthissection,weconsiderthetypesofattributesusedtodescribedataobjects.Wefirstdefineanattribute,thenconsiderwhatwemeanbythetypeofanattribute,andfinallydescribethetypesofattributesthatarecommonlyencountered.
WhatIsanAttribute?Westartwithamoredetaileddefinitionofanattribute.
Definition2.1.Anattributeisapropertyorcharacteristicofanobjectthatcanvary,eitherfromoneobjecttoanotherorfromonetimetoanother.
Forexample,eyecolorvariesfrompersontoperson,whilethetemperatureofanobjectvariesovertime.Notethateyecolorisasymbolicattributewitha
smallnumberofpossiblevalues{brown,black,blue,green,hazel,etc.},whiletemperatureisanumericalattributewithapotentiallyunlimitednumberofvalues.
Atthemostbasiclevel,attributesarenotaboutnumbersorsymbols.However,todiscussandmorepreciselyanalyzethecharacteristicsofobjects,weassignnumbersorsymbolstothem.Todothisinawell-definedway,weneedameasurementscale.
Definition2.2.Ameasurementscaleisarule(function)thatassociatesanumericalorsymbolicvaluewithanattributeofanobject.
Formally,theprocessofmeasurementistheapplicationofameasurementscaletoassociateavaluewithaparticularattributeofaspecificobject.Whilethismayseemabitabstract,weengageintheprocessofmeasurementallthetime.Forinstance,westeponabathroomscaletodetermineourweight,weclassifysomeoneasmaleorfemale,orwecountthenumberofchairsinaroomtoseeiftherewillbeenoughtoseatallthepeoplecomingtoameeting.Inallthesecases,the“physicalvalue”ofanattributeofanobjectismappedtoanumericalorsymbolicvalue.
Withthisbackground,wecandiscussthetypeofanattribute,aconceptthatisimportantindeterminingifaparticulardataanalysistechniqueisconsistentwithaspecifictypeofattribute.
TheTypeofanAttributeItiscommontorefertothetypeofanattributeasthetypeofameasurementscale.Itshouldbeapparentfromthepreviousdiscussionthatanattributecanbedescribedusingdifferentmeasurementscalesandthatthepropertiesofanattributeneednotbethesameasthepropertiesofthevaluesusedtomeasureit.Inotherwords,thevaluesusedtorepresentanattributecanhavepropertiesthatarenotpropertiesoftheattributeitself,andviceversa.Thisisillustratedwithtwoexamples.
Example2.3(EmployeeAgeandIDNumber).TwoattributesthatmightbeassociatedwithanemployeeareIDandage(inyears).Bothoftheseattributescanberepresentedasintegers.However,whileitisreasonabletotalkabouttheaverageageofanemployee,itmakesnosensetotalkabouttheaverageemployeeID.Indeed,theonlyaspectofemployeesthatwewanttocapturewiththeIDattributeisthattheyaredistinct.Consequently,theonlyvalidoperationforemployeeIDsistotestwhethertheyareequal.Thereisnohintofthislimitation,however,whenintegersareusedtorepresenttheemployeeIDattribute.Fortheageattribute,thepropertiesoftheintegersusedtorepresentageareverymuchthepropertiesoftheattribute.Evenso,thecorrespondenceisnotcompletebecause,forexample,ageshaveamaximum,whileintegersdonot.
Example2.4(LengthofLineSegments).ConsiderFigure2.1 ,whichshowssomeobjects—linesegments—andhowthelengthattributeoftheseobjectscanbemappedtonumbersintwodifferentways.Eachsuccessivelinesegment,goingfromthetoptothebottom,isformedbyappendingthetopmostlinesegmenttoitself.Thus,
thesecondlinesegmentfromthetopisformedbyappendingthetopmostlinesegmenttoitselftwice,thethirdlinesegmentfromthetopisformedbyappendingthetopmostlinesegmenttoitselfthreetimes,andsoforth.Inaveryreal(physical)sense,allthelinesegmentsaremultiplesofthefirst.Thisfactiscapturedbythemeasurementsontherightsideofthefigure,butnotbythoseontheleftside.Morespecifically,themeasurementscaleontheleftsidecapturesonlytheorderingofthelengthattribute,whilethescaleontherightsidecapturesboththeorderingandadditivityproperties.Thus,anattributecanbemeasuredinawaythatdoesnotcaptureallthepropertiesoftheattribute.
Figure2.1.Themeasurementofthelengthoflinesegmentsontwodifferentscalesofmeasurement.
Knowingthetypeofanattributeisimportantbecauseittellsuswhichpropertiesofthemeasuredvaluesareconsistentwiththeunderlying
propertiesoftheattribute,andtherefore,itallowsustoavoidfoolishactions,suchascomputingtheaverageemployeeID.
TheDifferentTypesofAttributesAuseful(andsimple)waytospecifythetypeofanattributeistoidentifythepropertiesofnumbersthatcorrespondtounderlyingpropertiesoftheattribute.Forexample,anattributesuchaslengthhasmanyofthepropertiesofnumbers.Itmakessensetocompareandorderobjectsbylength,aswellastotalkaboutthedifferencesandratiosoflength.Thefollowingproperties(operations)ofnumbersaretypicallyusedtodescribeattributes.
1. Distinctness and2. Order and3. Addition and4. Multiplication and/
Giventheseproperties,wecandefinefourtypesofattributes:nominal,ordinal,interval,andratio.Table2.2 givesthedefinitionsofthesetypes,alongwithinformationaboutthestatisticaloperationsthatarevalidforeachtype.Eachattributetypepossessesallofthepropertiesandoperationsoftheattributetypesaboveit.Consequently,anypropertyoroperationthatisvalidfornominal,ordinal,andintervalattributesisalsovalidforratioattributes.Inotherwords,thedefinitionoftheattributetypesiscumulative.However,thisdoesnotmeanthatthestatisticaloperationsappropriateforoneattributetypeareappropriatefortheattributetypesaboveit.
Table2.2.Differentattributetypes.
AttributeType Description Examples Operations
Categorical Nominal Thevaluesofanominalattribute zipcodes, mode,
= ≠<,≤,>, ≥
+ −×
(Qualitative) arejustdifferentnames;i.e.,nominalvaluesprovideonlyenoughinformationtodistinguishoneobjectfromanother.
employeeIDnumbers,eyecolor,gender
entropy,contingencycorrelation,test
Ordinal Thevaluesofanordinalattributeprovideenoughinformationtoorderobjects.
hardnessofminerals,{good,better,best},grades,streetnumbers
median,percentiles,rankcorrelation,runtests,signtests
Numeric(Quantitative)
Interval Forintervalattributes,thedifferencesbetweenvaluesaremeaningful,i.e.,aunitofmeasurementexists.
calendardates,temperatureinCelsiusorFahrenheit
mean,standarddeviation,Pearson’scorrelation,tandFtests
Ratio Forratiovariables,bothdifferencesandratiosaremeaningful.
temperatureinKelvin,monetaryquantities,counts,age,mass,length,electricalcurrent
geometricmean,harmonicmean,percentvariation
Nominalandordinalattributesarecollectivelyreferredtoascategoricalorqualitativeattributes.Asthenamesuggests,qualitativeattributes,suchasemployeeID,lackmostofthepropertiesofnumbers.Eveniftheyarerepresentedbynumbers,i.e.,integers,theyshouldbetreatedmorelikesymbols.Theremainingtwotypesofattributes,intervalandratio,arecollectivelyreferredtoasquantitativeornumericattributes.Quantitativeattributesarerepresentedbynumbersandhavemostofthepropertiesof
(=,≠) χ2
(<,>)
(+,−)
(×,/)
numbers.Notethatquantitativeattributescanbeinteger-valuedorcontinuous.
Thetypesofattributescanalsobedescribedintermsoftransformationsthatdonotchangethemeaningofanattribute.Indeed,S.SmithStevens,thepsychologistwhooriginallydefinedthetypesofattributesshowninTable2.2 ,definedthemintermsofthesepermissibletransformations.Forexample,themeaningofalengthattributeisunchangedifitismeasuredinmetersinsteadoffeet.
Thestatisticaloperationsthatmakesenseforaparticulartypeofattributearethosethatwillyieldthesameresultswhentheattributeistransformedbyusingatransformationthatpreservestheattribute’smeaning.Toillustrate,theaveragelengthofasetofobjectsisdifferentwhenmeasuredinmetersratherthaninfeet,butbothaveragesrepresentthesamelength.Table2.3 showsthemeaning-preservingtransformationsforthefourattributetypesofTable2.2 .
Table2.3.Transformationsthatdefineattributelevels.
AttributeType Transformation Comment
Categorical(Qualitative)
Nominal Anyone-to-onemapping,e.g.,apermutationofvalues
IfallemployeeIDnumbersarereassigned,itwillnotmakeanydifference.
Ordinal Anorder-preservingchangeofvalues,i.e.,
wherefisamonotonicfunction.
Anattributeencompassingthenotionofgood,better,bestcanberepresentedequallywellbythevalues{1,2,3}orby{0.5,1,10}.
Numeric(Quantitative)
Intervalaandbconstants.
TheFahrenheitandCelsiustemperaturescalesdifferinthe
new_value=f(old_value),
new_value=a×old_value+b,
locationoftheirzerovalueandthesizeofadegree(unit).
Ratio Lengthcanbemeasuredinmetersorfeet.
Example2.5(TemperatureScales).Temperatureprovidesagoodillustrationofsomeoftheconceptsthathavebeendescribed.First,temperaturecanbeeitheranintervaloraratioattribute,dependingonitsmeasurementscale.WhenmeasuredontheKelvinscale,atemperatureof2 is,inaphysicallymeaningfulway,twice
thatofatemperatureof1 .Thisisnottruewhentemperatureismeasured
oneithertheCelsiusorFahrenheitscales,because,physically,atemperatureof1 Fahrenheit(Celsius)isnotmuchdifferentthana
temperatureof2 Fahrenheit(Celsius).Theproblemisthatthezeropoints
oftheFahrenheitandCelsiusscalesare,inaphysicalsense,arbitrary,andtherefore,theratiooftwoCelsiusorFahrenheittemperaturesisnotphysicallymeaningful.
DescribingAttributesbytheNumberofValuesAnindependentwayofdistinguishingbetweenattributesisbythenumberofvaluestheycantake.
DiscreteAdiscreteattributehasafiniteorcountablyinfinitesetofvalues.Suchattributescanbecategorical,suchaszipcodesorIDnumbers,ornumeric,suchascounts.Discreteattributesareoftenrepresentedusingintegervariables.Binaryattributesareaspecialcaseofdiscreteattributesandassumeonlytwovalues,e.g.,true/false,yes/no,male/female,or0/1.
new_value=a×old_value
◦
◦
◦
◦
BinaryattributesareoftenrepresentedasBooleanvariables,orasintegervariablesthatonlytakethevalues0or1.
ContinuousAcontinuousattributeisonewhosevaluesarerealnumbers.Examplesincludeattributessuchastemperature,height,orweight.Continuousattributesaretypicallyrepresentedasfloating-pointvariables.Practically,realvaluescanbemeasuredandrepresentedonlywithlimitedprecision.
Intheory,anyofthemeasurementscaletypes—nominal,ordinal,interval,andratio—couldbecombinedwithanyofthetypesbasedonthenumberofattributevalues—binary,discrete,andcontinuous.However,somecombinationsoccuronlyinfrequentlyordonotmakemuchsense.Forinstance,itisdifficulttothinkofarealisticdatasetthatcontainsacontinuousbinaryattribute.Typically,nominalandordinalattributesarebinaryordiscrete,whileintervalandratioattributesarecontinuous.However,countattributes,whicharediscrete,arealsoratioattributes.
AsymmetricAttributesForasymmetricattributes,onlypresence—anon-zeroattributevalue—isregardedasimportant.Consideradatasetinwhicheachobjectisastudentandeachattributerecordswhetherastudenttookaparticularcourseatauniversity.Foraspecificstudent,anattributehasavalueof1ifthestudenttookthecourseassociatedwiththatattributeandavalueof0otherwise.Becausestudentstakeonlyasmallfractionofallavailablecourses,mostofthevaluesinsuchadatasetwouldbe0.Therefore,itismoremeaningfulandmoreefficienttofocusonthenon-zerovalues.Toillustrate,ifstudentsarecomparedonthebasisofthecoursestheydon’ttake,thenmoststudentswouldseemverysimilar,atleastifthenumberofcoursesislarge.Binaryattributeswhereonlynon-zerovaluesareimportantarecalledasymmetric
binaryattributes.Thistypeofattributeisparticularlyimportantforassociationanalysis,whichisdiscussedinChapter5 .Itisalsopossibletohavediscreteorcontinuousasymmetricfeatures.Forinstance,ifthenumberofcreditsassociatedwitheachcourseisrecorded,thentheresultingdatasetwillconsistofasymmetricdiscreteorcontinuousattributes.
GeneralCommentsonLevelsofMeasurementAsdescribedintherestofthischapter,therearemanydiversetypesofdata.Thepreviousdiscussionofmeasurementscales,whileuseful,isnotcompleteandhassomelimitations.Weprovidethefollowingcommentsandguidance.
Distinctness,order,andmeaningfulintervalsandratiosareonlyfourpropertiesofdata—manyothersarepossible.Forinstance,somedataisinherentlycyclical,e.g.,positiononthesurfaceoftheEarthortime.Asanotherexample,considersetvaluedattributes,whereeachattributevalueisasetofelements,e.g.,thesetofmoviesseeninthelastyear.Defineonesetofelements(movies)tobegreater(larger)thanasecondsetifthesecondsetisasubsetofthefirst.However,sucharelationshipdefinesonlyapartialorderthatdoesnotmatchanyoftheattributetypesjustdefined.Thenumbersorsymbolsusedtocaptureattributevaluesmaynotcaptureallthepropertiesoftheattributesormaysuggestpropertiesthatarenotthere.AnillustrationofthisforintegerswaspresentedinExample2.3 ,i.e.,averagesofIDsandoutofrangeages.Dataisoftentransformedforthepurposeofanalysis—seeSection2.3.7 .Thisoftenchangesthedistributionoftheobservedvariabletoadistributionthatiseasiertoanalyze,e.g.,aGaussian(normal)distribution.Often,suchtransformationsonlypreservetheorderoftheoriginalvalues,andotherpropertiesarelost.Nonetheless,ifthedesiredoutcomeisa
statisticaltestofdifferencesorapredictivemodel,suchatransformationisjustified.Thefinalevaluationofanydataanalysis,includingoperationsonattributes,iswhethertheresultsmakesensefromadomainpointofview.
Insummary,itcanbechallengingtodeterminewhichoperationscanbeperformedonaparticularattributeoracollectionofattributeswithoutcompromisingtheintegrityoftheanalysis.Fortunately,establishedpracticeoftenservesasareliableguide.Occasionally,however,standardpracticesareerroneousorhavelimitations.
2.1.2TypesofDataSets
Therearemanytypesofdatasets,andasthefieldofdataminingdevelopsandmatures,agreatervarietyofdatasetsbecomeavailableforanalysis.Inthissection,wedescribesomeofthemostcommontypes.Forconvenience,wehavegroupedthetypesofdatasetsintothreegroups:recorddata,graph-baseddata,andordereddata.Thesecategoriesdonotcoverallpossibilitiesandothergroupingsarecertainlypossible.
GeneralCharacteristicsofDataSetsBeforeprovidingdetailsofspecifickindsofdatasets,wediscussthreecharacteristicsthatapplytomanydatasetsandhaveasignificantimpactonthedataminingtechniquesthatareused:dimensionality,distribution,andresolution.
Dimensionality
Thedimensionalityofadatasetisthenumberofattributesthattheobjectsinthedatasetpossess.Analyzingdatawithasmallnumberofdimensionstendstobequalitativelydifferentfromanalyzingmoderateorhigh-dimensionaldata.Indeed,thedifficultiesassociatedwiththeanalysisofhigh-dimensionaldataaresometimesreferredtoasthecurseofdimensionality.Becauseofthis,animportantmotivationinpreprocessingthedataisdimensionalityreduction.TheseissuesarediscussedinmoredepthlaterinthischapterandinAppendixB.
Distribution
Thedistributionofadatasetisthefrequencyofoccurrenceofvariousvaluesorsetsofvaluesfortheattributescomprisingdataobjects.Equivalently,thedistributionofadatasetcanbeconsideredasadescriptionoftheconcentrationofobjectsinvariousregionsofthedataspace.Statisticianshaveenumeratedmanytypesofdistributions,e.g.,Gaussian(normal),anddescribedtheirproperties.(SeeAppendixC.)Althoughstatisticalapproachesfordescribingdistributionscanyieldpowerfulanalysistechniques,manydatasetshavedistributionsthatarenotwellcapturedbystandardstatisticaldistributions.
Asaresult,manydataminingalgorithmsdonotassumeaparticularstatisticaldistributionforthedatatheyanalyze.However,somegeneralaspectsofdistributionsoftenhaveastrongimpact.Forexample,supposeacategoricalattributeisusedasaclassvariable,whereoneofthecategoriesoccurs95%ofthetime,whiletheothercategoriestogetheroccuronly5%ofthetime.ThisskewnessinthedistributioncanmakeclassificationdifficultasdiscussedinSection4.11.(Skewnesshasotherimpactsondataanalysisthatarenotdiscussedhere.)
Aspecialcaseofskeweddataissparsity.Forsparsebinary,countorcontinuousdata,mostattributesofanobjecthavevaluesof0.Inmanycases,fewerthan1%ofthevaluesarenon-zero.Inpracticalterms,sparsityisanadvantagebecauseusuallyonlythenon-zerovaluesneedtobestoredandmanipulated.Thisresultsinsignificantsavingswithrespecttocomputationtimeandstorage.Indeed,somedataminingalgorithms,suchastheassociationruleminingalgorithmsdescribedinChapter5 ,workwellonlyforsparsedata.Finally,notethatoftentheattributesinsparsedatasetsareasymmetricattributes.
Resolution
Itisfrequentlypossibletoobtaindataatdifferentlevelsofresolution,andoftenthepropertiesofthedataaredifferentatdifferentresolutions.Forinstance,thesurfaceoftheEarthseemsveryunevenataresolutionofafewmeters,butisrelativelysmoothataresolutionoftensofkilometers.Thepatternsinthedataalsodependonthelevelofresolution.Iftheresolutionistoofine,apatternmaynotbevisibleormaybeburiedinnoise;iftheresolutionistoocoarse,thepatterncandisappear.Forexample,variationsinatmosphericpressureonascaleofhoursreflectthemovementofstormsandotherweathersystems.Onascaleofmonths,suchphenomenaarenotdetectable.
RecordDataMuchdataminingworkassumesthatthedatasetisacollectionofrecords(dataobjects),eachofwhichconsistsofafixedsetofdatafields(attributes).SeeFigure2.2(a) .Forthemostbasicformofrecorddata,thereisnoexplicitrelationshipamongrecordsordatafields,andeveryrecord(object)hasthesamesetofattributes.Recorddataisusuallystoredeitherinflatfilesorinrelationaldatabases.Relationaldatabasesarecertainlymorethana
collectionofrecords,butdataminingoftendoesnotuseanyoftheadditionalinformationavailableinarelationaldatabase.Rather,thedatabaseservesasaconvenientplacetofindrecords.DifferenttypesofrecorddataaredescribedbelowandareillustratedinFigure2.2 .
Figure2.2.Differentvariationsofrecorddata.
TransactionorMarketBasketData
Transactiondataisaspecialtypeofrecorddata,whereeachrecord(transaction)involvesasetofitems.Consideragrocerystore.Thesetofproductspurchasedbyacustomerduringoneshoppingtripconstitutesatransaction,whiletheindividualproductsthatwerepurchasedaretheitems.Thistypeofdataiscalledmarketbasketdatabecausetheitemsineachrecordaretheproductsinaperson’s“marketbasket.”Transactiondataisacollectionofsetsofitems,butitcanbeviewedasasetofrecordswhosefieldsareasymmetricattributes.Mostoften,theattributesarebinary,indicatingwhetheranitemwaspurchased,butmoregenerally,theattributescanbediscreteorcontinuous,suchasthenumberofitemspurchasedortheamountspentonthoseitems.Figure2.2(b) showsasampletransactiondataset.Eachrowrepresentsthepurchasesofaparticularcustomerataparticulartime.
TheDataMatrix
Ifallthedataobjectsinacollectionofdatahavethesamefixedsetofnumericattributes,thenthedataobjectscanbethoughtofaspoints(vectors)inamultidimensionalspace,whereeachdimensionrepresentsadistinctattributedescribingtheobject.Asetofsuchdataobjectscanbeinterpretedasanmbynmatrix,wheretherearemrows,oneforeachobject,andncolumns,oneforeachattribute.(Arepresentationthathasdataobjectsascolumnsandattributesasrowsisalsofine.)Thismatrixiscalledadatamatrixorapatternmatrix.Adatamatrixisavariationofrecorddata,butbecauseitconsistsofnumericattributes,standardmatrixoperationcanbeappliedtotransformandmanipulatethedata.Therefore,thedatamatrixisthestandarddataformatformoststatisticaldata.Figure2.2(c) showsasampledatamatrix.
TheSparseDataMatrix
Asparsedatamatrixisaspecialcaseofadatamatrixwheretheattributesareofthesametypeandareasymmetric;i.e.,onlynon-zerovaluesareimportant.Transactiondataisanexampleofasparsedatamatrixthathasonly0–1entries.Anothercommonexampleisdocumentdata.Inparticular,iftheorderoftheterms(words)inadocumentisignored—the“bagofwords”approach—thenadocumentcanberepresentedasatermvector,whereeachtermisacomponent(attribute)ofthevectorandthevalueofeachcomponentisthenumberoftimesthecorrespondingtermoccursinthedocument.Thisrepresentationofacollectionofdocumentsisoftencalledadocument-termmatrix.Figure2.2(d) showsasampledocument-termmatrix.Thedocumentsaretherowsofthismatrix,whilethetermsarethecolumns.Inpractice,onlythenon-zeroentriesofsparsedatamatricesarestored.
Graph-BasedDataAgraphcansometimesbeaconvenientandpowerfulrepresentationfordata.Weconsidertwospecificcases:(1)thegraphcapturesrelationshipsamongdataobjectsand(2)thedataobjectsthemselvesarerepresentedasgraphs.
DatawithRelationshipsamongObjects
Therelationshipsamongobjectsfrequentlyconveyimportantinformation.Insuchcases,thedataisoftenrepresentedasagraph.Inparticular,thedataobjectsaremappedtonodesofthegraph,whiletherelationshipsamongobjectsarecapturedbythelinksbetweenobjectsandlinkproperties,suchasdirectionandweight.ConsiderwebpagesontheWorldWideWeb,whichcontainbothtextandlinkstootherpages.Inordertoprocesssearchqueries,websearchenginescollectandprocesswebpagestoextracttheircontents.Itiswell-known,however,thatthelinkstoandfromeachpageprovideagreatdealofinformationabouttherelevanceofawebpagetoaquery,andthus,mustalsobetakenintoconsideration.Figure2.3(a) showsasetoflinked
webpages.Anotherimportantexampleofsuchgraphdataarethesocialnetworks,wheredataobjectsarepeopleandtherelationshipsamongthemaretheirinteractionsviasocialmedia.
DatawithObjectsThatAreGraphs
Ifobjectshavestructure,thatis,theobjectscontainsubobjectsthathaverelationships,thensuchobjectsarefrequentlyrepresentedasgraphs.Forexample,thestructureofchemicalcompoundscanberepresentedbyagraph,wherethenodesareatomsandthelinksbetweennodesarechemicalbonds.Figure2.3(b) showsaball-and-stickdiagramofthechemicalcompoundbenzene,whichcontainsatomsofcarbon(black)andhydrogen(gray).Agraphrepresentationmakesitpossibletodeterminewhichsubstructuresoccurfrequentlyinasetofcompoundsandtoascertainwhetherthepresenceofanyofthesesubstructuresisassociatedwiththepresenceorabsenceofcertainchemicalproperties,suchasmeltingpointorheatofformation.Frequentgraphmining,whichisabranchofdataminingthatanalyzessuchdata,isconsideredinSection6.5.
Figure2.3.Differentvariationsofgraphdata.
OrderedDataForsometypesofdata,theattributeshaverelationshipsthatinvolveorderintimeorspace.DifferenttypesofordereddataaredescribednextandareshowninFigure2.4 .
SequentialTransactionData
Sequentialtransactiondatacanbethoughtofasanextensionoftransactiondata,whereeachtransactionhasatimeassociatedwithit.Consideraretailtransactiondatasetthatalsostoresthetimeatwhichthetransactiontookplace.Thistimeinformationmakesitpossibletofindpatternssuchas“candysalespeakbeforeHalloween.”Atimecanalsobeassociatedwitheachattribute.Forexample,eachrecordcouldbethepurchasehistoryofa
customer,withalistingofitemspurchasedatdifferenttimes.Usingthisinformation,itispossibletofindpatternssuchas“peoplewhobuyDVDplayerstendtobuyDVDsintheperiodimmediatelyfollowingthepurchase.”
Figure2.4(a) showsanexampleofsequentialtransactiondata.Therearefivedifferenttimes—t1,t2,t3,t4,andt5;threedifferentcustomers—C1,C2,andC3;andfivedifferentitems—A,B,C,D,andE.Inthetoptable,eachrowcorrespondstotheitemspurchasedataparticulartimebyeachcustomer.Forinstance,attimet3,customerC2purchaseditemsAandD.Inthebottomtable,thesameinformationisdisplayed,buteachrowcorrespondstoaparticularcustomer.Eachrowcontainsinformationabouteachtransactioninvolvingthecustomer,whereatransactionisconsideredtobeasetofitemsandthetimeatwhichthoseitemswerepurchased.Forexample,customerC3boughtitemsAandCattimet2.
TimeSeriesData
Timeseriesdataisaspecialtypeofordereddatawhereeachrecordisatimeseries,i.e.,aseriesofmeasurementstakenovertime.Forexample,afinancialdatasetmightcontainobjectsthataretimeseriesofthedailypricesofvariousstocks.Asanotherexample,considerFigure2.4(c) ,whichshowsatimeseriesoftheaveragemonthlytemperatureforMinneapolisduringtheyears1982to1994.Whenworkingwithtemporaldata,suchastimeseries,itisimportanttoconsidertemporalautocorrelation;i.e.,iftwomeasurementsarecloseintime,thenthevaluesofthosemeasurementsareoftenverysimilar.
Figure2.4.Differentvariationsofordereddata.
SequenceData
Sequencedataconsistsofadatasetthatisasequenceofindividualentities,suchasasequenceofwordsorletters.Itisquitesimilartosequentialdata,exceptthattherearenotimestamps;instead,therearepositionsinanorderedsequence.Forexample,thegeneticinformationofplantsandanimalscanberepresentedintheformofsequencesofnucleotidesthatareknownasgenes.Manyoftheproblemsassociatedwithgeneticsequencedatainvolvepredictingsimilaritiesinthestructureandfunctionofgenesfromsimilaritiesinnucleotidesequences.Figure2.4(b) showsasectionofthehumangeneticcodeexpressedusingthefournucleotidesfromwhichallDNAisconstructed:A,T,G,andC.
SpatialandSpatio-TemporalData
Someobjectshavespatialattributes,suchaspositionsorareas,inadditiontoothertypesofattributes.Anexampleofspatialdataisweatherdata(precipitation,temperature,pressure)thatiscollectedforavarietyofgeographicallocations.Oftensuchmeasurementsarecollectedovertime,andthus,thedataconsistsoftimeseriesatvariouslocations.Inthatcase,werefertothedataasspatio-temporaldata.Althoughanalysiscanbeconductedseparatelyforeachspecifictimeorlocation,amorecompleteanalysisofspatio-temporaldatarequiresconsiderationofboththespatialandtemporalaspectsofthedata.
Animportantaspectofspatialdataisspatialautocorrelation;i.e.,objectsthatarephysicallyclosetendtobesimilarinotherwaysaswell.Thus,twopointsontheEarththatareclosetoeachotherusuallyhavesimilarvaluesfortemperatureandrainfall.Notethatspatialautocorrelationisanalogoustotemporalautocorrelation.
Importantexamplesofspatialandspatio-temporaldataarethescienceandengineeringdatasetsthataretheresultofmeasurementsormodeloutput
takenatregularlyorirregularlydistributedpointsonatwo-orthree-dimensionalgridormesh.Forinstance,Earthsciencedatasetsrecordthetemperatureorpressuremeasuredatpoints(gridcells)onlatitude–longitudesphericalgridsofvariousresolutions,e.g., by SeeFigure2.4(d) .Asanotherexample,inthesimulationoftheflowofagas,thespeedanddirectionofflowatvariousinstantsintimecanberecordedforeachgridpointinthesimulation.Adifferenttypeofspatio-temporaldataarisesfromtrackingthetrajectoriesofobjects,e.g.,vehicles,intimeandspace.
HandlingNon-RecordDataMostdataminingalgorithmsaredesignedforrecorddataoritsvariations,suchastransactiondataanddatamatrices.Record-orientedtechniquescanbeappliedtonon-recorddatabyextractingfeaturesfromdataobjectsandusingthesefeaturestocreatearecordcorrespondingtoeachobject.Considerthechemicalstructuredatathatwasdescribedearlier.Givenasetofcommonsubstructures,eachcompoundcanberepresentedasarecordwithbinaryattributesthatindicatewhetheracompoundcontainsaspecificsubstructure.Sucharepresentationisactuallyatransactiondataset,wherethetransactionsarethecompoundsandtheitemsarethesubstructures.
Insomecases,itiseasytorepresentthedatainarecordformat,butthistypeofrepresentationdoesnotcapturealltheinformationinthedata.Considerspatio-temporaldataconsistingofatimeseriesfromeachpointonaspatialgrid.Thisdataisoftenstoredinadatamatrix,whereeachrowrepresentsalocationandeachcolumnrepresentsaparticularpointintime.However,sucharepresentationdoesnotexplicitlycapturethetimerelationshipsthatarepresentamongattributesandthespatialrelationshipsthatexistamongobjects.Thisdoesnotmeanthatsucharepresentationisinappropriate,butratherthattheserelationshipsmustbetakenintoconsiderationduringtheanalysis.Forexample,itwouldnotbeagoodideatouseadatamining
1° 1°.
techniquethatignoresthetemporalautocorrelationoftheattributesorthespatialautocorrelationofthedataobjects,i.e.,thelocationsonthespatialgrid.
2.2DataQualityDataminingalgorithmsareoftenappliedtodatathatwascollectedforanotherpurpose,orforfuture,butunspecifiedapplications.Forthatreason,dataminingcannotusuallytakeadvantageofthesignificantbenefitsof“ad-dressingqualityissuesatthesource.”Incontrast,muchofstatisticsdealswiththedesignofexperimentsorsurveysthatachieveaprespecifiedlevelofdataquality.Becausepreventingdataqualityproblemsistypicallynotanoption,dataminingfocuseson(1)thedetectionandcorrectionofdataqualityproblemsand(2)theuseofalgorithmsthatcantoleratepoordataquality.Thefirststep,detectionandcorrection,isoftencalleddatacleaning.
Thefollowingsectionsdiscussspecificaspectsofdataquality.Thefocusisonmeasurementanddatacollectionissues,althoughsomeapplication-relatedissuesarealsodiscussed.
2.2.1MeasurementandDataCollectionIssues
Itisunrealistictoexpectthatdatawillbeperfect.Theremaybeproblemsduetohumanerror,limitationsofmeasuringdevices,orflawsinthedatacollectionprocess.Valuesorevenentiredataobjectscanbemissing.Inothercases,therecanbespuriousorduplicateobjects;i.e.,multipledataobjectsthatallcorrespondtoasingle“real”object.Forexample,theremightbetwodifferentrecordsforapersonwhohasrecentlylivedattwodifferentaddresses.Evenif
allthedataispresentand“looksfine,”theremaybeinconsistencies—apersonhasaheightof2meters,butweighsonly2kilograms.
Inthenextfewsections,wefocusonaspectsofdataqualitythatarerelatedtodatameasurementandcollection.Webeginwithadefinitionofmeasurementanddatacollectionerrorsandthenconsideravarietyofproblemsthatinvolvemeasurementerror:noise,artifacts,bias,precision,andaccuracy.Weconcludebydiscussingdataqualityissuesthatinvolvebothmeasurementanddatacollectionproblems:outliers,missingandinconsistentvalues,andduplicatedata.
MeasurementandDataCollectionErrorsThetermmeasurementerrorreferstoanyproblemresultingfromthemeasurementprocess.Acommonproblemisthatthevaluerecordeddiffersfromthetruevaluetosomeextent.Forcontinuousattributes,thenumericaldifferenceofthemeasuredandtruevalueiscalledtheerror.Thetermdatacollectionerrorreferstoerrorssuchasomittingdataobjectsorattributevalues,orinappropriatelyincludingadataobject.Forexample,astudyofanimalsofacertainspeciesmightincludeanimalsofarelatedspeciesthataresimilarinappearancetothespeciesofinterest.Bothmeasurementerrorsanddatacollectionerrorscanbeeithersystematicorrandom.
Wewillonlyconsidergeneraltypesoferrors.Withinparticulardomains,certaintypesofdataerrorsarecommonplace,andwell-developedtechniquesoftenexistfordetectingand/orcorrectingtheseerrors.Forexample,keyboarderrorsarecommonwhendataisenteredmanually,andasaresult,manydataentryprogramshavetechniquesfordetectingand,withhumanintervention,correctingsucherrors.
NoiseandArtifactsNoiseistherandomcomponentofameasurementerror.Ittypicallyinvolvesthedistortionofavalueortheadditionofspuriousobjects.Figure2.5showsatimeseriesbeforeandafterithasbeendisruptedbyrandomnoise.Ifabitmorenoisewereaddedtothetimeseries,itsshapewouldbelost.Figure2.6 showsasetofdatapointsbeforeandaftersomenoisepoints(indicatedby )havebeenadded.Noticethatsomeofthenoisepointsareintermixedwiththenon-noisepoints.
Figure2.5.Noiseinatimeseriescontext.
‘+’s
Figure2.6.Noiseinaspatialcontext.
Thetermnoiseisoftenusedinconnectionwithdatathathasaspatialortemporalcomponent.Insuchcases,techniquesfromsignalorimageprocessingcanfrequentlybeusedtoreducenoiseandthus,helptodiscoverpatterns(signals)thatmightbe“lostinthenoise.”Nonetheless,theeliminationofnoiseisfrequentlydifficult,andmuchworkindataminingfocusesondevisingrobustalgorithmsthatproduceacceptableresultsevenwhennoiseispresent.
Dataerrorscanbetheresultofamoredeterministicphenomenon,suchasastreakinthesameplaceonasetofphotographs.Suchdeterministicdistortionsofthedataareoftenreferredtoasartifacts.
Precision,Bias,andAccuracyInstatisticsandexperimentalscience,thequalityofthemeasurementprocessandtheresultingdataaremeasuredbyprecisionandbias.Weprovidethe
standarddefinitions,followedbyabriefdiscussion.Forthefollowingdefinitions,weassumethatwemakerepeatedmeasurementsofthesameunderlyingquantity.
Definition2.3(Precision).Theclosenessofrepeatedmeasurements(ofthesamequantity)tooneanother.
Definition2.4(Bias).Asystematicvariationofmeasurementsfromthequantitybeingmeasured.
Precisionisoftenmeasuredbythestandarddeviationofasetofvalues,whilebiasismeasuredbytakingthedifferencebetweenthemeanofthesetofvaluesandtheknownvalueofthequantitybeingmeasured.Biascanbedeterminedonlyforobjectswhosemeasuredquantityisknownbymeansexternaltothecurrentsituation.Supposethatwehaveastandardlaboratoryweightwithamassof1gandwanttoassesstheprecisionandbiasofournewlaboratoryscale.Weweighthemassfivetimes,andobtainthefollowingfivevalues:{1.015,0.990,1.013,1.001,0.986}.Themeanofthesevaluesis
1.001,andhence,thebiasis0.001.Theprecision,asmeasuredbythestandarddeviation,is0.013.
Itiscommontousethemoregeneralterm,accuracy,torefertothedegreeofmeasurementerrorindata.
Definition2.5(Accuracy)Theclosenessofmeasurementstothetruevalueofthequantitybeingmeasured.
Accuracydependsonprecisionandbias,butthereisnospecificformulaforaccuracyintermsofthesetwoquantities.
Oneimportantaspectofaccuracyistheuseofsignificantdigits.Thegoalistouseonlyasmanydigitstorepresenttheresultofameasurementorcalculationasarejustifiedbytheprecisionofthedata.Forexample,ifthelengthofanobjectismeasuredwithameterstickwhosesmallestmarkingsaremillimeters,thenweshouldrecordthelengthofdataonlytothenearestmillimeter.Theprecisionofsuchameasurementwouldbe Wedonotreviewthedetailsofworkingwithsignificantdigitsbecausemostreaderswillhaveencounteredtheminpreviouscoursesandtheyarecoveredinconsiderabledepthinscience,engineering,andstatisticstextbooks.
Issuessuchassignificantdigits,precision,bias,andaccuracyaresometimesoverlooked,buttheyareimportantfordataminingaswellasstatisticsandscience.Manytimes,datasetsdonotcomewithinformationaboutthe
±0.5mm.
precisionofthedata,andfurthermore,theprogramsusedforanalysisreturnresultswithoutanysuchinformation.Nonetheless,withoutsomeunderstandingoftheaccuracyofthedataandtheresults,ananalystrunstheriskofcommittingseriousdataanalysisblunders.
OutliersOutliersareeither(1)dataobjectsthat,insomesense,havecharacteristicsthataredifferentfrommostoftheotherdataobjectsinthedataset,or(2)valuesofanattributethatareunusualwithrespecttothetypicalvaluesforthatattribute.Alternatively,theycanbereferredtoasanomalousobjectsorvalues.Thereisconsiderableleewayinthedefinitionofanoutlier,andmanydifferentdefinitionshavebeenproposedbythestatisticsanddataminingcommunities.Furthermore,itisimportanttodistinguishbetweenthenotionsofnoiseandoutliers.Unlikenoise,outlierscanbelegitimatedataobjectsorvaluesthatweareinterestedindetecting.Forinstance,infraudandnetworkintrusiondetection,thegoalistofindunusualobjectsoreventsfromamongalargenumberofnormalones.Chapter9 discussesanomalydetectioninmoredetail.
MissingValuesItisnotunusualforanobjecttobemissingoneormoreattributevalues.Insomecases,theinformationwasnotcollected;e.g.,somepeopledeclinetogivetheirageorweight.Inothercases,someattributesarenotapplicabletoallobjects;e.g.,often,formshaveconditionalpartsthatarefilledoutonlywhenapersonanswersapreviousquestioninacertainway,butforsimplicity,allfieldsarestored.Regardless,missingvaluesshouldbetakenintoaccountduringthedataanalysis.
Thereareseveralstrategies(andvariationsonthesestrategies)fordealingwithmissingdata,eachofwhichisappropriateincertaincircumstances.Thesestrategiesarelistednext,alongwithanindicationoftheiradvantagesanddisadvantages.
EliminateDataObjectsorAttributes
Asimpleandeffectivestrategyistoeliminateobjectswithmissingvalues.However,evenapartiallyspecifieddataobjectcontainssomeinformation,andifmanyobjectshavemissingvalues,thenareliableanalysiscanbedifficultorimpossible.Nonetheless,ifadatasethasonlyafewobjectsthathavemissingvalues,thenitmaybeexpedienttoomitthem.Arelatedstrategyistoeliminateattributesthathavemissingvalues.Thisshouldbedonewithcaution,however,becausetheeliminatedattributesmaybetheonesthatarecriticaltotheanalysis.
EstimateMissingValues
Sometimesmissingdatacanbereliablyestimated.Forexample,consideratimeseriesthatchangesinareasonablysmoothfashion,buthasafew,widelyscatteredmissingvalues.Insuchcases,themissingvaluescanbeestimated(interpolated)byusingtheremainingvalues.Asanotherexample,consideradatasetthathasmanysimilardatapoints.Inthissituation,theattributevaluesofthepointsclosesttothepointwiththemissingvalueareoftenusedtoestimatethemissingvalue.Iftheattributeiscontinuous,thentheaverageattributevalueofthenearestneighborsisused;iftheattributeiscategorical,thenthemostcommonlyoccurringattributevaluecanbetaken.Foraconcreteillustration,considerprecipitationmeasurementsthatarerecordedbygroundstations.Forareasnotcontainingagroundstation,theprecipitationcanbeestimatedusingvaluesobservedatnearbygroundstations.
IgnoretheMissingValueduringAnalysis
Manydataminingapproachescanbemodifiedtoignoremissingvalues.Forexample,supposethatobjectsarebeingclusteredandthesimilaritybetweenpairsofdataobjectsneedstobecalculated.Ifoneorbothobjectsofapairhavemissingvaluesforsomeattributes,thenthesimilaritycanbecalculatedbyusingonlytheattributesthatdonothavemissingvalues.Itistruethatthesimilaritywillonlybeapproximate,butunlessthetotalnumberofattributesissmallorthenumberofmissingvaluesishigh,thisdegreeofinaccuracymaynotmattermuch.Likewise,manyclassificationschemescanbemodifiedtoworkwithmissingvalues.
InconsistentValuesDatacancontaininconsistentvalues.Consideranaddressfield,wherebothazipcodeandcityarelisted,butthespecifiedzipcodeareaisnotcontainedinthatcity.Itispossiblethattheindividualenteringthisinformationtransposedtwodigits,orperhapsadigitwasmisreadwhentheinformationwasscannedfromahandwrittenform.Regardlessofthecauseoftheinconsistentvalues,itisimportanttodetectand,ifpossible,correctsuchproblems.
Sometypesofinconsistencesareeasytodetect.Forinstance,aperson’sheightshouldnotbenegative.Inothercases,itcanbenecessarytoconsultanexternalsourceofinformation.Forexample,whenaninsurancecompanyprocessesclaimsforreimbursement,itchecksthenamesandaddressesonthereimbursementformsagainstadatabaseofitscustomers.
Onceaninconsistencyhasbeendetected,itissometimespossibletocorrectthedata.Aproductcodemayhave“check”digits,oritmaybepossibletodouble-checkaproductcodeagainstalistofknownproductcodes,andthen
correctthecodeifitisincorrect,butclosetoaknowncode.Thecorrectionofaninconsistencyrequiresadditionalorredundantinformation.
Example2.6(InconsistentSeaSurfaceTemperature).Thisexampleillustratesaninconsistencyinactualtimeseriesdatathatmeasurestheseasurfacetemperature(SST)atvariouspointsontheocean.SSTdatawasoriginallycollectedusingocean-basedmeasurementsfromshipsorbuoys,butmorerecently,satelliteshavebeenusedtogatherthedata.Tocreatealong-termdataset,bothsourcesofdatamustbeused.However,becausethedatacomesfromdifferentsources,thetwopartsofthedataaresubtlydifferent.ThisdiscrepancyisvisuallydisplayedinFigure2.7 ,whichshowsthecorrelationofSSTvaluesbetweenpairsofyears.Ifapairofyearshasapositivecorrelation,thenthelocationcorrespondingtothepairofyearsiscoloredwhite;otherwiseitiscoloredblack.(Seasonalvariationswereremovedfromthedatasince,otherwise,alltheyearswouldbehighlycorrelated.)Thereisadistinctchangeinbehaviorwherethedatahasbeenputtogetherin1983.Yearswithineachofthetwogroups,1958–1982and1983–1999,tendtohaveapositivecorrelationwithoneanother,butanegativecorrelationwithyearsintheothergroup.Thisdoesnotmeanthatthisdatashouldnotbeused,onlythattheanalystshouldconsiderthepotentialimpactofsuchdiscrepanciesonthedatamininganalysis.
Figure2.7.CorrelationofSSTdatabetweenpairsofyears.Whiteareasindicatepositivecorrelation.Blackareasindicatenegativecorrelation.
DuplicateDataAdatasetcanincludedataobjectsthatareduplicates,oralmostduplicates,ofoneanother.Manypeoplereceiveduplicatemailingsbecausetheyappearinadatabasemultipletimesunderslightlydifferentnames.Todetectandeliminatesuchduplicates,twomainissuesmustbeaddressed.First,iftherearetwoobjectsthatactuallyrepresentasingleobject,thenoneormorevaluesofcorrespondingattributesareusuallydifferent,andtheseinconsistentvaluesmustberesolved.Second,careneedstobetakentoavoidaccidentallycombiningdataobjectsthataresimilar,butnotduplicates,such
astwodistinctpeoplewithidenticalnames.Thetermdeduplicationisoftenusedtorefertotheprocessofdealingwiththeseissues.
Insomecases,twoormoreobjectsareidenticalwithrespecttotheattributesmeasuredbythedatabase,buttheystillrepresentdifferentobjects.Here,theduplicatesarelegitimate,butcanstillcauseproblemsforsomealgorithmsifthepossibilityofidenticalobjectsisnotspecificallyaccountedforintheirdesign.AnexampleofthisisgiveninExercise13 onpage108.
2.2.2IssuesRelatedtoApplications
Dataqualityissuescanalsobeconsideredfromanapplicationviewpointasexpressedbythestatement“dataisofhighqualityifitissuitableforitsintendeduse.”Thisapproachtodataqualityhasprovenquiteuseful,particularlyinbusinessandindustry.Asimilarviewpointisalsopresentinstatisticsandtheexperimentalsciences,withtheiremphasisonthecarefuldesignofexperimentstocollectthedatarelevanttoaspecifichypothesis.Aswithqualityissuesatthemeasurementanddatacollectionlevel,manyissuesarespecifictoparticularapplicationsandfields.Again,weconsideronlyafewofthegeneralissues.
Timeliness
Somedatastartstoageassoonasithasbeencollected.Inparticular,ifthedataprovidesasnapshotofsomeongoingphenomenonorprocess,suchasthepurchasingbehaviorofcustomersorwebbrowsingpatterns,thenthissnapshotrepresentsrealityforonlyalimitedtime.Ifthedataisoutofdate,thensoarethemodelsandpatternsthatarebasedonit.
Relevance
Theavailabledatamustcontaintheinformationnecessaryfortheapplication.Considerthetaskofbuildingamodelthatpredictstheaccidentratefordrivers.Ifinformationabouttheageandgenderofthedriverisomitted,thenitislikelythatthemodelwillhavelimitedaccuracyunlessthisinformationisindirectlyavailablethroughotherattributes.
Makingsurethattheobjectsinadatasetarerelevantisalsochallenging.Acommonproblemissamplingbias,whichoccurswhenasampledoesnotcontaindifferenttypesofobjectsinproportiontotheiractualoccurrenceinthepopulation.Forexample,surveydatadescribesonlythosewhorespondtothesurvey.(OtheraspectsofsamplingarediscussedfurtherinSection2.3.2 .)Becausetheresultsofadataanalysiscanreflectonlythedatathatispresent,samplingbiaswilltypicallyleadtoerroneousresultswhenappliedtothebroaderpopulation.
KnowledgeabouttheData
Ideally,datasetsareaccompaniedbydocumentationthatdescribesdifferentaspectsofthedata;thequalityofthisdocumentationcaneitheraidorhinderthesubsequentanalysis.Forexample,ifthedocumentationidentifiesseveralattributesasbeingstronglyrelated,theseattributesarelikelytoprovidehighlyredundantinformation,andweusuallydecidetokeepjustone.(Considersalestaxandpurchaseprice.)Ifthedocumentationispoor,however,andfailstotellus,forexample,thatthemissingvaluesforaparticularfieldareindicatedwitha-9999,thenouranalysisofthedatamaybefaulty.Otherimportantcharacteristicsaretheprecisionofthedata,thetypeoffeatures(nominal,ordinal,interval,ratio),thescaleofmeasurement(e.g.,metersorfeetforlength),andtheoriginofthedata.
2.3DataPreprocessingInthissection,weconsiderwhichpreprocessingstepsshouldbeappliedtomakethedatamoresuitablefordatamining.Datapreprocessingisabroadareaandconsistsofanumberofdifferentstrategiesandtechniquesthatareinterrelatedincomplexways.Wewillpresentsomeofthemostimportantideasandapproaches,andtrytopointouttheinterrelationshipsamongthem.Specifically,wewilldiscussthefollowingtopics:
AggregationSamplingDimensionalityreductionFeaturesubsetselectionFeaturecreationDiscretizationandbinarizationVariabletransformation
Roughlyspeaking,thesetopicsfallintotwocategories:selectingdataobjectsandattributesfortheanalysisorforcreating/changingtheattributes.Inbothcases,thegoalistoimprovethedatamininganalysiswithrespecttotime,cost,andquality.Detailsareprovidedinthefollowingsections.
Aquicknoteaboutterminology:Inthefollowing,wesometimesusesynonymsforattribute,suchasfeatureorvariable,inordertofollowcommonusage.
2.3.1Aggregation
Sometimes“lessismore,”andthisisthecasewithaggregation,thecombiningoftwoormoreobjectsintoasingleobject.Consideradatasetconsistingoftransactions(dataobjects)recordingthedailysalesofproductsinvariousstorelocations(Minneapolis,Chicago,Paris,…)fordifferentdaysoverthecourseofayear.SeeTable2.4 .Onewaytoaggregatetransactionsforthisdatasetistoreplaceallthetransactionsofasinglestorewithasinglestorewidetransaction.Thisreducesthehundredsorthousandsoftransactionsthatoccurdailyataspecificstoretoasingledailytransaction,andthenumberofdataobjectsperdayisreducedtothenumberofstores.
Table2.4.Datasetcontaininginformationaboutcustomerpurchases.
TransactionID Item StoreLocation Date Price …
⋮ ⋮ ⋮ ⋮ ⋮
101123 Watch Chicago 09/06/04 $25.99 …
101123 Battery Chicago 09/06/04 $5.99 …
101124 Shoes Minneapolis 09/06/04 $75.00 …
Anobviousissueishowanaggregatetransactioniscreated;i.e.,howthevaluesofeachattributearecombinedacrossalltherecordscorrespondingtoaparticularlocationtocreatetheaggregatetransactionthatrepresentsthesalesofasinglestoreordate.Quantitativeattributes,suchasprice,aretypicallyaggregatedbytakingasumoranaverage.Aqualitativeattribute,suchasitem,caneitherbeomittedorsummarizedintermsofahigherlevelcategory,e.g.,televisionsversuselectronics.
ThedatainTable2.4 canalsobeviewedasamultidimensionalarray,whereeachattributeisadimension.Fromthisviewpoint,aggregationistheprocessofeliminatingattributes,suchasthetypeofitem,orreducingthe
numberofvaluesforaparticularattribute;e.g.,reducingthepossiblevaluesfordatefrom365daysto12months.ThistypeofaggregationiscommonlyusedinOnlineAnalyticalProcessing(OLAP).ReferencestoOLAParegiveninthebibliographicNotes.
Thereareseveralmotivationsforaggregation.First,thesmallerdatasetsresultingfromdatareductionrequirelessmemoryandprocessingtime,andhence,aggregationoftenenablestheuseofmoreexpensivedataminingalgorithms.Second,aggregationcanactasachangeofscopeorscalebyprovidingahigh-levelviewofthedatainsteadofalow-levelview.Inthepreviousexample,aggregatingoverstorelocationsandmonthsgivesusamonthly,perstoreviewofthedatainsteadofadaily,peritemview.Finally,thebehaviorofgroupsofobjectsorattributesisoftenmorestablethanthatofindividualobjectsorattributes.Thisstatementreflectsthestatisticalfactthataggregatequantities,suchasaveragesortotals,havelessvariabilitythantheindividualvaluesbeingaggregated.Fortotals,theactualamountofvariationislargerthanthatofindividualobjects(onaverage),butthepercentageofthevariationissmaller,whileformeans,theactualamountofvariationislessthanthatofindividualobjects(onaverage).Adisadvantageofaggregationisthepotentiallossofinterestingdetails.Inthestoreexample,aggregatingovermonthslosesinformationaboutwhichdayoftheweekhasthehighestsales.
Example2.7(AustralianPrecipitation).ThisexampleisbasedonprecipitationinAustraliafromtheperiod1982–1993.Figure2.8(a) showsahistogramforthestandarddeviationofaveragemonthlyprecipitationfor by gridcellsinAustralia,whileFigure2.8(b) showsahistogramforthestandarddeviationoftheaverageyearlyprecipitationforthesamelocations.Theaverageyearlyprecipitationhaslessvariabilitythantheaveragemonthlyprecipitation.All
3,0300.5° 0.5°
precipitationmeasurements(andtheirstandarddeviations)areincentimeters.
Figure2.8.HistogramsofstandarddeviationformonthlyandyearlyprecipitationinAustraliafortheperiod1982–1993.
2.3.2Sampling
Samplingisacommonlyusedapproachforselectingasubsetofthedataobjectstobeanalyzed.Instatistics,ithaslongbeenusedforboththepreliminaryinvestigationofthedataandthefinaldataanalysis.Samplingcanalsobeveryusefulindatamining.However,themotivationsforsamplinginstatisticsanddataminingareoftendifferent.Statisticiansusesamplingbecauseobtainingtheentiresetofdataofinterestistooexpensiveortimeconsuming,whiledataminersusuallysamplebecauseitistoocomputationallyexpensiveintermsofthememoryortimerequiredtoprocess
allthedata.Insomecases,usingasamplingalgorithmcanreducethedatasizetothepointwhereabetter,butmorecomputationallyexpensivealgorithmcanbeused.
Thekeyprincipleforeffectivesamplingisthefollowing:Usingasamplewillworkalmostaswellasusingtheentiredatasetifthesampleisrepresentative.Inturn,asampleisrepresentativeifithasapproximatelythesameproperty(ofinterest)astheoriginalsetofdata.Ifthemean(average)ofthedataobjectsisthepropertyofinterest,thenasampleisrepresentativeifithasameanthatisclosetothatoftheoriginaldata.Becausesamplingisastatisticalprocess,therepresentativenessofanyparticularsamplewillvary,andthebestthatwecandoischooseasamplingschemethatguaranteesahighprobabilityofgettingarepresentativesample.Asdiscussednext,thisinvolveschoosingtheappropriatesamplesizeandsamplingtechnique.
SamplingApproachesTherearemanysamplingtechniques,butonlyafewofthemostbasiconesandtheirvariationswillbecoveredhere.Thesimplesttypeofsamplingissimplerandomsampling.Forthistypeofsampling,thereisanequalprobabilityofselectinganyparticularobject.Therearetwovariationsonrandomsampling(andothersamplingtechniquesaswell):(1)samplingwithoutreplacement—aseachobjectisselected,itisremovedfromthesetofallobjectsthattogetherconstitutethepopulation,and(2)samplingwithreplacement—objectsarenotremovedfromthepopulationastheyareselectedforthesample.Insamplingwithreplacement,thesameobjectcanbepickedmorethanonce.Thesamplesproducedbythetwomethodsarenotmuchdifferentwhensamplesarerelativelysmallcomparedtothedatasetsize,butsamplingwithreplacementissimplertoanalyzebecausetheprobabilityofselectinganyobjectremainsconstantduringthesamplingprocess.
Whenthepopulationconsistsofdifferenttypesofobjects,withwidelydifferentnumbersofobjects,simplerandomsamplingcanfailtoadequatelyrepresentthosetypesofobjectsthatarelessfrequent.Thiscancauseproblemswhentheanalysisrequiresproperrepresentationofallobjecttypes.Forexample,whenbuildingclassificationmodelsforrareclasses,itiscriticalthattherareclassesbeadequatelyrepresentedinthesample.Hence,asamplingschemethatcanaccommodatedifferingfrequenciesfortheobjecttypesofinterestisneeded.Stratifiedsampling,whichstartswithprespecifiedgroupsofobjects,issuchanapproach.Inthesimplestversion,equalnumbersofobjectsaredrawnfromeachgroupeventhoughthegroupsareofdifferentsizes.Inanothervariation,thenumberofobjectsdrawnfromeachgroupisproportionaltothesizeofthatgroup.
Example2.8(SamplingandLossofInformation).Onceasamplingtechniquehasbeenselected,itisstillnecessarytochoosethesamplesize.Largersamplesizesincreasetheprobabilitythatasamplewillberepresentative,buttheyalsoeliminatemuchoftheadvantageofsampling.Conversely,withsmallersamplesizes,patternscanbemissedorerroneouspatternscanbedetected.Figure2.9(a)showsadatasetthatcontains8000two-dimensionalpoints,whileFigures2.9(b) and2.9(c) showsamplesfromthisdatasetofsize2000and500,respectively.Althoughmostofthestructureofthisdatasetispresentinthesampleof2000points,muchofthestructureismissinginthesampleof500points.
Figure2.9.Exampleofthelossofstructurewithsampling.
Example2.9(DeterminingtheProperSampleSize).Toillustratethatdeterminingthepropersamplesizerequiresamethodicalapproach,considerthefollowingtask.
Givenasetofdataconsistingofasmallnumberofalmostequalsizedgroups,findatleastone
representativepointforeachofthegroups.Assumethattheobjectsineachgrouparehighly
similartoeachother,butnotverysimilartoobjectsindifferentgroups.Figure2.10(a) shows
anidealizedsetofclusters(groups)fromwhichthesepointsmightbedrawn.
Figure2.10.Findingrepresentativepointsfrom10groups.
Thisproblemcanbeefficientlysolvedusingsampling.Oneapproachistotakeasmallsampleofdatapoints,computethepairwisesimilaritiesbetweenpoints,andthenformgroupsofpointsthatarehighlysimilar.Thedesiredsetofrepresentativepointsisthenobtainedbytakingonepointfromeachofthesegroups.Tofollowthisapproach,however,weneedtodetermineasamplesizethatwouldguarantee,withahighprobability,thedesiredoutcome;thatis,thatatleastonepointwillbeobtainedfromeachcluster.Figure2.10(b) showstheprobabilityofgettingoneobjectfromeachofthe10groupsasthesamplesizerunsfrom10to60.Interestingly,withasamplesizeof20,thereislittlechance(20%)ofgettingasamplethatincludesall10clusters.Evenwithasamplesizeof30,thereisstillamoderatechance(almost40%)ofgettingasamplethatdoesn’tcontainobjectsfromall10clusters.ThisissueisfurtherexploredinthecontextofclusteringbyExercise4 onpage603.
ProgressiveSamplingThepropersamplesizecanbedifficulttodetermine,soadaptiveorprogressivesamplingschemesaresometimesused.Theseapproachesstartwithasmallsample,andthenincreasethesamplesizeuntilasampleofsufficientsizehasbeenobtained.Whilethistechniqueeliminatestheneedtodeterminethecorrectsamplesizeinitially,itrequiresthattherebeawaytoevaluatethesampletojudgeifitislargeenough.
Suppose,forinstance,thatprogressivesamplingisusedtolearnapredictivemodel.Althoughtheaccuracyofpredictivemodelsincreasesasthesamplesizeincreases,atsomepointtheincreaseinaccuracylevelsoff.Wewanttostopincreasingthesamplesizeatthisleveling-offpoint.Bykeepingtrackofthechangeinaccuracyofthemodelaswetakeprogressivelylargersamples,andbytakingothersamplesclosetothesizeofthecurrentone,wecangetanestimateofhowclosewearetothisleveling-offpoint,andthus,stopsampling.
2.3.3DimensionalityReduction
Datasetscanhavealargenumberoffeatures.Considerasetofdocuments,whereeachdocumentisrepresentedbyavectorwhosecomponentsarethefrequencieswithwhicheachwordoccursinthedocument.Insuchcases,therearetypicallythousandsortensofthousandsofattributes(components),oneforeachwordinthevocabulary.Asanotherexample,considerasetoftimeseriesconsistingofthedailyclosingpriceofvariousstocksoveraperiodof30years.Inthiscase,theattributes,whicharethepricesonspecificdays,againnumberinthethousands.
Thereareavarietyofbenefitstodimensionalityreduction.Akeybenefitisthatmanydataminingalgorithmsworkbetterifthedimensionality—thenumberofattributesinthedata—islower.Thisispartlybecausedimensionalityreductioncaneliminateirrelevantfeaturesandreducenoiseandpartlybecauseofthecurseofdimensionality,whichisexplainedbelow.Anotherbenefitisthatareductionofdimensionalitycanleadtoamoreunderstandablemodelbecausethemodelusuallyinvolvesfewerattributes.Also,dimensionalityreductionmayallowthedatatobemoreeasilyvisualized.Evenifdimensionalityreductiondoesn’treducethedatatotwoorthreedimensions,dataisoftenvisualizedbylookingatpairsortripletsofattributes,andthenumberofsuchcombinationsisgreatlyreduced.Finally,theamountoftimeandmemoryrequiredbythedataminingalgorithmisreducedwithareductionindimensionality.
Thetermdimensionalityreductionisoftenreservedforthosetechniquesthatreducethedimensionalityofadatasetbycreatingnewattributesthatareacombinationoftheoldattributes.Thereductionofdimensionalitybyselectingattributesthatareasubsetoftheoldisknownasfeaturesubsetselectionorfeatureselection.ItwillbediscussedinSection2.3.4 .
Intheremainderofthissection,webrieflyintroducetwoimportanttopics:thecurseofdimensionalityanddimensionalityreductiontechniquesbasedonlinearalgebraapproachessuchasprincipalcomponentsanalysis(PCA).MoredetailsondimensionalityreductioncanbefoundinAppendixB.
TheCurseofDimensionalityThecurseofdimensionalityreferstothephenomenonthatmanytypesofdataanalysisbecomesignificantlyharderasthedimensionalityofthedataincreases.Specifically,asdimensionalityincreases,thedatabecomesincreasinglysparseinthespacethatitoccupies.Thus,thedataobjectswe
observearequitepossiblynotarepresentativesampleofallpossibleobjects.Forclassification,thiscanmeanthattherearenotenoughdataobjectstoallowthecreationofamodelthatreliablyassignsaclasstoallpossibleobjects.Forclustering,thedifferencesindensityandinthedistancesbetweenpoints,whicharecriticalforclustering,becomelessmeaningful.(ThisisdiscussedfurtherinSections8.1.2,8.4.6,and8.4.8.)Asaresult,manyclusteringandclassificationalgorithms(andotherdataanalysisalgorithms)havetroublewithhigh-dimensionaldataleadingtoreducedclassificationaccuracyandpoorqualityclusters.
LinearAlgebraTechniquesforDimensionalityReductionSomeofthemostcommonapproachesfordimensionalityreduction,particularlyforcontinuousdata,usetechniquesfromlinearalgebratoprojectthedatafromahigh-dimensionalspaceintoalower-dimensionalspace.PrincipalComponentsAnalysis(PCA)isalinearalgebratechniqueforcontinuousattributesthatfindsnewattributes(principalcomponents)that(1)arelinearcombinationsoftheoriginalattributes,(2)areorthogonal(perpendicular)toeachother,and(3)capturethemaximumamountofvariationinthedata.Forexample,thefirsttwoprincipalcomponentscaptureasmuchofthevariationinthedataasispossiblewithtwoorthogonalattributesthatarelinearcombinationsoftheoriginalattributes.SingularValueDecomposition(SVD)isalinearalgebratechniquethatisrelatedtoPCAandisalsocommonlyusedfordimensionalityreduction.Foradditionaldetails,seeAppendicesAandB.
2.3.4FeatureSubsetSelection
Anotherwaytoreducethedimensionalityistouseonlyasubsetofthefeatures.Whileitmightseemthatsuchanapproachwouldloseinformation,thisisnotthecaseifredundantandirrelevantfeaturesarepresent.Redundantfeaturesduplicatemuchoralloftheinformationcontainedinoneormoreotherattributes.Forexample,thepurchasepriceofaproductandtheamountofsalestaxpaidcontainmuchofthesameinformation.Irrelevantfeaturescontainalmostnousefulinformationforthedataminingtaskathand.Forinstance,students’IDnumbersareirrelevanttothetaskofpredictingstudents’gradepointaverages.Redundantandirrelevantfeaturescanreduceclassificationaccuracyandthequalityoftheclustersthatarefound.
Whilesomeirrelevantandredundantattributescanbeeliminatedimmediatelybyusingcommonsenseordomainknowledge,selectingthebestsubsetoffeaturesfrequentlyrequiresasystematicapproach.Theidealapproachtofeatureselectionistotryallpossiblesubsetsoffeaturesasinputtothedataminingalgorithmofinterest,andthentakethesubsetthatproducesthebestresults.Thismethodhastheadvantageofreflectingtheobjectiveandbiasofthedataminingalgorithmthatwilleventuallybeused.Unfortunately,sincethenumberofsubsetsinvolvingnattributesis2 ,suchanapproachisimpractical
inmostsituationsandalternativestrategiesareneeded.Therearethreestandardapproachestofeatureselection:embedded,filter,andwrapper.
Embeddedapproaches
Featureselectionoccursnaturallyaspartofthedataminingalgorithm.Specifically,duringtheoperationofthedataminingalgorithm,thealgorithmitselfdecideswhichattributestouseandwhichtoignore.Algorithmsforbuildingdecisiontreeclassifiers,whicharediscussedinChapter3 ,oftenoperateinthismanner.
n
Filterapproaches
Featuresareselectedbeforethedataminingalgorithmisrun,usingsomeapproachthatisindependentofthedataminingtask.Forexample,wemightselectsetsofattributeswhosepairwisecorrelationisaslowaspossiblesothattheattributesarenon-redundant.
Wrapperapproaches
Thesemethodsusethetargetdataminingalgorithmasablackboxtofindthebestsubsetofattributes,inawaysimilartothatoftheidealalgorithmdescribedabove,buttypicallywithoutenumeratingallpossiblesubsets.
Becausetheembeddedapproachesarealgorithm-specific,onlythefilterandwrapperapproacheswillbediscussedfurtherhere.
AnArchitectureforFeatureSubsetSelectionItispossibletoencompassboththefilterandwrapperapproacheswithinacommonarchitecture.Thefeatureselectionprocessisviewedasconsistingoffourparts:ameasureforevaluatingasubset,asearchstrategythatcontrolsthegenerationofanewsubsetoffeatures,astoppingcriterion,andavalidationprocedure.Filtermethodsandwrappermethodsdifferonlyinthewayinwhichtheyevaluateasubsetoffeatures.Forawrappermethod,subsetevaluationusesthetargetdataminingalgorithm,whileforafilterapproach,theevaluationtechniqueisdistinctfromthetargetdataminingalgorithm.Thefollowingdiscussionprovidessomedetailsofthisapproach,whichissummarizedinFigure2.11 .
Figure2.11.Flowchartofafeaturesubsetselectionprocess.
Conceptually,featuresubsetselectionisasearchoverallpossiblesubsetsoffeatures.Manydifferenttypesofsearchstrategiescanbeused,butthesearchstrategyshouldbecomputationallyinexpensiveandshouldfindoptimalornearoptimalsetsoffeatures.Itisusuallynotpossibletosatisfybothrequirements,andthus,trade-offsarenecessary.
Anintegralpartofthesearchisanevaluationsteptojudgehowthecurrentsubsetoffeaturescomparestoothersthathavebeenconsidered.Thisrequiresanevaluationmeasurethatattemptstodeterminethegoodnessofasubsetofattributeswithrespecttoaparticulardataminingtask,suchasclassificationorclustering.Forthefilterapproach,suchmeasuresattempttopredicthowwelltheactualdataminingalgorithmwillperformonagivensetofattributes.Forthewrapperapproach,whereevaluationconsistsofactuallyrunningthetargetdataminingalgorithm,thesubsetevaluationfunctionissimplythecriterionnormallyusedtomeasuretheresultofthedatamining.
Becausethenumberofsubsetscanbeenormousanditisimpracticaltoexaminethemall,somesortofstoppingcriterionisnecessary.Thisstrategyisusuallybasedononeormoreconditionsinvolvingthefollowing:thenumberofiterations,whetherthevalueofthesubsetevaluationmeasureisoptimalorexceedsacertainthreshold,whetherasubsetofacertainsizehasbeenobtained,andwhetheranyimprovementcanbeachievedbytheoptionsavailabletothesearchstrategy.
Finally,onceasubsetoffeatureshasbeenselected,theresultsofthetargetdataminingalgorithmontheselectedsubsetshouldbevalidated.Astraightforwardvalidationapproachistorunthealgorithmwiththefullsetoffeaturesandcomparethefullresultstoresultsobtainedusingthesubsetoffeatures.Hopefully,thesubsetoffeatureswillproduceresultsthatarebetterthanoralmostasgoodasthoseproducedwhenusingallfeatures.Anothervalidationapproachistouseanumberofdifferentfeatureselectionalgorithmstoobtainsubsetsoffeaturesandthencomparetheresultsofrunningthedataminingalgorithmoneachsubset.
FeatureWeightingFeatureweightingisanalternativetokeepingoreliminatingfeatures.Moreimportantfeaturesareassignedahigherweight,whilelessimportantfeaturesaregivenalowerweight.Theseweightsaresometimesassignedbasedondomainknowledgeabouttherelativeimportanceoffeatures.Alternatively,theycansometimesbedeterminedautomatically.Forexample,someclassificationschemes,suchassupportvectormachines(Chapter4 ),produceclassificationmodelsinwhicheachfeatureisgivenaweight.Featureswithlargerweightsplayamoreimportantroleinthemodel.Thenormalizationofobjectsthattakesplacewhencomputingthecosinesimilarity(Section2.4.5 )canalsoberegardedasatypeoffeatureweighting.
2.3.5FeatureCreation
Itisfrequentlypossibletocreate,fromtheoriginalattributes,anewsetofattributesthatcapturestheimportantinformationinadatasetmuchmoreeffectively.Furthermore,thenumberofnewattributescanbesmallerthanthenumberoforiginalattributes,allowingustoreapallthepreviouslydescribedbenefitsofdimensionalityreduction.Tworelatedmethodologiesforcreatingnewattributesaredescribednext:featureextractionandmappingthedatatoanewspace.
FeatureExtractionThecreationofanewsetoffeaturesfromtheoriginalrawdataisknownasfeatureextraction.Considerasetofphotographs,whereeachphotographistobeclassifiedaccordingtowhetheritcontainsahumanface.Therawdataisasetofpixels,andassuch,isnotsuitableformanytypesofclassificationalgorithms.However,ifthedataisprocessedtoprovidehigher-levelfeatures,suchasthepresenceorabsenceofcertaintypesofedgesandareasthatarehighlycorrelatedwiththepresenceofhumanfaces,thenamuchbroadersetofclassificationtechniquescanbeappliedtothisproblem.
Unfortunately,inthesenseinwhichitismostcommonlyused,featureextractionishighlydomain-specific.Foraparticularfield,suchasimageprocessing,variousfeaturesandthetechniquestoextractthemhavebeendevelopedoveraperiodoftime,andoftenthesetechniqueshavelimitedapplicabilitytootherfields.Consequently,wheneverdataminingisappliedtoarelativelynewarea,akeytaskisthedevelopmentofnewfeaturesandfeatureextractionmethods.
Althoughfeatureextractionisoftencomplicated,Example2.10 illustratesthatitcanberelativelystraightforward.
Example2.10(Density).Consideradatasetconsistingofinformationabouthistoricalartifacts,which,alongwithotherinformation,containsthevolumeandmassofeachartifact.Forsimplicity,assumethattheseartifactsaremadeofasmallnumberofmaterials(wood,clay,bronze,gold)andthatwewanttoclassifytheartifactswithrespecttothematerialofwhichtheyaremade.Inthiscase,adensityfeatureconstructedfromthemassandvolumefeatures,i.e.,density=mass/volume,wouldmostdirectlyyieldanaccurateclassification.Althoughtherehavebeensomeattemptstoautomaticallyperformsuchsimplefeatureextractionbyexploringbasicmathematicalcombinationsofexistingattributes,themostcommonapproachistoconstructfeaturesusingdomainexpertise.
MappingtheDatatoaNewSpaceAtotallydifferentviewofthedatacanrevealimportantandinterestingfeatures.Consider,forexample,timeseriesdata,whichoftencontainsperiodicpatterns.Ifthereisonlyasingleperiodicpatternandnotmuchnoise,thenthepatterniseasilydetected.If,ontheotherhand,thereareanumberofperiodicpatternsandasignificantamountofnoise,thenthesepatternsarehardtodetect.Suchpatternscan,nonetheless,oftenbedetectedbyapplyingaFouriertransformtothetimeseriesinordertochangetoarepresentationinwhichfrequencyinformationisexplicit.InExample2.11 ,itwillnotbenecessarytoknowthedetailsoftheFouriertransform.Itisenoughtoknowthat,foreachtimeseries,theFouriertransformproducesanewdataobjectwhoseattributesarerelatedtofrequencies.
Example2.11(FourierAnalysis).ThetimeseriespresentedinFigure2.12(b) isthesumofthreeothertimeseries,twoofwhichareshowninFigure2.12(a) andhavefrequenciesof7and17cyclespersecond,respectively.Thethirdtimeseriesisrandomnoise.Figure2.12(c) showsthepowerspectrumthatcanbecomputedafterapplyingaFouriertransformtotheoriginaltimeseries.(Informally,thepowerspectrumisproportionaltothesquareofeachfrequencyattribute.)Inspiteofthenoise,therearetwopeaksthatcorrespondtotheperiodsofthetwooriginal,non-noisytimeseries.Again,themainpointisthatbetterfeaturescanrevealimportantaspectsofthedata.
Figure2.12.ApplicationoftheFouriertransformtoidentifytheunderlyingfrequenciesintimeseriesdata.
Manyothersortsoftransformationsarealsopossible.BesidestheFouriertransform,thewavelettransformhasalsoprovenveryusefulfortimeseriesandothertypesofdata.
2.3.6DiscretizationandBinarization
Somedataminingalgorithms,especiallycertainclassificationalgorithms,requirethatthedatabeintheformofcategoricalattributes.Algorithmsthatfindassociationpatternsrequirethatthedatabeintheformofbinaryattributes.Thus,itisoftennecessarytotransformacontinuousattributeintoacategoricalattribute(discretization),andbothcontinuousanddiscreteattributesmayneedtobetransformedintooneormorebinaryattributes(binarization).Additionally,ifacategoricalattributehasalargenumberofvalues(categories),orsomevaluesoccurinfrequently,thenitcanbebeneficialforcertaindataminingtaskstoreducethenumberofcategoriesbycombiningsomeofthevalues.
Aswithfeatureselection,thebestdiscretizationorbinarizationapproachistheonethat“producesthebestresultforthedataminingalgorithmthatwillbeusedtoanalyzethedata.”Itistypicallynotpracticaltoapplysuchacriteriondirectly.Consequently,discretizationorbinarizationisperformedinawaythatsatisfiesacriterionthatisthoughttohavearelationshiptogoodperformanceforthedataminingtaskbeingconsidered.Ingeneral,thebestdiscretizationdependsonthealgorithmbeingused,aswellastheotherattributesbeingconsidered.Typically,however,thediscretizationofeachattributeisconsideredinisolation.
BinarizationAsimpletechniquetobinarizeacategoricalattributeisthefollowing:Iftherearemcategoricalvalues,thenuniquelyassigneachoriginalvaluetoanintegerintheinterval Iftheattributeisordinal,thenordermustbemaintainedbytheassignment.(Notethateveniftheattributeisoriginallyrepresentedusingintegers,thisprocessisnecessaryiftheintegersarenotin
[0,m−1].
theinterval )Next,converteachofthesemintegerstoabinarynumber.Since binarydigitsarerequiredtorepresenttheseintegers,representthesebinarynumbersusingnbinaryattributes.Toillustrate,acategoricalvariablewith5values{awful,poor,OK,good,great}wouldrequirethreebinaryvariables and TheconversionisshowninTable2.5 .
Table2.5.Conversionofacategoricalattributetothreebinaryattributes.
CategoricalValue IntegerValue
awful 0 0 0 0
poor 1 0 0 1
OK 2 0 1 0
good 3 0 1 1
great 4 1 0 0
Suchatransformationcancausecomplications,suchascreatingunintendedrelationshipsamongthetransformedattributes.Forexample,inTable2.5 ,attributes and arecorrelatedbecauseinformationaboutthegoodvalueisencodedusingbothattributes.Furthermore,associationanalysisrequiresasymmetricbinaryattributes,whereonlythepresenceoftheattribute
isimportant.Forassociationproblems,itisthereforenecessarytointroduceoneasymmetricbinaryattributeforeachcategoricalvalue,asshowninTable2.6 .Ifthenumberofresultingattributesistoolarge,thenthetechniquesdescribedinthefollowingsectionscanbeusedtoreducethenumberofcategoricalvaluesbeforebinarization.
Table2.6.Conversionofacategoricalattributetofiveasymmetricbinary
[0,m−1].n=[log2(m)]
x1,x2, x3.
x1 x2 x3
x2 x3
(value=1).
attributes.
CategoricalValue IntegerValue
awful 0 1 0 0 0 0
poor 1 0 1 0 0 0
OK 2 0 0 1 0 0
good 3 0 0 0 1 0
great 4 0 0 0 0 1
Likewise,forassociationproblems,itcanbenecessarytoreplaceasinglebinaryattributewithtwoasymmetricbinaryattributes.Considerabinaryattributethatrecordsaperson’sgender,maleorfemale.Fortraditionalassociationrulealgorithms,thisinformationneedstobetransformedintotwoasymmetricbinaryattributes,onethatisa1onlywhenthepersonismaleandonethatisa1onlywhenthepersonisfemale.(Forasymmetricbinaryattributes,theinformationrepresentationissomewhatinefficientinthattwobitsofstoragearerequiredtorepresenteachbitofinformation.)
DiscretizationofContinuousAttributesDiscretizationistypicallyappliedtoattributesthatareusedinclassificationorassociationanalysis.Transformationofacontinuousattributetoacategoricalattributeinvolvestwosubtasks:decidinghowmanycategories,n,tohaveanddetermininghowtomapthevaluesofthecontinuousattributetothesecategories.Inthefirststep,afterthevaluesofthecontinuousattributearesorted,theyarethendividedintonintervalsbyspecifying splitpoints.Inthesecond,rathertrivialstep,allthevaluesinoneintervalaremappedtothesamecategoricalvalue.Therefore,theproblemofdiscretizationisoneof
x1 x2 x3 x4 x5
n−1
decidinghowmanysplitpointstochooseandwheretoplacethem.Theresultcanberepresentedeitherasasetofintervals
where and canbe or respectively,orequivalently,asaseriesofinequalities
UnsupervisedDiscretization
Abasicdistinctionbetweendiscretizationmethodsforclassificationiswhetherclassinformationisused(supervised)ornot(unsupervised).Ifclassinformationisnotused,thenrelativelysimpleapproachesarecommon.Forinstance,theequalwidthapproachdividestherangeoftheattributeintoauser-specifiednumberofintervalseachhavingthesamewidth.Suchanapproachcanbebadlyaffectedbyoutliers,andforthatreason,anequalfrequency(equaldepth)approach,whichtriestoputthesamenumberofobjectsintoeachinterval,isoftenpreferred.Asanotherexampleofunsuperviseddiscretization,aclusteringmethod,suchasK-means(seeChapter7 ),canalsobeused.Finally,visuallyinspectingthedatacansometimesbeaneffectiveapproach.
Example2.12(DiscretizationTechniques).Thisexampledemonstrateshowtheseapproachesworkonanactualdataset.Figure2.13(a) showsdatapointsbelongingtofourdifferentgroups,alongwithtwooutliers—thelargedotsoneitherend.Thetechniquesofthepreviousparagraphwereappliedtodiscretizethexvaluesofthesedatapointsintofourcategoricalvalues.(Pointsinthedatasethavearandomycomponenttomakeiteasytoseehowmanypointsareineachgroup.)Visuallyinspectingthedataworksquitewell,butisnotautomatic,andthus,wefocusontheotherthreeapproaches.Thesplitpointsproducedbythetechniquesequalwidth,equalfrequency,andK-meansareshownin
{(x0,x1],(x1,x2],…,(xn−1,xn)}, x0 xn +∞ −∞,
x0<x≤x1,…,xn−1<x<xn.
Figures2.13(b) ,2.13(c) ,and2.13(d) ,respectively.Thesplitpointsarerepresentedasdashedlines.
Figure2.13.Differentdiscretizationtechniques.
Inthisparticularexample,ifwemeasuretheperformanceofadiscretizationtechniquebytheextenttowhichdifferentobjectsthatclumptogetherhavethesamecategoricalvalue,thenK-meansperformsbest,followedbyequalfrequency,andfinally,equalwidth.Moregenerally,thebestdiscretizationwilldependontheapplicationandofteninvolvesdomain-specificdiscretization.Forexample,thediscretizationofpeopleintolowincome,middleincome,andhighincomeisbasedoneconomicfactors.
SupervisedDiscretization
Ifclassificationisourapplicationandclasslabelsareknownforsomedataobjects,thendiscretizationapproachesthatuseclasslabelsoftenproducebetterclassification.Thisshouldnotbesurprising,sinceanintervalconstructedwithnoknowledgeofclasslabelsoftencontainsamixtureofclasslabels.Aconceptuallysimpleapproachistoplacethesplitsinawaythatmaximizesthepurityoftheintervals,i.e.,theextenttowhichanintervalcontainsasingleclasslabel.Inpractice,however,suchanapproachrequirespotentiallyarbitrarydecisionsaboutthepurityofanintervalandtheminimumsizeofaninterval.
Toovercomesuchconcerns,somestatisticallybasedapproachesstartwitheachattributevalueinaseparateintervalandcreatelargerintervalsbymergingadjacentintervalsthataresimilaraccordingtoastatisticaltest.Analternativetothisbottom-upapproachisatop-downapproachthatstartsbybisectingtheinitialvaluessothattheresultingtwointervalsgiveminimumentropy.Thistechniqueonlyneedstoconsidereachvalueasapossiblesplitpoint,becauseitisassumedthatintervalscontainorderedsetsofvalues.Thesplittingprocessisthenrepeatedwithanotherinterval,typicallychoosingtheintervalwiththeworst(highest)entropy,untilauser-specifiednumberofintervalsisreached,orastoppingcriterionissatisfied.
Entropy-basedapproachesareoneofthemostpromisingapproachestodiscretization,whetherbottom-uportop-down.First,itisnecessarytodefineentropy.Letkbethenumberofdifferentclasslabels,m bethenumberof
valuesinthei intervalofapartition,andm bethenumberofvaluesofclassjinintervali.Thentheentropye ofthei intervalisgivenbytheequation
where istheprobability(fractionofvalues)ofclassjintheinterval.Thetotalentropy,e,ofthepartitionistheweightedaverageoftheindividualintervalentropies,i.e.,
wheremisthenumberofvalues, isthefractionofvaluesintheinterval,andnisthenumberofintervals.Intuitively,theentropyofanintervalisameasureofthepurityofaninterval.Ifanintervalcontainsonlyvaluesofoneclass(isperfectlypure),thentheentropyis0anditcontributesnothingtotheoverallentropy.Iftheclassesofvaluesinanintervaloccurequallyoften(theintervalisasimpureaspossible),thentheentropyisamaximum.
Example2.13(DiscretizationofTwoAttributes).Thetop-downmethodbasedonentropywasusedtoindependentlydiscretizeboththexandyattributesofthetwo-dimensionaldatashowninFigure2.14 .Inthefirstdiscretization,showninFigure2.14(a) ,thexandyattributeswerebothsplitintothreeintervals.(Thedashedlinesindicatethesplitpoints.)Intheseconddiscretization,showninFigure2.14(b) ,thexandyattributeswerebothsplitintofiveintervals.
i
thij
ith
ei=−∑j=1kpijlog2pij,
pij=mij/mi ith
e=∑i=1nwiei,
wi=mi/m ith
Figure2.14.Discretizingxandyattributesforfourgroups(classes)ofpoints.
Thissimpleexampleillustratestwoaspectsofdiscretization.First,intwodimensions,theclassesofpointsarewellseparated,butinonedimension,thisisnotso.Ingeneral,discretizingeachattributeseparatelyoftenguaranteessuboptimalresults.Second,fiveintervalsworkbetterthanthree,butsixintervalsdonotimprovethediscretizationmuch,atleastintermsofentropy.(Entropyvaluesandresultsforsixintervalsarenotshown.)Consequently,itisdesirabletohaveastoppingcriterionthatautomaticallyfindstherightnumberofpartitions.
CategoricalAttributeswithTooManyValuesCategoricalattributescansometimeshavetoomanyvalues.Ifthecategoricalattributeisanordinalattribute,thentechniquessimilartothoseforcontinuousattributescanbeusedtoreducethenumberofcategories.Ifthecategoricalattributeisnominal,however,thenotherapproachesareneeded.Considera
universitythathasalargenumberofdepartments.Consequently,adepartmentnameattributemighthavedozensofdifferentvalues.Inthissituation,wecoulduseourknowledgeoftherelationshipsamongdifferentdepartmentstocombinedepartmentsintolargergroups,suchasengineering,socialsciences,orbiologicalsciences.Ifdomainknowledgedoesnotserveasausefulguideorsuchanapproachresultsinpoorclassificationperformance,thenitisnecessarytouseamoreempiricalapproach,suchasgroupingvaluestogetheronlyifsuchagroupingresultsinimprovedclassificationaccuracyorachievessomeotherdataminingobjective.
2.3.7VariableTransformation
Avariabletransformationreferstoatransformationthatisappliedtoallthevaluesofavariable.(Weusethetermvariableinsteadofattributetoadheretocommonusage,althoughwewillalsorefertoattributetransformationonoccasion.)Inotherwords,foreachobject,thetransformationisappliedtothevalueofthevariableforthatobject.Forexample,ifonlythemagnitudeofavariableisimportant,thenthevaluesofthevariablecanbetransformedbytakingtheabsolutevalue.Inthefollowingsection,wediscusstwoimportanttypesofvariabletransformations:simplefunctionaltransformationsandnormalization.
SimpleFunctionsForthistypeofvariabletransformation,asimplemathematicalfunctionisappliedtoeachvalueindividually.Ifxisavariable,thenexamplesofsuchtransformationsinclude or Instatistics,variabletransformations,especiallysqrt,log,and1/x,areoftenusedtotransformdatathatdoesnothaveaGaussian(normal)distributionintodatathatdoes.While
xk,logx,ex,x,1/x,sinx, |x|.
thiscanbeimportant,otherreasonsoftentakeprecedenceindatamining.Supposethevariableofinterestisthenumberofdatabytesinasession,andthenumberofbytesrangesfrom1to1billion.Thisisahugerange,anditcanbeadvantageoustocompressitbyusingalog transformation.Inthiscase,sessionsthattransferred and byteswouldbemoresimilartoeachotherthansessionsthattransferred10and1000bytesForsomeapplications,suchasnetworkintrusiondetection,thismaybewhatisdesired,sincethefirsttwosessionsmostlikelyrepresenttransfersoflargefiles,whilethelattertwosessionscouldbetwoquitedistincttypesofsessions.
Variabletransformationsshouldbeappliedwithcautionbecausetheychangethenatureofthedata.Whilethisiswhatisdesired,therecanbeproblemsifthenatureofthetransformationisnotfullyappreciated.Forinstance,thetransformation1/xreducesthemagnitudeofvaluesthatare1orlarger,butincreasesthemagnitudeofvaluesbetween0and1.Toillustrate,thevalues{1,2,3}goto butthevalues goto{1,2,3}.Thus,forallsetsofvalues,thetransformation1/xreversestheorder.Tohelpclarifytheeffectofatransformation,itisimportanttoaskquestionssuchasthefollowing:Whatisthedesiredpropertyofthetransformedattribute?Doestheorderneedtobemaintained?Doesthetransformationapplytoallvalues,especiallynegativevaluesand0?Whatistheeffectofthetransformationonthevaluesbetween0and1?Exercise17 onpage109 exploresotheraspectsofvariabletransformation.
NormalizationorStandardizationThegoalofstandardizationornormalizationistomakeanentiresetofvalueshaveaparticularproperty.Atraditionalexampleisthatof“standardizingavariable”instatistics.If isthemean(average)oftheattributevaluesandistheirstandarddeviation,thenthetransformation createsanew
10
108 109(9−8=1versus3−1=3).
{1,12,13}, {1,12,13}
x¯ sxx′=(x−x¯)/sx
variablethathasameanof0andastandarddeviationof1.Ifdifferentvariablesaretobeusedtogether,e.g.,forclustering,thensuchatransformationisoftennecessarytoavoidhavingavariablewithlargevaluesdominatetheresultsoftheanalysis.Toillustrate,considercomparingpeoplebasedontwovariables:ageandincome.Foranytwopeople,thedifferenceinincomewilllikelybemuchhigherinabsoluteterms(hundredsorthousandsofdollars)thanthedifferenceinage(lessthan150).Ifthedifferencesintherangeofvaluesofageandincomearenottakenintoaccount,thenthecomparisonbetweenpeoplewillbedominatedbydifferencesinincome.Inparticular,ifthesimilarityordissimilarityoftwopeopleiscalculatedusingthesimilarityordissimilaritymeasuresdefinedlaterinthischapter,theninmanycases,suchasthatofEuclideandistance,theincomevalueswilldominatethecalculation.
Themeanandstandarddeviationarestronglyaffectedbyoutliers,sotheabovetransformationisoftenmodified.First,themeanisreplacedbythemedian,i.e.,themiddlevalue.Second,thestandarddeviationisreplacedbytheabsolutestandarddeviation.Specifically,ifxisavariable,thentheabsolutestandarddeviationofxisgivenby where isthevalueofthevariable,misthenumberofobjects,and iseitherthemean
ormedian.Otherapproachesforcomputingestimatesofthelocation(center)andspreadofasetofvaluesinthepresenceofoutliersaredescribedinstatisticsbooks.Thesemorerobustmeasurescanalsobeusedtodefineastandardizationtransformation.
σA=∑i=1m|xi−μ|, xiith μ
2.4MeasuresofSimilarityandDissimilaritySimilarityanddissimilarityareimportantbecausetheyareusedbyanumberofdataminingtechniques,suchasclustering,nearestneighborclassification,andanomalydetection.Inmanycases,theinitialdatasetisnotneededoncethesesimilaritiesordissimilaritieshavebeencomputed.Suchapproachescanbeviewedastransformingthedatatoasimilarity(dissimilarity)spaceandthenperformingtheanalysis.Indeed,kernelmethodsareapowerfulrealizationofthisidea.ThesemethodsareintroducedinSection2.4.7 andarediscussedmorefullyinthecontextofclassificationinSection4.9.4.
Webeginwithadiscussionofthebasics:high-leveldefinitionsofsimilarityanddissimilarity,andadiscussionofhowtheyarerelated.Forconvenience,thetermproximityisusedtorefertoeithersimilarityordissimilarity.Sincetheproximitybetweentwoobjectsisafunctionoftheproximitybetweenthecorrespondingattributesofthetwoobjects,wefirstdescribehowtomeasuretheproximitybetweenobjectshavingonlyoneattribute.
Wethenconsiderproximitymeasuresforobjectswithmultipleattributes.ThisincludesmeasuressuchastheJaccardandcosinesimilaritymeasures,whichareusefulforsparsedata,suchasdocuments,aswellascorrelationandEuclideandistance,whichareusefulfornon-sparse(dense)data,suchastimeseriesormulti-dimensionalpoints.Wealsoconsidermutualinformation,whichcanbeappliedtomanytypesofdataandisgoodfordetectingnonlinearrelationships.Inthisdiscussion,werestrictourselvestoobjectswithrelativelyhomogeneousattributetypes,typicallybinaryorcontinuous.
Next,weconsiderseveralimportantissuesconcerningproximitymeasures.Thisincludeshowtocomputeproximitybetweenobjectswhentheyhaveheterogeneoustypesofattributes,andapproachestoaccountfordifferencesofscaleandcorrelationamongvariableswhencomputingdistancebetweennumericalobjects.Thesectionconcludeswithabriefdiscussionofhowtoselecttherightproximitymeasure.
Althoughthissectionfocusesonthecomputationofproximitybetweendataobjects,proximitycanalsobecomputedbetweenattributes.Forexample,forthedocument-termmatrixofFigure2.2(d) ,thecosinemeasurecanbeusedtocomputesimilaritybetweenapairofdocumentsorapairofterms(words).Knowingthattwovariablesarestronglyrelatedcan,forexample,behelpfulforeliminatingredundancy.Inparticular,thecorrelationandmutualinformationmeasuresdiscussedlaterareoftenusedforthatpurpose.
2.4.1Basics
DefinitionsInformally,thesimilaritybetweentwoobjectsisanumericalmeasureofthedegreetowhichthetwoobjectsarealike.Consequently,similaritiesarehigherforpairsofobjectsthataremorealike.Similaritiesareusuallynon-negativeandareoftenbetween0(nosimilarity)and1(completesimilarity).
Thedissimilaritybetweentwoobjectsisanumericalmeasureofthedegreetowhichthetwoobjectsaredifferent.Dissimilaritiesarelowerformoresimilarpairsofobjects.Frequently,thetermdistanceisusedasasynonymfordissimilarity,although,asweshallsee,distanceoftenreferstoaspecialclass
ofdissimilarities.Dissimilaritiessometimesfallintheinterval[0,1],butitisalsocommonforthemtorangefrom0to∞.
TransformationsTransformationsareoftenappliedtoconvertasimilaritytoadissimilarity,orviceversa,ortotransformaproximitymeasuretofallwithinaparticularrange,suchas[0,1].Forinstance,wemayhavesimilaritiesthatrangefrom1to10,buttheparticularalgorithmorsoftwarepackagethatwewanttousemaybedesignedtoworkonlywithdissimilarities,oritmayworkonlywithsimilaritiesintheinterval[0,1].Wediscusstheseissuesherebecausewewillemploysuchtransformationslaterinourdiscussionofproximity.Inaddition,theseissuesarerelativelyindependentofthedetailsofspecificproximitymeasures.
Frequently,proximitymeasures,especiallysimilarities,aredefinedortransformedtohavevaluesintheinterval[0,1].Informally,themotivationforthisistouseascaleinwhichaproximityvalueindicatesthefractionofsimilarity(ordissimilarity)betweentwoobjects.Suchatransformationisoftenrelativelystraightforward.Forexample,ifthesimilaritiesbetweenobjectsrangefrom1(notatallsimilar)to10(completelysimilar),wecanmakethemfallwithintherange[0,1]byusingthetransformation wheresands′aretheoriginalandnewsimilarityvalues,respectively.Inthemoregeneralcase,thetransformationofsimilaritiestotheinterval[0,1]isgivenbytheexpression wheremax_sandmin_sarethemaximumandminimumsimilarityvalues,respectively.Likewise,dissimilaritymeasureswithafiniterangecanbemappedtotheinterval[0,1]byusingtheformula Thisisanexampleofalineartransformation,whichpreservestherelativedistancesbetweenpoints.Inotherwords,ifpoints, and aretwiceasfarapartaspoints, andthesamewillbetrueafteralineartransformation.
s′=(s−1)/9,
s′=(s−min_s)/(max_s−min_s),
d′=(d−min_d)/(max_d−min_d).
x1 x2, x3 x4,
However,therecanbecomplicationsinmappingproximitymeasurestotheinterval[0,1]usingalineartransformation.If,forexample,theproximitymeasureoriginallytakesvaluesintheinterval thenmax_disnotdefinedandanonlineartransformationisneeded.Valueswillnothavethesamerelationshiptooneanotheronthenewscale.Considerthetransformation
foradissimilaritymeasurethatrangesfrom0to Thedissimilarities0,0.5,2,10,100,and1000willbetransformedintothenewdissimilarities0,0.33,0.67,0.90,0.99,and0.999,respectively.Largervaluesontheoriginaldissimilarityscalearecompressedintotherangeofvaluesnear1,butwhetherthisisdesirabledependsontheapplication.
Notethatmappingproximitymeasurestotheinterval[0,1]canalsochangethemeaningoftheproximitymeasure.Forexample,correlation,whichisdiscussedlater,isameasureofsimilaritythattakesvaluesintheinterval
Mappingthesevaluestotheinterval[0,1]bytakingtheabsolutevaluelosesinformationaboutthesign,whichcanbeimportantinsomeapplications.SeeExercise22 onpage111 .
Transformingsimilaritiestodissimilaritiesandviceversaisalsorelativelystraightforward,althoughweagainfacetheissuesofpreservingmeaningandchangingalinearscaleintoanonlinearscale.Ifthesimilarity(ordissimilarity)fallsintheinterval[0,1],thenthedissimilaritycanbedefinedas
Anothersimpleapproachistodefinesimilarityasthenegativeofthedissimilarity(orviceversa).Toillustrate,thedissimilarities0,1,10,and100canbetransformedintothesimilarities andrespectively.
Thesimilaritiesresultingfromthenegationtransformationarenotrestrictedtotherange[0,1],butifthatisdesired,thentransformationssuchas
or canbeused.Forthetransformation thedissimilarities0,1,10,100aretransformedinto1,
[0,∞],
d=d/(1+d) ∞.
[−1,1].
d=1−s(s=1−d).
0,−1,−10, −100,
s=1d+1,s=e−d, s=1−d−min_dmax_d−min_ds=1d+1,
0.5,0.09,0.01,respectively.For theybecome1.00,0.37,0.00,0.00,respectively,whilefor theybecome1.00,0.99,0.90,0.00,respectively.Inthisdiscussion,wehavefocusedonconvertingdissimilaritiestosimilarities.ConversionintheoppositedirectionisconsideredinExercise23 onpage111 .
Ingeneral,anymonotonicdecreasingfunctioncanbeusedtoconvertdissimilaritiestosimilarities,orviceversa.Ofcourse,otherfactorsalsomustbeconsideredwhentransformingsimilaritiestodissimilarities,orviceversa,orwhentransformingthevaluesofaproximitymeasuretoanewscale.Wehavementionedissuesrelatedtopreservingmeaning,distortionofscale,andrequirementsofdataanalysistools,butthislistiscertainlynotexhaustive.
2.4.2SimilarityandDissimilaritybetweenSimpleAttributes
Theproximityofobjectswithanumberofattributesistypicallydefinedbycombiningtheproximitiesofindividualattributes,andthus,wefirstdiscussproximitybetweenobjectshavingasingleattribute.Considerobjectsdescribedbyonenominalattribute.Whatwoulditmeanfortwosuchobjectstobesimilar?Becausenominalattributesconveyonlyinformationaboutthedistinctnessofobjects,allwecansayisthattwoobjectseitherhavethesamevalueortheydonot.Hence,inthiscasesimilarityistraditionallydefinedas1ifattributevaluesmatch,andas0otherwise.Adissimilaritywouldbedefinedintheoppositeway:0iftheattributevaluesmatch,and1iftheydonot.
Forobjectswithasingleordinalattribute,thesituationismorecomplicatedbecauseinformationaboutordershouldbetakenintoaccount.Consideran
s=e−d,s=1−d−min_dmax_d−min_d
attributethatmeasuresthequalityofaproduct,e.g.,acandybar,onthescale{poor,fair,OK,good,wonderful}.Itwouldseemreasonablethataproduct,P1,whichisratedwonderful,wouldbeclosertoaproductP2,whichisratedgood,thanitwouldbetoaproductP3,whichisratedOK.Tomakethisobservationquantitative,thevaluesoftheordinalattributeareoftenmappedtosuccessiveintegers,beginningat0or1,e.g.,
Then, or,ifwewantthedissimilaritytofallbetween0and Asimilarityforordinalattributescanthenbedefinedas
Thisdefinitionofsimilarity(dissimilarity)foranordinalattributeshouldmakethereaderabituneasysincethisassumesequalintervalsbetweensuccessivevaluesoftheattribute,andthisisnotnecessarilyso.Otherwise,wewouldhaveanintervalorratioattribute.IsthedifferencebetweenthevaluesfairandgoodreallythesameasthatbetweenthevaluesOKandwonderful?Probablynot,butinpractice,ouroptionsarelimited,andintheabsenceofmoreinformation,thisisthestandardapproachfordefiningproximitybetweenordinalattributes.
Forintervalorratioattributes,thenaturalmeasureofdissimilaritybetweentwoobjectsistheabsolutedifferenceoftheirvalues.Forexample,wemightcompareourcurrentweightandourweightayearagobysaying“Iamtenpoundsheavier.”Incasessuchasthese,thedissimilaritiestypicallyrangefrom0to ratherthanfrom0to1.Thesimilarityofintervalorratioattributesistypicallyexpressedbytransformingadissimilarityintoasimilarity,aspreviouslydescribed.
Table2.7 summarizesthisdiscussion.Inthistable,xandyaretwoobjectsthathaveoneattributeoftheindicatedtype.Also,d(x,y)ands(x,y)arethedissimilarityandsimilaritybetweenxandy,respectively.Otherapproachesarepossible;thesearethemostcommonones.
{poor=0,fair=1,OK=2,good=3,wonderful=4}. d(P1,P2)=3−2=1d(P1,P2)=3−24=0.25.
s=1−d.
∞,
Table2.7.Similarityanddissimilarityforsimpleattributes
AttributeType
Dissimilarity Similarity
Nominal
Ordinal (valuesmappedtointegers0to ,wherenisthenumberofvalues)
IntervalorRatio
Thefollowingtwosectionsconsidermorecomplicatedmeasuresofproximitybetweenobjectsthatinvolvemultipleattributes:(1)dissimilaritiesbetweendataobjectsand(2)similaritiesbetweendataobjects.Thisdivisionallowsustomorenaturallydisplaytheunderlyingmotivationsforemployingvariousproximitymeasures.Weemphasize,however,thatsimilaritiescanbetransformedintodissimilaritiesandviceversausingtheapproachesdescribedearlier.
2.4.3DissimilaritiesbetweenDataObjects
Inthissection,wediscussvariouskindsofdissimilarities.Webeginwithadiscussionofdistances,whicharedissimilaritieswithcertainproperties,andthenprovideexamplesofmoregeneralkindsofdissimilarities.
Distances
d={0ifx=y1ifx≠y s={1ifx=y0ifx≠y
d=|x−y|/(n−1)n−1
s=1−d
d=|x−y| s=−d,s=11+d,s=e−d,s=1−d−min_dmax_d−min_d
Wefirstpresentsomeexamples,andthenofferamoreformaldescriptionofdistancesintermsofthepropertiescommontoalldistances.TheEuclideandistance,d,betweentwopoints,xandy,inone-,two-,three-,orhigher-dimensionalspace,isgivenbythefollowingfamiliarformula:
wherenisthenumberofdimensionsand and are,respectively,theattributes(components)ofxandy.WeillustratethisformulawithFigure2.15 andTables2.8 and2.9 ,whichshowasetofpoints,thexandycoordinatesofthesepoints,andthedistancematrixcontainingthepairwisedistancesofthesepoints.
Figure2.15.Fourtwo-dimensionalpoints.
TheEuclideandistancemeasuregiveninEquation2.1 isgeneralizedbytheMinkowskidistancemetricshowninEquation2.2 ,
d(x,y)=∑k=1n(xk−yk)2, (2.1)
xk yk kth
d(x,y)=(∑k=1n|xk−yk|r)1/r, (2.2)
whererisaparameter.ThefollowingarethethreemostcommonexamplesofMinkowskidistances.
Cityblock(Manhattan,taxicab, norm)distance.AcommonexampleistheHammingdistance,whichisthenumberofbitsthatisdifferentbetweentwoobjectsthathaveonlybinaryattributes,i.e.,betweentwobinaryvectors.
Euclideandistance( norm).Supremum( or norm)distance.Thisisthemaximum
differencebetweenanyattributeoftheobjects.Moreformally,thedistanceisdefinedbyEquation2.3
Therparametershouldnotbeconfusedwiththenumberofdimensions(at-tributes)n.TheEuclidean,Manhattan,andsupremumdistancesaredefinedforallvaluesofn:1,2,3,…,andspecifydifferentwaysofcombiningthedifferencesineachdimension(attribute)intoanoveralldistance.
Tables2.10 and2.11 ,respectively,givetheproximitymatricesfortheand distancesusingdatafromTable2.8 .Noticethatallthesedistancematricesaresymmetric;i.e.,the entryisthesameasthe entry.InTable2.9 ,forinstance,thefourthrowofthefirstcolumnandthefourthcolumnofthefirstrowbothcontainthevalue5.1.
Table2.8.xandycoordinatesoffourpoints.
point xcoordinate ycoordinate
p1 0 2
p2 2 0
p3 3 1
r=1. L1
r=2. L2r=∞. Lmax L∞
L∞
d(x,y)=limr→∞(∑k=1n|xk−yk|r)1/r. (2.3)
L1L∞
ijth jith
p4 5 1
Table2.9.EuclideandistancematrixforTable2.8 .
p1 p2 p3 p4
p1 0.0 2.8 3.2 5.1
p2 2.8 0.0 1.4 3.2
p3 3.2 1.4 0.0 2.0
p4 5.1 3.2 2.0 0.0
Table2.10. distancematrixforTable2.8 .
L p1 p2 p3 p4
p1 0.0 4.0 4.0 6.0
p2 4.0 0.0 2.0 4.0
p3 4.0 2.0 0.0 2.0
p4 6.0 4.0 2.0 0.0
Table2.11. distancematrixforTable2.8 .
p1 p2 p3 p4
p1 0.0 2.0 3.0 5.0
p2 2.0 0.0 1.0 3.0
p3 3.0 1.0 0.0 2.0
L1
1
L∞
L∞
p4 5.0 3.0 2.0 0.0
Distances,suchastheEuclideandistance,havesomewell-knownproperties.Ifd(x,y)isthedistancebetweentwopoints,xandy,thenthefollowingpropertieshold.
1. Positivitya. forallxandy,b. onlyif
2. Symmetry forallxandy.3. TriangleInequality forallpointsx,y,andz.
Measuresthatsatisfyallthreepropertiesareknownasmetrics.Somepeopleusethetermdistanceonlyfordissimilaritymeasuresthatsatisfytheseproperties,butthatpracticeisoftenviolated.Thethreepropertiesdescribedhereareuseful,aswellasmathematicallypleasing.Also,ifthetriangleinequalityholds,thenthispropertycanbeusedtoincreasetheefficiencyoftechniques(includingclustering)thatdependondistancespossessingthisproperty.(SeeExercise25 .)Nonetheless,manydissimilaritiesdonotsatisfyoneormoreofthemetricproperties.Example2.14 illustratessuchameasure.
Example2.14(Non-metricDissimilarities:SetDifferences).Thisexampleisbasedonthenotionofthedifferenceoftwosets,asdefinedinsettheory.GiventwosetsAandB, isthesetofelementsofAthatarenotin
d(x,y)≥0d(x,y)=0 x=y.
d(x,y)=d(y,x)d(x,z)≤d(x,y)+d(y,z)
A−B
B.Forexample,if and then andtheemptyset.WecandefinethedistancedbetweentwosetsA
andBas wheresizeisafunctionreturningthenumberofelementsinaset.Thisdistancemeasure,whichisanintegervaluegreaterthanorequalto0,doesnotsatisfythesecondpartofthepositivityproperty,thesymmetryproperty,orthetriangleinequality.However,thesepropertiescanbemadetoholdifthedissimilaritymeasureismodifiedasfollows: SeeExercise21 onpage110 .
2.4.4SimilaritiesbetweenDataObjects
Forsimilarities,thetriangleinequality(ortheanalogousproperty)typicallydoesnothold,butsymmetryandpositivitytypicallydo.Tobeexplicit,ifs(x,y)isthesimilaritybetweenpointsxandy,thenthetypicalpropertiesofsimilaritiesarethefollowing:
1. onlyif2. forallxandy.(Symmetry)
Thereisnogeneralanalogofthetriangleinequalityforsimilaritymeasures.Itissometimespossible,however,toshowthatasimilaritymeasurecaneasilybeconvertedtoametricdistance.ThecosineandJaccardsimilaritymeasures,whicharediscussedshortly,aretwoexamples.Also,forspecificsimilaritymeasures,itispossibletoderivemathematicalboundsonthesimilaritybetweentwoobjectsthataresimilarinspirittothetriangleinequality.
Example2.15(ANon-symmetricSimilarity
A={1,2,3,4} B={2,3,4}, A−B={1} B−A=∅,
d(A,B)=size(A−B),
d(A,B)=size(A−B)+size(B−A).
s(x,y)=1 x=y.(0≤s≤1)s(x,y)=s(y,x)
Measure).Consideranexperimentinwhichpeopleareaskedtoclassifyasmallsetofcharactersastheyflashonascreen.Theconfusionmatrixforthisexperimentrecordshowofteneachcharacterisclassifiedasitself,andhowofteneachisclassifiedasanothercharacter.Usingtheconfusionmatrix,wecandefineasimilaritymeasurebetweenacharacterxandacharacteryasthenumberoftimesthatxismisclassifiedasy,butnotethatthismeasureisnotsymmetric.Forexample,supposethat“0”appeared200timesandwasclassifiedasa“0”160times,butasan“o”40times.Likewise,supposethat“o”appeared200timesandwasclassifiedasan“o”170times,butas“0”only30times.Then, but
Insuchsituations,thesimilaritymeasurecanbemadesymmetricbysetting wheresindicatesthenewsimilaritymeasure.
2.4.5ExamplesofProximityMeasures
Thissectionprovidesspecificexamplesofsomesimilarityanddissimilaritymeasures.
SimilarityMeasuresforBinaryDataSimilaritymeasuresbetweenobjectsthatcontainonlybinaryattributesarecalledsimilaritycoefficients,andtypicallyhavevaluesbetween0and1.Avalueof1indicatesthatthetwoobjectsarecompletelysimilar,whileavalueof0indicatesthattheobjectsarenotatallsimilar.Therearemanyrationalesforwhyonecoefficientisbetterthananotherinspecificinstances.
s(0,o)=40,s(o,0)=30.
s′=(x,y)=s′(x,y)=(s(x,y+s(y,x))/2,
Letxandybetwoobjectsthatconsistofnbinaryattributes.Thecomparisonoftwosuchobjects,i.e.,twobinaryvectors,leadstothefollowingfourquantities(frequencies):
SimpleMatchingCoefficient
Onecommonlyusedsimilaritycoefficientisthesimplematchingcoefficient(SMC),whichisdefinedas
Thismeasurecountsbothpresencesandabsencesequally.Consequently,theSMCcouldbeusedtofindstudentswhohadansweredquestionssimilarlyonatestthatconsistedonlyoftrue/falsequestions.
JaccardCoefficient
Supposethatxandyaredataobjectsthatrepresenttworows(twotransactions)ofatransactionmatrix(seeSection2.1.2 ).Ifeachasymmetricbinaryattributecorrespondstoaniteminastore,thena1indicatesthattheitemwaspurchased,whilea0indicatesthattheproductwasnotpurchased.Becausethenumberofproductsnotpurchasedbyanycustomerfaroutnumbersthenumberofproductsthatwerepurchased,asimilaritymeasuresuchasSMCwouldsaythatalltransactionsareverysimilar.Asaresult,theJaccardcoefficientisfrequentlyusedtohandleobjectsconsistingofasymmetricbinaryattributes.TheJaccardcoefficient,whichisoftensymbolizedbyj,isgivenbythefollowingequation:
f00=thenumberofattributeswherexis0andyis0f01=thenumberofattributeswhere
SMC=numberofmatchingattributevaluesnumberofattributes=f11+f00f01+f10(2.4)
J=numberofmatchingpresencesnumberofattributesnotinvolvedin00matches(2.5)
Example2.16(TheSMCandJaccardSimilarityCoefficients).Toillustratethedifferencebetweenthesetwosimilaritymeasures,wecalculateSMCandjforthefollowingtwobinaryvectors.
CosineSimilarityDocumentsareoftenrepresentedasvectors,whereeachcomponent(attribute)representsthefrequencywithwhichaparticularterm(word)occursinthedocument.Eventhoughdocumentshavethousandsortensofthousandsofattributes(terms),eachdocumentissparsesinceithasrelativelyfewnonzeroattributes.Thus,aswithtransactiondata,similarityshouldnotdependonthenumberofshared0valuesbecauseanytwodocumentsarelikelyto“notcontain”manyofthesamewords,andtherefore,if0–0matchesarecounted,mostdocumentswillbehighlysimilartomostotherdocuments.Therefore,asimilaritymeasurefordocumentsneedstoignores0–0matchesliketheJaccardmeasure,butalsomustbeabletohandlenon-binaryvectors.Thecosinesimilarity,definednext,isoneofthemostcommonmeasuresofdocumentsimilarity.Ifxandyaretwodocumentvectors,then
x=(1,0,0,0,0,0,0,0,0,0)y=(0,0,0,0,0,0,1,0,0,1)
f01=2thenumberofattributeswherexwas0andywas1f10=1thenumberofattributeswhere
SMC=f11+f00f01+f10+f11+f00=0+72+1+0+7=0.7
J=f11f01+f10+f11=02+1+0=0
where′indicatesvectorormatrixtransposeand indicatestheinnerproductofthetwovectors,
and isthelengthofvector
Theinnerproductoftwovectorsworkswellforasymmetricattributessinceitdependsonlyoncomponentsthatarenon-zeroinbothvectors.Hence,thesimilaritybetweentwodocumentsdependsonlyuponthewordsthatappearinbothofthem.
Example2.17(CosineSimilaritybetweenTwoDocumentVectors).Thisexamplecalculatesthecosinesimilarityforthefollowingtwodataobjects,whichmightrepresentdocumentvectors:
AsindicatedbyFigure2.16 ,cosinesimilarityreallyisameasureofthe(cosineofthe)anglebetweenxandy.Thus,ifthecosinesimilarityis1,theanglebetweenxandyis andxandyarethesameexceptforlength.Ifthecosinesimilarityis0,thentheanglebetweenxandyis andtheydonotshareanyterms(words).
cos(x,y)=⟨x,y⟩∥x∥∥y∥=x′y∥x∥∥y∥, (2.6)
⟨x,y⟩
⟨x,y⟩=∑k=1nxkyk=x′y, (2.7)
∥x∥ x,∥x∥=∑k=1nxk2=⟨x,x⟩=x′x.
x=(3,2,0,5,0,0,0,2,0,0)y=(1,0,0,0,0,0,0,1,0,2)
⟨x,y⟩=3×1+2×0+0×0+5×0+0×0+0×0+0×0+2×1+0×0×2=5∥x∥=3×3+2×2+0×0+5×5
0°,90°,
Figure2.16.Geometricillustrationofthecosinemeasure.
Equation2.6 alsocanbewrittenasEquation2.8 .
where and Dividingxandybytheirlengthsnormalizesthemtohavealengthof1.Thismeansthatcosinesimilaritydoesnottakethelengthofthetwodataobjectsintoaccountwhencomputingsimilarity.(Euclideandistancemightbeabetterchoicewhenlengthisimportant.)Forvectorswithalengthof1,thecosinemeasurecanbecalculatedbytakingasimpleinnerproduct.Consequently,whenmanycosinesimilaritiesbetweenobjectsarebeingcomputed,normalizingtheobjectstohaveunitlengthcanreducethetimerequired.
ExtendedJaccardCoefficient(TanimotoCoefficient)TheextendedJaccardcoefficientcanbeusedfordocumentdataandthatreducestotheJaccardcoefficientinthecaseofbinaryattributes.Thiscoefficient,whichweshallrepresentasEJ,isdefinedbythefollowingequation:
cos(x,y)=⟨x∥x∥,y∥y∥⟩=⟨x′,y′⟩, (2.8)
x′=x/∥x∥ y′=y/∥y∥.
EJ(x,y)=⟨x,y⟩ǁxǁ2+ǁyǁ2−⟨x,y⟩=x′yǁxǁ2+ǁyǁ2−x′y. (2.9)
CorrelationCorrelationisfrequentlyusedtomeasurethelinearrelationshipbetweentwosetsofvaluesthatareobservedtogether.Thus,correlationcanmeasuretherelationshipbetweentwovariables(heightandweight)orbetweentwoobjects(apairoftemperaturetimeseries).Correlationisusedmuchmorefrequentlytomeasurethesimilaritybetweenattributessincethevaluesintwodataobjectscomefromdifferentattributes,whichcanhaveverydifferentattributetypesandscales.Therearemanytypesofcorrelation,andindeedcorrelationissometimesusedinageneralsensetomeantherelationshipbetweentwosetsofvaluesthatareobservedtogether.Inthisdiscussion,wewillfocusonameasureappropriatefornumericalvalues.
Specifically,Pearson’scorrelationbetweentwosetsofnumericalvalues,i.e.,twovectors,xandy,isdefinedbythefollowingequation:
whereweusethefollowingstandardstatisticalnotationanddefinitions:
corr(x,y)=covariance(x,y)standard_deviation(x)×standard_deviation(y)=sxysx(2.10)
covariance(x,y)=sxy=1n−1∑k=1n(xk−x¯)(yk−y¯) (2.11)
standard_deviation(x)=sx=1n−1∑k=1n(xk−x¯)2
standard_deviation(y)=sy=1n−1∑k=1n(yk−y¯)2
x¯=1n∑k=1nxkisthemeanofx
y¯=1n∑k=1nykisthemeanofy
Example2.18(PerfectCorrelation).Correlationisalwaysintherange to1.Acorrelationof meansthatxandyhaveaperfectpositive(negative)linearrelationship;thatis,
whereaandbareconstants.Thefollowingtwovectorsxandyillustratecaseswherethecorrelationis and respectively.Inthefirstcase,themeansofxandywerechosentobe0,forsimplicity.
Example2.19(NonlinearRelationships).Ifthecorrelationis0,thenthereisnolinearrelationshipbetweenthetwosetsofvalues.However,nonlinearrelationshipscanstillexist.Inthefollowingexample, buttheircorrelationis0.
Example2.20(VisualizingCorrelation).Itisalsoeasytojudgethecorrelationbetweentwovectorsxandybyplottingpairsofcorrespondingvaluesofxandyinascatterplot.Figure2.17 showsanumberofthesescatterplotswhenxandyconsistofasetof30pairsofvaluesthatarerandomlygenerated(withanormaldistribution)sothatthecorrelationofxandyrangesfrom to1.Eachcircleinaplotrepresentsoneofthe30pairsofxandyvalues;itsxcoordinateisthevalueofthatpairforx,whileitsycoordinateisthevalueofthesamepairfory.
−1 1(−1)
xk=ayk+b,−1 +1,
x=(−3,6,0,3,−6)y=(1,−2,0,−1,2)corr(x,y)=−1xk=−3yk
x=(3,6,0,3,6)y=(1,2,0,1,2)corr(x,y)=1xk=3yk
yk=xk2,
x=(−3,−2,−1,0,1,2,3)y=(9,4,1,0,1,4,9)
−1
Figure2.17.Scatterplotsillustratingcorrelationsfrom to1.
Ifwetransformxandybysubtractingofftheirmeansandthennormalizingthemsothattheirlengthsare1,thentheircorrelationcanbecalculatedbytakingthedotproduct.Letusrefertothesetransformedvectorsofxandyasand ,respectively.(Noticethatthistransformationisnotthesameasthe
standardizationusedinothercontexts,wherewesubtractthemeansanddividebythestandarddeviations,asdiscussedinSection2.3.7 .)Thistransformationhighlightsaninterestingrelationshipbetweenthecorrelationmeasureandthecosinemeasure.Specifically,thecorrelationbetweenxandyisidenticaltothecosinebetween and However,thecosinebetweenxandyisnotthesameasthecosinebetween and eventhoughtheybothhavethesamecorrelationmeasure.Ingeneral,thecorrelationbetweentwo
−1
x′ y′
x′ y′.x′ y′,
vectorsisequaltothecosinemeasureonlyinthespecialcasewhenthemeansofthetwovectorsare0.
DifferencesAmongMeasuresForContinuousAttributesInthissection,weillustratethedifferenceamongthethreeproximitymeasuresforcontinuousattributesthatwehavejustdefined:cosine,correlation,andMinkowskidistance.Specifically,weconsidertwotypesofdatatransformationsthatarecommonlyused,namely,scaling(multiplication)byaconstantfactorandtranslation(addition)byaconstantvalue.Aproximitymeasureisconsideredtobeinvarianttoadatatransformationifitsvalueremainsunchangedevenafterperformingthetransformation.Table2.12comparesthebehaviorofcosine,correlation,andMinkowskidistancemeasuresregardingtheirinvariancetoscalingandtranslationoperations.Itcanbeseenthatwhilecorrelationisinvarianttobothscalingandtranslation,cosineisonlyinvarianttoscalingbutnottotranslation.Minkowskidistancemeasures,ontheotherhand,aresensitivetobothscalingandtranslationandarethusinvarianttoneither.
Table2.12.Propertiesofcosine,correlation,andMinkowskidistancemeasures.
Property Cosine Correlation MinkowskiDistance
Invarianttoscaling(multiplication) Yes Yes No
Invarianttotranslation(addition) No Yes No
Letusconsideranexampletodemonstratethesignificanceofthesedifferencesamongdifferentproximitymeasures.
Example2.21(Comparingproximitymeasures).Considerthefollowingtwovectorsxandywithsevennumericattributes.
Itcanbeseenthatbothxandyhave4non-zerovalues,andthevaluesinthetwovectorsaremostlythesame,exceptforthethirdandthefourthcomponents.Thecosine,correlation,andEuclideandistancebetweenthetwovectorscanbecomputedasfollows.
Notsurprisingly,xandyhaveacosineandcorrelationmeasurecloseto1,whiletheEuclideandistancebetweenthemissmall,indicatingthattheyarequitesimilar.Nowletusconsiderthevector whichisascaledversionofy(multipliedbyaconstantfactorof2),andthevector whichisconstructedbytranslatingyby5unitsasfollows.
Weareinterestedinfindingwhether and showthesameproximitywithxasshownbytheoriginalvectory.Table2.13 showsthedifferentmeasuresofproximitycomputedforthepairs and Itcanbeseenthatthevalueofcorrelationbetweenxandyremainsunchangedevenafterreplacingywith or However,thevalueofcosineremainsequalto0.9667whencomputedfor(x,y)and butsignificantlyreducesto0.7940whencomputedfor Thishighlights
x=(1,2,4,3,0,0,0)y=(1,2,3,4,0,0,0)
cos(x,y)=2930×30=0.9667correlation(x,y)=2.35711.5811×1.5811=0.9429Euclideandistancex−yǁ=1.4142
ys,yt,
ys=2×y=(2,4,6,8,0,0,0)
yt=y+5=(6,7,8,9,5,5,5)
ys yt
(x,y),(x,ys), (x,yt).
ys yt.(x,ys),
(x,yt).
thefactthatcosineisinvarianttothescalingoperationbutnottothetranslationoperation,incontrastwiththecorrelationmeasure.TheEuclideandistance,ontheotherhand,showsdifferentvaluesforallthreepairsofvectors,asitissensitivetobothscalingandtranslation.
Table2.13.Similaritybetween and
Measure (x,y)
Cosine 0.9667 0.9667 0.7940
Correlation 0.9429 0.9429 0.9429
EuclideanDistance 1.4142 5.8310 14.2127
Wecanobservefromthisexamplethatdifferentproximitymeasuresbehavedifferentlywhenscalingortranslationoperationsareappliedonthedata.Thechoiceoftherightproximitymeasurethusdependsonthedesirednotionofsimilaritybetweendataobjectsthatismeaningfulforagivenapplication.Forexample,ifxandyrepresentedthefrequenciesofdifferentwordsinadocument-termmatrix,itwouldbemeaningfultouseaproximitymeasurethatremainsunchangedwhenyisreplacedbybecause isjustascaledversionofywiththesamedistributionofwordsoccurringinthedocument.However, isdifferentfromy,sinceitcontainsalargenumberofwordswithnon-zerofrequenciesthatdonotoccuriny.Becausecosineisinvarianttoscalingbutnottotranslation,itwillbeanidealchoiceofproximitymeasureforthisapplication.
Consideradifferentscenarioinwhichxrepresentsalocation’stemperaturemeasuredontheCelsiusscaleforsevendays.Let andbethetemperaturesmeasuredonthosedaysatadifferentlocation,but
usingthreedifferentmeasurementscales.Notethatdifferentunitsof
(x,y),(x,ys), (x,yt).
(x,ys) (x,yt)
ys,ys
yt
y,ys,yt
temperaturehavedifferentoffsets(e.g.CelsiusandKelvin)anddifferentscalingfactors(e.g.CelsiusandFahrenheit).Itisthusdesirabletouseaproximitymeasurethatcapturestheproximitybetweentemperaturevalueswithoutbeingaffectedbythemeasurementscale.Correlationwouldthenbetheidealchoiceofproximitymeasureforthisapplication,asitisinvarianttobothscalingandtranslation.
Asanotherexample,considerascenariowherexrepresentstheamountofprecipitation(incm)measuredatsevenlocations.Let and beestimatesoftheprecipitationattheselocations,whicharepredictedusingthreedifferentmodels.Ideally,wewouldliketochooseamodelthataccuratelyreconstructsthemeasurementsinxwithoutmakinganyerror.Itisevidentthatyprovidesagoodapproximationofthevaluesinx,whereasand providepoorestimatesofprecipitation,eventhoughtheydo
capturethetrendinprecipitationacrosslocations.Hence,weneedtochooseaproximitymeasurethatpenalizesanydifferenceinthemodelestimatesfromtheactualobservations,andissensitivetoboththescalingandtranslationoperations.TheEuclideandistancesatisfiesthispropertyandthuswouldbetherightchoiceofproximitymeasureforthisapplication.Indeed,theEuclideandistanceiscommonlyusedincomputingtheaccuracyofmodels,whichwillbediscussedlaterinChapter3 .
2.4.6MutualInformation
Likecorrelation,mutualinformationisusedasameasureofsimilaritybetweentwosetsofpairedvaluesthatissometimesusedasanalternativetocorrelation,particularlywhenanonlinearrelationshipissuspectedbetweenthepairsofvalues.Thismeasurecomesfrominformationtheory,whichisthe
y,ys, yt
ys yt
studyofhowtoformallydefineandquantifyinformation.Indeed,mutualinformationisameasureofhowmuchinformationonesetofvaluesprovidesaboutanother,giventhatthevaluescomeinpairs,e.g.,heightandweight.Ifthetwosetsofvaluesareindependent,i.e.,thevalueofonetellsusnothingabouttheother,thentheirmutualinformationis0.Ontheotherhand,ifthetwosetsofvaluesarecompletelydependent,i.e.,knowingthevalueofonetellsusthevalueoftheotherandvice-versa,thentheyhavemaximummutualinformation.Mutualinformationdoesnothaveamaximumvalue,butwewilldefineanormalizedversionofitthatrangesbetween0and1.
Todefinemutualinformation,weconsidertwosetsofvalues,XandY,whichoccurinpairs(X,Y).Weneedtomeasuretheaverageinformationinasinglesetofvalues,i.e.,eitherinXorinY,andinthepairsoftheirvalues.Thisiscommonlymeasuredbyentropy.Morespecifically,assumeXandYarediscrete,thatis,Xcantakemdistinctvalues, andYcantakendistinctvalues, Thentheirindividualandjointentropycanbedefinedintermsoftheprobabilitiesofeachvalueandpairofvaluesasfollows:
whereiftheprobabilityofavalueorcombinationofvaluesis0,thenisconventionallytakentobe0.
ThemutualinformationofXandYcannowbedefinedstraightforwardly:
u1,u2,…,umv1,v2,…,vn.
H(X)=−∑j=1mP(X=uj)log2P(X=uj) (2.12)
H(Y)=−∑k=1nP(Y=vk)log2P(Y=vk) (2.13)
H(X,Y)=−∑j=1m∑k=1nP(X=uj,Y=vk)log2P(X=uj,Y=vk) (2.14)
0log2(0)
I(X,Y)=H(X)+H(Y)−H(X,Y) (2.15)
NotethatH(X,Y)issymmetric,i.e., andthusmutualinformationisalsosymmetric,i.e.,
Practically,XandYareeitherthevaluesintwoattributesortworowsofthesamedataset.InExample2.22 ,wewillrepresentthosevaluesastwovectorsxandyandcalculatetheprobabilityofeachvalueorpairofvaluesfromthefrequencywithwhichvaluesorpairsofvaluesoccurinx,yand
where isthe componentofxand isthe componentofy.Letusillustrateusingapreviousexample.
Example2.22(EvaluatingNonlinearRelationshipswithMutualInformation).RecallExample2.19 where buttheircorrelationwas0.
FromFigure2.22 , Althoughavarietyofapproachestonormalizemutualinformationarepossible—seeBibliographicNotes—forthisexample,wewillapplyonethatdividesthemutualinformationby andproducesaresultbetween0and1.Thisyieldsavalueof Thus,wecanseethatxandyarestronglyrelated.Theyarenotperfectlyrelatedbecausegivenavalueofythereis,exceptfor someambiguityaboutthevalueofx.Noticethatfor thenormalizedmutualinformationwouldbe1.
Figure2.18.Computationofmutualinformation.
Table2.14.Entropyforx
H(X,Y)=H(Y,X),I(X,Y)=I(Y).
(xi,yi), xi ith yi ith
yk=xk2,
x=(−3,−2,−1,0,1,2,3)y=(9,4,1,0,1,4,9)
I(x,y)=H(x)+H(y)−H(x,y)=1.9502.
log2(min(m,n))1.9502/log2(4))=0.9751.
y=0,y=−x,
xj P(x=xj) −P(x=xj)log2P(x=xj)
1/7 0.4011
1/7 0.4011
1/7 0.4011
0 1/7 0.4011
1 1/7 0.4011
2 1/7 0.4011
3 1/7 0.4011
H(x) 2.8074
Table2.15.Entropyfory
9 2/7 0.5164
4 2/7 0.5164
1 2/7 0.5164
0 1/7 0.4011
H(y) 1.9502
Table2.16.Jointentropyforxandy
9 1/7 0.4011
4 1/7 0.4011
−3
−2
−1
yk P(y=yk) −P(y=yk)log2P(y=yk)
xj yk P(x=xj,y=xk) −P(x=xj,y=xk)log2P(x=xj,y=xk)
−3
−2
1 1/7 0.4011
0 0 1/7 0.4011
1 1 1/7 0.4011
2 4 1/7 0.4011
3 9 1/7 0.4011
H(x,y) 2.8074
2.4.7KernelFunctions*
Itiseasytounderstandhowsimilarityanddistancemightbeusefulinanapplicationsuchasclustering,whichtriestogroupsimilarobjectstogether.Whatismuchlessobviousisthatmanyotherdataanalysistasks,includingpredictivemodelinganddimensionalityreduction,canbeexpressedintermsofpairwise“proximities”ofdataobjects.Morespecifically,manydataanalysisproblemscanbemathematicallyformulatedtotakeasinput,akernelmatrix,K,whichcanbeconsideredatypeofproximitymatrix.Thus,aninitialpreprocessingstepisusedtoconverttheinputdataintoakernelmatrix,whichistheinputtothedataanalysisalgorithm.
Moreformally,ifadatasethasmdataobjects,thenKisanmbymmatrix.Ifand arethe and dataobjects,respectively,then the entryof
K,iscomputedbyakernelfunction:
−1
xi xj ith jth kij, ijth
kij=κ(xi,xj) (2.16)
Aswewillseeinthematerialthatfollows,theuseofakernelmatrixallowsbothwiderapplicabilityofanalgorithmtovariouskindsofdataandanabilitytomodelnonlinearrelationshipswithalgorithmsthataredesignedonlyfordetectinglinearrelationships.
Kernelsmakeanalgorithmdataindependent
Ifanalgorithmusesakernelmatrix,thenitcanbeusedwithanytypeofdataforwhichakernelfunctioncanbedesigned.ThisisillustratedbyAlgorithm2.1.Althoughonlysomedataanalysisalgorithmscanbemodifiedtouseakernelmatrixasinput,thisapproachisextremelypowerfulbecauseitallowssuchanalgorithmtobeusedwithalmostanytypeofdataforwhichanappropriatekernelfunctioncanbedefined.Thus,aclassificationalgorithmcanbeused,forexample,withrecorddata,stringdata,orgraphdata.Ifanalgorithmcanbereformulatedtouseakernelmatrix,thenitsapplicabilitytodifferenttypesofdataincreasesdramatically.Aswewillseeinlaterchapters,manyclustering,classification,andanomalydetectionalgorithmsworkonlywithsimilaritiesordistances,andthus,canbeeasilymodifiedtoworkwithkernels.
Algorithm2.1Basickernelalgorithm.1. Readinthemdataobjectsinthedataset.2. Computethekernelmatrix,Kbyapplyingthekernelfunction,
toeachpairofdataobjects.3. RunthedataanalysisalgorithmwithKasinput.4. Returntheanalysisresult,e.g.,predictedclassorclusterlabels.
Mappingdataintoahigherdimensionaldataspacecan
κ,
allowmodelingofnonlinearrelationships
Thereisyetanother,equallyimportant,aspectofkernelbaseddataalgorithms—theirabilitytomodelnonlinearrelationshipswithalgorithmsthatmodelonlylinearrelationships.Typically,thisworksbyfirsttransforming(mapping)thedatafromalowerdimensionaldataspacetoahigherdimensionalspace.
Example2.23(MappingDatatoaHigherDimensionalSpace).Considertherelationshipbetweentwovariablesxandygivenbythefollowingequation,whichdefinesanellipseintwodimensions(Figure2.19(a) ):
Figure2.19.Mappingdatatoahigherdimensionalspace:twotothreedimensions.
Wecanmapourtwodimensionaldatatothreedimensionsbycreatingthreenewvariables,u,v,andw,whicharedefinedasfollows:
Asaresult,wecannowexpressEquation2.17 asalinearone.Thisequationdescribesaplaneinthreedimensions.Pointsontheellipsewilllieonthatplane,whilepointsinsideandoutsidetheellipsewilllieonoppositesidesoftheplane.SeeFigure2.19(b) .Theviewpointofthis3Dplotisalongthesurfaceoftheseparatingplanesothattheplaneappearsasaline.
TheKernelTrick
Theapproachillustratedaboveshowsthevalueinmappingdatatohigherdimensionalspace,anoperationthatisintegraltokernel-basedmethods.Conceptually,wefirstdefineafunction thatmapsdatapointsxandytodatapoints and inahigherdimensionalspacesuchthattheinnerproduct givesthedesiredmeasureofproximityofxandy.Itmayseemthatwehavepotentiallysacrificedagreatdealbyusingsuchanapproach,becausewecangreatlyexpandthesizeofourdata,increasethecomputationalcomplexityofouranalysis,andencounterproblemswiththecurseofdimensionalitybycomputingsimilarityinahigh-dimensionalspace.However,thisisnotthecasesincetheseproblemscanbeavoidedbydefiningakernelfunction thatcancomputethesamesimilarityvalue,butwiththedatapointsintheoriginalspace,i.e., Thisisknownasthekerneltrick.Despitethename,thekerneltrickhasaverysolid
4×2+9xy+7y2=10 (2.17)
w=x2u=xyv=y2
4u+9v+7w=10 (2.18)
φφ(x) φ(y)
⟨x,y⟩
κκ(x,y)=⟨φ(x),φ(y)⟩.
mathematicalfoundationandisaremarkablypowerfulapproachfordataanalysis.
Noteveryfunctionofapairofdataobjectssatisfiesthepropertiesneededforakernelfunction,butithasbeenpossibletodesignmanyusefulkernelsforawidevarietyofdatatypes.Forexample,threecommonkernelfunctionsarethepolynomial,Gaussian(radialbasisfunction(RBF)),andsigmoidkernels.Ifxandyaretwodataobjects,specifically,twodatavectors,thenthesetwokernelfunctionscanbeexpressedasfollows,respectively:
where and areconstants,disanintegerparameterthatgivesthepolynomialdegree, isthelengthofthevector and isaparameterthatgovernsthe“spread”ofaGaussian.
Example2.24(ThePolynomialKernel).Notethatthekernelfunctionspresentedintheprevioussectionarecomputingthesamesimilarityvalueaswouldbecomputedifweactuallymappedthedatatoahigherdimensionalspaceandthencomputedaninnerproductthere.Forexample,forthepolynomialkernelofdegree2,letbethefunctionthatmapsatwo-dimensionaldatavector tothe
higherdimensionalspace.Specifically,let
κ(x,y)−(x′y+c)d (2.19)
κ(x,y)=exp(−ǁx−yǁ/2σ2) (2.20)
κ(x,y)=tanh(αx′y+c) (2.21)
α c≥0ǁx−yǁ x−y σ>0
φ x=(x1,x2)
φ(x)=(x12,x22,2x1x2,2cx1,2cx2,c). (2.22)
Forthehigherdimensionalspace,lettheproximitybedefinedastheinnerproductof and i.e., Then,aspreviouslymentioned,itcanbeshownthat
where isdefinedbyEquation2.19 above.Specifically,ifand then
Moregenerally,thekerneltrickdependsondefining and sothatEquation2.23 holds.Thishasbeendoneforawidevarietyofkernels.
Thisdiscussionofkernel-basedapproacheswasintendedonlytoprovideabriefintroductiontothistopicandhasomittedmanydetails.Afullerdiscussionofthekernel-basedapproachisprovidedinSection4.9.4,whichdiscussestheseissuesinthecontextofnonlinearsupportvectormachinesforclassification.MoregeneralreferencesforthekernelbasedanalysiscanbefoundintheBibliographicNotesofthischapter.
2.4.8BregmanDivergence*
ThissectionprovidesabriefdescriptionofBregmandivergences,whichareafamilyofproximityfunctionsthatsharesomecommonproperties.Asaresult,itispossibletoconstructgeneraldataminingalgorithms,suchasclusteringalgorithms,thatworkwithanyBregmandivergence.AconcreteexampleistheK-meansclusteringalgorithm(Section7.2).Notethatthissectionrequiresknowledgeofvectorcalculus.
φ(x) φ(y), ⟨φ(x),φ(y)⟩.
κ(x,y)=⟨φ(x),φ(y)⟩ (2.23)
κ x=(x1,x2)y=(y1,y2),
κ(x,y)=⟨x,y⟩=x′y=(x12y12,x22y22,2x1x2y1y2,2cx1y1,2cx2y2,c2).(2.24)
κ φ
Bregmandivergencesarelossordistortionfunctions.Tounderstandtheideaofalossfunction,considerthefollowing.Letxandybetwopoints,whereyisregardedastheoriginalpointandxissomedistortionorapproximationofit.Forexample,xmaybeapointthatwasgeneratedbyaddingrandomnoisetoy.Thegoalistomeasuretheresultingdistortionorlossthatresultsifyisapproximatedbyx.Ofcourse,themoresimilarxandyare,thesmallerthelossordistortion.Thus,Bregmandivergencescanbeusedasdissimilarityfunctions.
Moreformally,wehavethefollowingdefinition.
Definition2.6(Bregmandivergence)Givenastrictlyconvexfunction (withafewmodestrestrictionsthataregenerallysatisfied),theBregmandivergence(lossfunction) generatedbythatfunctionisgivenbythefollowingequation:
where isthegradientof evaluatedat isthevectordifferencebetweenxandy,and istheinnerproductbetween and ForpointsinEuclideanspace,theinnerproductisjustthedotproduct.
D(x,y)canbewrittenas whereandrepresentstheequationofaplanethatistangenttothefunction aty.
ϕ
D(x,y)
D(x,y)=ϕ(x)−ϕ(y)−⟨∇ϕ(y),(x−y)⟩ (2.25)
∇ϕ(y) ϕ y,x−y,⟨∇ϕ(y),(x−y)⟩
∇ϕ(y) (x−y).
D(x,y)=ϕ(x)−L(x), L(x)=ϕ(y)+⟨∇ϕ(y),(x−y)⟩ϕ
Usingcalculusterminology,L(x)isthelinearizationof aroundthepointy,andtheBregmandivergenceisjustthedifferencebetweenafunctionandalinearapproximationtothatfunction.DifferentBregmandivergencesareobtainedbyusingdifferentchoicesfor
Example2.25.WeprovideaconcreteexampleusingsquaredEuclideandistance,butrestrictourselvestoonedimensiontosimplifythemathematics.Letxandyberealnumbersand bethereal-valuedfunction, Inthatcase,thegradientreducestothederivative,andthedotproductreducestomultiplication.Specifically,Equation2.25 becomesEquation2.26 .
Thegraphforthisexample,with isshowninFigure2.20 .TheBregmandivergenceisshownfortwovaluesofx: and
ϕ
ϕ.
ϕ(t) ϕ(t)=t2.
D(x,y)=x2−y2−2y(x−y)=(x−y)2 (2.26)
y=1,x=2 x=3.
Figure2.20.IllustrationofBregmandivergence.
2.4.9IssuesinProximityCalculation
Thissectiondiscussesseveralimportantissuesrelatedtoproximitymeasures:(1)howtohandlethecaseinwhichattributeshavedifferentscalesand/orarecorrelated,(2)howtocalculateproximitybetweenobjectsthatarecomposedofdifferenttypesofattributes,e.g.,quantitativeandqualitative,(3)andhowtohandleproximitycalculationswhenattributeshavedifferentweights;i.e.,whennotallattributescontributeequallytotheproximityofobjects.
StandardizationandCorrelationforDistance
MeasuresAnimportantissuewithdistancemeasuresishowtohandlethesituationwhenattributesdonothavethesamerangeofvalues.(Thissituationisoftendescribedbysayingthat“thevariableshavedifferentscales.”)Inapreviousexample,Euclideandistancewasusedtomeasurethedistancebetweenpeoplebasedontwoattributes:ageandincome.Unlessthesetwoattributesarestandardized,thedistancebetweentwopeoplewillbedominatedbyincome.
Arelatedissueishowtocomputedistancewhenthereiscorrelationbetweensomeoftheattributes,perhapsinadditiontodifferencesintherangesofvalues.AgeneralizationofEuclideandistance,theMahalanobisdistance,isusefulwhenattributesarecorrelated,havedifferentrangesofvalues(differentvariances),andthedistributionofthedataisapproximatelyGaussian(normal).Correlatedvariableshavealargeimpactonstandarddistancemeasuressinceachangeinanyofthecorrelatedvariablesisreflectedinachangeinallthecorrelatedvariables.Specifically,theMahalanobisdistancebetweentwoobjects(vectors)xandyisdefinedas
where istheinverseofthecovariancematrixofthedata.Notethatthecovariancematrix isthematrixwhose entryisthecovarianceoftheand attributesasdefinedbyEquation2.11 .
Example2.26.InFigure2.21 ,thereare1000points,whosexandyattributeshaveacorrelationof0.6.Thedistancebetweenthetwolargepointsattheoppositeendsofthelongaxisoftheellipseis14.7intermsofEuclidean
Mahalanobis(x,y)=(x−y)′∑−1(x−y), (2.27)
∑−1∑ ijth ith
jth
distance,butonly6withrespecttoMahalanobisdistance.ThisisbecausetheMahalanobisdistancegiveslessemphasistothedirectionoflargestvariance.Inpractice,computingtheMahalanobisdistanceisexpensive,butcanbeworthwhilefordatawhoseattributesarecorrelated.Iftheattributesarerelativelyuncorrelated,buthavedifferentranges,thenstandardizingthevariablesissufficient.
Figure2.21.Setoftwo-dimensionalpoints.TheMahalanobisdistancebetweenthetwopointsrepresentedbylargedotsis6;theirEuclideandistanceis14.7.
CombiningSimilaritiesforHeterogeneousAttributes
Thepreviousdefinitionsofsimilaritywerebasedonapproachesthatassumedalltheattributeswereofthesametype.Ageneralapproachisneededwhentheattributesareofdifferenttypes.OnestraightforwardapproachistocomputethesimilaritybetweeneachattributeseparatelyusingTable2.7 ,andthencombinethesesimilaritiesusingamethodthatresultsinasimilaritybetween0and1.Onepossibleapproachistodefinetheoverallsimilarityastheaverageofalltheindividualattributesimilarities.Unfortunately,thisapproachdoesnotworkwellifsomeoftheattributesareasymmetricattributes.Forexample,ifalltheattributesareasymmetricbinaryattributes,thenthesimilaritymeasuresuggestedpreviouslyreducestothesimplematchingcoefficient,ameasurethatisnotappropriateforasymmetricbinaryattributes.Theeasiestwaytofixthisproblemistoomitasymmetricattributesfromthesimilaritycalculationwhentheirvaluesare0forbothoftheobjectswhosesimilarityisbeingcomputed.Asimilarapproachalsoworkswellforhandlingmissingvalues.
Insummary,Algorithm2.2iseffectiveforcomputinganoverallsimilaritybetweentwoobjects,xandy,withdifferenttypesofattributes.Thisprocedurecanbeeasilymodifiedtoworkwithdissimilarities.
Algorithm2.2Similaritiesofheterogeneous
objects.1:Forthe attribute,computeasimilarity, intherange[0,1].2:Defineanindicatorvariable, forthe attributeasfollows:
kth sk(x,y),
δk, kth
δk={0ifthekthattributeisanasymmetricattributeandbothobjectshaveavalueof
UsingWeightsInmuchofthepreviousdiscussion,allattributesweretreatedequallywhencomputingproximity.Thisisnotdesirablewhensomeattributesaremoreimportanttothedefinitionofproximitythanothers.Toaddressthesesituations,theformulasforproximitycanbemodifiedbyweightingthecontributionofeachattribute.
Withattributeweights, (2.28)becomes
ThedefinitionoftheMinkowskidistancecanalsobemodifiedasfollows:
2.4.10SelectingtheRightProximityMeasure
Afewgeneralobservationsmaybehelpful.First,thetypeofproximitymeasureshouldfitthetypeofdata.Formanytypesofdense,continuousdata,metricdistancemeasuressuchasEuclideandistanceareoftenused.Proximitybetweencontinuousattributesismostoftenexpressedintermsof
3:Computetheoverallsimilaritybetweenthetwoobjectsusingthefollowingformula:similarity(x,y)=∑k=1nδksk(x,y)∑k=1nδk(2.28)
wk,
similarity(x,y)=∑k=1nwkδksk(x,y)∑k=1nwkδk. (2.29)
d(x,y)=(∑k=1nwk|xk−yk|r)1/r. (2.30)
differences,anddistancemeasuresprovideawell-definedwayofcombiningthesedifferencesintoanoverallproximitymeasure.Althoughattributescanhavedifferentscalesandbeofdifferingimportance,theseissuescanoftenbedealtwithasdescribedearlier,suchasnormalizationandweightingofattributes.
Forsparsedata,whichoftenconsistsofasymmetricattributes,wetypicallyemploysimilaritymeasuresthatignore0–0matches.Conceptually,thisreflectsthefactthat,forapairofcomplexobjects,similaritydependsonthenumberofcharacteristicstheybothshare,ratherthanthenumberofcharacteristicstheybothlack.Thecosine,Jaccard,andextendedJaccardmeasuresareappropriateforsuchdata.
Thereareothercharacteristicsofdatavectorsthatoftenneedtobeconsidered.Invariancetoscaling(multiplication)andtotranslation(addition)werepreviouslydiscussedwithrespecttoEuclideandistanceandthecosineandcorrelationmeasures.Thepracticalimplicationsofsuchconsiderationsarethat,forexample,cosineismoresuitableforsparsedocumentdatawhereonlyscalingisimportant,whilecorrelationworksbetterfortimeseries,wherebothscalingandtranslationareimportant.EuclideandistanceorothertypesofMinkowskidistancearemostappropriatewhentwodatavectorsaretomatchascloselyaspossibleacrossallcomponents(features).
Insomecases,transformationornormalizationofthedataisneededtoobtainapropersimilaritymeasure.Forinstance,timeseriescanhavetrendsorperiodicpatternsthatsignificantlyimpactsimilarity.Also,apropercomputationofsimilarityoftenrequiresthattimelagsbetakenintoaccount.Finally,twotimeseriesmaybesimilaronlyoverspecificperiodsoftime.Forexample,thereisastrongrelationshipbetweentemperatureandtheuseofnaturalgas,butonlyduringtheheatingseason.
Practicalconsiderationcanalsobeimportant.Sometimes,oneormoreproximitymeasuresarealreadyinuseinaparticularfield,andthus,otherswillhaveansweredthequestionofwhichproximitymeasuresshouldbeused.Othertimes,thesoftwarepackageorclusteringalgorithmbeingusedcandrasticallylimitthechoices.Ifefficiencyisaconcern,thenwemaywanttochooseaproximitymeasurethathasaproperty,suchasthetriangleinequality,thatcanbeusedtoreducethenumberofproximitycalculations.(SeeExercise25 .)
However,ifcommonpracticeorpracticalrestrictionsdonotdictateachoice,thentheproperchoiceofaproximitymeasurecanbeatime-consumingtaskthatrequirescarefulconsiderationofbothdomainknowledgeandthepurposeforwhichthemeasureisbeingused.Anumberofdifferentsimilaritymeasuresmayneedtobeevaluatedtoseewhichonesproduceresultsthatmakethemostsense.
2.5BibliographicNotesItisessentialtounderstandthenatureofthedatathatisbeinganalyzed,andatafundamentallevel,thisisthesubjectofmeasurementtheory.Inparticular,oneoftheinitialmotivationsfordefiningtypesofattributeswastobepreciseaboutwhichstatisticaloperationswerevalidforwhatsortsofdata.WehavepresentedtheviewofmeasurementtheorythatwasinitiallydescribedinaclassicpaperbyS.S.Stevens[112].(Tables2.2 and2.3 arederivedfromthosepresentedbyStevens[113].)Whilethisisthemostcommonviewandisreasonablyeasytounderstandandapply,thereis,ofcourse,muchmoretomeasurementtheory.Anauthoritativediscussioncanbefoundinathree-volumeseriesonthefoundationsofmeasurementtheory[88,94,114].Alsoofinterestisawide-rangingarticlebyHand[77],whichdiscussesmeasurementtheoryandstatistics,andisaccompaniedbycommentsfromotherresearchersinthefield.NumerouscritiquesandextensionsoftheapproachofStevenshavebeenmade[66,97,117].Finally,manybooksandarticlesdescribemeasurementissuesforparticularareasofscienceandengineering.
Dataqualityisabroadsubjectthatspanseverydisciplinethatusesdata.Discussionsofprecision,bias,accuracy,andsignificantfigurescanbefoundinmanyintroductoryscience,engineering,andstatisticstextbooks.Theviewofdataqualityas“fitnessforuse”isexplainedinmoredetailinthebookbyRedman[103].ThoseinterestedindataqualitymayalsobeinterestedinMIT’sInformationQuality(MITIQ)Program[95,118].However,theknowledgeneededtodealwithspecificdataqualityissuesinaparticulardomainisoftenbestobtainedbyinvestigatingthedataqualitypracticesofresearchersinthatfield.
Aggregationisalesswell-definedsubjectthanmanyotherpreprocessingtasks.However,aggregationisoneofthemaintechniquesusedbythedatabaseareaofOnlineAnalyticalProcessing(OLAP)[68,76,102].Therehasalsobeenrelevantworkintheareaofsymbolicdataanalysis(BockandDiday[64]).Oneofthegoalsinthisareaistosummarizetraditionalrecorddataintermsofsymbolicdataobjectswhoseattributesaremorecomplexthantraditionalattributes.Specifically,theseattributescanhavevaluesthataresetsofvalues(categories),intervals,orsetsofvalueswithweights(histograms).Anothergoalofsymbolicdataanalysisistobeabletoperformclustering,classification,andotherkindsofdataanalysisondatathatconsistsofsymbolicdataobjects.
Samplingisasubjectthathasbeenwellstudiedinstatisticsandrelatedfields.Manyintroductorystatisticsbooks,suchastheonebyLindgren[90],havesomediscussionaboutsampling,andentirebooksaredevotedtothesubject,suchastheclassictextbyCochran[67].AsurveyofsamplingfordataminingisprovidedbyGuandLiu[74],whileasurveyofsamplingfordatabasesisprovidedbyOlkenandRotem[98].Thereareanumberofotherdatamininganddatabase-relatedsamplingreferencesthatmaybeofinterest,includingpapersbyPalmerandFaloutsos[100],Provostetal.[101],Toivonen[115],andZakietal.[119].
Instatistics,thetraditionaltechniquesthathavebeenusedfordimensionalityreductionaremultidimensionalscaling(MDS)(BorgandGroenen[65],KruskalandUslaner[89])andprincipalcomponentanalysis(PCA)(Jolliffe[80]),whichissimilartosingularvaluedecomposition(SVD)(Demmel[70]).DimensionalityreductionisdiscussedinmoredetailinAppendixB.
Discretizationisatopicthathasbeenextensivelyinvestigatedindatamining.Someclassificationalgorithmsworkonlywithcategoricaldata,andassociationanalysisrequiresbinarydata,andthus,thereisasignificant
motivationtoinvestigatehowtobestbinarizeordiscretizecontinuousattributes.Forassociationanalysis,wereferthereadertoworkbySrikantandAgrawal[111],whilesomeusefulreferencesfordiscretizationintheareaofclassificationincludeworkbyDoughertyetal.[71],ElomaaandRousu[72],FayyadandIrani[73],andHussainetal.[78].
Featureselectionisanothertopicwellinvestigatedindatamining.AbroadcoverageofthistopicisprovidedinasurveybyMolinaetal.[96]andtwobooksbyLiuandMotada[91,92].OtherusefulpapersincludethosebyBlumandLangley[63],KohaviandJohn[87],andLiuetal.[93].
Itisdifficulttoprovidereferencesforthesubjectoffeaturetransformationsbecausepracticesvaryfromonedisciplinetoanother.Manystatisticsbookshaveadiscussionoftransformations,buttypicallythediscussionisrestrictedtoaparticularpurpose,suchasensuringthenormalityofavariableormakingsurethatvariableshaveequalvariance.Weoffertworeferences:Osborne[99]andTukey[116].
Whilewehavecoveredsomeofthemostcommonlyuseddistanceandsimilaritymeasures,therearehundredsofsuchmeasuresandmorearebeingcreatedallthetime.Aswithsomanyothertopicsinthischapter,manyofthesemeasuresarespecifictoparticularfields,e.g.,intheareaoftimeseriesseepapersbyKalpakisetal.[81]andKeoghandPazzani[83].Clusteringbooksprovidethebestgeneraldiscussions.Inparticular,seethebooksbyAnderberg[62],JainandDubes[79],KaufmanandRousseeuw[82],andSneathandSokal[109].
Information-basedmeasuresofsimilarityhavebecomemorepopularlatelydespitethecomputationaldifficultiesandexpenseofcalculatingthem.AgoodintroductiontoinformationtheoryisprovidedbyCoverandThomas[69].Computingthemutualinformationforcontinuousvariablescanbe
straightforwardiftheyfollowawell-knowdistribution,suchasGaussian.However,thisisoftennotthecase,andmanytechniqueshavebeendeveloped.Asoneexample,thearticlebyKhan,etal.[85]comparesvariousmethodsinthecontextofcomparingshorttimeseries.SeealsotheinformationandmutualinformationpackagesforRandMatlab.MutualinformationhasbeenthesubjectofconsiderablerecentattentionduetopaperbyReshef,etal.[104,105]thatintroducedanalternativemeasure,albeitonebasedonmutualinformation,whichwasclaimedtohavesuperiorproperties.Althoughthisapproachhadsomeearlysupport,e.g.,[110],othershavepointedoutvariouslimitations[75,86,108].
Twopopularbooksonthetopicofkernelmethodsare[106]and[107].Thelatteralsohasawebsitewithlinkstokernel-relatedmaterials[84].Inaddition,manycurrentdatamining,machinelearning,andstatisticallearningtextbookshavesomematerialaboutkernelmethods.FurtherreferencesforkernelmethodsinthecontextofsupportvectormachineclassifiersareprovidedinthebibliographicNotesofSection4.9.4.
Bibliography[62]M.R.Anderberg.ClusterAnalysisforApplications.AcademicPress,New
York,December1973.
[63]A.BlumandP.Langley.SelectionofRelevantFeaturesandExamplesinMachineLearning.ArtificialIntelligence,97(1–2):245–271,1997.
[64]H.H.BockandE.Diday.AnalysisofSymbolicData:ExploratoryMethodsforExtractingStatisticalInformationfromComplexData(StudiesinClassification,DataAnalysis,andKnowledgeOrganization).Springer-VerlagTelos,January2000.
[65]I.BorgandP.Groenen.ModernMultidimensionalScaling—TheoryandApplications.Springer-Verlag,February1997.
[66]N.R.Chrisman.Rethinkinglevelsofmeasurementforcartography.CartographyandGeographicInformationSystems,25(4):231–242,1998.
[67]W.G.Cochran.SamplingTechniques.JohnWiley&Sons,3rdedition,July1977.
[68]E.F.Codd,S.B.Codd,andC.T.Smalley.ProvidingOLAP(On-lineAnalyticalProcessing)toUser-Analysts:AnITMandate.WhitePaper,E.F.CoddandAssociates,1993.
[69]T.M.CoverandJ.A.Thomas.Elementsofinformationtheory.JohnWiley&Sons,2012.
[70]J.W.Demmel.AppliedNumericalLinearAlgebra.SocietyforIndustrial&AppliedMathematics,September1997.
[71]J.Dougherty,R.Kohavi,andM.Sahami.SupervisedandUnsupervisedDiscretizationofContinuousFeatures.InProc.ofthe12thIntl.Conf.onMachineLearning,pages194–202,1995.
[72]T.ElomaaandJ.Rousu.GeneralandEfficientMultisplittingofNumericalAttributes.MachineLearning,36(3):201–244,1999.
[73]U.M.FayyadandK.B.Irani.Multi-intervaldiscretizationofcontinuousvaluedattributesforclassificationlearning.InProc.13thInt.JointConf.onArtificialIntelligence,pages1022–1027.MorganKaufman,1993.
[74]F.H.GaohuaGuandH.Liu.SamplingandItsApplicationinDataMining:ASurvey.TechnicalReportTRA6/00,NationalUniversityofSingapore,Singapore,2000.
[75]M.Gorfine,R.Heller,andY.Heller.CommentonDetectingnovelassociationsinlargedatasets.Unpublished(availableathttp://emotion.technion.ac.il/gorfinm/files/science6.pdfon11Nov.2012),2012.
[76]J.Gray,S.Chaudhuri,A.Bosworth,A.Layman,D.Reichart,M.Venkatrao,F.Pellow,andH.Pirahesh.DataCube:ARelationalAggregationOperatorGeneralizingGroup-By,Cross-Tab,andSub-Totals.JournalDataMiningandKnowledgeDiscovery,1(1):29–53,1997.
[77]D.J.Hand.StatisticsandtheTheoryofMeasurement.JournaloftheRoyalStatisticalSociety:SeriesA(StatisticsinSociety),159(3):445–492,1996.
[78]F.Hussain,H.Liu,C.L.Tan,andM.Dash.TRC6/99:Discretization:anenablingtechnique.Technicalreport,NationalUniversityofSingapore,Singapore,1999.
[79]A.K.JainandR.C.Dubes.AlgorithmsforClusteringData.PrenticeHallAdvancedReferenceSeries.PrenticeHall,March1988.
[80]I.T.Jolliffe.PrincipalComponentAnalysis.SpringerVerlag,2ndedition,October2002.
[81]K.Kalpakis,D.Gada,andV.Puttagunta.DistanceMeasuresforEffectiveClusteringofARIMATime-Series.InProc.ofthe2001IEEEIntl.Conf.onDataMining,pages273–280.IEEEComputerSociety,2001.
[82]L.KaufmanandP.J.Rousseeuw.FindingGroupsinData:AnIntroductiontoClusterAnalysis.WileySeriesinProbabilityandStatistics.JohnWileyandSons,NewYork,November1990.
[83]E.J.KeoghandM.J.Pazzani.Scalingupdynamictimewarpingfordataminingapplications.InKDD,pages285–289,2000.
[84]KernelMethodsforPatternAnalysisWebsite.http://www.kernel-methods.net/,2014.
[85]S.Khan,S.Bandyopadhyay,A.R.Ganguly,S.Saigal,D.J.EricksonIII,V.Protopopescu,andG.Ostrouchov.Relativeperformanceofmutualinformationestimationmethodsforquantifyingthedependenceamongshortandnoisydata.PhysicalReviewE,76(2):026209,2007.
[86]J.B.KinneyandG.S.Atwal.Equitability,mutualinformation,andthemaximalinformationcoefficient.ProceedingsoftheNationalAcademyofSciences,2014.
[87]R.KohaviandG.H.John.WrappersforFeatureSubsetSelection.ArtificialIntelligence,97(1–2):273–324,1997.
[88]D.Krantz,R.D.Luce,P.Suppes,andA.Tversky.FoundationsofMeasurements:Volume1:Additiveandpolynomialrepresentations.AcademicPress,NewYork,1971.
[89]J.B.KruskalandE.M.Uslaner.MultidimensionalScaling.SagePublications,August1978.
[90]B.W.Lindgren.StatisticalTheory.CRCPress,January1993.
[91]H.LiuandH.Motoda,editors.FeatureExtraction,ConstructionandSelection:ADataMiningPerspective.KluwerInternationalSeriesinEngineeringandComputerScience,453.KluwerAcademicPublishers,July1998.
[92]H.LiuandH.Motoda.FeatureSelectionforKnowledgeDiscoveryandDataMining.KluwerInternationalSeriesinEngineeringandComputerScience,454.KluwerAcademicPublishers,July1998.
[93]H.Liu,H.Motoda,andL.Yu.FeatureExtraction,Selection,andConstruction.InN.Ye,editor,TheHandbookofDataMining,pages22–41.LawrenceErlbaumAssociates,Inc.,Mahwah,NJ,2003.
[94]R.D.Luce,D.Krantz,P.Suppes,andA.Tversky.FoundationsofMeasurements:Volume3:Representation,Axiomatization,andInvariance.AcademicPress,NewYork,1990.
[95]MITInformationQuality(MITIQ)Program.http://mitiq.mit.edu/,2014.
[96]L.C.Molina,L.Belanche,andA.Nebot.FeatureSelectionAlgorithms:ASurveyandExperimentalEvaluation.InProc.ofthe2002IEEEIntl.Conf.onDataMining,2002.
[97]F.MostellerandJ.W.Tukey.Dataanalysisandregression:asecondcourseinstatistics.Addison-Wesley,1977.
[98]F.OlkenandD.Rotem.RandomSamplingfromDatabases—ASurvey.Statistics&Computing,5(1):25–42,March1995.
[99]J.Osborne.NotesontheUseofDataTransformations.PracticalAssessment,Research&Evaluation,28(6),2002.
[100]C.R.PalmerandC.Faloutsos.Densitybiasedsampling:Animprovedmethodfordataminingandclustering.ACMSIGMODRecord,29(2):82–92,2000.
[101]F.J.Provost,D.Jensen,andT.Oates.EfficientProgressiveSampling.InProc.ofthe5thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages23–32,1999.
[102]R.RamakrishnanandJ.Gehrke.DatabaseManagementSystems.McGraw-Hill,3rdedition,August2002.
[103]T.C.Redman.DataQuality:TheFieldGuide.DigitalPress,January2001.
[104]D.Reshef,Y.Reshef,M.Mitzenmacher,andP.Sabeti.Equitabilityanalysisofthemaximalinformationcoefficient,withcomparisons.arXivpreprintarXiv:1301.6314,2013.
[105]D.N.Reshef,Y.A.Reshef,H.K.Finucane,S.R.Grossman,G.McVean,P.J.Turnbaugh,E.S.Lander,M.Mitzenmacher,andP.C.
Sabeti.Detectingnovelassociationsinlargedatasets.science,334(6062):1518–1524,2011.
[106]B.SchölkopfandA.J.Smola.Learningwithkernels:supportvectormachines,regularization,optimization,andbeyond.MITpress,2002.
[107]J.Shawe-TaylorandN.Cristianini.Kernelmethodsforpatternanalysis.Cambridgeuniversitypress,2004.
[108]N.SimonandR.Tibshirani.Commenton”DetectingNovelAssociationsInLargeDataSets”byReshefEtAl,ScienceDec16,2011.arXivpreprintarXiv:1401.7645,2014.
[109]P.H.A.SneathandR.R.Sokal.NumericalTaxonomy.Freeman,SanFrancisco,1971.
[110]T.Speed.Acorrelationforthe21stcentury.Science,334(6062):1502–1503,2011.
[111]R.SrikantandR.Agrawal.MiningQuantitativeAssociationRulesinLargeRelationalTables.InProc.of1996ACM-SIGMODIntl.Conf.onManagementofData,pages1–12,Montreal,Quebec,Canada,August1996.
[112]S.S.Stevens.OntheTheoryofScalesofMeasurement.Science,103(2684):677–680,June1946.
[113]S.S.Stevens.Measurement.InG.M.Maranell,editor,Scaling:ASourcebookforBehavioralScientists,pages22–41.AldinePublishingCo.,Chicago,1974.
[114]P.Suppes,D.Krantz,R.D.Luce,andA.Tversky.FoundationsofMeasurements:Volume2:Geometrical,Threshold,andProbabilisticRepresentations.AcademicPress,NewYork,1989.
[115]H.Toivonen.SamplingLargeDatabasesforAssociationRules.InVLDB96,pages134–145.MorganKaufman,September1996.
[116]J.W.Tukey.OntheComparativeAnatomyofTransformations.AnnalsofMathematicalStatistics,28(3):602–632,September1957.
[117]P.F.VellemanandL.Wilkinson.Nominal,ordinal,interval,andratiotypologiesaremisleading.TheAmericanStatistician,47(1):65–72,1993.
[118]R.Y.Wang,M.Ziad,Y.W.Lee,andY.R.Wang.DataQuality.TheKluwerInternationalSeriesonAdvancesinDatabaseSystems,Volume23.KluwerAcademicPublishers,January2001.
[119]M.J.Zaki,S.Parthasarathy,W.Li,andM.Ogihara.EvaluationofSamplingforDataMiningofAssociationRules.TechnicalReportTR617,RensselaerPolytechnicInstitute,1996.
2.6Exercises1.IntheinitialexampleofChapter2 ,thestatisticiansays,“Yes,fields2and3arebasicallythesame.”Canyoutellfromthethreelinesofsampledatathatareshownwhyshesaysthat?
2.Classifythefollowingattributesasbinary,discrete,orcontinuous.Alsoclassifythemasqualitative(nominalorordinal)orquantitative(intervalorratio).Somecasesmayhavemorethanoneinterpretation,sobrieflyindicateyourreasoningifyouthinktheremaybesomeambiguity.
Example:Ageinyears.Answer:Discrete,quantitative,ratio
a. TimeintermsofAMorPM.
b. Brightnessasmeasuredbyalightmeter.
c. Brightnessasmeasuredbypeople’sjudgments.
d. Anglesasmeasuredindegreesbetween0and360.
e. Bronze,Silver,andGoldmedalsasawardedattheOlympics.
f. Heightabovesealevel.
g. Numberofpatientsinahospital.
h. ISBNnumbersforbooks.(LookuptheformatontheWeb.)
i. Abilitytopasslightintermsofthefollowingvalues:opaque,translucent,transparent.
j. Militaryrank.
k. Distancefromthecenterofcampus.
l. Densityofasubstanceingramspercubiccentimeter.
m. Coatchecknumber.(Whenyouattendanevent,youcanoftengiveyourcoattosomeonewho,inturn,givesyouanumberthatyoucanusetoclaimyourcoatwhenyouleave.)
3.Youareapproachedbythemarketingdirectorofalocalcompany,whobelievesthathehasdevisedafoolproofwaytomeasurecustomersatisfaction.Heexplainshisschemeasfollows:“It’ssosimplethatIcan’tbelievethatnoonehasthoughtofitbefore.Ijustkeeptrackofthenumberofcustomercomplaintsforeachproduct.Ireadinadataminingbookthatcountsareratioattributes,andso,mymeasureofproductsatisfactionmustbearatioattribute.ButwhenIratedtheproductsbasedonmynewcustomersatisfactionmeasureandshowedthemtomyboss,hetoldmethatIhadoverlookedtheobvious,andthatmymeasurewasworthless.Ithinkthathewasjustmadbecauseourbestsellingproducthadtheworstsatisfactionsinceithadthemostcomplaints.Couldyouhelpmesethimstraight?”
a. Whoisright,themarketingdirectororhisboss?Ifyouanswered,hisboss,whatwouldyoudotofixthemeasureofsatisfaction?
b. Whatcanyousayabouttheattributetypeoftheoriginalproductsatisfactionattribute?
4.Afewmonthslater,youareagainapproachedbythesamemarketingdirectorasinExercise3 .Thistime,hehasdevisedabetterapproachtomeasuretheextenttowhichacustomerprefersoneproductoverothersimilarproducts.Heexplains,“Whenwedevelopnewproducts,wetypicallycreateseveralvariationsandevaluatewhichonecustomersprefer.Ourstandardprocedureistogiveourtestsubjectsalloftheproductvariationsatonetimeandthenaskthemtoranktheproductvariationsinorderofpreference.However,ourtestsubjectsareveryindecisive,especiallywhenthereare
morethantwoproducts.Asaresult,testingtakesforever.Isuggestedthatweperformthecomparisonsinpairsandthenusethesecomparisonstogettherankings.Thus,ifwehavethreeproductvariations,wehavethecustomerscomparevariations1and2,then2and3,andfinally3and1.Ourtestingtimewithmynewprocedureisathirdofwhatitwasfortheoldprocedure,buttheemployeesconductingthetestscomplainthattheycannotcomeupwithaconsistentrankingfromtheresults.Andmybosswantsthelatestproductevaluations,yesterday.Ishouldalsomentionthathewasthepersonwhocameupwiththeoldproductevaluationapproach.Canyouhelpme?”
a. Isthemarketingdirectorintrouble?Willhisapproachworkforgeneratinganordinalrankingoftheproductvariationsintermsofcustomerpreference?Explain.
b. Isthereawaytofixthemarketingdirector’sapproach?Moregenerally,whatcanyousayabouttryingtocreateanordinalmeasurementscalebasedonpairwisecomparisons?
c. Fortheoriginalproductevaluationscheme,theoverallrankingsofeachproductvariationarefoundbycomputingitsaverageoveralltestsubjects.Commentonwhetheryouthinkthatthisisareasonableapproach.Whatotherapproachesmightyoutake?
5.Canyouthinkofasituationinwhichidentificationnumberswouldbeusefulforprediction?
6.Aneducationalpsychologistwantstouseassociationanalysistoanalyzetestresults.Thetestconsistsof100questionswithfourpossibleanswerseach.
a. Howwouldyouconvertthisdataintoaformsuitableforassociationanalysis?
b. Inparticular,whattypeofattributeswouldyouhaveandhowmanyofthemarethere?
7.Whichofthefollowingquantitiesislikelytoshowmoretemporalautocorrelation:dailyrainfallordailytemperature?Why?
8.Discusswhyadocument-termmatrixisanexampleofadatasetthathasasymmetricdiscreteorasymmetriccontinuousfeatures.
9.Manysciencesrelyonobservationinsteadof(orinadditionto)designedexperiments.Comparethedataqualityissuesinvolvedinobservationalsciencewiththoseofexperimentalscienceanddatamining.
10.Discussthedifferencebetweentheprecisionofameasurementandthetermssingleanddoubleprecision,astheyareusedincomputerscience,typicallytorepresentfloating-pointnumbersthatrequire32and64bits,respectively.
11.Giveatleasttwoadvantagestoworkingwithdatastoredintextfilesinsteadofinabinaryformat.
12.Distinguishbetweennoiseandoutliers.Besuretoconsiderthefollowingquestions.
a. Isnoiseeverinterestingordesirable?Outliers?
b. Cannoiseobjectsbeoutliers?
c. Arenoiseobjectsalwaysoutliers?
d. Areoutliersalwaysnoiseobjects?
e. Cannoisemakeatypicalvalueintoanunusualone,orviceversa?
Algorithm2.3Algorithmforfindingk-
nearestneighbors.
13.ConsidertheproblemoffindingtheK-nearestneighborsofadataobject.AprogrammerdesignsAlgorithm2.3forthistask.
a. Describethepotentialproblemswiththisalgorithmifthereareduplicateobjectsinthedataset.Assumethedistancefunctionwillreturnadistanceof0onlyforobjectsthatarethesame.
b. Howwouldyoufixthisproblem?
14.ThefollowingattributesaremeasuredformembersofaherdofAsianelephants:weight,height,tusklength,trunklength,andeararea.Basedonthesemeasurements,whatsortofproximitymeasurefromSection2.4wouldyouusetocompareorgrouptheseelephants?Justifyyouranswerandexplainanyspecialcircumstances.
15.Youaregivenasetofmobjectsthatisdividedintokgroups,wheretheigroupisofsize Ifthegoalistoobtainasampleofsize whatisthedifferencebetweenthefollowingtwosamplingschemes?(Assumesamplingwithreplacement.)
:for tonumberofdataobjectsdo
:Findthedistancesofthe objecttoallotherobjects.
3:Sortthesedistancesindecreasingorder.
(Keeptrackofwhichobjectisassociatedwitheachdistance.)
4:returntheobjectsassociatedwiththefirstkdistancesofthesortedlist
5:endfor
1 i=1
2 ith
th
mi. n<m,
a. Werandomlyselect elementsfromeachgroup.
b. Werandomlyselectnelementsfromthedataset,withoutregardforthegrouptowhichanobjectbelongs.
16.Consideradocument-termmatrix,where isthefrequencyoftheword(term)inthe documentandmisthenumberofdocuments.Considerthevariabletransformationthatisdefinedby
where isthenumberofdocumentsinwhichthe termappears,whichisknownasthedocumentfrequencyoftheterm.Thistransformationisknownastheinversedocumentfrequencytransformation.
a. Whatistheeffectofthistransformationifatermoccursinonedocument?Ineverydocument?
b. Whatmightbethepurposeofthistransformation?
17.Assumethatweapplyasquareroottransformationtoaratioattributextoobtainthenewattribute Aspartofyouranalysis,youidentifyaninterval(a,b)inwhich hasalinearrelationshiptoanotherattributey.
a. Whatisthecorrespondinginterval(A,B)intermsofx?
b. Giveanequationthatrelatesytox.
18.Thisexercisecomparesandcontrastssomesimilarityanddistancemeasures.
a. Forbinarydata,theL1distancecorrespondstotheHammingdistance;thatis,thenumberofbitsthataredifferentbetweentwobinaryvectors.TheJaccardsimilarityisameasureofthesimilaritybetweentwobinary
n×mi/m
tfij ithjth
tfij′=tfij×logmdfi, (2.31)
dfi ith
x*.x*
vectors.ComputetheHammingdistanceandtheJaccardsimilaritybetweenthefollowingtwobinaryvectors.
b. Whichapproach,JaccardorHammingdistance,ismoresimilartotheSimpleMatchingCoefficient,andwhichapproachismoresimilartothecosinemeasure?Explain.(Note:TheHammingmeasureisadistance,whiletheotherthreemeasuresaresimilarities,butdon’tletthisconfuseyou.)
c. Supposethatyouarecomparinghowsimilartwoorganismsofdifferentspeciesareintermsofthenumberofgenestheyshare.Describewhichmeasure,HammingorJaccard,youthinkwouldbemoreappropriateforcomparingthegeneticmakeupoftwoorganisms.Explain.(Assumethateachanimalisrepresentedasabinaryvector,whereeachattributeis1ifaparticulargeneispresentintheorganismand0otherwise.)
d. Ifyouwantedtocomparethegeneticmakeupoftwoorganismsofthesamespecies,e.g.,twohumanbeings,wouldyouusetheHammingdistance,theJaccardcoefficient,oradifferentmeasureofsimilarityordistance?Explain.(Notethattwohumanbeingsshare ofthesamegenes.)
19.Forthefollowingvectors,xandy,calculatetheindicatedsimilarityordistancemeasures.
a. cosine,correlation,Euclidean
b. cosine,correlation,Euclidean,Jaccard
c. cosine,correlation,Euclidean
d. cosine,correlation,Jaccard
x=0101010001y=0100011000
>99.9%
x=(1,1,1,1),y=(2,2,2,2)
x=(0,1,0,1),y=(1,0,1,0)
x=(0,−1,0,1),y=(1,0,−1,0)
x=(1,1,0,1,0,1),y=(1,1,1,0,0,1)
e. cosine,correlation
20.Here,wefurtherexplorethecosineandcorrelationmeasures.
a. Whatistherangeofvaluespossibleforthecosinemeasure?
b. Iftwoobjectshaveacosinemeasureof1,aretheyidentical?Explain.
c. Whatistherelationshipofthecosinemeasuretocorrelation,ifany?(Hint:Lookatstatisticalmeasuressuchasmeanandstandarddeviationincaseswherecosineandcorrelationarethesameanddifferent.)
d. Figure2.22(a) showstherelationshipofthecosinemeasuretoEuclideandistancefor100,000randomlygeneratedpointsthathavebeennormalizedtohaveanL2lengthof1.WhatgeneralobservationcanyoumakeabouttherelationshipbetweenEuclideandistanceandcosinesimilaritywhenvectorshaveanL2normof1?
Figure2.22.GraphsforExercise20 .
x=(2,−1,0,2,0,−3),y=(−1,1,−1,0,0,−1)
e. Figure2.22(b) showstherelationshipofcorrelationtoEuclideandistancefor100,000randomlygeneratedpointsthathavebeenstandardizedtohaveameanof0andastandarddeviationof1.WhatgeneralobservationcanyoumakeabouttherelationshipbetweenEuclideandistanceandcorrelationwhenthevectorshavebeenstandardizedtohaveameanof0andastandarddeviationof1?
f. DerivethemathematicalrelationshipbetweencosinesimilarityandEuclideandistancewheneachdataobjecthasanL lengthof1.
g. DerivethemathematicalrelationshipbetweencorrelationandEuclideandistancewheneachdatapointhasbeenbeenstandardizedbysubtractingitsmeananddividingbyitsstandarddeviation.
21.Showthatthesetdifferencemetricgivenby
satisfiesthemetricaxiomsgivenonpage77 .AandBaresetsand isthesetdifference.
22.Discusshowyoumightmapcorrelationvaluesfromtheinterval totheinterval[0,1].Notethatthetypeoftransformationthatyouusemightdependontheapplicationthatyouhaveinmind.Thus,considertwoapplications:clusteringtimeseriesandpredictingthebehaviorofonetimeseriesgivenanother.
23.Givenasimilaritymeasurewithvaluesintheinterval[0,1],describetwowaystotransformthissimilarityvalueintoadissimilarityvalueintheinterval
24.Proximityistypicallydefinedbetweenapairofobjects.
2
d(A,B)=size(A−B)+size(B−A) (2.32)
A−B
[−1,1]
[0,∞].
a. Definetwowaysinwhichyoumightdefinetheproximityamongagroupofobjects.
b. HowmightyoudefinethedistancebetweentwosetsofpointsinEuclideanspace?
c. Howmightyoudefinetheproximitybetweentwosetsofdataobjects?(Makenoassumptionaboutthedataobjects,exceptthataproximitymeasureisdefinedbetweenanypairofobjects.)
25.YouaregivenasetofpointssinEuclideanspace,aswellasthedistanceofeachpointinstoapointx.(Itdoesnotmatterif )
a. Ifthegoalistofindallpointswithinaspecifieddistance ofpointexplainhowyoucouldusethetriangleinequalityandthealreadycalculateddistancestoxtopotentiallyreducethenumberofdistancecalculationsnecessary?Hint:Thetriangleinequality,
canberewrittenas
b. Ingeneral,howwouldthedistancebetweenxandyaffectthenumberofdistancecalculations?
c. Supposethatyoucanfindasmallsubsetofpoints fromtheoriginaldataset,suchthateverypointinthedatasetiswithinaspecifieddistance ofatleastoneofthepointsin andthatyoualsohavethepairwisedistancematrixfor Describeatechniquethatusesthisinformationtocompute,withaminimumofdistancecalculations,thesetofallpointswithinadistanceof ofaspecifiedpointfromthedataset.
26.Showthat1minustheJaccardsimilarityisadistancemeasurebetweentwodataobjects,xandy,thatsatisfiesthemetricaxiomsgivenonpage77 .Specifically,
x∈S.ε y,y≠x,
d(x,z)≤d(x,y)+d(y,x), d(x,y)≥d(x,z)−d(y,z).
S′,
ε S′,S′.
β
d(x,y)=1−J(x,y).
27.Showthatthedistancemeasuredefinedastheanglebetweentwodatavectors,xandy,satisfiesthemetricaxiomsgivenonpage77 .Specifically,
28.Explainwhycomputingtheproximitybetweentwoattributesisoftensimplerthancomputingthesimilaritybetweentwoobjects.
d(x,y)=arccos(cos(x,y)).
3Classification:BasicConceptsandTechniques
Humanshaveaninnateabilitytoclassifythingsintocategories,e.g.,mundanetaskssuchasfilteringspamemailmessagesormorespecializedtaskssuchasrecognizingcelestialobjectsintelescopeimages(seeFigure3.1 ).Whilemanualclassificationoftensufficesforsmallandsimpledatasetswithonlyafewattributes,largerandmorecomplexdatasetsrequireanautomatedsolution.
Figure3.1.ClassificationofgalaxiesfromtelescopeimagestakenfromtheNASAwebsite.
Thischapterintroducesthebasicconceptsofclassificationanddescribessomeofitskeyissuessuchasmodeloverfitting,modelselection,andmodelevaluation.Whilethesetopicsareillustratedusingaclassificationtechniqueknownasdecisiontreeinduction,mostofthediscussioninthischapterisalsoapplicabletootherclassificationtechniques,manyofwhicharecoveredinChapter4 .
3.1BasicConceptsFigure3.2 illustratesthegeneralideabehindclassification.Thedataforaclassificationtaskconsistsofacollectionofinstances(records).Eachsuchinstanceischaracterizedbythetuple( ,y),where isthesetofattributevaluesthatdescribetheinstanceandyistheclasslabeloftheinstance.Theattributeset cancontainattributesofanytype,whiletheclasslabelymustbecategorical.
Figure3.2.Aschematicillustrationofaclassificationtask.
Aclassificationmodelisanabstractrepresentationoftherelationshipbetweentheattributesetandtheclasslabel.Aswillbeseeninthenexttwochapters,themodelcanberepresentedinmanyways,e.g.,asatree,aprobabilitytable,orsimply,avectorofreal-valuedparameters.Moreformally,wecanexpressitmathematicallyasatargetfunctionfthattakesasinputtheattributeset andproducesanoutputcorrespondingtothepredictedclasslabel.Themodelissaidtoclassifyaninstance( ,y)correctlyif .
Table3.1 showsexamplesofattributesetsandclasslabelsforvariousclassificationtasks.Spamfilteringandtumoridentificationareexamplesofbinaryclassificationproblems,inwhicheachdatainstancecanbecategorizedintooneoftwoclasses.Ifthenumberofclassesislargerthan2,asinthe
f(x)=y
galaxyclassificationexample,thenitiscalledamulticlassclassificationproblem.
Table3.1.Examplesofclassificationtasks.
Task Attributeset Classlabel
Spamfiltering Featuresextractedfromemailmessageheaderandcontent
spamornon-spam
Tumoridentification
Featuresextractedfrommagneticresonanceimaging(MRI)scans
malignantorbenign
Galaxyclassification
Featuresextractedfromtelescopeimages elliptical,spiral,orirregular-shaped
Weillustratethebasicconceptsofclassificationinthischapterwiththefollowingtwoexamples.
3.1.ExampleVertebrateClassificationTable3.2 showsasampledatasetforclassifyingvertebratesintomammals,reptiles,birds,fishes,andamphibians.Theattributesetincludescharacteristicsofthevertebratesuchasitsbodytemperature,skincover,andabilitytofly.Thedatasetcanalsobeusedforabinaryclassificationtasksuchasmammalclassification,bygroupingthereptiles,birds,fishes,andamphibiansintoasinglecategorycallednon-mammals.
Table3.2.Asampledataforthevertebrateclassificationproblem.VertebrateName
BodyTemperature
SkinCover
GivesBirth
AquaticCreature
AerialCreature
HasLegs
Hibernates ClassLabel
human warm-
blooded
hair yes no no yes no mammal
3.2.ExampleLoanBorrowerClassificationConsidertheproblemofpredictingwhetheraloanborrowerwillrepaytheloanordefaultontheloanpayments.Thedatasetusedtobuildthe
blooded
python cold-blooded scales no no no no yes reptile
salmon cold-blooded scales no yes no no no fish
whale warm-blooded
hair yes yes no no no mammal
frog cold-blooded none no semi no yes yes amphibian
komodo cold-blooded scales no no no yes no reptile
dragon
bat warm-blooded
hair yes no yes yes yes mammal
pigeon warm-blooded
feathers no no yes yes no bird
cat warm-blooded
fur yes no no yes no mammal
leopard cold-blooded scales yes yes no no no fish
shark
turtle cold-blooded scales no semi no yes no reptile
penguin warm-blooded
feathers no semi no yes no bird
porcupine warm-blooded
quills yes no no yes yes mammal
eel cold-blooded scales no yes no no no fish
salamander cold-blooded none no semi no yes yes amphibian
classificationmodelisshowninTable3.3 .Theattributesetincludespersonalinformationoftheborrowersuchasmaritalstatusandannualincome,whiletheclasslabelindicateswhethertheborrowerhaddefaultedontheloanpayments.
Table3.3.Asampledatafortheloanborrowerclassificationproblem.
ID HomeOwner MaritalStatus AnnualIncome Defaulted?
1 Yes Single 125000 No
2 No Married 100000 No
3 No Single 70000 No
4 Yes Married 120000 No
5 No Divorced 95000 Yes
6 No Single 60000 No
7 Yes Divorced 220000 No
8 No Single 85000 Yes
9 No Married 75000 No
10 No Single 90000 Yes
Aclassificationmodelservestwoimportantrolesindatamining.First,itisusedasapredictivemodeltoclassifypreviouslyunlabeledinstances.Agoodclassificationmodelmustprovideaccuratepredictionswithafastresponsetime.Second,itservesasadescriptivemodeltoidentifythecharacteristicsthatdistinguishinstancesfromdifferentclasses.Thisisparticularlyusefulforcriticalapplications,suchasmedicaldiagnosis,whereit
isinsufficienttohaveamodelthatmakesapredictionwithoutjustifyinghowitreachessuchadecision.
Forexample,aclassificationmodelinducedfromthevertebratedatasetshowninTable3.2 canbeusedtopredicttheclasslabelofthefollowingvertebrate:
Inaddition,itcanbeusedasadescriptivemodeltohelpdeterminecharacteristicsthatdefineavertebrateasamammal,areptile,abird,afish,oranamphibian.Forexample,themodelmayidentifymammalsaswarm-bloodedvertebratesthatgivebirthtotheiryoung.
Thereareseveralpointsworthnotingregardingthepreviousexample.First,althoughalltheattributesshowninTable3.2 arequalitative,therearenorestrictionsonthetypeofattributesthatcanbeusedaspredictorvariables.Theclasslabel,ontheotherhand,mustbeofnominaltype.Thisdistinguishesclassificationfromotherpredictivemodelingtaskssuchasregression,wherethepredictedvalueisoftenquantitative.MoreinformationaboutregressioncanbefoundinAppendixD.
Anotherpointworthnotingisthatnotallattributesmayberelevanttotheclassificationtask.Forexample,theaveragelengthorweightofavertebratemaynotbeusefulforclassifyingmammals,astheseattributescanshowsamevalueforbothmammalsandnon-mammals.Suchanattributeistypicallydiscardedduringpreprocessing.Theremainingattributesmightnotbeabletodistinguishtheclassesbythemselves,andthus,mustbeusedin
VertebrateName
BodyTemperature
SkinCover
GivesBirth
AquaticCreature
AerialCreature
HasLegs
Hibernates ClassLabel
gilamonster
cold-blooded scales no no no yes yes ?
concertwithotherattributes.Forinstance,theBodyTemperatureattributeisinsufficienttodistinguishmammalsfromothervertebrates.WhenitisusedtogetherwithGivesBirth,theclassificationofmammalsimprovessignificantly.However,whenadditionalattributes,suchasSkinCoverareincluded,themodelbecomesoverlyspecificandnolongercoversallmammals.Findingtheoptimalcombinationofattributesthatbestdiscriminatesinstancesfromdifferentclassesisthekeychallengeinbuildingclassificationmodels.
3.2GeneralFrameworkforClassificationClassificationisthetaskofassigninglabelstounlabeleddatainstancesandaclassifierisusedtoperformsuchatask.Aclassifieristypicallydescribedintermsofamodelasillustratedintheprevioussection.Themodeliscreatedusingagivenasetofinstances,knownasthetrainingset,whichcontainsattributevaluesaswellasclasslabelsforeachinstance.Thesystematicapproachforlearningaclassificationmodelgivenatrainingsetisknownasalearningalgorithm.Theprocessofusingalearningalgorithmtobuildaclassificationmodelfromthetrainingdataisknownasinduction.Thisprocessisalsooftendescribedas“learningamodel”or“buildingamodel.”Thisprocessofapplyingaclassificationmodelonunseentestinstancestopredicttheirclasslabelsisknownasdeduction.Thus,theprocessofclassificationinvolvestwosteps:applyingalearningalgorithmtotrainingdatatolearnamodel,andthenapplyingthemodeltoassignlabelstounlabeledinstances.Figure3.3 illustratesthegeneralframeworkforclassification.
Figure3.3.Generalframeworkforbuildingaclassificationmodel.
Aclassificationtechniquereferstoageneralapproachtoclassification,e.g.,thedecisiontreetechniquethatwewillstudyinthischapter.Thisclassificationtechniquelikemostothers,consistsofafamilyofrelatedmodelsandanumberofalgorithmsforlearningthesemodels.InChapter4 ,wewillstudyadditionalclassificationtechniques,includingneuralnetworksandsupportvectormachines.
Acouplenotesonterminology.First,theterms“classifier”and“model”areoftentakentobesynonymous.Ifaclassificationtechniquebuildsasingle,
globalmodel,thenthisisfine.However,whileeverymodeldefinesaclassifier,noteveryclassifierisdefinedbyasinglemodel.Someclassifiers,suchask-nearestneighborclassifiers,donotbuildanexplicitmodel(Section4.3 ),whileotherclassifiers,suchasensembleclassifiers,combinetheoutputofacollectionofmodels(Section4.10 ).Second,theterm“classifier”isoftenusedinamoregeneralsensetorefertoaclassificationtechnique.Thus,forexample,“decisiontreeclassifier”canrefertothedecisiontreeclassificationtechniqueoraspecificclassifierbuiltusingthattechnique.Fortunately,themeaningof“classifier”isusuallyclearfromthecontext.
InthegeneralframeworkshowninFigure3.3 ,theinductionanddeductionstepsshouldbeperformedseparately.Infact,aswillbediscussedlaterinSection3.6 ,thetrainingandtestsetsshouldbeindependentofeachothertoensurethattheinducedmodelcanaccuratelypredicttheclasslabelsofinstancesithasneverencounteredbefore.Modelsthatdeliversuchpredictiveinsightsaresaidtohavegoodgeneralizationperformance.Theperformanceofamodel(classifier)canbeevaluatedbycomparingthepredictedlabelsagainstthetruelabelsofinstances.Thisinformationcanbesummarizedinatablecalledaconfusionmatrix.Table3.4 depictstheconfusionmatrixforabinaryclassificationproblem.Eachentry denotesthenumberofinstancesfromclassipredictedtobeofclassj.Forexample, isthenumberofinstancesfromclass0incorrectlypredictedasclass1.Thenumberofcorrectpredictionsmadebythemodelis andthenumberofincorrectpredictionsis .
Table3.4.Confusionmatrixforabinaryclassificationproblem.
PredictedClass
ActualClass
fijf01
(f11+f00)(f10+f01)
Class=1 Class=0
Class=1 f11 f10
Althoughaconfusionmatrixprovidestheinformationneededtodeterminehowwellaclassificationmodelperforms,summarizingthisinformationintoasinglenumbermakesitmoreconvenienttocomparetherelativeperformanceofdifferentmodels.Thiscanbedoneusinganevaluationmetricsuchasaccuracy,whichiscomputedinthefollowingway:
Accuracy=
Forbinaryclassificationproblems,theaccuracyofamodelisgivenby
Errorrateisanotherrelatedmetric,whichisdefinedasfollowsforbinaryclassificationproblems:
Thelearningalgorithmsofmostclassificationtechniquesaredesignedtolearnmodelsthatattainthehighestaccuracy,orequivalently,thelowesterrorratewhenappliedtothetestset.WewillrevisitthetopicofmodelevaluationinSection3.6 .
Class=0 f01 f00
Accuracy=NumberofcorrectpredictionsTotalnumberofpredictions. (3.1)
Accuracy=f11+f00f11+f10+f01+f00. (3.2)
Errorrate=NumberofwrongpredictionsTotalnumberofpredictions=f10+f01f11(3.3)
3.3DecisionTreeClassifierThissectionintroducesasimpleclassificationtechniqueknownasthedecisiontreeclassifier.Toillustratehowadecisiontreeworks,considertheclassificationproblemofdistinguishingmammalsfromnon-mammalsusingthevertebratedatasetshowninTable3.2 .Supposeanewspeciesisdiscoveredbyscientists.Howcanwetellwhetheritisamammaloranon-mammal?Oneapproachistoposeaseriesofquestionsaboutthecharacteristicsofthespecies.Thefirstquestionwemayaskiswhetherthespeciesiscold-orwarm-blooded.Ifitiscold-blooded,thenitisdefinitelynotamammal.Otherwise,itiseitherabirdoramammal.Inthelattercase,weneedtoaskafollow-upquestion:Dothefemalesofthespeciesgivebirthtotheiryoung?Thosethatdogivebirtharedefinitelymammals,whilethosethatdonotarelikelytobenon-mammals(withtheexceptionofegg-layingmammalssuchastheplatypusandspinyanteater).
Thepreviousexampleillustrateshowwecansolveaclassificationproblembyaskingaseriesofcarefullycraftedquestionsabouttheattributesofthetestinstance.Eachtimewereceiveananswer,wecouldaskafollow-upquestionuntilwecanconclusivelydecideonitsclasslabel.Theseriesofquestionsandtheirpossibleanswerscanbeorganizedintoahierarchicalstructurecalledadecisiontree.Figure3.4 showsanexampleofthedecisiontreeforthemammalclassificationproblem.Thetreehasthreetypesofnodes:
Arootnode,withnoincominglinksandzeroormoreoutgoinglinks.Internalnodes,eachofwhichhasexactlyoneincominglinkandtwoormoreoutgoinglinks.Leaforterminalnodes,eachofwhichhasexactlyoneincominglinkandnooutgoinglinks.
Everyleafnodeinthedecisiontreeisassociatedwithaclasslabel.Thenon-terminalnodes,whichincludetherootandinternalnodes,containattributetestconditionsthataretypicallydefinedusingasingleattribute.Eachpossibleoutcomeoftheattributetestconditionisassociatedwithexactlyonechildofthisnode.Forexample,therootnodeofthetreeshowninFigure3.4 usestheattribute todefineanattributetestconditionthathastwooutcomes,warmandcold,resultingintwochildnodes.
Figure3.4.Adecisiontreeforthemammalclassificationproblem.
Givenadecisiontree,classifyingatestinstanceisstraightforward.Startingfromtherootnode,weapplyitsattributetestconditionandfollowtheappropriatebranchbasedontheoutcomeofthetest.Thiswillleaduseithertoanotherinternalnode,forwhichanewattributetestconditionisapplied,ortoaleafnode.Oncealeafnodeisreached,weassigntheclasslabelassociatedwiththenodetothetestinstance.Asanillustration,Figure3.5
tracesthepathusedtopredicttheclasslabelofaflamingo.Thepathterminatesataleafnodelabeledas .
Figure3.5.Classifyinganunlabeledvertebrate.Thedashedlinesrepresenttheoutcomesofapplyingvariousattributetestconditionsontheunlabeledvertebrate.Thevertebrateiseventuallyassignedtothe class.
3.3.1ABasicAlgorithmtoBuildaDecisionTree
Manypossibledecisiontreesthatcanbeconstructedfromaparticulardataset.Whilesometreesarebetterthanothers,findinganoptimaloneiscomputationallyexpensiveduetotheexponentialsizeofthesearchspace.Efficientalgorithmshavebeendevelopedtoinduceareasonablyaccurate,
albeitsuboptimal,decisiontreeinareasonableamountoftime.Thesealgorithmsusuallyemployagreedystrategytogrowthedecisiontreeinatop-downfashionbymakingaseriesoflocallyoptimaldecisionsaboutwhichattributetousewhenpartitioningthetrainingdata.OneoftheearliestmethodisHunt'salgorithm,whichisthebasisformanycurrentimplementationsofdecisiontreeclassifiers,includingID3,C4.5,andCART.ThissubsectionpresentsHunt'salgorithmanddescribessomeofthedesignissuesthatmustbeconsideredwhenbuildingadecisiontree.
Hunt'sAlgorithmInHunt'salgorithm,adecisiontreeisgrowninarecursivefashion.Thetreeinitiallycontainsasinglerootnodethatisassociatedwithallthetraininginstances.Ifanodeisassociatedwithinstancesfrommorethanoneclass,itisexpandedusinganattributetestconditionthatisdeterminedusingasplittingcriterion.Achildleafnodeiscreatedforeachoutcomeoftheattributetestconditionandtheinstancesassociatedwiththeparentnodearedistributedtothechildrenbasedonthetestoutcomes.Thisnodeexpansionstepcanthenberecursivelyappliedtoeachchildnode,aslongasithaslabelsofmorethanoneclass.Ifalltheinstancesassociatedwithaleafnodehaveidenticalclasslabels,thenthenodeisnotexpandedanyfurther.Eachleafnodeisassignedaclasslabelthatoccursmostfrequentlyinthetraininginstancesassociatedwiththenode.
Toillustratehowthealgorithmworks,considerthetrainingsetshowninTable3.3 fortheloanborrowerclassificationproblem.SupposeweapplyHunt'salgorithmtofitthetrainingdata.ThetreeinitiallycontainsonlyasingleleafnodeasshowninFigure3.6(a) .ThisnodeislabeledasDefaulted=No,sincethemajorityoftheborrowersdidnotdefaultontheirloanpayments.Thetrainingerrorofthistreeis30%asthreeoutofthetentraininginstanceshave
theclasslabel .Theleafnodecanthereforebefurtherexpandedbecauseitcontainstraininginstancesfrommorethanoneclass.
Figure3.6.Hunt'salgorithmforbuildingdecisiontrees.
LetHomeOwnerbetheattributechosentosplitthetraininginstances.Thejustificationforchoosingthisattributeastheattributetestconditionwillbediscussedlater.TheresultingbinarysplitontheHomeOwnerattributeisshowninFigure3.6(b) .AllthetraininginstancesforwhichHomeOwner=Yesarepropagatedtotheleftchildoftherootnodeandtherestarepropagatedtotherightchild.Hunt'salgorithmisthenrecursivelyappliedtoeachchild.Theleftchildbecomesaleafnodelabeled ,since
Defaulted=Yes
Defaulted=No
allinstancesassociatedwiththisnodehaveidenticalclasslabel.Therightchildhasinstancesfromeachclasslabel.Hence,
wesplititfurther.TheresultingsubtreesafterrecursivelyexpandingtherightchildareshowninFigures3.6(c) and(d) .
Hunt'salgorithm,asdescribedabove,makessomesimplifyingassumptionsthatareoftennottrueinpractice.Inthefollowing,wedescribetheseassumptionsandbrieflydiscusssomeofthepossiblewaysforhandlingthem.
1. SomeofthechildnodescreatedinHunt'salgorithmcanbeemptyifnoneofthetraininginstanceshavetheparticularattributevalues.Onewaytohandlethisisbydeclaringeachofthemasaleafnodewithaclasslabelthatoccursmostfrequentlyamongthetraininginstancesassociatedwiththeirparentnodes.
2. Ifalltraininginstancesassociatedwithanodehaveidenticalattributevaluesbutdifferentclasslabels,itisnotpossibletoexpandthisnodeanyfurther.Onewaytohandlethiscaseistodeclareitaleafnodeandassignittheclasslabelthatoccursmostfrequentlyinthetraininginstancesassociatedwiththisnode.
DesignIssuesofDecisionTreeInductionHunt'salgorithmisagenericprocedureforgrowingdecisiontreesinagreedyfashion.Toimplementthealgorithm,therearetwokeydesignissuesthatmustbeaddressed.
1. Whatisthesplittingcriterion?Ateachrecursivestep,anattributemustbeselectedtopartitionthetraininginstancesassociatedwithanodeintosmallersubsetsassociatedwithitschildnodes.Thesplittingcriteriondetermineswhichattributeischosenasthetestconditionand
Defaulted=No
howthetraininginstancesshouldbedistributedtothechildnodes.ThiswillbediscussedinSections3.3.2 and3.3.3 .
2. Whatisthestoppingcriterion?Thebasicalgorithmstopsexpandinganodeonlywhenallthetraininginstancesassociatedwiththenodehavethesameclasslabelsorhaveidenticalattributevalues.Althoughtheseconditionsaresufficient,therearereasonstostopexpandinganodemuchearliereveniftheleafnodecontainstraininginstancesfrommorethanoneclass.Thisprocessiscalledearlyterminationandtheconditionusedtodeterminewhenanodeshouldbestoppedfromexpandingiscalledastoppingcriterion.TheadvantagesofearlyterminationarediscussedinSection3.4 .
3.3.2MethodsforExpressingAttributeTestConditions
Decisiontreeinductionalgorithmsmustprovideamethodforexpressinganattributetestconditionanditscorrespondingoutcomesfordifferentattributetypes.
BinaryAttributes
Thetestconditionforabinaryattributegeneratestwopotentialoutcomes,asshowninFigure3.7 .
Figure3.7.Attributetestconditionforabinaryattribute.
NominalAttributes
Sinceanominalattributecanhavemanyvalues,itsattributetestconditioncanbeexpressedintwoways,asamultiwaysplitorabinarysplitasshowninFigure3.8 .Foramultiwaysplit(Figure3.8(a) ),thenumberofoutcomesdependsonthenumberofdistinctvaluesforthecorrespondingattribute.Forexample,ifanattributesuchasmaritalstatushasthreedistinctvalues—single,married,ordivorced—itstestconditionwillproduceathree-waysplit.Itisalsopossibletocreateabinarysplitbypartitioningallvaluestakenbythenominalattributeintotwogroups.Forexample,somedecisiontreealgorithms,suchasCART,produceonlybinarysplitsbyconsideringall
waysofcreatingabinarypartitionofkattributevalues.Figure3.8(b)illustratesthreedifferentwaysofgroupingtheattributevaluesformaritalstatusintotwosubsets.
2k−1−1
Figure3.8.Attributetestconditionsfornominalattributes.
OrdinalAttributes
Ordinalattributescanalsoproducebinaryormulti-waysplits.Ordinalattributevaluescanbegroupedaslongasthegroupingdoesnotviolatetheorderpropertyoftheattributevalues.Figure3.9 illustratesvariouswaysofsplittingtrainingrecordsbasedontheShirtSizeattribute.ThegroupingsshowninFigures3.9(a) and(b) preservetheorderamongtheattributevalues,whereasthegroupingshowninFigure3.9(c) violatesthispropertybecauseitcombinestheattributevaluesSmallandLargeintothesamepartitionwhileMediumandExtraLargearecombinedintoanotherpartition.
Figure3.9.Differentwaysofgroupingordinalattributevalues.
ContinuousAttributes
Forcontinuousattributes,theattributetestconditioncanbeexpressedasacomparisontest(e.g., )producingabinarysplit,orasarangequeryoftheform ,for producingamultiwaysplit.ThedifferencebetweentheseapproachesisshowninFigure3.10 .Forthebinarysplit,anypossiblevaluevbetweentheminimumandmaximumattributevaluesinthetrainingdatacanbeusedforconstructingthecomparisontest .However,itissufficienttoonlyconsiderdistinctattributevaluesinthetrainingsetascandidatesplitpositions.Forthemultiwaysplit,anypossiblecollectionofattributevaluerangescanbeused,aslongastheyaremutuallyexclusiveandcovertheentirerangeofattributevaluesbetweentheminimumandmaximumvaluesobservedinthetrainingset.OneapproachforconstructingmultiwaysplitsistoapplythediscretizationstrategiesdescribedinSection2.3.6 onpage63.Afterdiscretization,anewordinalvalueisassignedtoeachdiscretizedinterval,andtheattributetestconditionisthendefinedusingthisnewlyconstructedordinalattribute.
A<vvi≤A<vi+1 i=1,…,k,
A<v
Figure3.10.Testconditionforcontinuousattributes.
3.3.3MeasuresforSelectinganAttributeTestCondition
Therearemanymeasuresthatcanbeusedtodeterminethegoodnessofanattributetestcondition.Thesemeasurestrytogivepreferencetoattributetestconditionsthatpartitionthetraininginstancesintopurersubsetsinthechildnodes,whichmostlyhavethesameclasslabels.Havingpurernodesisusefulsinceanodethathasallofitstraininginstancesfromthesameclassdoesnotneedtobeexpandedfurther.Incontrast,animpurenodecontainingtraininginstancesfrommultipleclassesislikelytorequireseverallevelsofnodeexpansions,therebyincreasingthedepthofthetreeconsiderably.Largertreesarelessdesirableastheyaremoresusceptibletomodeloverfitting,aconditionthatmaydegradetheclassificationperformanceonunseeninstances,aswillbediscussedinSection3.4 .Theyarealsodifficulttointerpretandincurmoretrainingandtesttimeascomparedtosmallertrees.
Inthefollowing,wepresentdifferentwaysofmeasuringtheimpurityofanodeandthecollectiveimpurityofitschildnodes,bothofwhichwillbeusedtoidentifythebestattributetestconditionforanode.
ImpurityMeasureforaSingleNodeTheimpurityofanodemeasureshowdissimilartheclasslabelsareforthedatainstancesbelongingtoacommonnode.Followingareexamplesofmeasuresthatcanbeusedtoevaluatetheimpurityofanodet:
wherepi(t)istherelativefrequencyoftraininginstancesthatbelongtoclassiatnodet,cisthetotalnumberofclasses,and inentropycalculations.Allthreemeasuresgiveazeroimpurityvalueifanodecontainsinstancesfromasingleclassandmaximumimpurityifthenodehasequalproportionofinstancesfrommultipleclasses.
Figure3.11 comparestherelativemagnitudeoftheimpuritymeasureswhenappliedtobinaryclassificationproblems.Sincethereareonlytwoclasses, .Thehorizontalaxispreferstothefractionofinstancesthatbelongtooneofthetwoclasses.Observethatallthreemeasuresattaintheirmaximumvaluewhentheclassdistributionisuniform(i.e.,
)andminimumvaluewhenalltheinstancesbelongtoasingleclass(i.e.,either or equalsto1).Thefollowingexamplesillustratehowthevaluesoftheimpuritymeasuresvaryaswealtertheclassdistribution.
Entropy=−∑i=0c−1pi(t)log2pi(t), (3.4)
Giniindex=1−∑i=0c−1pi(t)2, (3.5)
Classificationerror=1−maxi[pi(t)], (3.6)
0log20=0
p0(t)+p1(t)=1
p0(t)+p1(t)=0.5p0(t) p1(t)
Figure3.11.Comparisonamongtheimpuritymeasuresforbinaryclassificationproblems.
Node Count
0
6
Node Count
1
5
Node Count
3
N1 Gini=1−(0/6)2−(6/6)2=0
Class=0 Entropy=−(0/6)log2(0/6)−(6/6)log2(6/6)=0
Class=1 Error=1−max[0/6,6/6]=0
N2 Gini=1−(1/6)2−(5/6)2=0.278
Class=0 Entropy=−(1/6)log2(1/6)−(5/6)log2(5/6)=0.650
Class=1 Error=1−max[1/6,5/6]=0.167
N3 Gini=1−(3/6)2−(3/6)2=0.5
Class=0 Entropy=−(3/6)log2(3/6)−(3/6)log2(3/6)=1
3
Basedonthesecalculations,node hasthelowestimpurityvalue,followedby and .Thisexample,alongwithFigure3.11 ,showstheconsistencyamongtheimpuritymeasures,i.e.,ifanode haslowerentropythannode ,thentheGiniindexanderrorrateof willalsobelowerthanthatof .Despitetheiragreement,theattributechosenassplittingcriterionbytheimpuritymeasurescanstillbedifferent(seeExercise6onpage187).
CollectiveImpurityofChildNodesConsideranattributetestconditionthatsplitsanodecontainingNtraininginstancesintokchildren, ,whereeverychildnoderepresentsapartitionofthedataresultingfromoneofthekoutcomesoftheattributetestcondition.Let bethenumberoftraininginstancesassociatedwithachildnode ,whoseimpurityvalueis .Sinceatraininginstanceintheparentnodereachesnode forafractionof times,thecollectiveimpurityofthechildnodescanbecomputedbytakingaweightedsumoftheimpuritiesofthechildnodes,asfollows:
3.3.ExampleWeightedEntropyConsiderthecandidateattributetestconditionshowninFigures3.12(a)and(b) fortheloanborrowerclassificationproblem.SplittingontheHomeOwnerattributewillgeneratetwochildnodes
Class=1 Error=1−max[6/6,3/6]=0.5
N1N2 N3
N1N2 N1N2
{v1,v2,⋯,vk}
N(vj)vj I(vj)
vj N(vj)/N
I(children)=∑j=1kN(vj)NI(vj), (3.7)
Figure3.12.Examplesofcandidateattributetestconditions.
whoseweightedentropycanbecalculatedasfollows:
SplittingonMaritalStatus,ontheotherhand,leadstothreechildnodeswithaweightedentropygivenby
Thus,MaritalStatushasalowerweightedentropythanHomeOwner.
IdentifyingthebestattributetestconditionTodeterminethegoodnessofanattributetestcondition,weneedtocomparethedegreeofimpurityoftheparentnode(beforesplitting)withtheweighteddegreeofimpurityofthechildnodes(aftersplitting).Thelargertheir
I(HomeOwner=yes)=03log203−33log233=0I(HomeOwner=no)=−37log237−47log247=0.985I(HomeOwner=310×0+710×0.985=0.690
I(MaritalStatus=Single)=−25log225−35log235=0.971I(MaritalStatus=Married)=−03log203−33log233=0I(MaritalStatus=Divorced)=−12log212−12log212=1.000I(MaritalStatus)=510×0.971+310×0+210×1=0.686
difference,thebetterthetestcondition.Thisdifference, ,alsotermedasthegaininpurityofanattributetestcondition,canbedefinedasfollows:
Figure3.13.SplittingcriteriafortheloanborrowerclassificationproblemusingGiniindex.
whereI(parent)istheimpurityofanodebeforesplittingandI(children)istheweightedimpuritymeasureaftersplitting.Itcanbeshownthatthegainisnon-negativesince foranyreasonablemeasuresuchasthosepresentedabove.Thehigherthegain,thepureraretheclassesinthechildnodesrelativetotheparentnode.Thesplittingcriterioninthedecisiontreelearningalgorithmselectstheattributetestconditionthatshowsthemaximumgain.NotethatmaximizingthegainatagivennodeisequivalenttominimizingtheweightedimpuritymeasureofitschildrensinceI(parent)isthesameforallcandidateattributetestconditions.Finally,whenentropyisused
Δ
Δ=I(parent)−I(children), (3.8)
I(parent)≥I(children)
astheimpuritymeasure,thedifferenceinentropyiscommonlyknownasinformationgain, .
Inthefollowing,wepresentillustrativeapproachesforidentifyingthebestattributetestconditiongivenqualitativeorquantitativeattributes.
SplittingofQualitativeAttributesConsiderthefirsttwocandidatesplitsshowninFigure3.12 involvingqualitativeattributes and .Theinitialclassdistributionattheparentnodeis(0.3,0.7),sincethereare3instancesofclass and7instancesofclass inthetrainingdata.Thus,
TheinformationgainsforHomeOwnerandMaritalStatusareeachgivenby
TheinformationgainforMaritalStatusisthushigherduetoitslowerweightedentropy,whichwillthusbeconsideredforsplitting.
BinarySplittingofQualitativeAttributesConsiderbuildingadecisiontreeusingonlybinarysplitsandtheGiniindexastheimpuritymeasure.Figure3.13 showsexamplesoffourcandidatesplittingcriteriaforthe and attributes.Sincethereare3borrowersinthetrainingsetwhodefaultedand7otherswhorepaidtheirloan(seeTableinFigure3.13 ),theGiniindexoftheparentnodebeforesplittingis
Δinfo
I(parent)=−310log2310−710log2710=0.881
Δinfo(HomeOwner)=0.881−0.690=0.191Δinfo(MaritalStatus)=0.881−0.686=0.195
If ischosenasthesplittingattribute,theGiniindexforthechildnodes and are0and0.490,respectively.TheweightedaverageGiniindexforthechildrenis
wheretheweightsrepresenttheproportionoftraininginstancesassignedtoeachchild.Thegainusing assplittingattributeis
.Similarly,wecanapplyabinarysplitontheattribute.However,since isanominalattributewith
threeoutcomes,therearethreepossiblewaystogrouptheattributevaluesintoabinarysplit.TheweightedaverageGiniindexofthechildrenforeachcandidatebinarysplitisshowninFigure3.13 .Basedontheseresults,
andthelastbinarysplitusing areclearlythebestcandidates,sincetheybothproducethelowestweightedaverageGiniindex.Binarysplitscanalsobeusedforordinalattributes,ifthebinarypartitioningoftheattributevaluesdoesnotviolatetheorderingpropertyofthevalues.
BinarySplittingofQuantitativeAttributesConsidertheproblemofidentifyingthebestbinarysplit fortheprecedingloanapprovalclassificationproblem.Asdiscussedpreviously,eventhough cantakeanyvaluebetweentheminimumandmaximumvaluesofannualincomeinthetrainingset,itissufficienttoonlyconsidertheannualincomevaluesobservedinthetrainingsetascandidatesplitpositions.Foreachcandidate ,thetrainingsetisscannedoncetocountthenumberofborrowerswithannualincomelessthanorgreaterthan alongwiththeirclassproportions.WecanthencomputetheGiniindexateachcandidatesplit
1−(310)2−(710)2=0.420.
N1 N2
(3/10)×0+(7/10)×0.490=0.343,
0.420−0.343=0.077
AnnualIncome≤τ
τ
ττ
positionandchoosethe thatproducesthelowestvalue.ComputingtheGiniindexateachcandidatesplitpositionrequiresO(N)operations,whereNisthenumberoftraininginstances.SincethereareatmostNpossiblecandidates,theoverallcomplexityofthisbrute-forcemethodis .ItispossibletoreducethecomplexityofthisproblemtoO(NlogN)byusingamethoddescribedasfollows(seeillustrationinFigure3.14 ).Inthismethod,wefirstsortthetraininginstancesbasedontheirannualincome,aone-timecostthatrequiresO(NlogN)operations.Thecandidatesplitpositionsaregivenbythemidpointsbetweeneverytwoadjacentsortedvalues:$55,000,$65,000,$72,500,andsoon.Forthefirstcandidate,sincenoneoftheinstanceshasanannualincomelessthanorequalto$55,000,theGiniindexforthechildnodewith isequaltozero.Incontrast,thereare3traininginstancesofclass and instancesofclassNowithannualincomegreaterthan$55,000.TheGiniindexforthisnodeis0.420.TheweightedaverageGiniindexforthefirstcandidatesplitposition, ,isequalto .
Figure3.14.Splittingcontinuousattributes.
Forthenextcandidate, ,theclassdistributionofitschildnodescanbeobtainedwithasimpleupdateofthedistributionforthepreviouscandidate.Thisisbecause,as increasesfrom$55,000to$65,000,thereisonlyone
τ
O(N2)
AnnualIncome<$55,000
τ=$55,0000×0+1×0.420=0.420
τ=$65,000
τ
traininginstanceaffectedbythechange.Byexaminingtheclasslabeloftheaffectedtraininginstance,thenewclassdistributionisobtained.Forexample,as increasesto$65,000,thereisonlyoneborrowerinthetrainingset,withanannualincomeof$60,000,affectedbythischange.Sincetheclasslabelfortheborroweris ,thecountforclass increasesfrom0to1(for
)anddecreasesfrom7to6(for),asshowninFigure3.14 .Thedistributionforthe
classremainsunaffected.TheupdatedGiniindexforthiscandidatesplitpositionis0.400.
ThisprocedureisrepeateduntiltheGiniindexforallcandidatesarefound.ThebestsplitpositioncorrespondstotheonethatproducesthelowestGiniindex,whichoccursat .SincetheGiniindexateachcandidatesplitpositioncanbecomputedinO(1)time,thecomplexityoffindingthebestsplitpositionisO(N)onceallthevaluesarekeptsorted,aone-timeoperationthattakesO(NlogN)time.TheoverallcomplexityofthismethodisthusO(NlogN),whichismuchsmallerthanthe timetakenbythebrute-forcemethod.Theamountofcomputationcanbefurtherreducedbyconsideringonlycandidatesplitpositionslocatedbetweentwoadjacentsortedinstanceswithdifferentclasslabels.Forexample,wedonotneedtoconsidercandidatesplitpositionslocatedbetween$60,000and$75,000becauseallthreeinstanceswithannualincomeinthisrange($60,000,$70,000,and$75,000)havethesameclasslabels.Choosingasplitpositionwithinthisrangeonlyincreasesthedegreeofimpurity,comparedtoasplitpositionlocatedoutsidethisrange.Therefore,thecandidatesplitpositionsat and
canbeignored.Similarly,wedonotneedtoconsiderthecandidatesplitpositionsat$87,500,$92,500,$110,000,$122,500,and$172,500becausetheyarelocatedbetweentwoadjacentinstanceswiththesamelabels.Thisstrategyreducesthenumberofcandidatesplitpositionstoconsiderfrom9to2(excludingthetwoboundarycases and
).
τ
AnnualIncome≤$65,000AnnualIncome>$65,000
τ=$97,500
O(N2)
τ=$65,000τ=$72,500
τ=$55,000τ=$230,000
GainRatioOnepotentiallimitationofimpuritymeasuressuchasentropyandGiniindexisthattheytendtofavorqualitativeattributeswithlargenumberofdistinctvalues.Figure3.12 showsthreecandidateattributesforpartitioningthedatasetgiveninTable3.3 .Aspreviouslymentioned,theattribute
isabetterchoicethantheattribute ,becauseitprovidesalargerinformationgain.However,ifwecomparethemagainst ,thelatterproducesthepurestpartitionswiththemaximuminformationgain,sincetheweightedentropyandGiniindexisequaltozeroforitschildren.Yet,
isnotagoodattributeforsplittingbecauseithasauniquevalueforeachinstance.Eventhoughatestconditioninvolving willaccuratelyclassifyeveryinstanceinthetrainingdata,wecannotusesuchatestconditiononnewtestinstanceswith valuesthathaven'tbeenseenbeforeduringtraining.Thisexamplesuggestshavingalowimpurityvaluealoneisinsufficienttofindagoodattributetestconditionforanode.AswewillseelaterinSection3.4 ,havingmorenumberofchildnodescanmakeadecisiontreemorecomplexandconsequentlymoresusceptibletooverfitting.Hence,thenumberofchildrenproducedbythesplittingattributeshouldalsobetakenintoconsiderationwhiledecidingthebestattributetestcondition.
Therearetwowaystoovercomethisproblem.Onewayistogenerateonlybinarydecisiontrees,thusavoidingthedifficultyofhandlingattributeswithvaryingnumberofpartitions.ThisstrategyisemployedbydecisiontreeclassifierssuchasCART.Anotherwayistomodifythesplittingcriteriontotakeintoaccountthenumberofpartitionsproducedbytheattribute.Forexample,intheC4.5decisiontreealgorithm,ameasureknownasgainratioisusedtocompensateforattributesthatproducealargenumberofchildnodes.Thismeasureiscomputedasfollows:
where isthenumberofinstancesassignedtonode andkisthetotalnumberofsplits.Thesplitinformationmeasurestheentropyofsplittinganodeintoitschildnodesandevaluatesifthesplitresultsinalargernumberofequally-sizedchildnodesornot.Forexample,ifeverypartitionhasthesamenumberofinstances,then andthesplitinformationwouldbeequaltolog k.Thus,ifanattributeproducesalargenumberofsplits,itssplitinformationisalsolarge,whichinturn,reducesthegainratio.
3.4.ExampleGainRatioConsiderthedatasetgiveninExercise2onpage185.Wewanttoselectthebestattributetestconditionamongthefollowingthreeattributes:
, ,and .Theentropybeforesplittingis
If isusedasattributetestcondition:
If isusedasattributetestcondition:
Finally,if isusedasattributetestcondition:
Gainratio=ΔinfoSplitInfo=Entropy(Parent)−∑i=1kN(vi)NEntropy(vi)−∑i=1kN(vi)Nlog2N(vi)N
(3.9)
N(vi) vi
∀i:N(vi)/N=1/k2
Entropy(parent)=−1020log21020−1020log21020=1.
Entropy(children)=1020[−610log2610−410log2410]×2=0.971GainRatio=1−0.971−1020log21020−1020log21020=0.0291=0.029
Entropy(children)=420[−14log214−34log234]+820×0+820[−18log218−78log278]=0.380GainRatio=1−0.380−420log2420−820log2820−820log2820=0.6201.52
Thus,eventhough hasthehighestinformationgain,itsgainratioislowerthan sinceitproducesalargernumberofsplits.
3.3.4AlgorithmforDecisionTreeInduction
Algorithm3.1 presentsapseudocodefordecisiontreeinductionalgorithm.TheinputtothisalgorithmisasetoftraininginstancesEalongwiththeattributesetF.Thealgorithmworksbyrecursivelyselectingthebestattributetosplitthedata(Step7)andexpandingthenodesofthetree(Steps11and12)untilthestoppingcriterionismet(Step1).Thedetailsofthisalgorithmareexplainedbelow.
1. The functionextendsthedecisiontreebycreatinganewnode.Anodeinthedecisiontreeeitherhasatestcondition,denotedasnode.testcond,oraclasslabel,denotedasnode.label.
2. The functiondeterminestheattributetestconditionforpartitioningthetraininginstancesassociatedwithanode.Thesplittingattributechosendependsontheimpuritymeasureused.ThepopularmeasuresincludeentropyandtheGiniindex.
3. The functiondeterminestheclasslabeltobeassignedtoaleafnode.Foreachleafnodet,let denotethefractionoftraininginstancesfromclassiassociatedwiththenodet.Thelabelassignedto
Entropy(children)=120[−11log211−01log201]×20=0GainRatio=1−0−120log2120×20=14.32=0.23
p(i|t)
theleafnodeistypicallytheonethatoccursmostfrequentlyinthetraininginstancesthatareassociatedwiththisnode.
Algorithm3.1Askeletondecisiontreeinductionalgorithm.
∈
∈
wheretheargmaxoperatorreturnstheclassithatmaximizes .Besidesprovidingtheinformationneededtodeterminetheclasslabel
leaf.label=argmaxip(i|t), (3.10)
p(i|t)
ofaleafnode, canalsobeusedasaroughestimateoftheprobabilitythataninstanceassignedtotheleafnodetbelongstoclassi.Sections4.11.2 and4.11.4 inthenextchapterdescribehowsuchprobabilityestimatescanbeusedtodeterminetheperformanceofadecisiontreeunderdifferentcostfunctions.
4. The functionisusedtoterminatethetree-growingprocessbycheckingwhetheralltheinstanceshaveidenticalclasslabelorattributevalues.Sincedecisiontreeclassifiersemployatop-down,recursivepartitioningapproachforbuildingamodel,thenumberoftraininginstancesassociatedwithanodedecreasesasthedepthofthetreeincreases.Asaresult,aleafnodemaycontaintoofewtraininginstancestomakeastatisticallysignificantdecisionaboutitsclasslabel.Thisisknownasthedatafragmentationproblem.Onewaytoavoidthisproblemistodisallowsplittingofanodewhenthenumberofinstancesassociatedwiththenodefallbelowacertainthreshold.Amoresystematicwaytocontrolthesizeofadecisiontree(numberofleafnodes)willbediscussedinSection3.5.4 .
3.3.5ExampleApplication:WebRobotDetection
Considerthetaskofdistinguishingtheaccesspatternsofwebrobotsfromthosegeneratedbyhumanusers.Awebrobot(alsoknownasawebcrawler)isasoftwareprogramthatautomaticallyretrievesfilesfromoneormorewebsitesbyfollowingthehyperlinksextractedfromaninitialsetofseedURLs.Theseprogramshavebeendeployedforvariouspurposes,fromgatheringwebpagesonbehalfofsearchenginestomoremaliciousactivitiessuchasspammingandcommittingclickfraudsinonlineadvertisements.
p(i|t)
Figure3.15.Inputdataforwebrobotdetection.
Thewebrobotdetectionproblemcanbecastasabinaryclassificationtask.Theinputdatafortheclassificationtaskisawebserverlog,asampleofwhichisshowninFigure3.15(a) .Eachlineinthelogfilecorrespondstoarequestmadebyaclient(i.e.,ahumanuserorawebrobot)tothewebserver.Thefieldsrecordedintheweblogincludetheclient'sIPaddress,timestampoftherequest,URLoftherequestedfile,sizeofthefile,anduseragent,whichisafieldthatcontainsidentifyinginformationabouttheclient.
Forhumanusers,theuseragentfieldspecifiesthetypeofwebbrowserormobiledeviceusedtofetchthefiles,whereasforwebrobots,itshouldtechnicallycontainthenameofthecrawlerprogram.However,webrobotsmayconcealtheirtrueidentitiesbydeclaringtheiruseragentfieldstobeidenticaltoknownbrowsers.Therefore,useragentisnotareliablefieldtodetectwebrobots.
Thefirststeptowardbuildingaclassificationmodelistopreciselydefineadatainstanceandassociatedattributes.Asimpleapproachistoconsidereachlogentryasadatainstanceandusetheappropriatefieldsinthelogfileasitsattributeset.Thisapproach,however,isinadequateforseveralreasons.First,manyoftheattributesarenominal-valuedandhaveawiderangeofdomainvalues.Forexample,thenumberofuniqueclientIPaddresses,URLs,andreferrersinalogfilecanbeverylarge.Theseattributesareundesirableforbuildingadecisiontreebecausetheirsplitinformationisextremelyhigh(seeEquation(3.9) ).Inaddition,itmightnotbepossibletoclassifytestinstancescontainingIPaddresses,URLs,orreferrersthatarenotpresentinthetrainingdata.Finally,byconsideringeachlogentryasaseparatedatainstance,wedisregardthesequenceofwebpagesretrievedbytheclient—acriticalpieceofinformationthatcanhelpdistinguishwebrobotaccessesfromthoseofahumanuser.
Abetteralternativeistoconsidereachwebsessionasadatainstance.Awebsessionisasequenceofrequestsmadebyaclientduringagivenvisittothewebsite.Eachwebsessioncanbemodeledasadirectedgraph,inwhichthenodescorrespondtowebpagesandtheedgescorrespondtohyperlinksconnectingonewebpagetoanother.Figure3.15(b) showsagraphicalrepresentationofthefirstwebsessiongiveninthelogfile.Everywebsessioncanbecharacterizedusingsomemeaningfulattributesaboutthegraphthatcontaindiscriminatoryinformation.Figure3.15(c) showssomeoftheattributesextractedfromthegraph,includingthedepthandbreadthofits
correspondingtreerootedattheentrypointtothewebsite.Forexample,thedepthandbreadthofthetreeshowninFigure3.15(b) arebothequaltotwo.
ThederivedattributesshowninFigure3.15(c) aremoreinformativethantheoriginalattributesgiveninthelogfilebecausetheycharacterizethebehavioroftheclientatthewebsite.Usingthisapproach,adatasetcontaining2916instanceswascreated,withequalnumbersofsessionsduetowebrobots(class1)andhumanusers(class0).10%ofthedatawerereservedfortrainingwhiletheremaining90%wereusedfortesting.TheinduceddecisiontreeisshowninFigure3.16 ,whichhasanerrorrateequalto3.8%onthetrainingsetand5.3%onthetestset.Inadditiontoitslowerrorrate,thetreealsorevealssomeinterestingpropertiesthatcanhelpdiscriminatewebrobotsfromhumanusers:
1. Accessesbywebrobotstendtobebroadbutshallow,whereasaccessesbyhumanuserstendtobemorefocused(narrowbutdeep).
2. Webrobotsseldomretrievetheimagepagesassociatedwithawebpage.
3. Sessionsduetowebrobotstendtobelongandcontainalargenumberofrequestedpages.
4. Webrobotsaremorelikelytomakerepeatedrequestsforthesamewebpagethanhumanuserssincethewebpagesretrievedbyhumanusersareoftencachedbythebrowser.
3.3.6CharacteristicsofDecisionTreeClassifiers
Thefollowingisasummaryoftheimportantcharacteristicsofdecisiontreeinductionalgorithms.
1. Applicability:Decisiontreesareanonparametricapproachforbuildingclassificationmodels.Thisapproachdoesnotrequireanypriorassumptionabouttheprobabilitydistributiongoverningtheclassandattributesofthedata,andthus,isapplicabletoawidevarietyofdatasets.Itisalsoapplicabletobothcategoricalandcontinuousdatawithoutrequiringtheattributestobetransformedintoacommonrepresentationviabinarization,normalization,orstandardization.UnlikesomebinaryclassifiersdescribedinChapter4 ,itcanalsodealwithmulticlassproblemswithouttheneedtodecomposethemintomultiplebinaryclassificationtasks.Anotherappealingfeatureofdecisiontreeclassifiersisthattheinducedtrees,especiallytheshorterones,arerelativelyeasytointerpret.Theaccuraciesofthetreesarealsoquitecomparabletootherclassificationtechniquesformanysimpledatasets.
2. Expressiveness:Adecisiontreeprovidesauniversalrepresentationfordiscrete-valuedfunctions.Inotherwords,itcanencodeanyfunctionofdiscrete-valuedattributes.Thisisbecauseeverydiscrete-valuedfunctioncanberepresentedasanassignmenttable,whereeveryuniquecombinationofdiscreteattributesisassignedaclasslabel.Sinceeverycombinationofattributescanberepresentedasaleafinthedecisiontree,wecanalwaysfindadecisiontreewhoselabelassignmentsattheleafnodesmatcheswiththeassignmenttableoftheoriginalfunction.Decisiontreescanalsohelpinprovidingcompactrepresentationsoffunctionswhensomeoftheuniquecombinationsofattributescanberepresentedbythesameleafnode.Forexample,Figure3.17 showstheassignmenttableoftheBooleanfunction
involvingfourbinaryattributes,resultinginatotalofpossibleassignments.ThetreeshowninFigure3.17 shows
(A∧B)∨(C∧D)24=16
acompressedencodingofthisassignmenttable.Insteadofrequiringafully-growntreewith16leafnodes,itispossibletoencodethefunctionusingasimplertreewithonly7leafnodes.Nevertheless,notalldecisiontreesfordiscrete-valuedattributescanbesimplified.Onenotableexampleistheparityfunction,whosevalueis1whenthereisanevennumberoftruevaluesamongitsBooleanattributes,and0otherwise.Accuratemodelingofsuchafunctionrequiresafulldecisiontreewith nodes,wheredisthenumberofBooleanattributes(seeExercise1onpage185).
2d
Figure3.16.Decisiontreemodelforwebrobotdetection.
Figure3.17.DecisiontreefortheBooleanfunction .
3. ComputationalEfficiency:Sincethenumberofpossibledecisiontreescanbeverylarge,manydecisiontreealgorithmsemployaheuristic-basedapproachtoguidetheirsearchinthevasthypothesisspace.Forexample,thealgorithmpresentedinSection3.3.4 usesagreedy,top-down,recursivepartitioningstrategyforgrowingadecisiontree.Formanydatasets,suchtechniquesquicklyconstructareasonablygooddecisiontreeevenwhenthetrainingsetsizeisverylarge.Furthermore,onceadecisiontreehasbeenbuilt,classifyingatestrecordisextremelyfast,withaworst-casecomplexityofO(w),wherewisthemaximumdepthofthetree.
4. HandlingMissingValues:Adecisiontreeclassifiercanhandlemissingattributevaluesinanumberofways,bothinthetrainingandthetestsets.Whentherearemissingvaluesinthetestset,theclassifiermustdecidewhichbranchtofollowifthevalueofasplitting
(A∧B)∨(C∧D)
nodeattributeismissingforagiventestinstance.Oneapproach,knownastheprobabilisticsplitmethod,whichisemployedbytheC4.5decisiontreeclassifier,distributesthedatainstancetoeverychildofthesplittingnodeaccordingtotheprobabilitythatthemissingattributehasaparticularvalue.Incontrast,theCARTalgorithmusesthesurrogatesplitmethod,wheretheinstancewhosesplittingattributevalueismissingisassignedtooneofthechildnodesbasedonthevalueofanothernon-missingsurrogateattributewhosesplitsmostresemblethepartitionsmadebythemissingattribute.Anotherapproach,knownastheseparateclassmethodisusedbytheCHAIDalgorithm,wherethemissingvalueistreatedasaseparatecategoricalvaluedistinctfromothervaluesofthesplittingattribute.Figure3.18showsanexampleofthethreedifferentwaysforhandlingmissingvaluesinadecisiontreeclassifier.Otherstrategiesfordealingwithmissingvaluesarebasedondatapreprocessing,wheretheinstancewithmissingvalueiseitherimputedwiththemode(forcategoricalattribute)ormean(forcontinuousattribute)valueordiscardedbeforetheclassifieristrained.
Figure3.18.Methodsforhandlingmissingattributevaluesindecisiontreeclassifier.
Duringtraining,ifanattributevhasmissingvaluesinsomeofthetraininginstancesassociatedwithanode,weneedawaytomeasurethegaininpurityifvisusedforsplitting.Onesimplewayistoexcludeinstanceswithmissingvaluesofvinthecountingofinstancesassociatedwitheverychildnode,generatedforeverypossibleoutcomeofv.Further,ifvischosenastheattributetestconditionatanode,traininginstanceswithmissingvaluesofvcanbepropagatedtothechildnodesusinganyofthemethodsdescribedaboveforhandlingmissingvaluesintestinstances.
5. HandlingInteractionsamongAttributes:Attributesareconsideredinteractingiftheyareabletodistinguishbetweenclasseswhenusedtogether,butindividuallytheyprovidelittleornoinformation.Duetothegreedynatureofthesplittingcriteriaindecisiontrees,suchattributescouldbepassedoverinfavorofotherattributesthatarenotasuseful.Thiscouldresultinmorecomplexdecisiontreesthannecessary.Hence,decisiontreescanperformpoorlywhenthereareinteractionsamongattributes.Toillustratethispoint,considerthethree-dimensionaldatashowninFigure3.19(a) ,whichcontains2000datapointsfromoneoftwoclasses,denotedas and inthediagram.Figure3.19(b) showsthedistributionofthetwoclassesinthetwo-dimensionalspaceinvolvingattributesXandY,whichisanoisyversionoftheXORBooleanfunction.Wecanseethateventhoughthetwoclassesarewell-separatedinthistwo-dimensionalspace,neitherofthetwoattributescontainsufficientinformationtodistinguishbetweenthetwoclasseswhenusedalone.Forexample,theentropiesofthefollowingattributetestconditions: and ,arecloseto1,indicatingthatneitherXnorYprovideanyreductionintheimpuritymeasurewhenusedindividually.XandYthusrepresentacaseofinteractionamongattributes.Thedatasetalsocontainsathirdattribute,Z,inwhichbothclassesaredistributeduniformly,asshowninFigures3.19(c) and
+ ∘
X≤10 Y≤10
3.19(d) ,andhence,theentropyofanysplitinvolvingZiscloseto1.Asaresult,Zisaslikelytobechosenforsplittingastheinteractingbutusefulattributes,XandY.Forfurtherillustrationofthisissue,readersarereferredtoExample3.7 inSection3.4.1 andExercise7attheendofthischapter.
Figure3.19.ExampleofaXORdatainvolvingXandY,alongwithanirrelevantattributeZ.
6. HandlingIrrelevantAttributes:Anattributeisirrelevantifitisnotusefulfortheclassificationtask.Sinceirrelevantattributesarepoorlyassociatedwiththetargetclasslabels,theywillprovidelittleornogaininpurityandthuswillbepassedoverbyothermorerelevantfeatures.Hence,thepresenceofasmallnumberofirrelevantattributeswillnotimpactthedecisiontreeconstructionprocess.However,notallattributesthatprovidelittletonogainareirrelevant(seeFigure3.19 ).Hence,iftheclassificationproblemiscomplex(e.g.,involvinginteractionsamongattributes)andtherearealargenumberofirrelevantattributes,thensomeoftheseattributesmaybeaccidentallychosenduringthetree-growingprocess,sincetheymayprovideabettergainthanarelevantattributejustbyrandomchance.Featureselectiontechniquescanhelptoimprovetheaccuracyofdecisiontreesbyeliminatingtheirrelevantattributesduringpreprocessing.WewillinvestigatetheissueoftoomanyirrelevantattributesinSection3.4.1 .
7. HandlingRedundantAttributes:Anattributeisredundantifitisstronglycorrelatedwithanotherattributeinthedata.Sinceredundantattributesshowsimilargainsinpurityiftheyareselectedforsplitting,onlyoneofthemwillbeselectedasanattributetestconditioninthedecisiontreealgorithm.Decisiontreescanthushandlethepresenceofredundantattributes.
8. UsingRectilinearSplits:Thetestconditionsdescribedsofarinthischapterinvolveusingonlyasingleattributeatatime.Asaconsequence,thetree-growingprocedurecanbeviewedastheprocessofpartitioningtheattributespaceintodisjointregionsuntileachregioncontainsrecordsofthesameclass.Theborderbetweentwoneighboringregionsofdifferentclassesisknownasadecisionboundary.Figure3.20 showsthedecisiontreeaswellasthedecisionboundaryforabinaryclassificationproblem.Sincethetestconditioninvolvesonlyasingleattribute,thedecisionboundariesare
rectilinear;i.e.,paralleltothecoordinateaxes.Thislimitstheexpressivenessofdecisiontreesinrepresentingdecisionboundariesofdatasetswithcontinuousattributes.Figure3.21 showsatwo-dimensionaldatasetinvolvingbinaryclassesthatcannotbeperfectlyclassifiedbyadecisiontreewhoseattributetestconditionsaredefinedbasedonsingleattributes.ThebinaryclassesinthedatasetaregeneratedfromtwoskewedGaussiandistributions,centeredat(8,8)and(12,12),respectively.Thetruedecisionboundaryisrepresentedbythediagonaldashedline,whereastherectilineardecisionboundaryproducedbythedecisiontreeclassifierisshownbythethicksolidline.Incontrast,anobliquedecisiontreemayovercomethislimitationbyallowingthetestconditiontobespecifiedusingmorethanoneattribute.Forexample,thebinaryclassificationdatashowninFigure3.21 canbeeasilyrepresentedbyanobliquedecisiontreewithasinglerootnodewithtestcondition
Figure3.20.
x+y<20.
Exampleofadecisiontreeanditsdecisionboundariesforatwo-dimensionaldataset.
Figure3.21.Exampleofdatasetthatcannotbepartitionedoptimallyusingadecisiontreewithsingleattributetestconditions.Thetruedecisionboundaryisshownbythedashedline.
Althoughanobliquedecisiontreeismoreexpressiveandcanproducemorecompacttrees,findingtheoptimaltestconditioniscomputationallymoreexpensive.
9. ChoiceofImpurityMeasure:Itshouldbenotedthatthechoiceofimpuritymeasureoftenhaslittleeffectontheperformanceofdecisiontreeclassifierssincemanyoftheimpuritymeasuresarequiteconsistentwitheachother,asshowninFigure3.11 onpage129.Instead,thestrategyusedtoprunethetreehasagreaterimpactonthefinaltreethanthechoiceofimpuritymeasure.
3.4ModelOverfittingMethodspresentedsofartrytolearnclassificationmodelsthatshowthelowesterroronthetrainingset.However,aswewillshowinthefollowingexample,evenifamodelfitswelloverthetrainingdata,itcanstillshowpoorgeneralizationperformance,aphenomenonknownasmodeloverfitting.
Figure3.22.Examplesoftrainingandtestsetsofatwo-dimensionalclassificationproblem.
Figure3.23.Effectofvaryingtreesize(numberofleafnodes)ontrainingandtesterrors.
3.5.ExampleOverfittingandUnderfittingofDecisionTreesConsiderthetwo-dimensionaldatasetshowninFigure3.22(a) .Thedatasetcontainsinstancesthatbelongtotwoseparateclasses,representedas and ,respectively,whereeachclasshas5400instances.Allinstancesbelongingtothe classweregeneratedfromauniformdistribution.Forthe class,5000instancesweregeneratedfromaGaussiandistributioncenteredat(10,10)withunitvariance,whiletheremaining400instancesweresampledfromthesameuniformdistributionasthe class.WecanseefromFigure3.22(a) thatthe classcanbelargelydistinguishedfromthe classbydrawingacircleofappropriatesizecenteredat(10,10).Tolearnaclassifierusingthistwo-dimensionaldataset,werandomlysampled10%ofthedatafortrainingandusedtheremaining90%fortesting.Thetrainingset,showninFigure3.22(b) ,looksquiterepresentativeoftheoveralldata.WeusedGiniindexasthe
+ ∘∘
+
∘ +∘
impuritymeasuretoconstructdecisiontreesofincreasingsizes(numberofleafnodes),byrecursivelyexpandinganodeintochildnodestilleveryleafnodewaspure,asdescribedinSection3.3.4 .
Figure3.23(a) showschangesinthetrainingandtesterrorratesasthesizeofthetreevariesfrom1to8.Botherrorratesareinitiallylargewhenthetreehasonlyoneortwoleafnodes.Thissituationisknownasmodelunderfitting.Underfittingoccurswhenthelearneddecisiontreeistoosimplistic,andthus,incapableoffullyrepresentingthetruerelationshipbetweentheattributesandtheclasslabels.Asweincreasethetreesizefrom1to8,wecanobservetwoeffects.First,boththeerrorratesdecreasesincelargertreesareabletorepresentmorecomplexdecisionboundaries.Second,thetrainingandtesterrorratesarequiteclosetoeachother,whichindicatesthattheperformanceonthetrainingsetisfairlyrepresentativeofthegeneralizationperformance.Aswefurtherincreasethesizeofthetreefrom8to150,thetrainingerrorcontinuestosteadilydecreasetilliteventuallyreacheszero,asshowninFigure3.23(b) .However,inastrikingcontrast,thetesterrorrateceasestodecreaseanyfurtherbeyondacertaintreesize,andthenitbeginstoincrease.Thetrainingerrorratethusgrosslyunder-estimatesthetesterrorrateoncethetreebecomestoolarge.Further,thegapbetweenthetrainingandtesterrorrateskeepsonwideningasweincreasethetreesize.Thisbehavior,whichmayseemcounter-intuitiveatfirst,canbeattributedtothephenomenaofmodeloverfitting.
3.4.1ReasonsforModelOverfitting
Modeloverfittingisthephenomenawhere,inthepursuitofminimizingthetrainingerrorrate,anoverlycomplexmodelisselectedthatcapturesspecific
patternsinthetrainingdatabutfailstolearnthetruenatureofrelationshipsbetweenattributesandclasslabelsintheoveralldata.Toillustratethis,Figure3.24 showsdecisiontreesandtheircorrespondingdecisionboundaries(shadedrectanglesrepresentregionsassignedtothe class)fortwotreesofsizes5and50.Wecanseethatthedecisiontreeofsize5appearsquitesimpleanditsdecisionboundariesprovideareasonableapproximationtotheidealdecisionboundary,whichinthiscasecorrespondstoacirclecenteredaroundtheGaussiandistributionat(10,10).Althoughitstrainingandtesterrorratesarenon-zero,theyareveryclosetoeachother,whichindicatesthatthepatternslearnedinthetrainingsetshouldgeneralizewelloverthetestset.Ontheotherhand,thedecisiontreeofsize50appearsmuchmorecomplexthanthetreeofsize5,withcomplicateddecisionboundaries.Forexample,someofitsshadedrectangles(assignedtheclass)attempttocovernarrowregionsintheinputspacethatcontainonlyoneortwo traininginstances.Notethattheprevalenceof instancesinsuchregionsishighlyspecifictothetrainingset,astheseregionsaremostlydominatedby-instancesintheoveralldata.Hence,inanattempttoperfectlyfitthetrainingdata,thedecisiontreeofsize50startsfinetuningitselftospecificpatternsinthetrainingdata,leadingtopoorperformanceonanindependentlychosentestset.
+
+
+ +
Figure3.24.Decisiontreeswithdifferentmodelcomplexities.
Figure3.25.Performanceofdecisiontreesusing20%datafortraining(twicetheoriginaltrainingsize).
Thereareanumberoffactorsthatinfluencemodeloverfitting.Inthefollowing,weprovidebriefdescriptionsoftwoofthemajorfactors:limitedtrainingsizeandhighmodelcomplexity.Thoughtheyarenotexhaustive,theinterplaybetweenthemcanhelpexplainmostofthecommonmodeloverfittingphenomenainreal-worldapplications.
LimitedTrainingSizeNotethatatrainingsetconsistingofafinitenumberofinstancescanonlyprovidealimitedrepresentationoftheoveralldata.Hence,itispossiblethatthepatternslearnedfromatrainingsetdonotfullyrepresentthetruepatternsintheoveralldata,leadingtomodeloverfitting.Ingeneral,asweincreasethesizeofatrainingset(numberoftraininginstances),thepatternslearnedfromthetrainingsetstartresemblingthetruepatternsintheoveralldata.Hence,
theeffectofoverfittingcanbereducedbyincreasingthetrainingsize,asillustratedinthefollowingexample.
3.6ExampleEffectofTrainingSizeSupposethatweusetwicethenumberoftraininginstancesthanwhatwehadusedintheexperimentsconductedinExample3.5 .Specifically,weuse20%datafortrainingandusetheremainderfortesting.Figure3.25(b) showsthetrainingandtesterrorratesasthesizeofthetreeisvariedfrom1to150.TherearetwomajordifferencesinthetrendsshowninthisfigureandthoseshowninFigure3.23(b) (usingonly10%ofthedatafortraining).First,eventhoughthetrainingerrorratedecreaseswithincreasingtreesizeinbothfigures,itsrateofdecreaseismuchsmallerwhenweusetwicethetrainingsize.Second,foragiventreesize,thegapbetweenthetrainingandtesterrorratesismuchsmallerwhenweusetwicethetrainingsize.Thesedifferencessuggestthatthepatternslearnedusing20%ofdatafortrainingaremoregeneralizablethanthoselearnedusing10%ofdatafortraining.
Figure3.25(a) showsthedecisionboundariesforthetreeofsize50,learnedusing20%ofdatafortraining.Incontrasttothetreeofthesamesizelearnedusing10%datafortraining(seeFigure3.24(d) ),wecanseethatthedecisiontreeisnotcapturingspecificpatternsofnoisyinstancesinthetrainingset.Instead,thehighmodelcomplexityof50leafnodesisbeingeffectivelyusedtolearntheboundariesofthe instancescenteredat(10,10).
HighModelComplexityGenerally,amorecomplexmodelhasabetterabilitytorepresentcomplexpatternsinthedata.Forexample,decisiontreeswithlargernumberofleaf
+
+
nodescanrepresentmorecomplexdecisionboundariesthandecisiontreeswithfewerleafnodes.However,anoverlycomplexmodelalsohasatendencytolearnspecificpatternsinthetrainingsetthatdonotgeneralizewelloverunseeninstances.Modelswithhighcomplexityshouldthusbejudiciouslyusedtoavoidoverfitting.
Onemeasureofmodelcomplexityisthenumberof“parameters”thatneedtobeinferredfromthetrainingset.Forexample,inthecaseofdecisiontreeinduction,theattributetestconditionsatinternalnodescorrespondtotheparametersofthemodelthatneedtobeinferredfromthetrainingset.Adecisiontreewithlargernumberofattributetestconditions(andconsequentlymoreleafnodes)thusinvolvesmore“parameters”andhenceismorecomplex.
Givenaclassofmodelswithacertainnumberofparameters,alearningalgorithmattemptstoselectthebestcombinationofparametervaluesthatmaximizesanevaluationmetric(e.g.,accuracy)overthetrainingset.Ifthenumberofparametervaluecombinations(andhencethecomplexity)islarge,thelearningalgorithmhastoselectthebestcombinationfromalargenumberofpossibilities,usingalimitedtrainingset.Insuchcases,thereisahighchanceforthelearningalgorithmtopickaspuriouscombinationofparametersthatmaximizestheevaluationmetricjustbyrandomchance.Thisissimilartothemultiplecomparisonsproblem(alsoreferredasmultipletestingproblem)instatistics.
Asanillustrationofthemultiplecomparisonsproblem,considerthetaskofpredictingwhetherthestockmarketwillriseorfallinthenexttentradingdays.Ifastockanalystsimplymakesrandomguesses,theprobabilitythatherpredictioniscorrectonanytradingdayis0.5.However,theprobabilitythatshewillpredictcorrectlyatleastnineoutoftentimesis
whichisextremelylow.
Supposeweareinterestedinchoosinganinvestmentadvisorfromapoolof200stockanalysts.Ourstrategyistoselecttheanalystwhomakesthemostnumberofcorrectpredictionsinthenexttentradingdays.Theflawinthisstrategyisthatevenifalltheanalystsmaketheirpredictionsinarandomfashion,theprobabilitythatatleastoneofthemmakesatleastninecorrectpredictionsis
whichisveryhigh.Althougheachanalysthasalowprobabilityofpredictingatleastninetimescorrectly,consideredtogether,wehaveahighprobabilityoffindingatleastoneanalystwhocandoso.However,thereisnoguaranteeinthefuturethatsuchananalystwillcontinuetomakeaccuratepredictionsbyrandomguessing.
Howdoesthemultiplecomparisonsproblemrelatetomodeloverfitting?Inthecontextoflearningaclassificationmodel,eachcombinationofparametervaluescorrespondstoananalyst,whilethenumberoftraininginstancescorrespondstothenumberofdays.Analogoustothetaskofselectingthebestanalystwhomakesthemostaccuratepredictionsonconsecutivedays,thetaskofalearningalgorithmistoselectthebestcombinationofparametersthatresultsinthehighestaccuracyonthetrainingset.Ifthenumberofparametercombinationsislargebutthetrainingsizeissmall,itishighlylikelyforthelearningalgorithmtochooseaspuriousparametercombinationthatprovideshightrainingaccuracyjustbyrandomchance.Inthefollowingexample,weillustratethephenomenaofoverfittingduetomultiplecomparisonsinthecontextofdecisiontreeinduction.
(109)+(1010)210=0.0107,
1−(1−0.0107)200=0.8847,
Figure3.26.Exampleofatwo-dimensional(X-Y)dataset.
Figure3.27.
Trainingandtesterrorratesillustratingtheeffectofmultiplecomparisonsproblemonmodeloverfitting.
3.7.ExampleMultipleComparisonsandOverfittingConsiderthetwo-dimensionaldatasetshowninFigure3.26 containing500 and500 instances,whichissimilartothedatashowninFigure3.19 .Inthisdataset,thedistributionsofbothclassesarewell-separatedinthetwo-dimensional(XY)attributespace,butnoneofthetwoattributes(XorY)aresufficientlyinformativetobeusedaloneforseparatingthetwoclasses.Hence,splittingthedatasetbasedonanyvalueofanXorYattributewillprovideclosetozeroreductioninanimpuritymeasure.However,ifXandYattributesareusedtogetherinthesplittingcriterion(e.g.,splittingXat10andYat10),thetwoclassescanbeeffectivelyseparated.
+ ∘
Figure3.28.Decisiontreewith6leafnodesusingXandYasattributes.Splitshavebeennumberedfrom1to5inorderofotheroccurrenceinthetree.
Figure3.27(a) showsthetrainingandtesterrorratesforlearningdecisiontreesofvaryingsizes,when30%ofthedataisusedfortrainingandtheremainderofthedatafortesting.Wecanseethatthetwoclassescanbeseparatedusingasmallnumberofleafnodes.Figure3.28showsthedecisionboundariesforthetreewithsixleafnodes,wherethesplitshavebeennumberedaccordingtotheirorderofappearanceinthetree.Notethattheeventhoughsplits1and3providetrivialgains,theirconsequentsplits(2,4,and5)providelargegains,resultingineffectivediscriminationofthetwoclasses.
Assumeweadd100irrelevantattributestothetwo-dimensionalX-Ydata.Learningadecisiontreefromthisresultantdatawillbechallengingbecausethenumberofcandidateattributestochooseforsplittingateveryinternalnodewillincreasefromtwoto102.Withsuchalargenumberofcandidateattributetestconditionstochoosefrom,itisquitelikelythatspuriousattributetestconditionswillbeselectedatinternalnodesbecauseofthemultiplecomparisonsproblem.Figure3.27(b) showsthetrainingandtesterrorratesafteradding100irrelevantattributestothetrainingset.Wecanseethatthetesterrorrateremainscloseto0.5evenafterusing50leafnodes,whilethetrainingerrorratekeepsondecliningandeventuallybecomes0.
3.5ModelSelectionTherearemanypossibleclassificationmodelswithvaryinglevelsofmodelcomplexitythatcanbeusedtocapturepatternsinthetrainingdata.Amongthesepossibilities,wewanttoselectthemodelthatshowslowestgeneralizationerrorrate.Theprocessofselectingamodelwiththerightlevelofcomplexity,whichisexpectedtogeneralizewelloverunseentestinstances,isknownasmodelselection.Asdescribedintheprevioussection,thetrainingerrorratecannotbereliablyusedasthesolecriterionformodelselection.Inthefollowing,wepresentthreegenericapproachestoestimatethegeneralizationperformanceofamodelthatcanbeusedformodelselection.Weconcludethissectionbypresentingspecificstrategiesforusingtheseapproachesinthecontextofdecisiontreeinduction.
3.5.1UsingaValidationSet
Notethatwecanalwaysestimatethegeneralizationerrorrateofamodelbyusing“out-of-sample”estimates,i.e.byevaluatingthemodelonaseparatevalidationsetthatisnotusedfortrainingthemodel.Theerrorrateonthevalidationset,termedasthevalidationerrorrate,isabetterindicatorofgeneralizationperformancethanthetrainingerrorrate,sincethevalidationsethasnotbeenusedfortrainingthemodel.Thevalidationerrorratecanbeusedformodelselectionasfollows.
GivenatrainingsetD.train,wecanpartitionD.trainintotwosmallersubsets,D.trandD.val,suchthatD.trisusedfortrainingwhileD.valisusedasthevalidationset.Forexample,two-thirdsofD.traincanbereservedasD.trfor
training,whiletheremainingone-thirdisusedasD.valforcomputingvalidationerrorrate.ForanychoiceofclassificationmodelmthatistrainedonD.tr,wecanestimateitsvalidationerrorrateonD.val, .Themodelthatshowsthelowestvalueof canthenbeselectedasthepreferredchoiceofmodel.
Theuseofvalidationsetprovidesagenericapproachformodelselection.However,onelimitationofthisapproachisthatitissensitivetothesizesofD.trandD.val,obtainedbypartitioningD.train.IfthesizeofD.tristoosmall,itmayresultinthelearningofapoorclassificationmodelwithsub-standardperformance,sinceasmallertrainingsetwillbelessrepresentativeoftheoveralldata.Ontheotherhand,ifthesizeofD.valistoosmall,thevalidationerrorratemightnotbereliableforselectingmodels,asitwouldbecomputedoverasmallnumberofinstances.
Figure3.29.
errval(m)errval(m)
ClassdistributionofvalidationdataforthetwodecisiontreesshowninFigure3.30 .
3.8.ExampleValidationErrorInthefollowingexample,weillustrateonepossibleapproachforusingavalidationsetindecisiontreeinduction.Figure3.29 showsthepredictedlabelsattheleafnodesofthedecisiontreesgeneratedinFigure3.30 .Thecountsgivenbeneaththeleafnodesrepresenttheproportionofdatainstancesinthevalidationsetthatreacheachofthenodes.Basedonthepredictedlabelsofthenodes,thevalidationerrorrateforthelefttreeis ,whilethevalidationerrorratefortherighttreeis .Basedontheirvalidationerrorrates,therighttreeispreferredovertheleftone.
3.5.2IncorporatingModelComplexity
Sincethechanceformodeloverfittingincreasesasthemodelbecomesmorecomplex,amodelselectionapproachshouldnotonlyconsiderthetrainingerrorratebutalsothemodelcomplexity.Thisstrategyisinspiredbyawell-knownprincipleknownasOccam'srazorortheprincipleofparsimony,whichsuggeststhatgiventwomodelswiththesameerrors,thesimplermodelispreferredoverthemorecomplexmodel.Agenericapproachtoaccountformodelcomplexitywhileestimatinggeneralizationperformanceisformallydescribedasfollows.
GivenatrainingsetD.train,letusconsiderlearningaclassificationmodelmthatbelongstoacertainclassofmodels, .Forexample,if representsthesetofallpossibledecisiontrees,thenmcancorrespondtoaspecificdecision
errval(TL)=6/16=0.375errval(TR)=4/16=0.25
M M
treelearnedfromthetrainingset.Weareinterestedinestimatingthegeneralizationerrorrateofm,gen.error(m).Asdiscussedpreviously,thetrainingerrorrateofm,train.error(m,D.train),canunder-estimategen.error(m)whenthemodelcomplexityishigh.Hence,werepresentgen.error(m)asafunctionofnotjustthetrainingerrorratebutalsothemodelcomplexityof asfollows:
where isahyper-parameterthatstrikesabalancebetweenminimizingtrainingerrorandreducingmodelcomplexity.Ahighervalueof givesmoreemphasistothemodelcomplexityintheestimationofgeneralizationperformance.Tochoosetherightvalueof ,wecanmakeuseofthevalidationsetinasimilarwayasdescribedin3.5.1 .Forexample,wecaniteratethrougharangeofvaluesof andforeverypossiblevalue,wecanlearnamodelonasubsetofthetrainingset,D.tr,andcomputeitsvalidationerrorrateonaseparatesubset,D.val.Wecanthenselectthevalueof thatprovidesthelowestvalidationerrorrate.
Equation3.11 providesonepossibleapproachforincorporatingmodelcomplexityintotheestimateofgeneralizationperformance.Thisapproachisattheheartofanumberoftechniquesforestimatinggeneralizationperformance,suchasthestructuralriskminimizationprinciple,theAkaike'sInformationCriterion(AIC),andtheBayesianInformationCriterion(BIC).Thestructuralriskminimizationprincipleservesasthebuildingblockforlearningsupportvectormachines,whichwillbediscussedlaterinChapter4 .FormoredetailsonAICandBIC,seetheBibliographicNotes.
Inthefollowing,wepresenttwodifferentapproachesforestimatingthecomplexityofamodel, .Whiletheformerisspecifictodecisiontrees,thelatterismoregenericandcanbeusedwithanyclassofmodels.
M,complexity(M),
gen.error(m)=train.error(m,D.train)+α×complexity(M), (3.11)
αα
α
α
α
complexity(M)
EstimatingtheComplexityofDecisionTreesInthecontextofdecisiontrees,thecomplexityofadecisiontreecanbeestimatedastheratioofthenumberofleafnodestothenumberoftraininginstances.Letkbethenumberofleafnodesand bethenumberoftraininginstances.Thecomplexityofadecisiontreecanthenbedescribedas
.Thisreflectstheintuitionthatforalargertrainingsize,wecanlearnadecisiontreewithlargernumberofleafnodeswithoutitbecomingoverlycomplex.ThegeneralizationerrorrateofadecisiontreeTcanthenbecomputedusingEquation3.11 asfollows:
whereerr(T)isthetrainingerrorofthedecisiontreeand isahyper-parameterthatmakesatrade-offbetweenreducingtrainingerrorandminimizingmodelcomplexity,similartotheuseof inEquation3.11 .canbeviewedastherelativecostofaddingaleafnoderelativetoincurringatrainingerror.Intheliteratureondecisiontreeinduction,theaboveapproachforestimatinggeneralizationerrorrateisalsotermedasthepessimisticerrorestimate.Itiscalledpessimisticasitassumesthegeneralizationerrorratetobeworsethanthetrainingerrorrate(byaddingapenaltytermformodelcomplexity).Ontheotherhand,simplyusingthetrainingerrorrateasanestimateofthegeneralizationerrorrateiscalledtheoptimisticerrorestimateortheresubstitutionestimate.
3.9.ExampleGeneralizationErrorEstimatesConsiderthetwobinarydecisiontrees, and ,showninFigure3.30 .Bothtreesaregeneratedfromthesametrainingdataand isgeneratedbyexpandingthreeleafnodesof .Thecountsshownintheleafnodesofthetreesrepresenttheclassdistributionofthetraining
Ntrain
k/Ntrain
errgen(T)=err(T)+Ω×kNtrain,
Ω
α Ω
TL TRTL
TR
instances.Ifeachleafnodeislabeledaccordingtothemajorityclassoftraininginstancesthatreachthenode,thetrainingerrorrateforthelefttreewillbe ,whilethetrainingerrorratefortherighttreewillbe .Basedontheirtrainingerrorratesalone,wouldpreferredover ,eventhough ismorecomplex(contains
largernumberofleafnodes)than .
Figure3.30.Exampleoftwodecisiontreesgeneratedfromthesametrainingdata.
Now,assumethatthecostassociatedwitheachleafnodeis .Then,thegeneralizationerrorestimatefor willbe
andthegeneralizationerrorestimatefor willbe
err(TL)=4/24=0.167err(TR)=6/24=0.25
TL TR TLTR
Ω=0.5TL
errgen(TL)=424+0.5×724=7.524=0.3125
TR
errgen(TR)=624+0.5×424=824=0.3333.
Since hasalowergeneralizationerrorrate,itwillstillbepreferredover.Notethat impliesthatanodeshouldalwaysbeexpandedinto
itstwochildnodesifitimprovesthepredictionofatleastonetraininginstance,sinceexpandinganodeislesscostlythanmisclassifyingatraininginstance.Ontheotherhand,if ,thenthegeneralizationerrorratefor is andfor is
.Inthiscase, willbepreferredoverbecauseithasalowergeneralizationerrorrate.Thisexampleillustratesthatdifferentchoicesof canchangeourpreferenceofdecisiontreesbasedontheirgeneralizationerrorestimates.However,foragivenchoiceof ,thepessimisticerrorestimateprovidesanapproachformodelingthegeneralizationperformanceonunseentestinstances.Thevalueof canbeselectedwiththehelpofavalidationset.
MinimumDescriptionLengthPrincipleAnotherwaytoincorporatemodelcomplexityisbasedonaninformation-theoreticapproachknownastheminimumdescriptionlengthorMDLprinciple.Toillustratethisapproach,considertheexampleshowninFigure3.31 .Inthisexample,bothperson andperson aregivenasetofinstanceswithknownattributevalues .AssumepersonAknowstheclasslabelyforeveryinstance,whileperson hasnosuchinformation. wouldliketosharetheclassinformationwith bysendingamessagecontainingthelabels.Themessagewouldcontain bitsofinformation,whereNisthenumberofinstances.
TLTR Ω=0.5
Ω=1TL errgen(TL)=11/24=0.458 TR
errgen(TR)=10/24=0.417 TR TL
Ω
ΩΩ
Θ(N)
Figure3.31.Anillustrationoftheminimumdescriptionlengthprinciple.
Alternatively,insteadofsendingtheclasslabelsexplicitly, canbuildaclassificationmodelfromtheinstancesandtransmititto . canthenapplythemodeltodeterminetheclasslabelsoftheinstances.Ifthemodelis100%accurate,thenthecostfortransmissionisequaltothenumberofbitsrequiredtoencodethemodel.Otherwise, mustalsotransmitinformationaboutwhichinstancesaremisclassifiedbythemodelsothat canreproducethesameclasslabels.Thus,theoveralltransmissioncost,whichisequaltothetotaldescriptionlengthofthemessage,is
wherethefirsttermontheright-handsideisthenumberofbitsneededtoencodethemisclassifiedinstances,whilethesecondtermisthenumberofbitsrequiredtoencodethemodel.Thereisalsoahyper-parameter thattrades-offtherelativecostsofthemisclassifiedinstancesandthemodel.
Cost(model,data)=Cost(data|model)+α×Cost(model), (3.12)
α
NoticethefamiliaritybetweenthisequationandthegenericequationforgeneralizationerrorratepresentedinEquation3.11 .Agoodmodelmusthaveatotaldescriptionlengthlessthanthenumberofbitsrequiredtoencodetheentiresequenceofclasslabels.Furthermore,giventwocompetingmodels,themodelwithlowertotaldescriptionlengthispreferred.AnexampleshowinghowtocomputethetotaldescriptionlengthofadecisiontreeisgiveninExercise10onpage189.
3.5.3EstimatingStatisticalBounds
InsteadofusingEquation3.11 toestimatethegeneralizationerrorrateofamodel,analternativewayistoapplyastatisticalcorrectiontothetrainingerrorrateofthemodelthatisindicativeofitsmodelcomplexity.Thiscanbedoneiftheprobabilitydistributionoftrainingerrorisavailableorcanbeassumed.Forexample,thenumberoferrorscommittedbyaleafnodeinadecisiontreecanbeassumedtofollowabinomialdistribution.Wecanthuscomputeanupperboundlimittotheobservedtrainingerrorratethatcanbeusedformodelselection,asillustratedinthefollowingexample.
3.10.ExampleStatisticalBoundsonTrainingErrorConsidertheleft-mostbranchofthebinarydecisiontreesshowninFigure3.30 .Observethattheleft-mostleafnodeof hasbeenexpandedintotwochildnodesin .Beforesplitting,thetrainingerrorrateofthenodeis .Byapproximatingabinomialdistributionwithanormaldistribution,thefollowingupperboundofthetrainingerrorrateecanbederived:
TRTL
2/7=0.286
where istheconfidencelevel, isthestandardizedvaluefromastandardnormaldistribution,andNisthetotalnumberoftraininginstancesusedtocomputee.Byreplacing and ,theupperboundfortheerrorrateis ,whichcorrespondsto errors.Ifweexpandthenodeintoitschildnodesasshownin ,thetrainingerrorratesforthechildnodesare
and ,respectively.UsingEquation(3.13) ,theupperboundsoftheseerrorratesare and
,respectively.Theoveralltrainingerrorofthechildnodesis ,whichislargerthantheestimatederrorforthecorrespondingnodein ,suggestingthatitshouldnotbesplit.
3.5.4ModelSelectionforDecisionTrees
Buildingonthegenericapproachespresentedabove,wepresenttwocommonlyusedmodelselectionstrategiesfordecisiontreeinduction.
Prepruning(EarlyStoppingRule)
Inthisapproach,thetree-growingalgorithmishaltedbeforegeneratingafullygrowntreethatperfectlyfitstheentiretrainingdata.Todothis,amorerestrictivestoppingconditionmustbeused;e.g.,stopexpandingaleafnodewhentheobservedgaininthegeneralizationerrorestimatefallsbelowacertainthreshold.Thisestimateofthegeneralizationerrorratecanbe
eupper(N,e,α)=e+zα/222N+zα/2e(1−e)N+zα/224N21+zα/22N, (3.13)
α zα/2
α=25%,N=7, e=2/7eupper(7,2/7,0.25)=0.503
7×0.503=3.521TL
1/4=0.250 1/3=0.333eupper(4,1/4,0.25)=0.537
eupper(3,1/3,0.25)=0.6504×0.537+3×0.650=4.098
TR
computedusinganyoftheapproachespresentedintheprecedingthreesubsections,e.g.,byusingpessimisticerrorestimates,byusingvalidationerrorestimates,orbyusingstatisticalbounds.Theadvantageofprepruningisthatitavoidsthecomputationsassociatedwithgeneratingoverlycomplexsubtreesthatoverfitthetrainingdata.However,onemajordrawbackofthismethodisthat,evenifnosignificantgainisobtainedusingoneoftheexistingsplittingcriterion,subsequentsplittingmayresultinbettersubtrees.Suchsubtreeswouldnotbereachedifprepruningisusedbecauseofthegreedynatureofdecisiontreeinduction.
Post-pruning
Inthisapproach,thedecisiontreeisinitiallygrowntoitsmaximumsize.Thisisfollowedbyatree-pruningstep,whichproceedstotrimthefullygrowntreeinabottom-upfashion.Trimmingcanbedonebyreplacingasubtreewith(1)anewleafnodewhoseclasslabelisdeterminedfromthemajorityclassofinstancesaffiliatedwiththesubtree(approachknownassubtreereplacement),or(2)themostfrequentlyusedbranchofthesubtree(approachknownassubtreeraising).Thetree-pruningstepterminateswhennofurtherimprovementinthegeneralizationerrorestimateisobservedbeyondacertainthreshold.Again,theestimatesofgeneralizationerrorratecanbecomputedusinganyoftheapproachespresentedinthepreviousthreesubsections.Post-pruningtendstogivebetterresultsthanprepruningbecauseitmakespruningdecisionsbasedonafullygrowntree,unlikeprepruning,whichcansufferfromprematureterminationofthetree-growingprocess.However,forpost-pruning,theadditionalcomputationsneededtogrowthefulltreemaybewastedwhenthesubtreeispruned.
Figure3.32 illustratesthesimplifieddecisiontreemodelforthewebrobotdetectionexamplegiveninSection3.3.5 .Noticethatthesubtreerootedat
hasbeenreplacedbyoneofitsbranchescorrespondingtodepth=1
,and ,usingsubtreeraising.Ontheotherhand,thesubtreecorrespondingto and hasbeenreplacedbyaleafnodeassignedtoclass0,usingsubtreereplacement.Thesubtreefor
and remainsintact.
Figure3.32.Post-pruningofthedecisiontreeforwebrobotdetection.
breadth<=7,width>3 MultiP=1depth>1 MultiAgent=0
depth>1 MultiAgent=1
3.6ModelEvaluationTheprevioussectiondiscussedseveralapproachesformodelselectionthatcanbeusedtolearnaclassificationmodelfromatrainingsetD.train.Herewediscussmethodsforestimatingitsgeneralizationperformance,i.e.itsperformanceonunseeninstancesoutsideofD.train.Thisprocessisknownasmodelevaluation.
NotethatmodelselectionapproachesdiscussedinSection3.5 alsocomputeanestimateofthegeneralizationperformanceusingthetrainingsetD.train.However,theseestimatesarebiasedindicatorsoftheperformanceonunseeninstances,sincetheywereusedtoguidetheselectionofclassificationmodel.Forexample,ifweusethevalidationerrorrateformodelselection(asdescribedinSection3.5.1 ),theresultingmodelwouldbedeliberatelychosentominimizetheerrorsonthevalidationset.Thevalidationerrorratemaythusunder-estimatethetruegeneralizationerrorrate,andhencecannotbereliablyusedformodelevaluation.
Acorrectapproachformodelevaluationwouldbetoassesstheperformanceofalearnedmodelonalabeledtestsethasnotbeenusedatanystageofmodelselection.ThiscanbeachievedbypartitioningtheentiresetoflabeledinstancesD,intotwodisjointsubsets,D.train,whichisusedformodelselectionandD.test,whichisusedforcomputingthetesterrorrate, .Inthefollowing,wepresenttwodifferentapproachesforpartitioningDintoD.trainandD.test,andcomputingthetesterrorrate, .
3.6.1HoldoutMethod
errtest
errtest
Themostbasictechniqueforpartitioningalabeleddatasetistheholdoutmethod,wherethelabeledsetDisrandomlypartitionedintotwodisjointsets,calledthetrainingsetD.trainandthetestsetD.test.AclassificationmodelistheninducedfromD.trainusingthemodelselectionapproachespresentedinSection3.5 ,anditserrorrateonD.test, ,isusedasanestimateofthegeneralizationerrorrate.Theproportionofdatareservedfortrainingandfortestingistypicallyatthediscretionoftheanalysts,e.g.,two-thirdsfortrainingandone-thirdfortesting.
Similartothetrade-offfacedwhilepartitioningD.trainintoD.trandD.valinSection3.5.1 ,choosingtherightfractionoflabeleddatatobeusedfortrainingandtestingisnon-trivial.IfthesizeofD.trainissmall,thelearnedclassificationmodelmaybeimproperlylearnedusinganinsufficientnumberoftraininginstances,resultinginabiasedestimationofgeneralizationperformance.Ontheotherhand,ifthesizeofD.testissmall, maybelessreliableasitwouldbecomputedoverasmallnumberoftestinstances.Moreover, canhaveahighvarianceaswechangetherandompartitioningofDintoD.trainandD.test.
Theholdoutmethodcanberepeatedseveraltimestoobtainadistributionofthetesterrorrates,anapproachknownasrandomsubsamplingorrepeatedholdoutmethod.Thismethodproducesadistributionoftheerrorratesthatcanbeusedtounderstandthevarianceof .
3.6.2Cross-Validation
Cross-validationisawidely-usedmodelevaluationmethodthataimstomakeeffectiveuseofalllabeledinstancesinDforbothtrainingandtesting.Toillustratethismethod,supposethatwearegivenalabeledsetthatwehave
errtest
errtest
errtest
errtest
randomlypartitionedintothreeequal-sizedsubsets, ,and ,asshowninFigure3.33 .Forthefirstrun,wetrainamodelusingsubsetsandS (shownasemptyblocks)andtestthemodelonsubset .Thetesterrorrateon ,denotedas ,isthuscomputedinthefirstrun.Similarly,forthesecondrun,weuse and asthetrainingsetand asthetestset,tocomputethetesterrorrate, ,on .Finally,weuseand fortraininginthethirdrun,while isusedfortesting,thusresultinginthetesterrorrate for .Theoveralltesterrorrateisobtainedbysummingupthenumberoferrorscommittedineachtestsubsetacrossallrunsanddividingitbythetotalnumberofinstances.Thisapproachiscalledthree-foldcross-validation.
Figure3.33.Exampledemonstratingthetechniqueof3-foldcross-validation.
Thek-foldcross-validationmethodgeneralizesthisapproachbysegmentingthelabeleddataD(ofsizeN)intokequal-sizedpartitions(orfolds).Duringthei run,oneofthepartitionsofDischosenasD.test(i)fortesting,whiletherestofthepartitionsareusedasD.train(i)fortraining.Amodelm(i)islearnedusingD.train(i)andappliedonD.test(i)toobtainthesumoftesterrors,
S1,S2 S3S2
3 S1S1 err(S1)
S1 S3 S2err(S2) S2 S1
S3 S3err(S3) S3
th
.Thisprocedureisrepeatedktimes.Thetotaltesterrorrate, ,isthencomputedas
Everyinstanceinthedataisthususedfortestingexactlyonceandfortrainingexactly times.Also,everyrunuses fractionofthedatafortrainingand1/kfractionfortesting.
Therightchoiceofkink-foldcross-validationdependsonanumberofcharacteristicsoftheproblem.Asmallvalueofkwillresultinasmallertrainingsetateveryrun,whichwillresultinalargerestimateofgeneralizationerrorratethanwhatisexpectedofamodeltrainedovertheentirelabeledset.Ontheotherhand,ahighvalueofkresultsinalargertrainingsetateveryrun,whichreducesthebiasintheestimateofgeneralizationerrorrate.Intheextremecase,when ,everyrunusesexactlyonedatainstancefortestingandtheremainderofthedatafortesting.Thisspecialcaseofk-foldcross-validationiscalledtheleave-one-outapproach.Thisapproachhastheadvantageofutilizingasmuchdataaspossiblefortraining.However,leave-one-outcanproducequitemisleadingresultsinsomespecialscenarios,asillustratedinExercise11.Furthermore,leave-one-outcanbecomputationallyexpensiveforlargedatasetsasthecross-validationprocedureneedstoberepeatedNtimes.Formostpracticalapplications,thechoiceofkbetween5and10providesareasonableapproachforestimatingthegeneralizationerrorrate,becauseeachfoldisabletomakeuseof80%to90%ofthelabeleddatafortraining.
Thek-foldcross-validationmethod,asdescribedabove,producesasingleestimateofthegeneralizationerrorrate,withoutprovidinganyinformationaboutthevarianceoftheestimate.Toobtainthisinformation,wecanrunk-foldcross-validationforeverypossiblepartitioningofthedataintokpartitions,
errsum(i) errtest
errtest=∑i=1kerrsum(i)N. (3.14)
(k−1) (k−1)/k
k=N
andobtainadistributionoftesterrorratescomputedforeverysuchpartitioning.Theaveragetesterrorrateacrossallpossiblepartitioningsservesasamorerobustestimateofgeneralizationerrorrate.Thisapproachofestimatingthegeneralizationerrorrateanditsvarianceisknownasthecompletecross-validationapproach.Eventhoughsuchanestimateisquiterobust,itisusuallytooexpensivetoconsiderallpossiblepartitioningsofalargedatasetintokpartitions.Amorepracticalsolutionistorepeatthecross-validationapproachmultipletimes,usingadifferentrandompartitioningofthedataintokpartitionsateverytime,andusetheaveragetesterrorrateastheestimateofgeneralizationerrorrate.Notethatsincethereisonlyonepossiblepartitioningfortheleave-one-outapproach,itisnotpossibletoestimatethevarianceofgeneralizationerrorrate,whichisanotherlimitationofthismethod.
Thek-foldcross-validationdoesnotguaranteethatthefractionofpositiveandnegativeinstancesineverypartitionofthedataisequaltothefractionobservedintheoveralldata.Asimplesolutiontothisproblemistoperformastratifiedsamplingofthepositiveandnegativeinstancesintokpartitions,anapproachcalledstratifiedcross-validation.
Ink-foldcross-validation,adifferentmodelislearnedateveryrunandtheperformanceofeveryoneofthekmodelsontheirrespectivetestfoldsisthenaggregatedtocomputetheoveralltesterrorrate, .Hence, doesnotreflectthegeneralizationerrorrateofanyofthekmodels.Instead,itreflectstheexpectedgeneralizationerrorrateofthemodelselectionapproach,whenappliedonatrainingsetofthesamesizeasoneofthetrainingfolds .Thisisdifferentthanthe computedintheholdoutmethod,whichexactlycorrespondstothespecificmodellearnedoverD.train.Hence,althougheffectivelyutilizingeverydatainstanceinDfortrainingandtesting,the computedinthecross-validationmethoddoesnotrepresenttheperformanceofasinglemodellearnedoveraspecificD.train.
errtest errtest
(N(k−1)/k) errtest
errtest
Nonetheless,inpractice, istypicallyusedasanestimateofthegeneralizationerrorofamodelbuiltonD.Onemotivationforthisisthatwhenthesizeofthetrainingfoldsisclosertothesizeoftheoveralldata(whenkislarge),then resemblestheexpectedperformanceofamodellearnedoveradatasetofthesamesizeasD.Forexample,whenkis10,everytrainingfoldis90%oftheoveralldata.The thenshouldapproachtheexpectedperformanceofamodellearnedover90%oftheoveralldata,whichwillbeclosetotheexpectedperformanceofamodellearnedoverD.
errtest
errtest
errtest
3.7PresenceofHyper-parametersHyper-parametersareparametersoflearningalgorithmsthatneedtobedeterminedbeforelearningtheclassificationmodel.Forinstance,considerthehyper-parameter thatappearedinEquation3.11 ,whichisrepeatedhereforconvenience.Thisequationwasusedforestimatingthegeneralizationerrorforamodelselectionapproachthatusedanexplicitrepresentationsofmodelcomplexity.(SeeSection3.5.2 .)
Forotherexamplesofhyper-parameters,seeChapter4 .
Unlikeregularmodelparameters,suchasthetestconditionsintheinternalnodesofadecisiontree,hyper-parameterssuchas donotappearinthefinalclassificationmodelthatisusedtoclassifyunlabeledinstances.However,thevaluesofhyper-parametersneedtobedeterminedduringmodelselection—aprocessknownashyper-parameterselection—andmustbetakenintoaccountduringmodelevaluation.Fortunately,bothtaskscanbeeffectivelyaccomplishedviaslightmodificationsofthecross-validationapproachdescribedintheprevioussection.
3.7.1Hyper-parameterSelection
InSection3.5.2 ,avalidationsetwasusedtoselect andthisapproachisgenerallyapplicableforhyper-parametersection.Letpbethehyper-parameterthatneedstobeselectedfromafiniterangeofvalues,
α
gen.error(m)=train.error(m,D.train)+α×complexity(M)
α
α
P=
.PartitionD.trainintoD.trandD.val.Foreverychoiceofhyper-parametervalue ,wecanlearnamodel onD.tr,andapplythismodelonD.valtoobtainthevalidationerrorrate .Let bethehyper-parametervaluethatprovidesthelowestvalidationerrorrate.Wecanthenusethemodel correspondingto asthefinalchoiceofclassificationmodel.
Theaboveapproach,althoughuseful,usesonlyasubsetofthedata,D.train,fortrainingandasubset,D.val,forvalidation.Theframeworkofcross-validation,presentedinSection3.6.2 ,addressesbothofthoseissues,albeitinthecontextofmodelevaluation.Hereweindicatehowtouseacross-validationapproachforhyper-parameterselection.Toillustratethisapproach,letuspartitionD.trainintothreefoldsasshowninFigure3.34 .Ateveryrun,oneofthefoldsisusedasD.valforvalidation,andtheremainingtwofoldsareusedasD.trforlearningamodel,foreverychoiceofhyper-parametervalue .Theoverallvalidationerrorratecorrespondingtoeachiscomputedbysummingtheerrorsacrossallthethreefolds.Wethenselectthehyper-parametervalue thatprovidesthelowestvalidationerrorrate,anduseittolearnamodel ontheentiretrainingsetD.train.
Figure3.34.Exampledemonstratingthe3-foldcross-validationframeworkforhyper-parameterselectionusingD.train.
{p1,p2,…pn}pi mi
errval(pi) p*
m* p*
pi pi
p*m*
Algorithm3.2 generalizestheaboveapproachusingak-foldcross-validationframeworkforhyper-parameterselection.Atthei runofcross-validation,thedatainthei foldisusedasD.val(i)forvalidation(Step4),whiletheremainderofthedatainD.trainisusedasD.tr(i)fortraining(Step5).Thenforeverychoiceofhyper-parametervalue ,amodelislearnedonD.tr(i)(Step7),whichisappliedonD.val(i)tocomputeitsvalidationerror(Step8).Thisisusedtocomputethevalidationerrorratecorrespondingtomodelslearningusing overallthefolds(Step11).Thehyper-parametervalue thatprovidesthelowestvalidationerrorrate(Step12)isnowusedtolearnthefinalmodel ontheentiretrainingsetD.train(Step13).Hence,attheendofthisalgorithm,weobtainthebestchoiceofthehyper-parametervalueaswellasthefinalclassificationmodel(Step14),bothofwhichareobtainedbymakinganeffectiveuseofeverydatainstanceinD.train.
Algorithm3.2Proceduremodel-select(k, ,D.train)
∈
th
th
pi
pip*
m*
P
∑
3.7.2NestedCross-Validation
TheapproachoftheprevioussectionprovidesawaytoeffectivelyusealltheinstancesinD.traintolearnaclassificationmodelwhenhyper-parameterselectionisrequired.ThisapproachcanbeappliedovertheentiredatasetDtolearnthefinalclassificationmodel.However,applyingAlgorithm3.2 onDwouldonlyreturnthefinalclassificationmodel butnotanestimateofitsgeneralizationperformance, .RecallthatthevalidationerrorratesusedinAlgorithm3.2 cannotbeusedasestimatesofgeneralizationperformance,sincetheyareusedtoguidetheselectionofthefinalmodel .However,tocompute ,wecanagainuseacross-validationframeworkforevaluatingtheperformanceontheentiredatasetD,asdescribedoriginallyinSection3.6.2 .Inthisapproach,DispartitionedintoD.train(fortraining)andD.test(fortesting)ateveryrunofcross-validation.Whenhyper-parametersareinvolved,wecanuseAlgorithm3.2 totrainamodelusingD.trainateveryrun,thus“internally”usingcross-validationformodelselection.Thisapproachiscallednestedcross-validationordoublecross-validation.Algorithm3.3describesthecompleteapproachforestimating
usingnestedcross-validationinthepresenceofhyper-parameters.
Asanillustrationofthisapproach,seeFigure3.35 wherethelabeledsetDispartitionedintoD.trainandD.test,usinga3-foldcross-validationmethod.
m*errtest
m*errtest
errtest
Figure3.35.Exampledemonstrating3-foldnestedcross-validationforcomputing .
Atthei runofthismethod,oneofthefoldsisusedasthetestset,D.test(i),whiletheremainingtwofoldsareusedasthetrainingset,D.train(i).ThisisrepresentedinFigure3.35 asthei “outer”run.InordertoselectamodelusingD.train(i),weagainusean“inner”3-foldcross-validationframeworkthatpartitionsD.train(i)intoD.trandD.valateveryoneofthethreeinnerruns(iterations).AsdescribedinSection3.7 ,wecanusetheinnercross-validationframeworktoselectthebesthyper-parametervalue aswellasitsresultingclassificationmodel learnedoverD.train(i).Wecanthenapply onD.test(i)toobtainthetesterroratthei outerrun.Byrepeatingthisprocessforeveryouterrun,wecancomputetheaveragetesterrorrate,
,overtheentirelabeledsetD.Notethatintheaboveapproach,theinnercross-validationframeworkisbeingusedformodelselectionwhiletheoutercross-validationframeworkisbeingusedformodelevaluation.
Algorithm3.3Thenestedcross-validationapproachforcomputing .
errtest
th
th
p*(i)m*(i)
m*(i) th
errtest
errtest
∑
3.8PitfallsofModelSelectionandEvaluationModelselectionandevaluation,whenusedeffectively,serveasexcellenttoolsforlearningclassificationmodelsandassessingtheirgeneralizationperformance.However,whenusingthemeffectivelyinpracticalsettings,thereareseveralpitfallsthatcanresultinimproperandoftenmisleadingconclusions.Someofthesepitfallsaresimpletounderstandandeasytoavoid,whileothersarequitesubtleinnatureanddifficulttocatch.Inthefollowing,wepresenttwoofthesepitfallsanddiscussbestpracticestoavoidthem.
3.8.1OverlapbetweenTrainingandTestSets
Oneofthebasicrequirementsofacleanmodelselectionandevaluationsetupisthatthedatausedformodelselection(D.train)mustbekeptseparatefromthedatausedformodelevaluation(D.test).Ifthereisanyoverlapbetweenthetwo,thetesterrorrate computedoverD.testcannotbeconsideredrepresentativeoftheperformanceonunseeninstances.Comparingtheeffectivenessofclassificationmodelsusing canthenbequitemisleading,asanoverlycomplexmodelcanshowaninaccuratelylowvalueof duetomodeloverfitting(seeExercise12attheendofthischapter).
errtest
errtest
errtest
ToillustratetheimportanceofensuringnooverlapbetweenD.trainandD.test,consideralabeleddatasetwherealltheattributesareirrelevant,i.e.theyhavenorelationshipwiththeclasslabels.Usingsuchattributes,weshouldexpectnoclassificationmodeltoperformbetterthanrandomguessing.However,ifthetestsetinvolvesevenasmallnumberofdatainstancesthatwereusedfortraining,thereisapossibilityforanoverlycomplexmodeltoshowbetterperformancethanrandom,eventhoughtheattributesarecompletelyirrelevant.AswewillseelaterinChapter10 ,thisscenariocanactuallybeusedasacriteriontodetectoverfittingduetoimpropersetupofexperiment.Ifamodelshowsbetterperformancethanarandomclassifierevenwhentheattributesareirrelevant,itisanindicationofapotentialfeedbackbetweenthetrainingandtestsets.
3.8.2UseofValidationErrorasGeneralizationError
Thevalidationerrorrate servesanimportantroleduringmodelselection,asitprovides“out-of-sample”errorestimatesofmodelsonD.val,whichisnotusedfortrainingthemodels.Hence, servesasabettermetricthanthetrainingerrorrateforselectingmodelsandhyper-parametervalues,asdescribedinSections3.5.1 and3.7 ,respectively.However,oncethevalidationsethasbeenusedforselectingaclassificationmodel
nolongerreflectstheperformanceof onunseeninstances.
Torealizethepitfallinusingvalidationerrorrateasanestimateofgeneralizationperformance,considertheproblemofselectingahyper-parametervaluepfromarangeofvalues usingavalidationsetD.val.Ifthenumberofpossiblevaluesin isquitelargeandthesizeofD.valissmall,itis
errval
errval
m*,errval m*
P,P
possibletoselectahyper-parametervalue thatshowsfavorableperformanceonD.valjustbyrandomchance.NoticethesimilarityofthisproblemwiththemultiplecomparisonsproblemdiscussedinSection3.4.1 .Eventhoughtheclassificationmodel learnedusing wouldshowalowvalidationerrorrate,itwouldlackgeneralizabilityonunseentestinstances.
ThecorrectapproachforestimatingthegeneralizationerrorrateofamodelistouseanindependentlychosentestsetD.testthathasn'tbeenusedin
anywaytoinfluencetheselectionof .Asaruleofthumb,thetestsetshouldneverbeexaminedduringmodelselection,toensuretheabsenceofanyformofoverfitting.Iftheinsightsgainedfromanyportionofalabeleddatasethelpinimprovingtheclassificationmodeleveninanindirectway,thenthatportionofdatamustbediscardedduringtesting.
p*
m* p*
m*m*
3.9ModelComparisonOnedifficultywhencomparingtheperformanceofdifferentclassificationmodelsiswhethertheobserveddifferenceintheirperformanceisstatisticallysignificant.Forexample,considerapairofclassificationmodels, and .Suppose achieves85%accuracywhenevaluatedonatestsetcontaining30instances,while achieves75%accuracyonadifferenttestsetcontaining5000instances.Basedonthisinformation,is abettermodelthan ?Thisexampleraisestwokeyquestionsregardingthestatisticalsignificanceofaperformancemetric:
1. Although hasahigheraccuracythan ,itwastestedonasmallertestset.Howmuchconfidencedowehavethattheaccuracyfor isactually85%?
2. Isitpossibletoexplainthedifferenceinaccuraciesbetween andasaresultofvariationsinthecompositionoftheirtestsets?
Thefirstquestionrelatestotheissueofestimatingtheconfidenceintervalofmodelaccuracy.Thesecondquestionrelatestotheissueoftestingthestatisticalsignificanceoftheobserveddeviation.Theseissuesareinvestigatedintheremainderofthissection.
3.9.1EstimatingtheConfidenceIntervalforAccuracy
*
MA MBMA
MBMA
MB
MA MBMA
MAMB
Todetermineitsconfidenceinterval,weneedtoestablishtheprobabilitydistributionforsampleaccuracy.Thissectiondescribesanapproachforderivingtheconfidenceintervalbymodelingtheclassificationtaskasabinomialrandomexperiment.Thefollowingdescribescharacteristicsofsuchanexperiment:
1. TherandomexperimentconsistsofNindependenttrials,whereeachtrialhastwopossibleoutcomes:successorfailure.
2. Theprobabilityofsuccess,p,ineachtrialisconstant.
AnexampleofabinomialexperimentiscountingthenumberofheadsthatturnupwhenacoinisflippedNtimes.IfXisthenumberofsuccessesobservedinNtrials,thentheprobabilitythatXtakesaparticularvalueisgivenbyabinomialdistributionwithmean andvariance :
Forexample,ifthecoinisfair andisflippedfiftytimes,thentheprobabilitythattheheadshowsup20timesis
Iftheexperimentisrepeatedmanytimes,thentheaveragenumberofheadsexpectedtoshowupis whileitsvarianceis
Thetaskofpredictingtheclasslabelsoftestinstancescanalsobeconsideredasabinomialexperiment.GivenatestsetthatcontainsNinstances,letXbethenumberofinstancescorrectlypredictedbyamodelandpbethetrueaccuracyofthemodel.Ifthepredictiontaskismodeledasabinomialexperiment,thenXhasabinomialdistributionwithmean andvariance Itcanbeshownthattheempiricalaccuracy, also
Np Np(1−p)
P(X=υ)=(Nυ)pυ(1−p)N−υ.
(p=0.5)
P(X=20)=(5020)0.520(1−0.5)30=0.0419.
50×0.5=25, 50×0.5×0.5=12.5.
NpNp(1−p). acc=X/N,
hasabinomialdistributionwithmeanpandvariance (seeExercise14).ThebinomialdistributioncanbeapproximatedbyanormaldistributionwhenNissufficientlylarge.Basedonthenormaldistribution,theconfidenceintervalforacccanbederivedasfollows:
where and aretheupperandlowerboundsobtainedfromastandardnormaldistributionatconfidencelevel Sinceastandardnormaldistributionissymmetricaround itfollowsthatRearrangingthisinequalityleadstothefollowingconfidenceintervalforp:
Thefollowingtableshowsthevaluesof atdifferentconfidencelevels:
0.99 0.98 0.95 0.9 0.8 0.7 0.5
2.58 2.33 1.96 1.65 1.28 1.04 0.67
3.11.ExampleConfidenceIntervalforAccuracyConsideramodelthathasanaccuracyof80%whenevaluatedon100testinstances.Whatistheconfidenceintervalforitstrueaccuracyata95%confidencelevel?Theconfidencelevelof95%correspondsto
accordingtothetablegivenabove.InsertingthistermintoEquation3.16 yieldsaconfidenceintervalbetween71.1%and86.7%.Thefollowingtableshowstheconfidenceintervalwhenthenumberofinstances,N,increases:
N 20 50 100 500 1000 5000
p(1−p)/N
P(−Zα/2≤acc−pp(1−p)/N≤Z1−α/2)=1−α, (3.15)
Zα/2 Z1−α/2(1−α).
Z=0, Zα/2=Z1−α/2.
2×N×acc×Zα/22±Zα/2Zα/22+4Nacc−4Nacc22(N+Zα/22). (3.16)
Zα/2
1−α
Zα/2
Za/2=1.96
Confidence 0.584 0.670 0.711 0.763 0.774 0.789
Interval
NotethattheconfidenceintervalbecomestighterwhenNincreases.
3.9.2ComparingthePerformanceofTwoModels
Considerapairofmodels, and whichareevaluatedontwoindependenttestsets, and Let denotethenumberofinstancesin
and denotethenumberofinstancesin Inaddition,supposetheerrorratefor on is andtheerrorratefor on is Ourgoalistotestwhethertheobserveddifferencebetween and isstatisticallysignificant.
Assumingthat and aresufficientlylarge,theerrorrates and canbeapproximatedusingnormaldistributions.Iftheobserveddifferenceintheerrorrateisdenotedas thendisalsonormallydistributedwithmean ,itstruedifference,andvariance, Thevarianceofdcanbecomputedasfollows:
where and arethevariancesoftheerrorrates.Finally,atthe confidencelevel,itcanbeshownthattheconfidenceintervalforthetruedifferencedtisgivenbythefollowingequation:
−0.919 −0.888 −0.867 −0.833 −0.824 −0.811
M1 M2,D1 D2. n1
D1 n2 D2.M1 D1 e1 M2 D2 e2.
e1 e2
n1 n2 e1 e2
d=e1−e2,dt σd2.
σd2≃σ^d2=e1(1−e1)n1+e2(1−e2)n2, (3.17)
e1(1−e1)/n1 e2(1−e1)/n2(1−α)%
3.12.ExampleSignificanceTestingConsidertheproblemdescribedatthebeginningofthissection.Modelhasanerrorrateof whenappliedto testinstances,whilemodel hasanerrorrateof whenappliedto testinstances.Theobserveddifferenceintheirerrorratesis
.Inthisexample,weareperformingatwo-sidedtesttocheckwhether or .Theestimatedvarianceoftheobserveddifferenceinerrorratescanbecomputedasfollows:
or .InsertingthisvalueintoEquation3.18 ,weobtainthefollowingconfidenceintervalfor at95%confidencelevel:
Astheintervalspansthevaluezero,wecanconcludethattheobserveddifferenceisnotstatisticallysignificantata95%confidencelevel.
Atwhatconfidencelevelcanwerejectthehypothesisthat ?Todothis,weneedtodeterminethevalueof suchthattheconfidenceintervalfordoesnotspanthevaluezero.Wecanreversetheprecedingcomputationandlookforthevalue suchthat .Replacingthevaluesofdand
gives .Thisvaluefirstoccurswhen (foratwo-sidedtest).Theresultsuggeststhatthenullhypothesiscanberejectedatconfidencelevelof93.6%orlower.
dt=d±zα/2σ^d. (3.18)
MAe1=0.15 N1=30
MB e2=0.25 N2=5000
d=|0.15−0.25|=0.1dt=0 dt≠0
σ^d2=0.15(1−0.15)30+0.25(1−0.25)5000=0.0043
σ^d=0.0655dt
dt=0.1±1.96×0.0655=0.1±0.128.
dt=0Zα/2 dt
Zα/2 d>Zσ/2σ^dσ^d Zσ/2<1.527 (1−α)<~0.936
3.10BibliographicNotesEarlyclassificationsystemsweredevelopedtoorganizevariouscollectionsofobjects,fromlivingorganismstoinanimateones.Examplesabound,fromAristotle'scataloguingofspeciestotheDeweyDecimalandLibraryofCongressclassificationsystemsforbooks.Suchatasktypicallyrequiresconsiderablehumanefforts,bothtoidentifypropertiesoftheobjectstobeclassifiedandtoorganizethemintowelldistinguishedcategories.
Withthedevelopmentofstatisticsandcomputing,automatedclassificationhasbeenasubjectofintensiveresearch.Thestudyofclassificationinclassicalstatisticsissometimesknownasdiscriminantanalysis,wheretheobjectiveistopredictthegroupmembershipofanobjectbasedonitscorrespondingfeatures.Awell-knownclassicalmethodisFisher'slineardiscriminantanalysis[142],whichseekstofindalinearprojectionofthedatathatproducesthebestseparationbetweenobjectsfromdifferentclasses.
Manypatternrecognitionproblemsalsorequirethediscriminationofobjectsfromdifferentclasses.Examplesincludespeechrecognition,handwrittencharacteridentification,andimageclassification.ReaderswhoareinterestedintheapplicationofclassificationtechniquesforpatternrecognitionmayrefertothesurveyarticlesbyJainetal.[150]andKulkarnietal.[157]orclassicpatternrecognitionbooksbyBishop[125],Dudaetal.[137],andFukunaga[143].Thesubjectofclassificationisalsoamajorresearchtopicinneuralnetworks,statisticallearning,andmachinelearning.Anin-depthtreatmentonthetopicofclassificationfromthestatisticalandmachinelearningperspectivescanbefoundinthebooksbyBishop[126],CherkasskyandMulier[132],Hastieetal.[148],Michieetal.[162],Murphy[167],andMitchell[165].Recentyearshavealsoseenthereleaseofmanypubliclyavailable
softwarepackagesforclassification,whichcanbeembeddedinprogramminglanguagessuchasJava(Weka[147])andPython(scikit-learn[174]).
AnoverviewofdecisiontreeinductionalgorithmscanbefoundinthesurveyarticlesbyBuntine[129],Moret[166],Murthy[168],andSafavianetal.[179].Examplesofsomewell-knowndecisiontreealgorithmsincludeCART[127],ID3[175],C4.5[177],andCHAID[153].BothID3andC4.5employtheentropymeasureastheirsplittingfunction.Anin-depthdiscussionoftheC4.5decisiontreealgorithmisgivenbyQuinlan[177].TheCARTalgorithmwasdevelopedbyBreimanetal.[127]andusestheGiniindexasitssplittingfunction.CHAID[153]usesthestatistical testtodeterminethebestsplitduringthetree-growingprocess.
Thedecisiontreealgorithmpresentedinthischapterassumesthatthesplittingconditionateachinternalnodecontainsonlyoneattribute.Anobliquedecisiontreecanusemultipleattributestoformtheattributetestconditioninasinglenode[149,187].Breimanetal.[127]provideanoptionforusinglinearcombinationsofattributesintheirCARTimplementation.OtherapproachesforinducingobliquedecisiontreeswereproposedbyHeathetal.[149],Murthyetal.[169],Cantú-PazandKamath[130],andUtgoffandBrodley[187].Althoughanobliquedecisiontreehelpstoimprovetheexpressivenessofthemodelrepresentation,thetreeinductionprocessbecomescomputationallychallenging.Anotherwaytoimprovetheexpressivenessofadecisiontreewithoutusingobliquedecisiontreesistoapplyamethodknownasconstructiveinduction[161].Thismethodsimplifiesthetaskoflearningcomplexsplittingfunctionsbycreatingcompoundfeaturesfromtheoriginaldata.
Besidesthetop-downapproach,otherstrategiesforgrowingadecisiontreeincludethebottom-upapproachbyLandeweerdetal.[159]andPattipatiandAlexandridis[173],aswellasthebidirectionalapproachbyKimand
χ2
Landgrebe[154].SchuermannandDoster[181]andWangandSuen[193]proposedusingasoftsplittingcriteriontoaddressthedatafragmentationproblem.Inthisapproach,eachinstanceisassignedtodifferentbranchesofthedecisiontreewithdifferentprobabilities.
Modeloverfittingisanimportantissuethatmustbeaddressedtoensurethatadecisiontreeclassifierperformsequallywellonpreviouslyunlabeleddatainstances.ThemodeloverfittingproblemhasbeeninvestigatedbymanyauthorsincludingBreimanetal.[127],Schaffer[180],Mingers[164],andJensenandCohen[151].Whilethepresenceofnoiseisoftenregardedasoneoftheprimaryreasonsforoverfitting[164,170],JensenandCohen[151]viewedoverfittingasanartifactoffailuretocompensateforthemultiplecomparisonsproblem.
Bishop[126]andHastieetal.[148]provideanexcellentdiscussionofmodeloverfitting,relatingittoawell-knownframeworkoftheoreticalanalysis,knownasbias-variancedecomposition[146].Inthisframework,thepredictionofalearningalgorithmisconsideredtobeafunctionofthetrainingset,whichvariesasthetrainingsetischanged.Thegeneralizationerrorofamodelisthendescribedintermsofitsbias(theerroroftheaveragepredictionobtainedusingdifferenttrainingsets),itsvariance(howdifferentarethepredictionsobtainedusingdifferenttrainingsets),andnoise(theirreducibleerrorinherenttotheproblem).Anunderfitmodelisconsideredtohavehighbiasbutlowvariance,whileanoverfitmodelisconsideredtohavelowbiasbuthighvariance.Althoughthebias-variancedecompositionwasoriginallyproposedforregressionproblems(wherethetargetattributeisacontinuousvariable),aunifiedanalysisthatisapplicableforclassificationhasbeenproposedbyDomingos[136].ThebiasvariancedecompositionwillbediscussedinmoredetailwhileintroducingensemblelearningmethodsinChapter4 .
Variouslearningprinciples,suchastheProbablyApproximatelyCorrect(PAC)learningframework[188],havebeendevelopedtoprovideatheoreticalframeworkforexplainingthegeneralizationperformanceoflearningalgorithms.Inthefieldofstatistics,anumberofperformanceestimationmethodshavebeenproposedthatmakeatrade-offbetweenthegoodnessoffitofamodelandthemodelcomplexity.MostnoteworthyamongthemaretheAkaike'sInformationCriterion[120]andtheBayesianInformationCriterion[182].Theybothapplycorrectivetermstothetrainingerrorrateofamodel,soastopenalizemorecomplexmodels.Anotherwidely-usedapproachformeasuringthecomplexityofanygeneralmodelistheVapnikChervonenkis(VC)Dimension[190].TheVCdimensionofaclassoffunctionsCisdefinedasthemaximumnumberofpointsthatcanbeshattered(everypointcanbedistinguishedfromtherest)byfunctionsbelongingtoC,foranypossibleconfigurationofpoints.TheVCdimensionlaysthefoundationofthestructuralriskminimizationprinciple[189],whichisextensivelyusedinmanylearningalgorithms,e.g.,supportvectormachines,whichwillbediscussedindetailinChapter4 .
TheOccam'srazorprincipleisoftenattributedtothephilosopherWilliamofOccam.Domingos[135]cautionedagainstthepitfallofmisinterpretingOccam'srazorascomparingmodelswithsimilartrainingerrors,insteadofgeneralizationerrors.Asurveyondecisiontree-pruningmethodstoavoidoverfittingisgivenbyBreslowandAha[128]andEspositoetal.[141].Someofthetypicalpruningmethodsincludereducederrorpruning[176],pessimisticerrorpruning[176],minimumerrorpruning[171],criticalvaluepruning[163],cost-complexitypruning[127],anderror-basedpruning[177].QuinlanandRivestproposedusingtheminimumdescriptionlengthprinciplefordecisiontreepruningin[178].
Thediscussionsinthischapteronthesignificanceofcross-validationerrorestimatesisinspiredfromChapter7 inHastieetal.[148].Itisalsoan
excellentresourceforunderstanding“therightandwrongwaystodocross-validation”,whichissimilartothediscussiononpitfallsinSection3.8 ofthischapter.Acomprehensivediscussionofsomeofthecommonpitfallsinusingcross-validationformodelselectionandevaluationisprovidedinKrstajicetal.[156].
Theoriginalcross-validationmethodwasproposedindependentlybyAllen[121],Stone[184],andGeisser[145]formodelassessment(evaluation).Eventhoughcross-validationcanbeusedformodelselection[194],itsusageformodelselectionisquitedifferentthanwhenitisusedformodelevaluation,asemphasizedbyStone[184].Overtheyears,thedistinctionbetweenthetwousageshasoftenbeenignored,resultinginincorrectfindings.Oneofthecommonmistakeswhileusingcross-validationistoperformpre-processingoperations(e.g.,hyper-parametertuningorfeatureselection)usingtheentiredatasetandnot“within”thetrainingfoldofeverycross-validationrun.Ambroiseetal.,usinganumberofgeneexpressionstudiesasexamples,[124]provideanextensivediscussionoftheselectionbiasthatariseswhenfeatureselectionisperformedoutsidecross-validation.UsefulguidelinesforevaluatingmodelsonmicroarraydatahavealsobeenprovidedbyAllisonetal.[122].
Theuseofthecross-validationprotocolforhyper-parametertuninghasbeendescribedindetailbyDudoitandvanderLaan[138].Thisapproachhasbeencalled“grid-searchcross-validation.”Thecorrectapproachinusingcross-validationforbothhyper-parameterselectionandmodelevaluation,asdiscussedinSection3.7 ofthischapter,isextensivelycoveredbyVarmaandSimon[191].Thiscombinedapproachhasbeenreferredtoas“nestedcross-validation”or“doublecross-validation”intheexistingliterature.Recently,TibshiraniandTibshirani[185]haveproposedanewapproachforhyper-parameterselectionandmodelevaluation.Tsamardinosetal.[186]comparedthisapproachtonestedcross-validation.Theexperimentsthey
performedfoundthat,onaverage,bothapproachesprovideconservativeestimatesofmodelperformancewiththeTibshiraniandTibshiraniapproachbeingmorecomputationallyefficient.
Kohavi[155]hasperformedanextensiveempiricalstudytocomparetheperformancemetricsobtainedusingdifferentestimationmethodssuchasrandomsubsamplingandk-foldcross-validation.Theirresultssuggestthatthebestestimationmethodisten-fold,stratifiedcross-validation.
Analternativeapproachformodelevaluationisthebootstrapmethod,whichwaspresentedbyEfronin1979[139].Inthismethod,traininginstancesaresampledwithreplacementfromthelabeledset,i.e.,aninstancepreviouslyselectedtobepartofthetrainingsetisequallylikelytobedrawnagain.IftheoriginaldatahasNinstances,itcanbeshownthat,onaverage,abootstrapsampleofsizeNcontainsabout63.2%oftheinstancesintheoriginaldata.Instancesthatarenotincludedinthebootstrapsamplebecomepartofthetestset.Thebootstrapprocedureforobtainingtrainingandtestsetsisrepeatedbtimes,resultinginadifferenterrorrateonthetestset,err(i),atthei run.Toobtaintheoverallerrorrate, ,the.632bootstrapapproachcombineserr(i)withtheerrorrateobtainedfromatrainingsetcontainingallthelabeledexamples, ,asfollows:
EfronandTibshirani[140]providedatheoreticalandempiricalcomparisonbetweencross-validationandabootstrapmethodknownasthe rule.
Whilethe.632bootstrapmethodpresentedaboveprovidesarobustestimateofthegeneralizationperformancewithlowvarianceinitsestimate,itmayproducemisleadingresultsforhighlycomplexmodelsincertainconditions,asdemonstratedbyKohavi[155].Thisisbecausetheoverallerrorrateisnot
th errboot
errs
errboot=1b∑i=1b(0.632)×err(i)+0.386×errs). (3.19)
632+
trulyanout-of-sampleerrorestimateasitdependsonthetrainingerrorrate,,whichcanbequitesmallifthereisoverfitting.
CurrenttechniquessuchasC4.5requirethattheentiretrainingdatasetfitintomainmemory.Therehasbeenconsiderableefforttodevelopparallelandscalableversionsofdecisiontreeinductionalgorithms.SomeoftheproposedalgorithmsincludeSLIQbyMehtaetal.[160],SPRINTbyShaferetal.[183],CMPbyWangandZaniolo[192],CLOUDSbyAlsabtietal.[123],RainForestbyGehrkeetal.[144],andScalParCbyJoshietal.[152].Asurveyofparallelalgorithmsforclassificationandotherdataminingtasksisgivenin[158].Morerecently,therehasbeenextensiveresearchtoimplementlarge-scaleclassifiersonthecomputeunifieddevicearchitecture(CUDA)[131,134]andMapReduce[133,172]platforms.
errs
Bibliography[120]H.Akaike.Informationtheoryandanextensionofthemaximum
likelihoodprinciple.InSelectedPapersofHirotuguAkaike,pages199–213.Springer,1998.
[121]D.M.Allen.Therelationshipbetweenvariableselectionanddataagumentationandamethodforprediction.Technometrics,16(1):125–127,1974.
[122]D.B.Allison,X.Cui,G.P.Page,andM.Sabripour.Microarraydataanalysis:fromdisarraytoconsolidationandconsensus.Naturereviewsgenetics,7(1):55–65,2006.
[123]K.Alsabti,S.Ranka,andV.Singh.CLOUDS:ADecisionTreeClassifierforLargeDatasets.InProc.ofthe4thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages2–8,NewYork,NY,August1998.
[124]C.AmbroiseandG.J.McLachlan.Selectionbiasingeneextractiononthebasisofmicroarraygene-expressiondata.Proceedingsofthenationalacademyofsciences,99(10):6562–6566,2002.
[125]C.M.Bishop.NeuralNetworksforPatternRecognition.OxfordUniversityPress,Oxford,U.K.,1995.
[126]C.M.Bishop.PatternRecognitionandMachineLearning.Springer,2006.
[127]L.Breiman,J.H.Friedman,R.Olshen,andC.J.Stone.ClassificationandRegressionTrees.Chapman&Hall,NewYork,1984.
[128]L.A.BreslowandD.W.Aha.SimplifyingDecisionTrees:ASurvey.KnowledgeEngineeringReview,12(1):1–40,1997.
[129]W.Buntine.Learningclassificationtrees.InArtificialIntelligenceFrontiersinStatistics,pages182–201.Chapman&Hall,London,1993.
[130]E.Cantú-PazandC.Kamath.Usingevolutionaryalgorithmstoinduceobliquedecisiontrees.InProc.oftheGeneticandEvolutionaryComputationConf.,pages1053–1060,SanFrancisco,CA,2000.
[131]B.Catanzaro,N.Sundaram,andK.Keutzer.Fastsupportvectormachinetrainingandclassificationongraphicsprocessors.InProceedingsofthe25thInternationalConferenceonMachineLearning,pages104–111,2008.
[132]V.CherkasskyandF.M.Mulier.LearningfromData:Concepts,Theory,andMethods.Wiley,2ndedition,2007.
[133]C.Chu,S.K.Kim,Y.-A.Lin,Y.Yu,G.Bradski,A.Y.Ng,andK.Olukotun.Map-reduceformachinelearningonmulticore.Advancesinneuralinformationprocessingsystems,19:281,2007.
[134]A.Cotter,N.Srebro,andJ.Keshet.AGPU-tailoredApproachforTrainingKernelizedSVMs.InProceedingsofthe17thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages805–813,SanDiego,California,USA,2011.
[135]P.Domingos.TheRoleofOccam'sRazorinKnowledgeDiscovery.DataMiningandKnowledgeDiscovery,3(4):409–425,1999.
[136]P.Domingos.Aunifiedbias-variancedecomposition.InProceedingsof17thInternationalConferenceonMachineLearning,pages231–238,2000.
[137]R.O.Duda,P.E.Hart,andD.G.Stork.PatternClassification.JohnWiley&Sons,Inc.,NewYork,2ndedition,2001.
[138]S.DudoitandM.J.vanderLaan.Asymptoticsofcross-validatedriskestimationinestimatorselectionandperformanceassessment.StatisticalMethodology,2(2):131–154,2005.
[139]B.Efron.Bootstrapmethods:anotherlookatthejackknife.InBreakthroughsinStatistics,pages569–593.Springer,1992.
[140]B.EfronandR.Tibshirani.Cross-validationandtheBootstrap:EstimatingtheErrorRateofaPredictionRule.Technicalreport,StanfordUniversity,1995.
[141]F.Esposito,D.Malerba,andG.Semeraro.AComparativeAnalysisofMethodsforPruningDecisionTrees.IEEETrans.PatternAnalysisandMachineIntelligence,19(5):476–491,May1997.
[142]R.A.Fisher.Theuseofmultiplemeasurementsintaxonomicproblems.AnnalsofEugenics,7:179–188,1936.
[143]K.Fukunaga.IntroductiontoStatisticalPatternRecognition.AcademicPress,NewYork,1990.
[144]J.Gehrke,R.Ramakrishnan,andV.Ganti.RainForest—AFrameworkforFastDecisionTreeConstructionofLargeDatasets.DataMiningandKnowledgeDiscovery,4(2/3):127–162,2000.
[145]S.Geisser.Thepredictivesamplereusemethodwithapplications.JournaloftheAmericanStatisticalAssociation,70(350):320–328,1975.
[146]S.Geman,E.Bienenstock,andR.Doursat.Neuralnetworksandthebias/variancedilemma.Neuralcomputation,4(1):1–58,1992.
[147]M.Hall,E.Frank,G.Holmes,B.Pfahringer,P.Reutemann,andI.H.Witten.TheWEKADataMiningSoftware:AnUpdate.SIGKDDExplorations,11(1),2009.
[148]T.Hastie,R.Tibshirani,andJ.Friedman.TheElementsofStatisticalLearning:DataMining,Inference,andPrediction.Springer,2ndedition,2009.
[149]D.Heath,S.Kasif,andS.Salzberg.InductionofObliqueDecisionTrees.InProc.ofthe13thIntl.JointConf.onArtificialIntelligence,pages1002–1007,Chambery,France,August1993.
[150]A.K.Jain,R.P.W.Duin,andJ.Mao.StatisticalPatternRecognition:AReview.IEEETran.Patt.Anal.andMach.Intellig.,22(1):4–37,2000.
[151]D.JensenandP.R.Cohen.MultipleComparisonsinInductionAlgorithms.MachineLearning,38(3):309–338,March2000.
[152]M.V.Joshi,G.Karypis,andV.Kumar.ScalParC:ANewScalableandEfficientParallelClassificationAlgorithmforMiningLargeDatasets.InProc.of12thIntl.ParallelProcessingSymp.(IPPS/SPDP),pages573–579,Orlando,FL,April1998.
[153]G.V.Kass.AnExploratoryTechniqueforInvestigatingLargeQuantitiesofCategoricalData.AppliedStatistics,29:119–127,1980.
[154]B.KimandD.Landgrebe.Hierarchicaldecisionclassifiersinhigh-dimensionalandlargeclassdata.IEEETrans.onGeoscienceandRemoteSensing,29(4):518–528,1991.
[155]R.Kohavi.AStudyonCross-ValidationandBootstrapforAccuracyEstimationandModelSelection.InProc.ofthe15thIntl.JointConf.onArtificialIntelligence,pages1137–1145,Montreal,Canada,August1995.
[156]D.Krstajic,L.J.Buturovic,D.E.Leahy,andS.Thomas.Cross-validationpitfallswhenselectingandassessingregressionandclassificationmodels.Journalofcheminformatics,6(1):1,2014.
[157]S.R.Kulkarni,G.Lugosi,andS.S.Venkatesh.LearningPatternClassification—ASurvey.IEEETran.Inf.Theory,44(6):2178–2206,1998.
[158]V.Kumar,M.V.Joshi,E.-H.Han,P.N.Tan,andM.Steinbach.HighPerformanceDataMining.InHighPerformanceComputingforComputationalScience(VECPAR2002),pages111–125.Springer,2002.
[159]G.Landeweerd,T.Timmers,E.Gersema,M.Bins,andM.Halic.Binarytreeversussingleleveltreeclassificationofwhitebloodcells.PatternRecognition,16:571–577,1983.
[160]M.Mehta,R.Agrawal,andJ.Rissanen.SLIQ:AFastScalableClassifierforDataMining.InProc.ofthe5thIntl.Conf.onExtendingDatabaseTechnology,pages18–32,Avignon,France,March1996.
[161]R.S.Michalski.Atheoryandmethodologyofinductivelearning.ArtificialIntelligence,20:111–116,1983.
[162]D.Michie,D.J.Spiegelhalter,andC.C.Taylor.MachineLearning,NeuralandStatisticalClassification.EllisHorwood,UpperSaddleRiver,NJ,1994.
[163]J.Mingers.ExpertSystems—RuleInductionwithStatisticalData.JOperationalResearchSociety,38:39–47,1987.
[164]J.Mingers.Anempiricalcomparisonofpruningmethodsfordecisiontreeinduction.MachineLearning,4:227–243,1989.
[165]T.Mitchell.MachineLearning.McGraw-Hill,Boston,MA,1997.
[166]B.M.E.Moret.DecisionTreesandDiagrams.ComputingSurveys,14(4):593–623,1982.
[167]K.P.Murphy.MachineLearning:AProbabilisticPerspective.MITPress,2012.
[168]S.K.Murthy.AutomaticConstructionofDecisionTreesfromData:AMulti-DisciplinarySurvey.DataMiningandKnowledgeDiscovery,2(4):345–389,1998.
[169]S.K.Murthy,S.Kasif,andS.Salzberg.Asystemforinductionofobliquedecisiontrees.JofArtificialIntelligenceResearch,2:1–33,1994.
[170]T.Niblett.Constructingdecisiontreesinnoisydomains.InProc.ofthe2ndEuropeanWorkingSessiononLearning,pages67–78,Bled,Yugoslavia,May1987.
[171]T.NiblettandI.Bratko.LearningDecisionRulesinNoisyDomains.InResearchandDevelopmentinExpertSystemsIII,Cambridge,1986.
CambridgeUniversityPress.
[172]I.PalitandC.K.Reddy.Scalableandparallelboostingwithmapreduce.IEEETransactionsonKnowledgeandDataEngineering,24(10):1904–1916,2012.
[173]K.R.PattipatiandM.G.Alexandridis.Applicationofheuristicsearchandinformationtheorytosequentialfaultdiagnosis.IEEETrans.onSystems,Man,andCybernetics,20(4):872–887,1990.
[174]F.Pedregosa,G.Varoquaux,A.Gramfort,V.Michel,B.Thirion,O.Grisel,M.Blondel,P.Prettenhofer,R.Weiss,V.Dubourg,J.Vanderplas,A.Passos,D.Cournapeau,M.Brucher,M.Perrot,andE.Duchesnay.Scikit-learn:MachineLearninginPython.JournalofMachineLearningResearch,12:2825–2830,2011.
[175]J.R.Quinlan.Discoveringrulesbyinductionfromlargecollectionofexamples.InD.Michie,editor,ExpertSystemsintheMicroElectronicAge.EdinburghUniversityPress,Edinburgh,UK,1979.
[176]J.R.Quinlan.SimplifyingDecisionTrees.Intl.J.Man-MachineStudies,27:221–234,1987.
[177]J.R.Quinlan.C4.5:ProgramsforMachineLearning.Morgan-KaufmannPublishers,SanMateo,CA,1993.
[178]J.R.QuinlanandR.L.Rivest.InferringDecisionTreesUsingtheMinimumDescriptionLengthPrinciple.InformationandComputation,80(3):227–248,1989.
[179]S.R.SafavianandD.Landgrebe.ASurveyofDecisionTreeClassifierMethodology.IEEETrans.Systems,ManandCybernetics,22:660–674,May/June1998.
[180]C.Schaffer.Overfittingavoidenceasbias.MachineLearning,10:153–178,1993.
[181]J.SchuermannandW.Doster.Adecision-theoreticapproachinhierarchicalclassifierdesign.PatternRecognition,17:359–369,1984.
[182]G.Schwarzetal.Estimatingthedimensionofamodel.Theannalsofstatistics,6(2):461–464,1978.
[183]J.C.Shafer,R.Agrawal,andM.Mehta.SPRINT:AScalableParallelClassifierforDataMining.InProc.ofthe22ndVLDBConf.,pages544–555,Bombay,India,September1996.
[184]M.Stone.Cross-validatorychoiceandassessmentofstatisticalpredictions.JournaloftheRoyalStatisticalSociety.SeriesB(Methodological),pages111–147,1974.
[185]R.J.TibshiraniandR.Tibshirani.Abiascorrectionfortheminimumerrorrateincross-validation.TheAnnalsofAppliedStatistics,pages822–
829,2009.
[186]I.Tsamardinos,A.Rakhshani,andV.Lagani.Performance-estimationpropertiesofcross-validation-basedprotocolswithsimultaneoushyper-parameteroptimization.InHellenicConferenceonArtificialIntelligence,pages1–14.Springer,2014.
[187]P.E.UtgoffandC.E.Brodley.Anincrementalmethodforfindingmultivariatesplitsfordecisiontrees.InProc.ofthe7thIntl.Conf.onMachineLearning,pages58–65,Austin,TX,June1990.
[188]L.Valiant.Atheoryofthelearnable.CommunicationsoftheACM,27(11):1134–1142,1984.
[189]V.N.Vapnik.StatisticalLearningTheory.Wiley-Interscience,1998.
[190]V.N.VapnikandA.Y.Chervonenkis.Ontheuniformconvergenceofrelativefrequenciesofeventstotheirprobabilities.InMeasuresofComplexity,pages11–30.Springer,2015.
[191]S.VarmaandR.Simon.Biasinerrorestimationwhenusingcross-validationformodelselection.BMCbioinformatics,7(1):1,2006.
[192]H.WangandC.Zaniolo.CMP:AFastDecisionTreeClassifierUsingMultivariatePredictions.InProc.ofthe16thIntl.Conf.onDataEngineering,pages449–460,SanDiego,CA,March2000.
[193]Q.R.WangandC.Y.Suen.Largetreeclassifierwithheuristicsearchandglobaltraining.IEEETrans.onPatternAnalysisandMachineIntelligence,9(1):91–102,1987.
[194]Y.ZhangandY.Yang.Cross-validationforselectingamodelselectionprocedure.JournalofEconometrics,187(1):95–112,2015.
3.11Exercises1.DrawthefulldecisiontreefortheparityfunctionoffourBooleanattributes,A,B,C,andD.Isitpossibletosimplifythetree?
2.ConsiderthetrainingexamplesshowninTable3.5 forabinaryclassificationproblem.
Table3.5.DatasetforExercise2.
CustomerID Gender CarType ShirtSize Class
1 M Family Small C0
2 M Sports Medium C0
3 M Sports Medium C0
4 M Sports Large C0
5 M Sports ExtraLarge C0
6 M Sports ExtraLarge C0
7 F Sports Small C0
8 F Sports Small C0
9 F Sports Medium C0
10 F Luxury Large C0
11 M Family Large C1
12 M Family ExtraLarge C1
13 M Family Medium C1
14 M Luxury ExtraLarge C1
15 F Luxury Small C1
16 F Luxury Small C1
17 F Luxury Medium C1
18 F Luxury Medium C1
19 F Luxury Medium C1
20 F Luxury Large C1
a. ComputetheGiniindexfortheoverallcollectionoftrainingexamples.
b. ComputetheGiniindexforthe attribute.
c. ComputetheGiniindexforthe attribute.
d. ComputetheGiniindexforthe attributeusingmultiwaysplit.
e. ComputetheGiniindexforthe attributeusingmultiwaysplit.
f. Whichattributeisbetter, , ,or ?
g. Explainwhy shouldnotbeusedastheattributetestconditioneventhoughithasthelowestGini.
3.ConsiderthetrainingexamplesshowninTable3.6 forabinaryclassificationproblem.
Table3.6.DatasetforExercise3.
Instance TargetClassa1 a2 a3
1 T T 1.0 +
2 T T 6.0
3 T F 5.0
4 F F 4.0
5 F T 7.0
6 F T 3.0
7 F F 8.0
8 T F 7.0
9 F T 5.0
a. Whatistheentropyofthiscollectionoftrainingexampleswithrespecttotheclassattribute?
b. Whataretheinformationgainsof and relativetothesetrainingexamples?
c. For ,whichisacontinuousattribute,computetheinformationgainforeverypossiblesplit.
d. Whatisthebestsplit(among ,and )accordingtotheinformationgain?
e. Whatisthebestsplit(between and )accordingtothemisclassificationerrorrate?
f. Whatisthebestsplit(between and )accordingtotheGiniindex?
+
−
+
−
−
−
+
−
a1 a2
a3
a1,a2 a3
a1 a2
a1 a2
4.Showthattheentropyofanodeneverincreasesaftersplittingitintosmallersuccessornodes.
5.Considerthefollowingdatasetforabinaryclassproblem.
A B ClassLabel
T F
T T
T T
T F
T T
F F
F F
F F
T T
T F
a. CalculatetheinformationgainwhensplittingonAandB.Whichattributewouldthedecisiontreeinductionalgorithmchoose?
b. CalculatethegainintheGiniindexwhensplittingonAandB.Whichattributewouldthedecisiontreeinductionalgorithmchoose?
c. Figure3.11 showsthatentropyandtheGiniindexarebothmonotonicallyincreasingontherange[0,0.5]andtheyarebothmonotonicallydecreasingontherange[0.5,1].Isitpossiblethat
+
+
+
−
+
−
−
−
−
−
informationgainandthegainintheGiniindexfavordifferentattributes?Explain.
6.ConsidersplittingaparentnodePintotwochildnodes, and ,usingsomeattributetestcondition.ThecompositionoflabeledtraininginstancesateverynodeissummarizedintheTablebelow.
P
Class0 7 3 4
Class1 3 0 3
a. CalculatetheGiniindexandmisclassificationerrorrateoftheparentnodeP.
b. CalculatetheweightedGiniindexofthechildnodes.WouldyouconsiderthisattributetestconditionifGiniisusedastheimpuritymeasure?
c. Calculatetheweightedmisclassificationrateofthechildnodes.Wouldyouconsiderthisattributetestconditionifmisclassificationrateisusedastheimpuritymeasure?
7.Considerthefollowingsetoftrainingexamples.
X Y Z No.ofClassC1Examples No.ofClassC2Examples
0 0 0 5 40
0 0 1 0 15
0 1 0 10 5
0 1 1 45 0
C1 C2
C1 C2
1 0 0 10 5
1 0 1 25 0
1 1 0 5 20
1 1 1 0 15
a. Computeatwo-leveldecisiontreeusingthegreedyapproachdescribedinthischapter.Usetheclassificationerrorrateasthecriterionforsplitting.Whatistheoverallerrorrateoftheinducedtree?
b. Repeatpart(a)usingXasthefirstsplittingattributeandthenchoosethebestremainingattributeforsplittingateachofthetwosuccessornodes.Whatistheerrorrateoftheinducedtree?
c. Comparetheresultsofparts(a)and(b).Commentonthesuitabilityofthegreedyheuristicusedforsplittingattributeselection.
8.ThefollowingtablesummarizesadatasetwiththreeattributesA,B,Candtwoclasslabels .Buildatwo-leveldecisiontree.
A B C NumberofInstances
+
T T T 5 0
F T T 0 20
T F T 20 0
F F T 0 5
T T F 0 0
+,−
−
F T F 25 0
T F F 0 0
F F F 0 25
a. Accordingtotheclassificationerrorrate,whichattributewouldbechosenasthefirstsplittingattribute?Foreachattribute,showthecontingencytableandthegainsinclassificationerrorrate.
b. Repeatforthetwochildrenoftherootnode.
c. Howmanyinstancesaremisclassifiedbytheresultingdecisiontree?
d. Repeatparts(a),(b),and(c)usingCasthesplittingattribute.
e. Usetheresultsinparts(c)and(d)toconcludeaboutthegreedynatureofthedecisiontreeinductionalgorithm.
9.ConsiderthedecisiontreeshowninFigure3.36 .
Figure3.36.DecisiontreeanddatasetsforExercise9.
a. Computethegeneralizationerrorrateofthetreeusingtheoptimisticapproach.
b. Computethegeneralizationerrorrateofthetreeusingthepessimisticapproach.(Forsimplicity,usethestrategyofaddingafactorof0.5toeachleafnode.)
c. Computethegeneralizationerrorrateofthetreeusingthevalidationsetshownabove.Thisapproachisknownasreducederrorpruning.
10.ConsiderthedecisiontreesshowninFigure3.37 .Assumetheyaregeneratedfromadatasetthatcontains16binaryattributesand3classes,
,and .C1,C2 C3
Computethetotaldescriptionlengthofeachdecisiontreeaccordingtothefollowingformulationoftheminimumdescriptionlengthprinciple.
Thetotaldescriptionlengthofatreeisgivenby
EachinternalnodeofthetreeisencodedbytheIDofthesplittingattribute.Iftherearemattributes,thecostofencodingeachattributeis
bits.
Figure3.37.DecisiontreesforExercise10.
EachleafisencodedusingtheIDoftheclassitisassociatedwith.Iftherearekclasses,thecostofencodingaclassis bits.
Cost(tree)isthecostofencodingallthenodesinthetree.Tosimplifythecomputation,youcanassumethatthetotalcostofthetreeisobtainedbyaddingupthecostsofencodingeachinternalnodeandeachleafnode.
Cost(tree,data)=Cost(tree)+Cost(data|tree).
log2m
log2k
isencodedusingtheclassificationerrorsthetreecommitsonthetrainingset.Eacherrorisencodedby bits,wherenisthetotalnumberoftraininginstances.
Whichdecisiontreeisbetter,accordingtotheMDLprinciple?
11.Thisexercise,inspiredbythediscussionsin[155],highlightsoneoftheknownlimitationsoftheleave-one-outmodelevaluationprocedure.Letusconsideradatasetcontaining50positiveand50negativeinstances,wheretheattributesarepurelyrandomandcontainnoinformationabouttheclasslabels.Hence,thegeneralizationerrorrateofanyclassificationmodellearnedoverthisdataisexpectedtobe0.5.Letusconsideraclassifierthatassignsthemajorityclasslabeloftraininginstances(tiesresolvedbyusingthepositivelabelasthedefaultclass)toanytestinstance,irrespectiveofitsattributevalues.Wecancallthisapproachasthemajorityinducerclassifier.Determinetheerrorrateofthisclassifierusingthefollowingmethods.
a. Leave-one-out.
b. 2-foldstratifiedcross-validation,wheretheproportionofclasslabelsateveryfoldiskeptsameasthatoftheoveralldata.
c. Fromtheresultsabove,whichmethodprovidesamorereliableevaluationoftheclassifier'sgeneralizationerrorrate?
12.Consideralabeleddatasetcontaining100datainstances,whichisrandomlypartitionedintotwosetsAandB,eachcontaining50instances.WeuseAasthetrainingsettolearntwodecisiontrees, with10leafnodesand with100leafnodes.TheaccuraciesofthetwodecisiontreesondatasetsAandBareshowninTable3.7 .
Table3.7.Comparingthetestaccuracyofdecisiontrees and .
Accuracy
Cost(data|tree)log2n
T10T100
T10 T100
DataSet
A 0.86 0.97
B 0.84 0.77
a. BasedontheaccuraciesshowninTable3.7 ,whichclassificationmodelwouldyouexpecttohavebetterperformanceonunseeninstances?
b. Now,youtested and ontheentiredataset andfoundthattheclassificationaccuracyof ondataset is0.85,whereastheclassificationaccuracyof onthedataset is0.87.BasedonthisnewinformationandyourobservationsfromTable3.7 ,whichclassificationmodelwouldyoufinallychooseforclassification?
13.ConsiderthefollowingapproachfortestingwhetheraclassifierAbeatsanotherclassifierB.LetNbethesizeofagivendataset, betheaccuracyofclassifierA, betheaccuracyofclassifierB,and betheaverageaccuracyforbothclassifiers.TotestwhetherclassifierAissignificantlybetterthanB,thefollowingZ-statisticisused:
ClassifierAisassumedtobebetterthanclassifierBif .
Table3.8 comparestheaccuraciesofthreedifferentclassifiers,decisiontreeclassifiers,naïveBayesclassifiers,andsupportvectormachines,onvariousdatasets.(ThelattertwoclassifiersaredescribedinChapter4 .)
SummarizetheperformanceoftheclassifiersgiveninTable3.8 usingthefollowing table:
win-loss-draw Decisiontree NaïveBayes Supportvectormachine
T10 T100
T10 T100 (A+B)T10 (A+B)
T100 (A+B)
pApB p=(pA+pB)/2
Z=pA−pB2p(1−p)N.
Z>1.96
3×3
Decisiontree 0-0-23
NaïveBayes 0-0-23
Supportvectormachine 0-0-23
Table3.8.Comparingtheaccuracyofvariousclassificationmethods.
DataSet Size(N) DecisionTree(%)
naïveBayes(%)
Supportvectormachine(%)
Anneal 898 92.09 79.62 87.19
Australia 690 85.51 76.81 84.78
Auto 205 81.95 58.05 70.73
Breast 699 95.14 95.99 96.42
Cleve 303 76.24 83.50 84.49
Credit 690 85.80 77.54 85.07
Diabetes 768 72.40 75.91 76.82
German 1000 70.90 74.70 74.40
Glass 214 67.29 48.59 59.81
Heart 270 80.00 84.07 83.70
Hepatitis 155 81.94 83.23 87.10
Horse 368 85.33 78.80 82.61
Ionosphere 351 89.17 82.34 88.89
Iris 150 94.67 95.33 96.00
Labor 57 78.95 94.74 92.98
Led7 3200 73.34 73.16 73.56
Lymphography 148 77.03 83.11 86.49
Pima 768 74.35 76.04 76.95
Sonar 208 78.85 69.71 76.92
Tic-tac-toe 958 83.72 70.04 98.33
Vehicle 846 71.04 45.04 74.94
Wine 178 94.38 96.63 98.88
Zoo 101 93.07 93.07 96.04
Eachcellinthetablecontainsthenumberofwins,losses,anddrawswhencomparingtheclassifierinagivenrowtotheclassifierinagivencolumn.
14.LetXbeabinomialrandomvariablewithmean andvariance .ShowthattheratioX/Nalsohasabinomialdistributionwithmeanpandvariance .
Np Np(1−p)
p(1−p)N
4Classification:AlternativeTechniques
Thepreviouschapterintroducedtheclassificationproblemandpresentedatechniqueknownasthedecisiontreeclassifier.Issuessuchasmodeloverfittingandmodelevaluationwerealsodiscussed.Thischapterpresentsalternativetechniquesforbuildingclassificationmodels—fromsimpletechniquessuchasrule-basedandnearestneighborclassifierstomoresophisticatedtechniquessuchasartificialneuralnetworks,deeplearning,supportvectormachines,andensemblemethods.Otherpracticalissuessuchastheclassimbalanceandmulticlassproblemsarealsodiscussedattheendofthechapter.
4.1TypesofClassifiersBeforepresentingspecifictechniques,wefirstcategorizethedifferenttypesofclassifiersavailable.Onewaytodistinguishclassifiersisbyconsideringthecharacteristicsoftheiroutput.
BinaryversusMulticlass
Binaryclassifiersassigneachdatainstancetooneoftwopossiblelabels,typicallydenotedas and .Thepositiveclassusuallyreferstothecategorywearemoreinterestedinpredictingcorrectlycomparedtothenegativeclass(e.g.,the categoryinemailclassificationproblems).Iftherearemorethantwopossiblelabelsavailable,thenthetechniqueisknownasamulticlassclassifier.Assomeclassifiersweredesignedforbinaryclassesonly,theymustbeadaptedtodealwithmulticlassproblems.TechniquesfortransformingbinaryclassifiersintomulticlassclassifiersaredescribedinSection4.12 .
DeterministicversusProbabilistic
Adeterministicclassifierproducesadiscrete-valuedlabeltoeachdatainstanceitclassifieswhereasaprobabilisticclassifierassignsacontinuousscorebetween0and1toindicatehowlikelyitisthataninstancebelongstoaparticularclass,wheretheprobabilityscoresforalltheclassessumupto1.SomeexamplesofprobabilisticclassifiersincludethenaïveBayesclassifier,Bayesiannetworks,andlogisticregression.Probabilisticclassifiersprovideadditionalinformationabouttheconfidenceinassigninganinstancetoaclassthandeterministicclassifiers.Adatainstanceistypicallyassignedtotheclass
+1 −1
withthehighestprobabilityscore,exceptwhenthecostofmisclassifyingtheclasswithlowerprobabilityissignificantlyhigher.Wewilldiscussthetopicofcost-sensitiveclassificationwithprobabilisticoutputsinSection4.11.2 .
Anotherwaytodistinguishthedifferenttypesofclassifiersisbasedontheirtechniquefordiscriminatinginstancesfromdifferentclasses.
LinearversusNonlinear
Alinearclassifierusesalinearseparatinghyperplanetodiscriminateinstancesfromdifferentclasseswhereasanonlinearclassifierenablestheconstructionofmorecomplex,nonlineardecisionsurfaces.Weillustrateanexampleofalinearclassifier(perceptron)anditsnonlinearcounterpart(multi-layerneuralnetwork)inSection4.7 .Althoughthelinearityassumptionmakesthemodellessflexibleintermsoffittingcomplexdata,linearclassifiersarethuslesssusceptibletomodeloverfittingcomparedtononlinearclassifiers.Furthermore,onecantransformtheoriginalsetofattributes,
,intoamorecomplexfeatureset,e.g.,,beforeapplyingthelinearclassifier.Suchfeature
transformationallowsthelinearclassifiertofitdatasetswithnonlineardecisionsurfaces(seeSection4.9.4 ).
GlobalversusLocal
Aglobalclassifierfitsasinglemodeltotheentiredataset.Unlessthemodelishighlynonlinear,thisone-size-fits-allstrategymaynotbeeffectivewhentherelationshipbetweentheattributesandtheclasslabelsvariesovertheinputspace.Incontrast,alocalclassifierpartitionstheinputspaceintosmallerregionsandfitsadistinctmodeltotraininginstancesineachregion.Thek-nearestneighborclassifier(seeSection4.3 )isaclassicexampleoflocalclassifiers.Whilelocalclassifiersaremoreflexibleintermsoffittingcomplex
x=(x1,x2,⋯,xd) Φ(x)=(x1,x2,x1x2,x12,x22,⋯)
decisionboundaries,theyarealsomoresusceptibletothemodeloverfittingproblem,especiallywhenthelocalregionscontainfewtrainingexamples.
GenerativeversusDiscriminative
Givenadatainstance ,theprimaryobjectiveofanyclassifieristopredicttheclasslabel,y,ofthedatainstance.However,apartfrompredictingtheclasslabel,wemayalsobeinterestedindescribingtheunderlyingmechanismthatgeneratestheinstancesbelongingtoeveryclasslabel.Forexample,intheprocessofclassifyingspamemailmessages,itmaybeusefultounderstandthetypicalcharacteristicsofemailmessagesthatarelabeledasspam,e.g.,specificusageofkeywordsinthesubjectorthebodyoftheemail.Classifiersthatlearnagenerativemodelofeveryclassintheprocessofpredictingclasslabelsareknownasgenerativeclassifiers.SomeexamplesofgenerativeclassifiersincludethenaïveBayesclassifierandBayesiannetworks.Incontrast,discriminativeclassifiersdirectlypredicttheclasslabelswithoutexplicitlydescribingthedistributionofeveryclasslabel.Theysolveasimplerproblemthangenerativemodelssincetheydonothavetheonusofderivinginsightsaboutthegenerativemechanismofdatainstances.Theyarethussometimespreferredovergenerativemodels,especiallywhenitisnotcrucialtoobtaininformationaboutthepropertiesofeveryclass.Someexamplesofdiscriminativeclassifiersincludedecisiontrees,rule-basedclassifier,nearestneighborclassifier,artificialneuralnetworks,andsupportvectormachines.
4.2Rule-BasedClassifierArule-basedclassifierusesacollectionof“if…then…”rules(alsoknownasaruleset)toclassifydatainstances.Table4.1 showsanexampleofarulesetgeneratedforthevertebrateclassificationproblemdescribedinthepreviouschapter.Eachclassificationruleintherulesetcanbeexpressedinthefollowingway:
Theleft-handsideoftheruleiscalledtheruleantecedentorprecondition.Itcontainsaconjunctionofattributetestconditions:
where isanattribute-valuepairandopisacomparisonoperatorchosenfromtheset .Eachattributetest isalsoknownasaconjunct.Theright-handsideoftheruleiscalledtheruleconsequent,whichcontainsthepredictedclass .
Arulercoversadatainstancexifthepreconditionofrmatchestheattributesofx.risalsosaidtobefiredortriggeredwheneveritcoversagiveninstance.Foranillustration,considertherule giveninTable4.1 andthefollowingattributesfortwovertebrates:hawkandgrizzlybear.
Table4.1.Exampleofarulesetforthevertebrateclassificationproblem.
ri:(Conditioni)→yi. (4.1)
Conditioni=(A1opv1)∧(A2opv2)…(Akopvk), (4.2)
(Aj,vj){=,≠,<,>,≤,≥} (Ajopvj)
yi
r1
r1:(GivesBirth=no)∧(AerialCreature=yes)→Birdsr2:(GivesBirth=no)∧(AquaticCreature=yes)→Fishesr3:(GivesBirth=yes)∧(BodyTemperature=warm-blooded)→Mammalsr4:(GivesBirth=no)∧(AerialCreature=no)→Reptilesr5:(AquaticCreature=semi)→Amphibians
Name BodyTemperature
SkinCover
GivesBirth
AquaticCreature
AerialCreature
HasLegs
Hibernates
hawk warm-blooded
feather no no yes yes no
grizzlybear
warm-blooded
fur yes no no yes yes
coversthefirstvertebratebecauseitspreconditionissatisfiedbythehawk'sattributes.Theruledoesnotcoverthesecondvertebratebecausegrizzlybearsgivebirthtotheiryoungandcannotfly,thusviolatingthepreconditionof .
Thequalityofaclassificationrulecanbeevaluatedusingmeasuressuchascoverageandaccuracy.GivenadatasetDandaclassificationruler: ,thecoverageoftheruleisthefractionofinstancesinDthattriggertheruler.Ontheotherhand,itsaccuracyorconfidencefactoristhefractionofinstancestriggeredbyrwhoseclasslabelsareequaltoy.Theformaldefinitionsofthesemeasuresare
where isthenumberofinstancesthatsatisfytheruleantecedent, isthenumberofinstancesthatsatisfyboththeantecedentandconsequent,and
isthetotalnumberofinstances.
Example4.1.ConsiderthedatasetshowninTable4.2 .Therule
r1
r1
A→y
Coverage(r)=|A||D|Coverage(r)=|A∩y||A|, (4.3)
|A| |A∩y|
|D|
(GivesBirth=yes)∧(BodyTemperature=warm-blooded)→Mammals
hasacoverageof33%sincefiveofthefifteeninstancessupporttheruleantecedent.Theruleaccuracyis100%becauseallfivevertebratescoveredbytherulearemammals.
Table4.2.Thevertebratedataset.Name Body
TemperatureSkinCover
GivesBirth
AquaticCreature
AerialCreature
HasLegs
Hibernates ClassLabel
human warm-blooded
hair yes no no yes no Mammals
python cold-blooded scales no no no no yes Reptiles
salmon cold-blooded scales no yes no no no Fishes
whale warm-blooded
hair yes yes no no no Mammals
frog cold-blooded none no semi no yes yes Amphibians
komododragon
cold-blooded scales no no no yes no Reptiles
bat warm-blooded
hair yes no yes yes yes Mammals
pigeon warm-blooded
feathers no no yes yes no Birds
cat warm-blooded
fur yes no no yes no Mammals
guppy cold-blooded scales yes yes no no no Fishes
alligator cold-blooded scales no semi no yes no Reptiles
penguin warm-
blooded
feathers no semi no yes no Birds
porcupine warm-blooded
quills yes no no yes yes Mammals
eel cold-blooded scales no yes no no no Fishes
4.2.1HowaRule-BasedClassifierWorks
Arule-basedclassifierclassifiesatestinstancebasedontheruletriggeredbytheinstance.Toillustratehowarule-basedclassifierworks,considertherulesetshowninTable4.1 andthefollowingvertebrates:
Name BodyTemperature
SkinCover
GivesBirth
AquaticCreature
AerialCreature
HasLegs
Hibernates
lemur warm-blooded
fur yes no no yes yes
turtle cold-blooded scales no semi no yes no
dogfishshark
cold-blooded scales yes yes no no no
Thefirstvertebrate,whichisalemur,iswarm-bloodedandgivesbirthtoitsyoung.Ittriggerstherule ,andthus,isclassifiedasamammal.Thesecondvertebrate,whichisaturtle,triggerstherules and .Sincetheclassespredictedbytherulesarecontradictory(reptilesversusamphibians),theirconflictingclassesmustberesolved.Noneoftherulesareapplicabletoadogfishshark.Inthiscase,weneedtodeterminewhatclasstoassigntosuchatestinstance.
eel cold-blooded scales no yes no no no Fishes
salamander cold-blooded none no semi no yes yes Amphibians
r3r4 r5
4.2.2PropertiesofaRuleSet
Therulesetgeneratedbyarule-basedclassifiercanbecharacterizedbythefollowingtwoproperties.
Definition4.1(MutuallyExclusiveRuleSet).TherulesinarulesetRaremutuallyexclusiveifnotworulesinRaretriggeredbythesameinstance.ThispropertyensuresthateveryinstanceiscoveredbyatmostoneruleinR.
Definition4.2(ExhaustiveRuleSet).ArulesetRhasexhaustivecoverageifthereisaruleforeachcombinationofattributevalues.ThispropertyensuresthateveryinstanceiscoveredbyatleastoneruleinR.
Table4.3.Exampleofamutuallyexclusiveandexhaustiveruleset.
r1:(BodyTemperature=cold-blooded)→Non-mammalsr2:(BodyTemperature=warm-blooded)∧(GivesBirth=yes)→Mammalsr3:(BodyTemperature=warm-
Together,thesetwopropertiesensurethateveryinstanceiscoveredbyexactlyonerule.AnexampleofamutuallyexclusiveandexhaustiverulesetisshowninTable4.3 .Unfortunately,manyrule-basedclassifiers,includingtheoneshowninTable4.1 ,donothavesuchproperties.Iftherulesetisnotexhaustive,thenadefaultrule, ,mustbeaddedtocovertheremainingcases.Adefaultrulehasanemptyantecedentandistriggeredwhenallotherruleshavefailed. isknownasthedefaultclassandistypicallyassignedtothemajorityclassoftraininginstancesnotcoveredbytheexistingrules.Iftherulesetisnotmutuallyexclusive,thenaninstancecanbecoveredbymorethanonerule,someofwhichmaypredictconflictingclasses.
Definition4.3(OrderedRuleSet).TherulesinanorderedrulesetRarerankedindecreasingorderoftheirpriority.Anorderedrulesetisalsoknownasadecisionlist.
Therankofarulecanbedefinedinmanyways,e.g.,basedonitsaccuracyortotaldescriptionlength.Whenatestinstanceispresented,itwillbeclassifiedbythehighest-rankedrulethatcoverstheinstance.Thisavoidstheproblemofhavingconflictingclassespredictedbymultipleclassificationrulesiftherulesetisnotmutuallyexclusive.
blooded)∧(GivesBirth=no)→Non-mammals
rd:()→yd
yd
Analternativewaytohandleanon-mutuallyexclusiverulesetwithoutorderingtherulesistoconsidertheconsequentofeachruletriggeredbyatestinstanceasavoteforaparticularclass.Thevotesarethentalliedtodeterminetheclasslabelofthetestinstance.Theinstanceisusuallyassignedtotheclassthatreceivesthehighestnumberofvotes.Thevotemayalsobeweightedbytherule'saccuracy.Usingunorderedrulestobuildarule-basedclassifierhasbothadvantagesanddisadvantages.Unorderedrulesarelesssusceptibletoerrorscausedbythewrongrulebeingselectedtoclassifyatestinstanceunlikeclassifiersbasedonorderedrules,whicharesensitivetothechoiceofrule-orderingcriteria.Modelbuildingisalsolessexpensivebecausetherulesdonotneedtobekeptinsortedorder.Nevertheless,classifyingatestinstancecanbequiteexpensivebecausetheattributesofthetestinstancemustbecomparedagainstthepreconditionofeveryruleintheruleset.
Inthenexttwosections,wepresenttechniquesforextractinganorderedrulesetfromdata.Arule-basedclassifiercanbeconstructedusing(1)directmethods,whichextractclassificationrulesdirectlyfromdata,and(2)indirectmethods,whichextractclassificationrulesfrommorecomplexclassificationmodels,suchasdecisiontreesandneuralnetworks.DetaileddiscussionsofthesemethodsarepresentedinSections4.2.3 and4.2.4 ,respectively.
4.2.3DirectMethodsforRuleExtraction
Toillustratethedirectmethod,weconsiderawidely-usedruleinductionalgorithmcalledRIPPER.Thisalgorithmscalesalmostlinearlywiththenumberoftraininginstancesandisparticularlysuitedforbuildingmodelsfrom
datasetswithimbalancedclassdistributions.RIPPERalsoworkswellwithnoisydatabecauseitusesavalidationsettopreventmodeloverfitting.
RIPPERusesthesequentialcoveringalgorithmtoextractrulesdirectlyfromdata.Rulesaregrowninagreedyfashiononeclassatatime.Forbinaryclassproblems,RIPPERchoosesthemajorityclassasitsdefaultclassandlearnstherulestodetectinstancesfromtheminorityclass.Formulticlassproblems,theclassesareorderedaccordingtotheirprevalenceinthetrainingset.Let betheorderedlistofclasses,where istheleastprevalentclassand isthemostprevalentclass.Alltraininginstancesthatbelongto areinitiallylabeledaspositiveexamples,whilethosethatbelongtootherclassesarelabeledasnegativeexamples.Thesequentialcoveringalgorithmlearnsasetofrulestodiscriminatethepositivefromnegativeexamples.Next,alltraininginstancesfrom arelabeledaspositive,whilethosefromclasses arelabeledasnegative.Thesequentialcoveringalgorithmwouldlearnthenextsetofrulestodistinguish fromotherremainingclasses.Thisprocessisrepeateduntilweareleftwithonlyoneclass, ,whichisdesignatedasthedefaultclass.
Example4.1.Sequentialcoveringalgorithm.
∈
∨
(y1,y2,…,yc) y1yc
y1
y2y3,y4,⋯,yc
y2
yc
AsummaryofthesequentialcoveringalgorithmisshowninAlgorithm4.1 .Thealgorithmstartswithanemptydecisionlist,R,andextractsrulesforeachclassbasedontheorderingspecifiedbytheclassprevalence.ItiterativelyextractstherulesforagivenclassyusingtheLearn-One-Rulefunction.Oncesucharuleisfound,allthetraininginstancescoveredbytheruleareeliminated.ThenewruleisaddedtothebottomofthedecisionlistR.Thisprocedureisrepeateduntilthestoppingcriterionismet.Thealgorithmthenproceedstogeneraterulesforthenextclass.
Figure4.1 demonstrateshowthesequentialcoveringalgorithmworksforadatasetthatcontainsacollectionofpositiveandnegativeexamples.TheruleR1,whosecoverageisshowninFigure4.1(b) ,isextractedfirstbecauseitcoversthelargestfractionofpositiveexamples.AllthetraininginstancescoveredbyR1aresubsequentlyremovedandthealgorithmproceedstolookforthenextbestrule,whichisR2.
Learn-One-RuleFunctionFindinganoptimalruleiscomputationallyexpensiveduetotheexponentialsearchspacetoexplore.TheLearn-One-Rulefunctionaddressesthisproblembygrowingtherulesinagreedyfashion.Itgeneratesaninitialrule
,wheretheleft-handsideisanemptysetandtheright-handsidecorrespondstothepositiveclass.Itthenrefinestheruleuntilacertainstoppingcriterionismet.Theaccuracyoftheinitialrulemaybepoorbecausesomeofthetraininginstancescoveredbytherulebelongtothenegative
r:{}→+
class.Anewconjunctmustbeaddedtotheruleantecedenttoimproveitsaccuracy.
Figure4.1.Anexampleofthesequentialcoveringalgorithm.
RIPPERusestheFOIL'sinformationgainmeasuretochoosethebestconjuncttobeaddedintotheruleantecedent.Themeasuretakesintoconsiderationboththegaininaccuracyandsupportofacandidaterule,wheresupportisdefinedasthenumberofpositiveexamplescoveredbytherule.Forexample,supposetherule initiallycovers positiveexamplesand negativeexamples.AfteraddinganewconjunctB,theextendedrule covers positiveexamplesand negative
r:A→+ p0n0r′:A∧B→+ p1 n1
examples.TheFOIL'sinformationgainoftheextendedruleiscomputedasfollows:
RIPPERchoosestheconjunctwithhighestFOIL'sinformationgaintoextendtherule,asillustratedinthenextexample.
Example4.2.[Foil'sInformationGain]ConsiderthetrainingsetforthevertebrateclassificationproblemshowninTable4.2 .SupposethetargetclassfortheLearn-One-Rulefunctionismammals.Initially,theantecedentoftherule covers5positiveand10negativeexamples.Thus,theaccuracyoftheruleisonly0.333.Next,considerthefollowingthreecandidateconjunctstobeaddedtotheleft-handsideoftherule: ,and .ThenumberofpositiveandnegativeexamplescoveredbytheruleafteraddingeachconjunctalongwiththeirrespectiveaccuracyandFOIL'sinformationgainareshowninthefollowingtable.
Candidaterule Accuracy InfoGain
3 0 1.000 4.755
5 1 0.714 5.498
2 4 0.200
Although hasthehighestaccuracyamongthethreecandidates,theconjunct hasthehighestFOIL'sinformationgain.Thus,itischosentoextendtherule(seeFigure4.2 ).
FOIL'sinformationgain=p1×(log2p1p1+n1−log2p0p0+n0). (4.4)
{}→Mammals
Skincover=hair,Bodytemperature=warmHaslegs=No
p1 n1
{SkinCover=hair}→mammals
{Bodytemperature=wam}→mammals
{Haslegs=No}→mammals −0.737
Skincover=hairBodytemperature=warm
Thisprocesscontinuesuntiladdingnewconjunctsnolongerimprovestheinformationgainmeasure.
RulePruning
TherulesgeneratedbytheLearn-One-Rulefunctioncanbeprunedtoimprovetheirgeneralizationerrors.RIPPERprunestherulesbasedontheirperformanceonthevalidationset.Thefollowingmetriciscomputedtodeterminewhetherpruningisneeded: ,wherep(n)isthenumberofpositive(negative)examplesinthevalidationsetcoveredbytherule.Thismetricismonotonicallyrelatedtotherule'saccuracyonthevalidationset.Ifthemetricimprovesafterpruning,thentheconjunctisremoved.Pruningisdonestartingfromthelastconjunctaddedtotherule.Forexample,givenarule ,RIPPERcheckswhetherDshouldbeprunedfirst,followedbyCD,BCD,etc.Whiletheoriginalrulecoversonlypositiveexamples,theprunedrulemaycoversomeofthenegativeexamplesinthetrainingset.
BuildingtheRuleSet
Aftergeneratingarule,allthepositiveandnegativeexamplescoveredbytheruleareeliminated.Theruleisthenaddedintotherulesetaslongasitdoesnotviolatethestoppingcondition,whichisbasedontheminimumdescriptionlengthprinciple.Ifthenewruleincreasesthetotaldescriptionlengthoftherulesetbyatleastdbits,thenRIPPERstopsaddingrulesintoitsruleset(bydefault,dischosentobe64bits).AnotherstoppingconditionusedbyRIPPERisthattheerrorrateoftheruleonthevalidationsetmustnotexceed50%.
(p−n)/(p+n)
ABCD→y
Figure4.2.General-to-specificandspecific-to-generalrule-growingstrategies.
RIPPERalsoperformsadditionaloptimizationstepstodeterminewhethersomeoftheexistingrulesintherulesetcanbereplacedbybetteralternativerules.Readerswhoareinterestedinthedetailsoftheoptimizationmethodmayrefertothereferencecitedattheendofthischapter.
InstanceElimination
Afteraruleisextracted,RIPPEReliminatesthepositiveandnegativeexamplescoveredbytherule.Therationalefordoingthisisillustratedinthenextexample.
Figure4.3 showsthreepossiblerules,R1,R2,andR3,extractedfromatrainingsetthatcontains29positiveexamplesand21negativeexamples.TheaccuraciesofR1,R2,andR3are12/15(80%),7/10(70%),and8/12(66.7%),respectively.R1isgeneratedfirstbecauseithasthehighestaccuracy.AftergeneratingR1,thealgorithmmustremovetheexamplescoveredbytherulesothatthenextrulegeneratedbythealgorithmisdifferentthanR1.Thequestionis,shoulditremovethepositiveexamplesonly,negativeexamplesonly,orboth?Toanswerthis,supposethealgorithmmustchoosebetweengeneratingR2orR3afterR1.EventhoughR2hasahigheraccuracythanR3(70%versus66.7%),observethattheregioncoveredbyR2isdisjointfromR1,whiletheregioncoveredbyR3overlapswithR1.Asaresult,R1andR3togethercover18positiveand5negativeexamples(resultinginanoverallaccuracyof78.3%),whereasR1andR2togethercover19positiveand6negativeexamples(resultinginaloweroverallaccuracyof76%).IfthepositiveexamplescoveredbyR1arenotremoved,thenwemayoverestimatetheeffectiveaccuracyofR3.IfthenegativeexamplescoveredbyR1arenotremoved,thenwemayunderestimatetheaccuracyofR3.Inthelattercase,wemightenduppreferringR2overR3eventhoughhalfofthefalsepositiveerrorscommittedbyR3havealreadybeenaccountedforbytheprecedingrule,R1.ThisexampleshowsthattheeffectiveaccuracyafteraddingR2orR3totherulesetbecomesevidentonlywhenbothpositiveandnegativeexamplescoveredbyR1areremoved.
Figure4.3.Eliminationoftraininginstancesbythesequentialcoveringalgorithm.R1,R2,andR3representregionscoveredbythreedifferentrules.
4.2.4IndirectMethodsforRuleExtraction
Thissectionpresentsamethodforgeneratingarulesetfromadecisiontree.Inprinciple,everypathfromtherootnodetotheleafnodeofadecisiontreecanbeexpressedasaclassificationrule.Thetestconditionsencounteredalongthepathformtheconjunctsoftheruleantecedent,whiletheclasslabelattheleafnodeisassignedtotheruleconsequent.Figure4.4 showsanexampleofarulesetgeneratedfromadecisiontree.Noticethattherulesetisexhaustiveandcontainsmutuallyexclusiverules.However,someoftherulescanbesimplifiedasshowninthenextexample.
Figure4.4.Convertingadecisiontreeintoclassificationrules.
Example4.3.ConsiderthefollowingthreerulesfromFigure4.4 :
ObservethattherulesetalwayspredictsapositiveclasswhenthevalueofQisYes.Therefore,wemaysimplifytherulesasfollows:
isretainedtocovertheremaininginstancesofthepositiveclass.Althoughtherulesobtainedaftersimplificationarenolongermutuallyexclusive,theyarelesscomplexandareeasiertointerpret.
Inthefollowing,wedescribeanapproachusedbytheC4.5rulesalgorithmtogeneratearulesetfromadecisiontree.Figure4.5 showsthedecisiontree
r2:(P=No)∧(Q=Yes)→+r3:(P=Yes)∧(R=No)→+r5:(P=Yes)∧(R=Yes)∧(Q=Yes)→+.
r2′:(Q=Yes)→+r3:(P=Yes)∧(R=No)→+.
r3
andresultingclassificationrulesobtainedforthedatasetgiveninTable4.2 .
RuleGeneration
Classificationrulesareextractedforeverypathfromtheroottooneoftheleafnodesinthedecisiontree.Givenaclassificationrule ,weconsiderasimplifiedrule, where isobtainedbyremovingoneoftheconjunctsinA.Thesimplifiedrulewiththelowestpessimisticerrorrateisretainedprovideditserrorrateislessthanthatoftheoriginalrule.Therule-pruningstepisrepeateduntilthepessimisticerroroftherulecannotbeimprovedfurther.Becausesomeoftherulesmaybecomeidenticalafterpruning,theduplicaterulesarediscarded.
Figure4.5.
r:A→yr′:A′→y A′
Classificationrulesextractedfromadecisiontreeforthevertebrateclassificationproblem.
RuleOrdering
Aftergeneratingtheruleset,C4.5rulesusestheclass-basedorderingschemetoordertheextractedrules.Rulesthatpredictthesameclassaregroupedtogetherintothesamesubset.Thetotaldescriptionlengthforeachsubsetiscomputed,andtheclassesarearrangedinincreasingorderoftheirtotaldescriptionlength.Theclassthathasthesmallestdescriptionlengthisgiventhehighestprioritybecauseitisexpectedtocontainthebestsetofrules.Thetotaldescriptionlengthforaclassisgivenby ,where
isthenumberofbitsneededtoencodethemisclassifiedexamples,Lmodelisthenumberofbitsneededtoencodethemodel,andgisatuningparameterwhosedefaultvalueis0.5.Thetuningparameterdependsonthenumberofredundantattributespresentinthemodel.Thevalueofthetuningparameterissmallifthemodelcontainsmanyredundantattributes.
4.2.5CharacteristicsofRule-BasedClassifiers
1. Rule-basedclassifiershaveverysimilarcharacteristicsasdecisiontrees.Theexpressivenessofarulesetisalmostequivalenttothatofadecisiontreebecauseadecisiontreecanberepresentedbyasetofmutuallyexclusiveandexhaustiverules.Bothrule-basedanddecisiontreeclassifierscreaterectilinearpartitionsoftheattributespaceandassignaclasstoeachpartition.However,arule-basedclassifiercan
Lexception+g×LmodelLexception
allowmultiplerulestobetriggeredforagiveninstance,thusenablingthelearningofmorecomplexmodelsthandecisiontrees.
2. Likedecisiontrees,rule-basedclassifierscanhandlevaryingtypesofcategoricalandcontinuousattributesandcaneasilyworkinmulticlassclassificationscenarios.Rule-basedclassifiersaregenerallyusedtoproducedescriptivemodelsthatareeasiertointerpretbutgivecomparableperformancetothedecisiontreeclassifier.
3. Rule-basedclassifierscaneasilyhandlethepresenceofredundantattributesthatarehighlycorrelatedwithoneother.Thisisbecauseonceanattributehasbeenusedasaconjunctinaruleantecedent,theremainingredundantattributeswouldshowlittletonoFOIL'sinformationgainandwouldthusbeignored.
4. Sinceirrelevantattributesshowpoorinformationgain,rule-basedclassifierscanavoidselectingirrelevantattributesifthereareotherrelevantattributesthatshowbetterinformationgain.However,iftheproblemiscomplexandthereareinteractingattributesthatcancollectivelydistinguishbetweentheclassesbutindividuallyshowpoorinformationgain,itislikelyforanirrelevantattributetobeaccidentallyfavoredoverarelevantattributejustbyrandomchance.Hence,rule-basedclassifierscanshowpoorperformanceinthepresenceofinteractingattributes,whenthenumberofirrelevantattributesislarge.
5. Theclass-basedorderingstrategyadoptedbyRIPPER,whichemphasizesgivinghigherprioritytorareclasses,iswellsuitedforhandlingtrainingdatasetswithimbalancedclassdistributions.
6. Rule-basedclassifiersarenotwell-suitedforhandlingmissingvaluesinthetestset.Thisisbecausethepositionofrulesinarulesetfollowsacertainorderingstrategyandevenifatestinstanceiscoveredbymultiplerules,theycanassigndifferentclasslabelsdependingontheirpositionintheruleset.Hence,ifacertainruleinvolvesanattributethatismissinginatestinstance,itisdifficulttoignoretheruleandproceed
tothesubsequentrulesintheruleset,assuchastrategycanresultinincorrectclassassignments.
4.3NearestNeighborClassifiersTheclassificationframeworkshowninFigure3.3 involvesatwo-stepprocess:
(1)aninductivestepforconstructingaclassificationmodelfromdata,and
(2)adeductivestepforapplyingthemodeltotestexamples.Decisiontreeandrule-basedclassifiersareexamplesofeagerlearnersbecausetheyaredesignedtolearnamodelthatmapstheinputattributestotheclasslabelassoonasthetrainingdatabecomesavailable.Anoppositestrategywouldbetodelaytheprocessofmodelingthetrainingdatauntilitisneededtoclassifythetestinstances.Techniquesthatemploythisstrategyareknownaslazylearners.AnexampleofalazylearneristheRoteclassifier,whichmemorizestheentiretrainingdataandperformsclassificationonlyiftheattributesofatestinstancematchoneofthetrainingexamplesexactly.Anobviousdrawbackofthisapproachisthatsometestinstancesmaynotbeclassifiedbecausetheydonotmatchanytrainingexample.
Onewaytomakethisapproachmoreflexibleistofindallthetrainingexamplesthatarerelativelysimilartotheattributesofthetestinstances.Theseexamples,whichareknownasnearestneighbors,canbeusedtodeterminetheclasslabelofthetestinstance.Thejustificationforusingnearestneighborsisbestexemplifiedbythefollowingsaying:“Ifitwalkslikeaduck,quackslikeaduck,andlookslikeaduck,thenit'sprobablyaduck.”Anearestneighborclassifierrepresentseachexampleasadatapointinad-dimensionalspace,wheredisthenumberofattributes.Givenatestinstance,wecomputeitsproximitytothetraininginstancesaccordingtooneoftheproximitymeasuresdescribedinSection2.4 onpage71.Thek-nearest
neighborsofagiventestinstancezrefertothektrainingexamplesthatareclosesttoz.
Figure4.6 illustratesthe1-,2-,and3-nearestneighborsofatestinstancelocatedatthecenterofeachcircle.Theinstanceisclassifiedbasedontheclasslabelsofitsneighbors.Inthecasewheretheneighborshavemorethanonelabel,thetestinstanceisassignedtothemajorityclassofitsnearestneighbors.InFigure4.6(a) ,the1-nearestneighboroftheinstanceisanegativeexample.Thereforetheinstanceisassignedtothenegativeclass.Ifthenumberofnearestneighborsisthree,asshowninFigure4.6(c) ,thentheneighborhoodcontainstwopositiveexamplesandonenegativeexample.Usingthemajorityvotingscheme,theinstanceisassignedtothepositiveclass.Inthecasewherethereisatiebetweentheclasses(seeFigure4.6(b) ),wemayrandomlychooseoneofthemtoclassifythedatapoint.
Figure4.6.The1-,2-,and3-nearestneighborsofaninstance.
Theprecedingdiscussionunderscorestheimportanceofchoosingtherightvaluefork.Ifkistoosmall,thenthenearestneighborclassifiermaybesusceptibletooverfittingduetonoise,i.e.,mislabeledexamplesinthetraining
data.Ontheotherhand,ifkistoolarge,thenearestneighborclassifiermaymisclassifythetestinstancebecauseitslistofnearestneighborsincludestrainingexamplesthatarelocatedfarawayfromitsneighborhood(seeFigure4.7 ).
Figure4.7.k-nearestneighborclassificationwithlargek.
4.3.1Algorithm
Ahigh-levelsummaryofthenearestneighborclassificationmethodisgiveninAlgorithm4.2 .Thealgorithmcomputesthedistance(orsimilarity)betweeneachtestinstance andallthetrainingexamples todetermineitsnearestneighborlist, .Suchcomputationcanbecostlyifthenumberoftrainingexamplesislarge.However,efficientindexingtechniquesareavailabletoreducethecomputationneededtofindthenearestneighborsofatestinstance.
z=(x′,y′) (x,y)∈DDz
Algorithm4.2Thek-nearestneighborclassifier.
′ ′
′ ∑ ∈
Oncethenearestneighborlistisobtained,thetestinstanceisclassifiedbasedonthemajorityclassofitsnearestneighbors:
wherevisaclasslabel, istheclasslabelforoneofthenearestneighbors,and isanindicatorfunctionthatreturnsthevalue1ifitsargumentistrueand0otherwise.
Inthemajorityvotingapproach,everyneighborhasthesameimpactontheclassification.Thismakesthealgorithmsensitivetothechoiceofk,asshowninFigure4.6 .Onewaytoreducetheimpactofkistoweighttheinfluenceofeachnearestneighbor accordingtoitsdistance: .Asaresult,trainingexamplesthatarelocatedfarawayfromzhaveaweakerimpactontheclassificationcomparedtothosethatarelocatedclosetoz.Usingthedistance-weightedvotingscheme,theclasslabelcanbedeterminedasfollows:
′
∈
⊆
MajorityVoting:y′=argmaxv∑(xi,yi)∈DzI(v=yi), (4.5)
yiI(⋅)
xi wi=1/d(x′,xi)2
Distance-WeightedVoting:y′=argmaxv∑(xi,yi)∈Dzwi×I(v=yi). (4.6)
4.3.2CharacteristicsofNearestNeighborClassifiers
1. Nearestneighborclassificationispartofamoregeneraltechniqueknownasinstance-basedlearning,whichdoesnotbuildaglobalmodel,butratherusesthetrainingexamplestomakepredictionsforatestinstance.(Thus,suchclassifiersareoftensaidtobe“modelfree.”)Suchalgorithmsrequireaproximitymeasuretodeterminethesimilarityordistancebetweeninstancesandaclassificationfunctionthatreturnsthepredictedclassofatestinstancebasedonitsproximitytootherinstances.
2. Althoughlazylearners,suchasnearestneighborclassifiers,donotrequiremodelbuilding,classifyingatestinstancecanbequiteexpensivebecauseweneedtocomputetheproximityvaluesindividuallybetweenthetestandtrainingexamples.Incontrast,eagerlearnersoftenspendthebulkoftheircomputingresourcesformodelbuilding.Onceamodelhasbeenbuilt,classifyingatestinstanceisextremelyfast.
3. Nearestneighborclassifiersmaketheirpredictionsbasedonlocalinformation.(Thisisequivalenttobuildingalocalmodelforeachtestinstance.)Bycontrast,decisiontreeandrule-basedclassifiersattempttofindaglobalmodelthatfitstheentireinputspace.Becausetheclassificationdecisionsaremadelocally,nearestneighborclassifiers(withsmallvaluesofk)arequitesusceptibletonoise.
4. Nearestneighborclassifierscanproducedecisionboundariesofarbitraryshape.Suchboundariesprovideamoreflexiblemodelrepresentationcomparedtodecisiontreeandrule-basedclassifiersthatareoftenconstrainedtorectilineardecisionboundaries.Thedecisionboundariesofnearestneighborclassifiersalsohavehigh
variabilitybecausetheydependonthecompositionoftrainingexamplesinthelocalneighborhood.Increasingthenumberofnearestneighborsmayreducesuchvariability.
5. Nearestneighborclassifiershavedifficultyhandlingmissingvaluesinboththetrainingandtestsetssinceproximitycomputationsnormallyrequirethepresenceofallattributes.Although,thesubsetofattributespresentintwoinstancescanbeusedtocomputeaproximity,suchanapproachmaynotproducegoodresultssincetheproximitymeasuresmaybedifferentforeachpairofinstancesandthushardtocompare.
6. Nearestneighborclassifierscanhandlethepresenceofinteractingattributes,i.e.,attributesthathavemorepredictivepowertakenincombinationthenbythemselves,byusingappropriateproximitymeasuresthatcanincorporatetheeffectsofmultipleattributestogether.
7. Thepresenceofirrelevantattributescandistortcommonlyusedproximitymeasures,especiallywhenthenumberofirrelevantattributesislarge.Furthermore,iftherearealargenumberofredundantattributesthatarehighlycorrelatedwitheachother,thentheproximitymeasurecanbeoverlybiasedtowardsuchattributes,resultinginimproperestimatesofdistance.Hence,thepresenceofirrelevantandredundantattributescanadverselyaffecttheperformanceofnearestneighborclassifiers.
8. Nearestneighborclassifierscanproducewrongpredictionsunlesstheappropriateproximitymeasureanddatapreprocessingstepsaretaken.Forexample,supposewewanttoclassifyagroupofpeoplebasedonattributessuchasheight(measuredinmeters)andweight(measuredinpounds).Theheightattributehasalowvariability,rangingfrom1.5mto1.85m,whereastheweightattributemayvaryfrom90lb.to250lb.Ifthescaleoftheattributesarenottakenintoconsideration,theproximitymeasuremaybedominatedbydifferencesintheweightsofaperson.
4.4NaïveBayesClassifierManyclassificationproblemsinvolveuncertainty.First,theobservedattributesandclasslabelsmaybeunreliableduetoimperfectionsinthemeasurementprocess,e.g.,duetothelimitedprecisenessofsensordevices.Second,thesetofattributeschosenforclassificationmaynotbefullyrepresentativeofthetargetclass,resultinginuncertainpredictions.Toillustratethis,considertheproblemofpredictingaperson'sriskforheartdiseasebasedonamodelthatusestheirdietandworkoutfrequencyasattributes.Althoughmostpeoplewhoeathealthilyandexerciseregularlyhavelesschanceofdevelopingheartdisease,theymaystillbeatriskduetootherlatentfactors,suchasheredity,excessivesmoking,andalcoholabuse,thatarenotcapturedinthemodel.Third,aclassificationmodellearnedoverafinitetrainingsetmaynotbeabletofullycapturethetruerelationshipsintheoveralldata,asdiscussedinthecontextofmodeloverfittinginthepreviouschapter.Finally,uncertaintyinpredictionsmayariseduetotheinherentrandomnatureofreal-worldsystems,suchasthoseencounteredinweatherforecastingproblems.
Inthepresenceofuncertainty,thereisaneedtonotonlymakepredictionsofclasslabelsbutalsoprovideameasureofconfidenceassociatedwitheveryprediction.Probabilitytheoryoffersasystematicwayforquantifyingandmanipulatinguncertaintyindata,andthus,isanappealingframeworkforassessingtheconfidenceofpredictions.Classificationmodelsthatmakeuseofprobabilitytheorytorepresenttherelationshipbetweenattributesandclasslabelsareknownasprobabilisticclassificationmodels.Inthissection,wepresentthenaïveBayesclassifier,whichisoneofthesimplestandmostwidely-usedprobabilisticclassificationmodels.
4.4.1BasicsofProbabilityTheory
BeforewediscusshowthenaïveBayesclassifierworks,wefirstintroducesomebasicsofprobabilitytheorythatwillbeusefulinunderstandingtheprobabilisticclassificationmodelspresentedinthischapter.Thisinvolvesdefiningthenotionofprobabilityandintroducingsomecommonapproachesformanipulatingprobabilityvalues.
ConsideravariableX,whichcantakeanydiscretevaluefromtheset.Whenwehavemultipleobservationsofthatvariable,suchasina
datasetwherethevariabledescribessomecharacteristicofdataobjects,thenwecancomputetherelativefrequencywithwhicheachvalueoccurs.Specifically,supposethatXhasthevalue for dataobjects.Therelativefrequencywithwhichweobservetheevent isthen ,whereNdenotesthetotalnumberofoccurrences( ).TheserelativefrequenciescharacterizetheuncertaintythatwehavewithrespecttowhatvalueXmaytakeforanunseenobservationandmotivatesthenotionofprobability.
Moreformally,theprobabilityofanevente,e.g., ,measureshowlikelyitisfortheeventetooccur.Themosttraditionalviewofprobabilityisbasedonrelativefrequencyofevents(frequentist),whiletheBayesianviewpoint(describedlater)takesamoreflexibleviewofprobabilities.Ineithercase,aprobabilityisalwaysanumberbetween0and1.Further,thesumofprobabilityvaluesofallpossibleevents,e.g.,outcomesofavariableXisequalto1.Variablesthathaveprobabilitiesassociatedwitheachpossibleoutcome(values)areknownasrandomvariables.
Now,letusconsidertworandomvariables,XandY,thatcaneachtakekdiscretevalues.Let bethenumberoftimesweobserve and ,out
{x1,…,xk}
xi niX=xi ni/N
N=∑i=1kni
P(X=xi)
nij X=xi Y=yj
ofatotalnumberofNoccurrences.Thejointprobabilityofobservingand togethercanbeestimatedas
(Thisisanestimatesincewetypicallyhaveonlyafinitesubsetofallpossibleobservations.)Jointprobabilitiescanbeusedtoanswerquestionssuchas“whatistheprobabilitythattherewillbeasurprisequiztoday Iwillbelatefortheclass.”Jointprobabilitiesaresymmetric,i.e.,
.Forjointprobabilities,itistousefultoconsidertheirsumwithrespecttooneoftherandomvariables,asdescribedinthefollowingequation:
where isthetotalnumberoftimesweobserve irrespectiveofthevalueofY.Noticethat isessentiallytheprobabilityofobserving .Hence,bysummingoutthejointprobabilitieswithrespecttoarandomvariableY,weobtaintheprobabilityofobservingtheremainingvariableX.Thisoperationiscalledmarginalizationandtheprobabilityvalue obtainedbymarginalizingoutYissometimescalledthemarginalprobabilityofX.Aswewillseelater,jointprobabilityandmarginalprobabilityformthebasicbuildingblocksofanumberofprobabilisticclassificationmodelsdiscussedinthischapter.
Noticethatinthepreviousdiscussions,weused todenotetheprobabilityofaparticularoutcomeofarandomvariableX.Thisnotationcaneasilybecomecumbersomewhenanumberofrandomvariablesareinvolved.Hence,intheremainderofthissection,wewilluseP(X)todenotetheprobabilityofanygenericoutcomeoftherandomvariableX,while willbeusedtorepresenttheprobabilityofthespecificoutcome .
X=xiY=yj
P(X=xi,Y=yi)=nijN. (4.7)
P(X=x,Y=y)=P(Y=y,X=x)
∑j=1kP(X=xi,Y=yj)=∑j=1knijN=niN=P(X=xi), (4.8)
ni X=xini/N X=xi
P(X=xi)
P(X=xi)
P(xi)xi
BayesTheoremSupposeyouhaveinvitedtwoofyourfriendsAlexandMarthatoadinnerparty.YouknowthatAlex
attends40%ofthepartiesheisinvitedto.Further,ifAlexisgoingtoaparty,thereisan80%chance
ofMarthacomingalong.Ontheotherhand,ifAlexisnotgoingtotheparty,thechanceofMartha
comingtothepartyisreducedto30%.IfMarthahasrespondedthatshewillbecomingtoyourparty,
whatistheprobabilitythatAlexwillalsobecoming?
Bayestheorempresentsthestatisticalprincipleforansweringquestionslikethepreviousone,whereevidencefrommultiplesourceshastobecombinedwithpriorbeliefstoarriveatpredictions.Bayestheoremcanbebrieflydescribedasfollows.
Let denotetheconditionalprobabilityofobservingtherandomvariableYwhenevertherandomvariableXtakesaparticularvalue. isoftenreadastheprobabilityofobservingYconditionedontheoutcomeofX.Conditionalprobabilitiescanbeusedforansweringquestionssuchas“giventhatitisgoingtoraintoday,whatwillbetheprobabilitythatIwillgototheclass.”ConditionalprobabilitiesofXandYarerelatedtotheirjoint
probabilityinthefollowingway:
RearrangingthelasttwoexpressionsinEquation4.10 leadstoEquation4.11 ,whichisknownasBayestheorem:
P(Y|X)P(Y|X)
P(Y|X)=P(X,Y)P(X),whichimplies (4.9)
P(X,Y)=P(Y|X)×P(X)=P(X|Y)×P(Y). (4.10)
P(Y|X)=P(X|Y)P(Y)P(X). (4.11)
Bayestheoremprovidesarelationshipbetweentheconditionalprobabilitiesand .NotethatthedenominatorinEquation4.11 involvesthe
marginalprobabilityofX,whichcanalsoberepresentedas
UsingthepreviousexpressionforP(X),wecanobtainthefollowingequationfor solelyintermsof andP(Y):
Example4.4.[BayesTheorem]Bayestheoremcanbeusedtosolveanumberofinferentialquestionsaboutrandomvariables.Forexample,considertheproblemstatedatthebeginningoninferringwhetherAlexwillcometotheparty.LetdenotetheprobabilityofAlexgoingtoaparty,while denotestheprobabilityofhimnotgoingtoaparty.Weknowthat
Further,let denotetheconditionalprobabilityofMarthagoingtoapartyconditionedonwhetherAlexisgoingtotheparty. takesthefollowingvalues:
Wecanusetheabovevaluesof andP(A)tocomputetheprobabilityofAlexgoingtothepartygivenMarthaisgoingtotheparty,
,asfollows:
P(Y|X) P(X|Y)
P(X)=∑i=1kP(X,yi)=∑i=1kP(X|yi)×P(yi).
P(Y|X) P(X|Y)
P(Y|X)=P(X|Y)P(Y)∑i−1kP(X|yi)P(yi). (4.12)
P(A=1)P(A=0)
P(A=1)=0.4,andP(A=0)=1−P(A=1)=0.6.
P(M=1|A)P(M=1|A)
P(M=1|A=1)=0.8,andP(M=1|A=0)=0.3.
P(M|A)
P(A=1|M=1)
NoticethateventhoughthepriorprobabilityP(A)ofAlexgoingtothepartyislow,theobservationthatMarthaisgoing, ,affectstheconditionalprobability .ThisshowsthevalueofBayestheoremincombiningpriorassumptionswithobservedoutcomestomakepredictions.Since ,itismorelikelyforAlextojoinifMarthaisgoingtotheparty.
UsingBayesTheoremforClassificationForthepurposeofclassification,weareinterestedincomputingtheprobabilityofobservingaclasslabelyforadatainstancegivenitssetofattributevalues .Thiscanberepresentedas ,whichisknownastheposteriorprobabilityofthetargetclass.UsingtheBayesTheorem,wecanrepresenttheposteriorprobabilityas
Notethatthenumeratorofthepreviousequationinvolvestwoterms,andP(y),bothofwhichcontributetotheposteriorprobability .Wedescribebothofthesetermsinthefollowing.
Thefirstterm isknownastheclass-conditionalprobabilityoftheattributesgiventheclasslabel. measuresthelikelihoodofobservingfromthedistributionofinstancesbelongingtoy.If indeedbelongstoclassy,thenweshouldexpect tobehigh.Fromthispointofview,theuseofclass-conditionalprobabilitiesattemptstocapturetheprocessfromwhichthedatainstancesweregenerated.Becauseofthisinterpretation,probabilisticclassificationmodelsthatinvolvecomputingclass-conditionalprobabilitiesare
P(A=1|M=1)=P(M=1|A=1)P(A=1)P(M=1|A=0)P(A=0)+P(M=1|A=1)P(A=1),=0.8(4.13)
M=1P(A=1|M=1)
P(A=1|M=1)>0.5
P(y|x)
P(y|x)=P(x|y)P(y)P(x) (4.14)
P(x|y)P(y|x)
P(x|y)P(x|y)
P(x|y)
knownasgenerativeclassificationmodels.Apartfromtheiruseincomputingposteriorprobabilitiesandmakingpredictions,class-conditionalprobabilitiesalsoprovideinsightsabouttheunderlyingmechanismbehindthegenerationofattributevalues.
ThesecondterminthenumeratorofEquation4.14 isthepriorprobabilityP(y).Thepriorprobabilitycapturesourpriorbeliefsaboutthedistributionofclasslabels,independentoftheobservedattributevalues.(ThisistheBayesianviewpoint.)Forexample,wemayhaveapriorbeliefthatthelikelihoodofanypersontosufferfromaheartdiseaseis ,irrespectiveoftheirdiagnosisreports.Thepriorprobabilitycaneitherbeobtainedusingexpertknowledge,orinferredfromhistoricaldistributionofclasslabels.
ThedenominatorinEquation4.14 involvestheprobabilityofevidence,P( ).Notethatthistermdoesnotdependontheclasslabelandthuscanbetreatedasanormalizationconstantinthecomputationofposteriorprobabilities.Further,thevalueofP( )canbecalculatedas
.
Bayestheoremprovidesaconvenientwaytocombineourpriorbeliefswiththelikelihoodofobtainingtheobservedattributevalues.Duringthetrainingphase,wearerequiredtolearntheparametersforP(y)and .ThepriorprobabilityP(y)canbeeasilyestimatedfromthetrainingsetbycomputingthefractionoftraininginstancesthatbelongtoeachclass.Tocomputetheclass-conditionalprobabilities,oneapproachistoconsiderthefractionoftraininginstancesofagivenclassforeverypossiblecombinationofattributevalues.Forexample,supposethattherearetwoattributes and thatcaneachtakeadiscretevaluefrom to .Let denotethenumberoftraininginstancesbelongingtoclass0,outofwhich numberoftraininginstanceshave and .Theclass-conditionalprobabilitycanthenbegivenas
α
P(x)=∑iP(x|yi)P(yi)
P(x|y)
X1 X2c1 ck n0
nij0X1=ci X2=cj
Thisapproachcaneasilybecomecomputationallyprohibitiveasthenumberofattributesincrease,duetotheexponentialgrowthinthenumberofattributevaluecombinations.Forexample,ifeveryattributecantakekdiscretevalues,thenthenumberofattributevaluecombinationsisequalto ,wheredisthenumberofattributes.Thelargenumberofattributevaluecombinationscanalsoresultinpoorestimatesofclass-conditionalprobabilities,sinceeverycombinationwillhavefewertraininginstanceswhenthesizeoftrainingsetissmall.
Inthefollowing,wepresentthenaïveBayesclassifier,whichmakesasimplifyingassumptionabouttheclass-conditionalprobabilities,knownasthenaïveBayesassumption.Theuseofthisassumptionsignificantlyhelpsinobtainingreliableestimatesofclass-conditionalprobabilities,evenwhenthenumberofattributesarelarge.
4.4.2NaïveBayesAssumption
ThenaïveBayesclassifierassumesthattheclass-conditionalprobabilityofallattributes canbefactoredasaproductofclass-conditionalprobabilitiesofeveryattribute ,asdescribedinthefollowingequation:
whereeverydatainstance consistsofdattributes, .Thebasicassumptionbehindthepreviousequationisthattheattributevaluesareconditionallyindependentofeachother,giventheclasslabely.Thismeansthattheattributesareinfluencedonlybythetargetclassandifwe
P(X1=ci,X2=cj|Y=0)=nij0n0.
kd
xi
P(x|y)=∏i=1dP(xi|y), (4.15)
{x1,x2,…,xd}xi
knowtheclasslabel,thenwecanconsidertheattributestobeindependentofeachother.Theconceptofconditionalindependencecanbeformallystatedasfollows.
ConditionalIndependenceLet ,andYdenotethreesetsofrandomvariables.Thevariablesinaresaidtobeconditionallyindependentof ,givenY,ifthefollowingconditionholds:
ThismeansthatconditionedonY,thedistributionof isnotinfluencedbytheoutcomesof ,andhenceisconditionallyindependentof .Toillustratethenotionofconditionalindependence,considertherelationshipbetweenaperson'sarmlength andhisorherreadingskills .Onemightobservethatpeoplewithlongerarmstendtohavehigherlevelsofreadingskills,andthusconsider and toberelatedtoeachother.However,thisrelationshipcanbeexplainedbyanotherfactor,whichistheageoftheperson(Y).Ayoungchildtendstohaveshortarmsandlacksthereadingskillsofanadult.Iftheageofapersonisfixed,thentheobservedrelationshipbetweenarmlengthandreadingskillsdisappears.Thus,wecanconcludethatarmlengthandreadingskillsarenotdirectlyrelatedtoeachotherandareconditionallyindependentwhentheagevariableisfixed.
Anotherwayofdescribingconditionalindependenceistoconsiderthejointconditionalprobability, ,asfollows:
X1,X2, X1X2
P(X1|X2,Y)=P(X1|Y). (4.16)
X1X2 X2
(X1) (X2)
X1 X2
P(X1,X2|Y)
P(X1,X2|Y)=P(X1,X2,Y)P(Y)=P(X1,X2,Y)P(X2,Y)×P(X2,Y)P(Y)=P(X1|X2,Y(4.17)
whereEquation4.16 wasusedtoobtainthelastlineofEquation4.17 .Thepreviousdescriptionofconditionalindependenceisquiteusefulfromanoperationalperspective.Itstatesthatthejointconditionalprobabilityof and
givenYcanbefactoredastheproductofconditionalprobabilitiesofand consideredseparately.ThisformsthebasisofthenaïveBayesassumptionstatedinEquation4.15 .
HowaNaïveBayesClassifierWorksUsingthenaïveBayesassumption,weonlyneedtoestimatetheconditionalprobabilityofeach givenYseparately,insteadofcomputingtheclass-conditionalprobabilityforeverycombinationofattributevalues.Forexample,if and denotethenumberoftraininginstancesbelongingtoclass0with and ,respectively,thentheclass-conditionalprobabilitycanbeestimatedas
Inthepreviousequation,weonlyneedtocountthenumberoftraininginstancesforeveryoneofthekvaluesofanattributeX,irrespectiveofthevaluesofotherattributes.Hence,thenumberofparametersneededtolearnclass-conditionalprobabilitiesisreducedfrom todk.Thisgreatlysimplifiestheexpressionfortheclass-conditionalprobabilityandmakesitmoreamenabletolearningparametersandmakingpredictions,eveninhigh-dimensionalsettings.
ThenaïveBayesclassifiercomputestheposteriorprobabilityforatestinstance byusingthefollowingequation:
X1X2 X1
X2
xi
ni0 nj0X1=ci X2=cj
P(X1=ci,X2=xj|Y=0)=ni0n0×nj0n0.
dk
P(y|x)=P(y)∏i=1dP(xi|y)P(x) (4.18)
SinceP( )isfixedforeveryyandonlyactsasanormalizingconstanttoensurethat ,wecanwrite
Hence,itissufficienttochoosetheclassthatmaximizes .
OneoftheusefulpropertiesofthenaïveBayesclassifieristhatitcaneasilyworkwithincompleteinformationaboutdatainstances,whenonlyasubsetofattributesareobservedateveryinstance.Forexample,ifweonlyobservepoutofdattributesatadatainstance,thenwecanstillcompute
usingthosepattributesandchoosetheclasswiththemaximumvalue.ThenaïveBayesclassifiercanthusnaturallyhandlemissingvaluesintestinstances.Infact,intheextremecasewherenoattributesareobserved,wecanstillusethepriorprobabilityP(y)asanestimateoftheposteriorprobability.Asweobservemoreattributes,wecankeeprefiningtheposteriorprobabilitytobetterreflectthelikelihoodofobservingthedatainstance.
Inthenexttwosubsections,wedescribeseveralapproachesforestimatingtheconditionalprobabilities forcategoricalandcontinuousattributesfromthetrainingset.
EstimatingConditionalProbabilitiesforCategoricalAttributesForacategoricalattribute ,theconditionalprobability isestimatedaccordingtothefractionoftraininginstancesinclassywhere takesonaparticularcategoricalvaluec.
P(y|x)∈[0,1]
P(y|x)∝P(y)∏i=1dP(xi|y).
P(y)∏i=1dP(xi|y)
P(y)∏i=1pP(xi|y)
P(xi|y)
Xi P(Xi=c|y)Xi
wherenisthenumberoftraininginstancesbelongingtoclassy,outofwhichnumberofinstanceshave .Forexample,inthetrainingsetgivenin
Figure4.8 ,sevenpeoplehavetheclasslabel ,outofwhichthreepeoplehave whiletheremainingfourhave
.Asaresult,theconditionalprobabilityforisequalto3/7.Similarly,the
conditionalprobabilityfordefaultedborrowerswith isgivenby .Notethatthesumofconditionalprobabilitiesoverallpossibleoutcomesof isequaltoone,i.e., .
Figure4.8.Trainingsetforpredictingtheloandefaultproblem.
P(Xi=c|y)=ncn,
nc Xi=cDefaultedBorrower=No
HomeOwner=YesHomeOwner=NoP(HomeOwner=Yes|DefaultedBorrower=No)
MaritalStatus=SingleP(MaritalStatus=Single|DefaultedBorrower=Yes)=2/3
Xi∑cP(Xi=c|y)=1,
EstimatingConditionalProbabilitiesforContinuousAttributesTherearetwowaystoestimatetheclass-conditionalprobabilitiesforcontinuousattributes:
1. Wecandiscretizeeachcontinuousattributeandthenreplacethecontinuousvalueswiththeircorrespondingdiscreteintervals.Thisapproachtransformsthecontinuousattributesintoordinalattributes,andthesimplemethoddescribedpreviouslyforcomputingtheconditionalprobabilitiesofcategoricalattributescanbeemployed.Notethattheestimationerrorofthismethoddependsonthediscretizationstrategy(asdescribedinSection2.3.6 onpage63),aswellasthenumberofdiscreteintervals.Ifthenumberofintervalsistoolarge,everyintervalmayhaveaninsufficientnumberoftraininginstancestoprovideareliableestimateof .Ontheotherhand,ifthenumberofintervalsistoosmall,thenthediscretizationprocessmaylooseinformationaboutthetruedistributionofcontinuousvalues,andthusresultinpoorpredictions.
2. Wecanassumeacertainformofprobabilitydistributionforthecontinuousvariableandestimatetheparametersofthedistributionusingthetrainingdata.Forexample,wecanuseaGaussiandistributiontorepresenttheconditionalprobabilityofcontinuousattributes.TheGaussiandistributionischaracterizedbytwoparameters,themean, ,andthevariance, .Foreachclass ,theclass-conditionalprobabilityforattribute is
P(Xi|Y)
μ σ2 yj
XiP(Xi=xi|Y=yj)=12πσijexp[−(xi−μij)22σij2]. (4.19)
Theparameter canbeestimatedusingthesamplemeanofforalltraininginstancesthatbelongto .Similarly, canbeestimatedfromthesamplevariance ofsuchtraininginstances.Forexample,considertheannualincomeattributeshowninFigure4.8 .Thesamplemeanandvarianceforthisattributewithrespecttotheclass are
Givenatestinstancewithtaxableincomeequalto$120K,wecanusethefollowingvalueasitsconditionalprobabilitygivenclass :
Example4.5.[NaïveBayesClassifier]ConsiderthedatasetshowninFigure4.9(a) ,wherethetargetclassisDefaultedBorrower,whichcantaketwovaluesYesandNo.Wecancomputetheclass-conditionalprobabilityforeachcategoricalattributeandthesamplemeanandvarianceforthecontinuousattribute,assummarizedinFigure4.9(b) .
Weareinterestedinpredictingtheclasslabelofatestinstance.Todo
this,wefirstcomputethepriorprobabilitiesbycountingthenumberoftraininginstancesbelongingtoeveryclass.Wethusobtain and
.Next,wecancomputetheclass-conditionalprobabilityasfollows:
μij Xi(x¯)yj σij2
(s2)
x¯=125+100+70+…+757=100s2=(125−110)2+(100−110)2+…(75−110)26=2975s=2975=54.54.
P(Income=120|No)=12π(54.54)exp−(120−110)22×2975=0.0072.
x=(HomeOwner=No,MaritalStatus=Married,AnnualIncome=$120K)
P(yes)=0.3P(No)=0.7
Figure4.9.ThenaïveBayesclassifierfortheloanclassificationproblem.
Noticethattheclass-conditionalprobabilityforclass hasbecome0becausetherearenoinstancesbelongingtoclass with
inthetrainingset.Usingtheseclass-conditionalprobabilities,wecanestimatetheposteriorprobabilitiesas
where isanormalizingconstant.Since ,theinstanceisclassifiedas .
P(x|NO)=P(HomeOwner=No|No)×P(Status=Married|No)×P(AnnualIncome
Status=Married
P(No|x)=0.7×0.0024P(x).=0.0016α.P(Yes|x)=0.3×0P(x)=0.
α=1/P(x) P(No|x)>P(Yes|x)
HandlingZeroConditionalProbabilitiesTheprecedingexampleillustratesapotentialproblemwithusingthenaïveBayesassumptioninestimatingclass-conditionalprobabilities.Iftheconditionalprobabilityforanyoftheattributesiszero,thentheentireexpressionfortheclass-conditionalprobabilitybecomeszero.Notethatzeroconditionalprobabilitiesarisewhenthenumberoftraininginstancesissmallandthenumberofpossiblevaluesofanattributeislarge.Insuchcases,itmayhappenthatacombinationofattributevaluesandclasslabelsareneverobserved,resultinginazeroconditionalprobability.
Inamoreextremecase,ifthetraininginstancesdonotcoversomecombinationsofattributevaluesandclasslabels,thenwemaynotbeabletoevenclassifysomeofthetestinstances.Forexample,if
iszeroinsteadof1/7,thenadatainstancewithattributeset
hasthefollowingclass-conditionalprobabilities:
Sinceboththeclass-conditionalprobabilitiesare0,thenaïveBayesclassifierwillnotbeabletoclassifytheinstance.Toaddressthisproblem,itisimportanttoadjusttheconditionalprobabilityestimatessothattheyarenotasbrittleassimplyusingfractionsoftraininginstances.Thiscanbeachievedbyusingthefollowingalternateestimatesofconditionalprobability:
P(MaritalStatus=Divorced|No)x=
(HomeOwner=Yes,MaritalStatus=Divorced,Income=$120K)
P(x|No)=3/7×0×0.0072=0.P(x|Yes)=0×1/3×1.2×10−9=0.
Laplaceestimate:P(Xi=c|y)=nc+1n+v, (4.20)
m-estimate:P(Xi=c|y)=nc+mpn+m, (4.21)
wherenisthenumberoftraininginstancesbelongingtoclassy, isthenumberoftraininginstanceswith and ,visthetotalnumberofattributevaluesthat cantake,pissomeinitialestimateof thatisknownapriori,andmisahyper-parameterthatindicatesourconfidenceinusingpwhenthefractionoftraininginstancesistoobrittle.Notethatevenif
,bothLaplaceandm-estimateprovidenon-zerovaluesofconditionalprobabilities.Hence,theyavoidtheproblemofvanishingclass-conditionalprobabilitiesandthusgenerallyprovidemorerobustestimatesofposteriorprobabilities.
CharacteristicsofNaïveBayesClassifiers1. NaïveBayesclassifiersareprobabilisticclassificationmodelsthatare
abletoquantifytheuncertaintyinpredictionsbyprovidingposteriorprobabilityestimates.Theyarealsogenerativeclassificationmodelsastheytreatthetargetclassasthecausativefactorforgeneratingthedatainstances.Hence,apartfromcomputingposteriorprobabilities,naïveBayesclassifiersalsoattempttocapturetheunderlyingmechanismbehindthegenerationofdatainstancesbelongingtoeveryclass.Theyarethususefulforgainingpredictiveaswellasdescriptiveinsights.
2. ByusingthenaïveBayesassumption,theycaneasilycomputeclass-conditionalprobabilitieseveninhigh-dimensionalsettings,providedthattheattributesareconditionallyindependentofeachothergiventheclasslabels.ThispropertymakesnaïveBayesclassifierasimpleandeffectiveclassificationtechniquethatiscommonlyusedindiverseapplicationproblems,suchastextclassification.
3. NaïveBayesclassifiersarerobusttoisolatednoisepointsbecausesuchpointsarenotabletosignificantlyimpacttheconditionalprobabilityestimates,astheyareoftenaveragedoutduringtraining.
ncXi=c Y=y
Xi P(Xi=c|y)
nc=0
4. NaïveBayesclassifierscanhandlemissingvaluesinthetrainingsetbyignoringthemissingvaluesofeveryattributewhilecomputingitsconditionalprobabilityestimates.Further,naïveBayesclassifierscaneffectivelyhandlemissingvaluesinatestinstance,byusingonlythenon-missingattributevalueswhilecomputingposteriorprobabilities.Ifthefrequencyofmissingvaluesforaparticularattributevaluedependsonclasslabel,thenthisapproachwillnotaccuratelyestimateposteriorprobabilities.
5. NaïveBayesclassifiersarerobusttoirrelevantattributes.If isanirrelevantattribute,then becomesalmostuniformlydistributedforeveryclassy.Theclass-conditionalprobabilitiesforeveryclassthusreceivesimilarcontributionsof ,resultinginnegligibleimpactontheposteriorprobabilityestimates.
6. CorrelatedattributescandegradetheperformanceofnaïveBayesclassifiersbecausethenaïveBayesassumptionofconditionalindependencenolongerholdsforsuchattributes.Forexample,considerthefollowingprobabilities:
whereAisabinaryattributeandYisabinaryclassvariable.SupposethereisanotherbinaryattributeBthatisperfectlycorrelatedwithAwhen ,butisindependentofAwhen .Forsimplicity,assumethattheconditionalprobabilitiesforBarethesameasforA.Givenaninstancewithattributes ,andassumingconditionalindependence,wecancomputeitsposteriorprobabilitiesasfollows:
If ,thenthenaïveBayesclassifierwouldassigntheinstancetoclass1.However,thetruthis,
XiP(Xi|Y)
P(Xi|Y)
P(A=0|Y=0)=0.4,P(A=1|Y=0)=0.6,P(A=0|Y=1)=0.6,P(A=1|Y=1)=0.4,
Y=0 Y=1
A=0,B=0
P(Y=0|A=0,B=0)=P(A=0|Y=0)P(B=0|Y=0)P(Y=0)P(A=0,B=0)=0.16×P(Y
P(Y=0)=P(Y=1)
P(A=0,B=0|Y=0)=P(A=0|Y=0)=0.4,
becauseAandBareperfectlycorrelatedwhen .Asaresult,theposteriorprobabilityfor is
whichislargerthanthatfor .Theinstanceshouldhavebeenclassifiedasclass0.Hence,thenaïveBayesclassifiercanproduceincorrectresultswhentheattributesarenotconditionallyindependentgiventheclasslabels.NaïveBayesclassifiersarethusnotwell-suitedforhandlingredundantorinteractingattributes.
Y=0Y=0
P(Y=0|A=0,B=0)=P(A=0,B=0|Y=0)P(Y=0)P(A=0,B=0)=0.4×P(Y=0)P(A=0,B=
Y=1
4.5BayesianNetworksTheconditionalindependenceassumptionmadebynaïveBayesclassifiersmayseemtoorigid,especiallyforclassificationproblemswheretheattributesaredependentoneachotherevenafterconditioningontheclasslabels.WethusneedanapproachtorelaxthenaïveBayesassumptionsothatwecancapturemoregenericrepresentationsofconditionalindependenceamongattributes.
Inthissection,wepresentaflexibleframeworkformodelingprobabilisticrelationshipsbetweenattributesandclasslabels,knownasBayesianNetworks.Bybuildingonconceptsfromprobabilitytheoryandgraphtheory,Bayesiannetworksareabletocapturemoregenericformsofconditionalindependenceusingsimpleschematicrepresentations.Theyalsoprovidethenecessarycomputationalstructuretoperforminferencesoverrandomvariablesinanefficientway.Inthefollowing,wefirstdescribethebasicrepresentationofaBayesiannetwork,andthendiscussmethodsforperforminginferenceandlearningmodelparametersinthecontextofclassification.
4.5.1GraphicalRepresentation
Bayesiannetworksbelongtoabroaderfamilyofmodelsforcapturingprobabilisticrelationshipsamongrandomvariables,knownasprobabilisticgraphicalmodels.Thebasicconceptbehindthesemodelsistousegraphicalrepresentationswherethenodesofthegraphcorrespondtorandomvariablesandtheedgesbetweenthenodesexpressprobabilistic
relationships.Figures4.10(a) and4.10(b) showexamplesofprobabilisticgraphicalmodelsusingdirectededges(witharrows)andundirectededges(withoutarrows),respectively.DirectedgraphicalmodelsarealsoknownasBayesiannetworkswhileundirectedgraphicalmodelsareknownasMarkovrandomfields.Thetwoapproachesusedifferentsemanticsforexpressingrelationshipsamongrandomvariablesandarethususefulindifferentcontexts.Inthefollowing,webrieflydescribeBayesiannetworksthatareusefulinthecontextofclassification.
ABayesiannetwork(alsoreferredtoasabeliefnetwork)involvesdirectededgesbetweennodes,whereeveryedgerepresentsadirectionofinfluenceamongrandomvariables.Forexample,Figure4.10(a) showsaBayesiannetworkwherevariableCdependsuponthevaluesofvariablesAandB,asindicatedbythearrowspointingtowardCfromAandB.Consequently,thevariableCinfluencesthevaluesofvariablesDandE.EveryedgeinaBayesiannetworkthusencodesadependencerelationshipbetweenrandomvariableswithaparticulardirectionality.
Figure4.10.Illustrationsoftwobasictypesofgraphicalmodels.
Bayesiannetworksaredirectedacyclicgraphs(DAG)becausetheydonotcontainanydirectedcyclessuchthattheinfluenceofanodeloopsbacktothesamenode.Figure4.11 showssomeexamplesofBayesiannetworksthatcapturedifferenttypesofdependencestructuresamongrandomvariables.Inadirectedacyclicgraph,ifthereisadirectededgefromXtoY,thenXiscalledtheparentofYandYiscalledthechildofX.NotethatanodecanhavemultipleparentsinaBayesiannetwork,e.g.,nodeDhastwoparentnodes,BandC,inFigure4.11(a) .Furthermore,ifthereisadirectedpathinthenetworkfromXtoZ,thenXisanancestorofZ,whileZisadescendantofX.Forexample,inthediagramshowninFigure4.11(b) ,AisadescendantofDandDisanancestorofB.Notethattherecanbemultipledirectedpathsbetweentwonodesofadirectedacyclicgraph,asisthecasefornodesAandDinFigure4.11(a) .
Figure4.11.ExamplesofBayesiannetworks.
ConditionalIndependenceAnimportantpropertyofaBayesiannetworkisitsabilitytorepresentvaryingformsofconditionalindependenceamongrandomvariables.ThereareseveralwaysofdescribingtheconditionalindependenceassumptionscapturedbyBayesiannetworks.Oneofthemostgenericwaysofexpressingconditionalindependenceistheconceptofd-separation,whichcanbeusedtodetermineifanytwosetsofnodesAandBareconditionallyindependentgivenanothersetofnodesC.AnotherusefulconceptisthatoftheMarkovblanketofanodeY,whichdenotestheminimalsetofnodesXthatmakesYindependentoftheothernodesinthegraph,whenconditionedonX.(SeeBibliographicNotesformoredetailsond-separationandMarkovblanket.)However,forthepurposeofclassification,itissufficienttodescribeasimplerexpressionofconditionalindependenceinBayesiannetworks,knownasthelocalMarkovproperty.
Property1(LocalMarkovProperty).AnodeinaBayesiannetworkisconditionallyindependentofitsnon-descendants,ifitsparentsareknown.
ToillustratethelocalMarkovproperty,considertheBayesnetworkshowninFigure4.11(b) .WecanstatethatAisconditionallyindependentofbothBandDgivenC,becauseCistheparentofAandnodesBandDarenon-descendantsofA.ThelocalMarkovpropertyhelpsininterpretingparent-childrelationshipsinBayesiannetworksasrepresentationsofconditionalprobabilities.Sinceanodeisconditionallyindependentofitsnon-descendants
givenitparents,theconditionalindependenceassumptionsimposedbyaBayesiannetworkisoftensparseinstructure.Nonetheless,BayesiannetworksareabletoexpressaricherclassofconditionalindependencestatementsamongattributesandclasslabelsthanthenaïveBayesclassifier.Infact,thenaïveBayesclassifiercanbeviewedasaspecialtypeofBayesiannetwork,wherethetargetclassYisattherootofatreeandeveryattributeisconnectedtotherootnodebyadirectededge,asshowninFigure4.12(a) .
Figure4.12.ComparingthegraphicalrepresentationofanaïveBayesclassifierwiththatofagenericBayesiannetwork.
NotethatinanaïveBayesclassifier,everydirectededgepointsfromthetargetclasstotheobservedattributes,suggestingthattheclasslabelisafactorbehindthegenerationofattributes.Inferringtheclasslabelcanthusbeviewedasdiagnosingtherootcausebehindtheobservedattributes.Ontheotherhand,Bayesiannetworksprovideamoregenericstructureofprobabilisticrelationships,sincethetargetclassisnotrequiredtobeattherootofatreebutcanappearanywhereinthegraph,asshowninFigure
Xi
4.12(b) .Inthisdiagram,inferringYnotonlyhelpsindiagnosingthefactorsinfluencing and ,butalsohelpsinpredictingtheinfluenceof and .
JointProbabilityThelocalMarkovpropertycanbeusedtosuccinctlyexpressthejointprobabilityofthesetofrandomvariablesinvolvedinaBayesiannetwork.Torealizethis,letusfirstconsideraBayesiannetworkconsistingofdnodes,to ,wherethenodeshavebeennumberedinsuchawaythat isanancestorof onlyif .Thejointprobabilityof canbegenericallyfactorizedusingthechainruleofprobabilityas
Bythewaywehaveconstructedthegraph,notethatthesetcontainsonlynon-descendantsof .Hence,byusingthelocalMarkovproperty,wecanwrite as ,where denotestheparentsof .Thejointprobabilitycanthenberepresentedas
Itisthussufficienttorepresenttheprobabilityofeverynode intermsofitsparentnodes, ,forcomputingP( ).Thisisachievedwiththehelpofprobabilitytablesthatassociateeverynodetoitsparentnodesasfollows:
1. Theprobabilitytablefornode containstheconditionalprobabilityvalues foreverycombinationofvaluesin and .
2. If hasnoparents ,thenthetablecontainsonlythepriorprobability .
X3 X4 X1 X2
X1Xd Xi
Xj i<j X={X1,…,Xd}
P(X)=P(X1)P(X2|X1)P(X3|X1,X2)…P(Xd|X1,…Xd−1)=∏i=1dP(Xi|X1,…Xi−1)
(4.22)
{X1,…Xi−1}Xi
P(Xi|X1,…Xi−1) P(Xi|pa(Xi)) pa(Xi)Xi
P(X)=∏i=1dP(Xi|pa(Xi)) (4.23)
Xipa(Xi)
XiP(Xi|pa(Xi)) Xi pa(Xi)
Xi (pa(Xi)=ϕ)P(Xi)
Example4.6.[ProbabilityTables]Figure4.13 showsanexampleofaBayesiannetworkformodelingtherelationshipsbetweenapatient'ssymptomsandriskfactors.Theprobabilitytablesareshownatthesideofeverynodeinthefigure.Theprobabilitytablesassociatedwiththeriskfactors(ExerciseandDiet)containonlythepriorprobabilities,whereasthetablesforheartdisease,heartburn,bloodpressure,andchestpain,containtheconditionalprobabilities.
Figure4.13.ABayesiannetworkfordetectingheartdiseaseandheartburninpatients.
UseofHiddenVariables
ABayesiannetworktypicallyinvolvestwotypesofvariables:observedvariablesthatareclampedtospecificobservedvalues,andunobservedvariables,whosevaluesarenotknownandneedtobeinferredfromthenetwork.Todistinguishbetweenthesetwotypesofvariables,observedvariablesaregenerallyrepresentedusingshadednodeswhileunobservedvariablesarerepresentedusingemptynodes.Figure4.14 showsanexampleofaBayesiannetworkwithobservedvariables(A,B,andE)andunobservedvariables(CandD).
Figure4.14.Observedandunobservedvariablesarerepresentedusingunshadedandshadedcircles,respectively.
Inthecontextofclassification,theobservedvariablescorrespondtothesetofattributesX,whilethetargetclassisrepresentedusinganunobservedvariableYthatneedstobeinferredduringtesting.However,notethatagenericBayesiannetworkmaycontainmanyotherunobservedvariablesapartfromthetargetclass,asrepresentedinFigure4.15 asthesetofvariablesH.Theseunobservedvariablesrepresenthiddenorconfoundingfactorsthataffecttheprobabilitiesofattributesandclasslabels,althoughtheyareneverdirectlyobserved.TheuseofhiddenvariablesenhancestheexpressivepowerofBayesiannetworksinrepresentingcomplexprobabilistic
relationshipsbetweenattributesandclasslabels.ThisisoneofthekeydistinguishingpropertiesofBayesiannetworksascomparedtonaïveBayesclassifiers.
4.5.2InferenceandLearning
GiventheprobabilitytablescorrespondingtoeverynodeinaBayesiannetwork,theproblemofinferencecorrespondstocomputingtheprobabilitiesofdifferentsetsofrandomvariables.Inthecontextofclassification,oneofthekeyinferenceproblemsistocomputetheprobabilityofatargetclassYtakingonaspecificvaluey,giventhesetofobservedattributesatadatainstance,.Thiscanberepresentedusingthefollowingconditionalprobability:
ThepreviousequationinvolvesmarginalprobabilitiesoftheformP(y, ).TheycanbecomputedbymarginalizingoutthehiddenvariablesHfromthejointprobabilityasfollows:
wherethejointprobabilityP(y, ,H)canbeobtainedbyusingthefactorizationdescribedinEquation4.23 .TounderstandthenatureofcomputationsinvolvedinestimatingP(y, ),considertheexampleBayesiannetworkshowninFigure4.15 ,whichinvolvesatargetclass,Y,threeobservedattributes, to ,andfourhiddenvariables, to .Forthisnetwork,wecanexpressP(y, )as
P(Y=y|x)=(y,x)P(x)=(y,x)∑y′P(y′,x) (4.24)
P(y,x)=∑HP(y,x,H), (4.25)
X1 X3 H1 H4
Figure4.15.AnexampleofaBayesiannetworkwithfourhiddenvariables, to ,threeobservedattributes, to ,andonetargetclassY.
wherefisafactorthatdependsonthevaluesof to .IntheprevioussimplisticexpressionofP(y, ),adifferentsummandisconsideredforeverycombinationofvalues, to ,inthehiddenvariables, to .Ifweassumethateveryvariableinthenetworkcantakekdiscretevalues,thenthesummationhastobecarriedoutforatotalnumberof times.Thecomputationalcomplexityofthisapproachisthus .Moreover,thenumberofcomputationsgrowsexponentiallywiththenumberofhiddenvariables,makingitdifficulttousethisapproachwithnetworksthathavealargenumberofhiddenvariables.Inthefollowing,wepresentdifferentcomputationaltechniquesforefficientlyperforminginferencesinBayesiannetworks.
H1 H4X1 X3
P(y,x)=∑h1∑h2∑h3∑h4P(y,x1,x2,h1,h2,h3,h4),=∑h1∑h2∑h3∑h4[P(h1)P(h2)P(x2)P(h4)P(x1|h1,h2)×P(h3|x2,h2)P(y|x1,h3)P(x3|h3,h4)],
(4.26)
=∑h1∑h2∑h3∑h4f(h1,h2,h3,h4), (4.27)
h1 h4
h1 h4 H1 H4
k4O(k4)
VariableEliminationToreducethenumberofcomputationsinvolvedinestimatingP(y, ),letuscloselyexaminetheexpressionsinEquations4.26 and4.27 .Noticethatalthough dependsonthevaluesofallfourhiddenvariables,itcanbedecomposedasaproductofseveralsmallerfactors,whereeveryfactorinvolvesonlyasmallnumberofhiddenvariables.Forexample,thefactor dependsonlyonthevalueof ,andthusactsasaconstantmultiplicativetermwhensummationsareperformedover ,or .Hence,ifweplace outsidethesummationsof to ,wecansavesomerepeatedmultiplicationsoccurringinsideeverysummand.
Ingeneral,wecanpusheverysummationasfarinsideaspossible,sothatthefactorsthatdonotdependonthesummingvariableareplacedoutsidethesummation.Thiswillhelpreducethenumberofwastefulcomputationsbyusingsmallerfactorsateverysummation.Toillustratethisprocess,considerthefollowingsequenceofstepsforcomputingP(y, ),byrearrangingtheorder
ofsummationsinEquation4.26 .
where representstheintermediatefactortermobtainedbysummingout .Tocheckifthepreviousrearrangementsprovideanyimprovementsin
f(h1,h2,h3,h4)
P(h4) h4h1,h2 h3
P(h4) h1 h3
P(y,x)=P(x2)∑h4P(h4)∑h3P(y|x1,h3)P(x3|h3,h4)×∑h2P(h2)P(h3|x2,h2)∑h1P(4.28)
=P(x2)∑h4P(h4)∑h3P(y|x1,h3)P(x3|h3,h4)×∑h2P(h2)P(h3|x2,h2)f1(h2)(4.29)
=P(x2)∑h4P(h4)∑h3P(y|x1,h3)P(x3|h3,h4)f2(h3) (4.30)
=P(x2)∑h4P(h4)f3(h4) (4.31)
fi hi
computationalefficiency,letuscountthenumberofcomputationsoccurringateverystepoftheprocess.Atthefirststep(Equation4.28 ),weperformasummationover usingfactorsthatdependon and .Thisrequiresconsideringeverypairofvaluesin and ,resultingin computations.Similarly,thesecondstep(Equation4.29 )involvessummingout usingfactorsof and ,leadingto computations.Thethirdstep(Equation4.30 )againrequires computationsasitinvolvessummingoutoverfactorsdependingon and .Finally,thefourthstep(Equation4.31 )involvessummingout usingfactorsdependingon ,resultinginO(k)computations.
Theoverallcomplexityofthepreviousapproachisthus ,whichisconsiderablysmallerthanthe complexityofthebasicapproach.Hence,bymerelyrearrangingsummationsandusingalgebraicmanipulations,weareabletoimprovethecomputationalefficiencyincomputingP(y, ).Thisprocedureisknownasvariableelimination.
Thebasicconceptthatvariableeliminationexploitstoreducethenumberofcomputationsisthedistributivenatureofmultiplicationoveradditionoperations.Forexample,considerthefollowingmultiplicationandadditionoperations:
Noticethattheright-handsideofthepreviousequationinvolvesthreemultiplicationsandthreeadditions,whiletheleft-handsideinvolvesonlyonemultiplicationandthreeadditions,thussavingontwoarithmeticoperations.Thispropertyisutilizedbyvariableeliminationinpushingoutconstanttermsoutsidethesummation,suchthattheyaremultipliedonlyonce.
h1 h1 h2h1 h2 O(k2)
h2h2 h3 O(k2)
O(k2) h3h3 h4
h4 h4
O(k2)O(k4)
a.(b+c+d)=a.b+a.c+a.d
Notethattheefficiencyofvariableeliminationdependsontheorderofhiddenvariablesusedforperformingsummations.Hence,wewouldideallyliketofindtheoptimalorderofvariablesthatresultinthesmallestnumberofcomputations.Unfortunately,findingtheoptimalorderofsummationsforagenericBayesiannetworkisanNP-Hardproblem,i.e.,theredoesnotexistanefficientalgorithmforfindingtheoptimalorderingthatcanruninpolynomialtime.However,thereexistsefficienttechniquesforhandlingspecialtypesofBayesiannetworks,e.g.,thoseinvolvingtree-likegraphs,asdescribedinthefollowing.
Sum-ProductAlgorithmforTreesNotethatinEquations4.28 and4.29 ,wheneveravariable iseliminatedduringmarginalization,itresultsinthecreationofafactor thatdependsontheneighboringnodesof . isthenabsorbedinthefactorsofneighboringvariablesandtheprocessisrepeateduntilallunobservedvariableshavemarginalized.Thisphenomenaofvariableeliminationcanbeviewedastransmittingalocalmessagefromthevariablebeingmarginalizedtoitsneighboringnodes.Thisideaofmessagepassingutilizesthestructureofthegraphforperformingcomputations,thusmakingitpossibletousegraph-theoreticapproachesformakingeffectiveinferences.Thesum-productalgorithmbuildsontheconceptofmessagepassingforcomputingmarginalandconditionalprobabilitiesontree-basedgraphs.
Figure4.16 showsanexampleofatreeinvolvingfivevariables, to .Akeycharacteristicofatreeisthateverynodeinthetreehasexactlyoneparent,andthereisonlyonedirectededgebetweenanytwonodesinthetree.Forthepurposeofillustration,letusconsidertheproblemofestimatingthemarginalprobabilityof .Thiscanbeobtainedbymarginalizingouteveryvariableinthegraphexcept andrearrangingthesummationstoobtainthefollowingexpression:
hifi
hi fi
X1 X5
X2,P(X2)X2
Figure4.16.AnexampleofaBayesiannetworkwithatreestructure.
where hasbeenconvenientlychosentorepresentthefactorof thatisobtainedbysummingout .Wecanview asalocalmessagepassedfromnode tonode ,asshownusingarrowsinFigure4.17(a) .Theselocalmessagescapturetheinfluenceofeliminatingnodesonthemarginalprobabilitiesofneighboringnodes.
Beforeweformallydescribetheformulaforcomputing and ,wefirstdefineapotentialfunction thatisassociatedeverynodeandedgeofthegraph.Wecandefinethepotentialofanode as
P(x2)=∑x1∑x3∑x4∑x5P(x1)P(x2|x1)P(x3|x2)P(x4|x3)P(x5|x3),=(∑x1P(x1)P(x2|x1))︸m12(x2)(∑x3P(x3|x2)(∑x4P(x4|x3))︸m43(x3)(∑x5P(x5|x3))︸m53(x3)),︸m32(x2)
mij(xj) xjxi mij(xj)
xi xj
mij(xj) P(xj)ψ(⋅)
Xi
ψ(Xi)={P(Xi),ifXiistherootnode.1,otherwise. (4.32)
Figure4.17.Illustrationofmessagepassinginthesum-productalgorithm.
Similarly,wecandefinethepotentialofanedgebetweennodes and(where istheparentof )as
Using and ,wecanrepresent usingthefollowingequation:
whereN(i)representsthesetofneighborsofnode .Themessage thatistransmittedfrom to canthusberecursivelycomputedusingthe
Xi XjXi Xj
ψ(Xi,Xj)=P(Xj|Xi).
ψ(Xi) ψ(Xi,Xj) mij(xj)
mij(xj)=∑xi(ψ(xi)ψ(xi,xj)∏k∈N(i)imki(xi)), (4.33)
Xi mijXi Xj
messagesincidenton fromitsneighboringnodesexcluding .Notethattheformulafor involvestakingasumoverallpossiblevaluesof ,aftermultiplyingthefactorsobtainedfromtheneighborsof .Thisapproachofmessagepassingisthuscalledthe“sum-product”algorithm.Further,sincerepresentsanotionof“belief”propagatedfrom to ,thisalgorithmisalsoknownasbeliefpropagation.Themarginalprobabilityofanode
isthengivenas
Ausefulpropertyofthesum-productalgorithmisthatitallowsthemessagestobereusedforcomputingadifferentmarginalprobabilityinthefuture.Forexample,ifwehadtocomputethemarginalprobabilityfornode ,wewouldrequirethefollowingmessagesfromitsneighboringnodes: ,and .However,notethat ,and havealreadybeencomputedintheprocessofcomputingthemarginalprobabilityof andthuscanbereused.
Noticethatthebasicoperationsofthesum-productalgorithmresembleamessagepassingprotocolovertheedgesofthenetwork.Anodesendsoutamessagetoallitsneighboringnodesonlyafterithasreceivedincomingmessagesfromallitsneighbors.Hence,wecaninitializethemessagepassingprotocolfromtheleafnodes,andtransmitmessagestillwereachtherootnode.Wecanthenrunasecondpassofmessagesfromtherootnodebacktotheleafnodes.Inthisway,wecancomputethemessagesforeveryedgeinbothdirections,usingjust operations,where isthenumberofedges.OncewehavetransmittedallpossiblemessagesasshowninFigure4.17(b) ,wecaneasilycomputethemarginalprobabilityofeverynodeinthegraphusingEquation4.34 .
Xi Ximij Xj
Xjmij
Xi XjXi
P(xi)=ψ(xi)∏j∈N(i)mji(xi). (4.34)
X3m23(x3),m43(x3)
m53(x3) m43(x3) m53(x3)X2
O(2|E|) |E|
Inthecontextofclassification,thesum-productalgorithmcanbeeasilymodifiedforcomputingtheconditionalprobabilityoftheclasslabelygiventhesetofobservedattributes ,i.e., .Thisbasicallyamountstocomputing inEquation4.24 ,whereXisclampedtotheobservedvalues .Tohandlethescenariowheresomeoftherandomvariablesarefixedanddonotneedtobenormalized,weconsiderthefollowingmodification.
If isarandomvariablethatisfixedtoaspecificvalue ,thenwecansimplymodify and asfollows:
Wecanrunthesum-productalgorithmusingthesemodifiedvaluesforeveryobservedvariableandthuscompute .
x^ P(y|x^)P(y,X=x^)
x^
Xi x^iψ(Xi) ψ(Xi,Xj)
ψ(Xi)={1,ifXi=x^i.0,otherwise. (4.35)
ψ(Xi,Xj)={P(Xi|x^i),ifXi=x^i.0,otherwise. (4.36)
P(y,X=x^)
Figure4.18.Exampleofapoly-treeanditscorrespondingfactorgraph.
GeneralizationsforNon-TreeGraphsThesum-productalgorithmisguaranteedtooptimallyconvergeinthecaseoftreesusingasinglerunofmessagepassinginbothdirectionsofeveryedge.Thisisbecauseanytwonodesinatreehaveauniquepathforthetransmissionofmessages.Furthermore,sinceeverynodeinatreehasasingleparent,thejointprobabilityinvolvesonlyfactorsofatmosttwovariables.Hence,itissufficienttoconsiderpotentialsoveredgesandnotothergenericsubstructuresinthegraph.
Bothofthepreviouspropertiesareviolatedingraphsthatarenottrees,thusmakingitdifficulttodirectlyapplythesum-productalgorithmformakinginferences.However,anumberofvariantsofthesum-productalgorithmhavebeendevisedtoperforminferencesonabroaderfamilyofgraphsthantrees.Manyofthesevariantstransformtheoriginalgraphintoanalternativetree-basedrepresentation,andthenapplythesum-productalgorithmonthetransformedtree.Inthissection,webrieflydiscussonesuchtransformationsknownasfactorgraphs.
Factorgraphsareusefulformakinginferencesovergraphsthatviolatetheconditionthateverynodehasasingleparent.Nonetheless,theystillrequiretheabsenceofmultiplepathsbetweenanytwonodes,toguaranteeconvergence.Suchgraphsareknownaspoly-trees.Anexampleofapoly-treeisshowninFigure4.18(a) .
Apoly-treecanbetransformedintoatree-basedrepresentationwiththehelpoffactorgraphs.Thesegraphsconsistoftwotypesofnodes,variablesnodes(thatarerepresentedusingcircles)andfactornodes(thatarerepresented
usingsquares).Thefactornodesrepresentconditionalindependencerelationshipsamongthevariablesofthepoly-tree.Inparticular,everyprobabilitytablecanberepresentedasafactornode.Theedgesinafactorgraphareundirectedinnatureandrelateavariablenodetoafactornodeifthevariableisinvolvedintheprobabilitytablecorrespondingtothefactornode.Figure4.18(b) presentsthefactorgraphrepresentationofthepoly-treeshowninFigure4.18(a) .
Notethatthefactorgraphofapoly-treealwaysformsatree-likestructure,wherethereisauniquepathofinfluencebetweenanytwonodesinthefactorgraph.Hence,wecanapplyamodifiedformofsum-productalgorithmtotransmitmessagesbetweenvariablenodesandfactornodes,whichisguaranteedtoconvergetooptimalvalues.
LearningModelParametersInallourpreviousdiscussionsonBayesiannetworks,wehadassumedthatthetopologyoftheBayesiannetworkandthevaluesintheprobabilitytablesofeverynodewerealreadyknown.Inthissection,wediscussapproachesforlearningboththetopologyandtheprobabilitytablevaluesofaBayesiannetworkfromthetrainingdata.
Letusfirstconsiderthecasewherethetopologyofthenetworkisknownandweareonlyrequiredtocomputetheprobabilitytables.Iftherearenounobservedvariablesinthetrainingdata,thenwecaneasilycomputetheprobabilitytablefor ,bycountingthefractionoftraininginstancesforeveryvalueof andeverycombinationofvaluesin .However,ifthereareunobservedvariablesin or ,thencomputingthefractionoftraininginstancesforsuchvariablesisnon-trivialandrequirestheuseofadvancestechniquessuchastheExpectation-Maximizationalgorithm(describedlaterinChapter8 ).
P(Xi|pa(Xi))Xi pa(Xi)
Xi pa(Xi)
LearningthestructureoftheBayesiannetworkisamuchmorechallengingtaskthanlearningtheprobabilitytables.Althoughtherearesomescoringapproachesthatattempttofindagraphstructurethatmaximizesthetraininglikelihood,theyareoftencomputationallyinfeasiblewhenthegraphislarge.Hence,acommonapproachforconstructingBayesiannetworksistousethesubjectiveknowledgeofdomainexperts.
4.5.3CharacteristicsofBayesianNetworks
1. Bayesiannetworksprovideapowerfulapproachforrepresentingprobabilisticrelationshipsbetweenattributesandclasslabelswiththehelpofgraphicalmodels.Theyareabletocapturecomplexformsofdependenciesamongvariables.Apartfromencodingpriorbeliefs,theyarealsoabletomodelthepresenceoflatent(unobserved)factorsashiddenvariablesinthegraph.Bayesiannetworksarethusquiteexpressiveandprovidepredictiveaswellasdescriptiveinsightsaboutthebehaviorofattributesandclasslabels.
2. Bayesiannetworkscaneasilyhandlethepresenceofcorrelatedorredundantattributes,asopposedtothenaïveBayesclassifier.ThisisbecauseBayesiannetworksdonotusethenaïveBayesassumptionaboutconditionalindependence,butinsteadareabletoexpressricherformsofconditionalindependence.
3. SimilartothenaïveBayesclassifier,Bayesiannetworksarealsoquiterobusttothepresenceofnoiseinthetrainingdata.Further,theycanhandlemissingvaluesduringtrainingaswellastesting.Ifatestinstancecontainsanattribute withamissingvalue,thenaBayesiannetworkcanperforminferencebytreating asanunobservednode
XiXi
andmarginalizingoutitseffectonthetargetclass.Hence,Bayesiannetworksarewell-suitedforhandlingincompletenessinthedata,andcanworkwithpartialinformation.However,unlessthepatternwithwhichmissingvaluesoccursiscompletelyrandom,thentheirpresencewilllikelyintroducesomedegreeoferrorand/orbiasintotheanalysis.
4. Bayesiannetworksarerobusttoirrelevantattributesthatcontainnodiscriminatoryinformationabouttheclasslabels.Suchattributesshownoimpactontheconditionalprobabilityofthetargetclass,andarethusrightfullyignored.
5. LearningthestructureofaBayesiannetworkisacumbersometaskthatoftenrequiresassistancefromexpertknowledge.However,oncethestructurehasbeendecided,learningtheparametersofthenetworkcanbequitestraightforward,especiallyifallthevariablesinthenetworkareobserved.
6. Duetoitsadditionalabilityofrepresentingcomplexformsofrelationships,BayesiannetworksaremoresusceptibletooverfittingascomparedtothenaïveBayesclassifier.Furthermore,BayesiannetworkstypicallyrequiremoretraininginstancesforeffectivelylearningtheprobabilitytablesthanthenaïveBayesclassifier.
7. Althoughthesum-productalgorithmprovidescomputationallyefficienttechniquesforperforminginferenceovertree-likegraphs,thecomplexityoftheapproachincreasesignificantlywhendealingwithgenericgraphsoflargesizes.Insituationswhereexactinferenceiscomputationallyinfeasible,itisquitecommontouseapproximateinferencetechniques.
4.6LogisticRegressionThenaïveBayesandtheBayesiannetworkclassifiersdescribedintheprevioussectionsprovidedifferentwaysofestimatingtheconditionalprobabilityofaninstance givenclassy, .Suchmodelsareknownasprobabilisticgenerativemodels.Notethattheconditionalprobabilityessentiallydescribesthebehaviorofinstancesintheattributespacethataregeneratedfromclassy.However,forthepurposeofmakingpredictions,wearefinallyinterestedincomputingtheposteriorprobability .Forexample,computingthefollowingratioofposteriorprobabilitiesissufficientforinferringclasslabelsinabinaryclassificationproblem:
Thisratioisknownastheodds.Ifthisratioisgreaterthan1,then isclassifiedas .Otherwise,itisassignedtoclass .Hence,onemaysimplylearnamodeloftheoddsbasedontheattributevaluesoftraininginstances,withouthavingtocompute asanintermediatequantityintheBayestheorem.
Classificationmodelsthatdirectlyassignclasslabelswithoutcomputingclass-conditionalprobabilitiesarecalleddiscriminativemodels.Inthissection,wepresentaprobabilisticdiscriminativemodelknownaslogisticregression,whichdirectlyestimatestheoddsofadatainstance usingitsattributevalues.Thebasicideaoflogisticregressionistousealinearpredictor,
,forrepresentingtheoddsof asfollows:
P(x|y)P(x|y)
P(y|x)
P(y=1|x)P(y=0|x)
y=1 y=0
P(x|y)
z=wTx+b
P(y=1|x)P(y=0|x)=ez=ewTx+b, (4.37)
where andbaretheparametersofthemodeland denotesthetransposeofavector .Notethatif ,then belongstoclass1sinceitsoddsisgreaterthan1.Otherwise, belongstoclass0.
Figure4.19.Plotofsigmoid(logistic)function, .
Since ,wecanre-writeEquation4.37 as
Thiscanbefurthersimplifiedtoexpress asafunctionofz.
wherethefunction isknownasthelogisticorsigmoidfunction.Figure4.19 showsthebehaviorofthesigmoidfunctionaswevaryz.Wecanseethat onlywhen .Wecanalsoderive using asfollows:
aTwTx+b>0
σ(z)
P(y=0|x)+P(y=1|x)=1
P(y=1|x)1−P(y=1|x)=ez.
P(y=1|x)
P(y=1|x)=11+e−z=σ(z), (4.38)
σ(⋅)
σ(z)≥0.5 z≥0 P(y=0|x) σ(z)
Hence,ifwehavelearnedasuitablevalueofparameters andb,wecanuseEquations4.38 and4.39 toestimatetheposteriorprobabilitiesofanydatainstance anddetermineitsclasslabel.
4.6.1LogisticRegressionasaGeneralizedLinearModel
Sincetheposteriorprobabilitiesarereal-valued,theirestimationusingthepreviousequationscanbeviewedassolvingaregressionproblem.Infact,logisticregressionbelongstoabroaderfamilyofstatisticalregressionmodels,knownasgeneralizedlinearmodels(GLM).Inthesemodels,thetargetvariableyisconsideredtobegeneratedfromaprobabilitydistribution ,whosemean canbeestimatedusingalinkfunction asfollows:
Forbinaryclassificationusinglogisticregression,yfollowsaBernoullidistribution(ycaneitherbe0or1)and isequalto .Thelinkfunction oflogisticregression,calledthelogitfunction,canthusberepresentedas
Dependingonthechoiceoflinkfunction andtheformofprobabilitydistribution ,GLMsareabletorepresentabroadfamilyofregressionmodels,suchaslinearregressionandPoissonregression.Theyrequire
P(y=0|x)=1−σ(z)=11+e−z (4.39)
P(y|x)μ g(⋅)
g(μ)=z=wTx+b. (4.40)
μ P(y=1|x)g(⋅)
g(μ)=log(μ1−μ).
g(⋅)P(y|x)
differentapproachesforestimatingtheirmodelparameters,( , ).Inthischapter,wewillonlydiscussapproachesforestimatingthemodelparametersoflogisticregression,althoughmethodsforestimatingparametersofothertypesofGLMsareoftensimilar(andsometimesevensimpler).(SeeBibliographicNotesformoredetailsonGLMs.)
Notethateventhoughlogisticregressionhasrelationshipswithregressionmodels,itisaclassificationmodelsincethecomputedposteriorprobabilitiesareeventuallyusedtodeterminetheclasslabelofadatainstance.
4.6.2LearningModelParameters
Theparametersoflogisticregression,( , ),areestimatedduringtrainingusingastatisticalapproachknownasthemaximumlikelihoodestimation(MLE)method.Thismethodinvolvescomputingthelikelihoodofobservingthetrainingdatagiven( , ),andthendeterminingthemodelparameters
thatyieldmaximumlikelihood.
Let denoteasetofntraininginstances,where isabinaryvariable(0or1).Foragiventraininginstance,wecancomputeitsposteriorprobabilitiesusingEquations4.38 and
4.39 .Wecanthenexpressthelikelihoodofobserving given , ,andbas
where isthesigmoidfunctionasdescribedabove,Equation4.41basicallymeansthatthelikelihood isequalto when
(w*,b*)
D.train={(x1,y1),(x2,y2),…,(xn,yn)}yi
xiyi xi
P(yi|xi,w,b)=P(y=1|xi)yi×P(y=0|xi)1−yi,=(σ(zi))yi×(1−σ(zi))1−yi,=(σ(wTxi+b))yi×(1−σ(wTxi+b))1−yi,
(4.41)
σ(⋅)P(yi|xi,w,b) P(y=1|xi)
,andequalto when .Thelikelihoodofalltraininginstances,,canthenbecomputedbytakingtheproductofindividuallikelihoods
(assumingindependenceamongtraininginstances)asfollows:
Thepreviousequationinvolvesmultiplyingalargenumberofprobabilityvalues,eachofwhicharesmallerthanorequalto1.Sincethisnaïvecomputationcaneasilybecomenumericallyunstablewhennislarge,amorepracticalapproachistoconsiderthenegativelogarithm(tobasee)ofthelikelihoodfunction,alsoknownasthecrossentropyfunction:
Thecrossentropyisalossfunctionthatmeasureshowunlikelyitisforthetrainingdatatobegeneratedfromthelogisticregressionmodelwithparameters( , ).Intuitively,wewouldliketofindmodelparametersthatresultinthelowestcrossentropy, .
where isthelossfunction.ItisworthemphasizingthatE( , )isaconvexfunction,i.e.,anyminimaofE( , )willbeaglobalminima.Hence,wecanuseanyofthestandardconvexoptimizationtechniquestosolveEquation4.43 ,whicharementionedinAppendixE.Here,webrieflydescribetheNewton-Raphsonmethodthatiscommonlyusedforestimatingtheparametersoflogisticregression.Foreaseofrepresentation,wewilluseasinglevectortodescribe ,whichisofsizeonegreaterthan .Similarly,wewillconsidertheconcatenatedfeaturevector ,suchthatthelinearpredictor canbesuccinctly
yi=1 P(y=0|xi) yi=0L(w,b)
L(w,b)=∏i=1nP(yi|xi,w,b)=∏i=1nP(y=1|xi)yi×P(y=0|xi)1−yi. (4.42)
−logL(w,b)=−∑i=1nyilog(P(y=1|xi))+(1−yi)log(P(y=0|xi)).=−∑i=1nyilog(σ(wTxi+b))+(1−yi)log(1−σ(wTxi+b)).
(w*,b*)−logL(w*,b*)
(w*,b*)=argmin(w,b)E(w,b)=argmin(w,b)−logL(w,b) (4.43)
E(w,b)=−logL(w,b)
w˜=(wTb)T
x˜=(xT1)T z=wTx+b
writtenas .Also,theconcatenationofalltraininglabels, to ,willberepresentedasy,thesetconsistingof to willberepresentedas,andtheconcatenationof to willberepresentedas .
TheNewton-Raphsonisaniterativemethodforfinding thatusesthefollowingequationtoupdatethemodelparametersateveryiteration:
where andHarethefirst-andsecond-orderderivativesofthelossfunction withrespectto ,respectively.ThekeyintuitionbehindEquation4.44 istomovethemodelparametersinthedirectionofmaximumgradient,suchthat takeslargerstepswhen islarge.When arrivesataminimaaftersomenumberofiterations,thenwouldbecomeequalto0andthusresultinconvergence.Hence,westartwithsomeinitialvaluesof (eitherrandomlyassignedorsetto0)anduseEquation4.44 toiterativelyupdate tilltherearenosignificantchangesinitsvalue(beyondacertainthreshold).
Thefirst-orderderivativeof isgivenby
wherewehaveusedthefactthat .Using ,wecancomputethesecond-orderderivativeof as
whereRisadiagonalmatrixwhosei diagonalelement .Wecannowusethefirst-andsecond-orderderivativesof inEquation4.44 to
z=w˜Tx˜ y1 ynσ(z1) σ(zn)
σ x˜1 x˜n X˜
w˜*
w˜(new)=w˜(old)−H−1∇E(w˜), (4.44)
∇E(w˜)E(w˜) w˜
w˜ ∇E(w˜)w˜ ∇E(w˜)
w˜w˜
E(w˜)
∇E(w˜)=−∑i=1nyix˜i(1−σ(w˜Tx˜i))−(1−yi)x˜iσ(w˜Tx˜i),=−∑i=1n(σ(w˜Tx˜i)−yi)x˜i,=X˜(σ−y),
(4.45)
dσ(z)/dz=σ(z)(1−σ(z)) ∇E(w˜)E(w˜)
H=∇∇E(w˜)=∑i=1nσ(w˜Tx˜i)(1−σ(w˜Tx˜i)x˜ix˜iT)=X˜TRX˜, (4.46)
th Rii=σi(1−σi)E(w˜)
th
obtainthefollowingupdateequationatthek iteration:
wherethesubscriptkunder and referstousing tocomputebothterms.
4.6.3CharacteristicsofLogisticRegression
1. LogisticRegressionisadiscriminativemodelforclassificationthatdirectlycomputestheposterprobabilitieswithoutmakinganyassumptionabouttheclassconditionalprobabilities.Hence,itisquitegenericandcanbeappliedindiverseapplications.Itcanalsobeeasilyextendedtomulticlassclassification,whereitisknownasmultinomiallogisticregression.However,itsexpressivepowerislimitedtolearningonlylineardecisionboundaries.
2. Becausetherearedifferentweights(parameters)foreveryattribute,thelearnedparametersoflogisticregressioncanbeanalyzedtounderstandtherelationshipsbetweenattributesandclasslabels.
3. Becauselogisticregressiondoesnotinvolvecomputingdensitiesanddistancesintheattributespace,itcanworkmorerobustlyeveninhigh-dimensionalsettingsthandistance-basedmethodssuchasnearestneighborclassifiers.However,theobjectivefunctionoflogisticregressiondoesnotinvolveanytermrelatingtothecomplexityofthemodel.Hence,logisticregressiondoesnotprovideawaytomakeatrade-offbetweenmodelcomplexityandtrainingperformance,ascomparedtootherclassificationmodelssuchassupportvector
th
w˜(k+1)=w˜(k)−(X˜TRkX˜)−1X˜T(σk−y) (4.47)
Rk σk w˜(k)
machines.Nevertheless,variantsoflogisticregressioncaneasilybedevelopedtoaccountformodelcomplexity,byincludingappropriatetermsintheobjectivefunctionalongwiththecrossentropyfunction.
4. Logisticregressioncanhandleirrelevantattributesbylearningweightparameterscloseto0forattributesthatdonotprovideanygaininperformanceduringtraining.Itcanalsohandleinteractingattributessincethelearningofmodelparametersisachievedinajointfashionbyconsideringtheeffectsofallattributestogether.Furthermore,ifthereareredundantattributesthatareduplicatesofeachother,thenlogisticregressioncanlearnequalweightsforeveryredundantattribute,withoutdegradingclassificationperformance.However,thepresenceofalargenumberofirrelevantorredundantattributesinhigh-dimensionalsettingscanmakelogisticregressionsusceptibletomodeloverfitting.
5. Logisticregressioncannothandledatainstanceswithmissingvalues,sincetheposteriorprobabilitiesareonlycomputedbytakingaweightedsumofalltheattributes.Iftherearemissingvaluesinatraininginstance,itcanbediscardedfromthetrainingset.However,iftherearemissingvaluesinatestinstance,thenlogisticregressionwouldfailtopredictitsclasslabel.
4.7ArtificialNeuralNetwork(ANN)Artificialneuralnetworks(ANN)arepowerfulclassificationmodelsthatareabletolearnhighlycomplexandnonlineardecisionboundariespurelyfromthedata.Theyhavegainedwidespreadacceptanceinseveralapplicationssuchasvision,speech,andlanguageprocessing,wheretheyhavebeenrepeatedlyshowntooutperformotherclassificationmodels(andinsomecasesevenhumanperformance).Historically,thestudyofartificialneuralnetworkswasinspiredbyattemptstoemulatebiologicalneuralsystems.Thehumanbrainconsistsprimarilyofnervecellscalledneurons,linkedtogetherwithotherneuronsviastrandsoffibercalledaxons.Wheneveraneuronisstimulated(e.g.,inresponsetoastimuli),ittransmitsnerveactivationsviaaxonstootherneurons.Thereceptorneuronscollectthesenerveactivationsusingstructurescalleddendrites,whichareextensionsfromthecellbodyoftheneuron.Thestrengthofthecontactpointbetweenadendriteandanaxon,knownasasynapse,determinestheconnectivitybetweenneurons.Neuroscientistshavediscoveredthatthehumanbrainlearnsbychangingthestrengthofthesynapticconnectionbetweenneuronsuponrepeatedstimulationbythesameimpulse.
Thehumanbrainconsistsofapproximately100billionneuronsthatareinter-connectedincomplexways,makingitpossibleforustolearnnewtasksandperformregularactivities.Notethatasingleneurononlyperformsasimplemodularfunction,whichistorespondtothenerveactivationscomingfromsenderneuronsconnectedatitsdendrite,andtransmititsactivationtoreceptorneuronsviaaxons.However,itisthecompositionofthesesimplefunctionsthattogetherisabletoexpresscomplexfunctions.Thisideaisatthebasisofconstructingartificialneuralnetworks.
Analogoustothestructureofahumanbrain,anartificialneuralnetworkiscomposedofanumberofprocessingunits,callednodes,thatareconnectedwitheachotherviadirectedlinks.Thenodescorrespondtoneuronsthatperformthebasicunitsofcomputation,whilethedirectedlinkscorrespondtoconnectionsbetweenneurons,consistingofaxonsanddendrites.Further,theweightofadirectedlinkbetweentwoneuronsrepresentsthestrengthofthesynapticconnectionbetweenneurons.Asinbiologicalneuralsystems,theprimaryobjectiveofANNistoadapttheweightsofthelinksuntiltheyfittheinput-outputrelationshipsoftheunderlyingdata.
ThebasicmotivationbehindusinganANNmodelistoextractusefulfeaturesfromtheoriginalattributesthataremostrelevantforclassification.Traditionally,featureextractionhasbeenachievedbyusingdimensionalityreductiontechniquessuchasPCA(introducedinChapter2),whichshowlimitedsuccessinextractingnonlinearfeatures,orbyusinghand-craftedfeaturesprovidedbydomainexperts.Byusingacomplexcombinationofinter-connectednodes,ANNmodelsareabletoextractmuchrichersetsoffeatures,resultingingoodclassificationperformance.Moreover,ANNmodelsprovideanaturalwayofrepresentingfeaturesatmultiplelevelsofabstraction,wherecomplexfeaturesareseenascompositionsofsimplerfeatures.Inmanyclassificationproblems,modelingsuchahierarchyoffeaturesturnsouttobeveryuseful.Forexample,inordertodetectahumanfaceinanimage,wecanfirstidentifylow-levelfeaturessuchassharpedgeswithdifferentgradientsandorientations.Thesefeaturescanthenbecombinedtoidentifyfacialpartssuchaseyes,nose,ears,andlips.Finally,anappropriatearrangementoffacialpartscanbeusedtocorrectlyidentifyahumanface.ANNmodelsprovideapowerfularchitecturetorepresentahierarchicalabstractionoffeatures,fromlowerlevelsofabstraction(e.g.,edges)tohigherlevels(e.g.,facialparts).
Artificialneuralnetworkshavehadalonghistoryofdevelopmentsspanningoverfivedecadesofresearch.AlthoughclassicalmodelsofANNsufferedfromseveralchallengesthathinderedprogressforalongtime,theyhavere-emergedwithwidespreadpopularitybecauseofanumberofrecentdevelopmentsinthelastdecade,collectivelyknownasdeeplearning.Inthissection,weexamineclassicalapproachesforlearningANNmodels,startingfromthesimplestmodelcalledperceptronstomorecomplexarchitecturescalledmulti-layerneuralnetworks.Inthenextsection,wediscusssomeoftherecentadvancementsintheareaofANNthathavemadeitpossibletoeffectivelylearnmodernANNmodelswithdeeparchitectures.
4.7.1Perceptron
AperceptronisabasictypeofANNmodelthatinvolvestwotypesofnodes:inputnodes,whichareusedtorepresenttheinputattributes,andanoutputnode,whichisusedtorepresentthemodeloutput.Figure4.20 illustratesthebasicarchitectureofaperceptronthattakesthreeinputattributes, ,and ,andproducesabinaryoutputy.Theinputnodecorrespondingtoanattribute isconnectedviaaweightedlink totheoutputnode.Theweightedlinkisusedtoemulatethestrengthofasynapticconnectionbetweenneurons.
x1,x2x3
xi wi
Figure4.20.Basicarchitectureofaperceptron.
Theoutputnodeisamathematicaldevicethatcomputesaweightedsumofitsinputs,addsabiasfactorbtothesum,andthenexaminesthesignoftheresulttoproducetheoutput asfollows:
Tosimplifynotations, andbcanbeconcatenatedtoform ,whilecanbeappendedwith1attheendtoform .Theoutputofthe
perceptron canthenbewritten:
wherethesignfunctionactsasanactivationfunctionbyprovidinganoutputvalueof iftheargumentispositiveand ifitsargumentisnegative.
LearningthePerceptronGivenatrainingset,weareinterestedinlearningparameters suchthatcloselyresemblesthetrueyoftraininginstances.ThisisachievedbyusingtheperceptronlearningalgorithmgiveninAlgorithm4.3 .ThekeycomputationforthisalgorithmistheiterativeweightupdateformulagiveninStep8ofthealgorithm:
where istheweightparameterassociatedwiththei inputlinkafterthek iteration, isaparameterknownasthelearningrate,and isthevalue
y^
3^y={1,ifwTx+b>0.−1,otherwise. (4.48)
w˜=(wTb)Tx˜=(xT1)T
y^
y^=sign(w˜Tx˜),
+1 −1
w˜ y^
wj(k+1)=wj(k)+λ(yi−yi^(k))xij, (4.49)
w(k) th
th λ xij
th
ofthej attributeofthetrainingexample .ThejustificationforEquation4.49 isratherintuitive.Notethat capturesthediscrepancybetweenand ,suchthatitsvalueis0onlywhenthetruelabelandthepredicted
outputmatch.Assume ispositive.If and ,then isincreasedatthenextiterationsothat canbecomepositive.Ontheotherhand,if
and ,then isdecreasedsothat canbecomenegative.Hence,theweightsaremodifiedateveryiterationtoreducethediscrepanciesbetween andyacrossalltraininginstances.Thelearningrate ,aparameterwhosevalueisbetween0and1,canbeusedtocontroltheamountofadjustmentsmadeineachiteration.Thealgorithmhaltswhentheaveragenumberofdiscrepanciesaresmallerthanathreshold .
Algorithm4.3Perceptronlearningalgorithm.
∈
λ
∑ γ
Theperceptronisasimpleclassificationmodelthatisdesignedtolearnlineardecisionboundariesintheattributespace.Figure4.21 showsthedecision
th xi(yi−y^i)
yi y^ixij y^=0 y=1 wjw˜Txi
y^=1 y=0 wj w˜Txi
y^ λ
γ
boundaryobtainedbyapplyingtheperceptronlearningalgorithmtothedatasetprovidedontheleftofthefigure.However,notethattherecanbemultipledecisionboundariesthatcanseparatethetwoclasses,andtheperceptronarbitrarilylearnsoneoftheseboundariesdependingontherandominitialvaluesofparameters.(TheselectionoftheoptimaldecisionboundaryisaproblemthatwillberevisitedinthecontextofsupportvectormachinesinSection4.9 .)Further,theperceptronlearningalgorithmisonlyguaranteedtoconvergewhentheclassesarelinearlyseparable.However,iftheclassesarenotlinearlyseparable,thealgorithmfailstoconverge.Figure4.22showsanexampleofanonlinearlyseparabledatagivenbytheXORfunction.Theperceptroncannotfindtherightsolutionforthisdatabecausethereisnolineardecisionboundarythatcanperfectlyseparatethetraininginstances.Thus,thestoppingconditionatline12ofAlgorithm4.3 wouldneverbemetandhence,theperceptronlearningalgorithmwouldfailtoconverge.Thisisamajorlimitationofperceptronssincereal-worldclassificationproblemsofteninvolvenonlinearlyseparableclasses.
Figure4.21.Perceptrondecisionboundaryforthedatagivenontheleft( representsapositivelylabeledinstancewhileorepresentsanegativelylabeledinstance.
+
Figure4.22.XORclassificationproblem.Nolinearhyperplanecanseparatethetwoclasses.
4.7.2Multi-layerNeuralNetwork
Amulti-layerneuralnetworkgeneralizesthebasicconceptofaperceptrontomorecomplexarchitecturesofnodesthatarecapableoflearningnonlineardecisionboundaries.Agenericarchitectureofamulti-layerneuralnetworkisshowninFigure4.23 wherethenodesarearrangedingroupscalledlayers.Theselayersarecommonlyorganizedintheformofachainsuchthateverylayeroperatesontheoutputsofitsprecedinglayer.Inthisway,thelayersrepresentdifferentlevelsofabstractionthatareappliedontheinputfeaturesinasequentialmanner.Thecompositionoftheseabstractionsgeneratesthefinaloutputatthelastlayer,whichisusedformakingpredictions.Inthefollowing,webrieflydescribethethreetypesoflayersusedinmulti-layerneuralnetworks.
Figure4.23.Exampleofamulti-layerartificialneuralnetwork(ANN).
Thefirstlayerofthenetwork,calledtheinputlayer,isusedforrepresentinginputsfromattributes.Everynumericalorbinaryattributeistypicallyrepresentedusingasinglenodeonthislayer,whileacategoricalattributeiseitherrepresentedusingadifferentnodeforeachcategoricalvalue,orbyencodingthek-aryattributeusing inputnodes.Theseinputsarefedintointermediarylayersknownashiddenlayers,whicharemadeupofprocessingunitsknownashiddennodes.Everyhiddennodeoperatesonsignalsreceivedfromtheinputnodesorhiddennodesattheprecedinglayer,andproducesanactivationvaluethatistransmittedtothenextlayer.Thefinallayeriscalledtheoutputlayerandprocessestheactivationvaluesfromitsprecedinglayertoproducepredictionsofoutputvariables.Forbinaryclassification,theoutputlayercontainsasinglenoderepresentingthebinaryclasslabel.Inthisarchitecture,sincethesignalsarepropagatedonlyintheforwarddirectionfromtheinputlayertotheoutputlayer,theyarealsocalledfeedforwardneuralnetworks.
⌈log2k⌉
Amajordifferencebetweenmulti-layerneuralnetworksandperceptronsistheinclusionofhiddenlayers,whichdramaticallyimprovestheirabilitytorepresentarbitrarilycomplexdecisionboundaries.Forexample,considertheXORproblemdescribedintheprevioussection.Theinstancescanbeclassifiedusingtwohyperplanesthatpartitiontheinputspaceintotheirrespectiveclasses,asshowninFigure4.24(a) .Becauseaperceptroncancreateonlyonehyperplane,itcannotfindtheoptimalsolution.However,thisproblemcanbeaddressedbyusingahiddenlayerconsistingoftwonodes,asshowninFigure4.24(b) .Intuitively,wecanthinkofeachhiddennodeasaperceptronthattriestoconstructoneofthetwohyperplanes,whiletheoutputnodesimplycombinestheresultsoftheperceptronstoyieldthedecisionboundaryshowninFigure4.24(a) .
Figure4.24.Atwo-layerneuralnetworkfortheXORproblem.
Thehiddennodescanbeviewedaslearninglatentrepresentationsorfeaturesthatareusefulfordistinguishingbetweentheclasses.Whilethefirsthiddenlayerdirectlyoperatesontheinputattributesandthuscapturessimplerfeatures,thesubsequenthiddenlayersareabletocombinethemand
constructmorecomplexfeatures.Fromthisperspective,multi-layerneuralnetworkslearnahierarchyoffeaturesatdifferentlevelsofabstractionthatarefinallycombinedattheoutputnodestomakepredictions.Further,therearecombinatoriallymanywayswecancombinethefeatureslearnedatthehiddenlayersofANN,makingthemhighlyexpressive.ThispropertychieflydistinguishesANNfromotherclassificationmodelssuchasdecisiontrees,whichcanlearnpartitionsintheattributespacebutareunabletocombinetheminexponentialways.
Figure4.25.SchematicillustrationoftheparametersofanANNmodelwith hiddenlayers.
TounderstandthenatureofcomputationshappeningatthehiddenandoutputnodesofANN,considerthei nodeatthel layerofthenetwork ,wherethelayersarenumberedfrom0(inputlayer)toL(outputlayer),asshowninFigure4.25 .Theactivationvaluegeneratedatthisnode, ,canberepresentedasafunctionoftheinputsreceivedfromnodesattheprecedinglayer.Let representtheweightoftheconnectionfromthej nodeatlayer
(L−1)
th th (l>0)
ail
wijl th
th
tothei nodeatlayerl.Similarly,letusdenotethebiastermatthisnodeas .Theactivationvalue canthenbeexpressedas
whereziscalledthelinearpredictorand istheactivationfunctionthatconvertsztoa.Further,notethat,bydefinition, attheinputlayerand
attheoutputnode.
Thereareanumberofalternateactivationfunctionsapartfromthesignfunctionthatcanbeusedinmulti-layerneuralnetworks.Someexamplesincludelinear,sigmoid(logistic),andhyperbolictangentfunctions,asshowninFigure4.26 .Thesefunctionsareabletoproducereal-valuedandnonlinearactivationvalues.Amongtheseactivationfunctions,thesigmoid hasbeenwidelyusedinmanyANNmodels,althoughtheuseofothertypesofactivationfunctionsinthecontextofdeeplearningwillbediscussedinSection4.8 .Wecanthusrepresent as
(l−1) th
bjl ail
ail=f(zil)=f(∑jwijlajl−1+bil),
f(⋅)aj0=xj
aL=y^
σ(⋅)
ail
Figure4.26.Typesofactivationfunctionsusedinmulti-layerneuralnetworks.
LearningModelParametersTheweightsandbiasterms( ,b)oftheANNmodelarelearnedduringtrainingsothatthepredictionsontraininginstancesmatchthetruelabels.Thisisachievedbyusingalossfunction
ail=σ(zil)=11+e−zil. (4.50)
E(w,b)=∑k=1nLoss(yk,y^k) (4.51)
where isthetruelabelofthekthtraininginstanceand isequalto ,producedbyusing .Atypicalchoiceofthelossfunctionisthesquaredlossfunction:.
NotethatE( ,b)isafunctionofthemodelparameters( ,b)becausetheoutputactivationvalue dependsontheweightsandbiasterms.Weareinterestedinchoosing( ,b)thatminimizesthetraininglossE( ,b).Unfortunately,becauseoftheuseofhiddennodeswithnonlinearactivationfunctions,E( ,b)isnotaconvexfunctionof andb,whichmeansthatE( ,b)canhavelocalminimathatarenotgloballyoptimal.However,wecanstillapplystandardoptimizationtechniquessuchasthegradientdescentmethodtoarriveatalocallyoptimalsolution.Inparticular,theweightparameter andthebiasterm canbeiterativelyupdatedusingthefollowingequations:
where isahyper-parameterknownasthelearningrate.Theintuitionbehindthisequationistomovetheweightsinadirectionthatreducesthetrainingloss.Ifwearriveataminimausingthisprocedure,thegradientofthetraininglosswillbecloseto0,eliminatingthesecondtermandresultingintheconvergenceofweights.TheweightsarecommonlyinitializedwithvaluesdrawnrandomlyfromaGaussianorauniformdistribution.
AnecessarytoolforupdatingweightsinEquation4.53 istocomputethepartialderivativeofEwithrespectto .Thiscomputationisnontrivialespeciallyathiddenlayers ,since doesnotdirectlyaffect (and
yk y^k aLxk
Loss(yk,y^k)=(yk,y^k)2. (4.52)
aL
wijl bil
wijl←wijl−λ∂E∂wijl, (4.53)
bil←bil−λ∂E∂bil, (4.54)
λ
wijl(l<L) wijl y^=aL
hencethetrainingloss),buthascomplexchainsofinfluencesviaactivationvaluesatsubsequentlayers.Toaddressthisproblem,atechniqueknownasbackpropagationwasdeveloped,whichpropagatesthederivativesbackwardfromtheoutputlayertothehiddenlayers.Thistechniquecanbedescribedasfollows.
RecallthatthetraininglossEissimplythesumofindividuallossesattraininginstances.HencethepartialderivativeofEcanbedecomposedasasumofpartialderivativesofindividuallosses.
Tosimplifydiscussions,wewillconsideronlythederivativesofthelossatthek traininginstance,whichwillbegenericallyrepresentedas .Byusingthechainruleofdifferentiation,wecanrepresentthepartialderivativesofthelosswithrespectto as
Thelasttermofthepreviousequationcanbewrittenas
Also,ifweusethesigmoidactivationfunction,then
Equation4.55 canthusbesimplifiedas
∂E∂wjl=∑k=1n∂Loss(yk,y^k)∂wjl.
th Loss(y,aL)
wijl
∂Loss∂wijl=∂Loss∂ail×∂ail∂zil×∂zil∂wijl. (4.55)
∂zil∂wijl=∂(∑jwijlajl−1+bil)∂wijl=ajl−1.
∂ail∂zil=∂σ(zil)∂zil=ail(1−ai1).
∂Loss∂wijl=δil×ail(1−ai1)×ajl−1,whereδil=∂Loss∂ail. (4.56)
Asimilarformulaforthepartialderivativeswithrespecttothebiasterms isgivenby
Hence,tocomputethepartialderivatives,weonlyneedtodetermine .Usingasquaredlossfunction,wecaneasilywrite attheoutputnodeas
However,theapproachforcomputing athiddennodes ismoreinvolved.Noticethat affectstheactivationvalues ofallnodesatthenextlayer,whichinturninfluencestheloss.Hence,againusingthechainruleofdifferentiation, canberepresentedas
Thepreviousequationprovidesaconciserepresentationofthe valuesatlayerlintermsofthe valuescomputedatlayer .Hence,proceedingbackwardfromtheoutputlayerLtothehiddenlayers,wecanrecursivelyapplyEquation4.59 tocompute ateveryhiddennode. canthenbeusedinEquations4.56 and4.57 tocomputethepartialderivativesofthelosswithrespectto and ,respectively.Algorithm4.4 summarizesthecompleteapproachforlearningthemodelparametersofANNusingbackpropagationandgradientdescentmethod.
Algorithm4.4LearningANNusingbackpropagationandgradientdescent.
bli
∂Loss∂bil=δil×ail(1−ai1). (4.57)
δilδL
δL=∂Loss∂aL=∂(y−aL)2∂aL=2(aL−y). (4.58)
δjl (l<L)ajl ail+1
δjl
δjl=∂Loss∂ajl=∑i(∂Loss∂ail+1×∂ail+1∂ajl).=∑i(∂Loss∂ail+1×∂ail+1∂zil+1×∂zil+1(4.59)
δjlδjl+1 l+1
δil δil
wijl bil
∈
∂ ∂ ∂ ∂
∂ ∂ ∑ ∂ ∂
∂ ∂ ∑ ∂ ∂
4.7.3CharacteristicsofANN
1. Multi-layerneuralnetworkswithatleastonehiddenlayerareuniversalapproximators;i.e.,theycanbeusedtoapproximateanytargetfunction.Theyarethushighlyexpressiveandcanbeusedtolearncomplexdecisionboundariesindiverseapplications.ANNcanalsobeusedformulticlassclassificationandregressionproblems,by
appropriatelymodifyingtheoutputlayer.However,thehighmodelcomplexityofclassicalANNmodelsmakesitsusceptibletooverfitting,whichcanbeovercometosomeextentbyusingdeeplearningtechniquesdiscussedinSection4.8.3 .
2. ANNprovidesanaturalwaytorepresentahierarchyoffeaturesatmultiplelevelsofabstraction.TheoutputsatthefinalhiddenlayeroftheANNmodelthusrepresentfeaturesatthehighestlevelofabstractionthataremostusefulforclassification.Thesefeaturescanalsobeusedasinputsinothersupervisedclassificationmodels,e.g.,byreplacingtheoutputnodeoftheANNbyanygenericclassifier.
3. ANNrepresentscomplexhigh-levelfeaturesascompositionsofsimplerlower-levelfeaturesthatareeasiertolearn.ThisprovidesANNtheabilitytograduallyincreasethecomplexityofrepresentations,byaddingmorehiddenlayerstothearchitecture.Further,sincesimplerfeaturescanbecombinedincombinatorialways,thenumberofcomplexfeatureslearnedbyANNismuchlargerthantraditionalclassificationmodels.Thisisoneofthemainreasonsbehindthehighexpressivepowerofdeepneuralnetworks.
4. ANNcaneasilyhandleirrelevantattributes,byusingzeroweightsforattributesthatdonothelpinimprovingthetrainingloss.Also,redundantattributesreceivesimilarweightsanddonotdegradethequalityoftheclassifier.However,ifthenumberofirrelevantorredundantattributesislarge,thelearningoftheANNmodelmaysufferfromoverfitting,leadingtopoorgeneralizationperformance.
5. SincethelearningofANNmodelinvolvesminimizinganon-convexfunction,thesolutionsobtainedbygradientdescentarenotguaranteedtobegloballyoptimal.Forthisreason,ANNhasatendencytogetstuckinlocalminima,achallengethatcanbeaddressedbyusingdeeplearningtechniquesdiscussedinSection4.8.4 .
6. TraininganANNisatimeconsumingprocess,especiallywhenthenumberofhiddennodesislarge.Nevertheless,testexamplescanbe
classifiedrapidly.7. Justlikelogisticregression,ANNcanlearninthepresenceof
interactingvariables,sincethemodelparametersarejointlylearnedoverallvariablestogether.Inaddition,ANNcannothandleinstanceswithmissingvaluesinthetrainingortestingphase.
4.8DeepLearningAsdescribedabove,theuseofhiddenlayersinANNisbasedonthegeneralbeliefthatcomplexhigh-levelfeaturescanbeconstructedbycombiningsimplerlower-levelfeatures.Typically,thegreaterthenumberofhiddenlayers,thedeeperthehierarchyoffeatureslearnedbythenetwork.ThismotivatesthelearningofANNmodelswithlongchainsofhiddenlayers,knownasdeepneuralnetworks.Incontrastto“shallow”neuralnetworksthatinvolveonlyasmallnumberofhiddenlayers,deepneuralnetworksareabletorepresentfeaturesatmultiplelevelsofabstractionandoftenrequirefarfewernodesperlayertoachievegeneralizationperformancesimilartoshallownetworks.
Despitethehugepotentialinlearningdeepneuralnetworks,ithasremainedchallengingtolearnANNmodelswithalargenumberofhiddenlayersusingclassicalapproaches.Apartfromreasonsrelatedtolimitedcomputationalresourcesandhardwarearchitectures,therehavebeenanumberofalgorithmicchallengesinlearningdeepneuralnetworks.First,learningadeepneuralnetworkwithlowtrainingerrorhasbeenadauntingtaskbecauseofthesaturationofsigmoidactivationfunctions,resultinginslowconvergenceofgradientdescent.Thisproblembecomesevenmoreseriousaswemoveawayfromtheoutputnodetothehiddenlayers,becauseofthecompoundedeffectsofsaturationatmultiplelayers,knownasthevanishinggradientproblem.Becauseofthisreason,classicalANNmodelshavesufferedfromslowandineffectivelearning,leadingtopoortrainingandtestperformance.Second,thelearningofdeepneuralnetworksisquitesensitivetotheinitialvaluesofmodelparameters,chieflybecauseofthenon-convexnatureoftheoptimizationfunctionandtheslowconvergenceofgradientdescent.Third,deepneuralnetworkswithalargenumberofhiddenlayershavehighmodel
complexity,makingthemsusceptibletooverfitting.Hence,evenifadeepneuralnetworkhasbeentrainedtoshowlowtrainingerror,itcanstillsufferfrompoorgeneralizationperformance.
Thesechallengeshavedeterredprogressinbuildingdeepneuralnetworksforseveraldecadesanditisonlyrecentlythatwehavestartedtounlocktheirimmensepotentialwiththehelpofanumberofadvancesbeingmadeintheareaofdeeplearning.Althoughsomeoftheseadvanceshavebeenaroundforsometime,theyhaveonlygainedmainstreamattentioninthelastdecade,withdeepneuralnetworkscontinuallybeatingrecordsinvariouscompetitionsandsolvingproblemsthatweretoodifficultforotherclassificationapproaches.
Therearetwofactorsthathaveplayedamajorroleintheemergenceofdeeplearningtechniques.First,theavailabilityoflargerlabeleddatasets,e.g.,theImageNetdatasetcontainsmorethan10millionlabeledimages,hasmadeitpossibletolearnmorecomplexANNmodelsthaneverbefore,withoutfallingeasilyintothetrapsofmodeloverfitting.Second,advancesincomputationalabilitiesandhardwareinfrastructures,suchastheuseofgraphicalprocessingunits(GPU)fordistributedcomputing,havegreatlyhelpedinexperimentingwithdeepneuralnetworkswithlargerarchitecturesthatwouldnothavebeenfeasiblewithtraditionalresources.
Inadditiontotheprevioustwofactors,therehavebeenanumberofalgorithmicadvancementstoovercomethechallengesfacedbyclassicalmethodsinlearningdeepneuralnetworks.Someexamplesincludetheuseofmoreresponsivecombinationsoflossfunctionsandactivationfunctions,betterinitializationofmodelparameters,novelregularizationtechniques,moreagilearchitecturedesigns,andbettertechniquesformodellearningandhyper-parameterselection.Inthefollowing,wedescribesomeofthedeeplearningadvancesmadetoaddressthechallengesinlearningdeepneural
networks.FurtherdetailsonrecentdevelopmentsindeeplearningcanbeobtainedfromtheBibliographicNotes.
4.8.1UsingSynergisticLossFunctions
Oneofthemajorrealizationsleadingtodeeplearninghasbeentheimportanceofchoosingappropriatecombinationsofactivationandlossfunctions.ClassicalANNmodelscommonlymadeuseofthesigmoidactivationfunctionattheoutputlayer,becauseofitsabilitytoproducereal-valuedoutputsbetween0and1,whichwascombinedwithasquaredlossobjectivetoperformgradientdescent.Itwassoonnoticedthatthisparticularcombinationofactivationandlossfunctionresultedinthesaturationofoutputactivationvalues,whichcanbedescribedasfollows.
SaturationofOutputsAlthoughthesigmoidhasbeenwidely-usedasanactivationfunction,iteasilysaturatesathighandlowvaluesofinputsthatarefarawayfrom0.ObservefromFigure4.27(a) that showsvarianceinitsvaluesonlywhenziscloseto0.Forthisreason, isnon-zeroforonlyasmallrangeofzaround0,asshowninFigure4.27(b) .Since isoneofthecomponentsinthegradientofloss(seeEquation4.55 ),wegetadiminishinggradientvaluewhentheactivationvaluesarefarfrom0.
σ(z)∂σ(z)/∂z
∂σ(z)/∂z
Figure4.27.Plotsofsigmoidfunctionanditsderivative.
Toillustratetheeffectofsaturationonthelearningofmodelparametersattheoutputnode,considerthepartialderivativeoflosswithrespecttotheweight
attheoutputnode.Usingthesquaredlossfunction,wecanwritethisas
Inthepreviousequation,noticethatwhen ishighlynegative, (andhencethegradient)iscloseto0.Ontheotherhand,when ishighlypositive, becomescloseto0,nullifyingthevalueofthegradient.Hence,irrespectiveofwhethertheprediction matchesthetruelabelyornot,thegradientofthelosswithrespecttotheweightsiscloseto0wheneverishighlypositiveornegative.Thiscausesanunnecessarilyslow
convergenceofthemodelparametersoftheANNmodel,oftenresultinginpoorlearning.
Notethatitisthecombinationofthesquaredlossfunctionandthesigmoidactivationfunctionattheoutputnodethattogetherresultsindiminishing
wjL
∂Loss∂wjL=2(aL−y)×σ(zL)(1−σ(zL))×ajL−1. (4.60)
zL σ(zL)zL
(1−σ(zL))aL
zL
gradients(andthuspoorlearning)uponsaturationofoutputs.Itisthusimportanttochooseasynergisticcombinationoflossfunctionandactivationfunctionthatdoesnotsufferfromthesaturationofoutputs.
CrossentropylossfunctionThecrossentropylossfunction,whichwasdescribedinthecontextoflogisticregressioninSection4.6.2 ,cansignificantlyavoidtheproblemofsaturatingoutputswhenusedincombinationwiththesigmoidactivationfunction.Thecrossentropylossfunctionofareal-valuedpredictiononadatainstancewithbinarylabel canbedefinedas
wherelogrepresentsthenaturallogarithm(tobasee)and forconvenience.Thecrossentropyfunctionhasfoundationsininformationtheoryandmeasurestheamountofdisagreementbetweenyand .Thepartialderivativeofthislossfunctionwithrespectto canbegivenas
Usingthisvalueof inEquation4.56 ,wecanobtainthepartialderivativeofthelosswithrespecttotheweight attheoutputnodeas
Noticethesimplicityofthepreviousformulausingthecrossentropylossfunction.Thepartialderivativesofthelosswithrespecttotheweightsattheoutputnodedependonlyonthedifferencebetweentheprediction andthetruelabely.IncontrasttoEquation4.60 ,itdoesnotinvolvetermssuchas
thatcanbeimpactedbysaturationof .Hence,thegradients
y^∈(0,1)y∈{0,1}
Loss(y,y^)=−ylog(y^)−(1−y)log(1−y^), (4.61)
0log(0)=0
y^y^=aL
δL=∂Loss∂aL=−yaL+(1−y)(1−aL).=(aL−y)aL(1−aL). (4.62)
δLwjl
∂Loss∂wjL=(aL−y)aL(1−aL)×aL(1−aL)×ajL−1.=(aL−y)×ajL−1. (4.63)
aL
σ(zL)(1−σ(zL)) zL
arehighwhenever islarge,promotingeffectivelearningofthemodelparametersattheoutputnode.ThishasbeenamajorbreakthroughinthelearningofmodernANNmodelsanditisnowacommonpracticetousethecrossentropylossfunctionwithsigmoidactivationsattheoutputnode.
4.8.2UsingResponsiveActivationFunctions
Eventhoughthecrossentropylossfunctionhelpsinovercomingtheproblemofsaturatingoutputs,itstilldoesnotsolvetheproblemofsaturationathiddenlayers,arisingduetotheuseofsigmoidactivationfunctionsathiddennodes.Infact,theeffectofsaturationonthelearningofmodelparametersisevenmoreaggravatedathiddenlayers,aproblemknownasthevanishinggradientproblem.Inthefollowing,wedescribethevanishinggradientproblemandtheuseofamoreresponsiveactivationfunction,calledtherectifiedlinearoutputunit(ReLU),toovercomethisproblem.
VanishingGradientProblemTheimpactofsaturatingactivationvaluesonthelearningofmodelparametersincreasesatdeeperhiddenlayersthatarefartherawayfromtheoutputnode.Eveniftheactivationintheoutputlayerdoesnotsaturate,therepeatedmultiplicationsperformedaswebackpropagatethegradientsfromtheoutputlayertothehiddenlayersmayleadtodecreasinggradientsinthehiddenlayers.Thisiscalledthevanishinggradientproblem,whichhasbeenoneofthemajorhindrancesinlearningdeepneuralnetworks.
(aL−y)
Toillustratethevanishinggradientproblem,consideranANNmodelthatconsistsofasinglenodeateveryhiddenlayerofthenetwork,asshowninFigure4.28 .Thissimplifiedarchitectureinvolvesasinglechainofhiddennodeswhereasingleweightedlink connectsthenodeatlayer tothenodeatlayerl.UsingEquations4.56 and4.59 ,wecanrepresentthepartialderivativeofthelosswithrespectto as
Noticethatifanyofthelinearpredictors saturatesatsubsequentlayers,thentheterm becomescloseto0,thusdiminishingtheoverallgradient.Thesaturationofactivationsthusgetscompoundedandhasmultiplicativeeffectsonthegradientsathiddenlayers,makingthemhighlyunstableandthus,unsuitableforusewithgradientdescent.Eventhoughthepreviousdiscussiononlypertainstothesimplifiedarchitectureinvolvingasinglechainofhiddennodes,asimilarargumentcanbemadeforanygenericANNarchitectureinvolvingmultiplechainsofhiddennodes.Notethatthevanishinggradientproblemprimarilyarisesbecauseoftheuseofsigmoidactivationfunctionathiddennodes,whichisknowntoeasilysaturateespeciallyafterrepeatedmultiplications.
Figure4.28.AnexampleofanANNmodelwithonlyonenodeateveryhiddenlayer.
wl l−1
wl
∂Loss∂wl=δl×al(1−al)×al−1,whereδl=2(aL−y)×∏r=lL−1(ar+1(1−ar+1)×wr+1).(4.64)
zr+1ar+1(1−ar+1)
Figure4.29.Plotoftherectifiedlinearunit(ReLU)activationfunction.
RectifiedLinearUnits(ReLU)Toovercomethevanishinggradientproblem,itisimportanttouseanactivationfunctionf(z)atthehiddennodesthatprovidesastableandsignificantvalueofthegradientwheneverahiddennodeisactive,i.e., .Thisisachievedbyusingrectifiedlinearunits(ReLU)asactivationfunctionsathiddennodes,whichcanbedefinedas
TheideaofReLUhasbeeninspiredfrombiologicalneurons,whichareeitherinaninactivestate orshowanactivationvalueproportionaltotheinput.Figure4.29 showsaplotoftheReLUfunction.Wecanseethatitislinearwithrespecttozwhen .Hence,thegradientoftheactivationvaluewithrespecttozcanbewrittenas
z>0
a=f(z)={z,ifz>0.0,otherwise. (4.65)
(f(z)=0)
z>0
Althoughf(z)isnotdifferentiableat0,itiscommonpracticetousewhen .SincethegradientoftheReLUactivationfunctionisequalto1whenever ,itavoidstheproblemofsaturationathiddennodes,evenafterrepeatedmultiplications.UsingReLU,thepartialderivativesofthelosswithrespecttotheweightandbiasparameterscanbegivenby
NoticethatReLUshowsalinearbehaviorintheactivationvalueswheneveranodeisactive,ascomparedtothenonlinearpropertiesofthesigmoidfunction.Thislinearitypromotesbetterflowsofgradientsduringbackpropagation,andthussimplifiesthelearningofANNmodelparameters.TheReLUisalsohighlyresponsiveatlargevaluesofzawayfrom0,asopposedtothesigmoidactivationfunction,makingitmoresuitableforgradientdescent.ThesedifferencesgiveReLUamajoradvantageoverthesigmoidfunction.Indeed,ReLUisusedasthepreferredchoiceofactivationfunctionathiddenlayersinmostmodernANNmodels.
4.8.3Regularization
AmajorchallengeinlearningdeepneuralnetworksisthehighmodelcomplexityofANNmodels,whichgrowswiththeadditionofhiddenlayersinthenetwork.Thiscanbecomeaseriousconcern,especiallywhenthetrainingsetissmall,duetothephenomenaofmodeloverfitting.Toovercomethis
∂a∂z={1,ifz>0.0,ifz<0. (4.66)
∂a/∂z=0z=0
z>0
∂Loss∂wijl=δil×I(zil)×ajl−1, (4.67)
∂Loss∂bil=δil×I(zil),whereδil=∑i=1n(δil+1×I(zil+1)×wijl+1),andI(z)={1,ifz>0.0,otherwise.
(4.68)
challenge,itisimportanttousetechniquesthatcanhelpinreducingthecomplexityofthelearnedmodel,knownasregularizationtechniques.ClassicalapproachesforlearningANNmodelsdidnothaveaneffectivewaytopromoteregularizationofthelearnedmodelparameters.Hence,theyhadoftenbeensidelinedbyotherclassificationmethods,suchassupportvectormachines(SVM),whichhavein-builtregularizationmechanisms.(SVMswillbediscussedinmoredetailinSection4.9 ).
OneofthemajoradvancementsindeeplearninghasbeenthedevelopmentofnovelregularizationtechniquesforANNmodelsthatareabletooffersignificantimprovementsingeneralizationperformance.Inthefollowing,wediscussoneoftheregularizationtechniquesforANN,knownasthedropoutmethod,thathavegainedalotofattentioninseveralapplications.
DropoutThemainobjectiveofdropoutistoavoidthelearningofspuriousfeaturesathiddennodes,occurringduetomodeloverfitting.Itusesthebasicintuitionthatspuriousfeaturesoften“co-adapt”themselvessuchthattheyshowgoodtrainingperformanceonlywhenusedinhighlyselectivecombinations.Ontheotherhand,relevantfeaturescanbeusedinadiversityoffeaturecombinationsandhencearequiteresilienttotheremovalormodificationofotherfeatures.Thedropoutmethodusesthisintuitiontobreakcomplex“co-adaptations”inthelearnedfeaturesbyrandomlydroppinginputandhiddennodesinthenetworkduringtraining.
Dropoutbelongstoafamilyofregularizationtechniquesthatusesthecriteriaofresiliencetorandomperturbationsasameasureoftherobustness(andhence,simplicity)ofamodel.Forexample,oneapproachtoregularizationistoinjectnoiseintheinputattributesofthetrainingsetandlearnamodelwiththenoisytraininginstances.Ifafeaturelearnedfromthetrainingdatais
indeedgeneralizable,itshouldnotbeaffectedbytheadditionofnoise.Dropoutcanbeviewedasasimilarregularizationapproachthatperturbstheinformationcontentofthetrainingsetnotonlyatthelevelofattributesbutalsoatmultiplelevelsofabstractions,bydroppinginputandhiddennodes.
Thedropoutmethoddrawsinspirationfromthebiologicalprocessofgeneswappinginsexualreproduction,wherehalfofthegenesfrombothparentsarecombinedtogethertocreatethegenesoftheoffspring.Thisfavorstheselectionofparentgenesthatarenotonlyusefulbutcanalsointer-minglewithdiversecombinationsofgenescomingfromtheotherparent.Ontheotherhand,co-adaptedgenesthatfunctiononlyinhighlyselectivecombinationsaresooneliminatedintheprocessofevolution.Thisideaisusedinthedropoutmethodforeliminatingspuriousco-adaptedfeatures.Asimplifieddescriptionofthedropoutmethodisprovidedintherestofthissection.
Figure4.30.Examplesofsub-networksgeneratedinthedropoutmethodusing .
Let representthemodelparametersoftheANNmodelatthekiterationofthegradientdescentmethod.Ateveryiteration,werandomlyselectafraction ofinputandhiddennodestobedroppedfromthenetwork,where isahyper-parameterthatistypicallychosentobe0.5.Theweightedlinksandbiastermsinvolvingthedroppednodesaretheneliminated,resultingina“thinned”sub-networkofsmallersize.Themodelparametersofthesub-network arethenupdatedbycomputingactivationvaluesandperformingbackpropagationonthissmallersub-network.Theseupdatedvaluesarethenaddedbackintheoriginalnetworkto
γ=0.5
(wk,bk) th
γγ∈(0,1)
(wsk,bsk)
obtaintheupdatedmodelparameters, ,tobeusedinthenextiteration.
Figure4.30 showssomeexamplesofsub-networksthatcanbegeneratedatdifferentiterationsofthedropoutmethod,byrandomlydroppinginputandhiddennodes.Sinceeverysub-networkhasadifferentarchitecture,itisdifficulttolearncomplexco-adaptationsinthefeaturesthatcanresultinoverfitting.Instead,thefeaturesatthehiddennodesarelearnedtobemoreagiletorandommodificationsinthenetworkstructure,thusimprovingtheirgeneralizationability.Themodelparametersareupdatedusingadifferentrandomsub-networkateveryiteration,tillthegradientdescentmethodconverges.
Let denotethemodelparametersatthelastiterationofthegradientdescentmethod.Theseparametersarefinallyscaleddownbyafactorof ,toproducetheweightsandbiastermsofthefinalANNmodel,asfollows:
Wecannowusethecompleteneuralnetworkwithmodelparametersfortesting.ThedropoutmethodhasbeenshowntoprovidesignificantimprovementsinthegeneralizationperformanceofANNmodelsinanumberofapplications.Itiscomputationallycheapandcanbeappliedincombinationwithanyoftheotherdeeplearningtechniques.Italsohasanumberofsimilaritieswithawidely-usedensemblelearningmethodknownasbagging,whichlearnsmultiplemodelsusingrandomsubsetsofthetrainingset,andthenusestheaverageoutputofallthemodelstomakepredictions.(BaggingwillbepresentedinmoredetaillaterinSection4.10.4 ).Inasimilarvein,itcanbeshownthatthepredictionsofthefinalnetworklearnedusingdropoutapproximatestheaverageoutputofallpossible sub-networksthatcanbe
(wk+1,bk+1)
(wkmax,bkmax) kmax
(1−γ)
(w*,b*)=((1−γ)×wkmax,(1−γ)×bkmax)
(w*,b*)
2n
formedusingnnodes.Thisisoneofthereasonsbehindthesuperiorregularizationabilitiesofdropout.
4.8.4InitializationofModelParameters
Becauseofthenon-convexnatureofthelossfunctionusedbyANNmodels,itispossibletogetstuckinlocallyoptimalbutgloballyinferiorsolutions.Hence,theinitialchoiceofmodelparametervaluesplaysasignificantroleinthelearningofANNbygradientdescent.Theimpactofpoorinitializationisevenmoreaggravatedwhenthemodeliscomplex,thenetworkarchitectureisdeep,ortheclassificationtaskisdifficult.Insuchcases,itisoftenadvisabletofirstlearnasimplermodelfortheproblem,e.g.,usingasinglehiddenlayer,andthenincrementallyincreasethecomplexityofthemodel,e.g.,byaddingmorehiddenlayers.Analternateapproachistotrainthemodelforasimplertaskandthenusethelearnedmodelparametersasinitialparameterchoicesinthelearningoftheoriginaltask.TheprocessofinitializingANNmodelparametersbeforetheactualtrainingprocessisknownaspretraining.
Pretraininghelpsininitializingthemodeltoasuitableregionintheparameterspacethatwouldotherwisebeinaccessiblebyrandominitialization.Pretrainingalsoreducesthevarianceinthemodelparametersbyfixingthestartingpointofgradientdescent,thusreducingthechancesofoverfittingduetomultiplecomparisons.Themodelslearnedbypretrainingarethusmoreconsistentandprovidebettergeneralizationperformance.
SupervisedPretrainingAcommonapproachforpretrainingistoincrementallytraintheANNmodelinalayer-wisemanner,byaddingonehiddenlayeratatime.Thisapproach,
knownassupervisedpretraining,ensuresthattheparameterslearnedateverylayerareobtainedbysolvingasimplerproblem,ratherthanlearningallmodelparameterstogether.TheseparametervaluesthusprovideagoodchoiceforinitializingtheANNmodel.Theapproachforsupervisedpretrainingcanbebrieflydescribedasfollows.
WestartthesupervisedpretrainingprocessbyconsideringareducedANNmodelwithonlyasinglehiddenlayer.Byapplyinggradientdescentonthissimplemodel,weareabletolearnthemodelparametersofthefirsthiddenlayer.Atthenextrun,weaddanotherhiddenlayertothemodelandapplygradientdescenttolearntheparametersofthenewlyaddedhiddenlayer,whilekeepingtheparametersofthefirstlayerfixed.Thisprocedureisrecursivelyappliedsuchthatwhilelearningtheparametersofthel hiddenlayer,weconsiderareducedmodelwithonlylhiddenlayers,whosefirsthiddenlayersarenotupdatedonthel runbutareinsteadfixedusingpretrainedvaluesfrompreviousruns.Inthisway,weareabletolearnthemodelparametersofall hiddenlayers.ThesepretrainedvaluesareusedtoinitializethehiddenlayersofthefinalANNmodel,whichisfine-tunedbyapplyingafinalroundofgradientdescentoverallthelayers.
UnsupervisedPretrainingSupervisedpretrainingprovidesapowerfulwaytoinitializemodelparameters,bygraduallygrowingthemodelcomplexityfromshallowertodeepernetworks.However,supervisedpretrainingrequiresasufficientnumberoflabeledtraininginstancesforeffectiveinitializationoftheANNmodel.Analternatepretrainingapproachisunsupervisedpretraining,whichinitializesmodelparametersbyusingunlabeledinstancesthatareoftenabundantlyavailable.ThebasicideaofunsupervisedpretrainingistoinitializetheANN
th
(l−1)th
(L−1)
modelinsuchawaythatthelearnedfeaturescapturethelatentstructureintheunlabeleddata.
Figure4.31.Thebasicarchitectureofasingle-layerautoencoder.
Unsupervisedpretrainingreliesontheassumptionthatlearningthedistributionoftheinputdatacanindirectlyhelpinlearningtheclassificationmodel.Itismosthelpfulwhenthenumberoflabeledexamplesissmallandthefeaturesforthesupervisedproblembearresemblancetothefactorsgeneratingtheinputdata.Unsupervisedpretrainingcanbeviewedasadifferentformofregularization,wherethefocusisnotexplicitlytowardfindingsimplerfeaturesbutinsteadtowardfindingfeaturesthatcanbestexplaintheinputdata.Historically,unsupervisedpretraininghasplayedanimportantroleinrevivingtheareaofdeeplearning,bymakingitpossibletotrainanygenericdeepneuralnetworkwithoutrequiringspecializedarchitectures.
UseofAutoencoders
OnesimpleandcommonlyusedapproachforunsupervisedpretrainingistouseanunsupervisedANNmodelknownasanautoencoder.ThebasicarchitectureofanautoencoderisshowninFigure4.31 .Anautoencoderattemptstolearnareconstructionoftheinputdatabymappingtheattributestolatentfeatures ,andthenre-projecting backtotheoriginalattribute
spacetocreatethereconstruction .Thelatentfeaturesarerepresentedusingahiddenlayerofnodes,whiletheinputandoutputlayersrepresenttheattributesandcontainthesamenumberofnodes.Duringtraining,thegoalistolearnanautoencodermodelthatprovidesthelowestreconstructionerror,
,onallinputdatainstances.Atypicalchoiceofthereconstructionerroristhesquaredlossfunction:
ThemodelparametersoftheautoencodercanbelearnedbyusingasimilargradientdescentmethodastheoneusedforlearningsupervisedANNmodelsforclassification.Thekeydifferenceistheuseofthereconstructionerroronalltraininginstancesasthetrainingloss.Autoencodersthathavemultiplelayersofhiddenlayersareknownasstackedautoencoders.
Autoencodersareabletocapturecomplexrepresentationsoftheinputdatabytheuseofhiddennodes.However,ifthenumberofhiddennodesislarge,itispossibleforanautoencodertolearntheidentityrelationship,wheretheinput isjustcopiedandreturnedastheoutput ,resultinginatrivialsolution.Forexample,ifweuseasmanyhiddennodesasthenumberofattributes,thenitispossibleforeveryhiddennodetocopyanattributeandsimplypassitalongtoanoutputnode,withoutextractinganyusefulinformation.Toavoidthisproblem,itiscommonpracticetokeepthenumberofhiddennodessmallerthanthenumberofinputattributes.Thisforcestheautoencodertolearnacompactandusefulencodingoftheinputdata,similartoadimensionalityreductiontechnique.Analternateapproachistocorrupt
x^
RE(x,x^)
RE(x,x^)=ǁx−x^ǁ2.
x^
theinputinstancesbyaddingrandomnoise,andthenlearntheautoencodertoreconstructtheoriginalinstancefromthenoisyinput.Thisapproachisknownasthedenoisingautoencoder,whichoffersstrongregularizationcapabilitiesandisoftenusedtolearncomplexfeatureseveninthepresenceofalargenumberofhiddennodes.
Touseanautoencoderforunsupervisedpretraining,wecanfollowasimilarlayer-wiseapproachlikesupervisedpretraining.Inparticular,topretrainthemodelparametersofthel hiddenlayer,wecanconstructareducedANNmodelwithonlylhiddenlayersandanoutputlayercontainingthesamenumberofnodesastheattributesandisusedforreconstruction.Theparametersofthel hiddenlayerofthisnetworkarethenlearnedusingagradientdescentmethodtominimizethereconstructionerror.Theuseofunlabeleddatacanbeviewedasprovidinghintstothelearningofparametersateverylayerthataidingeneralization.ThefinalmodelparametersoftheANNmodelarethenlearnedbyapplyinggradientdescentoverallthelayers,usingtheinitialvaluesofparametersobtainedfrompretraining.
HybridPretrainingUnsupervisedpretrainingcanalsobecombinedwithsupervisedpretrainingbyusingtwooutputlayersateveryrunofpretraining,oneforreconstructionandtheotherforsupervisedclassification.Theparametersofthel hiddenlayerarethenlearnedbyjointlyminimizingthelossesonbothoutputlayers,usuallyweightedbyatrade-offhyper-parameter .Suchacombinedapproachoftenshowsbettergeneralizationperformancethaneitheroftheapproaches,sinceitprovidesawaytobalancebetweenthecompetingobjectivesofrepresentingtheinputdataandimprovingclassificationperformance.
th
th
th
α
4.8.5CharacteristicsofDeepLearning
ApartfromthebasiccharacteristicsofANNdiscussedinSection4.7.3 ,theuseofdeeplearningtechniquesprovidesthefollowingadditionalcharacteristics:
1. AnANNmodeltrainedforsometaskcanbeeasilyre-usedforadifferenttaskthatinvolvesthesameattributes,byusingpretrainingstrategies.Forexample,wecanusethelearnedparametersoftheoriginaltaskasinitialparameterchoicesforthetargettask.Inthisway,ANNpromotesre-usabilityoflearning,whichcanbequiteusefulwhenthetargetapplicationhasasmallernumberoflabeledtraininginstances.
2. Deeplearningtechniquesforregularization,suchasthedropoutmethod,helpinreducingthemodelcomplexityofANNandthuspromotinggoodgeneralizationperformance.Theuseofregularizationtechniquesisespeciallyusefulinhigh-dimensionalsettings,wherethenumberoftraininglabelsissmallbuttheclassificationproblemisinherentlydifficult.
3. Theuseofanautoencoderforpretrainingcanhelpeliminateirrelevantattributesthatarenotrelatedtootherattributes.Further,itcanhelpreducetheimpactofredundantattributesbyrepresentingthemascopiesofthesameattribute.
4. AlthoughthelearningofanANNmodelcansuccumbtofindinginferiorandlocallyoptimalsolutions,thereareanumberofdeeplearningtechniquesthathavebeenproposedtoensureadequatelearningofanANN.Apartfromthemethodsdiscussedinthissection,someothertechniquesinvolvenovelarchitecturedesignssuchasskipconnectionsbetweentheoutputlayerandlowerlayers,whichaidstheeasyflowofgradientsduringbackpropagation.
5. AnumberofspecializedANNarchitectureshavebeendesignedtohandleavarietyofinputdatasets.Someexamplesincludeconvolutionalneuralnetworks(CNN)fortwo-dimensionalgriddedobjectssuchasimages,andrecurrentneuralnetwork(RNN)forsequences.WhileCNNshavebeenextensivelyusedintheareaofcomputervision,RNNshavefoundapplicationsinprocessingspeechandlanguage.
4.9SupportVectorMachine(SVM)Asupportvectormachine(SVM)isadiscriminativeclassificationmodelthatlearnslinearornonlineardecisionboundariesintheattributespacetoseparatetheclasses.Apartfrommaximizingtheseparabilityofthetwoclasses,SVMoffersstrongregularizationcapabilities,i.e.,itisabletocontrolthecomplexityofthemodelinordertoensuregoodgeneralizationperformance.Duetoitsuniqueabilitytoinnatelyregularizeitslearning,SVMisabletolearnhighlyexpressivemodelswithoutsufferingfromoverfitting.Ithasthusreceivedconsiderableattentioninthemachinelearningcommunityandiscommonlyusedinseveralpracticalapplications,rangingfromhandwrittendigitrecognitiontotextcategorization.SVMhasstrongrootsinstatisticallearningtheoryandisbasedontheprincipleofstructuralriskminimization.AnotheruniqueaspectofSVMisthatitrepresentsthedecisionboundaryusingonlyasubsetofthetrainingexamplesthataremostdifficulttoclassify,knownasthesupportvectors.Hence,itisadiscriminativemodelthatisimpactedonlybytraininginstancesneartheboundaryofthetwoclasses,incontrasttolearningthegenerativedistributionofeveryclass.
ToillustratethebasicideabehindSVM,wefirstintroducetheconceptofthemarginofaseparatinghyperplaneandtherationaleforchoosingsuchahyperplanewithmaximummargin.WethendescribehowalinearSVMcanbetrainedtoexplicitlylookforthistypeofhyperplane.WeconcludebyshowinghowtheSVMmethodologycanbeextendedtolearnnonlineardecisionboundariesbyusingkernelfunctions.
4.9.1MarginofaSeparating
Hyperplane
Thegenericequationofaseparatinghyperplanecanbewrittenas
where representstheattributesand( , )representtheparametersofthehyperplane.Adatainstance canbelongtoeithersideofthehyperplanedependingonthesignof .Forthepurposeofbinaryclassification,weareinterestedinfindingahyperplanethatplacesinstancesofbothclassesonoppositesidesofthehyperplane,thusresultinginaseparationofthetwoclasses.Ifthereexistsahyperplanethatcanperfectlyseparatetheclassesinthedataset,wesaythatthedatasetislinearlyseparable.Figure4.32showsanexampleoflinearlyseparabledatainvolvingtwoclasses,squaresandcircles.Notethattherecanbeinfinitelymanyhyperplanesthatcanseparatetheclasses,twoofwhichareshowninFigure4.32 aslinesand .Eventhougheverysuchhyperplanewillhavezerotrainingerror,theycanprovidedifferentresultsonpreviouslyunseeninstances.Whichseparatinghyperplaneshouldwethusfinallychoosetoobtainthebestgeneralizationperformance?Ideally,wewouldliketochooseasimplehyperplanethatisrobusttosmallperturbations.Thiscanbeachievedbyusingtheconceptofthemarginofaseparatinghyperplane,whichcanbebrieflydescribedasfollows.
wTx+b=0,
xi(wTxi+b)
B1B2
Figure4.32.Marginofahyperplaneinatwo-dimensionaldataset.
Foreveryseparatinghyperplane ,letusassociateapairofparallelhyperplanes, and ,suchthattheytouchtheclosestinstancesofbothclasses,respectively.Forexample,ifwemove paralleltoitsdirection,wecantouchthefirstsquareusing andthefirstcircleusing . andareknownasthemarginhyperplanesof andthedistancebetweenthemisknownasthemarginoftheseparatinghyperplane .FromthediagramshowninFigure4.32 ,noticethatthemarginfor isconsiderablylargerthanthatfor .Inthisexample, turnsouttobetheseparatinghyperplanewiththemaximummargin,knownasthemaximummarginhyperplane.
RationaleforMaximumMargin
Bibi1 bi2
B1b11 b12 bi1 bi2
BiBi
B1B2 b1
Hyperplaneswithlargemarginstendtohavebettergeneralizationperformancethanthosewithsmallmargins.Intuitively,ifthemarginissmall,thenanyslightperturbationinthehyperplaneorthetraininginstanceslocatedattheboundarycanhavequiteanimpactontheclassificationperformance.Smallmarginhyperplanesarethusmoresusceptibletooverfitting,astheyarebarelyabletoseparatetheclasseswithaverynarrowroomtoallowperturbations.Ontheotherhand,ahyperplanethatisfartherawayfromtraininginstancesofbothclasseshassufficientleewaytoberobusttominormodificationsinthedata,andthusshowssuperiorgeneralizationperformance.
Theideaofchoosingthemaximummarginseparatinghyperplanealsohasstrongfoundationsinstatisticallearningtheory.ItcanbeshownthatthemarginofsuchahyperplaneisinverselyrelatedtotheVC-dimensionoftheclassifier,whichisacommonlyusedmeasureofthecomplexityofamodel.AsdiscussedinSection3.4 ofthelastchapter,asimplermodelshouldbepreferredoveramorecomplexmodeliftheybothshowsimilartrainingperformance.Hence,maximizingthemarginresultsintheselectionofaseparatinghyperplanewiththelowestmodelcomplexity,whichisexpectedtoshowbettergeneralizationperformance.
4.9.2LinearSVM
AlinearSVMisaclassifierthatsearchesforaseparatinghyperplanewiththelargestmargin,whichiswhyitisoftenknownasamaximalmarginclassifier.ThebasicideaofSVMcanbedescribedasfollows.
Considerabinaryclassificationproblemconsistingofntraininginstances,whereeverytraininginstance isassociatedwithabinarylabel .xi yi∈{−1,1}
Let betheequationofaseparatinghyperplanethatseparatesthetwoclassesbyplacingthemonoppositesides.Thismeansthat
Thedistanceofanypoint fromthehyperplaneisthengivenby
where denotestheabsolutevalueand denotesthelengthofavector.Letthedistanceoftheclosestpointfromthehyperplanewith be .Similarly,let denotethedistanceoftheclosestpointfromclass .
Thiscanberepresentedusingthefollowingconstraints:
Thepreviousequationscanbesuccinctlyrepresentedbyusingtheproductofand as
whereMisaparameterrelatedtothemarginofthehyperplane,i.e.,if,thenmargin .Inordertofindthemaximummargin
hyperplanethatadherestothepreviousconstraints,wecanconsiderthefollowingoptimizationproblem:
Tofindthesolutiontothepreviousproblem,notethatif andbsatisfytheconstraintsofthepreviousproblem,thenanyscaledversionof andbwould
wTx+b=0
wTxi+b>0ifyi=1,wTxi+b<0ifyi=−1.
D(x)=|wTx+b|ǁwǁ
|⋅| ǁ⋅ǁy=1 k+>0
k−>0 −1
wTxi+bǁwǁ≥k+ifyi=1,wTxi+bǁwǁ≤−k−ifyi=−1, (4.69)
yi (wTxi+b)
yi(wTxi+b)≥Mǁwǁ (4.70)
k+=k−=M =k+−k−=2M
maxw,bMsubjecttoyi(wTxi+b)≥Mǁwǁ. (4.71)
satisfythemtoo.Hence,wecanconvenientlychoose tosimplifytheright-handsideoftheinequalities.Furthermore,maximizingMamountstominimizing .Hence,theoptimizationproblemofSVMiscommonlyrepresentedinthefollowingform:
LearningModelParametersEquation4.72 representsaconstrainedoptimizationproblemwithlinearinequalities.Sincetheobjectivefunctionisconvexandquadraticwithrespectto ,itisknownasaquadraticprogrammingproblem(QPP),whichcanbesolvedusingstandardoptimizationtechniques,asdescribedinAppendixE.Inthefollowing,wepresentabriefsketchofthemainideasforlearningthemodelparametersofSVM.
First,werewritetheobjectivefunctioninaformthattakesintoaccounttheconstraintsimposedonitssolutions.ThenewobjectivefunctionisknownastheLagrangianprimalproblem,whichcanberepresentedasfollows,
wheretheparameters correspondtotheconstraintsandarecalledtheLagrangemultipliers.Next,tominimizetheLagrangian,wetakethederivativeof withrespectto andbandsetthemequaltozero:
ǁwǁ=1/M
ǁwǁ2
minw,bǁwǁ22subjecttoyi(wTxi+b)≥1. (4.72)
LP=12ǁwǁ2−∑i=1nλi(yi(wTxi+b)−1), (4.73)
λi≥0
LP
∂LP∂w=0⇒w=∑i=1nλiyixi, (4.74)
∂LP∂b=0⇒∑i=1nλiyi=0. (4.75)
NotethatusingEquation4.74 ,wecanrepresent completelyintermsoftheLagrangemultipliers.Thereisanotherrelationshipbetween( ,b)andthatisderivedfromtheKarush-Kuhn-Tucker(KKT)conditions,acommonlyusedtechniqueforsolvingQPP.Thisrelationshipcanbedescribedas
Equation4.76 isknownasthecomplementaryslacknesscondition,whichshedslightonavaluablepropertyofSVM.ItstatesthattheLagrangemultiplier isstrictlygreaterthan0onlywhen satisfiestheequation
,whichmeansthat liesexactlyonamarginhyperplane.However,if isfartherawayfromthemarginhyperplanessuchthat
,then isnecessarily0.Hence, foronlyasmallnumberofinstancesthatareclosesttotheseparatinghyperplane,whichareknownassupportvectors.Figure4.33 showsthesupportvectorsofahyperplaneasfilledcirclesandsquares.Further,ifwelookatEquation4.74 ,wewillobservethattraininginstanceswith donotcontributetotheweightparameter .Thissuggeststhat canbeconciselyrepresentedonlyintermsofthesupportvectorsinthetrainingdata,whicharequitefewerthantheoverallnumberoftraininginstances.Thisabilitytorepresentthedecisionfunctiononlyintermsofthesupportvectorsiswhatgivesthisclassifierthenamesupportvectormachines.
λi
λi[yi(wTxi+b)−1]=0. (4.76)
λi xiyi(w⋅xi+b)=1 xi
xiyi(w⋅xi+b)>1 λi λi>0
λi=0
Figure4.33.Supportvectorsofahyperplaneshownasfilledcirclesandsquares.
Usingequations4.74 ,4.75 ,and4.76 inEquation4.73 ,weobtainthefollowingoptimizationproblemintermsoftheLagrangemultipliers :
Thepreviousoptimizationproblemiscalledthedualoptimizationproblem.Maximizingthedualproblemwithrespectto isequivalenttominimizingtheprimalproblemwithrespectto andb.
Thekeydifferencesbetweenthedualandprimalproblemsareasfollows:
λi
maxλi∑i=1nλi−12∑i=1n∑j=1nλiλjyiyjxiTxjsubjectto∑i=1nλiyi=0,λi≥0. (4.77)
λi
1. Solvingthedualproblemhelpsusidentifythesupportvectorsinthedatathathavenon-zerovaluesof .Further,thesolutionofthedualproblemisinfluencedonlybythesupportvectorsthatareclosesttothedecisionboundaryofSVM.ThishelpsinsummarizingthelearningofSVMsolelyintermsofitssupportvectors,whichareeasiertomanagecomputationally.Further,itrepresentsauniqueabilityofSVMtobedependentonlyontheinstancesclosesttotheboundary,whicharehardertoclassify,ratherthanthedistributionofinstancesfartherawayfromtheboundary.
2. Theobjectiveofthedualprobleminvolvesonlytermsoftheform ,whicharebasicallyinnerproductsintheattributespace.AswewillseelaterinSection4.9.4 ,thispropertywillprovetobequiteusefulinlearningnonlineardecisionboundariesusingSVM.
Becauseofthesedifferences,itisusefultosolvethedualoptimizationproblemusinganyofthestandardsolversforQPP.Havingfoundanoptimalsolutionfor ,wecanuseEquation4.74 tosolvefor .WecanthenuseEquation4.76 onthesupportvectorstosolveforbasfollows:
whereSrepresentsthesetofsupportvectors and isthenumberofsupportvectors.Themaximummarginhyperplanecanthenbeexpressedas
Usingthisseparatinghyperplane,atestinstance canbeassignedaclasslabelusingthesignoff( ).
λi
xiTxj
λi
b=1nS∑i∈S1−yiwTxiyi (4.78)
(S={i|λi>0}) nS
f(x)=(∑i=1nλiyixiTx)+b=0. (4.79)
Example4.7.Considerthetwo-dimensionaldatasetshowninFigure4.34 ,whichcontainseighttraininginstances.Usingquadraticprogramming,wecansolvetheoptimizationproblemstatedinEquation4.77 toobtaintheLagrangemultiplier foreachtraininginstance.TheLagrangemultipliersaredepictedinthelastcolumnofthetable.Noticethatonlythefirsttwoinstanceshavenon-zeroLagrangemultipliers.Theseinstancescorrespondtothesupportvectorsforthisdataset.
Let andbdenotetheparametersofthedecisionboundary.UsingEquation4.74 ,wecansolvefor and inthefollowingway:
λi
w=(w1,w2)w1 w2
w1=∑iλiyixi1=65.5261×1×0.3858+65.5261×−1×0.4871=−6.64.w2=∑iλiyixi2=65.5261×1×0.4687+65.5261×−1×0.611=−9.32.
Figure4.34.Exampleofalinearlyseparabledataset.
ThebiastermbcanbecomputedusingEquation4.76 foreachsupportvector:
Averagingthesevalues,weobtain .ThedecisionboundarycorrespondingtotheseparametersisshowninFigure4.34 .
4.9.3Soft-marginSVM
Figure4.35 showsadatasetthatissimilartoFigure4.32 ,exceptithastwonewexamples,PandQ.Althoughthedecisionboundary misclassifiesthenewexamples,while classifiesthemcorrectly,thisdoesnotmeanthat
isabetterdecisionboundarythan becausethenewexamplesmaycorrespondtonoiseinthetrainingdata. shouldstillbepreferredoverbecauseithasawidermargin,andthus,islesssusceptibletooverfitting.However,theSVMformulationpresentedintheprevioussectiononlyconstructsdecisionboundariesthataremistake-free.
b(1)=1−w⋅x1=1−(−6.64)(0.3858)−(−9.32)(0.4687)=7.9300.b(2)=1−w⋅x2=−1−(−6.64)(0.4871)−(−9.32)(0.611)=7.9289.
b=7.93
B1B2
B2 B1B1 B2
Figure4.35.DecisionboundaryofSVMforthenon-separablecase.
ThissectionexamineshowtheformulationofSVMcanbemodifiedtolearnaseparatinghyperplanethatistolerabletosmallnumberoftrainingerrorsusingamethodknownasthesoft-marginapproach.Moreimportantly,themethodpresentedinthissectionallowsSVMtolearnlinearhyperplaneseveninsituationswheretheclassesarenotlinearlyseparable.Todothis,thelearningalgorithminSVMmustconsiderthetrade-offbetweenthewidthofthemarginandthenumberoftrainingerrorscommittedbythelinearhyperplane.
TointroducetheconceptoftrainingerrorsintheSVMformulation,letusrelaxtheinequalityconstraintstoaccommodateforsomeviolationsonasmallnumberoftraininginstances.Thiscanbedonebyintroducingaslackvariable foreverytraininginstance asfollows:ξ≥0 xi
Thevariable allowsforsomeslackintheinequalitiesoftheSVMsuchthateveryinstance doesnotneedtostrictlysatisfy .Further, isnon-zeroonlyifthemarginhyperplanesarenotabletoplace onthesamesideastherestoftheinstancesbelongingto .Toillustratethis,Figure4.36 showsacirclePthatfallsontheoppositesideoftheseparatinghyperplaneastherestofthecircles,andthussatisfies .ThedistancebetweenPandthemarginhyperplane isequalto .Hence, providesameasureoftheerrorofSVMinrepresenting usingsoftinequalityconstraints.
Figure4.36.Slackvariablesusedinsoft-marginSVM.
yi(wTxi+b)≥1−ξi (4.80)
ξixi yi(wTxi+b)≥1 ξi
xiyi
wTx+b=−1+ξwTx+b=−1 ξ/ǁwǁ
ξi xi
Inthepresenceofslackvariables,itisimportanttolearnaseparatinghyperplanethatjointlymaximizesthemargin(ensuringgoodgeneralizationperformance)andminimizesthevaluesofslackvariables(ensuringlowtrainingerror).ThiscanbeachievedbymodifyingtheoptimizationproblemofSVMasfollows:
whereCisahyper-parameterthatmakesatrade-offbetweenmaximizingthemarginandminimizingthetrainingerror.AlargevalueofCpaysmoreemphasisonminimizingthetrainingerrorthanmaximizingthemargin.NoticethesimilarityofthepreviousequationwiththegenericformulaofgeneralizationerrorrateintroducedinSection3.4 ofthepreviouschapter.Indeed,SVMprovidesanaturalwaytobalancebetweenmodelcomplexityandtrainingerrorinordertomaximizegeneralizationperformance.
TosolveEquation4.81 weapplytheLagrangemultipliermethodandconverttheprimalproblemtoitscorrespondingdualproblem,similartotheapproachdescribedintheprevioussection.TheLagrangianprimalproblemofEquation4.81 canbewrittenasfollows:
where and aretheLagrangemultiplierscorrespondingtotheinequalityconstraintsofEquation4.81 .Settingthederivativeof withrespectto ,b,and equalto0,weobtainthefollowingequations:
minw,b,ξiǁwǁ22+c∑i=1nξisubjecttoyi(wTxi+b)≥1−ξi,ξi≥0. (4.81)
LP=12ǁwǁ2+C∑i=1nξi−∑i=1nλi(yi(wTxi+b)−1+ξi)−∑i=1nμi(ξi), (4.82)
λi≥0 μi≥0LP
ξi
∂LP∂w=0⇒w=∑i=1nλiyixi. (4.83)
∂L∂b=0⇒∑i=1nλiyi=0. (4.84)
WecanalsoobtainthecomplementaryslacknessconditionsbyusingthefollowingKKTconditions:
Equation4.86 suggeststhat iszeroforalltraininginstancesexceptthosethatresideonthemarginhyperplanes ,orhave .Theseinstanceswith areknownassupportvectors.Ontheotherhand, giveninEquation4.87 iszeroforanytraininginstancethatismisclassified,i.e.,
.Further, and arerelatedwitheachotherbyEquation4.85 .Thisresultsinthefollowingthreeconfigurationsof :
1. If and ,then doesnotresideonthemarginhyperplanesandiscorrectlyclassifiedonthesamesideasotherinstancesbelongingto .
2. If and ,then ismisclassifiedandhasanon-zeroslackvariable .
3. If and ,then residesononeofthemarginhyperplanes.
SubstitutingEquations4.83 to4.87 intoEquation4.82 ,weobtainthefollowingdualoptimizationproblem:
NoticethatthepreviousproblemlooksalmostidenticaltothedualproblemofSVMforthelinearlyseparablecase(Equation4.77 ),exceptthat is
∂L∂ξi=0⇒λi+μi=C. (4.85)
λi(yi(wTxi+b)−1+ξi)=0, (4.86)
μiξi=0. (4.87)
λiwTxi+b=±1 ξi>0
λi>0 μi
ξi>0 λi μi(λi,μi)
λi=0 μi=C xi
yiλi=C μi=0 xi
ξi0<λi<C 0<μi<C xi
maxλi∑i=1nλi−12∑i=1n∑j=1nλiλjyiyjxiTxjsubjectto∑i=1nλiyi=0,0≤λi≤C. (4.88)
λi
requiredtonotonlybegreaterthan0butalsosmallerthanaconstantvalueC.Clearly,whenCreachesinfinity,thepreviousoptimizationproblembecomesequivalenttoEquation4.77 ,wherethelearnedhyperplaneperfectlyseparatestheclasses(withnotrainingerrors).However,bycappingthevaluesof toC,thelearnedhyperplaneisabletotolerateafewtrainingerrorsthathave .
Figure4.37.Hingelossasafunctionof .
Asbefore,Equation4.88 canbesolvedbyusinganyofthestandardsolversforQPP,andtheoptimalvalueof canbeobtainedbyusingEquation4.83 .Tosolveforb,wecanuseEquation4.86 onthesupportvectorsthatresideonthemarginhyperplanesasfollows:
λiξi>0
yy^
b=1nS∑i∈S1−yiwTxiyi (4.89)
whereSrepresentsthesetofsupportvectorsresidingonthemarginhyperplanes and isthenumberofelementsinS.
SVMasaRegularizerofHingeLossSVMbelongstoabroadclassofregularizationtechniquesthatusealossfunctiontorepresentthetrainingerrorsandanormofthemodelparameterstorepresentthemodelcomplexity.Torealizethis,noticethattheslackvariable ,usedformeasuringthetrainingerrorsinSVM,isequivalenttothehingelossfunction,whichcanbedefinedasfollows:
where .InthecaseofSVM, correspondsto .Figure4.37 showsaplotofthehingelossfunctionaswevary .Wecanseethatthehingelossisequalto0aslongasyand havethesamesignand
.However,thehingelossgrowslinearlywith wheneveryand areoftheoppositesignor .Thisissimilartothenotionoftheslackvariable,whichisusedtomeasurethedistanceofapointfromitsmarginhyperplane.Hence,theoptimizationproblemofSVMcanberepresentedinthefollowingequivalentform:
Notethatusingthehingelossensuresthattheoptimizationproblemisconvexandcanbesolvedusingstandardoptimizationtechniques.However,ifweuseadifferentlossfunction,suchasthesquaredlossfunctionthatwasintroducedinSection4.7 onANN,itwillresultinadifferentoptimizationproblemthatmayormaynotremainconvex.Nevertheless,differentlossfunctionscanbeexploredtocapturevaryingnotionsoftrainingerror,dependingonthecharacteristicsoftheproblem.
(S={i|0<λi<C}) nS
ξ
Loss(y,y^)=max(0,1−yy^),
y∈{+1,−1} y^ wTx+byy^
y^|y^|≥1 |y^| y^
|y^|<1ξ
minw,bǁwǁ22+C∑i=1nLoss(yi,wTxi+b) (4.90)
AnotherinterestingpropertyofSVMthatrelatesittoabroaderclassofregularizationtechniquesistheconceptofamargin.Althoughminimizinghasthegeometricinterpretationofmaximizingthemarginofaseparating
hyperplane,itisessentiallythesquared normofthemodelparameters,.Ingeneral,the normof , ,isequaltotheMinkowskidistanceof
orderqfrom totheorigin,i.e.,
Minimizingthe normof toachievelowermodelcomplexityisagenericregularizationconceptthathasseveralinterpretations.Forexample,minimizingthe normamountstofindingasolutiononahypersphereofsmallestradiusthatshowssuitabletrainingperformance.Tovisualizethisintwo-dimensions,Figure4.38(a) showstheplotofacirclewithconstantradiusr,whereeverypointhasthesame norm.Ontheotherhand,usingthe normensuresthatthesolutionliesonthesurfaceofahypercubewithsmallestsize,withverticesalongtheaxes.ThisisillustratedinFigure4.38(b) asasquarewithverticesontheaxesatadistanceofrfromtheorigin.The normiscommonlyusedasaregularizertoobtainsparsemodelparameterswithonlyasmallnumberofnon-zeroparametervalues,suchastheuseofLassoinregressionproblems(seeBibliographicNotes).
ǁwǁ2
L2 ǁwǁ22 Lq ǁwǁq
ǁwǁq=(∑ipwiq)1/q
Lq
L2
L2L1
L1
Figure4.38.Plotsshowingthebehavioroftwo-dimensionalsolutionswithconstant andnorms.
Ingeneral,dependingonthecharacteristicsoftheproblem,differentcombinationsof normsandtraininglossfunctionscanbeusedforlearningthemodelparameters,eachrequiringadifferentoptimizationsolver.Thisformsthebackboneofawiderangeofmodelingtechniquesthatattempttoimprovethegeneralizationperformancebyjointlyminimizingtrainingerrorandmodelcomplexity.However,inthissection,wefocusonlyonthesquarednormandthehingelossfunction,resultingintheclassicalformulationof
SVM.
4.9.4NonlinearSVM
L2L1
Lq
L2
TheSVMformulationsdescribedintheprevioussectionsconstructalineardecisionboundarytoseparatethetrainingexamplesintotheirrespectiveclasses.ThissectionpresentsamethodologyforapplyingSVMtodatasetsthathavenonlineardecisionboundaries.Thebasicideaistotransformthedatafromitsoriginalattributespacein intoanewspace sothatalinearhyperplanecanbeusedtoseparatetheinstancesinthetransformedspace,usingtheSVMapproach.Thelearnedhyperplanecanthenbeprojectedbacktotheoriginalattributespace,resultinginanonlineardecisionboundary.
Figure4.39.Classifyingdatawithanonlineardecisionboundary.
AttributeTransformationToillustratehowattributetransformationcanleadtoalineardecisionboundary,Figure4.39(a) showsanexampleofatwo-dimensionaldatasetconsistingofsquares(classifiedas )andcircles(classifiedas ).The
φ(x)
y=1 y=−1
datasetisgeneratedinsuchawaythatallthecirclesareclusterednearthecenterofthediagramandallthesquaresaredistributedfartherawayfromthecenter.Instancesofthedatasetcanbeclassifiedusingthefollowingequation:
Thedecisionboundaryforthedatacanthereforebewrittenasfollows:
whichcanbefurthersimplifiedintothefollowingquadraticequation:
Anonlineartransformation isneededtomapthedatafromitsoriginalattributespaceintoanewspacesuchthatalinearhyperplanecanseparatetheclasses.Thiscanbeachievedbyusingthefollowingsimpletransformation:
Figure4.39(b) showsthepointsinthetransformedspace,wherewecanseethatallthecirclesarelocatedinthelowerleft-handsideofthediagram.Alinearhyperplanewithparameters andbcanthereforebeconstructedinthetransformedspace,toseparatetheinstancesintotheirrespectiveclasses.
Onemaythinkthatbecausethenonlineartransformationpossiblyincreasesthedimensionalityoftheinputspace,thisapproachcansufferfromthecurseofdimensionalitythatisoftenassociatedwithhigh-dimensionaldata.
y={1if(x1−0.5)2+(x2−0.5)2>0.2,−1otherwise. (4.91)
(x1−0.5)2+(x2−0.5)2>0.2,
x12−x1+x22−x2=−0.46.
φ
φ:(x1,x2)→(x12−x1,x22−x2). (4.92)
However,aswewillseeinthefollowingsection,nonlinearSVMisabletoavoidthisproblembyusingkernelfunctions.
LearningaNonlinearSVMModelUsingasuitablefunction, ,wecantransformanydatainstance to .(Thedetailsonhowtochoose willbecomeclearlater.)Thelinearhyperplaneinthetransformedspacecanbeexpressedas .Tolearntheoptimalseparatinghyperplane,wecansubstitute for intheformulationofSVMtoobtainthefollowingoptimizationproblem:
UsingLagrangemultipliers ,thiscanbeconvertedintoadualoptimizationproblem:max
where denotestheinnerproductbetweenvectors and .Also,theequationofthehyperplaneinthetransformedspacecanberepresentedusing
asfollows:
Further,bisgivenby
φ(⋅) φ(x)φ(⋅)
wTφ(x)+b=0φ(x)
minw,b,ξiǁwǁ22+C∑i=1nξisubjecttoyi(wTφ(xi)+b)≥1−ξi,ξi≥0. (4.93)
λi
maxλi∑i=1nλi−12∑i=1n∑j=1nλiλjyiyj⟨φ(xi),φ(xj)⟩subjectto∑i=1nλjyi=0,0≤λi≤C,
(4.94)
⟨a,b⟩
λi
∑i=1nλiyi⟨φ(xi),φ(x)⟩+b=0. (4.95)
b=1nS(∑i∈S1yi−∑i∈S∑j=1nλjyiyj⟨φ(xi),φ(xj)⟩yi), (4.96)
where isthesetofsupportvectorsresidingonthemarginhyperplanesand isthenumberofelementsinS.
NotethatinordertosolvethedualoptimizationprobleminEquation4.94 ,ortousethelearnedmodelparameterstomakepredictionsusingEquations4.95 and4.96 ,weneedonlyinnerproductsof .Hence,eventhough
maybenonlinearandhigh-dimensional,itsufficestouseafunctionoftheinnerproductsof inthetransformedspace.Thiscanbeachievedbyusingakerneltrick,whichcanbedescribedasfollows.
Theinnerproductbetweentwovectorsisoftenregardedasameasureofsimilaritybetweenthevectors.Forexample,thecosinesimilaritydescribedinSection2.4.5 onpage79canbedefinedasthedotproductbetweentwovectorsthatarenormalizedtounitlength.Analogously,theinnerproduct
canalsoberegardedasameasureofsimilaritybetweentwoinstances, and ,inthetransformedspace.Thekerneltrickisamethodforcomputingthissimilarityasafunctionoftheoriginalattributes.Specifically,thekernelfunctionK(u,v)betweentwoinstancesuandvcanbedefinedasfollows:
where isafunctionthatfollowscertainconditionsasstatedbytheMercer'sTheorem.Althoughthedetailsofthistheoremareoutsidethescopeofthebook,weprovidealistofsomeofthecommonlyusedkernelfunctions:
S={i|0>λi<C}nS
φ(x)φ(x)
φ(x)
φ(xi),φ(xj)xi xj
K(u,v)=⟨φ(u),φ(v)⟩=f(u,v) (4.97)
f(⋅)
PolynomialkernelK(u,v)=(uTv+1)p (4.98)
RadialBasisFunctionkernelK(u,v)=e−ǁu−vǁ2/(2σ2) (4.99)
SigmoidkernelK(u,v)=tanh(kuTv−δ) (4.100)
Byusingakernelfunction,wecandirectlyworkwithinnerproductsinthetransformedspacewithoutdealingwiththeexactformsofthenonlineartransformationfunction .Specifically,thisallowsustousehigh-dimensionaltransformations(sometimeseveninvolvinginfinitelymanydimensions),whileperformingcalculationsonlyintheoriginalattributespace.Computingtheinnerproductsusingkernelfunctionsisalsoconsiderablycheaperthanusingthetransformedattributeset .Hence,theuseofkernelfunctionsprovidesasignificantadvantageinrepresentingnonlineardecisionboundaries,withoutsufferingfromthecurseofdimensionality.ThishasbeenoneofthemajorreasonsbehindthewidespreadusageofSVMinhighlycomplexandnonlinearproblems.
Figure4.40.DecisionboundaryproducedbyanonlinearSVMwithpolynomialkernel.
Figure4.40 showsthenonlineardecisionboundaryobtainedbySVMusingthepolynomialkernelfunctiongiveninEquation4.98 .Wecanseethatthe
φ
φ(x)
learneddecisionboundaryisquiteclosetothetruedecisionboundaryshowninFigure4.39(a) .Althoughthechoiceofkernelfunctiondependsonthecharacteristicsoftheinputdata,acommonlyusedkernelfunctionistheradialbasisfunction(RBF)kernel,whichinvolvesasinglehyper-parameter ,knownasthestandarddeviationoftheRBFkernel.
4.9.5CharacteristicsofSVM
1. TheSVMlearningproblemcanbeformulatedasaconvexoptimizationproblem,inwhichefficientalgorithmsareavailabletofindtheglobalminimumoftheobjectivefunction.Otherclassificationmethods,suchasrule-basedclassifiersandartificialneuralnetworks,employagreedystrategytosearchthehypothesisspace.Suchmethodstendtofindonlylocallyoptimumsolutions.
2. SVMprovidesaneffectivewayofregularizingthemodelparametersbymaximizingthemarginofthedecisionboundary.Furthermore,itisabletocreateabalancebetweenmodelcomplexityandtrainingerrorsbyusingahyper-parameterC.Thistrade-offisgenerictoabroaderclassofmodellearningtechniquesthatcapturethemodelcomplexityandthetraininglossusingdifferentformulations.
3. LinearSVMcanhandleirrelevantattributesbylearningzeroweightscorrespondingtosuchattributes.Itcanalsohandleredundantattributesbylearningsimilarweightsfortheduplicateattributes.Furthermore,theabilityofSVMtoregularizeitslearningmakesitmorerobusttothepresenceofalargenumberofirrelevantandredundantattributesthanotherclassifiers,eveninhigh-dimensionalsettings.Forthisreason,nonlinearSVMsarelessimpactedbyirrelevantandredundantattributesthanotherhighlyexpressiveclassifiersthatcanlearnnonlineardecisionboundariessuchasdecisiontrees.
σ
TocomparetheeffectofirrelevantattributesontheperformanceofnonlinearSVMsanddecisiontrees,considerthetwo-dimensionaldatasetshowninFigure4.41(a) containing and instances,wherethetwoclassescanbeeasilyseparatedusinganonlineardecisionboundary.Weincrementallyaddirrelevantattributestothisdatasetandcomparetheperformanceoftwoclassifiers:decisiontreeandnonlinearSVM(usingradialbasisfunctionkernel),using70%ofthedatafortrainingandtherestfortesting.Figure4.41(b) showsthetesterrorratesofthetwoclassifiersasweincreasethenumberofirrelevantattributes.Wecanseethatthetesterrorrateofdecisiontreesswiftlyreaches0.5(sameasrandomguessing)inthepresenceofevenasmallnumberofirrelevantattributes.ThiscanbeattributedtotheproblemofmultiplecomparisonswhilechoosingsplittingattributesatinternalnodesasdiscussedinExample3.7 ofthepreviouschapter.Ontheotherhand,nonlinearSVMshowsamorerobustandsteadyperformanceevenafteraddingamoderatelylargenumberofirrelevantattributes.Itstesterrorrategraduallydeclinesandeventuallyreachescloseto0.5afteradding125irrelevantattributes,atwhichpointitbecomesdifficulttodiscernthediscriminativeinformationintheoriginaltwoattributesfromthenoiseintheremainingattributesforlearningnonlineardecisionboundaries.
500+ 500o
Figure4.41.ComparingtheeffectofaddingirrelevantattributesontheperformanceofnonlinearSVMsanddecisiontrees.
4. SVMcanbeappliedtocategoricaldatabyintroducingdummyvariablesforeachcategoricalattributevaluepresentinthedata.Forexample,if hasthreevalues ,wecanintroduceabinaryvariableforeachoftheattributevalues.
5. TheSVMformulationpresentedinthischapterisforbinaryclassproblems.However,multiclassextensionsofSVMhavealsobeenproposed.
6. AlthoughthetrainingtimeofanSVMmodelcanbelarge,thelearnedparameterscanbesuccinctlyrepresentedwiththehelpofasmallnumberofsupportvectors,makingtheclassificationoftestinstancesquitefast.
{Single,Married,Divorced}
4.10EnsembleMethodsThissectionpresentstechniquesforimprovingclassificationaccuracybyaggregatingthepredictionsofmultipleclassifiers.Thesetechniquesareknownasensembleorclassifiercombinationmethods.Anensemblemethodconstructsasetofbaseclassifiersfromtrainingdataandperformsclassificationbytakingavoteonthepredictionsmadebyeachbaseclassifier.Thissectionexplainswhyensemblemethodstendtoperformbetterthananysingleclassifierandpresentstechniquesforconstructingtheclassifierensemble.
4.10.1RationaleforEnsembleMethod
Thefollowingexampleillustrateshowanensemblemethodcanimproveaclassifier'sperformance.
Example4.8.Consideranensembleof25binaryclassifiers,eachofwhichhasanerrorrateof .Theensembleclassifierpredictstheclasslabelofatestexamplebytakingamajorityvoteonthepredictionsmadebythebaseclassifiers.Ifthebaseclassifiersareidentical,thenallthebaseclassifierswillcommitthesamemistakes.Thus,theerrorrateoftheensembleremains0.35.Ontheotherhand,ifthebaseclassifiersareindependent—i.e.,theirerrorsareuncorrelated—thentheensemblemakesawrongpredictiononlyifmorethanhalfofthebaseclassifierspredictincorrectly.Inthiscase,theerrorrateoftheensembleclassifieris
∈=0.35
whichisconsiderablylowerthantheerrorrateofthebaseclassifiers.
Figure4.42 showstheerrorrateofanensembleof25binaryclassifiersfordifferentbaseclassifiererrorrates .Thediagonalline
representsthecaseinwhichthebaseclassifiersareidentical,whilethesolidlinerepresentsthecaseinwhichthebaseclassifiersareindependent.Observethattheensembleclassifierperformsworsethanthebaseclassifierswhen islargerthan0.5.
Theprecedingexampleillustratestwonecessaryconditionsforanensembleclassifiertoperformbetterthanasingleclassifier:(1)thebaseclassifiersshouldbeindependentofeachother,and(2)thebaseclassifiersshoulddobetterthanaclassifierthatperformsrandomguessing.Inpractice,itisdifficulttoensuretotalindependenceamongthebaseclassifiers.Nevertheless,improvementsinclassificationaccuracieshavebeenobservedinensemblemethodsinwhichthebaseclassifiersaresomewhatcorrelated.
4.10.2MethodsforConstructinganEnsembleClassifier
AlogicalviewoftheensemblemethodispresentedinFigure4.43 .Thebasicideaistoconstructmultipleclassifiersfromtheoriginaldataandthenaggregatetheirpredictionswhenclassifyingunknownexamples.Theensembleofclassifierscanbeconstructedinmanyways:
eensemble=∑i=1325(25i)∈i(1−∈)25−i=0.06, (4.101)
(eensemble) (∈)
∈
1. Bymanipulatingthetrainingset.Inthisapproach,multipletrainingsetsarecreatedbyresamplingtheoriginaldataaccordingtosomesamplingdistributionandconstructingaclassifierfromeachtrainingset.Thesamplingdistributiondetermineshowlikelyitisthatanexamplewillbeselectedfortraining,anditmayvaryfromonetrialtoanother.Baggingandboostingaretwoexamplesofensemblemethodsthatmanipulatetheirtrainingsets.ThesemethodsaredescribedinfurtherdetailinSections4.10.4 and4.10.5 .
Figure4.42.Comparisonbetweenerrorsofbaseclassifiersanderrorsoftheensembleclassifier.
Figure4.43.Alogicalviewoftheensemblelearningmethod.
2. Bymanipulatingtheinputfeatures.Inthisapproach,asubsetofinputfeaturesischosentoformeachtrainingset.Thesubsetcanbeeitherchosenrandomlyorbasedontherecommendationofdomainexperts.Somestudieshaveshownthatthisapproachworksverywellwithdatasetsthatcontainhighlyredundantfeatures.Randomforest,whichisdescribedinSection4.10.6 ,isanensemblemethodthatmanipulatesitsinputfeaturesandusesdecisiontreesasitsbaseclassifiers.
3. Bymanipulatingtheclasslabels.Thismethodcanbeusedwhenthenumberofclassesissufficientlylarge.Thetrainingdataistransformedintoabinaryclassproblembyrandomlypartitioningtheclasslabelsintotwodisjointsubsets, and .TrainingexampleswhoseclassA0 A1
labelbelongstothesubset areassignedtoclass0,whilethosethatbelongtothesubset areassignedtoclass1.Therelabeledexamplesarethenusedtotrainabaseclassifier.Byrepeatingthisprocessmultipletimes,anensembleofbaseclassifiersisobtained.Whenatestexampleispresented,eachbaseclassifier isusedtopredictitsclasslabel.Ifthetestexampleispredictedasclass0,thenalltheclassesthatbelongto willreceiveavote.Conversely,ifitispredictedtobeclass1,thenalltheclassesthatbelongto willreceiveavote.Thevotesaretalliedandtheclassthatreceivesthehighestvoteisassignedtothetestexample.Anexampleofthisapproachistheerror-correctingoutputcodingmethoddescribedonpage331.
4. Bymanipulatingthelearningalgorithm.Manylearningalgorithmscanbemanipulatedinsuchawaythatapplyingthealgorithmseveraltimesonthesametrainingdatawillresultintheconstructionofdifferentclassifiers.Forexample,anartificialneuralnetworkcanchangeitsnetworktopologyortheinitialweightsofthelinksbetweenneurons.Similarly,anensembleofdecisiontreescanbeconstructedbyinjectingrandomnessintothetree-growingprocedure.Forexample,insteadofchoosingthebestsplittingattributeateachnode,wecanrandomlychooseoneofthetopkattributesforsplitting.
Thefirstthreeapproachesaregenericmethodsthatareapplicabletoanyclassifier,whereasthefourthapproachdependsonthetypeofclassifierused.Thebaseclassifiersformostoftheseapproachescanbegeneratedsequentially(oneafteranother)orinparallel(allatonce).Onceanensembleofclassifiershasbeenlearned,atestexample isclassifiedbycombiningthepredictionsmadebythebaseclassifiers :
A0A1
Ci
A0A1
Ci(x)
C*(x)=f(C1(x),C2(x),…,Ck(x)).
wherefisthefunctionthatcombinestheensembleresponses.Onesimpleapproachforobtaining istotakeamajorityvoteoftheindividualpredictions.Analternateapproachistotakeaweightedmajorityvote,wheretheweightofabaseclassifierdenotesitsaccuracyorrelevance.
Ensemblemethodsshowthemostimprovementwhenusedwithunstableclassifiers,i.e.,baseclassifiersthataresensitivetominorperturbationsinthetrainingset,becauseofhighmodelcomplexity.Althoughunstableclassifiersmayhavealowbiasinfindingtheoptimaldecisionboundary,theirpredictionshaveahighvarianceforminorchangesinthetrainingsetormodelselection.Thistrade-offbetweenbiasandvarianceisdiscussedindetailinthenextsection.Byaggregatingtheresponsesofmultipleunstableclassifiers,ensemblelearningattemptstominimizetheirvariancewithoutworseningtheirbias.
4.10.3Bias-VarianceDecomposition
Bias-variancedecompositionisaformalmethodforanalyzingthegeneralizationerrorofapredictivemodel.Althoughtheanalysisisslightlydifferentforclassificationthanregression,wefirstdiscussthebasicintuitionofthisdecompositionbyusingananalogueofaregressionproblem.
Considertheillustrativetaskofreachingatargetybyfiringprojectilesfromastartingposition ,asshowninFigure4.44 .Thetargetcorrespondstothedesiredoutputatatestinstance,whilethestartingpositioncorrespondstoitsobservedattributes.Inthisanalogy,theprojectilerepresentsthemodelusedforpredictingthetargetusingtheobservedattributes.Let denotethepointwheretheprojectilehitstheground,whichisanalogousofthepredictionofthemodel.
C*(x)
y^
Figure4.44.Bias-variancedecomposition.
Ideally,wewouldlikeourpredictionstobeasclosetothetruetargetaspossible.However,notethatdifferenttrajectoriesofprojectilesarepossiblebasedondifferencesinthetrainingdataorintheapproachusedformodelselection.Hence,wecanobserveavarianceinthepredictions overdifferentrunsofprojectile.Further,thetargetinourexampleisnotfixedbuthassomefreedomtomovearound,resultinginanoisecomponentinthetruetarget.Thiscanbeunderstoodasthenon-deterministicnatureoftheoutputvariable,wherethesamesetofattributescanhavedifferentoutputvalues.Let representtheaveragepredictionoftheprojectileovermultipleruns,and denotetheaveragetargetvalue.Thedifferencebetween and
isknownasthebiasofthemodel.
Inthecontextofclassification,itcanbeshownthatthegeneralizationerrorofaclassificationmodelmcanbedecomposedintotermsinvolvingthebias,variance,andnoisecomponentsofthemodelinthefollowingway:
where and areconstantsthatdependonthecharacteristicsoftrainingandtestsets.Notethatwhilethenoisetermisintrinsictothetargetclass,the
y^
y^avgyavg y^avg
yavg
gen.error(m)=c1×noise+bias(m)+c2×variance(m)
c1 c2
biasandvariancetermsdependonthechoiceoftheclassificationmodel.Thebiasofamodelrepresentshowclosetheaveragepredictionofthemodelistotheaveragetarget.Modelsthatareabletolearncomplexdecisionboundaries,e.g.,modelsproducedbyk-nearestneighborandmulti-layerANN,generallyshowlowbias.Thevarianceofamodelcapturesthestabilityofitspredictionsinresponsetominorperturbationsinthetrainingsetorthemodelselectionapproach.
Wecansaythatamodelshowsbettergeneralizationperformanceifithasalowerbiasandlowervariance.However,ifthecomplexityofamodelishighbutthetrainingsizeissmall,wegenerallyexpecttoseealowerbiasbuthighervariance,resultinginthephenomenaofoverfitting.ThisphenomenaispictoriallyrepresentedinFigure4.45(a) .Ontheotherhand,anoverlysimplisticmodelthatsuffersfromunderfittingmayshowalowervariancebutwouldsufferfromahighbias,asshowninFigure4.45(b) .Hence,thetrade-offbetweenbiasandvarianceprovidesausefulwayforinterpretingtheeffectsofunderfittingandoverfittingonthegeneralizationperformanceofamodel.
Figure4.45.Plotsshowingthebehavioroftwo-dimensionalsolutionswithconstant andnorms.
Thebias-variancetrade-offcanbeusedtoexplainwhyensemblelearningimprovesthegeneralizationperformanceofunstableclassifiers.Ifabaseclassifiershowlowbiasbuthighvariance,itcanbecomesusceptibletooverfitting,asevenasmallchangeinthetrainingsetwillresultindifferentpredictions.However,bycombiningtheresponsesofmultiplebaseclassifiers,wecanexpecttoreducetheoverallvariance.Hence,ensemblelearningmethodsshowbetterperformanceprimarilybyloweringthevarianceinthepredictions,althoughtheycanevenhelpinreducingthebias.Oneofthesimplestapproachesforcombiningpredictionsandreducingtheirvarianceistocomputetheiraverage.Thisformsthebasisofthebaggingmethod,describedinthefollowingsubsection.
L2L1
4.10.4Bagging
Bagging,whichisalsoknownasbootstrapaggregating,isatechniquethatrepeatedlysamples(withreplacement)fromadatasetaccordingtoauniformprobabilitydistribution.Eachbootstrapsamplehasthesamesizeastheoriginaldata.Becausethesamplingisdonewithreplacement,someinstancesmayappearseveraltimesinthesametrainingset,whileothersmaybeomittedfromthetrainingset.Onaverage,abootstrapsample containsapproximately63%oftheoriginaltrainingdatabecauseeachsamplehasaprobability ofbeingselectedineach .IfNissufficientlylarge,thisprobabilityconvergesto .ThebasicprocedureforbaggingissummarizedinAlgorithm4.5 .Aftertrainingthekclassifiers,atestinstanceisassignedtotheclassthatreceivesthehighestnumberofvotes.
Toillustratehowbaggingworks,considerthedatasetshowninTable4.4 .Letxdenoteaone-dimensionalattributeandydenotetheclasslabel.Supposeweuseonlyone-levelbinarydecisiontrees,withatestcondition
,wherekisasplitpointchosentominimizetheentropyoftheleafnodes.Suchatreeisalsoknownasadecisionstump.
Table4.4.Exampleofdatasetusedtoconstructanensembleofbaggingclassifiers.
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 1 1 1
Withoutbagging,thebestdecisionstumpwecanproducesplitstheinstancesateither or .Eitherway,theaccuracyofthetreeisatmost70%.Supposeweapplythebaggingprocedureonthedatasetusing10bootstrapsamples.Theexampleschosenfortrainingineachbaggingroundareshown
Di
1−(1−1/N)N Di1−1/e≃0.632
x≤k
−1 −1 −1 −1
x≤0.35 x≤0.75
inFigure4.46 .Ontheright-handsideofeachtable,wealsodescribethedecisionstumpbeingusedineachround.
WeclassifytheentiredatasetgiveninTable4.4 bytakingamajorityvoteamongthepredictionsmadebyeachbaseclassifier.TheresultsofthepredictionsareshowninFigure4.47 .Sincetheclasslabelsareeitheror ,takingthemajorityvoteisequivalenttosummingupthepredictedvaluesofyandexaminingthesignoftheresultingsum(refertothesecondtolastrowinFigure4.47 ).Noticethattheensembleclassifierperfectlyclassifiesall10examplesintheoriginaldata.
Algorithm4.5Baggingalgorithm.
∑
⋅
−1+1
Figure4.46.Exampleofbagging.
Theprecedingexampleillustratesanotheradvantageofusingensemblemethodsintermsofenhancingtherepresentationofthetargetfunction.Eventhougheachbaseclassifierisadecisionstump,combiningtheclassifierscanleadtoadecisionboundarythatmimicsadecisiontreeofdepth2.
Baggingimprovesgeneralizationerrorbyreducingthevarianceofthebaseclassifiers.Theperformanceofbaggingdependsonthestabilityofthebaseclassifier.Ifabaseclassifierisunstable,bagginghelpstoreducetheerrorsassociatedwithrandomfluctuationsinthetrainingdata.Ifabaseclassifierisstable,i.e.,robusttominorperturbationsinthetrainingset,thentheerroroftheensembleisprimarilycausedbybiasinthebaseclassifier.Inthissituation,baggingmaynotbeabletoimprovetheperformanceofthebaseclassifierssignificantly.Itmayevendegradetheclassifier'sperformancebecausetheeffectivesizeofeachtrainingsetisabout37%smallerthantheoriginaldata.
Figure4.47.Exampleofcombiningclassifiersconstructedusingthebaggingapproach.
4.10.5Boosting
Boostingisaniterativeprocedureusedtoadaptivelychangethedistributionoftrainingexamplesforlearningbaseclassifierssothattheyincreasinglyfocusonexamplesthatarehardtoclassify.Unlikebagging,boostingassignsaweighttoeachtrainingexampleandmayadaptivelychangetheweightattheendofeachboostinground.Theweightsassignedtothetrainingexamplescanbeusedinthefollowingways:
1. Theycanbeusedtoinformthesamplingdistributionusedtodrawasetofbootstrapsamplesfromtheoriginaldata.
2. Theycanbeusedtolearnamodelthatisbiasedtowardexampleswithhigherweight.
Thissectiondescribesanalgorithmthatusesweightsofexamplestodeterminethesamplingdistributionofitstrainingset.Initially,theexamplesareassignedequalweights,1/N,sothattheyareequallylikelytobechosenfortraining.Asampleisdrawnaccordingtothesamplingdistributionofthetrainingexamplestoobtainanewtrainingset.Next,aclassifierisbuiltfromthetrainingsetandusedtoclassifyalltheexamplesintheoriginaldata.Theweightsofthetrainingexamplesareupdatedattheendofeachboostinground.Examplesthatareclassifiedincorrectlywillhavetheirweightsincreased,whilethosethatareclassifiedcorrectlywillhavetheirweightsdecreased.Thisforcestheclassifiertofocusonexamplesthataredifficulttoclassifyinsubsequentiterations.
Thefollowingtableshowstheexampleschosenduringeachboostinground,whenappliedtothedatashowninTable4.4 .
Boosting(Round1): 7 3 2 8 7 9 4 10 6 3
Boosting(Round2): 5 4 9 4 2 5 1 7 4 2
Boosting(Round3): 4 4 8 10 4 5 4 6 3 4
Initially,alltheexamplesareassignedthesameweights.However,someexamplesmaybechosenmorethanonce,e.g.,examples3and7,becausethesamplingisdonewithreplacement.Aclassifierbuiltfromthedataisthenusedtoclassifyalltheexamples.Supposeexample4isdifficulttoclassify.Theweightforthisexamplewillbeincreasedinfutureiterationsasitgetsmisclassifiedrepeatedly.Meanwhile,examplesthatwerenotchoseninthepreviousround,e.g.,examples1and5,alsohaveabetterchanceofbeingselectedinthenextroundsincetheirpredictionsinthepreviousroundwerelikelytobewrong.Astheboostingroundsproceed,examplesthatarethehardesttoclassifytendtobecomeevenmoreprevalent.Thefinalensembleisobtainedbyaggregatingthebaseclassifiersobtainedfromeachboostinground.
Overtheyears,severalimplementationsoftheboostingalgorithmhavebeendeveloped.Thesealgorithmsdifferintermsof(1)howtheweightsofthetrainingexamplesareupdatedattheendofeachboostinground,and(2)howthepredictionsmadebyeachclassifierarecombined.AnimplementationcalledAdaBoostisexploredinthenextsection.
AdaBoostLet denoteasetofNtrainingexamples.IntheAdaBoostalgorithm,theimportanceofabaseclassifier dependsonits
{(xj,yj)|j=1,2,…,N}Ci
Figure4.48.Plotof asafunctionoftrainingerror .
errorrate,whichisdefinedas
where ifthepredicatepistrue,and0otherwise.Theimportanceofaclassifier isgivenbythefollowingparameter,
Notethat hasalargepositivevalueiftheerrorrateiscloseto0andalargenegativevalueiftheerrorrateiscloseto1,asshowninFigure4.48 .
The parameterisalsousedtoupdatetheweightofthetrainingexamples.Toillustrate,let denotetheweightassignedtoexample( duringthe
α ∈
∈i=1N[∑j=1NwjI(Ci(xj)≠yj)], (4.102)
I(p)=1Ci
αi=12ln(1−∈i∈i).
αi
αiwi(j) xi,yi)
th
j boostinground.TheweightupdatemechanismforAdaBoostisgivenbytheequation:
where isthenormalizationfactorusedtoensurethat .TheweightupdateformulagiveninEquation4.103 increasestheweightsofincorrectlyclassifiedexamplesanddecreasestheweightsofthoseclassifiedcorrectly.
Insteadofusingamajorityvotingscheme,thepredictionmadebyeachclassifier isweightedaccordingto .ThisapproachallowsAdaBoosttopenalizemodelsthathavepooraccuracy,e.g.,thosegeneratedattheearlierboostingrounds.Inaddition,ifanyintermediateroundsproduceanerrorratehigherthan50%,theweightsarerevertedbacktotheiroriginaluniformvalues, ,andtheresamplingprocedureisrepeated.TheAdaBoostalgorithmissummarizedinAlgorithm4.6 .
Algorithm4.6AdaBoostalgorithm.
∈ ∑
∈
th
wi(j+1)=wi(j)Zj×{e−αjifCj(xi)=yi,eαjifCj(xi)≠yi (4.103)
Zj ∑iwi(j+1)=1
Cj αj
wi=1/N
∈ ∈
∑
LetusexaminehowtheboostingapproachworksonthedatasetshowninTable4.4 .Initially,alltheexampleshaveidenticalweights.Afterthreeboostingrounds,theexampleschosenfortrainingareshowninFigure4.49(a) .TheweightsforeachexampleareupdatedattheendofeachboostingroundusingEquation4.103 ,asshowninFigure4.50(b) .
Withoutboosting,theaccuracyofthedecisionstumpis,atbest,70%.WithAdaBoost,theresultsofthepredictionsaregiveninFigure4.50(b) .Thefinalpredictionoftheensembleclassifierisobtainedbytakingaweightedaverageofthepredictionsmadebyeachbaseclassifier,whichisshowninthelastrowofFigure4.50(b) .NoticethatAdaBoostperfectlyclassifiesalltheexamplesinthetrainingdata.
Figure4.49.Exampleofboosting.
Animportantanalyticalresultofboostingshowsthatthetrainingerroroftheensembleisboundedbythefollowingexpression:
where istheerrorrateofeachbaseclassifieri.Iftheerrorrateofthebaseclassifierislessthan50%,wecanwrite ,where measureshowmuchbettertheclassifieristhanrandomguessing.Theboundonthetrainingerroroftheensemblebecomes
eensemble≤∏i[∈i(1−∈i)], (4.104)
∈i∈i=0.5−γi γi
Hence,thetrainingerroroftheensembledecreasesexponentially,whichleadstothefastconvergenceofthealgorithm.Byfocusingonexamplesthataredifficulttoclassifybybaseclassifiers,itisabletoreducethebiasofthefinalpredictionsalongwiththevariance.AdaBoosthasbeenshowntoprovidesignificantimprovementsinperformanceoverbaseclassifiersonarangeofdatasets.Nevertheless,becauseofitstendencytofocusontrainingexamplesthatarewronglyclassified,theboostingtechniquecanbesusceptibletooverfitting,resultinginpoorgeneralizationperformanceinsomescenarios.
Figure4.50.ExampleofcombiningclassifiersconstructedusingtheAdaBoostapproach.
eensemble≤∏i1−4γi2≤exp(−2∑iγi2). (4.105)
4.10.6RandomForests
Randomforestsattempttoimprovethegeneralizationperformancebyconstructinganensembleofdecorrelateddecisiontrees.Randomforestsbuildontheideaofbaggingtouseadifferentbootstrapsampleofthetrainingdataforlearningdecisiontrees.However,akeydistinguishingfeatureofrandomforestsfrombaggingisthatateveryinternalnodeofatree,thebestsplittingcriterionischosenamongasmallsetofrandomlyselectedattributes.Inthisway,randomforestsconstructensemblesofdecisiontreesbynotonlymanipulatingtraininginstances(byusingbootstrapsamplessimilartobagging),butalsotheinputattributes(byusingdifferentsubsetsofattributesateveryinternalnode).
GivenatrainingsetDconsistingofninstancesanddattributes,thebasicprocedureoftrainingarandomforestclassifiercanbesummarizedusingthefollowingsteps:
1. Constructabootstrapsample ofthetrainingsetbyrandomlysamplingninstances(withreplacement)fromD.
2. Use tolearnadecisiontree asfollows.Ateveryinternalnodeof,randomlysampleasetofpattributesandchooseanattributefrom
thissubsetthatshowsthemaximumreductioninanimpuritymeasureforsplitting.Repeatthisproceduretilleveryleafispure,i.e.,containinginstancesfromthesameclass.
Onceanensembleofdecisiontreeshavebeenconstructed,theiraverageprediction(majorityvote)onatestinstanceisusedasthefinalpredictionoftherandomforest.Notethatthedecisiontreesinvolvedinarandomforestareunprunedtrees,astheyareallowedtogrowtotheirlargestpossiblesizetilleveryleafispure.Hence,thebaseclassifiersofrandomforestrepresent
Di
Di TiTi
unstableclassifiersthathavelowbiasbuthighvariance,becauseoftheirlargesize.
Anotherpropertyofthebaseclassifierslearnedinrandomforestsisthelackofcorrelationamongtheirmodelparametersandtestpredictions.Thiscanbeattributedtotheuseofanindependentlysampleddataset forlearningeverydecisiontree ,similartothebaggingapproach.However,randomforestshavetheadditionaladvantageofchoosingasplittingcriterionateveryinternalnodeusingadifferent(andrandomlyselected)subsetofattributes.Thispropertysignificantlyhelpsinbreakingthecorrelationstructure,ifany,amongthedecisiontrees .
Torealizethis,consideratrainingsetinvolvingalargenumberofattributes,whereonlyasmallsubsetofattributesarestrongpredictorsofthetargetclass,whereasotherattributesareweakindicators.Givensuchatrainingset,evenifweconsiderdifferentbootstrapsamples forlearning ,wewouldmostlybechoosingthesameattributesforsplittingatinternalnodes,becausetheweakattributeswouldbelargelyoverlookedwhencomparedwiththestrongpredictors.Thiscanresultinaconsiderablecorrelationamongthetrees.However,ifwerestrictthechoiceofattributesateveryinternalnodetoarandomsubsetofattributes,wecanensuretheselectionofbothstrongandweakpredictors,thuspromotingdiversityamongthetrees.Thisprincipleisutilizedbyrandomforestsforcreatingdecorrelateddecisiontrees.
Byaggregatingthepredictionsofanensembleofstronganddecorrelateddecisiontrees,randomforestsareabletoreducethevarianceofthetreeswithoutnegativelyimpactingtheirlowbias.Thismakesrandomforestsquiterobusttooverfitting.Additionally,becauseoftheirabilitytoconsideronlyasmallsubsetofattributesateveryinternalnode,randomforestsarecomputationallyfastandrobusteveninhigh-dimensionalsettings.
DiTi
Ti
Di Ti
Thenumberofattributestobeselectedateverynode,p,isahyper-parameteroftherandomforestclassifier.Asmallvalueofpcanreducethecorrelationamongtheclassifiersbutmayalsoreducetheirstrength.Alargevaluecanimprovetheirstrengthbutmayresultincorrelatedtreessimilartobagging.Althoughcommonsuggestionsforpintheliteratureinclude and ,asuitablevalueofpforagiventrainingsetcanalwaysbeselectedbytuningitoveravalidationset,asdescribedinthepreviouschapter.However,thereisanalternativewayforselectinghyper-parametersinrandomforests,whichdoesnotrequireusingaseparatevalidationset.Itinvolvescomputingareliableestimateofthegeneralizationerrorratedirectlyduringtraining,knownastheout-of-bag(oob)errorestimate.Theoobestimatecanbecomputedforanygenericensemblelearningmethodthatbuildsindependentbaseclassifiersusingbootstrapsamplesofthetrainingset,e.g.,baggingandrandomforests.Theapproachforcomputingoobestimatecanbedescribedasfollows.
Consideranensemblelearningmethodthatusesanindependentbaseclassifier builtonabootstrapsampleofthetrainingset .Sinceeverytraininginstance willbeusedfortrainingapproximately63%ofbaseclassifiers,wecancall asanout-of-bagsamplefortheremaining27%ofbaseclassifiersthatdidnotuseitfortraining.Ifweusetheseremaining27%classifierstomakepredictionson ,wecanobtaintheooberroron bytakingtheirmajorityvoteandcomparingitwithitsclasslabel.Notethattheooberrorestimatestheerrorof27%classifiersonaninstancethatwasnotusedfortrainingthoseclassifiers.Hence,theooberrorcanbeconsideredasareliableestimateofgeneralizationerror.Bytakingtheaverageofooberrorsofalltraininginstances,wecancomputetheoverallooberrorestimate.Thiscanbeusedasanalternativetothevalidationerrorrateforselectinghyper-parameters.Hence,randomforestsdonotneedtouseaseparatepartitionofthetrainingsetforvalidation,asitcansimultaneouslytrainthebaseclassifiersandcomputegeneralizationerrorestimatesonthesamedataset.
d log2d+1
Ti Di
Randomforestshavebeenempiricallyfoundtoprovidesignificantimprovementsingeneralizationperformancethatareoftencomparable,ifnotsuperior,totheimprovementsprovidedbytheAdaBoostalgorithm.RandomforestsarealsomorerobusttooverfittingandrunmuchfasterthantheAdaBoostalgorithm.
4.10.7EmpiricalComparisonamongEnsembleMethods
Table4.5 showstheempiricalresultsobtainedwhencomparingtheperformanceofadecisiontreeclassifieragainstbagging,boosting,andrandomforest.Thebaseclassifiersusedineachensemblemethodconsistof50decisiontrees.Theclassificationaccuraciesreportedinthistableareobtainedfromtenfoldcross-validation.Noticethattheensembleclassifiersgenerallyoutperformasingledecisiontreeclassifieronmanyofthedatasets.
Table4.5.Comparingtheaccuracyofadecisiontreeclassifieragainstthreeensemblemethods.
DataSet Numberof(Attributes,Classes,Instances)
DecisionTree(%)
Bagging(%) Boosting(%) RF(%)
Anneal (39,6,898) 92.09 94.43 95.43 95.43
Australia (15,2,690) 85.51 87.10 85.22 85.80
Auto (26,7,205) 81.95 85.37 85.37 84.39
Breast (11,2,699) 95.14 96.42 97.28 96.14
Cleve (14,2,303) 76.24 81.52 82.18 82.18
Credit (16,2,690) 85.8 86.23 86.09 85.8
Diabetes (9,2,768) 72.40 76.30 73.18 75.13
German (21,2,1000) 70.90 73.40 73.00 74.5
Glass (10,7,214) 67.29 76.17 77.57 78.04
Heart (14,2,270) 80.00 81.48 80.74 83.33
Hepatitis (20,2,155) 81.94 81.29 83.87 83.23
Horse (23,2,368) 85.33 85.87 81.25 85.33
Ionosphere (35,2,351) 89.17 92.02 93.73 93.45
Iris (5,3,150) 94.67 94.67 94.00 93.33
Labor (17,2,57) 78.95 84.21 89.47 84.21
Led7 (8,10,3200) 73.34 73.66 73.34 73.06
Lymphography (19,4,148) 77.03 79.05 85.14 82.43
Pima (9,2,768) 74.35 76.69 73.44 77.60
Sonar (61,2,208) 78.85 78.85 84.62 85.58
Tic-tac-toe (10,2,958) 83.72 93.84 98.54 95.82
Vehicle (19,4,846) 71.04 74.11 78.25 74.94
Waveform (22,3,5000) 76.44 83.30 83.90 84.04
Wine (14,3,178) 94.38 96.07 97.75 97.75
Zoo (17,7,101) 93.07 93.07 95.05 97.03
4.11ClassImbalanceProblemInmanydatasetsthereareadisproportionatenumberofinstancesthatbelongtodifferentclasses,apropertyknownasskeworclassimbalance.Forexample,considerahealth-careapplicationwherediagnosticreportsareusedtodecidewhetherapersonhasararedisease.Becauseoftheinfrequentnatureofthedisease,wecanexpecttoobserveasmallernumberofsubjectswhoarepositivelydiagnosed.Similarly,increditcardfrauddetection,fraudulenttransactionsaregreatlyoutnumberedbylegitimatetransactions.
Thedegreeofimbalancebetweentheclassesvariesacrossdifferentapplicationsandevenacrossdifferentdatasetsfromthesameapplication.Forexample,theriskforararediseasemayvaryacrossdifferentpopulationsofsubjectsdependingontheirdietaryandlifestylechoices.However,despitetheirinfrequentoccurrences,acorrectclassificationoftherareclassoftenhasgreatervaluethanacorrectclassificationofthemajorityclass.Forexample,itmaybemoredangeroustoignoreapatientsufferingfromadiseasethantomisdiagnoseahealthyperson.
Moregenerally,classimbalanceposestwochallengesforclassification.First,itcanbedifficulttofindsufficientlymanylabeledsamplesofarareclass.Notethatmanyoftheclassificationmethodsdiscussedsofarworkwellonlywhenthetrainingsethasabalancedrepresentationofbothclasses.Althoughsomeclassifiersaremoreeffectiveathandlingimbalanceinthetrainingdatathanothers,e.g.,rule-basedclassifiersandk-NN,theyareallimpactediftheminorityclassisnotwell-representedinthetrainingset.Ingeneral,aclassifiertrainedoveranimbalanceddatasetshowsabiastowardimprovingitsperformanceoverthemajorityclass,whichisoftennotthedesiredbehavior.
Asaresult,manyexistingclassificationmodels,whentrainedonanimbalanceddataset,maynoteffectivelydetectinstancesoftherareclass.
Second,accuracy,whichisthetraditionalmeasureforevaluatingclassificationperformance,isnotwell-suitedforevaluatingmodelsinthepresenceofclassimbalanceinthetestdata.Forexample,if1%ofthecreditcardtransactionsarefraudulent,thenatrivialmodelthatpredictseverytransactionaslegitimatewillhaveanaccuracyof99%eventhoughitfailstodetectanyofthefraudulentactivities.Thus,thereisaneedtousealternativeevaluationmetricsthataresensitivetotheskewandcancapturedifferentcriteriaofperformancethanaccuracy.
Inthissection,wefirstpresentsomeofthegenericmethodsforbuildingclassifierswhenthereisclassimbalanceinthetrainingset.Wethendiscussmethodsforevaluatingclassificationperformanceandadaptingclassificationdecisionsinthepresenceofaskewedtestset.Intheremainderofthissection,wewillconsiderbinaryclassificationproblemsforsimplicity,wheretheminorityclassisreferredasthepositive classwhilethemajorityclassisreferredasthenegative class.
4.11.1BuildingClassifierswithClassImbalance
Therearetwoprimaryconsiderationsforbuildingclassifiersinthepresenceofclassimbalanceinthetrainingset.First,weneedtoensurethatthelearningalgorithmistrainedoveradatasetthathasadequaterepresentationofboththemajorityaswellastheminorityclasses.Somecommonapproachesforensuringthisincludesthemethodologiesofoversamplingandundersampling
(+)(−)
thetrainingset.Second,havinglearnedaclassificationmodel,weneedawaytoadaptitsclassificationdecisions(andthuscreateanappropriatelytunedclassifier)tobestmatchtherequirementsoftheimbalancedtestset.Thisistypicallydonebyconvertingtheoutputsoftheclassificationmodeltoreal-valuedscores,andthenselectingasuitablethresholdontheclassificationscoretomatchtheneedsofatestset.Boththeseconsiderationsarediscussedindetailinthefollowing.
OversamplingandUndersamplingThefirststepinlearningwithimbalanceddataistotransformthetrainingsettoabalancedtrainingset,wherebothclasseshavenearlyequalrepresentation.Thebalancedtrainingsetcanthenbeusedwithanyoftheexistingclassificationtechniques(withoutmakinganymodificationsinthelearningalgorithm)tolearnamodelthatgivesequalemphasistobothclasses.Inthefollowing,wepresentsomeofthecommontechniquesfortransforminganimbalancedtrainingsettoabalancedone.
Abasicapproachforcreatingbalancedtrainingsetsistogenerateasampleoftraininginstanceswheretherareclasshasadequaterepresentation.Therearetwotypesofsamplingmethodsthatcanbeusedtoenhancetherepresentationoftheminorityclass:(a)undersampling,wherethefrequencyofthemajorityclassisreducedtomatchthefrequencyoftheminorityclass,and(b)oversampling,whereartificialexamplesoftheminorityclassarecreatedtomakethemequalinproportiontothenumberofnegativeinstances.
Toillustrateundersampling,consideratrainingsetthatcontains100positiveexamplesand1000negativeexamples.Toovercometheskewamongtheclasses,wecanselectarandomsampleof100examplesfromthenegativeclassandusethemwiththe100positiveexamplestocreateabalancedtrainingset.Aclassifierbuiltovertheresultantbalancedsetwillthenbe
unbiasedtowardbothclasses.However,onelimitationofundersamplingisthatsomeoftheusefulnegativeexamples(e.g.,thoseclosertotheactualdecisionboundary)maynotbechosenfortraining,therefore,resultinginaninferiorclassificationmodel.Anotherlimitationisthatthesmallersampleof100negativeinstancesmayhaveahighervariancethanthelargersetof1000.
Oversamplingattemptstocreateabalancedtrainingsetbyartificiallygeneratingnewpositiveexamples.Asimpleapproachforoversamplingistoduplicateeverypositiveinstance times,where and arethenumbersofpositiveandnegativetraininginstances,respectively.Figure4.51 illustratestheeffectofoversamplingonthelearningofadecisionboundaryusingaclassifiersuchasadecisiontree.Withoutoversampling,onlythepositiveexamplesatthebottomright-handsideofFigure4.51(a)areclassifiedcorrectly.Thepositiveexampleinthemiddleofthediagramismisclassifiedbecausetherearenotenoughexamplestojustifythecreationofanewdecisionboundarytoseparatethepositiveandnegativeinstances.Oversamplingprovidestheadditionalexamplesneededtoensurethatthedecisionboundarysurroundingthepositiveexampleisnotpruned,asillustratedinFigure4.51(b) .Notethatduplicatingapositiveinstanceisanalogoustodoublingitsweightduringthetrainingstage.Hence,theeffectofoversamplingcanbealternativelyachievedbyassigninghigherweightstopositiveinstancesthannegativeinstances.Thismethodofweightinginstancescanbeusedwithanumberofclassifierssuchaslogisticregression,ANN,andSVM.
n−/n+ n+ n−
Figure4.51.Illustratingtheeffectofoversamplingoftherareclass.
Onelimitationoftheduplicationmethodforoversamplingisthatthereplicatedpositiveexampleshaveanartificiallylowervariancewhencomparedwiththeirtruedistributionintheoveralldata.Thiscanbiastheclassifiertothespecificdistributionoftraininginstances,whichmaynotberepresentativeoftheoveralldistributionoftestinstances,leadingtopoorgeneralizability.Toovercomethislimitation,analternativeapproachforoversamplingistogeneratesyntheticpositiveinstancesintheneighborhoodofexistingpositiveinstances.Inthisapproach,calledtheSyntheticMinorityOversamplingTechnique(SMOTE),wefirstdeterminethek-nearestpositiveneighborsofeverypositiveinstance ,andthengenerateasyntheticpositiveinstanceatsomeintermediatepointalongthelinesegmentjoining tooneofitsrandomlychosenk-nearestneighbor, .Thisprocessisrepeateduntilthedesirednumberofpositiveinstancesisreached.However,onelimitationofthisapproachisthatitcanonlygeneratenewpositiveinstancesintheconvexhulloftheexistingpositiveclass.Hence,itdoesnothelpimprovetherepresentationofthepositiveclassoutsidetheboundaryofexistingpositive
xk
instances.Despitetheircomplementarystrengthsandweaknesses,undersamplingandoversamplingprovideusefuldirectionsforgeneratingbalancedtrainingsetsinthepresenceofclassimbalance.
AssigningScorestoTestInstancesIfaclassifierreturnsanordinalscores( )foreverytestinstance suchthatahigherscoredenotesagreaterlikelihoodof belongingtothepositiveclass,thenforeverypossiblevalueofscorethreshold, ,wecancreateanewbinaryclassifierwhereatestinstance isclassifiedpositiveonlyif .Thus,everychoiceof canpotentiallyleadtoadifferentclassifier,andweareinterestedinfindingtheclassifierthatisbestsuitedforourneeds.
Ideally,wewouldliketheclassificationscoretovarymonotonicallywiththeactualposteriorprobabilityofthepositiveclass,i.e.,if and arethescoresofanytwoinstances, and ,then
.However,thisisdifficulttoguaranteeinpracticeasthepropertiesoftheclassificationscoredependsonseveralfactorssuchasthecomplexityoftheclassificationalgorithmandtherepresentativepowerofthetrainingset.Ingeneral,wecanonlyexpecttheclassificationscoreofareasonablealgorithmtobeweaklyrelatedtotheactualposteriorprobabilityofthepositiveclass,eventhoughtherelationshipmaynotbestrictlymonotonic.Mostclassifierscanbeeasilymodifiedtoproducesucharealvaluedscore.Forexample,thesigneddistanceofaninstancefromthepositivemarginhyperplaneofSVMcanbeusedasaclassificationscore.Asanotherexample,testinstancesbelongingtoaleafinadecisiontreecanbeassignedascorebasedonthefractionoftraininginstanceslabeledaspositiveintheleaf.Also,probabilisticclassifierssuchasnaïveBayes,Bayesiannetworks,andlogisticregressionnaturallyoutputestimatesofposteriorprobabilities, .Next,wediscusssome
sTs(x)>sT
sT
s(x1) s(x2)x1 x2
s(x1)≥s(x2)⇒P(y=1|x1)≥P(y=1|x2)
P(y=1|x)
evaluationmeasuresforassessingthegoodnessofaclassifierinthepresenceofclassimbalance.
Table4.6.Aconfusionmatrixforabinaryclassificationprobleminwhichtheclassesarenotequallyimportant.
PredictedClass
Actualclass
4.11.2EvaluatingPerformancewithClassImbalance
Themostbasicapproachforrepresentingaclassifier'sperformanceonatestsetistouseaconfusionmatrix,asshowninTable4.6 .ThistableisessentiallythesameasTable3.4 ,whichwasintroducedinthecontextofevaluatingclassificationperformanceinSection3.2 .Aconfusionmatrixsummarizesthenumberofinstancespredictedcorrectlyorincorrectlybyaclassifierusingthefollowingfourcounts:
Truepositive(TP)or ,whichcorrespondstothenumberofpositiveexamplescorrectlypredictedbytheclassifier.Falsepositive(FP)or (alsoknownasTypeIerror),whichcorrespondstothenumberofnegativeexampleswronglypredictedaspositivebytheclassifier.
+ −
+ f++(TP) f+−(FN)
− f−+(FP) f−−(TN)
f++
f−+
Falsenegative(FN)or (alsoknownasTypeIIerror),whichcorrespondstothenumberofpositiveexampleswronglypredictedasnegativebytheclassifier.Truenegative(TN)or ,whichcorrespondstothenumberofnegativeexamplescorrectlypredictedbytheclassifier.
Theconfusionmatrixprovidesaconciserepresentationofclassificationperformanceonagiventestdataset.However,itisoftendifficulttointerpretandcomparetheperformanceofclassifiersusingthefour-dimensionalrepresentations(correspondingtothefourcounts)providedbytheirconfusionmatrices.Hence,thecountsintheconfusionmatrixareoftensummarizedusinganumberofevaluationmeasures.Accuracyisanexampleofonesuchmeasurethatcombinesthesefourcountsintoasinglevalue,whichisusedextensivelywhenclassesarebalanced.However,theaccuracymeasureisnotsuitableforhandlingdatasetswithimbalancedclassdistributionsasittendstofavorclassifiersthatcorrectlyclassifythemajorityclass.Inthefollowing,wedescribeotherpossiblemeasuresthatcapturedifferentcriteriaofperformancewhenworkingwithimbalancedclasses.
Abasicevaluationmeasureisthetruepositiverate(TPR),whichisdefinedasthefractionofpositivetestinstancescorrectlypredictedbytheclassifier:
Inthemedicalcommunity,TPRisalsoknownassensitivity,whileintheinformationretrievalliterature,itisalsocalledrecall(r).AclassifierwithahighTPRhasahighchanceofcorrectlyidentifyingthepositiveinstancesofthedata.
AnalogouslytoTPR,thetruenegativerate(TNR)(alsoknownasspecificity)isdefinedasthefractionofnegativetestinstancescorrectly
f+−
f−−
TPR=TPTP+FN.
predictedbytheclassifier,i.e.,
AhighTNRvaluesignifiesthattheclassifiercorrectlyclassifiesanyrandomlychosennegativeinstanceinthetestset.AcommonlyusedevaluationmeasurethatiscloselyrelatedtoTNRisthefalsepositiverate(FPR),whichisdefinedas .
Similarly,wecandefinefalsenegativerate(FNR)as .
Notethattheevaluationmeasuresdefinedabovedonottakeintoaccounttheskewamongtheclasses,whichcanbeformallydefinedas ,wherePandNdenotethenumberofactualpositivesandactualnegatives,respectively.Asaresult,changingtherelativenumbersofPandNwillhavenoeffectonTPR,TNR,FPR,orFNR,sincetheydependonlyonthefractionofcorrectclassificationsforeveryclass,independentlyoftheotherclass.Furthermore,knowingthevaluesofTPRandTNR(andconsequentlyFNRandFPR)doesnotbyitselfhelpusuniquelydetermineallfourentriesoftheconfusionmatrix.However,togetherwithinformationabouttheskewfactor, ,andthetotalnumberofinstances,N,wecancomputetheentireconfusionmatrixusingTPRandTNR,asshowninTable4.7 .
Table4.7.EntriesoftheconfusionmatrixintermsoftheTPR,TNR,skew, ,andtotalnumberofinstances,N.
Predicted Predicted
TNR=TNFP+TN.
1−TNR
FPR=FPFP+TN.
1−TPR
FNR=FNFN+TP.
α=P/(P+N)
α
α
+ −
Actual
Actual
N
Anevaluationmeasurethatissensitivetotheskewisprecision,whichcanbedefinedasthefractionofcorrectpredictionsofthepositiveclassoverthetotalnumberofpositivepredictions,i.e.,
Precisionisalsoreferredasthepositivepredictedvalue(PPV).Aclassifierthathasahighprecisionislikelytohavemostofitspositivepredictionscorrect.Precisionisausefulmeasureforhighlyskewedtestsetswherethepositivepredictions,eventhoughsmallinnumbers,arerequiredtobemostlycorrect.Ameasurethatiscloselyrelatedtoprecisionisthefalsediscoveryrate(FDR),whichcanbedefinedas .
AlthoughbothFDRandFPRfocusonFP,theyaredesignedtocapturedifferentevaluationobjectivesandthuscantakequitecontrastingvalues,especiallyinthepresenceofclassimbalance.Toillustratethis,consideraclassifierwiththefollowingconfusionmatrix.
PredictedClass
ActualClass
100 0
+ TPR×α×N (1−TPR)×α×N α×N
− (1−TNR)×(1−α)×N TNR×(1−α)×N (1−α)×N
Precision,p=TPTP+FP.
1−p
FDR=FPTP+FP.
+ −
+
100 900
Sincehalfofthepositivepredictionsmadebytheclassifierareincorrect,ithasaFDRvalueof .However,itsFPRisequalto
,whichisquitelow.Thisexampleshowsthatinthepresenceofhighskew(i.e.,verysmallvalueof ),evenasmallFPRcanresultinhighFDR.SeeSection10.6 forfurtherdiscussionofthisissue.
Notethattheevaluationmeasuresdefinedaboveprovideanincompleterepresentationofperformance,becausetheyeitheronlycapturetheeffectoffalsepositives(e.g.,FPRandprecision)ortheeffectoffalsenegatives(e.g.,TPRorrecall),butnotboth.Hence,ifweoptimizeonlyoneoftheseevaluationmeasures,wemayendupwithaclassifierthatshowslowFNbuthighFP,orvice-versa.Forexample,aclassifierthatdeclareseveryinstancetobepositivewillhaveaperfectrecall,buthighFPRandverypoorprecision.Ontheotherhand,aclassifierthatisveryconservativeinclassifyinganinstanceaspositive(toreduceFP)mayenduphavinghighprecisionbutverypoorrecall.Wethusneedevaluationmeasuresthataccountforbothtypesofmisclassifications,FPandFN.Someexamplesofsuchevaluationmeasuresaresummarizedbythefollowingdefinitions.
Whilesomeoftheseevaluationmeasuresareinvarianttotheskew(e.g.,thepositivelikelihoodratio),others(e.g.,precisionandthe measure)aresensitivetoskew.Further,differentevaluationmeasurescapturetheeffectsofdifferenttypesofmisclassificationerrorsinvariousways.Forexample,themeasurerepresentsaharmonicmeanbetweenrecallandprecision,i.e.,
−
100/(100+100)=0.5100/(100+900)=0.1
α
PositiveLikelihoodRatio=TPRFPR.F1measure=2rpr+p=2×TP2×TP+FP+FN.G(TP+FN).
F1
F1
F1=21r+1p.
Becausetheharmonicmeanoftwonumberstendstobeclosertothesmallerofthetwonumbers,ahighvalueof -measureensuresthatbothprecisionandrecallarereasonablyhigh.Similarly,theGmeasurerepresentsthegeometricmeanbetweenrecallandprecision.Acomparisonamongharmonic,geometric,andarithmeticmeansisgiveninthenextexample.
Example4.9.Considertwopositivenumbers and .Theirarithmeticmeanis
andtheirgeometricmeanis .Theirharmonicmeanis ,whichisclosertothesmallervaluebetweenaandbthanthearithmeticandgeometricmeans.
Agenericextensionofthe measureisthe measure,whichcanbedefinedasfollows.
Bothprecisionandrecallcanbeviewedasspecialcasesof bysettingand ,respectively.Lowvaluesof make closertoprecision,andhighvaluesmakeitclosertorecall.
Amoregeneralmeasurethatcaptures aswellasaccuracyistheweightedaccuracymeasure,whichisdefinedbythefollowingequation:
Therelationshipbetweenweightedaccuracyandotherperformancemeasuresissummarizedinthefollowingtable:
Measure
F1
a=1 b=5 μa=(a+b)/2=3 μg=ab=2.236μh=(2×1×5)/6=1.667
F1 Fβ
Fβ=(β2+1)rpr+β2p=(β2+1)×TP(β2+1)TP+β2FP+FN. (4.106)
Fβ β=0β=∞ β Fβ
Fβ
Weightedaccuracy=w1TP+w4TNw1TP+w2FP+w3FN+w4TN. (4.107)
w1 w2 w3 w4
Recall 1 1 0 0
Precision 1 0 1 0
1 0
Accuracy 1 1 1 1
4.11.3FindinganOptimalScoreThreshold
GivenasuitablychosenevaluationmeasureEandadistributionofclassificationscores, ,onavalidationset,wecanobtaintheoptimalscorethreshold onthevalidationsetusingthefollowingapproach:
1. Sortthescoresinincreasingorderoftheirvalues.2. Foreveryuniquevalueofscore,s,considertheclassificationmodel
thatassignsaninstance aspositiveonlyif .LetE(s)denotetheperformanceofthismodelonthevalidationset.
3. Find thatmaximizestheevaluationmeasure,E(s).
Notethat canbetreatedasahyper-parameteroftheclassificationalgorithmthatislearnedduringmodelselection.Using ,wecanassignapositivelabeltoafuturetestinstance onlyif .IftheevaluationmeasureEisskewinvariant(e.g.,PositiveLikelihoodRatio),thenwecanselect withoutknowingtheskewofthetestset,andtheresultantclassifierformedusing canbeexpectedtoshowoptimalperformanceonthetestset
Fβ β2+1 β2
s(x)s*
s(x)>s
s*s*=argmaxsE(s).
s*s*
s(x)>s*
s*s*
(withrespecttotheevaluationmeasureE).Ontheotherhand,ifEissensitivetotheskew(e.g.,precisionor -measure),thenweneedtoensurethattheskewofthevalidationsetusedforselecting issimilartothatofthetestset,sothattheclassifierformedusing showsoptimaltestperformancewithrespecttoE.Alternatively,givenanestimateoftheskewofthetestdata, ,wecanuseitalongwiththeTPRandTNRonthevalidationsettoestimateallentriesoftheconfusionmatrix(seeTable4.7 ),andthustheestimateofanyevaluationmeasureEonthetestset.Thescorethreshold selectedusingthisestimateofEcanthenbeexpectedtoproduceoptimaltestperformancewithrespecttoE.Furthermore,themethodologyofselectingonthevalidationsetcanhelpincomparingthetestperformanceofdifferentclassificationalgorithms,byusingtheoptimalvaluesof foreachalgorithm.
4.11.4AggregateEvaluationofPerformance
Althoughtheaboveapproachhelpsinfindingascorethreshold thatprovidesoptimalperformancewithrespecttoadesiredevaluationmeasureandacertainamountofskew, ,sometimesweareinterestedinevaluatingtheperformanceofaclassifieronanumberofpossiblescorethresholds,eachcorrespondingtoadifferentchoiceofevaluationmeasureandskewvalue.Assessingtheperformanceofaclassifieroverarangeofscorethresholdsiscalledaggregateevaluationofperformance.Inthisstyleofanalysis,theemphasisisnotonevaluatingtheperformanceofasingleclassifiercorrespondingtotheoptimalscorethreshold,buttoassessthegeneralqualityofrankingproducedbytheclassificationscoresonthetestset.Ingeneral,thishelpsinobtainingrobustestimatesofclassificationperformancethatarenotsensitivetospecificchoicesofscorethresholds.
F1s*
s*α
s*
s*
s*
s*
α
ROCCurveOneofthewidely-usedtoolsforaggregateevaluationisthereceiveroperatingcharacteristic(ROC)curve.AnROCcurveisagraphicalapproachfordisplayingthetrade-offbetweenTPRandFPRofaclassifier,overvaryingscorethresholds.InanROCcurve,theTPRisplottedalongthey-axisandtheFPRisshownonthex-axis.Eachpointalongthecurvecorrespondstoaclassificationmodelgeneratedbyplacingathresholdonthetestscoresproducedbytheclassifier.ThefollowingproceduredescribesthegenericapproachforcomputinganROCcurve:
1. Sortthetestinstancesinincreasingorderoftheirscores.2. Selectthelowestrankedtestinstance(i.e.,theinstancewithlowest
score).Assigntheselectedinstanceandthoserankedaboveittothepositiveclass.Thisapproachisequivalenttoclassifyingallthetestinstancesaspositiveclass.Becauseallthepositiveexamplesareclassifiedcorrectlyandthenegativeexamplesaremisclassified,
.3. Selectthenexttestinstancefromthesortedlist.Classifytheselected
instanceandthoserankedaboveitaspositive,whilethoserankedbelowitasnegative.UpdatethecountsofTPandFPbyexaminingtheactualclasslabeloftheselectedinstance.Ifthisinstancebelongstothepositiveclass,theTPcountisdecrementedandtheFPcountremainsthesameasbefore.Iftheinstancebelongstothenegativeclass,theFPcountisdecrementedandTPcountremainsthesameasbefore.
4. RepeatStep3andupdatetheTPandFPcountsaccordinglyuntilthehighestrankedtestinstanceisselected.Atthisfinalthreshold,
,asallinstancesarelabeledasnegative.5. PlottheTPRagainstFPRoftheclassifier.
TPR=FPR=1
TPR=FPR=0
Example4.10.[GeneratingROCCurve]Figure4.52 showsanexampleofhowtocomputetheTPRandFPRvaluesforeverychoiceofscorethreshold.Therearefivepositiveexamplesandfivenegativeexamplesinthetestset.Theclasslabelsofthetestinstancesareshowninthefirstrowofthetable,whilethesecondrowcorrespondstothesortedscorevaluesforeachinstance.ThenextsixrowscontainthecountsofTP,FP,TN,andFN,alongwiththeircorrespondingTPRandFPR.Thetableisthenfilledfromlefttoright.Initially,alltheinstancesarepredictedtobepositive.Thus, and
.Next,weassignthetestinstancewiththelowestscoreasthenegativeclass.Becausetheselectedinstanceisactuallyapositiveexample,theTPcountdecreasesfrom5to4andtheFPcountisthesameasbefore.TheFPRandTPRareupdatedaccordingly.Thisprocessisrepeateduntilwereachtheendofthelist,where and .TheROCcurveforthisexampleisshowninFigure4.53 .
Figure4.52.ComputingtheTPRandFPRateveryscorethreshold.
TP=FP=5TPR=FPR=1
TPR=0 FPR=0
Figure4.53.ROCcurveforthedatashowninFigure4.52 .
NotethatinanROCcurve,theTPRmonotonicallyincreaseswithFPR,becausetheinclusionofatestinstanceinthesetofpredictedpositivescaneitherincreasetheTPRortheFPR.TheROCcurvethushasastaircasepattern.Furthermore,thereareseveralcriticalpointsalonganROCcurvethathavewell-knowninterpretations:
:Modelpredictseveryinstancetobeanegativeclass.
:Modelpredictseveryinstancetobeapositiveclass.
:Theperfectmodelwithzeromisclassifications.
Agoodclassificationmodelshouldbelocatedascloseaspossibletotheupperleftcornerofthediagram,whileamodelthatmakesrandomguessesshouldresidealongthemaindiagonal,connectingthepointsand .Randomguessingmeansthataninstanceisclassifiedasapositiveclasswithafixedprobabilityp,irrespectiveofitsattributeset.
(TPR=0,FPR=0)
(TPR=1,FPR=1)
(TPR=1,FPR=0)
(TPR=0,FPR=0)(TPR=1,FPR=1)
Forexample,consideradatasetthatcontains positiveinstancesandnegativeinstances.Therandomclassifierisexpectedtocorrectlyclassifyofthepositiveinstancesandtomisclassify ofthenegativeinstances.Therefore,theTPRoftheclassifieris ,whileitsFPRis .Hence,thisrandomclassifierwillresideatthepoint(p,p)intheROCcurvealongthemaindiagonal.
Figure4.54.ROCcurvesfortwodifferentclassifiers.
SinceeverypointontheROCcurverepresentstheperformanceofaclassifiergeneratedusingaparticularscorethreshold,theycanbeviewedasdifferentoperatingpointsoftheclassifier.Onemaychooseoneoftheseoperatingpointsdependingontherequirementsoftheapplication.Hence,anROCcurvefacilitatesthecomparisonofclassifiersoverarangeofoperatingpoints.Forexample,Figure4.54 comparestheROCcurvesoftwoclassifiers,
n+ n−pn+
pn−(pn+)/n+=p (pn−)/p=p
M1
and ,generatedbyvaryingthescorethresholds.Wecanseethat isbetterthan whenFPRislessthan0.36,as showsbetterTPRthanforthisrangeofoperatingpoints.Ontheotherhand, issuperiorwhenFPRisgreaterthan0.36,sincetheTPRof ishigherthanthatof forthisrange.Clearly,neitherofthetwoclassifiersdominates(isstrictlybetterthan)theother,i.e.,showshighervaluesofTPRandlowervaluesofFPRoveralloperatingpoints.
Tosummarizetheaggregatebehavioracrossalloperatingpoints,oneofthecommonlyusedmeasuresistheareaundertheROCcurve(AUC).Iftheclassifierisperfect,thenitsareaundertheROCcurvewillbeequal1.Ifthealgorithmsimplyperformsrandomguessing,thenitsareaundertheROCcurvewillbeequalto0.5.
AlthoughtheAUCprovidesausefulsummaryofaggregateperformance,therearecertaincaveatsinusingtheAUCforcomparingclassifiers.First,eveniftheAUCofalgorithmAishigherthantheAUCofanotheralgorithmB,thisdoesnotmeanthatalgorithmAisalwaysbetterthanB,i.e.,theROCcurveofAdominatesthatofBacrossalloperatingpoints.Forexample,eventhough showsaslightlylowerAUCthan inFigure4.54 ,wecanseethatboth and areusefuloverdifferentrangesofoperatingpointsandnoneofthemarestrictlybetterthantheotheracrossallpossibleoperatingpoints.Hence,wecannotusetheAUCtodeterminewhichalgorithmisbetter,unlessweknowthattheROCcurveofoneofthealgorithmsdominatestheother.
Second,althoughtheAUCsummarizestheaggregateperformanceoveralloperatingpoints,weareofteninterestedinonlyasmallrangeofoperatingpointsinmostapplications.Forexample,eventhough showsslightlylowerAUCthan ,itshowshigherTPRvaluesthan forsmallFPRvalues(smallerthan0.36).Inthepresenceofclassimbalance,thebehaviorof
M2 M1M2 M1 M2
M2M2 M1
M1 M2M1 M2
M1M2 M2
analgorithmoversmallFPRvalues(alsotermedasearlyretrieval)isoftenmoremeaningfulforcomparisonthantheperformanceoverallFPRvalues.Thisisbecause,inmanyapplications,itisimportanttoassesstheTPRachievedbyaclassifierinthefirstfewinstanceswithhighestscores,withoutincurringalargeFPR.Hence,inFigure4.54 ,duetothehighTPRvaluesof duringearlyretrieval ,wemayprefer over forimbalancedtestsets,despitethelowerAUCof .Hence,caremustbetakenwhilecomparingtheAUCvaluesofdifferentclassifiers,usuallybyvisualizingtheirROCcurvesratherthanjustreportingtheirAUC.
AkeycharacteristicofROCcurvesisthattheyareagnostictotheskewinthetestset,becauseboththeevaluationmeasuresusedinconstructingROCcurves(TPRandFPR)areinvarianttoclassimbalance.Hence,ROCcurvesarenotsuitableformeasuringtheimpactofskewonclassificationperformance.Inparticular,wewillobtainthesameROCcurvefortwotestdatasetsthathaveverydifferentskew.
M1 (FPR<0.36) M1 M2M1
Figure4.55.PRcurvesfortwodifferentclassifiers.
Precision-RecallCurveAnalternatetoolforaggregateevaluationistheprecisionrecallcurve(PRcurve).ThePRcurveplotstheprecisionandrecallvaluesofaclassifierontheyandxaxesrespectively,byvaryingthethresholdonthetestscores.Figure4.55 showsanexampleofPRcurvesfortwohypotheticalclassifiers, and .TheapproachforgeneratingaPRcurveissimilartotheapproachdescribedaboveforgeneratinganROCcurve.However,therearesomekeydistinguishingfeaturesinthePRcurve:
1. PRcurvesaresensitivetotheskewfactor ,anddifferentPRcurvesaregeneratedfordifferentvaluesof .
M1 M2
α=P/(P+N)α
2. Whenthescorethresholdislowest(everyinstanceislabeledaspositive),theprecisionisequalto whilerecallis1.Asweincreasethescorethreshold,thenumberofpredictedpositivescanstaythesameordecrease.Hence,therecallmonotonicallydeclinesasthescorethresholdincreases.Ingeneral,theprecisionmayincreaseordecreaseforthesamevalueofrecall,uponadditionofaninstanceintothesetofpredictedpositives.Forexample,ifthek rankedinstancebelongstothenegativeclass,thenincludingitwillresultinadropintheprecisionwithoutaffectingtherecall.Theprecisionmayimproveatthenextstep,whichaddsthe rankedinstance,ifthisinstancebelongstothepositiveclass.Hence,thePRcurveisnotasmooth,monotonicallyincreasingcurveliketheROCcurve,andgenerallyhasazigzagpattern.Thispatternismoreprominentintheleftpartofthecurve,whereevenasmallchangeinthenumberoffalsepositivescancausealargechangeinprecision.
3. As,asweincreasetheimbalanceamongtheclasses(reducethevalueof ),therightmostpointsofallPRcurveswillmovedownwards.AtandneartheleftmostpointonthePRcurve(correspondingtolargervaluesofscorethreshold),therecallisclosetozero,whiletheprecisionisequaltothefractionofpositivesinthetoprankedinstancesofthealgorithm.Hence,differentclassifierscanhavedifferentvaluesofprecisionattheleftmostpointsofthePRcurve.Also,iftheclassificationscoreofanalgorithmmonotonicallyvarieswiththeposteriorprobabilityofthepositiveclass,wecanexpectthePRcurvetograduallydecreasefromahighvalueofprecisionontheleftmostpointtoaconstantvalueof attherightmostpoint,albeitwithsomeupsanddowns.ThiscanbeobservedinthePRcurveofalgorithminFigure4.55 ,whichstartsfromahighervalueofprecisionontheleftthatgraduallydecreasesaswemovetowardstheright.Ontheotherhand,thePRcurveofalgorithm startsfromalowervalueofprecisionontheleftandshowsmoredrasticupsanddownsaswe
α
th
(k+1)th
α
αM1
M2
moveright,suggestingthattheclassificationscoreof showsaweakermonotonicrelationshipwiththeposteriorprobabilityofthepositiveclass.
4. Arandomclassifierthatassignsaninstancetobepositivewithafixedprobabilityphasaprecisionof andarecallofp.Hence,aclassifierthatperformsrandomguessinghasahorizontalPRcurvewith ,asshownusingadashedlineinFigure4.55 .NotethattherandombaselineinPRcurvesdependsontheskewinthetestset,incontrasttothefixedmaindiagonalofROCcurvesthatrepresentsrandomclassifiers.
5. NotethattheprecisionofanalgorithmisimpactedmorestronglybyfalsepositivesinthetoprankedtestinstancesthantheFPRofthealgorithm.Forthisreason,thePRcurvegenerallyhelpstomagnifythedifferencesbetweenclassifiersintheleftportionofthePRcurve.Hence,inthepresenceofclassimbalanceinthetestdata,analyzingthePRcurvesgenerallyprovidesadeeperinsightintotheperformanceofclassifiersthantheROCcurves,especiallyintheearlyretrievalrangeofoperatingpoints.
6. Theclassifiercorrespondingto representstheperfectclassifier.SimilartoAUC,wecanalsocomputetheareaunderthePRcurveofanalgorithm,knownasAUC-PR.TheAUC-PRofarandomclassifierisequalto ,whilethatofaperfectalgorithmisequalto1.NotethatAUC-PRvarieswithchangingskewinthetestset,incontrasttotheareaundertheROCcurve,whichisinsensitivetotheskew.TheAUC-PRhelpsinaccentuatingthedifferencesbetweenclassificationalgorithmsintheearlyretrievalrangeofoperatingpoints.Hence,itismoresuitedforevaluatingclassificationperformanceinthepresenceofclassimbalancethantheareaundertheROCcurve.However,similartoROCcurves,ahighervalueofAUC-PRdoesnotguaranteethesuperiorityofaclassificationalgorithmoveranother.ThisisbecausethePRcurvesoftwoalgorithmscaneasilycrosseach
M2
αy=α
(precision=1,recall=1)
α
other,suchthattheybothshowbetterperformancesindifferentrangesofoperatingpoints.Hence,itisimportanttovisualizethePRcurvesbeforecomparingtheirAUC-PRvalues,inordertoensureameaningfulevaluation.
4.12MulticlassProblemSomeoftheclassificationtechniquesdescribedinthischapterareoriginallydesignedforbinaryclassificationproblems.Yettherearemanyreal-worldproblems,suchascharacterrecognition,faceidentification,andtextclassification,wheretheinputdataisdividedintomorethantwocategories.Thissectionpresentsseveralapproachesforextendingthebinaryclassifierstohandlemulticlassproblems.Toillustratetheseapproaches,let
bethesetofclassesoftheinputdata.
ThefirstapproachdecomposesthemulticlassproblemintoKbinaryproblems.Foreachclass ,abinaryproblemiscreatedwhereallinstancesthatbelongto areconsideredpositiveexamples,whiletheremaininginstancesareconsiderednegativeexamples.Abinaryclassifieristhenconstructedtoseparateinstancesofclass fromtherestoftheclasses.Thisisknownastheone-against-rest(1-r)approach.
Thesecondapproach,whichisknownastheone-against-one(1-1)approach,constructs binaryclassifiers,whereeachclassifierisusedtodistinguishbetweenapairofclasses, .Instancesthatdonotbelongtoeither or areignoredwhenconstructingthebinaryclassifierfor .Inboth1-rand1-1approaches,atestinstanceisclassifiedbycombiningthepredictionsmadebythebinaryclassifiers.Avotingschemeistypicallyemployedtocombinethepredictions,wheretheclassthatreceivesthehighestnumberofvotesisassignedtothetestinstance.Inthe1-rapproach,ifaninstanceisclassifiedasnegative,thenallclassesexceptforthepositiveclassreceiveavote.Thisapproach,however,mayleadtotiesamongthedifferentclasses.Anotherpossibilityistotransformtheoutputsofthebinary
Y={y1,y2,…,yK}
yi∈Yyi
yi
K(K−1)/2(yi,yj)
yi yj (yi,yj)
classifiersintoprobabilityestimatesandthenassignthetestinstancetotheclassthathasthehighestprobability.
Example4.11.Consideramulticlassproblemwhere .Supposeatestinstanceisclassifiedas accordingtothe1-rapproach.Inotherwords,itisclassifiedaspositivewhen isusedasthepositiveclassandnegativewhen ,and areusedasthepositiveclass.Usingasimplemajorityvote,noticethat receivesthehighestnumberofvotes,whichisfour,whiletheremainingclassesreceiveonlythreevotes.Thetestinstanceisthereforeclassifiedas .
Example4.12.Supposethetestinstanceisclassifiedusingthe1-1approachasfollows:
Binarypairofclasses
Classification
Thefirsttworowsinthistablecorrespondtothepairofclasseschosentobuildtheclassifierandthelastrowrepresentsthepredictedclassforthetestinstance.Aftercombiningthepredictions, and eachreceivetwovotes,while and eachreceivesonlyonevote.Thetestinstanceisthereforeclassifiedaseither or ,dependingonthetie-breakingprocedure.
Error-CorrectingOutputCoding
Y={y1,y2,y3,y4}(+,−,−,−)
y1y2,y3 y4
y1
y1
+:y1−:y2 +:y1−:y3 +:y1−:y4 +:y2−:y3 +:y2−:y4 +:y3−:y4
+ + − + − +
(yi,yj)
y1 y4y2 y3
y1 y4
Apotentialproblemwiththeprevioustwoapproachesisthattheymaybesensitivetobinaryclassificationerrors.Forthe1-rapproachgiveninExample
4.12,ifatleastofoneofthebinaryclassifiersmakesamistakeinitsprediction,thentheclassifiermayendupdeclaringatiebetweenclassesormakingawrongprediction.Forexample,supposethetestinstanceisclassifiedas duetomisclassificationbythethirdclassifier.Inthiscase,itwillbedifficulttotellwhethertheinstanceshouldbeclassifiedas or,unlesstheprobabilityassociatedwitheachclasspredictionistakeninto
account.
Theerror-correctingoutputcoding(ECOC)methodprovidesamorerobustwayforhandlingmulticlassproblems.Themethodisinspiredbyaninformation-theoreticapproachforsendingmessagesacrossnoisychannels.
Theideabehindthisapproachistoaddredundancyintothetransmittedmessagebymeansofacodeword,sothatthereceivermaydetecterrorsinthereceivedmessageandperhapsrecovertheoriginalmessageifthenumberoferrorsissmall.
Formulticlasslearning,eachclass isrepresentedbyauniquebitstringoflengthnknownasitscodeword.Wethentrainnbinaryclassifierstopredicteachbitofthecodewordstring.ThepredictedclassofatestinstanceisgivenbythecodewordwhoseHammingdistanceisclosesttothecodewordproducedbythebinaryclassifiers.RecallthattheHammingdistancebetweenapairofbitstringsisgivenbythenumberofbitsthatdiffer.
Example4.13.Consideramulticlassproblemwhere .Supposeweencodetheclassesusingthefollowingsevenbitcodewords:
(+,−,+,−)y1
y3
yi
Y={y1,y2,y3,y4}
Class Codeword
1 1 1 1 1 1 1
0 0 0 0 1 1 1
0 0 1 1 0 0 1
0 1 0 1 0 1 0
Eachbitofthecodewordisusedtotrainabinaryclassifier.Ifatestinstanceisclassifiedas(0,1,1,1,1,1,1)bythebinaryclassifiers,thentheHammingdistancebetweenthecodewordand is1,whiletheHammingdistancetotheremainingclassesis3.Thetestinstanceisthereforeclassifiedas .
Aninterestingpropertyofanerror-correctingcodeisthatiftheminimumHammingdistancebetweenanypairofcodewordsisd,thenanyerrorsintheoutputcodecanbecorrectedusingitsnearestcodeword.InExample4.13 ,becausetheminimumHammingdistancebetweenanypairofcodewordsis4,theclassifiermaytolerateerrorsmadebyoneofthesevenbinaryclassifiers.Ifthereismorethanoneclassifierthatmakesamistake,thentheclassifiermaynotbeabletocompensatefortheerror.
Animportantissueishowtodesigntheappropriatesetofcodewordsfordifferentclasses.Fromcodingtheory,avastnumberofalgorithmshavebeendevelopedforgeneratingn-bitcodewordswithboundedHammingdistance.However,thediscussionofthesealgorithmsisbeyondthescopeofthisbook.Itisworthwhilementioningthatthereisasignificantdifferencebetweenthedesignoferror-correctingcodesforcommunicationtaskscomparedtothoseusedformulticlasslearning.Forcommunication,thecodewordsshouldmaximizetheHammingdistancebetweentherowssothaterrorcorrection
y1
y2
y3
y4
y1
y1
⌊(d−1)/2)⌋
canbeperformed.Multiclasslearning,however,requiresthatboththerow-wiseandcolumn-wisedistancesofthecodewordsmustbewellseparated.Alargercolumn-wisedistanceensuresthatthebinaryclassifiersaremutuallyindependent,whichisanimportantrequirementforensemblelearningmethods.
4.13BibliographicNotesMitchell[278]providesexcellentcoverageonmanyclassificationtechniquesfromamachinelearningperspective.ExtensivecoverageonclassificationcanalsobefoundinAggarwal[195],Dudaetal.[229],Webb[307],Fukunaga[237],Bishop[204],Hastieetal.[249],CherkasskyandMulier[215],WittenandFrank[310],Handetal.[247],HanandKamber[244],andDunham[230].
Directmethodsforrule-basedclassifierstypicallyemploythesequentialcoveringschemeforinducingclassificationrules.Holte's1R[255]isthesimplestformofarule-basedclassifierbecauseitsrulesetcontainsonlyasinglerule.Despiteitssimplicity,Holtefoundthatforsomedatasetsthatexhibitastrongone-to-onerelationshipbetweentheattributesandtheclasslabel,1Rperformsjustaswellasotherclassifiers.Otherexamplesofrule-basedclassifiersincludeIREP[234],RIPPER[218],CN2[216,217],AQ[276],RISE[224],andITRULE[296].Table4.8 showsacomparisonofthecharacteristicsoffouroftheseclassifiers.
Table4.8.Comparisonofvariousrule-basedclassifiers.
RIPPER CN2(unordered)
CN2(ordered)
AQR
Rule-growingstrategy
General-to-specific General-to-specific
General-to-specific
General-to-specific(seededbyapositive
example)
Evaluationmetric FOIL'sInfogain Laplace Entropyandlikelihoodratio
Numberoftruepositives
Stoppingconditionforrule-growing
Allexamplesbelongtothesame
class
Noperformance
gain
Noperformance
gain
Rulescoveronlypositiveclass
Rulepruning Reducederrorpruning
None None None
Instanceelimination
Positiveandnegative
Positiveonly Positiveonly Positiveandnegative
Stoppingconditionforaddingrules
orbasedonMDLNo
performancegain
Noperformance
gain
Allpositiveexamplesarecovered
Rulesetpruning Replaceormodifyrules
Statisticaltests
None None
Searchstrategy Greedy Beamsearch
Beamsearch
Beamsearch
Forrule-basedclassifiers,theruleantecedentcanbegeneralizedtoincludeanypropositionalorfirst-orderlogicalexpression(e.g.,Hornclauses).Readerswhoareinterestedinfirst-orderlogicrule-basedclassifiersmayrefertoreferencessuchas[278]orthevastliteratureoninductivelogicprogramming[279].Quinlan[287]proposedtheC4.5rulesalgorithmforextractingclassificationrulesfromdecisiontrees.AnindirectmethodforextractingrulesfromartificialneuralnetworkswasgivenbyAndrewsetal.in[198].
CoverandHart[220]presentedanoverviewofthenearestneighborclassificationmethodfromaBayesianperspective.Ahaprovidedboththeoreticalandempiricalevaluationsforinstance-basedmethodsin[196].PEBLS,whichwasdevelopedbyCostandSalzberg[219],isanearestneighborclassifierthatcanhandledatasetscontainingnominalattributes.
Error>50%
EachtrainingexampleinPEBLSisalsoassignedaweightfactorthatdependsonthenumberoftimestheexamplehelpsmakeacorrectprediction.Hanetal.[243]developedaweight-adjustednearestneighboralgorithm,inwhichthefeatureweightsarelearnedusingagreedy,hill-climbingoptimizationalgorithm.Amorerecentsurveyofk-nearestneighborclassificationisgivenbySteinbachandTan[298].
NaïveBayesclassifiershavebeeninvestigatedbymanyauthors,includingLangleyetal.[267],RamoniandSebastiani[288],Lewis[270],andDomingosandPazzani[227].AlthoughtheindependenceassumptionusedinnaïveBayesclassifiersmayseemratherunrealistic,themethodhasworkedsurprisinglywellforapplicationssuchastextclassification.Bayesiannetworksprovideamoreflexibleapproachbyallowingsomeoftheattributestobeinterdependent.AnexcellenttutorialonBayesiannetworksisgivenbyHeckermanin[252]andJensenin[258].Bayesiannetworksbelongtoabroaderclassofmodelsknownasprobabilisticgraphicalmodels.AformalintroductiontotherelationshipsbetweengraphsandprobabilitieswaspresentedinPearl[283].OthergreatresourcesonprobabilisticgraphicalmodelsincludebooksbyBishop[205],andJordan[259].Detaileddiscussionsofconceptssuchasd-separationandMarkovblanketsareprovidedinGeigeretal.[238]andRussellandNorvig[291].
Generalizedlinearmodels(GLM)arearichclassofregressionmodelsthathavebeenextensivelystudiedinthestatisticalliterature.TheywereformulatedbyNelderandWedderburnin1972[280]tounifyanumberofregressionmodelssuchaslinearregression,logisticregression,andPoissonregression,whichsharesomesimilaritiesintheirformulations.AnextensivediscussionofGLMsisprovidedinthebookbyMcCullaghandNelder[274].
Artificialneuralnetworks(ANN)havewitnessedalongandwindinghistoryofdevelopments,involvingmultiplephasesofstagnationandresurgence.The
ideaofamathematicalmodelofaneuralnetworkwasfirstintroducedin1943byMcCullochandPitts[275].Thisledtoaseriesofcomputationalmachinestosimulateaneuralnetworkbasedonthetheoryofneuralplasticity[289].Theperceptron,whichisthesimplestprototypeofmodernANNs,wasdevelopedbyRosenblattin1958[290].Theperceptronusesasinglelayerofprocessingunitsthatcanperformbasicmathematicaloperationssuchasadditionandmultiplication.However,theperceptroncanonlylearnlineardecisionboundariesandisguaranteedtoconvergeonlywhentheclassesarelinearlyseparable.Despitetheinterestinlearningmulti-layernetworkstoovercomethelimitationsofperceptron,progressinthisarearemainhalteduntiltheinventionofthebackpropagationalgorithmbyWerbosin1974[309],whichallowedforthequicktrainingofmulti-layerANNsusingthegradientdescentmethod.Thisledtoanupsurgeofinterestintheartificialintelligence(AI)communitytodevelopmulti-layerANNmodels,atrendthatcontinuedformorethanadecade.Historically,ANNsmarkaparadigmshiftinAIfromapproachesbasedonexpertsystems(whereknowledgeisencodedusingif-thenrules)tomachinelearningapproaches(wheretheknowledgeisencodedintheparametersofacomputationalmodel).However,therewerestillanumberofalgorithmicandcomputationalchallengesinlearninglargeANNmodels,whichremainedunresolvedforalongtime.ThishinderedthedevelopmentofANNmodelstothescalenecessaryforsolvingreal-worldproblems.Slowly,ANNsstartedgettingoutpacedbyotherclassificationmodelssuchassupportvectormachines,whichprovidedbetterperformanceaswellastheoreticalguaranteesofconvergenceandoptimality.Itisonlyrecentlythatthechallengesinlearningdeepneuralnetworkshavebeencircumvented,owingtobettercomputationalresourcesandanumberofalgorithmicimprovementsinANNssince2006.Thisre-emergenceofANNhasbeendubbedas“deeplearning,”whichhasoftenoutperformedexistingclassificationmodelsandgainedwide-spreadpopularity.
Deeplearningisarapidlyevolvingareaofresearchwithanumberofpotentiallyimpactfulcontributionsbeingmadeeveryyear.Someofthelandmarkadvancementsindeeplearningincludetheuseoflarge-scalerestrictedBoltzmannmachinesforlearninggenerativemodelsofdata[201,253],theuseofautoencodersanditsvariants(denoisingautoencoders)forlearningrobustfeaturerepresentations[199,305,306],andsophisticalarchitecturestopromotesharingofparametersacrossnodessuchasconvolutionalneuralnetworksforimages[265,268]andrecurrentneuralnetworksforsequences[241,242,277].OthermajorimprovementsincludetheapproachofunsupervisedpretrainingforinitializingANNmodels[232],thedropouttechniqueforregularization[254,297],batchnormalizationforfastlearningofANNparameters[256],andmaxoutnetworksforeffectiveusageofthedropouttechnique[240].EventhoughthediscussionsinthischapteronlearningANNmodelswerecenteredaroundthegradientdescentmethod,mostofthemodernANNmodelsinvolvingalargenumberofhiddenlayersaretrainedusingthestochasticgradientdescentmethodsinceitishighlyscalable[207].AnextensivesurveyofdeeplearningapproacheshasbeenpresentedinreviewarticlesbyBengio[200],LeCunetal.[269],andSchmidhuber[293].AnexcellentsummaryofdeeplearningapproachescanalsobeobtainedfromrecentbooksbyGoodfellowetal.[239]andNielsen[281].
Vapnik[303,304]haswrittentwoauthoritativebooksonSupportVectorMachines(SVM).OtherusefulresourcesonSVMandkernelmethodsincludethebooksbyCristianiniandShawe-Taylor[221]andSchölkopfandSmola[294].ThereareseveralsurveyarticlesonSVM,includingthosewrittenbyBurges[212],Bennetetal.[202],Hearst[251],andMangasarian[272].SVMcanalsobeviewedasan normregularizerofthehingelossfunction,asdescribedindetailbyHastieetal.[249].The normregularizerofthesquarelossfunctioncanbeobtainedusingtheleastabsoluteshrinkageandselectionoperator(Lasso),whichwasintroducedbyTibshiraniin1996[301].
L2L1
TheLassohasseveralinterestingpropertiessuchastheabilitytosimultaneouslyperformfeatureselectionaswellasregularization,sothatonlyasubsetoffeaturesareselectedinthefinalmodel.AnexcellentreviewofLassocanbeobtainedfromabookbyHastieetal.[250].
AsurveyofensemblemethodsinmachinelearningwasgivenbyDiet-terich[222].ThebaggingmethodwasproposedbyBreiman[209].FreundandSchapire[236]developedtheAdaBoostalgorithm.Arcing,whichstandsforadaptiveresamplingandcombining,isavariantoftheboostingalgorithmproposedbyBreiman[210].Itusesthenon-uniformweightsassignedtotrainingexamplestoresamplethedataforbuildinganensembleoftrainingsets.UnlikeAdaBoost,thevotesofthebaseclassifiersarenotweightedwhendeterminingtheclasslabeloftestexamples.TherandomforestmethodwasintroducedbyBreimanin[211].Theconceptofbias-variancedecompositionisexplainedindetailbyHastieetal.[249].Whilethebias-variancedecompositionwasinitiallyproposedforregressionproblemswithsquaredlossfunction,aunifiedframeworkforclassificationproblemsinvolving0–1losseswasintroducedbyDomingos[226].
RelatedworkonminingrareandimbalanceddatasetscanbefoundinthesurveypaperswrittenbyChawlaetal.[214]andWeiss[308].Sampling-basedmethodsforminingimbalanceddatasetshavebeeninvestigatedbymanyauthors,suchasKubatandMatwin[266],Japkowitz[257],andDrummondandHolte[228].Joshietal.[261]discussedthelimitationsofboostingalgorithmsforrareclassmodeling.OtheralgorithmsdevelopedforminingrareclassesincludeSMOTE[213],PNrule[260],andCREDOS[262].
Variousalternativemetricsthatarewell-suitedforclassimbalancedproblemsareavailable.Theprecision,recall,and -measurearewidely-usedmetricsininformationretrieval[302].ROCanalysiswasoriginallyusedinsignaldetectiontheoryforperformingaggregateevaluationoverarangeofscore
F1
thresholds.AmethodforcomparingclassifierperformanceusingtheconvexhullofROCcurveswassuggestedbyProvostandFawcettin[286].Bradley[208]investigatedtheuseofareaundertheROCcurve(AUC)asaperformancemetricformachinelearningalgorithms.DespitethevastbodyofliteratureonoptimizingtheAUCmeasureinmachinelearningmodels,itiswell-knownthatAUCsuffersfromcertainlimitations.Forexample,theAUCcanbeusedtocomparethequalityoftwoclassifiersonlyiftheROCcurveofoneclassifierstrictlydominatestheother.However,iftheROCcurvesoftwoclassifiersintersectatanypoint,thenitisdifficulttoassesstherelativequalityofclassifiersusingtheAUCmeasure.Anin-depthdiscussionofthepitfallsinusingAUCasaperformancemeasurecanbeobtainedinworksbyHand[245,246],andPowers[284].TheAUChasalsobeenconsideredtobeanincoherentmeasureofperformance,i.e.,itusesdifferentscaleswhilecomparingtheperformanceofdifferentclassifiers,althoughacoherentinterpretationofAUChasbeenprovidedbyFerrietal.[235].BerrarandFlach[203]describesomeofthecommoncaveatsinusingtheROCcurveforclinicalmicroarrayresearch.Analternateapproachformeasuringtheaggregateperformanceofaclassifieristheprecision-recall(PR)curve,whichisespeciallyusefulinthepresenceofclassimbalance[292].
Anexcellenttutorialoncost-sensitivelearningcanbefoundinareviewarticlebyLingandSheng[271].ThepropertiesofacostmatrixhadbeenstudiedbyElkanin[231].MargineantuandDietterich[273]examinedvariousmethodsforincorporatingcostinformationintotheC4.5learningalgorithm,includingwrappermethods,classdistribution-basedmethods,andloss-basedmethods.Othercost-sensitivelearningmethodsthatarealgorithm-independentincludeAdaCost[233],MetaCost[225],andcosting[312].
Extensiveliteratureisalsoavailableonthesubjectofmulticlasslearning.ThisincludestheworksofHastieandTibshirani[248],Allweinetal.[197],KongandDietterich[264],andTaxandDuin[300].Theerror-correctingoutput
coding(ECOC)methodwasproposedbyDietterichandBakiri[223].Theyhadalsoinvestigatedtechniquesfordesigningcodesthataresuitableforsolvingmulticlassproblems.
Apartfromexploringalgorithmsfortraditionalclassificationsettingswhereeveryinstancehasasinglesetoffeatureswithauniquecategoricallabel,therehasbeenalotofrecentinterestinnon-traditionalclassificationparadigms,involvingcomplexformsofinputsandoutputs.Forexample,theparadigmofmulti-labellearningallowsforaninstancetobeassignedmultipleclasslabelsratherthanjustone.Thisisusefulinapplicationssuchasobjectrecognitioninimages,whereaphotoimagemayincludemorethanoneclassificationobject,suchas,grass,sky,trees,andmountains.Asurveyonmulti-labellearningcanbefoundin[313].Asanotherexample,theparadigmofmulti-instancelearningconsiderstheproblemwheretheinstancesareavailableintheformofgroupscalledbags,andtraininglabelsareavailableatthelevelofbagsratherthanindividualinstances.Multi-instancelearningisusefulinapplicationswhereanobjectcanexistasmultipleinstancesindifferentstates(e.g.,thedifferentisomersofachemicalcompound),andevenifasingleinstanceshowsaspecificcharacteristic,theentirebagofinstancesassociatedwiththeobjectneedstobeassignedtherelevantclass.Asurveyonmulti-instancelearningisprovidedin[314].
Inanumberofreal-worldapplications,itisoftenthecasethatthetraininglabelsarescarceinquantity,becauseofthehighcostsassociatedwithobtaininggold-standardsupervision.However,wealmostalwayshaveabundantaccesstounlabeledtestinstances,whichdonothavesupervisedlabelsbutcontainvaluableinformationaboutthestructureordistributionofinstances.Traditionallearningalgorithms,whichonlymakeuseofthelabeledinstancesinthetrainingsetforlearningthedecisionboundary,areunabletoexploittheinformationcontainedinunlabeledinstances.Incontrast,learningalgorithmsthatmakeuseofthestructureintheunlabeleddataforlearningthe
classificationmodelareknownassemi-supervisedlearningalgorithms[315,316].Theuseofunlabeleddataisalsoexploredintheparadigmofmulti-viewlearning[299,311],whereeveryobjectisobservedinmultipleviewsofthedata,involvingdiversesetsoffeatures.Acommonstrategyusedbymulti-viewlearningalgorithmsisco-training[206],whereadifferentmodelislearnedforeveryviewofthedata,butthemodelpredictionsfromeveryviewareconstrainedtobeidenticaltoeachotherontheunlabeledtestinstances.
Anotherlearningparadigmthatiscommonlyexploredinthepaucityoftrainingdataistheframeworkofactivelearning,whichattemptstoseekthesmallestsetoflabelannotationstolearnareasonableclassificationmodel.Activelearningexpectstheannotatortobeinvolvedintheprocessofmodellearning,sothatthelabelsarerequestedincrementallyoverthemostrelevantsetofinstances,givenalimitedbudgetoflabelannotations.Forexample,itmaybeusefultoobtainlabelsoverinstancesclosertothedecisionboundarythatcanplayabiggerroleinfine-tuningtheboundary.Areviewonactivelearningapproachescanbefoundin[285,295].
Insomeapplications,itisimportanttosimultaneouslysolvemultiplelearningtaskstogether,wheresomeofthetasksmaybesimilartooneanother.Forexample,ifweareinterestedintranslatingapassagewritteninEnglishintodifferentlanguages,thetasksinvolvinglexicallysimilarlanguages(suchasSpanishandPortuguese)wouldrequiresimilarlearningofmodels.Theparadigmofmulti-tasklearninghelpsinsimultaneouslylearningacrossalltaskswhilesharingthelearningamongrelatedtasks.Thisisespeciallyusefulwhensomeofthetasksdonotcontainsufficientlymanytrainingsamples,inwhichcaseborrowingthelearningfromotherrelatedtaskshelpsinthelearningofrobustmodels.Aspecialcaseofmulti-tasklearningistransferlearning,wherethelearningfromasourcetask(withsufficientnumberoftrainingsamples)hastobetransferredtoadestinationtask(withpaucityof
trainingdata).AnextensivesurveyoftransferlearningapproachesisprovidedbyPanetal.[282].
Mostclassifiersassumeeverydatainstancemustbelongtoaclass,whichisnotalwaystrueforsomeapplications.Forexample,inmalwaredetection,duetotheeaseinwhichnewmalwaresarecreated,aclassifiertrainedonexistingclassesmayfailtodetectnewonesevenifthefeaturesforthenewmalwaresareconsiderablydifferentthanthoseforexistingmalwares.Anotherexampleisincriticalapplicationssuchasmedicaldiagnosis,wherepredictionerrorsarecostlyandcanhavesevereconsequences.Inthissituation,itwouldbebetterfortheclassifiertorefrainfrommakinganypredictiononadatainstanceifitisunsureofitsclass.Thisapproach,knownasclassifierwithrejectoption,doesnotneedtoclassifyeverydatainstanceunlessitdeterminesthepredictionisreliable(e.g.,iftheclassprobabilityexceedsauser-specifiedthreshold).Instancesthatareunclassifiedcanbepresentedtodomainexpertsforfurtherdeterminationoftheirtrueclasslabels.
Classifierscanalsobedistinguishedintermsofhowtheclassificationmodelistrained.Abatchclassifierassumesallthelabeledinstancesareavailableduringtraining.Thisstrategyisapplicablewhenthetrainingsetsizeisnottoolargeandforstationarydata,wheretherelationshipbetweentheattributesandclassesdoesnotvaryovertime.Anonlineclassifier,ontheotherhand,trainsaninitialmodelusingasubsetofthelabeleddata[263].Themodelisthenupdatedincrementallyasmorelabeledinstancesbecomeavailable.Thisstrategyiseffectivewhenthetrainingsetistoolargeorwhenthereisconceptdriftduetochangesinthedistributionofthedataovertime.
Bibliography[195]C.C.Aggarwal.Dataclassification:algorithmsandapplications.CRC
Press,2014.
[196]D.W.Aha.Astudyofinstance-basedalgorithmsforsupervisedlearningtasks:mathematical,empirical,andpsychologicalevaluations.PhDthesis,UniversityofCalifornia,Irvine,1990.
[197]E.L.Allwein,R.E.Schapire,andY.Singer.ReducingMulticlasstoBinary:AUnifyingApproachtoMarginClassifiers.JournalofMachineLearningResearch,1:113–141,2000.
[198]R.Andrews,J.Diederich,andA.Tickle.ASurveyandCritiqueofTechniquesForExtractingRulesFromTrainedArtificialNeuralNetworks.KnowledgeBasedSystems,8(6):373–389,1995.
[199]P.Baldi.Autoencoders,unsupervisedlearning,anddeeparchitectures.ICMLunsupervisedandtransferlearning,27(37-50):1,2012.
[200]Y.Bengio.LearningdeeparchitecturesforAI.FoundationsandtrendsRinMachineLearning,2(1):1–127,2009.
[201]Y.Bengio,A.Courville,andP.Vincent.Representationlearning:Areviewandnewperspectives.IEEEtransactionsonpatternanalysisand
machineintelligence,35(8):1798–1828,2013.
[202]K.BennettandC.Campbell.SupportVectorMachines:HypeorHallelujah.SIGKDDExplorations,2(2):1–13,2000.
[203]D.BerrarandP.Flach.CaveatsandpitfallsofROCanalysisinclinicalmicroarrayresearch(andhowtoavoidthem).Briefingsinbioinformatics,pagebbr008,2011.
[204]C.M.Bishop.NeuralNetworksforPatternRecognition.OxfordUniversityPress,Oxford,U.K.,1995.
[205]C.M.Bishop.PatternRecognitionandMachineLearning.Springer,2006.
[206]A.BlumandT.Mitchell.Combininglabeledandunlabeleddatawithco-training.InProceedingsoftheeleventhannualconferenceonComputationallearningtheory,pages92–100.ACM,1998.
[207]L.Bottou.Large-scalemachinelearningwithstochasticgradientdescent.InProceedingsofCOMPSTAT'2010,pages177–186.Springer,2010.
[208]A.P.Bradley.TheuseoftheareaundertheROCcurveintheEvaluationofMachineLearningAlgorithms.PatternRecognition,30(7):1145–1149,1997.
[209]L.Breiman.BaggingPredictors.MachineLearning,24(2):123–140,1996.
[210]L.Breiman.Bias,Variance,andArcingClassifiers.TechnicalReport486,UniversityofCalifornia,Berkeley,CA,1996.
[211]L.Breiman.RandomForests.MachineLearning,45(1):5–32,2001.
[212]C.J.C.Burges.ATutorialonSupportVectorMachinesforPatternRecognition.DataMiningandKnowledgeDiscovery,2(2):121–167,1998.
[213]N.V.Chawla,K.W.Bowyer,L.O.Hall,andW.P.Kegelmeyer.SMOTE:SyntheticMinorityOver-samplingTechnique.JournalofArtificialIntelligenceResearch,16:321–357,2002.
[214]N.V.Chawla,N.Japkowicz,andA.Kolcz.Editorial:SpecialIssueonLearningfromImbalancedDataSets.SIGKDDExplorations,6(1):1–6,2004.
[215]V.CherkasskyandF.Mulier.LearningfromData:Concepts,Theory,andMethods.WileyInterscience,1998.
[216]P.ClarkandR.Boswell.RuleInductionwithCN2:SomeRecentImprovements.InMachineLearning:Proc.ofthe5thEuropeanConf.(EWSL-91),pages151–163,1991.
[217]P.ClarkandT.Niblett.TheCN2InductionAlgorithm.MachineLearning,3(4):261–283,1989.
[218]W.W.Cohen.FastEffectiveRuleInduction.InProc.ofthe12thIntl.Conf.onMachineLearning,pages115–123,TahoeCity,CA,July1995.
[219]S.CostandS.Salzberg.AWeightedNearestNeighborAlgorithmforLearningwithSymbolicFeatures.MachineLearning,10:57–78,1993.
[220]T.M.CoverandP.E.Hart.NearestNeighborPatternClassification.KnowledgeBasedSystems,8(6):373–389,1995.
[221]N.CristianiniandJ.Shawe-Taylor.AnIntroductiontoSupportVectorMachinesandOtherKernel-basedLearningMethods.CambridgeUniversityPress,2000.
[222]T.G.Dietterich.EnsembleMethodsinMachineLearning.InFirstIntl.WorkshoponMultipleClassifierSystems,Cagliari,Italy,2000.
[223]T.G.DietterichandG.Bakiri.SolvingMulticlassLearningProblemsviaError-CorrectingOutputCodes.JournalofArtificialIntelligenceResearch,2:263–286,1995.
[224]P.Domingos.TheRISEsystem:Conqueringwithoutseparating.InProc.ofthe6thIEEEIntl.Conf.onToolswithArtificialIntelligence,pages704–707,NewOrleans,LA,1994.
[225]P.Domingos.MetaCost:AGeneralMethodforMakingClassifiersCost-Sensitive.InProc.ofthe5thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages155–164,SanDiego,CA,August1999.
[226]P.Domingos.Aunifiedbias-variancedecomposition.InProceedingsof17thInternationalConferenceonMachineLearning,pages231–238,2000.
[227]P.DomingosandM.Pazzani.OntheOptimalityoftheSimpleBayesianClassifierunderZero-OneLoss.MachineLearning,29(2-3):103–130,1997.
[228]C.DrummondandR.C.Holte.C4.5,Classimbalance,andCostsensitivity:Whyunder-samplingbeatsover-sampling.InICML'2004WorkshoponLearningfromImbalancedDataSetsII,Washington,DC,August2003.
[229]R.O.Duda,P.E.Hart,andD.G.Stork.PatternClassification.JohnWiley&Sons,Inc.,NewYork,2ndedition,2001.
[230]M.H.Dunham.DataMining:IntroductoryandAdvancedTopics.PrenticeHall,2006.
[231]C.Elkan.TheFoundationsofCost-SensitiveLearning.InProc.ofthe17thIntl.JointConf.onArtificialIntelligence,pages973–978,Seattle,WA,August2001.
[232]D.Erhan,Y.Bengio,A.Courville,P.-A.Manzagol,P.Vincent,andS.Bengio.Whydoesunsupervisedpre-traininghelpdeeplearning?JournalofMachineLearningResearch,11(Feb):625–660,2010.
[233]W.Fan,S.J.Stolfo,J.Zhang,andP.K.Chan.AdaCost:misclassificationcost-sensitiveboosting.InProc.ofthe16thIntl.Conf.onMachineLearning,pages97–105,Bled,Slovenia,June1999.
[234]J.FürnkranzandG.Widmer.Incrementalreducederrorpruning.InProc.ofthe11thIntl.Conf.onMachineLearning,pages70–77,NewBrunswick,NJ,July1994.
[235]C.Ferri,J.Hernández-Orallo,andP.A.Flach.AcoherentinterpretationofAUCasameasureofaggregatedclassificationperformance.InProceedingsofthe28thInternationalConferenceonMachineLearning(ICML-11),pages657–664,2011.
[236]Y.FreundandR.E.Schapire.Adecision-theoreticgeneralizationofon-linelearningandanapplicationtoboosting.JournalofComputerandSystemSciences,55(1):119–139,1997.
[237]K.Fukunaga.IntroductiontoStatisticalPatternRecognition.AcademicPress,NewYork,1990.
[238]D.Geiger,T.S.Verma,andJ.Pearl.d-separation:Fromtheoremstoalgorithms.arXivpreprintarXiv:1304.1505,2013.
[239]I.Goodfellow,Y.Bengio,andA.Courville.DeepLearning.BookinpreparationforMITPress,2016.
[240]I.J.Goodfellow,D.Warde-Farley,M.Mirza,A.C.Courville,andY.Bengio.Maxoutnetworks.ICML(3),28:1319–1327,2013.
[241]A.Graves,M.Liwicki,S.Fernández,R.Bertolami,H.Bunke,andJ.Schmidhuber.Anovelconnectionistsystemforunconstrainedhandwritingrecognition.IEEEtransactionsonpatternanalysisandmachineintelligence,31(5):855–868,2009.
[242]A.GravesandJ.Schmidhuber.Offlinehandwritingrecognitionwithmultidimensionalrecurrentneuralnetworks.InAdvancesinneuralinformationprocessingsystems,pages545–552,2009.
[243]E.-H.Han,G.Karypis,andV.Kumar.TextCategorizationUsingWeightAdjustedk-NearestNeighborClassification.InProc.ofthe5thPacific-AsiaConf.onKnowledgeDiscoveryandDataMining,Lyon,France,2001.
[244]J.HanandM.Kamber.DataMining:ConceptsandTechniques.MorganKaufmannPublishers,SanFrancisco,2001.
[245]D.J.Hand.Measuringclassifierperformance:acoherentalternativetotheareaundertheROCcurve.Machinelearning,77(1):103–123,2009.
[246]D.J.Hand.Evaluatingdiagnostictests:theareaundertheROCcurveandthebalanceoferrors.Statisticsinmedicine,29(14):1502–1510,2010.
[247]D.J.Hand,H.Mannila,andP.Smyth.PrinciplesofDataMining.MITPress,2001.
[248]T.HastieandR.Tibshirani.Classificationbypairwisecoupling.AnnalsofStatistics,26(2):451–471,1998.
[249]T.Hastie,R.Tibshirani,andJ.Friedman.TheElementsofStatisticalLearning:DataMining,Inference,andPrediction.Springer,2ndedition,2009.
[250]T.Hastie,R.Tibshirani,andM.Wainwright.Statisticallearningwithsparsity:thelassoandgeneralizations.CRCPress,2015.
[251]M.Hearst.Trends&Controversies:SupportVectorMachines.IEEEIntelligentSystems,13(4):18–28,1998.
[252]D.Heckerman.BayesianNetworksforDataMining.DataMiningandKnowledgeDiscovery,1(1):79–119,1997.
[253]G.E.HintonandR.R.Salakhutdinov.Reducingthedimensionalityofdatawithneuralnetworks.Science,313(5786):504–507,2006.
[254]G.E.Hinton,N.Srivastava,A.Krizhevsky,I.Sutskever,andR.R.Salakhutdinov.Improvingneuralnetworksbypreventingco-adaptationoffeaturedetectors.arXivpreprintarXiv:1207.0580,2012.
[255]R.C.Holte.VerySimpleClassificationRulesPerformWellonMostCommonlyUsedDatasets.MachineLearning,11:63–91,1993.
[256]S.IoffeandC.Szegedy.Batchnormalization:Acceleratingdeepnetworktrainingbyreducinginternalcovariateshift.arXivpreprintarXiv:1502.03167,2015.
[257]N.Japkowicz.TheClassImbalanceProblem:SignificanceandStrategies.InProc.ofthe2000Intl.Conf.onArtificialIntelligence:SpecialTrackonInductiveLearning,volume1,pages111–117,LasVegas,NV,June2000.
[258]F.V.Jensen.AnintroductiontoBayesiannetworks,volume210.UCLpressLondon,1996.
[259]M.I.Jordan.Learningingraphicalmodels,volume89.SpringerScience&BusinessMedia,1998.
[260]M.V.Joshi,R.C.Agarwal,andV.Kumar.MiningNeedlesinaHaystack:ClassifyingRareClassesviaTwo-PhaseRuleInduction.InProc.of2001ACM-SIGMODIntl.Conf.onManagementofData,pages91–102,SantaBarbara,CA,June2001.
[261]M.V.Joshi,R.C.Agarwal,andV.Kumar.Predictingrareclasses:canboostingmakeanyweaklearnerstrong?InProc.ofthe8thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages297–306,Edmonton,Canada,July2002.
[262]M.V.JoshiandV.Kumar.CREDOS:ClassificationUsingRippleDownStructure(ACaseforRareClasses).InProc.oftheSIAMIntl.Conf.onDataMining,pages321–332,Orlando,FL,April2004.
[263]J.Kivinen,A.J.Smola,andR.C.Williamson.Onlinelearningwithkernels.IEEEtransactionsonsignalprocessing,52(8):2165–2176,2004.
[264]E.B.KongandT.G.Dietterich.Error-CorrectingOutputCodingCorrectsBiasandVariance.InProc.ofthe12thIntl.Conf.onMachineLearning,pages313–321,TahoeCity,CA,July1995.
[265]A.Krizhevsky,I.Sutskever,andG.E.Hinton.Imagenetclassificationwithdeepconvolutionalneuralnetworks.InAdvancesinneuralinformationprocessingsystems,pages1097–1105,2012.
[266]M.KubatandS.Matwin.AddressingtheCurseofImbalancedTrainingSets:OneSidedSelection.InProc.ofthe14thIntl.Conf.onMachineLearning,pages179–186,Nashville,TN,July1997.
[267]P.Langley,W.Iba,andK.Thompson.AnanalysisofBayesianclassifiers.InProc.ofthe10thNationalConf.onArtificialIntelligence,pages223–228,1992.
[268]Y.LeCunandY.Bengio.Convolutionalnetworksforimages,speech,andtimeseries.Thehandbookofbraintheoryandneuralnetworks,3361(10):1995,1995.
[269]Y.LeCun,Y.Bengio,andG.Hinton.Deeplearning.Nature,521(7553):436–444,2015.
[270]D.D.Lewis.NaiveBayesatForty:TheIndependenceAssumptioninInformationRetrieval.InProc.ofthe10thEuropeanConf.onMachineLearning(ECML1998),pages4–15,1998.
[271]C.X.LingandV.S.Sheng.Cost-sensitivelearning.InEncyclopediaofMachineLearning,pages231–235.Springer,2011.
[272]O.Mangasarian.DataMiningviaSupportVectorMachines.TechnicalReportTechnicalReport01-05,DataMiningInstitute,May2001.
[273]D.D.MargineantuandT.G.Dietterich.LearningDecisionTreesforLossMinimizationinMulti-ClassProblems.TechnicalReport99-30-03,OregonStateUniversity,1999.
[274]P.McCullaghandJ.A.Nelder.Generalizedlinearmodels,volume37.CRCpress,1989.
[275]W.S.McCullochandW.Pitts.Alogicalcalculusoftheideasimmanentinnervousactivity.Thebulletinofmathematicalbiophysics,5(4):115–133,1943.
[276]R.S.Michalski,I.Mozetic,J.Hong,andN.Lavrac.TheMulti-PurposeIncrementalLearningSystemAQ15andItsTestingApplicationtoThree
MedicalDomains.InProc.of5thNationalConf.onArtificialIntelligence,Orlando,August1986.
[277]T.Mikolov,M.Karafiát,L.Burget,J.Cernock`y,andS.Khudanpur.Recurrentneuralnetworkbasedlanguagemodel.InInterspeech,volume2,page3,2010.
[278]T.Mitchell.MachineLearning.McGraw-Hill,Boston,MA,1997.
[279]S.Muggleton.FoundationsofInductiveLogicProgramming.PrenticeHall,EnglewoodCliffs,NJ,1995.
[280]J.A.NelderandR.J.Baker.Generalizedlinearmodels.Encyclopediaofstatisticalsciences,1972.
[281]M.A.Nielsen.Neuralnetworksanddeeplearning.Publishedonline:http://neuralnetworksanddeeplearning.com/.(visited:10.15.2016),2015.
[282]S.J.PanandQ.Yang.Asurveyontransferlearning.IEEETransactionsonknowledgeanddataengineering,22(10):1345–1359,2010.
[283]J.Pearl.Probabilisticreasoninginintelligentsystems:networksofplausibleinference.MorganKaufmann,2014.
[284]D.M.Powers.Theproblemofareaunderthecurve.In2012IEEEInternationalConferenceonInformationScienceandTechnology,pages
567–573.IEEE,2012.
[285]M.Prince.Doesactivelearningwork?Areviewoftheresearch.Journalofengineeringeducation,93(3):223–231,2004.
[286]F.J.ProvostandT.Fawcett.AnalysisandVisualizationofClassifierPerformance:ComparisonunderImpreciseClassandCostDistributions.InProc.ofthe3rdIntl.Conf.onKnowledgeDiscoveryandDataMining,pages43–48,NewportBeach,CA,August1997.
[287]J.R.Quinlan.C4.5:ProgramsforMachineLearning.Morgan-KaufmannPublishers,SanMateo,CA,1993.
[288]M.RamoniandP.Sebastiani.RobustBayesclassifiers.ArtificialIntelligence,125:209–226,2001.
[289]N.Rochester,J.Holland,L.Haibt,andW.Duda.Testsonacellassemblytheoryoftheactionofthebrain,usingalargedigitalcomputer.IRETransactionsoninformationTheory,2(3):80–93,1956.
[290]F.Rosenblatt.Theperceptron:aprobabilisticmodelforinformationstorageandorganizationinthebrain.Psychologicalreview,65(6):386,1958.
[291]S.J.Russell,P.Norvig,J.F.Canny,J.M.Malik,andD.D.Edwards.Artificialintelligence:amodernapproach,volume2.PrenticehallUpperSaddleRiver,2003.
[292]T.SaitoandM.Rehmsmeier.Theprecision-recallplotismoreinformativethantheROCplotwhenevaluatingbinaryclassifiersonimbalanceddatasets.PloSone,10(3):e0118432,2015.
[293]J.Schmidhuber.Deeplearninginneuralnetworks:Anoverview.NeuralNetworks,61:85–117,2015.
[294]B.SchölkopfandA.J.Smola.LearningwithKernels:SupportVectorMachines,Regularization,Optimization,andBeyond.MITPress,2001.
[295]B.Settles.Activelearningliteraturesurvey.UniversityofWisconsin,Madison,52(55-66):11,2010.
[296]P.SmythandR.M.Goodman.AnInformationTheoreticApproachtoRuleInductionfromDatabases.IEEETrans.onKnowledgeandDataEngineering,4(4):301–316,1992.
[297]N.Srivastava,G.E.Hinton,A.Krizhevsky,I.Sutskever,andR.Salakhutdinov.Dropout:asimplewaytopreventneuralnetworksfromoverfitting.JournalofMachineLearningResearch,15(1):1929–1958,2014.
[298]M.SteinbachandP.-N.Tan.kNN:k-NearestNeighbors.InX.WuandV.Kumar,editors,TheTopTenAlgorithmsinDataMining.ChapmanandHall/CRCReference,1stedition,2009.
[299]S.Sun.Asurveyofmulti-viewmachinelearning.NeuralComputingandApplications,23(7-8):2031–2038,2013.
[300]D.M.J.TaxandR.P.W.Duin.UsingTwo-ClassClassifiersforMulticlassClassification.InProc.ofthe16thIntl.Conf.onPatternRecognition(ICPR2002),pages124–127,Quebec,Canada,August2002.
[301]R.Tibshirani.Regressionshrinkageandselectionviathelasso.JournaloftheRoyalStatisticalSociety.SeriesB(Methodological),pages267–288,1996.
[302]C.J.vanRijsbergen.InformationRetrieval.Butterworth-Heinemann,Newton,MA,1978.
[303]V.Vapnik.TheNatureofStatisticalLearningTheory.SpringerVerlag,NewYork,1995.
[304]V.Vapnik.StatisticalLearningTheory.JohnWiley&Sons,NewYork,1998.
[305]P.Vincent,H.Larochelle,Y.Bengio,andP.-A.Manzagol.Extractingandcomposingrobustfeatureswithdenoisingautoencoders.InProceedingsofthe25thinternationalconferenceonMachinelearning,pages1096–1103.ACM,2008.
[306]P.Vincent,H.Larochelle,I.Lajoie,Y.Bengio,andP.-A.Manzagol.Stackeddenoisingautoencoders:Learningusefulrepresentationsina
deepnetworkwithalocaldenoisingcriterion.JournalofMachineLearningResearch,11(Dec):3371–3408,2010.
[307]A.R.Webb.StatisticalPatternRecognition.JohnWiley&Sons,2ndedition,2002.
[308]G.M.Weiss.MiningwithRarity:AUnifyingFramework.SIGKDDExplorations,6(1):7–19,2004.
[309]P.Werbos.Beyondregression:newfoolsforpredictionandanalysisinthebehavioralsciences.PhDthesis,HarvardUniversity,1974.
[310]I.H.WittenandE.Frank.DataMining:PracticalMachineLearningToolsandTechniqueswithJavaImplementations.MorganKaufmann,1999.
[311]C.Xu,D.Tao,andC.Xu.Asurveyonmulti-viewlearning.arXivpreprintarXiv:1304.5634,2013.
[312]B.Zadrozny,J.C.Langford,andN.Abe.Cost-SensitiveLearningbyCost-ProportionateExampleWeighting.InProc.ofthe2003IEEEIntl.Conf.onDataMining,pages435–442,Melbourne,FL,August2003.
[313]M.-L.ZhangandZ.-H.Zhou.Areviewonmulti-labellearningalgorithms.IEEEtransactionsonknowledgeanddataengineering,26(8):1819–1837,2014.
[314]Z.-H.Zhou.Multi-instancelearning:Asurvey.DepartmentofComputerScience&Technology,NanjingUniversity,Tech.Rep,2004.
[315]X.Zhu.Semi-supervisedlearning.InEncyclopediaofmachinelearning,pages892–897.Springer,2011.
[316]X.ZhuandA.B.Goldberg.Introductiontosemi-supervisedlearning.Synthesislecturesonartificialintelligenceandmachinelearning,3(1):1–130,2009.
4.14Exercises1.Considerabinaryclassificationproblemwiththefollowingsetofattributesandattributevalues:
Supposearule-basedclassifierproducesthefollowingruleset:
a. Aretherulesmutuallyexclusive?
b. Istherulesetexhaustive?
c. Isorderingneededforthissetofrules?
d. Doyouneedadefaultclassfortheruleset?
2.TheRIPPERalgorithm(byCohen[218])isanextensionofanearlieralgorithmcalledIREP(byFürnkranzandWidmer[234]).Bothalgorithmsapplythereduced-errorpruningmethodtodeterminewhetheraruleneedstobepruned.Thereducederrorpruningmethodusesavalidationsettoestimatethegeneralizationerrorofaclassifier.Considerthefollowingpairofrules:
AirConditioner={Working,Broken}
Engine={Good,Bad}
Mileage={High,Medium,Low}
Rust={Yes,No}
Mileage=High→Mileage=HighMileage=Low→Value=HighAirConditioner=Working
R1:A→CR2:A∧B→C
isobtainedbyaddinganewconjunct,B,totheleft-handsideof .Forthisquestion,youwillbeaskedtodeterminewhether ispreferredoverfromtheperspectivesofrule-growingandrule-pruning.Todeterminewhetheraruleshouldbepruned,IREPcomputesthefollowingmeasure:
wherePisthetotalnumberofpositiveexamplesinthevalidationset,Nisthetotalnumberofnegativeexamplesinthevalidationset,pisthenumberofpositiveexamplesinthevalidationsetcoveredbytherule,andnisthenumberofnegativeexamplesinthevalidationsetcoveredbytherule.isactuallysimilartoclassificationaccuracyforthevalidationset.IREPfavorsrulesthathavehighervaluesof .Ontheotherhand,RIPPERappliesthefollowingmeasuretodeterminewhetheraruleshouldbepruned:
a. Suppose iscoveredby350positiveexamplesand150negativeexamples,while iscoveredby300positiveexamplesand50negativeexamples.ComputetheFOIL'sinformationgainfortherule withrespectto .
b. Consideravalidationsetthatcontains500positiveexamplesand500negativeexamples.For ,supposethenumberofpositiveexamplescoveredbytheruleis200,andthenumberofnegativeexamplescoveredbytheruleis50.For ,supposethenumberofpositiveexamplescoveredbytheruleis100andthenumberofnegativeexamplesis5.Compute
forbothrules.WhichruledoesIREPprefer?
c. Compute forthepreviousproblem.WhichruledoesRIPPERprefer?
R2 R1R2 R1
vIREP=p+(N−n)P+N,
vIREP
vIREP
vRIPPER=p−nP+n.
R1R2
R2R1
R1
R2
vIREP
vRIPPER
3.C4.5rulesisanimplementationofanindirectmethodforgeneratingrulesfromadecisiontree.RIPPERisanimplementationofadirectmethodforgeneratingrulesdirectlyfromdata.
a. Discussthestrengthsandweaknessesofbothmethods.
b. Consideradatasetthathasalargedifferenceintheclasssize(i.e.,someclassesaremuchbiggerthanothers).Whichmethod(betweenC4.5rulesandRIPPER)isbetterintermsoffindinghighaccuracyrulesforthesmallclasses?
4.Consideratrainingsetthatcontains100positiveexamplesand400negativeexamples.Foreachofthefollowingcandidaterules,
determinewhichisthebestandworstcandidateruleaccordingto:
a. Ruleaccuracy.
b. FOIL'sinformationgain.
c. Thelikelihoodratiostatistic.
d. TheLaplacemeasure.
e. Them-estimatemeasure(with and ).
5.Figure4.3 illustratesthecoverageoftheclassificationrulesR1,R2,andR3.Determinewhichisthebestandworstruleaccordingto:
a. Thelikelihoodratiostatistic.
b. TheLaplacemeasure.
R1:A→+(covers4positiveand1negativeexamples),R2:B→+(covers30positiveand10negativeexamples),R3:C→+(covers100positiveand90negativeexamples),
k=2 p+=0.2
c. Them-estimatemeasure(with and ).
d. TheruleaccuracyafterR1hasbeendiscovered,wherenoneoftheexamplescoveredbyR1arediscarded.
e. TheruleaccuracyafterR1hasbeendiscovered,whereonlythepositiveexamplescoveredbyR1arediscarded.
f. TheruleaccuracyafterR1hasbeendiscovered,wherebothpositiveandnegativeexamplescoveredbyR1arediscarded.
6.
a. Supposethefractionofundergraduatestudentswhosmokeis15%andthefractionofgraduatestudentswhosmokeis23%.Ifone-fifthofthecollegestudentsaregraduatestudentsandtherestareundergraduates,whatistheprobabilitythatastudentwhosmokesisagraduatestudent?
b. Giventheinformationinpart(a),isarandomlychosencollegestudentmorelikelytobeagraduateorundergraduatestudent?
c. Repeatpart(b)assumingthatthestudentisasmoker.
d. Suppose30%ofthegraduatestudentsliveinadormbutonly10%oftheundergraduatestudentsliveinadorm.Ifastudentsmokesandlivesinthedorm,isheorshemorelikelytobeagraduateorundergraduatestudent?Youcanassumeindependencebetweenstudentswholiveinadormandthosewhosmoke.
7.ConsiderthedatasetshowninTable4.9
Table4.9.DatasetforExercise7.
Instance A B C Class
1 0 0 0
k=2 p+=0.58
+
2 0 0 1
3 0 1 1
4 0 1 1
5 0 0 1
6 1 0 1
7 1 0 1
8 1 0 1
9 1 1 1
10 1 0 1
a. Estimatetheconditionalprobabilitiesfor,and .
b. UsetheestimateofconditionalprobabilitiesgiveninthepreviousquestiontopredicttheclasslabelforatestsampleusingthenaïveBayesapproach.
c. Estimatetheconditionalprobabilitiesusingthem-estimateapproach,withand .
d. Repeatpart(b)usingtheconditionalprobabilitiesgiveninpart(c).
e. Comparethetwomethodsforestimatingprobabilities.Whichmethodisbetterandwhy?
8.ConsiderthedatasetshowninTable4.10 .
Table4.10.DatasetforExercise8.
−
−
−
+
+
−
−
+
+
P(A|+),P(B|+),P(C|+),P(A|−),P(B|−) P(C|−)
(A=0,B=1,C=0)
p=1/2 m=4
Instance A B C Class
1 0 0 1
2 1 0 1
3 0 1 0
4 1 0 0
5 1 0 1
6 0 0 1
7 1 1 0
8 0 0 0
9 0 1 0
10 1 1 1 +
a. Estimatetheconditionalprobabilitiesfor,and using
thesameapproachasinthepreviousproblem.
b. Usetheconditionalprobabilitiesinpart(a)topredicttheclasslabelforatestsample usingthenaïveBayesapproach.
c. Compare ,and .StatetherelationshipsbetweenAandB.
d. Repeattheanalysisinpart(c)using ,and .
e. Compare against and.Arethevariablesconditionallyindependentgiventhe
class?
−
+
−
−
+
+
−
−
+
P(A=1|+),P(B=1|+),P(C=1|+),P(A=1|−),P(B=1|−) P(C=1|−)
(A=1,B=1,C=1)
P(A=1),P(B=1) P(A=1,B=1)
P(A=1),P(B=0) P(A=1,B=0)
P(A=1,B=1|Class=+) P(A=1|Class=+)P(B=1|Class=+)
9.
a. ExplainhownaïveBayesperformsonthedatasetshowninFigure4.56 .
b. Ifeachclassisfurtherdividedsuchthattherearefourclasses(A1,A2,B1,andB2),willnaïveBayesperformbetter?
c. Howwilladecisiontreeperformonthisdataset(forthetwo-classproblem)?Whatiftherearefourclasses?
10.Figure4.57 illustratestheBayesiannetworkforthedatasetshowninTable4.11 .(Assumethatalltheattributesarebinary).
a. Drawtheprobabilitytableforeachnodeinthenetwork.
b. UsetheBayesiannetworktocompute.
11.GiventheBayesiannetworkshowninFigure4.58 ,computethefollowingprobabilities:
P(Engine=Bad,AirConditioner=Broken)
Figure4.56.DatasetforExercise9.
Figure4.57.Bayesiannetwork.
a. .P(B=good,F=empty,G=empty,S=yes)
b. .
c. Giventhatthebatteryisbad,computetheprobabilitythatthecarwillstart.
12.Considertheone-dimensionaldatasetshowninTable4.12 .
a. Classifythedatapoint accordingtoits1-,3-,5-,and9-nearestneighbors(usingmajorityvote).
b. Repeatthepreviousanalysisusingthedistance-weightedvotingapproachdescribedinSection4.3.1 .
Table4.11.DatasetforExercise10.
Mileage Engine AirConditioner
NumberofInstanceswith
NumberofInstanceswith
Hi Good Working 3 4
Hi Good Broken 1 2
Hi Bad Working 1 5
Hi Bad Broken 0 4
Lo Good Working 9 0
Lo Good Broken 5 1
Lo Bad Working 1 2
Lo Bad Broken 0 2
P(B=bad,F=empty,G=notempty,S=no)
x=5.0
CarValue=Hi CarValue=Lo
Figure4.58.BayesiannetworkforExercise11.
13.ThenearestneighboralgorithmdescribedinSection4.3 canbeextendedtohandlenominalattributes.AvariantofthealgorithmcalledPEBLS(ParallelExemplar-BasedLearningSystem)byCostandSalzberg[219]measuresthedistancebetweentwovaluesofanominalattributeusingthemodifiedvaluedifferencemetric(MVDM).Givenapairofnominalattributevalues, and ,thedistancebetweenthemisdefinedasfollows:
where isthenumberofexamplesfromclassiwithattributevalue andisthenumberofexampleswithattributevalue
Table4.12.DatasetforExercise12.
x 0.5 3.0 4.5 4.6 4.9 5.2 5.3 5.5 7.0 9.5
V1 V2
d(V1,V2)=∑i=1k|ni1n1−ni2n2,| (4.108)
nij Vj njVj.
y
ConsiderthetrainingsetfortheloanclassificationproblemshowninFigure4.8 .UsetheMVDMmeasuretocomputethedistancebetweeneverypairofattributevaluesforthe and attributes.
14.ForeachoftheBooleanfunctionsgivenbelow,statewhethertheproblemislinearlyseparable.
a. AANDBANDC
b. NOTAANDB
c. (AORB)AND(AORC)
d. (AXORB)AND(AORB)
15.
a. DemonstratehowtheperceptronmodelcanbeusedtorepresenttheANDandORfunctionsbetweenapairofBooleanvariables.
b. Commentonthedisadvantageofusinglinearfunctionsasactivationfunctionsformulti-layerneuralnetworks.
16.Youareaskedtoevaluatetheperformanceoftwoclassificationmodels,and .Thetestsetyouhavechosencontains26binaryattributes,
labeledasAthroughZ.Table4.13 showstheposteriorprobabilitiesobtainedbyapplyingthemodelstothetestset.(Onlytheposteriorprobabilitiesforthepositiveclassareshown).Asthisisatwo-classproblem,
and .Assumethatwearemostlyinterestedindetectinginstancesfromthepositiveclass.
a. PlottheROCcurveforboth and .(Youshouldplotthemonthesamegraph.)Whichmodeldoyouthinkisbetter?Explainyourreasons.
− − + + + − − + − −
M1 M2
P(−)=1−P(+) P(−|A,…,Z)=1−P(+|A,…,Z)
M1 M2
b. Formodel ,supposeyouchoosethecutoffthresholdtobe .Inotherwords,anytestinstanceswhoseposteriorprobabilityisgreaterthantwillbeclassifiedasapositiveexample.Computetheprecision,recall,andF-measureforthemodelatthisthresholdvalue.
c. Repeattheanalysisforpart(b)usingthesamecutoffthresholdonmodel.ComparetheF-measureresultsforbothmodels.Whichmodelis
better?AretheresultsconsistentwithwhatyouexpectfromtheROCcurve?
d. Repeatpart(b)formodel usingthethreshold .Whichthresholddoyouprefer, or ?AretheresultsconsistentwithwhatyouexpectfromtheROCcurve?
Table4.13.PosteriorprobabilitiesforExercise16.
Instance TrueClass
1 0.73 0.61
2 0.69 0.03
3 0.44 0.68
4 0.55 0.31
5 0.67 0.45
6 0.47 0.09
7 0.08 0.38
8 0.15 0.05
9 0.45 0.01
10 0.35 0.04
M1 t=0.5
M2
M1 t=0.1t=0.5 t=0.1
P(+|A,…,Z,M1) P(+|A,…,Z,M2)
+
+
−
−
+
+
−
−
+
−
17.Followingisadatasetthatcontainstwoattributes,XandY,andtwoclasslabels,“ ”and“ ”.Eachattributecantakethreedifferentvalues:0,1,or2.
X Y NumberofInstances
0 0 0 100
1 0 0 0
2 0 0 100
0 1 10 100
1 1 10 0
2 1 10 100
0 2 0 100
1 2 0 0
2 2 0 100
Theconceptforthe“ ”classis andtheconceptforthe“ ”classis.
a. Buildadecisiontreeonthedataset.Doesthetreecapturethe“ ”and“ ”concepts?
b. Whataretheaccuracy,precision,recall,and -measureofthedecisiontree?(Notethatprecision,recall,and -measurearedefinedwith
+ −
+ −
+ Y=1 −X=0∨X=2
+−
F1F1
respecttothe“ ”class.)
c. Buildanewdecisiontreewiththefollowingcostfunction:
(Hint:onlytheleavesoftheolddecisiontreeneedtobechanged.)Doesthedecisiontreecapturethe“ ”concept?
d. Whataretheaccuracy,precision,recall,and -measureofthenewdecisiontree?
18.Considerthetaskofbuildingaclassifierfromrandomdata,wheretheattributevaluesaregeneratedrandomlyirrespectiveoftheclasslabels.Assumethedatasetcontainsinstancesfromtwoclasses,“ ”and“ .”Halfofthedatasetisusedfortrainingwhiletheremaininghalfisusedfortesting.
a. Supposethereareanequalnumberofpositiveandnegativeinstancesinthedataandthedecisiontreeclassifierpredictseverytestinstancetobepositive.Whatistheexpectederrorrateoftheclassifieronthetestdata?
b. Repeatthepreviousanalysisassumingthattheclassifierpredictseachtestinstancetobepositiveclasswithprobability0.8andnegativeclasswithprobability0.2.
c. Supposetwo-thirdsofthedatabelongtothepositiveclassandtheremainingone-thirdbelongtothenegativeclass.Whatistheexpectederrorofaclassifierthatpredictseverytestinstancetobepositive?
d. Repeatthepreviousanalysisassumingthattheclassifierpredictseachtestinstancetobepositiveclasswithprobability2/3andnegativeclasswithprobability1/3.
+
C(i,j)={0,ifi=j;1,ifi=+,j=−;Numberof−instancesNumberof+instancesifi=−,j=+;
+
F1
+ −
19.DerivethedualLagrangianforthelinearSVMwithnon-separabledatawheretheobjectivefunctionis
20.ConsidertheXORproblemwheretherearefourtrainingpoints:
Transformthedataintothefollowingfeaturespace:
Findthemaximummarginlineardecisionboundaryinthetransformedspace.
21.GiventhedatasetsshowninFigures4.59 ,explainhowthedecisiontree,naïveBayes,andk-nearestneighborclassifierswouldperformonthesedatasets.
f(w)=ǁwǁ22+C(∑i=1Nξi)2.
(1,1,−),(1,0,+),(0,1,+),(0,0,−).
φ=(1,2×1,2×2,2x1x2,x12,x22).
Figure4.59.DatasetforExercise21.
5AssociationAnalysis:BasicConceptsandAlgorithms
Manybusinessenterprisesaccumulatelargequantitiesofdatafromtheirday-to-dayoperations.Forexample,hugeamountsofcustomerpurchasedataarecollecteddailyatthecheckoutcountersofgrocerystores.Table5.1 givesanexampleofsuchdata,commonlyknownasmarketbaskettransactions.Eachrowinthistablecorrespondstoatransaction,whichcontainsauniqueidentifierlabeledTIDandasetofitemsboughtbyagivencustomer.Retailersareinterestedinanalyzingthedatatolearnaboutthepurchasingbehavioroftheircustomers.Suchvaluableinformationcanbeusedtosupportavarietyofbusiness-relatedapplicationssuchasmarketingpromotions,inventorymanagement,andcustomerrelationshipmanagement.
Table5.1.Anexampleofmarketbaskettransactions.
TID Items
1 {Bread,Milk}
2 {Bread,Diapers,Beer,Eggs}
3 {Milk,Diapers,Beer,Cola}
4 {Bread,Milk,Diapers,Beer}
5 {Bread,Milk,Diapers,Cola}
Thischapterpresentsamethodologyknownasassociationanalysis,whichisusefulfordiscoveringinterestingrelationshipshiddeninlargedatasets.Theuncoveredrelationshipscanberepresentedintheformofsetsofitemspresentinmanytransactions,whichareknownasfrequentitemsets,orassociationrules,thatrepresentrelationshipsbetweentwoitemsets.Forexample,thefollowingrulecanbeextractedfromthedatasetshowninTable5.1 :
Therulesuggestsarelationshipbetweenthesaleofdiapersandbeerbecausemanycustomerswhobuydiapersalsobuybeer.Retailerscanusethesetypesofrulestohelpthemidentifynewopportunitiesforcross-sellingtheirproductstothecustomers.
{Diapers}→{Beer}.
Besidesmarketbasketdata,associationanalysisisalsoapplicabletodatafromotherapplicationdomainssuchasbioinformatics,medicaldiagnosis,webmining,andscientificdataanalysis.IntheanalysisofEarthsciencedata,forexample,associationpatternsmayrevealinterestingconnectionsamongtheocean,land,andatmosphericprocesses.SuchinformationmayhelpEarthscientistsdevelopabetterunderstandingofhowthedifferentelementsoftheEarthsysteminteractwitheachother.Eventhoughthetechniquespresentedherearegenerallyapplicabletoawidervarietyofdatasets,forillustrativepurposes,ourdiscussionwillfocusmainlyonmarketbasketdata.
Therearetwokeyissuesthatneedtobeaddressedwhenapplyingassociationanalysistomarketbasketdata.First,discoveringpatternsfromalargetransactiondatasetcanbecomputationallyexpensive.Second,someofthediscoveredpatternsmaybespurious(happensimplybychance)andevenfornon-spuriouspatterns,somearemoreinterestingthanothers.Theremainderofthischapterisorganizedaroundthesetwoissues.Thefirstpartofthechapterisdevotedtoexplainingthebasicconceptsofassociationanalysisandthealgorithmsusedtoefficientlyminesuchpatterns.Thesecondpartofthechapterdealswiththeissueofevaluatingthediscoveredpatternsinordertohelppreventthegenerationofspuriousresultsandtorankthepatternsintermsofsomeinterestingnessmeasure.
5.1PreliminariesThissectionreviewsthebasicterminologyusedinassociationanalysisandpresentsaformaldescriptionofthetask.
BinaryRepresentation
MarketbasketdatacanberepresentedinabinaryformatasshowninTable5.2 ,whereeachrowcorrespondstoatransactionandeachcolumncorrespondstoanitem.Anitemcanbetreatedasabinaryvariablewhosevalueisoneiftheitemispresentinatransactionandzerootherwise.Becausethepresenceofaniteminatransactionisoftenconsideredmoreimportantthanitsabsence,anitemisanasymmetricbinaryvariable.Thisrepresentationisasimplisticviewofrealmarketbasketdatabecauseitignoresimportantaspectsofthedatasuchasthequantityofitemssoldorthepricepaidtopurchasethem.Methodsforhandlingsuchnon-binarydatawillbeexplainedinChapter6 .
Table5.2.Abinary0/1representationofmarketbasketdata.
TID Bread Milk Diapers Beer Eggs Cola
1 1 1 0 0 0 0
2 1 0 1 1 1 0
3 0 1 1 1 0 1
4 1 1 1 1 0 0
5 1 1 1 0 0 1
ItemsetandSupportCount
Let bethesetofallitemsinamarketbasketdataandbethesetofalltransactions.Eachtransaction, containsa
subsetofitemschosenfromI.Inassociationanalysis,acollectionofzeroormoreitemsistermedanitemset.Ifanitemsetcontainskitems,itiscalledak-itemset.Forinstance,{ , , }isanexampleofa3-itemset.Thenull(orempty)setisanitemsetthatdoesnotcontainanyitems.
Atransaction issaidtocontainanitemsetXifXisasubsetof .Forexample,thesecondtransactionshowninTable5.2 containstheitemset{ , }butnot{ , }.Animportantpropertyofanitemsetisitssupportcount,whichreferstothenumberoftransactionsthatcontainaparticularitemset.Mathematically,thesupportcount, ,foranitemsetXcanbestatedasfollows:
wherethesymbol denotesthenumberofelementsinaset.InthedatasetshowninTable5.2 ,thesupportcountfor{ , , }isequaltotwobecausethereareonlytwotransactionsthatcontainallthreeitems.
Often,thepropertyofinterestisthesupport,whichisfractionoftransactionsinwhichanitemsetoccurs:
AnitemsetXiscalledfrequentifs(X)isgreaterthansomeuser-definedthreshold,minsup.
I={i1,i2,…,id} T={t1,t2,…,tN} ti
tj tj
σ(X)
σ(X)=|{ti|X⊆ti,ti∈T}|,
|⋅|
s(X)=σ(X)/N.
AssociationRule
Anassociationruleisanimplicationexpressionoftheform ,whereXandYaredisjointitemsets,i.e., .Thestrengthofanassociationrulecanbemeasuredintermsofitssupportandconfidence.Supportdetermineshowoftenaruleisapplicabletoagivendataset,whileconfidencedetermineshowfrequentlyitemsinYappearintransactionsthatcontainX.Theformaldefinitionsofthesemetricsare
Example5.1.Considertherule Becausethesupportcountfor{ , , }is2andthetotalnumberoftransactionsis5,therule'ssupportis .Therule'sconfidenceisobtainedbydividingthesupportcountfor{ , , }bythesupportcountfor{ ,
}.Sincethereare3transactionsthatcontainmilkanddiapers,theconfidenceforthisruleis .
WhyUseSupportandConfidence?
Supportisanimportantmeasurebecausearulethathasverylowsupportmightoccursimplybychance.Also,fromabusinessperspectivealowsupportruleisunlikelytobeinterestingbecauseitmightnotbeprofitabletopromoteitemsthatcustomersseldombuytogether(withtheexceptionofthesituationdescribedinSection5.8 ).Forthesereasons,weareinterestedinfindingruleswhosesupportisgreaterthansomeuser-definedthreshold.As
X→YX∩Y=∅
Support,s(X→Y)=σ(X∪Y)N; (5.1)
Confidence,c(X→Y)=σ(X∪Y)σ(X). (5.2)
2/5=0.4
2/3=0.67
willbeshowninSection5.2.1 ,supportalsohasadesirablepropertythatcanbeexploitedfortheefficientdiscoveryofassociationrules.
Confidence,ontheotherhand,measuresthereliabilityoftheinferencemadebyarule.Foragivenrule ,thehighertheconfidence,themorelikelyitisforYtobepresentintransactionsthatcontainX.ConfidencealsoprovidesanestimateoftheconditionalprobabilityofYgivenX.
Associationanalysisresultsshouldbeinterpretedwithcaution.Theinferencemadebyanassociationruledoesnotnecessarilyimplycausality.Instead,itcansometimessuggestastrongco-occurrencerelationshipbetweenitemsintheantecedentandconsequentoftherule.Causality,ontheotherhand,requiresknowledgeaboutwhichattributesinthedatacapturecauseandeffect,andtypicallyinvolvesrelationshipsoccurringovertime(e.g.,greenhousegasemissionsleadtoglobalwarming).SeeSection5.7.1 foradditionaldiscussion.
FormulationoftheAssociationRuleMiningProblem
Theassociationruleminingproblemcanbeformallystatedasfollows:
Definition5.1.(AssociationRuleDiscovery.)GivenasetoftransactionsT,findalltheruleshaving
and ,whereminsupandminconfarethecorrespondingsupportandconfidencethresholds.
X→Y
support≥minsup confidence≥minconf
Abrute-forceapproachforminingassociationrulesistocomputethesupportandconfidenceforeverypossiblerule.Thisapproachisprohibitivelyexpensivebecausethereareexponentiallymanyrulesthatcanbeextractedfromadataset.Morespecifically,assumingthatneithertheleftnortheright-handsideoftheruleisanemptyset,thetotalnumberofpossiblerules,R,extractedfromadatasetthatcontainsditemsis
Theproofforthisequationisleftasanexercisetothereaders(seeExercise5onpage440).EvenforthesmalldatasetshowninTable5.1 ,thisapproachrequiresustocomputethesupportandconfidencefor
rules.Morethan80%oftherulesarediscardedafterapplyingand ,thuswastingmostofthecomputations.To
avoidperformingneedlesscomputations,itwouldbeusefultoprunetherulesearlywithouthavingtocomputetheirsupportandconfidencevalues.
Aninitialsteptowardimprovingtheperformanceofassociationruleminingalgorithmsistodecouplethesupportandconfidencerequirements.FromEquation5.1 ,noticethatthesupportofarule isthesameasthesupportofitscorrespondingitemset, .Forexample,thefollowingruleshaveidenticalsupportbecausetheyinvolveitemsfromthesameitemset,
{ , , }:
R=3d−2d+1+1. (5.3)
36−27+1=602minsup=20% mincof=50%
X→YX∪Y
{Beer,Diapers}→{Milk},{Beer,Milk}→{Diapers},{Diapers,Milk}→{Beer},{Beer}→{Diapers,Milk},{Milk}→{Beer,Diapers},{Diapers}→{Beer,Milk}.
Iftheitemsetisinfrequent,thenallsixcandidaterulescanbeprunedimmediatelywithoutourhavingtocomputetheirconfidencevalues.
Therefore,acommonstrategyadoptedbymanyassociationruleminingalgorithmsistodecomposetheproblemintotwomajorsubtasks:
1. FrequentItemsetGeneration,whoseobjectiveistofindalltheitemsetsthatsatisfytheminsupthreshold.
2. RuleGeneration,whoseobjectiveistoextractallthehighconfidencerulesfromthefrequentitemsetsfoundinthepreviousstep.Theserulesarecalledstrongrules.
Thecomputationalrequirementsforfrequentitemsetgenerationaregenerallymoreexpensivethanthoseofrulegeneration.EfficienttechniquesforgeneratingfrequentitemsetsandassociationrulesarediscussedinSections5.2 and5.3 ,respectively.
5.2FrequentItemsetGenerationAlatticestructurecanbeusedtoenumeratethelistofallpossibleitemsets.Figure5.1 showsanitemsetlatticefor .Ingeneral,adatasetthatcontainskitemscanpotentiallygenerateupto frequentitemsets,excludingthenullset.Becausekcanbeverylargeinmanypracticalapplications,thesearchspaceofitemsetsthatneedtobeexploredisexponentiallylarge.
Figure5.1.
I={a,b,c,d,e}2k−1
Anitemsetlattice.
Abrute-forceapproachforfindingfrequentitemsetsistodeterminethesupportcountforeverycandidateitemsetinthelatticestructure.Todothis,weneedtocompareeachcandidateagainsteverytransaction,anoperationthatisshowninFigure5.2 .Ifthecandidateiscontainedinatransaction,itssupportcountwillbeincremented.Forexample,thesupportfor{ ,
}isincrementedthreetimesbecausetheitemsetiscontainedintransactions1,4,and5.SuchanapproachcanbeveryexpensivebecauseitrequiresO(NMw)comparisons,whereNisthenumberoftransactions,isthenumberofcandidateitemsets,andwisthemaximumtransaction
width.Transactionwidthisthenumberofitemspresentinatransaction.
Figure5.2.Countingthesupportofcandidateitemsets.
Therearethreemainapproachesforreducingthecomputationalcomplexityoffrequentitemsetgeneration.
1. Reducethenumberofcandidateitemsets(M).TheAprioriprinciple,describedinthenextSection,isaneffectivewaytoeliminatesomeof
M=2k−1
thecandidateitemsetswithoutcountingtheirsupportvalues.2. Reducethenumberofcomparisons.Insteadofmatchingeach
candidateitemsetagainsteverytransaction,wecanreducethenumberofcomparisonsbyusingmoreadvanceddatastructures,eithertostorethecandidateitemsetsortocompressthedataset.WewilldiscussthesestrategiesinSections5.2.4 and5.6 ,respectively.
3. Reducethenumberoftransactions(N).Asthesizeofcandidateitemsetsincreases,fewertransactionswillbesupportedbytheitemsets.Forinstance,sincethewidthofthefirsttransactioninTable5.1 is2,itwouldbeadvantageoustoremovethistransactionbeforesearchingforfrequentitemsetsofsize3andlarger.AlgorithmsthatemploysuchastrategyarediscussedintheBibliographicNotes.
5.2.1TheAprioriPrinciple
ThisSectiondescribeshowthesupportmeasurecanbeusedtoreducethenumberofcandidateitemsetsexploredduringfrequentitemsetgeneration.Theuseofsupportforpruningcandidateitemsetsisguidedbythefollowingprinciple.
Theorem5.1(AprioriPrinciple).Ifanitemsetisfrequent,thenallofitssubsetsmustalsobefrequent.
ToillustratetheideabehindtheAprioriprinciple,considertheitemsetlatticeshowninFigure5.3 .Suppose{c,d,e}isafrequentitemset.Clearly,anytransactionthatcontains{c,d,e}mustalsocontainitssubsets,{c,d},{c,e},{d,e},{c},{d},and{e}.Asaresult,if{c,d,e}isfrequent,thenallsubsetsof{c,d,e}(i.e.,theshadeditemsetsinthisfigure)mustalsobefrequent.
Figure5.3.AnillustrationoftheAprioriprinciple.If{c,d,e}isfrequent,thenallsubsetsofthisitemsetarefrequent.
Conversely,ifanitemsetsuchas{a,b}isinfrequent,thenallofitssupersetsmustbeinfrequenttoo.AsillustratedinFigure5.4 ,theentiresubgraph
containingthesupersetsof{a,b}canbeprunedimmediatelyonce{a,b}isfoundtobeinfrequent.Thisstrategyoftrimmingtheexponentialsearchspacebasedonthesupportmeasureisknownassupport-basedpruning.Suchapruningstrategyismadepossiblebyakeypropertyofthesupportmeasure,namely,thatthesupportforanitemsetneverexceedsthesupportforitssubsets.Thispropertyisalsoknownastheanti-monotonepropertyofthesupportmeasure.
Figure5.4.Anillustrationofsupport-basedpruning.If{a,b}isinfrequent,thenallsupersetsof{a,b}areinfrequent.
Definition5.2.(Anti-monotoneProperty.)Ameasurefpossessestheanti-monotonepropertyifforeveryitemsetXthatisapropersubsetofitemsetY,i.e. ,wehave
.
Moregenerally,alargenumberofmeasures—seeSection5.7.1 —canbeappliedtoitemsetstoevaluatevariouspropertiesofitemsets.AswillbeshowninthenextSection,anymeasurethathastheanti-monotonepropertycanbeincorporateddirectlyintoanitemsetminingalgorithmtoeffectivelyprunetheexponentialsearchspaceofcandidateitemsets.
5.2.2FrequentItemsetGenerationintheAprioriAlgorithm
Aprioriisthefirstassociationruleminingalgorithmthatpioneeredtheuseofsupport-basedpruningtosystematicallycontroltheexponentialgrowthofcandidateitemsets.Figure5.5 providesahigh-levelillustrationofthefrequentitemsetgenerationpartoftheApriorialgorithmforthetransactionsshowninTable5.1 .Weassumethatthesupportthresholdis60%,whichisequivalenttoaminimumsupportcountequalto3.
X⊂Yf(Y)≤f(X)
Figure5.5.IllustrationoffrequentitemsetgenerationusingtheApriorialgorithm.
Initially,everyitemisconsideredasacandidate1-itemset.Aftercountingtheirsupports,thecandidateitemsets{ }and{ }arediscardedbecausetheyappearinfewerthanthreetransactions.Inthenextiteration,candidate2-itemsetsaregeneratedusingonlythefrequent1-itemsetsbecausetheAprioriprincipleensuresthatallsupersetsoftheinfrequent1-itemsetsmustbeinfrequent.Becausethereareonlyfourfrequent1-itemsets,thenumberofcandidate2-itemsetsgeneratedbythealgorithmis .Twoofthesesixcandidates,{ , }and{ , },aresubsequentlyfoundtobeinfrequentaftercomputingtheirsupportvalues.Theremainingfourcandidatesarefrequent,andthuswillbeusedtogeneratecandidate3-itemsets.Withoutsupport-basedpruning,thereare candidate3-itemsetsthatcanbeformedusingthesixitemsgiveninthisexample.With
(42)=6
(63)=20
theAprioriprinciple,weonlyneedtokeepcandidate3-itemsetswhosesubsetsarefrequent.Theonlycandidatethathasthispropertyis{ ,
, }.However,eventhoughthesubsetsof{ , , }arefrequent,theitemsetitselfisnot.
TheeffectivenessoftheAprioripruningstrategycanbeshownbycountingthenumberofcandidateitemsetsgenerated.Abrute-forcestrategyofenumeratingallitemsets(uptosize3)ascandidateswillproduce
candidates.WiththeAprioriprinciple,thisnumberdecreasesto
candidates,whichrepresentsa68%reductioninthenumberofcandidateitemsetseveninthissimpleexample.
ThepseudocodeforthefrequentitemsetgenerationpartoftheApriorialgorithmisshowninAlgorithm5.1 .Let denotethesetofcandidatek-itemsetsand denotethesetoffrequentk-itemsets:
Thealgorithminitiallymakesasinglepassoverthedatasettodeterminethesupportofeachitem.Uponcompletionofthisstep,thesetofallfrequent1-itemsets, ,willbeknown(steps1and2).Next,thealgorithmwilliterativelygeneratenewcandidatek-itemsetsandpruneunnecessarycandidatesthatareguaranteedtobeinfrequentgiventhefrequent -itemsetsfoundinthepreviousiteration(steps5and6).Candidategenerationandpruningisimplementedusingthefunctionscandidate-genandcandidate-prune,whicharedescribedinSection5.2.3 .
(61)+(62)+(63)=6+15+20=41
(61)+(42)+1=6+6+1=13
CkFk
F1
(k−1)
Tocountthesupportofthecandidates,thealgorithmneedstomakeanadditionalpassoverthedataset(steps7–12).Thesubsetfunctionisusedtodetermineallthecandidateitemsetsin thatarecontainedineachtransactiont.TheimplementationofthisfunctionisdescribedinSection5.2.4 .Aftercountingtheirsupports,thealgorithmeliminatesallcandidateitemsetswhosesupportcountsarelessthan (step13).Thealgorithmterminateswhentherearenonewfrequentitemsetsgenerated,i.e., (step14).
ThefrequentitemsetgenerationpartoftheApriorialgorithmhastwoimportantcharacteristics.First,itisalevel-wisealgorithm;i.e.,ittraversestheitemsetlatticeonelevelatatime,fromfrequent1-itemsetstothemaximumsizeoffrequentitemsets.Second,itemploysagenerate-and-teststrategyforfindingfrequentitemsets.Ateachiteration(level),newcandidateitemsetsaregeneratedfromthefrequentitemsetsfoundinthepreviousiteration.Thesupportforeachcandidateisthencountedandtestedagainsttheminsupthreshold.Thetotalnumberofiterationsneededbythealgorithmis ,where isthemaximumsizeofthefrequentitemsets.
5.2.3CandidateGenerationandPruning
Thecandidate-genandcandidate-prunefunctionsshowninSteps5and6ofAlgorithm5.1 generatecandidateitemsetsandprunesunnecessaryonesbyperformingthefollowingtwooperations,respectively:
Ck
N×minsup
Fk=∅
kmax+1kmax
1. CandidateGeneration.Thisoperationgeneratesnewcandidatek-itemsetsbasedonthefrequent -itemsetsfoundinthepreviousiteration.
Algorithm5.1FrequentitemsetgenerationoftheApriorialgorithm.
∈ ∧
∈
∈
∈ ∧
∅
∪
2. CandidatePruning.Thisoperationeliminatessomeofthecandidatek-itemsetsusingsupport-basedpruning,i.e.byremovingk-itemsetswhosesubsetsareknowntobeinfrequentinpreviousiterations.Note
(k−1)
thatthispruningisdonewithoutcomputingtheactualsupportofthesek-itemsets(whichcouldhaverequiredcomparingthemagainsteachtransaction).
CandidateGenerationInprinciple,therearemanywaystogeneratecandidateitemsets.Aneffectivecandidategenerationproceduremustbecompleteandnon-redundant.Acandidategenerationprocedureissaidtobecompleteifitdoesnotomitanyfrequentitemsets.Toensurecompleteness,thesetofcandidateitemsetsmustsubsumethesetofallfrequentitemsets,i.e., .Acandidategenerationprocedureisnon-redundantifitdoesnotgeneratethesamecandidateitemsetmorethanonce.Forexample,thecandidateitemset{a,b,c,d}canbegeneratedinmanyways—bymerging{a,b,c}with{d},{b,d}with{a,c},{c}with{a,b,d},etc.Generationofduplicatecandidatesleadstowastedcomputationsandthusshouldbeavoidedforefficiencyreasons.Also,aneffectivecandidategenerationprocedureshouldavoidgeneratingtoomanyunnecessarycandidates.Acandidateitemsetisunnecessaryifatleastoneofitssubsetsisinfrequent,andthus,eliminatedinthecandidatepruningstep.
Next,wewillbrieflydescribeseveralcandidategenerationprocedures,includingtheoneusedbythecandidate-genfunction.
Brute-ForceMethod
Thebrute-forcemethodconsiderseveryk-itemsetasapotentialcandidateandthenappliesthecandidatepruningsteptoremoveanyunnecessarycandidateswhosesubsetsareinfrequent(seeFigure5.6 ).Thenumberofcandidateitemsetsgeneratedatlevelkisequalto ,wheredisthetotalnumberofitems.Althoughcandidategenerationisrathertrivial,candidate
∀k:Fk⊆Ck
(dk)
pruningbecomesextremelyexpensivebecausealargenumberofitemsetsmustbeexamined.
Figure5.6.Abrute-forcemethodforgeneratingcandidate3-itemsets.
Method
Analternativemethodforcandidategenerationistoextendeachfrequent-itemsetwithfrequentitemsthatarenotpartofthe -itemset.Figure
5.7 illustrateshowafrequent2-itemsetsuchas{ , }canbeaugmentedwithafrequentitemsuchas toproduceacandidate3-itemset{ , , }.
Fk−1×F1
(k−1) (k−1)
Figure5.7.Generatingandpruningcandidatek-itemsetsbymergingafrequent -itemsetwithafrequentitem.Notethatsomeofthecandidatesareunnecessarybecausetheirsubsetsareinfrequent.
Theprocedureiscompletebecauseeveryfrequentk-itemsetiscomposedofafrequent -itemsetandafrequent1-itemset.Therefore,allfrequentk-itemsetsarepartofthecandidatek-itemsetsgeneratedbythisprocedure.Figure5.7 showsthatthe candidategenerationmethodonlyproducesfourcandidate3-itemsets,insteadofthe
itemsetsproducedbythebrute-forcemethod.The methodgenerateslowernumberofcandidatesbecauseeverycandidateisguaranteedtocontainatleastonefrequent -itemset.Whilethisprocedureisasubstantialimprovementoverthebrute-forcemethod,itcanstillproducealargenumberofunnecessarycandidates,astheremainingsubsetsofacandidateitemsetcanstillbeinfrequent.
Notethattheapproachdiscussedabovedoesnotpreventthesamecandidate
(k−1)
(k−1)
Fk−1×F1
(63)=20 Fk−1×F1
(k−1)
itemsetfrombeinggeneratedmorethanonce.Forinstance,{ , ,}canbegeneratedbymerging{ , }with{ },{ ,}with{ },or{ , }with{ }.Onewaytoavoid
generatingduplicatecandidatesisbyensuringthattheitemsineachfrequentitemsetarekeptsortedintheirlexicographicorder.Forexample,itemsetssuchas{ , },{ , , },and{ , }followlexicographicorderastheitemswithineveryitemsetarearrangedalphabetically.Eachfrequent -itemsetXisthenextendedwithfrequentitemsthatarelexicographicallylargerthantheitemsinX.Forexample,theitemset{ , }canbeaugmentedwith{ }becauseMilkislexicographicallylargerthanBreadandDiapers.However,weshouldnotaugment{ , }with{ }nor{ , }with{ }becausetheyviolatethelexicographicorderingcondition.Everycandidatek-itemsetisthusgeneratedexactlyonce,bymergingthelexicographicallylargestitemwiththeremaining itemsintheitemset.Ifthemethodisusedinconjunctionwithlexicographicordering,thenonlytwocandidate3-itemsetswillbeproducedintheexampleillustratedinFigure5.7 .{ , , }and{ , , }willnotbegeneratedbecause{ , }isnotafrequent2-itemset.
Method
Thiscandidategenerationprocedure,whichisusedinthecandidate-genfunctionoftheApriorialgorithm,mergesapairoffrequent -itemsetsonlyiftheirfirst items,arrangedinlexicographicorder,areidentical.Let
and beapairoffrequent -itemsets,arrangedlexicographically.AandBaremergediftheysatisfythefollowingconditions:
(k−1)
k−1 Fk−1×F1
Fk−1×Fk−1
(k−1)k−2 A=
{a1,a2,…,ak−1} B={b1,b2,…,bk−1} (k−1)
ai=bi(fori=1,2,…,k−2).
Notethatinthiscase, becauseAandBaretwodistinctitemsets.Thecandidatek-itemsetgeneratedbymergingAandBconsistsofthefirstcommonitemsfollowedby and inlexicographicorder.This
candidategenerationprocedureiscomplete,becauseforeverylexicographicallyorderedfrequentk-itemset,thereexiststwolexicographicallyorderedfrequent -itemsetsthathaveidenticalitemsinthefirstpositions.
InFigure5.8 ,thefrequentitemsets{ , }and{ , }aremergedtoformacandidate3-itemset{ , , }.Thealgorithmdoesnothavetomerge{ , }with{ , }becausethefirstiteminbothitemsetsisdifferent.Indeed,if{ , , }isaviablecandidate,itwouldhavebeenobtainedbymerging{ , }with{ , }instead.Thisexampleillustratesboththecompletenessofthecandidategenerationprocedureandtheadvantagesofusinglexicographicorderingtopreventduplicatecandidates.Also,ifweorderthefrequent -itemsetsaccordingtotheirlexicographicrank,itemsetswithidenticalfirstitemswouldtakeconsecutiveranks.Asaresult,the candidategenerationmethodwouldconsidermergingafrequentitemsetonlywithonesthatoccupythenextfewranksinthesortedlist,thussavingsomecomputations.
ak−1≠bk−1k
−2 ak−1 bk−1
(k−1) k−2
(k−1)k−2
Fk−1×Fk−1
Figure5.8.Generatingandpruningcandidatek-itemsetsbymergingpairsoffrequent
-itemsets.
Figure5.8 showsthatthe candidategenerationprocedureresultsinonlyonecandidate3-itemset.Thisisaconsiderablereductionfromthefourcandidate3-itemsetsgeneratedbythe method.Thisisbecausethe methodensuresthateverycandidatek-itemsetcontainsatleasttwofrequent -itemsets,thusgreatlyreducingthenumberofcandidatesthataregeneratedinthisstep.
Notethattherecanbemultiplewaysofmergingtwofrequent -itemsetsinthe procedure,oneofwhichismergingiftheirfirst itemsareidentical.Analternateapproachcouldbetomergetwofrequent -itemsetsAandBifthelast itemsofAareidenticaltothefirstitemsetsofB.Forexample,{ , }and{ , }couldbemergedusingthisapproachtogeneratethecandidate3-itemset{ ,
, }.Aswewillseelater,thisalternate procedureis
(k−1)
Fk−1×Fk−1
Fk−1×F1Fk−1×Fk−1
(k−1)
(k−1)Fk−1×Fk−1 k−2
(k−1)k−2 k−2
Fk−1×Fk−1
usefulingeneratingsequentialpatterns,whichwillbediscussedinChapter6 .
CandidatePruningToillustratethecandidatepruningoperationforacandidatek-itemset,
,consideritskpropersubsets, .Ifanyofthemareinfrequent,thenXisimmediatelyprunedbyusingtheAprioriprinciple.Notethatwedon'tneedtoexplicitlyensurethatallsubsetsofXofsizelessthan arefrequent(seeExercise7).Thisapproachgreatlyreducesthenumberofcandidateitemsetsconsideredduringsupportcounting.Forthebrute-forcecandidategenerationmethod,candidatepruningrequirescheckingonlyksubsetsofsize foreachcandidatek-itemset.However,sincethe candidategenerationstrategyensuresthatatleastoneofthe -sizesubsetsofeverycandidatek-itemsetisfrequent,weonlyneedtocheckfortheremaining subsets.Likewise,thestrategyrequiresexaminingonly subsetsofeverycandidatek-itemset,
sincetwoofits -sizesubsetsarealreadyknowntobefrequentinthecandidategenerationstep.
5.2.4SupportCounting
Supportcountingistheprocessofdeterminingthefrequencyofoccurrenceforeverycandidateitemsetthatsurvivesthecandidatepruningstep.Supportcountingisimplementedinsteps6through11ofAlgorithm5.1 .Abrute-forceapproachfordoingthisistocompareeachtransactionagainsteverycandidateitemset(seeFigure5.2 )andtoupdatethesupportcountsofcandidatescontainedinatransaction.Thisapproachiscomputationally
X={i1,i2,…,ik} X−{ij}(∀j=1,2,…,k)
k−1
k−1Fk−1×F1
(k−1)k−1 Fk−1×Fk
−1 k−2(k−1)
expensive,especiallywhenthenumbersoftransactionsandcandidateitemsetsarelarge.
Analternativeapproachistoenumeratetheitemsetscontainedineachtransactionandusethemtoupdatethesupportcountsoftheirrespectivecandidateitemsets.Toillustrate,consideratransactiontthatcontainsfiveitems,{1,2,3,5,6}.Thereare itemsetsofsize3containedinthistransaction.Someoftheitemsetsmaycorrespondtothecandidate3-itemsetsunderinvestigation,inwhichcase,theirsupportcountsareincremented.Othersubsetsoftthatdonotcorrespondtoanycandidatescanbeignored.
Figure5.9 showsasystematicwayforenumeratingthe3-itemsetscontainedint.Assumingthateachitemsetkeepsitsitemsinincreasinglexicographicorder,anitemsetcanbeenumeratedbyspecifyingthesmallestitemfirst,followedbythelargeritems.Forinstance,given ,allthe3-itemsetscontainedintmustbeginwithitem1,2,or3.Itisnotpossibletoconstructa3-itemsetthatbeginswithitems5or6becausethereareonlytwoitemsintwhoselabelsaregreaterthanorequalto5.Thenumberofwaystospecifythefirstitemofa3-itemsetcontainedintisillustratedbytheLevel1prefixtreestructuredepictedinFigure5.9 .Forinstance,1representsa3-itemsetthatbeginswithitem1,followedbytwomoreitemschosenfromtheset{2,3,5,6}.
(53)=10
t={1,2,3,5,6}
2356
Figure5.9.Enumeratingsubsetsofthreeitemsfromatransactiont.
Afterfixingthefirstitem,theprefixtreestructureatLevel2representsthenumberofwaystoselecttheseconditem.Forexample,12correspondstoitemsetsthatbeginwiththeprefix(12)andarefollowedbytheitems3,5,or6.Finally,theprefixtreestructureatLevel3representsthecompletesetof3-itemsetscontainedint.Forexample,the3-itemsetsthatbeginwithprefix{12}are{1,2,3},{1,2,5},and{1,2,6},whilethosethatbeginwithprefix{23}are{2,3,5}and{2,3,6}.
TheprefixtreestructureshowninFigure5.9 demonstrateshowitemsetscontainedinatransactioncanbesystematicallyenumerated,i.e.,byspecifyingtheiritemsonebyone,fromtheleftmostitemtotherightmostitem.Westillhavetodeterminewhethereachenumerated3-itemsetcorresponds
356
toanexistingcandidateitemset.Ifitmatchesoneofthecandidates,thenthesupportcountofthecorrespondingcandidateisincremented.InthenextSection,weillustratehowthismatchingoperationcanbeperformedefficientlyusingahashtreestructure.
SupportCountingUsingaHashTree*IntheApriorialgorithm,candidateitemsetsarepartitionedintodifferentbucketsandstoredinahashtree.Duringsupportcounting,itemsetscontainedineachtransactionarealsohashedintotheirappropriatebuckets.Thatway,insteadofcomparingeachitemsetinthetransactionwitheverycandidateitemset,itismatchedonlyagainstcandidateitemsetsthatbelongtothesamebucket,asshowninFigure5.10 .
Figure5.10.Countingthesupportofitemsetsusinghashstructure.
Figure5.11 showsanexampleofahashtreestructure.Eachinternalnodeofthetreeusesthefollowinghashfunction, ,wheremodeh(p)=(p−1)mod3,
referstothemodulo(remainder)operator,todeterminewhichbranchofthecurrentnodeshouldbefollowednext.Forexample,items1,4,and7arehashedtothesamebranch(i.e.,theleftmostbranch)becausetheyhavethesameremainderafterdividingthenumberby3.Allcandidateitemsetsarestoredattheleafnodesofthehashtree.ThehashtreeshowninFigure5.11 contains15candidate3-itemsets,distributedacross9leafnodes.
Figure5.11.Hashingatransactionattherootnodeofahashtree.
Considerthetransaction, .Toupdatethesupportcountsofthecandidateitemsets,thehashtreemustbetraversedinsuchawaythatall
t={1,2,3,4,5,6}
theleafnodescontainingcandidate3-itemsetsbelongingtotmustbevisitedatleastonce.Recallthatthe3-itemsetscontainedintmustbeginwithitems1,2,or3,asindicatedbytheLevel1prefixtreestructureshowninFigure5.9 .Therefore,attherootnodeofthehashtree,theitems1,2,and3ofthetransactionarehashedseparately.Item1ishashedtotheleftchildoftherootnode,item2ishashedtothemiddlechild,anditem3ishashedtotherightchild.Atthenextlevelofthetree,thetransactionishashedontheseconditemlistedintheLevel2treestructureshowninFigure5.9 .Forexample,afterhashingonitem1attherootnode,items2,3,and5ofthetransactionarehashed.Basedonthehashfunction,items2and5arehashedtothemiddlechild,whileitem3ishashedtotherightchild,asshowninFigure5.12 .Thisprocesscontinuesuntiltheleafnodesofthehashtreearereached.Thecandidateitemsetsstoredatthevisitedleafnodesarecomparedagainstthetransaction.Ifacandidateisasubsetofthetransaction,itssupportcountisincremented.Notethatnotalltheleafnodesarevisitedwhiletraversingthehashtree,whichhelpsinreducingthecomputationalcost.Inthisexample,5outofthe9leafnodesarevisitedand9outofthe15itemsetsarecomparedagainstthetransaction.
Figure5.12.Subsetoperationontheleftmostsubtreeoftherootofacandidatehashtree.
5.2.5ComputationalComplexity
ThecomputationalcomplexityoftheApriorialgorithm,whichincludesbothitsruntimeandstorage,canbeaffectedbythefollowingfactors.
SupportThreshold
Loweringthesupportthresholdoftenresultsinmoreitemsetsbeingdeclaredasfrequent.Thishasanadverseeffectonthecomputationalcomplexityofthealgorithmbecausemorecandidateitemsetsmustbegeneratedandcounted
ateverylevel,asshowninFigure5.13 .Themaximumsizeoffrequentitemsetsalsotendstoincreasewithlowersupportthresholds.ThisincreasesthetotalnumberofiterationstobeperformedbytheApriorialgorithm,furtherincreasingthecomputationalcost.
Figure5.13.Effectofsupportthresholdonthenumberofcandidateandfrequentitemsetsobtainedfromabenchmarkdataset.
NumberofItems(Dimensionality)
Asthenumberofitemsincreases,morespacewillbeneededtostorethesupportcountsofitems.Ifthenumberoffrequentitemsalsogrowswiththedimensionalityofthedata,theruntimeandstoragerequirementswillincreasebecauseofthelargernumberofcandidateitemsetsgeneratedbythealgorithm.
NumberofTransactions
BecausetheApriorialgorithmmakesrepeatedpassesoverthetransactiondataset,itsruntimeincreaseswithalargernumberoftransactions.
AverageTransactionWidth
Fordensedatasets,theaveragetransactionwidthcanbeverylarge.ThisaffectsthecomplexityoftheApriorialgorithmintwoways.First,themaximumsizeoffrequentitemsetstendstoincreaseastheaveragetransactionwidthincreases.Asaresult,morecandidateitemsetsmustbeexaminedduringcandidategenerationandsupportcounting,asillustratedinFigure5.14 .Second,asthetransactionwidthincreases,moreitemsetsarecontainedinthetransaction.Thiswillincreasethenumberofhashtreetraversalsperformedduringsupportcounting.
AdetailedanalysisofthetimecomplexityfortheApriorialgorithmispresentednext.
Figure5.14.
Effectofaveragetransactionwidthonthenumberofcandidateandfrequentitemsetsobtainedfromasyntheticdataset.
Generationoffrequent1-itemsets
Foreachtransaction,weneedtoupdatethesupportcountforeveryitempresentinthetransaction.Assumingthatwistheaveragetransactionwidth,thisoperationrequiresO(Nw)time,whereNisthetotalnumberoftransactions.
Candidategeneration
Togeneratecandidatek-itemsets,pairsoffrequent -itemsetsaremergedtodeterminewhethertheyhaveatleast itemsincommon.Eachmergingoperationrequiresatmost equalitycomparisons.Everymergingstepcanproduceatmostoneviablecandidatek-itemset,whileintheworst-case,thealgorithmmusttrytomergeeverypairoffrequent -itemsetsfoundinthepreviousiteration.Therefore,theoverallcostofmergingfrequentitemsetsis
wherewisthemaximumtransactionwidth.Ahashtreeisalsoconstructedduringcandidategenerationtostorethecandidateitemsets.Becausethemaximumdepthofthetreeisk,thecostforpopulatingthehashtreewithcandidateitemsetsis .Duringcandidatepruning,weneedtoverifythatthe subsetsofeverycandidatek-itemsetarefrequent.SincethecostforlookingupacandidateinahashtreeisO(k),thecandidatepruningsteprequires time.
Supportcounting
(k−1)k−2
k−2
(k−1)
∑k=2w(k−2)|Ck|<Costofmerging<∑k=2w(k−2)|Fk−1|2,
O(∑k=2wk|Ck|)k−2
O(∑k=2wk(k−2)|Ck|)
Eachtransactionofwidth produces itemsetsofsizek.Thisisalsotheeffectivenumberofhashtreetraversalsperformedforeachtransaction.Thecostforsupportcountingis ,wherewisthemaximumtransactionwidthand isthecostforupdatingthesupportcountofacandidatek-itemsetinthehashtree.
|t| (|t|k)
O(N∑k(wk)αk)αk
5.3RuleGenerationThisSectiondescribeshowtoextractassociationrulesefficientlyfromagivenfrequentitemset.Eachfrequentk-itemset,Y,canproduceuptoassociationrules,ignoringrulesthathaveemptyantecedentsorconsequents
or ).AnassociationrulecanbeextractedbypartitioningtheitemsetYintotwonon-emptysubsets,Xand ,suchthat satisfiestheconfidencethreshold.Notethatallsuchrulesmusthavealreadymetthesupportthresholdbecausetheyaregeneratedfromafrequentitemset.
Example5.2.Let beafrequentitemset.Therearesixcandidateassociationrulesthatcanbegeneratedfrom
,and .AseachoftheirsupportisidenticaltothesupportforX,alltherulessatisfythesupportthreshold.
Computingtheconfidenceofanassociationruledoesnotrequireadditionalscansofthetransactiondataset.Considertherule ,whichisgeneratedfromthefrequentitemset .Theconfidenceforthisruleis
.Because{1,2,3}isfrequent,theanti-monotonepropertyofsupportensuresthat{1,2}mustbefrequent,too.Sincethesupportcountsforbothitemsetswerealreadyfoundduringfrequentitemsetgeneration,thereisnoneedtoreadtheentiredatasetagain.
5.3.1Confidence-BasedPruning
2k−2
∅→Y Y→∅Y−X X→Y−X
X={a,b,c}X:{a,b}→{c},{a,c}→{b},{b,c}→{a},{a}
→{b,c},{b}→{a,c} {c}→{a,b}
{1,2}→{3}X={1,2,3}
σ{(1,2,3})/σ({1,2})
Confidencedoesnotshowtheanti-monotonepropertyinthesamewayasthesupportmeasure.Forexample,theconfidencefor canbelarger,smaller,orequaltotheconfidenceforanotherrule ,where and
(seeExercise3onpage439).Nevertheless,ifwecomparerulesgeneratedfromthesamefrequentitemsetY,thefollowingtheoremholdsfortheconfidencemeasure.
Theorem5.2.LetYbeanitemsetandXisasubsetofY.Ifaruledoesnotsatisfytheconfidencethreshold,thenanyrule
,where isasubsetofX,mustnotsatisfytheconfidencethresholdaswell.
Toprovethistheorem,considerthefollowingtworules: and,where .Theconfidenceoftherulesare and ,
respectively.Since isasubsetofX, .Therefore,theformerrulecannothaveahigherconfidencethanthelatterrule.
5.3.2RuleGenerationinAprioriAlgorithm
TheApriorialgorithmusesalevel-wiseapproachforgeneratingassociationrules,whereeachlevelcorrespondstothenumberofitemsthatbelongtothe
X→YX˜→Y˜ X˜⊆X
Y˜⊆Y
X→Y−XX˜→Y
−X˜ X˜
X˜→Y−X˜ X→Y−X X˜⊂X σ(Y)/σ(X˜) σ(Y)/σ(X)
X˜ σ(X˜)/σ(X)
ruleconsequent.Initially,allthehighconfidencerulesthathaveonlyoneitemintheruleconsequentareextracted.Theserulesarethenusedtogeneratenewcandidaterules.Forexample,if and arehighconfidencerules,thenthecandidaterule isgeneratedbymergingtheconsequentsofbothrules.Figure5.15 showsalatticestructurefortheassociationrulesgeneratedfromthefrequentitemset{a,b,c,d}.Ifanynodeinthelatticehaslowconfidence,thenaccordingtoTheorem5.2 ,theentiresubgraphspannedbythenodecanbeprunedimmediately.Supposetheconfidencefor islow.Alltherulescontainingitemainitsconsequent,including ,and canbediscarded.
Figure5.15.Pruningofassociationrulesusingtheconfidencemeasure.
{acd}→{b} {abd}→{c}{ad}→{bc}
{bcd}→{a}{cd}→{ab},{bd}→{ac},{bc}→{ad} {d}→{abc}
ApseudocodefortherulegenerationstepisshowninAlgorithms5.2 and5.3 .Notethesimilaritybetweenthe proceduregiveninAlgorithm5.3 andthefrequentitemsetgenerationproceduregiveninAlgorithm5.1 .Theonlydifferenceisthat,inrulegeneration,wedonothavetomakeadditionalpassesoverthedatasettocomputetheconfidenceofthecandidaterules.Instead,wedeterminetheconfidenceofeachrulebyusingthesupportcountscomputedduringfrequentitemsetgeneration.
Algorithm5.2RulegenerationoftheApriorialgorithm.
∈
Algorithm5.3Procedureap-genrules .
∈
(fk,Hm)
5.3.3AnExample:CongressionalVotingRecords
ThisSectiondemonstratestheresultsofapplyingassociationanalysistothevotingrecordsofmembersoftheUnitedStatesHouseofRepresentatives.Thedataisobtainedfromthe1984CongressionalVotingRecordsDatabase,whichisavailableattheUCImachinelearningdatarepository.Eachtransactioncontainsinformationaboutthepartyaffiliationforarepresentativealongwithhisorhervotingrecordon16keyissues.Thereare435transactionsand34itemsinthedataset.ThesetofitemsarelistedinTable5.3 .
Table5.3.Listofbinaryattributesfromthe1984UnitedStatesCongressionalVotingRecords.Source:TheUCImachinelearningrepository.
1. Republican2. Democrat3.4.5.6.7.
handicapped-infants=yeshandicapped-infants=nowaterprojectcostsharing=yeswaterprojectcostsharing=nobudget-resolution=yes
8.9.
10.11.12.13.14.15.16.17.18.19.20.21.22.23.24.25.26.27.28.29.30.31.32.33.34.
TheApriorialgorithmisthenappliedtothedatasetwith and.Someofthehighconfidencerulesextractedbythealgorithm
areshowninTable5.4 .ThefirsttworulessuggestthatmostofthememberswhovotedyesforaidtoElSalvadorandnoforbudgetresolutionandMXmissileareRepublicans;whilethosewhovotednoforaidtoElSalvadorandyesforbudgetresolutionandMXmissileareDemocrats.These
budget-resolution=nophysicianfeefreeze=yesphysicianfeefreeze=noaidtoElSalvador=yesaidtoElSalvador=noreligiousgroupsinschools=yesreligiousgroupsinschools=noanti-satellitetestban=yesanti-satellitetestban=noaidtoNicaragua=yesaidtoNicaragua=noMX-missile=yesMX-missile=noimmigration=yesimmigration=nosynfuelcorporationcutback=yessynfuelcorporationcutback=noeducationspending=yeseducationspending=noright-to-sue=yesright-to-sue=nocrime=yescrime=noduty-free-exports=yesduty-free-exports=noexportadministrationact=yesexportadministrationact=no
minsup=30%minconf=90%
highconfidencerulesshowthekeyissuesthatdividemembersfrombothpoliticalparties.
Table5.4.Associationrulesextractedfromthe1984UnitedStatesCongressionalVotingRecords.
AssociationRule Confidence
91.0%
97.5%
93.5%
100%
{budgetresolution=no,MX-missile=no,aidtoElSalvador=yes}→{Republican}
{budgetresolution=yes,MX-missile=yes,aidtoElSalvador=no}→{Democrat}
{crime=yes,right-to-sue=yes,physicianfeefreeze=yes}→{Republican}
{crime=no,right-to-sue=no,physicianfeefreeze=no}→{Democrat}
5.4CompactRepresentationofFrequentItemsetsInpractice,thenumberoffrequentitemsetsproducedfromatransactiondatasetcanbeverylarge.Itisusefultoidentifyasmallrepresentativesetoffrequentitemsetsfromwhichallotherfrequentitemsetscanbederived.TwosuchrepresentationsarepresentedinthisSectionintheformofmaximalandclosedfrequentitemsets.
5.4.1MaximalFrequentItemsets
Definition5.3.(MaximalFrequentItemset.)Afrequentitemsetismaximalifnoneofitsimmediatesupersetsarefrequent.
Toillustratethisconcept,considertheitemsetlatticeshowninFigure5.16 .Theitemsetsinthelatticearedividedintotwogroups:thosethatarefrequentandthosethatareinfrequent.Afrequentitemsetborder,whichisrepresentedbyadashedline,isalsoillustratedinthediagram.Everyitemsetlocatedabovetheborderisfrequent,whilethoselocatedbelowtheborder(theshadednodes)areinfrequent.Amongtheitemsetsresidingneartheborder,
{a,d},{a,c,e},and{b,c,d,e}aremaximalfrequentitemsetsbecausealloftheirimmediatesupersetsareinfrequent.Forexample,theitemset{a,d}ismaximalfrequentbecauseallofitsimmediatesupersets,{a,b,d},{a,c,d},and{a,d,e},areinfrequent.Incontrast,{a,c}isnon-maximalbecauseoneofitsimmediatesupersets,{a,c,e},isfrequent.
Figure5.16.Maximalfrequentitemset.
Maximalfrequentitemsetseffectivelyprovideacompactrepresentationoffrequentitemsets.Inotherwords,theyformthesmallestsetofitemsetsfromwhichallfrequentitemsetscanbederived.Forexample,everyfrequentitemsetinFigure5.16 isasubsetofoneofthethreemaximalfrequent
itemsets,{a,d},{a,c,e},and{b,c,d,e}.Ifanitemsetisnotapropersubsetofanyofthemaximalfrequentitemsets,thenitiseitherinfrequent(e.g.,{a,d,e})ormaximalfrequentitself(e.g.,{b,c,d,e}).Hence,themaximalfrequentitemsets{a,c,e},{a,d},and{b,c,d,e}provideacompactrepresentationofthefrequentitemsetsshowninFigure5.16 .Enumeratingallthesubsetsofmaximalfrequentitemsetsgeneratesthecompletelistofallfrequentitemsets.
Maximalfrequentitemsetsprovideavaluablerepresentationfordatasetsthatcanproduceverylong,frequentitemsets,asthereareexponentiallymanyfrequentitemsetsinsuchdata.Nevertheless,thisapproachispracticalonlyifanefficientalgorithmexiststoexplicitlyfindthemaximalfrequentitemsets.WebrieflydescribeonesuchapproachinSection5.5 .
Despiteprovidingacompactrepresentation,maximalfrequentitemsetsdonotcontainthesupportinformationoftheirsubsets.Forexample,thesupportofthemaximalfrequentitemsets{a,c,e},{a,d},and{b,c,d,e}donotprovideanyinformationaboutthesupportoftheirsubsetsexceptthatitmeetsthesupportthreshold.Anadditionalpassoverthedatasetisthereforeneededtodeterminethesupportcountsofthenon-maximalfrequentitemsets.Insomecases,itisdesirabletohaveaminimalrepresentationofitemsetsthatpreservesthesupportinformation.WedescribesucharepresentationinthenextSection.
5.4.2ClosedItemsets
Closeditemsetsprovideaminimalrepresentationofallitemsetswithoutlosingtheirsupportinformation.Aformaldefinitionofacloseditemsetispresentedbelow.
Definition5.4.(ClosedItemset.)AnitemsetXisclosedifnoneofitsimmediatesupersetshasexactlythesamesupportcountasX.
Putanotherway,XisnotclosedifatleastoneofitsimmediatesupersetshasthesamesupportcountasX.ExamplesofcloseditemsetsareshowninFigure5.17 .Tobetterillustratethesupportcountofeachitemset,wehaveassociatedeachnode(itemset)inthelatticewithalistofitscorrespondingtransactionIDs.Forexample,sincethenode{b,c}isassociatedwithtransactionIDs1,2,and3,itssupportcountisequaltothree.Fromthetransactionsgiveninthisdiagram,noticethatthesupportfor{b}isidenticalto{b,c}.Thisisbecauseeverytransactionthatcontainsbalsocontainsc.Hence,{b}isnotacloseditemset.Similarly,sincecoccursineverytransactionthatcontainsbothaandd,theitemset{a,d}isnotclosedasithasthesamesupportasitssuperset{a,c,d}.Ontheotherhand,{b,c}isacloseditemsetbecauseitdoesnothavethesamesupportcountasanyofitssupersets.
Figure5.17.Anexampleoftheclosedfrequentitemsets(withminimumsupportequalto40%).
Aninterestingpropertyofcloseditemsetsisthatifweknowtheirsupportcounts,wecanderivethesupportcountofeveryotheritemsetintheitemsetlatticewithoutmakingadditionalpassesoverthedataset.Forexample,considerthe2-itemset{b,e}inFigure5.17 .Since{b,e}isnotclosed,itssupportmustbeequaltothesupportofoneofitsimmediatesupersets,{a,b,e},{b,c,e},and{b,d,e}.Further,noneofthesupersetsof{b,e}canhaveasupportgreaterthanthesupportof{b,e},duetotheanti-monotonenatureofthesupportmeasure.Hence,thesupportof{b,e}canbecomputedbyexaminingthesupportcountsofallofitsimmediatesupersetsofsizethree
andtakingtheirmaximumvalue.Ifanimmediatesupersetisclosed(e.g.,{b,c,e}),wewouldknowitssupportcount.Otherwise,wecanrecursivelycomputeitssupportbyexaminingthesupportsofitsimmediatesupersetsofsizefour.Ingeneral,thesupportcountofanynon-closed -itemsetcanbedeterminedaslongasweknowthesupportcountsofallk-itemsets.Hence,onecandeviseaniterativealgorithmthatcomputesthesupportcountsofitemsetsatlevel usingthesupportcountsofitemsetsatlevelk,startingfromthelevel ,where isthesizeofthelargestcloseditemset.
Eventhoughcloseditemsetsprovideacompactrepresentationofthesupportcountsofallitemsets,theycanstillbeexponentiallylargeinnumber.Moreover,formostpracticalapplications,weonlyneedtodeterminethesupportcountofallfrequentitemsets.Inthisregard,closedfrequentitem-setsprovideacompactrepresentationofthesupportcountsofallfrequentitemsets,whichcanbedefinedasfollows.
Definition5.5.(ClosedFrequentItemset.)Anitemsetisaclosedfrequentitemsetifitisclosedanditssupportisgreaterthanorequaltominsup.
Inthepreviousexample,assumingthatthesupportthresholdis40%,{b,c}isaclosedfrequentitemsetbecauseitssupportis60%.InFigure5.17 ,theclosedfrequentitemsetsareindicatedbytheshadednodes.
Algorithmsareavailabletoexplicitlyextractclosedfrequentitemsetsfromagivendataset.InterestedreadersmayrefertotheBibliographicNotesatthe
(k−1)
k−1kmax kmax
endofthischapterforfurtherdiscussionsofthesealgorithms.Wecanuseclosedfrequentitemsetstodeterminethesupportcountsforallnon-closedfrequentitemsets.Forexample,considerthefrequentitemset{a,d}showninFigure5.17 .Becausethisitemsetisnotclosed,itssupportcountmustbeequaltothemaximumsupportcountofitsimmediatesupersets,{a,b,d},{a,c,d},and{a,d,e}.Also,since{a,d}isfrequent,weonlyneedtoconsiderthesupportofitsfrequentsupersets.Ingeneral,thesupportcountofeverynon-closedfrequentk-itemsetcanbeobtainedbyconsideringthesupportofallitsfrequentsupersetsofsize .Forexample,sincetheonlyfrequentsupersetof{a,d}is{a,c,d},itssupportisequaltothesupportof{a,c,d},whichis2.Usingthismethodology,analgorithmcanbedevelopedtocomputethesupportforeveryfrequentitemset.ThepseudocodeforthisalgorithmisshowninAlgorithm5.4 .Thealgorithmproceedsinaspecific-to-generalfashion,i.e.,fromthelargesttothesmallestfrequentitemsets.Thisisbecause,inordertofindthesupportforanon-closedfrequentitemset,thesupportforallofitssupersetsmustbeknown.Notethatthesetofallfrequentitemsetscanbeeasilycomputedbytakingtheunionofallsubsetsoffrequentcloseditemsets.
Algorithm5.4Supportcountingusingclosedfrequentitemsets.
∈
∈
k+1
∈
∉
⋅ ′⋅ ′∈ ⊂ ′
Toillustratetheadvantageofusingclosedfrequentitemsets,considerthedatasetshowninTable5.5 ,whichcontainstentransactionsandfifteenitems.Theitemscanbedividedintothreegroups:(1)GroupA,whichcontainsitems through ;(2)GroupB,whichcontainsitems through;and(3)GroupC,whichcontainsitems through .Assumingthatthe
supportthresholdis20%,itemsetsinvolvingitemsfromthesamegrouparefrequent,butitemsetsinvolvingitemsfromdifferentgroupsareinfrequent.Thetotalnumberoffrequentitemsetsisthus .However,thereareonlyfourclosedfrequentitemsetsinthedata:
and .Itisoftensufficienttopresentonlytheclosedfrequentitemsetstotheanalystsinsteadoftheentiresetoffrequentitemsets.
Table5.5.Atransactiondatasetforminingcloseditemsets.
TID
1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
2 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
3 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
4 0 0 1 1 0 1 1 1 1 1 0 0 0 0 0
5 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0
a1 a5 b1b5 c1 c5
3×(25−1)=93
({a3,a4},{a1,a2,a3,a4,a5},{b1,b2,b3,b4,b5}, {c1,c2,c3,c4,c5})
a1 a2 a3 a4 a5 b1 b2 b3 b4 b5 c1 c2 c3 c4 c5
6 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1
8 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1
9 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1
10 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1
Finally,notethatallmaximalfrequentitemsetsareclosedbecausenoneofthemaximalfrequentitemsetscanhavethesamesupportcountastheirimmediatesupersets.Therelationshipsamongfrequent,closed,closedfrequent,andmaximalfrequentitemsetsareshowninFigure5.18 .
Figure5.18.Relationshipsamongfrequent,closed,closedfrequent,andmaximalfrequentitemsets.
5.5AlternativeMethodsforGeneratingFrequentItemsets*Aprioriisoneoftheearliestalgorithmstohavesuccessfullyaddressedthecombinatorialexplosionoffrequentitemsetgeneration.ItachievesthisbyapplyingtheAprioriprincipletoprunetheexponentialsearchspace.Despiteitssignificantperformanceimprovement,thealgorithmstillincursconsiderableI/Ooverheadsinceitrequiresmakingseveralpassesoverthetransactiondataset.Inaddition,asnotedinSection5.2.5 ,theperformanceoftheApriorialgorithmmaydegradesignificantlyfordensedatasetsbecauseoftheincreasingwidthoftransactions.SeveralalternativemethodshavebeendevelopedtoovercometheselimitationsandimproveupontheefficiencyoftheApriorialgorithm.Thefollowingisahigh-leveldescriptionofthesemethods.
TraversalofItemsetLattice
AsearchforfrequentitemsetscanbeconceptuallyviewedasatraversalontheitemsetlatticeshowninFigure5.1 .Thesearchstrategyemployedbyanalgorithmdictateshowthelatticestructureistraversedduringthefrequentitemsetgenerationprocess.Somesearchstrategiesarebetterthanothers,dependingontheconfigurationoffrequentitemsetsinthelattice.Anoverviewofthesestrategiesispresentednext.
General-to-SpecificversusSpecific-to-General:TheApriorialgorithmusesageneral-to-specificsearchstrategy,wherepairsoffrequent -itemsetsaremergedtoobtaincandidatek-itemsets.Thisgeneral-to-
(k−1)
specificsearchstrategyiseffective,providedthemaximumlengthofafrequentitemsetisnottoolong.TheconfigurationoffrequentitemsetsthatworksbestwiththisstrategyisshowninFigure5.19(a) ,wherethedarkernodesrepresentinfrequentitemsets.Alternatively,aspecificto-generalsearchstrategylooksformorespecificfrequentitemsetsfirst,beforefindingthemoregeneralfrequentitemsets.Thisstrategyisusefultodiscovermaximalfrequentitemsetsindensetransactions,wherethefrequentitemsetborderislocatednearthebottomofthelattice,asshowninFigure5.19(b) .TheAprioriprinciplecanbeappliedtopruneallsubsetsofmaximalfrequentitemsets.Specifically,ifacandidatek-itemsetismaximalfrequent,wedonothavetoexamineanyofitssubsetsofsize.However,ifthecandidatek-itemsetisinfrequent,weneedtocheckall
ofits subsetsinthenextiteration.Anotherapproachistocombinebothgeneral-to-specificandspecific-to-generalsearchstrategies.Thisbidirectionalapproachrequiresmorespacetostorethecandidateitemsets,butitcanhelptorapidlyidentifythefrequentitemsetborder,giventheconfigurationshowninFigure5.19(c) .
Figure5.19.General-to-specific,specific-to-general,andbidirectionalsearch.
k−1
k−1
EquivalenceClasses:Anotherwaytoenvisionthetraversalistofirstpartitionthelatticeintodisjointgroupsofnodes(orequivalenceclasses).Afrequentitemsetgenerationalgorithmsearchesforfrequentitemsetswithinaparticularequivalenceclassfirstbeforemovingtoanotherequivalenceclass.Asanexample,thelevel-wisestrategyusedintheApriorialgorithmcanbeconsideredtobepartitioningthelatticeonthebasisofitemsetsizes;i.e.,thealgorithmdiscoversallfrequent1-itemsetsfirstbeforeproceedingtolarger-sizeditemsets.Equivalenceclassescanalsobedefinedaccordingtotheprefixorsuffixlabelsofanitemset.Inthiscase,twoitemsetsbelongtothesameequivalenceclassiftheyshareacommonprefixorsuffixoflengthk.Intheprefix-basedapproach,thealgorithmcansearchforfrequentitemsetsstartingwiththeprefixabeforelookingforthosestartingwithprefixesb,c,andsoon.Bothprefix-basedandsuffix-basedequivalenceclassescanbedemonstratedusingthetree-likestructureshowninFigure5.20 .
Figure5.20.Equivalenceclassesbasedontheprefixandsuffixlabelsofitemsets.
Breadth-FirstversusDepth-First:TheApriorialgorithmtraversesthelatticeinabreadth-firstmanner,asshowninFigure5.21(a) .Itfirstdiscoversallthefrequent1-itemsets,followedbythefrequent2-itemsets,andsoon,untilnonewfrequentitemsetsaregenerated.Theitemsetlatticecanalsobetraversedinadepth-firstmanner,asshowninFigures5.21(b) and5.22 .Thealgorithmcanstartfrom,say,nodeainFigure5.22 ,andcountitssupporttodeterminewhetheritisfrequent.Ifso,thealgorithmprogressivelyexpandsthenextlevelofnodes,i.e.,ab,abc,andsoon,untilaninfrequentnodeisreached,say,abcd.Itthenbacktrackstoanotherbranch,say,abce,andcontinuesthesearchfromthere.
Figure5.21.Breadth-firstanddepth-firsttraversals.
Figure5.22.Generatingcandidateitemsetsusingthedepth-firstapproach.
Thedepth-firstapproachisoftenusedbyalgorithmsdesignedtofindmaximalfrequentitemsets.Thisapproachallowsthefrequentitemsetbordertobedetectedmorequicklythanusingabreadth-firstapproach.
Onceamaximalfrequentitemsetisfound,substantialpruningcanbeperformedonitssubsets.Forexample,ifthenodebcdeshowninFigure5.22 ismaximalfrequent,thenthealgorithmdoesnothavetovisitthesubtreesrootedatbd,be,c,d,andebecausetheywillnotcontainanymaximalfrequentitemsets.However,ifabcismaximalfrequent,onlythenodessuchasacandbcarenotmaximalfrequent(butthesubtreesofacandbcmaystillcontainmaximalfrequentitemsets).Thedepth-firstapproachalsoallowsadifferentkindofpruningbasedonthesupportofitemsets.Forexample,supposethesupportfor{a,b,c}isidenticaltothesupportfor{a,b}.Thesubtreesrootedatabdandabecanbeskipped
becausetheyareguaranteednottohaveanymaximalfrequentitemsets.Theproofofthisisleftasanexercisetothereaders.
RepresentationofTransactionDataSet
Therearemanywaystorepresentatransactiondataset.ThechoiceofrepresentationcanaffecttheI/Ocostsincurredwhencomputingthesupportofcandidateitemsets.Figure5.23 showstwodifferentwaysofrepresentingmarketbaskettransactions.Therepresentationontheleftiscalledahorizontaldatalayout,whichisadoptedbymanyassociationruleminingalgorithms,includingApriori.Anotherpossibilityistostorethelistoftransactionidentifiers(TID-list)associatedwitheachitem.Sucharepresentationisknownastheverticaldatalayout.ThesupportforeachcandidateitemsetisobtainedbyintersectingtheTID-listsofitssubsetitems.ThelengthoftheTID-listsshrinksasweprogresstolargersizeditemsets.However,oneproblemwiththisapproachisthattheinitialsetofTID-listsmightbetoolargetofitintomainmemory,thusrequiringmoresophisticatedtechniquestocompresstheTID-lists.WedescribeanothereffectiveapproachtorepresentthedatainthenextSection.
Figure5.23.Horizontalandverticaldataformat.
HorizontalDataLayout
5.6FP-GrowthAlgorithm*ThisSectionpresentsanalternativealgorithmcalledFP-growththattakesaradicallydifferentapproachtodiscoveringfrequentitemsets.Thealgorithmdoesnotsubscribetothegenerate-and-testparadigmofApriori.Instead,itencodesthedatasetusingacompactdatastructurecalledanFP-treeandextractsfrequentitemsetsdirectlyfromthisstructure.Thedetailsofthisapproacharepresentednext.
5.6.1FP-TreeRepresentation
AnFP-treeisacompressedrepresentationoftheinputdata.ItisconstructedbyreadingthedatasetonetransactionatatimeandmappingeachtransactionontoapathintheFP-tree.Asdifferenttransactionscanhaveseveralitemsincommon,theirpathsmightoverlap.Themorethepathsoverlapwithoneanother,themorecompressionwecanachieveusingtheFP-treestructure.IfthesizeoftheFP-treeissmallenoughtofitintomainmemory,thiswillallowustoextractfrequentitemsetsdirectlyfromthestructureinmemoryinsteadofmakingrepeatedpassesoverthedatastoredondisk.
Figure5.24 showsadatasetthatcontainstentransactionsandfiveitems.ThestructuresoftheFP-treeafterreadingthefirstthreetransactionsarealsodepictedinthediagram.Eachnodeinthetreecontainsthelabelofanitemalongwithacounterthatshowsthenumberoftransactionsmappedontothegivenpath.Initially,theFP-treecontainsonlytherootnoderepresentedbythenullsymbol.TheFP-treeissubsequentlyextendedinthefollowingway:
Figure5.24.ConstructionofanFP-tree.
1. Thedatasetisscannedoncetodeterminethesupportcountofeachitem.Infrequentitemsarediscarded,whilethefrequentitemsaresortedindecreasingsupportcountsinsideeverytransactionofthedata
set.ForthedatasetshowninFigure5.24 ,aisthemostfrequentitem,followedbyb,c,d,ande.
2. ThealgorithmmakesasecondpassoverthedatatoconstructtheFP-tree.Afterreadingthefirsttransaction,{a,b},thenodeslabeledasaandbarecreated.Apathisthenformedfrom toencodethetransaction.Everynodealongthepathhasafrequencycountof1.
3. Afterreadingthesecondtransaction,{b,c,d},anewsetofnodesiscreatedforitemsb,c,andd.Apathisthenformedtorepresentthetransactionbyconnectingthenodes .Everynodealongthispathalsohasafrequencycountequaltoone.Althoughthefirsttwotransactionshaveanitemincommon,whichisb,theirpathsaredisjointbecausethetransactionsdonotshareacommonprefix.
4. Thethirdtransaction,{a,c,d,e},sharesacommonprefixitem(whichisa)withthefirsttransaction.Asaresult,thepathforthethirdtransaction, ,overlapswiththepathforthefirsttransaction, .Becauseoftheiroverlappingpath,thefrequencycountfornodeaisincrementedtotwo,whilethefrequencycountsforthenewlycreatednodes,c,d,ande,areequaltoone.
5. ThisprocesscontinuesuntileverytransactionhasbeenmappedontooneofthepathsgivenintheFP-tree.TheresultingFP-treeafterreadingallthetransactionsisshownatthebottomofFigure5.24 .
ThesizeofanFP-treeistypicallysmallerthanthesizeoftheuncompresseddatabecausemanytransactionsinmarketbasketdataoftenshareafewitemsincommon.Inthebest-casescenario,whereallthetransactionshavethesamesetofitems,theFP-treecontainsonlyasinglebranchofnodes.Theworst-casescenariohappenswheneverytransactionhasauniquesetofitems.Asnoneofthetransactionshaveanyitemsincommon,thesizeoftheFP-treeiseffectivelythesameasthesizeoftheoriginaldata.However,thephysicalstoragerequirementfortheFP-treeishigherbecauseitrequiresadditionalspacetostorepointersbetweennodesandcountersforeachitem.
→a→b
→b→c→d
→a→c→d→e→a→b
ThesizeofanFP-treealsodependsonhowtheitemsareordered.Thenotionoforderingitemsindecreasingorderofsupportcountsreliesonthepossibilitythatthehighsupportitemsoccurmorefrequentlyacrossallpathsandhencemustbeusedasmostcommonlyoccurringprefixes.Forexample,iftheorderingschemeintheprecedingexampleisreversed,i.e.,fromlowesttohighestsupportitem,theresultingFP-treeisshowninFigure5.25 .Thetreeappearstobedenserbecausethebranchingfactorattherootnodehasincreasedfrom2to5andthenumberofnodescontainingthehighsupportitemssuchasaandbhasincreasedfrom3to12.Nevertheless,orderingbydecreasingsupportcountsdoesnotalwaysleadtothesmallesttree,especiallywhenthehighsupportitemsdonotoccurfrequentlytogetherwiththeotheritems.Forexample,supposeweaugmentthedatasetgiveninFigure5.24 with100transactionsthatcontain{e},80transactionsthatcontain{d},60transactionsthatcontain{c},and40transactionsthatcontain{b}.Itemeisnowmostfrequent,followedbyd,c,b,anda.Withtheaugmentedtransactions,orderingbydecreasingsupportcountswillresultinanFP-treesimilartoFigure5.25 ,whileaschemebasedonincreasingsupportcountsproducesasmallerFP-treesimilartoFigure5.24(iv) .
Figure5.25.
AnFP-treerepresentationforthedatasetshowninFigure5.24 withadifferentitemorderingscheme.
AnFP-treealsocontainsalistofpointersconnectingnodesthathavethesameitems.Thesepointers,representedasdashedlinesinFigures5.24and5.25 ,helptofacilitatetherapidaccessofindividualitemsinthetree.WeexplainhowtousetheFP-treeanditscorrespondingpointersforfrequentitemsetgenerationinthenextSection.
5.6.2FrequentItemsetGenerationinFP-GrowthAlgorithm
FP-growthisanalgorithmthatgeneratesfrequentitemsetsfromanFP-treebyexploringthetreeinabottom-upfashion.GiventheexampletreeshowninFigure5.24 ,thealgorithmlooksforfrequentitemsetsendinginefirst,followedbyd,c,b,andfinally,a.Thisbottom-upstrategyforfindingfrequentitemsetsendingwithaparticularitemisequivalenttothesuffix-basedapproachdescribedinSection5.5 .SinceeverytransactionismappedontoapathintheFP-tree,wecanderivethefrequentitemsetsendingwithaparticularitem,say,e,byexaminingonlythepathscontainingnodee.Thesepathscanbeaccessedrapidlyusingthepointersassociatedwithnodee.TheextractedpathsareshowninFigure5.26(a) .Similarpathsforitemsetsendingind,c,b,andaareshowninFigures5.26(b) ,(c) ,(d) ,and(e) ,respectively.
Figure5.26.Decomposingthefrequentitemsetgenerationproblemintomultiplesubproblems,whereeachsubprobleminvolvesfindingfrequentitemsetsendingine,d,c,b,anda.
FP-growthfindsallthefrequentitemsetsendingwithaparticularsuffixbyemployingadivide-and-conquerstrategytosplittheproblemintosmallersubproblems.Forexample,supposeweareinterestedinfindingallfrequentitemsetsendingine.Todothis,wemustfirstcheckwhethertheitemset{e}itselfisfrequent.Ifitisfrequent,weconsiderthesubproblemoffindingfrequentitemsetsendinginde,followedbyce,be,andae.Inturn,eachofthesesubproblemsarefurtherdecomposedintosmallersubproblems.Bymergingthesolutionsobtainedfromthesubproblems,allthefrequentitemsetsendinginecanbefound.Finally,thesetofallfrequentitemsetscanbegeneratedbymergingthesolutionstothesubproblemsoffindingfrequent
itemsetsendingine,d,c,b,anda.Thisdivide-and-conquerapproachisthekeystrategyemployedbytheFP-growthalgorithm.
Foramoreconcreteexampleonhowtosolvethesubproblems,considerthetaskoffindingfrequentitemsetsendingwithe.
1. Thefirststepistogatherallthepathscontainingnodee.TheseinitialpathsarecalledprefixpathsandareshowninFigure5.27(a) .
Figure5.27.ExampleofapplyingtheFP-growthalgorithmtofindfrequentitemsetsendingine.
2. FromtheprefixpathsshowninFigure5.27(a) ,thesupportcountforeisobtainedbyaddingthesupportcountsassociatedwithnodee.Assumingthattheminimumsupportcountis2,{e}isdeclaredafrequentitemsetbecauseitssupportcountis3.
3. Because{e}isfrequent,thealgorithmhastosolvethesubproblemsoffindingfrequentitemsetsendinginde,ce,be,andae.Beforesolvingthesesubproblems,itmustfirstconverttheprefixpathsintoaconditionalFP-tree,whichisstructurallysimilartoanFP-tree,exceptitisusedtofindfrequentitemsetsendingwithaparticularsuffix.AconditionalFP-treeisobtainedinthefollowingway:
a. First,thesupportcountsalongtheprefixpathsmustbeupdatedbecausesomeofthecountsincludetransactionsthatdonotcontainiteme.Forexample,therightmostpathshowninFigure5.27(a) , ,includesatransaction{b,c}thatdoesnotcontainiteme.Thecountsalongtheprefixpathmustthereforebeadjustedto1toreflecttheactualnumberoftransactionscontaining{b,c,e}.
b. Theprefixpathsaretruncatedbyremovingthenodesfore.Thesenodescanberemovedbecausethesupportcountsalongtheprefixpathshavebeenupdatedtoreflectonlytransactionsthatcontaineandthesubproblemsoffindingfrequentitemsetsendinginde,ce,be,andaenolongerneedinformationaboutnodee.
c. Afterupdatingthesupportcountsalongtheprefixpaths,someoftheitemsmaynolongerbefrequent.Forexample,thenodebappearsonlyonceandhasasupportcountequalto1,whichmeansthatthereisonlyonetransactionthatcontainsbothbande.Itembcanbesafelyignoredfromsubsequentanalysisbecauseallitemsetsendinginbemustbeinfrequent.
TheconditionalFP-treeforeisshowninFigure5.27(b) .Thetreelooksdifferentthantheoriginalprefixpathsbecausethefrequencycountshavebeenupdatedandthenodesbandehavebeeneliminated.
4. FP-growthusestheconditionalFP-treeforetosolvethesubproblemsoffindingfrequentitemsetsendinginde,ce,andae.Tofindthefrequentitemsetsendinginde,theprefixpathsfordaregatheredfromtheconditionalFP-treefore(Figure5.27(c) ).Byaddingthefrequencycountsassociatedwithnoded,weobtainthesupportcountfor{d,e}.Sincethesupportcountisequalto2,{d,e}isdeclaredafrequentitemset.Next,thealgorithmconstructstheconditionalFP-treefordeusingtheapproachdescribedinstep3.Afterupdatingthesupportcountsandremovingtheinfrequentitemc,theconditionalFP-treefordeisshowninFigure5.27(d) .SincetheconditionalFP-treecontainsonlyoneitem,a,whosesupportisequaltominsup,thealgorithmextractsthefrequentitemset{a,d,e}andmovesontothenextsubproblem,whichistogeneratefrequentitemsetsendingince.Afterprocessingtheprefixpathsforc,{c,e}isfoundtobefrequent.However,theconditionalFP-treeforcewillhavenofrequentitemsandthuswillbeeliminated.Thealgorithmproceedstosolvethenextsubproblemandfinds{a,e}tobetheonlyfrequentitemsetremaining.
Thisexampleillustratesthedivide-and-conquerapproachusedintheFP-growthalgorithm.Ateachrecursivestep,aconditionalFP-treeisconstructedbyupdatingthefrequencycountsalongtheprefixpathsandremovingallinfrequentitems.Becausethesubproblemsaredisjoint,FP-growthwillnotgenerateanyduplicateitemsets.Inaddition,thecountsassociatedwiththenodesallowthealgorithmtoperformsupportcountingwhilegeneratingthecommonsuffixitemsets.
FP-growthisaninterestingalgorithmbecauseitillustrateshowacompactrepresentationofthetransactiondatasethelpstoefficientlygeneratefrequentitemsets.Inaddition,forcertaintransactiondatasets,FP-growthoutperformsthestandardApriorialgorithmbyseveralordersofmagnitude.Therun-timeperformanceofFP-growthdependsonthecompactionfactorofthedataset.IftheresultingconditionalFP-treesareverybushy(intheworstcase,afullprefixtree),thentheperformanceofthealgorithmdegradessignificantlybecauseithastogeneratealargenumberofsubproblemsandmergetheresultsreturnedbyeachsubproblem.
5.7EvaluationofAssociationPatternsAlthoughtheAprioriprinciplesignificantlyreducestheexponentialsearchspaceofcandidateitemsets,associationanalysisalgorithmsstillhavethepotentialtogeneratealargenumberofpatterns.Forexample,althoughthedatasetshowninTable5.1 containsonlysixitems,itcanproducehundredsofassociationrulesatparticularsupportandconfidencethresholds.Asthesizeanddimensionalityofrealcommercialdatabasescanbeverylarge,wecaneasilyendupwiththousandsorevenmillionsofpatterns,manyofwhichmightnotbeinteresting.Identifyingthemostinterestingpatternsfromthemultitudeofallpossibleonesisnotatrivialtaskbecause“oneperson'strashmightbeanotherperson'streasure.”Itisthereforeimportanttoestablishasetofwell-acceptedcriteriaforevaluatingthequalityofassociationpatterns.
Thefirstsetofcriteriacanbeestablishedthroughadata-drivenapproachtodefineobjectiveinterestingnessmeasures.Thesemeasurescanbeusedtorankpatterns—itemsetsorrules—andthusprovideastraightforwardwayofdealingwiththeenormousnumberofpatternsthatarefoundinadataset.Someofthemeasurescanalsoprovidestatisticalinformation,e.g.,itemsetsthatinvolveasetofunrelateditemsorcoververyfewtransactionsareconsidereduninterestingbecausetheymaycapturespuriousrelationshipsinthedataandshouldbeeliminated.Examplesofobjectiveinterestingnessmeasuresincludesupport,confidence,andcorrelation.
Thesecondsetofcriteriacanbeestablishedthroughsubjectivearguments.Apatternisconsideredsubjectivelyuninterestingunlessitrevealsunexpectedinformationaboutthedataorprovidesusefulknowledgethatcanleadtoprofitableactions.Forexample,therule maynotbeinteresting,despitehavinghighsupportandconfidencevalues,becausethe
{Butter}→{Bread}
relationshiprepresentedbytherulemightseemratherobvious.Ontheotherhand,therule isinterestingbecausetherelationshipisquiteunexpectedandmaysuggestanewcross-sellingopportunityforretailers.Incorporatingsubjectiveknowledgeintopatternevaluationisadifficulttaskbecauseitrequiresaconsiderableamountofpriorinformationfromdomainexperts.Readersinterestedinsubjectiveinterestingnessmeasuresmayrefertoresourceslistedinthebibliographyattheendofthischapter.
5.7.1ObjectiveMeasuresofInterestingness
Anobjectivemeasureisadata-drivenapproachforevaluatingthequalityofassociationpatterns.Itisdomain-independentandrequiresonlythattheuserspecifiesathresholdforfilteringlow-qualitypatterns.Anobjectivemeasureisusuallycomputedbasedonthefrequencycountstabulatedinacontingencytable.Table5.6 showsanexampleofacontingencytableforapairofbinaryvariables,AandB.Weusethenotation toindicatethatA(B)isabsentfromatransaction.Eachentry inthis tabledenotesafrequencycount.Forexample, isthenumberoftimesAandBappeartogetherinthesametransaction,while isthenumberoftransactionsthatcontainBbutnotA.Therowsum representsthesupportcountforA,whilethecolumnsum representsthesupportcountforB.Finally,eventhoughourdiscussionfocusesmainlyonasymmetricbinaryvariables,notethatcontingencytablesarealsoapplicabletootherattributetypessuchassymmetricbinary,nominal,andordinalvariables.
Table5.6.A2-waycontingencytableforvariablesAandB.
{Diapers}→{Beer}
A¯(B¯)fij 2×2
f11f01
f1+f+1
B
A
N
LimitationsoftheSupport-ConfidenceFrameworkTheclassicalassociationruleminingformulationreliesonthesupportandconfidencemeasurestoeliminateuninterestingpatterns.Thedrawbackofsupport,whichisdescribedmorefullyinSection5.8 ,isthatmanypotentiallyinterestingpatternsinvolvinglowsupportitemsmightbeeliminatedbythesupportthreshold.Thedrawbackofconfidenceismoresubtleandisbestdemonstratedwiththefollowingexample.
Example5.3.Supposeweareinterestedinanalyzingtherelationshipbetweenpeoplewhodrinkteaandcoffee.WemaygatherinformationaboutthebeveragepreferencesamongagroupofpeopleandsummarizetheirresponsesintoacontingencytablesuchastheoneshowninTable5.7 .
Table5.7.Beveragepreferencesamongagroupof1000people.
Coffee
Tea 150 50 200
650 150 800
800 200 1000
B¯
f11 f10 f1+
A¯ f01 f00 f0+
f+1 f+0
Coffee¯
Tea¯
Theinformationgiveninthistablecanbeusedtoevaluatetheassociationrule .Atfirstglance,itmayappearthatpeoplewhodrinkteaalsotendtodrinkcoffeebecausetherule'ssupport(15%)andconfidence(75%)valuesarereasonablyhigh.Thisargumentwouldhavebeenacceptableexceptthatthefractionofpeoplewhodrinkcoffee,regardlessofwhethertheydrinktea,is80%,whilethefractionofteadrinkerswhodrinkcoffeeisonly75%.Thusknowingthatapersonisateadrinkeractuallydecreasesherprobabilityofbeingacoffeedrinkerfrom80%to75%!Therule isthereforemisleadingdespiteitshighconfidencevalue.
Nowconsiderasimilarproblemwhereweareinterestedinanalyzingtherelationshipbetweenpeoplewhodrinkteaandpeoplewhousehoneyintheirbeverage.Table5.8 summarizestheinformationgatheredoverthesamegroupofpeopleabouttheirpreferencesfordrinkingteaandusinghoney.Ifweevaluatetheassociationrule usingthisinformation,wewillfindthattheconfidencevalueofthisruleismerely50%,whichmightbeeasilyrejectedusingareasonablethresholdontheconfidencevalue,say70%.Onethusmightconsiderthatthepreferenceofapersonfordrinkingteahasnoinfluenceonherpreferenceforusinghoney.However,thefractionofpeoplewhousehoney,regardlessofwhethertheydrinktea,isonly12%.Hence,knowingthatapersondrinksteasignificantlyincreasesherprobabilityofusinghoneyfrom12%to50%.Further,thefractionofpeoplewhodonotdrinkteabutusehoneyisonly2.5%!Thissuggeststhatthereisdefinitelysomeinformationinthepreferenceofapersonofusinghoneygiventhatshedrinkstea.Therule
maythereforebefalselyrejectedifconfidenceisusedastheevaluationmeasure.
Table5.8.Informationaboutpeoplewhodrinkteaandpeoplewhousehoneyintheirbeverage.
{Tea}→{Coffee}
{Tea}→{Coffee}
{Tea}→{Honey}
{Tea}→{Honey}
Honey
Tea 100 100 200
20 780 800
120 880 1000
Notethatifwetakethesupportofcoffeedrinkersintoaccount,wewouldnotbesurprisedtofindthatmanyofthepeoplewhodrinkteaalsodrinkcoffee,sincetheoverallnumberofcoffeedrinkersisquitelargebyitself.Whatismoresurprisingisthatthefractionofteadrinkerswhodrinkcoffeeisactuallylessthantheoverallfractionofpeoplewhodrinkcoffee,whichpointstoaninverserelationshipbetweenteadrinkersandcoffeedrinkers.Similarly,ifweaccountforthefactthatthesupportofusinghoneyisinherentlysmall,itiseasytounderstandthatthefractionofteadrinkerswhousehoneywillnaturallybesmall.Instead,whatisimportanttomeasureisthechangeinthefractionofhoneyusers,giventheinformationthattheydrinktea.
Thelimitationsoftheconfidencemeasurearewell-knownandcanbeunderstoodfromastatisticalperspectiveasfollows.Thesupportofavariablemeasurestheprobabilityofitsoccurrence,whilethesupports(A,B)ofapairofavariablesAandBmeasurestheprobabilityofthetwovariablesoccurringtogether.Hence,thejointprobabilityP(A,B)canbewrittenas
IfweassumeAandBarestatisticallyindependent,i.e.thereisnorelationshipbetweentheoccurrencesofAandB,then .Hence,undertheassumptionofstatisticalindependencebetweenAandB,thesupportsindep(A,B)ofAandBcanbewrittenas
Honey¯
Tea¯
P(A,B)=s(A,B)=f11N.
P(A,B)=P(A)×P(B)
Ifthesupportbetweentwovariables,s(A,B)isequalto ,thenAandBcanbeconsideredtobeunrelatedtoeachother.However,ifs(A,B)iswidelydifferentfrom ,thenAandBaremostlikelydependent.Hence,anydeviationofs(A,B)from canbeseenasanindicationofastatisticalrelationshipbetweenAandB.Sincetheconfidencemeasureonlyconsidersthedevianceofs(A,B)froms(A)andnotfrom ,itfailstoaccountforthesupportoftheconsequent,namelys(B).Thisresultsinthedetectionofspuriouspatterns(e.g., )andtherejectionoftrulyinterestingpatterns(e.g., ),asillustratedinthepreviousexample.
Variousobjectivemeasureshavebeenusedtocapturethedevianceofs(A,B)from ,thatarenotsusceptibletothelimitationsoftheconfidencemeasure.Below,weprovideabriefdescriptionofsomeofthesemeasuresanddiscusssomeoftheirproperties.
InterestFactorTheinterestfactor,whichisalsocalledasthe“lift,”canbedefinedasfollows:
Noticethat .Hence,theinterestfactormeasurestheratioofthesupportofapatterns(A,B)againstitsbaselinesupport (A,B)computedunderthestatisticalindependenceassumption.UsingEquations5.5 and5.4 ,wecaninterpretthemeasureasfollows:
sindep(A,B)=s(A)×s(B)orequivalently,sindep(A,B)=f1+N×f+1N. (5.4)
sindep(A,B)
sindep(A,B)s(A)×s(B)
s(A)×s(B)
{Tea}→{Coffee}{Tea}→{Honey}
sindep(A,B)
I(A,B)=s(A,B)s(A)×s(B)=Nf11f1+f+1. (5.5)
s(A)×s(B)=sindep(A,B)sindep
I(A,B)={=1,ifAandBareindependent;>1,ifAandBarepositivelyrelated;<1,ifAandBarenegativelyrelated.
(5.6)
Forthetea-coffeeexampleshowninTable5.7 , ,thussuggestingaslightnegativerelationshipbetweenteadrinkersandcoffeedrinkers.Also,forthetea-honeyexampleshowninTable5.8 ,
,suggestingastrongpositiverelationshipbetweenpeoplewhodrinkteaandpeoplewhousehoneyintheirbeverage.Wecanthusseethattheinterestfactorisabletodetectmeaningfulpatternsinthetea-coffeeandtea-honeyexamples.Indeed,theinterestfactorhasanumberofstatisticaladvantagesovertheconfidencemeasurethatmakeitasuitablemeasureforanalyzingstatisticalindependencebetweenvariables.
Piatesky-Shapiro(PS)MeasureInsteadofcomputingtheratiobetweens(A,B)and ,thePSmeasureconsidersthedifferencebetweens(A,B)and asfollows.
ThePSvalueis0whenAandBaremutuallyindependentofeachother.Otherwise, whenthereisapositiverelationshipbetweenthetwovariables,and whenthereisanegativerelationship.
CorrelationAnalysisCorrelationanalysisisoneofthemostpopulartechniquesforanalyzingrelationshipsbetweenapairofvariables.Forcontinuousvariables,correlationisdefinedusingPearson'scorrelationcoefficient(seeEquation2.10 onpage83).Forbinaryvariables,correlationcanbemeasuredusingthe
,whichisdefinedas
I=0.150.2×0.8=0.9375
I=0.10.12×0.2=4.1667
sindep(A,B)=s(A)×s(B)s(A)×s(B)
PS=s(A,B)−s(A)×s(B)=f11N−f1+f+1N2 (5.7)
PS>0PS<0
ϕ-coefficient
ϕ=f11f00−f01f10f1+f+1f0+f+0. (5.8)
Ifwerearrangethetermsin5.8,wecanshowthatthe canberewrittenintermsofthesupportmeasuresofA,B,and{A,B}asfollows:
NotethatthenumeratorintheaboveequationisidenticaltothePSmeasure.Hence,the canbeunderstoodasanormalizedversionofthePSmeasure,wherethatthevalueofthe rangesfrom to .Fromastatisticalviewpoint,thecorrelationcapturesthenormalizeddifferencebetweens(A,B)and (A,B).Acorrelationvalueof0meansnorelationship,whileavalueof suggestsaperfectpositiverelationshipandavalueof suggestsaperfectnegativerelationship.Thecorrelationmeasurehasastatisticalmeaningandhenceiswidelyusedtoevaluatethestrengthofstatisticalindependenceamongvariables.Forinstance,thecorrelationbetweenteaandcoffeedrinkersinTable5.7 is whichisslightlylessthan0.Ontheotherhand,thecorrelationbetweenpeoplewhodrinkteaandpeoplewhousehoneyinTable5.8 is0.5847,suggestingapositiverelationship.
ISMeasureISisanalternativemeasureforcapturingtherelationshipbetweens(A,B)and
.TheISmeasureisdefinedasfollows:
AlthoughthedefinitionofISlooksquitesimilartotheinterestfactor,theysharesomeinterestingdifferences.SinceISisthegeometricmeanbetweentheinterestfactorandthesupportofapattern,ISislargewhenboththeinterestfactorandsupportarelarge.Hence,iftheinterestfactoroftwopatternsareidentical,theIShasapreferenceofselectingthepatternwithhighersupport.ItisalsopossibletoshowthatISismathematicallyequivalent
ϕ-coefficient
ϕ=s(A,B)−s(A)×s(B)s(A)×(1−s(A))×s(B)×(1−s(B)). (5.9)
ϕ-coefficientϕ-coefficient −1 +1
sindep+1
−1
−0.0625
s(A)×s(B)
IS(A,B)=I(A,B)×s(A,B)=s(A,B)s(A)s(B)=f11f1+f+1. (5.10)
tothecosinemeasureforbinaryvariables(seeEquation2.6 onpage81 ).ThevalueofISthusvariesfrom0to1,whereanISvalueof0correspondstonoco-occurrenceofthetwovariables,whileanISvalueof1denotesperfectrelationship,sincetheyoccurinexactlythesametransactions.Forthetea-coffeeexampleshowninTable5.7 ,thevalueofISisequalto0.375,whilethevalueofISforthetea-honeyexampleinTable5.8 is0.6455.TheISmeasurethusgivesahighervalueforthe
rulethanthe rule,whichisconsistentwithourunderstandingofthetworules.
AlternativeObjectiveInterestingnessMeasuresNotethatallofthemeasuresdefinedinthepreviousSectionusedifferenttechniquestocapturethedeviancebetweens(A,B)and .Somemeasuresusetheratiobetweens(A,B)and (A,B),e.g.,theinterestfactorandIS,whilesomeothermeasuresconsiderthedifferencebetweenthetwo,e.g.,thePSandthe .Somemeasuresareboundedinaparticularrange,e.g.,theISandthe ,whileothersareunboundedanddonothaveadefinedmaximumorminimumvalue,e.g.,theInterestFactor.Becauseofsuchdifferences,thesemeasuresbehavedifferentlywhenappliedtodifferenttypesofpatterns.Indeed,themeasuresdefinedabovearenotexhaustiveandthereexistmanyalternativemeasuresforcapturingdifferentpropertiesofrelationshipsbetweenpairsofbinaryvariables.Table5.9 providesthedefinitionsforsomeofthesemeasuresintermsofthefrequencycountsofa contingencytable.
Table5.9.Examplesofobjectivemeasuresfortheitemset{A,B}.
Measure(Symbol) Definition
Correlation
{Tea}→{Honey} {Tea}→{Coffee}
sindep=s(A)×s(B)sindep
ϕ-coefficientϕ-coefficient
2×2
(ϕ) Nf11−f1+f+1f1+f+1f0+f+0
Oddsratio
Kappa
Interest(I)
Cosine(IS)
Piatetsky-Shapiro(PS)
Collectivestrength(S)
Jaccard
All-confidence(h)
ConsistencyamongObjectiveMeasuresGiventhewidevarietyofmeasuresavailable,itisreasonabletoquestionwhetherthemeasurescanproducesimilarorderingresultswhenappliedtoasetofassociationpatterns.Ifthemeasuresareconsistent,thenwecanchooseanyoneofthemasourevaluationmetric.Otherwise,itisimportanttounderstandwhattheirdifferencesareinordertodeterminewhichmeasureismoresuitableforanalyzingcertaintypesofpatterns.
SupposethemeasuresdefinedinTable5.9 areappliedtorankthetencontingencytablesshowninTable5.10 .Thesecontingencytablesarechosentoillustratethedifferencesamongtheexistingmeasures.TheorderingproducedbythesemeasuresisshowninTable5.11 (with1asthemostinterestingand10astheleastinterestingtable).Althoughsomeofthemeasuresappeartobeconsistentwitheachother,othersproducequitedifferentorderingresults.Forexample,therankingsgivenbytheagreesmostlywiththoseprovidedby andcollectivestrength,butarequite
(α) (f11f00)/(f10f01)
(κ) Nf11+Nf00−f1+f+1−f0+f+0N2−f1+f+1−f0+f+0
(Nf11)/(f1+f+1)
(f11)/(f1+f+1)
f11N−f1+f+1N2
f11+f00f1+f+1+f0+f+0×N−f1+f+1−f0+f+0N−f11−f00
(ζ) f11/(f1++f+1−f11)
min[f11f1+,f11f+1]
ϕ-coefficientκ
differentthantherankingsproducedbyinterestfactor.Furthermore,acontingencytablesuchas isrankedlowestaccordingtothe ,buthighestaccordingtointerestfactor.
Table5.10.Exampleofcontingencytables.
Example
8123 83 424 1370
8330 2 622 1046
3954 3080 5 2961
2886 1363 1320 4431
1500 2000 500 6000
4000 2000 1000 3000
9481 298 127 94
4000 2000 2000 2000
7450 2483 4 63
61 2483 4 7452
Table5.11.RankingsofcontingencytablesusingthemeasuresgiveninTable5.9 .
I IS PS S h
1 3 1 6 2 2 1 2 2
2 1 2 7 3 5 2 3 3
E10 ϕ-coefficient
f11 f10 f01 f00
E1
E2
E3
E4
E5
E6
E7
E8
E9
E10
ϕ α κ ζ
E1
E2
3 2 4 4 5 1 3 6 8
4 8 3 3 7 3 4 7 5
5 7 6 2 9 6 6 9 9
6 9 5 5 6 4 5 5 7
7 6 7 9 1 8 7 1 1
8 10 8 8 8 7 8 8 7
9 4 9 10 4 9 9 4 4
10 5 10 1 10 10 10 10 10
PropertiesofObjectiveMeasuresTheresultsshowninTable5.11 suggestthatthemeasuresgreatlydifferfromeachotherandcanprovideconflictinginformationaboutthequalityofapattern.Infact,nomeasureisuniversallybestforallapplications.Inthefollowing,wedescribesomepropertiesofthemeasuresthatplayanimportantroleindeterminingiftheyaresuitedforacertainapplication.
InversionProperty
ConsiderthebinaryvectorsshowninFigure5.28 .The0/1valueineachcolumnvectorindicateswhetheratransaction(row)containsaparticularitem(column).Forexample,thevectorAindicatesthattheitemappearsinthefirstandlasttransactions,whereasthevectorBindicatesthattheitemiscontainedonlyinthefifthtransaction.Thevectors and aretheinvertedversionsofAandB,i.e.,their1valueshavebeenchangedto0values(absencetopresence)andviceversa.Applyingthistransformationtoabinary
E3
E4
E5
E6
E7
E8
E9
E10
A¯ B¯
vectoriscalledinversion.Ifameasureisinvariantundertheinversionoperation,thenitsvalueforthevectorpair shouldbeidenticaltoitsvaluefor{A,B}.Theinversionpropertyofameasurecanbetestedasfollows.
Figure5.28.Effectoftheinversionoperation.Thevectors and areinversionsofvectorsAandB,respectively.
Definition5.6.(InversionProperty.)AnobjectivemeasureMisinvariantundertheinversionoperationifitsvalueremainsthesamewhenexchangingthefrequencycounts with and with .
Measuresthatareinvarianttotheinversionpropertyincludethecorrelation( ),oddsratio, ,andcollectivestrength.Thesemeasuresareespeciallyusefulinscenarioswherethepresence(1's)ofavariableisas
{A¯,B¯}
A¯ E¯
f11 f00 f10 f01
ϕ-coefficient κ
importantasitsabsence(0's).Forexample,ifwecomparetwosetsofanswerstoaseriesoftrue/falsequestionswhere0's(true)and1's(false)areequallymeaningful,weshoulduseameasurethatgivesequalimportancetooccurrencesof0–0'sand1–1's.ForthevectorsshowninFigure5.28 ,the
isequalto-0.1667regardlessofwhetherweconsiderthepair{A,B}orpair .Similarly,theoddsratioforbothpairsofvectorsisequaltoaconstantvalueof0.Notethateventhoughthe andtheoddsratioareinvarianttoinversion,theycanstillshowdifferentresults,aswillbeshownlater.
MeasuresthatdonotremaininvariantundertheinversionoperationincludetheinterestfactorandtheISmeasure.Forexample,theISvalueforthepair
inFigure5.28 is0.825,whichreflectsthefactthatthe1'sinand occurfrequentlytogether.However,theISvalueofitsinvertedpair{A,B}isequalto0,sinceAandBdonothaveanyco-occurrenceof1's.Forasymmetricbinaryvariables,e.g.,theoccurrenceofwordsindocuments,thisisindeedthedesiredbehavior.Adesiredsimilaritymeasurebetweenasymmetricvariablesshouldnotbeinvarianttoinversion,sinceforthesevariables,itismoremeaningfultocapturerelationshipsbasedonthepresenceofavariableratherthanitsabsence.Ontheotherhand,ifwearedealingwithsymmetricbinaryvariableswheretherelationshipsbetween0'sand1'sareequallymeaningful,careshouldbetakentoensurethatthechosenmeasureisinvarianttoinversion.
AlthoughthevaluesoftheinterestfactorandISchangewiththeinversionoperation,theycanstillbeinconsistentwitheachother.Toillustratethis,considerTable5.12 ,whichshowsthecontingencytablesfortwopairsofvariables,{p,q}and{r,s}.Notethatrandsareinvertedtransformationsofpandq,respectively,wheretherolesof0'sand1'shavejustbeenreversed.Theinterestfactorfor{p,q}is1.02andfor{r,s}is4.08,whichmeansthattheinterestfactorfindstheinvertedpair{r,s}morerelatedthantheoriginalpair
ϕ-coefficient{A¯,B¯}
ϕ-coefficient
{A¯,B¯} A¯B¯
{p,q}.Onthecontrary,theISvaluedecreasesuponinversionfrom0.9346for{p,q}to0.286for{r,s},suggestingquiteanoppositetrendtothatoftheinterestfactor.Eventhoughthesemeasuresconflictwitheachotherforthisexample,theymaybetherightchoiceofmeasureindifferentapplications.
Table5.12.Contingencytablesforthepairs{p,q}and{r,s}.
p
q 880 50 930
50 20 70
930 70 1000
r
s 20 50 70
50 880 930
70 930 1000
ScalingProperty
Table5.13 showstwocontingencytablesforgenderandthegradesachievedbystudentsenrolledinaparticularcourse.Thesetablescanbeusedtostudytherelationshipbetweengenderandperformanceinthecourse.Thesecondcontingencytablehasdatafromthesamepopulationbuthastwiceasmanymalesandthreetimesasmanyfemales.Theactualnumberofmalesorfemalescandependuponthesamplesavailableforstudy,buttherelationshipbetweengenderandgradeshouldnotchangejustbecauseofdifferencesinsamplesizes.Similarly,ifthenumberofstudentswithhighandlowgradesarechangedinanewstudy,ameasureofassociationbetween
p¯
q¯
r¯
s¯
genderandgradesshouldremainunchanged.Hence,weneedameasurethatisinvarianttoscalingofrowsorcolumns.Theprocessofmultiplyingaroworcolumnofacontingencytablebyaconstantvalueiscalledaroworcolumnscalingoperation.Ameasurethatisinvarianttoscalingdoesnotchangeitsvalueafteranyroworcolumnscalingoperation.
Table5.13.Thegrade-genderexample.(a)Sampledataofsize100.
Male Female
High 30 20 50
Low 40 10 50
70 30 100
(b)Sampledataofsize230.
Male Female
High 60 60 120
Low 80 30 110
140 90 230
Definition5.7.(ScalingInvarianceProperty.)LetTbeacontingencytablewithfrequencycounts
.Let bethetransformedacontingencytable[f11;f10;f01;f00] T′
withscaledfrequencycounts,where are
positiveconstantsusedtoscalethetworowsandthetwocolumnsofT.AnobjectivemeasureMisinvariantundertherow/columnscalingoperationif .
Notethattheuseoftheterm‘scaling'hereshouldnotbeconfusedwiththescalingoperationforcontinuousvariablesintroducedinChapter2 onpage23,whereallthevaluesofavariablewerebeingmultipliedbyaconstantfactor,insteadofscalingaroworcolumnofacontingencytable.
Scalingofrowsandcolumnsincontingencytablesoccursinmultiplewaysindifferentapplications.Forexample,ifwearemeasuringtheeffectofaparticularmedicalprocedureontwosetsofsubjects,healthyanddiseased,theratioofhealthyanddiseasedsubjectscanwidelyvaryacrossdifferentstudiesinvolvingdifferentgroupsofparticipants.Further,thefractionofhealthyanddiseasedsubjectschosenforacontrolledstudycanbequitedifferentfromthetruefractionobservedinthecompletepopulation.Thesedifferencescanresultinaroworcolumnscalinginthecontingencytablesfordifferentpopulationsofsubjects.Ingeneral,thefrequenciesofitemsinacontingencytablecloselydependsonthesampleoftransactionsusedtogeneratethetable.Anychangeinthesamplingproceduremayaffectaroworcolumnscalingtransformation.Ameasurethatisexpectedtobeinvarianttodifferencesinthesamplingproceduremustnotchangewithroworcolumnscaling.
OfallthemeasuresintroducedinTable5.9 ,onlytheoddsratio isinvarianttorowandcolumnscalingoperations.Forexample,thevalueofoddsratioforboththetablesinTable5.13 isequalto0.375.Allother
[k1k3f11;k2k3f10;k1k4f01;k2k4f00] k1,k2,k3,k4
M(T)=M(T′)
(α)
measuressuchasthe ,IS,interestfactor,andcollectivestrength(S)changetheirvalueswhentherowsandcolumnsofthecontingencytablearerescaled.Indeed,theoddsratioisapreferredchoiceofmeasureinthemedicaldomain,whereitisimportanttofindrelationshipsthatdonotchangewithdifferencesinthepopulationsamplechosenforastudy.
NullAdditionProperty
Supposeweareinterestedinanalyzingtherelationshipbetweenapairofwords,suchas and ,inasetofdocuments.Ifacollectionofarticlesabouticefishingisaddedtothedataset,shouldtheassociationbetween and beaffected?Thisprocessofaddingunrelateddata(inthiscase,documents)toagivendatasetisknownasthenulladditionoperation.
Definition5.8.(NullAdditionProperty.)AnobjectivemeasureMisinvariantunderthenulladditionoperationifitisnotaffectedbyincreasing ,whileallotherfrequenciesinthecontingencytablestaythesame.
Forapplicationssuchasdocumentanalysisormarketbasketanalysis,wewouldliketouseameasurethatremainsinvariantunderthenulladditionoperation.Otherwise,therelationshipbetweenwordscanbemadetochangesimplybyaddingenoughdocumentsthatdonotcontainbothwords!Examplesofmeasuresthatsatisfythispropertyincludecosine(IS)and
ϕ-coefficient,κ
f00
Jaccard measures,whilethosethatviolatethispropertyincludeinterestfactor,PS,oddsratio,andthe .
Todemonstratetheeffectofnulladdition,considerthetwocontingencytablesand showninTable5.14 .Table hasbeenobtainedfrom by
adding1000extratransactionswithbothAandBabsent.Thisoperationonlyaffectsthe entryofTable ,whichhasincreasedfrom100to1100,whereasalltheotherfrequenciesinthetable ,and remainthesame.SinceISisinvarianttonulladdition,itgivesaconstantvalueof0.875toboththetables.However,theadditionof1000extratransactionswithoccurrencesof0–0'schangesthevalueofinterestfactorfrom0.972for(denotingaslightlynegativecorrelation)to1.944for (positivecorrelation).Similarly,thevalueofoddsratioincreasesfrom7for to77for .Hence,whentheinterestfactororoddsratioareusedastheassociationmeasure,therelationshipsbetweenvariableschangesbytheadditionofnulltransactionswhereboththevariablesareabsent.Incontrast,theISmeasureisinvarianttonulladdition,sinceitconsiderstwovariablestoberelatedonlyiftheyfrequentlyoccurtogether.Indeed,theISmeasure(cosinemeasure)iswidelyusedtomeasuresimilarityamongdocuments,whichisexpectedtodependonlyonthejointoccurrences(1's)ofwordsindocuments,butnottheirabsences(0's).
Table5.14.Anexampledemonstratingtheeffectofnulladdition.(a)Table .
B
A 700 100 800
100 100 200
800 200 1000
(ξ)ϕ-coefficient
T1 T2 T2 T1
f00 T2(f11,f10 f01)
T1T2T1 T2
T1
B¯
A¯
(b)Table .
B
A 700 100 800
10 1100 1200
800 1200 2000
Table5.15 providesasummaryofpropertiesforthemeasuresdefinedinTable5.9 .Eventhoughthislistofpropertiesisnotexhaustive,itcanserveasausefulguideforselectingtherightchoiceofmeasureforanapplication.Ideally,ifweknowthespecificrequirementsofacertainapplication,wecanensurethattheselectedmeasureshowspropertiesthatadheretothoserequirements.Forexample,ifwearedealingwithasymmetricvariables,wewouldprefertouseameasurethatisnotinvarianttonulladditionorinversion.Ontheotherhand,ifwerequirethemeasuretoremaininvarianttochangesinthesamplesize,wewouldliketouseameasurethatdoesnotchangewithscaling.
Table5.15.Propertiesofsymmetricmeasures.
Symbol Measure Inversion NullAddition Scaling
Yes No No
oddsratio Yes No Yes
Cohen's Yes No No
I Interest No No No
IS Cosine No Yes No
PS Piatetsky-Shapiro's Yes No No
T2
B¯
A¯
ϕ ϕ-coefficient
α
κ
S Collectivestrength Yes No No
Jaccard No Yes No
h All-confidence No Yes No
s Support No No No
AsymmetricInterestingnessMeasuresNotethatinthediscussionsofar,wehaveonlyconsideredmeasuresthatdonotchangetheirvaluewhentheorderofthevariablesarereversed.Morespecifically,ifMisameasureandAandBaretwovariables,thenM(A,B)isequaltoM(B,A)iftheorderofthevariablesdoesnotmatter.Suchmeasuresarecalledsymmetric.Ontheotherhand,measuresthatdependontheorderofvariables arecalledasymmetricmeasures.Forexample,theinterestfactorisasymmetricmeasurebecauseitsvalueisidenticalfortherules and .Incontrast,confidenceisanasymmetricmeasuresincetheconfidencefor and maynotbethesame.Notethattheuseoftheterm‘asymmetric'todescribeaparticulartypeofmeasureofrelationship—oneinwhichtheorderofthevariablesisimportant—shouldnotbeconfusedwiththeuseof‘asymmetric'todescribeabinaryvariableforwhichonly1'sareimportant.Asymmetricmeasuresaremoresuitableforanalyzingassociationrules,sincetheitemsinaruledohaveaspecificorder.Eventhoughweonlyconsideredsymmetricmeasurestodiscussthedifferentpropertiesofassociationmeasures,theabovediscussionisalsorelevantfortheasymmetricmeasures.SeeBibliographicNotesformoreinformationaboutdifferentkindsofasymmetricmeasuresandtheirproperties.
ζ
(M(A,B)≠M(B,A))
A→B B→AA→B B→A
5.7.2MeasuresbeyondPairsofBinaryVariables
ThemeasuresshowninTable5.9 aredefinedforpairsofbinaryvariables(e.g.,2-itemsetsorassociationrules).However,manyofthem,suchassupportandall-confidence,arealsoapplicabletolarger-sizeditemsets.Othermeasures,suchasinterestfactor,IS,PS,andJaccardcoefficient,canbeextendedtomorethantwovariablesusingthefrequencytablestabulatedinamultidimensionalcontingencytable.Anexampleofathree-dimensionalcontingencytablefora,b,andcisshowninTable5.16 .Eachentry inthistablerepresentsthenumberoftransactionsthatcontainaparticularcombinationofitemsa,b,andc.Forexample, isthenumberoftransactionsthatcontainaandc,butnotb.Ontheotherhand,amarginalfrequencysuchas isthenumberoftransactionsthatcontainaandc,irrespectiveofwhetherbispresentinthetransaction.
Table5.16.Exampleofathree-dimensionalcontingencytable.
c b
a
c b
a
fijk
f101
f1+1
b¯
f111 f101 f1+1
a¯ f011 f001 f0+1
f+11 f+01 f++1
b¯
f110 f100 f1+0
a¯ f010 f000 f0+0
Givenak-itemset ,theconditionforstatisticalindependencecanbestatedasfollows:
Withthisdefinition,wecanextendobjectivemeasuressuchasinterestfactorandPS,whicharebasedondeviationsfromstatisticalindependence,tomorethantwovariables:
Anotherapproachistodefinetheobjectivemeasureasthemaximum,minimum,oraveragevaluefortheassociationsbetweenpairsofitemsinapattern.Forexample,givenak-itemset ,wemaydefinethe
forXastheaverage betweeneverypairofitemsinX.However,becausethemeasureconsidersonlypairwise
associations,itmaynotcapturealltheunderlyingrelationshipswithinapattern.Also,careshouldbetakeninusingsuchalternatemeasuresformorethantwovariables,sincetheymaynotalwaysshowtheanti-monotonepropertyinthesamewayasthesupportmeasure,makingthemunsuitableforminingpatternsusingtheAprioriprinciple.
Analysisofmultidimensionalcontingencytablesismorecomplicatedbecauseofthepresenceofpartialassociationsinthedata.Forexample,someassociationsmayappearordisappearwhenconditioneduponthevalueofcertainvariables.ThisproblemisknownasSimpson'sparadoxandisdescribedinSection5.7.3 .Moresophisticatedstatisticaltechniquesare
f+10 f+00 f++0
{i1,i2,…,ik}
fi1i2…ik=fi1+…+×f+i2…+×…×f++…ikNk−1. (5.11)
I=Nk−1×fi1i2…ikfi1+…+×f+i2…+×…×f++…ikPS=fi1i2…ikN−fi1+…+×f+i2…+×…×f++…ikNk
X={i1,i2,…,ik} ϕ-coefficient ϕ-coefficient(ip,iq)
availabletoanalyzesuchrelationships,e.g.,loglinearmodels,butthesetechniquesarebeyondthescopeofthisbook.
5.7.3Simpson'sParadox
Itisimportanttoexercisecautionwheninterpretingtheassociationbetweenvariablesbecausetheobservedrelationshipmaybeinfluencedbythepresenceofotherconfoundingfactors,i.e.,hiddenvariablesthatarenotincludedintheanalysis.Insomecases,thehiddenvariablesmaycausetheobservedrelationshipbetweenapairofvariablestodisappearorreverseitsdirection,aphenomenonthatisknownasSimpson'sparadox.Weillustratethenatureofthisparadoxwiththefollowingexample.
Considertherelationshipbetweenthesaleofhigh-definitiontelevisions(HDTV)andexercisemachines,asshowninTable5.17 .Therule
hasaconfidenceof andtherule hasaconfidenceof
.Together,theserulessuggestthatcustomerswhobuyhigh-definitiontelevisionsaremorelikelytobuyexercisemachinesthanthosewhodonotbuyhigh-definitiontelevisions.
Table5.17.Atwo-waycontingencytablebetweenthesaleofhigh-definitiontelevisionandexercisemachine.
BuyHDTV
BuyExerciseMachine
Yes No
Yes 99 81 180
No 54 66 120
{HDTV=Yes}→{Exercisemachine=Yes} 99/180=55%{HDTV=No}→{Exercisemachine=Yes}
54/120=45%
153 147 300
However,adeeperanalysisrevealsthatthesalesoftheseitemsdependonwhetherthecustomerisacollegestudentoraworkingadult.Table5.18summarizestherelationshipbetweenthesaleofHDTVsandexercisemachinesamongcollegestudentsandworkingadults.NoticethatthesupportcountsgiveninthetableforcollegestudentsandworkingadultssumuptothefrequenciesshowninTable5.17 .Furthermore,therearemoreworkingadultsthancollegestudentswhobuytheseitems.Forcollegestudents:
Table5.18.Exampleofathree-waycontingencytable.
CustomerGroup
BuyHDTV
BuyExerciseMachine Total
Yes No
CollegeStudents Yes 1 9 10
No 4 30 34
WorkingAdult Yes 98 72 170
No 50 36 86
whileforworkingadults:
c({HDTV=Yes}→{Exercisemachine=Yes})=1/10=10%,c({HDTV=No}→{Exercisemachine=Yes})=4/34=11.8%.
c({HDTV=Yes}→{Exercisemachine=Yes})=98/170=57.7%,c({HDTV=No}→{Exercisemachine=Yes})=50/86=58.1%.
Therulessuggestthat,foreachgroup,customerswhodonotbuyhigh-definitiontelevisionsaremorelikelytobuyexercisemachines,whichcontradictsthepreviousconclusionwhendatafromthetwocustomergroupsarepooledtogether.Evenifalternativemeasuressuchascorrelation,oddsratio,orinterestareapplied,westillfindthatthesaleofHDTVandexercisemachineispositivelyrelatedinthecombineddatabutisnegativelyrelatedinthestratifieddata(seeExercise21onpage449).ThereversalinthedirectionofassociationisknownasSimpson'sparadox.
Theparadoxcanbeexplainedinthefollowingway.First,noticethatmostcustomerswhobuyHDTVsareworkingadults.Thisisreflectedinthehighconfidenceoftherule .Second,thehighconfidenceoftherule
suggeststhatmostcustomerswhobuyexercisemachinesarealsoworkingadults.SinceworkingadultsformthelargestfractionofcustomersforbothHDTVsandexercisemachines,theybothlookrelatedandtherule turnsouttobestrongerinthecombineddatathanwhatitwouldhavebeenifthedataisstratified.Hence,customergroupactsasahiddenvariablethataffectsboththefractionofcustomerswhobuyHDTVsandthosewhobuyexercisemachines.Ifwefactorouttheeffectofthehiddenvariablebystratifyingthedata,weseethattherelationshipbetweenbuyingHDTVsandbuyingexercisemachinesisnotdirect,butshowsupasanindirectconsequenceoftheeffectofthehiddenvariable.
TheSimpson'sparadoxcanalsobeillustratedmathematicallyasfollows.Suppose
{HDTV=Yes}→{WorkingAdult}(170/180=94.4%){Exercisemachine=Yes}
→{WorkingAdult}(148/153=96.7%)
{HDTV=Yes}→{Exercisemachine=Yes}
a/b<c/dandp/q<r/s,
wherea/bandp/qmayrepresenttheconfidenceoftherule intwodifferentstrata,whilec/dandr/smayrepresenttheconfidenceoftherule
inthetwostrata.Whenthedataispooledtogether,theconfidencevaluesoftherulesinthecombineddataare and ,respectively.Simpson'sparadoxoccurswhen
thusleadingtothewrongconclusionabouttherelationshipbetweenthevariables.ThelessonhereisthatproperstratificationisneededtoavoidgeneratingspuriouspatternsresultingfromSimpson'sparadox.Forexample,marketbasketdatafromamajorsupermarketchainshouldbestratifiedaccordingtostorelocations,whilemedicalrecordsfromvariouspatientsshouldbestratifiedaccordingtoconfoundingfactorssuchasageandgender.
A→B
A¯→B(a+p)/(b+q) (c+r)/(d+s)
a+pb+q>c+rd+s,
5.8EffectofSkewedSupportDistributionTheperformancesofmanyassociationanalysisalgorithmsareinfluencedbypropertiesoftheirinputdata.Forexample,thecomputationalcomplexityoftheApriorialgorithmdependsonpropertiessuchasthenumberofitemsinthedata,theaveragetransactionwidth,andthesupportthresholdused.ThisSectionexaminesanotherimportantpropertythathassignificantinfluenceontheperformanceofassociationanalysisalgorithmsaswellasthequalityofextractedpatterns.Morespecifically,wefocusondatasetswithskewedsupportdistributions,wheremostoftheitemshaverelativelylowtomoderatefrequencies,butasmallnumberofthemhaveveryhighfrequencies.
Figure5.29.Atransactiondatasetcontainingthreeitems,p,q,andr,wherepisahighsupportitemandqandrarelowsupportitems.
Figure5.29 showsanillustrativeexampleofadatasetthathasaskewedsupportdistributionofitsitems.Whilephasahighsupportof83.3%inthedata,qandrarelow-supportitemswithasupportof16.7%.Despitetheirlowsupport,qandralwaysoccurtogetherinthelimitednumberoftransactionsthattheyappearandhencearestronglyrelated.Apatternminingalgorithmthereforeshouldreport{q,r}asinteresting.
However,notethatchoosingtherightsupportthresholdforminingitem-setssuchas{q,r}canbequitetricky.Ifwesetthethresholdtoohigh(e.g.,20%),
thenwemaymissmanyinterestingpatternsinvolvinglow-supportitemssuchas{q,r}.Conversely,settingthesupportthresholdtoolowcanbedetrimentaltothepatternminingprocessforthefollowingreasons.First,thecomputationalandmemoryrequirementsofexistingassociationanalysisalgorithmsincreaseconsiderablywithlowsupportthresholds.Second,thenumberofextractedpatternsalsoincreasessubstantiallywithlowsupportthresholds,whichmakestheiranalysisandinterpretationdifficult.Inparticular,wemayextractmanyspuriouspatternsthatrelateahigh-frequencyitemsuchasptoalow-frequencyitemsuchasq.Suchpatterns,whicharecalledcross-supportpatterns,arelikelytobespuriousbecausetheassociationbetweenpandqislargelyinfluencedbythefrequentoccurrenceofpinsteadofthejointoccurrenceofpandqtogether.Becausethesupportof{p,q}isquiteclosetothesupportof{q,r},wemayeasilyselect{p,q}ifwesetthesupportthresholdlowenoughtoinclude{q,r}.
AnexampleofarealdatasetthatexhibitsaskewedsupportdistributionisshowninFigure5.30 .Thedata,takenfromthePUMS(PublicUseMicrodataSample)censusdata,contains49,046recordsand2113asymmetricbinaryvariables.Weshalltreattheasymmetricbinaryvariablesasitemsandrecordsastransactions.Whilemorethan80%oftheitemshavesupportlessthan1%,ahandfulofthemhavesupportgreaterthan90%.Tounderstandtheeffectofskewedsupportdistributiononfrequentitemsetmining,wedividetheitemsintothreegroups, ,and ,accordingtotheirsupportlevels,asshowninTable5.19 .Wecanseethatmorethan82%ofitemsbelongto andhaveasupportlessthan1%.Inmarketbasketanalysis,suchlowsupportitemsmaycorrespondtoexpensiveproducts(suchasjewelry)thatareseldomboughtbycustomers,butwhosepatternsarestillinterestingtoretailers.Patternsinvolvingsuchlow-supportitems,thoughmeaningful,caneasilyberejectedbyafrequentpatternminingalgorithmwithahighsupportthreshold.Ontheotherhand,settingalowsupportthresholdmayresultintheextractionofspuriouspatternsthatrelateahigh-frequency
G1,G2 G3
G1
itemin toalow-frequencyitemin .Forexample,atasupportthresholdequalto0.05%,thereare18,847frequentpairsinvolvingitemsfrom and
.Outofthese,93%ofthemarecross-supportpatterns;i.e.,thepatternscontainitemsfromboth and .
Figure5.30.Supportdistributionofitemsinthecensusdataset.
Table5.19.Groupingtheitemsinthecensusdatasetbasedontheirsupportvalues.
Group
Support
NumberofItems 1735 358 20
G3 G1G1
G3G1 G3
G1 G2 G3
<1% 1%−90% >90%
Thisexampleshowsthatalargenumberofweaklyrelatedcross-supportpatternscanbegeneratedwhenthesupportthresholdissufficientlylow.Notethatfindinginterestingpatternsindatasetswithskewedsupportdistributionsisnotjustachallengeforthesupportmeasure,butsimilarstatementscanbemadeaboutmanyotherobjectivemeasuresdiscussedinthepreviousSections.Beforepresentingamethodologyforfindinginterestingpatternsandpruningspuriousones,weformallydefinetheconceptofcross-supportpatterns.
Definition5.9.(Cross-supportPattern.)Letusdefinethesupportratio,r(X),ofanitemset
as
Givenauser-specifiedthreshold ,anitemsetXisacross-supportpatternif .
Example5.4.Supposethesupportformilkis70%,whilethesupportforsugaris10%andcaviaris0.04%.Given ,thefrequentitemset{milk,sugar,caviar}isacross-supportpatternbecauseitssupportratiois
X={i1,i2,…,ik}
r(X)=min[s(i1),s(i2),…,s(ik)}max[s(i1),s(i2),…,s(ik)} (5.12)
hcr(X)<hc
hc=0.01
r=min[0.7,0.1,0.0004]max[0.7,0.1,0.0004]=0.0040.7=0.00058<0.01.
Existingmeasuressuchassupportandconfidencemaynotbesufficienttoeliminatecross-supportpatterns.Forexample,ifweassume forthedatasetpresentedinFigure5.29 ,theitemsets{p,q},{p,r},and{p,q,r}arecross-supportpatternsbecausetheirsupportratios,whichareequalto0.2,arelessthanthethreshold .However,theirsupportsarecomparabletothatof{q,r},makingitdifficulttoeliminatecross-supportpatternswithoutloosinginterestingonesusingasupport-basedpruningstrategy.Confidencepruningalsodoesnothelpbecausetheconfidenceoftherulesextractedfromcross-supportpatternscanbeveryhigh.Forexample,theconfidencefor
is80%eventhough{p,q}isacross-supportpattern.Thefactthatthecross-supportpatterncanproduceahighconfidenceruleshouldnotcomeasasurprisebecauseoneofitsitems(p)appearsveryfrequentlyinthedata.Therefore,pisexpectedtoappearinmanyofthetransactionsthatcontainq.Meanwhile,therule alsohashighconfidenceeventhough{q,r}isnotacross-supportpattern.Thisexampledemonstratesthedifficultyofusingtheconfidencemeasuretodistinguishbetweenrulesextractedfromcross-supportpatternsandinterestingpatternsinvolvingstronglyconnectedbutlow-supportitems.
Eventhoughtherule hasveryhighconfidence,noticethattherulehasverylowconfidencebecausemostofthetransactionsthatcontainp
donotcontainq.Incontrast,therule ,whichisderivedfrom{q,r},hasveryhighconfidence.Thisobservationsuggeststhatcross-supportpatternscanbedetectedbyexaminingthelowestconfidencerulethatcanbeextractedfromagivenitemset.Anapproachforfindingtherulewiththelowestconfidencegivenanitemsetcanbedescribedasfollows.
1. Recallthefollowinganti-monotonepropertyofconfidence:
hc=0.3
hc
{q}→{p}
{q}→{r}
{q}→{p} {p}→{q}
{r}→{q}
conf({i1i2}→{i3,i4,…,ik})≤conf({i1i2i3}→{i4,i5,…,ik}).
Thispropertysuggeststhatconfidenceneverincreasesasweshiftmoreitemsfromtheleft-totheright-handsideofanassociationrule.Becauseofthisproperty,thelowestconfidenceruleextractedfromafrequentitemsetcontainsonlyoneitemonitsleft-handside.Wedenotethesetofallruleswithonlyoneitemonitsleft-handsideas .
2. Givenafrequentitemset ,therule
hasthelowestconfidencein if .Thisfollowsdirectlyfromthedefinitionofconfidenceastheratiobetweentherule'ssupportandthesupportoftheruleantecedent.Hence,theconfidenceofarulewillbelowestwhenthesupportoftheantecedentishighest.
3. Summarizingthepreviouspoints,thelowestconfidenceattainablefromafrequentitemset is
Thisexpressionisalsoknownastheh-confidenceorall-confidencemeasure.Becauseoftheanti-monotonepropertyofsupport,thenumeratoroftheh-confidencemeasureisboundedbytheminimumsupportofanyitemthatappearsinthefrequentitemset.Inotherwords,theh-confidenceofanitemset mustnotexceedthefollowingexpression:
Notethattheupperboundofh-confidenceintheaboveequationisexactlysameassupportratio(r)giveninEquation5.12 .Becausethesupportratioforacross-supportpatternisalwayslessthan ,theh-confidenceofthepatternisalsoguaranteedtobelessthan .Therefore,cross-supportpatternscanbeeliminatedbyensuringthattheh-confidencevaluesforthepatternsexceed .Asafinalnote,theadvantagesofusingh-confidencego
R1{i1,i2,…,ik}
{ij}→{i1,i2,…,ij−1,ij+1,…,ik}
R1 s(ij)=max[s(i1),s(i2),…,s(ik)]
{i1,i2,…,ik}s({i1,i2,…,ik})max[s(i1),s(i2),…,s(ik)].
X={i1,i2,…,ik}
h-confidence(X)≤min[s(i1),s(i2),…,s(ik)]max[s(i1),s(i2),…,s(ik)].
hchc
hc
beyondeliminatingcross-supportpatterns.Themeasureisalsoanti-monotone,i.e.,
andthuscanbeincorporateddirectlyintotheminingalgorithm.Furthermore,h-confidenceensuresthattheitemscontainedinanitemsetarestronglyassociatedwitheachother.Forexample,supposetheh-confidenceofanitemsetXis80%.IfoneoftheitemsinXispresentinatransaction,thereisatleastan80%chancethattherestoftheitemsinXalsobelongtothesametransaction.Suchstronglyassociatedpatternsinvolvinglow-supportitemsarecalledhypercliquepatterns.
Definition5.10.(HypercliquePattern.)AnitemsetXisahypercliquepatternifh-confidence ,where isauser-specifiedthreshold.
h-confidence({i1,i2,…,ik})≥h-confidence({i1,i2,…,ik+1}),
(X)>hchc
5.9BibliographicNotesTheassociationruleminingtaskwasfirstintroducedbyAgrawaletal.[324,325]todiscoverinterestingrelationshipsamongitemsinmarketbaskettransactions.Sinceitsinception,extensiveresearchhasbeenconductedtoaddressthevariousissuesinassociationrulemining,fromitsfundamentalconceptstoitsimplementationandapplications.Figure5.31 showsataxonomyofthevariousresearchdirectionsinthisarea,whichisgenerallyknownasassociationanalysis.Asmuchoftheresearchfocusesonfindingpatternsthatappearsignificantlyofteninthedata,theareaisalsoknownasfrequentpatternmining.Adetailedreviewonsomeoftheresearchtopicsinthisareacanbefoundin[362]andin[319].
Figure5.31.Anoverviewofthevariousresearchdirectionsinassociationanalysis.
ConceptualIssues
Researchontheconceptualissuesofassociationanalysishasfocusedondevelopingatheoreticalformulationofassociationanalysisandextendingtheformulationtonewtypesofpatternsandgoingbeyondasymmetricbinaryattributes.
FollowingthepioneeringworkbyAgrawaletal.[324,325],therehasbeenavastamountofresearchondevelopingatheoreticalformulationfortheassociationanalysisproblem.In[357],Gunopoulosetal.showedtheconnectionbetweenfindingmaximalfrequentitemsetsandthehypergraphtransversalproblem.Anupperboundonthecomplexityoftheassociationanalysistaskwasalsoderived.Zakietal.[454,456]andPasquieretal.[407]haveappliedformalconceptanalysistostudythefrequentitemsetgenerationproblem.Moreimportantly,suchresearchhasledtothedevelopmentofaclassofpatternsknownasclosedfrequentitemsets[456].Friedmanetal.[355]havestudiedtheassociationanalysisprobleminthecontextofbumphuntinginmultidimensionalspace.Specifically,theyconsiderfrequentitemsetgenerationasthetaskoffindinghighdensityregionsinmultidimensionalspace.Formalizingassociationanalysisinastatisticallearningframeworkisanotheractiveresearchdirection[414,435,444]asitcanhelpaddressissuesrelatedtoidentifyingstatisticallysignificantpatternsanddealingwithuncertaindata[320,333,343].
Overtheyears,theassociationruleminingformulationhasbeenexpandedtoencompassotherrule-basedpatterns,suchas,profileassociationrules[321],cyclicassociationrules[403],fuzzyassociationrules[379],exceptionrules[431],negativeassociationrules[336,418],weightedassociationrules[338,413],dependencerules[422],peculiarrules[462],inter-transactionassociationrules[353,440],andpartialclassificationrules[327,397].Additionally,theconceptoffrequentitemsethasbeenextendedtoothertypesofpatternsincludingcloseditemsets[407,456],maximalitemsets[330],hypercliquepatterns[449],supportenvelopes[428],emergingpatterns[347],
contrastsets[329],high-utilityitemsets[340,390],approximateorerror-tolerantitem-sets[358,389,451],anddiscriminativepatterns[352,401,430].Associationanalysistechniqueshavealsobeensuccessfullyappliedtosequential[326,426],spatial[371],andgraph-based[374,380,406,450,455]data.
Substantialresearchhasbeenconductedtoextendtheoriginalassociationruleformulationtonominal[425],ordinal[392],interval[395],andratio[356,359,425,443,461]attributes.Oneofthekeyissuesishowtodefinethesupportmeasurefortheseattributes.AmethodologywasproposedbySteinbachetal.[429]toextendthetraditionalnotionofsupporttomoregeneralpatternsandattributetypes.
ImplementationIssuesResearchactivitiesinthisarearevolvearound(1)integratingtheminingcapabilityintoexistingdatabasetechnology,(2)developingefficientandscalableminingalgorithms,(3)handlinguser-specifiedordomain-specificconstraints,and(4)post-processingtheextractedpatterns.
Thereareseveraladvantagestointegratingassociationanalysisintoexistingdatabasetechnology.First,itcanmakeuseoftheindexingandqueryprocessingcapabilitiesofthedatabasesystem.Second,itcanalsoexploittheDBMSsupportforscalability,check-pointing,andparallelization[415].TheSETMalgorithmdevelopedbyHoutsmaetal.[370]wasoneoftheearliestalgorithmstosupportassociationrulediscoveryviaSQLqueries.Sincethen,numerousmethodshavebeendevelopedtoprovidecapabilitiesforminingassociationrulesindatabasesystems.Forexample,theDMQL[363]andM-SQL[373]querylanguagesextendthebasicSQLwithnewoperatorsforminingassociationrules.TheMineRuleoperator[394]isanexpressiveSQLoperatorthatcanhandlebothclusteredattributesanditemhierarchies.Tsuret
al.[439]developedagenerate-and-testapproachcalledqueryflocksforminingassociationrules.AdistributedOLAP-basedinfrastructurewasdevelopedbyChenetal.[341]forminingmultilevelassociationrules.
Despiteitspopularity,theApriorialgorithmiscomputationallyexpensivebecauseitrequiresmakingmultiplepassesoverthetransactiondatabase.ItsruntimeandstoragecomplexitieswereinvestigatedbyDunkelandSoparkar[349].TheFP-growthalgorithmwasdevelopedbyHanetal.in[364].OtheralgorithmsforminingfrequentitemsetsincludetheDHP(dynamichashingandpruning)algorithmproposedbyParketal.[405]andthePartitionalgorithmdevelopedbySavasereetal[417].Asampling-basedfrequentitemsetgenerationalgorithmwasproposedbyToivonen[436].Thealgorithmrequiresonlyasinglepassoverthedata,butitcanproducemorecandidateitem-setsthannecessary.TheDynamicItemsetCounting(DIC)algorithm[337]makesonly1.5passesoverthedataandgenerateslesscandidateitemsetsthanthesampling-basedalgorithm.Othernotablealgorithmsincludethetree-projectionalgorithm[317]andH-Mine[408].Surveyarticlesonfrequentitemsetgenerationalgorithmscanbefoundin[322,367].ArepositoryofbenchmarkdatasetsandsoftwareimplementationofassociationruleminingalgorithmsisavailableattheFrequentItemsetMiningImplementations(FIMI)repository(http://fimi.cs.helsinki.fi).
Parallelalgorithmshavebeendevelopedtoscaleupassociationruleminingforhandlingbigdata[318,360,399,420,457].Asurveyofsuchalgorithmscanbefoundin[453].OnlineandincrementalassociationruleminingalgorithmshavealsobeenproposedbyHidber[365]andCheungetal.[342].Morerecently,newalgorithmshavebeendevelopedtospeedupfrequentitemsetminingbyexploitingtheprocessingpowerofGPUs[459]andtheMapReduce/Hadoopdistributedcomputingframework[382,384,396].Forexample,animplementationoffrequentitemsetminingfortheHadoopframeworkisavailableintheApacheMahoutsoftware .
1
1http://mahout.apache.org
Srikantetal.[427]haveconsideredtheproblemofminingassociationrulesinthepresenceofBooleanconstraintssuchasthefollowing:
Givensuchaconstraint,thealgorithmlooksforrulesthatcontainbothcookiesandmilk,orrulesthatcontainthedescendentitemsofcookiesbutnotancestoritemsofwheatbread.Singhetal.[424]andNgetal.[400]hadalsodevelopedalternativetechniquesforconstrained-basedassociationrulemining.Constraintscanalsobeimposedonthesupportfordifferentitemsets.ThisproblemwasinvestigatedbyWangetal.[442],Liuetal.in[387],andSenoetal.[419].Inaddition,constraintsarisingfromprivacyconcernsofminingsensitivedatahaveledtothedevelopmentofprivacy-preservingfrequentpatternminingtechniques[334,350,441,458].
Onepotentialproblemwithassociationanalysisisthelargenumberofpatternsthatcanbegeneratedbycurrentalgorithms.Toovercomethisproblem,methodstorank,summarize,andfilterpatternshavebeendeveloped.Toivonenetal.[437]proposedtheideaofeliminatingredundantrulesusingstructuralrulecoversandgroupingtheremainingrulesusingclustering.Liuetal.[388]appliedthestatisticalchi-squaretesttoprunespuriouspatternsandsummarizedtheremainingpatternsusingasubsetofthepatternscalleddirectionsettingrules.Theuseofobjectivemeasurestofilterpatternshasbeeninvestigatedbymanyauthors,includingBrinetal.[336],BayardoandAgrawal[331],AggarwalandYu[323],andDuMouchelandPregibon[348].ThepropertiesformanyofthesemeasureswereanalyzedbyPiatetsky-Shapiro[410],KamberandSinghal[376],HildermanandHamilton[366],andTanetal.[433].Thegrade-genderexampleusedtohighlighttheimportanceoftherowandcolumnscalinginvarianceproperty
(Cookies∧Milk)∨(descendants(Cookies)∧¬ancestors(WheatBread))
washeavilyinfluencedbythediscussiongivenin[398]byMosteller.Meanwhile,thetea-coffeeexampleillustratingthelimitationofconfidencewasmotivatedbyanexamplegivenin[336]byBrinetal.Becauseofthelimitationofconfidence,Brinetal.[336]hadproposedtheideaofusinginterestfactorasameasureofinterestingness.Theall-confidencemeasurewasproposedbyOmiecinski[402].Xiongetal.[449]introducedthecross-supportpropertyandshowedthattheall-confidencemeasurecanbeusedtoeliminatecross-supportpatterns.Akeydifficultyinusingalternativeobjectivemeasuresbesidessupportistheirlackofamonotonicityproperty,whichmakesitdifficulttoincorporatethemeasuresdirectlyintotheminingalgorithms.Xiongetal.[447]haveproposedanefficientmethodforminingcorrelationsbyintroducinganupperboundfunctiontothe .Althoughthemeasureisnon-monotone,ithasanupperboundexpressionthatcanbeexploitedfortheefficientminingofstronglycorrelateditempairs.
FabrisandFreitas[351]haveproposedamethodfordiscoveringinterestingassociationsbydetectingtheoccurrencesofSimpson'sparadox[423].MegiddoandSrikant[393]describedanapproachforvalidatingtheextractedpatternsusinghypothesistestingmethods.Aresampling-basedtechniquewasalsodevelopedtoavoidgeneratingspuriouspatternsbecauseofthemultiplecomparisonproblem.Boltonetal.[335]haveappliedtheBenjamini-Hochberg[332]andBonferronicorrectionmethodstoadjustthep-valuesofdiscoveredpatternsinmarketbasketdata.AlternativemethodsforhandlingthemultiplecomparisonproblemweresuggestedbyWebb[445],Zhangetal.[460],andLlinares-Lopezetal.[391].
Applicationofsubjectivemeasurestoassociationanalysishasbeeninvestigatedbymanyauthors.SilberschatzandTuzhilin[421]presentedtwoprinciplesinwhicharulecanbeconsideredinterestingfromasubjectivepointofview.TheconceptofunexpectedconditionruleswasintroducedbyLiuetal.in[385].Cooleyetal.[344]analyzedtheideaofcombiningsoftbeliefsets
ϕ-coefficient
usingtheDempster-Shafertheoryandappliedthisapproachtoidentifycontradictoryandnovelassociationpatternsinwebdata.AlternativeapproachesincludeusingBayesiannetworks[375]andneighborhood-basedinformation[346]toidentifysubjectivelyinterestingpatterns.
Visualizationalsohelpstheusertoquicklygrasptheunderlyingstructureofthediscoveredpatterns.Manycommercialdataminingtoolsdisplaythecompletesetofrules(whichsatisfybothsupportandconfidencethresholdcriteria)asatwo-dimensionalplot,witheachaxiscorrespondingtotheantecedentorconsequentitemsetsoftherule.Hofmannetal.[368]proposedusingMosaicplotsandDoubleDeckerplotstovisualizeassociationrules.Thisapproachcanvisualizenotonlyaparticularrule,butalsotheoverallcontingencytablebetweenitemsetsintheantecedentandconsequentpartsoftherule.Nevertheless,thistechniqueassumesthattheruleconsequentconsistsofonlyasingleattribute.
ApplicationIssuesAssociationanalysishasbeenappliedtoavarietyofapplicationdomainssuchaswebmining[409,432],documentanalysis[369],telecommunicationalarmdiagnosis[377],networkintrusiondetection[328,345,381],andbioinformatics[416,446].ApplicationsofassociationandcorrelationpatternanalysistoEarthSciencestudieshavebeeninvestigatedin[411,412,434].Trajectorypatternmining[339,372,438]isanotherapplicationofspatio-temporalassociationanalysistoidentifyfrequentlytraversedpathsofmovingobjects.
Associationpatternshavealsobeenappliedtootherlearningproblemssuchasclassification[383,386],regression[404],andclustering[361,448,452].AcomparisonbetweenclassificationandassociationruleminingwasmadebyFreitasinhispositionpaper[354].Theuseofassociationpatternsfor
clusteringhasbeenstudiedbymanyauthorsincludingHanetal.[361],Kostersetal.[378],Yangetal.[452]andXiongetal.[448].
Bibliography[317]R.C.Agarwal,C.C.Aggarwal,andV.V.V.Prasad.ATreeProjection
AlgorithmforGenerationofFrequentItemsets.JournalofParallelandDistributedComputing(SpecialIssueonHighPerformanceDataMining),61(3):350–371,2001.
[318]R.C.AgarwalandJ.C.Shafer.ParallelMiningofAssociationRules.IEEETransactionsonKnowledgeandDataEngineering,8(6):962–969,March1998.
[319]C.AggarwalandJ.Han.FrequentPatternMining.Springer,2014.
[320]C.C.Aggarwal,Y.Li,J.Wang,andJ.Wang.Frequentpatternminingwithuncertaindata.InProceedingsofthe15thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages29–38,Paris,France,2009.
[321]C.C.Aggarwal,Z.Sun,andP.S.Yu.OnlineGenerationofProfileAssociationRules.InProc.ofthe4thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages129—133,NewYork,NY,August1996.
[322]C.C.AggarwalandP.S.Yu.MiningLargeItemsetsforAssociationRules.DataEngineeringBulletin,21(1):23–31,March1998.
[323]C.C.AggarwalandP.S.Yu.MiningAssociationswiththeCollectiveStrengthApproach.IEEETrans.onKnowledgeandDataEngineering,13(6):863–873,January/February2001.
[324]R.Agrawal,T.Imielinski,andA.Swami.Databasemining:Aperformanceperspective.IEEETransactionsonKnowledgeandDataEngineering,5:914–925,1993.
[325]R.Agrawal,T.Imielinski,andA.Swami.Miningassociationrulesbetweensetsofitemsinlargedatabases.InProc.ACMSIGMODIntl.Conf.ManagementofData,pages207–216,Washington,DC,1993.
[326]R.AgrawalandR.Srikant.MiningSequentialPatterns.InProc.ofIntl.Conf.onDataEngineering,pages3–14,Taipei,Taiwan,1995.
[327]K.Ali,S.Manganaris,andR.Srikant.PartialClassificationusingAssociationRules.InProc.ofthe3rdIntl.Conf.onKnowledgeDiscoveryandDataMining,pages115—118,NewportBeach,CA,August1997.
[328]D.Barbará,J.Couto,S.Jajodia,andN.Wu.ADAM:ATestbedforExploringtheUseofDataMininginIntrusionDetection.SIGMODRecord,30(4):15–24,2001.
[329]S.D.BayandM.Pazzani.DetectingGroupDifferences:MiningContrastSets.DataMiningandKnowledgeDiscovery,5(3):213–246,2001.
[330]R.Bayardo.EfficientlyMiningLongPatternsfromDatabases.InProc.of1998ACM-SIGMODIntl.Conf.onManagementofData,pages85–93,Seattle,WA,June1998.
[331]R.BayardoandR.Agrawal.MiningtheMostInterestingRules.InProc.ofthe5thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages145–153,SanDiego,CA,August1999.
[332]Y.BenjaminiandY.Hochberg.ControllingtheFalseDiscoveryRate:APracticalandPowerfulApproachtoMultipleTesting.JournalRoyalStatisticalSocietyB,57(1):289–300,1995.
[333]T.Bernecker,H.Kriegel,M.Renz,F.Verhein,andA.Züle.Probabilisticfrequentitemsetmininginuncertaindatabases.InProceedingsofthe15thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages119–128,Paris,France,2009.
[334]R.Bhaskar,S.Laxman,A.D.Smith,andA.Thakurta.Discoveringfrequentpatternsinsensitivedata.InProceedingsofthe16thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages503–512,Washington,DC,2010.
[335]R.J.Bolton,D.J.Hand,andN.M.Adams.DeterminingHitRateinPatternSearch.InProc.oftheESFExploratoryWorkshoponPatternDetectionandDiscoveryinDataMining,pages36–48,London,UK,September2002.
[336]S.Brin,R.Motwani,andC.Silverstein.Beyondmarketbaskets:Generalizingassociationrulestocorrelations.InProc.ACMSIGMODIntl.Conf.ManagementofData,pages265–276,Tucson,AZ,1997.
[337]S.Brin,R.Motwani,J.Ullman,andS.Tsur.DynamicItemsetCountingandImplicationRulesformarketbasketdata.InProc.of1997ACM-SIGMODIntl.Conf.onManagementofData,pages255–264,Tucson,AZ,June1997.
[338]C.H.Cai,A.Fu,C.H.Cheng,andW.W.Kwong.MiningAssociationRuleswithWeightedItems.InProc.ofIEEEIntl.DatabaseEngineeringandApplicationsSymp.,pages68–77,Cardiff,Wales,1998.
[339]H.Cao,N.Mamoulis,andD.W.Cheung.MiningFrequentSpatio-TemporalSequentialPatterns.InProceedingsofthe5thIEEEInternationalConferenceonDataMining,pages82–89,Houston,TX,2005.
[340]R.Chan,Q.Yang,andY.Shen.MiningHighUtilityItemsets.InProceedingsofthe3rdIEEEInternationalConferenceonDataMining,pages19–26,Melbourne,FL,2003.
[341]Q.Chen,U.Dayal,andM.Hsu.ADistributedOLAPinfrastructureforE-Commerce.InProc.ofthe4thIFCISIntl.Conf.onCooperativeInformationSystems,pages209—220,Edinburgh,Scotland,1999.
[342]D.C.Cheung,S.D.Lee,andB.Kao.AGeneralIncrementalTechniqueforMaintainingDiscoveredAssociationRules.InProc.ofthe5thIntl.Conf.
onDatabaseSystemsforAdvancedApplications,pages185–194,Melbourne,Australia,1997.
[343]C.K.Chui,B.Kao,andE.Hung.MiningFrequentItemsetsfromUncertainData.InProceedingsofthe11thPacific-AsiaConferenceonKnowledgeDiscoveryandDataMining,pages47–58,Nanjing,China,2007.
[344]R.Cooley,P.N.Tan,andJ.Srivastava.DiscoveryofInterestingUsagePatternsfromWebData.InM.SpiliopoulouandB.Masand,editors,AdvancesinWebUsageAnalysisandUserProfiling,volume1836,pages163–182.LectureNotesinComputerScience,2000.
[345]P.Dokas,L.Ertöz,V.Kumar,A.Lazarevic,J.Srivastava,andP.N.Tan.DataMiningforNetworkIntrusionDetection.InProc.NSFWorkshoponNextGenerationDataMining,Baltimore,MD,2002.
[346]G.DongandJ.Li.Interestingnessofdiscoveredassociationrulesintermsofneighborhood-basedunexpectedness.InProc.ofthe2ndPacific-AsiaConf.onKnowledgeDiscoveryandDataMining,pages72–86,Melbourne,Australia,April1998.
[347]G.DongandJ.Li.EfficientMiningofEmergingPatterns:DiscoveringTrendsandDifferences.InProc.ofthe5thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages43–52,SanDiego,CA,August1999.
[348]W.DuMouchelandD.Pregibon.EmpiricalBayesScreeningforMulti-ItemAssociations.InProc.ofthe7thIntl.Conf.onKnowledgeDiscovery
andDataMining,pages67–76,SanFrancisco,CA,August2001.
[349]B.DunkelandN.Soparkar.DataOrganizationandAccessforEfficientDataMining.InProc.ofthe15thIntl.Conf.onDataEngineering,pages522–529,Sydney,Australia,March1999.
[350]A.V.Evfimievski,R.Srikant,R.Agrawal,andJ.Gehrke.Privacypreservingminingofassociationrules.InProceedingsoftheEighthACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages217–228,Edmonton,Canada,2002.
[351]C.C.FabrisandA.A.Freitas.DiscoveringsurprisingpatternsbydetectingoccurrencesofSimpson'sparadox.InProc.ofthe19thSGESIntl.Conf.onKnowledge-BasedSystemsandAppliedArtificialIntelligence),pages148–160,Cambridge,UK,December1999.
[352]G.Fang,G.Pandey,W.Wang,M.Gupta,M.Steinbach,andV.Kumar.MiningLow-SupportDiscriminativePatternsfromDenseandHigh-DimensionalData.IEEETrans.Knowl.DataEng.,24(2):279–294,2012.
[353]L.Feng,H.J.Lu,J.X.Yu,andJ.Han.Mininginter-transactionassociationswithtemplates.InProc.ofthe8thIntl.Conf.onInformationandKnowledgeManagement,pages225–233,KansasCity,Missouri,Nov1999.
[354]A.A.Freitas.Understandingthecrucialdifferencesbetweenclassificationanddiscoveryofassociationrules—apositionpaper.SIGKDDExplorations,2(1):65–69,2000.
[355]J.H.FriedmanandN.I.Fisher.Bumphuntinginhigh-dimensionaldata.StatisticsandComputing,9(2):123–143,April1999.
[356]T.Fukuda,Y.Morimoto,S.Morishita,andT.Tokuyama.MiningOptimizedAssociationRulesforNumericAttributes.InProc.ofthe15thSymp.onPrinciplesofDatabaseSystems,pages182–191,Montreal,Canada,June1996.
[357]D.Gunopulos,R.Khardon,H.Mannila,andH.Toivonen.DataMining,HypergraphTransversals,andMachineLearning.InProc.ofthe16thSymp.onPrinciplesofDatabaseSystems,pages209–216,Tucson,AZ,May1997.
[358]R.Gupta,G.Fang,B.Field,M.Steinbach,andV.Kumar.Quantitativeevaluationofapproximatefrequentpatternminingalgorithms.InProceedingsofthe14thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages301–309,LasVegas,NV,2008.
[359]E.Han,G.Karypis,andV.Kumar.Min-apriori:Analgorithmforfindingassociationrulesindatawithcontinuousattributes.DepartmentofComputerScienceandEngineering,UniversityofMinnesota,Tech.Rep,1997.
[360]E.-H.Han,G.Karypis,andV.Kumar.ScalableParallelDataMiningforAssociationRules.InProc.of1997ACM-SIGMODIntl.Conf.onManagementofData,pages277–288,Tucson,AZ,May1997.
[361]E.-H.Han,G.Karypis,V.Kumar,andB.Mobasher.ClusteringBasedonAssociationRuleHypergraphs.InProc.ofthe1997ACMSIGMODWorkshoponResearchIssuesinDataMiningandKnowledgeDiscovery,Tucson,AZ,1997.
[362]J.Han,H.Cheng,D.Xin,andX.Yan.Frequentpatternmining:currentstatusandfuturedirections.DataMiningandKnowledgeDiscovery,15(1):55–86,2007.
[363]J.Han,Y.Fu,K.Koperski,W.Wang,andO.R.Zaïane.DMQL:Adataminingquerylanguageforrelationaldatabases.InProc.ofthe1996ACMSIGMODWorkshoponResearchIssuesinDataMiningandKnowledgeDiscovery,Montreal,Canada,June1996.
[364]J.Han,J.Pei,andY.Yin.MiningFrequentPatternswithoutCandidateGeneration.InProc.ACM-SIGMODInt.Conf.onManagementofData(SIGMOD'00),pages1–12,Dallas,TX,May2000.
[365]C.Hidber.OnlineAssociationRuleMining.InProc.of1999ACM-SIGMODIntl.Conf.onManagementofData,pages145–156,Philadelphia,PA,1999.
[366]R.J.HildermanandH.J.Hamilton.KnowledgeDiscoveryandMeasuresofInterest.KluwerAcademicPublishers,2001.
[367]J.Hipp,U.Guntzer,andG.Nakhaeizadeh.AlgorithmsforAssociationRuleMining—AGeneralSurvey.SigKDDExplorations,2(1):58–64,June
2000.
[368]H.Hofmann,A.P.J.M.Siebes,andA.F.X.Wilhelm.VisualizingAssociationRuleswithInteractiveMosaicPlots.InProc.ofthe6thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages227–235,Boston,MA,August2000.
[369]J.D.HoltandS.M.Chung.EfficientMiningofAssociationRulesinTextDatabases.InProc.ofthe8thIntl.Conf.onInformationandKnowledgeManagement,pages234–242,KansasCity,Missouri,1999.
[370]M.HoutsmaandA.Swami.Set-orientedMiningforAssociationRulesinRelationalDatabases.InProc.ofthe11thIntl.Conf.onDataEngineering,pages25–33,Taipei,Taiwan,1995.
[371]Y.Huang,S.Shekhar,andH.Xiong.DiscoveringCo-locationPatternsfromSpatialDatasets:AGeneralApproach.IEEETrans.onKnowledgeandDataEngineering,16(12):1472–1485,December2004.
[372]S.Hwang,Y.Liu,J.Chiu,andE.Lim.MiningMobileGroupPatterns:ATrajectory-BasedApproach.InProceedingsofthe9thPacific-AsiaConferenceonKnowledgeDiscoveryandDataMining,pages713–718,Hanoi,Vietnam,2005.
[373]T.Imielinski,A.Virmani,andA.Abdulghani.DataMine:ApplicationProgrammingInterfaceandQueryLanguageforDatabaseMining.InProc.ofthe2ndIntl.Conf.onKnowledgeDiscoveryandDataMining,pages256–262,Portland,Oregon,1996.
[374]A.Inokuchi,T.Washio,andH.Motoda.AnApriori-basedAlgorithmforMiningFrequentSubstructuresfromGraphData.InProc.ofthe4thEuropeanConf.ofPrinciplesandPracticeofKnowledgeDiscoveryinDatabases,pages13–23,Lyon,France,2000.
[375]S.JaroszewiczandD.Simovici.InterestingnessofFrequentItemsetsUsingBayesianNetworksasBackgroundKnowledge.InProc.ofthe10thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages178–186,Seattle,WA,August2004.
[376]M.KamberandR.Shinghal.EvaluatingtheInterestingnessofCharacteristicRules.InProc.ofthe2ndIntl.Conf.onKnowledgeDiscoveryandDataMining,pages263–266,Portland,Oregon,1996.
[377]M.Klemettinen.AKnowledgeDiscoveryMethodologyforTelecommunicationNetworkAlarmDatabases.PhDthesis,UniversityofHelsinki,1999.
[378]W.A.Kosters,E.Marchiori,andA.Oerlemans.MiningClusterswithAssociationRules.InThe3rdSymp.onIntelligentDataAnalysis(IDA99),pages39–50,Amsterdam,August1999.
[379]C.M.Kuok,A.Fu,andM.H.Wong.MiningFuzzyAssociationRulesinDatabases.ACMSIGMODRecord,27(1):41–46,March1998.
[380]M.KuramochiandG.Karypis.FrequentSubgraphDiscovery.InProc.ofthe2001IEEEIntl.Conf.onDataMining,pages313–320,SanJose,CA,
November2001.
[381]W.Lee,S.J.Stolfo,andK.W.Mok.AdaptiveIntrusionDetection:ADataMiningApproach.ArtificialIntelligenceReview,14(6):533–567,2000.
[382]N.Li,L.Zeng,Q.He,andZ.Shi.ParallelImplementationofAprioriAlgorithmBasedonMapReduce.InProceedingsofthe13thACISInternationalConferenceonSoftwareEngineering,ArtificialIntelligence,NetworkingandParallel/DistributedComputing,pages236–241,Kyoto,Japan,2012.
[383]W.Li,J.Han,andJ.Pei.CMAR:AccurateandEfficientClassificationBasedonMultipleClass-associationRules.InProc.ofthe2001IEEEIntl.Conf.onDataMining,pages369–376,SanJose,CA,2001.
[384]M.Lin,P.Lee,andS.Hsueh.Apriori-basedfrequentitemsetminingalgorithmsonMapReduce.InProceedingsofthe6thInternationalConferenceonUbiquitousInformationManagementandCommunication,pages26–30,KualaLumpur,Malaysia,2012.
[385]B.Liu,W.Hsu,andS.Chen.UsingGeneralImpressionstoAnalyzeDiscoveredClassificationRules.InProc.ofthe3rdIntl.Conf.onKnowledgeDiscoveryandDataMining,pages31–36,NewportBeach,CA,August1997.
[386]B.Liu,W.Hsu,andY.Ma.IntegratingClassificationandAssociationRuleMining.InProc.ofthe4thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages80–86,NewYork,NY,August1998.
[387]B.Liu,W.Hsu,andY.Ma.Miningassociationruleswithmultipleminimumsupports.InProc.ofthe5thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages125—134,SanDiego,CA,August1999.
[388]B.Liu,W.Hsu,andY.Ma.PruningandSummarizingtheDiscoveredAssociations.InProc.ofthe5thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages125–134,SanDiego,CA,August1999.
[389]J.Liu,S.Paulsen,W.Wang,A.B.Nobel,andJ.Prins.MiningApproximateFrequentItemsetsfromNoisyData.InProceedingsofthe5thIEEEInternationalConferenceonDataMining,pages721–724,Houston,TX,2005.
[390]Y.Liu,W.-K.Liao,andA.Choudhary.Atwo-phasealgorithmforfastdiscoveryofhighutilityitemsets.InProceedingsofthe9thPacific-AsiaConferenceonKnowledgeDiscoveryandDataMining,pages689–695,Hanoi,Vietnam,2005.
[391]F.Llinares-López,M.Sugiyama,L.Papaxanthos,andK.M.Borgwardt.FastandMemory-EfficientSignificantPatternMiningviaPermutationTesting.InProceedingsofthe21thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages725–734,Sydney,Australia,2015.
[392]A.Marcus,J.I.Maletic,andK.-I.Lin.Ordinalassociationrulesforerroridentificationindatasets.InProc.ofthe10thIntl.Conf.onInformationandKnowledgeManagement,pages589–591,Atlanta,GA,October2001.
[393]N.MegiddoandR.Srikant.DiscoveringPredictiveAssociationRules.InProc.ofthe4thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages274–278,NewYork,August1998.
[394]R.Meo,G.Psaila,andS.Ceri.ANewSQL-likeOperatorforMiningAssociationRules.InProc.ofthe22ndVLDBConf.,pages122–133,Bombay,India,1996.
[395]R.J.MillerandY.Yang.AssociationRulesoverIntervalData.InProc.of1997ACM-SIGMODIntl.Conf.onManagementofData,pages452–461,Tucson,AZ,May1997.
[396]S.Moens,E.Aksehirli,andB.Goethals.FrequentItemsetMiningforBigData.InProceedingsofthe2013IEEEInternationalConferenceonBigData,pages111–118,SantaClara,CA,2013.
[397]Y.Morimoto,T.Fukuda,H.Matsuzawa,T.Tokuyama,andK.Yoda.Algorithmsforminingassociationrulesforbinarysegmentationsofhugecategoricaldatabases.InProc.ofthe24thVLDBConf.,pages380–391,NewYork,August1998.
[398]F.Mosteller.AssociationandEstimationinContingencyTables.JASA,63:1–28,1968.
[399]A.Mueller.Fastsequentialandparallelalgorithmsforassociationrulemining:Acomparison.TechnicalReportCS-TR-3515,UniversityofMaryland,August1995.
[400]R.T.Ng,L.V.S.Lakshmanan,J.Han,andA.Pang.ExploratoryMiningandPruningOptimizationsofConstrainedAssociationRules.InProc.of1998ACM-SIGMODIntl.Conf.onManagementofData,pages13–24,Seattle,WA,June1998.
[401]P.K.Novak,N.Lavrač,andG.I.Webb.Superviseddescriptiverulediscovery:Aunifyingsurveyofcontrastset,emergingpatternandsubgroupmining.JournalofMachineLearningResearch,10(Feb):377–403,2009.
[402]E.Omiecinski.AlternativeInterestMeasuresforMiningAssociationsinDatabases.IEEETrans.onKnowledgeandDataEngineering,15(1):57–69,January/February2003.
[403]B.Ozden,S.Ramaswamy,andA.Silberschatz.CyclicAssociationRules.InProc.ofthe14thIntl.Conf.onDataEng.,pages412–421,Orlando,FL,February1998.
[404]A.Ozgur,P.N.Tan,andV.Kumar.RBA:AnIntegratedFrameworkforRegressionbasedonAssociationRules.InProc.oftheSIAMIntl.Conf.onDataMining,pages210–221,Orlando,FL,April2004.
[405]J.S.Park,M.-S.Chen,andP.S.Yu.Aneffectivehash-basedalgorithmforminingassociationrules.SIGMODRecord,25(2):175–186,1995.
[406]S.ParthasarathyandM.Coatney.EfficientDiscoveryofCommonSubstructuresinMacromolecules.InProc.ofthe2002IEEEIntl.Conf.on
DataMining,pages362—369,MaebashiCity,Japan,December2002.
[407]N.Pasquier,Y.Bastide,R.Taouil,andL.Lakhal.Discoveringfrequentcloseditemsetsforassociationrules.InProc.ofthe7thIntl.Conf.onDatabaseTheory(ICDT'99),pages398–416,Jerusalem,Israel,January1999.
[408]J.Pei,J.Han,H.J.Lu,S.Nishio,andS.Tang.H-Mine:Hyper-StructureMiningofFrequentPatternsinLargeDatabases.InProc.ofthe2001IEEEIntl.Conf.onDataMining,pages441–448,SanJose,CA,November2001.
[409]J.Pei,J.Han,B.Mortazavi-Asl,andH.Zhu.MiningAccessPatternsEfficientlyfromWebLogs.InProc.ofthe4thPacific-AsiaConf.onKnowledgeDiscoveryandDataMining,pages396–407,Kyoto,Japan,April2000.
[410]G.Piatetsky-Shapiro.Discovery,AnalysisandPresentationofStrongRules.InG.Piatetsky-ShapiroandW.Frawley,editors,KnowledgeDiscoveryinDatabases,pages229–248.MITPress,Cambridge,MA,1991.
[411]C.Potter,S.Klooster,M.Steinbach,P.N.Tan,V.Kumar,S.Shekhar,andC.Carvalho.UnderstandingGlobalTeleconnectionsofClimatetoRegionalModelEstimatesofAmazonEcosystemCarbonFluxes.GlobalChangeBiology,10(5):693—703,2004.
[412]C.Potter,S.Klooster,M.Steinbach,P.N.Tan,V.Kumar,S.Shekhar,R.Myneni,andR.Nemani.GlobalTeleconnectionsofOceanClimatetoTerrestrialCarbonFlux.JournalofGeophysicalResearch,108(D17),2003.
[413]G.D.Ramkumar,S.Ranka,andS.Tsur.Weightedassociationrules:Modelandalgorithm.InProc.ACMSIGKDD,1998.
[414]M.RiondatoandF.Vandin.FindingtheTrueFrequentItemsets.InProceedingsofthe2014SIAMInternationalConferenceonDataMining,pages497–505,Philadelphia,PA,2014.
[415]S.Sarawagi,S.Thomas,andR.Agrawal.IntegratingMiningwithRelationalDatabaseSystems:AlternativesandImplications.InProc.of1998ACM-SIGMODIntl.Conf.onManagementofData,pages343–354,Seattle,WA,1998.
[416]K.Satou,G.Shibayama,T.Ono,Y.Yamamura,E.Furuichi,S.Kuhara,andT.Takagi.FindingAssociationRulesonHeterogeneousGenomeData.InProc.ofthePacificSymp.onBiocomputing,pages397–408,Hawaii,January1997.
[417]A.Savasere,E.Omiecinski,andS.Navathe.Anefficientalgorithmforminingassociationrulesinlargedatabases.InProc.ofthe21stInt.Conf.onVeryLargeDatabases(VLDB’95),pages432–444,Zurich,Switzerland,September1995.
[418]A.Savasere,E.Omiecinski,andS.Navathe.MiningforStrongNegativeAssociationsinaLargeDatabaseofCustomerTransactions.InProc.ofthe14thIntl.Conf.onDataEngineering,pages494–502,Orlando,Florida,February1998.
[419]M.SenoandG.Karypis.LPMiner:AnAlgorithmforFindingFrequentItemsetsUsingLength-DecreasingSupportConstraint.InProc.ofthe2001IEEEIntl.Conf.onDataMining,pages505–512,SanJose,CA,November2001.
[420]T.ShintaniandM.Kitsuregawa.Hashbasedparallelalgorithmsforminingassociationrules.InProcofthe4thIntl.Conf.onParallelandDistributedInfo.Systems,pages19–30,MiamiBeach,FL,December1996.
[421]A.SilberschatzandA.Tuzhilin.Whatmakespatternsinterestinginknowledgediscoverysystems.IEEETrans.onKnowledgeandDataEngineering,8(6):970–974,1996.
[422]C.Silverstein,S.Brin,andR.Motwani.Beyondmarketbaskets:Generalizingassociationrulestodependencerules.DataMiningandKnowledgeDiscovery,2(1):39–68,1998.
[423]E.-H.Simpson.TheInterpretationofInteractioninContingencyTables.JournaloftheRoyalStatisticalSociety,B(13):238–241,1951.
[424]L.Singh,B.Chen,R.Haight,andP.Scheuermann.AnAlgorithmforConstrainedAssociationRuleMininginSemi-structuredData.InProc.of
the3rdPacific-AsiaConf.onKnowledgeDiscoveryandDataMining,pages148–158,Beijing,China,April1999.
[425]R.SrikantandR.Agrawal.MiningQuantitativeAssociationRulesinLargeRelationalTables.InProc.of1996ACM-SIGMODIntl.Conf.onManagementofData,pages1–12,Montreal,Canada,1996.
[426]R.SrikantandR.Agrawal.MiningSequentialPatterns:GeneralizationsandPerformanceImprovements.InProc.ofthe5thIntlConf.onExtendingDatabaseTechnology(EDBT'96),pages18–32,Avignon,France,1996.
[427]R.Srikant,Q.Vu,andR.Agrawal.MiningAssociationRuleswithItemConstraints.InProc.ofthe3rdIntl.Conf.onKnowledgeDiscoveryandDataMining,pages67–73,NewportBeach,CA,August1997.
[428]M.Steinbach,P.N.Tan,andV.Kumar.SupportEnvelopes:ATechniqueforExploringtheStructureofAssociationPatterns.InProc.ofthe10thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages296–305,Seattle,WA,August2004.
[429]M.Steinbach,P.N.Tan,H.Xiong,andV.Kumar.ExtendingtheNotionofSupport.InProc.ofthe10thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages689–694,Seattle,WA,August2004.
[430]M.Steinbach,H.Yu,G.Fang,andV.Kumar.Usingconstraintstogenerateandexplorehigherorderdiscriminativepatterns.AdvancesinKnowledgeDiscoveryandDataMining,pages338–350,2011.
[431]E.Suzuki.AutonomousDiscoveryofReliableExceptionRules.InProc.ofthe3rdIntl.Conf.onKnowledgeDiscoveryandDataMining,pages259–262,NewportBeach,CA,August1997.
[432]P.N.TanandV.Kumar.MiningAssociationPatternsinWebUsageData.InProc.oftheIntl.Conf.onAdvancesinInfrastructurefore-Business,e-Education,e-Scienceande-MedicineontheInternet,L'Aquila,Italy,January2002.
[433]P.N.Tan,V.Kumar,andJ.Srivastava.SelectingtheRightInterestingnessMeasureforAssociationPatterns.InProc.ofthe8thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages32–41,Edmonton,Canada,July2002.
[434]P.N.Tan,M.Steinbach,V.Kumar,S.Klooster,C.Potter,andA.Torregrosa.FindingSpatio-TemporalPatternsinEarthScienceData.InKDD2001WorkshoponTemporalDataMining,SanFrancisco,CA,2001.
[435]N.Tatti.Probablythebestitemsets.InProceedingsofthe16thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages293–302,Washington,DC,2010.
[436]H.Toivonen.SamplingLargeDatabasesforAssociationRules.InProc.ofthe22ndVLDBConf.,pages134–145,Bombay,India,1996.
[437]H.Toivonen,M.Klemettinen,P.Ronkainen,K.Hatonen,andH.Mannila.PruningandGroupingDiscoveredAssociationRules.InECML-95
WorkshoponStatistics,MachineLearningandKnowledgeDiscoveryinDatabases,pages47–52,Heraklion,Greece,April1995.
[438]I.TsoukatosandD.Gunopulos.Efficientminingofspatiotemporalpatterns.InProceedingsofthe7thInternationalSymposiumonAdvancesinSpatialandTemporalDatabases,pages425–442,2001.
[439]S.Tsur,J.Ullman,S.Abiteboul,C.Clifton,R.Motwani,S.Nestorov,andA.Rosenthal.QueryFlocks:AGeneralizationofAssociationRuleMining.InProc.of1998ACM-SIGMODIntl.Conf.onManagementofData,pages1–12,Seattle,WA,June1998.
[440]A.Tung,H.J.Lu,J.Han,andL.Feng.BreakingtheBarrierofTransactions:MiningInter-TransactionAssociationRules.InProc.ofthe5thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages297–301,SanDiego,CA,August1999.
[441]J.VaidyaandC.Clifton.Privacypreservingassociationrulemininginverticallypartitioneddata.InProceedingsoftheEighthACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages639–644,Edmonton,Canada,2002.
[442]K.Wang,Y.He,andJ.Han.MiningFrequentItemsetsUsingSupportConstraints.InProc.ofthe26thVLDBConf.,pages43–52,Cairo,Egypt,September2000.
[443]K.Wang,S.H.Tay,andB.Liu.Interestingness-BasedIntervalMergerforNumericAssociationRules.InProc.ofthe4thIntl.Conf.onKnowledge
DiscoveryandDataMining,pages121–128,NewYork,NY,August1998.
[444]L.Wang,R.Cheng,S.D.Lee,andD.W.Cheung.Acceleratingprobabilisticfrequentitemsetmining:amodel-basedapproach.InProceedingsofthe19thACMConferenceonInformationandKnowledgeManagement,pages429–438,2010.
[445]G.I.Webb.Preliminaryinvestigationsintostatisticallyvalidexploratoryrulediscovery.InProc.oftheAustralasianDataMiningWorkshop(AusDM03),Canberra,Australia,December2003.
[446]H.Xiong,X.He,C.Ding,Y.Zhang,V.Kumar,andS.R.Holbrook.IdentificationofFunctionalModulesinProteinComplexesviaHypercliquePatternDiscovery.InProc.ofthePacificSymposiumonBiocomputing,(PSB2005),Maui,January2005.
[447]H.Xiong,S.Shekhar,P.N.Tan,andV.Kumar.ExploitingaSupport-basedUpperBoundofPearson'sCorrelationCoefficientforEfficientlyIdentifyingStronglyCorrelatedPairs.InProc.ofthe10thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages334–343,Seattle,WA,August2004.
[448]H.Xiong,M.Steinbach,P.N.Tan,andV.Kumar.HICAP:HierarchialClusteringwithPatternPreservation.InProc.oftheSIAMIntl.Conf.onDataMining,pages279–290,Orlando,FL,April2004.
[449]H.Xiong,P.N.Tan,andV.Kumar.MiningStrongAffinityAssociationPatternsinDataSetswithSkewedSupportDistribution.InProc.ofthe
2003IEEEIntl.Conf.onDataMining,pages387–394,Melbourne,FL,2003.
[450]X.YanandJ.Han.gSpan:Graph-basedSubstructurePatternMining.InProc.ofthe2002IEEEIntl.Conf.onDataMining,pages721–724,MaebashiCity,Japan,December2002.
[451]C.Yang,U.M.Fayyad,andP.S.Bradley.Efficientdiscoveryoferror-tolerantfrequentitemsetsinhighdimensions.InProceedingsoftheseventhACMSIGKDDinternationalconferenceonKnowledgediscoveryanddatamining,pages194–203,,SanFrancisco,CA,2001.
[452]C.Yang,U.M.Fayyad,andP.S.Bradley.Efficientdiscoveryoferror-tolerantfrequentitemsetsinhighdimensions.InProc.ofthe7thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages194–203,SanFrancisco,CA,August2001.
[453]M.J.Zaki.ParallelandDistributedAssociationMining:ASurvey.IEEEConcurrency,specialissueonParallelMechanismsforDataMining,7(4):14–25,December1999.
[454]M.J.Zaki.GeneratingNon-RedundantAssociationRules.InProc.ofthe6thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages34–43,Boston,MA,August2000.
[455]M.J.Zaki.Efficientlyminingfrequenttreesinaforest.InProc.ofthe8thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages71–80,Edmonton,Canada,July2002.
[456]M.J.ZakiandM.Orihara.Theoreticalfoundationsofassociationrules.InProc.ofthe1998ACMSIGMODWorkshoponResearchIssuesinDataMiningandKnowledgeDiscovery,Seattle,WA,June1998.
[457]M.J.Zaki,S.Parthasarathy,M.Ogihara,andW.Li.NewAlgorithmsforFastDiscoveryofAssociationRules.InProc.ofthe3rdIntl.Conf.onKnowledgeDiscoveryandDataMining,pages283–286,NewportBeach,CA,August1997.
[458]C.Zeng,J.F.Naughton,andJ.Cai.Ondifferentiallyprivatefrequentitemsetmining.ProceedingsoftheVLDBEndowment,6(1):25–36,2012.
[459]F.Zhang,Y.Zhang,andJ.Bakos.GPApriori:GPU-AcceleratedFrequentItemsetMining.InProceedingsofthe2011IEEEInternationalConferenceonClusterComputing,pages590–594,Austin,TX,2011.
[460]H.Zhang,B.Padmanabhan,andA.Tuzhilin.OntheDiscoveryofSignificantStatisticalQuantitativeRules.InProc.ofthe10thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages374–383,Seattle,WA,August2004.
[461]Z.Zhang,Y.Lu,andB.Zhang.AnEffectivePartioning-CombiningAlgorithmforDiscoveringQuantitativeAssociationRules.InProc.ofthe1stPacific-AsiaConf.onKnowledgeDiscoveryandDataMining,Singapore,1997.
[462]N.Zhong,Y.Y.Yao,andS.Ohsuga.PeculiarityOrientedMulti-databaseMining.InProc.ofthe3rdEuropeanConf.ofPrinciplesandPracticeofKnowledgeDiscoveryinDatabases,pages136–146,Prague,CzechRepublic,1999.
5.10Exercises1.Foreachofthefollowingquestions,provideanexampleofanassociationrulefromthemarketbasketdomainthatsatisfiesthefollowingconditions.Also,describewhethersuchrulesaresubjectivelyinteresting.
a. Arulethathashighsupportandhighconfidence.
b. Arulethathasreasonablyhighsupportbutlowconfidence.
c. Arulethathaslowsupportandlowconfidence.
d. Arulethathaslowsupportandhighconfidence.
2.ConsiderthedatasetshowninTable5.20 .
Table5.20.Exampleofmarketbaskettransactions.
CustomerID TransactionID ItemsBought
1 0001 {a,d,e}
1 0024 {a,b,c,e}
2 0012 {a,b,d,e}
2 0031 {a,c,d,e}
3 0015 {b,ce}
3 0022 {b,d,e}
4 0029 {cd}
4 0040 {a,b,c}
5 0033 {a,d,e}
5 0038 {a,b,e}
a. Computethesupportforitemsets{e},{b,d},and{b,d,e}bytreatingeachtransactionIDasamarketbasket.
b. Usetheresultsinpart(a)tocomputetheconfidencefortheassociationrules and .Isconfidenceasymmetricmeasure?
c. Repeatpart(a)bytreatingeachcustomerIDasamarketbasket.Eachitemshouldbetreatedasabinaryvariable(1ifanitemappearsinatleastonetransactionboughtbythecustomer,and0otherwise).
d. Usetheresultsinpart(c)tocomputetheconfidencefortheassociationrules and .
e. Suppose and arethesupportandconfidencevaluesofanassociationrulerwhentreatingeachtransactionIDasamarketbasket.Also,let and bethesupportandconfidencevaluesofrwhentreatingeachcustomerIDasamarketbasket.Discusswhetherthereareanyrelationshipsbetween and or and .
3.
a. Whatistheconfidencefortherules and ?
b. Let ,and betheconfidencevaluesoftherules,and ,respectively.Ifweassumethat ,and
havedifferentvalues,whatarethepossiblerelationshipsthatmayexistamong ,and ?Whichrulehasthelowestconfidence?
c. Repeattheanalysisinpart(b)assumingthattheruleshaveidenticalsupport.Whichrulehasthehighestconfidence?
{b,d}→{e} {e}→{b,d}
{b,d}→{e} {e}→{b,d}
s1 c1
s2 c2
s1 s2 c1 c2
∅→A A→∅c1,c2 c3 {p}→{q},{p}
→{q,r} {p,r}→{q} c1,c2 c3
c1,c2 c3
d. Transitivity:Supposetheconfidenceoftherules and arelargerthansomethreshold,minconf.Isitpossiblethat hasaconfidencelessthanminconf?
4.Foreachofthefollowingmeasures,determinewhetheritismonotone,anti-monotone,ornon-monotone(i.e.,neithermonotonenoranti-monotone).
Example:Support, isanti-monotonebecause whenever.
a. Acharacteristicruleisaruleoftheform ,wheretheruleantecedentcontainsonlyasingleitem.Anitemsetofsizekcanproduceuptokcharacteristicrules.Let betheminimumconfidenceofallcharacteristicrulesgeneratedfromagivenitemset:
Is monotone,anti-monotone,ornon-monotone?
b. Adiscriminantruleisaruleoftheform ,wheretheruleconsequentcontainsonlyasingleitem.Anitemsetofsizekcanproduceuptokdiscriminantrules.Let betheminimumconfidenceofalldiscriminantrulesgeneratedfromagivenitemset:
Is monotone,anti-monotone,ornon-monotone?
c. Repeattheanalysisinparts(a)and(b)byreplacingtheminfunctionwithamaxfunction.
A→B B→CA→C
s=σ(x)|T| s(X)≥s(Y)X⊂Y
{p}→{q1,q2,…,qn}
ζ
ζ({p1,p2,…,pk})=min[c({p1}→{p2,p3,…,pk}),…c({pk}→{p1,p2,…,pk−1})]
ζ
{p1,p2,…,pn}→{q}
η
η({p1,p2,…,pk})=min[c({p2,p3,…,pk}→{p1}),…c({p1,p2,…,pk−1}→{pk})]
η
5.ProveEquation5.3 .(Hint:First,countthenumberofwaystocreateanitemsetthatformstheleft-handsideoftherule.Next,foreachsizekitemsetselectedfortheleft-handside,countthenumberofwaystochoosetheremaining itemstoformtheright-handsideoftherule.)Assumethatneitheroftheitemsetsofaruleareempty.
6.ConsiderthemarketbaskettransactionsshowninTable5.21 .
a. Whatisthemaximumnumberofassociationrulesthatcanbeextractedfromthisdata(includingrulesthathavezerosupport)?
b. Whatisthemaximumsizeoffrequentitemsetsthatcanbeextracted(assuming )?
Table5.21.Marketbaskettransactions.
TransactionID ItemsBought
1 {Milk,Beer,Diapers}
2 {Bread,Butter,Milk}
3 {Milk,Diapers,Cookies}
4 {Bread,Butter,Cookies}
5 {Beer,Cookies,Diapers}
6 {Milk,Diapers,Bread,Butter}
7 {Bread,Butter,Diapers}
8 {Beer,Diapers}
9 {Milk,Diapers,Bread,Butter}
10 {Beer,Cookies}
d−k
minsup>0
c. Writeanexpressionforthemaximumnumberofsize-3itemsetsthatcanbederivedfromthisdataset.
d. Findanitemset(ofsize2orlarger)thathasthelargestsupport.
e. Findapairofitems,aandb,suchthattherules andhavethesameconfidence.
7.Showthatifacandidatek-itemsetXhasasubsetofsizelessthan thatisinfrequent,thenatleastoneofthe -sizesubsetsofXisnecessarilyinfrequent.
8.Considerthefollowingsetoffrequent3-itemsets:
{1,2,3},{1,2,4},{1,2,5},{1,3,4},{1,3,5},{2,3,4},{2,3,5},{3,4,5}.
Assumethatthereareonlyfiveitemsinthedataset.
a. Listallcandidate4-itemsetsobtainedbyacandidategenerationprocedureusingthe mergingstrategy.
b. Listallcandidate4-itemsetsobtainedbythecandidategenerationprocedureinApriori.
c. Listallcandidate4-itemsetsthatsurvivethecandidatepruningstepoftheApriorialgorithm.
9.TheApriorialgorithmusesagenerate-and-countstrategyforderivingfrequentitemsets.Candidateitemsetsofsize arecreatedbyjoiningapairoffrequentitemsetsofsizek(thisisknownasthecandidategenerationstep).Acandidateisdiscardedifanyoneofitssubsetsisfoundtobeinfrequentduringthecandidatepruningstep.SupposetheApriorialgorithmisappliedtothedatasetshowninTable5.22 with ,i.e.,anyitemsetoccurringinlessthan3transactionsisconsideredtobeinfrequent.
{a}→{b} {b}→{a}
k−1(k−1)
Fk−1×F1
k+1
minsup=30%
Table5.22.Exampleofmarketbaskettransactions.
TransactionID ItemsBought
1 {a,b,d,e}
2 {b,cd}
3 {a,b,d,e}
4 {a,c,d,e}
5 {b,c,d,e}
6 {b,d,e}
7 {c,d}
8 {a,b,c}
9 {a,d,e}
10 {b,d}
a. DrawanitemsetlatticerepresentingthedatasetgiveninTable5.22 .Labeleachnodeinthelatticewiththefollowingletter(s):
N:IftheitemsetisnotconsideredtobeacandidateitemsetbytheApriorialgorithm.Therearetworeasonsforanitemsetnottobeconsideredasacandidateitemset:(1)itisnotgeneratedatallduringthecandidategenerationstep,or(2)itisgeneratedduringthecandidategenerationstepbutissubsequentlyremovedduringthecandidatepruningstepbecauseoneofitssubsetsisfoundtobeinfrequent.
F:IfthecandidateitemsetisfoundtobefrequentbytheApriorialgorithm.
I:Ifthecandidateitemsetisfoundtobeinfrequentaftersupportcounting.
b. Whatisthepercentageoffrequentitemsets(withrespecttoallitemsetsinthelattice)?
c. WhatisthepruningratiooftheApriorialgorithmonthisdataset?(Pruningratioisdefinedasthepercentageofitemsetsnotconsideredtobeacandidatebecause(1)theyarenotgeneratedduringcandidategenerationor(2)theyareprunedduringthecandidatepruningstep.)
d. Whatisthefalsealarmrate(i.e.,percentageofcandidateitemsetsthatarefoundtobeinfrequentafterperformingsupportcounting)?
10.TheApriorialgorithmusesahashtreedatastructuretoefficientlycountthesupportofcandidateitemsets.Considerthehashtreeforcandidate3-itemsetsshowninFigure5.32 .
Figure5.32.Anexampleofahashtreestructure.
a. Givenatransactionthatcontainsitems{1,3,4,5,8},whichofthehashtreeleafnodeswillbevisitedwhenfindingthecandidatesofthetransaction?
b. Usethevisitedleafnodesinpart(a)todeterminethecandidateitemsetsthatarecontainedinthetransaction{1,3,4,5,8}.
11.Considerthefollowingsetofcandidate3-itemsets:
{1,2,3},{1,2,6},{1,3,4},{2,3,4},{2,4,5},{3,4,6},{4,5,6}
a. Constructahashtreefortheabovecandidate3-itemsets.Assumethetreeusesahashfunctionwhereallodd-numbereditemsarehashedtotheleftchildofanode,whiletheeven-numbereditemsarehashedtotherightchild.Acandidatek-itemsetisinsertedintothetreebyhashingoneachsuccessiveiteminthecandidateandthenfollowingtheappropriatebranchofthetreeaccordingtothehashvalue.Oncealeafnodeisreached,thecandidateisinsertedbasedononeofthefollowingconditions:
Condition1:Ifthedepthoftheleafnodeisequaltok(therootisassumedtobeatdepth0),thenthecandidateisinsertedregardlessofthenumberofitemsetsalreadystoredatthenode.
Condition2:Ifthedepthoftheleafnodeislessthank,thenthecandidatecanbeinsertedaslongasthenumberofitemsetsstoredatthenodeislessthanmaxsize.Assume forthisquestion.
Condition3:Ifthedepthoftheleafnodeislessthankandthenumberofitemsetsstoredatthenodeisequaltomaxsize,thentheleafnodeisconvertedintoaninternalnode.Newleafnodesarecreatedaschildren
maxsize=2
oftheoldleafnode.Candidateitemsetspreviouslystoredintheoldleafnodearedistributedtothechildrenbasedontheirhashvalues.Thenewcandidateisalsohashedtoitsappropriateleafnode.
b. Howmanyleafnodesarethereinthecandidatehashtree?Howmanyinternalnodesarethere?
c. Consideratransactionthatcontainsthefollowingitems:{1,2,3,5,6}.Usingthehashtreeconstructedinpart(a),whichleafnodeswillbecheckedagainstthetransaction?Whatarethecandidate3-itemsetscontainedinthetransaction?
12.GiventhelatticestructureshowninFigure5.33 andthetransactionsgiveninTable5.22 ,labeleachnodewiththefollowingletter(s):
Figure5.33.Anitemsetlattice
Mifthenodeisamaximalfrequentitemset,
Cifitisaclosedfrequentitemset,
Nifitisfrequentbutneithermaximalnorclosed,and
Iifitisinfrequent
Assumethatthesupportthresholdisequalto30%.
13.Theoriginalassociationruleminingformulationusesthesupportandconfidencemeasurestopruneuninterestingrules.
a. DrawacontingencytableforeachofthefollowingrulesusingthetransactionsshowninTable5.23 .
Table5.23.Exampleofmarketbaskettransactions.
TransactionID ItemsBought
1 {a,b,d,e}
2 {b,c,d}
3 {a,b,d,e}
4 {a,c,d,e}
5 {b,c,d,e}
6 {b,d,e}
7 {c,d}
8 {a,b,c}
9 {a,d,e}
10 {b,d}
Rules: .
b. Usethecontingencytablesinpart(a)tocomputeandranktherulesindecreasingorderaccordingtothefollowingmeasures.
i. Support.
ii. Confidence.
iii. Interest
iv.
v. ,where.
vi.
14.GiventherankingsyouhadobtainedinExercise13,computethecorrelationbetweentherankingsofconfidenceandtheotherfivemeasures.Whichmeasureismosthighlycorrelatedwithconfidence?Whichmeasureisleastcorrelatedwithconfidence?
15.AnswerthefollowingquestionsusingthedatasetsshowninFigure5.34 .Notethateachdatasetcontains1000itemsand10,000transactions.Darkcellsindicatethepresenceofitemsandwhitecellsindicatetheabsenceofitems.WewillapplytheApriorialgorithmtoextractfrequentitemsetswith
(i.e.,itemsetsmustbecontainedinatleast1000transactions).
{b}→{c},{a}→{d},{b}→{d},{e}→{c},{c}→{a}
(X→Y)=P(X,Y)P(X)P(Y).
IS(X→Y)=P(X,Y)P(X)P(Y).
Klosgen(X→Y)=P(X,Y)×max(P(Y|X))−P(Y),P(X|Y)−P(X))P(Y|X)=P(X,Y)P(X)
Oddsratio(X→Y)=P(X,Y)P(X¯,Y¯)P(X,Y¯)P(X¯,Y).
minsup=10%
Figure5.34.FiguresforExercise15.
a. Whichdataset(s)willproducethemostnumberoffrequentitemsets?
b. Whichdataset(s)willproducethefewestnumberoffrequentitemsets?
c. Whichdataset(s)willproducethelongestfrequentitemset?
d. Whichdataset(s)willproducefrequentitemsetswithhighestmaximumsupport?
e. Whichdataset(s)willproducefrequentitemsetscontainingitemswithwide-varyingsupportlevels(i.e.,itemswithmixedsupport,rangingfromlessthan20%tomorethan70%)?
16.
a. Provethatthe coefficientisequalto1ifandonlyif .
b. ShowthatifAandBareindependent,then.
c. ShowthatYule'sQandYcoefficients
arenormalizedversionsoftheoddsratio.
d. WriteasimplifiedexpressionforthevalueofeachmeasureshowninTable5.9 whenthevariablesarestatisticallyindependent.
17.Considertheinterestingnessmeasure, ,foranassociationrule .
ϕ f11=f1+=f+1
P(A,B)×P(A¯,B¯)=P(A,B¯)×P(A¯,B)
Q=[f11f00−f10f01f11f00+f10f01]Y=[f11f00−f10f01f11f00+f10f01]
M=P(B|A)−P(B)1−P(B)A→B
a. Whatistherangeofthismeasure?Whendoesthemeasureattainitsmaximumandminimumvalues?
b. HowdoesMbehavewhenP(A,B)isincreasedwhileP(A)andP(B)remainunchanged?
c. HowdoesMbehavewhenP(A)isincreasedwhileP(A,B)andP(B)remainunchanged?
d. HowdoesMbehavewhenP(B)isincreasedwhileP(A,B)andP(A)remainunchanged?
e. Isthemeasuresymmetricundervariablepermutation?
f. WhatisthevalueofthemeasurewhenAandBarestatisticallyindependent?
g. Isthemeasurenull-invariant?
h. Doesthemeasureremaininvariantunderroworcolumnscalingoperations?
i. Howdoesthemeasurebehaveundertheinversionoperation?
18.Supposewehavemarketbasketdataconsistingof100transactionsand20items.Assumethesupportforitemais25%,thesupportforitembis90%andthesupportforitemset{a,b}is20%.Letthesupportandconfidencethresholdsbe10%and60%,respectively.
a. Computetheconfidenceoftheassociationrule .Istheruleinterestingaccordingtotheconfidencemeasure?
b. Computetheinterestmeasurefortheassociationpattern{a,b}.Describethenatureoftherelationshipbetweenitemaanditembintermsoftheinterestmeasure.
{a}→{b}
c. Whatconclusionscanyoudrawfromtheresultsofparts(a)and(b)?
d. Provethatiftheconfidenceoftherule islessthanthesupportof{b},then:
i.
ii.
where denotetheruleconfidenceand denotethesupportofanitemset.
19.Table5.24 showsa contingencytableforthebinaryvariablesAandBatdifferentvaluesofthecontrolvariableC.
Table5.24.AContingencyTable.
A
1 0
B 1 0 15
0 15 30
B 1 5 0
0 0 15
a. Computethe coefficientforAandBwhen and or1.Notethat .
b. Whatconclusionscanyoudrawfromtheaboveresult?
20.ConsiderthecontingencytablesshowninTable5.25 .
{a}→{b}
c({a¯}→{b})>c({a}→{b}),
c({a¯}→{b})>s({b}),
c(⋅) s(⋅)
2×2×2
C=0
C=1
ϕ C=0,C=1, C=0ϕ=({A,B})=P(A,B)−P(A)P(B)P(A)P(B)(1−P(A))(1−P(B))
a. FortableI,computesupport,theinterestmeasure,andthe correlationcoefficientfortheassociationpattern{A,B}.Also,computetheconfidenceofrules and .
b. FortableII,computesupport,theinterestmeasure,andthe correlationcoefficientfortheassociationpattern{A,B}.Also,computetheconfidenceofrules and .
Table5.25.ContingencytablesforExercise20.(a)TableI.
B
A 9 1
1 89
(b)TableII.
B
A 89 1
1 9
c. Whatconclusionscanyoudrawfromtheresultsof(a)and(b)?
21.Considertherelationshipbetweencustomerswhobuyhigh-definitiontelevisionsandexercisemachinesasshowninTables5.17 and5.18 .
a. Computetheoddsratiosforbothtables.
b. Computethe forbothtables.
c. Computetheinterestfactorforbothtables.
ϕ
A→B B→A
ϕ
A→B B→A
B¯
A¯
B¯
A¯
ϕ-coefficient
Foreachofthemeasuresgivenabove,describehowthedirectionofassociationchangeswhendataispooledtogetherinsteadofbeingstratified.
6AssociationAnalysis:AdvancedConcepts
Theassociationruleminingformulationdescribedinthepreviouschapterassumesthattheinputdataconsistsofbinaryattributescalleditems.Thepresenceofaniteminatransactionisalsoassumedtobemoreimportantthanitsabsence.Asaresult,anitemistreatedasanasymmetricbinaryattributeandonlyfrequentpatternsareconsideredinteresting.
Thischapterextendstheformulationtodatasetswithsymmetricbinary,categorical,andcontinuousattributes.Theformulationwillalsobeextendedtoincorporatemorecomplexentitiessuchassequencesandgraphs.Althoughtheoverallstructureofassociationanalysisalgorithmsremainsunchanged,certainaspectsofthealgorithmsmustbemodifiedtohandlethenon-traditionalentities.
6.1HandlingCategoricalAttributesTherearemanyapplicationsthatcontainsymmetricbinaryandnominalattributes.TheInternetsurveydatashowninTable6.1 containssymmetricbinaryattributessuchasand ;aswellasnominalattributessuchas
and .Usingassociationanalysis,wemayuncoverinterestinginformationaboutthecharacteristicsofInternetuserssuchas
Table6.1.Internetsurveydatawithcategoricalattributes.
Gender LevelofEducation
State ComputeratHome
ChatOnline
ShopOnline
PrivacyConcerns
Female Graduate Illinois Yes Yes Yes Yes
Male College California No No No No
Male Graduate Michigan Yes Yes Yes Yes
Female College Virginia No No Yes Yes
Female Graduate California Yes No No Yes
Male College Minnesota Yes Yes Yes Yes
Male College Alaska Yes Yes Yes No
Male HighSchool Oregon Yes No No No
Female Graduate Texas No Yes No No
… … … … … … …
ThisrulesuggeststhatmostInternetuserswhoshoponlineareconcernedabouttheirpersonalprivacy.
Toextractsuchpatterns,thecategoricalandsymmetricbinaryattributesaretransformedinto“items”first,sothatexistingassociationruleminingalgorithmscanbeapplied.Thistypeoftransformationcanbeperformedbycreatinganewitemforeachdistinctattribute-valuepair.Forexample,thenominalattribute canbereplacedbythreebinaryitems:
,and .Similarly,symmetricbinaryattributessuchas canbeconvertedintoapairofbinaryitems, and .Table6.2 showstheresultofbinarizingtheInternetsurveydata.
Table6.2.Internetsurveydataafterbinarizingcategoricalandsymmetricbinaryattributes.
Male Female Education=Graduate
Education=College
… Privacy=Yes
Privacy=No
0 1 1 0 … 1 0
1 0 0 1 … 0 1
1 0 1 0 … 1 0
0 1 0 1 … 1 0
0 1 1 0 … 1 0
1 0 0 1 … 1 0
1 0 0 1 … 0 1
1 0 0 0 … 0 1
0 1 1 0 … 0 1
… … … … … … …
Thereareseveralissuestoconsiderwhenapplyingassociationanalysistothebinarizeddata:
1. Someattributevaluesmaynotbefrequentenoughtobepartofafrequentpattern.Thisproblemismoreevidentfornominalattributesthathavemanypossiblevalues,e.g.,statenames.Loweringthesupportthresholddoesnothelpbecauseitexponentiallyincreasesthenumberoffrequentpatternsfound(manyofwhichmaybespurious)andmakesthecomputationmoreexpensive.Amorepracticalsolutionistogrouprelatedattributevaluesintoasmallnumberofcategories.Forexample,eachstatenamecanbereplacedbyitscorrespondinggeographicalregion,suchas ,and .Anotherpossibilityistoaggregatethelessfrequentattributevaluesintoasinglecategorycalled ,asshowninFigure6.1 .
Figure6.1.Apiechartwithamergedcategorycalled .
2. Someattributevaluesmayhaveconsiderablyhigherfrequenciesthanothers.Forexample,suppose85%ofthesurveyparticipantsownahomecomputer.Bycreatingabinaryitemforeachattributevaluethatappearsfrequentlyinthedata,wemaypotentiallygeneratemanyredundantpatterns,asillustratedbythefollowingexample:
Theruleisredundantbecauseitissubsumedbythemoregeneralrulegivenatthebeginningofthissection.Becausethehigh-frequencyitemscorrespondtothetypicalvaluesofanattribute,theyseldomcarryanynewinformationthatcanhelpustobetterunderstandthepattern.Itmaythereforebeusefultoremovesuchitemsbeforeapplying
standardassociationanalysisalgorithms.AnotherpossibilityistoapplythetechniquespresentedinSection5.8 forhandlingdatasetswithawiderangeofsupportvalues.
3. Althoughthewidthofeverytransactionisthesameasthenumberofattributesintheoriginaldata,thecomputationtimemayincreaseespeciallywhenmanyofthenewlycreateditemsbecomefrequent.Thisisbecausemoretimeisneededtodealwiththeadditionalcandidateitemsetsgeneratedbytheseitems(seeExercise1 onpage510).Onewaytoreducethecomputationtimeistoavoidgeneratingcandidateitemsetsthatcontainmorethanoneitemfromthesameattribute.Forexample,wedonothavetogenerateacandidateitemsetsuchas becausethesupportcountoftheitemsetiszero.
6.2HandlingContinuousAttributesTheInternetsurveydatadescribedintheprevioussectionmayalsocontaincontinuousattributessuchastheonesshowninTable6.3 .Miningthecontinuousattributesmayrevealusefulinsightsaboutthedatasuchas“userswhoseannualincomeismorethan$120Kbelongtothe45–60agegroup”or“userswhohavemorethan3emailaccountsandspendmorethan15hoursonlineperweekareoftenconcernedabouttheirpersonalprivacy.”Associationrulesthatcontaincontinuousattributesarecommonlyknownasquantitativeassociationrules.
Table6.3.Internetsurveydatawithcontinuousattributes.
Gender … Age AnnualIncome
No.ofHoursSpentOnlineperWeek
No.ofEmailAccounts
PrivacyConcern
Female … 26 90K 20 4 Yes
Male … 51 135K 10 2 No
Male … 29 80K 10 3 Yes
Female … 45 120K 15 3 Yes
Female … 31 95K 20 5 Yes
Male … 25 55K 25 5 Yes
Male … 37 100K 10 1 No
Male … 41 65K 8 2 No
Female … 26 85K 12 1 No
… … … … … … …
Thissectiondescribesthevariousmethodologiesforapplyingassociationanalysistocontinuousdata.Wewillspecificallydiscussthreetypesofmethods:(1)discretization-basedmethods,(2)statistics-basedmethods,and(3)nondiscretizationmethods.Thequantitativeassociationrulesderivedusingthesemethodsarequitedifferentinnature.
6.2.1Discretization-BasedMethods
Discretizationisthemostcommonapproachforhandlingcontinuousattributes.Thisapproachgroupstheadjacentvaluesofacontinuousattributeintoafinitenumberofintervals.Forexample,the attributecanbedividedintothefollowingintervals: ∈[12,16), ∈[16,20), ∈[20,24),…,
∈[56,60),where[a,b)representsanintervalthatincludesabutnotb.DiscretizationcanbeperformedusinganyofthetechniquesdescribedinSection2.3.6 (equalintervalwidth,equalfrequency,entropy-based,orclustering).Thediscreteintervalsarethenmappedintoasymmetricbinaryattributessothatexistingassociationanalysisalgorithmscanbeapplied.Table6.4 showstheInternetsurveydataafterdiscretizationandbinarization.
Table6.4.Internetsurveydataafterbinarizingcategoricalandcontinuousattributes.
Male Female … Age<13
Age∈[13,21)
Age∈[21,30)
… Privacy=Yes
Privacy=No
0 1 … 0 0 1 … 1 0
1 0 … 0 0 0 … 0 1
1 0 … 0 0 1 … 1 0
0 1 … 0 0 0 … 1 0
0 1 … 0 0 0 … 1 0
1 0 … 0 0 1 … 1 0
1 0 … 0 0 0 … 0 1
1 0 … 0 0 0 … 0 1
0 1 … 0 0 1 … 0 1
… … … … … … … … …
Akeyparameterinattributediscretizationisthenumberofintervalsusedtopartitioneachattribute.Thisparameteristypicallyprovidedbytheusersandcanbeexpressedintermsoftheintervalwidth(fortheequalintervalwidthapproach),theaveragenumberoftransactionsperinterval(fortheequalfrequencyapproach),orthenumberofdesiredclusters(fortheclustering-basedapproach).ThedifficultyindeterminingtherightnumberofintervalscanbeillustratedusingthedatasetshowninTable6.5 ,whichsummarizestheresponsesof250userswhoparticipatedinthesurvey.Therearetwostrongrulesembeddedinthedata:
Table6.5.AbreakdownofInternetuserswhoparticipatedinonlinechataccordingtotheiragegroup.
AgeGroup ChatOnline=Yes ChatOnline=No
R1:Age∈[16,24)→Chat Online=Yes(s=8.8%, c=81.5%).R2:Age∈[44,60)→Chat
[12,16) 12 13
[16,20) 11 2
[20,24) 11 3
[24,28) 12 13
[28,32) 14 12
[32,36) 15 12
[36,40) 16 14
[40,44) 16 14
[44,48) 4 10
[48,52) 5 11
[52,56) 5 10
[56,60) 4 11
Theserulessuggestthatmostoftheusersfromtheagegroupof16–24oftenparticipateinonlinechatting,whilethosefromtheagegroupof44–60arelesslikelytochatonline.Inthisexample,weconsideraruletobeinterestingonlyifitssupport(s)exceeds5%anditsconfidence(c)exceeds65%.Oneoftheproblemsencounteredwhendiscretizingthe attributeishowtodeterminetheintervalwidth.
1. Iftheintervalistoowide,thenwemaylosesomepatternsbecauseoftheirlackofconfidence.Forexample,whentheintervalwidthis24years, and arereplacedbythefollowingrules:R1 R2R′1:Age∈[12,36)→Chat Online=Yes(s=30%, c=57.7%).R
Despitetheirhighersupports,thewiderintervalshavecausedtheconfidenceforbothrulestodropbelowtheminimumconfidencethreshold.Asaresult,bothpatternsarelostafterdiscretization.
2. Iftheintervalistoonarrow,thenwemaylosesomepatternsbecauseoftheirlackofsupport.Forexample,iftheintervalwidthis4years,then isbrokenupintothefollowingtwosubrules:
Sincethesupportsforthesubrulesarelessthantheminimumsupportthreshold, islostafterdiscretization.Similarly,therule ,whichisbrokenupintofoursubrules,willalsobelostbecausethesupportofeachsubruleislessthantheminimumsupportthreshold.
3. Iftheintervalwidthis8years,thentherule isbrokenupintothefollowingtwosubrules:
Since and havesufficientsupportandconfidence,canberecoveredbyaggregatingbothsubrules.Meanwhile, isbrokenupintothefollowingtwosubrules:
Unlike ,wecannotrecovertherule byaggregatingthesubrulesbecausebothsubrulesfailtheconfidencethreshold.
Onewaytoaddresstheseissuesistoconsidereverypossiblegroupingofadjacentintervals.Forexample,wecanstartwithanintervalwidthof4yearsandthenmergetheadjacentintervalsintowiderintervals: ∈[12,16),∈[12,20),…, ∈[12,60), ∈[16,20), ∈[16,24),etc.This
′2:Age∈[36,60)→Chat Online=No(s=28%, c=58.3%).
R1R11(4):Age∈[16,20)→Chat Online=Yes(s=4.4%, c=84.6%).R12(4):Age∈
R1 R2
R2
R21(8):Age∈[44,52)→Chat Online=No(s=8.4%, c=70%).R22(8):Age∈[52
R21(8) R22(8) R2R1
R11(8):Age∈[12,20)→Chat Online=Yes(s=9.2%, c=60.5%).R12(8):Age∈
R2 R1
approachenablesthedetectionofboth and asstrongrules.However,italsoleadstothefollowingcomputationalissues:
1. Thecomputationbecomesextremelyexpensive.Iftherangeisinitiallydividedintokintervals,then binaryitemsmustbegeneratedtorepresentallpossibleintervals.Furthermore,ifanitemcorrespondingtotheinterval[a,b)isfrequent,thenallotheritemscorrespondingtointervalsthatsubsume[a,b)mustbefrequenttoo.Thisapproachcanthereforegeneratefartoomanycandidateandfrequentitemsets.Toaddresstheseproblems,amaximumsupportthresholdcanbeappliedtopreventthecreationofitemscorrespondingtoverywideintervalsandtoreducethenumberofitemsets.
2. Manyredundantrulesareextracted.Forexample,considerthefollowingpairofrules:
isageneralizationof (and isaspecializationof )becausehasawiderintervalforthe attribute.Iftheconfidencevaluesfor
bothrulesarethesame,then shouldbemoreinterestingbecauseitcoversmoreexamples—includingthosefor . isthereforearedundantrule.
6.2.2Statistics-BasedMethods
Quantitativeassociationrulescanbeusedtoinferthestatisticalpropertiesofapopulation.Forexample,supposeweareinterestedinfindingtheaverageageofcertaingroupsofInternetusersbasedonthedataprovidedinTables
R1 R2
k(k−1)/2
R3:{Age∈[16,20), Gender=Male}→{Chat Online=Yes},R4:{Age∈[16,24), Gender=Male}→{Chat Online=Yes}.
R4 R3 R3 R4R4
R4R3 R3
6.1 and6.3 .Usingthestatistics-basedmethoddescribedinthissection,quantitativeassociationrulessuchasthefollowingcanbeextracted:
TherulestatesthattheaverageageofInternetuserswhoseannualincomeexceeds$100Kandwhoshoponlineregularlyis38yearsold.
RuleGenerationTogeneratethestatistics-basedquantitativeassociationrules,thetargetattributeusedtocharacterizeinterestingsegmentsofthepopulationmustbespecified.Bywithholdingthetargetattribute,theremainingcategoricalandcontinuousattributesinthedataarebinarizedusingthemethodsdescribedintheprevioussection.ExistingalgorithmssuchasAprioriorFP-growtharethenappliedtoextractfrequentitemsetsfromthebinarizeddata.Eachfrequentitemsetidentifiesaninterestingsegmentofthepopulation.Thedistributionofthetargetattributeineachsegmentcanbesummarizedusingdescriptivestatisticssuchasmean,median,variance,orabsolutedeviation.Forexample,theprecedingruleisobtainedbyaveragingtheageofInternetuserswhosupportthefrequentitemset{ >$100K,
}.
Thenumberofquantitativeassociationrulesdiscoveredusingthismethodisthesameasthenumberofextractedfrequentitemsets.Becauseofthewaythequantitativeassociationrulesaredefined,thenotionofconfidenceisnotapplicabletosuchrules.Analternativemethodforvalidatingthequantitativeassociationrulesispresentednext.
RuleValidation
{Annual Income>$100K, Shop Online=Yes}→Age: Mean=38.
Aquantitativeassociationruleisinterestingonlyifthestatisticscomputedfromtransactionscoveredbytherulearedifferentthanthosecomputedfromtransactionsnotcoveredbytherule.Forexample,therulegivenatthebeginningofthissectionisinterestingonlyiftheaverageageofInternetuserswhodonotsupportthefrequentitemset{ >100K,
}issignificantlyhigherorlowerthan38yearsold.Todeterminewhetherthedifferenceintheiraverageagesisstatisticallysignificant,statisticalhypothesistestingmethodsshouldbeapplied.
Considerthequantitativeassociationrule, ,whereAisafrequentitemset,tisthecontinuoustargetattribute,andμistheaveragevalueoftamongtransactionscoveredbyA.Furthermore,let denotetheaveragevalueoftamongtransactionsnotcoveredbyA.Thegoalistotestwhetherthedifferencebetweenμand isgreaterthansomeuser-specifiedthreshold,Δ.Instatisticalhypothesistesting,twooppositepropositions,knownasthenullhypothesisandthealternativehypothesis,aregiven.Ahypothesistestisperformedtodeterminewhichofthesetwohypothesesshouldbeaccepted,basedonevidencegatheredfromthedata(seeAppendixC).
Inthiscase,assumingthat ,thenullhypothesisis ,whilethealternativehypothesisis .Todeterminewhichhypothesisshouldbeaccepted,thefollowingZ-statisticiscomputed:
where isthenumberoftransactionssupportingA, isthenumberoftransactionsnotsupportingA, isthestandarddeviationfortamongtransactionsthatsupportA,and isthestandarddeviationfortamongtransactionsthatdonotsupportA.Underthenullhypothesis,Zhasastandardnormaldistributionwithmean0andvariance1.ThevalueofZcomputedusingEquation6.1 isthencomparedagainstacriticalvalue, ,
A→t:μ
μ′
μ′
μ<μ′ H0:μ′=μ+ΔH1:μ′>μ+Δ
Z=μ′−μ−Δs12n1+s22n2, (6.1)
n1 n2s1s2
Zα
whichisathresholdthatdependsonthedesiredconfidencelevel.If ,thenthenullhypothesisisrejectedandwemayconcludethatthequantitativeassociationruleisinteresting.Otherwise,thereisnotenoughevidenceinthedatatoshowthatthedifferenceinmeanisstatisticallysignificant.
Example6.1.Considerthequantitativeassociationrule
Supposethereare50Internetuserswhosupportedtheruleantecedent.Thestandarddeviationoftheiragesis3.5.Ontheotherhand,theaverageageofthe200userswhodonotsupporttheruleantecedentis30andtheirstandarddeviationis6.5.Assumethataquantitativeassociationruleisconsideredinterestingonlyifthedifferencebetweenμand ismorethan5years.UsingEquation6.1 weobtain
Foraone-sidedhypothesistestata95%confidencelevel,thecriticalvalueforrejectingthenullhypothesisis1.64.Since ,thenullhypothesiscanberejected.Wethereforeconcludethatthequantitativeassociationruleisinterestingbecausethedifferencebetweentheaverageagesofuserswhosupportanddonotsupporttheruleantecedentismorethan5years.
6.2.3Non-discretizationMethods
Z>Zα
{Income>100K, Shop Online=Yes}→Age:μ=38.
μ′
Z=38−30−53.5250+6.52200=4.4414.
Z>1.64
Therearecertainapplicationsinwhichanalystsaremoreinterestedinfindingassociationsamongthecontinuousattributes,ratherthanassociationsamongdiscreteintervalsofthecontinuousattributes.Forexample,considertheproblemoffindingwordassociationsintextdocuments.Table6.6 showsadocument-wordmatrixwhereeveryentryrepresentsthenumberoftimesawordappearsinagivendocument.Givensuchadatamatrix,weareinterestedinfindingassociationsbetweenwords(e.g., and )insteadofassociationsbetweenrangesofwordcounts(e.g., ∈[1,4]and ∈[2,3]).Onewaytodothisistotransformthedataintoa0/1matrix,wheretheentryis1ifthecountexceedssomethresholdt,and0otherwise.Whilethisapproachallowsanalyststoapplyexistingfrequentitemsetgenerationalgorithmstothebinarizeddataset,findingtherightthresholdforbinarizationcanbequitetricky.Ifthethresholdissettoohigh,itispossibletomisssomeinterestingassociations.Conversely,ifthethresholdissettoolow,thereisapotentialforgeneratingalargenumberofspuriousassociations.
Table6.6.Document-wordmatrix.
Document
d1 3 6 0 0 0 2
d2 1 2 0 0 0 2
d3 4 2 7 0 0 2
d4 2 0 3 0 0 1
d5 0 0 0 1 1 0
Thissectionpresentsanothermethodologyforfindingassociationsamongcontinuousattributes,knownasthemin-Aprioriapproach.Analogousto
word1 word2 word3 word4 word5 word6
traditionalassociationanalysis,anitemsetisconsideredtobeacollectionofcontinuousattributes,whileitssupportmeasuresthedegreeofassociationamongtheattributes,acrossmultiplerowsofthedatamatrix.Forexample,acollectionofwordsinTable6.6 canbereferredtoasanitemset,whosesupportisdeterminedbytheco-occurrenceofwordsacrossdocuments.Inmin-Apriori,theassociationamongattributesinagivenrowofthedatamatrixisobtainedbytakingtheminimumvalueoftheattributes.Forexample,theassociationbetweenwords, and ,inthedocument isgivenby
.Thesupportofanitemsetisthencomputedbyaggregatingitsassociationoverallthedocuments.
Thesupportmeasuredefinedinmin-Apriorihasthefollowingdesiredproperties,whichmakesitsuitableforfindingwordassociationsindocuments:
1. Supportincreasesmonotonicallyasthenumberofoccurrencesofawordincreases.
2. Supportincreasesmonotonicallyasthenumberofdocumentsthatcontainthewordincreases.
3. Supporthasananti-monotoneproperty.Forexample,considerapairofitemsets and .Since
.Therefore,supportdecreasesmonotonicallyasthenumberofwordsinanitemsetincreases.
ThestandardApriorialgorithmcanbemodifiedtofindassociationsamongwordsusingthenewsupportdefinition.
word1 word2 d1min(3,6)=3.
s({word1,word2})=min(3,6)+min(1,2)+min(4,2)+min(2,0)=6.
{A,B} {A,B,C}min({A,B})≥min({A,B,C}),s({A,B})≥s({A,B,C})
6.3HandlingaConceptHierarchyAconcepthierarchyisamultilevelorganizationofthevariousentitiesorconceptsdefinedinaparticulardomain.Forexample,inmarketbasketanalysis,aconcepthierarchyhastheformofanitemtaxonomydescribingthe“is-a”relationshipsamongitemssoldatagrocerystore—e.g.,milkisakindoffoodandDVDisakindofhomeelectronicsequipment(seeFigure6.2 ).Concepthierarchiesareoftendefinedaccordingtodomainknowledgeorbasedonastandardclassificationschemedefinedbycertainorganizations(e.g.,theLibraryofCongressclassificationschemeisusedtoorganizelibrarymaterialsbasedontheirsubjectcategories).
Figure6.2.Exampleofanitemtaxonomy.
Aconcepthierarchycanberepresentedusingadirectedacyclicgraph,asshowninFigure6.2 .Ifthereisanedgeinthegraphfromanodeptoanothernodec,wecallptheparentofcandcthechildofp.Forexample,
istheparentof becausethereisadirectededgefromthenode tothenode . iscalledanancestorofX(andXadescendentof )ifthereisapathfromnode tonodeXinthedirectedacyclicgraph.InthediagramshowninFigure6.2 , isanancestorof
and isadescendentof .
Themainadvantagesofincorporatingconcepthierarchiesintoassociationanalysisareasfollows:
1. Itemsatthelowerlevelsofahierarchymaynothaveenoughsupporttoappearinanyfrequentitemset.Forexample,althoughthesaleofACadaptorsanddockingstationsmaybelow,thesaleoflaptopaccessories,whichistheirparentnodeintheconcepthierarchy,maybehigh.Also,rulesinvolvinghigh-levelcategoriesmayhavelowerconfidencethantheonesgeneratedusinglow-levelcategories.Unlesstheconcepthierarchyisused,thereisapotentialtomissinterestingpatternsatdifferentlevelsofcategories.
2. Rulesfoundatthelowerlevelsofaconcepthierarchytendtobeoverlyspecificandmaynotbeasinterestingasrulesatthehigherlevels.Forexample,stapleitemssuchasmilkandbreadtendtoproducemanylow-levelrulessuchas
,and .Usingaconcepthierarchy,theycanbesummarizedintoasinglerule, .Consideringonlyitemsresidingatthetopleveloftheirhierarchiesalsomaynotbegoodenough,becausesuchrulesmaynotbeofanypracticaluse.Forexample,althoughtherule maysatisfythesupportandconfidencethresholds,itisnotinformativebecausethe
X^X^ X^
combinationofelectronicsandfooditemsthatarefrequentlypurchasedbycustomersareunknown.Ifmilkandbatteriesaretheonlyitemssoldtogetherfrequently,thenthepattern{ }mayhaveovergeneralizedthesituation.
Standardassociationanalysiscanbeextendedtoincorporateconcepthierarchiesinthefollowingway.Eachtransactiontisinitiallyreplacedwithitsextendedtransaction ,whichcontainsalltheitemsintalongwiththeircorrespondingancestors.Forexample,thetransaction{ }canbeextendedto{
},where and aretheancestorsof ,while and aretheancestorsof .
Wecanthenapplyexistingalgorithms,suchasApriori,tothedatabaseofextendedtransactions.Althoughsuchanapproachwouldfindrulesthatspandifferentlevelsoftheconcepthierarchy,itwouldsufferfromseveralobviouslimitationsasdescribedbelow:
1. Itemsresidingatthehigherlevelstendtohavehighersupportcountsthanthoseresidingatthelowerlevelsofaconcepthierarchy.Asaresult,ifthesupportthresholdissettoohigh,thenonlypatternsinvolvingthehigh-levelitemsareextracted.Ontheotherhand,ifthethresholdissettoolow,thenthealgorithmgeneratesfartoomanypatterns(mostofwhichmaybespurious)andbecomescomputationallyinefficient.
2. Introductionofaconcepthierarchytendstoincreasethecomputationtimeofassociationanalysisalgorithmsbecauseofthelargernumberofitemsandwidertransactions.Thenumberofcandidatepatternsandfrequentpatternsgeneratedbythesealgorithmsmayalsogrowexponentiallywithwidertransactions.
t′
3. Introductionofaconcepthierarchymayproduceredundantrules.Arule isredundantifthereexistsamoregeneralrule ,where isanancestorofX, isanancestorofY,andbothruleshaveverysimilarconfidence.Forexample,suppose{ }→{ },{ }→{2% },{ }→{2% },{ }→{skim },and{ }→{ }haveverysimilarconfidence.Therulesinvolvingitemsfromthelowerlevelofthehierarchyareconsideredredundantbecausetheycanbesummarizedbyaruleinvolvingtheancestoritems.Anitemsetsuchas{
}isalsoredundantbecause and areancestorsof.Fortunately,itiseasytoeliminatesuchredundantitemsets
duringfrequentitemsetgeneration,giventheknowledgeofthehierarchy.
X→Y X^→Y^X^ Y^
6.4SequentialPatternsMarketbasketdataoftencontainstemporalinformationaboutwhenanitemwaspurchasedbycustomers.Suchinformationcanbeusedtopiecetogetherthesequenceoftransactionsmadebyacustomeroveracertainperiodoftime.Similarly,event-baseddatacollectedfromscientificexperimentsorthemonitoringofphysicalsystems,suchastelecommunicationsnetworks,computernetworks,andwirelesssensornetworks,haveaninherentsequentialnaturetothem.Thismeansthatanordinalrelation,usuallybasedontemporalprecedence,existsamongeventsoccurringinsuchdata.However,theconceptsofassociationpatternsdiscussedsofaremphasizeonly“co-occurrence”relationshipsanddisregardthesequentialinformationofthedata.Thelatterinformationmaybevaluableforidentifyingrecurringfeaturesofadynamicsystemorpredictingfutureoccurrencesofcertainevents.Thissectionpresentsthebasicconceptofsequentialpatternsandthealgorithmsdevelopedtodiscoverthem.
6.4.1Preliminaries
Theinputtotheproblemofdiscoveringsequentialpatternsisasequencedataset,anexampleofwhichisshownontheleft-handsideofFigure6.3 .Eachrowrecordstheoccurrencesofeventsassociatedwithaparticularobjectatagiventime.Forexample,thefirstrowcontainsthesetofeventsoccurringattimestamp forobjectA.Notethatifweonlyconsiderthelastcolumnofthissequencedataset,itwouldlooksimilartoamarketbasketdatawhereeveryrowrepresentsatransactioncontainingasetofevents(items).Thetraditionalconceptofassociationpatternsinthisdatawouldcorrespond
t=10
tocommonco-occurrencesofeventsacrosstransactions.However,asequencedatasetalsocontainsinformationabouttheobjectandthetimestampofatransactionofeventsinthefirsttwocolumns.Thesecolumnsaddcontexttoeverytransaction,whichenablesadifferentstyleofassociationanalysisforsequencedatasets.Theright-handsideofFigure6.3 showsadifferentrepresentationofthesequencedatasetwheretheeventsassociatedwitheveryobjectappeartogether,sortedinincreasingorderoftheirtimestamps.Inasequencedataset,wecanlookforassociationpatternsofeventsthatcommonlyoccurinasequentialorderacrossobjects.Forexample,inthesequencedatashowninFigure6.3 ,event6isfollowedbyevent1inallofthesequences.Notethatsuchapatterncannotbeinferredifwetreatthisasamarketbasketdatabyignoringinformationabouttheobjectandtimestamp.
Figure6.3.Exampleofasequencedatabase.
Beforepresentingamethodologyforfindingsequentialpatterns,weprovideabriefdescriptionofsequencesandsubsequences.
SequencesGenerallyspeaking,asequenceisanorderedlistofelements(transactions).Asequencecanbedenotedas ,whereeachelement isacollectionofoneormoreevents(items),i.e., .Thefollowingisalistofexamplesofsequences:
Sequenceofwebpagesviewedbyawebsitevisitor:
⟨{Homepage}{Electronics}{CamerasandCamcorders}{DigitalCameras}{ShoppingCart}{OrderConfirmation}{ReturntoShopping}⟩SequenceofeventsleadingtothenuclearaccidentatThree-MileIsland:
⟨{cloggedresin}{outletvalveclosure}{lossoffeedwater}{condenserpolisheroutletvalveshut}{boosterpumpstrip}{mainwaterpumptrips}{mainturbinetrips}{reactorpressureincreases}⟩Sequenceofclassestakenbyacomputersciencemajorstudentindifferentsemesters:
⟨{AlgorithmsandDataStructures,IntroductiontoOperatingSystems}{DatabaseSystems,ComputerArchitecture}{ComputerNetworks,SoftwareEngineering}{ComputerGraphics,ParallelProgramming}⟩
Asequencecanbecharacterizedbyitslengthandthenumberofoccurringevents.Thelengthofasequencecorrespondstothenumberofelementspresentinthesequence,whilewerefertoasequencethatcontainskeventsasak-sequence.Thewebsequenceinthepreviousexamplecontains7elementsand7events;theeventsequenceatThree-MileIslandcontains8
s=⟨e1e2e3…en⟩ ejej={i1,i2,…,ik}
elementsand8events;andtheclasssequencecontains4elementsand8events.
Figure6.4 providesexamplesofsequences,elements,andeventsdefinedforavarietyofapplicationdomains.Exceptforthelastrow,theordinalattributeassociatedwitheachofthefirstthreedomainscorrespondstocalendartime.Forthelastrow,theordinalattributecorrespondstothelocationofthebases(A,C,G,T)inthegenesequence.Althoughthediscussiononsequentialpatternsisprimarilyfocusedontemporalevents,itcanbeextendedtothecasewheretheeventshavenon-temporalordering,suchasspatialordering.
Figure6.4.Examplesofelementsandeventsinsequencedatasets.
SubsequencesAsequencetisasubsequenceofanothersequencesifitispossibletoderivetfromsbysimplydeletingsomeeventsfromelementsinsorevendeletingsomeelementsinscompletely.Formally,thesequenceisasubsequenceof ifthereexistintegerssuchthat .Iftisasubsequenceofs,thenwesaythattiscontainedins.Table6.7 givesexamplesillustratingtheideaofsubsequencesforvarioussequences.
Table6.7.Examplesillustratingtheconceptofasubsequence.
Sequence,s Sequence,t Istasubsequenceofs?
Yes
Yes
No
Yes
6.4.2SequentialPatternDiscovery
LetDbeadatasetthatcontainsoneormoredatasequences.Thetermdatasequencereferstoanorderedlistofelementsassociatedwithasingledataobject.Forexample,thedatasetshowninFigure6.5 containsfivedatasequences,oneforeachobjectA,B,C,D,andE.
t=⟨t1t2…tm⟩ s=⟨s1s2…sn⟩ 1≤j1<j2<⋯<jm≤n
t1⊆sj1,t2⊆sj2,…,tm⊆sjm
⟨{2,4} {3,5,6} {8}⟩ ⟨{2} {3,6} {8}⟩
⟨{2,4} {3,5,6} {8}⟩ ⟨{2} {8}⟩
⟨{1,2} {3,4}⟩ ⟨{1} {2}⟩
⟨{2,4} {2,4} {2,5}⟩ ⟨{2} {4}⟩
Figure6.5.Sequentialpatternsderivedfromadatasetthatcontainsfivedatasequences.
Thesupportofasequencesisthefractionofalldatasequencesthatcontains.Ifthesupportforsisgreaterthanorequaltoauser-specifiedthresholdminsup,thensisdeclaredtobeasequentialpattern(orfrequentsequence).
Definition6.1(SequentialPatternDiscovery).GivenasequencedatasetDandauser-specifiedminimumsupportthresholdminsup,thetaskofsequentialpatterndiscoveryistofindallsequenceswith .support≥minsup
InFigure6.5 ,thesupportforthesequence isequalto80%becauseitiscontainedinfourofthefivedatasequences(everyobjectexceptforD).Assumingthattheminimumsupportthresholdis50%,anysequencethatiscontainedinatleastthreedatasequencesisconsideredtobeasequentialpattern.Examplesofsequentialpatternsextractedfromthegivendatasetinclude ,etc.
Sequentialpatterndiscoveryisacomputationallychallengingtaskbecausethesetofallpossiblesequencesthatcanbegeneratedfromacollectionofeventsisexponentiallylargeanddifficulttoenumerate.Forexample,acollectionofneventscanresultinthefollowingexamplesof1-sequences,2-sequences,and3-sequences:
1-sequences:
2-sequences:
3-sequences:
TheaboveenumerationissimilarinsomewaystotheitemsetlatticeintroducedinChapter5 formarketbasketdata.However,notethattheaboveenumerationisnotexhaustive;itonlyshowssomesequencesandomitsalargenumberofremainingonesbytheuseofellipses(…).Thisisbecausethenumberofcandidatesequencesissubstantiallylargerthanthenumberofcandidateitemsets,whichmakestheirenumerationdifficult.Therearethreereasonsfortheadditionalnumberofcandidatessequences:
1. Anitemcanappearatmostonceinanitemset,butaneventcanappearmorethanonceinasequence,indifferentelementsofthe
⟨{1}{2}⟩
⟨{1}{2}⟩,⟨{1,2}⟩,⟨{2,3}⟩,⟨{1,2}{2,3}⟩
⟨i1⟩,⟨i2⟩,…,⟨in⟩
⟨{i1,i2}⟩,⟨{i1,i3}⟩,…,⟨{in−1,in}⟩,…⟨{i1}{i1}⟩,⟨{i1}{i2}⟩,…,⟨{in}{in}⟩
⟨{i1,i2,i3}⟩,⟨{i1,i2,i4}⟩,…,⟨{in−2,in−1,in}⟩,…⟨{i1}{i1,i2}⟩,⟨{i1}{i1,i3}⟩,…,⟨{in−1}{in−1,in}⟩,…⟨{i1,i2}{i2}⟩,⟨{i1,i2}{i3}⟩,…,⟨{in−1,in}{in}⟩,…⟨{i1}{i1}{i1}⟩,⟨{i1}{i1}{i2}⟩,…,⟨{in}{in}{in}⟩
sequence.Forexample,givenapairofitems, and ,onlyonecandidate2-itemset, ,canbegenerated.Incontrast,therearemanycandidate2-sequencesthatcanbegeneratedusingonlytwoevents: ,and .
2. Ordermattersinsequences,butnotforitemsets.Forexample,and referstothesameitemset,whereas ,and correspondtodifferentsequences,andthusmustbegeneratedseparately.
3. Formarketbasketdata,thenumberofdistinctitemsnputsanupperboundonthenumberofcandidateitemsets ,whereasforsequencedata,eventwoeventsaandbcanleadtoinfinitelymanycandidatesequences(seeFigure6.6 foranillustration).
Figure6.6.Comparingthenumberofitemsetswiththenumberofsequencesgeneratedusingtwoevents(items).Weonlyshow1-sequences,2-sequences,and3-sequencesforillustration.
Becauseoftheabovereasons,itischallengingtocreateasequencelatticethatenumeratesallpossiblesequencesevenwhenthenumberofeventsinthedataissmall.Itisthusdifficulttouseabrute-forceapproachfor
i1 i2{i1,i2}
⟨{i1}{i1}⟩,⟨{i1}{i2}⟩,⟨{i2}{i1}⟩,⟨{i2}{i2}⟩ ⟨{i1,i2}⟩{i1,i2}
{i2,i1} ⟨{i1}{i2}⟩,⟨{i2}{i1}⟩⟨{i1,i2}⟩
(2n−1)
generatingsequentialpatternsthatenumeratesallpossiblesequencesbytraversingthesequencelattice.Despitethesechallenges,theAprioriprinciplestillholdsforsequentialdatabecauseanydatasequencethatcontainsaparticulark-sequencemustalsocontainallofits -subsequences.Aswewillseelater,eventhoughitischallengingtoconstructthesequencelattice,itispossibletogeneratecandidatek-sequencesfromthefrequent -sequencesusingtheAprioriprinciple.ThisallowsustoextractsequentialpatternsfromasequencedatasetusinganApriori-likealgorithm.ThebasicstructureofthisalgorithmisshowninAlgorithm6.1 .
Algorithm6.1Apriori-likealgorithmfor
sequentialpatterndiscovery.
(k−1)
(k−1)
1: .2: . {Findallfrequent1-subsequences.}3:repeat4: .5: . {Generatecandidatek-subsequences.}6: . {Prunecandidatek-subsequences.}7:foreachdatasequence do8: . {Identifyallcandidatescontainedint.}9:foreachcandidatek-subsequence do10: . {Incrementthesupportcount.}11:endfor12:endfor
k=1Fk={i|i∈I∧σ({i})N≥minsup}
k=k+1Ck=candidate-gen(Fk−1)
Ck=candidate-prune(Ck,Fk−1)
t∈TCt=subsequence(Ck,t)
c∈Ctσ(c)=σ(c)+1
NoticethatthestructureofthealgorithmisalmostidenticaltoApriorialgorithmforfrequentitemsetdiscovery,presentedinthepreviouschapter.Thealgorithmwoulditerativelygeneratenewcandidatek-sequences,prunecandidateswhose -sequencesareinfrequent,andthencountthesupportsoftheremainingcandidatestoidentifythesequentialpatterns.Thedetailedaspectsofthesestepsaregivennext.
CandidateGeneration
Wegeneratecandidatek-sequencesbymergingapairoffrequent -sequences.Althoughthisapproachissimilartothe strategyintroducedinChapter5 forgeneratingcandidateitem-sets,therearecertaindifferences.First,inthecaseofgeneratingsequences,wecanmergea -sequencewithitselftoproduceak-sequence,since.Forexample,wecanmergethe1-sequence withitselftoproduceacandidate2-sequence, .Second,recallthatinordertoavoidgeneratingduplicatecandidates,thetraditionalApriorialgorithmmergesapairoffrequentk-itemsetsonlyiftheirfirst items,arrangedinlexicographicorder,areidentical.Inthecaseofgeneratingsequences,westillusethelexicographicorderforarrangingeventswithinanelement.However,thearrangementofelementsinasequencemaynotfollowthelexicographicorder.Forexample,
isaviablerepresentationofa4-sequence,eventhoughtheelementsinthesequencearenotarrangedaccordingtotheirlexicographicranks.Ontheotherhand, isnotaviablerepresentationofthesame4-sequence,sincetheeventsinthefirstelementviolatethelexicographicorder.
13: . {Extractthefrequentk-subsequences.}14:until15: .
Fk={c|c∈Ck∧σ(c)N≥minsup}
Fk=∅Answer=∪Fk
(k−1)
(k−1)Fk−1×Fk−1
(k−1)⟨a⟩
⟨{a}{a}⟩
k−1
⟨{b,c}{a}{d}⟩
⟨{c,b}{a}{d}⟩
Givenasequence ,wheretheeventsineveryelementarearrangedlexicographically,wecanrefertofirsteventof asthefirsteventofsandthelasteventof asthelasteventofs.Thecriteriaformergingsequencescanthenbestatedintheformofthefollowingprocedure.
SequenceMergingProcedure
Asequence ismergedwithanothersequence onlyifthesubsequenceobtainedbydroppingthefirsteventin isidenticaltothesubsequenceobtainedbydroppingthelasteventin .Theresultingcandidateisgivenbyextendingthesequence asfollows:
1. Ifthelastelementof hasonlyoneevent,appendthelastelementof totheendof andobtainthemergedsequence.
2. Ifthelastelementof hasmorethanoneevent,appendthelasteventfromthelastelementof (thatisabsentinthelastelementof )tothelastelementof andobtainthemergedsequence.
Figure6.7 illustratesexamplesofcandidate4-sequencesobtainedbymergingpairsoffrequent3-sequences.Thefirstcandidate, ,isobtainedbymerging with .Sincethelastelementofthesecondsequence containsonlyoneelement,itissimplyappendedtothefirstsequencetogeneratethemergedsequence.Ontheotherhand,merging with producesthecandidate4-sequence
.Inthiscase,thelastelementofthesecondsequence
s=⟨e1e2e3…en⟩e1
en
s(1) s(2)s(1)
s(2)s(1)
s(2)s(2) s(1)
s(2)s(2)
s(1) s(1)
⟨{1}{2}{3}{4}⟩⟨{1}{2}{3}⟩ ⟨{2}{3}{4}⟩({4})
⟨{1}{5}{3}⟩ ⟨{5}{3,4}⟩ ⟨{1}{5}{3,4}⟩ ({3,4})
containstwoevents.Hence,thelasteventinthiselement(4)isaddedtothelastelementofthefirstsequence toobtainthemergedsequence.
Figure6.7.Exampleofthecandidategenerationandpruningstepsofasequentialpatternminingalgorithm.
Itcanbeshownthatthesequencemergingprocedureiscomplete,i.e.,itgenerateseveryfrequentk-subsequence.Thisisbecauseeveryfrequentk-subsequencesincludesafrequent -sequence ,thatdoesnotcontainthefirsteventofs,andafrequent -sequence ,thatdoesnotcontainthelasteventofs.Since and arefrequentandfollowthecriteriaformergingsequences,theywillbemergedtoproduceeveryfrequentk-subsequencesasoneofthecandidates.Also,thesequencemergingprocedureensuresthatthereisauniquewayofgeneratingsonlybymergingand .Forexample,inFigure6.7 ,thesequences and
donothavetobemergedbecauseremovingthefirsteventfromthefirstsequencedoesnotgivethesamesubsequenceasremovingthelasteventfromthesecondsequence.Although isaviablecandidate,itisgeneratedbymergingadifferentpairofsequences,
({3})
(k−1) s1(k−1) s2
s1 s2
s1 s2 ⟨{1}{2}{3}⟩ ⟨{1}{2,5}⟩
⟨{1}{2,5}{3}⟩⟨{1}{2,5}
and .Thisexampleillustratesthatthesequencemergingproceduredoesnotgenerateduplicatecandidatesequences.
CandidatePruning
Acandidatek-sequenceisprunedifatleastoneofits -sequencesisinfrequent.Forexample,considerthecandidate4-sequence .Weneedtocheckifanyofthe3-sequencescontainedinthis4-sequenceisinfrequent.Sincethesequence iscontainedinthissequenceandisinfrequent,thecandidate canbeeliminated.Readersshouldbeabletoverifythattheonlycandidate4-sequencethatsurvivesthecandidatepruningstepinFigure6.7 is .
SupportCounting
Duringsupportcounting,thealgorithmidentifiesallcandidatek-sequencesbelongingtoaparticulardatasequenceandincrementstheirsupportcounts.Afterperformingthisstepforeachdatasequence,thealgorithmidentifiesthefrequentk-sequencesanddiscardsallcandidatesequenceswhosesupportvaluesarelessthantheminsupthreshold.
6.4.3TimingConstraints*
Thissectionpresentsasequentialpatternformulationwheretimingconstraintsareimposedontheeventsandelementsofapattern.Tomotivatetheneedfortimingconstraints,considerthefollowingsequenceofcoursestakenbytwostudentswhoenrolledinadataminingclass:
⟩ ⟨{2,5}{3}⟩
(k−1)⟨{1}{2}{3}{4}⟩
⟨{1}{2}{4}⟩⟨{1}{2}{3}{4}⟩
⟨{1}{2 5}{3}⟩
StudentA:⟨{Statistics}{DatabaseSystems}{DataMining}⟩.StudentB:⟨
Thesequentialpatternofinterestis,whichmeansthatstudentswhoareenrolledinthedata
miningclassmusthavepreviouslytakenacourseinstatisticsanddatabasesystems.Clearly,thepatternissupportedbybothstudentseventhoughtheydonottakestatisticsanddatabasesystemsatthesametime.Incontrast,astudentwhotookastatisticscoursetenyearsearliershouldnotbeconsideredassupportingthepatternbecausethetimegapbetweenthecoursesistoolong.Becausetheformulationpresentedintheprevioussectiondoesnotincorporatethesetimingconstraints,anewsequentialpatterndefinitionisneeded.
Figure6.8 illustratessomeofthetimingconstraintsthatcanbeimposedonapattern.Thedefinitionoftheseconstraintsandtheimpacttheyhaveonsequentialpatterndiscoveryalgorithmswillbediscussedinthefollowingsections.Notethateachelementofthesequentialpatternisassociatedwithatimewindow ,wherelistheearliestoccurrenceofaneventwithinthetimewindowanduisthelatestoccurrenceofaneventwithinthetimewindow.Notethatinthisdiscussion,wealloweventswithinanelementtooccuratdifferenttimes.Hence,theactualtimingoftheeventoccurrencemaynotbethesameasthelexicographicordering.
{DatabaseSystems}{Statistics}{DataMining}⟩.
⟨{Statistics,DatabaseSystems}{DataMining}⟩
[l,u]
Figure6.8.Timingconstraintsofasequentialpattern.
ThemaxspanConstraintThemaxspanconstraintspecifiesthemaximumallowedtimedifferencebetweenthelatestandtheearliestoccurrencesofeventsintheentiresequence.Forexample,supposethefollowingdatasequencescontainelementsthatoccuratconsecutivetimestamps ,i.e.,theelementinthesequenceoccursatthe timestamp.Assumingthatmaxspan=3,thefollowingtablecontainssequentialpatternsthataresupportedandnotsupportedbyagivendatasequence.
DataSequence,s SequentialPattern,t Doesssupportt?
Yes
Yes
(1,2,3,…) ithith
⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{3}{4}⟩
⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{3}{6}⟩
No
Ingeneral,thelongerthemaxspan,themorelikelyitistodetectapatterninadatasequence.However,alongermaxspancanalsocapturespuriouspatternsbecauseitincreasesthechancefortwounrelatedeventstobetemporallyrelated.Inaddition,thepatternmayinvolveeventsthatarealreadyobsolete.
Themaxspanconstraintaffectsthesupportcountingstepofsequentialpatterndiscoveryalgorithms.Asshownintheprecedingexamples,somedatasequencesnolongersupportacandidatepatternwhenthemaxspanconstraintisimposed.IfwesimplyapplyAlgorithm6.1 ,thesupportcountsforsomepatternsmaybeoverestimated.Toavoidthisproblem,thealgorithmmustbemodifiedtoignorecaseswheretheintervalbetweenthefirstandlastoccurrencesofeventsinagivenpatternisgreaterthanmaxspan.
ThemingapandmaxgapConstraintsTimingconstraintscanalsobespecifiedtorestrictthetimedifferencebetweentwoconsecutiveelementsofasequence.Ifthemaximumtimedifference(maxgap)isoneweek,theneventsinoneelementmustoccurwithinaweek’stimeoftheeventsoccurringinthepreviouselement.Iftheminimumtimedifference(mingap)iszero,theneventsinoneelementmustoccuraftertheeventsoccurringinthepreviouselement.(SeeFigure6.8 .)Thefollowingtableshowsexamplesofpatternsthatpassorfailthemaxgapandmingapconstraints,assumingthatmaxgap=3andmingap=1.Theseexamplesassumeeachelementoccursatconsecutivetimesteps.
DataSequence,s SequentialPattern,t maxgap mingap
Pass Pass
⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{1,3}{6}⟩
⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{3}{6}⟩
Pass Fail
Fail Pass
Fail Fail
Aswithmaxspan,theseconstraintswillaffectthesupportcountingstepofsequentialpatterndiscoveryalgorithmsbecausesomedatasequencesnolongersupportacandidatepatternwhenmingapandmaxgapconstraintsarepresent.Thesealgorithmsmustbemodifiedtoensurethatthetimingconstraintsarenotviolatedwhencountingthesupportofapattern.Otherwise,someinfrequentsequencesmaymistakenlybedeclaredasfrequentpatterns.
AsideeffectofusingthemaxgapconstraintisthattheAprioriprinciplemightbeviolated.Toillustratethis,considerthedatasetshowninFigure6.5 .Withoutmingapormaxgapconstraints,thesupportfor and
arebothequalto60%.However,ifmingap=0andmaxgap=1,thenthesupportfor reducesto40%,whilethesupportfor isstill60%.Inotherwords,supporthasincreasedwhenthenumberofeventsinasequenceincreases—whichcontradictstheAprioriprinciple.TheviolationoccursbecausetheobjectDdoesnotsupportthepattern sincethetimegapbetweenevents2and5isgreaterthanmaxgap.Thisproblemcanbeavoidedbyusingtheconceptofacontiguoussubsequence.
Definition6.2(ContiguousSubsequence).Asequencesisacontiguoussubsequenceof ifanyoneofthefollowingconditionshold:
⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{6}{8}⟩
⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{1,3}{6}⟩
⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{1}{3}{8}⟩
⟨{2}{5}⟩ ⟨{2}{3}{5}⟩
⟨{2}{5}⟩ ⟨{2}{3}{5}⟩
⟨{2}{5}⟩
w=⟨e1e2…ek⟩
1. sisobtainedfromwafterdeletinganeventfromeither or ,2. sisobtainedfromwafterdeletinganeventfromanyelement that
containsatleasttwoevents,or3. sisacontiguoussubsequenceoftandtisacontiguoussubsequence
ofw.
Thefollowingexamplesillustratetheconceptofacontiguoussubsequence:
DataSequence,s SequentialPattern,t Istacontiguoussubsequenceofs?
Yes
Yes
Yes
No
No
Usingtheconceptofcontiguoussubsequences,theAprioriprinciplecanbemodifiedtohandlemaxgapconstraintsinthefollowingway.
Definition6.3(ModifiedAprioriPrinciple).Ifak-sequenceisfrequent,thenallofitscontiguous -subsequencesmustalsobefrequent.
e1 ekei∈w
⟨{1}{2,3}⟩ ⟨{1}{2}⟩
⟨{1,2}{2}{3}⟩ ⟨{1}{2}⟩
⟨{3,4}{1,2}{2,3}{4}⟩ ⟨{1}{2}⟩
⟨{1}{3}{2}⟩ ⟨{1}{2}⟩
⟨{1,2}{1}{3}{2}⟩ ⟨{1}{2}⟩
k−1
ThemodifiedAprioriprinciplecanbeappliedtothesequentialpatterndiscoveryalgorithmwithminormodifications.Duringcandidatepruning,notallk-sequencesneedtobeverifiedsincesomeofthemmayviolatethemaxgapconstraint.Forexample,ifmaxgap=1,itisnotnecessarytocheckwhetherthesubsequence ofthecandidate isfrequentsincethetimedifferencebetweenelements and isgreaterthanonetimeunit.Instead,onlythecontiguoussubsequencesofneedtobeexamined.Thesesubsequencesinclude ,
, ,and .
TheWindowSizeConstraintFinally,eventswithinanelement donothavetooccuratthesametime.Awindowsizethreshold(ws)canbedefinedtospecifythemaximumallowedtimedifferencebetweenthelatestandearliestoccurrencesofeventsinanyelementofasequentialpattern.Awindowsizeof0meansalleventsinthesameelementofapatternmustoccursimultaneously.
Thefollowingexampleuses todeterminewhetheradatasequencesupportsagivensequence(assuming , ,and
).
DataSequence,s SequentialPattern,t Doesssupportt?
Yes
Yes
No
No
⟨{1}{2,3}{5}⟩ ⟨{1}{2,3}{4}{5}⟩⟨{2,3}⟩ {5}
⟨{1}{2,3}{4}{5}⟩⟨{1}{2,3}{4}⟩ ⟨{2,3}{4}
{5}⟩ ⟨{1}{2}{4}{5}⟩ {1}{3}{4}{5}
sj
ws=2mingap=0 maxgap=3
maxspan=∞
⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{3,4}{5}⟩
⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{4,6}{8}⟩
⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{3,4,6}{8}⟩
⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{1,3,4}{6,7,8}⟩
Inthelastexample,althoughthepattern satisfiesthewindowsizeconstraint,itviolatesthemaxgapconstraintbecausethemaximumtimedifferencebetweeneventsinthetwoelementsis5units.Thewindowsizeconstraintalsoaffectsthesupportcountingstepofsequentialpatterndiscoveryalgorithms.IfAlgorithm6.1 isappliedwithoutimposingthewindowsizeconstraint,thesupportcountsforsomeofthecandidatepatternsmightbeunderestimated,andthussomeinterestingpatternsmaybelost.
6.4.4AlternativeCountingSchemes*
Therearemultiplewaysofdefiningasequencegivenadatasequence.Forexample,ifourdatabaseinvolveslongsequencesofevents,wemightbeinterestedinfindingsubsequencesthatoccurmultipletimesinthesamedatasequence.Hence,insteadofcountingthesupportofasubsequenceasthenumberofdatasequencesitiscontainedin,wecanalsotakeintoaccountthenumberoftimesasubsequenceiscontainedinadatasequence.Thisviewpointgivesrisetoseveraldifferentformulationsforcountingthesupportofacandidatek-sequencefromadatabaseofsequences.Forillustrativepurposes,considertheproblemofcountingthesupportforsequence,asshowninFigure6.9 .Assumethat , , ,and
.
⟨{1,3,4}{6,7,8}⟩
⟨{p}{q}⟩ ws=0 mingap=0 maxgap=2maxspan=2
Figure6.9.Comparingdifferentsupportcountingmethods.
COBJ:Oneoccurrenceperobject.
Thismethodlooksforatleastoneoccurrenceofagivensequenceinanobject’stimeline.InFigure6.9 ,eventhoughthesequenceappearsseveraltimesintheobject’stimeline,itiscountedonlyonce—withpoccurringat andqoccurringat .CWIN:Oneoccurrenceperslidingwindow.
⟨(p)(q)⟩
t=1 t=3
Inthisapproach,aslidingtimewindowoffixedlength(maxspan)ismovedacrossanobject’stimeline,oneunitatatime.Thesupportcountisincrementedeachtimethesequenceisencounteredintheslidingwindow.InFigure6.9 ,thesequence isobservedsixtimesusingthismethod.CMINWIN:Numberofminimalwindowsofoccurrence.
Aminimalwindowofoccurrenceisthesmallestwindowinwhichthesequenceoccursgiventhetimingconstraints.Inotherwords,aminimalwindowisthetimeintervalsuchthatthesequenceoccursinthattimeinterval,butitdoesnotoccurinanyofthepropersubintervalsofit.ThisdefinitioncanbeconsideredasarestrictiveversionofCWIN,becauseitseffectistoshrinkandcollapsesomeofthewindowsthatarecountedbyCWIN.Forexample,sequence hasfourminimalwindowoccurrences:(1)thepair ,(2)thepair ,(3)thepair
,and(4)thepair .Theoccurrenceofeventpatandeventqat isnotaminimalwindowoccurrencebecauseit
containsasmallerwindowwith ,whichisindeedaminimalwindowofoccurrence.CDIST_O:Distinctoccurrenceswithpossibilityofevent-timestampoverlap.
Adistinctoccurrenceofasequenceisdefinedtobethesetofevent-timestamppairssuchthattherehastobeatleastonenewevent-timestamppairthatisdifferentfromapreviouslycountedoccurrence.CountingallsuchdistinctoccurrencesresultsintheCDIST_Omethod.Iftheoccurrencetimeofeventspandqisdenotedasatuple ,thenthismethodyieldseightdistinctoccurrencesofsequence attimes(1,3),(2,3),(2,4),(3,4),(3,5),(5,6),(5,7),and(6,7).CDIST:Distinctoccurrenceswithnoevent-timestampoverlapallowed.
⟨{p}{q}⟩
⟨{p}{q}⟩(p:t=2,q:t=3) (p:t=3,q:t=4)
(p:t=5,q:t=6) (p:t=6,q:t=7)t=1 t=3
(p:t=2,q:t=3)
(t(p),t(q))⟨{p}{q}⟩
InCDIST_Oabove,twooccurrencesofasequencewereallowedtohaveoverlappingevent-timestamppairs,e.g.,theoverlapbetween(1,3)and(2,3).IntheCDISTmethod,nooverlapisallowed.Effectively,whenanevent-timestamppairisconsideredforcounting,itismarkedasusedandisneverusedagainforsubsequentcountingofthesamesequence.Asanexample,therearefivedistinct,non-overlappingoccurrencesofthesequence inthediagramshowninFigure6.9 .Theseoccurrenceshappenattimes(1,3),(2,4),(3,5),(5,6),and(6,7).ObservethattheseoccurrencesaresubsetsoftheoccurrencesobservedinCDIST_O.
Onefinalpointregardingthecountingmethodsistheneedtodeterminethebaselineforcomputingthesupportmeasure.Forfrequentitemsetmining,thebaselineisgivenbythetotalnumberoftransactions.Forsequentialpatternmining,thebaselinedependsonthecountingmethodused.FortheCOBJmethod,thetotalnumberofobjectsintheinputdatacanbeusedasthebaseline.FortheCWINandCMINWINmethods,thebaselineisgivenbythesumofthenumberoftimewindowspossibleinallobjects.FormethodssuchasCDISTandCDIST_O,thebaselineisgivenbythesumofthenumberofdistincttimestampspresentintheinputdataofeachobject.
⟨{p}{q}⟩
6.5SubgraphPatternsThissectiondescribestheapplicationofassociationanalysismethodstographs,whicharemorecomplexentitiesthanitemsetsandsequences.Anumberofentitiessuchaschemicalcompounds,3-Dproteinstructures,computernetworks,andtreestructuredXMLdocumentscanbemodeledusingagraphrepresentation,asshowninTable6.8 .
Table6.8.Graphrepresentationofentitiesinvariousapplicationdomains.
Application Graphs Vertices Edges
Webmining Collectionofinter-linkedWebpages
Webpages Hyperlinkbetweenpages
Computationalchemistry
Chemicalcompounds Atomsorions Bondbetweenatomsorions
Computersecurity
Computernetworks Computersandservers
Interconnectionbetweenmachines
SemanticWeb XMLdocuments XMLelements Parent-childrelationshipbetweenelements
Bioinformatics 3-DProteinstructures Aminoacids Contactresidue
Ausefuldataminingtasktoperformonthistypeofdataistoderiveasetoffrequentlyoccurringsubstructuresinacollectionofgraphs.Suchataskisknownasfrequentsubgraphmining.Apotentialapplicationoffrequentsubgraphminingcanbeseeninthecontextofcomputationalchemistry.Each
year,newchemicalcompoundsaredesignedforthedevelopmentofpharmaceuticaldrugs,pesticides,fertilizers,etc.Althoughthestructureofacompoundisknowntoplayamajorroleindeterminingitschemicalproperties,itisdifficulttoestablishtheirexactrelationship.Frequentsubgraphminingcanaidthisundertakingbyidentifyingthesubstructurescommonlyassociatedwithcertainpropertiesofknowncompounds.Suchinformationcanhelpscientiststodevelopnewchemicalcompoundsthathavecertaindesiredproperties.
Thissectionpresentsamethodologyforapplyingassociationanalysistograph-baseddata.Thesectionbeginswithareviewofsomeofthebasicgraph-relatedconceptsanddefinitions.Thefrequentsubgraphminingproblemisthenintroduced,followedbyadescriptionofhowthetraditionalApriorialgorithmcanbeextendedtodiscoversuchpatterns.
6.5.1Preliminaries
GraphsAgraphisadatastructurethatcanbeusedtorepresentrelationshipsamongasetofentities.Mathematically,agraph iscomposedofavertexsetandasetofedges connectingpairsofvertices.Eachedgeisdenotedby
avertexpair ,where .Alabel canbeassignedtoeachvertex representingthenameofanentity.Similarlyeachedge canalsobeassociatedwithalabel describingtherelationshipbetweenapairofentities.Table6.8 showstheve