introduction-to-data-mining-2nbsped-2017048641-9780133128901-013312890311.pdf

INTRODUCTIONTODATAMINING

INTRODUCTIONTODATAMININGSECONDEDITION

PANG-NINGTANMichiganStateUniversitMICHAELSTEINBACHUniversityofMinnesotaANUJKARPATNEUniversityofMinnesotaVIPINKUMARUniversityofMinnesota

330HudsonStreet,NYNY10013

Director,PortfolioManagement:Engineering,ComputerScience&GlobalEditions:JulianPartridge

Specialist,HigherEdPortfolioManagement:MattGoldstein

PortfolioManagementAssistant:MeghanJacoby

ManagingContentProducer:ScottDisanno

ContentProducer:CaroleSnyder

WebDeveloper:SteveWright

RightsandPermissionsManager:BenFerrini

ManufacturingBuyer,HigherEd,LakeSideCommunicationsInc(LSC):MauraZaldivar-Garcia

InventoryManager:AnnLam

ProductMarketingManager:YvonneVannatta

FieldMarketingManager:DemetriusHall

MarketingAssistant:JonBryant

CoverDesigner:JoyceWells,jWellsDesign

Full-ServiceProjectManagement:ChandrasekarSubramanian,SPiGlobal

Copyright©2019PearsonEducation,Inc.Allrightsreserved.ManufacturedintheUnitedStatesofAmerica.ThispublicationisprotectedbyCopyright,andpermissionshouldbeobtainedfromthepublisherpriortoanyprohibitedreproduction,storageinaretrievalsystem,ortransmissioninanyformorbyanymeans,electronic,mechanical,photocopying,recording,orlikewise.Forinformationregardingpermissions,requestformsandtheappropriatecontactswithinthePearsonEducationGlobalRights&Permissionsdepartment,pleasevisitwww.pearsonhighed.com/permissions/.

Manyofthedesignationsbymanufacturersandsellerstodistinguishtheirproductsareclaimedastrademarks.Wherethosedesignationsappearinthisbook,andthepublisherwasawareofatrademarkclaim,thedesignationshavebeenprintedininitialcapsorallcaps.

LibraryofCongressCataloging-in-PublicationDataonFile

Names:Tan,Pang-Ning,author.|Steinbach,Michael,author.|Karpatne,Anuj,author.|Kumar,Vipin,1956-author.

Title:IntroductiontoDataMining/Pang-NingTan,MichiganStateUniversity,MichaelSteinbach,UniversityofMinnesota,AnujKarpatne,UniversityofMinnesota,VipinKumar,UniversityofMinnesota.

Description:Secondedition.|NewYork,NY:PearsonEducation,[2019]|Includesbibliographicalreferencesandindex.

Identifiers:LCCN2017048641|ISBN9780133128901|ISBN0133128903

Subjects:LCSH:Datamining.

Classification:LCCQA76.9.D343T352019|DDC006.3/12–dc23LCrecordavailableathttps://lccn.loc.gov/2017048641

118

ISBN-10:0133128903

ISBN-13:9780133128901

Toourfamilies…

PrefacetotheSecondEditionSincethefirstedition,roughly12yearsago,muchhaschangedinthefieldofdataanalysis.Thevolumeandvarietyofdatabeingcollectedcontinuestoincrease,ashastherate(velocity)atwhichitisbeingcollectedandusedtomakedecisions.Indeed,theterm,BigData,hasbeenusedtorefertothemassiveanddiversedatasetsnowavailable.Inaddition,thetermdatasciencehasbeencoinedtodescribeanemergingareathatappliestoolsandtechniquesfromvariousfields,suchasdatamining,machinelearning,statistics,andmanyothers,toextractactionableinsightsfromdata,oftenbigdata.

Thegrowthindatahascreatednumerousopportunitiesforallareasofdataanalysis.Themostdramaticdevelopmentshavebeenintheareaofpredictivemodeling,acrossawiderangeofapplicationdomains.Forinstance,recentadvancesinneuralnetworks,knownasdeeplearning,haveshownimpressiveresultsinanumberofchallengingareas,suchasimageclassification,speechrecognition,aswellastextcategorizationandunderstanding.Whilenotasdramatic,otherareas,e.g.,clustering,associationanalysis,andanomalydetectionhavealsocontinuedtoadvance.Thisneweditionisinresponsetothoseadvances.

Overview

Aswiththefirstedition,thesecondeditionofthebookprovidesacomprehensiveintroductiontodataminingandisdesignedtobeaccessibleandusefultostudents,instructors,researchers,andprofessionals.Areas

coveredincludedatapreprocessing,predictivemodeling,associationanalysis,clusteranalysis,anomalydetection,andavoidingfalsediscoveries.Thegoalistopresentfundamentalconceptsandalgorithmsforeachtopic,thusprovidingthereaderwiththenecessarybackgroundfortheapplicationofdataminingtorealproblems.Asbefore,classification,associationanalysisandclusteranalysis,areeachcoveredinapairofchapters.Theintroductorychaptercoversbasicconcepts,representativealgorithms,andevaluationtechniques,whilethemorefollowingchapterdiscussesadvancedconceptsandalgorithms.Asbefore,ourobjectiveistoprovidethereaderwithasoundunderstandingofthefoundationsofdatamining,whilestillcoveringmanyimportantadvancedtopics.Becauseofthisapproach,thebookisusefulbothasalearningtoolandasareference.

Tohelpreadersbetterunderstandtheconceptsthathavebeenpresented,weprovideanextensivesetofexamples,figures,andexercises.Thesolutionstotheoriginalexercises,whicharealreadycirculatingontheweb,willbemadepublic.Theexercisesaremostlyunchangedfromthelastedition,withtheexceptionofnewexercisesinthechapteronavoidingfalsediscoveries.Newexercisesfortheotherchaptersandtheirsolutionswillbeavailabletoinstructorsviatheweb.Bibliographicnotesareincludedattheendofeachchapterforreaderswhoareinterestedinmoreadvancedtopics,historicallyimportantpapers,andrecenttrends.Thesehavealsobeensignificantlyupdated.Thebookalsocontainsacomprehensivesubjectandauthorindex.

WhatisNewintheSecondEdition?

Someofthemostsignificantimprovementsinthetexthavebeeninthetwochaptersonclassification.Theintroductorychapterusesthedecisiontreeclassifierforillustration,butthediscussiononmanytopics—thosethatapply

acrossallclassificationapproaches—hasbeengreatlyexpandedandclarified,includingtopicssuchasoverfitting,underfitting,theimpactoftrainingsize,modelcomplexity,modelselection,andcommonpitfallsinmodelevaluation.Almosteverysectionoftheadvancedclassificationchapterhasbeensignificantlyupdated.ThematerialonBayesiannetworks,supportvectormachines,andartificialneuralnetworkshasbeensignificantlyexpanded.Wehaveaddedaseparatesectionondeepnetworkstoaddressthecurrentdevelopmentsinthisarea.Thediscussionofevaluation,whichoccursinthesectiononimbalancedclasses,hasalsobeenupdatedandimproved.

Thechangesinassociationanalysisaremorelocalized.Wehavecompletelyreworkedthesectionontheevaluationofassociationpatterns(introductorychapter),aswellasthesectionsonsequenceandgraphmining(advancedchapter).Changestoclusteranalysisarealsolocalized.TheintroductorychapteraddedtheK-meansinitializationtechniqueandanupdatedthediscussionofclusterevaluation.Theadvancedclusteringchapteraddsanewsectiononspectralgraphclustering.Anomalydetectionhasbeengreatlyrevisedandexpanded.Existingapproaches—statistical,nearestneighbor/density-based,andclusteringbased—havebeenretainedandupdated,whilenewapproacheshavebeenadded:reconstruction-based,one-classclassification,andinformation-theoretic.Thereconstruction-basedapproachisillustratedusingautoencodernetworksthatarepartofthedeeplearningparadigm.Thedatachapterhasbeenupdatedtoincludediscussionsofmutualinformationandkernel-basedtechniques.

Thelastchapter,whichdiscusseshowtoavoidfalsediscoveriesandproducevalidresults,iscompletelynew,andisnovelamongothercontemporarytextbooksondatamining.Itsupplementsthediscussionsintheotherchapterswithadiscussionofthestatisticalconcepts(statisticalsignificance,p-values,falsediscoveryrate,permutationtesting,etc.)relevanttoavoidingspuriousresults,andthenillustratestheseconceptsinthecontextofdata

miningtechniques.Thischapteraddressestheincreasingconcernoverthevalidityandreproducibilityofresultsobtainedfromdataanalysis.Theadditionofthislastchapterisarecognitionoftheimportanceofthistopicandanacknowledgmentthatadeeperunderstandingofthisareaisneededforthoseanalyzingdata.

Thedataexplorationchapterhasbeendeleted,ashavetheappendices,fromtheprinteditionofthebook,butwillremainavailableontheweb.Anewappendixprovidesabriefdiscussionofscalabilityinthecontextofbigdata.

TotheInstructor

Asatextbook,thisbookissuitableforawiderangeofstudentsattheadvancedundergraduateorgraduatelevel.Sincestudentscometothissubjectwithdiversebackgroundsthatmaynotincludeextensiveknowledgeofstatisticsordatabases,ourbookrequiresminimalprerequisites.Nodatabaseknowledgeisneeded,andweassumeonlyamodestbackgroundinstatisticsormathematics,althoughsuchabackgroundwillmakeforeasiergoinginsomesections.Asbefore,thebook,andmorespecifically,thechapterscoveringmajordataminingtopics,aredesignedtobeasself-containedaspossible.Thus,theorderinwhichtopicscanbecoveredisquiteflexible.Thecorematerialiscoveredinchapters2(data),3(classification),5(associationanalysis),7(clustering),and9(anomalydetection).WerecommendatleastacursorycoverageofChapter10(AvoidingFalseDiscoveries)toinstillinstudentssomecautionwheninterpretingtheresultsoftheirdataanalysis.Althoughtheintroductorydatachapter(2)shouldbecoveredfirst,thebasicclassification(3),associationanalysis(5),andclusteringchapters(7),canbecoveredinanyorder.Becauseoftherelationshipofanomalydetection(9)toclassification(3)andclustering(7),thesechaptersshouldprecedeChapter9.

Varioustopicscanbeselectedfromtheadvancedclassification,associationanalysis,andclusteringchapters(4,6,and8,respectively)tofitthescheduleandinterestsoftheinstructorandstudents.Wealsoadvisethatthelecturesbeaugmentedbyprojectsorpracticalexercisesindatamining.Althoughtheyaretimeconsuming,suchhands-onassignmentsgreatlyenhancethevalueofthecourse.

SupportMaterials

Supportmaterialsavailabletoallreadersofthisbookareavailableathttp://www-users.cs.umn.edu/~kumar/dmbook.

PowerPointlectureslidesSuggestionsforstudentprojectsDataminingresources,suchasalgorithmsanddatasetsOnlinetutorialsthatgivestep-by-stepexamplesforselecteddataminingtechniquesdescribedinthebookusingactualdatasetsanddataanalysissoftware

Additionalsupportmaterials,includingsolutionstoexercises,areavailableonlytoinstructorsadoptingthistextbookforclassroomuse.Thebook’sresourceswillbemirroredatwww.pearsonhighered.com/cs-resources.Commentsandsuggestions,aswellasreportsoferrors,canbesenttotheauthorsthroughdmbook@cs.umn.edu.

Acknowledgments

Manypeoplecontributedtothefirstandsecondeditionsofthebook.Webeginbyacknowledgingourfamiliestowhomthisbookisdedicated.Withouttheirpatienceandsupport,thisprojectwouldhavebeenimpossible.

WewouldliketothankthecurrentandformerstudentsofourdatamininggroupsattheUniversityofMinnesotaandMichiganStatefortheircontributions.Eui-Hong(Sam)HanandMaheshJoshihelpedwiththeinitialdataminingclasses.Someoftheexercisesandpresentationslidesthattheycreatedcanbefoundinthebookanditsaccompanyingslides.StudentsinourdatamininggroupswhoprovidedcommentsondraftsofthebookorwhocontributedinotherwaysincludeShyamBoriah,HaibinCheng,VarunChandola,EricEilertson,LeventErtöz,JingGao,RohitGupta,SridharIyer,Jung-EunLee,BenjaminMayer,AyselOzgur,UygarOztekin,GauravPandey,KashifRiaz,JerryScripps,GyorgySimon,HuiXiong,JiepingYe,andPushengZhang.WewouldalsoliketothankthestudentsofourdataminingclassesattheUniversityofMinnesotaandMichiganStateUniversitywhoworkedwithearlydraftsofthebookandprovidedinvaluablefeedback.WespecificallynotethehelpfulsuggestionsofBernardoCraemer,ArifinRuslim,JamshidVayghan,andYuWei.

JoydeepGhosh(UniversityofTexas)andSanjayRanka(UniversityofFlorida)classtestedearlyversionsofthebook.WealsoreceivedmanyusefulsuggestionsdirectlyfromthefollowingUTstudents:PankajAdhikari,RajivBhatia,FredericBosche,ArindamChakraborty,MeghanaDeodhar,ChrisEverson,DavidGardner,SaadGodil,ToddHay,ClintJones,AjayJoshi,JoonsooLee,YueLuo,AnujNanavati,TylerOlsen,SunyoungPark,AashishPhansalkar,GeoffPrewett,MichaelRyoo,DarylShannon,andMeiYang.

RonaldKostoff(ONR)readanearlyversionoftheclusteringchapterandofferednumeroussuggestions.GeorgeKarypisprovidedinvaluableLATEXassistanceincreatinganauthorindex.IreneMoulitsasalsoprovided

assistancewithLATEXandreviewedsomeoftheappendices.MusettaSteinbachwasveryhelpfulinfindingerrorsinthefigures.

WewouldliketoacknowledgeourcolleaguesattheUniversityofMinnesotaandMichiganStatewhohavehelpedcreateapositiveenvironmentfordataminingresearch.TheyincludeArindamBanerjee,DanBoley,JoyceChai,AnilJain,RaviJanardan,RongJin,GeorgeKarypis,ClaudiaNeuhauser,HaesunPark,WilliamF.Punch,GyörgySimon,ShashiShekhar,andJaideepSrivastava.Thecollaboratorsonourmanydataminingprojects,whoalsohaveourgratitude,includeRameshAgrawal,ManeeshBhargava,SteveCannon,AlokChoudhary,ImmeEbert-Uphoff,AuroopGanguly,PietC.deGroen,FranHill,YongdaeKim,SteveKlooster,KerryLong,NiharMahapatra,RamaNemani,NikunjOza,ChrisPotter,LisianePruinelli,NagizaSamatova,JonathanShapiro,KevinSilverstein,BrianVanNess,BonnieWestra,NevinYoung,andZhi-LiZhang.

ThedepartmentsofComputerScienceandEngineeringattheUniversityofMinnesotaandMichiganStateUniversityprovidedcomputingresourcesandasupportiveenvironmentforthisproject.ARDA,ARL,ARO,DOE,NASA,NOAA,andNSFprovidedresearchsupportforPang-NingTan,MichaelStein-bach,AnujKarpatne,andVipinKumar.Inparticular,KamalAbdali,MitraBasu,DickBrackney,JagdishChandra,JoeCoughlan,MichaelCoyle,StephenDavis,FredericaDarema,RichardHirsch,ChandrikaKamath,TsengdarLee,RajuNamburu,N.Radhakrishnan,JamesSidoran,SylviaSpengler,BhavaniThuraisingham,WaltTiernin,MariaZemankova,AidongZhang,andXiaodongZhanghavebeensupportiveofourresearchindataminingandhigh-performancecomputing.

ItwasapleasureworkingwiththehelpfulstaffatPearsonEducation.Inparticular,wewouldliketothankMattGoldstein,KathySmith,CaroleSnyder,

andJoyceWells.WewouldalsoliketothankGeorgeNichols,whohelpedwiththeartworkandPaulAnagnostopoulos,whoprovidedLATEXsupport.

WearegratefultothefollowingPearsonreviewers:LemanAkoglu(CarnegieMellonUniversity),Chien-ChungChan(UniversityofAkron),ZhengxinChen(UniversityofNebraskaatOmaha),ChrisClifton(PurdueUniversity),Joy-deepGhosh(UniversityofTexas,Austin),NazliGoharian(IllinoisInstituteofTechnology),J.MichaelHardin(UniversityofAlabama),JingruiHe(ArizonaStateUniversity),JamesHearne(WesternWashingtonUniversity),HillolKargupta(UniversityofMaryland,BaltimoreCountyandAgnik,LLC),EamonnKeogh(UniversityofCalifornia-Riverside),BingLiu(UniversityofIllinoisatChicago),MariofannaMilanova(UniversityofArkansasatLittleRock),SrinivasanParthasarathy(OhioStateUniversity),ZbigniewW.Ras(UniversityofNorthCarolinaatCharlotte),XintaoWu(UniversityofNorthCarolinaatCharlotte),andMohammedJ.Zaki(RensselaerPolytechnicInstitute).

Overtheyearssincethefirstedition,wehavealsoreceivednumerouscommentsfromreadersandstudentswhohavepointedouttyposandvariousotherissues.Weareunabletomentiontheseindividualsbyname,buttheirinputismuchappreciatedandhasbeentakenintoaccountforthesecondedition.

ContentsPrefacetotheSecondEditionv

1Introduction11.1WhatIsDataMining?4

1.2MotivatingChallenges5

1.3TheOriginsofDataMining7

1.4DataMiningTasks9

1.5ScopeandOrganizationoftheBook13

1.6BibliographicNotes15

1.7Exercises21

2Data232.1TypesofData26

2.1.1AttributesandMeasurement27

2.1.2TypesofDataSets34

2.2DataQuality422.2.1MeasurementandDataCollectionIssues42

2.2.2IssuesRelatedtoApplications49

2.3DataPreprocessing502.3.1Aggregation51

2.3.2Sampling52

2.3.3DimensionalityReduction56

2.3.4FeatureSubsetSelection58

2.3.5FeatureCreation61

2.3.6DiscretizationandBinarization63

2.3.7VariableTransformation69

2.4MeasuresofSimilarityandDissimilarity712.4.1Basics72

2.4.2SimilarityandDissimilaritybetweenSimpleAttributes74

2.4.3DissimilaritiesbetweenDataObjects76

2.4.4SimilaritiesbetweenDataObjects78

2.4.5ExamplesofProximityMeasures79

2.4.6MutualInformation88

2.4.7KernelFunctions*90

2.4.8BregmanDivergence*94

2.4.9IssuesinProximityCalculation96

2.4.10SelectingtheRightProximityMeasure98

2.5BibliographicNotes100

2.6Exercises105

3Classification:BasicConceptsandTechniques113

3.1BasicConcepts114

3.2GeneralFrameworkforClassification117

3.3DecisionTreeClassifier1193.3.1ABasicAlgorithmtoBuildaDecisionTree121

3.3.2MethodsforExpressingAttributeTestConditions124

3.3.3MeasuresforSelectinganAttributeTestCondition127

3.3.4AlgorithmforDecisionTreeInduction136

3.3.5ExampleApplication:WebRobotDetection138

3.3.6CharacteristicsofDecisionTreeClassifiers140

3.4ModelOverfitting1473.4.1ReasonsforModelOverfitting149

3.5ModelSelection1563.5.1UsingaValidationSet156

3.5.2IncorporatingModelComplexity157

3.5.3EstimatingStatisticalBounds162

3.5.4ModelSelectionforDecisionTrees162

3.6ModelEvaluation1643.6.1HoldoutMethod165

3.6.2Cross-Validation165

3.7PresenceofHyper-parameters1683.7.1Hyper-parameterSelection168

3.7.2NestedCross-Validation170

3.8PitfallsofModelSelectionandEvaluation1723.8.1OverlapbetweenTrainingandTestSets172

3.8.2UseofValidationErrorasGeneralizationError172

3.9ModelComparison 1733.9.1EstimatingtheConfidenceIntervalforAccuracy174

3.9.2ComparingthePerformanceofTwoModels175

3.10BibliographicNotes176

3.11Exercises185

4Classification:AlternativeTechniques1934.1TypesofClassifiers193

4.2Rule-BasedClassifier1954.2.1HowaRule-BasedClassifierWorks197

4.2.2PropertiesofaRuleSet198

4.2.3DirectMethodsforRuleExtraction199

4.2.4IndirectMethodsforRuleExtraction204

4.2.5CharacteristicsofRule-BasedClassifiers206

4.3NearestNeighborClassifiers2084.3.1Algorithm209

4.3.2CharacteristicsofNearestNeighborClassifiers210

*

4.4NaïveBayesClassifier2124.4.1BasicsofProbabilityTheory213

4.4.2NaïveBayesAssumption218

4.5BayesianNetworks2274.5.1GraphicalRepresentation227

4.5.2InferenceandLearning233

4.5.3CharacteristicsofBayesianNetworks242

4.6LogisticRegression2434.6.1LogisticRegressionasaGeneralizedLinearModel244

4.6.2LearningModelParameters245

4.6.3CharacteristicsofLogisticRegression248

4.7ArtificialNeuralNetwork(ANN)2494.7.1Perceptron250

4.7.2Multi-layerNeuralNetwork254

4.7.3CharacteristicsofANN261

4.8DeepLearning2624.8.1UsingSynergisticLossFunctions263

4.8.2UsingResponsiveActivationFunctions266

4.8.3Regularization268

4.8.4InitializationofModelParameters271

4.8.5CharacteristicsofDeepLearning275

4.9SupportVectorMachine(SVM)2764.9.1MarginofaSeparatingHyperplane276

4.9.2LinearSVM278

4.9.3Soft-marginSVM284

4.9.4NonlinearSVM290

4.9.5CharacteristicsofSVM294

4.10EnsembleMethods2964.10.1RationaleforEnsembleMethod297

4.10.2MethodsforConstructinganEnsembleClassifier297

4.10.3Bias-VarianceDecomposition300

4.10.4Bagging302

4.10.5Boosting305

4.10.6RandomForests310

4.10.7EmpiricalComparisonamongEnsembleMethods312

4.11ClassImbalanceProblem3134.11.1BuildingClassifierswithClassImbalance314

4.11.2EvaluatingPerformancewithClassImbalance318

4.11.3FindinganOptimalScoreThreshold322

4.11.4AggregateEvaluationofPerformance323

4.12MulticlassProblem330

4.13BibliographicNotes333

4.14Exercises345

5AssociationAnalysis:BasicConceptsandAlgorithms3575.1Preliminaries358

5.2FrequentItemsetGeneration3625.2.1TheAprioriPrinciple363

5.2.2FrequentItemsetGenerationintheAprioriAlgorithm364

5.2.3CandidateGenerationandPruning368

5.2.4SupportCounting373

5.2.5ComputationalComplexity377

5.3RuleGeneration3805.3.1Confidence-BasedPruning380

5.3.2RuleGenerationinAprioriAlgorithm381

5.3.3AnExample:CongressionalVotingRecords382

5.4CompactRepresentationofFrequentItemsets3845.4.1MaximalFrequentItemsets384

5.4.2ClosedItemsets386

5.5AlternativeMethodsforGeneratingFrequentItemsets*389

5.6FP-GrowthAlgorithm*3935.6.1FP-TreeRepresentation394

5.6.2FrequentItemsetGenerationinFP-GrowthAlgorithm397

5.7EvaluationofAssociationPatterns401

5.7.1ObjectiveMeasuresofInterestingness402

5.7.2MeasuresbeyondPairsofBinaryVariables414

5.7.3Simpson’sParadox416

5.8EffectofSkewedSupportDistribution418

5.9BibliographicNotes424

5.10Exercises438

6AssociationAnalysis:AdvancedConcepts4516.1HandlingCategoricalAttributes451

6.2HandlingContinuousAttributes4546.2.1Discretization-BasedMethods454

6.2.2Statistics-BasedMethods458

6.2.3Non-discretizationMethods460

6.3HandlingaConceptHierarchy462

6.4SequentialPatterns4646.4.1Preliminaries465

6.4.2SequentialPatternDiscovery468

6.4.3TimingConstraints 473

6.4.4AlternativeCountingSchemes 477

6.5SubgraphPatterns4796.5.1Preliminaries480

6.5.2FrequentSubgraphMining483

6.5.3CandidateGeneration487

6.5.4CandidatePruning493

6.5.5SupportCounting493

6.6InfrequentPatterns 4936.6.1NegativePatterns494

6.6.2NegativelyCorrelatedPatterns495

6.6.3ComparisonsamongInfrequentPatterns,NegativePatterns,andNegativelyCorrelatedPatterns496

6.6.4TechniquesforMiningInterestingInfrequentPatterns498

6.6.5TechniquesBasedonMiningNegativePatterns499

6.6.6TechniquesBasedonSupportExpectation501

6.7BibliographicNotes505

6.8Exercises510

7ClusterAnalysis:BasicConceptsandAlgorithms5257.1Overview528

7.1.1WhatIsClusterAnalysis?528

7.1.2DifferentTypesofClusterings529

7.1.3DifferentTypesofClusters531

7.2K-means5347.2.1TheBasicK-meansAlgorithm535

7.2.2K-means:AdditionalIssues544

7.2.3BisectingK-means547

7.2.4K-meansandDifferentTypesofClusters548

7.2.5StrengthsandWeaknesses549

7.2.6K-meansasanOptimizationProblem549

7.3AgglomerativeHierarchicalClustering5547.3.1BasicAgglomerativeHierarchicalClusteringAlgorithm555

7.3.2SpecificTechniques557

7.3.3TheLance-WilliamsFormulaforClusterProximity562

7.3.4KeyIssuesinHierarchicalClustering563

7.3.5Outliers564

7.3.6StrengthsandWeaknesses565

7.4DBSCAN5657.4.1TraditionalDensity:Center-BasedApproach565

7.4.2TheDBSCANAlgorithm567

7.4.3StrengthsandWeaknesses569

7.5ClusterEvaluation5717.5.1Overview571

7.5.2UnsupervisedClusterEvaluationUsingCohesionandSeparation574

7.5.3UnsupervisedClusterEvaluationUsingtheProximityMatrix582

7.5.4UnsupervisedEvaluationofHierarchicalClustering585

7.5.5DeterminingtheCorrectNumberofClusters587

7.5.6ClusteringTendency588

7.5.7SupervisedMeasuresofClusterValidity589

7.5.8AssessingtheSignificanceofClusterValidityMeasures594

7.5.9ChoosingaClusterValidityMeasure596

7.6BibliographicNotes597

7.7Exercises603

8ClusterAnalysis:AdditionalIssuesandAlgorithms6138.1CharacteristicsofData,Clusters,andClusteringAlgorithms614

8.1.1Example:ComparingK-meansandDBSCAN614

8.1.2DataCharacteristics615

8.1.3ClusterCharacteristics617

8.1.4GeneralCharacteristicsofClusteringAlgorithms619

8.2Prototype-BasedClustering6218.2.1FuzzyClustering621

8.2.2ClusteringUsingMixtureModels627

8.2.3Self-OrganizingMaps(SOM)637

8.3Density-BasedClustering6448.3.1Grid-BasedClustering644

8.3.2SubspaceClustering648

8.3.3DENCLUE:AKernel-BasedSchemeforDensity-BasedClustering652

8.4Graph-BasedClustering6568.4.1Sparsification657

8.4.2MinimumSpanningTree(MST)Clustering658

8.4.3OPOSSUM:OptimalPartitioningofSparseSimilaritiesUsingMETIS659

8.4.4Chameleon:HierarchicalClusteringwithDynamicModeling660

8.4.5SpectralClustering666

8.4.6SharedNearestNeighborSimilarity673

8.4.7TheJarvis-PatrickClusteringAlgorithm676

8.4.8SNNDensity678

8.4.9SNNDensity-BasedClustering679

8.5ScalableClusteringAlgorithms6818.5.1Scalability:GeneralIssuesandApproaches681

8.5.2BIRCH684

8.5.3CURE686

8.6WhichClusteringAlgorithm?690

8.7BibliographicNotes693

8.8Exercises699

9AnomalyDetection7039.1CharacteristicsofAnomalyDetectionProblems705

9.1.1ADefinitionofanAnomaly705

9.1.2NatureofData706

9.1.3HowAnomalyDetectionisUsed707

9.2CharacteristicsofAnomalyDetectionMethods708

9.3StatisticalApproaches7109.3.1UsingParametricModels710

9.3.2UsingNon-parametricModels714

9.3.3ModelingNormalandAnomalousClasses715

9.3.4AssessingStatisticalSignificance717

9.3.5StrengthsandWeaknesses718

9.4Proximity-basedApproaches7199.4.1Distance-basedAnomalyScore719

9.4.2Density-basedAnomalyScore720

9.4.3RelativeDensity-basedAnomalyScore722

9.4.4StrengthsandWeaknesses723

9.5Clustering-basedApproaches7249.5.1FindingAnomalousClusters724

9.5.2FindingAnomalousInstances725

9.5.3StrengthsandWeaknesses728

9.6Reconstruction-basedApproaches7289.6.1StrengthsandWeaknesses731

9.7One-classClassification7329.7.1UseofKernels733

9.7.2TheOriginTrick734

9.7.3StrengthsandWeaknesses738

9.8InformationTheoreticApproaches7389.8.1StrengthsandWeaknesses740

9.9EvaluationofAnomalyDetection740

9.10BibliographicNotes742

9.11Exercises749

10AvoidingFalseDiscoveries75510.1Preliminaries:StatisticalTesting756

10.1.1SignificanceTesting756

10.1.2HypothesisTesting761

10.1.3MultipleHypothesisTesting767

10.1.4PitfallsinStatisticalTesting776

10.2ModelingNullandAlternativeDistributions77810.2.1GeneratingSyntheticDataSets781

10.2.2RandomizingClassLabels782

10.2.3ResamplingInstances782

10.2.4ModelingtheDistributionoftheTestStatistic783

10.3StatisticalTestingforClassification78310.3.1EvaluatingClassificationPerformance783

10.3.2BinaryClassificationasMultipleHypothesisTesting785

10.3.3MultipleHypothesisTestinginModelSelection786

10.4StatisticalTestingforAssociationAnalysis78710.4.1UsingStatisticalModels788

10.4.2UsingRandomizationMethods794

10.5StatisticalTestingforClusterAnalysis79510.5.1GeneratingaNullDistributionforInternalIndices796

10.5.2GeneratingaNullDistributionforExternalIndices798

10.5.3Enrichment798

10.6StatisticalTestingforAnomalyDetection800

10.7BibliographicNotes803

10.8Exercises808

AuthorIndex816

SubjectIndex829

CopyrightPermissions839

1Introduction

Rapidadvancesindatacollectionandstoragetechnology,coupledwiththeeasewithwhichdatacanbegeneratedanddisseminated,havetriggeredtheexplosivegrowthofdata,leadingtothecurrentageofbigdata.Derivingactionableinsightsfromtheselargedatasetsisincreasinglyimportantindecisionmakingacrossalmostallareasofsociety,includingbusinessandindustry;scienceandengineering;medicineandbiotechnology;andgovernmentandindividuals.However,theamountofdata(volume),itscomplexity(variety),andtherateatwhichitisbeingcollectedandprocessed(velocity)havesimplybecometoogreatforhumanstoanalyzeunaided.Thus,thereisagreatneedforautomatedtoolsforextractingusefulinformationfromthebigdatadespitethechallengesposedbyitsenormityanddiversity.

Dataminingblendstraditionaldataanalysismethodswithsophisticatedalgorithmsforprocessingthisabundanceofdata.Inthisintroductorychapter,wepresentanoverviewofdataminingandoutlinethekeytopicstobecoveredinthisbook.Westartwitha

descriptionofsomeapplicationsthatrequiremoreadvancedtechniquesfordataanalysis.

BusinessandIndustryPoint-of-saledatacollection(barcodescanners,radiofrequencyidentification(RFID),andsmartcardtechnology)haveallowedretailerstocollectup-to-the-minutedataaboutcustomerpurchasesatthecheckoutcountersoftheirstores.Retailerscanutilizethisinformation,alongwithotherbusiness-criticaldata,suchaswebserverlogsfrome-commercewebsitesandcustomerservicerecordsfromcallcenters,tohelpthembetterunderstandtheneedsoftheircustomersandmakemoreinformedbusinessdecisions.

Dataminingtechniquescanbeusedtosupportawiderangeofbusinessintelligenceapplications,suchascustomerprofiling,targetedmarketing,workflowmanagement,storelayout,frauddetection,andautomatedbuyingandselling.Anexampleofthelastapplicationishigh-speedstocktrading,wheredecisionsonbuyingandsellinghavetobemadeinlessthanasecondusingdataaboutfinancialtransactions.Dataminingcanalsohelpretailersanswerimportantbusinessquestions,suchas“Whoarethemostprofitablecustomers?”“Whatproductscanbecross-soldorup-sold?”and“Whatistherevenueoutlookofthecompanyfornextyear?”Thesequestionshaveinspiredthedevelopmentofsuchdataminingtechniquesasassociationanalysis(Chapters5 and6 ).

AstheInternetcontinuestorevolutionizethewayweinteractandmakedecisionsinoureverydaylives,wearegeneratingmassiveamountsofdataaboutouronlineexperiences,e.g.,webbrowsing,messaging,andpostingonsocialnetworkingwebsites.Thishasopenedseveralopportunitiesforbusinessapplicationsthatusewebdata.Forexample,inthee-commercesector,dataaboutouronlineviewingorshoppingpreferencescanbeusedto

providepersonalizedrecommendationsofproducts.DataminingalsoplaysaprominentroleinsupportingseveralotherInternet-basedservices,suchasfilteringspammessages,answeringsearchqueries,andsuggestingsocialupdatesandconnections.Thelargecorpusoftext,images,andvideosavailableontheInternethasenabledanumberofadvancementsindataminingmethods,includingdeeplearning,whichisdiscussedinChapter4 .Thesedevelopmentshaveledtogreatadvancesinanumberofapplications,suchasobjectrecognition,naturallanguagetranslation,andautonomousdriving.

Anotherdomainthathasundergonearapidbigdatatransformationistheuseofmobilesensorsanddevices,suchassmartphonesandwearablecomputingdevices.Withbettersensortechnologies,ithasbecomepossibletocollectavarietyofinformationaboutourphysicalworldusinglow-costsensorsembeddedoneverydayobjectsthatareconnectedtoeachother,termedtheInternetofThings(IOT).Thisdeepintegrationofphysicalsensorsindigitalsystemsisbeginningtogeneratelargeamountsofdiverseanddistributeddataaboutourenvironment,whichcanbeusedfordesigningconvenient,safe,andenergy-efficienthomesystems,aswellasforurbanplanningofsmartcities.

Medicine,Science,andEngineeringResearchersinmedicine,science,andengineeringarerapidlyaccumulatingdatathatiskeytosignificantnewdiscoveries.Forexample,asanimportantsteptowardimprovingourunderstandingoftheEarth’sclimatesystem,NASAhasdeployedaseriesofEarth-orbitingsatellitesthatcontinuouslygenerateglobalobservationsofthelandsurface,oceans,andatmosphere.However,becauseofthesizeandspatio-temporalnatureofthedata,traditionalmethodsareoftennotsuitableforanalyzingthesedatasets.TechniquesdevelopedindataminingcanaidEarthscientistsinansweringquestionssuchasthefollowing:“Whatistherelationshipbetweenthefrequencyandintensityofecosystemdisturbances

suchasdroughtsandhurricanestoglobalwarming?”“Howislandsurfaceprecipitationandtemperatureaffectedbyoceansurfacetemperature?”and“Howwellcanwepredictthebeginningandendofthegrowingseasonforaregion?”

Asanotherexample,researchersinmolecularbiologyhopetousethelargeamountsofgenomicdatatobetterunderstandthestructureandfunctionofgenes.Inthepast,traditionalmethodsinmolecularbiologyallowedscientiststostudyonlyafewgenesatatimeinagivenexperiment.Recentbreakthroughsinmicroarraytechnologyhaveenabledscientiststocomparethebehaviorofthousandsofgenesundervarioussituations.Suchcomparisonscanhelpdeterminethefunctionofeachgene,andperhapsisolatethegenesresponsibleforcertaindiseases.However,thenoisy,high-dimensionalnatureofdatarequiresnewdataanalysismethods.Inadditiontoanalyzinggeneexpressiondata,dataminingcanalsobeusedtoaddressotherimportantbiologicalchallengessuchasproteinstructureprediction,multiplesequencealignment,themodelingofbiochemicalpathways,andphylogenetics.

Anotherexampleistheuseofdataminingtechniquestoanalyzeelectronichealthrecord(EHR)data,whichhasbecomeincreasinglyavailable.Notverylongago,studiesofpatientsrequiredmanuallyexaminingthephysicalrecordsofindividualpatientsandextractingveryspecificpiecesofinformationpertinenttotheparticularquestionbeinginvestigated.EHRsallowforafasterandbroaderexplorationofsuchdata.However,therearesignificantchallengessincetheobservationsonanyonepatienttypicallyoccurduringtheirvisitstoadoctororhospitalandonlyasmallnumberofdetailsaboutthehealthofthepatientaremeasuredduringanyparticularvisit.

Currently,EHRanalysisfocusesonsimpletypesofdata,e.g.,apatient’sbloodpressureorthediagnosiscodeofadisease.However,largeamountsof

morecomplextypesofmedicaldataarealsobeingcollected,suchaselectrocardiograms(ECGs)andneuroimagesfrommagneticresonanceimaging(MRI)orfunctionalMagneticResonanceImaging(fMRI).Althoughchallengingtoanalyze,thisdataalsoprovidesvitalinformationaboutpatients.Integratingandanalyzingsuchdata,withtraditionalEHRandgenomicdataisoneofthecapabilitiesneededtoenableprecisionmedicine,whichaimstoprovidemorepersonalizedpatientcare.

1.1WhatIsDataMining?Dataminingistheprocessofautomaticallydiscoveringusefulinformationinlargedatarepositories.Dataminingtechniquesaredeployedtoscourlargedatasetsinordertofindnovelandusefulpatternsthatmightotherwiseremainunknown.Theyalsoprovidethecapabilitytopredicttheoutcomeofafutureobservation,suchastheamountacustomerwillspendatanonlineorabrick-and-mortarstore.

Notallinformationdiscoverytasksareconsideredtobedatamining.Examplesincludequeries,e.g.,lookingupindividualrecordsinadatabaseorfindingwebpagesthatcontainaparticularsetofkeywords.Thisisbecausesuchtaskscanbeaccomplishedthroughsimpleinteractionswithadatabasemanagementsystemoraninformationretrievalsystem.Thesesystemsrelyontraditionalcomputersciencetechniques,whichincludesophisticatedindexingstructuresandqueryprocessingalgorithms,forefficientlyorganizingandretrievinginformationfromlargedatarepositories.Nonetheless,dataminingtechniqueshavebeenusedtoenhancetheperformanceofsuchsystemsbyimprovingthequalityofthesearchresultsbasedontheirrelevancetotheinputqueries.

DataMiningandKnowledgeDiscoveryinDatabasesDataminingisanintegralpartofknowledgediscoveryindatabases(KDD),whichistheoverallprocessofconvertingrawdataintousefulinformation,asshowninFigure1.1 .Thisprocessconsistsofaseriesofsteps,fromdatapreprocessingtopostprocessingofdataminingresults.

Figure1.1.Theprocessofknowledgediscoveryindatabases(KDD).

Theinputdatacanbestoredinavarietyofformats(flatfiles,spreadsheets,orrelationaltables)andmayresideinacentralizeddatarepositoryorbedistributedacrossmultiplesites.Thepurposeofpreprocessingistotransformtherawinputdataintoanappropriateformatforsubsequentanalysis.Thestepsinvolvedindatapreprocessingincludefusingdatafrommultiplesources,cleaningdatatoremovenoiseandduplicateobservations,andselectingrecordsandfeaturesthatarerelevanttothedataminingtaskathand.Becauseofthemanywaysdatacanbecollectedandstored,datapreprocessingisperhapsthemostlaboriousandtime-consumingstepintheoverallknowledgediscoveryprocess.

“Closingtheloop”isaphraseoftenusedtorefertotheprocessofintegratingdataminingresultsintodecisionsupportsystems.Forexample,inbusinessapplications,theinsightsofferedbydataminingresultscanbeintegratedwithcampaignmanagementtoolssothateffectivemarketingpromotionscanbeconductedandtested.Suchintegrationrequiresapostprocessingsteptoensurethatonlyvalidandusefulresultsareincorporatedintothedecisionsupportsystem.Anexampleofpostprocessingisvisualization,whichallowsanalyststoexplorethedataandthedataminingresultsfromavarietyofviewpoints.Hypothesistestingmethodscanalsobeappliedduring

postprocessingtoeliminatespuriousdataminingresults.(SeeChapter10 .)

1.2MotivatingChallengesAsmentionedearlier,traditionaldataanalysistechniqueshaveoftenencounteredpracticaldifficultiesinmeetingthechallengesposedbybigdataapplications.Thefollowingaresomeofthespecificchallengesthatmotivatedthedevelopmentofdatamining.

Scalability

Becauseofadvancesindatagenerationandcollection,datasetswithsizesofterabytes,petabytes,orevenexabytesarebecomingcommon.Ifdataminingalgorithmsaretohandlethesemassivedatasets,theymustbescalable.Manydataminingalgorithmsemployspecialsearchstrategiestohandleexponentialsearchproblems.Scalabilitymayalsorequiretheimplementationofnoveldatastructurestoaccessindividualrecordsinanefficientmanner.Forinstance,out-of-corealgorithmsmaybenecessarywhenprocessingdatasetsthatcannotfitintomainmemory.Scalabilitycanalsobeimprovedbyusingsamplingordevelopingparallelanddistributedalgorithms.AgeneraloverviewoftechniquesforscalingupdataminingalgorithmsisgiveninAppendixF.

HighDimensionality

Itisnowcommontoencounterdatasetswithhundredsorthousandsofattributesinsteadofthehandfulcommonafewdecadesago.Inbioinformatics,progressinmicroarraytechnologyhasproducedgeneexpressiondatainvolvingthousandsoffeatures.Datasetswithtemporalorspatialcomponentsalsotendtohavehighdimensionality.Forexample,

consideradatasetthatcontainsmeasurementsoftemperatureatvariouslocations.Ifthetemperaturemeasurementsaretakenrepeatedlyforanextendedperiod,thenumberofdimensions(features)increasesinproportiontothenumberofmeasurementstaken.Traditionaldataanalysistechniquesthatweredevelopedforlow-dimensionaldataoftendonotworkwellforsuchhigh-dimensionaldataduetoissuessuchascurseofdimensionality(tobediscussedinChapter2 ).Also,forsomedataanalysisalgorithms,thecomputationalcomplexityincreasesrapidlyasthedimensionality(thenumberoffeatures)increases.

HeterogeneousandComplexData

Traditionaldataanalysismethodsoftendealwithdatasetscontainingattributesofthesametype,eithercontinuousorcategorical.Astheroleofdatamininginbusiness,science,medicine,andotherfieldshasgrown,sohastheneedfortechniquesthatcanhandleheterogeneousattributes.Recentyearshavealsoseentheemergenceofmorecomplexdataobjects.Examplesofsuchnon-traditionaltypesofdataincludewebandsocialmediadatacontainingtext,hyperlinks,images,audio,andvideos;DNAdatawithsequentialandthree-dimensionalstructure;andclimatedatathatconsistsofmeasurements(temperature,pressure,etc.)atvarioustimesandlocationsontheEarth’ssurface.Techniquesdevelopedforminingsuchcomplexobjectsshouldtakeintoconsiderationrelationshipsinthedata,suchastemporalandspatialautocorrelation,graphconnectivity,andparent-childrelationshipsbetweentheelementsinsemi-structuredtextandXMLdocuments.

DataOwnershipandDistribution

Sometimes,thedataneededforananalysisisnotstoredinonelocationorownedbyoneorganization.Instead,thedataisgeographicallydistributedamongresourcesbelongingtomultipleentities.Thisrequiresthedevelopment

ofdistributeddataminingtechniques.Thekeychallengesfacedbydistributeddataminingalgorithmsincludethefollowing:(1)howtoreducetheamountofcommunicationneededtoperformthedistributedcomputation,(2)howtoeffectivelyconsolidatethedataminingresultsobtainedfrommultiplesources,and(3)howtoaddressdatasecurityandprivacyissues.

Non-traditionalAnalysis

Thetraditionalstatisticalapproachisbasedonahypothesize-and-testparadigm.Inotherwords,ahypothesisisproposed,anexperimentisdesignedtogatherthedata,andthenthedataisanalyzedwithrespecttothehypothesis.Unfortunately,thisprocessisextremelylabor-intensive.Currentdataanalysistasksoftenrequirethegenerationandevaluationofthousandsofhypotheses,andconsequently,thedevelopmentofsomedataminingtechniqueshasbeenmotivatedbythedesiretoautomatetheprocessofhypothesisgenerationandevaluation.Furthermore,thedatasetsanalyzedindataminingaretypicallynottheresultofacarefullydesignedexperimentandoftenrepresentopportunisticsamplesofthedata,ratherthanrandomsamples.

1.3TheOriginsofDataMiningWhiledatamininghastraditionallybeenviewedasanintermediateprocesswithintheKDDframework,asshowninFigure1.1 ,ithasemergedovertheyearsasanacademicfieldwithincomputerscience,focusingonallaspectsofKDD,includingdatapreprocessing,mining,andpostprocessing.Itsorigincanbetracedbacktothelate1980s,followingaseriesofworkshopsorganizedonthetopicofknowledgediscoveryindatabases.Theworkshopsbroughttogetherresearchersfromdifferentdisciplinestodiscussthechallengesandopportunitiesinapplyingcomputationaltechniquestoextractactionableknowledgefromlargedatabases.Theworkshopsquicklygrewintohugelypopularconferencesthatwereattendedbyresearchersandpractitionersfromboththeacademiaandindustry.Thesuccessoftheseconferences,alongwiththeinterestshownbybusinessesandindustryinrecruitingnewhireswithdataminingbackground,havefueledthetremendousgrowthofthisfield.

Thefieldwasinitiallybuiltuponthemethodologyandalgorithmsthatresearchershadpreviouslyused.Inparticular,dataminingresearchersdrawuponideas,suchas(1)sampling,estimation,andhypothesistestingfromstatisticsand(2)searchalgorithms,modelingtechniques,andlearningtheoriesfromartificialintelligence,patternrecognition,andmachinelearning.Datamininghasalsobeenquicktoadoptideasfromotherareas,includingoptimization,evolutionarycomputing,informationtheory,signalprocessing,visualization,andinformationretrieval,andextendingthemtosolvethechallengesofminingbigdata.

Anumberofotherareasalsoplaykeysupportingroles.Inparticular,databasesystemsareneededtoprovidesupportforefficientstorage,indexing,andqueryprocessing.Techniquesfromhighperformance(parallel)computingare

oftenimportantinaddressingthemassivesizeofsomedatasets.Distributedtechniquescanalsohelpaddresstheissueofsizeandareessentialwhenthedatacannotbegatheredinonelocation.Figure1.2 showstherelationshipofdataminingtootherareas.

Figure1.2.Dataminingasaconfluenceofmanydisciplines.

DataScienceandData-DrivenDiscoveryDatascienceisaninterdisciplinaryfieldthatstudiesandappliestoolsandtechniquesforderivingusefulinsightsfromdata.Althoughdatascienceisregardedasanemergingfieldwithadistinctidentityofitsown,thetoolsandtechniquesoftencomefrommanydifferentareasofdataanalysis,suchasdatamining,statistics,AI,machinelearning,patternrecognition,databasetechnology,anddistributedandparallelcomputing.(SeeFigure1.2 .)

Theemergenceofdatascienceasanewfieldisarecognitionthat,often,noneoftheexistingareasofdataanalysisprovidesacompletesetoftoolsforthedataanalysistasksthatareoftenencounteredinemergingapplications.

Instead,abroadrangeofcomputational,mathematical,andstatisticalskillsisoftenrequired.Toillustratethechallengesthatariseinanalyzingsuchdata,considerthefollowingexample.SocialmediaandtheWebpresentnewopportunitiesforsocialscientiststoobserveandquantitativelymeasurehumanbehavioronalargescale.Toconductsuchastudy,socialscientistsworkwithanalystswhopossessskillsinareassuchaswebmining,naturallanguageprocessing(NLP),networkanalysis,datamining,andstatistics.Comparedtomoretraditionalresearchinsocialscience,whichisoftenbasedonsurveys,thisanalysisrequiresabroaderrangeofskillsandtools,andinvolvesfarlargeramountsofdata.Thus,datascienceis,bynecessity,ahighlyinterdisciplinaryfieldthatbuildsonthecontinuingworkofmanyfields.

Thedata-drivenapproachofdatascienceemphasizesthedirectdiscoveryofpatternsandrelationshipsfromdata,especiallyinlargequantitiesofdata,oftenwithouttheneedforextensivedomainknowledge.Anotableexampleofthesuccessofthisapproachisrepresentedbyadvancesinneuralnetworks,i.e.,deeplearning,whichhavebeenparticularlysuccessfulinareaswhichhavelongprovedchallenging,e.g.,recognizingobjectsinphotosorvideosandwordsinspeech,aswellasinotherapplicationareas.However,notethatthisisjustoneexampleofthesuccessofdata-drivenapproaches,anddramaticimprovementshavealsooccurredinmanyotherareasofdataanalysis.Manyofthesedevelopmentsaretopicsdescribedlaterinthisbook.

Somecautionsonpotentiallimitationsofapurelydata-drivenapproacharegivenintheBibliographicNotes.

1.4DataMiningTasksDataminingtasksaregenerallydividedintotwomajorcategories:

PredictivetasksTheobjectiveofthesetasksistopredictthevalueofaparticularattributebasedonthevaluesofotherattributes.Theattributetobepredictediscommonlyknownasthetargetordependentvariable,whiletheattributesusedformakingthepredictionareknownastheexplanatoryorindependentvariables.

DescriptivetasksHere,theobjectiveistoderivepatterns(correlations,trends,clusters,trajectories,andanomalies)thatsummarizetheunderlyingrelationshipsindata.Descriptivedataminingtasksareoftenexploratoryinnatureandfrequentlyrequirepostprocessingtechniquestovalidateandexplaintheresults.

Figure1.3 illustratesfourofthecoredataminingtasksthataredescribedintheremainderofthisbook.

Figure1.3.Fourofthecoredataminingtasks.

Predictivemodelingreferstothetaskofbuildingamodelforthetargetvariableasafunctionoftheexplanatoryvariables.Therearetwotypesofpredictivemodelingtasks:classification,whichisusedfordiscretetargetvariables,andregression,whichisusedforcontinuoustargetvariables.Forexample,predictingwhetherawebuserwillmakeapurchaseatanonlinebookstoreisaclassificationtaskbecausethetargetvariableisbinary-valued.Ontheotherhand,forecastingthefuturepriceofastockisaregressiontaskbecausepriceisacontinuous-valuedattribute.Thegoalofbothtasksistolearnamodelthatminimizestheerrorbetweenthepredictedandtruevaluesofthetargetvariable.Predictivemodelingcanbeusedtoidentifycustomerswhowillrespondtoamarketingcampaign,predictdisturbancesintheEarth’s

ecosystem,orjudgewhetherapatienthasaparticulardiseasebasedontheresultsofmedicaltests.

Example1.1(PredictingtheTypeofaFlower).Considerthetaskofpredictingaspeciesofflowerbasedonthecharacteristicsoftheflower.Inparticular,considerclassifyinganIrisflowerasoneofthefollowingthreeIrisspecies:Setosa,Versicolour,orVirginica.Toperformthistask,weneedadatasetcontainingthecharacteristicsofvariousflowersofthesethreespecies.Adatasetwiththistypeofinformationisthewell-knownIrisdatasetfromtheUCIMachineLearningRepositoryathttp://www.ics.uci.edu/~mlearn.Inadditiontothespeciesofaflower,thisdatasetcontainsfourotherattributes:sepalwidth,sepallength,petallength,andpetalwidth.Figure1.4 showsaplotofpetalwidthversuspetallengthforthe150flowersintheIrisdataset.Petalwidthisbrokenintothecategorieslow,medium,andhigh,whichcorrespondtotheintervals[0,0.75),[0.75,1.75), ,respectively.Also,petallengthisbrokenintocategorieslow,medium,andhigh,whichcorrespondtotheintervals[0,2.5),[2.5,5), ,respectively.Basedonthesecategoriesofpetalwidthandlength,thefollowingrulescanbederived:

PetalwidthlowandpetallengthlowimpliesSetosa.

PetalwidthmediumandpetallengthmediumimpliesVersicolour.

PetalwidthhighandpetallengthhighimpliesVirginica.

Whiletheserulesdonotclassifyalltheflowers,theydoagood(butnotperfect)jobofclassifyingmostoftheflowers.NotethatflowersfromtheSetosaspeciesarewellseparatedfromtheVersicolourandVirginicaspecieswithrespecttopetalwidthandlength,butthelattertwospeciesoverlapsomewhatwithrespecttotheseattributes.

[1.75,∞)

[5,∞)

Figure1.4.Petalwidthversuspetallengthfor150Irisflowers.

Associationanalysisisusedtodiscoverpatternsthatdescribestronglyassociatedfeaturesinthedata.Thediscoveredpatternsaretypicallyrepresentedintheformofimplicationrulesorfeaturesubsets.Becauseoftheexponentialsizeofitssearchspace,thegoalofassociationanalysisistoextractthemostinterestingpatternsinanefficientmanner.Usefulapplicationsofassociationanalysisincludefindinggroupsofgenesthathaverelatedfunctionality,identifyingwebpagesthatareaccessedtogether,orunderstandingtherelationshipsbetweendifferentelementsofEarth’sclimatesystem.

Example1.2(MarketBasketAnalysis).

ThetransactionsshowninTable1.1 illustratepoint-of-saledatacollectedatthecheckoutcountersofagrocerystore.Associationanalysiscanbeappliedtofinditemsthatarefrequentlyboughttogetherbycustomers.Forexample,wemaydiscovertherule ,whichsuggeststhatcustomerswhobuydiapersalsotendtobuymilk.Thistypeofrulecanbeusedtoidentifypotentialcross-sellingopportunitiesamongrelateditems.

Table1.1.Marketbasketdata.

TransactionID Items

1 {Bread,Butter,Diapers,Milk}

2 {Coffee,Sugar,Cookies,Salmon}

3 {Bread,Butter,Coffee,Diapers,Milk,Eggs}

4 {Bread,Butter,Salmon,Chicken}

5 {Eggs,Bread,Butter}

6 {Salmon,Diapers,Milk}

7 {Bread,Tea,Sugar,Eggs}

8 {Coffee,Sugar,Chicken,Eggs}

9 {Bread,Diapers,Milk,Salt}

10 {Tea,Eggs,Cookies,Diapers,Milk}

Clusteranalysisseekstofindgroupsofcloselyrelatedobservationssothatobservationsthatbelongtothesameclusteraremoresimilartoeachotherthanobservationsthatbelongtootherclusters.Clusteringhasbeenusedto

{Diapers}→{Milk}

groupsetsofrelatedcustomers,findareasoftheoceanthathaveasignificantimpactontheEarth’sclimate,andcompressdata.

Example1.3(DocumentClustering).ThecollectionofnewsarticlesshowninTable1.2 canbegroupedbasedontheirrespectivetopics.Eacharticleisrepresentedasasetofword-frequencypairs(w:c),wherewisawordandcisthenumberoftimesthewordappearsinthearticle.Therearetwonaturalclustersinthedataset.Thefirstclusterconsistsofthefirstfourarticles,whichcorrespondtonewsabouttheeconomy,whilethesecondclustercontainsthelastfourarticles,whichcorrespondtonewsabouthealthcare.Agoodclusteringalgorithmshouldbeabletoidentifythesetwoclustersbasedonthesimilaritybetweenwordsthatappearinthearticles.

Table1.2.Collectionofnewsarticles.

Article Word-frequencypairs

1 dollar:1,industry:4,country:2,loan:3,deal:2,government:2

2 machinery:2,labor:3,market:4,industry:2,work:3,country:1

3 job:5,inflation:3,rise:2,jobless:2,market:3,country:2,index:3

4 domestic:3,forecast:2,gain:1,market:2,sale:3,price:2

5 patient:4,symptom:2,drug:3,health:2,clinic:2,doctor:2

6 pharmaceutical:2,company:3,drug:2,vaccine:1,flu:3

7 death:2,cancer:4,drug:3,public:4,health:3,director:2

8 medical:2,cost:3,increase:2,patient:2,health:3,care:1

Anomalydetectionisthetaskofidentifyingobservationswhosecharacteristicsaresignificantlydifferentfromtherestofthedata.Suchobservationsareknownasanomaliesoroutliers.Thegoalofananomalydetectionalgorithmistodiscovertherealanomaliesandavoidfalselylabelingnormalobjectsasanomalous.Inotherwords,agoodanomalydetectormusthaveahighdetectionrateandalowfalsealarmrate.Applicationsofanomalydetectionincludethedetectionoffraud,networkintrusions,unusualpatternsofdisease,andecosystemdisturbances,suchasdroughts,floods,fires,hurricanes,etc.

Example1.4(CreditCardFraudDetection).Acreditcardcompanyrecordsthetransactionsmadebyeverycreditcardholder,alongwithpersonalinformationsuchascreditlimit,age,annualincome,andaddress.Sincethenumberoffraudulentcasesisrelativelysmallcomparedtothenumberoflegitimatetransactions,anomalydetectiontechniquescanbeappliedtobuildaprofileoflegitimatetransactionsfortheusers.Whenanewtransactionarrives,itiscomparedagainsttheprofileoftheuser.Ifthecharacteristicsofthetransactionareverydifferentfromthepreviouslycreatedprofile,thenthetransactionisflaggedaspotentiallyfraudulent.

1.5ScopeandOrganizationoftheBookThisbookintroducesthemajorprinciplesandtechniquesusedindataminingfromanalgorithmicperspective.Astudyoftheseprinciplesandtechniquesisessentialfordevelopingabetterunderstandingofhowdataminingtechnologycanbeappliedtovariouskindsofdata.Thisbookalsoservesasastartingpointforreaderswhoareinterestedindoingresearchinthisfield.

Webeginthetechnicaldiscussionofthisbookwithachapterondata(Chapter2 ),whichdiscussesthebasictypesofdata,dataquality,preprocessingtechniques,andmeasuresofsimilarityanddissimilarity.Althoughthismaterialcanbecoveredquickly,itprovidesanessentialfoundationfordataanalysis.Chapters3 and4 coverclassification.Chapter3 providesafoundationbydiscussingdecisiontreeclassifiersandseveralissuesthatareimportanttoallclassification:overfitting,underfitting,modelselection,andperformanceevaluation.Usingthisfoundation,Chapter4 describesanumberofotherimportantclassificationtechniques:rule-basedsystems,nearestneighborclassifiers,Bayesianclassifiers,artificialneuralnetworks,includingdeeplearning,supportvectormachines,andensembleclassifiers,whicharecollectionsofclassifiers.Themulticlassandimbalancedclassproblemsarealsodiscussed.Thesetopicscanbecoveredindependently.

AssociationanalysisisexploredinChapters5 and6 .Chapter5describesthebasicsofassociationanalysis:frequentitemsets,associationrules,andsomeofthealgorithmsusedtogeneratethem.Specifictypesoffrequentitemsets—maximal,closed,andhyperclique—thatareimportantfor

dataminingarealsodiscussed,andthechapterconcludeswithadiscussionofevaluationmeasuresforassociationanalysis.Chapter6 considersavarietyofmoreadvancedtopics,includinghowassociationanalysiscanbeappliedtocategoricalandcontinuousdataortodatathathasaconcepthierarchy.(Aconcepthierarchyisahierarchicalcategorizationofobjects,e.g.,storeitems .)Thischapteralsodescribeshowassociationanalysiscanbeextendedtofindsequentialpatterns(patternsinvolvingorder),patternsingraphs,andnegativerelationships(ifoneitemispresent,thentheotherisnot).

ClusteranalysisisdiscussedinChapters7 and8 .Chapter7 firstdescribesthedifferenttypesofclusters,andthenpresentsthreespecificclusteringtechniques:K-means,agglomerativehierarchicalclustering,andDBSCAN.Thisisfollowedbyadiscussionoftechniquesforvalidatingtheresultsofaclusteringalgorithm.AdditionalclusteringconceptsandtechniquesareexploredinChapter8 ,includingfuzzyandprobabilisticclustering,Self-OrganizingMaps(SOM),graph-basedclustering,spectralclustering,anddensity-basedclustering.Thereisalsoadiscussionofscalabilityissuesandfactorstoconsiderwhenselectingaclusteringalgorithm.

Chapter9 ,isonanomalydetection.Aftersomebasicdefinitions,severaldifferenttypesofanomalydetectionareconsidered:statistical,distance-based,density-based,clustering-based,reconstruction-based,one-classclassification,andinformationtheoretic.Thelastchapter,Chapter10 ,supplementsthediscussionsintheotherChapterswithadiscussionofthestatisticalconceptsimportantforavoidingspuriousresults,andthendiscussesthoseconceptsinthecontextofdataminingtechniquesstudiedinthepreviouschapters.Thesetechniquesincludestatisticalhypothesistesting,p-values,thefalsediscoveryrate,andpermutationtesting.AppendicesAthroughFgiveabriefreviewofimportanttopicsthatareusedinportionsof

storeitems→clothing→shoes→sneakers

thebook:linearalgebra,dimensionalityreduction,statistics,regression,optimization,andscalingupdataminingtechniquesforbigdata.

Thesubjectofdatamining,whilerelativelyyoungcomparedtostatisticsormachinelearning,isalreadytoolargetocoverinasinglebook.Selectedreferencestotopicsthatareonlybrieflycovered,suchasdataquality,areprovidedintheBibliographicNotessectionoftheappropriatechapter.Referencestotopicsnotcoveredinthisbook,suchasminingstreamingdataandprivacy-preservingdataminingareprovidedintheBibliographicNotesofthischapter.

1.6BibliographicNotesThetopicofdatamininghasinspiredmanytextbooks.IntroductorytextbooksincludethosebyDunham[16],Hanetal.[29],Handetal.[31],RoigerandGeatz[50],ZakiandMeira[61],andAggarwal[2].DataminingbookswithastrongeremphasisonbusinessapplicationsincludetheworksbyBerryandLinoff[5],Pyle[47],andParrRud[45].BookswithanemphasisonstatisticallearningincludethosebyCherkasskyandMulier[11],andHastieetal.[32].SimilarbookswithanemphasisonmachinelearningorpatternrecognitionarethosebyDudaetal.[15],Kantardzic[34],Mitchell[43],Webb[57],andWittenandFrank[58].Therearealsosomemorespecializedbooks:Chakrabarti[9](webmining),Fayyadetal.[20](collectionofearlyarticlesondatamining),Fayyadetal.[18](visualization),Grossmanetal.[25](scienceandengineering),KarguptaandChan[35](distributeddatamining),Wangetal.[56](bioinformatics),andZakiandHo[60](paralleldatamining).

Thereareseveralconferencesrelatedtodatamining.SomeofthemainconferencesdedicatedtothisfieldincludetheACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining(KDD),theIEEEInternationalConferenceonDataMining(ICDM),theSIAMInternationalConferenceonDataMining(SDM),theEuropeanConferenceonPrinciplesandPracticeofKnowledgeDiscoveryinDatabases(PKDD),andthePacific-AsiaConferenceonKnowledgeDiscoveryandDataMining(PAKDD).DataminingpaperscanalsobefoundinothermajorconferencessuchastheConferenceandWorkshoponNeuralInformationProcessingSystems(NIPS),theInternationalConferenceonMachineLearning(ICML),theACMSIGMOD/PODSconference,theInternationalConferenceonVeryLargeDataBases(VLDB),theConferenceonInformationandKnowledgeManagement(CIKM),theInternationalConferenceonDataEngineering(ICDE),the

NationalConferenceonArtificialIntelligence(AAAI),theIEEEInternationalConferenceonBigData,andtheIEEEInternationalConferenceonDataScienceandAdvancedAnalytics(DSAA).

JournalpublicationsondataminingincludeIEEETransactionsonKnowledgeandDataEngineering,DataMiningandKnowledgeDiscovery,KnowledgeandInformationSystems,ACMTransactionsonKnowledgeDiscoveryfromData,StatisticalAnalysisandDataMining,andInformationSystems.Therearevariousopen-sourcedataminingsoftwareavailable,includingWeka[27]andScikit-learn[46].Morerecently,dataminingsoftwaresuchasApacheMahoutandApacheSparkhavebeendevelopedforlarge-scaleproblemsonthedistributedcomputingplatform.

Therehavebeenanumberofgeneralarticlesondataminingthatdefinethefieldoritsrelationshiptootherfields,particularlystatistics.Fayyadetal.[19]describedataminingandhowitfitsintothetotalknowledgediscoveryprocess.Chenetal.[10]giveadatabaseperspectiveondatamining.RamakrishnanandGrama[48]provideageneraldiscussionofdataminingandpresentseveralviewpoints.Hand[30]describeshowdataminingdiffersfromstatistics,asdoesFriedman[21].Lambert[40]explorestheuseofstatisticsforlargedatasetsandprovidessomecommentsontherespectiverolesofdataminingandstatistics.Glymouretal.[23]considerthelessonsthatstatisticsmayhavefordatamining.Smythetal.[53]describehowtheevolutionofdataminingisbeingdrivenbynewtypesofdataandapplications,suchasthoseinvolvingstreams,graphs,andtext.Hanetal.[28]consideremergingapplicationsindataminingandSmyth[52]describessomeresearchchallengesindatamining.Wuetal.[59]discusshowdevelopmentsindataminingresearchcanbeturnedintopracticaltools.DataminingstandardsarethesubjectofapaperbyGrossmanetal.[24].Bradley[7]discusseshowdataminingalgorithmscanbescaledtolargedatasets.

Theemergenceofnewdataminingapplicationshasproducednewchallengesthatneedtobeaddressed.Forinstance,concernsaboutprivacybreachesasaresultofdatamininghaveescalatedinrecentyears,particularlyinapplicationdomainssuchaswebcommerceandhealthcare.Asaresult,thereisgrowinginterestindevelopingdataminingalgorithmsthatmaintainuserprivacy.Developingtechniquesforminingencryptedorrandomizeddataisknownasprivacy-preservingdatamining.SomegeneralreferencesinthisareaincludepapersbyAgrawalandSrikant[3],Cliftonetal.[12]andKarguptaetal.[36].Vassiliosetal.[55]provideasurvey.Anotherareaofconcernisthebiasinpredictivemodelsthatmaybeusedforsomeapplications,e.g.,screeningjobapplicantsordecidingprisonparole[39].Assessingwhethersuchapplicationsareproducingbiasedresultsismademoredifficultbythefactthatthepredictivemodelsusedforsuchapplicationsareoftenblackboxmodels,i.e.,modelsthatarenotinterpretableinanystraightforwardway.

Datascience,itsconstituentfields,andmoregenerally,thenewparadigmofknowledgediscoverytheyrepresent[33],havegreatpotential,someofwhichhasbeenrealized.However,itisimportanttoemphasizethatdatascienceworksmostlywithobservationaldata,i.e.,datathatwascollectedbyvariousorganizationsaspartoftheirnormaloperation.Theconsequenceofthisisthatsamplingbiasesarecommonandthedeterminationofcausalfactorsbecomesmoreproblematic.Forthisandanumberofotherreasons,itisoftenhardtointerpretthepredictivemodelsbuiltfromthisdata[42,49].Thus,theory,experimentationandcomputationalsimulationswillcontinuetobethemethodsofchoiceinmanyareas,especiallythoserelatedtoscience.

Moreimportantly,apurelydata-drivenapproachoftenignorestheexistingknowledgeinaparticularfield.Suchmodelsmayperformpoorly,forexample,predictingimpossibleoutcomesorfailingtogeneralizetonewsituations.However,ifthemodeldoesworkwell,e.g.,hashighpredictiveaccuracy,then

thisapproachmaybesufficientforpracticalpurposesinsomefields.Butinmanyareas,suchasmedicineandscience,gaininginsightintotheunderlyingdomainisoftenthegoal.Somerecentworkattemptstoaddresstheseissuesinordertocreatetheory-guideddatascience,whichtakespre-existingdomainknowledgeintoaccount[17,37].

Recentyearshavewitnessedagrowingnumberofapplicationsthatrapidlygeneratecontinuousstreamsofdata.Examplesofstreamdataincludenetworktraffic,multimediastreams,andstockprices.Severalissuesmustbeconsideredwhenminingdatastreams,suchasthelimitedamountofmemoryavailable,theneedforonlineanalysis,andthechangeofthedataovertime.Dataminingforstreamdatahasbecomeanimportantareaindatamining.SomeselectedpublicationsareDomingosandHulten[14](classification),Giannellaetal.[22](associationanalysis),Guhaetal.[26](clustering),Kiferetal.[38](changedetection),Papadimitriouetal.[44](timeseries),andLawetal.[41](dimensionalityreduction).

Anotherareaofinterestisrecommenderandcollaborativefilteringsystems[1,6,8,13,54],whichsuggestmovies,televisionshows,books,products,etc.thatapersonmightlike.Inmanycases,thisproblem,oratleastacomponentofit,istreatedasapredictionproblemandthus,dataminingtechniquescanbeapplied[4,51].

Bibliography[1]G.AdomaviciusandA.Tuzhilin.Towardthenextgenerationof

recommendersystems:Asurveyofthestate-of-the-artandpossibleextensions.IEEEtransactionsonknowledgeanddataengineering,17(6):734–749,2005.

[2]C.Aggarwal.Datamining:TheTextbook.Springer,2009.

[3]R.AgrawalandR.Srikant.Privacy-preservingdatamining.InProc.of2000ACMSIGMODIntl.Conf.onManagementofData,pages439–450,Dallas,Texas,2000.ACMPress.

[4]X.AmatriainandJ.M.Pujol.Dataminingmethodsforrecommendersystems.InRecommenderSystemsHandbook,pages227–262.Springer,2015.

[5]M.J.A.BerryandG.Linoff.DataMiningTechniques:ForMarketing,Sales,andCustomerRelationshipManagement.WileyComputerPublishing,2ndedition,2004.

[6]J.Bobadilla,F.Ortega,A.Hernando,andA.Gutiérrez.Recommendersystemssurvey.Knowledge-basedsystems,46:109–132,2013.

[7]P.S.Bradley,J.Gehrke,R.Ramakrishnan,andR.Srikant.Scalingminingalgorithmstolargedatabases.CommunicationsoftheACM,45(8):38–43,2002.

[8]R.Burke.Hybridrecommendersystems:Surveyandexperiments.Usermodelinganduser-adaptedinteraction,12(4):331–370,2002.

[9]S.Chakrabarti.MiningtheWeb:DiscoveringKnowledgefromHypertextData.MorganKaufmann,SanFrancisco,CA,2003.

[10]M.-S.Chen,J.Han,andP.S.Yu.DataMining:AnOverviewfromaDatabasePerspective.IEEETransactionsonKnowledgeandDataEngineering,8(6):866–883,1996.

[11]V.CherkasskyandF.Mulier.LearningfromData:Concepts,Theory,andMethods.Wiley-IEEEPress,2ndedition,1998.

[12]C.Clifton,M.Kantarcioglu,andJ.Vaidya.Definingprivacyfordatamining.InNationalScienceFoundationWorkshoponNextGenerationDataMining,pages126–133,Baltimore,MD,November2002.

[13]C.DesrosiersandG.Karypis.Acomprehensivesurveyofneighborhood-basedrecommendationmethods.Recommendersystemshandbook,pages107–144,2011.

[14]P.DomingosandG.Hulten.Mininghigh-speeddatastreams.InProc.ofthe6thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages71–80,

Boston,Massachusetts,2000.ACMPress.

[15]R.O.Duda,P.E.Hart,andD.G.Stork.PatternClassification.JohnWiley…Sons,Inc.,NewYork,2ndedition,2001.

[16]M.H.Dunham.DataMining:IntroductoryandAdvancedTopics.PrenticeHall,2006.

[17]J.H.Faghmous,A.Banerjee,S.Shekhar,M.Steinbach,V.Kumar,A.R.Ganguly,andN.Samatova.Theory-guideddatascienceforclimatechange.Computer,47(11):74–78,2014.

[18]U.M.Fayyad,G.G.Grinstein,andA.Wierse,editors.InformationVisualizationinDataMiningandKnowledgeDiscovery.MorganKaufmannPublishers,SanFrancisco,CA,September2001.

[19]U.M.Fayyad,G.Piatetsky-Shapiro,andP.Smyth.FromDataMiningtoKnowledgeDiscovery:AnOverview.InAdvancesinKnowledgeDiscoveryandDataMining,pages1–34.AAAIPress,1996.

[20]U.M.Fayyad,G.Piatetsky-Shapiro,P.Smyth,andR.Uthurusamy,editors.AdvancesinKnowledgeDiscoveryandDataMining.AAAI/MITPress,1996.

[21]J.H.Friedman.DataMiningandStatistics:What’stheConnection?Unpublished.www-stat.stanford.edu/~jhf/ftp/dm-stat.ps,1997.

[22]C.Giannella,J.Han,J.Pei,X.Yan,andP.S.Yu.MiningFrequentPatternsinDataStreamsatMultipleTimeGranularities.InH.Kargupta,A.Joshi,K.Sivakumar,andY.Yesha,editors,NextGenerationDataMining,pages191–212.AAAI/MIT,2003.

[23]C.Glymour,D.Madigan,D.Pregibon,andP.Smyth.StatisticalThemesandLessonsforDataMining.DataMiningandKnowledgeDiscovery,1(1):11–28,1997.

[24]R.L.Grossman,M.F.Hornick,andG.Meyer.Dataminingstandardsinitiatives.CommunicationsoftheACM,45(8):59–61,2002.

[25]R.L.Grossman,C.Kamath,P.Kegelmeyer,V.Kumar,andR.Namburu,editors.DataMiningforScientificandEngineeringApplications.KluwerAcademicPublishers,2001.

[26]S.Guha,A.Meyerson,N.Mishra,R.Motwani,andL.O’Callaghan.ClusteringDataStreams:TheoryandPractice.IEEETransactionsonKnowledgeandDataEngineering,15(3):515–528,May/June2003.

[27]M.Hall,E.Frank,G.Holmes,B.Pfahringer,P.Reutemann,andI.H.Witten.TheWEKADataMiningSoftware:AnUpdate.SIGKDDExplorations,11(1),2009.

[28]J.Han,R.B.Altman,V.Kumar,H.Mannila,andD.Pregibon.Emergingscientificapplicationsindatamining.CommunicationsoftheACM,45(8):54–58,2002.

[29]J.Han,M.Kamber,andJ.Pei.DataMining:ConceptsandTechniques.MorganKaufmannPublishers,SanFrancisco,3rdedition,2011.

[30]D.J.Hand.DataMining:StatisticsandMore?TheAmericanStatistician,52(2):112–118,1998.

[31]D.J.Hand,H.Mannila,andP.Smyth.PrinciplesofDataMining.MITPress,2001.

[32]T.Hastie,R.Tibshirani,andJ.H.Friedman.TheElementsofStatisticalLearning:DataMining,Inference,Prediction.Springer,2ndedition,2009.

[33]T.Hey,S.Tansley,K.M.Tolle,etal.Thefourthparadigm:data-intensivescientificdiscovery,volume1.MicrosoftresearchRedmond,WA,2009.

[34]M.Kantardzic.DataMining:Concepts,Models,Methods,andAlgorithms.Wiley-IEEEPress,Piscataway,NJ,2003.

[35]H.KarguptaandP.K.Chan,editors.AdvancesinDistributedandParallelKnowledgeDiscovery.AAAIPress,September2002.

[36]H.Kargupta,S.Datta,Q.Wang,andK.Sivakumar.OnthePrivacyPreservingPropertiesofRandomDataPerturbationTechniques.InProc.ofthe2003IEEEIntl.Conf.onDataMining,pages99–106,Melbourne,Florida,December2003.IEEEComputerSociety.

[37]A.Karpatne,G.Atluri,J.Faghmous,M.Steinbach,A.Banerjee,A.Ganguly,S.Shekhar,N.Samatova,andV.Kumar.Theory-guidedDataScience:ANewParadigmforScientificDiscoveryfromData.IEEETransactionsonKnowledgeandDataEngineering,2017.

[38]D.Kifer,S.Ben-David,andJ.Gehrke.DetectingChangeinDataStreams.InProc.ofthe30thVLDBConf.,pages180–191,Toronto,Canada,2004.MorganKaufmann.

[39]J.Kleinberg,J.Ludwig,andS.Mullainathan.AGuidetoSolvingSocialProblemswithMachineLearning.HarvardBusinessReview,December2016.

[40]D.Lambert.WhatUseisStatisticsforMassiveData?InACMSIGMODWorkshoponResearchIssuesinDataMiningandKnowledgeDiscovery,pages54–62,2000.

[41]M.H.C.Law,N.Zhang,andA.K.Jain.NonlinearManifoldLearningforDataStreams.InProc.oftheSIAMIntl.Conf.onDataMining,LakeBuenaVista,Florida,April2004.SIAM.

[42]Z.C.Lipton.Themythosofmodelinterpretability.arXivpreprintarXiv:1606.03490,2016.

[43]T.Mitchell.MachineLearning.McGraw-Hill,Boston,MA,1997.

[44]S.Papadimitriou,A.Brockwell,andC.Faloutsos.Adaptive,unsupervisedstreammining.VLDBJournal,13(3):222–239,2004.

[45]O.ParrRud.DataMiningCookbook:ModelingDataforMarketing,RiskandCustomerRelationshipManagement.JohnWiley…Sons,NewYork,NY,2001.

[46]F.Pedregosa,G.Varoquaux,A.Gramfort,V.Michel,B.Thirion,O.Grisel,M.Blondel,P.Prettenhofer,R.Weiss,V.Dubourg,J.Vanderplas,A.Passos,D.Cournapeau,M.Brucher,M.Perrot,andE.Duchesnay.Scikit-learn:MachineLearninginPython.JournalofMachineLearningResearch,12:2825–2830,2011.

[47]D.Pyle.BusinessModelingandDataMining.MorganKaufmann,SanFrancisco,CA,2003.

[48]N.RamakrishnanandA.Grama.DataMining:FromSerendipitytoScience—GuestEditors’Introduction.IEEEComputer,32(8):34–37,1999.

[49]M.T.Ribeiro,S.Singh,andC.Guestrin.Whyshoulditrustyou?:Explainingthepredictionsofanyclassifier.InProceedingsofthe22ndACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages1135–1144.ACM,2016.

[50]R.RoigerandM.Geatz.DataMining:ATutorialBasedPrimer.Addison-Wesley,2002.

[51]J.Schafer.TheApplicationofData-MiningtoRecommenderSystems.Encyclopediaofdatawarehousingandmining,1:44–48,2009.

[52]P.Smyth.BreakingoutoftheBlack-Box:ResearchChallengesinDataMining.InProc.ofthe2001ACMSIGMODWorkshoponResearchIssuesinDataMiningandKnowledgeDiscovery,2001.

[53]P.Smyth,D.Pregibon,andC.Faloutsos.Data-drivenevolutionofdataminingalgorithms.CommunicationsoftheACM,45(8):33–37,2002.

[54]X.SuandT.M.Khoshgoftaar.Asurveyofcollaborativefilteringtechniques.Advancesinartificialintelligence,2009:4,2009.

[55]V.S.Verykios,E.Bertino,I.N.Fovino,L.P.Provenza,Y.Saygin,andY.Theodoridis.State-of-the-artinprivacypreservingdatamining.SIGMODRecord,33(1):50–57,2004.

[56]J.T.L.Wang,M.J.Zaki,H.Toivonen,andD.E.Shasha,editors.DataMininginBioinformatics.Springer,September2004.

[57]A.R.Webb.StatisticalPatternRecognition.JohnWiley…Sons,2ndedition,2002.

[58]I.H.WittenandE.Frank.DataMining:PracticalMachineLearningToolsandTechniques.MorganKaufmann,3rdedition,2011.

[59]X.Wu,P.S.Yu,andG.Piatetsky-Shapiro.DataMining:HowResearchMeetsPracticalDevelopment?KnowledgeandInformationSystems,5(2):248–261,2003.

[60]M.J.ZakiandC.-T.Ho,editors.Large-ScaleParallelDataMining.Springer,September2002.

[61]M.J.ZakiandW.MeiraJr.DataMiningandAnalysis:FundamentalConceptsandAlgorithms.CambridgeUniversityPress,NewYork,2014.

1.7Exercises1.Discusswhetherornoteachofthefollowingactivitiesisadataminingtask.

a. Dividingthecustomersofacompanyaccordingtotheirgender.

b. Dividingthecustomersofacompanyaccordingtotheirprofitability.

c. Computingthetotalsalesofacompany.

d. Sortingastudentdatabasebasedonstudentidentificationnumbers.

e. Predictingtheoutcomesoftossinga(fair)pairofdice.

f. Predictingthefuturestockpriceofacompanyusinghistoricalrecords.

g. Monitoringtheheartrateofapatientforabnormalities.

h. Monitoringseismicwavesforearthquakeactivities.

i. Extractingthefrequenciesofasoundwave.

2.SupposethatyouareemployedasadataminingconsultantforanInternetsearchenginecompany.Describehowdataminingcanhelpthecompanybygivingspecificexamplesofhowtechniques,suchasclustering,classification,associationrulemining,andanomalydetectioncanbeapplied.

3.Foreachofthefollowingdatasets,explainwhetherornotdataprivacyisanimportantissue.

a. Censusdatacollectedfrom1900–1950.

b. IPaddressesandvisittimesofwebuserswhovisityourwebsite.

c. ImagesfromEarth-orbitingsatellites.

d. Namesandaddressesofpeoplefromthetelephonebook.

e. NamesandemailaddressescollectedfromtheWeb.

2Data

Thischapterdiscussesseveraldata-relatedissuesthatareimportantforsuccessfuldatamining:

TheTypeofDataDatasetsdifferinanumberofways.Forexample,theattributesusedtodescribedataobjectscanbeofdifferenttypes—quantitativeorqualitative—anddatasetsoftenhavespecialcharacteristics;e.g.,somedatasetscontaintimeseriesorobjectswithexplicitrelationshipstooneanother.Notsurprisingly,thetypeofdatadetermineswhichtoolsandtechniquescanbeusedtoanalyzethedata.Indeed,newresearchindataminingisoftendrivenbytheneedtoaccommodatenewapplicationareasandtheirnewtypesofdata.

TheQualityoftheDataDataisoftenfarfromperfect.Whilemostdataminingtechniquescantoleratesomelevelofimperfectioninthedata,afocusonunderstandingandimprovingdataqualitytypicallyimprovesthequalityoftheresultinganalysis.Dataqualityissuesthatoftenneedtobeaddressedincludethepresenceofnoiseandoutliers;missing,inconsistent,orduplicatedata;anddatathatisbiasedor,insomeotherway,unrepresentativeofthephenomenonorpopulationthatthedataissupposedtodescribe.

PreprocessingStepstoMaketheDataMoreSuitableforDataMiningOften,therawdatamustbeprocessedinordertomakeitsuitablefor

analysis.Whileoneobjectivemaybetoimprovedataquality,othergoalsfocusonmodifyingthedatasothatitbetterfitsaspecifieddataminingtechniqueortool.Forexample,acontinuousattribute,e.g.,length,sometimesneedstobetransformedintoanattributewithdiscretecategories,e.g.,short,medium,orlong,inordertoapplyaparticulartechnique.Asanotherexample,thenumberofattributesinadatasetisoftenreducedbecausemanytechniquesaremoreeffectivewhenthedatahasarelativelysmallnumberofattributes.

AnalyzingDatainTermsofItsRelationshipsOneapproachtodataanalysisistofindrelationshipsamongthedataobjectsandthenperformtheremaininganalysisusingtheserelationshipsratherthanthedataobjectsthemselves.Forinstance,wecancomputethesimilarityordistancebetweenpairsofobjectsandthenperformtheanalysis—clustering,classification,oranomalydetection—basedonthesesimilaritiesordistances.Therearemanysuchsimilarityordistancemeasures,andtheproperchoicedependsonthetypeofdataandtheparticularapplication.

Example2.1(AnIllustrationofData-RelatedIssues).Tofurtherillustratetheimportanceoftheseissues,considerthefollowinghypotheticalsituation.Youreceiveanemailfromamedicalresearcherconcerningaprojectthatyouareeagertoworkon.

Hi,

I’veattachedthedatafilethatImentionedinmypreviousemail.Eachlinecontainsthe

informationforasinglepatientandconsistsoffivefields.Wewanttopredictthelastfieldusing

theotherfields.Idon’thavetimetoprovideanymoreinformationaboutthedatasinceI’mgoing

outoftownforacoupleofdays,buthopefullythatwon’tslowyoudowntoomuch.Andifyou

don’tmind,couldwemeetwhenIgetbacktodiscussyourpreliminaryresults?Imightinvitea

fewothermembersofmyteam.

Thanksandseeyouinacoupleofdays.

Despitesomemisgivings,youproceedtoanalyzethedata.Thefirstfewrowsofthefileareasfollows:

012 232 33.5 0 10.7

020 121 16.9 2 210.1

027 165 24.0 0 427.6

Abrieflookatthedatarevealsnothingstrange.Youputyourdoubtsasideandstarttheanalysis.Thereareonly1000lines,asmallerdatafilethanyouhadhopedfor,buttwodayslater,youfeelthatyouhavemadesomeprogress.Youarriveforthemeeting,andwhilewaitingforotherstoarrive,youstrikeupaconversationwithastatisticianwhoisworkingontheproject.Whenshelearnsthatyouhavealsobeenanalyzingthedatafromtheproject,sheasksifyouwouldmindgivingherabriefoverviewofyourresults.

Statistician:So,yougotthedataforallthepatients?

DataMiner:Yes.Ihaven’thadmuchtimeforanalysis,butIdohaveafewinterestingresults.

Statistician:Amazing.ThereweresomanydataissueswiththissetofpatientsthatIcouldn’tdomuch.

DataMiner:Oh?Ididn’thearaboutanypossibleproblems.

Statistician:Well,firstthereisfield5,thevariablewewanttopredict.

It’scommonknowledgeamongpeoplewhoanalyzethistypeofdatathatresultsarebetterifyouworkwiththelogofthevalues,butIdidn’tdiscoverthisuntillater.Wasitmentionedtoyou?

DataMiner:No.

Statistician:Butsurelyyouheardaboutwhathappenedtofield4?It’ssupposedtobemeasuredonascalefrom1to10,with0indicatingamissingvalue,butbecauseofadataentryerror,all10’swerechangedinto0’s.Unfortunately,sincesomeofthepatientshavemissingvaluesforthisfield,it’simpossibletosaywhethera0inthisfieldisareal0ora10.Quiteafewoftherecordshavethatproblem.

DataMiner:Interesting.Werethereanyotherproblems?

Statistician:Yes,fields2and3arebasicallythesame,butIassumethatyouprobablynoticedthat.

DataMiner:Yes,butthesefieldswereonlyweakpredictorsoffield5.

Statistician:Anyway,givenallthoseproblems,I’msurprisedyouwereabletoaccomplishanything.

DataMiner:True,butmyresultsarereallyquitegood.Field1isaverystrongpredictoroffield5.I’msurprisedthatthiswasn’tnoticedbefore.

Statistician:What?Field1isjustanidentificationnumber.

DataMiner:Nonetheless,myresultsspeakforthemselves.

Statistician:Oh,no!Ijustremembered.WeassignedIDnumbersafterwesortedtherecordsbasedonfield5.Thereisastrongconnection,butit’smeaningless.Sorry.

Althoughthisscenariorepresentsanextremesituation,itemphasizestheimportanceof“knowingyourdata.”Tothatend,thischapterwilladdresseach

ofthefourissuesmentionedabove,outliningsomeofthebasicchallengesandstandardapproaches.

2.1TypesofDataAdatasetcanoftenbeviewedasacollectionofdataobjects.Othernamesforadataobjectarerecord,point,vector,pattern,event,case,sample,instance,observation,orentity.Inturn,dataobjectsaredescribedbyanumberofattributesthatcapturethecharacteristicsofanobject,suchasthemassofaphysicalobjectorthetimeatwhichaneventoccurred.Othernamesforanattributearevariable,characteristic,field,feature,ordimension.

Example2.2(StudentInformation).Often,adatasetisafile,inwhichtheobjectsarerecords(orrows)inthefileandeachfield(orcolumn)correspondstoanattribute.Forexample,Table2.1 showsadatasetthatconsistsofstudentinformation.Eachrowcorrespondstoastudentandeachcolumnisanattributethatdescribessomeaspectofastudent,suchasgradepointaverage(GPA)oridentificationnumber(ID).

Table2.1.Asampledatasetcontainingstudentinformation.

StudentID Year GradePointAverage(GPA) …

1034262 Senior 3.24 …

1052663 Freshman 3.51 …

1082246 Sophomore 3.62 …

Althoughrecord-baseddatasetsarecommon,eitherinflatfilesorrelationaldatabasesystems,thereareotherimportanttypesofdatasetsandsystemsforstoringdata.InSection2.1.2 ,wewilldiscusssomeofthetypesofdatasetsthatarecommonlyencounteredindatamining.However,wefirstconsiderattributes.

2.1.1AttributesandMeasurement

Inthissection,weconsiderthetypesofattributesusedtodescribedataobjects.Wefirstdefineanattribute,thenconsiderwhatwemeanbythetypeofanattribute,andfinallydescribethetypesofattributesthatarecommonlyencountered.

WhatIsanAttribute?Westartwithamoredetaileddefinitionofanattribute.

Definition2.1.Anattributeisapropertyorcharacteristicofanobjectthatcanvary,eitherfromoneobjecttoanotherorfromonetimetoanother.

Forexample,eyecolorvariesfrompersontoperson,whilethetemperatureofanobjectvariesovertime.Notethateyecolorisasymbolicattributewitha

smallnumberofpossiblevalues{brown,black,blue,green,hazel,etc.},whiletemperatureisanumericalattributewithapotentiallyunlimitednumberofvalues.

Atthemostbasiclevel,attributesarenotaboutnumbersorsymbols.However,todiscussandmorepreciselyanalyzethecharacteristicsofobjects,weassignnumbersorsymbolstothem.Todothisinawell-definedway,weneedameasurementscale.

Definition2.2.Ameasurementscaleisarule(function)thatassociatesanumericalorsymbolicvaluewithanattributeofanobject.

Formally,theprocessofmeasurementistheapplicationofameasurementscaletoassociateavaluewithaparticularattributeofaspecificobject.Whilethismayseemabitabstract,weengageintheprocessofmeasurementallthetime.Forinstance,westeponabathroomscaletodetermineourweight,weclassifysomeoneasmaleorfemale,orwecountthenumberofchairsinaroomtoseeiftherewillbeenoughtoseatallthepeoplecomingtoameeting.Inallthesecases,the“physicalvalue”ofanattributeofanobjectismappedtoanumericalorsymbolicvalue.

Withthisbackground,wecandiscussthetypeofanattribute,aconceptthatisimportantindeterminingifaparticulardataanalysistechniqueisconsistentwithaspecifictypeofattribute.

TheTypeofanAttributeItiscommontorefertothetypeofanattributeasthetypeofameasurementscale.Itshouldbeapparentfromthepreviousdiscussionthatanattributecanbedescribedusingdifferentmeasurementscalesandthatthepropertiesofanattributeneednotbethesameasthepropertiesofthevaluesusedtomeasureit.Inotherwords,thevaluesusedtorepresentanattributecanhavepropertiesthatarenotpropertiesoftheattributeitself,andviceversa.Thisisillustratedwithtwoexamples.

Example2.3(EmployeeAgeandIDNumber).TwoattributesthatmightbeassociatedwithanemployeeareIDandage(inyears).Bothoftheseattributescanberepresentedasintegers.However,whileitisreasonabletotalkabouttheaverageageofanemployee,itmakesnosensetotalkabouttheaverageemployeeID.Indeed,theonlyaspectofemployeesthatwewanttocapturewiththeIDattributeisthattheyaredistinct.Consequently,theonlyvalidoperationforemployeeIDsistotestwhethertheyareequal.Thereisnohintofthislimitation,however,whenintegersareusedtorepresenttheemployeeIDattribute.Fortheageattribute,thepropertiesoftheintegersusedtorepresentageareverymuchthepropertiesoftheattribute.Evenso,thecorrespondenceisnotcompletebecause,forexample,ageshaveamaximum,whileintegersdonot.

Example2.4(LengthofLineSegments).ConsiderFigure2.1 ,whichshowssomeobjects—linesegments—andhowthelengthattributeoftheseobjectscanbemappedtonumbersintwodifferentways.Eachsuccessivelinesegment,goingfromthetoptothebottom,isformedbyappendingthetopmostlinesegmenttoitself.Thus,

thesecondlinesegmentfromthetopisformedbyappendingthetopmostlinesegmenttoitselftwice,thethirdlinesegmentfromthetopisformedbyappendingthetopmostlinesegmenttoitselfthreetimes,andsoforth.Inaveryreal(physical)sense,allthelinesegmentsaremultiplesofthefirst.Thisfactiscapturedbythemeasurementsontherightsideofthefigure,butnotbythoseontheleftside.Morespecifically,themeasurementscaleontheleftsidecapturesonlytheorderingofthelengthattribute,whilethescaleontherightsidecapturesboththeorderingandadditivityproperties.Thus,anattributecanbemeasuredinawaythatdoesnotcaptureallthepropertiesoftheattribute.

Figure2.1.Themeasurementofthelengthoflinesegmentsontwodifferentscalesofmeasurement.

Knowingthetypeofanattributeisimportantbecauseittellsuswhichpropertiesofthemeasuredvaluesareconsistentwiththeunderlying

propertiesoftheattribute,andtherefore,itallowsustoavoidfoolishactions,suchascomputingtheaverageemployeeID.

TheDifferentTypesofAttributesAuseful(andsimple)waytospecifythetypeofanattributeistoidentifythepropertiesofnumbersthatcorrespondtounderlyingpropertiesoftheattribute.Forexample,anattributesuchaslengthhasmanyofthepropertiesofnumbers.Itmakessensetocompareandorderobjectsbylength,aswellastotalkaboutthedifferencesandratiosoflength.Thefollowingproperties(operations)ofnumbersaretypicallyusedtodescribeattributes.

1. Distinctness and2. Order and3. Addition and4. Multiplication and/

Giventheseproperties,wecandefinefourtypesofattributes:nominal,ordinal,interval,andratio.Table2.2 givesthedefinitionsofthesetypes,alongwithinformationaboutthestatisticaloperationsthatarevalidforeachtype.Eachattributetypepossessesallofthepropertiesandoperationsoftheattributetypesaboveit.Consequently,anypropertyoroperationthatisvalidfornominal,ordinal,andintervalattributesisalsovalidforratioattributes.Inotherwords,thedefinitionoftheattributetypesiscumulative.However,thisdoesnotmeanthatthestatisticaloperationsappropriateforoneattributetypeareappropriatefortheattributetypesaboveit.

Table2.2.Differentattributetypes.

AttributeType Description Examples Operations

Categorical Nominal Thevaluesofanominalattribute zipcodes, mode,

= ≠<,≤,>, ≥

+ −×

(Qualitative) arejustdifferentnames;i.e.,nominalvaluesprovideonlyenoughinformationtodistinguishoneobjectfromanother.

employeeIDnumbers,eyecolor,gender

entropy,contingencycorrelation,test

Ordinal Thevaluesofanordinalattributeprovideenoughinformationtoorderobjects.

hardnessofminerals,{good,better,best},grades,streetnumbers

median,percentiles,rankcorrelation,runtests,signtests

Numeric(Quantitative)

Interval Forintervalattributes,thedifferencesbetweenvaluesaremeaningful,i.e.,aunitofmeasurementexists.

calendardates,temperatureinCelsiusorFahrenheit

mean,standarddeviation,Pearson’scorrelation,tandFtests

Ratio Forratiovariables,bothdifferencesandratiosaremeaningful.

temperatureinKelvin,monetaryquantities,counts,age,mass,length,electricalcurrent

geometricmean,harmonicmean,percentvariation

Nominalandordinalattributesarecollectivelyreferredtoascategoricalorqualitativeattributes.Asthenamesuggests,qualitativeattributes,suchasemployeeID,lackmostofthepropertiesofnumbers.Eveniftheyarerepresentedbynumbers,i.e.,integers,theyshouldbetreatedmorelikesymbols.Theremainingtwotypesofattributes,intervalandratio,arecollectivelyreferredtoasquantitativeornumericattributes.Quantitativeattributesarerepresentedbynumbersandhavemostofthepropertiesof

(=,≠) χ2

(<,>)

(+,−)

(×,/)

numbers.Notethatquantitativeattributescanbeinteger-valuedorcontinuous.

Thetypesofattributescanalsobedescribedintermsoftransformationsthatdonotchangethemeaningofanattribute.Indeed,S.SmithStevens,thepsychologistwhooriginallydefinedthetypesofattributesshowninTable2.2 ,definedthemintermsofthesepermissibletransformations.Forexample,themeaningofalengthattributeisunchangedifitismeasuredinmetersinsteadoffeet.

Thestatisticaloperationsthatmakesenseforaparticulartypeofattributearethosethatwillyieldthesameresultswhentheattributeistransformedbyusingatransformationthatpreservestheattribute’smeaning.Toillustrate,theaveragelengthofasetofobjectsisdifferentwhenmeasuredinmetersratherthaninfeet,butbothaveragesrepresentthesamelength.Table2.3 showsthemeaning-preservingtransformationsforthefourattributetypesofTable2.2 .

Table2.3.Transformationsthatdefineattributelevels.

AttributeType Transformation Comment

Categorical(Qualitative)

Nominal Anyone-to-onemapping,e.g.,apermutationofvalues

IfallemployeeIDnumbersarereassigned,itwillnotmakeanydifference.

Ordinal Anorder-preservingchangeofvalues,i.e.,

wherefisamonotonicfunction.

Anattributeencompassingthenotionofgood,better,bestcanberepresentedequallywellbythevalues{1,2,3}orby{0.5,1,10}.

Numeric(Quantitative)

Intervalaandbconstants.

TheFahrenheitandCelsiustemperaturescalesdifferinthe

new_value=f(old_value),

new_value=a×old_value+b,

locationoftheirzerovalueandthesizeofadegree(unit).

Ratio Lengthcanbemeasuredinmetersorfeet.

Example2.5(TemperatureScales).Temperatureprovidesagoodillustrationofsomeoftheconceptsthathavebeendescribed.First,temperaturecanbeeitheranintervaloraratioattribute,dependingonitsmeasurementscale.WhenmeasuredontheKelvinscale,atemperatureof2 is,inaphysicallymeaningfulway,twice

thatofatemperatureof1 .Thisisnottruewhentemperatureismeasured

oneithertheCelsiusorFahrenheitscales,because,physically,atemperatureof1 Fahrenheit(Celsius)isnotmuchdifferentthana

temperatureof2 Fahrenheit(Celsius).Theproblemisthatthezeropoints

oftheFahrenheitandCelsiusscalesare,inaphysicalsense,arbitrary,andtherefore,theratiooftwoCelsiusorFahrenheittemperaturesisnotphysicallymeaningful.

DescribingAttributesbytheNumberofValuesAnindependentwayofdistinguishingbetweenattributesisbythenumberofvaluestheycantake.

DiscreteAdiscreteattributehasafiniteorcountablyinfinitesetofvalues.Suchattributescanbecategorical,suchaszipcodesorIDnumbers,ornumeric,suchascounts.Discreteattributesareoftenrepresentedusingintegervariables.Binaryattributesareaspecialcaseofdiscreteattributesandassumeonlytwovalues,e.g.,true/false,yes/no,male/female,or0/1.

new_value=a×old_value

BinaryattributesareoftenrepresentedasBooleanvariables,orasintegervariablesthatonlytakethevalues0or1.

ContinuousAcontinuousattributeisonewhosevaluesarerealnumbers.Examplesincludeattributessuchastemperature,height,orweight.Continuousattributesaretypicallyrepresentedasfloating-pointvariables.Practically,realvaluescanbemeasuredandrepresentedonlywithlimitedprecision.

Intheory,anyofthemeasurementscaletypes—nominal,ordinal,interval,andratio—couldbecombinedwithanyofthetypesbasedonthenumberofattributevalues—binary,discrete,andcontinuous.However,somecombinationsoccuronlyinfrequentlyordonotmakemuchsense.Forinstance,itisdifficulttothinkofarealisticdatasetthatcontainsacontinuousbinaryattribute.Typically,nominalandordinalattributesarebinaryordiscrete,whileintervalandratioattributesarecontinuous.However,countattributes,whicharediscrete,arealsoratioattributes.

AsymmetricAttributesForasymmetricattributes,onlypresence—anon-zeroattributevalue—isregardedasimportant.Consideradatasetinwhicheachobjectisastudentandeachattributerecordswhetherastudenttookaparticularcourseatauniversity.Foraspecificstudent,anattributehasavalueof1ifthestudenttookthecourseassociatedwiththatattributeandavalueof0otherwise.Becausestudentstakeonlyasmallfractionofallavailablecourses,mostofthevaluesinsuchadatasetwouldbe0.Therefore,itismoremeaningfulandmoreefficienttofocusonthenon-zerovalues.Toillustrate,ifstudentsarecomparedonthebasisofthecoursestheydon’ttake,thenmoststudentswouldseemverysimilar,atleastifthenumberofcoursesislarge.Binaryattributeswhereonlynon-zerovaluesareimportantarecalledasymmetric

binaryattributes.Thistypeofattributeisparticularlyimportantforassociationanalysis,whichisdiscussedinChapter5 .Itisalsopossibletohavediscreteorcontinuousasymmetricfeatures.Forinstance,ifthenumberofcreditsassociatedwitheachcourseisrecorded,thentheresultingdatasetwillconsistofasymmetricdiscreteorcontinuousattributes.

GeneralCommentsonLevelsofMeasurementAsdescribedintherestofthischapter,therearemanydiversetypesofdata.Thepreviousdiscussionofmeasurementscales,whileuseful,isnotcompleteandhassomelimitations.Weprovidethefollowingcommentsandguidance.

Distinctness,order,andmeaningfulintervalsandratiosareonlyfourpropertiesofdata—manyothersarepossible.Forinstance,somedataisinherentlycyclical,e.g.,positiononthesurfaceoftheEarthortime.Asanotherexample,considersetvaluedattributes,whereeachattributevalueisasetofelements,e.g.,thesetofmoviesseeninthelastyear.Defineonesetofelements(movies)tobegreater(larger)thanasecondsetifthesecondsetisasubsetofthefirst.However,sucharelationshipdefinesonlyapartialorderthatdoesnotmatchanyoftheattributetypesjustdefined.Thenumbersorsymbolsusedtocaptureattributevaluesmaynotcaptureallthepropertiesoftheattributesormaysuggestpropertiesthatarenotthere.AnillustrationofthisforintegerswaspresentedinExample2.3 ,i.e.,averagesofIDsandoutofrangeages.Dataisoftentransformedforthepurposeofanalysis—seeSection2.3.7 .Thisoftenchangesthedistributionoftheobservedvariabletoadistributionthatiseasiertoanalyze,e.g.,aGaussian(normal)distribution.Often,suchtransformationsonlypreservetheorderoftheoriginalvalues,andotherpropertiesarelost.Nonetheless,ifthedesiredoutcomeisa

statisticaltestofdifferencesorapredictivemodel,suchatransformationisjustified.Thefinalevaluationofanydataanalysis,includingoperationsonattributes,iswhethertheresultsmakesensefromadomainpointofview.

Insummary,itcanbechallengingtodeterminewhichoperationscanbeperformedonaparticularattributeoracollectionofattributeswithoutcompromisingtheintegrityoftheanalysis.Fortunately,establishedpracticeoftenservesasareliableguide.Occasionally,however,standardpracticesareerroneousorhavelimitations.

2.1.2TypesofDataSets

Therearemanytypesofdatasets,andasthefieldofdataminingdevelopsandmatures,agreatervarietyofdatasetsbecomeavailableforanalysis.Inthissection,wedescribesomeofthemostcommontypes.Forconvenience,wehavegroupedthetypesofdatasetsintothreegroups:recorddata,graph-baseddata,andordereddata.Thesecategoriesdonotcoverallpossibilitiesandothergroupingsarecertainlypossible.

GeneralCharacteristicsofDataSetsBeforeprovidingdetailsofspecifickindsofdatasets,wediscussthreecharacteristicsthatapplytomanydatasetsandhaveasignificantimpactonthedataminingtechniquesthatareused:dimensionality,distribution,andresolution.

Dimensionality

Thedimensionalityofadatasetisthenumberofattributesthattheobjectsinthedatasetpossess.Analyzingdatawithasmallnumberofdimensionstendstobequalitativelydifferentfromanalyzingmoderateorhigh-dimensionaldata.Indeed,thedifficultiesassociatedwiththeanalysisofhigh-dimensionaldataaresometimesreferredtoasthecurseofdimensionality.Becauseofthis,animportantmotivationinpreprocessingthedataisdimensionalityreduction.TheseissuesarediscussedinmoredepthlaterinthischapterandinAppendixB.

Distribution

Thedistributionofadatasetisthefrequencyofoccurrenceofvariousvaluesorsetsofvaluesfortheattributescomprisingdataobjects.Equivalently,thedistributionofadatasetcanbeconsideredasadescriptionoftheconcentrationofobjectsinvariousregionsofthedataspace.Statisticianshaveenumeratedmanytypesofdistributions,e.g.,Gaussian(normal),anddescribedtheirproperties.(SeeAppendixC.)Althoughstatisticalapproachesfordescribingdistributionscanyieldpowerfulanalysistechniques,manydatasetshavedistributionsthatarenotwellcapturedbystandardstatisticaldistributions.

Asaresult,manydataminingalgorithmsdonotassumeaparticularstatisticaldistributionforthedatatheyanalyze.However,somegeneralaspectsofdistributionsoftenhaveastrongimpact.Forexample,supposeacategoricalattributeisusedasaclassvariable,whereoneofthecategoriesoccurs95%ofthetime,whiletheothercategoriestogetheroccuronly5%ofthetime.ThisskewnessinthedistributioncanmakeclassificationdifficultasdiscussedinSection4.11.(Skewnesshasotherimpactsondataanalysisthatarenotdiscussedhere.)

Aspecialcaseofskeweddataissparsity.Forsparsebinary,countorcontinuousdata,mostattributesofanobjecthavevaluesof0.Inmanycases,fewerthan1%ofthevaluesarenon-zero.Inpracticalterms,sparsityisanadvantagebecauseusuallyonlythenon-zerovaluesneedtobestoredandmanipulated.Thisresultsinsignificantsavingswithrespecttocomputationtimeandstorage.Indeed,somedataminingalgorithms,suchastheassociationruleminingalgorithmsdescribedinChapter5 ,workwellonlyforsparsedata.Finally,notethatoftentheattributesinsparsedatasetsareasymmetricattributes.

Resolution

Itisfrequentlypossibletoobtaindataatdifferentlevelsofresolution,andoftenthepropertiesofthedataaredifferentatdifferentresolutions.Forinstance,thesurfaceoftheEarthseemsveryunevenataresolutionofafewmeters,butisrelativelysmoothataresolutionoftensofkilometers.Thepatternsinthedataalsodependonthelevelofresolution.Iftheresolutionistoofine,apatternmaynotbevisibleormaybeburiedinnoise;iftheresolutionistoocoarse,thepatterncandisappear.Forexample,variationsinatmosphericpressureonascaleofhoursreflectthemovementofstormsandotherweathersystems.Onascaleofmonths,suchphenomenaarenotdetectable.

RecordDataMuchdataminingworkassumesthatthedatasetisacollectionofrecords(dataobjects),eachofwhichconsistsofafixedsetofdatafields(attributes).SeeFigure2.2(a) .Forthemostbasicformofrecorddata,thereisnoexplicitrelationshipamongrecordsordatafields,andeveryrecord(object)hasthesamesetofattributes.Recorddataisusuallystoredeitherinflatfilesorinrelationaldatabases.Relationaldatabasesarecertainlymorethana

collectionofrecords,butdataminingoftendoesnotuseanyoftheadditionalinformationavailableinarelationaldatabase.Rather,thedatabaseservesasaconvenientplacetofindrecords.DifferenttypesofrecorddataaredescribedbelowandareillustratedinFigure2.2 .

Figure2.2.Differentvariationsofrecorddata.

TransactionorMarketBasketData

Transactiondataisaspecialtypeofrecorddata,whereeachrecord(transaction)involvesasetofitems.Consideragrocerystore.Thesetofproductspurchasedbyacustomerduringoneshoppingtripconstitutesatransaction,whiletheindividualproductsthatwerepurchasedaretheitems.Thistypeofdataiscalledmarketbasketdatabecausetheitemsineachrecordaretheproductsinaperson’s“marketbasket.”Transactiondataisacollectionofsetsofitems,butitcanbeviewedasasetofrecordswhosefieldsareasymmetricattributes.Mostoften,theattributesarebinary,indicatingwhetheranitemwaspurchased,butmoregenerally,theattributescanbediscreteorcontinuous,suchasthenumberofitemspurchasedortheamountspentonthoseitems.Figure2.2(b) showsasampletransactiondataset.Eachrowrepresentsthepurchasesofaparticularcustomerataparticulartime.

TheDataMatrix

Ifallthedataobjectsinacollectionofdatahavethesamefixedsetofnumericattributes,thenthedataobjectscanbethoughtofaspoints(vectors)inamultidimensionalspace,whereeachdimensionrepresentsadistinctattributedescribingtheobject.Asetofsuchdataobjectscanbeinterpretedasanmbynmatrix,wheretherearemrows,oneforeachobject,andncolumns,oneforeachattribute.(Arepresentationthathasdataobjectsascolumnsandattributesasrowsisalsofine.)Thismatrixiscalledadatamatrixorapatternmatrix.Adatamatrixisavariationofrecorddata,butbecauseitconsistsofnumericattributes,standardmatrixoperationcanbeappliedtotransformandmanipulatethedata.Therefore,thedatamatrixisthestandarddataformatformoststatisticaldata.Figure2.2(c) showsasampledatamatrix.

TheSparseDataMatrix

Asparsedatamatrixisaspecialcaseofadatamatrixwheretheattributesareofthesametypeandareasymmetric;i.e.,onlynon-zerovaluesareimportant.Transactiondataisanexampleofasparsedatamatrixthathasonly0–1entries.Anothercommonexampleisdocumentdata.Inparticular,iftheorderoftheterms(words)inadocumentisignored—the“bagofwords”approach—thenadocumentcanberepresentedasatermvector,whereeachtermisacomponent(attribute)ofthevectorandthevalueofeachcomponentisthenumberoftimesthecorrespondingtermoccursinthedocument.Thisrepresentationofacollectionofdocumentsisoftencalledadocument-termmatrix.Figure2.2(d) showsasampledocument-termmatrix.Thedocumentsaretherowsofthismatrix,whilethetermsarethecolumns.Inpractice,onlythenon-zeroentriesofsparsedatamatricesarestored.

Graph-BasedDataAgraphcansometimesbeaconvenientandpowerfulrepresentationfordata.Weconsidertwospecificcases:(1)thegraphcapturesrelationshipsamongdataobjectsand(2)thedataobjectsthemselvesarerepresentedasgraphs.

DatawithRelationshipsamongObjects

Therelationshipsamongobjectsfrequentlyconveyimportantinformation.Insuchcases,thedataisoftenrepresentedasagraph.Inparticular,thedataobjectsaremappedtonodesofthegraph,whiletherelationshipsamongobjectsarecapturedbythelinksbetweenobjectsandlinkproperties,suchasdirectionandweight.ConsiderwebpagesontheWorldWideWeb,whichcontainbothtextandlinkstootherpages.Inordertoprocesssearchqueries,websearchenginescollectandprocesswebpagestoextracttheircontents.Itiswell-known,however,thatthelinkstoandfromeachpageprovideagreatdealofinformationabouttherelevanceofawebpagetoaquery,andthus,mustalsobetakenintoconsideration.Figure2.3(a) showsasetoflinked

webpages.Anotherimportantexampleofsuchgraphdataarethesocialnetworks,wheredataobjectsarepeopleandtherelationshipsamongthemaretheirinteractionsviasocialmedia.

DatawithObjectsThatAreGraphs

Ifobjectshavestructure,thatis,theobjectscontainsubobjectsthathaverelationships,thensuchobjectsarefrequentlyrepresentedasgraphs.Forexample,thestructureofchemicalcompoundscanberepresentedbyagraph,wherethenodesareatomsandthelinksbetweennodesarechemicalbonds.Figure2.3(b) showsaball-and-stickdiagramofthechemicalcompoundbenzene,whichcontainsatomsofcarbon(black)andhydrogen(gray).Agraphrepresentationmakesitpossibletodeterminewhichsubstructuresoccurfrequentlyinasetofcompoundsandtoascertainwhetherthepresenceofanyofthesesubstructuresisassociatedwiththepresenceorabsenceofcertainchemicalproperties,suchasmeltingpointorheatofformation.Frequentgraphmining,whichisabranchofdataminingthatanalyzessuchdata,isconsideredinSection6.5.

Figure2.3.Differentvariationsofgraphdata.

OrderedDataForsometypesofdata,theattributeshaverelationshipsthatinvolveorderintimeorspace.DifferenttypesofordereddataaredescribednextandareshowninFigure2.4 .

SequentialTransactionData

Sequentialtransactiondatacanbethoughtofasanextensionoftransactiondata,whereeachtransactionhasatimeassociatedwithit.Consideraretailtransactiondatasetthatalsostoresthetimeatwhichthetransactiontookplace.Thistimeinformationmakesitpossibletofindpatternssuchas“candysalespeakbeforeHalloween.”Atimecanalsobeassociatedwitheachattribute.Forexample,eachrecordcouldbethepurchasehistoryofa

customer,withalistingofitemspurchasedatdifferenttimes.Usingthisinformation,itispossibletofindpatternssuchas“peoplewhobuyDVDplayerstendtobuyDVDsintheperiodimmediatelyfollowingthepurchase.”

Figure2.4(a) showsanexampleofsequentialtransactiondata.Therearefivedifferenttimes—t1,t2,t3,t4,andt5;threedifferentcustomers—C1,C2,andC3;andfivedifferentitems—A,B,C,D,andE.Inthetoptable,eachrowcorrespondstotheitemspurchasedataparticulartimebyeachcustomer.Forinstance,attimet3,customerC2purchaseditemsAandD.Inthebottomtable,thesameinformationisdisplayed,buteachrowcorrespondstoaparticularcustomer.Eachrowcontainsinformationabouteachtransactioninvolvingthecustomer,whereatransactionisconsideredtobeasetofitemsandthetimeatwhichthoseitemswerepurchased.Forexample,customerC3boughtitemsAandCattimet2.

TimeSeriesData

Timeseriesdataisaspecialtypeofordereddatawhereeachrecordisatimeseries,i.e.,aseriesofmeasurementstakenovertime.Forexample,afinancialdatasetmightcontainobjectsthataretimeseriesofthedailypricesofvariousstocks.Asanotherexample,considerFigure2.4(c) ,whichshowsatimeseriesoftheaveragemonthlytemperatureforMinneapolisduringtheyears1982to1994.Whenworkingwithtemporaldata,suchastimeseries,itisimportanttoconsidertemporalautocorrelation;i.e.,iftwomeasurementsarecloseintime,thenthevaluesofthosemeasurementsareoftenverysimilar.

Figure2.4.Differentvariationsofordereddata.

SequenceData

Sequencedataconsistsofadatasetthatisasequenceofindividualentities,suchasasequenceofwordsorletters.Itisquitesimilartosequentialdata,exceptthattherearenotimestamps;instead,therearepositionsinanorderedsequence.Forexample,thegeneticinformationofplantsandanimalscanberepresentedintheformofsequencesofnucleotidesthatareknownasgenes.Manyoftheproblemsassociatedwithgeneticsequencedatainvolvepredictingsimilaritiesinthestructureandfunctionofgenesfromsimilaritiesinnucleotidesequences.Figure2.4(b) showsasectionofthehumangeneticcodeexpressedusingthefournucleotidesfromwhichallDNAisconstructed:A,T,G,andC.

SpatialandSpatio-TemporalData

Someobjectshavespatialattributes,suchaspositionsorareas,inadditiontoothertypesofattributes.Anexampleofspatialdataisweatherdata(precipitation,temperature,pressure)thatiscollectedforavarietyofgeographicallocations.Oftensuchmeasurementsarecollectedovertime,andthus,thedataconsistsoftimeseriesatvariouslocations.Inthatcase,werefertothedataasspatio-temporaldata.Althoughanalysiscanbeconductedseparatelyforeachspecifictimeorlocation,amorecompleteanalysisofspatio-temporaldatarequiresconsiderationofboththespatialandtemporalaspectsofthedata.

Animportantaspectofspatialdataisspatialautocorrelation;i.e.,objectsthatarephysicallyclosetendtobesimilarinotherwaysaswell.Thus,twopointsontheEarththatareclosetoeachotherusuallyhavesimilarvaluesfortemperatureandrainfall.Notethatspatialautocorrelationisanalogoustotemporalautocorrelation.

Importantexamplesofspatialandspatio-temporaldataarethescienceandengineeringdatasetsthataretheresultofmeasurementsormodeloutput

takenatregularlyorirregularlydistributedpointsonatwo-orthree-dimensionalgridormesh.Forinstance,Earthsciencedatasetsrecordthetemperatureorpressuremeasuredatpoints(gridcells)onlatitude–longitudesphericalgridsofvariousresolutions,e.g., by SeeFigure2.4(d) .Asanotherexample,inthesimulationoftheflowofagas,thespeedanddirectionofflowatvariousinstantsintimecanberecordedforeachgridpointinthesimulation.Adifferenttypeofspatio-temporaldataarisesfromtrackingthetrajectoriesofobjects,e.g.,vehicles,intimeandspace.

HandlingNon-RecordDataMostdataminingalgorithmsaredesignedforrecorddataoritsvariations,suchastransactiondataanddatamatrices.Record-orientedtechniquescanbeappliedtonon-recorddatabyextractingfeaturesfromdataobjectsandusingthesefeaturestocreatearecordcorrespondingtoeachobject.Considerthechemicalstructuredatathatwasdescribedearlier.Givenasetofcommonsubstructures,eachcompoundcanberepresentedasarecordwithbinaryattributesthatindicatewhetheracompoundcontainsaspecificsubstructure.Sucharepresentationisactuallyatransactiondataset,wherethetransactionsarethecompoundsandtheitemsarethesubstructures.

Insomecases,itiseasytorepresentthedatainarecordformat,butthistypeofrepresentationdoesnotcapturealltheinformationinthedata.Considerspatio-temporaldataconsistingofatimeseriesfromeachpointonaspatialgrid.Thisdataisoftenstoredinadatamatrix,whereeachrowrepresentsalocationandeachcolumnrepresentsaparticularpointintime.However,sucharepresentationdoesnotexplicitlycapturethetimerelationshipsthatarepresentamongattributesandthespatialrelationshipsthatexistamongobjects.Thisdoesnotmeanthatsucharepresentationisinappropriate,butratherthattheserelationshipsmustbetakenintoconsiderationduringtheanalysis.Forexample,itwouldnotbeagoodideatouseadatamining

1° 1°.

techniquethatignoresthetemporalautocorrelationoftheattributesorthespatialautocorrelationofthedataobjects,i.e.,thelocationsonthespatialgrid.

2.2DataQualityDataminingalgorithmsareoftenappliedtodatathatwascollectedforanotherpurpose,orforfuture,butunspecifiedapplications.Forthatreason,dataminingcannotusuallytakeadvantageofthesignificantbenefitsof“ad-dressingqualityissuesatthesource.”Incontrast,muchofstatisticsdealswiththedesignofexperimentsorsurveysthatachieveaprespecifiedlevelofdataquality.Becausepreventingdataqualityproblemsistypicallynotanoption,dataminingfocuseson(1)thedetectionandcorrectionofdataqualityproblemsand(2)theuseofalgorithmsthatcantoleratepoordataquality.Thefirststep,detectionandcorrection,isoftencalleddatacleaning.

Thefollowingsectionsdiscussspecificaspectsofdataquality.Thefocusisonmeasurementanddatacollectionissues,althoughsomeapplication-relatedissuesarealsodiscussed.

2.2.1MeasurementandDataCollectionIssues

Itisunrealistictoexpectthatdatawillbeperfect.Theremaybeproblemsduetohumanerror,limitationsofmeasuringdevices,orflawsinthedatacollectionprocess.Valuesorevenentiredataobjectscanbemissing.Inothercases,therecanbespuriousorduplicateobjects;i.e.,multipledataobjectsthatallcorrespondtoasingle“real”object.Forexample,theremightbetwodifferentrecordsforapersonwhohasrecentlylivedattwodifferentaddresses.Evenif

allthedataispresentand“looksfine,”theremaybeinconsistencies—apersonhasaheightof2meters,butweighsonly2kilograms.

Inthenextfewsections,wefocusonaspectsofdataqualitythatarerelatedtodatameasurementandcollection.Webeginwithadefinitionofmeasurementanddatacollectionerrorsandthenconsideravarietyofproblemsthatinvolvemeasurementerror:noise,artifacts,bias,precision,andaccuracy.Weconcludebydiscussingdataqualityissuesthatinvolvebothmeasurementanddatacollectionproblems:outliers,missingandinconsistentvalues,andduplicatedata.

MeasurementandDataCollectionErrorsThetermmeasurementerrorreferstoanyproblemresultingfromthemeasurementprocess.Acommonproblemisthatthevaluerecordeddiffersfromthetruevaluetosomeextent.Forcontinuousattributes,thenumericaldifferenceofthemeasuredandtruevalueiscalledtheerror.Thetermdatacollectionerrorreferstoerrorssuchasomittingdataobjectsorattributevalues,orinappropriatelyincludingadataobject.Forexample,astudyofanimalsofacertainspeciesmightincludeanimalsofarelatedspeciesthataresimilarinappearancetothespeciesofinterest.Bothmeasurementerrorsanddatacollectionerrorscanbeeithersystematicorrandom.

Wewillonlyconsidergeneraltypesoferrors.Withinparticulardomains,certaintypesofdataerrorsarecommonplace,andwell-developedtechniquesoftenexistfordetectingand/orcorrectingtheseerrors.Forexample,keyboarderrorsarecommonwhendataisenteredmanually,andasaresult,manydataentryprogramshavetechniquesfordetectingand,withhumanintervention,correctingsucherrors.

NoiseandArtifactsNoiseistherandomcomponentofameasurementerror.Ittypicallyinvolvesthedistortionofavalueortheadditionofspuriousobjects.Figure2.5showsatimeseriesbeforeandafterithasbeendisruptedbyrandomnoise.Ifabitmorenoisewereaddedtothetimeseries,itsshapewouldbelost.Figure2.6 showsasetofdatapointsbeforeandaftersomenoisepoints(indicatedby )havebeenadded.Noticethatsomeofthenoisepointsareintermixedwiththenon-noisepoints.

Figure2.5.Noiseinatimeseriescontext.

‘+’s

Figure2.6.Noiseinaspatialcontext.

Thetermnoiseisoftenusedinconnectionwithdatathathasaspatialortemporalcomponent.Insuchcases,techniquesfromsignalorimageprocessingcanfrequentlybeusedtoreducenoiseandthus,helptodiscoverpatterns(signals)thatmightbe“lostinthenoise.”Nonetheless,theeliminationofnoiseisfrequentlydifficult,andmuchworkindataminingfocusesondevisingrobustalgorithmsthatproduceacceptableresultsevenwhennoiseispresent.

Dataerrorscanbetheresultofamoredeterministicphenomenon,suchasastreakinthesameplaceonasetofphotographs.Suchdeterministicdistortionsofthedataareoftenreferredtoasartifacts.

Precision,Bias,andAccuracyInstatisticsandexperimentalscience,thequalityofthemeasurementprocessandtheresultingdataaremeasuredbyprecisionandbias.Weprovidethe

standarddefinitions,followedbyabriefdiscussion.Forthefollowingdefinitions,weassumethatwemakerepeatedmeasurementsofthesameunderlyingquantity.

Definition2.3(Precision).Theclosenessofrepeatedmeasurements(ofthesamequantity)tooneanother.

Definition2.4(Bias).Asystematicvariationofmeasurementsfromthequantitybeingmeasured.

Precisionisoftenmeasuredbythestandarddeviationofasetofvalues,whilebiasismeasuredbytakingthedifferencebetweenthemeanofthesetofvaluesandtheknownvalueofthequantitybeingmeasured.Biascanbedeterminedonlyforobjectswhosemeasuredquantityisknownbymeansexternaltothecurrentsituation.Supposethatwehaveastandardlaboratoryweightwithamassof1gandwanttoassesstheprecisionandbiasofournewlaboratoryscale.Weweighthemassfivetimes,andobtainthefollowingfivevalues:{1.015,0.990,1.013,1.001,0.986}.Themeanofthesevaluesis

1.001,andhence,thebiasis0.001.Theprecision,asmeasuredbythestandarddeviation,is0.013.

Itiscommontousethemoregeneralterm,accuracy,torefertothedegreeofmeasurementerrorindata.

Definition2.5(Accuracy)Theclosenessofmeasurementstothetruevalueofthequantitybeingmeasured.

Accuracydependsonprecisionandbias,butthereisnospecificformulaforaccuracyintermsofthesetwoquantities.

Oneimportantaspectofaccuracyistheuseofsignificantdigits.Thegoalistouseonlyasmanydigitstorepresenttheresultofameasurementorcalculationasarejustifiedbytheprecisionofthedata.Forexample,ifthelengthofanobjectismeasuredwithameterstickwhosesmallestmarkingsaremillimeters,thenweshouldrecordthelengthofdataonlytothenearestmillimeter.Theprecisionofsuchameasurementwouldbe Wedonotreviewthedetailsofworkingwithsignificantdigitsbecausemostreaderswillhaveencounteredtheminpreviouscoursesandtheyarecoveredinconsiderabledepthinscience,engineering,andstatisticstextbooks.

Issuessuchassignificantdigits,precision,bias,andaccuracyaresometimesoverlooked,buttheyareimportantfordataminingaswellasstatisticsandscience.Manytimes,datasetsdonotcomewithinformationaboutthe

±0.5mm.

precisionofthedata,andfurthermore,theprogramsusedforanalysisreturnresultswithoutanysuchinformation.Nonetheless,withoutsomeunderstandingoftheaccuracyofthedataandtheresults,ananalystrunstheriskofcommittingseriousdataanalysisblunders.

OutliersOutliersareeither(1)dataobjectsthat,insomesense,havecharacteristicsthataredifferentfrommostoftheotherdataobjectsinthedataset,or(2)valuesofanattributethatareunusualwithrespecttothetypicalvaluesforthatattribute.Alternatively,theycanbereferredtoasanomalousobjectsorvalues.Thereisconsiderableleewayinthedefinitionofanoutlier,andmanydifferentdefinitionshavebeenproposedbythestatisticsanddataminingcommunities.Furthermore,itisimportanttodistinguishbetweenthenotionsofnoiseandoutliers.Unlikenoise,outlierscanbelegitimatedataobjectsorvaluesthatweareinterestedindetecting.Forinstance,infraudandnetworkintrusiondetection,thegoalistofindunusualobjectsoreventsfromamongalargenumberofnormalones.Chapter9 discussesanomalydetectioninmoredetail.

MissingValuesItisnotunusualforanobjecttobemissingoneormoreattributevalues.Insomecases,theinformationwasnotcollected;e.g.,somepeopledeclinetogivetheirageorweight.Inothercases,someattributesarenotapplicabletoallobjects;e.g.,often,formshaveconditionalpartsthatarefilledoutonlywhenapersonanswersapreviousquestioninacertainway,butforsimplicity,allfieldsarestored.Regardless,missingvaluesshouldbetakenintoaccountduringthedataanalysis.

Thereareseveralstrategies(andvariationsonthesestrategies)fordealingwithmissingdata,eachofwhichisappropriateincertaincircumstances.Thesestrategiesarelistednext,alongwithanindicationoftheiradvantagesanddisadvantages.

EliminateDataObjectsorAttributes

Asimpleandeffectivestrategyistoeliminateobjectswithmissingvalues.However,evenapartiallyspecifieddataobjectcontainssomeinformation,andifmanyobjectshavemissingvalues,thenareliableanalysiscanbedifficultorimpossible.Nonetheless,ifadatasethasonlyafewobjectsthathavemissingvalues,thenitmaybeexpedienttoomitthem.Arelatedstrategyistoeliminateattributesthathavemissingvalues.Thisshouldbedonewithcaution,however,becausetheeliminatedattributesmaybetheonesthatarecriticaltotheanalysis.

EstimateMissingValues

Sometimesmissingdatacanbereliablyestimated.Forexample,consideratimeseriesthatchangesinareasonablysmoothfashion,buthasafew,widelyscatteredmissingvalues.Insuchcases,themissingvaluescanbeestimated(interpolated)byusingtheremainingvalues.Asanotherexample,consideradatasetthathasmanysimilardatapoints.Inthissituation,theattributevaluesofthepointsclosesttothepointwiththemissingvalueareoftenusedtoestimatethemissingvalue.Iftheattributeiscontinuous,thentheaverageattributevalueofthenearestneighborsisused;iftheattributeiscategorical,thenthemostcommonlyoccurringattributevaluecanbetaken.Foraconcreteillustration,considerprecipitationmeasurementsthatarerecordedbygroundstations.Forareasnotcontainingagroundstation,theprecipitationcanbeestimatedusingvaluesobservedatnearbygroundstations.

IgnoretheMissingValueduringAnalysis

Manydataminingapproachescanbemodifiedtoignoremissingvalues.Forexample,supposethatobjectsarebeingclusteredandthesimilaritybetweenpairsofdataobjectsneedstobecalculated.Ifoneorbothobjectsofapairhavemissingvaluesforsomeattributes,thenthesimilaritycanbecalculatedbyusingonlytheattributesthatdonothavemissingvalues.Itistruethatthesimilaritywillonlybeapproximate,butunlessthetotalnumberofattributesissmallorthenumberofmissingvaluesishigh,thisdegreeofinaccuracymaynotmattermuch.Likewise,manyclassificationschemescanbemodifiedtoworkwithmissingvalues.

InconsistentValuesDatacancontaininconsistentvalues.Consideranaddressfield,wherebothazipcodeandcityarelisted,butthespecifiedzipcodeareaisnotcontainedinthatcity.Itispossiblethattheindividualenteringthisinformationtransposedtwodigits,orperhapsadigitwasmisreadwhentheinformationwasscannedfromahandwrittenform.Regardlessofthecauseoftheinconsistentvalues,itisimportanttodetectand,ifpossible,correctsuchproblems.

Sometypesofinconsistencesareeasytodetect.Forinstance,aperson’sheightshouldnotbenegative.Inothercases,itcanbenecessarytoconsultanexternalsourceofinformation.Forexample,whenaninsurancecompanyprocessesclaimsforreimbursement,itchecksthenamesandaddressesonthereimbursementformsagainstadatabaseofitscustomers.

Onceaninconsistencyhasbeendetected,itissometimespossibletocorrectthedata.Aproductcodemayhave“check”digits,oritmaybepossibletodouble-checkaproductcodeagainstalistofknownproductcodes,andthen

correctthecodeifitisincorrect,butclosetoaknowncode.Thecorrectionofaninconsistencyrequiresadditionalorredundantinformation.

Example2.6(InconsistentSeaSurfaceTemperature).Thisexampleillustratesaninconsistencyinactualtimeseriesdatathatmeasurestheseasurfacetemperature(SST)atvariouspointsontheocean.SSTdatawasoriginallycollectedusingocean-basedmeasurementsfromshipsorbuoys,butmorerecently,satelliteshavebeenusedtogatherthedata.Tocreatealong-termdataset,bothsourcesofdatamustbeused.However,becausethedatacomesfromdifferentsources,thetwopartsofthedataaresubtlydifferent.ThisdiscrepancyisvisuallydisplayedinFigure2.7 ,whichshowsthecorrelationofSSTvaluesbetweenpairsofyears.Ifapairofyearshasapositivecorrelation,thenthelocationcorrespondingtothepairofyearsiscoloredwhite;otherwiseitiscoloredblack.(Seasonalvariationswereremovedfromthedatasince,otherwise,alltheyearswouldbehighlycorrelated.)Thereisadistinctchangeinbehaviorwherethedatahasbeenputtogetherin1983.Yearswithineachofthetwogroups,1958–1982and1983–1999,tendtohaveapositivecorrelationwithoneanother,butanegativecorrelationwithyearsintheothergroup.Thisdoesnotmeanthatthisdatashouldnotbeused,onlythattheanalystshouldconsiderthepotentialimpactofsuchdiscrepanciesonthedatamininganalysis.

Figure2.7.CorrelationofSSTdatabetweenpairsofyears.Whiteareasindicatepositivecorrelation.Blackareasindicatenegativecorrelation.

DuplicateDataAdatasetcanincludedataobjectsthatareduplicates,oralmostduplicates,ofoneanother.Manypeoplereceiveduplicatemailingsbecausetheyappearinadatabasemultipletimesunderslightlydifferentnames.Todetectandeliminatesuchduplicates,twomainissuesmustbeaddressed.First,iftherearetwoobjectsthatactuallyrepresentasingleobject,thenoneormorevaluesofcorrespondingattributesareusuallydifferent,andtheseinconsistentvaluesmustberesolved.Second,careneedstobetakentoavoidaccidentallycombiningdataobjectsthataresimilar,butnotduplicates,such

astwodistinctpeoplewithidenticalnames.Thetermdeduplicationisoftenusedtorefertotheprocessofdealingwiththeseissues.

Insomecases,twoormoreobjectsareidenticalwithrespecttotheattributesmeasuredbythedatabase,buttheystillrepresentdifferentobjects.Here,theduplicatesarelegitimate,butcanstillcauseproblemsforsomealgorithmsifthepossibilityofidenticalobjectsisnotspecificallyaccountedforintheirdesign.AnexampleofthisisgiveninExercise13 onpage108.

2.2.2IssuesRelatedtoApplications

Dataqualityissuescanalsobeconsideredfromanapplicationviewpointasexpressedbythestatement“dataisofhighqualityifitissuitableforitsintendeduse.”Thisapproachtodataqualityhasprovenquiteuseful,particularlyinbusinessandindustry.Asimilarviewpointisalsopresentinstatisticsandtheexperimentalsciences,withtheiremphasisonthecarefuldesignofexperimentstocollectthedatarelevanttoaspecifichypothesis.Aswithqualityissuesatthemeasurementanddatacollectionlevel,manyissuesarespecifictoparticularapplicationsandfields.Again,weconsideronlyafewofthegeneralissues.

Timeliness

Somedatastartstoageassoonasithasbeencollected.Inparticular,ifthedataprovidesasnapshotofsomeongoingphenomenonorprocess,suchasthepurchasingbehaviorofcustomersorwebbrowsingpatterns,thenthissnapshotrepresentsrealityforonlyalimitedtime.Ifthedataisoutofdate,thensoarethemodelsandpatternsthatarebasedonit.

Relevance

Theavailabledatamustcontaintheinformationnecessaryfortheapplication.Considerthetaskofbuildingamodelthatpredictstheaccidentratefordrivers.Ifinformationabouttheageandgenderofthedriverisomitted,thenitislikelythatthemodelwillhavelimitedaccuracyunlessthisinformationisindirectlyavailablethroughotherattributes.

Makingsurethattheobjectsinadatasetarerelevantisalsochallenging.Acommonproblemissamplingbias,whichoccurswhenasampledoesnotcontaindifferenttypesofobjectsinproportiontotheiractualoccurrenceinthepopulation.Forexample,surveydatadescribesonlythosewhorespondtothesurvey.(OtheraspectsofsamplingarediscussedfurtherinSection2.3.2 .)Becausetheresultsofadataanalysiscanreflectonlythedatathatispresent,samplingbiaswilltypicallyleadtoerroneousresultswhenappliedtothebroaderpopulation.

KnowledgeabouttheData

Ideally,datasetsareaccompaniedbydocumentationthatdescribesdifferentaspectsofthedata;thequalityofthisdocumentationcaneitheraidorhinderthesubsequentanalysis.Forexample,ifthedocumentationidentifiesseveralattributesasbeingstronglyrelated,theseattributesarelikelytoprovidehighlyredundantinformation,andweusuallydecidetokeepjustone.(Considersalestaxandpurchaseprice.)Ifthedocumentationispoor,however,andfailstotellus,forexample,thatthemissingvaluesforaparticularfieldareindicatedwitha-9999,thenouranalysisofthedatamaybefaulty.Otherimportantcharacteristicsaretheprecisionofthedata,thetypeoffeatures(nominal,ordinal,interval,ratio),thescaleofmeasurement(e.g.,metersorfeetforlength),andtheoriginofthedata.

2.3DataPreprocessingInthissection,weconsiderwhichpreprocessingstepsshouldbeappliedtomakethedatamoresuitablefordatamining.Datapreprocessingisabroadareaandconsistsofanumberofdifferentstrategiesandtechniquesthatareinterrelatedincomplexways.Wewillpresentsomeofthemostimportantideasandapproaches,andtrytopointouttheinterrelationshipsamongthem.Specifically,wewilldiscussthefollowingtopics:

AggregationSamplingDimensionalityreductionFeaturesubsetselectionFeaturecreationDiscretizationandbinarizationVariabletransformation

Roughlyspeaking,thesetopicsfallintotwocategories:selectingdataobjectsandattributesfortheanalysisorforcreating/changingtheattributes.Inbothcases,thegoalistoimprovethedatamininganalysiswithrespecttotime,cost,andquality.Detailsareprovidedinthefollowingsections.

Aquicknoteaboutterminology:Inthefollowing,wesometimesusesynonymsforattribute,suchasfeatureorvariable,inordertofollowcommonusage.

2.3.1Aggregation

Sometimes“lessismore,”andthisisthecasewithaggregation,thecombiningoftwoormoreobjectsintoasingleobject.Consideradatasetconsistingoftransactions(dataobjects)recordingthedailysalesofproductsinvariousstorelocations(Minneapolis,Chicago,Paris,…)fordifferentdaysoverthecourseofayear.SeeTable2.4 .Onewaytoaggregatetransactionsforthisdatasetistoreplaceallthetransactionsofasinglestorewithasinglestorewidetransaction.Thisreducesthehundredsorthousandsoftransactionsthatoccurdailyataspecificstoretoasingledailytransaction,andthenumberofdataobjectsperdayisreducedtothenumberofstores.

Table2.4.Datasetcontaininginformationaboutcustomerpurchases.

TransactionID Item StoreLocation Date Price …

⋮ ⋮ ⋮ ⋮ ⋮

101123 Watch Chicago 09/06/04 $25.99 …

101123 Battery Chicago 09/06/04 $5.99 …

101124 Shoes Minneapolis 09/06/04 $75.00 …

Anobviousissueishowanaggregatetransactioniscreated;i.e.,howthevaluesofeachattributearecombinedacrossalltherecordscorrespondingtoaparticularlocationtocreatetheaggregatetransactionthatrepresentsthesalesofasinglestoreordate.Quantitativeattributes,suchasprice,aretypicallyaggregatedbytakingasumoranaverage.Aqualitativeattribute,suchasitem,caneitherbeomittedorsummarizedintermsofahigherlevelcategory,e.g.,televisionsversuselectronics.

ThedatainTable2.4 canalsobeviewedasamultidimensionalarray,whereeachattributeisadimension.Fromthisviewpoint,aggregationistheprocessofeliminatingattributes,suchasthetypeofitem,orreducingthe

numberofvaluesforaparticularattribute;e.g.,reducingthepossiblevaluesfordatefrom365daysto12months.ThistypeofaggregationiscommonlyusedinOnlineAnalyticalProcessing(OLAP).ReferencestoOLAParegiveninthebibliographicNotes.

Thereareseveralmotivationsforaggregation.First,thesmallerdatasetsresultingfromdatareductionrequirelessmemoryandprocessingtime,andhence,aggregationoftenenablestheuseofmoreexpensivedataminingalgorithms.Second,aggregationcanactasachangeofscopeorscalebyprovidingahigh-levelviewofthedatainsteadofalow-levelview.Inthepreviousexample,aggregatingoverstorelocationsandmonthsgivesusamonthly,perstoreviewofthedatainsteadofadaily,peritemview.Finally,thebehaviorofgroupsofobjectsorattributesisoftenmorestablethanthatofindividualobjectsorattributes.Thisstatementreflectsthestatisticalfactthataggregatequantities,suchasaveragesortotals,havelessvariabilitythantheindividualvaluesbeingaggregated.Fortotals,theactualamountofvariationislargerthanthatofindividualobjects(onaverage),butthepercentageofthevariationissmaller,whileformeans,theactualamountofvariationislessthanthatofindividualobjects(onaverage).Adisadvantageofaggregationisthepotentiallossofinterestingdetails.Inthestoreexample,aggregatingovermonthslosesinformationaboutwhichdayoftheweekhasthehighestsales.

Example2.7(AustralianPrecipitation).ThisexampleisbasedonprecipitationinAustraliafromtheperiod1982–1993.Figure2.8(a) showsahistogramforthestandarddeviationofaveragemonthlyprecipitationfor by gridcellsinAustralia,whileFigure2.8(b) showsahistogramforthestandarddeviationoftheaverageyearlyprecipitationforthesamelocations.Theaverageyearlyprecipitationhaslessvariabilitythantheaveragemonthlyprecipitation.All

3,0300.5° 0.5°

precipitationmeasurements(andtheirstandarddeviations)areincentimeters.

Figure2.8.HistogramsofstandarddeviationformonthlyandyearlyprecipitationinAustraliafortheperiod1982–1993.

2.3.2Sampling

Samplingisacommonlyusedapproachforselectingasubsetofthedataobjectstobeanalyzed.Instatistics,ithaslongbeenusedforboththepreliminaryinvestigationofthedataandthefinaldataanalysis.Samplingcanalsobeveryusefulindatamining.However,themotivationsforsamplinginstatisticsanddataminingareoftendifferent.Statisticiansusesamplingbecauseobtainingtheentiresetofdataofinterestistooexpensiveortimeconsuming,whiledataminersusuallysamplebecauseitistoocomputationallyexpensiveintermsofthememoryortimerequiredtoprocess

allthedata.Insomecases,usingasamplingalgorithmcanreducethedatasizetothepointwhereabetter,butmorecomputationallyexpensivealgorithmcanbeused.

Thekeyprincipleforeffectivesamplingisthefollowing:Usingasamplewillworkalmostaswellasusingtheentiredatasetifthesampleisrepresentative.Inturn,asampleisrepresentativeifithasapproximatelythesameproperty(ofinterest)astheoriginalsetofdata.Ifthemean(average)ofthedataobjectsisthepropertyofinterest,thenasampleisrepresentativeifithasameanthatisclosetothatoftheoriginaldata.Becausesamplingisastatisticalprocess,therepresentativenessofanyparticularsamplewillvary,andthebestthatwecandoischooseasamplingschemethatguaranteesahighprobabilityofgettingarepresentativesample.Asdiscussednext,thisinvolveschoosingtheappropriatesamplesizeandsamplingtechnique.

SamplingApproachesTherearemanysamplingtechniques,butonlyafewofthemostbasiconesandtheirvariationswillbecoveredhere.Thesimplesttypeofsamplingissimplerandomsampling.Forthistypeofsampling,thereisanequalprobabilityofselectinganyparticularobject.Therearetwovariationsonrandomsampling(andothersamplingtechniquesaswell):(1)samplingwithoutreplacement—aseachobjectisselected,itisremovedfromthesetofallobjectsthattogetherconstitutethepopulation,and(2)samplingwithreplacement—objectsarenotremovedfromthepopulationastheyareselectedforthesample.Insamplingwithreplacement,thesameobjectcanbepickedmorethanonce.Thesamplesproducedbythetwomethodsarenotmuchdifferentwhensamplesarerelativelysmallcomparedtothedatasetsize,butsamplingwithreplacementissimplertoanalyzebecausetheprobabilityofselectinganyobjectremainsconstantduringthesamplingprocess.

Whenthepopulationconsistsofdifferenttypesofobjects,withwidelydifferentnumbersofobjects,simplerandomsamplingcanfailtoadequatelyrepresentthosetypesofobjectsthatarelessfrequent.Thiscancauseproblemswhentheanalysisrequiresproperrepresentationofallobjecttypes.Forexample,whenbuildingclassificationmodelsforrareclasses,itiscriticalthattherareclassesbeadequatelyrepresentedinthesample.Hence,asamplingschemethatcanaccommodatedifferingfrequenciesfortheobjecttypesofinterestisneeded.Stratifiedsampling,whichstartswithprespecifiedgroupsofobjects,issuchanapproach.Inthesimplestversion,equalnumbersofobjectsaredrawnfromeachgroupeventhoughthegroupsareofdifferentsizes.Inanothervariation,thenumberofobjectsdrawnfromeachgroupisproportionaltothesizeofthatgroup.

Example2.8(SamplingandLossofInformation).Onceasamplingtechniquehasbeenselected,itisstillnecessarytochoosethesamplesize.Largersamplesizesincreasetheprobabilitythatasamplewillberepresentative,buttheyalsoeliminatemuchoftheadvantageofsampling.Conversely,withsmallersamplesizes,patternscanbemissedorerroneouspatternscanbedetected.Figure2.9(a)showsadatasetthatcontains8000two-dimensionalpoints,whileFigures2.9(b) and2.9(c) showsamplesfromthisdatasetofsize2000and500,respectively.Althoughmostofthestructureofthisdatasetispresentinthesampleof2000points,muchofthestructureismissinginthesampleof500points.

Figure2.9.Exampleofthelossofstructurewithsampling.

Example2.9(DeterminingtheProperSampleSize).Toillustratethatdeterminingthepropersamplesizerequiresamethodicalapproach,considerthefollowingtask.

Givenasetofdataconsistingofasmallnumberofalmostequalsizedgroups,findatleastone

representativepointforeachofthegroups.Assumethattheobjectsineachgrouparehighly

similartoeachother,butnotverysimilartoobjectsindifferentgroups.Figure2.10(a) shows

anidealizedsetofclusters(groups)fromwhichthesepointsmightbedrawn.

Figure2.10.Findingrepresentativepointsfrom10groups.

Thisproblemcanbeefficientlysolvedusingsampling.Oneapproachistotakeasmallsampleofdatapoints,computethepairwisesimilaritiesbetweenpoints,andthenformgroupsofpointsthatarehighlysimilar.Thedesiredsetofrepresentativepointsisthenobtainedbytakingonepointfromeachofthesegroups.Tofollowthisapproach,however,weneedtodetermineasamplesizethatwouldguarantee,withahighprobability,thedesiredoutcome;thatis,thatatleastonepointwillbeobtainedfromeachcluster.Figure2.10(b) showstheprobabilityofgettingoneobjectfromeachofthe10groupsasthesamplesizerunsfrom10to60.Interestingly,withasamplesizeof20,thereislittlechance(20%)ofgettingasamplethatincludesall10clusters.Evenwithasamplesizeof30,thereisstillamoderatechance(almost40%)ofgettingasamplethatdoesn’tcontainobjectsfromall10clusters.ThisissueisfurtherexploredinthecontextofclusteringbyExercise4 onpage603.

ProgressiveSamplingThepropersamplesizecanbedifficulttodetermine,soadaptiveorprogressivesamplingschemesaresometimesused.Theseapproachesstartwithasmallsample,andthenincreasethesamplesizeuntilasampleofsufficientsizehasbeenobtained.Whilethistechniqueeliminatestheneedtodeterminethecorrectsamplesizeinitially,itrequiresthattherebeawaytoevaluatethesampletojudgeifitislargeenough.

Suppose,forinstance,thatprogressivesamplingisusedtolearnapredictivemodel.Althoughtheaccuracyofpredictivemodelsincreasesasthesamplesizeincreases,atsomepointtheincreaseinaccuracylevelsoff.Wewanttostopincreasingthesamplesizeatthisleveling-offpoint.Bykeepingtrackofthechangeinaccuracyofthemodelaswetakeprogressivelylargersamples,andbytakingothersamplesclosetothesizeofthecurrentone,wecangetanestimateofhowclosewearetothisleveling-offpoint,andthus,stopsampling.

2.3.3DimensionalityReduction

Datasetscanhavealargenumberoffeatures.Considerasetofdocuments,whereeachdocumentisrepresentedbyavectorwhosecomponentsarethefrequencieswithwhicheachwordoccursinthedocument.Insuchcases,therearetypicallythousandsortensofthousandsofattributes(components),oneforeachwordinthevocabulary.Asanotherexample,considerasetoftimeseriesconsistingofthedailyclosingpriceofvariousstocksoveraperiodof30years.Inthiscase,theattributes,whicharethepricesonspecificdays,againnumberinthethousands.

Thereareavarietyofbenefitstodimensionalityreduction.Akeybenefitisthatmanydataminingalgorithmsworkbetterifthedimensionality—thenumberofattributesinthedata—islower.Thisispartlybecausedimensionalityreductioncaneliminateirrelevantfeaturesandreducenoiseandpartlybecauseofthecurseofdimensionality,whichisexplainedbelow.Anotherbenefitisthatareductionofdimensionalitycanleadtoamoreunderstandablemodelbecausethemodelusuallyinvolvesfewerattributes.Also,dimensionalityreductionmayallowthedatatobemoreeasilyvisualized.Evenifdimensionalityreductiondoesn’treducethedatatotwoorthreedimensions,dataisoftenvisualizedbylookingatpairsortripletsofattributes,andthenumberofsuchcombinationsisgreatlyreduced.Finally,theamountoftimeandmemoryrequiredbythedataminingalgorithmisreducedwithareductionindimensionality.

Thetermdimensionalityreductionisoftenreservedforthosetechniquesthatreducethedimensionalityofadatasetbycreatingnewattributesthatareacombinationoftheoldattributes.Thereductionofdimensionalitybyselectingattributesthatareasubsetoftheoldisknownasfeaturesubsetselectionorfeatureselection.ItwillbediscussedinSection2.3.4 .

Intheremainderofthissection,webrieflyintroducetwoimportanttopics:thecurseofdimensionalityanddimensionalityreductiontechniquesbasedonlinearalgebraapproachessuchasprincipalcomponentsanalysis(PCA).MoredetailsondimensionalityreductioncanbefoundinAppendixB.

TheCurseofDimensionalityThecurseofdimensionalityreferstothephenomenonthatmanytypesofdataanalysisbecomesignificantlyharderasthedimensionalityofthedataincreases.Specifically,asdimensionalityincreases,thedatabecomesincreasinglysparseinthespacethatitoccupies.Thus,thedataobjectswe

observearequitepossiblynotarepresentativesampleofallpossibleobjects.Forclassification,thiscanmeanthattherearenotenoughdataobjectstoallowthecreationofamodelthatreliablyassignsaclasstoallpossibleobjects.Forclustering,thedifferencesindensityandinthedistancesbetweenpoints,whicharecriticalforclustering,becomelessmeaningful.(ThisisdiscussedfurtherinSections8.1.2,8.4.6,and8.4.8.)Asaresult,manyclusteringandclassificationalgorithms(andotherdataanalysisalgorithms)havetroublewithhigh-dimensionaldataleadingtoreducedclassificationaccuracyandpoorqualityclusters.

LinearAlgebraTechniquesforDimensionalityReductionSomeofthemostcommonapproachesfordimensionalityreduction,particularlyforcontinuousdata,usetechniquesfromlinearalgebratoprojectthedatafromahigh-dimensionalspaceintoalower-dimensionalspace.PrincipalComponentsAnalysis(PCA)isalinearalgebratechniqueforcontinuousattributesthatfindsnewattributes(principalcomponents)that(1)arelinearcombinationsoftheoriginalattributes,(2)areorthogonal(perpendicular)toeachother,and(3)capturethemaximumamountofvariationinthedata.Forexample,thefirsttwoprincipalcomponentscaptureasmuchofthevariationinthedataasispossiblewithtwoorthogonalattributesthatarelinearcombinationsoftheoriginalattributes.SingularValueDecomposition(SVD)isalinearalgebratechniquethatisrelatedtoPCAandisalsocommonlyusedfordimensionalityreduction.Foradditionaldetails,seeAppendicesAandB.

2.3.4FeatureSubsetSelection

Anotherwaytoreducethedimensionalityistouseonlyasubsetofthefeatures.Whileitmightseemthatsuchanapproachwouldloseinformation,thisisnotthecaseifredundantandirrelevantfeaturesarepresent.Redundantfeaturesduplicatemuchoralloftheinformationcontainedinoneormoreotherattributes.Forexample,thepurchasepriceofaproductandtheamountofsalestaxpaidcontainmuchofthesameinformation.Irrelevantfeaturescontainalmostnousefulinformationforthedataminingtaskathand.Forinstance,students’IDnumbersareirrelevanttothetaskofpredictingstudents’gradepointaverages.Redundantandirrelevantfeaturescanreduceclassificationaccuracyandthequalityoftheclustersthatarefound.

Whilesomeirrelevantandredundantattributescanbeeliminatedimmediatelybyusingcommonsenseordomainknowledge,selectingthebestsubsetoffeaturesfrequentlyrequiresasystematicapproach.Theidealapproachtofeatureselectionistotryallpossiblesubsetsoffeaturesasinputtothedataminingalgorithmofinterest,andthentakethesubsetthatproducesthebestresults.Thismethodhastheadvantageofreflectingtheobjectiveandbiasofthedataminingalgorithmthatwilleventuallybeused.Unfortunately,sincethenumberofsubsetsinvolvingnattributesis2 ,suchanapproachisimpractical

inmostsituationsandalternativestrategiesareneeded.Therearethreestandardapproachestofeatureselection:embedded,filter,andwrapper.

Embeddedapproaches

Featureselectionoccursnaturallyaspartofthedataminingalgorithm.Specifically,duringtheoperationofthedataminingalgorithm,thealgorithmitselfdecideswhichattributestouseandwhichtoignore.Algorithmsforbuildingdecisiontreeclassifiers,whicharediscussedinChapter3 ,oftenoperateinthismanner.

n

Filterapproaches

Featuresareselectedbeforethedataminingalgorithmisrun,usingsomeapproachthatisindependentofthedataminingtask.Forexample,wemightselectsetsofattributeswhosepairwisecorrelationisaslowaspossiblesothattheattributesarenon-redundant.

Wrapperapproaches

Thesemethodsusethetargetdataminingalgorithmasablackboxtofindthebestsubsetofattributes,inawaysimilartothatoftheidealalgorithmdescribedabove,buttypicallywithoutenumeratingallpossiblesubsets.

Becausetheembeddedapproachesarealgorithm-specific,onlythefilterandwrapperapproacheswillbediscussedfurtherhere.

AnArchitectureforFeatureSubsetSelectionItispossibletoencompassboththefilterandwrapperapproacheswithinacommonarchitecture.Thefeatureselectionprocessisviewedasconsistingoffourparts:ameasureforevaluatingasubset,asearchstrategythatcontrolsthegenerationofanewsubsetoffeatures,astoppingcriterion,andavalidationprocedure.Filtermethodsandwrappermethodsdifferonlyinthewayinwhichtheyevaluateasubsetoffeatures.Forawrappermethod,subsetevaluationusesthetargetdataminingalgorithm,whileforafilterapproach,theevaluationtechniqueisdistinctfromthetargetdataminingalgorithm.Thefollowingdiscussionprovidessomedetailsofthisapproach,whichissummarizedinFigure2.11 .

Figure2.11.Flowchartofafeaturesubsetselectionprocess.

Conceptually,featuresubsetselectionisasearchoverallpossiblesubsetsoffeatures.Manydifferenttypesofsearchstrategiescanbeused,butthesearchstrategyshouldbecomputationallyinexpensiveandshouldfindoptimalornearoptimalsetsoffeatures.Itisusuallynotpossibletosatisfybothrequirements,andthus,trade-offsarenecessary.

Anintegralpartofthesearchisanevaluationsteptojudgehowthecurrentsubsetoffeaturescomparestoothersthathavebeenconsidered.Thisrequiresanevaluationmeasurethatattemptstodeterminethegoodnessofasubsetofattributeswithrespecttoaparticulardataminingtask,suchasclassificationorclustering.Forthefilterapproach,suchmeasuresattempttopredicthowwelltheactualdataminingalgorithmwillperformonagivensetofattributes.Forthewrapperapproach,whereevaluationconsistsofactuallyrunningthetargetdataminingalgorithm,thesubsetevaluationfunctionissimplythecriterionnormallyusedtomeasuretheresultofthedatamining.

Becausethenumberofsubsetscanbeenormousanditisimpracticaltoexaminethemall,somesortofstoppingcriterionisnecessary.Thisstrategyisusuallybasedononeormoreconditionsinvolvingthefollowing:thenumberofiterations,whetherthevalueofthesubsetevaluationmeasureisoptimalorexceedsacertainthreshold,whetherasubsetofacertainsizehasbeenobtained,andwhetheranyimprovementcanbeachievedbytheoptionsavailabletothesearchstrategy.

Finally,onceasubsetoffeatureshasbeenselected,theresultsofthetargetdataminingalgorithmontheselectedsubsetshouldbevalidated.Astraightforwardvalidationapproachistorunthealgorithmwiththefullsetoffeaturesandcomparethefullresultstoresultsobtainedusingthesubsetoffeatures.Hopefully,thesubsetoffeatureswillproduceresultsthatarebetterthanoralmostasgoodasthoseproducedwhenusingallfeatures.Anothervalidationapproachistouseanumberofdifferentfeatureselectionalgorithmstoobtainsubsetsoffeaturesandthencomparetheresultsofrunningthedataminingalgorithmoneachsubset.

FeatureWeightingFeatureweightingisanalternativetokeepingoreliminatingfeatures.Moreimportantfeaturesareassignedahigherweight,whilelessimportantfeaturesaregivenalowerweight.Theseweightsaresometimesassignedbasedondomainknowledgeabouttherelativeimportanceoffeatures.Alternatively,theycansometimesbedeterminedautomatically.Forexample,someclassificationschemes,suchassupportvectormachines(Chapter4 ),produceclassificationmodelsinwhicheachfeatureisgivenaweight.Featureswithlargerweightsplayamoreimportantroleinthemodel.Thenormalizationofobjectsthattakesplacewhencomputingthecosinesimilarity(Section2.4.5 )canalsoberegardedasatypeoffeatureweighting.

2.3.5FeatureCreation

Itisfrequentlypossibletocreate,fromtheoriginalattributes,anewsetofattributesthatcapturestheimportantinformationinadatasetmuchmoreeffectively.Furthermore,thenumberofnewattributescanbesmallerthanthenumberoforiginalattributes,allowingustoreapallthepreviouslydescribedbenefitsofdimensionalityreduction.Tworelatedmethodologiesforcreatingnewattributesaredescribednext:featureextractionandmappingthedatatoanewspace.

FeatureExtractionThecreationofanewsetoffeaturesfromtheoriginalrawdataisknownasfeatureextraction.Considerasetofphotographs,whereeachphotographistobeclassifiedaccordingtowhetheritcontainsahumanface.Therawdataisasetofpixels,andassuch,isnotsuitableformanytypesofclassificationalgorithms.However,ifthedataisprocessedtoprovidehigher-levelfeatures,suchasthepresenceorabsenceofcertaintypesofedgesandareasthatarehighlycorrelatedwiththepresenceofhumanfaces,thenamuchbroadersetofclassificationtechniquescanbeappliedtothisproblem.

Unfortunately,inthesenseinwhichitismostcommonlyused,featureextractionishighlydomain-specific.Foraparticularfield,suchasimageprocessing,variousfeaturesandthetechniquestoextractthemhavebeendevelopedoveraperiodoftime,andoftenthesetechniqueshavelimitedapplicabilitytootherfields.Consequently,wheneverdataminingisappliedtoarelativelynewarea,akeytaskisthedevelopmentofnewfeaturesandfeatureextractionmethods.

Althoughfeatureextractionisoftencomplicated,Example2.10 illustratesthatitcanberelativelystraightforward.

Example2.10(Density).Consideradatasetconsistingofinformationabouthistoricalartifacts,which,alongwithotherinformation,containsthevolumeandmassofeachartifact.Forsimplicity,assumethattheseartifactsaremadeofasmallnumberofmaterials(wood,clay,bronze,gold)andthatwewanttoclassifytheartifactswithrespecttothematerialofwhichtheyaremade.Inthiscase,adensityfeatureconstructedfromthemassandvolumefeatures,i.e.,density=mass/volume,wouldmostdirectlyyieldanaccurateclassification.Althoughtherehavebeensomeattemptstoautomaticallyperformsuchsimplefeatureextractionbyexploringbasicmathematicalcombinationsofexistingattributes,themostcommonapproachistoconstructfeaturesusingdomainexpertise.

MappingtheDatatoaNewSpaceAtotallydifferentviewofthedatacanrevealimportantandinterestingfeatures.Consider,forexample,timeseriesdata,whichoftencontainsperiodicpatterns.Ifthereisonlyasingleperiodicpatternandnotmuchnoise,thenthepatterniseasilydetected.If,ontheotherhand,thereareanumberofperiodicpatternsandasignificantamountofnoise,thenthesepatternsarehardtodetect.Suchpatternscan,nonetheless,oftenbedetectedbyapplyingaFouriertransformtothetimeseriesinordertochangetoarepresentationinwhichfrequencyinformationisexplicit.InExample2.11 ,itwillnotbenecessarytoknowthedetailsoftheFouriertransform.Itisenoughtoknowthat,foreachtimeseries,theFouriertransformproducesanewdataobjectwhoseattributesarerelatedtofrequencies.

Example2.11(FourierAnalysis).ThetimeseriespresentedinFigure2.12(b) isthesumofthreeothertimeseries,twoofwhichareshowninFigure2.12(a) andhavefrequenciesof7and17cyclespersecond,respectively.Thethirdtimeseriesisrandomnoise.Figure2.12(c) showsthepowerspectrumthatcanbecomputedafterapplyingaFouriertransformtotheoriginaltimeseries.(Informally,thepowerspectrumisproportionaltothesquareofeachfrequencyattribute.)Inspiteofthenoise,therearetwopeaksthatcorrespondtotheperiodsofthetwooriginal,non-noisytimeseries.Again,themainpointisthatbetterfeaturescanrevealimportantaspectsofthedata.

Figure2.12.ApplicationoftheFouriertransformtoidentifytheunderlyingfrequenciesintimeseriesdata.

Manyothersortsoftransformationsarealsopossible.BesidestheFouriertransform,thewavelettransformhasalsoprovenveryusefulfortimeseriesandothertypesofdata.

2.3.6DiscretizationandBinarization

Somedataminingalgorithms,especiallycertainclassificationalgorithms,requirethatthedatabeintheformofcategoricalattributes.Algorithmsthatfindassociationpatternsrequirethatthedatabeintheformofbinaryattributes.Thus,itisoftennecessarytotransformacontinuousattributeintoacategoricalattribute(discretization),andbothcontinuousanddiscreteattributesmayneedtobetransformedintooneormorebinaryattributes(binarization).Additionally,ifacategoricalattributehasalargenumberofvalues(categories),orsomevaluesoccurinfrequently,thenitcanbebeneficialforcertaindataminingtaskstoreducethenumberofcategoriesbycombiningsomeofthevalues.

Aswithfeatureselection,thebestdiscretizationorbinarizationapproachistheonethat“producesthebestresultforthedataminingalgorithmthatwillbeusedtoanalyzethedata.”Itistypicallynotpracticaltoapplysuchacriteriondirectly.Consequently,discretizationorbinarizationisperformedinawaythatsatisfiesacriterionthatisthoughttohavearelationshiptogoodperformanceforthedataminingtaskbeingconsidered.Ingeneral,thebestdiscretizationdependsonthealgorithmbeingused,aswellastheotherattributesbeingconsidered.Typically,however,thediscretizationofeachattributeisconsideredinisolation.

BinarizationAsimpletechniquetobinarizeacategoricalattributeisthefollowing:Iftherearemcategoricalvalues,thenuniquelyassigneachoriginalvaluetoanintegerintheinterval Iftheattributeisordinal,thenordermustbemaintainedbytheassignment.(Notethateveniftheattributeisoriginallyrepresentedusingintegers,thisprocessisnecessaryiftheintegersarenotin

[0,m−1].

theinterval )Next,converteachofthesemintegerstoabinarynumber.Since binarydigitsarerequiredtorepresenttheseintegers,representthesebinarynumbersusingnbinaryattributes.Toillustrate,acategoricalvariablewith5values{awful,poor,OK,good,great}wouldrequirethreebinaryvariables and TheconversionisshowninTable2.5 .

Table2.5.Conversionofacategoricalattributetothreebinaryattributes.

CategoricalValue IntegerValue

awful 0 0 0 0

poor 1 0 0 1

OK 2 0 1 0

good 3 0 1 1

great 4 1 0 0

Suchatransformationcancausecomplications,suchascreatingunintendedrelationshipsamongthetransformedattributes.Forexample,inTable2.5 ,attributes and arecorrelatedbecauseinformationaboutthegoodvalueisencodedusingbothattributes.Furthermore,associationanalysisrequiresasymmetricbinaryattributes,whereonlythepresenceoftheattribute

isimportant.Forassociationproblems,itisthereforenecessarytointroduceoneasymmetricbinaryattributeforeachcategoricalvalue,asshowninTable2.6 .Ifthenumberofresultingattributesistoolarge,thenthetechniquesdescribedinthefollowingsectionscanbeusedtoreducethenumberofcategoricalvaluesbeforebinarization.

Table2.6.Conversionofacategoricalattributetofiveasymmetricbinary

[0,m−1].n=[log2(m)]

x1,x2, x3.

x1 x2 x3

x2 x3

(value=1).

attributes.

CategoricalValue IntegerValue

awful 0 1 0 0 0 0

poor 1 0 1 0 0 0

OK 2 0 0 1 0 0

good 3 0 0 0 1 0

great 4 0 0 0 0 1

Likewise,forassociationproblems,itcanbenecessarytoreplaceasinglebinaryattributewithtwoasymmetricbinaryattributes.Considerabinaryattributethatrecordsaperson’sgender,maleorfemale.Fortraditionalassociationrulealgorithms,thisinformationneedstobetransformedintotwoasymmetricbinaryattributes,onethatisa1onlywhenthepersonismaleandonethatisa1onlywhenthepersonisfemale.(Forasymmetricbinaryattributes,theinformationrepresentationissomewhatinefficientinthattwobitsofstoragearerequiredtorepresenteachbitofinformation.)

DiscretizationofContinuousAttributesDiscretizationistypicallyappliedtoattributesthatareusedinclassificationorassociationanalysis.Transformationofacontinuousattributetoacategoricalattributeinvolvestwosubtasks:decidinghowmanycategories,n,tohaveanddetermininghowtomapthevaluesofthecontinuousattributetothesecategories.Inthefirststep,afterthevaluesofthecontinuousattributearesorted,theyarethendividedintonintervalsbyspecifying splitpoints.Inthesecond,rathertrivialstep,allthevaluesinoneintervalaremappedtothesamecategoricalvalue.Therefore,theproblemofdiscretizationisoneof

x1 x2 x3 x4 x5

n−1

decidinghowmanysplitpointstochooseandwheretoplacethem.Theresultcanberepresentedeitherasasetofintervals

where and canbe or respectively,orequivalently,asaseriesofinequalities

UnsupervisedDiscretization

Abasicdistinctionbetweendiscretizationmethodsforclassificationiswhetherclassinformationisused(supervised)ornot(unsupervised).Ifclassinformationisnotused,thenrelativelysimpleapproachesarecommon.Forinstance,theequalwidthapproachdividestherangeoftheattributeintoauser-specifiednumberofintervalseachhavingthesamewidth.Suchanapproachcanbebadlyaffectedbyoutliers,andforthatreason,anequalfrequency(equaldepth)approach,whichtriestoputthesamenumberofobjectsintoeachinterval,isoftenpreferred.Asanotherexampleofunsuperviseddiscretization,aclusteringmethod,suchasK-means(seeChapter7 ),canalsobeused.Finally,visuallyinspectingthedatacansometimesbeaneffectiveapproach.

Example2.12(DiscretizationTechniques).Thisexampledemonstrateshowtheseapproachesworkonanactualdataset.Figure2.13(a) showsdatapointsbelongingtofourdifferentgroups,alongwithtwooutliers—thelargedotsoneitherend.Thetechniquesofthepreviousparagraphwereappliedtodiscretizethexvaluesofthesedatapointsintofourcategoricalvalues.(Pointsinthedatasethavearandomycomponenttomakeiteasytoseehowmanypointsareineachgroup.)Visuallyinspectingthedataworksquitewell,butisnotautomatic,andthus,wefocusontheotherthreeapproaches.Thesplitpointsproducedbythetechniquesequalwidth,equalfrequency,andK-meansareshownin

{(x0,x1],(x1,x2],…,(xn−1,xn)}, x0 xn +∞ −∞,

x0<x≤x1,…,xn−1<x<xn.

Figures2.13(b) ,2.13(c) ,and2.13(d) ,respectively.Thesplitpointsarerepresentedasdashedlines.

Figure2.13.Differentdiscretizationtechniques.

Inthisparticularexample,ifwemeasuretheperformanceofadiscretizationtechniquebytheextenttowhichdifferentobjectsthatclumptogetherhavethesamecategoricalvalue,thenK-meansperformsbest,followedbyequalfrequency,andfinally,equalwidth.Moregenerally,thebestdiscretizationwilldependontheapplicationandofteninvolvesdomain-specificdiscretization.Forexample,thediscretizationofpeopleintolowincome,middleincome,andhighincomeisbasedoneconomicfactors.

SupervisedDiscretization

Ifclassificationisourapplicationandclasslabelsareknownforsomedataobjects,thendiscretizationapproachesthatuseclasslabelsoftenproducebetterclassification.Thisshouldnotbesurprising,sinceanintervalconstructedwithnoknowledgeofclasslabelsoftencontainsamixtureofclasslabels.Aconceptuallysimpleapproachistoplacethesplitsinawaythatmaximizesthepurityoftheintervals,i.e.,theextenttowhichanintervalcontainsasingleclasslabel.Inpractice,however,suchanapproachrequirespotentiallyarbitrarydecisionsaboutthepurityofanintervalandtheminimumsizeofaninterval.

Toovercomesuchconcerns,somestatisticallybasedapproachesstartwitheachattributevalueinaseparateintervalandcreatelargerintervalsbymergingadjacentintervalsthataresimilaraccordingtoastatisticaltest.Analternativetothisbottom-upapproachisatop-downapproachthatstartsbybisectingtheinitialvaluessothattheresultingtwointervalsgiveminimumentropy.Thistechniqueonlyneedstoconsidereachvalueasapossiblesplitpoint,becauseitisassumedthatintervalscontainorderedsetsofvalues.Thesplittingprocessisthenrepeatedwithanotherinterval,typicallychoosingtheintervalwiththeworst(highest)entropy,untilauser-specifiednumberofintervalsisreached,orastoppingcriterionissatisfied.

Entropy-basedapproachesareoneofthemostpromisingapproachestodiscretization,whetherbottom-uportop-down.First,itisnecessarytodefineentropy.Letkbethenumberofdifferentclasslabels,m bethenumberof

valuesinthei intervalofapartition,andm bethenumberofvaluesofclassjinintervali.Thentheentropye ofthei intervalisgivenbytheequation

where istheprobability(fractionofvalues)ofclassjintheinterval.Thetotalentropy,e,ofthepartitionistheweightedaverageoftheindividualintervalentropies,i.e.,

wheremisthenumberofvalues, isthefractionofvaluesintheinterval,andnisthenumberofintervals.Intuitively,theentropyofanintervalisameasureofthepurityofaninterval.Ifanintervalcontainsonlyvaluesofoneclass(isperfectlypure),thentheentropyis0anditcontributesnothingtotheoverallentropy.Iftheclassesofvaluesinanintervaloccurequallyoften(theintervalisasimpureaspossible),thentheentropyisamaximum.

Example2.13(DiscretizationofTwoAttributes).Thetop-downmethodbasedonentropywasusedtoindependentlydiscretizeboththexandyattributesofthetwo-dimensionaldatashowninFigure2.14 .Inthefirstdiscretization,showninFigure2.14(a) ,thexandyattributeswerebothsplitintothreeintervals.(Thedashedlinesindicatethesplitpoints.)Intheseconddiscretization,showninFigure2.14(b) ,thexandyattributeswerebothsplitintofiveintervals.

i

thij

ith

ei=−∑j=1kpijlog2pij,

pij=mij/mi ith

e=∑i=1nwiei,

wi=mi/m ith

Figure2.14.Discretizingxandyattributesforfourgroups(classes)ofpoints.

Thissimpleexampleillustratestwoaspectsofdiscretization.First,intwodimensions,theclassesofpointsarewellseparated,butinonedimension,thisisnotso.Ingeneral,discretizingeachattributeseparatelyoftenguaranteessuboptimalresults.Second,fiveintervalsworkbetterthanthree,butsixintervalsdonotimprovethediscretizationmuch,atleastintermsofentropy.(Entropyvaluesandresultsforsixintervalsarenotshown.)Consequently,itisdesirabletohaveastoppingcriterionthatautomaticallyfindstherightnumberofpartitions.

CategoricalAttributeswithTooManyValuesCategoricalattributescansometimeshavetoomanyvalues.Ifthecategoricalattributeisanordinalattribute,thentechniquessimilartothoseforcontinuousattributescanbeusedtoreducethenumberofcategories.Ifthecategoricalattributeisnominal,however,thenotherapproachesareneeded.Considera

universitythathasalargenumberofdepartments.Consequently,adepartmentnameattributemighthavedozensofdifferentvalues.Inthissituation,wecoulduseourknowledgeoftherelationshipsamongdifferentdepartmentstocombinedepartmentsintolargergroups,suchasengineering,socialsciences,orbiologicalsciences.Ifdomainknowledgedoesnotserveasausefulguideorsuchanapproachresultsinpoorclassificationperformance,thenitisnecessarytouseamoreempiricalapproach,suchasgroupingvaluestogetheronlyifsuchagroupingresultsinimprovedclassificationaccuracyorachievessomeotherdataminingobjective.

2.3.7VariableTransformation

Avariabletransformationreferstoatransformationthatisappliedtoallthevaluesofavariable.(Weusethetermvariableinsteadofattributetoadheretocommonusage,althoughwewillalsorefertoattributetransformationonoccasion.)Inotherwords,foreachobject,thetransformationisappliedtothevalueofthevariableforthatobject.Forexample,ifonlythemagnitudeofavariableisimportant,thenthevaluesofthevariablecanbetransformedbytakingtheabsolutevalue.Inthefollowingsection,wediscusstwoimportanttypesofvariabletransformations:simplefunctionaltransformationsandnormalization.

SimpleFunctionsForthistypeofvariabletransformation,asimplemathematicalfunctionisappliedtoeachvalueindividually.Ifxisavariable,thenexamplesofsuchtransformationsinclude or Instatistics,variabletransformations,especiallysqrt,log,and1/x,areoftenusedtotransformdatathatdoesnothaveaGaussian(normal)distributionintodatathatdoes.While

xk,logx,ex,x,1/x,sinx, |x|.

thiscanbeimportant,otherreasonsoftentakeprecedenceindatamining.Supposethevariableofinterestisthenumberofdatabytesinasession,andthenumberofbytesrangesfrom1to1billion.Thisisahugerange,anditcanbeadvantageoustocompressitbyusingalog transformation.Inthiscase,sessionsthattransferred and byteswouldbemoresimilartoeachotherthansessionsthattransferred10and1000bytesForsomeapplications,suchasnetworkintrusiondetection,thismaybewhatisdesired,sincethefirsttwosessionsmostlikelyrepresenttransfersoflargefiles,whilethelattertwosessionscouldbetwoquitedistincttypesofsessions.

Variabletransformationsshouldbeappliedwithcautionbecausetheychangethenatureofthedata.Whilethisiswhatisdesired,therecanbeproblemsifthenatureofthetransformationisnotfullyappreciated.Forinstance,thetransformation1/xreducesthemagnitudeofvaluesthatare1orlarger,butincreasesthemagnitudeofvaluesbetween0and1.Toillustrate,thevalues{1,2,3}goto butthevalues goto{1,2,3}.Thus,forallsetsofvalues,thetransformation1/xreversestheorder.Tohelpclarifytheeffectofatransformation,itisimportanttoaskquestionssuchasthefollowing:Whatisthedesiredpropertyofthetransformedattribute?Doestheorderneedtobemaintained?Doesthetransformationapplytoallvalues,especiallynegativevaluesand0?Whatistheeffectofthetransformationonthevaluesbetween0and1?Exercise17 onpage109 exploresotheraspectsofvariabletransformation.

NormalizationorStandardizationThegoalofstandardizationornormalizationistomakeanentiresetofvalueshaveaparticularproperty.Atraditionalexampleisthatof“standardizingavariable”instatistics.If isthemean(average)oftheattributevaluesandistheirstandarddeviation,thenthetransformation createsanew

10

108 109(9−8=1versus3−1=3).

{1,12,13}, {1,12,13}

x¯ sxx′=(x−x¯)/sx

variablethathasameanof0andastandarddeviationof1.Ifdifferentvariablesaretobeusedtogether,e.g.,forclustering,thensuchatransformationisoftennecessarytoavoidhavingavariablewithlargevaluesdominatetheresultsoftheanalysis.Toillustrate,considercomparingpeoplebasedontwovariables:ageandincome.Foranytwopeople,thedifferenceinincomewilllikelybemuchhigherinabsoluteterms(hundredsorthousandsofdollars)thanthedifferenceinage(lessthan150).Ifthedifferencesintherangeofvaluesofageandincomearenottakenintoaccount,thenthecomparisonbetweenpeoplewillbedominatedbydifferencesinincome.Inparticular,ifthesimilarityordissimilarityoftwopeopleiscalculatedusingthesimilarityordissimilaritymeasuresdefinedlaterinthischapter,theninmanycases,suchasthatofEuclideandistance,theincomevalueswilldominatethecalculation.

Themeanandstandarddeviationarestronglyaffectedbyoutliers,sotheabovetransformationisoftenmodified.First,themeanisreplacedbythemedian,i.e.,themiddlevalue.Second,thestandarddeviationisreplacedbytheabsolutestandarddeviation.Specifically,ifxisavariable,thentheabsolutestandarddeviationofxisgivenby where isthevalueofthevariable,misthenumberofobjects,and iseitherthemean

ormedian.Otherapproachesforcomputingestimatesofthelocation(center)andspreadofasetofvaluesinthepresenceofoutliersaredescribedinstatisticsbooks.Thesemorerobustmeasurescanalsobeusedtodefineastandardizationtransformation.

σA=∑i=1m|xi−μ|, xiith μ

2.4MeasuresofSimilarityandDissimilaritySimilarityanddissimilarityareimportantbecausetheyareusedbyanumberofdataminingtechniques,suchasclustering,nearestneighborclassification,andanomalydetection.Inmanycases,theinitialdatasetisnotneededoncethesesimilaritiesordissimilaritieshavebeencomputed.Suchapproachescanbeviewedastransformingthedatatoasimilarity(dissimilarity)spaceandthenperformingtheanalysis.Indeed,kernelmethodsareapowerfulrealizationofthisidea.ThesemethodsareintroducedinSection2.4.7 andarediscussedmorefullyinthecontextofclassificationinSection4.9.4.

Webeginwithadiscussionofthebasics:high-leveldefinitionsofsimilarityanddissimilarity,andadiscussionofhowtheyarerelated.Forconvenience,thetermproximityisusedtorefertoeithersimilarityordissimilarity.Sincetheproximitybetweentwoobjectsisafunctionoftheproximitybetweenthecorrespondingattributesofthetwoobjects,wefirstdescribehowtomeasuretheproximitybetweenobjectshavingonlyoneattribute.

Wethenconsiderproximitymeasuresforobjectswithmultipleattributes.ThisincludesmeasuressuchastheJaccardandcosinesimilaritymeasures,whichareusefulforsparsedata,suchasdocuments,aswellascorrelationandEuclideandistance,whichareusefulfornon-sparse(dense)data,suchastimeseriesormulti-dimensionalpoints.Wealsoconsidermutualinformation,whichcanbeappliedtomanytypesofdataandisgoodfordetectingnonlinearrelationships.Inthisdiscussion,werestrictourselvestoobjectswithrelativelyhomogeneousattributetypes,typicallybinaryorcontinuous.

Next,weconsiderseveralimportantissuesconcerningproximitymeasures.Thisincludeshowtocomputeproximitybetweenobjectswhentheyhaveheterogeneoustypesofattributes,andapproachestoaccountfordifferencesofscaleandcorrelationamongvariableswhencomputingdistancebetweennumericalobjects.Thesectionconcludeswithabriefdiscussionofhowtoselecttherightproximitymeasure.

Althoughthissectionfocusesonthecomputationofproximitybetweendataobjects,proximitycanalsobecomputedbetweenattributes.Forexample,forthedocument-termmatrixofFigure2.2(d) ,thecosinemeasurecanbeusedtocomputesimilaritybetweenapairofdocumentsorapairofterms(words).Knowingthattwovariablesarestronglyrelatedcan,forexample,behelpfulforeliminatingredundancy.Inparticular,thecorrelationandmutualinformationmeasuresdiscussedlaterareoftenusedforthatpurpose.

2.4.1Basics

DefinitionsInformally,thesimilaritybetweentwoobjectsisanumericalmeasureofthedegreetowhichthetwoobjectsarealike.Consequently,similaritiesarehigherforpairsofobjectsthataremorealike.Similaritiesareusuallynon-negativeandareoftenbetween0(nosimilarity)and1(completesimilarity).

Thedissimilaritybetweentwoobjectsisanumericalmeasureofthedegreetowhichthetwoobjectsaredifferent.Dissimilaritiesarelowerformoresimilarpairsofobjects.Frequently,thetermdistanceisusedasasynonymfordissimilarity,although,asweshallsee,distanceoftenreferstoaspecialclass

ofdissimilarities.Dissimilaritiessometimesfallintheinterval[0,1],butitisalsocommonforthemtorangefrom0to∞.

TransformationsTransformationsareoftenappliedtoconvertasimilaritytoadissimilarity,orviceversa,ortotransformaproximitymeasuretofallwithinaparticularrange,suchas[0,1].Forinstance,wemayhavesimilaritiesthatrangefrom1to10,buttheparticularalgorithmorsoftwarepackagethatwewanttousemaybedesignedtoworkonlywithdissimilarities,oritmayworkonlywithsimilaritiesintheinterval[0,1].Wediscusstheseissuesherebecausewewillemploysuchtransformationslaterinourdiscussionofproximity.Inaddition,theseissuesarerelativelyindependentofthedetailsofspecificproximitymeasures.

Frequently,proximitymeasures,especiallysimilarities,aredefinedortransformedtohavevaluesintheinterval[0,1].Informally,themotivationforthisistouseascaleinwhichaproximityvalueindicatesthefractionofsimilarity(ordissimilarity)betweentwoobjects.Suchatransformationisoftenrelativelystraightforward.Forexample,ifthesimilaritiesbetweenobjectsrangefrom1(notatallsimilar)to10(completelysimilar),wecanmakethemfallwithintherange[0,1]byusingthetransformation wheresands′aretheoriginalandnewsimilarityvalues,respectively.Inthemoregeneralcase,thetransformationofsimilaritiestotheinterval[0,1]isgivenbytheexpression wheremax_sandmin_sarethemaximumandminimumsimilarityvalues,respectively.Likewise,dissimilaritymeasureswithafiniterangecanbemappedtotheinterval[0,1]byusingtheformula Thisisanexampleofalineartransformation,whichpreservestherelativedistancesbetweenpoints.Inotherwords,ifpoints, and aretwiceasfarapartaspoints, andthesamewillbetrueafteralineartransformation.

s′=(s−1)/9,

s′=(s−min_s)/(max_s−min_s),

d′=(d−min_d)/(max_d−min_d).

x1 x2, x3 x4,

However,therecanbecomplicationsinmappingproximitymeasurestotheinterval[0,1]usingalineartransformation.If,forexample,theproximitymeasureoriginallytakesvaluesintheinterval thenmax_disnotdefinedandanonlineartransformationisneeded.Valueswillnothavethesamerelationshiptooneanotheronthenewscale.Considerthetransformation

foradissimilaritymeasurethatrangesfrom0to Thedissimilarities0,0.5,2,10,100,and1000willbetransformedintothenewdissimilarities0,0.33,0.67,0.90,0.99,and0.999,respectively.Largervaluesontheoriginaldissimilarityscalearecompressedintotherangeofvaluesnear1,butwhetherthisisdesirabledependsontheapplication.

Notethatmappingproximitymeasurestotheinterval[0,1]canalsochangethemeaningoftheproximitymeasure.Forexample,correlation,whichisdiscussedlater,isameasureofsimilaritythattakesvaluesintheinterval

Mappingthesevaluestotheinterval[0,1]bytakingtheabsolutevaluelosesinformationaboutthesign,whichcanbeimportantinsomeapplications.SeeExercise22 onpage111 .

Transformingsimilaritiestodissimilaritiesandviceversaisalsorelativelystraightforward,althoughweagainfacetheissuesofpreservingmeaningandchangingalinearscaleintoanonlinearscale.Ifthesimilarity(ordissimilarity)fallsintheinterval[0,1],thenthedissimilaritycanbedefinedas

Anothersimpleapproachistodefinesimilarityasthenegativeofthedissimilarity(orviceversa).Toillustrate,thedissimilarities0,1,10,and100canbetransformedintothesimilarities andrespectively.

Thesimilaritiesresultingfromthenegationtransformationarenotrestrictedtotherange[0,1],butifthatisdesired,thentransformationssuchas

or canbeused.Forthetransformation thedissimilarities0,1,10,100aretransformedinto1,

[0,∞],

d=d/(1+d) ∞.

[−1,1].

d=1−s(s=1−d).

0,−1,−10, −100,

s=1d+1,s=e−d, s=1−d−min_dmax_d−min_ds=1d+1,

0.5,0.09,0.01,respectively.For theybecome1.00,0.37,0.00,0.00,respectively,whilefor theybecome1.00,0.99,0.90,0.00,respectively.Inthisdiscussion,wehavefocusedonconvertingdissimilaritiestosimilarities.ConversionintheoppositedirectionisconsideredinExercise23 onpage111 .

Ingeneral,anymonotonicdecreasingfunctioncanbeusedtoconvertdissimilaritiestosimilarities,orviceversa.Ofcourse,otherfactorsalsomustbeconsideredwhentransformingsimilaritiestodissimilarities,orviceversa,orwhentransformingthevaluesofaproximitymeasuretoanewscale.Wehavementionedissuesrelatedtopreservingmeaning,distortionofscale,andrequirementsofdataanalysistools,butthislistiscertainlynotexhaustive.

2.4.2SimilarityandDissimilaritybetweenSimpleAttributes

Theproximityofobjectswithanumberofattributesistypicallydefinedbycombiningtheproximitiesofindividualattributes,andthus,wefirstdiscussproximitybetweenobjectshavingasingleattribute.Considerobjectsdescribedbyonenominalattribute.Whatwoulditmeanfortwosuchobjectstobesimilar?Becausenominalattributesconveyonlyinformationaboutthedistinctnessofobjects,allwecansayisthattwoobjectseitherhavethesamevalueortheydonot.Hence,inthiscasesimilarityistraditionallydefinedas1ifattributevaluesmatch,andas0otherwise.Adissimilaritywouldbedefinedintheoppositeway:0iftheattributevaluesmatch,and1iftheydonot.

Forobjectswithasingleordinalattribute,thesituationismorecomplicatedbecauseinformationaboutordershouldbetakenintoaccount.Consideran

s=e−d,s=1−d−min_dmax_d−min_d

attributethatmeasuresthequalityofaproduct,e.g.,acandybar,onthescale{poor,fair,OK,good,wonderful}.Itwouldseemreasonablethataproduct,P1,whichisratedwonderful,wouldbeclosertoaproductP2,whichisratedgood,thanitwouldbetoaproductP3,whichisratedOK.Tomakethisobservationquantitative,thevaluesoftheordinalattributeareoftenmappedtosuccessiveintegers,beginningat0or1,e.g.,

Then, or,ifwewantthedissimilaritytofallbetween0and Asimilarityforordinalattributescanthenbedefinedas

Thisdefinitionofsimilarity(dissimilarity)foranordinalattributeshouldmakethereaderabituneasysincethisassumesequalintervalsbetweensuccessivevaluesoftheattribute,andthisisnotnecessarilyso.Otherwise,wewouldhaveanintervalorratioattribute.IsthedifferencebetweenthevaluesfairandgoodreallythesameasthatbetweenthevaluesOKandwonderful?Probablynot,butinpractice,ouroptionsarelimited,andintheabsenceofmoreinformation,thisisthestandardapproachfordefiningproximitybetweenordinalattributes.

Forintervalorratioattributes,thenaturalmeasureofdissimilaritybetweentwoobjectsistheabsolutedifferenceoftheirvalues.Forexample,wemightcompareourcurrentweightandourweightayearagobysaying“Iamtenpoundsheavier.”Incasessuchasthese,thedissimilaritiestypicallyrangefrom0to ratherthanfrom0to1.Thesimilarityofintervalorratioattributesistypicallyexpressedbytransformingadissimilarityintoasimilarity,aspreviouslydescribed.

Table2.7 summarizesthisdiscussion.Inthistable,xandyaretwoobjectsthathaveoneattributeoftheindicatedtype.Also,d(x,y)ands(x,y)arethedissimilarityandsimilaritybetweenxandy,respectively.Otherapproachesarepossible;thesearethemostcommonones.

{poor=0,fair=1,OK=2,good=3,wonderful=4}. d(P1,P2)=3−2=1d(P1,P2)=3−24=0.25.

s=1−d.

∞,

Table2.7.Similarityanddissimilarityforsimpleattributes

AttributeType

Dissimilarity Similarity

Nominal

Ordinal (valuesmappedtointegers0to ,wherenisthenumberofvalues)

IntervalorRatio

Thefollowingtwosectionsconsidermorecomplicatedmeasuresofproximitybetweenobjectsthatinvolvemultipleattributes:(1)dissimilaritiesbetweendataobjectsand(2)similaritiesbetweendataobjects.Thisdivisionallowsustomorenaturallydisplaytheunderlyingmotivationsforemployingvariousproximitymeasures.Weemphasize,however,thatsimilaritiescanbetransformedintodissimilaritiesandviceversausingtheapproachesdescribedearlier.

2.4.3DissimilaritiesbetweenDataObjects

Inthissection,wediscussvariouskindsofdissimilarities.Webeginwithadiscussionofdistances,whicharedissimilaritieswithcertainproperties,andthenprovideexamplesofmoregeneralkindsofdissimilarities.

Distances

d={0ifx=y1ifx≠y s={1ifx=y0ifx≠y

d=|x−y|/(n−1)n−1

s=1−d

d=|x−y| s=−d,s=11+d,s=e−d,s=1−d−min_dmax_d−min_d

Wefirstpresentsomeexamples,andthenofferamoreformaldescriptionofdistancesintermsofthepropertiescommontoalldistances.TheEuclideandistance,d,betweentwopoints,xandy,inone-,two-,three-,orhigher-dimensionalspace,isgivenbythefollowingfamiliarformula:

wherenisthenumberofdimensionsand and are,respectively,theattributes(components)ofxandy.WeillustratethisformulawithFigure2.15 andTables2.8 and2.9 ,whichshowasetofpoints,thexandycoordinatesofthesepoints,andthedistancematrixcontainingthepairwisedistancesofthesepoints.

Figure2.15.Fourtwo-dimensionalpoints.

TheEuclideandistancemeasuregiveninEquation2.1 isgeneralizedbytheMinkowskidistancemetricshowninEquation2.2 ,

d(x,y)=∑k=1n(xk−yk)2, (2.1)

xk yk kth

d(x,y)=(∑k=1n|xk−yk|r)1/r, (2.2)

whererisaparameter.ThefollowingarethethreemostcommonexamplesofMinkowskidistances.

Cityblock(Manhattan,taxicab, norm)distance.AcommonexampleistheHammingdistance,whichisthenumberofbitsthatisdifferentbetweentwoobjectsthathaveonlybinaryattributes,i.e.,betweentwobinaryvectors.

Euclideandistance( norm).Supremum( or norm)distance.Thisisthemaximum

differencebetweenanyattributeoftheobjects.Moreformally,thedistanceisdefinedbyEquation2.3

Therparametershouldnotbeconfusedwiththenumberofdimensions(at-tributes)n.TheEuclidean,Manhattan,andsupremumdistancesaredefinedforallvaluesofn:1,2,3,…,andspecifydifferentwaysofcombiningthedifferencesineachdimension(attribute)intoanoveralldistance.

Tables2.10 and2.11 ,respectively,givetheproximitymatricesfortheand distancesusingdatafromTable2.8 .Noticethatallthesedistancematricesaresymmetric;i.e.,the entryisthesameasthe entry.InTable2.9 ,forinstance,thefourthrowofthefirstcolumnandthefourthcolumnofthefirstrowbothcontainthevalue5.1.

Table2.8.xandycoordinatesoffourpoints.

point xcoordinate ycoordinate

p1 0 2

p2 2 0

p3 3 1

r=1. L1

r=2. L2r=∞. Lmax L∞

L∞

d(x,y)=limr→∞(∑k=1n|xk−yk|r)1/r. (2.3)

L1L∞

ijth jith

p4 5 1

Table2.9.EuclideandistancematrixforTable2.8 .

p1 p2 p3 p4

p1 0.0 2.8 3.2 5.1

p2 2.8 0.0 1.4 3.2

p3 3.2 1.4 0.0 2.0

p4 5.1 3.2 2.0 0.0

Table2.10. distancematrixforTable2.8 .

L p1 p2 p3 p4

p1 0.0 4.0 4.0 6.0

p2 4.0 0.0 2.0 4.0

p3 4.0 2.0 0.0 2.0

p4 6.0 4.0 2.0 0.0

Table2.11. distancematrixforTable2.8 .

p1 p2 p3 p4

p1 0.0 2.0 3.0 5.0

p2 2.0 0.0 1.0 3.0

p3 3.0 1.0 0.0 2.0

L1

1

L∞

L∞

p4 5.0 3.0 2.0 0.0

Distances,suchastheEuclideandistance,havesomewell-knownproperties.Ifd(x,y)isthedistancebetweentwopoints,xandy,thenthefollowingpropertieshold.

1. Positivitya. forallxandy,b. onlyif

2. Symmetry forallxandy.3. TriangleInequality forallpointsx,y,andz.

Measuresthatsatisfyallthreepropertiesareknownasmetrics.Somepeopleusethetermdistanceonlyfordissimilaritymeasuresthatsatisfytheseproperties,butthatpracticeisoftenviolated.Thethreepropertiesdescribedhereareuseful,aswellasmathematicallypleasing.Also,ifthetriangleinequalityholds,thenthispropertycanbeusedtoincreasetheefficiencyoftechniques(includingclustering)thatdependondistancespossessingthisproperty.(SeeExercise25 .)Nonetheless,manydissimilaritiesdonotsatisfyoneormoreofthemetricproperties.Example2.14 illustratessuchameasure.

Example2.14(Non-metricDissimilarities:SetDifferences).Thisexampleisbasedonthenotionofthedifferenceoftwosets,asdefinedinsettheory.GiventwosetsAandB, isthesetofelementsofAthatarenotin

d(x,y)≥0d(x,y)=0 x=y.

d(x,y)=d(y,x)d(x,z)≤d(x,y)+d(y,z)

A−B

B.Forexample,if and then andtheemptyset.WecandefinethedistancedbetweentwosetsA

andBas wheresizeisafunctionreturningthenumberofelementsinaset.Thisdistancemeasure,whichisanintegervaluegreaterthanorequalto0,doesnotsatisfythesecondpartofthepositivityproperty,thesymmetryproperty,orthetriangleinequality.However,thesepropertiescanbemadetoholdifthedissimilaritymeasureismodifiedasfollows: SeeExercise21 onpage110 .

2.4.4SimilaritiesbetweenDataObjects

Forsimilarities,thetriangleinequality(ortheanalogousproperty)typicallydoesnothold,butsymmetryandpositivitytypicallydo.Tobeexplicit,ifs(x,y)isthesimilaritybetweenpointsxandy,thenthetypicalpropertiesofsimilaritiesarethefollowing:

1. onlyif2. forallxandy.(Symmetry)

Thereisnogeneralanalogofthetriangleinequalityforsimilaritymeasures.Itissometimespossible,however,toshowthatasimilaritymeasurecaneasilybeconvertedtoametricdistance.ThecosineandJaccardsimilaritymeasures,whicharediscussedshortly,aretwoexamples.Also,forspecificsimilaritymeasures,itispossibletoderivemathematicalboundsonthesimilaritybetweentwoobjectsthataresimilarinspirittothetriangleinequality.

Example2.15(ANon-symmetricSimilarity

A={1,2,3,4} B={2,3,4}, A−B={1} B−A=∅,

d(A,B)=size(A−B),

d(A,B)=size(A−B)+size(B−A).

s(x,y)=1 x=y.(0≤s≤1)s(x,y)=s(y,x)

Measure).Consideranexperimentinwhichpeopleareaskedtoclassifyasmallsetofcharactersastheyflashonascreen.Theconfusionmatrixforthisexperimentrecordshowofteneachcharacterisclassifiedasitself,andhowofteneachisclassifiedasanothercharacter.Usingtheconfusionmatrix,wecandefineasimilaritymeasurebetweenacharacterxandacharacteryasthenumberoftimesthatxismisclassifiedasy,butnotethatthismeasureisnotsymmetric.Forexample,supposethat“0”appeared200timesandwasclassifiedasa“0”160times,butasan“o”40times.Likewise,supposethat“o”appeared200timesandwasclassifiedasan“o”170times,butas“0”only30times.Then, but

Insuchsituations,thesimilaritymeasurecanbemadesymmetricbysetting wheresindicatesthenewsimilaritymeasure.

2.4.5ExamplesofProximityMeasures

Thissectionprovidesspecificexamplesofsomesimilarityanddissimilaritymeasures.

SimilarityMeasuresforBinaryDataSimilaritymeasuresbetweenobjectsthatcontainonlybinaryattributesarecalledsimilaritycoefficients,andtypicallyhavevaluesbetween0and1.Avalueof1indicatesthatthetwoobjectsarecompletelysimilar,whileavalueof0indicatesthattheobjectsarenotatallsimilar.Therearemanyrationalesforwhyonecoefficientisbetterthananotherinspecificinstances.

s(0,o)=40,s(o,0)=30.

s′=(x,y)=s′(x,y)=(s(x,y+s(y,x))/2,

Letxandybetwoobjectsthatconsistofnbinaryattributes.Thecomparisonoftwosuchobjects,i.e.,twobinaryvectors,leadstothefollowingfourquantities(frequencies):

SimpleMatchingCoefficient

Onecommonlyusedsimilaritycoefficientisthesimplematchingcoefficient(SMC),whichisdefinedas

Thismeasurecountsbothpresencesandabsencesequally.Consequently,theSMCcouldbeusedtofindstudentswhohadansweredquestionssimilarlyonatestthatconsistedonlyoftrue/falsequestions.

JaccardCoefficient

Supposethatxandyaredataobjectsthatrepresenttworows(twotransactions)ofatransactionmatrix(seeSection2.1.2 ).Ifeachasymmetricbinaryattributecorrespondstoaniteminastore,thena1indicatesthattheitemwaspurchased,whilea0indicatesthattheproductwasnotpurchased.Becausethenumberofproductsnotpurchasedbyanycustomerfaroutnumbersthenumberofproductsthatwerepurchased,asimilaritymeasuresuchasSMCwouldsaythatalltransactionsareverysimilar.Asaresult,theJaccardcoefficientisfrequentlyusedtohandleobjectsconsistingofasymmetricbinaryattributes.TheJaccardcoefficient,whichisoftensymbolizedbyj,isgivenbythefollowingequation:

f00=thenumberofattributeswherexis0andyis0f01=thenumberofattributeswhere

SMC=numberofmatchingattributevaluesnumberofattributes=f11+f00f01+f10(2.4)

J=numberofmatchingpresencesnumberofattributesnotinvolvedin00matches(2.5)

Example2.16(TheSMCandJaccardSimilarityCoefficients).Toillustratethedifferencebetweenthesetwosimilaritymeasures,wecalculateSMCandjforthefollowingtwobinaryvectors.

CosineSimilarityDocumentsareoftenrepresentedasvectors,whereeachcomponent(attribute)representsthefrequencywithwhichaparticularterm(word)occursinthedocument.Eventhoughdocumentshavethousandsortensofthousandsofattributes(terms),eachdocumentissparsesinceithasrelativelyfewnonzeroattributes.Thus,aswithtransactiondata,similarityshouldnotdependonthenumberofshared0valuesbecauseanytwodocumentsarelikelyto“notcontain”manyofthesamewords,andtherefore,if0–0matchesarecounted,mostdocumentswillbehighlysimilartomostotherdocuments.Therefore,asimilaritymeasurefordocumentsneedstoignores0–0matchesliketheJaccardmeasure,butalsomustbeabletohandlenon-binaryvectors.Thecosinesimilarity,definednext,isoneofthemostcommonmeasuresofdocumentsimilarity.Ifxandyaretwodocumentvectors,then

x=(1,0,0,0,0,0,0,0,0,0)y=(0,0,0,0,0,0,1,0,0,1)

f01=2thenumberofattributeswherexwas0andywas1f10=1thenumberofattributeswhere

SMC=f11+f00f01+f10+f11+f00=0+72+1+0+7=0.7

J=f11f01+f10+f11=02+1+0=0

where′indicatesvectorormatrixtransposeand indicatestheinnerproductofthetwovectors,

and isthelengthofvector

Theinnerproductoftwovectorsworkswellforasymmetricattributessinceitdependsonlyoncomponentsthatarenon-zeroinbothvectors.Hence,thesimilaritybetweentwodocumentsdependsonlyuponthewordsthatappearinbothofthem.

Example2.17(CosineSimilaritybetweenTwoDocumentVectors).Thisexamplecalculatesthecosinesimilarityforthefollowingtwodataobjects,whichmightrepresentdocumentvectors:

AsindicatedbyFigure2.16 ,cosinesimilarityreallyisameasureofthe(cosineofthe)anglebetweenxandy.Thus,ifthecosinesimilarityis1,theanglebetweenxandyis andxandyarethesameexceptforlength.Ifthecosinesimilarityis0,thentheanglebetweenxandyis andtheydonotshareanyterms(words).

cos(x,y)=⟨x,y⟩∥x∥∥y∥=x′y∥x∥∥y∥, (2.6)

⟨x,y⟩

⟨x,y⟩=∑k=1nxkyk=x′y, (2.7)

∥x∥ x,∥x∥=∑k=1nxk2=⟨x,x⟩=x′x.

x=(3,2,0,5,0,0,0,2,0,0)y=(1,0,0,0,0,0,0,1,0,2)

⟨x,y⟩=3×1+2×0+0×0+5×0+0×0+0×0+0×0+2×1+0×0×2=5∥x∥=3×3+2×2+0×0+5×5

0°,90°,

Figure2.16.Geometricillustrationofthecosinemeasure.

Equation2.6 alsocanbewrittenasEquation2.8 .

where and Dividingxandybytheirlengthsnormalizesthemtohavealengthof1.Thismeansthatcosinesimilaritydoesnottakethelengthofthetwodataobjectsintoaccountwhencomputingsimilarity.(Euclideandistancemightbeabetterchoicewhenlengthisimportant.)Forvectorswithalengthof1,thecosinemeasurecanbecalculatedbytakingasimpleinnerproduct.Consequently,whenmanycosinesimilaritiesbetweenobjectsarebeingcomputed,normalizingtheobjectstohaveunitlengthcanreducethetimerequired.

ExtendedJaccardCoefficient(TanimotoCoefficient)TheextendedJaccardcoefficientcanbeusedfordocumentdataandthatreducestotheJaccardcoefficientinthecaseofbinaryattributes.Thiscoefficient,whichweshallrepresentasEJ,isdefinedbythefollowingequation:

cos(x,y)=⟨x∥x∥,y∥y∥⟩=⟨x′,y′⟩, (2.8)

x′=x/∥x∥ y′=y/∥y∥.

EJ(x,y)=⟨x,y⟩ǁxǁ2+ǁyǁ2−⟨x,y⟩=x′yǁxǁ2+ǁyǁ2−x′y. (2.9)

CorrelationCorrelationisfrequentlyusedtomeasurethelinearrelationshipbetweentwosetsofvaluesthatareobservedtogether.Thus,correlationcanmeasuretherelationshipbetweentwovariables(heightandweight)orbetweentwoobjects(apairoftemperaturetimeseries).Correlationisusedmuchmorefrequentlytomeasurethesimilaritybetweenattributessincethevaluesintwodataobjectscomefromdifferentattributes,whichcanhaveverydifferentattributetypesandscales.Therearemanytypesofcorrelation,andindeedcorrelationissometimesusedinageneralsensetomeantherelationshipbetweentwosetsofvaluesthatareobservedtogether.Inthisdiscussion,wewillfocusonameasureappropriatefornumericalvalues.

Specifically,Pearson’scorrelationbetweentwosetsofnumericalvalues,i.e.,twovectors,xandy,isdefinedbythefollowingequation:

whereweusethefollowingstandardstatisticalnotationanddefinitions:

corr(x,y)=covariance(x,y)standard_deviation(x)×standard_deviation(y)=sxysx(2.10)

covariance(x,y)=sxy=1n−1∑k=1n(xk−x¯)(yk−y¯) (2.11)

standard_deviation(x)=sx=1n−1∑k=1n(xk−x¯)2

standard_deviation(y)=sy=1n−1∑k=1n(yk−y¯)2

x¯=1n∑k=1nxkisthemeanofx

y¯=1n∑k=1nykisthemeanofy

Example2.18(PerfectCorrelation).Correlationisalwaysintherange to1.Acorrelationof meansthatxandyhaveaperfectpositive(negative)linearrelationship;thatis,

whereaandbareconstants.Thefollowingtwovectorsxandyillustratecaseswherethecorrelationis and respectively.Inthefirstcase,themeansofxandywerechosentobe0,forsimplicity.

Example2.19(NonlinearRelationships).Ifthecorrelationis0,thenthereisnolinearrelationshipbetweenthetwosetsofvalues.However,nonlinearrelationshipscanstillexist.Inthefollowingexample, buttheircorrelationis0.

Example2.20(VisualizingCorrelation).Itisalsoeasytojudgethecorrelationbetweentwovectorsxandybyplottingpairsofcorrespondingvaluesofxandyinascatterplot.Figure2.17 showsanumberofthesescatterplotswhenxandyconsistofasetof30pairsofvaluesthatarerandomlygenerated(withanormaldistribution)sothatthecorrelationofxandyrangesfrom to1.Eachcircleinaplotrepresentsoneofthe30pairsofxandyvalues;itsxcoordinateisthevalueofthatpairforx,whileitsycoordinateisthevalueofthesamepairfory.

−1 1(−1)

xk=ayk+b,−1 +1,

x=(−3,6,0,3,−6)y=(1,−2,0,−1,2)corr(x,y)=−1xk=−3yk

x=(3,6,0,3,6)y=(1,2,0,1,2)corr(x,y)=1xk=3yk

yk=xk2,

x=(−3,−2,−1,0,1,2,3)y=(9,4,1,0,1,4,9)

−1

Figure2.17.Scatterplotsillustratingcorrelationsfrom to1.

Ifwetransformxandybysubtractingofftheirmeansandthennormalizingthemsothattheirlengthsare1,thentheircorrelationcanbecalculatedbytakingthedotproduct.Letusrefertothesetransformedvectorsofxandyasand ,respectively.(Noticethatthistransformationisnotthesameasthe

standardizationusedinothercontexts,wherewesubtractthemeansanddividebythestandarddeviations,asdiscussedinSection2.3.7 .)Thistransformationhighlightsaninterestingrelationshipbetweenthecorrelationmeasureandthecosinemeasure.Specifically,thecorrelationbetweenxandyisidenticaltothecosinebetween and However,thecosinebetweenxandyisnotthesameasthecosinebetween and eventhoughtheybothhavethesamecorrelationmeasure.Ingeneral,thecorrelationbetweentwo

−1

x′ y′

x′ y′.x′ y′,

vectorsisequaltothecosinemeasureonlyinthespecialcasewhenthemeansofthetwovectorsare0.

DifferencesAmongMeasuresForContinuousAttributesInthissection,weillustratethedifferenceamongthethreeproximitymeasuresforcontinuousattributesthatwehavejustdefined:cosine,correlation,andMinkowskidistance.Specifically,weconsidertwotypesofdatatransformationsthatarecommonlyused,namely,scaling(multiplication)byaconstantfactorandtranslation(addition)byaconstantvalue.Aproximitymeasureisconsideredtobeinvarianttoadatatransformationifitsvalueremainsunchangedevenafterperformingthetransformation.Table2.12comparesthebehaviorofcosine,correlation,andMinkowskidistancemeasuresregardingtheirinvariancetoscalingandtranslationoperations.Itcanbeseenthatwhilecorrelationisinvarianttobothscalingandtranslation,cosineisonlyinvarianttoscalingbutnottotranslation.Minkowskidistancemeasures,ontheotherhand,aresensitivetobothscalingandtranslationandarethusinvarianttoneither.

Table2.12.Propertiesofcosine,correlation,andMinkowskidistancemeasures.

Property Cosine Correlation MinkowskiDistance

Invarianttoscaling(multiplication) Yes Yes No

Invarianttotranslation(addition) No Yes No

Letusconsideranexampletodemonstratethesignificanceofthesedifferencesamongdifferentproximitymeasures.

Example2.21(Comparingproximitymeasures).Considerthefollowingtwovectorsxandywithsevennumericattributes.

Itcanbeseenthatbothxandyhave4non-zerovalues,andthevaluesinthetwovectorsaremostlythesame,exceptforthethirdandthefourthcomponents.Thecosine,correlation,andEuclideandistancebetweenthetwovectorscanbecomputedasfollows.

Notsurprisingly,xandyhaveacosineandcorrelationmeasurecloseto1,whiletheEuclideandistancebetweenthemissmall,indicatingthattheyarequitesimilar.Nowletusconsiderthevector whichisascaledversionofy(multipliedbyaconstantfactorof2),andthevector whichisconstructedbytranslatingyby5unitsasfollows.

Weareinterestedinfindingwhether and showthesameproximitywithxasshownbytheoriginalvectory.Table2.13 showsthedifferentmeasuresofproximitycomputedforthepairs and Itcanbeseenthatthevalueofcorrelationbetweenxandyremainsunchangedevenafterreplacingywith or However,thevalueofcosineremainsequalto0.9667whencomputedfor(x,y)and butsignificantlyreducesto0.7940whencomputedfor Thishighlights

x=(1,2,4,3,0,0,0)y=(1,2,3,4,0,0,0)

cos(x,y)=2930×30=0.9667correlation(x,y)=2.35711.5811×1.5811=0.9429Euclideandistancex−yǁ=1.4142

ys,yt,

ys=2×y=(2,4,6,8,0,0,0)

yt=y+5=(6,7,8,9,5,5,5)

ys yt

(x,y),(x,ys), (x,yt).

ys yt.(x,ys),

(x,yt).

thefactthatcosineisinvarianttothescalingoperationbutnottothetranslationoperation,incontrastwiththecorrelationmeasure.TheEuclideandistance,ontheotherhand,showsdifferentvaluesforallthreepairsofvectors,asitissensitivetobothscalingandtranslation.

Table2.13.Similaritybetween and

Measure (x,y)

Cosine 0.9667 0.9667 0.7940

Correlation 0.9429 0.9429 0.9429

EuclideanDistance 1.4142 5.8310 14.2127

Wecanobservefromthisexamplethatdifferentproximitymeasuresbehavedifferentlywhenscalingortranslationoperationsareappliedonthedata.Thechoiceoftherightproximitymeasurethusdependsonthedesirednotionofsimilaritybetweendataobjectsthatismeaningfulforagivenapplication.Forexample,ifxandyrepresentedthefrequenciesofdifferentwordsinadocument-termmatrix,itwouldbemeaningfultouseaproximitymeasurethatremainsunchangedwhenyisreplacedbybecause isjustascaledversionofywiththesamedistributionofwordsoccurringinthedocument.However, isdifferentfromy,sinceitcontainsalargenumberofwordswithnon-zerofrequenciesthatdonotoccuriny.Becausecosineisinvarianttoscalingbutnottotranslation,itwillbeanidealchoiceofproximitymeasureforthisapplication.

Consideradifferentscenarioinwhichxrepresentsalocation’stemperaturemeasuredontheCelsiusscaleforsevendays.Let andbethetemperaturesmeasuredonthosedaysatadifferentlocation,but

usingthreedifferentmeasurementscales.Notethatdifferentunitsof

(x,y),(x,ys), (x,yt).

(x,ys) (x,yt)

ys,ys

yt

y,ys,yt

temperaturehavedifferentoffsets(e.g.CelsiusandKelvin)anddifferentscalingfactors(e.g.CelsiusandFahrenheit).Itisthusdesirabletouseaproximitymeasurethatcapturestheproximitybetweentemperaturevalueswithoutbeingaffectedbythemeasurementscale.Correlationwouldthenbetheidealchoiceofproximitymeasureforthisapplication,asitisinvarianttobothscalingandtranslation.

Asanotherexample,considerascenariowherexrepresentstheamountofprecipitation(incm)measuredatsevenlocations.Let and beestimatesoftheprecipitationattheselocations,whicharepredictedusingthreedifferentmodels.Ideally,wewouldliketochooseamodelthataccuratelyreconstructsthemeasurementsinxwithoutmakinganyerror.Itisevidentthatyprovidesagoodapproximationofthevaluesinx,whereasand providepoorestimatesofprecipitation,eventhoughtheydo

capturethetrendinprecipitationacrosslocations.Hence,weneedtochooseaproximitymeasurethatpenalizesanydifferenceinthemodelestimatesfromtheactualobservations,andissensitivetoboththescalingandtranslationoperations.TheEuclideandistancesatisfiesthispropertyandthuswouldbetherightchoiceofproximitymeasureforthisapplication.Indeed,theEuclideandistanceiscommonlyusedincomputingtheaccuracyofmodels,whichwillbediscussedlaterinChapter3 .

2.4.6MutualInformation

Likecorrelation,mutualinformationisusedasameasureofsimilaritybetweentwosetsofpairedvaluesthatissometimesusedasanalternativetocorrelation,particularlywhenanonlinearrelationshipissuspectedbetweenthepairsofvalues.Thismeasurecomesfrominformationtheory,whichisthe

y,ys, yt

ys yt

studyofhowtoformallydefineandquantifyinformation.Indeed,mutualinformationisameasureofhowmuchinformationonesetofvaluesprovidesaboutanother,giventhatthevaluescomeinpairs,e.g.,heightandweight.Ifthetwosetsofvaluesareindependent,i.e.,thevalueofonetellsusnothingabouttheother,thentheirmutualinformationis0.Ontheotherhand,ifthetwosetsofvaluesarecompletelydependent,i.e.,knowingthevalueofonetellsusthevalueoftheotherandvice-versa,thentheyhavemaximummutualinformation.Mutualinformationdoesnothaveamaximumvalue,butwewilldefineanormalizedversionofitthatrangesbetween0and1.

Todefinemutualinformation,weconsidertwosetsofvalues,XandY,whichoccurinpairs(X,Y).Weneedtomeasuretheaverageinformationinasinglesetofvalues,i.e.,eitherinXorinY,andinthepairsoftheirvalues.Thisiscommonlymeasuredbyentropy.Morespecifically,assumeXandYarediscrete,thatis,Xcantakemdistinctvalues, andYcantakendistinctvalues, Thentheirindividualandjointentropycanbedefinedintermsoftheprobabilitiesofeachvalueandpairofvaluesasfollows:

whereiftheprobabilityofavalueorcombinationofvaluesis0,thenisconventionallytakentobe0.

ThemutualinformationofXandYcannowbedefinedstraightforwardly:

u1,u2,…,umv1,v2,…,vn.

H(X)=−∑j=1mP(X=uj)log2P(X=uj) (2.12)

H(Y)=−∑k=1nP(Y=vk)log2P(Y=vk) (2.13)

H(X,Y)=−∑j=1m∑k=1nP(X=uj,Y=vk)log2P(X=uj,Y=vk) (2.14)

0log2(0)

I(X,Y)=H(X)+H(Y)−H(X,Y) (2.15)

NotethatH(X,Y)issymmetric,i.e., andthusmutualinformationisalsosymmetric,i.e.,

Practically,XandYareeitherthevaluesintwoattributesortworowsofthesamedataset.InExample2.22 ,wewillrepresentthosevaluesastwovectorsxandyandcalculatetheprobabilityofeachvalueorpairofvaluesfromthefrequencywithwhichvaluesorpairsofvaluesoccurinx,yand

where isthe componentofxand isthe componentofy.Letusillustrateusingapreviousexample.

Example2.22(EvaluatingNonlinearRelationshipswithMutualInformation).RecallExample2.19 where buttheircorrelationwas0.

FromFigure2.22 , Althoughavarietyofapproachestonormalizemutualinformationarepossible—seeBibliographicNotes—forthisexample,wewillapplyonethatdividesthemutualinformationby andproducesaresultbetween0and1.Thisyieldsavalueof Thus,wecanseethatxandyarestronglyrelated.Theyarenotperfectlyrelatedbecausegivenavalueofythereis,exceptfor someambiguityaboutthevalueofx.Noticethatfor thenormalizedmutualinformationwouldbe1.

Figure2.18.Computationofmutualinformation.

Table2.14.Entropyforx

H(X,Y)=H(Y,X),I(X,Y)=I(Y).

(xi,yi), xi ith yi ith

yk=xk2,

x=(−3,−2,−1,0,1,2,3)y=(9,4,1,0,1,4,9)

I(x,y)=H(x)+H(y)−H(x,y)=1.9502.

log2(min(m,n))1.9502/log2(4))=0.9751.

y=0,y=−x,

xj P(x=xj) −P(x=xj)log2P(x=xj)

1/7 0.4011

1/7 0.4011

1/7 0.4011

0 1/7 0.4011

1 1/7 0.4011

2 1/7 0.4011

3 1/7 0.4011

H(x) 2.8074

Table2.15.Entropyfory

9 2/7 0.5164

4 2/7 0.5164

1 2/7 0.5164

0 1/7 0.4011

H(y) 1.9502

Table2.16.Jointentropyforxandy

9 1/7 0.4011

4 1/7 0.4011

−3

−2

−1

yk P(y=yk) −P(y=yk)log2P(y=yk)

xj yk P(x=xj,y=xk) −P(x=xj,y=xk)log2P(x=xj,y=xk)

−3

−2

1 1/7 0.4011

0 0 1/7 0.4011

1 1 1/7 0.4011

2 4 1/7 0.4011

3 9 1/7 0.4011

H(x,y) 2.8074

2.4.7KernelFunctions*

Itiseasytounderstandhowsimilarityanddistancemightbeusefulinanapplicationsuchasclustering,whichtriestogroupsimilarobjectstogether.Whatismuchlessobviousisthatmanyotherdataanalysistasks,includingpredictivemodelinganddimensionalityreduction,canbeexpressedintermsofpairwise“proximities”ofdataobjects.Morespecifically,manydataanalysisproblemscanbemathematicallyformulatedtotakeasinput,akernelmatrix,K,whichcanbeconsideredatypeofproximitymatrix.Thus,aninitialpreprocessingstepisusedtoconverttheinputdataintoakernelmatrix,whichistheinputtothedataanalysisalgorithm.

Moreformally,ifadatasethasmdataobjects,thenKisanmbymmatrix.Ifand arethe and dataobjects,respectively,then the entryof

K,iscomputedbyakernelfunction:

−1

xi xj ith jth kij, ijth

kij=κ(xi,xj) (2.16)

Aswewillseeinthematerialthatfollows,theuseofakernelmatrixallowsbothwiderapplicabilityofanalgorithmtovariouskindsofdataandanabilitytomodelnonlinearrelationshipswithalgorithmsthataredesignedonlyfordetectinglinearrelationships.

Kernelsmakeanalgorithmdataindependent

Ifanalgorithmusesakernelmatrix,thenitcanbeusedwithanytypeofdataforwhichakernelfunctioncanbedesigned.ThisisillustratedbyAlgorithm2.1.Althoughonlysomedataanalysisalgorithmscanbemodifiedtouseakernelmatrixasinput,thisapproachisextremelypowerfulbecauseitallowssuchanalgorithmtobeusedwithalmostanytypeofdataforwhichanappropriatekernelfunctioncanbedefined.Thus,aclassificationalgorithmcanbeused,forexample,withrecorddata,stringdata,orgraphdata.Ifanalgorithmcanbereformulatedtouseakernelmatrix,thenitsapplicabilitytodifferenttypesofdataincreasesdramatically.Aswewillseeinlaterchapters,manyclustering,classification,andanomalydetectionalgorithmsworkonlywithsimilaritiesordistances,andthus,canbeeasilymodifiedtoworkwithkernels.

Algorithm2.1Basickernelalgorithm.1. Readinthemdataobjectsinthedataset.2. Computethekernelmatrix,Kbyapplyingthekernelfunction,

toeachpairofdataobjects.3. RunthedataanalysisalgorithmwithKasinput.4. Returntheanalysisresult,e.g.,predictedclassorclusterlabels.

Mappingdataintoahigherdimensionaldataspacecan

κ,

allowmodelingofnonlinearrelationships

Thereisyetanother,equallyimportant,aspectofkernelbaseddataalgorithms—theirabilitytomodelnonlinearrelationshipswithalgorithmsthatmodelonlylinearrelationships.Typically,thisworksbyfirsttransforming(mapping)thedatafromalowerdimensionaldataspacetoahigherdimensionalspace.

Example2.23(MappingDatatoaHigherDimensionalSpace).Considertherelationshipbetweentwovariablesxandygivenbythefollowingequation,whichdefinesanellipseintwodimensions(Figure2.19(a) ):

Figure2.19.Mappingdatatoahigherdimensionalspace:twotothreedimensions.

Wecanmapourtwodimensionaldatatothreedimensionsbycreatingthreenewvariables,u,v,andw,whicharedefinedasfollows:

Asaresult,wecannowexpressEquation2.17 asalinearone.Thisequationdescribesaplaneinthreedimensions.Pointsontheellipsewilllieonthatplane,whilepointsinsideandoutsidetheellipsewilllieonoppositesidesoftheplane.SeeFigure2.19(b) .Theviewpointofthis3Dplotisalongthesurfaceoftheseparatingplanesothattheplaneappearsasaline.

TheKernelTrick

Theapproachillustratedaboveshowsthevalueinmappingdatatohigherdimensionalspace,anoperationthatisintegraltokernel-basedmethods.Conceptually,wefirstdefineafunction thatmapsdatapointsxandytodatapoints and inahigherdimensionalspacesuchthattheinnerproduct givesthedesiredmeasureofproximityofxandy.Itmayseemthatwehavepotentiallysacrificedagreatdealbyusingsuchanapproach,becausewecangreatlyexpandthesizeofourdata,increasethecomputationalcomplexityofouranalysis,andencounterproblemswiththecurseofdimensionalitybycomputingsimilarityinahigh-dimensionalspace.However,thisisnotthecasesincetheseproblemscanbeavoidedbydefiningakernelfunction thatcancomputethesamesimilarityvalue,butwiththedatapointsintheoriginalspace,i.e., Thisisknownasthekerneltrick.Despitethename,thekerneltrickhasaverysolid

4×2+9xy+7y2=10 (2.17)

w=x2u=xyv=y2

4u+9v+7w=10 (2.18)

φφ(x) φ(y)

⟨x,y⟩

κκ(x,y)=⟨φ(x),φ(y)⟩.

mathematicalfoundationandisaremarkablypowerfulapproachfordataanalysis.

Noteveryfunctionofapairofdataobjectssatisfiesthepropertiesneededforakernelfunction,butithasbeenpossibletodesignmanyusefulkernelsforawidevarietyofdatatypes.Forexample,threecommonkernelfunctionsarethepolynomial,Gaussian(radialbasisfunction(RBF)),andsigmoidkernels.Ifxandyaretwodataobjects,specifically,twodatavectors,thenthesetwokernelfunctionscanbeexpressedasfollows,respectively:

where and areconstants,disanintegerparameterthatgivesthepolynomialdegree, isthelengthofthevector and isaparameterthatgovernsthe“spread”ofaGaussian.

Example2.24(ThePolynomialKernel).Notethatthekernelfunctionspresentedintheprevioussectionarecomputingthesamesimilarityvalueaswouldbecomputedifweactuallymappedthedatatoahigherdimensionalspaceandthencomputedaninnerproductthere.Forexample,forthepolynomialkernelofdegree2,letbethefunctionthatmapsatwo-dimensionaldatavector tothe

higherdimensionalspace.Specifically,let

κ(x,y)−(x′y+c)d (2.19)

κ(x,y)=exp(−ǁx−yǁ/2σ2) (2.20)

κ(x,y)=tanh(αx′y+c) (2.21)

α c≥0ǁx−yǁ x−y σ>0

φ x=(x1,x2)

φ(x)=(x12,x22,2x1x2,2cx1,2cx2,c). (2.22)

Forthehigherdimensionalspace,lettheproximitybedefinedastheinnerproductof and i.e., Then,aspreviouslymentioned,itcanbeshownthat

where isdefinedbyEquation2.19 above.Specifically,ifand then

Moregenerally,thekerneltrickdependsondefining and sothatEquation2.23 holds.Thishasbeendoneforawidevarietyofkernels.

Thisdiscussionofkernel-basedapproacheswasintendedonlytoprovideabriefintroductiontothistopicandhasomittedmanydetails.Afullerdiscussionofthekernel-basedapproachisprovidedinSection4.9.4,whichdiscussestheseissuesinthecontextofnonlinearsupportvectormachinesforclassification.MoregeneralreferencesforthekernelbasedanalysiscanbefoundintheBibliographicNotesofthischapter.

2.4.8BregmanDivergence*

ThissectionprovidesabriefdescriptionofBregmandivergences,whichareafamilyofproximityfunctionsthatsharesomecommonproperties.Asaresult,itispossibletoconstructgeneraldataminingalgorithms,suchasclusteringalgorithms,thatworkwithanyBregmandivergence.AconcreteexampleistheK-meansclusteringalgorithm(Section7.2).Notethatthissectionrequiresknowledgeofvectorcalculus.

φ(x) φ(y), ⟨φ(x),φ(y)⟩.

κ(x,y)=⟨φ(x),φ(y)⟩ (2.23)

κ x=(x1,x2)y=(y1,y2),

κ(x,y)=⟨x,y⟩=x′y=(x12y12,x22y22,2x1x2y1y2,2cx1y1,2cx2y2,c2).(2.24)

κ φ

Bregmandivergencesarelossordistortionfunctions.Tounderstandtheideaofalossfunction,considerthefollowing.Letxandybetwopoints,whereyisregardedastheoriginalpointandxissomedistortionorapproximationofit.Forexample,xmaybeapointthatwasgeneratedbyaddingrandomnoisetoy.Thegoalistomeasuretheresultingdistortionorlossthatresultsifyisapproximatedbyx.Ofcourse,themoresimilarxandyare,thesmallerthelossordistortion.Thus,Bregmandivergencescanbeusedasdissimilarityfunctions.

Moreformally,wehavethefollowingdefinition.

Definition2.6(Bregmandivergence)Givenastrictlyconvexfunction (withafewmodestrestrictionsthataregenerallysatisfied),theBregmandivergence(lossfunction) generatedbythatfunctionisgivenbythefollowingequation:

where isthegradientof evaluatedat isthevectordifferencebetweenxandy,and istheinnerproductbetween and ForpointsinEuclideanspace,theinnerproductisjustthedotproduct.

D(x,y)canbewrittenas whereandrepresentstheequationofaplanethatistangenttothefunction aty.

ϕ

D(x,y)

D(x,y)=ϕ(x)−ϕ(y)−⟨∇ϕ(y),(x−y)⟩ (2.25)

∇ϕ(y) ϕ y,x−y,⟨∇ϕ(y),(x−y)⟩

∇ϕ(y) (x−y).

D(x,y)=ϕ(x)−L(x), L(x)=ϕ(y)+⟨∇ϕ(y),(x−y)⟩ϕ

Usingcalculusterminology,L(x)isthelinearizationof aroundthepointy,andtheBregmandivergenceisjustthedifferencebetweenafunctionandalinearapproximationtothatfunction.DifferentBregmandivergencesareobtainedbyusingdifferentchoicesfor

Example2.25.WeprovideaconcreteexampleusingsquaredEuclideandistance,butrestrictourselvestoonedimensiontosimplifythemathematics.Letxandyberealnumbersand bethereal-valuedfunction, Inthatcase,thegradientreducestothederivative,andthedotproductreducestomultiplication.Specifically,Equation2.25 becomesEquation2.26 .

Thegraphforthisexample,with isshowninFigure2.20 .TheBregmandivergenceisshownfortwovaluesofx: and

ϕ

ϕ.

ϕ(t) ϕ(t)=t2.

D(x,y)=x2−y2−2y(x−y)=(x−y)2 (2.26)

y=1,x=2 x=3.

Figure2.20.IllustrationofBregmandivergence.

2.4.9IssuesinProximityCalculation

Thissectiondiscussesseveralimportantissuesrelatedtoproximitymeasures:(1)howtohandlethecaseinwhichattributeshavedifferentscalesand/orarecorrelated,(2)howtocalculateproximitybetweenobjectsthatarecomposedofdifferenttypesofattributes,e.g.,quantitativeandqualitative,(3)andhowtohandleproximitycalculationswhenattributeshavedifferentweights;i.e.,whennotallattributescontributeequallytotheproximityofobjects.

StandardizationandCorrelationforDistance

MeasuresAnimportantissuewithdistancemeasuresishowtohandlethesituationwhenattributesdonothavethesamerangeofvalues.(Thissituationisoftendescribedbysayingthat“thevariableshavedifferentscales.”)Inapreviousexample,Euclideandistancewasusedtomeasurethedistancebetweenpeoplebasedontwoattributes:ageandincome.Unlessthesetwoattributesarestandardized,thedistancebetweentwopeoplewillbedominatedbyincome.

Arelatedissueishowtocomputedistancewhenthereiscorrelationbetweensomeoftheattributes,perhapsinadditiontodifferencesintherangesofvalues.AgeneralizationofEuclideandistance,theMahalanobisdistance,isusefulwhenattributesarecorrelated,havedifferentrangesofvalues(differentvariances),andthedistributionofthedataisapproximatelyGaussian(normal).Correlatedvariableshavealargeimpactonstandarddistancemeasuressinceachangeinanyofthecorrelatedvariablesisreflectedinachangeinallthecorrelatedvariables.Specifically,theMahalanobisdistancebetweentwoobjects(vectors)xandyisdefinedas

where istheinverseofthecovariancematrixofthedata.Notethatthecovariancematrix isthematrixwhose entryisthecovarianceoftheand attributesasdefinedbyEquation2.11 .

Example2.26.InFigure2.21 ,thereare1000points,whosexandyattributeshaveacorrelationof0.6.Thedistancebetweenthetwolargepointsattheoppositeendsofthelongaxisoftheellipseis14.7intermsofEuclidean

Mahalanobis(x,y)=(x−y)′∑−1(x−y), (2.27)

∑−1∑ ijth ith

jth

distance,butonly6withrespecttoMahalanobisdistance.ThisisbecausetheMahalanobisdistancegiveslessemphasistothedirectionoflargestvariance.Inpractice,computingtheMahalanobisdistanceisexpensive,butcanbeworthwhilefordatawhoseattributesarecorrelated.Iftheattributesarerelativelyuncorrelated,buthavedifferentranges,thenstandardizingthevariablesissufficient.

Figure2.21.Setoftwo-dimensionalpoints.TheMahalanobisdistancebetweenthetwopointsrepresentedbylargedotsis6;theirEuclideandistanceis14.7.

CombiningSimilaritiesforHeterogeneousAttributes

Thepreviousdefinitionsofsimilaritywerebasedonapproachesthatassumedalltheattributeswereofthesametype.Ageneralapproachisneededwhentheattributesareofdifferenttypes.OnestraightforwardapproachistocomputethesimilaritybetweeneachattributeseparatelyusingTable2.7 ,andthencombinethesesimilaritiesusingamethodthatresultsinasimilaritybetween0and1.Onepossibleapproachistodefinetheoverallsimilarityastheaverageofalltheindividualattributesimilarities.Unfortunately,thisapproachdoesnotworkwellifsomeoftheattributesareasymmetricattributes.Forexample,ifalltheattributesareasymmetricbinaryattributes,thenthesimilaritymeasuresuggestedpreviouslyreducestothesimplematchingcoefficient,ameasurethatisnotappropriateforasymmetricbinaryattributes.Theeasiestwaytofixthisproblemistoomitasymmetricattributesfromthesimilaritycalculationwhentheirvaluesare0forbothoftheobjectswhosesimilarityisbeingcomputed.Asimilarapproachalsoworkswellforhandlingmissingvalues.

Insummary,Algorithm2.2iseffectiveforcomputinganoverallsimilaritybetweentwoobjects,xandy,withdifferenttypesofattributes.Thisprocedurecanbeeasilymodifiedtoworkwithdissimilarities.

Algorithm2.2Similaritiesofheterogeneous

objects.1:Forthe attribute,computeasimilarity, intherange[0,1].2:Defineanindicatorvariable, forthe attributeasfollows:

kth sk(x,y),

δk, kth

δk={0ifthekthattributeisanasymmetricattributeandbothobjectshaveavalueof

UsingWeightsInmuchofthepreviousdiscussion,allattributesweretreatedequallywhencomputingproximity.Thisisnotdesirablewhensomeattributesaremoreimportanttothedefinitionofproximitythanothers.Toaddressthesesituations,theformulasforproximitycanbemodifiedbyweightingthecontributionofeachattribute.

Withattributeweights, (2.28)becomes

ThedefinitionoftheMinkowskidistancecanalsobemodifiedasfollows:

2.4.10SelectingtheRightProximityMeasure

Afewgeneralobservationsmaybehelpful.First,thetypeofproximitymeasureshouldfitthetypeofdata.Formanytypesofdense,continuousdata,metricdistancemeasuressuchasEuclideandistanceareoftenused.Proximitybetweencontinuousattributesismostoftenexpressedintermsof

3:Computetheoverallsimilaritybetweenthetwoobjectsusingthefollowingformula:similarity(x,y)=∑k=1nδksk(x,y)∑k=1nδk(2.28)

wk,

similarity(x,y)=∑k=1nwkδksk(x,y)∑k=1nwkδk. (2.29)

d(x,y)=(∑k=1nwk|xk−yk|r)1/r. (2.30)

differences,anddistancemeasuresprovideawell-definedwayofcombiningthesedifferencesintoanoverallproximitymeasure.Althoughattributescanhavedifferentscalesandbeofdifferingimportance,theseissuescanoftenbedealtwithasdescribedearlier,suchasnormalizationandweightingofattributes.

Forsparsedata,whichoftenconsistsofasymmetricattributes,wetypicallyemploysimilaritymeasuresthatignore0–0matches.Conceptually,thisreflectsthefactthat,forapairofcomplexobjects,similaritydependsonthenumberofcharacteristicstheybothshare,ratherthanthenumberofcharacteristicstheybothlack.Thecosine,Jaccard,andextendedJaccardmeasuresareappropriateforsuchdata.

Thereareothercharacteristicsofdatavectorsthatoftenneedtobeconsidered.Invariancetoscaling(multiplication)andtotranslation(addition)werepreviouslydiscussedwithrespecttoEuclideandistanceandthecosineandcorrelationmeasures.Thepracticalimplicationsofsuchconsiderationsarethat,forexample,cosineismoresuitableforsparsedocumentdatawhereonlyscalingisimportant,whilecorrelationworksbetterfortimeseries,wherebothscalingandtranslationareimportant.EuclideandistanceorothertypesofMinkowskidistancearemostappropriatewhentwodatavectorsaretomatchascloselyaspossibleacrossallcomponents(features).

Insomecases,transformationornormalizationofthedataisneededtoobtainapropersimilaritymeasure.Forinstance,timeseriescanhavetrendsorperiodicpatternsthatsignificantlyimpactsimilarity.Also,apropercomputationofsimilarityoftenrequiresthattimelagsbetakenintoaccount.Finally,twotimeseriesmaybesimilaronlyoverspecificperiodsoftime.Forexample,thereisastrongrelationshipbetweentemperatureandtheuseofnaturalgas,butonlyduringtheheatingseason.

Practicalconsiderationcanalsobeimportant.Sometimes,oneormoreproximitymeasuresarealreadyinuseinaparticularfield,andthus,otherswillhaveansweredthequestionofwhichproximitymeasuresshouldbeused.Othertimes,thesoftwarepackageorclusteringalgorithmbeingusedcandrasticallylimitthechoices.Ifefficiencyisaconcern,thenwemaywanttochooseaproximitymeasurethathasaproperty,suchasthetriangleinequality,thatcanbeusedtoreducethenumberofproximitycalculations.(SeeExercise25 .)

However,ifcommonpracticeorpracticalrestrictionsdonotdictateachoice,thentheproperchoiceofaproximitymeasurecanbeatime-consumingtaskthatrequirescarefulconsiderationofbothdomainknowledgeandthepurposeforwhichthemeasureisbeingused.Anumberofdifferentsimilaritymeasuresmayneedtobeevaluatedtoseewhichonesproduceresultsthatmakethemostsense.

2.5BibliographicNotesItisessentialtounderstandthenatureofthedatathatisbeinganalyzed,andatafundamentallevel,thisisthesubjectofmeasurementtheory.Inparticular,oneoftheinitialmotivationsfordefiningtypesofattributeswastobepreciseaboutwhichstatisticaloperationswerevalidforwhatsortsofdata.WehavepresentedtheviewofmeasurementtheorythatwasinitiallydescribedinaclassicpaperbyS.S.Stevens[112].(Tables2.2 and2.3 arederivedfromthosepresentedbyStevens[113].)Whilethisisthemostcommonviewandisreasonablyeasytounderstandandapply,thereis,ofcourse,muchmoretomeasurementtheory.Anauthoritativediscussioncanbefoundinathree-volumeseriesonthefoundationsofmeasurementtheory[88,94,114].Alsoofinterestisawide-rangingarticlebyHand[77],whichdiscussesmeasurementtheoryandstatistics,andisaccompaniedbycommentsfromotherresearchersinthefield.NumerouscritiquesandextensionsoftheapproachofStevenshavebeenmade[66,97,117].Finally,manybooksandarticlesdescribemeasurementissuesforparticularareasofscienceandengineering.

Dataqualityisabroadsubjectthatspanseverydisciplinethatusesdata.Discussionsofprecision,bias,accuracy,andsignificantfigurescanbefoundinmanyintroductoryscience,engineering,andstatisticstextbooks.Theviewofdataqualityas“fitnessforuse”isexplainedinmoredetailinthebookbyRedman[103].ThoseinterestedindataqualitymayalsobeinterestedinMIT’sInformationQuality(MITIQ)Program[95,118].However,theknowledgeneededtodealwithspecificdataqualityissuesinaparticulardomainisoftenbestobtainedbyinvestigatingthedataqualitypracticesofresearchersinthatfield.

Aggregationisalesswell-definedsubjectthanmanyotherpreprocessingtasks.However,aggregationisoneofthemaintechniquesusedbythedatabaseareaofOnlineAnalyticalProcessing(OLAP)[68,76,102].Therehasalsobeenrelevantworkintheareaofsymbolicdataanalysis(BockandDiday[64]).Oneofthegoalsinthisareaistosummarizetraditionalrecorddataintermsofsymbolicdataobjectswhoseattributesaremorecomplexthantraditionalattributes.Specifically,theseattributescanhavevaluesthataresetsofvalues(categories),intervals,orsetsofvalueswithweights(histograms).Anothergoalofsymbolicdataanalysisistobeabletoperformclustering,classification,andotherkindsofdataanalysisondatathatconsistsofsymbolicdataobjects.

Samplingisasubjectthathasbeenwellstudiedinstatisticsandrelatedfields.Manyintroductorystatisticsbooks,suchastheonebyLindgren[90],havesomediscussionaboutsampling,andentirebooksaredevotedtothesubject,suchastheclassictextbyCochran[67].AsurveyofsamplingfordataminingisprovidedbyGuandLiu[74],whileasurveyofsamplingfordatabasesisprovidedbyOlkenandRotem[98].Thereareanumberofotherdatamininganddatabase-relatedsamplingreferencesthatmaybeofinterest,includingpapersbyPalmerandFaloutsos[100],Provostetal.[101],Toivonen[115],andZakietal.[119].

Instatistics,thetraditionaltechniquesthathavebeenusedfordimensionalityreductionaremultidimensionalscaling(MDS)(BorgandGroenen[65],KruskalandUslaner[89])andprincipalcomponentanalysis(PCA)(Jolliffe[80]),whichissimilartosingularvaluedecomposition(SVD)(Demmel[70]).DimensionalityreductionisdiscussedinmoredetailinAppendixB.

Discretizationisatopicthathasbeenextensivelyinvestigatedindatamining.Someclassificationalgorithmsworkonlywithcategoricaldata,andassociationanalysisrequiresbinarydata,andthus,thereisasignificant

motivationtoinvestigatehowtobestbinarizeordiscretizecontinuousattributes.Forassociationanalysis,wereferthereadertoworkbySrikantandAgrawal[111],whilesomeusefulreferencesfordiscretizationintheareaofclassificationincludeworkbyDoughertyetal.[71],ElomaaandRousu[72],FayyadandIrani[73],andHussainetal.[78].

Featureselectionisanothertopicwellinvestigatedindatamining.AbroadcoverageofthistopicisprovidedinasurveybyMolinaetal.[96]andtwobooksbyLiuandMotada[91,92].OtherusefulpapersincludethosebyBlumandLangley[63],KohaviandJohn[87],andLiuetal.[93].

Itisdifficulttoprovidereferencesforthesubjectoffeaturetransformationsbecausepracticesvaryfromonedisciplinetoanother.Manystatisticsbookshaveadiscussionoftransformations,buttypicallythediscussionisrestrictedtoaparticularpurpose,suchasensuringthenormalityofavariableormakingsurethatvariableshaveequalvariance.Weoffertworeferences:Osborne[99]andTukey[116].

Whilewehavecoveredsomeofthemostcommonlyuseddistanceandsimilaritymeasures,therearehundredsofsuchmeasuresandmorearebeingcreatedallthetime.Aswithsomanyothertopicsinthischapter,manyofthesemeasuresarespecifictoparticularfields,e.g.,intheareaoftimeseriesseepapersbyKalpakisetal.[81]andKeoghandPazzani[83].Clusteringbooksprovidethebestgeneraldiscussions.Inparticular,seethebooksbyAnderberg[62],JainandDubes[79],KaufmanandRousseeuw[82],andSneathandSokal[109].

Information-basedmeasuresofsimilarityhavebecomemorepopularlatelydespitethecomputationaldifficultiesandexpenseofcalculatingthem.AgoodintroductiontoinformationtheoryisprovidedbyCoverandThomas[69].Computingthemutualinformationforcontinuousvariablescanbe

straightforwardiftheyfollowawell-knowdistribution,suchasGaussian.However,thisisoftennotthecase,andmanytechniqueshavebeendeveloped.Asoneexample,thearticlebyKhan,etal.[85]comparesvariousmethodsinthecontextofcomparingshorttimeseries.SeealsotheinformationandmutualinformationpackagesforRandMatlab.MutualinformationhasbeenthesubjectofconsiderablerecentattentionduetopaperbyReshef,etal.[104,105]thatintroducedanalternativemeasure,albeitonebasedonmutualinformation,whichwasclaimedtohavesuperiorproperties.Althoughthisapproachhadsomeearlysupport,e.g.,[110],othershavepointedoutvariouslimitations[75,86,108].

Twopopularbooksonthetopicofkernelmethodsare[106]and[107].Thelatteralsohasawebsitewithlinkstokernel-relatedmaterials[84].Inaddition,manycurrentdatamining,machinelearning,andstatisticallearningtextbookshavesomematerialaboutkernelmethods.FurtherreferencesforkernelmethodsinthecontextofsupportvectormachineclassifiersareprovidedinthebibliographicNotesofSection4.9.4.

Bibliography[62]M.R.Anderberg.ClusterAnalysisforApplications.AcademicPress,New

York,December1973.

[63]A.BlumandP.Langley.SelectionofRelevantFeaturesandExamplesinMachineLearning.ArtificialIntelligence,97(1–2):245–271,1997.

[64]H.H.BockandE.Diday.AnalysisofSymbolicData:ExploratoryMethodsforExtractingStatisticalInformationfromComplexData(StudiesinClassification,DataAnalysis,andKnowledgeOrganization).Springer-VerlagTelos,January2000.

[65]I.BorgandP.Groenen.ModernMultidimensionalScaling—TheoryandApplications.Springer-Verlag,February1997.

[66]N.R.Chrisman.Rethinkinglevelsofmeasurementforcartography.CartographyandGeographicInformationSystems,25(4):231–242,1998.

[67]W.G.Cochran.SamplingTechniques.JohnWiley&Sons,3rdedition,July1977.

[68]E.F.Codd,S.B.Codd,andC.T.Smalley.ProvidingOLAP(On-lineAnalyticalProcessing)toUser-Analysts:AnITMandate.WhitePaper,E.F.CoddandAssociates,1993.

[69]T.M.CoverandJ.A.Thomas.Elementsofinformationtheory.JohnWiley&Sons,2012.

[70]J.W.Demmel.AppliedNumericalLinearAlgebra.SocietyforIndustrial&AppliedMathematics,September1997.

[71]J.Dougherty,R.Kohavi,andM.Sahami.SupervisedandUnsupervisedDiscretizationofContinuousFeatures.InProc.ofthe12thIntl.Conf.onMachineLearning,pages194–202,1995.

[72]T.ElomaaandJ.Rousu.GeneralandEfficientMultisplittingofNumericalAttributes.MachineLearning,36(3):201–244,1999.

[73]U.M.FayyadandK.B.Irani.Multi-intervaldiscretizationofcontinuousvaluedattributesforclassificationlearning.InProc.13thInt.JointConf.onArtificialIntelligence,pages1022–1027.MorganKaufman,1993.

[74]F.H.GaohuaGuandH.Liu.SamplingandItsApplicationinDataMining:ASurvey.TechnicalReportTRA6/00,NationalUniversityofSingapore,Singapore,2000.

[75]M.Gorfine,R.Heller,andY.Heller.CommentonDetectingnovelassociationsinlargedatasets.Unpublished(availableathttp://emotion.technion.ac.il/gorfinm/files/science6.pdfon11Nov.2012),2012.

[76]J.Gray,S.Chaudhuri,A.Bosworth,A.Layman,D.Reichart,M.Venkatrao,F.Pellow,andH.Pirahesh.DataCube:ARelationalAggregationOperatorGeneralizingGroup-By,Cross-Tab,andSub-Totals.JournalDataMiningandKnowledgeDiscovery,1(1):29–53,1997.

[77]D.J.Hand.StatisticsandtheTheoryofMeasurement.JournaloftheRoyalStatisticalSociety:SeriesA(StatisticsinSociety),159(3):445–492,1996.

[78]F.Hussain,H.Liu,C.L.Tan,andM.Dash.TRC6/99:Discretization:anenablingtechnique.Technicalreport,NationalUniversityofSingapore,Singapore,1999.

[79]A.K.JainandR.C.Dubes.AlgorithmsforClusteringData.PrenticeHallAdvancedReferenceSeries.PrenticeHall,March1988.

[80]I.T.Jolliffe.PrincipalComponentAnalysis.SpringerVerlag,2ndedition,October2002.

[81]K.Kalpakis,D.Gada,andV.Puttagunta.DistanceMeasuresforEffectiveClusteringofARIMATime-Series.InProc.ofthe2001IEEEIntl.Conf.onDataMining,pages273–280.IEEEComputerSociety,2001.

[82]L.KaufmanandP.J.Rousseeuw.FindingGroupsinData:AnIntroductiontoClusterAnalysis.WileySeriesinProbabilityandStatistics.JohnWileyandSons,NewYork,November1990.

[83]E.J.KeoghandM.J.Pazzani.Scalingupdynamictimewarpingfordataminingapplications.InKDD,pages285–289,2000.

[84]KernelMethodsforPatternAnalysisWebsite.http://www.kernel-methods.net/,2014.

[85]S.Khan,S.Bandyopadhyay,A.R.Ganguly,S.Saigal,D.J.EricksonIII,V.Protopopescu,andG.Ostrouchov.Relativeperformanceofmutualinformationestimationmethodsforquantifyingthedependenceamongshortandnoisydata.PhysicalReviewE,76(2):026209,2007.

[86]J.B.KinneyandG.S.Atwal.Equitability,mutualinformation,andthemaximalinformationcoefficient.ProceedingsoftheNationalAcademyofSciences,2014.

[87]R.KohaviandG.H.John.WrappersforFeatureSubsetSelection.ArtificialIntelligence,97(1–2):273–324,1997.

[88]D.Krantz,R.D.Luce,P.Suppes,andA.Tversky.FoundationsofMeasurements:Volume1:Additiveandpolynomialrepresentations.AcademicPress,NewYork,1971.

[89]J.B.KruskalandE.M.Uslaner.MultidimensionalScaling.SagePublications,August1978.

[90]B.W.Lindgren.StatisticalTheory.CRCPress,January1993.

[91]H.LiuandH.Motoda,editors.FeatureExtraction,ConstructionandSelection:ADataMiningPerspective.KluwerInternationalSeriesinEngineeringandComputerScience,453.KluwerAcademicPublishers,July1998.

[92]H.LiuandH.Motoda.FeatureSelectionforKnowledgeDiscoveryandDataMining.KluwerInternationalSeriesinEngineeringandComputerScience,454.KluwerAcademicPublishers,July1998.

[93]H.Liu,H.Motoda,andL.Yu.FeatureExtraction,Selection,andConstruction.InN.Ye,editor,TheHandbookofDataMining,pages22–41.LawrenceErlbaumAssociates,Inc.,Mahwah,NJ,2003.

[94]R.D.Luce,D.Krantz,P.Suppes,andA.Tversky.FoundationsofMeasurements:Volume3:Representation,Axiomatization,andInvariance.AcademicPress,NewYork,1990.

[95]MITInformationQuality(MITIQ)Program.http://mitiq.mit.edu/,2014.

[96]L.C.Molina,L.Belanche,andA.Nebot.FeatureSelectionAlgorithms:ASurveyandExperimentalEvaluation.InProc.ofthe2002IEEEIntl.Conf.onDataMining,2002.

[97]F.MostellerandJ.W.Tukey.Dataanalysisandregression:asecondcourseinstatistics.Addison-Wesley,1977.

[98]F.OlkenandD.Rotem.RandomSamplingfromDatabases—ASurvey.Statistics&Computing,5(1):25–42,March1995.

[99]J.Osborne.NotesontheUseofDataTransformations.PracticalAssessment,Research&Evaluation,28(6),2002.

[100]C.R.PalmerandC.Faloutsos.Densitybiasedsampling:Animprovedmethodfordataminingandclustering.ACMSIGMODRecord,29(2):82–92,2000.

[101]F.J.Provost,D.Jensen,andT.Oates.EfficientProgressiveSampling.InProc.ofthe5thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages23–32,1999.

[102]R.RamakrishnanandJ.Gehrke.DatabaseManagementSystems.McGraw-Hill,3rdedition,August2002.

[103]T.C.Redman.DataQuality:TheFieldGuide.DigitalPress,January2001.

[104]D.Reshef,Y.Reshef,M.Mitzenmacher,andP.Sabeti.Equitabilityanalysisofthemaximalinformationcoefficient,withcomparisons.arXivpreprintarXiv:1301.6314,2013.

[105]D.N.Reshef,Y.A.Reshef,H.K.Finucane,S.R.Grossman,G.McVean,P.J.Turnbaugh,E.S.Lander,M.Mitzenmacher,andP.C.

Sabeti.Detectingnovelassociationsinlargedatasets.science,334(6062):1518–1524,2011.

[106]B.SchölkopfandA.J.Smola.Learningwithkernels:supportvectormachines,regularization,optimization,andbeyond.MITpress,2002.

[107]J.Shawe-TaylorandN.Cristianini.Kernelmethodsforpatternanalysis.Cambridgeuniversitypress,2004.

[108]N.SimonandR.Tibshirani.Commenton”DetectingNovelAssociationsInLargeDataSets”byReshefEtAl,ScienceDec16,2011.arXivpreprintarXiv:1401.7645,2014.

[109]P.H.A.SneathandR.R.Sokal.NumericalTaxonomy.Freeman,SanFrancisco,1971.

[110]T.Speed.Acorrelationforthe21stcentury.Science,334(6062):1502–1503,2011.

[111]R.SrikantandR.Agrawal.MiningQuantitativeAssociationRulesinLargeRelationalTables.InProc.of1996ACM-SIGMODIntl.Conf.onManagementofData,pages1–12,Montreal,Quebec,Canada,August1996.

[112]S.S.Stevens.OntheTheoryofScalesofMeasurement.Science,103(2684):677–680,June1946.

[113]S.S.Stevens.Measurement.InG.M.Maranell,editor,Scaling:ASourcebookforBehavioralScientists,pages22–41.AldinePublishingCo.,Chicago,1974.

[114]P.Suppes,D.Krantz,R.D.Luce,andA.Tversky.FoundationsofMeasurements:Volume2:Geometrical,Threshold,andProbabilisticRepresentations.AcademicPress,NewYork,1989.

[115]H.Toivonen.SamplingLargeDatabasesforAssociationRules.InVLDB96,pages134–145.MorganKaufman,September1996.

[116]J.W.Tukey.OntheComparativeAnatomyofTransformations.AnnalsofMathematicalStatistics,28(3):602–632,September1957.

[117]P.F.VellemanandL.Wilkinson.Nominal,ordinal,interval,andratiotypologiesaremisleading.TheAmericanStatistician,47(1):65–72,1993.

[118]R.Y.Wang,M.Ziad,Y.W.Lee,andY.R.Wang.DataQuality.TheKluwerInternationalSeriesonAdvancesinDatabaseSystems,Volume23.KluwerAcademicPublishers,January2001.

[119]M.J.Zaki,S.Parthasarathy,W.Li,andM.Ogihara.EvaluationofSamplingforDataMiningofAssociationRules.TechnicalReportTR617,RensselaerPolytechnicInstitute,1996.

2.6Exercises1.IntheinitialexampleofChapter2 ,thestatisticiansays,“Yes,fields2and3arebasicallythesame.”Canyoutellfromthethreelinesofsampledatathatareshownwhyshesaysthat?

2.Classifythefollowingattributesasbinary,discrete,orcontinuous.Alsoclassifythemasqualitative(nominalorordinal)orquantitative(intervalorratio).Somecasesmayhavemorethanoneinterpretation,sobrieflyindicateyourreasoningifyouthinktheremaybesomeambiguity.

Example:Ageinyears.Answer:Discrete,quantitative,ratio

a. TimeintermsofAMorPM.

b. Brightnessasmeasuredbyalightmeter.

c. Brightnessasmeasuredbypeople’sjudgments.

d. Anglesasmeasuredindegreesbetween0and360.

e. Bronze,Silver,andGoldmedalsasawardedattheOlympics.

f. Heightabovesealevel.

g. Numberofpatientsinahospital.

h. ISBNnumbersforbooks.(LookuptheformatontheWeb.)

i. Abilitytopasslightintermsofthefollowingvalues:opaque,translucent,transparent.

j. Militaryrank.

k. Distancefromthecenterofcampus.

l. Densityofasubstanceingramspercubiccentimeter.

m. Coatchecknumber.(Whenyouattendanevent,youcanoftengiveyourcoattosomeonewho,inturn,givesyouanumberthatyoucanusetoclaimyourcoatwhenyouleave.)

3.Youareapproachedbythemarketingdirectorofalocalcompany,whobelievesthathehasdevisedafoolproofwaytomeasurecustomersatisfaction.Heexplainshisschemeasfollows:“It’ssosimplethatIcan’tbelievethatnoonehasthoughtofitbefore.Ijustkeeptrackofthenumberofcustomercomplaintsforeachproduct.Ireadinadataminingbookthatcountsareratioattributes,andso,mymeasureofproductsatisfactionmustbearatioattribute.ButwhenIratedtheproductsbasedonmynewcustomersatisfactionmeasureandshowedthemtomyboss,hetoldmethatIhadoverlookedtheobvious,andthatmymeasurewasworthless.Ithinkthathewasjustmadbecauseourbestsellingproducthadtheworstsatisfactionsinceithadthemostcomplaints.Couldyouhelpmesethimstraight?”

a. Whoisright,themarketingdirectororhisboss?Ifyouanswered,hisboss,whatwouldyoudotofixthemeasureofsatisfaction?

b. Whatcanyousayabouttheattributetypeoftheoriginalproductsatisfactionattribute?

4.Afewmonthslater,youareagainapproachedbythesamemarketingdirectorasinExercise3 .Thistime,hehasdevisedabetterapproachtomeasuretheextenttowhichacustomerprefersoneproductoverothersimilarproducts.Heexplains,“Whenwedevelopnewproducts,wetypicallycreateseveralvariationsandevaluatewhichonecustomersprefer.Ourstandardprocedureistogiveourtestsubjectsalloftheproductvariationsatonetimeandthenaskthemtoranktheproductvariationsinorderofpreference.However,ourtestsubjectsareveryindecisive,especiallywhenthereare

morethantwoproducts.Asaresult,testingtakesforever.Isuggestedthatweperformthecomparisonsinpairsandthenusethesecomparisonstogettherankings.Thus,ifwehavethreeproductvariations,wehavethecustomerscomparevariations1and2,then2and3,andfinally3and1.Ourtestingtimewithmynewprocedureisathirdofwhatitwasfortheoldprocedure,buttheemployeesconductingthetestscomplainthattheycannotcomeupwithaconsistentrankingfromtheresults.Andmybosswantsthelatestproductevaluations,yesterday.Ishouldalsomentionthathewasthepersonwhocameupwiththeoldproductevaluationapproach.Canyouhelpme?”

a. Isthemarketingdirectorintrouble?Willhisapproachworkforgeneratinganordinalrankingoftheproductvariationsintermsofcustomerpreference?Explain.

b. Isthereawaytofixthemarketingdirector’sapproach?Moregenerally,whatcanyousayabouttryingtocreateanordinalmeasurementscalebasedonpairwisecomparisons?

c. Fortheoriginalproductevaluationscheme,theoverallrankingsofeachproductvariationarefoundbycomputingitsaverageoveralltestsubjects.Commentonwhetheryouthinkthatthisisareasonableapproach.Whatotherapproachesmightyoutake?

5.Canyouthinkofasituationinwhichidentificationnumberswouldbeusefulforprediction?

6.Aneducationalpsychologistwantstouseassociationanalysistoanalyzetestresults.Thetestconsistsof100questionswithfourpossibleanswerseach.

a. Howwouldyouconvertthisdataintoaformsuitableforassociationanalysis?

b. Inparticular,whattypeofattributeswouldyouhaveandhowmanyofthemarethere?

7.Whichofthefollowingquantitiesislikelytoshowmoretemporalautocorrelation:dailyrainfallordailytemperature?Why?

8.Discusswhyadocument-termmatrixisanexampleofadatasetthathasasymmetricdiscreteorasymmetriccontinuousfeatures.

9.Manysciencesrelyonobservationinsteadof(orinadditionto)designedexperiments.Comparethedataqualityissuesinvolvedinobservationalsciencewiththoseofexperimentalscienceanddatamining.

10.Discussthedifferencebetweentheprecisionofameasurementandthetermssingleanddoubleprecision,astheyareusedincomputerscience,typicallytorepresentfloating-pointnumbersthatrequire32and64bits,respectively.

11.Giveatleasttwoadvantagestoworkingwithdatastoredintextfilesinsteadofinabinaryformat.

12.Distinguishbetweennoiseandoutliers.Besuretoconsiderthefollowingquestions.

a. Isnoiseeverinterestingordesirable?Outliers?

b. Cannoiseobjectsbeoutliers?

c. Arenoiseobjectsalwaysoutliers?

d. Areoutliersalwaysnoiseobjects?

e. Cannoisemakeatypicalvalueintoanunusualone,orviceversa?

Algorithm2.3Algorithmforfindingk-

nearestneighbors.

13.ConsidertheproblemoffindingtheK-nearestneighborsofadataobject.AprogrammerdesignsAlgorithm2.3forthistask.

a. Describethepotentialproblemswiththisalgorithmifthereareduplicateobjectsinthedataset.Assumethedistancefunctionwillreturnadistanceof0onlyforobjectsthatarethesame.

b. Howwouldyoufixthisproblem?

14.ThefollowingattributesaremeasuredformembersofaherdofAsianelephants:weight,height,tusklength,trunklength,andeararea.Basedonthesemeasurements,whatsortofproximitymeasurefromSection2.4wouldyouusetocompareorgrouptheseelephants?Justifyyouranswerandexplainanyspecialcircumstances.

15.Youaregivenasetofmobjectsthatisdividedintokgroups,wheretheigroupisofsize Ifthegoalistoobtainasampleofsize whatisthedifferencebetweenthefollowingtwosamplingschemes?(Assumesamplingwithreplacement.)

:for tonumberofdataobjectsdo

:Findthedistancesofthe objecttoallotherobjects.

3:Sortthesedistancesindecreasingorder.

(Keeptrackofwhichobjectisassociatedwitheachdistance.)

4:returntheobjectsassociatedwiththefirstkdistancesofthesortedlist

5:endfor

1 i=1

2 ith

th

mi. n<m,

a. Werandomlyselect elementsfromeachgroup.

b. Werandomlyselectnelementsfromthedataset,withoutregardforthegrouptowhichanobjectbelongs.

16.Consideradocument-termmatrix,where isthefrequencyoftheword(term)inthe documentandmisthenumberofdocuments.Considerthevariabletransformationthatisdefinedby

where isthenumberofdocumentsinwhichthe termappears,whichisknownasthedocumentfrequencyoftheterm.Thistransformationisknownastheinversedocumentfrequencytransformation.

a. Whatistheeffectofthistransformationifatermoccursinonedocument?Ineverydocument?

b. Whatmightbethepurposeofthistransformation?

17.Assumethatweapplyasquareroottransformationtoaratioattributextoobtainthenewattribute Aspartofyouranalysis,youidentifyaninterval(a,b)inwhich hasalinearrelationshiptoanotherattributey.

a. Whatisthecorrespondinginterval(A,B)intermsofx?

b. Giveanequationthatrelatesytox.

18.Thisexercisecomparesandcontrastssomesimilarityanddistancemeasures.

a. Forbinarydata,theL1distancecorrespondstotheHammingdistance;thatis,thenumberofbitsthataredifferentbetweentwobinaryvectors.TheJaccardsimilarityisameasureofthesimilaritybetweentwobinary

n×mi/m

tfij ithjth

tfij′=tfij×logmdfi, (2.31)

dfi ith

x*.x*

vectors.ComputetheHammingdistanceandtheJaccardsimilaritybetweenthefollowingtwobinaryvectors.

b. Whichapproach,JaccardorHammingdistance,ismoresimilartotheSimpleMatchingCoefficient,andwhichapproachismoresimilartothecosinemeasure?Explain.(Note:TheHammingmeasureisadistance,whiletheotherthreemeasuresaresimilarities,butdon’tletthisconfuseyou.)

c. Supposethatyouarecomparinghowsimilartwoorganismsofdifferentspeciesareintermsofthenumberofgenestheyshare.Describewhichmeasure,HammingorJaccard,youthinkwouldbemoreappropriateforcomparingthegeneticmakeupoftwoorganisms.Explain.(Assumethateachanimalisrepresentedasabinaryvector,whereeachattributeis1ifaparticulargeneispresentintheorganismand0otherwise.)

d. Ifyouwantedtocomparethegeneticmakeupoftwoorganismsofthesamespecies,e.g.,twohumanbeings,wouldyouusetheHammingdistance,theJaccardcoefficient,oradifferentmeasureofsimilarityordistance?Explain.(Notethattwohumanbeingsshare ofthesamegenes.)

19.Forthefollowingvectors,xandy,calculatetheindicatedsimilarityordistancemeasures.

a. cosine,correlation,Euclidean

b. cosine,correlation,Euclidean,Jaccard

c. cosine,correlation,Euclidean

d. cosine,correlation,Jaccard

x=0101010001y=0100011000

>99.9%

x=(1,1,1,1),y=(2,2,2,2)

x=(0,1,0,1),y=(1,0,1,0)

x=(0,−1,0,1),y=(1,0,−1,0)

x=(1,1,0,1,0,1),y=(1,1,1,0,0,1)

e. cosine,correlation

20.Here,wefurtherexplorethecosineandcorrelationmeasures.

a. Whatistherangeofvaluespossibleforthecosinemeasure?

b. Iftwoobjectshaveacosinemeasureof1,aretheyidentical?Explain.

c. Whatistherelationshipofthecosinemeasuretocorrelation,ifany?(Hint:Lookatstatisticalmeasuressuchasmeanandstandarddeviationincaseswherecosineandcorrelationarethesameanddifferent.)

d. Figure2.22(a) showstherelationshipofthecosinemeasuretoEuclideandistancefor100,000randomlygeneratedpointsthathavebeennormalizedtohaveanL2lengthof1.WhatgeneralobservationcanyoumakeabouttherelationshipbetweenEuclideandistanceandcosinesimilaritywhenvectorshaveanL2normof1?

Figure2.22.GraphsforExercise20 .

x=(2,−1,0,2,0,−3),y=(−1,1,−1,0,0,−1)

e. Figure2.22(b) showstherelationshipofcorrelationtoEuclideandistancefor100,000randomlygeneratedpointsthathavebeenstandardizedtohaveameanof0andastandarddeviationof1.WhatgeneralobservationcanyoumakeabouttherelationshipbetweenEuclideandistanceandcorrelationwhenthevectorshavebeenstandardizedtohaveameanof0andastandarddeviationof1?

f. DerivethemathematicalrelationshipbetweencosinesimilarityandEuclideandistancewheneachdataobjecthasanL lengthof1.

g. DerivethemathematicalrelationshipbetweencorrelationandEuclideandistancewheneachdatapointhasbeenbeenstandardizedbysubtractingitsmeananddividingbyitsstandarddeviation.

21.Showthatthesetdifferencemetricgivenby

satisfiesthemetricaxiomsgivenonpage77 .AandBaresetsand isthesetdifference.

22.Discusshowyoumightmapcorrelationvaluesfromtheinterval totheinterval[0,1].Notethatthetypeoftransformationthatyouusemightdependontheapplicationthatyouhaveinmind.Thus,considertwoapplications:clusteringtimeseriesandpredictingthebehaviorofonetimeseriesgivenanother.

23.Givenasimilaritymeasurewithvaluesintheinterval[0,1],describetwowaystotransformthissimilarityvalueintoadissimilarityvalueintheinterval

24.Proximityistypicallydefinedbetweenapairofobjects.

2

d(A,B)=size(A−B)+size(B−A) (2.32)

A−B

[−1,1]

[0,∞].

a. Definetwowaysinwhichyoumightdefinetheproximityamongagroupofobjects.

b. HowmightyoudefinethedistancebetweentwosetsofpointsinEuclideanspace?

c. Howmightyoudefinetheproximitybetweentwosetsofdataobjects?(Makenoassumptionaboutthedataobjects,exceptthataproximitymeasureisdefinedbetweenanypairofobjects.)

25.YouaregivenasetofpointssinEuclideanspace,aswellasthedistanceofeachpointinstoapointx.(Itdoesnotmatterif )

a. Ifthegoalistofindallpointswithinaspecifieddistance ofpointexplainhowyoucouldusethetriangleinequalityandthealreadycalculateddistancestoxtopotentiallyreducethenumberofdistancecalculationsnecessary?Hint:Thetriangleinequality,

canberewrittenas

b. Ingeneral,howwouldthedistancebetweenxandyaffectthenumberofdistancecalculations?

c. Supposethatyoucanfindasmallsubsetofpoints fromtheoriginaldataset,suchthateverypointinthedatasetiswithinaspecifieddistance ofatleastoneofthepointsin andthatyoualsohavethepairwisedistancematrixfor Describeatechniquethatusesthisinformationtocompute,withaminimumofdistancecalculations,thesetofallpointswithinadistanceof ofaspecifiedpointfromthedataset.

26.Showthat1minustheJaccardsimilarityisadistancemeasurebetweentwodataobjects,xandy,thatsatisfiesthemetricaxiomsgivenonpage77 .Specifically,

x∈S.ε y,y≠x,

d(x,z)≤d(x,y)+d(y,x), d(x,y)≥d(x,z)−d(y,z).

S′,

ε S′,S′.

β

d(x,y)=1−J(x,y).

27.Showthatthedistancemeasuredefinedastheanglebetweentwodatavectors,xandy,satisfiesthemetricaxiomsgivenonpage77 .Specifically,

28.Explainwhycomputingtheproximitybetweentwoattributesisoftensimplerthancomputingthesimilaritybetweentwoobjects.

d(x,y)=arccos(cos(x,y)).

3Classification:BasicConceptsandTechniques

Humanshaveaninnateabilitytoclassifythingsintocategories,e.g.,mundanetaskssuchasfilteringspamemailmessagesormorespecializedtaskssuchasrecognizingcelestialobjectsintelescopeimages(seeFigure3.1 ).Whilemanualclassificationoftensufficesforsmallandsimpledatasetswithonlyafewattributes,largerandmorecomplexdatasetsrequireanautomatedsolution.

Figure3.1.ClassificationofgalaxiesfromtelescopeimagestakenfromtheNASAwebsite.

Thischapterintroducesthebasicconceptsofclassificationanddescribessomeofitskeyissuessuchasmodeloverfitting,modelselection,andmodelevaluation.Whilethesetopicsareillustratedusingaclassificationtechniqueknownasdecisiontreeinduction,mostofthediscussioninthischapterisalsoapplicabletootherclassificationtechniques,manyofwhicharecoveredinChapter4 .

3.1BasicConceptsFigure3.2 illustratesthegeneralideabehindclassification.Thedataforaclassificationtaskconsistsofacollectionofinstances(records).Eachsuchinstanceischaracterizedbythetuple( ,y),where isthesetofattributevaluesthatdescribetheinstanceandyistheclasslabeloftheinstance.Theattributeset cancontainattributesofanytype,whiletheclasslabelymustbecategorical.

Figure3.2.Aschematicillustrationofaclassificationtask.

Aclassificationmodelisanabstractrepresentationoftherelationshipbetweentheattributesetandtheclasslabel.Aswillbeseeninthenexttwochapters,themodelcanberepresentedinmanyways,e.g.,asatree,aprobabilitytable,orsimply,avectorofreal-valuedparameters.Moreformally,wecanexpressitmathematicallyasatargetfunctionfthattakesasinputtheattributeset andproducesanoutputcorrespondingtothepredictedclasslabel.Themodelissaidtoclassifyaninstance( ,y)correctlyif .

Table3.1 showsexamplesofattributesetsandclasslabelsforvariousclassificationtasks.Spamfilteringandtumoridentificationareexamplesofbinaryclassificationproblems,inwhicheachdatainstancecanbecategorizedintooneoftwoclasses.Ifthenumberofclassesislargerthan2,asinthe

f(x)=y

galaxyclassificationexample,thenitiscalledamulticlassclassificationproblem.

Table3.1.Examplesofclassificationtasks.

Task Attributeset Classlabel

Spamfiltering Featuresextractedfromemailmessageheaderandcontent

spamornon-spam

Tumoridentification

Featuresextractedfrommagneticresonanceimaging(MRI)scans

malignantorbenign

Galaxyclassification

Featuresextractedfromtelescopeimages elliptical,spiral,orirregular-shaped

Weillustratethebasicconceptsofclassificationinthischapterwiththefollowingtwoexamples.

3.1.ExampleVertebrateClassificationTable3.2 showsasampledatasetforclassifyingvertebratesintomammals,reptiles,birds,fishes,andamphibians.Theattributesetincludescharacteristicsofthevertebratesuchasitsbodytemperature,skincover,andabilitytofly.Thedatasetcanalsobeusedforabinaryclassificationtasksuchasmammalclassification,bygroupingthereptiles,birds,fishes,andamphibiansintoasinglecategorycallednon-mammals.

Table3.2.Asampledataforthevertebrateclassificationproblem.VertebrateName

BodyTemperature

SkinCover

GivesBirth

AquaticCreature

AerialCreature

HasLegs

Hibernates ClassLabel

human warm-

blooded

hair yes no no yes no mammal

3.2.ExampleLoanBorrowerClassificationConsidertheproblemofpredictingwhetheraloanborrowerwillrepaytheloanordefaultontheloanpayments.Thedatasetusedtobuildthe

blooded

python cold-blooded scales no no no no yes reptile

salmon cold-blooded scales no yes no no no fish

whale warm-blooded

hair yes yes no no no mammal

frog cold-blooded none no semi no yes yes amphibian

komodo cold-blooded scales no no no yes no reptile

dragon

bat warm-blooded

hair yes no yes yes yes mammal

pigeon warm-blooded

feathers no no yes yes no bird

cat warm-blooded

fur yes no no yes no mammal

leopard cold-blooded scales yes yes no no no fish

shark

turtle cold-blooded scales no semi no yes no reptile

penguin warm-blooded

feathers no semi no yes no bird

porcupine warm-blooded

quills yes no no yes yes mammal

eel cold-blooded scales no yes no no no fish

salamander cold-blooded none no semi no yes yes amphibian

classificationmodelisshowninTable3.3 .Theattributesetincludespersonalinformationoftheborrowersuchasmaritalstatusandannualincome,whiletheclasslabelindicateswhethertheborrowerhaddefaultedontheloanpayments.

Table3.3.Asampledatafortheloanborrowerclassificationproblem.

ID HomeOwner MaritalStatus AnnualIncome Defaulted?

1 Yes Single 125000 No

2 No Married 100000 No

3 No Single 70000 No

4 Yes Married 120000 No

5 No Divorced 95000 Yes

6 No Single 60000 No

7 Yes Divorced 220000 No

8 No Single 85000 Yes

9 No Married 75000 No

10 No Single 90000 Yes

Aclassificationmodelservestwoimportantrolesindatamining.First,itisusedasapredictivemodeltoclassifypreviouslyunlabeledinstances.Agoodclassificationmodelmustprovideaccuratepredictionswithafastresponsetime.Second,itservesasadescriptivemodeltoidentifythecharacteristicsthatdistinguishinstancesfromdifferentclasses.Thisisparticularlyusefulforcriticalapplications,suchasmedicaldiagnosis,whereit

isinsufficienttohaveamodelthatmakesapredictionwithoutjustifyinghowitreachessuchadecision.

Forexample,aclassificationmodelinducedfromthevertebratedatasetshowninTable3.2 canbeusedtopredicttheclasslabelofthefollowingvertebrate:

Inaddition,itcanbeusedasadescriptivemodeltohelpdeterminecharacteristicsthatdefineavertebrateasamammal,areptile,abird,afish,oranamphibian.Forexample,themodelmayidentifymammalsaswarm-bloodedvertebratesthatgivebirthtotheiryoung.

Thereareseveralpointsworthnotingregardingthepreviousexample.First,althoughalltheattributesshowninTable3.2 arequalitative,therearenorestrictionsonthetypeofattributesthatcanbeusedaspredictorvariables.Theclasslabel,ontheotherhand,mustbeofnominaltype.Thisdistinguishesclassificationfromotherpredictivemodelingtaskssuchasregression,wherethepredictedvalueisoftenquantitative.MoreinformationaboutregressioncanbefoundinAppendixD.

Anotherpointworthnotingisthatnotallattributesmayberelevanttotheclassificationtask.Forexample,theaveragelengthorweightofavertebratemaynotbeusefulforclassifyingmammals,astheseattributescanshowsamevalueforbothmammalsandnon-mammals.Suchanattributeistypicallydiscardedduringpreprocessing.Theremainingattributesmightnotbeabletodistinguishtheclassesbythemselves,andthus,mustbeusedin

VertebrateName

BodyTemperature

SkinCover

GivesBirth

AquaticCreature

AerialCreature

HasLegs

Hibernates ClassLabel

gilamonster

cold-blooded scales no no no yes yes ?

concertwithotherattributes.Forinstance,theBodyTemperatureattributeisinsufficienttodistinguishmammalsfromothervertebrates.WhenitisusedtogetherwithGivesBirth,theclassificationofmammalsimprovessignificantly.However,whenadditionalattributes,suchasSkinCoverareincluded,themodelbecomesoverlyspecificandnolongercoversallmammals.Findingtheoptimalcombinationofattributesthatbestdiscriminatesinstancesfromdifferentclassesisthekeychallengeinbuildingclassificationmodels.

3.2GeneralFrameworkforClassificationClassificationisthetaskofassigninglabelstounlabeleddatainstancesandaclassifierisusedtoperformsuchatask.Aclassifieristypicallydescribedintermsofamodelasillustratedintheprevioussection.Themodeliscreatedusingagivenasetofinstances,knownasthetrainingset,whichcontainsattributevaluesaswellasclasslabelsforeachinstance.Thesystematicapproachforlearningaclassificationmodelgivenatrainingsetisknownasalearningalgorithm.Theprocessofusingalearningalgorithmtobuildaclassificationmodelfromthetrainingdataisknownasinduction.Thisprocessisalsooftendescribedas“learningamodel”or“buildingamodel.”Thisprocessofapplyingaclassificationmodelonunseentestinstancestopredicttheirclasslabelsisknownasdeduction.Thus,theprocessofclassificationinvolvestwosteps:applyingalearningalgorithmtotrainingdatatolearnamodel,andthenapplyingthemodeltoassignlabelstounlabeledinstances.Figure3.3 illustratesthegeneralframeworkforclassification.

Figure3.3.Generalframeworkforbuildingaclassificationmodel.

Aclassificationtechniquereferstoageneralapproachtoclassification,e.g.,thedecisiontreetechniquethatwewillstudyinthischapter.Thisclassificationtechniquelikemostothers,consistsofafamilyofrelatedmodelsandanumberofalgorithmsforlearningthesemodels.InChapter4 ,wewillstudyadditionalclassificationtechniques,includingneuralnetworksandsupportvectormachines.

Acouplenotesonterminology.First,theterms“classifier”and“model”areoftentakentobesynonymous.Ifaclassificationtechniquebuildsasingle,

globalmodel,thenthisisfine.However,whileeverymodeldefinesaclassifier,noteveryclassifierisdefinedbyasinglemodel.Someclassifiers,suchask-nearestneighborclassifiers,donotbuildanexplicitmodel(Section4.3 ),whileotherclassifiers,suchasensembleclassifiers,combinetheoutputofacollectionofmodels(Section4.10 ).Second,theterm“classifier”isoftenusedinamoregeneralsensetorefertoaclassificationtechnique.Thus,forexample,“decisiontreeclassifier”canrefertothedecisiontreeclassificationtechniqueoraspecificclassifierbuiltusingthattechnique.Fortunately,themeaningof“classifier”isusuallyclearfromthecontext.

InthegeneralframeworkshowninFigure3.3 ,theinductionanddeductionstepsshouldbeperformedseparately.Infact,aswillbediscussedlaterinSection3.6 ,thetrainingandtestsetsshouldbeindependentofeachothertoensurethattheinducedmodelcanaccuratelypredicttheclasslabelsofinstancesithasneverencounteredbefore.Modelsthatdeliversuchpredictiveinsightsaresaidtohavegoodgeneralizationperformance.Theperformanceofamodel(classifier)canbeevaluatedbycomparingthepredictedlabelsagainstthetruelabelsofinstances.Thisinformationcanbesummarizedinatablecalledaconfusionmatrix.Table3.4 depictstheconfusionmatrixforabinaryclassificationproblem.Eachentry denotesthenumberofinstancesfromclassipredictedtobeofclassj.Forexample, isthenumberofinstancesfromclass0incorrectlypredictedasclass1.Thenumberofcorrectpredictionsmadebythemodelis andthenumberofincorrectpredictionsis .

Table3.4.Confusionmatrixforabinaryclassificationproblem.

PredictedClass

ActualClass

fijf01

(f11+f00)(f10+f01)

Class=1 Class=0

Class=1 f11 f10

Althoughaconfusionmatrixprovidestheinformationneededtodeterminehowwellaclassificationmodelperforms,summarizingthisinformationintoasinglenumbermakesitmoreconvenienttocomparetherelativeperformanceofdifferentmodels.Thiscanbedoneusinganevaluationmetricsuchasaccuracy,whichiscomputedinthefollowingway:

Accuracy=

Forbinaryclassificationproblems,theaccuracyofamodelisgivenby

Errorrateisanotherrelatedmetric,whichisdefinedasfollowsforbinaryclassificationproblems:

Thelearningalgorithmsofmostclassificationtechniquesaredesignedtolearnmodelsthatattainthehighestaccuracy,orequivalently,thelowesterrorratewhenappliedtothetestset.WewillrevisitthetopicofmodelevaluationinSection3.6 .

Class=0 f01 f00

Accuracy=NumberofcorrectpredictionsTotalnumberofpredictions. (3.1)

Accuracy=f11+f00f11+f10+f01+f00. (3.2)

Errorrate=NumberofwrongpredictionsTotalnumberofpredictions=f10+f01f11(3.3)

3.3DecisionTreeClassifierThissectionintroducesasimpleclassificationtechniqueknownasthedecisiontreeclassifier.Toillustratehowadecisiontreeworks,considertheclassificationproblemofdistinguishingmammalsfromnon-mammalsusingthevertebratedatasetshowninTable3.2 .Supposeanewspeciesisdiscoveredbyscientists.Howcanwetellwhetheritisamammaloranon-mammal?Oneapproachistoposeaseriesofquestionsaboutthecharacteristicsofthespecies.Thefirstquestionwemayaskiswhetherthespeciesiscold-orwarm-blooded.Ifitiscold-blooded,thenitisdefinitelynotamammal.Otherwise,itiseitherabirdoramammal.Inthelattercase,weneedtoaskafollow-upquestion:Dothefemalesofthespeciesgivebirthtotheiryoung?Thosethatdogivebirtharedefinitelymammals,whilethosethatdonotarelikelytobenon-mammals(withtheexceptionofegg-layingmammalssuchastheplatypusandspinyanteater).

Thepreviousexampleillustrateshowwecansolveaclassificationproblembyaskingaseriesofcarefullycraftedquestionsabouttheattributesofthetestinstance.Eachtimewereceiveananswer,wecouldaskafollow-upquestionuntilwecanconclusivelydecideonitsclasslabel.Theseriesofquestionsandtheirpossibleanswerscanbeorganizedintoahierarchicalstructurecalledadecisiontree.Figure3.4 showsanexampleofthedecisiontreeforthemammalclassificationproblem.Thetreehasthreetypesofnodes:

Arootnode,withnoincominglinksandzeroormoreoutgoinglinks.Internalnodes,eachofwhichhasexactlyoneincominglinkandtwoormoreoutgoinglinks.Leaforterminalnodes,eachofwhichhasexactlyoneincominglinkandnooutgoinglinks.

Everyleafnodeinthedecisiontreeisassociatedwithaclasslabel.Thenon-terminalnodes,whichincludetherootandinternalnodes,containattributetestconditionsthataretypicallydefinedusingasingleattribute.Eachpossibleoutcomeoftheattributetestconditionisassociatedwithexactlyonechildofthisnode.Forexample,therootnodeofthetreeshowninFigure3.4 usestheattribute todefineanattributetestconditionthathastwooutcomes,warmandcold,resultingintwochildnodes.

Figure3.4.Adecisiontreeforthemammalclassificationproblem.

Givenadecisiontree,classifyingatestinstanceisstraightforward.Startingfromtherootnode,weapplyitsattributetestconditionandfollowtheappropriatebranchbasedontheoutcomeofthetest.Thiswillleaduseithertoanotherinternalnode,forwhichanewattributetestconditionisapplied,ortoaleafnode.Oncealeafnodeisreached,weassigntheclasslabelassociatedwiththenodetothetestinstance.Asanillustration,Figure3.5

tracesthepathusedtopredicttheclasslabelofaflamingo.Thepathterminatesataleafnodelabeledas .

Figure3.5.Classifyinganunlabeledvertebrate.Thedashedlinesrepresenttheoutcomesofapplyingvariousattributetestconditionsontheunlabeledvertebrate.Thevertebrateiseventuallyassignedtothe class.

3.3.1ABasicAlgorithmtoBuildaDecisionTree

Manypossibledecisiontreesthatcanbeconstructedfromaparticulardataset.Whilesometreesarebetterthanothers,findinganoptimaloneiscomputationallyexpensiveduetotheexponentialsizeofthesearchspace.Efficientalgorithmshavebeendevelopedtoinduceareasonablyaccurate,

albeitsuboptimal,decisiontreeinareasonableamountoftime.Thesealgorithmsusuallyemployagreedystrategytogrowthedecisiontreeinatop-downfashionbymakingaseriesoflocallyoptimaldecisionsaboutwhichattributetousewhenpartitioningthetrainingdata.OneoftheearliestmethodisHunt'salgorithm,whichisthebasisformanycurrentimplementationsofdecisiontreeclassifiers,includingID3,C4.5,andCART.ThissubsectionpresentsHunt'salgorithmanddescribessomeofthedesignissuesthatmustbeconsideredwhenbuildingadecisiontree.

Hunt'sAlgorithmInHunt'salgorithm,adecisiontreeisgrowninarecursivefashion.Thetreeinitiallycontainsasinglerootnodethatisassociatedwithallthetraininginstances.Ifanodeisassociatedwithinstancesfrommorethanoneclass,itisexpandedusinganattributetestconditionthatisdeterminedusingasplittingcriterion.Achildleafnodeiscreatedforeachoutcomeoftheattributetestconditionandtheinstancesassociatedwiththeparentnodearedistributedtothechildrenbasedonthetestoutcomes.Thisnodeexpansionstepcanthenberecursivelyappliedtoeachchildnode,aslongasithaslabelsofmorethanoneclass.Ifalltheinstancesassociatedwithaleafnodehaveidenticalclasslabels,thenthenodeisnotexpandedanyfurther.Eachleafnodeisassignedaclasslabelthatoccursmostfrequentlyinthetraininginstancesassociatedwiththenode.

Toillustratehowthealgorithmworks,considerthetrainingsetshowninTable3.3 fortheloanborrowerclassificationproblem.SupposeweapplyHunt'salgorithmtofitthetrainingdata.ThetreeinitiallycontainsonlyasingleleafnodeasshowninFigure3.6(a) .ThisnodeislabeledasDefaulted=No,sincethemajorityoftheborrowersdidnotdefaultontheirloanpayments.Thetrainingerrorofthistreeis30%asthreeoutofthetentraininginstanceshave

theclasslabel .Theleafnodecanthereforebefurtherexpandedbecauseitcontainstraininginstancesfrommorethanoneclass.

Figure3.6.Hunt'salgorithmforbuildingdecisiontrees.

LetHomeOwnerbetheattributechosentosplitthetraininginstances.Thejustificationforchoosingthisattributeastheattributetestconditionwillbediscussedlater.TheresultingbinarysplitontheHomeOwnerattributeisshowninFigure3.6(b) .AllthetraininginstancesforwhichHomeOwner=Yesarepropagatedtotheleftchildoftherootnodeandtherestarepropagatedtotherightchild.Hunt'salgorithmisthenrecursivelyappliedtoeachchild.Theleftchildbecomesaleafnodelabeled ,since

Defaulted=Yes

Defaulted=No

allinstancesassociatedwiththisnodehaveidenticalclasslabel.Therightchildhasinstancesfromeachclasslabel.Hence,

wesplititfurther.TheresultingsubtreesafterrecursivelyexpandingtherightchildareshowninFigures3.6(c) and(d) .

Hunt'salgorithm,asdescribedabove,makessomesimplifyingassumptionsthatareoftennottrueinpractice.Inthefollowing,wedescribetheseassumptionsandbrieflydiscusssomeofthepossiblewaysforhandlingthem.

1. SomeofthechildnodescreatedinHunt'salgorithmcanbeemptyifnoneofthetraininginstanceshavetheparticularattributevalues.Onewaytohandlethisisbydeclaringeachofthemasaleafnodewithaclasslabelthatoccursmostfrequentlyamongthetraininginstancesassociatedwiththeirparentnodes.

2. Ifalltraininginstancesassociatedwithanodehaveidenticalattributevaluesbutdifferentclasslabels,itisnotpossibletoexpandthisnodeanyfurther.Onewaytohandlethiscaseistodeclareitaleafnodeandassignittheclasslabelthatoccursmostfrequentlyinthetraininginstancesassociatedwiththisnode.

DesignIssuesofDecisionTreeInductionHunt'salgorithmisagenericprocedureforgrowingdecisiontreesinagreedyfashion.Toimplementthealgorithm,therearetwokeydesignissuesthatmustbeaddressed.

1. Whatisthesplittingcriterion?Ateachrecursivestep,anattributemustbeselectedtopartitionthetraininginstancesassociatedwithanodeintosmallersubsetsassociatedwithitschildnodes.Thesplittingcriteriondetermineswhichattributeischosenasthetestconditionand

Defaulted=No

howthetraininginstancesshouldbedistributedtothechildnodes.ThiswillbediscussedinSections3.3.2 and3.3.3 .

2. Whatisthestoppingcriterion?Thebasicalgorithmstopsexpandinganodeonlywhenallthetraininginstancesassociatedwiththenodehavethesameclasslabelsorhaveidenticalattributevalues.Althoughtheseconditionsaresufficient,therearereasonstostopexpandinganodemuchearliereveniftheleafnodecontainstraininginstancesfrommorethanoneclass.Thisprocessiscalledearlyterminationandtheconditionusedtodeterminewhenanodeshouldbestoppedfromexpandingiscalledastoppingcriterion.TheadvantagesofearlyterminationarediscussedinSection3.4 .

3.3.2MethodsforExpressingAttributeTestConditions

Decisiontreeinductionalgorithmsmustprovideamethodforexpressinganattributetestconditionanditscorrespondingoutcomesfordifferentattributetypes.

BinaryAttributes

Thetestconditionforabinaryattributegeneratestwopotentialoutcomes,asshowninFigure3.7 .

Figure3.7.Attributetestconditionforabinaryattribute.

NominalAttributes

Sinceanominalattributecanhavemanyvalues,itsattributetestconditioncanbeexpressedintwoways,asamultiwaysplitorabinarysplitasshowninFigure3.8 .Foramultiwaysplit(Figure3.8(a) ),thenumberofoutcomesdependsonthenumberofdistinctvaluesforthecorrespondingattribute.Forexample,ifanattributesuchasmaritalstatushasthreedistinctvalues—single,married,ordivorced—itstestconditionwillproduceathree-waysplit.Itisalsopossibletocreateabinarysplitbypartitioningallvaluestakenbythenominalattributeintotwogroups.Forexample,somedecisiontreealgorithms,suchasCART,produceonlybinarysplitsbyconsideringall

waysofcreatingabinarypartitionofkattributevalues.Figure3.8(b)illustratesthreedifferentwaysofgroupingtheattributevaluesformaritalstatusintotwosubsets.

2k−1−1

Figure3.8.Attributetestconditionsfornominalattributes.

OrdinalAttributes

Ordinalattributescanalsoproducebinaryormulti-waysplits.Ordinalattributevaluescanbegroupedaslongasthegroupingdoesnotviolatetheorderpropertyoftheattributevalues.Figure3.9 illustratesvariouswaysofsplittingtrainingrecordsbasedontheShirtSizeattribute.ThegroupingsshowninFigures3.9(a) and(b) preservetheorderamongtheattributevalues,whereasthegroupingshowninFigure3.9(c) violatesthispropertybecauseitcombinestheattributevaluesSmallandLargeintothesamepartitionwhileMediumandExtraLargearecombinedintoanotherpartition.

Figure3.9.Differentwaysofgroupingordinalattributevalues.

ContinuousAttributes

Forcontinuousattributes,theattributetestconditioncanbeexpressedasacomparisontest(e.g., )producingabinarysplit,orasarangequeryoftheform ,for producingamultiwaysplit.ThedifferencebetweentheseapproachesisshowninFigure3.10 .Forthebinarysplit,anypossiblevaluevbetweentheminimumandmaximumattributevaluesinthetrainingdatacanbeusedforconstructingthecomparisontest .However,itissufficienttoonlyconsiderdistinctattributevaluesinthetrainingsetascandidatesplitpositions.Forthemultiwaysplit,anypossiblecollectionofattributevaluerangescanbeused,aslongastheyaremutuallyexclusiveandcovertheentirerangeofattributevaluesbetweentheminimumandmaximumvaluesobservedinthetrainingset.OneapproachforconstructingmultiwaysplitsistoapplythediscretizationstrategiesdescribedinSection2.3.6 onpage63.Afterdiscretization,anewordinalvalueisassignedtoeachdiscretizedinterval,andtheattributetestconditionisthendefinedusingthisnewlyconstructedordinalattribute.

A<vvi≤A<vi+1 i=1,…,k,

A<v

Figure3.10.Testconditionforcontinuousattributes.

3.3.3MeasuresforSelectinganAttributeTestCondition

Therearemanymeasuresthatcanbeusedtodeterminethegoodnessofanattributetestcondition.Thesemeasurestrytogivepreferencetoattributetestconditionsthatpartitionthetraininginstancesintopurersubsetsinthechildnodes,whichmostlyhavethesameclasslabels.Havingpurernodesisusefulsinceanodethathasallofitstraininginstancesfromthesameclassdoesnotneedtobeexpandedfurther.Incontrast,animpurenodecontainingtraininginstancesfrommultipleclassesislikelytorequireseverallevelsofnodeexpansions,therebyincreasingthedepthofthetreeconsiderably.Largertreesarelessdesirableastheyaremoresusceptibletomodeloverfitting,aconditionthatmaydegradetheclassificationperformanceonunseeninstances,aswillbediscussedinSection3.4 .Theyarealsodifficulttointerpretandincurmoretrainingandtesttimeascomparedtosmallertrees.

Inthefollowing,wepresentdifferentwaysofmeasuringtheimpurityofanodeandthecollectiveimpurityofitschildnodes,bothofwhichwillbeusedtoidentifythebestattributetestconditionforanode.

ImpurityMeasureforaSingleNodeTheimpurityofanodemeasureshowdissimilartheclasslabelsareforthedatainstancesbelongingtoacommonnode.Followingareexamplesofmeasuresthatcanbeusedtoevaluatetheimpurityofanodet:

wherepi(t)istherelativefrequencyoftraininginstancesthatbelongtoclassiatnodet,cisthetotalnumberofclasses,and inentropycalculations.Allthreemeasuresgiveazeroimpurityvalueifanodecontainsinstancesfromasingleclassandmaximumimpurityifthenodehasequalproportionofinstancesfrommultipleclasses.

Figure3.11 comparestherelativemagnitudeoftheimpuritymeasureswhenappliedtobinaryclassificationproblems.Sincethereareonlytwoclasses, .Thehorizontalaxispreferstothefractionofinstancesthatbelongtooneofthetwoclasses.Observethatallthreemeasuresattaintheirmaximumvaluewhentheclassdistributionisuniform(i.e.,

)andminimumvaluewhenalltheinstancesbelongtoasingleclass(i.e.,either or equalsto1).Thefollowingexamplesillustratehowthevaluesoftheimpuritymeasuresvaryaswealtertheclassdistribution.

Entropy=−∑i=0c−1pi(t)log2pi(t), (3.4)

Giniindex=1−∑i=0c−1pi(t)2, (3.5)

Classificationerror=1−maxi[pi(t)], (3.6)

0log20=0

p0(t)+p1(t)=1

p0(t)+p1(t)=0.5p0(t) p1(t)

Figure3.11.Comparisonamongtheimpuritymeasuresforbinaryclassificationproblems.

Node Count

0

6

Node Count

1

5

Node Count

3

N1 Gini=1−(0/6)2−(6/6)2=0

Class=0 Entropy=−(0/6)log2(0/6)−(6/6)log2(6/6)=0

Class=1 Error=1−max[0/6,6/6]=0

N2 Gini=1−(1/6)2−(5/6)2=0.278

Class=0 Entropy=−(1/6)log2(1/6)−(5/6)log2(5/6)=0.650

Class=1 Error=1−max[1/6,5/6]=0.167

N3 Gini=1−(3/6)2−(3/6)2=0.5

Class=0 Entropy=−(3/6)log2(3/6)−(3/6)log2(3/6)=1

3

Basedonthesecalculations,node hasthelowestimpurityvalue,followedby and .Thisexample,alongwithFigure3.11 ,showstheconsistencyamongtheimpuritymeasures,i.e.,ifanode haslowerentropythannode ,thentheGiniindexanderrorrateof willalsobelowerthanthatof .Despitetheiragreement,theattributechosenassplittingcriterionbytheimpuritymeasurescanstillbedifferent(seeExercise6onpage187).

CollectiveImpurityofChildNodesConsideranattributetestconditionthatsplitsanodecontainingNtraininginstancesintokchildren, ,whereeverychildnoderepresentsapartitionofthedataresultingfromoneofthekoutcomesoftheattributetestcondition.Let bethenumberoftraininginstancesassociatedwithachildnode ,whoseimpurityvalueis .Sinceatraininginstanceintheparentnodereachesnode forafractionof times,thecollectiveimpurityofthechildnodescanbecomputedbytakingaweightedsumoftheimpuritiesofthechildnodes,asfollows:

3.3.ExampleWeightedEntropyConsiderthecandidateattributetestconditionshowninFigures3.12(a)and(b) fortheloanborrowerclassificationproblem.SplittingontheHomeOwnerattributewillgeneratetwochildnodes

Class=1 Error=1−max[6/6,3/6]=0.5

N1N2 N3

N1N2 N1N2

{v1,v2,⋯,vk}

N(vj)vj I(vj)

vj N(vj)/N

I(children)=∑j=1kN(vj)NI(vj), (3.7)

Figure3.12.Examplesofcandidateattributetestconditions.

whoseweightedentropycanbecalculatedasfollows:

SplittingonMaritalStatus,ontheotherhand,leadstothreechildnodeswithaweightedentropygivenby

Thus,MaritalStatushasalowerweightedentropythanHomeOwner.

IdentifyingthebestattributetestconditionTodeterminethegoodnessofanattributetestcondition,weneedtocomparethedegreeofimpurityoftheparentnode(beforesplitting)withtheweighteddegreeofimpurityofthechildnodes(aftersplitting).Thelargertheir

I(HomeOwner=yes)=03log203−33log233=0I(HomeOwner=no)=−37log237−47log247=0.985I(HomeOwner=310×0+710×0.985=0.690

I(MaritalStatus=Single)=−25log225−35log235=0.971I(MaritalStatus=Married)=−03log203−33log233=0I(MaritalStatus=Divorced)=−12log212−12log212=1.000I(MaritalStatus)=510×0.971+310×0+210×1=0.686

difference,thebetterthetestcondition.Thisdifference, ,alsotermedasthegaininpurityofanattributetestcondition,canbedefinedasfollows:

Figure3.13.SplittingcriteriafortheloanborrowerclassificationproblemusingGiniindex.

whereI(parent)istheimpurityofanodebeforesplittingandI(children)istheweightedimpuritymeasureaftersplitting.Itcanbeshownthatthegainisnon-negativesince foranyreasonablemeasuresuchasthosepresentedabove.Thehigherthegain,thepureraretheclassesinthechildnodesrelativetotheparentnode.Thesplittingcriterioninthedecisiontreelearningalgorithmselectstheattributetestconditionthatshowsthemaximumgain.NotethatmaximizingthegainatagivennodeisequivalenttominimizingtheweightedimpuritymeasureofitschildrensinceI(parent)isthesameforallcandidateattributetestconditions.Finally,whenentropyisused

Δ

Δ=I(parent)−I(children), (3.8)

I(parent)≥I(children)

astheimpuritymeasure,thedifferenceinentropyiscommonlyknownasinformationgain, .

Inthefollowing,wepresentillustrativeapproachesforidentifyingthebestattributetestconditiongivenqualitativeorquantitativeattributes.

SplittingofQualitativeAttributesConsiderthefirsttwocandidatesplitsshowninFigure3.12 involvingqualitativeattributes and .Theinitialclassdistributionattheparentnodeis(0.3,0.7),sincethereare3instancesofclass and7instancesofclass inthetrainingdata.Thus,

TheinformationgainsforHomeOwnerandMaritalStatusareeachgivenby

TheinformationgainforMaritalStatusisthushigherduetoitslowerweightedentropy,whichwillthusbeconsideredforsplitting.

BinarySplittingofQualitativeAttributesConsiderbuildingadecisiontreeusingonlybinarysplitsandtheGiniindexastheimpuritymeasure.Figure3.13 showsexamplesoffourcandidatesplittingcriteriaforthe and attributes.Sincethereare3borrowersinthetrainingsetwhodefaultedand7otherswhorepaidtheirloan(seeTableinFigure3.13 ),theGiniindexoftheparentnodebeforesplittingis

Δinfo

I(parent)=−310log2310−710log2710=0.881

Δinfo(HomeOwner)=0.881−0.690=0.191Δinfo(MaritalStatus)=0.881−0.686=0.195

If ischosenasthesplittingattribute,theGiniindexforthechildnodes and are0and0.490,respectively.TheweightedaverageGiniindexforthechildrenis

wheretheweightsrepresenttheproportionoftraininginstancesassignedtoeachchild.Thegainusing assplittingattributeis

.Similarly,wecanapplyabinarysplitontheattribute.However,since isanominalattributewith

threeoutcomes,therearethreepossiblewaystogrouptheattributevaluesintoabinarysplit.TheweightedaverageGiniindexofthechildrenforeachcandidatebinarysplitisshowninFigure3.13 .Basedontheseresults,

andthelastbinarysplitusing areclearlythebestcandidates,sincetheybothproducethelowestweightedaverageGiniindex.Binarysplitscanalsobeusedforordinalattributes,ifthebinarypartitioningoftheattributevaluesdoesnotviolatetheorderingpropertyofthevalues.

BinarySplittingofQuantitativeAttributesConsidertheproblemofidentifyingthebestbinarysplit fortheprecedingloanapprovalclassificationproblem.Asdiscussedpreviously,eventhough cantakeanyvaluebetweentheminimumandmaximumvaluesofannualincomeinthetrainingset,itissufficienttoonlyconsidertheannualincomevaluesobservedinthetrainingsetascandidatesplitpositions.Foreachcandidate ,thetrainingsetisscannedoncetocountthenumberofborrowerswithannualincomelessthanorgreaterthan alongwiththeirclassproportions.WecanthencomputetheGiniindexateachcandidatesplit

1−(310)2−(710)2=0.420.

N1 N2

(3/10)×0+(7/10)×0.490=0.343,

0.420−0.343=0.077

AnnualIncome≤τ

τ

ττ

positionandchoosethe thatproducesthelowestvalue.ComputingtheGiniindexateachcandidatesplitpositionrequiresO(N)operations,whereNisthenumberoftraininginstances.SincethereareatmostNpossiblecandidates,theoverallcomplexityofthisbrute-forcemethodis .ItispossibletoreducethecomplexityofthisproblemtoO(NlogN)byusingamethoddescribedasfollows(seeillustrationinFigure3.14 ).Inthismethod,wefirstsortthetraininginstancesbasedontheirannualincome,aone-timecostthatrequiresO(NlogN)operations.Thecandidatesplitpositionsaregivenbythemidpointsbetweeneverytwoadjacentsortedvalues:$55,000,$65,000,$72,500,andsoon.Forthefirstcandidate,sincenoneoftheinstanceshasanannualincomelessthanorequalto$55,000,theGiniindexforthechildnodewith isequaltozero.Incontrast,thereare3traininginstancesofclass and instancesofclassNowithannualincomegreaterthan$55,000.TheGiniindexforthisnodeis0.420.TheweightedaverageGiniindexforthefirstcandidatesplitposition, ,isequalto .

Figure3.14.Splittingcontinuousattributes.

Forthenextcandidate, ,theclassdistributionofitschildnodescanbeobtainedwithasimpleupdateofthedistributionforthepreviouscandidate.Thisisbecause,as increasesfrom$55,000to$65,000,thereisonlyone

τ

O(N2)

AnnualIncome<$55,000

τ=$55,0000×0+1×0.420=0.420

τ=$65,000

τ

traininginstanceaffectedbythechange.Byexaminingtheclasslabeloftheaffectedtraininginstance,thenewclassdistributionisobtained.Forexample,as increasesto$65,000,thereisonlyoneborrowerinthetrainingset,withanannualincomeof$60,000,affectedbythischange.Sincetheclasslabelfortheborroweris ,thecountforclass increasesfrom0to1(for

)anddecreasesfrom7to6(for),asshowninFigure3.14 .Thedistributionforthe

classremainsunaffected.TheupdatedGiniindexforthiscandidatesplitpositionis0.400.

ThisprocedureisrepeateduntiltheGiniindexforallcandidatesarefound.ThebestsplitpositioncorrespondstotheonethatproducesthelowestGiniindex,whichoccursat .SincetheGiniindexateachcandidatesplitpositioncanbecomputedinO(1)time,thecomplexityoffindingthebestsplitpositionisO(N)onceallthevaluesarekeptsorted,aone-timeoperationthattakesO(NlogN)time.TheoverallcomplexityofthismethodisthusO(NlogN),whichismuchsmallerthanthe timetakenbythebrute-forcemethod.Theamountofcomputationcanbefurtherreducedbyconsideringonlycandidatesplitpositionslocatedbetweentwoadjacentsortedinstanceswithdifferentclasslabels.Forexample,wedonotneedtoconsidercandidatesplitpositionslocatedbetween$60,000and$75,000becauseallthreeinstanceswithannualincomeinthisrange($60,000,$70,000,and$75,000)havethesameclasslabels.Choosingasplitpositionwithinthisrangeonlyincreasesthedegreeofimpurity,comparedtoasplitpositionlocatedoutsidethisrange.Therefore,thecandidatesplitpositionsat and

canbeignored.Similarly,wedonotneedtoconsiderthecandidatesplitpositionsat$87,500,$92,500,$110,000,$122,500,and$172,500becausetheyarelocatedbetweentwoadjacentinstanceswiththesamelabels.Thisstrategyreducesthenumberofcandidatesplitpositionstoconsiderfrom9to2(excludingthetwoboundarycases and

).

τ

AnnualIncome≤$65,000AnnualIncome>$65,000

τ=$97,500

O(N2)

τ=$65,000τ=$72,500

τ=$55,000τ=$230,000

GainRatioOnepotentiallimitationofimpuritymeasuressuchasentropyandGiniindexisthattheytendtofavorqualitativeattributeswithlargenumberofdistinctvalues.Figure3.12 showsthreecandidateattributesforpartitioningthedatasetgiveninTable3.3 .Aspreviouslymentioned,theattribute

isabetterchoicethantheattribute ,becauseitprovidesalargerinformationgain.However,ifwecomparethemagainst ,thelatterproducesthepurestpartitionswiththemaximuminformationgain,sincetheweightedentropyandGiniindexisequaltozeroforitschildren.Yet,

isnotagoodattributeforsplittingbecauseithasauniquevalueforeachinstance.Eventhoughatestconditioninvolving willaccuratelyclassifyeveryinstanceinthetrainingdata,wecannotusesuchatestconditiononnewtestinstanceswith valuesthathaven'tbeenseenbeforeduringtraining.Thisexamplesuggestshavingalowimpurityvaluealoneisinsufficienttofindagoodattributetestconditionforanode.AswewillseelaterinSection3.4 ,havingmorenumberofchildnodescanmakeadecisiontreemorecomplexandconsequentlymoresusceptibletooverfitting.Hence,thenumberofchildrenproducedbythesplittingattributeshouldalsobetakenintoconsiderationwhiledecidingthebestattributetestcondition.

Therearetwowaystoovercomethisproblem.Onewayistogenerateonlybinarydecisiontrees,thusavoidingthedifficultyofhandlingattributeswithvaryingnumberofpartitions.ThisstrategyisemployedbydecisiontreeclassifierssuchasCART.Anotherwayistomodifythesplittingcriteriontotakeintoaccountthenumberofpartitionsproducedbytheattribute.Forexample,intheC4.5decisiontreealgorithm,ameasureknownasgainratioisusedtocompensateforattributesthatproducealargenumberofchildnodes.Thismeasureiscomputedasfollows:

where isthenumberofinstancesassignedtonode andkisthetotalnumberofsplits.Thesplitinformationmeasurestheentropyofsplittinganodeintoitschildnodesandevaluatesifthesplitresultsinalargernumberofequally-sizedchildnodesornot.Forexample,ifeverypartitionhasthesamenumberofinstances,then andthesplitinformationwouldbeequaltolog k.Thus,ifanattributeproducesalargenumberofsplits,itssplitinformationisalsolarge,whichinturn,reducesthegainratio.

3.4.ExampleGainRatioConsiderthedatasetgiveninExercise2onpage185.Wewanttoselectthebestattributetestconditionamongthefollowingthreeattributes:

, ,and .Theentropybeforesplittingis

If isusedasattributetestcondition:

If isusedasattributetestcondition:

Finally,if isusedasattributetestcondition:

Gainratio=ΔinfoSplitInfo=Entropy(Parent)−∑i=1kN(vi)NEntropy(vi)−∑i=1kN(vi)Nlog2N(vi)N

(3.9)

N(vi) vi

∀i:N(vi)/N=1/k2

Entropy(parent)=−1020log21020−1020log21020=1.

Entropy(children)=1020[−610log2610−410log2410]×2=0.971GainRatio=1−0.971−1020log21020−1020log21020=0.0291=0.029

Entropy(children)=420[−14log214−34log234]+820×0+820[−18log218−78log278]=0.380GainRatio=1−0.380−420log2420−820log2820−820log2820=0.6201.52

Thus,eventhough hasthehighestinformationgain,itsgainratioislowerthan sinceitproducesalargernumberofsplits.

3.3.4AlgorithmforDecisionTreeInduction

Algorithm3.1 presentsapseudocodefordecisiontreeinductionalgorithm.TheinputtothisalgorithmisasetoftraininginstancesEalongwiththeattributesetF.Thealgorithmworksbyrecursivelyselectingthebestattributetosplitthedata(Step7)andexpandingthenodesofthetree(Steps11and12)untilthestoppingcriterionismet(Step1).Thedetailsofthisalgorithmareexplainedbelow.

1. The functionextendsthedecisiontreebycreatinganewnode.Anodeinthedecisiontreeeitherhasatestcondition,denotedasnode.testcond,oraclasslabel,denotedasnode.label.

2. The functiondeterminestheattributetestconditionforpartitioningthetraininginstancesassociatedwithanode.Thesplittingattributechosendependsontheimpuritymeasureused.ThepopularmeasuresincludeentropyandtheGiniindex.

3. The functiondeterminestheclasslabeltobeassignedtoaleafnode.Foreachleafnodet,let denotethefractionoftraininginstancesfromclassiassociatedwiththenodet.Thelabelassignedto

Entropy(children)=120[−11log211−01log201]×20=0GainRatio=1−0−120log2120×20=14.32=0.23

p(i|t)

theleafnodeistypicallytheonethatoccursmostfrequentlyinthetraininginstancesthatareassociatedwiththisnode.

Algorithm3.1Askeletondecisiontreeinductionalgorithm.

wheretheargmaxoperatorreturnstheclassithatmaximizes .Besidesprovidingtheinformationneededtodeterminetheclasslabel

leaf.label=argmaxip(i|t), (3.10)

p(i|t)

ofaleafnode, canalsobeusedasaroughestimateoftheprobabilitythataninstanceassignedtotheleafnodetbelongstoclassi.Sections4.11.2 and4.11.4 inthenextchapterdescribehowsuchprobabilityestimatescanbeusedtodeterminetheperformanceofadecisiontreeunderdifferentcostfunctions.

4. The functionisusedtoterminatethetree-growingprocessbycheckingwhetheralltheinstanceshaveidenticalclasslabelorattributevalues.Sincedecisiontreeclassifiersemployatop-down,recursivepartitioningapproachforbuildingamodel,thenumberoftraininginstancesassociatedwithanodedecreasesasthedepthofthetreeincreases.Asaresult,aleafnodemaycontaintoofewtraininginstancestomakeastatisticallysignificantdecisionaboutitsclasslabel.Thisisknownasthedatafragmentationproblem.Onewaytoavoidthisproblemistodisallowsplittingofanodewhenthenumberofinstancesassociatedwiththenodefallbelowacertainthreshold.Amoresystematicwaytocontrolthesizeofadecisiontree(numberofleafnodes)willbediscussedinSection3.5.4 .

3.3.5ExampleApplication:WebRobotDetection

Considerthetaskofdistinguishingtheaccesspatternsofwebrobotsfromthosegeneratedbyhumanusers.Awebrobot(alsoknownasawebcrawler)isasoftwareprogramthatautomaticallyretrievesfilesfromoneormorewebsitesbyfollowingthehyperlinksextractedfromaninitialsetofseedURLs.Theseprogramshavebeendeployedforvariouspurposes,fromgatheringwebpagesonbehalfofsearchenginestomoremaliciousactivitiessuchasspammingandcommittingclickfraudsinonlineadvertisements.

p(i|t)

Figure3.15.Inputdataforwebrobotdetection.

Thewebrobotdetectionproblemcanbecastasabinaryclassificationtask.Theinputdatafortheclassificationtaskisawebserverlog,asampleofwhichisshowninFigure3.15(a) .Eachlineinthelogfilecorrespondstoarequestmadebyaclient(i.e.,ahumanuserorawebrobot)tothewebserver.Thefieldsrecordedintheweblogincludetheclient'sIPaddress,timestampoftherequest,URLoftherequestedfile,sizeofthefile,anduseragent,whichisafieldthatcontainsidentifyinginformationabouttheclient.

Forhumanusers,theuseragentfieldspecifiesthetypeofwebbrowserormobiledeviceusedtofetchthefiles,whereasforwebrobots,itshouldtechnicallycontainthenameofthecrawlerprogram.However,webrobotsmayconcealtheirtrueidentitiesbydeclaringtheiruseragentfieldstobeidenticaltoknownbrowsers.Therefore,useragentisnotareliablefieldtodetectwebrobots.

Thefirststeptowardbuildingaclassificationmodelistopreciselydefineadatainstanceandassociatedattributes.Asimpleapproachistoconsidereachlogentryasadatainstanceandusetheappropriatefieldsinthelogfileasitsattributeset.Thisapproach,however,isinadequateforseveralreasons.First,manyoftheattributesarenominal-valuedandhaveawiderangeofdomainvalues.Forexample,thenumberofuniqueclientIPaddresses,URLs,andreferrersinalogfilecanbeverylarge.Theseattributesareundesirableforbuildingadecisiontreebecausetheirsplitinformationisextremelyhigh(seeEquation(3.9) ).Inaddition,itmightnotbepossibletoclassifytestinstancescontainingIPaddresses,URLs,orreferrersthatarenotpresentinthetrainingdata.Finally,byconsideringeachlogentryasaseparatedatainstance,wedisregardthesequenceofwebpagesretrievedbytheclient—acriticalpieceofinformationthatcanhelpdistinguishwebrobotaccessesfromthoseofahumanuser.

Abetteralternativeistoconsidereachwebsessionasadatainstance.Awebsessionisasequenceofrequestsmadebyaclientduringagivenvisittothewebsite.Eachwebsessioncanbemodeledasadirectedgraph,inwhichthenodescorrespondtowebpagesandtheedgescorrespondtohyperlinksconnectingonewebpagetoanother.Figure3.15(b) showsagraphicalrepresentationofthefirstwebsessiongiveninthelogfile.Everywebsessioncanbecharacterizedusingsomemeaningfulattributesaboutthegraphthatcontaindiscriminatoryinformation.Figure3.15(c) showssomeoftheattributesextractedfromthegraph,includingthedepthandbreadthofits

correspondingtreerootedattheentrypointtothewebsite.Forexample,thedepthandbreadthofthetreeshowninFigure3.15(b) arebothequaltotwo.

ThederivedattributesshowninFigure3.15(c) aremoreinformativethantheoriginalattributesgiveninthelogfilebecausetheycharacterizethebehavioroftheclientatthewebsite.Usingthisapproach,adatasetcontaining2916instanceswascreated,withequalnumbersofsessionsduetowebrobots(class1)andhumanusers(class0).10%ofthedatawerereservedfortrainingwhiletheremaining90%wereusedfortesting.TheinduceddecisiontreeisshowninFigure3.16 ,whichhasanerrorrateequalto3.8%onthetrainingsetand5.3%onthetestset.Inadditiontoitslowerrorrate,thetreealsorevealssomeinterestingpropertiesthatcanhelpdiscriminatewebrobotsfromhumanusers:

1. Accessesbywebrobotstendtobebroadbutshallow,whereasaccessesbyhumanuserstendtobemorefocused(narrowbutdeep).

2. Webrobotsseldomretrievetheimagepagesassociatedwithawebpage.

3. Sessionsduetowebrobotstendtobelongandcontainalargenumberofrequestedpages.

4. Webrobotsaremorelikelytomakerepeatedrequestsforthesamewebpagethanhumanuserssincethewebpagesretrievedbyhumanusersareoftencachedbythebrowser.

3.3.6CharacteristicsofDecisionTreeClassifiers

Thefollowingisasummaryoftheimportantcharacteristicsofdecisiontreeinductionalgorithms.

1. Applicability:Decisiontreesareanonparametricapproachforbuildingclassificationmodels.Thisapproachdoesnotrequireanypriorassumptionabouttheprobabilitydistributiongoverningtheclassandattributesofthedata,andthus,isapplicabletoawidevarietyofdatasets.Itisalsoapplicabletobothcategoricalandcontinuousdatawithoutrequiringtheattributestobetransformedintoacommonrepresentationviabinarization,normalization,orstandardization.UnlikesomebinaryclassifiersdescribedinChapter4 ,itcanalsodealwithmulticlassproblemswithouttheneedtodecomposethemintomultiplebinaryclassificationtasks.Anotherappealingfeatureofdecisiontreeclassifiersisthattheinducedtrees,especiallytheshorterones,arerelativelyeasytointerpret.Theaccuraciesofthetreesarealsoquitecomparabletootherclassificationtechniquesformanysimpledatasets.

2. Expressiveness:Adecisiontreeprovidesauniversalrepresentationfordiscrete-valuedfunctions.Inotherwords,itcanencodeanyfunctionofdiscrete-valuedattributes.Thisisbecauseeverydiscrete-valuedfunctioncanberepresentedasanassignmenttable,whereeveryuniquecombinationofdiscreteattributesisassignedaclasslabel.Sinceeverycombinationofattributescanberepresentedasaleafinthedecisiontree,wecanalwaysfindadecisiontreewhoselabelassignmentsattheleafnodesmatcheswiththeassignmenttableoftheoriginalfunction.Decisiontreescanalsohelpinprovidingcompactrepresentationsoffunctionswhensomeoftheuniquecombinationsofattributescanberepresentedbythesameleafnode.Forexample,Figure3.17 showstheassignmenttableoftheBooleanfunction

involvingfourbinaryattributes,resultinginatotalofpossibleassignments.ThetreeshowninFigure3.17 shows

(A∧B)∨(C∧D)24=16

acompressedencodingofthisassignmenttable.Insteadofrequiringafully-growntreewith16leafnodes,itispossibletoencodethefunctionusingasimplertreewithonly7leafnodes.Nevertheless,notalldecisiontreesfordiscrete-valuedattributescanbesimplified.Onenotableexampleistheparityfunction,whosevalueis1whenthereisanevennumberoftruevaluesamongitsBooleanattributes,and0otherwise.Accuratemodelingofsuchafunctionrequiresafulldecisiontreewith nodes,wheredisthenumberofBooleanattributes(seeExercise1onpage185).

2d

Figure3.16.Decisiontreemodelforwebrobotdetection.

Figure3.17.DecisiontreefortheBooleanfunction .

3. ComputationalEfficiency:Sincethenumberofpossibledecisiontreescanbeverylarge,manydecisiontreealgorithmsemployaheuristic-basedapproachtoguidetheirsearchinthevasthypothesisspace.Forexample,thealgorithmpresentedinSection3.3.4 usesagreedy,top-down,recursivepartitioningstrategyforgrowingadecisiontree.Formanydatasets,suchtechniquesquicklyconstructareasonablygooddecisiontreeevenwhenthetrainingsetsizeisverylarge.Furthermore,onceadecisiontreehasbeenbuilt,classifyingatestrecordisextremelyfast,withaworst-casecomplexityofO(w),wherewisthemaximumdepthofthetree.

4. HandlingMissingValues:Adecisiontreeclassifiercanhandlemissingattributevaluesinanumberofways,bothinthetrainingandthetestsets.Whentherearemissingvaluesinthetestset,theclassifiermustdecidewhichbranchtofollowifthevalueofasplitting

(A∧B)∨(C∧D)

nodeattributeismissingforagiventestinstance.Oneapproach,knownastheprobabilisticsplitmethod,whichisemployedbytheC4.5decisiontreeclassifier,distributesthedatainstancetoeverychildofthesplittingnodeaccordingtotheprobabilitythatthemissingattributehasaparticularvalue.Incontrast,theCARTalgorithmusesthesurrogatesplitmethod,wheretheinstancewhosesplittingattributevalueismissingisassignedtooneofthechildnodesbasedonthevalueofanothernon-missingsurrogateattributewhosesplitsmostresemblethepartitionsmadebythemissingattribute.Anotherapproach,knownastheseparateclassmethodisusedbytheCHAIDalgorithm,wherethemissingvalueistreatedasaseparatecategoricalvaluedistinctfromothervaluesofthesplittingattribute.Figure3.18showsanexampleofthethreedifferentwaysforhandlingmissingvaluesinadecisiontreeclassifier.Otherstrategiesfordealingwithmissingvaluesarebasedondatapreprocessing,wheretheinstancewithmissingvalueiseitherimputedwiththemode(forcategoricalattribute)ormean(forcontinuousattribute)valueordiscardedbeforetheclassifieristrained.

Figure3.18.Methodsforhandlingmissingattributevaluesindecisiontreeclassifier.

Duringtraining,ifanattributevhasmissingvaluesinsomeofthetraininginstancesassociatedwithanode,weneedawaytomeasurethegaininpurityifvisusedforsplitting.Onesimplewayistoexcludeinstanceswithmissingvaluesofvinthecountingofinstancesassociatedwitheverychildnode,generatedforeverypossibleoutcomeofv.Further,ifvischosenastheattributetestconditionatanode,traininginstanceswithmissingvaluesofvcanbepropagatedtothechildnodesusinganyofthemethodsdescribedaboveforhandlingmissingvaluesintestinstances.

5. HandlingInteractionsamongAttributes:Attributesareconsideredinteractingiftheyareabletodistinguishbetweenclasseswhenusedtogether,butindividuallytheyprovidelittleornoinformation.Duetothegreedynatureofthesplittingcriteriaindecisiontrees,suchattributescouldbepassedoverinfavorofotherattributesthatarenotasuseful.Thiscouldresultinmorecomplexdecisiontreesthannecessary.Hence,decisiontreescanperformpoorlywhenthereareinteractionsamongattributes.Toillustratethispoint,considerthethree-dimensionaldatashowninFigure3.19(a) ,whichcontains2000datapointsfromoneoftwoclasses,denotedas and inthediagram.Figure3.19(b) showsthedistributionofthetwoclassesinthetwo-dimensionalspaceinvolvingattributesXandY,whichisanoisyversionoftheXORBooleanfunction.Wecanseethateventhoughthetwoclassesarewell-separatedinthistwo-dimensionalspace,neitherofthetwoattributescontainsufficientinformationtodistinguishbetweenthetwoclasseswhenusedalone.Forexample,theentropiesofthefollowingattributetestconditions: and ,arecloseto1,indicatingthatneitherXnorYprovideanyreductionintheimpuritymeasurewhenusedindividually.XandYthusrepresentacaseofinteractionamongattributes.Thedatasetalsocontainsathirdattribute,Z,inwhichbothclassesaredistributeduniformly,asshowninFigures3.19(c) and

+ ∘

X≤10 Y≤10

3.19(d) ,andhence,theentropyofanysplitinvolvingZiscloseto1.Asaresult,Zisaslikelytobechosenforsplittingastheinteractingbutusefulattributes,XandY.Forfurtherillustrationofthisissue,readersarereferredtoExample3.7 inSection3.4.1 andExercise7attheendofthischapter.

Figure3.19.ExampleofaXORdatainvolvingXandY,alongwithanirrelevantattributeZ.

6. HandlingIrrelevantAttributes:Anattributeisirrelevantifitisnotusefulfortheclassificationtask.Sinceirrelevantattributesarepoorlyassociatedwiththetargetclasslabels,theywillprovidelittleornogaininpurityandthuswillbepassedoverbyothermorerelevantfeatures.Hence,thepresenceofasmallnumberofirrelevantattributeswillnotimpactthedecisiontreeconstructionprocess.However,notallattributesthatprovidelittletonogainareirrelevant(seeFigure3.19 ).Hence,iftheclassificationproblemiscomplex(e.g.,involvinginteractionsamongattributes)andtherearealargenumberofirrelevantattributes,thensomeoftheseattributesmaybeaccidentallychosenduringthetree-growingprocess,sincetheymayprovideabettergainthanarelevantattributejustbyrandomchance.Featureselectiontechniquescanhelptoimprovetheaccuracyofdecisiontreesbyeliminatingtheirrelevantattributesduringpreprocessing.WewillinvestigatetheissueoftoomanyirrelevantattributesinSection3.4.1 .

7. HandlingRedundantAttributes:Anattributeisredundantifitisstronglycorrelatedwithanotherattributeinthedata.Sinceredundantattributesshowsimilargainsinpurityiftheyareselectedforsplitting,onlyoneofthemwillbeselectedasanattributetestconditioninthedecisiontreealgorithm.Decisiontreescanthushandlethepresenceofredundantattributes.

8. UsingRectilinearSplits:Thetestconditionsdescribedsofarinthischapterinvolveusingonlyasingleattributeatatime.Asaconsequence,thetree-growingprocedurecanbeviewedastheprocessofpartitioningtheattributespaceintodisjointregionsuntileachregioncontainsrecordsofthesameclass.Theborderbetweentwoneighboringregionsofdifferentclassesisknownasadecisionboundary.Figure3.20 showsthedecisiontreeaswellasthedecisionboundaryforabinaryclassificationproblem.Sincethetestconditioninvolvesonlyasingleattribute,thedecisionboundariesare

rectilinear;i.e.,paralleltothecoordinateaxes.Thislimitstheexpressivenessofdecisiontreesinrepresentingdecisionboundariesofdatasetswithcontinuousattributes.Figure3.21 showsatwo-dimensionaldatasetinvolvingbinaryclassesthatcannotbeperfectlyclassifiedbyadecisiontreewhoseattributetestconditionsaredefinedbasedonsingleattributes.ThebinaryclassesinthedatasetaregeneratedfromtwoskewedGaussiandistributions,centeredat(8,8)and(12,12),respectively.Thetruedecisionboundaryisrepresentedbythediagonaldashedline,whereastherectilineardecisionboundaryproducedbythedecisiontreeclassifierisshownbythethicksolidline.Incontrast,anobliquedecisiontreemayovercomethislimitationbyallowingthetestconditiontobespecifiedusingmorethanoneattribute.Forexample,thebinaryclassificationdatashowninFigure3.21 canbeeasilyrepresentedbyanobliquedecisiontreewithasinglerootnodewithtestcondition

Figure3.20.

x+y<20.

Exampleofadecisiontreeanditsdecisionboundariesforatwo-dimensionaldataset.

Figure3.21.Exampleofdatasetthatcannotbepartitionedoptimallyusingadecisiontreewithsingleattributetestconditions.Thetruedecisionboundaryisshownbythedashedline.

Althoughanobliquedecisiontreeismoreexpressiveandcanproducemorecompacttrees,findingtheoptimaltestconditioniscomputationallymoreexpensive.

9. ChoiceofImpurityMeasure:Itshouldbenotedthatthechoiceofimpuritymeasureoftenhaslittleeffectontheperformanceofdecisiontreeclassifierssincemanyoftheimpuritymeasuresarequiteconsistentwitheachother,asshowninFigure3.11 onpage129.Instead,thestrategyusedtoprunethetreehasagreaterimpactonthefinaltreethanthechoiceofimpuritymeasure.

3.4ModelOverfittingMethodspresentedsofartrytolearnclassificationmodelsthatshowthelowesterroronthetrainingset.However,aswewillshowinthefollowingexample,evenifamodelfitswelloverthetrainingdata,itcanstillshowpoorgeneralizationperformance,aphenomenonknownasmodeloverfitting.

Figure3.22.Examplesoftrainingandtestsetsofatwo-dimensionalclassificationproblem.

Figure3.23.Effectofvaryingtreesize(numberofleafnodes)ontrainingandtesterrors.

3.5.ExampleOverfittingandUnderfittingofDecisionTreesConsiderthetwo-dimensionaldatasetshowninFigure3.22(a) .Thedatasetcontainsinstancesthatbelongtotwoseparateclasses,representedas and ,respectively,whereeachclasshas5400instances.Allinstancesbelongingtothe classweregeneratedfromauniformdistribution.Forthe class,5000instancesweregeneratedfromaGaussiandistributioncenteredat(10,10)withunitvariance,whiletheremaining400instancesweresampledfromthesameuniformdistributionasthe class.WecanseefromFigure3.22(a) thatthe classcanbelargelydistinguishedfromthe classbydrawingacircleofappropriatesizecenteredat(10,10).Tolearnaclassifierusingthistwo-dimensionaldataset,werandomlysampled10%ofthedatafortrainingandusedtheremaining90%fortesting.Thetrainingset,showninFigure3.22(b) ,looksquiterepresentativeoftheoveralldata.WeusedGiniindexasthe

+ ∘∘

+

∘ +∘

impuritymeasuretoconstructdecisiontreesofincreasingsizes(numberofleafnodes),byrecursivelyexpandinganodeintochildnodestilleveryleafnodewaspure,asdescribedinSection3.3.4 .

Figure3.23(a) showschangesinthetrainingandtesterrorratesasthesizeofthetreevariesfrom1to8.Botherrorratesareinitiallylargewhenthetreehasonlyoneortwoleafnodes.Thissituationisknownasmodelunderfitting.Underfittingoccurswhenthelearneddecisiontreeistoosimplistic,andthus,incapableoffullyrepresentingthetruerelationshipbetweentheattributesandtheclasslabels.Asweincreasethetreesizefrom1to8,wecanobservetwoeffects.First,boththeerrorratesdecreasesincelargertreesareabletorepresentmorecomplexdecisionboundaries.Second,thetrainingandtesterrorratesarequiteclosetoeachother,whichindicatesthattheperformanceonthetrainingsetisfairlyrepresentativeofthegeneralizationperformance.Aswefurtherincreasethesizeofthetreefrom8to150,thetrainingerrorcontinuestosteadilydecreasetilliteventuallyreacheszero,asshowninFigure3.23(b) .However,inastrikingcontrast,thetesterrorrateceasestodecreaseanyfurtherbeyondacertaintreesize,andthenitbeginstoincrease.Thetrainingerrorratethusgrosslyunder-estimatesthetesterrorrateoncethetreebecomestoolarge.Further,thegapbetweenthetrainingandtesterrorrateskeepsonwideningasweincreasethetreesize.Thisbehavior,whichmayseemcounter-intuitiveatfirst,canbeattributedtothephenomenaofmodeloverfitting.

3.4.1ReasonsforModelOverfitting

Modeloverfittingisthephenomenawhere,inthepursuitofminimizingthetrainingerrorrate,anoverlycomplexmodelisselectedthatcapturesspecific

patternsinthetrainingdatabutfailstolearnthetruenatureofrelationshipsbetweenattributesandclasslabelsintheoveralldata.Toillustratethis,Figure3.24 showsdecisiontreesandtheircorrespondingdecisionboundaries(shadedrectanglesrepresentregionsassignedtothe class)fortwotreesofsizes5and50.Wecanseethatthedecisiontreeofsize5appearsquitesimpleanditsdecisionboundariesprovideareasonableapproximationtotheidealdecisionboundary,whichinthiscasecorrespondstoacirclecenteredaroundtheGaussiandistributionat(10,10).Althoughitstrainingandtesterrorratesarenon-zero,theyareveryclosetoeachother,whichindicatesthatthepatternslearnedinthetrainingsetshouldgeneralizewelloverthetestset.Ontheotherhand,thedecisiontreeofsize50appearsmuchmorecomplexthanthetreeofsize5,withcomplicateddecisionboundaries.Forexample,someofitsshadedrectangles(assignedtheclass)attempttocovernarrowregionsintheinputspacethatcontainonlyoneortwo traininginstances.Notethattheprevalenceof instancesinsuchregionsishighlyspecifictothetrainingset,astheseregionsaremostlydominatedby-instancesintheoveralldata.Hence,inanattempttoperfectlyfitthetrainingdata,thedecisiontreeofsize50startsfinetuningitselftospecificpatternsinthetrainingdata,leadingtopoorperformanceonanindependentlychosentestset.

+

+

+ +

Figure3.24.Decisiontreeswithdifferentmodelcomplexities.

Figure3.25.Performanceofdecisiontreesusing20%datafortraining(twicetheoriginaltrainingsize).

Thereareanumberoffactorsthatinfluencemodeloverfitting.Inthefollowing,weprovidebriefdescriptionsoftwoofthemajorfactors:limitedtrainingsizeandhighmodelcomplexity.Thoughtheyarenotexhaustive,theinterplaybetweenthemcanhelpexplainmostofthecommonmodeloverfittingphenomenainreal-worldapplications.

LimitedTrainingSizeNotethatatrainingsetconsistingofafinitenumberofinstancescanonlyprovidealimitedrepresentationoftheoveralldata.Hence,itispossiblethatthepatternslearnedfromatrainingsetdonotfullyrepresentthetruepatternsintheoveralldata,leadingtomodeloverfitting.Ingeneral,asweincreasethesizeofatrainingset(numberoftraininginstances),thepatternslearnedfromthetrainingsetstartresemblingthetruepatternsintheoveralldata.Hence,

theeffectofoverfittingcanbereducedbyincreasingthetrainingsize,asillustratedinthefollowingexample.

3.6ExampleEffectofTrainingSizeSupposethatweusetwicethenumberoftraininginstancesthanwhatwehadusedintheexperimentsconductedinExample3.5 .Specifically,weuse20%datafortrainingandusetheremainderfortesting.Figure3.25(b) showsthetrainingandtesterrorratesasthesizeofthetreeisvariedfrom1to150.TherearetwomajordifferencesinthetrendsshowninthisfigureandthoseshowninFigure3.23(b) (usingonly10%ofthedatafortraining).First,eventhoughthetrainingerrorratedecreaseswithincreasingtreesizeinbothfigures,itsrateofdecreaseismuchsmallerwhenweusetwicethetrainingsize.Second,foragiventreesize,thegapbetweenthetrainingandtesterrorratesismuchsmallerwhenweusetwicethetrainingsize.Thesedifferencessuggestthatthepatternslearnedusing20%ofdatafortrainingaremoregeneralizablethanthoselearnedusing10%ofdatafortraining.

Figure3.25(a) showsthedecisionboundariesforthetreeofsize50,learnedusing20%ofdatafortraining.Incontrasttothetreeofthesamesizelearnedusing10%datafortraining(seeFigure3.24(d) ),wecanseethatthedecisiontreeisnotcapturingspecificpatternsofnoisyinstancesinthetrainingset.Instead,thehighmodelcomplexityof50leafnodesisbeingeffectivelyusedtolearntheboundariesofthe instancescenteredat(10,10).

HighModelComplexityGenerally,amorecomplexmodelhasabetterabilitytorepresentcomplexpatternsinthedata.Forexample,decisiontreeswithlargernumberofleaf

+

+

nodescanrepresentmorecomplexdecisionboundariesthandecisiontreeswithfewerleafnodes.However,anoverlycomplexmodelalsohasatendencytolearnspecificpatternsinthetrainingsetthatdonotgeneralizewelloverunseeninstances.Modelswithhighcomplexityshouldthusbejudiciouslyusedtoavoidoverfitting.

Onemeasureofmodelcomplexityisthenumberof“parameters”thatneedtobeinferredfromthetrainingset.Forexample,inthecaseofdecisiontreeinduction,theattributetestconditionsatinternalnodescorrespondtotheparametersofthemodelthatneedtobeinferredfromthetrainingset.Adecisiontreewithlargernumberofattributetestconditions(andconsequentlymoreleafnodes)thusinvolvesmore“parameters”andhenceismorecomplex.

Givenaclassofmodelswithacertainnumberofparameters,alearningalgorithmattemptstoselectthebestcombinationofparametervaluesthatmaximizesanevaluationmetric(e.g.,accuracy)overthetrainingset.Ifthenumberofparametervaluecombinations(andhencethecomplexity)islarge,thelearningalgorithmhastoselectthebestcombinationfromalargenumberofpossibilities,usingalimitedtrainingset.Insuchcases,thereisahighchanceforthelearningalgorithmtopickaspuriouscombinationofparametersthatmaximizestheevaluationmetricjustbyrandomchance.Thisissimilartothemultiplecomparisonsproblem(alsoreferredasmultipletestingproblem)instatistics.

Asanillustrationofthemultiplecomparisonsproblem,considerthetaskofpredictingwhetherthestockmarketwillriseorfallinthenexttentradingdays.Ifastockanalystsimplymakesrandomguesses,theprobabilitythatherpredictioniscorrectonanytradingdayis0.5.However,theprobabilitythatshewillpredictcorrectlyatleastnineoutoftentimesis

whichisextremelylow.

Supposeweareinterestedinchoosinganinvestmentadvisorfromapoolof200stockanalysts.Ourstrategyistoselecttheanalystwhomakesthemostnumberofcorrectpredictionsinthenexttentradingdays.Theflawinthisstrategyisthatevenifalltheanalystsmaketheirpredictionsinarandomfashion,theprobabilitythatatleastoneofthemmakesatleastninecorrectpredictionsis

whichisveryhigh.Althougheachanalysthasalowprobabilityofpredictingatleastninetimescorrectly,consideredtogether,wehaveahighprobabilityoffindingatleastoneanalystwhocandoso.However,thereisnoguaranteeinthefuturethatsuchananalystwillcontinuetomakeaccuratepredictionsbyrandomguessing.

Howdoesthemultiplecomparisonsproblemrelatetomodeloverfitting?Inthecontextoflearningaclassificationmodel,eachcombinationofparametervaluescorrespondstoananalyst,whilethenumberoftraininginstancescorrespondstothenumberofdays.Analogoustothetaskofselectingthebestanalystwhomakesthemostaccuratepredictionsonconsecutivedays,thetaskofalearningalgorithmistoselectthebestcombinationofparametersthatresultsinthehighestaccuracyonthetrainingset.Ifthenumberofparametercombinationsislargebutthetrainingsizeissmall,itishighlylikelyforthelearningalgorithmtochooseaspuriousparametercombinationthatprovideshightrainingaccuracyjustbyrandomchance.Inthefollowingexample,weillustratethephenomenaofoverfittingduetomultiplecomparisonsinthecontextofdecisiontreeinduction.

(109)+(1010)210=0.0107,

1−(1−0.0107)200=0.8847,

Figure3.26.Exampleofatwo-dimensional(X-Y)dataset.

Figure3.27.

Trainingandtesterrorratesillustratingtheeffectofmultiplecomparisonsproblemonmodeloverfitting.

3.7.ExampleMultipleComparisonsandOverfittingConsiderthetwo-dimensionaldatasetshowninFigure3.26 containing500 and500 instances,whichissimilartothedatashowninFigure3.19 .Inthisdataset,thedistributionsofbothclassesarewell-separatedinthetwo-dimensional(XY)attributespace,butnoneofthetwoattributes(XorY)aresufficientlyinformativetobeusedaloneforseparatingthetwoclasses.Hence,splittingthedatasetbasedonanyvalueofanXorYattributewillprovideclosetozeroreductioninanimpuritymeasure.However,ifXandYattributesareusedtogetherinthesplittingcriterion(e.g.,splittingXat10andYat10),thetwoclassescanbeeffectivelyseparated.

+ ∘

Figure3.28.Decisiontreewith6leafnodesusingXandYasattributes.Splitshavebeennumberedfrom1to5inorderofotheroccurrenceinthetree.

Figure3.27(a) showsthetrainingandtesterrorratesforlearningdecisiontreesofvaryingsizes,when30%ofthedataisusedfortrainingandtheremainderofthedatafortesting.Wecanseethatthetwoclassescanbeseparatedusingasmallnumberofleafnodes.Figure3.28showsthedecisionboundariesforthetreewithsixleafnodes,wherethesplitshavebeennumberedaccordingtotheirorderofappearanceinthetree.Notethattheeventhoughsplits1and3providetrivialgains,theirconsequentsplits(2,4,and5)providelargegains,resultingineffectivediscriminationofthetwoclasses.

Assumeweadd100irrelevantattributestothetwo-dimensionalX-Ydata.Learningadecisiontreefromthisresultantdatawillbechallengingbecausethenumberofcandidateattributestochooseforsplittingateveryinternalnodewillincreasefromtwoto102.Withsuchalargenumberofcandidateattributetestconditionstochoosefrom,itisquitelikelythatspuriousattributetestconditionswillbeselectedatinternalnodesbecauseofthemultiplecomparisonsproblem.Figure3.27(b) showsthetrainingandtesterrorratesafteradding100irrelevantattributestothetrainingset.Wecanseethatthetesterrorrateremainscloseto0.5evenafterusing50leafnodes,whilethetrainingerrorratekeepsondecliningandeventuallybecomes0.

3.5ModelSelectionTherearemanypossibleclassificationmodelswithvaryinglevelsofmodelcomplexitythatcanbeusedtocapturepatternsinthetrainingdata.Amongthesepossibilities,wewanttoselectthemodelthatshowslowestgeneralizationerrorrate.Theprocessofselectingamodelwiththerightlevelofcomplexity,whichisexpectedtogeneralizewelloverunseentestinstances,isknownasmodelselection.Asdescribedintheprevioussection,thetrainingerrorratecannotbereliablyusedasthesolecriterionformodelselection.Inthefollowing,wepresentthreegenericapproachestoestimatethegeneralizationperformanceofamodelthatcanbeusedformodelselection.Weconcludethissectionbypresentingspecificstrategiesforusingtheseapproachesinthecontextofdecisiontreeinduction.

3.5.1UsingaValidationSet

Notethatwecanalwaysestimatethegeneralizationerrorrateofamodelbyusing“out-of-sample”estimates,i.e.byevaluatingthemodelonaseparatevalidationsetthatisnotusedfortrainingthemodel.Theerrorrateonthevalidationset,termedasthevalidationerrorrate,isabetterindicatorofgeneralizationperformancethanthetrainingerrorrate,sincethevalidationsethasnotbeenusedfortrainingthemodel.Thevalidationerrorratecanbeusedformodelselectionasfollows.

GivenatrainingsetD.train,wecanpartitionD.trainintotwosmallersubsets,D.trandD.val,suchthatD.trisusedfortrainingwhileD.valisusedasthevalidationset.Forexample,two-thirdsofD.traincanbereservedasD.trfor

training,whiletheremainingone-thirdisusedasD.valforcomputingvalidationerrorrate.ForanychoiceofclassificationmodelmthatistrainedonD.tr,wecanestimateitsvalidationerrorrateonD.val, .Themodelthatshowsthelowestvalueof canthenbeselectedasthepreferredchoiceofmodel.

Theuseofvalidationsetprovidesagenericapproachformodelselection.However,onelimitationofthisapproachisthatitissensitivetothesizesofD.trandD.val,obtainedbypartitioningD.train.IfthesizeofD.tristoosmall,itmayresultinthelearningofapoorclassificationmodelwithsub-standardperformance,sinceasmallertrainingsetwillbelessrepresentativeoftheoveralldata.Ontheotherhand,ifthesizeofD.valistoosmall,thevalidationerrorratemightnotbereliableforselectingmodels,asitwouldbecomputedoverasmallnumberofinstances.

Figure3.29.

errval(m)errval(m)

ClassdistributionofvalidationdataforthetwodecisiontreesshowninFigure3.30 .

3.8.ExampleValidationErrorInthefollowingexample,weillustrateonepossibleapproachforusingavalidationsetindecisiontreeinduction.Figure3.29 showsthepredictedlabelsattheleafnodesofthedecisiontreesgeneratedinFigure3.30 .Thecountsgivenbeneaththeleafnodesrepresenttheproportionofdatainstancesinthevalidationsetthatreacheachofthenodes.Basedonthepredictedlabelsofthenodes,thevalidationerrorrateforthelefttreeis ,whilethevalidationerrorratefortherighttreeis .Basedontheirvalidationerrorrates,therighttreeispreferredovertheleftone.

3.5.2IncorporatingModelComplexity

Sincethechanceformodeloverfittingincreasesasthemodelbecomesmorecomplex,amodelselectionapproachshouldnotonlyconsiderthetrainingerrorratebutalsothemodelcomplexity.Thisstrategyisinspiredbyawell-knownprincipleknownasOccam'srazorortheprincipleofparsimony,whichsuggeststhatgiventwomodelswiththesameerrors,thesimplermodelispreferredoverthemorecomplexmodel.Agenericapproachtoaccountformodelcomplexitywhileestimatinggeneralizationperformanceisformallydescribedasfollows.

GivenatrainingsetD.train,letusconsiderlearningaclassificationmodelmthatbelongstoacertainclassofmodels, .Forexample,if representsthesetofallpossibledecisiontrees,thenmcancorrespondtoaspecificdecision

errval(TL)=6/16=0.375errval(TR)=4/16=0.25

M M

treelearnedfromthetrainingset.Weareinterestedinestimatingthegeneralizationerrorrateofm,gen.error(m).Asdiscussedpreviously,thetrainingerrorrateofm,train.error(m,D.train),canunder-estimategen.error(m)whenthemodelcomplexityishigh.Hence,werepresentgen.error(m)asafunctionofnotjustthetrainingerrorratebutalsothemodelcomplexityof asfollows:

where isahyper-parameterthatstrikesabalancebetweenminimizingtrainingerrorandreducingmodelcomplexity.Ahighervalueof givesmoreemphasistothemodelcomplexityintheestimationofgeneralizationperformance.Tochoosetherightvalueof ,wecanmakeuseofthevalidationsetinasimilarwayasdescribedin3.5.1 .Forexample,wecaniteratethrougharangeofvaluesof andforeverypossiblevalue,wecanlearnamodelonasubsetofthetrainingset,D.tr,andcomputeitsvalidationerrorrateonaseparatesubset,D.val.Wecanthenselectthevalueof thatprovidesthelowestvalidationerrorrate.

Equation3.11 providesonepossibleapproachforincorporatingmodelcomplexityintotheestimateofgeneralizationperformance.Thisapproachisattheheartofanumberoftechniquesforestimatinggeneralizationperformance,suchasthestructuralriskminimizationprinciple,theAkaike'sInformationCriterion(AIC),andtheBayesianInformationCriterion(BIC).Thestructuralriskminimizationprincipleservesasthebuildingblockforlearningsupportvectormachines,whichwillbediscussedlaterinChapter4 .FormoredetailsonAICandBIC,seetheBibliographicNotes.

Inthefollowing,wepresenttwodifferentapproachesforestimatingthecomplexityofamodel, .Whiletheformerisspecifictodecisiontrees,thelatterismoregenericandcanbeusedwithanyclassofmodels.

M,complexity(M),

gen.error(m)=train.error(m,D.train)+α×complexity(M), (3.11)

αα

α

α

α

complexity(M)

EstimatingtheComplexityofDecisionTreesInthecontextofdecisiontrees,thecomplexityofadecisiontreecanbeestimatedastheratioofthenumberofleafnodestothenumberoftraininginstances.Letkbethenumberofleafnodesand bethenumberoftraininginstances.Thecomplexityofadecisiontreecanthenbedescribedas

.Thisreflectstheintuitionthatforalargertrainingsize,wecanlearnadecisiontreewithlargernumberofleafnodeswithoutitbecomingoverlycomplex.ThegeneralizationerrorrateofadecisiontreeTcanthenbecomputedusingEquation3.11 asfollows:

whereerr(T)isthetrainingerrorofthedecisiontreeand isahyper-parameterthatmakesatrade-offbetweenreducingtrainingerrorandminimizingmodelcomplexity,similartotheuseof inEquation3.11 .canbeviewedastherelativecostofaddingaleafnoderelativetoincurringatrainingerror.Intheliteratureondecisiontreeinduction,theaboveapproachforestimatinggeneralizationerrorrateisalsotermedasthepessimisticerrorestimate.Itiscalledpessimisticasitassumesthegeneralizationerrorratetobeworsethanthetrainingerrorrate(byaddingapenaltytermformodelcomplexity).Ontheotherhand,simplyusingthetrainingerrorrateasanestimateofthegeneralizationerrorrateiscalledtheoptimisticerrorestimateortheresubstitutionestimate.

3.9.ExampleGeneralizationErrorEstimatesConsiderthetwobinarydecisiontrees, and ,showninFigure3.30 .Bothtreesaregeneratedfromthesametrainingdataand isgeneratedbyexpandingthreeleafnodesof .Thecountsshownintheleafnodesofthetreesrepresenttheclassdistributionofthetraining

Ntrain

k/Ntrain

errgen(T)=err(T)+Ω×kNtrain,

Ω

α Ω

TL TRTL

TR

instances.Ifeachleafnodeislabeledaccordingtothemajorityclassoftraininginstancesthatreachthenode,thetrainingerrorrateforthelefttreewillbe ,whilethetrainingerrorratefortherighttreewillbe .Basedontheirtrainingerrorratesalone,wouldpreferredover ,eventhough ismorecomplex(contains

largernumberofleafnodes)than .

Figure3.30.Exampleoftwodecisiontreesgeneratedfromthesametrainingdata.

Now,assumethatthecostassociatedwitheachleafnodeis .Then,thegeneralizationerrorestimatefor willbe

andthegeneralizationerrorestimatefor willbe

err(TL)=4/24=0.167err(TR)=6/24=0.25

TL TR TLTR

Ω=0.5TL

errgen(TL)=424+0.5×724=7.524=0.3125

TR

errgen(TR)=624+0.5×424=824=0.3333.

Since hasalowergeneralizationerrorrate,itwillstillbepreferredover.Notethat impliesthatanodeshouldalwaysbeexpandedinto

itstwochildnodesifitimprovesthepredictionofatleastonetraininginstance,sinceexpandinganodeislesscostlythanmisclassifyingatraininginstance.Ontheotherhand,if ,thenthegeneralizationerrorratefor is andfor is

.Inthiscase, willbepreferredoverbecauseithasalowergeneralizationerrorrate.Thisexampleillustratesthatdifferentchoicesof canchangeourpreferenceofdecisiontreesbasedontheirgeneralizationerrorestimates.However,foragivenchoiceof ,thepessimisticerrorestimateprovidesanapproachformodelingthegeneralizationperformanceonunseentestinstances.Thevalueof canbeselectedwiththehelpofavalidationset.

MinimumDescriptionLengthPrincipleAnotherwaytoincorporatemodelcomplexityisbasedonaninformation-theoreticapproachknownastheminimumdescriptionlengthorMDLprinciple.Toillustratethisapproach,considertheexampleshowninFigure3.31 .Inthisexample,bothperson andperson aregivenasetofinstanceswithknownattributevalues .AssumepersonAknowstheclasslabelyforeveryinstance,whileperson hasnosuchinformation. wouldliketosharetheclassinformationwith bysendingamessagecontainingthelabels.Themessagewouldcontain bitsofinformation,whereNisthenumberofinstances.

TLTR Ω=0.5

Ω=1TL errgen(TL)=11/24=0.458 TR

errgen(TR)=10/24=0.417 TR TL

Ω

ΩΩ

Θ(N)

Figure3.31.Anillustrationoftheminimumdescriptionlengthprinciple.

Alternatively,insteadofsendingtheclasslabelsexplicitly, canbuildaclassificationmodelfromtheinstancesandtransmititto . canthenapplythemodeltodeterminetheclasslabelsoftheinstances.Ifthemodelis100%accurate,thenthecostfortransmissionisequaltothenumberofbitsrequiredtoencodethemodel.Otherwise, mustalsotransmitinformationaboutwhichinstancesaremisclassifiedbythemodelsothat canreproducethesameclasslabels.Thus,theoveralltransmissioncost,whichisequaltothetotaldescriptionlengthofthemessage,is

wherethefirsttermontheright-handsideisthenumberofbitsneededtoencodethemisclassifiedinstances,whilethesecondtermisthenumberofbitsrequiredtoencodethemodel.Thereisalsoahyper-parameter thattrades-offtherelativecostsofthemisclassifiedinstancesandthemodel.

Cost(model,data)=Cost(data|model)+α×Cost(model), (3.12)

α

NoticethefamiliaritybetweenthisequationandthegenericequationforgeneralizationerrorratepresentedinEquation3.11 .Agoodmodelmusthaveatotaldescriptionlengthlessthanthenumberofbitsrequiredtoencodetheentiresequenceofclasslabels.Furthermore,giventwocompetingmodels,themodelwithlowertotaldescriptionlengthispreferred.AnexampleshowinghowtocomputethetotaldescriptionlengthofadecisiontreeisgiveninExercise10onpage189.

3.5.3EstimatingStatisticalBounds

InsteadofusingEquation3.11 toestimatethegeneralizationerrorrateofamodel,analternativewayistoapplyastatisticalcorrectiontothetrainingerrorrateofthemodelthatisindicativeofitsmodelcomplexity.Thiscanbedoneiftheprobabilitydistributionoftrainingerrorisavailableorcanbeassumed.Forexample,thenumberoferrorscommittedbyaleafnodeinadecisiontreecanbeassumedtofollowabinomialdistribution.Wecanthuscomputeanupperboundlimittotheobservedtrainingerrorratethatcanbeusedformodelselection,asillustratedinthefollowingexample.

3.10.ExampleStatisticalBoundsonTrainingErrorConsidertheleft-mostbranchofthebinarydecisiontreesshowninFigure3.30 .Observethattheleft-mostleafnodeof hasbeenexpandedintotwochildnodesin .Beforesplitting,thetrainingerrorrateofthenodeis .Byapproximatingabinomialdistributionwithanormaldistribution,thefollowingupperboundofthetrainingerrorrateecanbederived:

TRTL

2/7=0.286

where istheconfidencelevel, isthestandardizedvaluefromastandardnormaldistribution,andNisthetotalnumberoftraininginstancesusedtocomputee.Byreplacing and ,theupperboundfortheerrorrateis ,whichcorrespondsto errors.Ifweexpandthenodeintoitschildnodesasshownin ,thetrainingerrorratesforthechildnodesare

and ,respectively.UsingEquation(3.13) ,theupperboundsoftheseerrorratesare and

,respectively.Theoveralltrainingerrorofthechildnodesis ,whichislargerthantheestimatederrorforthecorrespondingnodein ,suggestingthatitshouldnotbesplit.

3.5.4ModelSelectionforDecisionTrees

Buildingonthegenericapproachespresentedabove,wepresenttwocommonlyusedmodelselectionstrategiesfordecisiontreeinduction.

Prepruning(EarlyStoppingRule)

Inthisapproach,thetree-growingalgorithmishaltedbeforegeneratingafullygrowntreethatperfectlyfitstheentiretrainingdata.Todothis,amorerestrictivestoppingconditionmustbeused;e.g.,stopexpandingaleafnodewhentheobservedgaininthegeneralizationerrorestimatefallsbelowacertainthreshold.Thisestimateofthegeneralizationerrorratecanbe

eupper(N,e,α)=e+zα/222N+zα/2e(1−e)N+zα/224N21+zα/22N, (3.13)

α zα/2

α=25%,N=7, e=2/7eupper(7,2/7,0.25)=0.503

7×0.503=3.521TL

1/4=0.250 1/3=0.333eupper(4,1/4,0.25)=0.537

eupper(3,1/3,0.25)=0.6504×0.537+3×0.650=4.098

TR

computedusinganyoftheapproachespresentedintheprecedingthreesubsections,e.g.,byusingpessimisticerrorestimates,byusingvalidationerrorestimates,orbyusingstatisticalbounds.Theadvantageofprepruningisthatitavoidsthecomputationsassociatedwithgeneratingoverlycomplexsubtreesthatoverfitthetrainingdata.However,onemajordrawbackofthismethodisthat,evenifnosignificantgainisobtainedusingoneoftheexistingsplittingcriterion,subsequentsplittingmayresultinbettersubtrees.Suchsubtreeswouldnotbereachedifprepruningisusedbecauseofthegreedynatureofdecisiontreeinduction.

Post-pruning

Inthisapproach,thedecisiontreeisinitiallygrowntoitsmaximumsize.Thisisfollowedbyatree-pruningstep,whichproceedstotrimthefullygrowntreeinabottom-upfashion.Trimmingcanbedonebyreplacingasubtreewith(1)anewleafnodewhoseclasslabelisdeterminedfromthemajorityclassofinstancesaffiliatedwiththesubtree(approachknownassubtreereplacement),or(2)themostfrequentlyusedbranchofthesubtree(approachknownassubtreeraising).Thetree-pruningstepterminateswhennofurtherimprovementinthegeneralizationerrorestimateisobservedbeyondacertainthreshold.Again,theestimatesofgeneralizationerrorratecanbecomputedusinganyoftheapproachespresentedinthepreviousthreesubsections.Post-pruningtendstogivebetterresultsthanprepruningbecauseitmakespruningdecisionsbasedonafullygrowntree,unlikeprepruning,whichcansufferfromprematureterminationofthetree-growingprocess.However,forpost-pruning,theadditionalcomputationsneededtogrowthefulltreemaybewastedwhenthesubtreeispruned.

Figure3.32 illustratesthesimplifieddecisiontreemodelforthewebrobotdetectionexamplegiveninSection3.3.5 .Noticethatthesubtreerootedat

hasbeenreplacedbyoneofitsbranchescorrespondingtodepth=1

,and ,usingsubtreeraising.Ontheotherhand,thesubtreecorrespondingto and hasbeenreplacedbyaleafnodeassignedtoclass0,usingsubtreereplacement.Thesubtreefor

and remainsintact.

Figure3.32.Post-pruningofthedecisiontreeforwebrobotdetection.

breadth<=7,width>3 MultiP=1depth>1 MultiAgent=0

depth>1 MultiAgent=1

3.6ModelEvaluationTheprevioussectiondiscussedseveralapproachesformodelselectionthatcanbeusedtolearnaclassificationmodelfromatrainingsetD.train.Herewediscussmethodsforestimatingitsgeneralizationperformance,i.e.itsperformanceonunseeninstancesoutsideofD.train.Thisprocessisknownasmodelevaluation.

NotethatmodelselectionapproachesdiscussedinSection3.5 alsocomputeanestimateofthegeneralizationperformanceusingthetrainingsetD.train.However,theseestimatesarebiasedindicatorsoftheperformanceonunseeninstances,sincetheywereusedtoguidetheselectionofclassificationmodel.Forexample,ifweusethevalidationerrorrateformodelselection(asdescribedinSection3.5.1 ),theresultingmodelwouldbedeliberatelychosentominimizetheerrorsonthevalidationset.Thevalidationerrorratemaythusunder-estimatethetruegeneralizationerrorrate,andhencecannotbereliablyusedformodelevaluation.

Acorrectapproachformodelevaluationwouldbetoassesstheperformanceofalearnedmodelonalabeledtestsethasnotbeenusedatanystageofmodelselection.ThiscanbeachievedbypartitioningtheentiresetoflabeledinstancesD,intotwodisjointsubsets,D.train,whichisusedformodelselectionandD.test,whichisusedforcomputingthetesterrorrate, .Inthefollowing,wepresenttwodifferentapproachesforpartitioningDintoD.trainandD.test,andcomputingthetesterrorrate, .

3.6.1HoldoutMethod

errtest

errtest

Themostbasictechniqueforpartitioningalabeleddatasetistheholdoutmethod,wherethelabeledsetDisrandomlypartitionedintotwodisjointsets,calledthetrainingsetD.trainandthetestsetD.test.AclassificationmodelistheninducedfromD.trainusingthemodelselectionapproachespresentedinSection3.5 ,anditserrorrateonD.test, ,isusedasanestimateofthegeneralizationerrorrate.Theproportionofdatareservedfortrainingandfortestingistypicallyatthediscretionoftheanalysts,e.g.,two-thirdsfortrainingandone-thirdfortesting.

Similartothetrade-offfacedwhilepartitioningD.trainintoD.trandD.valinSection3.5.1 ,choosingtherightfractionoflabeleddatatobeusedfortrainingandtestingisnon-trivial.IfthesizeofD.trainissmall,thelearnedclassificationmodelmaybeimproperlylearnedusinganinsufficientnumberoftraininginstances,resultinginabiasedestimationofgeneralizationperformance.Ontheotherhand,ifthesizeofD.testissmall, maybelessreliableasitwouldbecomputedoverasmallnumberoftestinstances.Moreover, canhaveahighvarianceaswechangetherandompartitioningofDintoD.trainandD.test.

Theholdoutmethodcanberepeatedseveraltimestoobtainadistributionofthetesterrorrates,anapproachknownasrandomsubsamplingorrepeatedholdoutmethod.Thismethodproducesadistributionoftheerrorratesthatcanbeusedtounderstandthevarianceof .

3.6.2Cross-Validation

Cross-validationisawidely-usedmodelevaluationmethodthataimstomakeeffectiveuseofalllabeledinstancesinDforbothtrainingandtesting.Toillustratethismethod,supposethatwearegivenalabeledsetthatwehave

errtest

errtest

errtest

errtest

randomlypartitionedintothreeequal-sizedsubsets, ,and ,asshowninFigure3.33 .Forthefirstrun,wetrainamodelusingsubsetsandS (shownasemptyblocks)andtestthemodelonsubset .Thetesterrorrateon ,denotedas ,isthuscomputedinthefirstrun.Similarly,forthesecondrun,weuse and asthetrainingsetand asthetestset,tocomputethetesterrorrate, ,on .Finally,weuseand fortraininginthethirdrun,while isusedfortesting,thusresultinginthetesterrorrate for .Theoveralltesterrorrateisobtainedbysummingupthenumberoferrorscommittedineachtestsubsetacrossallrunsanddividingitbythetotalnumberofinstances.Thisapproachiscalledthree-foldcross-validation.

Figure3.33.Exampledemonstratingthetechniqueof3-foldcross-validation.

Thek-foldcross-validationmethodgeneralizesthisapproachbysegmentingthelabeleddataD(ofsizeN)intokequal-sizedpartitions(orfolds).Duringthei run,oneofthepartitionsofDischosenasD.test(i)fortesting,whiletherestofthepartitionsareusedasD.train(i)fortraining.Amodelm(i)islearnedusingD.train(i)andappliedonD.test(i)toobtainthesumoftesterrors,

S1,S2 S3S2

3 S1S1 err(S1)

S1 S3 S2err(S2) S2 S1

S3 S3err(S3) S3

th

.Thisprocedureisrepeatedktimes.Thetotaltesterrorrate, ,isthencomputedas

Everyinstanceinthedataisthususedfortestingexactlyonceandfortrainingexactly times.Also,everyrunuses fractionofthedatafortrainingand1/kfractionfortesting.

Therightchoiceofkink-foldcross-validationdependsonanumberofcharacteristicsoftheproblem.Asmallvalueofkwillresultinasmallertrainingsetateveryrun,whichwillresultinalargerestimateofgeneralizationerrorratethanwhatisexpectedofamodeltrainedovertheentirelabeledset.Ontheotherhand,ahighvalueofkresultsinalargertrainingsetateveryrun,whichreducesthebiasintheestimateofgeneralizationerrorrate.Intheextremecase,when ,everyrunusesexactlyonedatainstancefortestingandtheremainderofthedatafortesting.Thisspecialcaseofk-foldcross-validationiscalledtheleave-one-outapproach.Thisapproachhastheadvantageofutilizingasmuchdataaspossiblefortraining.However,leave-one-outcanproducequitemisleadingresultsinsomespecialscenarios,asillustratedinExercise11.Furthermore,leave-one-outcanbecomputationallyexpensiveforlargedatasetsasthecross-validationprocedureneedstoberepeatedNtimes.Formostpracticalapplications,thechoiceofkbetween5and10providesareasonableapproachforestimatingthegeneralizationerrorrate,becauseeachfoldisabletomakeuseof80%to90%ofthelabeleddatafortraining.

Thek-foldcross-validationmethod,asdescribedabove,producesasingleestimateofthegeneralizationerrorrate,withoutprovidinganyinformationaboutthevarianceoftheestimate.Toobtainthisinformation,wecanrunk-foldcross-validationforeverypossiblepartitioningofthedataintokpartitions,

errsum(i) errtest

errtest=∑i=1kerrsum(i)N. (3.14)

(k−1) (k−1)/k

k=N

andobtainadistributionoftesterrorratescomputedforeverysuchpartitioning.Theaveragetesterrorrateacrossallpossiblepartitioningsservesasamorerobustestimateofgeneralizationerrorrate.Thisapproachofestimatingthegeneralizationerrorrateanditsvarianceisknownasthecompletecross-validationapproach.Eventhoughsuchanestimateisquiterobust,itisusuallytooexpensivetoconsiderallpossiblepartitioningsofalargedatasetintokpartitions.Amorepracticalsolutionistorepeatthecross-validationapproachmultipletimes,usingadifferentrandompartitioningofthedataintokpartitionsateverytime,andusetheaveragetesterrorrateastheestimateofgeneralizationerrorrate.Notethatsincethereisonlyonepossiblepartitioningfortheleave-one-outapproach,itisnotpossibletoestimatethevarianceofgeneralizationerrorrate,whichisanotherlimitationofthismethod.

Thek-foldcross-validationdoesnotguaranteethatthefractionofpositiveandnegativeinstancesineverypartitionofthedataisequaltothefractionobservedintheoveralldata.Asimplesolutiontothisproblemistoperformastratifiedsamplingofthepositiveandnegativeinstancesintokpartitions,anapproachcalledstratifiedcross-validation.

Ink-foldcross-validation,adifferentmodelislearnedateveryrunandtheperformanceofeveryoneofthekmodelsontheirrespectivetestfoldsisthenaggregatedtocomputetheoveralltesterrorrate, .Hence, doesnotreflectthegeneralizationerrorrateofanyofthekmodels.Instead,itreflectstheexpectedgeneralizationerrorrateofthemodelselectionapproach,whenappliedonatrainingsetofthesamesizeasoneofthetrainingfolds .Thisisdifferentthanthe computedintheholdoutmethod,whichexactlycorrespondstothespecificmodellearnedoverD.train.Hence,althougheffectivelyutilizingeverydatainstanceinDfortrainingandtesting,the computedinthecross-validationmethoddoesnotrepresenttheperformanceofasinglemodellearnedoveraspecificD.train.

errtest errtest

(N(k−1)/k) errtest

errtest

Nonetheless,inpractice, istypicallyusedasanestimateofthegeneralizationerrorofamodelbuiltonD.Onemotivationforthisisthatwhenthesizeofthetrainingfoldsisclosertothesizeoftheoveralldata(whenkislarge),then resemblestheexpectedperformanceofamodellearnedoveradatasetofthesamesizeasD.Forexample,whenkis10,everytrainingfoldis90%oftheoveralldata.The thenshouldapproachtheexpectedperformanceofamodellearnedover90%oftheoveralldata,whichwillbeclosetotheexpectedperformanceofamodellearnedoverD.

errtest

errtest

errtest

3.7PresenceofHyper-parametersHyper-parametersareparametersoflearningalgorithmsthatneedtobedeterminedbeforelearningtheclassificationmodel.Forinstance,considerthehyper-parameter thatappearedinEquation3.11 ,whichisrepeatedhereforconvenience.Thisequationwasusedforestimatingthegeneralizationerrorforamodelselectionapproachthatusedanexplicitrepresentationsofmodelcomplexity.(SeeSection3.5.2 .)

Forotherexamplesofhyper-parameters,seeChapter4 .

Unlikeregularmodelparameters,suchasthetestconditionsintheinternalnodesofadecisiontree,hyper-parameterssuchas donotappearinthefinalclassificationmodelthatisusedtoclassifyunlabeledinstances.However,thevaluesofhyper-parametersneedtobedeterminedduringmodelselection—aprocessknownashyper-parameterselection—andmustbetakenintoaccountduringmodelevaluation.Fortunately,bothtaskscanbeeffectivelyaccomplishedviaslightmodificationsofthecross-validationapproachdescribedintheprevioussection.

3.7.1Hyper-parameterSelection

InSection3.5.2 ,avalidationsetwasusedtoselect andthisapproachisgenerallyapplicableforhyper-parametersection.Letpbethehyper-parameterthatneedstobeselectedfromafiniterangeofvalues,

α

gen.error(m)=train.error(m,D.train)+α×complexity(M)

α

α

P=

.PartitionD.trainintoD.trandD.val.Foreverychoiceofhyper-parametervalue ,wecanlearnamodel onD.tr,andapplythismodelonD.valtoobtainthevalidationerrorrate .Let bethehyper-parametervaluethatprovidesthelowestvalidationerrorrate.Wecanthenusethemodel correspondingto asthefinalchoiceofclassificationmodel.

Theaboveapproach,althoughuseful,usesonlyasubsetofthedata,D.train,fortrainingandasubset,D.val,forvalidation.Theframeworkofcross-validation,presentedinSection3.6.2 ,addressesbothofthoseissues,albeitinthecontextofmodelevaluation.Hereweindicatehowtouseacross-validationapproachforhyper-parameterselection.Toillustratethisapproach,letuspartitionD.trainintothreefoldsasshowninFigure3.34 .Ateveryrun,oneofthefoldsisusedasD.valforvalidation,andtheremainingtwofoldsareusedasD.trforlearningamodel,foreverychoiceofhyper-parametervalue .Theoverallvalidationerrorratecorrespondingtoeachiscomputedbysummingtheerrorsacrossallthethreefolds.Wethenselectthehyper-parametervalue thatprovidesthelowestvalidationerrorrate,anduseittolearnamodel ontheentiretrainingsetD.train.

Figure3.34.Exampledemonstratingthe3-foldcross-validationframeworkforhyper-parameterselectionusingD.train.

{p1,p2,…pn}pi mi

errval(pi) p*

m* p*

pi pi

p*m*

Algorithm3.2 generalizestheaboveapproachusingak-foldcross-validationframeworkforhyper-parameterselection.Atthei runofcross-validation,thedatainthei foldisusedasD.val(i)forvalidation(Step4),whiletheremainderofthedatainD.trainisusedasD.tr(i)fortraining(Step5).Thenforeverychoiceofhyper-parametervalue ,amodelislearnedonD.tr(i)(Step7),whichisappliedonD.val(i)tocomputeitsvalidationerror(Step8).Thisisusedtocomputethevalidationerrorratecorrespondingtomodelslearningusing overallthefolds(Step11).Thehyper-parametervalue thatprovidesthelowestvalidationerrorrate(Step12)isnowusedtolearnthefinalmodel ontheentiretrainingsetD.train(Step13).Hence,attheendofthisalgorithm,weobtainthebestchoiceofthehyper-parametervalueaswellasthefinalclassificationmodel(Step14),bothofwhichareobtainedbymakinganeffectiveuseofeverydatainstanceinD.train.

Algorithm3.2Proceduremodel-select(k, ,D.train)

th

th

pi

pip*

m*

P

3.7.2NestedCross-Validation

TheapproachoftheprevioussectionprovidesawaytoeffectivelyusealltheinstancesinD.traintolearnaclassificationmodelwhenhyper-parameterselectionisrequired.ThisapproachcanbeappliedovertheentiredatasetDtolearnthefinalclassificationmodel.However,applyingAlgorithm3.2 onDwouldonlyreturnthefinalclassificationmodel butnotanestimateofitsgeneralizationperformance, .RecallthatthevalidationerrorratesusedinAlgorithm3.2 cannotbeusedasestimatesofgeneralizationperformance,sincetheyareusedtoguidetheselectionofthefinalmodel .However,tocompute ,wecanagainuseacross-validationframeworkforevaluatingtheperformanceontheentiredatasetD,asdescribedoriginallyinSection3.6.2 .Inthisapproach,DispartitionedintoD.train(fortraining)andD.test(fortesting)ateveryrunofcross-validation.Whenhyper-parametersareinvolved,wecanuseAlgorithm3.2 totrainamodelusingD.trainateveryrun,thus“internally”usingcross-validationformodelselection.Thisapproachiscallednestedcross-validationordoublecross-validation.Algorithm3.3describesthecompleteapproachforestimating

usingnestedcross-validationinthepresenceofhyper-parameters.

Asanillustrationofthisapproach,seeFigure3.35 wherethelabeledsetDispartitionedintoD.trainandD.test,usinga3-foldcross-validationmethod.

m*errtest

m*errtest

errtest

Figure3.35.Exampledemonstrating3-foldnestedcross-validationforcomputing .

Atthei runofthismethod,oneofthefoldsisusedasthetestset,D.test(i),whiletheremainingtwofoldsareusedasthetrainingset,D.train(i).ThisisrepresentedinFigure3.35 asthei “outer”run.InordertoselectamodelusingD.train(i),weagainusean“inner”3-foldcross-validationframeworkthatpartitionsD.train(i)intoD.trandD.valateveryoneofthethreeinnerruns(iterations).AsdescribedinSection3.7 ,wecanusetheinnercross-validationframeworktoselectthebesthyper-parametervalue aswellasitsresultingclassificationmodel learnedoverD.train(i).Wecanthenapply onD.test(i)toobtainthetesterroratthei outerrun.Byrepeatingthisprocessforeveryouterrun,wecancomputetheaveragetesterrorrate,

,overtheentirelabeledsetD.Notethatintheaboveapproach,theinnercross-validationframeworkisbeingusedformodelselectionwhiletheoutercross-validationframeworkisbeingusedformodelevaluation.

Algorithm3.3Thenestedcross-validationapproachforcomputing .

errtest

th

th

p*(i)m*(i)

m*(i) th

errtest

errtest

3.8PitfallsofModelSelectionandEvaluationModelselectionandevaluation,whenusedeffectively,serveasexcellenttoolsforlearningclassificationmodelsandassessingtheirgeneralizationperformance.However,whenusingthemeffectivelyinpracticalsettings,thereareseveralpitfallsthatcanresultinimproperandoftenmisleadingconclusions.Someofthesepitfallsaresimpletounderstandandeasytoavoid,whileothersarequitesubtleinnatureanddifficulttocatch.Inthefollowing,wepresenttwoofthesepitfallsanddiscussbestpracticestoavoidthem.

3.8.1OverlapbetweenTrainingandTestSets

Oneofthebasicrequirementsofacleanmodelselectionandevaluationsetupisthatthedatausedformodelselection(D.train)mustbekeptseparatefromthedatausedformodelevaluation(D.test).Ifthereisanyoverlapbetweenthetwo,thetesterrorrate computedoverD.testcannotbeconsideredrepresentativeoftheperformanceonunseeninstances.Comparingtheeffectivenessofclassificationmodelsusing canthenbequitemisleading,asanoverlycomplexmodelcanshowaninaccuratelylowvalueof duetomodeloverfitting(seeExercise12attheendofthischapter).

errtest

errtest

errtest

ToillustratetheimportanceofensuringnooverlapbetweenD.trainandD.test,consideralabeleddatasetwherealltheattributesareirrelevant,i.e.theyhavenorelationshipwiththeclasslabels.Usingsuchattributes,weshouldexpectnoclassificationmodeltoperformbetterthanrandomguessing.However,ifthetestsetinvolvesevenasmallnumberofdatainstancesthatwereusedfortraining,thereisapossibilityforanoverlycomplexmodeltoshowbetterperformancethanrandom,eventhoughtheattributesarecompletelyirrelevant.AswewillseelaterinChapter10 ,thisscenariocanactuallybeusedasacriteriontodetectoverfittingduetoimpropersetupofexperiment.Ifamodelshowsbetterperformancethanarandomclassifierevenwhentheattributesareirrelevant,itisanindicationofapotentialfeedbackbetweenthetrainingandtestsets.

3.8.2UseofValidationErrorasGeneralizationError

Thevalidationerrorrate servesanimportantroleduringmodelselection,asitprovides“out-of-sample”errorestimatesofmodelsonD.val,whichisnotusedfortrainingthemodels.Hence, servesasabettermetricthanthetrainingerrorrateforselectingmodelsandhyper-parametervalues,asdescribedinSections3.5.1 and3.7 ,respectively.However,oncethevalidationsethasbeenusedforselectingaclassificationmodel

nolongerreflectstheperformanceof onunseeninstances.

Torealizethepitfallinusingvalidationerrorrateasanestimateofgeneralizationperformance,considertheproblemofselectingahyper-parametervaluepfromarangeofvalues usingavalidationsetD.val.Ifthenumberofpossiblevaluesin isquitelargeandthesizeofD.valissmall,itis

errval

errval

m*,errval m*

P,P

possibletoselectahyper-parametervalue thatshowsfavorableperformanceonD.valjustbyrandomchance.NoticethesimilarityofthisproblemwiththemultiplecomparisonsproblemdiscussedinSection3.4.1 .Eventhoughtheclassificationmodel learnedusing wouldshowalowvalidationerrorrate,itwouldlackgeneralizabilityonunseentestinstances.

ThecorrectapproachforestimatingthegeneralizationerrorrateofamodelistouseanindependentlychosentestsetD.testthathasn'tbeenusedin

anywaytoinfluencetheselectionof .Asaruleofthumb,thetestsetshouldneverbeexaminedduringmodelselection,toensuretheabsenceofanyformofoverfitting.Iftheinsightsgainedfromanyportionofalabeleddatasethelpinimprovingtheclassificationmodeleveninanindirectway,thenthatportionofdatamustbediscardedduringtesting.

p*

m* p*

m*m*

3.9ModelComparisonOnedifficultywhencomparingtheperformanceofdifferentclassificationmodelsiswhethertheobserveddifferenceintheirperformanceisstatisticallysignificant.Forexample,considerapairofclassificationmodels, and .Suppose achieves85%accuracywhenevaluatedonatestsetcontaining30instances,while achieves75%accuracyonadifferenttestsetcontaining5000instances.Basedonthisinformation,is abettermodelthan ?Thisexampleraisestwokeyquestionsregardingthestatisticalsignificanceofaperformancemetric:

1. Although hasahigheraccuracythan ,itwastestedonasmallertestset.Howmuchconfidencedowehavethattheaccuracyfor isactually85%?

2. Isitpossibletoexplainthedifferenceinaccuraciesbetween andasaresultofvariationsinthecompositionoftheirtestsets?

Thefirstquestionrelatestotheissueofestimatingtheconfidenceintervalofmodelaccuracy.Thesecondquestionrelatestotheissueoftestingthestatisticalsignificanceoftheobserveddeviation.Theseissuesareinvestigatedintheremainderofthissection.

3.9.1EstimatingtheConfidenceIntervalforAccuracy

*

MA MBMA

MBMA

MB

MA MBMA

MAMB

Todetermineitsconfidenceinterval,weneedtoestablishtheprobabilitydistributionforsampleaccuracy.Thissectiondescribesanapproachforderivingtheconfidenceintervalbymodelingtheclassificationtaskasabinomialrandomexperiment.Thefollowingdescribescharacteristicsofsuchanexperiment:

1. TherandomexperimentconsistsofNindependenttrials,whereeachtrialhastwopossibleoutcomes:successorfailure.

2. Theprobabilityofsuccess,p,ineachtrialisconstant.

AnexampleofabinomialexperimentiscountingthenumberofheadsthatturnupwhenacoinisflippedNtimes.IfXisthenumberofsuccessesobservedinNtrials,thentheprobabilitythatXtakesaparticularvalueisgivenbyabinomialdistributionwithmean andvariance :

Forexample,ifthecoinisfair andisflippedfiftytimes,thentheprobabilitythattheheadshowsup20timesis

Iftheexperimentisrepeatedmanytimes,thentheaveragenumberofheadsexpectedtoshowupis whileitsvarianceis

Thetaskofpredictingtheclasslabelsoftestinstancescanalsobeconsideredasabinomialexperiment.GivenatestsetthatcontainsNinstances,letXbethenumberofinstancescorrectlypredictedbyamodelandpbethetrueaccuracyofthemodel.Ifthepredictiontaskismodeledasabinomialexperiment,thenXhasabinomialdistributionwithmean andvariance Itcanbeshownthattheempiricalaccuracy, also

Np Np(1−p)

P(X=υ)=(Nυ)pυ(1−p)N−υ.

(p=0.5)

P(X=20)=(5020)0.520(1−0.5)30=0.0419.

50×0.5=25, 50×0.5×0.5=12.5.

NpNp(1−p). acc=X/N,

hasabinomialdistributionwithmeanpandvariance (seeExercise14).ThebinomialdistributioncanbeapproximatedbyanormaldistributionwhenNissufficientlylarge.Basedonthenormaldistribution,theconfidenceintervalforacccanbederivedasfollows:

where and aretheupperandlowerboundsobtainedfromastandardnormaldistributionatconfidencelevel Sinceastandardnormaldistributionissymmetricaround itfollowsthatRearrangingthisinequalityleadstothefollowingconfidenceintervalforp:

Thefollowingtableshowsthevaluesof atdifferentconfidencelevels:

0.99 0.98 0.95 0.9 0.8 0.7 0.5

2.58 2.33 1.96 1.65 1.28 1.04 0.67

3.11.ExampleConfidenceIntervalforAccuracyConsideramodelthathasanaccuracyof80%whenevaluatedon100testinstances.Whatistheconfidenceintervalforitstrueaccuracyata95%confidencelevel?Theconfidencelevelof95%correspondsto

accordingtothetablegivenabove.InsertingthistermintoEquation3.16 yieldsaconfidenceintervalbetween71.1%and86.7%.Thefollowingtableshowstheconfidenceintervalwhenthenumberofinstances,N,increases:

N 20 50 100 500 1000 5000

p(1−p)/N

P(−Zα/2≤acc−pp(1−p)/N≤Z1−α/2)=1−α, (3.15)

Zα/2 Z1−α/2(1−α).

Z=0, Zα/2=Z1−α/2.

2×N×acc×Zα/22±Zα/2Zα/22+4Nacc−4Nacc22(N+Zα/22). (3.16)

Zα/2

1−α

Zα/2

Za/2=1.96

Confidence 0.584 0.670 0.711 0.763 0.774 0.789

Interval

NotethattheconfidenceintervalbecomestighterwhenNincreases.

3.9.2ComparingthePerformanceofTwoModels

Considerapairofmodels, and whichareevaluatedontwoindependenttestsets, and Let denotethenumberofinstancesin

and denotethenumberofinstancesin Inaddition,supposetheerrorratefor on is andtheerrorratefor on is Ourgoalistotestwhethertheobserveddifferencebetween and isstatisticallysignificant.

Assumingthat and aresufficientlylarge,theerrorrates and canbeapproximatedusingnormaldistributions.Iftheobserveddifferenceintheerrorrateisdenotedas thendisalsonormallydistributedwithmean ,itstruedifference,andvariance, Thevarianceofdcanbecomputedasfollows:

where and arethevariancesoftheerrorrates.Finally,atthe confidencelevel,itcanbeshownthattheconfidenceintervalforthetruedifferencedtisgivenbythefollowingequation:

−0.919 −0.888 −0.867 −0.833 −0.824 −0.811

M1 M2,D1 D2. n1

D1 n2 D2.M1 D1 e1 M2 D2 e2.

e1 e2

n1 n2 e1 e2

d=e1−e2,dt σd2.

σd2≃σ^d2=e1(1−e1)n1+e2(1−e2)n2, (3.17)

e1(1−e1)/n1 e2(1−e1)/n2(1−α)%

3.12.ExampleSignificanceTestingConsidertheproblemdescribedatthebeginningofthissection.Modelhasanerrorrateof whenappliedto testinstances,whilemodel hasanerrorrateof whenappliedto testinstances.Theobserveddifferenceintheirerrorratesis

.Inthisexample,weareperformingatwo-sidedtesttocheckwhether or .Theestimatedvarianceoftheobserveddifferenceinerrorratescanbecomputedasfollows:

or .InsertingthisvalueintoEquation3.18 ,weobtainthefollowingconfidenceintervalfor at95%confidencelevel:

Astheintervalspansthevaluezero,wecanconcludethattheobserveddifferenceisnotstatisticallysignificantata95%confidencelevel.

Atwhatconfidencelevelcanwerejectthehypothesisthat ?Todothis,weneedtodeterminethevalueof suchthattheconfidenceintervalfordoesnotspanthevaluezero.Wecanreversetheprecedingcomputationandlookforthevalue suchthat .Replacingthevaluesofdand

gives .Thisvaluefirstoccurswhen (foratwo-sidedtest).Theresultsuggeststhatthenullhypothesiscanberejectedatconfidencelevelof93.6%orlower.

dt=d±zα/2σ^d. (3.18)

MAe1=0.15 N1=30

MB e2=0.25 N2=5000

d=|0.15−0.25|=0.1dt=0 dt≠0

σ^d2=0.15(1−0.15)30+0.25(1−0.25)5000=0.0043

σ^d=0.0655dt

dt=0.1±1.96×0.0655=0.1±0.128.

dt=0Zα/2 dt

Zα/2 d>Zσ/2σ^dσ^d Zσ/2<1.527 (1−α)<~0.936

3.10BibliographicNotesEarlyclassificationsystemsweredevelopedtoorganizevariouscollectionsofobjects,fromlivingorganismstoinanimateones.Examplesabound,fromAristotle'scataloguingofspeciestotheDeweyDecimalandLibraryofCongressclassificationsystemsforbooks.Suchatasktypicallyrequiresconsiderablehumanefforts,bothtoidentifypropertiesoftheobjectstobeclassifiedandtoorganizethemintowelldistinguishedcategories.

Withthedevelopmentofstatisticsandcomputing,automatedclassificationhasbeenasubjectofintensiveresearch.Thestudyofclassificationinclassicalstatisticsissometimesknownasdiscriminantanalysis,wheretheobjectiveistopredictthegroupmembershipofanobjectbasedonitscorrespondingfeatures.Awell-knownclassicalmethodisFisher'slineardiscriminantanalysis[142],whichseekstofindalinearprojectionofthedatathatproducesthebestseparationbetweenobjectsfromdifferentclasses.

Manypatternrecognitionproblemsalsorequirethediscriminationofobjectsfromdifferentclasses.Examplesincludespeechrecognition,handwrittencharacteridentification,andimageclassification.ReaderswhoareinterestedintheapplicationofclassificationtechniquesforpatternrecognitionmayrefertothesurveyarticlesbyJainetal.[150]andKulkarnietal.[157]orclassicpatternrecognitionbooksbyBishop[125],Dudaetal.[137],andFukunaga[143].Thesubjectofclassificationisalsoamajorresearchtopicinneuralnetworks,statisticallearning,andmachinelearning.Anin-depthtreatmentonthetopicofclassificationfromthestatisticalandmachinelearningperspectivescanbefoundinthebooksbyBishop[126],CherkasskyandMulier[132],Hastieetal.[148],Michieetal.[162],Murphy[167],andMitchell[165].Recentyearshavealsoseenthereleaseofmanypubliclyavailable

softwarepackagesforclassification,whichcanbeembeddedinprogramminglanguagessuchasJava(Weka[147])andPython(scikit-learn[174]).

AnoverviewofdecisiontreeinductionalgorithmscanbefoundinthesurveyarticlesbyBuntine[129],Moret[166],Murthy[168],andSafavianetal.[179].Examplesofsomewell-knowndecisiontreealgorithmsincludeCART[127],ID3[175],C4.5[177],andCHAID[153].BothID3andC4.5employtheentropymeasureastheirsplittingfunction.Anin-depthdiscussionoftheC4.5decisiontreealgorithmisgivenbyQuinlan[177].TheCARTalgorithmwasdevelopedbyBreimanetal.[127]andusestheGiniindexasitssplittingfunction.CHAID[153]usesthestatistical testtodeterminethebestsplitduringthetree-growingprocess.

Thedecisiontreealgorithmpresentedinthischapterassumesthatthesplittingconditionateachinternalnodecontainsonlyoneattribute.Anobliquedecisiontreecanusemultipleattributestoformtheattributetestconditioninasinglenode[149,187].Breimanetal.[127]provideanoptionforusinglinearcombinationsofattributesintheirCARTimplementation.OtherapproachesforinducingobliquedecisiontreeswereproposedbyHeathetal.[149],Murthyetal.[169],Cantú-PazandKamath[130],andUtgoffandBrodley[187].Althoughanobliquedecisiontreehelpstoimprovetheexpressivenessofthemodelrepresentation,thetreeinductionprocessbecomescomputationallychallenging.Anotherwaytoimprovetheexpressivenessofadecisiontreewithoutusingobliquedecisiontreesistoapplyamethodknownasconstructiveinduction[161].Thismethodsimplifiesthetaskoflearningcomplexsplittingfunctionsbycreatingcompoundfeaturesfromtheoriginaldata.

Besidesthetop-downapproach,otherstrategiesforgrowingadecisiontreeincludethebottom-upapproachbyLandeweerdetal.[159]andPattipatiandAlexandridis[173],aswellasthebidirectionalapproachbyKimand

χ2

Landgrebe[154].SchuermannandDoster[181]andWangandSuen[193]proposedusingasoftsplittingcriteriontoaddressthedatafragmentationproblem.Inthisapproach,eachinstanceisassignedtodifferentbranchesofthedecisiontreewithdifferentprobabilities.

Modeloverfittingisanimportantissuethatmustbeaddressedtoensurethatadecisiontreeclassifierperformsequallywellonpreviouslyunlabeleddatainstances.ThemodeloverfittingproblemhasbeeninvestigatedbymanyauthorsincludingBreimanetal.[127],Schaffer[180],Mingers[164],andJensenandCohen[151].Whilethepresenceofnoiseisoftenregardedasoneoftheprimaryreasonsforoverfitting[164,170],JensenandCohen[151]viewedoverfittingasanartifactoffailuretocompensateforthemultiplecomparisonsproblem.

Bishop[126]andHastieetal.[148]provideanexcellentdiscussionofmodeloverfitting,relatingittoawell-knownframeworkoftheoreticalanalysis,knownasbias-variancedecomposition[146].Inthisframework,thepredictionofalearningalgorithmisconsideredtobeafunctionofthetrainingset,whichvariesasthetrainingsetischanged.Thegeneralizationerrorofamodelisthendescribedintermsofitsbias(theerroroftheaveragepredictionobtainedusingdifferenttrainingsets),itsvariance(howdifferentarethepredictionsobtainedusingdifferenttrainingsets),andnoise(theirreducibleerrorinherenttotheproblem).Anunderfitmodelisconsideredtohavehighbiasbutlowvariance,whileanoverfitmodelisconsideredtohavelowbiasbuthighvariance.Althoughthebias-variancedecompositionwasoriginallyproposedforregressionproblems(wherethetargetattributeisacontinuousvariable),aunifiedanalysisthatisapplicableforclassificationhasbeenproposedbyDomingos[136].ThebiasvariancedecompositionwillbediscussedinmoredetailwhileintroducingensemblelearningmethodsinChapter4 .

Variouslearningprinciples,suchastheProbablyApproximatelyCorrect(PAC)learningframework[188],havebeendevelopedtoprovideatheoreticalframeworkforexplainingthegeneralizationperformanceoflearningalgorithms.Inthefieldofstatistics,anumberofperformanceestimationmethodshavebeenproposedthatmakeatrade-offbetweenthegoodnessoffitofamodelandthemodelcomplexity.MostnoteworthyamongthemaretheAkaike'sInformationCriterion[120]andtheBayesianInformationCriterion[182].Theybothapplycorrectivetermstothetrainingerrorrateofamodel,soastopenalizemorecomplexmodels.Anotherwidely-usedapproachformeasuringthecomplexityofanygeneralmodelistheVapnikChervonenkis(VC)Dimension[190].TheVCdimensionofaclassoffunctionsCisdefinedasthemaximumnumberofpointsthatcanbeshattered(everypointcanbedistinguishedfromtherest)byfunctionsbelongingtoC,foranypossibleconfigurationofpoints.TheVCdimensionlaysthefoundationofthestructuralriskminimizationprinciple[189],whichisextensivelyusedinmanylearningalgorithms,e.g.,supportvectormachines,whichwillbediscussedindetailinChapter4 .

TheOccam'srazorprincipleisoftenattributedtothephilosopherWilliamofOccam.Domingos[135]cautionedagainstthepitfallofmisinterpretingOccam'srazorascomparingmodelswithsimilartrainingerrors,insteadofgeneralizationerrors.Asurveyondecisiontree-pruningmethodstoavoidoverfittingisgivenbyBreslowandAha[128]andEspositoetal.[141].Someofthetypicalpruningmethodsincludereducederrorpruning[176],pessimisticerrorpruning[176],minimumerrorpruning[171],criticalvaluepruning[163],cost-complexitypruning[127],anderror-basedpruning[177].QuinlanandRivestproposedusingtheminimumdescriptionlengthprinciplefordecisiontreepruningin[178].

Thediscussionsinthischapteronthesignificanceofcross-validationerrorestimatesisinspiredfromChapter7 inHastieetal.[148].Itisalsoan

excellentresourceforunderstanding“therightandwrongwaystodocross-validation”,whichissimilartothediscussiononpitfallsinSection3.8 ofthischapter.Acomprehensivediscussionofsomeofthecommonpitfallsinusingcross-validationformodelselectionandevaluationisprovidedinKrstajicetal.[156].

Theoriginalcross-validationmethodwasproposedindependentlybyAllen[121],Stone[184],andGeisser[145]formodelassessment(evaluation).Eventhoughcross-validationcanbeusedformodelselection[194],itsusageformodelselectionisquitedifferentthanwhenitisusedformodelevaluation,asemphasizedbyStone[184].Overtheyears,thedistinctionbetweenthetwousageshasoftenbeenignored,resultinginincorrectfindings.Oneofthecommonmistakeswhileusingcross-validationistoperformpre-processingoperations(e.g.,hyper-parametertuningorfeatureselection)usingtheentiredatasetandnot“within”thetrainingfoldofeverycross-validationrun.Ambroiseetal.,usinganumberofgeneexpressionstudiesasexamples,[124]provideanextensivediscussionoftheselectionbiasthatariseswhenfeatureselectionisperformedoutsidecross-validation.UsefulguidelinesforevaluatingmodelsonmicroarraydatahavealsobeenprovidedbyAllisonetal.[122].

Theuseofthecross-validationprotocolforhyper-parametertuninghasbeendescribedindetailbyDudoitandvanderLaan[138].Thisapproachhasbeencalled“grid-searchcross-validation.”Thecorrectapproachinusingcross-validationforbothhyper-parameterselectionandmodelevaluation,asdiscussedinSection3.7 ofthischapter,isextensivelycoveredbyVarmaandSimon[191].Thiscombinedapproachhasbeenreferredtoas“nestedcross-validation”or“doublecross-validation”intheexistingliterature.Recently,TibshiraniandTibshirani[185]haveproposedanewapproachforhyper-parameterselectionandmodelevaluation.Tsamardinosetal.[186]comparedthisapproachtonestedcross-validation.Theexperimentsthey

performedfoundthat,onaverage,bothapproachesprovideconservativeestimatesofmodelperformancewiththeTibshiraniandTibshiraniapproachbeingmorecomputationallyefficient.

Kohavi[155]hasperformedanextensiveempiricalstudytocomparetheperformancemetricsobtainedusingdifferentestimationmethodssuchasrandomsubsamplingandk-foldcross-validation.Theirresultssuggestthatthebestestimationmethodisten-fold,stratifiedcross-validation.

Analternativeapproachformodelevaluationisthebootstrapmethod,whichwaspresentedbyEfronin1979[139].Inthismethod,traininginstancesaresampledwithreplacementfromthelabeledset,i.e.,aninstancepreviouslyselectedtobepartofthetrainingsetisequallylikelytobedrawnagain.IftheoriginaldatahasNinstances,itcanbeshownthat,onaverage,abootstrapsampleofsizeNcontainsabout63.2%oftheinstancesintheoriginaldata.Instancesthatarenotincludedinthebootstrapsamplebecomepartofthetestset.Thebootstrapprocedureforobtainingtrainingandtestsetsisrepeatedbtimes,resultinginadifferenterrorrateonthetestset,err(i),atthei run.Toobtaintheoverallerrorrate, ,the.632bootstrapapproachcombineserr(i)withtheerrorrateobtainedfromatrainingsetcontainingallthelabeledexamples, ,asfollows:

EfronandTibshirani[140]providedatheoreticalandempiricalcomparisonbetweencross-validationandabootstrapmethodknownasthe rule.

Whilethe.632bootstrapmethodpresentedaboveprovidesarobustestimateofthegeneralizationperformancewithlowvarianceinitsestimate,itmayproducemisleadingresultsforhighlycomplexmodelsincertainconditions,asdemonstratedbyKohavi[155].Thisisbecausetheoverallerrorrateisnot

th errboot

errs

errboot=1b∑i=1b(0.632)×err(i)+0.386×errs). (3.19)

632+

trulyanout-of-sampleerrorestimateasitdependsonthetrainingerrorrate,,whichcanbequitesmallifthereisoverfitting.

CurrenttechniquessuchasC4.5requirethattheentiretrainingdatasetfitintomainmemory.Therehasbeenconsiderableefforttodevelopparallelandscalableversionsofdecisiontreeinductionalgorithms.SomeoftheproposedalgorithmsincludeSLIQbyMehtaetal.[160],SPRINTbyShaferetal.[183],CMPbyWangandZaniolo[192],CLOUDSbyAlsabtietal.[123],RainForestbyGehrkeetal.[144],andScalParCbyJoshietal.[152].Asurveyofparallelalgorithmsforclassificationandotherdataminingtasksisgivenin[158].Morerecently,therehasbeenextensiveresearchtoimplementlarge-scaleclassifiersonthecomputeunifieddevicearchitecture(CUDA)[131,134]andMapReduce[133,172]platforms.

errs

Bibliography[120]H.Akaike.Informationtheoryandanextensionofthemaximum

likelihoodprinciple.InSelectedPapersofHirotuguAkaike,pages199–213.Springer,1998.

[121]D.M.Allen.Therelationshipbetweenvariableselectionanddataagumentationandamethodforprediction.Technometrics,16(1):125–127,1974.

[122]D.B.Allison,X.Cui,G.P.Page,andM.Sabripour.Microarraydataanalysis:fromdisarraytoconsolidationandconsensus.Naturereviewsgenetics,7(1):55–65,2006.

[123]K.Alsabti,S.Ranka,andV.Singh.CLOUDS:ADecisionTreeClassifierforLargeDatasets.InProc.ofthe4thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages2–8,NewYork,NY,August1998.

[124]C.AmbroiseandG.J.McLachlan.Selectionbiasingeneextractiononthebasisofmicroarraygene-expressiondata.Proceedingsofthenationalacademyofsciences,99(10):6562–6566,2002.

[125]C.M.Bishop.NeuralNetworksforPatternRecognition.OxfordUniversityPress,Oxford,U.K.,1995.

[126]C.M.Bishop.PatternRecognitionandMachineLearning.Springer,2006.

[127]L.Breiman,J.H.Friedman,R.Olshen,andC.J.Stone.ClassificationandRegressionTrees.Chapman&Hall,NewYork,1984.

[128]L.A.BreslowandD.W.Aha.SimplifyingDecisionTrees:ASurvey.KnowledgeEngineeringReview,12(1):1–40,1997.

[129]W.Buntine.Learningclassificationtrees.InArtificialIntelligenceFrontiersinStatistics,pages182–201.Chapman&Hall,London,1993.

[130]E.Cantú-PazandC.Kamath.Usingevolutionaryalgorithmstoinduceobliquedecisiontrees.InProc.oftheGeneticandEvolutionaryComputationConf.,pages1053–1060,SanFrancisco,CA,2000.

[131]B.Catanzaro,N.Sundaram,andK.Keutzer.Fastsupportvectormachinetrainingandclassificationongraphicsprocessors.InProceedingsofthe25thInternationalConferenceonMachineLearning,pages104–111,2008.

[132]V.CherkasskyandF.M.Mulier.LearningfromData:Concepts,Theory,andMethods.Wiley,2ndedition,2007.

[133]C.Chu,S.K.Kim,Y.-A.Lin,Y.Yu,G.Bradski,A.Y.Ng,andK.Olukotun.Map-reduceformachinelearningonmulticore.Advancesinneuralinformationprocessingsystems,19:281,2007.

[134]A.Cotter,N.Srebro,andJ.Keshet.AGPU-tailoredApproachforTrainingKernelizedSVMs.InProceedingsofthe17thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages805–813,SanDiego,California,USA,2011.

[135]P.Domingos.TheRoleofOccam'sRazorinKnowledgeDiscovery.DataMiningandKnowledgeDiscovery,3(4):409–425,1999.

[136]P.Domingos.Aunifiedbias-variancedecomposition.InProceedingsof17thInternationalConferenceonMachineLearning,pages231–238,2000.

[137]R.O.Duda,P.E.Hart,andD.G.Stork.PatternClassification.JohnWiley&Sons,Inc.,NewYork,2ndedition,2001.

[138]S.DudoitandM.J.vanderLaan.Asymptoticsofcross-validatedriskestimationinestimatorselectionandperformanceassessment.StatisticalMethodology,2(2):131–154,2005.

[139]B.Efron.Bootstrapmethods:anotherlookatthejackknife.InBreakthroughsinStatistics,pages569–593.Springer,1992.

[140]B.EfronandR.Tibshirani.Cross-validationandtheBootstrap:EstimatingtheErrorRateofaPredictionRule.Technicalreport,StanfordUniversity,1995.

[141]F.Esposito,D.Malerba,andG.Semeraro.AComparativeAnalysisofMethodsforPruningDecisionTrees.IEEETrans.PatternAnalysisandMachineIntelligence,19(5):476–491,May1997.

[142]R.A.Fisher.Theuseofmultiplemeasurementsintaxonomicproblems.AnnalsofEugenics,7:179–188,1936.

[143]K.Fukunaga.IntroductiontoStatisticalPatternRecognition.AcademicPress,NewYork,1990.

[144]J.Gehrke,R.Ramakrishnan,andV.Ganti.RainForest—AFrameworkforFastDecisionTreeConstructionofLargeDatasets.DataMiningandKnowledgeDiscovery,4(2/3):127–162,2000.

[145]S.Geisser.Thepredictivesamplereusemethodwithapplications.JournaloftheAmericanStatisticalAssociation,70(350):320–328,1975.

[146]S.Geman,E.Bienenstock,andR.Doursat.Neuralnetworksandthebias/variancedilemma.Neuralcomputation,4(1):1–58,1992.

[147]M.Hall,E.Frank,G.Holmes,B.Pfahringer,P.Reutemann,andI.H.Witten.TheWEKADataMiningSoftware:AnUpdate.SIGKDDExplorations,11(1),2009.

[148]T.Hastie,R.Tibshirani,andJ.Friedman.TheElementsofStatisticalLearning:DataMining,Inference,andPrediction.Springer,2ndedition,2009.

[149]D.Heath,S.Kasif,andS.Salzberg.InductionofObliqueDecisionTrees.InProc.ofthe13thIntl.JointConf.onArtificialIntelligence,pages1002–1007,Chambery,France,August1993.

[150]A.K.Jain,R.P.W.Duin,andJ.Mao.StatisticalPatternRecognition:AReview.IEEETran.Patt.Anal.andMach.Intellig.,22(1):4–37,2000.

[151]D.JensenandP.R.Cohen.MultipleComparisonsinInductionAlgorithms.MachineLearning,38(3):309–338,March2000.

[152]M.V.Joshi,G.Karypis,andV.Kumar.ScalParC:ANewScalableandEfficientParallelClassificationAlgorithmforMiningLargeDatasets.InProc.of12thIntl.ParallelProcessingSymp.(IPPS/SPDP),pages573–579,Orlando,FL,April1998.

[153]G.V.Kass.AnExploratoryTechniqueforInvestigatingLargeQuantitiesofCategoricalData.AppliedStatistics,29:119–127,1980.

[154]B.KimandD.Landgrebe.Hierarchicaldecisionclassifiersinhigh-dimensionalandlargeclassdata.IEEETrans.onGeoscienceandRemoteSensing,29(4):518–528,1991.

[155]R.Kohavi.AStudyonCross-ValidationandBootstrapforAccuracyEstimationandModelSelection.InProc.ofthe15thIntl.JointConf.onArtificialIntelligence,pages1137–1145,Montreal,Canada,August1995.

[156]D.Krstajic,L.J.Buturovic,D.E.Leahy,andS.Thomas.Cross-validationpitfallswhenselectingandassessingregressionandclassificationmodels.Journalofcheminformatics,6(1):1,2014.

[157]S.R.Kulkarni,G.Lugosi,andS.S.Venkatesh.LearningPatternClassification—ASurvey.IEEETran.Inf.Theory,44(6):2178–2206,1998.

[158]V.Kumar,M.V.Joshi,E.-H.Han,P.N.Tan,andM.Steinbach.HighPerformanceDataMining.InHighPerformanceComputingforComputationalScience(VECPAR2002),pages111–125.Springer,2002.

[159]G.Landeweerd,T.Timmers,E.Gersema,M.Bins,andM.Halic.Binarytreeversussingleleveltreeclassificationofwhitebloodcells.PatternRecognition,16:571–577,1983.

[160]M.Mehta,R.Agrawal,andJ.Rissanen.SLIQ:AFastScalableClassifierforDataMining.InProc.ofthe5thIntl.Conf.onExtendingDatabaseTechnology,pages18–32,Avignon,France,March1996.

[161]R.S.Michalski.Atheoryandmethodologyofinductivelearning.ArtificialIntelligence,20:111–116,1983.

[162]D.Michie,D.J.Spiegelhalter,andC.C.Taylor.MachineLearning,NeuralandStatisticalClassification.EllisHorwood,UpperSaddleRiver,NJ,1994.

[163]J.Mingers.ExpertSystems—RuleInductionwithStatisticalData.JOperationalResearchSociety,38:39–47,1987.

[164]J.Mingers.Anempiricalcomparisonofpruningmethodsfordecisiontreeinduction.MachineLearning,4:227–243,1989.

[165]T.Mitchell.MachineLearning.McGraw-Hill,Boston,MA,1997.

[166]B.M.E.Moret.DecisionTreesandDiagrams.ComputingSurveys,14(4):593–623,1982.

[167]K.P.Murphy.MachineLearning:AProbabilisticPerspective.MITPress,2012.

[168]S.K.Murthy.AutomaticConstructionofDecisionTreesfromData:AMulti-DisciplinarySurvey.DataMiningandKnowledgeDiscovery,2(4):345–389,1998.

[169]S.K.Murthy,S.Kasif,andS.Salzberg.Asystemforinductionofobliquedecisiontrees.JofArtificialIntelligenceResearch,2:1–33,1994.

[170]T.Niblett.Constructingdecisiontreesinnoisydomains.InProc.ofthe2ndEuropeanWorkingSessiononLearning,pages67–78,Bled,Yugoslavia,May1987.

[171]T.NiblettandI.Bratko.LearningDecisionRulesinNoisyDomains.InResearchandDevelopmentinExpertSystemsIII,Cambridge,1986.

CambridgeUniversityPress.

[172]I.PalitandC.K.Reddy.Scalableandparallelboostingwithmapreduce.IEEETransactionsonKnowledgeandDataEngineering,24(10):1904–1916,2012.

[173]K.R.PattipatiandM.G.Alexandridis.Applicationofheuristicsearchandinformationtheorytosequentialfaultdiagnosis.IEEETrans.onSystems,Man,andCybernetics,20(4):872–887,1990.

[174]F.Pedregosa,G.Varoquaux,A.Gramfort,V.Michel,B.Thirion,O.Grisel,M.Blondel,P.Prettenhofer,R.Weiss,V.Dubourg,J.Vanderplas,A.Passos,D.Cournapeau,M.Brucher,M.Perrot,andE.Duchesnay.Scikit-learn:MachineLearninginPython.JournalofMachineLearningResearch,12:2825–2830,2011.

[175]J.R.Quinlan.Discoveringrulesbyinductionfromlargecollectionofexamples.InD.Michie,editor,ExpertSystemsintheMicroElectronicAge.EdinburghUniversityPress,Edinburgh,UK,1979.

[176]J.R.Quinlan.SimplifyingDecisionTrees.Intl.J.Man-MachineStudies,27:221–234,1987.

[177]J.R.Quinlan.C4.5:ProgramsforMachineLearning.Morgan-KaufmannPublishers,SanMateo,CA,1993.

[178]J.R.QuinlanandR.L.Rivest.InferringDecisionTreesUsingtheMinimumDescriptionLengthPrinciple.InformationandComputation,80(3):227–248,1989.

[179]S.R.SafavianandD.Landgrebe.ASurveyofDecisionTreeClassifierMethodology.IEEETrans.Systems,ManandCybernetics,22:660–674,May/June1998.

[180]C.Schaffer.Overfittingavoidenceasbias.MachineLearning,10:153–178,1993.

[181]J.SchuermannandW.Doster.Adecision-theoreticapproachinhierarchicalclassifierdesign.PatternRecognition,17:359–369,1984.

[182]G.Schwarzetal.Estimatingthedimensionofamodel.Theannalsofstatistics,6(2):461–464,1978.

[183]J.C.Shafer,R.Agrawal,andM.Mehta.SPRINT:AScalableParallelClassifierforDataMining.InProc.ofthe22ndVLDBConf.,pages544–555,Bombay,India,September1996.

[184]M.Stone.Cross-validatorychoiceandassessmentofstatisticalpredictions.JournaloftheRoyalStatisticalSociety.SeriesB(Methodological),pages111–147,1974.

[185]R.J.TibshiraniandR.Tibshirani.Abiascorrectionfortheminimumerrorrateincross-validation.TheAnnalsofAppliedStatistics,pages822–

829,2009.

[186]I.Tsamardinos,A.Rakhshani,andV.Lagani.Performance-estimationpropertiesofcross-validation-basedprotocolswithsimultaneoushyper-parameteroptimization.InHellenicConferenceonArtificialIntelligence,pages1–14.Springer,2014.

[187]P.E.UtgoffandC.E.Brodley.Anincrementalmethodforfindingmultivariatesplitsfordecisiontrees.InProc.ofthe7thIntl.Conf.onMachineLearning,pages58–65,Austin,TX,June1990.

[188]L.Valiant.Atheoryofthelearnable.CommunicationsoftheACM,27(11):1134–1142,1984.

[189]V.N.Vapnik.StatisticalLearningTheory.Wiley-Interscience,1998.

[190]V.N.VapnikandA.Y.Chervonenkis.Ontheuniformconvergenceofrelativefrequenciesofeventstotheirprobabilities.InMeasuresofComplexity,pages11–30.Springer,2015.

[191]S.VarmaandR.Simon.Biasinerrorestimationwhenusingcross-validationformodelselection.BMCbioinformatics,7(1):1,2006.

[192]H.WangandC.Zaniolo.CMP:AFastDecisionTreeClassifierUsingMultivariatePredictions.InProc.ofthe16thIntl.Conf.onDataEngineering,pages449–460,SanDiego,CA,March2000.

[193]Q.R.WangandC.Y.Suen.Largetreeclassifierwithheuristicsearchandglobaltraining.IEEETrans.onPatternAnalysisandMachineIntelligence,9(1):91–102,1987.

[194]Y.ZhangandY.Yang.Cross-validationforselectingamodelselectionprocedure.JournalofEconometrics,187(1):95–112,2015.

3.11Exercises1.DrawthefulldecisiontreefortheparityfunctionoffourBooleanattributes,A,B,C,andD.Isitpossibletosimplifythetree?

2.ConsiderthetrainingexamplesshowninTable3.5 forabinaryclassificationproblem.

Table3.5.DatasetforExercise2.

CustomerID Gender CarType ShirtSize Class

1 M Family Small C0

2 M Sports Medium C0

3 M Sports Medium C0

4 M Sports Large C0

5 M Sports ExtraLarge C0

6 M Sports ExtraLarge C0

7 F Sports Small C0

8 F Sports Small C0

9 F Sports Medium C0

10 F Luxury Large C0

11 M Family Large C1

12 M Family ExtraLarge C1

13 M Family Medium C1

14 M Luxury ExtraLarge C1

15 F Luxury Small C1

16 F Luxury Small C1

17 F Luxury Medium C1

18 F Luxury Medium C1

19 F Luxury Medium C1

20 F Luxury Large C1

a. ComputetheGiniindexfortheoverallcollectionoftrainingexamples.

b. ComputetheGiniindexforthe attribute.

c. ComputetheGiniindexforthe attribute.

d. ComputetheGiniindexforthe attributeusingmultiwaysplit.

e. ComputetheGiniindexforthe attributeusingmultiwaysplit.

f. Whichattributeisbetter, , ,or ?

g. Explainwhy shouldnotbeusedastheattributetestconditioneventhoughithasthelowestGini.

3.ConsiderthetrainingexamplesshowninTable3.6 forabinaryclassificationproblem.

Table3.6.DatasetforExercise3.

Instance TargetClassa1 a2 a3

1 T T 1.0 +

2 T T 6.0

3 T F 5.0

4 F F 4.0

5 F T 7.0

6 F T 3.0

7 F F 8.0

8 T F 7.0

9 F T 5.0

a. Whatistheentropyofthiscollectionoftrainingexampleswithrespecttotheclassattribute?

b. Whataretheinformationgainsof and relativetothesetrainingexamples?

c. For ,whichisacontinuousattribute,computetheinformationgainforeverypossiblesplit.

d. Whatisthebestsplit(among ,and )accordingtotheinformationgain?

e. Whatisthebestsplit(between and )accordingtothemisclassificationerrorrate?

f. Whatisthebestsplit(between and )accordingtotheGiniindex?

+

+

+

a1 a2

a3

a1,a2 a3

a1 a2

a1 a2

4.Showthattheentropyofanodeneverincreasesaftersplittingitintosmallersuccessornodes.

5.Considerthefollowingdatasetforabinaryclassproblem.

A B ClassLabel

T F

T T

T T

T F

T T

F F

F F

F F

T T

T F

a. CalculatetheinformationgainwhensplittingonAandB.Whichattributewouldthedecisiontreeinductionalgorithmchoose?

b. CalculatethegainintheGiniindexwhensplittingonAandB.Whichattributewouldthedecisiontreeinductionalgorithmchoose?

c. Figure3.11 showsthatentropyandtheGiniindexarebothmonotonicallyincreasingontherange[0,0.5]andtheyarebothmonotonicallydecreasingontherange[0.5,1].Isitpossiblethat

+

+

+

+

informationgainandthegainintheGiniindexfavordifferentattributes?Explain.

6.ConsidersplittingaparentnodePintotwochildnodes, and ,usingsomeattributetestcondition.ThecompositionoflabeledtraininginstancesateverynodeissummarizedintheTablebelow.

P

Class0 7 3 4

Class1 3 0 3

a. CalculatetheGiniindexandmisclassificationerrorrateoftheparentnodeP.

b. CalculatetheweightedGiniindexofthechildnodes.WouldyouconsiderthisattributetestconditionifGiniisusedastheimpuritymeasure?

c. Calculatetheweightedmisclassificationrateofthechildnodes.Wouldyouconsiderthisattributetestconditionifmisclassificationrateisusedastheimpuritymeasure?

7.Considerthefollowingsetoftrainingexamples.

X Y Z No.ofClassC1Examples No.ofClassC2Examples

0 0 0 5 40

0 0 1 0 15

0 1 0 10 5

0 1 1 45 0

C1 C2

C1 C2

1 0 0 10 5

1 0 1 25 0

1 1 0 5 20

1 1 1 0 15

a. Computeatwo-leveldecisiontreeusingthegreedyapproachdescribedinthischapter.Usetheclassificationerrorrateasthecriterionforsplitting.Whatistheoverallerrorrateoftheinducedtree?

b. Repeatpart(a)usingXasthefirstsplittingattributeandthenchoosethebestremainingattributeforsplittingateachofthetwosuccessornodes.Whatistheerrorrateoftheinducedtree?

c. Comparetheresultsofparts(a)and(b).Commentonthesuitabilityofthegreedyheuristicusedforsplittingattributeselection.

8.ThefollowingtablesummarizesadatasetwiththreeattributesA,B,Candtwoclasslabels .Buildatwo-leveldecisiontree.

A B C NumberofInstances

+

T T T 5 0

F T T 0 20

T F T 20 0

F F T 0 5

T T F 0 0

+,−

F T F 25 0

T F F 0 0

F F F 0 25

a. Accordingtotheclassificationerrorrate,whichattributewouldbechosenasthefirstsplittingattribute?Foreachattribute,showthecontingencytableandthegainsinclassificationerrorrate.

b. Repeatforthetwochildrenoftherootnode.

c. Howmanyinstancesaremisclassifiedbytheresultingdecisiontree?

d. Repeatparts(a),(b),and(c)usingCasthesplittingattribute.

e. Usetheresultsinparts(c)and(d)toconcludeaboutthegreedynatureofthedecisiontreeinductionalgorithm.

9.ConsiderthedecisiontreeshowninFigure3.36 .

Figure3.36.DecisiontreeanddatasetsforExercise9.

a. Computethegeneralizationerrorrateofthetreeusingtheoptimisticapproach.

b. Computethegeneralizationerrorrateofthetreeusingthepessimisticapproach.(Forsimplicity,usethestrategyofaddingafactorof0.5toeachleafnode.)

c. Computethegeneralizationerrorrateofthetreeusingthevalidationsetshownabove.Thisapproachisknownasreducederrorpruning.

10.ConsiderthedecisiontreesshowninFigure3.37 .Assumetheyaregeneratedfromadatasetthatcontains16binaryattributesand3classes,

,and .C1,C2 C3

Computethetotaldescriptionlengthofeachdecisiontreeaccordingtothefollowingformulationoftheminimumdescriptionlengthprinciple.

Thetotaldescriptionlengthofatreeisgivenby

EachinternalnodeofthetreeisencodedbytheIDofthesplittingattribute.Iftherearemattributes,thecostofencodingeachattributeis

bits.

Figure3.37.DecisiontreesforExercise10.

EachleafisencodedusingtheIDoftheclassitisassociatedwith.Iftherearekclasses,thecostofencodingaclassis bits.

Cost(tree)isthecostofencodingallthenodesinthetree.Tosimplifythecomputation,youcanassumethatthetotalcostofthetreeisobtainedbyaddingupthecostsofencodingeachinternalnodeandeachleafnode.

Cost(tree,data)=Cost(tree)+Cost(data|tree).

log2m

log2k

isencodedusingtheclassificationerrorsthetreecommitsonthetrainingset.Eacherrorisencodedby bits,wherenisthetotalnumberoftraininginstances.

Whichdecisiontreeisbetter,accordingtotheMDLprinciple?

11.Thisexercise,inspiredbythediscussionsin[155],highlightsoneoftheknownlimitationsoftheleave-one-outmodelevaluationprocedure.Letusconsideradatasetcontaining50positiveand50negativeinstances,wheretheattributesarepurelyrandomandcontainnoinformationabouttheclasslabels.Hence,thegeneralizationerrorrateofanyclassificationmodellearnedoverthisdataisexpectedtobe0.5.Letusconsideraclassifierthatassignsthemajorityclasslabeloftraininginstances(tiesresolvedbyusingthepositivelabelasthedefaultclass)toanytestinstance,irrespectiveofitsattributevalues.Wecancallthisapproachasthemajorityinducerclassifier.Determinetheerrorrateofthisclassifierusingthefollowingmethods.

a. Leave-one-out.

b. 2-foldstratifiedcross-validation,wheretheproportionofclasslabelsateveryfoldiskeptsameasthatoftheoveralldata.

c. Fromtheresultsabove,whichmethodprovidesamorereliableevaluationoftheclassifier'sgeneralizationerrorrate?

12.Consideralabeleddatasetcontaining100datainstances,whichisrandomlypartitionedintotwosetsAandB,eachcontaining50instances.WeuseAasthetrainingsettolearntwodecisiontrees, with10leafnodesand with100leafnodes.TheaccuraciesofthetwodecisiontreesondatasetsAandBareshowninTable3.7 .

Table3.7.Comparingthetestaccuracyofdecisiontrees and .

Accuracy

Cost(data|tree)log2n

T10T100

T10 T100

DataSet

A 0.86 0.97

B 0.84 0.77

a. BasedontheaccuraciesshowninTable3.7 ,whichclassificationmodelwouldyouexpecttohavebetterperformanceonunseeninstances?

b. Now,youtested and ontheentiredataset andfoundthattheclassificationaccuracyof ondataset is0.85,whereastheclassificationaccuracyof onthedataset is0.87.BasedonthisnewinformationandyourobservationsfromTable3.7 ,whichclassificationmodelwouldyoufinallychooseforclassification?

13.ConsiderthefollowingapproachfortestingwhetheraclassifierAbeatsanotherclassifierB.LetNbethesizeofagivendataset, betheaccuracyofclassifierA, betheaccuracyofclassifierB,and betheaverageaccuracyforbothclassifiers.TotestwhetherclassifierAissignificantlybetterthanB,thefollowingZ-statisticisused:

ClassifierAisassumedtobebetterthanclassifierBif .

Table3.8 comparestheaccuraciesofthreedifferentclassifiers,decisiontreeclassifiers,naïveBayesclassifiers,andsupportvectormachines,onvariousdatasets.(ThelattertwoclassifiersaredescribedinChapter4 .)

SummarizetheperformanceoftheclassifiersgiveninTable3.8 usingthefollowing table:

win-loss-draw Decisiontree NaïveBayes Supportvectormachine

T10 T100

T10 T100 (A+B)T10 (A+B)

T100 (A+B)

pApB p=(pA+pB)/2

Z=pA−pB2p(1−p)N.

Z>1.96

3×3

Decisiontree 0-0-23

NaïveBayes 0-0-23

Supportvectormachine 0-0-23

Table3.8.Comparingtheaccuracyofvariousclassificationmethods.

DataSet Size(N) DecisionTree(%)

naïveBayes(%)

Supportvectormachine(%)

Anneal 898 92.09 79.62 87.19

Australia 690 85.51 76.81 84.78

Auto 205 81.95 58.05 70.73

Breast 699 95.14 95.99 96.42

Cleve 303 76.24 83.50 84.49

Credit 690 85.80 77.54 85.07

Diabetes 768 72.40 75.91 76.82

German 1000 70.90 74.70 74.40

Glass 214 67.29 48.59 59.81

Heart 270 80.00 84.07 83.70

Hepatitis 155 81.94 83.23 87.10

Horse 368 85.33 78.80 82.61

Ionosphere 351 89.17 82.34 88.89

Iris 150 94.67 95.33 96.00

Labor 57 78.95 94.74 92.98

Led7 3200 73.34 73.16 73.56

Lymphography 148 77.03 83.11 86.49

Pima 768 74.35 76.04 76.95

Sonar 208 78.85 69.71 76.92

Tic-tac-toe 958 83.72 70.04 98.33

Vehicle 846 71.04 45.04 74.94

Wine 178 94.38 96.63 98.88

Zoo 101 93.07 93.07 96.04

Eachcellinthetablecontainsthenumberofwins,losses,anddrawswhencomparingtheclassifierinagivenrowtotheclassifierinagivencolumn.

14.LetXbeabinomialrandomvariablewithmean andvariance .ShowthattheratioX/Nalsohasabinomialdistributionwithmeanpandvariance .

Np Np(1−p)

p(1−p)N

4Classification:AlternativeTechniques

Thepreviouschapterintroducedtheclassificationproblemandpresentedatechniqueknownasthedecisiontreeclassifier.Issuessuchasmodeloverfittingandmodelevaluationwerealsodiscussed.Thischapterpresentsalternativetechniquesforbuildingclassificationmodels—fromsimpletechniquessuchasrule-basedandnearestneighborclassifierstomoresophisticatedtechniquessuchasartificialneuralnetworks,deeplearning,supportvectormachines,andensemblemethods.Otherpracticalissuessuchastheclassimbalanceandmulticlassproblemsarealsodiscussedattheendofthechapter.

4.1TypesofClassifiersBeforepresentingspecifictechniques,wefirstcategorizethedifferenttypesofclassifiersavailable.Onewaytodistinguishclassifiersisbyconsideringthecharacteristicsoftheiroutput.

BinaryversusMulticlass

Binaryclassifiersassigneachdatainstancetooneoftwopossiblelabels,typicallydenotedas and .Thepositiveclassusuallyreferstothecategorywearemoreinterestedinpredictingcorrectlycomparedtothenegativeclass(e.g.,the categoryinemailclassificationproblems).Iftherearemorethantwopossiblelabelsavailable,thenthetechniqueisknownasamulticlassclassifier.Assomeclassifiersweredesignedforbinaryclassesonly,theymustbeadaptedtodealwithmulticlassproblems.TechniquesfortransformingbinaryclassifiersintomulticlassclassifiersaredescribedinSection4.12 .

DeterministicversusProbabilistic

Adeterministicclassifierproducesadiscrete-valuedlabeltoeachdatainstanceitclassifieswhereasaprobabilisticclassifierassignsacontinuousscorebetween0and1toindicatehowlikelyitisthataninstancebelongstoaparticularclass,wheretheprobabilityscoresforalltheclassessumupto1.SomeexamplesofprobabilisticclassifiersincludethenaïveBayesclassifier,Bayesiannetworks,andlogisticregression.Probabilisticclassifiersprovideadditionalinformationabouttheconfidenceinassigninganinstancetoaclassthandeterministicclassifiers.Adatainstanceistypicallyassignedtotheclass

+1 −1

withthehighestprobabilityscore,exceptwhenthecostofmisclassifyingtheclasswithlowerprobabilityissignificantlyhigher.Wewilldiscussthetopicofcost-sensitiveclassificationwithprobabilisticoutputsinSection4.11.2 .

Anotherwaytodistinguishthedifferenttypesofclassifiersisbasedontheirtechniquefordiscriminatinginstancesfromdifferentclasses.

LinearversusNonlinear

Alinearclassifierusesalinearseparatinghyperplanetodiscriminateinstancesfromdifferentclasseswhereasanonlinearclassifierenablestheconstructionofmorecomplex,nonlineardecisionsurfaces.Weillustrateanexampleofalinearclassifier(perceptron)anditsnonlinearcounterpart(multi-layerneuralnetwork)inSection4.7 .Althoughthelinearityassumptionmakesthemodellessflexibleintermsoffittingcomplexdata,linearclassifiersarethuslesssusceptibletomodeloverfittingcomparedtononlinearclassifiers.Furthermore,onecantransformtheoriginalsetofattributes,

,intoamorecomplexfeatureset,e.g.,,beforeapplyingthelinearclassifier.Suchfeature

transformationallowsthelinearclassifiertofitdatasetswithnonlineardecisionsurfaces(seeSection4.9.4 ).

GlobalversusLocal

Aglobalclassifierfitsasinglemodeltotheentiredataset.Unlessthemodelishighlynonlinear,thisone-size-fits-allstrategymaynotbeeffectivewhentherelationshipbetweentheattributesandtheclasslabelsvariesovertheinputspace.Incontrast,alocalclassifierpartitionstheinputspaceintosmallerregionsandfitsadistinctmodeltotraininginstancesineachregion.Thek-nearestneighborclassifier(seeSection4.3 )isaclassicexampleoflocalclassifiers.Whilelocalclassifiersaremoreflexibleintermsoffittingcomplex

x=(x1,x2,⋯,xd) Φ(x)=(x1,x2,x1x2,x12,x22,⋯)

decisionboundaries,theyarealsomoresusceptibletothemodeloverfittingproblem,especiallywhenthelocalregionscontainfewtrainingexamples.

GenerativeversusDiscriminative

Givenadatainstance ,theprimaryobjectiveofanyclassifieristopredicttheclasslabel,y,ofthedatainstance.However,apartfrompredictingtheclasslabel,wemayalsobeinterestedindescribingtheunderlyingmechanismthatgeneratestheinstancesbelongingtoeveryclasslabel.Forexample,intheprocessofclassifyingspamemailmessages,itmaybeusefultounderstandthetypicalcharacteristicsofemailmessagesthatarelabeledasspam,e.g.,specificusageofkeywordsinthesubjectorthebodyoftheemail.Classifiersthatlearnagenerativemodelofeveryclassintheprocessofpredictingclasslabelsareknownasgenerativeclassifiers.SomeexamplesofgenerativeclassifiersincludethenaïveBayesclassifierandBayesiannetworks.Incontrast,discriminativeclassifiersdirectlypredicttheclasslabelswithoutexplicitlydescribingthedistributionofeveryclasslabel.Theysolveasimplerproblemthangenerativemodelssincetheydonothavetheonusofderivinginsightsaboutthegenerativemechanismofdatainstances.Theyarethussometimespreferredovergenerativemodels,especiallywhenitisnotcrucialtoobtaininformationaboutthepropertiesofeveryclass.Someexamplesofdiscriminativeclassifiersincludedecisiontrees,rule-basedclassifier,nearestneighborclassifier,artificialneuralnetworks,andsupportvectormachines.

4.2Rule-BasedClassifierArule-basedclassifierusesacollectionof“if…then…”rules(alsoknownasaruleset)toclassifydatainstances.Table4.1 showsanexampleofarulesetgeneratedforthevertebrateclassificationproblemdescribedinthepreviouschapter.Eachclassificationruleintherulesetcanbeexpressedinthefollowingway:

Theleft-handsideoftheruleiscalledtheruleantecedentorprecondition.Itcontainsaconjunctionofattributetestconditions:

where isanattribute-valuepairandopisacomparisonoperatorchosenfromtheset .Eachattributetest isalsoknownasaconjunct.Theright-handsideoftheruleiscalledtheruleconsequent,whichcontainsthepredictedclass .

Arulercoversadatainstancexifthepreconditionofrmatchestheattributesofx.risalsosaidtobefiredortriggeredwheneveritcoversagiveninstance.Foranillustration,considertherule giveninTable4.1 andthefollowingattributesfortwovertebrates:hawkandgrizzlybear.

Table4.1.Exampleofarulesetforthevertebrateclassificationproblem.

ri:(Conditioni)→yi. (4.1)

Conditioni=(A1opv1)∧(A2opv2)…(Akopvk), (4.2)

(Aj,vj){=,≠,<,>,≤,≥} (Ajopvj)

yi

r1

r1:(GivesBirth=no)∧(AerialCreature=yes)→Birdsr2:(GivesBirth=no)∧(AquaticCreature=yes)→Fishesr3:(GivesBirth=yes)∧(BodyTemperature=warm-blooded)→Mammalsr4:(GivesBirth=no)∧(AerialCreature=no)→Reptilesr5:(AquaticCreature=semi)→Amphibians

Name BodyTemperature

SkinCover

GivesBirth

AquaticCreature

AerialCreature

HasLegs

Hibernates

hawk warm-blooded

feather no no yes yes no

grizzlybear

warm-blooded

fur yes no no yes yes

coversthefirstvertebratebecauseitspreconditionissatisfiedbythehawk'sattributes.Theruledoesnotcoverthesecondvertebratebecausegrizzlybearsgivebirthtotheiryoungandcannotfly,thusviolatingthepreconditionof .

Thequalityofaclassificationrulecanbeevaluatedusingmeasuressuchascoverageandaccuracy.GivenadatasetDandaclassificationruler: ,thecoverageoftheruleisthefractionofinstancesinDthattriggertheruler.Ontheotherhand,itsaccuracyorconfidencefactoristhefractionofinstancestriggeredbyrwhoseclasslabelsareequaltoy.Theformaldefinitionsofthesemeasuresare

where isthenumberofinstancesthatsatisfytheruleantecedent, isthenumberofinstancesthatsatisfyboththeantecedentandconsequent,and

isthetotalnumberofinstances.

Example4.1.ConsiderthedatasetshowninTable4.2 .Therule

r1

r1

A→y

Coverage(r)=|A||D|Coverage(r)=|A∩y||A|, (4.3)

|A| |A∩y|

|D|

(GivesBirth=yes)∧(BodyTemperature=warm-blooded)→Mammals

hasacoverageof33%sincefiveofthefifteeninstancessupporttheruleantecedent.Theruleaccuracyis100%becauseallfivevertebratescoveredbytherulearemammals.

Table4.2.Thevertebratedataset.Name Body

TemperatureSkinCover

GivesBirth

AquaticCreature

AerialCreature

HasLegs

Hibernates ClassLabel

human warm-blooded

hair yes no no yes no Mammals

python cold-blooded scales no no no no yes Reptiles

salmon cold-blooded scales no yes no no no Fishes

whale warm-blooded

hair yes yes no no no Mammals

frog cold-blooded none no semi no yes yes Amphibians

komododragon

cold-blooded scales no no no yes no Reptiles

bat warm-blooded

hair yes no yes yes yes Mammals

pigeon warm-blooded

feathers no no yes yes no Birds

cat warm-blooded

fur yes no no yes no Mammals

guppy cold-blooded scales yes yes no no no Fishes

alligator cold-blooded scales no semi no yes no Reptiles

penguin warm-

blooded

feathers no semi no yes no Birds

porcupine warm-blooded

quills yes no no yes yes Mammals

eel cold-blooded scales no yes no no no Fishes

4.2.1HowaRule-BasedClassifierWorks

Arule-basedclassifierclassifiesatestinstancebasedontheruletriggeredbytheinstance.Toillustratehowarule-basedclassifierworks,considertherulesetshowninTable4.1 andthefollowingvertebrates:

Name BodyTemperature

SkinCover

GivesBirth

AquaticCreature

AerialCreature

HasLegs

Hibernates

lemur warm-blooded

fur yes no no yes yes

turtle cold-blooded scales no semi no yes no

dogfishshark

cold-blooded scales yes yes no no no

Thefirstvertebrate,whichisalemur,iswarm-bloodedandgivesbirthtoitsyoung.Ittriggerstherule ,andthus,isclassifiedasamammal.Thesecondvertebrate,whichisaturtle,triggerstherules and .Sincetheclassespredictedbytherulesarecontradictory(reptilesversusamphibians),theirconflictingclassesmustberesolved.Noneoftherulesareapplicabletoadogfishshark.Inthiscase,weneedtodeterminewhatclasstoassigntosuchatestinstance.

eel cold-blooded scales no yes no no no Fishes

salamander cold-blooded none no semi no yes yes Amphibians

r3r4 r5

4.2.2PropertiesofaRuleSet

Therulesetgeneratedbyarule-basedclassifiercanbecharacterizedbythefollowingtwoproperties.

Definition4.1(MutuallyExclusiveRuleSet).TherulesinarulesetRaremutuallyexclusiveifnotworulesinRaretriggeredbythesameinstance.ThispropertyensuresthateveryinstanceiscoveredbyatmostoneruleinR.

Definition4.2(ExhaustiveRuleSet).ArulesetRhasexhaustivecoverageifthereisaruleforeachcombinationofattributevalues.ThispropertyensuresthateveryinstanceiscoveredbyatleastoneruleinR.

Table4.3.Exampleofamutuallyexclusiveandexhaustiveruleset.

r1:(BodyTemperature=cold-blooded)→Non-mammalsr2:(BodyTemperature=warm-blooded)∧(GivesBirth=yes)→Mammalsr3:(BodyTemperature=warm-

Together,thesetwopropertiesensurethateveryinstanceiscoveredbyexactlyonerule.AnexampleofamutuallyexclusiveandexhaustiverulesetisshowninTable4.3 .Unfortunately,manyrule-basedclassifiers,includingtheoneshowninTable4.1 ,donothavesuchproperties.Iftherulesetisnotexhaustive,thenadefaultrule, ,mustbeaddedtocovertheremainingcases.Adefaultrulehasanemptyantecedentandistriggeredwhenallotherruleshavefailed. isknownasthedefaultclassandistypicallyassignedtothemajorityclassoftraininginstancesnotcoveredbytheexistingrules.Iftherulesetisnotmutuallyexclusive,thenaninstancecanbecoveredbymorethanonerule,someofwhichmaypredictconflictingclasses.

Definition4.3(OrderedRuleSet).TherulesinanorderedrulesetRarerankedindecreasingorderoftheirpriority.Anorderedrulesetisalsoknownasadecisionlist.

Therankofarulecanbedefinedinmanyways,e.g.,basedonitsaccuracyortotaldescriptionlength.Whenatestinstanceispresented,itwillbeclassifiedbythehighest-rankedrulethatcoverstheinstance.Thisavoidstheproblemofhavingconflictingclassespredictedbymultipleclassificationrulesiftherulesetisnotmutuallyexclusive.

blooded)∧(GivesBirth=no)→Non-mammals

rd:()→yd

yd

Analternativewaytohandleanon-mutuallyexclusiverulesetwithoutorderingtherulesistoconsidertheconsequentofeachruletriggeredbyatestinstanceasavoteforaparticularclass.Thevotesarethentalliedtodeterminetheclasslabelofthetestinstance.Theinstanceisusuallyassignedtotheclassthatreceivesthehighestnumberofvotes.Thevotemayalsobeweightedbytherule'saccuracy.Usingunorderedrulestobuildarule-basedclassifierhasbothadvantagesanddisadvantages.Unorderedrulesarelesssusceptibletoerrorscausedbythewrongrulebeingselectedtoclassifyatestinstanceunlikeclassifiersbasedonorderedrules,whicharesensitivetothechoiceofrule-orderingcriteria.Modelbuildingisalsolessexpensivebecausetherulesdonotneedtobekeptinsortedorder.Nevertheless,classifyingatestinstancecanbequiteexpensivebecausetheattributesofthetestinstancemustbecomparedagainstthepreconditionofeveryruleintheruleset.

Inthenexttwosections,wepresenttechniquesforextractinganorderedrulesetfromdata.Arule-basedclassifiercanbeconstructedusing(1)directmethods,whichextractclassificationrulesdirectlyfromdata,and(2)indirectmethods,whichextractclassificationrulesfrommorecomplexclassificationmodels,suchasdecisiontreesandneuralnetworks.DetaileddiscussionsofthesemethodsarepresentedinSections4.2.3 and4.2.4 ,respectively.

4.2.3DirectMethodsforRuleExtraction

Toillustratethedirectmethod,weconsiderawidely-usedruleinductionalgorithmcalledRIPPER.Thisalgorithmscalesalmostlinearlywiththenumberoftraininginstancesandisparticularlysuitedforbuildingmodelsfrom

datasetswithimbalancedclassdistributions.RIPPERalsoworkswellwithnoisydatabecauseitusesavalidationsettopreventmodeloverfitting.

RIPPERusesthesequentialcoveringalgorithmtoextractrulesdirectlyfromdata.Rulesaregrowninagreedyfashiononeclassatatime.Forbinaryclassproblems,RIPPERchoosesthemajorityclassasitsdefaultclassandlearnstherulestodetectinstancesfromtheminorityclass.Formulticlassproblems,theclassesareorderedaccordingtotheirprevalenceinthetrainingset.Let betheorderedlistofclasses,where istheleastprevalentclassand isthemostprevalentclass.Alltraininginstancesthatbelongto areinitiallylabeledaspositiveexamples,whilethosethatbelongtootherclassesarelabeledasnegativeexamples.Thesequentialcoveringalgorithmlearnsasetofrulestodiscriminatethepositivefromnegativeexamples.Next,alltraininginstancesfrom arelabeledaspositive,whilethosefromclasses arelabeledasnegative.Thesequentialcoveringalgorithmwouldlearnthenextsetofrulestodistinguish fromotherremainingclasses.Thisprocessisrepeateduntilweareleftwithonlyoneclass, ,whichisdesignatedasthedefaultclass.

Example4.1.Sequentialcoveringalgorithm.

(y1,y2,…,yc) y1yc

y1

y2y3,y4,⋯,yc

y2

yc

AsummaryofthesequentialcoveringalgorithmisshowninAlgorithm4.1 .Thealgorithmstartswithanemptydecisionlist,R,andextractsrulesforeachclassbasedontheorderingspecifiedbytheclassprevalence.ItiterativelyextractstherulesforagivenclassyusingtheLearn-One-Rulefunction.Oncesucharuleisfound,allthetraininginstancescoveredbytheruleareeliminated.ThenewruleisaddedtothebottomofthedecisionlistR.Thisprocedureisrepeateduntilthestoppingcriterionismet.Thealgorithmthenproceedstogeneraterulesforthenextclass.

Figure4.1 demonstrateshowthesequentialcoveringalgorithmworksforadatasetthatcontainsacollectionofpositiveandnegativeexamples.TheruleR1,whosecoverageisshowninFigure4.1(b) ,isextractedfirstbecauseitcoversthelargestfractionofpositiveexamples.AllthetraininginstancescoveredbyR1aresubsequentlyremovedandthealgorithmproceedstolookforthenextbestrule,whichisR2.

Learn-One-RuleFunctionFindinganoptimalruleiscomputationallyexpensiveduetotheexponentialsearchspacetoexplore.TheLearn-One-Rulefunctionaddressesthisproblembygrowingtherulesinagreedyfashion.Itgeneratesaninitialrule

,wheretheleft-handsideisanemptysetandtheright-handsidecorrespondstothepositiveclass.Itthenrefinestheruleuntilacertainstoppingcriterionismet.Theaccuracyoftheinitialrulemaybepoorbecausesomeofthetraininginstancescoveredbytherulebelongtothenegative

r:{}→+

class.Anewconjunctmustbeaddedtotheruleantecedenttoimproveitsaccuracy.

Figure4.1.Anexampleofthesequentialcoveringalgorithm.

RIPPERusestheFOIL'sinformationgainmeasuretochoosethebestconjuncttobeaddedintotheruleantecedent.Themeasuretakesintoconsiderationboththegaininaccuracyandsupportofacandidaterule,wheresupportisdefinedasthenumberofpositiveexamplescoveredbytherule.Forexample,supposetherule initiallycovers positiveexamplesand negativeexamples.AfteraddinganewconjunctB,theextendedrule covers positiveexamplesand negative

r:A→+ p0n0r′:A∧B→+ p1 n1

examples.TheFOIL'sinformationgainoftheextendedruleiscomputedasfollows:

RIPPERchoosestheconjunctwithhighestFOIL'sinformationgaintoextendtherule,asillustratedinthenextexample.

Example4.2.[Foil'sInformationGain]ConsiderthetrainingsetforthevertebrateclassificationproblemshowninTable4.2 .SupposethetargetclassfortheLearn-One-Rulefunctionismammals.Initially,theantecedentoftherule covers5positiveand10negativeexamples.Thus,theaccuracyoftheruleisonly0.333.Next,considerthefollowingthreecandidateconjunctstobeaddedtotheleft-handsideoftherule: ,and .ThenumberofpositiveandnegativeexamplescoveredbytheruleafteraddingeachconjunctalongwiththeirrespectiveaccuracyandFOIL'sinformationgainareshowninthefollowingtable.

Candidaterule Accuracy InfoGain

3 0 1.000 4.755

5 1 0.714 5.498

2 4 0.200

Although hasthehighestaccuracyamongthethreecandidates,theconjunct hasthehighestFOIL'sinformationgain.Thus,itischosentoextendtherule(seeFigure4.2 ).

FOIL'sinformationgain=p1×(log2p1p1+n1−log2p0p0+n0). (4.4)

{}→Mammals

Skincover=hair,Bodytemperature=warmHaslegs=No

p1 n1

{SkinCover=hair}→mammals

{Bodytemperature=wam}→mammals

{Haslegs=No}→mammals −0.737

Skincover=hairBodytemperature=warm

Thisprocesscontinuesuntiladdingnewconjunctsnolongerimprovestheinformationgainmeasure.

RulePruning

TherulesgeneratedbytheLearn-One-Rulefunctioncanbeprunedtoimprovetheirgeneralizationerrors.RIPPERprunestherulesbasedontheirperformanceonthevalidationset.Thefollowingmetriciscomputedtodeterminewhetherpruningisneeded: ,wherep(n)isthenumberofpositive(negative)examplesinthevalidationsetcoveredbytherule.Thismetricismonotonicallyrelatedtotherule'saccuracyonthevalidationset.Ifthemetricimprovesafterpruning,thentheconjunctisremoved.Pruningisdonestartingfromthelastconjunctaddedtotherule.Forexample,givenarule ,RIPPERcheckswhetherDshouldbeprunedfirst,followedbyCD,BCD,etc.Whiletheoriginalrulecoversonlypositiveexamples,theprunedrulemaycoversomeofthenegativeexamplesinthetrainingset.

BuildingtheRuleSet

Aftergeneratingarule,allthepositiveandnegativeexamplescoveredbytheruleareeliminated.Theruleisthenaddedintotherulesetaslongasitdoesnotviolatethestoppingcondition,whichisbasedontheminimumdescriptionlengthprinciple.Ifthenewruleincreasesthetotaldescriptionlengthoftherulesetbyatleastdbits,thenRIPPERstopsaddingrulesintoitsruleset(bydefault,dischosentobe64bits).AnotherstoppingconditionusedbyRIPPERisthattheerrorrateoftheruleonthevalidationsetmustnotexceed50%.

(p−n)/(p+n)

ABCD→y

Figure4.2.General-to-specificandspecific-to-generalrule-growingstrategies.

RIPPERalsoperformsadditionaloptimizationstepstodeterminewhethersomeoftheexistingrulesintherulesetcanbereplacedbybetteralternativerules.Readerswhoareinterestedinthedetailsoftheoptimizationmethodmayrefertothereferencecitedattheendofthischapter.

InstanceElimination

Afteraruleisextracted,RIPPEReliminatesthepositiveandnegativeexamplescoveredbytherule.Therationalefordoingthisisillustratedinthenextexample.

Figure4.3 showsthreepossiblerules,R1,R2,andR3,extractedfromatrainingsetthatcontains29positiveexamplesand21negativeexamples.TheaccuraciesofR1,R2,andR3are12/15(80%),7/10(70%),and8/12(66.7%),respectively.R1isgeneratedfirstbecauseithasthehighestaccuracy.AftergeneratingR1,thealgorithmmustremovetheexamplescoveredbytherulesothatthenextrulegeneratedbythealgorithmisdifferentthanR1.Thequestionis,shoulditremovethepositiveexamplesonly,negativeexamplesonly,orboth?Toanswerthis,supposethealgorithmmustchoosebetweengeneratingR2orR3afterR1.EventhoughR2hasahigheraccuracythanR3(70%versus66.7%),observethattheregioncoveredbyR2isdisjointfromR1,whiletheregioncoveredbyR3overlapswithR1.Asaresult,R1andR3togethercover18positiveand5negativeexamples(resultinginanoverallaccuracyof78.3%),whereasR1andR2togethercover19positiveand6negativeexamples(resultinginaloweroverallaccuracyof76%).IfthepositiveexamplescoveredbyR1arenotremoved,thenwemayoverestimatetheeffectiveaccuracyofR3.IfthenegativeexamplescoveredbyR1arenotremoved,thenwemayunderestimatetheaccuracyofR3.Inthelattercase,wemightenduppreferringR2overR3eventhoughhalfofthefalsepositiveerrorscommittedbyR3havealreadybeenaccountedforbytheprecedingrule,R1.ThisexampleshowsthattheeffectiveaccuracyafteraddingR2orR3totherulesetbecomesevidentonlywhenbothpositiveandnegativeexamplescoveredbyR1areremoved.

Figure4.3.Eliminationoftraininginstancesbythesequentialcoveringalgorithm.R1,R2,andR3representregionscoveredbythreedifferentrules.

4.2.4IndirectMethodsforRuleExtraction

Thissectionpresentsamethodforgeneratingarulesetfromadecisiontree.Inprinciple,everypathfromtherootnodetotheleafnodeofadecisiontreecanbeexpressedasaclassificationrule.Thetestconditionsencounteredalongthepathformtheconjunctsoftheruleantecedent,whiletheclasslabelattheleafnodeisassignedtotheruleconsequent.Figure4.4 showsanexampleofarulesetgeneratedfromadecisiontree.Noticethattherulesetisexhaustiveandcontainsmutuallyexclusiverules.However,someoftherulescanbesimplifiedasshowninthenextexample.

Figure4.4.Convertingadecisiontreeintoclassificationrules.

Example4.3.ConsiderthefollowingthreerulesfromFigure4.4 :

ObservethattherulesetalwayspredictsapositiveclasswhenthevalueofQisYes.Therefore,wemaysimplifytherulesasfollows:

isretainedtocovertheremaininginstancesofthepositiveclass.Althoughtherulesobtainedaftersimplificationarenolongermutuallyexclusive,theyarelesscomplexandareeasiertointerpret.

Inthefollowing,wedescribeanapproachusedbytheC4.5rulesalgorithmtogeneratearulesetfromadecisiontree.Figure4.5 showsthedecisiontree

r2:(P=No)∧(Q=Yes)→+r3:(P=Yes)∧(R=No)→+r5:(P=Yes)∧(R=Yes)∧(Q=Yes)→+.

r2′:(Q=Yes)→+r3:(P=Yes)∧(R=No)→+.

r3

andresultingclassificationrulesobtainedforthedatasetgiveninTable4.2 .

RuleGeneration

Classificationrulesareextractedforeverypathfromtheroottooneoftheleafnodesinthedecisiontree.Givenaclassificationrule ,weconsiderasimplifiedrule, where isobtainedbyremovingoneoftheconjunctsinA.Thesimplifiedrulewiththelowestpessimisticerrorrateisretainedprovideditserrorrateislessthanthatoftheoriginalrule.Therule-pruningstepisrepeateduntilthepessimisticerroroftherulecannotbeimprovedfurther.Becausesomeoftherulesmaybecomeidenticalafterpruning,theduplicaterulesarediscarded.

Figure4.5.

r:A→yr′:A′→y A′

Classificationrulesextractedfromadecisiontreeforthevertebrateclassificationproblem.

RuleOrdering

Aftergeneratingtheruleset,C4.5rulesusestheclass-basedorderingschemetoordertheextractedrules.Rulesthatpredictthesameclassaregroupedtogetherintothesamesubset.Thetotaldescriptionlengthforeachsubsetiscomputed,andtheclassesarearrangedinincreasingorderoftheirtotaldescriptionlength.Theclassthathasthesmallestdescriptionlengthisgiventhehighestprioritybecauseitisexpectedtocontainthebestsetofrules.Thetotaldescriptionlengthforaclassisgivenby ,where

isthenumberofbitsneededtoencodethemisclassifiedexamples,Lmodelisthenumberofbitsneededtoencodethemodel,andgisatuningparameterwhosedefaultvalueis0.5.Thetuningparameterdependsonthenumberofredundantattributespresentinthemodel.Thevalueofthetuningparameterissmallifthemodelcontainsmanyredundantattributes.

4.2.5CharacteristicsofRule-BasedClassifiers

1. Rule-basedclassifiershaveverysimilarcharacteristicsasdecisiontrees.Theexpressivenessofarulesetisalmostequivalenttothatofadecisiontreebecauseadecisiontreecanberepresentedbyasetofmutuallyexclusiveandexhaustiverules.Bothrule-basedanddecisiontreeclassifierscreaterectilinearpartitionsoftheattributespaceandassignaclasstoeachpartition.However,arule-basedclassifiercan

Lexception+g×LmodelLexception

allowmultiplerulestobetriggeredforagiveninstance,thusenablingthelearningofmorecomplexmodelsthandecisiontrees.

2. Likedecisiontrees,rule-basedclassifierscanhandlevaryingtypesofcategoricalandcontinuousattributesandcaneasilyworkinmulticlassclassificationscenarios.Rule-basedclassifiersaregenerallyusedtoproducedescriptivemodelsthatareeasiertointerpretbutgivecomparableperformancetothedecisiontreeclassifier.

3. Rule-basedclassifierscaneasilyhandlethepresenceofredundantattributesthatarehighlycorrelatedwithoneother.Thisisbecauseonceanattributehasbeenusedasaconjunctinaruleantecedent,theremainingredundantattributeswouldshowlittletonoFOIL'sinformationgainandwouldthusbeignored.

4. Sinceirrelevantattributesshowpoorinformationgain,rule-basedclassifierscanavoidselectingirrelevantattributesifthereareotherrelevantattributesthatshowbetterinformationgain.However,iftheproblemiscomplexandthereareinteractingattributesthatcancollectivelydistinguishbetweentheclassesbutindividuallyshowpoorinformationgain,itislikelyforanirrelevantattributetobeaccidentallyfavoredoverarelevantattributejustbyrandomchance.Hence,rule-basedclassifierscanshowpoorperformanceinthepresenceofinteractingattributes,whenthenumberofirrelevantattributesislarge.

5. Theclass-basedorderingstrategyadoptedbyRIPPER,whichemphasizesgivinghigherprioritytorareclasses,iswellsuitedforhandlingtrainingdatasetswithimbalancedclassdistributions.

6. Rule-basedclassifiersarenotwell-suitedforhandlingmissingvaluesinthetestset.Thisisbecausethepositionofrulesinarulesetfollowsacertainorderingstrategyandevenifatestinstanceiscoveredbymultiplerules,theycanassigndifferentclasslabelsdependingontheirpositionintheruleset.Hence,ifacertainruleinvolvesanattributethatismissinginatestinstance,itisdifficulttoignoretheruleandproceed

tothesubsequentrulesintheruleset,assuchastrategycanresultinincorrectclassassignments.

4.3NearestNeighborClassifiersTheclassificationframeworkshowninFigure3.3 involvesatwo-stepprocess:

(1)aninductivestepforconstructingaclassificationmodelfromdata,and

(2)adeductivestepforapplyingthemodeltotestexamples.Decisiontreeandrule-basedclassifiersareexamplesofeagerlearnersbecausetheyaredesignedtolearnamodelthatmapstheinputattributestotheclasslabelassoonasthetrainingdatabecomesavailable.Anoppositestrategywouldbetodelaytheprocessofmodelingthetrainingdatauntilitisneededtoclassifythetestinstances.Techniquesthatemploythisstrategyareknownaslazylearners.AnexampleofalazylearneristheRoteclassifier,whichmemorizestheentiretrainingdataandperformsclassificationonlyiftheattributesofatestinstancematchoneofthetrainingexamplesexactly.Anobviousdrawbackofthisapproachisthatsometestinstancesmaynotbeclassifiedbecausetheydonotmatchanytrainingexample.

Onewaytomakethisapproachmoreflexibleistofindallthetrainingexamplesthatarerelativelysimilartotheattributesofthetestinstances.Theseexamples,whichareknownasnearestneighbors,canbeusedtodeterminetheclasslabelofthetestinstance.Thejustificationforusingnearestneighborsisbestexemplifiedbythefollowingsaying:“Ifitwalkslikeaduck,quackslikeaduck,andlookslikeaduck,thenit'sprobablyaduck.”Anearestneighborclassifierrepresentseachexampleasadatapointinad-dimensionalspace,wheredisthenumberofattributes.Givenatestinstance,wecomputeitsproximitytothetraininginstancesaccordingtooneoftheproximitymeasuresdescribedinSection2.4 onpage71.Thek-nearest

neighborsofagiventestinstancezrefertothektrainingexamplesthatareclosesttoz.

Figure4.6 illustratesthe1-,2-,and3-nearestneighborsofatestinstancelocatedatthecenterofeachcircle.Theinstanceisclassifiedbasedontheclasslabelsofitsneighbors.Inthecasewheretheneighborshavemorethanonelabel,thetestinstanceisassignedtothemajorityclassofitsnearestneighbors.InFigure4.6(a) ,the1-nearestneighboroftheinstanceisanegativeexample.Thereforetheinstanceisassignedtothenegativeclass.Ifthenumberofnearestneighborsisthree,asshowninFigure4.6(c) ,thentheneighborhoodcontainstwopositiveexamplesandonenegativeexample.Usingthemajorityvotingscheme,theinstanceisassignedtothepositiveclass.Inthecasewherethereisatiebetweentheclasses(seeFigure4.6(b) ),wemayrandomlychooseoneofthemtoclassifythedatapoint.

Figure4.6.The1-,2-,and3-nearestneighborsofaninstance.

Theprecedingdiscussionunderscorestheimportanceofchoosingtherightvaluefork.Ifkistoosmall,thenthenearestneighborclassifiermaybesusceptibletooverfittingduetonoise,i.e.,mislabeledexamplesinthetraining

data.Ontheotherhand,ifkistoolarge,thenearestneighborclassifiermaymisclassifythetestinstancebecauseitslistofnearestneighborsincludestrainingexamplesthatarelocatedfarawayfromitsneighborhood(seeFigure4.7 ).

Figure4.7.k-nearestneighborclassificationwithlargek.

4.3.1Algorithm

Ahigh-levelsummaryofthenearestneighborclassificationmethodisgiveninAlgorithm4.2 .Thealgorithmcomputesthedistance(orsimilarity)betweeneachtestinstance andallthetrainingexamples todetermineitsnearestneighborlist, .Suchcomputationcanbecostlyifthenumberoftrainingexamplesislarge.However,efficientindexingtechniquesareavailabletoreducethecomputationneededtofindthenearestneighborsofatestinstance.

z=(x′,y′) (x,y)∈DDz

Algorithm4.2Thek-nearestneighborclassifier.

′ ′

′ ∑ ∈

Oncethenearestneighborlistisobtained,thetestinstanceisclassifiedbasedonthemajorityclassofitsnearestneighbors:

wherevisaclasslabel, istheclasslabelforoneofthenearestneighbors,and isanindicatorfunctionthatreturnsthevalue1ifitsargumentistrueand0otherwise.

Inthemajorityvotingapproach,everyneighborhasthesameimpactontheclassification.Thismakesthealgorithmsensitivetothechoiceofk,asshowninFigure4.6 .Onewaytoreducetheimpactofkistoweighttheinfluenceofeachnearestneighbor accordingtoitsdistance: .Asaresult,trainingexamplesthatarelocatedfarawayfromzhaveaweakerimpactontheclassificationcomparedtothosethatarelocatedclosetoz.Usingthedistance-weightedvotingscheme,theclasslabelcanbedeterminedasfollows:

MajorityVoting:y′=argmaxv∑(xi,yi)∈DzI(v=yi), (4.5)

yiI(⋅)

xi wi=1/d(x′,xi)2

Distance-WeightedVoting:y′=argmaxv∑(xi,yi)∈Dzwi×I(v=yi). (4.6)

4.3.2CharacteristicsofNearestNeighborClassifiers

1. Nearestneighborclassificationispartofamoregeneraltechniqueknownasinstance-basedlearning,whichdoesnotbuildaglobalmodel,butratherusesthetrainingexamplestomakepredictionsforatestinstance.(Thus,suchclassifiersareoftensaidtobe“modelfree.”)Suchalgorithmsrequireaproximitymeasuretodeterminethesimilarityordistancebetweeninstancesandaclassificationfunctionthatreturnsthepredictedclassofatestinstancebasedonitsproximitytootherinstances.

2. Althoughlazylearners,suchasnearestneighborclassifiers,donotrequiremodelbuilding,classifyingatestinstancecanbequiteexpensivebecauseweneedtocomputetheproximityvaluesindividuallybetweenthetestandtrainingexamples.Incontrast,eagerlearnersoftenspendthebulkoftheircomputingresourcesformodelbuilding.Onceamodelhasbeenbuilt,classifyingatestinstanceisextremelyfast.

3. Nearestneighborclassifiersmaketheirpredictionsbasedonlocalinformation.(Thisisequivalenttobuildingalocalmodelforeachtestinstance.)Bycontrast,decisiontreeandrule-basedclassifiersattempttofindaglobalmodelthatfitstheentireinputspace.Becausetheclassificationdecisionsaremadelocally,nearestneighborclassifiers(withsmallvaluesofk)arequitesusceptibletonoise.

4. Nearestneighborclassifierscanproducedecisionboundariesofarbitraryshape.Suchboundariesprovideamoreflexiblemodelrepresentationcomparedtodecisiontreeandrule-basedclassifiersthatareoftenconstrainedtorectilineardecisionboundaries.Thedecisionboundariesofnearestneighborclassifiersalsohavehigh

variabilitybecausetheydependonthecompositionoftrainingexamplesinthelocalneighborhood.Increasingthenumberofnearestneighborsmayreducesuchvariability.

5. Nearestneighborclassifiershavedifficultyhandlingmissingvaluesinboththetrainingandtestsetssinceproximitycomputationsnormallyrequirethepresenceofallattributes.Although,thesubsetofattributespresentintwoinstancescanbeusedtocomputeaproximity,suchanapproachmaynotproducegoodresultssincetheproximitymeasuresmaybedifferentforeachpairofinstancesandthushardtocompare.

6. Nearestneighborclassifierscanhandlethepresenceofinteractingattributes,i.e.,attributesthathavemorepredictivepowertakenincombinationthenbythemselves,byusingappropriateproximitymeasuresthatcanincorporatetheeffectsofmultipleattributestogether.

7. Thepresenceofirrelevantattributescandistortcommonlyusedproximitymeasures,especiallywhenthenumberofirrelevantattributesislarge.Furthermore,iftherearealargenumberofredundantattributesthatarehighlycorrelatedwitheachother,thentheproximitymeasurecanbeoverlybiasedtowardsuchattributes,resultinginimproperestimatesofdistance.Hence,thepresenceofirrelevantandredundantattributescanadverselyaffecttheperformanceofnearestneighborclassifiers.

8. Nearestneighborclassifierscanproducewrongpredictionsunlesstheappropriateproximitymeasureanddatapreprocessingstepsaretaken.Forexample,supposewewanttoclassifyagroupofpeoplebasedonattributessuchasheight(measuredinmeters)andweight(measuredinpounds).Theheightattributehasalowvariability,rangingfrom1.5mto1.85m,whereastheweightattributemayvaryfrom90lb.to250lb.Ifthescaleoftheattributesarenottakenintoconsideration,theproximitymeasuremaybedominatedbydifferencesintheweightsofaperson.

4.4NaïveBayesClassifierManyclassificationproblemsinvolveuncertainty.First,theobservedattributesandclasslabelsmaybeunreliableduetoimperfectionsinthemeasurementprocess,e.g.,duetothelimitedprecisenessofsensordevices.Second,thesetofattributeschosenforclassificationmaynotbefullyrepresentativeofthetargetclass,resultinginuncertainpredictions.Toillustratethis,considertheproblemofpredictingaperson'sriskforheartdiseasebasedonamodelthatusestheirdietandworkoutfrequencyasattributes.Althoughmostpeoplewhoeathealthilyandexerciseregularlyhavelesschanceofdevelopingheartdisease,theymaystillbeatriskduetootherlatentfactors,suchasheredity,excessivesmoking,andalcoholabuse,thatarenotcapturedinthemodel.Third,aclassificationmodellearnedoverafinitetrainingsetmaynotbeabletofullycapturethetruerelationshipsintheoveralldata,asdiscussedinthecontextofmodeloverfittinginthepreviouschapter.Finally,uncertaintyinpredictionsmayariseduetotheinherentrandomnatureofreal-worldsystems,suchasthoseencounteredinweatherforecastingproblems.

Inthepresenceofuncertainty,thereisaneedtonotonlymakepredictionsofclasslabelsbutalsoprovideameasureofconfidenceassociatedwitheveryprediction.Probabilitytheoryoffersasystematicwayforquantifyingandmanipulatinguncertaintyindata,andthus,isanappealingframeworkforassessingtheconfidenceofpredictions.Classificationmodelsthatmakeuseofprobabilitytheorytorepresenttherelationshipbetweenattributesandclasslabelsareknownasprobabilisticclassificationmodels.Inthissection,wepresentthenaïveBayesclassifier,whichisoneofthesimplestandmostwidely-usedprobabilisticclassificationmodels.

4.4.1BasicsofProbabilityTheory

BeforewediscusshowthenaïveBayesclassifierworks,wefirstintroducesomebasicsofprobabilitytheorythatwillbeusefulinunderstandingtheprobabilisticclassificationmodelspresentedinthischapter.Thisinvolvesdefiningthenotionofprobabilityandintroducingsomecommonapproachesformanipulatingprobabilityvalues.

ConsideravariableX,whichcantakeanydiscretevaluefromtheset.Whenwehavemultipleobservationsofthatvariable,suchasina

datasetwherethevariabledescribessomecharacteristicofdataobjects,thenwecancomputetherelativefrequencywithwhicheachvalueoccurs.Specifically,supposethatXhasthevalue for dataobjects.Therelativefrequencywithwhichweobservetheevent isthen ,whereNdenotesthetotalnumberofoccurrences( ).TheserelativefrequenciescharacterizetheuncertaintythatwehavewithrespecttowhatvalueXmaytakeforanunseenobservationandmotivatesthenotionofprobability.

Moreformally,theprobabilityofanevente,e.g., ,measureshowlikelyitisfortheeventetooccur.Themosttraditionalviewofprobabilityisbasedonrelativefrequencyofevents(frequentist),whiletheBayesianviewpoint(describedlater)takesamoreflexibleviewofprobabilities.Ineithercase,aprobabilityisalwaysanumberbetween0and1.Further,thesumofprobabilityvaluesofallpossibleevents,e.g.,outcomesofavariableXisequalto1.Variablesthathaveprobabilitiesassociatedwitheachpossibleoutcome(values)areknownasrandomvariables.

Now,letusconsidertworandomvariables,XandY,thatcaneachtakekdiscretevalues.Let bethenumberoftimesweobserve and ,out

{x1,…,xk}

xi niX=xi ni/N

N=∑i=1kni

P(X=xi)

nij X=xi Y=yj

ofatotalnumberofNoccurrences.Thejointprobabilityofobservingand togethercanbeestimatedas

(Thisisanestimatesincewetypicallyhaveonlyafinitesubsetofallpossibleobservations.)Jointprobabilitiescanbeusedtoanswerquestionssuchas“whatistheprobabilitythattherewillbeasurprisequiztoday Iwillbelatefortheclass.”Jointprobabilitiesaresymmetric,i.e.,

.Forjointprobabilities,itistousefultoconsidertheirsumwithrespecttooneoftherandomvariables,asdescribedinthefollowingequation:

where isthetotalnumberoftimesweobserve irrespectiveofthevalueofY.Noticethat isessentiallytheprobabilityofobserving .Hence,bysummingoutthejointprobabilitieswithrespecttoarandomvariableY,weobtaintheprobabilityofobservingtheremainingvariableX.Thisoperationiscalledmarginalizationandtheprobabilityvalue obtainedbymarginalizingoutYissometimescalledthemarginalprobabilityofX.Aswewillseelater,jointprobabilityandmarginalprobabilityformthebasicbuildingblocksofanumberofprobabilisticclassificationmodelsdiscussedinthischapter.

Noticethatinthepreviousdiscussions,weused todenotetheprobabilityofaparticularoutcomeofarandomvariableX.Thisnotationcaneasilybecomecumbersomewhenanumberofrandomvariablesareinvolved.Hence,intheremainderofthissection,wewilluseP(X)todenotetheprobabilityofanygenericoutcomeoftherandomvariableX,while willbeusedtorepresenttheprobabilityofthespecificoutcome .

X=xiY=yj

P(X=xi,Y=yi)=nijN. (4.7)

P(X=x,Y=y)=P(Y=y,X=x)

∑j=1kP(X=xi,Y=yj)=∑j=1knijN=niN=P(X=xi), (4.8)

ni X=xini/N X=xi

P(X=xi)

P(X=xi)

P(xi)xi

BayesTheoremSupposeyouhaveinvitedtwoofyourfriendsAlexandMarthatoadinnerparty.YouknowthatAlex

attends40%ofthepartiesheisinvitedto.Further,ifAlexisgoingtoaparty,thereisan80%chance

ofMarthacomingalong.Ontheotherhand,ifAlexisnotgoingtotheparty,thechanceofMartha

comingtothepartyisreducedto30%.IfMarthahasrespondedthatshewillbecomingtoyourparty,

whatistheprobabilitythatAlexwillalsobecoming?

Bayestheorempresentsthestatisticalprincipleforansweringquestionslikethepreviousone,whereevidencefrommultiplesourceshastobecombinedwithpriorbeliefstoarriveatpredictions.Bayestheoremcanbebrieflydescribedasfollows.

Let denotetheconditionalprobabilityofobservingtherandomvariableYwhenevertherandomvariableXtakesaparticularvalue. isoftenreadastheprobabilityofobservingYconditionedontheoutcomeofX.Conditionalprobabilitiescanbeusedforansweringquestionssuchas“giventhatitisgoingtoraintoday,whatwillbetheprobabilitythatIwillgototheclass.”ConditionalprobabilitiesofXandYarerelatedtotheirjoint

probabilityinthefollowingway:

RearrangingthelasttwoexpressionsinEquation4.10 leadstoEquation4.11 ,whichisknownasBayestheorem:

P(Y|X)P(Y|X)

P(Y|X)=P(X,Y)P(X),whichimplies (4.9)

P(X,Y)=P(Y|X)×P(X)=P(X|Y)×P(Y). (4.10)

P(Y|X)=P(X|Y)P(Y)P(X). (4.11)

Bayestheoremprovidesarelationshipbetweentheconditionalprobabilitiesand .NotethatthedenominatorinEquation4.11 involvesthe

marginalprobabilityofX,whichcanalsoberepresentedas

UsingthepreviousexpressionforP(X),wecanobtainthefollowingequationfor solelyintermsof andP(Y):

Example4.4.[BayesTheorem]Bayestheoremcanbeusedtosolveanumberofinferentialquestionsaboutrandomvariables.Forexample,considertheproblemstatedatthebeginningoninferringwhetherAlexwillcometotheparty.LetdenotetheprobabilityofAlexgoingtoaparty,while denotestheprobabilityofhimnotgoingtoaparty.Weknowthat

Further,let denotetheconditionalprobabilityofMarthagoingtoapartyconditionedonwhetherAlexisgoingtotheparty. takesthefollowingvalues:

Wecanusetheabovevaluesof andP(A)tocomputetheprobabilityofAlexgoingtothepartygivenMarthaisgoingtotheparty,

,asfollows:

P(Y|X) P(X|Y)

P(X)=∑i=1kP(X,yi)=∑i=1kP(X|yi)×P(yi).

P(Y|X) P(X|Y)

P(Y|X)=P(X|Y)P(Y)∑i−1kP(X|yi)P(yi). (4.12)

P(A=1)P(A=0)

P(A=1)=0.4,andP(A=0)=1−P(A=1)=0.6.

P(M=1|A)P(M=1|A)

P(M=1|A=1)=0.8,andP(M=1|A=0)=0.3.

P(M|A)

P(A=1|M=1)

NoticethateventhoughthepriorprobabilityP(A)ofAlexgoingtothepartyislow,theobservationthatMarthaisgoing, ,affectstheconditionalprobability .ThisshowsthevalueofBayestheoremincombiningpriorassumptionswithobservedoutcomestomakepredictions.Since ,itismorelikelyforAlextojoinifMarthaisgoingtotheparty.

UsingBayesTheoremforClassificationForthepurposeofclassification,weareinterestedincomputingtheprobabilityofobservingaclasslabelyforadatainstancegivenitssetofattributevalues .Thiscanberepresentedas ,whichisknownastheposteriorprobabilityofthetargetclass.UsingtheBayesTheorem,wecanrepresenttheposteriorprobabilityas

Notethatthenumeratorofthepreviousequationinvolvestwoterms,andP(y),bothofwhichcontributetotheposteriorprobability .Wedescribebothofthesetermsinthefollowing.

Thefirstterm isknownastheclass-conditionalprobabilityoftheattributesgiventheclasslabel. measuresthelikelihoodofobservingfromthedistributionofinstancesbelongingtoy.If indeedbelongstoclassy,thenweshouldexpect tobehigh.Fromthispointofview,theuseofclass-conditionalprobabilitiesattemptstocapturetheprocessfromwhichthedatainstancesweregenerated.Becauseofthisinterpretation,probabilisticclassificationmodelsthatinvolvecomputingclass-conditionalprobabilitiesare

P(A=1|M=1)=P(M=1|A=1)P(A=1)P(M=1|A=0)P(A=0)+P(M=1|A=1)P(A=1),=0.8(4.13)

M=1P(A=1|M=1)

P(A=1|M=1)>0.5

P(y|x)

P(y|x)=P(x|y)P(y)P(x) (4.14)

P(x|y)P(y|x)

P(x|y)P(x|y)

P(x|y)

knownasgenerativeclassificationmodels.Apartfromtheiruseincomputingposteriorprobabilitiesandmakingpredictions,class-conditionalprobabilitiesalsoprovideinsightsabouttheunderlyingmechanismbehindthegenerationofattributevalues.

ThesecondterminthenumeratorofEquation4.14 isthepriorprobabilityP(y).Thepriorprobabilitycapturesourpriorbeliefsaboutthedistributionofclasslabels,independentoftheobservedattributevalues.(ThisistheBayesianviewpoint.)Forexample,wemayhaveapriorbeliefthatthelikelihoodofanypersontosufferfromaheartdiseaseis ,irrespectiveoftheirdiagnosisreports.Thepriorprobabilitycaneitherbeobtainedusingexpertknowledge,orinferredfromhistoricaldistributionofclasslabels.

ThedenominatorinEquation4.14 involvestheprobabilityofevidence,P( ).Notethatthistermdoesnotdependontheclasslabelandthuscanbetreatedasanormalizationconstantinthecomputationofposteriorprobabilities.Further,thevalueofP( )canbecalculatedas

.

Bayestheoremprovidesaconvenientwaytocombineourpriorbeliefswiththelikelihoodofobtainingtheobservedattributevalues.Duringthetrainingphase,wearerequiredtolearntheparametersforP(y)and .ThepriorprobabilityP(y)canbeeasilyestimatedfromthetrainingsetbycomputingthefractionoftraininginstancesthatbelongtoeachclass.Tocomputetheclass-conditionalprobabilities,oneapproachistoconsiderthefractionoftraininginstancesofagivenclassforeverypossiblecombinationofattributevalues.Forexample,supposethattherearetwoattributes and thatcaneachtakeadiscretevaluefrom to .Let denotethenumberoftraininginstancesbelongingtoclass0,outofwhich numberoftraininginstanceshave and .Theclass-conditionalprobabilitycanthenbegivenas

α

P(x)=∑iP(x|yi)P(yi)

P(x|y)

X1 X2c1 ck n0

nij0X1=ci X2=cj

Thisapproachcaneasilybecomecomputationallyprohibitiveasthenumberofattributesincrease,duetotheexponentialgrowthinthenumberofattributevaluecombinations.Forexample,ifeveryattributecantakekdiscretevalues,thenthenumberofattributevaluecombinationsisequalto ,wheredisthenumberofattributes.Thelargenumberofattributevaluecombinationscanalsoresultinpoorestimatesofclass-conditionalprobabilities,sinceeverycombinationwillhavefewertraininginstanceswhenthesizeoftrainingsetissmall.

Inthefollowing,wepresentthenaïveBayesclassifier,whichmakesasimplifyingassumptionabouttheclass-conditionalprobabilities,knownasthenaïveBayesassumption.Theuseofthisassumptionsignificantlyhelpsinobtainingreliableestimatesofclass-conditionalprobabilities,evenwhenthenumberofattributesarelarge.

4.4.2NaïveBayesAssumption

ThenaïveBayesclassifierassumesthattheclass-conditionalprobabilityofallattributes canbefactoredasaproductofclass-conditionalprobabilitiesofeveryattribute ,asdescribedinthefollowingequation:

whereeverydatainstance consistsofdattributes, .Thebasicassumptionbehindthepreviousequationisthattheattributevaluesareconditionallyindependentofeachother,giventheclasslabely.Thismeansthattheattributesareinfluencedonlybythetargetclassandifwe

P(X1=ci,X2=cj|Y=0)=nij0n0.

kd

xi

P(x|y)=∏i=1dP(xi|y), (4.15)

{x1,x2,…,xd}xi

knowtheclasslabel,thenwecanconsidertheattributestobeindependentofeachother.Theconceptofconditionalindependencecanbeformallystatedasfollows.

ConditionalIndependenceLet ,andYdenotethreesetsofrandomvariables.Thevariablesinaresaidtobeconditionallyindependentof ,givenY,ifthefollowingconditionholds:

ThismeansthatconditionedonY,thedistributionof isnotinfluencedbytheoutcomesof ,andhenceisconditionallyindependentof .Toillustratethenotionofconditionalindependence,considertherelationshipbetweenaperson'sarmlength andhisorherreadingskills .Onemightobservethatpeoplewithlongerarmstendtohavehigherlevelsofreadingskills,andthusconsider and toberelatedtoeachother.However,thisrelationshipcanbeexplainedbyanotherfactor,whichistheageoftheperson(Y).Ayoungchildtendstohaveshortarmsandlacksthereadingskillsofanadult.Iftheageofapersonisfixed,thentheobservedrelationshipbetweenarmlengthandreadingskillsdisappears.Thus,wecanconcludethatarmlengthandreadingskillsarenotdirectlyrelatedtoeachotherandareconditionallyindependentwhentheagevariableisfixed.

Anotherwayofdescribingconditionalindependenceistoconsiderthejointconditionalprobability, ,asfollows:

X1,X2, X1X2

P(X1|X2,Y)=P(X1|Y). (4.16)

X1X2 X2

(X1) (X2)

X1 X2

P(X1,X2|Y)

P(X1,X2|Y)=P(X1,X2,Y)P(Y)=P(X1,X2,Y)P(X2,Y)×P(X2,Y)P(Y)=P(X1|X2,Y(4.17)

whereEquation4.16 wasusedtoobtainthelastlineofEquation4.17 .Thepreviousdescriptionofconditionalindependenceisquiteusefulfromanoperationalperspective.Itstatesthatthejointconditionalprobabilityof and

givenYcanbefactoredastheproductofconditionalprobabilitiesofand consideredseparately.ThisformsthebasisofthenaïveBayesassumptionstatedinEquation4.15 .

HowaNaïveBayesClassifierWorksUsingthenaïveBayesassumption,weonlyneedtoestimatetheconditionalprobabilityofeach givenYseparately,insteadofcomputingtheclass-conditionalprobabilityforeverycombinationofattributevalues.Forexample,if and denotethenumberoftraininginstancesbelongingtoclass0with and ,respectively,thentheclass-conditionalprobabilitycanbeestimatedas

Inthepreviousequation,weonlyneedtocountthenumberoftraininginstancesforeveryoneofthekvaluesofanattributeX,irrespectiveofthevaluesofotherattributes.Hence,thenumberofparametersneededtolearnclass-conditionalprobabilitiesisreducedfrom todk.Thisgreatlysimplifiestheexpressionfortheclass-conditionalprobabilityandmakesitmoreamenabletolearningparametersandmakingpredictions,eveninhigh-dimensionalsettings.

ThenaïveBayesclassifiercomputestheposteriorprobabilityforatestinstance byusingthefollowingequation:

X1X2 X1

X2

xi

ni0 nj0X1=ci X2=cj

P(X1=ci,X2=xj|Y=0)=ni0n0×nj0n0.

dk

P(y|x)=P(y)∏i=1dP(xi|y)P(x) (4.18)

SinceP( )isfixedforeveryyandonlyactsasanormalizingconstanttoensurethat ,wecanwrite

Hence,itissufficienttochoosetheclassthatmaximizes .

OneoftheusefulpropertiesofthenaïveBayesclassifieristhatitcaneasilyworkwithincompleteinformationaboutdatainstances,whenonlyasubsetofattributesareobservedateveryinstance.Forexample,ifweonlyobservepoutofdattributesatadatainstance,thenwecanstillcompute

usingthosepattributesandchoosetheclasswiththemaximumvalue.ThenaïveBayesclassifiercanthusnaturallyhandlemissingvaluesintestinstances.Infact,intheextremecasewherenoattributesareobserved,wecanstillusethepriorprobabilityP(y)asanestimateoftheposteriorprobability.Asweobservemoreattributes,wecankeeprefiningtheposteriorprobabilitytobetterreflectthelikelihoodofobservingthedatainstance.

Inthenexttwosubsections,wedescribeseveralapproachesforestimatingtheconditionalprobabilities forcategoricalandcontinuousattributesfromthetrainingset.

EstimatingConditionalProbabilitiesforCategoricalAttributesForacategoricalattribute ,theconditionalprobability isestimatedaccordingtothefractionoftraininginstancesinclassywhere takesonaparticularcategoricalvaluec.

P(y|x)∈[0,1]

P(y|x)∝P(y)∏i=1dP(xi|y).

P(y)∏i=1dP(xi|y)

P(y)∏i=1pP(xi|y)

P(xi|y)

Xi P(Xi=c|y)Xi

wherenisthenumberoftraininginstancesbelongingtoclassy,outofwhichnumberofinstanceshave .Forexample,inthetrainingsetgivenin

Figure4.8 ,sevenpeoplehavetheclasslabel ,outofwhichthreepeoplehave whiletheremainingfourhave

.Asaresult,theconditionalprobabilityforisequalto3/7.Similarly,the

conditionalprobabilityfordefaultedborrowerswith isgivenby .Notethatthesumofconditionalprobabilitiesoverallpossibleoutcomesof isequaltoone,i.e., .

Figure4.8.Trainingsetforpredictingtheloandefaultproblem.

P(Xi=c|y)=ncn,

nc Xi=cDefaultedBorrower=No

HomeOwner=YesHomeOwner=NoP(HomeOwner=Yes|DefaultedBorrower=No)

MaritalStatus=SingleP(MaritalStatus=Single|DefaultedBorrower=Yes)=2/3

Xi∑cP(Xi=c|y)=1,

EstimatingConditionalProbabilitiesforContinuousAttributesTherearetwowaystoestimatetheclass-conditionalprobabilitiesforcontinuousattributes:

1. Wecandiscretizeeachcontinuousattributeandthenreplacethecontinuousvalueswiththeircorrespondingdiscreteintervals.Thisapproachtransformsthecontinuousattributesintoordinalattributes,andthesimplemethoddescribedpreviouslyforcomputingtheconditionalprobabilitiesofcategoricalattributescanbeemployed.Notethattheestimationerrorofthismethoddependsonthediscretizationstrategy(asdescribedinSection2.3.6 onpage63),aswellasthenumberofdiscreteintervals.Ifthenumberofintervalsistoolarge,everyintervalmayhaveaninsufficientnumberoftraininginstancestoprovideareliableestimateof .Ontheotherhand,ifthenumberofintervalsistoosmall,thenthediscretizationprocessmaylooseinformationaboutthetruedistributionofcontinuousvalues,andthusresultinpoorpredictions.

2. Wecanassumeacertainformofprobabilitydistributionforthecontinuousvariableandestimatetheparametersofthedistributionusingthetrainingdata.Forexample,wecanuseaGaussiandistributiontorepresenttheconditionalprobabilityofcontinuousattributes.TheGaussiandistributionischaracterizedbytwoparameters,themean, ,andthevariance, .Foreachclass ,theclass-conditionalprobabilityforattribute is

P(Xi|Y)

μ σ2 yj

XiP(Xi=xi|Y=yj)=12πσijexp[−(xi−μij)22σij2]. (4.19)

Theparameter canbeestimatedusingthesamplemeanofforalltraininginstancesthatbelongto .Similarly, canbeestimatedfromthesamplevariance ofsuchtraininginstances.Forexample,considertheannualincomeattributeshowninFigure4.8 .Thesamplemeanandvarianceforthisattributewithrespecttotheclass are

Givenatestinstancewithtaxableincomeequalto$120K,wecanusethefollowingvalueasitsconditionalprobabilitygivenclass :

Example4.5.[NaïveBayesClassifier]ConsiderthedatasetshowninFigure4.9(a) ,wherethetargetclassisDefaultedBorrower,whichcantaketwovaluesYesandNo.Wecancomputetheclass-conditionalprobabilityforeachcategoricalattributeandthesamplemeanandvarianceforthecontinuousattribute,assummarizedinFigure4.9(b) .

Weareinterestedinpredictingtheclasslabelofatestinstance.Todo

this,wefirstcomputethepriorprobabilitiesbycountingthenumberoftraininginstancesbelongingtoeveryclass.Wethusobtain and

.Next,wecancomputetheclass-conditionalprobabilityasfollows:

μij Xi(x¯)yj σij2

(s2)

x¯=125+100+70+…+757=100s2=(125−110)2+(100−110)2+…(75−110)26=2975s=2975=54.54.

P(Income=120|No)=12π(54.54)exp−(120−110)22×2975=0.0072.

x=(HomeOwner=No,MaritalStatus=Married,AnnualIncome=$120K)

P(yes)=0.3P(No)=0.7

Figure4.9.ThenaïveBayesclassifierfortheloanclassificationproblem.

Noticethattheclass-conditionalprobabilityforclass hasbecome0becausetherearenoinstancesbelongingtoclass with

inthetrainingset.Usingtheseclass-conditionalprobabilities,wecanestimatetheposteriorprobabilitiesas

where isanormalizingconstant.Since ,theinstanceisclassifiedas .

P(x|NO)=P(HomeOwner=No|No)×P(Status=Married|No)×P(AnnualIncome

Status=Married

P(No|x)=0.7×0.0024P(x).=0.0016α.P(Yes|x)=0.3×0P(x)=0.

α=1/P(x) P(No|x)>P(Yes|x)

HandlingZeroConditionalProbabilitiesTheprecedingexampleillustratesapotentialproblemwithusingthenaïveBayesassumptioninestimatingclass-conditionalprobabilities.Iftheconditionalprobabilityforanyoftheattributesiszero,thentheentireexpressionfortheclass-conditionalprobabilitybecomeszero.Notethatzeroconditionalprobabilitiesarisewhenthenumberoftraininginstancesissmallandthenumberofpossiblevaluesofanattributeislarge.Insuchcases,itmayhappenthatacombinationofattributevaluesandclasslabelsareneverobserved,resultinginazeroconditionalprobability.

Inamoreextremecase,ifthetraininginstancesdonotcoversomecombinationsofattributevaluesandclasslabels,thenwemaynotbeabletoevenclassifysomeofthetestinstances.Forexample,if

iszeroinsteadof1/7,thenadatainstancewithattributeset

hasthefollowingclass-conditionalprobabilities:

Sinceboththeclass-conditionalprobabilitiesare0,thenaïveBayesclassifierwillnotbeabletoclassifytheinstance.Toaddressthisproblem,itisimportanttoadjusttheconditionalprobabilityestimatessothattheyarenotasbrittleassimplyusingfractionsoftraininginstances.Thiscanbeachievedbyusingthefollowingalternateestimatesofconditionalprobability:

P(MaritalStatus=Divorced|No)x=

(HomeOwner=Yes,MaritalStatus=Divorced,Income=$120K)

P(x|No)=3/7×0×0.0072=0.P(x|Yes)=0×1/3×1.2×10−9=0.

Laplaceestimate:P(Xi=c|y)=nc+1n+v, (4.20)

m-estimate:P(Xi=c|y)=nc+mpn+m, (4.21)

wherenisthenumberoftraininginstancesbelongingtoclassy, isthenumberoftraininginstanceswith and ,visthetotalnumberofattributevaluesthat cantake,pissomeinitialestimateof thatisknownapriori,andmisahyper-parameterthatindicatesourconfidenceinusingpwhenthefractionoftraininginstancesistoobrittle.Notethatevenif

,bothLaplaceandm-estimateprovidenon-zerovaluesofconditionalprobabilities.Hence,theyavoidtheproblemofvanishingclass-conditionalprobabilitiesandthusgenerallyprovidemorerobustestimatesofposteriorprobabilities.

CharacteristicsofNaïveBayesClassifiers1. NaïveBayesclassifiersareprobabilisticclassificationmodelsthatare

abletoquantifytheuncertaintyinpredictionsbyprovidingposteriorprobabilityestimates.Theyarealsogenerativeclassificationmodelsastheytreatthetargetclassasthecausativefactorforgeneratingthedatainstances.Hence,apartfromcomputingposteriorprobabilities,naïveBayesclassifiersalsoattempttocapturetheunderlyingmechanismbehindthegenerationofdatainstancesbelongingtoeveryclass.Theyarethususefulforgainingpredictiveaswellasdescriptiveinsights.

2. ByusingthenaïveBayesassumption,theycaneasilycomputeclass-conditionalprobabilitieseveninhigh-dimensionalsettings,providedthattheattributesareconditionallyindependentofeachothergiventheclasslabels.ThispropertymakesnaïveBayesclassifierasimpleandeffectiveclassificationtechniquethatiscommonlyusedindiverseapplicationproblems,suchastextclassification.

3. NaïveBayesclassifiersarerobusttoisolatednoisepointsbecausesuchpointsarenotabletosignificantlyimpacttheconditionalprobabilityestimates,astheyareoftenaveragedoutduringtraining.

ncXi=c Y=y

Xi P(Xi=c|y)

nc=0

4. NaïveBayesclassifierscanhandlemissingvaluesinthetrainingsetbyignoringthemissingvaluesofeveryattributewhilecomputingitsconditionalprobabilityestimates.Further,naïveBayesclassifierscaneffectivelyhandlemissingvaluesinatestinstance,byusingonlythenon-missingattributevalueswhilecomputingposteriorprobabilities.Ifthefrequencyofmissingvaluesforaparticularattributevaluedependsonclasslabel,thenthisapproachwillnotaccuratelyestimateposteriorprobabilities.

5. NaïveBayesclassifiersarerobusttoirrelevantattributes.If isanirrelevantattribute,then becomesalmostuniformlydistributedforeveryclassy.Theclass-conditionalprobabilitiesforeveryclassthusreceivesimilarcontributionsof ,resultinginnegligibleimpactontheposteriorprobabilityestimates.

6. CorrelatedattributescandegradetheperformanceofnaïveBayesclassifiersbecausethenaïveBayesassumptionofconditionalindependencenolongerholdsforsuchattributes.Forexample,considerthefollowingprobabilities:

whereAisabinaryattributeandYisabinaryclassvariable.SupposethereisanotherbinaryattributeBthatisperfectlycorrelatedwithAwhen ,butisindependentofAwhen .Forsimplicity,assumethattheconditionalprobabilitiesforBarethesameasforA.Givenaninstancewithattributes ,andassumingconditionalindependence,wecancomputeitsposteriorprobabilitiesasfollows:

If ,thenthenaïveBayesclassifierwouldassigntheinstancetoclass1.However,thetruthis,

XiP(Xi|Y)

P(Xi|Y)

P(A=0|Y=0)=0.4,P(A=1|Y=0)=0.6,P(A=0|Y=1)=0.6,P(A=1|Y=1)=0.4,

Y=0 Y=1

A=0,B=0

P(Y=0|A=0,B=0)=P(A=0|Y=0)P(B=0|Y=0)P(Y=0)P(A=0,B=0)=0.16×P(Y

P(Y=0)=P(Y=1)

P(A=0,B=0|Y=0)=P(A=0|Y=0)=0.4,

becauseAandBareperfectlycorrelatedwhen .Asaresult,theposteriorprobabilityfor is

whichislargerthanthatfor .Theinstanceshouldhavebeenclassifiedasclass0.Hence,thenaïveBayesclassifiercanproduceincorrectresultswhentheattributesarenotconditionallyindependentgiventheclasslabels.NaïveBayesclassifiersarethusnotwell-suitedforhandlingredundantorinteractingattributes.

Y=0Y=0

P(Y=0|A=0,B=0)=P(A=0,B=0|Y=0)P(Y=0)P(A=0,B=0)=0.4×P(Y=0)P(A=0,B=

Y=1

4.5BayesianNetworksTheconditionalindependenceassumptionmadebynaïveBayesclassifiersmayseemtoorigid,especiallyforclassificationproblemswheretheattributesaredependentoneachotherevenafterconditioningontheclasslabels.WethusneedanapproachtorelaxthenaïveBayesassumptionsothatwecancapturemoregenericrepresentationsofconditionalindependenceamongattributes.

Inthissection,wepresentaflexibleframeworkformodelingprobabilisticrelationshipsbetweenattributesandclasslabels,knownasBayesianNetworks.Bybuildingonconceptsfromprobabilitytheoryandgraphtheory,Bayesiannetworksareabletocapturemoregenericformsofconditionalindependenceusingsimpleschematicrepresentations.Theyalsoprovidethenecessarycomputationalstructuretoperforminferencesoverrandomvariablesinanefficientway.Inthefollowing,wefirstdescribethebasicrepresentationofaBayesiannetwork,andthendiscussmethodsforperforminginferenceandlearningmodelparametersinthecontextofclassification.

4.5.1GraphicalRepresentation

Bayesiannetworksbelongtoabroaderfamilyofmodelsforcapturingprobabilisticrelationshipsamongrandomvariables,knownasprobabilisticgraphicalmodels.Thebasicconceptbehindthesemodelsistousegraphicalrepresentationswherethenodesofthegraphcorrespondtorandomvariablesandtheedgesbetweenthenodesexpressprobabilistic

relationships.Figures4.10(a) and4.10(b) showexamplesofprobabilisticgraphicalmodelsusingdirectededges(witharrows)andundirectededges(withoutarrows),respectively.DirectedgraphicalmodelsarealsoknownasBayesiannetworkswhileundirectedgraphicalmodelsareknownasMarkovrandomfields.Thetwoapproachesusedifferentsemanticsforexpressingrelationshipsamongrandomvariablesandarethususefulindifferentcontexts.Inthefollowing,webrieflydescribeBayesiannetworksthatareusefulinthecontextofclassification.

ABayesiannetwork(alsoreferredtoasabeliefnetwork)involvesdirectededgesbetweennodes,whereeveryedgerepresentsadirectionofinfluenceamongrandomvariables.Forexample,Figure4.10(a) showsaBayesiannetworkwherevariableCdependsuponthevaluesofvariablesAandB,asindicatedbythearrowspointingtowardCfromAandB.Consequently,thevariableCinfluencesthevaluesofvariablesDandE.EveryedgeinaBayesiannetworkthusencodesadependencerelationshipbetweenrandomvariableswithaparticulardirectionality.

Figure4.10.Illustrationsoftwobasictypesofgraphicalmodels.

Bayesiannetworksaredirectedacyclicgraphs(DAG)becausetheydonotcontainanydirectedcyclessuchthattheinfluenceofanodeloopsbacktothesamenode.Figure4.11 showssomeexamplesofBayesiannetworksthatcapturedifferenttypesofdependencestructuresamongrandomvariables.Inadirectedacyclicgraph,ifthereisadirectededgefromXtoY,thenXiscalledtheparentofYandYiscalledthechildofX.NotethatanodecanhavemultipleparentsinaBayesiannetwork,e.g.,nodeDhastwoparentnodes,BandC,inFigure4.11(a) .Furthermore,ifthereisadirectedpathinthenetworkfromXtoZ,thenXisanancestorofZ,whileZisadescendantofX.Forexample,inthediagramshowninFigure4.11(b) ,AisadescendantofDandDisanancestorofB.Notethattherecanbemultipledirectedpathsbetweentwonodesofadirectedacyclicgraph,asisthecasefornodesAandDinFigure4.11(a) .

Figure4.11.ExamplesofBayesiannetworks.

ConditionalIndependenceAnimportantpropertyofaBayesiannetworkisitsabilitytorepresentvaryingformsofconditionalindependenceamongrandomvariables.ThereareseveralwaysofdescribingtheconditionalindependenceassumptionscapturedbyBayesiannetworks.Oneofthemostgenericwaysofexpressingconditionalindependenceistheconceptofd-separation,whichcanbeusedtodetermineifanytwosetsofnodesAandBareconditionallyindependentgivenanothersetofnodesC.AnotherusefulconceptisthatoftheMarkovblanketofanodeY,whichdenotestheminimalsetofnodesXthatmakesYindependentoftheothernodesinthegraph,whenconditionedonX.(SeeBibliographicNotesformoredetailsond-separationandMarkovblanket.)However,forthepurposeofclassification,itissufficienttodescribeasimplerexpressionofconditionalindependenceinBayesiannetworks,knownasthelocalMarkovproperty.

Property1(LocalMarkovProperty).AnodeinaBayesiannetworkisconditionallyindependentofitsnon-descendants,ifitsparentsareknown.

ToillustratethelocalMarkovproperty,considertheBayesnetworkshowninFigure4.11(b) .WecanstatethatAisconditionallyindependentofbothBandDgivenC,becauseCistheparentofAandnodesBandDarenon-descendantsofA.ThelocalMarkovpropertyhelpsininterpretingparent-childrelationshipsinBayesiannetworksasrepresentationsofconditionalprobabilities.Sinceanodeisconditionallyindependentofitsnon-descendants

givenitparents,theconditionalindependenceassumptionsimposedbyaBayesiannetworkisoftensparseinstructure.Nonetheless,BayesiannetworksareabletoexpressaricherclassofconditionalindependencestatementsamongattributesandclasslabelsthanthenaïveBayesclassifier.Infact,thenaïveBayesclassifiercanbeviewedasaspecialtypeofBayesiannetwork,wherethetargetclassYisattherootofatreeandeveryattributeisconnectedtotherootnodebyadirectededge,asshowninFigure4.12(a) .

Figure4.12.ComparingthegraphicalrepresentationofanaïveBayesclassifierwiththatofagenericBayesiannetwork.

NotethatinanaïveBayesclassifier,everydirectededgepointsfromthetargetclasstotheobservedattributes,suggestingthattheclasslabelisafactorbehindthegenerationofattributes.Inferringtheclasslabelcanthusbeviewedasdiagnosingtherootcausebehindtheobservedattributes.Ontheotherhand,Bayesiannetworksprovideamoregenericstructureofprobabilisticrelationships,sincethetargetclassisnotrequiredtobeattherootofatreebutcanappearanywhereinthegraph,asshowninFigure

Xi

4.12(b) .Inthisdiagram,inferringYnotonlyhelpsindiagnosingthefactorsinfluencing and ,butalsohelpsinpredictingtheinfluenceof and .

JointProbabilityThelocalMarkovpropertycanbeusedtosuccinctlyexpressthejointprobabilityofthesetofrandomvariablesinvolvedinaBayesiannetwork.Torealizethis,letusfirstconsideraBayesiannetworkconsistingofdnodes,to ,wherethenodeshavebeennumberedinsuchawaythat isanancestorof onlyif .Thejointprobabilityof canbegenericallyfactorizedusingthechainruleofprobabilityas

Bythewaywehaveconstructedthegraph,notethatthesetcontainsonlynon-descendantsof .Hence,byusingthelocalMarkovproperty,wecanwrite as ,where denotestheparentsof .Thejointprobabilitycanthenberepresentedas

Itisthussufficienttorepresenttheprobabilityofeverynode intermsofitsparentnodes, ,forcomputingP( ).Thisisachievedwiththehelpofprobabilitytablesthatassociateeverynodetoitsparentnodesasfollows:

1. Theprobabilitytablefornode containstheconditionalprobabilityvalues foreverycombinationofvaluesin and .

2. If hasnoparents ,thenthetablecontainsonlythepriorprobability .

X3 X4 X1 X2

X1Xd Xi

Xj i<j X={X1,…,Xd}

P(X)=P(X1)P(X2|X1)P(X3|X1,X2)…P(Xd|X1,…Xd−1)=∏i=1dP(Xi|X1,…Xi−1)

(4.22)

{X1,…Xi−1}Xi

P(Xi|X1,…Xi−1) P(Xi|pa(Xi)) pa(Xi)Xi

P(X)=∏i=1dP(Xi|pa(Xi)) (4.23)

Xipa(Xi)

XiP(Xi|pa(Xi)) Xi pa(Xi)

Xi (pa(Xi)=ϕ)P(Xi)

Example4.6.[ProbabilityTables]Figure4.13 showsanexampleofaBayesiannetworkformodelingtherelationshipsbetweenapatient'ssymptomsandriskfactors.Theprobabilitytablesareshownatthesideofeverynodeinthefigure.Theprobabilitytablesassociatedwiththeriskfactors(ExerciseandDiet)containonlythepriorprobabilities,whereasthetablesforheartdisease,heartburn,bloodpressure,andchestpain,containtheconditionalprobabilities.

Figure4.13.ABayesiannetworkfordetectingheartdiseaseandheartburninpatients.

UseofHiddenVariables

ABayesiannetworktypicallyinvolvestwotypesofvariables:observedvariablesthatareclampedtospecificobservedvalues,andunobservedvariables,whosevaluesarenotknownandneedtobeinferredfromthenetwork.Todistinguishbetweenthesetwotypesofvariables,observedvariablesaregenerallyrepresentedusingshadednodeswhileunobservedvariablesarerepresentedusingemptynodes.Figure4.14 showsanexampleofaBayesiannetworkwithobservedvariables(A,B,andE)andunobservedvariables(CandD).

Figure4.14.Observedandunobservedvariablesarerepresentedusingunshadedandshadedcircles,respectively.

Inthecontextofclassification,theobservedvariablescorrespondtothesetofattributesX,whilethetargetclassisrepresentedusinganunobservedvariableYthatneedstobeinferredduringtesting.However,notethatagenericBayesiannetworkmaycontainmanyotherunobservedvariablesapartfromthetargetclass,asrepresentedinFigure4.15 asthesetofvariablesH.Theseunobservedvariablesrepresenthiddenorconfoundingfactorsthataffecttheprobabilitiesofattributesandclasslabels,althoughtheyareneverdirectlyobserved.TheuseofhiddenvariablesenhancestheexpressivepowerofBayesiannetworksinrepresentingcomplexprobabilistic

relationshipsbetweenattributesandclasslabels.ThisisoneofthekeydistinguishingpropertiesofBayesiannetworksascomparedtonaïveBayesclassifiers.

4.5.2InferenceandLearning

GiventheprobabilitytablescorrespondingtoeverynodeinaBayesiannetwork,theproblemofinferencecorrespondstocomputingtheprobabilitiesofdifferentsetsofrandomvariables.Inthecontextofclassification,oneofthekeyinferenceproblemsistocomputetheprobabilityofatargetclassYtakingonaspecificvaluey,giventhesetofobservedattributesatadatainstance,.Thiscanberepresentedusingthefollowingconditionalprobability:

ThepreviousequationinvolvesmarginalprobabilitiesoftheformP(y, ).TheycanbecomputedbymarginalizingoutthehiddenvariablesHfromthejointprobabilityasfollows:

wherethejointprobabilityP(y, ,H)canbeobtainedbyusingthefactorizationdescribedinEquation4.23 .TounderstandthenatureofcomputationsinvolvedinestimatingP(y, ),considertheexampleBayesiannetworkshowninFigure4.15 ,whichinvolvesatargetclass,Y,threeobservedattributes, to ,andfourhiddenvariables, to .Forthisnetwork,wecanexpressP(y, )as

P(Y=y|x)=(y,x)P(x)=(y,x)∑y′P(y′,x) (4.24)

P(y,x)=∑HP(y,x,H), (4.25)

X1 X3 H1 H4

Figure4.15.AnexampleofaBayesiannetworkwithfourhiddenvariables, to ,threeobservedattributes, to ,andonetargetclassY.

wherefisafactorthatdependsonthevaluesof to .IntheprevioussimplisticexpressionofP(y, ),adifferentsummandisconsideredforeverycombinationofvalues, to ,inthehiddenvariables, to .Ifweassumethateveryvariableinthenetworkcantakekdiscretevalues,thenthesummationhastobecarriedoutforatotalnumberof times.Thecomputationalcomplexityofthisapproachisthus .Moreover,thenumberofcomputationsgrowsexponentiallywiththenumberofhiddenvariables,makingitdifficulttousethisapproachwithnetworksthathavealargenumberofhiddenvariables.Inthefollowing,wepresentdifferentcomputationaltechniquesforefficientlyperforminginferencesinBayesiannetworks.

H1 H4X1 X3

P(y,x)=∑h1∑h2∑h3∑h4P(y,x1,x2,h1,h2,h3,h4),=∑h1∑h2∑h3∑h4[P(h1)P(h2)P(x2)P(h4)P(x1|h1,h2)×P(h3|x2,h2)P(y|x1,h3)P(x3|h3,h4)],

(4.26)

=∑h1∑h2∑h3∑h4f(h1,h2,h3,h4), (4.27)

h1 h4

h1 h4 H1 H4

k4O(k4)

VariableEliminationToreducethenumberofcomputationsinvolvedinestimatingP(y, ),letuscloselyexaminetheexpressionsinEquations4.26 and4.27 .Noticethatalthough dependsonthevaluesofallfourhiddenvariables,itcanbedecomposedasaproductofseveralsmallerfactors,whereeveryfactorinvolvesonlyasmallnumberofhiddenvariables.Forexample,thefactor dependsonlyonthevalueof ,andthusactsasaconstantmultiplicativetermwhensummationsareperformedover ,or .Hence,ifweplace outsidethesummationsof to ,wecansavesomerepeatedmultiplicationsoccurringinsideeverysummand.

Ingeneral,wecanpusheverysummationasfarinsideaspossible,sothatthefactorsthatdonotdependonthesummingvariableareplacedoutsidethesummation.Thiswillhelpreducethenumberofwastefulcomputationsbyusingsmallerfactorsateverysummation.Toillustratethisprocess,considerthefollowingsequenceofstepsforcomputingP(y, ),byrearrangingtheorder

ofsummationsinEquation4.26 .

where representstheintermediatefactortermobtainedbysummingout .Tocheckifthepreviousrearrangementsprovideanyimprovementsin

f(h1,h2,h3,h4)

P(h4) h4h1,h2 h3

P(h4) h1 h3

P(y,x)=P(x2)∑h4P(h4)∑h3P(y|x1,h3)P(x3|h3,h4)×∑h2P(h2)P(h3|x2,h2)∑h1P(4.28)

=P(x2)∑h4P(h4)∑h3P(y|x1,h3)P(x3|h3,h4)×∑h2P(h2)P(h3|x2,h2)f1(h2)(4.29)

=P(x2)∑h4P(h4)∑h3P(y|x1,h3)P(x3|h3,h4)f2(h3) (4.30)

=P(x2)∑h4P(h4)f3(h4) (4.31)

fi hi

computationalefficiency,letuscountthenumberofcomputationsoccurringateverystepoftheprocess.Atthefirststep(Equation4.28 ),weperformasummationover usingfactorsthatdependon and .Thisrequiresconsideringeverypairofvaluesin and ,resultingin computations.Similarly,thesecondstep(Equation4.29 )involvessummingout usingfactorsof and ,leadingto computations.Thethirdstep(Equation4.30 )againrequires computationsasitinvolvessummingoutoverfactorsdependingon and .Finally,thefourthstep(Equation4.31 )involvessummingout usingfactorsdependingon ,resultinginO(k)computations.

Theoverallcomplexityofthepreviousapproachisthus ,whichisconsiderablysmallerthanthe complexityofthebasicapproach.Hence,bymerelyrearrangingsummationsandusingalgebraicmanipulations,weareabletoimprovethecomputationalefficiencyincomputingP(y, ).Thisprocedureisknownasvariableelimination.

Thebasicconceptthatvariableeliminationexploitstoreducethenumberofcomputationsisthedistributivenatureofmultiplicationoveradditionoperations.Forexample,considerthefollowingmultiplicationandadditionoperations:

Noticethattheright-handsideofthepreviousequationinvolvesthreemultiplicationsandthreeadditions,whiletheleft-handsideinvolvesonlyonemultiplicationandthreeadditions,thussavingontwoarithmeticoperations.Thispropertyisutilizedbyvariableeliminationinpushingoutconstanttermsoutsidethesummation,suchthattheyaremultipliedonlyonce.

h1 h1 h2h1 h2 O(k2)

h2h2 h3 O(k2)

O(k2) h3h3 h4

h4 h4

O(k2)O(k4)

a.(b+c+d)=a.b+a.c+a.d

Notethattheefficiencyofvariableeliminationdependsontheorderofhiddenvariablesusedforperformingsummations.Hence,wewouldideallyliketofindtheoptimalorderofvariablesthatresultinthesmallestnumberofcomputations.Unfortunately,findingtheoptimalorderofsummationsforagenericBayesiannetworkisanNP-Hardproblem,i.e.,theredoesnotexistanefficientalgorithmforfindingtheoptimalorderingthatcanruninpolynomialtime.However,thereexistsefficienttechniquesforhandlingspecialtypesofBayesiannetworks,e.g.,thoseinvolvingtree-likegraphs,asdescribedinthefollowing.

Sum-ProductAlgorithmforTreesNotethatinEquations4.28 and4.29 ,wheneveravariable iseliminatedduringmarginalization,itresultsinthecreationofafactor thatdependsontheneighboringnodesof . isthenabsorbedinthefactorsofneighboringvariablesandtheprocessisrepeateduntilallunobservedvariableshavemarginalized.Thisphenomenaofvariableeliminationcanbeviewedastransmittingalocalmessagefromthevariablebeingmarginalizedtoitsneighboringnodes.Thisideaofmessagepassingutilizesthestructureofthegraphforperformingcomputations,thusmakingitpossibletousegraph-theoreticapproachesformakingeffectiveinferences.Thesum-productalgorithmbuildsontheconceptofmessagepassingforcomputingmarginalandconditionalprobabilitiesontree-basedgraphs.

Figure4.16 showsanexampleofatreeinvolvingfivevariables, to .Akeycharacteristicofatreeisthateverynodeinthetreehasexactlyoneparent,andthereisonlyonedirectededgebetweenanytwonodesinthetree.Forthepurposeofillustration,letusconsidertheproblemofestimatingthemarginalprobabilityof .Thiscanbeobtainedbymarginalizingouteveryvariableinthegraphexcept andrearrangingthesummationstoobtainthefollowingexpression:

hifi

hi fi

X1 X5

X2,P(X2)X2

Figure4.16.AnexampleofaBayesiannetworkwithatreestructure.

where hasbeenconvenientlychosentorepresentthefactorof thatisobtainedbysummingout .Wecanview asalocalmessagepassedfromnode tonode ,asshownusingarrowsinFigure4.17(a) .Theselocalmessagescapturetheinfluenceofeliminatingnodesonthemarginalprobabilitiesofneighboringnodes.

Beforeweformallydescribetheformulaforcomputing and ,wefirstdefineapotentialfunction thatisassociatedeverynodeandedgeofthegraph.Wecandefinethepotentialofanode as

P(x2)=∑x1∑x3∑x4∑x5P(x1)P(x2|x1)P(x3|x2)P(x4|x3)P(x5|x3),=(∑x1P(x1)P(x2|x1))︸m12(x2)(∑x3P(x3|x2)(∑x4P(x4|x3))︸m43(x3)(∑x5P(x5|x3))︸m53(x3)),︸m32(x2)

mij(xj) xjxi mij(xj)

xi xj

mij(xj) P(xj)ψ(⋅)

Xi

ψ(Xi)={P(Xi),ifXiistherootnode.1,otherwise. (4.32)

Figure4.17.Illustrationofmessagepassinginthesum-productalgorithm.

Similarly,wecandefinethepotentialofanedgebetweennodes and(where istheparentof )as

Using and ,wecanrepresent usingthefollowingequation:

whereN(i)representsthesetofneighborsofnode .Themessage thatistransmittedfrom to canthusberecursivelycomputedusingthe

Xi XjXi Xj

ψ(Xi,Xj)=P(Xj|Xi).

ψ(Xi) ψ(Xi,Xj) mij(xj)

mij(xj)=∑xi(ψ(xi)ψ(xi,xj)∏k∈N(i)imki(xi)), (4.33)

Xi mijXi Xj

messagesincidenton fromitsneighboringnodesexcluding .Notethattheformulafor involvestakingasumoverallpossiblevaluesof ,aftermultiplyingthefactorsobtainedfromtheneighborsof .Thisapproachofmessagepassingisthuscalledthe“sum-product”algorithm.Further,sincerepresentsanotionof“belief”propagatedfrom to ,thisalgorithmisalsoknownasbeliefpropagation.Themarginalprobabilityofanode

isthengivenas

Ausefulpropertyofthesum-productalgorithmisthatitallowsthemessagestobereusedforcomputingadifferentmarginalprobabilityinthefuture.Forexample,ifwehadtocomputethemarginalprobabilityfornode ,wewouldrequirethefollowingmessagesfromitsneighboringnodes: ,and .However,notethat ,and havealreadybeencomputedintheprocessofcomputingthemarginalprobabilityof andthuscanbereused.

Noticethatthebasicoperationsofthesum-productalgorithmresembleamessagepassingprotocolovertheedgesofthenetwork.Anodesendsoutamessagetoallitsneighboringnodesonlyafterithasreceivedincomingmessagesfromallitsneighbors.Hence,wecaninitializethemessagepassingprotocolfromtheleafnodes,andtransmitmessagestillwereachtherootnode.Wecanthenrunasecondpassofmessagesfromtherootnodebacktotheleafnodes.Inthisway,wecancomputethemessagesforeveryedgeinbothdirections,usingjust operations,where isthenumberofedges.OncewehavetransmittedallpossiblemessagesasshowninFigure4.17(b) ,wecaneasilycomputethemarginalprobabilityofeverynodeinthegraphusingEquation4.34 .

Xi Ximij Xj

Xjmij

Xi XjXi

P(xi)=ψ(xi)∏j∈N(i)mji(xi). (4.34)

X3m23(x3),m43(x3)

m53(x3) m43(x3) m53(x3)X2

O(2|E|) |E|

Inthecontextofclassification,thesum-productalgorithmcanbeeasilymodifiedforcomputingtheconditionalprobabilityoftheclasslabelygiventhesetofobservedattributes ,i.e., .Thisbasicallyamountstocomputing inEquation4.24 ,whereXisclampedtotheobservedvalues .Tohandlethescenariowheresomeoftherandomvariablesarefixedanddonotneedtobenormalized,weconsiderthefollowingmodification.

If isarandomvariablethatisfixedtoaspecificvalue ,thenwecansimplymodify and asfollows:

Wecanrunthesum-productalgorithmusingthesemodifiedvaluesforeveryobservedvariableandthuscompute .

x^ P(y|x^)P(y,X=x^)

x^

Xi x^iψ(Xi) ψ(Xi,Xj)

ψ(Xi)={1,ifXi=x^i.0,otherwise. (4.35)

ψ(Xi,Xj)={P(Xi|x^i),ifXi=x^i.0,otherwise. (4.36)

P(y,X=x^)

Figure4.18.Exampleofapoly-treeanditscorrespondingfactorgraph.

GeneralizationsforNon-TreeGraphsThesum-productalgorithmisguaranteedtooptimallyconvergeinthecaseoftreesusingasinglerunofmessagepassinginbothdirectionsofeveryedge.Thisisbecauseanytwonodesinatreehaveauniquepathforthetransmissionofmessages.Furthermore,sinceeverynodeinatreehasasingleparent,thejointprobabilityinvolvesonlyfactorsofatmosttwovariables.Hence,itissufficienttoconsiderpotentialsoveredgesandnotothergenericsubstructuresinthegraph.

Bothofthepreviouspropertiesareviolatedingraphsthatarenottrees,thusmakingitdifficulttodirectlyapplythesum-productalgorithmformakinginferences.However,anumberofvariantsofthesum-productalgorithmhavebeendevisedtoperforminferencesonabroaderfamilyofgraphsthantrees.Manyofthesevariantstransformtheoriginalgraphintoanalternativetree-basedrepresentation,andthenapplythesum-productalgorithmonthetransformedtree.Inthissection,webrieflydiscussonesuchtransformationsknownasfactorgraphs.

Factorgraphsareusefulformakinginferencesovergraphsthatviolatetheconditionthateverynodehasasingleparent.Nonetheless,theystillrequiretheabsenceofmultiplepathsbetweenanytwonodes,toguaranteeconvergence.Suchgraphsareknownaspoly-trees.Anexampleofapoly-treeisshowninFigure4.18(a) .

Apoly-treecanbetransformedintoatree-basedrepresentationwiththehelpoffactorgraphs.Thesegraphsconsistoftwotypesofnodes,variablesnodes(thatarerepresentedusingcircles)andfactornodes(thatarerepresented

usingsquares).Thefactornodesrepresentconditionalindependencerelationshipsamongthevariablesofthepoly-tree.Inparticular,everyprobabilitytablecanberepresentedasafactornode.Theedgesinafactorgraphareundirectedinnatureandrelateavariablenodetoafactornodeifthevariableisinvolvedintheprobabilitytablecorrespondingtothefactornode.Figure4.18(b) presentsthefactorgraphrepresentationofthepoly-treeshowninFigure4.18(a) .

Notethatthefactorgraphofapoly-treealwaysformsatree-likestructure,wherethereisauniquepathofinfluencebetweenanytwonodesinthefactorgraph.Hence,wecanapplyamodifiedformofsum-productalgorithmtotransmitmessagesbetweenvariablenodesandfactornodes,whichisguaranteedtoconvergetooptimalvalues.

LearningModelParametersInallourpreviousdiscussionsonBayesiannetworks,wehadassumedthatthetopologyoftheBayesiannetworkandthevaluesintheprobabilitytablesofeverynodewerealreadyknown.Inthissection,wediscussapproachesforlearningboththetopologyandtheprobabilitytablevaluesofaBayesiannetworkfromthetrainingdata.

Letusfirstconsiderthecasewherethetopologyofthenetworkisknownandweareonlyrequiredtocomputetheprobabilitytables.Iftherearenounobservedvariablesinthetrainingdata,thenwecaneasilycomputetheprobabilitytablefor ,bycountingthefractionoftraininginstancesforeveryvalueof andeverycombinationofvaluesin .However,ifthereareunobservedvariablesin or ,thencomputingthefractionoftraininginstancesforsuchvariablesisnon-trivialandrequirestheuseofadvancestechniquessuchastheExpectation-Maximizationalgorithm(describedlaterinChapter8 ).

P(Xi|pa(Xi))Xi pa(Xi)

Xi pa(Xi)

LearningthestructureoftheBayesiannetworkisamuchmorechallengingtaskthanlearningtheprobabilitytables.Althoughtherearesomescoringapproachesthatattempttofindagraphstructurethatmaximizesthetraininglikelihood,theyareoftencomputationallyinfeasiblewhenthegraphislarge.Hence,acommonapproachforconstructingBayesiannetworksistousethesubjectiveknowledgeofdomainexperts.

4.5.3CharacteristicsofBayesianNetworks

1. Bayesiannetworksprovideapowerfulapproachforrepresentingprobabilisticrelationshipsbetweenattributesandclasslabelswiththehelpofgraphicalmodels.Theyareabletocapturecomplexformsofdependenciesamongvariables.Apartfromencodingpriorbeliefs,theyarealsoabletomodelthepresenceoflatent(unobserved)factorsashiddenvariablesinthegraph.Bayesiannetworksarethusquiteexpressiveandprovidepredictiveaswellasdescriptiveinsightsaboutthebehaviorofattributesandclasslabels.

2. Bayesiannetworkscaneasilyhandlethepresenceofcorrelatedorredundantattributes,asopposedtothenaïveBayesclassifier.ThisisbecauseBayesiannetworksdonotusethenaïveBayesassumptionaboutconditionalindependence,butinsteadareabletoexpressricherformsofconditionalindependence.

3. SimilartothenaïveBayesclassifier,Bayesiannetworksarealsoquiterobusttothepresenceofnoiseinthetrainingdata.Further,theycanhandlemissingvaluesduringtrainingaswellastesting.Ifatestinstancecontainsanattribute withamissingvalue,thenaBayesiannetworkcanperforminferencebytreating asanunobservednode

XiXi

andmarginalizingoutitseffectonthetargetclass.Hence,Bayesiannetworksarewell-suitedforhandlingincompletenessinthedata,andcanworkwithpartialinformation.However,unlessthepatternwithwhichmissingvaluesoccursiscompletelyrandom,thentheirpresencewilllikelyintroducesomedegreeoferrorand/orbiasintotheanalysis.

4. Bayesiannetworksarerobusttoirrelevantattributesthatcontainnodiscriminatoryinformationabouttheclasslabels.Suchattributesshownoimpactontheconditionalprobabilityofthetargetclass,andarethusrightfullyignored.

5. LearningthestructureofaBayesiannetworkisacumbersometaskthatoftenrequiresassistancefromexpertknowledge.However,oncethestructurehasbeendecided,learningtheparametersofthenetworkcanbequitestraightforward,especiallyifallthevariablesinthenetworkareobserved.

6. Duetoitsadditionalabilityofrepresentingcomplexformsofrelationships,BayesiannetworksaremoresusceptibletooverfittingascomparedtothenaïveBayesclassifier.Furthermore,BayesiannetworkstypicallyrequiremoretraininginstancesforeffectivelylearningtheprobabilitytablesthanthenaïveBayesclassifier.

7. Althoughthesum-productalgorithmprovidescomputationallyefficienttechniquesforperforminginferenceovertree-likegraphs,thecomplexityoftheapproachincreasesignificantlywhendealingwithgenericgraphsoflargesizes.Insituationswhereexactinferenceiscomputationallyinfeasible,itisquitecommontouseapproximateinferencetechniques.

4.6LogisticRegressionThenaïveBayesandtheBayesiannetworkclassifiersdescribedintheprevioussectionsprovidedifferentwaysofestimatingtheconditionalprobabilityofaninstance givenclassy, .Suchmodelsareknownasprobabilisticgenerativemodels.Notethattheconditionalprobabilityessentiallydescribesthebehaviorofinstancesintheattributespacethataregeneratedfromclassy.However,forthepurposeofmakingpredictions,wearefinallyinterestedincomputingtheposteriorprobability .Forexample,computingthefollowingratioofposteriorprobabilitiesissufficientforinferringclasslabelsinabinaryclassificationproblem:

Thisratioisknownastheodds.Ifthisratioisgreaterthan1,then isclassifiedas .Otherwise,itisassignedtoclass .Hence,onemaysimplylearnamodeloftheoddsbasedontheattributevaluesoftraininginstances,withouthavingtocompute asanintermediatequantityintheBayestheorem.

Classificationmodelsthatdirectlyassignclasslabelswithoutcomputingclass-conditionalprobabilitiesarecalleddiscriminativemodels.Inthissection,wepresentaprobabilisticdiscriminativemodelknownaslogisticregression,whichdirectlyestimatestheoddsofadatainstance usingitsattributevalues.Thebasicideaoflogisticregressionistousealinearpredictor,

,forrepresentingtheoddsof asfollows:

P(x|y)P(x|y)

P(y|x)

P(y=1|x)P(y=0|x)

y=1 y=0

P(x|y)

z=wTx+b

P(y=1|x)P(y=0|x)=ez=ewTx+b, (4.37)

where andbaretheparametersofthemodeland denotesthetransposeofavector .Notethatif ,then belongstoclass1sinceitsoddsisgreaterthan1.Otherwise, belongstoclass0.

Figure4.19.Plotofsigmoid(logistic)function, .

Since ,wecanre-writeEquation4.37 as

Thiscanbefurthersimplifiedtoexpress asafunctionofz.

wherethefunction isknownasthelogisticorsigmoidfunction.Figure4.19 showsthebehaviorofthesigmoidfunctionaswevaryz.Wecanseethat onlywhen .Wecanalsoderive using asfollows:

aTwTx+b>0

σ(z)

P(y=0|x)+P(y=1|x)=1

P(y=1|x)1−P(y=1|x)=ez.

P(y=1|x)

P(y=1|x)=11+e−z=σ(z), (4.38)

σ(⋅)

σ(z)≥0.5 z≥0 P(y=0|x) σ(z)

Hence,ifwehavelearnedasuitablevalueofparameters andb,wecanuseEquations4.38 and4.39 toestimatetheposteriorprobabilitiesofanydatainstance anddetermineitsclasslabel.

4.6.1LogisticRegressionasaGeneralizedLinearModel

Sincetheposteriorprobabilitiesarereal-valued,theirestimationusingthepreviousequationscanbeviewedassolvingaregressionproblem.Infact,logisticregressionbelongstoabroaderfamilyofstatisticalregressionmodels,knownasgeneralizedlinearmodels(GLM).Inthesemodels,thetargetvariableyisconsideredtobegeneratedfromaprobabilitydistribution ,whosemean canbeestimatedusingalinkfunction asfollows:

Forbinaryclassificationusinglogisticregression,yfollowsaBernoullidistribution(ycaneitherbe0or1)and isequalto .Thelinkfunction oflogisticregression,calledthelogitfunction,canthusberepresentedas

Dependingonthechoiceoflinkfunction andtheformofprobabilitydistribution ,GLMsareabletorepresentabroadfamilyofregressionmodels,suchaslinearregressionandPoissonregression.Theyrequire

P(y=0|x)=1−σ(z)=11+e−z (4.39)

P(y|x)μ g(⋅)

g(μ)=z=wTx+b. (4.40)

μ P(y=1|x)g(⋅)

g(μ)=log(μ1−μ).

g(⋅)P(y|x)

differentapproachesforestimatingtheirmodelparameters,( , ).Inthischapter,wewillonlydiscussapproachesforestimatingthemodelparametersoflogisticregression,althoughmethodsforestimatingparametersofothertypesofGLMsareoftensimilar(andsometimesevensimpler).(SeeBibliographicNotesformoredetailsonGLMs.)

Notethateventhoughlogisticregressionhasrelationshipswithregressionmodels,itisaclassificationmodelsincethecomputedposteriorprobabilitiesareeventuallyusedtodeterminetheclasslabelofadatainstance.

4.6.2LearningModelParameters

Theparametersoflogisticregression,( , ),areestimatedduringtrainingusingastatisticalapproachknownasthemaximumlikelihoodestimation(MLE)method.Thismethodinvolvescomputingthelikelihoodofobservingthetrainingdatagiven( , ),andthendeterminingthemodelparameters

thatyieldmaximumlikelihood.

Let denoteasetofntraininginstances,where isabinaryvariable(0or1).Foragiventraininginstance,wecancomputeitsposteriorprobabilitiesusingEquations4.38 and

4.39 .Wecanthenexpressthelikelihoodofobserving given , ,andbas

where isthesigmoidfunctionasdescribedabove,Equation4.41basicallymeansthatthelikelihood isequalto when

(w*,b*)

D.train={(x1,y1),(x2,y2),…,(xn,yn)}yi

xiyi xi

P(yi|xi,w,b)=P(y=1|xi)yi×P(y=0|xi)1−yi,=(σ(zi))yi×(1−σ(zi))1−yi,=(σ(wTxi+b))yi×(1−σ(wTxi+b))1−yi,

(4.41)

σ(⋅)P(yi|xi,w,b) P(y=1|xi)

,andequalto when .Thelikelihoodofalltraininginstances,,canthenbecomputedbytakingtheproductofindividuallikelihoods

(assumingindependenceamongtraininginstances)asfollows:

Thepreviousequationinvolvesmultiplyingalargenumberofprobabilityvalues,eachofwhicharesmallerthanorequalto1.Sincethisnaïvecomputationcaneasilybecomenumericallyunstablewhennislarge,amorepracticalapproachistoconsiderthenegativelogarithm(tobasee)ofthelikelihoodfunction,alsoknownasthecrossentropyfunction:

Thecrossentropyisalossfunctionthatmeasureshowunlikelyitisforthetrainingdatatobegeneratedfromthelogisticregressionmodelwithparameters( , ).Intuitively,wewouldliketofindmodelparametersthatresultinthelowestcrossentropy, .

where isthelossfunction.ItisworthemphasizingthatE( , )isaconvexfunction,i.e.,anyminimaofE( , )willbeaglobalminima.Hence,wecanuseanyofthestandardconvexoptimizationtechniquestosolveEquation4.43 ,whicharementionedinAppendixE.Here,webrieflydescribetheNewton-Raphsonmethodthatiscommonlyusedforestimatingtheparametersoflogisticregression.Foreaseofrepresentation,wewilluseasinglevectortodescribe ,whichisofsizeonegreaterthan .Similarly,wewillconsidertheconcatenatedfeaturevector ,suchthatthelinearpredictor canbesuccinctly

yi=1 P(y=0|xi) yi=0L(w,b)

L(w,b)=∏i=1nP(yi|xi,w,b)=∏i=1nP(y=1|xi)yi×P(y=0|xi)1−yi. (4.42)

−logL(w,b)=−∑i=1nyilog(P(y=1|xi))+(1−yi)log(P(y=0|xi)).=−∑i=1nyilog(σ(wTxi+b))+(1−yi)log(1−σ(wTxi+b)).

(w*,b*)−logL(w*,b*)

(w*,b*)=argmin(w,b)E(w,b)=argmin(w,b)−logL(w,b) (4.43)

E(w,b)=−logL(w,b)

w˜=(wTb)T

x˜=(xT1)T z=wTx+b

writtenas .Also,theconcatenationofalltraininglabels, to ,willberepresentedasy,thesetconsistingof to willberepresentedas,andtheconcatenationof to willberepresentedas .

TheNewton-Raphsonisaniterativemethodforfinding thatusesthefollowingequationtoupdatethemodelparametersateveryiteration:

where andHarethefirst-andsecond-orderderivativesofthelossfunction withrespectto ,respectively.ThekeyintuitionbehindEquation4.44 istomovethemodelparametersinthedirectionofmaximumgradient,suchthat takeslargerstepswhen islarge.When arrivesataminimaaftersomenumberofiterations,thenwouldbecomeequalto0andthusresultinconvergence.Hence,westartwithsomeinitialvaluesof (eitherrandomlyassignedorsetto0)anduseEquation4.44 toiterativelyupdate tilltherearenosignificantchangesinitsvalue(beyondacertainthreshold).

Thefirst-orderderivativeof isgivenby

wherewehaveusedthefactthat .Using ,wecancomputethesecond-orderderivativeof as

whereRisadiagonalmatrixwhosei diagonalelement .Wecannowusethefirst-andsecond-orderderivativesof inEquation4.44 to

z=w˜Tx˜ y1 ynσ(z1) σ(zn)

σ x˜1 x˜n X˜

w˜*

w˜(new)=w˜(old)−H−1∇E(w˜), (4.44)

∇E(w˜)E(w˜) w˜

w˜ ∇E(w˜)w˜ ∇E(w˜)

w˜w˜

E(w˜)

∇E(w˜)=−∑i=1nyix˜i(1−σ(w˜Tx˜i))−(1−yi)x˜iσ(w˜Tx˜i),=−∑i=1n(σ(w˜Tx˜i)−yi)x˜i,=X˜(σ−y),

(4.45)

dσ(z)/dz=σ(z)(1−σ(z)) ∇E(w˜)E(w˜)

H=∇∇E(w˜)=∑i=1nσ(w˜Tx˜i)(1−σ(w˜Tx˜i)x˜ix˜iT)=X˜TRX˜, (4.46)

th Rii=σi(1−σi)E(w˜)

th

obtainthefollowingupdateequationatthek iteration:

wherethesubscriptkunder and referstousing tocomputebothterms.

4.6.3CharacteristicsofLogisticRegression

1. LogisticRegressionisadiscriminativemodelforclassificationthatdirectlycomputestheposterprobabilitieswithoutmakinganyassumptionabouttheclassconditionalprobabilities.Hence,itisquitegenericandcanbeappliedindiverseapplications.Itcanalsobeeasilyextendedtomulticlassclassification,whereitisknownasmultinomiallogisticregression.However,itsexpressivepowerislimitedtolearningonlylineardecisionboundaries.

2. Becausetherearedifferentweights(parameters)foreveryattribute,thelearnedparametersoflogisticregressioncanbeanalyzedtounderstandtherelationshipsbetweenattributesandclasslabels.

3. Becauselogisticregressiondoesnotinvolvecomputingdensitiesanddistancesintheattributespace,itcanworkmorerobustlyeveninhigh-dimensionalsettingsthandistance-basedmethodssuchasnearestneighborclassifiers.However,theobjectivefunctionoflogisticregressiondoesnotinvolveanytermrelatingtothecomplexityofthemodel.Hence,logisticregressiondoesnotprovideawaytomakeatrade-offbetweenmodelcomplexityandtrainingperformance,ascomparedtootherclassificationmodelssuchassupportvector

th

w˜(k+1)=w˜(k)−(X˜TRkX˜)−1X˜T(σk−y) (4.47)

Rk σk w˜(k)

machines.Nevertheless,variantsoflogisticregressioncaneasilybedevelopedtoaccountformodelcomplexity,byincludingappropriatetermsintheobjectivefunctionalongwiththecrossentropyfunction.

4. Logisticregressioncanhandleirrelevantattributesbylearningweightparameterscloseto0forattributesthatdonotprovideanygaininperformanceduringtraining.Itcanalsohandleinteractingattributessincethelearningofmodelparametersisachievedinajointfashionbyconsideringtheeffectsofallattributestogether.Furthermore,ifthereareredundantattributesthatareduplicatesofeachother,thenlogisticregressioncanlearnequalweightsforeveryredundantattribute,withoutdegradingclassificationperformance.However,thepresenceofalargenumberofirrelevantorredundantattributesinhigh-dimensionalsettingscanmakelogisticregressionsusceptibletomodeloverfitting.

5. Logisticregressioncannothandledatainstanceswithmissingvalues,sincetheposteriorprobabilitiesareonlycomputedbytakingaweightedsumofalltheattributes.Iftherearemissingvaluesinatraininginstance,itcanbediscardedfromthetrainingset.However,iftherearemissingvaluesinatestinstance,thenlogisticregressionwouldfailtopredictitsclasslabel.

4.7ArtificialNeuralNetwork(ANN)Artificialneuralnetworks(ANN)arepowerfulclassificationmodelsthatareabletolearnhighlycomplexandnonlineardecisionboundariespurelyfromthedata.Theyhavegainedwidespreadacceptanceinseveralapplicationssuchasvision,speech,andlanguageprocessing,wheretheyhavebeenrepeatedlyshowntooutperformotherclassificationmodels(andinsomecasesevenhumanperformance).Historically,thestudyofartificialneuralnetworkswasinspiredbyattemptstoemulatebiologicalneuralsystems.Thehumanbrainconsistsprimarilyofnervecellscalledneurons,linkedtogetherwithotherneuronsviastrandsoffibercalledaxons.Wheneveraneuronisstimulated(e.g.,inresponsetoastimuli),ittransmitsnerveactivationsviaaxonstootherneurons.Thereceptorneuronscollectthesenerveactivationsusingstructurescalleddendrites,whichareextensionsfromthecellbodyoftheneuron.Thestrengthofthecontactpointbetweenadendriteandanaxon,knownasasynapse,determinestheconnectivitybetweenneurons.Neuroscientistshavediscoveredthatthehumanbrainlearnsbychangingthestrengthofthesynapticconnectionbetweenneuronsuponrepeatedstimulationbythesameimpulse.

Thehumanbrainconsistsofapproximately100billionneuronsthatareinter-connectedincomplexways,makingitpossibleforustolearnnewtasksandperformregularactivities.Notethatasingleneurononlyperformsasimplemodularfunction,whichistorespondtothenerveactivationscomingfromsenderneuronsconnectedatitsdendrite,andtransmititsactivationtoreceptorneuronsviaaxons.However,itisthecompositionofthesesimplefunctionsthattogetherisabletoexpresscomplexfunctions.Thisideaisatthebasisofconstructingartificialneuralnetworks.

Analogoustothestructureofahumanbrain,anartificialneuralnetworkiscomposedofanumberofprocessingunits,callednodes,thatareconnectedwitheachotherviadirectedlinks.Thenodescorrespondtoneuronsthatperformthebasicunitsofcomputation,whilethedirectedlinkscorrespondtoconnectionsbetweenneurons,consistingofaxonsanddendrites.Further,theweightofadirectedlinkbetweentwoneuronsrepresentsthestrengthofthesynapticconnectionbetweenneurons.Asinbiologicalneuralsystems,theprimaryobjectiveofANNistoadapttheweightsofthelinksuntiltheyfittheinput-outputrelationshipsoftheunderlyingdata.

ThebasicmotivationbehindusinganANNmodelistoextractusefulfeaturesfromtheoriginalattributesthataremostrelevantforclassification.Traditionally,featureextractionhasbeenachievedbyusingdimensionalityreductiontechniquessuchasPCA(introducedinChapter2),whichshowlimitedsuccessinextractingnonlinearfeatures,orbyusinghand-craftedfeaturesprovidedbydomainexperts.Byusingacomplexcombinationofinter-connectednodes,ANNmodelsareabletoextractmuchrichersetsoffeatures,resultingingoodclassificationperformance.Moreover,ANNmodelsprovideanaturalwayofrepresentingfeaturesatmultiplelevelsofabstraction,wherecomplexfeaturesareseenascompositionsofsimplerfeatures.Inmanyclassificationproblems,modelingsuchahierarchyoffeaturesturnsouttobeveryuseful.Forexample,inordertodetectahumanfaceinanimage,wecanfirstidentifylow-levelfeaturessuchassharpedgeswithdifferentgradientsandorientations.Thesefeaturescanthenbecombinedtoidentifyfacialpartssuchaseyes,nose,ears,andlips.Finally,anappropriatearrangementoffacialpartscanbeusedtocorrectlyidentifyahumanface.ANNmodelsprovideapowerfularchitecturetorepresentahierarchicalabstractionoffeatures,fromlowerlevelsofabstraction(e.g.,edges)tohigherlevels(e.g.,facialparts).

Artificialneuralnetworkshavehadalonghistoryofdevelopmentsspanningoverfivedecadesofresearch.AlthoughclassicalmodelsofANNsufferedfromseveralchallengesthathinderedprogressforalongtime,theyhavere-emergedwithwidespreadpopularitybecauseofanumberofrecentdevelopmentsinthelastdecade,collectivelyknownasdeeplearning.Inthissection,weexamineclassicalapproachesforlearningANNmodels,startingfromthesimplestmodelcalledperceptronstomorecomplexarchitecturescalledmulti-layerneuralnetworks.Inthenextsection,wediscusssomeoftherecentadvancementsintheareaofANNthathavemadeitpossibletoeffectivelylearnmodernANNmodelswithdeeparchitectures.

4.7.1Perceptron

AperceptronisabasictypeofANNmodelthatinvolvestwotypesofnodes:inputnodes,whichareusedtorepresenttheinputattributes,andanoutputnode,whichisusedtorepresentthemodeloutput.Figure4.20 illustratesthebasicarchitectureofaperceptronthattakesthreeinputattributes, ,and ,andproducesabinaryoutputy.Theinputnodecorrespondingtoanattribute isconnectedviaaweightedlink totheoutputnode.Theweightedlinkisusedtoemulatethestrengthofasynapticconnectionbetweenneurons.

x1,x2x3

xi wi

Figure4.20.Basicarchitectureofaperceptron.

Theoutputnodeisamathematicaldevicethatcomputesaweightedsumofitsinputs,addsabiasfactorbtothesum,andthenexaminesthesignoftheresulttoproducetheoutput asfollows:

Tosimplifynotations, andbcanbeconcatenatedtoform ,whilecanbeappendedwith1attheendtoform .Theoutputofthe

perceptron canthenbewritten:

wherethesignfunctionactsasanactivationfunctionbyprovidinganoutputvalueof iftheargumentispositiveand ifitsargumentisnegative.

LearningthePerceptronGivenatrainingset,weareinterestedinlearningparameters suchthatcloselyresemblesthetrueyoftraininginstances.ThisisachievedbyusingtheperceptronlearningalgorithmgiveninAlgorithm4.3 .ThekeycomputationforthisalgorithmistheiterativeweightupdateformulagiveninStep8ofthealgorithm:

where istheweightparameterassociatedwiththei inputlinkafterthek iteration, isaparameterknownasthelearningrate,and isthevalue

y^

3^y={1,ifwTx+b>0.−1,otherwise. (4.48)

w˜=(wTb)Tx˜=(xT1)T

y^

y^=sign(w˜Tx˜),

+1 −1

w˜ y^

wj(k+1)=wj(k)+λ(yi−yi^(k))xij, (4.49)

w(k) th

th λ xij

th

ofthej attributeofthetrainingexample .ThejustificationforEquation4.49 isratherintuitive.Notethat capturesthediscrepancybetweenand ,suchthatitsvalueis0onlywhenthetruelabelandthepredicted

outputmatch.Assume ispositive.If and ,then isincreasedatthenextiterationsothat canbecomepositive.Ontheotherhand,if

and ,then isdecreasedsothat canbecomenegative.Hence,theweightsaremodifiedateveryiterationtoreducethediscrepanciesbetween andyacrossalltraininginstances.Thelearningrate ,aparameterwhosevalueisbetween0and1,canbeusedtocontroltheamountofadjustmentsmadeineachiteration.Thealgorithmhaltswhentheaveragenumberofdiscrepanciesaresmallerthanathreshold .

Algorithm4.3Perceptronlearningalgorithm.

λ

∑ γ

Theperceptronisasimpleclassificationmodelthatisdesignedtolearnlineardecisionboundariesintheattributespace.Figure4.21 showsthedecision

th xi(yi−y^i)

yi y^ixij y^=0 y=1 wjw˜Txi

y^=1 y=0 wj w˜Txi

y^ λ

γ

boundaryobtainedbyapplyingtheperceptronlearningalgorithmtothedatasetprovidedontheleftofthefigure.However,notethattherecanbemultipledecisionboundariesthatcanseparatethetwoclasses,andtheperceptronarbitrarilylearnsoneoftheseboundariesdependingontherandominitialvaluesofparameters.(TheselectionoftheoptimaldecisionboundaryisaproblemthatwillberevisitedinthecontextofsupportvectormachinesinSection4.9 .)Further,theperceptronlearningalgorithmisonlyguaranteedtoconvergewhentheclassesarelinearlyseparable.However,iftheclassesarenotlinearlyseparable,thealgorithmfailstoconverge.Figure4.22showsanexampleofanonlinearlyseparabledatagivenbytheXORfunction.Theperceptroncannotfindtherightsolutionforthisdatabecausethereisnolineardecisionboundarythatcanperfectlyseparatethetraininginstances.Thus,thestoppingconditionatline12ofAlgorithm4.3 wouldneverbemetandhence,theperceptronlearningalgorithmwouldfailtoconverge.Thisisamajorlimitationofperceptronssincereal-worldclassificationproblemsofteninvolvenonlinearlyseparableclasses.

Figure4.21.Perceptrondecisionboundaryforthedatagivenontheleft( representsapositivelylabeledinstancewhileorepresentsanegativelylabeledinstance.

+

Figure4.22.XORclassificationproblem.Nolinearhyperplanecanseparatethetwoclasses.

4.7.2Multi-layerNeuralNetwork

Amulti-layerneuralnetworkgeneralizesthebasicconceptofaperceptrontomorecomplexarchitecturesofnodesthatarecapableoflearningnonlineardecisionboundaries.Agenericarchitectureofamulti-layerneuralnetworkisshowninFigure4.23 wherethenodesarearrangedingroupscalledlayers.Theselayersarecommonlyorganizedintheformofachainsuchthateverylayeroperatesontheoutputsofitsprecedinglayer.Inthisway,thelayersrepresentdifferentlevelsofabstractionthatareappliedontheinputfeaturesinasequentialmanner.Thecompositionoftheseabstractionsgeneratesthefinaloutputatthelastlayer,whichisusedformakingpredictions.Inthefollowing,webrieflydescribethethreetypesoflayersusedinmulti-layerneuralnetworks.

Figure4.23.Exampleofamulti-layerartificialneuralnetwork(ANN).

Thefirstlayerofthenetwork,calledtheinputlayer,isusedforrepresentinginputsfromattributes.Everynumericalorbinaryattributeistypicallyrepresentedusingasinglenodeonthislayer,whileacategoricalattributeiseitherrepresentedusingadifferentnodeforeachcategoricalvalue,orbyencodingthek-aryattributeusing inputnodes.Theseinputsarefedintointermediarylayersknownashiddenlayers,whicharemadeupofprocessingunitsknownashiddennodes.Everyhiddennodeoperatesonsignalsreceivedfromtheinputnodesorhiddennodesattheprecedinglayer,andproducesanactivationvaluethatistransmittedtothenextlayer.Thefinallayeriscalledtheoutputlayerandprocessestheactivationvaluesfromitsprecedinglayertoproducepredictionsofoutputvariables.Forbinaryclassification,theoutputlayercontainsasinglenoderepresentingthebinaryclasslabel.Inthisarchitecture,sincethesignalsarepropagatedonlyintheforwarddirectionfromtheinputlayertotheoutputlayer,theyarealsocalledfeedforwardneuralnetworks.

⌈log2k⌉

Amajordifferencebetweenmulti-layerneuralnetworksandperceptronsistheinclusionofhiddenlayers,whichdramaticallyimprovestheirabilitytorepresentarbitrarilycomplexdecisionboundaries.Forexample,considertheXORproblemdescribedintheprevioussection.Theinstancescanbeclassifiedusingtwohyperplanesthatpartitiontheinputspaceintotheirrespectiveclasses,asshowninFigure4.24(a) .Becauseaperceptroncancreateonlyonehyperplane,itcannotfindtheoptimalsolution.However,thisproblemcanbeaddressedbyusingahiddenlayerconsistingoftwonodes,asshowninFigure4.24(b) .Intuitively,wecanthinkofeachhiddennodeasaperceptronthattriestoconstructoneofthetwohyperplanes,whiletheoutputnodesimplycombinestheresultsoftheperceptronstoyieldthedecisionboundaryshowninFigure4.24(a) .

Figure4.24.Atwo-layerneuralnetworkfortheXORproblem.

Thehiddennodescanbeviewedaslearninglatentrepresentationsorfeaturesthatareusefulfordistinguishingbetweentheclasses.Whilethefirsthiddenlayerdirectlyoperatesontheinputattributesandthuscapturessimplerfeatures,thesubsequenthiddenlayersareabletocombinethemand

constructmorecomplexfeatures.Fromthisperspective,multi-layerneuralnetworkslearnahierarchyoffeaturesatdifferentlevelsofabstractionthatarefinallycombinedattheoutputnodestomakepredictions.Further,therearecombinatoriallymanywayswecancombinethefeatureslearnedatthehiddenlayersofANN,makingthemhighlyexpressive.ThispropertychieflydistinguishesANNfromotherclassificationmodelssuchasdecisiontrees,whichcanlearnpartitionsintheattributespacebutareunabletocombinetheminexponentialways.

Figure4.25.SchematicillustrationoftheparametersofanANNmodelwith hiddenlayers.

TounderstandthenatureofcomputationshappeningatthehiddenandoutputnodesofANN,considerthei nodeatthel layerofthenetwork ,wherethelayersarenumberedfrom0(inputlayer)toL(outputlayer),asshowninFigure4.25 .Theactivationvaluegeneratedatthisnode, ,canberepresentedasafunctionoftheinputsreceivedfromnodesattheprecedinglayer.Let representtheweightoftheconnectionfromthej nodeatlayer

(L−1)

th th (l>0)

ail

wijl th

th

tothei nodeatlayerl.Similarly,letusdenotethebiastermatthisnodeas .Theactivationvalue canthenbeexpressedas

whereziscalledthelinearpredictorand istheactivationfunctionthatconvertsztoa.Further,notethat,bydefinition, attheinputlayerand

attheoutputnode.

Thereareanumberofalternateactivationfunctionsapartfromthesignfunctionthatcanbeusedinmulti-layerneuralnetworks.Someexamplesincludelinear,sigmoid(logistic),andhyperbolictangentfunctions,asshowninFigure4.26 .Thesefunctionsareabletoproducereal-valuedandnonlinearactivationvalues.Amongtheseactivationfunctions,thesigmoid hasbeenwidelyusedinmanyANNmodels,althoughtheuseofothertypesofactivationfunctionsinthecontextofdeeplearningwillbediscussedinSection4.8 .Wecanthusrepresent as

(l−1) th

bjl ail

ail=f(zil)=f(∑jwijlajl−1+bil),

f(⋅)aj0=xj

aL=y^

σ(⋅)

ail

Figure4.26.Typesofactivationfunctionsusedinmulti-layerneuralnetworks.

LearningModelParametersTheweightsandbiasterms( ,b)oftheANNmodelarelearnedduringtrainingsothatthepredictionsontraininginstancesmatchthetruelabels.Thisisachievedbyusingalossfunction

ail=σ(zil)=11+e−zil. (4.50)

E(w,b)=∑k=1nLoss(yk,y^k) (4.51)

where isthetruelabelofthekthtraininginstanceand isequalto ,producedbyusing .Atypicalchoiceofthelossfunctionisthesquaredlossfunction:.

NotethatE( ,b)isafunctionofthemodelparameters( ,b)becausetheoutputactivationvalue dependsontheweightsandbiasterms.Weareinterestedinchoosing( ,b)thatminimizesthetraininglossE( ,b).Unfortunately,becauseoftheuseofhiddennodeswithnonlinearactivationfunctions,E( ,b)isnotaconvexfunctionof andb,whichmeansthatE( ,b)canhavelocalminimathatarenotgloballyoptimal.However,wecanstillapplystandardoptimizationtechniquessuchasthegradientdescentmethodtoarriveatalocallyoptimalsolution.Inparticular,theweightparameter andthebiasterm canbeiterativelyupdatedusingthefollowingequations:

where isahyper-parameterknownasthelearningrate.Theintuitionbehindthisequationistomovetheweightsinadirectionthatreducesthetrainingloss.Ifwearriveataminimausingthisprocedure,thegradientofthetraininglosswillbecloseto0,eliminatingthesecondtermandresultingintheconvergenceofweights.TheweightsarecommonlyinitializedwithvaluesdrawnrandomlyfromaGaussianorauniformdistribution.

AnecessarytoolforupdatingweightsinEquation4.53 istocomputethepartialderivativeofEwithrespectto .Thiscomputationisnontrivialespeciallyathiddenlayers ,since doesnotdirectlyaffect (and

yk y^k aLxk

Loss(yk,y^k)=(yk,y^k)2. (4.52)

aL

wijl bil

wijl←wijl−λ∂E∂wijl, (4.53)

bil←bil−λ∂E∂bil, (4.54)

λ

wijl(l<L) wijl y^=aL

hencethetrainingloss),buthascomplexchainsofinfluencesviaactivationvaluesatsubsequentlayers.Toaddressthisproblem,atechniqueknownasbackpropagationwasdeveloped,whichpropagatesthederivativesbackwardfromtheoutputlayertothehiddenlayers.Thistechniquecanbedescribedasfollows.

RecallthatthetraininglossEissimplythesumofindividuallossesattraininginstances.HencethepartialderivativeofEcanbedecomposedasasumofpartialderivativesofindividuallosses.

Tosimplifydiscussions,wewillconsideronlythederivativesofthelossatthek traininginstance,whichwillbegenericallyrepresentedas .Byusingthechainruleofdifferentiation,wecanrepresentthepartialderivativesofthelosswithrespectto as

Thelasttermofthepreviousequationcanbewrittenas

Also,ifweusethesigmoidactivationfunction,then

Equation4.55 canthusbesimplifiedas

∂E∂wjl=∑k=1n∂Loss(yk,y^k)∂wjl.

th Loss(y,aL)

wijl

∂Loss∂wijl=∂Loss∂ail×∂ail∂zil×∂zil∂wijl. (4.55)

∂zil∂wijl=∂(∑jwijlajl−1+bil)∂wijl=ajl−1.

∂ail∂zil=∂σ(zil)∂zil=ail(1−ai1).

∂Loss∂wijl=δil×ail(1−ai1)×ajl−1,whereδil=∂Loss∂ail. (4.56)

Asimilarformulaforthepartialderivativeswithrespecttothebiasterms isgivenby

Hence,tocomputethepartialderivatives,weonlyneedtodetermine .Usingasquaredlossfunction,wecaneasilywrite attheoutputnodeas

However,theapproachforcomputing athiddennodes ismoreinvolved.Noticethat affectstheactivationvalues ofallnodesatthenextlayer,whichinturninfluencestheloss.Hence,againusingthechainruleofdifferentiation, canberepresentedas

Thepreviousequationprovidesaconciserepresentationofthe valuesatlayerlintermsofthe valuescomputedatlayer .Hence,proceedingbackwardfromtheoutputlayerLtothehiddenlayers,wecanrecursivelyapplyEquation4.59 tocompute ateveryhiddennode. canthenbeusedinEquations4.56 and4.57 tocomputethepartialderivativesofthelosswithrespectto and ,respectively.Algorithm4.4 summarizesthecompleteapproachforlearningthemodelparametersofANNusingbackpropagationandgradientdescentmethod.

Algorithm4.4LearningANNusingbackpropagationandgradientdescent.

bli

∂Loss∂bil=δil×ail(1−ai1). (4.57)

δilδL

δL=∂Loss∂aL=∂(y−aL)2∂aL=2(aL−y). (4.58)

δjl (l<L)ajl ail+1

δjl

δjl=∂Loss∂ajl=∑i(∂Loss∂ail+1×∂ail+1∂ajl).=∑i(∂Loss∂ail+1×∂ail+1∂zil+1×∂zil+1(4.59)

δjlδjl+1 l+1

δil δil

wijl bil

∂ ∂ ∂ ∂

∂ ∂ ∑ ∂ ∂

∂ ∂ ∑ ∂ ∂

4.7.3CharacteristicsofANN

1. Multi-layerneuralnetworkswithatleastonehiddenlayerareuniversalapproximators;i.e.,theycanbeusedtoapproximateanytargetfunction.Theyarethushighlyexpressiveandcanbeusedtolearncomplexdecisionboundariesindiverseapplications.ANNcanalsobeusedformulticlassclassificationandregressionproblems,by

appropriatelymodifyingtheoutputlayer.However,thehighmodelcomplexityofclassicalANNmodelsmakesitsusceptibletooverfitting,whichcanbeovercometosomeextentbyusingdeeplearningtechniquesdiscussedinSection4.8.3 .

2. ANNprovidesanaturalwaytorepresentahierarchyoffeaturesatmultiplelevelsofabstraction.TheoutputsatthefinalhiddenlayeroftheANNmodelthusrepresentfeaturesatthehighestlevelofabstractionthataremostusefulforclassification.Thesefeaturescanalsobeusedasinputsinothersupervisedclassificationmodels,e.g.,byreplacingtheoutputnodeoftheANNbyanygenericclassifier.

3. ANNrepresentscomplexhigh-levelfeaturesascompositionsofsimplerlower-levelfeaturesthatareeasiertolearn.ThisprovidesANNtheabilitytograduallyincreasethecomplexityofrepresentations,byaddingmorehiddenlayerstothearchitecture.Further,sincesimplerfeaturescanbecombinedincombinatorialways,thenumberofcomplexfeatureslearnedbyANNismuchlargerthantraditionalclassificationmodels.Thisisoneofthemainreasonsbehindthehighexpressivepowerofdeepneuralnetworks.

4. ANNcaneasilyhandleirrelevantattributes,byusingzeroweightsforattributesthatdonothelpinimprovingthetrainingloss.Also,redundantattributesreceivesimilarweightsanddonotdegradethequalityoftheclassifier.However,ifthenumberofirrelevantorredundantattributesislarge,thelearningoftheANNmodelmaysufferfromoverfitting,leadingtopoorgeneralizationperformance.

5. SincethelearningofANNmodelinvolvesminimizinganon-convexfunction,thesolutionsobtainedbygradientdescentarenotguaranteedtobegloballyoptimal.Forthisreason,ANNhasatendencytogetstuckinlocalminima,achallengethatcanbeaddressedbyusingdeeplearningtechniquesdiscussedinSection4.8.4 .

6. TraininganANNisatimeconsumingprocess,especiallywhenthenumberofhiddennodesislarge.Nevertheless,testexamplescanbe

classifiedrapidly.7. Justlikelogisticregression,ANNcanlearninthepresenceof

interactingvariables,sincethemodelparametersarejointlylearnedoverallvariablestogether.Inaddition,ANNcannothandleinstanceswithmissingvaluesinthetrainingortestingphase.

4.8DeepLearningAsdescribedabove,theuseofhiddenlayersinANNisbasedonthegeneralbeliefthatcomplexhigh-levelfeaturescanbeconstructedbycombiningsimplerlower-levelfeatures.Typically,thegreaterthenumberofhiddenlayers,thedeeperthehierarchyoffeatureslearnedbythenetwork.ThismotivatesthelearningofANNmodelswithlongchainsofhiddenlayers,knownasdeepneuralnetworks.Incontrastto“shallow”neuralnetworksthatinvolveonlyasmallnumberofhiddenlayers,deepneuralnetworksareabletorepresentfeaturesatmultiplelevelsofabstractionandoftenrequirefarfewernodesperlayertoachievegeneralizationperformancesimilartoshallownetworks.

Despitethehugepotentialinlearningdeepneuralnetworks,ithasremainedchallengingtolearnANNmodelswithalargenumberofhiddenlayersusingclassicalapproaches.Apartfromreasonsrelatedtolimitedcomputationalresourcesandhardwarearchitectures,therehavebeenanumberofalgorithmicchallengesinlearningdeepneuralnetworks.First,learningadeepneuralnetworkwithlowtrainingerrorhasbeenadauntingtaskbecauseofthesaturationofsigmoidactivationfunctions,resultinginslowconvergenceofgradientdescent.Thisproblembecomesevenmoreseriousaswemoveawayfromtheoutputnodetothehiddenlayers,becauseofthecompoundedeffectsofsaturationatmultiplelayers,knownasthevanishinggradientproblem.Becauseofthisreason,classicalANNmodelshavesufferedfromslowandineffectivelearning,leadingtopoortrainingandtestperformance.Second,thelearningofdeepneuralnetworksisquitesensitivetotheinitialvaluesofmodelparameters,chieflybecauseofthenon-convexnatureoftheoptimizationfunctionandtheslowconvergenceofgradientdescent.Third,deepneuralnetworkswithalargenumberofhiddenlayershavehighmodel

complexity,makingthemsusceptibletooverfitting.Hence,evenifadeepneuralnetworkhasbeentrainedtoshowlowtrainingerror,itcanstillsufferfrompoorgeneralizationperformance.

Thesechallengeshavedeterredprogressinbuildingdeepneuralnetworksforseveraldecadesanditisonlyrecentlythatwehavestartedtounlocktheirimmensepotentialwiththehelpofanumberofadvancesbeingmadeintheareaofdeeplearning.Althoughsomeoftheseadvanceshavebeenaroundforsometime,theyhaveonlygainedmainstreamattentioninthelastdecade,withdeepneuralnetworkscontinuallybeatingrecordsinvariouscompetitionsandsolvingproblemsthatweretoodifficultforotherclassificationapproaches.

Therearetwofactorsthathaveplayedamajorroleintheemergenceofdeeplearningtechniques.First,theavailabilityoflargerlabeleddatasets,e.g.,theImageNetdatasetcontainsmorethan10millionlabeledimages,hasmadeitpossibletolearnmorecomplexANNmodelsthaneverbefore,withoutfallingeasilyintothetrapsofmodeloverfitting.Second,advancesincomputationalabilitiesandhardwareinfrastructures,suchastheuseofgraphicalprocessingunits(GPU)fordistributedcomputing,havegreatlyhelpedinexperimentingwithdeepneuralnetworkswithlargerarchitecturesthatwouldnothavebeenfeasiblewithtraditionalresources.

Inadditiontotheprevioustwofactors,therehavebeenanumberofalgorithmicadvancementstoovercomethechallengesfacedbyclassicalmethodsinlearningdeepneuralnetworks.Someexamplesincludetheuseofmoreresponsivecombinationsoflossfunctionsandactivationfunctions,betterinitializationofmodelparameters,novelregularizationtechniques,moreagilearchitecturedesigns,andbettertechniquesformodellearningandhyper-parameterselection.Inthefollowing,wedescribesomeofthedeeplearningadvancesmadetoaddressthechallengesinlearningdeepneural

networks.FurtherdetailsonrecentdevelopmentsindeeplearningcanbeobtainedfromtheBibliographicNotes.

4.8.1UsingSynergisticLossFunctions

Oneofthemajorrealizationsleadingtodeeplearninghasbeentheimportanceofchoosingappropriatecombinationsofactivationandlossfunctions.ClassicalANNmodelscommonlymadeuseofthesigmoidactivationfunctionattheoutputlayer,becauseofitsabilitytoproducereal-valuedoutputsbetween0and1,whichwascombinedwithasquaredlossobjectivetoperformgradientdescent.Itwassoonnoticedthatthisparticularcombinationofactivationandlossfunctionresultedinthesaturationofoutputactivationvalues,whichcanbedescribedasfollows.

SaturationofOutputsAlthoughthesigmoidhasbeenwidely-usedasanactivationfunction,iteasilysaturatesathighandlowvaluesofinputsthatarefarawayfrom0.ObservefromFigure4.27(a) that showsvarianceinitsvaluesonlywhenziscloseto0.Forthisreason, isnon-zeroforonlyasmallrangeofzaround0,asshowninFigure4.27(b) .Since isoneofthecomponentsinthegradientofloss(seeEquation4.55 ),wegetadiminishinggradientvaluewhentheactivationvaluesarefarfrom0.

σ(z)∂σ(z)/∂z

∂σ(z)/∂z

Figure4.27.Plotsofsigmoidfunctionanditsderivative.

Toillustratetheeffectofsaturationonthelearningofmodelparametersattheoutputnode,considerthepartialderivativeoflosswithrespecttotheweight

attheoutputnode.Usingthesquaredlossfunction,wecanwritethisas

Inthepreviousequation,noticethatwhen ishighlynegative, (andhencethegradient)iscloseto0.Ontheotherhand,when ishighlypositive, becomescloseto0,nullifyingthevalueofthegradient.Hence,irrespectiveofwhethertheprediction matchesthetruelabelyornot,thegradientofthelosswithrespecttotheweightsiscloseto0wheneverishighlypositiveornegative.Thiscausesanunnecessarilyslow

convergenceofthemodelparametersoftheANNmodel,oftenresultinginpoorlearning.

Notethatitisthecombinationofthesquaredlossfunctionandthesigmoidactivationfunctionattheoutputnodethattogetherresultsindiminishing

wjL

∂Loss∂wjL=2(aL−y)×σ(zL)(1−σ(zL))×ajL−1. (4.60)

zL σ(zL)zL

(1−σ(zL))aL

zL

gradients(andthuspoorlearning)uponsaturationofoutputs.Itisthusimportanttochooseasynergisticcombinationoflossfunctionandactivationfunctionthatdoesnotsufferfromthesaturationofoutputs.

CrossentropylossfunctionThecrossentropylossfunction,whichwasdescribedinthecontextoflogisticregressioninSection4.6.2 ,cansignificantlyavoidtheproblemofsaturatingoutputswhenusedincombinationwiththesigmoidactivationfunction.Thecrossentropylossfunctionofareal-valuedpredictiononadatainstancewithbinarylabel canbedefinedas

wherelogrepresentsthenaturallogarithm(tobasee)and forconvenience.Thecrossentropyfunctionhasfoundationsininformationtheoryandmeasurestheamountofdisagreementbetweenyand .Thepartialderivativeofthislossfunctionwithrespectto canbegivenas

Usingthisvalueof inEquation4.56 ,wecanobtainthepartialderivativeofthelosswithrespecttotheweight attheoutputnodeas

Noticethesimplicityofthepreviousformulausingthecrossentropylossfunction.Thepartialderivativesofthelosswithrespecttotheweightsattheoutputnodedependonlyonthedifferencebetweentheprediction andthetruelabely.IncontrasttoEquation4.60 ,itdoesnotinvolvetermssuchas

thatcanbeimpactedbysaturationof .Hence,thegradients

y^∈(0,1)y∈{0,1}

Loss(y,y^)=−ylog(y^)−(1−y)log(1−y^), (4.61)

0log(0)=0

y^y^=aL

δL=∂Loss∂aL=−yaL+(1−y)(1−aL).=(aL−y)aL(1−aL). (4.62)

δLwjl

∂Loss∂wjL=(aL−y)aL(1−aL)×aL(1−aL)×ajL−1.=(aL−y)×ajL−1. (4.63)

aL

σ(zL)(1−σ(zL)) zL

arehighwhenever islarge,promotingeffectivelearningofthemodelparametersattheoutputnode.ThishasbeenamajorbreakthroughinthelearningofmodernANNmodelsanditisnowacommonpracticetousethecrossentropylossfunctionwithsigmoidactivationsattheoutputnode.

4.8.2UsingResponsiveActivationFunctions

Eventhoughthecrossentropylossfunctionhelpsinovercomingtheproblemofsaturatingoutputs,itstilldoesnotsolvetheproblemofsaturationathiddenlayers,arisingduetotheuseofsigmoidactivationfunctionsathiddennodes.Infact,theeffectofsaturationonthelearningofmodelparametersisevenmoreaggravatedathiddenlayers,aproblemknownasthevanishinggradientproblem.Inthefollowing,wedescribethevanishinggradientproblemandtheuseofamoreresponsiveactivationfunction,calledtherectifiedlinearoutputunit(ReLU),toovercomethisproblem.

VanishingGradientProblemTheimpactofsaturatingactivationvaluesonthelearningofmodelparametersincreasesatdeeperhiddenlayersthatarefartherawayfromtheoutputnode.Eveniftheactivationintheoutputlayerdoesnotsaturate,therepeatedmultiplicationsperformedaswebackpropagatethegradientsfromtheoutputlayertothehiddenlayersmayleadtodecreasinggradientsinthehiddenlayers.Thisiscalledthevanishinggradientproblem,whichhasbeenoneofthemajorhindrancesinlearningdeepneuralnetworks.

(aL−y)

Toillustratethevanishinggradientproblem,consideranANNmodelthatconsistsofasinglenodeateveryhiddenlayerofthenetwork,asshowninFigure4.28 .Thissimplifiedarchitectureinvolvesasinglechainofhiddennodeswhereasingleweightedlink connectsthenodeatlayer tothenodeatlayerl.UsingEquations4.56 and4.59 ,wecanrepresentthepartialderivativeofthelosswithrespectto as

Noticethatifanyofthelinearpredictors saturatesatsubsequentlayers,thentheterm becomescloseto0,thusdiminishingtheoverallgradient.Thesaturationofactivationsthusgetscompoundedandhasmultiplicativeeffectsonthegradientsathiddenlayers,makingthemhighlyunstableandthus,unsuitableforusewithgradientdescent.Eventhoughthepreviousdiscussiononlypertainstothesimplifiedarchitectureinvolvingasinglechainofhiddennodes,asimilarargumentcanbemadeforanygenericANNarchitectureinvolvingmultiplechainsofhiddennodes.Notethatthevanishinggradientproblemprimarilyarisesbecauseoftheuseofsigmoidactivationfunctionathiddennodes,whichisknowntoeasilysaturateespeciallyafterrepeatedmultiplications.

Figure4.28.AnexampleofanANNmodelwithonlyonenodeateveryhiddenlayer.

wl l−1

wl

∂Loss∂wl=δl×al(1−al)×al−1,whereδl=2(aL−y)×∏r=lL−1(ar+1(1−ar+1)×wr+1).(4.64)

zr+1ar+1(1−ar+1)

Figure4.29.Plotoftherectifiedlinearunit(ReLU)activationfunction.

RectifiedLinearUnits(ReLU)Toovercomethevanishinggradientproblem,itisimportanttouseanactivationfunctionf(z)atthehiddennodesthatprovidesastableandsignificantvalueofthegradientwheneverahiddennodeisactive,i.e., .Thisisachievedbyusingrectifiedlinearunits(ReLU)asactivationfunctionsathiddennodes,whichcanbedefinedas

TheideaofReLUhasbeeninspiredfrombiologicalneurons,whichareeitherinaninactivestate orshowanactivationvalueproportionaltotheinput.Figure4.29 showsaplotoftheReLUfunction.Wecanseethatitislinearwithrespecttozwhen .Hence,thegradientoftheactivationvaluewithrespecttozcanbewrittenas

z>0

a=f(z)={z,ifz>0.0,otherwise. (4.65)

(f(z)=0)

z>0

Althoughf(z)isnotdifferentiableat0,itiscommonpracticetousewhen .SincethegradientoftheReLUactivationfunctionisequalto1whenever ,itavoidstheproblemofsaturationathiddennodes,evenafterrepeatedmultiplications.UsingReLU,thepartialderivativesofthelosswithrespecttotheweightandbiasparameterscanbegivenby

NoticethatReLUshowsalinearbehaviorintheactivationvalueswheneveranodeisactive,ascomparedtothenonlinearpropertiesofthesigmoidfunction.Thislinearitypromotesbetterflowsofgradientsduringbackpropagation,andthussimplifiesthelearningofANNmodelparameters.TheReLUisalsohighlyresponsiveatlargevaluesofzawayfrom0,asopposedtothesigmoidactivationfunction,makingitmoresuitableforgradientdescent.ThesedifferencesgiveReLUamajoradvantageoverthesigmoidfunction.Indeed,ReLUisusedasthepreferredchoiceofactivationfunctionathiddenlayersinmostmodernANNmodels.

4.8.3Regularization

AmajorchallengeinlearningdeepneuralnetworksisthehighmodelcomplexityofANNmodels,whichgrowswiththeadditionofhiddenlayersinthenetwork.Thiscanbecomeaseriousconcern,especiallywhenthetrainingsetissmall,duetothephenomenaofmodeloverfitting.Toovercomethis

∂a∂z={1,ifz>0.0,ifz<0. (4.66)

∂a/∂z=0z=0

z>0

∂Loss∂wijl=δil×I(zil)×ajl−1, (4.67)

∂Loss∂bil=δil×I(zil),whereδil=∑i=1n(δil+1×I(zil+1)×wijl+1),andI(z)={1,ifz>0.0,otherwise.

(4.68)

challenge,itisimportanttousetechniquesthatcanhelpinreducingthecomplexityofthelearnedmodel,knownasregularizationtechniques.ClassicalapproachesforlearningANNmodelsdidnothaveaneffectivewaytopromoteregularizationofthelearnedmodelparameters.Hence,theyhadoftenbeensidelinedbyotherclassificationmethods,suchassupportvectormachines(SVM),whichhavein-builtregularizationmechanisms.(SVMswillbediscussedinmoredetailinSection4.9 ).

OneofthemajoradvancementsindeeplearninghasbeenthedevelopmentofnovelregularizationtechniquesforANNmodelsthatareabletooffersignificantimprovementsingeneralizationperformance.Inthefollowing,wediscussoneoftheregularizationtechniquesforANN,knownasthedropoutmethod,thathavegainedalotofattentioninseveralapplications.

DropoutThemainobjectiveofdropoutistoavoidthelearningofspuriousfeaturesathiddennodes,occurringduetomodeloverfitting.Itusesthebasicintuitionthatspuriousfeaturesoften“co-adapt”themselvessuchthattheyshowgoodtrainingperformanceonlywhenusedinhighlyselectivecombinations.Ontheotherhand,relevantfeaturescanbeusedinadiversityoffeaturecombinationsandhencearequiteresilienttotheremovalormodificationofotherfeatures.Thedropoutmethodusesthisintuitiontobreakcomplex“co-adaptations”inthelearnedfeaturesbyrandomlydroppinginputandhiddennodesinthenetworkduringtraining.

Dropoutbelongstoafamilyofregularizationtechniquesthatusesthecriteriaofresiliencetorandomperturbationsasameasureoftherobustness(andhence,simplicity)ofamodel.Forexample,oneapproachtoregularizationistoinjectnoiseintheinputattributesofthetrainingsetandlearnamodelwiththenoisytraininginstances.Ifafeaturelearnedfromthetrainingdatais

indeedgeneralizable,itshouldnotbeaffectedbytheadditionofnoise.Dropoutcanbeviewedasasimilarregularizationapproachthatperturbstheinformationcontentofthetrainingsetnotonlyatthelevelofattributesbutalsoatmultiplelevelsofabstractions,bydroppinginputandhiddennodes.

Thedropoutmethoddrawsinspirationfromthebiologicalprocessofgeneswappinginsexualreproduction,wherehalfofthegenesfrombothparentsarecombinedtogethertocreatethegenesoftheoffspring.Thisfavorstheselectionofparentgenesthatarenotonlyusefulbutcanalsointer-minglewithdiversecombinationsofgenescomingfromtheotherparent.Ontheotherhand,co-adaptedgenesthatfunctiononlyinhighlyselectivecombinationsaresooneliminatedintheprocessofevolution.Thisideaisusedinthedropoutmethodforeliminatingspuriousco-adaptedfeatures.Asimplifieddescriptionofthedropoutmethodisprovidedintherestofthissection.

Figure4.30.Examplesofsub-networksgeneratedinthedropoutmethodusing .

Let representthemodelparametersoftheANNmodelatthekiterationofthegradientdescentmethod.Ateveryiteration,werandomlyselectafraction ofinputandhiddennodestobedroppedfromthenetwork,where isahyper-parameterthatistypicallychosentobe0.5.Theweightedlinksandbiastermsinvolvingthedroppednodesaretheneliminated,resultingina“thinned”sub-networkofsmallersize.Themodelparametersofthesub-network arethenupdatedbycomputingactivationvaluesandperformingbackpropagationonthissmallersub-network.Theseupdatedvaluesarethenaddedbackintheoriginalnetworkto

γ=0.5

(wk,bk) th

γγ∈(0,1)

(wsk,bsk)

obtaintheupdatedmodelparameters, ,tobeusedinthenextiteration.

Figure4.30 showssomeexamplesofsub-networksthatcanbegeneratedatdifferentiterationsofthedropoutmethod,byrandomlydroppinginputandhiddennodes.Sinceeverysub-networkhasadifferentarchitecture,itisdifficulttolearncomplexco-adaptationsinthefeaturesthatcanresultinoverfitting.Instead,thefeaturesatthehiddennodesarelearnedtobemoreagiletorandommodificationsinthenetworkstructure,thusimprovingtheirgeneralizationability.Themodelparametersareupdatedusingadifferentrandomsub-networkateveryiteration,tillthegradientdescentmethodconverges.

Let denotethemodelparametersatthelastiterationofthegradientdescentmethod.Theseparametersarefinallyscaleddownbyafactorof ,toproducetheweightsandbiastermsofthefinalANNmodel,asfollows:

Wecannowusethecompleteneuralnetworkwithmodelparametersfortesting.ThedropoutmethodhasbeenshowntoprovidesignificantimprovementsinthegeneralizationperformanceofANNmodelsinanumberofapplications.Itiscomputationallycheapandcanbeappliedincombinationwithanyoftheotherdeeplearningtechniques.Italsohasanumberofsimilaritieswithawidely-usedensemblelearningmethodknownasbagging,whichlearnsmultiplemodelsusingrandomsubsetsofthetrainingset,andthenusestheaverageoutputofallthemodelstomakepredictions.(BaggingwillbepresentedinmoredetaillaterinSection4.10.4 ).Inasimilarvein,itcanbeshownthatthepredictionsofthefinalnetworklearnedusingdropoutapproximatestheaverageoutputofallpossible sub-networksthatcanbe

(wk+1,bk+1)

(wkmax,bkmax) kmax

(1−γ)

(w*,b*)=((1−γ)×wkmax,(1−γ)×bkmax)

(w*,b*)

2n

formedusingnnodes.Thisisoneofthereasonsbehindthesuperiorregularizationabilitiesofdropout.

4.8.4InitializationofModelParameters

Becauseofthenon-convexnatureofthelossfunctionusedbyANNmodels,itispossibletogetstuckinlocallyoptimalbutgloballyinferiorsolutions.Hence,theinitialchoiceofmodelparametervaluesplaysasignificantroleinthelearningofANNbygradientdescent.Theimpactofpoorinitializationisevenmoreaggravatedwhenthemodeliscomplex,thenetworkarchitectureisdeep,ortheclassificationtaskisdifficult.Insuchcases,itisoftenadvisabletofirstlearnasimplermodelfortheproblem,e.g.,usingasinglehiddenlayer,andthenincrementallyincreasethecomplexityofthemodel,e.g.,byaddingmorehiddenlayers.Analternateapproachistotrainthemodelforasimplertaskandthenusethelearnedmodelparametersasinitialparameterchoicesinthelearningoftheoriginaltask.TheprocessofinitializingANNmodelparametersbeforetheactualtrainingprocessisknownaspretraining.

Pretraininghelpsininitializingthemodeltoasuitableregionintheparameterspacethatwouldotherwisebeinaccessiblebyrandominitialization.Pretrainingalsoreducesthevarianceinthemodelparametersbyfixingthestartingpointofgradientdescent,thusreducingthechancesofoverfittingduetomultiplecomparisons.Themodelslearnedbypretrainingarethusmoreconsistentandprovidebettergeneralizationperformance.

SupervisedPretrainingAcommonapproachforpretrainingistoincrementallytraintheANNmodelinalayer-wisemanner,byaddingonehiddenlayeratatime.Thisapproach,

knownassupervisedpretraining,ensuresthattheparameterslearnedateverylayerareobtainedbysolvingasimplerproblem,ratherthanlearningallmodelparameterstogether.TheseparametervaluesthusprovideagoodchoiceforinitializingtheANNmodel.Theapproachforsupervisedpretrainingcanbebrieflydescribedasfollows.

WestartthesupervisedpretrainingprocessbyconsideringareducedANNmodelwithonlyasinglehiddenlayer.Byapplyinggradientdescentonthissimplemodel,weareabletolearnthemodelparametersofthefirsthiddenlayer.Atthenextrun,weaddanotherhiddenlayertothemodelandapplygradientdescenttolearntheparametersofthenewlyaddedhiddenlayer,whilekeepingtheparametersofthefirstlayerfixed.Thisprocedureisrecursivelyappliedsuchthatwhilelearningtheparametersofthel hiddenlayer,weconsiderareducedmodelwithonlylhiddenlayers,whosefirsthiddenlayersarenotupdatedonthel runbutareinsteadfixedusingpretrainedvaluesfrompreviousruns.Inthisway,weareabletolearnthemodelparametersofall hiddenlayers.ThesepretrainedvaluesareusedtoinitializethehiddenlayersofthefinalANNmodel,whichisfine-tunedbyapplyingafinalroundofgradientdescentoverallthelayers.

UnsupervisedPretrainingSupervisedpretrainingprovidesapowerfulwaytoinitializemodelparameters,bygraduallygrowingthemodelcomplexityfromshallowertodeepernetworks.However,supervisedpretrainingrequiresasufficientnumberoflabeledtraininginstancesforeffectiveinitializationoftheANNmodel.Analternatepretrainingapproachisunsupervisedpretraining,whichinitializesmodelparametersbyusingunlabeledinstancesthatareoftenabundantlyavailable.ThebasicideaofunsupervisedpretrainingistoinitializetheANN

th

(l−1)th

(L−1)

modelinsuchawaythatthelearnedfeaturescapturethelatentstructureintheunlabeleddata.

Figure4.31.Thebasicarchitectureofasingle-layerautoencoder.

Unsupervisedpretrainingreliesontheassumptionthatlearningthedistributionoftheinputdatacanindirectlyhelpinlearningtheclassificationmodel.Itismosthelpfulwhenthenumberoflabeledexamplesissmallandthefeaturesforthesupervisedproblembearresemblancetothefactorsgeneratingtheinputdata.Unsupervisedpretrainingcanbeviewedasadifferentformofregularization,wherethefocusisnotexplicitlytowardfindingsimplerfeaturesbutinsteadtowardfindingfeaturesthatcanbestexplaintheinputdata.Historically,unsupervisedpretraininghasplayedanimportantroleinrevivingtheareaofdeeplearning,bymakingitpossibletotrainanygenericdeepneuralnetworkwithoutrequiringspecializedarchitectures.

UseofAutoencoders

OnesimpleandcommonlyusedapproachforunsupervisedpretrainingistouseanunsupervisedANNmodelknownasanautoencoder.ThebasicarchitectureofanautoencoderisshowninFigure4.31 .Anautoencoderattemptstolearnareconstructionoftheinputdatabymappingtheattributestolatentfeatures ,andthenre-projecting backtotheoriginalattribute

spacetocreatethereconstruction .Thelatentfeaturesarerepresentedusingahiddenlayerofnodes,whiletheinputandoutputlayersrepresenttheattributesandcontainthesamenumberofnodes.Duringtraining,thegoalistolearnanautoencodermodelthatprovidesthelowestreconstructionerror,

,onallinputdatainstances.Atypicalchoiceofthereconstructionerroristhesquaredlossfunction:

ThemodelparametersoftheautoencodercanbelearnedbyusingasimilargradientdescentmethodastheoneusedforlearningsupervisedANNmodelsforclassification.Thekeydifferenceistheuseofthereconstructionerroronalltraininginstancesasthetrainingloss.Autoencodersthathavemultiplelayersofhiddenlayersareknownasstackedautoencoders.

Autoencodersareabletocapturecomplexrepresentationsoftheinputdatabytheuseofhiddennodes.However,ifthenumberofhiddennodesislarge,itispossibleforanautoencodertolearntheidentityrelationship,wheretheinput isjustcopiedandreturnedastheoutput ,resultinginatrivialsolution.Forexample,ifweuseasmanyhiddennodesasthenumberofattributes,thenitispossibleforeveryhiddennodetocopyanattributeandsimplypassitalongtoanoutputnode,withoutextractinganyusefulinformation.Toavoidthisproblem,itiscommonpracticetokeepthenumberofhiddennodessmallerthanthenumberofinputattributes.Thisforcestheautoencodertolearnacompactandusefulencodingoftheinputdata,similartoadimensionalityreductiontechnique.Analternateapproachistocorrupt

x^

RE(x,x^)

RE(x,x^)=ǁx−x^ǁ2.

x^

theinputinstancesbyaddingrandomnoise,andthenlearntheautoencodertoreconstructtheoriginalinstancefromthenoisyinput.Thisapproachisknownasthedenoisingautoencoder,whichoffersstrongregularizationcapabilitiesandisoftenusedtolearncomplexfeatureseveninthepresenceofalargenumberofhiddennodes.

Touseanautoencoderforunsupervisedpretraining,wecanfollowasimilarlayer-wiseapproachlikesupervisedpretraining.Inparticular,topretrainthemodelparametersofthel hiddenlayer,wecanconstructareducedANNmodelwithonlylhiddenlayersandanoutputlayercontainingthesamenumberofnodesastheattributesandisusedforreconstruction.Theparametersofthel hiddenlayerofthisnetworkarethenlearnedusingagradientdescentmethodtominimizethereconstructionerror.Theuseofunlabeleddatacanbeviewedasprovidinghintstothelearningofparametersateverylayerthataidingeneralization.ThefinalmodelparametersoftheANNmodelarethenlearnedbyapplyinggradientdescentoverallthelayers,usingtheinitialvaluesofparametersobtainedfrompretraining.

HybridPretrainingUnsupervisedpretrainingcanalsobecombinedwithsupervisedpretrainingbyusingtwooutputlayersateveryrunofpretraining,oneforreconstructionandtheotherforsupervisedclassification.Theparametersofthel hiddenlayerarethenlearnedbyjointlyminimizingthelossesonbothoutputlayers,usuallyweightedbyatrade-offhyper-parameter .Suchacombinedapproachoftenshowsbettergeneralizationperformancethaneitheroftheapproaches,sinceitprovidesawaytobalancebetweenthecompetingobjectivesofrepresentingtheinputdataandimprovingclassificationperformance.

th

th

th

α

4.8.5CharacteristicsofDeepLearning

ApartfromthebasiccharacteristicsofANNdiscussedinSection4.7.3 ,theuseofdeeplearningtechniquesprovidesthefollowingadditionalcharacteristics:

1. AnANNmodeltrainedforsometaskcanbeeasilyre-usedforadifferenttaskthatinvolvesthesameattributes,byusingpretrainingstrategies.Forexample,wecanusethelearnedparametersoftheoriginaltaskasinitialparameterchoicesforthetargettask.Inthisway,ANNpromotesre-usabilityoflearning,whichcanbequiteusefulwhenthetargetapplicationhasasmallernumberoflabeledtraininginstances.

2. Deeplearningtechniquesforregularization,suchasthedropoutmethod,helpinreducingthemodelcomplexityofANNandthuspromotinggoodgeneralizationperformance.Theuseofregularizationtechniquesisespeciallyusefulinhigh-dimensionalsettings,wherethenumberoftraininglabelsissmallbuttheclassificationproblemisinherentlydifficult.

3. Theuseofanautoencoderforpretrainingcanhelpeliminateirrelevantattributesthatarenotrelatedtootherattributes.Further,itcanhelpreducetheimpactofredundantattributesbyrepresentingthemascopiesofthesameattribute.

4. AlthoughthelearningofanANNmodelcansuccumbtofindinginferiorandlocallyoptimalsolutions,thereareanumberofdeeplearningtechniquesthathavebeenproposedtoensureadequatelearningofanANN.Apartfromthemethodsdiscussedinthissection,someothertechniquesinvolvenovelarchitecturedesignssuchasskipconnectionsbetweentheoutputlayerandlowerlayers,whichaidstheeasyflowofgradientsduringbackpropagation.

5. AnumberofspecializedANNarchitectureshavebeendesignedtohandleavarietyofinputdatasets.Someexamplesincludeconvolutionalneuralnetworks(CNN)fortwo-dimensionalgriddedobjectssuchasimages,andrecurrentneuralnetwork(RNN)forsequences.WhileCNNshavebeenextensivelyusedintheareaofcomputervision,RNNshavefoundapplicationsinprocessingspeechandlanguage.

4.9SupportVectorMachine(SVM)Asupportvectormachine(SVM)isadiscriminativeclassificationmodelthatlearnslinearornonlineardecisionboundariesintheattributespacetoseparatetheclasses.Apartfrommaximizingtheseparabilityofthetwoclasses,SVMoffersstrongregularizationcapabilities,i.e.,itisabletocontrolthecomplexityofthemodelinordertoensuregoodgeneralizationperformance.Duetoitsuniqueabilitytoinnatelyregularizeitslearning,SVMisabletolearnhighlyexpressivemodelswithoutsufferingfromoverfitting.Ithasthusreceivedconsiderableattentioninthemachinelearningcommunityandiscommonlyusedinseveralpracticalapplications,rangingfromhandwrittendigitrecognitiontotextcategorization.SVMhasstrongrootsinstatisticallearningtheoryandisbasedontheprincipleofstructuralriskminimization.AnotheruniqueaspectofSVMisthatitrepresentsthedecisionboundaryusingonlyasubsetofthetrainingexamplesthataremostdifficulttoclassify,knownasthesupportvectors.Hence,itisadiscriminativemodelthatisimpactedonlybytraininginstancesneartheboundaryofthetwoclasses,incontrasttolearningthegenerativedistributionofeveryclass.

ToillustratethebasicideabehindSVM,wefirstintroducetheconceptofthemarginofaseparatinghyperplaneandtherationaleforchoosingsuchahyperplanewithmaximummargin.WethendescribehowalinearSVMcanbetrainedtoexplicitlylookforthistypeofhyperplane.WeconcludebyshowinghowtheSVMmethodologycanbeextendedtolearnnonlineardecisionboundariesbyusingkernelfunctions.

4.9.1MarginofaSeparating

Hyperplane

Thegenericequationofaseparatinghyperplanecanbewrittenas

where representstheattributesand( , )representtheparametersofthehyperplane.Adatainstance canbelongtoeithersideofthehyperplanedependingonthesignof .Forthepurposeofbinaryclassification,weareinterestedinfindingahyperplanethatplacesinstancesofbothclassesonoppositesidesofthehyperplane,thusresultinginaseparationofthetwoclasses.Ifthereexistsahyperplanethatcanperfectlyseparatetheclassesinthedataset,wesaythatthedatasetislinearlyseparable.Figure4.32showsanexampleoflinearlyseparabledatainvolvingtwoclasses,squaresandcircles.Notethattherecanbeinfinitelymanyhyperplanesthatcanseparatetheclasses,twoofwhichareshowninFigure4.32 aslinesand .Eventhougheverysuchhyperplanewillhavezerotrainingerror,theycanprovidedifferentresultsonpreviouslyunseeninstances.Whichseparatinghyperplaneshouldwethusfinallychoosetoobtainthebestgeneralizationperformance?Ideally,wewouldliketochooseasimplehyperplanethatisrobusttosmallperturbations.Thiscanbeachievedbyusingtheconceptofthemarginofaseparatinghyperplane,whichcanbebrieflydescribedasfollows.

wTx+b=0,

xi(wTxi+b)

B1B2

Figure4.32.Marginofahyperplaneinatwo-dimensionaldataset.

Foreveryseparatinghyperplane ,letusassociateapairofparallelhyperplanes, and ,suchthattheytouchtheclosestinstancesofbothclasses,respectively.Forexample,ifwemove paralleltoitsdirection,wecantouchthefirstsquareusing andthefirstcircleusing . andareknownasthemarginhyperplanesof andthedistancebetweenthemisknownasthemarginoftheseparatinghyperplane .FromthediagramshowninFigure4.32 ,noticethatthemarginfor isconsiderablylargerthanthatfor .Inthisexample, turnsouttobetheseparatinghyperplanewiththemaximummargin,knownasthemaximummarginhyperplane.

RationaleforMaximumMargin

Bibi1 bi2

B1b11 b12 bi1 bi2

BiBi

B1B2 b1

Hyperplaneswithlargemarginstendtohavebettergeneralizationperformancethanthosewithsmallmargins.Intuitively,ifthemarginissmall,thenanyslightperturbationinthehyperplaneorthetraininginstanceslocatedattheboundarycanhavequiteanimpactontheclassificationperformance.Smallmarginhyperplanesarethusmoresusceptibletooverfitting,astheyarebarelyabletoseparatetheclasseswithaverynarrowroomtoallowperturbations.Ontheotherhand,ahyperplanethatisfartherawayfromtraininginstancesofbothclasseshassufficientleewaytoberobusttominormodificationsinthedata,andthusshowssuperiorgeneralizationperformance.

Theideaofchoosingthemaximummarginseparatinghyperplanealsohasstrongfoundationsinstatisticallearningtheory.ItcanbeshownthatthemarginofsuchahyperplaneisinverselyrelatedtotheVC-dimensionoftheclassifier,whichisacommonlyusedmeasureofthecomplexityofamodel.AsdiscussedinSection3.4 ofthelastchapter,asimplermodelshouldbepreferredoveramorecomplexmodeliftheybothshowsimilartrainingperformance.Hence,maximizingthemarginresultsintheselectionofaseparatinghyperplanewiththelowestmodelcomplexity,whichisexpectedtoshowbettergeneralizationperformance.

4.9.2LinearSVM

AlinearSVMisaclassifierthatsearchesforaseparatinghyperplanewiththelargestmargin,whichiswhyitisoftenknownasamaximalmarginclassifier.ThebasicideaofSVMcanbedescribedasfollows.

Considerabinaryclassificationproblemconsistingofntraininginstances,whereeverytraininginstance isassociatedwithabinarylabel .xi yi∈{−1,1}

Let betheequationofaseparatinghyperplanethatseparatesthetwoclassesbyplacingthemonoppositesides.Thismeansthat

Thedistanceofanypoint fromthehyperplaneisthengivenby

where denotestheabsolutevalueand denotesthelengthofavector.Letthedistanceoftheclosestpointfromthehyperplanewith be .Similarly,let denotethedistanceoftheclosestpointfromclass .

Thiscanberepresentedusingthefollowingconstraints:

Thepreviousequationscanbesuccinctlyrepresentedbyusingtheproductofand as

whereMisaparameterrelatedtothemarginofthehyperplane,i.e.,if,thenmargin .Inordertofindthemaximummargin

hyperplanethatadherestothepreviousconstraints,wecanconsiderthefollowingoptimizationproblem:

Tofindthesolutiontothepreviousproblem,notethatif andbsatisfytheconstraintsofthepreviousproblem,thenanyscaledversionof andbwould

wTx+b=0

wTxi+b>0ifyi=1,wTxi+b<0ifyi=−1.

D(x)=|wTx+b|ǁwǁ

|⋅| ǁ⋅ǁy=1 k+>0

k−>0 −1

wTxi+bǁwǁ≥k+ifyi=1,wTxi+bǁwǁ≤−k−ifyi=−1, (4.69)

yi (wTxi+b)

yi(wTxi+b)≥Mǁwǁ (4.70)

k+=k−=M =k+−k−=2M

maxw,bMsubjecttoyi(wTxi+b)≥Mǁwǁ. (4.71)

satisfythemtoo.Hence,wecanconvenientlychoose tosimplifytheright-handsideoftheinequalities.Furthermore,maximizingMamountstominimizing .Hence,theoptimizationproblemofSVMiscommonlyrepresentedinthefollowingform:

LearningModelParametersEquation4.72 representsaconstrainedoptimizationproblemwithlinearinequalities.Sincetheobjectivefunctionisconvexandquadraticwithrespectto ,itisknownasaquadraticprogrammingproblem(QPP),whichcanbesolvedusingstandardoptimizationtechniques,asdescribedinAppendixE.Inthefollowing,wepresentabriefsketchofthemainideasforlearningthemodelparametersofSVM.

First,werewritetheobjectivefunctioninaformthattakesintoaccounttheconstraintsimposedonitssolutions.ThenewobjectivefunctionisknownastheLagrangianprimalproblem,whichcanberepresentedasfollows,

wheretheparameters correspondtotheconstraintsandarecalledtheLagrangemultipliers.Next,tominimizetheLagrangian,wetakethederivativeof withrespectto andbandsetthemequaltozero:

ǁwǁ=1/M

ǁwǁ2

minw,bǁwǁ22subjecttoyi(wTxi+b)≥1. (4.72)

LP=12ǁwǁ2−∑i=1nλi(yi(wTxi+b)−1), (4.73)

λi≥0

LP

∂LP∂w=0⇒w=∑i=1nλiyixi, (4.74)

∂LP∂b=0⇒∑i=1nλiyi=0. (4.75)

NotethatusingEquation4.74 ,wecanrepresent completelyintermsoftheLagrangemultipliers.Thereisanotherrelationshipbetween( ,b)andthatisderivedfromtheKarush-Kuhn-Tucker(KKT)conditions,acommonlyusedtechniqueforsolvingQPP.Thisrelationshipcanbedescribedas

Equation4.76 isknownasthecomplementaryslacknesscondition,whichshedslightonavaluablepropertyofSVM.ItstatesthattheLagrangemultiplier isstrictlygreaterthan0onlywhen satisfiestheequation

,whichmeansthat liesexactlyonamarginhyperplane.However,if isfartherawayfromthemarginhyperplanessuchthat

,then isnecessarily0.Hence, foronlyasmallnumberofinstancesthatareclosesttotheseparatinghyperplane,whichareknownassupportvectors.Figure4.33 showsthesupportvectorsofahyperplaneasfilledcirclesandsquares.Further,ifwelookatEquation4.74 ,wewillobservethattraininginstanceswith donotcontributetotheweightparameter .Thissuggeststhat canbeconciselyrepresentedonlyintermsofthesupportvectorsinthetrainingdata,whicharequitefewerthantheoverallnumberoftraininginstances.Thisabilitytorepresentthedecisionfunctiononlyintermsofthesupportvectorsiswhatgivesthisclassifierthenamesupportvectormachines.

λi

λi[yi(wTxi+b)−1]=0. (4.76)

λi xiyi(w⋅xi+b)=1 xi

xiyi(w⋅xi+b)>1 λi λi>0

λi=0

Figure4.33.Supportvectorsofahyperplaneshownasfilledcirclesandsquares.

Usingequations4.74 ,4.75 ,and4.76 inEquation4.73 ,weobtainthefollowingoptimizationproblemintermsoftheLagrangemultipliers :

Thepreviousoptimizationproblemiscalledthedualoptimizationproblem.Maximizingthedualproblemwithrespectto isequivalenttominimizingtheprimalproblemwithrespectto andb.

Thekeydifferencesbetweenthedualandprimalproblemsareasfollows:

λi

maxλi∑i=1nλi−12∑i=1n∑j=1nλiλjyiyjxiTxjsubjectto∑i=1nλiyi=0,λi≥0. (4.77)

λi

1. Solvingthedualproblemhelpsusidentifythesupportvectorsinthedatathathavenon-zerovaluesof .Further,thesolutionofthedualproblemisinfluencedonlybythesupportvectorsthatareclosesttothedecisionboundaryofSVM.ThishelpsinsummarizingthelearningofSVMsolelyintermsofitssupportvectors,whichareeasiertomanagecomputationally.Further,itrepresentsauniqueabilityofSVMtobedependentonlyontheinstancesclosesttotheboundary,whicharehardertoclassify,ratherthanthedistributionofinstancesfartherawayfromtheboundary.

2. Theobjectiveofthedualprobleminvolvesonlytermsoftheform ,whicharebasicallyinnerproductsintheattributespace.AswewillseelaterinSection4.9.4 ,thispropertywillprovetobequiteusefulinlearningnonlineardecisionboundariesusingSVM.

Becauseofthesedifferences,itisusefultosolvethedualoptimizationproblemusinganyofthestandardsolversforQPP.Havingfoundanoptimalsolutionfor ,wecanuseEquation4.74 tosolvefor .WecanthenuseEquation4.76 onthesupportvectorstosolveforbasfollows:

whereSrepresentsthesetofsupportvectors and isthenumberofsupportvectors.Themaximummarginhyperplanecanthenbeexpressedas

Usingthisseparatinghyperplane,atestinstance canbeassignedaclasslabelusingthesignoff( ).

λi

xiTxj

λi

b=1nS∑i∈S1−yiwTxiyi (4.78)

(S={i|λi>0}) nS

f(x)=(∑i=1nλiyixiTx)+b=0. (4.79)

Example4.7.Considerthetwo-dimensionaldatasetshowninFigure4.34 ,whichcontainseighttraininginstances.Usingquadraticprogramming,wecansolvetheoptimizationproblemstatedinEquation4.77 toobtaintheLagrangemultiplier foreachtraininginstance.TheLagrangemultipliersaredepictedinthelastcolumnofthetable.Noticethatonlythefirsttwoinstanceshavenon-zeroLagrangemultipliers.Theseinstancescorrespondtothesupportvectorsforthisdataset.

Let andbdenotetheparametersofthedecisionboundary.UsingEquation4.74 ,wecansolvefor and inthefollowingway:

λi

w=(w1,w2)w1 w2

w1=∑iλiyixi1=65.5261×1×0.3858+65.5261×−1×0.4871=−6.64.w2=∑iλiyixi2=65.5261×1×0.4687+65.5261×−1×0.611=−9.32.

Figure4.34.Exampleofalinearlyseparabledataset.

ThebiastermbcanbecomputedusingEquation4.76 foreachsupportvector:

Averagingthesevalues,weobtain .ThedecisionboundarycorrespondingtotheseparametersisshowninFigure4.34 .

4.9.3Soft-marginSVM

Figure4.35 showsadatasetthatissimilartoFigure4.32 ,exceptithastwonewexamples,PandQ.Althoughthedecisionboundary misclassifiesthenewexamples,while classifiesthemcorrectly,thisdoesnotmeanthat

isabetterdecisionboundarythan becausethenewexamplesmaycorrespondtonoiseinthetrainingdata. shouldstillbepreferredoverbecauseithasawidermargin,andthus,islesssusceptibletooverfitting.However,theSVMformulationpresentedintheprevioussectiononlyconstructsdecisionboundariesthataremistake-free.

b(1)=1−w⋅x1=1−(−6.64)(0.3858)−(−9.32)(0.4687)=7.9300.b(2)=1−w⋅x2=−1−(−6.64)(0.4871)−(−9.32)(0.611)=7.9289.

b=7.93

B1B2

B2 B1B1 B2

Figure4.35.DecisionboundaryofSVMforthenon-separablecase.

ThissectionexamineshowtheformulationofSVMcanbemodifiedtolearnaseparatinghyperplanethatistolerabletosmallnumberoftrainingerrorsusingamethodknownasthesoft-marginapproach.Moreimportantly,themethodpresentedinthissectionallowsSVMtolearnlinearhyperplaneseveninsituationswheretheclassesarenotlinearlyseparable.Todothis,thelearningalgorithminSVMmustconsiderthetrade-offbetweenthewidthofthemarginandthenumberoftrainingerrorscommittedbythelinearhyperplane.

TointroducetheconceptoftrainingerrorsintheSVMformulation,letusrelaxtheinequalityconstraintstoaccommodateforsomeviolationsonasmallnumberoftraininginstances.Thiscanbedonebyintroducingaslackvariable foreverytraininginstance asfollows:ξ≥0 xi

Thevariable allowsforsomeslackintheinequalitiesoftheSVMsuchthateveryinstance doesnotneedtostrictlysatisfy .Further, isnon-zeroonlyifthemarginhyperplanesarenotabletoplace onthesamesideastherestoftheinstancesbelongingto .Toillustratethis,Figure4.36 showsacirclePthatfallsontheoppositesideoftheseparatinghyperplaneastherestofthecircles,andthussatisfies .ThedistancebetweenPandthemarginhyperplane isequalto .Hence, providesameasureoftheerrorofSVMinrepresenting usingsoftinequalityconstraints.

Figure4.36.Slackvariablesusedinsoft-marginSVM.

yi(wTxi+b)≥1−ξi (4.80)

ξixi yi(wTxi+b)≥1 ξi

xiyi

wTx+b=−1+ξwTx+b=−1 ξ/ǁwǁ

ξi xi

Inthepresenceofslackvariables,itisimportanttolearnaseparatinghyperplanethatjointlymaximizesthemargin(ensuringgoodgeneralizationperformance)andminimizesthevaluesofslackvariables(ensuringlowtrainingerror).ThiscanbeachievedbymodifyingtheoptimizationproblemofSVMasfollows:

whereCisahyper-parameterthatmakesatrade-offbetweenmaximizingthemarginandminimizingthetrainingerror.AlargevalueofCpaysmoreemphasisonminimizingthetrainingerrorthanmaximizingthemargin.NoticethesimilarityofthepreviousequationwiththegenericformulaofgeneralizationerrorrateintroducedinSection3.4 ofthepreviouschapter.Indeed,SVMprovidesanaturalwaytobalancebetweenmodelcomplexityandtrainingerrorinordertomaximizegeneralizationperformance.

TosolveEquation4.81 weapplytheLagrangemultipliermethodandconverttheprimalproblemtoitscorrespondingdualproblem,similartotheapproachdescribedintheprevioussection.TheLagrangianprimalproblemofEquation4.81 canbewrittenasfollows:

where and aretheLagrangemultiplierscorrespondingtotheinequalityconstraintsofEquation4.81 .Settingthederivativeof withrespectto ,b,and equalto0,weobtainthefollowingequations:

minw,b,ξiǁwǁ22+c∑i=1nξisubjecttoyi(wTxi+b)≥1−ξi,ξi≥0. (4.81)

LP=12ǁwǁ2+C∑i=1nξi−∑i=1nλi(yi(wTxi+b)−1+ξi)−∑i=1nμi(ξi), (4.82)

λi≥0 μi≥0LP

ξi

∂LP∂w=0⇒w=∑i=1nλiyixi. (4.83)

∂L∂b=0⇒∑i=1nλiyi=0. (4.84)

WecanalsoobtainthecomplementaryslacknessconditionsbyusingthefollowingKKTconditions:

Equation4.86 suggeststhat iszeroforalltraininginstancesexceptthosethatresideonthemarginhyperplanes ,orhave .Theseinstanceswith areknownassupportvectors.Ontheotherhand, giveninEquation4.87 iszeroforanytraininginstancethatismisclassified,i.e.,

.Further, and arerelatedwitheachotherbyEquation4.85 .Thisresultsinthefollowingthreeconfigurationsof :

1. If and ,then doesnotresideonthemarginhyperplanesandiscorrectlyclassifiedonthesamesideasotherinstancesbelongingto .

2. If and ,then ismisclassifiedandhasanon-zeroslackvariable .

3. If and ,then residesononeofthemarginhyperplanes.

SubstitutingEquations4.83 to4.87 intoEquation4.82 ,weobtainthefollowingdualoptimizationproblem:

NoticethatthepreviousproblemlooksalmostidenticaltothedualproblemofSVMforthelinearlyseparablecase(Equation4.77 ),exceptthat is

∂L∂ξi=0⇒λi+μi=C. (4.85)

λi(yi(wTxi+b)−1+ξi)=0, (4.86)

μiξi=0. (4.87)

λiwTxi+b=±1 ξi>0

λi>0 μi

ξi>0 λi μi(λi,μi)

λi=0 μi=C xi

yiλi=C μi=0 xi

ξi0<λi<C 0<μi<C xi

maxλi∑i=1nλi−12∑i=1n∑j=1nλiλjyiyjxiTxjsubjectto∑i=1nλiyi=0,0≤λi≤C. (4.88)

λi

requiredtonotonlybegreaterthan0butalsosmallerthanaconstantvalueC.Clearly,whenCreachesinfinity,thepreviousoptimizationproblembecomesequivalenttoEquation4.77 ,wherethelearnedhyperplaneperfectlyseparatestheclasses(withnotrainingerrors).However,bycappingthevaluesof toC,thelearnedhyperplaneisabletotolerateafewtrainingerrorsthathave .

Figure4.37.Hingelossasafunctionof .

Asbefore,Equation4.88 canbesolvedbyusinganyofthestandardsolversforQPP,andtheoptimalvalueof canbeobtainedbyusingEquation4.83 .Tosolveforb,wecanuseEquation4.86 onthesupportvectorsthatresideonthemarginhyperplanesasfollows:

λiξi>0

yy^

b=1nS∑i∈S1−yiwTxiyi (4.89)

whereSrepresentsthesetofsupportvectorsresidingonthemarginhyperplanes and isthenumberofelementsinS.

SVMasaRegularizerofHingeLossSVMbelongstoabroadclassofregularizationtechniquesthatusealossfunctiontorepresentthetrainingerrorsandanormofthemodelparameterstorepresentthemodelcomplexity.Torealizethis,noticethattheslackvariable ,usedformeasuringthetrainingerrorsinSVM,isequivalenttothehingelossfunction,whichcanbedefinedasfollows:

where .InthecaseofSVM, correspondsto .Figure4.37 showsaplotofthehingelossfunctionaswevary .Wecanseethatthehingelossisequalto0aslongasyand havethesamesignand

.However,thehingelossgrowslinearlywith wheneveryand areoftheoppositesignor .Thisissimilartothenotionoftheslackvariable,whichisusedtomeasurethedistanceofapointfromitsmarginhyperplane.Hence,theoptimizationproblemofSVMcanberepresentedinthefollowingequivalentform:

Notethatusingthehingelossensuresthattheoptimizationproblemisconvexandcanbesolvedusingstandardoptimizationtechniques.However,ifweuseadifferentlossfunction,suchasthesquaredlossfunctionthatwasintroducedinSection4.7 onANN,itwillresultinadifferentoptimizationproblemthatmayormaynotremainconvex.Nevertheless,differentlossfunctionscanbeexploredtocapturevaryingnotionsoftrainingerror,dependingonthecharacteristicsoftheproblem.

(S={i|0<λi<C}) nS

ξ

Loss(y,y^)=max(0,1−yy^),

y∈{+1,−1} y^ wTx+byy^

y^|y^|≥1 |y^| y^

|y^|<1ξ

minw,bǁwǁ22+C∑i=1nLoss(yi,wTxi+b) (4.90)

AnotherinterestingpropertyofSVMthatrelatesittoabroaderclassofregularizationtechniquesistheconceptofamargin.Althoughminimizinghasthegeometricinterpretationofmaximizingthemarginofaseparating

hyperplane,itisessentiallythesquared normofthemodelparameters,.Ingeneral,the normof , ,isequaltotheMinkowskidistanceof

orderqfrom totheorigin,i.e.,

Minimizingthe normof toachievelowermodelcomplexityisagenericregularizationconceptthathasseveralinterpretations.Forexample,minimizingthe normamountstofindingasolutiononahypersphereofsmallestradiusthatshowssuitabletrainingperformance.Tovisualizethisintwo-dimensions,Figure4.38(a) showstheplotofacirclewithconstantradiusr,whereeverypointhasthesame norm.Ontheotherhand,usingthe normensuresthatthesolutionliesonthesurfaceofahypercubewithsmallestsize,withverticesalongtheaxes.ThisisillustratedinFigure4.38(b) asasquarewithverticesontheaxesatadistanceofrfromtheorigin.The normiscommonlyusedasaregularizertoobtainsparsemodelparameterswithonlyasmallnumberofnon-zeroparametervalues,suchastheuseofLassoinregressionproblems(seeBibliographicNotes).

ǁwǁ2

L2 ǁwǁ22 Lq ǁwǁq

ǁwǁq=(∑ipwiq)1/q

Lq

L2

L2L1

L1

Figure4.38.Plotsshowingthebehavioroftwo-dimensionalsolutionswithconstant andnorms.

Ingeneral,dependingonthecharacteristicsoftheproblem,differentcombinationsof normsandtraininglossfunctionscanbeusedforlearningthemodelparameters,eachrequiringadifferentoptimizationsolver.Thisformsthebackboneofawiderangeofmodelingtechniquesthatattempttoimprovethegeneralizationperformancebyjointlyminimizingtrainingerrorandmodelcomplexity.However,inthissection,wefocusonlyonthesquarednormandthehingelossfunction,resultingintheclassicalformulationof

SVM.

4.9.4NonlinearSVM

L2L1

Lq

L2

TheSVMformulationsdescribedintheprevioussectionsconstructalineardecisionboundarytoseparatethetrainingexamplesintotheirrespectiveclasses.ThissectionpresentsamethodologyforapplyingSVMtodatasetsthathavenonlineardecisionboundaries.Thebasicideaistotransformthedatafromitsoriginalattributespacein intoanewspace sothatalinearhyperplanecanbeusedtoseparatetheinstancesinthetransformedspace,usingtheSVMapproach.Thelearnedhyperplanecanthenbeprojectedbacktotheoriginalattributespace,resultinginanonlineardecisionboundary.

Figure4.39.Classifyingdatawithanonlineardecisionboundary.

AttributeTransformationToillustratehowattributetransformationcanleadtoalineardecisionboundary,Figure4.39(a) showsanexampleofatwo-dimensionaldatasetconsistingofsquares(classifiedas )andcircles(classifiedas ).The

φ(x)

y=1 y=−1

datasetisgeneratedinsuchawaythatallthecirclesareclusterednearthecenterofthediagramandallthesquaresaredistributedfartherawayfromthecenter.Instancesofthedatasetcanbeclassifiedusingthefollowingequation:

Thedecisionboundaryforthedatacanthereforebewrittenasfollows:

whichcanbefurthersimplifiedintothefollowingquadraticequation:

Anonlineartransformation isneededtomapthedatafromitsoriginalattributespaceintoanewspacesuchthatalinearhyperplanecanseparatetheclasses.Thiscanbeachievedbyusingthefollowingsimpletransformation:

Figure4.39(b) showsthepointsinthetransformedspace,wherewecanseethatallthecirclesarelocatedinthelowerleft-handsideofthediagram.Alinearhyperplanewithparameters andbcanthereforebeconstructedinthetransformedspace,toseparatetheinstancesintotheirrespectiveclasses.

Onemaythinkthatbecausethenonlineartransformationpossiblyincreasesthedimensionalityoftheinputspace,thisapproachcansufferfromthecurseofdimensionalitythatisoftenassociatedwithhigh-dimensionaldata.

y={1if(x1−0.5)2+(x2−0.5)2>0.2,−1otherwise. (4.91)

(x1−0.5)2+(x2−0.5)2>0.2,

x12−x1+x22−x2=−0.46.

φ

φ:(x1,x2)→(x12−x1,x22−x2). (4.92)

However,aswewillseeinthefollowingsection,nonlinearSVMisabletoavoidthisproblembyusingkernelfunctions.

LearningaNonlinearSVMModelUsingasuitablefunction, ,wecantransformanydatainstance to .(Thedetailsonhowtochoose willbecomeclearlater.)Thelinearhyperplaneinthetransformedspacecanbeexpressedas .Tolearntheoptimalseparatinghyperplane,wecansubstitute for intheformulationofSVMtoobtainthefollowingoptimizationproblem:

UsingLagrangemultipliers ,thiscanbeconvertedintoadualoptimizationproblem:max

where denotestheinnerproductbetweenvectors and .Also,theequationofthehyperplaneinthetransformedspacecanberepresentedusing

asfollows:

Further,bisgivenby

φ(⋅) φ(x)φ(⋅)

wTφ(x)+b=0φ(x)

minw,b,ξiǁwǁ22+C∑i=1nξisubjecttoyi(wTφ(xi)+b)≥1−ξi,ξi≥0. (4.93)

λi

maxλi∑i=1nλi−12∑i=1n∑j=1nλiλjyiyj⟨φ(xi),φ(xj)⟩subjectto∑i=1nλjyi=0,0≤λi≤C,

(4.94)

⟨a,b⟩

λi

∑i=1nλiyi⟨φ(xi),φ(x)⟩+b=0. (4.95)

b=1nS(∑i∈S1yi−∑i∈S∑j=1nλjyiyj⟨φ(xi),φ(xj)⟩yi), (4.96)

where isthesetofsupportvectorsresidingonthemarginhyperplanesand isthenumberofelementsinS.

NotethatinordertosolvethedualoptimizationprobleminEquation4.94 ,ortousethelearnedmodelparameterstomakepredictionsusingEquations4.95 and4.96 ,weneedonlyinnerproductsof .Hence,eventhough

maybenonlinearandhigh-dimensional,itsufficestouseafunctionoftheinnerproductsof inthetransformedspace.Thiscanbeachievedbyusingakerneltrick,whichcanbedescribedasfollows.

Theinnerproductbetweentwovectorsisoftenregardedasameasureofsimilaritybetweenthevectors.Forexample,thecosinesimilaritydescribedinSection2.4.5 onpage79canbedefinedasthedotproductbetweentwovectorsthatarenormalizedtounitlength.Analogously,theinnerproduct

canalsoberegardedasameasureofsimilaritybetweentwoinstances, and ,inthetransformedspace.Thekerneltrickisamethodforcomputingthissimilarityasafunctionoftheoriginalattributes.Specifically,thekernelfunctionK(u,v)betweentwoinstancesuandvcanbedefinedasfollows:

where isafunctionthatfollowscertainconditionsasstatedbytheMercer'sTheorem.Althoughthedetailsofthistheoremareoutsidethescopeofthebook,weprovidealistofsomeofthecommonlyusedkernelfunctions:

S={i|0>λi<C}nS

φ(x)φ(x)

φ(x)

φ(xi),φ(xj)xi xj

K(u,v)=⟨φ(u),φ(v)⟩=f(u,v) (4.97)

f(⋅)

PolynomialkernelK(u,v)=(uTv+1)p (4.98)

RadialBasisFunctionkernelK(u,v)=e−ǁu−vǁ2/(2σ2) (4.99)

SigmoidkernelK(u,v)=tanh(kuTv−δ) (4.100)

Byusingakernelfunction,wecandirectlyworkwithinnerproductsinthetransformedspacewithoutdealingwiththeexactformsofthenonlineartransformationfunction .Specifically,thisallowsustousehigh-dimensionaltransformations(sometimeseveninvolvinginfinitelymanydimensions),whileperformingcalculationsonlyintheoriginalattributespace.Computingtheinnerproductsusingkernelfunctionsisalsoconsiderablycheaperthanusingthetransformedattributeset .Hence,theuseofkernelfunctionsprovidesasignificantadvantageinrepresentingnonlineardecisionboundaries,withoutsufferingfromthecurseofdimensionality.ThishasbeenoneofthemajorreasonsbehindthewidespreadusageofSVMinhighlycomplexandnonlinearproblems.

Figure4.40.DecisionboundaryproducedbyanonlinearSVMwithpolynomialkernel.

Figure4.40 showsthenonlineardecisionboundaryobtainedbySVMusingthepolynomialkernelfunctiongiveninEquation4.98 .Wecanseethatthe

φ

φ(x)

learneddecisionboundaryisquiteclosetothetruedecisionboundaryshowninFigure4.39(a) .Althoughthechoiceofkernelfunctiondependsonthecharacteristicsoftheinputdata,acommonlyusedkernelfunctionistheradialbasisfunction(RBF)kernel,whichinvolvesasinglehyper-parameter ,knownasthestandarddeviationoftheRBFkernel.

4.9.5CharacteristicsofSVM

1. TheSVMlearningproblemcanbeformulatedasaconvexoptimizationproblem,inwhichefficientalgorithmsareavailabletofindtheglobalminimumoftheobjectivefunction.Otherclassificationmethods,suchasrule-basedclassifiersandartificialneuralnetworks,employagreedystrategytosearchthehypothesisspace.Suchmethodstendtofindonlylocallyoptimumsolutions.

2. SVMprovidesaneffectivewayofregularizingthemodelparametersbymaximizingthemarginofthedecisionboundary.Furthermore,itisabletocreateabalancebetweenmodelcomplexityandtrainingerrorsbyusingahyper-parameterC.Thistrade-offisgenerictoabroaderclassofmodellearningtechniquesthatcapturethemodelcomplexityandthetraininglossusingdifferentformulations.

3. LinearSVMcanhandleirrelevantattributesbylearningzeroweightscorrespondingtosuchattributes.Itcanalsohandleredundantattributesbylearningsimilarweightsfortheduplicateattributes.Furthermore,theabilityofSVMtoregularizeitslearningmakesitmorerobusttothepresenceofalargenumberofirrelevantandredundantattributesthanotherclassifiers,eveninhigh-dimensionalsettings.Forthisreason,nonlinearSVMsarelessimpactedbyirrelevantandredundantattributesthanotherhighlyexpressiveclassifiersthatcanlearnnonlineardecisionboundariessuchasdecisiontrees.

σ

TocomparetheeffectofirrelevantattributesontheperformanceofnonlinearSVMsanddecisiontrees,considerthetwo-dimensionaldatasetshowninFigure4.41(a) containing and instances,wherethetwoclassescanbeeasilyseparatedusinganonlineardecisionboundary.Weincrementallyaddirrelevantattributestothisdatasetandcomparetheperformanceoftwoclassifiers:decisiontreeandnonlinearSVM(usingradialbasisfunctionkernel),using70%ofthedatafortrainingandtherestfortesting.Figure4.41(b) showsthetesterrorratesofthetwoclassifiersasweincreasethenumberofirrelevantattributes.Wecanseethatthetesterrorrateofdecisiontreesswiftlyreaches0.5(sameasrandomguessing)inthepresenceofevenasmallnumberofirrelevantattributes.ThiscanbeattributedtotheproblemofmultiplecomparisonswhilechoosingsplittingattributesatinternalnodesasdiscussedinExample3.7 ofthepreviouschapter.Ontheotherhand,nonlinearSVMshowsamorerobustandsteadyperformanceevenafteraddingamoderatelylargenumberofirrelevantattributes.Itstesterrorrategraduallydeclinesandeventuallyreachescloseto0.5afteradding125irrelevantattributes,atwhichpointitbecomesdifficulttodiscernthediscriminativeinformationintheoriginaltwoattributesfromthenoiseintheremainingattributesforlearningnonlineardecisionboundaries.

500+ 500o

Figure4.41.ComparingtheeffectofaddingirrelevantattributesontheperformanceofnonlinearSVMsanddecisiontrees.

4. SVMcanbeappliedtocategoricaldatabyintroducingdummyvariablesforeachcategoricalattributevaluepresentinthedata.Forexample,if hasthreevalues ,wecanintroduceabinaryvariableforeachoftheattributevalues.

5. TheSVMformulationpresentedinthischapterisforbinaryclassproblems.However,multiclassextensionsofSVMhavealsobeenproposed.

6. AlthoughthetrainingtimeofanSVMmodelcanbelarge,thelearnedparameterscanbesuccinctlyrepresentedwiththehelpofasmallnumberofsupportvectors,makingtheclassificationoftestinstancesquitefast.

{Single,Married,Divorced}

4.10EnsembleMethodsThissectionpresentstechniquesforimprovingclassificationaccuracybyaggregatingthepredictionsofmultipleclassifiers.Thesetechniquesareknownasensembleorclassifiercombinationmethods.Anensemblemethodconstructsasetofbaseclassifiersfromtrainingdataandperformsclassificationbytakingavoteonthepredictionsmadebyeachbaseclassifier.Thissectionexplainswhyensemblemethodstendtoperformbetterthananysingleclassifierandpresentstechniquesforconstructingtheclassifierensemble.

4.10.1RationaleforEnsembleMethod

Thefollowingexampleillustrateshowanensemblemethodcanimproveaclassifier'sperformance.

Example4.8.Consideranensembleof25binaryclassifiers,eachofwhichhasanerrorrateof .Theensembleclassifierpredictstheclasslabelofatestexamplebytakingamajorityvoteonthepredictionsmadebythebaseclassifiers.Ifthebaseclassifiersareidentical,thenallthebaseclassifierswillcommitthesamemistakes.Thus,theerrorrateoftheensembleremains0.35.Ontheotherhand,ifthebaseclassifiersareindependent—i.e.,theirerrorsareuncorrelated—thentheensemblemakesawrongpredictiononlyifmorethanhalfofthebaseclassifierspredictincorrectly.Inthiscase,theerrorrateoftheensembleclassifieris

∈=0.35

whichisconsiderablylowerthantheerrorrateofthebaseclassifiers.

Figure4.42 showstheerrorrateofanensembleof25binaryclassifiersfordifferentbaseclassifiererrorrates .Thediagonalline

representsthecaseinwhichthebaseclassifiersareidentical,whilethesolidlinerepresentsthecaseinwhichthebaseclassifiersareindependent.Observethattheensembleclassifierperformsworsethanthebaseclassifierswhen islargerthan0.5.

Theprecedingexampleillustratestwonecessaryconditionsforanensembleclassifiertoperformbetterthanasingleclassifier:(1)thebaseclassifiersshouldbeindependentofeachother,and(2)thebaseclassifiersshoulddobetterthanaclassifierthatperformsrandomguessing.Inpractice,itisdifficulttoensuretotalindependenceamongthebaseclassifiers.Nevertheless,improvementsinclassificationaccuracieshavebeenobservedinensemblemethodsinwhichthebaseclassifiersaresomewhatcorrelated.

4.10.2MethodsforConstructinganEnsembleClassifier

AlogicalviewoftheensemblemethodispresentedinFigure4.43 .Thebasicideaistoconstructmultipleclassifiersfromtheoriginaldataandthenaggregatetheirpredictionswhenclassifyingunknownexamples.Theensembleofclassifierscanbeconstructedinmanyways:

eensemble=∑i=1325(25i)∈i(1−∈)25−i=0.06, (4.101)

(eensemble) (∈)

1. Bymanipulatingthetrainingset.Inthisapproach,multipletrainingsetsarecreatedbyresamplingtheoriginaldataaccordingtosomesamplingdistributionandconstructingaclassifierfromeachtrainingset.Thesamplingdistributiondetermineshowlikelyitisthatanexamplewillbeselectedfortraining,anditmayvaryfromonetrialtoanother.Baggingandboostingaretwoexamplesofensemblemethodsthatmanipulatetheirtrainingsets.ThesemethodsaredescribedinfurtherdetailinSections4.10.4 and4.10.5 .

Figure4.42.Comparisonbetweenerrorsofbaseclassifiersanderrorsoftheensembleclassifier.

Figure4.43.Alogicalviewoftheensemblelearningmethod.

2. Bymanipulatingtheinputfeatures.Inthisapproach,asubsetofinputfeaturesischosentoformeachtrainingset.Thesubsetcanbeeitherchosenrandomlyorbasedontherecommendationofdomainexperts.Somestudieshaveshownthatthisapproachworksverywellwithdatasetsthatcontainhighlyredundantfeatures.Randomforest,whichisdescribedinSection4.10.6 ,isanensemblemethodthatmanipulatesitsinputfeaturesandusesdecisiontreesasitsbaseclassifiers.

3. Bymanipulatingtheclasslabels.Thismethodcanbeusedwhenthenumberofclassesissufficientlylarge.Thetrainingdataistransformedintoabinaryclassproblembyrandomlypartitioningtheclasslabelsintotwodisjointsubsets, and .TrainingexampleswhoseclassA0 A1

labelbelongstothesubset areassignedtoclass0,whilethosethatbelongtothesubset areassignedtoclass1.Therelabeledexamplesarethenusedtotrainabaseclassifier.Byrepeatingthisprocessmultipletimes,anensembleofbaseclassifiersisobtained.Whenatestexampleispresented,eachbaseclassifier isusedtopredictitsclasslabel.Ifthetestexampleispredictedasclass0,thenalltheclassesthatbelongto willreceiveavote.Conversely,ifitispredictedtobeclass1,thenalltheclassesthatbelongto willreceiveavote.Thevotesaretalliedandtheclassthatreceivesthehighestvoteisassignedtothetestexample.Anexampleofthisapproachistheerror-correctingoutputcodingmethoddescribedonpage331.

4. Bymanipulatingthelearningalgorithm.Manylearningalgorithmscanbemanipulatedinsuchawaythatapplyingthealgorithmseveraltimesonthesametrainingdatawillresultintheconstructionofdifferentclassifiers.Forexample,anartificialneuralnetworkcanchangeitsnetworktopologyortheinitialweightsofthelinksbetweenneurons.Similarly,anensembleofdecisiontreescanbeconstructedbyinjectingrandomnessintothetree-growingprocedure.Forexample,insteadofchoosingthebestsplittingattributeateachnode,wecanrandomlychooseoneofthetopkattributesforsplitting.

Thefirstthreeapproachesaregenericmethodsthatareapplicabletoanyclassifier,whereasthefourthapproachdependsonthetypeofclassifierused.Thebaseclassifiersformostoftheseapproachescanbegeneratedsequentially(oneafteranother)orinparallel(allatonce).Onceanensembleofclassifiershasbeenlearned,atestexample isclassifiedbycombiningthepredictionsmadebythebaseclassifiers :

A0A1

Ci

A0A1

Ci(x)

C*(x)=f(C1(x),C2(x),…,Ck(x)).

wherefisthefunctionthatcombinestheensembleresponses.Onesimpleapproachforobtaining istotakeamajorityvoteoftheindividualpredictions.Analternateapproachistotakeaweightedmajorityvote,wheretheweightofabaseclassifierdenotesitsaccuracyorrelevance.

Ensemblemethodsshowthemostimprovementwhenusedwithunstableclassifiers,i.e.,baseclassifiersthataresensitivetominorperturbationsinthetrainingset,becauseofhighmodelcomplexity.Althoughunstableclassifiersmayhavealowbiasinfindingtheoptimaldecisionboundary,theirpredictionshaveahighvarianceforminorchangesinthetrainingsetormodelselection.Thistrade-offbetweenbiasandvarianceisdiscussedindetailinthenextsection.Byaggregatingtheresponsesofmultipleunstableclassifiers,ensemblelearningattemptstominimizetheirvariancewithoutworseningtheirbias.

4.10.3Bias-VarianceDecomposition

Bias-variancedecompositionisaformalmethodforanalyzingthegeneralizationerrorofapredictivemodel.Althoughtheanalysisisslightlydifferentforclassificationthanregression,wefirstdiscussthebasicintuitionofthisdecompositionbyusingananalogueofaregressionproblem.

Considertheillustrativetaskofreachingatargetybyfiringprojectilesfromastartingposition ,asshowninFigure4.44 .Thetargetcorrespondstothedesiredoutputatatestinstance,whilethestartingpositioncorrespondstoitsobservedattributes.Inthisanalogy,theprojectilerepresentsthemodelusedforpredictingthetargetusingtheobservedattributes.Let denotethepointwheretheprojectilehitstheground,whichisanalogousofthepredictionofthemodel.

C*(x)

y^

Figure4.44.Bias-variancedecomposition.

Ideally,wewouldlikeourpredictionstobeasclosetothetruetargetaspossible.However,notethatdifferenttrajectoriesofprojectilesarepossiblebasedondifferencesinthetrainingdataorintheapproachusedformodelselection.Hence,wecanobserveavarianceinthepredictions overdifferentrunsofprojectile.Further,thetargetinourexampleisnotfixedbuthassomefreedomtomovearound,resultinginanoisecomponentinthetruetarget.Thiscanbeunderstoodasthenon-deterministicnatureoftheoutputvariable,wherethesamesetofattributescanhavedifferentoutputvalues.Let representtheaveragepredictionoftheprojectileovermultipleruns,and denotetheaveragetargetvalue.Thedifferencebetween and

isknownasthebiasofthemodel.

Inthecontextofclassification,itcanbeshownthatthegeneralizationerrorofaclassificationmodelmcanbedecomposedintotermsinvolvingthebias,variance,andnoisecomponentsofthemodelinthefollowingway:

where and areconstantsthatdependonthecharacteristicsoftrainingandtestsets.Notethatwhilethenoisetermisintrinsictothetargetclass,the

y^

y^avgyavg y^avg

yavg

gen.error(m)=c1×noise+bias(m)+c2×variance(m)

c1 c2

biasandvariancetermsdependonthechoiceoftheclassificationmodel.Thebiasofamodelrepresentshowclosetheaveragepredictionofthemodelistotheaveragetarget.Modelsthatareabletolearncomplexdecisionboundaries,e.g.,modelsproducedbyk-nearestneighborandmulti-layerANN,generallyshowlowbias.Thevarianceofamodelcapturesthestabilityofitspredictionsinresponsetominorperturbationsinthetrainingsetorthemodelselectionapproach.

Wecansaythatamodelshowsbettergeneralizationperformanceifithasalowerbiasandlowervariance.However,ifthecomplexityofamodelishighbutthetrainingsizeissmall,wegenerallyexpecttoseealowerbiasbuthighervariance,resultinginthephenomenaofoverfitting.ThisphenomenaispictoriallyrepresentedinFigure4.45(a) .Ontheotherhand,anoverlysimplisticmodelthatsuffersfromunderfittingmayshowalowervariancebutwouldsufferfromahighbias,asshowninFigure4.45(b) .Hence,thetrade-offbetweenbiasandvarianceprovidesausefulwayforinterpretingtheeffectsofunderfittingandoverfittingonthegeneralizationperformanceofamodel.

Figure4.45.Plotsshowingthebehavioroftwo-dimensionalsolutionswithconstant andnorms.

Thebias-variancetrade-offcanbeusedtoexplainwhyensemblelearningimprovesthegeneralizationperformanceofunstableclassifiers.Ifabaseclassifiershowlowbiasbuthighvariance,itcanbecomesusceptibletooverfitting,asevenasmallchangeinthetrainingsetwillresultindifferentpredictions.However,bycombiningtheresponsesofmultiplebaseclassifiers,wecanexpecttoreducetheoverallvariance.Hence,ensemblelearningmethodsshowbetterperformanceprimarilybyloweringthevarianceinthepredictions,althoughtheycanevenhelpinreducingthebias.Oneofthesimplestapproachesforcombiningpredictionsandreducingtheirvarianceistocomputetheiraverage.Thisformsthebasisofthebaggingmethod,describedinthefollowingsubsection.

L2L1

4.10.4Bagging

Bagging,whichisalsoknownasbootstrapaggregating,isatechniquethatrepeatedlysamples(withreplacement)fromadatasetaccordingtoauniformprobabilitydistribution.Eachbootstrapsamplehasthesamesizeastheoriginaldata.Becausethesamplingisdonewithreplacement,someinstancesmayappearseveraltimesinthesametrainingset,whileothersmaybeomittedfromthetrainingset.Onaverage,abootstrapsample containsapproximately63%oftheoriginaltrainingdatabecauseeachsamplehasaprobability ofbeingselectedineach .IfNissufficientlylarge,thisprobabilityconvergesto .ThebasicprocedureforbaggingissummarizedinAlgorithm4.5 .Aftertrainingthekclassifiers,atestinstanceisassignedtotheclassthatreceivesthehighestnumberofvotes.

Toillustratehowbaggingworks,considerthedatasetshowninTable4.4 .Letxdenoteaone-dimensionalattributeandydenotetheclasslabel.Supposeweuseonlyone-levelbinarydecisiontrees,withatestcondition

,wherekisasplitpointchosentominimizetheentropyoftheleafnodes.Suchatreeisalsoknownasadecisionstump.

Table4.4.Exampleofdatasetusedtoconstructanensembleofbaggingclassifiers.

x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

y 1 1 1 1 1 1

Withoutbagging,thebestdecisionstumpwecanproducesplitstheinstancesateither or .Eitherway,theaccuracyofthetreeisatmost70%.Supposeweapplythebaggingprocedureonthedatasetusing10bootstrapsamples.Theexampleschosenfortrainingineachbaggingroundareshown

Di

1−(1−1/N)N Di1−1/e≃0.632

x≤k

−1 −1 −1 −1

x≤0.35 x≤0.75

inFigure4.46 .Ontheright-handsideofeachtable,wealsodescribethedecisionstumpbeingusedineachround.

WeclassifytheentiredatasetgiveninTable4.4 bytakingamajorityvoteamongthepredictionsmadebyeachbaseclassifier.TheresultsofthepredictionsareshowninFigure4.47 .Sincetheclasslabelsareeitheror ,takingthemajorityvoteisequivalenttosummingupthepredictedvaluesofyandexaminingthesignoftheresultingsum(refertothesecondtolastrowinFigure4.47 ).Noticethattheensembleclassifierperfectlyclassifiesall10examplesintheoriginaldata.

Algorithm4.5Baggingalgorithm.

−1+1

Figure4.46.Exampleofbagging.

Theprecedingexampleillustratesanotheradvantageofusingensemblemethodsintermsofenhancingtherepresentationofthetargetfunction.Eventhougheachbaseclassifierisadecisionstump,combiningtheclassifierscanleadtoadecisionboundarythatmimicsadecisiontreeofdepth2.

Baggingimprovesgeneralizationerrorbyreducingthevarianceofthebaseclassifiers.Theperformanceofbaggingdependsonthestabilityofthebaseclassifier.Ifabaseclassifierisunstable,bagginghelpstoreducetheerrorsassociatedwithrandomfluctuationsinthetrainingdata.Ifabaseclassifierisstable,i.e.,robusttominorperturbationsinthetrainingset,thentheerroroftheensembleisprimarilycausedbybiasinthebaseclassifier.Inthissituation,baggingmaynotbeabletoimprovetheperformanceofthebaseclassifierssignificantly.Itmayevendegradetheclassifier'sperformancebecausetheeffectivesizeofeachtrainingsetisabout37%smallerthantheoriginaldata.

Figure4.47.Exampleofcombiningclassifiersconstructedusingthebaggingapproach.

4.10.5Boosting

Boostingisaniterativeprocedureusedtoadaptivelychangethedistributionoftrainingexamplesforlearningbaseclassifierssothattheyincreasinglyfocusonexamplesthatarehardtoclassify.Unlikebagging,boostingassignsaweighttoeachtrainingexampleandmayadaptivelychangetheweightattheendofeachboostinground.Theweightsassignedtothetrainingexamplescanbeusedinthefollowingways:

1. Theycanbeusedtoinformthesamplingdistributionusedtodrawasetofbootstrapsamplesfromtheoriginaldata.

2. Theycanbeusedtolearnamodelthatisbiasedtowardexampleswithhigherweight.

Thissectiondescribesanalgorithmthatusesweightsofexamplestodeterminethesamplingdistributionofitstrainingset.Initially,theexamplesareassignedequalweights,1/N,sothattheyareequallylikelytobechosenfortraining.Asampleisdrawnaccordingtothesamplingdistributionofthetrainingexamplestoobtainanewtrainingset.Next,aclassifierisbuiltfromthetrainingsetandusedtoclassifyalltheexamplesintheoriginaldata.Theweightsofthetrainingexamplesareupdatedattheendofeachboostinground.Examplesthatareclassifiedincorrectlywillhavetheirweightsincreased,whilethosethatareclassifiedcorrectlywillhavetheirweightsdecreased.Thisforcestheclassifiertofocusonexamplesthataredifficulttoclassifyinsubsequentiterations.

Thefollowingtableshowstheexampleschosenduringeachboostinground,whenappliedtothedatashowninTable4.4 .

Boosting(Round1): 7 3 2 8 7 9 4 10 6 3

Boosting(Round2): 5 4 9 4 2 5 1 7 4 2

Boosting(Round3): 4 4 8 10 4 5 4 6 3 4

Initially,alltheexamplesareassignedthesameweights.However,someexamplesmaybechosenmorethanonce,e.g.,examples3and7,becausethesamplingisdonewithreplacement.Aclassifierbuiltfromthedataisthenusedtoclassifyalltheexamples.Supposeexample4isdifficulttoclassify.Theweightforthisexamplewillbeincreasedinfutureiterationsasitgetsmisclassifiedrepeatedly.Meanwhile,examplesthatwerenotchoseninthepreviousround,e.g.,examples1and5,alsohaveabetterchanceofbeingselectedinthenextroundsincetheirpredictionsinthepreviousroundwerelikelytobewrong.Astheboostingroundsproceed,examplesthatarethehardesttoclassifytendtobecomeevenmoreprevalent.Thefinalensembleisobtainedbyaggregatingthebaseclassifiersobtainedfromeachboostinground.

Overtheyears,severalimplementationsoftheboostingalgorithmhavebeendeveloped.Thesealgorithmsdifferintermsof(1)howtheweightsofthetrainingexamplesareupdatedattheendofeachboostinground,and(2)howthepredictionsmadebyeachclassifierarecombined.AnimplementationcalledAdaBoostisexploredinthenextsection.

AdaBoostLet denoteasetofNtrainingexamples.IntheAdaBoostalgorithm,theimportanceofabaseclassifier dependsonits

{(xj,yj)|j=1,2,…,N}Ci

Figure4.48.Plotof asafunctionoftrainingerror .

errorrate,whichisdefinedas

where ifthepredicatepistrue,and0otherwise.Theimportanceofaclassifier isgivenbythefollowingparameter,

Notethat hasalargepositivevalueiftheerrorrateiscloseto0andalargenegativevalueiftheerrorrateiscloseto1,asshowninFigure4.48 .

The parameterisalsousedtoupdatetheweightofthetrainingexamples.Toillustrate,let denotetheweightassignedtoexample( duringthe

α ∈

∈i=1N[∑j=1NwjI(Ci(xj)≠yj)], (4.102)

I(p)=1Ci

αi=12ln(1−∈i∈i).

αi

αiwi(j) xi,yi)

th

j boostinground.TheweightupdatemechanismforAdaBoostisgivenbytheequation:

where isthenormalizationfactorusedtoensurethat .TheweightupdateformulagiveninEquation4.103 increasestheweightsofincorrectlyclassifiedexamplesanddecreasestheweightsofthoseclassifiedcorrectly.

Insteadofusingamajorityvotingscheme,thepredictionmadebyeachclassifier isweightedaccordingto .ThisapproachallowsAdaBoosttopenalizemodelsthathavepooraccuracy,e.g.,thosegeneratedattheearlierboostingrounds.Inaddition,ifanyintermediateroundsproduceanerrorratehigherthan50%,theweightsarerevertedbacktotheiroriginaluniformvalues, ,andtheresamplingprocedureisrepeated.TheAdaBoostalgorithmissummarizedinAlgorithm4.6 .

Algorithm4.6AdaBoostalgorithm.

∈ ∑

th

wi(j+1)=wi(j)Zj×{e−αjifCj(xi)=yi,eαjifCj(xi)≠yi (4.103)

Zj ∑iwi(j+1)=1

Cj αj

wi=1/N

∈ ∈

LetusexaminehowtheboostingapproachworksonthedatasetshowninTable4.4 .Initially,alltheexampleshaveidenticalweights.Afterthreeboostingrounds,theexampleschosenfortrainingareshowninFigure4.49(a) .TheweightsforeachexampleareupdatedattheendofeachboostingroundusingEquation4.103 ,asshowninFigure4.50(b) .

Withoutboosting,theaccuracyofthedecisionstumpis,atbest,70%.WithAdaBoost,theresultsofthepredictionsaregiveninFigure4.50(b) .Thefinalpredictionoftheensembleclassifierisobtainedbytakingaweightedaverageofthepredictionsmadebyeachbaseclassifier,whichisshowninthelastrowofFigure4.50(b) .NoticethatAdaBoostperfectlyclassifiesalltheexamplesinthetrainingdata.

Figure4.49.Exampleofboosting.

Animportantanalyticalresultofboostingshowsthatthetrainingerroroftheensembleisboundedbythefollowingexpression:

where istheerrorrateofeachbaseclassifieri.Iftheerrorrateofthebaseclassifierislessthan50%,wecanwrite ,where measureshowmuchbettertheclassifieristhanrandomguessing.Theboundonthetrainingerroroftheensemblebecomes

eensemble≤∏i[∈i(1−∈i)], (4.104)

∈i∈i=0.5−γi γi

Hence,thetrainingerroroftheensembledecreasesexponentially,whichleadstothefastconvergenceofthealgorithm.Byfocusingonexamplesthataredifficulttoclassifybybaseclassifiers,itisabletoreducethebiasofthefinalpredictionsalongwiththevariance.AdaBoosthasbeenshowntoprovidesignificantimprovementsinperformanceoverbaseclassifiersonarangeofdatasets.Nevertheless,becauseofitstendencytofocusontrainingexamplesthatarewronglyclassified,theboostingtechniquecanbesusceptibletooverfitting,resultinginpoorgeneralizationperformanceinsomescenarios.

Figure4.50.ExampleofcombiningclassifiersconstructedusingtheAdaBoostapproach.

eensemble≤∏i1−4γi2≤exp(−2∑iγi2). (4.105)

4.10.6RandomForests

Randomforestsattempttoimprovethegeneralizationperformancebyconstructinganensembleofdecorrelateddecisiontrees.Randomforestsbuildontheideaofbaggingtouseadifferentbootstrapsampleofthetrainingdataforlearningdecisiontrees.However,akeydistinguishingfeatureofrandomforestsfrombaggingisthatateveryinternalnodeofatree,thebestsplittingcriterionischosenamongasmallsetofrandomlyselectedattributes.Inthisway,randomforestsconstructensemblesofdecisiontreesbynotonlymanipulatingtraininginstances(byusingbootstrapsamplessimilartobagging),butalsotheinputattributes(byusingdifferentsubsetsofattributesateveryinternalnode).

GivenatrainingsetDconsistingofninstancesanddattributes,thebasicprocedureoftrainingarandomforestclassifiercanbesummarizedusingthefollowingsteps:

1. Constructabootstrapsample ofthetrainingsetbyrandomlysamplingninstances(withreplacement)fromD.

2. Use tolearnadecisiontree asfollows.Ateveryinternalnodeof,randomlysampleasetofpattributesandchooseanattributefrom

thissubsetthatshowsthemaximumreductioninanimpuritymeasureforsplitting.Repeatthisproceduretilleveryleafispure,i.e.,containinginstancesfromthesameclass.

Onceanensembleofdecisiontreeshavebeenconstructed,theiraverageprediction(majorityvote)onatestinstanceisusedasthefinalpredictionoftherandomforest.Notethatthedecisiontreesinvolvedinarandomforestareunprunedtrees,astheyareallowedtogrowtotheirlargestpossiblesizetilleveryleafispure.Hence,thebaseclassifiersofrandomforestrepresent

Di

Di TiTi

unstableclassifiersthathavelowbiasbuthighvariance,becauseoftheirlargesize.

Anotherpropertyofthebaseclassifierslearnedinrandomforestsisthelackofcorrelationamongtheirmodelparametersandtestpredictions.Thiscanbeattributedtotheuseofanindependentlysampleddataset forlearningeverydecisiontree ,similartothebaggingapproach.However,randomforestshavetheadditionaladvantageofchoosingasplittingcriterionateveryinternalnodeusingadifferent(andrandomlyselected)subsetofattributes.Thispropertysignificantlyhelpsinbreakingthecorrelationstructure,ifany,amongthedecisiontrees .

Torealizethis,consideratrainingsetinvolvingalargenumberofattributes,whereonlyasmallsubsetofattributesarestrongpredictorsofthetargetclass,whereasotherattributesareweakindicators.Givensuchatrainingset,evenifweconsiderdifferentbootstrapsamples forlearning ,wewouldmostlybechoosingthesameattributesforsplittingatinternalnodes,becausetheweakattributeswouldbelargelyoverlookedwhencomparedwiththestrongpredictors.Thiscanresultinaconsiderablecorrelationamongthetrees.However,ifwerestrictthechoiceofattributesateveryinternalnodetoarandomsubsetofattributes,wecanensuretheselectionofbothstrongandweakpredictors,thuspromotingdiversityamongthetrees.Thisprincipleisutilizedbyrandomforestsforcreatingdecorrelateddecisiontrees.

Byaggregatingthepredictionsofanensembleofstronganddecorrelateddecisiontrees,randomforestsareabletoreducethevarianceofthetreeswithoutnegativelyimpactingtheirlowbias.Thismakesrandomforestsquiterobusttooverfitting.Additionally,becauseoftheirabilitytoconsideronlyasmallsubsetofattributesateveryinternalnode,randomforestsarecomputationallyfastandrobusteveninhigh-dimensionalsettings.

DiTi

Ti

Di Ti

Thenumberofattributestobeselectedateverynode,p,isahyper-parameteroftherandomforestclassifier.Asmallvalueofpcanreducethecorrelationamongtheclassifiersbutmayalsoreducetheirstrength.Alargevaluecanimprovetheirstrengthbutmayresultincorrelatedtreessimilartobagging.Althoughcommonsuggestionsforpintheliteratureinclude and ,asuitablevalueofpforagiventrainingsetcanalwaysbeselectedbytuningitoveravalidationset,asdescribedinthepreviouschapter.However,thereisanalternativewayforselectinghyper-parametersinrandomforests,whichdoesnotrequireusingaseparatevalidationset.Itinvolvescomputingareliableestimateofthegeneralizationerrorratedirectlyduringtraining,knownastheout-of-bag(oob)errorestimate.Theoobestimatecanbecomputedforanygenericensemblelearningmethodthatbuildsindependentbaseclassifiersusingbootstrapsamplesofthetrainingset,e.g.,baggingandrandomforests.Theapproachforcomputingoobestimatecanbedescribedasfollows.

Consideranensemblelearningmethodthatusesanindependentbaseclassifier builtonabootstrapsampleofthetrainingset .Sinceeverytraininginstance willbeusedfortrainingapproximately63%ofbaseclassifiers,wecancall asanout-of-bagsamplefortheremaining27%ofbaseclassifiersthatdidnotuseitfortraining.Ifweusetheseremaining27%classifierstomakepredictionson ,wecanobtaintheooberroron bytakingtheirmajorityvoteandcomparingitwithitsclasslabel.Notethattheooberrorestimatestheerrorof27%classifiersonaninstancethatwasnotusedfortrainingthoseclassifiers.Hence,theooberrorcanbeconsideredasareliableestimateofgeneralizationerror.Bytakingtheaverageofooberrorsofalltraininginstances,wecancomputetheoverallooberrorestimate.Thiscanbeusedasanalternativetothevalidationerrorrateforselectinghyper-parameters.Hence,randomforestsdonotneedtouseaseparatepartitionofthetrainingsetforvalidation,asitcansimultaneouslytrainthebaseclassifiersandcomputegeneralizationerrorestimatesonthesamedataset.

d log2d+1

Ti Di

Randomforestshavebeenempiricallyfoundtoprovidesignificantimprovementsingeneralizationperformancethatareoftencomparable,ifnotsuperior,totheimprovementsprovidedbytheAdaBoostalgorithm.RandomforestsarealsomorerobusttooverfittingandrunmuchfasterthantheAdaBoostalgorithm.

4.10.7EmpiricalComparisonamongEnsembleMethods

Table4.5 showstheempiricalresultsobtainedwhencomparingtheperformanceofadecisiontreeclassifieragainstbagging,boosting,andrandomforest.Thebaseclassifiersusedineachensemblemethodconsistof50decisiontrees.Theclassificationaccuraciesreportedinthistableareobtainedfromtenfoldcross-validation.Noticethattheensembleclassifiersgenerallyoutperformasingledecisiontreeclassifieronmanyofthedatasets.

Table4.5.Comparingtheaccuracyofadecisiontreeclassifieragainstthreeensemblemethods.

DataSet Numberof(Attributes,Classes,Instances)

DecisionTree(%)

Bagging(%) Boosting(%) RF(%)

Anneal (39,6,898) 92.09 94.43 95.43 95.43

Australia (15,2,690) 85.51 87.10 85.22 85.80

Auto (26,7,205) 81.95 85.37 85.37 84.39

Breast (11,2,699) 95.14 96.42 97.28 96.14

Cleve (14,2,303) 76.24 81.52 82.18 82.18

Credit (16,2,690) 85.8 86.23 86.09 85.8

Diabetes (9,2,768) 72.40 76.30 73.18 75.13

German (21,2,1000) 70.90 73.40 73.00 74.5

Glass (10,7,214) 67.29 76.17 77.57 78.04

Heart (14,2,270) 80.00 81.48 80.74 83.33

Hepatitis (20,2,155) 81.94 81.29 83.87 83.23

Horse (23,2,368) 85.33 85.87 81.25 85.33

Ionosphere (35,2,351) 89.17 92.02 93.73 93.45

Iris (5,3,150) 94.67 94.67 94.00 93.33

Labor (17,2,57) 78.95 84.21 89.47 84.21

Led7 (8,10,3200) 73.34 73.66 73.34 73.06

Lymphography (19,4,148) 77.03 79.05 85.14 82.43

Pima (9,2,768) 74.35 76.69 73.44 77.60

Sonar (61,2,208) 78.85 78.85 84.62 85.58

Tic-tac-toe (10,2,958) 83.72 93.84 98.54 95.82

Vehicle (19,4,846) 71.04 74.11 78.25 74.94

Waveform (22,3,5000) 76.44 83.30 83.90 84.04

Wine (14,3,178) 94.38 96.07 97.75 97.75

Zoo (17,7,101) 93.07 93.07 95.05 97.03

4.11ClassImbalanceProblemInmanydatasetsthereareadisproportionatenumberofinstancesthatbelongtodifferentclasses,apropertyknownasskeworclassimbalance.Forexample,considerahealth-careapplicationwherediagnosticreportsareusedtodecidewhetherapersonhasararedisease.Becauseoftheinfrequentnatureofthedisease,wecanexpecttoobserveasmallernumberofsubjectswhoarepositivelydiagnosed.Similarly,increditcardfrauddetection,fraudulenttransactionsaregreatlyoutnumberedbylegitimatetransactions.

Thedegreeofimbalancebetweentheclassesvariesacrossdifferentapplicationsandevenacrossdifferentdatasetsfromthesameapplication.Forexample,theriskforararediseasemayvaryacrossdifferentpopulationsofsubjectsdependingontheirdietaryandlifestylechoices.However,despitetheirinfrequentoccurrences,acorrectclassificationoftherareclassoftenhasgreatervaluethanacorrectclassificationofthemajorityclass.Forexample,itmaybemoredangeroustoignoreapatientsufferingfromadiseasethantomisdiagnoseahealthyperson.

Moregenerally,classimbalanceposestwochallengesforclassification.First,itcanbedifficulttofindsufficientlymanylabeledsamplesofarareclass.Notethatmanyoftheclassificationmethodsdiscussedsofarworkwellonlywhenthetrainingsethasabalancedrepresentationofbothclasses.Althoughsomeclassifiersaremoreeffectiveathandlingimbalanceinthetrainingdatathanothers,e.g.,rule-basedclassifiersandk-NN,theyareallimpactediftheminorityclassisnotwell-representedinthetrainingset.Ingeneral,aclassifiertrainedoveranimbalanceddatasetshowsabiastowardimprovingitsperformanceoverthemajorityclass,whichisoftennotthedesiredbehavior.

Asaresult,manyexistingclassificationmodels,whentrainedonanimbalanceddataset,maynoteffectivelydetectinstancesoftherareclass.

Second,accuracy,whichisthetraditionalmeasureforevaluatingclassificationperformance,isnotwell-suitedforevaluatingmodelsinthepresenceofclassimbalanceinthetestdata.Forexample,if1%ofthecreditcardtransactionsarefraudulent,thenatrivialmodelthatpredictseverytransactionaslegitimatewillhaveanaccuracyof99%eventhoughitfailstodetectanyofthefraudulentactivities.Thus,thereisaneedtousealternativeevaluationmetricsthataresensitivetotheskewandcancapturedifferentcriteriaofperformancethanaccuracy.

Inthissection,wefirstpresentsomeofthegenericmethodsforbuildingclassifierswhenthereisclassimbalanceinthetrainingset.Wethendiscussmethodsforevaluatingclassificationperformanceandadaptingclassificationdecisionsinthepresenceofaskewedtestset.Intheremainderofthissection,wewillconsiderbinaryclassificationproblemsforsimplicity,wheretheminorityclassisreferredasthepositive classwhilethemajorityclassisreferredasthenegative class.

4.11.1BuildingClassifierswithClassImbalance

Therearetwoprimaryconsiderationsforbuildingclassifiersinthepresenceofclassimbalanceinthetrainingset.First,weneedtoensurethatthelearningalgorithmistrainedoveradatasetthathasadequaterepresentationofboththemajorityaswellastheminorityclasses.Somecommonapproachesforensuringthisincludesthemethodologiesofoversamplingandundersampling

(+)(−)

thetrainingset.Second,havinglearnedaclassificationmodel,weneedawaytoadaptitsclassificationdecisions(andthuscreateanappropriatelytunedclassifier)tobestmatchtherequirementsoftheimbalancedtestset.Thisistypicallydonebyconvertingtheoutputsoftheclassificationmodeltoreal-valuedscores,andthenselectingasuitablethresholdontheclassificationscoretomatchtheneedsofatestset.Boththeseconsiderationsarediscussedindetailinthefollowing.

OversamplingandUndersamplingThefirststepinlearningwithimbalanceddataistotransformthetrainingsettoabalancedtrainingset,wherebothclasseshavenearlyequalrepresentation.Thebalancedtrainingsetcanthenbeusedwithanyoftheexistingclassificationtechniques(withoutmakinganymodificationsinthelearningalgorithm)tolearnamodelthatgivesequalemphasistobothclasses.Inthefollowing,wepresentsomeofthecommontechniquesfortransforminganimbalancedtrainingsettoabalancedone.

Abasicapproachforcreatingbalancedtrainingsetsistogenerateasampleoftraininginstanceswheretherareclasshasadequaterepresentation.Therearetwotypesofsamplingmethodsthatcanbeusedtoenhancetherepresentationoftheminorityclass:(a)undersampling,wherethefrequencyofthemajorityclassisreducedtomatchthefrequencyoftheminorityclass,and(b)oversampling,whereartificialexamplesoftheminorityclassarecreatedtomakethemequalinproportiontothenumberofnegativeinstances.

Toillustrateundersampling,consideratrainingsetthatcontains100positiveexamplesand1000negativeexamples.Toovercometheskewamongtheclasses,wecanselectarandomsampleof100examplesfromthenegativeclassandusethemwiththe100positiveexamplestocreateabalancedtrainingset.Aclassifierbuiltovertheresultantbalancedsetwillthenbe

unbiasedtowardbothclasses.However,onelimitationofundersamplingisthatsomeoftheusefulnegativeexamples(e.g.,thoseclosertotheactualdecisionboundary)maynotbechosenfortraining,therefore,resultinginaninferiorclassificationmodel.Anotherlimitationisthatthesmallersampleof100negativeinstancesmayhaveahighervariancethanthelargersetof1000.

Oversamplingattemptstocreateabalancedtrainingsetbyartificiallygeneratingnewpositiveexamples.Asimpleapproachforoversamplingistoduplicateeverypositiveinstance times,where and arethenumbersofpositiveandnegativetraininginstances,respectively.Figure4.51 illustratestheeffectofoversamplingonthelearningofadecisionboundaryusingaclassifiersuchasadecisiontree.Withoutoversampling,onlythepositiveexamplesatthebottomright-handsideofFigure4.51(a)areclassifiedcorrectly.Thepositiveexampleinthemiddleofthediagramismisclassifiedbecausetherearenotenoughexamplestojustifythecreationofanewdecisionboundarytoseparatethepositiveandnegativeinstances.Oversamplingprovidestheadditionalexamplesneededtoensurethatthedecisionboundarysurroundingthepositiveexampleisnotpruned,asillustratedinFigure4.51(b) .Notethatduplicatingapositiveinstanceisanalogoustodoublingitsweightduringthetrainingstage.Hence,theeffectofoversamplingcanbealternativelyachievedbyassigninghigherweightstopositiveinstancesthannegativeinstances.Thismethodofweightinginstancescanbeusedwithanumberofclassifierssuchaslogisticregression,ANN,andSVM.

n−/n+ n+ n−

Figure4.51.Illustratingtheeffectofoversamplingoftherareclass.

Onelimitationoftheduplicationmethodforoversamplingisthatthereplicatedpositiveexampleshaveanartificiallylowervariancewhencomparedwiththeirtruedistributionintheoveralldata.Thiscanbiastheclassifiertothespecificdistributionoftraininginstances,whichmaynotberepresentativeoftheoveralldistributionoftestinstances,leadingtopoorgeneralizability.Toovercomethislimitation,analternativeapproachforoversamplingistogeneratesyntheticpositiveinstancesintheneighborhoodofexistingpositiveinstances.Inthisapproach,calledtheSyntheticMinorityOversamplingTechnique(SMOTE),wefirstdeterminethek-nearestpositiveneighborsofeverypositiveinstance ,andthengenerateasyntheticpositiveinstanceatsomeintermediatepointalongthelinesegmentjoining tooneofitsrandomlychosenk-nearestneighbor, .Thisprocessisrepeateduntilthedesirednumberofpositiveinstancesisreached.However,onelimitationofthisapproachisthatitcanonlygeneratenewpositiveinstancesintheconvexhulloftheexistingpositiveclass.Hence,itdoesnothelpimprovetherepresentationofthepositiveclassoutsidetheboundaryofexistingpositive

xk

instances.Despitetheircomplementarystrengthsandweaknesses,undersamplingandoversamplingprovideusefuldirectionsforgeneratingbalancedtrainingsetsinthepresenceofclassimbalance.

AssigningScorestoTestInstancesIfaclassifierreturnsanordinalscores( )foreverytestinstance suchthatahigherscoredenotesagreaterlikelihoodof belongingtothepositiveclass,thenforeverypossiblevalueofscorethreshold, ,wecancreateanewbinaryclassifierwhereatestinstance isclassifiedpositiveonlyif .Thus,everychoiceof canpotentiallyleadtoadifferentclassifier,andweareinterestedinfindingtheclassifierthatisbestsuitedforourneeds.

Ideally,wewouldliketheclassificationscoretovarymonotonicallywiththeactualposteriorprobabilityofthepositiveclass,i.e.,if and arethescoresofanytwoinstances, and ,then

.However,thisisdifficulttoguaranteeinpracticeasthepropertiesoftheclassificationscoredependsonseveralfactorssuchasthecomplexityoftheclassificationalgorithmandtherepresentativepowerofthetrainingset.Ingeneral,wecanonlyexpecttheclassificationscoreofareasonablealgorithmtobeweaklyrelatedtotheactualposteriorprobabilityofthepositiveclass,eventhoughtherelationshipmaynotbestrictlymonotonic.Mostclassifierscanbeeasilymodifiedtoproducesucharealvaluedscore.Forexample,thesigneddistanceofaninstancefromthepositivemarginhyperplaneofSVMcanbeusedasaclassificationscore.Asanotherexample,testinstancesbelongingtoaleafinadecisiontreecanbeassignedascorebasedonthefractionoftraininginstanceslabeledaspositiveintheleaf.Also,probabilisticclassifierssuchasnaïveBayes,Bayesiannetworks,andlogisticregressionnaturallyoutputestimatesofposteriorprobabilities, .Next,wediscusssome

sTs(x)>sT

sT

s(x1) s(x2)x1 x2

s(x1)≥s(x2)⇒P(y=1|x1)≥P(y=1|x2)

P(y=1|x)

evaluationmeasuresforassessingthegoodnessofaclassifierinthepresenceofclassimbalance.

Table4.6.Aconfusionmatrixforabinaryclassificationprobleminwhichtheclassesarenotequallyimportant.

PredictedClass

Actualclass

4.11.2EvaluatingPerformancewithClassImbalance

Themostbasicapproachforrepresentingaclassifier'sperformanceonatestsetistouseaconfusionmatrix,asshowninTable4.6 .ThistableisessentiallythesameasTable3.4 ,whichwasintroducedinthecontextofevaluatingclassificationperformanceinSection3.2 .Aconfusionmatrixsummarizesthenumberofinstancespredictedcorrectlyorincorrectlybyaclassifierusingthefollowingfourcounts:

Truepositive(TP)or ,whichcorrespondstothenumberofpositiveexamplescorrectlypredictedbytheclassifier.Falsepositive(FP)or (alsoknownasTypeIerror),whichcorrespondstothenumberofnegativeexampleswronglypredictedaspositivebytheclassifier.

+ −

+ f++(TP) f+−(FN)

− f−+(FP) f−−(TN)

f++

f−+

Falsenegative(FN)or (alsoknownasTypeIIerror),whichcorrespondstothenumberofpositiveexampleswronglypredictedasnegativebytheclassifier.Truenegative(TN)or ,whichcorrespondstothenumberofnegativeexamplescorrectlypredictedbytheclassifier.

Theconfusionmatrixprovidesaconciserepresentationofclassificationperformanceonagiventestdataset.However,itisoftendifficulttointerpretandcomparetheperformanceofclassifiersusingthefour-dimensionalrepresentations(correspondingtothefourcounts)providedbytheirconfusionmatrices.Hence,thecountsintheconfusionmatrixareoftensummarizedusinganumberofevaluationmeasures.Accuracyisanexampleofonesuchmeasurethatcombinesthesefourcountsintoasinglevalue,whichisusedextensivelywhenclassesarebalanced.However,theaccuracymeasureisnotsuitableforhandlingdatasetswithimbalancedclassdistributionsasittendstofavorclassifiersthatcorrectlyclassifythemajorityclass.Inthefollowing,wedescribeotherpossiblemeasuresthatcapturedifferentcriteriaofperformancewhenworkingwithimbalancedclasses.

Abasicevaluationmeasureisthetruepositiverate(TPR),whichisdefinedasthefractionofpositivetestinstancescorrectlypredictedbytheclassifier:

Inthemedicalcommunity,TPRisalsoknownassensitivity,whileintheinformationretrievalliterature,itisalsocalledrecall(r).AclassifierwithahighTPRhasahighchanceofcorrectlyidentifyingthepositiveinstancesofthedata.

AnalogouslytoTPR,thetruenegativerate(TNR)(alsoknownasspecificity)isdefinedasthefractionofnegativetestinstancescorrectly

f+−

f−−

TPR=TPTP+FN.

predictedbytheclassifier,i.e.,

AhighTNRvaluesignifiesthattheclassifiercorrectlyclassifiesanyrandomlychosennegativeinstanceinthetestset.AcommonlyusedevaluationmeasurethatiscloselyrelatedtoTNRisthefalsepositiverate(FPR),whichisdefinedas .

Similarly,wecandefinefalsenegativerate(FNR)as .

Notethattheevaluationmeasuresdefinedabovedonottakeintoaccounttheskewamongtheclasses,whichcanbeformallydefinedas ,wherePandNdenotethenumberofactualpositivesandactualnegatives,respectively.Asaresult,changingtherelativenumbersofPandNwillhavenoeffectonTPR,TNR,FPR,orFNR,sincetheydependonlyonthefractionofcorrectclassificationsforeveryclass,independentlyoftheotherclass.Furthermore,knowingthevaluesofTPRandTNR(andconsequentlyFNRandFPR)doesnotbyitselfhelpusuniquelydetermineallfourentriesoftheconfusionmatrix.However,togetherwithinformationabouttheskewfactor, ,andthetotalnumberofinstances,N,wecancomputetheentireconfusionmatrixusingTPRandTNR,asshowninTable4.7 .

Table4.7.EntriesoftheconfusionmatrixintermsoftheTPR,TNR,skew, ,andtotalnumberofinstances,N.

Predicted Predicted

TNR=TNFP+TN.

1−TNR

FPR=FPFP+TN.

1−TPR

FNR=FNFN+TP.

α=P/(P+N)

α

α

+ −

Actual

Actual

N

Anevaluationmeasurethatissensitivetotheskewisprecision,whichcanbedefinedasthefractionofcorrectpredictionsofthepositiveclassoverthetotalnumberofpositivepredictions,i.e.,

Precisionisalsoreferredasthepositivepredictedvalue(PPV).Aclassifierthathasahighprecisionislikelytohavemostofitspositivepredictionscorrect.Precisionisausefulmeasureforhighlyskewedtestsetswherethepositivepredictions,eventhoughsmallinnumbers,arerequiredtobemostlycorrect.Ameasurethatiscloselyrelatedtoprecisionisthefalsediscoveryrate(FDR),whichcanbedefinedas .

AlthoughbothFDRandFPRfocusonFP,theyaredesignedtocapturedifferentevaluationobjectivesandthuscantakequitecontrastingvalues,especiallyinthepresenceofclassimbalance.Toillustratethis,consideraclassifierwiththefollowingconfusionmatrix.

PredictedClass

ActualClass

100 0

+ TPR×α×N (1−TPR)×α×N α×N

− (1−TNR)×(1−α)×N TNR×(1−α)×N (1−α)×N

Precision,p=TPTP+FP.

1−p

FDR=FPTP+FP.

+ −

+

100 900

Sincehalfofthepositivepredictionsmadebytheclassifierareincorrect,ithasaFDRvalueof .However,itsFPRisequalto

,whichisquitelow.Thisexampleshowsthatinthepresenceofhighskew(i.e.,verysmallvalueof ),evenasmallFPRcanresultinhighFDR.SeeSection10.6 forfurtherdiscussionofthisissue.

Notethattheevaluationmeasuresdefinedaboveprovideanincompleterepresentationofperformance,becausetheyeitheronlycapturetheeffectoffalsepositives(e.g.,FPRandprecision)ortheeffectoffalsenegatives(e.g.,TPRorrecall),butnotboth.Hence,ifweoptimizeonlyoneoftheseevaluationmeasures,wemayendupwithaclassifierthatshowslowFNbuthighFP,orvice-versa.Forexample,aclassifierthatdeclareseveryinstancetobepositivewillhaveaperfectrecall,buthighFPRandverypoorprecision.Ontheotherhand,aclassifierthatisveryconservativeinclassifyinganinstanceaspositive(toreduceFP)mayenduphavinghighprecisionbutverypoorrecall.Wethusneedevaluationmeasuresthataccountforbothtypesofmisclassifications,FPandFN.Someexamplesofsuchevaluationmeasuresaresummarizedbythefollowingdefinitions.

Whilesomeoftheseevaluationmeasuresareinvarianttotheskew(e.g.,thepositivelikelihoodratio),others(e.g.,precisionandthe measure)aresensitivetoskew.Further,differentevaluationmeasurescapturetheeffectsofdifferenttypesofmisclassificationerrorsinvariousways.Forexample,themeasurerepresentsaharmonicmeanbetweenrecallandprecision,i.e.,

100/(100+100)=0.5100/(100+900)=0.1

α

PositiveLikelihoodRatio=TPRFPR.F1measure=2rpr+p=2×TP2×TP+FP+FN.G(TP+FN).

F1

F1

F1=21r+1p.

Becausetheharmonicmeanoftwonumberstendstobeclosertothesmallerofthetwonumbers,ahighvalueof -measureensuresthatbothprecisionandrecallarereasonablyhigh.Similarly,theGmeasurerepresentsthegeometricmeanbetweenrecallandprecision.Acomparisonamongharmonic,geometric,andarithmeticmeansisgiveninthenextexample.

Example4.9.Considertwopositivenumbers and .Theirarithmeticmeanis

andtheirgeometricmeanis .Theirharmonicmeanis ,whichisclosertothesmallervaluebetweenaandbthanthearithmeticandgeometricmeans.

Agenericextensionofthe measureisthe measure,whichcanbedefinedasfollows.

Bothprecisionandrecallcanbeviewedasspecialcasesof bysettingand ,respectively.Lowvaluesof make closertoprecision,andhighvaluesmakeitclosertorecall.

Amoregeneralmeasurethatcaptures aswellasaccuracyistheweightedaccuracymeasure,whichisdefinedbythefollowingequation:

Therelationshipbetweenweightedaccuracyandotherperformancemeasuresissummarizedinthefollowingtable:

Measure

F1

a=1 b=5 μa=(a+b)/2=3 μg=ab=2.236μh=(2×1×5)/6=1.667

F1 Fβ

Fβ=(β2+1)rpr+β2p=(β2+1)×TP(β2+1)TP+β2FP+FN. (4.106)

Fβ β=0β=∞ β Fβ

Weightedaccuracy=w1TP+w4TNw1TP+w2FP+w3FN+w4TN. (4.107)

w1 w2 w3 w4

Recall 1 1 0 0

Precision 1 0 1 0

1 0

Accuracy 1 1 1 1

4.11.3FindinganOptimalScoreThreshold

GivenasuitablychosenevaluationmeasureEandadistributionofclassificationscores, ,onavalidationset,wecanobtaintheoptimalscorethreshold onthevalidationsetusingthefollowingapproach:

1. Sortthescoresinincreasingorderoftheirvalues.2. Foreveryuniquevalueofscore,s,considertheclassificationmodel

thatassignsaninstance aspositiveonlyif .LetE(s)denotetheperformanceofthismodelonthevalidationset.

3. Find thatmaximizestheevaluationmeasure,E(s).

Notethat canbetreatedasahyper-parameteroftheclassificationalgorithmthatislearnedduringmodelselection.Using ,wecanassignapositivelabeltoafuturetestinstance onlyif .IftheevaluationmeasureEisskewinvariant(e.g.,PositiveLikelihoodRatio),thenwecanselect withoutknowingtheskewofthetestset,andtheresultantclassifierformedusing canbeexpectedtoshowoptimalperformanceonthetestset

Fβ β2+1 β2

s(x)s*

s(x)>s

s*s*=argmaxsE(s).

s*s*

s(x)>s*

s*s*

(withrespecttotheevaluationmeasureE).Ontheotherhand,ifEissensitivetotheskew(e.g.,precisionor -measure),thenweneedtoensurethattheskewofthevalidationsetusedforselecting issimilartothatofthetestset,sothattheclassifierformedusing showsoptimaltestperformancewithrespecttoE.Alternatively,givenanestimateoftheskewofthetestdata, ,wecanuseitalongwiththeTPRandTNRonthevalidationsettoestimateallentriesoftheconfusionmatrix(seeTable4.7 ),andthustheestimateofanyevaluationmeasureEonthetestset.Thescorethreshold selectedusingthisestimateofEcanthenbeexpectedtoproduceoptimaltestperformancewithrespecttoE.Furthermore,themethodologyofselectingonthevalidationsetcanhelpincomparingthetestperformanceofdifferentclassificationalgorithms,byusingtheoptimalvaluesof foreachalgorithm.

4.11.4AggregateEvaluationofPerformance

Althoughtheaboveapproachhelpsinfindingascorethreshold thatprovidesoptimalperformancewithrespecttoadesiredevaluationmeasureandacertainamountofskew, ,sometimesweareinterestedinevaluatingtheperformanceofaclassifieronanumberofpossiblescorethresholds,eachcorrespondingtoadifferentchoiceofevaluationmeasureandskewvalue.Assessingtheperformanceofaclassifieroverarangeofscorethresholdsiscalledaggregateevaluationofperformance.Inthisstyleofanalysis,theemphasisisnotonevaluatingtheperformanceofasingleclassifiercorrespondingtotheoptimalscorethreshold,buttoassessthegeneralqualityofrankingproducedbytheclassificationscoresonthetestset.Ingeneral,thishelpsinobtainingrobustestimatesofclassificationperformancethatarenotsensitivetospecificchoicesofscorethresholds.

F1s*

s*α

s*

s*

s*

s*

α

ROCCurveOneofthewidely-usedtoolsforaggregateevaluationisthereceiveroperatingcharacteristic(ROC)curve.AnROCcurveisagraphicalapproachfordisplayingthetrade-offbetweenTPRandFPRofaclassifier,overvaryingscorethresholds.InanROCcurve,theTPRisplottedalongthey-axisandtheFPRisshownonthex-axis.Eachpointalongthecurvecorrespondstoaclassificationmodelgeneratedbyplacingathresholdonthetestscoresproducedbytheclassifier.ThefollowingproceduredescribesthegenericapproachforcomputinganROCcurve:

1. Sortthetestinstancesinincreasingorderoftheirscores.2. Selectthelowestrankedtestinstance(i.e.,theinstancewithlowest

score).Assigntheselectedinstanceandthoserankedaboveittothepositiveclass.Thisapproachisequivalenttoclassifyingallthetestinstancesaspositiveclass.Becauseallthepositiveexamplesareclassifiedcorrectlyandthenegativeexamplesaremisclassified,

.3. Selectthenexttestinstancefromthesortedlist.Classifytheselected

instanceandthoserankedaboveitaspositive,whilethoserankedbelowitasnegative.UpdatethecountsofTPandFPbyexaminingtheactualclasslabeloftheselectedinstance.Ifthisinstancebelongstothepositiveclass,theTPcountisdecrementedandtheFPcountremainsthesameasbefore.Iftheinstancebelongstothenegativeclass,theFPcountisdecrementedandTPcountremainsthesameasbefore.

4. RepeatStep3andupdatetheTPandFPcountsaccordinglyuntilthehighestrankedtestinstanceisselected.Atthisfinalthreshold,

,asallinstancesarelabeledasnegative.5. PlottheTPRagainstFPRoftheclassifier.

TPR=FPR=1

TPR=FPR=0

Example4.10.[GeneratingROCCurve]Figure4.52 showsanexampleofhowtocomputetheTPRandFPRvaluesforeverychoiceofscorethreshold.Therearefivepositiveexamplesandfivenegativeexamplesinthetestset.Theclasslabelsofthetestinstancesareshowninthefirstrowofthetable,whilethesecondrowcorrespondstothesortedscorevaluesforeachinstance.ThenextsixrowscontainthecountsofTP,FP,TN,andFN,alongwiththeircorrespondingTPRandFPR.Thetableisthenfilledfromlefttoright.Initially,alltheinstancesarepredictedtobepositive.Thus, and

.Next,weassignthetestinstancewiththelowestscoreasthenegativeclass.Becausetheselectedinstanceisactuallyapositiveexample,theTPcountdecreasesfrom5to4andtheFPcountisthesameasbefore.TheFPRandTPRareupdatedaccordingly.Thisprocessisrepeateduntilwereachtheendofthelist,where and .TheROCcurveforthisexampleisshowninFigure4.53 .

Figure4.52.ComputingtheTPRandFPRateveryscorethreshold.

TP=FP=5TPR=FPR=1

TPR=0 FPR=0

Figure4.53.ROCcurveforthedatashowninFigure4.52 .

NotethatinanROCcurve,theTPRmonotonicallyincreaseswithFPR,becausetheinclusionofatestinstanceinthesetofpredictedpositivescaneitherincreasetheTPRortheFPR.TheROCcurvethushasastaircasepattern.Furthermore,thereareseveralcriticalpointsalonganROCcurvethathavewell-knowninterpretations:

:Modelpredictseveryinstancetobeanegativeclass.

:Modelpredictseveryinstancetobeapositiveclass.

:Theperfectmodelwithzeromisclassifications.

Agoodclassificationmodelshouldbelocatedascloseaspossibletotheupperleftcornerofthediagram,whileamodelthatmakesrandomguessesshouldresidealongthemaindiagonal,connectingthepointsand .Randomguessingmeansthataninstanceisclassifiedasapositiveclasswithafixedprobabilityp,irrespectiveofitsattributeset.

(TPR=0,FPR=0)

(TPR=1,FPR=1)

(TPR=1,FPR=0)

(TPR=0,FPR=0)(TPR=1,FPR=1)

Forexample,consideradatasetthatcontains positiveinstancesandnegativeinstances.Therandomclassifierisexpectedtocorrectlyclassifyofthepositiveinstancesandtomisclassify ofthenegativeinstances.Therefore,theTPRoftheclassifieris ,whileitsFPRis .Hence,thisrandomclassifierwillresideatthepoint(p,p)intheROCcurvealongthemaindiagonal.

Figure4.54.ROCcurvesfortwodifferentclassifiers.

SinceeverypointontheROCcurverepresentstheperformanceofaclassifiergeneratedusingaparticularscorethreshold,theycanbeviewedasdifferentoperatingpointsoftheclassifier.Onemaychooseoneoftheseoperatingpointsdependingontherequirementsoftheapplication.Hence,anROCcurvefacilitatesthecomparisonofclassifiersoverarangeofoperatingpoints.Forexample,Figure4.54 comparestheROCcurvesoftwoclassifiers,

n+ n−pn+

pn−(pn+)/n+=p (pn−)/p=p

M1

and ,generatedbyvaryingthescorethresholds.Wecanseethat isbetterthan whenFPRislessthan0.36,as showsbetterTPRthanforthisrangeofoperatingpoints.Ontheotherhand, issuperiorwhenFPRisgreaterthan0.36,sincetheTPRof ishigherthanthatof forthisrange.Clearly,neitherofthetwoclassifiersdominates(isstrictlybetterthan)theother,i.e.,showshighervaluesofTPRandlowervaluesofFPRoveralloperatingpoints.

Tosummarizetheaggregatebehavioracrossalloperatingpoints,oneofthecommonlyusedmeasuresistheareaundertheROCcurve(AUC).Iftheclassifierisperfect,thenitsareaundertheROCcurvewillbeequal1.Ifthealgorithmsimplyperformsrandomguessing,thenitsareaundertheROCcurvewillbeequalto0.5.

AlthoughtheAUCprovidesausefulsummaryofaggregateperformance,therearecertaincaveatsinusingtheAUCforcomparingclassifiers.First,eveniftheAUCofalgorithmAishigherthantheAUCofanotheralgorithmB,thisdoesnotmeanthatalgorithmAisalwaysbetterthanB,i.e.,theROCcurveofAdominatesthatofBacrossalloperatingpoints.Forexample,eventhough showsaslightlylowerAUCthan inFigure4.54 ,wecanseethatboth and areusefuloverdifferentrangesofoperatingpointsandnoneofthemarestrictlybetterthantheotheracrossallpossibleoperatingpoints.Hence,wecannotusetheAUCtodeterminewhichalgorithmisbetter,unlessweknowthattheROCcurveofoneofthealgorithmsdominatestheother.

Second,althoughtheAUCsummarizestheaggregateperformanceoveralloperatingpoints,weareofteninterestedinonlyasmallrangeofoperatingpointsinmostapplications.Forexample,eventhough showsslightlylowerAUCthan ,itshowshigherTPRvaluesthan forsmallFPRvalues(smallerthan0.36).Inthepresenceofclassimbalance,thebehaviorof

M2 M1M2 M1 M2

M2M2 M1

M1 M2M1 M2

M1M2 M2

analgorithmoversmallFPRvalues(alsotermedasearlyretrieval)isoftenmoremeaningfulforcomparisonthantheperformanceoverallFPRvalues.Thisisbecause,inmanyapplications,itisimportanttoassesstheTPRachievedbyaclassifierinthefirstfewinstanceswithhighestscores,withoutincurringalargeFPR.Hence,inFigure4.54 ,duetothehighTPRvaluesof duringearlyretrieval ,wemayprefer over forimbalancedtestsets,despitethelowerAUCof .Hence,caremustbetakenwhilecomparingtheAUCvaluesofdifferentclassifiers,usuallybyvisualizingtheirROCcurvesratherthanjustreportingtheirAUC.

AkeycharacteristicofROCcurvesisthattheyareagnostictotheskewinthetestset,becauseboththeevaluationmeasuresusedinconstructingROCcurves(TPRandFPR)areinvarianttoclassimbalance.Hence,ROCcurvesarenotsuitableformeasuringtheimpactofskewonclassificationperformance.Inparticular,wewillobtainthesameROCcurvefortwotestdatasetsthathaveverydifferentskew.

M1 (FPR<0.36) M1 M2M1

Figure4.55.PRcurvesfortwodifferentclassifiers.

Precision-RecallCurveAnalternatetoolforaggregateevaluationistheprecisionrecallcurve(PRcurve).ThePRcurveplotstheprecisionandrecallvaluesofaclassifierontheyandxaxesrespectively,byvaryingthethresholdonthetestscores.Figure4.55 showsanexampleofPRcurvesfortwohypotheticalclassifiers, and .TheapproachforgeneratingaPRcurveissimilartotheapproachdescribedaboveforgeneratinganROCcurve.However,therearesomekeydistinguishingfeaturesinthePRcurve:

1. PRcurvesaresensitivetotheskewfactor ,anddifferentPRcurvesaregeneratedfordifferentvaluesof .

M1 M2

α=P/(P+N)α

2. Whenthescorethresholdislowest(everyinstanceislabeledaspositive),theprecisionisequalto whilerecallis1.Asweincreasethescorethreshold,thenumberofpredictedpositivescanstaythesameordecrease.Hence,therecallmonotonicallydeclinesasthescorethresholdincreases.Ingeneral,theprecisionmayincreaseordecreaseforthesamevalueofrecall,uponadditionofaninstanceintothesetofpredictedpositives.Forexample,ifthek rankedinstancebelongstothenegativeclass,thenincludingitwillresultinadropintheprecisionwithoutaffectingtherecall.Theprecisionmayimproveatthenextstep,whichaddsthe rankedinstance,ifthisinstancebelongstothepositiveclass.Hence,thePRcurveisnotasmooth,monotonicallyincreasingcurveliketheROCcurve,andgenerallyhasazigzagpattern.Thispatternismoreprominentintheleftpartofthecurve,whereevenasmallchangeinthenumberoffalsepositivescancausealargechangeinprecision.

3. As,asweincreasetheimbalanceamongtheclasses(reducethevalueof ),therightmostpointsofallPRcurveswillmovedownwards.AtandneartheleftmostpointonthePRcurve(correspondingtolargervaluesofscorethreshold),therecallisclosetozero,whiletheprecisionisequaltothefractionofpositivesinthetoprankedinstancesofthealgorithm.Hence,differentclassifierscanhavedifferentvaluesofprecisionattheleftmostpointsofthePRcurve.Also,iftheclassificationscoreofanalgorithmmonotonicallyvarieswiththeposteriorprobabilityofthepositiveclass,wecanexpectthePRcurvetograduallydecreasefromahighvalueofprecisionontheleftmostpointtoaconstantvalueof attherightmostpoint,albeitwithsomeupsanddowns.ThiscanbeobservedinthePRcurveofalgorithminFigure4.55 ,whichstartsfromahighervalueofprecisionontheleftthatgraduallydecreasesaswemovetowardstheright.Ontheotherhand,thePRcurveofalgorithm startsfromalowervalueofprecisionontheleftandshowsmoredrasticupsanddownsaswe

α

th

(k+1)th

α

αM1

M2

moveright,suggestingthattheclassificationscoreof showsaweakermonotonicrelationshipwiththeposteriorprobabilityofthepositiveclass.

4. Arandomclassifierthatassignsaninstancetobepositivewithafixedprobabilityphasaprecisionof andarecallofp.Hence,aclassifierthatperformsrandomguessinghasahorizontalPRcurvewith ,asshownusingadashedlineinFigure4.55 .NotethattherandombaselineinPRcurvesdependsontheskewinthetestset,incontrasttothefixedmaindiagonalofROCcurvesthatrepresentsrandomclassifiers.

5. NotethattheprecisionofanalgorithmisimpactedmorestronglybyfalsepositivesinthetoprankedtestinstancesthantheFPRofthealgorithm.Forthisreason,thePRcurvegenerallyhelpstomagnifythedifferencesbetweenclassifiersintheleftportionofthePRcurve.Hence,inthepresenceofclassimbalanceinthetestdata,analyzingthePRcurvesgenerallyprovidesadeeperinsightintotheperformanceofclassifiersthantheROCcurves,especiallyintheearlyretrievalrangeofoperatingpoints.

6. Theclassifiercorrespondingto representstheperfectclassifier.SimilartoAUC,wecanalsocomputetheareaunderthePRcurveofanalgorithm,knownasAUC-PR.TheAUC-PRofarandomclassifierisequalto ,whilethatofaperfectalgorithmisequalto1.NotethatAUC-PRvarieswithchangingskewinthetestset,incontrasttotheareaundertheROCcurve,whichisinsensitivetotheskew.TheAUC-PRhelpsinaccentuatingthedifferencesbetweenclassificationalgorithmsintheearlyretrievalrangeofoperatingpoints.Hence,itismoresuitedforevaluatingclassificationperformanceinthepresenceofclassimbalancethantheareaundertheROCcurve.However,similartoROCcurves,ahighervalueofAUC-PRdoesnotguaranteethesuperiorityofaclassificationalgorithmoveranother.ThisisbecausethePRcurvesoftwoalgorithmscaneasilycrosseach

M2

αy=α

(precision=1,recall=1)

α

other,suchthattheybothshowbetterperformancesindifferentrangesofoperatingpoints.Hence,itisimportanttovisualizethePRcurvesbeforecomparingtheirAUC-PRvalues,inordertoensureameaningfulevaluation.

4.12MulticlassProblemSomeoftheclassificationtechniquesdescribedinthischapterareoriginallydesignedforbinaryclassificationproblems.Yettherearemanyreal-worldproblems,suchascharacterrecognition,faceidentification,andtextclassification,wheretheinputdataisdividedintomorethantwocategories.Thissectionpresentsseveralapproachesforextendingthebinaryclassifierstohandlemulticlassproblems.Toillustratetheseapproaches,let

bethesetofclassesoftheinputdata.

ThefirstapproachdecomposesthemulticlassproblemintoKbinaryproblems.Foreachclass ,abinaryproblemiscreatedwhereallinstancesthatbelongto areconsideredpositiveexamples,whiletheremaininginstancesareconsiderednegativeexamples.Abinaryclassifieristhenconstructedtoseparateinstancesofclass fromtherestoftheclasses.Thisisknownastheone-against-rest(1-r)approach.

Thesecondapproach,whichisknownastheone-against-one(1-1)approach,constructs binaryclassifiers,whereeachclassifierisusedtodistinguishbetweenapairofclasses, .Instancesthatdonotbelongtoeither or areignoredwhenconstructingthebinaryclassifierfor .Inboth1-rand1-1approaches,atestinstanceisclassifiedbycombiningthepredictionsmadebythebinaryclassifiers.Avotingschemeistypicallyemployedtocombinethepredictions,wheretheclassthatreceivesthehighestnumberofvotesisassignedtothetestinstance.Inthe1-rapproach,ifaninstanceisclassifiedasnegative,thenallclassesexceptforthepositiveclassreceiveavote.Thisapproach,however,mayleadtotiesamongthedifferentclasses.Anotherpossibilityistotransformtheoutputsofthebinary

Y={y1,y2,…,yK}

yi∈Yyi

yi

K(K−1)/2(yi,yj)

yi yj (yi,yj)

classifiersintoprobabilityestimatesandthenassignthetestinstancetotheclassthathasthehighestprobability.

Example4.11.Consideramulticlassproblemwhere .Supposeatestinstanceisclassifiedas accordingtothe1-rapproach.Inotherwords,itisclassifiedaspositivewhen isusedasthepositiveclassandnegativewhen ,and areusedasthepositiveclass.Usingasimplemajorityvote,noticethat receivesthehighestnumberofvotes,whichisfour,whiletheremainingclassesreceiveonlythreevotes.Thetestinstanceisthereforeclassifiedas .

Example4.12.Supposethetestinstanceisclassifiedusingthe1-1approachasfollows:

Binarypairofclasses

Classification

Thefirsttworowsinthistablecorrespondtothepairofclasseschosentobuildtheclassifierandthelastrowrepresentsthepredictedclassforthetestinstance.Aftercombiningthepredictions, and eachreceivetwovotes,while and eachreceivesonlyonevote.Thetestinstanceisthereforeclassifiedaseither or ,dependingonthetie-breakingprocedure.

Error-CorrectingOutputCoding

Y={y1,y2,y3,y4}(+,−,−,−)

y1y2,y3 y4

y1

y1

+:y1−:y2 +:y1−:y3 +:y1−:y4 +:y2−:y3 +:y2−:y4 +:y3−:y4

+ + − + − +

(yi,yj)

y1 y4y2 y3

y1 y4

Apotentialproblemwiththeprevioustwoapproachesisthattheymaybesensitivetobinaryclassificationerrors.Forthe1-rapproachgiveninExample

4.12,ifatleastofoneofthebinaryclassifiersmakesamistakeinitsprediction,thentheclassifiermayendupdeclaringatiebetweenclassesormakingawrongprediction.Forexample,supposethetestinstanceisclassifiedas duetomisclassificationbythethirdclassifier.Inthiscase,itwillbedifficulttotellwhethertheinstanceshouldbeclassifiedas or,unlesstheprobabilityassociatedwitheachclasspredictionistakeninto

account.

Theerror-correctingoutputcoding(ECOC)methodprovidesamorerobustwayforhandlingmulticlassproblems.Themethodisinspiredbyaninformation-theoreticapproachforsendingmessagesacrossnoisychannels.

Theideabehindthisapproachistoaddredundancyintothetransmittedmessagebymeansofacodeword,sothatthereceivermaydetecterrorsinthereceivedmessageandperhapsrecovertheoriginalmessageifthenumberoferrorsissmall.

Formulticlasslearning,eachclass isrepresentedbyauniquebitstringoflengthnknownasitscodeword.Wethentrainnbinaryclassifierstopredicteachbitofthecodewordstring.ThepredictedclassofatestinstanceisgivenbythecodewordwhoseHammingdistanceisclosesttothecodewordproducedbythebinaryclassifiers.RecallthattheHammingdistancebetweenapairofbitstringsisgivenbythenumberofbitsthatdiffer.

Example4.13.Consideramulticlassproblemwhere .Supposeweencodetheclassesusingthefollowingsevenbitcodewords:

(+,−,+,−)y1

y3

yi

Y={y1,y2,y3,y4}

Class Codeword

1 1 1 1 1 1 1

0 0 0 0 1 1 1

0 0 1 1 0 0 1

0 1 0 1 0 1 0

Eachbitofthecodewordisusedtotrainabinaryclassifier.Ifatestinstanceisclassifiedas(0,1,1,1,1,1,1)bythebinaryclassifiers,thentheHammingdistancebetweenthecodewordand is1,whiletheHammingdistancetotheremainingclassesis3.Thetestinstanceisthereforeclassifiedas .

Aninterestingpropertyofanerror-correctingcodeisthatiftheminimumHammingdistancebetweenanypairofcodewordsisd,thenanyerrorsintheoutputcodecanbecorrectedusingitsnearestcodeword.InExample4.13 ,becausetheminimumHammingdistancebetweenanypairofcodewordsis4,theclassifiermaytolerateerrorsmadebyoneofthesevenbinaryclassifiers.Ifthereismorethanoneclassifierthatmakesamistake,thentheclassifiermaynotbeabletocompensatefortheerror.

Animportantissueishowtodesigntheappropriatesetofcodewordsfordifferentclasses.Fromcodingtheory,avastnumberofalgorithmshavebeendevelopedforgeneratingn-bitcodewordswithboundedHammingdistance.However,thediscussionofthesealgorithmsisbeyondthescopeofthisbook.Itisworthwhilementioningthatthereisasignificantdifferencebetweenthedesignoferror-correctingcodesforcommunicationtaskscomparedtothoseusedformulticlasslearning.Forcommunication,thecodewordsshouldmaximizetheHammingdistancebetweentherowssothaterrorcorrection

y1

y2

y3

y4

y1

y1

⌊(d−1)/2)⌋

canbeperformed.Multiclasslearning,however,requiresthatboththerow-wiseandcolumn-wisedistancesofthecodewordsmustbewellseparated.Alargercolumn-wisedistanceensuresthatthebinaryclassifiersaremutuallyindependent,whichisanimportantrequirementforensemblelearningmethods.

4.13BibliographicNotesMitchell[278]providesexcellentcoverageonmanyclassificationtechniquesfromamachinelearningperspective.ExtensivecoverageonclassificationcanalsobefoundinAggarwal[195],Dudaetal.[229],Webb[307],Fukunaga[237],Bishop[204],Hastieetal.[249],CherkasskyandMulier[215],WittenandFrank[310],Handetal.[247],HanandKamber[244],andDunham[230].

Directmethodsforrule-basedclassifierstypicallyemploythesequentialcoveringschemeforinducingclassificationrules.Holte's1R[255]isthesimplestformofarule-basedclassifierbecauseitsrulesetcontainsonlyasinglerule.Despiteitssimplicity,Holtefoundthatforsomedatasetsthatexhibitastrongone-to-onerelationshipbetweentheattributesandtheclasslabel,1Rperformsjustaswellasotherclassifiers.Otherexamplesofrule-basedclassifiersincludeIREP[234],RIPPER[218],CN2[216,217],AQ[276],RISE[224],andITRULE[296].Table4.8 showsacomparisonofthecharacteristicsoffouroftheseclassifiers.

Table4.8.Comparisonofvariousrule-basedclassifiers.

RIPPER CN2(unordered)

CN2(ordered)

AQR

Rule-growingstrategy

General-to-specific General-to-specific

General-to-specific

General-to-specific(seededbyapositive

example)

Evaluationmetric FOIL'sInfogain Laplace Entropyandlikelihoodratio

Numberoftruepositives

Stoppingconditionforrule-growing

Allexamplesbelongtothesame

class

Noperformance

gain

Noperformance

gain

Rulescoveronlypositiveclass

Rulepruning Reducederrorpruning

None None None

Instanceelimination

Positiveandnegative

Positiveonly Positiveonly Positiveandnegative

Stoppingconditionforaddingrules

orbasedonMDLNo

performancegain

Noperformance

gain

Allpositiveexamplesarecovered

Rulesetpruning Replaceormodifyrules

Statisticaltests

None None

Searchstrategy Greedy Beamsearch

Beamsearch

Beamsearch

Forrule-basedclassifiers,theruleantecedentcanbegeneralizedtoincludeanypropositionalorfirst-orderlogicalexpression(e.g.,Hornclauses).Readerswhoareinterestedinfirst-orderlogicrule-basedclassifiersmayrefertoreferencessuchas[278]orthevastliteratureoninductivelogicprogramming[279].Quinlan[287]proposedtheC4.5rulesalgorithmforextractingclassificationrulesfromdecisiontrees.AnindirectmethodforextractingrulesfromartificialneuralnetworkswasgivenbyAndrewsetal.in[198].

CoverandHart[220]presentedanoverviewofthenearestneighborclassificationmethodfromaBayesianperspective.Ahaprovidedboththeoreticalandempiricalevaluationsforinstance-basedmethodsin[196].PEBLS,whichwasdevelopedbyCostandSalzberg[219],isanearestneighborclassifierthatcanhandledatasetscontainingnominalattributes.

Error>50%

EachtrainingexampleinPEBLSisalsoassignedaweightfactorthatdependsonthenumberoftimestheexamplehelpsmakeacorrectprediction.Hanetal.[243]developedaweight-adjustednearestneighboralgorithm,inwhichthefeatureweightsarelearnedusingagreedy,hill-climbingoptimizationalgorithm.Amorerecentsurveyofk-nearestneighborclassificationisgivenbySteinbachandTan[298].

NaïveBayesclassifiershavebeeninvestigatedbymanyauthors,includingLangleyetal.[267],RamoniandSebastiani[288],Lewis[270],andDomingosandPazzani[227].AlthoughtheindependenceassumptionusedinnaïveBayesclassifiersmayseemratherunrealistic,themethodhasworkedsurprisinglywellforapplicationssuchastextclassification.Bayesiannetworksprovideamoreflexibleapproachbyallowingsomeoftheattributestobeinterdependent.AnexcellenttutorialonBayesiannetworksisgivenbyHeckermanin[252]andJensenin[258].Bayesiannetworksbelongtoabroaderclassofmodelsknownasprobabilisticgraphicalmodels.AformalintroductiontotherelationshipsbetweengraphsandprobabilitieswaspresentedinPearl[283].OthergreatresourcesonprobabilisticgraphicalmodelsincludebooksbyBishop[205],andJordan[259].Detaileddiscussionsofconceptssuchasd-separationandMarkovblanketsareprovidedinGeigeretal.[238]andRussellandNorvig[291].

Generalizedlinearmodels(GLM)arearichclassofregressionmodelsthathavebeenextensivelystudiedinthestatisticalliterature.TheywereformulatedbyNelderandWedderburnin1972[280]tounifyanumberofregressionmodelssuchaslinearregression,logisticregression,andPoissonregression,whichsharesomesimilaritiesintheirformulations.AnextensivediscussionofGLMsisprovidedinthebookbyMcCullaghandNelder[274].

Artificialneuralnetworks(ANN)havewitnessedalongandwindinghistoryofdevelopments,involvingmultiplephasesofstagnationandresurgence.The

ideaofamathematicalmodelofaneuralnetworkwasfirstintroducedin1943byMcCullochandPitts[275].Thisledtoaseriesofcomputationalmachinestosimulateaneuralnetworkbasedonthetheoryofneuralplasticity[289].Theperceptron,whichisthesimplestprototypeofmodernANNs,wasdevelopedbyRosenblattin1958[290].Theperceptronusesasinglelayerofprocessingunitsthatcanperformbasicmathematicaloperationssuchasadditionandmultiplication.However,theperceptroncanonlylearnlineardecisionboundariesandisguaranteedtoconvergeonlywhentheclassesarelinearlyseparable.Despitetheinterestinlearningmulti-layernetworkstoovercomethelimitationsofperceptron,progressinthisarearemainhalteduntiltheinventionofthebackpropagationalgorithmbyWerbosin1974[309],whichallowedforthequicktrainingofmulti-layerANNsusingthegradientdescentmethod.Thisledtoanupsurgeofinterestintheartificialintelligence(AI)communitytodevelopmulti-layerANNmodels,atrendthatcontinuedformorethanadecade.Historically,ANNsmarkaparadigmshiftinAIfromapproachesbasedonexpertsystems(whereknowledgeisencodedusingif-thenrules)tomachinelearningapproaches(wheretheknowledgeisencodedintheparametersofacomputationalmodel).However,therewerestillanumberofalgorithmicandcomputationalchallengesinlearninglargeANNmodels,whichremainedunresolvedforalongtime.ThishinderedthedevelopmentofANNmodelstothescalenecessaryforsolvingreal-worldproblems.Slowly,ANNsstartedgettingoutpacedbyotherclassificationmodelssuchassupportvectormachines,whichprovidedbetterperformanceaswellastheoreticalguaranteesofconvergenceandoptimality.Itisonlyrecentlythatthechallengesinlearningdeepneuralnetworkshavebeencircumvented,owingtobettercomputationalresourcesandanumberofalgorithmicimprovementsinANNssince2006.Thisre-emergenceofANNhasbeendubbedas“deeplearning,”whichhasoftenoutperformedexistingclassificationmodelsandgainedwide-spreadpopularity.

Deeplearningisarapidlyevolvingareaofresearchwithanumberofpotentiallyimpactfulcontributionsbeingmadeeveryyear.Someofthelandmarkadvancementsindeeplearningincludetheuseoflarge-scalerestrictedBoltzmannmachinesforlearninggenerativemodelsofdata[201,253],theuseofautoencodersanditsvariants(denoisingautoencoders)forlearningrobustfeaturerepresentations[199,305,306],andsophisticalarchitecturestopromotesharingofparametersacrossnodessuchasconvolutionalneuralnetworksforimages[265,268]andrecurrentneuralnetworksforsequences[241,242,277].OthermajorimprovementsincludetheapproachofunsupervisedpretrainingforinitializingANNmodels[232],thedropouttechniqueforregularization[254,297],batchnormalizationforfastlearningofANNparameters[256],andmaxoutnetworksforeffectiveusageofthedropouttechnique[240].EventhoughthediscussionsinthischapteronlearningANNmodelswerecenteredaroundthegradientdescentmethod,mostofthemodernANNmodelsinvolvingalargenumberofhiddenlayersaretrainedusingthestochasticgradientdescentmethodsinceitishighlyscalable[207].AnextensivesurveyofdeeplearningapproacheshasbeenpresentedinreviewarticlesbyBengio[200],LeCunetal.[269],andSchmidhuber[293].AnexcellentsummaryofdeeplearningapproachescanalsobeobtainedfromrecentbooksbyGoodfellowetal.[239]andNielsen[281].

Vapnik[303,304]haswrittentwoauthoritativebooksonSupportVectorMachines(SVM).OtherusefulresourcesonSVMandkernelmethodsincludethebooksbyCristianiniandShawe-Taylor[221]andSchölkopfandSmola[294].ThereareseveralsurveyarticlesonSVM,includingthosewrittenbyBurges[212],Bennetetal.[202],Hearst[251],andMangasarian[272].SVMcanalsobeviewedasan normregularizerofthehingelossfunction,asdescribedindetailbyHastieetal.[249].The normregularizerofthesquarelossfunctioncanbeobtainedusingtheleastabsoluteshrinkageandselectionoperator(Lasso),whichwasintroducedbyTibshiraniin1996[301].

L2L1

TheLassohasseveralinterestingpropertiessuchastheabilitytosimultaneouslyperformfeatureselectionaswellasregularization,sothatonlyasubsetoffeaturesareselectedinthefinalmodel.AnexcellentreviewofLassocanbeobtainedfromabookbyHastieetal.[250].

AsurveyofensemblemethodsinmachinelearningwasgivenbyDiet-terich[222].ThebaggingmethodwasproposedbyBreiman[209].FreundandSchapire[236]developedtheAdaBoostalgorithm.Arcing,whichstandsforadaptiveresamplingandcombining,isavariantoftheboostingalgorithmproposedbyBreiman[210].Itusesthenon-uniformweightsassignedtotrainingexamplestoresamplethedataforbuildinganensembleoftrainingsets.UnlikeAdaBoost,thevotesofthebaseclassifiersarenotweightedwhendeterminingtheclasslabeloftestexamples.TherandomforestmethodwasintroducedbyBreimanin[211].Theconceptofbias-variancedecompositionisexplainedindetailbyHastieetal.[249].Whilethebias-variancedecompositionwasinitiallyproposedforregressionproblemswithsquaredlossfunction,aunifiedframeworkforclassificationproblemsinvolving0–1losseswasintroducedbyDomingos[226].

RelatedworkonminingrareandimbalanceddatasetscanbefoundinthesurveypaperswrittenbyChawlaetal.[214]andWeiss[308].Sampling-basedmethodsforminingimbalanceddatasetshavebeeninvestigatedbymanyauthors,suchasKubatandMatwin[266],Japkowitz[257],andDrummondandHolte[228].Joshietal.[261]discussedthelimitationsofboostingalgorithmsforrareclassmodeling.OtheralgorithmsdevelopedforminingrareclassesincludeSMOTE[213],PNrule[260],andCREDOS[262].

Variousalternativemetricsthatarewell-suitedforclassimbalancedproblemsareavailable.Theprecision,recall,and -measurearewidely-usedmetricsininformationretrieval[302].ROCanalysiswasoriginallyusedinsignaldetectiontheoryforperformingaggregateevaluationoverarangeofscore

F1

thresholds.AmethodforcomparingclassifierperformanceusingtheconvexhullofROCcurveswassuggestedbyProvostandFawcettin[286].Bradley[208]investigatedtheuseofareaundertheROCcurve(AUC)asaperformancemetricformachinelearningalgorithms.DespitethevastbodyofliteratureonoptimizingtheAUCmeasureinmachinelearningmodels,itiswell-knownthatAUCsuffersfromcertainlimitations.Forexample,theAUCcanbeusedtocomparethequalityoftwoclassifiersonlyiftheROCcurveofoneclassifierstrictlydominatestheother.However,iftheROCcurvesoftwoclassifiersintersectatanypoint,thenitisdifficulttoassesstherelativequalityofclassifiersusingtheAUCmeasure.Anin-depthdiscussionofthepitfallsinusingAUCasaperformancemeasurecanbeobtainedinworksbyHand[245,246],andPowers[284].TheAUChasalsobeenconsideredtobeanincoherentmeasureofperformance,i.e.,itusesdifferentscaleswhilecomparingtheperformanceofdifferentclassifiers,althoughacoherentinterpretationofAUChasbeenprovidedbyFerrietal.[235].BerrarandFlach[203]describesomeofthecommoncaveatsinusingtheROCcurveforclinicalmicroarrayresearch.Analternateapproachformeasuringtheaggregateperformanceofaclassifieristheprecision-recall(PR)curve,whichisespeciallyusefulinthepresenceofclassimbalance[292].

Anexcellenttutorialoncost-sensitivelearningcanbefoundinareviewarticlebyLingandSheng[271].ThepropertiesofacostmatrixhadbeenstudiedbyElkanin[231].MargineantuandDietterich[273]examinedvariousmethodsforincorporatingcostinformationintotheC4.5learningalgorithm,includingwrappermethods,classdistribution-basedmethods,andloss-basedmethods.Othercost-sensitivelearningmethodsthatarealgorithm-independentincludeAdaCost[233],MetaCost[225],andcosting[312].

Extensiveliteratureisalsoavailableonthesubjectofmulticlasslearning.ThisincludestheworksofHastieandTibshirani[248],Allweinetal.[197],KongandDietterich[264],andTaxandDuin[300].Theerror-correctingoutput

coding(ECOC)methodwasproposedbyDietterichandBakiri[223].Theyhadalsoinvestigatedtechniquesfordesigningcodesthataresuitableforsolvingmulticlassproblems.

Apartfromexploringalgorithmsfortraditionalclassificationsettingswhereeveryinstancehasasinglesetoffeatureswithauniquecategoricallabel,therehasbeenalotofrecentinterestinnon-traditionalclassificationparadigms,involvingcomplexformsofinputsandoutputs.Forexample,theparadigmofmulti-labellearningallowsforaninstancetobeassignedmultipleclasslabelsratherthanjustone.Thisisusefulinapplicationssuchasobjectrecognitioninimages,whereaphotoimagemayincludemorethanoneclassificationobject,suchas,grass,sky,trees,andmountains.Asurveyonmulti-labellearningcanbefoundin[313].Asanotherexample,theparadigmofmulti-instancelearningconsiderstheproblemwheretheinstancesareavailableintheformofgroupscalledbags,andtraininglabelsareavailableatthelevelofbagsratherthanindividualinstances.Multi-instancelearningisusefulinapplicationswhereanobjectcanexistasmultipleinstancesindifferentstates(e.g.,thedifferentisomersofachemicalcompound),andevenifasingleinstanceshowsaspecificcharacteristic,theentirebagofinstancesassociatedwiththeobjectneedstobeassignedtherelevantclass.Asurveyonmulti-instancelearningisprovidedin[314].

Inanumberofreal-worldapplications,itisoftenthecasethatthetraininglabelsarescarceinquantity,becauseofthehighcostsassociatedwithobtaininggold-standardsupervision.However,wealmostalwayshaveabundantaccesstounlabeledtestinstances,whichdonothavesupervisedlabelsbutcontainvaluableinformationaboutthestructureordistributionofinstances.Traditionallearningalgorithms,whichonlymakeuseofthelabeledinstancesinthetrainingsetforlearningthedecisionboundary,areunabletoexploittheinformationcontainedinunlabeledinstances.Incontrast,learningalgorithmsthatmakeuseofthestructureintheunlabeleddataforlearningthe

classificationmodelareknownassemi-supervisedlearningalgorithms[315,316].Theuseofunlabeleddataisalsoexploredintheparadigmofmulti-viewlearning[299,311],whereeveryobjectisobservedinmultipleviewsofthedata,involvingdiversesetsoffeatures.Acommonstrategyusedbymulti-viewlearningalgorithmsisco-training[206],whereadifferentmodelislearnedforeveryviewofthedata,butthemodelpredictionsfromeveryviewareconstrainedtobeidenticaltoeachotherontheunlabeledtestinstances.

Anotherlearningparadigmthatiscommonlyexploredinthepaucityoftrainingdataistheframeworkofactivelearning,whichattemptstoseekthesmallestsetoflabelannotationstolearnareasonableclassificationmodel.Activelearningexpectstheannotatortobeinvolvedintheprocessofmodellearning,sothatthelabelsarerequestedincrementallyoverthemostrelevantsetofinstances,givenalimitedbudgetoflabelannotations.Forexample,itmaybeusefultoobtainlabelsoverinstancesclosertothedecisionboundarythatcanplayabiggerroleinfine-tuningtheboundary.Areviewonactivelearningapproachescanbefoundin[285,295].

Insomeapplications,itisimportanttosimultaneouslysolvemultiplelearningtaskstogether,wheresomeofthetasksmaybesimilartooneanother.Forexample,ifweareinterestedintranslatingapassagewritteninEnglishintodifferentlanguages,thetasksinvolvinglexicallysimilarlanguages(suchasSpanishandPortuguese)wouldrequiresimilarlearningofmodels.Theparadigmofmulti-tasklearninghelpsinsimultaneouslylearningacrossalltaskswhilesharingthelearningamongrelatedtasks.Thisisespeciallyusefulwhensomeofthetasksdonotcontainsufficientlymanytrainingsamples,inwhichcaseborrowingthelearningfromotherrelatedtaskshelpsinthelearningofrobustmodels.Aspecialcaseofmulti-tasklearningistransferlearning,wherethelearningfromasourcetask(withsufficientnumberoftrainingsamples)hastobetransferredtoadestinationtask(withpaucityof

trainingdata).AnextensivesurveyoftransferlearningapproachesisprovidedbyPanetal.[282].

Mostclassifiersassumeeverydatainstancemustbelongtoaclass,whichisnotalwaystrueforsomeapplications.Forexample,inmalwaredetection,duetotheeaseinwhichnewmalwaresarecreated,aclassifiertrainedonexistingclassesmayfailtodetectnewonesevenifthefeaturesforthenewmalwaresareconsiderablydifferentthanthoseforexistingmalwares.Anotherexampleisincriticalapplicationssuchasmedicaldiagnosis,wherepredictionerrorsarecostlyandcanhavesevereconsequences.Inthissituation,itwouldbebetterfortheclassifiertorefrainfrommakinganypredictiononadatainstanceifitisunsureofitsclass.Thisapproach,knownasclassifierwithrejectoption,doesnotneedtoclassifyeverydatainstanceunlessitdeterminesthepredictionisreliable(e.g.,iftheclassprobabilityexceedsauser-specifiedthreshold).Instancesthatareunclassifiedcanbepresentedtodomainexpertsforfurtherdeterminationoftheirtrueclasslabels.

Classifierscanalsobedistinguishedintermsofhowtheclassificationmodelistrained.Abatchclassifierassumesallthelabeledinstancesareavailableduringtraining.Thisstrategyisapplicablewhenthetrainingsetsizeisnottoolargeandforstationarydata,wheretherelationshipbetweentheattributesandclassesdoesnotvaryovertime.Anonlineclassifier,ontheotherhand,trainsaninitialmodelusingasubsetofthelabeleddata[263].Themodelisthenupdatedincrementallyasmorelabeledinstancesbecomeavailable.Thisstrategyiseffectivewhenthetrainingsetistoolargeorwhenthereisconceptdriftduetochangesinthedistributionofthedataovertime.

Bibliography[195]C.C.Aggarwal.Dataclassification:algorithmsandapplications.CRC

Press,2014.

[196]D.W.Aha.Astudyofinstance-basedalgorithmsforsupervisedlearningtasks:mathematical,empirical,andpsychologicalevaluations.PhDthesis,UniversityofCalifornia,Irvine,1990.

[197]E.L.Allwein,R.E.Schapire,andY.Singer.ReducingMulticlasstoBinary:AUnifyingApproachtoMarginClassifiers.JournalofMachineLearningResearch,1:113–141,2000.

[198]R.Andrews,J.Diederich,andA.Tickle.ASurveyandCritiqueofTechniquesForExtractingRulesFromTrainedArtificialNeuralNetworks.KnowledgeBasedSystems,8(6):373–389,1995.

[199]P.Baldi.Autoencoders,unsupervisedlearning,anddeeparchitectures.ICMLunsupervisedandtransferlearning,27(37-50):1,2012.

[200]Y.Bengio.LearningdeeparchitecturesforAI.FoundationsandtrendsRinMachineLearning,2(1):1–127,2009.

[201]Y.Bengio,A.Courville,andP.Vincent.Representationlearning:Areviewandnewperspectives.IEEEtransactionsonpatternanalysisand

machineintelligence,35(8):1798–1828,2013.

[202]K.BennettandC.Campbell.SupportVectorMachines:HypeorHallelujah.SIGKDDExplorations,2(2):1–13,2000.

[203]D.BerrarandP.Flach.CaveatsandpitfallsofROCanalysisinclinicalmicroarrayresearch(andhowtoavoidthem).Briefingsinbioinformatics,pagebbr008,2011.

[204]C.M.Bishop.NeuralNetworksforPatternRecognition.OxfordUniversityPress,Oxford,U.K.,1995.

[205]C.M.Bishop.PatternRecognitionandMachineLearning.Springer,2006.

[206]A.BlumandT.Mitchell.Combininglabeledandunlabeleddatawithco-training.InProceedingsoftheeleventhannualconferenceonComputationallearningtheory,pages92–100.ACM,1998.

[207]L.Bottou.Large-scalemachinelearningwithstochasticgradientdescent.InProceedingsofCOMPSTAT'2010,pages177–186.Springer,2010.

[208]A.P.Bradley.TheuseoftheareaundertheROCcurveintheEvaluationofMachineLearningAlgorithms.PatternRecognition,30(7):1145–1149,1997.

[209]L.Breiman.BaggingPredictors.MachineLearning,24(2):123–140,1996.

[210]L.Breiman.Bias,Variance,andArcingClassifiers.TechnicalReport486,UniversityofCalifornia,Berkeley,CA,1996.

[211]L.Breiman.RandomForests.MachineLearning,45(1):5–32,2001.

[212]C.J.C.Burges.ATutorialonSupportVectorMachinesforPatternRecognition.DataMiningandKnowledgeDiscovery,2(2):121–167,1998.

[213]N.V.Chawla,K.W.Bowyer,L.O.Hall,andW.P.Kegelmeyer.SMOTE:SyntheticMinorityOver-samplingTechnique.JournalofArtificialIntelligenceResearch,16:321–357,2002.

[214]N.V.Chawla,N.Japkowicz,andA.Kolcz.Editorial:SpecialIssueonLearningfromImbalancedDataSets.SIGKDDExplorations,6(1):1–6,2004.

[215]V.CherkasskyandF.Mulier.LearningfromData:Concepts,Theory,andMethods.WileyInterscience,1998.

[216]P.ClarkandR.Boswell.RuleInductionwithCN2:SomeRecentImprovements.InMachineLearning:Proc.ofthe5thEuropeanConf.(EWSL-91),pages151–163,1991.

[217]P.ClarkandT.Niblett.TheCN2InductionAlgorithm.MachineLearning,3(4):261–283,1989.

[218]W.W.Cohen.FastEffectiveRuleInduction.InProc.ofthe12thIntl.Conf.onMachineLearning,pages115–123,TahoeCity,CA,July1995.

[219]S.CostandS.Salzberg.AWeightedNearestNeighborAlgorithmforLearningwithSymbolicFeatures.MachineLearning,10:57–78,1993.

[220]T.M.CoverandP.E.Hart.NearestNeighborPatternClassification.KnowledgeBasedSystems,8(6):373–389,1995.

[221]N.CristianiniandJ.Shawe-Taylor.AnIntroductiontoSupportVectorMachinesandOtherKernel-basedLearningMethods.CambridgeUniversityPress,2000.

[222]T.G.Dietterich.EnsembleMethodsinMachineLearning.InFirstIntl.WorkshoponMultipleClassifierSystems,Cagliari,Italy,2000.

[223]T.G.DietterichandG.Bakiri.SolvingMulticlassLearningProblemsviaError-CorrectingOutputCodes.JournalofArtificialIntelligenceResearch,2:263–286,1995.

[224]P.Domingos.TheRISEsystem:Conqueringwithoutseparating.InProc.ofthe6thIEEEIntl.Conf.onToolswithArtificialIntelligence,pages704–707,NewOrleans,LA,1994.

[225]P.Domingos.MetaCost:AGeneralMethodforMakingClassifiersCost-Sensitive.InProc.ofthe5thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages155–164,SanDiego,CA,August1999.

[226]P.Domingos.Aunifiedbias-variancedecomposition.InProceedingsof17thInternationalConferenceonMachineLearning,pages231–238,2000.

[227]P.DomingosandM.Pazzani.OntheOptimalityoftheSimpleBayesianClassifierunderZero-OneLoss.MachineLearning,29(2-3):103–130,1997.

[228]C.DrummondandR.C.Holte.C4.5,Classimbalance,andCostsensitivity:Whyunder-samplingbeatsover-sampling.InICML'2004WorkshoponLearningfromImbalancedDataSetsII,Washington,DC,August2003.

[229]R.O.Duda,P.E.Hart,andD.G.Stork.PatternClassification.JohnWiley&Sons,Inc.,NewYork,2ndedition,2001.

[230]M.H.Dunham.DataMining:IntroductoryandAdvancedTopics.PrenticeHall,2006.

[231]C.Elkan.TheFoundationsofCost-SensitiveLearning.InProc.ofthe17thIntl.JointConf.onArtificialIntelligence,pages973–978,Seattle,WA,August2001.

[232]D.Erhan,Y.Bengio,A.Courville,P.-A.Manzagol,P.Vincent,andS.Bengio.Whydoesunsupervisedpre-traininghelpdeeplearning?JournalofMachineLearningResearch,11(Feb):625–660,2010.

[233]W.Fan,S.J.Stolfo,J.Zhang,andP.K.Chan.AdaCost:misclassificationcost-sensitiveboosting.InProc.ofthe16thIntl.Conf.onMachineLearning,pages97–105,Bled,Slovenia,June1999.

[234]J.FürnkranzandG.Widmer.Incrementalreducederrorpruning.InProc.ofthe11thIntl.Conf.onMachineLearning,pages70–77,NewBrunswick,NJ,July1994.

[235]C.Ferri,J.Hernández-Orallo,andP.A.Flach.AcoherentinterpretationofAUCasameasureofaggregatedclassificationperformance.InProceedingsofthe28thInternationalConferenceonMachineLearning(ICML-11),pages657–664,2011.

[236]Y.FreundandR.E.Schapire.Adecision-theoreticgeneralizationofon-linelearningandanapplicationtoboosting.JournalofComputerandSystemSciences,55(1):119–139,1997.

[237]K.Fukunaga.IntroductiontoStatisticalPatternRecognition.AcademicPress,NewYork,1990.

[238]D.Geiger,T.S.Verma,andJ.Pearl.d-separation:Fromtheoremstoalgorithms.arXivpreprintarXiv:1304.1505,2013.

[239]I.Goodfellow,Y.Bengio,andA.Courville.DeepLearning.BookinpreparationforMITPress,2016.

[240]I.J.Goodfellow,D.Warde-Farley,M.Mirza,A.C.Courville,andY.Bengio.Maxoutnetworks.ICML(3),28:1319–1327,2013.

[241]A.Graves,M.Liwicki,S.Fernández,R.Bertolami,H.Bunke,andJ.Schmidhuber.Anovelconnectionistsystemforunconstrainedhandwritingrecognition.IEEEtransactionsonpatternanalysisandmachineintelligence,31(5):855–868,2009.

[242]A.GravesandJ.Schmidhuber.Offlinehandwritingrecognitionwithmultidimensionalrecurrentneuralnetworks.InAdvancesinneuralinformationprocessingsystems,pages545–552,2009.

[243]E.-H.Han,G.Karypis,andV.Kumar.TextCategorizationUsingWeightAdjustedk-NearestNeighborClassification.InProc.ofthe5thPacific-AsiaConf.onKnowledgeDiscoveryandDataMining,Lyon,France,2001.

[244]J.HanandM.Kamber.DataMining:ConceptsandTechniques.MorganKaufmannPublishers,SanFrancisco,2001.

[245]D.J.Hand.Measuringclassifierperformance:acoherentalternativetotheareaundertheROCcurve.Machinelearning,77(1):103–123,2009.

[246]D.J.Hand.Evaluatingdiagnostictests:theareaundertheROCcurveandthebalanceoferrors.Statisticsinmedicine,29(14):1502–1510,2010.

[247]D.J.Hand,H.Mannila,andP.Smyth.PrinciplesofDataMining.MITPress,2001.

[248]T.HastieandR.Tibshirani.Classificationbypairwisecoupling.AnnalsofStatistics,26(2):451–471,1998.

[249]T.Hastie,R.Tibshirani,andJ.Friedman.TheElementsofStatisticalLearning:DataMining,Inference,andPrediction.Springer,2ndedition,2009.

[250]T.Hastie,R.Tibshirani,andM.Wainwright.Statisticallearningwithsparsity:thelassoandgeneralizations.CRCPress,2015.

[251]M.Hearst.Trends&Controversies:SupportVectorMachines.IEEEIntelligentSystems,13(4):18–28,1998.

[252]D.Heckerman.BayesianNetworksforDataMining.DataMiningandKnowledgeDiscovery,1(1):79–119,1997.

[253]G.E.HintonandR.R.Salakhutdinov.Reducingthedimensionalityofdatawithneuralnetworks.Science,313(5786):504–507,2006.

[254]G.E.Hinton,N.Srivastava,A.Krizhevsky,I.Sutskever,andR.R.Salakhutdinov.Improvingneuralnetworksbypreventingco-adaptationoffeaturedetectors.arXivpreprintarXiv:1207.0580,2012.

[255]R.C.Holte.VerySimpleClassificationRulesPerformWellonMostCommonlyUsedDatasets.MachineLearning,11:63–91,1993.

[256]S.IoffeandC.Szegedy.Batchnormalization:Acceleratingdeepnetworktrainingbyreducinginternalcovariateshift.arXivpreprintarXiv:1502.03167,2015.

[257]N.Japkowicz.TheClassImbalanceProblem:SignificanceandStrategies.InProc.ofthe2000Intl.Conf.onArtificialIntelligence:SpecialTrackonInductiveLearning,volume1,pages111–117,LasVegas,NV,June2000.

[258]F.V.Jensen.AnintroductiontoBayesiannetworks,volume210.UCLpressLondon,1996.

[259]M.I.Jordan.Learningingraphicalmodels,volume89.SpringerScience&BusinessMedia,1998.

[260]M.V.Joshi,R.C.Agarwal,andV.Kumar.MiningNeedlesinaHaystack:ClassifyingRareClassesviaTwo-PhaseRuleInduction.InProc.of2001ACM-SIGMODIntl.Conf.onManagementofData,pages91–102,SantaBarbara,CA,June2001.

[261]M.V.Joshi,R.C.Agarwal,andV.Kumar.Predictingrareclasses:canboostingmakeanyweaklearnerstrong?InProc.ofthe8thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages297–306,Edmonton,Canada,July2002.

[262]M.V.JoshiandV.Kumar.CREDOS:ClassificationUsingRippleDownStructure(ACaseforRareClasses).InProc.oftheSIAMIntl.Conf.onDataMining,pages321–332,Orlando,FL,April2004.

[263]J.Kivinen,A.J.Smola,andR.C.Williamson.Onlinelearningwithkernels.IEEEtransactionsonsignalprocessing,52(8):2165–2176,2004.

[264]E.B.KongandT.G.Dietterich.Error-CorrectingOutputCodingCorrectsBiasandVariance.InProc.ofthe12thIntl.Conf.onMachineLearning,pages313–321,TahoeCity,CA,July1995.

[265]A.Krizhevsky,I.Sutskever,andG.E.Hinton.Imagenetclassificationwithdeepconvolutionalneuralnetworks.InAdvancesinneuralinformationprocessingsystems,pages1097–1105,2012.

[266]M.KubatandS.Matwin.AddressingtheCurseofImbalancedTrainingSets:OneSidedSelection.InProc.ofthe14thIntl.Conf.onMachineLearning,pages179–186,Nashville,TN,July1997.

[267]P.Langley,W.Iba,andK.Thompson.AnanalysisofBayesianclassifiers.InProc.ofthe10thNationalConf.onArtificialIntelligence,pages223–228,1992.

[268]Y.LeCunandY.Bengio.Convolutionalnetworksforimages,speech,andtimeseries.Thehandbookofbraintheoryandneuralnetworks,3361(10):1995,1995.

[269]Y.LeCun,Y.Bengio,andG.Hinton.Deeplearning.Nature,521(7553):436–444,2015.

[270]D.D.Lewis.NaiveBayesatForty:TheIndependenceAssumptioninInformationRetrieval.InProc.ofthe10thEuropeanConf.onMachineLearning(ECML1998),pages4–15,1998.

[271]C.X.LingandV.S.Sheng.Cost-sensitivelearning.InEncyclopediaofMachineLearning,pages231–235.Springer,2011.

[272]O.Mangasarian.DataMiningviaSupportVectorMachines.TechnicalReportTechnicalReport01-05,DataMiningInstitute,May2001.

[273]D.D.MargineantuandT.G.Dietterich.LearningDecisionTreesforLossMinimizationinMulti-ClassProblems.TechnicalReport99-30-03,OregonStateUniversity,1999.

[274]P.McCullaghandJ.A.Nelder.Generalizedlinearmodels,volume37.CRCpress,1989.

[275]W.S.McCullochandW.Pitts.Alogicalcalculusoftheideasimmanentinnervousactivity.Thebulletinofmathematicalbiophysics,5(4):115–133,1943.

[276]R.S.Michalski,I.Mozetic,J.Hong,andN.Lavrac.TheMulti-PurposeIncrementalLearningSystemAQ15andItsTestingApplicationtoThree

MedicalDomains.InProc.of5thNationalConf.onArtificialIntelligence,Orlando,August1986.

[277]T.Mikolov,M.Karafiát,L.Burget,J.Cernock`y,andS.Khudanpur.Recurrentneuralnetworkbasedlanguagemodel.InInterspeech,volume2,page3,2010.

[278]T.Mitchell.MachineLearning.McGraw-Hill,Boston,MA,1997.

[279]S.Muggleton.FoundationsofInductiveLogicProgramming.PrenticeHall,EnglewoodCliffs,NJ,1995.

[280]J.A.NelderandR.J.Baker.Generalizedlinearmodels.Encyclopediaofstatisticalsciences,1972.

[281]M.A.Nielsen.Neuralnetworksanddeeplearning.Publishedonline:http://neuralnetworksanddeeplearning.com/.(visited:10.15.2016),2015.

[282]S.J.PanandQ.Yang.Asurveyontransferlearning.IEEETransactionsonknowledgeanddataengineering,22(10):1345–1359,2010.

[283]J.Pearl.Probabilisticreasoninginintelligentsystems:networksofplausibleinference.MorganKaufmann,2014.

[284]D.M.Powers.Theproblemofareaunderthecurve.In2012IEEEInternationalConferenceonInformationScienceandTechnology,pages

567–573.IEEE,2012.

[285]M.Prince.Doesactivelearningwork?Areviewoftheresearch.Journalofengineeringeducation,93(3):223–231,2004.

[286]F.J.ProvostandT.Fawcett.AnalysisandVisualizationofClassifierPerformance:ComparisonunderImpreciseClassandCostDistributions.InProc.ofthe3rdIntl.Conf.onKnowledgeDiscoveryandDataMining,pages43–48,NewportBeach,CA,August1997.

[287]J.R.Quinlan.C4.5:ProgramsforMachineLearning.Morgan-KaufmannPublishers,SanMateo,CA,1993.

[288]M.RamoniandP.Sebastiani.RobustBayesclassifiers.ArtificialIntelligence,125:209–226,2001.

[289]N.Rochester,J.Holland,L.Haibt,andW.Duda.Testsonacellassemblytheoryoftheactionofthebrain,usingalargedigitalcomputer.IRETransactionsoninformationTheory,2(3):80–93,1956.

[290]F.Rosenblatt.Theperceptron:aprobabilisticmodelforinformationstorageandorganizationinthebrain.Psychologicalreview,65(6):386,1958.

[291]S.J.Russell,P.Norvig,J.F.Canny,J.M.Malik,andD.D.Edwards.Artificialintelligence:amodernapproach,volume2.PrenticehallUpperSaddleRiver,2003.

[292]T.SaitoandM.Rehmsmeier.Theprecision-recallplotismoreinformativethantheROCplotwhenevaluatingbinaryclassifiersonimbalanceddatasets.PloSone,10(3):e0118432,2015.

[293]J.Schmidhuber.Deeplearninginneuralnetworks:Anoverview.NeuralNetworks,61:85–117,2015.

[294]B.SchölkopfandA.J.Smola.LearningwithKernels:SupportVectorMachines,Regularization,Optimization,andBeyond.MITPress,2001.

[295]B.Settles.Activelearningliteraturesurvey.UniversityofWisconsin,Madison,52(55-66):11,2010.

[296]P.SmythandR.M.Goodman.AnInformationTheoreticApproachtoRuleInductionfromDatabases.IEEETrans.onKnowledgeandDataEngineering,4(4):301–316,1992.

[297]N.Srivastava,G.E.Hinton,A.Krizhevsky,I.Sutskever,andR.Salakhutdinov.Dropout:asimplewaytopreventneuralnetworksfromoverfitting.JournalofMachineLearningResearch,15(1):1929–1958,2014.

[298]M.SteinbachandP.-N.Tan.kNN:k-NearestNeighbors.InX.WuandV.Kumar,editors,TheTopTenAlgorithmsinDataMining.ChapmanandHall/CRCReference,1stedition,2009.

[299]S.Sun.Asurveyofmulti-viewmachinelearning.NeuralComputingandApplications,23(7-8):2031–2038,2013.

[300]D.M.J.TaxandR.P.W.Duin.UsingTwo-ClassClassifiersforMulticlassClassification.InProc.ofthe16thIntl.Conf.onPatternRecognition(ICPR2002),pages124–127,Quebec,Canada,August2002.

[301]R.Tibshirani.Regressionshrinkageandselectionviathelasso.JournaloftheRoyalStatisticalSociety.SeriesB(Methodological),pages267–288,1996.

[302]C.J.vanRijsbergen.InformationRetrieval.Butterworth-Heinemann,Newton,MA,1978.

[303]V.Vapnik.TheNatureofStatisticalLearningTheory.SpringerVerlag,NewYork,1995.

[304]V.Vapnik.StatisticalLearningTheory.JohnWiley&Sons,NewYork,1998.

[305]P.Vincent,H.Larochelle,Y.Bengio,andP.-A.Manzagol.Extractingandcomposingrobustfeatureswithdenoisingautoencoders.InProceedingsofthe25thinternationalconferenceonMachinelearning,pages1096–1103.ACM,2008.

[306]P.Vincent,H.Larochelle,I.Lajoie,Y.Bengio,andP.-A.Manzagol.Stackeddenoisingautoencoders:Learningusefulrepresentationsina

deepnetworkwithalocaldenoisingcriterion.JournalofMachineLearningResearch,11(Dec):3371–3408,2010.

[307]A.R.Webb.StatisticalPatternRecognition.JohnWiley&Sons,2ndedition,2002.

[308]G.M.Weiss.MiningwithRarity:AUnifyingFramework.SIGKDDExplorations,6(1):7–19,2004.

[309]P.Werbos.Beyondregression:newfoolsforpredictionandanalysisinthebehavioralsciences.PhDthesis,HarvardUniversity,1974.

[310]I.H.WittenandE.Frank.DataMining:PracticalMachineLearningToolsandTechniqueswithJavaImplementations.MorganKaufmann,1999.

[311]C.Xu,D.Tao,andC.Xu.Asurveyonmulti-viewlearning.arXivpreprintarXiv:1304.5634,2013.

[312]B.Zadrozny,J.C.Langford,andN.Abe.Cost-SensitiveLearningbyCost-ProportionateExampleWeighting.InProc.ofthe2003IEEEIntl.Conf.onDataMining,pages435–442,Melbourne,FL,August2003.

[313]M.-L.ZhangandZ.-H.Zhou.Areviewonmulti-labellearningalgorithms.IEEEtransactionsonknowledgeanddataengineering,26(8):1819–1837,2014.

[314]Z.-H.Zhou.Multi-instancelearning:Asurvey.DepartmentofComputerScience&Technology,NanjingUniversity,Tech.Rep,2004.

[315]X.Zhu.Semi-supervisedlearning.InEncyclopediaofmachinelearning,pages892–897.Springer,2011.

[316]X.ZhuandA.B.Goldberg.Introductiontosemi-supervisedlearning.Synthesislecturesonartificialintelligenceandmachinelearning,3(1):1–130,2009.

4.14Exercises1.Considerabinaryclassificationproblemwiththefollowingsetofattributesandattributevalues:

Supposearule-basedclassifierproducesthefollowingruleset:

a. Aretherulesmutuallyexclusive?

b. Istherulesetexhaustive?

c. Isorderingneededforthissetofrules?

d. Doyouneedadefaultclassfortheruleset?

2.TheRIPPERalgorithm(byCohen[218])isanextensionofanearlieralgorithmcalledIREP(byFürnkranzandWidmer[234]).Bothalgorithmsapplythereduced-errorpruningmethodtodeterminewhetheraruleneedstobepruned.Thereducederrorpruningmethodusesavalidationsettoestimatethegeneralizationerrorofaclassifier.Considerthefollowingpairofrules:

AirConditioner={Working,Broken}

Engine={Good,Bad}

Mileage={High,Medium,Low}

Rust={Yes,No}

Mileage=High→Mileage=HighMileage=Low→Value=HighAirConditioner=Working

R1:A→CR2:A∧B→C

isobtainedbyaddinganewconjunct,B,totheleft-handsideof .Forthisquestion,youwillbeaskedtodeterminewhether ispreferredoverfromtheperspectivesofrule-growingandrule-pruning.Todeterminewhetheraruleshouldbepruned,IREPcomputesthefollowingmeasure:

wherePisthetotalnumberofpositiveexamplesinthevalidationset,Nisthetotalnumberofnegativeexamplesinthevalidationset,pisthenumberofpositiveexamplesinthevalidationsetcoveredbytherule,andnisthenumberofnegativeexamplesinthevalidationsetcoveredbytherule.isactuallysimilartoclassificationaccuracyforthevalidationset.IREPfavorsrulesthathavehighervaluesof .Ontheotherhand,RIPPERappliesthefollowingmeasuretodeterminewhetheraruleshouldbepruned:

a. Suppose iscoveredby350positiveexamplesand150negativeexamples,while iscoveredby300positiveexamplesand50negativeexamples.ComputetheFOIL'sinformationgainfortherule withrespectto .

b. Consideravalidationsetthatcontains500positiveexamplesand500negativeexamples.For ,supposethenumberofpositiveexamplescoveredbytheruleis200,andthenumberofnegativeexamplescoveredbytheruleis50.For ,supposethenumberofpositiveexamplescoveredbytheruleis100andthenumberofnegativeexamplesis5.Compute

forbothrules.WhichruledoesIREPprefer?

c. Compute forthepreviousproblem.WhichruledoesRIPPERprefer?

R2 R1R2 R1

vIREP=p+(N−n)P+N,

vIREP

vIREP

vRIPPER=p−nP+n.

R1R2

R2R1

R1

R2

vIREP

vRIPPER

3.C4.5rulesisanimplementationofanindirectmethodforgeneratingrulesfromadecisiontree.RIPPERisanimplementationofadirectmethodforgeneratingrulesdirectlyfromdata.

a. Discussthestrengthsandweaknessesofbothmethods.

b. Consideradatasetthathasalargedifferenceintheclasssize(i.e.,someclassesaremuchbiggerthanothers).Whichmethod(betweenC4.5rulesandRIPPER)isbetterintermsoffindinghighaccuracyrulesforthesmallclasses?

4.Consideratrainingsetthatcontains100positiveexamplesand400negativeexamples.Foreachofthefollowingcandidaterules,

determinewhichisthebestandworstcandidateruleaccordingto:

a. Ruleaccuracy.

b. FOIL'sinformationgain.

c. Thelikelihoodratiostatistic.

d. TheLaplacemeasure.

e. Them-estimatemeasure(with and ).

5.Figure4.3 illustratesthecoverageoftheclassificationrulesR1,R2,andR3.Determinewhichisthebestandworstruleaccordingto:

a. Thelikelihoodratiostatistic.

b. TheLaplacemeasure.

R1:A→+(covers4positiveand1negativeexamples),R2:B→+(covers30positiveand10negativeexamples),R3:C→+(covers100positiveand90negativeexamples),

k=2 p+=0.2

c. Them-estimatemeasure(with and ).

d. TheruleaccuracyafterR1hasbeendiscovered,wherenoneoftheexamplescoveredbyR1arediscarded.

e. TheruleaccuracyafterR1hasbeendiscovered,whereonlythepositiveexamplescoveredbyR1arediscarded.

f. TheruleaccuracyafterR1hasbeendiscovered,wherebothpositiveandnegativeexamplescoveredbyR1arediscarded.

6.

a. Supposethefractionofundergraduatestudentswhosmokeis15%andthefractionofgraduatestudentswhosmokeis23%.Ifone-fifthofthecollegestudentsaregraduatestudentsandtherestareundergraduates,whatistheprobabilitythatastudentwhosmokesisagraduatestudent?

b. Giventheinformationinpart(a),isarandomlychosencollegestudentmorelikelytobeagraduateorundergraduatestudent?

c. Repeatpart(b)assumingthatthestudentisasmoker.

d. Suppose30%ofthegraduatestudentsliveinadormbutonly10%oftheundergraduatestudentsliveinadorm.Ifastudentsmokesandlivesinthedorm,isheorshemorelikelytobeagraduateorundergraduatestudent?Youcanassumeindependencebetweenstudentswholiveinadormandthosewhosmoke.

7.ConsiderthedatasetshowninTable4.9

Table4.9.DatasetforExercise7.

Instance A B C Class

1 0 0 0

k=2 p+=0.58

+

2 0 0 1

3 0 1 1

4 0 1 1

5 0 0 1

6 1 0 1

7 1 0 1

8 1 0 1

9 1 1 1

10 1 0 1

a. Estimatetheconditionalprobabilitiesfor,and .

b. UsetheestimateofconditionalprobabilitiesgiveninthepreviousquestiontopredicttheclasslabelforatestsampleusingthenaïveBayesapproach.

c. Estimatetheconditionalprobabilitiesusingthem-estimateapproach,withand .

d. Repeatpart(b)usingtheconditionalprobabilitiesgiveninpart(c).

e. Comparethetwomethodsforestimatingprobabilities.Whichmethodisbetterandwhy?

8.ConsiderthedatasetshowninTable4.10 .

Table4.10.DatasetforExercise8.

+

+

+

+

P(A|+),P(B|+),P(C|+),P(A|−),P(B|−) P(C|−)

(A=0,B=1,C=0)

p=1/2 m=4

Instance A B C Class

1 0 0 1

2 1 0 1

3 0 1 0

4 1 0 0

5 1 0 1

6 0 0 1

7 1 1 0

8 0 0 0

9 0 1 0

10 1 1 1 +

a. Estimatetheconditionalprobabilitiesfor,and using

thesameapproachasinthepreviousproblem.

b. Usetheconditionalprobabilitiesinpart(a)topredicttheclasslabelforatestsample usingthenaïveBayesapproach.

c. Compare ,and .StatetherelationshipsbetweenAandB.

d. Repeattheanalysisinpart(c)using ,and .

e. Compare against and.Arethevariablesconditionallyindependentgiventhe

class?

+

+

+

+

P(A=1|+),P(B=1|+),P(C=1|+),P(A=1|−),P(B=1|−) P(C=1|−)

(A=1,B=1,C=1)

P(A=1),P(B=1) P(A=1,B=1)

P(A=1),P(B=0) P(A=1,B=0)

P(A=1,B=1|Class=+) P(A=1|Class=+)P(B=1|Class=+)

9.

a. ExplainhownaïveBayesperformsonthedatasetshowninFigure4.56 .

b. Ifeachclassisfurtherdividedsuchthattherearefourclasses(A1,A2,B1,andB2),willnaïveBayesperformbetter?

c. Howwilladecisiontreeperformonthisdataset(forthetwo-classproblem)?Whatiftherearefourclasses?

10.Figure4.57 illustratestheBayesiannetworkforthedatasetshowninTable4.11 .(Assumethatalltheattributesarebinary).

a. Drawtheprobabilitytableforeachnodeinthenetwork.

b. UsetheBayesiannetworktocompute.

11.GiventheBayesiannetworkshowninFigure4.58 ,computethefollowingprobabilities:

P(Engine=Bad,AirConditioner=Broken)

Figure4.56.DatasetforExercise9.

Figure4.57.Bayesiannetwork.

a. .P(B=good,F=empty,G=empty,S=yes)

b. .

c. Giventhatthebatteryisbad,computetheprobabilitythatthecarwillstart.

12.Considertheone-dimensionaldatasetshowninTable4.12 .

a. Classifythedatapoint accordingtoits1-,3-,5-,and9-nearestneighbors(usingmajorityvote).

b. Repeatthepreviousanalysisusingthedistance-weightedvotingapproachdescribedinSection4.3.1 .

Table4.11.DatasetforExercise10.

Mileage Engine AirConditioner

NumberofInstanceswith

NumberofInstanceswith

Hi Good Working 3 4

Hi Good Broken 1 2

Hi Bad Working 1 5

Hi Bad Broken 0 4

Lo Good Working 9 0

Lo Good Broken 5 1

Lo Bad Working 1 2

Lo Bad Broken 0 2

P(B=bad,F=empty,G=notempty,S=no)

x=5.0

CarValue=Hi CarValue=Lo

Figure4.58.BayesiannetworkforExercise11.

13.ThenearestneighboralgorithmdescribedinSection4.3 canbeextendedtohandlenominalattributes.AvariantofthealgorithmcalledPEBLS(ParallelExemplar-BasedLearningSystem)byCostandSalzberg[219]measuresthedistancebetweentwovaluesofanominalattributeusingthemodifiedvaluedifferencemetric(MVDM).Givenapairofnominalattributevalues, and ,thedistancebetweenthemisdefinedasfollows:

where isthenumberofexamplesfromclassiwithattributevalue andisthenumberofexampleswithattributevalue

Table4.12.DatasetforExercise12.

x 0.5 3.0 4.5 4.6 4.9 5.2 5.3 5.5 7.0 9.5

V1 V2

d(V1,V2)=∑i=1k|ni1n1−ni2n2,| (4.108)

nij Vj njVj.

y

ConsiderthetrainingsetfortheloanclassificationproblemshowninFigure4.8 .UsetheMVDMmeasuretocomputethedistancebetweeneverypairofattributevaluesforthe and attributes.

14.ForeachoftheBooleanfunctionsgivenbelow,statewhethertheproblemislinearlyseparable.

a. AANDBANDC

b. NOTAANDB

c. (AORB)AND(AORC)

d. (AXORB)AND(AORB)

15.

a. DemonstratehowtheperceptronmodelcanbeusedtorepresenttheANDandORfunctionsbetweenapairofBooleanvariables.

b. Commentonthedisadvantageofusinglinearfunctionsasactivationfunctionsformulti-layerneuralnetworks.

16.Youareaskedtoevaluatetheperformanceoftwoclassificationmodels,and .Thetestsetyouhavechosencontains26binaryattributes,

labeledasAthroughZ.Table4.13 showstheposteriorprobabilitiesobtainedbyapplyingthemodelstothetestset.(Onlytheposteriorprobabilitiesforthepositiveclassareshown).Asthisisatwo-classproblem,

and .Assumethatwearemostlyinterestedindetectinginstancesfromthepositiveclass.

a. PlottheROCcurveforboth and .(Youshouldplotthemonthesamegraph.)Whichmodeldoyouthinkisbetter?Explainyourreasons.

− − + + + − − + − −

M1 M2

P(−)=1−P(+) P(−|A,…,Z)=1−P(+|A,…,Z)

M1 M2

b. Formodel ,supposeyouchoosethecutoffthresholdtobe .Inotherwords,anytestinstanceswhoseposteriorprobabilityisgreaterthantwillbeclassifiedasapositiveexample.Computetheprecision,recall,andF-measureforthemodelatthisthresholdvalue.

c. Repeattheanalysisforpart(b)usingthesamecutoffthresholdonmodel.ComparetheF-measureresultsforbothmodels.Whichmodelis

better?AretheresultsconsistentwithwhatyouexpectfromtheROCcurve?

d. Repeatpart(b)formodel usingthethreshold .Whichthresholddoyouprefer, or ?AretheresultsconsistentwithwhatyouexpectfromtheROCcurve?

Table4.13.PosteriorprobabilitiesforExercise16.

Instance TrueClass

1 0.73 0.61

2 0.69 0.03

3 0.44 0.68

4 0.55 0.31

5 0.67 0.45

6 0.47 0.09

7 0.08 0.38

8 0.15 0.05

9 0.45 0.01

10 0.35 0.04

M1 t=0.5

M2

M1 t=0.1t=0.5 t=0.1

P(+|A,…,Z,M1) P(+|A,…,Z,M2)

+

+

+

+

+

17.Followingisadatasetthatcontainstwoattributes,XandY,andtwoclasslabels,“ ”and“ ”.Eachattributecantakethreedifferentvalues:0,1,or2.

X Y NumberofInstances

0 0 0 100

1 0 0 0

2 0 0 100

0 1 10 100

1 1 10 0

2 1 10 100

0 2 0 100

1 2 0 0

2 2 0 100

Theconceptforthe“ ”classis andtheconceptforthe“ ”classis.

a. Buildadecisiontreeonthedataset.Doesthetreecapturethe“ ”and“ ”concepts?

b. Whataretheaccuracy,precision,recall,and -measureofthedecisiontree?(Notethatprecision,recall,and -measurearedefinedwith

+ −

+ −

+ Y=1 −X=0∨X=2

+−

F1F1

respecttothe“ ”class.)

c. Buildanewdecisiontreewiththefollowingcostfunction:

(Hint:onlytheleavesoftheolddecisiontreeneedtobechanged.)Doesthedecisiontreecapturethe“ ”concept?

d. Whataretheaccuracy,precision,recall,and -measureofthenewdecisiontree?

18.Considerthetaskofbuildingaclassifierfromrandomdata,wheretheattributevaluesaregeneratedrandomlyirrespectiveoftheclasslabels.Assumethedatasetcontainsinstancesfromtwoclasses,“ ”and“ .”Halfofthedatasetisusedfortrainingwhiletheremaininghalfisusedfortesting.

a. Supposethereareanequalnumberofpositiveandnegativeinstancesinthedataandthedecisiontreeclassifierpredictseverytestinstancetobepositive.Whatistheexpectederrorrateoftheclassifieronthetestdata?

b. Repeatthepreviousanalysisassumingthattheclassifierpredictseachtestinstancetobepositiveclasswithprobability0.8andnegativeclasswithprobability0.2.

c. Supposetwo-thirdsofthedatabelongtothepositiveclassandtheremainingone-thirdbelongtothenegativeclass.Whatistheexpectederrorofaclassifierthatpredictseverytestinstancetobepositive?

d. Repeatthepreviousanalysisassumingthattheclassifierpredictseachtestinstancetobepositiveclasswithprobability2/3andnegativeclasswithprobability1/3.

+

C(i,j)={0,ifi=j;1,ifi=+,j=−;Numberof−instancesNumberof+instancesifi=−,j=+;

+

F1

+ −

19.DerivethedualLagrangianforthelinearSVMwithnon-separabledatawheretheobjectivefunctionis

20.ConsidertheXORproblemwheretherearefourtrainingpoints:

Transformthedataintothefollowingfeaturespace:

Findthemaximummarginlineardecisionboundaryinthetransformedspace.

21.GiventhedatasetsshowninFigures4.59 ,explainhowthedecisiontree,naïveBayes,andk-nearestneighborclassifierswouldperformonthesedatasets.

f(w)=ǁwǁ22+C(∑i=1Nξi)2.

(1,1,−),(1,0,+),(0,1,+),(0,0,−).

φ=(1,2×1,2×2,2x1x2,x12,x22).

Figure4.59.DatasetforExercise21.

5AssociationAnalysis:BasicConceptsandAlgorithms

Manybusinessenterprisesaccumulatelargequantitiesofdatafromtheirday-to-dayoperations.Forexample,hugeamountsofcustomerpurchasedataarecollecteddailyatthecheckoutcountersofgrocerystores.Table5.1 givesanexampleofsuchdata,commonlyknownasmarketbaskettransactions.Eachrowinthistablecorrespondstoatransaction,whichcontainsauniqueidentifierlabeledTIDandasetofitemsboughtbyagivencustomer.Retailersareinterestedinanalyzingthedatatolearnaboutthepurchasingbehavioroftheircustomers.Suchvaluableinformationcanbeusedtosupportavarietyofbusiness-relatedapplicationssuchasmarketingpromotions,inventorymanagement,andcustomerrelationshipmanagement.

Table5.1.Anexampleofmarketbaskettransactions.

TID Items

1 {Bread,Milk}

2 {Bread,Diapers,Beer,Eggs}

3 {Milk,Diapers,Beer,Cola}

4 {Bread,Milk,Diapers,Beer}

5 {Bread,Milk,Diapers,Cola}

Thischapterpresentsamethodologyknownasassociationanalysis,whichisusefulfordiscoveringinterestingrelationshipshiddeninlargedatasets.Theuncoveredrelationshipscanberepresentedintheformofsetsofitemspresentinmanytransactions,whichareknownasfrequentitemsets,orassociationrules,thatrepresentrelationshipsbetweentwoitemsets.Forexample,thefollowingrulecanbeextractedfromthedatasetshowninTable5.1 :

Therulesuggestsarelationshipbetweenthesaleofdiapersandbeerbecausemanycustomerswhobuydiapersalsobuybeer.Retailerscanusethesetypesofrulestohelpthemidentifynewopportunitiesforcross-sellingtheirproductstothecustomers.

{Diapers}→{Beer}.

Besidesmarketbasketdata,associationanalysisisalsoapplicabletodatafromotherapplicationdomainssuchasbioinformatics,medicaldiagnosis,webmining,andscientificdataanalysis.IntheanalysisofEarthsciencedata,forexample,associationpatternsmayrevealinterestingconnectionsamongtheocean,land,andatmosphericprocesses.SuchinformationmayhelpEarthscientistsdevelopabetterunderstandingofhowthedifferentelementsoftheEarthsysteminteractwitheachother.Eventhoughthetechniquespresentedherearegenerallyapplicabletoawidervarietyofdatasets,forillustrativepurposes,ourdiscussionwillfocusmainlyonmarketbasketdata.

Therearetwokeyissuesthatneedtobeaddressedwhenapplyingassociationanalysistomarketbasketdata.First,discoveringpatternsfromalargetransactiondatasetcanbecomputationallyexpensive.Second,someofthediscoveredpatternsmaybespurious(happensimplybychance)andevenfornon-spuriouspatterns,somearemoreinterestingthanothers.Theremainderofthischapterisorganizedaroundthesetwoissues.Thefirstpartofthechapterisdevotedtoexplainingthebasicconceptsofassociationanalysisandthealgorithmsusedtoefficientlyminesuchpatterns.Thesecondpartofthechapterdealswiththeissueofevaluatingthediscoveredpatternsinordertohelppreventthegenerationofspuriousresultsandtorankthepatternsintermsofsomeinterestingnessmeasure.

5.1PreliminariesThissectionreviewsthebasicterminologyusedinassociationanalysisandpresentsaformaldescriptionofthetask.

BinaryRepresentation

MarketbasketdatacanberepresentedinabinaryformatasshowninTable5.2 ,whereeachrowcorrespondstoatransactionandeachcolumncorrespondstoanitem.Anitemcanbetreatedasabinaryvariablewhosevalueisoneiftheitemispresentinatransactionandzerootherwise.Becausethepresenceofaniteminatransactionisoftenconsideredmoreimportantthanitsabsence,anitemisanasymmetricbinaryvariable.Thisrepresentationisasimplisticviewofrealmarketbasketdatabecauseitignoresimportantaspectsofthedatasuchasthequantityofitemssoldorthepricepaidtopurchasethem.Methodsforhandlingsuchnon-binarydatawillbeexplainedinChapter6 .

Table5.2.Abinary0/1representationofmarketbasketdata.

TID Bread Milk Diapers Beer Eggs Cola

1 1 1 0 0 0 0

2 1 0 1 1 1 0

3 0 1 1 1 0 1

4 1 1 1 1 0 0

5 1 1 1 0 0 1

ItemsetandSupportCount

Let bethesetofallitemsinamarketbasketdataandbethesetofalltransactions.Eachtransaction, containsa

subsetofitemschosenfromI.Inassociationanalysis,acollectionofzeroormoreitemsistermedanitemset.Ifanitemsetcontainskitems,itiscalledak-itemset.Forinstance,{ , , }isanexampleofa3-itemset.Thenull(orempty)setisanitemsetthatdoesnotcontainanyitems.

Atransaction issaidtocontainanitemsetXifXisasubsetof .Forexample,thesecondtransactionshowninTable5.2 containstheitemset{ , }butnot{ , }.Animportantpropertyofanitemsetisitssupportcount,whichreferstothenumberoftransactionsthatcontainaparticularitemset.Mathematically,thesupportcount, ,foranitemsetXcanbestatedasfollows:

wherethesymbol denotesthenumberofelementsinaset.InthedatasetshowninTable5.2 ,thesupportcountfor{ , , }isequaltotwobecausethereareonlytwotransactionsthatcontainallthreeitems.

Often,thepropertyofinterestisthesupport,whichisfractionoftransactionsinwhichanitemsetoccurs:

AnitemsetXiscalledfrequentifs(X)isgreaterthansomeuser-definedthreshold,minsup.

I={i1,i2,…,id} T={t1,t2,…,tN} ti

tj tj

σ(X)

σ(X)=|{ti|X⊆ti,ti∈T}|,

|⋅|

s(X)=σ(X)/N.

AssociationRule

Anassociationruleisanimplicationexpressionoftheform ,whereXandYaredisjointitemsets,i.e., .Thestrengthofanassociationrulecanbemeasuredintermsofitssupportandconfidence.Supportdetermineshowoftenaruleisapplicabletoagivendataset,whileconfidencedetermineshowfrequentlyitemsinYappearintransactionsthatcontainX.Theformaldefinitionsofthesemetricsare

Example5.1.Considertherule Becausethesupportcountfor{ , , }is2andthetotalnumberoftransactionsis5,therule'ssupportis .Therule'sconfidenceisobtainedbydividingthesupportcountfor{ , , }bythesupportcountfor{ ,

}.Sincethereare3transactionsthatcontainmilkanddiapers,theconfidenceforthisruleis .

WhyUseSupportandConfidence?

Supportisanimportantmeasurebecausearulethathasverylowsupportmightoccursimplybychance.Also,fromabusinessperspectivealowsupportruleisunlikelytobeinterestingbecauseitmightnotbeprofitabletopromoteitemsthatcustomersseldombuytogether(withtheexceptionofthesituationdescribedinSection5.8 ).Forthesereasons,weareinterestedinfindingruleswhosesupportisgreaterthansomeuser-definedthreshold.As

X→YX∩Y=∅

Support,s(X→Y)=σ(X∪Y)N; (5.1)

Confidence,c(X→Y)=σ(X∪Y)σ(X). (5.2)

2/5=0.4

2/3=0.67

willbeshowninSection5.2.1 ,supportalsohasadesirablepropertythatcanbeexploitedfortheefficientdiscoveryofassociationrules.

Confidence,ontheotherhand,measuresthereliabilityoftheinferencemadebyarule.Foragivenrule ,thehighertheconfidence,themorelikelyitisforYtobepresentintransactionsthatcontainX.ConfidencealsoprovidesanestimateoftheconditionalprobabilityofYgivenX.

Associationanalysisresultsshouldbeinterpretedwithcaution.Theinferencemadebyanassociationruledoesnotnecessarilyimplycausality.Instead,itcansometimessuggestastrongco-occurrencerelationshipbetweenitemsintheantecedentandconsequentoftherule.Causality,ontheotherhand,requiresknowledgeaboutwhichattributesinthedatacapturecauseandeffect,andtypicallyinvolvesrelationshipsoccurringovertime(e.g.,greenhousegasemissionsleadtoglobalwarming).SeeSection5.7.1 foradditionaldiscussion.

FormulationoftheAssociationRuleMiningProblem

Theassociationruleminingproblemcanbeformallystatedasfollows:

Definition5.1.(AssociationRuleDiscovery.)GivenasetoftransactionsT,findalltheruleshaving

and ,whereminsupandminconfarethecorrespondingsupportandconfidencethresholds.

X→Y

support≥minsup confidence≥minconf

Abrute-forceapproachforminingassociationrulesistocomputethesupportandconfidenceforeverypossiblerule.Thisapproachisprohibitivelyexpensivebecausethereareexponentiallymanyrulesthatcanbeextractedfromadataset.Morespecifically,assumingthatneithertheleftnortheright-handsideoftheruleisanemptyset,thetotalnumberofpossiblerules,R,extractedfromadatasetthatcontainsditemsis

Theproofforthisequationisleftasanexercisetothereaders(seeExercise5onpage440).EvenforthesmalldatasetshowninTable5.1 ,thisapproachrequiresustocomputethesupportandconfidencefor

rules.Morethan80%oftherulesarediscardedafterapplyingand ,thuswastingmostofthecomputations.To

avoidperformingneedlesscomputations,itwouldbeusefultoprunetherulesearlywithouthavingtocomputetheirsupportandconfidencevalues.

Aninitialsteptowardimprovingtheperformanceofassociationruleminingalgorithmsistodecouplethesupportandconfidencerequirements.FromEquation5.1 ,noticethatthesupportofarule isthesameasthesupportofitscorrespondingitemset, .Forexample,thefollowingruleshaveidenticalsupportbecausetheyinvolveitemsfromthesameitemset,

{ , , }:

R=3d−2d+1+1. (5.3)

36−27+1=602minsup=20% mincof=50%

X→YX∪Y

{Beer,Diapers}→{Milk},{Beer,Milk}→{Diapers},{Diapers,Milk}→{Beer},{Beer}→{Diapers,Milk},{Milk}→{Beer,Diapers},{Diapers}→{Beer,Milk}.

Iftheitemsetisinfrequent,thenallsixcandidaterulescanbeprunedimmediatelywithoutourhavingtocomputetheirconfidencevalues.

Therefore,acommonstrategyadoptedbymanyassociationruleminingalgorithmsistodecomposetheproblemintotwomajorsubtasks:

1. FrequentItemsetGeneration,whoseobjectiveistofindalltheitemsetsthatsatisfytheminsupthreshold.

2. RuleGeneration,whoseobjectiveistoextractallthehighconfidencerulesfromthefrequentitemsetsfoundinthepreviousstep.Theserulesarecalledstrongrules.

Thecomputationalrequirementsforfrequentitemsetgenerationaregenerallymoreexpensivethanthoseofrulegeneration.EfficienttechniquesforgeneratingfrequentitemsetsandassociationrulesarediscussedinSections5.2 and5.3 ,respectively.

5.2FrequentItemsetGenerationAlatticestructurecanbeusedtoenumeratethelistofallpossibleitemsets.Figure5.1 showsanitemsetlatticefor .Ingeneral,adatasetthatcontainskitemscanpotentiallygenerateupto frequentitemsets,excludingthenullset.Becausekcanbeverylargeinmanypracticalapplications,thesearchspaceofitemsetsthatneedtobeexploredisexponentiallylarge.

Figure5.1.

I={a,b,c,d,e}2k−1

Anitemsetlattice.

Abrute-forceapproachforfindingfrequentitemsetsistodeterminethesupportcountforeverycandidateitemsetinthelatticestructure.Todothis,weneedtocompareeachcandidateagainsteverytransaction,anoperationthatisshowninFigure5.2 .Ifthecandidateiscontainedinatransaction,itssupportcountwillbeincremented.Forexample,thesupportfor{ ,

}isincrementedthreetimesbecausetheitemsetiscontainedintransactions1,4,and5.SuchanapproachcanbeveryexpensivebecauseitrequiresO(NMw)comparisons,whereNisthenumberoftransactions,isthenumberofcandidateitemsets,andwisthemaximumtransaction

width.Transactionwidthisthenumberofitemspresentinatransaction.

Figure5.2.Countingthesupportofcandidateitemsets.

Therearethreemainapproachesforreducingthecomputationalcomplexityoffrequentitemsetgeneration.

1. Reducethenumberofcandidateitemsets(M).TheAprioriprinciple,describedinthenextSection,isaneffectivewaytoeliminatesomeof

M=2k−1

thecandidateitemsetswithoutcountingtheirsupportvalues.2. Reducethenumberofcomparisons.Insteadofmatchingeach

candidateitemsetagainsteverytransaction,wecanreducethenumberofcomparisonsbyusingmoreadvanceddatastructures,eithertostorethecandidateitemsetsortocompressthedataset.WewilldiscussthesestrategiesinSections5.2.4 and5.6 ,respectively.

3. Reducethenumberoftransactions(N).Asthesizeofcandidateitemsetsincreases,fewertransactionswillbesupportedbytheitemsets.Forinstance,sincethewidthofthefirsttransactioninTable5.1 is2,itwouldbeadvantageoustoremovethistransactionbeforesearchingforfrequentitemsetsofsize3andlarger.AlgorithmsthatemploysuchastrategyarediscussedintheBibliographicNotes.

5.2.1TheAprioriPrinciple

ThisSectiondescribeshowthesupportmeasurecanbeusedtoreducethenumberofcandidateitemsetsexploredduringfrequentitemsetgeneration.Theuseofsupportforpruningcandidateitemsetsisguidedbythefollowingprinciple.

Theorem5.1(AprioriPrinciple).Ifanitemsetisfrequent,thenallofitssubsetsmustalsobefrequent.

ToillustratetheideabehindtheAprioriprinciple,considertheitemsetlatticeshowninFigure5.3 .Suppose{c,d,e}isafrequentitemset.Clearly,anytransactionthatcontains{c,d,e}mustalsocontainitssubsets,{c,d},{c,e},{d,e},{c},{d},and{e}.Asaresult,if{c,d,e}isfrequent,thenallsubsetsof{c,d,e}(i.e.,theshadeditemsetsinthisfigure)mustalsobefrequent.

Figure5.3.AnillustrationoftheAprioriprinciple.If{c,d,e}isfrequent,thenallsubsetsofthisitemsetarefrequent.

Conversely,ifanitemsetsuchas{a,b}isinfrequent,thenallofitssupersetsmustbeinfrequenttoo.AsillustratedinFigure5.4 ,theentiresubgraph

containingthesupersetsof{a,b}canbeprunedimmediatelyonce{a,b}isfoundtobeinfrequent.Thisstrategyoftrimmingtheexponentialsearchspacebasedonthesupportmeasureisknownassupport-basedpruning.Suchapruningstrategyismadepossiblebyakeypropertyofthesupportmeasure,namely,thatthesupportforanitemsetneverexceedsthesupportforitssubsets.Thispropertyisalsoknownastheanti-monotonepropertyofthesupportmeasure.

Figure5.4.Anillustrationofsupport-basedpruning.If{a,b}isinfrequent,thenallsupersetsof{a,b}areinfrequent.

Definition5.2.(Anti-monotoneProperty.)Ameasurefpossessestheanti-monotonepropertyifforeveryitemsetXthatisapropersubsetofitemsetY,i.e. ,wehave

.

Moregenerally,alargenumberofmeasures—seeSection5.7.1 —canbeappliedtoitemsetstoevaluatevariouspropertiesofitemsets.AswillbeshowninthenextSection,anymeasurethathastheanti-monotonepropertycanbeincorporateddirectlyintoanitemsetminingalgorithmtoeffectivelyprunetheexponentialsearchspaceofcandidateitemsets.

5.2.2FrequentItemsetGenerationintheAprioriAlgorithm

Aprioriisthefirstassociationruleminingalgorithmthatpioneeredtheuseofsupport-basedpruningtosystematicallycontroltheexponentialgrowthofcandidateitemsets.Figure5.5 providesahigh-levelillustrationofthefrequentitemsetgenerationpartoftheApriorialgorithmforthetransactionsshowninTable5.1 .Weassumethatthesupportthresholdis60%,whichisequivalenttoaminimumsupportcountequalto3.

X⊂Yf(Y)≤f(X)

Figure5.5.IllustrationoffrequentitemsetgenerationusingtheApriorialgorithm.

Initially,everyitemisconsideredasacandidate1-itemset.Aftercountingtheirsupports,thecandidateitemsets{ }and{ }arediscardedbecausetheyappearinfewerthanthreetransactions.Inthenextiteration,candidate2-itemsetsaregeneratedusingonlythefrequent1-itemsetsbecausetheAprioriprincipleensuresthatallsupersetsoftheinfrequent1-itemsetsmustbeinfrequent.Becausethereareonlyfourfrequent1-itemsets,thenumberofcandidate2-itemsetsgeneratedbythealgorithmis .Twoofthesesixcandidates,{ , }and{ , },aresubsequentlyfoundtobeinfrequentaftercomputingtheirsupportvalues.Theremainingfourcandidatesarefrequent,andthuswillbeusedtogeneratecandidate3-itemsets.Withoutsupport-basedpruning,thereare candidate3-itemsetsthatcanbeformedusingthesixitemsgiveninthisexample.With

(42)=6

(63)=20

theAprioriprinciple,weonlyneedtokeepcandidate3-itemsetswhosesubsetsarefrequent.Theonlycandidatethathasthispropertyis{ ,

, }.However,eventhoughthesubsetsof{ , , }arefrequent,theitemsetitselfisnot.

TheeffectivenessoftheAprioripruningstrategycanbeshownbycountingthenumberofcandidateitemsetsgenerated.Abrute-forcestrategyofenumeratingallitemsets(uptosize3)ascandidateswillproduce

candidates.WiththeAprioriprinciple,thisnumberdecreasesto

candidates,whichrepresentsa68%reductioninthenumberofcandidateitemsetseveninthissimpleexample.

ThepseudocodeforthefrequentitemsetgenerationpartoftheApriorialgorithmisshowninAlgorithm5.1 .Let denotethesetofcandidatek-itemsetsand denotethesetoffrequentk-itemsets:

Thealgorithminitiallymakesasinglepassoverthedatasettodeterminethesupportofeachitem.Uponcompletionofthisstep,thesetofallfrequent1-itemsets, ,willbeknown(steps1and2).Next,thealgorithmwilliterativelygeneratenewcandidatek-itemsetsandpruneunnecessarycandidatesthatareguaranteedtobeinfrequentgiventhefrequent -itemsetsfoundinthepreviousiteration(steps5and6).Candidategenerationandpruningisimplementedusingthefunctionscandidate-genandcandidate-prune,whicharedescribedinSection5.2.3 .

(61)+(62)+(63)=6+15+20=41

(61)+(42)+1=6+6+1=13

CkFk

F1

(k−1)

Tocountthesupportofthecandidates,thealgorithmneedstomakeanadditionalpassoverthedataset(steps7–12).Thesubsetfunctionisusedtodetermineallthecandidateitemsetsin thatarecontainedineachtransactiont.TheimplementationofthisfunctionisdescribedinSection5.2.4 .Aftercountingtheirsupports,thealgorithmeliminatesallcandidateitemsetswhosesupportcountsarelessthan (step13).Thealgorithmterminateswhentherearenonewfrequentitemsetsgenerated,i.e., (step14).

ThefrequentitemsetgenerationpartoftheApriorialgorithmhastwoimportantcharacteristics.First,itisalevel-wisealgorithm;i.e.,ittraversestheitemsetlatticeonelevelatatime,fromfrequent1-itemsetstothemaximumsizeoffrequentitemsets.Second,itemploysagenerate-and-teststrategyforfindingfrequentitemsets.Ateachiteration(level),newcandidateitemsetsaregeneratedfromthefrequentitemsetsfoundinthepreviousiteration.Thesupportforeachcandidateisthencountedandtestedagainsttheminsupthreshold.Thetotalnumberofiterationsneededbythealgorithmis ,where isthemaximumsizeofthefrequentitemsets.

5.2.3CandidateGenerationandPruning

Thecandidate-genandcandidate-prunefunctionsshowninSteps5and6ofAlgorithm5.1 generatecandidateitemsetsandprunesunnecessaryonesbyperformingthefollowingtwooperations,respectively:

Ck

N×minsup

Fk=∅

kmax+1kmax

1. CandidateGeneration.Thisoperationgeneratesnewcandidatek-itemsetsbasedonthefrequent -itemsetsfoundinthepreviousiteration.

Algorithm5.1FrequentitemsetgenerationoftheApriorialgorithm.

∈ ∧

∈ ∧

2. CandidatePruning.Thisoperationeliminatessomeofthecandidatek-itemsetsusingsupport-basedpruning,i.e.byremovingk-itemsetswhosesubsetsareknowntobeinfrequentinpreviousiterations.Note

(k−1)

thatthispruningisdonewithoutcomputingtheactualsupportofthesek-itemsets(whichcouldhaverequiredcomparingthemagainsteachtransaction).

CandidateGenerationInprinciple,therearemanywaystogeneratecandidateitemsets.Aneffectivecandidategenerationproceduremustbecompleteandnon-redundant.Acandidategenerationprocedureissaidtobecompleteifitdoesnotomitanyfrequentitemsets.Toensurecompleteness,thesetofcandidateitemsetsmustsubsumethesetofallfrequentitemsets,i.e., .Acandidategenerationprocedureisnon-redundantifitdoesnotgeneratethesamecandidateitemsetmorethanonce.Forexample,thecandidateitemset{a,b,c,d}canbegeneratedinmanyways—bymerging{a,b,c}with{d},{b,d}with{a,c},{c}with{a,b,d},etc.Generationofduplicatecandidatesleadstowastedcomputationsandthusshouldbeavoidedforefficiencyreasons.Also,aneffectivecandidategenerationprocedureshouldavoidgeneratingtoomanyunnecessarycandidates.Acandidateitemsetisunnecessaryifatleastoneofitssubsetsisinfrequent,andthus,eliminatedinthecandidatepruningstep.

Next,wewillbrieflydescribeseveralcandidategenerationprocedures,includingtheoneusedbythecandidate-genfunction.

Brute-ForceMethod

Thebrute-forcemethodconsiderseveryk-itemsetasapotentialcandidateandthenappliesthecandidatepruningsteptoremoveanyunnecessarycandidateswhosesubsetsareinfrequent(seeFigure5.6 ).Thenumberofcandidateitemsetsgeneratedatlevelkisequalto ,wheredisthetotalnumberofitems.Althoughcandidategenerationisrathertrivial,candidate

∀k:Fk⊆Ck

(dk)

pruningbecomesextremelyexpensivebecausealargenumberofitemsetsmustbeexamined.

Figure5.6.Abrute-forcemethodforgeneratingcandidate3-itemsets.

Method

Analternativemethodforcandidategenerationistoextendeachfrequent-itemsetwithfrequentitemsthatarenotpartofthe -itemset.Figure

5.7 illustrateshowafrequent2-itemsetsuchas{ , }canbeaugmentedwithafrequentitemsuchas toproduceacandidate3-itemset{ , , }.

Fk−1×F1

(k−1) (k−1)

Figure5.7.Generatingandpruningcandidatek-itemsetsbymergingafrequent -itemsetwithafrequentitem.Notethatsomeofthecandidatesareunnecessarybecausetheirsubsetsareinfrequent.

Theprocedureiscompletebecauseeveryfrequentk-itemsetiscomposedofafrequent -itemsetandafrequent1-itemset.Therefore,allfrequentk-itemsetsarepartofthecandidatek-itemsetsgeneratedbythisprocedure.Figure5.7 showsthatthe candidategenerationmethodonlyproducesfourcandidate3-itemsets,insteadofthe

itemsetsproducedbythebrute-forcemethod.The methodgenerateslowernumberofcandidatesbecauseeverycandidateisguaranteedtocontainatleastonefrequent -itemset.Whilethisprocedureisasubstantialimprovementoverthebrute-forcemethod,itcanstillproducealargenumberofunnecessarycandidates,astheremainingsubsetsofacandidateitemsetcanstillbeinfrequent.

Notethattheapproachdiscussedabovedoesnotpreventthesamecandidate

(k−1)

(k−1)

Fk−1×F1

(63)=20 Fk−1×F1

(k−1)

itemsetfrombeinggeneratedmorethanonce.Forinstance,{ , ,}canbegeneratedbymerging{ , }with{ },{ ,}with{ },or{ , }with{ }.Onewaytoavoid

generatingduplicatecandidatesisbyensuringthattheitemsineachfrequentitemsetarekeptsortedintheirlexicographicorder.Forexample,itemsetssuchas{ , },{ , , },and{ , }followlexicographicorderastheitemswithineveryitemsetarearrangedalphabetically.Eachfrequent -itemsetXisthenextendedwithfrequentitemsthatarelexicographicallylargerthantheitemsinX.Forexample,theitemset{ , }canbeaugmentedwith{ }becauseMilkislexicographicallylargerthanBreadandDiapers.However,weshouldnotaugment{ , }with{ }nor{ , }with{ }becausetheyviolatethelexicographicorderingcondition.Everycandidatek-itemsetisthusgeneratedexactlyonce,bymergingthelexicographicallylargestitemwiththeremaining itemsintheitemset.Ifthemethodisusedinconjunctionwithlexicographicordering,thenonlytwocandidate3-itemsetswillbeproducedintheexampleillustratedinFigure5.7 .{ , , }and{ , , }willnotbegeneratedbecause{ , }isnotafrequent2-itemset.

Method

Thiscandidategenerationprocedure,whichisusedinthecandidate-genfunctionoftheApriorialgorithm,mergesapairoffrequent -itemsetsonlyiftheirfirst items,arrangedinlexicographicorder,areidentical.Let

and beapairoffrequent -itemsets,arrangedlexicographically.AandBaremergediftheysatisfythefollowingconditions:

(k−1)

k−1 Fk−1×F1

Fk−1×Fk−1

(k−1)k−2 A=

{a1,a2,…,ak−1} B={b1,b2,…,bk−1} (k−1)

ai=bi(fori=1,2,…,k−2).

Notethatinthiscase, becauseAandBaretwodistinctitemsets.Thecandidatek-itemsetgeneratedbymergingAandBconsistsofthefirstcommonitemsfollowedby and inlexicographicorder.This

candidategenerationprocedureiscomplete,becauseforeverylexicographicallyorderedfrequentk-itemset,thereexiststwolexicographicallyorderedfrequent -itemsetsthathaveidenticalitemsinthefirstpositions.

InFigure5.8 ,thefrequentitemsets{ , }and{ , }aremergedtoformacandidate3-itemset{ , , }.Thealgorithmdoesnothavetomerge{ , }with{ , }becausethefirstiteminbothitemsetsisdifferent.Indeed,if{ , , }isaviablecandidate,itwouldhavebeenobtainedbymerging{ , }with{ , }instead.Thisexampleillustratesboththecompletenessofthecandidategenerationprocedureandtheadvantagesofusinglexicographicorderingtopreventduplicatecandidates.Also,ifweorderthefrequent -itemsetsaccordingtotheirlexicographicrank,itemsetswithidenticalfirstitemswouldtakeconsecutiveranks.Asaresult,the candidategenerationmethodwouldconsidermergingafrequentitemsetonlywithonesthatoccupythenextfewranksinthesortedlist,thussavingsomecomputations.

ak−1≠bk−1k

−2 ak−1 bk−1

(k−1) k−2

(k−1)k−2

Fk−1×Fk−1

Figure5.8.Generatingandpruningcandidatek-itemsetsbymergingpairsoffrequent

-itemsets.

Figure5.8 showsthatthe candidategenerationprocedureresultsinonlyonecandidate3-itemset.Thisisaconsiderablereductionfromthefourcandidate3-itemsetsgeneratedbythe method.Thisisbecausethe methodensuresthateverycandidatek-itemsetcontainsatleasttwofrequent -itemsets,thusgreatlyreducingthenumberofcandidatesthataregeneratedinthisstep.

Notethattherecanbemultiplewaysofmergingtwofrequent -itemsetsinthe procedure,oneofwhichismergingiftheirfirst itemsareidentical.Analternateapproachcouldbetomergetwofrequent -itemsetsAandBifthelast itemsofAareidenticaltothefirstitemsetsofB.Forexample,{ , }and{ , }couldbemergedusingthisapproachtogeneratethecandidate3-itemset{ ,

, }.Aswewillseelater,thisalternate procedureis

(k−1)

Fk−1×Fk−1

Fk−1×F1Fk−1×Fk−1

(k−1)

(k−1)Fk−1×Fk−1 k−2

(k−1)k−2 k−2

Fk−1×Fk−1

usefulingeneratingsequentialpatterns,whichwillbediscussedinChapter6 .

CandidatePruningToillustratethecandidatepruningoperationforacandidatek-itemset,

,consideritskpropersubsets, .Ifanyofthemareinfrequent,thenXisimmediatelyprunedbyusingtheAprioriprinciple.Notethatwedon'tneedtoexplicitlyensurethatallsubsetsofXofsizelessthan arefrequent(seeExercise7).Thisapproachgreatlyreducesthenumberofcandidateitemsetsconsideredduringsupportcounting.Forthebrute-forcecandidategenerationmethod,candidatepruningrequirescheckingonlyksubsetsofsize foreachcandidatek-itemset.However,sincethe candidategenerationstrategyensuresthatatleastoneofthe -sizesubsetsofeverycandidatek-itemsetisfrequent,weonlyneedtocheckfortheremaining subsets.Likewise,thestrategyrequiresexaminingonly subsetsofeverycandidatek-itemset,

sincetwoofits -sizesubsetsarealreadyknowntobefrequentinthecandidategenerationstep.

5.2.4SupportCounting

Supportcountingistheprocessofdeterminingthefrequencyofoccurrenceforeverycandidateitemsetthatsurvivesthecandidatepruningstep.Supportcountingisimplementedinsteps6through11ofAlgorithm5.1 .Abrute-forceapproachfordoingthisistocompareeachtransactionagainsteverycandidateitemset(seeFigure5.2 )andtoupdatethesupportcountsofcandidatescontainedinatransaction.Thisapproachiscomputationally

X={i1,i2,…,ik} X−{ij}(∀j=1,2,…,k)

k−1

k−1Fk−1×F1

(k−1)k−1 Fk−1×Fk

−1 k−2(k−1)

expensive,especiallywhenthenumbersoftransactionsandcandidateitemsetsarelarge.

Analternativeapproachistoenumeratetheitemsetscontainedineachtransactionandusethemtoupdatethesupportcountsoftheirrespectivecandidateitemsets.Toillustrate,consideratransactiontthatcontainsfiveitems,{1,2,3,5,6}.Thereare itemsetsofsize3containedinthistransaction.Someoftheitemsetsmaycorrespondtothecandidate3-itemsetsunderinvestigation,inwhichcase,theirsupportcountsareincremented.Othersubsetsoftthatdonotcorrespondtoanycandidatescanbeignored.

Figure5.9 showsasystematicwayforenumeratingthe3-itemsetscontainedint.Assumingthateachitemsetkeepsitsitemsinincreasinglexicographicorder,anitemsetcanbeenumeratedbyspecifyingthesmallestitemfirst,followedbythelargeritems.Forinstance,given ,allthe3-itemsetscontainedintmustbeginwithitem1,2,or3.Itisnotpossibletoconstructa3-itemsetthatbeginswithitems5or6becausethereareonlytwoitemsintwhoselabelsaregreaterthanorequalto5.Thenumberofwaystospecifythefirstitemofa3-itemsetcontainedintisillustratedbytheLevel1prefixtreestructuredepictedinFigure5.9 .Forinstance,1representsa3-itemsetthatbeginswithitem1,followedbytwomoreitemschosenfromtheset{2,3,5,6}.

(53)=10

t={1,2,3,5,6}

2356

Figure5.9.Enumeratingsubsetsofthreeitemsfromatransactiont.

Afterfixingthefirstitem,theprefixtreestructureatLevel2representsthenumberofwaystoselecttheseconditem.Forexample,12correspondstoitemsetsthatbeginwiththeprefix(12)andarefollowedbytheitems3,5,or6.Finally,theprefixtreestructureatLevel3representsthecompletesetof3-itemsetscontainedint.Forexample,the3-itemsetsthatbeginwithprefix{12}are{1,2,3},{1,2,5},and{1,2,6},whilethosethatbeginwithprefix{23}are{2,3,5}and{2,3,6}.

TheprefixtreestructureshowninFigure5.9 demonstrateshowitemsetscontainedinatransactioncanbesystematicallyenumerated,i.e.,byspecifyingtheiritemsonebyone,fromtheleftmostitemtotherightmostitem.Westillhavetodeterminewhethereachenumerated3-itemsetcorresponds

356

toanexistingcandidateitemset.Ifitmatchesoneofthecandidates,thenthesupportcountofthecorrespondingcandidateisincremented.InthenextSection,weillustratehowthismatchingoperationcanbeperformedefficientlyusingahashtreestructure.

SupportCountingUsingaHashTree*IntheApriorialgorithm,candidateitemsetsarepartitionedintodifferentbucketsandstoredinahashtree.Duringsupportcounting,itemsetscontainedineachtransactionarealsohashedintotheirappropriatebuckets.Thatway,insteadofcomparingeachitemsetinthetransactionwitheverycandidateitemset,itismatchedonlyagainstcandidateitemsetsthatbelongtothesamebucket,asshowninFigure5.10 .

Figure5.10.Countingthesupportofitemsetsusinghashstructure.

Figure5.11 showsanexampleofahashtreestructure.Eachinternalnodeofthetreeusesthefollowinghashfunction, ,wheremodeh(p)=(p−1)mod3,

referstothemodulo(remainder)operator,todeterminewhichbranchofthecurrentnodeshouldbefollowednext.Forexample,items1,4,and7arehashedtothesamebranch(i.e.,theleftmostbranch)becausetheyhavethesameremainderafterdividingthenumberby3.Allcandidateitemsetsarestoredattheleafnodesofthehashtree.ThehashtreeshowninFigure5.11 contains15candidate3-itemsets,distributedacross9leafnodes.

Figure5.11.Hashingatransactionattherootnodeofahashtree.

Considerthetransaction, .Toupdatethesupportcountsofthecandidateitemsets,thehashtreemustbetraversedinsuchawaythatall

t={1,2,3,4,5,6}

theleafnodescontainingcandidate3-itemsetsbelongingtotmustbevisitedatleastonce.Recallthatthe3-itemsetscontainedintmustbeginwithitems1,2,or3,asindicatedbytheLevel1prefixtreestructureshowninFigure5.9 .Therefore,attherootnodeofthehashtree,theitems1,2,and3ofthetransactionarehashedseparately.Item1ishashedtotheleftchildoftherootnode,item2ishashedtothemiddlechild,anditem3ishashedtotherightchild.Atthenextlevelofthetree,thetransactionishashedontheseconditemlistedintheLevel2treestructureshowninFigure5.9 .Forexample,afterhashingonitem1attherootnode,items2,3,and5ofthetransactionarehashed.Basedonthehashfunction,items2and5arehashedtothemiddlechild,whileitem3ishashedtotherightchild,asshowninFigure5.12 .Thisprocesscontinuesuntiltheleafnodesofthehashtreearereached.Thecandidateitemsetsstoredatthevisitedleafnodesarecomparedagainstthetransaction.Ifacandidateisasubsetofthetransaction,itssupportcountisincremented.Notethatnotalltheleafnodesarevisitedwhiletraversingthehashtree,whichhelpsinreducingthecomputationalcost.Inthisexample,5outofthe9leafnodesarevisitedand9outofthe15itemsetsarecomparedagainstthetransaction.

Figure5.12.Subsetoperationontheleftmostsubtreeoftherootofacandidatehashtree.

5.2.5ComputationalComplexity

ThecomputationalcomplexityoftheApriorialgorithm,whichincludesbothitsruntimeandstorage,canbeaffectedbythefollowingfactors.

SupportThreshold

Loweringthesupportthresholdoftenresultsinmoreitemsetsbeingdeclaredasfrequent.Thishasanadverseeffectonthecomputationalcomplexityofthealgorithmbecausemorecandidateitemsetsmustbegeneratedandcounted

ateverylevel,asshowninFigure5.13 .Themaximumsizeoffrequentitemsetsalsotendstoincreasewithlowersupportthresholds.ThisincreasesthetotalnumberofiterationstobeperformedbytheApriorialgorithm,furtherincreasingthecomputationalcost.

Figure5.13.Effectofsupportthresholdonthenumberofcandidateandfrequentitemsetsobtainedfromabenchmarkdataset.

NumberofItems(Dimensionality)

Asthenumberofitemsincreases,morespacewillbeneededtostorethesupportcountsofitems.Ifthenumberoffrequentitemsalsogrowswiththedimensionalityofthedata,theruntimeandstoragerequirementswillincreasebecauseofthelargernumberofcandidateitemsetsgeneratedbythealgorithm.

NumberofTransactions

BecausetheApriorialgorithmmakesrepeatedpassesoverthetransactiondataset,itsruntimeincreaseswithalargernumberoftransactions.

AverageTransactionWidth

Fordensedatasets,theaveragetransactionwidthcanbeverylarge.ThisaffectsthecomplexityoftheApriorialgorithmintwoways.First,themaximumsizeoffrequentitemsetstendstoincreaseastheaveragetransactionwidthincreases.Asaresult,morecandidateitemsetsmustbeexaminedduringcandidategenerationandsupportcounting,asillustratedinFigure5.14 .Second,asthetransactionwidthincreases,moreitemsetsarecontainedinthetransaction.Thiswillincreasethenumberofhashtreetraversalsperformedduringsupportcounting.

AdetailedanalysisofthetimecomplexityfortheApriorialgorithmispresentednext.

Figure5.14.

Effectofaveragetransactionwidthonthenumberofcandidateandfrequentitemsetsobtainedfromasyntheticdataset.

Generationoffrequent1-itemsets

Foreachtransaction,weneedtoupdatethesupportcountforeveryitempresentinthetransaction.Assumingthatwistheaveragetransactionwidth,thisoperationrequiresO(Nw)time,whereNisthetotalnumberoftransactions.

Candidategeneration

Togeneratecandidatek-itemsets,pairsoffrequent -itemsetsaremergedtodeterminewhethertheyhaveatleast itemsincommon.Eachmergingoperationrequiresatmost equalitycomparisons.Everymergingstepcanproduceatmostoneviablecandidatek-itemset,whileintheworst-case,thealgorithmmusttrytomergeeverypairoffrequent -itemsetsfoundinthepreviousiteration.Therefore,theoverallcostofmergingfrequentitemsetsis

wherewisthemaximumtransactionwidth.Ahashtreeisalsoconstructedduringcandidategenerationtostorethecandidateitemsets.Becausethemaximumdepthofthetreeisk,thecostforpopulatingthehashtreewithcandidateitemsetsis .Duringcandidatepruning,weneedtoverifythatthe subsetsofeverycandidatek-itemsetarefrequent.SincethecostforlookingupacandidateinahashtreeisO(k),thecandidatepruningsteprequires time.

Supportcounting

(k−1)k−2

k−2

(k−1)

∑k=2w(k−2)|Ck|<Costofmerging<∑k=2w(k−2)|Fk−1|2,

O(∑k=2wk|Ck|)k−2

O(∑k=2wk(k−2)|Ck|)

Eachtransactionofwidth produces itemsetsofsizek.Thisisalsotheeffectivenumberofhashtreetraversalsperformedforeachtransaction.Thecostforsupportcountingis ,wherewisthemaximumtransactionwidthand isthecostforupdatingthesupportcountofacandidatek-itemsetinthehashtree.

|t| (|t|k)

O(N∑k(wk)αk)αk

5.3RuleGenerationThisSectiondescribeshowtoextractassociationrulesefficientlyfromagivenfrequentitemset.Eachfrequentk-itemset,Y,canproduceuptoassociationrules,ignoringrulesthathaveemptyantecedentsorconsequents

or ).AnassociationrulecanbeextractedbypartitioningtheitemsetYintotwonon-emptysubsets,Xand ,suchthat satisfiestheconfidencethreshold.Notethatallsuchrulesmusthavealreadymetthesupportthresholdbecausetheyaregeneratedfromafrequentitemset.

Example5.2.Let beafrequentitemset.Therearesixcandidateassociationrulesthatcanbegeneratedfrom

,and .AseachoftheirsupportisidenticaltothesupportforX,alltherulessatisfythesupportthreshold.

Computingtheconfidenceofanassociationruledoesnotrequireadditionalscansofthetransactiondataset.Considertherule ,whichisgeneratedfromthefrequentitemset .Theconfidenceforthisruleis

.Because{1,2,3}isfrequent,theanti-monotonepropertyofsupportensuresthat{1,2}mustbefrequent,too.Sincethesupportcountsforbothitemsetswerealreadyfoundduringfrequentitemsetgeneration,thereisnoneedtoreadtheentiredatasetagain.

5.3.1Confidence-BasedPruning

2k−2

∅→Y Y→∅Y−X X→Y−X

X={a,b,c}X:{a,b}→{c},{a,c}→{b},{b,c}→{a},{a}

→{b,c},{b}→{a,c} {c}→{a,b}

{1,2}→{3}X={1,2,3}

σ{(1,2,3})/σ({1,2})

Confidencedoesnotshowtheanti-monotonepropertyinthesamewayasthesupportmeasure.Forexample,theconfidencefor canbelarger,smaller,orequaltotheconfidenceforanotherrule ,where and

(seeExercise3onpage439).Nevertheless,ifwecomparerulesgeneratedfromthesamefrequentitemsetY,thefollowingtheoremholdsfortheconfidencemeasure.

Theorem5.2.LetYbeanitemsetandXisasubsetofY.Ifaruledoesnotsatisfytheconfidencethreshold,thenanyrule

,where isasubsetofX,mustnotsatisfytheconfidencethresholdaswell.

Toprovethistheorem,considerthefollowingtworules: and,where .Theconfidenceoftherulesare and ,

respectively.Since isasubsetofX, .Therefore,theformerrulecannothaveahigherconfidencethanthelatterrule.

5.3.2RuleGenerationinAprioriAlgorithm

TheApriorialgorithmusesalevel-wiseapproachforgeneratingassociationrules,whereeachlevelcorrespondstothenumberofitemsthatbelongtothe

X→YX˜→Y˜ X˜⊆X

Y˜⊆Y

X→Y−XX˜→Y

−X˜ X˜

X˜→Y−X˜ X→Y−X X˜⊂X σ(Y)/σ(X˜) σ(Y)/σ(X)

X˜ σ(X˜)/σ(X)

ruleconsequent.Initially,allthehighconfidencerulesthathaveonlyoneitemintheruleconsequentareextracted.Theserulesarethenusedtogeneratenewcandidaterules.Forexample,if and arehighconfidencerules,thenthecandidaterule isgeneratedbymergingtheconsequentsofbothrules.Figure5.15 showsalatticestructurefortheassociationrulesgeneratedfromthefrequentitemset{a,b,c,d}.Ifanynodeinthelatticehaslowconfidence,thenaccordingtoTheorem5.2 ,theentiresubgraphspannedbythenodecanbeprunedimmediately.Supposetheconfidencefor islow.Alltherulescontainingitemainitsconsequent,including ,and canbediscarded.

Figure5.15.Pruningofassociationrulesusingtheconfidencemeasure.

{acd}→{b} {abd}→{c}{ad}→{bc}

{bcd}→{a}{cd}→{ab},{bd}→{ac},{bc}→{ad} {d}→{abc}

ApseudocodefortherulegenerationstepisshowninAlgorithms5.2 and5.3 .Notethesimilaritybetweenthe proceduregiveninAlgorithm5.3 andthefrequentitemsetgenerationproceduregiveninAlgorithm5.1 .Theonlydifferenceisthat,inrulegeneration,wedonothavetomakeadditionalpassesoverthedatasettocomputetheconfidenceofthecandidaterules.Instead,wedeterminetheconfidenceofeachrulebyusingthesupportcountscomputedduringfrequentitemsetgeneration.

Algorithm5.2RulegenerationoftheApriorialgorithm.

Algorithm5.3Procedureap-genrules .

(fk,Hm)

5.3.3AnExample:CongressionalVotingRecords

ThisSectiondemonstratestheresultsofapplyingassociationanalysistothevotingrecordsofmembersoftheUnitedStatesHouseofRepresentatives.Thedataisobtainedfromthe1984CongressionalVotingRecordsDatabase,whichisavailableattheUCImachinelearningdatarepository.Eachtransactioncontainsinformationaboutthepartyaffiliationforarepresentativealongwithhisorhervotingrecordon16keyissues.Thereare435transactionsand34itemsinthedataset.ThesetofitemsarelistedinTable5.3 .

Table5.3.Listofbinaryattributesfromthe1984UnitedStatesCongressionalVotingRecords.Source:TheUCImachinelearningrepository.

1. Republican2. Democrat3.4.5.6.7.

handicapped-infants=yeshandicapped-infants=nowaterprojectcostsharing=yeswaterprojectcostsharing=nobudget-resolution=yes

8.9.

10.11.12.13.14.15.16.17.18.19.20.21.22.23.24.25.26.27.28.29.30.31.32.33.34.

TheApriorialgorithmisthenappliedtothedatasetwith and.Someofthehighconfidencerulesextractedbythealgorithm

areshowninTable5.4 .ThefirsttworulessuggestthatmostofthememberswhovotedyesforaidtoElSalvadorandnoforbudgetresolutionandMXmissileareRepublicans;whilethosewhovotednoforaidtoElSalvadorandyesforbudgetresolutionandMXmissileareDemocrats.These

budget-resolution=nophysicianfeefreeze=yesphysicianfeefreeze=noaidtoElSalvador=yesaidtoElSalvador=noreligiousgroupsinschools=yesreligiousgroupsinschools=noanti-satellitetestban=yesanti-satellitetestban=noaidtoNicaragua=yesaidtoNicaragua=noMX-missile=yesMX-missile=noimmigration=yesimmigration=nosynfuelcorporationcutback=yessynfuelcorporationcutback=noeducationspending=yeseducationspending=noright-to-sue=yesright-to-sue=nocrime=yescrime=noduty-free-exports=yesduty-free-exports=noexportadministrationact=yesexportadministrationact=no

minsup=30%minconf=90%

highconfidencerulesshowthekeyissuesthatdividemembersfrombothpoliticalparties.

Table5.4.Associationrulesextractedfromthe1984UnitedStatesCongressionalVotingRecords.

AssociationRule Confidence

91.0%

97.5%

93.5%

100%

{budgetresolution=no,MX-missile=no,aidtoElSalvador=yes}→{Republican}

{budgetresolution=yes,MX-missile=yes,aidtoElSalvador=no}→{Democrat}

{crime=yes,right-to-sue=yes,physicianfeefreeze=yes}→{Republican}

{crime=no,right-to-sue=no,physicianfeefreeze=no}→{Democrat}

5.4CompactRepresentationofFrequentItemsetsInpractice,thenumberoffrequentitemsetsproducedfromatransactiondatasetcanbeverylarge.Itisusefultoidentifyasmallrepresentativesetoffrequentitemsetsfromwhichallotherfrequentitemsetscanbederived.TwosuchrepresentationsarepresentedinthisSectionintheformofmaximalandclosedfrequentitemsets.

5.4.1MaximalFrequentItemsets

Definition5.3.(MaximalFrequentItemset.)Afrequentitemsetismaximalifnoneofitsimmediatesupersetsarefrequent.

Toillustratethisconcept,considertheitemsetlatticeshowninFigure5.16 .Theitemsetsinthelatticearedividedintotwogroups:thosethatarefrequentandthosethatareinfrequent.Afrequentitemsetborder,whichisrepresentedbyadashedline,isalsoillustratedinthediagram.Everyitemsetlocatedabovetheborderisfrequent,whilethoselocatedbelowtheborder(theshadednodes)areinfrequent.Amongtheitemsetsresidingneartheborder,

{a,d},{a,c,e},and{b,c,d,e}aremaximalfrequentitemsetsbecausealloftheirimmediatesupersetsareinfrequent.Forexample,theitemset{a,d}ismaximalfrequentbecauseallofitsimmediatesupersets,{a,b,d},{a,c,d},and{a,d,e},areinfrequent.Incontrast,{a,c}isnon-maximalbecauseoneofitsimmediatesupersets,{a,c,e},isfrequent.

Figure5.16.Maximalfrequentitemset.

Maximalfrequentitemsetseffectivelyprovideacompactrepresentationoffrequentitemsets.Inotherwords,theyformthesmallestsetofitemsetsfromwhichallfrequentitemsetscanbederived.Forexample,everyfrequentitemsetinFigure5.16 isasubsetofoneofthethreemaximalfrequent

itemsets,{a,d},{a,c,e},and{b,c,d,e}.Ifanitemsetisnotapropersubsetofanyofthemaximalfrequentitemsets,thenitiseitherinfrequent(e.g.,{a,d,e})ormaximalfrequentitself(e.g.,{b,c,d,e}).Hence,themaximalfrequentitemsets{a,c,e},{a,d},and{b,c,d,e}provideacompactrepresentationofthefrequentitemsetsshowninFigure5.16 .Enumeratingallthesubsetsofmaximalfrequentitemsetsgeneratesthecompletelistofallfrequentitemsets.

Maximalfrequentitemsetsprovideavaluablerepresentationfordatasetsthatcanproduceverylong,frequentitemsets,asthereareexponentiallymanyfrequentitemsetsinsuchdata.Nevertheless,thisapproachispracticalonlyifanefficientalgorithmexiststoexplicitlyfindthemaximalfrequentitemsets.WebrieflydescribeonesuchapproachinSection5.5 .

Despiteprovidingacompactrepresentation,maximalfrequentitemsetsdonotcontainthesupportinformationoftheirsubsets.Forexample,thesupportofthemaximalfrequentitemsets{a,c,e},{a,d},and{b,c,d,e}donotprovideanyinformationaboutthesupportoftheirsubsetsexceptthatitmeetsthesupportthreshold.Anadditionalpassoverthedatasetisthereforeneededtodeterminethesupportcountsofthenon-maximalfrequentitemsets.Insomecases,itisdesirabletohaveaminimalrepresentationofitemsetsthatpreservesthesupportinformation.WedescribesucharepresentationinthenextSection.

5.4.2ClosedItemsets

Closeditemsetsprovideaminimalrepresentationofallitemsetswithoutlosingtheirsupportinformation.Aformaldefinitionofacloseditemsetispresentedbelow.

Definition5.4.(ClosedItemset.)AnitemsetXisclosedifnoneofitsimmediatesupersetshasexactlythesamesupportcountasX.

Putanotherway,XisnotclosedifatleastoneofitsimmediatesupersetshasthesamesupportcountasX.ExamplesofcloseditemsetsareshowninFigure5.17 .Tobetterillustratethesupportcountofeachitemset,wehaveassociatedeachnode(itemset)inthelatticewithalistofitscorrespondingtransactionIDs.Forexample,sincethenode{b,c}isassociatedwithtransactionIDs1,2,and3,itssupportcountisequaltothree.Fromthetransactionsgiveninthisdiagram,noticethatthesupportfor{b}isidenticalto{b,c}.Thisisbecauseeverytransactionthatcontainsbalsocontainsc.Hence,{b}isnotacloseditemset.Similarly,sincecoccursineverytransactionthatcontainsbothaandd,theitemset{a,d}isnotclosedasithasthesamesupportasitssuperset{a,c,d}.Ontheotherhand,{b,c}isacloseditemsetbecauseitdoesnothavethesamesupportcountasanyofitssupersets.

Figure5.17.Anexampleoftheclosedfrequentitemsets(withminimumsupportequalto40%).

Aninterestingpropertyofcloseditemsetsisthatifweknowtheirsupportcounts,wecanderivethesupportcountofeveryotheritemsetintheitemsetlatticewithoutmakingadditionalpassesoverthedataset.Forexample,considerthe2-itemset{b,e}inFigure5.17 .Since{b,e}isnotclosed,itssupportmustbeequaltothesupportofoneofitsimmediatesupersets,{a,b,e},{b,c,e},and{b,d,e}.Further,noneofthesupersetsof{b,e}canhaveasupportgreaterthanthesupportof{b,e},duetotheanti-monotonenatureofthesupportmeasure.Hence,thesupportof{b,e}canbecomputedbyexaminingthesupportcountsofallofitsimmediatesupersetsofsizethree

andtakingtheirmaximumvalue.Ifanimmediatesupersetisclosed(e.g.,{b,c,e}),wewouldknowitssupportcount.Otherwise,wecanrecursivelycomputeitssupportbyexaminingthesupportsofitsimmediatesupersetsofsizefour.Ingeneral,thesupportcountofanynon-closed -itemsetcanbedeterminedaslongasweknowthesupportcountsofallk-itemsets.Hence,onecandeviseaniterativealgorithmthatcomputesthesupportcountsofitemsetsatlevel usingthesupportcountsofitemsetsatlevelk,startingfromthelevel ,where isthesizeofthelargestcloseditemset.

Eventhoughcloseditemsetsprovideacompactrepresentationofthesupportcountsofallitemsets,theycanstillbeexponentiallylargeinnumber.Moreover,formostpracticalapplications,weonlyneedtodeterminethesupportcountofallfrequentitemsets.Inthisregard,closedfrequentitem-setsprovideacompactrepresentationofthesupportcountsofallfrequentitemsets,whichcanbedefinedasfollows.

Definition5.5.(ClosedFrequentItemset.)Anitemsetisaclosedfrequentitemsetifitisclosedanditssupportisgreaterthanorequaltominsup.

Inthepreviousexample,assumingthatthesupportthresholdis40%,{b,c}isaclosedfrequentitemsetbecauseitssupportis60%.InFigure5.17 ,theclosedfrequentitemsetsareindicatedbytheshadednodes.

Algorithmsareavailabletoexplicitlyextractclosedfrequentitemsetsfromagivendataset.InterestedreadersmayrefertotheBibliographicNotesatthe

(k−1)

k−1kmax kmax

endofthischapterforfurtherdiscussionsofthesealgorithms.Wecanuseclosedfrequentitemsetstodeterminethesupportcountsforallnon-closedfrequentitemsets.Forexample,considerthefrequentitemset{a,d}showninFigure5.17 .Becausethisitemsetisnotclosed,itssupportcountmustbeequaltothemaximumsupportcountofitsimmediatesupersets,{a,b,d},{a,c,d},and{a,d,e}.Also,since{a,d}isfrequent,weonlyneedtoconsiderthesupportofitsfrequentsupersets.Ingeneral,thesupportcountofeverynon-closedfrequentk-itemsetcanbeobtainedbyconsideringthesupportofallitsfrequentsupersetsofsize .Forexample,sincetheonlyfrequentsupersetof{a,d}is{a,c,d},itssupportisequaltothesupportof{a,c,d},whichis2.Usingthismethodology,analgorithmcanbedevelopedtocomputethesupportforeveryfrequentitemset.ThepseudocodeforthisalgorithmisshowninAlgorithm5.4 .Thealgorithmproceedsinaspecific-to-generalfashion,i.e.,fromthelargesttothesmallestfrequentitemsets.Thisisbecause,inordertofindthesupportforanon-closedfrequentitemset,thesupportforallofitssupersetsmustbeknown.Notethatthesetofallfrequentitemsetscanbeeasilycomputedbytakingtheunionofallsubsetsoffrequentcloseditemsets.

Algorithm5.4Supportcountingusingclosedfrequentitemsets.

k+1

⋅ ′⋅ ′∈ ⊂ ′

Toillustratetheadvantageofusingclosedfrequentitemsets,considerthedatasetshowninTable5.5 ,whichcontainstentransactionsandfifteenitems.Theitemscanbedividedintothreegroups:(1)GroupA,whichcontainsitems through ;(2)GroupB,whichcontainsitems through;and(3)GroupC,whichcontainsitems through .Assumingthatthe

supportthresholdis20%,itemsetsinvolvingitemsfromthesamegrouparefrequent,butitemsetsinvolvingitemsfromdifferentgroupsareinfrequent.Thetotalnumberoffrequentitemsetsisthus .However,thereareonlyfourclosedfrequentitemsetsinthedata:

and .Itisoftensufficienttopresentonlytheclosedfrequentitemsetstotheanalystsinsteadoftheentiresetoffrequentitemsets.

Table5.5.Atransactiondatasetforminingcloseditemsets.

TID

1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

2 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

3 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

4 0 0 1 1 0 1 1 1 1 1 0 0 0 0 0

5 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0

a1 a5 b1b5 c1 c5

3×(25−1)=93

({a3,a4},{a1,a2,a3,a4,a5},{b1,b2,b3,b4,b5}, {c1,c2,c3,c4,c5})

a1 a2 a3 a4 a5 b1 b2 b3 b4 b5 c1 c2 c3 c4 c5

6 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0

7 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1

8 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1

9 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1

10 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1

Finally,notethatallmaximalfrequentitemsetsareclosedbecausenoneofthemaximalfrequentitemsetscanhavethesamesupportcountastheirimmediatesupersets.Therelationshipsamongfrequent,closed,closedfrequent,andmaximalfrequentitemsetsareshowninFigure5.18 .

Figure5.18.Relationshipsamongfrequent,closed,closedfrequent,andmaximalfrequentitemsets.

5.5AlternativeMethodsforGeneratingFrequentItemsets*Aprioriisoneoftheearliestalgorithmstohavesuccessfullyaddressedthecombinatorialexplosionoffrequentitemsetgeneration.ItachievesthisbyapplyingtheAprioriprincipletoprunetheexponentialsearchspace.Despiteitssignificantperformanceimprovement,thealgorithmstillincursconsiderableI/Ooverheadsinceitrequiresmakingseveralpassesoverthetransactiondataset.Inaddition,asnotedinSection5.2.5 ,theperformanceoftheApriorialgorithmmaydegradesignificantlyfordensedatasetsbecauseoftheincreasingwidthoftransactions.SeveralalternativemethodshavebeendevelopedtoovercometheselimitationsandimproveupontheefficiencyoftheApriorialgorithm.Thefollowingisahigh-leveldescriptionofthesemethods.

TraversalofItemsetLattice

AsearchforfrequentitemsetscanbeconceptuallyviewedasatraversalontheitemsetlatticeshowninFigure5.1 .Thesearchstrategyemployedbyanalgorithmdictateshowthelatticestructureistraversedduringthefrequentitemsetgenerationprocess.Somesearchstrategiesarebetterthanothers,dependingontheconfigurationoffrequentitemsetsinthelattice.Anoverviewofthesestrategiesispresentednext.

General-to-SpecificversusSpecific-to-General:TheApriorialgorithmusesageneral-to-specificsearchstrategy,wherepairsoffrequent -itemsetsaremergedtoobtaincandidatek-itemsets.Thisgeneral-to-

(k−1)

specificsearchstrategyiseffective,providedthemaximumlengthofafrequentitemsetisnottoolong.TheconfigurationoffrequentitemsetsthatworksbestwiththisstrategyisshowninFigure5.19(a) ,wherethedarkernodesrepresentinfrequentitemsets.Alternatively,aspecificto-generalsearchstrategylooksformorespecificfrequentitemsetsfirst,beforefindingthemoregeneralfrequentitemsets.Thisstrategyisusefultodiscovermaximalfrequentitemsetsindensetransactions,wherethefrequentitemsetborderislocatednearthebottomofthelattice,asshowninFigure5.19(b) .TheAprioriprinciplecanbeappliedtopruneallsubsetsofmaximalfrequentitemsets.Specifically,ifacandidatek-itemsetismaximalfrequent,wedonothavetoexamineanyofitssubsetsofsize.However,ifthecandidatek-itemsetisinfrequent,weneedtocheckall

ofits subsetsinthenextiteration.Anotherapproachistocombinebothgeneral-to-specificandspecific-to-generalsearchstrategies.Thisbidirectionalapproachrequiresmorespacetostorethecandidateitemsets,butitcanhelptorapidlyidentifythefrequentitemsetborder,giventheconfigurationshowninFigure5.19(c) .

Figure5.19.General-to-specific,specific-to-general,andbidirectionalsearch.

k−1

k−1

EquivalenceClasses:Anotherwaytoenvisionthetraversalistofirstpartitionthelatticeintodisjointgroupsofnodes(orequivalenceclasses).Afrequentitemsetgenerationalgorithmsearchesforfrequentitemsetswithinaparticularequivalenceclassfirstbeforemovingtoanotherequivalenceclass.Asanexample,thelevel-wisestrategyusedintheApriorialgorithmcanbeconsideredtobepartitioningthelatticeonthebasisofitemsetsizes;i.e.,thealgorithmdiscoversallfrequent1-itemsetsfirstbeforeproceedingtolarger-sizeditemsets.Equivalenceclassescanalsobedefinedaccordingtotheprefixorsuffixlabelsofanitemset.Inthiscase,twoitemsetsbelongtothesameequivalenceclassiftheyshareacommonprefixorsuffixoflengthk.Intheprefix-basedapproach,thealgorithmcansearchforfrequentitemsetsstartingwiththeprefixabeforelookingforthosestartingwithprefixesb,c,andsoon.Bothprefix-basedandsuffix-basedequivalenceclassescanbedemonstratedusingthetree-likestructureshowninFigure5.20 .

Figure5.20.Equivalenceclassesbasedontheprefixandsuffixlabelsofitemsets.

Breadth-FirstversusDepth-First:TheApriorialgorithmtraversesthelatticeinabreadth-firstmanner,asshowninFigure5.21(a) .Itfirstdiscoversallthefrequent1-itemsets,followedbythefrequent2-itemsets,andsoon,untilnonewfrequentitemsetsaregenerated.Theitemsetlatticecanalsobetraversedinadepth-firstmanner,asshowninFigures5.21(b) and5.22 .Thealgorithmcanstartfrom,say,nodeainFigure5.22 ,andcountitssupporttodeterminewhetheritisfrequent.Ifso,thealgorithmprogressivelyexpandsthenextlevelofnodes,i.e.,ab,abc,andsoon,untilaninfrequentnodeisreached,say,abcd.Itthenbacktrackstoanotherbranch,say,abce,andcontinuesthesearchfromthere.

Figure5.21.Breadth-firstanddepth-firsttraversals.

Figure5.22.Generatingcandidateitemsetsusingthedepth-firstapproach.

Thedepth-firstapproachisoftenusedbyalgorithmsdesignedtofindmaximalfrequentitemsets.Thisapproachallowsthefrequentitemsetbordertobedetectedmorequicklythanusingabreadth-firstapproach.

Onceamaximalfrequentitemsetisfound,substantialpruningcanbeperformedonitssubsets.Forexample,ifthenodebcdeshowninFigure5.22 ismaximalfrequent,thenthealgorithmdoesnothavetovisitthesubtreesrootedatbd,be,c,d,andebecausetheywillnotcontainanymaximalfrequentitemsets.However,ifabcismaximalfrequent,onlythenodessuchasacandbcarenotmaximalfrequent(butthesubtreesofacandbcmaystillcontainmaximalfrequentitemsets).Thedepth-firstapproachalsoallowsadifferentkindofpruningbasedonthesupportofitemsets.Forexample,supposethesupportfor{a,b,c}isidenticaltothesupportfor{a,b}.Thesubtreesrootedatabdandabecanbeskipped

becausetheyareguaranteednottohaveanymaximalfrequentitemsets.Theproofofthisisleftasanexercisetothereaders.

RepresentationofTransactionDataSet

Therearemanywaystorepresentatransactiondataset.ThechoiceofrepresentationcanaffecttheI/Ocostsincurredwhencomputingthesupportofcandidateitemsets.Figure5.23 showstwodifferentwaysofrepresentingmarketbaskettransactions.Therepresentationontheleftiscalledahorizontaldatalayout,whichisadoptedbymanyassociationruleminingalgorithms,includingApriori.Anotherpossibilityistostorethelistoftransactionidentifiers(TID-list)associatedwitheachitem.Sucharepresentationisknownastheverticaldatalayout.ThesupportforeachcandidateitemsetisobtainedbyintersectingtheTID-listsofitssubsetitems.ThelengthoftheTID-listsshrinksasweprogresstolargersizeditemsets.However,oneproblemwiththisapproachisthattheinitialsetofTID-listsmightbetoolargetofitintomainmemory,thusrequiringmoresophisticatedtechniquestocompresstheTID-lists.WedescribeanothereffectiveapproachtorepresentthedatainthenextSection.

Figure5.23.Horizontalandverticaldataformat.

HorizontalDataLayout

5.6FP-GrowthAlgorithm*ThisSectionpresentsanalternativealgorithmcalledFP-growththattakesaradicallydifferentapproachtodiscoveringfrequentitemsets.Thealgorithmdoesnotsubscribetothegenerate-and-testparadigmofApriori.Instead,itencodesthedatasetusingacompactdatastructurecalledanFP-treeandextractsfrequentitemsetsdirectlyfromthisstructure.Thedetailsofthisapproacharepresentednext.

5.6.1FP-TreeRepresentation

AnFP-treeisacompressedrepresentationoftheinputdata.ItisconstructedbyreadingthedatasetonetransactionatatimeandmappingeachtransactionontoapathintheFP-tree.Asdifferenttransactionscanhaveseveralitemsincommon,theirpathsmightoverlap.Themorethepathsoverlapwithoneanother,themorecompressionwecanachieveusingtheFP-treestructure.IfthesizeoftheFP-treeissmallenoughtofitintomainmemory,thiswillallowustoextractfrequentitemsetsdirectlyfromthestructureinmemoryinsteadofmakingrepeatedpassesoverthedatastoredondisk.

Figure5.24 showsadatasetthatcontainstentransactionsandfiveitems.ThestructuresoftheFP-treeafterreadingthefirstthreetransactionsarealsodepictedinthediagram.Eachnodeinthetreecontainsthelabelofanitemalongwithacounterthatshowsthenumberoftransactionsmappedontothegivenpath.Initially,theFP-treecontainsonlytherootnoderepresentedbythenullsymbol.TheFP-treeissubsequentlyextendedinthefollowingway:

Figure5.24.ConstructionofanFP-tree.

1. Thedatasetisscannedoncetodeterminethesupportcountofeachitem.Infrequentitemsarediscarded,whilethefrequentitemsaresortedindecreasingsupportcountsinsideeverytransactionofthedata

set.ForthedatasetshowninFigure5.24 ,aisthemostfrequentitem,followedbyb,c,d,ande.

2. ThealgorithmmakesasecondpassoverthedatatoconstructtheFP-tree.Afterreadingthefirsttransaction,{a,b},thenodeslabeledasaandbarecreated.Apathisthenformedfrom toencodethetransaction.Everynodealongthepathhasafrequencycountof1.

3. Afterreadingthesecondtransaction,{b,c,d},anewsetofnodesiscreatedforitemsb,c,andd.Apathisthenformedtorepresentthetransactionbyconnectingthenodes .Everynodealongthispathalsohasafrequencycountequaltoone.Althoughthefirsttwotransactionshaveanitemincommon,whichisb,theirpathsaredisjointbecausethetransactionsdonotshareacommonprefix.

4. Thethirdtransaction,{a,c,d,e},sharesacommonprefixitem(whichisa)withthefirsttransaction.Asaresult,thepathforthethirdtransaction, ,overlapswiththepathforthefirsttransaction, .Becauseoftheiroverlappingpath,thefrequencycountfornodeaisincrementedtotwo,whilethefrequencycountsforthenewlycreatednodes,c,d,ande,areequaltoone.

5. ThisprocesscontinuesuntileverytransactionhasbeenmappedontooneofthepathsgivenintheFP-tree.TheresultingFP-treeafterreadingallthetransactionsisshownatthebottomofFigure5.24 .

ThesizeofanFP-treeistypicallysmallerthanthesizeoftheuncompresseddatabecausemanytransactionsinmarketbasketdataoftenshareafewitemsincommon.Inthebest-casescenario,whereallthetransactionshavethesamesetofitems,theFP-treecontainsonlyasinglebranchofnodes.Theworst-casescenariohappenswheneverytransactionhasauniquesetofitems.Asnoneofthetransactionshaveanyitemsincommon,thesizeoftheFP-treeiseffectivelythesameasthesizeoftheoriginaldata.However,thephysicalstoragerequirementfortheFP-treeishigherbecauseitrequiresadditionalspacetostorepointersbetweennodesandcountersforeachitem.

→a→b

→b→c→d

→a→c→d→e→a→b

ThesizeofanFP-treealsodependsonhowtheitemsareordered.Thenotionoforderingitemsindecreasingorderofsupportcountsreliesonthepossibilitythatthehighsupportitemsoccurmorefrequentlyacrossallpathsandhencemustbeusedasmostcommonlyoccurringprefixes.Forexample,iftheorderingschemeintheprecedingexampleisreversed,i.e.,fromlowesttohighestsupportitem,theresultingFP-treeisshowninFigure5.25 .Thetreeappearstobedenserbecausethebranchingfactorattherootnodehasincreasedfrom2to5andthenumberofnodescontainingthehighsupportitemssuchasaandbhasincreasedfrom3to12.Nevertheless,orderingbydecreasingsupportcountsdoesnotalwaysleadtothesmallesttree,especiallywhenthehighsupportitemsdonotoccurfrequentlytogetherwiththeotheritems.Forexample,supposeweaugmentthedatasetgiveninFigure5.24 with100transactionsthatcontain{e},80transactionsthatcontain{d},60transactionsthatcontain{c},and40transactionsthatcontain{b}.Itemeisnowmostfrequent,followedbyd,c,b,anda.Withtheaugmentedtransactions,orderingbydecreasingsupportcountswillresultinanFP-treesimilartoFigure5.25 ,whileaschemebasedonincreasingsupportcountsproducesasmallerFP-treesimilartoFigure5.24(iv) .

Figure5.25.

AnFP-treerepresentationforthedatasetshowninFigure5.24 withadifferentitemorderingscheme.

AnFP-treealsocontainsalistofpointersconnectingnodesthathavethesameitems.Thesepointers,representedasdashedlinesinFigures5.24and5.25 ,helptofacilitatetherapidaccessofindividualitemsinthetree.WeexplainhowtousetheFP-treeanditscorrespondingpointersforfrequentitemsetgenerationinthenextSection.

5.6.2FrequentItemsetGenerationinFP-GrowthAlgorithm

FP-growthisanalgorithmthatgeneratesfrequentitemsetsfromanFP-treebyexploringthetreeinabottom-upfashion.GiventheexampletreeshowninFigure5.24 ,thealgorithmlooksforfrequentitemsetsendinginefirst,followedbyd,c,b,andfinally,a.Thisbottom-upstrategyforfindingfrequentitemsetsendingwithaparticularitemisequivalenttothesuffix-basedapproachdescribedinSection5.5 .SinceeverytransactionismappedontoapathintheFP-tree,wecanderivethefrequentitemsetsendingwithaparticularitem,say,e,byexaminingonlythepathscontainingnodee.Thesepathscanbeaccessedrapidlyusingthepointersassociatedwithnodee.TheextractedpathsareshowninFigure5.26(a) .Similarpathsforitemsetsendingind,c,b,andaareshowninFigures5.26(b) ,(c) ,(d) ,and(e) ,respectively.

Figure5.26.Decomposingthefrequentitemsetgenerationproblemintomultiplesubproblems,whereeachsubprobleminvolvesfindingfrequentitemsetsendingine,d,c,b,anda.

FP-growthfindsallthefrequentitemsetsendingwithaparticularsuffixbyemployingadivide-and-conquerstrategytosplittheproblemintosmallersubproblems.Forexample,supposeweareinterestedinfindingallfrequentitemsetsendingine.Todothis,wemustfirstcheckwhethertheitemset{e}itselfisfrequent.Ifitisfrequent,weconsiderthesubproblemoffindingfrequentitemsetsendinginde,followedbyce,be,andae.Inturn,eachofthesesubproblemsarefurtherdecomposedintosmallersubproblems.Bymergingthesolutionsobtainedfromthesubproblems,allthefrequentitemsetsendinginecanbefound.Finally,thesetofallfrequentitemsetscanbegeneratedbymergingthesolutionstothesubproblemsoffindingfrequent

itemsetsendingine,d,c,b,anda.Thisdivide-and-conquerapproachisthekeystrategyemployedbytheFP-growthalgorithm.

Foramoreconcreteexampleonhowtosolvethesubproblems,considerthetaskoffindingfrequentitemsetsendingwithe.

1. Thefirststepistogatherallthepathscontainingnodee.TheseinitialpathsarecalledprefixpathsandareshowninFigure5.27(a) .

Figure5.27.ExampleofapplyingtheFP-growthalgorithmtofindfrequentitemsetsendingine.

2. FromtheprefixpathsshowninFigure5.27(a) ,thesupportcountforeisobtainedbyaddingthesupportcountsassociatedwithnodee.Assumingthattheminimumsupportcountis2,{e}isdeclaredafrequentitemsetbecauseitssupportcountis3.

3. Because{e}isfrequent,thealgorithmhastosolvethesubproblemsoffindingfrequentitemsetsendinginde,ce,be,andae.Beforesolvingthesesubproblems,itmustfirstconverttheprefixpathsintoaconditionalFP-tree,whichisstructurallysimilartoanFP-tree,exceptitisusedtofindfrequentitemsetsendingwithaparticularsuffix.AconditionalFP-treeisobtainedinthefollowingway:

a. First,thesupportcountsalongtheprefixpathsmustbeupdatedbecausesomeofthecountsincludetransactionsthatdonotcontainiteme.Forexample,therightmostpathshowninFigure5.27(a) , ,includesatransaction{b,c}thatdoesnotcontainiteme.Thecountsalongtheprefixpathmustthereforebeadjustedto1toreflecttheactualnumberoftransactionscontaining{b,c,e}.

b. Theprefixpathsaretruncatedbyremovingthenodesfore.Thesenodescanberemovedbecausethesupportcountsalongtheprefixpathshavebeenupdatedtoreflectonlytransactionsthatcontaineandthesubproblemsoffindingfrequentitemsetsendinginde,ce,be,andaenolongerneedinformationaboutnodee.

c. Afterupdatingthesupportcountsalongtheprefixpaths,someoftheitemsmaynolongerbefrequent.Forexample,thenodebappearsonlyonceandhasasupportcountequalto1,whichmeansthatthereisonlyonetransactionthatcontainsbothbande.Itembcanbesafelyignoredfromsubsequentanalysisbecauseallitemsetsendinginbemustbeinfrequent.

TheconditionalFP-treeforeisshowninFigure5.27(b) .Thetreelooksdifferentthantheoriginalprefixpathsbecausethefrequencycountshavebeenupdatedandthenodesbandehavebeeneliminated.

4. FP-growthusestheconditionalFP-treeforetosolvethesubproblemsoffindingfrequentitemsetsendinginde,ce,andae.Tofindthefrequentitemsetsendinginde,theprefixpathsfordaregatheredfromtheconditionalFP-treefore(Figure5.27(c) ).Byaddingthefrequencycountsassociatedwithnoded,weobtainthesupportcountfor{d,e}.Sincethesupportcountisequalto2,{d,e}isdeclaredafrequentitemset.Next,thealgorithmconstructstheconditionalFP-treefordeusingtheapproachdescribedinstep3.Afterupdatingthesupportcountsandremovingtheinfrequentitemc,theconditionalFP-treefordeisshowninFigure5.27(d) .SincetheconditionalFP-treecontainsonlyoneitem,a,whosesupportisequaltominsup,thealgorithmextractsthefrequentitemset{a,d,e}andmovesontothenextsubproblem,whichistogeneratefrequentitemsetsendingince.Afterprocessingtheprefixpathsforc,{c,e}isfoundtobefrequent.However,theconditionalFP-treeforcewillhavenofrequentitemsandthuswillbeeliminated.Thealgorithmproceedstosolvethenextsubproblemandfinds{a,e}tobetheonlyfrequentitemsetremaining.

Thisexampleillustratesthedivide-and-conquerapproachusedintheFP-growthalgorithm.Ateachrecursivestep,aconditionalFP-treeisconstructedbyupdatingthefrequencycountsalongtheprefixpathsandremovingallinfrequentitems.Becausethesubproblemsaredisjoint,FP-growthwillnotgenerateanyduplicateitemsets.Inaddition,thecountsassociatedwiththenodesallowthealgorithmtoperformsupportcountingwhilegeneratingthecommonsuffixitemsets.

FP-growthisaninterestingalgorithmbecauseitillustrateshowacompactrepresentationofthetransactiondatasethelpstoefficientlygeneratefrequentitemsets.Inaddition,forcertaintransactiondatasets,FP-growthoutperformsthestandardApriorialgorithmbyseveralordersofmagnitude.Therun-timeperformanceofFP-growthdependsonthecompactionfactorofthedataset.IftheresultingconditionalFP-treesareverybushy(intheworstcase,afullprefixtree),thentheperformanceofthealgorithmdegradessignificantlybecauseithastogeneratealargenumberofsubproblemsandmergetheresultsreturnedbyeachsubproblem.

5.7EvaluationofAssociationPatternsAlthoughtheAprioriprinciplesignificantlyreducestheexponentialsearchspaceofcandidateitemsets,associationanalysisalgorithmsstillhavethepotentialtogeneratealargenumberofpatterns.Forexample,althoughthedatasetshowninTable5.1 containsonlysixitems,itcanproducehundredsofassociationrulesatparticularsupportandconfidencethresholds.Asthesizeanddimensionalityofrealcommercialdatabasescanbeverylarge,wecaneasilyendupwiththousandsorevenmillionsofpatterns,manyofwhichmightnotbeinteresting.Identifyingthemostinterestingpatternsfromthemultitudeofallpossibleonesisnotatrivialtaskbecause“oneperson'strashmightbeanotherperson'streasure.”Itisthereforeimportanttoestablishasetofwell-acceptedcriteriaforevaluatingthequalityofassociationpatterns.

Thefirstsetofcriteriacanbeestablishedthroughadata-drivenapproachtodefineobjectiveinterestingnessmeasures.Thesemeasurescanbeusedtorankpatterns—itemsetsorrules—andthusprovideastraightforwardwayofdealingwiththeenormousnumberofpatternsthatarefoundinadataset.Someofthemeasurescanalsoprovidestatisticalinformation,e.g.,itemsetsthatinvolveasetofunrelateditemsorcoververyfewtransactionsareconsidereduninterestingbecausetheymaycapturespuriousrelationshipsinthedataandshouldbeeliminated.Examplesofobjectiveinterestingnessmeasuresincludesupport,confidence,andcorrelation.

Thesecondsetofcriteriacanbeestablishedthroughsubjectivearguments.Apatternisconsideredsubjectivelyuninterestingunlessitrevealsunexpectedinformationaboutthedataorprovidesusefulknowledgethatcanleadtoprofitableactions.Forexample,therule maynotbeinteresting,despitehavinghighsupportandconfidencevalues,becausethe

{Butter}→{Bread}

relationshiprepresentedbytherulemightseemratherobvious.Ontheotherhand,therule isinterestingbecausetherelationshipisquiteunexpectedandmaysuggestanewcross-sellingopportunityforretailers.Incorporatingsubjectiveknowledgeintopatternevaluationisadifficulttaskbecauseitrequiresaconsiderableamountofpriorinformationfromdomainexperts.Readersinterestedinsubjectiveinterestingnessmeasuresmayrefertoresourceslistedinthebibliographyattheendofthischapter.

5.7.1ObjectiveMeasuresofInterestingness

Anobjectivemeasureisadata-drivenapproachforevaluatingthequalityofassociationpatterns.Itisdomain-independentandrequiresonlythattheuserspecifiesathresholdforfilteringlow-qualitypatterns.Anobjectivemeasureisusuallycomputedbasedonthefrequencycountstabulatedinacontingencytable.Table5.6 showsanexampleofacontingencytableforapairofbinaryvariables,AandB.Weusethenotation toindicatethatA(B)isabsentfromatransaction.Eachentry inthis tabledenotesafrequencycount.Forexample, isthenumberoftimesAandBappeartogetherinthesametransaction,while isthenumberoftransactionsthatcontainBbutnotA.Therowsum representsthesupportcountforA,whilethecolumnsum representsthesupportcountforB.Finally,eventhoughourdiscussionfocusesmainlyonasymmetricbinaryvariables,notethatcontingencytablesarealsoapplicabletootherattributetypessuchassymmetricbinary,nominal,andordinalvariables.

Table5.6.A2-waycontingencytableforvariablesAandB.

{Diapers}→{Beer}

A¯(B¯)fij 2×2

f11f01

f1+f+1

B

A

N

LimitationsoftheSupport-ConfidenceFrameworkTheclassicalassociationruleminingformulationreliesonthesupportandconfidencemeasurestoeliminateuninterestingpatterns.Thedrawbackofsupport,whichisdescribedmorefullyinSection5.8 ,isthatmanypotentiallyinterestingpatternsinvolvinglowsupportitemsmightbeeliminatedbythesupportthreshold.Thedrawbackofconfidenceismoresubtleandisbestdemonstratedwiththefollowingexample.

Example5.3.Supposeweareinterestedinanalyzingtherelationshipbetweenpeoplewhodrinkteaandcoffee.WemaygatherinformationaboutthebeveragepreferencesamongagroupofpeopleandsummarizetheirresponsesintoacontingencytablesuchastheoneshowninTable5.7 .

Table5.7.Beveragepreferencesamongagroupof1000people.

Coffee

Tea 150 50 200

650 150 800

800 200 1000

f11 f10 f1+

A¯ f01 f00 f0+

f+1 f+0

Coffee¯

Tea¯

Theinformationgiveninthistablecanbeusedtoevaluatetheassociationrule .Atfirstglance,itmayappearthatpeoplewhodrinkteaalsotendtodrinkcoffeebecausetherule'ssupport(15%)andconfidence(75%)valuesarereasonablyhigh.Thisargumentwouldhavebeenacceptableexceptthatthefractionofpeoplewhodrinkcoffee,regardlessofwhethertheydrinktea,is80%,whilethefractionofteadrinkerswhodrinkcoffeeisonly75%.Thusknowingthatapersonisateadrinkeractuallydecreasesherprobabilityofbeingacoffeedrinkerfrom80%to75%!Therule isthereforemisleadingdespiteitshighconfidencevalue.

Nowconsiderasimilarproblemwhereweareinterestedinanalyzingtherelationshipbetweenpeoplewhodrinkteaandpeoplewhousehoneyintheirbeverage.Table5.8 summarizestheinformationgatheredoverthesamegroupofpeopleabouttheirpreferencesfordrinkingteaandusinghoney.Ifweevaluatetheassociationrule usingthisinformation,wewillfindthattheconfidencevalueofthisruleismerely50%,whichmightbeeasilyrejectedusingareasonablethresholdontheconfidencevalue,say70%.Onethusmightconsiderthatthepreferenceofapersonfordrinkingteahasnoinfluenceonherpreferenceforusinghoney.However,thefractionofpeoplewhousehoney,regardlessofwhethertheydrinktea,isonly12%.Hence,knowingthatapersondrinksteasignificantlyincreasesherprobabilityofusinghoneyfrom12%to50%.Further,thefractionofpeoplewhodonotdrinkteabutusehoneyisonly2.5%!Thissuggeststhatthereisdefinitelysomeinformationinthepreferenceofapersonofusinghoneygiventhatshedrinkstea.Therule

maythereforebefalselyrejectedifconfidenceisusedastheevaluationmeasure.

Table5.8.Informationaboutpeoplewhodrinkteaandpeoplewhousehoneyintheirbeverage.

{Tea}→{Coffee}

{Tea}→{Coffee}

{Tea}→{Honey}

{Tea}→{Honey}

Honey

Tea 100 100 200

20 780 800

120 880 1000

Notethatifwetakethesupportofcoffeedrinkersintoaccount,wewouldnotbesurprisedtofindthatmanyofthepeoplewhodrinkteaalsodrinkcoffee,sincetheoverallnumberofcoffeedrinkersisquitelargebyitself.Whatismoresurprisingisthatthefractionofteadrinkerswhodrinkcoffeeisactuallylessthantheoverallfractionofpeoplewhodrinkcoffee,whichpointstoaninverserelationshipbetweenteadrinkersandcoffeedrinkers.Similarly,ifweaccountforthefactthatthesupportofusinghoneyisinherentlysmall,itiseasytounderstandthatthefractionofteadrinkerswhousehoneywillnaturallybesmall.Instead,whatisimportanttomeasureisthechangeinthefractionofhoneyusers,giventheinformationthattheydrinktea.

Thelimitationsoftheconfidencemeasurearewell-knownandcanbeunderstoodfromastatisticalperspectiveasfollows.Thesupportofavariablemeasurestheprobabilityofitsoccurrence,whilethesupports(A,B)ofapairofavariablesAandBmeasurestheprobabilityofthetwovariablesoccurringtogether.Hence,thejointprobabilityP(A,B)canbewrittenas

IfweassumeAandBarestatisticallyindependent,i.e.thereisnorelationshipbetweentheoccurrencesofAandB,then .Hence,undertheassumptionofstatisticalindependencebetweenAandB,thesupportsindep(A,B)ofAandBcanbewrittenas

Honey¯

Tea¯

P(A,B)=s(A,B)=f11N.

P(A,B)=P(A)×P(B)

Ifthesupportbetweentwovariables,s(A,B)isequalto ,thenAandBcanbeconsideredtobeunrelatedtoeachother.However,ifs(A,B)iswidelydifferentfrom ,thenAandBaremostlikelydependent.Hence,anydeviationofs(A,B)from canbeseenasanindicationofastatisticalrelationshipbetweenAandB.Sincetheconfidencemeasureonlyconsidersthedevianceofs(A,B)froms(A)andnotfrom ,itfailstoaccountforthesupportoftheconsequent,namelys(B).Thisresultsinthedetectionofspuriouspatterns(e.g., )andtherejectionoftrulyinterestingpatterns(e.g., ),asillustratedinthepreviousexample.

Variousobjectivemeasureshavebeenusedtocapturethedevianceofs(A,B)from ,thatarenotsusceptibletothelimitationsoftheconfidencemeasure.Below,weprovideabriefdescriptionofsomeofthesemeasuresanddiscusssomeoftheirproperties.

InterestFactorTheinterestfactor,whichisalsocalledasthe“lift,”canbedefinedasfollows:

Noticethat .Hence,theinterestfactormeasurestheratioofthesupportofapatterns(A,B)againstitsbaselinesupport (A,B)computedunderthestatisticalindependenceassumption.UsingEquations5.5 and5.4 ,wecaninterpretthemeasureasfollows:

sindep(A,B)=s(A)×s(B)orequivalently,sindep(A,B)=f1+N×f+1N. (5.4)

sindep(A,B)

sindep(A,B)s(A)×s(B)

s(A)×s(B)

{Tea}→{Coffee}{Tea}→{Honey}

sindep(A,B)

I(A,B)=s(A,B)s(A)×s(B)=Nf11f1+f+1. (5.5)

s(A)×s(B)=sindep(A,B)sindep

I(A,B)={=1,ifAandBareindependent;>1,ifAandBarepositivelyrelated;<1,ifAandBarenegativelyrelated.

(5.6)

Forthetea-coffeeexampleshowninTable5.7 , ,thussuggestingaslightnegativerelationshipbetweenteadrinkersandcoffeedrinkers.Also,forthetea-honeyexampleshowninTable5.8 ,

,suggestingastrongpositiverelationshipbetweenpeoplewhodrinkteaandpeoplewhousehoneyintheirbeverage.Wecanthusseethattheinterestfactorisabletodetectmeaningfulpatternsinthetea-coffeeandtea-honeyexamples.Indeed,theinterestfactorhasanumberofstatisticaladvantagesovertheconfidencemeasurethatmakeitasuitablemeasureforanalyzingstatisticalindependencebetweenvariables.

Piatesky-Shapiro(PS)MeasureInsteadofcomputingtheratiobetweens(A,B)and ,thePSmeasureconsidersthedifferencebetweens(A,B)and asfollows.

ThePSvalueis0whenAandBaremutuallyindependentofeachother.Otherwise, whenthereisapositiverelationshipbetweenthetwovariables,and whenthereisanegativerelationship.

CorrelationAnalysisCorrelationanalysisisoneofthemostpopulartechniquesforanalyzingrelationshipsbetweenapairofvariables.Forcontinuousvariables,correlationisdefinedusingPearson'scorrelationcoefficient(seeEquation2.10 onpage83).Forbinaryvariables,correlationcanbemeasuredusingthe

,whichisdefinedas

I=0.150.2×0.8=0.9375

I=0.10.12×0.2=4.1667

sindep(A,B)=s(A)×s(B)s(A)×s(B)

PS=s(A,B)−s(A)×s(B)=f11N−f1+f+1N2 (5.7)

PS>0PS<0

ϕ-coefficient

ϕ=f11f00−f01f10f1+f+1f0+f+0. (5.8)

Ifwerearrangethetermsin5.8,wecanshowthatthe canberewrittenintermsofthesupportmeasuresofA,B,and{A,B}asfollows:

NotethatthenumeratorintheaboveequationisidenticaltothePSmeasure.Hence,the canbeunderstoodasanormalizedversionofthePSmeasure,wherethatthevalueofthe rangesfrom to .Fromastatisticalviewpoint,thecorrelationcapturesthenormalizeddifferencebetweens(A,B)and (A,B).Acorrelationvalueof0meansnorelationship,whileavalueof suggestsaperfectpositiverelationshipandavalueof suggestsaperfectnegativerelationship.Thecorrelationmeasurehasastatisticalmeaningandhenceiswidelyusedtoevaluatethestrengthofstatisticalindependenceamongvariables.Forinstance,thecorrelationbetweenteaandcoffeedrinkersinTable5.7 is whichisslightlylessthan0.Ontheotherhand,thecorrelationbetweenpeoplewhodrinkteaandpeoplewhousehoneyinTable5.8 is0.5847,suggestingapositiverelationship.

ISMeasureISisanalternativemeasureforcapturingtherelationshipbetweens(A,B)and

.TheISmeasureisdefinedasfollows:

AlthoughthedefinitionofISlooksquitesimilartotheinterestfactor,theysharesomeinterestingdifferences.SinceISisthegeometricmeanbetweentheinterestfactorandthesupportofapattern,ISislargewhenboththeinterestfactorandsupportarelarge.Hence,iftheinterestfactoroftwopatternsareidentical,theIShasapreferenceofselectingthepatternwithhighersupport.ItisalsopossibletoshowthatISismathematicallyequivalent

ϕ-coefficient

ϕ=s(A,B)−s(A)×s(B)s(A)×(1−s(A))×s(B)×(1−s(B)). (5.9)

ϕ-coefficientϕ-coefficient −1 +1

sindep+1

−1

−0.0625

s(A)×s(B)

IS(A,B)=I(A,B)×s(A,B)=s(A,B)s(A)s(B)=f11f1+f+1. (5.10)

tothecosinemeasureforbinaryvariables(seeEquation2.6 onpage81 ).ThevalueofISthusvariesfrom0to1,whereanISvalueof0correspondstonoco-occurrenceofthetwovariables,whileanISvalueof1denotesperfectrelationship,sincetheyoccurinexactlythesametransactions.Forthetea-coffeeexampleshowninTable5.7 ,thevalueofISisequalto0.375,whilethevalueofISforthetea-honeyexampleinTable5.8 is0.6455.TheISmeasurethusgivesahighervalueforthe

rulethanthe rule,whichisconsistentwithourunderstandingofthetworules.

AlternativeObjectiveInterestingnessMeasuresNotethatallofthemeasuresdefinedinthepreviousSectionusedifferenttechniquestocapturethedeviancebetweens(A,B)and .Somemeasuresusetheratiobetweens(A,B)and (A,B),e.g.,theinterestfactorandIS,whilesomeothermeasuresconsiderthedifferencebetweenthetwo,e.g.,thePSandthe .Somemeasuresareboundedinaparticularrange,e.g.,theISandthe ,whileothersareunboundedanddonothaveadefinedmaximumorminimumvalue,e.g.,theInterestFactor.Becauseofsuchdifferences,thesemeasuresbehavedifferentlywhenappliedtodifferenttypesofpatterns.Indeed,themeasuresdefinedabovearenotexhaustiveandthereexistmanyalternativemeasuresforcapturingdifferentpropertiesofrelationshipsbetweenpairsofbinaryvariables.Table5.9 providesthedefinitionsforsomeofthesemeasuresintermsofthefrequencycountsofa contingencytable.

Table5.9.Examplesofobjectivemeasuresfortheitemset{A,B}.

Measure(Symbol) Definition

Correlation

{Tea}→{Honey} {Tea}→{Coffee}

sindep=s(A)×s(B)sindep

ϕ-coefficientϕ-coefficient

2×2

(ϕ) Nf11−f1+f+1f1+f+1f0+f+0

Oddsratio

Kappa

Interest(I)

Cosine(IS)

Piatetsky-Shapiro(PS)

Collectivestrength(S)

Jaccard

All-confidence(h)

ConsistencyamongObjectiveMeasuresGiventhewidevarietyofmeasuresavailable,itisreasonabletoquestionwhetherthemeasurescanproducesimilarorderingresultswhenappliedtoasetofassociationpatterns.Ifthemeasuresareconsistent,thenwecanchooseanyoneofthemasourevaluationmetric.Otherwise,itisimportanttounderstandwhattheirdifferencesareinordertodeterminewhichmeasureismoresuitableforanalyzingcertaintypesofpatterns.

SupposethemeasuresdefinedinTable5.9 areappliedtorankthetencontingencytablesshowninTable5.10 .Thesecontingencytablesarechosentoillustratethedifferencesamongtheexistingmeasures.TheorderingproducedbythesemeasuresisshowninTable5.11 (with1asthemostinterestingand10astheleastinterestingtable).Althoughsomeofthemeasuresappeartobeconsistentwitheachother,othersproducequitedifferentorderingresults.Forexample,therankingsgivenbytheagreesmostlywiththoseprovidedby andcollectivestrength,butarequite

(α) (f11f00)/(f10f01)

(κ) Nf11+Nf00−f1+f+1−f0+f+0N2−f1+f+1−f0+f+0

(Nf11)/(f1+f+1)

(f11)/(f1+f+1)

f11N−f1+f+1N2

f11+f00f1+f+1+f0+f+0×N−f1+f+1−f0+f+0N−f11−f00

(ζ) f11/(f1++f+1−f11)

min[f11f1+,f11f+1]

ϕ-coefficientκ

differentthantherankingsproducedbyinterestfactor.Furthermore,acontingencytablesuchas isrankedlowestaccordingtothe ,buthighestaccordingtointerestfactor.

Table5.10.Exampleofcontingencytables.

Example

8123 83 424 1370

8330 2 622 1046

3954 3080 5 2961

2886 1363 1320 4431

1500 2000 500 6000

4000 2000 1000 3000

9481 298 127 94

4000 2000 2000 2000

7450 2483 4 63

61 2483 4 7452

Table5.11.RankingsofcontingencytablesusingthemeasuresgiveninTable5.9 .

I IS PS S h

1 3 1 6 2 2 1 2 2

2 1 2 7 3 5 2 3 3

E10 ϕ-coefficient

f11 f10 f01 f00

E1

E2

E3

E4

E5

E6

E7

E8

E9

E10

ϕ α κ ζ

E1

E2

3 2 4 4 5 1 3 6 8

4 8 3 3 7 3 4 7 5

5 7 6 2 9 6 6 9 9

6 9 5 5 6 4 5 5 7

7 6 7 9 1 8 7 1 1

8 10 8 8 8 7 8 8 7

9 4 9 10 4 9 9 4 4

10 5 10 1 10 10 10 10 10

PropertiesofObjectiveMeasuresTheresultsshowninTable5.11 suggestthatthemeasuresgreatlydifferfromeachotherandcanprovideconflictinginformationaboutthequalityofapattern.Infact,nomeasureisuniversallybestforallapplications.Inthefollowing,wedescribesomepropertiesofthemeasuresthatplayanimportantroleindeterminingiftheyaresuitedforacertainapplication.

InversionProperty

ConsiderthebinaryvectorsshowninFigure5.28 .The0/1valueineachcolumnvectorindicateswhetheratransaction(row)containsaparticularitem(column).Forexample,thevectorAindicatesthattheitemappearsinthefirstandlasttransactions,whereasthevectorBindicatesthattheitemiscontainedonlyinthefifthtransaction.Thevectors and aretheinvertedversionsofAandB,i.e.,their1valueshavebeenchangedto0values(absencetopresence)andviceversa.Applyingthistransformationtoabinary

E3

E4

E5

E6

E7

E8

E9

E10

A¯ B¯

vectoriscalledinversion.Ifameasureisinvariantundertheinversionoperation,thenitsvalueforthevectorpair shouldbeidenticaltoitsvaluefor{A,B}.Theinversionpropertyofameasurecanbetestedasfollows.

Figure5.28.Effectoftheinversionoperation.Thevectors and areinversionsofvectorsAandB,respectively.

Definition5.6.(InversionProperty.)AnobjectivemeasureMisinvariantundertheinversionoperationifitsvalueremainsthesamewhenexchangingthefrequencycounts with and with .

Measuresthatareinvarianttotheinversionpropertyincludethecorrelation( ),oddsratio, ,andcollectivestrength.Thesemeasuresareespeciallyusefulinscenarioswherethepresence(1's)ofavariableisas

{A¯,B¯}

A¯ E¯

f11 f00 f10 f01

ϕ-coefficient κ

importantasitsabsence(0's).Forexample,ifwecomparetwosetsofanswerstoaseriesoftrue/falsequestionswhere0's(true)and1's(false)areequallymeaningful,weshoulduseameasurethatgivesequalimportancetooccurrencesof0–0'sand1–1's.ForthevectorsshowninFigure5.28 ,the

isequalto-0.1667regardlessofwhetherweconsiderthepair{A,B}orpair .Similarly,theoddsratioforbothpairsofvectorsisequaltoaconstantvalueof0.Notethateventhoughthe andtheoddsratioareinvarianttoinversion,theycanstillshowdifferentresults,aswillbeshownlater.

MeasuresthatdonotremaininvariantundertheinversionoperationincludetheinterestfactorandtheISmeasure.Forexample,theISvalueforthepair

inFigure5.28 is0.825,whichreflectsthefactthatthe1'sinand occurfrequentlytogether.However,theISvalueofitsinvertedpair{A,B}isequalto0,sinceAandBdonothaveanyco-occurrenceof1's.Forasymmetricbinaryvariables,e.g.,theoccurrenceofwordsindocuments,thisisindeedthedesiredbehavior.Adesiredsimilaritymeasurebetweenasymmetricvariablesshouldnotbeinvarianttoinversion,sinceforthesevariables,itismoremeaningfultocapturerelationshipsbasedonthepresenceofavariableratherthanitsabsence.Ontheotherhand,ifwearedealingwithsymmetricbinaryvariableswheretherelationshipsbetween0'sand1'sareequallymeaningful,careshouldbetakentoensurethatthechosenmeasureisinvarianttoinversion.

AlthoughthevaluesoftheinterestfactorandISchangewiththeinversionoperation,theycanstillbeinconsistentwitheachother.Toillustratethis,considerTable5.12 ,whichshowsthecontingencytablesfortwopairsofvariables,{p,q}and{r,s}.Notethatrandsareinvertedtransformationsofpandq,respectively,wheretherolesof0'sand1'shavejustbeenreversed.Theinterestfactorfor{p,q}is1.02andfor{r,s}is4.08,whichmeansthattheinterestfactorfindstheinvertedpair{r,s}morerelatedthantheoriginalpair

ϕ-coefficient{A¯,B¯}

ϕ-coefficient

{A¯,B¯} A¯B¯

{p,q}.Onthecontrary,theISvaluedecreasesuponinversionfrom0.9346for{p,q}to0.286for{r,s},suggestingquiteanoppositetrendtothatoftheinterestfactor.Eventhoughthesemeasuresconflictwitheachotherforthisexample,theymaybetherightchoiceofmeasureindifferentapplications.

Table5.12.Contingencytablesforthepairs{p,q}and{r,s}.

p

q 880 50 930

50 20 70

930 70 1000

r

s 20 50 70

50 880 930

70 930 1000

ScalingProperty

Table5.13 showstwocontingencytablesforgenderandthegradesachievedbystudentsenrolledinaparticularcourse.Thesetablescanbeusedtostudytherelationshipbetweengenderandperformanceinthecourse.Thesecondcontingencytablehasdatafromthesamepopulationbuthastwiceasmanymalesandthreetimesasmanyfemales.Theactualnumberofmalesorfemalescandependuponthesamplesavailableforstudy,buttherelationshipbetweengenderandgradeshouldnotchangejustbecauseofdifferencesinsamplesizes.Similarly,ifthenumberofstudentswithhighandlowgradesarechangedinanewstudy,ameasureofassociationbetween

genderandgradesshouldremainunchanged.Hence,weneedameasurethatisinvarianttoscalingofrowsorcolumns.Theprocessofmultiplyingaroworcolumnofacontingencytablebyaconstantvalueiscalledaroworcolumnscalingoperation.Ameasurethatisinvarianttoscalingdoesnotchangeitsvalueafteranyroworcolumnscalingoperation.

Table5.13.Thegrade-genderexample.(a)Sampledataofsize100.

Male Female

High 30 20 50

Low 40 10 50

70 30 100

(b)Sampledataofsize230.

Male Female

High 60 60 120

Low 80 30 110

140 90 230

Definition5.7.(ScalingInvarianceProperty.)LetTbeacontingencytablewithfrequencycounts

.Let bethetransformedacontingencytable[f11;f10;f01;f00] T′

withscaledfrequencycounts,where are

positiveconstantsusedtoscalethetworowsandthetwocolumnsofT.AnobjectivemeasureMisinvariantundertherow/columnscalingoperationif .

Notethattheuseoftheterm‘scaling'hereshouldnotbeconfusedwiththescalingoperationforcontinuousvariablesintroducedinChapter2 onpage23,whereallthevaluesofavariablewerebeingmultipliedbyaconstantfactor,insteadofscalingaroworcolumnofacontingencytable.

Scalingofrowsandcolumnsincontingencytablesoccursinmultiplewaysindifferentapplications.Forexample,ifwearemeasuringtheeffectofaparticularmedicalprocedureontwosetsofsubjects,healthyanddiseased,theratioofhealthyanddiseasedsubjectscanwidelyvaryacrossdifferentstudiesinvolvingdifferentgroupsofparticipants.Further,thefractionofhealthyanddiseasedsubjectschosenforacontrolledstudycanbequitedifferentfromthetruefractionobservedinthecompletepopulation.Thesedifferencescanresultinaroworcolumnscalinginthecontingencytablesfordifferentpopulationsofsubjects.Ingeneral,thefrequenciesofitemsinacontingencytablecloselydependsonthesampleoftransactionsusedtogeneratethetable.Anychangeinthesamplingproceduremayaffectaroworcolumnscalingtransformation.Ameasurethatisexpectedtobeinvarianttodifferencesinthesamplingproceduremustnotchangewithroworcolumnscaling.

OfallthemeasuresintroducedinTable5.9 ,onlytheoddsratio isinvarianttorowandcolumnscalingoperations.Forexample,thevalueofoddsratioforboththetablesinTable5.13 isequalto0.375.Allother

[k1k3f11;k2k3f10;k1k4f01;k2k4f00] k1,k2,k3,k4

M(T)=M(T′)

(α)

measuressuchasthe ,IS,interestfactor,andcollectivestrength(S)changetheirvalueswhentherowsandcolumnsofthecontingencytablearerescaled.Indeed,theoddsratioisapreferredchoiceofmeasureinthemedicaldomain,whereitisimportanttofindrelationshipsthatdonotchangewithdifferencesinthepopulationsamplechosenforastudy.

NullAdditionProperty

Supposeweareinterestedinanalyzingtherelationshipbetweenapairofwords,suchas and ,inasetofdocuments.Ifacollectionofarticlesabouticefishingisaddedtothedataset,shouldtheassociationbetween and beaffected?Thisprocessofaddingunrelateddata(inthiscase,documents)toagivendatasetisknownasthenulladditionoperation.

Definition5.8.(NullAdditionProperty.)AnobjectivemeasureMisinvariantunderthenulladditionoperationifitisnotaffectedbyincreasing ,whileallotherfrequenciesinthecontingencytablestaythesame.

Forapplicationssuchasdocumentanalysisormarketbasketanalysis,wewouldliketouseameasurethatremainsinvariantunderthenulladditionoperation.Otherwise,therelationshipbetweenwordscanbemadetochangesimplybyaddingenoughdocumentsthatdonotcontainbothwords!Examplesofmeasuresthatsatisfythispropertyincludecosine(IS)and

ϕ-coefficient,κ

f00

Jaccard measures,whilethosethatviolatethispropertyincludeinterestfactor,PS,oddsratio,andthe .

Todemonstratetheeffectofnulladdition,considerthetwocontingencytablesand showninTable5.14 .Table hasbeenobtainedfrom by

adding1000extratransactionswithbothAandBabsent.Thisoperationonlyaffectsthe entryofTable ,whichhasincreasedfrom100to1100,whereasalltheotherfrequenciesinthetable ,and remainthesame.SinceISisinvarianttonulladdition,itgivesaconstantvalueof0.875toboththetables.However,theadditionof1000extratransactionswithoccurrencesof0–0'schangesthevalueofinterestfactorfrom0.972for(denotingaslightlynegativecorrelation)to1.944for (positivecorrelation).Similarly,thevalueofoddsratioincreasesfrom7for to77for .Hence,whentheinterestfactororoddsratioareusedastheassociationmeasure,therelationshipsbetweenvariableschangesbytheadditionofnulltransactionswhereboththevariablesareabsent.Incontrast,theISmeasureisinvarianttonulladdition,sinceitconsiderstwovariablestoberelatedonlyiftheyfrequentlyoccurtogether.Indeed,theISmeasure(cosinemeasure)iswidelyusedtomeasuresimilarityamongdocuments,whichisexpectedtodependonlyonthejointoccurrences(1's)ofwordsindocuments,butnottheirabsences(0's).

Table5.14.Anexampledemonstratingtheeffectofnulladdition.(a)Table .

B

A 700 100 800

100 100 200

800 200 1000

(ξ)ϕ-coefficient

T1 T2 T2 T1

f00 T2(f11,f10 f01)

T1T2T1 T2

T1

(b)Table .

B

A 700 100 800

10 1100 1200

800 1200 2000

Table5.15 providesasummaryofpropertiesforthemeasuresdefinedinTable5.9 .Eventhoughthislistofpropertiesisnotexhaustive,itcanserveasausefulguideforselectingtherightchoiceofmeasureforanapplication.Ideally,ifweknowthespecificrequirementsofacertainapplication,wecanensurethattheselectedmeasureshowspropertiesthatadheretothoserequirements.Forexample,ifwearedealingwithasymmetricvariables,wewouldprefertouseameasurethatisnotinvarianttonulladditionorinversion.Ontheotherhand,ifwerequirethemeasuretoremaininvarianttochangesinthesamplesize,wewouldliketouseameasurethatdoesnotchangewithscaling.

Table5.15.Propertiesofsymmetricmeasures.

Symbol Measure Inversion NullAddition Scaling

Yes No No

oddsratio Yes No Yes

Cohen's Yes No No

I Interest No No No

IS Cosine No Yes No

PS Piatetsky-Shapiro's Yes No No

T2

ϕ ϕ-coefficient

α

κ

S Collectivestrength Yes No No

Jaccard No Yes No

h All-confidence No Yes No

s Support No No No

AsymmetricInterestingnessMeasuresNotethatinthediscussionsofar,wehaveonlyconsideredmeasuresthatdonotchangetheirvaluewhentheorderofthevariablesarereversed.Morespecifically,ifMisameasureandAandBaretwovariables,thenM(A,B)isequaltoM(B,A)iftheorderofthevariablesdoesnotmatter.Suchmeasuresarecalledsymmetric.Ontheotherhand,measuresthatdependontheorderofvariables arecalledasymmetricmeasures.Forexample,theinterestfactorisasymmetricmeasurebecauseitsvalueisidenticalfortherules and .Incontrast,confidenceisanasymmetricmeasuresincetheconfidencefor and maynotbethesame.Notethattheuseoftheterm‘asymmetric'todescribeaparticulartypeofmeasureofrelationship—oneinwhichtheorderofthevariablesisimportant—shouldnotbeconfusedwiththeuseof‘asymmetric'todescribeabinaryvariableforwhichonly1'sareimportant.Asymmetricmeasuresaremoresuitableforanalyzingassociationrules,sincetheitemsinaruledohaveaspecificorder.Eventhoughweonlyconsideredsymmetricmeasurestodiscussthedifferentpropertiesofassociationmeasures,theabovediscussionisalsorelevantfortheasymmetricmeasures.SeeBibliographicNotesformoreinformationaboutdifferentkindsofasymmetricmeasuresandtheirproperties.

ζ

(M(A,B)≠M(B,A))

A→B B→AA→B B→A

5.7.2MeasuresbeyondPairsofBinaryVariables

ThemeasuresshowninTable5.9 aredefinedforpairsofbinaryvariables(e.g.,2-itemsetsorassociationrules).However,manyofthem,suchassupportandall-confidence,arealsoapplicabletolarger-sizeditemsets.Othermeasures,suchasinterestfactor,IS,PS,andJaccardcoefficient,canbeextendedtomorethantwovariablesusingthefrequencytablestabulatedinamultidimensionalcontingencytable.Anexampleofathree-dimensionalcontingencytablefora,b,andcisshowninTable5.16 .Eachentry inthistablerepresentsthenumberoftransactionsthatcontainaparticularcombinationofitemsa,b,andc.Forexample, isthenumberoftransactionsthatcontainaandc,butnotb.Ontheotherhand,amarginalfrequencysuchas isthenumberoftransactionsthatcontainaandc,irrespectiveofwhetherbispresentinthetransaction.

Table5.16.Exampleofathree-dimensionalcontingencytable.

c b

a

c b

a

fijk

f101

f1+1

f111 f101 f1+1

a¯ f011 f001 f0+1

f+11 f+01 f++1

f110 f100 f1+0

a¯ f010 f000 f0+0

Givenak-itemset ,theconditionforstatisticalindependencecanbestatedasfollows:

Withthisdefinition,wecanextendobjectivemeasuressuchasinterestfactorandPS,whicharebasedondeviationsfromstatisticalindependence,tomorethantwovariables:

Anotherapproachistodefinetheobjectivemeasureasthemaximum,minimum,oraveragevaluefortheassociationsbetweenpairsofitemsinapattern.Forexample,givenak-itemset ,wemaydefinethe

forXastheaverage betweeneverypairofitemsinX.However,becausethemeasureconsidersonlypairwise

associations,itmaynotcapturealltheunderlyingrelationshipswithinapattern.Also,careshouldbetakeninusingsuchalternatemeasuresformorethantwovariables,sincetheymaynotalwaysshowtheanti-monotonepropertyinthesamewayasthesupportmeasure,makingthemunsuitableforminingpatternsusingtheAprioriprinciple.

Analysisofmultidimensionalcontingencytablesismorecomplicatedbecauseofthepresenceofpartialassociationsinthedata.Forexample,someassociationsmayappearordisappearwhenconditioneduponthevalueofcertainvariables.ThisproblemisknownasSimpson'sparadoxandisdescribedinSection5.7.3 .Moresophisticatedstatisticaltechniquesare

f+10 f+00 f++0

{i1,i2,…,ik}

fi1i2…ik=fi1+…+×f+i2…+×…×f++…ikNk−1. (5.11)

I=Nk−1×fi1i2…ikfi1+…+×f+i2…+×…×f++…ikPS=fi1i2…ikN−fi1+…+×f+i2…+×…×f++…ikNk

X={i1,i2,…,ik} ϕ-coefficient ϕ-coefficient(ip,iq)

availabletoanalyzesuchrelationships,e.g.,loglinearmodels,butthesetechniquesarebeyondthescopeofthisbook.

5.7.3Simpson'sParadox

Itisimportanttoexercisecautionwheninterpretingtheassociationbetweenvariablesbecausetheobservedrelationshipmaybeinfluencedbythepresenceofotherconfoundingfactors,i.e.,hiddenvariablesthatarenotincludedintheanalysis.Insomecases,thehiddenvariablesmaycausetheobservedrelationshipbetweenapairofvariablestodisappearorreverseitsdirection,aphenomenonthatisknownasSimpson'sparadox.Weillustratethenatureofthisparadoxwiththefollowingexample.

Considertherelationshipbetweenthesaleofhigh-definitiontelevisions(HDTV)andexercisemachines,asshowninTable5.17 .Therule

hasaconfidenceof andtherule hasaconfidenceof

.Together,theserulessuggestthatcustomerswhobuyhigh-definitiontelevisionsaremorelikelytobuyexercisemachinesthanthosewhodonotbuyhigh-definitiontelevisions.

Table5.17.Atwo-waycontingencytablebetweenthesaleofhigh-definitiontelevisionandexercisemachine.

BuyHDTV

BuyExerciseMachine

Yes No

Yes 99 81 180

No 54 66 120

{HDTV=Yes}→{Exercisemachine=Yes} 99/180=55%{HDTV=No}→{Exercisemachine=Yes}

54/120=45%

153 147 300

However,adeeperanalysisrevealsthatthesalesoftheseitemsdependonwhetherthecustomerisacollegestudentoraworkingadult.Table5.18summarizestherelationshipbetweenthesaleofHDTVsandexercisemachinesamongcollegestudentsandworkingadults.NoticethatthesupportcountsgiveninthetableforcollegestudentsandworkingadultssumuptothefrequenciesshowninTable5.17 .Furthermore,therearemoreworkingadultsthancollegestudentswhobuytheseitems.Forcollegestudents:

Table5.18.Exampleofathree-waycontingencytable.

CustomerGroup

BuyHDTV

BuyExerciseMachine Total

Yes No

CollegeStudents Yes 1 9 10

No 4 30 34

WorkingAdult Yes 98 72 170

No 50 36 86

whileforworkingadults:

c({HDTV=Yes}→{Exercisemachine=Yes})=1/10=10%,c({HDTV=No}→{Exercisemachine=Yes})=4/34=11.8%.

c({HDTV=Yes}→{Exercisemachine=Yes})=98/170=57.7%,c({HDTV=No}→{Exercisemachine=Yes})=50/86=58.1%.

Therulessuggestthat,foreachgroup,customerswhodonotbuyhigh-definitiontelevisionsaremorelikelytobuyexercisemachines,whichcontradictsthepreviousconclusionwhendatafromthetwocustomergroupsarepooledtogether.Evenifalternativemeasuressuchascorrelation,oddsratio,orinterestareapplied,westillfindthatthesaleofHDTVandexercisemachineispositivelyrelatedinthecombineddatabutisnegativelyrelatedinthestratifieddata(seeExercise21onpage449).ThereversalinthedirectionofassociationisknownasSimpson'sparadox.

Theparadoxcanbeexplainedinthefollowingway.First,noticethatmostcustomerswhobuyHDTVsareworkingadults.Thisisreflectedinthehighconfidenceoftherule .Second,thehighconfidenceoftherule

suggeststhatmostcustomerswhobuyexercisemachinesarealsoworkingadults.SinceworkingadultsformthelargestfractionofcustomersforbothHDTVsandexercisemachines,theybothlookrelatedandtherule turnsouttobestrongerinthecombineddatathanwhatitwouldhavebeenifthedataisstratified.Hence,customergroupactsasahiddenvariablethataffectsboththefractionofcustomerswhobuyHDTVsandthosewhobuyexercisemachines.Ifwefactorouttheeffectofthehiddenvariablebystratifyingthedata,weseethattherelationshipbetweenbuyingHDTVsandbuyingexercisemachinesisnotdirect,butshowsupasanindirectconsequenceoftheeffectofthehiddenvariable.

TheSimpson'sparadoxcanalsobeillustratedmathematicallyasfollows.Suppose

{HDTV=Yes}→{WorkingAdult}(170/180=94.4%){Exercisemachine=Yes}

→{WorkingAdult}(148/153=96.7%)

{HDTV=Yes}→{Exercisemachine=Yes}

a/b<c/dandp/q<r/s,

wherea/bandp/qmayrepresenttheconfidenceoftherule intwodifferentstrata,whilec/dandr/smayrepresenttheconfidenceoftherule

inthetwostrata.Whenthedataispooledtogether,theconfidencevaluesoftherulesinthecombineddataare and ,respectively.Simpson'sparadoxoccurswhen

thusleadingtothewrongconclusionabouttherelationshipbetweenthevariables.ThelessonhereisthatproperstratificationisneededtoavoidgeneratingspuriouspatternsresultingfromSimpson'sparadox.Forexample,marketbasketdatafromamajorsupermarketchainshouldbestratifiedaccordingtostorelocations,whilemedicalrecordsfromvariouspatientsshouldbestratifiedaccordingtoconfoundingfactorssuchasageandgender.

A→B

A¯→B(a+p)/(b+q) (c+r)/(d+s)

a+pb+q>c+rd+s,

5.8EffectofSkewedSupportDistributionTheperformancesofmanyassociationanalysisalgorithmsareinfluencedbypropertiesoftheirinputdata.Forexample,thecomputationalcomplexityoftheApriorialgorithmdependsonpropertiessuchasthenumberofitemsinthedata,theaveragetransactionwidth,andthesupportthresholdused.ThisSectionexaminesanotherimportantpropertythathassignificantinfluenceontheperformanceofassociationanalysisalgorithmsaswellasthequalityofextractedpatterns.Morespecifically,wefocusondatasetswithskewedsupportdistributions,wheremostoftheitemshaverelativelylowtomoderatefrequencies,butasmallnumberofthemhaveveryhighfrequencies.

Figure5.29.Atransactiondatasetcontainingthreeitems,p,q,andr,wherepisahighsupportitemandqandrarelowsupportitems.

Figure5.29 showsanillustrativeexampleofadatasetthathasaskewedsupportdistributionofitsitems.Whilephasahighsupportof83.3%inthedata,qandrarelow-supportitemswithasupportof16.7%.Despitetheirlowsupport,qandralwaysoccurtogetherinthelimitednumberoftransactionsthattheyappearandhencearestronglyrelated.Apatternminingalgorithmthereforeshouldreport{q,r}asinteresting.

However,notethatchoosingtherightsupportthresholdforminingitem-setssuchas{q,r}canbequitetricky.Ifwesetthethresholdtoohigh(e.g.,20%),

thenwemaymissmanyinterestingpatternsinvolvinglow-supportitemssuchas{q,r}.Conversely,settingthesupportthresholdtoolowcanbedetrimentaltothepatternminingprocessforthefollowingreasons.First,thecomputationalandmemoryrequirementsofexistingassociationanalysisalgorithmsincreaseconsiderablywithlowsupportthresholds.Second,thenumberofextractedpatternsalsoincreasessubstantiallywithlowsupportthresholds,whichmakestheiranalysisandinterpretationdifficult.Inparticular,wemayextractmanyspuriouspatternsthatrelateahigh-frequencyitemsuchasptoalow-frequencyitemsuchasq.Suchpatterns,whicharecalledcross-supportpatterns,arelikelytobespuriousbecausetheassociationbetweenpandqislargelyinfluencedbythefrequentoccurrenceofpinsteadofthejointoccurrenceofpandqtogether.Becausethesupportof{p,q}isquiteclosetothesupportof{q,r},wemayeasilyselect{p,q}ifwesetthesupportthresholdlowenoughtoinclude{q,r}.

AnexampleofarealdatasetthatexhibitsaskewedsupportdistributionisshowninFigure5.30 .Thedata,takenfromthePUMS(PublicUseMicrodataSample)censusdata,contains49,046recordsand2113asymmetricbinaryvariables.Weshalltreattheasymmetricbinaryvariablesasitemsandrecordsastransactions.Whilemorethan80%oftheitemshavesupportlessthan1%,ahandfulofthemhavesupportgreaterthan90%.Tounderstandtheeffectofskewedsupportdistributiononfrequentitemsetmining,wedividetheitemsintothreegroups, ,and ,accordingtotheirsupportlevels,asshowninTable5.19 .Wecanseethatmorethan82%ofitemsbelongto andhaveasupportlessthan1%.Inmarketbasketanalysis,suchlowsupportitemsmaycorrespondtoexpensiveproducts(suchasjewelry)thatareseldomboughtbycustomers,butwhosepatternsarestillinterestingtoretailers.Patternsinvolvingsuchlow-supportitems,thoughmeaningful,caneasilyberejectedbyafrequentpatternminingalgorithmwithahighsupportthreshold.Ontheotherhand,settingalowsupportthresholdmayresultintheextractionofspuriouspatternsthatrelateahigh-frequency

G1,G2 G3

G1

itemin toalow-frequencyitemin .Forexample,atasupportthresholdequalto0.05%,thereare18,847frequentpairsinvolvingitemsfrom and

.Outofthese,93%ofthemarecross-supportpatterns;i.e.,thepatternscontainitemsfromboth and .

Figure5.30.Supportdistributionofitemsinthecensusdataset.

Table5.19.Groupingtheitemsinthecensusdatasetbasedontheirsupportvalues.

Group

Support

NumberofItems 1735 358 20

G3 G1G1

G3G1 G3

G1 G2 G3

<1% 1%−90% >90%

Thisexampleshowsthatalargenumberofweaklyrelatedcross-supportpatternscanbegeneratedwhenthesupportthresholdissufficientlylow.Notethatfindinginterestingpatternsindatasetswithskewedsupportdistributionsisnotjustachallengeforthesupportmeasure,butsimilarstatementscanbemadeaboutmanyotherobjectivemeasuresdiscussedinthepreviousSections.Beforepresentingamethodologyforfindinginterestingpatternsandpruningspuriousones,weformallydefinetheconceptofcross-supportpatterns.

Definition5.9.(Cross-supportPattern.)Letusdefinethesupportratio,r(X),ofanitemset

as

Givenauser-specifiedthreshold ,anitemsetXisacross-supportpatternif .

Example5.4.Supposethesupportformilkis70%,whilethesupportforsugaris10%andcaviaris0.04%.Given ,thefrequentitemset{milk,sugar,caviar}isacross-supportpatternbecauseitssupportratiois

X={i1,i2,…,ik}

r(X)=min[s(i1),s(i2),…,s(ik)}max[s(i1),s(i2),…,s(ik)} (5.12)

hcr(X)<hc

hc=0.01

r=min[0.7,0.1,0.0004]max[0.7,0.1,0.0004]=0.0040.7=0.00058<0.01.

Existingmeasuressuchassupportandconfidencemaynotbesufficienttoeliminatecross-supportpatterns.Forexample,ifweassume forthedatasetpresentedinFigure5.29 ,theitemsets{p,q},{p,r},and{p,q,r}arecross-supportpatternsbecausetheirsupportratios,whichareequalto0.2,arelessthanthethreshold .However,theirsupportsarecomparabletothatof{q,r},makingitdifficulttoeliminatecross-supportpatternswithoutloosinginterestingonesusingasupport-basedpruningstrategy.Confidencepruningalsodoesnothelpbecausetheconfidenceoftherulesextractedfromcross-supportpatternscanbeveryhigh.Forexample,theconfidencefor

is80%eventhough{p,q}isacross-supportpattern.Thefactthatthecross-supportpatterncanproduceahighconfidenceruleshouldnotcomeasasurprisebecauseoneofitsitems(p)appearsveryfrequentlyinthedata.Therefore,pisexpectedtoappearinmanyofthetransactionsthatcontainq.Meanwhile,therule alsohashighconfidenceeventhough{q,r}isnotacross-supportpattern.Thisexampledemonstratesthedifficultyofusingtheconfidencemeasuretodistinguishbetweenrulesextractedfromcross-supportpatternsandinterestingpatternsinvolvingstronglyconnectedbutlow-supportitems.

Eventhoughtherule hasveryhighconfidence,noticethattherulehasverylowconfidencebecausemostofthetransactionsthatcontainp

donotcontainq.Incontrast,therule ,whichisderivedfrom{q,r},hasveryhighconfidence.Thisobservationsuggeststhatcross-supportpatternscanbedetectedbyexaminingthelowestconfidencerulethatcanbeextractedfromagivenitemset.Anapproachforfindingtherulewiththelowestconfidencegivenanitemsetcanbedescribedasfollows.

1. Recallthefollowinganti-monotonepropertyofconfidence:

hc=0.3

hc

{q}→{p}

{q}→{r}

{q}→{p} {p}→{q}

{r}→{q}

conf({i1i2}→{i3,i4,…,ik})≤conf({i1i2i3}→{i4,i5,…,ik}).

Thispropertysuggeststhatconfidenceneverincreasesasweshiftmoreitemsfromtheleft-totheright-handsideofanassociationrule.Becauseofthisproperty,thelowestconfidenceruleextractedfromafrequentitemsetcontainsonlyoneitemonitsleft-handside.Wedenotethesetofallruleswithonlyoneitemonitsleft-handsideas .

2. Givenafrequentitemset ,therule

hasthelowestconfidencein if .Thisfollowsdirectlyfromthedefinitionofconfidenceastheratiobetweentherule'ssupportandthesupportoftheruleantecedent.Hence,theconfidenceofarulewillbelowestwhenthesupportoftheantecedentishighest.

3. Summarizingthepreviouspoints,thelowestconfidenceattainablefromafrequentitemset is

Thisexpressionisalsoknownastheh-confidenceorall-confidencemeasure.Becauseoftheanti-monotonepropertyofsupport,thenumeratoroftheh-confidencemeasureisboundedbytheminimumsupportofanyitemthatappearsinthefrequentitemset.Inotherwords,theh-confidenceofanitemset mustnotexceedthefollowingexpression:

Notethattheupperboundofh-confidenceintheaboveequationisexactlysameassupportratio(r)giveninEquation5.12 .Becausethesupportratioforacross-supportpatternisalwayslessthan ,theh-confidenceofthepatternisalsoguaranteedtobelessthan .Therefore,cross-supportpatternscanbeeliminatedbyensuringthattheh-confidencevaluesforthepatternsexceed .Asafinalnote,theadvantagesofusingh-confidencego

R1{i1,i2,…,ik}

{ij}→{i1,i2,…,ij−1,ij+1,…,ik}

R1 s(ij)=max[s(i1),s(i2),…,s(ik)]

{i1,i2,…,ik}s({i1,i2,…,ik})max[s(i1),s(i2),…,s(ik)].

X={i1,i2,…,ik}

h-confidence(X)≤min[s(i1),s(i2),…,s(ik)]max[s(i1),s(i2),…,s(ik)].

hchc

hc

beyondeliminatingcross-supportpatterns.Themeasureisalsoanti-monotone,i.e.,

andthuscanbeincorporateddirectlyintotheminingalgorithm.Furthermore,h-confidenceensuresthattheitemscontainedinanitemsetarestronglyassociatedwitheachother.Forexample,supposetheh-confidenceofanitemsetXis80%.IfoneoftheitemsinXispresentinatransaction,thereisatleastan80%chancethattherestoftheitemsinXalsobelongtothesametransaction.Suchstronglyassociatedpatternsinvolvinglow-supportitemsarecalledhypercliquepatterns.

Definition5.10.(HypercliquePattern.)AnitemsetXisahypercliquepatternifh-confidence ,where isauser-specifiedthreshold.

h-confidence({i1,i2,…,ik})≥h-confidence({i1,i2,…,ik+1}),

(X)>hchc

5.9BibliographicNotesTheassociationruleminingtaskwasfirstintroducedbyAgrawaletal.[324,325]todiscoverinterestingrelationshipsamongitemsinmarketbaskettransactions.Sinceitsinception,extensiveresearchhasbeenconductedtoaddressthevariousissuesinassociationrulemining,fromitsfundamentalconceptstoitsimplementationandapplications.Figure5.31 showsataxonomyofthevariousresearchdirectionsinthisarea,whichisgenerallyknownasassociationanalysis.Asmuchoftheresearchfocusesonfindingpatternsthatappearsignificantlyofteninthedata,theareaisalsoknownasfrequentpatternmining.Adetailedreviewonsomeoftheresearchtopicsinthisareacanbefoundin[362]andin[319].

Figure5.31.Anoverviewofthevariousresearchdirectionsinassociationanalysis.

ConceptualIssues

Researchontheconceptualissuesofassociationanalysishasfocusedondevelopingatheoreticalformulationofassociationanalysisandextendingtheformulationtonewtypesofpatternsandgoingbeyondasymmetricbinaryattributes.

FollowingthepioneeringworkbyAgrawaletal.[324,325],therehasbeenavastamountofresearchondevelopingatheoreticalformulationfortheassociationanalysisproblem.In[357],Gunopoulosetal.showedtheconnectionbetweenfindingmaximalfrequentitemsetsandthehypergraphtransversalproblem.Anupperboundonthecomplexityoftheassociationanalysistaskwasalsoderived.Zakietal.[454,456]andPasquieretal.[407]haveappliedformalconceptanalysistostudythefrequentitemsetgenerationproblem.Moreimportantly,suchresearchhasledtothedevelopmentofaclassofpatternsknownasclosedfrequentitemsets[456].Friedmanetal.[355]havestudiedtheassociationanalysisprobleminthecontextofbumphuntinginmultidimensionalspace.Specifically,theyconsiderfrequentitemsetgenerationasthetaskoffindinghighdensityregionsinmultidimensionalspace.Formalizingassociationanalysisinastatisticallearningframeworkisanotheractiveresearchdirection[414,435,444]asitcanhelpaddressissuesrelatedtoidentifyingstatisticallysignificantpatternsanddealingwithuncertaindata[320,333,343].

Overtheyears,theassociationruleminingformulationhasbeenexpandedtoencompassotherrule-basedpatterns,suchas,profileassociationrules[321],cyclicassociationrules[403],fuzzyassociationrules[379],exceptionrules[431],negativeassociationrules[336,418],weightedassociationrules[338,413],dependencerules[422],peculiarrules[462],inter-transactionassociationrules[353,440],andpartialclassificationrules[327,397].Additionally,theconceptoffrequentitemsethasbeenextendedtoothertypesofpatternsincludingcloseditemsets[407,456],maximalitemsets[330],hypercliquepatterns[449],supportenvelopes[428],emergingpatterns[347],

contrastsets[329],high-utilityitemsets[340,390],approximateorerror-tolerantitem-sets[358,389,451],anddiscriminativepatterns[352,401,430].Associationanalysistechniqueshavealsobeensuccessfullyappliedtosequential[326,426],spatial[371],andgraph-based[374,380,406,450,455]data.

Substantialresearchhasbeenconductedtoextendtheoriginalassociationruleformulationtonominal[425],ordinal[392],interval[395],andratio[356,359,425,443,461]attributes.Oneofthekeyissuesishowtodefinethesupportmeasurefortheseattributes.AmethodologywasproposedbySteinbachetal.[429]toextendthetraditionalnotionofsupporttomoregeneralpatternsandattributetypes.

ImplementationIssuesResearchactivitiesinthisarearevolvearound(1)integratingtheminingcapabilityintoexistingdatabasetechnology,(2)developingefficientandscalableminingalgorithms,(3)handlinguser-specifiedordomain-specificconstraints,and(4)post-processingtheextractedpatterns.

Thereareseveraladvantagestointegratingassociationanalysisintoexistingdatabasetechnology.First,itcanmakeuseoftheindexingandqueryprocessingcapabilitiesofthedatabasesystem.Second,itcanalsoexploittheDBMSsupportforscalability,check-pointing,andparallelization[415].TheSETMalgorithmdevelopedbyHoutsmaetal.[370]wasoneoftheearliestalgorithmstosupportassociationrulediscoveryviaSQLqueries.Sincethen,numerousmethodshavebeendevelopedtoprovidecapabilitiesforminingassociationrulesindatabasesystems.Forexample,theDMQL[363]andM-SQL[373]querylanguagesextendthebasicSQLwithnewoperatorsforminingassociationrules.TheMineRuleoperator[394]isanexpressiveSQLoperatorthatcanhandlebothclusteredattributesanditemhierarchies.Tsuret

al.[439]developedagenerate-and-testapproachcalledqueryflocksforminingassociationrules.AdistributedOLAP-basedinfrastructurewasdevelopedbyChenetal.[341]forminingmultilevelassociationrules.

Despiteitspopularity,theApriorialgorithmiscomputationallyexpensivebecauseitrequiresmakingmultiplepassesoverthetransactiondatabase.ItsruntimeandstoragecomplexitieswereinvestigatedbyDunkelandSoparkar[349].TheFP-growthalgorithmwasdevelopedbyHanetal.in[364].OtheralgorithmsforminingfrequentitemsetsincludetheDHP(dynamichashingandpruning)algorithmproposedbyParketal.[405]andthePartitionalgorithmdevelopedbySavasereetal[417].Asampling-basedfrequentitemsetgenerationalgorithmwasproposedbyToivonen[436].Thealgorithmrequiresonlyasinglepassoverthedata,butitcanproducemorecandidateitem-setsthannecessary.TheDynamicItemsetCounting(DIC)algorithm[337]makesonly1.5passesoverthedataandgenerateslesscandidateitemsetsthanthesampling-basedalgorithm.Othernotablealgorithmsincludethetree-projectionalgorithm[317]andH-Mine[408].Surveyarticlesonfrequentitemsetgenerationalgorithmscanbefoundin[322,367].ArepositoryofbenchmarkdatasetsandsoftwareimplementationofassociationruleminingalgorithmsisavailableattheFrequentItemsetMiningImplementations(FIMI)repository(http://fimi.cs.helsinki.fi).

Parallelalgorithmshavebeendevelopedtoscaleupassociationruleminingforhandlingbigdata[318,360,399,420,457].Asurveyofsuchalgorithmscanbefoundin[453].OnlineandincrementalassociationruleminingalgorithmshavealsobeenproposedbyHidber[365]andCheungetal.[342].Morerecently,newalgorithmshavebeendevelopedtospeedupfrequentitemsetminingbyexploitingtheprocessingpowerofGPUs[459]andtheMapReduce/Hadoopdistributedcomputingframework[382,384,396].Forexample,animplementationoffrequentitemsetminingfortheHadoopframeworkisavailableintheApacheMahoutsoftware .

1

1http://mahout.apache.org

Srikantetal.[427]haveconsideredtheproblemofminingassociationrulesinthepresenceofBooleanconstraintssuchasthefollowing:

Givensuchaconstraint,thealgorithmlooksforrulesthatcontainbothcookiesandmilk,orrulesthatcontainthedescendentitemsofcookiesbutnotancestoritemsofwheatbread.Singhetal.[424]andNgetal.[400]hadalsodevelopedalternativetechniquesforconstrained-basedassociationrulemining.Constraintscanalsobeimposedonthesupportfordifferentitemsets.ThisproblemwasinvestigatedbyWangetal.[442],Liuetal.in[387],andSenoetal.[419].Inaddition,constraintsarisingfromprivacyconcernsofminingsensitivedatahaveledtothedevelopmentofprivacy-preservingfrequentpatternminingtechniques[334,350,441,458].

Onepotentialproblemwithassociationanalysisisthelargenumberofpatternsthatcanbegeneratedbycurrentalgorithms.Toovercomethisproblem,methodstorank,summarize,andfilterpatternshavebeendeveloped.Toivonenetal.[437]proposedtheideaofeliminatingredundantrulesusingstructuralrulecoversandgroupingtheremainingrulesusingclustering.Liuetal.[388]appliedthestatisticalchi-squaretesttoprunespuriouspatternsandsummarizedtheremainingpatternsusingasubsetofthepatternscalleddirectionsettingrules.Theuseofobjectivemeasurestofilterpatternshasbeeninvestigatedbymanyauthors,includingBrinetal.[336],BayardoandAgrawal[331],AggarwalandYu[323],andDuMouchelandPregibon[348].ThepropertiesformanyofthesemeasureswereanalyzedbyPiatetsky-Shapiro[410],KamberandSinghal[376],HildermanandHamilton[366],andTanetal.[433].Thegrade-genderexampleusedtohighlighttheimportanceoftherowandcolumnscalinginvarianceproperty

(Cookies∧Milk)∨(descendants(Cookies)∧¬ancestors(WheatBread))

washeavilyinfluencedbythediscussiongivenin[398]byMosteller.Meanwhile,thetea-coffeeexampleillustratingthelimitationofconfidencewasmotivatedbyanexamplegivenin[336]byBrinetal.Becauseofthelimitationofconfidence,Brinetal.[336]hadproposedtheideaofusinginterestfactorasameasureofinterestingness.Theall-confidencemeasurewasproposedbyOmiecinski[402].Xiongetal.[449]introducedthecross-supportpropertyandshowedthattheall-confidencemeasurecanbeusedtoeliminatecross-supportpatterns.Akeydifficultyinusingalternativeobjectivemeasuresbesidessupportistheirlackofamonotonicityproperty,whichmakesitdifficulttoincorporatethemeasuresdirectlyintotheminingalgorithms.Xiongetal.[447]haveproposedanefficientmethodforminingcorrelationsbyintroducinganupperboundfunctiontothe .Althoughthemeasureisnon-monotone,ithasanupperboundexpressionthatcanbeexploitedfortheefficientminingofstronglycorrelateditempairs.

FabrisandFreitas[351]haveproposedamethodfordiscoveringinterestingassociationsbydetectingtheoccurrencesofSimpson'sparadox[423].MegiddoandSrikant[393]describedanapproachforvalidatingtheextractedpatternsusinghypothesistestingmethods.Aresampling-basedtechniquewasalsodevelopedtoavoidgeneratingspuriouspatternsbecauseofthemultiplecomparisonproblem.Boltonetal.[335]haveappliedtheBenjamini-Hochberg[332]andBonferronicorrectionmethodstoadjustthep-valuesofdiscoveredpatternsinmarketbasketdata.AlternativemethodsforhandlingthemultiplecomparisonproblemweresuggestedbyWebb[445],Zhangetal.[460],andLlinares-Lopezetal.[391].

Applicationofsubjectivemeasurestoassociationanalysishasbeeninvestigatedbymanyauthors.SilberschatzandTuzhilin[421]presentedtwoprinciplesinwhicharulecanbeconsideredinterestingfromasubjectivepointofview.TheconceptofunexpectedconditionruleswasintroducedbyLiuetal.in[385].Cooleyetal.[344]analyzedtheideaofcombiningsoftbeliefsets

ϕ-coefficient

usingtheDempster-Shafertheoryandappliedthisapproachtoidentifycontradictoryandnovelassociationpatternsinwebdata.AlternativeapproachesincludeusingBayesiannetworks[375]andneighborhood-basedinformation[346]toidentifysubjectivelyinterestingpatterns.

Visualizationalsohelpstheusertoquicklygrasptheunderlyingstructureofthediscoveredpatterns.Manycommercialdataminingtoolsdisplaythecompletesetofrules(whichsatisfybothsupportandconfidencethresholdcriteria)asatwo-dimensionalplot,witheachaxiscorrespondingtotheantecedentorconsequentitemsetsoftherule.Hofmannetal.[368]proposedusingMosaicplotsandDoubleDeckerplotstovisualizeassociationrules.Thisapproachcanvisualizenotonlyaparticularrule,butalsotheoverallcontingencytablebetweenitemsetsintheantecedentandconsequentpartsoftherule.Nevertheless,thistechniqueassumesthattheruleconsequentconsistsofonlyasingleattribute.

ApplicationIssuesAssociationanalysishasbeenappliedtoavarietyofapplicationdomainssuchaswebmining[409,432],documentanalysis[369],telecommunicationalarmdiagnosis[377],networkintrusiondetection[328,345,381],andbioinformatics[416,446].ApplicationsofassociationandcorrelationpatternanalysistoEarthSciencestudieshavebeeninvestigatedin[411,412,434].Trajectorypatternmining[339,372,438]isanotherapplicationofspatio-temporalassociationanalysistoidentifyfrequentlytraversedpathsofmovingobjects.

Associationpatternshavealsobeenappliedtootherlearningproblemssuchasclassification[383,386],regression[404],andclustering[361,448,452].AcomparisonbetweenclassificationandassociationruleminingwasmadebyFreitasinhispositionpaper[354].Theuseofassociationpatternsfor

clusteringhasbeenstudiedbymanyauthorsincludingHanetal.[361],Kostersetal.[378],Yangetal.[452]andXiongetal.[448].

Bibliography[317]R.C.Agarwal,C.C.Aggarwal,andV.V.V.Prasad.ATreeProjection

AlgorithmforGenerationofFrequentItemsets.JournalofParallelandDistributedComputing(SpecialIssueonHighPerformanceDataMining),61(3):350–371,2001.

[318]R.C.AgarwalandJ.C.Shafer.ParallelMiningofAssociationRules.IEEETransactionsonKnowledgeandDataEngineering,8(6):962–969,March1998.

[319]C.AggarwalandJ.Han.FrequentPatternMining.Springer,2014.

[320]C.C.Aggarwal,Y.Li,J.Wang,andJ.Wang.Frequentpatternminingwithuncertaindata.InProceedingsofthe15thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages29–38,Paris,France,2009.

[321]C.C.Aggarwal,Z.Sun,andP.S.Yu.OnlineGenerationofProfileAssociationRules.InProc.ofthe4thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages129—133,NewYork,NY,August1996.

[322]C.C.AggarwalandP.S.Yu.MiningLargeItemsetsforAssociationRules.DataEngineeringBulletin,21(1):23–31,March1998.

[323]C.C.AggarwalandP.S.Yu.MiningAssociationswiththeCollectiveStrengthApproach.IEEETrans.onKnowledgeandDataEngineering,13(6):863–873,January/February2001.

[324]R.Agrawal,T.Imielinski,andA.Swami.Databasemining:Aperformanceperspective.IEEETransactionsonKnowledgeandDataEngineering,5:914–925,1993.

[325]R.Agrawal,T.Imielinski,andA.Swami.Miningassociationrulesbetweensetsofitemsinlargedatabases.InProc.ACMSIGMODIntl.Conf.ManagementofData,pages207–216,Washington,DC,1993.

[326]R.AgrawalandR.Srikant.MiningSequentialPatterns.InProc.ofIntl.Conf.onDataEngineering,pages3–14,Taipei,Taiwan,1995.

[327]K.Ali,S.Manganaris,andR.Srikant.PartialClassificationusingAssociationRules.InProc.ofthe3rdIntl.Conf.onKnowledgeDiscoveryandDataMining,pages115—118,NewportBeach,CA,August1997.

[328]D.Barbará,J.Couto,S.Jajodia,andN.Wu.ADAM:ATestbedforExploringtheUseofDataMininginIntrusionDetection.SIGMODRecord,30(4):15–24,2001.

[329]S.D.BayandM.Pazzani.DetectingGroupDifferences:MiningContrastSets.DataMiningandKnowledgeDiscovery,5(3):213–246,2001.

[330]R.Bayardo.EfficientlyMiningLongPatternsfromDatabases.InProc.of1998ACM-SIGMODIntl.Conf.onManagementofData,pages85–93,Seattle,WA,June1998.

[331]R.BayardoandR.Agrawal.MiningtheMostInterestingRules.InProc.ofthe5thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages145–153,SanDiego,CA,August1999.

[332]Y.BenjaminiandY.Hochberg.ControllingtheFalseDiscoveryRate:APracticalandPowerfulApproachtoMultipleTesting.JournalRoyalStatisticalSocietyB,57(1):289–300,1995.

[333]T.Bernecker,H.Kriegel,M.Renz,F.Verhein,andA.Züle.Probabilisticfrequentitemsetmininginuncertaindatabases.InProceedingsofthe15thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages119–128,Paris,France,2009.

[334]R.Bhaskar,S.Laxman,A.D.Smith,andA.Thakurta.Discoveringfrequentpatternsinsensitivedata.InProceedingsofthe16thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages503–512,Washington,DC,2010.

[335]R.J.Bolton,D.J.Hand,andN.M.Adams.DeterminingHitRateinPatternSearch.InProc.oftheESFExploratoryWorkshoponPatternDetectionandDiscoveryinDataMining,pages36–48,London,UK,September2002.

[336]S.Brin,R.Motwani,andC.Silverstein.Beyondmarketbaskets:Generalizingassociationrulestocorrelations.InProc.ACMSIGMODIntl.Conf.ManagementofData,pages265–276,Tucson,AZ,1997.

[337]S.Brin,R.Motwani,J.Ullman,andS.Tsur.DynamicItemsetCountingandImplicationRulesformarketbasketdata.InProc.of1997ACM-SIGMODIntl.Conf.onManagementofData,pages255–264,Tucson,AZ,June1997.

[338]C.H.Cai,A.Fu,C.H.Cheng,andW.W.Kwong.MiningAssociationRuleswithWeightedItems.InProc.ofIEEEIntl.DatabaseEngineeringandApplicationsSymp.,pages68–77,Cardiff,Wales,1998.

[339]H.Cao,N.Mamoulis,andD.W.Cheung.MiningFrequentSpatio-TemporalSequentialPatterns.InProceedingsofthe5thIEEEInternationalConferenceonDataMining,pages82–89,Houston,TX,2005.

[340]R.Chan,Q.Yang,andY.Shen.MiningHighUtilityItemsets.InProceedingsofthe3rdIEEEInternationalConferenceonDataMining,pages19–26,Melbourne,FL,2003.

[341]Q.Chen,U.Dayal,andM.Hsu.ADistributedOLAPinfrastructureforE-Commerce.InProc.ofthe4thIFCISIntl.Conf.onCooperativeInformationSystems,pages209—220,Edinburgh,Scotland,1999.

[342]D.C.Cheung,S.D.Lee,andB.Kao.AGeneralIncrementalTechniqueforMaintainingDiscoveredAssociationRules.InProc.ofthe5thIntl.Conf.

onDatabaseSystemsforAdvancedApplications,pages185–194,Melbourne,Australia,1997.

[343]C.K.Chui,B.Kao,andE.Hung.MiningFrequentItemsetsfromUncertainData.InProceedingsofthe11thPacific-AsiaConferenceonKnowledgeDiscoveryandDataMining,pages47–58,Nanjing,China,2007.

[344]R.Cooley,P.N.Tan,andJ.Srivastava.DiscoveryofInterestingUsagePatternsfromWebData.InM.SpiliopoulouandB.Masand,editors,AdvancesinWebUsageAnalysisandUserProfiling,volume1836,pages163–182.LectureNotesinComputerScience,2000.

[345]P.Dokas,L.Ertöz,V.Kumar,A.Lazarevic,J.Srivastava,andP.N.Tan.DataMiningforNetworkIntrusionDetection.InProc.NSFWorkshoponNextGenerationDataMining,Baltimore,MD,2002.

[346]G.DongandJ.Li.Interestingnessofdiscoveredassociationrulesintermsofneighborhood-basedunexpectedness.InProc.ofthe2ndPacific-AsiaConf.onKnowledgeDiscoveryandDataMining,pages72–86,Melbourne,Australia,April1998.

[347]G.DongandJ.Li.EfficientMiningofEmergingPatterns:DiscoveringTrendsandDifferences.InProc.ofthe5thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages43–52,SanDiego,CA,August1999.

[348]W.DuMouchelandD.Pregibon.EmpiricalBayesScreeningforMulti-ItemAssociations.InProc.ofthe7thIntl.Conf.onKnowledgeDiscovery

andDataMining,pages67–76,SanFrancisco,CA,August2001.

[349]B.DunkelandN.Soparkar.DataOrganizationandAccessforEfficientDataMining.InProc.ofthe15thIntl.Conf.onDataEngineering,pages522–529,Sydney,Australia,March1999.

[350]A.V.Evfimievski,R.Srikant,R.Agrawal,andJ.Gehrke.Privacypreservingminingofassociationrules.InProceedingsoftheEighthACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages217–228,Edmonton,Canada,2002.

[351]C.C.FabrisandA.A.Freitas.DiscoveringsurprisingpatternsbydetectingoccurrencesofSimpson'sparadox.InProc.ofthe19thSGESIntl.Conf.onKnowledge-BasedSystemsandAppliedArtificialIntelligence),pages148–160,Cambridge,UK,December1999.

[352]G.Fang,G.Pandey,W.Wang,M.Gupta,M.Steinbach,andV.Kumar.MiningLow-SupportDiscriminativePatternsfromDenseandHigh-DimensionalData.IEEETrans.Knowl.DataEng.,24(2):279–294,2012.

[353]L.Feng,H.J.Lu,J.X.Yu,andJ.Han.Mininginter-transactionassociationswithtemplates.InProc.ofthe8thIntl.Conf.onInformationandKnowledgeManagement,pages225–233,KansasCity,Missouri,Nov1999.

[354]A.A.Freitas.Understandingthecrucialdifferencesbetweenclassificationanddiscoveryofassociationrules—apositionpaper.SIGKDDExplorations,2(1):65–69,2000.

[355]J.H.FriedmanandN.I.Fisher.Bumphuntinginhigh-dimensionaldata.StatisticsandComputing,9(2):123–143,April1999.

[356]T.Fukuda,Y.Morimoto,S.Morishita,andT.Tokuyama.MiningOptimizedAssociationRulesforNumericAttributes.InProc.ofthe15thSymp.onPrinciplesofDatabaseSystems,pages182–191,Montreal,Canada,June1996.

[357]D.Gunopulos,R.Khardon,H.Mannila,andH.Toivonen.DataMining,HypergraphTransversals,andMachineLearning.InProc.ofthe16thSymp.onPrinciplesofDatabaseSystems,pages209–216,Tucson,AZ,May1997.

[358]R.Gupta,G.Fang,B.Field,M.Steinbach,andV.Kumar.Quantitativeevaluationofapproximatefrequentpatternminingalgorithms.InProceedingsofthe14thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages301–309,LasVegas,NV,2008.

[359]E.Han,G.Karypis,andV.Kumar.Min-apriori:Analgorithmforfindingassociationrulesindatawithcontinuousattributes.DepartmentofComputerScienceandEngineering,UniversityofMinnesota,Tech.Rep,1997.

[360]E.-H.Han,G.Karypis,andV.Kumar.ScalableParallelDataMiningforAssociationRules.InProc.of1997ACM-SIGMODIntl.Conf.onManagementofData,pages277–288,Tucson,AZ,May1997.

[361]E.-H.Han,G.Karypis,V.Kumar,andB.Mobasher.ClusteringBasedonAssociationRuleHypergraphs.InProc.ofthe1997ACMSIGMODWorkshoponResearchIssuesinDataMiningandKnowledgeDiscovery,Tucson,AZ,1997.

[362]J.Han,H.Cheng,D.Xin,andX.Yan.Frequentpatternmining:currentstatusandfuturedirections.DataMiningandKnowledgeDiscovery,15(1):55–86,2007.

[363]J.Han,Y.Fu,K.Koperski,W.Wang,andO.R.Zaïane.DMQL:Adataminingquerylanguageforrelationaldatabases.InProc.ofthe1996ACMSIGMODWorkshoponResearchIssuesinDataMiningandKnowledgeDiscovery,Montreal,Canada,June1996.

[364]J.Han,J.Pei,andY.Yin.MiningFrequentPatternswithoutCandidateGeneration.InProc.ACM-SIGMODInt.Conf.onManagementofData(SIGMOD'00),pages1–12,Dallas,TX,May2000.

[365]C.Hidber.OnlineAssociationRuleMining.InProc.of1999ACM-SIGMODIntl.Conf.onManagementofData,pages145–156,Philadelphia,PA,1999.

[366]R.J.HildermanandH.J.Hamilton.KnowledgeDiscoveryandMeasuresofInterest.KluwerAcademicPublishers,2001.

[367]J.Hipp,U.Guntzer,andG.Nakhaeizadeh.AlgorithmsforAssociationRuleMining—AGeneralSurvey.SigKDDExplorations,2(1):58–64,June

2000.

[368]H.Hofmann,A.P.J.M.Siebes,andA.F.X.Wilhelm.VisualizingAssociationRuleswithInteractiveMosaicPlots.InProc.ofthe6thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages227–235,Boston,MA,August2000.

[369]J.D.HoltandS.M.Chung.EfficientMiningofAssociationRulesinTextDatabases.InProc.ofthe8thIntl.Conf.onInformationandKnowledgeManagement,pages234–242,KansasCity,Missouri,1999.

[370]M.HoutsmaandA.Swami.Set-orientedMiningforAssociationRulesinRelationalDatabases.InProc.ofthe11thIntl.Conf.onDataEngineering,pages25–33,Taipei,Taiwan,1995.

[371]Y.Huang,S.Shekhar,andH.Xiong.DiscoveringCo-locationPatternsfromSpatialDatasets:AGeneralApproach.IEEETrans.onKnowledgeandDataEngineering,16(12):1472–1485,December2004.

[372]S.Hwang,Y.Liu,J.Chiu,andE.Lim.MiningMobileGroupPatterns:ATrajectory-BasedApproach.InProceedingsofthe9thPacific-AsiaConferenceonKnowledgeDiscoveryandDataMining,pages713–718,Hanoi,Vietnam,2005.

[373]T.Imielinski,A.Virmani,andA.Abdulghani.DataMine:ApplicationProgrammingInterfaceandQueryLanguageforDatabaseMining.InProc.ofthe2ndIntl.Conf.onKnowledgeDiscoveryandDataMining,pages256–262,Portland,Oregon,1996.

[374]A.Inokuchi,T.Washio,andH.Motoda.AnApriori-basedAlgorithmforMiningFrequentSubstructuresfromGraphData.InProc.ofthe4thEuropeanConf.ofPrinciplesandPracticeofKnowledgeDiscoveryinDatabases,pages13–23,Lyon,France,2000.

[375]S.JaroszewiczandD.Simovici.InterestingnessofFrequentItemsetsUsingBayesianNetworksasBackgroundKnowledge.InProc.ofthe10thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages178–186,Seattle,WA,August2004.

[376]M.KamberandR.Shinghal.EvaluatingtheInterestingnessofCharacteristicRules.InProc.ofthe2ndIntl.Conf.onKnowledgeDiscoveryandDataMining,pages263–266,Portland,Oregon,1996.

[377]M.Klemettinen.AKnowledgeDiscoveryMethodologyforTelecommunicationNetworkAlarmDatabases.PhDthesis,UniversityofHelsinki,1999.

[378]W.A.Kosters,E.Marchiori,andA.Oerlemans.MiningClusterswithAssociationRules.InThe3rdSymp.onIntelligentDataAnalysis(IDA99),pages39–50,Amsterdam,August1999.

[379]C.M.Kuok,A.Fu,andM.H.Wong.MiningFuzzyAssociationRulesinDatabases.ACMSIGMODRecord,27(1):41–46,March1998.

[380]M.KuramochiandG.Karypis.FrequentSubgraphDiscovery.InProc.ofthe2001IEEEIntl.Conf.onDataMining,pages313–320,SanJose,CA,

November2001.

[381]W.Lee,S.J.Stolfo,andK.W.Mok.AdaptiveIntrusionDetection:ADataMiningApproach.ArtificialIntelligenceReview,14(6):533–567,2000.

[382]N.Li,L.Zeng,Q.He,andZ.Shi.ParallelImplementationofAprioriAlgorithmBasedonMapReduce.InProceedingsofthe13thACISInternationalConferenceonSoftwareEngineering,ArtificialIntelligence,NetworkingandParallel/DistributedComputing,pages236–241,Kyoto,Japan,2012.

[383]W.Li,J.Han,andJ.Pei.CMAR:AccurateandEfficientClassificationBasedonMultipleClass-associationRules.InProc.ofthe2001IEEEIntl.Conf.onDataMining,pages369–376,SanJose,CA,2001.

[384]M.Lin,P.Lee,andS.Hsueh.Apriori-basedfrequentitemsetminingalgorithmsonMapReduce.InProceedingsofthe6thInternationalConferenceonUbiquitousInformationManagementandCommunication,pages26–30,KualaLumpur,Malaysia,2012.

[385]B.Liu,W.Hsu,andS.Chen.UsingGeneralImpressionstoAnalyzeDiscoveredClassificationRules.InProc.ofthe3rdIntl.Conf.onKnowledgeDiscoveryandDataMining,pages31–36,NewportBeach,CA,August1997.

[386]B.Liu,W.Hsu,andY.Ma.IntegratingClassificationandAssociationRuleMining.InProc.ofthe4thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages80–86,NewYork,NY,August1998.

[387]B.Liu,W.Hsu,andY.Ma.Miningassociationruleswithmultipleminimumsupports.InProc.ofthe5thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages125—134,SanDiego,CA,August1999.

[388]B.Liu,W.Hsu,andY.Ma.PruningandSummarizingtheDiscoveredAssociations.InProc.ofthe5thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages125–134,SanDiego,CA,August1999.

[389]J.Liu,S.Paulsen,W.Wang,A.B.Nobel,andJ.Prins.MiningApproximateFrequentItemsetsfromNoisyData.InProceedingsofthe5thIEEEInternationalConferenceonDataMining,pages721–724,Houston,TX,2005.

[390]Y.Liu,W.-K.Liao,andA.Choudhary.Atwo-phasealgorithmforfastdiscoveryofhighutilityitemsets.InProceedingsofthe9thPacific-AsiaConferenceonKnowledgeDiscoveryandDataMining,pages689–695,Hanoi,Vietnam,2005.

[391]F.Llinares-López,M.Sugiyama,L.Papaxanthos,andK.M.Borgwardt.FastandMemory-EfficientSignificantPatternMiningviaPermutationTesting.InProceedingsofthe21thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages725–734,Sydney,Australia,2015.

[392]A.Marcus,J.I.Maletic,andK.-I.Lin.Ordinalassociationrulesforerroridentificationindatasets.InProc.ofthe10thIntl.Conf.onInformationandKnowledgeManagement,pages589–591,Atlanta,GA,October2001.

[393]N.MegiddoandR.Srikant.DiscoveringPredictiveAssociationRules.InProc.ofthe4thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages274–278,NewYork,August1998.

[394]R.Meo,G.Psaila,andS.Ceri.ANewSQL-likeOperatorforMiningAssociationRules.InProc.ofthe22ndVLDBConf.,pages122–133,Bombay,India,1996.

[395]R.J.MillerandY.Yang.AssociationRulesoverIntervalData.InProc.of1997ACM-SIGMODIntl.Conf.onManagementofData,pages452–461,Tucson,AZ,May1997.

[396]S.Moens,E.Aksehirli,andB.Goethals.FrequentItemsetMiningforBigData.InProceedingsofthe2013IEEEInternationalConferenceonBigData,pages111–118,SantaClara,CA,2013.

[397]Y.Morimoto,T.Fukuda,H.Matsuzawa,T.Tokuyama,andK.Yoda.Algorithmsforminingassociationrulesforbinarysegmentationsofhugecategoricaldatabases.InProc.ofthe24thVLDBConf.,pages380–391,NewYork,August1998.

[398]F.Mosteller.AssociationandEstimationinContingencyTables.JASA,63:1–28,1968.

[399]A.Mueller.Fastsequentialandparallelalgorithmsforassociationrulemining:Acomparison.TechnicalReportCS-TR-3515,UniversityofMaryland,August1995.

[400]R.T.Ng,L.V.S.Lakshmanan,J.Han,andA.Pang.ExploratoryMiningandPruningOptimizationsofConstrainedAssociationRules.InProc.of1998ACM-SIGMODIntl.Conf.onManagementofData,pages13–24,Seattle,WA,June1998.

[401]P.K.Novak,N.Lavrač,andG.I.Webb.Superviseddescriptiverulediscovery:Aunifyingsurveyofcontrastset,emergingpatternandsubgroupmining.JournalofMachineLearningResearch,10(Feb):377–403,2009.

[402]E.Omiecinski.AlternativeInterestMeasuresforMiningAssociationsinDatabases.IEEETrans.onKnowledgeandDataEngineering,15(1):57–69,January/February2003.

[403]B.Ozden,S.Ramaswamy,andA.Silberschatz.CyclicAssociationRules.InProc.ofthe14thIntl.Conf.onDataEng.,pages412–421,Orlando,FL,February1998.

[404]A.Ozgur,P.N.Tan,andV.Kumar.RBA:AnIntegratedFrameworkforRegressionbasedonAssociationRules.InProc.oftheSIAMIntl.Conf.onDataMining,pages210–221,Orlando,FL,April2004.

[405]J.S.Park,M.-S.Chen,andP.S.Yu.Aneffectivehash-basedalgorithmforminingassociationrules.SIGMODRecord,25(2):175–186,1995.

[406]S.ParthasarathyandM.Coatney.EfficientDiscoveryofCommonSubstructuresinMacromolecules.InProc.ofthe2002IEEEIntl.Conf.on

DataMining,pages362—369,MaebashiCity,Japan,December2002.

[407]N.Pasquier,Y.Bastide,R.Taouil,andL.Lakhal.Discoveringfrequentcloseditemsetsforassociationrules.InProc.ofthe7thIntl.Conf.onDatabaseTheory(ICDT'99),pages398–416,Jerusalem,Israel,January1999.

[408]J.Pei,J.Han,H.J.Lu,S.Nishio,andS.Tang.H-Mine:Hyper-StructureMiningofFrequentPatternsinLargeDatabases.InProc.ofthe2001IEEEIntl.Conf.onDataMining,pages441–448,SanJose,CA,November2001.

[409]J.Pei,J.Han,B.Mortazavi-Asl,andH.Zhu.MiningAccessPatternsEfficientlyfromWebLogs.InProc.ofthe4thPacific-AsiaConf.onKnowledgeDiscoveryandDataMining,pages396–407,Kyoto,Japan,April2000.

[410]G.Piatetsky-Shapiro.Discovery,AnalysisandPresentationofStrongRules.InG.Piatetsky-ShapiroandW.Frawley,editors,KnowledgeDiscoveryinDatabases,pages229–248.MITPress,Cambridge,MA,1991.

[411]C.Potter,S.Klooster,M.Steinbach,P.N.Tan,V.Kumar,S.Shekhar,andC.Carvalho.UnderstandingGlobalTeleconnectionsofClimatetoRegionalModelEstimatesofAmazonEcosystemCarbonFluxes.GlobalChangeBiology,10(5):693—703,2004.

[412]C.Potter,S.Klooster,M.Steinbach,P.N.Tan,V.Kumar,S.Shekhar,R.Myneni,andR.Nemani.GlobalTeleconnectionsofOceanClimatetoTerrestrialCarbonFlux.JournalofGeophysicalResearch,108(D17),2003.

[413]G.D.Ramkumar,S.Ranka,andS.Tsur.Weightedassociationrules:Modelandalgorithm.InProc.ACMSIGKDD,1998.

[414]M.RiondatoandF.Vandin.FindingtheTrueFrequentItemsets.InProceedingsofthe2014SIAMInternationalConferenceonDataMining,pages497–505,Philadelphia,PA,2014.

[415]S.Sarawagi,S.Thomas,andR.Agrawal.IntegratingMiningwithRelationalDatabaseSystems:AlternativesandImplications.InProc.of1998ACM-SIGMODIntl.Conf.onManagementofData,pages343–354,Seattle,WA,1998.

[416]K.Satou,G.Shibayama,T.Ono,Y.Yamamura,E.Furuichi,S.Kuhara,andT.Takagi.FindingAssociationRulesonHeterogeneousGenomeData.InProc.ofthePacificSymp.onBiocomputing,pages397–408,Hawaii,January1997.

[417]A.Savasere,E.Omiecinski,andS.Navathe.Anefficientalgorithmforminingassociationrulesinlargedatabases.InProc.ofthe21stInt.Conf.onVeryLargeDatabases(VLDB’95),pages432–444,Zurich,Switzerland,September1995.

[418]A.Savasere,E.Omiecinski,andS.Navathe.MiningforStrongNegativeAssociationsinaLargeDatabaseofCustomerTransactions.InProc.ofthe14thIntl.Conf.onDataEngineering,pages494–502,Orlando,Florida,February1998.

[419]M.SenoandG.Karypis.LPMiner:AnAlgorithmforFindingFrequentItemsetsUsingLength-DecreasingSupportConstraint.InProc.ofthe2001IEEEIntl.Conf.onDataMining,pages505–512,SanJose,CA,November2001.

[420]T.ShintaniandM.Kitsuregawa.Hashbasedparallelalgorithmsforminingassociationrules.InProcofthe4thIntl.Conf.onParallelandDistributedInfo.Systems,pages19–30,MiamiBeach,FL,December1996.

[421]A.SilberschatzandA.Tuzhilin.Whatmakespatternsinterestinginknowledgediscoverysystems.IEEETrans.onKnowledgeandDataEngineering,8(6):970–974,1996.

[422]C.Silverstein,S.Brin,andR.Motwani.Beyondmarketbaskets:Generalizingassociationrulestodependencerules.DataMiningandKnowledgeDiscovery,2(1):39–68,1998.

[423]E.-H.Simpson.TheInterpretationofInteractioninContingencyTables.JournaloftheRoyalStatisticalSociety,B(13):238–241,1951.

[424]L.Singh,B.Chen,R.Haight,andP.Scheuermann.AnAlgorithmforConstrainedAssociationRuleMininginSemi-structuredData.InProc.of

the3rdPacific-AsiaConf.onKnowledgeDiscoveryandDataMining,pages148–158,Beijing,China,April1999.

[425]R.SrikantandR.Agrawal.MiningQuantitativeAssociationRulesinLargeRelationalTables.InProc.of1996ACM-SIGMODIntl.Conf.onManagementofData,pages1–12,Montreal,Canada,1996.

[426]R.SrikantandR.Agrawal.MiningSequentialPatterns:GeneralizationsandPerformanceImprovements.InProc.ofthe5thIntlConf.onExtendingDatabaseTechnology(EDBT'96),pages18–32,Avignon,France,1996.

[427]R.Srikant,Q.Vu,andR.Agrawal.MiningAssociationRuleswithItemConstraints.InProc.ofthe3rdIntl.Conf.onKnowledgeDiscoveryandDataMining,pages67–73,NewportBeach,CA,August1997.

[428]M.Steinbach,P.N.Tan,andV.Kumar.SupportEnvelopes:ATechniqueforExploringtheStructureofAssociationPatterns.InProc.ofthe10thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages296–305,Seattle,WA,August2004.

[429]M.Steinbach,P.N.Tan,H.Xiong,andV.Kumar.ExtendingtheNotionofSupport.InProc.ofthe10thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages689–694,Seattle,WA,August2004.

[430]M.Steinbach,H.Yu,G.Fang,andV.Kumar.Usingconstraintstogenerateandexplorehigherorderdiscriminativepatterns.AdvancesinKnowledgeDiscoveryandDataMining,pages338–350,2011.

[431]E.Suzuki.AutonomousDiscoveryofReliableExceptionRules.InProc.ofthe3rdIntl.Conf.onKnowledgeDiscoveryandDataMining,pages259–262,NewportBeach,CA,August1997.

[432]P.N.TanandV.Kumar.MiningAssociationPatternsinWebUsageData.InProc.oftheIntl.Conf.onAdvancesinInfrastructurefore-Business,e-Education,e-Scienceande-MedicineontheInternet,L'Aquila,Italy,January2002.

[433]P.N.Tan,V.Kumar,andJ.Srivastava.SelectingtheRightInterestingnessMeasureforAssociationPatterns.InProc.ofthe8thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages32–41,Edmonton,Canada,July2002.

[434]P.N.Tan,M.Steinbach,V.Kumar,S.Klooster,C.Potter,andA.Torregrosa.FindingSpatio-TemporalPatternsinEarthScienceData.InKDD2001WorkshoponTemporalDataMining,SanFrancisco,CA,2001.

[435]N.Tatti.Probablythebestitemsets.InProceedingsofthe16thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages293–302,Washington,DC,2010.

[436]H.Toivonen.SamplingLargeDatabasesforAssociationRules.InProc.ofthe22ndVLDBConf.,pages134–145,Bombay,India,1996.

[437]H.Toivonen,M.Klemettinen,P.Ronkainen,K.Hatonen,andH.Mannila.PruningandGroupingDiscoveredAssociationRules.InECML-95

WorkshoponStatistics,MachineLearningandKnowledgeDiscoveryinDatabases,pages47–52,Heraklion,Greece,April1995.

[438]I.TsoukatosandD.Gunopulos.Efficientminingofspatiotemporalpatterns.InProceedingsofthe7thInternationalSymposiumonAdvancesinSpatialandTemporalDatabases,pages425–442,2001.

[439]S.Tsur,J.Ullman,S.Abiteboul,C.Clifton,R.Motwani,S.Nestorov,andA.Rosenthal.QueryFlocks:AGeneralizationofAssociationRuleMining.InProc.of1998ACM-SIGMODIntl.Conf.onManagementofData,pages1–12,Seattle,WA,June1998.

[440]A.Tung,H.J.Lu,J.Han,andL.Feng.BreakingtheBarrierofTransactions:MiningInter-TransactionAssociationRules.InProc.ofthe5thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages297–301,SanDiego,CA,August1999.

[441]J.VaidyaandC.Clifton.Privacypreservingassociationrulemininginverticallypartitioneddata.InProceedingsoftheEighthACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages639–644,Edmonton,Canada,2002.

[442]K.Wang,Y.He,andJ.Han.MiningFrequentItemsetsUsingSupportConstraints.InProc.ofthe26thVLDBConf.,pages43–52,Cairo,Egypt,September2000.

[443]K.Wang,S.H.Tay,andB.Liu.Interestingness-BasedIntervalMergerforNumericAssociationRules.InProc.ofthe4thIntl.Conf.onKnowledge

DiscoveryandDataMining,pages121–128,NewYork,NY,August1998.

[444]L.Wang,R.Cheng,S.D.Lee,andD.W.Cheung.Acceleratingprobabilisticfrequentitemsetmining:amodel-basedapproach.InProceedingsofthe19thACMConferenceonInformationandKnowledgeManagement,pages429–438,2010.

[445]G.I.Webb.Preliminaryinvestigationsintostatisticallyvalidexploratoryrulediscovery.InProc.oftheAustralasianDataMiningWorkshop(AusDM03),Canberra,Australia,December2003.

[446]H.Xiong,X.He,C.Ding,Y.Zhang,V.Kumar,andS.R.Holbrook.IdentificationofFunctionalModulesinProteinComplexesviaHypercliquePatternDiscovery.InProc.ofthePacificSymposiumonBiocomputing,(PSB2005),Maui,January2005.

[447]H.Xiong,S.Shekhar,P.N.Tan,andV.Kumar.ExploitingaSupport-basedUpperBoundofPearson'sCorrelationCoefficientforEfficientlyIdentifyingStronglyCorrelatedPairs.InProc.ofthe10thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages334–343,Seattle,WA,August2004.

[448]H.Xiong,M.Steinbach,P.N.Tan,andV.Kumar.HICAP:HierarchialClusteringwithPatternPreservation.InProc.oftheSIAMIntl.Conf.onDataMining,pages279–290,Orlando,FL,April2004.

[449]H.Xiong,P.N.Tan,andV.Kumar.MiningStrongAffinityAssociationPatternsinDataSetswithSkewedSupportDistribution.InProc.ofthe

2003IEEEIntl.Conf.onDataMining,pages387–394,Melbourne,FL,2003.

[450]X.YanandJ.Han.gSpan:Graph-basedSubstructurePatternMining.InProc.ofthe2002IEEEIntl.Conf.onDataMining,pages721–724,MaebashiCity,Japan,December2002.

[451]C.Yang,U.M.Fayyad,andP.S.Bradley.Efficientdiscoveryoferror-tolerantfrequentitemsetsinhighdimensions.InProceedingsoftheseventhACMSIGKDDinternationalconferenceonKnowledgediscoveryanddatamining,pages194–203,,SanFrancisco,CA,2001.

[452]C.Yang,U.M.Fayyad,andP.S.Bradley.Efficientdiscoveryoferror-tolerantfrequentitemsetsinhighdimensions.InProc.ofthe7thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages194–203,SanFrancisco,CA,August2001.

[453]M.J.Zaki.ParallelandDistributedAssociationMining:ASurvey.IEEEConcurrency,specialissueonParallelMechanismsforDataMining,7(4):14–25,December1999.

[454]M.J.Zaki.GeneratingNon-RedundantAssociationRules.InProc.ofthe6thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages34–43,Boston,MA,August2000.

[455]M.J.Zaki.Efficientlyminingfrequenttreesinaforest.InProc.ofthe8thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages71–80,Edmonton,Canada,July2002.

[456]M.J.ZakiandM.Orihara.Theoreticalfoundationsofassociationrules.InProc.ofthe1998ACMSIGMODWorkshoponResearchIssuesinDataMiningandKnowledgeDiscovery,Seattle,WA,June1998.

[457]M.J.Zaki,S.Parthasarathy,M.Ogihara,andW.Li.NewAlgorithmsforFastDiscoveryofAssociationRules.InProc.ofthe3rdIntl.Conf.onKnowledgeDiscoveryandDataMining,pages283–286,NewportBeach,CA,August1997.

[458]C.Zeng,J.F.Naughton,andJ.Cai.Ondifferentiallyprivatefrequentitemsetmining.ProceedingsoftheVLDBEndowment,6(1):25–36,2012.

[459]F.Zhang,Y.Zhang,andJ.Bakos.GPApriori:GPU-AcceleratedFrequentItemsetMining.InProceedingsofthe2011IEEEInternationalConferenceonClusterComputing,pages590–594,Austin,TX,2011.

[460]H.Zhang,B.Padmanabhan,andA.Tuzhilin.OntheDiscoveryofSignificantStatisticalQuantitativeRules.InProc.ofthe10thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages374–383,Seattle,WA,August2004.

[461]Z.Zhang,Y.Lu,andB.Zhang.AnEffectivePartioning-CombiningAlgorithmforDiscoveringQuantitativeAssociationRules.InProc.ofthe1stPacific-AsiaConf.onKnowledgeDiscoveryandDataMining,Singapore,1997.

[462]N.Zhong,Y.Y.Yao,andS.Ohsuga.PeculiarityOrientedMulti-databaseMining.InProc.ofthe3rdEuropeanConf.ofPrinciplesandPracticeofKnowledgeDiscoveryinDatabases,pages136–146,Prague,CzechRepublic,1999.

5.10Exercises1.Foreachofthefollowingquestions,provideanexampleofanassociationrulefromthemarketbasketdomainthatsatisfiesthefollowingconditions.Also,describewhethersuchrulesaresubjectivelyinteresting.

a. Arulethathashighsupportandhighconfidence.

b. Arulethathasreasonablyhighsupportbutlowconfidence.

c. Arulethathaslowsupportandlowconfidence.

d. Arulethathaslowsupportandhighconfidence.

2.ConsiderthedatasetshowninTable5.20 .

Table5.20.Exampleofmarketbaskettransactions.

CustomerID TransactionID ItemsBought

1 0001 {a,d,e}

1 0024 {a,b,c,e}

2 0012 {a,b,d,e}

2 0031 {a,c,d,e}

3 0015 {b,ce}

3 0022 {b,d,e}

4 0029 {cd}

4 0040 {a,b,c}

5 0033 {a,d,e}

5 0038 {a,b,e}

a. Computethesupportforitemsets{e},{b,d},and{b,d,e}bytreatingeachtransactionIDasamarketbasket.

b. Usetheresultsinpart(a)tocomputetheconfidencefortheassociationrules and .Isconfidenceasymmetricmeasure?

c. Repeatpart(a)bytreatingeachcustomerIDasamarketbasket.Eachitemshouldbetreatedasabinaryvariable(1ifanitemappearsinatleastonetransactionboughtbythecustomer,and0otherwise).

d. Usetheresultsinpart(c)tocomputetheconfidencefortheassociationrules and .

e. Suppose and arethesupportandconfidencevaluesofanassociationrulerwhentreatingeachtransactionIDasamarketbasket.Also,let and bethesupportandconfidencevaluesofrwhentreatingeachcustomerIDasamarketbasket.Discusswhetherthereareanyrelationshipsbetween and or and .

3.

a. Whatistheconfidencefortherules and ?

b. Let ,and betheconfidencevaluesoftherules,and ,respectively.Ifweassumethat ,and

havedifferentvalues,whatarethepossiblerelationshipsthatmayexistamong ,and ?Whichrulehasthelowestconfidence?

c. Repeattheanalysisinpart(b)assumingthattheruleshaveidenticalsupport.Whichrulehasthehighestconfidence?

{b,d}→{e} {e}→{b,d}

{b,d}→{e} {e}→{b,d}

s1 c1

s2 c2

s1 s2 c1 c2

∅→A A→∅c1,c2 c3 {p}→{q},{p}

→{q,r} {p,r}→{q} c1,c2 c3

c1,c2 c3

d. Transitivity:Supposetheconfidenceoftherules and arelargerthansomethreshold,minconf.Isitpossiblethat hasaconfidencelessthanminconf?

4.Foreachofthefollowingmeasures,determinewhetheritismonotone,anti-monotone,ornon-monotone(i.e.,neithermonotonenoranti-monotone).

Example:Support, isanti-monotonebecause whenever.

a. Acharacteristicruleisaruleoftheform ,wheretheruleantecedentcontainsonlyasingleitem.Anitemsetofsizekcanproduceuptokcharacteristicrules.Let betheminimumconfidenceofallcharacteristicrulesgeneratedfromagivenitemset:

Is monotone,anti-monotone,ornon-monotone?

b. Adiscriminantruleisaruleoftheform ,wheretheruleconsequentcontainsonlyasingleitem.Anitemsetofsizekcanproduceuptokdiscriminantrules.Let betheminimumconfidenceofalldiscriminantrulesgeneratedfromagivenitemset:

Is monotone,anti-monotone,ornon-monotone?

c. Repeattheanalysisinparts(a)and(b)byreplacingtheminfunctionwithamaxfunction.

A→B B→CA→C

s=σ(x)|T| s(X)≥s(Y)X⊂Y

{p}→{q1,q2,…,qn}

ζ

ζ({p1,p2,…,pk})=min[c({p1}→{p2,p3,…,pk}),…c({pk}→{p1,p2,…,pk−1})]

ζ

{p1,p2,…,pn}→{q}

η

η({p1,p2,…,pk})=min[c({p2,p3,…,pk}→{p1}),…c({p1,p2,…,pk−1}→{pk})]

η

5.ProveEquation5.3 .(Hint:First,countthenumberofwaystocreateanitemsetthatformstheleft-handsideoftherule.Next,foreachsizekitemsetselectedfortheleft-handside,countthenumberofwaystochoosetheremaining itemstoformtheright-handsideoftherule.)Assumethatneitheroftheitemsetsofaruleareempty.

6.ConsiderthemarketbaskettransactionsshowninTable5.21 .

a. Whatisthemaximumnumberofassociationrulesthatcanbeextractedfromthisdata(includingrulesthathavezerosupport)?

b. Whatisthemaximumsizeoffrequentitemsetsthatcanbeextracted(assuming )?

Table5.21.Marketbaskettransactions.

TransactionID ItemsBought

1 {Milk,Beer,Diapers}

2 {Bread,Butter,Milk}

3 {Milk,Diapers,Cookies}

4 {Bread,Butter,Cookies}

5 {Beer,Cookies,Diapers}

6 {Milk,Diapers,Bread,Butter}

7 {Bread,Butter,Diapers}

8 {Beer,Diapers}

9 {Milk,Diapers,Bread,Butter}

10 {Beer,Cookies}

d−k

minsup>0

c. Writeanexpressionforthemaximumnumberofsize-3itemsetsthatcanbederivedfromthisdataset.

d. Findanitemset(ofsize2orlarger)thathasthelargestsupport.

e. Findapairofitems,aandb,suchthattherules andhavethesameconfidence.

7.Showthatifacandidatek-itemsetXhasasubsetofsizelessthan thatisinfrequent,thenatleastoneofthe -sizesubsetsofXisnecessarilyinfrequent.

8.Considerthefollowingsetoffrequent3-itemsets:

{1,2,3},{1,2,4},{1,2,5},{1,3,4},{1,3,5},{2,3,4},{2,3,5},{3,4,5}.

Assumethatthereareonlyfiveitemsinthedataset.

a. Listallcandidate4-itemsetsobtainedbyacandidategenerationprocedureusingthe mergingstrategy.

b. Listallcandidate4-itemsetsobtainedbythecandidategenerationprocedureinApriori.

c. Listallcandidate4-itemsetsthatsurvivethecandidatepruningstepoftheApriorialgorithm.

9.TheApriorialgorithmusesagenerate-and-countstrategyforderivingfrequentitemsets.Candidateitemsetsofsize arecreatedbyjoiningapairoffrequentitemsetsofsizek(thisisknownasthecandidategenerationstep).Acandidateisdiscardedifanyoneofitssubsetsisfoundtobeinfrequentduringthecandidatepruningstep.SupposetheApriorialgorithmisappliedtothedatasetshowninTable5.22 with ,i.e.,anyitemsetoccurringinlessthan3transactionsisconsideredtobeinfrequent.

{a}→{b} {b}→{a}

k−1(k−1)

Fk−1×F1

k+1

minsup=30%

Table5.22.Exampleofmarketbaskettransactions.

TransactionID ItemsBought

1 {a,b,d,e}

2 {b,cd}

3 {a,b,d,e}

4 {a,c,d,e}

5 {b,c,d,e}

6 {b,d,e}

7 {c,d}

8 {a,b,c}

9 {a,d,e}

10 {b,d}

a. DrawanitemsetlatticerepresentingthedatasetgiveninTable5.22 .Labeleachnodeinthelatticewiththefollowingletter(s):

N:IftheitemsetisnotconsideredtobeacandidateitemsetbytheApriorialgorithm.Therearetworeasonsforanitemsetnottobeconsideredasacandidateitemset:(1)itisnotgeneratedatallduringthecandidategenerationstep,or(2)itisgeneratedduringthecandidategenerationstepbutissubsequentlyremovedduringthecandidatepruningstepbecauseoneofitssubsetsisfoundtobeinfrequent.

F:IfthecandidateitemsetisfoundtobefrequentbytheApriorialgorithm.

I:Ifthecandidateitemsetisfoundtobeinfrequentaftersupportcounting.

b. Whatisthepercentageoffrequentitemsets(withrespecttoallitemsetsinthelattice)?

c. WhatisthepruningratiooftheApriorialgorithmonthisdataset?(Pruningratioisdefinedasthepercentageofitemsetsnotconsideredtobeacandidatebecause(1)theyarenotgeneratedduringcandidategenerationor(2)theyareprunedduringthecandidatepruningstep.)

d. Whatisthefalsealarmrate(i.e.,percentageofcandidateitemsetsthatarefoundtobeinfrequentafterperformingsupportcounting)?

10.TheApriorialgorithmusesahashtreedatastructuretoefficientlycountthesupportofcandidateitemsets.Considerthehashtreeforcandidate3-itemsetsshowninFigure5.32 .

Figure5.32.Anexampleofahashtreestructure.

a. Givenatransactionthatcontainsitems{1,3,4,5,8},whichofthehashtreeleafnodeswillbevisitedwhenfindingthecandidatesofthetransaction?

b. Usethevisitedleafnodesinpart(a)todeterminethecandidateitemsetsthatarecontainedinthetransaction{1,3,4,5,8}.

11.Considerthefollowingsetofcandidate3-itemsets:

{1,2,3},{1,2,6},{1,3,4},{2,3,4},{2,4,5},{3,4,6},{4,5,6}

a. Constructahashtreefortheabovecandidate3-itemsets.Assumethetreeusesahashfunctionwhereallodd-numbereditemsarehashedtotheleftchildofanode,whiletheeven-numbereditemsarehashedtotherightchild.Acandidatek-itemsetisinsertedintothetreebyhashingoneachsuccessiveiteminthecandidateandthenfollowingtheappropriatebranchofthetreeaccordingtothehashvalue.Oncealeafnodeisreached,thecandidateisinsertedbasedononeofthefollowingconditions:

Condition1:Ifthedepthoftheleafnodeisequaltok(therootisassumedtobeatdepth0),thenthecandidateisinsertedregardlessofthenumberofitemsetsalreadystoredatthenode.

Condition2:Ifthedepthoftheleafnodeislessthank,thenthecandidatecanbeinsertedaslongasthenumberofitemsetsstoredatthenodeislessthanmaxsize.Assume forthisquestion.

Condition3:Ifthedepthoftheleafnodeislessthankandthenumberofitemsetsstoredatthenodeisequaltomaxsize,thentheleafnodeisconvertedintoaninternalnode.Newleafnodesarecreatedaschildren

maxsize=2

oftheoldleafnode.Candidateitemsetspreviouslystoredintheoldleafnodearedistributedtothechildrenbasedontheirhashvalues.Thenewcandidateisalsohashedtoitsappropriateleafnode.

b. Howmanyleafnodesarethereinthecandidatehashtree?Howmanyinternalnodesarethere?

c. Consideratransactionthatcontainsthefollowingitems:{1,2,3,5,6}.Usingthehashtreeconstructedinpart(a),whichleafnodeswillbecheckedagainstthetransaction?Whatarethecandidate3-itemsetscontainedinthetransaction?

12.GiventhelatticestructureshowninFigure5.33 andthetransactionsgiveninTable5.22 ,labeleachnodewiththefollowingletter(s):

Figure5.33.Anitemsetlattice

Mifthenodeisamaximalfrequentitemset,

Cifitisaclosedfrequentitemset,

Nifitisfrequentbutneithermaximalnorclosed,and

Iifitisinfrequent

Assumethatthesupportthresholdisequalto30%.

13.Theoriginalassociationruleminingformulationusesthesupportandconfidencemeasurestopruneuninterestingrules.

a. DrawacontingencytableforeachofthefollowingrulesusingthetransactionsshowninTable5.23 .

Table5.23.Exampleofmarketbaskettransactions.

TransactionID ItemsBought

1 {a,b,d,e}

2 {b,c,d}

3 {a,b,d,e}

4 {a,c,d,e}

5 {b,c,d,e}

6 {b,d,e}

7 {c,d}

8 {a,b,c}

9 {a,d,e}

10 {b,d}

Rules: .

b. Usethecontingencytablesinpart(a)tocomputeandranktherulesindecreasingorderaccordingtothefollowingmeasures.

i. Support.

ii. Confidence.

iii. Interest

iv.

v. ,where.

vi.

14.GiventherankingsyouhadobtainedinExercise13,computethecorrelationbetweentherankingsofconfidenceandtheotherfivemeasures.Whichmeasureismosthighlycorrelatedwithconfidence?Whichmeasureisleastcorrelatedwithconfidence?

15.AnswerthefollowingquestionsusingthedatasetsshowninFigure5.34 .Notethateachdatasetcontains1000itemsand10,000transactions.Darkcellsindicatethepresenceofitemsandwhitecellsindicatetheabsenceofitems.WewillapplytheApriorialgorithmtoextractfrequentitemsetswith

(i.e.,itemsetsmustbecontainedinatleast1000transactions).

{b}→{c},{a}→{d},{b}→{d},{e}→{c},{c}→{a}

(X→Y)=P(X,Y)P(X)P(Y).

IS(X→Y)=P(X,Y)P(X)P(Y).

Klosgen(X→Y)=P(X,Y)×max(P(Y|X))−P(Y),P(X|Y)−P(X))P(Y|X)=P(X,Y)P(X)

Oddsratio(X→Y)=P(X,Y)P(X¯,Y¯)P(X,Y¯)P(X¯,Y).

minsup=10%

Figure5.34.FiguresforExercise15.

a. Whichdataset(s)willproducethemostnumberoffrequentitemsets?

b. Whichdataset(s)willproducethefewestnumberoffrequentitemsets?

c. Whichdataset(s)willproducethelongestfrequentitemset?

d. Whichdataset(s)willproducefrequentitemsetswithhighestmaximumsupport?

e. Whichdataset(s)willproducefrequentitemsetscontainingitemswithwide-varyingsupportlevels(i.e.,itemswithmixedsupport,rangingfromlessthan20%tomorethan70%)?

16.

a. Provethatthe coefficientisequalto1ifandonlyif .

b. ShowthatifAandBareindependent,then.

c. ShowthatYule'sQandYcoefficients

arenormalizedversionsoftheoddsratio.

d. WriteasimplifiedexpressionforthevalueofeachmeasureshowninTable5.9 whenthevariablesarestatisticallyindependent.

17.Considertheinterestingnessmeasure, ,foranassociationrule .

ϕ f11=f1+=f+1

P(A,B)×P(A¯,B¯)=P(A,B¯)×P(A¯,B)

Q=[f11f00−f10f01f11f00+f10f01]Y=[f11f00−f10f01f11f00+f10f01]

M=P(B|A)−P(B)1−P(B)A→B

a. Whatistherangeofthismeasure?Whendoesthemeasureattainitsmaximumandminimumvalues?

b. HowdoesMbehavewhenP(A,B)isincreasedwhileP(A)andP(B)remainunchanged?

c. HowdoesMbehavewhenP(A)isincreasedwhileP(A,B)andP(B)remainunchanged?

d. HowdoesMbehavewhenP(B)isincreasedwhileP(A,B)andP(A)remainunchanged?

e. Isthemeasuresymmetricundervariablepermutation?

f. WhatisthevalueofthemeasurewhenAandBarestatisticallyindependent?

g. Isthemeasurenull-invariant?

h. Doesthemeasureremaininvariantunderroworcolumnscalingoperations?

i. Howdoesthemeasurebehaveundertheinversionoperation?

18.Supposewehavemarketbasketdataconsistingof100transactionsand20items.Assumethesupportforitemais25%,thesupportforitembis90%andthesupportforitemset{a,b}is20%.Letthesupportandconfidencethresholdsbe10%and60%,respectively.

a. Computetheconfidenceoftheassociationrule .Istheruleinterestingaccordingtotheconfidencemeasure?

b. Computetheinterestmeasurefortheassociationpattern{a,b}.Describethenatureoftherelationshipbetweenitemaanditembintermsoftheinterestmeasure.

{a}→{b}

c. Whatconclusionscanyoudrawfromtheresultsofparts(a)and(b)?

d. Provethatiftheconfidenceoftherule islessthanthesupportof{b},then:

i.

ii.

where denotetheruleconfidenceand denotethesupportofanitemset.

19.Table5.24 showsa contingencytableforthebinaryvariablesAandBatdifferentvaluesofthecontrolvariableC.

Table5.24.AContingencyTable.

A

1 0

B 1 0 15

0 15 30

B 1 5 0

0 0 15

a. Computethe coefficientforAandBwhen and or1.Notethat .

b. Whatconclusionscanyoudrawfromtheaboveresult?

20.ConsiderthecontingencytablesshowninTable5.25 .

{a}→{b}

c({a¯}→{b})>c({a}→{b}),

c({a¯}→{b})>s({b}),

c(⋅) s(⋅)

2×2×2

C=0

C=1

ϕ C=0,C=1, C=0ϕ=({A,B})=P(A,B)−P(A)P(B)P(A)P(B)(1−P(A))(1−P(B))

a. FortableI,computesupport,theinterestmeasure,andthe correlationcoefficientfortheassociationpattern{A,B}.Also,computetheconfidenceofrules and .

b. FortableII,computesupport,theinterestmeasure,andthe correlationcoefficientfortheassociationpattern{A,B}.Also,computetheconfidenceofrules and .

Table5.25.ContingencytablesforExercise20.(a)TableI.

B

A 9 1

1 89

(b)TableII.

B

A 89 1

1 9

c. Whatconclusionscanyoudrawfromtheresultsof(a)and(b)?

21.Considertherelationshipbetweencustomerswhobuyhigh-definitiontelevisionsandexercisemachinesasshowninTables5.17 and5.18 .

a. Computetheoddsratiosforbothtables.

b. Computethe forbothtables.

c. Computetheinterestfactorforbothtables.

ϕ

A→B B→A

ϕ

A→B B→A

ϕ-coefficient

Foreachofthemeasuresgivenabove,describehowthedirectionofassociationchangeswhendataispooledtogetherinsteadofbeingstratified.

6AssociationAnalysis:AdvancedConcepts

Theassociationruleminingformulationdescribedinthepreviouschapterassumesthattheinputdataconsistsofbinaryattributescalleditems.Thepresenceofaniteminatransactionisalsoassumedtobemoreimportantthanitsabsence.Asaresult,anitemistreatedasanasymmetricbinaryattributeandonlyfrequentpatternsareconsideredinteresting.

Thischapterextendstheformulationtodatasetswithsymmetricbinary,categorical,andcontinuousattributes.Theformulationwillalsobeextendedtoincorporatemorecomplexentitiessuchassequencesandgraphs.Althoughtheoverallstructureofassociationanalysisalgorithmsremainsunchanged,certainaspectsofthealgorithmsmustbemodifiedtohandlethenon-traditionalentities.

6.1HandlingCategoricalAttributesTherearemanyapplicationsthatcontainsymmetricbinaryandnominalattributes.TheInternetsurveydatashowninTable6.1 containssymmetricbinaryattributessuchasand ;aswellasnominalattributessuchas

and .Usingassociationanalysis,wemayuncoverinterestinginformationaboutthecharacteristicsofInternetuserssuchas

Table6.1.Internetsurveydatawithcategoricalattributes.

Gender LevelofEducation

State ComputeratHome

ChatOnline

ShopOnline

PrivacyConcerns

Female Graduate Illinois Yes Yes Yes Yes

Male College California No No No No

Male Graduate Michigan Yes Yes Yes Yes

Female College Virginia No No Yes Yes

Female Graduate California Yes No No Yes

Male College Minnesota Yes Yes Yes Yes

Male College Alaska Yes Yes Yes No

Male HighSchool Oregon Yes No No No

Female Graduate Texas No Yes No No

… … … … … … …

ThisrulesuggeststhatmostInternetuserswhoshoponlineareconcernedabouttheirpersonalprivacy.

Toextractsuchpatterns,thecategoricalandsymmetricbinaryattributesaretransformedinto“items”first,sothatexistingassociationruleminingalgorithmscanbeapplied.Thistypeoftransformationcanbeperformedbycreatinganewitemforeachdistinctattribute-valuepair.Forexample,thenominalattribute canbereplacedbythreebinaryitems:

,and .Similarly,symmetricbinaryattributessuchas canbeconvertedintoapairofbinaryitems, and .Table6.2 showstheresultofbinarizingtheInternetsurveydata.

Table6.2.Internetsurveydataafterbinarizingcategoricalandsymmetricbinaryattributes.

Male Female Education=Graduate

Education=College

… Privacy=Yes

Privacy=No

0 1 1 0 … 1 0

1 0 0 1 … 0 1

1 0 1 0 … 1 0

0 1 0 1 … 1 0

0 1 1 0 … 1 0

1 0 0 1 … 1 0

1 0 0 1 … 0 1

1 0 0 0 … 0 1

0 1 1 0 … 0 1

… … … … … … …

Thereareseveralissuestoconsiderwhenapplyingassociationanalysistothebinarizeddata:

1. Someattributevaluesmaynotbefrequentenoughtobepartofafrequentpattern.Thisproblemismoreevidentfornominalattributesthathavemanypossiblevalues,e.g.,statenames.Loweringthesupportthresholddoesnothelpbecauseitexponentiallyincreasesthenumberoffrequentpatternsfound(manyofwhichmaybespurious)andmakesthecomputationmoreexpensive.Amorepracticalsolutionistogrouprelatedattributevaluesintoasmallnumberofcategories.Forexample,eachstatenamecanbereplacedbyitscorrespondinggeographicalregion,suchas ,and .Anotherpossibilityistoaggregatethelessfrequentattributevaluesintoasinglecategorycalled ,asshowninFigure6.1 .

Figure6.1.Apiechartwithamergedcategorycalled .

2. Someattributevaluesmayhaveconsiderablyhigherfrequenciesthanothers.Forexample,suppose85%ofthesurveyparticipantsownahomecomputer.Bycreatingabinaryitemforeachattributevaluethatappearsfrequentlyinthedata,wemaypotentiallygeneratemanyredundantpatterns,asillustratedbythefollowingexample:

Theruleisredundantbecauseitissubsumedbythemoregeneralrulegivenatthebeginningofthissection.Becausethehigh-frequencyitemscorrespondtothetypicalvaluesofanattribute,theyseldomcarryanynewinformationthatcanhelpustobetterunderstandthepattern.Itmaythereforebeusefultoremovesuchitemsbeforeapplying

standardassociationanalysisalgorithms.AnotherpossibilityistoapplythetechniquespresentedinSection5.8 forhandlingdatasetswithawiderangeofsupportvalues.

3. Althoughthewidthofeverytransactionisthesameasthenumberofattributesintheoriginaldata,thecomputationtimemayincreaseespeciallywhenmanyofthenewlycreateditemsbecomefrequent.Thisisbecausemoretimeisneededtodealwiththeadditionalcandidateitemsetsgeneratedbytheseitems(seeExercise1 onpage510).Onewaytoreducethecomputationtimeistoavoidgeneratingcandidateitemsetsthatcontainmorethanoneitemfromthesameattribute.Forexample,wedonothavetogenerateacandidateitemsetsuchas becausethesupportcountoftheitemsetiszero.

6.2HandlingContinuousAttributesTheInternetsurveydatadescribedintheprevioussectionmayalsocontaincontinuousattributessuchastheonesshowninTable6.3 .Miningthecontinuousattributesmayrevealusefulinsightsaboutthedatasuchas“userswhoseannualincomeismorethan$120Kbelongtothe45–60agegroup”or“userswhohavemorethan3emailaccountsandspendmorethan15hoursonlineperweekareoftenconcernedabouttheirpersonalprivacy.”Associationrulesthatcontaincontinuousattributesarecommonlyknownasquantitativeassociationrules.

Table6.3.Internetsurveydatawithcontinuousattributes.

Gender … Age AnnualIncome

No.ofHoursSpentOnlineperWeek

No.ofEmailAccounts

PrivacyConcern

Female … 26 90K 20 4 Yes

Male … 51 135K 10 2 No

Male … 29 80K 10 3 Yes

Female … 45 120K 15 3 Yes

Female … 31 95K 20 5 Yes

Male … 25 55K 25 5 Yes

Male … 37 100K 10 1 No

Male … 41 65K 8 2 No

Female … 26 85K 12 1 No

… … … … … … …

Thissectiondescribesthevariousmethodologiesforapplyingassociationanalysistocontinuousdata.Wewillspecificallydiscussthreetypesofmethods:(1)discretization-basedmethods,(2)statistics-basedmethods,and(3)nondiscretizationmethods.Thequantitativeassociationrulesderivedusingthesemethodsarequitedifferentinnature.

6.2.1Discretization-BasedMethods

Discretizationisthemostcommonapproachforhandlingcontinuousattributes.Thisapproachgroupstheadjacentvaluesofacontinuousattributeintoafinitenumberofintervals.Forexample,the attributecanbedividedintothefollowingintervals: ∈[12,16), ∈[16,20), ∈[20,24),…,

∈[56,60),where[a,b)representsanintervalthatincludesabutnotb.DiscretizationcanbeperformedusinganyofthetechniquesdescribedinSection2.3.6 (equalintervalwidth,equalfrequency,entropy-based,orclustering).Thediscreteintervalsarethenmappedintoasymmetricbinaryattributessothatexistingassociationanalysisalgorithmscanbeapplied.Table6.4 showstheInternetsurveydataafterdiscretizationandbinarization.

Table6.4.Internetsurveydataafterbinarizingcategoricalandcontinuousattributes.

Male Female … Age<13

Age∈[13,21)

Age∈[21,30)

… Privacy=Yes

Privacy=No

0 1 … 0 0 1 … 1 0

1 0 … 0 0 0 … 0 1

1 0 … 0 0 1 … 1 0

0 1 … 0 0 0 … 1 0

0 1 … 0 0 0 … 1 0

1 0 … 0 0 1 … 1 0

1 0 … 0 0 0 … 0 1

1 0 … 0 0 0 … 0 1

0 1 … 0 0 1 … 0 1

… … … … … … … … …

Akeyparameterinattributediscretizationisthenumberofintervalsusedtopartitioneachattribute.Thisparameteristypicallyprovidedbytheusersandcanbeexpressedintermsoftheintervalwidth(fortheequalintervalwidthapproach),theaveragenumberoftransactionsperinterval(fortheequalfrequencyapproach),orthenumberofdesiredclusters(fortheclustering-basedapproach).ThedifficultyindeterminingtherightnumberofintervalscanbeillustratedusingthedatasetshowninTable6.5 ,whichsummarizestheresponsesof250userswhoparticipatedinthesurvey.Therearetwostrongrulesembeddedinthedata:

Table6.5.AbreakdownofInternetuserswhoparticipatedinonlinechataccordingtotheiragegroup.

AgeGroup ChatOnline=Yes ChatOnline=No

R1:Age∈[16,24)→Chat Online=Yes(s=8.8%, c=81.5%).R2:Age∈[44,60)→Chat

[12,16) 12 13

[16,20) 11 2

[20,24) 11 3

[24,28) 12 13

[28,32) 14 12

[32,36) 15 12

[36,40) 16 14

[40,44) 16 14

[44,48) 4 10

[48,52) 5 11

[52,56) 5 10

[56,60) 4 11

Theserulessuggestthatmostoftheusersfromtheagegroupof16–24oftenparticipateinonlinechatting,whilethosefromtheagegroupof44–60arelesslikelytochatonline.Inthisexample,weconsideraruletobeinterestingonlyifitssupport(s)exceeds5%anditsconfidence(c)exceeds65%.Oneoftheproblemsencounteredwhendiscretizingthe attributeishowtodeterminetheintervalwidth.

1. Iftheintervalistoowide,thenwemaylosesomepatternsbecauseoftheirlackofconfidence.Forexample,whentheintervalwidthis24years, and arereplacedbythefollowingrules:R1 R2R′1:Age∈[12,36)→Chat Online=Yes(s=30%, c=57.7%).R

Despitetheirhighersupports,thewiderintervalshavecausedtheconfidenceforbothrulestodropbelowtheminimumconfidencethreshold.Asaresult,bothpatternsarelostafterdiscretization.

2. Iftheintervalistoonarrow,thenwemaylosesomepatternsbecauseoftheirlackofsupport.Forexample,iftheintervalwidthis4years,then isbrokenupintothefollowingtwosubrules:

Sincethesupportsforthesubrulesarelessthantheminimumsupportthreshold, islostafterdiscretization.Similarly,therule ,whichisbrokenupintofoursubrules,willalsobelostbecausethesupportofeachsubruleislessthantheminimumsupportthreshold.

3. Iftheintervalwidthis8years,thentherule isbrokenupintothefollowingtwosubrules:

Since and havesufficientsupportandconfidence,canberecoveredbyaggregatingbothsubrules.Meanwhile, isbrokenupintothefollowingtwosubrules:

Unlike ,wecannotrecovertherule byaggregatingthesubrulesbecausebothsubrulesfailtheconfidencethreshold.

Onewaytoaddresstheseissuesistoconsidereverypossiblegroupingofadjacentintervals.Forexample,wecanstartwithanintervalwidthof4yearsandthenmergetheadjacentintervalsintowiderintervals: ∈[12,16),∈[12,20),…, ∈[12,60), ∈[16,20), ∈[16,24),etc.This

′2:Age∈[36,60)→Chat Online=No(s=28%, c=58.3%).

R1R11(4):Age∈[16,20)→Chat Online=Yes(s=4.4%, c=84.6%).R12(4):Age∈

R1 R2

R2

R21(8):Age∈[44,52)→Chat Online=No(s=8.4%, c=70%).R22(8):Age∈[52

R21(8) R22(8) R2R1

R11(8):Age∈[12,20)→Chat Online=Yes(s=9.2%, c=60.5%).R12(8):Age∈

R2 R1

approachenablesthedetectionofboth and asstrongrules.However,italsoleadstothefollowingcomputationalissues:

1. Thecomputationbecomesextremelyexpensive.Iftherangeisinitiallydividedintokintervals,then binaryitemsmustbegeneratedtorepresentallpossibleintervals.Furthermore,ifanitemcorrespondingtotheinterval[a,b)isfrequent,thenallotheritemscorrespondingtointervalsthatsubsume[a,b)mustbefrequenttoo.Thisapproachcanthereforegeneratefartoomanycandidateandfrequentitemsets.Toaddresstheseproblems,amaximumsupportthresholdcanbeappliedtopreventthecreationofitemscorrespondingtoverywideintervalsandtoreducethenumberofitemsets.

2. Manyredundantrulesareextracted.Forexample,considerthefollowingpairofrules:

isageneralizationof (and isaspecializationof )becausehasawiderintervalforthe attribute.Iftheconfidencevaluesfor

bothrulesarethesame,then shouldbemoreinterestingbecauseitcoversmoreexamples—includingthosefor . isthereforearedundantrule.

6.2.2Statistics-BasedMethods

Quantitativeassociationrulescanbeusedtoinferthestatisticalpropertiesofapopulation.Forexample,supposeweareinterestedinfindingtheaverageageofcertaingroupsofInternetusersbasedonthedataprovidedinTables

R1 R2

k(k−1)/2

R3:{Age∈[16,20), Gender=Male}→{Chat Online=Yes},R4:{Age∈[16,24), Gender=Male}→{Chat Online=Yes}.

R4 R3 R3 R4R4

R4R3 R3

6.1 and6.3 .Usingthestatistics-basedmethoddescribedinthissection,quantitativeassociationrulessuchasthefollowingcanbeextracted:

TherulestatesthattheaverageageofInternetuserswhoseannualincomeexceeds$100Kandwhoshoponlineregularlyis38yearsold.

RuleGenerationTogeneratethestatistics-basedquantitativeassociationrules,thetargetattributeusedtocharacterizeinterestingsegmentsofthepopulationmustbespecified.Bywithholdingthetargetattribute,theremainingcategoricalandcontinuousattributesinthedataarebinarizedusingthemethodsdescribedintheprevioussection.ExistingalgorithmssuchasAprioriorFP-growtharethenappliedtoextractfrequentitemsetsfromthebinarizeddata.Eachfrequentitemsetidentifiesaninterestingsegmentofthepopulation.Thedistributionofthetargetattributeineachsegmentcanbesummarizedusingdescriptivestatisticssuchasmean,median,variance,orabsolutedeviation.Forexample,theprecedingruleisobtainedbyaveragingtheageofInternetuserswhosupportthefrequentitemset{ >$100K,

}.

Thenumberofquantitativeassociationrulesdiscoveredusingthismethodisthesameasthenumberofextractedfrequentitemsets.Becauseofthewaythequantitativeassociationrulesaredefined,thenotionofconfidenceisnotapplicabletosuchrules.Analternativemethodforvalidatingthequantitativeassociationrulesispresentednext.

RuleValidation

{Annual Income>$100K, Shop Online=Yes}→Age: Mean=38.

Aquantitativeassociationruleisinterestingonlyifthestatisticscomputedfromtransactionscoveredbytherulearedifferentthanthosecomputedfromtransactionsnotcoveredbytherule.Forexample,therulegivenatthebeginningofthissectionisinterestingonlyiftheaverageageofInternetuserswhodonotsupportthefrequentitemset{ >100K,

}issignificantlyhigherorlowerthan38yearsold.Todeterminewhetherthedifferenceintheiraverageagesisstatisticallysignificant,statisticalhypothesistestingmethodsshouldbeapplied.

Considerthequantitativeassociationrule, ,whereAisafrequentitemset,tisthecontinuoustargetattribute,andμistheaveragevalueoftamongtransactionscoveredbyA.Furthermore,let denotetheaveragevalueoftamongtransactionsnotcoveredbyA.Thegoalistotestwhetherthedifferencebetweenμand isgreaterthansomeuser-specifiedthreshold,Δ.Instatisticalhypothesistesting,twooppositepropositions,knownasthenullhypothesisandthealternativehypothesis,aregiven.Ahypothesistestisperformedtodeterminewhichofthesetwohypothesesshouldbeaccepted,basedonevidencegatheredfromthedata(seeAppendixC).

Inthiscase,assumingthat ,thenullhypothesisis ,whilethealternativehypothesisis .Todeterminewhichhypothesisshouldbeaccepted,thefollowingZ-statisticiscomputed:

where isthenumberoftransactionssupportingA, isthenumberoftransactionsnotsupportingA, isthestandarddeviationfortamongtransactionsthatsupportA,and isthestandarddeviationfortamongtransactionsthatdonotsupportA.Underthenullhypothesis,Zhasastandardnormaldistributionwithmean0andvariance1.ThevalueofZcomputedusingEquation6.1 isthencomparedagainstacriticalvalue, ,

A→t:μ

μ′

μ′

μ<μ′ H0:μ′=μ+ΔH1:μ′>μ+Δ

Z=μ′−μ−Δs12n1+s22n2, (6.1)

n1 n2s1s2

whichisathresholdthatdependsonthedesiredconfidencelevel.If ,thenthenullhypothesisisrejectedandwemayconcludethatthequantitativeassociationruleisinteresting.Otherwise,thereisnotenoughevidenceinthedatatoshowthatthedifferenceinmeanisstatisticallysignificant.

Example6.1.Considerthequantitativeassociationrule

Supposethereare50Internetuserswhosupportedtheruleantecedent.Thestandarddeviationoftheiragesis3.5.Ontheotherhand,theaverageageofthe200userswhodonotsupporttheruleantecedentis30andtheirstandarddeviationis6.5.Assumethataquantitativeassociationruleisconsideredinterestingonlyifthedifferencebetweenμand ismorethan5years.UsingEquation6.1 weobtain

Foraone-sidedhypothesistestata95%confidencelevel,thecriticalvalueforrejectingthenullhypothesisis1.64.Since ,thenullhypothesiscanberejected.Wethereforeconcludethatthequantitativeassociationruleisinterestingbecausethedifferencebetweentheaverageagesofuserswhosupportanddonotsupporttheruleantecedentismorethan5years.

6.2.3Non-discretizationMethods

Z>Zα

{Income>100K, Shop Online=Yes}→Age:μ=38.

μ′

Z=38−30−53.5250+6.52200=4.4414.

Z>1.64

Therearecertainapplicationsinwhichanalystsaremoreinterestedinfindingassociationsamongthecontinuousattributes,ratherthanassociationsamongdiscreteintervalsofthecontinuousattributes.Forexample,considertheproblemoffindingwordassociationsintextdocuments.Table6.6 showsadocument-wordmatrixwhereeveryentryrepresentsthenumberoftimesawordappearsinagivendocument.Givensuchadatamatrix,weareinterestedinfindingassociationsbetweenwords(e.g., and )insteadofassociationsbetweenrangesofwordcounts(e.g., ∈[1,4]and ∈[2,3]).Onewaytodothisistotransformthedataintoa0/1matrix,wheretheentryis1ifthecountexceedssomethresholdt,and0otherwise.Whilethisapproachallowsanalyststoapplyexistingfrequentitemsetgenerationalgorithmstothebinarizeddataset,findingtherightthresholdforbinarizationcanbequitetricky.Ifthethresholdissettoohigh,itispossibletomisssomeinterestingassociations.Conversely,ifthethresholdissettoolow,thereisapotentialforgeneratingalargenumberofspuriousassociations.

Table6.6.Document-wordmatrix.

Document

d1 3 6 0 0 0 2

d2 1 2 0 0 0 2

d3 4 2 7 0 0 2

d4 2 0 3 0 0 1

d5 0 0 0 1 1 0

Thissectionpresentsanothermethodologyforfindingassociationsamongcontinuousattributes,knownasthemin-Aprioriapproach.Analogousto

word1 word2 word3 word4 word5 word6

traditionalassociationanalysis,anitemsetisconsideredtobeacollectionofcontinuousattributes,whileitssupportmeasuresthedegreeofassociationamongtheattributes,acrossmultiplerowsofthedatamatrix.Forexample,acollectionofwordsinTable6.6 canbereferredtoasanitemset,whosesupportisdeterminedbytheco-occurrenceofwordsacrossdocuments.Inmin-Apriori,theassociationamongattributesinagivenrowofthedatamatrixisobtainedbytakingtheminimumvalueoftheattributes.Forexample,theassociationbetweenwords, and ,inthedocument isgivenby

.Thesupportofanitemsetisthencomputedbyaggregatingitsassociationoverallthedocuments.

Thesupportmeasuredefinedinmin-Apriorihasthefollowingdesiredproperties,whichmakesitsuitableforfindingwordassociationsindocuments:

1. Supportincreasesmonotonicallyasthenumberofoccurrencesofawordincreases.

2. Supportincreasesmonotonicallyasthenumberofdocumentsthatcontainthewordincreases.

3. Supporthasananti-monotoneproperty.Forexample,considerapairofitemsets and .Since

.Therefore,supportdecreasesmonotonicallyasthenumberofwordsinanitemsetincreases.

ThestandardApriorialgorithmcanbemodifiedtofindassociationsamongwordsusingthenewsupportdefinition.

word1 word2 d1min(3,6)=3.

s({word1,word2})=min(3,6)+min(1,2)+min(4,2)+min(2,0)=6.

{A,B} {A,B,C}min({A,B})≥min({A,B,C}),s({A,B})≥s({A,B,C})

6.3HandlingaConceptHierarchyAconcepthierarchyisamultilevelorganizationofthevariousentitiesorconceptsdefinedinaparticulardomain.Forexample,inmarketbasketanalysis,aconcepthierarchyhastheformofanitemtaxonomydescribingthe“is-a”relationshipsamongitemssoldatagrocerystore—e.g.,milkisakindoffoodandDVDisakindofhomeelectronicsequipment(seeFigure6.2 ).Concepthierarchiesareoftendefinedaccordingtodomainknowledgeorbasedonastandardclassificationschemedefinedbycertainorganizations(e.g.,theLibraryofCongressclassificationschemeisusedtoorganizelibrarymaterialsbasedontheirsubjectcategories).

Figure6.2.Exampleofanitemtaxonomy.

Aconcepthierarchycanberepresentedusingadirectedacyclicgraph,asshowninFigure6.2 .Ifthereisanedgeinthegraphfromanodeptoanothernodec,wecallptheparentofcandcthechildofp.Forexample,

istheparentof becausethereisadirectededgefromthenode tothenode . iscalledanancestorofX(andXadescendentof )ifthereisapathfromnode tonodeXinthedirectedacyclicgraph.InthediagramshowninFigure6.2 , isanancestorof

and isadescendentof .

Themainadvantagesofincorporatingconcepthierarchiesintoassociationanalysisareasfollows:

1. Itemsatthelowerlevelsofahierarchymaynothaveenoughsupporttoappearinanyfrequentitemset.Forexample,althoughthesaleofACadaptorsanddockingstationsmaybelow,thesaleoflaptopaccessories,whichistheirparentnodeintheconcepthierarchy,maybehigh.Also,rulesinvolvinghigh-levelcategoriesmayhavelowerconfidencethantheonesgeneratedusinglow-levelcategories.Unlesstheconcepthierarchyisused,thereisapotentialtomissinterestingpatternsatdifferentlevelsofcategories.

2. Rulesfoundatthelowerlevelsofaconcepthierarchytendtobeoverlyspecificandmaynotbeasinterestingasrulesatthehigherlevels.Forexample,stapleitemssuchasmilkandbreadtendtoproducemanylow-levelrulessuchas

,and .Usingaconcepthierarchy,theycanbesummarizedintoasinglerule, .Consideringonlyitemsresidingatthetopleveloftheirhierarchiesalsomaynotbegoodenough,becausesuchrulesmaynotbeofanypracticaluse.Forexample,althoughtherule maysatisfythesupportandconfidencethresholds,itisnotinformativebecausethe

X^X^ X^

combinationofelectronicsandfooditemsthatarefrequentlypurchasedbycustomersareunknown.Ifmilkandbatteriesaretheonlyitemssoldtogetherfrequently,thenthepattern{ }mayhaveovergeneralizedthesituation.

Standardassociationanalysiscanbeextendedtoincorporateconcepthierarchiesinthefollowingway.Eachtransactiontisinitiallyreplacedwithitsextendedtransaction ,whichcontainsalltheitemsintalongwiththeircorrespondingancestors.Forexample,thetransaction{ }canbeextendedto{

},where and aretheancestorsof ,while and aretheancestorsof .

Wecanthenapplyexistingalgorithms,suchasApriori,tothedatabaseofextendedtransactions.Althoughsuchanapproachwouldfindrulesthatspandifferentlevelsoftheconcepthierarchy,itwouldsufferfromseveralobviouslimitationsasdescribedbelow:

1. Itemsresidingatthehigherlevelstendtohavehighersupportcountsthanthoseresidingatthelowerlevelsofaconcepthierarchy.Asaresult,ifthesupportthresholdissettoohigh,thenonlypatternsinvolvingthehigh-levelitemsareextracted.Ontheotherhand,ifthethresholdissettoolow,thenthealgorithmgeneratesfartoomanypatterns(mostofwhichmaybespurious)andbecomescomputationallyinefficient.

2. Introductionofaconcepthierarchytendstoincreasethecomputationtimeofassociationanalysisalgorithmsbecauseofthelargernumberofitemsandwidertransactions.Thenumberofcandidatepatternsandfrequentpatternsgeneratedbythesealgorithmsmayalsogrowexponentiallywithwidertransactions.

t′

3. Introductionofaconcepthierarchymayproduceredundantrules.Arule isredundantifthereexistsamoregeneralrule ,where isanancestorofX, isanancestorofY,andbothruleshaveverysimilarconfidence.Forexample,suppose{ }→{ },{ }→{2% },{ }→{2% },{ }→{skim },and{ }→{ }haveverysimilarconfidence.Therulesinvolvingitemsfromthelowerlevelofthehierarchyareconsideredredundantbecausetheycanbesummarizedbyaruleinvolvingtheancestoritems.Anitemsetsuchas{

}isalsoredundantbecause and areancestorsof.Fortunately,itiseasytoeliminatesuchredundantitemsets

duringfrequentitemsetgeneration,giventheknowledgeofthehierarchy.

X→Y X^→Y^X^ Y^

6.4SequentialPatternsMarketbasketdataoftencontainstemporalinformationaboutwhenanitemwaspurchasedbycustomers.Suchinformationcanbeusedtopiecetogetherthesequenceoftransactionsmadebyacustomeroveracertainperiodoftime.Similarly,event-baseddatacollectedfromscientificexperimentsorthemonitoringofphysicalsystems,suchastelecommunicationsnetworks,computernetworks,andwirelesssensornetworks,haveaninherentsequentialnaturetothem.Thismeansthatanordinalrelation,usuallybasedontemporalprecedence,existsamongeventsoccurringinsuchdata.However,theconceptsofassociationpatternsdiscussedsofaremphasizeonly“co-occurrence”relationshipsanddisregardthesequentialinformationofthedata.Thelatterinformationmaybevaluableforidentifyingrecurringfeaturesofadynamicsystemorpredictingfutureoccurrencesofcertainevents.Thissectionpresentsthebasicconceptofsequentialpatternsandthealgorithmsdevelopedtodiscoverthem.

6.4.1Preliminaries

Theinputtotheproblemofdiscoveringsequentialpatternsisasequencedataset,anexampleofwhichisshownontheleft-handsideofFigure6.3 .Eachrowrecordstheoccurrencesofeventsassociatedwithaparticularobjectatagiventime.Forexample,thefirstrowcontainsthesetofeventsoccurringattimestamp forobjectA.Notethatifweonlyconsiderthelastcolumnofthissequencedataset,itwouldlooksimilartoamarketbasketdatawhereeveryrowrepresentsatransactioncontainingasetofevents(items).Thetraditionalconceptofassociationpatternsinthisdatawouldcorrespond

t=10

tocommonco-occurrencesofeventsacrosstransactions.However,asequencedatasetalsocontainsinformationabouttheobjectandthetimestampofatransactionofeventsinthefirsttwocolumns.Thesecolumnsaddcontexttoeverytransaction,whichenablesadifferentstyleofassociationanalysisforsequencedatasets.Theright-handsideofFigure6.3 showsadifferentrepresentationofthesequencedatasetwheretheeventsassociatedwitheveryobjectappeartogether,sortedinincreasingorderoftheirtimestamps.Inasequencedataset,wecanlookforassociationpatternsofeventsthatcommonlyoccurinasequentialorderacrossobjects.Forexample,inthesequencedatashowninFigure6.3 ,event6isfollowedbyevent1inallofthesequences.Notethatsuchapatterncannotbeinferredifwetreatthisasamarketbasketdatabyignoringinformationabouttheobjectandtimestamp.

Figure6.3.Exampleofasequencedatabase.

Beforepresentingamethodologyforfindingsequentialpatterns,weprovideabriefdescriptionofsequencesandsubsequences.

SequencesGenerallyspeaking,asequenceisanorderedlistofelements(transactions).Asequencecanbedenotedas ,whereeachelement isacollectionofoneormoreevents(items),i.e., .Thefollowingisalistofexamplesofsequences:

Sequenceofwebpagesviewedbyawebsitevisitor:

⟨{Homepage}{Electronics}{CamerasandCamcorders}{DigitalCameras}{ShoppingCart}{OrderConfirmation}{ReturntoShopping}⟩SequenceofeventsleadingtothenuclearaccidentatThree-MileIsland:

⟨{cloggedresin}{outletvalveclosure}{lossoffeedwater}{condenserpolisheroutletvalveshut}{boosterpumpstrip}{mainwaterpumptrips}{mainturbinetrips}{reactorpressureincreases}⟩Sequenceofclassestakenbyacomputersciencemajorstudentindifferentsemesters:

⟨{AlgorithmsandDataStructures,IntroductiontoOperatingSystems}{DatabaseSystems,ComputerArchitecture}{ComputerNetworks,SoftwareEngineering}{ComputerGraphics,ParallelProgramming}⟩

Asequencecanbecharacterizedbyitslengthandthenumberofoccurringevents.Thelengthofasequencecorrespondstothenumberofelementspresentinthesequence,whilewerefertoasequencethatcontainskeventsasak-sequence.Thewebsequenceinthepreviousexamplecontains7elementsand7events;theeventsequenceatThree-MileIslandcontains8

s=⟨e1e2e3…en⟩ ejej={i1,i2,…,ik}

elementsand8events;andtheclasssequencecontains4elementsand8events.

Figure6.4 providesexamplesofsequences,elements,andeventsdefinedforavarietyofapplicationdomains.Exceptforthelastrow,theordinalattributeassociatedwitheachofthefirstthreedomainscorrespondstocalendartime.Forthelastrow,theordinalattributecorrespondstothelocationofthebases(A,C,G,T)inthegenesequence.Althoughthediscussiononsequentialpatternsisprimarilyfocusedontemporalevents,itcanbeextendedtothecasewheretheeventshavenon-temporalordering,suchasspatialordering.

Figure6.4.Examplesofelementsandeventsinsequencedatasets.

SubsequencesAsequencetisasubsequenceofanothersequencesifitispossibletoderivetfromsbysimplydeletingsomeeventsfromelementsinsorevendeletingsomeelementsinscompletely.Formally,thesequenceisasubsequenceof ifthereexistintegerssuchthat .Iftisasubsequenceofs,thenwesaythattiscontainedins.Table6.7 givesexamplesillustratingtheideaofsubsequencesforvarioussequences.

Table6.7.Examplesillustratingtheconceptofasubsequence.

Sequence,s Sequence,t Istasubsequenceofs?

Yes

Yes

No

Yes

6.4.2SequentialPatternDiscovery

LetDbeadatasetthatcontainsoneormoredatasequences.Thetermdatasequencereferstoanorderedlistofelementsassociatedwithasingledataobject.Forexample,thedatasetshowninFigure6.5 containsfivedatasequences,oneforeachobjectA,B,C,D,andE.

t=⟨t1t2…tm⟩ s=⟨s1s2…sn⟩ 1≤j1<j2<⋯<jm≤n

t1⊆sj1,t2⊆sj2,…,tm⊆sjm

⟨{2,4} {3,5,6} {8}⟩ ⟨{2} {3,6} {8}⟩

⟨{2,4} {3,5,6} {8}⟩ ⟨{2} {8}⟩

⟨{1,2} {3,4}⟩ ⟨{1} {2}⟩

⟨{2,4} {2,4} {2,5}⟩ ⟨{2} {4}⟩

Figure6.5.Sequentialpatternsderivedfromadatasetthatcontainsfivedatasequences.

Thesupportofasequencesisthefractionofalldatasequencesthatcontains.Ifthesupportforsisgreaterthanorequaltoauser-specifiedthresholdminsup,thensisdeclaredtobeasequentialpattern(orfrequentsequence).

Definition6.1(SequentialPatternDiscovery).GivenasequencedatasetDandauser-specifiedminimumsupportthresholdminsup,thetaskofsequentialpatterndiscoveryistofindallsequenceswith .support≥minsup

InFigure6.5 ,thesupportforthesequence isequalto80%becauseitiscontainedinfourofthefivedatasequences(everyobjectexceptforD).Assumingthattheminimumsupportthresholdis50%,anysequencethatiscontainedinatleastthreedatasequencesisconsideredtobeasequentialpattern.Examplesofsequentialpatternsextractedfromthegivendatasetinclude ,etc.

Sequentialpatterndiscoveryisacomputationallychallengingtaskbecausethesetofallpossiblesequencesthatcanbegeneratedfromacollectionofeventsisexponentiallylargeanddifficulttoenumerate.Forexample,acollectionofneventscanresultinthefollowingexamplesof1-sequences,2-sequences,and3-sequences:

1-sequences:

2-sequences:

3-sequences:

TheaboveenumerationissimilarinsomewaystotheitemsetlatticeintroducedinChapter5 formarketbasketdata.However,notethattheaboveenumerationisnotexhaustive;itonlyshowssomesequencesandomitsalargenumberofremainingonesbytheuseofellipses(…).Thisisbecausethenumberofcandidatesequencesissubstantiallylargerthanthenumberofcandidateitemsets,whichmakestheirenumerationdifficult.Therearethreereasonsfortheadditionalnumberofcandidatessequences:

1. Anitemcanappearatmostonceinanitemset,butaneventcanappearmorethanonceinasequence,indifferentelementsofthe

⟨{1}{2}⟩

⟨{1}{2}⟩,⟨{1,2}⟩,⟨{2,3}⟩,⟨{1,2}{2,3}⟩

⟨i1⟩,⟨i2⟩,…,⟨in⟩

⟨{i1,i2}⟩,⟨{i1,i3}⟩,…,⟨{in−1,in}⟩,…⟨{i1}{i1}⟩,⟨{i1}{i2}⟩,…,⟨{in}{in}⟩

⟨{i1,i2,i3}⟩,⟨{i1,i2,i4}⟩,…,⟨{in−2,in−1,in}⟩,…⟨{i1}{i1,i2}⟩,⟨{i1}{i1,i3}⟩,…,⟨{in−1}{in−1,in}⟩,…⟨{i1,i2}{i2}⟩,⟨{i1,i2}{i3}⟩,…,⟨{in−1,in}{in}⟩,…⟨{i1}{i1}{i1}⟩,⟨{i1}{i1}{i2}⟩,…,⟨{in}{in}{in}⟩

sequence.Forexample,givenapairofitems, and ,onlyonecandidate2-itemset, ,canbegenerated.Incontrast,therearemanycandidate2-sequencesthatcanbegeneratedusingonlytwoevents: ,and .

2. Ordermattersinsequences,butnotforitemsets.Forexample,and referstothesameitemset,whereas ,and correspondtodifferentsequences,andthusmustbegeneratedseparately.

3. Formarketbasketdata,thenumberofdistinctitemsnputsanupperboundonthenumberofcandidateitemsets ,whereasforsequencedata,eventwoeventsaandbcanleadtoinfinitelymanycandidatesequences(seeFigure6.6 foranillustration).

Figure6.6.Comparingthenumberofitemsetswiththenumberofsequencesgeneratedusingtwoevents(items).Weonlyshow1-sequences,2-sequences,and3-sequencesforillustration.

Becauseoftheabovereasons,itischallengingtocreateasequencelatticethatenumeratesallpossiblesequencesevenwhenthenumberofeventsinthedataissmall.Itisthusdifficulttouseabrute-forceapproachfor

i1 i2{i1,i2}

⟨{i1}{i1}⟩,⟨{i1}{i2}⟩,⟨{i2}{i1}⟩,⟨{i2}{i2}⟩ ⟨{i1,i2}⟩{i1,i2}

{i2,i1} ⟨{i1}{i2}⟩,⟨{i2}{i1}⟩⟨{i1,i2}⟩

(2n−1)

generatingsequentialpatternsthatenumeratesallpossiblesequencesbytraversingthesequencelattice.Despitethesechallenges,theAprioriprinciplestillholdsforsequentialdatabecauseanydatasequencethatcontainsaparticulark-sequencemustalsocontainallofits -subsequences.Aswewillseelater,eventhoughitischallengingtoconstructthesequencelattice,itispossibletogeneratecandidatek-sequencesfromthefrequent -sequencesusingtheAprioriprinciple.ThisallowsustoextractsequentialpatternsfromasequencedatasetusinganApriori-likealgorithm.ThebasicstructureofthisalgorithmisshowninAlgorithm6.1 .

Algorithm6.1Apriori-likealgorithmfor

sequentialpatterndiscovery.

(k−1)

(k−1)

1: .2: . {Findallfrequent1-subsequences.}3:repeat4: .5: . {Generatecandidatek-subsequences.}6: . {Prunecandidatek-subsequences.}7:foreachdatasequence do8: . {Identifyallcandidatescontainedint.}9:foreachcandidatek-subsequence do10: . {Incrementthesupportcount.}11:endfor12:endfor

k=1Fk={i|i∈I∧σ({i})N≥minsup}

k=k+1Ck=candidate-gen(Fk−1)

Ck=candidate-prune(Ck,Fk−1)

t∈TCt=subsequence(Ck,t)

c∈Ctσ(c)=σ(c)+1

NoticethatthestructureofthealgorithmisalmostidenticaltoApriorialgorithmforfrequentitemsetdiscovery,presentedinthepreviouschapter.Thealgorithmwoulditerativelygeneratenewcandidatek-sequences,prunecandidateswhose -sequencesareinfrequent,andthencountthesupportsoftheremainingcandidatestoidentifythesequentialpatterns.Thedetailedaspectsofthesestepsaregivennext.

CandidateGeneration

Wegeneratecandidatek-sequencesbymergingapairoffrequent -sequences.Althoughthisapproachissimilartothe strategyintroducedinChapter5 forgeneratingcandidateitem-sets,therearecertaindifferences.First,inthecaseofgeneratingsequences,wecanmergea -sequencewithitselftoproduceak-sequence,since.Forexample,wecanmergethe1-sequence withitselftoproduceacandidate2-sequence, .Second,recallthatinordertoavoidgeneratingduplicatecandidates,thetraditionalApriorialgorithmmergesapairoffrequentk-itemsetsonlyiftheirfirst items,arrangedinlexicographicorder,areidentical.Inthecaseofgeneratingsequences,westillusethelexicographicorderforarrangingeventswithinanelement.However,thearrangementofelementsinasequencemaynotfollowthelexicographicorder.Forexample,

isaviablerepresentationofa4-sequence,eventhoughtheelementsinthesequencearenotarrangedaccordingtotheirlexicographicranks.Ontheotherhand, isnotaviablerepresentationofthesame4-sequence,sincetheeventsinthefirstelementviolatethelexicographicorder.

13: . {Extractthefrequentk-subsequences.}14:until15: .

Fk={c|c∈Ck∧σ(c)N≥minsup}

Fk=∅Answer=∪Fk

(k−1)

(k−1)Fk−1×Fk−1

(k−1)⟨a⟩

⟨{a}{a}⟩

k−1

⟨{b,c}{a}{d}⟩

⟨{c,b}{a}{d}⟩

Givenasequence ,wheretheeventsineveryelementarearrangedlexicographically,wecanrefertofirsteventof asthefirsteventofsandthelasteventof asthelasteventofs.Thecriteriaformergingsequencescanthenbestatedintheformofthefollowingprocedure.

SequenceMergingProcedure

Asequence ismergedwithanothersequence onlyifthesubsequenceobtainedbydroppingthefirsteventin isidenticaltothesubsequenceobtainedbydroppingthelasteventin .Theresultingcandidateisgivenbyextendingthesequence asfollows:

1. Ifthelastelementof hasonlyoneevent,appendthelastelementof totheendof andobtainthemergedsequence.

2. Ifthelastelementof hasmorethanoneevent,appendthelasteventfromthelastelementof (thatisabsentinthelastelementof )tothelastelementof andobtainthemergedsequence.

Figure6.7 illustratesexamplesofcandidate4-sequencesobtainedbymergingpairsoffrequent3-sequences.Thefirstcandidate, ,isobtainedbymerging with .Sincethelastelementofthesecondsequence containsonlyoneelement,itissimplyappendedtothefirstsequencetogeneratethemergedsequence.Ontheotherhand,merging with producesthecandidate4-sequence

.Inthiscase,thelastelementofthesecondsequence

s=⟨e1e2e3…en⟩e1

en

s(1) s(2)s(1)

s(2)s(1)

s(2)s(2) s(1)

s(2)s(2)

s(1) s(1)

⟨{1}{2}{3}{4}⟩⟨{1}{2}{3}⟩ ⟨{2}{3}{4}⟩({4})

⟨{1}{5}{3}⟩ ⟨{5}{3,4}⟩ ⟨{1}{5}{3,4}⟩ ({3,4})

containstwoevents.Hence,thelasteventinthiselement(4)isaddedtothelastelementofthefirstsequence toobtainthemergedsequence.

Figure6.7.Exampleofthecandidategenerationandpruningstepsofasequentialpatternminingalgorithm.

Itcanbeshownthatthesequencemergingprocedureiscomplete,i.e.,itgenerateseveryfrequentk-subsequence.Thisisbecauseeveryfrequentk-subsequencesincludesafrequent -sequence ,thatdoesnotcontainthefirsteventofs,andafrequent -sequence ,thatdoesnotcontainthelasteventofs.Since and arefrequentandfollowthecriteriaformergingsequences,theywillbemergedtoproduceeveryfrequentk-subsequencesasoneofthecandidates.Also,thesequencemergingprocedureensuresthatthereisauniquewayofgeneratingsonlybymergingand .Forexample,inFigure6.7 ,thesequences and

donothavetobemergedbecauseremovingthefirsteventfromthefirstsequencedoesnotgivethesamesubsequenceasremovingthelasteventfromthesecondsequence.Although isaviablecandidate,itisgeneratedbymergingadifferentpairofsequences,

({3})

(k−1) s1(k−1) s2

s1 s2

s1 s2 ⟨{1}{2}{3}⟩ ⟨{1}{2,5}⟩

⟨{1}{2,5}{3}⟩⟨{1}{2,5}

and .Thisexampleillustratesthatthesequencemergingproceduredoesnotgenerateduplicatecandidatesequences.

CandidatePruning

Acandidatek-sequenceisprunedifatleastoneofits -sequencesisinfrequent.Forexample,considerthecandidate4-sequence .Weneedtocheckifanyofthe3-sequencescontainedinthis4-sequenceisinfrequent.Sincethesequence iscontainedinthissequenceandisinfrequent,thecandidate canbeeliminated.Readersshouldbeabletoverifythattheonlycandidate4-sequencethatsurvivesthecandidatepruningstepinFigure6.7 is .

SupportCounting

Duringsupportcounting,thealgorithmidentifiesallcandidatek-sequencesbelongingtoaparticulardatasequenceandincrementstheirsupportcounts.Afterperformingthisstepforeachdatasequence,thealgorithmidentifiesthefrequentk-sequencesanddiscardsallcandidatesequenceswhosesupportvaluesarelessthantheminsupthreshold.

6.4.3TimingConstraints*

Thissectionpresentsasequentialpatternformulationwheretimingconstraintsareimposedontheeventsandelementsofapattern.Tomotivatetheneedfortimingconstraints,considerthefollowingsequenceofcoursestakenbytwostudentswhoenrolledinadataminingclass:

⟩ ⟨{2,5}{3}⟩

(k−1)⟨{1}{2}{3}{4}⟩

⟨{1}{2}{4}⟩⟨{1}{2}{3}{4}⟩

⟨{1}{2 5}{3}⟩

StudentA:⟨{Statistics}{DatabaseSystems}{DataMining}⟩.StudentB:⟨

Thesequentialpatternofinterestis,whichmeansthatstudentswhoareenrolledinthedata

miningclassmusthavepreviouslytakenacourseinstatisticsanddatabasesystems.Clearly,thepatternissupportedbybothstudentseventhoughtheydonottakestatisticsanddatabasesystemsatthesametime.Incontrast,astudentwhotookastatisticscoursetenyearsearliershouldnotbeconsideredassupportingthepatternbecausethetimegapbetweenthecoursesistoolong.Becausetheformulationpresentedintheprevioussectiondoesnotincorporatethesetimingconstraints,anewsequentialpatterndefinitionisneeded.

Figure6.8 illustratessomeofthetimingconstraintsthatcanbeimposedonapattern.Thedefinitionoftheseconstraintsandtheimpacttheyhaveonsequentialpatterndiscoveryalgorithmswillbediscussedinthefollowingsections.Notethateachelementofthesequentialpatternisassociatedwithatimewindow ,wherelistheearliestoccurrenceofaneventwithinthetimewindowanduisthelatestoccurrenceofaneventwithinthetimewindow.Notethatinthisdiscussion,wealloweventswithinanelementtooccuratdifferenttimes.Hence,theactualtimingoftheeventoccurrencemaynotbethesameasthelexicographicordering.

{DatabaseSystems}{Statistics}{DataMining}⟩.

⟨{Statistics,DatabaseSystems}{DataMining}⟩

[l,u]

Figure6.8.Timingconstraintsofasequentialpattern.

ThemaxspanConstraintThemaxspanconstraintspecifiesthemaximumallowedtimedifferencebetweenthelatestandtheearliestoccurrencesofeventsintheentiresequence.Forexample,supposethefollowingdatasequencescontainelementsthatoccuratconsecutivetimestamps ,i.e.,theelementinthesequenceoccursatthe timestamp.Assumingthatmaxspan=3,thefollowingtablecontainssequentialpatternsthataresupportedandnotsupportedbyagivendatasequence.

DataSequence,s SequentialPattern,t Doesssupportt?

Yes

Yes

(1,2,3,…) ithith

⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{3}{4}⟩

⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{3}{6}⟩

No

Ingeneral,thelongerthemaxspan,themorelikelyitistodetectapatterninadatasequence.However,alongermaxspancanalsocapturespuriouspatternsbecauseitincreasesthechancefortwounrelatedeventstobetemporallyrelated.Inaddition,thepatternmayinvolveeventsthatarealreadyobsolete.

Themaxspanconstraintaffectsthesupportcountingstepofsequentialpatterndiscoveryalgorithms.Asshownintheprecedingexamples,somedatasequencesnolongersupportacandidatepatternwhenthemaxspanconstraintisimposed.IfwesimplyapplyAlgorithm6.1 ,thesupportcountsforsomepatternsmaybeoverestimated.Toavoidthisproblem,thealgorithmmustbemodifiedtoignorecaseswheretheintervalbetweenthefirstandlastoccurrencesofeventsinagivenpatternisgreaterthanmaxspan.

ThemingapandmaxgapConstraintsTimingconstraintscanalsobespecifiedtorestrictthetimedifferencebetweentwoconsecutiveelementsofasequence.Ifthemaximumtimedifference(maxgap)isoneweek,theneventsinoneelementmustoccurwithinaweek’stimeoftheeventsoccurringinthepreviouselement.Iftheminimumtimedifference(mingap)iszero,theneventsinoneelementmustoccuraftertheeventsoccurringinthepreviouselement.(SeeFigure6.8 .)Thefollowingtableshowsexamplesofpatternsthatpassorfailthemaxgapandmingapconstraints,assumingthatmaxgap=3andmingap=1.Theseexamplesassumeeachelementoccursatconsecutivetimesteps.

DataSequence,s SequentialPattern,t maxgap mingap

Pass Pass

⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{1,3}{6}⟩

⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{3}{6}⟩

Pass Fail

Fail Pass

Fail Fail

Aswithmaxspan,theseconstraintswillaffectthesupportcountingstepofsequentialpatterndiscoveryalgorithmsbecausesomedatasequencesnolongersupportacandidatepatternwhenmingapandmaxgapconstraintsarepresent.Thesealgorithmsmustbemodifiedtoensurethatthetimingconstraintsarenotviolatedwhencountingthesupportofapattern.Otherwise,someinfrequentsequencesmaymistakenlybedeclaredasfrequentpatterns.

AsideeffectofusingthemaxgapconstraintisthattheAprioriprinciplemightbeviolated.Toillustratethis,considerthedatasetshowninFigure6.5 .Withoutmingapormaxgapconstraints,thesupportfor and

arebothequalto60%.However,ifmingap=0andmaxgap=1,thenthesupportfor reducesto40%,whilethesupportfor isstill60%.Inotherwords,supporthasincreasedwhenthenumberofeventsinasequenceincreases—whichcontradictstheAprioriprinciple.TheviolationoccursbecausetheobjectDdoesnotsupportthepattern sincethetimegapbetweenevents2and5isgreaterthanmaxgap.Thisproblemcanbeavoidedbyusingtheconceptofacontiguoussubsequence.

Definition6.2(ContiguousSubsequence).Asequencesisacontiguoussubsequenceof ifanyoneofthefollowingconditionshold:

⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{6}{8}⟩

⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{1,3}{6}⟩

⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{1}{3}{8}⟩

⟨{2}{5}⟩ ⟨{2}{3}{5}⟩

⟨{2}{5}⟩ ⟨{2}{3}{5}⟩

⟨{2}{5}⟩

w=⟨e1e2…ek⟩

1. sisobtainedfromwafterdeletinganeventfromeither or ,2. sisobtainedfromwafterdeletinganeventfromanyelement that

containsatleasttwoevents,or3. sisacontiguoussubsequenceoftandtisacontiguoussubsequence

ofw.

Thefollowingexamplesillustratetheconceptofacontiguoussubsequence:

DataSequence,s SequentialPattern,t Istacontiguoussubsequenceofs?

Yes

Yes

Yes

No

No

Usingtheconceptofcontiguoussubsequences,theAprioriprinciplecanbemodifiedtohandlemaxgapconstraintsinthefollowingway.

Definition6.3(ModifiedAprioriPrinciple).Ifak-sequenceisfrequent,thenallofitscontiguous -subsequencesmustalsobefrequent.

e1 ekei∈w

⟨{1}{2,3}⟩ ⟨{1}{2}⟩

⟨{1,2}{2}{3}⟩ ⟨{1}{2}⟩

⟨{3,4}{1,2}{2,3}{4}⟩ ⟨{1}{2}⟩

⟨{1}{3}{2}⟩ ⟨{1}{2}⟩

⟨{1,2}{1}{3}{2}⟩ ⟨{1}{2}⟩

k−1

ThemodifiedAprioriprinciplecanbeappliedtothesequentialpatterndiscoveryalgorithmwithminormodifications.Duringcandidatepruning,notallk-sequencesneedtobeverifiedsincesomeofthemmayviolatethemaxgapconstraint.Forexample,ifmaxgap=1,itisnotnecessarytocheckwhetherthesubsequence ofthecandidate isfrequentsincethetimedifferencebetweenelements and isgreaterthanonetimeunit.Instead,onlythecontiguoussubsequencesofneedtobeexamined.Thesesubsequencesinclude ,

, ,and .

TheWindowSizeConstraintFinally,eventswithinanelement donothavetooccuratthesametime.Awindowsizethreshold(ws)canbedefinedtospecifythemaximumallowedtimedifferencebetweenthelatestandearliestoccurrencesofeventsinanyelementofasequentialpattern.Awindowsizeof0meansalleventsinthesameelementofapatternmustoccursimultaneously.

Thefollowingexampleuses todeterminewhetheradatasequencesupportsagivensequence(assuming , ,and

).

DataSequence,s SequentialPattern,t Doesssupportt?

Yes

Yes

No

No

⟨{1}{2,3}{5}⟩ ⟨{1}{2,3}{4}{5}⟩⟨{2,3}⟩ {5}

⟨{1}{2,3}{4}{5}⟩⟨{1}{2,3}{4}⟩ ⟨{2,3}{4}

{5}⟩ ⟨{1}{2}{4}{5}⟩ {1}{3}{4}{5}

sj

ws=2mingap=0 maxgap=3

maxspan=∞

⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{3,4}{5}⟩

⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{4,6}{8}⟩

⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{3,4,6}{8}⟩

⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{1,3,4}{6,7,8}⟩

Inthelastexample,althoughthepattern satisfiesthewindowsizeconstraint,itviolatesthemaxgapconstraintbecausethemaximumtimedifferencebetweeneventsinthetwoelementsis5units.Thewindowsizeconstraintalsoaffectsthesupportcountingstepofsequentialpatterndiscoveryalgorithms.IfAlgorithm6.1 isappliedwithoutimposingthewindowsizeconstraint,thesupportcountsforsomeofthecandidatepatternsmightbeunderestimated,andthussomeinterestingpatternsmaybelost.

6.4.4AlternativeCountingSchemes*

Therearemultiplewaysofdefiningasequencegivenadatasequence.Forexample,ifourdatabaseinvolveslongsequencesofevents,wemightbeinterestedinfindingsubsequencesthatoccurmultipletimesinthesamedatasequence.Hence,insteadofcountingthesupportofasubsequenceasthenumberofdatasequencesitiscontainedin,wecanalsotakeintoaccountthenumberoftimesasubsequenceiscontainedinadatasequence.Thisviewpointgivesrisetoseveraldifferentformulationsforcountingthesupportofacandidatek-sequencefromadatabaseofsequences.Forillustrativepurposes,considertheproblemofcountingthesupportforsequence,asshowninFigure6.9 .Assumethat , , ,and

.

⟨{1,3,4}{6,7,8}⟩

⟨{p}{q}⟩ ws=0 mingap=0 maxgap=2maxspan=2

Figure6.9.Comparingdifferentsupportcountingmethods.

COBJ:Oneoccurrenceperobject.

Thismethodlooksforatleastoneoccurrenceofagivensequenceinanobject’stimeline.InFigure6.9 ,eventhoughthesequenceappearsseveraltimesintheobject’stimeline,itiscountedonlyonce—withpoccurringat andqoccurringat .CWIN:Oneoccurrenceperslidingwindow.

⟨(p)(q)⟩

t=1 t=3

Inthisapproach,aslidingtimewindowoffixedlength(maxspan)ismovedacrossanobject’stimeline,oneunitatatime.Thesupportcountisincrementedeachtimethesequenceisencounteredintheslidingwindow.InFigure6.9 ,thesequence isobservedsixtimesusingthismethod.CMINWIN:Numberofminimalwindowsofoccurrence.

Aminimalwindowofoccurrenceisthesmallestwindowinwhichthesequenceoccursgiventhetimingconstraints.Inotherwords,aminimalwindowisthetimeintervalsuchthatthesequenceoccursinthattimeinterval,butitdoesnotoccurinanyofthepropersubintervalsofit.ThisdefinitioncanbeconsideredasarestrictiveversionofCWIN,becauseitseffectistoshrinkandcollapsesomeofthewindowsthatarecountedbyCWIN.Forexample,sequence hasfourminimalwindowoccurrences:(1)thepair ,(2)thepair ,(3)thepair

,and(4)thepair .Theoccurrenceofeventpatandeventqat isnotaminimalwindowoccurrencebecauseit

containsasmallerwindowwith ,whichisindeedaminimalwindowofoccurrence.CDIST_O:Distinctoccurrenceswithpossibilityofevent-timestampoverlap.

Adistinctoccurrenceofasequenceisdefinedtobethesetofevent-timestamppairssuchthattherehastobeatleastonenewevent-timestamppairthatisdifferentfromapreviouslycountedoccurrence.CountingallsuchdistinctoccurrencesresultsintheCDIST_Omethod.Iftheoccurrencetimeofeventspandqisdenotedasatuple ,thenthismethodyieldseightdistinctoccurrencesofsequence attimes(1,3),(2,3),(2,4),(3,4),(3,5),(5,6),(5,7),and(6,7).CDIST:Distinctoccurrenceswithnoevent-timestampoverlapallowed.

⟨{p}{q}⟩

⟨{p}{q}⟩(p:t=2,q:t=3) (p:t=3,q:t=4)

(p:t=5,q:t=6) (p:t=6,q:t=7)t=1 t=3

(p:t=2,q:t=3)

(t(p),t(q))⟨{p}{q}⟩

InCDIST_Oabove,twooccurrencesofasequencewereallowedtohaveoverlappingevent-timestamppairs,e.g.,theoverlapbetween(1,3)and(2,3).IntheCDISTmethod,nooverlapisallowed.Effectively,whenanevent-timestamppairisconsideredforcounting,itismarkedasusedandisneverusedagainforsubsequentcountingofthesamesequence.Asanexample,therearefivedistinct,non-overlappingoccurrencesofthesequence inthediagramshowninFigure6.9 .Theseoccurrenceshappenattimes(1,3),(2,4),(3,5),(5,6),and(6,7).ObservethattheseoccurrencesaresubsetsoftheoccurrencesobservedinCDIST_O.

Onefinalpointregardingthecountingmethodsistheneedtodeterminethebaselineforcomputingthesupportmeasure.Forfrequentitemsetmining,thebaselineisgivenbythetotalnumberoftransactions.Forsequentialpatternmining,thebaselinedependsonthecountingmethodused.FortheCOBJmethod,thetotalnumberofobjectsintheinputdatacanbeusedasthebaseline.FortheCWINandCMINWINmethods,thebaselineisgivenbythesumofthenumberoftimewindowspossibleinallobjects.FormethodssuchasCDISTandCDIST_O,thebaselineisgivenbythesumofthenumberofdistincttimestampspresentintheinputdataofeachobject.

⟨{p}{q}⟩

6.5SubgraphPatternsThissectiondescribestheapplicationofassociationanalysismethodstographs,whicharemorecomplexentitiesthanitemsetsandsequences.Anumberofentitiessuchaschemicalcompounds,3-Dproteinstructures,computernetworks,andtreestructuredXMLdocumentscanbemodeledusingagraphrepresentation,asshowninTable6.8 .

Table6.8.Graphrepresentationofentitiesinvariousapplicationdomains.

Application Graphs Vertices Edges

Webmining Collectionofinter-linkedWebpages

Webpages Hyperlinkbetweenpages

Computationalchemistry

Chemicalcompounds Atomsorions Bondbetweenatomsorions

Computersecurity

Computernetworks Computersandservers

Interconnectionbetweenmachines

SemanticWeb XMLdocuments XMLelements Parent-childrelationshipbetweenelements

Bioinformatics 3-DProteinstructures Aminoacids Contactresidue

Ausefuldataminingtasktoperformonthistypeofdataistoderiveasetoffrequentlyoccurringsubstructuresinacollectionofgraphs.Suchataskisknownasfrequentsubgraphmining.Apotentialapplicationoffrequentsubgraphminingcanbeseeninthecontextofcomputationalchemistry.Each

year,newchemicalcompoundsaredesignedforthedevelopmentofpharmaceuticaldrugs,pesticides,fertilizers,etc.Althoughthestructureofacompoundisknowntoplayamajorroleindeterminingitschemicalproperties,itisdifficulttoestablishtheirexactrelationship.Frequentsubgraphminingcanaidthisundertakingbyidentifyingthesubstructurescommonlyassociatedwithcertainpropertiesofknowncompounds.Suchinformationcanhelpscientiststodevelopnewchemicalcompoundsthathavecertaindesiredproperties.

Thissectionpresentsamethodologyforapplyingassociationanalysistograph-baseddata.Thesectionbeginswithareviewofsomeofthebasicgraph-relatedconceptsanddefinitions.Thefrequentsubgraphminingproblemisthenintroduced,followedbyadescriptionofhowthetraditionalApriorialgorithmcanbeextendedtodiscoversuchpatterns.

6.5.1Preliminaries

GraphsAgraphisadatastructurethatcanbeusedtorepresentrelationshipsamongasetofentities.Mathematically,agraph iscomposedofavertexsetandasetofedges connectingpairsofvertices.Eachedgeisdenotedby

avertexpair ,where .Alabel canbeassignedtoeachvertex representingthenameofanentity.Similarlyeachedge canalsobeassociatedwithalabel describingtherelationshipbetweenapairofentities.Table6.8 showstheve