introduction-to-data-mining-2nbsped-2017048641-9780133128901-013312890311.pdf

INTRODUCTIONTODATAMINING

INTRODUCTIONTODATAMININGSECONDEDITION

PANG-NINGTANMichiganStateUniversitMICHAELSTEINBACHUniversityofMinnesotaANUJKARPATNEUniversityofMinnesotaVIPINKUMARUniversityofMinnesota

330HudsonStreet,NYNY10013

Director,PortfolioManagement:Engineering,ComputerScience&GlobalEditions:JulianPartridge

Specialist,HigherEdPortfolioManagement:MattGoldstein

PortfolioManagementAssistant:MeghanJacoby

ManagingContentProducer:ScottDisanno

ContentProducer:CaroleSnyder

WebDeveloper:SteveWright

RightsandPermissionsManager:BenFerrini

ManufacturingBuyer,HigherEd,LakeSideCommunicationsInc(LSC):MauraZaldivar-Garcia

InventoryManager:AnnLam

ProductMarketingManager:YvonneVannatta

FieldMarketingManager:DemetriusHall

MarketingAssistant:JonBryant

CoverDesigner:JoyceWells,jWellsDesign

Full-ServiceProjectManagement:ChandrasekarSubramanian,SPiGlobal

Copyright©2019PearsonEducation,Inc.Allrightsreserved.ManufacturedintheUnitedStatesofAmerica.ThispublicationisprotectedbyCopyright,andpermissionshouldbeobtainedfromthepublisherpriortoanyprohibitedreproduction,storageinaretrievalsystem,ortransmissioninanyformorbyanymeans,electronic,mechanical,photocopying,recording,orlikewise.Forinformationregardingpermissions,requestformsandtheappropriatecontactswithinthePearsonEducationGlobalRights&Permissionsdepartment,pleasevisitwww.pearsonhighed.com/permissions/.

Manyofthedesignationsbymanufacturersandsellerstodistinguishtheirproductsareclaimedastrademarks.Wherethosedesignationsappearinthisbook,andthepublisherwasawareofatrademarkclaim,thedesignationshavebeenprintedininitialcapsorallcaps.

LibraryofCongressCataloging-in-PublicationDataonFile

Names:Tan,Pang-Ning,author.|Steinbach,Michael,author.|Karpatne,Anuj,author.|Kumar,Vipin,1956-author.

Title:IntroductiontoDataMining/Pang-NingTan,MichiganStateUniversity,MichaelSteinbach,UniversityofMinnesota,AnujKarpatne,UniversityofMinnesota,VipinKumar,UniversityofMinnesota.

Description:Secondedition.|NewYork,NY:PearsonEducation,[2019]|Includesbibliographicalreferencesandindex.

Identifiers:LCCN2017048641|ISBN9780133128901|ISBN0133128903

Subjects:LCSH:Datamining.

Classification:LCCQA76.9.D343T352019|DDC006.3/12–dc23LCrecordavailableathttps://lccn.loc.gov/2017048641

118

ISBN-10:0133128903

ISBN-13:9780133128901

Toourfamilies…

PrefacetotheSecondEditionSincethefirstedition,roughly12yearsago,muchhaschangedinthefieldofdataanalysis.Thevolumeandvarietyofdatabeingcollectedcontinuestoincrease,ashastherate(velocity)atwhichitisbeingcollectedandusedtomakedecisions.Indeed,theterm,BigData,hasbeenusedtorefertothemassiveanddiversedatasetsnowavailable.Inaddition,thetermdatasciencehasbeencoinedtodescribeanemergingareathatappliestoolsandtechniquesfromvariousfields,suchasdatamining,machinelearning,statistics,andmanyothers,toextractactionableinsightsfromdata,oftenbigdata.

Thegrowthindatahascreatednumerousopportunitiesforallareasofdataanalysis.Themostdramaticdevelopmentshavebeenintheareaofpredictivemodeling,acrossawiderangeofapplicationdomains.Forinstance,recentadvancesinneuralnetworks,knownasdeeplearning,haveshownimpressiveresultsinanumberofchallengingareas,suchasimageclassification,speechrecognition,aswellastextcategorizationandunderstanding.Whilenotasdramatic,otherareas,e.g.,clustering,associationanalysis,andanomalydetectionhavealsocontinuedtoadvance.Thisneweditionisinresponsetothoseadvances.

Overview

Aswiththefirstedition,thesecondeditionofthebookprovidesacomprehensiveintroductiontodataminingandisdesignedtobeaccessibleandusefultostudents,instructors,researchers,andprofessionals.Areas

coveredincludedatapreprocessing,predictivemodeling,associationanalysis,clusteranalysis,anomalydetection,andavoidingfalsediscoveries.Thegoalistopresentfundamentalconceptsandalgorithmsforeachtopic,thusprovidingthereaderwiththenecessarybackgroundfortheapplicationofdataminingtorealproblems.Asbefore,classification,associationanalysisandclusteranalysis,areeachcoveredinapairofchapters.Theintroductorychaptercoversbasicconcepts,representativealgorithms,andevaluationtechniques,whilethemorefollowingchapterdiscussesadvancedconceptsandalgorithms.Asbefore,ourobjectiveistoprovidethereaderwithasoundunderstandingofthefoundationsofdatamining,whilestillcoveringmanyimportantadvancedtopics.Becauseofthisapproach,thebookisusefulbothasalearningtoolandasareference.

Tohelpreadersbetterunderstandtheconceptsthathavebeenpresented,weprovideanextensivesetofexamples,figures,andexercises.Thesolutionstotheoriginalexercises,whicharealreadycirculatingontheweb,willbemadepublic.Theexercisesaremostlyunchangedfromthelastedition,withtheexceptionofnewexercisesinthechapteronavoidingfalsediscoveries.Newexercisesfortheotherchaptersandtheirsolutionswillbeavailabletoinstructorsviatheweb.Bibliographicnotesareincludedattheendofeachchapterforreaderswhoareinterestedinmoreadvancedtopics,historicallyimportantpapers,andrecenttrends.Thesehavealsobeensignificantlyupdated.Thebookalsocontainsacomprehensivesubjectandauthorindex.

WhatisNewintheSecondEdition?

Someofthemostsignificantimprovementsinthetexthavebeeninthetwochaptersonclassification.Theintroductorychapterusesthedecisiontreeclassifierforillustration,butthediscussiononmanytopics—thosethatapply

acrossallclassificationapproaches—hasbeengreatlyexpandedandclarified,includingtopicssuchasoverfitting,underfitting,theimpactoftrainingsize,modelcomplexity,modelselection,andcommonpitfallsinmodelevaluation.Almosteverysectionoftheadvancedclassificationchapterhasbeensignificantlyupdated.ThematerialonBayesiannetworks,supportvectormachines,andartificialneuralnetworkshasbeensignificantlyexpanded.Wehaveaddedaseparatesectionondeepnetworkstoaddressthecurrentdevelopmentsinthisarea.Thediscussionofevaluation,whichoccursinthesectiononimbalancedclasses,hasalsobeenupdatedandimproved.

Thechangesinassociationanalysisaremorelocalized.Wehavecompletelyreworkedthesectionontheevaluationofassociationpatterns(introductorychapter),aswellasthesectionsonsequenceandgraphmining(advancedchapter).Changestoclusteranalysisarealsolocalized.TheintroductorychapteraddedtheK-meansinitializationtechniqueandanupdatedthediscussionofclusterevaluation.Theadvancedclusteringchapteraddsanewsectiononspectralgraphclustering.Anomalydetectionhasbeengreatlyrevisedandexpanded.Existingapproaches—statistical,nearestneighbor/density-based,andclusteringbased—havebeenretainedandupdated,whilenewapproacheshavebeenadded:reconstruction-based,one-classclassification,andinformation-theoretic.Thereconstruction-basedapproachisillustratedusingautoencodernetworksthatarepartofthedeeplearningparadigm.Thedatachapterhasbeenupdatedtoincludediscussionsofmutualinformationandkernel-basedtechniques.

Thelastchapter,whichdiscusseshowtoavoidfalsediscoveriesandproducevalidresults,iscompletelynew,andisnovelamongothercontemporarytextbooksondatamining.Itsupplementsthediscussionsintheotherchapterswithadiscussionofthestatisticalconcepts(statisticalsignificance,p-values,falsediscoveryrate,permutationtesting,etc.)relevanttoavoidingspuriousresults,andthenillustratestheseconceptsinthecontextofdata

miningtechniques.Thischapteraddressestheincreasingconcernoverthevalidityandreproducibilityofresultsobtainedfromdataanalysis.Theadditionofthislastchapterisarecognitionoftheimportanceofthistopicandanacknowledgmentthatadeeperunderstandingofthisareaisneededforthoseanalyzingdata.

Thedataexplorationchapterhasbeendeleted,ashavetheappendices,fromtheprinteditionofthebook,butwillremainavailableontheweb.Anewappendixprovidesabriefdiscussionofscalabilityinthecontextofbigdata.

TotheInstructor

Asatextbook,thisbookissuitableforawiderangeofstudentsattheadvancedundergraduateorgraduatelevel.Sincestudentscometothissubjectwithdiversebackgroundsthatmaynotincludeextensiveknowledgeofstatisticsordatabases,ourbookrequiresminimalprerequisites.Nodatabaseknowledgeisneeded,andweassumeonlyamodestbackgroundinstatisticsormathematics,althoughsuchabackgroundwillmakeforeasiergoinginsomesections.Asbefore,thebook,andmorespecifically,thechapterscoveringmajordataminingtopics,aredesignedtobeasself-containedaspossible.Thus,theorderinwhichtopicscanbecoveredisquiteflexible.Thecorematerialiscoveredinchapters2(data),3(classification),5(associationanalysis),7(clustering),and9(anomalydetection).WerecommendatleastacursorycoverageofChapter10(AvoidingFalseDiscoveries)toinstillinstudentssomecautionwheninterpretingtheresultsoftheirdataanalysis.Althoughtheintroductorydatachapter(2)shouldbecoveredfirst,thebasicclassification(3),associationanalysis(5),andclusteringchapters(7),canbecoveredinanyorder.Becauseoftherelationshipofanomalydetection(9)toclassification(3)andclustering(7),thesechaptersshouldprecedeChapter9.

Varioustopicscanbeselectedfromtheadvancedclassification,associationanalysis,andclusteringchapters(4,6,and8,respectively)tofitthescheduleandinterestsoftheinstructorandstudents.Wealsoadvisethatthelecturesbeaugmentedbyprojectsorpracticalexercisesindatamining.Althoughtheyaretimeconsuming,suchhands-onassignmentsgreatlyenhancethevalueofthecourse.

SupportMaterials

Supportmaterialsavailabletoallreadersofthisbookareavailableathttp://www-users.cs.umn.edu/~kumar/dmbook.

PowerPointlectureslidesSuggestionsforstudentprojectsDataminingresources,suchasalgorithmsanddatasetsOnlinetutorialsthatgivestep-by-stepexamplesforselecteddataminingtechniquesdescribedinthebookusingactualdatasetsanddataanalysissoftware

Additionalsupportmaterials,includingsolutionstoexercises,areavailableonlytoinstructorsadoptingthistextbookforclassroomuse.Thebook’sresourceswillbemirroredatwww.pearsonhighered.com/cs-resources.Commentsandsuggestions,aswellasreportsoferrors,canbesenttotheauthorsthroughdmbook@cs.umn.edu.

Acknowledgments

Manypeoplecontributedtothefirstandsecondeditionsofthebook.Webeginbyacknowledgingourfamiliestowhomthisbookisdedicated.Withouttheirpatienceandsupport,thisprojectwouldhavebeenimpossible.

WewouldliketothankthecurrentandformerstudentsofourdatamininggroupsattheUniversityofMinnesotaandMichiganStatefortheircontributions.Eui-Hong(Sam)HanandMaheshJoshihelpedwiththeinitialdataminingclasses.Someoftheexercisesandpresentationslidesthattheycreatedcanbefoundinthebookanditsaccompanyingslides.StudentsinourdatamininggroupswhoprovidedcommentsondraftsofthebookorwhocontributedinotherwaysincludeShyamBoriah,HaibinCheng,VarunChandola,EricEilertson,LeventErtöz,JingGao,RohitGupta,SridharIyer,Jung-EunLee,BenjaminMayer,AyselOzgur,UygarOztekin,GauravPandey,KashifRiaz,JerryScripps,GyorgySimon,HuiXiong,JiepingYe,andPushengZhang.WewouldalsoliketothankthestudentsofourdataminingclassesattheUniversityofMinnesotaandMichiganStateUniversitywhoworkedwithearlydraftsofthebookandprovidedinvaluablefeedback.WespecificallynotethehelpfulsuggestionsofBernardoCraemer,ArifinRuslim,JamshidVayghan,andYuWei.

JoydeepGhosh(UniversityofTexas)andSanjayRanka(UniversityofFlorida)classtestedearlyversionsofthebook.WealsoreceivedmanyusefulsuggestionsdirectlyfromthefollowingUTstudents:PankajAdhikari,RajivBhatia,FredericBosche,ArindamChakraborty,MeghanaDeodhar,ChrisEverson,DavidGardner,SaadGodil,ToddHay,ClintJones,AjayJoshi,JoonsooLee,YueLuo,AnujNanavati,TylerOlsen,SunyoungPark,AashishPhansalkar,GeoffPrewett,MichaelRyoo,DarylShannon,andMeiYang.

RonaldKostoff(ONR)readanearlyversionoftheclusteringchapterandofferednumeroussuggestions.GeorgeKarypisprovidedinvaluableLATEXassistanceincreatinganauthorindex.IreneMoulitsasalsoprovided

assistancewithLATEXandreviewedsomeoftheappendices.MusettaSteinbachwasveryhelpfulinfindingerrorsinthefigures.

WewouldliketoacknowledgeourcolleaguesattheUniversityofMinnesotaandMichiganStatewhohavehelpedcreateapositiveenvironmentfordataminingresearch.TheyincludeArindamBanerjee,DanBoley,JoyceChai,AnilJain,RaviJanardan,RongJin,GeorgeKarypis,ClaudiaNeuhauser,HaesunPark,WilliamF.Punch,GyörgySimon,ShashiShekhar,andJaideepSrivastava.Thecollaboratorsonourmanydataminingprojects,whoalsohaveourgratitude,includeRameshAgrawal,ManeeshBhargava,SteveCannon,AlokChoudhary,ImmeEbert-Uphoff,AuroopGanguly,PietC.deGroen,FranHill,YongdaeKim,SteveKlooster,KerryLong,NiharMahapatra,RamaNemani,NikunjOza,ChrisPotter,LisianePruinelli,NagizaSamatova,JonathanShapiro,KevinSilverstein,BrianVanNess,BonnieWestra,NevinYoung,andZhi-LiZhang.

ThedepartmentsofComputerScienceandEngineeringattheUniversityofMinnesotaandMichiganStateUniversityprovidedcomputingresourcesandasupportiveenvironmentforthisproject.ARDA,ARL,ARO,DOE,NASA,NOAA,andNSFprovidedresearchsupportforPang-NingTan,MichaelStein-bach,AnujKarpatne,andVipinKumar.Inparticular,KamalAbdali,MitraBasu,DickBrackney,JagdishChandra,JoeCoughlan,MichaelCoyle,StephenDavis,FredericaDarema,RichardHirsch,ChandrikaKamath,TsengdarLee,RajuNamburu,N.Radhakrishnan,JamesSidoran,SylviaSpengler,BhavaniThuraisingham,WaltTiernin,MariaZemankova,AidongZhang,andXiaodongZhanghavebeensupportiveofourresearchindataminingandhigh-performancecomputing.

ItwasapleasureworkingwiththehelpfulstaffatPearsonEducation.Inparticular,wewouldliketothankMattGoldstein,KathySmith,CaroleSnyder,

andJoyceWells.WewouldalsoliketothankGeorgeNichols,whohelpedwiththeartworkandPaulAnagnostopoulos,whoprovidedLATEXsupport.

WearegratefultothefollowingPearsonreviewers:LemanAkoglu(CarnegieMellonUniversity),Chien-ChungChan(UniversityofAkron),ZhengxinChen(UniversityofNebraskaatOmaha),ChrisClifton(PurdueUniversity),Joy-deepGhosh(UniversityofTexas,Austin),NazliGoharian(IllinoisInstituteofTechnology),J.MichaelHardin(UniversityofAlabama),JingruiHe(ArizonaStateUniversity),JamesHearne(WesternWashingtonUniversity),HillolKargupta(UniversityofMaryland,BaltimoreCountyandAgnik,LLC),EamonnKeogh(UniversityofCalifornia-Riverside),BingLiu(UniversityofIllinoisatChicago),MariofannaMilanova(UniversityofArkansasatLittleRock),SrinivasanParthasarathy(OhioStateUniversity),ZbigniewW.Ras(UniversityofNorthCarolinaatCharlotte),XintaoWu(UniversityofNorthCarolinaatCharlotte),andMohammedJ.Zaki(RensselaerPolytechnicInstitute).

Overtheyearssincethefirstedition,wehavealsoreceivednumerouscommentsfromreadersandstudentswhohavepointedouttyposandvariousotherissues.Weareunabletomentiontheseindividualsbyname,buttheirinputismuchappreciatedandhasbeentakenintoaccountforthesecondedition.

ContentsPrefacetotheSecondEditionv

1Introduction11.1WhatIsDataMining?4

1.2MotivatingChallenges5

1.3TheOriginsofDataMining7

1.4DataMiningTasks9

1.5ScopeandOrganizationoftheBook13

1.6BibliographicNotes15

1.7Exercises21

2Data232.1TypesofData26

2.1.1AttributesandMeasurement27

2.1.2TypesofDataSets34

2.2DataQuality422.2.1MeasurementandDataCollectionIssues42

2.2.2IssuesRelatedtoApplications49

2.3DataPreprocessing502.3.1Aggregation51

2.3.2Sampling52

2.3.3DimensionalityReduction56

2.3.4FeatureSubsetSelection58

2.3.5FeatureCreation61

2.3.6DiscretizationandBinarization63

2.3.7VariableTransformation69

2.4MeasuresofSimilarityandDissimilarity712.4.1Basics72

2.4.2SimilarityandDissimilaritybetweenSimpleAttributes74

2.4.3DissimilaritiesbetweenDataObjects76

2.4.4SimilaritiesbetweenDataObjects78

2.4.5ExamplesofProximityMeasures79

2.4.6MutualInformation88

2.4.7KernelFunctions*90

2.4.8BregmanDivergence*94

2.4.9IssuesinProximityCalculation96

2.4.10SelectingtheRightProximityMeasure98

2.5BibliographicNotes100

2.6Exercises105

3Classification:BasicConceptsandTechniques113

3.1BasicConcepts114

3.2GeneralFrameworkforClassification117

3.3DecisionTreeClassifier1193.3.1ABasicAlgorithmtoBuildaDecisionTree121

3.3.2MethodsforExpressingAttributeTestConditions124

3.3.3MeasuresforSelectinganAttributeTestCondition127

3.3.4AlgorithmforDecisionTreeInduction136

3.3.5ExampleApplication:WebRobotDetection138

3.3.6CharacteristicsofDecisionTreeClassifiers140

3.4ModelOverfitting1473.4.1ReasonsforModelOverfitting149

3.5ModelSelection1563.5.1UsingaValidationSet156

3.5.2IncorporatingModelComplexity157

3.5.3EstimatingStatisticalBounds162

3.5.4ModelSelectionforDecisionTrees162

3.6ModelEvaluation1643.6.1HoldoutMethod165

3.6.2Cross-Validation165

3.7PresenceofHyper-parameters1683.7.1Hyper-parameterSelection168

3.7.2NestedCross-Validation170

3.8PitfallsofModelSelectionandEvaluation1723.8.1OverlapbetweenTrainingandTestSets172

3.8.2UseofValidationErrorasGeneralizationError172

3.9ModelComparison 1733.9.1EstimatingtheConfidenceIntervalforAccuracy174

3.9.2ComparingthePerformanceofTwoModels175

3.10BibliographicNotes176

3.11Exercises185

4Classification:AlternativeTechniques1934.1TypesofClassifiers193

4.2Rule-BasedClassifier1954.2.1HowaRule-BasedClassifierWorks197

4.2.2PropertiesofaRuleSet198

4.2.3DirectMethodsforRuleExtraction199

4.2.4IndirectMethodsforRuleExtraction204

4.2.5CharacteristicsofRule-BasedClassifiers206

4.3NearestNeighborClassifiers2084.3.1Algorithm209

4.3.2CharacteristicsofNearestNeighborClassifiers210

4.4NaïveBayesClassifier2124.4.1BasicsofProbabilityTheory213

4.4.2NaïveBayesAssumption218

4.5BayesianNetworks2274.5.1GraphicalRepresentation227

4.5.2InferenceandLearning233

4.5.3CharacteristicsofBayesianNetworks242

4.6LogisticRegression2434.6.1LogisticRegressionasaGeneralizedLinearModel244

4.6.2LearningModelParameters245

4.6.3CharacteristicsofLogisticRegression248

4.7ArtificialNeuralNetwork(ANN)2494.7.1Perceptron250

4.7.2Multi-layerNeuralNetwork254

4.7.3CharacteristicsofANN261

4.8DeepLearning2624.8.1UsingSynergisticLossFunctions263

4.8.2UsingResponsiveActivationFunctions266

4.8.3Regularization268

4.8.4InitializationofModelParameters271

4.8.5CharacteristicsofDeepLearning275

4.9SupportVectorMachine(SVM)2764.9.1MarginofaSeparatingHyperplane276

4.9.2LinearSVM278

4.9.3Soft-marginSVM284

4.9.4NonlinearSVM290

4.9.5CharacteristicsofSVM294

4.10EnsembleMethods2964.10.1RationaleforEnsembleMethod297

4.10.2MethodsforConstructinganEnsembleClassifier297

4.10.3Bias-VarianceDecomposition300

4.10.4Bagging302

4.10.5Boosting305

4.10.6RandomForests310

4.10.7EmpiricalComparisonamongEnsembleMethods312

4.11ClassImbalanceProblem3134.11.1BuildingClassifierswithClassImbalance314

4.11.2EvaluatingPerformancewithClassImbalance318

4.11.3FindinganOptimalScoreThreshold322

4.11.4AggregateEvaluationofPerformance323

4.12MulticlassProblem330

4.13BibliographicNotes333

4.14Exercises345

5AssociationAnalysis:BasicConceptsandAlgorithms3575.1Preliminaries358

5.2FrequentItemsetGeneration3625.2.1TheAprioriPrinciple363

5.2.2FrequentItemsetGenerationintheAprioriAlgorithm364

5.2.3CandidateGenerationandPruning368

5.2.4SupportCounting373

5.2.5ComputationalComplexity377

5.3RuleGeneration3805.3.1Confidence-BasedPruning380

5.3.2RuleGenerationinAprioriAlgorithm381

5.3.3AnExample:CongressionalVotingRecords382

5.4CompactRepresentationofFrequentItemsets3845.4.1MaximalFrequentItemsets384

5.4.2ClosedItemsets386

5.5AlternativeMethodsforGeneratingFrequentItemsets*389

5.6FP-GrowthAlgorithm*3935.6.1FP-TreeRepresentation394

5.6.2FrequentItemsetGenerationinFP-GrowthAlgorithm397

5.7EvaluationofAssociationPatterns401

5.7.1ObjectiveMeasuresofInterestingness402

5.7.2MeasuresbeyondPairsofBinaryVariables414

5.7.3Simpson’sParadox416

5.8EffectofSkewedSupportDistribution418

5.9BibliographicNotes424

5.10Exercises438

6AssociationAnalysis:AdvancedConcepts4516.1HandlingCategoricalAttributes451

6.2HandlingContinuousAttributes4546.2.1Discretization-BasedMethods454

6.2.2Statistics-BasedMethods458

6.2.3Non-discretizationMethods460

6.3HandlingaConceptHierarchy462

6.4SequentialPatterns4646.4.1Preliminaries465

6.4.2SequentialPatternDiscovery468

6.4.3TimingConstraints 473

6.4.4AlternativeCountingSchemes 477

6.5SubgraphPatterns4796.5.1Preliminaries480

∗

6.5.2FrequentSubgraphMining483

6.5.3CandidateGeneration487

6.5.4CandidatePruning493

6.5.5SupportCounting493

6.6InfrequentPatterns 4936.6.1NegativePatterns494

6.6.2NegativelyCorrelatedPatterns495

6.6.3ComparisonsamongInfrequentPatterns,NegativePatterns,andNegativelyCorrelatedPatterns496

6.6.4TechniquesforMiningInterestingInfrequentPatterns498

6.6.5TechniquesBasedonMiningNegativePatterns499

6.6.6TechniquesBasedonSupportExpectation501

6.7BibliographicNotes505

6.8Exercises510

7ClusterAnalysis:BasicConceptsandAlgorithms5257.1Overview528

7.1.1WhatIsClusterAnalysis?528

7.1.2DifferentTypesofClusterings529

7.1.3DifferentTypesofClusters531

7.2K-means5347.2.1TheBasicK-meansAlgorithm535

∗

7.2.2K-means:AdditionalIssues544

7.2.3BisectingK-means547

7.2.4K-meansandDifferentTypesofClusters548

7.2.5StrengthsandWeaknesses549

7.2.6K-meansasanOptimizationProblem549

7.3AgglomerativeHierarchicalClustering5547.3.1BasicAgglomerativeHierarchicalClusteringAlgorithm555

7.3.2SpecificTechniques557

7.3.3TheLance-WilliamsFormulaforClusterProximity562

7.3.4KeyIssuesinHierarchicalClustering563

7.3.5Outliers564

7.3.6StrengthsandWeaknesses565

7.4DBSCAN5657.4.1TraditionalDensity:Center-BasedApproach565

7.4.2TheDBSCANAlgorithm567

7.4.3StrengthsandWeaknesses569

7.5ClusterEvaluation5717.5.1Overview571

7.5.2UnsupervisedClusterEvaluationUsingCohesionandSeparation574

7.5.3UnsupervisedClusterEvaluationUsingtheProximityMatrix582

7.5.4UnsupervisedEvaluationofHierarchicalClustering585

7.5.5DeterminingtheCorrectNumberofClusters587

7.5.6ClusteringTendency588

7.5.7SupervisedMeasuresofClusterValidity589

7.5.8AssessingtheSignificanceofClusterValidityMeasures594

7.5.9ChoosingaClusterValidityMeasure596

7.6BibliographicNotes597

7.7Exercises603

8ClusterAnalysis:AdditionalIssuesandAlgorithms6138.1CharacteristicsofData,Clusters,andClusteringAlgorithms614

8.1.1Example:ComparingK-meansandDBSCAN614

8.1.2DataCharacteristics615

8.1.3ClusterCharacteristics617

8.1.4GeneralCharacteristicsofClusteringAlgorithms619

8.2Prototype-BasedClustering6218.2.1FuzzyClustering621

8.2.2ClusteringUsingMixtureModels627

8.2.3Self-OrganizingMaps(SOM)637

8.3Density-BasedClustering6448.3.1Grid-BasedClustering644

8.3.2SubspaceClustering648

8.3.3DENCLUE:AKernel-BasedSchemeforDensity-BasedClustering652

8.4Graph-BasedClustering6568.4.1Sparsification657

8.4.2MinimumSpanningTree(MST)Clustering658

8.4.3OPOSSUM:OptimalPartitioningofSparseSimilaritiesUsingMETIS659

8.4.4Chameleon:HierarchicalClusteringwithDynamicModeling660

8.4.5SpectralClustering666

8.4.6SharedNearestNeighborSimilarity673

8.4.7TheJarvis-PatrickClusteringAlgorithm676

8.4.8SNNDensity678

8.4.9SNNDensity-BasedClustering679

8.5ScalableClusteringAlgorithms6818.5.1Scalability:GeneralIssuesandApproaches681

8.5.2BIRCH684

8.5.3CURE686

8.6WhichClusteringAlgorithm?690

8.7BibliographicNotes693

8.8Exercises699

9AnomalyDetection7039.1CharacteristicsofAnomalyDetectionProblems705

9.1.1ADefinitionofanAnomaly705

9.1.2NatureofData706

9.1.3HowAnomalyDetectionisUsed707

9.2CharacteristicsofAnomalyDetectionMethods708

9.3StatisticalApproaches7109.3.1UsingParametricModels710

9.3.2UsingNon-parametricModels714

9.3.3ModelingNormalandAnomalousClasses715

9.3.4AssessingStatisticalSignificance717

9.3.5StrengthsandWeaknesses718

9.4Proximity-basedApproaches7199.4.1Distance-basedAnomalyScore719

9.4.2Density-basedAnomalyScore720

9.4.3RelativeDensity-basedAnomalyScore722

9.4.4StrengthsandWeaknesses723

9.5Clustering-basedApproaches7249.5.1FindingAnomalousClusters724

9.5.2FindingAnomalousInstances725

9.5.3StrengthsandWeaknesses728

9.6Reconstruction-basedApproaches7289.6.1StrengthsandWeaknesses731

9.7One-classClassification7329.7.1UseofKernels733

9.7.2TheOriginTrick734

9.7.3StrengthsandWeaknesses738

9.8InformationTheoreticApproaches7389.8.1StrengthsandWeaknesses740

9.9EvaluationofAnomalyDetection740

9.10BibliographicNotes742

9.11Exercises749

10AvoidingFalseDiscoveries75510.1Preliminaries:StatisticalTesting756

10.1.1SignificanceTesting756

10.1.2HypothesisTesting761

10.1.3MultipleHypothesisTesting767

10.1.4PitfallsinStatisticalTesting776

10.2ModelingNullandAlternativeDistributions77810.2.1GeneratingSyntheticDataSets781

10.2.2RandomizingClassLabels782

10.2.3ResamplingInstances782

10.2.4ModelingtheDistributionoftheTestStatistic783

10.3StatisticalTestingforClassification78310.3.1EvaluatingClassificationPerformance783

10.3.2BinaryClassificationasMultipleHypothesisTesting785

10.3.3MultipleHypothesisTestinginModelSelection786

10.4StatisticalTestingforAssociationAnalysis78710.4.1UsingStatisticalModels788

10.4.2UsingRandomizationMethods794

10.5StatisticalTestingforClusterAnalysis79510.5.1GeneratingaNullDistributionforInternalIndices796

10.5.2GeneratingaNullDistributionforExternalIndices798

10.5.3Enrichment798

10.6StatisticalTestingforAnomalyDetection800

10.7BibliographicNotes803

10.8Exercises808

AuthorIndex816

SubjectIndex829

CopyrightPermissions839

1Introduction

Rapidadvancesindatacollectionandstoragetechnology,coupledwiththeeasewithwhichdatacanbegeneratedanddisseminated,havetriggeredtheexplosivegrowthofdata,leadingtothecurrentageofbigdata.Derivingactionableinsightsfromtheselargedatasetsisincreasinglyimportantindecisionmakingacrossalmostallareasofsociety,includingbusinessandindustry;scienceandengineering;medicineandbiotechnology;andgovernmentandindividuals.However,theamountofdata(volume),itscomplexity(variety),andtherateatwhichitisbeingcollectedandprocessed(velocity)havesimplybecometoogreatforhumanstoanalyzeunaided.Thus,thereisagreatneedforautomatedtoolsforextractingusefulinformationfromthebigdatadespitethechallengesposedbyitsenormityanddiversity.

Dataminingblendstraditionaldataanalysismethodswithsophisticatedalgorithmsforprocessingthisabundanceofdata.Inthisintroductorychapter,wepresentanoverviewofdataminingandoutlinethekeytopicstobecoveredinthisbook.Westartwitha

descriptionofsomeapplicationsthatrequiremoreadvancedtechniquesfordataanalysis.

BusinessandIndustryPoint-of-saledatacollection(barcodescanners,radiofrequencyidentification(RFID),andsmartcardtechnology)haveallowedretailerstocollectup-to-the-minutedataaboutcustomerpurchasesatthecheckoutcountersoftheirstores.Retailerscanutilizethisinformation,alongwithotherbusiness-criticaldata,suchaswebserverlogsfrome-commercewebsitesandcustomerservicerecordsfromcallcenters,tohelpthembetterunderstandtheneedsoftheircustomersandmakemoreinformedbusinessdecisions.

Dataminingtechniquescanbeusedtosupportawiderangeofbusinessintelligenceapplications,suchascustomerprofiling,targetedmarketing,workflowmanagement,storelayout,frauddetection,andautomatedbuyingandselling.Anexampleofthelastapplicationishigh-speedstocktrading,wheredecisionsonbuyingandsellinghavetobemadeinlessthanasecondusingdataaboutfinancialtransactions.Dataminingcanalsohelpretailersanswerimportantbusinessquestions,suchas“Whoarethemostprofitablecustomers?”“Whatproductscanbecross-soldorup-sold?”and“Whatistherevenueoutlookofthecompanyfornextyear?”Thesequestionshaveinspiredthedevelopmentofsuchdataminingtechniquesasassociationanalysis(Chapters5 and6 ).

AstheInternetcontinuestorevolutionizethewayweinteractandmakedecisionsinoureverydaylives,wearegeneratingmassiveamountsofdataaboutouronlineexperiences,e.g.,webbrowsing,messaging,andpostingonsocialnetworkingwebsites.Thishasopenedseveralopportunitiesforbusinessapplicationsthatusewebdata.Forexample,inthee-commercesector,dataaboutouronlineviewingorshoppingpreferencescanbeusedto

providepersonalizedrecommendationsofproducts.DataminingalsoplaysaprominentroleinsupportingseveralotherInternet-basedservices,suchasfilteringspammessages,answeringsearchqueries,andsuggestingsocialupdatesandconnections.Thelargecorpusoftext,images,andvideosavailableontheInternethasenabledanumberofadvancementsindataminingmethods,includingdeeplearning,whichisdiscussedinChapter4 .Thesedevelopmentshaveledtogreatadvancesinanumberofapplications,suchasobjectrecognition,naturallanguagetranslation,andautonomousdriving.

Anotherdomainthathasundergonearapidbigdatatransformationistheuseofmobilesensorsanddevices,suchassmartphonesandwearablecomputingdevices.Withbettersensortechnologies,ithasbecomepossibletocollectavarietyofinformationaboutourphysicalworldusinglow-costsensorsembeddedoneverydayobjectsthatareconnectedtoeachother,termedtheInternetofThings(IOT).Thisdeepintegrationofphysicalsensorsindigitalsystemsisbeginningtogeneratelargeamountsofdiverseanddistributeddataaboutourenvironment,whichcanbeusedfordesigningconvenient,safe,andenergy-efficienthomesystems,aswellasforurbanplanningofsmartcities.

Medicine,Science,andEngineeringResearchersinmedicine,science,andengineeringarerapidlyaccumulatingdatathatiskeytosignificantnewdiscoveries.Forexample,asanimportantsteptowardimprovingourunderstandingoftheEarth’sclimatesystem,NASAhasdeployedaseriesofEarth-orbitingsatellitesthatcontinuouslygenerateglobalobservationsofthelandsurface,oceans,andatmosphere.However,becauseofthesizeandspatio-temporalnatureofthedata,traditionalmethodsareoftennotsuitableforanalyzingthesedatasets.TechniquesdevelopedindataminingcanaidEarthscientistsinansweringquestionssuchasthefollowing:“Whatistherelationshipbetweenthefrequencyandintensityofecosystemdisturbances

suchasdroughtsandhurricanestoglobalwarming?”“Howislandsurfaceprecipitationandtemperatureaffectedbyoceansurfacetemperature?”and“Howwellcanwepredictthebeginningandendofthegrowingseasonforaregion?”

Asanotherexample,researchersinmolecularbiologyhopetousethelargeamountsofgenomicdatatobetterunderstandthestructureandfunctionofgenes.Inthepast,traditionalmethodsinmolecularbiologyallowedscientiststostudyonlyafewgenesatatimeinagivenexperiment.Recentbreakthroughsinmicroarraytechnologyhaveenabledscientiststocomparethebehaviorofthousandsofgenesundervarioussituations.Suchcomparisonscanhelpdeterminethefunctionofeachgene,andperhapsisolatethegenesresponsibleforcertaindiseases.However,thenoisy,high-dimensionalnatureofdatarequiresnewdataanalysismethods.Inadditiontoanalyzinggeneexpressiondata,dataminingcanalsobeusedtoaddressotherimportantbiologicalchallengessuchasproteinstructureprediction,multiplesequencealignment,themodelingofbiochemicalpathways,andphylogenetics.

Anotherexampleistheuseofdataminingtechniquestoanalyzeelectronichealthrecord(EHR)data,whichhasbecomeincreasinglyavailable.Notverylongago,studiesofpatientsrequiredmanuallyexaminingthephysicalrecordsofindividualpatientsandextractingveryspecificpiecesofinformationpertinenttotheparticularquestionbeinginvestigated.EHRsallowforafasterandbroaderexplorationofsuchdata.However,therearesignificantchallengessincetheobservationsonanyonepatienttypicallyoccurduringtheirvisitstoadoctororhospitalandonlyasmallnumberofdetailsaboutthehealthofthepatientaremeasuredduringanyparticularvisit.

Currently,EHRanalysisfocusesonsimpletypesofdata,e.g.,apatient’sbloodpressureorthediagnosiscodeofadisease.However,largeamountsof

morecomplextypesofmedicaldataarealsobeingcollected,suchaselectrocardiograms(ECGs)andneuroimagesfrommagneticresonanceimaging(MRI)orfunctionalMagneticResonanceImaging(fMRI).Althoughchallengingtoanalyze,thisdataalsoprovidesvitalinformationaboutpatients.Integratingandanalyzingsuchdata,withtraditionalEHRandgenomicdataisoneofthecapabilitiesneededtoenableprecisionmedicine,whichaimstoprovidemorepersonalizedpatientcare.

1.1WhatIsDataMining?Dataminingistheprocessofautomaticallydiscoveringusefulinformationinlargedatarepositories.Dataminingtechniquesaredeployedtoscourlargedatasetsinordertofindnovelandusefulpatternsthatmightotherwiseremainunknown.Theyalsoprovidethecapabilitytopredicttheoutcomeofafutureobservation,suchastheamountacustomerwillspendatanonlineorabrick-and-mortarstore.

Notallinformationdiscoverytasksareconsideredtobedatamining.Examplesincludequeries,e.g.,lookingupindividualrecordsinadatabaseorfindingwebpagesthatcontainaparticularsetofkeywords.Thisisbecausesuchtaskscanbeaccomplishedthroughsimpleinteractionswithadatabasemanagementsystemoraninformationretrievalsystem.Thesesystemsrelyontraditionalcomputersciencetechniques,whichincludesophisticatedindexingstructuresandqueryprocessingalgorithms,forefficientlyorganizingandretrievinginformationfromlargedatarepositories.Nonetheless,dataminingtechniqueshavebeenusedtoenhancetheperformanceofsuchsystemsbyimprovingthequalityofthesearchresultsbasedontheirrelevancetotheinputqueries.

DataMiningandKnowledgeDiscoveryinDatabasesDataminingisanintegralpartofknowledgediscoveryindatabases(KDD),whichistheoverallprocessofconvertingrawdataintousefulinformation,asshowninFigure1.1 .Thisprocessconsistsofaseriesofsteps,fromdatapreprocessingtopostprocessingofdataminingresults.

Figure1.1.Theprocessofknowledgediscoveryindatabases(KDD).

Theinputdatacanbestoredinavarietyofformats(flatfiles,spreadsheets,orrelationaltables)andmayresideinacentralizeddatarepositoryorbedistributedacrossmultiplesites.Thepurposeofpreprocessingistotransformtherawinputdataintoanappropriateformatforsubsequentanalysis.Thestepsinvolvedindatapreprocessingincludefusingdatafrommultiplesources,cleaningdatatoremovenoiseandduplicateobservations,andselectingrecordsandfeaturesthatarerelevanttothedataminingtaskathand.Becauseofthemanywaysdatacanbecollectedandstored,datapreprocessingisperhapsthemostlaboriousandtime-consumingstepintheoverallknowledgediscoveryprocess.

“Closingtheloop”isaphraseoftenusedtorefertotheprocessofintegratingdataminingresultsintodecisionsupportsystems.Forexample,inbusinessapplications,theinsightsofferedbydataminingresultscanbeintegratedwithcampaignmanagementtoolssothateffectivemarketingpromotionscanbeconductedandtested.Suchintegrationrequiresapostprocessingsteptoensurethatonlyvalidandusefulresultsareincorporatedintothedecisionsupportsystem.Anexampleofpostprocessingisvisualization,whichallowsanalyststoexplorethedataandthedataminingresultsfromavarietyofviewpoints.Hypothesistestingmethodscanalsobeappliedduring

postprocessingtoeliminatespuriousdataminingresults.(SeeChapter10 .)

1.2MotivatingChallengesAsmentionedearlier,traditionaldataanalysistechniqueshaveoftenencounteredpracticaldifficultiesinmeetingthechallengesposedbybigdataapplications.Thefollowingaresomeofthespecificchallengesthatmotivatedthedevelopmentofdatamining.

Scalability

Becauseofadvancesindatagenerationandcollection,datasetswithsizesofterabytes,petabytes,orevenexabytesarebecomingcommon.Ifdataminingalgorithmsaretohandlethesemassivedatasets,theymustbescalable.Manydataminingalgorithmsemployspecialsearchstrategiestohandleexponentialsearchproblems.Scalabilitymayalsorequiretheimplementationofnoveldatastructurestoaccessindividualrecordsinanefficientmanner.Forinstance,out-of-corealgorithmsmaybenecessarywhenprocessingdatasetsthatcannotfitintomainmemory.Scalabilitycanalsobeimprovedbyusingsamplingordevelopingparallelanddistributedalgorithms.AgeneraloverviewoftechniquesforscalingupdataminingalgorithmsisgiveninAppendixF.

HighDimensionality

Itisnowcommontoencounterdatasetswithhundredsorthousandsofattributesinsteadofthehandfulcommonafewdecadesago.Inbioinformatics,progressinmicroarraytechnologyhasproducedgeneexpressiondatainvolvingthousandsoffeatures.Datasetswithtemporalorspatialcomponentsalsotendtohavehighdimensionality.Forexample,

consideradatasetthatcontainsmeasurementsoftemperatureatvariouslocations.Ifthetemperaturemeasurementsaretakenrepeatedlyforanextendedperiod,thenumberofdimensions(features)increasesinproportiontothenumberofmeasurementstaken.Traditionaldataanalysistechniquesthatweredevelopedforlow-dimensionaldataoftendonotworkwellforsuchhigh-dimensionaldataduetoissuessuchascurseofdimensionality(tobediscussedinChapter2 ).Also,forsomedataanalysisalgorithms,thecomputationalcomplexityincreasesrapidlyasthedimensionality(thenumberoffeatures)increases.

HeterogeneousandComplexData

Traditionaldataanalysismethodsoftendealwithdatasetscontainingattributesofthesametype,eithercontinuousorcategorical.Astheroleofdatamininginbusiness,science,medicine,andotherfieldshasgrown,sohastheneedfortechniquesthatcanhandleheterogeneousattributes.Recentyearshavealsoseentheemergenceofmorecomplexdataobjects.Examplesofsuchnon-traditionaltypesofdataincludewebandsocialmediadatacontainingtext,hyperlinks,images,audio,andvideos;DNAdatawithsequentialandthree-dimensionalstructure;andclimatedatathatconsistsofmeasurements(temperature,pressure,etc.)atvarioustimesandlocationsontheEarth’ssurface.Techniquesdevelopedforminingsuchcomplexobjectsshouldtakeintoconsiderationrelationshipsinthedata,suchastemporalandspatialautocorrelation,graphconnectivity,andparent-childrelationshipsbetweentheelementsinsemi-structuredtextandXMLdocuments.

DataOwnershipandDistribution

Sometimes,thedataneededforananalysisisnotstoredinonelocationorownedbyoneorganization.Instead,thedataisgeographicallydistributedamongresourcesbelongingtomultipleentities.Thisrequiresthedevelopment

ofdistributeddataminingtechniques.Thekeychallengesfacedbydistributeddataminingalgorithmsincludethefollowing:(1)howtoreducetheamountofcommunicationneededtoperformthedistributedcomputation,(2)howtoeffectivelyconsolidatethedataminingresultsobtainedfrommultiplesources,and(3)howtoaddressdatasecurityandprivacyissues.

Non-traditionalAnalysis

Thetraditionalstatisticalapproachisbasedonahypothesize-and-testparadigm.Inotherwords,ahypothesisisproposed,anexperimentisdesignedtogatherthedata,andthenthedataisanalyzedwithrespecttothehypothesis.Unfortunately,thisprocessisextremelylabor-intensive.Currentdataanalysistasksoftenrequirethegenerationandevaluationofthousandsofhypotheses,andconsequently,thedevelopmentofsomedataminingtechniqueshasbeenmotivatedbythedesiretoautomatetheprocessofhypothesisgenerationandevaluation.Furthermore,thedatasetsanalyzedindataminingaretypicallynottheresultofacarefullydesignedexperimentandoftenrepresentopportunisticsamplesofthedata,ratherthanrandomsamples.

1.3TheOriginsofDataMiningWhiledatamininghastraditionallybeenviewedasanintermediateprocesswithintheKDDframework,asshowninFigure1.1 ,ithasemergedovertheyearsasanacademicfieldwithincomputerscience,focusingonallaspectsofKDD,includingdatapreprocessing,mining,andpostprocessing.Itsorigincanbetracedbacktothelate1980s,followingaseriesofworkshopsorganizedonthetopicofknowledgediscoveryindatabases.Theworkshopsbroughttogetherresearchersfromdifferentdisciplinestodiscussthechallengesandopportunitiesinapplyingcomputationaltechniquestoextractactionableknowledgefromlargedatabases.Theworkshopsquicklygrewintohugelypopularconferencesthatwereattendedbyresearchersandpractitionersfromboththeacademiaandindustry.Thesuccessoftheseconferences,alongwiththeinterestshownbybusinessesandindustryinrecruitingnewhireswithdataminingbackground,havefueledthetremendousgrowthofthisfield.

Thefieldwasinitiallybuiltuponthemethodologyandalgorithmsthatresearchershadpreviouslyused.Inparticular,dataminingresearchersdrawuponideas,suchas(1)sampling,estimation,andhypothesistestingfromstatisticsand(2)searchalgorithms,modelingtechniques,andlearningtheoriesfromartificialintelligence,patternrecognition,andmachinelearning.Datamininghasalsobeenquicktoadoptideasfromotherareas,includingoptimization,evolutionarycomputing,informationtheory,signalprocessing,visualization,andinformationretrieval,andextendingthemtosolvethechallengesofminingbigdata.

Anumberofotherareasalsoplaykeysupportingroles.Inparticular,databasesystemsareneededtoprovidesupportforefficientstorage,indexing,andqueryprocessing.Techniquesfromhighperformance(parallel)computingare

oftenimportantinaddressingthemassivesizeofsomedatasets.Distributedtechniquescanalsohelpaddresstheissueofsizeandareessentialwhenthedatacannotbegatheredinonelocation.Figure1.2 showstherelationshipofdataminingtootherareas.

Figure1.2.Dataminingasaconfluenceofmanydisciplines.

DataScienceandData-DrivenDiscoveryDatascienceisaninterdisciplinaryfieldthatstudiesandappliestoolsandtechniquesforderivingusefulinsightsfromdata.Althoughdatascienceisregardedasanemergingfieldwithadistinctidentityofitsown,thetoolsandtechniquesoftencomefrommanydifferentareasofdataanalysis,suchasdatamining,statistics,AI,machinelearning,patternrecognition,databasetechnology,anddistributedandparallelcomputing.(SeeFigure1.2 .)

Theemergenceofdatascienceasanewfieldisarecognitionthat,often,noneoftheexistingareasofdataanalysisprovidesacompletesetoftoolsforthedataanalysistasksthatareoftenencounteredinemergingapplications.

Instead,abroadrangeofcomputational,mathematical,andstatisticalskillsisoftenrequired.Toillustratethechallengesthatariseinanalyzingsuchdata,considerthefollowingexample.SocialmediaandtheWebpresentnewopportunitiesforsocialscientiststoobserveandquantitativelymeasurehumanbehavioronalargescale.Toconductsuchastudy,socialscientistsworkwithanalystswhopossessskillsinareassuchaswebmining,naturallanguageprocessing(NLP),networkanalysis,datamining,andstatistics.Comparedtomoretraditionalresearchinsocialscience,whichisoftenbasedonsurveys,thisanalysisrequiresabroaderrangeofskillsandtools,andinvolvesfarlargeramountsofdata.Thus,datascienceis,bynecessity,ahighlyinterdisciplinaryfieldthatbuildsonthecontinuingworkofmanyfields.

Thedata-drivenapproachofdatascienceemphasizesthedirectdiscoveryofpatternsandrelationshipsfromdata,especiallyinlargequantitiesofdata,oftenwithouttheneedforextensivedomainknowledge.Anotableexampleofthesuccessofthisapproachisrepresentedbyadvancesinneuralnetworks,i.e.,deeplearning,whichhavebeenparticularlysuccessfulinareaswhichhavelongprovedchallenging,e.g.,recognizingobjectsinphotosorvideosandwordsinspeech,aswellasinotherapplicationareas.However,notethatthisisjustoneexampleofthesuccessofdata-drivenapproaches,anddramaticimprovementshavealsooccurredinmanyotherareasofdataanalysis.Manyofthesedevelopmentsaretopicsdescribedlaterinthisbook.

Somecautionsonpotentiallimitationsofapurelydata-drivenapproacharegivenintheBibliographicNotes.

1.4DataMiningTasksDataminingtasksaregenerallydividedintotwomajorcategories:

PredictivetasksTheobjectiveofthesetasksistopredictthevalueofaparticularattributebasedonthevaluesofotherattributes.Theattributetobepredictediscommonlyknownasthetargetordependentvariable,whiletheattributesusedformakingthepredictionareknownastheexplanatoryorindependentvariables.

DescriptivetasksHere,theobjectiveistoderivepatterns(correlations,trends,clusters,trajectories,andanomalies)thatsummarizetheunderlyingrelationshipsindata.Descriptivedataminingtasksareoftenexploratoryinnatureandfrequentlyrequirepostprocessingtechniquestovalidateandexplaintheresults.

Figure1.3 illustratesfourofthecoredataminingtasksthataredescribedintheremainderofthisbook.

Figure1.3.Fourofthecoredataminingtasks.

Predictivemodelingreferstothetaskofbuildingamodelforthetargetvariableasafunctionoftheexplanatoryvariables.Therearetwotypesofpredictivemodelingtasks:classification,whichisusedfordiscretetargetvariables,andregression,whichisusedforcontinuoustargetvariables.Forexample,predictingwhetherawebuserwillmakeapurchaseatanonlinebookstoreisaclassificationtaskbecausethetargetvariableisbinary-valued.Ontheotherhand,forecastingthefuturepriceofastockisaregressiontaskbecausepriceisacontinuous-valuedattribute.Thegoalofbothtasksistolearnamodelthatminimizestheerrorbetweenthepredictedandtruevaluesofthetargetvariable.Predictivemodelingcanbeusedtoidentifycustomerswhowillrespondtoamarketingcampaign,predictdisturbancesintheEarth’s

ecosystem,orjudgewhetherapatienthasaparticulardiseasebasedontheresultsofmedicaltests.

Example1.1(PredictingtheTypeofaFlower).Considerthetaskofpredictingaspeciesofflowerbasedonthecharacteristicsoftheflower.Inparticular,considerclassifyinganIrisflowerasoneofthefollowingthreeIrisspecies:Setosa,Versicolour,orVirginica.Toperformthistask,weneedadatasetcontainingthecharacteristicsofvariousflowersofthesethreespecies.Adatasetwiththistypeofinformationisthewell-knownIrisdatasetfromtheUCIMachineLearningRepositoryathttp://www.ics.uci.edu/~mlearn.Inadditiontothespeciesofaflower,thisdatasetcontainsfourotherattributes:sepalwidth,sepallength,petallength,andpetalwidth.Figure1.4 showsaplotofpetalwidthversuspetallengthforthe150flowersintheIrisdataset.Petalwidthisbrokenintothecategorieslow,medium,andhigh,whichcorrespondtotheintervals[0,0.75),[0.75,1.75), ,respectively.Also,petallengthisbrokenintocategorieslow,medium,andhigh,whichcorrespondtotheintervals[0,2.5),[2.5,5), ,respectively.Basedonthesecategoriesofpetalwidthandlength,thefollowingrulescanbederived:

PetalwidthlowandpetallengthlowimpliesSetosa.

PetalwidthmediumandpetallengthmediumimpliesVersicolour.

PetalwidthhighandpetallengthhighimpliesVirginica.

Whiletheserulesdonotclassifyalltheflowers,theydoagood(butnotperfect)jobofclassifyingmostoftheflowers.NotethatflowersfromtheSetosaspeciesarewellseparatedfromtheVersicolourandVirginicaspecieswithrespecttopetalwidthandlength,butthelattertwospeciesoverlapsomewhatwithrespecttotheseattributes.

[1.75,∞)

[5,∞)

Figure1.4.Petalwidthversuspetallengthfor150Irisflowers.

Associationanalysisisusedtodiscoverpatternsthatdescribestronglyassociatedfeaturesinthedata.Thediscoveredpatternsaretypicallyrepresentedintheformofimplicationrulesorfeaturesubsets.Becauseoftheexponentialsizeofitssearchspace,thegoalofassociationanalysisistoextractthemostinterestingpatternsinanefficientmanner.Usefulapplicationsofassociationanalysisincludefindinggroupsofgenesthathaverelatedfunctionality,identifyingwebpagesthatareaccessedtogether,orunderstandingtherelationshipsbetweendifferentelementsofEarth’sclimatesystem.

Example1.2(MarketBasketAnalysis).

ThetransactionsshowninTable1.1 illustratepoint-of-saledatacollectedatthecheckoutcountersofagrocerystore.Associationanalysiscanbeappliedtofinditemsthatarefrequentlyboughttogetherbycustomers.Forexample,wemaydiscovertherule ,whichsuggeststhatcustomerswhobuydiapersalsotendtobuymilk.Thistypeofrulecanbeusedtoidentifypotentialcross-sellingopportunitiesamongrelateditems.

Table1.1.Marketbasketdata.

TransactionID Items

1 {Bread,Butter,Diapers,Milk}

2 {Coffee,Sugar,Cookies,Salmon}

3 {Bread,Butter,Coffee,Diapers,Milk,Eggs}

4 {Bread,Butter,Salmon,Chicken}

5 {Eggs,Bread,Butter}

6 {Salmon,Diapers,Milk}

7 {Bread,Tea,Sugar,Eggs}

8 {Coffee,Sugar,Chicken,Eggs}

9 {Bread,Diapers,Milk,Salt}

10 {Tea,Eggs,Cookies,Diapers,Milk}

Clusteranalysisseekstofindgroupsofcloselyrelatedobservationssothatobservationsthatbelongtothesameclusteraremoresimilartoeachotherthanobservationsthatbelongtootherclusters.Clusteringhasbeenusedto

{Diapers}→{Milk}

groupsetsofrelatedcustomers,findareasoftheoceanthathaveasignificantimpactontheEarth’sclimate,andcompressdata.

Example1.3(DocumentClustering).ThecollectionofnewsarticlesshowninTable1.2 canbegroupedbasedontheirrespectivetopics.Eacharticleisrepresentedasasetofword-frequencypairs(w:c),wherewisawordandcisthenumberoftimesthewordappearsinthearticle.Therearetwonaturalclustersinthedataset.Thefirstclusterconsistsofthefirstfourarticles,whichcorrespondtonewsabouttheeconomy,whilethesecondclustercontainsthelastfourarticles,whichcorrespondtonewsabouthealthcare.Agoodclusteringalgorithmshouldbeabletoidentifythesetwoclustersbasedonthesimilaritybetweenwordsthatappearinthearticles.

Table1.2.Collectionofnewsarticles.

Article Word-frequencypairs

1 dollar:1,industry:4,country:2,loan:3,deal:2,government:2

2 machinery:2,labor:3,market:4,industry:2,work:3,country:1

3 job:5,inflation:3,rise:2,jobless:2,market:3,country:2,index:3

4 domestic:3,forecast:2,gain:1,market:2,sale:3,price:2

5 patient:4,symptom:2,drug:3,health:2,clinic:2,doctor:2

6 pharmaceutical:2,company:3,drug:2,vaccine:1,flu:3

7 death:2,cancer:4,drug:3,public:4,health:3,director:2

8 medical:2,cost:3,increase:2,patient:2,health:3,care:1

Anomalydetectionisthetaskofidentifyingobservationswhosecharacteristicsaresignificantlydifferentfromtherestofthedata.Suchobservationsareknownasanomaliesoroutliers.Thegoalofananomalydetectionalgorithmistodiscovertherealanomaliesandavoidfalselylabelingnormalobjectsasanomalous.Inotherwords,agoodanomalydetectormusthaveahighdetectionrateandalowfalsealarmrate.Applicationsofanomalydetectionincludethedetectionoffraud,networkintrusions,unusualpatternsofdisease,andecosystemdisturbances,suchasdroughts,floods,fires,hurricanes,etc.

Example1.4(CreditCardFraudDetection).Acreditcardcompanyrecordsthetransactionsmadebyeverycreditcardholder,alongwithpersonalinformationsuchascreditlimit,age,annualincome,andaddress.Sincethenumberoffraudulentcasesisrelativelysmallcomparedtothenumberoflegitimatetransactions,anomalydetectiontechniquescanbeappliedtobuildaprofileoflegitimatetransactionsfortheusers.Whenanewtransactionarrives,itiscomparedagainsttheprofileoftheuser.Ifthecharacteristicsofthetransactionareverydifferentfromthepreviouslycreatedprofile,thenthetransactionisflaggedaspotentiallyfraudulent.

1.5ScopeandOrganizationoftheBookThisbookintroducesthemajorprinciplesandtechniquesusedindataminingfromanalgorithmicperspective.Astudyoftheseprinciplesandtechniquesisessentialfordevelopingabetterunderstandingofhowdataminingtechnologycanbeappliedtovariouskindsofdata.Thisbookalsoservesasastartingpointforreaderswhoareinterestedindoingresearchinthisfield.

Webeginthetechnicaldiscussionofthisbookwithachapterondata(Chapter2 ),whichdiscussesthebasictypesofdata,dataquality,preprocessingtechniques,andmeasuresofsimilarityanddissimilarity.Althoughthismaterialcanbecoveredquickly,itprovidesanessentialfoundationfordataanalysis.Chapters3 and4 coverclassification.Chapter3 providesafoundationbydiscussingdecisiontreeclassifiersandseveralissuesthatareimportanttoallclassification:overfitting,underfitting,modelselection,andperformanceevaluation.Usingthisfoundation,Chapter4 describesanumberofotherimportantclassificationtechniques:rule-basedsystems,nearestneighborclassifiers,Bayesianclassifiers,artificialneuralnetworks,includingdeeplearning,supportvectormachines,andensembleclassifiers,whicharecollectionsofclassifiers.Themulticlassandimbalancedclassproblemsarealsodiscussed.Thesetopicscanbecoveredindependently.

AssociationanalysisisexploredinChapters5 and6 .Chapter5describesthebasicsofassociationanalysis:frequentitemsets,associationrules,andsomeofthealgorithmsusedtogeneratethem.Specifictypesoffrequentitemsets—maximal,closed,andhyperclique—thatareimportantfor

dataminingarealsodiscussed,andthechapterconcludeswithadiscussionofevaluationmeasuresforassociationanalysis.Chapter6 considersavarietyofmoreadvancedtopics,includinghowassociationanalysiscanbeappliedtocategoricalandcontinuousdataortodatathathasaconcepthierarchy.(Aconcepthierarchyisahierarchicalcategorizationofobjects,e.g.,storeitems .)Thischapteralsodescribeshowassociationanalysiscanbeextendedtofindsequentialpatterns(patternsinvolvingorder),patternsingraphs,andnegativerelationships(ifoneitemispresent,thentheotherisnot).

ClusteranalysisisdiscussedinChapters7 and8 .Chapter7 firstdescribesthedifferenttypesofclusters,andthenpresentsthreespecificclusteringtechniques:K-means,agglomerativehierarchicalclustering,andDBSCAN.Thisisfollowedbyadiscussionoftechniquesforvalidatingtheresultsofaclusteringalgorithm.AdditionalclusteringconceptsandtechniquesareexploredinChapter8 ,includingfuzzyandprobabilisticclustering,Self-OrganizingMaps(SOM),graph-basedclustering,spectralclustering,anddensity-basedclustering.Thereisalsoadiscussionofscalabilityissuesandfactorstoconsiderwhenselectingaclusteringalgorithm.

Chapter9 ,isonanomalydetection.Aftersomebasicdefinitions,severaldifferenttypesofanomalydetectionareconsidered:statistical,distance-based,density-based,clustering-based,reconstruction-based,one-classclassification,andinformationtheoretic.Thelastchapter,Chapter10 ,supplementsthediscussionsintheotherChapterswithadiscussionofthestatisticalconceptsimportantforavoidingspuriousresults,andthendiscussesthoseconceptsinthecontextofdataminingtechniquesstudiedinthepreviouschapters.Thesetechniquesincludestatisticalhypothesistesting,p-values,thefalsediscoveryrate,andpermutationtesting.AppendicesAthroughFgiveabriefreviewofimportanttopicsthatareusedinportionsof

storeitems→clothing→shoes→sneakers

thebook:linearalgebra,dimensionalityreduction,statistics,regression,optimization,andscalingupdataminingtechniquesforbigdata.

Thesubjectofdatamining,whilerelativelyyoungcomparedtostatisticsormachinelearning,isalreadytoolargetocoverinasinglebook.Selectedreferencestotopicsthatareonlybrieflycovered,suchasdataquality,areprovidedintheBibliographicNotessectionoftheappropriatechapter.Referencestotopicsnotcoveredinthisbook,suchasminingstreamingdataandprivacy-preservingdataminingareprovidedintheBibliographicNotesofthischapter.

1.6BibliographicNotesThetopicofdatamininghasinspiredmanytextbooks.IntroductorytextbooksincludethosebyDunham[16],Hanetal.[29],Handetal.[31],RoigerandGeatz[50],ZakiandMeira[61],andAggarwal[2].DataminingbookswithastrongeremphasisonbusinessapplicationsincludetheworksbyBerryandLinoff[5],Pyle[47],andParrRud[45].BookswithanemphasisonstatisticallearningincludethosebyCherkasskyandMulier[11],andHastieetal.[32].SimilarbookswithanemphasisonmachinelearningorpatternrecognitionarethosebyDudaetal.[15],Kantardzic[34],Mitchell[43],Webb[57],andWittenandFrank[58].Therearealsosomemorespecializedbooks:Chakrabarti[9](webmining),Fayyadetal.[20](collectionofearlyarticlesondatamining),Fayyadetal.[18](visualization),Grossmanetal.[25](scienceandengineering),KarguptaandChan[35](distributeddatamining),Wangetal.[56](bioinformatics),andZakiandHo[60](paralleldatamining).

Thereareseveralconferencesrelatedtodatamining.SomeofthemainconferencesdedicatedtothisfieldincludetheACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining(KDD),theIEEEInternationalConferenceonDataMining(ICDM),theSIAMInternationalConferenceonDataMining(SDM),theEuropeanConferenceonPrinciplesandPracticeofKnowledgeDiscoveryinDatabases(PKDD),andthePacific-AsiaConferenceonKnowledgeDiscoveryandDataMining(PAKDD).DataminingpaperscanalsobefoundinothermajorconferencessuchastheConferenceandWorkshoponNeuralInformationProcessingSystems(NIPS),theInternationalConferenceonMachineLearning(ICML),theACMSIGMOD/PODSconference,theInternationalConferenceonVeryLargeDataBases(VLDB),theConferenceonInformationandKnowledgeManagement(CIKM),theInternationalConferenceonDataEngineering(ICDE),the

NationalConferenceonArtificialIntelligence(AAAI),theIEEEInternationalConferenceonBigData,andtheIEEEInternationalConferenceonDataScienceandAdvancedAnalytics(DSAA).

JournalpublicationsondataminingincludeIEEETransactionsonKnowledgeandDataEngineering,DataMiningandKnowledgeDiscovery,KnowledgeandInformationSystems,ACMTransactionsonKnowledgeDiscoveryfromData,StatisticalAnalysisandDataMining,andInformationSystems.Therearevariousopen-sourcedataminingsoftwareavailable,includingWeka[27]andScikit-learn[46].Morerecently,dataminingsoftwaresuchasApacheMahoutandApacheSparkhavebeendevelopedforlarge-scaleproblemsonthedistributedcomputingplatform.

Therehavebeenanumberofgeneralarticlesondataminingthatdefinethefieldoritsrelationshiptootherfields,particularlystatistics.Fayyadetal.[19]describedataminingandhowitfitsintothetotalknowledgediscoveryprocess.Chenetal.[10]giveadatabaseperspectiveondatamining.RamakrishnanandGrama[48]provideageneraldiscussionofdataminingandpresentseveralviewpoints.Hand[30]describeshowdataminingdiffersfromstatistics,asdoesFriedman[21].Lambert[40]explorestheuseofstatisticsforlargedatasetsandprovidessomecommentsontherespectiverolesofdataminingandstatistics.Glymouretal.[23]considerthelessonsthatstatisticsmayhavefordatamining.Smythetal.[53]describehowtheevolutionofdataminingisbeingdrivenbynewtypesofdataandapplications,suchasthoseinvolvingstreams,graphs,andtext.Hanetal.[28]consideremergingapplicationsindataminingandSmyth[52]describessomeresearchchallengesindatamining.Wuetal.[59]discusshowdevelopmentsindataminingresearchcanbeturnedintopracticaltools.DataminingstandardsarethesubjectofapaperbyGrossmanetal.[24].Bradley[7]discusseshowdataminingalgorithmscanbescaledtolargedatasets.

Theemergenceofnewdataminingapplicationshasproducednewchallengesthatneedtobeaddressed.Forinstance,concernsaboutprivacybreachesasaresultofdatamininghaveescalatedinrecentyears,particularlyinapplicationdomainssuchaswebcommerceandhealthcare.Asaresult,thereisgrowinginterestindevelopingdataminingalgorithmsthatmaintainuserprivacy.Developingtechniquesforminingencryptedorrandomizeddataisknownasprivacy-preservingdatamining.SomegeneralreferencesinthisareaincludepapersbyAgrawalandSrikant[3],Cliftonetal.[12]andKarguptaetal.[36].Vassiliosetal.[55]provideasurvey.Anotherareaofconcernisthebiasinpredictivemodelsthatmaybeusedforsomeapplications,e.g.,screeningjobapplicantsordecidingprisonparole[39].Assessingwhethersuchapplicationsareproducingbiasedresultsismademoredifficultbythefactthatthepredictivemodelsusedforsuchapplicationsareoftenblackboxmodels,i.e.,modelsthatarenotinterpretableinanystraightforwardway.

Datascience,itsconstituentfields,andmoregenerally,thenewparadigmofknowledgediscoverytheyrepresent[33],havegreatpotential,someofwhichhasbeenrealized.However,itisimportanttoemphasizethatdatascienceworksmostlywithobservationaldata,i.e.,datathatwascollectedbyvariousorganizationsaspartoftheirnormaloperation.Theconsequenceofthisisthatsamplingbiasesarecommonandthedeterminationofcausalfactorsbecomesmoreproblematic.Forthisandanumberofotherreasons,itisoftenhardtointerpretthepredictivemodelsbuiltfromthisdata[42,49].Thus,theory,experimentationandcomputationalsimulationswillcontinuetobethemethodsofchoiceinmanyareas,especiallythoserelatedtoscience.

Moreimportantly,apurelydata-drivenapproachoftenignorestheexistingknowledgeinaparticularfield.Suchmodelsmayperformpoorly,forexample,predictingimpossibleoutcomesorfailingtogeneralizetonewsituations.However,ifthemodeldoesworkwell,e.g.,hashighpredictiveaccuracy,then

thisapproachmaybesufficientforpracticalpurposesinsomefields.Butinmanyareas,suchasmedicineandscience,gaininginsightintotheunderlyingdomainisoftenthegoal.Somerecentworkattemptstoaddresstheseissuesinordertocreatetheory-guideddatascience,whichtakespre-existingdomainknowledgeintoaccount[17,37].

Recentyearshavewitnessedagrowingnumberofapplicationsthatrapidlygeneratecontinuousstreamsofdata.Examplesofstreamdataincludenetworktraffic,multimediastreams,andstockprices.Severalissuesmustbeconsideredwhenminingdatastreams,suchasthelimitedamountofmemoryavailable,theneedforonlineanalysis,andthechangeofthedataovertime.Dataminingforstreamdatahasbecomeanimportantareaindatamining.SomeselectedpublicationsareDomingosandHulten[14](classification),Giannellaetal.[22](associationanalysis),Guhaetal.[26](clustering),Kiferetal.[38](changedetection),Papadimitriouetal.[44](timeseries),andLawetal.[41](dimensionalityreduction).

Anotherareaofinterestisrecommenderandcollaborativefilteringsystems[1,6,8,13,54],whichsuggestmovies,televisionshows,books,products,etc.thatapersonmightlike.Inmanycases,thisproblem,oratleastacomponentofit,istreatedasapredictionproblemandthus,dataminingtechniquescanbeapplied[4,51].

Bibliography[1]G.AdomaviciusandA.Tuzhilin.Towardthenextgenerationof

recommendersystems:Asurveyofthestate-of-the-artandpossibleextensions.IEEEtransactionsonknowledgeanddataengineering,17(6):734–749,2005.

[2]C.Aggarwal.Datamining:TheTextbook.Springer,2009.

[3]R.AgrawalandR.Srikant.Privacy-preservingdatamining.InProc.of2000ACMSIGMODIntl.Conf.onManagementofData,pages439–450,Dallas,Texas,2000.ACMPress.

[4]X.AmatriainandJ.M.Pujol.Dataminingmethodsforrecommendersystems.InRecommenderSystemsHandbook,pages227–262.Springer,2015.

[5]M.J.A.BerryandG.Linoff.DataMiningTechniques:ForMarketing,Sales,andCustomerRelationshipManagement.WileyComputerPublishing,2ndedition,2004.

[6]J.Bobadilla,F.Ortega,A.Hernando,andA.Gutiérrez.Recommendersystemssurvey.Knowledge-basedsystems,46:109–132,2013.

[7]P.S.Bradley,J.Gehrke,R.Ramakrishnan,andR.Srikant.Scalingminingalgorithmstolargedatabases.CommunicationsoftheACM,45(8):38–43,2002.

[8]R.Burke.Hybridrecommendersystems:Surveyandexperiments.Usermodelinganduser-adaptedinteraction,12(4):331–370,2002.

[9]S.Chakrabarti.MiningtheWeb:DiscoveringKnowledgefromHypertextData.MorganKaufmann,SanFrancisco,CA,2003.

[10]M.-S.Chen,J.Han,andP.S.Yu.DataMining:AnOverviewfromaDatabasePerspective.IEEETransactionsonKnowledgeandDataEngineering,8(6):866–883,1996.

[11]V.CherkasskyandF.Mulier.LearningfromData:Concepts,Theory,andMethods.Wiley-IEEEPress,2ndedition,1998.

[12]C.Clifton,M.Kantarcioglu,andJ.Vaidya.Definingprivacyfordatamining.InNationalScienceFoundationWorkshoponNextGenerationDataMining,pages126–133,Baltimore,MD,November2002.

[13]C.DesrosiersandG.Karypis.Acomprehensivesurveyofneighborhood-basedrecommendationmethods.Recommendersystemshandbook,pages107–144,2011.

[14]P.DomingosandG.Hulten.Mininghigh-speeddatastreams.InProc.ofthe6thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages71–80,

Boston,Massachusetts,2000.ACMPress.

[15]R.O.Duda,P.E.Hart,andD.G.Stork.PatternClassification.JohnWiley…Sons,Inc.,NewYork,2ndedition,2001.

[16]M.H.Dunham.DataMining:IntroductoryandAdvancedTopics.PrenticeHall,2006.

[17]J.H.Faghmous,A.Banerjee,S.Shekhar,M.Steinbach,V.Kumar,A.R.Ganguly,andN.Samatova.Theory-guideddatascienceforclimatechange.Computer,47(11):74–78,2014.

[18]U.M.Fayyad,G.G.Grinstein,andA.Wierse,editors.InformationVisualizationinDataMiningandKnowledgeDiscovery.MorganKaufmannPublishers,SanFrancisco,CA,September2001.

[19]U.M.Fayyad,G.Piatetsky-Shapiro,andP.Smyth.FromDataMiningtoKnowledgeDiscovery:AnOverview.InAdvancesinKnowledgeDiscoveryandDataMining,pages1–34.AAAIPress,1996.

[20]U.M.Fayyad,G.Piatetsky-Shapiro,P.Smyth,andR.Uthurusamy,editors.AdvancesinKnowledgeDiscoveryandDataMining.AAAI/MITPress,1996.

[21]J.H.Friedman.DataMiningandStatistics:What’stheConnection?Unpublished.www-stat.stanford.edu/~jhf/ftp/dm-stat.ps,1997.

[22]C.Giannella,J.Han,J.Pei,X.Yan,andP.S.Yu.MiningFrequentPatternsinDataStreamsatMultipleTimeGranularities.InH.Kargupta,A.Joshi,K.Sivakumar,andY.Yesha,editors,NextGenerationDataMining,pages191–212.AAAI/MIT,2003.

[23]C.Glymour,D.Madigan,D.Pregibon,andP.Smyth.StatisticalThemesandLessonsforDataMining.DataMiningandKnowledgeDiscovery,1(1):11–28,1997.

[24]R.L.Grossman,M.F.Hornick,andG.Meyer.Dataminingstandardsinitiatives.CommunicationsoftheACM,45(8):59–61,2002.

[25]R.L.Grossman,C.Kamath,P.Kegelmeyer,V.Kumar,andR.Namburu,editors.DataMiningforScientificandEngineeringApplications.KluwerAcademicPublishers,2001.

[26]S.Guha,A.Meyerson,N.Mishra,R.Motwani,andL.O’Callaghan.ClusteringDataStreams:TheoryandPractice.IEEETransactionsonKnowledgeandDataEngineering,15(3):515–528,May/June2003.

[27]M.Hall,E.Frank,G.Holmes,B.Pfahringer,P.Reutemann,andI.H.Witten.TheWEKADataMiningSoftware:AnUpdate.SIGKDDExplorations,11(1),2009.

[28]J.Han,R.B.Altman,V.Kumar,H.Mannila,andD.Pregibon.Emergingscientificapplicationsindatamining.CommunicationsoftheACM,45(8):54–58,2002.

[29]J.Han,M.Kamber,andJ.Pei.DataMining:ConceptsandTechniques.MorganKaufmannPublishers,SanFrancisco,3rdedition,2011.

[30]D.J.Hand.DataMining:StatisticsandMore?TheAmericanStatistician,52(2):112–118,1998.

[31]D.J.Hand,H.Mannila,andP.Smyth.PrinciplesofDataMining.MITPress,2001.

[32]T.Hastie,R.Tibshirani,andJ.H.Friedman.TheElementsofStatisticalLearning:DataMining,Inference,Prediction.Springer,2ndedition,2009.

[33]T.Hey,S.Tansley,K.M.Tolle,etal.Thefourthparadigm:data-intensivescientificdiscovery,volume1.MicrosoftresearchRedmond,WA,2009.

[34]M.Kantardzic.DataMining:Concepts,Models,Methods,andAlgorithms.Wiley-IEEEPress,Piscataway,NJ,2003.

[35]H.KarguptaandP.K.Chan,editors.AdvancesinDistributedandParallelKnowledgeDiscovery.AAAIPress,September2002.

[36]H.Kargupta,S.Datta,Q.Wang,andK.Sivakumar.OnthePrivacyPreservingPropertiesofRandomDataPerturbationTechniques.InProc.ofthe2003IEEEIntl.Conf.onDataMining,pages99–106,Melbourne,Florida,December2003.IEEEComputerSociety.

[37]A.Karpatne,G.Atluri,J.Faghmous,M.Steinbach,A.Banerjee,A.Ganguly,S.Shekhar,N.Samatova,andV.Kumar.Theory-guidedDataScience:ANewParadigmforScientificDiscoveryfromData.IEEETransactionsonKnowledgeandDataEngineering,2017.

[38]D.Kifer,S.Ben-David,andJ.Gehrke.DetectingChangeinDataStreams.InProc.ofthe30thVLDBConf.,pages180–191,Toronto,Canada,2004.MorganKaufmann.

[39]J.Kleinberg,J.Ludwig,andS.Mullainathan.AGuidetoSolvingSocialProblemswithMachineLearning.HarvardBusinessReview,December2016.

[40]D.Lambert.WhatUseisStatisticsforMassiveData?InACMSIGMODWorkshoponResearchIssuesinDataMiningandKnowledgeDiscovery,pages54–62,2000.

[41]M.H.C.Law,N.Zhang,andA.K.Jain.NonlinearManifoldLearningforDataStreams.InProc.oftheSIAMIntl.Conf.onDataMining,LakeBuenaVista,Florida,April2004.SIAM.

[42]Z.C.Lipton.Themythosofmodelinterpretability.arXivpreprintarXiv:1606.03490,2016.

[43]T.Mitchell.MachineLearning.McGraw-Hill,Boston,MA,1997.

[44]S.Papadimitriou,A.Brockwell,andC.Faloutsos.Adaptive,unsupervisedstreammining.VLDBJournal,13(3):222–239,2004.

[45]O.ParrRud.DataMiningCookbook:ModelingDataforMarketing,RiskandCustomerRelationshipManagement.JohnWiley…Sons,NewYork,NY,2001.

[46]F.Pedregosa,G.Varoquaux,A.Gramfort,V.Michel,B.Thirion,O.Grisel,M.Blondel,P.Prettenhofer,R.Weiss,V.Dubourg,J.Vanderplas,A.Passos,D.Cournapeau,M.Brucher,M.Perrot,andE.Duchesnay.Scikit-learn:MachineLearninginPython.JournalofMachineLearningResearch,12:2825–2830,2011.

[47]D.Pyle.BusinessModelingandDataMining.MorganKaufmann,SanFrancisco,CA,2003.

[48]N.RamakrishnanandA.Grama.DataMining:FromSerendipitytoScience—GuestEditors’Introduction.IEEEComputer,32(8):34–37,1999.

[49]M.T.Ribeiro,S.Singh,andC.Guestrin.Whyshoulditrustyou?:Explainingthepredictionsofanyclassifier.InProceedingsofthe22ndACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages1135–1144.ACM,2016.

[50]R.RoigerandM.Geatz.DataMining:ATutorialBasedPrimer.Addison-Wesley,2002.

[51]J.Schafer.TheApplicationofData-MiningtoRecommenderSystems.Encyclopediaofdatawarehousingandmining,1:44–48,2009.

[52]P.Smyth.BreakingoutoftheBlack-Box:ResearchChallengesinDataMining.InProc.ofthe2001ACMSIGMODWorkshoponResearchIssuesinDataMiningandKnowledgeDiscovery,2001.

[53]P.Smyth,D.Pregibon,andC.Faloutsos.Data-drivenevolutionofdataminingalgorithms.CommunicationsoftheACM,45(8):33–37,2002.

[54]X.SuandT.M.Khoshgoftaar.Asurveyofcollaborativefilteringtechniques.Advancesinartificialintelligence,2009:4,2009.

[55]V.S.Verykios,E.Bertino,I.N.Fovino,L.P.Provenza,Y.Saygin,andY.Theodoridis.State-of-the-artinprivacypreservingdatamining.SIGMODRecord,33(1):50–57,2004.

[56]J.T.L.Wang,M.J.Zaki,H.Toivonen,andD.E.Shasha,editors.DataMininginBioinformatics.Springer,September2004.

[57]A.R.Webb.StatisticalPatternRecognition.JohnWiley…Sons,2ndedition,2002.

[58]I.H.WittenandE.Frank.DataMining:PracticalMachineLearningToolsandTechniques.MorganKaufmann,3rdedition,2011.

[59]X.Wu,P.S.Yu,andG.Piatetsky-Shapiro.DataMining:HowResearchMeetsPracticalDevelopment?KnowledgeandInformationSystems,5(2):248–261,2003.

[60]M.J.ZakiandC.-T.Ho,editors.Large-ScaleParallelDataMining.Springer,September2002.

[61]M.J.ZakiandW.MeiraJr.DataMiningandAnalysis:FundamentalConceptsandAlgorithms.CambridgeUniversityPress,NewYork,2014.

1.7Exercises1.Discusswhetherornoteachofthefollowingactivitiesisadataminingtask.

a. Dividingthecustomersofacompanyaccordingtotheirgender.

b. Dividingthecustomersofacompanyaccordingtotheirprofitability.

c. Computingthetotalsalesofacompany.

d. Sortingastudentdatabasebasedonstudentidentificationnumbers.

e. Predictingtheoutcomesoftossinga(fair)pairofdice.

f. Predictingthefuturestockpriceofacompanyusinghistoricalrecords.

g. Monitoringtheheartrateofapatientforabnormalities.

h. Monitoringseismicwavesforearthquakeactivities.

i. Extractingthefrequenciesofasoundwave.

2.SupposethatyouareemployedasadataminingconsultantforanInternetsearchenginecompany.Describehowdataminingcanhelpthecompanybygivingspecificexamplesofhowtechniques,suchasclustering,classification,associationrulemining,andanomalydetectioncanbeapplied.

3.Foreachofthefollowingdatasets,explainwhetherornotdataprivacyisanimportantissue.

a. Censusdatacollectedfrom1900–1950.

b. IPaddressesandvisittimesofwebuserswhovisityourwebsite.

c. ImagesfromEarth-orbitingsatellites.

d. Namesandaddressesofpeoplefromthetelephonebook.

e. NamesandemailaddressescollectedfromtheWeb.

2Data

Thischapterdiscussesseveraldata-relatedissuesthatareimportantforsuccessfuldatamining:

TheTypeofDataDatasetsdifferinanumberofways.Forexample,theattributesusedtodescribedataobjectscanbeofdifferenttypes—quantitativeorqualitative—anddatasetsoftenhavespecialcharacteristics;e.g.,somedatasetscontaintimeseriesorobjectswithexplicitrelationshipstooneanother.Notsurprisingly,thetypeofdatadetermineswhichtoolsandtechniquescanbeusedtoanalyzethedata.Indeed,newresearchindataminingisoftendrivenbytheneedtoaccommodatenewapplicationareasandtheirnewtypesofdata.

TheQualityoftheDataDataisoftenfarfromperfect.Whilemostdataminingtechniquescantoleratesomelevelofimperfectioninthedata,afocusonunderstandingandimprovingdataqualitytypicallyimprovesthequalityoftheresultinganalysis.Dataqualityissuesthatoftenneedtobeaddressedincludethepresenceofnoiseandoutliers;missing,inconsistent,orduplicatedata;anddatathatisbiasedor,insomeotherway,unrepresentativeofthephenomenonorpopulationthatthedataissupposedtodescribe.

PreprocessingStepstoMaketheDataMoreSuitableforDataMiningOften,therawdatamustbeprocessedinordertomakeitsuitablefor

analysis.Whileoneobjectivemaybetoimprovedataquality,othergoalsfocusonmodifyingthedatasothatitbetterfitsaspecifieddataminingtechniqueortool.Forexample,acontinuousattribute,e.g.,length,sometimesneedstobetransformedintoanattributewithdiscretecategories,e.g.,short,medium,orlong,inordertoapplyaparticulartechnique.Asanotherexample,thenumberofattributesinadatasetisoftenreducedbecausemanytechniquesaremoreeffectivewhenthedatahasarelativelysmallnumberofattributes.

AnalyzingDatainTermsofItsRelationshipsOneapproachtodataanalysisistofindrelationshipsamongthedataobjectsandthenperformtheremaininganalysisusingtheserelationshipsratherthanthedataobjectsthemselves.Forinstance,wecancomputethesimilarityordistancebetweenpairsofobjectsandthenperformtheanalysis—clustering,classification,oranomalydetection—basedonthesesimilaritiesordistances.Therearemanysuchsimilarityordistancemeasures,andtheproperchoicedependsonthetypeofdataandtheparticularapplication.

Example2.1(AnIllustrationofData-RelatedIssues).Tofurtherillustratetheimportanceoftheseissues,considerthefollowinghypotheticalsituation.Youreceiveanemailfromamedicalresearcherconcerningaprojectthatyouareeagertoworkon.

Hi,

I’veattachedthedatafilethatImentionedinmypreviousemail.Eachlinecontainsthe

informationforasinglepatientandconsistsoffivefields.Wewanttopredictthelastfieldusing

theotherfields.Idon’thavetimetoprovideanymoreinformationaboutthedatasinceI’mgoing

outoftownforacoupleofdays,buthopefullythatwon’tslowyoudowntoomuch.Andifyou

don’tmind,couldwemeetwhenIgetbacktodiscussyourpreliminaryresults?Imightinvitea

fewothermembersofmyteam.

Thanksandseeyouinacoupleofdays.

Despitesomemisgivings,youproceedtoanalyzethedata.Thefirstfewrowsofthefileareasfollows:

012 232 33.5 0 10.7

020 121 16.9 2 210.1

027 165 24.0 0 427.6

⋮

Abrieflookatthedatarevealsnothingstrange.Youputyourdoubtsasideandstarttheanalysis.Thereareonly1000lines,asmallerdatafilethanyouhadhopedfor,buttwodayslater,youfeelthatyouhavemadesomeprogress.Youarriveforthemeeting,andwhilewaitingforotherstoarrive,youstrikeupaconversationwithastatisticianwhoisworkingontheproject.Whenshelearnsthatyouhavealsobeenanalyzingthedatafromtheproject,sheasksifyouwouldmindgivingherabriefoverviewofyourresults.

Statistician:So,yougotthedataforallthepatients?

DataMiner:Yes.Ihaven’thadmuchtimeforanalysis,butIdohaveafewinterestingresults.

Statistician:Amazing.ThereweresomanydataissueswiththissetofpatientsthatIcouldn’tdomuch.

DataMiner:Oh?Ididn’thearaboutanypossibleproblems.

Statistician:Well,firstthereisfield5,thevariablewewanttopredict.

It’scommonknowledgeamongpeoplewhoanalyzethistypeofdatathatresultsarebetterifyouworkwiththelogofthevalues,butIdidn’tdiscoverthisuntillater.Wasitmentionedtoyou?

DataMiner:No.

Statistician:Butsurelyyouheardaboutwhathappenedtofield4?It’ssupposedtobemeasuredonascalefrom1to10,with0indicatingamissingvalue,butbecauseofadataentryerror,all10’swerechangedinto0’s.Unfortunately,sincesomeofthepatientshavemissingvaluesforthisfield,it’simpossibletosaywhethera0inthisfieldisareal0ora10.Quiteafewoftherecordshavethatproblem.

DataMiner:Interesting.Werethereanyotherproblems?

Statistician:Yes,fields2and3arebasicallythesame,butIassumethatyouprobablynoticedthat.

DataMiner:Yes,butthesefieldswereonlyweakpredictorsoffield5.

Statistician:Anyway,givenallthoseproblems,I’msurprisedyouwereabletoaccomplishanything.

DataMiner:True,butmyresultsarereallyquitegood.Field1isaverystrongpredictoroffield5.I’msurprisedthatthiswasn’tnoticedbefore.

Statistician:What?Field1isjustanidentificationnumber.

DataMiner:Nonetheless,myresultsspeakforthemselves.

Statistician:Oh,no!Ijustremembered.WeassignedIDnumbersafterwesortedtherecordsbasedonfield5.Thereisastrongconnection,butit’smeaningless.Sorry.

Althoughthisscenariorepresentsanextremesituation,itemphasizestheimportanceof“knowingyourdata.”Tothatend,thischapterwilladdresseach

ofthefourissuesmentionedabove,outliningsomeofthebasicchallengesandstandardapproaches.

2.1TypesofDataAdatasetcanoftenbeviewedasacollectionofdataobjects.Othernamesforadataobjectarerecord,point,vector,pattern,event,case,sample,instance,observation,orentity.Inturn,dataobjectsaredescribedbyanumberofattributesthatcapturethecharacteristicsofanobject,suchasthemassofaphysicalobjectorthetimeatwhichaneventoccurred.Othernamesforanattributearevariable,characteristic,field,feature,ordimension.

Example2.2(StudentInformation).Often,adatasetisafile,inwhichtheobjectsarerecords(orrows)inthefileandeachfield(orcolumn)correspondstoanattribute.Forexample,Table2.1 showsadatasetthatconsistsofstudentinformation.Eachrowcorrespondstoastudentandeachcolumnisanattributethatdescribessomeaspectofastudent,suchasgradepointaverage(GPA)oridentificationnumber(ID).

Table2.1.Asampledatasetcontainingstudentinformation.

StudentID Year GradePointAverage(GPA) …

⋮

1034262 Senior 3.24 …

1052663 Freshman 3.51 …

1082246 Sophomore 3.62 …

Althoughrecord-baseddatasetsarecommon,eitherinflatfilesorrelationaldatabasesystems,thereareotherimportanttypesofdatasetsandsystemsforstoringdata.InSection2.1.2 ,wewilldiscusssomeofthetypesofdatasetsthatarecommonlyencounteredindatamining.However,wefirstconsiderattributes.

2.1.1AttributesandMeasurement

Inthissection,weconsiderthetypesofattributesusedtodescribedataobjects.Wefirstdefineanattribute,thenconsiderwhatwemeanbythetypeofanattribute,andfinallydescribethetypesofattributesthatarecommonlyencountered.

WhatIsanAttribute?Westartwithamoredetaileddefinitionofanattribute.

Definition2.1.Anattributeisapropertyorcharacteristicofanobjectthatcanvary,eitherfromoneobjecttoanotherorfromonetimetoanother.

Forexample,eyecolorvariesfrompersontoperson,whilethetemperatureofanobjectvariesovertime.Notethateyecolorisasymbolicattributewitha

smallnumberofpossiblevalues{brown,black,blue,green,hazel,etc.},whiletemperatureisanumericalattributewithapotentiallyunlimitednumberofvalues.

Atthemostbasiclevel,attributesarenotaboutnumbersorsymbols.However,todiscussandmorepreciselyanalyzethecharacteristicsofobjects,weassignnumbersorsymbolstothem.Todothisinawell-definedway,weneedameasurementscale.

Definition2.2.Ameasurementscaleisarule(function)thatassociatesanumericalorsymbolicvaluewithanattributeofanobject.

Formally,theprocessofmeasurementistheapplicationofameasurementscaletoassociateavaluewithaparticularattributeofaspecificobject.Whilethismayseemabitabstract,weengageintheprocessofmeasurementallthetime.Forinstance,westeponabathroomscaletodetermineourweight,weclassifysomeoneasmaleorfemale,orwecountthenumberofchairsinaroomtoseeiftherewillbeenoughtoseatallthepeoplecomingtoameeting.Inallthesecases,the“physicalvalue”ofanattributeofanobjectismappedtoanumericalorsymbolicvalue.

Withthisbackground,wecandiscussthetypeofanattribute,aconceptthatisimportantindeterminingifaparticulardataanalysistechniqueisconsistentwithaspecifictypeofattribute.

TheTypeofanAttributeItiscommontorefertothetypeofanattributeasthetypeofameasurementscale.Itshouldbeapparentfromthepreviousdiscussionthatanattributecanbedescribedusingdifferentmeasurementscalesandthatthepropertiesofanattributeneednotbethesameasthepropertiesofthevaluesusedtomeasureit.Inotherwords,thevaluesusedtorepresentanattributecanhavepropertiesthatarenotpropertiesoftheattributeitself,andviceversa.Thisisillustratedwithtwoexamples.

Example2.3(EmployeeAgeandIDNumber).TwoattributesthatmightbeassociatedwithanemployeeareIDandage(inyears).Bothoftheseattributescanberepresentedasintegers.However,whileitisreasonabletotalkabouttheaverageageofanemployee,itmakesnosensetotalkabouttheaverageemployeeID.Indeed,theonlyaspectofemployeesthatwewanttocapturewiththeIDattributeisthattheyaredistinct.Consequently,theonlyvalidoperationforemployeeIDsistotestwhethertheyareequal.Thereisnohintofthislimitation,however,whenintegersareusedtorepresenttheemployeeIDattribute.Fortheageattribute,thepropertiesoftheintegersusedtorepresentageareverymuchthepropertiesoftheattribute.Evenso,thecorrespondenceisnotcompletebecause,forexample,ageshaveamaximum,whileintegersdonot.

Example2.4(LengthofLineSegments).ConsiderFigure2.1 ,whichshowssomeobjects—linesegments—andhowthelengthattributeoftheseobjectscanbemappedtonumbersintwodifferentways.Eachsuccessivelinesegment,goingfromthetoptothebottom,isformedbyappendingthetopmostlinesegmenttoitself.Thus,

thesecondlinesegmentfromthetopisformedbyappendingthetopmostlinesegmenttoitselftwice,thethirdlinesegmentfromthetopisformedbyappendingthetopmostlinesegmenttoitselfthreetimes,andsoforth.Inaveryreal(physical)sense,allthelinesegmentsaremultiplesofthefirst.Thisfactiscapturedbythemeasurementsontherightsideofthefigure,butnotbythoseontheleftside.Morespecifically,themeasurementscaleontheleftsidecapturesonlytheorderingofthelengthattribute,whilethescaleontherightsidecapturesboththeorderingandadditivityproperties.Thus,anattributecanbemeasuredinawaythatdoesnotcaptureallthepropertiesoftheattribute.

Figure2.1.Themeasurementofthelengthoflinesegmentsontwodifferentscalesofmeasurement.

Knowingthetypeofanattributeisimportantbecauseittellsuswhichpropertiesofthemeasuredvaluesareconsistentwiththeunderlying

propertiesoftheattribute,andtherefore,itallowsustoavoidfoolishactions,suchascomputingtheaverageemployeeID.

TheDifferentTypesofAttributesAuseful(andsimple)waytospecifythetypeofanattributeistoidentifythepropertiesofnumbersthatcorrespondtounderlyingpropertiesoftheattribute.Forexample,anattributesuchaslengthhasmanyofthepropertiesofnumbers.Itmakessensetocompareandorderobjectsbylength,aswellastotalkaboutthedifferencesandratiosoflength.Thefollowingproperties(operations)ofnumbersaretypicallyusedtodescribeattributes.

1. Distinctness and2. Order and3. Addition and4. Multiplication and/

Giventheseproperties,wecandefinefourtypesofattributes:nominal,ordinal,interval,andratio.Table2.2 givesthedefinitionsofthesetypes,alongwithinformationaboutthestatisticaloperationsthatarevalidforeachtype.Eachattributetypepossessesallofthepropertiesandoperationsoftheattributetypesaboveit.Consequently,anypropertyoroperationthatisvalidfornominal,ordinal,andintervalattributesisalsovalidforratioattributes.Inotherwords,thedefinitionoftheattributetypesiscumulative.However,thisdoesnotmeanthatthestatisticaloperationsappropriateforoneattributetypeareappropriatefortheattributetypesaboveit.

Table2.2.Differentattributetypes.

AttributeType Description Examples Operations

Categorical Nominal Thevaluesofanominalattribute zipcodes, mode,

= ≠<,≤,>, ≥

+ −×

(Qualitative) arejustdifferentnames;i.e.,nominalvaluesprovideonlyenoughinformationtodistinguishoneobjectfromanother.

employeeIDnumbers,eyecolor,gender

entropy,contingencycorrelation,test

Ordinal Thevaluesofanordinalattributeprovideenoughinformationtoorderobjects.

hardnessofminerals,{good,better,best},grades,streetnumbers

median,percentiles,rankcorrelation,runtests,signtests

Numeric(Quantitative)

Interval Forintervalattributes,thedifferencesbetweenvaluesaremeaningful,i.e.,aunitofmeasurementexists.

calendardates,temperatureinCelsiusorFahrenheit

mean,standarddeviation,Pearson’scorrelation,tandFtests

Ratio Forratiovariables,bothdifferencesandratiosaremeaningful.

temperatureinKelvin,monetaryquantities,counts,age,mass,length,electricalcurrent

geometricmean,harmonicmean,percentvariation

Nominalandordinalattributesarecollectivelyreferredtoascategoricalorqualitativeattributes.Asthenamesuggests,qualitativeattributes,suchasemployeeID,lackmostofthepropertiesofnumbers.Eveniftheyarerepresentedbynumbers,i.e.,integers,theyshouldbetreatedmorelikesymbols.Theremainingtwotypesofattributes,intervalandratio,arecollectivelyreferredtoasquantitativeornumericattributes.Quantitativeattributesarerepresentedbynumbersandhavemostofthepropertiesof

(=,≠) χ2

(<,>)

(+,−)

(×,/)

numbers.Notethatquantitativeattributescanbeinteger-valuedorcontinuous.

Thetypesofattributescanalsobedescribedintermsoftransformationsthatdonotchangethemeaningofanattribute.Indeed,S.SmithStevens,thepsychologistwhooriginallydefinedthetypesofattributesshowninTable2.2 ,definedthemintermsofthesepermissibletransformations.Forexample,themeaningofalengthattributeisunchangedifitismeasuredinmetersinsteadoffeet.

Thestatisticaloperationsthatmakesenseforaparticulartypeofattributearethosethatwillyieldthesameresultswhentheattributeistransformedbyusingatransformationthatpreservestheattribute’smeaning.Toillustrate,theaveragelengthofasetofobjectsisdifferentwhenmeasuredinmetersratherthaninfeet,butbothaveragesrepresentthesamelength.Table2.3 showsthemeaning-preservingtransformationsforthefourattributetypesofTable2.2 .

Table2.3.Transformationsthatdefineattributelevels.

AttributeType Transformation Comment

Categorical(Qualitative)

Nominal Anyone-to-onemapping,e.g.,apermutationofvalues

IfallemployeeIDnumbersarereassigned,itwillnotmakeanydifference.

Ordinal Anorder-preservingchangeofvalues,i.e.,

wherefisamonotonicfunction.

Anattributeencompassingthenotionofgood,better,bestcanberepresentedequallywellbythevalues{1,2,3}orby{0.5,1,10}.

Numeric(Quantitative)

Intervalaandbconstants.

TheFahrenheitandCelsiustemperaturescalesdifferinthe

new_value=f(old_value),

new_value=a×old_value+b,

locationoftheirzerovalueandthesizeofadegree(unit).

Ratio Lengthcanbemeasuredinmetersorfeet.

Example2.5(TemperatureScales).Temperatureprovidesagoodillustrationofsomeoftheconceptsthathavebeendescribed.First,temperaturecanbeeitheranintervaloraratioattribute,dependingonitsmeasurementscale.WhenmeasuredontheKelvinscale,atemperatureof2 is,inaphysicallymeaningfulway,twice

thatofatemperatureof1 .Thisisnottruewhentemperatureismeasured

oneithertheCelsiusorFahrenheitscales,because,physically,atemperatureof1 Fahrenheit(Celsius)isnotmuchdifferentthana

temperatureof2 Fahrenheit(Celsius).Theproblemisthatthezeropoints

oftheFahrenheitandCelsiusscalesare,inaphysicalsense,arbitrary,andtherefore,theratiooftwoCelsiusorFahrenheittemperaturesisnotphysicallymeaningful.

DescribingAttributesbytheNumberofValuesAnindependentwayofdistinguishingbetweenattributesisbythenumberofvaluestheycantake.

DiscreteAdiscreteattributehasafiniteorcountablyinfinitesetofvalues.Suchattributescanbecategorical,suchaszipcodesorIDnumbers,ornumeric,suchascounts.Discreteattributesareoftenrepresentedusingintegervariables.Binaryattributesareaspecialcaseofdiscreteattributesandassumeonlytwovalues,e.g.,true/false,yes/no,male/female,or0/1.

new_value=a×old_value

◦

BinaryattributesareoftenrepresentedasBooleanvariables,orasintegervariablesthatonlytakethevalues0or1.

ContinuousAcontinuousattributeisonewhosevaluesarerealnumbers.Examplesincludeattributessuchastemperature,height,orweight.Continuousattributesaretypicallyrepresentedasfloating-pointvariables.Practically,realvaluescanbemeasuredandrepresentedonlywithlimitedprecision.

Intheory,anyofthemeasurementscaletypes—nominal,ordinal,interval,andratio—couldbecombinedwithanyofthetypesbasedonthenumberofattributevalues—binary,discrete,andcontinuous.However,somecombinationsoccuronlyinfrequentlyordonotmakemuchsense.Forinstance,itisdifficulttothinkofarealisticdatasetthatcontainsacontinuousbinaryattribute.Typically,nominalandordinalattributesarebinaryordiscrete,whileintervalandratioattributesarecontinuous.However,countattributes,whicharediscrete,arealsoratioattributes.

AsymmetricAttributesForasymmetricattributes,onlypresence—anon-zeroattributevalue—isregardedasimportant.Consideradatasetinwhicheachobjectisastudentandeachattributerecordswhetherastudenttookaparticularcourseatauniversity.Foraspecificstudent,anattributehasavalueof1ifthestudenttookthecourseassociatedwiththatattributeandavalueof0otherwise.Becausestudentstakeonlyasmallfractionofallavailablecourses,mostofthevaluesinsuchadatasetwouldbe0.Therefore,itismoremeaningfulandmoreefficienttofocusonthenon-zerovalues.Toillustrate,ifstudentsarecomparedonthebasisofthecoursestheydon’ttake,thenmoststudentswouldseemverysimilar,atleastifthenumberofcoursesislarge.Binaryattributeswhereonlynon-zerovaluesareimportantarecalledasymmetric

binaryattributes.Thistypeofattributeisparticularlyimportantforassociationanalysis,whichisdiscussedinChapter5 .Itisalsopossibletohavediscreteorcontinuousasymmetricfeatures.Forinstance,ifthenumberofcreditsassociatedwitheachcourseisrecorded,thentheresultingdatasetwillconsistofasymmetricdiscreteorcontinuousattributes.

GeneralCommentsonLevelsofMeasurementAsdescribedintherestofthischapter,therearemanydiversetypesofdata.Thepreviousdiscussionofmeasurementscales,whileuseful,isnotcompleteandhassomelimitations.Weprovidethefollowingcommentsandguidance.

Distinctness,order,andmeaningfulintervalsandratiosareonlyfourpropertiesofdata—manyothersarepossible.Forinstance,somedataisinherentlycyclical,e.g.,positiononthesurfaceoftheEarthortime.Asanotherexample,considersetvaluedattributes,whereeachattributevalueisasetofelements,e.g.,thesetofmoviesseeninthelastyear.Defineonesetofelements(movies)tobegreater(larger)thanasecondsetifthesecondsetisasubsetofthefirst.However,sucharelationshipdefinesonlyapartialorderthatdoesnotmatchanyoftheattributetypesjustdefined.Thenumbersorsymbolsusedtocaptureattributevaluesmaynotcaptureallthepropertiesoftheattributesormaysuggestpropertiesthatarenotthere.AnillustrationofthisforintegerswaspresentedinExample2.3 ,i.e.,averagesofIDsandoutofrangeages.Dataisoftentransformedforthepurposeofanalysis—seeSection2.3.7 .Thisoftenchangesthedistributionoftheobservedvariabletoadistributionthatiseasiertoanalyze,e.g.,aGaussian(normal)distribution.Often,suchtransformationsonlypreservetheorderoftheoriginalvalues,andotherpropertiesarelost.Nonetheless,ifthedesiredoutcomeisa

statisticaltestofdifferencesorapredictivemodel,suchatransformationisjustified.Thefinalevaluationofanydataanalysis,includingoperationsonattributes,iswhethertheresultsmakesensefromadomainpointofview.

Insummary,itcanbechallengingtodeterminewhichoperationscanbeperformedonaparticularattributeoracollectionofattributeswithoutcompromisingtheintegrityoftheanalysis.Fortunately,establishedpracticeoftenservesasareliableguide.Occasionally,however,standardpracticesareerroneousorhavelimitations.

2.1.2TypesofDataSets

Therearemanytypesofdatasets,andasthefieldofdataminingdevelopsandmatures,agreatervarietyofdatasetsbecomeavailableforanalysis.Inthissection,wedescribesomeofthemostcommontypes.Forconvenience,wehavegroupedthetypesofdatasetsintothreegroups:recorddata,graph-baseddata,andordereddata.Thesecategoriesdonotcoverallpossibilitiesandothergroupingsarecertainlypossible.

GeneralCharacteristicsofDataSetsBeforeprovidingdetailsofspecifickindsofdatasets,wediscussthreecharacteristicsthatapplytomanydatasetsandhaveasignificantimpactonthedataminingtechniquesthatareused:dimensionality,distribution,andresolution.

Dimensionality

Thedimensionalityofadatasetisthenumberofattributesthattheobjectsinthedatasetpossess.Analyzingdatawithasmallnumberofdimensionstendstobequalitativelydifferentfromanalyzingmoderateorhigh-dimensionaldata.Indeed,thedifficultiesassociatedwiththeanalysisofhigh-dimensionaldataaresometimesreferredtoasthecurseofdimensionality.Becauseofthis,animportantmotivationinpreprocessingthedataisdimensionalityreduction.TheseissuesarediscussedinmoredepthlaterinthischapterandinAppendixB.

Distribution

Thedistributionofadatasetisthefrequencyofoccurrenceofvariousvaluesorsetsofvaluesfortheattributescomprisingdataobjects.Equivalently,thedistributionofadatasetcanbeconsideredasadescriptionoftheconcentrationofobjectsinvariousregionsofthedataspace.Statisticianshaveenumeratedmanytypesofdistributions,e.g.,Gaussian(normal),anddescribedtheirproperties.(SeeAppendixC.)Althoughstatisticalapproachesfordescribingdistributionscanyieldpowerfulanalysistechniques,manydatasetshavedistributionsthatarenotwellcapturedbystandardstatisticaldistributions.

Asaresult,manydataminingalgorithmsdonotassumeaparticularstatisticaldistributionforthedatatheyanalyze.However,somegeneralaspectsofdistributionsoftenhaveastrongimpact.Forexample,supposeacategoricalattributeisusedasaclassvariable,whereoneofthecategoriesoccurs95%ofthetime,whiletheothercategoriestogetheroccuronly5%ofthetime.ThisskewnessinthedistributioncanmakeclassificationdifficultasdiscussedinSection4.11.(Skewnesshasotherimpactsondataanalysisthatarenotdiscussedhere.)

Aspecialcaseofskeweddataissparsity.Forsparsebinary,countorcontinuousdata,mostattributesofanobjecthavevaluesof0.Inmanycases,fewerthan1%ofthevaluesarenon-zero.Inpracticalterms,sparsityisanadvantagebecauseusuallyonlythenon-zerovaluesneedtobestoredandmanipulated.Thisresultsinsignificantsavingswithrespecttocomputationtimeandstorage.Indeed,somedataminingalgorithms,suchastheassociationruleminingalgorithmsdescribedinChapter5 ,workwellonlyforsparsedata.Finally,notethatoftentheattributesinsparsedatasetsareasymmetricattributes.

Resolution

Itisfrequentlypossibletoobtaindataatdifferentlevelsofresolution,andoftenthepropertiesofthedataaredifferentatdifferentresolutions.Forinstance,thesurfaceoftheEarthseemsveryunevenataresolutionofafewmeters,butisrelativelysmoothataresolutionoftensofkilometers.Thepatternsinthedataalsodependonthelevelofresolution.Iftheresolutionistoofine,apatternmaynotbevisibleormaybeburiedinnoise;iftheresolutionistoocoarse,thepatterncandisappear.Forexample,variationsinatmosphericpressureonascaleofhoursreflectthemovementofstormsandotherweathersystems.Onascaleofmonths,suchphenomenaarenotdetectable.

RecordDataMuchdataminingworkassumesthatthedatasetisacollectionofrecords(dataobjects),eachofwhichconsistsofafixedsetofdatafields(attributes).SeeFigure2.2(a) .Forthemostbasicformofrecorddata,thereisnoexplicitrelationshipamongrecordsordatafields,andeveryrecord(object)hasthesamesetofattributes.Recorddataisusuallystoredeitherinflatfilesorinrelationaldatabases.Relationaldatabasesarecertainlymorethana

collectionofrecords,butdataminingoftendoesnotuseanyoftheadditionalinformationavailableinarelationaldatabase.Rather,thedatabaseservesasaconvenientplacetofindrecords.DifferenttypesofrecorddataaredescribedbelowandareillustratedinFigure2.2 .

Figure2.2.Differentvariationsofrecorddata.

TransactionorMarketBasketData

Transactiondataisaspecialtypeofrecorddata,whereeachrecord(transaction)involvesasetofitems.Consideragrocerystore.Thesetofproductspurchasedbyacustomerduringoneshoppingtripconstitutesatransaction,whiletheindividualproductsthatwerepurchasedaretheitems.Thistypeofdataiscalledmarketbasketdatabecausetheitemsineachrecordaretheproductsinaperson’s“marketbasket.”Transactiondataisacollectionofsetsofitems,butitcanbeviewedasasetofrecordswhosefieldsareasymmetricattributes.Mostoften,theattributesarebinary,indicatingwhetheranitemwaspurchased,butmoregenerally,theattributescanbediscreteorcontinuous,suchasthenumberofitemspurchasedortheamountspentonthoseitems.Figure2.2(b) showsasampletransactiondataset.Eachrowrepresentsthepurchasesofaparticularcustomerataparticulartime.

TheDataMatrix

Ifallthedataobjectsinacollectionofdatahavethesamefixedsetofnumericattributes,thenthedataobjectscanbethoughtofaspoints(vectors)inamultidimensionalspace,whereeachdimensionrepresentsadistinctattributedescribingtheobject.Asetofsuchdataobjectscanbeinterpretedasanmbynmatrix,wheretherearemrows,oneforeachobject,andncolumns,oneforeachattribute.(Arepresentationthathasdataobjectsascolumnsandattributesasrowsisalsofine.)Thismatrixiscalledadatamatrixorapatternmatrix.Adatamatrixisavariationofrecorddata,butbecauseitconsistsofnumericattributes,standardmatrixoperationcanbeappliedtotransformandmanipulatethedata.Therefore,thedatamatrixisthestandarddataformatformoststatisticaldata.Figure2.2(c) showsasampledatamatrix.

TheSparseDataMatrix

Asparsedatamatrixisaspecialcaseofadatamatrixwheretheattributesareofthesametypeandareasymmetric;i.e.,onlynon-zerovaluesareimportant.Transactiondataisanexampleofasparsedatamatrixthathasonly0–1entries.Anothercommonexampleisdocumentdata.Inparticular,iftheorderoftheterms(words)inadocumentisignored—the“bagofwords”approach—thenadocumentcanberepresentedasatermvector,whereeachtermisacomponent(attribute)ofthevectorandthevalueofeachcomponentisthenumberoftimesthecorrespondingtermoccursinthedocument.Thisrepresentationofacollectionofdocumentsisoftencalledadocument-termmatrix.Figure2.2(d) showsasampledocument-termmatrix.Thedocumentsaretherowsofthismatrix,whilethetermsarethecolumns.Inpractice,onlythenon-zeroentriesofsparsedatamatricesarestored.

Graph-BasedDataAgraphcansometimesbeaconvenientandpowerfulrepresentationfordata.Weconsidertwospecificcases:(1)thegraphcapturesrelationshipsamongdataobjectsand(2)thedataobjectsthemselvesarerepresentedasgraphs.

DatawithRelationshipsamongObjects

Therelationshipsamongobjectsfrequentlyconveyimportantinformation.Insuchcases,thedataisoftenrepresentedasagraph.Inparticular,thedataobjectsaremappedtonodesofthegraph,whiletherelationshipsamongobjectsarecapturedbythelinksbetweenobjectsandlinkproperties,suchasdirectionandweight.ConsiderwebpagesontheWorldWideWeb,whichcontainbothtextandlinkstootherpages.Inordertoprocesssearchqueries,websearchenginescollectandprocesswebpagestoextracttheircontents.Itiswell-known,however,thatthelinkstoandfromeachpageprovideagreatdealofinformationabouttherelevanceofawebpagetoaquery,andthus,mustalsobetakenintoconsideration.Figure2.3(a) showsasetoflinked

webpages.Anotherimportantexampleofsuchgraphdataarethesocialnetworks,wheredataobjectsarepeopleandtherelationshipsamongthemaretheirinteractionsviasocialmedia.

DatawithObjectsThatAreGraphs

Ifobjectshavestructure,thatis,theobjectscontainsubobjectsthathaverelationships,thensuchobjectsarefrequentlyrepresentedasgraphs.Forexample,thestructureofchemicalcompoundscanberepresentedbyagraph,wherethenodesareatomsandthelinksbetweennodesarechemicalbonds.Figure2.3(b) showsaball-and-stickdiagramofthechemicalcompoundbenzene,whichcontainsatomsofcarbon(black)andhydrogen(gray).Agraphrepresentationmakesitpossibletodeterminewhichsubstructuresoccurfrequentlyinasetofcompoundsandtoascertainwhetherthepresenceofanyofthesesubstructuresisassociatedwiththepresenceorabsenceofcertainchemicalproperties,suchasmeltingpointorheatofformation.Frequentgraphmining,whichisabranchofdataminingthatanalyzessuchdata,isconsideredinSection6.5.

Figure2.3.Differentvariationsofgraphdata.

OrderedDataForsometypesofdata,theattributeshaverelationshipsthatinvolveorderintimeorspace.DifferenttypesofordereddataaredescribednextandareshowninFigure2.4 .

SequentialTransactionData

Sequentialtransactiondatacanbethoughtofasanextensionoftransactiondata,whereeachtransactionhasatimeassociatedwithit.Consideraretailtransactiondatasetthatalsostoresthetimeatwhichthetransactiontookplace.Thistimeinformationmakesitpossibletofindpatternssuchas“candysalespeakbeforeHalloween.”Atimecanalsobeassociatedwitheachattribute.Forexample,eachrecordcouldbethepurchasehistoryofa

customer,withalistingofitemspurchasedatdifferenttimes.Usingthisinformation,itispossibletofindpatternssuchas“peoplewhobuyDVDplayerstendtobuyDVDsintheperiodimmediatelyfollowingthepurchase.”

Figure2.4(a) showsanexampleofsequentialtransactiondata.Therearefivedifferenttimes—t1,t2,t3,t4,andt5;threedifferentcustomers—C1,C2,andC3;andfivedifferentitems—A,B,C,D,andE.Inthetoptable,eachrowcorrespondstotheitemspurchasedataparticulartimebyeachcustomer.Forinstance,attimet3,customerC2purchaseditemsAandD.Inthebottomtable,thesameinformationisdisplayed,buteachrowcorrespondstoaparticularcustomer.Eachrowcontainsinformationabouteachtransactioninvolvingthecustomer,whereatransactionisconsideredtobeasetofitemsandthetimeatwhichthoseitemswerepurchased.Forexample,customerC3boughtitemsAandCattimet2.

TimeSeriesData

Timeseriesdataisaspecialtypeofordereddatawhereeachrecordisatimeseries,i.e.,aseriesofmeasurementstakenovertime.Forexample,afinancialdatasetmightcontainobjectsthataretimeseriesofthedailypricesofvariousstocks.Asanotherexample,considerFigure2.4(c) ,whichshowsatimeseriesoftheaveragemonthlytemperatureforMinneapolisduringtheyears1982to1994.Whenworkingwithtemporaldata,suchastimeseries,itisimportanttoconsidertemporalautocorrelation;i.e.,iftwomeasurementsarecloseintime,thenthevaluesofthosemeasurementsareoftenverysimilar.

Figure2.4.Differentvariationsofordereddata.

SequenceData

Sequencedataconsistsofadatasetthatisasequenceofindividualentities,suchasasequenceofwordsorletters.Itisquitesimilartosequentialdata,exceptthattherearenotimestamps;instead,therearepositionsinanorderedsequence.Forexample,thegeneticinformationofplantsandanimalscanberepresentedintheformofsequencesofnucleotidesthatareknownasgenes.Manyoftheproblemsassociatedwithgeneticsequencedatainvolvepredictingsimilaritiesinthestructureandfunctionofgenesfromsimilaritiesinnucleotidesequences.Figure2.4(b) showsasectionofthehumangeneticcodeexpressedusingthefournucleotidesfromwhichallDNAisconstructed:A,T,G,andC.

SpatialandSpatio-TemporalData

Someobjectshavespatialattributes,suchaspositionsorareas,inadditiontoothertypesofattributes.Anexampleofspatialdataisweatherdata(precipitation,temperature,pressure)thatiscollectedforavarietyofgeographicallocations.Oftensuchmeasurementsarecollectedovertime,andthus,thedataconsistsoftimeseriesatvariouslocations.Inthatcase,werefertothedataasspatio-temporaldata.Althoughanalysiscanbeconductedseparatelyforeachspecifictimeorlocation,amorecompleteanalysisofspatio-temporaldatarequiresconsiderationofboththespatialandtemporalaspectsofthedata.

Animportantaspectofspatialdataisspatialautocorrelation;i.e.,objectsthatarephysicallyclosetendtobesimilarinotherwaysaswell.Thus,twopointsontheEarththatareclosetoeachotherusuallyhavesimilarvaluesfortemperatureandrainfall.Notethatspatialautocorrelationisanalogoustotemporalautocorrelation.

Importantexamplesofspatialandspatio-temporaldataarethescienceandengineeringdatasetsthataretheresultofmeasurementsormodeloutput

takenatregularlyorirregularlydistributedpointsonatwo-orthree-dimensionalgridormesh.Forinstance,Earthsciencedatasetsrecordthetemperatureorpressuremeasuredatpoints(gridcells)onlatitude–longitudesphericalgridsofvariousresolutions,e.g., by SeeFigure2.4(d) .Asanotherexample,inthesimulationoftheflowofagas,thespeedanddirectionofflowatvariousinstantsintimecanberecordedforeachgridpointinthesimulation.Adifferenttypeofspatio-temporaldataarisesfromtrackingthetrajectoriesofobjects,e.g.,vehicles,intimeandspace.

HandlingNon-RecordDataMostdataminingalgorithmsaredesignedforrecorddataoritsvariations,suchastransactiondataanddatamatrices.Record-orientedtechniquescanbeappliedtonon-recorddatabyextractingfeaturesfromdataobjectsandusingthesefeaturestocreatearecordcorrespondingtoeachobject.Considerthechemicalstructuredatathatwasdescribedearlier.Givenasetofcommonsubstructures,eachcompoundcanberepresentedasarecordwithbinaryattributesthatindicatewhetheracompoundcontainsaspecificsubstructure.Sucharepresentationisactuallyatransactiondataset,wherethetransactionsarethecompoundsandtheitemsarethesubstructures.

Insomecases,itiseasytorepresentthedatainarecordformat,butthistypeofrepresentationdoesnotcapturealltheinformationinthedata.Considerspatio-temporaldataconsistingofatimeseriesfromeachpointonaspatialgrid.Thisdataisoftenstoredinadatamatrix,whereeachrowrepresentsalocationandeachcolumnrepresentsaparticularpointintime.However,sucharepresentationdoesnotexplicitlycapturethetimerelationshipsthatarepresentamongattributesandthespatialrelationshipsthatexistamongobjects.Thisdoesnotmeanthatsucharepresentationisinappropriate,butratherthattheserelationshipsmustbetakenintoconsiderationduringtheanalysis.Forexample,itwouldnotbeagoodideatouseadatamining

1° 1°.

techniquethatignoresthetemporalautocorrelationoftheattributesorthespatialautocorrelationofthedataobjects,i.e.,thelocationsonthespatialgrid.

2.2DataQualityDataminingalgorithmsareoftenappliedtodatathatwascollectedforanotherpurpose,orforfuture,butunspecifiedapplications.Forthatreason,dataminingcannotusuallytakeadvantageofthesignificantbenefitsof“ad-dressingqualityissuesatthesource.”Incontrast,muchofstatisticsdealswiththedesignofexperimentsorsurveysthatachieveaprespecifiedlevelofdataquality.Becausepreventingdataqualityproblemsistypicallynotanoption,dataminingfocuseson(1)thedetectionandcorrectionofdataqualityproblemsand(2)theuseofalgorithmsthatcantoleratepoordataquality.Thefirststep,detectionandcorrection,isoftencalleddatacleaning.

Thefollowingsectionsdiscussspecificaspectsofdataquality.Thefocusisonmeasurementanddatacollectionissues,althoughsomeapplication-relatedissuesarealsodiscussed.

2.2.1MeasurementandDataCollectionIssues

Itisunrealistictoexpectthatdatawillbeperfect.Theremaybeproblemsduetohumanerror,limitationsofmeasuringdevices,orflawsinthedatacollectionprocess.Valuesorevenentiredataobjectscanbemissing.Inothercases,therecanbespuriousorduplicateobjects;i.e.,multipledataobjectsthatallcorrespondtoasingle“real”object.Forexample,theremightbetwodifferentrecordsforapersonwhohasrecentlylivedattwodifferentaddresses.Evenif

allthedataispresentand“looksfine,”theremaybeinconsistencies—apersonhasaheightof2meters,butweighsonly2kilograms.

Inthenextfewsections,wefocusonaspectsofdataqualitythatarerelatedtodatameasurementandcollection.Webeginwithadefinitionofmeasurementanddatacollectionerrorsandthenconsideravarietyofproblemsthatinvolvemeasurementerror:noise,artifacts,bias,precision,andaccuracy.Weconcludebydiscussingdataqualityissuesthatinvolvebothmeasurementanddatacollectionproblems:outliers,missingandinconsistentvalues,andduplicatedata.

MeasurementandDataCollectionErrorsThetermmeasurementerrorreferstoanyproblemresultingfromthemeasurementprocess.Acommonproblemisthatthevaluerecordeddiffersfromthetruevaluetosomeextent.Forcontinuousattributes,thenumericaldifferenceofthemeasuredandtruevalueiscalledtheerror.Thetermdatacollectionerrorreferstoerrorssuchasomittingdataobjectsorattributevalues,orinappropriatelyincludingadataobject.Forexample,astudyofanimalsofacertainspeciesmightincludeanimalsofarelatedspeciesthataresimilarinappearancetothespeciesofinterest.Bothmeasurementerrorsanddatacollectionerrorscanbeeithersystematicorrandom.

Wewillonlyconsidergeneraltypesoferrors.Withinparticulardomains,certaintypesofdataerrorsarecommonplace,andwell-developedtechniquesoftenexistfordetectingand/orcorrectingtheseerrors.Forexample,keyboarderrorsarecommonwhendataisenteredmanually,andasaresult,manydataentryprogramshavetechniquesfordetectingand,withhumanintervention,correctingsucherrors.

NoiseandArtifactsNoiseistherandomcomponentofameasurementerror.Ittypicallyinvolvesthedistortionofavalueortheadditionofspuriousobjects.Figure2.5showsatimeseriesbeforeandafterithasbeendisruptedbyrandomnoise.Ifabitmorenoisewereaddedtothetimeseries,itsshapewouldbelost.Figure2.6 showsasetofdatapointsbeforeandaftersomenoisepoints(indicatedby )havebeenadded.Noticethatsomeofthenoisepointsareintermixedwiththenon-noisepoints.

Figure2.5.Noiseinatimeseriescontext.

‘+’s

Figure2.6.Noiseinaspatialcontext.

Thetermnoiseisoftenusedinconnectionwithdatathathasaspatialortemporalcomponent.Insuchcases,techniquesfromsignalorimageprocessingcanfrequentlybeusedtoreducenoiseandthus,helptodiscoverpatterns(signals)thatmightbe“lostinthenoise.”Nonetheless,theeliminationofnoiseisfrequentlydifficult,andmuchworkindataminingfocusesondevisingrobustalgorithmsthatproduceacceptableresultsevenwhennoiseispresent.

Dataerrorscanbetheresultofamoredeterministicphenomenon,suchasastreakinthesameplaceonasetofphotographs.Suchdeterministicdistortionsofthedataareoftenreferredtoasartifacts.

Precision,Bias,andAccuracyInstatisticsandexperimentalscience,thequalityofthemeasurementprocessandtheresultingdataaremeasuredbyprecisionandbias.Weprovidethe

standarddefinitions,followedbyabriefdiscussion.Forthefollowingdefinitions,weassumethatwemakerepeatedmeasurementsofthesameunderlyingquantity.

Definition2.3(Precision).Theclosenessofrepeatedmeasurements(ofthesamequantity)tooneanother.

Definition2.4(Bias).Asystematicvariationofmeasurementsfromthequantitybeingmeasured.

Precisionisoftenmeasuredbythestandarddeviationofasetofvalues,whilebiasismeasuredbytakingthedifferencebetweenthemeanofthesetofvaluesandtheknownvalueofthequantitybeingmeasured.Biascanbedeterminedonlyforobjectswhosemeasuredquantityisknownbymeansexternaltothecurrentsituation.Supposethatwehaveastandardlaboratoryweightwithamassof1gandwanttoassesstheprecisionandbiasofournewlaboratoryscale.Weweighthemassfivetimes,andobtainthefollowingfivevalues:{1.015,0.990,1.013,1.001,0.986}.Themeanofthesevaluesis

1.001,andhence,thebiasis0.001.Theprecision,asmeasuredbythestandarddeviation,is0.013.

Itiscommontousethemoregeneralterm,accuracy,torefertothedegreeofmeasurementerrorindata.

Definition2.5(Accuracy)Theclosenessofmeasurementstothetruevalueofthequantitybeingmeasured.

Accuracydependsonprecisionandbias,butthereisnospecificformulaforaccuracyintermsofthesetwoquantities.

Oneimportantaspectofaccuracyistheuseofsignificantdigits.Thegoalistouseonlyasmanydigitstorepresenttheresultofameasurementorcalculationasarejustifiedbytheprecisionofthedata.Forexample,ifthelengthofanobjectismeasuredwithameterstickwhosesmallestmarkingsaremillimeters,thenweshouldrecordthelengthofdataonlytothenearestmillimeter.Theprecisionofsuchameasurementwouldbe Wedonotreviewthedetailsofworkingwithsignificantdigitsbecausemostreaderswillhaveencounteredtheminpreviouscoursesandtheyarecoveredinconsiderabledepthinscience,engineering,andstatisticstextbooks.

Issuessuchassignificantdigits,precision,bias,andaccuracyaresometimesoverlooked,buttheyareimportantfordataminingaswellasstatisticsandscience.Manytimes,datasetsdonotcomewithinformationaboutthe

±0.5mm.

precisionofthedata,andfurthermore,theprogramsusedforanalysisreturnresultswithoutanysuchinformation.Nonetheless,withoutsomeunderstandingoftheaccuracyofthedataandtheresults,ananalystrunstheriskofcommittingseriousdataanalysisblunders.

OutliersOutliersareeither(1)dataobjectsthat,insomesense,havecharacteristicsthataredifferentfrommostoftheotherdataobjectsinthedataset,or(2)valuesofanattributethatareunusualwithrespecttothetypicalvaluesforthatattribute.Alternatively,theycanbereferredtoasanomalousobjectsorvalues.Thereisconsiderableleewayinthedefinitionofanoutlier,andmanydifferentdefinitionshavebeenproposedbythestatisticsanddataminingcommunities.Furthermore,itisimportanttodistinguishbetweenthenotionsofnoiseandoutliers.Unlikenoise,outlierscanbelegitimatedataobjectsorvaluesthatweareinterestedindetecting.Forinstance,infraudandnetworkintrusiondetection,thegoalistofindunusualobjectsoreventsfromamongalargenumberofnormalones.Chapter9 discussesanomalydetectioninmoredetail.

MissingValuesItisnotunusualforanobjecttobemissingoneormoreattributevalues.Insomecases,theinformationwasnotcollected;e.g.,somepeopledeclinetogivetheirageorweight.Inothercases,someattributesarenotapplicabletoallobjects;e.g.,often,formshaveconditionalpartsthatarefilledoutonlywhenapersonanswersapreviousquestioninacertainway,butforsimplicity,allfieldsarestored.Regardless,missingvaluesshouldbetakenintoaccountduringthedataanalysis.

Thereareseveralstrategies(andvariationsonthesestrategies)fordealingwithmissingdata,eachofwhichisappropriateincertaincircumstances.Thesestrategiesarelistednext,alongwithanindicationoftheiradvantagesanddisadvantages.

EliminateDataObjectsorAttributes

Asimpleandeffectivestrategyistoeliminateobjectswithmissingvalues.However,evenapartiallyspecifieddataobjectcontainssomeinformation,andifmanyobjectshavemissingvalues,thenareliableanalysiscanbedifficultorimpossible.Nonetheless,ifadatasethasonlyafewobjectsthathavemissingvalues,thenitmaybeexpedienttoomitthem.Arelatedstrategyistoeliminateattributesthathavemissingvalues.Thisshouldbedonewithcaution,however,becausetheeliminatedattributesmaybetheonesthatarecriticaltotheanalysis.

EstimateMissingValues

Sometimesmissingdatacanbereliablyestimated.Forexample,consideratimeseriesthatchangesinareasonablysmoothfashion,buthasafew,widelyscatteredmissingvalues.Insuchcases,themissingvaluescanbeestimated(interpolated)byusingtheremainingvalues.Asanotherexample,consideradatasetthathasmanysimilardatapoints.Inthissituation,theattributevaluesofthepointsclosesttothepointwiththemissingvalueareoftenusedtoestimatethemissingvalue.Iftheattributeiscontinuous,thentheaverageattributevalueofthenearestneighborsisused;iftheattributeiscategorical,thenthemostcommonlyoccurringattributevaluecanbetaken.Foraconcreteillustration,considerprecipitationmeasurementsthatarerecordedbygroundstations.Forareasnotcontainingagroundstation,theprecipitationcanbeestimatedusingvaluesobservedatnearbygroundstations.

IgnoretheMissingValueduringAnalysis

Manydataminingapproachescanbemodifiedtoignoremissingvalues.Forexample,supposethatobjectsarebeingclusteredandthesimilaritybetweenpairsofdataobjectsneedstobecalculated.Ifoneorbothobjectsofapairhavemissingvaluesforsomeattributes,thenthesimilaritycanbecalculatedbyusingonlytheattributesthatdonothavemissingvalues.Itistruethatthesimilaritywillonlybeapproximate,butunlessthetotalnumberofattributesissmallorthenumberofmissingvaluesishigh,thisdegreeofinaccuracymaynotmattermuch.Likewise,manyclassificationschemescanbemodifiedtoworkwithmissingvalues.

InconsistentValuesDatacancontaininconsistentvalues.Consideranaddressfield,wherebothazipcodeandcityarelisted,butthespecifiedzipcodeareaisnotcontainedinthatcity.Itispossiblethattheindividualenteringthisinformationtransposedtwodigits,orperhapsadigitwasmisreadwhentheinformationwasscannedfromahandwrittenform.Regardlessofthecauseoftheinconsistentvalues,itisimportanttodetectand,ifpossible,correctsuchproblems.

Sometypesofinconsistencesareeasytodetect.Forinstance,aperson’sheightshouldnotbenegative.Inothercases,itcanbenecessarytoconsultanexternalsourceofinformation.Forexample,whenaninsurancecompanyprocessesclaimsforreimbursement,itchecksthenamesandaddressesonthereimbursementformsagainstadatabaseofitscustomers.

Onceaninconsistencyhasbeendetected,itissometimespossibletocorrectthedata.Aproductcodemayhave“check”digits,oritmaybepossibletodouble-checkaproductcodeagainstalistofknownproductcodes,andthen

correctthecodeifitisincorrect,butclosetoaknowncode.Thecorrectionofaninconsistencyrequiresadditionalorredundantinformation.

Example2.6(InconsistentSeaSurfaceTemperature).Thisexampleillustratesaninconsistencyinactualtimeseriesdatathatmeasurestheseasurfacetemperature(SST)atvariouspointsontheocean.SSTdatawasoriginallycollectedusingocean-basedmeasurementsfromshipsorbuoys,butmorerecently,satelliteshavebeenusedtogatherthedata.Tocreatealong-termdataset,bothsourcesofdatamustbeused.However,becausethedatacomesfromdifferentsources,thetwopartsofthedataaresubtlydifferent.ThisdiscrepancyisvisuallydisplayedinFigure2.7 ,whichshowsthecorrelationofSSTvaluesbetweenpairsofyears.Ifapairofyearshasapositivecorrelation,thenthelocationcorrespondingtothepairofyearsiscoloredwhite;otherwiseitiscoloredblack.(Seasonalvariationswereremovedfromthedatasince,otherwise,alltheyearswouldbehighlycorrelated.)Thereisadistinctchangeinbehaviorwherethedatahasbeenputtogetherin1983.Yearswithineachofthetwogroups,1958–1982and1983–1999,tendtohaveapositivecorrelationwithoneanother,butanegativecorrelationwithyearsintheothergroup.Thisdoesnotmeanthatthisdatashouldnotbeused,onlythattheanalystshouldconsiderthepotentialimpactofsuchdiscrepanciesonthedatamininganalysis.

Figure2.7.CorrelationofSSTdatabetweenpairsofyears.Whiteareasindicatepositivecorrelation.Blackareasindicatenegativecorrelation.

DuplicateDataAdatasetcanincludedataobjectsthatareduplicates,oralmostduplicates,ofoneanother.Manypeoplereceiveduplicatemailingsbecausetheyappearinadatabasemultipletimesunderslightlydifferentnames.Todetectandeliminatesuchduplicates,twomainissuesmustbeaddressed.First,iftherearetwoobjectsthatactuallyrepresentasingleobject,thenoneormorevaluesofcorrespondingattributesareusuallydifferent,andtheseinconsistentvaluesmustberesolved.Second,careneedstobetakentoavoidaccidentallycombiningdataobjectsthataresimilar,butnotduplicates,such

astwodistinctpeoplewithidenticalnames.Thetermdeduplicationisoftenusedtorefertotheprocessofdealingwiththeseissues.

Insomecases,twoormoreobjectsareidenticalwithrespecttotheattributesmeasuredbythedatabase,buttheystillrepresentdifferentobjects.Here,theduplicatesarelegitimate,butcanstillcauseproblemsforsomealgorithmsifthepossibilityofidenticalobjectsisnotspecificallyaccountedforintheirdesign.AnexampleofthisisgiveninExercise13 onpage108.

2.2.2IssuesRelatedtoApplications

Dataqualityissuescanalsobeconsideredfromanapplicationviewpointasexpressedbythestatement“dataisofhighqualityifitissuitableforitsintendeduse.”Thisapproachtodataqualityhasprovenquiteuseful,particularlyinbusinessandindustry.Asimilarviewpointisalsopresentinstatisticsandtheexperimentalsciences,withtheiremphasisonthecarefuldesignofexperimentstocollectthedatarelevanttoaspecifichypothesis.Aswithqualityissuesatthemeasurementanddatacollectionlevel,manyissuesarespecifictoparticularapplicationsandfields.Again,weconsideronlyafewofthegeneralissues.

Timeliness

Somedatastartstoageassoonasithasbeencollected.Inparticular,ifthedataprovidesasnapshotofsomeongoingphenomenonorprocess,suchasthepurchasingbehaviorofcustomersorwebbrowsingpatterns,thenthissnapshotrepresentsrealityforonlyalimitedtime.Ifthedataisoutofdate,thensoarethemodelsandpatternsthatarebasedonit.

Relevance

Theavailabledatamustcontaintheinformationnecessaryfortheapplication.Considerthetaskofbuildingamodelthatpredictstheaccidentratefordrivers.Ifinformationabouttheageandgenderofthedriverisomitted,thenitislikelythatthemodelwillhavelimitedaccuracyunlessthisinformationisindirectlyavailablethroughotherattributes.

Makingsurethattheobjectsinadatasetarerelevantisalsochallenging.Acommonproblemissamplingbias,whichoccurswhenasampledoesnotcontaindifferenttypesofobjectsinproportiontotheiractualoccurrenceinthepopulation.Forexample,surveydatadescribesonlythosewhorespondtothesurvey.(OtheraspectsofsamplingarediscussedfurtherinSection2.3.2 .)Becausetheresultsofadataanalysiscanreflectonlythedatathatispresent,samplingbiaswilltypicallyleadtoerroneousresultswhenappliedtothebroaderpopulation.

KnowledgeabouttheData

Ideally,datasetsareaccompaniedbydocumentationthatdescribesdifferentaspectsofthedata;thequalityofthisdocumentationcaneitheraidorhinderthesubsequentanalysis.Forexample,ifthedocumentationidentifiesseveralattributesasbeingstronglyrelated,theseattributesarelikelytoprovidehighlyredundantinformation,andweusuallydecidetokeepjustone.(Considersalestaxandpurchaseprice.)Ifthedocumentationispoor,however,andfailstotellus,forexample,thatthemissingvaluesforaparticularfieldareindicatedwitha-9999,thenouranalysisofthedatamaybefaulty.Otherimportantcharacteristicsaretheprecisionofthedata,thetypeoffeatures(nominal,ordinal,interval,ratio),thescaleofmeasurement(e.g.,metersorfeetforlength),andtheoriginofthedata.

2.3DataPreprocessingInthissection,weconsiderwhichpreprocessingstepsshouldbeappliedtomakethedatamoresuitablefordatamining.Datapreprocessingisabroadareaandconsistsofanumberofdifferentstrategiesandtechniquesthatareinterrelatedincomplexways.Wewillpresentsomeofthemostimportantideasandapproaches,andtrytopointouttheinterrelationshipsamongthem.Specifically,wewilldiscussthefollowingtopics:

AggregationSamplingDimensionalityreductionFeaturesubsetselectionFeaturecreationDiscretizationandbinarizationVariabletransformation

Roughlyspeaking,thesetopicsfallintotwocategories:selectingdataobjectsandattributesfortheanalysisorforcreating/changingtheattributes.Inbothcases,thegoalistoimprovethedatamininganalysiswithrespecttotime,cost,andquality.Detailsareprovidedinthefollowingsections.

Aquicknoteaboutterminology:Inthefollowing,wesometimesusesynonymsforattribute,suchasfeatureorvariable,inordertofollowcommonusage.

2.3.1Aggregation

Sometimes“lessismore,”andthisisthecasewithaggregation,thecombiningoftwoormoreobjectsintoasingleobject.Consideradatasetconsistingoftransactions(dataobjects)recordingthedailysalesofproductsinvariousstorelocations(Minneapolis,Chicago,Paris,…)fordifferentdaysoverthecourseofayear.SeeTable2.4 .Onewaytoaggregatetransactionsforthisdatasetistoreplaceallthetransactionsofasinglestorewithasinglestorewidetransaction.Thisreducesthehundredsorthousandsoftransactionsthatoccurdailyataspecificstoretoasingledailytransaction,andthenumberofdataobjectsperdayisreducedtothenumberofstores.

Table2.4.Datasetcontaininginformationaboutcustomerpurchases.

TransactionID Item StoreLocation Date Price …

⋮ ⋮ ⋮ ⋮ ⋮

101123 Watch Chicago 09/06/04 $25.99 …

101123 Battery Chicago 09/06/04 $5.99 …

101124 Shoes Minneapolis 09/06/04 $75.00 …

Anobviousissueishowanaggregatetransactioniscreated;i.e.,howthevaluesofeachattributearecombinedacrossalltherecordscorrespondingtoaparticularlocationtocreatetheaggregatetransactionthatrepresentsthesalesofasinglestoreordate.Quantitativeattributes,suchasprice,aretypicallyaggregatedbytakingasumoranaverage.Aqualitativeattribute,suchasitem,caneitherbeomittedorsummarizedintermsofahigherlevelcategory,e.g.,televisionsversuselectronics.

ThedatainTable2.4 canalsobeviewedasamultidimensionalarray,whereeachattributeisadimension.Fromthisviewpoint,aggregationistheprocessofeliminatingattributes,suchasthetypeofitem,orreducingthe

numberofvaluesforaparticularattribute;e.g.,reducingthepossiblevaluesfordatefrom365daysto12months.ThistypeofaggregationiscommonlyusedinOnlineAnalyticalProcessing(OLAP).ReferencestoOLAParegiveninthebibliographicNotes.

Thereareseveralmotivationsforaggregation.First,thesmallerdatasetsresultingfromdatareductionrequirelessmemoryandprocessingtime,andhence,aggregationoftenenablestheuseofmoreexpensivedataminingalgorithms.Second,aggregationcanactasachangeofscopeorscalebyprovidingahigh-levelviewofthedatainsteadofalow-levelview.Inthepreviousexample,aggregatingoverstorelocationsandmonthsgivesusamonthly,perstoreviewofthedatainsteadofadaily,peritemview.Finally,thebehaviorofgroupsofobjectsorattributesisoftenmorestablethanthatofindividualobjectsorattributes.Thisstatementreflectsthestatisticalfactthataggregatequantities,suchasaveragesortotals,havelessvariabilitythantheindividualvaluesbeingaggregated.Fortotals,theactualamountofvariationislargerthanthatofindividualobjects(onaverage),butthepercentageofthevariationissmaller,whileformeans,theactualamountofvariationislessthanthatofindividualobjects(onaverage).Adisadvantageofaggregationisthepotentiallossofinterestingdetails.Inthestoreexample,aggregatingovermonthslosesinformationaboutwhichdayoftheweekhasthehighestsales.

Example2.7(AustralianPrecipitation).ThisexampleisbasedonprecipitationinAustraliafromtheperiod1982–1993.Figure2.8(a) showsahistogramforthestandarddeviationofaveragemonthlyprecipitationfor by gridcellsinAustralia,whileFigure2.8(b) showsahistogramforthestandarddeviationoftheaverageyearlyprecipitationforthesamelocations.Theaverageyearlyprecipitationhaslessvariabilitythantheaveragemonthlyprecipitation.All

3,0300.5° 0.5°

precipitationmeasurements(andtheirstandarddeviations)areincentimeters.

Figure2.8.HistogramsofstandarddeviationformonthlyandyearlyprecipitationinAustraliafortheperiod1982–1993.

2.3.2Sampling

Samplingisacommonlyusedapproachforselectingasubsetofthedataobjectstobeanalyzed.Instatistics,ithaslongbeenusedforboththepreliminaryinvestigationofthedataandthefinaldataanalysis.Samplingcanalsobeveryusefulindatamining.However,themotivationsforsamplinginstatisticsanddataminingareoftendifferent.Statisticiansusesamplingbecauseobtainingtheentiresetofdataofinterestistooexpensiveortimeconsuming,whiledataminersusuallysamplebecauseitistoocomputationallyexpensiveintermsofthememoryortimerequiredtoprocess

allthedata.Insomecases,usingasamplingalgorithmcanreducethedatasizetothepointwhereabetter,butmorecomputationallyexpensivealgorithmcanbeused.

Thekeyprincipleforeffectivesamplingisthefollowing:Usingasamplewillworkalmostaswellasusingtheentiredatasetifthesampleisrepresentative.Inturn,asampleisrepresentativeifithasapproximatelythesameproperty(ofinterest)astheoriginalsetofdata.Ifthemean(average)ofthedataobjectsisthepropertyofinterest,thenasampleisrepresentativeifithasameanthatisclosetothatoftheoriginaldata.Becausesamplingisastatisticalprocess,therepresentativenessofanyparticularsamplewillvary,andthebestthatwecandoischooseasamplingschemethatguaranteesahighprobabilityofgettingarepresentativesample.Asdiscussednext,thisinvolveschoosingtheappropriatesamplesizeandsamplingtechnique.

SamplingApproachesTherearemanysamplingtechniques,butonlyafewofthemostbasiconesandtheirvariationswillbecoveredhere.Thesimplesttypeofsamplingissimplerandomsampling.Forthistypeofsampling,thereisanequalprobabilityofselectinganyparticularobject.Therearetwovariationsonrandomsampling(andothersamplingtechniquesaswell):(1)samplingwithoutreplacement—aseachobjectisselected,itisremovedfromthesetofallobjectsthattogetherconstitutethepopulation,and(2)samplingwithreplacement—objectsarenotremovedfromthepopulationastheyareselectedforthesample.Insamplingwithreplacement,thesameobjectcanbepickedmorethanonce.Thesamplesproducedbythetwomethodsarenotmuchdifferentwhensamplesarerelativelysmallcomparedtothedatasetsize,butsamplingwithreplacementissimplertoanalyzebecausetheprobabilityofselectinganyobjectremainsconstantduringthesamplingprocess.

Whenthepopulationconsistsofdifferenttypesofobjects,withwidelydifferentnumbersofobjects,simplerandomsamplingcanfailtoadequatelyrepresentthosetypesofobjectsthatarelessfrequent.Thiscancauseproblemswhentheanalysisrequiresproperrepresentationofallobjecttypes.Forexample,whenbuildingclassificationmodelsforrareclasses,itiscriticalthattherareclassesbeadequatelyrepresentedinthesample.Hence,asamplingschemethatcanaccommodatedifferingfrequenciesfortheobjecttypesofinterestisneeded.Stratifiedsampling,whichstartswithprespecifiedgroupsofobjects,issuchanapproach.Inthesimplestversion,equalnumbersofobjectsaredrawnfromeachgroupeventhoughthegroupsareofdifferentsizes.Inanothervariation,thenumberofobjectsdrawnfromeachgroupisproportionaltothesizeofthatgroup.

Example2.8(SamplingandLossofInformation).Onceasamplingtechniquehasbeenselected,itisstillnecessarytochoosethesamplesize.Largersamplesizesincreasetheprobabilitythatasamplewillberepresentative,buttheyalsoeliminatemuchoftheadvantageofsampling.Conversely,withsmallersamplesizes,patternscanbemissedorerroneouspatternscanbedetected.Figure2.9(a)showsadatasetthatcontains8000two-dimensionalpoints,whileFigures2.9(b) and2.9(c) showsamplesfromthisdatasetofsize2000and500,respectively.Althoughmostofthestructureofthisdatasetispresentinthesampleof2000points,muchofthestructureismissinginthesampleof500points.

Figure2.9.Exampleofthelossofstructurewithsampling.

Example2.9(DeterminingtheProperSampleSize).Toillustratethatdeterminingthepropersamplesizerequiresamethodicalapproach,considerthefollowingtask.

Givenasetofdataconsistingofasmallnumberofalmostequalsizedgroups,findatleastone

representativepointforeachofthegroups.Assumethattheobjectsineachgrouparehighly

similartoeachother,butnotverysimilartoobjectsindifferentgroups.Figure2.10(a) shows

anidealizedsetofclusters(groups)fromwhichthesepointsmightbedrawn.

Figure2.10.Findingrepresentativepointsfrom10groups.

Thisproblemcanbeefficientlysolvedusingsampling.Oneapproachistotakeasmallsampleofdatapoints,computethepairwisesimilaritiesbetweenpoints,andthenformgroupsofpointsthatarehighlysimilar.Thedesiredsetofrepresentativepointsisthenobtainedbytakingonepointfromeachofthesegroups.Tofollowthisapproach,however,weneedtodetermineasamplesizethatwouldguarantee,withahighprobability,thedesiredoutcome;thatis,thatatleastonepointwillbeobtainedfromeachcluster.Figure2.10(b) showstheprobabilityofgettingoneobjectfromeachofthe10groupsasthesamplesizerunsfrom10to60.Interestingly,withasamplesizeof20,thereislittlechance(20%)ofgettingasamplethatincludesall10clusters.Evenwithasamplesizeof30,thereisstillamoderatechance(almost40%)ofgettingasamplethatdoesn’tcontainobjectsfromall10clusters.ThisissueisfurtherexploredinthecontextofclusteringbyExercise4 onpage603.

ProgressiveSamplingThepropersamplesizecanbedifficulttodetermine,soadaptiveorprogressivesamplingschemesaresometimesused.Theseapproachesstartwithasmallsample,andthenincreasethesamplesizeuntilasampleofsufficientsizehasbeenobtained.Whilethistechniqueeliminatestheneedtodeterminethecorrectsamplesizeinitially,itrequiresthattherebeawaytoevaluatethesampletojudgeifitislargeenough.

Suppose,forinstance,thatprogressivesamplingisusedtolearnapredictivemodel.Althoughtheaccuracyofpredictivemodelsincreasesasthesamplesizeincreases,atsomepointtheincreaseinaccuracylevelsoff.Wewanttostopincreasingthesamplesizeatthisleveling-offpoint.Bykeepingtrackofthechangeinaccuracyofthemodelaswetakeprogressivelylargersamples,andbytakingothersamplesclosetothesizeofthecurrentone,wecangetanestimateofhowclosewearetothisleveling-offpoint,andthus,stopsampling.

2.3.3DimensionalityReduction

Datasetscanhavealargenumberoffeatures.Considerasetofdocuments,whereeachdocumentisrepresentedbyavectorwhosecomponentsarethefrequencieswithwhicheachwordoccursinthedocument.Insuchcases,therearetypicallythousandsortensofthousandsofattributes(components),oneforeachwordinthevocabulary.Asanotherexample,considerasetoftimeseriesconsistingofthedailyclosingpriceofvariousstocksoveraperiodof30years.Inthiscase,theattributes,whicharethepricesonspecificdays,againnumberinthethousands.

Thereareavarietyofbenefitstodimensionalityreduction.Akeybenefitisthatmanydataminingalgorithmsworkbetterifthedimensionality—thenumberofattributesinthedata—islower.Thisispartlybecausedimensionalityreductioncaneliminateirrelevantfeaturesandreducenoiseandpartlybecauseofthecurseofdimensionality,whichisexplainedbelow.Anotherbenefitisthatareductionofdimensionalitycanleadtoamoreunderstandablemodelbecausethemodelusuallyinvolvesfewerattributes.Also,dimensionalityreductionmayallowthedatatobemoreeasilyvisualized.Evenifdimensionalityreductiondoesn’treducethedatatotwoorthreedimensions,dataisoftenvisualizedbylookingatpairsortripletsofattributes,andthenumberofsuchcombinationsisgreatlyreduced.Finally,theamountoftimeandmemoryrequiredbythedataminingalgorithmisreducedwithareductionindimensionality.

Thetermdimensionalityreductionisoftenreservedforthosetechniquesthatreducethedimensionalityofadatasetbycreatingnewattributesthatareacombinationoftheoldattributes.Thereductionofdimensionalitybyselectingattributesthatareasubsetoftheoldisknownasfeaturesubsetselectionorfeatureselection.ItwillbediscussedinSection2.3.4 .

Intheremainderofthissection,webrieflyintroducetwoimportanttopics:thecurseofdimensionalityanddimensionalityreductiontechniquesbasedonlinearalgebraapproachessuchasprincipalcomponentsanalysis(PCA).MoredetailsondimensionalityreductioncanbefoundinAppendixB.

TheCurseofDimensionalityThecurseofdimensionalityreferstothephenomenonthatmanytypesofdataanalysisbecomesignificantlyharderasthedimensionalityofthedataincreases.Specifically,asdimensionalityincreases,thedatabecomesincreasinglysparseinthespacethatitoccupies.Thus,thedataobjectswe

observearequitepossiblynotarepresentativesampleofallpossibleobjects.Forclassification,thiscanmeanthattherearenotenoughdataobjectstoallowthecreationofamodelthatreliablyassignsaclasstoallpossibleobjects.Forclustering,thedifferencesindensityandinthedistancesbetweenpoints,whicharecriticalforclustering,becomelessmeaningful.(ThisisdiscussedfurtherinSections8.1.2,8.4.6,and8.4.8.)Asaresult,manyclusteringandclassificationalgorithms(andotherdataanalysisalgorithms)havetroublewithhigh-dimensionaldataleadingtoreducedclassificationaccuracyandpoorqualityclusters.

LinearAlgebraTechniquesforDimensionalityReductionSomeofthemostcommonapproachesfordimensionalityreduction,particularlyforcontinuousdata,usetechniquesfromlinearalgebratoprojectthedatafromahigh-dimensionalspaceintoalower-dimensionalspace.PrincipalComponentsAnalysis(PCA)isalinearalgebratechniqueforcontinuousattributesthatfindsnewattributes(principalcomponents)that(1)arelinearcombinationsoftheoriginalattributes,(2)areorthogonal(perpendicular)toeachother,and(3)capturethemaximumamountofvariationinthedata.Forexample,thefirsttwoprincipalcomponentscaptureasmuchofthevariationinthedataasispossiblewithtwoorthogonalattributesthatarelinearcombinationsoftheoriginalattributes.SingularValueDecomposition(SVD)isalinearalgebratechniquethatisrelatedtoPCAandisalsocommonlyusedfordimensionalityreduction.Foradditionaldetails,seeAppendicesAandB.

2.3.4FeatureSubsetSelection

Anotherwaytoreducethedimensionalityistouseonlyasubsetofthefeatures.Whileitmightseemthatsuchanapproachwouldloseinformation,thisisnotthecaseifredundantandirrelevantfeaturesarepresent.Redundantfeaturesduplicatemuchoralloftheinformationcontainedinoneormoreotherattributes.Forexample,thepurchasepriceofaproductandtheamountofsalestaxpaidcontainmuchofthesameinformation.Irrelevantfeaturescontainalmostnousefulinformationforthedataminingtaskathand.Forinstance,students’IDnumbersareirrelevanttothetaskofpredictingstudents’gradepointaverages.Redundantandirrelevantfeaturescanreduceclassificationaccuracyandthequalityoftheclustersthatarefound.

Whilesomeirrelevantandredundantattributescanbeeliminatedimmediatelybyusingcommonsenseordomainknowledge,selectingthebestsubsetoffeaturesfrequentlyrequiresasystematicapproach.Theidealapproachtofeatureselectionistotryallpossiblesubsetsoffeaturesasinputtothedataminingalgorithmofinterest,andthentakethesubsetthatproducesthebestresults.Thismethodhastheadvantageofreflectingtheobjectiveandbiasofthedataminingalgorithmthatwilleventuallybeused.Unfortunately,sincethenumberofsubsetsinvolvingnattributesis2 ,suchanapproachisimpractical

inmostsituationsandalternativestrategiesareneeded.Therearethreestandardapproachestofeatureselection:embedded,filter,andwrapper.

Embeddedapproaches

Featureselectionoccursnaturallyaspartofthedataminingalgorithm.Specifically,duringtheoperationofthedataminingalgorithm,thealgorithmitselfdecideswhichattributestouseandwhichtoignore.Algorithmsforbuildingdecisiontreeclassifiers,whicharediscussedinChapter3 ,oftenoperateinthismanner.

Filterapproaches

Featuresareselectedbeforethedataminingalgorithmisrun,usingsomeapproachthatisindependentofthedataminingtask.Forexample,wemightselectsetsofattributeswhosepairwisecorrelationisaslowaspossiblesothattheattributesarenon-redundant.

Wrapperapproaches

Thesemethodsusethetargetdataminingalgorithmasablackboxtofindthebestsubsetofattributes,inawaysimilartothatoftheidealalgorithmdescribedabove,buttypicallywithoutenumeratingallpossiblesubsets.

Becausetheembeddedapproachesarealgorithm-specific,onlythefilterandwrapperapproacheswillbediscussedfurtherhere.

AnArchitectureforFeatureSubsetSelectionItispossibletoencompassboththefilterandwrapperapproacheswithinacommonarchitecture.Thefeatureselectionprocessisviewedasconsistingoffourparts:ameasureforevaluatingasubset,asearchstrategythatcontrolsthegenerationofanewsubsetoffeatures,astoppingcriterion,andavalidationprocedure.Filtermethodsandwrappermethodsdifferonlyinthewayinwhichtheyevaluateasubsetoffeatures.Forawrappermethod,subsetevaluationusesthetargetdataminingalgorithm,whileforafilterapproach,theevaluationtechniqueisdistinctfromthetargetdataminingalgorithm.Thefollowingdiscussionprovidessomedetailsofthisapproach,whichissummarizedinFigure2.11 .

Figure2.11.Flowchartofafeaturesubsetselectionprocess.

Conceptually,featuresubsetselectionisasearchoverallpossiblesubsetsoffeatures.Manydifferenttypesofsearchstrategiescanbeused,butthesearchstrategyshouldbecomputationallyinexpensiveandshouldfindoptimalornearoptimalsetsoffeatures.Itisusuallynotpossibletosatisfybothrequirements,andthus,trade-offsarenecessary.

Anintegralpartofthesearchisanevaluationsteptojudgehowthecurrentsubsetoffeaturescomparestoothersthathavebeenconsidered.Thisrequiresanevaluationmeasurethatattemptstodeterminethegoodnessofasubsetofattributeswithrespecttoaparticulardataminingtask,suchasclassificationorclustering.Forthefilterapproach,suchmeasuresattempttopredicthowwelltheactualdataminingalgorithmwillperformonagivensetofattributes.Forthewrapperapproach,whereevaluationconsistsofactuallyrunningthetargetdataminingalgorithm,thesubsetevaluationfunctionissimplythecriterionnormallyusedtomeasuretheresultofthedatamining.

Becausethenumberofsubsetscanbeenormousanditisimpracticaltoexaminethemall,somesortofstoppingcriterionisnecessary.Thisstrategyisusuallybasedononeormoreconditionsinvolvingthefollowing:thenumberofiterations,whetherthevalueofthesubsetevaluationmeasureisoptimalorexceedsacertainthreshold,whetherasubsetofacertainsizehasbeenobtained,andwhetheranyimprovementcanbeachievedbytheoptionsavailabletothesearchstrategy.

Finally,onceasubsetoffeatureshasbeenselected,theresultsofthetargetdataminingalgorithmontheselectedsubsetshouldbevalidated.Astraightforwardvalidationapproachistorunthealgorithmwiththefullsetoffeaturesandcomparethefullresultstoresultsobtainedusingthesubsetoffeatures.Hopefully,thesubsetoffeatureswillproduceresultsthatarebetterthanoralmostasgoodasthoseproducedwhenusingallfeatures.Anothervalidationapproachistouseanumberofdifferentfeatureselectionalgorithmstoobtainsubsetsoffeaturesandthencomparetheresultsofrunningthedataminingalgorithmoneachsubset.

FeatureWeightingFeatureweightingisanalternativetokeepingoreliminatingfeatures.Moreimportantfeaturesareassignedahigherweight,whilelessimportantfeaturesaregivenalowerweight.Theseweightsaresometimesassignedbasedondomainknowledgeabouttherelativeimportanceoffeatures.Alternatively,theycansometimesbedeterminedautomatically.Forexample,someclassificationschemes,suchassupportvectormachines(Chapter4 ),produceclassificationmodelsinwhicheachfeatureisgivenaweight.Featureswithlargerweightsplayamoreimportantroleinthemodel.Thenormalizationofobjectsthattakesplacewhencomputingthecosinesimilarity(Section2.4.5 )canalsoberegardedasatypeoffeatureweighting.

2.3.5FeatureCreation

Itisfrequentlypossibletocreate,fromtheoriginalattributes,anewsetofattributesthatcapturestheimportantinformationinadatasetmuchmoreeffectively.Furthermore,thenumberofnewattributescanbesmallerthanthenumberoforiginalattributes,allowingustoreapallthepreviouslydescribedbenefitsofdimensionalityreduction.Tworelatedmethodologiesforcreatingnewattributesaredescribednext:featureextractionandmappingthedatatoanewspace.

FeatureExtractionThecreationofanewsetoffeaturesfromtheoriginalrawdataisknownasfeatureextraction.Considerasetofphotographs,whereeachphotographistobeclassifiedaccordingtowhetheritcontainsahumanface.Therawdataisasetofpixels,andassuch,isnotsuitableformanytypesofclassificationalgorithms.However,ifthedataisprocessedtoprovidehigher-levelfeatures,suchasthepresenceorabsenceofcertaintypesofedgesandareasthatarehighlycorrelatedwiththepresenceofhumanfaces,thenamuchbroadersetofclassificationtechniquescanbeappliedtothisproblem.

Unfortunately,inthesenseinwhichitismostcommonlyused,featureextractionishighlydomain-specific.Foraparticularfield,suchasimageprocessing,variousfeaturesandthetechniquestoextractthemhavebeendevelopedoveraperiodoftime,andoftenthesetechniqueshavelimitedapplicabilitytootherfields.Consequently,wheneverdataminingisappliedtoarelativelynewarea,akeytaskisthedevelopmentofnewfeaturesandfeatureextractionmethods.

Althoughfeatureextractionisoftencomplicated,Example2.10 illustratesthatitcanberelativelystraightforward.

Example2.10(Density).Consideradatasetconsistingofinformationabouthistoricalartifacts,which,alongwithotherinformation,containsthevolumeandmassofeachartifact.Forsimplicity,assumethattheseartifactsaremadeofasmallnumberofmaterials(wood,clay,bronze,gold)andthatwewanttoclassifytheartifactswithrespecttothematerialofwhichtheyaremade.Inthiscase,adensityfeatureconstructedfromthemassandvolumefeatures,i.e.,density=mass/volume,wouldmostdirectlyyieldanaccurateclassification.Althoughtherehavebeensomeattemptstoautomaticallyperformsuchsimplefeatureextractionbyexploringbasicmathematicalcombinationsofexistingattributes,themostcommonapproachistoconstructfeaturesusingdomainexpertise.

MappingtheDatatoaNewSpaceAtotallydifferentviewofthedatacanrevealimportantandinterestingfeatures.Consider,forexample,timeseriesdata,whichoftencontainsperiodicpatterns.Ifthereisonlyasingleperiodicpatternandnotmuchnoise,thenthepatterniseasilydetected.If,ontheotherhand,thereareanumberofperiodicpatternsandasignificantamountofnoise,thenthesepatternsarehardtodetect.Suchpatternscan,nonetheless,oftenbedetectedbyapplyingaFouriertransformtothetimeseriesinordertochangetoarepresentationinwhichfrequencyinformationisexplicit.InExample2.11 ,itwillnotbenecessarytoknowthedetailsoftheFouriertransform.Itisenoughtoknowthat,foreachtimeseries,theFouriertransformproducesanewdataobjectwhoseattributesarerelatedtofrequencies.

Example2.11(FourierAnalysis).ThetimeseriespresentedinFigure2.12(b) isthesumofthreeothertimeseries,twoofwhichareshowninFigure2.12(a) andhavefrequenciesof7and17cyclespersecond,respectively.Thethirdtimeseriesisrandomnoise.Figure2.12(c) showsthepowerspectrumthatcanbecomputedafterapplyingaFouriertransformtotheoriginaltimeseries.(Informally,thepowerspectrumisproportionaltothesquareofeachfrequencyattribute.)Inspiteofthenoise,therearetwopeaksthatcorrespondtotheperiodsofthetwooriginal,non-noisytimeseries.Again,themainpointisthatbetterfeaturescanrevealimportantaspectsofthedata.

Figure2.12.ApplicationoftheFouriertransformtoidentifytheunderlyingfrequenciesintimeseriesdata.

Manyothersortsoftransformationsarealsopossible.BesidestheFouriertransform,thewavelettransformhasalsoprovenveryusefulfortimeseriesandothertypesofdata.

2.3.6DiscretizationandBinarization

Somedataminingalgorithms,especiallycertainclassificationalgorithms,requirethatthedatabeintheformofcategoricalattributes.Algorithmsthatfindassociationpatternsrequirethatthedatabeintheformofbinaryattributes.Thus,itisoftennecessarytotransformacontinuousattributeintoacategoricalattribute(discretization),andbothcontinuousanddiscreteattributesmayneedtobetransformedintooneormorebinaryattributes(binarization).Additionally,ifacategoricalattributehasalargenumberofvalues(categories),orsomevaluesoccurinfrequently,thenitcanbebeneficialforcertaindataminingtaskstoreducethenumberofcategoriesbycombiningsomeofthevalues.

Aswithfeatureselection,thebestdiscretizationorbinarizationapproachistheonethat“producesthebestresultforthedataminingalgorithmthatwillbeusedtoanalyzethedata.”Itistypicallynotpracticaltoapplysuchacriteriondirectly.Consequently,discretizationorbinarizationisperformedinawaythatsatisfiesacriterionthatisthoughttohavearelationshiptogoodperformanceforthedataminingtaskbeingconsidered.Ingeneral,thebestdiscretizationdependsonthealgorithmbeingused,aswellastheotherattributesbeingconsidered.Typically,however,thediscretizationofeachattributeisconsideredinisolation.

BinarizationAsimpletechniquetobinarizeacategoricalattributeisthefollowing:Iftherearemcategoricalvalues,thenuniquelyassigneachoriginalvaluetoanintegerintheinterval Iftheattributeisordinal,thenordermustbemaintainedbytheassignment.(Notethateveniftheattributeisoriginallyrepresentedusingintegers,thisprocessisnecessaryiftheintegersarenotin

[0,m−1].

theinterval )Next,converteachofthesemintegerstoabinarynumber.Since binarydigitsarerequiredtorepresenttheseintegers,representthesebinarynumbersusingnbinaryattributes.Toillustrate,acategoricalvariablewith5values{awful,poor,OK,good,great}wouldrequirethreebinaryvariables and TheconversionisshowninTable2.5 .

Table2.5.Conversionofacategoricalattributetothreebinaryattributes.

CategoricalValue IntegerValue

awful 0 0 0 0

poor 1 0 0 1

OK 2 0 1 0

good 3 0 1 1

great 4 1 0 0

Suchatransformationcancausecomplications,suchascreatingunintendedrelationshipsamongthetransformedattributes.Forexample,inTable2.5 ,attributes and arecorrelatedbecauseinformationaboutthegoodvalueisencodedusingbothattributes.Furthermore,associationanalysisrequiresasymmetricbinaryattributes,whereonlythepresenceoftheattribute

isimportant.Forassociationproblems,itisthereforenecessarytointroduceoneasymmetricbinaryattributeforeachcategoricalvalue,asshowninTable2.6 .Ifthenumberofresultingattributesistoolarge,thenthetechniquesdescribedinthefollowingsectionscanbeusedtoreducethenumberofcategoricalvaluesbeforebinarization.

Table2.6.Conversionofacategoricalattributetofiveasymmetricbinary

[0,m−1].n=[log2(m)]

x1,x2, x3.

x1 x2 x3

x2 x3

(value=1).

attributes.

CategoricalValue IntegerValue

awful 0 1 0 0 0 0

poor 1 0 1 0 0 0

OK 2 0 0 1 0 0

good 3 0 0 0 1 0

great 4 0 0 0 0 1

Likewise,forassociationproblems,itcanbenecessarytoreplaceasinglebinaryattributewithtwoasymmetricbinaryattributes.Considerabinaryattributethatrecordsaperson’sgender,maleorfemale.Fortraditionalassociationrulealgorithms,thisinformationneedstobetransformedintotwoasymmetricbinaryattributes,onethatisa1onlywhenthepersonismaleandonethatisa1onlywhenthepersonisfemale.(Forasymmetricbinaryattributes,theinformationrepresentationissomewhatinefficientinthattwobitsofstoragearerequiredtorepresenteachbitofinformation.)

DiscretizationofContinuousAttributesDiscretizationistypicallyappliedtoattributesthatareusedinclassificationorassociationanalysis.Transformationofacontinuousattributetoacategoricalattributeinvolvestwosubtasks:decidinghowmanycategories,n,tohaveanddetermininghowtomapthevaluesofthecontinuousattributetothesecategories.Inthefirststep,afterthevaluesofthecontinuousattributearesorted,theyarethendividedintonintervalsbyspecifying splitpoints.Inthesecond,rathertrivialstep,allthevaluesinoneintervalaremappedtothesamecategoricalvalue.Therefore,theproblemofdiscretizationisoneof

x1 x2 x3 x4 x5

n−1

decidinghowmanysplitpointstochooseandwheretoplacethem.Theresultcanberepresentedeitherasasetofintervals

where and canbe or respectively,orequivalently,asaseriesofinequalities

UnsupervisedDiscretization

Abasicdistinctionbetweendiscretizationmethodsforclassificationiswhetherclassinformationisused(supervised)ornot(unsupervised).Ifclassinformationisnotused,thenrelativelysimpleapproachesarecommon.Forinstance,theequalwidthapproachdividestherangeoftheattributeintoauser-specifiednumberofintervalseachhavingthesamewidth.Suchanapproachcanbebadlyaffectedbyoutliers,andforthatreason,anequalfrequency(equaldepth)approach,whichtriestoputthesamenumberofobjectsintoeachinterval,isoftenpreferred.Asanotherexampleofunsuperviseddiscretization,aclusteringmethod,suchasK-means(seeChapter7 ),canalsobeused.Finally,visuallyinspectingthedatacansometimesbeaneffectiveapproach.

Example2.12(DiscretizationTechniques).Thisexampledemonstrateshowtheseapproachesworkonanactualdataset.Figure2.13(a) showsdatapointsbelongingtofourdifferentgroups,alongwithtwooutliers—thelargedotsoneitherend.Thetechniquesofthepreviousparagraphwereappliedtodiscretizethexvaluesofthesedatapointsintofourcategoricalvalues.(Pointsinthedatasethavearandomycomponenttomakeiteasytoseehowmanypointsareineachgroup.)Visuallyinspectingthedataworksquitewell,butisnotautomatic,andthus,wefocusontheotherthreeapproaches.Thesplitpointsproducedbythetechniquesequalwidth,equalfrequency,andK-meansareshownin

{(x0,x1],(x1,x2],…,(xn−1,xn)}, x0 xn +∞ −∞,

x0<x≤x1,…,xn−1<x<xn.

Figures2.13(b) ,2.13(c) ,and2.13(d) ,respectively.Thesplitpointsarerepresentedasdashedlines.

Figure2.13.Differentdiscretizationtechniques.

Inthisparticularexample,ifwemeasuretheperformanceofadiscretizationtechniquebytheextenttowhichdifferentobjectsthatclumptogetherhavethesamecategoricalvalue,thenK-meansperformsbest,followedbyequalfrequency,andfinally,equalwidth.Moregenerally,thebestdiscretizationwilldependontheapplicationandofteninvolvesdomain-specificdiscretization.Forexample,thediscretizationofpeopleintolowincome,middleincome,andhighincomeisbasedoneconomicfactors.

SupervisedDiscretization

Ifclassificationisourapplicationandclasslabelsareknownforsomedataobjects,thendiscretizationapproachesthatuseclasslabelsoftenproducebetterclassification.Thisshouldnotbesurprising,sinceanintervalconstructedwithnoknowledgeofclasslabelsoftencontainsamixtureofclasslabels.Aconceptuallysimpleapproachistoplacethesplitsinawaythatmaximizesthepurityoftheintervals,i.e.,theextenttowhichanintervalcontainsasingleclasslabel.Inpractice,however,suchanapproachrequirespotentiallyarbitrarydecisionsaboutthepurityofanintervalandtheminimumsizeofaninterval.

Toovercomesuchconcerns,somestatisticallybasedapproachesstartwitheachattributevalueinaseparateintervalandcreatelargerintervalsbymergingadjacentintervalsthataresimilaraccordingtoastatisticaltest.Analternativetothisbottom-upapproachisatop-downapproachthatstartsbybisectingtheinitialvaluessothattheresultingtwointervalsgiveminimumentropy.Thistechniqueonlyneedstoconsidereachvalueasapossiblesplitpoint,becauseitisassumedthatintervalscontainorderedsetsofvalues.Thesplittingprocessisthenrepeatedwithanotherinterval,typicallychoosingtheintervalwiththeworst(highest)entropy,untilauser-specifiednumberofintervalsisreached,orastoppingcriterionissatisfied.

Entropy-basedapproachesareoneofthemostpromisingapproachestodiscretization,whetherbottom-uportop-down.First,itisnecessarytodefineentropy.Letkbethenumberofdifferentclasslabels,m bethenumberof

valuesinthei intervalofapartition,andm bethenumberofvaluesofclassjinintervali.Thentheentropye ofthei intervalisgivenbytheequation

where istheprobability(fractionofvalues)ofclassjintheinterval.Thetotalentropy,e,ofthepartitionistheweightedaverageoftheindividualintervalentropies,i.e.,

wheremisthenumberofvalues, isthefractionofvaluesintheinterval,andnisthenumberofintervals.Intuitively,theentropyofanintervalisameasureofthepurityofaninterval.Ifanintervalcontainsonlyvaluesofoneclass(isperfectlypure),thentheentropyis0anditcontributesnothingtotheoverallentropy.Iftheclassesofvaluesinanintervaloccurequallyoften(theintervalisasimpureaspossible),thentheentropyisamaximum.

Example2.13(DiscretizationofTwoAttributes).Thetop-downmethodbasedonentropywasusedtoindependentlydiscretizeboththexandyattributesofthetwo-dimensionaldatashowninFigure2.14 .Inthefirstdiscretization,showninFigure2.14(a) ,thexandyattributeswerebothsplitintothreeintervals.(Thedashedlinesindicatethesplitpoints.)Intheseconddiscretization,showninFigure2.14(b) ,thexandyattributeswerebothsplitintofiveintervals.

thij

ith

ei=−∑j=1kpijlog2pij,

pij=mij/mi ith

e=∑i=1nwiei,

wi=mi/m ith

Figure2.14.Discretizingxandyattributesforfourgroups(classes)ofpoints.

Thissimpleexampleillustratestwoaspectsofdiscretization.First,intwodimensions,theclassesofpointsarewellseparated,butinonedimension,thisisnotso.Ingeneral,discretizingeachattributeseparatelyoftenguaranteessuboptimalresults.Second,fiveintervalsworkbetterthanthree,butsixintervalsdonotimprovethediscretizationmuch,atleastintermsofentropy.(Entropyvaluesandresultsforsixintervalsarenotshown.)Consequently,itisdesirabletohaveastoppingcriterionthatautomaticallyfindstherightnumberofpartitions.

CategoricalAttributeswithTooManyValuesCategoricalattributescansometimeshavetoomanyvalues.Ifthecategoricalattributeisanordinalattribute,thentechniquessimilartothoseforcontinuousattributescanbeusedtoreducethenumberofcategories.Ifthecategoricalattributeisnominal,however,thenotherapproachesareneeded.Considera

universitythathasalargenumberofdepartments.Consequently,adepartmentnameattributemighthavedozensofdifferentvalues.Inthissituation,wecoulduseourknowledgeoftherelationshipsamongdifferentdepartmentstocombinedepartmentsintolargergroups,suchasengineering,socialsciences,orbiologicalsciences.Ifdomainknowledgedoesnotserveasausefulguideorsuchanapproachresultsinpoorclassificationperformance,thenitisnecessarytouseamoreempiricalapproach,suchasgroupingvaluestogetheronlyifsuchagroupingresultsinimprovedclassificationaccuracyorachievessomeotherdataminingobjective.

2.3.7VariableTransformation

Avariabletransformationreferstoatransformationthatisappliedtoallthevaluesofavariable.(Weusethetermvariableinsteadofattributetoadheretocommonusage,althoughwewillalsorefertoattributetransformationonoccasion.)Inotherwords,foreachobject,thetransformationisappliedtothevalueofthevariableforthatobject.Forexample,ifonlythemagnitudeofavariableisimportant,thenthevaluesofthevariablecanbetransformedbytakingtheabsolutevalue.Inthefollowingsection,wediscusstwoimportanttypesofvariabletransformations:simplefunctionaltransformationsandnormalization.

SimpleFunctionsForthistypeofvariabletransformation,asimplemathematicalfunctionisappliedtoeachvalueindividually.Ifxisavariable,thenexamplesofsuchtransformationsinclude or Instatistics,variabletransformations,especiallysqrt,log,and1/x,areoftenusedtotransformdatathatdoesnothaveaGaussian(normal)distributionintodatathatdoes.While

xk,logx,ex,x,1/x,sinx, |x|.

thiscanbeimportant,otherreasonsoftentakeprecedenceindatamining.Supposethevariableofinterestisthenumberofdatabytesinasession,andthenumberofbytesrangesfrom1to1billion.Thisisahugerange,anditcanbeadvantageoustocompressitbyusingalog transformation.Inthiscase,sessionsthattransferred and byteswouldbemoresimilartoeachotherthansessionsthattransferred10and1000bytesForsomeapplications,suchasnetworkintrusiondetection,thismaybewhatisdesired,sincethefirsttwosessionsmostlikelyrepresenttransfersoflargefiles,whilethelattertwosessionscouldbetwoquitedistincttypesofsessions.

Variabletransformationsshouldbeappliedwithcautionbecausetheychangethenatureofthedata.Whilethisiswhatisdesired,therecanbeproblemsifthenatureofthetransformationisnotfullyappreciated.Forinstance,thetransformation1/xreducesthemagnitudeofvaluesthatare1orlarger,butincreasesthemagnitudeofvaluesbetween0and1.Toillustrate,thevalues{1,2,3}goto butthevalues goto{1,2,3}.Thus,forallsetsofvalues,thetransformation1/xreversestheorder.Tohelpclarifytheeffectofatransformation,itisimportanttoaskquestionssuchasthefollowing:Whatisthedesiredpropertyofthetransformedattribute?Doestheorderneedtobemaintained?Doesthetransformationapplytoallvalues,especiallynegativevaluesand0?Whatistheeffectofthetransformationonthevaluesbetween0and1?Exercise17 onpage109 exploresotheraspectsofvariabletransformation.

NormalizationorStandardizationThegoalofstandardizationornormalizationistomakeanentiresetofvalueshaveaparticularproperty.Atraditionalexampleisthatof“standardizingavariable”instatistics.If isthemean(average)oftheattributevaluesandistheirstandarddeviation,thenthetransformation createsanew

108 109(9−8=1versus3−1=3).

{1,12,13}, {1,12,13}

x¯ sxx′=(x−x¯)/sx

variablethathasameanof0andastandarddeviationof1.Ifdifferentvariablesaretobeusedtogether,e.g.,forclustering,thensuchatransformationisoftennecessarytoavoidhavingavariablewithlargevaluesdominatetheresultsoftheanalysis.Toillustrate,considercomparingpeoplebasedontwovariables:ageandincome.Foranytwopeople,thedifferenceinincomewilllikelybemuchhigherinabsoluteterms(hundredsorthousandsofdollars)thanthedifferenceinage(lessthan150).Ifthedifferencesintherangeofvaluesofageandincomearenottakenintoaccount,thenthecomparisonbetweenpeoplewillbedominatedbydifferencesinincome.Inparticular,ifthesimilarityordissimilarityoftwopeopleiscalculatedusingthesimilarityordissimilaritymeasuresdefinedlaterinthischapter,theninmanycases,suchasthatofEuclideandistance,theincomevalueswilldominatethecalculation.

Themeanandstandarddeviationarestronglyaffectedbyoutliers,sotheabovetransformationisoftenmodified.First,themeanisreplacedbythemedian,i.e.,themiddlevalue.Second,thestandarddeviationisreplacedbytheabsolutestandarddeviation.Specifically,ifxisavariable,thentheabsolutestandarddeviationofxisgivenby where isthevalueofthevariable,misthenumberofobjects,and iseitherthemean

ormedian.Otherapproachesforcomputingestimatesofthelocation(center)andspreadofasetofvaluesinthepresenceofoutliersaredescribedinstatisticsbooks.Thesemorerobustmeasurescanalsobeusedtodefineastandardizationtransformation.

σA=∑i=1m|xi−μ|, xiith μ

2.4MeasuresofSimilarityandDissimilaritySimilarityanddissimilarityareimportantbecausetheyareusedbyanumberofdataminingtechniques,suchasclustering,nearestneighborclassification,andanomalydetection.Inmanycases,theinitialdatasetisnotneededoncethesesimilaritiesordissimilaritieshavebeencomputed.Suchapproachescanbeviewedastransformingthedatatoasimilarity(dissimilarity)spaceandthenperformingtheanalysis.Indeed,kernelmethodsareapowerfulrealizationofthisidea.ThesemethodsareintroducedinSection2.4.7 andarediscussedmorefullyinthecontextofclassificationinSection4.9.4.

Webeginwithadiscussionofthebasics:high-leveldefinitionsofsimilarityanddissimilarity,andadiscussionofhowtheyarerelated.Forconvenience,thetermproximityisusedtorefertoeithersimilarityordissimilarity.Sincetheproximitybetweentwoobjectsisafunctionoftheproximitybetweenthecorrespondingattributesofthetwoobjects,wefirstdescribehowtomeasuretheproximitybetweenobjectshavingonlyoneattribute.

Wethenconsiderproximitymeasuresforobjectswithmultipleattributes.ThisincludesmeasuressuchastheJaccardandcosinesimilaritymeasures,whichareusefulforsparsedata,suchasdocuments,aswellascorrelationandEuclideandistance,whichareusefulfornon-sparse(dense)data,suchastimeseriesormulti-dimensionalpoints.Wealsoconsidermutualinformation,whichcanbeappliedtomanytypesofdataandisgoodfordetectingnonlinearrelationships.Inthisdiscussion,werestrictourselvestoobjectswithrelativelyhomogeneousattributetypes,typicallybinaryorcontinuous.

Next,weconsiderseveralimportantissuesconcerningproximitymeasures.Thisincludeshowtocomputeproximitybetweenobjectswhentheyhaveheterogeneoustypesofattributes,andapproachestoaccountfordifferencesofscaleandcorrelationamongvariableswhencomputingdistancebetweennumericalobjects.Thesectionconcludeswithabriefdiscussionofhowtoselecttherightproximitymeasure.

Althoughthissectionfocusesonthecomputationofproximitybetweendataobjects,proximitycanalsobecomputedbetweenattributes.Forexample,forthedocument-termmatrixofFigure2.2(d) ,thecosinemeasurecanbeusedtocomputesimilaritybetweenapairofdocumentsorapairofterms(words).Knowingthattwovariablesarestronglyrelatedcan,forexample,behelpfulforeliminatingredundancy.Inparticular,thecorrelationandmutualinformationmeasuresdiscussedlaterareoftenusedforthatpurpose.

2.4.1Basics

DefinitionsInformally,thesimilaritybetweentwoobjectsisanumericalmeasureofthedegreetowhichthetwoobjectsarealike.Consequently,similaritiesarehigherforpairsofobjectsthataremorealike.Similaritiesareusuallynon-negativeandareoftenbetween0(nosimilarity)and1(completesimilarity).

Thedissimilaritybetweentwoobjectsisanumericalmeasureofthedegreetowhichthetwoobjectsaredifferent.Dissimilaritiesarelowerformoresimilarpairsofobjects.Frequently,thetermdistanceisusedasasynonymfordissimilarity,although,asweshallsee,distanceoftenreferstoaspecialclass

ofdissimilarities.Dissimilaritiessometimesfallintheinterval[0,1],butitisalsocommonforthemtorangefrom0to∞.

TransformationsTransformationsareoftenappliedtoconvertasimilaritytoadissimilarity,orviceversa,ortotransformaproximitymeasuretofallwithinaparticularrange,suchas[0,1].Forinstance,wemayhavesimilaritiesthatrangefrom1to10,buttheparticularalgorithmorsoftwarepackagethatwewanttousemaybedesignedtoworkonlywithdissimilarities,oritmayworkonlywithsimilaritiesintheinterval[0,1].Wediscusstheseissuesherebecausewewillemploysuchtransformationslaterinourdiscussionofproximity.Inaddition,theseissuesarerelativelyindependentofthedetailsofspecificproximitymeasures.

Frequently,proximitymeasures,especiallysimilarities,aredefinedortransformedtohavevaluesintheinterval[0,1].Informally,themotivationforthisistouseascaleinwhichaproximityvalueindicatesthefractionofsimilarity(ordissimilarity)betweentwoobjects.Suchatransformationisoftenrelativelystraightforward.Forexample,ifthesimilaritiesbetweenobjectsrangefrom1(notatallsimilar)to10(completelysimilar),wecanmakethemfallwithintherange[0,1]byusingthetransformation wheresands′aretheoriginalandnewsimilarityvalues,respectively.Inthemoregeneralcase,thetransformationofsimilaritiestotheinterval[0,1]isgivenbytheexpression wheremax_sandmin_sarethemaximumandminimumsimilarityvalues,respectively.Likewise,dissimilaritymeasureswithafiniterangecanbemappedtotheinterval[0,1]byusingtheformula Thisisanexampleofalineartransformation,whichpreservestherelativedistancesbetweenpoints.Inotherwords,ifpoints, and aretwiceasfarapartaspoints, andthesamewillbetrueafteralineartransformation.

s′=(s−1)/9,

s′=(s−min_s)/(max_s−min_s),

d′=(d−min_d)/(max_d−min_d).

x1 x2, x3 x4,

However,therecanbecomplicationsinmappingproximitymeasurestotheinterval[0,1]usingalineartransformation.If,forexample,theproximitymeasureoriginallytakesvaluesintheinterval thenmax_disnotdefinedandanonlineartransformationisneeded.Valueswillnothavethesamerelationshiptooneanotheronthenewscale.Considerthetransformation

foradissimilaritymeasurethatrangesfrom0to Thedissimilarities0,0.5,2,10,100,and1000willbetransformedintothenewdissimilarities0,0.33,0.67,0.90,0.99,and0.999,respectively.Largervaluesontheoriginaldissimilarityscalearecompressedintotherangeofvaluesnear1,butwhetherthisisdesirabledependsontheapplication.

Notethatmappingproximitymeasurestotheinterval[0,1]canalsochangethemeaningoftheproximitymeasure.Forexample,correlation,whichisdiscussedlater,isameasureofsimilaritythattakesvaluesintheinterval

Mappingthesevaluestotheinterval[0,1]bytakingtheabsolutevaluelosesinformationaboutthesign,whichcanbeimportantinsomeapplications.SeeExercise22 onpage111 .

Transformingsimilaritiestodissimilaritiesandviceversaisalsorelativelystraightforward,althoughweagainfacetheissuesofpreservingmeaningandchangingalinearscaleintoanonlinearscale.Ifthesimilarity(ordissimilarity)fallsintheinterval[0,1],thenthedissimilaritycanbedefinedas

Anothersimpleapproachistodefinesimilarityasthenegativeofthedissimilarity(orviceversa).Toillustrate,thedissimilarities0,1,10,and100canbetransformedintothesimilarities andrespectively.

Thesimilaritiesresultingfromthenegationtransformationarenotrestrictedtotherange[0,1],butifthatisdesired,thentransformationssuchas

or canbeused.Forthetransformation thedissimilarities0,1,10,100aretransformedinto1,

[0,∞],

d=d/(1+d) ∞.

[−1,1].

d=1−s(s=1−d).

0,−1,−10, −100,

s=1d+1,s=e−d, s=1−d−min_dmax_d−min_ds=1d+1,

0.5,0.09,0.01,respectively.For theybecome1.00,0.37,0.00,0.00,respectively,whilefor theybecome1.00,0.99,0.90,0.00,respectively.Inthisdiscussion,wehavefocusedonconvertingdissimilaritiestosimilarities.ConversionintheoppositedirectionisconsideredinExercise23 onpage111 .

Ingeneral,anymonotonicdecreasingfunctioncanbeusedtoconvertdissimilaritiestosimilarities,orviceversa.Ofcourse,otherfactorsalsomustbeconsideredwhentransformingsimilaritiestodissimilarities,orviceversa,orwhentransformingthevaluesofaproximitymeasuretoanewscale.Wehavementionedissuesrelatedtopreservingmeaning,distortionofscale,andrequirementsofdataanalysistools,butthislistiscertainlynotexhaustive.

2.4.2SimilarityandDissimilaritybetweenSimpleAttributes

Theproximityofobjectswithanumberofattributesistypicallydefinedbycombiningtheproximitiesofindividualattributes,andthus,wefirstdiscussproximitybetweenobjectshavingasingleattribute.Considerobjectsdescribedbyonenominalattribute.Whatwoulditmeanfortwosuchobjectstobesimilar?Becausenominalattributesconveyonlyinformationaboutthedistinctnessofobjects,allwecansayisthattwoobjectseitherhavethesamevalueortheydonot.Hence,inthiscasesimilarityistraditionallydefinedas1ifattributevaluesmatch,andas0otherwise.Adissimilaritywouldbedefinedintheoppositeway:0iftheattributevaluesmatch,and1iftheydonot.

Forobjectswithasingleordinalattribute,thesituationismorecomplicatedbecauseinformationaboutordershouldbetakenintoaccount.Consideran

s=e−d,s=1−d−min_dmax_d−min_d

attributethatmeasuresthequalityofaproduct,e.g.,acandybar,onthescale{poor,fair,OK,good,wonderful}.Itwouldseemreasonablethataproduct,P1,whichisratedwonderful,wouldbeclosertoaproductP2,whichisratedgood,thanitwouldbetoaproductP3,whichisratedOK.Tomakethisobservationquantitative,thevaluesoftheordinalattributeareoftenmappedtosuccessiveintegers,beginningat0or1,e.g.,

Then, or,ifwewantthedissimilaritytofallbetween0and Asimilarityforordinalattributescanthenbedefinedas

Thisdefinitionofsimilarity(dissimilarity)foranordinalattributeshouldmakethereaderabituneasysincethisassumesequalintervalsbetweensuccessivevaluesoftheattribute,andthisisnotnecessarilyso.Otherwise,wewouldhaveanintervalorratioattribute.IsthedifferencebetweenthevaluesfairandgoodreallythesameasthatbetweenthevaluesOKandwonderful?Probablynot,butinpractice,ouroptionsarelimited,andintheabsenceofmoreinformation,thisisthestandardapproachfordefiningproximitybetweenordinalattributes.

Forintervalorratioattributes,thenaturalmeasureofdissimilaritybetweentwoobjectsistheabsolutedifferenceoftheirvalues.Forexample,wemightcompareourcurrentweightandourweightayearagobysaying“Iamtenpoundsheavier.”Incasessuchasthese,thedissimilaritiestypicallyrangefrom0to ratherthanfrom0to1.Thesimilarityofintervalorratioattributesistypicallyexpressedbytransformingadissimilarityintoasimilarity,aspreviouslydescribed.

Table2.7 summarizesthisdiscussion.Inthistable,xandyaretwoobjectsthathaveoneattributeoftheindicatedtype.Also,d(x,y)ands(x,y)arethedissimilarityandsimilaritybetweenxandy,respectively.Otherapproachesarepossible;thesearethemostcommonones.

{poor=0,fair=1,OK=2,good=3,wonderful=4}. d(P1,P2)=3−2=1d(P1,P2)=3−24=0.25.

s=1−d.

∞,

Table2.7.Similarityanddissimilarityforsimpleattributes

AttributeType

Dissimilarity Similarity

Nominal

Ordinal (valuesmappedtointegers0to ,wherenisthenumberofvalues)

IntervalorRatio

Thefollowingtwosectionsconsidermorecomplicatedmeasuresofproximitybetweenobjectsthatinvolvemultipleattributes:(1)dissimilaritiesbetweendataobjectsand(2)similaritiesbetweendataobjects.Thisdivisionallowsustomorenaturallydisplaytheunderlyingmotivationsforemployingvariousproximitymeasures.Weemphasize,however,thatsimilaritiescanbetransformedintodissimilaritiesandviceversausingtheapproachesdescribedearlier.

2.4.3DissimilaritiesbetweenDataObjects

Inthissection,wediscussvariouskindsofdissimilarities.Webeginwithadiscussionofdistances,whicharedissimilaritieswithcertainproperties,andthenprovideexamplesofmoregeneralkindsofdissimilarities.

Distances

d={0ifx=y1ifx≠y s={1ifx=y0ifx≠y

d=|x−y|/(n−1)n−1

s=1−d

d=|x−y| s=−d,s=11+d,s=e−d,s=1−d−min_dmax_d−min_d

Wefirstpresentsomeexamples,andthenofferamoreformaldescriptionofdistancesintermsofthepropertiescommontoalldistances.TheEuclideandistance,d,betweentwopoints,xandy,inone-,two-,three-,orhigher-dimensionalspace,isgivenbythefollowingfamiliarformula:

wherenisthenumberofdimensionsand and are,respectively,theattributes(components)ofxandy.WeillustratethisformulawithFigure2.15 andTables2.8 and2.9 ,whichshowasetofpoints,thexandycoordinatesofthesepoints,andthedistancematrixcontainingthepairwisedistancesofthesepoints.

Figure2.15.Fourtwo-dimensionalpoints.

TheEuclideandistancemeasuregiveninEquation2.1 isgeneralizedbytheMinkowskidistancemetricshowninEquation2.2 ,

d(x,y)=∑k=1n(xk−yk)2, (2.1)

xk yk kth

d(x,y)=(∑k=1n|xk−yk|r)1/r, (2.2)

whererisaparameter.ThefollowingarethethreemostcommonexamplesofMinkowskidistances.

Cityblock(Manhattan,taxicab, norm)distance.AcommonexampleistheHammingdistance,whichisthenumberofbitsthatisdifferentbetweentwoobjectsthathaveonlybinaryattributes,i.e.,betweentwobinaryvectors.

Euclideandistance( norm).Supremum( or norm)distance.Thisisthemaximum

differencebetweenanyattributeoftheobjects.Moreformally,thedistanceisdefinedbyEquation2.3

Therparametershouldnotbeconfusedwiththenumberofdimensions(at-tributes)n.TheEuclidean,Manhattan,andsupremumdistancesaredefinedforallvaluesofn:1,2,3,…,andspecifydifferentwaysofcombiningthedifferencesineachdimension(attribute)intoanoveralldistance.

Tables2.10 and2.11 ,respectively,givetheproximitymatricesfortheand distancesusingdatafromTable2.8 .Noticethatallthesedistancematricesaresymmetric;i.e.,the entryisthesameasthe entry.InTable2.9 ,forinstance,thefourthrowofthefirstcolumnandthefourthcolumnofthefirstrowbothcontainthevalue5.1.

Table2.8.xandycoordinatesoffourpoints.

point xcoordinate ycoordinate

p1 0 2

p2 2 0

p3 3 1

r=1. L1

r=2. L2r=∞. Lmax L∞

L∞

d(x,y)=limr→∞(∑k=1n|xk−yk|r)1/r. (2.3)

L1L∞

ijth jith

p4 5 1

Table2.9.EuclideandistancematrixforTable2.8 .

p1 p2 p3 p4

p1 0.0 2.8 3.2 5.1

p2 2.8 0.0 1.4 3.2

p3 3.2 1.4 0.0 2.0

p4 5.1 3.2 2.0 0.0

Table2.10. distancematrixforTable2.8 .

L p1 p2 p3 p4

p1 0.0 4.0 4.0 6.0

p2 4.0 0.0 2.0 4.0

p3 4.0 2.0 0.0 2.0

p4 6.0 4.0 2.0 0.0

Table2.11. distancematrixforTable2.8 .

p1 p2 p3 p4

p1 0.0 2.0 3.0 5.0

p2 2.0 0.0 1.0 3.0

p3 3.0 1.0 0.0 2.0

L∞

p4 5.0 3.0 2.0 0.0

Distances,suchastheEuclideandistance,havesomewell-knownproperties.Ifd(x,y)isthedistancebetweentwopoints,xandy,thenthefollowingpropertieshold.

1. Positivitya. forallxandy,b. onlyif

2. Symmetry forallxandy.3. TriangleInequality forallpointsx,y,andz.

Measuresthatsatisfyallthreepropertiesareknownasmetrics.Somepeopleusethetermdistanceonlyfordissimilaritymeasuresthatsatisfytheseproperties,butthatpracticeisoftenviolated.Thethreepropertiesdescribedhereareuseful,aswellasmathematicallypleasing.Also,ifthetriangleinequalityholds,thenthispropertycanbeusedtoincreasetheefficiencyoftechniques(includingclustering)thatdependondistancespossessingthisproperty.(SeeExercise25 .)Nonetheless,manydissimilaritiesdonotsatisfyoneormoreofthemetricproperties.Example2.14 illustratessuchameasure.

Example2.14(Non-metricDissimilarities:SetDifferences).Thisexampleisbasedonthenotionofthedifferenceoftwosets,asdefinedinsettheory.GiventwosetsAandB, isthesetofelementsofAthatarenotin

d(x,y)≥0d(x,y)=0 x=y.

d(x,y)=d(y,x)d(x,z)≤d(x,y)+d(y,z)

A−B

B.Forexample,if and then andtheemptyset.WecandefinethedistancedbetweentwosetsA

andBas wheresizeisafunctionreturningthenumberofelementsinaset.Thisdistancemeasure,whichisanintegervaluegreaterthanorequalto0,doesnotsatisfythesecondpartofthepositivityproperty,thesymmetryproperty,orthetriangleinequality.However,thesepropertiescanbemadetoholdifthedissimilaritymeasureismodifiedasfollows: SeeExercise21 onpage110 .

2.4.4SimilaritiesbetweenDataObjects

Forsimilarities,thetriangleinequality(ortheanalogousproperty)typicallydoesnothold,butsymmetryandpositivitytypicallydo.Tobeexplicit,ifs(x,y)isthesimilaritybetweenpointsxandy,thenthetypicalpropertiesofsimilaritiesarethefollowing:

1. onlyif2. forallxandy.(Symmetry)

Thereisnogeneralanalogofthetriangleinequalityforsimilaritymeasures.Itissometimespossible,however,toshowthatasimilaritymeasurecaneasilybeconvertedtoametricdistance.ThecosineandJaccardsimilaritymeasures,whicharediscussedshortly,aretwoexamples.Also,forspecificsimilaritymeasures,itispossibletoderivemathematicalboundsonthesimilaritybetweentwoobjectsthataresimilarinspirittothetriangleinequality.

Example2.15(ANon-symmetricSimilarity

A={1,2,3,4} B={2,3,4}, A−B={1} B−A=∅,

d(A,B)=size(A−B),

d(A,B)=size(A−B)+size(B−A).

s(x,y)=1 x=y.(0≤s≤1)s(x,y)=s(y,x)

Measure).Consideranexperimentinwhichpeopleareaskedtoclassifyasmallsetofcharactersastheyflashonascreen.Theconfusionmatrixforthisexperimentrecordshowofteneachcharacterisclassifiedasitself,andhowofteneachisclassifiedasanothercharacter.Usingtheconfusionmatrix,wecandefineasimilaritymeasurebetweenacharacterxandacharacteryasthenumberoftimesthatxismisclassifiedasy,butnotethatthismeasureisnotsymmetric.Forexample,supposethat“0”appeared200timesandwasclassifiedasa“0”160times,butasan“o”40times.Likewise,supposethat“o”appeared200timesandwasclassifiedasan“o”170times,butas“0”only30times.Then, but

Insuchsituations,thesimilaritymeasurecanbemadesymmetricbysetting wheresindicatesthenewsimilaritymeasure.

2.4.5ExamplesofProximityMeasures

Thissectionprovidesspecificexamplesofsomesimilarityanddissimilaritymeasures.

SimilarityMeasuresforBinaryDataSimilaritymeasuresbetweenobjectsthatcontainonlybinaryattributesarecalledsimilaritycoefficients,andtypicallyhavevaluesbetween0and1.Avalueof1indicatesthatthetwoobjectsarecompletelysimilar,whileavalueof0indicatesthattheobjectsarenotatallsimilar.Therearemanyrationalesforwhyonecoefficientisbetterthananotherinspecificinstances.

s(0,o)=40,s(o,0)=30.

s′=(x,y)=s′(x,y)=(s(x,y+s(y,x))/2,

Letxandybetwoobjectsthatconsistofnbinaryattributes.Thecomparisonoftwosuchobjects,i.e.,twobinaryvectors,leadstothefollowingfourquantities(frequencies):

SimpleMatchingCoefficient

Onecommonlyusedsimilaritycoefficientisthesimplematchingcoefficient(SMC),whichisdefinedas

Thismeasurecountsbothpresencesandabsencesequally.Consequently,theSMCcouldbeusedtofindstudentswhohadansweredquestionssimilarlyonatestthatconsistedonlyoftrue/falsequestions.

JaccardCoefficient

Supposethatxandyaredataobjectsthatrepresenttworows(twotransactions)ofatransactionmatrix(seeSection2.1.2 ).Ifeachasymmetricbinaryattributecorrespondstoaniteminastore,thena1indicatesthattheitemwaspurchased,whilea0indicatesthattheproductwasnotpurchased.Becausethenumberofproductsnotpurchasedbyanycustomerfaroutnumbersthenumberofproductsthatwerepurchased,asimilaritymeasuresuchasSMCwouldsaythatalltransactionsareverysimilar.Asaresult,theJaccardcoefficientisfrequentlyusedtohandleobjectsconsistingofasymmetricbinaryattributes.TheJaccardcoefficient,whichisoftensymbolizedbyj,isgivenbythefollowingequation:

f00=thenumberofattributeswherexis0andyis0f01=thenumberofattributeswhere

SMC=numberofmatchingattributevaluesnumberofattributes=f11+f00f01+f10(2.4)

J=numberofmatchingpresencesnumberofattributesnotinvolvedin00matches(2.5)

Example2.16(TheSMCandJaccardSimilarityCoefficients).Toillustratethedifferencebetweenthesetwosimilaritymeasures,wecalculateSMCandjforthefollowingtwobinaryvectors.

CosineSimilarityDocumentsareoftenrepresentedasvectors,whereeachcomponent(attribute)representsthefrequencywithwhichaparticularterm(word)occursinthedocument.Eventhoughdocumentshavethousandsortensofthousandsofattributes(terms),eachdocumentissparsesinceithasrelativelyfewnonzeroattributes.Thus,aswithtransactiondata,similarityshouldnotdependonthenumberofshared0valuesbecauseanytwodocumentsarelikelyto“notcontain”manyofthesamewords,andtherefore,if0–0matchesarecounted,mostdocumentswillbehighlysimilartomostotherdocuments.Therefore,asimilaritymeasurefordocumentsneedstoignores0–0matchesliketheJaccardmeasure,butalsomustbeabletohandlenon-binaryvectors.Thecosinesimilarity,definednext,isoneofthemostcommonmeasuresofdocumentsimilarity.Ifxandyaretwodocumentvectors,then

x=(1,0,0,0,0,0,0,0,0,0)y=(0,0,0,0,0,0,1,0,0,1)

f01=2thenumberofattributeswherexwas0andywas1f10=1thenumberofattributeswhere

SMC=f11+f00f01+f10+f11+f00=0+72+1+0+7=0.7

J=f11f01+f10+f11=02+1+0=0

where′indicatesvectorormatrixtransposeand indicatestheinnerproductofthetwovectors,

and isthelengthofvector

Theinnerproductoftwovectorsworkswellforasymmetricattributessinceitdependsonlyoncomponentsthatarenon-zeroinbothvectors.Hence,thesimilaritybetweentwodocumentsdependsonlyuponthewordsthatappearinbothofthem.

Example2.17(CosineSimilaritybetweenTwoDocumentVectors).Thisexamplecalculatesthecosinesimilarityforthefollowingtwodataobjects,whichmightrepresentdocumentvectors:

AsindicatedbyFigure2.16 ,cosinesimilarityreallyisameasureofthe(cosineofthe)anglebetweenxandy.Thus,ifthecosinesimilarityis1,theanglebetweenxandyis andxandyarethesameexceptforlength.Ifthecosinesimilarityis0,thentheanglebetweenxandyis andtheydonotshareanyterms(words).

cos(x,y)=⟨x,y⟩∥x∥∥y∥=x′y∥x∥∥y∥, (2.6)

⟨x,y⟩

⟨x,y⟩=∑k=1nxkyk=x′y, (2.7)

∥x∥ x,∥x∥=∑k=1nxk2=⟨x,x⟩=x′x.

x=(3,2,0,5,0,0,0,2,0,0)y=(1,0,0,0,0,0,0,1,0,2)

⟨x,y⟩=3×1+2×0+0×0+5×0+0×0+0×0+0×0+2×1+0×0×2=5∥x∥=3×3+2×2+0×0+5×5

0°,90°,

Figure2.16.Geometricillustrationofthecosinemeasure.

Equation2.6 alsocanbewrittenasEquation2.8 .

where and Dividingxandybytheirlengthsnormalizesthemtohavealengthof1.Thismeansthatcosinesimilaritydoesnottakethelengthofthetwodataobjectsintoaccountwhencomputingsimilarity.(Euclideandistancemightbeabetterchoicewhenlengthisimportant.)Forvectorswithalengthof1,thecosinemeasurecanbecalculatedbytakingasimpleinnerproduct.Consequently,whenmanycosinesimilaritiesbetweenobjectsarebeingcomputed,normalizingtheobjectstohaveunitlengthcanreducethetimerequired.

ExtendedJaccardCoefficient(TanimotoCoefficient)TheextendedJaccardcoefficientcanbeusedfordocumentdataandthatreducestotheJaccardcoefficientinthecaseofbinaryattributes.Thiscoefficient,whichweshallrepresentasEJ,isdefinedbythefollowingequation:

cos(x,y)=⟨x∥x∥,y∥y∥⟩=⟨x′,y′⟩, (2.8)

x′=x/∥x∥ y′=y/∥y∥.

EJ(x,y)=⟨x,y⟩ǁxǁ2+ǁyǁ2−⟨x,y⟩=x′yǁxǁ2+ǁyǁ2−x′y. (2.9)

CorrelationCorrelationisfrequentlyusedtomeasurethelinearrelationshipbetweentwosetsofvaluesthatareobservedtogether.Thus,correlationcanmeasuretherelationshipbetweentwovariables(heightandweight)orbetweentwoobjects(apairoftemperaturetimeseries).Correlationisusedmuchmorefrequentlytomeasurethesimilaritybetweenattributessincethevaluesintwodataobjectscomefromdifferentattributes,whichcanhaveverydifferentattributetypesandscales.Therearemanytypesofcorrelation,andindeedcorrelationissometimesusedinageneralsensetomeantherelationshipbetweentwosetsofvaluesthatareobservedtogether.Inthisdiscussion,wewillfocusonameasureappropriatefornumericalvalues.

Specifically,Pearson’scorrelationbetweentwosetsofnumericalvalues,i.e.,twovectors,xandy,isdefinedbythefollowingequation:

whereweusethefollowingstandardstatisticalnotationanddefinitions:

corr(x,y)=covariance(x,y)standard_deviation(x)×standard_deviation(y)=sxysx(2.10)

covariance(x,y)=sxy=1n−1∑k=1n(xk−x¯)(yk−y¯) (2.11)

standard_deviation(x)=sx=1n−1∑k=1n(xk−x¯)2

standard_deviation(y)=sy=1n−1∑k=1n(yk−y¯)2

x¯=1n∑k=1nxkisthemeanofx

y¯=1n∑k=1nykisthemeanofy

Example2.18(PerfectCorrelation).Correlationisalwaysintherange to1.Acorrelationof meansthatxandyhaveaperfectpositive(negative)linearrelationship;thatis,

whereaandbareconstants.Thefollowingtwovectorsxandyillustratecaseswherethecorrelationis and respectively.Inthefirstcase,themeansofxandywerechosentobe0,forsimplicity.

Example2.19(NonlinearRelationships).Ifthecorrelationis0,thenthereisnolinearrelationshipbetweenthetwosetsofvalues.However,nonlinearrelationshipscanstillexist.Inthefollowingexample, buttheircorrelationis0.

Example2.20(VisualizingCorrelation).Itisalsoeasytojudgethecorrelationbetweentwovectorsxandybyplottingpairsofcorrespondingvaluesofxandyinascatterplot.Figure2.17 showsanumberofthesescatterplotswhenxandyconsistofasetof30pairsofvaluesthatarerandomlygenerated(withanormaldistribution)sothatthecorrelationofxandyrangesfrom to1.Eachcircleinaplotrepresentsoneofthe30pairsofxandyvalues;itsxcoordinateisthevalueofthatpairforx,whileitsycoordinateisthevalueofthesamepairfory.

−1 1(−1)

xk=ayk+b,−1 +1,

x=(−3,6,0,3,−6)y=(1,−2,0,−1,2)corr(x,y)=−1xk=−3yk

x=(3,6,0,3,6)y=(1,2,0,1,2)corr(x,y)=1xk=3yk

yk=xk2,

x=(−3,−2,−1,0,1,2,3)y=(9,4,1,0,1,4,9)

−1

Figure2.17.Scatterplotsillustratingcorrelationsfrom to1.

Ifwetransformxandybysubtractingofftheirmeansandthennormalizingthemsothattheirlengthsare1,thentheircorrelationcanbecalculatedbytakingthedotproduct.Letusrefertothesetransformedvectorsofxandyasand ,respectively.(Noticethatthistransformationisnotthesameasthe

standardizationusedinothercontexts,wherewesubtractthemeansanddividebythestandarddeviations,asdiscussedinSection2.3.7 .)Thistransformationhighlightsaninterestingrelationshipbetweenthecorrelationmeasureandthecosinemeasure.Specifically,thecorrelationbetweenxandyisidenticaltothecosinebetween and However,thecosinebetweenxandyisnotthesameasthecosinebetween and eventhoughtheybothhavethesamecorrelationmeasure.Ingeneral,thecorrelationbetweentwo

−1

x′ y′

x′ y′.x′ y′,

vectorsisequaltothecosinemeasureonlyinthespecialcasewhenthemeansofthetwovectorsare0.

DifferencesAmongMeasuresForContinuousAttributesInthissection,weillustratethedifferenceamongthethreeproximitymeasuresforcontinuousattributesthatwehavejustdefined:cosine,correlation,andMinkowskidistance.Specifically,weconsidertwotypesofdatatransformationsthatarecommonlyused,namely,scaling(multiplication)byaconstantfactorandtranslation(addition)byaconstantvalue.Aproximitymeasureisconsideredtobeinvarianttoadatatransformationifitsvalueremainsunchangedevenafterperformingthetransformation.Table2.12comparesthebehaviorofcosine,correlation,andMinkowskidistancemeasuresregardingtheirinvariancetoscalingandtranslationoperations.Itcanbeseenthatwhilecorrelationisinvarianttobothscalingandtranslation,cosineisonlyinvarianttoscalingbutnottotranslation.Minkowskidistancemeasures,ontheotherhand,aresensitivetobothscalingandtranslationandarethusinvarianttoneither.

Table2.12.Propertiesofcosine,correlation,andMinkowskidistancemeasures.

Property Cosine Correlation MinkowskiDistance

Invarianttoscaling(multiplication) Yes Yes No

Invarianttotranslation(addition) No Yes No

Letusconsideranexampletodemonstratethesignificanceofthesedifferencesamongdifferentproximitymeasures.

Example2.21(Comparingproximitymeasures).Considerthefollowingtwovectorsxandywithsevennumericattributes.

Itcanbeseenthatbothxandyhave4non-zerovalues,andthevaluesinthetwovectorsaremostlythesame,exceptforthethirdandthefourthcomponents.Thecosine,correlation,andEuclideandistancebetweenthetwovectorscanbecomputedasfollows.

Notsurprisingly,xandyhaveacosineandcorrelationmeasurecloseto1,whiletheEuclideandistancebetweenthemissmall,indicatingthattheyarequitesimilar.Nowletusconsiderthevector whichisascaledversionofy(multipliedbyaconstantfactorof2),andthevector whichisconstructedbytranslatingyby5unitsasfollows.

Weareinterestedinfindingwhether and showthesameproximitywithxasshownbytheoriginalvectory.Table2.13 showsthedifferentmeasuresofproximitycomputedforthepairs and Itcanbeseenthatthevalueofcorrelationbetweenxandyremainsunchangedevenafterreplacingywith or However,thevalueofcosineremainsequalto0.9667whencomputedfor(x,y)and butsignificantlyreducesto0.7940whencomputedfor Thishighlights

x=(1,2,4,3,0,0,0)y=(1,2,3,4,0,0,0)

cos(x,y)=2930×30=0.9667correlation(x,y)=2.35711.5811×1.5811=0.9429Euclideandistancex−yǁ=1.4142

ys,yt,

ys=2×y=(2,4,6,8,0,0,0)

yt=y+5=(6,7,8,9,5,5,5)

ys yt

(x,y),(x,ys), (x,yt).

ys yt.(x,ys),

(x,yt).

thefactthatcosineisinvarianttothescalingoperationbutnottothetranslationoperation,incontrastwiththecorrelationmeasure.TheEuclideandistance,ontheotherhand,showsdifferentvaluesforallthreepairsofvectors,asitissensitivetobothscalingandtranslation.

Table2.13.Similaritybetween and

Measure (x,y)

Cosine 0.9667 0.9667 0.7940

Correlation 0.9429 0.9429 0.9429

EuclideanDistance 1.4142 5.8310 14.2127

Wecanobservefromthisexamplethatdifferentproximitymeasuresbehavedifferentlywhenscalingortranslationoperationsareappliedonthedata.Thechoiceoftherightproximitymeasurethusdependsonthedesirednotionofsimilaritybetweendataobjectsthatismeaningfulforagivenapplication.Forexample,ifxandyrepresentedthefrequenciesofdifferentwordsinadocument-termmatrix,itwouldbemeaningfultouseaproximitymeasurethatremainsunchangedwhenyisreplacedbybecause isjustascaledversionofywiththesamedistributionofwordsoccurringinthedocument.However, isdifferentfromy,sinceitcontainsalargenumberofwordswithnon-zerofrequenciesthatdonotoccuriny.Becausecosineisinvarianttoscalingbutnottotranslation,itwillbeanidealchoiceofproximitymeasureforthisapplication.

Consideradifferentscenarioinwhichxrepresentsalocation’stemperaturemeasuredontheCelsiusscaleforsevendays.Let andbethetemperaturesmeasuredonthosedaysatadifferentlocation,but

usingthreedifferentmeasurementscales.Notethatdifferentunitsof

(x,y),(x,ys), (x,yt).

(x,ys) (x,yt)

ys,ys

y,ys,yt

temperaturehavedifferentoffsets(e.g.CelsiusandKelvin)anddifferentscalingfactors(e.g.CelsiusandFahrenheit).Itisthusdesirabletouseaproximitymeasurethatcapturestheproximitybetweentemperaturevalueswithoutbeingaffectedbythemeasurementscale.Correlationwouldthenbetheidealchoiceofproximitymeasureforthisapplication,asitisinvarianttobothscalingandtranslation.

Asanotherexample,considerascenariowherexrepresentstheamountofprecipitation(incm)measuredatsevenlocations.Let and beestimatesoftheprecipitationattheselocations,whicharepredictedusingthreedifferentmodels.Ideally,wewouldliketochooseamodelthataccuratelyreconstructsthemeasurementsinxwithoutmakinganyerror.Itisevidentthatyprovidesagoodapproximationofthevaluesinx,whereasand providepoorestimatesofprecipitation,eventhoughtheydo

capturethetrendinprecipitationacrosslocations.Hence,weneedtochooseaproximitymeasurethatpenalizesanydifferenceinthemodelestimatesfromtheactualobservations,andissensitivetoboththescalingandtranslationoperations.TheEuclideandistancesatisfiesthispropertyandthuswouldbetherightchoiceofproximitymeasureforthisapplication.Indeed,theEuclideandistanceiscommonlyusedincomputingtheaccuracyofmodels,whichwillbediscussedlaterinChapter3 .

2.4.6MutualInformation

Likecorrelation,mutualinformationisusedasameasureofsimilaritybetweentwosetsofpairedvaluesthatissometimesusedasanalternativetocorrelation,particularlywhenanonlinearrelationshipissuspectedbetweenthepairsofvalues.Thismeasurecomesfrominformationtheory,whichisthe

y,ys, yt

ys yt

studyofhowtoformallydefineandquantifyinformation.Indeed,mutualinformationisameasureofhowmuchinformationonesetofvaluesprovidesaboutanother,giventhatthevaluescomeinpairs,e.g.,heightandweight.Ifthetwosetsofvaluesareindependent,i.e.,thevalueofonetellsusnothingabouttheother,thentheirmutualinformationis0.Ontheotherhand,ifthetwosetsofvaluesarecompletelydependent,i.e.,knowingthevalueofonetellsusthevalueoftheotherandvice-versa,thentheyhavemaximummutualinformation.Mutualinformationdoesnothaveamaximumvalue,butwewilldefineanormalizedversionofitthatrangesbetween0and1.

Todefinemutualinformation,weconsidertwosetsofvalues,XandY,whichoccurinpairs(X,Y).Weneedtomeasuretheaverageinformationinasinglesetofvalues,i.e.,eitherinXorinY,andinthepairsoftheirvalues.Thisiscommonlymeasuredbyentropy.Morespecifically,assumeXandYarediscrete,thatis,Xcantakemdistinctvalues, andYcantakendistinctvalues, Thentheirindividualandjointentropycanbedefinedintermsoftheprobabilitiesofeachvalueandpairofvaluesasfollows:

whereiftheprobabilityofavalueorcombinationofvaluesis0,thenisconventionallytakentobe0.

ThemutualinformationofXandYcannowbedefinedstraightforwardly:

u1,u2,…,umv1,v2,…,vn.

H(X)=−∑j=1mP(X=uj)log2P(X=uj) (2.12)

H(Y)=−∑k=1nP(Y=vk)log2P(Y=vk) (2.13)

H(X,Y)=−∑j=1m∑k=1nP(X=uj,Y=vk)log2P(X=uj,Y=vk) (2.14)

0log2(0)

I(X,Y)=H(X)+H(Y)−H(X,Y) (2.15)

NotethatH(X,Y)issymmetric,i.e., andthusmutualinformationisalsosymmetric,i.e.,

Practically,XandYareeitherthevaluesintwoattributesortworowsofthesamedataset.InExample2.22 ,wewillrepresentthosevaluesastwovectorsxandyandcalculatetheprobabilityofeachvalueorpairofvaluesfromthefrequencywithwhichvaluesorpairsofvaluesoccurinx,yand

where isthe componentofxand isthe componentofy.Letusillustrateusingapreviousexample.

Example2.22(EvaluatingNonlinearRelationshipswithMutualInformation).RecallExample2.19 where buttheircorrelationwas0.

FromFigure2.22 , Althoughavarietyofapproachestonormalizemutualinformationarepossible—seeBibliographicNotes—forthisexample,wewillapplyonethatdividesthemutualinformationby andproducesaresultbetween0and1.Thisyieldsavalueof Thus,wecanseethatxandyarestronglyrelated.Theyarenotperfectlyrelatedbecausegivenavalueofythereis,exceptfor someambiguityaboutthevalueofx.Noticethatfor thenormalizedmutualinformationwouldbe1.

Figure2.18.Computationofmutualinformation.

Table2.14.Entropyforx

H(X,Y)=H(Y,X),I(X,Y)=I(Y).

(xi,yi), xi ith yi ith

yk=xk2,

x=(−3,−2,−1,0,1,2,3)y=(9,4,1,0,1,4,9)

I(x,y)=H(x)+H(y)−H(x,y)=1.9502.

log2(min(m,n))1.9502/log2(4))=0.9751.

y=0,y=−x,

xj P(x=xj) −P(x=xj)log2P(x=xj)

1/7 0.4011

0 1/7 0.4011

1 1/7 0.4011

2 1/7 0.4011

3 1/7 0.4011

H(x) 2.8074

Table2.15.Entropyfory

9 2/7 0.5164

4 2/7 0.5164

1 2/7 0.5164

0 1/7 0.4011

H(y) 1.9502

Table2.16.Jointentropyforxandy

9 1/7 0.4011

4 1/7 0.4011

−3

−2

−1

yk P(y=yk) −P(y=yk)log2P(y=yk)

xj yk P(x=xj,y=xk) −P(x=xj,y=xk)log2P(x=xj,y=xk)

−3

−2

1 1/7 0.4011

0 0 1/7 0.4011

1 1 1/7 0.4011

2 4 1/7 0.4011

3 9 1/7 0.4011

H(x,y) 2.8074

2.4.7KernelFunctions*

Itiseasytounderstandhowsimilarityanddistancemightbeusefulinanapplicationsuchasclustering,whichtriestogroupsimilarobjectstogether.Whatismuchlessobviousisthatmanyotherdataanalysistasks,includingpredictivemodelinganddimensionalityreduction,canbeexpressedintermsofpairwise“proximities”ofdataobjects.Morespecifically,manydataanalysisproblemscanbemathematicallyformulatedtotakeasinput,akernelmatrix,K,whichcanbeconsideredatypeofproximitymatrix.Thus,aninitialpreprocessingstepisusedtoconverttheinputdataintoakernelmatrix,whichistheinputtothedataanalysisalgorithm.

Moreformally,ifadatasethasmdataobjects,thenKisanmbymmatrix.Ifand arethe and dataobjects,respectively,then the entryof

K,iscomputedbyakernelfunction:

−1

xi xj ith jth kij, ijth

kij=κ(xi,xj) (2.16)

Aswewillseeinthematerialthatfollows,theuseofakernelmatrixallowsbothwiderapplicabilityofanalgorithmtovariouskindsofdataandanabilitytomodelnonlinearrelationshipswithalgorithmsthataredesignedonlyfordetectinglinearrelationships.

Kernelsmakeanalgorithmdataindependent

Ifanalgorithmusesakernelmatrix,thenitcanbeusedwithanytypeofdataforwhichakernelfunctioncanbedesigned.ThisisillustratedbyAlgorithm2.1.Althoughonlysomedataanalysisalgorithmscanbemodifiedtouseakernelmatrixasinput,thisapproachisextremelypowerfulbecauseitallowssuchanalgorithmtobeusedwithalmostanytypeofdataforwhichanappropriatekernelfunctioncanbedefined.Thus,aclassificationalgorithmcanbeused,forexample,withrecorddata,stringdata,orgraphdata.Ifanalgorithmcanbereformulatedtouseakernelmatrix,thenitsapplicabilitytodifferenttypesofdataincreasesdramatically.Aswewillseeinlaterchapters,manyclustering,classification,andanomalydetectionalgorithmsworkonlywithsimilaritiesordistances,andthus,canbeeasilymodifiedtoworkwithkernels.

Algorithm2.1Basickernelalgorithm.1. Readinthemdataobjectsinthedataset.2. Computethekernelmatrix,Kbyapplyingthekernelfunction,

toeachpairofdataobjects.3. RunthedataanalysisalgorithmwithKasinput.4. Returntheanalysisresult,e.g.,predictedclassorclusterlabels.

Mappingdataintoahigherdimensionaldataspacecan

κ,

allowmodelingofnonlinearrelationships

Thereisyetanother,equallyimportant,aspectofkernelbaseddataalgorithms—theirabilitytomodelnonlinearrelationshipswithalgorithmsthatmodelonlylinearrelationships.Typically,thisworksbyfirsttransforming(mapping)thedatafromalowerdimensionaldataspacetoahigherdimensionalspace.

Example2.23(MappingDatatoaHigherDimensionalSpace).Considertherelationshipbetweentwovariablesxandygivenbythefollowingequation,whichdefinesanellipseintwodimensions(Figure2.19(a) ):

Figure2.19.Mappingdatatoahigherdimensionalspace:twotothreedimensions.

Wecanmapourtwodimensionaldatatothreedimensionsbycreatingthreenewvariables,u,v,andw,whicharedefinedasfollows:

Asaresult,wecannowexpressEquation2.17 asalinearone.Thisequationdescribesaplaneinthreedimensions.Pointsontheellipsewilllieonthatplane,whilepointsinsideandoutsidetheellipsewilllieonoppositesidesoftheplane.SeeFigure2.19(b) .Theviewpointofthis3Dplotisalongthesurfaceoftheseparatingplanesothattheplaneappearsasaline.

TheKernelTrick

Theapproachillustratedaboveshowsthevalueinmappingdatatohigherdimensionalspace,anoperationthatisintegraltokernel-basedmethods.Conceptually,wefirstdefineafunction thatmapsdatapointsxandytodatapoints and inahigherdimensionalspacesuchthattheinnerproduct givesthedesiredmeasureofproximityofxandy.Itmayseemthatwehavepotentiallysacrificedagreatdealbyusingsuchanapproach,becausewecangreatlyexpandthesizeofourdata,increasethecomputationalcomplexityofouranalysis,andencounterproblemswiththecurseofdimensionalitybycomputingsimilarityinahigh-dimensionalspace.However,thisisnotthecasesincetheseproblemscanbeavoidedbydefiningakernelfunction thatcancomputethesamesimilarityvalue,butwiththedatapointsintheoriginalspace,i.e., Thisisknownasthekerneltrick.Despitethename,thekerneltrickhasaverysolid

4×2+9xy+7y2=10 (2.17)

w=x2u=xyv=y2

4u+9v+7w=10 (2.18)

φφ(x) φ(y)

⟨x,y⟩

κκ(x,y)=⟨φ(x),φ(y)⟩.

mathematicalfoundationandisaremarkablypowerfulapproachfordataanalysis.

Noteveryfunctionofapairofdataobjectssatisfiesthepropertiesneededforakernelfunction,butithasbeenpossibletodesignmanyusefulkernelsforawidevarietyofdatatypes.Forexample,threecommonkernelfunctionsarethepolynomial,Gaussian(radialbasisfunction(RBF)),andsigmoidkernels.Ifxandyaretwodataobjects,specifically,twodatavectors,thenthesetwokernelfunctionscanbeexpressedasfollows,respectively:

where and areconstants,disanintegerparameterthatgivesthepolynomialdegree, isthelengthofthevector and isaparameterthatgovernsthe“spread”ofaGaussian.

Example2.24(ThePolynomialKernel).Notethatthekernelfunctionspresentedintheprevioussectionarecomputingthesamesimilarityvalueaswouldbecomputedifweactuallymappedthedatatoahigherdimensionalspaceandthencomputedaninnerproductthere.Forexample,forthepolynomialkernelofdegree2,letbethefunctionthatmapsatwo-dimensionaldatavector tothe

higherdimensionalspace.Specifically,let

κ(x,y)−(x′y+c)d (2.19)

κ(x,y)=exp(−ǁx−yǁ/2σ2) (2.20)

κ(x,y)=tanh(αx′y+c) (2.21)

α c≥0ǁx−yǁ x−y σ>0

φ x=(x1,x2)

φ(x)=(x12,x22,2x1x2,2cx1,2cx2,c). (2.22)

Forthehigherdimensionalspace,lettheproximitybedefinedastheinnerproductof and i.e., Then,aspreviouslymentioned,itcanbeshownthat

where isdefinedbyEquation2.19 above.Specifically,ifand then

Moregenerally,thekerneltrickdependsondefining and sothatEquation2.23 holds.Thishasbeendoneforawidevarietyofkernels.

Thisdiscussionofkernel-basedapproacheswasintendedonlytoprovideabriefintroductiontothistopicandhasomittedmanydetails.Afullerdiscussionofthekernel-basedapproachisprovidedinSection4.9.4,whichdiscussestheseissuesinthecontextofnonlinearsupportvectormachinesforclassification.MoregeneralreferencesforthekernelbasedanalysiscanbefoundintheBibliographicNotesofthischapter.

2.4.8BregmanDivergence*

ThissectionprovidesabriefdescriptionofBregmandivergences,whichareafamilyofproximityfunctionsthatsharesomecommonproperties.Asaresult,itispossibletoconstructgeneraldataminingalgorithms,suchasclusteringalgorithms,thatworkwithanyBregmandivergence.AconcreteexampleistheK-meansclusteringalgorithm(Section7.2).Notethatthissectionrequiresknowledgeofvectorcalculus.

φ(x) φ(y), ⟨φ(x),φ(y)⟩.

κ(x,y)=⟨φ(x),φ(y)⟩ (2.23)

κ x=(x1,x2)y=(y1,y2),

κ(x,y)=⟨x,y⟩=x′y=(x12y12,x22y22,2x1x2y1y2,2cx1y1,2cx2y2,c2).(2.24)

κ φ

Bregmandivergencesarelossordistortionfunctions.Tounderstandtheideaofalossfunction,considerthefollowing.Letxandybetwopoints,whereyisregardedastheoriginalpointandxissomedistortionorapproximationofit.Forexample,xmaybeapointthatwasgeneratedbyaddingrandomnoisetoy.Thegoalistomeasuretheresultingdistortionorlossthatresultsifyisapproximatedbyx.Ofcourse,themoresimilarxandyare,thesmallerthelossordistortion.Thus,Bregmandivergencescanbeusedasdissimilarityfunctions.

Moreformally,wehavethefollowingdefinition.

Definition2.6(Bregmandivergence)Givenastrictlyconvexfunction (withafewmodestrestrictionsthataregenerallysatisfied),theBregmandivergence(lossfunction) generatedbythatfunctionisgivenbythefollowingequation:

where isthegradientof evaluatedat isthevectordifferencebetweenxandy,and istheinnerproductbetween and ForpointsinEuclideanspace,theinnerproductisjustthedotproduct.

D(x,y)canbewrittenas whereandrepresentstheequationofaplanethatistangenttothefunction aty.

D(x,y)

D(x,y)=ϕ(x)−ϕ(y)−⟨∇ϕ(y),(x−y)⟩ (2.25)

∇ϕ(y) ϕ y,x−y,⟨∇ϕ(y),(x−y)⟩

∇ϕ(y) (x−y).

D(x,y)=ϕ(x)−L(x), L(x)=ϕ(y)+⟨∇ϕ(y),(x−y)⟩ϕ

Usingcalculusterminology,L(x)isthelinearizationof aroundthepointy,andtheBregmandivergenceisjustthedifferencebetweenafunctionandalinearapproximationtothatfunction.DifferentBregmandivergencesareobtainedbyusingdifferentchoicesfor

Example2.25.WeprovideaconcreteexampleusingsquaredEuclideandistance,butrestrictourselvestoonedimensiontosimplifythemathematics.Letxandyberealnumbersand bethereal-valuedfunction, Inthatcase,thegradientreducestothederivative,andthedotproductreducestomultiplication.Specifically,Equation2.25 becomesEquation2.26 .

Thegraphforthisexample,with isshowninFigure2.20 .TheBregmandivergenceisshownfortwovaluesofx: and

ϕ.

ϕ(t) ϕ(t)=t2.

D(x,y)=x2−y2−2y(x−y)=(x−y)2 (2.26)

y=1,x=2 x=3.

Figure2.20.IllustrationofBregmandivergence.

2.4.9IssuesinProximityCalculation

Thissectiondiscussesseveralimportantissuesrelatedtoproximitymeasures:(1)howtohandlethecaseinwhichattributeshavedifferentscalesand/orarecorrelated,(2)howtocalculateproximitybetweenobjectsthatarecomposedofdifferenttypesofattributes,e.g.,quantitativeandqualitative,(3)andhowtohandleproximitycalculationswhenattributeshavedifferentweights;i.e.,whennotallattributescontributeequallytotheproximityofobjects.

StandardizationandCorrelationforDistance

MeasuresAnimportantissuewithdistancemeasuresishowtohandlethesituationwhenattributesdonothavethesamerangeofvalues.(Thissituationisoftendescribedbysayingthat“thevariableshavedifferentscales.”)Inapreviousexample,Euclideandistancewasusedtomeasurethedistancebetweenpeoplebasedontwoattributes:ageandincome.Unlessthesetwoattributesarestandardized,thedistancebetweentwopeoplewillbedominatedbyincome.

Arelatedissueishowtocomputedistancewhenthereiscorrelationbetweensomeoftheattributes,perhapsinadditiontodifferencesintherangesofvalues.AgeneralizationofEuclideandistance,theMahalanobisdistance,isusefulwhenattributesarecorrelated,havedifferentrangesofvalues(differentvariances),andthedistributionofthedataisapproximatelyGaussian(normal).Correlatedvariableshavealargeimpactonstandarddistancemeasuressinceachangeinanyofthecorrelatedvariablesisreflectedinachangeinallthecorrelatedvariables.Specifically,theMahalanobisdistancebetweentwoobjects(vectors)xandyisdefinedas

where istheinverseofthecovariancematrixofthedata.Notethatthecovariancematrix isthematrixwhose entryisthecovarianceoftheand attributesasdefinedbyEquation2.11 .

Example2.26.InFigure2.21 ,thereare1000points,whosexandyattributeshaveacorrelationof0.6.Thedistancebetweenthetwolargepointsattheoppositeendsofthelongaxisoftheellipseis14.7intermsofEuclidean

Mahalanobis(x,y)=(x−y)′∑−1(x−y), (2.27)

∑−1∑ ijth ith

jth

distance,butonly6withrespecttoMahalanobisdistance.ThisisbecausetheMahalanobisdistancegiveslessemphasistothedirectionoflargestvariance.Inpractice,computingtheMahalanobisdistanceisexpensive,butcanbeworthwhilefordatawhoseattributesarecorrelated.Iftheattributesarerelativelyuncorrelated,buthavedifferentranges,thenstandardizingthevariablesissufficient.

Figure2.21.Setoftwo-dimensionalpoints.TheMahalanobisdistancebetweenthetwopointsrepresentedbylargedotsis6;theirEuclideandistanceis14.7.

CombiningSimilaritiesforHeterogeneousAttributes

Thepreviousdefinitionsofsimilaritywerebasedonapproachesthatassumedalltheattributeswereofthesametype.Ageneralapproachisneededwhentheattributesareofdifferenttypes.OnestraightforwardapproachistocomputethesimilaritybetweeneachattributeseparatelyusingTable2.7 ,andthencombinethesesimilaritiesusingamethodthatresultsinasimilaritybetween0and1.Onepossibleapproachistodefinetheoverallsimilarityastheaverageofalltheindividualattributesimilarities.Unfortunately,thisapproachdoesnotworkwellifsomeoftheattributesareasymmetricattributes.Forexample,ifalltheattributesareasymmetricbinaryattributes,thenthesimilaritymeasuresuggestedpreviouslyreducestothesimplematchingcoefficient,ameasurethatisnotappropriateforasymmetricbinaryattributes.Theeasiestwaytofixthisproblemistoomitasymmetricattributesfromthesimilaritycalculationwhentheirvaluesare0forbothoftheobjectswhosesimilarityisbeingcomputed.Asimilarapproachalsoworkswellforhandlingmissingvalues.

Insummary,Algorithm2.2iseffectiveforcomputinganoverallsimilaritybetweentwoobjects,xandy,withdifferenttypesofattributes.Thisprocedurecanbeeasilymodifiedtoworkwithdissimilarities.

Algorithm2.2Similaritiesofheterogeneous

objects.1:Forthe attribute,computeasimilarity, intherange[0,1].2:Defineanindicatorvariable, forthe attributeasfollows:

kth sk(x,y),

δk, kth

δk={0ifthekthattributeisanasymmetricattributeandbothobjectshaveavalueof

UsingWeightsInmuchofthepreviousdiscussion,allattributesweretreatedequallywhencomputingproximity.Thisisnotdesirablewhensomeattributesaremoreimportanttothedefinitionofproximitythanothers.Toaddressthesesituations,theformulasforproximitycanbemodifiedbyweightingthecontributionofeachattribute.

Withattributeweights, (2.28)becomes

ThedefinitionoftheMinkowskidistancecanalsobemodifiedasfollows:

2.4.10SelectingtheRightProximityMeasure

Afewgeneralobservationsmaybehelpful.First,thetypeofproximitymeasureshouldfitthetypeofdata.Formanytypesofdense,continuousdata,metricdistancemeasuressuchasEuclideandistanceareoftenused.Proximitybetweencontinuousattributesismostoftenexpressedintermsof

3:Computetheoverallsimilaritybetweenthetwoobjectsusingthefollowingformula:similarity(x,y)=∑k=1nδksk(x,y)∑k=1nδk(2.28)

wk,

similarity(x,y)=∑k=1nwkδksk(x,y)∑k=1nwkδk. (2.29)

d(x,y)=(∑k=1nwk|xk−yk|r)1/r. (2.30)

differences,anddistancemeasuresprovideawell-definedwayofcombiningthesedifferencesintoanoverallproximitymeasure.Althoughattributescanhavedifferentscalesandbeofdifferingimportance,theseissuescanoftenbedealtwithasdescribedearlier,suchasnormalizationandweightingofattributes.

Forsparsedata,whichoftenconsistsofasymmetricattributes,wetypicallyemploysimilaritymeasuresthatignore0–0matches.Conceptually,thisreflectsthefactthat,forapairofcomplexobjects,similaritydependsonthenumberofcharacteristicstheybothshare,ratherthanthenumberofcharacteristicstheybothlack.Thecosine,Jaccard,andextendedJaccardmeasuresareappropriateforsuchdata.

Thereareothercharacteristicsofdatavectorsthatoftenneedtobeconsidered.Invariancetoscaling(multiplication)andtotranslation(addition)werepreviouslydiscussedwithrespecttoEuclideandistanceandthecosineandcorrelationmeasures.Thepracticalimplicationsofsuchconsiderationsarethat,forexample,cosineismoresuitableforsparsedocumentdatawhereonlyscalingisimportant,whilecorrelationworksbetterfortimeseries,wherebothscalingandtranslationareimportant.EuclideandistanceorothertypesofMinkowskidistancearemostappropriatewhentwodatavectorsaretomatchascloselyaspossibleacrossallcomponents(features).

Insomecases,transformationornormalizationofthedataisneededtoobtainapropersimilaritymeasure.Forinstance,timeseriescanhavetrendsorperiodicpatternsthatsignificantlyimpactsimilarity.Also,apropercomputationofsimilarityoftenrequiresthattimelagsbetakenintoaccount.Finally,twotimeseriesmaybesimilaronlyoverspecificperiodsoftime.Forexample,thereisastrongrelationshipbetweentemperatureandtheuseofnaturalgas,butonlyduringtheheatingseason.

Practicalconsiderationcanalsobeimportant.Sometimes,oneormoreproximitymeasuresarealreadyinuseinaparticularfield,andthus,otherswillhaveansweredthequestionofwhichproximitymeasuresshouldbeused.Othertimes,thesoftwarepackageorclusteringalgorithmbeingusedcandrasticallylimitthechoices.Ifefficiencyisaconcern,thenwemaywanttochooseaproximitymeasurethathasaproperty,suchasthetriangleinequality,thatcanbeusedtoreducethenumberofproximitycalculations.(SeeExercise25 .)

However,ifcommonpracticeorpracticalrestrictionsdonotdictateachoice,thentheproperchoiceofaproximitymeasurecanbeatime-consumingtaskthatrequirescarefulconsiderationofbothdomainknowledgeandthepurposeforwhichthemeasureisbeingused.Anumberofdifferentsimilaritymeasuresmayneedtobeevaluatedtoseewhichonesproduceresultsthatmakethemostsense.

2.5BibliographicNotesItisessentialtounderstandthenatureofthedatathatisbeinganalyzed,andatafundamentallevel,thisisthesubjectofmeasurementtheory.Inparticular,oneoftheinitialmotivationsfordefiningtypesofattributeswastobepreciseaboutwhichstatisticaloperationswerevalidforwhatsortsofdata.WehavepresentedtheviewofmeasurementtheorythatwasinitiallydescribedinaclassicpaperbyS.S.Stevens[112].(Tables2.2 and2.3 arederivedfromthosepresentedbyStevens[113].)Whilethisisthemostcommonviewandisreasonablyeasytounderstandandapply,thereis,ofcourse,muchmoretomeasurementtheory.Anauthoritativediscussioncanbefoundinathree-volumeseriesonthefoundationsofmeasurementtheory[88,94,114].Alsoofinterestisawide-rangingarticlebyHand[77],whichdiscussesmeasurementtheoryandstatistics,andisaccompaniedbycommentsfromotherresearchersinthefield.NumerouscritiquesandextensionsoftheapproachofStevenshavebeenmade[66,97,117].Finally,manybooksandarticlesdescribemeasurementissuesforparticularareasofscienceandengineering.

Dataqualityisabroadsubjectthatspanseverydisciplinethatusesdata.Discussionsofprecision,bias,accuracy,andsignificantfigurescanbefoundinmanyintroductoryscience,engineering,andstatisticstextbooks.Theviewofdataqualityas“fitnessforuse”isexplainedinmoredetailinthebookbyRedman[103].ThoseinterestedindataqualitymayalsobeinterestedinMIT’sInformationQuality(MITIQ)Program[95,118].However,theknowledgeneededtodealwithspecificdataqualityissuesinaparticulardomainisoftenbestobtainedbyinvestigatingthedataqualitypracticesofresearchersinthatfield.

Aggregationisalesswell-definedsubjectthanmanyotherpreprocessingtasks.However,aggregationisoneofthemaintechniquesusedbythedatabaseareaofOnlineAnalyticalProcessing(OLAP)[68,76,102].Therehasalsobeenrelevantworkintheareaofsymbolicdataanalysis(BockandDiday[64]).Oneofthegoalsinthisareaistosummarizetraditionalrecorddataintermsofsymbolicdataobjectswhoseattributesaremorecomplexthantraditionalattributes.Specifically,theseattributescanhavevaluesthataresetsofvalues(categories),intervals,orsetsofvalueswithweights(histograms).Anothergoalofsymbolicdataanalysisistobeabletoperformclustering,classification,andotherkindsofdataanalysisondatathatconsistsofsymbolicdataobjects.

Samplingisasubjectthathasbeenwellstudiedinstatisticsandrelatedfields.Manyintroductorystatisticsbooks,suchastheonebyLindgren[90],havesomediscussionaboutsampling,andentirebooksaredevotedtothesubject,suchastheclassictextbyCochran[67].AsurveyofsamplingfordataminingisprovidedbyGuandLiu[74],whileasurveyofsamplingfordatabasesisprovidedbyOlkenandRotem[98].Thereareanumberofotherdatamininganddatabase-relatedsamplingreferencesthatmaybeofinterest,includingpapersbyPalmerandFaloutsos[100],Provostetal.[101],Toivonen[115],andZakietal.[119].

Instatistics,thetraditionaltechniquesthathavebeenusedfordimensionalityreductionaremultidimensionalscaling(MDS)(BorgandGroenen[65],KruskalandUslaner[89])andprincipalcomponentanalysis(PCA)(Jolliffe[80]),whichissimilartosingularvaluedecomposition(SVD)(Demmel[70]).DimensionalityreductionisdiscussedinmoredetailinAppendixB.

Discretizationisatopicthathasbeenextensivelyinvestigatedindatamining.Someclassificationalgorithmsworkonlywithcategoricaldata,andassociationanalysisrequiresbinarydata,andthus,thereisasignificant

motivationtoinvestigatehowtobestbinarizeordiscretizecontinuousattributes.Forassociationanalysis,wereferthereadertoworkbySrikantandAgrawal[111],whilesomeusefulreferencesfordiscretizationintheareaofclassificationincludeworkbyDoughertyetal.[71],ElomaaandRousu[72],FayyadandIrani[73],andHussainetal.[78].

Featureselectionisanothertopicwellinvestigatedindatamining.AbroadcoverageofthistopicisprovidedinasurveybyMolinaetal.[96]andtwobooksbyLiuandMotada[91,92].OtherusefulpapersincludethosebyBlumandLangley[63],KohaviandJohn[87],andLiuetal.[93].

Itisdifficulttoprovidereferencesforthesubjectoffeaturetransformationsbecausepracticesvaryfromonedisciplinetoanother.Manystatisticsbookshaveadiscussionoftransformations,buttypicallythediscussionisrestrictedtoaparticularpurpose,suchasensuringthenormalityofavariableormakingsurethatvariableshaveequalvariance.Weoffertworeferences:Osborne[99]andTukey[116].

Whilewehavecoveredsomeofthemostcommonlyuseddistanceandsimilaritymeasures,therearehundredsofsuchmeasuresandmorearebeingcreatedallthetime.Aswithsomanyothertopicsinthischapter,manyofthesemeasuresarespecifictoparticularfields,e.g.,intheareaoftimeseriesseepapersbyKalpakisetal.[81]andKeoghandPazzani[83].Clusteringbooksprovidethebestgeneraldiscussions.Inparticular,seethebooksbyAnderberg[62],JainandDubes[79],KaufmanandRousseeuw[82],andSneathandSokal[109].

Information-basedmeasuresofsimilarityhavebecomemorepopularlatelydespitethecomputationaldifficultiesandexpenseofcalculatingthem.AgoodintroductiontoinformationtheoryisprovidedbyCoverandThomas[69].Computingthemutualinformationforcontinuousvariablescanbe

straightforwardiftheyfollowawell-knowdistribution,suchasGaussian.However,thisisoftennotthecase,andmanytechniqueshavebeendeveloped.Asoneexample,thearticlebyKhan,etal.[85]comparesvariousmethodsinthecontextofcomparingshorttimeseries.SeealsotheinformationandmutualinformationpackagesforRandMatlab.MutualinformationhasbeenthesubjectofconsiderablerecentattentionduetopaperbyReshef,etal.[104,105]thatintroducedanalternativemeasure,albeitonebasedonmutualinformation,whichwasclaimedtohavesuperiorproperties.Althoughthisapproachhadsomeearlysupport,e.g.,[110],othershavepointedoutvariouslimitations[75,86,108].

Twopopularbooksonthetopicofkernelmethodsare[106]and[107].Thelatteralsohasawebsitewithlinkstokernel-relatedmaterials[84].Inaddition,manycurrentdatamining,machinelearning,andstatisticallearningtextbookshavesomematerialaboutkernelmethods.FurtherreferencesforkernelmethodsinthecontextofsupportvectormachineclassifiersareprovidedinthebibliographicNotesofSection4.9.4.

Bibliography[62]M.R.Anderberg.ClusterAnalysisforApplications.AcademicPress,New

York,December1973.

[63]A.BlumandP.Langley.SelectionofRelevantFeaturesandExamplesinMachineLearning.ArtificialIntelligence,97(1–2):245–271,1997.

[64]H.H.BockandE.Diday.AnalysisofSymbolicData:ExploratoryMethodsforExtractingStatisticalInformationfromComplexData(StudiesinClassification,DataAnalysis,andKnowledgeOrganization).Springer-VerlagTelos,January2000.

[65]I.BorgandP.Groenen.ModernMultidimensionalScaling—TheoryandApplications.Springer-Verlag,February1997.

[66]N.R.Chrisman.Rethinkinglevelsofmeasurementforcartography.CartographyandGeographicInformationSystems,25(4):231–242,1998.

[67]W.G.Cochran.SamplingTechniques.JohnWiley&Sons,3rdedition,July1977.

[68]E.F.Codd,S.B.Codd,andC.T.Smalley.ProvidingOLAP(On-lineAnalyticalProcessing)toUser-Analysts:AnITMandate.WhitePaper,E.F.CoddandAssociates,1993.

[69]T.M.CoverandJ.A.Thomas.Elementsofinformationtheory.JohnWiley&Sons,2012.

[70]J.W.Demmel.AppliedNumericalLinearAlgebra.SocietyforIndustrial&AppliedMathematics,September1997.

[71]J.Dougherty,R.Kohavi,andM.Sahami.SupervisedandUnsupervisedDiscretizationofContinuousFeatures.InProc.ofthe12thIntl.Conf.onMachineLearning,pages194–202,1995.

[72]T.ElomaaandJ.Rousu.GeneralandEfficientMultisplittingofNumericalAttributes.MachineLearning,36(3):201–244,1999.

[73]U.M.FayyadandK.B.Irani.Multi-intervaldiscretizationofcontinuousvaluedattributesforclassificationlearning.InProc.13thInt.JointConf.onArtificialIntelligence,pages1022–1027.MorganKaufman,1993.

[74]F.H.GaohuaGuandH.Liu.SamplingandItsApplicationinDataMining:ASurvey.TechnicalReportTRA6/00,NationalUniversityofSingapore,Singapore,2000.

[75]M.Gorfine,R.Heller,andY.Heller.CommentonDetectingnovelassociationsinlargedatasets.Unpublished(availableathttp://emotion.technion.ac.il/gorfinm/files/science6.pdfon11Nov.2012),2012.

[76]J.Gray,S.Chaudhuri,A.Bosworth,A.Layman,D.Reichart,M.Venkatrao,F.Pellow,andH.Pirahesh.DataCube:ARelationalAggregationOperatorGeneralizingGroup-By,Cross-Tab,andSub-Totals.JournalDataMiningandKnowledgeDiscovery,1(1):29–53,1997.

[77]D.J.Hand.StatisticsandtheTheoryofMeasurement.JournaloftheRoyalStatisticalSociety:SeriesA(StatisticsinSociety),159(3):445–492,1996.

[78]F.Hussain,H.Liu,C.L.Tan,andM.Dash.TRC6/99:Discretization:anenablingtechnique.Technicalreport,NationalUniversityofSingapore,Singapore,1999.

[79]A.K.JainandR.C.Dubes.AlgorithmsforClusteringData.PrenticeHallAdvancedReferenceSeries.PrenticeHall,March1988.

[80]I.T.Jolliffe.PrincipalComponentAnalysis.SpringerVerlag,2ndedition,October2002.

[81]K.Kalpakis,D.Gada,andV.Puttagunta.DistanceMeasuresforEffectiveClusteringofARIMATime-Series.InProc.ofthe2001IEEEIntl.Conf.onDataMining,pages273–280.IEEEComputerSociety,2001.

[82]L.KaufmanandP.J.Rousseeuw.FindingGroupsinData:AnIntroductiontoClusterAnalysis.WileySeriesinProbabilityandStatistics.JohnWileyandSons,NewYork,November1990.

[83]E.J.KeoghandM.J.Pazzani.Scalingupdynamictimewarpingfordataminingapplications.InKDD,pages285–289,2000.

[84]KernelMethodsforPatternAnalysisWebsite.http://www.kernel-methods.net/,2014.

[85]S.Khan,S.Bandyopadhyay,A.R.Ganguly,S.Saigal,D.J.EricksonIII,V.Protopopescu,andG.Ostrouchov.Relativeperformanceofmutualinformationestimationmethodsforquantifyingthedependenceamongshortandnoisydata.PhysicalReviewE,76(2):026209,2007.

[86]J.B.KinneyandG.S.Atwal.Equitability,mutualinformation,andthemaximalinformationcoefficient.ProceedingsoftheNationalAcademyofSciences,2014.

[87]R.KohaviandG.H.John.WrappersforFeatureSubsetSelection.ArtificialIntelligence,97(1–2):273–324,1997.

[88]D.Krantz,R.D.Luce,P.Suppes,andA.Tversky.FoundationsofMeasurements:Volume1:Additiveandpolynomialrepresentations.AcademicPress,NewYork,1971.

[89]J.B.KruskalandE.M.Uslaner.MultidimensionalScaling.SagePublications,August1978.

[90]B.W.Lindgren.StatisticalTheory.CRCPress,January1993.

[91]H.LiuandH.Motoda,editors.FeatureExtraction,ConstructionandSelection:ADataMiningPerspective.KluwerInternationalSeriesinEngineeringandComputerScience,453.KluwerAcademicPublishers,July1998.

[92]H.LiuandH.Motoda.FeatureSelectionforKnowledgeDiscoveryandDataMining.KluwerInternationalSeriesinEngineeringandComputerScience,454.KluwerAcademicPublishers,July1998.

[93]H.Liu,H.Motoda,andL.Yu.FeatureExtraction,Selection,andConstruction.InN.Ye,editor,TheHandbookofDataMining,pages22–41.LawrenceErlbaumAssociates,Inc.,Mahwah,NJ,2003.

[94]R.D.Luce,D.Krantz,P.Suppes,andA.Tversky.FoundationsofMeasurements:Volume3:Representation,Axiomatization,andInvariance.AcademicPress,NewYork,1990.

[95]MITInformationQuality(MITIQ)Program.http://mitiq.mit.edu/,2014.

[96]L.C.Molina,L.Belanche,andA.Nebot.FeatureSelectionAlgorithms:ASurveyandExperimentalEvaluation.InProc.ofthe2002IEEEIntl.Conf.onDataMining,2002.

[97]F.MostellerandJ.W.Tukey.Dataanalysisandregression:asecondcourseinstatistics.Addison-Wesley,1977.

[98]F.OlkenandD.Rotem.RandomSamplingfromDatabases—ASurvey.Statistics&Computing,5(1):25–42,March1995.

[99]J.Osborne.NotesontheUseofDataTransformations.PracticalAssessment,Research&Evaluation,28(6),2002.

[100]C.R.PalmerandC.Faloutsos.Densitybiasedsampling:Animprovedmethodfordataminingandclustering.ACMSIGMODRecord,29(2):82–92,2000.

[101]F.J.Provost,D.Jensen,andT.Oates.EfficientProgressiveSampling.InProc.ofthe5thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages23–32,1999.

[102]R.RamakrishnanandJ.Gehrke.DatabaseManagementSystems.McGraw-Hill,3rdedition,August2002.

[103]T.C.Redman.DataQuality:TheFieldGuide.DigitalPress,January2001.

[104]D.Reshef,Y.Reshef,M.Mitzenmacher,andP.Sabeti.Equitabilityanalysisofthemaximalinformationcoefficient,withcomparisons.arXivpreprintarXiv:1301.6314,2013.

[105]D.N.Reshef,Y.A.Reshef,H.K.Finucane,S.R.Grossman,G.McVean,P.J.Turnbaugh,E.S.Lander,M.Mitzenmacher,andP.C.

Sabeti.Detectingnovelassociationsinlargedatasets.science,334(6062):1518–1524,2011.

[106]B.SchölkopfandA.J.Smola.Learningwithkernels:supportvectormachines,regularization,optimization,andbeyond.MITpress,2002.

[107]J.Shawe-TaylorandN.Cristianini.Kernelmethodsforpatternanalysis.Cambridgeuniversitypress,2004.

[108]N.SimonandR.Tibshirani.Commenton”DetectingNovelAssociationsInLargeDataSets”byReshefEtAl,ScienceDec16,2011.arXivpreprintarXiv:1401.7645,2014.

[109]P.H.A.SneathandR.R.Sokal.NumericalTaxonomy.Freeman,SanFrancisco,1971.

[110]T.Speed.Acorrelationforthe21stcentury.Science,334(6062):1502–1503,2011.

[111]R.SrikantandR.Agrawal.MiningQuantitativeAssociationRulesinLargeRelationalTables.InProc.of1996ACM-SIGMODIntl.Conf.onManagementofData,pages1–12,Montreal,Quebec,Canada,August1996.

[112]S.S.Stevens.OntheTheoryofScalesofMeasurement.Science,103(2684):677–680,June1946.

[113]S.S.Stevens.Measurement.InG.M.Maranell,editor,Scaling:ASourcebookforBehavioralScientists,pages22–41.AldinePublishingCo.,Chicago,1974.

[114]P.Suppes,D.Krantz,R.D.Luce,andA.Tversky.FoundationsofMeasurements:Volume2:Geometrical,Threshold,andProbabilisticRepresentations.AcademicPress,NewYork,1989.

[115]H.Toivonen.SamplingLargeDatabasesforAssociationRules.InVLDB96,pages134–145.MorganKaufman,September1996.

[116]J.W.Tukey.OntheComparativeAnatomyofTransformations.AnnalsofMathematicalStatistics,28(3):602–632,September1957.

[117]P.F.VellemanandL.Wilkinson.Nominal,ordinal,interval,andratiotypologiesaremisleading.TheAmericanStatistician,47(1):65–72,1993.

[118]R.Y.Wang,M.Ziad,Y.W.Lee,andY.R.Wang.DataQuality.TheKluwerInternationalSeriesonAdvancesinDatabaseSystems,Volume23.KluwerAcademicPublishers,January2001.

[119]M.J.Zaki,S.Parthasarathy,W.Li,andM.Ogihara.EvaluationofSamplingforDataMiningofAssociationRules.TechnicalReportTR617,RensselaerPolytechnicInstitute,1996.

2.6Exercises1.IntheinitialexampleofChapter2 ,thestatisticiansays,“Yes,fields2and3arebasicallythesame.”Canyoutellfromthethreelinesofsampledatathatareshownwhyshesaysthat?

2.Classifythefollowingattributesasbinary,discrete,orcontinuous.Alsoclassifythemasqualitative(nominalorordinal)orquantitative(intervalorratio).Somecasesmayhavemorethanoneinterpretation,sobrieflyindicateyourreasoningifyouthinktheremaybesomeambiguity.

Example:Ageinyears.Answer:Discrete,quantitative,ratio

a. TimeintermsofAMorPM.

b. Brightnessasmeasuredbyalightmeter.

c. Brightnessasmeasuredbypeople’sjudgments.

d. Anglesasmeasuredindegreesbetween0and360.

e. Bronze,Silver,andGoldmedalsasawardedattheOlympics.

f. Heightabovesealevel.

g. Numberofpatientsinahospital.

h. ISBNnumbersforbooks.(LookuptheformatontheWeb.)

i. Abilitytopasslightintermsofthefollowingvalues:opaque,translucent,transparent.

j. Militaryrank.

k. Distancefromthecenterofcampus.

l. Densityofasubstanceingramspercubiccentimeter.

m. Coatchecknumber.(Whenyouattendanevent,youcanoftengiveyourcoattosomeonewho,inturn,givesyouanumberthatyoucanusetoclaimyourcoatwhenyouleave.)

3.Youareapproachedbythemarketingdirectorofalocalcompany,whobelievesthathehasdevisedafoolproofwaytomeasurecustomersatisfaction.Heexplainshisschemeasfollows:“It’ssosimplethatIcan’tbelievethatnoonehasthoughtofitbefore.Ijustkeeptrackofthenumberofcustomercomplaintsforeachproduct.Ireadinadataminingbookthatcountsareratioattributes,andso,mymeasureofproductsatisfactionmustbearatioattribute.ButwhenIratedtheproductsbasedonmynewcustomersatisfactionmeasureandshowedthemtomyboss,hetoldmethatIhadoverlookedtheobvious,andthatmymeasurewasworthless.Ithinkthathewasjustmadbecauseourbestsellingproducthadtheworstsatisfactionsinceithadthemostcomplaints.Couldyouhelpmesethimstraight?”

a. Whoisright,themarketingdirectororhisboss?Ifyouanswered,hisboss,whatwouldyoudotofixthemeasureofsatisfaction?

b. Whatcanyousayabouttheattributetypeoftheoriginalproductsatisfactionattribute?

4.Afewmonthslater,youareagainapproachedbythesamemarketingdirectorasinExercise3 .Thistime,hehasdevisedabetterapproachtomeasuretheextenttowhichacustomerprefersoneproductoverothersimilarproducts.Heexplains,“Whenwedevelopnewproducts,wetypicallycreateseveralvariationsandevaluatewhichonecustomersprefer.Ourstandardprocedureistogiveourtestsubjectsalloftheproductvariationsatonetimeandthenaskthemtoranktheproductvariationsinorderofpreference.However,ourtestsubjectsareveryindecisive,especiallywhenthereare

morethantwoproducts.Asaresult,testingtakesforever.Isuggestedthatweperformthecomparisonsinpairsandthenusethesecomparisonstogettherankings.Thus,ifwehavethreeproductvariations,wehavethecustomerscomparevariations1and2,then2and3,andfinally3and1.Ourtestingtimewithmynewprocedureisathirdofwhatitwasfortheoldprocedure,buttheemployeesconductingthetestscomplainthattheycannotcomeupwithaconsistentrankingfromtheresults.Andmybosswantsthelatestproductevaluations,yesterday.Ishouldalsomentionthathewasthepersonwhocameupwiththeoldproductevaluationapproach.Canyouhelpme?”

a. Isthemarketingdirectorintrouble?Willhisapproachworkforgeneratinganordinalrankingoftheproductvariationsintermsofcustomerpreference?Explain.

b. Isthereawaytofixthemarketingdirector’sapproach?Moregenerally,whatcanyousayabouttryingtocreateanordinalmeasurementscalebasedonpairwisecomparisons?

c. Fortheoriginalproductevaluationscheme,theoverallrankingsofeachproductvariationarefoundbycomputingitsaverageoveralltestsubjects.Commentonwhetheryouthinkthatthisisareasonableapproach.Whatotherapproachesmightyoutake?

5.Canyouthinkofasituationinwhichidentificationnumberswouldbeusefulforprediction?

6.Aneducationalpsychologistwantstouseassociationanalysistoanalyzetestresults.Thetestconsistsof100questionswithfourpossibleanswerseach.

a. Howwouldyouconvertthisdataintoaformsuitableforassociationanalysis?

b. Inparticular,whattypeofattributeswouldyouhaveandhowmanyofthemarethere?

7.Whichofthefollowingquantitiesislikelytoshowmoretemporalautocorrelation:dailyrainfallordailytemperature?Why?

8.Discusswhyadocument-termmatrixisanexampleofadatasetthathasasymmetricdiscreteorasymmetriccontinuousfeatures.

9.Manysciencesrelyonobservationinsteadof(orinadditionto)designedexperiments.Comparethedataqualityissuesinvolvedinobservationalsciencewiththoseofexperimentalscienceanddatamining.

10.Discussthedifferencebetweentheprecisionofameasurementandthetermssingleanddoubleprecision,astheyareusedincomputerscience,typicallytorepresentfloating-pointnumbersthatrequire32and64bits,respectively.

11.Giveatleasttwoadvantagestoworkingwithdatastoredintextfilesinsteadofinabinaryformat.

12.Distinguishbetweennoiseandoutliers.Besuretoconsiderthefollowingquestions.

a. Isnoiseeverinterestingordesirable?Outliers?

b. Cannoiseobjectsbeoutliers?

c. Arenoiseobjectsalwaysoutliers?

d. Areoutliersalwaysnoiseobjects?

e. Cannoisemakeatypicalvalueintoanunusualone,orviceversa?

Algorithm2.3Algorithmforfindingk-

nearestneighbors.

13.ConsidertheproblemoffindingtheK-nearestneighborsofadataobject.AprogrammerdesignsAlgorithm2.3forthistask.

a. Describethepotentialproblemswiththisalgorithmifthereareduplicateobjectsinthedataset.Assumethedistancefunctionwillreturnadistanceof0onlyforobjectsthatarethesame.

b. Howwouldyoufixthisproblem?

14.ThefollowingattributesaremeasuredformembersofaherdofAsianelephants:weight,height,tusklength,trunklength,andeararea.Basedonthesemeasurements,whatsortofproximitymeasurefromSection2.4wouldyouusetocompareorgrouptheseelephants?Justifyyouranswerandexplainanyspecialcircumstances.

15.Youaregivenasetofmobjectsthatisdividedintokgroups,wheretheigroupisofsize Ifthegoalistoobtainasampleofsize whatisthedifferencebetweenthefollowingtwosamplingschemes?(Assumesamplingwithreplacement.)

:for tonumberofdataobjectsdo

:Findthedistancesofthe objecttoallotherobjects.

3:Sortthesedistancesindecreasingorder.

(Keeptrackofwhichobjectisassociatedwitheachdistance.)

4:returntheobjectsassociatedwiththefirstkdistancesofthesortedlist

5:endfor

1 i=1

2 ith

mi. n<m,

a. Werandomlyselect elementsfromeachgroup.

b. Werandomlyselectnelementsfromthedataset,withoutregardforthegrouptowhichanobjectbelongs.

16.Consideradocument-termmatrix,where isthefrequencyoftheword(term)inthe documentandmisthenumberofdocuments.Considerthevariabletransformationthatisdefinedby

where isthenumberofdocumentsinwhichthe termappears,whichisknownasthedocumentfrequencyoftheterm.Thistransformationisknownastheinversedocumentfrequencytransformation.

a. Whatistheeffectofthistransformationifatermoccursinonedocument?Ineverydocument?

b. Whatmightbethepurposeofthistransformation?

17.Assumethatweapplyasquareroottransformationtoaratioattributextoobtainthenewattribute Aspartofyouranalysis,youidentifyaninterval(a,b)inwhich hasalinearrelationshiptoanotherattributey.

a. Whatisthecorrespondinginterval(A,B)intermsofx?

b. Giveanequationthatrelatesytox.

18.Thisexercisecomparesandcontrastssomesimilarityanddistancemeasures.

a. Forbinarydata,theL1distancecorrespondstotheHammingdistance;thatis,thenumberofbitsthataredifferentbetweentwobinaryvectors.TheJaccardsimilarityisameasureofthesimilaritybetweentwobinary

n×mi/m

tfij ithjth

tfij′=tfij×logmdfi, (2.31)

dfi ith

x*.x*

vectors.ComputetheHammingdistanceandtheJaccardsimilaritybetweenthefollowingtwobinaryvectors.

b. Whichapproach,JaccardorHammingdistance,ismoresimilartotheSimpleMatchingCoefficient,andwhichapproachismoresimilartothecosinemeasure?Explain.(Note:TheHammingmeasureisadistance,whiletheotherthreemeasuresaresimilarities,butdon’tletthisconfuseyou.)

c. Supposethatyouarecomparinghowsimilartwoorganismsofdifferentspeciesareintermsofthenumberofgenestheyshare.Describewhichmeasure,HammingorJaccard,youthinkwouldbemoreappropriateforcomparingthegeneticmakeupoftwoorganisms.Explain.(Assumethateachanimalisrepresentedasabinaryvector,whereeachattributeis1ifaparticulargeneispresentintheorganismand0otherwise.)

d. Ifyouwantedtocomparethegeneticmakeupoftwoorganismsofthesamespecies,e.g.,twohumanbeings,wouldyouusetheHammingdistance,theJaccardcoefficient,oradifferentmeasureofsimilarityordistance?Explain.(Notethattwohumanbeingsshare ofthesamegenes.)

19.Forthefollowingvectors,xandy,calculatetheindicatedsimilarityordistancemeasures.

a. cosine,correlation,Euclidean

b. cosine,correlation,Euclidean,Jaccard

c. cosine,correlation,Euclidean

d. cosine,correlation,Jaccard

x=0101010001y=0100011000

>99.9%

x=(1,1,1,1),y=(2,2,2,2)

x=(0,1,0,1),y=(1,0,1,0)

x=(0,−1,0,1),y=(1,0,−1,0)

x=(1,1,0,1,0,1),y=(1,1,1,0,0,1)

e. cosine,correlation

20.Here,wefurtherexplorethecosineandcorrelationmeasures.

a. Whatistherangeofvaluespossibleforthecosinemeasure?

b. Iftwoobjectshaveacosinemeasureof1,aretheyidentical?Explain.

c. Whatistherelationshipofthecosinemeasuretocorrelation,ifany?(Hint:Lookatstatisticalmeasuressuchasmeanandstandarddeviationincaseswherecosineandcorrelationarethesameanddifferent.)

d. Figure2.22(a) showstherelationshipofthecosinemeasuretoEuclideandistancefor100,000randomlygeneratedpointsthathavebeennormalizedtohaveanL2lengthof1.WhatgeneralobservationcanyoumakeabouttherelationshipbetweenEuclideandistanceandcosinesimilaritywhenvectorshaveanL2normof1?

Figure2.22.GraphsforExercise20 .

x=(2,−1,0,2,0,−3),y=(−1,1,−1,0,0,−1)

e. Figure2.22(b) showstherelationshipofcorrelationtoEuclideandistancefor100,000randomlygeneratedpointsthathavebeenstandardizedtohaveameanof0andastandarddeviationof1.WhatgeneralobservationcanyoumakeabouttherelationshipbetweenEuclideandistanceandcorrelationwhenthevectorshavebeenstandardizedtohaveameanof0andastandarddeviationof1?

f. DerivethemathematicalrelationshipbetweencosinesimilarityandEuclideandistancewheneachdataobjecthasanL lengthof1.

g. DerivethemathematicalrelationshipbetweencorrelationandEuclideandistancewheneachdatapointhasbeenbeenstandardizedbysubtractingitsmeananddividingbyitsstandarddeviation.

21.Showthatthesetdifferencemetricgivenby

satisfiesthemetricaxiomsgivenonpage77 .AandBaresetsand isthesetdifference.

22.Discusshowyoumightmapcorrelationvaluesfromtheinterval totheinterval[0,1].Notethatthetypeoftransformationthatyouusemightdependontheapplicationthatyouhaveinmind.Thus,considertwoapplications:clusteringtimeseriesandpredictingthebehaviorofonetimeseriesgivenanother.

23.Givenasimilaritymeasurewithvaluesintheinterval[0,1],describetwowaystotransformthissimilarityvalueintoadissimilarityvalueintheinterval

24.Proximityistypicallydefinedbetweenapairofobjects.

d(A,B)=size(A−B)+size(B−A) (2.32)

A−B

[−1,1]

[0,∞].

a. Definetwowaysinwhichyoumightdefinetheproximityamongagroupofobjects.

b. HowmightyoudefinethedistancebetweentwosetsofpointsinEuclideanspace?

c. Howmightyoudefinetheproximitybetweentwosetsofdataobjects?(Makenoassumptionaboutthedataobjects,exceptthataproximitymeasureisdefinedbetweenanypairofobjects.)

25.YouaregivenasetofpointssinEuclideanspace,aswellasthedistanceofeachpointinstoapointx.(Itdoesnotmatterif )

a. Ifthegoalistofindallpointswithinaspecifieddistance ofpointexplainhowyoucouldusethetriangleinequalityandthealreadycalculateddistancestoxtopotentiallyreducethenumberofdistancecalculationsnecessary?Hint:Thetriangleinequality,

canberewrittenas

b. Ingeneral,howwouldthedistancebetweenxandyaffectthenumberofdistancecalculations?

c. Supposethatyoucanfindasmallsubsetofpoints fromtheoriginaldataset,suchthateverypointinthedatasetiswithinaspecifieddistance ofatleastoneofthepointsin andthatyoualsohavethepairwisedistancematrixfor Describeatechniquethatusesthisinformationtocompute,withaminimumofdistancecalculations,thesetofallpointswithinadistanceof ofaspecifiedpointfromthedataset.

26.Showthat1minustheJaccardsimilarityisadistancemeasurebetweentwodataobjects,xandy,thatsatisfiesthemetricaxiomsgivenonpage77 .Specifically,

x∈S.ε y,y≠x,

d(x,z)≤d(x,y)+d(y,x), d(x,y)≥d(x,z)−d(y,z).

S′,

ε S′,S′.

d(x,y)=1−J(x,y).

27.Showthatthedistancemeasuredefinedastheanglebetweentwodatavectors,xandy,satisfiesthemetricaxiomsgivenonpage77 .Specifically,

28.Explainwhycomputingtheproximitybetweentwoattributesisoftensimplerthancomputingthesimilaritybetweentwoobjects.

d(x,y)=arccos(cos(x,y)).

3Classification:BasicConceptsandTechniques

Humanshaveaninnateabilitytoclassifythingsintocategories,e.g.,mundanetaskssuchasfilteringspamemailmessagesormorespecializedtaskssuchasrecognizingcelestialobjectsintelescopeimages(seeFigure3.1 ).Whilemanualclassificationoftensufficesforsmallandsimpledatasetswithonlyafewattributes,largerandmorecomplexdatasetsrequireanautomatedsolution.

Figure3.1.ClassificationofgalaxiesfromtelescopeimagestakenfromtheNASAwebsite.

Thischapterintroducesthebasicconceptsofclassificationanddescribessomeofitskeyissuessuchasmodeloverfitting,modelselection,andmodelevaluation.Whilethesetopicsareillustratedusingaclassificationtechniqueknownasdecisiontreeinduction,mostofthediscussioninthischapterisalsoapplicabletootherclassificationtechniques,manyofwhicharecoveredinChapter4 .

3.1BasicConceptsFigure3.2 illustratesthegeneralideabehindclassification.Thedataforaclassificationtaskconsistsofacollectionofinstances(records).Eachsuchinstanceischaracterizedbythetuple( ,y),where isthesetofattributevaluesthatdescribetheinstanceandyistheclasslabeloftheinstance.Theattributeset cancontainattributesofanytype,whiletheclasslabelymustbecategorical.

Figure3.2.Aschematicillustrationofaclassificationtask.

Aclassificationmodelisanabstractrepresentationoftherelationshipbetweentheattributesetandtheclasslabel.Aswillbeseeninthenexttwochapters,themodelcanberepresentedinmanyways,e.g.,asatree,aprobabilitytable,orsimply,avectorofreal-valuedparameters.Moreformally,wecanexpressitmathematicallyasatargetfunctionfthattakesasinputtheattributeset andproducesanoutputcorrespondingtothepredictedclasslabel.Themodelissaidtoclassifyaninstance( ,y)correctlyif .

Table3.1 showsexamplesofattributesetsandclasslabelsforvariousclassificationtasks.Spamfilteringandtumoridentificationareexamplesofbinaryclassificationproblems,inwhicheachdatainstancecanbecategorizedintooneoftwoclasses.Ifthenumberofclassesislargerthan2,asinthe

f(x)=y

galaxyclassificationexample,thenitiscalledamulticlassclassificationproblem.

Table3.1.Examplesofclassificationtasks.

Task Attributeset Classlabel

Spamfiltering Featuresextractedfromemailmessageheaderandcontent

spamornon-spam

Tumoridentification

Featuresextractedfrommagneticresonanceimaging(MRI)scans

malignantorbenign

Galaxyclassification

Featuresextractedfromtelescopeimages elliptical,spiral,orirregular-shaped

Weillustratethebasicconceptsofclassificationinthischapterwiththefollowingtwoexamples.

3.1.ExampleVertebrateClassificationTable3.2 showsasampledatasetforclassifyingvertebratesintomammals,reptiles,birds,fishes,andamphibians.Theattributesetincludescharacteristicsofthevertebratesuchasitsbodytemperature,skincover,andabilitytofly.Thedatasetcanalsobeusedforabinaryclassificationtasksuchasmammalclassification,bygroupingthereptiles,birds,fishes,andamphibiansintoasinglecategorycallednon-mammals.

Table3.2.Asampledataforthevertebrateclassificationproblem.VertebrateName

BodyTemperature

SkinCover

GivesBirth

AquaticCreature

AerialCreature

HasLegs

Hibernates ClassLabel

human warm-

blooded

hair yes no no yes no mammal

3.2.ExampleLoanBorrowerClassificationConsidertheproblemofpredictingwhetheraloanborrowerwillrepaytheloanordefaultontheloanpayments.Thedatasetusedtobuildthe

blooded

python cold-blooded scales no no no no yes reptile

salmon cold-blooded scales no yes no no no fish

whale warm-blooded

hair yes yes no no no mammal

frog cold-blooded none no semi no yes yes amphibian

komodo cold-blooded scales no no no yes no reptile

dragon

bat warm-blooded

hair yes no yes yes yes mammal

pigeon warm-blooded

feathers no no yes yes no bird

cat warm-blooded

fur yes no no yes no mammal

leopard cold-blooded scales yes yes no no no fish

shark

turtle cold-blooded scales no semi no yes no reptile

penguin warm-blooded

feathers no semi no yes no bird

porcupine warm-blooded

quills yes no no yes yes mammal

eel cold-blooded scales no yes no no no fish

salamander cold-blooded none no semi no yes yes amphibian

classificationmodelisshowninTable3.3 .Theattributesetincludespersonalinformationoftheborrowersuchasmaritalstatusandannualincome,whiletheclasslabelindicateswhethertheborrowerhaddefaultedontheloanpayments.

Table3.3.Asampledatafortheloanborrowerclassificationproblem.

ID HomeOwner MaritalStatus AnnualIncome Defaulted?

1 Yes Single 125000 No

2 No Married 100000 No

3 No Single 70000 No

4 Yes Married 120000 No

5 No Divorced 95000 Yes

6 No Single 60000 No

7 Yes Divorced 220000 No

8 No Single 85000 Yes

9 No Married 75000 No

10 No Single 90000 Yes

Aclassificationmodelservestwoimportantrolesindatamining.First,itisusedasapredictivemodeltoclassifypreviouslyunlabeledinstances.Agoodclassificationmodelmustprovideaccuratepredictionswithafastresponsetime.Second,itservesasadescriptivemodeltoidentifythecharacteristicsthatdistinguishinstancesfromdifferentclasses.Thisisparticularlyusefulforcriticalapplications,suchasmedicaldiagnosis,whereit

isinsufficienttohaveamodelthatmakesapredictionwithoutjustifyinghowitreachessuchadecision.

Forexample,aclassificationmodelinducedfromthevertebratedatasetshowninTable3.2 canbeusedtopredicttheclasslabelofthefollowingvertebrate:

Inaddition,itcanbeusedasadescriptivemodeltohelpdeterminecharacteristicsthatdefineavertebrateasamammal,areptile,abird,afish,oranamphibian.Forexample,themodelmayidentifymammalsaswarm-bloodedvertebratesthatgivebirthtotheiryoung.

Thereareseveralpointsworthnotingregardingthepreviousexample.First,althoughalltheattributesshowninTable3.2 arequalitative,therearenorestrictionsonthetypeofattributesthatcanbeusedaspredictorvariables.Theclasslabel,ontheotherhand,mustbeofnominaltype.Thisdistinguishesclassificationfromotherpredictivemodelingtaskssuchasregression,wherethepredictedvalueisoftenquantitative.MoreinformationaboutregressioncanbefoundinAppendixD.

Anotherpointworthnotingisthatnotallattributesmayberelevanttotheclassificationtask.Forexample,theaveragelengthorweightofavertebratemaynotbeusefulforclassifyingmammals,astheseattributescanshowsamevalueforbothmammalsandnon-mammals.Suchanattributeistypicallydiscardedduringpreprocessing.Theremainingattributesmightnotbeabletodistinguishtheclassesbythemselves,andthus,mustbeusedin

VertebrateName

BodyTemperature

SkinCover

GivesBirth

AquaticCreature

AerialCreature

HasLegs

Hibernates ClassLabel

gilamonster

cold-blooded scales no no no yes yes ?

concertwithotherattributes.Forinstance,theBodyTemperatureattributeisinsufficienttodistinguishmammalsfromothervertebrates.WhenitisusedtogetherwithGivesBirth,theclassificationofmammalsimprovessignificantly.However,whenadditionalattributes,suchasSkinCoverareincluded,themodelbecomesoverlyspecificandnolongercoversallmammals.Findingtheoptimalcombinationofattributesthatbestdiscriminatesinstancesfromdifferentclassesisthekeychallengeinbuildingclassificationmodels.

3.2GeneralFrameworkforClassificationClassificationisthetaskofassigninglabelstounlabeleddatainstancesandaclassifierisusedtoperformsuchatask.Aclassifieristypicallydescribedintermsofamodelasillustratedintheprevioussection.Themodeliscreatedusingagivenasetofinstances,knownasthetrainingset,whichcontainsattributevaluesaswellasclasslabelsforeachinstance.Thesystematicapproachforlearningaclassificationmodelgivenatrainingsetisknownasalearningalgorithm.Theprocessofusingalearningalgorithmtobuildaclassificationmodelfromthetrainingdataisknownasinduction.Thisprocessisalsooftendescribedas“learningamodel”or“buildingamodel.”Thisprocessofapplyingaclassificationmodelonunseentestinstancestopredicttheirclasslabelsisknownasdeduction.Thus,theprocessofclassificationinvolvestwosteps:applyingalearningalgorithmtotrainingdatatolearnamodel,andthenapplyingthemodeltoassignlabelstounlabeledinstances.Figure3.3 illustratesthegeneralframeworkforclassification.

Figure3.3.Generalframeworkforbuildingaclassificationmodel.

Aclassificationtechniquereferstoageneralapproachtoclassification,e.g.,thedecisiontreetechniquethatwewillstudyinthischapter.Thisclassificationtechniquelikemostothers,consistsofafamilyofrelatedmodelsandanumberofalgorithmsforlearningthesemodels.InChapter4 ,wewillstudyadditionalclassificationtechniques,includingneuralnetworksandsupportvectormachines.

Acouplenotesonterminology.First,theterms“classifier”and“model”areoftentakentobesynonymous.Ifaclassificationtechniquebuildsasingle,

globalmodel,thenthisisfine.However,whileeverymodeldefinesaclassifier,noteveryclassifierisdefinedbyasinglemodel.Someclassifiers,suchask-nearestneighborclassifiers,donotbuildanexplicitmodel(Section4.3 ),whileotherclassifiers,suchasensembleclassifiers,combinetheoutputofacollectionofmodels(Section4.10 ).Second,theterm“classifier”isoftenusedinamoregeneralsensetorefertoaclassificationtechnique.Thus,forexample,“decisiontreeclassifier”canrefertothedecisiontreeclassificationtechniqueoraspecificclassifierbuiltusingthattechnique.Fortunately,themeaningof“classifier”isusuallyclearfromthecontext.

InthegeneralframeworkshowninFigure3.3 ,theinductionanddeductionstepsshouldbeperformedseparately.Infact,aswillbediscussedlaterinSection3.6 ,thetrainingandtestsetsshouldbeindependentofeachothertoensurethattheinducedmodelcanaccuratelypredicttheclasslabelsofinstancesithasneverencounteredbefore.Modelsthatdeliversuchpredictiveinsightsaresaidtohavegoodgeneralizationperformance.Theperformanceofamodel(classifier)canbeevaluatedbycomparingthepredictedlabelsagainstthetruelabelsofinstances.Thisinformationcanbesummarizedinatablecalledaconfusionmatrix.Table3.4 depictstheconfusionmatrixforabinaryclassificationproblem.Eachentry denotesthenumberofinstancesfromclassipredictedtobeofclassj.Forexample, isthenumberofinstancesfromclass0incorrectlypredictedasclass1.Thenumberofcorrectpredictionsmadebythemodelis andthenumberofincorrectpredictionsis .

Table3.4.Confusionmatrixforabinaryclassificationproblem.

PredictedClass

ActualClass

fijf01

(f11+f00)(f10+f01)

Class=1 Class=0

Class=1 f11 f10

Althoughaconfusionmatrixprovidestheinformationneededtodeterminehowwellaclassificationmodelperforms,summarizingthisinformationintoasinglenumbermakesitmoreconvenienttocomparetherelativeperformanceofdifferentmodels.Thiscanbedoneusinganevaluationmetricsuchasaccuracy,whichiscomputedinthefollowingway:

Accuracy=

Forbinaryclassificationproblems,theaccuracyofamodelisgivenby

Errorrateisanotherrelatedmetric,whichisdefinedasfollowsforbinaryclassificationproblems:

Thelearningalgorithmsofmostclassificationtechniquesaredesignedtolearnmodelsthatattainthehighestaccuracy,orequivalently,thelowesterrorratewhenappliedtothetestset.WewillrevisitthetopicofmodelevaluationinSection3.6 .

Class=0 f01 f00

Accuracy=NumberofcorrectpredictionsTotalnumberofpredictions. (3.1)

Accuracy=f11+f00f11+f10+f01+f00. (3.2)

Errorrate=NumberofwrongpredictionsTotalnumberofpredictions=f10+f01f11(3.3)

3.3DecisionTreeClassifierThissectionintroducesasimpleclassificationtechniqueknownasthedecisiontreeclassifier.Toillustratehowadecisiontreeworks,considertheclassificationproblemofdistinguishingmammalsfromnon-mammalsusingthevertebratedatasetshowninTable3.2 .Supposeanewspeciesisdiscoveredbyscientists.Howcanwetellwhetheritisamammaloranon-mammal?Oneapproachistoposeaseriesofquestionsaboutthecharacteristicsofthespecies.Thefirstquestionwemayaskiswhetherthespeciesiscold-orwarm-blooded.Ifitiscold-blooded,thenitisdefinitelynotamammal.Otherwise,itiseitherabirdoramammal.Inthelattercase,weneedtoaskafollow-upquestion:Dothefemalesofthespeciesgivebirthtotheiryoung?Thosethatdogivebirtharedefinitelymammals,whilethosethatdonotarelikelytobenon-mammals(withtheexceptionofegg-layingmammalssuchastheplatypusandspinyanteater).

Thepreviousexampleillustrateshowwecansolveaclassificationproblembyaskingaseriesofcarefullycraftedquestionsabouttheattributesofthetestinstance.Eachtimewereceiveananswer,wecouldaskafollow-upquestionuntilwecanconclusivelydecideonitsclasslabel.Theseriesofquestionsandtheirpossibleanswerscanbeorganizedintoahierarchicalstructurecalledadecisiontree.Figure3.4 showsanexampleofthedecisiontreeforthemammalclassificationproblem.Thetreehasthreetypesofnodes:

Arootnode,withnoincominglinksandzeroormoreoutgoinglinks.Internalnodes,eachofwhichhasexactlyoneincominglinkandtwoormoreoutgoinglinks.Leaforterminalnodes,eachofwhichhasexactlyoneincominglinkandnooutgoinglinks.

Everyleafnodeinthedecisiontreeisassociatedwithaclasslabel.Thenon-terminalnodes,whichincludetherootandinternalnodes,containattributetestconditionsthataretypicallydefinedusingasingleattribute.Eachpossibleoutcomeoftheattributetestconditionisassociatedwithexactlyonechildofthisnode.Forexample,therootnodeofthetreeshowninFigure3.4 usestheattribute todefineanattributetestconditionthathastwooutcomes,warmandcold,resultingintwochildnodes.

Figure3.4.Adecisiontreeforthemammalclassificationproblem.

Givenadecisiontree,classifyingatestinstanceisstraightforward.Startingfromtherootnode,weapplyitsattributetestconditionandfollowtheappropriatebranchbasedontheoutcomeofthetest.Thiswillleaduseithertoanotherinternalnode,forwhichanewattributetestconditionisapplied,ortoaleafnode.Oncealeafnodeisreached,weassigntheclasslabelassociatedwiththenodetothetestinstance.Asanillustration,Figure3.5

tracesthepathusedtopredicttheclasslabelofaflamingo.Thepathterminatesataleafnodelabeledas .

Figure3.5.Classifyinganunlabeledvertebrate.Thedashedlinesrepresenttheoutcomesofapplyingvariousattributetestconditionsontheunlabeledvertebrate.Thevertebrateiseventuallyassignedtothe class.

3.3.1ABasicAlgorithmtoBuildaDecisionTree

Manypossibledecisiontreesthatcanbeconstructedfromaparticulardataset.Whilesometreesarebetterthanothers,findinganoptimaloneiscomputationallyexpensiveduetotheexponentialsizeofthesearchspace.Efficientalgorithmshavebeendevelopedtoinduceareasonablyaccurate,

albeitsuboptimal,decisiontreeinareasonableamountoftime.Thesealgorithmsusuallyemployagreedystrategytogrowthedecisiontreeinatop-downfashionbymakingaseriesoflocallyoptimaldecisionsaboutwhichattributetousewhenpartitioningthetrainingdata.OneoftheearliestmethodisHunt'salgorithm,whichisthebasisformanycurrentimplementationsofdecisiontreeclassifiers,includingID3,C4.5,andCART.ThissubsectionpresentsHunt'salgorithmanddescribessomeofthedesignissuesthatmustbeconsideredwhenbuildingadecisiontree.

Hunt'sAlgorithmInHunt'salgorithm,adecisiontreeisgrowninarecursivefashion.Thetreeinitiallycontainsasinglerootnodethatisassociatedwithallthetraininginstances.Ifanodeisassociatedwithinstancesfrommorethanoneclass,itisexpandedusinganattributetestconditionthatisdeterminedusingasplittingcriterion.Achildleafnodeiscreatedforeachoutcomeoftheattributetestconditionandtheinstancesassociatedwiththeparentnodearedistributedtothechildrenbasedonthetestoutcomes.Thisnodeexpansionstepcanthenberecursivelyappliedtoeachchildnode,aslongasithaslabelsofmorethanoneclass.Ifalltheinstancesassociatedwithaleafnodehaveidenticalclasslabels,thenthenodeisnotexpandedanyfurther.Eachleafnodeisassignedaclasslabelthatoccursmostfrequentlyinthetraininginstancesassociatedwiththenode.

Toillustratehowthealgorithmworks,considerthetrainingsetshowninTable3.3 fortheloanborrowerclassificationproblem.SupposeweapplyHunt'salgorithmtofitthetrainingdata.ThetreeinitiallycontainsonlyasingleleafnodeasshowninFigure3.6(a) .ThisnodeislabeledasDefaulted=No,sincethemajorityoftheborrowersdidnotdefaultontheirloanpayments.Thetrainingerrorofthistreeis30%asthreeoutofthetentraininginstanceshave

theclasslabel .Theleafnodecanthereforebefurtherexpandedbecauseitcontainstraininginstancesfrommorethanoneclass.

Figure3.6.Hunt'salgorithmforbuildingdecisiontrees.

LetHomeOwnerbetheattributechosentosplitthetraininginstances.Thejustificationforchoosingthisattributeastheattributetestconditionwillbediscussedlater.TheresultingbinarysplitontheHomeOwnerattributeisshowninFigure3.6(b) .AllthetraininginstancesforwhichHomeOwner=Yesarepropagatedtotheleftchildoftherootnodeandtherestarepropagatedtotherightchild.Hunt'salgorithmisthenrecursivelyappliedtoeachchild.Theleftchildbecomesaleafnodelabeled ,since

Defaulted=Yes

Defaulted=No

allinstancesassociatedwiththisnodehaveidenticalclasslabel.Therightchildhasinstancesfromeachclasslabel.Hence,

wesplititfurther.TheresultingsubtreesafterrecursivelyexpandingtherightchildareshowninFigures3.6(c) and(d) .

Hunt'salgorithm,asdescribedabove,makessomesimplifyingassumptionsthatareoftennottrueinpractice.Inthefollowing,wedescribetheseassumptionsandbrieflydiscusssomeofthepossiblewaysforhandlingthem.

1. SomeofthechildnodescreatedinHunt'salgorithmcanbeemptyifnoneofthetraininginstanceshavetheparticularattributevalues.Onewaytohandlethisisbydeclaringeachofthemasaleafnodewithaclasslabelthatoccursmostfrequentlyamongthetraininginstancesassociatedwiththeirparentnodes.

2. Ifalltraininginstancesassociatedwithanodehaveidenticalattributevaluesbutdifferentclasslabels,itisnotpossibletoexpandthisnodeanyfurther.Onewaytohandlethiscaseistodeclareitaleafnodeandassignittheclasslabelthatoccursmostfrequentlyinthetraininginstancesassociatedwiththisnode.

DesignIssuesofDecisionTreeInductionHunt'salgorithmisagenericprocedureforgrowingdecisiontreesinagreedyfashion.Toimplementthealgorithm,therearetwokeydesignissuesthatmustbeaddressed.

1. Whatisthesplittingcriterion?Ateachrecursivestep,anattributemustbeselectedtopartitionthetraininginstancesassociatedwithanodeintosmallersubsetsassociatedwithitschildnodes.Thesplittingcriteriondetermineswhichattributeischosenasthetestconditionand

Defaulted=No

howthetraininginstancesshouldbedistributedtothechildnodes.ThiswillbediscussedinSections3.3.2 and3.3.3 .

2. Whatisthestoppingcriterion?Thebasicalgorithmstopsexpandinganodeonlywhenallthetraininginstancesassociatedwiththenodehavethesameclasslabelsorhaveidenticalattributevalues.Althoughtheseconditionsaresufficient,therearereasonstostopexpandinganodemuchearliereveniftheleafnodecontainstraininginstancesfrommorethanoneclass.Thisprocessiscalledearlyterminationandtheconditionusedtodeterminewhenanodeshouldbestoppedfromexpandingiscalledastoppingcriterion.TheadvantagesofearlyterminationarediscussedinSection3.4 .

3.3.2MethodsforExpressingAttributeTestConditions

Decisiontreeinductionalgorithmsmustprovideamethodforexpressinganattributetestconditionanditscorrespondingoutcomesfordifferentattributetypes.

BinaryAttributes

Thetestconditionforabinaryattributegeneratestwopotentialoutcomes,asshowninFigure3.7 .

Figure3.7.Attributetestconditionforabinaryattribute.

NominalAttributes

Sinceanominalattributecanhavemanyvalues,itsattributetestconditioncanbeexpressedintwoways,asamultiwaysplitorabinarysplitasshowninFigure3.8 .Foramultiwaysplit(Figure3.8(a) ),thenumberofoutcomesdependsonthenumberofdistinctvaluesforthecorrespondingattribute.Forexample,ifanattributesuchasmaritalstatushasthreedistinctvalues—single,married,ordivorced—itstestconditionwillproduceathree-waysplit.Itisalsopossibletocreateabinarysplitbypartitioningallvaluestakenbythenominalattributeintotwogroups.Forexample,somedecisiontreealgorithms,suchasCART,produceonlybinarysplitsbyconsideringall

waysofcreatingabinarypartitionofkattributevalues.Figure3.8(b)illustratesthreedifferentwaysofgroupingtheattributevaluesformaritalstatusintotwosubsets.

2k−1−1

Figure3.8.Attributetestconditionsfornominalattributes.

OrdinalAttributes

Ordinalattributescanalsoproducebinaryormulti-waysplits.Ordinalattributevaluescanbegroupedaslongasthegroupingdoesnotviolatetheorderpropertyoftheattributevalues.Figure3.9 illustratesvariouswaysofsplittingtrainingrecordsbasedontheShirtSizeattribute.ThegroupingsshowninFigures3.9(a) and(b) preservetheorderamongtheattributevalues,whereasthegroupingshowninFigure3.9(c) violatesthispropertybecauseitcombinestheattributevaluesSmallandLargeintothesamepartitionwhileMediumandExtraLargearecombinedintoanotherpartition.

Figure3.9.Differentwaysofgroupingordinalattributevalues.

ContinuousAttributes

Forcontinuousattributes,theattributetestconditioncanbeexpressedasacomparisontest(e.g., )producingabinarysplit,orasarangequeryoftheform ,for producingamultiwaysplit.ThedifferencebetweentheseapproachesisshowninFigure3.10 .Forthebinarysplit,anypossiblevaluevbetweentheminimumandmaximumattributevaluesinthetrainingdatacanbeusedforconstructingthecomparisontest .However,itissufficienttoonlyconsiderdistinctattributevaluesinthetrainingsetascandidatesplitpositions.Forthemultiwaysplit,anypossiblecollectionofattributevaluerangescanbeused,aslongastheyaremutuallyexclusiveandcovertheentirerangeofattributevaluesbetweentheminimumandmaximumvaluesobservedinthetrainingset.OneapproachforconstructingmultiwaysplitsistoapplythediscretizationstrategiesdescribedinSection2.3.6 onpage63.Afterdiscretization,anewordinalvalueisassignedtoeachdiscretizedinterval,andtheattributetestconditionisthendefinedusingthisnewlyconstructedordinalattribute.

A<vvi≤A<vi+1 i=1,…,k,

A<v

Figure3.10.Testconditionforcontinuousattributes.

3.3.3MeasuresforSelectinganAttributeTestCondition

Therearemanymeasuresthatcanbeusedtodeterminethegoodnessofanattributetestcondition.Thesemeasurestrytogivepreferencetoattributetestconditionsthatpartitionthetraininginstancesintopurersubsetsinthechildnodes,whichmostlyhavethesameclasslabels.Havingpurernodesisusefulsinceanodethathasallofitstraininginstancesfromthesameclassdoesnotneedtobeexpandedfurther.Incontrast,animpurenodecontainingtraininginstancesfrommultipleclassesislikelytorequireseverallevelsofnodeexpansions,therebyincreasingthedepthofthetreeconsiderably.Largertreesarelessdesirableastheyaremoresusceptibletomodeloverfitting,aconditionthatmaydegradetheclassificationperformanceonunseeninstances,aswillbediscussedinSection3.4 .Theyarealsodifficulttointerpretandincurmoretrainingandtesttimeascomparedtosmallertrees.

Inthefollowing,wepresentdifferentwaysofmeasuringtheimpurityofanodeandthecollectiveimpurityofitschildnodes,bothofwhichwillbeusedtoidentifythebestattributetestconditionforanode.

ImpurityMeasureforaSingleNodeTheimpurityofanodemeasureshowdissimilartheclasslabelsareforthedatainstancesbelongingtoacommonnode.Followingareexamplesofmeasuresthatcanbeusedtoevaluatetheimpurityofanodet:

wherepi(t)istherelativefrequencyoftraininginstancesthatbelongtoclassiatnodet,cisthetotalnumberofclasses,and inentropycalculations.Allthreemeasuresgiveazeroimpurityvalueifanodecontainsinstancesfromasingleclassandmaximumimpurityifthenodehasequalproportionofinstancesfrommultipleclasses.

Figure3.11 comparestherelativemagnitudeoftheimpuritymeasureswhenappliedtobinaryclassificationproblems.Sincethereareonlytwoclasses, .Thehorizontalaxispreferstothefractionofinstancesthatbelongtooneofthetwoclasses.Observethatallthreemeasuresattaintheirmaximumvaluewhentheclassdistributionisuniform(i.e.,

)andminimumvaluewhenalltheinstancesbelongtoasingleclass(i.e.,either or equalsto1).Thefollowingexamplesillustratehowthevaluesoftheimpuritymeasuresvaryaswealtertheclassdistribution.

Entropy=−∑i=0c−1pi(t)log2pi(t), (3.4)

Giniindex=1−∑i=0c−1pi(t)2, (3.5)

Classificationerror=1−maxi[pi(t)], (3.6)

0log20=0

p0(t)+p1(t)=1

p0(t)+p1(t)=0.5p0(t) p1(t)

Figure3.11.Comparisonamongtheimpuritymeasuresforbinaryclassificationproblems.

Node Count

N1 Gini=1−(0/6)2−(6/6)2=0

Class=0 Entropy=−(0/6)log2(0/6)−(6/6)log2(6/6)=0

Class=1 Error=1−max[0/6,6/6]=0

N2 Gini=1−(1/6)2−(5/6)2=0.278

Class=0 Entropy=−(1/6)log2(1/6)−(5/6)log2(5/6)=0.650

Class=1 Error=1−max[1/6,5/6]=0.167

N3 Gini=1−(3/6)2−(3/6)2=0.5

Class=0 Entropy=−(3/6)log2(3/6)−(3/6)log2(3/6)=1

Basedonthesecalculations,node hasthelowestimpurityvalue,followedby and .Thisexample,alongwithFigure3.11 ,showstheconsistencyamongtheimpuritymeasures,i.e.,ifanode haslowerentropythannode ,thentheGiniindexanderrorrateof willalsobelowerthanthatof .Despitetheiragreement,theattributechosenassplittingcriterionbytheimpuritymeasurescanstillbedifferent(seeExercise6onpage187).

CollectiveImpurityofChildNodesConsideranattributetestconditionthatsplitsanodecontainingNtraininginstancesintokchildren, ,whereeverychildnoderepresentsapartitionofthedataresultingfromoneofthekoutcomesoftheattributetestcondition.Let bethenumberoftraininginstancesassociatedwithachildnode ,whoseimpurityvalueis .Sinceatraininginstanceintheparentnodereachesnode forafractionof times,thecollectiveimpurityofthechildnodescanbecomputedbytakingaweightedsumoftheimpuritiesofthechildnodes,asfollows:

3.3.ExampleWeightedEntropyConsiderthecandidateattributetestconditionshowninFigures3.12(a)and(b) fortheloanborrowerclassificationproblem.SplittingontheHomeOwnerattributewillgeneratetwochildnodes

Class=1 Error=1−max[6/6,3/6]=0.5

N1N2 N3

N1N2 N1N2

{v1,v2,⋯,vk}

N(vj)vj I(vj)

vj N(vj)/N

I(children)=∑j=1kN(vj)NI(vj), (3.7)

Figure3.12.Examplesofcandidateattributetestconditions.

whoseweightedentropycanbecalculatedasfollows:

SplittingonMaritalStatus,ontheotherhand,leadstothreechildnodeswithaweightedentropygivenby

Thus,MaritalStatushasalowerweightedentropythanHomeOwner.

IdentifyingthebestattributetestconditionTodeterminethegoodnessofanattributetestcondition,weneedtocomparethedegreeofimpurityoftheparentnode(beforesplitting)withtheweighteddegreeofimpurityofthechildnodes(aftersplitting).Thelargertheir

I(HomeOwner=yes)=03log203−33log233=0I(HomeOwner=no)=−37log237−47log247=0.985I(HomeOwner=310×0+710×0.985=0.690

I(MaritalStatus=Single)=−25log225−35log235=0.971I(MaritalStatus=Married)=−03log203−33log233=0I(MaritalStatus=Divorced)=−12log212−12log212=1.000I(MaritalStatus)=510×0.971+310×0+210×1=0.686

difference,thebetterthetestcondition.Thisdifference, ,alsotermedasthegaininpurityofanattributetestcondition,canbedefinedasfollows:

Figure3.13.SplittingcriteriafortheloanborrowerclassificationproblemusingGiniindex.

whereI(parent)istheimpurityofanodebeforesplittingandI(children)istheweightedimpuritymeasureaftersplitting.Itcanbeshownthatthegainisnon-negativesince foranyreasonablemeasuresuchasthosepresentedabove.Thehigherthegain,thepureraretheclassesinthechildnodesrelativetotheparentnode.Thesplittingcriterioninthedecisiontreelearningalgorithmselectstheattributetestconditionthatshowsthemaximumgain.NotethatmaximizingthegainatagivennodeisequivalenttominimizingtheweightedimpuritymeasureofitschildrensinceI(parent)isthesameforallcandidateattributetestconditions.Finally,whenentropyisused

Δ=I(parent)−I(children), (3.8)

I(parent)≥I(children)

astheimpuritymeasure,thedifferenceinentropyiscommonlyknownasinformationgain, .

Inthefollowing,wepresentillustrativeapproachesforidentifyingthebestattributetestconditiongivenqualitativeorquantitativeattributes.

SplittingofQualitativeAttributesConsiderthefirsttwocandidatesplitsshowninFigure3.12 involvingqualitativeattributes and .Theinitialclassdistributionattheparentnodeis(0.3,0.7),sincethereare3instancesofclass and7instancesofclass inthetrainingdata.Thus,

TheinformationgainsforHomeOwnerandMaritalStatusareeachgivenby

TheinformationgainforMaritalStatusisthushigherduetoitslowerweightedentropy,whichwillthusbeconsideredforsplitting.

BinarySplittingofQualitativeAttributesConsiderbuildingadecisiontreeusingonlybinarysplitsandtheGiniindexastheimpuritymeasure.Figure3.13 showsexamplesoffourcandidatesplittingcriteriaforthe and attributes.Sincethereare3borrowersinthetrainingsetwhodefaultedand7otherswhorepaidtheirloan(seeTableinFigure3.13 ),theGiniindexoftheparentnodebeforesplittingis

Δinfo

I(parent)=−310log2310−710log2710=0.881

Δinfo(HomeOwner)=0.881−0.690=0.191Δinfo(MaritalStatus)=0.881−0.686=0.195

If ischosenasthesplittingattribute,theGiniindexforthechildnodes and are0and0.490,respectively.TheweightedaverageGiniindexforthechildrenis

wheretheweightsrepresenttheproportionoftraininginstancesassignedtoeachchild.Thegainusing assplittingattributeis

.Similarly,wecanapplyabinarysplitontheattribute.However,since isanominalattributewith

threeoutcomes,therearethreepossiblewaystogrouptheattributevaluesintoabinarysplit.TheweightedaverageGiniindexofthechildrenforeachcandidatebinarysplitisshowninFigure3.13 .Basedontheseresults,

andthelastbinarysplitusing areclearlythebestcandidates,sincetheybothproducethelowestweightedaverageGiniindex.Binarysplitscanalsobeusedforordinalattributes,ifthebinarypartitioningoftheattributevaluesdoesnotviolatetheorderingpropertyofthevalues.

BinarySplittingofQuantitativeAttributesConsidertheproblemofidentifyingthebestbinarysplit fortheprecedingloanapprovalclassificationproblem.Asdiscussedpreviously,eventhough cantakeanyvaluebetweentheminimumandmaximumvaluesofannualincomeinthetrainingset,itissufficienttoonlyconsidertheannualincomevaluesobservedinthetrainingsetascandidatesplitpositions.Foreachcandidate ,thetrainingsetisscannedoncetocountthenumberofborrowerswithannualincomelessthanorgreaterthan alongwiththeirclassproportions.WecanthencomputetheGiniindexateachcandidatesplit

1−(310)2−(710)2=0.420.

N1 N2

(3/10)×0+(7/10)×0.490=0.343,

0.420−0.343=0.077

AnnualIncome≤τ

ττ

positionandchoosethe thatproducesthelowestvalue.ComputingtheGiniindexateachcandidatesplitpositionrequiresO(N)operations,whereNisthenumberoftraininginstances.SincethereareatmostNpossiblecandidates,theoverallcomplexityofthisbrute-forcemethodis .ItispossibletoreducethecomplexityofthisproblemtoO(NlogN)byusingamethoddescribedasfollows(seeillustrationinFigure3.14 ).Inthismethod,wefirstsortthetraininginstancesbasedontheirannualincome,aone-timecostthatrequiresO(NlogN)operations.Thecandidatesplitpositionsaregivenbythemidpointsbetweeneverytwoadjacentsortedvalues:$55,000,$65,000,$72,500,andsoon.Forthefirstcandidate,sincenoneoftheinstanceshasanannualincomelessthanorequalto$55,000,theGiniindexforthechildnodewith isequaltozero.Incontrast,thereare3traininginstancesofclass and instancesofclassNowithannualincomegreaterthan$55,000.TheGiniindexforthisnodeis0.420.TheweightedaverageGiniindexforthefirstcandidatesplitposition, ,isequalto .

Figure3.14.Splittingcontinuousattributes.

Forthenextcandidate, ,theclassdistributionofitschildnodescanbeobtainedwithasimpleupdateofthedistributionforthepreviouscandidate.Thisisbecause,as increasesfrom$55,000to$65,000,thereisonlyone

O(N2)

AnnualIncome<$55,000

τ=$55,0000×0+1×0.420=0.420

τ=$65,000

traininginstanceaffectedbythechange.Byexaminingtheclasslabeloftheaffectedtraininginstance,thenewclassdistributionisobtained.Forexample,as increasesto$65,000,thereisonlyoneborrowerinthetrainingset,withanannualincomeof$60,000,affectedbythischange.Sincetheclasslabelfortheborroweris ,thecountforclass increasesfrom0to1(for

)anddecreasesfrom7to6(for),asshowninFigure3.14 .Thedistributionforthe

classremainsunaffected.TheupdatedGiniindexforthiscandidatesplitpositionis0.400.

ThisprocedureisrepeateduntiltheGiniindexforallcandidatesarefound.ThebestsplitpositioncorrespondstotheonethatproducesthelowestGiniindex,whichoccursat .SincetheGiniindexateachcandidatesplitpositioncanbecomputedinO(1)time,thecomplexityoffindingthebestsplitpositionisO(N)onceallthevaluesarekeptsorted,aone-timeoperationthattakesO(NlogN)time.TheoverallcomplexityofthismethodisthusO(NlogN),whichismuchsmallerthanthe timetakenbythebrute-forcemethod.Theamountofcomputationcanbefurtherreducedbyconsideringonlycandidatesplitpositionslocatedbetweentwoadjacentsortedinstanceswithdifferentclasslabels.Forexample,wedonotneedtoconsidercandidatesplitpositionslocatedbetween$60,000and$75,000becauseallthreeinstanceswithannualincomeinthisrange($60,000,$70,000,and$75,000)havethesameclasslabels.Choosingasplitpositionwithinthisrangeonlyincreasesthedegreeofimpurity,comparedtoasplitpositionlocatedoutsidethisrange.Therefore,thecandidatesplitpositionsat and

canbeignored.Similarly,wedonotneedtoconsiderthecandidatesplitpositionsat$87,500,$92,500,$110,000,$122,500,and$172,500becausetheyarelocatedbetweentwoadjacentinstanceswiththesamelabels.Thisstrategyreducesthenumberofcandidatesplitpositionstoconsiderfrom9to2(excludingthetwoboundarycases and

AnnualIncome≤$65,000AnnualIncome>$65,000

τ=$97,500

O(N2)

τ=$65,000τ=$72,500

τ=$55,000τ=$230,000

GainRatioOnepotentiallimitationofimpuritymeasuressuchasentropyandGiniindexisthattheytendtofavorqualitativeattributeswithlargenumberofdistinctvalues.Figure3.12 showsthreecandidateattributesforpartitioningthedatasetgiveninTable3.3 .Aspreviouslymentioned,theattribute

isabetterchoicethantheattribute ,becauseitprovidesalargerinformationgain.However,ifwecomparethemagainst ,thelatterproducesthepurestpartitionswiththemaximuminformationgain,sincetheweightedentropyandGiniindexisequaltozeroforitschildren.Yet,

isnotagoodattributeforsplittingbecauseithasauniquevalueforeachinstance.Eventhoughatestconditioninvolving willaccuratelyclassifyeveryinstanceinthetrainingdata,wecannotusesuchatestconditiononnewtestinstanceswith valuesthathaven'tbeenseenbeforeduringtraining.Thisexamplesuggestshavingalowimpurityvaluealoneisinsufficienttofindagoodattributetestconditionforanode.AswewillseelaterinSection3.4 ,havingmorenumberofchildnodescanmakeadecisiontreemorecomplexandconsequentlymoresusceptibletooverfitting.Hence,thenumberofchildrenproducedbythesplittingattributeshouldalsobetakenintoconsiderationwhiledecidingthebestattributetestcondition.

Therearetwowaystoovercomethisproblem.Onewayistogenerateonlybinarydecisiontrees,thusavoidingthedifficultyofhandlingattributeswithvaryingnumberofpartitions.ThisstrategyisemployedbydecisiontreeclassifierssuchasCART.Anotherwayistomodifythesplittingcriteriontotakeintoaccountthenumberofpartitionsproducedbytheattribute.Forexample,intheC4.5decisiontreealgorithm,ameasureknownasgainratioisusedtocompensateforattributesthatproducealargenumberofchildnodes.Thismeasureiscomputedasfollows:

where isthenumberofinstancesassignedtonode andkisthetotalnumberofsplits.Thesplitinformationmeasurestheentropyofsplittinganodeintoitschildnodesandevaluatesifthesplitresultsinalargernumberofequally-sizedchildnodesornot.Forexample,ifeverypartitionhasthesamenumberofinstances,then andthesplitinformationwouldbeequaltolog k.Thus,ifanattributeproducesalargenumberofsplits,itssplitinformationisalsolarge,whichinturn,reducesthegainratio.

3.4.ExampleGainRatioConsiderthedatasetgiveninExercise2onpage185.Wewanttoselectthebestattributetestconditionamongthefollowingthreeattributes:

, ,and .Theentropybeforesplittingis

If isusedasattributetestcondition:

Finally,if isusedasattributetestcondition:

Gainratio=ΔinfoSplitInfo=Entropy(Parent)−∑i=1kN(vi)NEntropy(vi)−∑i=1kN(vi)Nlog2N(vi)N

(3.9)

N(vi) vi

∀i:N(vi)/N=1/k2

Entropy(parent)=−1020log21020−1020log21020=1.

Entropy(children)=1020[−610log2610−410log2410]×2=0.971GainRatio=1−0.971−1020log21020−1020log21020=0.0291=0.029

Entropy(children)=420[−14log214−34log234]+820×0+820[−18log218−78log278]=0.380GainRatio=1−0.380−420log2420−820log2820−820log2820=0.6201.52

Thus,eventhough hasthehighestinformationgain,itsgainratioislowerthan sinceitproducesalargernumberofsplits.

3.3.4AlgorithmforDecisionTreeInduction

Algorithm3.1 presentsapseudocodefordecisiontreeinductionalgorithm.TheinputtothisalgorithmisasetoftraininginstancesEalongwiththeattributesetF.Thealgorithmworksbyrecursivelyselectingthebestattributetosplitthedata(Step7)andexpandingthenodesofthetree(Steps11and12)untilthestoppingcriterionismet(Step1).Thedetailsofthisalgorithmareexplainedbelow.

1. The functionextendsthedecisiontreebycreatinganewnode.Anodeinthedecisiontreeeitherhasatestcondition,denotedasnode.testcond,oraclasslabel,denotedasnode.label.

2. The functiondeterminestheattributetestconditionforpartitioningthetraininginstancesassociatedwithanode.Thesplittingattributechosendependsontheimpuritymeasureused.ThepopularmeasuresincludeentropyandtheGiniindex.

3. The functiondeterminestheclasslabeltobeassignedtoaleafnode.Foreachleafnodet,let denotethefractionoftraininginstancesfromclassiassociatedwiththenodet.Thelabelassignedto

Entropy(children)=120[−11log211−01log201]×20=0GainRatio=1−0−120log2120×20=14.32=0.23

p(i|t)

theleafnodeistypicallytheonethatoccursmostfrequentlyinthetraininginstancesthatareassociatedwiththisnode.

Algorithm3.1Askeletondecisiontreeinductionalgorithm.

∈

wheretheargmaxoperatorreturnstheclassithatmaximizes .Besidesprovidingtheinformationneededtodeterminetheclasslabel

leaf.label=argmaxip(i|t), (3.10)

p(i|t)

ofaleafnode, canalsobeusedasaroughestimateoftheprobabilitythataninstanceassignedtotheleafnodetbelongstoclassi.Sections4.11.2 and4.11.4 inthenextchapterdescribehowsuchprobabilityestimatescanbeusedtodeterminetheperformanceofadecisiontreeunderdifferentcostfunctions.

4. The functionisusedtoterminatethetree-growingprocessbycheckingwhetheralltheinstanceshaveidenticalclasslabelorattributevalues.Sincedecisiontreeclassifiersemployatop-down,recursivepartitioningapproachforbuildingamodel,thenumberoftraininginstancesassociatedwithanodedecreasesasthedepthofthetreeincreases.Asaresult,aleafnodemaycontaintoofewtraininginstancestomakeastatisticallysignificantdecisionaboutitsclasslabel.Thisisknownasthedatafragmentationproblem.Onewaytoavoidthisproblemistodisallowsplittingofanodewhenthenumberofinstancesassociatedwiththenodefallbelowacertainthreshold.Amoresystematicwaytocontrolthesizeofadecisiontree(numberofleafnodes)willbediscussedinSection3.5.4 .

3.3.5ExampleApplication:WebRobotDetection

Considerthetaskofdistinguishingtheaccesspatternsofwebrobotsfromthosegeneratedbyhumanusers.Awebrobot(alsoknownasawebcrawler)isasoftwareprogramthatautomaticallyretrievesfilesfromoneormorewebsitesbyfollowingthehyperlinksextractedfromaninitialsetofseedURLs.Theseprogramshavebeendeployedforvariouspurposes,fromgatheringwebpagesonbehalfofsearchenginestomoremaliciousactivitiessuchasspammingandcommittingclickfraudsinonlineadvertisements.

p(i|t)

Figure3.15.Inputdataforwebrobotdetection.

Thewebrobotdetectionproblemcanbecastasabinaryclassificationtask.Theinputdatafortheclassificationtaskisawebserverlog,asampleofwhichisshowninFigure3.15(a) .Eachlineinthelogfilecorrespondstoarequestmadebyaclient(i.e.,ahumanuserorawebrobot)tothewebserver.Thefieldsrecordedintheweblogincludetheclient'sIPaddress,timestampoftherequest,URLoftherequestedfile,sizeofthefile,anduseragent,whichisafieldthatcontainsidentifyinginformationabouttheclient.

Forhumanusers,theuseragentfieldspecifiesthetypeofwebbrowserormobiledeviceusedtofetchthefiles,whereasforwebrobots,itshouldtechnicallycontainthenameofthecrawlerprogram.However,webrobotsmayconcealtheirtrueidentitiesbydeclaringtheiruseragentfieldstobeidenticaltoknownbrowsers.Therefore,useragentisnotareliablefieldtodetectwebrobots.

Thefirststeptowardbuildingaclassificationmodelistopreciselydefineadatainstanceandassociatedattributes.Asimpleapproachistoconsidereachlogentryasadatainstanceandusetheappropriatefieldsinthelogfileasitsattributeset.Thisapproach,however,isinadequateforseveralreasons.First,manyoftheattributesarenominal-valuedandhaveawiderangeofdomainvalues.Forexample,thenumberofuniqueclientIPaddresses,URLs,andreferrersinalogfilecanbeverylarge.Theseattributesareundesirableforbuildingadecisiontreebecausetheirsplitinformationisextremelyhigh(seeEquation(3.9) ).Inaddition,itmightnotbepossibletoclassifytestinstancescontainingIPaddresses,URLs,orreferrersthatarenotpresentinthetrainingdata.Finally,byconsideringeachlogentryasaseparatedatainstance,wedisregardthesequenceofwebpagesretrievedbytheclient—acriticalpieceofinformationthatcanhelpdistinguishwebrobotaccessesfromthoseofahumanuser.

Abetteralternativeistoconsidereachwebsessionasadatainstance.Awebsessionisasequenceofrequestsmadebyaclientduringagivenvisittothewebsite.Eachwebsessioncanbemodeledasadirectedgraph,inwhichthenodescorrespondtowebpagesandtheedgescorrespondtohyperlinksconnectingonewebpagetoanother.Figure3.15(b) showsagraphicalrepresentationofthefirstwebsessiongiveninthelogfile.Everywebsessioncanbecharacterizedusingsomemeaningfulattributesaboutthegraphthatcontaindiscriminatoryinformation.Figure3.15(c) showssomeoftheattributesextractedfromthegraph,includingthedepthandbreadthofits

correspondingtreerootedattheentrypointtothewebsite.Forexample,thedepthandbreadthofthetreeshowninFigure3.15(b) arebothequaltotwo.

ThederivedattributesshowninFigure3.15(c) aremoreinformativethantheoriginalattributesgiveninthelogfilebecausetheycharacterizethebehavioroftheclientatthewebsite.Usingthisapproach,adatasetcontaining2916instanceswascreated,withequalnumbersofsessionsduetowebrobots(class1)andhumanusers(class0).10%ofthedatawerereservedfortrainingwhiletheremaining90%wereusedfortesting.TheinduceddecisiontreeisshowninFigure3.16 ,whichhasanerrorrateequalto3.8%onthetrainingsetand5.3%onthetestset.Inadditiontoitslowerrorrate,thetreealsorevealssomeinterestingpropertiesthatcanhelpdiscriminatewebrobotsfromhumanusers:

1. Accessesbywebrobotstendtobebroadbutshallow,whereasaccessesbyhumanuserstendtobemorefocused(narrowbutdeep).

2. Webrobotsseldomretrievetheimagepagesassociatedwithawebpage.

3. Sessionsduetowebrobotstendtobelongandcontainalargenumberofrequestedpages.

4. Webrobotsaremorelikelytomakerepeatedrequestsforthesamewebpagethanhumanuserssincethewebpagesretrievedbyhumanusersareoftencachedbythebrowser.

3.3.6CharacteristicsofDecisionTreeClassifiers

Thefollowingisasummaryoftheimportantcharacteristicsofdecisiontreeinductionalgorithms.

1. Applicability:Decisiontreesareanonparametricapproachforbuildingclassificationmodels.Thisapproachdoesnotrequireanypriorassumptionabouttheprobabilitydistributiongoverningtheclassandattributesofthedata,andthus,isapplicabletoawidevarietyofdatasets.Itisalsoapplicabletobothcategoricalandcontinuousdatawithoutrequiringtheattributestobetransformedintoacommonrepresentationviabinarization,normalization,orstandardization.UnlikesomebinaryclassifiersdescribedinChapter4 ,itcanalsodealwithmulticlassproblemswithouttheneedtodecomposethemintomultiplebinaryclassificationtasks.Anotherappealingfeatureofdecisiontreeclassifiersisthattheinducedtrees,especiallytheshorterones,arerelativelyeasytointerpret.Theaccuraciesofthetreesarealsoquitecomparabletootherclassificationtechniquesformanysimpledatasets.

2. Expressiveness:Adecisiontreeprovidesauniversalrepresentationfordiscrete-valuedfunctions.Inotherwords,itcanencodeanyfunctionofdiscrete-valuedattributes.Thisisbecauseeverydiscrete-valuedfunctioncanberepresentedasanassignmenttable,whereeveryuniquecombinationofdiscreteattributesisassignedaclasslabel.Sinceeverycombinationofattributescanberepresentedasaleafinthedecisiontree,wecanalwaysfindadecisiontreewhoselabelassignmentsattheleafnodesmatcheswiththeassignmenttableoftheoriginalfunction.Decisiontreescanalsohelpinprovidingcompactrepresentationsoffunctionswhensomeoftheuniquecombinationsofattributescanberepresentedbythesameleafnode.Forexample,Figure3.17 showstheassignmenttableoftheBooleanfunction

involvingfourbinaryattributes,resultinginatotalofpossibleassignments.ThetreeshowninFigure3.17 shows

(A∧B)∨(C∧D)24=16

acompressedencodingofthisassignmenttable.Insteadofrequiringafully-growntreewith16leafnodes,itispossibletoencodethefunctionusingasimplertreewithonly7leafnodes.Nevertheless,notalldecisiontreesfordiscrete-valuedattributescanbesimplified.Onenotableexampleistheparityfunction,whosevalueis1whenthereisanevennumberoftruevaluesamongitsBooleanattributes,and0otherwise.Accuratemodelingofsuchafunctionrequiresafulldecisiontreewith nodes,wheredisthenumberofBooleanattributes(seeExercise1onpage185).

Figure3.16.Decisiontreemodelforwebrobotdetection.

Figure3.17.DecisiontreefortheBooleanfunction .

3. ComputationalEfficiency:Sincethenumberofpossibledecisiontreescanbeverylarge,manydecisiontreealgorithmsemployaheuristic-basedapproachtoguidetheirsearchinthevasthypothesisspace.Forexample,thealgorithmpresentedinSection3.3.4 usesagreedy,top-down,recursivepartitioningstrategyforgrowingadecisiontree.Formanydatasets,suchtechniquesquicklyconstructareasonablygooddecisiontreeevenwhenthetrainingsetsizeisverylarge.Furthermore,onceadecisiontreehasbeenbuilt,classifyingatestrecordisextremelyfast,withaworst-casecomplexityofO(w),wherewisthemaximumdepthofthetree.

4. HandlingMissingValues:Adecisiontreeclassifiercanhandlemissingattributevaluesinanumberofways,bothinthetrainingandthetestsets.Whentherearemissingvaluesinthetestset,theclassifiermustdecidewhichbranchtofollowifthevalueofasplitting

(A∧B)∨(C∧D)

nodeattributeismissingforagiventestinstance.Oneapproach,knownastheprobabilisticsplitmethod,whichisemployedbytheC4.5decisiontreeclassifier,distributesthedatainstancetoeverychildofthesplittingnodeaccordingtotheprobabilitythatthemissingattributehasaparticularvalue.Incontrast,theCARTalgorithmusesthesurrogatesplitmethod,wheretheinstancewhosesplittingattributevalueismissingisassignedtooneofthechildnodesbasedonthevalueofanothernon-missingsurrogateattributewhosesplitsmostresemblethepartitionsmadebythemissingattribute.Anotherapproach,knownastheseparateclassmethodisusedbytheCHAIDalgorithm,wherethemissingvalueistreatedasaseparatecategoricalvaluedistinctfromothervaluesofthesplittingattribute.Figure3.18showsanexampleofthethreedifferentwaysforhandlingmissingvaluesinadecisiontreeclassifier.Otherstrategiesfordealingwithmissingvaluesarebasedondatapreprocessing,wheretheinstancewithmissingvalueiseitherimputedwiththemode(forcategoricalattribute)ormean(forcontinuousattribute)valueordiscardedbeforetheclassifieristrained.

Figure3.18.Methodsforhandlingmissingattributevaluesindecisiontreeclassifier.

Duringtraining,ifanattributevhasmissingvaluesinsomeofthetraininginstancesassociatedwithanode,weneedawaytomeasurethegaininpurityifvisusedforsplitting.Onesimplewayistoexcludeinstanceswithmissingvaluesofvinthecountingofinstancesassociatedwitheverychildnode,generatedforeverypossibleoutcomeofv.Further,ifvischosenastheattributetestconditionatanode,traininginstanceswithmissingvaluesofvcanbepropagatedtothechildnodesusinganyofthemethodsdescribedaboveforhandlingmissingvaluesintestinstances.

5. HandlingInteractionsamongAttributes:Attributesareconsideredinteractingiftheyareabletodistinguishbetweenclasseswhenusedtogether,butindividuallytheyprovidelittleornoinformation.Duetothegreedynatureofthesplittingcriteriaindecisiontrees,suchattributescouldbepassedoverinfavorofotherattributesthatarenotasuseful.Thiscouldresultinmorecomplexdecisiontreesthannecessary.Hence,decisiontreescanperformpoorlywhenthereareinteractionsamongattributes.Toillustratethispoint,considerthethree-dimensionaldatashowninFigure3.19(a) ,whichcontains2000datapointsfromoneoftwoclasses,denotedas and inthediagram.Figure3.19(b) showsthedistributionofthetwoclassesinthetwo-dimensionalspaceinvolvingattributesXandY,whichisanoisyversionoftheXORBooleanfunction.Wecanseethateventhoughthetwoclassesarewell-separatedinthistwo-dimensionalspace,neitherofthetwoattributescontainsufficientinformationtodistinguishbetweenthetwoclasseswhenusedalone.Forexample,theentropiesofthefollowingattributetestconditions: and ,arecloseto1,indicatingthatneitherXnorYprovideanyreductionintheimpuritymeasurewhenusedindividually.XandYthusrepresentacaseofinteractionamongattributes.Thedatasetalsocontainsathirdattribute,Z,inwhichbothclassesaredistributeduniformly,asshowninFigures3.19(c) and

+ ∘

X≤10 Y≤10

3.19(d) ,andhence,theentropyofanysplitinvolvingZiscloseto1.Asaresult,Zisaslikelytobechosenforsplittingastheinteractingbutusefulattributes,XandY.Forfurtherillustrationofthisissue,readersarereferredtoExample3.7 inSection3.4.1 andExercise7attheendofthischapter.

Figure3.19.ExampleofaXORdatainvolvingXandY,alongwithanirrelevantattributeZ.

6. HandlingIrrelevantAttributes:Anattributeisirrelevantifitisnotusefulfortheclassificationtask.Sinceirrelevantattributesarepoorlyassociatedwiththetargetclasslabels,theywillprovidelittleornogaininpurityandthuswillbepassedoverbyothermorerelevantfeatures.Hence,thepresenceofasmallnumberofirrelevantattributeswillnotimpactthedecisiontreeconstructionprocess.However,notallattributesthatprovidelittletonogainareirrelevant(seeFigure3.19 ).Hence,iftheclassificationproblemiscomplex(e.g.,involvinginteractionsamongattributes)andtherearealargenumberofirrelevantattributes,thensomeoftheseattributesmaybeaccidentallychosenduringthetree-growingprocess,sincetheymayprovideabettergainthanarelevantattributejustbyrandomchance.Featureselectiontechniquescanhelptoimprovetheaccuracyofdecisiontreesbyeliminatingtheirrelevantattributesduringpreprocessing.WewillinvestigatetheissueoftoomanyirrelevantattributesinSection3.4.1 .

7. HandlingRedundantAttributes:Anattributeisredundantifitisstronglycorrelatedwithanotherattributeinthedata.Sinceredundantattributesshowsimilargainsinpurityiftheyareselectedforsplitting,onlyoneofthemwillbeselectedasanattributetestconditioninthedecisiontreealgorithm.Decisiontreescanthushandlethepresenceofredundantattributes.

8. UsingRectilinearSplits:Thetestconditionsdescribedsofarinthischapterinvolveusingonlyasingleattributeatatime.Asaconsequence,thetree-growingprocedurecanbeviewedastheprocessofpartitioningtheattributespaceintodisjointregionsuntileachregioncontainsrecordsofthesameclass.Theborderbetweentwoneighboringregionsofdifferentclassesisknownasadecisionboundary.Figure3.20 showsthedecisiontreeaswellasthedecisionboundaryforabinaryclassificationproblem.Sincethetestconditioninvolvesonlyasingleattribute,thedecisionboundariesare

rectilinear;i.e.,paralleltothecoordinateaxes.Thislimitstheexpressivenessofdecisiontreesinrepresentingdecisionboundariesofdatasetswithcontinuousattributes.Figure3.21 showsatwo-dimensionaldatasetinvolvingbinaryclassesthatcannotbeperfectlyclassifiedbyadecisiontreewhoseattributetestconditionsaredefinedbasedonsingleattributes.ThebinaryclassesinthedatasetaregeneratedfromtwoskewedGaussiandistributions,centeredat(8,8)and(12,12),respectively.Thetruedecisionboundaryisrepresentedbythediagonaldashedline,whereastherectilineardecisionboundaryproducedbythedecisiontreeclassifierisshownbythethicksolidline.Incontrast,anobliquedecisiontreemayovercomethislimitationbyallowingthetestconditiontobespecifiedusingmorethanoneattribute.Forexample,thebinaryclassificationdatashowninFigure3.21 canbeeasilyrepresentedbyanobliquedecisiontreewithasinglerootnodewithtestcondition

Figure3.20.

x+y<20.

Exampleofadecisiontreeanditsdecisionboundariesforatwo-dimensionaldataset.

Figure3.21.Exampleofdatasetthatcannotbepartitionedoptimallyusingadecisiontreewithsingleattributetestconditions.Thetruedecisionboundaryisshownbythedashedline.

Althoughanobliquedecisiontreeismoreexpressiveandcanproducemorecompacttrees,findingtheoptimaltestconditioniscomputationallymoreexpensive.

9. ChoiceofImpurityMeasure:Itshouldbenotedthatthechoiceofimpuritymeasureoftenhaslittleeffectontheperformanceofdecisiontreeclassifierssincemanyoftheimpuritymeasuresarequiteconsistentwitheachother,asshowninFigure3.11 onpage129.Instead,thestrategyusedtoprunethetreehasagreaterimpactonthefinaltreethanthechoiceofimpuritymeasure.

3.4ModelOverfittingMethodspresentedsofartrytolearnclassificationmodelsthatshowthelowesterroronthetrainingset.However,aswewillshowinthefollowingexample,evenifamodelfitswelloverthetrainingdata,itcanstillshowpoorgeneralizationperformance,aphenomenonknownasmodeloverfitting.

Figure3.22.Examplesoftrainingandtestsetsofatwo-dimensionalclassificationproblem.

Figure3.23.Effectofvaryingtreesize(numberofleafnodes)ontrainingandtesterrors.

3.5.ExampleOverfittingandUnderfittingofDecisionTreesConsiderthetwo-dimensionaldatasetshowninFigure3.22(a) .Thedatasetcontainsinstancesthatbelongtotwoseparateclasses,representedas and ,respectively,whereeachclasshas5400instances.Allinstancesbelongingtothe classweregeneratedfromauniformdistribution.Forthe class,5000instancesweregeneratedfromaGaussiandistributioncenteredat(10,10)withunitvariance,whiletheremaining400instancesweresampledfromthesameuniformdistributionasthe class.WecanseefromFigure3.22(a) thatthe classcanbelargelydistinguishedfromthe classbydrawingacircleofappropriatesizecenteredat(10,10).Tolearnaclassifierusingthistwo-dimensionaldataset,werandomlysampled10%ofthedatafortrainingandusedtheremaining90%fortesting.Thetrainingset,showninFigure3.22(b) ,looksquiterepresentativeoftheoveralldata.WeusedGiniindexasthe

+ ∘∘

∘ +∘

impuritymeasuretoconstructdecisiontreesofincreasingsizes(numberofleafnodes),byrecursivelyexpandinganodeintochildnodestilleveryleafnodewaspure,asdescribedinSection3.3.4 .

Figure3.23(a) showschangesinthetrainingandtesterrorratesasthesizeofthetreevariesfrom1to8.Botherrorratesareinitiallylargewhenthetreehasonlyoneortwoleafnodes.Thissituationisknownasmodelunderfitting.Underfittingoccurswhenthelearneddecisiontreeistoosimplistic,andthus,incapableoffullyrepresentingthetruerelationshipbetweentheattributesandtheclasslabels.Asweincreasethetreesizefrom1to8,wecanobservetwoeffects.First,boththeerrorratesdecreasesincelargertreesareabletorepresentmorecomplexdecisionboundaries.Second,thetrainingandtesterrorratesarequiteclosetoeachother,whichindicatesthattheperformanceonthetrainingsetisfairlyrepresentativeofthegeneralizationperformance.Aswefurtherincreasethesizeofthetreefrom8to150,thetrainingerrorcontinuestosteadilydecreasetilliteventuallyreacheszero,asshowninFigure3.23(b) .However,inastrikingcontrast,thetesterrorrateceasestodecreaseanyfurtherbeyondacertaintreesize,andthenitbeginstoincrease.Thetrainingerrorratethusgrosslyunder-estimatesthetesterrorrateoncethetreebecomestoolarge.Further,thegapbetweenthetrainingandtesterrorrateskeepsonwideningasweincreasethetreesize.Thisbehavior,whichmayseemcounter-intuitiveatfirst,canbeattributedtothephenomenaofmodeloverfitting.

3.4.1ReasonsforModelOverfitting

Modeloverfittingisthephenomenawhere,inthepursuitofminimizingthetrainingerrorrate,anoverlycomplexmodelisselectedthatcapturesspecific

patternsinthetrainingdatabutfailstolearnthetruenatureofrelationshipsbetweenattributesandclasslabelsintheoveralldata.Toillustratethis,Figure3.24 showsdecisiontreesandtheircorrespondingdecisionboundaries(shadedrectanglesrepresentregionsassignedtothe class)fortwotreesofsizes5and50.Wecanseethatthedecisiontreeofsize5appearsquitesimpleanditsdecisionboundariesprovideareasonableapproximationtotheidealdecisionboundary,whichinthiscasecorrespondstoacirclecenteredaroundtheGaussiandistributionat(10,10).Althoughitstrainingandtesterrorratesarenon-zero,theyareveryclosetoeachother,whichindicatesthatthepatternslearnedinthetrainingsetshouldgeneralizewelloverthetestset.Ontheotherhand,thedecisiontreeofsize50appearsmuchmorecomplexthanthetreeofsize5,withcomplicateddecisionboundaries.Forexample,someofitsshadedrectangles(assignedtheclass)attempttocovernarrowregionsintheinputspacethatcontainonlyoneortwo traininginstances.Notethattheprevalenceof instancesinsuchregionsishighlyspecifictothetrainingset,astheseregionsaremostlydominatedby-instancesintheoveralldata.Hence,inanattempttoperfectlyfitthetrainingdata,thedecisiontreeofsize50startsfinetuningitselftospecificpatternsinthetrainingdata,leadingtopoorperformanceonanindependentlychosentestset.

+ +

Figure3.24.Decisiontreeswithdifferentmodelcomplexities.

Figure3.25.Performanceofdecisiontreesusing20%datafortraining(twicetheoriginaltrainingsize).

Thereareanumberoffactorsthatinfluencemodeloverfitting.Inthefollowing,weprovidebriefdescriptionsoftwoofthemajorfactors:limitedtrainingsizeandhighmodelcomplexity.Thoughtheyarenotexhaustive,theinterplaybetweenthemcanhelpexplainmostofthecommonmodeloverfittingphenomenainreal-worldapplications.

LimitedTrainingSizeNotethatatrainingsetconsistingofafinitenumberofinstancescanonlyprovidealimitedrepresentationoftheoveralldata.Hence,itispossiblethatthepatternslearnedfromatrainingsetdonotfullyrepresentthetruepatternsintheoveralldata,leadingtomodeloverfitting.Ingeneral,asweincreasethesizeofatrainingset(numberoftraininginstances),thepatternslearnedfromthetrainingsetstartresemblingthetruepatternsintheoveralldata.Hence,

theeffectofoverfittingcanbereducedbyincreasingthetrainingsize,asillustratedinthefollowingexample.

3.6ExampleEffectofTrainingSizeSupposethatweusetwicethenumberoftraininginstancesthanwhatwehadusedintheexperimentsconductedinExample3.5 .Specifically,weuse20%datafortrainingandusetheremainderfortesting.Figure3.25(b) showsthetrainingandtesterrorratesasthesizeofthetreeisvariedfrom1to150.TherearetwomajordifferencesinthetrendsshowninthisfigureandthoseshowninFigure3.23(b) (usingonly10%ofthedatafortraining).First,eventhoughthetrainingerrorratedecreaseswithincreasingtreesizeinbothfigures,itsrateofdecreaseismuchsmallerwhenweusetwicethetrainingsize.Second,foragiventreesize,thegapbetweenthetrainingandtesterrorratesismuchsmallerwhenweusetwicethetrainingsize.Thesedifferencessuggestthatthepatternslearnedusing20%ofdatafortrainingaremoregeneralizablethanthoselearnedusing10%ofdatafortraining.

Figure3.25(a) showsthedecisionboundariesforthetreeofsize50,learnedusing20%ofdatafortraining.Incontrasttothetreeofthesamesizelearnedusing10%datafortraining(seeFigure3.24(d) ),wecanseethatthedecisiontreeisnotcapturingspecificpatternsofnoisyinstancesinthetrainingset.Instead,thehighmodelcomplexityof50leafnodesisbeingeffectivelyusedtolearntheboundariesofthe instancescenteredat(10,10).

HighModelComplexityGenerally,amorecomplexmodelhasabetterabilitytorepresentcomplexpatternsinthedata.Forexample,decisiontreeswithlargernumberofleaf

nodescanrepresentmorecomplexdecisionboundariesthandecisiontreeswithfewerleafnodes.However,anoverlycomplexmodelalsohasatendencytolearnspecificpatternsinthetrainingsetthatdonotgeneralizewelloverunseeninstances.Modelswithhighcomplexityshouldthusbejudiciouslyusedtoavoidoverfitting.

Onemeasureofmodelcomplexityisthenumberof“parameters”thatneedtobeinferredfromthetrainingset.Forexample,inthecaseofdecisiontreeinduction,theattributetestconditionsatinternalnodescorrespondtotheparametersofthemodelthatneedtobeinferredfromthetrainingset.Adecisiontreewithlargernumberofattributetestconditions(andconsequentlymoreleafnodes)thusinvolvesmore“parameters”andhenceismorecomplex.

Givenaclassofmodelswithacertainnumberofparameters,alearningalgorithmattemptstoselectthebestcombinationofparametervaluesthatmaximizesanevaluationmetric(e.g.,accuracy)overthetrainingset.Ifthenumberofparametervaluecombinations(andhencethecomplexity)islarge,thelearningalgorithmhastoselectthebestcombinationfromalargenumberofpossibilities,usingalimitedtrainingset.Insuchcases,thereisahighchanceforthelearningalgorithmtopickaspuriouscombinationofparametersthatmaximizestheevaluationmetricjustbyrandomchance.Thisissimilartothemultiplecomparisonsproblem(alsoreferredasmultipletestingproblem)instatistics.

Asanillustrationofthemultiplecomparisonsproblem,considerthetaskofpredictingwhetherthestockmarketwillriseorfallinthenexttentradingdays.Ifastockanalystsimplymakesrandomguesses,theprobabilitythatherpredictioniscorrectonanytradingdayis0.5.However,theprobabilitythatshewillpredictcorrectlyatleastnineoutoftentimesis

whichisextremelylow.

Supposeweareinterestedinchoosinganinvestmentadvisorfromapoolof200stockanalysts.Ourstrategyistoselecttheanalystwhomakesthemostnumberofcorrectpredictionsinthenexttentradingdays.Theflawinthisstrategyisthatevenifalltheanalystsmaketheirpredictionsinarandomfashion,theprobabilitythatatleastoneofthemmakesatleastninecorrectpredictionsis

whichisveryhigh.Althougheachanalysthasalowprobabilityofpredictingatleastninetimescorrectly,consideredtogether,wehaveahighprobabilityoffindingatleastoneanalystwhocandoso.However,thereisnoguaranteeinthefuturethatsuchananalystwillcontinuetomakeaccuratepredictionsbyrandomguessing.

Howdoesthemultiplecomparisonsproblemrelatetomodeloverfitting?Inthecontextoflearningaclassificationmodel,eachcombinationofparametervaluescorrespondstoananalyst,whilethenumberoftraininginstancescorrespondstothenumberofdays.Analogoustothetaskofselectingthebestanalystwhomakesthemostaccuratepredictionsonconsecutivedays,thetaskofalearningalgorithmistoselectthebestcombinationofparametersthatresultsinthehighestaccuracyonthetrainingset.Ifthenumberofparametercombinationsislargebutthetrainingsizeissmall,itishighlylikelyforthelearningalgorithmtochooseaspuriousparametercombinationthatprovideshightrainingaccuracyjustbyrandomchance.Inthefollowingexample,weillustratethephenomenaofoverfittingduetomultiplecomparisonsinthecontextofdecisiontreeinduction.

(109)+(1010)210=0.0107,

1−(1−0.0107)200=0.8847,

Figure3.26.Exampleofatwo-dimensional(X-Y)dataset.

Figure3.27.

Trainingandtesterrorratesillustratingtheeffectofmultiplecomparisonsproblemonmodeloverfitting.

3.7.ExampleMultipleComparisonsandOverfittingConsiderthetwo-dimensionaldatasetshowninFigure3.26 containing500 and500 instances,whichissimilartothedatashowninFigure3.19 .Inthisdataset,thedistributionsofbothclassesarewell-separatedinthetwo-dimensional(XY)attributespace,butnoneofthetwoattributes(XorY)aresufficientlyinformativetobeusedaloneforseparatingthetwoclasses.Hence,splittingthedatasetbasedonanyvalueofanXorYattributewillprovideclosetozeroreductioninanimpuritymeasure.However,ifXandYattributesareusedtogetherinthesplittingcriterion(e.g.,splittingXat10andYat10),thetwoclassescanbeeffectivelyseparated.

+ ∘

Figure3.28.Decisiontreewith6leafnodesusingXandYasattributes.Splitshavebeennumberedfrom1to5inorderofotheroccurrenceinthetree.

Figure3.27(a) showsthetrainingandtesterrorratesforlearningdecisiontreesofvaryingsizes,when30%ofthedataisusedfortrainingandtheremainderofthedatafortesting.Wecanseethatthetwoclassescanbeseparatedusingasmallnumberofleafnodes.Figure3.28showsthedecisionboundariesforthetreewithsixleafnodes,wherethesplitshavebeennumberedaccordingtotheirorderofappearanceinthetree.Notethattheeventhoughsplits1and3providetrivialgains,theirconsequentsplits(2,4,and5)providelargegains,resultingineffectivediscriminationofthetwoclasses.

Assumeweadd100irrelevantattributestothetwo-dimensionalX-Ydata.Learningadecisiontreefromthisresultantdatawillbechallengingbecausethenumberofcandidateattributestochooseforsplittingateveryinternalnodewillincreasefromtwoto102.Withsuchalargenumberofcandidateattributetestconditionstochoosefrom,itisquitelikelythatspuriousattributetestconditionswillbeselectedatinternalnodesbecauseofthemultiplecomparisonsproblem.Figure3.27(b) showsthetrainingandtesterrorratesafteradding100irrelevantattributestothetrainingset.Wecanseethatthetesterrorrateremainscloseto0.5evenafterusing50leafnodes,whilethetrainingerrorratekeepsondecliningandeventuallybecomes0.

3.5ModelSelectionTherearemanypossibleclassificationmodelswithvaryinglevelsofmodelcomplexitythatcanbeusedtocapturepatternsinthetrainingdata.Amongthesepossibilities,wewanttoselectthemodelthatshowslowestgeneralizationerrorrate.Theprocessofselectingamodelwiththerightlevelofcomplexity,whichisexpectedtogeneralizewelloverunseentestinstances,isknownasmodelselection.Asdescribedintheprevioussection,thetrainingerrorratecannotbereliablyusedasthesolecriterionformodelselection.Inthefollowing,wepresentthreegenericapproachestoestimatethegeneralizationperformanceofamodelthatcanbeusedformodelselection.Weconcludethissectionbypresentingspecificstrategiesforusingtheseapproachesinthecontextofdecisiontreeinduction.

3.5.1UsingaValidationSet

Notethatwecanalwaysestimatethegeneralizationerrorrateofamodelbyusing“out-of-sample”estimates,i.e.byevaluatingthemodelonaseparatevalidationsetthatisnotusedfortrainingthemodel.Theerrorrateonthevalidationset,termedasthevalidationerrorrate,isabetterindicatorofgeneralizationperformancethanthetrainingerrorrate,sincethevalidationsethasnotbeenusedfortrainingthemodel.Thevalidationerrorratecanbeusedformodelselectionasfollows.

GivenatrainingsetD.train,wecanpartitionD.trainintotwosmallersubsets,D.trandD.val,suchthatD.trisusedfortrainingwhileD.valisusedasthevalidationset.Forexample,two-thirdsofD.traincanbereservedasD.trfor

training,whiletheremainingone-thirdisusedasD.valforcomputingvalidationerrorrate.ForanychoiceofclassificationmodelmthatistrainedonD.tr,wecanestimateitsvalidationerrorrateonD.val, .Themodelthatshowsthelowestvalueof canthenbeselectedasthepreferredchoiceofmodel.

Theuseofvalidationsetprovidesagenericapproachformodelselection.However,onelimitationofthisapproachisthatitissensitivetothesizesofD.trandD.val,obtainedbypartitioningD.train.IfthesizeofD.tristoosmall,itmayresultinthelearningofapoorclassificationmodelwithsub-standardperformance,sinceasmallertrainingsetwillbelessrepresentativeoftheoveralldata.Ontheotherhand,ifthesizeofD.valistoosmall,thevalidationerrorratemightnotbereliableforselectingmodels,asitwouldbecomputedoverasmallnumberofinstances.

Figure3.29.

errval(m)errval(m)

ClassdistributionofvalidationdataforthetwodecisiontreesshowninFigure3.30 .

3.8.ExampleValidationErrorInthefollowingexample,weillustrateonepossibleapproachforusingavalidationsetindecisiontreeinduction.Figure3.29 showsthepredictedlabelsattheleafnodesofthedecisiontreesgeneratedinFigure3.30 .Thecountsgivenbeneaththeleafnodesrepresenttheproportionofdatainstancesinthevalidationsetthatreacheachofthenodes.Basedonthepredictedlabelsofthenodes,thevalidationerrorrateforthelefttreeis ,whilethevalidationerrorratefortherighttreeis .Basedontheirvalidationerrorrates,therighttreeispreferredovertheleftone.

3.5.2IncorporatingModelComplexity

Sincethechanceformodeloverfittingincreasesasthemodelbecomesmorecomplex,amodelselectionapproachshouldnotonlyconsiderthetrainingerrorratebutalsothemodelcomplexity.Thisstrategyisinspiredbyawell-knownprincipleknownasOccam'srazorortheprincipleofparsimony,whichsuggeststhatgiventwomodelswiththesameerrors,thesimplermodelispreferredoverthemorecomplexmodel.Agenericapproachtoaccountformodelcomplexitywhileestimatinggeneralizationperformanceisformallydescribedasfollows.

GivenatrainingsetD.train,letusconsiderlearningaclassificationmodelmthatbelongstoacertainclassofmodels, .Forexample,if representsthesetofallpossibledecisiontrees,thenmcancorrespondtoaspecificdecision

errval(TL)=6/16=0.375errval(TR)=4/16=0.25

M M

treelearnedfromthetrainingset.Weareinterestedinestimatingthegeneralizationerrorrateofm,gen.error(m).Asdiscussedpreviously,thetrainingerrorrateofm,train.error(m,D.train),canunder-estimategen.error(m)whenthemodelcomplexityishigh.Hence,werepresentgen.error(m)asafunctionofnotjustthetrainingerrorratebutalsothemodelcomplexityof asfollows:

where isahyper-parameterthatstrikesabalancebetweenminimizingtrainingerrorandreducingmodelcomplexity.Ahighervalueof givesmoreemphasistothemodelcomplexityintheestimationofgeneralizationperformance.Tochoosetherightvalueof ,wecanmakeuseofthevalidationsetinasimilarwayasdescribedin3.5.1 .Forexample,wecaniteratethrougharangeofvaluesof andforeverypossiblevalue,wecanlearnamodelonasubsetofthetrainingset,D.tr,andcomputeitsvalidationerrorrateonaseparatesubset,D.val.Wecanthenselectthevalueof thatprovidesthelowestvalidationerrorrate.

Equation3.11 providesonepossibleapproachforincorporatingmodelcomplexityintotheestimateofgeneralizationperformance.Thisapproachisattheheartofanumberoftechniquesforestimatinggeneralizationperformance,suchasthestructuralriskminimizationprinciple,theAkaike'sInformationCriterion(AIC),andtheBayesianInformationCriterion(BIC).Thestructuralriskminimizationprincipleservesasthebuildingblockforlearningsupportvectormachines,whichwillbediscussedlaterinChapter4 .FormoredetailsonAICandBIC,seetheBibliographicNotes.

Inthefollowing,wepresenttwodifferentapproachesforestimatingthecomplexityofamodel, .Whiletheformerisspecifictodecisiontrees,thelatterismoregenericandcanbeusedwithanyclassofmodels.

M,complexity(M),

gen.error(m)=train.error(m,D.train)+α×complexity(M), (3.11)

αα

complexity(M)

EstimatingtheComplexityofDecisionTreesInthecontextofdecisiontrees,thecomplexityofadecisiontreecanbeestimatedastheratioofthenumberofleafnodestothenumberoftraininginstances.Letkbethenumberofleafnodesand bethenumberoftraininginstances.Thecomplexityofadecisiontreecanthenbedescribedas

.Thisreflectstheintuitionthatforalargertrainingsize,wecanlearnadecisiontreewithlargernumberofleafnodeswithoutitbecomingoverlycomplex.ThegeneralizationerrorrateofadecisiontreeTcanthenbecomputedusingEquation3.11 asfollows:

whereerr(T)isthetrainingerrorofthedecisiontreeand isahyper-parameterthatmakesatrade-offbetweenreducingtrainingerrorandminimizingmodelcomplexity,similartotheuseof inEquation3.11 .canbeviewedastherelativecostofaddingaleafnoderelativetoincurringatrainingerror.Intheliteratureondecisiontreeinduction,theaboveapproachforestimatinggeneralizationerrorrateisalsotermedasthepessimisticerrorestimate.Itiscalledpessimisticasitassumesthegeneralizationerrorratetobeworsethanthetrainingerrorrate(byaddingapenaltytermformodelcomplexity).Ontheotherhand,simplyusingthetrainingerrorrateasanestimateofthegeneralizationerrorrateiscalledtheoptimisticerrorestimateortheresubstitutionestimate.

3.9.ExampleGeneralizationErrorEstimatesConsiderthetwobinarydecisiontrees, and ,showninFigure3.30 .Bothtreesaregeneratedfromthesametrainingdataand isgeneratedbyexpandingthreeleafnodesof .Thecountsshownintheleafnodesofthetreesrepresenttheclassdistributionofthetraining

Ntrain

k/Ntrain

errgen(T)=err(T)+Ω×kNtrain,

α Ω

TL TRTL

instances.Ifeachleafnodeislabeledaccordingtothemajorityclassoftraininginstancesthatreachthenode,thetrainingerrorrateforthelefttreewillbe ,whilethetrainingerrorratefortherighttreewillbe .Basedontheirtrainingerrorratesalone,wouldpreferredover ,eventhough ismorecomplex(contains

largernumberofleafnodes)than .

Figure3.30.Exampleoftwodecisiontreesgeneratedfromthesametrainingdata.

Now,assumethatthecostassociatedwitheachleafnodeis .Then,thegeneralizationerrorestimatefor willbe

andthegeneralizationerrorestimatefor willbe

err(TL)=4/24=0.167err(TR)=6/24=0.25

TL TR TLTR

Ω=0.5TL

errgen(TL)=424+0.5×724=7.524=0.3125

errgen(TR)=624+0.5×424=824=0.3333.

Since hasalowergeneralizationerrorrate,itwillstillbepreferredover.Notethat impliesthatanodeshouldalwaysbeexpandedinto

itstwochildnodesifitimprovesthepredictionofatleastonetraininginstance,sinceexpandinganodeislesscostlythanmisclassifyingatraininginstance.Ontheotherhand,if ,thenthegeneralizationerrorratefor is andfor is

.Inthiscase, willbepreferredoverbecauseithasalowergeneralizationerrorrate.Thisexampleillustratesthatdifferentchoicesof canchangeourpreferenceofdecisiontreesbasedontheirgeneralizationerrorestimates.However,foragivenchoiceof ,thepessimisticerrorestimateprovidesanapproachformodelingthegeneralizationperformanceonunseentestinstances.Thevalueof canbeselectedwiththehelpofavalidationset.

MinimumDescriptionLengthPrincipleAnotherwaytoincorporatemodelcomplexityisbasedonaninformation-theoreticapproachknownastheminimumdescriptionlengthorMDLprinciple.Toillustratethisapproach,considertheexampleshowninFigure3.31 .Inthisexample,bothperson andperson aregivenasetofinstanceswithknownattributevalues .AssumepersonAknowstheclasslabelyforeveryinstance,whileperson hasnosuchinformation. wouldliketosharetheclassinformationwith bysendingamessagecontainingthelabels.Themessagewouldcontain bitsofinformation,whereNisthenumberofinstances.

TLTR Ω=0.5

Ω=1TL errgen(TL)=11/24=0.458 TR

errgen(TR)=10/24=0.417 TR TL

ΩΩ

Θ(N)

Figure3.31.Anillustrationoftheminimumdescriptionlengthprinciple.

Alternatively,insteadofsendingtheclasslabelsexplicitly, canbuildaclassificationmodelfromtheinstancesandtransmititto . canthenapplythemodeltodeterminetheclasslabelsoftheinstances.Ifthemodelis100%accurate,thenthecostfortransmissionisequaltothenumberofbitsrequiredtoencodethemodel.Otherwise, mustalsotransmitinformationaboutwhichinstancesaremisclassifiedbythemodelsothat canreproducethesameclasslabels.Thus,theoveralltransmissioncost,whichisequaltothetotaldescriptionlengthofthemessage,is

wherethefirsttermontheright-handsideisthenumberofbitsneededtoencodethemisclassifiedinstances,whilethesecondtermisthenumberofbitsrequiredtoencodethemodel.Thereisalsoahyper-parameter thattrades-offtherelativecostsofthemisclassifiedinstancesandthemodel.

Cost(model,data)=Cost(data|model)+α×Cost(model), (3.12)

NoticethefamiliaritybetweenthisequationandthegenericequationforgeneralizationerrorratepresentedinEquation3.11 .Agoodmodelmusthaveatotaldescriptionlengthlessthanthenumberofbitsrequiredtoencodetheentiresequenceofclasslabels.Furthermore,giventwocompetingmodels,themodelwithlowertotaldescriptionlengthispreferred.AnexampleshowinghowtocomputethetotaldescriptionlengthofadecisiontreeisgiveninExercise10onpage189.

3.5.3EstimatingStatisticalBounds

InsteadofusingEquation3.11 toestimatethegeneralizationerrorrateofamodel,analternativewayistoapplyastatisticalcorrectiontothetrainingerrorrateofthemodelthatisindicativeofitsmodelcomplexity.Thiscanbedoneiftheprobabilitydistributionoftrainingerrorisavailableorcanbeassumed.Forexample,thenumberoferrorscommittedbyaleafnodeinadecisiontreecanbeassumedtofollowabinomialdistribution.Wecanthuscomputeanupperboundlimittotheobservedtrainingerrorratethatcanbeusedformodelselection,asillustratedinthefollowingexample.

3.10.ExampleStatisticalBoundsonTrainingErrorConsidertheleft-mostbranchofthebinarydecisiontreesshowninFigure3.30 .Observethattheleft-mostleafnodeof hasbeenexpandedintotwochildnodesin .Beforesplitting,thetrainingerrorrateofthenodeis .Byapproximatingabinomialdistributionwithanormaldistribution,thefollowingupperboundofthetrainingerrorrateecanbederived:

TRTL

2/7=0.286

where istheconfidencelevel, isthestandardizedvaluefromastandardnormaldistribution,andNisthetotalnumberoftraininginstancesusedtocomputee.Byreplacing and ,theupperboundfortheerrorrateis ,whichcorrespondsto errors.Ifweexpandthenodeintoitschildnodesasshownin ,thetrainingerrorratesforthechildnodesare

and ,respectively.UsingEquation(3.13) ,theupperboundsoftheseerrorratesare and

,respectively.Theoveralltrainingerrorofthechildnodesis ,whichislargerthantheestimatederrorforthecorrespondingnodein ,suggestingthatitshouldnotbesplit.

3.5.4ModelSelectionforDecisionTrees

Buildingonthegenericapproachespresentedabove,wepresenttwocommonlyusedmodelselectionstrategiesfordecisiontreeinduction.

Prepruning(EarlyStoppingRule)

Inthisapproach,thetree-growingalgorithmishaltedbeforegeneratingafullygrowntreethatperfectlyfitstheentiretrainingdata.Todothis,amorerestrictivestoppingconditionmustbeused;e.g.,stopexpandingaleafnodewhentheobservedgaininthegeneralizationerrorestimatefallsbelowacertainthreshold.Thisestimateofthegeneralizationerrorratecanbe

eupper(N,e,α)=e+zα/222N+zα/2e(1−e)N+zα/224N21+zα/22N, (3.13)

α zα/2

α=25%,N=7, e=2/7eupper(7,2/7,0.25)=0.503

7×0.503=3.521TL

1/4=0.250 1/3=0.333eupper(4,1/4,0.25)=0.537

eupper(3,1/3,0.25)=0.6504×0.537+3×0.650=4.098

computedusinganyoftheapproachespresentedintheprecedingthreesubsections,e.g.,byusingpessimisticerrorestimates,byusingvalidationerrorestimates,orbyusingstatisticalbounds.Theadvantageofprepruningisthatitavoidsthecomputationsassociatedwithgeneratingoverlycomplexsubtreesthatoverfitthetrainingdata.However,onemajordrawbackofthismethodisthat,evenifnosignificantgainisobtainedusingoneoftheexistingsplittingcriterion,subsequentsplittingmayresultinbettersubtrees.Suchsubtreeswouldnotbereachedifprepruningisusedbecauseofthegreedynatureofdecisiontreeinduction.

Post-pruning

Inthisapproach,thedecisiontreeisinitiallygrowntoitsmaximumsize.Thisisfollowedbyatree-pruningstep,whichproceedstotrimthefullygrowntreeinabottom-upfashion.Trimmingcanbedonebyreplacingasubtreewith(1)anewleafnodewhoseclasslabelisdeterminedfromthemajorityclassofinstancesaffiliatedwiththesubtree(approachknownassubtreereplacement),or(2)themostfrequentlyusedbranchofthesubtree(approachknownassubtreeraising).Thetree-pruningstepterminateswhennofurtherimprovementinthegeneralizationerrorestimateisobservedbeyondacertainthreshold.Again,theestimatesofgeneralizationerrorratecanbecomputedusinganyoftheapproachespresentedinthepreviousthreesubsections.Post-pruningtendstogivebetterresultsthanprepruningbecauseitmakespruningdecisionsbasedonafullygrowntree,unlikeprepruning,whichcansufferfromprematureterminationofthetree-growingprocess.However,forpost-pruning,theadditionalcomputationsneededtogrowthefulltreemaybewastedwhenthesubtreeispruned.

Figure3.32 illustratesthesimplifieddecisiontreemodelforthewebrobotdetectionexamplegiveninSection3.3.5 .Noticethatthesubtreerootedat

hasbeenreplacedbyoneofitsbranchescorrespondingtodepth=1

,and ,usingsubtreeraising.Ontheotherhand,thesubtreecorrespondingto and hasbeenreplacedbyaleafnodeassignedtoclass0,usingsubtreereplacement.Thesubtreefor

and remainsintact.

Figure3.32.Post-pruningofthedecisiontreeforwebrobotdetection.

breadth<=7,width>3 MultiP=1depth>1 MultiAgent=0

depth>1 MultiAgent=1

3.6ModelEvaluationTheprevioussectiondiscussedseveralapproachesformodelselectionthatcanbeusedtolearnaclassificationmodelfromatrainingsetD.train.Herewediscussmethodsforestimatingitsgeneralizationperformance,i.e.itsperformanceonunseeninstancesoutsideofD.train.Thisprocessisknownasmodelevaluation.

NotethatmodelselectionapproachesdiscussedinSection3.5 alsocomputeanestimateofthegeneralizationperformanceusingthetrainingsetD.train.However,theseestimatesarebiasedindicatorsoftheperformanceonunseeninstances,sincetheywereusedtoguidetheselectionofclassificationmodel.Forexample,ifweusethevalidationerrorrateformodelselection(asdescribedinSection3.5.1 ),theresultingmodelwouldbedeliberatelychosentominimizetheerrorsonthevalidationset.Thevalidationerrorratemaythusunder-estimatethetruegeneralizationerrorrate,andhencecannotbereliablyusedformodelevaluation.

Acorrectapproachformodelevaluationwouldbetoassesstheperformanceofalearnedmodelonalabeledtestsethasnotbeenusedatanystageofmodelselection.ThiscanbeachievedbypartitioningtheentiresetoflabeledinstancesD,intotwodisjointsubsets,D.train,whichisusedformodelselectionandD.test,whichisusedforcomputingthetesterrorrate, .Inthefollowing,wepresenttwodifferentapproachesforpartitioningDintoD.trainandD.test,andcomputingthetesterrorrate, .

3.6.1HoldoutMethod

errtest

Themostbasictechniqueforpartitioningalabeleddatasetistheholdoutmethod,wherethelabeledsetDisrandomlypartitionedintotwodisjointsets,calledthetrainingsetD.trainandthetestsetD.test.AclassificationmodelistheninducedfromD.trainusingthemodelselectionapproachespresentedinSection3.5 ,anditserrorrateonD.test, ,isusedasanestimateofthegeneralizationerrorrate.Theproportionofdatareservedfortrainingandfortestingistypicallyatthediscretionoftheanalysts,e.g.,two-thirdsfortrainingandone-thirdfortesting.

Similartothetrade-offfacedwhilepartitioningD.trainintoD.trandD.valinSection3.5.1 ,choosingtherightfractionoflabeleddatatobeusedfortrainingandtestingisnon-trivial.IfthesizeofD.trainissmall,thelearnedclassificationmodelmaybeimproperlylearnedusinganinsufficientnumberoftraininginstances,resultinginabiasedestimationofgeneralizationperformance.Ontheotherhand,ifthesizeofD.testissmall, maybelessreliableasitwouldbecomputedoverasmallnumberoftestinstances.Moreover, canhaveahighvarianceaswechangetherandompartitioningofDintoD.trainandD.test.

Theholdoutmethodcanberepeatedseveraltimestoobtainadistributionofthetesterrorrates,anapproachknownasrandomsubsamplingorrepeatedholdoutmethod.Thismethodproducesadistributionoftheerrorratesthatcanbeusedtounderstandthevarianceof .

3.6.2Cross-Validation

Cross-validationisawidely-usedmodelevaluationmethodthataimstomakeeffectiveuseofalllabeledinstancesinDforbothtrainingandtesting.Toillustratethismethod,supposethatwearegivenalabeledsetthatwehave

errtest

randomlypartitionedintothreeequal-sizedsubsets, ,and ,asshowninFigure3.33 .Forthefirstrun,wetrainamodelusingsubsetsandS (shownasemptyblocks)andtestthemodelonsubset .Thetesterrorrateon ,denotedas ,isthuscomputedinthefirstrun.Similarly,forthesecondrun,weuse and asthetrainingsetand asthetestset,tocomputethetesterrorrate, ,on .Finally,weuseand fortraininginthethirdrun,while isusedfortesting,thusresultinginthetesterrorrate for .Theoveralltesterrorrateisobtainedbysummingupthenumberoferrorscommittedineachtestsubsetacrossallrunsanddividingitbythetotalnumberofinstances.Thisapproachiscalledthree-foldcross-validation.

Figure3.33.Exampledemonstratingthetechniqueof3-foldcross-validation.

Thek-foldcross-validationmethodgeneralizesthisapproachbysegmentingthelabeleddataD(ofsizeN)intokequal-sizedpartitions(orfolds).Duringthei run,oneofthepartitionsofDischosenasD.test(i)fortesting,whiletherestofthepartitionsareusedasD.train(i)fortraining.Amodelm(i)islearnedusingD.train(i)andappliedonD.test(i)toobtainthesumoftesterrors,

S1,S2 S3S2

3 S1S1 err(S1)

S1 S3 S2err(S2) S2 S1

S3 S3err(S3) S3

.Thisprocedureisrepeatedktimes.Thetotaltesterrorrate, ,isthencomputedas

Everyinstanceinthedataisthususedfortestingexactlyonceandfortrainingexactly times.Also,everyrunuses fractionofthedatafortrainingand1/kfractionfortesting.

Therightchoiceofkink-foldcross-validationdependsonanumberofcharacteristicsoftheproblem.Asmallvalueofkwillresultinasmallertrainingsetateveryrun,whichwillresultinalargerestimateofgeneralizationerrorratethanwhatisexpectedofamodeltrainedovertheentirelabeledset.Ontheotherhand,ahighvalueofkresultsinalargertrainingsetateveryrun,whichreducesthebiasintheestimateofgeneralizationerrorrate.Intheextremecase,when ,everyrunusesexactlyonedatainstancefortestingandtheremainderofthedatafortesting.Thisspecialcaseofk-foldcross-validationiscalledtheleave-one-outapproach.Thisapproachhastheadvantageofutilizingasmuchdataaspossiblefortraining.However,leave-one-outcanproducequitemisleadingresultsinsomespecialscenarios,asillustratedinExercise11.Furthermore,leave-one-outcanbecomputationallyexpensiveforlargedatasetsasthecross-validationprocedureneedstoberepeatedNtimes.Formostpracticalapplications,thechoiceofkbetween5and10providesareasonableapproachforestimatingthegeneralizationerrorrate,becauseeachfoldisabletomakeuseof80%to90%ofthelabeleddatafortraining.

Thek-foldcross-validationmethod,asdescribedabove,producesasingleestimateofthegeneralizationerrorrate,withoutprovidinganyinformationaboutthevarianceoftheestimate.Toobtainthisinformation,wecanrunk-foldcross-validationforeverypossiblepartitioningofthedataintokpartitions,

errsum(i) errtest

errtest=∑i=1kerrsum(i)N. (3.14)

(k−1) (k−1)/k

k=N

andobtainadistributionoftesterrorratescomputedforeverysuchpartitioning.Theaveragetesterrorrateacrossallpossiblepartitioningsservesasamorerobustestimateofgeneralizationerrorrate.Thisapproachofestimatingthegeneralizationerrorrateanditsvarianceisknownasthecompletecross-validationapproach.Eventhoughsuchanestimateisquiterobust,itisusuallytooexpensivetoconsiderallpossiblepartitioningsofalargedatasetintokpartitions.Amorepracticalsolutionistorepeatthecross-validationapproachmultipletimes,usingadifferentrandompartitioningofthedataintokpartitionsateverytime,andusetheaveragetesterrorrateastheestimateofgeneralizationerrorrate.Notethatsincethereisonlyonepossiblepartitioningfortheleave-one-outapproach,itisnotpossibletoestimatethevarianceofgeneralizationerrorrate,whichisanotherlimitationofthismethod.

Thek-foldcross-validationdoesnotguaranteethatthefractionofpositiveandnegativeinstancesineverypartitionofthedataisequaltothefractionobservedintheoveralldata.Asimplesolutiontothisproblemistoperformastratifiedsamplingofthepositiveandnegativeinstancesintokpartitions,anapproachcalledstratifiedcross-validation.

Ink-foldcross-validation,adifferentmodelislearnedateveryrunandtheperformanceofeveryoneofthekmodelsontheirrespectivetestfoldsisthenaggregatedtocomputetheoveralltesterrorrate, .Hence, doesnotreflectthegeneralizationerrorrateofanyofthekmodels.Instead,itreflectstheexpectedgeneralizationerrorrateofthemodelselectionapproach,whenappliedonatrainingsetofthesamesizeasoneofthetrainingfolds .Thisisdifferentthanthe computedintheholdoutmethod,whichexactlycorrespondstothespecificmodellearnedoverD.train.Hence,althougheffectivelyutilizingeverydatainstanceinDfortrainingandtesting,the computedinthecross-validationmethoddoesnotrepresenttheperformanceofasinglemodellearnedoveraspecificD.train.

errtest errtest

(N(k−1)/k) errtest

errtest

Nonetheless,inpractice, istypicallyusedasanestimateofthegeneralizationerrorofamodelbuiltonD.Onemotivationforthisisthatwhenthesizeofthetrainingfoldsisclosertothesizeoftheoveralldata(whenkislarge),then resemblestheexpectedperformanceofamodellearnedoveradatasetofthesamesizeasD.Forexample,whenkis10,everytrainingfoldis90%oftheoveralldata.The thenshouldapproachtheexpectedperformanceofamodellearnedover90%oftheoveralldata,whichwillbeclosetotheexpectedperformanceofamodellearnedoverD.

errtest

3.7PresenceofHyper-parametersHyper-parametersareparametersoflearningalgorithmsthatneedtobedeterminedbeforelearningtheclassificationmodel.Forinstance,considerthehyper-parameter thatappearedinEquation3.11 ,whichisrepeatedhereforconvenience.Thisequationwasusedforestimatingthegeneralizationerrorforamodelselectionapproachthatusedanexplicitrepresentationsofmodelcomplexity.(SeeSection3.5.2 .)

Forotherexamplesofhyper-parameters,seeChapter4 .

Unlikeregularmodelparameters,suchasthetestconditionsintheinternalnodesofadecisiontree,hyper-parameterssuchas donotappearinthefinalclassificationmodelthatisusedtoclassifyunlabeledinstances.However,thevaluesofhyper-parametersneedtobedeterminedduringmodelselection—aprocessknownashyper-parameterselection—andmustbetakenintoaccountduringmodelevaluation.Fortunately,bothtaskscanbeeffectivelyaccomplishedviaslightmodificationsofthecross-validationapproachdescribedintheprevioussection.

3.7.1Hyper-parameterSelection

InSection3.5.2 ,avalidationsetwasusedtoselect andthisapproachisgenerallyapplicableforhyper-parametersection.Letpbethehyper-parameterthatneedstobeselectedfromafiniterangeofvalues,

gen.error(m)=train.error(m,D.train)+α×complexity(M)

.PartitionD.trainintoD.trandD.val.Foreverychoiceofhyper-parametervalue ,wecanlearnamodel onD.tr,andapplythismodelonD.valtoobtainthevalidationerrorrate .Let bethehyper-parametervaluethatprovidesthelowestvalidationerrorrate.Wecanthenusethemodel correspondingto asthefinalchoiceofclassificationmodel.

Theaboveapproach,althoughuseful,usesonlyasubsetofthedata,D.train,fortrainingandasubset,D.val,forvalidation.Theframeworkofcross-validation,presentedinSection3.6.2 ,addressesbothofthoseissues,albeitinthecontextofmodelevaluation.Hereweindicatehowtouseacross-validationapproachforhyper-parameterselection.Toillustratethisapproach,letuspartitionD.trainintothreefoldsasshowninFigure3.34 .Ateveryrun,oneofthefoldsisusedasD.valforvalidation,andtheremainingtwofoldsareusedasD.trforlearningamodel,foreverychoiceofhyper-parametervalue .Theoverallvalidationerrorratecorrespondingtoeachiscomputedbysummingtheerrorsacrossallthethreefolds.Wethenselectthehyper-parametervalue thatprovidesthelowestvalidationerrorrate,anduseittolearnamodel ontheentiretrainingsetD.train.

Figure3.34.Exampledemonstratingthe3-foldcross-validationframeworkforhyper-parameterselectionusingD.train.

{p1,p2,…pn}pi mi

errval(pi) p*

m* p*

pi pi

p*m*

Algorithm3.2 generalizestheaboveapproachusingak-foldcross-validationframeworkforhyper-parameterselection.Atthei runofcross-validation,thedatainthei foldisusedasD.val(i)forvalidation(Step4),whiletheremainderofthedatainD.trainisusedasD.tr(i)fortraining(Step5).Thenforeverychoiceofhyper-parametervalue ,amodelislearnedonD.tr(i)(Step7),whichisappliedonD.val(i)tocomputeitsvalidationerror(Step8).Thisisusedtocomputethevalidationerrorratecorrespondingtomodelslearningusing overallthefolds(Step11).Thehyper-parametervalue thatprovidesthelowestvalidationerrorrate(Step12)isnowusedtolearnthefinalmodel ontheentiretrainingsetD.train(Step13).Hence,attheendofthisalgorithm,weobtainthebestchoiceofthehyper-parametervalueaswellasthefinalclassificationmodel(Step14),bothofwhichareobtainedbymakinganeffectiveuseofeverydatainstanceinD.train.

Algorithm3.2Proceduremodel-select(k, ,D.train)

∈

pip*

∑

3.7.2NestedCross-Validation

TheapproachoftheprevioussectionprovidesawaytoeffectivelyusealltheinstancesinD.traintolearnaclassificationmodelwhenhyper-parameterselectionisrequired.ThisapproachcanbeappliedovertheentiredatasetDtolearnthefinalclassificationmodel.However,applyingAlgorithm3.2 onDwouldonlyreturnthefinalclassificationmodel butnotanestimateofitsgeneralizationperformance, .RecallthatthevalidationerrorratesusedinAlgorithm3.2 cannotbeusedasestimatesofgeneralizationperformance,sincetheyareusedtoguidetheselectionofthefinalmodel .However,tocompute ,wecanagainuseacross-validationframeworkforevaluatingtheperformanceontheentiredatasetD,asdescribedoriginallyinSection3.6.2 .Inthisapproach,DispartitionedintoD.train(fortraining)andD.test(fortesting)ateveryrunofcross-validation.Whenhyper-parametersareinvolved,wecanuseAlgorithm3.2 totrainamodelusingD.trainateveryrun,thus“internally”usingcross-validationformodelselection.Thisapproachiscallednestedcross-validationordoublecross-validation.Algorithm3.3describesthecompleteapproachforestimating

usingnestedcross-validationinthepresenceofhyper-parameters.

Asanillustrationofthisapproach,seeFigure3.35 wherethelabeledsetDispartitionedintoD.trainandD.test,usinga3-foldcross-validationmethod.

m*errtest

errtest

Figure3.35.Exampledemonstrating3-foldnestedcross-validationforcomputing .

Atthei runofthismethod,oneofthefoldsisusedasthetestset,D.test(i),whiletheremainingtwofoldsareusedasthetrainingset,D.train(i).ThisisrepresentedinFigure3.35 asthei “outer”run.InordertoselectamodelusingD.train(i),weagainusean“inner”3-foldcross-validationframeworkthatpartitionsD.train(i)intoD.trandD.valateveryoneofthethreeinnerruns(iterations).AsdescribedinSection3.7 ,wecanusetheinnercross-validationframeworktoselectthebesthyper-parametervalue aswellasitsresultingclassificationmodel learnedoverD.train(i).Wecanthenapply onD.test(i)toobtainthetesterroratthei outerrun.Byrepeatingthisprocessforeveryouterrun,wecancomputetheaveragetesterrorrate,

,overtheentirelabeledsetD.Notethatintheaboveapproach,theinnercross-validationframeworkisbeingusedformodelselectionwhiletheoutercross-validationframeworkisbeingusedformodelevaluation.

Algorithm3.3Thenestedcross-validationapproachforcomputing .

errtest

p*(i)m*(i)

m*(i) th

errtest

∑

3.8PitfallsofModelSelectionandEvaluationModelselectionandevaluation,whenusedeffectively,serveasexcellenttoolsforlearningclassificationmodelsandassessingtheirgeneralizationperformance.However,whenusingthemeffectivelyinpracticalsettings,thereareseveralpitfallsthatcanresultinimproperandoftenmisleadingconclusions.Someofthesepitfallsaresimpletounderstandandeasytoavoid,whileothersarequitesubtleinnatureanddifficulttocatch.Inthefollowing,wepresenttwoofthesepitfallsanddiscussbestpracticestoavoidthem.

3.8.1OverlapbetweenTrainingandTestSets

Oneofthebasicrequirementsofacleanmodelselectionandevaluationsetupisthatthedatausedformodelselection(D.train)mustbekeptseparatefromthedatausedformodelevaluation(D.test).Ifthereisanyoverlapbetweenthetwo,thetesterrorrate computedoverD.testcannotbeconsideredrepresentativeoftheperformanceonunseeninstances.Comparingtheeffectivenessofclassificationmodelsusing canthenbequitemisleading,asanoverlycomplexmodelcanshowaninaccuratelylowvalueof duetomodeloverfitting(seeExercise12attheendofthischapter).

errtest

ToillustratetheimportanceofensuringnooverlapbetweenD.trainandD.test,consideralabeleddatasetwherealltheattributesareirrelevant,i.e.theyhavenorelationshipwiththeclasslabels.Usingsuchattributes,weshouldexpectnoclassificationmodeltoperformbetterthanrandomguessing.However,ifthetestsetinvolvesevenasmallnumberofdatainstancesthatwereusedfortraining,thereisapossibilityforanoverlycomplexmodeltoshowbetterperformancethanrandom,eventhoughtheattributesarecompletelyirrelevant.AswewillseelaterinChapter10 ,thisscenariocanactuallybeusedasacriteriontodetectoverfittingduetoimpropersetupofexperiment.Ifamodelshowsbetterperformancethanarandomclassifierevenwhentheattributesareirrelevant,itisanindicationofapotentialfeedbackbetweenthetrainingandtestsets.

3.8.2UseofValidationErrorasGeneralizationError

Thevalidationerrorrate servesanimportantroleduringmodelselection,asitprovides“out-of-sample”errorestimatesofmodelsonD.val,whichisnotusedfortrainingthemodels.Hence, servesasabettermetricthanthetrainingerrorrateforselectingmodelsandhyper-parametervalues,asdescribedinSections3.5.1 and3.7 ,respectively.However,oncethevalidationsethasbeenusedforselectingaclassificationmodel

nolongerreflectstheperformanceof onunseeninstances.

Torealizethepitfallinusingvalidationerrorrateasanestimateofgeneralizationperformance,considertheproblemofselectingahyper-parametervaluepfromarangeofvalues usingavalidationsetD.val.Ifthenumberofpossiblevaluesin isquitelargeandthesizeofD.valissmall,itis

errval

m*,errval m*

P,P

possibletoselectahyper-parametervalue thatshowsfavorableperformanceonD.valjustbyrandomchance.NoticethesimilarityofthisproblemwiththemultiplecomparisonsproblemdiscussedinSection3.4.1 .Eventhoughtheclassificationmodel learnedusing wouldshowalowvalidationerrorrate,itwouldlackgeneralizabilityonunseentestinstances.

ThecorrectapproachforestimatingthegeneralizationerrorrateofamodelistouseanindependentlychosentestsetD.testthathasn'tbeenusedin

anywaytoinfluencetheselectionof .Asaruleofthumb,thetestsetshouldneverbeexaminedduringmodelselection,toensuretheabsenceofanyformofoverfitting.Iftheinsightsgainedfromanyportionofalabeleddatasethelpinimprovingtheclassificationmodeleveninanindirectway,thenthatportionofdatamustbediscardedduringtesting.

m* p*

m*m*

3.9ModelComparisonOnedifficultywhencomparingtheperformanceofdifferentclassificationmodelsiswhethertheobserveddifferenceintheirperformanceisstatisticallysignificant.Forexample,considerapairofclassificationmodels, and .Suppose achieves85%accuracywhenevaluatedonatestsetcontaining30instances,while achieves75%accuracyonadifferenttestsetcontaining5000instances.Basedonthisinformation,is abettermodelthan ?Thisexampleraisestwokeyquestionsregardingthestatisticalsignificanceofaperformancemetric:

1. Although hasahigheraccuracythan ,itwastestedonasmallertestset.Howmuchconfidencedowehavethattheaccuracyfor isactually85%?

2. Isitpossibletoexplainthedifferenceinaccuraciesbetween andasaresultofvariationsinthecompositionoftheirtestsets?

Thefirstquestionrelatestotheissueofestimatingtheconfidenceintervalofmodelaccuracy.Thesecondquestionrelatestotheissueoftestingthestatisticalsignificanceoftheobserveddeviation.Theseissuesareinvestigatedintheremainderofthissection.

3.9.1EstimatingtheConfidenceIntervalforAccuracy

MA MBMA

MBMA

MA MBMA

MAMB

Todetermineitsconfidenceinterval,weneedtoestablishtheprobabilitydistributionforsampleaccuracy.Thissectiondescribesanapproachforderivingtheconfidenceintervalbymodelingtheclassificationtaskasabinomialrandomexperiment.Thefollowingdescribescharacteristicsofsuchanexperiment:

1. TherandomexperimentconsistsofNindependenttrials,whereeachtrialhastwopossibleoutcomes:successorfailure.

2. Theprobabilityofsuccess,p,ineachtrialisconstant.

AnexampleofabinomialexperimentiscountingthenumberofheadsthatturnupwhenacoinisflippedNtimes.IfXisthenumberofsuccessesobservedinNtrials,thentheprobabilitythatXtakesaparticularvalueisgivenbyabinomialdistributionwithmean andvariance :

Forexample,ifthecoinisfair andisflippedfiftytimes,thentheprobabilitythattheheadshowsup20timesis

Iftheexperimentisrepeatedmanytimes,thentheaveragenumberofheadsexpectedtoshowupis whileitsvarianceis

Thetaskofpredictingtheclasslabelsoftestinstancescanalsobeconsideredasabinomialexperiment.GivenatestsetthatcontainsNinstances,letXbethenumberofinstancescorrectlypredictedbyamodelandpbethetrueaccuracyofthemodel.Ifthepredictiontaskismodeledasabinomialexperiment,thenXhasabinomialdistributionwithmean andvariance Itcanbeshownthattheempiricalaccuracy, also

Np Np(1−p)

P(X=υ)=(Nυ)pυ(1−p)N−υ.

(p=0.5)

P(X=20)=(5020)0.520(1−0.5)30=0.0419.

50×0.5=25, 50×0.5×0.5=12.5.

NpNp(1−p). acc=X/N,

hasabinomialdistributionwithmeanpandvariance (seeExercise14).ThebinomialdistributioncanbeapproximatedbyanormaldistributionwhenNissufficientlylarge.Basedonthenormaldistribution,theconfidenceintervalforacccanbederivedasfollows:

where and aretheupperandlowerboundsobtainedfromastandardnormaldistributionatconfidencelevel Sinceastandardnormaldistributionissymmetricaround itfollowsthatRearrangingthisinequalityleadstothefollowingconfidenceintervalforp:

Thefollowingtableshowsthevaluesof atdifferentconfidencelevels:

0.99 0.98 0.95 0.9 0.8 0.7 0.5

2.58 2.33 1.96 1.65 1.28 1.04 0.67

3.11.ExampleConfidenceIntervalforAccuracyConsideramodelthathasanaccuracyof80%whenevaluatedon100testinstances.Whatistheconfidenceintervalforitstrueaccuracyata95%confidencelevel?Theconfidencelevelof95%correspondsto

accordingtothetablegivenabove.InsertingthistermintoEquation3.16 yieldsaconfidenceintervalbetween71.1%and86.7%.Thefollowingtableshowstheconfidenceintervalwhenthenumberofinstances,N,increases:

N 20 50 100 500 1000 5000

p(1−p)/N

P(−Zα/2≤acc−pp(1−p)/N≤Z1−α/2)=1−α, (3.15)

Zα/2 Z1−α/2(1−α).

Z=0, Zα/2=Z1−α/2.

2×N×acc×Zα/22±Zα/2Zα/22+4Nacc−4Nacc22(N+Zα/22). (3.16)

Zα/2

1−α

Zα/2

Za/2=1.96

Confidence 0.584 0.670 0.711 0.763 0.774 0.789

Interval

NotethattheconfidenceintervalbecomestighterwhenNincreases.

3.9.2ComparingthePerformanceofTwoModels

Considerapairofmodels, and whichareevaluatedontwoindependenttestsets, and Let denotethenumberofinstancesin

and denotethenumberofinstancesin Inaddition,supposetheerrorratefor on is andtheerrorratefor on is Ourgoalistotestwhethertheobserveddifferencebetween and isstatisticallysignificant.

Assumingthat and aresufficientlylarge,theerrorrates and canbeapproximatedusingnormaldistributions.Iftheobserveddifferenceintheerrorrateisdenotedas thendisalsonormallydistributedwithmean ,itstruedifference,andvariance, Thevarianceofdcanbecomputedasfollows:

where and arethevariancesoftheerrorrates.Finally,atthe confidencelevel,itcanbeshownthattheconfidenceintervalforthetruedifferencedtisgivenbythefollowingequation:

−0.919 −0.888 −0.867 −0.833 −0.824 −0.811

M1 M2,D1 D2. n1

D1 n2 D2.M1 D1 e1 M2 D2 e2.

e1 e2

n1 n2 e1 e2

d=e1−e2,dt σd2.

σd2≃σ^d2=e1(1−e1)n1+e2(1−e2)n2, (3.17)

e1(1−e1)/n1 e2(1−e1)/n2(1−α)%

3.12.ExampleSignificanceTestingConsidertheproblemdescribedatthebeginningofthissection.Modelhasanerrorrateof whenappliedto testinstances,whilemodel hasanerrorrateof whenappliedto testinstances.Theobserveddifferenceintheirerrorratesis

.Inthisexample,weareperformingatwo-sidedtesttocheckwhether or .Theestimatedvarianceoftheobserveddifferenceinerrorratescanbecomputedasfollows:

or .InsertingthisvalueintoEquation3.18 ,weobtainthefollowingconfidenceintervalfor at95%confidencelevel:

Astheintervalspansthevaluezero,wecanconcludethattheobserveddifferenceisnotstatisticallysignificantata95%confidencelevel.

Atwhatconfidencelevelcanwerejectthehypothesisthat ?Todothis,weneedtodeterminethevalueof suchthattheconfidenceintervalfordoesnotspanthevaluezero.Wecanreversetheprecedingcomputationandlookforthevalue suchthat .Replacingthevaluesofdand

gives .Thisvaluefirstoccurswhen (foratwo-sidedtest).Theresultsuggeststhatthenullhypothesiscanberejectedatconfidencelevelof93.6%orlower.

dt=d±zα/2σ^d. (3.18)

MAe1=0.15 N1=30

MB e2=0.25 N2=5000

d=|0.15−0.25|=0.1dt=0 dt≠0

σ^d2=0.15(1−0.15)30+0.25(1−0.25)5000=0.0043

σ^d=0.0655dt

dt=0.1±1.96×0.0655=0.1±0.128.

dt=0Zα/2 dt

Zα/2 d>Zσ/2σ^dσ^d Zσ/2<1.527 (1−α)<~0.936

3.10BibliographicNotesEarlyclassificationsystemsweredevelopedtoorganizevariouscollectionsofobjects,fromlivingorganismstoinanimateones.Examplesabound,fromAristotle'scataloguingofspeciestotheDeweyDecimalandLibraryofCongressclassificationsystemsforbooks.Suchatasktypicallyrequiresconsiderablehumanefforts,bothtoidentifypropertiesoftheobjectstobeclassifiedandtoorganizethemintowelldistinguishedcategories.

Withthedevelopmentofstatisticsandcomputing,automatedclassificationhasbeenasubjectofintensiveresearch.Thestudyofclassificationinclassicalstatisticsissometimesknownasdiscriminantanalysis,wheretheobjectiveistopredictthegroupmembershipofanobjectbasedonitscorrespondingfeatures.Awell-knownclassicalmethodisFisher'slineardiscriminantanalysis[142],whichseekstofindalinearprojectionofthedatathatproducesthebestseparationbetweenobjectsfromdifferentclasses.

Manypatternrecognitionproblemsalsorequirethediscriminationofobjectsfromdifferentclasses.Examplesincludespeechrecognition,handwrittencharacteridentification,andimageclassification.ReaderswhoareinterestedintheapplicationofclassificationtechniquesforpatternrecognitionmayrefertothesurveyarticlesbyJainetal.[150]andKulkarnietal.[157]orclassicpatternrecognitionbooksbyBishop[125],Dudaetal.[137],andFukunaga[143].Thesubjectofclassificationisalsoamajorresearchtopicinneuralnetworks,statisticallearning,andmachinelearning.Anin-depthtreatmentonthetopicofclassificationfromthestatisticalandmachinelearningperspectivescanbefoundinthebooksbyBishop[126],CherkasskyandMulier[132],Hastieetal.[148],Michieetal.[162],Murphy[167],andMitchell[165].Recentyearshavealsoseenthereleaseofmanypubliclyavailable

softwarepackagesforclassification,whichcanbeembeddedinprogramminglanguagessuchasJava(Weka[147])andPython(scikit-learn[174]).

AnoverviewofdecisiontreeinductionalgorithmscanbefoundinthesurveyarticlesbyBuntine[129],Moret[166],Murthy[168],andSafavianetal.[179].Examplesofsomewell-knowndecisiontreealgorithmsincludeCART[127],ID3[175],C4.5[177],andCHAID[153].BothID3andC4.5employtheentropymeasureastheirsplittingfunction.Anin-depthdiscussionoftheC4.5decisiontreealgorithmisgivenbyQuinlan[177].TheCARTalgorithmwasdevelopedbyBreimanetal.[127]andusestheGiniindexasitssplittingfunction.CHAID[153]usesthestatistical testtodeterminethebestsplitduringthetree-growingprocess.

Thedecisiontreealgorithmpresentedinthischapterassumesthatthesplittingconditionateachinternalnodecontainsonlyoneattribute.Anobliquedecisiontreecanusemultipleattributestoformtheattributetestconditioninasinglenode[149,187].Breimanetal.[127]provideanoptionforusinglinearcombinationsofattributesintheirCARTimplementation.OtherapproachesforinducingobliquedecisiontreeswereproposedbyHeathetal.[149],Murthyetal.[169],Cantú-PazandKamath[130],andUtgoffandBrodley[187].Althoughanobliquedecisiontreehelpstoimprovetheexpressivenessofthemodelrepresentation,thetreeinductionprocessbecomescomputationallychallenging.Anotherwaytoimprovetheexpressivenessofadecisiontreewithoutusingobliquedecisiontreesistoapplyamethodknownasconstructiveinduction[161].Thismethodsimplifiesthetaskoflearningcomplexsplittingfunctionsbycreatingcompoundfeaturesfromtheoriginaldata.

Besidesthetop-downapproach,otherstrategiesforgrowingadecisiontreeincludethebottom-upapproachbyLandeweerdetal.[159]andPattipatiandAlexandridis[173],aswellasthebidirectionalapproachbyKimand

χ2

Landgrebe[154].SchuermannandDoster[181]andWangandSuen[193]proposedusingasoftsplittingcriteriontoaddressthedatafragmentationproblem.Inthisapproach,eachinstanceisassignedtodifferentbranchesofthedecisiontreewithdifferentprobabilities.

Modeloverfittingisanimportantissuethatmustbeaddressedtoensurethatadecisiontreeclassifierperformsequallywellonpreviouslyunlabeleddatainstances.ThemodeloverfittingproblemhasbeeninvestigatedbymanyauthorsincludingBreimanetal.[127],Schaffer[180],Mingers[164],andJensenandCohen[151].Whilethepresenceofnoiseisoftenregardedasoneoftheprimaryreasonsforoverfitting[164,170],JensenandCohen[151]viewedoverfittingasanartifactoffailuretocompensateforthemultiplecomparisonsproblem.

Bishop[126]andHastieetal.[148]provideanexcellentdiscussionofmodeloverfitting,relatingittoawell-knownframeworkoftheoreticalanalysis,knownasbias-variancedecomposition[146].Inthisframework,thepredictionofalearningalgorithmisconsideredtobeafunctionofthetrainingset,whichvariesasthetrainingsetischanged.Thegeneralizationerrorofamodelisthendescribedintermsofitsbias(theerroroftheaveragepredictionobtainedusingdifferenttrainingsets),itsvariance(howdifferentarethepredictionsobtainedusingdifferenttrainingsets),andnoise(theirreducibleerrorinherenttotheproblem).Anunderfitmodelisconsideredtohavehighbiasbutlowvariance,whileanoverfitmodelisconsideredtohavelowbiasbuthighvariance.Althoughthebias-variancedecompositionwasoriginallyproposedforregressionproblems(wherethetargetattributeisacontinuousvariable),aunifiedanalysisthatisapplicableforclassificationhasbeenproposedbyDomingos[136].ThebiasvariancedecompositionwillbediscussedinmoredetailwhileintroducingensemblelearningmethodsinChapter4 .

Variouslearningprinciples,suchastheProbablyApproximatelyCorrect(PAC)learningframework[188],havebeendevelopedtoprovideatheoreticalframeworkforexplainingthegeneralizationperformanceoflearningalgorithms.Inthefieldofstatistics,anumberofperformanceestimationmethodshavebeenproposedthatmakeatrade-offbetweenthegoodnessoffitofamodelandthemodelcomplexity.MostnoteworthyamongthemaretheAkaike'sInformationCriterion[120]andtheBayesianInformationCriterion[182].Theybothapplycorrectivetermstothetrainingerrorrateofamodel,soastopenalizemorecomplexmodels.Anotherwidely-usedapproachformeasuringthecomplexityofanygeneralmodelistheVapnikChervonenkis(VC)Dimension[190].TheVCdimensionofaclassoffunctionsCisdefinedasthemaximumnumberofpointsthatcanbeshattered(everypointcanbedistinguishedfromtherest)byfunctionsbelongingtoC,foranypossibleconfigurationofpoints.TheVCdimensionlaysthefoundationofthestructuralriskminimizationprinciple[189],whichisextensivelyusedinmanylearningalgorithms,e.g.,supportvectormachines,whichwillbediscussedindetailinChapter4 .

TheOccam'srazorprincipleisoftenattributedtothephilosopherWilliamofOccam.Domingos[135]cautionedagainstthepitfallofmisinterpretingOccam'srazorascomparingmodelswithsimilartrainingerrors,insteadofgeneralizationerrors.Asurveyondecisiontree-pruningmethodstoavoidoverfittingisgivenbyBreslowandAha[128]andEspositoetal.[141].Someofthetypicalpruningmethodsincludereducederrorpruning[176],pessimisticerrorpruning[176],minimumerrorpruning[171],criticalvaluepruning[163],cost-complexitypruning[127],anderror-basedpruning[177].QuinlanandRivestproposedusingtheminimumdescriptionlengthprinciplefordecisiontreepruningin[178].

Thediscussionsinthischapteronthesignificanceofcross-validationerrorestimatesisinspiredfromChapter7 inHastieetal.[148].Itisalsoan

excellentresourceforunderstanding“therightandwrongwaystodocross-validation”,whichissimilartothediscussiononpitfallsinSection3.8 ofthischapter.Acomprehensivediscussionofsomeofthecommonpitfallsinusingcross-validationformodelselectionandevaluationisprovidedinKrstajicetal.[156].

Theoriginalcross-validationmethodwasproposedindependentlybyAllen[121],Stone[184],andGeisser[145]formodelassessment(evaluation).Eventhoughcross-validationcanbeusedformodelselection[194],itsusageformodelselectionisquitedifferentthanwhenitisusedformodelevaluation,asemphasizedbyStone[184].Overtheyears,thedistinctionbetweenthetwousageshasoftenbeenignored,resultinginincorrectfindings.Oneofthecommonmistakeswhileusingcross-validationistoperformpre-processingoperations(e.g.,hyper-parametertuningorfeatureselection)usingtheentiredatasetandnot“within”thetrainingfoldofeverycross-validationrun.Ambroiseetal.,usinganumberofgeneexpressionstudiesasexamples,[124]provideanextensivediscussionoftheselectionbiasthatariseswhenfeatureselectionisperformedoutsidecross-validation.UsefulguidelinesforevaluatingmodelsonmicroarraydatahavealsobeenprovidedbyAllisonetal.[122].

Theuseofthecross-validationprotocolforhyper-parametertuninghasbeendescribedindetailbyDudoitandvanderLaan[138].Thisapproachhasbeencalled“grid-searchcross-validation.”Thecorrectapproachinusingcross-validationforbothhyper-parameterselectionandmodelevaluation,asdiscussedinSection3.7 ofthischapter,isextensivelycoveredbyVarmaandSimon[191].Thiscombinedapproachhasbeenreferredtoas“nestedcross-validation”or“doublecross-validation”intheexistingliterature.Recently,TibshiraniandTibshirani[185]haveproposedanewapproachforhyper-parameterselectionandmodelevaluation.Tsamardinosetal.[186]comparedthisapproachtonestedcross-validation.Theexperimentsthey

performedfoundthat,onaverage,bothapproachesprovideconservativeestimatesofmodelperformancewiththeTibshiraniandTibshiraniapproachbeingmorecomputationallyefficient.

Kohavi[155]hasperformedanextensiveempiricalstudytocomparetheperformancemetricsobtainedusingdifferentestimationmethodssuchasrandomsubsamplingandk-foldcross-validation.Theirresultssuggestthatthebestestimationmethodisten-fold,stratifiedcross-validation.

Analternativeapproachformodelevaluationisthebootstrapmethod,whichwaspresentedbyEfronin1979[139].Inthismethod,traininginstancesaresampledwithreplacementfromthelabeledset,i.e.,aninstancepreviouslyselectedtobepartofthetrainingsetisequallylikelytobedrawnagain.IftheoriginaldatahasNinstances,itcanbeshownthat,onaverage,abootstrapsampleofsizeNcontainsabout63.2%oftheinstancesintheoriginaldata.Instancesthatarenotincludedinthebootstrapsamplebecomepartofthetestset.Thebootstrapprocedureforobtainingtrainingandtestsetsisrepeatedbtimes,resultinginadifferenterrorrateonthetestset,err(i),atthei run.Toobtaintheoverallerrorrate, ,the.632bootstrapapproachcombineserr(i)withtheerrorrateobtainedfromatrainingsetcontainingallthelabeledexamples, ,asfollows:

EfronandTibshirani[140]providedatheoreticalandempiricalcomparisonbetweencross-validationandabootstrapmethodknownasthe rule.

Whilethe.632bootstrapmethodpresentedaboveprovidesarobustestimateofthegeneralizationperformancewithlowvarianceinitsestimate,itmayproducemisleadingresultsforhighlycomplexmodelsincertainconditions,asdemonstratedbyKohavi[155].Thisisbecausetheoverallerrorrateisnot

th errboot

errs

errboot=1b∑i=1b(0.632)×err(i)+0.386×errs). (3.19)

632+

trulyanout-of-sampleerrorestimateasitdependsonthetrainingerrorrate,,whichcanbequitesmallifthereisoverfitting.

CurrenttechniquessuchasC4.5requirethattheentiretrainingdatasetfitintomainmemory.Therehasbeenconsiderableefforttodevelopparallelandscalableversionsofdecisiontreeinductionalgorithms.SomeoftheproposedalgorithmsincludeSLIQbyMehtaetal.[160],SPRINTbyShaferetal.[183],CMPbyWangandZaniolo[192],CLOUDSbyAlsabtietal.[123],RainForestbyGehrkeetal.[144],andScalParCbyJoshietal.[152].Asurveyofparallelalgorithmsforclassificationandotherdataminingtasksisgivenin[158].Morerecently,therehasbeenextensiveresearchtoimplementlarge-scaleclassifiersonthecomputeunifieddevicearchitecture(CUDA)[131,134]andMapReduce[133,172]platforms.

errs

Bibliography[120]H.Akaike.Informationtheoryandanextensionofthemaximum

likelihoodprinciple.InSelectedPapersofHirotuguAkaike,pages199–213.Springer,1998.

[121]D.M.Allen.Therelationshipbetweenvariableselectionanddataagumentationandamethodforprediction.Technometrics,16(1):125–127,1974.

[122]D.B.Allison,X.Cui,G.P.Page,andM.Sabripour.Microarraydataanalysis:fromdisarraytoconsolidationandconsensus.Naturereviewsgenetics,7(1):55–65,2006.

[123]K.Alsabti,S.Ranka,andV.Singh.CLOUDS:ADecisionTreeClassifierforLargeDatasets.InProc.ofthe4thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages2–8,NewYork,NY,August1998.

[124]C.AmbroiseandG.J.McLachlan.Selectionbiasingeneextractiononthebasisofmicroarraygene-expressiondata.Proceedingsofthenationalacademyofsciences,99(10):6562–6566,2002.

[125]C.M.Bishop.NeuralNetworksforPatternRecognition.OxfordUniversityPress,Oxford,U.K.,1995.

[126]C.M.Bishop.PatternRecognitionandMachineLearning.Springer,2006.

[127]L.Breiman,J.H.Friedman,R.Olshen,andC.J.Stone.ClassificationandRegressionTrees.Chapman&Hall,NewYork,1984.

[128]L.A.BreslowandD.W.Aha.SimplifyingDecisionTrees:ASurvey.KnowledgeEngineeringReview,12(1):1–40,1997.

[129]W.Buntine.Learningclassificationtrees.InArtificialIntelligenceFrontiersinStatistics,pages182–201.Chapman&Hall,London,1993.

[130]E.Cantú-PazandC.Kamath.Usingevolutionaryalgorithmstoinduceobliquedecisiontrees.InProc.oftheGeneticandEvolutionaryComputationConf.,pages1053–1060,SanFrancisco,CA,2000.

[131]B.Catanzaro,N.Sundaram,andK.Keutzer.Fastsupportvectormachinetrainingandclassificationongraphicsprocessors.InProceedingsofthe25thInternationalConferenceonMachineLearning,pages104–111,2008.

[132]V.CherkasskyandF.M.Mulier.LearningfromData:Concepts,Theory,andMethods.Wiley,2ndedition,2007.

[133]C.Chu,S.K.Kim,Y.-A.Lin,Y.Yu,G.Bradski,A.Y.Ng,andK.Olukotun.Map-reduceformachinelearningonmulticore.Advancesinneuralinformationprocessingsystems,19:281,2007.

[134]A.Cotter,N.Srebro,andJ.Keshet.AGPU-tailoredApproachforTrainingKernelizedSVMs.InProceedingsofthe17thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages805–813,SanDiego,California,USA,2011.

[135]P.Domingos.TheRoleofOccam'sRazorinKnowledgeDiscovery.DataMiningandKnowledgeDiscovery,3(4):409–425,1999.

[136]P.Domingos.Aunifiedbias-variancedecomposition.InProceedingsof17thInternationalConferenceonMachineLearning,pages231–238,2000.

[137]R.O.Duda,P.E.Hart,andD.G.Stork.PatternClassification.JohnWiley&Sons,Inc.,NewYork,2ndedition,2001.

[138]S.DudoitandM.J.vanderLaan.Asymptoticsofcross-validatedriskestimationinestimatorselectionandperformanceassessment.StatisticalMethodology,2(2):131–154,2005.

[139]B.Efron.Bootstrapmethods:anotherlookatthejackknife.InBreakthroughsinStatistics,pages569–593.Springer,1992.

[140]B.EfronandR.Tibshirani.Cross-validationandtheBootstrap:EstimatingtheErrorRateofaPredictionRule.Technicalreport,StanfordUniversity,1995.

[141]F.Esposito,D.Malerba,andG.Semeraro.AComparativeAnalysisofMethodsforPruningDecisionTrees.IEEETrans.PatternAnalysisandMachineIntelligence,19(5):476–491,May1997.

[142]R.A.Fisher.Theuseofmultiplemeasurementsintaxonomicproblems.AnnalsofEugenics,7:179–188,1936.

[143]K.Fukunaga.IntroductiontoStatisticalPatternRecognition.AcademicPress,NewYork,1990.

[144]J.Gehrke,R.Ramakrishnan,andV.Ganti.RainForest—AFrameworkforFastDecisionTreeConstructionofLargeDatasets.DataMiningandKnowledgeDiscovery,4(2/3):127–162,2000.

[145]S.Geisser.Thepredictivesamplereusemethodwithapplications.JournaloftheAmericanStatisticalAssociation,70(350):320–328,1975.

[146]S.Geman,E.Bienenstock,andR.Doursat.Neuralnetworksandthebias/variancedilemma.Neuralcomputation,4(1):1–58,1992.

[147]M.Hall,E.Frank,G.Holmes,B.Pfahringer,P.Reutemann,andI.H.Witten.TheWEKADataMiningSoftware:AnUpdate.SIGKDDExplorations,11(1),2009.

[148]T.Hastie,R.Tibshirani,andJ.Friedman.TheElementsofStatisticalLearning:DataMining,Inference,andPrediction.Springer,2ndedition,2009.

[149]D.Heath,S.Kasif,andS.Salzberg.InductionofObliqueDecisionTrees.InProc.ofthe13thIntl.JointConf.onArtificialIntelligence,pages1002–1007,Chambery,France,August1993.

[150]A.K.Jain,R.P.W.Duin,andJ.Mao.StatisticalPatternRecognition:AReview.IEEETran.Patt.Anal.andMach.Intellig.,22(1):4–37,2000.

[151]D.JensenandP.R.Cohen.MultipleComparisonsinInductionAlgorithms.MachineLearning,38(3):309–338,March2000.

[152]M.V.Joshi,G.Karypis,andV.Kumar.ScalParC:ANewScalableandEfficientParallelClassificationAlgorithmforMiningLargeDatasets.InProc.of12thIntl.ParallelProcessingSymp.(IPPS/SPDP),pages573–579,Orlando,FL,April1998.

[153]G.V.Kass.AnExploratoryTechniqueforInvestigatingLargeQuantitiesofCategoricalData.AppliedStatistics,29:119–127,1980.

[154]B.KimandD.Landgrebe.Hierarchicaldecisionclassifiersinhigh-dimensionalandlargeclassdata.IEEETrans.onGeoscienceandRemoteSensing,29(4):518–528,1991.

[155]R.Kohavi.AStudyonCross-ValidationandBootstrapforAccuracyEstimationandModelSelection.InProc.ofthe15thIntl.JointConf.onArtificialIntelligence,pages1137–1145,Montreal,Canada,August1995.

[156]D.Krstajic,L.J.Buturovic,D.E.Leahy,andS.Thomas.Cross-validationpitfallswhenselectingandassessingregressionandclassificationmodels.Journalofcheminformatics,6(1):1,2014.

[157]S.R.Kulkarni,G.Lugosi,andS.S.Venkatesh.LearningPatternClassification—ASurvey.IEEETran.Inf.Theory,44(6):2178–2206,1998.

[158]V.Kumar,M.V.Joshi,E.-H.Han,P.N.Tan,andM.Steinbach.HighPerformanceDataMining.InHighPerformanceComputingforComputationalScience(VECPAR2002),pages111–125.Springer,2002.

[159]G.Landeweerd,T.Timmers,E.Gersema,M.Bins,andM.Halic.Binarytreeversussingleleveltreeclassificationofwhitebloodcells.PatternRecognition,16:571–577,1983.

[160]M.Mehta,R.Agrawal,andJ.Rissanen.SLIQ:AFastScalableClassifierforDataMining.InProc.ofthe5thIntl.Conf.onExtendingDatabaseTechnology,pages18–32,Avignon,France,March1996.

[161]R.S.Michalski.Atheoryandmethodologyofinductivelearning.ArtificialIntelligence,20:111–116,1983.

[162]D.Michie,D.J.Spiegelhalter,andC.C.Taylor.MachineLearning,NeuralandStatisticalClassification.EllisHorwood,UpperSaddleRiver,NJ,1994.

[163]J.Mingers.ExpertSystems—RuleInductionwithStatisticalData.JOperationalResearchSociety,38:39–47,1987.

[164]J.Mingers.Anempiricalcomparisonofpruningmethodsfordecisiontreeinduction.MachineLearning,4:227–243,1989.

[165]T.Mitchell.MachineLearning.McGraw-Hill,Boston,MA,1997.

[166]B.M.E.Moret.DecisionTreesandDiagrams.ComputingSurveys,14(4):593–623,1982.

[167]K.P.Murphy.MachineLearning:AProbabilisticPerspective.MITPress,2012.

[168]S.K.Murthy.AutomaticConstructionofDecisionTreesfromData:AMulti-DisciplinarySurvey.DataMiningandKnowledgeDiscovery,2(4):345–389,1998.

[169]S.K.Murthy,S.Kasif,andS.Salzberg.Asystemforinductionofobliquedecisiontrees.JofArtificialIntelligenceResearch,2:1–33,1994.

[170]T.Niblett.Constructingdecisiontreesinnoisydomains.InProc.ofthe2ndEuropeanWorkingSessiononLearning,pages67–78,Bled,Yugoslavia,May1987.

[171]T.NiblettandI.Bratko.LearningDecisionRulesinNoisyDomains.InResearchandDevelopmentinExpertSystemsIII,Cambridge,1986.

CambridgeUniversityPress.

[172]I.PalitandC.K.Reddy.Scalableandparallelboostingwithmapreduce.IEEETransactionsonKnowledgeandDataEngineering,24(10):1904–1916,2012.

[173]K.R.PattipatiandM.G.Alexandridis.Applicationofheuristicsearchandinformationtheorytosequentialfaultdiagnosis.IEEETrans.onSystems,Man,andCybernetics,20(4):872–887,1990.

[174]F.Pedregosa,G.Varoquaux,A.Gramfort,V.Michel,B.Thirion,O.Grisel,M.Blondel,P.Prettenhofer,R.Weiss,V.Dubourg,J.Vanderplas,A.Passos,D.Cournapeau,M.Brucher,M.Perrot,andE.Duchesnay.Scikit-learn:MachineLearninginPython.JournalofMachineLearningResearch,12:2825–2830,2011.

[175]J.R.Quinlan.Discoveringrulesbyinductionfromlargecollectionofexamples.InD.Michie,editor,ExpertSystemsintheMicroElectronicAge.EdinburghUniversityPress,Edinburgh,UK,1979.

[176]J.R.Quinlan.SimplifyingDecisionTrees.Intl.J.Man-MachineStudies,27:221–234,1987.

[177]J.R.Quinlan.C4.5:ProgramsforMachineLearning.Morgan-KaufmannPublishers,SanMateo,CA,1993.

[178]J.R.QuinlanandR.L.Rivest.InferringDecisionTreesUsingtheMinimumDescriptionLengthPrinciple.InformationandComputation,80(3):227–248,1989.

[179]S.R.SafavianandD.Landgrebe.ASurveyofDecisionTreeClassifierMethodology.IEEETrans.Systems,ManandCybernetics,22:660–674,May/June1998.

[180]C.Schaffer.Overfittingavoidenceasbias.MachineLearning,10:153–178,1993.

[181]J.SchuermannandW.Doster.Adecision-theoreticapproachinhierarchicalclassifierdesign.PatternRecognition,17:359–369,1984.

[182]G.Schwarzetal.Estimatingthedimensionofamodel.Theannalsofstatistics,6(2):461–464,1978.

[183]J.C.Shafer,R.Agrawal,andM.Mehta.SPRINT:AScalableParallelClassifierforDataMining.InProc.ofthe22ndVLDBConf.,pages544–555,Bombay,India,September1996.

[184]M.Stone.Cross-validatorychoiceandassessmentofstatisticalpredictions.JournaloftheRoyalStatisticalSociety.SeriesB(Methodological),pages111–147,1974.

[185]R.J.TibshiraniandR.Tibshirani.Abiascorrectionfortheminimumerrorrateincross-validation.TheAnnalsofAppliedStatistics,pages822–

829,2009.

[186]I.Tsamardinos,A.Rakhshani,andV.Lagani.Performance-estimationpropertiesofcross-validation-basedprotocolswithsimultaneoushyper-parameteroptimization.InHellenicConferenceonArtificialIntelligence,pages1–14.Springer,2014.

[187]P.E.UtgoffandC.E.Brodley.Anincrementalmethodforfindingmultivariatesplitsfordecisiontrees.InProc.ofthe7thIntl.Conf.onMachineLearning,pages58–65,Austin,TX,June1990.

[188]L.Valiant.Atheoryofthelearnable.CommunicationsoftheACM,27(11):1134–1142,1984.

[189]V.N.Vapnik.StatisticalLearningTheory.Wiley-Interscience,1998.

[190]V.N.VapnikandA.Y.Chervonenkis.Ontheuniformconvergenceofrelativefrequenciesofeventstotheirprobabilities.InMeasuresofComplexity,pages11–30.Springer,2015.

[191]S.VarmaandR.Simon.Biasinerrorestimationwhenusingcross-validationformodelselection.BMCbioinformatics,7(1):1,2006.

[192]H.WangandC.Zaniolo.CMP:AFastDecisionTreeClassifierUsingMultivariatePredictions.InProc.ofthe16thIntl.Conf.onDataEngineering,pages449–460,SanDiego,CA,March2000.

[193]Q.R.WangandC.Y.Suen.Largetreeclassifierwithheuristicsearchandglobaltraining.IEEETrans.onPatternAnalysisandMachineIntelligence,9(1):91–102,1987.

[194]Y.ZhangandY.Yang.Cross-validationforselectingamodelselectionprocedure.JournalofEconometrics,187(1):95–112,2015.

3.11Exercises1.DrawthefulldecisiontreefortheparityfunctionoffourBooleanattributes,A,B,C,andD.Isitpossibletosimplifythetree?

2.ConsiderthetrainingexamplesshowninTable3.5 forabinaryclassificationproblem.

Table3.5.DatasetforExercise2.

CustomerID Gender CarType ShirtSize Class

1 M Family Small C0

2 M Sports Medium C0

3 M Sports Medium C0

4 M Sports Large C0

5 M Sports ExtraLarge C0

6 M Sports ExtraLarge C0

7 F Sports Small C0

8 F Sports Small C0

9 F Sports Medium C0

10 F Luxury Large C0

11 M Family Large C1

12 M Family ExtraLarge C1

13 M Family Medium C1

14 M Luxury ExtraLarge C1

15 F Luxury Small C1

16 F Luxury Small C1

17 F Luxury Medium C1

18 F Luxury Medium C1

19 F Luxury Medium C1

20 F Luxury Large C1

a. ComputetheGiniindexfortheoverallcollectionoftrainingexamples.

b. ComputetheGiniindexforthe attribute.

c. ComputetheGiniindexforthe attribute.

d. ComputetheGiniindexforthe attributeusingmultiwaysplit.

e. ComputetheGiniindexforthe attributeusingmultiwaysplit.

f. Whichattributeisbetter, , ,or ?

g. Explainwhy shouldnotbeusedastheattributetestconditioneventhoughithasthelowestGini.

3.ConsiderthetrainingexamplesshowninTable3.6 forabinaryclassificationproblem.

Table3.6.DatasetforExercise3.

Instance TargetClassa1 a2 a3

1 T T 1.0 +

2 T T 6.0

3 T F 5.0

4 F F 4.0

5 F T 7.0

6 F T 3.0

7 F F 8.0

8 T F 7.0

9 F T 5.0

a. Whatistheentropyofthiscollectionoftrainingexampleswithrespecttotheclassattribute?

b. Whataretheinformationgainsof and relativetothesetrainingexamples?

c. For ,whichisacontinuousattribute,computetheinformationgainforeverypossiblesplit.

d. Whatisthebestsplit(among ,and )accordingtotheinformationgain?

e. Whatisthebestsplit(between and )accordingtothemisclassificationerrorrate?

f. Whatisthebestsplit(between and )accordingtotheGiniindex?

−

a1 a2

a1,a2 a3

a1 a2

4.Showthattheentropyofanodeneverincreasesaftersplittingitintosmallersuccessornodes.

5.Considerthefollowingdatasetforabinaryclassproblem.

A B ClassLabel

T F

T T

T F

T T

F F

T T

T F

a. CalculatetheinformationgainwhensplittingonAandB.Whichattributewouldthedecisiontreeinductionalgorithmchoose?

b. CalculatethegainintheGiniindexwhensplittingonAandB.Whichattributewouldthedecisiontreeinductionalgorithmchoose?

c. Figure3.11 showsthatentropyandtheGiniindexarebothmonotonicallyincreasingontherange[0,0.5]andtheyarebothmonotonicallydecreasingontherange[0.5,1].Isitpossiblethat

−

informationgainandthegainintheGiniindexfavordifferentattributes?Explain.

6.ConsidersplittingaparentnodePintotwochildnodes, and ,usingsomeattributetestcondition.ThecompositionoflabeledtraininginstancesateverynodeissummarizedintheTablebelow.

Class0 7 3 4

Class1 3 0 3

a. CalculatetheGiniindexandmisclassificationerrorrateoftheparentnodeP.

b. CalculatetheweightedGiniindexofthechildnodes.WouldyouconsiderthisattributetestconditionifGiniisusedastheimpuritymeasure?

c. Calculatetheweightedmisclassificationrateofthechildnodes.Wouldyouconsiderthisattributetestconditionifmisclassificationrateisusedastheimpuritymeasure?

7.Considerthefollowingsetoftrainingexamples.

X Y Z No.ofClassC1Examples No.ofClassC2Examples

0 0 0 5 40

0 0 1 0 15

0 1 0 10 5

0 1 1 45 0

C1 C2

1 0 0 10 5

1 0 1 25 0

1 1 0 5 20

1 1 1 0 15

a. Computeatwo-leveldecisiontreeusingthegreedyapproachdescribedinthischapter.Usetheclassificationerrorrateasthecriterionforsplitting.Whatistheoverallerrorrateoftheinducedtree?

b. Repeatpart(a)usingXasthefirstsplittingattributeandthenchoosethebestremainingattributeforsplittingateachofthetwosuccessornodes.Whatistheerrorrateoftheinducedtree?

c. Comparetheresultsofparts(a)and(b).Commentonthesuitabilityofthegreedyheuristicusedforsplittingattributeselection.

8.ThefollowingtablesummarizesadatasetwiththreeattributesA,B,Candtwoclasslabels .Buildatwo-leveldecisiontree.

A B C NumberofInstances

T T T 5 0

F T T 0 20

T F T 20 0

F F T 0 5

T T F 0 0

+,−

−

F T F 25 0

T F F 0 0

F F F 0 25

a. Accordingtotheclassificationerrorrate,whichattributewouldbechosenasthefirstsplittingattribute?Foreachattribute,showthecontingencytableandthegainsinclassificationerrorrate.

b. Repeatforthetwochildrenoftherootnode.

c. Howmanyinstancesaremisclassifiedbytheresultingdecisiontree?

d. Repeatparts(a),(b),and(c)usingCasthesplittingattribute.

e. Usetheresultsinparts(c)and(d)toconcludeaboutthegreedynatureofthedecisiontreeinductionalgorithm.

9.ConsiderthedecisiontreeshowninFigure3.36 .

Figure3.36.DecisiontreeanddatasetsforExercise9.

a. Computethegeneralizationerrorrateofthetreeusingtheoptimisticapproach.

b. Computethegeneralizationerrorrateofthetreeusingthepessimisticapproach.(Forsimplicity,usethestrategyofaddingafactorof0.5toeachleafnode.)

c. Computethegeneralizationerrorrateofthetreeusingthevalidationsetshownabove.Thisapproachisknownasreducederrorpruning.

10.ConsiderthedecisiontreesshowninFigure3.37 .Assumetheyaregeneratedfromadatasetthatcontains16binaryattributesand3classes,

,and .C1,C2 C3

Computethetotaldescriptionlengthofeachdecisiontreeaccordingtothefollowingformulationoftheminimumdescriptionlengthprinciple.

Thetotaldescriptionlengthofatreeisgivenby

EachinternalnodeofthetreeisencodedbytheIDofthesplittingattribute.Iftherearemattributes,thecostofencodingeachattributeis

bits.

Figure3.37.DecisiontreesforExercise10.

EachleafisencodedusingtheIDoftheclassitisassociatedwith.Iftherearekclasses,thecostofencodingaclassis bits.

Cost(tree)isthecostofencodingallthenodesinthetree.Tosimplifythecomputation,youcanassumethatthetotalcostofthetreeisobtainedbyaddingupthecostsofencodingeachinternalnodeandeachleafnode.

Cost(tree,data)=Cost(tree)+Cost(data|tree).

log2m

log2k

isencodedusingtheclassificationerrorsthetreecommitsonthetrainingset.Eacherrorisencodedby bits,wherenisthetotalnumberoftraininginstances.

Whichdecisiontreeisbetter,accordingtotheMDLprinciple?

11.Thisexercise,inspiredbythediscussionsin[155],highlightsoneoftheknownlimitationsoftheleave-one-outmodelevaluationprocedure.Letusconsideradatasetcontaining50positiveand50negativeinstances,wheretheattributesarepurelyrandomandcontainnoinformationabouttheclasslabels.Hence,thegeneralizationerrorrateofanyclassificationmodellearnedoverthisdataisexpectedtobe0.5.Letusconsideraclassifierthatassignsthemajorityclasslabeloftraininginstances(tiesresolvedbyusingthepositivelabelasthedefaultclass)toanytestinstance,irrespectiveofitsattributevalues.Wecancallthisapproachasthemajorityinducerclassifier.Determinetheerrorrateofthisclassifierusingthefollowingmethods.

a. Leave-one-out.

b. 2-foldstratifiedcross-validation,wheretheproportionofclasslabelsateveryfoldiskeptsameasthatoftheoveralldata.

c. Fromtheresultsabove,whichmethodprovidesamorereliableevaluationoftheclassifier'sgeneralizationerrorrate?

12.Consideralabeleddatasetcontaining100datainstances,whichisrandomlypartitionedintotwosetsAandB,eachcontaining50instances.WeuseAasthetrainingsettolearntwodecisiontrees, with10leafnodesand with100leafnodes.TheaccuraciesofthetwodecisiontreesondatasetsAandBareshowninTable3.7 .

Table3.7.Comparingthetestaccuracyofdecisiontrees and .

Accuracy

Cost(data|tree)log2n

T10T100

T10 T100

DataSet

A 0.86 0.97

B 0.84 0.77

a. BasedontheaccuraciesshowninTable3.7 ,whichclassificationmodelwouldyouexpecttohavebetterperformanceonunseeninstances?

b. Now,youtested and ontheentiredataset andfoundthattheclassificationaccuracyof ondataset is0.85,whereastheclassificationaccuracyof onthedataset is0.87.BasedonthisnewinformationandyourobservationsfromTable3.7 ,whichclassificationmodelwouldyoufinallychooseforclassification?

13.ConsiderthefollowingapproachfortestingwhetheraclassifierAbeatsanotherclassifierB.LetNbethesizeofagivendataset, betheaccuracyofclassifierA, betheaccuracyofclassifierB,and betheaverageaccuracyforbothclassifiers.TotestwhetherclassifierAissignificantlybetterthanB,thefollowingZ-statisticisused:

ClassifierAisassumedtobebetterthanclassifierBif .

Table3.8 comparestheaccuraciesofthreedifferentclassifiers,decisiontreeclassifiers,naïveBayesclassifiers,andsupportvectormachines,onvariousdatasets.(ThelattertwoclassifiersaredescribedinChapter4 .)

SummarizetheperformanceoftheclassifiersgiveninTable3.8 usingthefollowing table:

win-loss-draw Decisiontree NaïveBayes Supportvectormachine

T10 T100

T10 T100 (A+B)T10 (A+B)

T100 (A+B)

pApB p=(pA+pB)/2

Z=pA−pB2p(1−p)N.

Z>1.96

3×3

Decisiontree 0-0-23

NaïveBayes 0-0-23

Supportvectormachine 0-0-23

Table3.8.Comparingtheaccuracyofvariousclassificationmethods.

DataSet Size(N) DecisionTree(%)

naïveBayes(%)

Supportvectormachine(%)

Anneal 898 92.09 79.62 87.19

Australia 690 85.51 76.81 84.78

Auto 205 81.95 58.05 70.73

Breast 699 95.14 95.99 96.42

Cleve 303 76.24 83.50 84.49

Credit 690 85.80 77.54 85.07

Diabetes 768 72.40 75.91 76.82

German 1000 70.90 74.70 74.40

Glass 214 67.29 48.59 59.81

Heart 270 80.00 84.07 83.70

Hepatitis 155 81.94 83.23 87.10

Horse 368 85.33 78.80 82.61

Ionosphere 351 89.17 82.34 88.89

Iris 150 94.67 95.33 96.00

Labor 57 78.95 94.74 92.98

Led7 3200 73.34 73.16 73.56

Lymphography 148 77.03 83.11 86.49

Pima 768 74.35 76.04 76.95

Sonar 208 78.85 69.71 76.92

Tic-tac-toe 958 83.72 70.04 98.33

Vehicle 846 71.04 45.04 74.94

Wine 178 94.38 96.63 98.88

Zoo 101 93.07 93.07 96.04

Eachcellinthetablecontainsthenumberofwins,losses,anddrawswhencomparingtheclassifierinagivenrowtotheclassifierinagivencolumn.

14.LetXbeabinomialrandomvariablewithmean andvariance .ShowthattheratioX/Nalsohasabinomialdistributionwithmeanpandvariance .

Np Np(1−p)

p(1−p)N

4Classification:AlternativeTechniques

Thepreviouschapterintroducedtheclassificationproblemandpresentedatechniqueknownasthedecisiontreeclassifier.Issuessuchasmodeloverfittingandmodelevaluationwerealsodiscussed.Thischapterpresentsalternativetechniquesforbuildingclassificationmodels—fromsimpletechniquessuchasrule-basedandnearestneighborclassifierstomoresophisticatedtechniquessuchasartificialneuralnetworks,deeplearning,supportvectormachines,andensemblemethods.Otherpracticalissuessuchastheclassimbalanceandmulticlassproblemsarealsodiscussedattheendofthechapter.

4.1TypesofClassifiersBeforepresentingspecifictechniques,wefirstcategorizethedifferenttypesofclassifiersavailable.Onewaytodistinguishclassifiersisbyconsideringthecharacteristicsoftheiroutput.

BinaryversusMulticlass

Binaryclassifiersassigneachdatainstancetooneoftwopossiblelabels,typicallydenotedas and .Thepositiveclassusuallyreferstothecategorywearemoreinterestedinpredictingcorrectlycomparedtothenegativeclass(e.g.,the categoryinemailclassificationproblems).Iftherearemorethantwopossiblelabelsavailable,thenthetechniqueisknownasamulticlassclassifier.Assomeclassifiersweredesignedforbinaryclassesonly,theymustbeadaptedtodealwithmulticlassproblems.TechniquesfortransformingbinaryclassifiersintomulticlassclassifiersaredescribedinSection4.12 .

DeterministicversusProbabilistic

Adeterministicclassifierproducesadiscrete-valuedlabeltoeachdatainstanceitclassifieswhereasaprobabilisticclassifierassignsacontinuousscorebetween0and1toindicatehowlikelyitisthataninstancebelongstoaparticularclass,wheretheprobabilityscoresforalltheclassessumupto1.SomeexamplesofprobabilisticclassifiersincludethenaïveBayesclassifier,Bayesiannetworks,andlogisticregression.Probabilisticclassifiersprovideadditionalinformationabouttheconfidenceinassigninganinstancetoaclassthandeterministicclassifiers.Adatainstanceistypicallyassignedtotheclass

+1 −1

withthehighestprobabilityscore,exceptwhenthecostofmisclassifyingtheclasswithlowerprobabilityissignificantlyhigher.Wewilldiscussthetopicofcost-sensitiveclassificationwithprobabilisticoutputsinSection4.11.2 .

Anotherwaytodistinguishthedifferenttypesofclassifiersisbasedontheirtechniquefordiscriminatinginstancesfromdifferentclasses.

LinearversusNonlinear

Alinearclassifierusesalinearseparatinghyperplanetodiscriminateinstancesfromdifferentclasseswhereasanonlinearclassifierenablestheconstructionofmorecomplex,nonlineardecisionsurfaces.Weillustrateanexampleofalinearclassifier(perceptron)anditsnonlinearcounterpart(multi-layerneuralnetwork)inSection4.7 .Althoughthelinearityassumptionmakesthemodellessflexibleintermsoffittingcomplexdata,linearclassifiersarethuslesssusceptibletomodeloverfittingcomparedtononlinearclassifiers.Furthermore,onecantransformtheoriginalsetofattributes,

,intoamorecomplexfeatureset,e.g.,,beforeapplyingthelinearclassifier.Suchfeature

transformationallowsthelinearclassifiertofitdatasetswithnonlineardecisionsurfaces(seeSection4.9.4 ).

GlobalversusLocal

Aglobalclassifierfitsasinglemodeltotheentiredataset.Unlessthemodelishighlynonlinear,thisone-size-fits-allstrategymaynotbeeffectivewhentherelationshipbetweentheattributesandtheclasslabelsvariesovertheinputspace.Incontrast,alocalclassifierpartitionstheinputspaceintosmallerregionsandfitsadistinctmodeltotraininginstancesineachregion.Thek-nearestneighborclassifier(seeSection4.3 )isaclassicexampleoflocalclassifiers.Whilelocalclassifiersaremoreflexibleintermsoffittingcomplex

x=(x1,x2,⋯,xd) Φ(x)=(x1,x2,x1x2,x12,x22,⋯)

decisionboundaries,theyarealsomoresusceptibletothemodeloverfittingproblem,especiallywhenthelocalregionscontainfewtrainingexamples.

GenerativeversusDiscriminative

Givenadatainstance ,theprimaryobjectiveofanyclassifieristopredicttheclasslabel,y,ofthedatainstance.However,apartfrompredictingtheclasslabel,wemayalsobeinterestedindescribingtheunderlyingmechanismthatgeneratestheinstancesbelongingtoeveryclasslabel.Forexample,intheprocessofclassifyingspamemailmessages,itmaybeusefultounderstandthetypicalcharacteristicsofemailmessagesthatarelabeledasspam,e.g.,specificusageofkeywordsinthesubjectorthebodyoftheemail.Classifiersthatlearnagenerativemodelofeveryclassintheprocessofpredictingclasslabelsareknownasgenerativeclassifiers.SomeexamplesofgenerativeclassifiersincludethenaïveBayesclassifierandBayesiannetworks.Incontrast,discriminativeclassifiersdirectlypredicttheclasslabelswithoutexplicitlydescribingthedistributionofeveryclasslabel.Theysolveasimplerproblemthangenerativemodelssincetheydonothavetheonusofderivinginsightsaboutthegenerativemechanismofdatainstances.Theyarethussometimespreferredovergenerativemodels,especiallywhenitisnotcrucialtoobtaininformationaboutthepropertiesofeveryclass.Someexamplesofdiscriminativeclassifiersincludedecisiontrees,rule-basedclassifier,nearestneighborclassifier,artificialneuralnetworks,andsupportvectormachines.

4.2Rule-BasedClassifierArule-basedclassifierusesacollectionof“if…then…”rules(alsoknownasaruleset)toclassifydatainstances.Table4.1 showsanexampleofarulesetgeneratedforthevertebrateclassificationproblemdescribedinthepreviouschapter.Eachclassificationruleintherulesetcanbeexpressedinthefollowingway:

Theleft-handsideoftheruleiscalledtheruleantecedentorprecondition.Itcontainsaconjunctionofattributetestconditions:

where isanattribute-valuepairandopisacomparisonoperatorchosenfromtheset .Eachattributetest isalsoknownasaconjunct.Theright-handsideoftheruleiscalledtheruleconsequent,whichcontainsthepredictedclass .

Arulercoversadatainstancexifthepreconditionofrmatchestheattributesofx.risalsosaidtobefiredortriggeredwheneveritcoversagiveninstance.Foranillustration,considertherule giveninTable4.1 andthefollowingattributesfortwovertebrates:hawkandgrizzlybear.

Table4.1.Exampleofarulesetforthevertebrateclassificationproblem.

ri:(Conditioni)→yi. (4.1)

Conditioni=(A1opv1)∧(A2opv2)…(Akopvk), (4.2)

(Aj,vj){=,≠,<,>,≤,≥} (Ajopvj)

r1:(GivesBirth=no)∧(AerialCreature=yes)→Birdsr2:(GivesBirth=no)∧(AquaticCreature=yes)→Fishesr3:(GivesBirth=yes)∧(BodyTemperature=warm-blooded)→Mammalsr4:(GivesBirth=no)∧(AerialCreature=no)→Reptilesr5:(AquaticCreature=semi)→Amphibians

Name BodyTemperature

SkinCover

GivesBirth

AquaticCreature

AerialCreature

HasLegs

Hibernates

hawk warm-blooded

feather no no yes yes no

grizzlybear

warm-blooded

fur yes no no yes yes

coversthefirstvertebratebecauseitspreconditionissatisfiedbythehawk'sattributes.Theruledoesnotcoverthesecondvertebratebecausegrizzlybearsgivebirthtotheiryoungandcannotfly,thusviolatingthepreconditionof .

Thequalityofaclassificationrulecanbeevaluatedusingmeasuressuchascoverageandaccuracy.GivenadatasetDandaclassificationruler: ,thecoverageoftheruleisthefractionofinstancesinDthattriggertheruler.Ontheotherhand,itsaccuracyorconfidencefactoristhefractionofinstancestriggeredbyrwhoseclasslabelsareequaltoy.Theformaldefinitionsofthesemeasuresare

where isthenumberofinstancesthatsatisfytheruleantecedent, isthenumberofinstancesthatsatisfyboththeantecedentandconsequent,and

isthetotalnumberofinstances.

Example4.1.ConsiderthedatasetshowninTable4.2 .Therule

A→y

Coverage(r)=|A||D|Coverage(r)=|A∩y||A|, (4.3)

|A| |A∩y|

|D|

(GivesBirth=yes)∧(BodyTemperature=warm-blooded)→Mammals

hasacoverageof33%sincefiveofthefifteeninstancessupporttheruleantecedent.Theruleaccuracyis100%becauseallfivevertebratescoveredbytherulearemammals.

Table4.2.Thevertebratedataset.Name Body

TemperatureSkinCover

GivesBirth

AquaticCreature

AerialCreature

HasLegs

Hibernates ClassLabel

human warm-blooded

hair yes no no yes no Mammals

python cold-blooded scales no no no no yes Reptiles

salmon cold-blooded scales no yes no no no Fishes

whale warm-blooded

hair yes yes no no no Mammals

frog cold-blooded none no semi no yes yes Amphibians

komododragon

cold-blooded scales no no no yes no Reptiles

bat warm-blooded

hair yes no yes yes yes Mammals

pigeon warm-blooded

feathers no no yes yes no Birds

cat warm-blooded

fur yes no no yes no Mammals

guppy cold-blooded scales yes yes no no no Fishes

alligator cold-blooded scales no semi no yes no Reptiles

penguin warm-

blooded

feathers no semi no yes no Birds

porcupine warm-blooded

quills yes no no yes yes Mammals

eel cold-blooded scales no yes no no no Fishes

4.2.1HowaRule-BasedClassifierWorks

Arule-basedclassifierclassifiesatestinstancebasedontheruletriggeredbytheinstance.Toillustratehowarule-basedclassifierworks,considertherulesetshowninTable4.1 andthefollowingvertebrates:

Name BodyTemperature

SkinCover

GivesBirth

AquaticCreature

AerialCreature

HasLegs

Hibernates

lemur warm-blooded

fur yes no no yes yes

turtle cold-blooded scales no semi no yes no

dogfishshark

cold-blooded scales yes yes no no no

Thefirstvertebrate,whichisalemur,iswarm-bloodedandgivesbirthtoitsyoung.Ittriggerstherule ,andthus,isclassifiedasamammal.Thesecondvertebrate,whichisaturtle,triggerstherules and .Sincetheclassespredictedbytherulesarecontradictory(reptilesversusamphibians),theirconflictingclassesmustberesolved.Noneoftherulesareapplicabletoadogfishshark.Inthiscase,weneedtodeterminewhatclasstoassigntosuchatestinstance.

eel cold-blooded scales no yes no no no Fishes

salamander cold-blooded none no semi no yes yes Amphibians

r3r4 r5

4.2.2PropertiesofaRuleSet

Therulesetgeneratedbyarule-basedclassifiercanbecharacterizedbythefollowingtwoproperties.

Definition4.1(MutuallyExclusiveRuleSet).TherulesinarulesetRaremutuallyexclusiveifnotworulesinRaretriggeredbythesameinstance.ThispropertyensuresthateveryinstanceiscoveredbyatmostoneruleinR.

Definition4.2(ExhaustiveRuleSet).ArulesetRhasexhaustivecoverageifthereisaruleforeachcombinationofattributevalues.ThispropertyensuresthateveryinstanceiscoveredbyatleastoneruleinR.

Table4.3.Exampleofamutuallyexclusiveandexhaustiveruleset.

r1:(BodyTemperature=cold-blooded)→Non-mammalsr2:(BodyTemperature=warm-blooded)∧(GivesBirth=yes)→Mammalsr3:(BodyTemperature=warm-

Together,thesetwopropertiesensurethateveryinstanceiscoveredbyexactlyonerule.AnexampleofamutuallyexclusiveandexhaustiverulesetisshowninTable4.3 .Unfortunately,manyrule-basedclassifiers,includingtheoneshowninTable4.1 ,donothavesuchproperties.Iftherulesetisnotexhaustive,thenadefaultrule, ,mustbeaddedtocovertheremainingcases.Adefaultrulehasanemptyantecedentandistriggeredwhenallotherruleshavefailed. isknownasthedefaultclassandistypicallyassignedtothemajorityclassoftraininginstancesnotcoveredbytheexistingrules.Iftherulesetisnotmutuallyexclusive,thenaninstancecanbecoveredbymorethanonerule,someofwhichmaypredictconflictingclasses.

Definition4.3(OrderedRuleSet).TherulesinanorderedrulesetRarerankedindecreasingorderoftheirpriority.Anorderedrulesetisalsoknownasadecisionlist.

Therankofarulecanbedefinedinmanyways,e.g.,basedonitsaccuracyortotaldescriptionlength.Whenatestinstanceispresented,itwillbeclassifiedbythehighest-rankedrulethatcoverstheinstance.Thisavoidstheproblemofhavingconflictingclassespredictedbymultipleclassificationrulesiftherulesetisnotmutuallyexclusive.

blooded)∧(GivesBirth=no)→Non-mammals

rd:()→yd

Analternativewaytohandleanon-mutuallyexclusiverulesetwithoutorderingtherulesistoconsidertheconsequentofeachruletriggeredbyatestinstanceasavoteforaparticularclass.Thevotesarethentalliedtodeterminetheclasslabelofthetestinstance.Theinstanceisusuallyassignedtotheclassthatreceivesthehighestnumberofvotes.Thevotemayalsobeweightedbytherule'saccuracy.Usingunorderedrulestobuildarule-basedclassifierhasbothadvantagesanddisadvantages.Unorderedrulesarelesssusceptibletoerrorscausedbythewrongrulebeingselectedtoclassifyatestinstanceunlikeclassifiersbasedonorderedrules,whicharesensitivetothechoiceofrule-orderingcriteria.Modelbuildingisalsolessexpensivebecausetherulesdonotneedtobekeptinsortedorder.Nevertheless,classifyingatestinstancecanbequiteexpensivebecausetheattributesofthetestinstancemustbecomparedagainstthepreconditionofeveryruleintheruleset.

Inthenexttwosections,wepresenttechniquesforextractinganorderedrulesetfromdata.Arule-basedclassifiercanbeconstructedusing(1)directmethods,whichextractclassificationrulesdirectlyfromdata,and(2)indirectmethods,whichextractclassificationrulesfrommorecomplexclassificationmodels,suchasdecisiontreesandneuralnetworks.DetaileddiscussionsofthesemethodsarepresentedinSections4.2.3 and4.2.4 ,respectively.

4.2.3DirectMethodsforRuleExtraction

Toillustratethedirectmethod,weconsiderawidely-usedruleinductionalgorithmcalledRIPPER.Thisalgorithmscalesalmostlinearlywiththenumberoftraininginstancesandisparticularlysuitedforbuildingmodelsfrom

datasetswithimbalancedclassdistributions.RIPPERalsoworkswellwithnoisydatabecauseitusesavalidationsettopreventmodeloverfitting.

RIPPERusesthesequentialcoveringalgorithmtoextractrulesdirectlyfromdata.Rulesaregrowninagreedyfashiononeclassatatime.Forbinaryclassproblems,RIPPERchoosesthemajorityclassasitsdefaultclassandlearnstherulestodetectinstancesfromtheminorityclass.Formulticlassproblems,theclassesareorderedaccordingtotheirprevalenceinthetrainingset.Let betheorderedlistofclasses,where istheleastprevalentclassand isthemostprevalentclass.Alltraininginstancesthatbelongto areinitiallylabeledaspositiveexamples,whilethosethatbelongtootherclassesarelabeledasnegativeexamples.Thesequentialcoveringalgorithmlearnsasetofrulestodiscriminatethepositivefromnegativeexamples.Next,alltraininginstancesfrom arelabeledaspositive,whilethosefromclasses arelabeledasnegative.Thesequentialcoveringalgorithmwouldlearnthenextsetofrulestodistinguish fromotherremainingclasses.Thisprocessisrepeateduntilweareleftwithonlyoneclass, ,whichisdesignatedasthedefaultclass.

Example4.1.Sequentialcoveringalgorithm.

∈

∨

(y1,y2,…,yc) y1yc

y2y3,y4,⋯,yc

AsummaryofthesequentialcoveringalgorithmisshowninAlgorithm4.1 .Thealgorithmstartswithanemptydecisionlist,R,andextractsrulesforeachclassbasedontheorderingspecifiedbytheclassprevalence.ItiterativelyextractstherulesforagivenclassyusingtheLearn-One-Rulefunction.Oncesucharuleisfound,allthetraininginstancescoveredbytheruleareeliminated.ThenewruleisaddedtothebottomofthedecisionlistR.Thisprocedureisrepeateduntilthestoppingcriterionismet.Thealgorithmthenproceedstogeneraterulesforthenextclass.

Figure4.1 demonstrateshowthesequentialcoveringalgorithmworksforadatasetthatcontainsacollectionofpositiveandnegativeexamples.TheruleR1,whosecoverageisshowninFigure4.1(b) ,isextractedfirstbecauseitcoversthelargestfractionofpositiveexamples.AllthetraininginstancescoveredbyR1aresubsequentlyremovedandthealgorithmproceedstolookforthenextbestrule,whichisR2.

Learn-One-RuleFunctionFindinganoptimalruleiscomputationallyexpensiveduetotheexponentialsearchspacetoexplore.TheLearn-One-Rulefunctionaddressesthisproblembygrowingtherulesinagreedyfashion.Itgeneratesaninitialrule

,wheretheleft-handsideisanemptysetandtheright-handsidecorrespondstothepositiveclass.Itthenrefinestheruleuntilacertainstoppingcriterionismet.Theaccuracyoftheinitialrulemaybepoorbecausesomeofthetraininginstancescoveredbytherulebelongtothenegative

r:{}→+

class.Anewconjunctmustbeaddedtotheruleantecedenttoimproveitsaccuracy.

Figure4.1.Anexampleofthesequentialcoveringalgorithm.

RIPPERusestheFOIL'sinformationgainmeasuretochoosethebestconjuncttobeaddedintotheruleantecedent.Themeasuretakesintoconsiderationboththegaininaccuracyandsupportofacandidaterule,wheresupportisdefinedasthenumberofpositiveexamplescoveredbytherule.Forexample,supposetherule initiallycovers positiveexamplesand negativeexamples.AfteraddinganewconjunctB,theextendedrule covers positiveexamplesand negative

r:A→+ p0n0r′:A∧B→+ p1 n1

examples.TheFOIL'sinformationgainoftheextendedruleiscomputedasfollows:

RIPPERchoosestheconjunctwithhighestFOIL'sinformationgaintoextendtherule,asillustratedinthenextexample.

Example4.2.[Foil'sInformationGain]ConsiderthetrainingsetforthevertebrateclassificationproblemshowninTable4.2 .SupposethetargetclassfortheLearn-One-Rulefunctionismammals.Initially,theantecedentoftherule covers5positiveand10negativeexamples.Thus,theaccuracyoftheruleisonly0.333.Next,considerthefollowingthreecandidateconjunctstobeaddedtotheleft-handsideoftherule: ,and .ThenumberofpositiveandnegativeexamplescoveredbytheruleafteraddingeachconjunctalongwiththeirrespectiveaccuracyandFOIL'sinformationgainareshowninthefollowingtable.

Candidaterule Accuracy InfoGain

3 0 1.000 4.755

5 1 0.714 5.498

2 4 0.200

Although hasthehighestaccuracyamongthethreecandidates,theconjunct hasthehighestFOIL'sinformationgain.Thus,itischosentoextendtherule(seeFigure4.2 ).

FOIL'sinformationgain=p1×(log2p1p1+n1−log2p0p0+n0). (4.4)

{}→Mammals

Skincover=hair,Bodytemperature=warmHaslegs=No

p1 n1

{SkinCover=hair}→mammals

{Bodytemperature=wam}→mammals

{Haslegs=No}→mammals −0.737

Skincover=hairBodytemperature=warm

Thisprocesscontinuesuntiladdingnewconjunctsnolongerimprovestheinformationgainmeasure.

RulePruning

TherulesgeneratedbytheLearn-One-Rulefunctioncanbeprunedtoimprovetheirgeneralizationerrors.RIPPERprunestherulesbasedontheirperformanceonthevalidationset.Thefollowingmetriciscomputedtodeterminewhetherpruningisneeded: ,wherep(n)isthenumberofpositive(negative)examplesinthevalidationsetcoveredbytherule.Thismetricismonotonicallyrelatedtotherule'saccuracyonthevalidationset.Ifthemetricimprovesafterpruning,thentheconjunctisremoved.Pruningisdonestartingfromthelastconjunctaddedtotherule.Forexample,givenarule ,RIPPERcheckswhetherDshouldbeprunedfirst,followedbyCD,BCD,etc.Whiletheoriginalrulecoversonlypositiveexamples,theprunedrulemaycoversomeofthenegativeexamplesinthetrainingset.

BuildingtheRuleSet

Aftergeneratingarule,allthepositiveandnegativeexamplescoveredbytheruleareeliminated.Theruleisthenaddedintotherulesetaslongasitdoesnotviolatethestoppingcondition,whichisbasedontheminimumdescriptionlengthprinciple.Ifthenewruleincreasesthetotaldescriptionlengthoftherulesetbyatleastdbits,thenRIPPERstopsaddingrulesintoitsruleset(bydefault,dischosentobe64bits).AnotherstoppingconditionusedbyRIPPERisthattheerrorrateoftheruleonthevalidationsetmustnotexceed50%.

(p−n)/(p+n)

ABCD→y

Figure4.2.General-to-specificandspecific-to-generalrule-growingstrategies.

RIPPERalsoperformsadditionaloptimizationstepstodeterminewhethersomeoftheexistingrulesintherulesetcanbereplacedbybetteralternativerules.Readerswhoareinterestedinthedetailsoftheoptimizationmethodmayrefertothereferencecitedattheendofthischapter.

InstanceElimination

Afteraruleisextracted,RIPPEReliminatesthepositiveandnegativeexamplescoveredbytherule.Therationalefordoingthisisillustratedinthenextexample.

Figure4.3 showsthreepossiblerules,R1,R2,andR3,extractedfromatrainingsetthatcontains29positiveexamplesand21negativeexamples.TheaccuraciesofR1,R2,andR3are12/15(80%),7/10(70%),and8/12(66.7%),respectively.R1isgeneratedfirstbecauseithasthehighestaccuracy.AftergeneratingR1,thealgorithmmustremovetheexamplescoveredbytherulesothatthenextrulegeneratedbythealgorithmisdifferentthanR1.Thequestionis,shoulditremovethepositiveexamplesonly,negativeexamplesonly,orboth?Toanswerthis,supposethealgorithmmustchoosebetweengeneratingR2orR3afterR1.EventhoughR2hasahigheraccuracythanR3(70%versus66.7%),observethattheregioncoveredbyR2isdisjointfromR1,whiletheregioncoveredbyR3overlapswithR1.Asaresult,R1andR3togethercover18positiveand5negativeexamples(resultinginanoverallaccuracyof78.3%),whereasR1andR2togethercover19positiveand6negativeexamples(resultinginaloweroverallaccuracyof76%).IfthepositiveexamplescoveredbyR1arenotremoved,thenwemayoverestimatetheeffectiveaccuracyofR3.IfthenegativeexamplescoveredbyR1arenotremoved,thenwemayunderestimatetheaccuracyofR3.Inthelattercase,wemightenduppreferringR2overR3eventhoughhalfofthefalsepositiveerrorscommittedbyR3havealreadybeenaccountedforbytheprecedingrule,R1.ThisexampleshowsthattheeffectiveaccuracyafteraddingR2orR3totherulesetbecomesevidentonlywhenbothpositiveandnegativeexamplescoveredbyR1areremoved.

Figure4.3.Eliminationoftraininginstancesbythesequentialcoveringalgorithm.R1,R2,andR3representregionscoveredbythreedifferentrules.

4.2.4IndirectMethodsforRuleExtraction

Thissectionpresentsamethodforgeneratingarulesetfromadecisiontree.Inprinciple,everypathfromtherootnodetotheleafnodeofadecisiontreecanbeexpressedasaclassificationrule.Thetestconditionsencounteredalongthepathformtheconjunctsoftheruleantecedent,whiletheclasslabelattheleafnodeisassignedtotheruleconsequent.Figure4.4 showsanexampleofarulesetgeneratedfromadecisiontree.Noticethattherulesetisexhaustiveandcontainsmutuallyexclusiverules.However,someoftherulescanbesimplifiedasshowninthenextexample.

Figure4.4.Convertingadecisiontreeintoclassificationrules.

Example4.3.ConsiderthefollowingthreerulesfromFigure4.4 :

ObservethattherulesetalwayspredictsapositiveclasswhenthevalueofQisYes.Therefore,wemaysimplifytherulesasfollows:

isretainedtocovertheremaininginstancesofthepositiveclass.Althoughtherulesobtainedaftersimplificationarenolongermutuallyexclusive,theyarelesscomplexandareeasiertointerpret.

Inthefollowing,wedescribeanapproachusedbytheC4.5rulesalgorithmtogeneratearulesetfromadecisiontree.Figure4.5 showsthedecisiontree

r2:(P=No)∧(Q=Yes)→+r3:(P=Yes)∧(R=No)→+r5:(P=Yes)∧(R=Yes)∧(Q=Yes)→+.

r2′:(Q=Yes)→+r3:(P=Yes)∧(R=No)→+.

andresultingclassificationrulesobtainedforthedatasetgiveninTable4.2 .

RuleGeneration

Classificationrulesareextractedforeverypathfromtheroottooneoftheleafnodesinthedecisiontree.Givenaclassificationrule ,weconsiderasimplifiedrule, where isobtainedbyremovingoneoftheconjunctsinA.Thesimplifiedrulewiththelowestpessimisticerrorrateisretainedprovideditserrorrateislessthanthatoftheoriginalrule.Therule-pruningstepisrepeateduntilthepessimisticerroroftherulecannotbeimprovedfurther.Becausesomeoftherulesmaybecomeidenticalafterpruning,theduplicaterulesarediscarded.

Figure4.5.

r:A→yr′:A′→y A′

Classificationrulesextractedfromadecisiontreeforthevertebrateclassificationproblem.

RuleOrdering

Aftergeneratingtheruleset,C4.5rulesusestheclass-basedorderingschemetoordertheextractedrules.Rulesthatpredictthesameclassaregroupedtogetherintothesamesubset.Thetotaldescriptionlengthforeachsubsetiscomputed,andtheclassesarearrangedinincreasingorderoftheirtotaldescriptionlength.Theclassthathasthesmallestdescriptionlengthisgiventhehighestprioritybecauseitisexpectedtocontainthebestsetofrules.Thetotaldescriptionlengthforaclassisgivenby ,where

isthenumberofbitsneededtoencodethemisclassifiedexamples,Lmodelisthenumberofbitsneededtoencodethemodel,andgisatuningparameterwhosedefaultvalueis0.5.Thetuningparameterdependsonthenumberofredundantattributespresentinthemodel.Thevalueofthetuningparameterissmallifthemodelcontainsmanyredundantattributes.

4.2.5CharacteristicsofRule-BasedClassifiers

1. Rule-basedclassifiershaveverysimilarcharacteristicsasdecisiontrees.Theexpressivenessofarulesetisalmostequivalenttothatofadecisiontreebecauseadecisiontreecanberepresentedbyasetofmutuallyexclusiveandexhaustiverules.Bothrule-basedanddecisiontreeclassifierscreaterectilinearpartitionsoftheattributespaceandassignaclasstoeachpartition.However,arule-basedclassifiercan

Lexception+g×LmodelLexception

allowmultiplerulestobetriggeredforagiveninstance,thusenablingthelearningofmorecomplexmodelsthandecisiontrees.

2. Likedecisiontrees,rule-basedclassifierscanhandlevaryingtypesofcategoricalandcontinuousattributesandcaneasilyworkinmulticlassclassificationscenarios.Rule-basedclassifiersaregenerallyusedtoproducedescriptivemodelsthatareeasiertointerpretbutgivecomparableperformancetothedecisiontreeclassifier.

3. Rule-basedclassifierscaneasilyhandlethepresenceofredundantattributesthatarehighlycorrelatedwithoneother.Thisisbecauseonceanattributehasbeenusedasaconjunctinaruleantecedent,theremainingredundantattributeswouldshowlittletonoFOIL'sinformationgainandwouldthusbeignored.

4. Sinceirrelevantattributesshowpoorinformationgain,rule-basedclassifierscanavoidselectingirrelevantattributesifthereareotherrelevantattributesthatshowbetterinformationgain.However,iftheproblemiscomplexandthereareinteractingattributesthatcancollectivelydistinguishbetweentheclassesbutindividuallyshowpoorinformationgain,itislikelyforanirrelevantattributetobeaccidentallyfavoredoverarelevantattributejustbyrandomchance.Hence,rule-basedclassifierscanshowpoorperformanceinthepresenceofinteractingattributes,whenthenumberofirrelevantattributesislarge.

5. Theclass-basedorderingstrategyadoptedbyRIPPER,whichemphasizesgivinghigherprioritytorareclasses,iswellsuitedforhandlingtrainingdatasetswithimbalancedclassdistributions.

6. Rule-basedclassifiersarenotwell-suitedforhandlingmissingvaluesinthetestset.Thisisbecausethepositionofrulesinarulesetfollowsacertainorderingstrategyandevenifatestinstanceiscoveredbymultiplerules,theycanassigndifferentclasslabelsdependingontheirpositionintheruleset.Hence,ifacertainruleinvolvesanattributethatismissinginatestinstance,itisdifficulttoignoretheruleandproceed

tothesubsequentrulesintheruleset,assuchastrategycanresultinincorrectclassassignments.

4.3NearestNeighborClassifiersTheclassificationframeworkshowninFigure3.3 involvesatwo-stepprocess:

(1)aninductivestepforconstructingaclassificationmodelfromdata,and

(2)adeductivestepforapplyingthemodeltotestexamples.Decisiontreeandrule-basedclassifiersareexamplesofeagerlearnersbecausetheyaredesignedtolearnamodelthatmapstheinputattributestotheclasslabelassoonasthetrainingdatabecomesavailable.Anoppositestrategywouldbetodelaytheprocessofmodelingthetrainingdatauntilitisneededtoclassifythetestinstances.Techniquesthatemploythisstrategyareknownaslazylearners.AnexampleofalazylearneristheRoteclassifier,whichmemorizestheentiretrainingdataandperformsclassificationonlyiftheattributesofatestinstancematchoneofthetrainingexamplesexactly.Anobviousdrawbackofthisapproachisthatsometestinstancesmaynotbeclassifiedbecausetheydonotmatchanytrainingexample.

Onewaytomakethisapproachmoreflexibleistofindallthetrainingexamplesthatarerelativelysimilartotheattributesofthetestinstances.Theseexamples,whichareknownasnearestneighbors,canbeusedtodeterminetheclasslabelofthetestinstance.Thejustificationforusingnearestneighborsisbestexemplifiedbythefollowingsaying:“Ifitwalkslikeaduck,quackslikeaduck,andlookslikeaduck,thenit'sprobablyaduck.”Anearestneighborclassifierrepresentseachexampleasadatapointinad-dimensionalspace,wheredisthenumberofattributes.Givenatestinstance,wecomputeitsproximitytothetraininginstancesaccordingtooneoftheproximitymeasuresdescribedinSection2.4 onpage71.Thek-nearest

neighborsofagiventestinstancezrefertothektrainingexamplesthatareclosesttoz.

Figure4.6 illustratesthe1-,2-,and3-nearestneighborsofatestinstancelocatedatthecenterofeachcircle.Theinstanceisclassifiedbasedontheclasslabelsofitsneighbors.Inthecasewheretheneighborshavemorethanonelabel,thetestinstanceisassignedtothemajorityclassofitsnearestneighbors.InFigure4.6(a) ,the1-nearestneighboroftheinstanceisanegativeexample.Thereforetheinstanceisassignedtothenegativeclass.Ifthenumberofnearestneighborsisthree,asshowninFigure4.6(c) ,thentheneighborhoodcontainstwopositiveexamplesandonenegativeexample.Usingthemajorityvotingscheme,theinstanceisassignedtothepositiveclass.Inthecasewherethereisatiebetweentheclasses(seeFigure4.6(b) ),wemayrandomlychooseoneofthemtoclassifythedatapoint.

Figure4.6.The1-,2-,and3-nearestneighborsofaninstance.

Theprecedingdiscussionunderscorestheimportanceofchoosingtherightvaluefork.Ifkistoosmall,thenthenearestneighborclassifiermaybesusceptibletooverfittingduetonoise,i.e.,mislabeledexamplesinthetraining

data.Ontheotherhand,ifkistoolarge,thenearestneighborclassifiermaymisclassifythetestinstancebecauseitslistofnearestneighborsincludestrainingexamplesthatarelocatedfarawayfromitsneighborhood(seeFigure4.7 ).

Figure4.7.k-nearestneighborclassificationwithlargek.

4.3.1Algorithm

Ahigh-levelsummaryofthenearestneighborclassificationmethodisgiveninAlgorithm4.2 .Thealgorithmcomputesthedistance(orsimilarity)betweeneachtestinstance andallthetrainingexamples todetermineitsnearestneighborlist, .Suchcomputationcanbecostlyifthenumberoftrainingexamplesislarge.However,efficientindexingtechniquesareavailabletoreducethecomputationneededtofindthenearestneighborsofatestinstance.

z=(x′,y′) (x,y)∈DDz

Algorithm4.2Thek-nearestneighborclassifier.

′ ′

′ ∑ ∈

Oncethenearestneighborlistisobtained,thetestinstanceisclassifiedbasedonthemajorityclassofitsnearestneighbors:

wherevisaclasslabel, istheclasslabelforoneofthenearestneighbors,and isanindicatorfunctionthatreturnsthevalue1ifitsargumentistrueand0otherwise.

Inthemajorityvotingapproach,everyneighborhasthesameimpactontheclassification.Thismakesthealgorithmsensitivetothechoiceofk,asshowninFigure4.6 .Onewaytoreducetheimpactofkistoweighttheinfluenceofeachnearestneighbor accordingtoitsdistance: .Asaresult,trainingexamplesthatarelocatedfarawayfromzhaveaweakerimpactontheclassificationcomparedtothosethatarelocatedclosetoz.Usingthedistance-weightedvotingscheme,theclasslabelcanbedeterminedasfollows:

′

∈

⊆

MajorityVoting:y′=argmaxv∑(xi,yi)∈DzI(v=yi), (4.5)

yiI(⋅)

xi wi=1/d(x′,xi)2

Distance-WeightedVoting:y′=argmaxv∑(xi,yi)∈Dzwi×I(v=yi). (4.6)

4.3.2CharacteristicsofNearestNeighborClassifiers

1. Nearestneighborclassificationispartofamoregeneraltechniqueknownasinstance-basedlearning,whichdoesnotbuildaglobalmodel,butratherusesthetrainingexamplestomakepredictionsforatestinstance.(Thus,suchclassifiersareoftensaidtobe“modelfree.”)Suchalgorithmsrequireaproximitymeasuretodeterminethesimilarityordistancebetweeninstancesandaclassificationfunctionthatreturnsthepredictedclassofatestinstancebasedonitsproximitytootherinstances.

2. Althoughlazylearners,suchasnearestneighborclassifiers,donotrequiremodelbuilding,classifyingatestinstancecanbequiteexpensivebecauseweneedtocomputetheproximityvaluesindividuallybetweenthetestandtrainingexamples.Incontrast,eagerlearnersoftenspendthebulkoftheircomputingresourcesformodelbuilding.Onceamodelhasbeenbuilt,classifyingatestinstanceisextremelyfast.

3. Nearestneighborclassifiersmaketheirpredictionsbasedonlocalinformation.(Thisisequivalenttobuildingalocalmodelforeachtestinstance.)Bycontrast,decisiontreeandrule-basedclassifiersattempttofindaglobalmodelthatfitstheentireinputspace.Becausetheclassificationdecisionsaremadelocally,nearestneighborclassifiers(withsmallvaluesofk)arequitesusceptibletonoise.

4. Nearestneighborclassifierscanproducedecisionboundariesofarbitraryshape.Suchboundariesprovideamoreflexiblemodelrepresentationcomparedtodecisiontreeandrule-basedclassifiersthatareoftenconstrainedtorectilineardecisionboundaries.Thedecisionboundariesofnearestneighborclassifiersalsohavehigh

variabilitybecausetheydependonthecompositionoftrainingexamplesinthelocalneighborhood.Increasingthenumberofnearestneighborsmayreducesuchvariability.

5. Nearestneighborclassifiershavedifficultyhandlingmissingvaluesinboththetrainingandtestsetssinceproximitycomputationsnormallyrequirethepresenceofallattributes.Although,thesubsetofattributespresentintwoinstancescanbeusedtocomputeaproximity,suchanapproachmaynotproducegoodresultssincetheproximitymeasuresmaybedifferentforeachpairofinstancesandthushardtocompare.

6. Nearestneighborclassifierscanhandlethepresenceofinteractingattributes,i.e.,attributesthathavemorepredictivepowertakenincombinationthenbythemselves,byusingappropriateproximitymeasuresthatcanincorporatetheeffectsofmultipleattributestogether.

7. Thepresenceofirrelevantattributescandistortcommonlyusedproximitymeasures,especiallywhenthenumberofirrelevantattributesislarge.Furthermore,iftherearealargenumberofredundantattributesthatarehighlycorrelatedwitheachother,thentheproximitymeasurecanbeoverlybiasedtowardsuchattributes,resultinginimproperestimatesofdistance.Hence,thepresenceofirrelevantandredundantattributescanadverselyaffecttheperformanceofnearestneighborclassifiers.

8. Nearestneighborclassifierscanproducewrongpredictionsunlesstheappropriateproximitymeasureanddatapreprocessingstepsaretaken.Forexample,supposewewanttoclassifyagroupofpeoplebasedonattributessuchasheight(measuredinmeters)andweight(measuredinpounds).Theheightattributehasalowvariability,rangingfrom1.5mto1.85m,whereastheweightattributemayvaryfrom90lb.to250lb.Ifthescaleoftheattributesarenottakenintoconsideration,theproximitymeasuremaybedominatedbydifferencesintheweightsofaperson.

4.4NaïveBayesClassifierManyclassificationproblemsinvolveuncertainty.First,theobservedattributesandclasslabelsmaybeunreliableduetoimperfectionsinthemeasurementprocess,e.g.,duetothelimitedprecisenessofsensordevices.Second,thesetofattributeschosenforclassificationmaynotbefullyrepresentativeofthetargetclass,resultinginuncertainpredictions.Toillustratethis,considertheproblemofpredictingaperson'sriskforheartdiseasebasedonamodelthatusestheirdietandworkoutfrequencyasattributes.Althoughmostpeoplewhoeathealthilyandexerciseregularlyhavelesschanceofdevelopingheartdisease,theymaystillbeatriskduetootherlatentfactors,suchasheredity,excessivesmoking,andalcoholabuse,thatarenotcapturedinthemodel.Third,aclassificationmodellearnedoverafinitetrainingsetmaynotbeabletofullycapturethetruerelationshipsintheoveralldata,asdiscussedinthecontextofmodeloverfittinginthepreviouschapter.Finally,uncertaintyinpredictionsmayariseduetotheinherentrandomnatureofreal-worldsystems,suchasthoseencounteredinweatherforecastingproblems.

Inthepresenceofuncertainty,thereisaneedtonotonlymakepredictionsofclasslabelsbutalsoprovideameasureofconfidenceassociatedwitheveryprediction.Probabilitytheoryoffersasystematicwayforquantifyingandmanipulatinguncertaintyindata,andthus,isanappealingframeworkforassessingtheconfidenceofpredictions.Classificationmodelsthatmakeuseofprobabilitytheorytorepresenttherelationshipbetweenattributesandclasslabelsareknownasprobabilisticclassificationmodels.Inthissection,wepresentthenaïveBayesclassifier,whichisoneofthesimplestandmostwidely-usedprobabilisticclassificationmodels.

4.4.1BasicsofProbabilityTheory

BeforewediscusshowthenaïveBayesclassifierworks,wefirstintroducesomebasicsofprobabilitytheorythatwillbeusefulinunderstandingtheprobabilisticclassificationmodelspresentedinthischapter.Thisinvolvesdefiningthenotionofprobabilityandintroducingsomecommonapproachesformanipulatingprobabilityvalues.

ConsideravariableX,whichcantakeanydiscretevaluefromtheset.Whenwehavemultipleobservationsofthatvariable,suchasina

datasetwherethevariabledescribessomecharacteristicofdataobjects,thenwecancomputetherelativefrequencywithwhicheachvalueoccurs.Specifically,supposethatXhasthevalue for dataobjects.Therelativefrequencywithwhichweobservetheevent isthen ,whereNdenotesthetotalnumberofoccurrences( ).TheserelativefrequenciescharacterizetheuncertaintythatwehavewithrespecttowhatvalueXmaytakeforanunseenobservationandmotivatesthenotionofprobability.

Moreformally,theprobabilityofanevente,e.g., ,measureshowlikelyitisfortheeventetooccur.Themosttraditionalviewofprobabilityisbasedonrelativefrequencyofevents(frequentist),whiletheBayesianviewpoint(describedlater)takesamoreflexibleviewofprobabilities.Ineithercase,aprobabilityisalwaysanumberbetween0and1.Further,thesumofprobabilityvaluesofallpossibleevents,e.g.,outcomesofavariableXisequalto1.Variablesthathaveprobabilitiesassociatedwitheachpossibleoutcome(values)areknownasrandomvariables.

Now,letusconsidertworandomvariables,XandY,thatcaneachtakekdiscretevalues.Let bethenumberoftimesweobserve and ,out

{x1,…,xk}

xi niX=xi ni/N

N=∑i=1kni

P(X=xi)

nij X=xi Y=yj

ofatotalnumberofNoccurrences.Thejointprobabilityofobservingand togethercanbeestimatedas

(Thisisanestimatesincewetypicallyhaveonlyafinitesubsetofallpossibleobservations.)Jointprobabilitiescanbeusedtoanswerquestionssuchas“whatistheprobabilitythattherewillbeasurprisequiztoday Iwillbelatefortheclass.”Jointprobabilitiesaresymmetric,i.e.,

.Forjointprobabilities,itistousefultoconsidertheirsumwithrespecttooneoftherandomvariables,asdescribedinthefollowingequation:

where isthetotalnumberoftimesweobserve irrespectiveofthevalueofY.Noticethat isessentiallytheprobabilityofobserving .Hence,bysummingoutthejointprobabilitieswithrespecttoarandomvariableY,weobtaintheprobabilityofobservingtheremainingvariableX.Thisoperationiscalledmarginalizationandtheprobabilityvalue obtainedbymarginalizingoutYissometimescalledthemarginalprobabilityofX.Aswewillseelater,jointprobabilityandmarginalprobabilityformthebasicbuildingblocksofanumberofprobabilisticclassificationmodelsdiscussedinthischapter.

Noticethatinthepreviousdiscussions,weused todenotetheprobabilityofaparticularoutcomeofarandomvariableX.Thisnotationcaneasilybecomecumbersomewhenanumberofrandomvariablesareinvolved.Hence,intheremainderofthissection,wewilluseP(X)todenotetheprobabilityofanygenericoutcomeoftherandomvariableX,while willbeusedtorepresenttheprobabilityofthespecificoutcome .

X=xiY=yj

P(X=xi,Y=yi)=nijN. (4.7)

P(X=x,Y=y)=P(Y=y,X=x)

∑j=1kP(X=xi,Y=yj)=∑j=1knijN=niN=P(X=xi), (4.8)

ni X=xini/N X=xi

P(X=xi)

P(xi)xi

BayesTheoremSupposeyouhaveinvitedtwoofyourfriendsAlexandMarthatoadinnerparty.YouknowthatAlex

attends40%ofthepartiesheisinvitedto.Further,ifAlexisgoingtoaparty,thereisan80%chance

ofMarthacomingalong.Ontheotherhand,ifAlexisnotgoingtotheparty,thechanceofMartha

comingtothepartyisreducedto30%.IfMarthahasrespondedthatshewillbecomingtoyourparty,

whatistheprobabilitythatAlexwillalsobecoming?

Bayestheorempresentsthestatisticalprincipleforansweringquestionslikethepreviousone,whereevidencefrommultiplesourceshastobecombinedwithpriorbeliefstoarriveatpredictions.Bayestheoremcanbebrieflydescribedasfollows.

Let denotetheconditionalprobabilityofobservingtherandomvariableYwhenevertherandomvariableXtakesaparticularvalue. isoftenreadastheprobabilityofobservingYconditionedontheoutcomeofX.Conditionalprobabilitiescanbeusedforansweringquestionssuchas“giventhatitisgoingtoraintoday,whatwillbetheprobabilitythatIwillgototheclass.”ConditionalprobabilitiesofXandYarerelatedtotheirjoint

probabilityinthefollowingway:

RearrangingthelasttwoexpressionsinEquation4.10 leadstoEquation4.11 ,whichisknownasBayestheorem:

P(Y|X)P(Y|X)

P(Y|X)=P(X,Y)P(X),whichimplies (4.9)

P(X,Y)=P(Y|X)×P(X)=P(X|Y)×P(Y). (4.10)

P(Y|X)=P(X|Y)P(Y)P(X). (4.11)

Bayestheoremprovidesarelationshipbetweentheconditionalprobabilitiesand .NotethatthedenominatorinEquation4.11 involvesthe

marginalprobabilityofX,whichcanalsoberepresentedas

UsingthepreviousexpressionforP(X),wecanobtainthefollowingequationfor solelyintermsof andP(Y):

Example4.4.[BayesTheorem]Bayestheoremcanbeusedtosolveanumberofinferentialquestionsaboutrandomvariables.Forexample,considertheproblemstatedatthebeginningoninferringwhetherAlexwillcometotheparty.LetdenotetheprobabilityofAlexgoingtoaparty,while denotestheprobabilityofhimnotgoingtoaparty.Weknowthat

Further,let denotetheconditionalprobabilityofMarthagoingtoapartyconditionedonwhetherAlexisgoingtotheparty. takesthefollowingvalues:

Wecanusetheabovevaluesof andP(A)tocomputetheprobabilityofAlexgoingtothepartygivenMarthaisgoingtotheparty,

,asfollows:

P(Y|X) P(X|Y)

P(X)=∑i=1kP(X,yi)=∑i=1kP(X|yi)×P(yi).

P(Y|X) P(X|Y)

P(Y|X)=P(X|Y)P(Y)∑i−1kP(X|yi)P(yi). (4.12)

P(A=1)P(A=0)

P(A=1)=0.4,andP(A=0)=1−P(A=1)=0.6.

P(M=1|A)P(M=1|A)

P(M=1|A=1)=0.8,andP(M=1|A=0)=0.3.

P(M|A)

P(A=1|M=1)

NoticethateventhoughthepriorprobabilityP(A)ofAlexgoingtothepartyislow,theobservationthatMarthaisgoing, ,affectstheconditionalprobability .ThisshowsthevalueofBayestheoremincombiningpriorassumptionswithobservedoutcomestomakepredictions.Since ,itismorelikelyforAlextojoinifMarthaisgoingtotheparty.

UsingBayesTheoremforClassificationForthepurposeofclassification,weareinterestedincomputingtheprobabilityofobservingaclasslabelyforadatainstancegivenitssetofattributevalues .Thiscanberepresentedas ,whichisknownastheposteriorprobabilityofthetargetclass.UsingtheBayesTheorem,wecanrepresenttheposteriorprobabilityas

Notethatthenumeratorofthepreviousequationinvolvestwoterms,andP(y),bothofwhichcontributetotheposteriorprobability .Wedescribebothofthesetermsinthefollowing.

Thefirstterm isknownastheclass-conditionalprobabilityoftheattributesgiventheclasslabel. measuresthelikelihoodofobservingfromthedistributionofinstancesbelongingtoy.If indeedbelongstoclassy,thenweshouldexpect tobehigh.Fromthispointofview,theuseofclass-conditionalprobabilitiesattemptstocapturetheprocessfromwhichthedatainstancesweregenerated.Becauseofthisinterpretation,probabilisticclassificationmodelsthatinvolvecomputingclass-conditionalprobabilitiesare

P(A=1|M=1)=P(M=1|A=1)P(A=1)P(M=1|A=0)P(A=0)+P(M=1|A=1)P(A=1),=0.8(4.13)

M=1P(A=1|M=1)

P(A=1|M=1)>0.5

P(y|x)

P(y|x)=P(x|y)P(y)P(x) (4.14)

P(x|y)P(y|x)

P(x|y)P(x|y)

P(x|y)

knownasgenerativeclassificationmodels.Apartfromtheiruseincomputingposteriorprobabilitiesandmakingpredictions,class-conditionalprobabilitiesalsoprovideinsightsabouttheunderlyingmechanismbehindthegenerationofattributevalues.

ThesecondterminthenumeratorofEquation4.14 isthepriorprobabilityP(y).Thepriorprobabilitycapturesourpriorbeliefsaboutthedistributionofclasslabels,independentoftheobservedattributevalues.(ThisistheBayesianviewpoint.)Forexample,wemayhaveapriorbeliefthatthelikelihoodofanypersontosufferfromaheartdiseaseis ,irrespectiveoftheirdiagnosisreports.Thepriorprobabilitycaneitherbeobtainedusingexpertknowledge,orinferredfromhistoricaldistributionofclasslabels.

ThedenominatorinEquation4.14 involvestheprobabilityofevidence,P( ).Notethatthistermdoesnotdependontheclasslabelandthuscanbetreatedasanormalizationconstantinthecomputationofposteriorprobabilities.Further,thevalueofP( )canbecalculatedas

Bayestheoremprovidesaconvenientwaytocombineourpriorbeliefswiththelikelihoodofobtainingtheobservedattributevalues.Duringthetrainingphase,wearerequiredtolearntheparametersforP(y)and .ThepriorprobabilityP(y)canbeeasilyestimatedfromthetrainingsetbycomputingthefractionoftraininginstancesthatbelongtoeachclass.Tocomputetheclass-conditionalprobabilities,oneapproachistoconsiderthefractionoftraininginstancesofagivenclassforeverypossiblecombinationofattributevalues.Forexample,supposethattherearetwoattributes and thatcaneachtakeadiscretevaluefrom to .Let denotethenumberoftraininginstancesbelongingtoclass0,outofwhich numberoftraininginstanceshave and .Theclass-conditionalprobabilitycanthenbegivenas

P(x)=∑iP(x|yi)P(yi)

P(x|y)

X1 X2c1 ck n0

nij0X1=ci X2=cj

Thisapproachcaneasilybecomecomputationallyprohibitiveasthenumberofattributesincrease,duetotheexponentialgrowthinthenumberofattributevaluecombinations.Forexample,ifeveryattributecantakekdiscretevalues,thenthenumberofattributevaluecombinationsisequalto ,wheredisthenumberofattributes.Thelargenumberofattributevaluecombinationscanalsoresultinpoorestimatesofclass-conditionalprobabilities,sinceeverycombinationwillhavefewertraininginstanceswhenthesizeoftrainingsetissmall.

Inthefollowing,wepresentthenaïveBayesclassifier,whichmakesasimplifyingassumptionabouttheclass-conditionalprobabilities,knownasthenaïveBayesassumption.Theuseofthisassumptionsignificantlyhelpsinobtainingreliableestimatesofclass-conditionalprobabilities,evenwhenthenumberofattributesarelarge.

4.4.2NaïveBayesAssumption

ThenaïveBayesclassifierassumesthattheclass-conditionalprobabilityofallattributes canbefactoredasaproductofclass-conditionalprobabilitiesofeveryattribute ,asdescribedinthefollowingequation:

whereeverydatainstance consistsofdattributes, .Thebasicassumptionbehindthepreviousequationisthattheattributevaluesareconditionallyindependentofeachother,giventheclasslabely.Thismeansthattheattributesareinfluencedonlybythetargetclassandifwe

P(X1=ci,X2=cj|Y=0)=nij0n0.

P(x|y)=∏i=1dP(xi|y), (4.15)

{x1,x2,…,xd}xi

knowtheclasslabel,thenwecanconsidertheattributestobeindependentofeachother.Theconceptofconditionalindependencecanbeformallystatedasfollows.

ConditionalIndependenceLet ,andYdenotethreesetsofrandomvariables.Thevariablesinaresaidtobeconditionallyindependentof ,givenY,ifthefollowingconditionholds:

ThismeansthatconditionedonY,thedistributionof isnotinfluencedbytheoutcomesof ,andhenceisconditionallyindependentof .Toillustratethenotionofconditionalindependence,considertherelationshipbetweenaperson'sarmlength andhisorherreadingskills .Onemightobservethatpeoplewithlongerarmstendtohavehigherlevelsofreadingskills,andthusconsider and toberelatedtoeachother.However,thisrelationshipcanbeexplainedbyanotherfactor,whichistheageoftheperson(Y).Ayoungchildtendstohaveshortarmsandlacksthereadingskillsofanadult.Iftheageofapersonisfixed,thentheobservedrelationshipbetweenarmlengthandreadingskillsdisappears.Thus,wecanconcludethatarmlengthandreadingskillsarenotdirectlyrelatedtoeachotherandareconditionallyindependentwhentheagevariableisfixed.

Anotherwayofdescribingconditionalindependenceistoconsiderthejointconditionalprobability, ,asfollows:

X1,X2, X1X2

P(X1|X2,Y)=P(X1|Y). (4.16)

X1X2 X2

(X1) (X2)

X1 X2

P(X1,X2|Y)

P(X1,X2|Y)=P(X1,X2,Y)P(Y)=P(X1,X2,Y)P(X2,Y)×P(X2,Y)P(Y)=P(X1|X2,Y(4.17)

whereEquation4.16 wasusedtoobtainthelastlineofEquation4.17 .Thepreviousdescriptionofconditionalindependenceisquiteusefulfromanoperationalperspective.Itstatesthatthejointconditionalprobabilityof and

givenYcanbefactoredastheproductofconditionalprobabilitiesofand consideredseparately.ThisformsthebasisofthenaïveBayesassumptionstatedinEquation4.15 .

HowaNaïveBayesClassifierWorksUsingthenaïveBayesassumption,weonlyneedtoestimatetheconditionalprobabilityofeach givenYseparately,insteadofcomputingtheclass-conditionalprobabilityforeverycombinationofattributevalues.Forexample,if and denotethenumberoftraininginstancesbelongingtoclass0with and ,respectively,thentheclass-conditionalprobabilitycanbeestimatedas

Inthepreviousequation,weonlyneedtocountthenumberoftraininginstancesforeveryoneofthekvaluesofanattributeX,irrespectiveofthevaluesofotherattributes.Hence,thenumberofparametersneededtolearnclass-conditionalprobabilitiesisreducedfrom todk.Thisgreatlysimplifiestheexpressionfortheclass-conditionalprobabilityandmakesitmoreamenabletolearningparametersandmakingpredictions,eveninhigh-dimensionalsettings.

ThenaïveBayesclassifiercomputestheposteriorprobabilityforatestinstance byusingthefollowingequation:

X1X2 X1

ni0 nj0X1=ci X2=cj

P(X1=ci,X2=xj|Y=0)=ni0n0×nj0n0.

P(y|x)=P(y)∏i=1dP(xi|y)P(x) (4.18)

SinceP( )isfixedforeveryyandonlyactsasanormalizingconstanttoensurethat ,wecanwrite

Hence,itissufficienttochoosetheclassthatmaximizes .

OneoftheusefulpropertiesofthenaïveBayesclassifieristhatitcaneasilyworkwithincompleteinformationaboutdatainstances,whenonlyasubsetofattributesareobservedateveryinstance.Forexample,ifweonlyobservepoutofdattributesatadatainstance,thenwecanstillcompute

usingthosepattributesandchoosetheclasswiththemaximumvalue.ThenaïveBayesclassifiercanthusnaturallyhandlemissingvaluesintestinstances.Infact,intheextremecasewherenoattributesareobserved,wecanstillusethepriorprobabilityP(y)asanestimateoftheposteriorprobability.Asweobservemoreattributes,wecankeeprefiningtheposteriorprobabilitytobetterreflectthelikelihoodofobservingthedatainstance.

Inthenexttwosubsections,wedescribeseveralapproachesforestimatingtheconditionalprobabilities forcategoricalandcontinuousattributesfromthetrainingset.

EstimatingConditionalProbabilitiesforCategoricalAttributesForacategoricalattribute ,theconditionalprobability isestimatedaccordingtothefractionoftraininginstancesinclassywhere takesonaparticularcategoricalvaluec.

P(y|x)∈[0,1]

P(y|x)∝P(y)∏i=1dP(xi|y).

P(y)∏i=1dP(xi|y)

P(y)∏i=1pP(xi|y)

P(xi|y)

Xi P(Xi=c|y)Xi

wherenisthenumberoftraininginstancesbelongingtoclassy,outofwhichnumberofinstanceshave .Forexample,inthetrainingsetgivenin

Figure4.8 ,sevenpeoplehavetheclasslabel ,outofwhichthreepeoplehave whiletheremainingfourhave

.Asaresult,theconditionalprobabilityforisequalto3/7.Similarly,the

conditionalprobabilityfordefaultedborrowerswith isgivenby .Notethatthesumofconditionalprobabilitiesoverallpossibleoutcomesof isequaltoone,i.e., .

Figure4.8.Trainingsetforpredictingtheloandefaultproblem.

P(Xi=c|y)=ncn,

nc Xi=cDefaultedBorrower=No

HomeOwner=YesHomeOwner=NoP(HomeOwner=Yes|DefaultedBorrower=No)

MaritalStatus=SingleP(MaritalStatus=Single|DefaultedBorrower=Yes)=2/3

Xi∑cP(Xi=c|y)=1,

EstimatingConditionalProbabilitiesforContinuousAttributesTherearetwowaystoestimatetheclass-conditionalprobabilitiesforcontinuousattributes:

1. Wecandiscretizeeachcontinuousattributeandthenreplacethecontinuousvalueswiththeircorrespondingdiscreteintervals.Thisapproachtransformsthecontinuousattributesintoordinalattributes,andthesimplemethoddescribedpreviouslyforcomputingtheconditionalprobabilitiesofcategoricalattributescanbeemployed.Notethattheestimationerrorofthismethoddependsonthediscretizationstrategy(asdescribedinSection2.3.6 onpage63),aswellasthenumberofdiscreteintervals.Ifthenumberofintervalsistoolarge,everyintervalmayhaveaninsufficientnumberoftraininginstancestoprovideareliableestimateof .Ontheotherhand,ifthenumberofintervalsistoosmall,thenthediscretizationprocessmaylooseinformationaboutthetruedistributionofcontinuousvalues,andthusresultinpoorpredictions.

2. Wecanassumeacertainformofprobabilitydistributionforthecontinuousvariableandestimatetheparametersofthedistributionusingthetrainingdata.Forexample,wecanuseaGaussiandistributiontorepresenttheconditionalprobabilityofcontinuousattributes.TheGaussiandistributionischaracterizedbytwoparameters,themean, ,andthevariance, .Foreachclass ,theclass-conditionalprobabilityforattribute is

P(Xi|Y)

μ σ2 yj

XiP(Xi=xi|Y=yj)=12πσijexp[−(xi−μij)22σij2]. (4.19)

Theparameter canbeestimatedusingthesamplemeanofforalltraininginstancesthatbelongto .Similarly, canbeestimatedfromthesamplevariance ofsuchtraininginstances.Forexample,considertheannualincomeattributeshowninFigure4.8 .Thesamplemeanandvarianceforthisattributewithrespecttotheclass are

Givenatestinstancewithtaxableincomeequalto$120K,wecanusethefollowingvalueasitsconditionalprobabilitygivenclass :

Example4.5.[NaïveBayesClassifier]ConsiderthedatasetshowninFigure4.9(a) ,wherethetargetclassisDefaultedBorrower,whichcantaketwovaluesYesandNo.Wecancomputetheclass-conditionalprobabilityforeachcategoricalattributeandthesamplemeanandvarianceforthecontinuousattribute,assummarizedinFigure4.9(b) .

Weareinterestedinpredictingtheclasslabelofatestinstance.Todo

this,wefirstcomputethepriorprobabilitiesbycountingthenumberoftraininginstancesbelongingtoeveryclass.Wethusobtain and

.Next,wecancomputetheclass-conditionalprobabilityasfollows:

μij Xi(x¯)yj σij2

(s2)

x¯=125+100+70+…+757=100s2=(125−110)2+(100−110)2+…(75−110)26=2975s=2975=54.54.

P(Income=120|No)=12π(54.54)exp−(120−110)22×2975=0.0072.

x=(HomeOwner=No,MaritalStatus=Married,AnnualIncome=$120K)

P(yes)=0.3P(No)=0.7

Figure4.9.ThenaïveBayesclassifierfortheloanclassificationproblem.

Noticethattheclass-conditionalprobabilityforclass hasbecome0becausetherearenoinstancesbelongingtoclass with

inthetrainingset.Usingtheseclass-conditionalprobabilities,wecanestimatetheposteriorprobabilitiesas

where isanormalizingconstant.Since ,theinstanceisclassifiedas .

P(x|NO)=P(HomeOwner=No|No)×P(Status=Married|No)×P(AnnualIncome

Status=Married

P(No|x)=0.7×0.0024P(x).=0.0016α.P(Yes|x)=0.3×0P(x)=0.

α=1/P(x) P(No|x)>P(Yes|x)

HandlingZeroConditionalProbabilitiesTheprecedingexampleillustratesapotentialproblemwithusingthenaïveBayesassumptioninestimatingclass-conditionalprobabilities.Iftheconditionalprobabilityforanyoftheattributesiszero,thentheentireexpressionfortheclass-conditionalprobabilitybecomeszero.Notethatzeroconditionalprobabilitiesarisewhenthenumberoftraininginstancesissmallandthenumberofpossiblevaluesofanattributeislarge.Insuchcases,itmayhappenthatacombinationofattributevaluesandclasslabelsareneverobserved,resultinginazeroconditionalprobability.

Inamoreextremecase,ifthetraininginstancesdonotcoversomecombinationsofattributevaluesandclasslabels,thenwemaynotbeabletoevenclassifysomeofthetestinstances.Forexample,if

iszeroinsteadof1/7,thenadatainstancewithattributeset

hasthefollowingclass-conditionalprobabilities:

Sinceboththeclass-conditionalprobabilitiesare0,thenaïveBayesclassifierwillnotbeabletoclassifytheinstance.Toaddressthisproblem,itisimportanttoadjusttheconditionalprobabilityestimatessothattheyarenotasbrittleassimplyusingfractionsoftraininginstances.Thiscanbeachievedbyusingthefollowingalternateestimatesofconditionalprobability:

P(MaritalStatus=Divorced|No)x=

(HomeOwner=Yes,MaritalStatus=Divorced,Income=$120K)

P(x|No)=3/7×0×0.0072=0.P(x|Yes)=0×1/3×1.2×10−9=0.

Laplaceestimate:P(Xi=c|y)=nc+1n+v, (4.20)

m-estimate:P(Xi=c|y)=nc+mpn+m, (4.21)

wherenisthenumberoftraininginstancesbelongingtoclassy, isthenumberoftraininginstanceswith and ,visthetotalnumberofattributevaluesthat cantake,pissomeinitialestimateof thatisknownapriori,andmisahyper-parameterthatindicatesourconfidenceinusingpwhenthefractionoftraininginstancesistoobrittle.Notethatevenif

,bothLaplaceandm-estimateprovidenon-zerovaluesofconditionalprobabilities.Hence,theyavoidtheproblemofvanishingclass-conditionalprobabilitiesandthusgenerallyprovidemorerobustestimatesofposteriorprobabilities.

CharacteristicsofNaïveBayesClassifiers1. NaïveBayesclassifiersareprobabilisticclassificationmodelsthatare

abletoquantifytheuncertaintyinpredictionsbyprovidingposteriorprobabilityestimates.Theyarealsogenerativeclassificationmodelsastheytreatthetargetclassasthecausativefactorforgeneratingthedatainstances.Hence,apartfromcomputingposteriorprobabilities,naïveBayesclassifiersalsoattempttocapturetheunderlyingmechanismbehindthegenerationofdatainstancesbelongingtoeveryclass.Theyarethususefulforgainingpredictiveaswellasdescriptiveinsights.

2. ByusingthenaïveBayesassumption,theycaneasilycomputeclass-conditionalprobabilitieseveninhigh-dimensionalsettings,providedthattheattributesareconditionallyindependentofeachothergiventheclasslabels.ThispropertymakesnaïveBayesclassifierasimpleandeffectiveclassificationtechniquethatiscommonlyusedindiverseapplicationproblems,suchastextclassification.

3. NaïveBayesclassifiersarerobusttoisolatednoisepointsbecausesuchpointsarenotabletosignificantlyimpacttheconditionalprobabilityestimates,astheyareoftenaveragedoutduringtraining.

ncXi=c Y=y

Xi P(Xi=c|y)

nc=0

4. NaïveBayesclassifierscanhandlemissingvaluesinthetrainingsetbyignoringthemissingvaluesofeveryattributewhilecomputingitsconditionalprobabilityestimates.Further,naïveBayesclassifierscaneffectivelyhandlemissingvaluesinatestinstance,byusingonlythenon-missingattributevalueswhilecomputingposteriorprobabilities.Ifthefrequencyofmissingvaluesforaparticularattributevaluedependsonclasslabel,thenthisapproachwillnotaccuratelyestimateposteriorprobabilities.

5. NaïveBayesclassifiersarerobusttoirrelevantattributes.If isanirrelevantattribute,then becomesalmostuniformlydistributedforeveryclassy.Theclass-conditionalprobabilitiesforeveryclassthusreceivesimilarcontributionsof ,resultinginnegligibleimpactontheposteriorprobabilityestimates.

6. CorrelatedattributescandegradetheperformanceofnaïveBayesclassifiersbecausethenaïveBayesassumptionofconditionalindependencenolongerholdsforsuchattributes.Forexample,considerthefollowingprobabilities:

whereAisabinaryattributeandYisabinaryclassvariable.SupposethereisanotherbinaryattributeBthatisperfectlycorrelatedwithAwhen ,butisindependentofAwhen .Forsimplicity,assumethattheconditionalprobabilitiesforBarethesameasforA.Givenaninstancewithattributes ,andassumingconditionalindependence,wecancomputeitsposteriorprobabilitiesasfollows:

If ,thenthenaïveBayesclassifierwouldassigntheinstancetoclass1.However,thetruthis,

XiP(Xi|Y)

P(Xi|Y)

P(A=0|Y=0)=0.4,P(A=1|Y=0)=0.6,P(A=0|Y=1)=0.6,P(A=1|Y=1)=0.4,

Y=0 Y=1

A=0,B=0

P(Y=0|A=0,B=0)=P(A=0|Y=0)P(B=0|Y=0)P(Y=0)P(A=0,B=0)=0.16×P(Y

P(Y=0)=P(Y=1)

P(A=0,B=0|Y=0)=P(A=0|Y=0)=0.4,

becauseAandBareperfectlycorrelatedwhen .Asaresult,theposteriorprobabilityfor is

whichislargerthanthatfor .Theinstanceshouldhavebeenclassifiedasclass0.Hence,thenaïveBayesclassifiercanproduceincorrectresultswhentheattributesarenotconditionallyindependentgiventheclasslabels.NaïveBayesclassifiersarethusnotwell-suitedforhandlingredundantorinteractingattributes.

Y=0Y=0

P(Y=0|A=0,B=0)=P(A=0,B=0|Y=0)P(Y=0)P(A=0,B=0)=0.4×P(Y=0)P(A=0,B=

Y=1

4.5BayesianNetworksTheconditionalindependenceassumptionmadebynaïveBayesclassifiersmayseemtoorigid,especiallyforclassificationproblemswheretheattributesaredependentoneachotherevenafterconditioningontheclasslabels.WethusneedanapproachtorelaxthenaïveBayesassumptionsothatwecancapturemoregenericrepresentationsofconditionalindependenceamongattributes.

Inthissection,wepresentaflexibleframeworkformodelingprobabilisticrelationshipsbetweenattributesandclasslabels,knownasBayesianNetworks.Bybuildingonconceptsfromprobabilitytheoryandgraphtheory,Bayesiannetworksareabletocapturemoregenericformsofconditionalindependenceusingsimpleschematicrepresentations.Theyalsoprovidethenecessarycomputationalstructuretoperforminferencesoverrandomvariablesinanefficientway.Inthefollowing,wefirstdescribethebasicrepresentationofaBayesiannetwork,andthendiscussmethodsforperforminginferenceandlearningmodelparametersinthecontextofclassification.

4.5.1GraphicalRepresentation

Bayesiannetworksbelongtoabroaderfamilyofmodelsforcapturingprobabilisticrelationshipsamongrandomvariables,knownasprobabilisticgraphicalmodels.Thebasicconceptbehindthesemodelsistousegraphicalrepresentationswherethenodesofthegraphcorrespondtorandomvariablesandtheedgesbetweenthenodesexpressprobabilistic

relationships.Figures4.10(a) and4.10(b) showexamplesofprobabilisticgraphicalmodelsusingdirectededges(witharrows)andundirectededges(withoutarrows),respectively.DirectedgraphicalmodelsarealsoknownasBayesiannetworkswhileundirectedgraphicalmodelsareknownasMarkovrandomfields.Thetwoapproachesusedifferentsemanticsforexpressingrelationshipsamongrandomvariablesandarethususefulindifferentcontexts.Inthefollowing,webrieflydescribeBayesiannetworksthatareusefulinthecontextofclassification.

ABayesiannetwork(alsoreferredtoasabeliefnetwork)involvesdirectededgesbetweennodes,whereeveryedgerepresentsadirectionofinfluenceamongrandomvariables.Forexample,Figure4.10(a) showsaBayesiannetworkwherevariableCdependsuponthevaluesofvariablesAandB,asindicatedbythearrowspointingtowardCfromAandB.Consequently,thevariableCinfluencesthevaluesofvariablesDandE.EveryedgeinaBayesiannetworkthusencodesadependencerelationshipbetweenrandomvariableswithaparticulardirectionality.

Figure4.10.Illustrationsoftwobasictypesofgraphicalmodels.

Bayesiannetworksaredirectedacyclicgraphs(DAG)becausetheydonotcontainanydirectedcyclessuchthattheinfluenceofanodeloopsbacktothesamenode.Figure4.11 showssomeexamplesofBayesiannetworksthatcapturedifferenttypesofdependencestructuresamongrandomvariables.Inadirectedacyclicgraph,ifthereisadirectededgefromXtoY,thenXiscalledtheparentofYandYiscalledthechildofX.NotethatanodecanhavemultipleparentsinaBayesiannetwork,e.g.,nodeDhastwoparentnodes,BandC,inFigure4.11(a) .Furthermore,ifthereisadirectedpathinthenetworkfromXtoZ,thenXisanancestorofZ,whileZisadescendantofX.Forexample,inthediagramshowninFigure4.11(b) ,AisadescendantofDandDisanancestorofB.Notethattherecanbemultipledirectedpathsbetweentwonodesofadirectedacyclicgraph,asisthecasefornodesAandDinFigure4.11(a) .

Figure4.11.ExamplesofBayesiannetworks.

ConditionalIndependenceAnimportantpropertyofaBayesiannetworkisitsabilitytorepresentvaryingformsofconditionalindependenceamongrandomvariables.ThereareseveralwaysofdescribingtheconditionalindependenceassumptionscapturedbyBayesiannetworks.Oneofthemostgenericwaysofexpressingconditionalindependenceistheconceptofd-separation,whichcanbeusedtodetermineifanytwosetsofnodesAandBareconditionallyindependentgivenanothersetofnodesC.AnotherusefulconceptisthatoftheMarkovblanketofanodeY,whichdenotestheminimalsetofnodesXthatmakesYindependentoftheothernodesinthegraph,whenconditionedonX.(SeeBibliographicNotesformoredetailsond-separationandMarkovblanket.)However,forthepurposeofclassification,itissufficienttodescribeasimplerexpressionofconditionalindependenceinBayesiannetworks,knownasthelocalMarkovproperty.

Property1(LocalMarkovProperty).AnodeinaBayesiannetworkisconditionallyindependentofitsnon-descendants,ifitsparentsareknown.

ToillustratethelocalMarkovproperty,considertheBayesnetworkshowninFigure4.11(b) .WecanstatethatAisconditionallyindependentofbothBandDgivenC,becauseCistheparentofAandnodesBandDarenon-descendantsofA.ThelocalMarkovpropertyhelpsininterpretingparent-childrelationshipsinBayesiannetworksasrepresentationsofconditionalprobabilities.Sinceanodeisconditionallyindependentofitsnon-descendants

givenitparents,theconditionalindependenceassumptionsimposedbyaBayesiannetworkisoftensparseinstructure.Nonetheless,BayesiannetworksareabletoexpressaricherclassofconditionalindependencestatementsamongattributesandclasslabelsthanthenaïveBayesclassifier.Infact,thenaïveBayesclassifiercanbeviewedasaspecialtypeofBayesiannetwork,wherethetargetclassYisattherootofatreeandeveryattributeisconnectedtotherootnodebyadirectededge,asshowninFigure4.12(a) .

Figure4.12.ComparingthegraphicalrepresentationofanaïveBayesclassifierwiththatofagenericBayesiannetwork.

NotethatinanaïveBayesclassifier,everydirectededgepointsfromthetargetclasstotheobservedattributes,suggestingthattheclasslabelisafactorbehindthegenerationofattributes.Inferringtheclasslabelcanthusbeviewedasdiagnosingtherootcausebehindtheobservedattributes.Ontheotherhand,Bayesiannetworksprovideamoregenericstructureofprobabilisticrelationships,sincethetargetclassisnotrequiredtobeattherootofatreebutcanappearanywhereinthegraph,asshowninFigure

4.12(b) .Inthisdiagram,inferringYnotonlyhelpsindiagnosingthefactorsinfluencing and ,butalsohelpsinpredictingtheinfluenceof and .

JointProbabilityThelocalMarkovpropertycanbeusedtosuccinctlyexpressthejointprobabilityofthesetofrandomvariablesinvolvedinaBayesiannetwork.Torealizethis,letusfirstconsideraBayesiannetworkconsistingofdnodes,to ,wherethenodeshavebeennumberedinsuchawaythat isanancestorof onlyif .Thejointprobabilityof canbegenericallyfactorizedusingthechainruleofprobabilityas

Bythewaywehaveconstructedthegraph,notethatthesetcontainsonlynon-descendantsof .Hence,byusingthelocalMarkovproperty,wecanwrite as ,where denotestheparentsof .Thejointprobabilitycanthenberepresentedas

Itisthussufficienttorepresenttheprobabilityofeverynode intermsofitsparentnodes, ,forcomputingP( ).Thisisachievedwiththehelpofprobabilitytablesthatassociateeverynodetoitsparentnodesasfollows:

1. Theprobabilitytablefornode containstheconditionalprobabilityvalues foreverycombinationofvaluesin and .

2. If hasnoparents ,thenthetablecontainsonlythepriorprobability .

X3 X4 X1 X2

X1Xd Xi

Xj i<j X={X1,…,Xd}

P(X)=P(X1)P(X2|X1)P(X3|X1,X2)…P(Xd|X1,…Xd−1)=∏i=1dP(Xi|X1,…Xi−1)

(4.22)

{X1,…Xi−1}Xi

P(Xi|X1,…Xi−1) P(Xi|pa(Xi)) pa(Xi)Xi

P(X)=∏i=1dP(Xi|pa(Xi)) (4.23)

Xipa(Xi)

XiP(Xi|pa(Xi)) Xi pa(Xi)

Xi (pa(Xi)=ϕ)P(Xi)

Example4.6.[ProbabilityTables]Figure4.13 showsanexampleofaBayesiannetworkformodelingtherelationshipsbetweenapatient'ssymptomsandriskfactors.Theprobabilitytablesareshownatthesideofeverynodeinthefigure.Theprobabilitytablesassociatedwiththeriskfactors(ExerciseandDiet)containonlythepriorprobabilities,whereasthetablesforheartdisease,heartburn,bloodpressure,andchestpain,containtheconditionalprobabilities.

Figure4.13.ABayesiannetworkfordetectingheartdiseaseandheartburninpatients.

UseofHiddenVariables

ABayesiannetworktypicallyinvolvestwotypesofvariables:observedvariablesthatareclampedtospecificobservedvalues,andunobservedvariables,whosevaluesarenotknownandneedtobeinferredfromthenetwork.Todistinguishbetweenthesetwotypesofvariables,observedvariablesaregenerallyrepresentedusingshadednodeswhileunobservedvariablesarerepresentedusingemptynodes.Figure4.14 showsanexampleofaBayesiannetworkwithobservedvariables(A,B,andE)andunobservedvariables(CandD).

Figure4.14.Observedandunobservedvariablesarerepresentedusingunshadedandshadedcircles,respectively.

Inthecontextofclassification,theobservedvariablescorrespondtothesetofattributesX,whilethetargetclassisrepresentedusinganunobservedvariableYthatneedstobeinferredduringtesting.However,notethatagenericBayesiannetworkmaycontainmanyotherunobservedvariablesapartfromthetargetclass,asrepresentedinFigure4.15 asthesetofvariablesH.Theseunobservedvariablesrepresenthiddenorconfoundingfactorsthataffecttheprobabilitiesofattributesandclasslabels,althoughtheyareneverdirectlyobserved.TheuseofhiddenvariablesenhancestheexpressivepowerofBayesiannetworksinrepresentingcomplexprobabilistic

relationshipsbetweenattributesandclasslabels.ThisisoneofthekeydistinguishingpropertiesofBayesiannetworksascomparedtonaïveBayesclassifiers.

4.5.2InferenceandLearning

GiventheprobabilitytablescorrespondingtoeverynodeinaBayesiannetwork,theproblemofinferencecorrespondstocomputingtheprobabilitiesofdifferentsetsofrandomvariables.Inthecontextofclassification,oneofthekeyinferenceproblemsistocomputetheprobabilityofatargetclassYtakingonaspecificvaluey,giventhesetofobservedattributesatadatainstance,.Thiscanberepresentedusingthefollowingconditionalprobability:

ThepreviousequationinvolvesmarginalprobabilitiesoftheformP(y, ).TheycanbecomputedbymarginalizingoutthehiddenvariablesHfromthejointprobabilityasfollows:

wherethejointprobabilityP(y, ,H)canbeobtainedbyusingthefactorizationdescribedinEquation4.23 .TounderstandthenatureofcomputationsinvolvedinestimatingP(y, ),considertheexampleBayesiannetworkshowninFigure4.15 ,whichinvolvesatargetclass,Y,threeobservedattributes, to ,andfourhiddenvariables, to .Forthisnetwork,wecanexpressP(y, )as

P(Y=y|x)=(y,x)P(x)=(y,x)∑y′P(y′,x) (4.24)

P(y,x)=∑HP(y,x,H), (4.25)

X1 X3 H1 H4

Figure4.15.AnexampleofaBayesiannetworkwithfourhiddenvariables, to ,threeobservedattributes, to ,andonetargetclassY.

wherefisafactorthatdependsonthevaluesof to .IntheprevioussimplisticexpressionofP(y, ),adifferentsummandisconsideredforeverycombinationofvalues, to ,inthehiddenvariables, to .Ifweassumethateveryvariableinthenetworkcantakekdiscretevalues,thenthesummationhastobecarriedoutforatotalnumberof times.Thecomputationalcomplexityofthisapproachisthus .Moreover,thenumberofcomputationsgrowsexponentiallywiththenumberofhiddenvariables,makingitdifficulttousethisapproachwithnetworksthathavealargenumberofhiddenvariables.Inthefollowing,wepresentdifferentcomputationaltechniquesforefficientlyperforminginferencesinBayesiannetworks.

H1 H4X1 X3

P(y,x)=∑h1∑h2∑h3∑h4P(y,x1,x2,h1,h2,h3,h4),=∑h1∑h2∑h3∑h4[P(h1)P(h2)P(x2)P(h4)P(x1|h1,h2)×P(h3|x2,h2)P(y|x1,h3)P(x3|h3,h4)],

(4.26)

=∑h1∑h2∑h3∑h4f(h1,h2,h3,h4), (4.27)

h1 h4

h1 h4 H1 H4

k4O(k4)

VariableEliminationToreducethenumberofcomputationsinvolvedinestimatingP(y, ),letuscloselyexaminetheexpressionsinEquations4.26 and4.27 .Noticethatalthough dependsonthevaluesofallfourhiddenvariables,itcanbedecomposedasaproductofseveralsmallerfactors,whereeveryfactorinvolvesonlyasmallnumberofhiddenvariables.Forexample,thefactor dependsonlyonthevalueof ,andthusactsasaconstantmultiplicativetermwhensummationsareperformedover ,or .Hence,ifweplace outsidethesummationsof to ,wecansavesomerepeatedmultiplicationsoccurringinsideeverysummand.

Ingeneral,wecanpusheverysummationasfarinsideaspossible,sothatthefactorsthatdonotdependonthesummingvariableareplacedoutsidethesummation.Thiswillhelpreducethenumberofwastefulcomputationsbyusingsmallerfactorsateverysummation.Toillustratethisprocess,considerthefollowingsequenceofstepsforcomputingP(y, ),byrearrangingtheorder

ofsummationsinEquation4.26 .

where representstheintermediatefactortermobtainedbysummingout .Tocheckifthepreviousrearrangementsprovideanyimprovementsin

f(h1,h2,h3,h4)

P(h4) h4h1,h2 h3

P(h4) h1 h3

P(y,x)=P(x2)∑h4P(h4)∑h3P(y|x1,h3)P(x3|h3,h4)×∑h2P(h2)P(h3|x2,h2)∑h1P(4.28)

=P(x2)∑h4P(h4)∑h3P(y|x1,h3)P(x3|h3,h4)×∑h2P(h2)P(h3|x2,h2)f1(h2)(4.29)

=P(x2)∑h4P(h4)∑h3P(y|x1,h3)P(x3|h3,h4)f2(h3) (4.30)

=P(x2)∑h4P(h4)f3(h4) (4.31)

fi hi

computationalefficiency,letuscountthenumberofcomputationsoccurringateverystepoftheprocess.Atthefirststep(Equation4.28 ),weperformasummationover usingfactorsthatdependon and .Thisrequiresconsideringeverypairofvaluesin and ,resultingin computations.Similarly,thesecondstep(Equation4.29 )involvessummingout usingfactorsof and ,leadingto computations.Thethirdstep(Equation4.30 )againrequires computationsasitinvolvessummingoutoverfactorsdependingon and .Finally,thefourthstep(Equation4.31 )involvessummingout usingfactorsdependingon ,resultinginO(k)computations.

Theoverallcomplexityofthepreviousapproachisthus ,whichisconsiderablysmallerthanthe complexityofthebasicapproach.Hence,bymerelyrearrangingsummationsandusingalgebraicmanipulations,weareabletoimprovethecomputationalefficiencyincomputingP(y, ).Thisprocedureisknownasvariableelimination.

Thebasicconceptthatvariableeliminationexploitstoreducethenumberofcomputationsisthedistributivenatureofmultiplicationoveradditionoperations.Forexample,considerthefollowingmultiplicationandadditionoperations:

Noticethattheright-handsideofthepreviousequationinvolvesthreemultiplicationsandthreeadditions,whiletheleft-handsideinvolvesonlyonemultiplicationandthreeadditions,thussavingontwoarithmeticoperations.Thispropertyisutilizedbyvariableeliminationinpushingoutconstanttermsoutsidethesummation,suchthattheyaremultipliedonlyonce.

h1 h1 h2h1 h2 O(k2)

h2h2 h3 O(k2)

O(k2) h3h3 h4

h4 h4

O(k2)O(k4)

a.(b+c+d)=a.b+a.c+a.d

Notethattheefficiencyofvariableeliminationdependsontheorderofhiddenvariablesusedforperformingsummations.Hence,wewouldideallyliketofindtheoptimalorderofvariablesthatresultinthesmallestnumberofcomputations.Unfortunately,findingtheoptimalorderofsummationsforagenericBayesiannetworkisanNP-Hardproblem,i.e.,theredoesnotexistanefficientalgorithmforfindingtheoptimalorderingthatcanruninpolynomialtime.However,thereexistsefficienttechniquesforhandlingspecialtypesofBayesiannetworks,e.g.,thoseinvolvingtree-likegraphs,asdescribedinthefollowing.

Sum-ProductAlgorithmforTreesNotethatinEquations4.28 and4.29 ,wheneveravariable iseliminatedduringmarginalization,itresultsinthecreationofafactor thatdependsontheneighboringnodesof . isthenabsorbedinthefactorsofneighboringvariablesandtheprocessisrepeateduntilallunobservedvariableshavemarginalized.Thisphenomenaofvariableeliminationcanbeviewedastransmittingalocalmessagefromthevariablebeingmarginalizedtoitsneighboringnodes.Thisideaofmessagepassingutilizesthestructureofthegraphforperformingcomputations,thusmakingitpossibletousegraph-theoreticapproachesformakingeffectiveinferences.Thesum-productalgorithmbuildsontheconceptofmessagepassingforcomputingmarginalandconditionalprobabilitiesontree-basedgraphs.

Figure4.16 showsanexampleofatreeinvolvingfivevariables, to .Akeycharacteristicofatreeisthateverynodeinthetreehasexactlyoneparent,andthereisonlyonedirectededgebetweenanytwonodesinthetree.Forthepurposeofillustration,letusconsidertheproblemofestimatingthemarginalprobabilityof .Thiscanbeobtainedbymarginalizingouteveryvariableinthegraphexcept andrearrangingthesummationstoobtainthefollowingexpression:

hifi

hi fi

X1 X5

X2,P(X2)X2

Figure4.16.AnexampleofaBayesiannetworkwithatreestructure.

where hasbeenconvenientlychosentorepresentthefactorof thatisobtainedbysummingout .Wecanview asalocalmessagepassedfromnode tonode ,asshownusingarrowsinFigure4.17(a) .Theselocalmessagescapturetheinfluenceofeliminatingnodesonthemarginalprobabilitiesofneighboringnodes.

Beforeweformallydescribetheformulaforcomputing and ,wefirstdefineapotentialfunction thatisassociatedeverynodeandedgeofthegraph.Wecandefinethepotentialofanode as

mij(xj) xjxi mij(xj)

xi xj

mij(xj) P(xj)ψ(⋅)

ψ(Xi)={P(Xi),ifXiistherootnode.1,otherwise. (4.32)

Figure4.17.Illustrationofmessagepassinginthesum-productalgorithm.

Similarly,wecandefinethepotentialofanedgebetweennodes and(where istheparentof )as

Using and ,wecanrepresent usingthefollowingequation:

whereN(i)representsthesetofneighborsofnode .Themessage thatistransmittedfrom to canthusberecursivelycomputedusingthe

Xi XjXi Xj

ψ(Xi,Xj)=P(Xj|Xi).

ψ(Xi) ψ(Xi,Xj) mij(xj)

mij(xj)=∑xi(ψ(xi)ψ(xi,xj)∏k∈N(i)imki(xi)), (4.33)

Xi mijXi Xj

messagesincidenton fromitsneighboringnodesexcluding .Notethattheformulafor involvestakingasumoverallpossiblevaluesof ,aftermultiplyingthefactorsobtainedfromtheneighborsof .Thisapproachofmessagepassingisthuscalledthe“sum-product”algorithm.Further,sincerepresentsanotionof“belief”propagatedfrom to ,thisalgorithmisalsoknownasbeliefpropagation.Themarginalprobabilityofanode

isthengivenas

Ausefulpropertyofthesum-productalgorithmisthatitallowsthemessagestobereusedforcomputingadifferentmarginalprobabilityinthefuture.Forexample,ifwehadtocomputethemarginalprobabilityfornode ,wewouldrequirethefollowingmessagesfromitsneighboringnodes: ,and .However,notethat ,and havealreadybeencomputedintheprocessofcomputingthemarginalprobabilityof andthuscanbereused.

Noticethatthebasicoperationsofthesum-productalgorithmresembleamessagepassingprotocolovertheedgesofthenetwork.Anodesendsoutamessagetoallitsneighboringnodesonlyafterithasreceivedincomingmessagesfromallitsneighbors.Hence,wecaninitializethemessagepassingprotocolfromtheleafnodes,andtransmitmessagestillwereachtherootnode.Wecanthenrunasecondpassofmessagesfromtherootnodebacktotheleafnodes.Inthisway,wecancomputethemessagesforeveryedgeinbothdirections,usingjust operations,where isthenumberofedges.OncewehavetransmittedallpossiblemessagesasshowninFigure4.17(b) ,wecaneasilycomputethemarginalprobabilityofeverynodeinthegraphusingEquation4.34 .

Xi Ximij Xj

Xjmij

Xi XjXi

P(xi)=ψ(xi)∏j∈N(i)mji(xi). (4.34)

X3m23(x3),m43(x3)

m53(x3) m43(x3) m53(x3)X2

O(2|E|) |E|

Inthecontextofclassification,thesum-productalgorithmcanbeeasilymodifiedforcomputingtheconditionalprobabilityoftheclasslabelygiventhesetofobservedattributes ,i.e., .Thisbasicallyamountstocomputing inEquation4.24 ,whereXisclampedtotheobservedvalues .Tohandlethescenariowheresomeoftherandomvariablesarefixedanddonotneedtobenormalized,weconsiderthefollowingmodification.

If isarandomvariablethatisfixedtoaspecificvalue ,thenwecansimplymodify and asfollows:

Wecanrunthesum-productalgorithmusingthesemodifiedvaluesforeveryobservedvariableandthuscompute .

x^ P(y|x^)P(y,X=x^)

Xi x^iψ(Xi) ψ(Xi,Xj)

ψ(Xi)={1,ifXi=x^i.0,otherwise. (4.35)

ψ(Xi,Xj)={P(Xi|x^i),ifXi=x^i.0,otherwise. (4.36)

P(y,X=x^)

Figure4.18.Exampleofapoly-treeanditscorrespondingfactorgraph.

GeneralizationsforNon-TreeGraphsThesum-productalgorithmisguaranteedtooptimallyconvergeinthecaseoftreesusingasinglerunofmessagepassinginbothdirectionsofeveryedge.Thisisbecauseanytwonodesinatreehaveauniquepathforthetransmissionofmessages.Furthermore,sinceeverynodeinatreehasasingleparent,thejointprobabilityinvolvesonlyfactorsofatmosttwovariables.Hence,itissufficienttoconsiderpotentialsoveredgesandnotothergenericsubstructuresinthegraph.

Bothofthepreviouspropertiesareviolatedingraphsthatarenottrees,thusmakingitdifficulttodirectlyapplythesum-productalgorithmformakinginferences.However,anumberofvariantsofthesum-productalgorithmhavebeendevisedtoperforminferencesonabroaderfamilyofgraphsthantrees.Manyofthesevariantstransformtheoriginalgraphintoanalternativetree-basedrepresentation,andthenapplythesum-productalgorithmonthetransformedtree.Inthissection,webrieflydiscussonesuchtransformationsknownasfactorgraphs.

Factorgraphsareusefulformakinginferencesovergraphsthatviolatetheconditionthateverynodehasasingleparent.Nonetheless,theystillrequiretheabsenceofmultiplepathsbetweenanytwonodes,toguaranteeconvergence.Suchgraphsareknownaspoly-trees.Anexampleofapoly-treeisshowninFigure4.18(a) .

Apoly-treecanbetransformedintoatree-basedrepresentationwiththehelpoffactorgraphs.Thesegraphsconsistoftwotypesofnodes,variablesnodes(thatarerepresentedusingcircles)andfactornodes(thatarerepresented

usingsquares).Thefactornodesrepresentconditionalindependencerelationshipsamongthevariablesofthepoly-tree.Inparticular,everyprobabilitytablecanberepresentedasafactornode.Theedgesinafactorgraphareundirectedinnatureandrelateavariablenodetoafactornodeifthevariableisinvolvedintheprobabilitytablecorrespondingtothefactornode.Figure4.18(b) presentsthefactorgraphrepresentationofthepoly-treeshowninFigure4.18(a) .

Notethatthefactorgraphofapoly-treealwaysformsatree-likestructure,wherethereisauniquepathofinfluencebetweenanytwonodesinthefactorgraph.Hence,wecanapplyamodifiedformofsum-productalgorithmtotransmitmessagesbetweenvariablenodesandfactornodes,whichisguaranteedtoconvergetooptimalvalues.

LearningModelParametersInallourpreviousdiscussionsonBayesiannetworks,wehadassumedthatthetopologyoftheBayesiannetworkandthevaluesintheprobabilitytablesofeverynodewerealreadyknown.Inthissection,wediscussapproachesforlearningboththetopologyandtheprobabilitytablevaluesofaBayesiannetworkfromthetrainingdata.

Letusfirstconsiderthecasewherethetopologyofthenetworkisknownandweareonlyrequiredtocomputetheprobabilitytables.Iftherearenounobservedvariablesinthetrainingdata,thenwecaneasilycomputetheprobabilitytablefor ,bycountingthefractionoftraininginstancesforeveryvalueof andeverycombinationofvaluesin .However,ifthereareunobservedvariablesin or ,thencomputingthefractionoftraininginstancesforsuchvariablesisnon-trivialandrequirestheuseofadvancestechniquessuchastheExpectation-Maximizationalgorithm(describedlaterinChapter8 ).

P(Xi|pa(Xi))Xi pa(Xi)

Xi pa(Xi)

LearningthestructureoftheBayesiannetworkisamuchmorechallengingtaskthanlearningtheprobabilitytables.Althoughtherearesomescoringapproachesthatattempttofindagraphstructurethatmaximizesthetraininglikelihood,theyareoftencomputationallyinfeasiblewhenthegraphislarge.Hence,acommonapproachforconstructingBayesiannetworksistousethesubjectiveknowledgeofdomainexperts.

4.5.3CharacteristicsofBayesianNetworks

1. Bayesiannetworksprovideapowerfulapproachforrepresentingprobabilisticrelationshipsbetweenattributesandclasslabelswiththehelpofgraphicalmodels.Theyareabletocapturecomplexformsofdependenciesamongvariables.Apartfromencodingpriorbeliefs,theyarealsoabletomodelthepresenceoflatent(unobserved)factorsashiddenvariablesinthegraph.Bayesiannetworksarethusquiteexpressiveandprovidepredictiveaswellasdescriptiveinsightsaboutthebehaviorofattributesandclasslabels.

2. Bayesiannetworkscaneasilyhandlethepresenceofcorrelatedorredundantattributes,asopposedtothenaïveBayesclassifier.ThisisbecauseBayesiannetworksdonotusethenaïveBayesassumptionaboutconditionalindependence,butinsteadareabletoexpressricherformsofconditionalindependence.

3. SimilartothenaïveBayesclassifier,Bayesiannetworksarealsoquiterobusttothepresenceofnoiseinthetrainingdata.Further,theycanhandlemissingvaluesduringtrainingaswellastesting.Ifatestinstancecontainsanattribute withamissingvalue,thenaBayesiannetworkcanperforminferencebytreating asanunobservednode

XiXi

andmarginalizingoutitseffectonthetargetclass.Hence,Bayesiannetworksarewell-suitedforhandlingincompletenessinthedata,andcanworkwithpartialinformation.However,unlessthepatternwithwhichmissingvaluesoccursiscompletelyrandom,thentheirpresencewilllikelyintroducesomedegreeoferrorand/orbiasintotheanalysis.

4. Bayesiannetworksarerobusttoirrelevantattributesthatcontainnodiscriminatoryinformationabouttheclasslabels.Suchattributesshownoimpactontheconditionalprobabilityofthetargetclass,andarethusrightfullyignored.

5. LearningthestructureofaBayesiannetworkisacumbersometaskthatoftenrequiresassistancefromexpertknowledge.However,oncethestructurehasbeendecided,learningtheparametersofthenetworkcanbequitestraightforward,especiallyifallthevariablesinthenetworkareobserved.

6. Duetoitsadditionalabilityofrepresentingcomplexformsofrelationships,BayesiannetworksaremoresusceptibletooverfittingascomparedtothenaïveBayesclassifier.Furthermore,BayesiannetworkstypicallyrequiremoretraininginstancesforeffectivelylearningtheprobabilitytablesthanthenaïveBayesclassifier.

7. Althoughthesum-productalgorithmprovidescomputationallyefficienttechniquesforperforminginferenceovertree-likegraphs,thecomplexityoftheapproachincreasesignificantlywhendealingwithgenericgraphsoflargesizes.Insituationswhereexactinferenceiscomputationallyinfeasible,itisquitecommontouseapproximateinferencetechniques.

4.6LogisticRegressionThenaïveBayesandtheBayesiannetworkclassifiersdescribedintheprevioussectionsprovidedifferentwaysofestimatingtheconditionalprobabilityofaninstance givenclassy, .Suchmodelsareknownasprobabilisticgenerativemodels.Notethattheconditionalprobabilityessentiallydescribesthebehaviorofinstancesintheattributespacethataregeneratedfromclassy.However,forthepurposeofmakingpredictions,wearefinallyinterestedincomputingtheposteriorprobability .Forexample,computingthefollowingratioofposteriorprobabilitiesissufficientforinferringclasslabelsinabinaryclassificationproblem:

Thisratioisknownastheodds.Ifthisratioisgreaterthan1,then isclassifiedas .Otherwise,itisassignedtoclass .Hence,onemaysimplylearnamodeloftheoddsbasedontheattributevaluesoftraininginstances,withouthavingtocompute asanintermediatequantityintheBayestheorem.

Classificationmodelsthatdirectlyassignclasslabelswithoutcomputingclass-conditionalprobabilitiesarecalleddiscriminativemodels.Inthissection,wepresentaprobabilisticdiscriminativemodelknownaslogisticregression,whichdirectlyestimatestheoddsofadatainstance usingitsattributevalues.Thebasicideaoflogisticregressionistousealinearpredictor,

,forrepresentingtheoddsof asfollows:

P(x|y)P(x|y)

P(y|x)

P(y=1|x)P(y=0|x)

y=1 y=0

P(x|y)

z=wTx+b

P(y=1|x)P(y=0|x)=ez=ewTx+b, (4.37)

where andbaretheparametersofthemodeland denotesthetransposeofavector .Notethatif ,then belongstoclass1sinceitsoddsisgreaterthan1.Otherwise, belongstoclass0.

Figure4.19.Plotofsigmoid(logistic)function, .

Since ,wecanre-writeEquation4.37 as

Thiscanbefurthersimplifiedtoexpress asafunctionofz.

wherethefunction isknownasthelogisticorsigmoidfunction.Figure4.19 showsthebehaviorofthesigmoidfunctionaswevaryz.Wecanseethat onlywhen .Wecanalsoderive using asfollows:

aTwTx+b>0

σ(z)

P(y=0|x)+P(y=1|x)=1

P(y=1|x)1−P(y=1|x)=ez.

P(y=1|x)

P(y=1|x)=11+e−z=σ(z), (4.38)

σ(⋅)

σ(z)≥0.5 z≥0 P(y=0|x) σ(z)

Hence,ifwehavelearnedasuitablevalueofparameters andb,wecanuseEquations4.38 and4.39 toestimatetheposteriorprobabilitiesofanydatainstance anddetermineitsclasslabel.

4.6.1LogisticRegressionasaGeneralizedLinearModel

Sincetheposteriorprobabilitiesarereal-valued,theirestimationusingthepreviousequationscanbeviewedassolvingaregressionproblem.Infact,logisticregressionbelongstoabroaderfamilyofstatisticalregressionmodels,knownasgeneralizedlinearmodels(GLM).Inthesemodels,thetargetvariableyisconsideredtobegeneratedfromaprobabilitydistribution ,whosemean canbeestimatedusingalinkfunction asfollows:

Forbinaryclassificationusinglogisticregression,yfollowsaBernoullidistribution(ycaneitherbe0or1)and isequalto .Thelinkfunction oflogisticregression,calledthelogitfunction,canthusberepresentedas

Dependingonthechoiceoflinkfunction andtheformofprobabilitydistribution ,GLMsareabletorepresentabroadfamilyofregressionmodels,suchaslinearregressionandPoissonregression.Theyrequire

P(y=0|x)=1−σ(z)=11+e−z (4.39)

P(y|x)μ g(⋅)

g(μ)=z=wTx+b. (4.40)

μ P(y=1|x)g(⋅)

g(μ)=log(μ1−μ).

g(⋅)P(y|x)

differentapproachesforestimatingtheirmodelparameters,( , ).Inthischapter,wewillonlydiscussapproachesforestimatingthemodelparametersoflogisticregression,althoughmethodsforestimatingparametersofothertypesofGLMsareoftensimilar(andsometimesevensimpler).(SeeBibliographicNotesformoredetailsonGLMs.)

Notethateventhoughlogisticregressionhasrelationshipswithregressionmodels,itisaclassificationmodelsincethecomputedposteriorprobabilitiesareeventuallyusedtodeterminetheclasslabelofadatainstance.

4.6.2LearningModelParameters

Theparametersoflogisticregression,( , ),areestimatedduringtrainingusingastatisticalapproachknownasthemaximumlikelihoodestimation(MLE)method.Thismethodinvolvescomputingthelikelihoodofobservingthetrainingdatagiven( , ),andthendeterminingthemodelparameters

thatyieldmaximumlikelihood.

Let denoteasetofntraininginstances,where isabinaryvariable(0or1).Foragiventraininginstance,wecancomputeitsposteriorprobabilitiesusingEquations4.38 and

4.39 .Wecanthenexpressthelikelihoodofobserving given , ,andbas

where isthesigmoidfunctionasdescribedabove,Equation4.41basicallymeansthatthelikelihood isequalto when

(w*,b*)

D.train={(x1,y1),(x2,y2),…,(xn,yn)}yi

xiyi xi

P(yi|xi,w,b)=P(y=1|xi)yi×P(y=0|xi)1−yi,=(σ(zi))yi×(1−σ(zi))1−yi,=(σ(wTxi+b))yi×(1−σ(wTxi+b))1−yi,

(4.41)

σ(⋅)P(yi|xi,w,b) P(y=1|xi)

,andequalto when .Thelikelihoodofalltraininginstances,,canthenbecomputedbytakingtheproductofindividuallikelihoods

(assumingindependenceamongtraininginstances)asfollows:

Thepreviousequationinvolvesmultiplyingalargenumberofprobabilityvalues,eachofwhicharesmallerthanorequalto1.Sincethisnaïvecomputationcaneasilybecomenumericallyunstablewhennislarge,amorepracticalapproachistoconsiderthenegativelogarithm(tobasee)ofthelikelihoodfunction,alsoknownasthecrossentropyfunction:

Thecrossentropyisalossfunctionthatmeasureshowunlikelyitisforthetrainingdatatobegeneratedfromthelogisticregressionmodelwithparameters( , ).Intuitively,wewouldliketofindmodelparametersthatresultinthelowestcrossentropy, .

where isthelossfunction.ItisworthemphasizingthatE( , )isaconvexfunction,i.e.,anyminimaofE( , )willbeaglobalminima.Hence,wecanuseanyofthestandardconvexoptimizationtechniquestosolveEquation4.43 ,whicharementionedinAppendixE.Here,webrieflydescribetheNewton-Raphsonmethodthatiscommonlyusedforestimatingtheparametersoflogisticregression.Foreaseofrepresentation,wewilluseasinglevectortodescribe ,whichisofsizeonegreaterthan .Similarly,wewillconsidertheconcatenatedfeaturevector ,suchthatthelinearpredictor canbesuccinctly

yi=1 P(y=0|xi) yi=0L(w,b)

L(w,b)=∏i=1nP(yi|xi,w,b)=∏i=1nP(y=1|xi)yi×P(y=0|xi)1−yi. (4.42)

−logL(w,b)=−∑i=1nyilog(P(y=1|xi))+(1−yi)log(P(y=0|xi)).=−∑i=1nyilog(σ(wTxi+b))+(1−yi)log(1−σ(wTxi+b)).

(w*,b*)−logL(w*,b*)

(w*,b*)=argmin(w,b)E(w,b)=argmin(w,b)−logL(w,b) (4.43)

E(w,b)=−logL(w,b)

w˜=(wTb)T

x˜=(xT1)T z=wTx+b

writtenas .Also,theconcatenationofalltraininglabels, to ,willberepresentedasy,thesetconsistingof to willberepresentedas,andtheconcatenationof to willberepresentedas .

TheNewton-Raphsonisaniterativemethodforfinding thatusesthefollowingequationtoupdatethemodelparametersateveryiteration:

where andHarethefirst-andsecond-orderderivativesofthelossfunction withrespectto ,respectively.ThekeyintuitionbehindEquation4.44 istomovethemodelparametersinthedirectionofmaximumgradient,suchthat takeslargerstepswhen islarge.When arrivesataminimaaftersomenumberofiterations,thenwouldbecomeequalto0andthusresultinconvergence.Hence,westartwithsomeinitialvaluesof (eitherrandomlyassignedorsetto0)anduseEquation4.44 toiterativelyupdate tilltherearenosignificantchangesinitsvalue(beyondacertainthreshold).

Thefirst-orderderivativeof isgivenby

wherewehaveusedthefactthat .Using ,wecancomputethesecond-orderderivativeof as

whereRisadiagonalmatrixwhosei diagonalelement .Wecannowusethefirst-andsecond-orderderivativesof inEquation4.44 to

z=w˜Tx˜ y1 ynσ(z1) σ(zn)

σ x˜1 x˜n X˜

w˜*

w˜(new)=w˜(old)−H−1∇E(w˜), (4.44)

∇E(w˜)E(w˜) w˜

w˜ ∇E(w˜)w˜ ∇E(w˜)

w˜w˜

E(w˜)

∇E(w˜)=−∑i=1nyix˜i(1−σ(w˜Tx˜i))−(1−yi)x˜iσ(w˜Tx˜i),=−∑i=1n(σ(w˜Tx˜i)−yi)x˜i,=X˜(σ−y),

(4.45)

dσ(z)/dz=σ(z)(1−σ(z)) ∇E(w˜)E(w˜)

H=∇∇E(w˜)=∑i=1nσ(w˜Tx˜i)(1−σ(w˜Tx˜i)x˜ix˜iT)=X˜TRX˜, (4.46)

th Rii=σi(1−σi)E(w˜)

obtainthefollowingupdateequationatthek iteration:

wherethesubscriptkunder and referstousing tocomputebothterms.

4.6.3CharacteristicsofLogisticRegression

1. LogisticRegressionisadiscriminativemodelforclassificationthatdirectlycomputestheposterprobabilitieswithoutmakinganyassumptionabouttheclassconditionalprobabilities.Hence,itisquitegenericandcanbeappliedindiverseapplications.Itcanalsobeeasilyextendedtomulticlassclassification,whereitisknownasmultinomiallogisticregression.However,itsexpressivepowerislimitedtolearningonlylineardecisionboundaries.

2. Becausetherearedifferentweights(parameters)foreveryattribute,thelearnedparametersoflogisticregressioncanbeanalyzedtounderstandtherelationshipsbetweenattributesandclasslabels.

3. Becauselogisticregressiondoesnotinvolvecomputingdensitiesanddistancesintheattributespace,itcanworkmorerobustlyeveninhigh-dimensionalsettingsthandistance-basedmethodssuchasnearestneighborclassifiers.However,theobjectivefunctionoflogisticregressiondoesnotinvolveanytermrelatingtothecomplexityofthemodel.Hence,logisticregressiondoesnotprovideawaytomakeatrade-offbetweenmodelcomplexityandtrainingperformance,ascomparedtootherclassificationmodelssuchassupportvector

w˜(k+1)=w˜(k)−(X˜TRkX˜)−1X˜T(σk−y) (4.47)

Rk σk w˜(k)

machines.Nevertheless,variantsoflogisticregressioncaneasilybedevelopedtoaccountformodelcomplexity,byincludingappropriatetermsintheobjectivefunctionalongwiththecrossentropyfunction.

4. Logisticregressioncanhandleirrelevantattributesbylearningweightparameterscloseto0forattributesthatdonotprovideanygaininperformanceduringtraining.Itcanalsohandleinteractingattributessincethelearningofmodelparametersisachievedinajointfashionbyconsideringtheeffectsofallattributestogether.Furthermore,ifthereareredundantattributesthatareduplicatesofeachother,thenlogisticregressioncanlearnequalweightsforeveryredundantattribute,withoutdegradingclassificationperformance.However,thepresenceofalargenumberofirrelevantorredundantattributesinhigh-dimensionalsettingscanmakelogisticregressionsusceptibletomodeloverfitting.

5. Logisticregressioncannothandledatainstanceswithmissingvalues,sincetheposteriorprobabilitiesareonlycomputedbytakingaweightedsumofalltheattributes.Iftherearemissingvaluesinatraininginstance,itcanbediscardedfromthetrainingset.However,iftherearemissingvaluesinatestinstance,thenlogisticregressionwouldfailtopredictitsclasslabel.

4.7ArtificialNeuralNetwork(ANN)Artificialneuralnetworks(ANN)arepowerfulclassificationmodelsthatareabletolearnhighlycomplexandnonlineardecisionboundariespurelyfromthedata.Theyhavegainedwidespreadacceptanceinseveralapplicationssuchasvision,speech,andlanguageprocessing,wheretheyhavebeenrepeatedlyshowntooutperformotherclassificationmodels(andinsomecasesevenhumanperformance).Historically,thestudyofartificialneuralnetworkswasinspiredbyattemptstoemulatebiologicalneuralsystems.Thehumanbrainconsistsprimarilyofnervecellscalledneurons,linkedtogetherwithotherneuronsviastrandsoffibercalledaxons.Wheneveraneuronisstimulated(e.g.,inresponsetoastimuli),ittransmitsnerveactivationsviaaxonstootherneurons.Thereceptorneuronscollectthesenerveactivationsusingstructurescalleddendrites,whichareextensionsfromthecellbodyoftheneuron.Thestrengthofthecontactpointbetweenadendriteandanaxon,knownasasynapse,determinestheconnectivitybetweenneurons.Neuroscientistshavediscoveredthatthehumanbrainlearnsbychangingthestrengthofthesynapticconnectionbetweenneuronsuponrepeatedstimulationbythesameimpulse.

Thehumanbrainconsistsofapproximately100billionneuronsthatareinter-connectedincomplexways,makingitpossibleforustolearnnewtasksandperformregularactivities.Notethatasingleneurononlyperformsasimplemodularfunction,whichistorespondtothenerveactivationscomingfromsenderneuronsconnectedatitsdendrite,andtransmititsactivationtoreceptorneuronsviaaxons.However,itisthecompositionofthesesimplefunctionsthattogetherisabletoexpresscomplexfunctions.Thisideaisatthebasisofconstructingartificialneuralnetworks.

Analogoustothestructureofahumanbrain,anartificialneuralnetworkiscomposedofanumberofprocessingunits,callednodes,thatareconnectedwitheachotherviadirectedlinks.Thenodescorrespondtoneuronsthatperformthebasicunitsofcomputation,whilethedirectedlinkscorrespondtoconnectionsbetweenneurons,consistingofaxonsanddendrites.Further,theweightofadirectedlinkbetweentwoneuronsrepresentsthestrengthofthesynapticconnectionbetweenneurons.Asinbiologicalneuralsystems,theprimaryobjectiveofANNistoadapttheweightsofthelinksuntiltheyfittheinput-outputrelationshipsoftheunderlyingdata.

ThebasicmotivationbehindusinganANNmodelistoextractusefulfeaturesfromtheoriginalattributesthataremostrelevantforclassification.Traditionally,featureextractionhasbeenachievedbyusingdimensionalityreductiontechniquessuchasPCA(introducedinChapter2),whichshowlimitedsuccessinextractingnonlinearfeatures,orbyusinghand-craftedfeaturesprovidedbydomainexperts.Byusingacomplexcombinationofinter-connectednodes,ANNmodelsareabletoextractmuchrichersetsoffeatures,resultingingoodclassificationperformance.Moreover,ANNmodelsprovideanaturalwayofrepresentingfeaturesatmultiplelevelsofabstraction,wherecomplexfeaturesareseenascompositionsofsimplerfeatures.Inmanyclassificationproblems,modelingsuchahierarchyoffeaturesturnsouttobeveryuseful.Forexample,inordertodetectahumanfaceinanimage,wecanfirstidentifylow-levelfeaturessuchassharpedgeswithdifferentgradientsandorientations.Thesefeaturescanthenbecombinedtoidentifyfacialpartssuchaseyes,nose,ears,andlips.Finally,anappropriatearrangementoffacialpartscanbeusedtocorrectlyidentifyahumanface.ANNmodelsprovideapowerfularchitecturetorepresentahierarchicalabstractionoffeatures,fromlowerlevelsofabstraction(e.g.,edges)tohigherlevels(e.g.,facialparts).

Artificialneuralnetworkshavehadalonghistoryofdevelopmentsspanningoverfivedecadesofresearch.AlthoughclassicalmodelsofANNsufferedfromseveralchallengesthathinderedprogressforalongtime,theyhavere-emergedwithwidespreadpopularitybecauseofanumberofrecentdevelopmentsinthelastdecade,collectivelyknownasdeeplearning.Inthissection,weexamineclassicalapproachesforlearningANNmodels,startingfromthesimplestmodelcalledperceptronstomorecomplexarchitecturescalledmulti-layerneuralnetworks.Inthenextsection,wediscusssomeoftherecentadvancementsintheareaofANNthathavemadeitpossibletoeffectivelylearnmodernANNmodelswithdeeparchitectures.

4.7.1Perceptron

AperceptronisabasictypeofANNmodelthatinvolvestwotypesofnodes:inputnodes,whichareusedtorepresenttheinputattributes,andanoutputnode,whichisusedtorepresentthemodeloutput.Figure4.20 illustratesthebasicarchitectureofaperceptronthattakesthreeinputattributes, ,and ,andproducesabinaryoutputy.Theinputnodecorrespondingtoanattribute isconnectedviaaweightedlink totheoutputnode.Theweightedlinkisusedtoemulatethestrengthofasynapticconnectionbetweenneurons.

x1,x2x3

xi wi

Figure4.20.Basicarchitectureofaperceptron.

Theoutputnodeisamathematicaldevicethatcomputesaweightedsumofitsinputs,addsabiasfactorbtothesum,andthenexaminesthesignoftheresulttoproducetheoutput asfollows:

Tosimplifynotations, andbcanbeconcatenatedtoform ,whilecanbeappendedwith1attheendtoform .Theoutputofthe

perceptron canthenbewritten:

wherethesignfunctionactsasanactivationfunctionbyprovidinganoutputvalueof iftheargumentispositiveand ifitsargumentisnegative.

LearningthePerceptronGivenatrainingset,weareinterestedinlearningparameters suchthatcloselyresemblesthetrueyoftraininginstances.ThisisachievedbyusingtheperceptronlearningalgorithmgiveninAlgorithm4.3 .ThekeycomputationforthisalgorithmistheiterativeweightupdateformulagiveninStep8ofthealgorithm:

where istheweightparameterassociatedwiththei inputlinkafterthek iteration, isaparameterknownasthelearningrate,and isthevalue

3^y={1,ifwTx+b>0.−1,otherwise. (4.48)

w˜=(wTb)Tx˜=(xT1)T

y^=sign(w˜Tx˜),

+1 −1

w˜ y^

wj(k+1)=wj(k)+λ(yi−yi^(k))xij, (4.49)

w(k) th

th λ xij

ofthej attributeofthetrainingexample .ThejustificationforEquation4.49 isratherintuitive.Notethat capturesthediscrepancybetweenand ,suchthatitsvalueis0onlywhenthetruelabelandthepredicted

outputmatch.Assume ispositive.If and ,then isincreasedatthenextiterationsothat canbecomepositive.Ontheotherhand,if

and ,then isdecreasedsothat canbecomenegative.Hence,theweightsaremodifiedateveryiterationtoreducethediscrepanciesbetween andyacrossalltraininginstances.Thelearningrate ,aparameterwhosevalueisbetween0and1,canbeusedtocontroltheamountofadjustmentsmadeineachiteration.Thealgorithmhaltswhentheaveragenumberofdiscrepanciesaresmallerthanathreshold .

Algorithm4.3Perceptronlearningalgorithm.

∈

∑ γ

Theperceptronisasimpleclassificationmodelthatisdesignedtolearnlineardecisionboundariesintheattributespace.Figure4.21 showsthedecision

th xi(yi−y^i)

yi y^ixij y^=0 y=1 wjw˜Txi

y^=1 y=0 wj w˜Txi

y^ λ

boundaryobtainedbyapplyingtheperceptronlearningalgorithmtothedatasetprovidedontheleftofthefigure.However,notethattherecanbemultipledecisionboundariesthatcanseparatethetwoclasses,andtheperceptronarbitrarilylearnsoneoftheseboundariesdependingontherandominitialvaluesofparameters.(TheselectionoftheoptimaldecisionboundaryisaproblemthatwillberevisitedinthecontextofsupportvectormachinesinSection4.9 .)Further,theperceptronlearningalgorithmisonlyguaranteedtoconvergewhentheclassesarelinearlyseparable.However,iftheclassesarenotlinearlyseparable,thealgorithmfailstoconverge.Figure4.22showsanexampleofanonlinearlyseparabledatagivenbytheXORfunction.Theperceptroncannotfindtherightsolutionforthisdatabecausethereisnolineardecisionboundarythatcanperfectlyseparatethetraininginstances.Thus,thestoppingconditionatline12ofAlgorithm4.3 wouldneverbemetandhence,theperceptronlearningalgorithmwouldfailtoconverge.Thisisamajorlimitationofperceptronssincereal-worldclassificationproblemsofteninvolvenonlinearlyseparableclasses.

Figure4.21.Perceptrondecisionboundaryforthedatagivenontheleft( representsapositivelylabeledinstancewhileorepresentsanegativelylabeledinstance.

Figure4.22.XORclassificationproblem.Nolinearhyperplanecanseparatethetwoclasses.

4.7.2Multi-layerNeuralNetwork

Amulti-layerneuralnetworkgeneralizesthebasicconceptofaperceptrontomorecomplexarchitecturesofnodesthatarecapableoflearningnonlineardecisionboundaries.Agenericarchitectureofamulti-layerneuralnetworkisshowninFigure4.23 wherethenodesarearrangedingroupscalledlayers.Theselayersarecommonlyorganizedintheformofachainsuchthateverylayeroperatesontheoutputsofitsprecedinglayer.Inthisway,thelayersrepresentdifferentlevelsofabstractionthatareappliedontheinputfeaturesinasequentialmanner.Thecompositionoftheseabstractionsgeneratesthefinaloutputatthelastlayer,whichisusedformakingpredictions.Inthefollowing,webrieflydescribethethreetypesoflayersusedinmulti-layerneuralnetworks.

Figure4.23.Exampleofamulti-layerartificialneuralnetwork(ANN).

Thefirstlayerofthenetwork,calledtheinputlayer,isusedforrepresentinginputsfromattributes.Everynumericalorbinaryattributeistypicallyrepresentedusingasinglenodeonthislayer,whileacategoricalattributeiseitherrepresentedusingadifferentnodeforeachcategoricalvalue,orbyencodingthek-aryattributeusing inputnodes.Theseinputsarefedintointermediarylayersknownashiddenlayers,whicharemadeupofprocessingunitsknownashiddennodes.Everyhiddennodeoperatesonsignalsreceivedfromtheinputnodesorhiddennodesattheprecedinglayer,andproducesanactivationvaluethatistransmittedtothenextlayer.Thefinallayeriscalledtheoutputlayerandprocessestheactivationvaluesfromitsprecedinglayertoproducepredictionsofoutputvariables.Forbinaryclassification,theoutputlayercontainsasinglenoderepresentingthebinaryclasslabel.Inthisarchitecture,sincethesignalsarepropagatedonlyintheforwarddirectionfromtheinputlayertotheoutputlayer,theyarealsocalledfeedforwardneuralnetworks.

⌈log2k⌉

Amajordifferencebetweenmulti-layerneuralnetworksandperceptronsistheinclusionofhiddenlayers,whichdramaticallyimprovestheirabilitytorepresentarbitrarilycomplexdecisionboundaries.Forexample,considertheXORproblemdescribedintheprevioussection.Theinstancescanbeclassifiedusingtwohyperplanesthatpartitiontheinputspaceintotheirrespectiveclasses,asshowninFigure4.24(a) .Becauseaperceptroncancreateonlyonehyperplane,itcannotfindtheoptimalsolution.However,thisproblemcanbeaddressedbyusingahiddenlayerconsistingoftwonodes,asshowninFigure4.24(b) .Intuitively,wecanthinkofeachhiddennodeasaperceptronthattriestoconstructoneofthetwohyperplanes,whiletheoutputnodesimplycombinestheresultsoftheperceptronstoyieldthedecisionboundaryshowninFigure4.24(a) .

Figure4.24.Atwo-layerneuralnetworkfortheXORproblem.

Thehiddennodescanbeviewedaslearninglatentrepresentationsorfeaturesthatareusefulfordistinguishingbetweentheclasses.Whilethefirsthiddenlayerdirectlyoperatesontheinputattributesandthuscapturessimplerfeatures,thesubsequenthiddenlayersareabletocombinethemand

constructmorecomplexfeatures.Fromthisperspective,multi-layerneuralnetworkslearnahierarchyoffeaturesatdifferentlevelsofabstractionthatarefinallycombinedattheoutputnodestomakepredictions.Further,therearecombinatoriallymanywayswecancombinethefeatureslearnedatthehiddenlayersofANN,makingthemhighlyexpressive.ThispropertychieflydistinguishesANNfromotherclassificationmodelssuchasdecisiontrees,whichcanlearnpartitionsintheattributespacebutareunabletocombinetheminexponentialways.

Figure4.25.SchematicillustrationoftheparametersofanANNmodelwith hiddenlayers.

TounderstandthenatureofcomputationshappeningatthehiddenandoutputnodesofANN,considerthei nodeatthel layerofthenetwork ,wherethelayersarenumberedfrom0(inputlayer)toL(outputlayer),asshowninFigure4.25 .Theactivationvaluegeneratedatthisnode, ,canberepresentedasafunctionoftheinputsreceivedfromnodesattheprecedinglayer.Let representtheweightoftheconnectionfromthej nodeatlayer

(L−1)

th th (l>0)

ail

wijl th

tothei nodeatlayerl.Similarly,letusdenotethebiastermatthisnodeas .Theactivationvalue canthenbeexpressedas

whereziscalledthelinearpredictorand istheactivationfunctionthatconvertsztoa.Further,notethat,bydefinition, attheinputlayerand

attheoutputnode.

Thereareanumberofalternateactivationfunctionsapartfromthesignfunctionthatcanbeusedinmulti-layerneuralnetworks.Someexamplesincludelinear,sigmoid(logistic),andhyperbolictangentfunctions,asshowninFigure4.26 .Thesefunctionsareabletoproducereal-valuedandnonlinearactivationvalues.Amongtheseactivationfunctions,thesigmoid hasbeenwidelyusedinmanyANNmodels,althoughtheuseofothertypesofactivationfunctionsinthecontextofdeeplearningwillbediscussedinSection4.8 .Wecanthusrepresent as

(l−1) th

bjl ail

ail=f(zil)=f(∑jwijlajl−1+bil),

f(⋅)aj0=xj

aL=y^

σ(⋅)

ail

Figure4.26.Typesofactivationfunctionsusedinmulti-layerneuralnetworks.

LearningModelParametersTheweightsandbiasterms( ,b)oftheANNmodelarelearnedduringtrainingsothatthepredictionsontraininginstancesmatchthetruelabels.Thisisachievedbyusingalossfunction

ail=σ(zil)=11+e−zil. (4.50)

E(w,b)=∑k=1nLoss(yk,y^k) (4.51)

where isthetruelabelofthekthtraininginstanceand isequalto ,producedbyusing .Atypicalchoiceofthelossfunctionisthesquaredlossfunction:.

NotethatE( ,b)isafunctionofthemodelparameters( ,b)becausetheoutputactivationvalue dependsontheweightsandbiasterms.Weareinterestedinchoosing( ,b)thatminimizesthetraininglossE( ,b).Unfortunately,becauseoftheuseofhiddennodeswithnonlinearactivationfunctions,E( ,b)isnotaconvexfunctionof andb,whichmeansthatE( ,b)canhavelocalminimathatarenotgloballyoptimal.However,wecanstillapplystandardoptimizationtechniquessuchasthegradientdescentmethodtoarriveatalocallyoptimalsolution.Inparticular,theweightparameter andthebiasterm canbeiterativelyupdatedusingthefollowingequations:

where isahyper-parameterknownasthelearningrate.Theintuitionbehindthisequationistomovetheweightsinadirectionthatreducesthetrainingloss.Ifwearriveataminimausingthisprocedure,thegradientofthetraininglosswillbecloseto0,eliminatingthesecondtermandresultingintheconvergenceofweights.TheweightsarecommonlyinitializedwithvaluesdrawnrandomlyfromaGaussianorauniformdistribution.

AnecessarytoolforupdatingweightsinEquation4.53 istocomputethepartialderivativeofEwithrespectto .Thiscomputationisnontrivialespeciallyathiddenlayers ,since doesnotdirectlyaffect (and

yk y^k aLxk

Loss(yk,y^k)=(yk,y^k)2. (4.52)

wijl bil

wijl←wijl−λ∂E∂wijl, (4.53)

bil←bil−λ∂E∂bil, (4.54)

wijl(l<L) wijl y^=aL

hencethetrainingloss),buthascomplexchainsofinfluencesviaactivationvaluesatsubsequentlayers.Toaddressthisproblem,atechniqueknownasbackpropagationwasdeveloped,whichpropagatesthederivativesbackwardfromtheoutputlayertothehiddenlayers.Thistechniquecanbedescribedasfollows.

RecallthatthetraininglossEissimplythesumofindividuallossesattraininginstances.HencethepartialderivativeofEcanbedecomposedasasumofpartialderivativesofindividuallosses.

Tosimplifydiscussions,wewillconsideronlythederivativesofthelossatthek traininginstance,whichwillbegenericallyrepresentedas .Byusingthechainruleofdifferentiation,wecanrepresentthepartialderivativesofthelosswithrespectto as

Thelasttermofthepreviousequationcanbewrittenas

Also,ifweusethesigmoidactivationfunction,then

Equation4.55 canthusbesimplifiedas

∂E∂wjl=∑k=1n∂Loss(yk,y^k)∂wjl.

th Loss(y,aL)

wijl

∂Loss∂wijl=∂Loss∂ail×∂ail∂zil×∂zil∂wijl. (4.55)

∂zil∂wijl=∂(∑jwijlajl−1+bil)∂wijl=ajl−1.

∂ail∂zil=∂σ(zil)∂zil=ail(1−ai1).

∂Loss∂wijl=δil×ail(1−ai1)×ajl−1,whereδil=∂Loss∂ail. (4.56)

Asimilarformulaforthepartialderivativeswithrespecttothebiasterms isgivenby

Hence,tocomputethepartialderivatives,weonlyneedtodetermine .Usingasquaredlossfunction,wecaneasilywrite attheoutputnodeas

However,theapproachforcomputing athiddennodes ismoreinvolved.Noticethat affectstheactivationvalues ofallnodesatthenextlayer,whichinturninfluencestheloss.Hence,againusingthechainruleofdifferentiation, canberepresentedas

Thepreviousequationprovidesaconciserepresentationofthe valuesatlayerlintermsofthe valuescomputedatlayer .Hence,proceedingbackwardfromtheoutputlayerLtothehiddenlayers,wecanrecursivelyapplyEquation4.59 tocompute ateveryhiddennode. canthenbeusedinEquations4.56 and4.57 tocomputethepartialderivativesofthelosswithrespectto and ,respectively.Algorithm4.4 summarizesthecompleteapproachforlearningthemodelparametersofANNusingbackpropagationandgradientdescentmethod.

Algorithm4.4LearningANNusingbackpropagationandgradientdescent.

bli

∂Loss∂bil=δil×ail(1−ai1). (4.57)

δilδL

δL=∂Loss∂aL=∂(y−aL)2∂aL=2(aL−y). (4.58)

δjl (l<L)ajl ail+1

δjl

δjl=∂Loss∂ajl=∑i(∂Loss∂ail+1×∂ail+1∂ajl).=∑i(∂Loss∂ail+1×∂ail+1∂zil+1×∂zil+1(4.59)

δjlδjl+1 l+1

δil δil

wijl bil

∈

∂ ∂ ∂ ∂

∂ ∂ ∑ ∂ ∂

4.7.3CharacteristicsofANN

1. Multi-layerneuralnetworkswithatleastonehiddenlayerareuniversalapproximators;i.e.,theycanbeusedtoapproximateanytargetfunction.Theyarethushighlyexpressiveandcanbeusedtolearncomplexdecisionboundariesindiverseapplications.ANNcanalsobeusedformulticlassclassificationandregressionproblems,by

appropriatelymodifyingtheoutputlayer.However,thehighmodelcomplexityofclassicalANNmodelsmakesitsusceptibletooverfitting,whichcanbeovercometosomeextentbyusingdeeplearningtechniquesdiscussedinSection4.8.3 .

2. ANNprovidesanaturalwaytorepresentahierarchyoffeaturesatmultiplelevelsofabstraction.TheoutputsatthefinalhiddenlayeroftheANNmodelthusrepresentfeaturesatthehighestlevelofabstractionthataremostusefulforclassification.Thesefeaturescanalsobeusedasinputsinothersupervisedclassificationmodels,e.g.,byreplacingtheoutputnodeoftheANNbyanygenericclassifier.

3. ANNrepresentscomplexhigh-levelfeaturesascompositionsofsimplerlower-levelfeaturesthatareeasiertolearn.ThisprovidesANNtheabilitytograduallyincreasethecomplexityofrepresentations,byaddingmorehiddenlayerstothearchitecture.Further,sincesimplerfeaturescanbecombinedincombinatorialways,thenumberofcomplexfeatureslearnedbyANNismuchlargerthantraditionalclassificationmodels.Thisisoneofthemainreasonsbehindthehighexpressivepowerofdeepneuralnetworks.

4. ANNcaneasilyhandleirrelevantattributes,byusingzeroweightsforattributesthatdonothelpinimprovingthetrainingloss.Also,redundantattributesreceivesimilarweightsanddonotdegradethequalityoftheclassifier.However,ifthenumberofirrelevantorredundantattributesislarge,thelearningoftheANNmodelmaysufferfromoverfitting,leadingtopoorgeneralizationperformance.

5. SincethelearningofANNmodelinvolvesminimizinganon-convexfunction,thesolutionsobtainedbygradientdescentarenotguaranteedtobegloballyoptimal.Forthisreason,ANNhasatendencytogetstuckinlocalminima,achallengethatcanbeaddressedbyusingdeeplearningtechniquesdiscussedinSection4.8.4 .

6. TraininganANNisatimeconsumingprocess,especiallywhenthenumberofhiddennodesislarge.Nevertheless,testexamplescanbe

classifiedrapidly.7. Justlikelogisticregression,ANNcanlearninthepresenceof

interactingvariables,sincethemodelparametersarejointlylearnedoverallvariablestogether.Inaddition,ANNcannothandleinstanceswithmissingvaluesinthetrainingortestingphase.

4.8DeepLearningAsdescribedabove,theuseofhiddenlayersinANNisbasedonthegeneralbeliefthatcomplexhigh-levelfeaturescanbeconstructedbycombiningsimplerlower-levelfeatures.Typically,thegreaterthenumberofhiddenlayers,thedeeperthehierarchyoffeatureslearnedbythenetwork.ThismotivatesthelearningofANNmodelswithlongchainsofhiddenlayers,knownasdeepneuralnetworks.Incontrastto“shallow”neuralnetworksthatinvolveonlyasmallnumberofhiddenlayers,deepneuralnetworksareabletorepresentfeaturesatmultiplelevelsofabstractionandoftenrequirefarfewernodesperlayertoachievegeneralizationperformancesimilartoshallownetworks.

Despitethehugepotentialinlearningdeepneuralnetworks,ithasremainedchallengingtolearnANNmodelswithalargenumberofhiddenlayersusingclassicalapproaches.Apartfromreasonsrelatedtolimitedcomputationalresourcesandhardwarearchitectures,therehavebeenanumberofalgorithmicchallengesinlearningdeepneuralnetworks.First,learningadeepneuralnetworkwithlowtrainingerrorhasbeenadauntingtaskbecauseofthesaturationofsigmoidactivationfunctions,resultinginslowconvergenceofgradientdescent.Thisproblembecomesevenmoreseriousaswemoveawayfromtheoutputnodetothehiddenlayers,becauseofthecompoundedeffectsofsaturationatmultiplelayers,knownasthevanishinggradientproblem.Becauseofthisreason,classicalANNmodelshavesufferedfromslowandineffectivelearning,leadingtopoortrainingandtestperformance.Second,thelearningofdeepneuralnetworksisquitesensitivetotheinitialvaluesofmodelparameters,chieflybecauseofthenon-convexnatureoftheoptimizationfunctionandtheslowconvergenceofgradientdescent.Third,deepneuralnetworkswithalargenumberofhiddenlayershavehighmodel

complexity,makingthemsusceptibletooverfitting.Hence,evenifadeepneuralnetworkhasbeentrainedtoshowlowtrainingerror,itcanstillsufferfrompoorgeneralizationperformance.

Thesechallengeshavedeterredprogressinbuildingdeepneuralnetworksforseveraldecadesanditisonlyrecentlythatwehavestartedtounlocktheirimmensepotentialwiththehelpofanumberofadvancesbeingmadeintheareaofdeeplearning.Althoughsomeoftheseadvanceshavebeenaroundforsometime,theyhaveonlygainedmainstreamattentioninthelastdecade,withdeepneuralnetworkscontinuallybeatingrecordsinvariouscompetitionsandsolvingproblemsthatweretoodifficultforotherclassificationapproaches.

Therearetwofactorsthathaveplayedamajorroleintheemergenceofdeeplearningtechniques.First,theavailabilityoflargerlabeleddatasets,e.g.,theImageNetdatasetcontainsmorethan10millionlabeledimages,hasmadeitpossibletolearnmorecomplexANNmodelsthaneverbefore,withoutfallingeasilyintothetrapsofmodeloverfitting.Second,advancesincomputationalabilitiesandhardwareinfrastructures,suchastheuseofgraphicalprocessingunits(GPU)fordistributedcomputing,havegreatlyhelpedinexperimentingwithdeepneuralnetworkswithlargerarchitecturesthatwouldnothavebeenfeasiblewithtraditionalresources.

Inadditiontotheprevioustwofactors,therehavebeenanumberofalgorithmicadvancementstoovercomethechallengesfacedbyclassicalmethodsinlearningdeepneuralnetworks.Someexamplesincludetheuseofmoreresponsivecombinationsoflossfunctionsandactivationfunctions,betterinitializationofmodelparameters,novelregularizationtechniques,moreagilearchitecturedesigns,andbettertechniquesformodellearningandhyper-parameterselection.Inthefollowing,wedescribesomeofthedeeplearningadvancesmadetoaddressthechallengesinlearningdeepneural

networks.FurtherdetailsonrecentdevelopmentsindeeplearningcanbeobtainedfromtheBibliographicNotes.

4.8.1UsingSynergisticLossFunctions

Oneofthemajorrealizationsleadingtodeeplearninghasbeentheimportanceofchoosingappropriatecombinationsofactivationandlossfunctions.ClassicalANNmodelscommonlymadeuseofthesigmoidactivationfunctionattheoutputlayer,becauseofitsabilitytoproducereal-valuedoutputsbetween0and1,whichwascombinedwithasquaredlossobjectivetoperformgradientdescent.Itwassoonnoticedthatthisparticularcombinationofactivationandlossfunctionresultedinthesaturationofoutputactivationvalues,whichcanbedescribedasfollows.

SaturationofOutputsAlthoughthesigmoidhasbeenwidely-usedasanactivationfunction,iteasilysaturatesathighandlowvaluesofinputsthatarefarawayfrom0.ObservefromFigure4.27(a) that showsvarianceinitsvaluesonlywhenziscloseto0.Forthisreason, isnon-zeroforonlyasmallrangeofzaround0,asshowninFigure4.27(b) .Since isoneofthecomponentsinthegradientofloss(seeEquation4.55 ),wegetadiminishinggradientvaluewhentheactivationvaluesarefarfrom0.

σ(z)∂σ(z)/∂z

∂σ(z)/∂z

Figure4.27.Plotsofsigmoidfunctionanditsderivative.

Toillustratetheeffectofsaturationonthelearningofmodelparametersattheoutputnode,considerthepartialderivativeoflosswithrespecttotheweight

attheoutputnode.Usingthesquaredlossfunction,wecanwritethisas

Inthepreviousequation,noticethatwhen ishighlynegative, (andhencethegradient)iscloseto0.Ontheotherhand,when ishighlypositive, becomescloseto0,nullifyingthevalueofthegradient.Hence,irrespectiveofwhethertheprediction matchesthetruelabelyornot,thegradientofthelosswithrespecttotheweightsiscloseto0wheneverishighlypositiveornegative.Thiscausesanunnecessarilyslow

convergenceofthemodelparametersoftheANNmodel,oftenresultinginpoorlearning.

Notethatitisthecombinationofthesquaredlossfunctionandthesigmoidactivationfunctionattheoutputnodethattogetherresultsindiminishing

wjL

∂Loss∂wjL=2(aL−y)×σ(zL)(1−σ(zL))×ajL−1. (4.60)

zL σ(zL)zL

(1−σ(zL))aL

gradients(andthuspoorlearning)uponsaturationofoutputs.Itisthusimportanttochooseasynergisticcombinationoflossfunctionandactivationfunctionthatdoesnotsufferfromthesaturationofoutputs.

CrossentropylossfunctionThecrossentropylossfunction,whichwasdescribedinthecontextoflogisticregressioninSection4.6.2 ,cansignificantlyavoidtheproblemofsaturatingoutputswhenusedincombinationwiththesigmoidactivationfunction.Thecrossentropylossfunctionofareal-valuedpredictiononadatainstancewithbinarylabel canbedefinedas

wherelogrepresentsthenaturallogarithm(tobasee)and forconvenience.Thecrossentropyfunctionhasfoundationsininformationtheoryandmeasurestheamountofdisagreementbetweenyand .Thepartialderivativeofthislossfunctionwithrespectto canbegivenas

Usingthisvalueof inEquation4.56 ,wecanobtainthepartialderivativeofthelosswithrespecttotheweight attheoutputnodeas

Noticethesimplicityofthepreviousformulausingthecrossentropylossfunction.Thepartialderivativesofthelosswithrespecttotheweightsattheoutputnodedependonlyonthedifferencebetweentheprediction andthetruelabely.IncontrasttoEquation4.60 ,itdoesnotinvolvetermssuchas

thatcanbeimpactedbysaturationof .Hence,thegradients

y^∈(0,1)y∈{0,1}

Loss(y,y^)=−ylog(y^)−(1−y)log(1−y^), (4.61)

0log(0)=0

y^y^=aL

δL=∂Loss∂aL=−yaL+(1−y)(1−aL).=(aL−y)aL(1−aL). (4.62)

δLwjl

∂Loss∂wjL=(aL−y)aL(1−aL)×aL(1−aL)×ajL−1.=(aL−y)×ajL−1. (4.63)

σ(zL)(1−σ(zL)) zL

arehighwhenever islarge,promotingeffectivelearningofthemodelparametersattheoutputnode.ThishasbeenamajorbreakthroughinthelearningofmodernANNmodelsanditisnowacommonpracticetousethecrossentropylossfunctionwithsigmoidactivationsattheoutputnode.

4.8.2UsingResponsiveActivationFunctions

Eventhoughthecrossentropylossfunctionhelpsinovercomingtheproblemofsaturatingoutputs,itstilldoesnotsolvetheproblemofsaturationathiddenlayers,arisingduetotheuseofsigmoidactivationfunctionsathiddennodes.Infact,theeffectofsaturationonthelearningofmodelparametersisevenmoreaggravatedathiddenlayers,aproblemknownasthevanishinggradientproblem.Inthefollowing,wedescribethevanishinggradientproblemandtheuseofamoreresponsiveactivationfunction,calledtherectifiedlinearoutputunit(ReLU),toovercomethisproblem.

VanishingGradientProblemTheimpactofsaturatingactivationvaluesonthelearningofmodelparametersincreasesatdeeperhiddenlayersthatarefartherawayfromtheoutputnode.Eveniftheactivationintheoutputlayerdoesnotsaturate,therepeatedmultiplicationsperformedaswebackpropagatethegradientsfromtheoutputlayertothehiddenlayersmayleadtodecreasinggradientsinthehiddenlayers.Thisiscalledthevanishinggradientproblem,whichhasbeenoneofthemajorhindrancesinlearningdeepneuralnetworks.

(aL−y)

Toillustratethevanishinggradientproblem,consideranANNmodelthatconsistsofasinglenodeateveryhiddenlayerofthenetwork,asshowninFigure4.28 .Thissimplifiedarchitectureinvolvesasinglechainofhiddennodeswhereasingleweightedlink connectsthenodeatlayer tothenodeatlayerl.UsingEquations4.56 and4.59 ,wecanrepresentthepartialderivativeofthelosswithrespectto as

Noticethatifanyofthelinearpredictors saturatesatsubsequentlayers,thentheterm becomescloseto0,thusdiminishingtheoverallgradient.Thesaturationofactivationsthusgetscompoundedandhasmultiplicativeeffectsonthegradientsathiddenlayers,makingthemhighlyunstableandthus,unsuitableforusewithgradientdescent.Eventhoughthepreviousdiscussiononlypertainstothesimplifiedarchitectureinvolvingasinglechainofhiddennodes,asimilarargumentcanbemadeforanygenericANNarchitectureinvolvingmultiplechainsofhiddennodes.Notethatthevanishinggradientproblemprimarilyarisesbecauseoftheuseofsigmoidactivationfunctionathiddennodes,whichisknowntoeasilysaturateespeciallyafterrepeatedmultiplications.

Figure4.28.AnexampleofanANNmodelwithonlyonenodeateveryhiddenlayer.

wl l−1

∂Loss∂wl=δl×al(1−al)×al−1,whereδl=2(aL−y)×∏r=lL−1(ar+1(1−ar+1)×wr+1).(4.64)

zr+1ar+1(1−ar+1)

Figure4.29.Plotoftherectifiedlinearunit(ReLU)activationfunction.

RectifiedLinearUnits(ReLU)Toovercomethevanishinggradientproblem,itisimportanttouseanactivationfunctionf(z)atthehiddennodesthatprovidesastableandsignificantvalueofthegradientwheneverahiddennodeisactive,i.e., .Thisisachievedbyusingrectifiedlinearunits(ReLU)asactivationfunctionsathiddennodes,whichcanbedefinedas

TheideaofReLUhasbeeninspiredfrombiologicalneurons,whichareeitherinaninactivestate orshowanactivationvalueproportionaltotheinput.Figure4.29 showsaplotoftheReLUfunction.Wecanseethatitislinearwithrespecttozwhen .Hence,thegradientoftheactivationvaluewithrespecttozcanbewrittenas

z>0

a=f(z)={z,ifz>0.0,otherwise. (4.65)

(f(z)=0)

z>0

Althoughf(z)isnotdifferentiableat0,itiscommonpracticetousewhen .SincethegradientoftheReLUactivationfunctionisequalto1whenever ,itavoidstheproblemofsaturationathiddennodes,evenafterrepeatedmultiplications.UsingReLU,thepartialderivativesofthelosswithrespecttotheweightandbiasparameterscanbegivenby

NoticethatReLUshowsalinearbehaviorintheactivationvalueswheneveranodeisactive,ascomparedtothenonlinearpropertiesofthesigmoidfunction.Thislinearitypromotesbetterflowsofgradientsduringbackpropagation,andthussimplifiesthelearningofANNmodelparameters.TheReLUisalsohighlyresponsiveatlargevaluesofzawayfrom0,asopposedtothesigmoidactivationfunction,makingitmoresuitableforgradientdescent.ThesedifferencesgiveReLUamajoradvantageoverthesigmoidfunction.Indeed,ReLUisusedasthepreferredchoiceofactivationfunctionathiddenlayersinmostmodernANNmodels.

4.8.3Regularization

AmajorchallengeinlearningdeepneuralnetworksisthehighmodelcomplexityofANNmodels,whichgrowswiththeadditionofhiddenlayersinthenetwork.Thiscanbecomeaseriousconcern,especiallywhenthetrainingsetissmall,duetothephenomenaofmodeloverfitting.Toovercomethis

∂a∂z={1,ifz>0.0,ifz<0. (4.66)

∂a/∂z=0z=0

z>0

∂Loss∂wijl=δil×I(zil)×ajl−1, (4.67)

∂Loss∂bil=δil×I(zil),whereδil=∑i=1n(δil+1×I(zil+1)×wijl+1),andI(z)={1,ifz>0.0,otherwise.

(4.68)

challenge,itisimportanttousetechniquesthatcanhelpinreducingthecomplexityofthelearnedmodel,knownasregularizationtechniques.ClassicalapproachesforlearningANNmodelsdidnothaveaneffectivewaytopromoteregularizationofthelearnedmodelparameters.Hence,theyhadoftenbeensidelinedbyotherclassificationmethods,suchassupportvectormachines(SVM),whichhavein-builtregularizationmechanisms.(SVMswillbediscussedinmoredetailinSection4.9 ).

OneofthemajoradvancementsindeeplearninghasbeenthedevelopmentofnovelregularizationtechniquesforANNmodelsthatareabletooffersignificantimprovementsingeneralizationperformance.Inthefollowing,wediscussoneoftheregularizationtechniquesforANN,knownasthedropoutmethod,thathavegainedalotofattentioninseveralapplications.

DropoutThemainobjectiveofdropoutistoavoidthelearningofspuriousfeaturesathiddennodes,occurringduetomodeloverfitting.Itusesthebasicintuitionthatspuriousfeaturesoften“co-adapt”themselvessuchthattheyshowgoodtrainingperformanceonlywhenusedinhighlyselectivecombinations.Ontheotherhand,relevantfeaturescanbeusedinadiversityoffeaturecombinationsandhencearequiteresilienttotheremovalormodificationofotherfeatures.Thedropoutmethodusesthisintuitiontobreakcomplex“co-adaptations”inthelearnedfeaturesbyrandomlydroppinginputandhiddennodesinthenetworkduringtraining.

Dropoutbelongstoafamilyofregularizationtechniquesthatusesthecriteriaofresiliencetorandomperturbationsasameasureoftherobustness(andhence,simplicity)ofamodel.Forexample,oneapproachtoregularizationistoinjectnoiseintheinputattributesofthetrainingsetandlearnamodelwiththenoisytraininginstances.Ifafeaturelearnedfromthetrainingdatais

indeedgeneralizable,itshouldnotbeaffectedbytheadditionofnoise.Dropoutcanbeviewedasasimilarregularizationapproachthatperturbstheinformationcontentofthetrainingsetnotonlyatthelevelofattributesbutalsoatmultiplelevelsofabstractions,bydroppinginputandhiddennodes.

Thedropoutmethoddrawsinspirationfromthebiologicalprocessofgeneswappinginsexualreproduction,wherehalfofthegenesfrombothparentsarecombinedtogethertocreatethegenesoftheoffspring.Thisfavorstheselectionofparentgenesthatarenotonlyusefulbutcanalsointer-minglewithdiversecombinationsofgenescomingfromtheotherparent.Ontheotherhand,co-adaptedgenesthatfunctiononlyinhighlyselectivecombinationsaresooneliminatedintheprocessofevolution.Thisideaisusedinthedropoutmethodforeliminatingspuriousco-adaptedfeatures.Asimplifieddescriptionofthedropoutmethodisprovidedintherestofthissection.

Figure4.30.Examplesofsub-networksgeneratedinthedropoutmethodusing .

Let representthemodelparametersoftheANNmodelatthekiterationofthegradientdescentmethod.Ateveryiteration,werandomlyselectafraction ofinputandhiddennodestobedroppedfromthenetwork,where isahyper-parameterthatistypicallychosentobe0.5.Theweightedlinksandbiastermsinvolvingthedroppednodesaretheneliminated,resultingina“thinned”sub-networkofsmallersize.Themodelparametersofthesub-network arethenupdatedbycomputingactivationvaluesandperformingbackpropagationonthissmallersub-network.Theseupdatedvaluesarethenaddedbackintheoriginalnetworkto

γ=0.5

(wk,bk) th

γγ∈(0,1)

(wsk,bsk)

obtaintheupdatedmodelparameters, ,tobeusedinthenextiteration.

Figure4.30 showssomeexamplesofsub-networksthatcanbegeneratedatdifferentiterationsofthedropoutmethod,byrandomlydroppinginputandhiddennodes.Sinceeverysub-networkhasadifferentarchitecture,itisdifficulttolearncomplexco-adaptationsinthefeaturesthatcanresultinoverfitting.Instead,thefeaturesatthehiddennodesarelearnedtobemoreagiletorandommodificationsinthenetworkstructure,thusimprovingtheirgeneralizationability.Themodelparametersareupdatedusingadifferentrandomsub-networkateveryiteration,tillthegradientdescentmethodconverges.

Let denotethemodelparametersatthelastiterationofthegradientdescentmethod.Theseparametersarefinallyscaleddownbyafactorof ,toproducetheweightsandbiastermsofthefinalANNmodel,asfollows:

Wecannowusethecompleteneuralnetworkwithmodelparametersfortesting.ThedropoutmethodhasbeenshowntoprovidesignificantimprovementsinthegeneralizationperformanceofANNmodelsinanumberofapplications.Itiscomputationallycheapandcanbeappliedincombinationwithanyoftheotherdeeplearningtechniques.Italsohasanumberofsimilaritieswithawidely-usedensemblelearningmethodknownasbagging,whichlearnsmultiplemodelsusingrandomsubsetsofthetrainingset,andthenusestheaverageoutputofallthemodelstomakepredictions.(BaggingwillbepresentedinmoredetaillaterinSection4.10.4 ).Inasimilarvein,itcanbeshownthatthepredictionsofthefinalnetworklearnedusingdropoutapproximatestheaverageoutputofallpossible sub-networksthatcanbe

(wk+1,bk+1)

(wkmax,bkmax) kmax

(1−γ)

(w*,b*)=((1−γ)×wkmax,(1−γ)×bkmax)

(w*,b*)

formedusingnnodes.Thisisoneofthereasonsbehindthesuperiorregularizationabilitiesofdropout.

4.8.4InitializationofModelParameters

Becauseofthenon-convexnatureofthelossfunctionusedbyANNmodels,itispossibletogetstuckinlocallyoptimalbutgloballyinferiorsolutions.Hence,theinitialchoiceofmodelparametervaluesplaysasignificantroleinthelearningofANNbygradientdescent.Theimpactofpoorinitializationisevenmoreaggravatedwhenthemodeliscomplex,thenetworkarchitectureisdeep,ortheclassificationtaskisdifficult.Insuchcases,itisoftenadvisabletofirstlearnasimplermodelfortheproblem,e.g.,usingasinglehiddenlayer,andthenincrementallyincreasethecomplexityofthemodel,e.g.,byaddingmorehiddenlayers.Analternateapproachistotrainthemodelforasimplertaskandthenusethelearnedmodelparametersasinitialparameterchoicesinthelearningoftheoriginaltask.TheprocessofinitializingANNmodelparametersbeforetheactualtrainingprocessisknownaspretraining.

Pretraininghelpsininitializingthemodeltoasuitableregionintheparameterspacethatwouldotherwisebeinaccessiblebyrandominitialization.Pretrainingalsoreducesthevarianceinthemodelparametersbyfixingthestartingpointofgradientdescent,thusreducingthechancesofoverfittingduetomultiplecomparisons.Themodelslearnedbypretrainingarethusmoreconsistentandprovidebettergeneralizationperformance.

SupervisedPretrainingAcommonapproachforpretrainingistoincrementallytraintheANNmodelinalayer-wisemanner,byaddingonehiddenlayeratatime.Thisapproach,

knownassupervisedpretraining,ensuresthattheparameterslearnedateverylayerareobtainedbysolvingasimplerproblem,ratherthanlearningallmodelparameterstogether.TheseparametervaluesthusprovideagoodchoiceforinitializingtheANNmodel.Theapproachforsupervisedpretrainingcanbebrieflydescribedasfollows.

WestartthesupervisedpretrainingprocessbyconsideringareducedANNmodelwithonlyasinglehiddenlayer.Byapplyinggradientdescentonthissimplemodel,weareabletolearnthemodelparametersofthefirsthiddenlayer.Atthenextrun,weaddanotherhiddenlayertothemodelandapplygradientdescenttolearntheparametersofthenewlyaddedhiddenlayer,whilekeepingtheparametersofthefirstlayerfixed.Thisprocedureisrecursivelyappliedsuchthatwhilelearningtheparametersofthel hiddenlayer,weconsiderareducedmodelwithonlylhiddenlayers,whosefirsthiddenlayersarenotupdatedonthel runbutareinsteadfixedusingpretrainedvaluesfrompreviousruns.Inthisway,weareabletolearnthemodelparametersofall hiddenlayers.ThesepretrainedvaluesareusedtoinitializethehiddenlayersofthefinalANNmodel,whichisfine-tunedbyapplyingafinalroundofgradientdescentoverallthelayers.

UnsupervisedPretrainingSupervisedpretrainingprovidesapowerfulwaytoinitializemodelparameters,bygraduallygrowingthemodelcomplexityfromshallowertodeepernetworks.However,supervisedpretrainingrequiresasufficientnumberoflabeledtraininginstancesforeffectiveinitializationoftheANNmodel.Analternatepretrainingapproachisunsupervisedpretraining,whichinitializesmodelparametersbyusingunlabeledinstancesthatareoftenabundantlyavailable.ThebasicideaofunsupervisedpretrainingistoinitializetheANN

(l−1)th

(L−1)

modelinsuchawaythatthelearnedfeaturescapturethelatentstructureintheunlabeleddata.

Figure4.31.Thebasicarchitectureofasingle-layerautoencoder.

Unsupervisedpretrainingreliesontheassumptionthatlearningthedistributionoftheinputdatacanindirectlyhelpinlearningtheclassificationmodel.Itismosthelpfulwhenthenumberoflabeledexamplesissmallandthefeaturesforthesupervisedproblembearresemblancetothefactorsgeneratingtheinputdata.Unsupervisedpretrainingcanbeviewedasadifferentformofregularization,wherethefocusisnotexplicitlytowardfindingsimplerfeaturesbutinsteadtowardfindingfeaturesthatcanbestexplaintheinputdata.Historically,unsupervisedpretraininghasplayedanimportantroleinrevivingtheareaofdeeplearning,bymakingitpossibletotrainanygenericdeepneuralnetworkwithoutrequiringspecializedarchitectures.

UseofAutoencoders

OnesimpleandcommonlyusedapproachforunsupervisedpretrainingistouseanunsupervisedANNmodelknownasanautoencoder.ThebasicarchitectureofanautoencoderisshowninFigure4.31 .Anautoencoderattemptstolearnareconstructionoftheinputdatabymappingtheattributestolatentfeatures ,andthenre-projecting backtotheoriginalattribute

spacetocreatethereconstruction .Thelatentfeaturesarerepresentedusingahiddenlayerofnodes,whiletheinputandoutputlayersrepresenttheattributesandcontainthesamenumberofnodes.Duringtraining,thegoalistolearnanautoencodermodelthatprovidesthelowestreconstructionerror,

,onallinputdatainstances.Atypicalchoiceofthereconstructionerroristhesquaredlossfunction:

ThemodelparametersoftheautoencodercanbelearnedbyusingasimilargradientdescentmethodastheoneusedforlearningsupervisedANNmodelsforclassification.Thekeydifferenceistheuseofthereconstructionerroronalltraininginstancesasthetrainingloss.Autoencodersthathavemultiplelayersofhiddenlayersareknownasstackedautoencoders.

Autoencodersareabletocapturecomplexrepresentationsoftheinputdatabytheuseofhiddennodes.However,ifthenumberofhiddennodesislarge,itispossibleforanautoencodertolearntheidentityrelationship,wheretheinput isjustcopiedandreturnedastheoutput ,resultinginatrivialsolution.Forexample,ifweuseasmanyhiddennodesasthenumberofattributes,thenitispossibleforeveryhiddennodetocopyanattributeandsimplypassitalongtoanoutputnode,withoutextractinganyusefulinformation.Toavoidthisproblem,itiscommonpracticetokeepthenumberofhiddennodessmallerthanthenumberofinputattributes.Thisforcestheautoencodertolearnacompactandusefulencodingoftheinputdata,similartoadimensionalityreductiontechnique.Analternateapproachistocorrupt

RE(x,x^)

RE(x,x^)=ǁx−x^ǁ2.

theinputinstancesbyaddingrandomnoise,andthenlearntheautoencodertoreconstructtheoriginalinstancefromthenoisyinput.Thisapproachisknownasthedenoisingautoencoder,whichoffersstrongregularizationcapabilitiesandisoftenusedtolearncomplexfeatureseveninthepresenceofalargenumberofhiddennodes.

Touseanautoencoderforunsupervisedpretraining,wecanfollowasimilarlayer-wiseapproachlikesupervisedpretraining.Inparticular,topretrainthemodelparametersofthel hiddenlayer,wecanconstructareducedANNmodelwithonlylhiddenlayersandanoutputlayercontainingthesamenumberofnodesastheattributesandisusedforreconstruction.Theparametersofthel hiddenlayerofthisnetworkarethenlearnedusingagradientdescentmethodtominimizethereconstructionerror.Theuseofunlabeleddatacanbeviewedasprovidinghintstothelearningofparametersateverylayerthataidingeneralization.ThefinalmodelparametersoftheANNmodelarethenlearnedbyapplyinggradientdescentoverallthelayers,usingtheinitialvaluesofparametersobtainedfrompretraining.

HybridPretrainingUnsupervisedpretrainingcanalsobecombinedwithsupervisedpretrainingbyusingtwooutputlayersateveryrunofpretraining,oneforreconstructionandtheotherforsupervisedclassification.Theparametersofthel hiddenlayerarethenlearnedbyjointlyminimizingthelossesonbothoutputlayers,usuallyweightedbyatrade-offhyper-parameter .Suchacombinedapproachoftenshowsbettergeneralizationperformancethaneitheroftheapproaches,sinceitprovidesawaytobalancebetweenthecompetingobjectivesofrepresentingtheinputdataandimprovingclassificationperformance.

4.8.5CharacteristicsofDeepLearning

ApartfromthebasiccharacteristicsofANNdiscussedinSection4.7.3 ,theuseofdeeplearningtechniquesprovidesthefollowingadditionalcharacteristics:

1. AnANNmodeltrainedforsometaskcanbeeasilyre-usedforadifferenttaskthatinvolvesthesameattributes,byusingpretrainingstrategies.Forexample,wecanusethelearnedparametersoftheoriginaltaskasinitialparameterchoicesforthetargettask.Inthisway,ANNpromotesre-usabilityoflearning,whichcanbequiteusefulwhenthetargetapplicationhasasmallernumberoflabeledtraininginstances.

2. Deeplearningtechniquesforregularization,suchasthedropoutmethod,helpinreducingthemodelcomplexityofANNandthuspromotinggoodgeneralizationperformance.Theuseofregularizationtechniquesisespeciallyusefulinhigh-dimensionalsettings,wherethenumberoftraininglabelsissmallbuttheclassificationproblemisinherentlydifficult.

3. Theuseofanautoencoderforpretrainingcanhelpeliminateirrelevantattributesthatarenotrelatedtootherattributes.Further,itcanhelpreducetheimpactofredundantattributesbyrepresentingthemascopiesofthesameattribute.

4. AlthoughthelearningofanANNmodelcansuccumbtofindinginferiorandlocallyoptimalsolutions,thereareanumberofdeeplearningtechniquesthathavebeenproposedtoensureadequatelearningofanANN.Apartfromthemethodsdiscussedinthissection,someothertechniquesinvolvenovelarchitecturedesignssuchasskipconnectionsbetweentheoutputlayerandlowerlayers,whichaidstheeasyflowofgradientsduringbackpropagation.

5. AnumberofspecializedANNarchitectureshavebeendesignedtohandleavarietyofinputdatasets.Someexamplesincludeconvolutionalneuralnetworks(CNN)fortwo-dimensionalgriddedobjectssuchasimages,andrecurrentneuralnetwork(RNN)forsequences.WhileCNNshavebeenextensivelyusedintheareaofcomputervision,RNNshavefoundapplicationsinprocessingspeechandlanguage.

4.9SupportVectorMachine(SVM)Asupportvectormachine(SVM)isadiscriminativeclassificationmodelthatlearnslinearornonlineardecisionboundariesintheattributespacetoseparatetheclasses.Apartfrommaximizingtheseparabilityofthetwoclasses,SVMoffersstrongregularizationcapabilities,i.e.,itisabletocontrolthecomplexityofthemodelinordertoensuregoodgeneralizationperformance.Duetoitsuniqueabilitytoinnatelyregularizeitslearning,SVMisabletolearnhighlyexpressivemodelswithoutsufferingfromoverfitting.Ithasthusreceivedconsiderableattentioninthemachinelearningcommunityandiscommonlyusedinseveralpracticalapplications,rangingfromhandwrittendigitrecognitiontotextcategorization.SVMhasstrongrootsinstatisticallearningtheoryandisbasedontheprincipleofstructuralriskminimization.AnotheruniqueaspectofSVMisthatitrepresentsthedecisionboundaryusingonlyasubsetofthetrainingexamplesthataremostdifficulttoclassify,knownasthesupportvectors.Hence,itisadiscriminativemodelthatisimpactedonlybytraininginstancesneartheboundaryofthetwoclasses,incontrasttolearningthegenerativedistributionofeveryclass.

ToillustratethebasicideabehindSVM,wefirstintroducetheconceptofthemarginofaseparatinghyperplaneandtherationaleforchoosingsuchahyperplanewithmaximummargin.WethendescribehowalinearSVMcanbetrainedtoexplicitlylookforthistypeofhyperplane.WeconcludebyshowinghowtheSVMmethodologycanbeextendedtolearnnonlineardecisionboundariesbyusingkernelfunctions.

4.9.1MarginofaSeparating

Hyperplane

Thegenericequationofaseparatinghyperplanecanbewrittenas

where representstheattributesand( , )representtheparametersofthehyperplane.Adatainstance canbelongtoeithersideofthehyperplanedependingonthesignof .Forthepurposeofbinaryclassification,weareinterestedinfindingahyperplanethatplacesinstancesofbothclassesonoppositesidesofthehyperplane,thusresultinginaseparationofthetwoclasses.Ifthereexistsahyperplanethatcanperfectlyseparatetheclassesinthedataset,wesaythatthedatasetislinearlyseparable.Figure4.32showsanexampleoflinearlyseparabledatainvolvingtwoclasses,squaresandcircles.Notethattherecanbeinfinitelymanyhyperplanesthatcanseparatetheclasses,twoofwhichareshowninFigure4.32 aslinesand .Eventhougheverysuchhyperplanewillhavezerotrainingerror,theycanprovidedifferentresultsonpreviouslyunseeninstances.Whichseparatinghyperplaneshouldwethusfinallychoosetoobtainthebestgeneralizationperformance?Ideally,wewouldliketochooseasimplehyperplanethatisrobusttosmallperturbations.Thiscanbeachievedbyusingtheconceptofthemarginofaseparatinghyperplane,whichcanbebrieflydescribedasfollows.

wTx+b=0,

xi(wTxi+b)

B1B2

Figure4.32.Marginofahyperplaneinatwo-dimensionaldataset.

Foreveryseparatinghyperplane ,letusassociateapairofparallelhyperplanes, and ,suchthattheytouchtheclosestinstancesofbothclasses,respectively.Forexample,ifwemove paralleltoitsdirection,wecantouchthefirstsquareusing andthefirstcircleusing . andareknownasthemarginhyperplanesof andthedistancebetweenthemisknownasthemarginoftheseparatinghyperplane .FromthediagramshowninFigure4.32 ,noticethatthemarginfor isconsiderablylargerthanthatfor .Inthisexample, turnsouttobetheseparatinghyperplanewiththemaximummargin,knownasthemaximummarginhyperplane.

RationaleforMaximumMargin

Bibi1 bi2

B1b11 b12 bi1 bi2

BiBi

B1B2 b1

Hyperplaneswithlargemarginstendtohavebettergeneralizationperformancethanthosewithsmallmargins.Intuitively,ifthemarginissmall,thenanyslightperturbationinthehyperplaneorthetraininginstanceslocatedattheboundarycanhavequiteanimpactontheclassificationperformance.Smallmarginhyperplanesarethusmoresusceptibletooverfitting,astheyarebarelyabletoseparatetheclasseswithaverynarrowroomtoallowperturbations.Ontheotherhand,ahyperplanethatisfartherawayfromtraininginstancesofbothclasseshassufficientleewaytoberobusttominormodificationsinthedata,andthusshowssuperiorgeneralizationperformance.

Theideaofchoosingthemaximummarginseparatinghyperplanealsohasstrongfoundationsinstatisticallearningtheory.ItcanbeshownthatthemarginofsuchahyperplaneisinverselyrelatedtotheVC-dimensionoftheclassifier,whichisacommonlyusedmeasureofthecomplexityofamodel.AsdiscussedinSection3.4 ofthelastchapter,asimplermodelshouldbepreferredoveramorecomplexmodeliftheybothshowsimilartrainingperformance.Hence,maximizingthemarginresultsintheselectionofaseparatinghyperplanewiththelowestmodelcomplexity,whichisexpectedtoshowbettergeneralizationperformance.

4.9.2LinearSVM

AlinearSVMisaclassifierthatsearchesforaseparatinghyperplanewiththelargestmargin,whichiswhyitisoftenknownasamaximalmarginclassifier.ThebasicideaofSVMcanbedescribedasfollows.

Considerabinaryclassificationproblemconsistingofntraininginstances,whereeverytraininginstance isassociatedwithabinarylabel .xi yi∈{−1,1}

Let betheequationofaseparatinghyperplanethatseparatesthetwoclassesbyplacingthemonoppositesides.Thismeansthat

Thedistanceofanypoint fromthehyperplaneisthengivenby

where denotestheabsolutevalueand denotesthelengthofavector.Letthedistanceoftheclosestpointfromthehyperplanewith be .Similarly,let denotethedistanceoftheclosestpointfromclass .

Thiscanberepresentedusingthefollowingconstraints:

Thepreviousequationscanbesuccinctlyrepresentedbyusingtheproductofand as

whereMisaparameterrelatedtothemarginofthehyperplane,i.e.,if,thenmargin .Inordertofindthemaximummargin

hyperplanethatadherestothepreviousconstraints,wecanconsiderthefollowingoptimizationproblem:

Tofindthesolutiontothepreviousproblem,notethatif andbsatisfytheconstraintsofthepreviousproblem,thenanyscaledversionof andbwould

wTx+b=0

wTxi+b>0ifyi=1,wTxi+b<0ifyi=−1.

D(x)=|wTx+b|ǁwǁ

|⋅| ǁ⋅ǁy=1 k+>0

k−>0 −1

wTxi+bǁwǁ≥k+ifyi=1,wTxi+bǁwǁ≤−k−ifyi=−1, (4.69)

yi (wTxi+b)

yi(wTxi+b)≥Mǁwǁ (4.70)

k+=k−=M =k+−k−=2M

maxw,bMsubjecttoyi(wTxi+b)≥Mǁwǁ. (4.71)

satisfythemtoo.Hence,wecanconvenientlychoose tosimplifytheright-handsideoftheinequalities.Furthermore,maximizingMamountstominimizing .Hence,theoptimizationproblemofSVMiscommonlyrepresentedinthefollowingform:

LearningModelParametersEquation4.72 representsaconstrainedoptimizationproblemwithlinearinequalities.Sincetheobjectivefunctionisconvexandquadraticwithrespectto ,itisknownasaquadraticprogrammingproblem(QPP),whichcanbesolvedusingstandardoptimizationtechniques,asdescribedinAppendixE.Inthefollowing,wepresentabriefsketchofthemainideasforlearningthemodelparametersofSVM.

First,werewritetheobjectivefunctioninaformthattakesintoaccounttheconstraintsimposedonitssolutions.ThenewobjectivefunctionisknownastheLagrangianprimalproblem,whichcanberepresentedasfollows,

wheretheparameters correspondtotheconstraintsandarecalledtheLagrangemultipliers.Next,tominimizetheLagrangian,wetakethederivativeof withrespectto andbandsetthemequaltozero:

ǁwǁ=1/M

ǁwǁ2

minw,bǁwǁ22subjecttoyi(wTxi+b)≥1. (4.72)

LP=12ǁwǁ2−∑i=1nλi(yi(wTxi+b)−1), (4.73)

λi≥0

∂LP∂w=0⇒w=∑i=1nλiyixi, (4.74)

∂LP∂b=0⇒∑i=1nλiyi=0. (4.75)

NotethatusingEquation4.74 ,wecanrepresent completelyintermsoftheLagrangemultipliers.Thereisanotherrelationshipbetween( ,b)andthatisderivedfromtheKarush-Kuhn-Tucker(KKT)conditions,acommonlyusedtechniqueforsolvingQPP.Thisrelationshipcanbedescribedas

Equation4.76 isknownasthecomplementaryslacknesscondition,whichshedslightonavaluablepropertyofSVM.ItstatesthattheLagrangemultiplier isstrictlygreaterthan0onlywhen satisfiestheequation

,whichmeansthat liesexactlyonamarginhyperplane.However,if isfartherawayfromthemarginhyperplanessuchthat

,then isnecessarily0.Hence, foronlyasmallnumberofinstancesthatareclosesttotheseparatinghyperplane,whichareknownassupportvectors.Figure4.33 showsthesupportvectorsofahyperplaneasfilledcirclesandsquares.Further,ifwelookatEquation4.74 ,wewillobservethattraininginstanceswith donotcontributetotheweightparameter .Thissuggeststhat canbeconciselyrepresentedonlyintermsofthesupportvectorsinthetrainingdata,whicharequitefewerthantheoverallnumberoftraininginstances.Thisabilitytorepresentthedecisionfunctiononlyintermsofthesupportvectorsiswhatgivesthisclassifierthenamesupportvectormachines.

λi

λi[yi(wTxi+b)−1]=0. (4.76)

λi xiyi(w⋅xi+b)=1 xi

xiyi(w⋅xi+b)>1 λi λi>0

λi=0

Figure4.33.Supportvectorsofahyperplaneshownasfilledcirclesandsquares.

Usingequations4.74 ,4.75 ,and4.76 inEquation4.73 ,weobtainthefollowingoptimizationproblemintermsoftheLagrangemultipliers :

Thepreviousoptimizationproblemiscalledthedualoptimizationproblem.Maximizingthedualproblemwithrespectto isequivalenttominimizingtheprimalproblemwithrespectto andb.

Thekeydifferencesbetweenthedualandprimalproblemsareasfollows:

λi

maxλi∑i=1nλi−12∑i=1n∑j=1nλiλjyiyjxiTxjsubjectto∑i=1nλiyi=0,λi≥0. (4.77)

λi

1. Solvingthedualproblemhelpsusidentifythesupportvectorsinthedatathathavenon-zerovaluesof .Further,thesolutionofthedualproblemisinfluencedonlybythesupportvectorsthatareclosesttothedecisionboundaryofSVM.ThishelpsinsummarizingthelearningofSVMsolelyintermsofitssupportvectors,whichareeasiertomanagecomputationally.Further,itrepresentsauniqueabilityofSVMtobedependentonlyontheinstancesclosesttotheboundary,whicharehardertoclassify,ratherthanthedistributionofinstancesfartherawayfromtheboundary.

2. Theobjectiveofthedualprobleminvolvesonlytermsoftheform ,whicharebasicallyinnerproductsintheattributespace.AswewillseelaterinSection4.9.4 ,thispropertywillprovetobequiteusefulinlearningnonlineardecisionboundariesusingSVM.

Becauseofthesedifferences,itisusefultosolvethedualoptimizationproblemusinganyofthestandardsolversforQPP.Havingfoundanoptimalsolutionfor ,wecanuseEquation4.74 tosolvefor .WecanthenuseEquation4.76 onthesupportvectorstosolveforbasfollows:

whereSrepresentsthesetofsupportvectors and isthenumberofsupportvectors.Themaximummarginhyperplanecanthenbeexpressedas

Usingthisseparatinghyperplane,atestinstance canbeassignedaclasslabelusingthesignoff( ).

λi

xiTxj

λi

b=1nS∑i∈S1−yiwTxiyi (4.78)

(S={i|λi>0}) nS

f(x)=(∑i=1nλiyixiTx)+b=0. (4.79)

Example4.7.Considerthetwo-dimensionaldatasetshowninFigure4.34 ,whichcontainseighttraininginstances.Usingquadraticprogramming,wecansolvetheoptimizationproblemstatedinEquation4.77 toobtaintheLagrangemultiplier foreachtraininginstance.TheLagrangemultipliersaredepictedinthelastcolumnofthetable.Noticethatonlythefirsttwoinstanceshavenon-zeroLagrangemultipliers.Theseinstancescorrespondtothesupportvectorsforthisdataset.

Let andbdenotetheparametersofthedecisionboundary.UsingEquation4.74 ,wecansolvefor and inthefollowingway:

λi

w=(w1,w2)w1 w2

w1=∑iλiyixi1=65.5261×1×0.3858+65.5261×−1×0.4871=−6.64.w2=∑iλiyixi2=65.5261×1×0.4687+65.5261×−1×0.611=−9.32.

Figure4.34.Exampleofalinearlyseparabledataset.

ThebiastermbcanbecomputedusingEquation4.76 foreachsupportvector:

Averagingthesevalues,weobtain .ThedecisionboundarycorrespondingtotheseparametersisshowninFigure4.34 .

4.9.3Soft-marginSVM

Figure4.35 showsadatasetthatissimilartoFigure4.32 ,exceptithastwonewexamples,PandQ.Althoughthedecisionboundary misclassifiesthenewexamples,while classifiesthemcorrectly,thisdoesnotmeanthat

isabetterdecisionboundarythan becausethenewexamplesmaycorrespondtonoiseinthetrainingdata. shouldstillbepreferredoverbecauseithasawidermargin,andthus,islesssusceptibletooverfitting.However,theSVMformulationpresentedintheprevioussectiononlyconstructsdecisionboundariesthataremistake-free.

b(1)=1−w⋅x1=1−(−6.64)(0.3858)−(−9.32)(0.4687)=7.9300.b(2)=1−w⋅x2=−1−(−6.64)(0.4871)−(−9.32)(0.611)=7.9289.

b=7.93

B1B2

B2 B1B1 B2

Figure4.35.DecisionboundaryofSVMforthenon-separablecase.

ThissectionexamineshowtheformulationofSVMcanbemodifiedtolearnaseparatinghyperplanethatistolerabletosmallnumberoftrainingerrorsusingamethodknownasthesoft-marginapproach.Moreimportantly,themethodpresentedinthissectionallowsSVMtolearnlinearhyperplaneseveninsituationswheretheclassesarenotlinearlyseparable.Todothis,thelearningalgorithminSVMmustconsiderthetrade-offbetweenthewidthofthemarginandthenumberoftrainingerrorscommittedbythelinearhyperplane.

TointroducetheconceptoftrainingerrorsintheSVMformulation,letusrelaxtheinequalityconstraintstoaccommodateforsomeviolationsonasmallnumberoftraininginstances.Thiscanbedonebyintroducingaslackvariable foreverytraininginstance asfollows:ξ≥0 xi

Thevariable allowsforsomeslackintheinequalitiesoftheSVMsuchthateveryinstance doesnotneedtostrictlysatisfy .Further, isnon-zeroonlyifthemarginhyperplanesarenotabletoplace onthesamesideastherestoftheinstancesbelongingto .Toillustratethis,Figure4.36 showsacirclePthatfallsontheoppositesideoftheseparatinghyperplaneastherestofthecircles,andthussatisfies .ThedistancebetweenPandthemarginhyperplane isequalto .Hence, providesameasureoftheerrorofSVMinrepresenting usingsoftinequalityconstraints.

Figure4.36.Slackvariablesusedinsoft-marginSVM.

yi(wTxi+b)≥1−ξi (4.80)

ξixi yi(wTxi+b)≥1 ξi

xiyi

wTx+b=−1+ξwTx+b=−1 ξ/ǁwǁ

ξi xi

Inthepresenceofslackvariables,itisimportanttolearnaseparatinghyperplanethatjointlymaximizesthemargin(ensuringgoodgeneralizationperformance)andminimizesthevaluesofslackvariables(ensuringlowtrainingerror).ThiscanbeachievedbymodifyingtheoptimizationproblemofSVMasfollows:

whereCisahyper-parameterthatmakesatrade-offbetweenmaximizingthemarginandminimizingthetrainingerror.AlargevalueofCpaysmoreemphasisonminimizingthetrainingerrorthanmaximizingthemargin.NoticethesimilarityofthepreviousequationwiththegenericformulaofgeneralizationerrorrateintroducedinSection3.4 ofthepreviouschapter.Indeed,SVMprovidesanaturalwaytobalancebetweenmodelcomplexityandtrainingerrorinordertomaximizegeneralizationperformance.

TosolveEquation4.81 weapplytheLagrangemultipliermethodandconverttheprimalproblemtoitscorrespondingdualproblem,similartotheapproachdescribedintheprevioussection.TheLagrangianprimalproblemofEquation4.81 canbewrittenasfollows:

where and aretheLagrangemultiplierscorrespondingtotheinequalityconstraintsofEquation4.81 .Settingthederivativeof withrespectto ,b,and equalto0,weobtainthefollowingequations:

minw,b,ξiǁwǁ22+c∑i=1nξisubjecttoyi(wTxi+b)≥1−ξi,ξi≥0. (4.81)

LP=12ǁwǁ2+C∑i=1nξi−∑i=1nλi(yi(wTxi+b)−1+ξi)−∑i=1nμi(ξi), (4.82)

λi≥0 μi≥0LP

ξi

∂LP∂w=0⇒w=∑i=1nλiyixi. (4.83)

∂L∂b=0⇒∑i=1nλiyi=0. (4.84)

WecanalsoobtainthecomplementaryslacknessconditionsbyusingthefollowingKKTconditions:

Equation4.86 suggeststhat iszeroforalltraininginstancesexceptthosethatresideonthemarginhyperplanes ,orhave .Theseinstanceswith areknownassupportvectors.Ontheotherhand, giveninEquation4.87 iszeroforanytraininginstancethatismisclassified,i.e.,

.Further, and arerelatedwitheachotherbyEquation4.85 .Thisresultsinthefollowingthreeconfigurationsof :

1. If and ,then doesnotresideonthemarginhyperplanesandiscorrectlyclassifiedonthesamesideasotherinstancesbelongingto .

2. If and ,then ismisclassifiedandhasanon-zeroslackvariable .

3. If and ,then residesononeofthemarginhyperplanes.

SubstitutingEquations4.83 to4.87 intoEquation4.82 ,weobtainthefollowingdualoptimizationproblem:

NoticethatthepreviousproblemlooksalmostidenticaltothedualproblemofSVMforthelinearlyseparablecase(Equation4.77 ),exceptthat is

∂L∂ξi=0⇒λi+μi=C. (4.85)

λi(yi(wTxi+b)−1+ξi)=0, (4.86)

μiξi=0. (4.87)

λiwTxi+b=±1 ξi>0

λi>0 μi

ξi>0 λi μi(λi,μi)

λi=0 μi=C xi

yiλi=C μi=0 xi

ξi0<λi<C 0<μi<C xi

maxλi∑i=1nλi−12∑i=1n∑j=1nλiλjyiyjxiTxjsubjectto∑i=1nλiyi=0,0≤λi≤C. (4.88)

λi

requiredtonotonlybegreaterthan0butalsosmallerthanaconstantvalueC.Clearly,whenCreachesinfinity,thepreviousoptimizationproblembecomesequivalenttoEquation4.77 ,wherethelearnedhyperplaneperfectlyseparatestheclasses(withnotrainingerrors).However,bycappingthevaluesof toC,thelearnedhyperplaneisabletotolerateafewtrainingerrorsthathave .

Figure4.37.Hingelossasafunctionof .

Asbefore,Equation4.88 canbesolvedbyusinganyofthestandardsolversforQPP,andtheoptimalvalueof canbeobtainedbyusingEquation4.83 .Tosolveforb,wecanuseEquation4.86 onthesupportvectorsthatresideonthemarginhyperplanesasfollows:

λiξi>0

yy^

b=1nS∑i∈S1−yiwTxiyi (4.89)

whereSrepresentsthesetofsupportvectorsresidingonthemarginhyperplanes and isthenumberofelementsinS.

SVMasaRegularizerofHingeLossSVMbelongstoabroadclassofregularizationtechniquesthatusealossfunctiontorepresentthetrainingerrorsandanormofthemodelparameterstorepresentthemodelcomplexity.Torealizethis,noticethattheslackvariable ,usedformeasuringthetrainingerrorsinSVM,isequivalenttothehingelossfunction,whichcanbedefinedasfollows:

where .InthecaseofSVM, correspondsto .Figure4.37 showsaplotofthehingelossfunctionaswevary .Wecanseethatthehingelossisequalto0aslongasyand havethesamesignand

.However,thehingelossgrowslinearlywith wheneveryand areoftheoppositesignor .Thisissimilartothenotionoftheslackvariable,whichisusedtomeasurethedistanceofapointfromitsmarginhyperplane.Hence,theoptimizationproblemofSVMcanberepresentedinthefollowingequivalentform:

Notethatusingthehingelossensuresthattheoptimizationproblemisconvexandcanbesolvedusingstandardoptimizationtechniques.However,ifweuseadifferentlossfunction,suchasthesquaredlossfunctionthatwasintroducedinSection4.7 onANN,itwillresultinadifferentoptimizationproblemthatmayormaynotremainconvex.Nevertheless,differentlossfunctionscanbeexploredtocapturevaryingnotionsoftrainingerror,dependingonthecharacteristicsoftheproblem.

(S={i|0<λi<C}) nS

Loss(y,y^)=max(0,1−yy^),

y∈{+1,−1} y^ wTx+byy^

y^|y^|≥1 |y^| y^

|y^|<1ξ

minw,bǁwǁ22+C∑i=1nLoss(yi,wTxi+b) (4.90)

AnotherinterestingpropertyofSVMthatrelatesittoabroaderclassofregularizationtechniquesistheconceptofamargin.Althoughminimizinghasthegeometricinterpretationofmaximizingthemarginofaseparating

hyperplane,itisessentiallythesquared normofthemodelparameters,.Ingeneral,the normof , ,isequaltotheMinkowskidistanceof

orderqfrom totheorigin,i.e.,

Minimizingthe normof toachievelowermodelcomplexityisagenericregularizationconceptthathasseveralinterpretations.Forexample,minimizingthe normamountstofindingasolutiononahypersphereofsmallestradiusthatshowssuitabletrainingperformance.Tovisualizethisintwo-dimensions,Figure4.38(a) showstheplotofacirclewithconstantradiusr,whereeverypointhasthesame norm.Ontheotherhand,usingthe normensuresthatthesolutionliesonthesurfaceofahypercubewithsmallestsize,withverticesalongtheaxes.ThisisillustratedinFigure4.38(b) asasquarewithverticesontheaxesatadistanceofrfromtheorigin.The normiscommonlyusedasaregularizertoobtainsparsemodelparameterswithonlyasmallnumberofnon-zeroparametervalues,suchastheuseofLassoinregressionproblems(seeBibliographicNotes).

ǁwǁ2

L2 ǁwǁ22 Lq ǁwǁq

ǁwǁq=(∑ipwiq)1/q

L2L1

Figure4.38.Plotsshowingthebehavioroftwo-dimensionalsolutionswithconstant andnorms.

Ingeneral,dependingonthecharacteristicsoftheproblem,differentcombinationsof normsandtraininglossfunctionscanbeusedforlearningthemodelparameters,eachrequiringadifferentoptimizationsolver.Thisformsthebackboneofawiderangeofmodelingtechniquesthatattempttoimprovethegeneralizationperformancebyjointlyminimizingtrainingerrorandmodelcomplexity.However,inthissection,wefocusonlyonthesquarednormandthehingelossfunction,resultingintheclassicalformulationof

SVM.

4.9.4NonlinearSVM

L2L1

TheSVMformulationsdescribedintheprevioussectionsconstructalineardecisionboundarytoseparatethetrainingexamplesintotheirrespectiveclasses.ThissectionpresentsamethodologyforapplyingSVMtodatasetsthathavenonlineardecisionboundaries.Thebasicideaistotransformthedatafromitsoriginalattributespacein intoanewspace sothatalinearhyperplanecanbeusedtoseparatetheinstancesinthetransformedspace,usingtheSVMapproach.Thelearnedhyperplanecanthenbeprojectedbacktotheoriginalattributespace,resultinginanonlineardecisionboundary.

Figure4.39.Classifyingdatawithanonlineardecisionboundary.

AttributeTransformationToillustratehowattributetransformationcanleadtoalineardecisionboundary,Figure4.39(a) showsanexampleofatwo-dimensionaldatasetconsistingofsquares(classifiedas )andcircles(classifiedas ).The

φ(x)

y=1 y=−1

datasetisgeneratedinsuchawaythatallthecirclesareclusterednearthecenterofthediagramandallthesquaresaredistributedfartherawayfromthecenter.Instancesofthedatasetcanbeclassifiedusingthefollowingequation:

Thedecisionboundaryforthedatacanthereforebewrittenasfollows:

whichcanbefurthersimplifiedintothefollowingquadraticequation:

Anonlineartransformation isneededtomapthedatafromitsoriginalattributespaceintoanewspacesuchthatalinearhyperplanecanseparatetheclasses.Thiscanbeachievedbyusingthefollowingsimpletransformation:

Figure4.39(b) showsthepointsinthetransformedspace,wherewecanseethatallthecirclesarelocatedinthelowerleft-handsideofthediagram.Alinearhyperplanewithparameters andbcanthereforebeconstructedinthetransformedspace,toseparatetheinstancesintotheirrespectiveclasses.

Onemaythinkthatbecausethenonlineartransformationpossiblyincreasesthedimensionalityoftheinputspace,thisapproachcansufferfromthecurseofdimensionalitythatisoftenassociatedwithhigh-dimensionaldata.

y={1if(x1−0.5)2+(x2−0.5)2>0.2,−1otherwise. (4.91)

(x1−0.5)2+(x2−0.5)2>0.2,

x12−x1+x22−x2=−0.46.

φ:(x1,x2)→(x12−x1,x22−x2). (4.92)

However,aswewillseeinthefollowingsection,nonlinearSVMisabletoavoidthisproblembyusingkernelfunctions.

LearningaNonlinearSVMModelUsingasuitablefunction, ,wecantransformanydatainstance to .(Thedetailsonhowtochoose willbecomeclearlater.)Thelinearhyperplaneinthetransformedspacecanbeexpressedas .Tolearntheoptimalseparatinghyperplane,wecansubstitute for intheformulationofSVMtoobtainthefollowingoptimizationproblem:

UsingLagrangemultipliers ,thiscanbeconvertedintoadualoptimizationproblem:max

where denotestheinnerproductbetweenvectors and .Also,theequationofthehyperplaneinthetransformedspacecanberepresentedusing

asfollows:

Further,bisgivenby

φ(⋅) φ(x)φ(⋅)

wTφ(x)+b=0φ(x)

minw,b,ξiǁwǁ22+C∑i=1nξisubjecttoyi(wTφ(xi)+b)≥1−ξi,ξi≥0. (4.93)

λi

maxλi∑i=1nλi−12∑i=1n∑j=1nλiλjyiyj⟨φ(xi),φ(xj)⟩subjectto∑i=1nλjyi=0,0≤λi≤C,

(4.94)

⟨a,b⟩

λi

∑i=1nλiyi⟨φ(xi),φ(x)⟩+b=0. (4.95)

b=1nS(∑i∈S1yi−∑i∈S∑j=1nλjyiyj⟨φ(xi),φ(xj)⟩yi), (4.96)

where isthesetofsupportvectorsresidingonthemarginhyperplanesand isthenumberofelementsinS.

NotethatinordertosolvethedualoptimizationprobleminEquation4.94 ,ortousethelearnedmodelparameterstomakepredictionsusingEquations4.95 and4.96 ,weneedonlyinnerproductsof .Hence,eventhough

maybenonlinearandhigh-dimensional,itsufficestouseafunctionoftheinnerproductsof inthetransformedspace.Thiscanbeachievedbyusingakerneltrick,whichcanbedescribedasfollows.

Theinnerproductbetweentwovectorsisoftenregardedasameasureofsimilaritybetweenthevectors.Forexample,thecosinesimilaritydescribedinSection2.4.5 onpage79canbedefinedasthedotproductbetweentwovectorsthatarenormalizedtounitlength.Analogously,theinnerproduct

canalsoberegardedasameasureofsimilaritybetweentwoinstances, and ,inthetransformedspace.Thekerneltrickisamethodforcomputingthissimilarityasafunctionoftheoriginalattributes.Specifically,thekernelfunctionK(u,v)betweentwoinstancesuandvcanbedefinedasfollows:

where isafunctionthatfollowscertainconditionsasstatedbytheMercer'sTheorem.Althoughthedetailsofthistheoremareoutsidethescopeofthebook,weprovidealistofsomeofthecommonlyusedkernelfunctions:

S={i|0>λi<C}nS

φ(x)φ(x)

φ(x)

φ(xi),φ(xj)xi xj

K(u,v)=⟨φ(u),φ(v)⟩=f(u,v) (4.97)

f(⋅)

PolynomialkernelK(u,v)=(uTv+1)p (4.98)

RadialBasisFunctionkernelK(u,v)=e−ǁu−vǁ2/(2σ2) (4.99)

SigmoidkernelK(u,v)=tanh(kuTv−δ) (4.100)

Byusingakernelfunction,wecandirectlyworkwithinnerproductsinthetransformedspacewithoutdealingwiththeexactformsofthenonlineartransformationfunction .Specifically,thisallowsustousehigh-dimensionaltransformations(sometimeseveninvolvinginfinitelymanydimensions),whileperformingcalculationsonlyintheoriginalattributespace.Computingtheinnerproductsusingkernelfunctionsisalsoconsiderablycheaperthanusingthetransformedattributeset .Hence,theuseofkernelfunctionsprovidesasignificantadvantageinrepresentingnonlineardecisionboundaries,withoutsufferingfromthecurseofdimensionality.ThishasbeenoneofthemajorreasonsbehindthewidespreadusageofSVMinhighlycomplexandnonlinearproblems.

Figure4.40.DecisionboundaryproducedbyanonlinearSVMwithpolynomialkernel.

Figure4.40 showsthenonlineardecisionboundaryobtainedbySVMusingthepolynomialkernelfunctiongiveninEquation4.98 .Wecanseethatthe

φ(x)

learneddecisionboundaryisquiteclosetothetruedecisionboundaryshowninFigure4.39(a) .Althoughthechoiceofkernelfunctiondependsonthecharacteristicsoftheinputdata,acommonlyusedkernelfunctionistheradialbasisfunction(RBF)kernel,whichinvolvesasinglehyper-parameter ,knownasthestandarddeviationoftheRBFkernel.

4.9.5CharacteristicsofSVM

1. TheSVMlearningproblemcanbeformulatedasaconvexoptimizationproblem,inwhichefficientalgorithmsareavailabletofindtheglobalminimumoftheobjectivefunction.Otherclassificationmethods,suchasrule-basedclassifiersandartificialneuralnetworks,employagreedystrategytosearchthehypothesisspace.Suchmethodstendtofindonlylocallyoptimumsolutions.

2. SVMprovidesaneffectivewayofregularizingthemodelparametersbymaximizingthemarginofthedecisionboundary.Furthermore,itisabletocreateabalancebetweenmodelcomplexityandtrainingerrorsbyusingahyper-parameterC.Thistrade-offisgenerictoabroaderclassofmodellearningtechniquesthatcapturethemodelcomplexityandthetraininglossusingdifferentformulations.

3. LinearSVMcanhandleirrelevantattributesbylearningzeroweightscorrespondingtosuchattributes.Itcanalsohandleredundantattributesbylearningsimilarweightsfortheduplicateattributes.Furthermore,theabilityofSVMtoregularizeitslearningmakesitmorerobusttothepresenceofalargenumberofirrelevantandredundantattributesthanotherclassifiers,eveninhigh-dimensionalsettings.Forthisreason,nonlinearSVMsarelessimpactedbyirrelevantandredundantattributesthanotherhighlyexpressiveclassifiersthatcanlearnnonlineardecisionboundariessuchasdecisiontrees.

TocomparetheeffectofirrelevantattributesontheperformanceofnonlinearSVMsanddecisiontrees,considerthetwo-dimensionaldatasetshowninFigure4.41(a) containing and instances,wherethetwoclassescanbeeasilyseparatedusinganonlineardecisionboundary.Weincrementallyaddirrelevantattributestothisdatasetandcomparetheperformanceoftwoclassifiers:decisiontreeandnonlinearSVM(usingradialbasisfunctionkernel),using70%ofthedatafortrainingandtherestfortesting.Figure4.41(b) showsthetesterrorratesofthetwoclassifiersasweincreasethenumberofirrelevantattributes.Wecanseethatthetesterrorrateofdecisiontreesswiftlyreaches0.5(sameasrandomguessing)inthepresenceofevenasmallnumberofirrelevantattributes.ThiscanbeattributedtotheproblemofmultiplecomparisonswhilechoosingsplittingattributesatinternalnodesasdiscussedinExample3.7 ofthepreviouschapter.Ontheotherhand,nonlinearSVMshowsamorerobustandsteadyperformanceevenafteraddingamoderatelylargenumberofirrelevantattributes.Itstesterrorrategraduallydeclinesandeventuallyreachescloseto0.5afteradding125irrelevantattributes,atwhichpointitbecomesdifficulttodiscernthediscriminativeinformationintheoriginaltwoattributesfromthenoiseintheremainingattributesforlearningnonlineardecisionboundaries.

500+ 500o

Figure4.41.ComparingtheeffectofaddingirrelevantattributesontheperformanceofnonlinearSVMsanddecisiontrees.

4. SVMcanbeappliedtocategoricaldatabyintroducingdummyvariablesforeachcategoricalattributevaluepresentinthedata.Forexample,if hasthreevalues ,wecanintroduceabinaryvariableforeachoftheattributevalues.

5. TheSVMformulationpresentedinthischapterisforbinaryclassproblems.However,multiclassextensionsofSVMhavealsobeenproposed.

6. AlthoughthetrainingtimeofanSVMmodelcanbelarge,thelearnedparameterscanbesuccinctlyrepresentedwiththehelpofasmallnumberofsupportvectors,makingtheclassificationoftestinstancesquitefast.

{Single,Married,Divorced}

4.10EnsembleMethodsThissectionpresentstechniquesforimprovingclassificationaccuracybyaggregatingthepredictionsofmultipleclassifiers.Thesetechniquesareknownasensembleorclassifiercombinationmethods.Anensemblemethodconstructsasetofbaseclassifiersfromtrainingdataandperformsclassificationbytakingavoteonthepredictionsmadebyeachbaseclassifier.Thissectionexplainswhyensemblemethodstendtoperformbetterthananysingleclassifierandpresentstechniquesforconstructingtheclassifierensemble.

4.10.1RationaleforEnsembleMethod

Thefollowingexampleillustrateshowanensemblemethodcanimproveaclassifier'sperformance.

Example4.8.Consideranensembleof25binaryclassifiers,eachofwhichhasanerrorrateof .Theensembleclassifierpredictstheclasslabelofatestexamplebytakingamajorityvoteonthepredictionsmadebythebaseclassifiers.Ifthebaseclassifiersareidentical,thenallthebaseclassifierswillcommitthesamemistakes.Thus,theerrorrateoftheensembleremains0.35.Ontheotherhand,ifthebaseclassifiersareindependent—i.e.,theirerrorsareuncorrelated—thentheensemblemakesawrongpredictiononlyifmorethanhalfofthebaseclassifierspredictincorrectly.Inthiscase,theerrorrateoftheensembleclassifieris

∈=0.35

whichisconsiderablylowerthantheerrorrateofthebaseclassifiers.

Figure4.42 showstheerrorrateofanensembleof25binaryclassifiersfordifferentbaseclassifiererrorrates .Thediagonalline

representsthecaseinwhichthebaseclassifiersareidentical,whilethesolidlinerepresentsthecaseinwhichthebaseclassifiersareindependent.Observethattheensembleclassifierperformsworsethanthebaseclassifierswhen islargerthan0.5.

Theprecedingexampleillustratestwonecessaryconditionsforanensembleclassifiertoperformbetterthanasingleclassifier:(1)thebaseclassifiersshouldbeindependentofeachother,and(2)thebaseclassifiersshoulddobetterthanaclassifierthatperformsrandomguessing.Inpractice,itisdifficulttoensuretotalindependenceamongthebaseclassifiers.Nevertheless,improvementsinclassificationaccuracieshavebeenobservedinensemblemethodsinwhichthebaseclassifiersaresomewhatcorrelated.

4.10.2MethodsforConstructinganEnsembleClassifier

AlogicalviewoftheensemblemethodispresentedinFigure4.43 .Thebasicideaistoconstructmultipleclassifiersfromtheoriginaldataandthenaggregatetheirpredictionswhenclassifyingunknownexamples.Theensembleofclassifierscanbeconstructedinmanyways:

eensemble=∑i=1325(25i)∈i(1−∈)25−i=0.06, (4.101)

(eensemble) (∈)

∈

1. Bymanipulatingthetrainingset.Inthisapproach,multipletrainingsetsarecreatedbyresamplingtheoriginaldataaccordingtosomesamplingdistributionandconstructingaclassifierfromeachtrainingset.Thesamplingdistributiondetermineshowlikelyitisthatanexamplewillbeselectedfortraining,anditmayvaryfromonetrialtoanother.Baggingandboostingaretwoexamplesofensemblemethodsthatmanipulatetheirtrainingsets.ThesemethodsaredescribedinfurtherdetailinSections4.10.4 and4.10.5 .

Figure4.42.Comparisonbetweenerrorsofbaseclassifiersanderrorsoftheensembleclassifier.

Figure4.43.Alogicalviewoftheensemblelearningmethod.

2. Bymanipulatingtheinputfeatures.Inthisapproach,asubsetofinputfeaturesischosentoformeachtrainingset.Thesubsetcanbeeitherchosenrandomlyorbasedontherecommendationofdomainexperts.Somestudieshaveshownthatthisapproachworksverywellwithdatasetsthatcontainhighlyredundantfeatures.Randomforest,whichisdescribedinSection4.10.6 ,isanensemblemethodthatmanipulatesitsinputfeaturesandusesdecisiontreesasitsbaseclassifiers.

3. Bymanipulatingtheclasslabels.Thismethodcanbeusedwhenthenumberofclassesissufficientlylarge.Thetrainingdataistransformedintoabinaryclassproblembyrandomlypartitioningtheclasslabelsintotwodisjointsubsets, and .TrainingexampleswhoseclassA0 A1

labelbelongstothesubset areassignedtoclass0,whilethosethatbelongtothesubset areassignedtoclass1.Therelabeledexamplesarethenusedtotrainabaseclassifier.Byrepeatingthisprocessmultipletimes,anensembleofbaseclassifiersisobtained.Whenatestexampleispresented,eachbaseclassifier isusedtopredictitsclasslabel.Ifthetestexampleispredictedasclass0,thenalltheclassesthatbelongto willreceiveavote.Conversely,ifitispredictedtobeclass1,thenalltheclassesthatbelongto willreceiveavote.Thevotesaretalliedandtheclassthatreceivesthehighestvoteisassignedtothetestexample.Anexampleofthisapproachistheerror-correctingoutputcodingmethoddescribedonpage331.

4. Bymanipulatingthelearningalgorithm.Manylearningalgorithmscanbemanipulatedinsuchawaythatapplyingthealgorithmseveraltimesonthesametrainingdatawillresultintheconstructionofdifferentclassifiers.Forexample,anartificialneuralnetworkcanchangeitsnetworktopologyortheinitialweightsofthelinksbetweenneurons.Similarly,anensembleofdecisiontreescanbeconstructedbyinjectingrandomnessintothetree-growingprocedure.Forexample,insteadofchoosingthebestsplittingattributeateachnode,wecanrandomlychooseoneofthetopkattributesforsplitting.

Thefirstthreeapproachesaregenericmethodsthatareapplicabletoanyclassifier,whereasthefourthapproachdependsonthetypeofclassifierused.Thebaseclassifiersformostoftheseapproachescanbegeneratedsequentially(oneafteranother)orinparallel(allatonce).Onceanensembleofclassifiershasbeenlearned,atestexample isclassifiedbycombiningthepredictionsmadebythebaseclassifiers :

A0A1

Ci(x)

C*(x)=f(C1(x),C2(x),…,Ck(x)).

wherefisthefunctionthatcombinestheensembleresponses.Onesimpleapproachforobtaining istotakeamajorityvoteoftheindividualpredictions.Analternateapproachistotakeaweightedmajorityvote,wheretheweightofabaseclassifierdenotesitsaccuracyorrelevance.

Ensemblemethodsshowthemostimprovementwhenusedwithunstableclassifiers,i.e.,baseclassifiersthataresensitivetominorperturbationsinthetrainingset,becauseofhighmodelcomplexity.Althoughunstableclassifiersmayhavealowbiasinfindingtheoptimaldecisionboundary,theirpredictionshaveahighvarianceforminorchangesinthetrainingsetormodelselection.Thistrade-offbetweenbiasandvarianceisdiscussedindetailinthenextsection.Byaggregatingtheresponsesofmultipleunstableclassifiers,ensemblelearningattemptstominimizetheirvariancewithoutworseningtheirbias.

4.10.3Bias-VarianceDecomposition

Bias-variancedecompositionisaformalmethodforanalyzingthegeneralizationerrorofapredictivemodel.Althoughtheanalysisisslightlydifferentforclassificationthanregression,wefirstdiscussthebasicintuitionofthisdecompositionbyusingananalogueofaregressionproblem.

Considertheillustrativetaskofreachingatargetybyfiringprojectilesfromastartingposition ,asshowninFigure4.44 .Thetargetcorrespondstothedesiredoutputatatestinstance,whilethestartingpositioncorrespondstoitsobservedattributes.Inthisanalogy,theprojectilerepresentsthemodelusedforpredictingthetargetusingtheobservedattributes.Let denotethepointwheretheprojectilehitstheground,whichisanalogousofthepredictionofthemodel.

C*(x)

Figure4.44.Bias-variancedecomposition.

Ideally,wewouldlikeourpredictionstobeasclosetothetruetargetaspossible.However,notethatdifferenttrajectoriesofprojectilesarepossiblebasedondifferencesinthetrainingdataorintheapproachusedformodelselection.Hence,wecanobserveavarianceinthepredictions overdifferentrunsofprojectile.Further,thetargetinourexampleisnotfixedbuthassomefreedomtomovearound,resultinginanoisecomponentinthetruetarget.Thiscanbeunderstoodasthenon-deterministicnatureoftheoutputvariable,wherethesamesetofattributescanhavedifferentoutputvalues.Let representtheaveragepredictionoftheprojectileovermultipleruns,and denotetheaveragetargetvalue.Thedifferencebetween and

isknownasthebiasofthemodel.

Inthecontextofclassification,itcanbeshownthatthegeneralizationerrorofaclassificationmodelmcanbedecomposedintotermsinvolvingthebias,variance,andnoisecomponentsofthemodelinthefollowingway:

where and areconstantsthatdependonthecharacteristicsoftrainingandtestsets.Notethatwhilethenoisetermisintrinsictothetargetclass,the

y^avgyavg y^avg

yavg

gen.error(m)=c1×noise+bias(m)+c2×variance(m)

c1 c2

biasandvariancetermsdependonthechoiceoftheclassificationmodel.Thebiasofamodelrepresentshowclosetheaveragepredictionofthemodelistotheaveragetarget.Modelsthatareabletolearncomplexdecisionboundaries,e.g.,modelsproducedbyk-nearestneighborandmulti-layerANN,generallyshowlowbias.Thevarianceofamodelcapturesthestabilityofitspredictionsinresponsetominorperturbationsinthetrainingsetorthemodelselectionapproach.

Wecansaythatamodelshowsbettergeneralizationperformanceifithasalowerbiasandlowervariance.However,ifthecomplexityofamodelishighbutthetrainingsizeissmall,wegenerallyexpecttoseealowerbiasbuthighervariance,resultinginthephenomenaofoverfitting.ThisphenomenaispictoriallyrepresentedinFigure4.45(a) .Ontheotherhand,anoverlysimplisticmodelthatsuffersfromunderfittingmayshowalowervariancebutwouldsufferfromahighbias,asshowninFigure4.45(b) .Hence,thetrade-offbetweenbiasandvarianceprovidesausefulwayforinterpretingtheeffectsofunderfittingandoverfittingonthegeneralizationperformanceofamodel.

Figure4.45.Plotsshowingthebehavioroftwo-dimensionalsolutionswithconstant andnorms.

Thebias-variancetrade-offcanbeusedtoexplainwhyensemblelearningimprovesthegeneralizationperformanceofunstableclassifiers.Ifabaseclassifiershowlowbiasbuthighvariance,itcanbecomesusceptibletooverfitting,asevenasmallchangeinthetrainingsetwillresultindifferentpredictions.However,bycombiningtheresponsesofmultiplebaseclassifiers,wecanexpecttoreducetheoverallvariance.Hence,ensemblelearningmethodsshowbetterperformanceprimarilybyloweringthevarianceinthepredictions,althoughtheycanevenhelpinreducingthebias.Oneofthesimplestapproachesforcombiningpredictionsandreducingtheirvarianceistocomputetheiraverage.Thisformsthebasisofthebaggingmethod,describedinthefollowingsubsection.

L2L1

4.10.4Bagging

Bagging,whichisalsoknownasbootstrapaggregating,isatechniquethatrepeatedlysamples(withreplacement)fromadatasetaccordingtoauniformprobabilitydistribution.Eachbootstrapsamplehasthesamesizeastheoriginaldata.Becausethesamplingisdonewithreplacement,someinstancesmayappearseveraltimesinthesametrainingset,whileothersmaybeomittedfromthetrainingset.Onaverage,abootstrapsample containsapproximately63%oftheoriginaltrainingdatabecauseeachsamplehasaprobability ofbeingselectedineach .IfNissufficientlylarge,thisprobabilityconvergesto .ThebasicprocedureforbaggingissummarizedinAlgorithm4.5 .Aftertrainingthekclassifiers,atestinstanceisassignedtotheclassthatreceivesthehighestnumberofvotes.

Toillustratehowbaggingworks,considerthedatasetshowninTable4.4 .Letxdenoteaone-dimensionalattributeandydenotetheclasslabel.Supposeweuseonlyone-levelbinarydecisiontrees,withatestcondition

,wherekisasplitpointchosentominimizetheentropyoftheleafnodes.Suchatreeisalsoknownasadecisionstump.

Table4.4.Exampleofdatasetusedtoconstructanensembleofbaggingclassifiers.

x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

y 1 1 1 1 1 1

Withoutbagging,thebestdecisionstumpwecanproducesplitstheinstancesateither or .Eitherway,theaccuracyofthetreeisatmost70%.Supposeweapplythebaggingprocedureonthedatasetusing10bootstrapsamples.Theexampleschosenfortrainingineachbaggingroundareshown

1−(1−1/N)N Di1−1/e≃0.632

x≤k

−1 −1 −1 −1

x≤0.35 x≤0.75

inFigure4.46 .Ontheright-handsideofeachtable,wealsodescribethedecisionstumpbeingusedineachround.

WeclassifytheentiredatasetgiveninTable4.4 bytakingamajorityvoteamongthepredictionsmadebyeachbaseclassifier.TheresultsofthepredictionsareshowninFigure4.47 .Sincetheclasslabelsareeitheror ,takingthemajorityvoteisequivalenttosummingupthepredictedvaluesofyandexaminingthesignoftheresultingsum(refertothesecondtolastrowinFigure4.47 ).Noticethattheensembleclassifierperfectlyclassifiesall10examplesintheoriginaldata.

Algorithm4.5Baggingalgorithm.

∑

⋅

−1+1

Figure4.46.Exampleofbagging.

Theprecedingexampleillustratesanotheradvantageofusingensemblemethodsintermsofenhancingtherepresentationofthetargetfunction.Eventhougheachbaseclassifierisadecisionstump,combiningtheclassifierscanleadtoadecisionboundarythatmimicsadecisiontreeofdepth2.

Baggingimprovesgeneralizationerrorbyreducingthevarianceofthebaseclassifiers.Theperformanceofbaggingdependsonthestabilityofthebaseclassifier.Ifabaseclassifierisunstable,bagginghelpstoreducetheerrorsassociatedwithrandomfluctuationsinthetrainingdata.Ifabaseclassifierisstable,i.e.,robusttominorperturbationsinthetrainingset,thentheerroroftheensembleisprimarilycausedbybiasinthebaseclassifier.Inthissituation,baggingmaynotbeabletoimprovetheperformanceofthebaseclassifierssignificantly.Itmayevendegradetheclassifier'sperformancebecausetheeffectivesizeofeachtrainingsetisabout37%smallerthantheoriginaldata.

Figure4.47.Exampleofcombiningclassifiersconstructedusingthebaggingapproach.

4.10.5Boosting

Boostingisaniterativeprocedureusedtoadaptivelychangethedistributionoftrainingexamplesforlearningbaseclassifierssothattheyincreasinglyfocusonexamplesthatarehardtoclassify.Unlikebagging,boostingassignsaweighttoeachtrainingexampleandmayadaptivelychangetheweightattheendofeachboostinground.Theweightsassignedtothetrainingexamplescanbeusedinthefollowingways:

1. Theycanbeusedtoinformthesamplingdistributionusedtodrawasetofbootstrapsamplesfromtheoriginaldata.

2. Theycanbeusedtolearnamodelthatisbiasedtowardexampleswithhigherweight.

Thissectiondescribesanalgorithmthatusesweightsofexamplestodeterminethesamplingdistributionofitstrainingset.Initially,theexamplesareassignedequalweights,1/N,sothattheyareequallylikelytobechosenfortraining.Asampleisdrawnaccordingtothesamplingdistributionofthetrainingexamplestoobtainanewtrainingset.Next,aclassifierisbuiltfromthetrainingsetandusedtoclassifyalltheexamplesintheoriginaldata.Theweightsofthetrainingexamplesareupdatedattheendofeachboostinground.Examplesthatareclassifiedincorrectlywillhavetheirweightsincreased,whilethosethatareclassifiedcorrectlywillhavetheirweightsdecreased.Thisforcestheclassifiertofocusonexamplesthataredifficulttoclassifyinsubsequentiterations.

Thefollowingtableshowstheexampleschosenduringeachboostinground,whenappliedtothedatashowninTable4.4 .

Boosting(Round1): 7 3 2 8 7 9 4 10 6 3

Boosting(Round2): 5 4 9 4 2 5 1 7 4 2

Boosting(Round3): 4 4 8 10 4 5 4 6 3 4

Initially,alltheexamplesareassignedthesameweights.However,someexamplesmaybechosenmorethanonce,e.g.,examples3and7,becausethesamplingisdonewithreplacement.Aclassifierbuiltfromthedataisthenusedtoclassifyalltheexamples.Supposeexample4isdifficulttoclassify.Theweightforthisexamplewillbeincreasedinfutureiterationsasitgetsmisclassifiedrepeatedly.Meanwhile,examplesthatwerenotchoseninthepreviousround,e.g.,examples1and5,alsohaveabetterchanceofbeingselectedinthenextroundsincetheirpredictionsinthepreviousroundwerelikelytobewrong.Astheboostingroundsproceed,examplesthatarethehardesttoclassifytendtobecomeevenmoreprevalent.Thefinalensembleisobtainedbyaggregatingthebaseclassifiersobtainedfromeachboostinground.

Overtheyears,severalimplementationsoftheboostingalgorithmhavebeendeveloped.Thesealgorithmsdifferintermsof(1)howtheweightsofthetrainingexamplesareupdatedattheendofeachboostinground,and(2)howthepredictionsmadebyeachclassifierarecombined.AnimplementationcalledAdaBoostisexploredinthenextsection.

AdaBoostLet denoteasetofNtrainingexamples.IntheAdaBoostalgorithm,theimportanceofabaseclassifier dependsonits

{(xj,yj)|j=1,2,…,N}Ci

Figure4.48.Plotof asafunctionoftrainingerror .

errorrate,whichisdefinedas

where ifthepredicatepistrue,and0otherwise.Theimportanceofaclassifier isgivenbythefollowingparameter,

Notethat hasalargepositivevalueiftheerrorrateiscloseto0andalargenegativevalueiftheerrorrateiscloseto1,asshowninFigure4.48 .

The parameterisalsousedtoupdatetheweightofthetrainingexamples.Toillustrate,let denotetheweightassignedtoexample( duringthe

α ∈

∈i=1N[∑j=1NwjI(Ci(xj)≠yj)], (4.102)

I(p)=1Ci

αi=12ln(1−∈i∈i).

αi

αiwi(j) xi,yi)

j boostinground.TheweightupdatemechanismforAdaBoostisgivenbytheequation:

where isthenormalizationfactorusedtoensurethat .TheweightupdateformulagiveninEquation4.103 increasestheweightsofincorrectlyclassifiedexamplesanddecreasestheweightsofthoseclassifiedcorrectly.

Insteadofusingamajorityvotingscheme,thepredictionmadebyeachclassifier isweightedaccordingto .ThisapproachallowsAdaBoosttopenalizemodelsthathavepooraccuracy,e.g.,thosegeneratedattheearlierboostingrounds.Inaddition,ifanyintermediateroundsproduceanerrorratehigherthan50%,theweightsarerevertedbacktotheiroriginaluniformvalues, ,andtheresamplingprocedureisrepeated.TheAdaBoostalgorithmissummarizedinAlgorithm4.6 .

Algorithm4.6AdaBoostalgorithm.

∈ ∑

∈

wi(j+1)=wi(j)Zj×{e−αjifCj(xi)=yi,eαjifCj(xi)≠yi (4.103)

Zj ∑iwi(j+1)=1

Cj αj

wi=1/N

∈ ∈

∑

LetusexaminehowtheboostingapproachworksonthedatasetshowninTable4.4 .Initially,alltheexampleshaveidenticalweights.Afterthreeboostingrounds,theexampleschosenfortrainingareshowninFigure4.49(a) .TheweightsforeachexampleareupdatedattheendofeachboostingroundusingEquation4.103 ,asshowninFigure4.50(b) .

Withoutboosting,theaccuracyofthedecisionstumpis,atbest,70%.WithAdaBoost,theresultsofthepredictionsaregiveninFigure4.50(b) .Thefinalpredictionoftheensembleclassifierisobtainedbytakingaweightedaverageofthepredictionsmadebyeachbaseclassifier,whichisshowninthelastrowofFigure4.50(b) .NoticethatAdaBoostperfectlyclassifiesalltheexamplesinthetrainingdata.

Figure4.49.Exampleofboosting.

Animportantanalyticalresultofboostingshowsthatthetrainingerroroftheensembleisboundedbythefollowingexpression:

where istheerrorrateofeachbaseclassifieri.Iftheerrorrateofthebaseclassifierislessthan50%,wecanwrite ,where measureshowmuchbettertheclassifieristhanrandomguessing.Theboundonthetrainingerroroftheensemblebecomes

eensemble≤∏i[∈i(1−∈i)], (4.104)

∈i∈i=0.5−γi γi

Hence,thetrainingerroroftheensembledecreasesexponentially,whichleadstothefastconvergenceofthealgorithm.Byfocusingonexamplesthataredifficulttoclassifybybaseclassifiers,itisabletoreducethebiasofthefinalpredictionsalongwiththevariance.AdaBoosthasbeenshowntoprovidesignificantimprovementsinperformanceoverbaseclassifiersonarangeofdatasets.Nevertheless,becauseofitstendencytofocusontrainingexamplesthatarewronglyclassified,theboostingtechniquecanbesusceptibletooverfitting,resultinginpoorgeneralizationperformanceinsomescenarios.

Figure4.50.ExampleofcombiningclassifiersconstructedusingtheAdaBoostapproach.

eensemble≤∏i1−4γi2≤exp(−2∑iγi2). (4.105)

4.10.6RandomForests

Randomforestsattempttoimprovethegeneralizationperformancebyconstructinganensembleofdecorrelateddecisiontrees.Randomforestsbuildontheideaofbaggingtouseadifferentbootstrapsampleofthetrainingdataforlearningdecisiontrees.However,akeydistinguishingfeatureofrandomforestsfrombaggingisthatateveryinternalnodeofatree,thebestsplittingcriterionischosenamongasmallsetofrandomlyselectedattributes.Inthisway,randomforestsconstructensemblesofdecisiontreesbynotonlymanipulatingtraininginstances(byusingbootstrapsamplessimilartobagging),butalsotheinputattributes(byusingdifferentsubsetsofattributesateveryinternalnode).

GivenatrainingsetDconsistingofninstancesanddattributes,thebasicprocedureoftrainingarandomforestclassifiercanbesummarizedusingthefollowingsteps:

1. Constructabootstrapsample ofthetrainingsetbyrandomlysamplingninstances(withreplacement)fromD.

2. Use tolearnadecisiontree asfollows.Ateveryinternalnodeof,randomlysampleasetofpattributesandchooseanattributefrom

thissubsetthatshowsthemaximumreductioninanimpuritymeasureforsplitting.Repeatthisproceduretilleveryleafispure,i.e.,containinginstancesfromthesameclass.

Onceanensembleofdecisiontreeshavebeenconstructed,theiraverageprediction(majorityvote)onatestinstanceisusedasthefinalpredictionoftherandomforest.Notethatthedecisiontreesinvolvedinarandomforestareunprunedtrees,astheyareallowedtogrowtotheirlargestpossiblesizetilleveryleafispure.Hence,thebaseclassifiersofrandomforestrepresent

Di TiTi

unstableclassifiersthathavelowbiasbuthighvariance,becauseoftheirlargesize.

Anotherpropertyofthebaseclassifierslearnedinrandomforestsisthelackofcorrelationamongtheirmodelparametersandtestpredictions.Thiscanbeattributedtotheuseofanindependentlysampleddataset forlearningeverydecisiontree ,similartothebaggingapproach.However,randomforestshavetheadditionaladvantageofchoosingasplittingcriterionateveryinternalnodeusingadifferent(andrandomlyselected)subsetofattributes.Thispropertysignificantlyhelpsinbreakingthecorrelationstructure,ifany,amongthedecisiontrees .

Torealizethis,consideratrainingsetinvolvingalargenumberofattributes,whereonlyasmallsubsetofattributesarestrongpredictorsofthetargetclass,whereasotherattributesareweakindicators.Givensuchatrainingset,evenifweconsiderdifferentbootstrapsamples forlearning ,wewouldmostlybechoosingthesameattributesforsplittingatinternalnodes,becausetheweakattributeswouldbelargelyoverlookedwhencomparedwiththestrongpredictors.Thiscanresultinaconsiderablecorrelationamongthetrees.However,ifwerestrictthechoiceofattributesateveryinternalnodetoarandomsubsetofattributes,wecanensuretheselectionofbothstrongandweakpredictors,thuspromotingdiversityamongthetrees.Thisprincipleisutilizedbyrandomforestsforcreatingdecorrelateddecisiontrees.

Byaggregatingthepredictionsofanensembleofstronganddecorrelateddecisiontrees,randomforestsareabletoreducethevarianceofthetreeswithoutnegativelyimpactingtheirlowbias.Thismakesrandomforestsquiterobusttooverfitting.Additionally,becauseoftheirabilitytoconsideronlyasmallsubsetofattributesateveryinternalnode,randomforestsarecomputationallyfastandrobusteveninhigh-dimensionalsettings.

DiTi

Di Ti

Thenumberofattributestobeselectedateverynode,p,isahyper-parameteroftherandomforestclassifier.Asmallvalueofpcanreducethecorrelationamongtheclassifiersbutmayalsoreducetheirstrength.Alargevaluecanimprovetheirstrengthbutmayresultincorrelatedtreessimilartobagging.Althoughcommonsuggestionsforpintheliteratureinclude and ,asuitablevalueofpforagiventrainingsetcanalwaysbeselectedbytuningitoveravalidationset,asdescribedinthepreviouschapter.However,thereisanalternativewayforselectinghyper-parametersinrandomforests,whichdoesnotrequireusingaseparatevalidationset.Itinvolvescomputingareliableestimateofthegeneralizationerrorratedirectlyduringtraining,knownastheout-of-bag(oob)errorestimate.Theoobestimatecanbecomputedforanygenericensemblelearningmethodthatbuildsindependentbaseclassifiersusingbootstrapsamplesofthetrainingset,e.g.,baggingandrandomforests.Theapproachforcomputingoobestimatecanbedescribedasfollows.

Consideranensemblelearningmethodthatusesanindependentbaseclassifier builtonabootstrapsampleofthetrainingset .Sinceeverytraininginstance willbeusedfortrainingapproximately63%ofbaseclassifiers,wecancall asanout-of-bagsamplefortheremaining27%ofbaseclassifiersthatdidnotuseitfortraining.Ifweusetheseremaining27%classifierstomakepredictionson ,wecanobtaintheooberroron bytakingtheirmajorityvoteandcomparingitwithitsclasslabel.Notethattheooberrorestimatestheerrorof27%classifiersonaninstancethatwasnotusedfortrainingthoseclassifiers.Hence,theooberrorcanbeconsideredasareliableestimateofgeneralizationerror.Bytakingtheaverageofooberrorsofalltraininginstances,wecancomputetheoverallooberrorestimate.Thiscanbeusedasanalternativetothevalidationerrorrateforselectinghyper-parameters.Hence,randomforestsdonotneedtouseaseparatepartitionofthetrainingsetforvalidation,asitcansimultaneouslytrainthebaseclassifiersandcomputegeneralizationerrorestimatesonthesamedataset.

d log2d+1

Ti Di

Randomforestshavebeenempiricallyfoundtoprovidesignificantimprovementsingeneralizationperformancethatareoftencomparable,ifnotsuperior,totheimprovementsprovidedbytheAdaBoostalgorithm.RandomforestsarealsomorerobusttooverfittingandrunmuchfasterthantheAdaBoostalgorithm.

4.10.7EmpiricalComparisonamongEnsembleMethods

Table4.5 showstheempiricalresultsobtainedwhencomparingtheperformanceofadecisiontreeclassifieragainstbagging,boosting,andrandomforest.Thebaseclassifiersusedineachensemblemethodconsistof50decisiontrees.Theclassificationaccuraciesreportedinthistableareobtainedfromtenfoldcross-validation.Noticethattheensembleclassifiersgenerallyoutperformasingledecisiontreeclassifieronmanyofthedatasets.

Table4.5.Comparingtheaccuracyofadecisiontreeclassifieragainstthreeensemblemethods.

DataSet Numberof(Attributes,Classes,Instances)

DecisionTree(%)

Bagging(%) Boosting(%) RF(%)

Anneal (39,6,898) 92.09 94.43 95.43 95.43

Australia (15,2,690) 85.51 87.10 85.22 85.80

Auto (26,7,205) 81.95 85.37 85.37 84.39

Breast (11,2,699) 95.14 96.42 97.28 96.14

Cleve (14,2,303) 76.24 81.52 82.18 82.18

Credit (16,2,690) 85.8 86.23 86.09 85.8

Diabetes (9,2,768) 72.40 76.30 73.18 75.13

German (21,2,1000) 70.90 73.40 73.00 74.5

Glass (10,7,214) 67.29 76.17 77.57 78.04

Heart (14,2,270) 80.00 81.48 80.74 83.33

Hepatitis (20,2,155) 81.94 81.29 83.87 83.23

Horse (23,2,368) 85.33 85.87 81.25 85.33

Ionosphere (35,2,351) 89.17 92.02 93.73 93.45

Iris (5,3,150) 94.67 94.67 94.00 93.33

Labor (17,2,57) 78.95 84.21 89.47 84.21

Led7 (8,10,3200) 73.34 73.66 73.34 73.06

Lymphography (19,4,148) 77.03 79.05 85.14 82.43

Pima (9,2,768) 74.35 76.69 73.44 77.60

Sonar (61,2,208) 78.85 78.85 84.62 85.58

Tic-tac-toe (10,2,958) 83.72 93.84 98.54 95.82

Vehicle (19,4,846) 71.04 74.11 78.25 74.94

Waveform (22,3,5000) 76.44 83.30 83.90 84.04

Wine (14,3,178) 94.38 96.07 97.75 97.75

Zoo (17,7,101) 93.07 93.07 95.05 97.03

4.11ClassImbalanceProblemInmanydatasetsthereareadisproportionatenumberofinstancesthatbelongtodifferentclasses,apropertyknownasskeworclassimbalance.Forexample,considerahealth-careapplicationwherediagnosticreportsareusedtodecidewhetherapersonhasararedisease.Becauseoftheinfrequentnatureofthedisease,wecanexpecttoobserveasmallernumberofsubjectswhoarepositivelydiagnosed.Similarly,increditcardfrauddetection,fraudulenttransactionsaregreatlyoutnumberedbylegitimatetransactions.

Thedegreeofimbalancebetweentheclassesvariesacrossdifferentapplicationsandevenacrossdifferentdatasetsfromthesameapplication.Forexample,theriskforararediseasemayvaryacrossdifferentpopulationsofsubjectsdependingontheirdietaryandlifestylechoices.However,despitetheirinfrequentoccurrences,acorrectclassificationoftherareclassoftenhasgreatervaluethanacorrectclassificationofthemajorityclass.Forexample,itmaybemoredangeroustoignoreapatientsufferingfromadiseasethantomisdiagnoseahealthyperson.

Moregenerally,classimbalanceposestwochallengesforclassification.First,itcanbedifficulttofindsufficientlymanylabeledsamplesofarareclass.Notethatmanyoftheclassificationmethodsdiscussedsofarworkwellonlywhenthetrainingsethasabalancedrepresentationofbothclasses.Althoughsomeclassifiersaremoreeffectiveathandlingimbalanceinthetrainingdatathanothers,e.g.,rule-basedclassifiersandk-NN,theyareallimpactediftheminorityclassisnotwell-representedinthetrainingset.Ingeneral,aclassifiertrainedoveranimbalanceddatasetshowsabiastowardimprovingitsperformanceoverthemajorityclass,whichisoftennotthedesiredbehavior.

Asaresult,manyexistingclassificationmodels,whentrainedonanimbalanceddataset,maynoteffectivelydetectinstancesoftherareclass.

Second,accuracy,whichisthetraditionalmeasureforevaluatingclassificationperformance,isnotwell-suitedforevaluatingmodelsinthepresenceofclassimbalanceinthetestdata.Forexample,if1%ofthecreditcardtransactionsarefraudulent,thenatrivialmodelthatpredictseverytransactionaslegitimatewillhaveanaccuracyof99%eventhoughitfailstodetectanyofthefraudulentactivities.Thus,thereisaneedtousealternativeevaluationmetricsthataresensitivetotheskewandcancapturedifferentcriteriaofperformancethanaccuracy.

Inthissection,wefirstpresentsomeofthegenericmethodsforbuildingclassifierswhenthereisclassimbalanceinthetrainingset.Wethendiscussmethodsforevaluatingclassificationperformanceandadaptingclassificationdecisionsinthepresenceofaskewedtestset.Intheremainderofthissection,wewillconsiderbinaryclassificationproblemsforsimplicity,wheretheminorityclassisreferredasthepositive classwhilethemajorityclassisreferredasthenegative class.

4.11.1BuildingClassifierswithClassImbalance

Therearetwoprimaryconsiderationsforbuildingclassifiersinthepresenceofclassimbalanceinthetrainingset.First,weneedtoensurethatthelearningalgorithmistrainedoveradatasetthathasadequaterepresentationofboththemajorityaswellastheminorityclasses.Somecommonapproachesforensuringthisincludesthemethodologiesofoversamplingandundersampling

(+)(−)

thetrainingset.Second,havinglearnedaclassificationmodel,weneedawaytoadaptitsclassificationdecisions(andthuscreateanappropriatelytunedclassifier)tobestmatchtherequirementsoftheimbalancedtestset.Thisistypicallydonebyconvertingtheoutputsoftheclassificationmodeltoreal-valuedscores,andthenselectingasuitablethresholdontheclassificationscoretomatchtheneedsofatestset.Boththeseconsiderationsarediscussedindetailinthefollowing.

OversamplingandUndersamplingThefirststepinlearningwithimbalanceddataistotransformthetrainingsettoabalancedtrainingset,wherebothclasseshavenearlyequalrepresentation.Thebalancedtrainingsetcanthenbeusedwithanyoftheexistingclassificationtechniques(withoutmakinganymodificationsinthelearningalgorithm)tolearnamodelthatgivesequalemphasistobothclasses.Inthefollowing,wepresentsomeofthecommontechniquesfortransforminganimbalancedtrainingsettoabalancedone.

Abasicapproachforcreatingbalancedtrainingsetsistogenerateasampleoftraininginstanceswheretherareclasshasadequaterepresentation.Therearetwotypesofsamplingmethodsthatcanbeusedtoenhancetherepresentationoftheminorityclass:(a)undersampling,wherethefrequencyofthemajorityclassisreducedtomatchthefrequencyoftheminorityclass,and(b)oversampling,whereartificialexamplesoftheminorityclassarecreatedtomakethemequalinproportiontothenumberofnegativeinstances.

Toillustrateundersampling,consideratrainingsetthatcontains100positiveexamplesand1000negativeexamples.Toovercometheskewamongtheclasses,wecanselectarandomsampleof100examplesfromthenegativeclassandusethemwiththe100positiveexamplestocreateabalancedtrainingset.Aclassifierbuiltovertheresultantbalancedsetwillthenbe

unbiasedtowardbothclasses.However,onelimitationofundersamplingisthatsomeoftheusefulnegativeexamples(e.g.,thoseclosertotheactualdecisionboundary)maynotbechosenfortraining,therefore,resultinginaninferiorclassificationmodel.Anotherlimitationisthatthesmallersampleof100negativeinstancesmayhaveahighervariancethanthelargersetof1000.

Oversamplingattemptstocreateabalancedtrainingsetbyartificiallygeneratingnewpositiveexamples.Asimpleapproachforoversamplingistoduplicateeverypositiveinstance times,where and arethenumbersofpositiveandnegativetraininginstances,respectively.Figure4.51 illustratestheeffectofoversamplingonthelearningofadecisionboundaryusingaclassifiersuchasadecisiontree.Withoutoversampling,onlythepositiveexamplesatthebottomright-handsideofFigure4.51(a)areclassifiedcorrectly.Thepositiveexampleinthemiddleofthediagramismisclassifiedbecausetherearenotenoughexamplestojustifythecreationofanewdecisionboundarytoseparatethepositiveandnegativeinstances.Oversamplingprovidestheadditionalexamplesneededtoensurethatthedecisionboundarysurroundingthepositiveexampleisnotpruned,asillustratedinFigure4.51(b) .Notethatduplicatingapositiveinstanceisanalogoustodoublingitsweightduringthetrainingstage.Hence,theeffectofoversamplingcanbealternativelyachievedbyassigninghigherweightstopositiveinstancesthannegativeinstances.Thismethodofweightinginstancescanbeusedwithanumberofclassifierssuchaslogisticregression,ANN,andSVM.

n−/n+ n+ n−

Figure4.51.Illustratingtheeffectofoversamplingoftherareclass.

Onelimitationoftheduplicationmethodforoversamplingisthatthereplicatedpositiveexampleshaveanartificiallylowervariancewhencomparedwiththeirtruedistributionintheoveralldata.Thiscanbiastheclassifiertothespecificdistributionoftraininginstances,whichmaynotberepresentativeoftheoveralldistributionoftestinstances,leadingtopoorgeneralizability.Toovercomethislimitation,analternativeapproachforoversamplingistogeneratesyntheticpositiveinstancesintheneighborhoodofexistingpositiveinstances.Inthisapproach,calledtheSyntheticMinorityOversamplingTechnique(SMOTE),wefirstdeterminethek-nearestpositiveneighborsofeverypositiveinstance ,andthengenerateasyntheticpositiveinstanceatsomeintermediatepointalongthelinesegmentjoining tooneofitsrandomlychosenk-nearestneighbor, .Thisprocessisrepeateduntilthedesirednumberofpositiveinstancesisreached.However,onelimitationofthisapproachisthatitcanonlygeneratenewpositiveinstancesintheconvexhulloftheexistingpositiveclass.Hence,itdoesnothelpimprovetherepresentationofthepositiveclassoutsidetheboundaryofexistingpositive

instances.Despitetheircomplementarystrengthsandweaknesses,undersamplingandoversamplingprovideusefuldirectionsforgeneratingbalancedtrainingsetsinthepresenceofclassimbalance.

AssigningScorestoTestInstancesIfaclassifierreturnsanordinalscores( )foreverytestinstance suchthatahigherscoredenotesagreaterlikelihoodof belongingtothepositiveclass,thenforeverypossiblevalueofscorethreshold, ,wecancreateanewbinaryclassifierwhereatestinstance isclassifiedpositiveonlyif .Thus,everychoiceof canpotentiallyleadtoadifferentclassifier,andweareinterestedinfindingtheclassifierthatisbestsuitedforourneeds.

Ideally,wewouldliketheclassificationscoretovarymonotonicallywiththeactualposteriorprobabilityofthepositiveclass,i.e.,if and arethescoresofanytwoinstances, and ,then

.However,thisisdifficulttoguaranteeinpracticeasthepropertiesoftheclassificationscoredependsonseveralfactorssuchasthecomplexityoftheclassificationalgorithmandtherepresentativepowerofthetrainingset.Ingeneral,wecanonlyexpecttheclassificationscoreofareasonablealgorithmtobeweaklyrelatedtotheactualposteriorprobabilityofthepositiveclass,eventhoughtherelationshipmaynotbestrictlymonotonic.Mostclassifierscanbeeasilymodifiedtoproducesucharealvaluedscore.Forexample,thesigneddistanceofaninstancefromthepositivemarginhyperplaneofSVMcanbeusedasaclassificationscore.Asanotherexample,testinstancesbelongingtoaleafinadecisiontreecanbeassignedascorebasedonthefractionoftraininginstanceslabeledaspositiveintheleaf.Also,probabilisticclassifierssuchasnaïveBayes,Bayesiannetworks,andlogisticregressionnaturallyoutputestimatesofposteriorprobabilities, .Next,wediscusssome

sTs(x)>sT

s(x1) s(x2)x1 x2

s(x1)≥s(x2)⇒P(y=1|x1)≥P(y=1|x2)

P(y=1|x)

evaluationmeasuresforassessingthegoodnessofaclassifierinthepresenceofclassimbalance.

Table4.6.Aconfusionmatrixforabinaryclassificationprobleminwhichtheclassesarenotequallyimportant.

PredictedClass

Actualclass

4.11.2EvaluatingPerformancewithClassImbalance

Themostbasicapproachforrepresentingaclassifier'sperformanceonatestsetistouseaconfusionmatrix,asshowninTable4.6 .ThistableisessentiallythesameasTable3.4 ,whichwasintroducedinthecontextofevaluatingclassificationperformanceinSection3.2 .Aconfusionmatrixsummarizesthenumberofinstancespredictedcorrectlyorincorrectlybyaclassifierusingthefollowingfourcounts:

Truepositive(TP)or ,whichcorrespondstothenumberofpositiveexamplescorrectlypredictedbytheclassifier.Falsepositive(FP)or (alsoknownasTypeIerror),whichcorrespondstothenumberofnegativeexampleswronglypredictedaspositivebytheclassifier.

+ −

+ f++(TP) f+−(FN)

− f−+(FP) f−−(TN)

f++

f−+

Falsenegative(FN)or (alsoknownasTypeIIerror),whichcorrespondstothenumberofpositiveexampleswronglypredictedasnegativebytheclassifier.Truenegative(TN)or ,whichcorrespondstothenumberofnegativeexamplescorrectlypredictedbytheclassifier.

Theconfusionmatrixprovidesaconciserepresentationofclassificationperformanceonagiventestdataset.However,itisoftendifficulttointerpretandcomparetheperformanceofclassifiersusingthefour-dimensionalrepresentations(correspondingtothefourcounts)providedbytheirconfusionmatrices.Hence,thecountsintheconfusionmatrixareoftensummarizedusinganumberofevaluationmeasures.Accuracyisanexampleofonesuchmeasurethatcombinesthesefourcountsintoasinglevalue,whichisusedextensivelywhenclassesarebalanced.However,theaccuracymeasureisnotsuitableforhandlingdatasetswithimbalancedclassdistributionsasittendstofavorclassifiersthatcorrectlyclassifythemajorityclass.Inthefollowing,wedescribeotherpossiblemeasuresthatcapturedifferentcriteriaofperformancewhenworkingwithimbalancedclasses.

Abasicevaluationmeasureisthetruepositiverate(TPR),whichisdefinedasthefractionofpositivetestinstancescorrectlypredictedbytheclassifier:

Inthemedicalcommunity,TPRisalsoknownassensitivity,whileintheinformationretrievalliterature,itisalsocalledrecall(r).AclassifierwithahighTPRhasahighchanceofcorrectlyidentifyingthepositiveinstancesofthedata.

AnalogouslytoTPR,thetruenegativerate(TNR)(alsoknownasspecificity)isdefinedasthefractionofnegativetestinstancescorrectly

f+−

f−−

TPR=TPTP+FN.

predictedbytheclassifier,i.e.,

AhighTNRvaluesignifiesthattheclassifiercorrectlyclassifiesanyrandomlychosennegativeinstanceinthetestset.AcommonlyusedevaluationmeasurethatiscloselyrelatedtoTNRisthefalsepositiverate(FPR),whichisdefinedas .

Similarly,wecandefinefalsenegativerate(FNR)as .

Notethattheevaluationmeasuresdefinedabovedonottakeintoaccounttheskewamongtheclasses,whichcanbeformallydefinedas ,wherePandNdenotethenumberofactualpositivesandactualnegatives,respectively.Asaresult,changingtherelativenumbersofPandNwillhavenoeffectonTPR,TNR,FPR,orFNR,sincetheydependonlyonthefractionofcorrectclassificationsforeveryclass,independentlyoftheotherclass.Furthermore,knowingthevaluesofTPRandTNR(andconsequentlyFNRandFPR)doesnotbyitselfhelpusuniquelydetermineallfourentriesoftheconfusionmatrix.However,togetherwithinformationabouttheskewfactor, ,andthetotalnumberofinstances,N,wecancomputetheentireconfusionmatrixusingTPRandTNR,asshowninTable4.7 .

Table4.7.EntriesoftheconfusionmatrixintermsoftheTPR,TNR,skew, ,andtotalnumberofinstances,N.

Predicted Predicted

TNR=TNFP+TN.

1−TNR

FPR=FPFP+TN.

1−TPR

FNR=FNFN+TP.

α=P/(P+N)

+ −

Actual

Anevaluationmeasurethatissensitivetotheskewisprecision,whichcanbedefinedasthefractionofcorrectpredictionsofthepositiveclassoverthetotalnumberofpositivepredictions,i.e.,

Precisionisalsoreferredasthepositivepredictedvalue(PPV).Aclassifierthathasahighprecisionislikelytohavemostofitspositivepredictionscorrect.Precisionisausefulmeasureforhighlyskewedtestsetswherethepositivepredictions,eventhoughsmallinnumbers,arerequiredtobemostlycorrect.Ameasurethatiscloselyrelatedtoprecisionisthefalsediscoveryrate(FDR),whichcanbedefinedas .

AlthoughbothFDRandFPRfocusonFP,theyaredesignedtocapturedifferentevaluationobjectivesandthuscantakequitecontrastingvalues,especiallyinthepresenceofclassimbalance.Toillustratethis,consideraclassifierwiththefollowingconfusionmatrix.

PredictedClass

ActualClass

100 0

+ TPR×α×N (1−TPR)×α×N α×N

− (1−TNR)×(1−α)×N TNR×(1−α)×N (1−α)×N

Precision,p=TPTP+FP.

1−p

FDR=FPTP+FP.

+ −

100 900

Sincehalfofthepositivepredictionsmadebytheclassifierareincorrect,ithasaFDRvalueof .However,itsFPRisequalto

,whichisquitelow.Thisexampleshowsthatinthepresenceofhighskew(i.e.,verysmallvalueof ),evenasmallFPRcanresultinhighFDR.SeeSection10.6 forfurtherdiscussionofthisissue.

Notethattheevaluationmeasuresdefinedaboveprovideanincompleterepresentationofperformance,becausetheyeitheronlycapturetheeffectoffalsepositives(e.g.,FPRandprecision)ortheeffectoffalsenegatives(e.g.,TPRorrecall),butnotboth.Hence,ifweoptimizeonlyoneoftheseevaluationmeasures,wemayendupwithaclassifierthatshowslowFNbuthighFP,orvice-versa.Forexample,aclassifierthatdeclareseveryinstancetobepositivewillhaveaperfectrecall,buthighFPRandverypoorprecision.Ontheotherhand,aclassifierthatisveryconservativeinclassifyinganinstanceaspositive(toreduceFP)mayenduphavinghighprecisionbutverypoorrecall.Wethusneedevaluationmeasuresthataccountforbothtypesofmisclassifications,FPandFN.Someexamplesofsuchevaluationmeasuresaresummarizedbythefollowingdefinitions.

Whilesomeoftheseevaluationmeasuresareinvarianttotheskew(e.g.,thepositivelikelihoodratio),others(e.g.,precisionandthe measure)aresensitivetoskew.Further,differentevaluationmeasurescapturetheeffectsofdifferenttypesofmisclassificationerrorsinvariousways.Forexample,themeasurerepresentsaharmonicmeanbetweenrecallandprecision,i.e.,

−

100/(100+100)=0.5100/(100+900)=0.1

PositiveLikelihoodRatio=TPRFPR.F1measure=2rpr+p=2×TP2×TP+FP+FN.G(TP+FN).

F1=21r+1p.

Becausetheharmonicmeanoftwonumberstendstobeclosertothesmallerofthetwonumbers,ahighvalueof -measureensuresthatbothprecisionandrecallarereasonablyhigh.Similarly,theGmeasurerepresentsthegeometricmeanbetweenrecallandprecision.Acomparisonamongharmonic,geometric,andarithmeticmeansisgiveninthenextexample.

Example4.9.Considertwopositivenumbers and .Theirarithmeticmeanis

andtheirgeometricmeanis .Theirharmonicmeanis ,whichisclosertothesmallervaluebetweenaandbthanthearithmeticandgeometricmeans.

Agenericextensionofthe measureisthe measure,whichcanbedefinedasfollows.

Bothprecisionandrecallcanbeviewedasspecialcasesof bysettingand ,respectively.Lowvaluesof make closertoprecision,andhighvaluesmakeitclosertorecall.

Amoregeneralmeasurethatcaptures aswellasaccuracyistheweightedaccuracymeasure,whichisdefinedbythefollowingequation:

Therelationshipbetweenweightedaccuracyandotherperformancemeasuresissummarizedinthefollowingtable:

Measure

a=1 b=5 μa=(a+b)/2=3 μg=ab=2.236μh=(2×1×5)/6=1.667

F1 Fβ

Fβ=(β2+1)rpr+β2p=(β2+1)×TP(β2+1)TP+β2FP+FN. (4.106)

Fβ β=0β=∞ β Fβ

Fβ

Weightedaccuracy=w1TP+w4TNw1TP+w2FP+w3FN+w4TN. (4.107)

w1 w2 w3 w4

Recall 1 1 0 0

Precision 1 0 1 0

1 0

Accuracy 1 1 1 1

4.11.3FindinganOptimalScoreThreshold

GivenasuitablychosenevaluationmeasureEandadistributionofclassificationscores, ,onavalidationset,wecanobtaintheoptimalscorethreshold onthevalidationsetusingthefollowingapproach:

1. Sortthescoresinincreasingorderoftheirvalues.2. Foreveryuniquevalueofscore,s,considertheclassificationmodel

thatassignsaninstance aspositiveonlyif .LetE(s)denotetheperformanceofthismodelonthevalidationset.

3. Find thatmaximizestheevaluationmeasure,E(s).

Notethat canbetreatedasahyper-parameteroftheclassificationalgorithmthatislearnedduringmodelselection.Using ,wecanassignapositivelabeltoafuturetestinstance onlyif .IftheevaluationmeasureEisskewinvariant(e.g.,PositiveLikelihoodRatio),thenwecanselect withoutknowingtheskewofthetestset,andtheresultantclassifierformedusing canbeexpectedtoshowoptimalperformanceonthetestset

Fβ β2+1 β2

s(x)s*

s(x)>s

s*s*=argmaxsE(s).

s*s*

s(x)>s*

s*s*

(withrespecttotheevaluationmeasureE).Ontheotherhand,ifEissensitivetotheskew(e.g.,precisionor -measure),thenweneedtoensurethattheskewofthevalidationsetusedforselecting issimilartothatofthetestset,sothattheclassifierformedusing showsoptimaltestperformancewithrespecttoE.Alternatively,givenanestimateoftheskewofthetestdata, ,wecanuseitalongwiththeTPRandTNRonthevalidationsettoestimateallentriesoftheconfusionmatrix(seeTable4.7 ),andthustheestimateofanyevaluationmeasureEonthetestset.Thescorethreshold selectedusingthisestimateofEcanthenbeexpectedtoproduceoptimaltestperformancewithrespecttoE.Furthermore,themethodologyofselectingonthevalidationsetcanhelpincomparingthetestperformanceofdifferentclassificationalgorithms,byusingtheoptimalvaluesof foreachalgorithm.

4.11.4AggregateEvaluationofPerformance

Althoughtheaboveapproachhelpsinfindingascorethreshold thatprovidesoptimalperformancewithrespecttoadesiredevaluationmeasureandacertainamountofskew, ,sometimesweareinterestedinevaluatingtheperformanceofaclassifieronanumberofpossiblescorethresholds,eachcorrespondingtoadifferentchoiceofevaluationmeasureandskewvalue.Assessingtheperformanceofaclassifieroverarangeofscorethresholdsiscalledaggregateevaluationofperformance.Inthisstyleofanalysis,theemphasisisnotonevaluatingtheperformanceofasingleclassifiercorrespondingtotheoptimalscorethreshold,buttoassessthegeneralqualityofrankingproducedbytheclassificationscoresonthetestset.Ingeneral,thishelpsinobtainingrobustestimatesofclassificationperformancethatarenotsensitivetospecificchoicesofscorethresholds.

F1s*

s*α

ROCCurveOneofthewidely-usedtoolsforaggregateevaluationisthereceiveroperatingcharacteristic(ROC)curve.AnROCcurveisagraphicalapproachfordisplayingthetrade-offbetweenTPRandFPRofaclassifier,overvaryingscorethresholds.InanROCcurve,theTPRisplottedalongthey-axisandtheFPRisshownonthex-axis.Eachpointalongthecurvecorrespondstoaclassificationmodelgeneratedbyplacingathresholdonthetestscoresproducedbytheclassifier.ThefollowingproceduredescribesthegenericapproachforcomputinganROCcurve:

1. Sortthetestinstancesinincreasingorderoftheirscores.2. Selectthelowestrankedtestinstance(i.e.,theinstancewithlowest

score).Assigntheselectedinstanceandthoserankedaboveittothepositiveclass.Thisapproachisequivalenttoclassifyingallthetestinstancesaspositiveclass.Becauseallthepositiveexamplesareclassifiedcorrectlyandthenegativeexamplesaremisclassified,

.3. Selectthenexttestinstancefromthesortedlist.Classifytheselected

instanceandthoserankedaboveitaspositive,whilethoserankedbelowitasnegative.UpdatethecountsofTPandFPbyexaminingtheactualclasslabeloftheselectedinstance.Ifthisinstancebelongstothepositiveclass,theTPcountisdecrementedandtheFPcountremainsthesameasbefore.Iftheinstancebelongstothenegativeclass,theFPcountisdecrementedandTPcountremainsthesameasbefore.

4. RepeatStep3andupdatetheTPandFPcountsaccordinglyuntilthehighestrankedtestinstanceisselected.Atthisfinalthreshold,

,asallinstancesarelabeledasnegative.5. PlottheTPRagainstFPRoftheclassifier.

TPR=FPR=1

TPR=FPR=0

Example4.10.[GeneratingROCCurve]Figure4.52 showsanexampleofhowtocomputetheTPRandFPRvaluesforeverychoiceofscorethreshold.Therearefivepositiveexamplesandfivenegativeexamplesinthetestset.Theclasslabelsofthetestinstancesareshowninthefirstrowofthetable,whilethesecondrowcorrespondstothesortedscorevaluesforeachinstance.ThenextsixrowscontainthecountsofTP,FP,TN,andFN,alongwiththeircorrespondingTPRandFPR.Thetableisthenfilledfromlefttoright.Initially,alltheinstancesarepredictedtobepositive.Thus, and

.Next,weassignthetestinstancewiththelowestscoreasthenegativeclass.Becausetheselectedinstanceisactuallyapositiveexample,theTPcountdecreasesfrom5to4andtheFPcountisthesameasbefore.TheFPRandTPRareupdatedaccordingly.Thisprocessisrepeateduntilwereachtheendofthelist,where and .TheROCcurveforthisexampleisshowninFigure4.53 .

Figure4.52.ComputingtheTPRandFPRateveryscorethreshold.

TP=FP=5TPR=FPR=1

TPR=0 FPR=0

Figure4.53.ROCcurveforthedatashowninFigure4.52 .

NotethatinanROCcurve,theTPRmonotonicallyincreaseswithFPR,becausetheinclusionofatestinstanceinthesetofpredictedpositivescaneitherincreasetheTPRortheFPR.TheROCcurvethushasastaircasepattern.Furthermore,thereareseveralcriticalpointsalonganROCcurvethathavewell-knowninterpretations:

:Modelpredictseveryinstancetobeanegativeclass.

:Modelpredictseveryinstancetobeapositiveclass.

:Theperfectmodelwithzeromisclassifications.

Agoodclassificationmodelshouldbelocatedascloseaspossibletotheupperleftcornerofthediagram,whileamodelthatmakesrandomguessesshouldresidealongthemaindiagonal,connectingthepointsand .Randomguessingmeansthataninstanceisclassifiedasapositiveclasswithafixedprobabilityp,irrespectiveofitsattributeset.

(TPR=0,FPR=0)

(TPR=1,FPR=1)

(TPR=1,FPR=0)

(TPR=0,FPR=0)(TPR=1,FPR=1)

Forexample,consideradatasetthatcontains positiveinstancesandnegativeinstances.Therandomclassifierisexpectedtocorrectlyclassifyofthepositiveinstancesandtomisclassify ofthenegativeinstances.Therefore,theTPRoftheclassifieris ,whileitsFPRis .Hence,thisrandomclassifierwillresideatthepoint(p,p)intheROCcurvealongthemaindiagonal.

Figure4.54.ROCcurvesfortwodifferentclassifiers.

SinceeverypointontheROCcurverepresentstheperformanceofaclassifiergeneratedusingaparticularscorethreshold,theycanbeviewedasdifferentoperatingpointsoftheclassifier.Onemaychooseoneoftheseoperatingpointsdependingontherequirementsoftheapplication.Hence,anROCcurvefacilitatesthecomparisonofclassifiersoverarangeofoperatingpoints.Forexample,Figure4.54 comparestheROCcurvesoftwoclassifiers,

n+ n−pn+

pn−(pn+)/n+=p (pn−)/p=p

and ,generatedbyvaryingthescorethresholds.Wecanseethat isbetterthan whenFPRislessthan0.36,as showsbetterTPRthanforthisrangeofoperatingpoints.Ontheotherhand, issuperiorwhenFPRisgreaterthan0.36,sincetheTPRof ishigherthanthatof forthisrange.Clearly,neitherofthetwoclassifiersdominates(isstrictlybetterthan)theother,i.e.,showshighervaluesofTPRandlowervaluesofFPRoveralloperatingpoints.

Tosummarizetheaggregatebehavioracrossalloperatingpoints,oneofthecommonlyusedmeasuresistheareaundertheROCcurve(AUC).Iftheclassifierisperfect,thenitsareaundertheROCcurvewillbeequal1.Ifthealgorithmsimplyperformsrandomguessing,thenitsareaundertheROCcurvewillbeequalto0.5.

AlthoughtheAUCprovidesausefulsummaryofaggregateperformance,therearecertaincaveatsinusingtheAUCforcomparingclassifiers.First,eveniftheAUCofalgorithmAishigherthantheAUCofanotheralgorithmB,thisdoesnotmeanthatalgorithmAisalwaysbetterthanB,i.e.,theROCcurveofAdominatesthatofBacrossalloperatingpoints.Forexample,eventhough showsaslightlylowerAUCthan inFigure4.54 ,wecanseethatboth and areusefuloverdifferentrangesofoperatingpointsandnoneofthemarestrictlybetterthantheotheracrossallpossibleoperatingpoints.Hence,wecannotusetheAUCtodeterminewhichalgorithmisbetter,unlessweknowthattheROCcurveofoneofthealgorithmsdominatestheother.

Second,althoughtheAUCsummarizestheaggregateperformanceoveralloperatingpoints,weareofteninterestedinonlyasmallrangeofoperatingpointsinmostapplications.Forexample,eventhough showsslightlylowerAUCthan ,itshowshigherTPRvaluesthan forsmallFPRvalues(smallerthan0.36).Inthepresenceofclassimbalance,thebehaviorof

M2 M1M2 M1 M2

M2M2 M1

M1 M2M1 M2

M1M2 M2

analgorithmoversmallFPRvalues(alsotermedasearlyretrieval)isoftenmoremeaningfulforcomparisonthantheperformanceoverallFPRvalues.Thisisbecause,inmanyapplications,itisimportanttoassesstheTPRachievedbyaclassifierinthefirstfewinstanceswithhighestscores,withoutincurringalargeFPR.Hence,inFigure4.54 ,duetothehighTPRvaluesof duringearlyretrieval ,wemayprefer over forimbalancedtestsets,despitethelowerAUCof .Hence,caremustbetakenwhilecomparingtheAUCvaluesofdifferentclassifiers,usuallybyvisualizingtheirROCcurvesratherthanjustreportingtheirAUC.

AkeycharacteristicofROCcurvesisthattheyareagnostictotheskewinthetestset,becauseboththeevaluationmeasuresusedinconstructingROCcurves(TPRandFPR)areinvarianttoclassimbalance.Hence,ROCcurvesarenotsuitableformeasuringtheimpactofskewonclassificationperformance.Inparticular,wewillobtainthesameROCcurvefortwotestdatasetsthathaveverydifferentskew.

M1 (FPR<0.36) M1 M2M1

Figure4.55.PRcurvesfortwodifferentclassifiers.

Precision-RecallCurveAnalternatetoolforaggregateevaluationistheprecisionrecallcurve(PRcurve).ThePRcurveplotstheprecisionandrecallvaluesofaclassifierontheyandxaxesrespectively,byvaryingthethresholdonthetestscores.Figure4.55 showsanexampleofPRcurvesfortwohypotheticalclassifiers, and .TheapproachforgeneratingaPRcurveissimilartotheapproachdescribedaboveforgeneratinganROCcurve.However,therearesomekeydistinguishingfeaturesinthePRcurve:

1. PRcurvesaresensitivetotheskewfactor ,anddifferentPRcurvesaregeneratedfordifferentvaluesof .

M1 M2

α=P/(P+N)α

2. Whenthescorethresholdislowest(everyinstanceislabeledaspositive),theprecisionisequalto whilerecallis1.Asweincreasethescorethreshold,thenumberofpredictedpositivescanstaythesameordecrease.Hence,therecallmonotonicallydeclinesasthescorethresholdincreases.Ingeneral,theprecisionmayincreaseordecreaseforthesamevalueofrecall,uponadditionofaninstanceintothesetofpredictedpositives.Forexample,ifthek rankedinstancebelongstothenegativeclass,thenincludingitwillresultinadropintheprecisionwithoutaffectingtherecall.Theprecisionmayimproveatthenextstep,whichaddsthe rankedinstance,ifthisinstancebelongstothepositiveclass.Hence,thePRcurveisnotasmooth,monotonicallyincreasingcurveliketheROCcurve,andgenerallyhasazigzagpattern.Thispatternismoreprominentintheleftpartofthecurve,whereevenasmallchangeinthenumberoffalsepositivescancausealargechangeinprecision.

3. As,asweincreasetheimbalanceamongtheclasses(reducethevalueof ),therightmostpointsofallPRcurveswillmovedownwards.AtandneartheleftmostpointonthePRcurve(correspondingtolargervaluesofscorethreshold),therecallisclosetozero,whiletheprecisionisequaltothefractionofpositivesinthetoprankedinstancesofthealgorithm.Hence,differentclassifierscanhavedifferentvaluesofprecisionattheleftmostpointsofthePRcurve.Also,iftheclassificationscoreofanalgorithmmonotonicallyvarieswiththeposteriorprobabilityofthepositiveclass,wecanexpectthePRcurvetograduallydecreasefromahighvalueofprecisionontheleftmostpointtoaconstantvalueof attherightmostpoint,albeitwithsomeupsanddowns.ThiscanbeobservedinthePRcurveofalgorithminFigure4.55 ,whichstartsfromahighervalueofprecisionontheleftthatgraduallydecreasesaswemovetowardstheright.Ontheotherhand,thePRcurveofalgorithm startsfromalowervalueofprecisionontheleftandshowsmoredrasticupsanddownsaswe

(k+1)th

αM1

moveright,suggestingthattheclassificationscoreof showsaweakermonotonicrelationshipwiththeposteriorprobabilityofthepositiveclass.

4. Arandomclassifierthatassignsaninstancetobepositivewithafixedprobabilityphasaprecisionof andarecallofp.Hence,aclassifierthatperformsrandomguessinghasahorizontalPRcurvewith ,asshownusingadashedlineinFigure4.55 .NotethattherandombaselineinPRcurvesdependsontheskewinthetestset,incontrasttothefixedmaindiagonalofROCcurvesthatrepresentsrandomclassifiers.

5. NotethattheprecisionofanalgorithmisimpactedmorestronglybyfalsepositivesinthetoprankedtestinstancesthantheFPRofthealgorithm.Forthisreason,thePRcurvegenerallyhelpstomagnifythedifferencesbetweenclassifiersintheleftportionofthePRcurve.Hence,inthepresenceofclassimbalanceinthetestdata,analyzingthePRcurvesgenerallyprovidesadeeperinsightintotheperformanceofclassifiersthantheROCcurves,especiallyintheearlyretrievalrangeofoperatingpoints.

6. Theclassifiercorrespondingto representstheperfectclassifier.SimilartoAUC,wecanalsocomputetheareaunderthePRcurveofanalgorithm,knownasAUC-PR.TheAUC-PRofarandomclassifierisequalto ,whilethatofaperfectalgorithmisequalto1.NotethatAUC-PRvarieswithchangingskewinthetestset,incontrasttotheareaundertheROCcurve,whichisinsensitivetotheskew.TheAUC-PRhelpsinaccentuatingthedifferencesbetweenclassificationalgorithmsintheearlyretrievalrangeofoperatingpoints.Hence,itismoresuitedforevaluatingclassificationperformanceinthepresenceofclassimbalancethantheareaundertheROCcurve.However,similartoROCcurves,ahighervalueofAUC-PRdoesnotguaranteethesuperiorityofaclassificationalgorithmoveranother.ThisisbecausethePRcurvesoftwoalgorithmscaneasilycrosseach

αy=α

(precision=1,recall=1)

other,suchthattheybothshowbetterperformancesindifferentrangesofoperatingpoints.Hence,itisimportanttovisualizethePRcurvesbeforecomparingtheirAUC-PRvalues,inordertoensureameaningfulevaluation.

4.12MulticlassProblemSomeoftheclassificationtechniquesdescribedinthischapterareoriginallydesignedforbinaryclassificationproblems.Yettherearemanyreal-worldproblems,suchascharacterrecognition,faceidentification,andtextclassification,wheretheinputdataisdividedintomorethantwocategories.Thissectionpresentsseveralapproachesforextendingthebinaryclassifierstohandlemulticlassproblems.Toillustratetheseapproaches,let

bethesetofclassesoftheinputdata.

ThefirstapproachdecomposesthemulticlassproblemintoKbinaryproblems.Foreachclass ,abinaryproblemiscreatedwhereallinstancesthatbelongto areconsideredpositiveexamples,whiletheremaininginstancesareconsiderednegativeexamples.Abinaryclassifieristhenconstructedtoseparateinstancesofclass fromtherestoftheclasses.Thisisknownastheone-against-rest(1-r)approach.

Thesecondapproach,whichisknownastheone-against-one(1-1)approach,constructs binaryclassifiers,whereeachclassifierisusedtodistinguishbetweenapairofclasses, .Instancesthatdonotbelongtoeither or areignoredwhenconstructingthebinaryclassifierfor .Inboth1-rand1-1approaches,atestinstanceisclassifiedbycombiningthepredictionsmadebythebinaryclassifiers.Avotingschemeistypicallyemployedtocombinethepredictions,wheretheclassthatreceivesthehighestnumberofvotesisassignedtothetestinstance.Inthe1-rapproach,ifaninstanceisclassifiedasnegative,thenallclassesexceptforthepositiveclassreceiveavote.Thisapproach,however,mayleadtotiesamongthedifferentclasses.Anotherpossibilityistotransformtheoutputsofthebinary

Y={y1,y2,…,yK}

yi∈Yyi

K(K−1)/2(yi,yj)

yi yj (yi,yj)

classifiersintoprobabilityestimatesandthenassignthetestinstancetotheclassthathasthehighestprobability.

Example4.11.Consideramulticlassproblemwhere .Supposeatestinstanceisclassifiedas accordingtothe1-rapproach.Inotherwords,itisclassifiedaspositivewhen isusedasthepositiveclassandnegativewhen ,and areusedasthepositiveclass.Usingasimplemajorityvote,noticethat receivesthehighestnumberofvotes,whichisfour,whiletheremainingclassesreceiveonlythreevotes.Thetestinstanceisthereforeclassifiedas .

Example4.12.Supposethetestinstanceisclassifiedusingthe1-1approachasfollows:

Binarypairofclasses

Classification

Thefirsttworowsinthistablecorrespondtothepairofclasseschosentobuildtheclassifierandthelastrowrepresentsthepredictedclassforthetestinstance.Aftercombiningthepredictions, and eachreceivetwovotes,while and eachreceivesonlyonevote.Thetestinstanceisthereforeclassifiedaseither or ,dependingonthetie-breakingprocedure.

Error-CorrectingOutputCoding

Y={y1,y2,y3,y4}(+,−,−,−)

y1y2,y3 y4

+:y1−:y2 +:y1−:y3 +:y1−:y4 +:y2−:y3 +:y2−:y4 +:y3−:y4

+ + − + − +

(yi,yj)

y1 y4y2 y3

y1 y4

Apotentialproblemwiththeprevioustwoapproachesisthattheymaybesensitivetobinaryclassificationerrors.Forthe1-rapproachgiveninExample

4.12,ifatleastofoneofthebinaryclassifiersmakesamistakeinitsprediction,thentheclassifiermayendupdeclaringatiebetweenclassesormakingawrongprediction.Forexample,supposethetestinstanceisclassifiedas duetomisclassificationbythethirdclassifier.Inthiscase,itwillbedifficulttotellwhethertheinstanceshouldbeclassifiedas or,unlesstheprobabilityassociatedwitheachclasspredictionistakeninto

account.

Theerror-correctingoutputcoding(ECOC)methodprovidesamorerobustwayforhandlingmulticlassproblems.Themethodisinspiredbyaninformation-theoreticapproachforsendingmessagesacrossnoisychannels.

Theideabehindthisapproachistoaddredundancyintothetransmittedmessagebymeansofacodeword,sothatthereceivermaydetecterrorsinthereceivedmessageandperhapsrecovertheoriginalmessageifthenumberoferrorsissmall.

Formulticlasslearning,eachclass isrepresentedbyauniquebitstringoflengthnknownasitscodeword.Wethentrainnbinaryclassifierstopredicteachbitofthecodewordstring.ThepredictedclassofatestinstanceisgivenbythecodewordwhoseHammingdistanceisclosesttothecodewordproducedbythebinaryclassifiers.RecallthattheHammingdistancebetweenapairofbitstringsisgivenbythenumberofbitsthatdiffer.

Example4.13.Consideramulticlassproblemwhere .Supposeweencodetheclassesusingthefollowingsevenbitcodewords:

(+,−,+,−)y1

Y={y1,y2,y3,y4}

Class Codeword

1 1 1 1 1 1 1

0 0 0 0 1 1 1

0 0 1 1 0 0 1

0 1 0 1 0 1 0

Eachbitofthecodewordisusedtotrainabinaryclassifier.Ifatestinstanceisclassifiedas(0,1,1,1,1,1,1)bythebinaryclassifiers,thentheHammingdistancebetweenthecodewordand is1,whiletheHammingdistancetotheremainingclassesis3.Thetestinstanceisthereforeclassifiedas .

Aninterestingpropertyofanerror-correctingcodeisthatiftheminimumHammingdistancebetweenanypairofcodewordsisd,thenanyerrorsintheoutputcodecanbecorrectedusingitsnearestcodeword.InExample4.13 ,becausetheminimumHammingdistancebetweenanypairofcodewordsis4,theclassifiermaytolerateerrorsmadebyoneofthesevenbinaryclassifiers.Ifthereismorethanoneclassifierthatmakesamistake,thentheclassifiermaynotbeabletocompensatefortheerror.

Animportantissueishowtodesigntheappropriatesetofcodewordsfordifferentclasses.Fromcodingtheory,avastnumberofalgorithmshavebeendevelopedforgeneratingn-bitcodewordswithboundedHammingdistance.However,thediscussionofthesealgorithmsisbeyondthescopeofthisbook.Itisworthwhilementioningthatthereisasignificantdifferencebetweenthedesignoferror-correctingcodesforcommunicationtaskscomparedtothoseusedformulticlasslearning.Forcommunication,thecodewordsshouldmaximizetheHammingdistancebetweentherowssothaterrorcorrection

⌊(d−1)/2)⌋

canbeperformed.Multiclasslearning,however,requiresthatboththerow-wiseandcolumn-wisedistancesofthecodewordsmustbewellseparated.Alargercolumn-wisedistanceensuresthatthebinaryclassifiersaremutuallyindependent,whichisanimportantrequirementforensemblelearningmethods.

4.13BibliographicNotesMitchell[278]providesexcellentcoverageonmanyclassificationtechniquesfromamachinelearningperspective.ExtensivecoverageonclassificationcanalsobefoundinAggarwal[195],Dudaetal.[229],Webb[307],Fukunaga[237],Bishop[204],Hastieetal.[249],CherkasskyandMulier[215],WittenandFrank[310],Handetal.[247],HanandKamber[244],andDunham[230].

Directmethodsforrule-basedclassifierstypicallyemploythesequentialcoveringschemeforinducingclassificationrules.Holte's1R[255]isthesimplestformofarule-basedclassifierbecauseitsrulesetcontainsonlyasinglerule.Despiteitssimplicity,Holtefoundthatforsomedatasetsthatexhibitastrongone-to-onerelationshipbetweentheattributesandtheclasslabel,1Rperformsjustaswellasotherclassifiers.Otherexamplesofrule-basedclassifiersincludeIREP[234],RIPPER[218],CN2[216,217],AQ[276],RISE[224],andITRULE[296].Table4.8 showsacomparisonofthecharacteristicsoffouroftheseclassifiers.

Table4.8.Comparisonofvariousrule-basedclassifiers.

RIPPER CN2(unordered)

CN2(ordered)

AQR

Rule-growingstrategy

General-to-specific General-to-specific

General-to-specific

General-to-specific(seededbyapositive

example)

Evaluationmetric FOIL'sInfogain Laplace Entropyandlikelihoodratio

Numberoftruepositives

Stoppingconditionforrule-growing

Allexamplesbelongtothesame

class

Noperformance

gain

Noperformance

gain

Rulescoveronlypositiveclass

Rulepruning Reducederrorpruning

None None None

Instanceelimination

Positiveandnegative

Positiveonly Positiveonly Positiveandnegative

Stoppingconditionforaddingrules

orbasedonMDLNo

performancegain

Noperformance

gain

Allpositiveexamplesarecovered

Rulesetpruning Replaceormodifyrules

Statisticaltests

None None

Searchstrategy Greedy Beamsearch

Beamsearch

Forrule-basedclassifiers,theruleantecedentcanbegeneralizedtoincludeanypropositionalorfirst-orderlogicalexpression(e.g.,Hornclauses).Readerswhoareinterestedinfirst-orderlogicrule-basedclassifiersmayrefertoreferencessuchas[278]orthevastliteratureoninductivelogicprogramming[279].Quinlan[287]proposedtheC4.5rulesalgorithmforextractingclassificationrulesfromdecisiontrees.AnindirectmethodforextractingrulesfromartificialneuralnetworkswasgivenbyAndrewsetal.in[198].

CoverandHart[220]presentedanoverviewofthenearestneighborclassificationmethodfromaBayesianperspective.Ahaprovidedboththeoreticalandempiricalevaluationsforinstance-basedmethodsin[196].PEBLS,whichwasdevelopedbyCostandSalzberg[219],isanearestneighborclassifierthatcanhandledatasetscontainingnominalattributes.

Error>50%

EachtrainingexampleinPEBLSisalsoassignedaweightfactorthatdependsonthenumberoftimestheexamplehelpsmakeacorrectprediction.Hanetal.[243]developedaweight-adjustednearestneighboralgorithm,inwhichthefeatureweightsarelearnedusingagreedy,hill-climbingoptimizationalgorithm.Amorerecentsurveyofk-nearestneighborclassificationisgivenbySteinbachandTan[298].

NaïveBayesclassifiershavebeeninvestigatedbymanyauthors,includingLangleyetal.[267],RamoniandSebastiani[288],Lewis[270],andDomingosandPazzani[227].AlthoughtheindependenceassumptionusedinnaïveBayesclassifiersmayseemratherunrealistic,themethodhasworkedsurprisinglywellforapplicationssuchastextclassification.Bayesiannetworksprovideamoreflexibleapproachbyallowingsomeoftheattributestobeinterdependent.AnexcellenttutorialonBayesiannetworksisgivenbyHeckermanin[252]andJensenin[258].Bayesiannetworksbelongtoabroaderclassofmodelsknownasprobabilisticgraphicalmodels.AformalintroductiontotherelationshipsbetweengraphsandprobabilitieswaspresentedinPearl[283].OthergreatresourcesonprobabilisticgraphicalmodelsincludebooksbyBishop[205],andJordan[259].Detaileddiscussionsofconceptssuchasd-separationandMarkovblanketsareprovidedinGeigeretal.[238]andRussellandNorvig[291].

Generalizedlinearmodels(GLM)arearichclassofregressionmodelsthathavebeenextensivelystudiedinthestatisticalliterature.TheywereformulatedbyNelderandWedderburnin1972[280]tounifyanumberofregressionmodelssuchaslinearregression,logisticregression,andPoissonregression,whichsharesomesimilaritiesintheirformulations.AnextensivediscussionofGLMsisprovidedinthebookbyMcCullaghandNelder[274].

Artificialneuralnetworks(ANN)havewitnessedalongandwindinghistoryofdevelopments,involvingmultiplephasesofstagnationandresurgence.The

ideaofamathematicalmodelofaneuralnetworkwasfirstintroducedin1943byMcCullochandPitts[275].Thisledtoaseriesofcomputationalmachinestosimulateaneuralnetworkbasedonthetheoryofneuralplasticity[289].Theperceptron,whichisthesimplestprototypeofmodernANNs,wasdevelopedbyRosenblattin1958[290].Theperceptronusesasinglelayerofprocessingunitsthatcanperformbasicmathematicaloperationssuchasadditionandmultiplication.However,theperceptroncanonlylearnlineardecisionboundariesandisguaranteedtoconvergeonlywhentheclassesarelinearlyseparable.Despitetheinterestinlearningmulti-layernetworkstoovercomethelimitationsofperceptron,progressinthisarearemainhalteduntiltheinventionofthebackpropagationalgorithmbyWerbosin1974[309],whichallowedforthequicktrainingofmulti-layerANNsusingthegradientdescentmethod.Thisledtoanupsurgeofinterestintheartificialintelligence(AI)communitytodevelopmulti-layerANNmodels,atrendthatcontinuedformorethanadecade.Historically,ANNsmarkaparadigmshiftinAIfromapproachesbasedonexpertsystems(whereknowledgeisencodedusingif-thenrules)tomachinelearningapproaches(wheretheknowledgeisencodedintheparametersofacomputationalmodel).However,therewerestillanumberofalgorithmicandcomputationalchallengesinlearninglargeANNmodels,whichremainedunresolvedforalongtime.ThishinderedthedevelopmentofANNmodelstothescalenecessaryforsolvingreal-worldproblems.Slowly,ANNsstartedgettingoutpacedbyotherclassificationmodelssuchassupportvectormachines,whichprovidedbetterperformanceaswellastheoreticalguaranteesofconvergenceandoptimality.Itisonlyrecentlythatthechallengesinlearningdeepneuralnetworkshavebeencircumvented,owingtobettercomputationalresourcesandanumberofalgorithmicimprovementsinANNssince2006.Thisre-emergenceofANNhasbeendubbedas“deeplearning,”whichhasoftenoutperformedexistingclassificationmodelsandgainedwide-spreadpopularity.

Deeplearningisarapidlyevolvingareaofresearchwithanumberofpotentiallyimpactfulcontributionsbeingmadeeveryyear.Someofthelandmarkadvancementsindeeplearningincludetheuseoflarge-scalerestrictedBoltzmannmachinesforlearninggenerativemodelsofdata[201,253],theuseofautoencodersanditsvariants(denoisingautoencoders)forlearningrobustfeaturerepresentations[199,305,306],andsophisticalarchitecturestopromotesharingofparametersacrossnodessuchasconvolutionalneuralnetworksforimages[265,268]andrecurrentneuralnetworksforsequences[241,242,277].OthermajorimprovementsincludetheapproachofunsupervisedpretrainingforinitializingANNmodels[232],thedropouttechniqueforregularization[254,297],batchnormalizationforfastlearningofANNparameters[256],andmaxoutnetworksforeffectiveusageofthedropouttechnique[240].EventhoughthediscussionsinthischapteronlearningANNmodelswerecenteredaroundthegradientdescentmethod,mostofthemodernANNmodelsinvolvingalargenumberofhiddenlayersaretrainedusingthestochasticgradientdescentmethodsinceitishighlyscalable[207].AnextensivesurveyofdeeplearningapproacheshasbeenpresentedinreviewarticlesbyBengio[200],LeCunetal.[269],andSchmidhuber[293].AnexcellentsummaryofdeeplearningapproachescanalsobeobtainedfromrecentbooksbyGoodfellowetal.[239]andNielsen[281].

Vapnik[303,304]haswrittentwoauthoritativebooksonSupportVectorMachines(SVM).OtherusefulresourcesonSVMandkernelmethodsincludethebooksbyCristianiniandShawe-Taylor[221]andSchölkopfandSmola[294].ThereareseveralsurveyarticlesonSVM,includingthosewrittenbyBurges[212],Bennetetal.[202],Hearst[251],andMangasarian[272].SVMcanalsobeviewedasan normregularizerofthehingelossfunction,asdescribedindetailbyHastieetal.[249].The normregularizerofthesquarelossfunctioncanbeobtainedusingtheleastabsoluteshrinkageandselectionoperator(Lasso),whichwasintroducedbyTibshiraniin1996[301].

L2L1

TheLassohasseveralinterestingpropertiessuchastheabilitytosimultaneouslyperformfeatureselectionaswellasregularization,sothatonlyasubsetoffeaturesareselectedinthefinalmodel.AnexcellentreviewofLassocanbeobtainedfromabookbyHastieetal.[250].

AsurveyofensemblemethodsinmachinelearningwasgivenbyDiet-terich[222].ThebaggingmethodwasproposedbyBreiman[209].FreundandSchapire[236]developedtheAdaBoostalgorithm.Arcing,whichstandsforadaptiveresamplingandcombining,isavariantoftheboostingalgorithmproposedbyBreiman[210].Itusesthenon-uniformweightsassignedtotrainingexamplestoresamplethedataforbuildinganensembleoftrainingsets.UnlikeAdaBoost,thevotesofthebaseclassifiersarenotweightedwhendeterminingtheclasslabeloftestexamples.TherandomforestmethodwasintroducedbyBreimanin[211].Theconceptofbias-variancedecompositionisexplainedindetailbyHastieetal.[249].Whilethebias-variancedecompositionwasinitiallyproposedforregressionproblemswithsquaredlossfunction,aunifiedframeworkforclassificationproblemsinvolving0–1losseswasintroducedbyDomingos[226].

RelatedworkonminingrareandimbalanceddatasetscanbefoundinthesurveypaperswrittenbyChawlaetal.[214]andWeiss[308].Sampling-basedmethodsforminingimbalanceddatasetshavebeeninvestigatedbymanyauthors,suchasKubatandMatwin[266],Japkowitz[257],andDrummondandHolte[228].Joshietal.[261]discussedthelimitationsofboostingalgorithmsforrareclassmodeling.OtheralgorithmsdevelopedforminingrareclassesincludeSMOTE[213],PNrule[260],andCREDOS[262].

Variousalternativemetricsthatarewell-suitedforclassimbalancedproblemsareavailable.Theprecision,recall,and -measurearewidely-usedmetricsininformationretrieval[302].ROCanalysiswasoriginallyusedinsignaldetectiontheoryforperformingaggregateevaluationoverarangeofscore

thresholds.AmethodforcomparingclassifierperformanceusingtheconvexhullofROCcurveswassuggestedbyProvostandFawcettin[286].Bradley[208]investigatedtheuseofareaundertheROCcurve(AUC)asaperformancemetricformachinelearningalgorithms.DespitethevastbodyofliteratureonoptimizingtheAUCmeasureinmachinelearningmodels,itiswell-knownthatAUCsuffersfromcertainlimitations.Forexample,theAUCcanbeusedtocomparethequalityoftwoclassifiersonlyiftheROCcurveofoneclassifierstrictlydominatestheother.However,iftheROCcurvesoftwoclassifiersintersectatanypoint,thenitisdifficulttoassesstherelativequalityofclassifiersusingtheAUCmeasure.Anin-depthdiscussionofthepitfallsinusingAUCasaperformancemeasurecanbeobtainedinworksbyHand[245,246],andPowers[284].TheAUChasalsobeenconsideredtobeanincoherentmeasureofperformance,i.e.,itusesdifferentscaleswhilecomparingtheperformanceofdifferentclassifiers,althoughacoherentinterpretationofAUChasbeenprovidedbyFerrietal.[235].BerrarandFlach[203]describesomeofthecommoncaveatsinusingtheROCcurveforclinicalmicroarrayresearch.Analternateapproachformeasuringtheaggregateperformanceofaclassifieristheprecision-recall(PR)curve,whichisespeciallyusefulinthepresenceofclassimbalance[292].

Anexcellenttutorialoncost-sensitivelearningcanbefoundinareviewarticlebyLingandSheng[271].ThepropertiesofacostmatrixhadbeenstudiedbyElkanin[231].MargineantuandDietterich[273]examinedvariousmethodsforincorporatingcostinformationintotheC4.5learningalgorithm,includingwrappermethods,classdistribution-basedmethods,andloss-basedmethods.Othercost-sensitivelearningmethodsthatarealgorithm-independentincludeAdaCost[233],MetaCost[225],andcosting[312].

Extensiveliteratureisalsoavailableonthesubjectofmulticlasslearning.ThisincludestheworksofHastieandTibshirani[248],Allweinetal.[197],KongandDietterich[264],andTaxandDuin[300].Theerror-correctingoutput

coding(ECOC)methodwasproposedbyDietterichandBakiri[223].Theyhadalsoinvestigatedtechniquesfordesigningcodesthataresuitableforsolvingmulticlassproblems.

Apartfromexploringalgorithmsfortraditionalclassificationsettingswhereeveryinstancehasasinglesetoffeatureswithauniquecategoricallabel,therehasbeenalotofrecentinterestinnon-traditionalclassificationparadigms,involvingcomplexformsofinputsandoutputs.Forexample,theparadigmofmulti-labellearningallowsforaninstancetobeassignedmultipleclasslabelsratherthanjustone.Thisisusefulinapplicationssuchasobjectrecognitioninimages,whereaphotoimagemayincludemorethanoneclassificationobject,suchas,grass,sky,trees,andmountains.Asurveyonmulti-labellearningcanbefoundin[313].Asanotherexample,theparadigmofmulti-instancelearningconsiderstheproblemwheretheinstancesareavailableintheformofgroupscalledbags,andtraininglabelsareavailableatthelevelofbagsratherthanindividualinstances.Multi-instancelearningisusefulinapplicationswhereanobjectcanexistasmultipleinstancesindifferentstates(e.g.,thedifferentisomersofachemicalcompound),andevenifasingleinstanceshowsaspecificcharacteristic,theentirebagofinstancesassociatedwiththeobjectneedstobeassignedtherelevantclass.Asurveyonmulti-instancelearningisprovidedin[314].

Inanumberofreal-worldapplications,itisoftenthecasethatthetraininglabelsarescarceinquantity,becauseofthehighcostsassociatedwithobtaininggold-standardsupervision.However,wealmostalwayshaveabundantaccesstounlabeledtestinstances,whichdonothavesupervisedlabelsbutcontainvaluableinformationaboutthestructureordistributionofinstances.Traditionallearningalgorithms,whichonlymakeuseofthelabeledinstancesinthetrainingsetforlearningthedecisionboundary,areunabletoexploittheinformationcontainedinunlabeledinstances.Incontrast,learningalgorithmsthatmakeuseofthestructureintheunlabeleddataforlearningthe

classificationmodelareknownassemi-supervisedlearningalgorithms[315,316].Theuseofunlabeleddataisalsoexploredintheparadigmofmulti-viewlearning[299,311],whereeveryobjectisobservedinmultipleviewsofthedata,involvingdiversesetsoffeatures.Acommonstrategyusedbymulti-viewlearningalgorithmsisco-training[206],whereadifferentmodelislearnedforeveryviewofthedata,butthemodelpredictionsfromeveryviewareconstrainedtobeidenticaltoeachotherontheunlabeledtestinstances.

Anotherlearningparadigmthatiscommonlyexploredinthepaucityoftrainingdataistheframeworkofactivelearning,whichattemptstoseekthesmallestsetoflabelannotationstolearnareasonableclassificationmodel.Activelearningexpectstheannotatortobeinvolvedintheprocessofmodellearning,sothatthelabelsarerequestedincrementallyoverthemostrelevantsetofinstances,givenalimitedbudgetoflabelannotations.Forexample,itmaybeusefultoobtainlabelsoverinstancesclosertothedecisionboundarythatcanplayabiggerroleinfine-tuningtheboundary.Areviewonactivelearningapproachescanbefoundin[285,295].

Insomeapplications,itisimportanttosimultaneouslysolvemultiplelearningtaskstogether,wheresomeofthetasksmaybesimilartooneanother.Forexample,ifweareinterestedintranslatingapassagewritteninEnglishintodifferentlanguages,thetasksinvolvinglexicallysimilarlanguages(suchasSpanishandPortuguese)wouldrequiresimilarlearningofmodels.Theparadigmofmulti-tasklearninghelpsinsimultaneouslylearningacrossalltaskswhilesharingthelearningamongrelatedtasks.Thisisespeciallyusefulwhensomeofthetasksdonotcontainsufficientlymanytrainingsamples,inwhichcaseborrowingthelearningfromotherrelatedtaskshelpsinthelearningofrobustmodels.Aspecialcaseofmulti-tasklearningistransferlearning,wherethelearningfromasourcetask(withsufficientnumberoftrainingsamples)hastobetransferredtoadestinationtask(withpaucityof

trainingdata).AnextensivesurveyoftransferlearningapproachesisprovidedbyPanetal.[282].

Mostclassifiersassumeeverydatainstancemustbelongtoaclass,whichisnotalwaystrueforsomeapplications.Forexample,inmalwaredetection,duetotheeaseinwhichnewmalwaresarecreated,aclassifiertrainedonexistingclassesmayfailtodetectnewonesevenifthefeaturesforthenewmalwaresareconsiderablydifferentthanthoseforexistingmalwares.Anotherexampleisincriticalapplicationssuchasmedicaldiagnosis,wherepredictionerrorsarecostlyandcanhavesevereconsequences.Inthissituation,itwouldbebetterfortheclassifiertorefrainfrommakinganypredictiononadatainstanceifitisunsureofitsclass.Thisapproach,knownasclassifierwithrejectoption,doesnotneedtoclassifyeverydatainstanceunlessitdeterminesthepredictionisreliable(e.g.,iftheclassprobabilityexceedsauser-specifiedthreshold).Instancesthatareunclassifiedcanbepresentedtodomainexpertsforfurtherdeterminationoftheirtrueclasslabels.

Classifierscanalsobedistinguishedintermsofhowtheclassificationmodelistrained.Abatchclassifierassumesallthelabeledinstancesareavailableduringtraining.Thisstrategyisapplicablewhenthetrainingsetsizeisnottoolargeandforstationarydata,wheretherelationshipbetweentheattributesandclassesdoesnotvaryovertime.Anonlineclassifier,ontheotherhand,trainsaninitialmodelusingasubsetofthelabeleddata[263].Themodelisthenupdatedincrementallyasmorelabeledinstancesbecomeavailable.Thisstrategyiseffectivewhenthetrainingsetistoolargeorwhenthereisconceptdriftduetochangesinthedistributionofthedataovertime.

Bibliography[195]C.C.Aggarwal.Dataclassification:algorithmsandapplications.CRC

Press,2014.

[196]D.W.Aha.Astudyofinstance-basedalgorithmsforsupervisedlearningtasks:mathematical,empirical,andpsychologicalevaluations.PhDthesis,UniversityofCalifornia,Irvine,1990.

[197]E.L.Allwein,R.E.Schapire,andY.Singer.ReducingMulticlasstoBinary:AUnifyingApproachtoMarginClassifiers.JournalofMachineLearningResearch,1:113–141,2000.

[198]R.Andrews,J.Diederich,andA.Tickle.ASurveyandCritiqueofTechniquesForExtractingRulesFromTrainedArtificialNeuralNetworks.KnowledgeBasedSystems,8(6):373–389,1995.

[199]P.Baldi.Autoencoders,unsupervisedlearning,anddeeparchitectures.ICMLunsupervisedandtransferlearning,27(37-50):1,2012.

[200]Y.Bengio.LearningdeeparchitecturesforAI.FoundationsandtrendsRinMachineLearning,2(1):1–127,2009.

[201]Y.Bengio,A.Courville,andP.Vincent.Representationlearning:Areviewandnewperspectives.IEEEtransactionsonpatternanalysisand

machineintelligence,35(8):1798–1828,2013.

[202]K.BennettandC.Campbell.SupportVectorMachines:HypeorHallelujah.SIGKDDExplorations,2(2):1–13,2000.

[203]D.BerrarandP.Flach.CaveatsandpitfallsofROCanalysisinclinicalmicroarrayresearch(andhowtoavoidthem).Briefingsinbioinformatics,pagebbr008,2011.

[204]C.M.Bishop.NeuralNetworksforPatternRecognition.OxfordUniversityPress,Oxford,U.K.,1995.

[205]C.M.Bishop.PatternRecognitionandMachineLearning.Springer,2006.

[206]A.BlumandT.Mitchell.Combininglabeledandunlabeleddatawithco-training.InProceedingsoftheeleventhannualconferenceonComputationallearningtheory,pages92–100.ACM,1998.

[207]L.Bottou.Large-scalemachinelearningwithstochasticgradientdescent.InProceedingsofCOMPSTAT'2010,pages177–186.Springer,2010.

[208]A.P.Bradley.TheuseoftheareaundertheROCcurveintheEvaluationofMachineLearningAlgorithms.PatternRecognition,30(7):1145–1149,1997.

[209]L.Breiman.BaggingPredictors.MachineLearning,24(2):123–140,1996.

[210]L.Breiman.Bias,Variance,andArcingClassifiers.TechnicalReport486,UniversityofCalifornia,Berkeley,CA,1996.

[211]L.Breiman.RandomForests.MachineLearning,45(1):5–32,2001.

[212]C.J.C.Burges.ATutorialonSupportVectorMachinesforPatternRecognition.DataMiningandKnowledgeDiscovery,2(2):121–167,1998.

[213]N.V.Chawla,K.W.Bowyer,L.O.Hall,andW.P.Kegelmeyer.SMOTE:SyntheticMinorityOver-samplingTechnique.JournalofArtificialIntelligenceResearch,16:321–357,2002.

[214]N.V.Chawla,N.Japkowicz,andA.Kolcz.Editorial:SpecialIssueonLearningfromImbalancedDataSets.SIGKDDExplorations,6(1):1–6,2004.

[215]V.CherkasskyandF.Mulier.LearningfromData:Concepts,Theory,andMethods.WileyInterscience,1998.

[216]P.ClarkandR.Boswell.RuleInductionwithCN2:SomeRecentImprovements.InMachineLearning:Proc.ofthe5thEuropeanConf.(EWSL-91),pages151–163,1991.

[217]P.ClarkandT.Niblett.TheCN2InductionAlgorithm.MachineLearning,3(4):261–283,1989.

[218]W.W.Cohen.FastEffectiveRuleInduction.InProc.ofthe12thIntl.Conf.onMachineLearning,pages115–123,TahoeCity,CA,July1995.

[219]S.CostandS.Salzberg.AWeightedNearestNeighborAlgorithmforLearningwithSymbolicFeatures.MachineLearning,10:57–78,1993.

[220]T.M.CoverandP.E.Hart.NearestNeighborPatternClassification.KnowledgeBasedSystems,8(6):373–389,1995.

[221]N.CristianiniandJ.Shawe-Taylor.AnIntroductiontoSupportVectorMachinesandOtherKernel-basedLearningMethods.CambridgeUniversityPress,2000.

[222]T.G.Dietterich.EnsembleMethodsinMachineLearning.InFirstIntl.WorkshoponMultipleClassifierSystems,Cagliari,Italy,2000.

[223]T.G.DietterichandG.Bakiri.SolvingMulticlassLearningProblemsviaError-CorrectingOutputCodes.JournalofArtificialIntelligenceResearch,2:263–286,1995.

[224]P.Domingos.TheRISEsystem:Conqueringwithoutseparating.InProc.ofthe6thIEEEIntl.Conf.onToolswithArtificialIntelligence,pages704–707,NewOrleans,LA,1994.

[225]P.Domingos.MetaCost:AGeneralMethodforMakingClassifiersCost-Sensitive.InProc.ofthe5thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages155–164,SanDiego,CA,August1999.

[226]P.Domingos.Aunifiedbias-variancedecomposition.InProceedingsof17thInternationalConferenceonMachineLearning,pages231–238,2000.

[227]P.DomingosandM.Pazzani.OntheOptimalityoftheSimpleBayesianClassifierunderZero-OneLoss.MachineLearning,29(2-3):103–130,1997.

[228]C.DrummondandR.C.Holte.C4.5,Classimbalance,andCostsensitivity:Whyunder-samplingbeatsover-sampling.InICML'2004WorkshoponLearningfromImbalancedDataSetsII,Washington,DC,August2003.

[229]R.O.Duda,P.E.Hart,andD.G.Stork.PatternClassification.JohnWiley&Sons,Inc.,NewYork,2ndedition,2001.

[230]M.H.Dunham.DataMining:IntroductoryandAdvancedTopics.PrenticeHall,2006.

[231]C.Elkan.TheFoundationsofCost-SensitiveLearning.InProc.ofthe17thIntl.JointConf.onArtificialIntelligence,pages973–978,Seattle,WA,August2001.

[232]D.Erhan,Y.Bengio,A.Courville,P.-A.Manzagol,P.Vincent,andS.Bengio.Whydoesunsupervisedpre-traininghelpdeeplearning?JournalofMachineLearningResearch,11(Feb):625–660,2010.

[233]W.Fan,S.J.Stolfo,J.Zhang,andP.K.Chan.AdaCost:misclassificationcost-sensitiveboosting.InProc.ofthe16thIntl.Conf.onMachineLearning,pages97–105,Bled,Slovenia,June1999.

[234]J.FürnkranzandG.Widmer.Incrementalreducederrorpruning.InProc.ofthe11thIntl.Conf.onMachineLearning,pages70–77,NewBrunswick,NJ,July1994.

[235]C.Ferri,J.Hernández-Orallo,andP.A.Flach.AcoherentinterpretationofAUCasameasureofaggregatedclassificationperformance.InProceedingsofthe28thInternationalConferenceonMachineLearning(ICML-11),pages657–664,2011.

[236]Y.FreundandR.E.Schapire.Adecision-theoreticgeneralizationofon-linelearningandanapplicationtoboosting.JournalofComputerandSystemSciences,55(1):119–139,1997.

[237]K.Fukunaga.IntroductiontoStatisticalPatternRecognition.AcademicPress,NewYork,1990.

[238]D.Geiger,T.S.Verma,andJ.Pearl.d-separation:Fromtheoremstoalgorithms.arXivpreprintarXiv:1304.1505,2013.

[239]I.Goodfellow,Y.Bengio,andA.Courville.DeepLearning.BookinpreparationforMITPress,2016.

[240]I.J.Goodfellow,D.Warde-Farley,M.Mirza,A.C.Courville,andY.Bengio.Maxoutnetworks.ICML(3),28:1319–1327,2013.

[241]A.Graves,M.Liwicki,S.Fernández,R.Bertolami,H.Bunke,andJ.Schmidhuber.Anovelconnectionistsystemforunconstrainedhandwritingrecognition.IEEEtransactionsonpatternanalysisandmachineintelligence,31(5):855–868,2009.

[242]A.GravesandJ.Schmidhuber.Offlinehandwritingrecognitionwithmultidimensionalrecurrentneuralnetworks.InAdvancesinneuralinformationprocessingsystems,pages545–552,2009.

[243]E.-H.Han,G.Karypis,andV.Kumar.TextCategorizationUsingWeightAdjustedk-NearestNeighborClassification.InProc.ofthe5thPacific-AsiaConf.onKnowledgeDiscoveryandDataMining,Lyon,France,2001.

[244]J.HanandM.Kamber.DataMining:ConceptsandTechniques.MorganKaufmannPublishers,SanFrancisco,2001.

[245]D.J.Hand.Measuringclassifierperformance:acoherentalternativetotheareaundertheROCcurve.Machinelearning,77(1):103–123,2009.

[246]D.J.Hand.Evaluatingdiagnostictests:theareaundertheROCcurveandthebalanceoferrors.Statisticsinmedicine,29(14):1502–1510,2010.

[247]D.J.Hand,H.Mannila,andP.Smyth.PrinciplesofDataMining.MITPress,2001.

[248]T.HastieandR.Tibshirani.Classificationbypairwisecoupling.AnnalsofStatistics,26(2):451–471,1998.

[249]T.Hastie,R.Tibshirani,andJ.Friedman.TheElementsofStatisticalLearning:DataMining,Inference,andPrediction.Springer,2ndedition,2009.

[250]T.Hastie,R.Tibshirani,andM.Wainwright.Statisticallearningwithsparsity:thelassoandgeneralizations.CRCPress,2015.

[251]M.Hearst.Trends&Controversies:SupportVectorMachines.IEEEIntelligentSystems,13(4):18–28,1998.

[252]D.Heckerman.BayesianNetworksforDataMining.DataMiningandKnowledgeDiscovery,1(1):79–119,1997.

[253]G.E.HintonandR.R.Salakhutdinov.Reducingthedimensionalityofdatawithneuralnetworks.Science,313(5786):504–507,2006.

[254]G.E.Hinton,N.Srivastava,A.Krizhevsky,I.Sutskever,andR.R.Salakhutdinov.Improvingneuralnetworksbypreventingco-adaptationoffeaturedetectors.arXivpreprintarXiv:1207.0580,2012.

[255]R.C.Holte.VerySimpleClassificationRulesPerformWellonMostCommonlyUsedDatasets.MachineLearning,11:63–91,1993.

[256]S.IoffeandC.Szegedy.Batchnormalization:Acceleratingdeepnetworktrainingbyreducinginternalcovariateshift.arXivpreprintarXiv:1502.03167,2015.

[257]N.Japkowicz.TheClassImbalanceProblem:SignificanceandStrategies.InProc.ofthe2000Intl.Conf.onArtificialIntelligence:SpecialTrackonInductiveLearning,volume1,pages111–117,LasVegas,NV,June2000.

[258]F.V.Jensen.AnintroductiontoBayesiannetworks,volume210.UCLpressLondon,1996.

[259]M.I.Jordan.Learningingraphicalmodels,volume89.SpringerScience&BusinessMedia,1998.

[260]M.V.Joshi,R.C.Agarwal,andV.Kumar.MiningNeedlesinaHaystack:ClassifyingRareClassesviaTwo-PhaseRuleInduction.InProc.of2001ACM-SIGMODIntl.Conf.onManagementofData,pages91–102,SantaBarbara,CA,June2001.

[261]M.V.Joshi,R.C.Agarwal,andV.Kumar.Predictingrareclasses:canboostingmakeanyweaklearnerstrong?InProc.ofthe8thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages297–306,Edmonton,Canada,July2002.

[262]M.V.JoshiandV.Kumar.CREDOS:ClassificationUsingRippleDownStructure(ACaseforRareClasses).InProc.oftheSIAMIntl.Conf.onDataMining,pages321–332,Orlando,FL,April2004.

[263]J.Kivinen,A.J.Smola,andR.C.Williamson.Onlinelearningwithkernels.IEEEtransactionsonsignalprocessing,52(8):2165–2176,2004.

[264]E.B.KongandT.G.Dietterich.Error-CorrectingOutputCodingCorrectsBiasandVariance.InProc.ofthe12thIntl.Conf.onMachineLearning,pages313–321,TahoeCity,CA,July1995.

[265]A.Krizhevsky,I.Sutskever,andG.E.Hinton.Imagenetclassificationwithdeepconvolutionalneuralnetworks.InAdvancesinneuralinformationprocessingsystems,pages1097–1105,2012.

[266]M.KubatandS.Matwin.AddressingtheCurseofImbalancedTrainingSets:OneSidedSelection.InProc.ofthe14thIntl.Conf.onMachineLearning,pages179–186,Nashville,TN,July1997.

[267]P.Langley,W.Iba,andK.Thompson.AnanalysisofBayesianclassifiers.InProc.ofthe10thNationalConf.onArtificialIntelligence,pages223–228,1992.

[268]Y.LeCunandY.Bengio.Convolutionalnetworksforimages,speech,andtimeseries.Thehandbookofbraintheoryandneuralnetworks,3361(10):1995,1995.

[269]Y.LeCun,Y.Bengio,andG.Hinton.Deeplearning.Nature,521(7553):436–444,2015.

[270]D.D.Lewis.NaiveBayesatForty:TheIndependenceAssumptioninInformationRetrieval.InProc.ofthe10thEuropeanConf.onMachineLearning(ECML1998),pages4–15,1998.

[271]C.X.LingandV.S.Sheng.Cost-sensitivelearning.InEncyclopediaofMachineLearning,pages231–235.Springer,2011.

[272]O.Mangasarian.DataMiningviaSupportVectorMachines.TechnicalReportTechnicalReport01-05,DataMiningInstitute,May2001.

[273]D.D.MargineantuandT.G.Dietterich.LearningDecisionTreesforLossMinimizationinMulti-ClassProblems.TechnicalReport99-30-03,OregonStateUniversity,1999.

[274]P.McCullaghandJ.A.Nelder.Generalizedlinearmodels,volume37.CRCpress,1989.

[275]W.S.McCullochandW.Pitts.Alogicalcalculusoftheideasimmanentinnervousactivity.Thebulletinofmathematicalbiophysics,5(4):115–133,1943.

[276]R.S.Michalski,I.Mozetic,J.Hong,andN.Lavrac.TheMulti-PurposeIncrementalLearningSystemAQ15andItsTestingApplicationtoThree

MedicalDomains.InProc.of5thNationalConf.onArtificialIntelligence,Orlando,August1986.

[277]T.Mikolov,M.Karafiát,L.Burget,J.Cernock`y,andS.Khudanpur.Recurrentneuralnetworkbasedlanguagemodel.InInterspeech,volume2,page3,2010.

[278]T.Mitchell.MachineLearning.McGraw-Hill,Boston,MA,1997.

[279]S.Muggleton.FoundationsofInductiveLogicProgramming.PrenticeHall,EnglewoodCliffs,NJ,1995.

[280]J.A.NelderandR.J.Baker.Generalizedlinearmodels.Encyclopediaofstatisticalsciences,1972.

[281]M.A.Nielsen.Neuralnetworksanddeeplearning.Publishedonline:http://neuralnetworksanddeeplearning.com/.(visited:10.15.2016),2015.

[282]S.J.PanandQ.Yang.Asurveyontransferlearning.IEEETransactionsonknowledgeanddataengineering,22(10):1345–1359,2010.

[283]J.Pearl.Probabilisticreasoninginintelligentsystems:networksofplausibleinference.MorganKaufmann,2014.

[284]D.M.Powers.Theproblemofareaunderthecurve.In2012IEEEInternationalConferenceonInformationScienceandTechnology,pages

567–573.IEEE,2012.

[285]M.Prince.Doesactivelearningwork?Areviewoftheresearch.Journalofengineeringeducation,93(3):223–231,2004.

[286]F.J.ProvostandT.Fawcett.AnalysisandVisualizationofClassifierPerformance:ComparisonunderImpreciseClassandCostDistributions.InProc.ofthe3rdIntl.Conf.onKnowledgeDiscoveryandDataMining,pages43–48,NewportBeach,CA,August1997.

[287]J.R.Quinlan.C4.5:ProgramsforMachineLearning.Morgan-KaufmannPublishers,SanMateo,CA,1993.

[288]M.RamoniandP.Sebastiani.RobustBayesclassifiers.ArtificialIntelligence,125:209–226,2001.

[289]N.Rochester,J.Holland,L.Haibt,andW.Duda.Testsonacellassemblytheoryoftheactionofthebrain,usingalargedigitalcomputer.IRETransactionsoninformationTheory,2(3):80–93,1956.

[290]F.Rosenblatt.Theperceptron:aprobabilisticmodelforinformationstorageandorganizationinthebrain.Psychologicalreview,65(6):386,1958.

[291]S.J.Russell,P.Norvig,J.F.Canny,J.M.Malik,andD.D.Edwards.Artificialintelligence:amodernapproach,volume2.PrenticehallUpperSaddleRiver,2003.

[292]T.SaitoandM.Rehmsmeier.Theprecision-recallplotismoreinformativethantheROCplotwhenevaluatingbinaryclassifiersonimbalanceddatasets.PloSone,10(3):e0118432,2015.

[293]J.Schmidhuber.Deeplearninginneuralnetworks:Anoverview.NeuralNetworks,61:85–117,2015.

[294]B.SchölkopfandA.J.Smola.LearningwithKernels:SupportVectorMachines,Regularization,Optimization,andBeyond.MITPress,2001.

[295]B.Settles.Activelearningliteraturesurvey.UniversityofWisconsin,Madison,52(55-66):11,2010.

[296]P.SmythandR.M.Goodman.AnInformationTheoreticApproachtoRuleInductionfromDatabases.IEEETrans.onKnowledgeandDataEngineering,4(4):301–316,1992.

[297]N.Srivastava,G.E.Hinton,A.Krizhevsky,I.Sutskever,andR.Salakhutdinov.Dropout:asimplewaytopreventneuralnetworksfromoverfitting.JournalofMachineLearningResearch,15(1):1929–1958,2014.

[298]M.SteinbachandP.-N.Tan.kNN:k-NearestNeighbors.InX.WuandV.Kumar,editors,TheTopTenAlgorithmsinDataMining.ChapmanandHall/CRCReference,1stedition,2009.

[299]S.Sun.Asurveyofmulti-viewmachinelearning.NeuralComputingandApplications,23(7-8):2031–2038,2013.

[300]D.M.J.TaxandR.P.W.Duin.UsingTwo-ClassClassifiersforMulticlassClassification.InProc.ofthe16thIntl.Conf.onPatternRecognition(ICPR2002),pages124–127,Quebec,Canada,August2002.

[301]R.Tibshirani.Regressionshrinkageandselectionviathelasso.JournaloftheRoyalStatisticalSociety.SeriesB(Methodological),pages267–288,1996.

[302]C.J.vanRijsbergen.InformationRetrieval.Butterworth-Heinemann,Newton,MA,1978.

[303]V.Vapnik.TheNatureofStatisticalLearningTheory.SpringerVerlag,NewYork,1995.

[304]V.Vapnik.StatisticalLearningTheory.JohnWiley&Sons,NewYork,1998.

[305]P.Vincent,H.Larochelle,Y.Bengio,andP.-A.Manzagol.Extractingandcomposingrobustfeatureswithdenoisingautoencoders.InProceedingsofthe25thinternationalconferenceonMachinelearning,pages1096–1103.ACM,2008.

[306]P.Vincent,H.Larochelle,I.Lajoie,Y.Bengio,andP.-A.Manzagol.Stackeddenoisingautoencoders:Learningusefulrepresentationsina

deepnetworkwithalocaldenoisingcriterion.JournalofMachineLearningResearch,11(Dec):3371–3408,2010.

[307]A.R.Webb.StatisticalPatternRecognition.JohnWiley&Sons,2ndedition,2002.

[308]G.M.Weiss.MiningwithRarity:AUnifyingFramework.SIGKDDExplorations,6(1):7–19,2004.

[309]P.Werbos.Beyondregression:newfoolsforpredictionandanalysisinthebehavioralsciences.PhDthesis,HarvardUniversity,1974.

[310]I.H.WittenandE.Frank.DataMining:PracticalMachineLearningToolsandTechniqueswithJavaImplementations.MorganKaufmann,1999.

[311]C.Xu,D.Tao,andC.Xu.Asurveyonmulti-viewlearning.arXivpreprintarXiv:1304.5634,2013.

[312]B.Zadrozny,J.C.Langford,andN.Abe.Cost-SensitiveLearningbyCost-ProportionateExampleWeighting.InProc.ofthe2003IEEEIntl.Conf.onDataMining,pages435–442,Melbourne,FL,August2003.

[313]M.-L.ZhangandZ.-H.Zhou.Areviewonmulti-labellearningalgorithms.IEEEtransactionsonknowledgeanddataengineering,26(8):1819–1837,2014.

[314]Z.-H.Zhou.Multi-instancelearning:Asurvey.DepartmentofComputerScience&Technology,NanjingUniversity,Tech.Rep,2004.

[315]X.Zhu.Semi-supervisedlearning.InEncyclopediaofmachinelearning,pages892–897.Springer,2011.

[316]X.ZhuandA.B.Goldberg.Introductiontosemi-supervisedlearning.Synthesislecturesonartificialintelligenceandmachinelearning,3(1):1–130,2009.

4.14Exercises1.Considerabinaryclassificationproblemwiththefollowingsetofattributesandattributevalues:

Supposearule-basedclassifierproducesthefollowingruleset:

a. Aretherulesmutuallyexclusive?

b. Istherulesetexhaustive?

c. Isorderingneededforthissetofrules?

d. Doyouneedadefaultclassfortheruleset?

2.TheRIPPERalgorithm(byCohen[218])isanextensionofanearlieralgorithmcalledIREP(byFürnkranzandWidmer[234]).Bothalgorithmsapplythereduced-errorpruningmethodtodeterminewhetheraruleneedstobepruned.Thereducederrorpruningmethodusesavalidationsettoestimatethegeneralizationerrorofaclassifier.Considerthefollowingpairofrules:

AirConditioner={Working,Broken}

Engine={Good,Bad}

Mileage={High,Medium,Low}

Rust={Yes,No}

Mileage=High→Mileage=HighMileage=Low→Value=HighAirConditioner=Working

R1:A→CR2:A∧B→C

isobtainedbyaddinganewconjunct,B,totheleft-handsideof .Forthisquestion,youwillbeaskedtodeterminewhether ispreferredoverfromtheperspectivesofrule-growingandrule-pruning.Todeterminewhetheraruleshouldbepruned,IREPcomputesthefollowingmeasure:

wherePisthetotalnumberofpositiveexamplesinthevalidationset,Nisthetotalnumberofnegativeexamplesinthevalidationset,pisthenumberofpositiveexamplesinthevalidationsetcoveredbytherule,andnisthenumberofnegativeexamplesinthevalidationsetcoveredbytherule.isactuallysimilartoclassificationaccuracyforthevalidationset.IREPfavorsrulesthathavehighervaluesof .Ontheotherhand,RIPPERappliesthefollowingmeasuretodeterminewhetheraruleshouldbepruned:

a. Suppose iscoveredby350positiveexamplesand150negativeexamples,while iscoveredby300positiveexamplesand50negativeexamples.ComputetheFOIL'sinformationgainfortherule withrespectto .

b. Consideravalidationsetthatcontains500positiveexamplesand500negativeexamples.For ,supposethenumberofpositiveexamplescoveredbytheruleis200,andthenumberofnegativeexamplescoveredbytheruleis50.For ,supposethenumberofpositiveexamplescoveredbytheruleis100andthenumberofnegativeexamplesis5.Compute

forbothrules.WhichruledoesIREPprefer?

c. Compute forthepreviousproblem.WhichruledoesRIPPERprefer?

R2 R1R2 R1

vIREP=p+(N−n)P+N,

vIREP

vRIPPER=p−nP+n.

R1R2

R2R1

vIREP

vRIPPER

3.C4.5rulesisanimplementationofanindirectmethodforgeneratingrulesfromadecisiontree.RIPPERisanimplementationofadirectmethodforgeneratingrulesdirectlyfromdata.

a. Discussthestrengthsandweaknessesofbothmethods.

b. Consideradatasetthathasalargedifferenceintheclasssize(i.e.,someclassesaremuchbiggerthanothers).Whichmethod(betweenC4.5rulesandRIPPER)isbetterintermsoffindinghighaccuracyrulesforthesmallclasses?

4.Consideratrainingsetthatcontains100positiveexamplesand400negativeexamples.Foreachofthefollowingcandidaterules,

determinewhichisthebestandworstcandidateruleaccordingto:

a. Ruleaccuracy.

b. FOIL'sinformationgain.

c. Thelikelihoodratiostatistic.

d. TheLaplacemeasure.

e. Them-estimatemeasure(with and ).

5.Figure4.3 illustratesthecoverageoftheclassificationrulesR1,R2,andR3.Determinewhichisthebestandworstruleaccordingto:

a. Thelikelihoodratiostatistic.

b. TheLaplacemeasure.

R1:A→+(covers4positiveand1negativeexamples),R2:B→+(covers30positiveand10negativeexamples),R3:C→+(covers100positiveand90negativeexamples),

k=2 p+=0.2

c. Them-estimatemeasure(with and ).

d. TheruleaccuracyafterR1hasbeendiscovered,wherenoneoftheexamplescoveredbyR1arediscarded.

e. TheruleaccuracyafterR1hasbeendiscovered,whereonlythepositiveexamplescoveredbyR1arediscarded.

f. TheruleaccuracyafterR1hasbeendiscovered,wherebothpositiveandnegativeexamplescoveredbyR1arediscarded.

a. Supposethefractionofundergraduatestudentswhosmokeis15%andthefractionofgraduatestudentswhosmokeis23%.Ifone-fifthofthecollegestudentsaregraduatestudentsandtherestareundergraduates,whatistheprobabilitythatastudentwhosmokesisagraduatestudent?

b. Giventheinformationinpart(a),isarandomlychosencollegestudentmorelikelytobeagraduateorundergraduatestudent?

c. Repeatpart(b)assumingthatthestudentisasmoker.

d. Suppose30%ofthegraduatestudentsliveinadormbutonly10%oftheundergraduatestudentsliveinadorm.Ifastudentsmokesandlivesinthedorm,isheorshemorelikelytobeagraduateorundergraduatestudent?Youcanassumeindependencebetweenstudentswholiveinadormandthosewhosmoke.

7.ConsiderthedatasetshowninTable4.9

Table4.9.DatasetforExercise7.

Instance A B C Class

1 0 0 0

k=2 p+=0.58

2 0 0 1

3 0 1 1

4 0 1 1

5 0 0 1

6 1 0 1

7 1 0 1

8 1 0 1

9 1 1 1

10 1 0 1

a. Estimatetheconditionalprobabilitiesfor,and .

b. UsetheestimateofconditionalprobabilitiesgiveninthepreviousquestiontopredicttheclasslabelforatestsampleusingthenaïveBayesapproach.

c. Estimatetheconditionalprobabilitiesusingthem-estimateapproach,withand .

d. Repeatpart(b)usingtheconditionalprobabilitiesgiveninpart(c).

e. Comparethetwomethodsforestimatingprobabilities.Whichmethodisbetterandwhy?

8.ConsiderthedatasetshowninTable4.10 .

Table4.10.DatasetforExercise8.

−

P(A|+),P(B|+),P(C|+),P(A|−),P(B|−) P(C|−)

(A=0,B=1,C=0)

p=1/2 m=4

Instance A B C Class

1 0 0 1

2 1 0 1

3 0 1 0

4 1 0 0

5 1 0 1

6 0 0 1

7 1 1 0

8 0 0 0

9 0 1 0

10 1 1 1 +

a. Estimatetheconditionalprobabilitiesfor,and using

thesameapproachasinthepreviousproblem.

b. Usetheconditionalprobabilitiesinpart(a)topredicttheclasslabelforatestsample usingthenaïveBayesapproach.

c. Compare ,and .StatetherelationshipsbetweenAandB.

d. Repeattheanalysisinpart(c)using ,and .

e. Compare against and.Arethevariablesconditionallyindependentgiventhe

class?

−

P(A=1|+),P(B=1|+),P(C=1|+),P(A=1|−),P(B=1|−) P(C=1|−)

(A=1,B=1,C=1)

P(A=1),P(B=1) P(A=1,B=1)

P(A=1),P(B=0) P(A=1,B=0)

P(A=1,B=1|Class=+) P(A=1|Class=+)P(B=1|Class=+)

a. ExplainhownaïveBayesperformsonthedatasetshowninFigure4.56 .

b. Ifeachclassisfurtherdividedsuchthattherearefourclasses(A1,A2,B1,andB2),willnaïveBayesperformbetter?

c. Howwilladecisiontreeperformonthisdataset(forthetwo-classproblem)?Whatiftherearefourclasses?

10.Figure4.57 illustratestheBayesiannetworkforthedatasetshowninTable4.11 .(Assumethatalltheattributesarebinary).

a. Drawtheprobabilitytableforeachnodeinthenetwork.

b. UsetheBayesiannetworktocompute.

11.GiventheBayesiannetworkshowninFigure4.58 ,computethefollowingprobabilities:

P(Engine=Bad,AirConditioner=Broken)

Figure4.56.DatasetforExercise9.

Figure4.57.Bayesiannetwork.

a. .P(B=good,F=empty,G=empty,S=yes)

b. .

c. Giventhatthebatteryisbad,computetheprobabilitythatthecarwillstart.

12.Considertheone-dimensionaldatasetshowninTable4.12 .

a. Classifythedatapoint accordingtoits1-,3-,5-,and9-nearestneighbors(usingmajorityvote).

b. Repeatthepreviousanalysisusingthedistance-weightedvotingapproachdescribedinSection4.3.1 .

Table4.11.DatasetforExercise10.

Mileage Engine AirConditioner

NumberofInstanceswith

Hi Good Working 3 4

Hi Good Broken 1 2

Hi Bad Working 1 5

Hi Bad Broken 0 4

Lo Good Working 9 0

Lo Good Broken 5 1

Lo Bad Working 1 2

Lo Bad Broken 0 2

P(B=bad,F=empty,G=notempty,S=no)

x=5.0

CarValue=Hi CarValue=Lo

Figure4.58.BayesiannetworkforExercise11.

13.ThenearestneighboralgorithmdescribedinSection4.3 canbeextendedtohandlenominalattributes.AvariantofthealgorithmcalledPEBLS(ParallelExemplar-BasedLearningSystem)byCostandSalzberg[219]measuresthedistancebetweentwovaluesofanominalattributeusingthemodifiedvaluedifferencemetric(MVDM).Givenapairofnominalattributevalues, and ,thedistancebetweenthemisdefinedasfollows:

where isthenumberofexamplesfromclassiwithattributevalue andisthenumberofexampleswithattributevalue

Table4.12.DatasetforExercise12.

x 0.5 3.0 4.5 4.6 4.9 5.2 5.3 5.5 7.0 9.5

V1 V2

d(V1,V2)=∑i=1k|ni1n1−ni2n2,| (4.108)

nij Vj njVj.

ConsiderthetrainingsetfortheloanclassificationproblemshowninFigure4.8 .UsetheMVDMmeasuretocomputethedistancebetweeneverypairofattributevaluesforthe and attributes.

14.ForeachoftheBooleanfunctionsgivenbelow,statewhethertheproblemislinearlyseparable.

a. AANDBANDC

b. NOTAANDB

c. (AORB)AND(AORC)

d. (AXORB)AND(AORB)

15.

a. DemonstratehowtheperceptronmodelcanbeusedtorepresenttheANDandORfunctionsbetweenapairofBooleanvariables.

b. Commentonthedisadvantageofusinglinearfunctionsasactivationfunctionsformulti-layerneuralnetworks.

16.Youareaskedtoevaluatetheperformanceoftwoclassificationmodels,and .Thetestsetyouhavechosencontains26binaryattributes,

labeledasAthroughZ.Table4.13 showstheposteriorprobabilitiesobtainedbyapplyingthemodelstothetestset.(Onlytheposteriorprobabilitiesforthepositiveclassareshown).Asthisisatwo-classproblem,

and .Assumethatwearemostlyinterestedindetectinginstancesfromthepositiveclass.

a. PlottheROCcurveforboth and .(Youshouldplotthemonthesamegraph.)Whichmodeldoyouthinkisbetter?Explainyourreasons.

− − + + + − − + − −

M1 M2

P(−)=1−P(+) P(−|A,…,Z)=1−P(+|A,…,Z)

M1 M2

b. Formodel ,supposeyouchoosethecutoffthresholdtobe .Inotherwords,anytestinstanceswhoseposteriorprobabilityisgreaterthantwillbeclassifiedasapositiveexample.Computetheprecision,recall,andF-measureforthemodelatthisthresholdvalue.

c. Repeattheanalysisforpart(b)usingthesamecutoffthresholdonmodel.ComparetheF-measureresultsforbothmodels.Whichmodelis

better?AretheresultsconsistentwithwhatyouexpectfromtheROCcurve?

d. Repeatpart(b)formodel usingthethreshold .Whichthresholddoyouprefer, or ?AretheresultsconsistentwithwhatyouexpectfromtheROCcurve?

Table4.13.PosteriorprobabilitiesforExercise16.

Instance TrueClass

1 0.73 0.61

2 0.69 0.03

3 0.44 0.68

4 0.55 0.31

5 0.67 0.45

6 0.47 0.09

7 0.08 0.38

8 0.15 0.05

9 0.45 0.01

10 0.35 0.04

M1 t=0.5

M1 t=0.1t=0.5 t=0.1

P(+|A,…,Z,M1) P(+|A,…,Z,M2)

−

17.Followingisadatasetthatcontainstwoattributes,XandY,andtwoclasslabels,“ ”and“ ”.Eachattributecantakethreedifferentvalues:0,1,or2.

X Y NumberofInstances

0 0 0 100

1 0 0 0

2 0 0 100

0 1 10 100

1 1 10 0

2 1 10 100

0 2 0 100

1 2 0 0

2 2 0 100

Theconceptforthe“ ”classis andtheconceptforthe“ ”classis.

a. Buildadecisiontreeonthedataset.Doesthetreecapturethe“ ”and“ ”concepts?

b. Whataretheaccuracy,precision,recall,and -measureofthedecisiontree?(Notethatprecision,recall,and -measurearedefinedwith

+ −

+ Y=1 −X=0∨X=2

+−

F1F1

respecttothe“ ”class.)

c. Buildanewdecisiontreewiththefollowingcostfunction:

(Hint:onlytheleavesoftheolddecisiontreeneedtobechanged.)Doesthedecisiontreecapturethe“ ”concept?

d. Whataretheaccuracy,precision,recall,and -measureofthenewdecisiontree?

18.Considerthetaskofbuildingaclassifierfromrandomdata,wheretheattributevaluesaregeneratedrandomlyirrespectiveoftheclasslabels.Assumethedatasetcontainsinstancesfromtwoclasses,“ ”and“ .”Halfofthedatasetisusedfortrainingwhiletheremaininghalfisusedfortesting.

a. Supposethereareanequalnumberofpositiveandnegativeinstancesinthedataandthedecisiontreeclassifierpredictseverytestinstancetobepositive.Whatistheexpectederrorrateoftheclassifieronthetestdata?

b. Repeatthepreviousanalysisassumingthattheclassifierpredictseachtestinstancetobepositiveclasswithprobability0.8andnegativeclasswithprobability0.2.

c. Supposetwo-thirdsofthedatabelongtothepositiveclassandtheremainingone-thirdbelongtothenegativeclass.Whatistheexpectederrorofaclassifierthatpredictseverytestinstancetobepositive?

d. Repeatthepreviousanalysisassumingthattheclassifierpredictseachtestinstancetobepositiveclasswithprobability2/3andnegativeclasswithprobability1/3.

C(i,j)={0,ifi=j;1,ifi=+,j=−;Numberof−instancesNumberof+instancesifi=−,j=+;

+ −

19.DerivethedualLagrangianforthelinearSVMwithnon-separabledatawheretheobjectivefunctionis

20.ConsidertheXORproblemwheretherearefourtrainingpoints:

Transformthedataintothefollowingfeaturespace:

Findthemaximummarginlineardecisionboundaryinthetransformedspace.

21.GiventhedatasetsshowninFigures4.59 ,explainhowthedecisiontree,naïveBayes,andk-nearestneighborclassifierswouldperformonthesedatasets.

f(w)=ǁwǁ22+C(∑i=1Nξi)2.

(1,1,−),(1,0,+),(0,1,+),(0,0,−).

φ=(1,2×1,2×2,2x1x2,x12,x22).

Figure4.59.DatasetforExercise21.

5AssociationAnalysis:BasicConceptsandAlgorithms

Manybusinessenterprisesaccumulatelargequantitiesofdatafromtheirday-to-dayoperations.Forexample,hugeamountsofcustomerpurchasedataarecollecteddailyatthecheckoutcountersofgrocerystores.Table5.1 givesanexampleofsuchdata,commonlyknownasmarketbaskettransactions.Eachrowinthistablecorrespondstoatransaction,whichcontainsauniqueidentifierlabeledTIDandasetofitemsboughtbyagivencustomer.Retailersareinterestedinanalyzingthedatatolearnaboutthepurchasingbehavioroftheircustomers.Suchvaluableinformationcanbeusedtosupportavarietyofbusiness-relatedapplicationssuchasmarketingpromotions,inventorymanagement,andcustomerrelationshipmanagement.

Table5.1.Anexampleofmarketbaskettransactions.

TID Items

1 {Bread,Milk}

2 {Bread,Diapers,Beer,Eggs}

3 {Milk,Diapers,Beer,Cola}

4 {Bread,Milk,Diapers,Beer}

5 {Bread,Milk,Diapers,Cola}

Thischapterpresentsamethodologyknownasassociationanalysis,whichisusefulfordiscoveringinterestingrelationshipshiddeninlargedatasets.Theuncoveredrelationshipscanberepresentedintheformofsetsofitemspresentinmanytransactions,whichareknownasfrequentitemsets,orassociationrules,thatrepresentrelationshipsbetweentwoitemsets.Forexample,thefollowingrulecanbeextractedfromthedatasetshowninTable5.1 :

Therulesuggestsarelationshipbetweenthesaleofdiapersandbeerbecausemanycustomerswhobuydiapersalsobuybeer.Retailerscanusethesetypesofrulestohelpthemidentifynewopportunitiesforcross-sellingtheirproductstothecustomers.

{Diapers}→{Beer}.

Besidesmarketbasketdata,associationanalysisisalsoapplicabletodatafromotherapplicationdomainssuchasbioinformatics,medicaldiagnosis,webmining,andscientificdataanalysis.IntheanalysisofEarthsciencedata,forexample,associationpatternsmayrevealinterestingconnectionsamongtheocean,land,andatmosphericprocesses.SuchinformationmayhelpEarthscientistsdevelopabetterunderstandingofhowthedifferentelementsoftheEarthsysteminteractwitheachother.Eventhoughthetechniquespresentedherearegenerallyapplicabletoawidervarietyofdatasets,forillustrativepurposes,ourdiscussionwillfocusmainlyonmarketbasketdata.

Therearetwokeyissuesthatneedtobeaddressedwhenapplyingassociationanalysistomarketbasketdata.First,discoveringpatternsfromalargetransactiondatasetcanbecomputationallyexpensive.Second,someofthediscoveredpatternsmaybespurious(happensimplybychance)andevenfornon-spuriouspatterns,somearemoreinterestingthanothers.Theremainderofthischapterisorganizedaroundthesetwoissues.Thefirstpartofthechapterisdevotedtoexplainingthebasicconceptsofassociationanalysisandthealgorithmsusedtoefficientlyminesuchpatterns.Thesecondpartofthechapterdealswiththeissueofevaluatingthediscoveredpatternsinordertohelppreventthegenerationofspuriousresultsandtorankthepatternsintermsofsomeinterestingnessmeasure.

5.1PreliminariesThissectionreviewsthebasicterminologyusedinassociationanalysisandpresentsaformaldescriptionofthetask.

BinaryRepresentation

MarketbasketdatacanberepresentedinabinaryformatasshowninTable5.2 ,whereeachrowcorrespondstoatransactionandeachcolumncorrespondstoanitem.Anitemcanbetreatedasabinaryvariablewhosevalueisoneiftheitemispresentinatransactionandzerootherwise.Becausethepresenceofaniteminatransactionisoftenconsideredmoreimportantthanitsabsence,anitemisanasymmetricbinaryvariable.Thisrepresentationisasimplisticviewofrealmarketbasketdatabecauseitignoresimportantaspectsofthedatasuchasthequantityofitemssoldorthepricepaidtopurchasethem.Methodsforhandlingsuchnon-binarydatawillbeexplainedinChapter6 .

Table5.2.Abinary0/1representationofmarketbasketdata.

TID Bread Milk Diapers Beer Eggs Cola

1 1 1 0 0 0 0

2 1 0 1 1 1 0

3 0 1 1 1 0 1

4 1 1 1 1 0 0

5 1 1 1 0 0 1

ItemsetandSupportCount

Let bethesetofallitemsinamarketbasketdataandbethesetofalltransactions.Eachtransaction, containsa

subsetofitemschosenfromI.Inassociationanalysis,acollectionofzeroormoreitemsistermedanitemset.Ifanitemsetcontainskitems,itiscalledak-itemset.Forinstance,{ , , }isanexampleofa3-itemset.Thenull(orempty)setisanitemsetthatdoesnotcontainanyitems.

Atransaction issaidtocontainanitemsetXifXisasubsetof .Forexample,thesecondtransactionshowninTable5.2 containstheitemset{ , }butnot{ , }.Animportantpropertyofanitemsetisitssupportcount,whichreferstothenumberoftransactionsthatcontainaparticularitemset.Mathematically,thesupportcount, ,foranitemsetXcanbestatedasfollows:

wherethesymbol denotesthenumberofelementsinaset.InthedatasetshowninTable5.2 ,thesupportcountfor{ , , }isequaltotwobecausethereareonlytwotransactionsthatcontainallthreeitems.

Often,thepropertyofinterestisthesupport,whichisfractionoftransactionsinwhichanitemsetoccurs:

AnitemsetXiscalledfrequentifs(X)isgreaterthansomeuser-definedthreshold,minsup.

I={i1,i2,…,id} T={t1,t2,…,tN} ti

tj tj

σ(X)

σ(X)=|{ti|X⊆ti,ti∈T}|,

|⋅|

s(X)=σ(X)/N.

AssociationRule

Anassociationruleisanimplicationexpressionoftheform ,whereXandYaredisjointitemsets,i.e., .Thestrengthofanassociationrulecanbemeasuredintermsofitssupportandconfidence.Supportdetermineshowoftenaruleisapplicabletoagivendataset,whileconfidencedetermineshowfrequentlyitemsinYappearintransactionsthatcontainX.Theformaldefinitionsofthesemetricsare

Example5.1.Considertherule Becausethesupportcountfor{ , , }is2andthetotalnumberoftransactionsis5,therule'ssupportis .Therule'sconfidenceisobtainedbydividingthesupportcountfor{ , , }bythesupportcountfor{ ,

}.Sincethereare3transactionsthatcontainmilkanddiapers,theconfidenceforthisruleis .

WhyUseSupportandConfidence?

Supportisanimportantmeasurebecausearulethathasverylowsupportmightoccursimplybychance.Also,fromabusinessperspectivealowsupportruleisunlikelytobeinterestingbecauseitmightnotbeprofitabletopromoteitemsthatcustomersseldombuytogether(withtheexceptionofthesituationdescribedinSection5.8 ).Forthesereasons,weareinterestedinfindingruleswhosesupportisgreaterthansomeuser-definedthreshold.As

X→YX∩Y=∅

Support,s(X→Y)=σ(X∪Y)N; (5.1)

Confidence,c(X→Y)=σ(X∪Y)σ(X). (5.2)

2/5=0.4

2/3=0.67

willbeshowninSection5.2.1 ,supportalsohasadesirablepropertythatcanbeexploitedfortheefficientdiscoveryofassociationrules.

Confidence,ontheotherhand,measuresthereliabilityoftheinferencemadebyarule.Foragivenrule ,thehighertheconfidence,themorelikelyitisforYtobepresentintransactionsthatcontainX.ConfidencealsoprovidesanestimateoftheconditionalprobabilityofYgivenX.

Associationanalysisresultsshouldbeinterpretedwithcaution.Theinferencemadebyanassociationruledoesnotnecessarilyimplycausality.Instead,itcansometimessuggestastrongco-occurrencerelationshipbetweenitemsintheantecedentandconsequentoftherule.Causality,ontheotherhand,requiresknowledgeaboutwhichattributesinthedatacapturecauseandeffect,andtypicallyinvolvesrelationshipsoccurringovertime(e.g.,greenhousegasemissionsleadtoglobalwarming).SeeSection5.7.1 foradditionaldiscussion.

FormulationoftheAssociationRuleMiningProblem

Theassociationruleminingproblemcanbeformallystatedasfollows:

Definition5.1.(AssociationRuleDiscovery.)GivenasetoftransactionsT,findalltheruleshaving

and ,whereminsupandminconfarethecorrespondingsupportandconfidencethresholds.

X→Y

support≥minsup confidence≥minconf

Abrute-forceapproachforminingassociationrulesistocomputethesupportandconfidenceforeverypossiblerule.Thisapproachisprohibitivelyexpensivebecausethereareexponentiallymanyrulesthatcanbeextractedfromadataset.Morespecifically,assumingthatneithertheleftnortheright-handsideoftheruleisanemptyset,thetotalnumberofpossiblerules,R,extractedfromadatasetthatcontainsditemsis

Theproofforthisequationisleftasanexercisetothereaders(seeExercise5onpage440).EvenforthesmalldatasetshowninTable5.1 ,thisapproachrequiresustocomputethesupportandconfidencefor

rules.Morethan80%oftherulesarediscardedafterapplyingand ,thuswastingmostofthecomputations.To

avoidperformingneedlesscomputations,itwouldbeusefultoprunetherulesearlywithouthavingtocomputetheirsupportandconfidencevalues.

Aninitialsteptowardimprovingtheperformanceofassociationruleminingalgorithmsistodecouplethesupportandconfidencerequirements.FromEquation5.1 ,noticethatthesupportofarule isthesameasthesupportofitscorrespondingitemset, .Forexample,thefollowingruleshaveidenticalsupportbecausetheyinvolveitemsfromthesameitemset,

{ , , }:

R=3d−2d+1+1. (5.3)

36−27+1=602minsup=20% mincof=50%

X→YX∪Y

{Beer,Diapers}→{Milk},{Beer,Milk}→{Diapers},{Diapers,Milk}→{Beer},{Beer}→{Diapers,Milk},{Milk}→{Beer,Diapers},{Diapers}→{Beer,Milk}.

Iftheitemsetisinfrequent,thenallsixcandidaterulescanbeprunedimmediatelywithoutourhavingtocomputetheirconfidencevalues.

Therefore,acommonstrategyadoptedbymanyassociationruleminingalgorithmsistodecomposetheproblemintotwomajorsubtasks:

1. FrequentItemsetGeneration,whoseobjectiveistofindalltheitemsetsthatsatisfytheminsupthreshold.

2. RuleGeneration,whoseobjectiveistoextractallthehighconfidencerulesfromthefrequentitemsetsfoundinthepreviousstep.Theserulesarecalledstrongrules.

Thecomputationalrequirementsforfrequentitemsetgenerationaregenerallymoreexpensivethanthoseofrulegeneration.EfficienttechniquesforgeneratingfrequentitemsetsandassociationrulesarediscussedinSections5.2 and5.3 ,respectively.

5.2FrequentItemsetGenerationAlatticestructurecanbeusedtoenumeratethelistofallpossibleitemsets.Figure5.1 showsanitemsetlatticefor .Ingeneral,adatasetthatcontainskitemscanpotentiallygenerateupto frequentitemsets,excludingthenullset.Becausekcanbeverylargeinmanypracticalapplications,thesearchspaceofitemsetsthatneedtobeexploredisexponentiallylarge.

Figure5.1.

I={a,b,c,d,e}2k−1

Anitemsetlattice.

Abrute-forceapproachforfindingfrequentitemsetsistodeterminethesupportcountforeverycandidateitemsetinthelatticestructure.Todothis,weneedtocompareeachcandidateagainsteverytransaction,anoperationthatisshowninFigure5.2 .Ifthecandidateiscontainedinatransaction,itssupportcountwillbeincremented.Forexample,thesupportfor{ ,

}isincrementedthreetimesbecausetheitemsetiscontainedintransactions1,4,and5.SuchanapproachcanbeveryexpensivebecauseitrequiresO(NMw)comparisons,whereNisthenumberoftransactions,isthenumberofcandidateitemsets,andwisthemaximumtransaction

width.Transactionwidthisthenumberofitemspresentinatransaction.

Figure5.2.Countingthesupportofcandidateitemsets.

Therearethreemainapproachesforreducingthecomputationalcomplexityoffrequentitemsetgeneration.

1. Reducethenumberofcandidateitemsets(M).TheAprioriprinciple,describedinthenextSection,isaneffectivewaytoeliminatesomeof

M=2k−1

thecandidateitemsetswithoutcountingtheirsupportvalues.2. Reducethenumberofcomparisons.Insteadofmatchingeach

candidateitemsetagainsteverytransaction,wecanreducethenumberofcomparisonsbyusingmoreadvanceddatastructures,eithertostorethecandidateitemsetsortocompressthedataset.WewilldiscussthesestrategiesinSections5.2.4 and5.6 ,respectively.

3. Reducethenumberoftransactions(N).Asthesizeofcandidateitemsetsincreases,fewertransactionswillbesupportedbytheitemsets.Forinstance,sincethewidthofthefirsttransactioninTable5.1 is2,itwouldbeadvantageoustoremovethistransactionbeforesearchingforfrequentitemsetsofsize3andlarger.AlgorithmsthatemploysuchastrategyarediscussedintheBibliographicNotes.

5.2.1TheAprioriPrinciple

ThisSectiondescribeshowthesupportmeasurecanbeusedtoreducethenumberofcandidateitemsetsexploredduringfrequentitemsetgeneration.Theuseofsupportforpruningcandidateitemsetsisguidedbythefollowingprinciple.

Theorem5.1(AprioriPrinciple).Ifanitemsetisfrequent,thenallofitssubsetsmustalsobefrequent.

ToillustratetheideabehindtheAprioriprinciple,considertheitemsetlatticeshowninFigure5.3 .Suppose{c,d,e}isafrequentitemset.Clearly,anytransactionthatcontains{c,d,e}mustalsocontainitssubsets,{c,d},{c,e},{d,e},{c},{d},and{e}.Asaresult,if{c,d,e}isfrequent,thenallsubsetsof{c,d,e}(i.e.,theshadeditemsetsinthisfigure)mustalsobefrequent.

Figure5.3.AnillustrationoftheAprioriprinciple.If{c,d,e}isfrequent,thenallsubsetsofthisitemsetarefrequent.

Conversely,ifanitemsetsuchas{a,b}isinfrequent,thenallofitssupersetsmustbeinfrequenttoo.AsillustratedinFigure5.4 ,theentiresubgraph

containingthesupersetsof{a,b}canbeprunedimmediatelyonce{a,b}isfoundtobeinfrequent.Thisstrategyoftrimmingtheexponentialsearchspacebasedonthesupportmeasureisknownassupport-basedpruning.Suchapruningstrategyismadepossiblebyakeypropertyofthesupportmeasure,namely,thatthesupportforanitemsetneverexceedsthesupportforitssubsets.Thispropertyisalsoknownastheanti-monotonepropertyofthesupportmeasure.

Figure5.4.Anillustrationofsupport-basedpruning.If{a,b}isinfrequent,thenallsupersetsof{a,b}areinfrequent.

Definition5.2.(Anti-monotoneProperty.)Ameasurefpossessestheanti-monotonepropertyifforeveryitemsetXthatisapropersubsetofitemsetY,i.e. ,wehave

Moregenerally,alargenumberofmeasures—seeSection5.7.1 —canbeappliedtoitemsetstoevaluatevariouspropertiesofitemsets.AswillbeshowninthenextSection,anymeasurethathastheanti-monotonepropertycanbeincorporateddirectlyintoanitemsetminingalgorithmtoeffectivelyprunetheexponentialsearchspaceofcandidateitemsets.

5.2.2FrequentItemsetGenerationintheAprioriAlgorithm

Aprioriisthefirstassociationruleminingalgorithmthatpioneeredtheuseofsupport-basedpruningtosystematicallycontroltheexponentialgrowthofcandidateitemsets.Figure5.5 providesahigh-levelillustrationofthefrequentitemsetgenerationpartoftheApriorialgorithmforthetransactionsshowninTable5.1 .Weassumethatthesupportthresholdis60%,whichisequivalenttoaminimumsupportcountequalto3.

X⊂Yf(Y)≤f(X)

Figure5.5.IllustrationoffrequentitemsetgenerationusingtheApriorialgorithm.

Initially,everyitemisconsideredasacandidate1-itemset.Aftercountingtheirsupports,thecandidateitemsets{ }and{ }arediscardedbecausetheyappearinfewerthanthreetransactions.Inthenextiteration,candidate2-itemsetsaregeneratedusingonlythefrequent1-itemsetsbecausetheAprioriprincipleensuresthatallsupersetsoftheinfrequent1-itemsetsmustbeinfrequent.Becausethereareonlyfourfrequent1-itemsets,thenumberofcandidate2-itemsetsgeneratedbythealgorithmis .Twoofthesesixcandidates,{ , }and{ , },aresubsequentlyfoundtobeinfrequentaftercomputingtheirsupportvalues.Theremainingfourcandidatesarefrequent,andthuswillbeusedtogeneratecandidate3-itemsets.Withoutsupport-basedpruning,thereare candidate3-itemsetsthatcanbeformedusingthesixitemsgiveninthisexample.With

(42)=6

(63)=20

theAprioriprinciple,weonlyneedtokeepcandidate3-itemsetswhosesubsetsarefrequent.Theonlycandidatethathasthispropertyis{ ,

, }.However,eventhoughthesubsetsof{ , , }arefrequent,theitemsetitselfisnot.

TheeffectivenessoftheAprioripruningstrategycanbeshownbycountingthenumberofcandidateitemsetsgenerated.Abrute-forcestrategyofenumeratingallitemsets(uptosize3)ascandidateswillproduce

candidates.WiththeAprioriprinciple,thisnumberdecreasesto

candidates,whichrepresentsa68%reductioninthenumberofcandidateitemsetseveninthissimpleexample.

ThepseudocodeforthefrequentitemsetgenerationpartoftheApriorialgorithmisshowninAlgorithm5.1 .Let denotethesetofcandidatek-itemsetsand denotethesetoffrequentk-itemsets:

Thealgorithminitiallymakesasinglepassoverthedatasettodeterminethesupportofeachitem.Uponcompletionofthisstep,thesetofallfrequent1-itemsets, ,willbeknown(steps1and2).Next,thealgorithmwilliterativelygeneratenewcandidatek-itemsetsandpruneunnecessarycandidatesthatareguaranteedtobeinfrequentgiventhefrequent -itemsetsfoundinthepreviousiteration(steps5and6).Candidategenerationandpruningisimplementedusingthefunctionscandidate-genandcandidate-prune,whicharedescribedinSection5.2.3 .

(61)+(62)+(63)=6+15+20=41

(61)+(42)+1=6+6+1=13

CkFk

(k−1)

Tocountthesupportofthecandidates,thealgorithmneedstomakeanadditionalpassoverthedataset(steps7–12).Thesubsetfunctionisusedtodetermineallthecandidateitemsetsin thatarecontainedineachtransactiont.TheimplementationofthisfunctionisdescribedinSection5.2.4 .Aftercountingtheirsupports,thealgorithmeliminatesallcandidateitemsetswhosesupportcountsarelessthan (step13).Thealgorithmterminateswhentherearenonewfrequentitemsetsgenerated,i.e., (step14).

ThefrequentitemsetgenerationpartoftheApriorialgorithmhastwoimportantcharacteristics.First,itisalevel-wisealgorithm;i.e.,ittraversestheitemsetlatticeonelevelatatime,fromfrequent1-itemsetstothemaximumsizeoffrequentitemsets.Second,itemploysagenerate-and-teststrategyforfindingfrequentitemsets.Ateachiteration(level),newcandidateitemsetsaregeneratedfromthefrequentitemsetsfoundinthepreviousiteration.Thesupportforeachcandidateisthencountedandtestedagainsttheminsupthreshold.Thetotalnumberofiterationsneededbythealgorithmis ,where isthemaximumsizeofthefrequentitemsets.

5.2.3CandidateGenerationandPruning

Thecandidate-genandcandidate-prunefunctionsshowninSteps5and6ofAlgorithm5.1 generatecandidateitemsetsandprunesunnecessaryonesbyperformingthefollowingtwooperations,respectively:

N×minsup

Fk=∅

kmax+1kmax

1. CandidateGeneration.Thisoperationgeneratesnewcandidatek-itemsetsbasedonthefrequent -itemsetsfoundinthepreviousiteration.

Algorithm5.1FrequentitemsetgenerationoftheApriorialgorithm.

∈ ∧

∈

∈ ∧

∅

∪

2. CandidatePruning.Thisoperationeliminatessomeofthecandidatek-itemsetsusingsupport-basedpruning,i.e.byremovingk-itemsetswhosesubsetsareknowntobeinfrequentinpreviousiterations.Note

(k−1)

thatthispruningisdonewithoutcomputingtheactualsupportofthesek-itemsets(whichcouldhaverequiredcomparingthemagainsteachtransaction).

CandidateGenerationInprinciple,therearemanywaystogeneratecandidateitemsets.Aneffectivecandidategenerationproceduremustbecompleteandnon-redundant.Acandidategenerationprocedureissaidtobecompleteifitdoesnotomitanyfrequentitemsets.Toensurecompleteness,thesetofcandidateitemsetsmustsubsumethesetofallfrequentitemsets,i.e., .Acandidategenerationprocedureisnon-redundantifitdoesnotgeneratethesamecandidateitemsetmorethanonce.Forexample,thecandidateitemset{a,b,c,d}canbegeneratedinmanyways—bymerging{a,b,c}with{d},{b,d}with{a,c},{c}with{a,b,d},etc.Generationofduplicatecandidatesleadstowastedcomputationsandthusshouldbeavoidedforefficiencyreasons.Also,aneffectivecandidategenerationprocedureshouldavoidgeneratingtoomanyunnecessarycandidates.Acandidateitemsetisunnecessaryifatleastoneofitssubsetsisinfrequent,andthus,eliminatedinthecandidatepruningstep.

Next,wewillbrieflydescribeseveralcandidategenerationprocedures,includingtheoneusedbythecandidate-genfunction.

Brute-ForceMethod

Thebrute-forcemethodconsiderseveryk-itemsetasapotentialcandidateandthenappliesthecandidatepruningsteptoremoveanyunnecessarycandidateswhosesubsetsareinfrequent(seeFigure5.6 ).Thenumberofcandidateitemsetsgeneratedatlevelkisequalto ,wheredisthetotalnumberofitems.Althoughcandidategenerationisrathertrivial,candidate

∀k:Fk⊆Ck

(dk)

pruningbecomesextremelyexpensivebecausealargenumberofitemsetsmustbeexamined.

Figure5.6.Abrute-forcemethodforgeneratingcandidate3-itemsets.

Method

Analternativemethodforcandidategenerationistoextendeachfrequent-itemsetwithfrequentitemsthatarenotpartofthe -itemset.Figure

5.7 illustrateshowafrequent2-itemsetsuchas{ , }canbeaugmentedwithafrequentitemsuchas toproduceacandidate3-itemset{ , , }.

Fk−1×F1

(k−1) (k−1)

Figure5.7.Generatingandpruningcandidatek-itemsetsbymergingafrequent -itemsetwithafrequentitem.Notethatsomeofthecandidatesareunnecessarybecausetheirsubsetsareinfrequent.

Theprocedureiscompletebecauseeveryfrequentk-itemsetiscomposedofafrequent -itemsetandafrequent1-itemset.Therefore,allfrequentk-itemsetsarepartofthecandidatek-itemsetsgeneratedbythisprocedure.Figure5.7 showsthatthe candidategenerationmethodonlyproducesfourcandidate3-itemsets,insteadofthe

itemsetsproducedbythebrute-forcemethod.The methodgenerateslowernumberofcandidatesbecauseeverycandidateisguaranteedtocontainatleastonefrequent -itemset.Whilethisprocedureisasubstantialimprovementoverthebrute-forcemethod,itcanstillproducealargenumberofunnecessarycandidates,astheremainingsubsetsofacandidateitemsetcanstillbeinfrequent.

Notethattheapproachdiscussedabovedoesnotpreventthesamecandidate

(k−1)

Fk−1×F1

(63)=20 Fk−1×F1

(k−1)

itemsetfrombeinggeneratedmorethanonce.Forinstance,{ , ,}canbegeneratedbymerging{ , }with{ },{ ,}with{ },or{ , }with{ }.Onewaytoavoid

generatingduplicatecandidatesisbyensuringthattheitemsineachfrequentitemsetarekeptsortedintheirlexicographicorder.Forexample,itemsetssuchas{ , },{ , , },and{ , }followlexicographicorderastheitemswithineveryitemsetarearrangedalphabetically.Eachfrequent -itemsetXisthenextendedwithfrequentitemsthatarelexicographicallylargerthantheitemsinX.Forexample,theitemset{ , }canbeaugmentedwith{ }becauseMilkislexicographicallylargerthanBreadandDiapers.However,weshouldnotaugment{ , }with{ }nor{ , }with{ }becausetheyviolatethelexicographicorderingcondition.Everycandidatek-itemsetisthusgeneratedexactlyonce,bymergingthelexicographicallylargestitemwiththeremaining itemsintheitemset.Ifthemethodisusedinconjunctionwithlexicographicordering,thenonlytwocandidate3-itemsetswillbeproducedintheexampleillustratedinFigure5.7 .{ , , }and{ , , }willnotbegeneratedbecause{ , }isnotafrequent2-itemset.

Method

Thiscandidategenerationprocedure,whichisusedinthecandidate-genfunctionoftheApriorialgorithm,mergesapairoffrequent -itemsetsonlyiftheirfirst items,arrangedinlexicographicorder,areidentical.Let

and beapairoffrequent -itemsets,arrangedlexicographically.AandBaremergediftheysatisfythefollowingconditions:

(k−1)

k−1 Fk−1×F1

Fk−1×Fk−1

(k−1)k−2 A=

{a1,a2,…,ak−1} B={b1,b2,…,bk−1} (k−1)

ai=bi(fori=1,2,…,k−2).

Notethatinthiscase, becauseAandBaretwodistinctitemsets.Thecandidatek-itemsetgeneratedbymergingAandBconsistsofthefirstcommonitemsfollowedby and inlexicographicorder.This

candidategenerationprocedureiscomplete,becauseforeverylexicographicallyorderedfrequentk-itemset,thereexiststwolexicographicallyorderedfrequent -itemsetsthathaveidenticalitemsinthefirstpositions.

InFigure5.8 ,thefrequentitemsets{ , }and{ , }aremergedtoformacandidate3-itemset{ , , }.Thealgorithmdoesnothavetomerge{ , }with{ , }becausethefirstiteminbothitemsetsisdifferent.Indeed,if{ , , }isaviablecandidate,itwouldhavebeenobtainedbymerging{ , }with{ , }instead.Thisexampleillustratesboththecompletenessofthecandidategenerationprocedureandtheadvantagesofusinglexicographicorderingtopreventduplicatecandidates.Also,ifweorderthefrequent -itemsetsaccordingtotheirlexicographicrank,itemsetswithidenticalfirstitemswouldtakeconsecutiveranks.Asaresult,the candidategenerationmethodwouldconsidermergingafrequentitemsetonlywithonesthatoccupythenextfewranksinthesortedlist,thussavingsomecomputations.

ak−1≠bk−1k

−2 ak−1 bk−1

(k−1) k−2

(k−1)k−2

Fk−1×Fk−1

Figure5.8.Generatingandpruningcandidatek-itemsetsbymergingpairsoffrequent

-itemsets.

Figure5.8 showsthatthe candidategenerationprocedureresultsinonlyonecandidate3-itemset.Thisisaconsiderablereductionfromthefourcandidate3-itemsetsgeneratedbythe method.Thisisbecausethe methodensuresthateverycandidatek-itemsetcontainsatleasttwofrequent -itemsets,thusgreatlyreducingthenumberofcandidatesthataregeneratedinthisstep.

Notethattherecanbemultiplewaysofmergingtwofrequent -itemsetsinthe procedure,oneofwhichismergingiftheirfirst itemsareidentical.Analternateapproachcouldbetomergetwofrequent -itemsetsAandBifthelast itemsofAareidenticaltothefirstitemsetsofB.Forexample,{ , }and{ , }couldbemergedusingthisapproachtogeneratethecandidate3-itemset{ ,

, }.Aswewillseelater,thisalternate procedureis

(k−1)

Fk−1×Fk−1

Fk−1×F1Fk−1×Fk−1

(k−1)

(k−1)Fk−1×Fk−1 k−2

(k−1)k−2 k−2

Fk−1×Fk−1

usefulingeneratingsequentialpatterns,whichwillbediscussedinChapter6 .

CandidatePruningToillustratethecandidatepruningoperationforacandidatek-itemset,

,consideritskpropersubsets, .Ifanyofthemareinfrequent,thenXisimmediatelyprunedbyusingtheAprioriprinciple.Notethatwedon'tneedtoexplicitlyensurethatallsubsetsofXofsizelessthan arefrequent(seeExercise7).Thisapproachgreatlyreducesthenumberofcandidateitemsetsconsideredduringsupportcounting.Forthebrute-forcecandidategenerationmethod,candidatepruningrequirescheckingonlyksubsetsofsize foreachcandidatek-itemset.However,sincethe candidategenerationstrategyensuresthatatleastoneofthe -sizesubsetsofeverycandidatek-itemsetisfrequent,weonlyneedtocheckfortheremaining subsets.Likewise,thestrategyrequiresexaminingonly subsetsofeverycandidatek-itemset,

sincetwoofits -sizesubsetsarealreadyknowntobefrequentinthecandidategenerationstep.

5.2.4SupportCounting

Supportcountingistheprocessofdeterminingthefrequencyofoccurrenceforeverycandidateitemsetthatsurvivesthecandidatepruningstep.Supportcountingisimplementedinsteps6through11ofAlgorithm5.1 .Abrute-forceapproachfordoingthisistocompareeachtransactionagainsteverycandidateitemset(seeFigure5.2 )andtoupdatethesupportcountsofcandidatescontainedinatransaction.Thisapproachiscomputationally

X={i1,i2,…,ik} X−{ij}(∀j=1,2,…,k)

k−1

k−1Fk−1×F1

(k−1)k−1 Fk−1×Fk

−1 k−2(k−1)

expensive,especiallywhenthenumbersoftransactionsandcandidateitemsetsarelarge.

Analternativeapproachistoenumeratetheitemsetscontainedineachtransactionandusethemtoupdatethesupportcountsoftheirrespectivecandidateitemsets.Toillustrate,consideratransactiontthatcontainsfiveitems,{1,2,3,5,6}.Thereare itemsetsofsize3containedinthistransaction.Someoftheitemsetsmaycorrespondtothecandidate3-itemsetsunderinvestigation,inwhichcase,theirsupportcountsareincremented.Othersubsetsoftthatdonotcorrespondtoanycandidatescanbeignored.

Figure5.9 showsasystematicwayforenumeratingthe3-itemsetscontainedint.Assumingthateachitemsetkeepsitsitemsinincreasinglexicographicorder,anitemsetcanbeenumeratedbyspecifyingthesmallestitemfirst,followedbythelargeritems.Forinstance,given ,allthe3-itemsetscontainedintmustbeginwithitem1,2,or3.Itisnotpossibletoconstructa3-itemsetthatbeginswithitems5or6becausethereareonlytwoitemsintwhoselabelsaregreaterthanorequalto5.Thenumberofwaystospecifythefirstitemofa3-itemsetcontainedintisillustratedbytheLevel1prefixtreestructuredepictedinFigure5.9 .Forinstance,1representsa3-itemsetthatbeginswithitem1,followedbytwomoreitemschosenfromtheset{2,3,5,6}.

(53)=10

t={1,2,3,5,6}

2356

Figure5.9.Enumeratingsubsetsofthreeitemsfromatransactiont.

Afterfixingthefirstitem,theprefixtreestructureatLevel2representsthenumberofwaystoselecttheseconditem.Forexample,12correspondstoitemsetsthatbeginwiththeprefix(12)andarefollowedbytheitems3,5,or6.Finally,theprefixtreestructureatLevel3representsthecompletesetof3-itemsetscontainedint.Forexample,the3-itemsetsthatbeginwithprefix{12}are{1,2,3},{1,2,5},and{1,2,6},whilethosethatbeginwithprefix{23}are{2,3,5}and{2,3,6}.

TheprefixtreestructureshowninFigure5.9 demonstrateshowitemsetscontainedinatransactioncanbesystematicallyenumerated,i.e.,byspecifyingtheiritemsonebyone,fromtheleftmostitemtotherightmostitem.Westillhavetodeterminewhethereachenumerated3-itemsetcorresponds

356

toanexistingcandidateitemset.Ifitmatchesoneofthecandidates,thenthesupportcountofthecorrespondingcandidateisincremented.InthenextSection,weillustratehowthismatchingoperationcanbeperformedefficientlyusingahashtreestructure.

SupportCountingUsingaHashTree*IntheApriorialgorithm,candidateitemsetsarepartitionedintodifferentbucketsandstoredinahashtree.Duringsupportcounting,itemsetscontainedineachtransactionarealsohashedintotheirappropriatebuckets.Thatway,insteadofcomparingeachitemsetinthetransactionwitheverycandidateitemset,itismatchedonlyagainstcandidateitemsetsthatbelongtothesamebucket,asshowninFigure5.10 .

Figure5.10.Countingthesupportofitemsetsusinghashstructure.

Figure5.11 showsanexampleofahashtreestructure.Eachinternalnodeofthetreeusesthefollowinghashfunction, ,wheremodeh(p)=(p−1)mod3,

referstothemodulo(remainder)operator,todeterminewhichbranchofthecurrentnodeshouldbefollowednext.Forexample,items1,4,and7arehashedtothesamebranch(i.e.,theleftmostbranch)becausetheyhavethesameremainderafterdividingthenumberby3.Allcandidateitemsetsarestoredattheleafnodesofthehashtree.ThehashtreeshowninFigure5.11 contains15candidate3-itemsets,distributedacross9leafnodes.

Figure5.11.Hashingatransactionattherootnodeofahashtree.

Considerthetransaction, .Toupdatethesupportcountsofthecandidateitemsets,thehashtreemustbetraversedinsuchawaythatall

t={1,2,3,4,5,6}

theleafnodescontainingcandidate3-itemsetsbelongingtotmustbevisitedatleastonce.Recallthatthe3-itemsetscontainedintmustbeginwithitems1,2,or3,asindicatedbytheLevel1prefixtreestructureshowninFigure5.9 .Therefore,attherootnodeofthehashtree,theitems1,2,and3ofthetransactionarehashedseparately.Item1ishashedtotheleftchildoftherootnode,item2ishashedtothemiddlechild,anditem3ishashedtotherightchild.Atthenextlevelofthetree,thetransactionishashedontheseconditemlistedintheLevel2treestructureshowninFigure5.9 .Forexample,afterhashingonitem1attherootnode,items2,3,and5ofthetransactionarehashed.Basedonthehashfunction,items2and5arehashedtothemiddlechild,whileitem3ishashedtotherightchild,asshowninFigure5.12 .Thisprocesscontinuesuntiltheleafnodesofthehashtreearereached.Thecandidateitemsetsstoredatthevisitedleafnodesarecomparedagainstthetransaction.Ifacandidateisasubsetofthetransaction,itssupportcountisincremented.Notethatnotalltheleafnodesarevisitedwhiletraversingthehashtree,whichhelpsinreducingthecomputationalcost.Inthisexample,5outofthe9leafnodesarevisitedand9outofthe15itemsetsarecomparedagainstthetransaction.

Figure5.12.Subsetoperationontheleftmostsubtreeoftherootofacandidatehashtree.

5.2.5ComputationalComplexity

ThecomputationalcomplexityoftheApriorialgorithm,whichincludesbothitsruntimeandstorage,canbeaffectedbythefollowingfactors.

SupportThreshold

Loweringthesupportthresholdoftenresultsinmoreitemsetsbeingdeclaredasfrequent.Thishasanadverseeffectonthecomputationalcomplexityofthealgorithmbecausemorecandidateitemsetsmustbegeneratedandcounted

ateverylevel,asshowninFigure5.13 .Themaximumsizeoffrequentitemsetsalsotendstoincreasewithlowersupportthresholds.ThisincreasesthetotalnumberofiterationstobeperformedbytheApriorialgorithm,furtherincreasingthecomputationalcost.

Figure5.13.Effectofsupportthresholdonthenumberofcandidateandfrequentitemsetsobtainedfromabenchmarkdataset.

NumberofItems(Dimensionality)

Asthenumberofitemsincreases,morespacewillbeneededtostorethesupportcountsofitems.Ifthenumberoffrequentitemsalsogrowswiththedimensionalityofthedata,theruntimeandstoragerequirementswillincreasebecauseofthelargernumberofcandidateitemsetsgeneratedbythealgorithm.

NumberofTransactions

BecausetheApriorialgorithmmakesrepeatedpassesoverthetransactiondataset,itsruntimeincreaseswithalargernumberoftransactions.

AverageTransactionWidth

Fordensedatasets,theaveragetransactionwidthcanbeverylarge.ThisaffectsthecomplexityoftheApriorialgorithmintwoways.First,themaximumsizeoffrequentitemsetstendstoincreaseastheaveragetransactionwidthincreases.Asaresult,morecandidateitemsetsmustbeexaminedduringcandidategenerationandsupportcounting,asillustratedinFigure5.14 .Second,asthetransactionwidthincreases,moreitemsetsarecontainedinthetransaction.Thiswillincreasethenumberofhashtreetraversalsperformedduringsupportcounting.

AdetailedanalysisofthetimecomplexityfortheApriorialgorithmispresentednext.

Figure5.14.

Effectofaveragetransactionwidthonthenumberofcandidateandfrequentitemsetsobtainedfromasyntheticdataset.

Generationoffrequent1-itemsets

Foreachtransaction,weneedtoupdatethesupportcountforeveryitempresentinthetransaction.Assumingthatwistheaveragetransactionwidth,thisoperationrequiresO(Nw)time,whereNisthetotalnumberoftransactions.

Candidategeneration

Togeneratecandidatek-itemsets,pairsoffrequent -itemsetsaremergedtodeterminewhethertheyhaveatleast itemsincommon.Eachmergingoperationrequiresatmost equalitycomparisons.Everymergingstepcanproduceatmostoneviablecandidatek-itemset,whileintheworst-case,thealgorithmmusttrytomergeeverypairoffrequent -itemsetsfoundinthepreviousiteration.Therefore,theoverallcostofmergingfrequentitemsetsis

wherewisthemaximumtransactionwidth.Ahashtreeisalsoconstructedduringcandidategenerationtostorethecandidateitemsets.Becausethemaximumdepthofthetreeisk,thecostforpopulatingthehashtreewithcandidateitemsetsis .Duringcandidatepruning,weneedtoverifythatthe subsetsofeverycandidatek-itemsetarefrequent.SincethecostforlookingupacandidateinahashtreeisO(k),thecandidatepruningsteprequires time.

Supportcounting

(k−1)k−2

k−2

(k−1)

∑k=2w(k−2)|Ck|<Costofmerging<∑k=2w(k−2)|Fk−1|2,

O(∑k=2wk|Ck|)k−2

O(∑k=2wk(k−2)|Ck|)

Eachtransactionofwidth produces itemsetsofsizek.Thisisalsotheeffectivenumberofhashtreetraversalsperformedforeachtransaction.Thecostforsupportcountingis ,wherewisthemaximumtransactionwidthand isthecostforupdatingthesupportcountofacandidatek-itemsetinthehashtree.

|t| (|t|k)

O(N∑k(wk)αk)αk

5.3RuleGenerationThisSectiondescribeshowtoextractassociationrulesefficientlyfromagivenfrequentitemset.Eachfrequentk-itemset,Y,canproduceuptoassociationrules,ignoringrulesthathaveemptyantecedentsorconsequents

or ).AnassociationrulecanbeextractedbypartitioningtheitemsetYintotwonon-emptysubsets,Xand ,suchthat satisfiestheconfidencethreshold.Notethatallsuchrulesmusthavealreadymetthesupportthresholdbecausetheyaregeneratedfromafrequentitemset.

Example5.2.Let beafrequentitemset.Therearesixcandidateassociationrulesthatcanbegeneratedfrom

,and .AseachoftheirsupportisidenticaltothesupportforX,alltherulessatisfythesupportthreshold.

Computingtheconfidenceofanassociationruledoesnotrequireadditionalscansofthetransactiondataset.Considertherule ,whichisgeneratedfromthefrequentitemset .Theconfidenceforthisruleis

.Because{1,2,3}isfrequent,theanti-monotonepropertyofsupportensuresthat{1,2}mustbefrequent,too.Sincethesupportcountsforbothitemsetswerealreadyfoundduringfrequentitemsetgeneration,thereisnoneedtoreadtheentiredatasetagain.

5.3.1Confidence-BasedPruning

2k−2

∅→Y Y→∅Y−X X→Y−X

X={a,b,c}X:{a,b}→{c},{a,c}→{b},{b,c}→{a},{a}

→{b,c},{b}→{a,c} {c}→{a,b}

{1,2}→{3}X={1,2,3}

σ{(1,2,3})/σ({1,2})

Confidencedoesnotshowtheanti-monotonepropertyinthesamewayasthesupportmeasure.Forexample,theconfidencefor canbelarger,smaller,orequaltotheconfidenceforanotherrule ,where and

(seeExercise3onpage439).Nevertheless,ifwecomparerulesgeneratedfromthesamefrequentitemsetY,thefollowingtheoremholdsfortheconfidencemeasure.

Theorem5.2.LetYbeanitemsetandXisasubsetofY.Ifaruledoesnotsatisfytheconfidencethreshold,thenanyrule

,where isasubsetofX,mustnotsatisfytheconfidencethresholdaswell.

Toprovethistheorem,considerthefollowingtworules: and,where .Theconfidenceoftherulesare and ,

respectively.Since isasubsetofX, .Therefore,theformerrulecannothaveahigherconfidencethanthelatterrule.

5.3.2RuleGenerationinAprioriAlgorithm

TheApriorialgorithmusesalevel-wiseapproachforgeneratingassociationrules,whereeachlevelcorrespondstothenumberofitemsthatbelongtothe

X→YX˜→Y˜ X˜⊆X

Y˜⊆Y

X→Y−XX˜→Y

−X˜ X˜

X˜→Y−X˜ X→Y−X X˜⊂X σ(Y)/σ(X˜) σ(Y)/σ(X)

X˜ σ(X˜)/σ(X)

ruleconsequent.Initially,allthehighconfidencerulesthathaveonlyoneitemintheruleconsequentareextracted.Theserulesarethenusedtogeneratenewcandidaterules.Forexample,if and arehighconfidencerules,thenthecandidaterule isgeneratedbymergingtheconsequentsofbothrules.Figure5.15 showsalatticestructurefortheassociationrulesgeneratedfromthefrequentitemset{a,b,c,d}.Ifanynodeinthelatticehaslowconfidence,thenaccordingtoTheorem5.2 ,theentiresubgraphspannedbythenodecanbeprunedimmediately.Supposetheconfidencefor islow.Alltherulescontainingitemainitsconsequent,including ,and canbediscarded.

Figure5.15.Pruningofassociationrulesusingtheconfidencemeasure.

{acd}→{b} {abd}→{c}{ad}→{bc}

{bcd}→{a}{cd}→{ab},{bd}→{ac},{bc}→{ad} {d}→{abc}

ApseudocodefortherulegenerationstepisshowninAlgorithms5.2 and5.3 .Notethesimilaritybetweenthe proceduregiveninAlgorithm5.3 andthefrequentitemsetgenerationproceduregiveninAlgorithm5.1 .Theonlydifferenceisthat,inrulegeneration,wedonothavetomakeadditionalpassesoverthedatasettocomputetheconfidenceofthecandidaterules.Instead,wedeterminetheconfidenceofeachrulebyusingthesupportcountscomputedduringfrequentitemsetgeneration.

Algorithm5.2RulegenerationoftheApriorialgorithm.

∈

Algorithm5.3Procedureap-genrules .

∈

(fk,Hm)

5.3.3AnExample:CongressionalVotingRecords

ThisSectiondemonstratestheresultsofapplyingassociationanalysistothevotingrecordsofmembersoftheUnitedStatesHouseofRepresentatives.Thedataisobtainedfromthe1984CongressionalVotingRecordsDatabase,whichisavailableattheUCImachinelearningdatarepository.Eachtransactioncontainsinformationaboutthepartyaffiliationforarepresentativealongwithhisorhervotingrecordon16keyissues.Thereare435transactionsand34itemsinthedataset.ThesetofitemsarelistedinTable5.3 .

Table5.3.Listofbinaryattributesfromthe1984UnitedStatesCongressionalVotingRecords.Source:TheUCImachinelearningrepository.

1. Republican2. Democrat3.4.5.6.7.

handicapped-infants=yeshandicapped-infants=nowaterprojectcostsharing=yeswaterprojectcostsharing=nobudget-resolution=yes

8.9.

10.11.12.13.14.15.16.17.18.19.20.21.22.23.24.25.26.27.28.29.30.31.32.33.34.

TheApriorialgorithmisthenappliedtothedatasetwith and.Someofthehighconfidencerulesextractedbythealgorithm

areshowninTable5.4 .ThefirsttworulessuggestthatmostofthememberswhovotedyesforaidtoElSalvadorandnoforbudgetresolutionandMXmissileareRepublicans;whilethosewhovotednoforaidtoElSalvadorandyesforbudgetresolutionandMXmissileareDemocrats.These

budget-resolution=nophysicianfeefreeze=yesphysicianfeefreeze=noaidtoElSalvador=yesaidtoElSalvador=noreligiousgroupsinschools=yesreligiousgroupsinschools=noanti-satellitetestban=yesanti-satellitetestban=noaidtoNicaragua=yesaidtoNicaragua=noMX-missile=yesMX-missile=noimmigration=yesimmigration=nosynfuelcorporationcutback=yessynfuelcorporationcutback=noeducationspending=yeseducationspending=noright-to-sue=yesright-to-sue=nocrime=yescrime=noduty-free-exports=yesduty-free-exports=noexportadministrationact=yesexportadministrationact=no

minsup=30%minconf=90%

highconfidencerulesshowthekeyissuesthatdividemembersfrombothpoliticalparties.

Table5.4.Associationrulesextractedfromthe1984UnitedStatesCongressionalVotingRecords.

AssociationRule Confidence

91.0%

97.5%

93.5%

100%

{budgetresolution=no,MX-missile=no,aidtoElSalvador=yes}→{Republican}

{budgetresolution=yes,MX-missile=yes,aidtoElSalvador=no}→{Democrat}

{crime=yes,right-to-sue=yes,physicianfeefreeze=yes}→{Republican}

{crime=no,right-to-sue=no,physicianfeefreeze=no}→{Democrat}

5.4CompactRepresentationofFrequentItemsetsInpractice,thenumberoffrequentitemsetsproducedfromatransactiondatasetcanbeverylarge.Itisusefultoidentifyasmallrepresentativesetoffrequentitemsetsfromwhichallotherfrequentitemsetscanbederived.TwosuchrepresentationsarepresentedinthisSectionintheformofmaximalandclosedfrequentitemsets.

5.4.1MaximalFrequentItemsets

Definition5.3.(MaximalFrequentItemset.)Afrequentitemsetismaximalifnoneofitsimmediatesupersetsarefrequent.

Toillustratethisconcept,considertheitemsetlatticeshowninFigure5.16 .Theitemsetsinthelatticearedividedintotwogroups:thosethatarefrequentandthosethatareinfrequent.Afrequentitemsetborder,whichisrepresentedbyadashedline,isalsoillustratedinthediagram.Everyitemsetlocatedabovetheborderisfrequent,whilethoselocatedbelowtheborder(theshadednodes)areinfrequent.Amongtheitemsetsresidingneartheborder,

{a,d},{a,c,e},and{b,c,d,e}aremaximalfrequentitemsetsbecausealloftheirimmediatesupersetsareinfrequent.Forexample,theitemset{a,d}ismaximalfrequentbecauseallofitsimmediatesupersets,{a,b,d},{a,c,d},and{a,d,e},areinfrequent.Incontrast,{a,c}isnon-maximalbecauseoneofitsimmediatesupersets,{a,c,e},isfrequent.

Figure5.16.Maximalfrequentitemset.

Maximalfrequentitemsetseffectivelyprovideacompactrepresentationoffrequentitemsets.Inotherwords,theyformthesmallestsetofitemsetsfromwhichallfrequentitemsetscanbederived.Forexample,everyfrequentitemsetinFigure5.16 isasubsetofoneofthethreemaximalfrequent

itemsets,{a,d},{a,c,e},and{b,c,d,e}.Ifanitemsetisnotapropersubsetofanyofthemaximalfrequentitemsets,thenitiseitherinfrequent(e.g.,{a,d,e})ormaximalfrequentitself(e.g.,{b,c,d,e}).Hence,themaximalfrequentitemsets{a,c,e},{a,d},and{b,c,d,e}provideacompactrepresentationofthefrequentitemsetsshowninFigure5.16 .Enumeratingallthesubsetsofmaximalfrequentitemsetsgeneratesthecompletelistofallfrequentitemsets.

Maximalfrequentitemsetsprovideavaluablerepresentationfordatasetsthatcanproduceverylong,frequentitemsets,asthereareexponentiallymanyfrequentitemsetsinsuchdata.Nevertheless,thisapproachispracticalonlyifanefficientalgorithmexiststoexplicitlyfindthemaximalfrequentitemsets.WebrieflydescribeonesuchapproachinSection5.5 .

Despiteprovidingacompactrepresentation,maximalfrequentitemsetsdonotcontainthesupportinformationoftheirsubsets.Forexample,thesupportofthemaximalfrequentitemsets{a,c,e},{a,d},and{b,c,d,e}donotprovideanyinformationaboutthesupportoftheirsubsetsexceptthatitmeetsthesupportthreshold.Anadditionalpassoverthedatasetisthereforeneededtodeterminethesupportcountsofthenon-maximalfrequentitemsets.Insomecases,itisdesirabletohaveaminimalrepresentationofitemsetsthatpreservesthesupportinformation.WedescribesucharepresentationinthenextSection.

5.4.2ClosedItemsets

Closeditemsetsprovideaminimalrepresentationofallitemsetswithoutlosingtheirsupportinformation.Aformaldefinitionofacloseditemsetispresentedbelow.

Definition5.4.(ClosedItemset.)AnitemsetXisclosedifnoneofitsimmediatesupersetshasexactlythesamesupportcountasX.

Putanotherway,XisnotclosedifatleastoneofitsimmediatesupersetshasthesamesupportcountasX.ExamplesofcloseditemsetsareshowninFigure5.17 .Tobetterillustratethesupportcountofeachitemset,wehaveassociatedeachnode(itemset)inthelatticewithalistofitscorrespondingtransactionIDs.Forexample,sincethenode{b,c}isassociatedwithtransactionIDs1,2,and3,itssupportcountisequaltothree.Fromthetransactionsgiveninthisdiagram,noticethatthesupportfor{b}isidenticalto{b,c}.Thisisbecauseeverytransactionthatcontainsbalsocontainsc.Hence,{b}isnotacloseditemset.Similarly,sincecoccursineverytransactionthatcontainsbothaandd,theitemset{a,d}isnotclosedasithasthesamesupportasitssuperset{a,c,d}.Ontheotherhand,{b,c}isacloseditemsetbecauseitdoesnothavethesamesupportcountasanyofitssupersets.

Figure5.17.Anexampleoftheclosedfrequentitemsets(withminimumsupportequalto40%).

Aninterestingpropertyofcloseditemsetsisthatifweknowtheirsupportcounts,wecanderivethesupportcountofeveryotheritemsetintheitemsetlatticewithoutmakingadditionalpassesoverthedataset.Forexample,considerthe2-itemset{b,e}inFigure5.17 .Since{b,e}isnotclosed,itssupportmustbeequaltothesupportofoneofitsimmediatesupersets,{a,b,e},{b,c,e},and{b,d,e}.Further,noneofthesupersetsof{b,e}canhaveasupportgreaterthanthesupportof{b,e},duetotheanti-monotonenatureofthesupportmeasure.Hence,thesupportof{b,e}canbecomputedbyexaminingthesupportcountsofallofitsimmediatesupersetsofsizethree

andtakingtheirmaximumvalue.Ifanimmediatesupersetisclosed(e.g.,{b,c,e}),wewouldknowitssupportcount.Otherwise,wecanrecursivelycomputeitssupportbyexaminingthesupportsofitsimmediatesupersetsofsizefour.Ingeneral,thesupportcountofanynon-closed -itemsetcanbedeterminedaslongasweknowthesupportcountsofallk-itemsets.Hence,onecandeviseaniterativealgorithmthatcomputesthesupportcountsofitemsetsatlevel usingthesupportcountsofitemsetsatlevelk,startingfromthelevel ,where isthesizeofthelargestcloseditemset.

Eventhoughcloseditemsetsprovideacompactrepresentationofthesupportcountsofallitemsets,theycanstillbeexponentiallylargeinnumber.Moreover,formostpracticalapplications,weonlyneedtodeterminethesupportcountofallfrequentitemsets.Inthisregard,closedfrequentitem-setsprovideacompactrepresentationofthesupportcountsofallfrequentitemsets,whichcanbedefinedasfollows.

Definition5.5.(ClosedFrequentItemset.)Anitemsetisaclosedfrequentitemsetifitisclosedanditssupportisgreaterthanorequaltominsup.

Inthepreviousexample,assumingthatthesupportthresholdis40%,{b,c}isaclosedfrequentitemsetbecauseitssupportis60%.InFigure5.17 ,theclosedfrequentitemsetsareindicatedbytheshadednodes.

Algorithmsareavailabletoexplicitlyextractclosedfrequentitemsetsfromagivendataset.InterestedreadersmayrefertotheBibliographicNotesatthe

(k−1)

k−1kmax kmax

endofthischapterforfurtherdiscussionsofthesealgorithms.Wecanuseclosedfrequentitemsetstodeterminethesupportcountsforallnon-closedfrequentitemsets.Forexample,considerthefrequentitemset{a,d}showninFigure5.17 .Becausethisitemsetisnotclosed,itssupportcountmustbeequaltothemaximumsupportcountofitsimmediatesupersets,{a,b,d},{a,c,d},and{a,d,e}.Also,since{a,d}isfrequent,weonlyneedtoconsiderthesupportofitsfrequentsupersets.Ingeneral,thesupportcountofeverynon-closedfrequentk-itemsetcanbeobtainedbyconsideringthesupportofallitsfrequentsupersetsofsize .Forexample,sincetheonlyfrequentsupersetof{a,d}is{a,c,d},itssupportisequaltothesupportof{a,c,d},whichis2.Usingthismethodology,analgorithmcanbedevelopedtocomputethesupportforeveryfrequentitemset.ThepseudocodeforthisalgorithmisshowninAlgorithm5.4 .Thealgorithmproceedsinaspecific-to-generalfashion,i.e.,fromthelargesttothesmallestfrequentitemsets.Thisisbecause,inordertofindthesupportforanon-closedfrequentitemset,thesupportforallofitssupersetsmustbeknown.Notethatthesetofallfrequentitemsetscanbeeasilycomputedbytakingtheunionofallsubsetsoffrequentcloseditemsets.

Algorithm5.4Supportcountingusingclosedfrequentitemsets.

∈

k+1

∈

∉

⋅ ′⋅ ′∈ ⊂ ′

Toillustratetheadvantageofusingclosedfrequentitemsets,considerthedatasetshowninTable5.5 ,whichcontainstentransactionsandfifteenitems.Theitemscanbedividedintothreegroups:(1)GroupA,whichcontainsitems through ;(2)GroupB,whichcontainsitems through;and(3)GroupC,whichcontainsitems through .Assumingthatthe

supportthresholdis20%,itemsetsinvolvingitemsfromthesamegrouparefrequent,butitemsetsinvolvingitemsfromdifferentgroupsareinfrequent.Thetotalnumberoffrequentitemsetsisthus .However,thereareonlyfourclosedfrequentitemsetsinthedata:

and .Itisoftensufficienttopresentonlytheclosedfrequentitemsetstotheanalystsinsteadoftheentiresetoffrequentitemsets.

Table5.5.Atransactiondatasetforminingcloseditemsets.

TID

1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

2 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

3 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

4 0 0 1 1 0 1 1 1 1 1 0 0 0 0 0

5 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0

a1 a5 b1b5 c1 c5

3×(25−1)=93

({a3,a4},{a1,a2,a3,a4,a5},{b1,b2,b3,b4,b5}, {c1,c2,c3,c4,c5})

a1 a2 a3 a4 a5 b1 b2 b3 b4 b5 c1 c2 c3 c4 c5

6 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0

7 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1

8 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1

9 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1

10 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1

Finally,notethatallmaximalfrequentitemsetsareclosedbecausenoneofthemaximalfrequentitemsetscanhavethesamesupportcountastheirimmediatesupersets.Therelationshipsamongfrequent,closed,closedfrequent,andmaximalfrequentitemsetsareshowninFigure5.18 .

Figure5.18.Relationshipsamongfrequent,closed,closedfrequent,andmaximalfrequentitemsets.

5.5AlternativeMethodsforGeneratingFrequentItemsets*Aprioriisoneoftheearliestalgorithmstohavesuccessfullyaddressedthecombinatorialexplosionoffrequentitemsetgeneration.ItachievesthisbyapplyingtheAprioriprincipletoprunetheexponentialsearchspace.Despiteitssignificantperformanceimprovement,thealgorithmstillincursconsiderableI/Ooverheadsinceitrequiresmakingseveralpassesoverthetransactiondataset.Inaddition,asnotedinSection5.2.5 ,theperformanceoftheApriorialgorithmmaydegradesignificantlyfordensedatasetsbecauseoftheincreasingwidthoftransactions.SeveralalternativemethodshavebeendevelopedtoovercometheselimitationsandimproveupontheefficiencyoftheApriorialgorithm.Thefollowingisahigh-leveldescriptionofthesemethods.

TraversalofItemsetLattice

AsearchforfrequentitemsetscanbeconceptuallyviewedasatraversalontheitemsetlatticeshowninFigure5.1 .Thesearchstrategyemployedbyanalgorithmdictateshowthelatticestructureistraversedduringthefrequentitemsetgenerationprocess.Somesearchstrategiesarebetterthanothers,dependingontheconfigurationoffrequentitemsetsinthelattice.Anoverviewofthesestrategiesispresentednext.

General-to-SpecificversusSpecific-to-General:TheApriorialgorithmusesageneral-to-specificsearchstrategy,wherepairsoffrequent -itemsetsaremergedtoobtaincandidatek-itemsets.Thisgeneral-to-

(k−1)

specificsearchstrategyiseffective,providedthemaximumlengthofafrequentitemsetisnottoolong.TheconfigurationoffrequentitemsetsthatworksbestwiththisstrategyisshowninFigure5.19(a) ,wherethedarkernodesrepresentinfrequentitemsets.Alternatively,aspecificto-generalsearchstrategylooksformorespecificfrequentitemsetsfirst,beforefindingthemoregeneralfrequentitemsets.Thisstrategyisusefultodiscovermaximalfrequentitemsetsindensetransactions,wherethefrequentitemsetborderislocatednearthebottomofthelattice,asshowninFigure5.19(b) .TheAprioriprinciplecanbeappliedtopruneallsubsetsofmaximalfrequentitemsets.Specifically,ifacandidatek-itemsetismaximalfrequent,wedonothavetoexamineanyofitssubsetsofsize.However,ifthecandidatek-itemsetisinfrequent,weneedtocheckall

ofits subsetsinthenextiteration.Anotherapproachistocombinebothgeneral-to-specificandspecific-to-generalsearchstrategies.Thisbidirectionalapproachrequiresmorespacetostorethecandidateitemsets,butitcanhelptorapidlyidentifythefrequentitemsetborder,giventheconfigurationshowninFigure5.19(c) .

Figure5.19.General-to-specific,specific-to-general,andbidirectionalsearch.

k−1

EquivalenceClasses:Anotherwaytoenvisionthetraversalistofirstpartitionthelatticeintodisjointgroupsofnodes(orequivalenceclasses).Afrequentitemsetgenerationalgorithmsearchesforfrequentitemsetswithinaparticularequivalenceclassfirstbeforemovingtoanotherequivalenceclass.Asanexample,thelevel-wisestrategyusedintheApriorialgorithmcanbeconsideredtobepartitioningthelatticeonthebasisofitemsetsizes;i.e.,thealgorithmdiscoversallfrequent1-itemsetsfirstbeforeproceedingtolarger-sizeditemsets.Equivalenceclassescanalsobedefinedaccordingtotheprefixorsuffixlabelsofanitemset.Inthiscase,twoitemsetsbelongtothesameequivalenceclassiftheyshareacommonprefixorsuffixoflengthk.Intheprefix-basedapproach,thealgorithmcansearchforfrequentitemsetsstartingwiththeprefixabeforelookingforthosestartingwithprefixesb,c,andsoon.Bothprefix-basedandsuffix-basedequivalenceclassescanbedemonstratedusingthetree-likestructureshowninFigure5.20 .

Figure5.20.Equivalenceclassesbasedontheprefixandsuffixlabelsofitemsets.

Breadth-FirstversusDepth-First:TheApriorialgorithmtraversesthelatticeinabreadth-firstmanner,asshowninFigure5.21(a) .Itfirstdiscoversallthefrequent1-itemsets,followedbythefrequent2-itemsets,andsoon,untilnonewfrequentitemsetsaregenerated.Theitemsetlatticecanalsobetraversedinadepth-firstmanner,asshowninFigures5.21(b) and5.22 .Thealgorithmcanstartfrom,say,nodeainFigure5.22 ,andcountitssupporttodeterminewhetheritisfrequent.Ifso,thealgorithmprogressivelyexpandsthenextlevelofnodes,i.e.,ab,abc,andsoon,untilaninfrequentnodeisreached,say,abcd.Itthenbacktrackstoanotherbranch,say,abce,andcontinuesthesearchfromthere.

Figure5.21.Breadth-firstanddepth-firsttraversals.

Figure5.22.Generatingcandidateitemsetsusingthedepth-firstapproach.

Thedepth-firstapproachisoftenusedbyalgorithmsdesignedtofindmaximalfrequentitemsets.Thisapproachallowsthefrequentitemsetbordertobedetectedmorequicklythanusingabreadth-firstapproach.

Onceamaximalfrequentitemsetisfound,substantialpruningcanbeperformedonitssubsets.Forexample,ifthenodebcdeshowninFigure5.22 ismaximalfrequent,thenthealgorithmdoesnothavetovisitthesubtreesrootedatbd,be,c,d,andebecausetheywillnotcontainanymaximalfrequentitemsets.However,ifabcismaximalfrequent,onlythenodessuchasacandbcarenotmaximalfrequent(butthesubtreesofacandbcmaystillcontainmaximalfrequentitemsets).Thedepth-firstapproachalsoallowsadifferentkindofpruningbasedonthesupportofitemsets.Forexample,supposethesupportfor{a,b,c}isidenticaltothesupportfor{a,b}.Thesubtreesrootedatabdandabecanbeskipped

becausetheyareguaranteednottohaveanymaximalfrequentitemsets.Theproofofthisisleftasanexercisetothereaders.

RepresentationofTransactionDataSet

Therearemanywaystorepresentatransactiondataset.ThechoiceofrepresentationcanaffecttheI/Ocostsincurredwhencomputingthesupportofcandidateitemsets.Figure5.23 showstwodifferentwaysofrepresentingmarketbaskettransactions.Therepresentationontheleftiscalledahorizontaldatalayout,whichisadoptedbymanyassociationruleminingalgorithms,includingApriori.Anotherpossibilityistostorethelistoftransactionidentifiers(TID-list)associatedwitheachitem.Sucharepresentationisknownastheverticaldatalayout.ThesupportforeachcandidateitemsetisobtainedbyintersectingtheTID-listsofitssubsetitems.ThelengthoftheTID-listsshrinksasweprogresstolargersizeditemsets.However,oneproblemwiththisapproachisthattheinitialsetofTID-listsmightbetoolargetofitintomainmemory,thusrequiringmoresophisticatedtechniquestocompresstheTID-lists.WedescribeanothereffectiveapproachtorepresentthedatainthenextSection.

Figure5.23.Horizontalandverticaldataformat.

HorizontalDataLayout

5.6FP-GrowthAlgorithm*ThisSectionpresentsanalternativealgorithmcalledFP-growththattakesaradicallydifferentapproachtodiscoveringfrequentitemsets.Thealgorithmdoesnotsubscribetothegenerate-and-testparadigmofApriori.Instead,itencodesthedatasetusingacompactdatastructurecalledanFP-treeandextractsfrequentitemsetsdirectlyfromthisstructure.Thedetailsofthisapproacharepresentednext.

5.6.1FP-TreeRepresentation

AnFP-treeisacompressedrepresentationoftheinputdata.ItisconstructedbyreadingthedatasetonetransactionatatimeandmappingeachtransactionontoapathintheFP-tree.Asdifferenttransactionscanhaveseveralitemsincommon,theirpathsmightoverlap.Themorethepathsoverlapwithoneanother,themorecompressionwecanachieveusingtheFP-treestructure.IfthesizeoftheFP-treeissmallenoughtofitintomainmemory,thiswillallowustoextractfrequentitemsetsdirectlyfromthestructureinmemoryinsteadofmakingrepeatedpassesoverthedatastoredondisk.

Figure5.24 showsadatasetthatcontainstentransactionsandfiveitems.ThestructuresoftheFP-treeafterreadingthefirstthreetransactionsarealsodepictedinthediagram.Eachnodeinthetreecontainsthelabelofanitemalongwithacounterthatshowsthenumberoftransactionsmappedontothegivenpath.Initially,theFP-treecontainsonlytherootnoderepresentedbythenullsymbol.TheFP-treeissubsequentlyextendedinthefollowingway:

Figure5.24.ConstructionofanFP-tree.

1. Thedatasetisscannedoncetodeterminethesupportcountofeachitem.Infrequentitemsarediscarded,whilethefrequentitemsaresortedindecreasingsupportcountsinsideeverytransactionofthedata

set.ForthedatasetshowninFigure5.24 ,aisthemostfrequentitem,followedbyb,c,d,ande.

2. ThealgorithmmakesasecondpassoverthedatatoconstructtheFP-tree.Afterreadingthefirsttransaction,{a,b},thenodeslabeledasaandbarecreated.Apathisthenformedfrom toencodethetransaction.Everynodealongthepathhasafrequencycountof1.

3. Afterreadingthesecondtransaction,{b,c,d},anewsetofnodesiscreatedforitemsb,c,andd.Apathisthenformedtorepresentthetransactionbyconnectingthenodes .Everynodealongthispathalsohasafrequencycountequaltoone.Althoughthefirsttwotransactionshaveanitemincommon,whichisb,theirpathsaredisjointbecausethetransactionsdonotshareacommonprefix.

4. Thethirdtransaction,{a,c,d,e},sharesacommonprefixitem(whichisa)withthefirsttransaction.Asaresult,thepathforthethirdtransaction, ,overlapswiththepathforthefirsttransaction, .Becauseoftheiroverlappingpath,thefrequencycountfornodeaisincrementedtotwo,whilethefrequencycountsforthenewlycreatednodes,c,d,ande,areequaltoone.

5. ThisprocesscontinuesuntileverytransactionhasbeenmappedontooneofthepathsgivenintheFP-tree.TheresultingFP-treeafterreadingallthetransactionsisshownatthebottomofFigure5.24 .

ThesizeofanFP-treeistypicallysmallerthanthesizeoftheuncompresseddatabecausemanytransactionsinmarketbasketdataoftenshareafewitemsincommon.Inthebest-casescenario,whereallthetransactionshavethesamesetofitems,theFP-treecontainsonlyasinglebranchofnodes.Theworst-casescenariohappenswheneverytransactionhasauniquesetofitems.Asnoneofthetransactionshaveanyitemsincommon,thesizeoftheFP-treeiseffectivelythesameasthesizeoftheoriginaldata.However,thephysicalstoragerequirementfortheFP-treeishigherbecauseitrequiresadditionalspacetostorepointersbetweennodesandcountersforeachitem.

→a→b

→b→c→d

→a→c→d→e→a→b

ThesizeofanFP-treealsodependsonhowtheitemsareordered.Thenotionoforderingitemsindecreasingorderofsupportcountsreliesonthepossibilitythatthehighsupportitemsoccurmorefrequentlyacrossallpathsandhencemustbeusedasmostcommonlyoccurringprefixes.Forexample,iftheorderingschemeintheprecedingexampleisreversed,i.e.,fromlowesttohighestsupportitem,theresultingFP-treeisshowninFigure5.25 .Thetreeappearstobedenserbecausethebranchingfactorattherootnodehasincreasedfrom2to5andthenumberofnodescontainingthehighsupportitemssuchasaandbhasincreasedfrom3to12.Nevertheless,orderingbydecreasingsupportcountsdoesnotalwaysleadtothesmallesttree,especiallywhenthehighsupportitemsdonotoccurfrequentlytogetherwiththeotheritems.Forexample,supposeweaugmentthedatasetgiveninFigure5.24 with100transactionsthatcontain{e},80transactionsthatcontain{d},60transactionsthatcontain{c},and40transactionsthatcontain{b}.Itemeisnowmostfrequent,followedbyd,c,b,anda.Withtheaugmentedtransactions,orderingbydecreasingsupportcountswillresultinanFP-treesimilartoFigure5.25 ,whileaschemebasedonincreasingsupportcountsproducesasmallerFP-treesimilartoFigure5.24(iv) .

Figure5.25.

AnFP-treerepresentationforthedatasetshowninFigure5.24 withadifferentitemorderingscheme.

AnFP-treealsocontainsalistofpointersconnectingnodesthathavethesameitems.Thesepointers,representedasdashedlinesinFigures5.24and5.25 ,helptofacilitatetherapidaccessofindividualitemsinthetree.WeexplainhowtousetheFP-treeanditscorrespondingpointersforfrequentitemsetgenerationinthenextSection.

5.6.2FrequentItemsetGenerationinFP-GrowthAlgorithm

FP-growthisanalgorithmthatgeneratesfrequentitemsetsfromanFP-treebyexploringthetreeinabottom-upfashion.GiventheexampletreeshowninFigure5.24 ,thealgorithmlooksforfrequentitemsetsendinginefirst,followedbyd,c,b,andfinally,a.Thisbottom-upstrategyforfindingfrequentitemsetsendingwithaparticularitemisequivalenttothesuffix-basedapproachdescribedinSection5.5 .SinceeverytransactionismappedontoapathintheFP-tree,wecanderivethefrequentitemsetsendingwithaparticularitem,say,e,byexaminingonlythepathscontainingnodee.Thesepathscanbeaccessedrapidlyusingthepointersassociatedwithnodee.TheextractedpathsareshowninFigure5.26(a) .Similarpathsforitemsetsendingind,c,b,andaareshowninFigures5.26(b) ,(c) ,(d) ,and(e) ,respectively.

Figure5.26.Decomposingthefrequentitemsetgenerationproblemintomultiplesubproblems,whereeachsubprobleminvolvesfindingfrequentitemsetsendingine,d,c,b,anda.

FP-growthfindsallthefrequentitemsetsendingwithaparticularsuffixbyemployingadivide-and-conquerstrategytosplittheproblemintosmallersubproblems.Forexample,supposeweareinterestedinfindingallfrequentitemsetsendingine.Todothis,wemustfirstcheckwhethertheitemset{e}itselfisfrequent.Ifitisfrequent,weconsiderthesubproblemoffindingfrequentitemsetsendinginde,followedbyce,be,andae.Inturn,eachofthesesubproblemsarefurtherdecomposedintosmallersubproblems.Bymergingthesolutionsobtainedfromthesubproblems,allthefrequentitemsetsendinginecanbefound.Finally,thesetofallfrequentitemsetscanbegeneratedbymergingthesolutionstothesubproblemsoffindingfrequent

itemsetsendingine,d,c,b,anda.Thisdivide-and-conquerapproachisthekeystrategyemployedbytheFP-growthalgorithm.

Foramoreconcreteexampleonhowtosolvethesubproblems,considerthetaskoffindingfrequentitemsetsendingwithe.

1. Thefirststepistogatherallthepathscontainingnodee.TheseinitialpathsarecalledprefixpathsandareshowninFigure5.27(a) .

Figure5.27.ExampleofapplyingtheFP-growthalgorithmtofindfrequentitemsetsendingine.

2. FromtheprefixpathsshowninFigure5.27(a) ,thesupportcountforeisobtainedbyaddingthesupportcountsassociatedwithnodee.Assumingthattheminimumsupportcountis2,{e}isdeclaredafrequentitemsetbecauseitssupportcountis3.

3. Because{e}isfrequent,thealgorithmhastosolvethesubproblemsoffindingfrequentitemsetsendinginde,ce,be,andae.Beforesolvingthesesubproblems,itmustfirstconverttheprefixpathsintoaconditionalFP-tree,whichisstructurallysimilartoanFP-tree,exceptitisusedtofindfrequentitemsetsendingwithaparticularsuffix.AconditionalFP-treeisobtainedinthefollowingway:

a. First,thesupportcountsalongtheprefixpathsmustbeupdatedbecausesomeofthecountsincludetransactionsthatdonotcontainiteme.Forexample,therightmostpathshowninFigure5.27(a) , ,includesatransaction{b,c}thatdoesnotcontainiteme.Thecountsalongtheprefixpathmustthereforebeadjustedto1toreflecttheactualnumberoftransactionscontaining{b,c,e}.

b. Theprefixpathsaretruncatedbyremovingthenodesfore.Thesenodescanberemovedbecausethesupportcountsalongtheprefixpathshavebeenupdatedtoreflectonlytransactionsthatcontaineandthesubproblemsoffindingfrequentitemsetsendinginde,ce,be,andaenolongerneedinformationaboutnodee.

c. Afterupdatingthesupportcountsalongtheprefixpaths,someoftheitemsmaynolongerbefrequent.Forexample,thenodebappearsonlyonceandhasasupportcountequalto1,whichmeansthatthereisonlyonetransactionthatcontainsbothbande.Itembcanbesafelyignoredfromsubsequentanalysisbecauseallitemsetsendinginbemustbeinfrequent.

TheconditionalFP-treeforeisshowninFigure5.27(b) .Thetreelooksdifferentthantheoriginalprefixpathsbecausethefrequencycountshavebeenupdatedandthenodesbandehavebeeneliminated.

4. FP-growthusestheconditionalFP-treeforetosolvethesubproblemsoffindingfrequentitemsetsendinginde,ce,andae.Tofindthefrequentitemsetsendinginde,theprefixpathsfordaregatheredfromtheconditionalFP-treefore(Figure5.27(c) ).Byaddingthefrequencycountsassociatedwithnoded,weobtainthesupportcountfor{d,e}.Sincethesupportcountisequalto2,{d,e}isdeclaredafrequentitemset.Next,thealgorithmconstructstheconditionalFP-treefordeusingtheapproachdescribedinstep3.Afterupdatingthesupportcountsandremovingtheinfrequentitemc,theconditionalFP-treefordeisshowninFigure5.27(d) .SincetheconditionalFP-treecontainsonlyoneitem,a,whosesupportisequaltominsup,thealgorithmextractsthefrequentitemset{a,d,e}andmovesontothenextsubproblem,whichistogeneratefrequentitemsetsendingince.Afterprocessingtheprefixpathsforc,{c,e}isfoundtobefrequent.However,theconditionalFP-treeforcewillhavenofrequentitemsandthuswillbeeliminated.Thealgorithmproceedstosolvethenextsubproblemandfinds{a,e}tobetheonlyfrequentitemsetremaining.

Thisexampleillustratesthedivide-and-conquerapproachusedintheFP-growthalgorithm.Ateachrecursivestep,aconditionalFP-treeisconstructedbyupdatingthefrequencycountsalongtheprefixpathsandremovingallinfrequentitems.Becausethesubproblemsaredisjoint,FP-growthwillnotgenerateanyduplicateitemsets.Inaddition,thecountsassociatedwiththenodesallowthealgorithmtoperformsupportcountingwhilegeneratingthecommonsuffixitemsets.

FP-growthisaninterestingalgorithmbecauseitillustrateshowacompactrepresentationofthetransactiondatasethelpstoefficientlygeneratefrequentitemsets.Inaddition,forcertaintransactiondatasets,FP-growthoutperformsthestandardApriorialgorithmbyseveralordersofmagnitude.Therun-timeperformanceofFP-growthdependsonthecompactionfactorofthedataset.IftheresultingconditionalFP-treesareverybushy(intheworstcase,afullprefixtree),thentheperformanceofthealgorithmdegradessignificantlybecauseithastogeneratealargenumberofsubproblemsandmergetheresultsreturnedbyeachsubproblem.

5.7EvaluationofAssociationPatternsAlthoughtheAprioriprinciplesignificantlyreducestheexponentialsearchspaceofcandidateitemsets,associationanalysisalgorithmsstillhavethepotentialtogeneratealargenumberofpatterns.Forexample,althoughthedatasetshowninTable5.1 containsonlysixitems,itcanproducehundredsofassociationrulesatparticularsupportandconfidencethresholds.Asthesizeanddimensionalityofrealcommercialdatabasescanbeverylarge,wecaneasilyendupwiththousandsorevenmillionsofpatterns,manyofwhichmightnotbeinteresting.Identifyingthemostinterestingpatternsfromthemultitudeofallpossibleonesisnotatrivialtaskbecause“oneperson'strashmightbeanotherperson'streasure.”Itisthereforeimportanttoestablishasetofwell-acceptedcriteriaforevaluatingthequalityofassociationpatterns.

Thefirstsetofcriteriacanbeestablishedthroughadata-drivenapproachtodefineobjectiveinterestingnessmeasures.Thesemeasurescanbeusedtorankpatterns—itemsetsorrules—andthusprovideastraightforwardwayofdealingwiththeenormousnumberofpatternsthatarefoundinadataset.Someofthemeasurescanalsoprovidestatisticalinformation,e.g.,itemsetsthatinvolveasetofunrelateditemsorcoververyfewtransactionsareconsidereduninterestingbecausetheymaycapturespuriousrelationshipsinthedataandshouldbeeliminated.Examplesofobjectiveinterestingnessmeasuresincludesupport,confidence,andcorrelation.

Thesecondsetofcriteriacanbeestablishedthroughsubjectivearguments.Apatternisconsideredsubjectivelyuninterestingunlessitrevealsunexpectedinformationaboutthedataorprovidesusefulknowledgethatcanleadtoprofitableactions.Forexample,therule maynotbeinteresting,despitehavinghighsupportandconfidencevalues,becausethe

{Butter}→{Bread}

relationshiprepresentedbytherulemightseemratherobvious.Ontheotherhand,therule isinterestingbecausetherelationshipisquiteunexpectedandmaysuggestanewcross-sellingopportunityforretailers.Incorporatingsubjectiveknowledgeintopatternevaluationisadifficulttaskbecauseitrequiresaconsiderableamountofpriorinformationfromdomainexperts.Readersinterestedinsubjectiveinterestingnessmeasuresmayrefertoresourceslistedinthebibliographyattheendofthischapter.

5.7.1ObjectiveMeasuresofInterestingness

Anobjectivemeasureisadata-drivenapproachforevaluatingthequalityofassociationpatterns.Itisdomain-independentandrequiresonlythattheuserspecifiesathresholdforfilteringlow-qualitypatterns.Anobjectivemeasureisusuallycomputedbasedonthefrequencycountstabulatedinacontingencytable.Table5.6 showsanexampleofacontingencytableforapairofbinaryvariables,AandB.Weusethenotation toindicatethatA(B)isabsentfromatransaction.Eachentry inthis tabledenotesafrequencycount.Forexample, isthenumberoftimesAandBappeartogetherinthesametransaction,while isthenumberoftransactionsthatcontainBbutnotA.Therowsum representsthesupportcountforA,whilethecolumnsum representsthesupportcountforB.Finally,eventhoughourdiscussionfocusesmainlyonasymmetricbinaryvariables,notethatcontingencytablesarealsoapplicabletootherattributetypessuchassymmetricbinary,nominal,andordinalvariables.

Table5.6.A2-waycontingencytableforvariablesAandB.

{Diapers}→{Beer}

A¯(B¯)fij 2×2

f11f01

f1+f+1

LimitationsoftheSupport-ConfidenceFrameworkTheclassicalassociationruleminingformulationreliesonthesupportandconfidencemeasurestoeliminateuninterestingpatterns.Thedrawbackofsupport,whichisdescribedmorefullyinSection5.8 ,isthatmanypotentiallyinterestingpatternsinvolvinglowsupportitemsmightbeeliminatedbythesupportthreshold.Thedrawbackofconfidenceismoresubtleandisbestdemonstratedwiththefollowingexample.

Example5.3.Supposeweareinterestedinanalyzingtherelationshipbetweenpeoplewhodrinkteaandcoffee.WemaygatherinformationaboutthebeveragepreferencesamongagroupofpeopleandsummarizetheirresponsesintoacontingencytablesuchastheoneshowninTable5.7 .

Table5.7.Beveragepreferencesamongagroupof1000people.

Coffee

Tea 150 50 200

650 150 800

800 200 1000

B¯

f11 f10 f1+

A¯ f01 f00 f0+

f+1 f+0

Coffee¯

Tea¯

Theinformationgiveninthistablecanbeusedtoevaluatetheassociationrule .Atfirstglance,itmayappearthatpeoplewhodrinkteaalsotendtodrinkcoffeebecausetherule'ssupport(15%)andconfidence(75%)valuesarereasonablyhigh.Thisargumentwouldhavebeenacceptableexceptthatthefractionofpeoplewhodrinkcoffee,regardlessofwhethertheydrinktea,is80%,whilethefractionofteadrinkerswhodrinkcoffeeisonly75%.Thusknowingthatapersonisateadrinkeractuallydecreasesherprobabilityofbeingacoffeedrinkerfrom80%to75%!Therule isthereforemisleadingdespiteitshighconfidencevalue.

Nowconsiderasimilarproblemwhereweareinterestedinanalyzingtherelationshipbetweenpeoplewhodrinkteaandpeoplewhousehoneyintheirbeverage.Table5.8 summarizestheinformationgatheredoverthesamegroupofpeopleabouttheirpreferencesfordrinkingteaandusinghoney.Ifweevaluatetheassociationrule usingthisinformation,wewillfindthattheconfidencevalueofthisruleismerely50%,whichmightbeeasilyrejectedusingareasonablethresholdontheconfidencevalue,say70%.Onethusmightconsiderthatthepreferenceofapersonfordrinkingteahasnoinfluenceonherpreferenceforusinghoney.However,thefractionofpeoplewhousehoney,regardlessofwhethertheydrinktea,isonly12%.Hence,knowingthatapersondrinksteasignificantlyincreasesherprobabilityofusinghoneyfrom12%to50%.Further,thefractionofpeoplewhodonotdrinkteabutusehoneyisonly2.5%!Thissuggeststhatthereisdefinitelysomeinformationinthepreferenceofapersonofusinghoneygiventhatshedrinkstea.Therule

maythereforebefalselyrejectedifconfidenceisusedastheevaluationmeasure.

Table5.8.Informationaboutpeoplewhodrinkteaandpeoplewhousehoneyintheirbeverage.

{Tea}→{Coffee}

{Tea}→{Honey}

Honey

Tea 100 100 200

20 780 800

120 880 1000

Notethatifwetakethesupportofcoffeedrinkersintoaccount,wewouldnotbesurprisedtofindthatmanyofthepeoplewhodrinkteaalsodrinkcoffee,sincetheoverallnumberofcoffeedrinkersisquitelargebyitself.Whatismoresurprisingisthatthefractionofteadrinkerswhodrinkcoffeeisactuallylessthantheoverallfractionofpeoplewhodrinkcoffee,whichpointstoaninverserelationshipbetweenteadrinkersandcoffeedrinkers.Similarly,ifweaccountforthefactthatthesupportofusinghoneyisinherentlysmall,itiseasytounderstandthatthefractionofteadrinkerswhousehoneywillnaturallybesmall.Instead,whatisimportanttomeasureisthechangeinthefractionofhoneyusers,giventheinformationthattheydrinktea.

Thelimitationsoftheconfidencemeasurearewell-knownandcanbeunderstoodfromastatisticalperspectiveasfollows.Thesupportofavariablemeasurestheprobabilityofitsoccurrence,whilethesupports(A,B)ofapairofavariablesAandBmeasurestheprobabilityofthetwovariablesoccurringtogether.Hence,thejointprobabilityP(A,B)canbewrittenas

IfweassumeAandBarestatisticallyindependent,i.e.thereisnorelationshipbetweentheoccurrencesofAandB,then .Hence,undertheassumptionofstatisticalindependencebetweenAandB,thesupportsindep(A,B)ofAandBcanbewrittenas

Honey¯

Tea¯

P(A,B)=s(A,B)=f11N.

P(A,B)=P(A)×P(B)

Ifthesupportbetweentwovariables,s(A,B)isequalto ,thenAandBcanbeconsideredtobeunrelatedtoeachother.However,ifs(A,B)iswidelydifferentfrom ,thenAandBaremostlikelydependent.Hence,anydeviationofs(A,B)from canbeseenasanindicationofastatisticalrelationshipbetweenAandB.Sincetheconfidencemeasureonlyconsidersthedevianceofs(A,B)froms(A)andnotfrom ,itfailstoaccountforthesupportoftheconsequent,namelys(B).Thisresultsinthedetectionofspuriouspatterns(e.g., )andtherejectionoftrulyinterestingpatterns(e.g., ),asillustratedinthepreviousexample.

Variousobjectivemeasureshavebeenusedtocapturethedevianceofs(A,B)from ,thatarenotsusceptibletothelimitationsoftheconfidencemeasure.Below,weprovideabriefdescriptionofsomeofthesemeasuresanddiscusssomeoftheirproperties.

InterestFactorTheinterestfactor,whichisalsocalledasthe“lift,”canbedefinedasfollows:

Noticethat .Hence,theinterestfactormeasurestheratioofthesupportofapatterns(A,B)againstitsbaselinesupport (A,B)computedunderthestatisticalindependenceassumption.UsingEquations5.5 and5.4 ,wecaninterpretthemeasureasfollows:

sindep(A,B)=s(A)×s(B)orequivalently,sindep(A,B)=f1+N×f+1N. (5.4)

sindep(A,B)

sindep(A,B)s(A)×s(B)

s(A)×s(B)

{Tea}→{Coffee}{Tea}→{Honey}

sindep(A,B)

I(A,B)=s(A,B)s(A)×s(B)=Nf11f1+f+1. (5.5)

s(A)×s(B)=sindep(A,B)sindep

I(A,B)={=1,ifAandBareindependent;>1,ifAandBarepositivelyrelated;<1,ifAandBarenegativelyrelated.

(5.6)

Forthetea-coffeeexampleshowninTable5.7 , ,thussuggestingaslightnegativerelationshipbetweenteadrinkersandcoffeedrinkers.Also,forthetea-honeyexampleshowninTable5.8 ,

,suggestingastrongpositiverelationshipbetweenpeoplewhodrinkteaandpeoplewhousehoneyintheirbeverage.Wecanthusseethattheinterestfactorisabletodetectmeaningfulpatternsinthetea-coffeeandtea-honeyexamples.Indeed,theinterestfactorhasanumberofstatisticaladvantagesovertheconfidencemeasurethatmakeitasuitablemeasureforanalyzingstatisticalindependencebetweenvariables.

Piatesky-Shapiro(PS)MeasureInsteadofcomputingtheratiobetweens(A,B)and ,thePSmeasureconsidersthedifferencebetweens(A,B)and asfollows.

ThePSvalueis0whenAandBaremutuallyindependentofeachother.Otherwise, whenthereisapositiverelationshipbetweenthetwovariables,and whenthereisanegativerelationship.

CorrelationAnalysisCorrelationanalysisisoneofthemostpopulartechniquesforanalyzingrelationshipsbetweenapairofvariables.Forcontinuousvariables,correlationisdefinedusingPearson'scorrelationcoefficient(seeEquation2.10 onpage83).Forbinaryvariables,correlationcanbemeasuredusingthe

,whichisdefinedas

I=0.150.2×0.8=0.9375

I=0.10.12×0.2=4.1667

sindep(A,B)=s(A)×s(B)s(A)×s(B)

PS=s(A,B)−s(A)×s(B)=f11N−f1+f+1N2 (5.7)

PS>0PS<0

ϕ-coefficient

ϕ=f11f00−f01f10f1+f+1f0+f+0. (5.8)

Ifwerearrangethetermsin5.8,wecanshowthatthe canberewrittenintermsofthesupportmeasuresofA,B,and{A,B}asfollows:

NotethatthenumeratorintheaboveequationisidenticaltothePSmeasure.Hence,the canbeunderstoodasanormalizedversionofthePSmeasure,wherethatthevalueofthe rangesfrom to .Fromastatisticalviewpoint,thecorrelationcapturesthenormalizeddifferencebetweens(A,B)and (A,B).Acorrelationvalueof0meansnorelationship,whileavalueof suggestsaperfectpositiverelationshipandavalueof suggestsaperfectnegativerelationship.Thecorrelationmeasurehasastatisticalmeaningandhenceiswidelyusedtoevaluatethestrengthofstatisticalindependenceamongvariables.Forinstance,thecorrelationbetweenteaandcoffeedrinkersinTable5.7 is whichisslightlylessthan0.Ontheotherhand,thecorrelationbetweenpeoplewhodrinkteaandpeoplewhousehoneyinTable5.8 is0.5847,suggestingapositiverelationship.

ISMeasureISisanalternativemeasureforcapturingtherelationshipbetweens(A,B)and

.TheISmeasureisdefinedasfollows:

AlthoughthedefinitionofISlooksquitesimilartotheinterestfactor,theysharesomeinterestingdifferences.SinceISisthegeometricmeanbetweentheinterestfactorandthesupportofapattern,ISislargewhenboththeinterestfactorandsupportarelarge.Hence,iftheinterestfactoroftwopatternsareidentical,theIShasapreferenceofselectingthepatternwithhighersupport.ItisalsopossibletoshowthatISismathematicallyequivalent

ϕ-coefficient

ϕ=s(A,B)−s(A)×s(B)s(A)×(1−s(A))×s(B)×(1−s(B)). (5.9)

ϕ-coefficientϕ-coefficient −1 +1

sindep+1

−1

−0.0625

s(A)×s(B)

IS(A,B)=I(A,B)×s(A,B)=s(A,B)s(A)s(B)=f11f1+f+1. (5.10)

tothecosinemeasureforbinaryvariables(seeEquation2.6 onpage81 ).ThevalueofISthusvariesfrom0to1,whereanISvalueof0correspondstonoco-occurrenceofthetwovariables,whileanISvalueof1denotesperfectrelationship,sincetheyoccurinexactlythesametransactions.Forthetea-coffeeexampleshowninTable5.7 ,thevalueofISisequalto0.375,whilethevalueofISforthetea-honeyexampleinTable5.8 is0.6455.TheISmeasurethusgivesahighervalueforthe

rulethanthe rule,whichisconsistentwithourunderstandingofthetworules.

AlternativeObjectiveInterestingnessMeasuresNotethatallofthemeasuresdefinedinthepreviousSectionusedifferenttechniquestocapturethedeviancebetweens(A,B)and .Somemeasuresusetheratiobetweens(A,B)and (A,B),e.g.,theinterestfactorandIS,whilesomeothermeasuresconsiderthedifferencebetweenthetwo,e.g.,thePSandthe .Somemeasuresareboundedinaparticularrange,e.g.,theISandthe ,whileothersareunboundedanddonothaveadefinedmaximumorminimumvalue,e.g.,theInterestFactor.Becauseofsuchdifferences,thesemeasuresbehavedifferentlywhenappliedtodifferenttypesofpatterns.Indeed,themeasuresdefinedabovearenotexhaustiveandthereexistmanyalternativemeasuresforcapturingdifferentpropertiesofrelationshipsbetweenpairsofbinaryvariables.Table5.9 providesthedefinitionsforsomeofthesemeasuresintermsofthefrequencycountsofa contingencytable.

Table5.9.Examplesofobjectivemeasuresfortheitemset{A,B}.

Measure(Symbol) Definition

Correlation

{Tea}→{Honey} {Tea}→{Coffee}

sindep=s(A)×s(B)sindep

ϕ-coefficientϕ-coefficient

2×2

(ϕ) Nf11−f1+f+1f1+f+1f0+f+0

Oddsratio

Kappa

Interest(I)

Cosine(IS)

Piatetsky-Shapiro(PS)

Collectivestrength(S)

Jaccard

All-confidence(h)

ConsistencyamongObjectiveMeasuresGiventhewidevarietyofmeasuresavailable,itisreasonabletoquestionwhetherthemeasurescanproducesimilarorderingresultswhenappliedtoasetofassociationpatterns.Ifthemeasuresareconsistent,thenwecanchooseanyoneofthemasourevaluationmetric.Otherwise,itisimportanttounderstandwhattheirdifferencesareinordertodeterminewhichmeasureismoresuitableforanalyzingcertaintypesofpatterns.

SupposethemeasuresdefinedinTable5.9 areappliedtorankthetencontingencytablesshowninTable5.10 .Thesecontingencytablesarechosentoillustratethedifferencesamongtheexistingmeasures.TheorderingproducedbythesemeasuresisshowninTable5.11 (with1asthemostinterestingand10astheleastinterestingtable).Althoughsomeofthemeasuresappeartobeconsistentwitheachother,othersproducequitedifferentorderingresults.Forexample,therankingsgivenbytheagreesmostlywiththoseprovidedby andcollectivestrength,butarequite

(α) (f11f00)/(f10f01)

(κ) Nf11+Nf00−f1+f+1−f0+f+0N2−f1+f+1−f0+f+0

(Nf11)/(f1+f+1)

(f11)/(f1+f+1)

f11N−f1+f+1N2

f11+f00f1+f+1+f0+f+0×N−f1+f+1−f0+f+0N−f11−f00

(ζ) f11/(f1++f+1−f11)

min[f11f1+,f11f+1]

ϕ-coefficientκ

differentthantherankingsproducedbyinterestfactor.Furthermore,acontingencytablesuchas isrankedlowestaccordingtothe ,buthighestaccordingtointerestfactor.

Table5.10.Exampleofcontingencytables.

Example

8123 83 424 1370

8330 2 622 1046

3954 3080 5 2961

2886 1363 1320 4431

1500 2000 500 6000

4000 2000 1000 3000

9481 298 127 94

4000 2000 2000 2000

7450 2483 4 63

61 2483 4 7452

Table5.11.RankingsofcontingencytablesusingthemeasuresgiveninTable5.9 .

I IS PS S h

1 3 1 6 2 2 1 2 2

2 1 2 7 3 5 2 3 3

E10 ϕ-coefficient

f11 f10 f01 f00

E10

ϕ α κ ζ

3 2 4 4 5 1 3 6 8

4 8 3 3 7 3 4 7 5

5 7 6 2 9 6 6 9 9

6 9 5 5 6 4 5 5 7

7 6 7 9 1 8 7 1 1

8 10 8 8 8 7 8 8 7

9 4 9 10 4 9 9 4 4

10 5 10 1 10 10 10 10 10

PropertiesofObjectiveMeasuresTheresultsshowninTable5.11 suggestthatthemeasuresgreatlydifferfromeachotherandcanprovideconflictinginformationaboutthequalityofapattern.Infact,nomeasureisuniversallybestforallapplications.Inthefollowing,wedescribesomepropertiesofthemeasuresthatplayanimportantroleindeterminingiftheyaresuitedforacertainapplication.

InversionProperty

ConsiderthebinaryvectorsshowninFigure5.28 .The0/1valueineachcolumnvectorindicateswhetheratransaction(row)containsaparticularitem(column).Forexample,thevectorAindicatesthattheitemappearsinthefirstandlasttransactions,whereasthevectorBindicatesthattheitemiscontainedonlyinthefifthtransaction.Thevectors and aretheinvertedversionsofAandB,i.e.,their1valueshavebeenchangedto0values(absencetopresence)andviceversa.Applyingthistransformationtoabinary

E10

A¯ B¯

vectoriscalledinversion.Ifameasureisinvariantundertheinversionoperation,thenitsvalueforthevectorpair shouldbeidenticaltoitsvaluefor{A,B}.Theinversionpropertyofameasurecanbetestedasfollows.

Figure5.28.Effectoftheinversionoperation.Thevectors and areinversionsofvectorsAandB,respectively.

Definition5.6.(InversionProperty.)AnobjectivemeasureMisinvariantundertheinversionoperationifitsvalueremainsthesamewhenexchangingthefrequencycounts with and with .

Measuresthatareinvarianttotheinversionpropertyincludethecorrelation( ),oddsratio, ,andcollectivestrength.Thesemeasuresareespeciallyusefulinscenarioswherethepresence(1's)ofavariableisas

{A¯,B¯}

A¯ E¯

f11 f00 f10 f01

ϕ-coefficient κ

importantasitsabsence(0's).Forexample,ifwecomparetwosetsofanswerstoaseriesoftrue/falsequestionswhere0's(true)and1's(false)areequallymeaningful,weshoulduseameasurethatgivesequalimportancetooccurrencesof0–0'sand1–1's.ForthevectorsshowninFigure5.28 ,the

isequalto-0.1667regardlessofwhetherweconsiderthepair{A,B}orpair .Similarly,theoddsratioforbothpairsofvectorsisequaltoaconstantvalueof0.Notethateventhoughthe andtheoddsratioareinvarianttoinversion,theycanstillshowdifferentresults,aswillbeshownlater.

MeasuresthatdonotremaininvariantundertheinversionoperationincludetheinterestfactorandtheISmeasure.Forexample,theISvalueforthepair

inFigure5.28 is0.825,whichreflectsthefactthatthe1'sinand occurfrequentlytogether.However,theISvalueofitsinvertedpair{A,B}isequalto0,sinceAandBdonothaveanyco-occurrenceof1's.Forasymmetricbinaryvariables,e.g.,theoccurrenceofwordsindocuments,thisisindeedthedesiredbehavior.Adesiredsimilaritymeasurebetweenasymmetricvariablesshouldnotbeinvarianttoinversion,sinceforthesevariables,itismoremeaningfultocapturerelationshipsbasedonthepresenceofavariableratherthanitsabsence.Ontheotherhand,ifwearedealingwithsymmetricbinaryvariableswheretherelationshipsbetween0'sand1'sareequallymeaningful,careshouldbetakentoensurethatthechosenmeasureisinvarianttoinversion.

AlthoughthevaluesoftheinterestfactorandISchangewiththeinversionoperation,theycanstillbeinconsistentwitheachother.Toillustratethis,considerTable5.12 ,whichshowsthecontingencytablesfortwopairsofvariables,{p,q}and{r,s}.Notethatrandsareinvertedtransformationsofpandq,respectively,wheretherolesof0'sand1'shavejustbeenreversed.Theinterestfactorfor{p,q}is1.02andfor{r,s}is4.08,whichmeansthattheinterestfactorfindstheinvertedpair{r,s}morerelatedthantheoriginalpair

ϕ-coefficient{A¯,B¯}

ϕ-coefficient

{A¯,B¯} A¯B¯

{p,q}.Onthecontrary,theISvaluedecreasesuponinversionfrom0.9346for{p,q}to0.286for{r,s},suggestingquiteanoppositetrendtothatoftheinterestfactor.Eventhoughthesemeasuresconflictwitheachotherforthisexample,theymaybetherightchoiceofmeasureindifferentapplications.

Table5.12.Contingencytablesforthepairs{p,q}and{r,s}.

q 880 50 930

50 20 70

930 70 1000

s 20 50 70

50 880 930

70 930 1000

ScalingProperty

Table5.13 showstwocontingencytablesforgenderandthegradesachievedbystudentsenrolledinaparticularcourse.Thesetablescanbeusedtostudytherelationshipbetweengenderandperformanceinthecourse.Thesecondcontingencytablehasdatafromthesamepopulationbuthastwiceasmanymalesandthreetimesasmanyfemales.Theactualnumberofmalesorfemalescandependuponthesamplesavailableforstudy,buttherelationshipbetweengenderandgradeshouldnotchangejustbecauseofdifferencesinsamplesizes.Similarly,ifthenumberofstudentswithhighandlowgradesarechangedinanewstudy,ameasureofassociationbetween

p¯

q¯

r¯

s¯

genderandgradesshouldremainunchanged.Hence,weneedameasurethatisinvarianttoscalingofrowsorcolumns.Theprocessofmultiplyingaroworcolumnofacontingencytablebyaconstantvalueiscalledaroworcolumnscalingoperation.Ameasurethatisinvarianttoscalingdoesnotchangeitsvalueafteranyroworcolumnscalingoperation.

Table5.13.Thegrade-genderexample.(a)Sampledataofsize100.

Male Female

High 30 20 50

Low 40 10 50

70 30 100

(b)Sampledataofsize230.

Male Female

High 60 60 120

Low 80 30 110

140 90 230

Definition5.7.(ScalingInvarianceProperty.)LetTbeacontingencytablewithfrequencycounts

.Let bethetransformedacontingencytable[f11;f10;f01;f00] T′

withscaledfrequencycounts,where are

positiveconstantsusedtoscalethetworowsandthetwocolumnsofT.AnobjectivemeasureMisinvariantundertherow/columnscalingoperationif .

Notethattheuseoftheterm‘scaling'hereshouldnotbeconfusedwiththescalingoperationforcontinuousvariablesintroducedinChapter2 onpage23,whereallthevaluesofavariablewerebeingmultipliedbyaconstantfactor,insteadofscalingaroworcolumnofacontingencytable.

Scalingofrowsandcolumnsincontingencytablesoccursinmultiplewaysindifferentapplications.Forexample,ifwearemeasuringtheeffectofaparticularmedicalprocedureontwosetsofsubjects,healthyanddiseased,theratioofhealthyanddiseasedsubjectscanwidelyvaryacrossdifferentstudiesinvolvingdifferentgroupsofparticipants.Further,thefractionofhealthyanddiseasedsubjectschosenforacontrolledstudycanbequitedifferentfromthetruefractionobservedinthecompletepopulation.Thesedifferencescanresultinaroworcolumnscalinginthecontingencytablesfordifferentpopulationsofsubjects.Ingeneral,thefrequenciesofitemsinacontingencytablecloselydependsonthesampleoftransactionsusedtogeneratethetable.Anychangeinthesamplingproceduremayaffectaroworcolumnscalingtransformation.Ameasurethatisexpectedtobeinvarianttodifferencesinthesamplingproceduremustnotchangewithroworcolumnscaling.

OfallthemeasuresintroducedinTable5.9 ,onlytheoddsratio isinvarianttorowandcolumnscalingoperations.Forexample,thevalueofoddsratioforboththetablesinTable5.13 isequalto0.375.Allother

[k1k3f11;k2k3f10;k1k4f01;k2k4f00] k1,k2,k3,k4

M(T)=M(T′)

(α)

measuressuchasthe ,IS,interestfactor,andcollectivestrength(S)changetheirvalueswhentherowsandcolumnsofthecontingencytablearerescaled.Indeed,theoddsratioisapreferredchoiceofmeasureinthemedicaldomain,whereitisimportanttofindrelationshipsthatdonotchangewithdifferencesinthepopulationsamplechosenforastudy.

NullAdditionProperty

Supposeweareinterestedinanalyzingtherelationshipbetweenapairofwords,suchas and ,inasetofdocuments.Ifacollectionofarticlesabouticefishingisaddedtothedataset,shouldtheassociationbetween and beaffected?Thisprocessofaddingunrelateddata(inthiscase,documents)toagivendatasetisknownasthenulladditionoperation.

Definition5.8.(NullAdditionProperty.)AnobjectivemeasureMisinvariantunderthenulladditionoperationifitisnotaffectedbyincreasing ,whileallotherfrequenciesinthecontingencytablestaythesame.

Forapplicationssuchasdocumentanalysisormarketbasketanalysis,wewouldliketouseameasurethatremainsinvariantunderthenulladditionoperation.Otherwise,therelationshipbetweenwordscanbemadetochangesimplybyaddingenoughdocumentsthatdonotcontainbothwords!Examplesofmeasuresthatsatisfythispropertyincludecosine(IS)and

ϕ-coefficient,κ

f00

Jaccard measures,whilethosethatviolatethispropertyincludeinterestfactor,PS,oddsratio,andthe .

Todemonstratetheeffectofnulladdition,considerthetwocontingencytablesand showninTable5.14 .Table hasbeenobtainedfrom by

adding1000extratransactionswithbothAandBabsent.Thisoperationonlyaffectsthe entryofTable ,whichhasincreasedfrom100to1100,whereasalltheotherfrequenciesinthetable ,and remainthesame.SinceISisinvarianttonulladdition,itgivesaconstantvalueof0.875toboththetables.However,theadditionof1000extratransactionswithoccurrencesof0–0'schangesthevalueofinterestfactorfrom0.972for(denotingaslightlynegativecorrelation)to1.944for (positivecorrelation).Similarly,thevalueofoddsratioincreasesfrom7for to77for .Hence,whentheinterestfactororoddsratioareusedastheassociationmeasure,therelationshipsbetweenvariableschangesbytheadditionofnulltransactionswhereboththevariablesareabsent.Incontrast,theISmeasureisinvarianttonulladdition,sinceitconsiderstwovariablestoberelatedonlyiftheyfrequentlyoccurtogether.Indeed,theISmeasure(cosinemeasure)iswidelyusedtomeasuresimilarityamongdocuments,whichisexpectedtodependonlyonthejointoccurrences(1's)ofwordsindocuments,butnottheirabsences(0's).

Table5.14.Anexampledemonstratingtheeffectofnulladdition.(a)Table .

A 700 100 800

100 100 200

800 200 1000

(ξ)ϕ-coefficient

T1 T2 T2 T1

f00 T2(f11,f10 f01)

T1T2T1 T2

B¯

A¯

(b)Table .

A 700 100 800

10 1100 1200

800 1200 2000

Table5.15 providesasummaryofpropertiesforthemeasuresdefinedinTable5.9 .Eventhoughthislistofpropertiesisnotexhaustive,itcanserveasausefulguideforselectingtherightchoiceofmeasureforanapplication.Ideally,ifweknowthespecificrequirementsofacertainapplication,wecanensurethattheselectedmeasureshowspropertiesthatadheretothoserequirements.Forexample,ifwearedealingwithasymmetricvariables,wewouldprefertouseameasurethatisnotinvarianttonulladditionorinversion.Ontheotherhand,ifwerequirethemeasuretoremaininvarianttochangesinthesamplesize,wewouldliketouseameasurethatdoesnotchangewithscaling.

Table5.15.Propertiesofsymmetricmeasures.

Symbol Measure Inversion NullAddition Scaling

Yes No No

oddsratio Yes No Yes

Cohen's Yes No No

I Interest No No No

IS Cosine No Yes No

PS Piatetsky-Shapiro's Yes No No

B¯

A¯

ϕ ϕ-coefficient

S Collectivestrength Yes No No

Jaccard No Yes No

h All-confidence No Yes No

s Support No No No

AsymmetricInterestingnessMeasuresNotethatinthediscussionsofar,wehaveonlyconsideredmeasuresthatdonotchangetheirvaluewhentheorderofthevariablesarereversed.Morespecifically,ifMisameasureandAandBaretwovariables,thenM(A,B)isequaltoM(B,A)iftheorderofthevariablesdoesnotmatter.Suchmeasuresarecalledsymmetric.Ontheotherhand,measuresthatdependontheorderofvariables arecalledasymmetricmeasures.Forexample,theinterestfactorisasymmetricmeasurebecauseitsvalueisidenticalfortherules and .Incontrast,confidenceisanasymmetricmeasuresincetheconfidencefor and maynotbethesame.Notethattheuseoftheterm‘asymmetric'todescribeaparticulartypeofmeasureofrelationship—oneinwhichtheorderofthevariablesisimportant—shouldnotbeconfusedwiththeuseof‘asymmetric'todescribeabinaryvariableforwhichonly1'sareimportant.Asymmetricmeasuresaremoresuitableforanalyzingassociationrules,sincetheitemsinaruledohaveaspecificorder.Eventhoughweonlyconsideredsymmetricmeasurestodiscussthedifferentpropertiesofassociationmeasures,theabovediscussionisalsorelevantfortheasymmetricmeasures.SeeBibliographicNotesformoreinformationaboutdifferentkindsofasymmetricmeasuresandtheirproperties.

(M(A,B)≠M(B,A))

A→B B→AA→B B→A

5.7.2MeasuresbeyondPairsofBinaryVariables

ThemeasuresshowninTable5.9 aredefinedforpairsofbinaryvariables(e.g.,2-itemsetsorassociationrules).However,manyofthem,suchassupportandall-confidence,arealsoapplicabletolarger-sizeditemsets.Othermeasures,suchasinterestfactor,IS,PS,andJaccardcoefficient,canbeextendedtomorethantwovariablesusingthefrequencytablestabulatedinamultidimensionalcontingencytable.Anexampleofathree-dimensionalcontingencytablefora,b,andcisshowninTable5.16 .Eachentry inthistablerepresentsthenumberoftransactionsthatcontainaparticularcombinationofitemsa,b,andc.Forexample, isthenumberoftransactionsthatcontainaandc,butnotb.Ontheotherhand,amarginalfrequencysuchas isthenumberoftransactionsthatcontainaandc,irrespectiveofwhetherbispresentinthetransaction.

Table5.16.Exampleofathree-dimensionalcontingencytable.

c b

fijk

f101

f1+1

b¯

f111 f101 f1+1

a¯ f011 f001 f0+1

f+11 f+01 f++1

b¯

f110 f100 f1+0

a¯ f010 f000 f0+0

Givenak-itemset ,theconditionforstatisticalindependencecanbestatedasfollows:

Withthisdefinition,wecanextendobjectivemeasuressuchasinterestfactorandPS,whicharebasedondeviationsfromstatisticalindependence,tomorethantwovariables:

Anotherapproachistodefinetheobjectivemeasureasthemaximum,minimum,oraveragevaluefortheassociationsbetweenpairsofitemsinapattern.Forexample,givenak-itemset ,wemaydefinethe

forXastheaverage betweeneverypairofitemsinX.However,becausethemeasureconsidersonlypairwise

associations,itmaynotcapturealltheunderlyingrelationshipswithinapattern.Also,careshouldbetakeninusingsuchalternatemeasuresformorethantwovariables,sincetheymaynotalwaysshowtheanti-monotonepropertyinthesamewayasthesupportmeasure,makingthemunsuitableforminingpatternsusingtheAprioriprinciple.

Analysisofmultidimensionalcontingencytablesismorecomplicatedbecauseofthepresenceofpartialassociationsinthedata.Forexample,someassociationsmayappearordisappearwhenconditioneduponthevalueofcertainvariables.ThisproblemisknownasSimpson'sparadoxandisdescribedinSection5.7.3 .Moresophisticatedstatisticaltechniquesare

f+10 f+00 f++0

{i1,i2,…,ik}

fi1i2…ik=fi1+…+×f+i2…+×…×f++…ikNk−1. (5.11)

I=Nk−1×fi1i2…ikfi1+…+×f+i2…+×…×f++…ikPS=fi1i2…ikN−fi1+…+×f+i2…+×…×f++…ikNk

X={i1,i2,…,ik} ϕ-coefficient ϕ-coefficient(ip,iq)

availabletoanalyzesuchrelationships,e.g.,loglinearmodels,butthesetechniquesarebeyondthescopeofthisbook.

5.7.3Simpson'sParadox

Itisimportanttoexercisecautionwheninterpretingtheassociationbetweenvariablesbecausetheobservedrelationshipmaybeinfluencedbythepresenceofotherconfoundingfactors,i.e.,hiddenvariablesthatarenotincludedintheanalysis.Insomecases,thehiddenvariablesmaycausetheobservedrelationshipbetweenapairofvariablestodisappearorreverseitsdirection,aphenomenonthatisknownasSimpson'sparadox.Weillustratethenatureofthisparadoxwiththefollowingexample.

Considertherelationshipbetweenthesaleofhigh-definitiontelevisions(HDTV)andexercisemachines,asshowninTable5.17 .Therule

hasaconfidenceof andtherule hasaconfidenceof

.Together,theserulessuggestthatcustomerswhobuyhigh-definitiontelevisionsaremorelikelytobuyexercisemachinesthanthosewhodonotbuyhigh-definitiontelevisions.

Table5.17.Atwo-waycontingencytablebetweenthesaleofhigh-definitiontelevisionandexercisemachine.

BuyHDTV

BuyExerciseMachine

Yes No

Yes 99 81 180

No 54 66 120

{HDTV=Yes}→{Exercisemachine=Yes} 99/180=55%{HDTV=No}→{Exercisemachine=Yes}

54/120=45%

153 147 300

However,adeeperanalysisrevealsthatthesalesoftheseitemsdependonwhetherthecustomerisacollegestudentoraworkingadult.Table5.18summarizestherelationshipbetweenthesaleofHDTVsandexercisemachinesamongcollegestudentsandworkingadults.NoticethatthesupportcountsgiveninthetableforcollegestudentsandworkingadultssumuptothefrequenciesshowninTable5.17 .Furthermore,therearemoreworkingadultsthancollegestudentswhobuytheseitems.Forcollegestudents:

Table5.18.Exampleofathree-waycontingencytable.

CustomerGroup

BuyHDTV

BuyExerciseMachine Total

Yes No

CollegeStudents Yes 1 9 10

No 4 30 34

WorkingAdult Yes 98 72 170

No 50 36 86

whileforworkingadults:

c({HDTV=Yes}→{Exercisemachine=Yes})=1/10=10%,c({HDTV=No}→{Exercisemachine=Yes})=4/34=11.8%.

c({HDTV=Yes}→{Exercisemachine=Yes})=98/170=57.7%,c({HDTV=No}→{Exercisemachine=Yes})=50/86=58.1%.

Therulessuggestthat,foreachgroup,customerswhodonotbuyhigh-definitiontelevisionsaremorelikelytobuyexercisemachines,whichcontradictsthepreviousconclusionwhendatafromthetwocustomergroupsarepooledtogether.Evenifalternativemeasuressuchascorrelation,oddsratio,orinterestareapplied,westillfindthatthesaleofHDTVandexercisemachineispositivelyrelatedinthecombineddatabutisnegativelyrelatedinthestratifieddata(seeExercise21onpage449).ThereversalinthedirectionofassociationisknownasSimpson'sparadox.

Theparadoxcanbeexplainedinthefollowingway.First,noticethatmostcustomerswhobuyHDTVsareworkingadults.Thisisreflectedinthehighconfidenceoftherule .Second,thehighconfidenceoftherule

suggeststhatmostcustomerswhobuyexercisemachinesarealsoworkingadults.SinceworkingadultsformthelargestfractionofcustomersforbothHDTVsandexercisemachines,theybothlookrelatedandtherule turnsouttobestrongerinthecombineddatathanwhatitwouldhavebeenifthedataisstratified.Hence,customergroupactsasahiddenvariablethataffectsboththefractionofcustomerswhobuyHDTVsandthosewhobuyexercisemachines.Ifwefactorouttheeffectofthehiddenvariablebystratifyingthedata,weseethattherelationshipbetweenbuyingHDTVsandbuyingexercisemachinesisnotdirect,butshowsupasanindirectconsequenceoftheeffectofthehiddenvariable.

TheSimpson'sparadoxcanalsobeillustratedmathematicallyasfollows.Suppose

{HDTV=Yes}→{WorkingAdult}(170/180=94.4%){Exercisemachine=Yes}

→{WorkingAdult}(148/153=96.7%)

{HDTV=Yes}→{Exercisemachine=Yes}

a/b<c/dandp/q<r/s,

wherea/bandp/qmayrepresenttheconfidenceoftherule intwodifferentstrata,whilec/dandr/smayrepresenttheconfidenceoftherule

inthetwostrata.Whenthedataispooledtogether,theconfidencevaluesoftherulesinthecombineddataare and ,respectively.Simpson'sparadoxoccurswhen

thusleadingtothewrongconclusionabouttherelationshipbetweenthevariables.ThelessonhereisthatproperstratificationisneededtoavoidgeneratingspuriouspatternsresultingfromSimpson'sparadox.Forexample,marketbasketdatafromamajorsupermarketchainshouldbestratifiedaccordingtostorelocations,whilemedicalrecordsfromvariouspatientsshouldbestratifiedaccordingtoconfoundingfactorssuchasageandgender.

A→B

A¯→B(a+p)/(b+q) (c+r)/(d+s)

a+pb+q>c+rd+s,

5.8EffectofSkewedSupportDistributionTheperformancesofmanyassociationanalysisalgorithmsareinfluencedbypropertiesoftheirinputdata.Forexample,thecomputationalcomplexityoftheApriorialgorithmdependsonpropertiessuchasthenumberofitemsinthedata,theaveragetransactionwidth,andthesupportthresholdused.ThisSectionexaminesanotherimportantpropertythathassignificantinfluenceontheperformanceofassociationanalysisalgorithmsaswellasthequalityofextractedpatterns.Morespecifically,wefocusondatasetswithskewedsupportdistributions,wheremostoftheitemshaverelativelylowtomoderatefrequencies,butasmallnumberofthemhaveveryhighfrequencies.

Figure5.29.Atransactiondatasetcontainingthreeitems,p,q,andr,wherepisahighsupportitemandqandrarelowsupportitems.

Figure5.29 showsanillustrativeexampleofadatasetthathasaskewedsupportdistributionofitsitems.Whilephasahighsupportof83.3%inthedata,qandrarelow-supportitemswithasupportof16.7%.Despitetheirlowsupport,qandralwaysoccurtogetherinthelimitednumberoftransactionsthattheyappearandhencearestronglyrelated.Apatternminingalgorithmthereforeshouldreport{q,r}asinteresting.

However,notethatchoosingtherightsupportthresholdforminingitem-setssuchas{q,r}canbequitetricky.Ifwesetthethresholdtoohigh(e.g.,20%),

thenwemaymissmanyinterestingpatternsinvolvinglow-supportitemssuchas{q,r}.Conversely,settingthesupportthresholdtoolowcanbedetrimentaltothepatternminingprocessforthefollowingreasons.First,thecomputationalandmemoryrequirementsofexistingassociationanalysisalgorithmsincreaseconsiderablywithlowsupportthresholds.Second,thenumberofextractedpatternsalsoincreasessubstantiallywithlowsupportthresholds,whichmakestheiranalysisandinterpretationdifficult.Inparticular,wemayextractmanyspuriouspatternsthatrelateahigh-frequencyitemsuchasptoalow-frequencyitemsuchasq.Suchpatterns,whicharecalledcross-supportpatterns,arelikelytobespuriousbecausetheassociationbetweenpandqislargelyinfluencedbythefrequentoccurrenceofpinsteadofthejointoccurrenceofpandqtogether.Becausethesupportof{p,q}isquiteclosetothesupportof{q,r},wemayeasilyselect{p,q}ifwesetthesupportthresholdlowenoughtoinclude{q,r}.

AnexampleofarealdatasetthatexhibitsaskewedsupportdistributionisshowninFigure5.30 .Thedata,takenfromthePUMS(PublicUseMicrodataSample)censusdata,contains49,046recordsand2113asymmetricbinaryvariables.Weshalltreattheasymmetricbinaryvariablesasitemsandrecordsastransactions.Whilemorethan80%oftheitemshavesupportlessthan1%,ahandfulofthemhavesupportgreaterthan90%.Tounderstandtheeffectofskewedsupportdistributiononfrequentitemsetmining,wedividetheitemsintothreegroups, ,and ,accordingtotheirsupportlevels,asshowninTable5.19 .Wecanseethatmorethan82%ofitemsbelongto andhaveasupportlessthan1%.Inmarketbasketanalysis,suchlowsupportitemsmaycorrespondtoexpensiveproducts(suchasjewelry)thatareseldomboughtbycustomers,butwhosepatternsarestillinterestingtoretailers.Patternsinvolvingsuchlow-supportitems,thoughmeaningful,caneasilyberejectedbyafrequentpatternminingalgorithmwithahighsupportthreshold.Ontheotherhand,settingalowsupportthresholdmayresultintheextractionofspuriouspatternsthatrelateahigh-frequency

G1,G2 G3

itemin toalow-frequencyitemin .Forexample,atasupportthresholdequalto0.05%,thereare18,847frequentpairsinvolvingitemsfrom and

.Outofthese,93%ofthemarecross-supportpatterns;i.e.,thepatternscontainitemsfromboth and .

Figure5.30.Supportdistributionofitemsinthecensusdataset.

Table5.19.Groupingtheitemsinthecensusdatasetbasedontheirsupportvalues.

Group

Support

NumberofItems 1735 358 20

G3 G1G1

G3G1 G3

G1 G2 G3

<1% 1%−90% >90%

Thisexampleshowsthatalargenumberofweaklyrelatedcross-supportpatternscanbegeneratedwhenthesupportthresholdissufficientlylow.Notethatfindinginterestingpatternsindatasetswithskewedsupportdistributionsisnotjustachallengeforthesupportmeasure,butsimilarstatementscanbemadeaboutmanyotherobjectivemeasuresdiscussedinthepreviousSections.Beforepresentingamethodologyforfindinginterestingpatternsandpruningspuriousones,weformallydefinetheconceptofcross-supportpatterns.

Definition5.9.(Cross-supportPattern.)Letusdefinethesupportratio,r(X),ofanitemset

Givenauser-specifiedthreshold ,anitemsetXisacross-supportpatternif .

Example5.4.Supposethesupportformilkis70%,whilethesupportforsugaris10%andcaviaris0.04%.Given ,thefrequentitemset{milk,sugar,caviar}isacross-supportpatternbecauseitssupportratiois

X={i1,i2,…,ik}

r(X)=min[s(i1),s(i2),…,s(ik)}max[s(i1),s(i2),…,s(ik)} (5.12)

hcr(X)<hc

hc=0.01

r=min[0.7,0.1,0.0004]max[0.7,0.1,0.0004]=0.0040.7=0.00058<0.01.

Existingmeasuressuchassupportandconfidencemaynotbesufficienttoeliminatecross-supportpatterns.Forexample,ifweassume forthedatasetpresentedinFigure5.29 ,theitemsets{p,q},{p,r},and{p,q,r}arecross-supportpatternsbecausetheirsupportratios,whichareequalto0.2,arelessthanthethreshold .However,theirsupportsarecomparabletothatof{q,r},makingitdifficulttoeliminatecross-supportpatternswithoutloosinginterestingonesusingasupport-basedpruningstrategy.Confidencepruningalsodoesnothelpbecausetheconfidenceoftherulesextractedfromcross-supportpatternscanbeveryhigh.Forexample,theconfidencefor

is80%eventhough{p,q}isacross-supportpattern.Thefactthatthecross-supportpatterncanproduceahighconfidenceruleshouldnotcomeasasurprisebecauseoneofitsitems(p)appearsveryfrequentlyinthedata.Therefore,pisexpectedtoappearinmanyofthetransactionsthatcontainq.Meanwhile,therule alsohashighconfidenceeventhough{q,r}isnotacross-supportpattern.Thisexampledemonstratesthedifficultyofusingtheconfidencemeasuretodistinguishbetweenrulesextractedfromcross-supportpatternsandinterestingpatternsinvolvingstronglyconnectedbutlow-supportitems.

Eventhoughtherule hasveryhighconfidence,noticethattherulehasverylowconfidencebecausemostofthetransactionsthatcontainp

donotcontainq.Incontrast,therule ,whichisderivedfrom{q,r},hasveryhighconfidence.Thisobservationsuggeststhatcross-supportpatternscanbedetectedbyexaminingthelowestconfidencerulethatcanbeextractedfromagivenitemset.Anapproachforfindingtherulewiththelowestconfidencegivenanitemsetcanbedescribedasfollows.

1. Recallthefollowinganti-monotonepropertyofconfidence:

hc=0.3

{q}→{p}

{q}→{r}

{q}→{p} {p}→{q}

{r}→{q}

conf({i1i2}→{i3,i4,…,ik})≤conf({i1i2i3}→{i4,i5,…,ik}).

Thispropertysuggeststhatconfidenceneverincreasesasweshiftmoreitemsfromtheleft-totheright-handsideofanassociationrule.Becauseofthisproperty,thelowestconfidenceruleextractedfromafrequentitemsetcontainsonlyoneitemonitsleft-handside.Wedenotethesetofallruleswithonlyoneitemonitsleft-handsideas .

2. Givenafrequentitemset ,therule

hasthelowestconfidencein if .Thisfollowsdirectlyfromthedefinitionofconfidenceastheratiobetweentherule'ssupportandthesupportoftheruleantecedent.Hence,theconfidenceofarulewillbelowestwhenthesupportoftheantecedentishighest.

3. Summarizingthepreviouspoints,thelowestconfidenceattainablefromafrequentitemset is

Thisexpressionisalsoknownastheh-confidenceorall-confidencemeasure.Becauseoftheanti-monotonepropertyofsupport,thenumeratoroftheh-confidencemeasureisboundedbytheminimumsupportofanyitemthatappearsinthefrequentitemset.Inotherwords,theh-confidenceofanitemset mustnotexceedthefollowingexpression:

Notethattheupperboundofh-confidenceintheaboveequationisexactlysameassupportratio(r)giveninEquation5.12 .Becausethesupportratioforacross-supportpatternisalwayslessthan ,theh-confidenceofthepatternisalsoguaranteedtobelessthan .Therefore,cross-supportpatternscanbeeliminatedbyensuringthattheh-confidencevaluesforthepatternsexceed .Asafinalnote,theadvantagesofusingh-confidencego

R1{i1,i2,…,ik}

{ij}→{i1,i2,…,ij−1,ij+1,…,ik}

R1 s(ij)=max[s(i1),s(i2),…,s(ik)]

{i1,i2,…,ik}s({i1,i2,…,ik})max[s(i1),s(i2),…,s(ik)].

X={i1,i2,…,ik}

h-confidence(X)≤min[s(i1),s(i2),…,s(ik)]max[s(i1),s(i2),…,s(ik)].

hchc

beyondeliminatingcross-supportpatterns.Themeasureisalsoanti-monotone,i.e.,

andthuscanbeincorporateddirectlyintotheminingalgorithm.Furthermore,h-confidenceensuresthattheitemscontainedinanitemsetarestronglyassociatedwitheachother.Forexample,supposetheh-confidenceofanitemsetXis80%.IfoneoftheitemsinXispresentinatransaction,thereisatleastan80%chancethattherestoftheitemsinXalsobelongtothesametransaction.Suchstronglyassociatedpatternsinvolvinglow-supportitemsarecalledhypercliquepatterns.

Definition5.10.(HypercliquePattern.)AnitemsetXisahypercliquepatternifh-confidence ,where isauser-specifiedthreshold.

h-confidence({i1,i2,…,ik})≥h-confidence({i1,i2,…,ik+1}),

(X)>hchc

5.9BibliographicNotesTheassociationruleminingtaskwasfirstintroducedbyAgrawaletal.[324,325]todiscoverinterestingrelationshipsamongitemsinmarketbaskettransactions.Sinceitsinception,extensiveresearchhasbeenconductedtoaddressthevariousissuesinassociationrulemining,fromitsfundamentalconceptstoitsimplementationandapplications.Figure5.31 showsataxonomyofthevariousresearchdirectionsinthisarea,whichisgenerallyknownasassociationanalysis.Asmuchoftheresearchfocusesonfindingpatternsthatappearsignificantlyofteninthedata,theareaisalsoknownasfrequentpatternmining.Adetailedreviewonsomeoftheresearchtopicsinthisareacanbefoundin[362]andin[319].

Figure5.31.Anoverviewofthevariousresearchdirectionsinassociationanalysis.

ConceptualIssues

Researchontheconceptualissuesofassociationanalysishasfocusedondevelopingatheoreticalformulationofassociationanalysisandextendingtheformulationtonewtypesofpatternsandgoingbeyondasymmetricbinaryattributes.

FollowingthepioneeringworkbyAgrawaletal.[324,325],therehasbeenavastamountofresearchondevelopingatheoreticalformulationfortheassociationanalysisproblem.In[357],Gunopoulosetal.showedtheconnectionbetweenfindingmaximalfrequentitemsetsandthehypergraphtransversalproblem.Anupperboundonthecomplexityoftheassociationanalysistaskwasalsoderived.Zakietal.[454,456]andPasquieretal.[407]haveappliedformalconceptanalysistostudythefrequentitemsetgenerationproblem.Moreimportantly,suchresearchhasledtothedevelopmentofaclassofpatternsknownasclosedfrequentitemsets[456].Friedmanetal.[355]havestudiedtheassociationanalysisprobleminthecontextofbumphuntinginmultidimensionalspace.Specifically,theyconsiderfrequentitemsetgenerationasthetaskoffindinghighdensityregionsinmultidimensionalspace.Formalizingassociationanalysisinastatisticallearningframeworkisanotheractiveresearchdirection[414,435,444]asitcanhelpaddressissuesrelatedtoidentifyingstatisticallysignificantpatternsanddealingwithuncertaindata[320,333,343].

Overtheyears,theassociationruleminingformulationhasbeenexpandedtoencompassotherrule-basedpatterns,suchas,profileassociationrules[321],cyclicassociationrules[403],fuzzyassociationrules[379],exceptionrules[431],negativeassociationrules[336,418],weightedassociationrules[338,413],dependencerules[422],peculiarrules[462],inter-transactionassociationrules[353,440],andpartialclassificationrules[327,397].Additionally,theconceptoffrequentitemsethasbeenextendedtoothertypesofpatternsincludingcloseditemsets[407,456],maximalitemsets[330],hypercliquepatterns[449],supportenvelopes[428],emergingpatterns[347],

contrastsets[329],high-utilityitemsets[340,390],approximateorerror-tolerantitem-sets[358,389,451],anddiscriminativepatterns[352,401,430].Associationanalysistechniqueshavealsobeensuccessfullyappliedtosequential[326,426],spatial[371],andgraph-based[374,380,406,450,455]data.

Substantialresearchhasbeenconductedtoextendtheoriginalassociationruleformulationtonominal[425],ordinal[392],interval[395],andratio[356,359,425,443,461]attributes.Oneofthekeyissuesishowtodefinethesupportmeasurefortheseattributes.AmethodologywasproposedbySteinbachetal.[429]toextendthetraditionalnotionofsupporttomoregeneralpatternsandattributetypes.

ImplementationIssuesResearchactivitiesinthisarearevolvearound(1)integratingtheminingcapabilityintoexistingdatabasetechnology,(2)developingefficientandscalableminingalgorithms,(3)handlinguser-specifiedordomain-specificconstraints,and(4)post-processingtheextractedpatterns.

Thereareseveraladvantagestointegratingassociationanalysisintoexistingdatabasetechnology.First,itcanmakeuseoftheindexingandqueryprocessingcapabilitiesofthedatabasesystem.Second,itcanalsoexploittheDBMSsupportforscalability,check-pointing,andparallelization[415].TheSETMalgorithmdevelopedbyHoutsmaetal.[370]wasoneoftheearliestalgorithmstosupportassociationrulediscoveryviaSQLqueries.Sincethen,numerousmethodshavebeendevelopedtoprovidecapabilitiesforminingassociationrulesindatabasesystems.Forexample,theDMQL[363]andM-SQL[373]querylanguagesextendthebasicSQLwithnewoperatorsforminingassociationrules.TheMineRuleoperator[394]isanexpressiveSQLoperatorthatcanhandlebothclusteredattributesanditemhierarchies.Tsuret

al.[439]developedagenerate-and-testapproachcalledqueryflocksforminingassociationrules.AdistributedOLAP-basedinfrastructurewasdevelopedbyChenetal.[341]forminingmultilevelassociationrules.

Despiteitspopularity,theApriorialgorithmiscomputationallyexpensivebecauseitrequiresmakingmultiplepassesoverthetransactiondatabase.ItsruntimeandstoragecomplexitieswereinvestigatedbyDunkelandSoparkar[349].TheFP-growthalgorithmwasdevelopedbyHanetal.in[364].OtheralgorithmsforminingfrequentitemsetsincludetheDHP(dynamichashingandpruning)algorithmproposedbyParketal.[405]andthePartitionalgorithmdevelopedbySavasereetal[417].Asampling-basedfrequentitemsetgenerationalgorithmwasproposedbyToivonen[436].Thealgorithmrequiresonlyasinglepassoverthedata,butitcanproducemorecandidateitem-setsthannecessary.TheDynamicItemsetCounting(DIC)algorithm[337]makesonly1.5passesoverthedataandgenerateslesscandidateitemsetsthanthesampling-basedalgorithm.Othernotablealgorithmsincludethetree-projectionalgorithm[317]andH-Mine[408].Surveyarticlesonfrequentitemsetgenerationalgorithmscanbefoundin[322,367].ArepositoryofbenchmarkdatasetsandsoftwareimplementationofassociationruleminingalgorithmsisavailableattheFrequentItemsetMiningImplementations(FIMI)repository(http://fimi.cs.helsinki.fi).

Parallelalgorithmshavebeendevelopedtoscaleupassociationruleminingforhandlingbigdata[318,360,399,420,457].Asurveyofsuchalgorithmscanbefoundin[453].OnlineandincrementalassociationruleminingalgorithmshavealsobeenproposedbyHidber[365]andCheungetal.[342].Morerecently,newalgorithmshavebeendevelopedtospeedupfrequentitemsetminingbyexploitingtheprocessingpowerofGPUs[459]andtheMapReduce/Hadoopdistributedcomputingframework[382,384,396].Forexample,animplementationoffrequentitemsetminingfortheHadoopframeworkisavailableintheApacheMahoutsoftware .

1http://mahout.apache.org

Srikantetal.[427]haveconsideredtheproblemofminingassociationrulesinthepresenceofBooleanconstraintssuchasthefollowing:

Givensuchaconstraint,thealgorithmlooksforrulesthatcontainbothcookiesandmilk,orrulesthatcontainthedescendentitemsofcookiesbutnotancestoritemsofwheatbread.Singhetal.[424]andNgetal.[400]hadalsodevelopedalternativetechniquesforconstrained-basedassociationrulemining.Constraintscanalsobeimposedonthesupportfordifferentitemsets.ThisproblemwasinvestigatedbyWangetal.[442],Liuetal.in[387],andSenoetal.[419].Inaddition,constraintsarisingfromprivacyconcernsofminingsensitivedatahaveledtothedevelopmentofprivacy-preservingfrequentpatternminingtechniques[334,350,441,458].

Onepotentialproblemwithassociationanalysisisthelargenumberofpatternsthatcanbegeneratedbycurrentalgorithms.Toovercomethisproblem,methodstorank,summarize,andfilterpatternshavebeendeveloped.Toivonenetal.[437]proposedtheideaofeliminatingredundantrulesusingstructuralrulecoversandgroupingtheremainingrulesusingclustering.Liuetal.[388]appliedthestatisticalchi-squaretesttoprunespuriouspatternsandsummarizedtheremainingpatternsusingasubsetofthepatternscalleddirectionsettingrules.Theuseofobjectivemeasurestofilterpatternshasbeeninvestigatedbymanyauthors,includingBrinetal.[336],BayardoandAgrawal[331],AggarwalandYu[323],andDuMouchelandPregibon[348].ThepropertiesformanyofthesemeasureswereanalyzedbyPiatetsky-Shapiro[410],KamberandSinghal[376],HildermanandHamilton[366],andTanetal.[433].Thegrade-genderexampleusedtohighlighttheimportanceoftherowandcolumnscalinginvarianceproperty

(Cookies∧Milk)∨(descendants(Cookies)∧¬ancestors(WheatBread))

washeavilyinfluencedbythediscussiongivenin[398]byMosteller.Meanwhile,thetea-coffeeexampleillustratingthelimitationofconfidencewasmotivatedbyanexamplegivenin[336]byBrinetal.Becauseofthelimitationofconfidence,Brinetal.[336]hadproposedtheideaofusinginterestfactorasameasureofinterestingness.Theall-confidencemeasurewasproposedbyOmiecinski[402].Xiongetal.[449]introducedthecross-supportpropertyandshowedthattheall-confidencemeasurecanbeusedtoeliminatecross-supportpatterns.Akeydifficultyinusingalternativeobjectivemeasuresbesidessupportistheirlackofamonotonicityproperty,whichmakesitdifficulttoincorporatethemeasuresdirectlyintotheminingalgorithms.Xiongetal.[447]haveproposedanefficientmethodforminingcorrelationsbyintroducinganupperboundfunctiontothe .Althoughthemeasureisnon-monotone,ithasanupperboundexpressionthatcanbeexploitedfortheefficientminingofstronglycorrelateditempairs.

FabrisandFreitas[351]haveproposedamethodfordiscoveringinterestingassociationsbydetectingtheoccurrencesofSimpson'sparadox[423].MegiddoandSrikant[393]describedanapproachforvalidatingtheextractedpatternsusinghypothesistestingmethods.Aresampling-basedtechniquewasalsodevelopedtoavoidgeneratingspuriouspatternsbecauseofthemultiplecomparisonproblem.Boltonetal.[335]haveappliedtheBenjamini-Hochberg[332]andBonferronicorrectionmethodstoadjustthep-valuesofdiscoveredpatternsinmarketbasketdata.AlternativemethodsforhandlingthemultiplecomparisonproblemweresuggestedbyWebb[445],Zhangetal.[460],andLlinares-Lopezetal.[391].

Applicationofsubjectivemeasurestoassociationanalysishasbeeninvestigatedbymanyauthors.SilberschatzandTuzhilin[421]presentedtwoprinciplesinwhicharulecanbeconsideredinterestingfromasubjectivepointofview.TheconceptofunexpectedconditionruleswasintroducedbyLiuetal.in[385].Cooleyetal.[344]analyzedtheideaofcombiningsoftbeliefsets

ϕ-coefficient

usingtheDempster-Shafertheoryandappliedthisapproachtoidentifycontradictoryandnovelassociationpatternsinwebdata.AlternativeapproachesincludeusingBayesiannetworks[375]andneighborhood-basedinformation[346]toidentifysubjectivelyinterestingpatterns.

Visualizationalsohelpstheusertoquicklygrasptheunderlyingstructureofthediscoveredpatterns.Manycommercialdataminingtoolsdisplaythecompletesetofrules(whichsatisfybothsupportandconfidencethresholdcriteria)asatwo-dimensionalplot,witheachaxiscorrespondingtotheantecedentorconsequentitemsetsoftherule.Hofmannetal.[368]proposedusingMosaicplotsandDoubleDeckerplotstovisualizeassociationrules.Thisapproachcanvisualizenotonlyaparticularrule,butalsotheoverallcontingencytablebetweenitemsetsintheantecedentandconsequentpartsoftherule.Nevertheless,thistechniqueassumesthattheruleconsequentconsistsofonlyasingleattribute.

ApplicationIssuesAssociationanalysishasbeenappliedtoavarietyofapplicationdomainssuchaswebmining[409,432],documentanalysis[369],telecommunicationalarmdiagnosis[377],networkintrusiondetection[328,345,381],andbioinformatics[416,446].ApplicationsofassociationandcorrelationpatternanalysistoEarthSciencestudieshavebeeninvestigatedin[411,412,434].Trajectorypatternmining[339,372,438]isanotherapplicationofspatio-temporalassociationanalysistoidentifyfrequentlytraversedpathsofmovingobjects.

Associationpatternshavealsobeenappliedtootherlearningproblemssuchasclassification[383,386],regression[404],andclustering[361,448,452].AcomparisonbetweenclassificationandassociationruleminingwasmadebyFreitasinhispositionpaper[354].Theuseofassociationpatternsfor

clusteringhasbeenstudiedbymanyauthorsincludingHanetal.[361],Kostersetal.[378],Yangetal.[452]andXiongetal.[448].

Bibliography[317]R.C.Agarwal,C.C.Aggarwal,andV.V.V.Prasad.ATreeProjection

AlgorithmforGenerationofFrequentItemsets.JournalofParallelandDistributedComputing(SpecialIssueonHighPerformanceDataMining),61(3):350–371,2001.

[318]R.C.AgarwalandJ.C.Shafer.ParallelMiningofAssociationRules.IEEETransactionsonKnowledgeandDataEngineering,8(6):962–969,March1998.

[319]C.AggarwalandJ.Han.FrequentPatternMining.Springer,2014.

[320]C.C.Aggarwal,Y.Li,J.Wang,andJ.Wang.Frequentpatternminingwithuncertaindata.InProceedingsofthe15thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages29–38,Paris,France,2009.

[321]C.C.Aggarwal,Z.Sun,andP.S.Yu.OnlineGenerationofProfileAssociationRules.InProc.ofthe4thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages129—133,NewYork,NY,August1996.

[322]C.C.AggarwalandP.S.Yu.MiningLargeItemsetsforAssociationRules.DataEngineeringBulletin,21(1):23–31,March1998.

[323]C.C.AggarwalandP.S.Yu.MiningAssociationswiththeCollectiveStrengthApproach.IEEETrans.onKnowledgeandDataEngineering,13(6):863–873,January/February2001.

[324]R.Agrawal,T.Imielinski,andA.Swami.Databasemining:Aperformanceperspective.IEEETransactionsonKnowledgeandDataEngineering,5:914–925,1993.

[325]R.Agrawal,T.Imielinski,andA.Swami.Miningassociationrulesbetweensetsofitemsinlargedatabases.InProc.ACMSIGMODIntl.Conf.ManagementofData,pages207–216,Washington,DC,1993.

[326]R.AgrawalandR.Srikant.MiningSequentialPatterns.InProc.ofIntl.Conf.onDataEngineering,pages3–14,Taipei,Taiwan,1995.

[327]K.Ali,S.Manganaris,andR.Srikant.PartialClassificationusingAssociationRules.InProc.ofthe3rdIntl.Conf.onKnowledgeDiscoveryandDataMining,pages115—118,NewportBeach,CA,August1997.

[328]D.Barbará,J.Couto,S.Jajodia,andN.Wu.ADAM:ATestbedforExploringtheUseofDataMininginIntrusionDetection.SIGMODRecord,30(4):15–24,2001.

[329]S.D.BayandM.Pazzani.DetectingGroupDifferences:MiningContrastSets.DataMiningandKnowledgeDiscovery,5(3):213–246,2001.

[330]R.Bayardo.EfficientlyMiningLongPatternsfromDatabases.InProc.of1998ACM-SIGMODIntl.Conf.onManagementofData,pages85–93,Seattle,WA,June1998.

[331]R.BayardoandR.Agrawal.MiningtheMostInterestingRules.InProc.ofthe5thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages145–153,SanDiego,CA,August1999.

[332]Y.BenjaminiandY.Hochberg.ControllingtheFalseDiscoveryRate:APracticalandPowerfulApproachtoMultipleTesting.JournalRoyalStatisticalSocietyB,57(1):289–300,1995.

[333]T.Bernecker,H.Kriegel,M.Renz,F.Verhein,andA.Züle.Probabilisticfrequentitemsetmininginuncertaindatabases.InProceedingsofthe15thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages119–128,Paris,France,2009.

[334]R.Bhaskar,S.Laxman,A.D.Smith,andA.Thakurta.Discoveringfrequentpatternsinsensitivedata.InProceedingsofthe16thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages503–512,Washington,DC,2010.

[335]R.J.Bolton,D.J.Hand,andN.M.Adams.DeterminingHitRateinPatternSearch.InProc.oftheESFExploratoryWorkshoponPatternDetectionandDiscoveryinDataMining,pages36–48,London,UK,September2002.

[336]S.Brin,R.Motwani,andC.Silverstein.Beyondmarketbaskets:Generalizingassociationrulestocorrelations.InProc.ACMSIGMODIntl.Conf.ManagementofData,pages265–276,Tucson,AZ,1997.

[337]S.Brin,R.Motwani,J.Ullman,andS.Tsur.DynamicItemsetCountingandImplicationRulesformarketbasketdata.InProc.of1997ACM-SIGMODIntl.Conf.onManagementofData,pages255–264,Tucson,AZ,June1997.

[338]C.H.Cai,A.Fu,C.H.Cheng,andW.W.Kwong.MiningAssociationRuleswithWeightedItems.InProc.ofIEEEIntl.DatabaseEngineeringandApplicationsSymp.,pages68–77,Cardiff,Wales,1998.

[339]H.Cao,N.Mamoulis,andD.W.Cheung.MiningFrequentSpatio-TemporalSequentialPatterns.InProceedingsofthe5thIEEEInternationalConferenceonDataMining,pages82–89,Houston,TX,2005.

[340]R.Chan,Q.Yang,andY.Shen.MiningHighUtilityItemsets.InProceedingsofthe3rdIEEEInternationalConferenceonDataMining,pages19–26,Melbourne,FL,2003.

[341]Q.Chen,U.Dayal,andM.Hsu.ADistributedOLAPinfrastructureforE-Commerce.InProc.ofthe4thIFCISIntl.Conf.onCooperativeInformationSystems,pages209—220,Edinburgh,Scotland,1999.

[342]D.C.Cheung,S.D.Lee,andB.Kao.AGeneralIncrementalTechniqueforMaintainingDiscoveredAssociationRules.InProc.ofthe5thIntl.Conf.

onDatabaseSystemsforAdvancedApplications,pages185–194,Melbourne,Australia,1997.

[343]C.K.Chui,B.Kao,andE.Hung.MiningFrequentItemsetsfromUncertainData.InProceedingsofthe11thPacific-AsiaConferenceonKnowledgeDiscoveryandDataMining,pages47–58,Nanjing,China,2007.

[344]R.Cooley,P.N.Tan,andJ.Srivastava.DiscoveryofInterestingUsagePatternsfromWebData.InM.SpiliopoulouandB.Masand,editors,AdvancesinWebUsageAnalysisandUserProfiling,volume1836,pages163–182.LectureNotesinComputerScience,2000.

[345]P.Dokas,L.Ertöz,V.Kumar,A.Lazarevic,J.Srivastava,andP.N.Tan.DataMiningforNetworkIntrusionDetection.InProc.NSFWorkshoponNextGenerationDataMining,Baltimore,MD,2002.

[346]G.DongandJ.Li.Interestingnessofdiscoveredassociationrulesintermsofneighborhood-basedunexpectedness.InProc.ofthe2ndPacific-AsiaConf.onKnowledgeDiscoveryandDataMining,pages72–86,Melbourne,Australia,April1998.

[347]G.DongandJ.Li.EfficientMiningofEmergingPatterns:DiscoveringTrendsandDifferences.InProc.ofthe5thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages43–52,SanDiego,CA,August1999.

[348]W.DuMouchelandD.Pregibon.EmpiricalBayesScreeningforMulti-ItemAssociations.InProc.ofthe7thIntl.Conf.onKnowledgeDiscovery

andDataMining,pages67–76,SanFrancisco,CA,August2001.

[349]B.DunkelandN.Soparkar.DataOrganizationandAccessforEfficientDataMining.InProc.ofthe15thIntl.Conf.onDataEngineering,pages522–529,Sydney,Australia,March1999.

[350]A.V.Evfimievski,R.Srikant,R.Agrawal,andJ.Gehrke.Privacypreservingminingofassociationrules.InProceedingsoftheEighthACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages217–228,Edmonton,Canada,2002.

[351]C.C.FabrisandA.A.Freitas.DiscoveringsurprisingpatternsbydetectingoccurrencesofSimpson'sparadox.InProc.ofthe19thSGESIntl.Conf.onKnowledge-BasedSystemsandAppliedArtificialIntelligence),pages148–160,Cambridge,UK,December1999.

[352]G.Fang,G.Pandey,W.Wang,M.Gupta,M.Steinbach,andV.Kumar.MiningLow-SupportDiscriminativePatternsfromDenseandHigh-DimensionalData.IEEETrans.Knowl.DataEng.,24(2):279–294,2012.

[353]L.Feng,H.J.Lu,J.X.Yu,andJ.Han.Mininginter-transactionassociationswithtemplates.InProc.ofthe8thIntl.Conf.onInformationandKnowledgeManagement,pages225–233,KansasCity,Missouri,Nov1999.

[354]A.A.Freitas.Understandingthecrucialdifferencesbetweenclassificationanddiscoveryofassociationrules—apositionpaper.SIGKDDExplorations,2(1):65–69,2000.

[355]J.H.FriedmanandN.I.Fisher.Bumphuntinginhigh-dimensionaldata.StatisticsandComputing,9(2):123–143,April1999.

[356]T.Fukuda,Y.Morimoto,S.Morishita,andT.Tokuyama.MiningOptimizedAssociationRulesforNumericAttributes.InProc.ofthe15thSymp.onPrinciplesofDatabaseSystems,pages182–191,Montreal,Canada,June1996.

[357]D.Gunopulos,R.Khardon,H.Mannila,andH.Toivonen.DataMining,HypergraphTransversals,andMachineLearning.InProc.ofthe16thSymp.onPrinciplesofDatabaseSystems,pages209–216,Tucson,AZ,May1997.

[358]R.Gupta,G.Fang,B.Field,M.Steinbach,andV.Kumar.Quantitativeevaluationofapproximatefrequentpatternminingalgorithms.InProceedingsofthe14thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages301–309,LasVegas,NV,2008.

[359]E.Han,G.Karypis,andV.Kumar.Min-apriori:Analgorithmforfindingassociationrulesindatawithcontinuousattributes.DepartmentofComputerScienceandEngineering,UniversityofMinnesota,Tech.Rep,1997.

[360]E.-H.Han,G.Karypis,andV.Kumar.ScalableParallelDataMiningforAssociationRules.InProc.of1997ACM-SIGMODIntl.Conf.onManagementofData,pages277–288,Tucson,AZ,May1997.

[361]E.-H.Han,G.Karypis,V.Kumar,andB.Mobasher.ClusteringBasedonAssociationRuleHypergraphs.InProc.ofthe1997ACMSIGMODWorkshoponResearchIssuesinDataMiningandKnowledgeDiscovery,Tucson,AZ,1997.

[362]J.Han,H.Cheng,D.Xin,andX.Yan.Frequentpatternmining:currentstatusandfuturedirections.DataMiningandKnowledgeDiscovery,15(1):55–86,2007.

[363]J.Han,Y.Fu,K.Koperski,W.Wang,andO.R.Zaïane.DMQL:Adataminingquerylanguageforrelationaldatabases.InProc.ofthe1996ACMSIGMODWorkshoponResearchIssuesinDataMiningandKnowledgeDiscovery,Montreal,Canada,June1996.

[364]J.Han,J.Pei,andY.Yin.MiningFrequentPatternswithoutCandidateGeneration.InProc.ACM-SIGMODInt.Conf.onManagementofData(SIGMOD'00),pages1–12,Dallas,TX,May2000.

[365]C.Hidber.OnlineAssociationRuleMining.InProc.of1999ACM-SIGMODIntl.Conf.onManagementofData,pages145–156,Philadelphia,PA,1999.

[366]R.J.HildermanandH.J.Hamilton.KnowledgeDiscoveryandMeasuresofInterest.KluwerAcademicPublishers,2001.

[367]J.Hipp,U.Guntzer,andG.Nakhaeizadeh.AlgorithmsforAssociationRuleMining—AGeneralSurvey.SigKDDExplorations,2(1):58–64,June

2000.

[368]H.Hofmann,A.P.J.M.Siebes,andA.F.X.Wilhelm.VisualizingAssociationRuleswithInteractiveMosaicPlots.InProc.ofthe6thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages227–235,Boston,MA,August2000.

[369]J.D.HoltandS.M.Chung.EfficientMiningofAssociationRulesinTextDatabases.InProc.ofthe8thIntl.Conf.onInformationandKnowledgeManagement,pages234–242,KansasCity,Missouri,1999.

[370]M.HoutsmaandA.Swami.Set-orientedMiningforAssociationRulesinRelationalDatabases.InProc.ofthe11thIntl.Conf.onDataEngineering,pages25–33,Taipei,Taiwan,1995.

[371]Y.Huang,S.Shekhar,andH.Xiong.DiscoveringCo-locationPatternsfromSpatialDatasets:AGeneralApproach.IEEETrans.onKnowledgeandDataEngineering,16(12):1472–1485,December2004.

[372]S.Hwang,Y.Liu,J.Chiu,andE.Lim.MiningMobileGroupPatterns:ATrajectory-BasedApproach.InProceedingsofthe9thPacific-AsiaConferenceonKnowledgeDiscoveryandDataMining,pages713–718,Hanoi,Vietnam,2005.

[373]T.Imielinski,A.Virmani,andA.Abdulghani.DataMine:ApplicationProgrammingInterfaceandQueryLanguageforDatabaseMining.InProc.ofthe2ndIntl.Conf.onKnowledgeDiscoveryandDataMining,pages256–262,Portland,Oregon,1996.

[374]A.Inokuchi,T.Washio,andH.Motoda.AnApriori-basedAlgorithmforMiningFrequentSubstructuresfromGraphData.InProc.ofthe4thEuropeanConf.ofPrinciplesandPracticeofKnowledgeDiscoveryinDatabases,pages13–23,Lyon,France,2000.

[375]S.JaroszewiczandD.Simovici.InterestingnessofFrequentItemsetsUsingBayesianNetworksasBackgroundKnowledge.InProc.ofthe10thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages178–186,Seattle,WA,August2004.

[376]M.KamberandR.Shinghal.EvaluatingtheInterestingnessofCharacteristicRules.InProc.ofthe2ndIntl.Conf.onKnowledgeDiscoveryandDataMining,pages263–266,Portland,Oregon,1996.

[377]M.Klemettinen.AKnowledgeDiscoveryMethodologyforTelecommunicationNetworkAlarmDatabases.PhDthesis,UniversityofHelsinki,1999.

[378]W.A.Kosters,E.Marchiori,andA.Oerlemans.MiningClusterswithAssociationRules.InThe3rdSymp.onIntelligentDataAnalysis(IDA99),pages39–50,Amsterdam,August1999.

[379]C.M.Kuok,A.Fu,andM.H.Wong.MiningFuzzyAssociationRulesinDatabases.ACMSIGMODRecord,27(1):41–46,March1998.

[380]M.KuramochiandG.Karypis.FrequentSubgraphDiscovery.InProc.ofthe2001IEEEIntl.Conf.onDataMining,pages313–320,SanJose,CA,

November2001.

[381]W.Lee,S.J.Stolfo,andK.W.Mok.AdaptiveIntrusionDetection:ADataMiningApproach.ArtificialIntelligenceReview,14(6):533–567,2000.

[382]N.Li,L.Zeng,Q.He,andZ.Shi.ParallelImplementationofAprioriAlgorithmBasedonMapReduce.InProceedingsofthe13thACISInternationalConferenceonSoftwareEngineering,ArtificialIntelligence,NetworkingandParallel/DistributedComputing,pages236–241,Kyoto,Japan,2012.

[383]W.Li,J.Han,andJ.Pei.CMAR:AccurateandEfficientClassificationBasedonMultipleClass-associationRules.InProc.ofthe2001IEEEIntl.Conf.onDataMining,pages369–376,SanJose,CA,2001.

[384]M.Lin,P.Lee,andS.Hsueh.Apriori-basedfrequentitemsetminingalgorithmsonMapReduce.InProceedingsofthe6thInternationalConferenceonUbiquitousInformationManagementandCommunication,pages26–30,KualaLumpur,Malaysia,2012.

[385]B.Liu,W.Hsu,andS.Chen.UsingGeneralImpressionstoAnalyzeDiscoveredClassificationRules.InProc.ofthe3rdIntl.Conf.onKnowledgeDiscoveryandDataMining,pages31–36,NewportBeach,CA,August1997.

[386]B.Liu,W.Hsu,andY.Ma.IntegratingClassificationandAssociationRuleMining.InProc.ofthe4thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages80–86,NewYork,NY,August1998.

[387]B.Liu,W.Hsu,andY.Ma.Miningassociationruleswithmultipleminimumsupports.InProc.ofthe5thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages125—134,SanDiego,CA,August1999.

[388]B.Liu,W.Hsu,andY.Ma.PruningandSummarizingtheDiscoveredAssociations.InProc.ofthe5thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages125–134,SanDiego,CA,August1999.

[389]J.Liu,S.Paulsen,W.Wang,A.B.Nobel,andJ.Prins.MiningApproximateFrequentItemsetsfromNoisyData.InProceedingsofthe5thIEEEInternationalConferenceonDataMining,pages721–724,Houston,TX,2005.

[390]Y.Liu,W.-K.Liao,andA.Choudhary.Atwo-phasealgorithmforfastdiscoveryofhighutilityitemsets.InProceedingsofthe9thPacific-AsiaConferenceonKnowledgeDiscoveryandDataMining,pages689–695,Hanoi,Vietnam,2005.

[391]F.Llinares-López,M.Sugiyama,L.Papaxanthos,andK.M.Borgwardt.FastandMemory-EfficientSignificantPatternMiningviaPermutationTesting.InProceedingsofthe21thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages725–734,Sydney,Australia,2015.

[392]A.Marcus,J.I.Maletic,andK.-I.Lin.Ordinalassociationrulesforerroridentificationindatasets.InProc.ofthe10thIntl.Conf.onInformationandKnowledgeManagement,pages589–591,Atlanta,GA,October2001.

[393]N.MegiddoandR.Srikant.DiscoveringPredictiveAssociationRules.InProc.ofthe4thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages274–278,NewYork,August1998.

[394]R.Meo,G.Psaila,andS.Ceri.ANewSQL-likeOperatorforMiningAssociationRules.InProc.ofthe22ndVLDBConf.,pages122–133,Bombay,India,1996.

[395]R.J.MillerandY.Yang.AssociationRulesoverIntervalData.InProc.of1997ACM-SIGMODIntl.Conf.onManagementofData,pages452–461,Tucson,AZ,May1997.

[396]S.Moens,E.Aksehirli,andB.Goethals.FrequentItemsetMiningforBigData.InProceedingsofthe2013IEEEInternationalConferenceonBigData,pages111–118,SantaClara,CA,2013.

[397]Y.Morimoto,T.Fukuda,H.Matsuzawa,T.Tokuyama,andK.Yoda.Algorithmsforminingassociationrulesforbinarysegmentationsofhugecategoricaldatabases.InProc.ofthe24thVLDBConf.,pages380–391,NewYork,August1998.

[398]F.Mosteller.AssociationandEstimationinContingencyTables.JASA,63:1–28,1968.

[399]A.Mueller.Fastsequentialandparallelalgorithmsforassociationrulemining:Acomparison.TechnicalReportCS-TR-3515,UniversityofMaryland,August1995.

[400]R.T.Ng,L.V.S.Lakshmanan,J.Han,andA.Pang.ExploratoryMiningandPruningOptimizationsofConstrainedAssociationRules.InProc.of1998ACM-SIGMODIntl.Conf.onManagementofData,pages13–24,Seattle,WA,June1998.

[401]P.K.Novak,N.Lavrač,andG.I.Webb.Superviseddescriptiverulediscovery:Aunifyingsurveyofcontrastset,emergingpatternandsubgroupmining.JournalofMachineLearningResearch,10(Feb):377–403,2009.

[402]E.Omiecinski.AlternativeInterestMeasuresforMiningAssociationsinDatabases.IEEETrans.onKnowledgeandDataEngineering,15(1):57–69,January/February2003.

[403]B.Ozden,S.Ramaswamy,andA.Silberschatz.CyclicAssociationRules.InProc.ofthe14thIntl.Conf.onDataEng.,pages412–421,Orlando,FL,February1998.

[404]A.Ozgur,P.N.Tan,andV.Kumar.RBA:AnIntegratedFrameworkforRegressionbasedonAssociationRules.InProc.oftheSIAMIntl.Conf.onDataMining,pages210–221,Orlando,FL,April2004.

[405]J.S.Park,M.-S.Chen,andP.S.Yu.Aneffectivehash-basedalgorithmforminingassociationrules.SIGMODRecord,25(2):175–186,1995.

[406]S.ParthasarathyandM.Coatney.EfficientDiscoveryofCommonSubstructuresinMacromolecules.InProc.ofthe2002IEEEIntl.Conf.on

DataMining,pages362—369,MaebashiCity,Japan,December2002.

[407]N.Pasquier,Y.Bastide,R.Taouil,andL.Lakhal.Discoveringfrequentcloseditemsetsforassociationrules.InProc.ofthe7thIntl.Conf.onDatabaseTheory(ICDT'99),pages398–416,Jerusalem,Israel,January1999.

[408]J.Pei,J.Han,H.J.Lu,S.Nishio,andS.Tang.H-Mine:Hyper-StructureMiningofFrequentPatternsinLargeDatabases.InProc.ofthe2001IEEEIntl.Conf.onDataMining,pages441–448,SanJose,CA,November2001.

[409]J.Pei,J.Han,B.Mortazavi-Asl,andH.Zhu.MiningAccessPatternsEfficientlyfromWebLogs.InProc.ofthe4thPacific-AsiaConf.onKnowledgeDiscoveryandDataMining,pages396–407,Kyoto,Japan,April2000.

[410]G.Piatetsky-Shapiro.Discovery,AnalysisandPresentationofStrongRules.InG.Piatetsky-ShapiroandW.Frawley,editors,KnowledgeDiscoveryinDatabases,pages229–248.MITPress,Cambridge,MA,1991.

[411]C.Potter,S.Klooster,M.Steinbach,P.N.Tan,V.Kumar,S.Shekhar,andC.Carvalho.UnderstandingGlobalTeleconnectionsofClimatetoRegionalModelEstimatesofAmazonEcosystemCarbonFluxes.GlobalChangeBiology,10(5):693—703,2004.

[412]C.Potter,S.Klooster,M.Steinbach,P.N.Tan,V.Kumar,S.Shekhar,R.Myneni,andR.Nemani.GlobalTeleconnectionsofOceanClimatetoTerrestrialCarbonFlux.JournalofGeophysicalResearch,108(D17),2003.

[413]G.D.Ramkumar,S.Ranka,andS.Tsur.Weightedassociationrules:Modelandalgorithm.InProc.ACMSIGKDD,1998.

[414]M.RiondatoandF.Vandin.FindingtheTrueFrequentItemsets.InProceedingsofthe2014SIAMInternationalConferenceonDataMining,pages497–505,Philadelphia,PA,2014.

[415]S.Sarawagi,S.Thomas,andR.Agrawal.IntegratingMiningwithRelationalDatabaseSystems:AlternativesandImplications.InProc.of1998ACM-SIGMODIntl.Conf.onManagementofData,pages343–354,Seattle,WA,1998.

[416]K.Satou,G.Shibayama,T.Ono,Y.Yamamura,E.Furuichi,S.Kuhara,andT.Takagi.FindingAssociationRulesonHeterogeneousGenomeData.InProc.ofthePacificSymp.onBiocomputing,pages397–408,Hawaii,January1997.

[417]A.Savasere,E.Omiecinski,andS.Navathe.Anefficientalgorithmforminingassociationrulesinlargedatabases.InProc.ofthe21stInt.Conf.onVeryLargeDatabases(VLDB’95),pages432–444,Zurich,Switzerland,September1995.

[418]A.Savasere,E.Omiecinski,andS.Navathe.MiningforStrongNegativeAssociationsinaLargeDatabaseofCustomerTransactions.InProc.ofthe14thIntl.Conf.onDataEngineering,pages494–502,Orlando,Florida,February1998.

[419]M.SenoandG.Karypis.LPMiner:AnAlgorithmforFindingFrequentItemsetsUsingLength-DecreasingSupportConstraint.InProc.ofthe2001IEEEIntl.Conf.onDataMining,pages505–512,SanJose,CA,November2001.

[420]T.ShintaniandM.Kitsuregawa.Hashbasedparallelalgorithmsforminingassociationrules.InProcofthe4thIntl.Conf.onParallelandDistributedInfo.Systems,pages19–30,MiamiBeach,FL,December1996.

[421]A.SilberschatzandA.Tuzhilin.Whatmakespatternsinterestinginknowledgediscoverysystems.IEEETrans.onKnowledgeandDataEngineering,8(6):970–974,1996.

[422]C.Silverstein,S.Brin,andR.Motwani.Beyondmarketbaskets:Generalizingassociationrulestodependencerules.DataMiningandKnowledgeDiscovery,2(1):39–68,1998.

[423]E.-H.Simpson.TheInterpretationofInteractioninContingencyTables.JournaloftheRoyalStatisticalSociety,B(13):238–241,1951.

[424]L.Singh,B.Chen,R.Haight,andP.Scheuermann.AnAlgorithmforConstrainedAssociationRuleMininginSemi-structuredData.InProc.of

the3rdPacific-AsiaConf.onKnowledgeDiscoveryandDataMining,pages148–158,Beijing,China,April1999.

[425]R.SrikantandR.Agrawal.MiningQuantitativeAssociationRulesinLargeRelationalTables.InProc.of1996ACM-SIGMODIntl.Conf.onManagementofData,pages1–12,Montreal,Canada,1996.

[426]R.SrikantandR.Agrawal.MiningSequentialPatterns:GeneralizationsandPerformanceImprovements.InProc.ofthe5thIntlConf.onExtendingDatabaseTechnology(EDBT'96),pages18–32,Avignon,France,1996.

[427]R.Srikant,Q.Vu,andR.Agrawal.MiningAssociationRuleswithItemConstraints.InProc.ofthe3rdIntl.Conf.onKnowledgeDiscoveryandDataMining,pages67–73,NewportBeach,CA,August1997.

[428]M.Steinbach,P.N.Tan,andV.Kumar.SupportEnvelopes:ATechniqueforExploringtheStructureofAssociationPatterns.InProc.ofthe10thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages296–305,Seattle,WA,August2004.

[429]M.Steinbach,P.N.Tan,H.Xiong,andV.Kumar.ExtendingtheNotionofSupport.InProc.ofthe10thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages689–694,Seattle,WA,August2004.

[430]M.Steinbach,H.Yu,G.Fang,andV.Kumar.Usingconstraintstogenerateandexplorehigherorderdiscriminativepatterns.AdvancesinKnowledgeDiscoveryandDataMining,pages338–350,2011.

[431]E.Suzuki.AutonomousDiscoveryofReliableExceptionRules.InProc.ofthe3rdIntl.Conf.onKnowledgeDiscoveryandDataMining,pages259–262,NewportBeach,CA,August1997.

[432]P.N.TanandV.Kumar.MiningAssociationPatternsinWebUsageData.InProc.oftheIntl.Conf.onAdvancesinInfrastructurefore-Business,e-Education,e-Scienceande-MedicineontheInternet,L'Aquila,Italy,January2002.

[433]P.N.Tan,V.Kumar,andJ.Srivastava.SelectingtheRightInterestingnessMeasureforAssociationPatterns.InProc.ofthe8thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages32–41,Edmonton,Canada,July2002.

[434]P.N.Tan,M.Steinbach,V.Kumar,S.Klooster,C.Potter,andA.Torregrosa.FindingSpatio-TemporalPatternsinEarthScienceData.InKDD2001WorkshoponTemporalDataMining,SanFrancisco,CA,2001.

[435]N.Tatti.Probablythebestitemsets.InProceedingsofthe16thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages293–302,Washington,DC,2010.

[436]H.Toivonen.SamplingLargeDatabasesforAssociationRules.InProc.ofthe22ndVLDBConf.,pages134–145,Bombay,India,1996.

[437]H.Toivonen,M.Klemettinen,P.Ronkainen,K.Hatonen,andH.Mannila.PruningandGroupingDiscoveredAssociationRules.InECML-95

WorkshoponStatistics,MachineLearningandKnowledgeDiscoveryinDatabases,pages47–52,Heraklion,Greece,April1995.

[438]I.TsoukatosandD.Gunopulos.Efficientminingofspatiotemporalpatterns.InProceedingsofthe7thInternationalSymposiumonAdvancesinSpatialandTemporalDatabases,pages425–442,2001.

[439]S.Tsur,J.Ullman,S.Abiteboul,C.Clifton,R.Motwani,S.Nestorov,andA.Rosenthal.QueryFlocks:AGeneralizationofAssociationRuleMining.InProc.of1998ACM-SIGMODIntl.Conf.onManagementofData,pages1–12,Seattle,WA,June1998.

[440]A.Tung,H.J.Lu,J.Han,andL.Feng.BreakingtheBarrierofTransactions:MiningInter-TransactionAssociationRules.InProc.ofthe5thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages297–301,SanDiego,CA,August1999.

[441]J.VaidyaandC.Clifton.Privacypreservingassociationrulemininginverticallypartitioneddata.InProceedingsoftheEighthACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages639–644,Edmonton,Canada,2002.

[442]K.Wang,Y.He,andJ.Han.MiningFrequentItemsetsUsingSupportConstraints.InProc.ofthe26thVLDBConf.,pages43–52,Cairo,Egypt,September2000.

[443]K.Wang,S.H.Tay,andB.Liu.Interestingness-BasedIntervalMergerforNumericAssociationRules.InProc.ofthe4thIntl.Conf.onKnowledge

DiscoveryandDataMining,pages121–128,NewYork,NY,August1998.

[444]L.Wang,R.Cheng,S.D.Lee,andD.W.Cheung.Acceleratingprobabilisticfrequentitemsetmining:amodel-basedapproach.InProceedingsofthe19thACMConferenceonInformationandKnowledgeManagement,pages429–438,2010.

[445]G.I.Webb.Preliminaryinvestigationsintostatisticallyvalidexploratoryrulediscovery.InProc.oftheAustralasianDataMiningWorkshop(AusDM03),Canberra,Australia,December2003.

[446]H.Xiong,X.He,C.Ding,Y.Zhang,V.Kumar,andS.R.Holbrook.IdentificationofFunctionalModulesinProteinComplexesviaHypercliquePatternDiscovery.InProc.ofthePacificSymposiumonBiocomputing,(PSB2005),Maui,January2005.

[447]H.Xiong,S.Shekhar,P.N.Tan,andV.Kumar.ExploitingaSupport-basedUpperBoundofPearson'sCorrelationCoefficientforEfficientlyIdentifyingStronglyCorrelatedPairs.InProc.ofthe10thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages334–343,Seattle,WA,August2004.

[448]H.Xiong,M.Steinbach,P.N.Tan,andV.Kumar.HICAP:HierarchialClusteringwithPatternPreservation.InProc.oftheSIAMIntl.Conf.onDataMining,pages279–290,Orlando,FL,April2004.

[449]H.Xiong,P.N.Tan,andV.Kumar.MiningStrongAffinityAssociationPatternsinDataSetswithSkewedSupportDistribution.InProc.ofthe

2003IEEEIntl.Conf.onDataMining,pages387–394,Melbourne,FL,2003.

[450]X.YanandJ.Han.gSpan:Graph-basedSubstructurePatternMining.InProc.ofthe2002IEEEIntl.Conf.onDataMining,pages721–724,MaebashiCity,Japan,December2002.

[451]C.Yang,U.M.Fayyad,andP.S.Bradley.Efficientdiscoveryoferror-tolerantfrequentitemsetsinhighdimensions.InProceedingsoftheseventhACMSIGKDDinternationalconferenceonKnowledgediscoveryanddatamining,pages194–203,,SanFrancisco,CA,2001.

[452]C.Yang,U.M.Fayyad,andP.S.Bradley.Efficientdiscoveryoferror-tolerantfrequentitemsetsinhighdimensions.InProc.ofthe7thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages194–203,SanFrancisco,CA,August2001.

[453]M.J.Zaki.ParallelandDistributedAssociationMining:ASurvey.IEEEConcurrency,specialissueonParallelMechanismsforDataMining,7(4):14–25,December1999.

[454]M.J.Zaki.GeneratingNon-RedundantAssociationRules.InProc.ofthe6thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages34–43,Boston,MA,August2000.

[455]M.J.Zaki.Efficientlyminingfrequenttreesinaforest.InProc.ofthe8thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages71–80,Edmonton,Canada,July2002.

[456]M.J.ZakiandM.Orihara.Theoreticalfoundationsofassociationrules.InProc.ofthe1998ACMSIGMODWorkshoponResearchIssuesinDataMiningandKnowledgeDiscovery,Seattle,WA,June1998.

[457]M.J.Zaki,S.Parthasarathy,M.Ogihara,andW.Li.NewAlgorithmsforFastDiscoveryofAssociationRules.InProc.ofthe3rdIntl.Conf.onKnowledgeDiscoveryandDataMining,pages283–286,NewportBeach,CA,August1997.

[458]C.Zeng,J.F.Naughton,andJ.Cai.Ondifferentiallyprivatefrequentitemsetmining.ProceedingsoftheVLDBEndowment,6(1):25–36,2012.

[459]F.Zhang,Y.Zhang,andJ.Bakos.GPApriori:GPU-AcceleratedFrequentItemsetMining.InProceedingsofthe2011IEEEInternationalConferenceonClusterComputing,pages590–594,Austin,TX,2011.

[460]H.Zhang,B.Padmanabhan,andA.Tuzhilin.OntheDiscoveryofSignificantStatisticalQuantitativeRules.InProc.ofthe10thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages374–383,Seattle,WA,August2004.

[461]Z.Zhang,Y.Lu,andB.Zhang.AnEffectivePartioning-CombiningAlgorithmforDiscoveringQuantitativeAssociationRules.InProc.ofthe1stPacific-AsiaConf.onKnowledgeDiscoveryandDataMining,Singapore,1997.

[462]N.Zhong,Y.Y.Yao,andS.Ohsuga.PeculiarityOrientedMulti-databaseMining.InProc.ofthe3rdEuropeanConf.ofPrinciplesandPracticeofKnowledgeDiscoveryinDatabases,pages136–146,Prague,CzechRepublic,1999.

5.10Exercises1.Foreachofthefollowingquestions,provideanexampleofanassociationrulefromthemarketbasketdomainthatsatisfiesthefollowingconditions.Also,describewhethersuchrulesaresubjectivelyinteresting.

a. Arulethathashighsupportandhighconfidence.

b. Arulethathasreasonablyhighsupportbutlowconfidence.

c. Arulethathaslowsupportandlowconfidence.

d. Arulethathaslowsupportandhighconfidence.

2.ConsiderthedatasetshowninTable5.20 .

Table5.20.Exampleofmarketbaskettransactions.

CustomerID TransactionID ItemsBought

1 0001 {a,d,e}

1 0024 {a,b,c,e}

2 0012 {a,b,d,e}

2 0031 {a,c,d,e}

3 0015 {b,ce}

3 0022 {b,d,e}

4 0029 {cd}

4 0040 {a,b,c}

5 0033 {a,d,e}

5 0038 {a,b,e}

a. Computethesupportforitemsets{e},{b,d},and{b,d,e}bytreatingeachtransactionIDasamarketbasket.

b. Usetheresultsinpart(a)tocomputetheconfidencefortheassociationrules and .Isconfidenceasymmetricmeasure?

c. Repeatpart(a)bytreatingeachcustomerIDasamarketbasket.Eachitemshouldbetreatedasabinaryvariable(1ifanitemappearsinatleastonetransactionboughtbythecustomer,and0otherwise).

d. Usetheresultsinpart(c)tocomputetheconfidencefortheassociationrules and .

e. Suppose and arethesupportandconfidencevaluesofanassociationrulerwhentreatingeachtransactionIDasamarketbasket.Also,let and bethesupportandconfidencevaluesofrwhentreatingeachcustomerIDasamarketbasket.Discusswhetherthereareanyrelationshipsbetween and or and .

a. Whatistheconfidencefortherules and ?

b. Let ,and betheconfidencevaluesoftherules,and ,respectively.Ifweassumethat ,and

havedifferentvalues,whatarethepossiblerelationshipsthatmayexistamong ,and ?Whichrulehasthelowestconfidence?

c. Repeattheanalysisinpart(b)assumingthattheruleshaveidenticalsupport.Whichrulehasthehighestconfidence?

{b,d}→{e} {e}→{b,d}

s1 c1

s2 c2

s1 s2 c1 c2

∅→A A→∅c1,c2 c3 {p}→{q},{p}

→{q,r} {p,r}→{q} c1,c2 c3

c1,c2 c3

d. Transitivity:Supposetheconfidenceoftherules and arelargerthansomethreshold,minconf.Isitpossiblethat hasaconfidencelessthanminconf?

4.Foreachofthefollowingmeasures,determinewhetheritismonotone,anti-monotone,ornon-monotone(i.e.,neithermonotonenoranti-monotone).

Example:Support, isanti-monotonebecause whenever.

a. Acharacteristicruleisaruleoftheform ,wheretheruleantecedentcontainsonlyasingleitem.Anitemsetofsizekcanproduceuptokcharacteristicrules.Let betheminimumconfidenceofallcharacteristicrulesgeneratedfromagivenitemset:

Is monotone,anti-monotone,ornon-monotone?

b. Adiscriminantruleisaruleoftheform ,wheretheruleconsequentcontainsonlyasingleitem.Anitemsetofsizekcanproduceuptokdiscriminantrules.Let betheminimumconfidenceofalldiscriminantrulesgeneratedfromagivenitemset:

Is monotone,anti-monotone,ornon-monotone?

c. Repeattheanalysisinparts(a)and(b)byreplacingtheminfunctionwithamaxfunction.

A→B B→CA→C

s=σ(x)|T| s(X)≥s(Y)X⊂Y

{p}→{q1,q2,…,qn}

ζ({p1,p2,…,pk})=min[c({p1}→{p2,p3,…,pk}),…c({pk}→{p1,p2,…,pk−1})]

{p1,p2,…,pn}→{q}

η({p1,p2,…,pk})=min[c({p2,p3,…,pk}→{p1}),…c({p1,p2,…,pk−1}→{pk})]

5.ProveEquation5.3 .(Hint:First,countthenumberofwaystocreateanitemsetthatformstheleft-handsideoftherule.Next,foreachsizekitemsetselectedfortheleft-handside,countthenumberofwaystochoosetheremaining itemstoformtheright-handsideoftherule.)Assumethatneitheroftheitemsetsofaruleareempty.

6.ConsiderthemarketbaskettransactionsshowninTable5.21 .

a. Whatisthemaximumnumberofassociationrulesthatcanbeextractedfromthisdata(includingrulesthathavezerosupport)?

b. Whatisthemaximumsizeoffrequentitemsetsthatcanbeextracted(assuming )?

Table5.21.Marketbaskettransactions.

TransactionID ItemsBought

1 {Milk,Beer,Diapers}

2 {Bread,Butter,Milk}

3 {Milk,Diapers,Cookies}

4 {Bread,Butter,Cookies}

5 {Beer,Cookies,Diapers}

6 {Milk,Diapers,Bread,Butter}

7 {Bread,Butter,Diapers}

8 {Beer,Diapers}

9 {Milk,Diapers,Bread,Butter}

10 {Beer,Cookies}

d−k

minsup>0

c. Writeanexpressionforthemaximumnumberofsize-3itemsetsthatcanbederivedfromthisdataset.

d. Findanitemset(ofsize2orlarger)thathasthelargestsupport.

e. Findapairofitems,aandb,suchthattherules andhavethesameconfidence.

7.Showthatifacandidatek-itemsetXhasasubsetofsizelessthan thatisinfrequent,thenatleastoneofthe -sizesubsetsofXisnecessarilyinfrequent.

8.Considerthefollowingsetoffrequent3-itemsets:

{1,2,3},{1,2,4},{1,2,5},{1,3,4},{1,3,5},{2,3,4},{2,3,5},{3,4,5}.

Assumethatthereareonlyfiveitemsinthedataset.

a. Listallcandidate4-itemsetsobtainedbyacandidategenerationprocedureusingthe mergingstrategy.

b. Listallcandidate4-itemsetsobtainedbythecandidategenerationprocedureinApriori.

c. Listallcandidate4-itemsetsthatsurvivethecandidatepruningstepoftheApriorialgorithm.

9.TheApriorialgorithmusesagenerate-and-countstrategyforderivingfrequentitemsets.Candidateitemsetsofsize arecreatedbyjoiningapairoffrequentitemsetsofsizek(thisisknownasthecandidategenerationstep).Acandidateisdiscardedifanyoneofitssubsetsisfoundtobeinfrequentduringthecandidatepruningstep.SupposetheApriorialgorithmisappliedtothedatasetshowninTable5.22 with ,i.e.,anyitemsetoccurringinlessthan3transactionsisconsideredtobeinfrequent.

{a}→{b} {b}→{a}

k−1(k−1)

Fk−1×F1

k+1

minsup=30%

Table5.22.Exampleofmarketbaskettransactions.

TransactionID ItemsBought

1 {a,b,d,e}

2 {b,cd}

3 {a,b,d,e}

4 {a,c,d,e}

5 {b,c,d,e}

6 {b,d,e}

7 {c,d}

8 {a,b,c}

9 {a,d,e}

10 {b,d}

a. DrawanitemsetlatticerepresentingthedatasetgiveninTable5.22 .Labeleachnodeinthelatticewiththefollowingletter(s):

N:IftheitemsetisnotconsideredtobeacandidateitemsetbytheApriorialgorithm.Therearetworeasonsforanitemsetnottobeconsideredasacandidateitemset:(1)itisnotgeneratedatallduringthecandidategenerationstep,or(2)itisgeneratedduringthecandidategenerationstepbutissubsequentlyremovedduringthecandidatepruningstepbecauseoneofitssubsetsisfoundtobeinfrequent.

F:IfthecandidateitemsetisfoundtobefrequentbytheApriorialgorithm.

I:Ifthecandidateitemsetisfoundtobeinfrequentaftersupportcounting.

b. Whatisthepercentageoffrequentitemsets(withrespecttoallitemsetsinthelattice)?

c. WhatisthepruningratiooftheApriorialgorithmonthisdataset?(Pruningratioisdefinedasthepercentageofitemsetsnotconsideredtobeacandidatebecause(1)theyarenotgeneratedduringcandidategenerationor(2)theyareprunedduringthecandidatepruningstep.)

d. Whatisthefalsealarmrate(i.e.,percentageofcandidateitemsetsthatarefoundtobeinfrequentafterperformingsupportcounting)?

10.TheApriorialgorithmusesahashtreedatastructuretoefficientlycountthesupportofcandidateitemsets.Considerthehashtreeforcandidate3-itemsetsshowninFigure5.32 .

Figure5.32.Anexampleofahashtreestructure.

a. Givenatransactionthatcontainsitems{1,3,4,5,8},whichofthehashtreeleafnodeswillbevisitedwhenfindingthecandidatesofthetransaction?

b. Usethevisitedleafnodesinpart(a)todeterminethecandidateitemsetsthatarecontainedinthetransaction{1,3,4,5,8}.

11.Considerthefollowingsetofcandidate3-itemsets:

{1,2,3},{1,2,6},{1,3,4},{2,3,4},{2,4,5},{3,4,6},{4,5,6}

a. Constructahashtreefortheabovecandidate3-itemsets.Assumethetreeusesahashfunctionwhereallodd-numbereditemsarehashedtotheleftchildofanode,whiletheeven-numbereditemsarehashedtotherightchild.Acandidatek-itemsetisinsertedintothetreebyhashingoneachsuccessiveiteminthecandidateandthenfollowingtheappropriatebranchofthetreeaccordingtothehashvalue.Oncealeafnodeisreached,thecandidateisinsertedbasedononeofthefollowingconditions:

Condition1:Ifthedepthoftheleafnodeisequaltok(therootisassumedtobeatdepth0),thenthecandidateisinsertedregardlessofthenumberofitemsetsalreadystoredatthenode.

Condition2:Ifthedepthoftheleafnodeislessthank,thenthecandidatecanbeinsertedaslongasthenumberofitemsetsstoredatthenodeislessthanmaxsize.Assume forthisquestion.

Condition3:Ifthedepthoftheleafnodeislessthankandthenumberofitemsetsstoredatthenodeisequaltomaxsize,thentheleafnodeisconvertedintoaninternalnode.Newleafnodesarecreatedaschildren

maxsize=2

oftheoldleafnode.Candidateitemsetspreviouslystoredintheoldleafnodearedistributedtothechildrenbasedontheirhashvalues.Thenewcandidateisalsohashedtoitsappropriateleafnode.

b. Howmanyleafnodesarethereinthecandidatehashtree?Howmanyinternalnodesarethere?

c. Consideratransactionthatcontainsthefollowingitems:{1,2,3,5,6}.Usingthehashtreeconstructedinpart(a),whichleafnodeswillbecheckedagainstthetransaction?Whatarethecandidate3-itemsetscontainedinthetransaction?

12.GiventhelatticestructureshowninFigure5.33 andthetransactionsgiveninTable5.22 ,labeleachnodewiththefollowingletter(s):

Figure5.33.Anitemsetlattice

Mifthenodeisamaximalfrequentitemset,

Cifitisaclosedfrequentitemset,

Nifitisfrequentbutneithermaximalnorclosed,and

Iifitisinfrequent

Assumethatthesupportthresholdisequalto30%.

13.Theoriginalassociationruleminingformulationusesthesupportandconfidencemeasurestopruneuninterestingrules.

a. DrawacontingencytableforeachofthefollowingrulesusingthetransactionsshowninTable5.23 .

Table5.23.Exampleofmarketbaskettransactions.

TransactionID ItemsBought

1 {a,b,d,e}

2 {b,c,d}

3 {a,b,d,e}

4 {a,c,d,e}

5 {b,c,d,e}

6 {b,d,e}

7 {c,d}

8 {a,b,c}

9 {a,d,e}

10 {b,d}

Rules: .

b. Usethecontingencytablesinpart(a)tocomputeandranktherulesindecreasingorderaccordingtothefollowingmeasures.

i. Support.

ii. Confidence.

iii. Interest

iv.

v. ,where.

vi.

14.GiventherankingsyouhadobtainedinExercise13,computethecorrelationbetweentherankingsofconfidenceandtheotherfivemeasures.Whichmeasureismosthighlycorrelatedwithconfidence?Whichmeasureisleastcorrelatedwithconfidence?

15.AnswerthefollowingquestionsusingthedatasetsshowninFigure5.34 .Notethateachdatasetcontains1000itemsand10,000transactions.Darkcellsindicatethepresenceofitemsandwhitecellsindicatetheabsenceofitems.WewillapplytheApriorialgorithmtoextractfrequentitemsetswith

(i.e.,itemsetsmustbecontainedinatleast1000transactions).

{b}→{c},{a}→{d},{b}→{d},{e}→{c},{c}→{a}

(X→Y)=P(X,Y)P(X)P(Y).

IS(X→Y)=P(X,Y)P(X)P(Y).

Klosgen(X→Y)=P(X,Y)×max(P(Y|X))−P(Y),P(X|Y)−P(X))P(Y|X)=P(X,Y)P(X)

Oddsratio(X→Y)=P(X,Y)P(X¯,Y¯)P(X,Y¯)P(X¯,Y).

minsup=10%

Figure5.34.FiguresforExercise15.

a. Whichdataset(s)willproducethemostnumberoffrequentitemsets?

b. Whichdataset(s)willproducethefewestnumberoffrequentitemsets?

c. Whichdataset(s)willproducethelongestfrequentitemset?

d. Whichdataset(s)willproducefrequentitemsetswithhighestmaximumsupport?

e. Whichdataset(s)willproducefrequentitemsetscontainingitemswithwide-varyingsupportlevels(i.e.,itemswithmixedsupport,rangingfromlessthan20%tomorethan70%)?

16.

a. Provethatthe coefficientisequalto1ifandonlyif .

b. ShowthatifAandBareindependent,then.

c. ShowthatYule'sQandYcoefficients

arenormalizedversionsoftheoddsratio.

d. WriteasimplifiedexpressionforthevalueofeachmeasureshowninTable5.9 whenthevariablesarestatisticallyindependent.

17.Considertheinterestingnessmeasure, ,foranassociationrule .

ϕ f11=f1+=f+1

P(A,B)×P(A¯,B¯)=P(A,B¯)×P(A¯,B)

Q=[f11f00−f10f01f11f00+f10f01]Y=[f11f00−f10f01f11f00+f10f01]

M=P(B|A)−P(B)1−P(B)A→B

a. Whatistherangeofthismeasure?Whendoesthemeasureattainitsmaximumandminimumvalues?

b. HowdoesMbehavewhenP(A,B)isincreasedwhileP(A)andP(B)remainunchanged?

c. HowdoesMbehavewhenP(A)isincreasedwhileP(A,B)andP(B)remainunchanged?

d. HowdoesMbehavewhenP(B)isincreasedwhileP(A,B)andP(A)remainunchanged?

e. Isthemeasuresymmetricundervariablepermutation?

f. WhatisthevalueofthemeasurewhenAandBarestatisticallyindependent?

g. Isthemeasurenull-invariant?

h. Doesthemeasureremaininvariantunderroworcolumnscalingoperations?

i. Howdoesthemeasurebehaveundertheinversionoperation?

18.Supposewehavemarketbasketdataconsistingof100transactionsand20items.Assumethesupportforitemais25%,thesupportforitembis90%andthesupportforitemset{a,b}is20%.Letthesupportandconfidencethresholdsbe10%and60%,respectively.

a. Computetheconfidenceoftheassociationrule .Istheruleinterestingaccordingtotheconfidencemeasure?

b. Computetheinterestmeasurefortheassociationpattern{a,b}.Describethenatureoftherelationshipbetweenitemaanditembintermsoftheinterestmeasure.

{a}→{b}

c. Whatconclusionscanyoudrawfromtheresultsofparts(a)and(b)?

d. Provethatiftheconfidenceoftherule islessthanthesupportof{b},then:

ii.

where denotetheruleconfidenceand denotethesupportofanitemset.

19.Table5.24 showsa contingencytableforthebinaryvariablesAandBatdifferentvaluesofthecontrolvariableC.

Table5.24.AContingencyTable.

1 0

B 1 0 15

0 15 30

B 1 5 0

0 0 15

a. Computethe coefficientforAandBwhen and or1.Notethat .

b. Whatconclusionscanyoudrawfromtheaboveresult?

20.ConsiderthecontingencytablesshowninTable5.25 .

{a}→{b}

c({a¯}→{b})>c({a}→{b}),

c({a¯}→{b})>s({b}),

c(⋅) s(⋅)

2×2×2

C=0

C=1

ϕ C=0,C=1, C=0ϕ=({A,B})=P(A,B)−P(A)P(B)P(A)P(B)(1−P(A))(1−P(B))

a. FortableI,computesupport,theinterestmeasure,andthe correlationcoefficientfortheassociationpattern{A,B}.Also,computetheconfidenceofrules and .

b. FortableII,computesupport,theinterestmeasure,andthe correlationcoefficientfortheassociationpattern{A,B}.Also,computetheconfidenceofrules and .

Table5.25.ContingencytablesforExercise20.(a)TableI.

A 9 1

1 89

(b)TableII.

A 89 1

1 9

c. Whatconclusionscanyoudrawfromtheresultsof(a)and(b)?

21.Considertherelationshipbetweencustomerswhobuyhigh-definitiontelevisionsandexercisemachinesasshowninTables5.17 and5.18 .

a. Computetheoddsratiosforbothtables.

b. Computethe forbothtables.

c. Computetheinterestfactorforbothtables.

A→B B→A

B¯

A¯

B¯

A¯

ϕ-coefficient

Foreachofthemeasuresgivenabove,describehowthedirectionofassociationchangeswhendataispooledtogetherinsteadofbeingstratified.

6AssociationAnalysis:AdvancedConcepts

Theassociationruleminingformulationdescribedinthepreviouschapterassumesthattheinputdataconsistsofbinaryattributescalleditems.Thepresenceofaniteminatransactionisalsoassumedtobemoreimportantthanitsabsence.Asaresult,anitemistreatedasanasymmetricbinaryattributeandonlyfrequentpatternsareconsideredinteresting.

Thischapterextendstheformulationtodatasetswithsymmetricbinary,categorical,andcontinuousattributes.Theformulationwillalsobeextendedtoincorporatemorecomplexentitiessuchassequencesandgraphs.Althoughtheoverallstructureofassociationanalysisalgorithmsremainsunchanged,certainaspectsofthealgorithmsmustbemodifiedtohandlethenon-traditionalentities.

6.1HandlingCategoricalAttributesTherearemanyapplicationsthatcontainsymmetricbinaryandnominalattributes.TheInternetsurveydatashowninTable6.1 containssymmetricbinaryattributessuchasand ;aswellasnominalattributessuchas

and .Usingassociationanalysis,wemayuncoverinterestinginformationaboutthecharacteristicsofInternetuserssuchas

Table6.1.Internetsurveydatawithcategoricalattributes.

Gender LevelofEducation

State ComputeratHome

ChatOnline

ShopOnline

PrivacyConcerns

Female Graduate Illinois Yes Yes Yes Yes

Male College California No No No No

Male Graduate Michigan Yes Yes Yes Yes

Female College Virginia No No Yes Yes

Female Graduate California Yes No No Yes

Male College Minnesota Yes Yes Yes Yes

Male College Alaska Yes Yes Yes No

Male HighSchool Oregon Yes No No No

Female Graduate Texas No Yes No No

… … … … … … …

ThisrulesuggeststhatmostInternetuserswhoshoponlineareconcernedabouttheirpersonalprivacy.

Toextractsuchpatterns,thecategoricalandsymmetricbinaryattributesaretransformedinto“items”first,sothatexistingassociationruleminingalgorithmscanbeapplied.Thistypeoftransformationcanbeperformedbycreatinganewitemforeachdistinctattribute-valuepair.Forexample,thenominalattribute canbereplacedbythreebinaryitems:

,and .Similarly,symmetricbinaryattributessuchas canbeconvertedintoapairofbinaryitems, and .Table6.2 showstheresultofbinarizingtheInternetsurveydata.

Table6.2.Internetsurveydataafterbinarizingcategoricalandsymmetricbinaryattributes.

Male Female Education=Graduate

Education=College

… Privacy=Yes

Privacy=No

0 1 1 0 … 1 0

1 0 0 1 … 0 1

1 0 1 0 … 1 0

0 1 0 1 … 1 0

0 1 1 0 … 1 0

1 0 0 1 … 1 0

1 0 0 1 … 0 1

1 0 0 0 … 0 1

0 1 1 0 … 0 1

… … … … … … …

Thereareseveralissuestoconsiderwhenapplyingassociationanalysistothebinarizeddata:

1. Someattributevaluesmaynotbefrequentenoughtobepartofafrequentpattern.Thisproblemismoreevidentfornominalattributesthathavemanypossiblevalues,e.g.,statenames.Loweringthesupportthresholddoesnothelpbecauseitexponentiallyincreasesthenumberoffrequentpatternsfound(manyofwhichmaybespurious)andmakesthecomputationmoreexpensive.Amorepracticalsolutionistogrouprelatedattributevaluesintoasmallnumberofcategories.Forexample,eachstatenamecanbereplacedbyitscorrespondinggeographicalregion,suchas ,and .Anotherpossibilityistoaggregatethelessfrequentattributevaluesintoasinglecategorycalled ,asshowninFigure6.1 .

Figure6.1.Apiechartwithamergedcategorycalled .

2. Someattributevaluesmayhaveconsiderablyhigherfrequenciesthanothers.Forexample,suppose85%ofthesurveyparticipantsownahomecomputer.Bycreatingabinaryitemforeachattributevaluethatappearsfrequentlyinthedata,wemaypotentiallygeneratemanyredundantpatterns,asillustratedbythefollowingexample:

Theruleisredundantbecauseitissubsumedbythemoregeneralrulegivenatthebeginningofthissection.Becausethehigh-frequencyitemscorrespondtothetypicalvaluesofanattribute,theyseldomcarryanynewinformationthatcanhelpustobetterunderstandthepattern.Itmaythereforebeusefultoremovesuchitemsbeforeapplying

standardassociationanalysisalgorithms.AnotherpossibilityistoapplythetechniquespresentedinSection5.8 forhandlingdatasetswithawiderangeofsupportvalues.

3. Althoughthewidthofeverytransactionisthesameasthenumberofattributesintheoriginaldata,thecomputationtimemayincreaseespeciallywhenmanyofthenewlycreateditemsbecomefrequent.Thisisbecausemoretimeisneededtodealwiththeadditionalcandidateitemsetsgeneratedbytheseitems(seeExercise1 onpage510).Onewaytoreducethecomputationtimeistoavoidgeneratingcandidateitemsetsthatcontainmorethanoneitemfromthesameattribute.Forexample,wedonothavetogenerateacandidateitemsetsuchas becausethesupportcountoftheitemsetiszero.

6.2HandlingContinuousAttributesTheInternetsurveydatadescribedintheprevioussectionmayalsocontaincontinuousattributessuchastheonesshowninTable6.3 .Miningthecontinuousattributesmayrevealusefulinsightsaboutthedatasuchas“userswhoseannualincomeismorethan$120Kbelongtothe45–60agegroup”or“userswhohavemorethan3emailaccountsandspendmorethan15hoursonlineperweekareoftenconcernedabouttheirpersonalprivacy.”Associationrulesthatcontaincontinuousattributesarecommonlyknownasquantitativeassociationrules.

Table6.3.Internetsurveydatawithcontinuousattributes.

Gender … Age AnnualIncome

No.ofHoursSpentOnlineperWeek

No.ofEmailAccounts

PrivacyConcern

Female … 26 90K 20 4 Yes

Male … 51 135K 10 2 No

Male … 29 80K 10 3 Yes

Female … 45 120K 15 3 Yes

Female … 31 95K 20 5 Yes

Male … 25 55K 25 5 Yes

Male … 37 100K 10 1 No

Male … 41 65K 8 2 No

Female … 26 85K 12 1 No

… … … … … … …

Thissectiondescribesthevariousmethodologiesforapplyingassociationanalysistocontinuousdata.Wewillspecificallydiscussthreetypesofmethods:(1)discretization-basedmethods,(2)statistics-basedmethods,and(3)nondiscretizationmethods.Thequantitativeassociationrulesderivedusingthesemethodsarequitedifferentinnature.

6.2.1Discretization-BasedMethods

Discretizationisthemostcommonapproachforhandlingcontinuousattributes.Thisapproachgroupstheadjacentvaluesofacontinuousattributeintoafinitenumberofintervals.Forexample,the attributecanbedividedintothefollowingintervals: ∈[12,16), ∈[16,20), ∈[20,24),…,

∈[56,60),where[a,b)representsanintervalthatincludesabutnotb.DiscretizationcanbeperformedusinganyofthetechniquesdescribedinSection2.3.6 (equalintervalwidth,equalfrequency,entropy-based,orclustering).Thediscreteintervalsarethenmappedintoasymmetricbinaryattributessothatexistingassociationanalysisalgorithmscanbeapplied.Table6.4 showstheInternetsurveydataafterdiscretizationandbinarization.

Table6.4.Internetsurveydataafterbinarizingcategoricalandcontinuousattributes.

Male Female … Age<13

Age∈[13,21)

Age∈[21,30)

… Privacy=Yes

Privacy=No

0 1 … 0 0 1 … 1 0

1 0 … 0 0 0 … 0 1

1 0 … 0 0 1 … 1 0

0 1 … 0 0 0 … 1 0

1 0 … 0 0 1 … 1 0

1 0 … 0 0 0 … 0 1

0 1 … 0 0 1 … 0 1

… … … … … … … … …

Akeyparameterinattributediscretizationisthenumberofintervalsusedtopartitioneachattribute.Thisparameteristypicallyprovidedbytheusersandcanbeexpressedintermsoftheintervalwidth(fortheequalintervalwidthapproach),theaveragenumberoftransactionsperinterval(fortheequalfrequencyapproach),orthenumberofdesiredclusters(fortheclustering-basedapproach).ThedifficultyindeterminingtherightnumberofintervalscanbeillustratedusingthedatasetshowninTable6.5 ,whichsummarizestheresponsesof250userswhoparticipatedinthesurvey.Therearetwostrongrulesembeddedinthedata:

Table6.5.AbreakdownofInternetuserswhoparticipatedinonlinechataccordingtotheiragegroup.

AgeGroup ChatOnline=Yes ChatOnline=No

R1:Age∈[16,24)→Chat Online=Yes(s=8.8%, c=81.5%).R2:Age∈[44,60)→Chat

[12,16) 12 13

[16,20) 11 2

[20,24) 11 3

[24,28) 12 13

[28,32) 14 12

[32,36) 15 12

[36,40) 16 14

[40,44) 16 14

[44,48) 4 10

[48,52) 5 11

[52,56) 5 10

[56,60) 4 11

Theserulessuggestthatmostoftheusersfromtheagegroupof16–24oftenparticipateinonlinechatting,whilethosefromtheagegroupof44–60arelesslikelytochatonline.Inthisexample,weconsideraruletobeinterestingonlyifitssupport(s)exceeds5%anditsconfidence(c)exceeds65%.Oneoftheproblemsencounteredwhendiscretizingthe attributeishowtodeterminetheintervalwidth.

1. Iftheintervalistoowide,thenwemaylosesomepatternsbecauseoftheirlackofconfidence.Forexample,whentheintervalwidthis24years, and arereplacedbythefollowingrules:R1 R2R′1:Age∈[12,36)→Chat Online=Yes(s=30%, c=57.7%).R

Despitetheirhighersupports,thewiderintervalshavecausedtheconfidenceforbothrulestodropbelowtheminimumconfidencethreshold.Asaresult,bothpatternsarelostafterdiscretization.

2. Iftheintervalistoonarrow,thenwemaylosesomepatternsbecauseoftheirlackofsupport.Forexample,iftheintervalwidthis4years,then isbrokenupintothefollowingtwosubrules:

Sincethesupportsforthesubrulesarelessthantheminimumsupportthreshold, islostafterdiscretization.Similarly,therule ,whichisbrokenupintofoursubrules,willalsobelostbecausethesupportofeachsubruleislessthantheminimumsupportthreshold.

3. Iftheintervalwidthis8years,thentherule isbrokenupintothefollowingtwosubrules:

Since and havesufficientsupportandconfidence,canberecoveredbyaggregatingbothsubrules.Meanwhile, isbrokenupintothefollowingtwosubrules:

Unlike ,wecannotrecovertherule byaggregatingthesubrulesbecausebothsubrulesfailtheconfidencethreshold.

Onewaytoaddresstheseissuesistoconsidereverypossiblegroupingofadjacentintervals.Forexample,wecanstartwithanintervalwidthof4yearsandthenmergetheadjacentintervalsintowiderintervals: ∈[12,16),∈[12,20),…, ∈[12,60), ∈[16,20), ∈[16,24),etc.This

′2:Age∈[36,60)→Chat Online=No(s=28%, c=58.3%).

R1R11(4):Age∈[16,20)→Chat Online=Yes(s=4.4%, c=84.6%).R12(4):Age∈

R1 R2

R21(8):Age∈[44,52)→Chat Online=No(s=8.4%, c=70%).R22(8):Age∈[52

R21(8) R22(8) R2R1

R11(8):Age∈[12,20)→Chat Online=Yes(s=9.2%, c=60.5%).R12(8):Age∈

R2 R1

approachenablesthedetectionofboth and asstrongrules.However,italsoleadstothefollowingcomputationalissues:

1. Thecomputationbecomesextremelyexpensive.Iftherangeisinitiallydividedintokintervals,then binaryitemsmustbegeneratedtorepresentallpossibleintervals.Furthermore,ifanitemcorrespondingtotheinterval[a,b)isfrequent,thenallotheritemscorrespondingtointervalsthatsubsume[a,b)mustbefrequenttoo.Thisapproachcanthereforegeneratefartoomanycandidateandfrequentitemsets.Toaddresstheseproblems,amaximumsupportthresholdcanbeappliedtopreventthecreationofitemscorrespondingtoverywideintervalsandtoreducethenumberofitemsets.

2. Manyredundantrulesareextracted.Forexample,considerthefollowingpairofrules:

isageneralizationof (and isaspecializationof )becausehasawiderintervalforthe attribute.Iftheconfidencevaluesfor

bothrulesarethesame,then shouldbemoreinterestingbecauseitcoversmoreexamples—includingthosefor . isthereforearedundantrule.

6.2.2Statistics-BasedMethods

Quantitativeassociationrulescanbeusedtoinferthestatisticalpropertiesofapopulation.Forexample,supposeweareinterestedinfindingtheaverageageofcertaingroupsofInternetusersbasedonthedataprovidedinTables

R1 R2

k(k−1)/2

R3:{Age∈[16,20), Gender=Male}→{Chat Online=Yes},R4:{Age∈[16,24), Gender=Male}→{Chat Online=Yes}.

R4 R3 R3 R4R4

R4R3 R3

6.1 and6.3 .Usingthestatistics-basedmethoddescribedinthissection,quantitativeassociationrulessuchasthefollowingcanbeextracted:

TherulestatesthattheaverageageofInternetuserswhoseannualincomeexceeds$100Kandwhoshoponlineregularlyis38yearsold.

RuleGenerationTogeneratethestatistics-basedquantitativeassociationrules,thetargetattributeusedtocharacterizeinterestingsegmentsofthepopulationmustbespecified.Bywithholdingthetargetattribute,theremainingcategoricalandcontinuousattributesinthedataarebinarizedusingthemethodsdescribedintheprevioussection.ExistingalgorithmssuchasAprioriorFP-growtharethenappliedtoextractfrequentitemsetsfromthebinarizeddata.Eachfrequentitemsetidentifiesaninterestingsegmentofthepopulation.Thedistributionofthetargetattributeineachsegmentcanbesummarizedusingdescriptivestatisticssuchasmean,median,variance,orabsolutedeviation.Forexample,theprecedingruleisobtainedbyaveragingtheageofInternetuserswhosupportthefrequentitemset{ >$100K,

Thenumberofquantitativeassociationrulesdiscoveredusingthismethodisthesameasthenumberofextractedfrequentitemsets.Becauseofthewaythequantitativeassociationrulesaredefined,thenotionofconfidenceisnotapplicabletosuchrules.Analternativemethodforvalidatingthequantitativeassociationrulesispresentednext.

RuleValidation

{Annual Income>$100K, Shop Online=Yes}→Age: Mean=38.

Aquantitativeassociationruleisinterestingonlyifthestatisticscomputedfromtransactionscoveredbytherulearedifferentthanthosecomputedfromtransactionsnotcoveredbytherule.Forexample,therulegivenatthebeginningofthissectionisinterestingonlyiftheaverageageofInternetuserswhodonotsupportthefrequentitemset{ >100K,

}issignificantlyhigherorlowerthan38yearsold.Todeterminewhetherthedifferenceintheiraverageagesisstatisticallysignificant,statisticalhypothesistestingmethodsshouldbeapplied.

Considerthequantitativeassociationrule, ,whereAisafrequentitemset,tisthecontinuoustargetattribute,andμistheaveragevalueoftamongtransactionscoveredbyA.Furthermore,let denotetheaveragevalueoftamongtransactionsnotcoveredbyA.Thegoalistotestwhetherthedifferencebetweenμand isgreaterthansomeuser-specifiedthreshold,Δ.Instatisticalhypothesistesting,twooppositepropositions,knownasthenullhypothesisandthealternativehypothesis,aregiven.Ahypothesistestisperformedtodeterminewhichofthesetwohypothesesshouldbeaccepted,basedonevidencegatheredfromthedata(seeAppendixC).

Inthiscase,assumingthat ,thenullhypothesisis ,whilethealternativehypothesisis .Todeterminewhichhypothesisshouldbeaccepted,thefollowingZ-statisticiscomputed:

where isthenumberoftransactionssupportingA, isthenumberoftransactionsnotsupportingA, isthestandarddeviationfortamongtransactionsthatsupportA,and isthestandarddeviationfortamongtransactionsthatdonotsupportA.Underthenullhypothesis,Zhasastandardnormaldistributionwithmean0andvariance1.ThevalueofZcomputedusingEquation6.1 isthencomparedagainstacriticalvalue, ,

A→t:μ

μ′

μ<μ′ H0:μ′=μ+ΔH1:μ′>μ+Δ

Z=μ′−μ−Δs12n1+s22n2, (6.1)

n1 n2s1s2

Zα

whichisathresholdthatdependsonthedesiredconfidencelevel.If ,thenthenullhypothesisisrejectedandwemayconcludethatthequantitativeassociationruleisinteresting.Otherwise,thereisnotenoughevidenceinthedatatoshowthatthedifferenceinmeanisstatisticallysignificant.

Example6.1.Considerthequantitativeassociationrule

Supposethereare50Internetuserswhosupportedtheruleantecedent.Thestandarddeviationoftheiragesis3.5.Ontheotherhand,theaverageageofthe200userswhodonotsupporttheruleantecedentis30andtheirstandarddeviationis6.5.Assumethataquantitativeassociationruleisconsideredinterestingonlyifthedifferencebetweenμand ismorethan5years.UsingEquation6.1 weobtain

Foraone-sidedhypothesistestata95%confidencelevel,thecriticalvalueforrejectingthenullhypothesisis1.64.Since ,thenullhypothesiscanberejected.Wethereforeconcludethatthequantitativeassociationruleisinterestingbecausethedifferencebetweentheaverageagesofuserswhosupportanddonotsupporttheruleantecedentismorethan5years.

6.2.3Non-discretizationMethods

Z>Zα

{Income>100K, Shop Online=Yes}→Age:μ=38.

μ′

Z=38−30−53.5250+6.52200=4.4414.

Z>1.64

Therearecertainapplicationsinwhichanalystsaremoreinterestedinfindingassociationsamongthecontinuousattributes,ratherthanassociationsamongdiscreteintervalsofthecontinuousattributes.Forexample,considertheproblemoffindingwordassociationsintextdocuments.Table6.6 showsadocument-wordmatrixwhereeveryentryrepresentsthenumberoftimesawordappearsinagivendocument.Givensuchadatamatrix,weareinterestedinfindingassociationsbetweenwords(e.g., and )insteadofassociationsbetweenrangesofwordcounts(e.g., ∈[1,4]and ∈[2,3]).Onewaytodothisistotransformthedataintoa0/1matrix,wheretheentryis1ifthecountexceedssomethresholdt,and0otherwise.Whilethisapproachallowsanalyststoapplyexistingfrequentitemsetgenerationalgorithmstothebinarizeddataset,findingtherightthresholdforbinarizationcanbequitetricky.Ifthethresholdissettoohigh,itispossibletomisssomeinterestingassociations.Conversely,ifthethresholdissettoolow,thereisapotentialforgeneratingalargenumberofspuriousassociations.

Table6.6.Document-wordmatrix.

Document

d1 3 6 0 0 0 2

d2 1 2 0 0 0 2

d3 4 2 7 0 0 2

d4 2 0 3 0 0 1

d5 0 0 0 1 1 0

Thissectionpresentsanothermethodologyforfindingassociationsamongcontinuousattributes,knownasthemin-Aprioriapproach.Analogousto

word1 word2 word3 word4 word5 word6

traditionalassociationanalysis,anitemsetisconsideredtobeacollectionofcontinuousattributes,whileitssupportmeasuresthedegreeofassociationamongtheattributes,acrossmultiplerowsofthedatamatrix.Forexample,acollectionofwordsinTable6.6 canbereferredtoasanitemset,whosesupportisdeterminedbytheco-occurrenceofwordsacrossdocuments.Inmin-Apriori,theassociationamongattributesinagivenrowofthedatamatrixisobtainedbytakingtheminimumvalueoftheattributes.Forexample,theassociationbetweenwords, and ,inthedocument isgivenby

.Thesupportofanitemsetisthencomputedbyaggregatingitsassociationoverallthedocuments.

Thesupportmeasuredefinedinmin-Apriorihasthefollowingdesiredproperties,whichmakesitsuitableforfindingwordassociationsindocuments:

1. Supportincreasesmonotonicallyasthenumberofoccurrencesofawordincreases.

2. Supportincreasesmonotonicallyasthenumberofdocumentsthatcontainthewordincreases.

3. Supporthasananti-monotoneproperty.Forexample,considerapairofitemsets and .Since

.Therefore,supportdecreasesmonotonicallyasthenumberofwordsinanitemsetincreases.

ThestandardApriorialgorithmcanbemodifiedtofindassociationsamongwordsusingthenewsupportdefinition.

word1 word2 d1min(3,6)=3.

s({word1,word2})=min(3,6)+min(1,2)+min(4,2)+min(2,0)=6.

{A,B} {A,B,C}min({A,B})≥min({A,B,C}),s({A,B})≥s({A,B,C})

6.3HandlingaConceptHierarchyAconcepthierarchyisamultilevelorganizationofthevariousentitiesorconceptsdefinedinaparticulardomain.Forexample,inmarketbasketanalysis,aconcepthierarchyhastheformofanitemtaxonomydescribingthe“is-a”relationshipsamongitemssoldatagrocerystore—e.g.,milkisakindoffoodandDVDisakindofhomeelectronicsequipment(seeFigure6.2 ).Concepthierarchiesareoftendefinedaccordingtodomainknowledgeorbasedonastandardclassificationschemedefinedbycertainorganizations(e.g.,theLibraryofCongressclassificationschemeisusedtoorganizelibrarymaterialsbasedontheirsubjectcategories).

Figure6.2.Exampleofanitemtaxonomy.

Aconcepthierarchycanberepresentedusingadirectedacyclicgraph,asshowninFigure6.2 .Ifthereisanedgeinthegraphfromanodeptoanothernodec,wecallptheparentofcandcthechildofp.Forexample,

istheparentof becausethereisadirectededgefromthenode tothenode . iscalledanancestorofX(andXadescendentof )ifthereisapathfromnode tonodeXinthedirectedacyclicgraph.InthediagramshowninFigure6.2 , isanancestorof

and isadescendentof .

Themainadvantagesofincorporatingconcepthierarchiesintoassociationanalysisareasfollows:

1. Itemsatthelowerlevelsofahierarchymaynothaveenoughsupporttoappearinanyfrequentitemset.Forexample,althoughthesaleofACadaptorsanddockingstationsmaybelow,thesaleoflaptopaccessories,whichistheirparentnodeintheconcepthierarchy,maybehigh.Also,rulesinvolvinghigh-levelcategoriesmayhavelowerconfidencethantheonesgeneratedusinglow-levelcategories.Unlesstheconcepthierarchyisused,thereisapotentialtomissinterestingpatternsatdifferentlevelsofcategories.

2. Rulesfoundatthelowerlevelsofaconcepthierarchytendtobeoverlyspecificandmaynotbeasinterestingasrulesatthehigherlevels.Forexample,stapleitemssuchasmilkandbreadtendtoproducemanylow-levelrulessuchas

,and .Usingaconcepthierarchy,theycanbesummarizedintoasinglerule, .Consideringonlyitemsresidingatthetopleveloftheirhierarchiesalsomaynotbegoodenough,becausesuchrulesmaynotbeofanypracticaluse.Forexample,althoughtherule maysatisfythesupportandconfidencethresholds,itisnotinformativebecausethe

X^X^ X^

combinationofelectronicsandfooditemsthatarefrequentlypurchasedbycustomersareunknown.Ifmilkandbatteriesaretheonlyitemssoldtogetherfrequently,thenthepattern{ }mayhaveovergeneralizedthesituation.

Standardassociationanalysiscanbeextendedtoincorporateconcepthierarchiesinthefollowingway.Eachtransactiontisinitiallyreplacedwithitsextendedtransaction ,whichcontainsalltheitemsintalongwiththeircorrespondingancestors.Forexample,thetransaction{ }canbeextendedto{

},where and aretheancestorsof ,while and aretheancestorsof .

Wecanthenapplyexistingalgorithms,suchasApriori,tothedatabaseofextendedtransactions.Althoughsuchanapproachwouldfindrulesthatspandifferentlevelsoftheconcepthierarchy,itwouldsufferfromseveralobviouslimitationsasdescribedbelow:

1. Itemsresidingatthehigherlevelstendtohavehighersupportcountsthanthoseresidingatthelowerlevelsofaconcepthierarchy.Asaresult,ifthesupportthresholdissettoohigh,thenonlypatternsinvolvingthehigh-levelitemsareextracted.Ontheotherhand,ifthethresholdissettoolow,thenthealgorithmgeneratesfartoomanypatterns(mostofwhichmaybespurious)andbecomescomputationallyinefficient.

2. Introductionofaconcepthierarchytendstoincreasethecomputationtimeofassociationanalysisalgorithmsbecauseofthelargernumberofitemsandwidertransactions.Thenumberofcandidatepatternsandfrequentpatternsgeneratedbythesealgorithmsmayalsogrowexponentiallywithwidertransactions.

t′

3. Introductionofaconcepthierarchymayproduceredundantrules.Arule isredundantifthereexistsamoregeneralrule ,where isanancestorofX, isanancestorofY,andbothruleshaveverysimilarconfidence.Forexample,suppose{ }→{ },{ }→{2% },{ }→{2% },{ }→{skim },and{ }→{ }haveverysimilarconfidence.Therulesinvolvingitemsfromthelowerlevelofthehierarchyareconsideredredundantbecausetheycanbesummarizedbyaruleinvolvingtheancestoritems.Anitemsetsuchas{

}isalsoredundantbecause and areancestorsof.Fortunately,itiseasytoeliminatesuchredundantitemsets

duringfrequentitemsetgeneration,giventheknowledgeofthehierarchy.

X→Y X^→Y^X^ Y^

6.4SequentialPatternsMarketbasketdataoftencontainstemporalinformationaboutwhenanitemwaspurchasedbycustomers.Suchinformationcanbeusedtopiecetogetherthesequenceoftransactionsmadebyacustomeroveracertainperiodoftime.Similarly,event-baseddatacollectedfromscientificexperimentsorthemonitoringofphysicalsystems,suchastelecommunicationsnetworks,computernetworks,andwirelesssensornetworks,haveaninherentsequentialnaturetothem.Thismeansthatanordinalrelation,usuallybasedontemporalprecedence,existsamongeventsoccurringinsuchdata.However,theconceptsofassociationpatternsdiscussedsofaremphasizeonly“co-occurrence”relationshipsanddisregardthesequentialinformationofthedata.Thelatterinformationmaybevaluableforidentifyingrecurringfeaturesofadynamicsystemorpredictingfutureoccurrencesofcertainevents.Thissectionpresentsthebasicconceptofsequentialpatternsandthealgorithmsdevelopedtodiscoverthem.

6.4.1Preliminaries

Theinputtotheproblemofdiscoveringsequentialpatternsisasequencedataset,anexampleofwhichisshownontheleft-handsideofFigure6.3 .Eachrowrecordstheoccurrencesofeventsassociatedwithaparticularobjectatagiventime.Forexample,thefirstrowcontainsthesetofeventsoccurringattimestamp forobjectA.Notethatifweonlyconsiderthelastcolumnofthissequencedataset,itwouldlooksimilartoamarketbasketdatawhereeveryrowrepresentsatransactioncontainingasetofevents(items).Thetraditionalconceptofassociationpatternsinthisdatawouldcorrespond

t=10

tocommonco-occurrencesofeventsacrosstransactions.However,asequencedatasetalsocontainsinformationabouttheobjectandthetimestampofatransactionofeventsinthefirsttwocolumns.Thesecolumnsaddcontexttoeverytransaction,whichenablesadifferentstyleofassociationanalysisforsequencedatasets.Theright-handsideofFigure6.3 showsadifferentrepresentationofthesequencedatasetwheretheeventsassociatedwitheveryobjectappeartogether,sortedinincreasingorderoftheirtimestamps.Inasequencedataset,wecanlookforassociationpatternsofeventsthatcommonlyoccurinasequentialorderacrossobjects.Forexample,inthesequencedatashowninFigure6.3 ,event6isfollowedbyevent1inallofthesequences.Notethatsuchapatterncannotbeinferredifwetreatthisasamarketbasketdatabyignoringinformationabouttheobjectandtimestamp.

Figure6.3.Exampleofasequencedatabase.

Beforepresentingamethodologyforfindingsequentialpatterns,weprovideabriefdescriptionofsequencesandsubsequences.

SequencesGenerallyspeaking,asequenceisanorderedlistofelements(transactions).Asequencecanbedenotedas ,whereeachelement isacollectionofoneormoreevents(items),i.e., .Thefollowingisalistofexamplesofsequences:

Sequenceofwebpagesviewedbyawebsitevisitor:

⟨{Homepage}{Electronics}{CamerasandCamcorders}{DigitalCameras}{ShoppingCart}{OrderConfirmation}{ReturntoShopping}⟩SequenceofeventsleadingtothenuclearaccidentatThree-MileIsland:

⟨{cloggedresin}{outletvalveclosure}{lossoffeedwater}{condenserpolisheroutletvalveshut}{boosterpumpstrip}{mainwaterpumptrips}{mainturbinetrips}{reactorpressureincreases}⟩Sequenceofclassestakenbyacomputersciencemajorstudentindifferentsemesters:

⟨{AlgorithmsandDataStructures,IntroductiontoOperatingSystems}{DatabaseSystems,ComputerArchitecture}{ComputerNetworks,SoftwareEngineering}{ComputerGraphics,ParallelProgramming}⟩

Asequencecanbecharacterizedbyitslengthandthenumberofoccurringevents.Thelengthofasequencecorrespondstothenumberofelementspresentinthesequence,whilewerefertoasequencethatcontainskeventsasak-sequence.Thewebsequenceinthepreviousexamplecontains7elementsand7events;theeventsequenceatThree-MileIslandcontains8

s=⟨e1e2e3…en⟩ ejej={i1,i2,…,ik}

elementsand8events;andtheclasssequencecontains4elementsand8events.

Figure6.4 providesexamplesofsequences,elements,andeventsdefinedforavarietyofapplicationdomains.Exceptforthelastrow,theordinalattributeassociatedwitheachofthefirstthreedomainscorrespondstocalendartime.Forthelastrow,theordinalattributecorrespondstothelocationofthebases(A,C,G,T)inthegenesequence.Althoughthediscussiononsequentialpatternsisprimarilyfocusedontemporalevents,itcanbeextendedtothecasewheretheeventshavenon-temporalordering,suchasspatialordering.

Figure6.4.Examplesofelementsandeventsinsequencedatasets.

SubsequencesAsequencetisasubsequenceofanothersequencesifitispossibletoderivetfromsbysimplydeletingsomeeventsfromelementsinsorevendeletingsomeelementsinscompletely.Formally,thesequenceisasubsequenceof ifthereexistintegerssuchthat .Iftisasubsequenceofs,thenwesaythattiscontainedins.Table6.7 givesexamplesillustratingtheideaofsubsequencesforvarioussequences.

Table6.7.Examplesillustratingtheconceptofasubsequence.

Sequence,s Sequence,t Istasubsequenceofs?

Yes

6.4.2SequentialPatternDiscovery

LetDbeadatasetthatcontainsoneormoredatasequences.Thetermdatasequencereferstoanorderedlistofelementsassociatedwithasingledataobject.Forexample,thedatasetshowninFigure6.5 containsfivedatasequences,oneforeachobjectA,B,C,D,andE.

t=⟨t1t2…tm⟩ s=⟨s1s2…sn⟩ 1≤j1<j2<⋯<jm≤n

t1⊆sj1,t2⊆sj2,…,tm⊆sjm

⟨{2,4} {3,5,6} {8}⟩ ⟨{2} {3,6} {8}⟩

⟨{2,4} {3,5,6} {8}⟩ ⟨{2} {8}⟩

⟨{1,2} {3,4}⟩ ⟨{1} {2}⟩

⟨{2,4} {2,4} {2,5}⟩ ⟨{2} {4}⟩

Figure6.5.Sequentialpatternsderivedfromadatasetthatcontainsfivedatasequences.

Thesupportofasequencesisthefractionofalldatasequencesthatcontains.Ifthesupportforsisgreaterthanorequaltoauser-specifiedthresholdminsup,thensisdeclaredtobeasequentialpattern(orfrequentsequence).

Definition6.1(SequentialPatternDiscovery).GivenasequencedatasetDandauser-specifiedminimumsupportthresholdminsup,thetaskofsequentialpatterndiscoveryistofindallsequenceswith .support≥minsup

InFigure6.5 ,thesupportforthesequence isequalto80%becauseitiscontainedinfourofthefivedatasequences(everyobjectexceptforD).Assumingthattheminimumsupportthresholdis50%,anysequencethatiscontainedinatleastthreedatasequencesisconsideredtobeasequentialpattern.Examplesofsequentialpatternsextractedfromthegivendatasetinclude ,etc.

Sequentialpatterndiscoveryisacomputationallychallengingtaskbecausethesetofallpossiblesequencesthatcanbegeneratedfromacollectionofeventsisexponentiallylargeanddifficulttoenumerate.Forexample,acollectionofneventscanresultinthefollowingexamplesof1-sequences,2-sequences,and3-sequences:

1-sequences:

2-sequences:

3-sequences:

TheaboveenumerationissimilarinsomewaystotheitemsetlatticeintroducedinChapter5 formarketbasketdata.However,notethattheaboveenumerationisnotexhaustive;itonlyshowssomesequencesandomitsalargenumberofremainingonesbytheuseofellipses(…).Thisisbecausethenumberofcandidatesequencesissubstantiallylargerthanthenumberofcandidateitemsets,whichmakestheirenumerationdifficult.Therearethreereasonsfortheadditionalnumberofcandidatessequences:

1. Anitemcanappearatmostonceinanitemset,butaneventcanappearmorethanonceinasequence,indifferentelementsofthe

⟨{1}{2}⟩

⟨{1}{2}⟩,⟨{1,2}⟩,⟨{2,3}⟩,⟨{1,2}{2,3}⟩

⟨i1⟩,⟨i2⟩,…,⟨in⟩

⟨{i1,i2}⟩,⟨{i1,i3}⟩,…,⟨{in−1,in}⟩,…⟨{i1}{i1}⟩,⟨{i1}{i2}⟩,…,⟨{in}{in}⟩

⟨{i1,i2,i3}⟩,⟨{i1,i2,i4}⟩,…,⟨{in−2,in−1,in}⟩,…⟨{i1}{i1,i2}⟩,⟨{i1}{i1,i3}⟩,…,⟨{in−1}{in−1,in}⟩,…⟨{i1,i2}{i2}⟩,⟨{i1,i2}{i3}⟩,…,⟨{in−1,in}{in}⟩,…⟨{i1}{i1}{i1}⟩,⟨{i1}{i1}{i2}⟩,…,⟨{in}{in}{in}⟩

sequence.Forexample,givenapairofitems, and ,onlyonecandidate2-itemset, ,canbegenerated.Incontrast,therearemanycandidate2-sequencesthatcanbegeneratedusingonlytwoevents: ,and .

2. Ordermattersinsequences,butnotforitemsets.Forexample,and referstothesameitemset,whereas ,and correspondtodifferentsequences,andthusmustbegeneratedseparately.

3. Formarketbasketdata,thenumberofdistinctitemsnputsanupperboundonthenumberofcandidateitemsets ,whereasforsequencedata,eventwoeventsaandbcanleadtoinfinitelymanycandidatesequences(seeFigure6.6 foranillustration).

Figure6.6.Comparingthenumberofitemsetswiththenumberofsequencesgeneratedusingtwoevents(items).Weonlyshow1-sequences,2-sequences,and3-sequencesforillustration.

Becauseoftheabovereasons,itischallengingtocreateasequencelatticethatenumeratesallpossiblesequencesevenwhenthenumberofeventsinthedataissmall.Itisthusdifficulttouseabrute-forceapproachfor

i1 i2{i1,i2}

⟨{i1}{i1}⟩,⟨{i1}{i2}⟩,⟨{i2}{i1}⟩,⟨{i2}{i2}⟩ ⟨{i1,i2}⟩{i1,i2}

{i2,i1} ⟨{i1}{i2}⟩,⟨{i2}{i1}⟩⟨{i1,i2}⟩

(2n−1)

generatingsequentialpatternsthatenumeratesallpossiblesequencesbytraversingthesequencelattice.Despitethesechallenges,theAprioriprinciplestillholdsforsequentialdatabecauseanydatasequencethatcontainsaparticulark-sequencemustalsocontainallofits -subsequences.Aswewillseelater,eventhoughitischallengingtoconstructthesequencelattice,itispossibletogeneratecandidatek-sequencesfromthefrequent -sequencesusingtheAprioriprinciple.ThisallowsustoextractsequentialpatternsfromasequencedatasetusinganApriori-likealgorithm.ThebasicstructureofthisalgorithmisshowninAlgorithm6.1 .

Algorithm6.1Apriori-likealgorithmfor

sequentialpatterndiscovery.

(k−1)

1: .2: . {Findallfrequent1-subsequences.}3:repeat4: .5: . {Generatecandidatek-subsequences.}6: . {Prunecandidatek-subsequences.}7:foreachdatasequence do8: . {Identifyallcandidatescontainedint.}9:foreachcandidatek-subsequence do10: . {Incrementthesupportcount.}11:endfor12:endfor

k=1Fk={i|i∈I∧σ({i})N≥minsup}

k=k+1Ck=candidate-gen(Fk−1)

Ck=candidate-prune(Ck,Fk−1)

t∈TCt=subsequence(Ck,t)

c∈Ctσ(c)=σ(c)+1

NoticethatthestructureofthealgorithmisalmostidenticaltoApriorialgorithmforfrequentitemsetdiscovery,presentedinthepreviouschapter.Thealgorithmwoulditerativelygeneratenewcandidatek-sequences,prunecandidateswhose -sequencesareinfrequent,andthencountthesupportsoftheremainingcandidatestoidentifythesequentialpatterns.Thedetailedaspectsofthesestepsaregivennext.

CandidateGeneration

Wegeneratecandidatek-sequencesbymergingapairoffrequent -sequences.Althoughthisapproachissimilartothe strategyintroducedinChapter5 forgeneratingcandidateitem-sets,therearecertaindifferences.First,inthecaseofgeneratingsequences,wecanmergea -sequencewithitselftoproduceak-sequence,since.Forexample,wecanmergethe1-sequence withitselftoproduceacandidate2-sequence, .Second,recallthatinordertoavoidgeneratingduplicatecandidates,thetraditionalApriorialgorithmmergesapairoffrequentk-itemsetsonlyiftheirfirst items,arrangedinlexicographicorder,areidentical.Inthecaseofgeneratingsequences,westillusethelexicographicorderforarrangingeventswithinanelement.However,thearrangementofelementsinasequencemaynotfollowthelexicographicorder.Forexample,

isaviablerepresentationofa4-sequence,eventhoughtheelementsinthesequencearenotarrangedaccordingtotheirlexicographicranks.Ontheotherhand, isnotaviablerepresentationofthesame4-sequence,sincetheeventsinthefirstelementviolatethelexicographicorder.

13: . {Extractthefrequentk-subsequences.}14:until15: .

Fk={c|c∈Ck∧σ(c)N≥minsup}

Fk=∅Answer=∪Fk

(k−1)

(k−1)Fk−1×Fk−1

(k−1)⟨a⟩

⟨{a}{a}⟩

k−1

⟨{b,c}{a}{d}⟩

⟨{c,b}{a}{d}⟩

Givenasequence ,wheretheeventsineveryelementarearrangedlexicographically,wecanrefertofirsteventof asthefirsteventofsandthelasteventof asthelasteventofs.Thecriteriaformergingsequencescanthenbestatedintheformofthefollowingprocedure.

SequenceMergingProcedure

Asequence ismergedwithanothersequence onlyifthesubsequenceobtainedbydroppingthefirsteventin isidenticaltothesubsequenceobtainedbydroppingthelasteventin .Theresultingcandidateisgivenbyextendingthesequence asfollows:

1. Ifthelastelementof hasonlyoneevent,appendthelastelementof totheendof andobtainthemergedsequence.

2. Ifthelastelementof hasmorethanoneevent,appendthelasteventfromthelastelementof (thatisabsentinthelastelementof )tothelastelementof andobtainthemergedsequence.

Figure6.7 illustratesexamplesofcandidate4-sequencesobtainedbymergingpairsoffrequent3-sequences.Thefirstcandidate, ,isobtainedbymerging with .Sincethelastelementofthesecondsequence containsonlyoneelement,itissimplyappendedtothefirstsequencetogeneratethemergedsequence.Ontheotherhand,merging with producesthecandidate4-sequence

.Inthiscase,thelastelementofthesecondsequence

s=⟨e1e2e3…en⟩e1

s(1) s(2)s(1)

s(2)s(1)

s(2)s(2) s(1)

s(2)s(2)

s(1) s(1)

⟨{1}{2}{3}{4}⟩⟨{1}{2}{3}⟩ ⟨{2}{3}{4}⟩({4})

⟨{1}{5}{3}⟩ ⟨{5}{3,4}⟩ ⟨{1}{5}{3,4}⟩ ({3,4})

containstwoevents.Hence,thelasteventinthiselement(4)isaddedtothelastelementofthefirstsequence toobtainthemergedsequence.

Figure6.7.Exampleofthecandidategenerationandpruningstepsofasequentialpatternminingalgorithm.

Itcanbeshownthatthesequencemergingprocedureiscomplete,i.e.,itgenerateseveryfrequentk-subsequence.Thisisbecauseeveryfrequentk-subsequencesincludesafrequent -sequence ,thatdoesnotcontainthefirsteventofs,andafrequent -sequence ,thatdoesnotcontainthelasteventofs.Since and arefrequentandfollowthecriteriaformergingsequences,theywillbemergedtoproduceeveryfrequentk-subsequencesasoneofthecandidates.Also,thesequencemergingprocedureensuresthatthereisauniquewayofgeneratingsonlybymergingand .Forexample,inFigure6.7 ,thesequences and

donothavetobemergedbecauseremovingthefirsteventfromthefirstsequencedoesnotgivethesamesubsequenceasremovingthelasteventfromthesecondsequence.Although isaviablecandidate,itisgeneratedbymergingadifferentpairofsequences,

({3})

(k−1) s1(k−1) s2

s1 s2

s1 s2 ⟨{1}{2}{3}⟩ ⟨{1}{2,5}⟩

⟨{1}{2,5}{3}⟩⟨{1}{2,5}

and .Thisexampleillustratesthatthesequencemergingproceduredoesnotgenerateduplicatecandidatesequences.

CandidatePruning

Acandidatek-sequenceisprunedifatleastoneofits -sequencesisinfrequent.Forexample,considerthecandidate4-sequence .Weneedtocheckifanyofthe3-sequencescontainedinthis4-sequenceisinfrequent.Sincethesequence iscontainedinthissequenceandisinfrequent,thecandidate canbeeliminated.Readersshouldbeabletoverifythattheonlycandidate4-sequencethatsurvivesthecandidatepruningstepinFigure6.7 is .

SupportCounting

Duringsupportcounting,thealgorithmidentifiesallcandidatek-sequencesbelongingtoaparticulardatasequenceandincrementstheirsupportcounts.Afterperformingthisstepforeachdatasequence,thealgorithmidentifiesthefrequentk-sequencesanddiscardsallcandidatesequenceswhosesupportvaluesarelessthantheminsupthreshold.

6.4.3TimingConstraints*

Thissectionpresentsasequentialpatternformulationwheretimingconstraintsareimposedontheeventsandelementsofapattern.Tomotivatetheneedfortimingconstraints,considerthefollowingsequenceofcoursestakenbytwostudentswhoenrolledinadataminingclass:

⟩ ⟨{2,5}{3}⟩

(k−1)⟨{1}{2}{3}{4}⟩

⟨{1}{2}{4}⟩⟨{1}{2}{3}{4}⟩

⟨{1}{2 5}{3}⟩

StudentA:⟨{Statistics}{DatabaseSystems}{DataMining}⟩.StudentB:⟨

Thesequentialpatternofinterestis,whichmeansthatstudentswhoareenrolledinthedata

miningclassmusthavepreviouslytakenacourseinstatisticsanddatabasesystems.Clearly,thepatternissupportedbybothstudentseventhoughtheydonottakestatisticsanddatabasesystemsatthesametime.Incontrast,astudentwhotookastatisticscoursetenyearsearliershouldnotbeconsideredassupportingthepatternbecausethetimegapbetweenthecoursesistoolong.Becausetheformulationpresentedintheprevioussectiondoesnotincorporatethesetimingconstraints,anewsequentialpatterndefinitionisneeded.

Figure6.8 illustratessomeofthetimingconstraintsthatcanbeimposedonapattern.Thedefinitionoftheseconstraintsandtheimpacttheyhaveonsequentialpatterndiscoveryalgorithmswillbediscussedinthefollowingsections.Notethateachelementofthesequentialpatternisassociatedwithatimewindow ,wherelistheearliestoccurrenceofaneventwithinthetimewindowanduisthelatestoccurrenceofaneventwithinthetimewindow.Notethatinthisdiscussion,wealloweventswithinanelementtooccuratdifferenttimes.Hence,theactualtimingoftheeventoccurrencemaynotbethesameasthelexicographicordering.

{DatabaseSystems}{Statistics}{DataMining}⟩.

⟨{Statistics,DatabaseSystems}{DataMining}⟩

[l,u]

Figure6.8.Timingconstraintsofasequentialpattern.

ThemaxspanConstraintThemaxspanconstraintspecifiesthemaximumallowedtimedifferencebetweenthelatestandtheearliestoccurrencesofeventsintheentiresequence.Forexample,supposethefollowingdatasequencescontainelementsthatoccuratconsecutivetimestamps ,i.e.,theelementinthesequenceoccursatthe timestamp.Assumingthatmaxspan=3,thefollowingtablecontainssequentialpatternsthataresupportedandnotsupportedbyagivendatasequence.

DataSequence,s SequentialPattern,t Doesssupportt?

Yes

(1,2,3,…) ithith

⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{3}{4}⟩

⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{3}{6}⟩

Ingeneral,thelongerthemaxspan,themorelikelyitistodetectapatterninadatasequence.However,alongermaxspancanalsocapturespuriouspatternsbecauseitincreasesthechancefortwounrelatedeventstobetemporallyrelated.Inaddition,thepatternmayinvolveeventsthatarealreadyobsolete.

Themaxspanconstraintaffectsthesupportcountingstepofsequentialpatterndiscoveryalgorithms.Asshownintheprecedingexamples,somedatasequencesnolongersupportacandidatepatternwhenthemaxspanconstraintisimposed.IfwesimplyapplyAlgorithm6.1 ,thesupportcountsforsomepatternsmaybeoverestimated.Toavoidthisproblem,thealgorithmmustbemodifiedtoignorecaseswheretheintervalbetweenthefirstandlastoccurrencesofeventsinagivenpatternisgreaterthanmaxspan.

ThemingapandmaxgapConstraintsTimingconstraintscanalsobespecifiedtorestrictthetimedifferencebetweentwoconsecutiveelementsofasequence.Ifthemaximumtimedifference(maxgap)isoneweek,theneventsinoneelementmustoccurwithinaweek’stimeoftheeventsoccurringinthepreviouselement.Iftheminimumtimedifference(mingap)iszero,theneventsinoneelementmustoccuraftertheeventsoccurringinthepreviouselement.(SeeFigure6.8 .)Thefollowingtableshowsexamplesofpatternsthatpassorfailthemaxgapandmingapconstraints,assumingthatmaxgap=3andmingap=1.Theseexamplesassumeeachelementoccursatconsecutivetimesteps.

DataSequence,s SequentialPattern,t maxgap mingap

Pass Pass

⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{1,3}{6}⟩

⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{3}{6}⟩

Pass Fail

Fail Pass

Fail Fail

Aswithmaxspan,theseconstraintswillaffectthesupportcountingstepofsequentialpatterndiscoveryalgorithmsbecausesomedatasequencesnolongersupportacandidatepatternwhenmingapandmaxgapconstraintsarepresent.Thesealgorithmsmustbemodifiedtoensurethatthetimingconstraintsarenotviolatedwhencountingthesupportofapattern.Otherwise,someinfrequentsequencesmaymistakenlybedeclaredasfrequentpatterns.

AsideeffectofusingthemaxgapconstraintisthattheAprioriprinciplemightbeviolated.Toillustratethis,considerthedatasetshowninFigure6.5 .Withoutmingapormaxgapconstraints,thesupportfor and

arebothequalto60%.However,ifmingap=0andmaxgap=1,thenthesupportfor reducesto40%,whilethesupportfor isstill60%.Inotherwords,supporthasincreasedwhenthenumberofeventsinasequenceincreases—whichcontradictstheAprioriprinciple.TheviolationoccursbecausetheobjectDdoesnotsupportthepattern sincethetimegapbetweenevents2and5isgreaterthanmaxgap.Thisproblemcanbeavoidedbyusingtheconceptofacontiguoussubsequence.

Definition6.2(ContiguousSubsequence).Asequencesisacontiguoussubsequenceof ifanyoneofthefollowingconditionshold:

⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{6}{8}⟩

⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{1,3}{6}⟩

⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{1}{3}{8}⟩

⟨{2}{5}⟩ ⟨{2}{3}{5}⟩

⟨{2}{5}⟩

w=⟨e1e2…ek⟩

1. sisobtainedfromwafterdeletinganeventfromeither or ,2. sisobtainedfromwafterdeletinganeventfromanyelement that

containsatleasttwoevents,or3. sisacontiguoussubsequenceoftandtisacontiguoussubsequence

ofw.

Thefollowingexamplesillustratetheconceptofacontiguoussubsequence:

DataSequence,s SequentialPattern,t Istacontiguoussubsequenceofs?

Yes

Usingtheconceptofcontiguoussubsequences,theAprioriprinciplecanbemodifiedtohandlemaxgapconstraintsinthefollowingway.

Definition6.3(ModifiedAprioriPrinciple).Ifak-sequenceisfrequent,thenallofitscontiguous -subsequencesmustalsobefrequent.

e1 ekei∈w

⟨{1}{2,3}⟩ ⟨{1}{2}⟩

⟨{1,2}{2}{3}⟩ ⟨{1}{2}⟩

⟨{3,4}{1,2}{2,3}{4}⟩ ⟨{1}{2}⟩

⟨{1}{3}{2}⟩ ⟨{1}{2}⟩

⟨{1,2}{1}{3}{2}⟩ ⟨{1}{2}⟩

k−1

ThemodifiedAprioriprinciplecanbeappliedtothesequentialpatterndiscoveryalgorithmwithminormodifications.Duringcandidatepruning,notallk-sequencesneedtobeverifiedsincesomeofthemmayviolatethemaxgapconstraint.Forexample,ifmaxgap=1,itisnotnecessarytocheckwhetherthesubsequence ofthecandidate isfrequentsincethetimedifferencebetweenelements and isgreaterthanonetimeunit.Instead,onlythecontiguoussubsequencesofneedtobeexamined.Thesesubsequencesinclude ,

, ,and .

TheWindowSizeConstraintFinally,eventswithinanelement donothavetooccuratthesametime.Awindowsizethreshold(ws)canbedefinedtospecifythemaximumallowedtimedifferencebetweenthelatestandearliestoccurrencesofeventsinanyelementofasequentialpattern.Awindowsizeof0meansalleventsinthesameelementofapatternmustoccursimultaneously.

Thefollowingexampleuses todeterminewhetheradatasequencesupportsagivensequence(assuming , ,and

DataSequence,s SequentialPattern,t Doesssupportt?

Yes

⟨{1}{2,3}{5}⟩ ⟨{1}{2,3}{4}{5}⟩⟨{2,3}⟩ {5}

⟨{1}{2,3}{4}{5}⟩⟨{1}{2,3}{4}⟩ ⟨{2,3}{4}

{5}⟩ ⟨{1}{2}{4}{5}⟩ {1}{3}{4}{5}

ws=2mingap=0 maxgap=3

maxspan=∞

⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{3,4}{5}⟩

⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{4,6}{8}⟩

⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{3,4,6}{8}⟩

⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{1,3,4}{6,7,8}⟩

Inthelastexample,althoughthepattern satisfiesthewindowsizeconstraint,itviolatesthemaxgapconstraintbecausethemaximumtimedifferencebetweeneventsinthetwoelementsis5units.Thewindowsizeconstraintalsoaffectsthesupportcountingstepofsequentialpatterndiscoveryalgorithms.IfAlgorithm6.1 isappliedwithoutimposingthewindowsizeconstraint,thesupportcountsforsomeofthecandidatepatternsmightbeunderestimated,andthussomeinterestingpatternsmaybelost.

6.4.4AlternativeCountingSchemes*

Therearemultiplewaysofdefiningasequencegivenadatasequence.Forexample,ifourdatabaseinvolveslongsequencesofevents,wemightbeinterestedinfindingsubsequencesthatoccurmultipletimesinthesamedatasequence.Hence,insteadofcountingthesupportofasubsequenceasthenumberofdatasequencesitiscontainedin,wecanalsotakeintoaccountthenumberoftimesasubsequenceiscontainedinadatasequence.Thisviewpointgivesrisetoseveraldifferentformulationsforcountingthesupportofacandidatek-sequencefromadatabaseofsequences.Forillustrativepurposes,considertheproblemofcountingthesupportforsequence,asshowninFigure6.9 .Assumethat , , ,and

⟨{1,3,4}{6,7,8}⟩

⟨{p}{q}⟩ ws=0 mingap=0 maxgap=2maxspan=2

Figure6.9.Comparingdifferentsupportcountingmethods.

COBJ:Oneoccurrenceperobject.

Thismethodlooksforatleastoneoccurrenceofagivensequenceinanobject’stimeline.InFigure6.9 ,eventhoughthesequenceappearsseveraltimesintheobject’stimeline,itiscountedonlyonce—withpoccurringat andqoccurringat .CWIN:Oneoccurrenceperslidingwindow.

⟨(p)(q)⟩

t=1 t=3

Inthisapproach,aslidingtimewindowoffixedlength(maxspan)ismovedacrossanobject’stimeline,oneunitatatime.Thesupportcountisincrementedeachtimethesequenceisencounteredintheslidingwindow.InFigure6.9 ,thesequence isobservedsixtimesusingthismethod.CMINWIN:Numberofminimalwindowsofoccurrence.

Aminimalwindowofoccurrenceisthesmallestwindowinwhichthesequenceoccursgiventhetimingconstraints.Inotherwords,aminimalwindowisthetimeintervalsuchthatthesequenceoccursinthattimeinterval,butitdoesnotoccurinanyofthepropersubintervalsofit.ThisdefinitioncanbeconsideredasarestrictiveversionofCWIN,becauseitseffectistoshrinkandcollapsesomeofthewindowsthatarecountedbyCWIN.Forexample,sequence hasfourminimalwindowoccurrences:(1)thepair ,(2)thepair ,(3)thepair

,and(4)thepair .Theoccurrenceofeventpatandeventqat isnotaminimalwindowoccurrencebecauseit

containsasmallerwindowwith ,whichisindeedaminimalwindowofoccurrence.CDIST_O:Distinctoccurrenceswithpossibilityofevent-timestampoverlap.

Adistinctoccurrenceofasequenceisdefinedtobethesetofevent-timestamppairssuchthattherehastobeatleastonenewevent-timestamppairthatisdifferentfromapreviouslycountedoccurrence.CountingallsuchdistinctoccurrencesresultsintheCDIST_Omethod.Iftheoccurrencetimeofeventspandqisdenotedasatuple ,thenthismethodyieldseightdistinctoccurrencesofsequence attimes(1,3),(2,3),(2,4),(3,4),(3,5),(5,6),(5,7),and(6,7).CDIST:Distinctoccurrenceswithnoevent-timestampoverlapallowed.

⟨{p}{q}⟩

⟨{p}{q}⟩(p:t=2,q:t=3) (p:t=3,q:t=4)

(p:t=5,q:t=6) (p:t=6,q:t=7)t=1 t=3

(p:t=2,q:t=3)

(t(p),t(q))⟨{p}{q}⟩

InCDIST_Oabove,twooccurrencesofasequencewereallowedtohaveoverlappingevent-timestamppairs,e.g.,theoverlapbetween(1,3)and(2,3).IntheCDISTmethod,nooverlapisallowed.Effectively,whenanevent-timestamppairisconsideredforcounting,itismarkedasusedandisneverusedagainforsubsequentcountingofthesamesequence.Asanexample,therearefivedistinct,non-overlappingoccurrencesofthesequence inthediagramshowninFigure6.9 .Theseoccurrenceshappenattimes(1,3),(2,4),(3,5),(5,6),and(6,7).ObservethattheseoccurrencesaresubsetsoftheoccurrencesobservedinCDIST_O.

Onefinalpointregardingthecountingmethodsistheneedtodeterminethebaselineforcomputingthesupportmeasure.Forfrequentitemsetmining,thebaselineisgivenbythetotalnumberoftransactions.Forsequentialpatternmining,thebaselinedependsonthecountingmethodused.FortheCOBJmethod,thetotalnumberofobjectsintheinputdatacanbeusedasthebaseline.FortheCWINandCMINWINmethods,thebaselineisgivenbythesumofthenumberoftimewindowspossibleinallobjects.FormethodssuchasCDISTandCDIST_O,thebaselineisgivenbythesumofthenumberofdistincttimestampspresentintheinputdataofeachobject.

⟨{p}{q}⟩

6.5SubgraphPatternsThissectiondescribestheapplicationofassociationanalysismethodstographs,whicharemorecomplexentitiesthanitemsetsandsequences.Anumberofentitiessuchaschemicalcompounds,3-Dproteinstructures,computernetworks,andtreestructuredXMLdocumentscanbemodeledusingagraphrepresentation,asshowninTable6.8 .

Table6.8.Graphrepresentationofentitiesinvariousapplicationdomains.

Application Graphs Vertices Edges

Webmining Collectionofinter-linkedWebpages

Webpages Hyperlinkbetweenpages

Computationalchemistry

Chemicalcompounds Atomsorions Bondbetweenatomsorions

Computersecurity

Computernetworks Computersandservers

Interconnectionbetweenmachines

SemanticWeb XMLdocuments XMLelements Parent-childrelationshipbetweenelements

Bioinformatics 3-DProteinstructures Aminoacids Contactresidue

Ausefuldataminingtasktoperformonthistypeofdataistoderiveasetoffrequentlyoccurringsubstructuresinacollectionofgraphs.Suchataskisknownasfrequentsubgraphmining.Apotentialapplicationoffrequentsubgraphminingcanbeseeninthecontextofcomputationalchemistry.Each

year,newchemicalcompoundsaredesignedforthedevelopmentofpharmaceuticaldrugs,pesticides,fertilizers,etc.Althoughthestructureofacompoundisknowntoplayamajorroleindeterminingitschemicalproperties,itisdifficulttoestablishtheirexactrelationship.Frequentsubgraphminingcanaidthisundertakingbyidentifyingthesubstructurescommonlyassociatedwithcertainpropertiesofknowncompounds.Suchinformationcanhelpscientiststodevelopnewchemicalcompoundsthathavecertaindesiredproperties.

Thissectionpresentsamethodologyforapplyingassociationanalysistograph-baseddata.Thesectionbeginswithareviewofsomeofthebasicgraph-relatedconceptsanddefinitions.Thefrequentsubgraphminingproblemisthenintroduced,followedbyadescriptionofhowthetraditionalApriorialgorithmcanbeextendedtodiscoversuchpatterns.

6.5.1Preliminaries

GraphsAgraphisadatastructurethatcanbeusedtorepresentrelationshipsamongasetofentities.Mathematically,agraph iscomposedofavertexsetandasetofedges connectingpairsofvertices.Eachedgeisdenotedby

avertexpair ,where .Alabel canbeassignedtoeachvertex representingthenameofanentity.Similarlyeachedge canalsobeassociatedwithalabel describingtherelationshipbetweenapairofentities.Table6.8 showstheverticesandedgesassociatedwithdifferenttypesofgraphs.Forexample,inawebgraph,theverticescorrespondtowebpagesandtheedgesrepresentthehyperlinksbetweenwebpages.

G=(V,E)V E

(vi,vj) vi,vj∈V l(vi)vi (vi,vj)

l(vi,vj)

Althoughthesizeofagraphcangenerallyberepresentedeitherbythenumberofitsverticesoritsedges,inthischapter,wewillconsiderthesizeofagraphasitsnumberofedges.Further,wewilldenoteagraphwithkedgesasak-graph.

GraphIsomorphismAbasicprimitivethatisneededtoworkwithgraphsistodecideiftwographswiththesamenumberofverticesandedgesareequivalenttoeachother,i.e.,representthesamestructureofrelationshipsamongentities.Graphisomorphismprovidesaformaldefinitionofgraphequivalencethatservesasabuildingblockforcomputingsimilaritiesamonggraphs.

Definition6.4(GraphIsomorphism).Twographs and areisomorphictoeachother(denotedas )ifthereexistsfunctions,

and ,thatmapeveryvertexandedge,respectively,from to ,suchthatthefollowingpropertiesaresatisfied:

1. Edge-preservingproperty:Twovertices and informanedgein ifandonlyifthevertices and

formanedgein2. Label-preservingproperty:Thelabelsoftwovertices

and in areequalifandonlyifthelabelsof andin areequal.Similarly,thelabelsoftwoedgesand in areequalifandonlyifthelabelsof

and areequal.

G1=(V1,E1) G2=(V2,E2)G1≃G2 fv:V1→

V2 fe:E1→E2G1 G2

va vb G1G1 fv(va) fv(

vb) G2va

vb G1 fv(va)fv(vb) G2 (va,vb) (vc,vd) G1fe(va,vb) fe(vc,vd)

Themappingfunctions constitutetheisomorphismbetweenthegraphsand .Thisisdenotedas : .Anautomorphismisaspecial

typeofisomorphismwhereagraphismappeduntoitself,i.e., and.Figure6.10 showsanexampleofagraphautomorphismwheretheset

ofvertexlabelsinbothgraphsis .Eventhoughbothgraphslookdifferent,theyareactuallyisomorphictoeachotherbecausethereisaone-toonemappingbetweentheverticesandedgesofbothgraphs.Sincethesamegraphcanbedepictedinmultipleforms,detectinggraphautomorphismisanon-trivialproblem.Acommonapproachtosolvingthisproblemistoassignacanonicallabeltoeverygraph,suchthateveryautomorphismofagraphsharesthesamecanonicallabel.Canonicallabelscanalsohelpinarranginggraphsinaparticular(canonical)orderandcheckingforduplicates.Techniquesforconstructingcanonicallabelsarenotcoveredinthischapter,butinterestedreadersmayconsulttheBibliographicNotesattheendofthischapterformoredetails.

Figure6.10.

(fv,fe)G1 G2 (fv,fe) G1→G2

V1=V2 EE2

{A,B}

Graphisomorphism

Subgraphs

Definition6.5(Subgraph).Agraph isasubgraphofanothergraph ifitsvertexset isasubsetofVanditsedgeset isasubsetofE,suchthattheendpointsofeveryedgein iscontainedin.Thesubgraphrelationshipisdenotedas .

Example6.2.Figure6.11 showsagraphthatcontains6verticesand11edgesalongwithoneofitspossiblesubgraphs.Thesubgraph,whichisshowninFigure6.11(b) ,containsonly4ofthe6verticesand4ofthe11edgesintheoriginalgraph.

G′=(V′,E′) G=(V,E)V′ E′

E′ V′G′⊆SG

Figure6.11.Exampleofasubgraph.

Definition6.6(Support).GivenacollectionofgraphsG,thesupportforasubgraph isdefinedasthefractionofallgraphsthatcontain asitssubgraph,i.e.,

Example6.3.Considerthefivegraphs, through ,showninFigure6.12 ,wherethesetofvertexlabelsrangesfrom to butalltheedgesinthegraphshavethesamelabel.Thegraph shownonthetopright-handdiagramisasubgraphof , , ,and .Therefore, .

s(g)=|{Gi|g⊆SGi,Gi∈g}||G|. (6.2)

G1 G5a e

g1G1 G3 G4 G5 s(g1)=4/5=80%

Similarly,wecanshowthat because isasubgraphof,and ,while because isasubgraphof and

Figure6.12.Computingthesupportofasubgraphfromasetofgraphs.

6.5.2FrequentSubgraphMining

Thissectionpresentsaformaldefinitionofthefrequentsubgraphminingproblemandillustratesthecomplexityofthistask.

s(g2)=60% g2 G1G2 G3 s(g3)=40% g3 G1 G

Definition6.7(FrequentSubgraphMining).GivenasetofgraphsGandasupportthreshold,minsup,thegoaloffrequentsubgraphminingistofindallsubgraphsgsuchthat .

Whilethisformulationisgenerallyapplicabletoanytypeofgraph,thediscussionpresentedinthischapterfocusesprimarilyonundirected,connectedgraphs.Thedefinitionsofthesegraphsaregivenbelow:

1. Agraphisundirectedifitcontainsonlyundirectededges.Anedgeisundirectedifitisindistinguishablefrom .

2. Agraphisconnectedifthereexistsapathbetweeneverypairofverticesinthegraph,inwhichapathisasequenceofvertices

suchthatthereisanedgeconnectingeverypairofadjacentvertices inthesequence.

Methodsforhandlingothertypesofsubgraphs(directedordisconnected)areleftasanexercisetothereaders(seeExercise15 onpage519).

Miningfrequentsubgraphsisacomputationallyexpensivetaskthatismuchmorechallengingthanminingfrequentitemsetsorfrequentsubsequences.Theadditionalcomplexityinfrequentsubgraphminingarisesduetotwomajorreasons.First,computingthesupportofasubgraphggivenacollectionofgraphsGisnotasstraightforwardasforitemsetsorsequences.Thisisbecauseitisanon-trivialproblemtocheckifasubgraphgiscontainedinagraph ,sincethesamegraphgcanbepresentinadifferentforminduetographisomorphism.Theproblemofverifyingifagraphisasubgraph

s(g)≥minsup

(vi,vj) (vj,vi)

⟨v1v2…vk⟩

(vi,vi+1)

g′∈G g′

ofanothergraphisknownasthesubgraphisomorphismproblem,whichisproventobeNP-complete,i.e.,thereisnoknownalgorithmforthisproblemthatrunsinpolynomialtime.

Second,thenumberofcandidatesubgraphsthatcanbegeneratedfromagivensetofvertexandedgelabelsisfarlargerthanthenumberofcandidateitemsetsgeneratedusingtraditionalmarketbasketdatasets.Thisisbecauseofthefollowingreasons:

1. Acollectionofitemsformsauniqueitemsetbutthesamesetofedgelabelscanbearrangedinexponentialnumberofwaysinagraph,withmultiplechoicesofvertexlabelsattheirendpoints.Forexample,itemsp,q,andrformauniqueitemset ,butthreeedgeswithlabelsp,q,andrcanformmultiplegraphs,someexamplesofwhichareshowninFigure6.13 .

Figure6.13.Examplesofgraphsgeneratedusingthreeedgeswithlabelsp,q,andr.

{p,q,r}

2. Anitemcanappearatmostonceinanitemsetbutanedgelabelcanappearmultipletimesinagraph,becausedifferentarrangementsofedgeswiththesameedgelabelrepresentdifferentgraphs.Forexample,anitempcanonlygenerateasinglecandidateitemset,whichistheitemitself.However,usingasingleedgelabelpandvertexlabela,wecangenerateanumberofgraphswithdifferentsizes,asshowninFigure6.14 .

Figure6.14.Graphsofsizesonetothreegeneratedusingasingleedgelabelpandvertexlabela.

Becauseoftheabovereasons,itischallengingtoenumerateallpossiblesubgraphsthatcanbegeneratedusingagivensetofvertexandedgelabels.Figure6.15 showssomeexamplesof1-graphs,2-graphs,and3-graphsthatcanbegeneratedusingvertexlabels andedgelabels .Itcanbeseenthatevenusingtwovertexandedgelabels,enumeratingallpossiblegraphsbecomesdifficultevenforsizetwo.Hence,itishighlyimpracticaltouseabrute-forcemethodforfrequentsubgraphmining,thatenumeratesallpossiblesubgraphsandcountstheirrespectivesupports.

{a,b} {p,q}

Figure6.15.Examplesofgraphsgeneratedusingtwoedgelabels,pandq,andtwovertexlabels,aandb,forsizesvaryingfromonetothree.

However,notethattheAprioriprinciplestillholdsforsubgraphsbecauseak-graphisfrequentonlyifallofits -subgraphsarefrequent.Hence,despitethecomputationalchallengesinenumeratingallpossiblecandidatesubgraphs,wecanusetheAprioriprincipletogeneratecandidatek-subgraphsusingfrequent -subgraphs.Algorithm6.2 presentsagenericApriori-likeapproachforfrequentsubgraphmining.Inthefollowing,webrieflydescribethethreemainstepsofthealgorithm:candidategeneration,candidatepruning,andsupportcounting.

Algorithm6.2Apriori-likealgorithmfor

frequentsubgraphmining.1. ←Findallfrequent1-subgraphsinG2. ←Findallfrequent2-subgraphsinG3. .4. repeat5. .6. =candidate-gen .{Generatecandidatek-subgraphs.}7. =candidate-prune .{Performcandidate

pruning.}8. foreachgraph do9. =subgraph .{Identifyallcandidatescontainedint.}10. foreachcandidatek-subgraph do11. .{Incrementthesupportcount.}12. endfor13. endfor14. .{Extractthefrequentk-

subgraphs.}15. until16. Answer= .

(k−1)

F1F2k=2

k=k+1Ck (Fk−1)Ck (Ck−1,Fk−1)

g∈GCt (Ck,g)

c∈Ctσ(c)=σ(c)+1

Fk={c|c∈Ck∧σ(c)N≥minsup}

Fk=0∪Fk

6.5.3CandidateGeneration

Apairoffrequent -subgraphsaremergedtoformacandidatek-subgraphiftheyshareacommon -subgraph,knownastheircore.Givenacommoncore,thesubgraphmergingprocedurecanbedescribedasfollows:

SubgraphMergingProcedure

Let and betwofrequent -subgraphs.Letconsistofacore andanextraedge ,whereuispartofthecore.ThisisdepictedinFigure6.16(a) ,wherethecoreisrepresentedbyasquareandtheextraedgeisrepresentedbyalinebetweenuand .Similarly,let consistofthecore andtheextraedge, ,asshowninFigure6.16(b) .

Figure6.16.Acompactrepresentationofapairoffrequent -subgraphsconsideredformerging.

(k−1)(k−2)

Gi(k−1) Gj(k−1) (k−1) Gi(k−1)Gi(k−2) (u,u′)

u′ Gj(k−1) Gj(k−2)(v,v′)

(k−1)

Usingthesecores,thetwographsaremergedonlyifthereexistsanautomorphismbetweenthetwocores: Theresultingcandidatesareobtainedbyaddinganedgeto asfollows:

1. If ,i.e.,uismappedtovintheautomorphismbetweenthecores,thengenerateacandidatebyadding to ,asshowninFigure6.17(a) .

Figure6.17.IllustrationofCandidateMergingProcedures.

2. If ,i.e.,uisnotmappedtovbutadifferentvertexw,thengenerateacandidatebyadding to .Additionally,ifthelabelsof and areidentical,thengenerateanothercandidatebyadding to ,asshowninFigure6.17(b) .

(fv,fe):Gi(k−2)→Gj(k−2)Gi(k−1)

fv(u)=v(v,u′) Gj(k−1)

fv(u)=w≠v(w,u′) Gj(k−1)

u′ v′(w,v′) Gi(k−1)

Figure6.18(a) showsthecandidatesubgraphsgeneratedbymergingand .Theshadedverticesandthickerlinesrepresentthecoreverticesandedges,respectively,ofthetwographs,whilethedottedlinesrepresentthemappingbetweenthetwocores.Notethatthisexampleillustratescondition1ofthesubgraphmergingprocedure,sincetheendpointsoftheextraedgesinboththegraphsaremappedtoeachother.Thisresultsinasinglecandidatesubgraph, .Ontheotherhand,Figure6.18(b) showsanexampleofcondition2ofthesubgraphmergingprocedure,wheretheendpointsoftheextraedgesdonotmaptoeachotherandthelabelsofthenewendpointsareidentical.Mergingthetwographs and thusresultsintwosubgraphs,asshownintheFigureas and .

G1G2

G4 G5G6 G7

Figure6.18.Twoexamplesofcandidatek-subgraphgenerationusingapairof -subgraphs.

Theapproachpresentedaboveofmergingtwofrequent -subgraphsissimilartothe candidategenerationstrategyintroducedforitemsetsinChapter5 ,andisguaranteedtoexhaustivelygenerateallfrequentk-subgraphsasviablecandidates(seeExercise18 ).However,

(k−1)

(k−1)Fk−1×Fk−1

thereareseveralnotabledifferencesinthecandidategenerationproceduresofitemsetsandsubgraphs.

1. MergingwithSelf:Unlikeitemsets,afrequent -subgraphcanbemergedwithitselftocreateacandidatek-subgraph.Thisisespeciallyimportantwhenak-graphcontainsrepeatingunitsofedgescontainedina -subgraph.Asanexample,the3-graphsshowninFigure6.14 canonlybegeneratedfromthe2-graphsshowninFigure6.14 ,ifself-mergingisallowed.

2. MultiplicityofCandidates:Asdescribedinthesubgraphmergingprocedure,apairoffrequent -subgraphssharingacommoncorecangeneratemultiplecandidates.Asanexample,ifthelabelsattheendpointsoftheextraedgesareidentical,i.e., ,wewillgeneratetwocandidatesasshowninFigure6.18(b) .Ontheotherhand,mergingapairoffrequentitemsetsorsubsequencesgeneratesauniquecandidateitemsetorsubsequence.

3. MultiplicityofCores:Twofrequent -subgraphscansharemorethanonecoreofsize thatiscommoninboththegraphs.Figure6.19 showsanexampleofapairofgraphsthatsharetwocommoncores.Sinceeverychoiceofacommoncorecanresultinadifferentwayofmergingthetwographs,thiscanpotentiallycontributetothemultiplicityofcandidatesgeneratedbymergingthesamepairofsubgraphs.

(k−1)

l(u′)=l(v′)

(k−1)k−2

Figure6.19.Multiplicityofcoresforthesamepairof -subgraphs.

4. MultiplicityofAutomorphisms:Thecommoncoresofthetwographscanbemappedtoeachotherusingmultiplechoicesofmappingfunctions,eachresultinginadifferentautomorphism.Toillustratethis,Figure6.20 showsapairofgraphsthatshareacommoncoreofsizefour,representedasasquare.Thefirstcorecanexistinthreedifferentforms(rotatedviews),eachresultinginadifferentmappingbetweenthetwocores.Sincethechoiceofthemappingfunctionaffectsthecandidategenerationprocedure,everyautomorphismofthecorecanpotentiallyresultinadifferentsetofcandidates,asshowninFigure6.20 .

(k−1)

Figure6.20.Anexampleshowingmultiplewaysofmappingthecoresoftwo -subgraphswithoneanother.

5. GenerationofDuplicateCandidates:Inthecaseofitemsets,generationofduplicatecandidatesisavoidedbytheuseoflexicographicordering,suchthattwofrequentk-itemsetsaremergedonlyiftheirfirst items,arrangedinlexicographicorder,areidentical.Unfortunately,inthecaseofsubgraphs,theredoesnotexistanotionoflexicographicorderingamongtheverticesoredgesofagraph.Hence,thesamecandidatek-subgraphcanbegeneratedbymergingtwodifferentpairsof -subgraphs.Figure6.21 showsan

(k−1)

k−1

exampleofacandidate4-subgraphthatcanbegeneratedintwodifferentways,usingdifferentpairsoffrequent3-subgraphs.Thus,itisnecessarytocheckforduplicatesandeliminatetheredundantgraphsduringcandidatepruning.

Figure6.21.Differentpairsof -subgraphscangeneratethesamecandidatek-subgraph,thusresultinginduplicatecandidates.

Algorithm6.3 presentsthecompleteprocedureforgeneratingthesetofallcandidatek-subgraphs, ,usingthesetoffrequent -subgraphs, .Weconsidermergingeverypairofsubgraphsin ,includingpairsinvolvingthesamesubgraphtwice(toensureself-merging).Foreverypairof

-subgraphs,weconsiderallpossibleconnectedcoresofsize ,thatcanbeconstructedfromthetwographsbyremovinganedgefromeachgraph.Ifthetwocoresareisomorphic,weconsiderallpossiblemappingsbetweentheverticesandedgesofthetwocores.Foreverysuchmapping,we

(k−1)

Ck (k−1) Fk−1Fk−1

(k−1) k−2

employthesubgraphmergingproceduretoproducecandidatek-subgraphs,thatareaddedto .

Algorithm6.3Procedureforcandidate

generation:candidate-gen .1. .2. foreachpair, and , do3. {Consideringallpairsoffrequent -subgraphsformerging.}4. foreachpair, and do5. {Findingallcommoncoresbetweenapairoffrequent –

subgraphs.}6. .{Removinganedgefrom .}7. .{Removinganedgefrom .}8. if AND and areconnected

graphsthen9. { and arecommoncoresof and ,

respectively.}10. foreach do11. {Generatingcandidatesforeveryautomorphismbetweenthe

cores.}12. subgraph-merge .13. endfor14. endif15. endfor16. endfor17. Answer= .

(Fk−1)Ck=0

Gi(k−1)∈Fk−1 Gj(k−1)∈Fk−1 i≤j(k−1)

ei∈Gi(k−1) ej∈Gj(k−1)(k−1)

Gi(k−2)=Gi(k−1)ei Gi(k−1)Gj(k−2)=Gj(k−1)ej Gj(k−1)Gi(k−2)≃Gj(k−2) Gi(k−2) Gj(k−2)

Gi(k−2) Gj(k−2) Gi(k−1) Gj(k−1)

(fv,fe):Gi(k−2)→Gj(k−2)

Ck=Ck∪ (Gi(k−2),Gj(k−2),fv,fe,ei,ej)

6.5.4CandidatePruning

Afterthecandidatek-subgraphsaregenerated,thecandidateswhose -subgraphsareinfrequentneedtobepruned.Thepruningstepcanbeperformedbyidentifyingallpossibleconnected -subgraphsthatcanbeconstructedbyremovingoneedgefromacandidatek-subgraphandthencheckingiftheyhavealreadybeenidentifiedasfrequent.Ifanyofthe -subgraphsareinfrequent,thecandidatek-subgraphisdiscarded.Also,duplicatecandidatesneedtobedetectedandeliminated.Thiscanbedonebycomparingthecanonicallabelsofcandidategraphs,sincethecanonicallabelsofduplicategraphswillbeidentical.Canonicallabelscanalsohelpincheckingifa -subgraphcontainedinacandidatek-subgraphisfrequentornot,bymatchingitscanonicallabelwiththatofeveryfrequent -subgraphin .

6.5.5SupportCounting

Supportcountingisalsoapotentiallycostlyoperationbecauseallthecandidatesubgraphscontainedineachgraph mustbedetermined.OnewaytospeedupthisoperationistomaintainalistofgraphIDsassociatedwitheachfrequent -subgraph.Wheneveranewcandidatek-subgraphisgeneratedbymergingapairoffrequent -subgraphs,theircorrespondinglistsofgraphIDsareintersected.Finally,thesubgraphisomorphismtestsareperformedonthegraphsintheintersectedlisttodeterminewhethertheycontainaparticularcandidatesubgraph.

(k−1)

(k−1)(k−1)

Fk−1

G∈G

(k−1)(k−1)

6.6InfrequentPatterns*Theassociationanalysisformulationdescribedsofarisbasedonthepremisethatthepresenceofaniteminatransactionismoreimportantthanitsabsence.Asaconsequence,patternsthatarerarelyfoundinadatabaseareoftenconsideredtobeuninterestingandareeliminatedusingthesupportmeasure.Suchpatternsareknownasinfrequentpatterns.

Definition6.8(InfrequentPattern).Aninfrequentpatternisanitemsetorarulewhosesupportislessthantheminsupthreshold.

Althoughavastmajorityofinfrequentpatternsareuninteresting,someofthemmightbeusefultotheanalysts,particularlythosethatcorrespondtonegativecorrelationsinthedata.Forexample,thesaleof sand stogetherislowbecauseanycustomerwhobuysa willmostlikelynotbuya ,andviceversa.Suchnegative-correlatedpatternsareusefultohelpidentifycompetingitems,whichareitemsthatcanbesubstitutedforoneanother.Examplesofcompetingitemsincludeteaversuscoffee,butterversusmargarine,regularversusdietsoda,anddesktopversuslaptopcomputers.

Someinfrequentpatternsmayalsosuggesttheoccurrenceofinterestingrareeventsorexceptionalsituationsinthedata.Forexample,if is

frequentbut isinfrequent,thenthelatterisaninterestinginfrequentpatternbecauseitmayindicatefaultyalarmsystems.Todetectsuchunusualsituations,theexpectedsupportofapatternmustbedetermined,sothat,ifapatternturnsouttohaveaconsiderablylowersupportthanexpected,itisdeclaredasaninterestinginfrequentpattern.

Mininginfrequentpatternsisachallengingendeavorbecausethereisanenormousnumberofsuchpatternsthatcanbederivedfromagivendataset.Morespecifically,thekeyissuesinmininginfrequentpatternsare:(1)howtoidentifyinterestinginfrequentpatterns,and(2)howtoefficientlydiscovertheminlargedatasets.Togetsomeperspectiveonvarioustypesofinterestinginfrequentpatterns,tworelatedconcepts—negativepatternsandnegativelycorrelatedpatterns—areintroducedinSections6.6.1 and6.6.2 ,respectively.TherelationshipsamongthesepatternsareelucidatedinSection6.6.3 .Finally,twoclassesoftechniquesdevelopedformininginterestinginfrequentpatternsarepresentedinSections6.6.5 and6.6.6 .

6.6.1NegativePatterns

Let beasetofitems.Anegativeitem, denotestheabsenceofitem fromagiventransaction.Forexample, isanegativeitemwhosevalueis1ifatransactiondoesnotcontain .

Definition6.9(NegativeItemset).

I={i1,i2,…,id} ik¯ik

AnegativeitemsetXisanitemsetthathasthefollowingproperties:(1) ,whereAisasetofpositiveitems, isasetofnegativeitems, 1,and(2) minsup.

Definition6.10(NegativeAssociationRule).Anegativeassociationruleisanassociationrulethathasthefollowingproperties:(1)theruleisextractedfromanegativeitemset,(2)thesupportoftheruleisgreaterthanorequaltominsup,and(3)theconfidenceoftheruleisgreaterthanorequaltominconf.

Thenegativeitemsetsandnegativeassociationrulesarecollectivelyknownasnegativepatternsthroughoutthischapter.Anexampleofanegativeassociationruleis ,whichmaysuggestthatpeoplewhodrinkteatendtonotdrinkcoffee.

6.6.2NegativelyCorrelatedPatterns

Section5.7.1 onpage402describedhowcorrelationanalysiscanbeusedtoanalyzetherelationshipbetweenapairofcategoricalvariables.Measuressuchasinterestfactor(Equation5.5 )andtheφ-coefficient(Equation

X=A∪B¯ B¯|B¯|≥1 s(X)≥

5.8 )wereshowntobeusefulfordiscoveringitemsetsthatarepositivelycorrelated.Thissectionextendsthediscussiontonegativelycorrelatedpatterns.

Definition6.11(NegativelyCorrelatedItemset).AnitemsetX,whichisdefinedas ,isnegativelycorrelatedif

where isthesupportoftheitemx.

Notethatthesupportofanitemsetisanestimateoftheprobabilitythatatransactioncontainstheitemset.Hence,theright-handsideoftheprecedingexpression, ,representsanestimateoftheprobabilitythatalltheitemsinXarestatisticallyindependent.Definition6.11 suggeststhatanitemsetisnegativelycorrelatedifitssupportisbelowtheexpectedsupportcomputedusingthestatisticalindependenceassumption.Thesmaller ,themorenegativelycorrelatedisthepattern.

Definition6.12(NegativelyCorrelatedAssociationRule).

X={x1,x2,…,xk}

s(X)<∏j=1ks(xj)=s(x1)×s(x2)×…×s(xk), (6.3)

s(x)

∏j=1ks(xj)

s(X)

Anassociationrule isnegativelycorrelatedif

whereXandYaredisjointitemsets;i.e., .

TheprecedingdefinitionprovidesonlyapartialconditionfornegativecorrelationbetweenitemsinXanditemsinY.Afullconditionfornegativecorrelationcanbestatedasfollows:

where and .BecausetheitemsinX(andinY)areoftenpositivelycorrelated,itismorepracticaltousethepartialconditiontodefineanegativelycorrelatedassociationruleinsteadofthefullcondition.Forexample,althoughtherule

isnegativelycorrelatedaccordingtoInequality6.4 , ispositivelycorrelatedwith and ispositivelycorrelatedwith

.IfInequality6.5 isappliedinstead,sucharulecouldbemissedbecauseitmaynotsatisfythefullconditionfornegativecorrelation.

Theconditionfornegativecorrelationcanalsobeexpressedintermsofthesupportforpositiveandnegativeitemsets.Let and denotethecorrespondingnegativeitemsetsforXandY,respectively.Since

X→Y

s(X∪Y)<s(X)s(Y), (6.4)

X∪Y=0

s(X∪Y)<∏is(xi)∏js(yi), (6.5)

xi∈X yi∈Y

X¯ Y¯

theconditionfornegativecorrelationcanbestatedasfollows:

Thenegativelycorrelateditemsetsandassociationrulesareknownasnegativelycorrelatedpatternsthroughoutthischapter.

6.6.3ComparisonsamongInfrequentPatterns,NegativePatterns,andNegativelyCorrelatedPatterns

Infrequentpatterns,negativepatterns,andnegativelycorrelatedpatternsarethreecloselyrelatedconcepts.Althoughinfrequentpatternsandnegativelycorrelatedpatternsreferonlytoitemsetsorrulesthatcontainpositiveitems,whilenegativepatternsrefertoitemsetsorrulesthatcontainbothpositiveandnegativeitems,therearecertaincommonalitiesamongtheseconcepts,asillustratedinFigure6.22 .

s(X∪Y)−s(X)s(Y)=s(X∪Y)−[s(X∪Y)+s(X∪Y¯)][s(X∪Y)+s(X¯∪Y)]=s(X∪Y)[1−s(X∪Y)−s(X∪Y¯)−s(X¯∪Y)]−s(X∪Y¯)s(X¯∪Y)=s(X∪Y)s(X¯∪Y¯)−s(X∪Y¯)s(X¯∪Y),

s(X∪Y)s(X¯∪Y¯)<s(X∪Y¯)s(X¯∪Y). (6.6)

Figure6.22.Comparisonsamonginfrequentpatterns,negativepatterns,andnegativelycorrelatedpatterns.

First,notethatmanyinfrequentpatternshavecorrespondingnegativepatterns.Tounderstandwhythisisthecase,considerthecontingencytableshowninTable6.9 .If isinfrequent,thenitislikelytohaveacorrespondingnegativeitemsetunlessminsupistoohigh.Forexample,assumingthat ,if isinfrequent,thenthesupportforatleastoneofthefollowingitemsets, , ,or ,mustbehigherthanminsupsincethesumofthesupportsinacontingencytableis1.

Table6.9.Atwo-waycontingencytablefortheassociationrule .

X∪Y

minsup≤0.25 X∪YX∪Y¯ X¯∪Y X¯∪Y¯

X→Y

Y¯

X¯

s(X∪Y)

s(X¯∪Y)

s(X∪Y¯)

s(X¯∪Y¯)

s(X)

s(X¯)

Second,notethatmanynegativelycorrelatedpatternsalsohavecorrespondingnegativepatterns.ConsiderthecontingencytableshowninTable6.9 andtheconditionfornegativecorrelationstatedinInequality6.6 .IfXandYhavestrongnegativecorrelation,then

Therefore,either or ,orboth,musthaverelativelyhighsupportwhenXandYarenegativelycorrelated.Theseitemsetscorrespondtothenegativepatterns.Finally,becausethelowerthesupportof ,themorenegativelycorrelatedisthepattern,infrequentpatternstendtobestrongernegativelycorrelatedpatternsthanfrequentones.

6.6.4TechniquesforMiningInterestingInfrequentPatterns

Inprinciple,infrequentitemsetsaregivenbyallitemsetsthatarenotextractedbystandardfrequentitemsetgenerationalgorithmssuchasAprioriandFP-growth.TheseitemsetscorrespondtothoselocatedbelowthefrequentitemsetbordershowninFigure6.23 .

s(Y) s(Y¯)

s(X∪Y¯)×s(X¯∪Y)≫s(X∪Y)×s(X¯∪Y¯).

X∪Y¯ X¯∪Y

X∪Y

Figure6.23.Frequentandinfrequentitemsets.

Sincethenumberofinfrequentpatternscanbeexponentiallylarge,especiallyforsparse,high-dimensionaldata,techniquesdevelopedformininginfrequentpatternsfocusonfindingonlyinterestinginfrequentpatterns.AnexampleofsuchpatternsincludesthenegativelycorrelatedpatternsdiscussedinSection6.6.2 .ThesepatternsareobtainedbyeliminatingallinfrequentitemsetsthatfailthenegativecorrelationconditionprovidedinInequality6.3 .Thisapproachcanbecomputationallyintensivebecausethesupportsforallinfrequentitemsetsmustbecomputedinordertodeterminewhethertheyarenegativelycorrelated.Unlikethesupportmeasureusedforminingfrequentitemsets,correlation-basedmeasuresusedforminingnegativelycorrelateditemsetsdonotpossessananti-monotonepropertythatcanbe

exploitedforpruningtheexponentialsearchspace.Althoughanefficientsolutionremainselusive,severalinnovativemethodshavebeendeveloped,asmentionedintheBibliographicNotesprovidedattheendofthischapter.

Theremainderofthischapterpresentstwoclassesoftechniquesformininginterestinginfrequentpatterns.Section6.6.5 describesmethodsforminingnegativepatternsindata,whileSection6.6.6 describesmethodsforfindinginterestinginfrequentpatternsbasedonsupportexpectation.

6.6.5TechniquesBasedonMiningNegativePatterns

Thefirstclassoftechniquesdevelopedformininginfrequentpatternstreatseveryitemasasymmetricbinaryvariable.UsingtheapproachdescribedinSection6.1 ,thetransactiondatacanbebinarizedbyaugmentingitwithnegativeitems.Figure6.24 showsanexampleoftransformingtheoriginaldataintotransactionshavingbothpositiveandnegativeitems.ByapplyingexistingfrequentitemsetgenerationalgorithmssuchasAprioriontheaugmentedtransactions,allthenegativeitemsetscanbederived.

Figure6.24.Augmentingadatasetwithnegativeitems.

Suchanapproachisfeasibleonlyifafewvariablesaretreatedassymmetricbinary(i.e.,welookfornegativepatternsinvolvingthenegationofonlyasmallnumberofitems).Ifeveryitemmustbetreatedassymmetricbinary,theproblembecomescomputationallyintractableduetothefollowingreasons.

1. Thenumberofitemsdoubleswheneveryitemisaugmentedwithitscorrespondingnegativeitem.Insteadofexploringanitemsetlatticeofsize ,wheredisthenumberofitemsintheoriginaldataset,thelatticebecomesconsiderablylarger,asshowninExercise22 onpage522.

2. Support-basedpruningisnolongereffectivewhennegativeitemsareaugmented.Foreachvariablex,eitherxor hassupportgreaterthanorequalto50%.Hence,evenifthesupportthresholdisashighas50%,halfoftheitemswillremainfrequent.Forlowerthresholds,manymoreitemsandpossiblyitemsetscontainingthemwillbefrequent.Thesupport-basedpruningstrategyemployedbyAprioriiseffectiveonlywhenthesupportformostitemsetsislow;otherwise,thenumberoffrequentitemsetsgrowsexponentially.

x¯

3. Thewidthofeachtransactionincreaseswhennegativeitemsareaugmented.Supposethereareditemsavailableintheoriginaldataset.Forsparsedatasetssuchasmarketbaskettransactions,thewidthofeachtransactiontendstobemuchsmallerthand.Asaresult,themaximumsizeofafrequentitemset,whichisboundedbythemaximumtransactionwidth, ,tendstoberelativelysmall.Whennegativeitemsareincluded,thewidthofthetransactionsincreasestodbecauseanitemiseitherpresentinthetransactionorabsentfromthetransaction,butnotboth.Sincethemaximumtransactionwidthhasgrownfrom tod,thiswillincreasethenumberoffrequentitemsetsexponentially.Asaresult,manyexistingalgorithmstendtobreakdownwhentheyareappliedtotheextendeddataset.

Thepreviousbrute-forceapproachiscomputationallyexpensivebecauseitforcesustodeterminethesupportforalargenumberofpositiveandnegativepatterns.Insteadofaugmentingthedatasetwithnegativeitems,anotherapproachistodeterminethesupportofthenegativeitemsetsbasedonthesupportoftheircorrespondingpositiveitems.Forexample,thesupportfor

canbecomputedinthefollowingway:

Moregenerally,thesupportforanyitemset canbeobtainedasfollows:

ToapplyEquation6.7 , mustbedeterminedforeveryZthatisasubsetofY.ThesupportforanycombinationofXandZthatexceedstheminsupthresholdcanbefoundusingtheApriorialgorithm.Forallothercombinations,thesupportsmustbedeterminedexplicitly,e.g.,byscanningtheentiresetoftransactions.Anotherpossibleapproachistoeitherignorethe

wmax

{p,q¯,r¯}

s({p,q¯,r¯})=s({p})−s({p,q})−s({p,r})+s({p,q,r}).

X∪Y¯

s(X∪Y¯)=s(X)+∑i=1n∑Z⊂Y,|Z|=i{(−1)i×s(X∪Z)}. (6.7)

s(X∪Z)

supportforanyinfrequentitemset ortoapproximateitwiththeminsupthreshold.

Severaloptimizationstrategiesareavailabletofurtherimprovetheperformanceoftheminingalgorithms.First,thenumberofvariablesconsideredassymmetricbinarycanberestricted.Morespecifically,anegativeitem isconsideredinterestingonlyifyisafrequentitem.Therationaleforthisstrategyisthatrareitemstendtoproducealargenumberofinfrequentpatternsandmanyofwhichareuninteresting.Byrestrictingtheset

giveninEquation6.7 tovariableswhosepositiveitemsarefrequent,thenumberofcandidatenegativeitemsetsconsideredbytheminingalgorithmcanbesubstantiallyreduced.Anotherstrategyistorestrictthetypeofnegativepatterns.Forexample,thealgorithmmayconsideronlyanegativepattern ifitcontainsatleastonepositiveitem(i.e., ).Therationaleforthisstrategyisthatifthedatasetcontainsveryfewpositiveitemswithsupportgreaterthan50%,thenmostofthenegativepatternsoftheform

willbecomefrequent,thusdegradingtheperformanceoftheminingalgorithm.

6.6.6TechniquesBasedonSupportExpectation

Anotherclassoftechniquesconsidersaninfrequentpatterntobeinterestingonlyifitsactualsupportisconsiderablysmallerthanitsexpectedsupport.Fornegativelycorrelatedpatterns,theexpectedsupportiscomputedbasedonthestatisticalindependenceassumption.Thissectiondescribestwoalternativeapproachesfordeterminingtheexpectedsupportofapattern

X∪Z

y¯

Y¯

X∪Y¯ |X|≥1

X¯∪Y¯

using(1)aconcepthierarchyand(2)aneighborhood-basedapproachknownasindirectassociation.

SupportExpectationBasedonConceptHierarchyObjectivemeasuresalonemaynotbesufficienttoeliminateuninterestinginfrequentpatterns.Forexample,suppose and arefrequentitems.Eventhoughtheitemset isinfrequentandperhapsnegativelycorrelated,itisnotinterestingbecausetheirlackofsupportseemsobvioustodomainexperts.Therefore,asubjectiveapproachfordeterminingexpectedsupportisneededtoavoidgeneratingsuchinfrequentpatterns.

Intheprecedingexample, and belongtotwocompletelydifferentproductcategories,whichiswhyitisnotsurprisingtofindthattheirsupportislow.Thisexamplealsoillustratestheadvantageofusingdomainknowledgetopruneuninterestingpatterns.Formarketbasketdata,thedomainknowledgecanbeinferredfromaconcepthierarchysuchastheoneshowninFigure6.25 .Thebasicassumptionofthisapproachisthatitemsfromthesameproductfamilyareexpectedtohavesimilartypesofinteractionwithotheritems.Forexample,since and belongtothesameproductfamily,weexpecttheassociationbetween and tobesomewhatsimilartotheassociationbetween and .Iftheactualsupportforanyoneofthesepairsislessthantheirexpectedsupport,thentheinfrequentpatternisinteresting.

Figure6.25.Exampleofaconcepthierarchy.

Toillustratehowtocomputetheexpectedsupport,considerthediagramshowninFigure6.26 .Supposetheitemset isfrequent.Letdenotetheactualsupportofapatternand denoteitsexpectedsupport.TheexpectedsupportforanychildrenorsiblingsofCandGcanbecomputedusingtheformulashownbelow.

{C,G} s(⋅)ϵ(⋅)

Figure6.26.Mininginterestingnegativepatternsusingaconcepthierarchy.

Forexample,if and arefrequent,thentheexpectedsupportbetween and canbecomputedusingEquation6.8becausetheseitemsarechildrenof and ,respectively.Iftheactualsupportfor and isconsiderablylowerthantheirexpectedvalue,then and formaninterestinginfrequentpattern.

SupportExpectationBasedonIndirectAssociationConsiderapairofitems,(a,b),thatarerarelyboughttogetherbycustomers.Ifaandbareunrelateditems,suchas and player,thentheirsupportisexpectedtobelow.Ontheotherhand,ifaandbarerelateditems,thentheirsupportisexpectedtobehigh.Theexpectedsupportwaspreviouslycomputedusingaconcepthierarchy.Thissectionpresentsanapproachfordeterminingtheexpectedsupportbetweenapairofitemsbylookingatotheritemscommonlypurchasedtogetherwiththesetwoitems.

Forexample,supposecustomerswhobuya alsotendtobuyothercampingequipment,whereasthosewhobuya also

ϵ(s(E,J))=s(C,G)×s(E)s(C)×s(J)s(G) (6.8)

ϵ(s(C,J))=s(C,G)×s(J)s(G) (6.9)

ϵ(s(C,H))=s(C,G)×s(H)s(G) (6.10)

tendtobuyothercomputeraccessoriessuchasanopticalmouseoraprinter.Assumingthereisnootheritemfrequentlyboughttogetherwithbothasleepingbagandadesktopcomputer,thesupportfortheseunrelateditemsisexpectedtobelow.Ontheotherhand,suppose and areoftenboughttogetherwith and .Evenwithoutusingaconcepthierarchy,bothitemsareexpectedtobesomewhatrelatedandtheirsupportshouldbehigh.Becausetheiractualsupportislow, andformaninterestinginfrequentpattern.Suchpatternsareknownasindirectassociationpatterns.

Ahigh-levelillustrationofindirectassociationisshowninFigure6.27 .Itemsaandbcorrespondto and ,whileY,whichisknownasthemediatorset,containsitemssuchas and .Aformaldefinitionofindirectassociationispresentednext.

Figure6.27.Anindirectassociationbetweenapairofitems.

Definition6.13(IndirectAssociation).

Apairofitemsa,bisindirectlyassociatedviaamediatorsetYifthefollowingconditionshold:

1. (Itempairsupportcondition).2. suchthat:

a. and (Mediatorsupportcondition).

b. ,where isanobjectivemeasureoftheassociationbetweenXandZ(Mediatordependencecondition).

NotethatthemediatorsupportanddependenceconditionsareusedtoensurethatitemsinYformacloseneighborhoodtobothaandb.Someofthedependencemeasuresthatcanbeusedincludeinterest,cosineorIS,Jaccard,andothermeasurespreviouslydescribedinSection5.7.1 onpage402.

Indirectassociationhasmanypotentialapplications.Inthemarketbasketdomain,aandbmayrefertocompetingitemssuchas and

.Intextmining,indirectassociationcanbeusedtoidentifysynonyms,antonyms,orwordsthatareusedindifferentcontexts.Forexample,givenacollectionofdocuments,theword maybeindirectlyassociatedwith viathemediator .Thispatternsuggeststhattheword canbeusedintwodifferentcontexts—dataminingversusgoldmining.

Indirectassociationscanbegeneratedinthefollowingway.First,thesetoffrequentitemsetsisgeneratedusingstandardalgorithmssuchasApriorior

s({a,b})<ts∃Y≠0

s({a}∪Y)≥tf s({b}∪Y)≥tf

d({a}Y)≥td,d({b}Y)≥td d(X,Z)

FP-growth.Eachpairoffrequentk-itemsetsarethenmergedtoobtainacandidateindirectassociation ,whereaandbareapairofitemsandYistheircommonmediator.Forexample,if and arefrequent3-itemsets,thenthecandidateindirectassociation isobtainedbymergingthepairoffrequentitemsets.Oncethecandidateshavebeengenerated,itisnecessarytoverifythattheysatisfytheitempairsupportandmediatordependenceconditionsprovidedinDefinition6.13 .However,themediatorsupportconditiondoesnothavetobeverifiedbecausethecandidateindirectassociationisobtainedbymergingapairoffrequentitemsets.AsummaryofthealgorithmisshowninAlgorithm6.4 .

Algorithm6.4Algorithmforminingindirect

associations.1. Generate ,thesetoffrequentitemsets.2. for to do3.4. foreachcandidate do5. if then6.7. endif8. endfor9. endfor10. Result= .

(a,b,Y){p,q,r} {p,q,s}

(r,s,{p,q})

Fkk=2 kmax

Ck={(a,b,Y)|{a}∪Y∈Fk,{b}∪Y∈Fk,a≠b}(a,b,Y)∈Ck

s({a,b})<ts∧d({a},Y)≥td∧d({b},Y)≥tdIk=Ik∪{(a,b,Y)}

∪Ik

6.7BibliographicNotesTheproblemofminingassociationrulesfromcategoricalandcontinuousdatawasintroducedbySrikantandAgrawalin[495].Theirstrategywastobinarizethecategoricalattributesandtoapplyequal-frequencydiscretizationtothecontinuousattributes.Apartialcompletenessmeasurewasalsoproposedtodeterminetheamountofinformationlossasaresultofdiscretization.Thismeasurewasthenusedtodeterminethenumberofdiscreteintervalsneededtoensurethattheamountofinformationlosscanbekeptatacertaindesiredlevel.Followingthiswork,numerousotherformulationshavebeenproposedforminingquantitativeassociationrules.Insteadofdiscretizingthequantitativeattributes,astatistical-basedapproachwasdevelopedbyAumannandLindell[465],wheresummarystatisticssuchasmeanandstandarddeviationarecomputedforthequantitativeattributesoftherules.ThisformulationwaslaterextendedbyotherauthorsincludingWebb[501]andZhangetal.[506].Themin-ApriorialgorithmwasdevelopedbyHanetal.[474]forfindingassociationrulesincontinuousdatawithoutdiscretization.Followingthemin-Apriori,arangeoftechniquesforcapturingdifferenttypesofassociationsamongcontinuousattributeshavebeenexplored.Forexample,theRAngesupportPatterns(RAP)developedbyPandeyetal.[487]findsgroupsofattributesthatshowcoherentvaluesacrossmultiplerowsofthedatamatrix.TheRAPframeworkwasextendedbytodealwithnoisydatabyGuptaetal.[473].Sincetherulescanbedesignedtosatisfymultipleobjectives,evolutionaryalgorithmsforminingquantitativeassociationrules[484,485]havealsobeendeveloped.OthertechniquesincludethoseproposedbyFukudaetal.[471],Lentetal.[480],Wangetal.[500],Ruckertetal.[490]andMillerandYang[486].

ThemethoddescribedinSection6.3 forhandlingconcepthierarchyusingextendedtransactionswasdevelopedbySrikantandAgrawal[494].AnalternativealgorithmwasproposedbyHanandFu[475],wherefrequentitemsetsaregeneratedonelevelatatime.Morespecifically,theiralgorithminitiallygeneratesallthefrequent1-itemsetsatthetopleveloftheconcepthierarchy.Thesetoffrequent1-itemsetsisdenotedasL(1,1).Usingthefrequent1-itemsetsinL(1,1),thealgorithmproceedstogenerateallfrequent2-itemsetsatlevel1,L(1,2).Thisprocedureisrepeateduntilallthefrequentitemsetsinvolvingitemsfromthehighestlevelofthehierarchy,L(1,k)( ),areextracted.Thealgorithmthencontinuestoextractfrequentitemsetsatthenextlevelofthehierarchy,L(2,1),basedonthefrequentitemsetsinL(1,1).Theprocedureisrepeateduntilitterminatesatthelowestleveloftheconcepthierarchyrequestedbytheuser.

ThesequentialpatternformulationandalgorithmdescribedinSection6.4wasproposedbyAgrawalandSrikantin[463,496].Similarly,Mannilaetal.[483]introducedtheconceptoffrequentepisode,whichisusefulforminingsequentialpatternsfromalongstreamofevents.AnotherformulationofsequentialpatternminingbasedonregularexpressionswasproposedbyGarofalakisetal.in[472].Joshietal.haveattemptedtoreconcilethedifferencesbetweenvarioussequentialpatternformulations[477].TheresultwasauniversalformulationofsequentialpatternwiththedifferentcountingschemesdescribedinSection6.4.4 .AlternativealgorithmsforminingsequentialpatternswerealsoproposedbyPeietal.[489],Ayresetal.[466],Chengetal.[468],andSenoetal.[492].Areviewonsequentialpatternminingalgorithmscanbefoundin[482]and[493].Extensionsoftheformulationtomaximal[470,481]andclosed[499,504]sequentialpatternmininghavealsobeendevelopedinrecentyears.

ThefrequentsubgraphminingproblemwasinitiallyintroducedbyInokuchietal.in[476].Theyusedavertex-growingapproachforgeneratingfrequent

k>1

inducedsubgraphsfromagraphdataset.Theedge-growingstrategywasdevelopedbyKuramochiandKarypisin[478],wheretheyalsopresentedanApriori-likealgorithmcalledFSGthataddressesissuessuchasmultiplicityofcandidates,canonicallabeling,andvertexinvariantschemes.AnotherfrequentsubgraphminingalgorithmknownasgSpanwasdevelopedbyYanandHanin[503].TheauthorsproposedusingaminimumDFScodeforencodingthevarioussubgraphs.OthervariantsofthefrequentsubgraphminingproblemswereproposedbyZakiin[505],ParthasarathyandCoatneyin[488],andKuramochiandKarypisin[479].ArecentreviewongraphminingisgivenbyChengetal.in[469].

Theproblemofmininginfrequentpatternshasbeeninvestigatedbymanyauthors.Savasereetal.[491]examinedtheproblemofminingnegativeassociationrulesusingaconcepthierarchy.Tanetal.[497]proposedtheideaofminingindirectassociationsforsequentialandnon-sequentialdata.EfficientalgorithmsforminingnegativepatternshavealsobeenproposedbyBoulicautetal.[467],Tengetal.[498],Wuetal.[502],andAntonieandZaïane[464].

Bibliography[463]R.AgrawalandR.Srikant.MiningSequentialPatterns.InProc.ofIntl.

Conf.onDataEngineering,pages3–14,Taipei,Taiwan,1995.

[464]M.-L.AntonieandO.R.Za¨ıane.MiningPositiveandNegativeAssociationRules:AnApproachforConfinedRules.InProc.ofthe8thEuropeanConf.ofPrinciplesandPracticeofKnowledgeDiscoveryinDatabases,pages27–38,Pisa,Italy,September2004.

[465]Y.AumannandY.Lindell.AStatisticalTheoryforQuantitativeAssociationRules.InKDD99,pages261–270,SanDiego,CA,August1999.

[466]J.Ayres,J.Flannick,J.Gehrke,andT.Yiu.SequentialPatternminingusingabitmaprepresentation.InProc.ofthe8thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages429–435,Edmonton,Canada,July2002.

[467]J.-F.Boulicaut,A.Bykowski,andB.Jeudy.TowardstheTractableDiscoveryofAssociationRuleswithNegations.InProc.ofthe4thIntl.ConfonFlexibleQueryAnsweringSystemsFQAS’00,pages425–434,Warsaw,Poland,October2000.

[468]H.Cheng,X.Yan,andJ.Han.IncSpan:incrementalminingofsequentialpatternsinlargedatabase.InProc.ofthe10thIntl.Conf.on

KnowledgeDiscoveryandDataMining,pages527–532,Seattle,WA,August2004.

[469]H.Cheng,X.Yan,andJ.Han.MiningGraphPatterns.InC.AggarwalandJ.Han,editors,FrequentPatternMining,pages307–338.Springer,2014.

[470]P.Fournier-Viger,C.-W.Wu,A.Gomariz,andV.S.Tseng.VMSP:Efficientverticalminingofmaximalsequentialpatterns.InProceedingsoftheCanadianConferenceonArtificialIntelligence,pages83–94,2014.

[471]T.Fukuda,Y.Morimoto,S.Morishita,andT.Tokuyama.MiningOptimizedAssociationRulesforNumericAttributes.InProc.ofthe15thSymp.onPrinciplesofDatabaseSystems,pages182–191,Montreal,Canada,June1996.

[472]M.N.Garofalakis,R.Rastogi,andK.Shim.SPIRIT:SequentialPatternMiningwithRegularExpressionConstraints.InProc.ofthe25thVLDBConf.,pages223–234,Edinburgh,Scotland,1999.

[473]R.Gupta,N.Rao,andV.Kumar.Discoveryoferror-tolerantbiclustersfromnoisygeneexpressiondata.BMCbioinformatics,12(12):1,2011.

[474]E.-H.Han,G.Karypis,andV.Kumar.Min-Apriori:AnAlgorithmforFindingAssociationRulesinDatawithContinuousAttributes.http://www.cs.umn.edu/˜han,1997.

[475]J.HanandY.Fu.MiningMultiple-LevelAssociationRulesinLargeDatabases.IEEETrans.onKnowledgeandDataEngineering,11(5):798–804,1999.

[476]A.Inokuchi,T.Washio,andH.Motoda.AnApriori-basedAlgorithmforMiningFrequentSubstructuresfromGraphData.InProc.ofthe4thEuropeanConf.ofPrinciplesandPracticeofKnowledgeDiscoveryinDatabases,pages13–23,Lyon,France,2000.

[477]M.V.Joshi,G.Karypis,andV.Kumar.AUniversalFormulationofSequentialPatterns.InProc.oftheKDD’2001workshoponTemporalDataMining,SanFrancisco,CA,August2001.

[478]M.KuramochiandG.Karypis.FrequentSubgraphDiscovery.InProc.ofthe2001IEEEIntl.Conf.onDataMining,pages313–320,SanJose,CA,November2001.

[479]M.KuramochiandG.Karypis.DiscoveringFrequentGeometricSubgraphs.InProc.ofthe2002IEEEIntl.Conf.onDataMining,pages258–265,MaebashiCity,Japan,December2002.

[480]B.Lent,A.Swami,andJ.Widom.ClusteringAssociationRules.InProc.ofthe13thIntl.Conf.onDataEngineering,pages220–231,Birmingham,U.K,April1997.

[481]C.LuoandS.M.Chung.Efficientminingofmaximalsequentialpatternsusingmultiplesamples.InProceedingsoftheSIAMInternationalConferenceonDataMining,pages415–426,2005.

[482]N.R.MabroukehandC.Ezeife.Ataxonomyofsequentialpatternminingalgorithms.ACMComputingSurvey,43(1),2010.

[483]H.Mannila,H.Toivonen,andA.I.Verkamo.DiscoveryofFrequentEpisodesinEventSequences.DataMiningandKnowledgeDiscovery,1(3):259–289,November1997.

[484]D.Martin,A.Rosete,J.Alcalá-Fdez,andF.Herrera.Anewmultiobjectiveevolutionaryalgorithmforminingareducedsetofinterestingpositiveandnegativequantitativeassociationrules.IEEETransactionsonEvolutionaryComputation,18(1):54–69,2014.

[485]J.Mata,J.L.Alvarez,andJ.C.Riquelme.MiningNumericAssociationRuleswithGeneticAlgorithms.InProceedingsoftheInternationalConferenceonArtificialNeuralNetsandGeneticAlgorithms,pages264–267,Prague,CzechRepublic,2001.Springer.

[486]R.J.MillerandY.Yang.AssociationRulesoverIntervalData.InProc.of1997ACM-SIGMODIntl.Conf.onManagementofData,pages452–461,Tucson,AZ,May1997.

[487]G.Pandey,G.Atluri,M.Steinbach,C.L.Myers,andV.Kumar.Anassociationanalysisapproachtobiclustering.InProceedingsofthe15thACMSIGKDDinternationalconferenceonKnowledgediscoveryanddatamining,pages677–686.ACM,2009.

[488]S.ParthasarathyandM.Coatney.EfficientDiscoveryofCommonSubstructuresinMacromolecules.InProc.ofthe2002IEEEIntl.Conf.onDataMining,pages362–369,MaebashiCity,Japan,December2002.

[489]J.Pei,J.Han,B.Mortazavi-Asl,Q.Chen,U.Dayal,andM.Hsu.PrefixSpan:MiningSequentialPatternsefficientlybyprefix-projectedpatterngrowth.InProcofthe17thIntl.Conf.onDataEngineering,Heidelberg,Germany,April2001.

[490]U.Ruckert,L.Richter,andS.Kramer.Quantitativeassociationrulesbasedonhalf-spaces:Anoptimizationapproach.InProceedingsoftheFourthIEEEInternationalConferenceonDataMining,pages507–510,2004.

[491]A.Savasere,E.Omiecinski,andS.Navathe.MiningforStrongNegativeAssociationsinaLargeDatabaseofCustomerTransactions.InProc.ofthe14thIntl.Conf.onDataEngineering,pages494–502,Orlando,Florida,February1998.

[492]M.SenoandG.Karypis.SLPMiner:AnAlgorithmforFindingFrequentSequentialPatternsUsingLength-DecreasingSupportConstraint.InProc.ofthe2002IEEEIntl.Conf.onDataMining,pages418–425,MaebashiCity,Japan,December2002.

[493]W.Shen,J.Wang,andJ.Han.SequentialPatternMining.InC.AggarwalandJ.Han,editors,FrequentPatternMining,pages261–282.Springer,2014.

[494]R.SrikantandR.Agrawal.MiningGeneralizedAssociationRules.InProc.ofthe21stVLDBConf.,pages407–419,Zurich,Switzerland,1995.

[495]R.SrikantandR.Agrawal.MiningQuantitativeAssociationRulesinLargeRelationalTables.InProc.of1996ACM-SIGMODIntl.Conf.onManagementofData,pages1–12,Montreal,Canada,1996.

[496]R.SrikantandR.Agrawal.MiningSequentialPatterns:GeneralizationsandPerformanceImprovements.InProc.ofthe5thIntlConf.onExtendingDatabaseTechnology(EDBT’96),pages18–32,Avignon,France,1996.

[497]P.N.Tan,V.Kumar,andJ.Srivastava.IndirectAssociation:MiningHigherOrderDependenciesinData.InProc.ofthe4thEuropeanConf.ofPrinciplesandPracticeofKnowledgeDiscoveryinDatabases,pages632–637,Lyon,France,2000.

[498]W.G.Teng,M.J.Hsieh,andM.-S.Chen.OntheMiningofSubstitutionRulesforStatisticallyDependentItems.InProc.ofthe2002IEEEIntl.Conf.onDataMining,pages442–449,MaebashiCity,Japan,December2002.

[499]P.Tzvetkov,X.Yan,andJ.Han.TSP:Miningtop-kclosedsequentialpatterns.KnowledgeandInformationSystems,7(4):438–457,2005.

[500]K.Wang,S.H.Tay,andB.Liu.Interestingness-BasedIntervalMergerforNumericAssociationRules.InProc.ofthe4thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages121–128,NewYork,NY,August1998.

[501]G.I.Webb.Discoveringassociationswithnumericvariables.InProc.ofthe7thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages383–388,SanFrancisco,CA,August2001.

[502]X.Wu,C.Zhang,andS.Zhang.MiningBothPositiveandNegativeAssociationRules.ACMTrans.onInformationSystems,22(3):381–405,2004.

[503]X.YanandJ.Han.gSpan:Graph-basedSubstructurePatternMining.InProc.ofthe2002IEEEIntl.Conf.onDataMining,pages721–724,MaebashiCity,Japan,December2002.

[504]X.Yan,J.Han,andR.Afshar.CloSpan:Mining:Closedsequentialpatternsinlargedatasets.InProceedingsoftheSIAMInternationalConferenceonDataMining,pages166–177,2003.

[505]M.J.Zaki.Efficientlyminingfrequenttreesinaforest.InProc.ofthe8thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages71–80,Edmonton,Canada,July2002.

[506]H.Zhang,B.Padmanabhan,andA.Tuzhilin.OntheDiscoveryofSignificantStatisticalQuantitativeRules.InProc.ofthe10thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages374–383,Seattle,WA,August2004.

6.8Exercises1.ConsiderthetrafficaccidentdatasetshowninTable6.10 .

Table6.10.Trafficaccidentdataset.

WeatherCondition Driver’sCondition TrafficViolation SeatBelt CrashSeverity

Good Alcohol-impaired Exceedspeedlimit No Major

Bad Sober None Yes Minor

Good Sober Disobeystopsign Yes Minor

Good Sober Exceedspeedlimit Yes Major

Bad Sober Disobeytrafficsignal No Major

Good Alcohol-impaired Disobeystopsign Yes Minor

Bad Alcohol-impaired None Yes Major

Good Sober Disobeytrafficsignal Yes Major

Good Alcohol-impaired None No Major

Bad Sober Disobeytrafficsignal No Major

Good Alcohol-impaired Exceedspeedlimit Yes Major

Bad Sober Disobeystopsign Yes Minor

a. Showabinarizedversionofthedataset.

b. Whatisthemaximumwidthofeachtransactioninthebinarizeddata?

c. Assumingthatthesupportthresholdis30%,howmanycandidateandfrequentitemsetswillbegenerated?

d. Createadatasetthatcontainsonlythefollowingasymmetricbinaryattributes:(

,onlyNonehasavalueof0.Therestoftheattributevaluesareassignedto1.Assumingthatthesupportthresholdis30%,howmanycandidateandfrequentitemsetswillbegenerated?

e. Comparethenumberofcandidateandfrequentitemsetsgeneratedinparts(c)and(d).

a. ConsiderthedatasetshowninTable6.11 .Supposeweapplythefollowingdiscretizationstrategiestothecontinuousattributesofthedataset.

Table6.11.DatasetforExercise2 .

TID Temperature Pressure Alarm1 Alarm2 Alarm3

1 95 1105 0 0 1

2 85 1040 1 1 0

3 103 1090 1 1 1

4 97 1084 1 0 0

5 80 1038 0 1 1

6 100 1080 1 1 0

7 83 1025 1 0 1

8 86 1030 1 0 0

9 101 1100 1 1 1

D1: Partitiontherangeofeachcontinuousattributeinto3equal-sizedbins.

D2: Partitiontherangeofeachcontinuousattributeinto3bins;whereeachbincontainsanequalnumberoftransactions

Foreachstrategy,answerthefollowingquestions:

b. Thecontinuousattributecanalsobediscretizedusingaclusteringapproach.

Constructabinarizedversionofthedataset.

Deriveallthefrequentitemsetshavingsupport .

ii. ≥30%

PlotagraphoftemperatureversuspressureforthedatapointsshowninTable6.11 .

Howmanynaturalclustersdoyouobservefromthegraph?Assignalabel( etc.)toeachclusterinthegraph.

Whattypeofclusteringalgorithmdoyouthinkcanbeusedtoidentifytheclusters?Stateyourreasonsclearly.

ReplacethetemperatureandpressureattributesinTable6.11withasymmetricbinaryattributes etc.Constructatransactionmatrixusingthenewattributes(alongwithattributesAlarm1,Alarm2,andAlarm3).

Deriveallthefrequentitemsetshavingsupport fromthebinarizeddata.

ii.C1,C2,

iii.

iv.C1,C2,

v. ≥30%

3.ConsiderthedatasetshowninTable6.12 .Thefirstattributeiscontinuous,whiletheremainingtwoattributesareasymmetricbinary.Aruleisconsideredtobestrongifitssupportexceeds15%anditsconfidenceexceeds60%.ThedatagiveninTable6.12 supportsthefollowingtwostrongrules:

Table6.12.DatasetforExercise3 .

A B C

1 1 1

2 1 1

3 1 0

4 1 0

5 1 1

6 0 1

7 0 0

8 1 1

9 0 0

10 0 0

11 0 0

12 0 1

{(1≤A≤2),B=1}→{C=1}

(ii) {(5≤A≤8),B=1}→{C=1}

a. Computethesupportandconfidenceforbothrules.

b. TofindtherulesusingthetraditionalApriorialgorithm,weneedtodiscretizethecontinuousattributeA.Supposeweapplytheequalwidthbinningapproachtodiscretizethedata,with .Foreachbin-width,statewhethertheabovetworulesarediscoveredbytheApriorialgorithm.(NotethattherulesmaynotbeinthesameexactformasbeforebecauseitmaycontainwiderornarrowerintervalsforA.)Foreachrulethatcorrespondstooneoftheabovetworules,computeitssupportandconfidence.

c. Commentontheeffectivenessofusingtheequalwidthapproachforclassifyingtheabovedataset.Isthereabin-widththatallowsyoutofindbothrulessatisfactorily?Ifnot,whatalternativeapproachcanyoutaketoensurethatyouwillfindbothrules?

4.ConsiderthedatasetshowninTable6.13 .

Table6.13.DatasetforExercise4 .

Age(A) NumberofHoursOnlineperWeek(B)

0–5 5–10 10–20 20–30 30–40

10–15 2 3 5 3 2

15–25 2 5 10 10 3

25–35 10 15 5 3 2

35–50 4 6 5 3 2

a. Foreachcombinationofrulesgivenbelow,specifytherulethathasthehighestconfidence.

bin-width=2,3,4

b. SupposeweareinterestedinfindingtheaveragenumberofhoursspentonlineperweekbyInternetusersbetweentheageof15and35.Writethecorrespondingstatistics-basedassociationruletocharacterizethesegmentofusers.Tocomputetheaveragenumberofhoursspentonline,approximateeachintervalbyitsmidpointvalue(e.g.,use torepresenttheinterval ).

c. Testwhetherthequantitativeassociationrulegiveninpart(b)isstatisticallysignificantbycomparingitsmeanagainsttheaveragenumberofhoursspentonlinebyotheruserswhodonotbelongtotheagegroup.

5.Forthedatasetwiththeattributesgivenbelow,describehowyouwouldconvertitintoabinarytransactiondatasetappropriateforassociationanalysis.Specifically,indicateforeachattributeintheoriginaldataset

a. howmanybinaryattributesitwouldcorrespondtointhetransactiondataset,

b. howthevaluesoftheoriginalattributewouldbemappedtovaluesofthebinaryattributes,and

c. ifthereisanyhierarchicalstructureinthedatavaluesofanattributethatcouldbeusefulforgroupingthedataintofewerbinaryattributes.

Thefollowingisalistofattributesforthedatasetalongwiththeirpossiblevalues.Assumethatallattributesarecollectedonaper-studentbasis:

and

i. 15<A<25→10<B<20,10<A<25→10<B<20, 15<A<35→10<B<20.

ii. 15<A<25→10<B<20,15<A<25→5<B<20, 15<A<25→5<B<30

iii. 15<A<25→10<B<20 10<A<35→5<B<30

B=7.55<B<10

Year:Freshman,Sophomore,Junior,Senior,Graduate:Masters,Graduate:PhD,Professional

Zipcode:zipcodeforthehomeaddressofaU.S.student,zipcodeforthelocaladdressofanon-U.S.student

College:Agriculture,Architecture,ContinuingEducation,Education,LiberalArts,Engineering,NaturalSciences,Business,Law,Medical,Dentistry,Pharmacy,Nursing,VeterinaryMedicine

OnCampus:1ifthestudentlivesoncampus,0otherwise

Eachofthefollowingisaseparateattributethathasavalueof1ifthepersonspeaksthelanguageandavalueof0,otherwise.

–Arabic

–Bengali

–ChineseMandarin

–English

–Portuguese

–Russian

–Spanish

6.ConsiderthedatasetshowninTable6.14 .Supposeweareinterestedinextractingthefollowingassociationrule:

Table6.14.DatasetforExercise6 .

Age PlayPiano EnjoyClassicalMusic

{α1≤Age≤α2,Play Piano=Yes}→{EnjoyClassicalMusic=Yes }

9 Yes Yes

11 Yes Yes

14 Yes No

17 Yes No

19 Yes Yes

21 No No

25 No No

29 Yes Yes

33 No No

39 No Yes

41 No No

47 No Yes

Tohandlethecontinuousattribute,weapplytheequal-frequencyapproachwith3,4,and6intervals.Categoricalattributesarehandledbyintroducingasmanynewasymmetricbinaryattributesasthenumberofcategoricalvalues.Assumethatthesupportthresholdis10%andtheconfidencethresholdis70%.

a. SupposewediscretizetheAgeattributeinto3equal-frequencyintervals.Findapairofvaluesfor and thatsatisfytheminimumsupportandminimumconfidencerequirements.

b. Repeatpart(a)bydiscretizingtheAgeattributeinto4equal-frequencyintervals.Comparetheextractedrulesagainsttheonesyouhadobtained

α1 α2

inpart(a).

c. Repeatpart(a)bydiscretizingtheAgeattributeinto6equal-frequencyintervals.Comparetheextractedrulesagainsttheonesyouhadobtainedinpart(a).

d. Fromtheresultsinpart(a),(b),and(c),discusshowthechoiceofdiscretizationintervalswillaffecttherulesextractedbyassociationruleminingalgorithms.

7.ConsiderthetransactionsshowninTable6.15 ,withanitemtaxonomygiveninFigure6.25 .

Table6.15.Exampleofmarketbaskettransactions.

TransactionID ItemsBought

1 Chips,Cookies,RegularSoda,Ham

2 Chips,Ham,BonelessChicken,DietSoda

3 Ham,Bacon,WholeChicken,RegularSoda

4 Chips,Ham,BonelessChicken,DietSoda

5 Chips,Bacon,BonelessChicken

6 Chips,Ham,Bacon,WholeChicken,RegularSoda

7 Chips,Cookies,BonelessChicken,DietSoda

a. Whatarethemainchallengesofminingassociationruleswithitemtaxonomy?

b. Considertheapproachwhereeachtransactiontisreplacedbyanextendedtransaction thatcontainsalltheitemsintaswellastheirt′

respectiveancestors.Forexample,thetransactiont=willbereplacedbyt= .Usethisapproachtoderiveallfrequentitemsets(uptosize4)withsupport .

c. Consideranalternativeapproachwherethefrequentitemsetsaregeneratedonelevelatatime.Initially,allthefrequentitemsetsinvolvingitemsatthehighestlevelofthehierarchyaregenerated.Next,weusethefrequentitemsetsdiscoveredatthehigherlevelofthehierarchytogeneratecandidateitemsetsinvolvingitemsatthelowerlevelsofthehierarchy.Forexample,wegeneratethecandidateitemset

onlyif isfrequent.Usethisapproachtoderiveallfrequentitemsets(uptosize4)withsupport .

d. Comparethefrequentitemsetsfoundinparts(b)and(c).Commentontheefficiencyandcompletenessofthealgorithms.

8.Thefollowingquestionsexaminehowthesupportandconfidenceofanassociationrulemayvaryinthepresenceofaconcepthierarchy.

a. Consideranitemxinagivenconcepthierarchy.Letdenotethekchildrenofxintheconcepthierarchy.Showthat

,where isthesupportofanitem.Underwhatconditionswilltheinequalitybecomeanequality?

b. Letpandqdenoteapairofitems,while and aretheircorrespondingparentsintheconcepthierarchy.If >minsup,whichofthefollowingitemsetsareguaranteedtobefrequent?(i)

,(ii) ,and(iii) .

c. Considertheassociationrule .Supposetheconfidenceoftheruleexceedsminconf.Whichofthefollowingrulesareguaranteedtohaveconfidencehigherthanminconf?(i) ,(ii) ,and(iii)

≥70%

x¯1,x¯2,…,x¯k

s(x)≤∑i=1ks(x¯i) s(⋅)

p^ q^s({p,q})

s({p^,q}) s({p,q^}) s({p^,q^})

{p}→{q}

{p}→{q^} {p^}→{q} {p^}→{q^}

a. Listallthe4-subsequencescontainedinthefollowingdatasequence:

assumingnotimingconstraints.

b. Listallthe3-elementsubsequencescontainedinthedatasequenceforpart(a)assumingthatnotimingconstraintsareimposed.

c. Listallthe4-subsequencescontainedinthedatasequenceforpart(a)(assumingthetimingconstraintsareflexible).

d. Listallthe3-elementsubsequencescontainedinthedatasequenceforpart(a)(assumingthetimingconstraintsareflexible).

10.Findallthefrequentsubsequenceswithsupport giventhesequencedatabaseshowninTable6.16 .Assumethattherearenotimingconstraintsimposedonthesequences.

Table6.16.Exampleofeventsequencesgeneratedbyvarioussensors.

Sensor Timestamp Events

S1 1 A,B

2 C

3 D,E

4 C

S2 1 A,B

2 C,D

{1,3} {2} {2,3} {4},

≥50%

3 E

S3 1 B

2 A

3 B

4 D,E

S4 1 C

2 D,E

3 C

4 E

S5 1 B

2 A

3 B,C

4 A,D

11.

a. Foreachofthesequences givenbelow,determinewhethertheyaresubsequencesofthesequence

subjectedtothefollowingtimingconstraints:

mingap=0 (intervalbetweenlasteventin andfirsteventin is )

w=⟨e1e2…ei…ei+1…elast⟩

⟨{1,2,3}{2,4}{2,4,5}{3,5}{6}⟩

ei ei+1 ⟩0

maxgap=3 (intervalbetweenfirsteventin andlasteventin is )

maxspan=5 (intervalbetweenfirsteventin andlasteventin is )

ws=1 (timebetweenfirstandlasteventsin is )

b. Determinewhethereachofthesubsequenceswgiveninthepreviousquestionarecontiguoussubsequencesofthefollowingsequencess.

12.Foreachofthesequence below,determinewhethertheyaresubsequencesofthefollowingdatasequence:

subjectedtothefollowingtimingconstraints:

mingap=0 (intervalbetweenlasteventin andfirsteventin is )

maxgap=2 (intervalbetweenfirsteventin andlasteventin is )

ei ei+1 ≤3

ei elast ≤5

ei ≤1

w=⟨{1}{2}{3}⟩

w=⟨{1,2,3,4}{5,6}⟩

w=⟨{2,4}{2,4}{6}⟩

w=⟨{1}{2,4}{6}⟩

w=⟨{1,2}{3,4}{5,6}⟩

s=⟨{1,2,3,4,5,6}{1,2,3,4,5,6}{1,2,3,4,5,6}⟩

s=⟨{1,2,3,4}{1,2,3,4,5,6}{3,4,5,6}⟩

s=⟨{1,2}{1,2,3,4}{3,4,5,6}{5,6}⟩

s=⟨{1,2,3}{2,3,4,5}{4,5,6}⟩

w=⟨e1,…,elast⟩

⟨{A,B}{C,D}{A,B}{C,D}{A,B}{C,D}⟩

ei ei+1 >0

ei ei+1 ≤2

maxspan=6 (intervalbetweenfirsteventin andlasteventin is )

ws=1 (timebetweenfirstandlasteventsin is )

13.Considerthefollowingfrequent3-sequences:

and .

a. Listallthecandidate4-sequencesproducedbythecandidategenerationstepoftheGSPalgorithm.

b. Listallthecandidate4-sequencesprunedduringthecandidatepruningstepoftheGSPalgorithm(assumingnotimingconstraints).

c. Listallthecandidate4-sequencesprunedduringthecandidatepruningstepoftheGSPalgorithm(assumingmaxgap=1).

14.ConsiderthedatasequenceshowninTable6.17 foragivenobject.Countthenumberofoccurrencesforthesequence accordingtothefollowingcountingmethods:

Table6.17.ExampleofeventsequencedataforExercise14 .

Timestamp Events

1 p,q

ei elast ≤6

ei ≤1

w=⟨{A}{B}{C}{D}⟩

w=⟨{A}{B,C,D}{A}⟩

w=⟨{A}{A,B,C,D}{A}⟩

w=⟨{B,C}{A,D}{B,C}⟩

w=⟨{A,B,C,D}{A,B,C,D}⟩

⟨{1,2,3}⟩,⟨{1,2}{3}⟩,⟨{1}{2,3}⟩,⟨{1,2}{4}⟩,⟨{1,3}{4}⟩,⟨{1,2,4}⟩,⟨{2,3}{3}⟩,⟨{2,3}{4}⟩,⟨{2}{3}{3}⟩, ⟨{2}{3}{4}⟩

⟨{p}{q}{r}⟩

2 r

3 s

4 p,q

5 r,s

6 p

7 q,r

8 q,s

9 p

10 q,r,s

a. COBJ(oneoccurrenceperobject).

b. CWIN(oneoccurrenceperslidingwindow).

c. CMINWIN(numberofminimalwindowsofoccurrence).

d. CDIST_O(distinctoccurrenceswithpossibilityofevent-timestampoverlap).

e. CDIST(distinctoccurrenceswithnoeventtimestampoverlapallowed).

15.Describethetypesofmodificationsnecessarytoadaptthefrequentsubgraphminingalgorithmtohandle:

a. Directedgraphs

b. Unlabeledgraphs

c. Acyclicgraphs

d. Disconnectedgraphs

Foreachtypeofgraphgivenabove,describewhichstepofthealgorithmwillbeaffected(candidategeneration,candidatepruning,andsupportcounting),andanyfurtheroptimizationthatcanhelpimprovetheefficiencyofthealgorithm.

16.DrawallcandidatesubgraphsobtainedfromjoiningthepairofgraphsshowninFigure6.28 .

Figure6.28.GraphsforExercise16 .

17.DrawallthecandidatesubgraphsobtainedbyjoiningthepairofgraphsshowninFigure6.29 .

Figure6.29.GraphsforExercise17 .

18.ShowthatthecandidategenerationprocedureintroducedinSection6.5.3 forfrequentsubgraphminingiscomplete,i.e.,nofrequentk-subgraphcanbemissedfrombeinggeneratedifeverypairoffrequent -subgraphsisconsideredformerging.

19.

a. Ifsupportisdefinedintermsofinducedsubgraphrelationship,showthattheconfidenceoftherule canbegreaterthan1if and areallowedtohaveoverlappingvertexsets.

b. Whatisthetimecomplexityneededtodeterminethecanonicallabelofagraphthatcontains vertices?

c. Thecoreofasubgraphcanhavemultipleautomorphisms.Thiswillincreasethenumberofcandidatesubgraphsobtainedaftermergingtwo

(k−1)

g1→g2 g1 g2

|V|

frequentsubgraphsthatsharethesamecore.Determinethemaximumnumberofcandidatesubgraphsobtainedduetoautomorphismofacoreofsizek.

d. Twofrequentsubgraphsofsizekmaysharemultiplecores.Determinethemaximumnumberofcoresthatcanbesharedbythetwofrequentsubgraphs.

20.(a)Considerthetwographsshownbelow.

21.Theoriginalassociationruleminingframeworkconsidersonlypresenceofitemstogetherinthesametransaction.Therearesituationsinwhichitemsetsthatareinfrequentmayalsobeinformative.Forinstance,theitemsetTV,DVD,¬VCRsuggeststhatmanycustomerswhobuyTVsandDVDsdonotbuyVCRs.

Drawallthedistinctcoresobtainedwhenmergingthetwosubgraphs.

Howmanycandidatesaregeneratedusingthefollowingcore?

Inthisproblem,youareaskedtoextendtheassociationruleframeworktonegativeitemsets(i.e.,itemsetsthatcontainbothpresenceandabsenceofitems).Wewillusethenegationsymbol(¬)torefertoabsenceofitems.

a. AnaïvewayforderivingnegativeitemsetsistoextendeachtransactiontoincludeabsenceofitemsasshowninTable6.18 .

Table6.18.Exampleofnumericdataset.

TID TV ¬TV DVD ¬DVD VCR ¬VCR …

1 1 0 0 1 0 1 …

2 1 0 0 1 0 1 …

b. ConsiderthedatabaseshowninTable6.15 .Whatarethesupportandconfidencevaluesforthefollowingnegativeassociationrulesinvolvingregularanddietsoda?

Supposethetransactiondatabasecontains1000distinctitems.Whatisthetotalnumberofpositiveitemsetsthatcanbegeneratedfromtheseitems?(Note:Apositiveitemsetdoesnotcontainanynegateditems).

Whatisthemaximumnumberoffrequentitemsetsthatcanbegeneratedfromthesetransactions?(Assumethatafrequentitemsetmaycontainpositive,negative,orbothtypesofitems)

Explainwhysuchana¨ıvemethodofextendingeachtransactionwithnegativeitemsisnotpracticalforderivingnegativeitemsets.

ii.

iii.

¬Regular→Diet.

Regular→¬Diet.

¬Diet→Regular.

ii.

iii.

22.Supposewewouldliketoextractpositiveandnegativeitemsetsfromadatasetthatcontainsditems.

a. Consideranapproachwhereweintroduceanewvariabletorepresenteachnegativeitem.Withthisapproach,thenumberofitemsgrowsfromdto2d.Whatisthetotalsizeoftheitemsetlattice,assumingthatanitemsetmaycontainbothpositiveandnegativeitemsofthesamevariable?

b. Assumethatanitemsetmustcontainpositiveornegativeitemsofdifferentvariables.Forexample,theitemset isinvalidbecauseitcontainsbothpositiveandnegativeitemsforvariablea.Whatisthetotalsizeoftheitemsetlattice?

23.Foreachtypeofpatterndefinedbelow,determinewhetherthesupportmeasureismonotone,anti-monotone,ornon-monotone(i.e.,neithermonotonenoranti-monotone)withrespecttoincreasingitemsetsize.

a. Itemsetsthatcontainbothpositiveandnegativeitemssuchas.Isthesupportmeasuremonotone,anti-monotone,ornon-monotonewhenappliedtosuchpatterns?

b. Booleanlogicalpatternssuchas ,whichmaycontainbothdisjunctionsandconjunctionsofitems.Isthesupportmeasuremonotone,anti-monotone,ornon-monotonewhenappliedtosuchpatterns?

24.ManyassociationanalysisalgorithmsrelyonanApriori-likeapproachforfindingfrequentpatterns.Theoverallstructureofthealgorithmisgivenbelow.

SupposeweareinterestedinfindingBooleanlogicalrulessuchas

Diet→¬Regular.iv.

{a,a¯,b,c¯}

{a,b,c¯,d¯}

{(a∨b∨c),d,e}

whichmaycontainbothdisjunctionsandconjunctionsofitems.Thecorrespondingitemsetcanbewrittenas .

a. DoestheAprioriprinciplestillholdforsuchitemsets?

b. Howshouldthecandidategenerationstepbemodifiedtofindsuchpatterns?

c. Howshouldthecandidatepruningstepbemodifiedtofindsuchpatterns?

d. Howshouldthesupportcountingstepbemodifiedtofindsuchpatterns?

Algorithm6.5Apriori-likealgorithm.1.

2. {Findfrequent1-patterns.}

3. repeat

4. .

5. .{CandidateGeneration}

6. .{CandidatePruning}

7. .{SupportCounting}

8. .{Extractfrequentpatterns}

9. until

10. Answer= .

{a∨b}→{c,d},

{(a∨b),c,d}

k=1

Fk={i|i∈I∧σ{i}N≥minsup}

k=k+1

Ck=genCandidate(Fk−1)

Ck=pruneCandidate (Ck,Fk−1)

Ck=count (Ck,D)

Fk={c|c∈Ck∧σ(c)N≥minsup}Fk=∅

∪Fk

7ClusterAnalysis:BasicConceptsandAlgorithms

Clusteranalysisdividesdataintogroups(clusters)thataremeaningful,useful,orboth.Ifmeaningfulgroupsarethegoal,thentheclustersshouldcapturethenaturalstructureofthedata.Insomecases,however,clusteranalysisisusedfordatasummarizationinordertoreducethesizeofthedata.Whetherforunderstandingorutility,clusteranalysishaslongplayedanimportantroleinawidevarietyoffields:psychologyandothersocialsciences,biology,statistics,patternrecognition,informationretrieval,machinelearning,anddatamining.

Therehavebeenmanyapplicationsofclusteranalysistopracticalproblems.Weprovidesomespecificexamples,organizedbywhetherthepurposeoftheclusteringisunderstandingorutility.

ClusteringforUnderstandingClasses,orconceptuallymeaningfulgroupsofobjectsthatsharecommoncharacteristics,playanimportantroleinhowpeopleanalyzeanddescribetheworld.Indeed,humanbeingsareskilledat

dividingobjectsintogroups(clustering)andassigningparticularobjectstothesegroups(classification).Forexample,evenrelativelyyoungchildrencanquicklylabeltheobjectsinaphotograph.Inthecontextofunderstandingdata,clustersarepotentialclassesandclusteranalysisisthestudyoftechniquesforautomaticallyfindingclasses.Thefollowingaresomeexamples:

Biology.Biologistshavespentmanyyearscreatingataxonomy(hierarchicalclassification)ofalllivingthings:kingdom,phylum,class,order,family,genus,andspecies.Thus,itisperhapsnotsurprisingthatmuchoftheearlyworkinclusteranalysissoughttocreateadisciplineofmathematicaltaxonomythatcouldautomaticallyfindsuchclassificationstructures.Morerecently,biologistshaveappliedclusteringtoanalyzethelargeamountsofgeneticinformationthatarenowavailable.Forexample,clusteringhasbeenusedtofindgroupsofgenesthathavesimilarfunctions.InformationRetrieval.TheWorldWideWebconsistsofbillionsofwebpages,andtheresultsofaquerytoasearchenginecanreturnthousandsofpages.Clusteringcanbeusedtogroupthesesearchresultsintoasmallnumberofclusters,eachofwhichcapturesaparticularaspectofthequery.Forinstance,aqueryof“movie”mightreturnwebpagesgroupedintocategoriessuchasreviews,trailers,stars,andtheaters.Eachcategory(cluster)canbebrokenintosubcategories(subclusters),producingahierarchicalstructurethatfurtherassistsauser’sexplorationofthequeryresults.Climate.UnderstandingtheEarth’sclimaterequiresfindingpatternsintheatmosphereandocean.Tothatend,clusteranalysishasbeenappliedtofindpatternsinatmosphericpressureandoceantemperaturethathaveasignificantimpactonclimate.PsychologyandMedicine.Anillnessorconditionfrequentlyhasanumberofvariations,andclusteranalysiscanbeusedtoidentifythesedifferentsubcategories.Forexample,clusteringhasbeenusedtoidentify

differenttypesofdepression.Clusteranalysiscanalsobeusedtodetectpatternsinthespatialortemporaldistributionofadisease.Business.Businessescollectlargeamountsofinformationaboutcurrentandpotentialcustomers.Clusteringcanbeusedtosegmentcustomersintoasmallnumberofgroupsforadditionalanalysisandmarketingactivities.

ClusteringforUtilityClusteranalysisprovidesanabstractionfromindividualdataobjectstotheclustersinwhichthosedataobjectsreside.Additionally,someclusteringtechniquescharacterizeeachclusterintermsofaclusterprototype;i.e.,adataobjectthatisrepresentativeoftheobjectsinthecluster.Theseclusterprototypescanbeusedasthebasisforanumberofadditionaldataanalysisordataprocessingtechniques.Therefore,inthecontextofutility,clusteranalysisisthestudyoftechniquesforfindingthemostrepresentativeclusterprototypes.

Summarization.Manydataanalysistechniques,suchasregressionorprincipalcomponentanalysis,haveatimeorspacecomplexityof orhigher(wheremisthenumberofobjects),andthus,arenotpracticalforlargedatasets.However,insteadofapplyingthealgorithmtotheentiredataset,itcanbeappliedtoareduceddatasetconsistingonlyofclusterprototypes.Dependingonthetypeofanalysis,thenumberofprototypes,andtheaccuracywithwhichtheprototypesrepresentthedata,theresultscanbecomparabletothosethatwouldhavebeenobtainedifallthedatacouldhavebeenused.Compression.Clusterprototypescanalsobeusedfordatacompression.Inparticular,atableiscreatedthatconsistsoftheprototypesforeachcluster;i.e.,eachprototypeisassignedanintegervaluethatisitsposition(index)inthetable.Eachobjectisrepresentedbytheindexoftheprototypeassociatedwithitscluster.Thistypeofcompressionisknownasvectorquantizationandisoftenappliedtoimage,sound,andvideodata,

O(m2)

where(1)manyofthedataobjectsarehighlysimilartooneanother,(2)somelossofinformationisacceptable,and(3)asubstantialreductioninthedatasizeisdesired.EfficientlyFindingNearestNeighbors.Findingnearestneighborscanrequirecomputingthepairwisedistancebetweenallpoints.Oftenclustersandtheirclusterprototypescanbefoundmuchmoreefficiently.Ifobjectsarerelativelyclosetotheprototypeoftheircluster,thenwecanusetheprototypestoreducethenumberofdistancecomputationsthatarenecessarytofindthenearestneighborsofanobject.Intuitively,iftwoclusterprototypesarefarapart,thentheobjectsinthecorrespondingclusterscannotbenearestneighborsofeachother.Consequently,tofindanobject’snearestneighbors,itisnecessarytocomputeonlythedistancetoobjectsinnearbyclusters,wherethenearnessoftwoclustersismeasuredbythedistancebetweentheirprototypes.ThisideaismademorepreciseinExercise25 ofChapter2 ,whichisonpage111.

Thischapterprovidesanintroductiontoclusteranalysis.Webeginwithahigh-leveloverviewofclustering,includingadiscussionofthevariousapproachestodividingobjectsintosetsofclustersandthedifferenttypesofclusters.Wethendescribethreespecificclusteringtechniquesthatrepresentbroadcategoriesofalgorithmsandillustrateavarietyofconcepts:K-means,agglomerativehierarchicalclustering,andDBSCAN.Thefinalsectionofthischapterisdevotedtoclustervalidity—methodsforevaluatingthegoodnessoftheclustersproducedbyaclusteringalgorithm.MoreadvancedclusteringconceptsandalgorithmswillbediscussedinChapter8 .Wheneverpossible,wediscussthestrengthsandweaknessesofdifferentschemes.Inaddition,theBibliographicNotesprovidereferencestorelevantbooksandpapersthatexploreclusteranalysisingreaterdepth.

7.1OverviewBeforediscussingspecificclusteringtechniques,weprovidesomenecessarybackground.First,wefurtherdefineclusteranalysis,illustratingwhyitisdifficultandexplainingitsrelationshiptoothertechniquesthatgroupdata.Thenweexploretwoimportanttopics:(1)differentwaystogroupasetofobjectsintoasetofclusters,and(2)typesofclusters.

7.1.1WhatIsClusterAnalysis?

Clusteranalysisgroupsdataobjectsbasedoninformationfoundonlyinthedatathatdescribestheobjectsandtheirrelationships.Thegoalisthattheobjectswithinagroupbesimilar(orrelated)tooneanotheranddifferentfrom(orunrelatedto)theobjectsinothergroups.Thegreaterthesimilarity(orhomogeneity)withinagroupandthegreaterthedifferencebetweengroups,thebetterormoredistincttheclustering.

Inmanyapplications,thenotionofaclusterisnotwelldefined.Tobetterunderstandthedifficultyofdecidingwhatconstitutesacluster,considerFigure7.1 ,whichshows20pointsandthreedifferentwaysofdividingthemintoclusters.Theshapesofthemarkersindicateclustermembership.Figures7.1(b) and7.1(d) dividethedataintotwoandsixparts,respectively.However,theapparentdivisionofeachofthetwolargerclustersintothreesubclustersmaysimplybeanartifactofthehumanvisualsystem.Also,itmaynotbeunreasonabletosaythatthepointsformfourclusters,asshowninFigure7.1(c) .Thisfigureillustratesthatthedefinitionofacluster

isimpreciseandthatthebestdefinitiondependsonthenatureofdataandthedesiredresults.

Figure7.1.Threedifferentwaysofclusteringthesamesetofpoints.

Clusteranalysisisrelatedtoothertechniquesthatareusedtodividedataobjectsintogroups.Forinstance,clusteringcanberegardedasaformofclassificationinthatitcreatesalabelingofobjectswithclass(cluster)labels.However,itderivestheselabelsonlyfromthedata.Incontrast,classificationinthesenseofChapter3 issupervisedclassification;i.e.,new,unlabeledobjectsareassignedaclasslabelusingamodeldevelopedfromobjectswithknownclasslabels.Forthisreason,clusteranalysisissometimesreferredtoasunsupervisedclassification.Whenthetermclassificationisusedwithoutanyqualificationwithindatamining,ittypicallyreferstosupervisedclassification.

Also,whilethetermssegmentationandpartitioningaresometimesusedassynonymsforclustering,thesetermsarefrequentlyusedforapproachesoutsidethetraditionalboundsofclusteranalysis.Forexample,theterm

partitioningisoftenusedinconnectionwithtechniquesthatdividegraphsintosubgraphsandthatarenotstronglyconnectedtoclustering.Segmentationoftenreferstothedivisionofdataintogroupsusingsimpletechniques;e.g.,animagecanbesplitintosegmentsbasedonlyonpixelintensityandcolor,orpeoplecanbedividedintogroupsbasedontheirincome.Nonetheless,someworkingraphpartitioningandinimageandsegmentationisrelatedtoclusteranalysis.

7.1.2DifferentTypesofClusterings

Anentirecollectionofclustersiscommonlyreferredtoasaclustering,andinthissection,wedistinguishvarioustypesofclusterings:hierarchical(nested)versuspartitional(unnested),exclusiveversusoverlappingversusfuzzy,andcompleteversuspartial.

HierarchicalversusPartitional

Themostcommonlydiscusseddistinctionamongdifferenttypesofclusteringsiswhetherthesetofclustersisnestedorunnested,orinmoretraditionalterminology,hierarchicalorpartitional.Apartitionalclusteringissimplyadivisionofthesetofdataobjectsintonon-overlappingsubsets(clusters)suchthateachdataobjectisinexactlyonesubset.Takenindividually,eachcollectionofclustersinFigures7.1(b–d) isapartitionalclustering.

Ifwepermitclusterstohavesubclusters,thenweobtainahierarchicalclustering,whichisasetofnestedclustersthatareorganizedasatree.Eachnode(cluster)inthetree(exceptfortheleafnodes)istheunionofitschildren(subclusters),andtherootofthetreeistheclustercontainingalltheobjects.Often,butnotalways,theleavesofthetreearesingletonclustersof

individualdataobjects.Ifweallowclusterstobenested,thenoneinterpretationofFigure7.1(a) isthatithastwosubclusters(Figure7.1(b) ),eachofwhich,inturn,hasthreesubclusters(Figure7.1(d) ).TheclustersshowninFigures7.1(a–d) ,whentakeninthatorder,alsoformahierarchical(nested)clusteringwith,respectively,1,2,4,and6clustersoneachlevel.Finally,notethatahierarchicalclusteringcanbeviewedasasequenceofpartitionalclusteringsandapartitionalclusteringcanbeobtainedbytakinganymemberofthatsequence;i.e.,bycuttingthehierarchicaltreeataparticularlevel.

ExclusiveversusOverlappingversusFuzzy

TheclusteringsshowninFigure7.1 areallexclusive,astheyassigneachobjecttoasinglecluster.Therearemanysituationsinwhichapointcouldreasonablybeplacedinmorethanonecluster,andthesesituationsarebetteraddressedbynon-exclusiveclustering.Inthemostgeneralsense,anoverlappingornon-exclusiveclusteringisusedtoreflectthefactthatanobjectcansimultaneouslybelongtomorethanonegroup(class).Forinstance,apersonatauniversitycanbebothanenrolledstudentandanemployeeoftheuniversity.Anon-exclusiveclusteringisalsooftenusedwhen,forexample,anobjectis“between”twoormoreclustersandcouldreasonablybeassignedtoanyoftheseclusters.ImagineapointhalfwaybetweentwooftheclustersofFigure7.1 .Ratherthanmakeasomewhatarbitraryassignmentoftheobjecttoasinglecluster,itisplacedinallofthe“equallygood”clusters.

Inafuzzyclustering(Section8.2.1 ),everyobjectbelongstoeveryclusterwithamembershipweightthatisbetween0(absolutelydoesn’tbelong)and1(absolutelybelongs).Inotherwords,clustersaretreatedasfuzzysets.(Mathematically,afuzzysetisoneinwhichanobjectbelongstoeverysetwithaweightthatisbetween0and1.Infuzzyclustering,weoftenimposethe

additionalconstraintthatthesumoftheweightsforeachobjectmustequal1.)Similarly,probabilisticclusteringtechniques(Section8.2.2 )computetheprobabilitywithwhicheachpointbelongstoeachcluster,andtheseprobabilitiesmustalsosumto1.Becausethemembershipweightsorprobabilitiesforanyobjectsumto1,afuzzyorprobabilisticclusteringdoesnotaddresstruemulticlasssituations,suchasthecaseofastudentemployee,whereanobjectbelongstomultipleclasses.Instead,theseapproachesaremostappropriateforavoidingthearbitrarinessofassigninganobjecttoonlyoneclusterwhenitisclosetoseveral.Inpractice,afuzzyorprobabilisticclusteringisoftenconvertedtoanexclusiveclusteringbyassigningeachobjecttotheclusterinwhichitsmembershipweightorprobabilityishighest.

CompleteversusPartial

Acompleteclusteringassignseveryobjecttoacluster,whereasapartialclusteringdoesnot.Themotivationforapartialclusteringisthatsomeobjectsinadatasetmaynotbelongtowell-definedgroups.Manytimesobjectsinthedatasetrepresentnoise,outliers,or“uninterestingbackground.”Forexample,somenewspaperstoriesshareacommontheme,suchasglobalwarming,whileotherstoriesaremoregenericorone-of-a-kind.Thus,tofindtheimportanttopicsinlastmonth’sstories,weoftenwanttosearchonlyforclustersofdocumentsthataretightlyrelatedbyacommontheme.Inothercases,acompleteclusteringoftheobjectsisdesired.Forexample,anapplicationthatusesclusteringtoorganizedocumentsforbrowsingneedstoguaranteethatalldocumentscanbebrowsed.

7.1.3DifferentTypesofClusters

Clusteringaimstofindusefulgroupsofobjects(clusters),whereusefulnessisdefinedbythegoalsofthedataanalysis.Notsurprisingly,severaldifferentnotionsofaclusterproveusefulinpractice.Inordertovisuallyillustratethedifferencesamongthesetypesofclusters,weusetwo-dimensionalpoints,asshowninFigure7.2 ,asourdataobjects.Westress,however,thatthetypesofclustersdescribedhereareequallyvalidforotherkindsofdata.

Figure7.2.Differenttypesofclustersasillustratedbysetsoftwo-dimensionalpoints.

Well-Separated

Aclusterisasetofobjectsinwhicheachobjectiscloser(ormoresimilar)toeveryotherobjectintheclusterthantoanyobjectnotinthecluster.Sometimesathresholdisusedtospecifythatalltheobjectsinaclustermustbesufficientlyclose(orsimilar)tooneanother.Thisidealisticdefinitionofaclusterissatisfiedonlywhenthedatacontainsnaturalclustersthatarequitefarfromeachother.Figure7.2(a) givesanexampleofwell-separatedclustersthatconsistsoftwogroupsofpointsinatwo-dimensionalspace.Thedistancebetweenanytwopointsindifferentgroupsislargerthanthedistancebetweenanytwopointswithinagroup.Well-separatedclustersdonotneedtobeglobular,butcanhaveanyshape.

Prototype-Based

Aclusterisasetofobjectsinwhicheachobjectiscloser(moresimilar)totheprototypethatdefinestheclusterthantotheprototypeofanyothercluster.Fordatawithcontinuousattributes,theprototypeofaclusterisoftenacentroid,i.e.,theaverage(mean)ofallthepointsinthecluster.Whenacentroidisnotmeaningful,suchaswhenthedatahascategoricalattributes,theprototypeisoftenamedoid,i.e.,themostrepresentativepointofacluster.Formanytypesofdata,theprototypecanberegardedasthemostcentralpoint,andinsuchinstances,wecommonlyrefertoprototype-basedclustersascenter-basedclusters.Notsurprisingly,suchclusterstendtobeglobular.Figure7.2(b) showsanexampleofcenter-basedclusters.

Graph-Based

Ifthedataisrepresentedasagraph,wherethenodesareobjectsandthelinksrepresentconnectionsamongobjects(seeSection2.1.2 ),thenaclustercanbedefinedasaconnectedcomponent;i.e.,agroupofobjects

thatareconnectedtooneanother,butthathavenoconnectiontoobjectsoutsidethegroup.Animportantexampleofgraph-basedclustersisacontiguity-basedcluster,wheretwoobjectsareconnectedonlyiftheyarewithinaspecifieddistanceofeachother.Thisimpliesthateachobjectinacontiguity-basedclusterisclosertosomeotherobjectintheclusterthantoanypointinadifferentcluster.Figure7.2(c) showsanexampleofsuchclustersfortwo-dimensionalpoints.Thisdefinitionofaclusterisusefulwhenclustersareirregularorintertwined.However,thisapproachcanhavetroublewhennoiseispresentsince,asillustratedbythetwosphericalclustersofFigure7.2(c) ,asmallbridgeofpointscanmergetwodistinctclusters.

Othertypesofgraph-basedclustersarealsopossible.Onesuchapproach(Section7.3.2 )definesaclusterasaclique;i.e.,asetofnodesinagraphthatarecompletelyconnectedtoeachother.Specifically,ifweaddconnectionsbetweenobjectsintheorderoftheirdistancefromoneanother,aclusterisformedwhenasetofobjectsformsaclique.Likeprototype-basedclusters,suchclusterstendtobeglobular.

Density-Based

Aclusterisadenseregionofobjectsthatissurroundedbyaregionoflowdensity.Figure7.2(d) showssomedensity-basedclustersfordatacreatedbyaddingnoisetothedataofFigure7.2(c) .Thetwocircularclustersarenotmerged,asinFigure7.2(c) ,becausethebridgebetweenthemfadesintothenoise.Likewise,thecurvethatispresentinFigure7.2(c) alsofadesintothenoiseanddoesnotformaclusterinFigure7.2(d) .Adensity-baseddefinitionofaclusterisoftenemployedwhentheclustersareirregularorintertwined,andwhennoiseandoutliersarepresent.Bycontrast,acontiguity-baseddefinitionofaclusterwouldnotworkwellforthedataofFigure7.2(d) becausethenoisewouldtendtoformbridgesbetweenclusters.

Shared-Property(ConceptualClusters)

Moregenerally,wecandefineaclusterasasetofobjectsthatsharesomeproperty.Thisdefinitionencompassesallthepreviousdefinitionsofacluster;e.g.,objectsinacenter-basedclustersharethepropertythattheyareallclosesttothesamecentroidormedoid.However,theshared-propertyapproachalsoincludesnewtypesofclusters.ConsidertheclustersshowninFigure7.2(e) .Atriangulararea(cluster)isadjacenttoarectangularone,andtherearetwointertwinedcircles(clusters).Inbothcases,aclusteringalgorithmwouldneedaveryspecificconceptofaclustertosuccessfullydetecttheseclusters.Theprocessoffindingsuchclustersiscalledconceptualclustering.However,toosophisticatedanotionofaclusterwouldtakeusintotheareaofpatternrecognition,andthus,weonlyconsidersimplertypesofclustersinthisbook.

RoadMapInthischapter,weusethefollowingthreesimple,butimportanttechniquestointroducemanyoftheconceptsinvolvedinclusteranalysis.

K-means.Thisisaprototype-based,partitionalclusteringtechniquethatattemptstofindauser-specifiednumberofclusters(K),whicharerepresentedbytheircentroids.AgglomerativeHierarchicalClustering.Thisclusteringapproachreferstoacollectionofcloselyrelatedclusteringtechniquesthatproduceahierarchicalclusteringbystartingwitheachpointasasingletonclusterandthenrepeatedlymergingthetwoclosestclustersuntilasingle,all-encompassingclusterremains.Someofthesetechniqueshaveanaturalinterpretationintermsofgraph-basedclustering,whileothershaveaninterpretationintermsofaprototype-basedapproach.

DBSCAN.Thisisadensity-basedclusteringalgorithmthatproducesapartitionalclustering,inwhichthenumberofclustersisautomaticallydeterminedbythealgorithm.Pointsinlow-densityregionsareclassifiedasnoiseandomitted;thus,DBSCANdoesnotproduceacompleteclustering.

7.2K-meansPrototype-basedclusteringtechniquescreateaone-levelpartitioningofthedataobjects.Thereareanumberofsuchtechniques,buttwoofthemostprominentareK-meansandK-medoid.K-meansdefinesaprototypeintermsofacentroid,whichisusuallythemeanofagroupofpoints,andistypicallyappliedtoobjectsinacontinuousn-dimensionalspace.K-medoiddefinesaprototypeintermsofamedoid,whichisthemostrepresentativepointforagroupofpoints,andcanbeappliedtoawiderangeofdatasinceitrequiresonlyaproximitymeasureforapairofobjects.Whileacentroidalmostnevercorrespondstoanactualdatapoint,amedoid,byitsdefinition,mustbeanactualdatapoint.Inthissection,wewillfocussolelyonK-means,whichisoneoftheoldestandmostwidely-usedclusteringalgorithms.

7.2.1TheBasicK-meansAlgorithm

TheK-meansclusteringtechniqueissimple,andwebeginwithadescriptionofthebasicalgorithm.WefirstchooseKinitialcentroids,whereKisauser-specifiedparameter,namely,thenumberofclustersdesired.Eachpointisthenassignedtotheclosestcentroid,andeachcollectionofpointsassignedtoacentroidisacluster.Thecentroidofeachclusteristhenupdatedbasedonthepointsassignedtothecluster.Werepeattheassignmentandupdatestepsuntilnopointchangesclusters,orequivalently,untilthecentroidsremainthesame.

K-meansisformallydescribedbyAlgorithm7.1. TheoperationofK-meansisillustratedinFigure7.3 ,whichshowshow,startingfromthree

centroids,thefinalclustersarefoundinfourassignment-updatesteps.IntheseandotherfiguresdisplayingK-meansclustering,eachsubfigureshows(1)thecentroidsatthestartoftheiterationand(2)theassignmentofthepointstothosecentroids.Thecentroidsareindicatedbythe“+”symbol;allpointsbelongingtothesameclusterhavethesamemarkershape.

Figure7.3.UsingtheK-meansalgorithmtofindthreeclustersinsampledata.

Algorithm7.1BasicK-meansalgorithm.

Inthefirststep,showninFigure7.3(a) ,pointsareassignedtotheinitialcentroids,whichareallinthelargergroupofpoints.Forthisexample,weusethemeanasthecentroid.Afterpointsareassignedtoacentroid,thecentroid

SelectKpointsasinitialcentroids.repeatFormKclustersbyassigningeachpointtoitsclosestcentroid.Recomputethecentroidofeachcluster.untilCentroidsdonotchange.

1:2:3:4:5:

isthenupdated.Again,thefigureforeachstepshowsthecentroidatthebeginningofthestepandtheassignmentofpointstothosecentroids.Inthesecondstep,pointsareassignedtotheupdatedcentroids,andthecentroidsareupdatedagain.Insteps2,3,and4,whichareshowninFigures7.3(b) ,(c) ,and(d) ,respectively,twoofthecentroidsmovetothetwosmallgroupsofpointsatthebottomofthefigures.WhentheK-meansalgorithmterminatesinFigure7.3(d) ,becausenomorechangesoccur,thecentroidshaveidentifiedthenaturalgroupingsofpoints.

Foranumberofcombinationsofproximityfunctionsandtypesofcentroids,K-meansalwaysconvergestoasolution;i.e.,K-meansreachesastateinwhichnopointsareshiftingfromoneclustertoanother,andhence,thecentroidsdon’tchange.Becausemostoftheconvergenceoccursintheearlysteps,however,theconditiononline5ofAlgorithm7.1 isoftenreplacedbyaweakercondition,e.g.,repeatuntilonly1%ofthepointschangeclusters.

WeconsidereachofthestepsinthebasicK-meansalgorithminmoredetailandthenprovideananalysisofthealgorithm’sspaceandtimecomplexity.

AssigningPointstotheClosestCentroidToassignapointtotheclosestcentroid,weneedaproximitymeasurethatquantifiesthenotionof“closest”forthespecificdataunderconsideration.Euclidean distanceisoftenusedfordatapointsinEuclideanspace,whilecosinesimilarityismoreappropriatefordocuments.However,severaltypesofproximitymeasurescanbeappropriateforagiventypeofdata.Forexample,Manhattan distancecanbeusedforEuclideandata,whiletheJaccardmeasureisoftenemployedfordocuments.

Usually,thesimilaritymeasuresusedforK-meansarerelativelysimplesincethealgorithmrepeatedlycalculatesthesimilarityofeachpointtoeach

(L2)

(L1)

centroid.Insomecases,however,suchaswhenthedataisinlow-dimensionalEuclideanspace,itispossibletoavoidcomputingmanyofthesimilarities,thussignificantlyspeedinguptheK-meansalgorithm.BisectingK-means(describedinSection7.2.3 )isanotherapproachthatspeedsupK-meansbyreducingthenumberofsimilaritiescomputed.

CentroidsandObjectiveFunctionsStep4oftheK-meansalgorithmwasstatedrathergenerallyas“recomputethecentroidofeachcluster,”sincethecentroidcanvary,dependingontheproximitymeasureforthedataandthegoaloftheclustering.Thegoaloftheclusteringistypicallyexpressedbyanobjectivefunctionthatdependsontheproximitiesofthepointstooneanotherortotheclustercentroids;e.g.,minimizethesquareddistanceofeachpointtoitsclosestcentroid.Weillustratethiswithtwoexamples.However,thekeypointisthis:afterwehavespecifiedaproximitymeasureandanobjectivefunction,thecentroidthatweshouldchoosecanoftenbedeterminedmathematically.WeprovidemathematicaldetailsinSection7.2.6 ,andprovideanon-mathematicaldiscussionofthisobservationhere.

DatainEuclideanSpace

ConsiderdatawhoseproximitymeasureisEuclideandistance.Forourobjectivefunction,whichmeasuresthequalityofaclustering,weusethesumofthesquarederror(SSE),whichisalsoknownasscatter.Inotherwords,wecalculatetheerrorofeachdatapoint,i.e.,itsEuclideandistancetotheclosestcentroid,andthencomputethetotalsumofthesquarederrors.GiventwodifferentsetsofclustersthatareproducedbytwodifferentrunsofK-means,weprefertheonewiththesmallestsquarederrorsincethismeansthattheprototypes(centroids)ofthisclusteringareabetterrepresentationof

thepointsintheircluster.UsingthenotationinTable7.1 ,theSSEisformallydefinedasfollows:

Table7.1.Tableofnotation.

Symbol Description

x Anobject.

The cluster.

Thecentroidofcluster .

c Thecentroidofallpoints.

Thenumberofobjectsinthe cluster.

m Thenumberofobjectsinthedataset.

K Thenumberofclusters.

wheredististhestandardEuclidean distancebetweentwoobjectsinEuclideanspace.

Giventheseassumptions,itcanbeshown(seeSection7.2.6 )thatthecentroidthatminimizestheSSEoftheclusteristhemean.UsingthenotationinTable7.1 ,thecentroid(mean)ofthe clusterisdefinedbyEquation7.2 .

Ci ith

ci Ci

mi ith

SSE=∑i=1K∑x∈Cidist(ci,x)2 (7.1)

(L2)

ith

ci=1mi∑x∈Cix (7.2)

Toillustrate,thecentroidofaclustercontainingthethreetwo-dimensionalpoints,(1,1),(2,3),and(6,2),is , .

Steps3and4oftheK-meansalgorithmdirectlyattempttominimizetheSSE(ormoregenerally,theobjectivefunction).Step3formsclustersbyassigningpointstotheirnearestcentroid,whichminimizestheSSEforthegivensetofcentroids.Step4recomputesthecentroidssoastofurtherminimizetheSSE.However,theactionsofK-meansinSteps3and4areguaranteedtoonlyfindalocalminimumwithrespecttotheSSEbecausetheyarebasedonoptimizingtheSSEforspecificchoicesofthecentroidsandclusters,ratherthanforallpossiblechoices.Wewilllaterseeanexampleinwhichthisleadstoasuboptimalclustering.

DocumentData

ToillustratethatK-meansisnotrestrictedtodatainEuclideanspace,weconsiderdocumentdataandthecosinesimilaritymeasure.Hereweassumethatthedocumentdataisrepresentedasadocument-termmatrixasdescribedonpage37.Ourobjectiveistomaximizethesimilarityofthedocumentsinaclustertotheclustercentroid;thisquantityisknownasthecohesionofthecluster.Forthisobjectiveitcanbeshownthattheclustercentroidis,asforEuclideandata,themean.TheanalogousquantitytothetotalSSEisthetotalcohesion,whichisgivenbyEquation7.3 .

TheGeneralCase

Thereareanumberofchoicesfortheproximityfunction,centroid,andobjectivefunctionthatcanbeusedinthebasicK-meansalgorithmandthatareguaranteedtoconverge.Table7.2 showssomepossiblechoices,

((1+2+6)/3 ((1+3+2)/3)=(3,2)

TotalCohesion=∑i=1K∑x∈Cicosine(x,ci) (7.3)

includingthetwothatwehavejustdiscussed.NoticethatforManhattandistanceandtheobjectiveofminimizingthesumofthedistances,theappropriatecentroidisthemedianofthepointsinacluster.

Table7.2.K-means:Commonchoicesforproximity,centroids,andobjectivefunctions.

ProximityFunction Centroid ObjectiveFunction

Manhattan median Minimizesumofthe distanceofanobjecttoitsclustercentroid

SquaredEuclidean mean Minimizesumofthesquared distanceofanobjecttoitsclustercentroid

cosine mean Maximizesumofthecosinesimilarityofanobjecttoitsclustercentroid

Bregmandivergence mean MinimizesumoftheBregmandivergenceofanobjecttoitsclustercentroid

Thelastentryinthetable,Bregmandivergence(Section2.4.8 ),isactuallyaclassofproximitymeasuresthatincludesthesquaredEuclideandistance,

,theMahalanobisdistance,andcosinesimilarity.TheimportanceofBregmandivergencefunctionsisthatanysuchfunctioncanbeusedasthebasisofaK-meansstyleclusteringalgorithmwiththemeanasthecentroid.Specifically,ifweuseaBregmandivergenceasourproximityfunction,thentheresultingclusteringalgorithmhastheusualpropertiesofK-meanswithrespecttoconvergence,localminima,etc.Furthermore,thepropertiesofsuchaclusteringalgorithmcanbedevelopedforallpossibleBregmandivergences.Forexample,K-meansalgorithmsthatusecosinesimilarityorsquaredEuclideandistanceareparticularinstancesofageneralclusteringalgorithmbasedonBregmandivergences.

(L1)

(L1) L1

(L22)L2

L22

FortherestofourK-meansdiscussion,weusetwo-dimensionaldatasinceitiseasytoexplainK-meansanditspropertiesforthistypeofdata.But,assuggestedbythelastfewparagraphs,K-meansisageneralclusteringalgorithmandcanbeusedwithawidevarietyofdatatypes,suchasdocumentsandtimeseries.

ChoosingInitialCentroidsWhenrandominitializationofcentroidsisused,differentrunsofK-meanstypicallyproducedifferenttotalSSEs.Weillustratethiswiththesetoftwo-dimensionalpointsshowninFigure7.3 ,whichhasthreenaturalclustersofpoints.Figure7.4(a) showsaclusteringsolutionthatistheglobalminimumoftheSSEforthreeclusters,whileFigure7.4(b) showsasuboptimalclusteringthatisonlyalocalminimum.

Figure7.4.Threeoptimalandnon-optimalclusters.

ChoosingtheproperinitialcentroidsisthekeystepofthebasicK-meansprocedure.Acommonapproachistochoosetheinitialcentroidsrandomly,buttheresultingclustersareoftenpoor.

Example7.1(PoorInitialCentroids).Randomlyselectedinitialcentroidscanbepoor.WeprovideanexampleofthisusingthesamedatasetusedinFigures7.3 and7.4 .Figures7.3 and7.5 showtheclustersthatresultfromtwoparticularchoicesofinitialcentroids.(Forbothfigures,thepositionsoftheclustercentroidsinthevariousiterationsareindicatedbycrosses.)InFigure7.3 ,eventhoughalltheinitialcentroidsarefromonenaturalcluster,theminimumSSEclusteringisstillfound.InFigure7.5 ,however,eventhoughtheinitialcentroidsseemtobebetterdistributed,weobtainasuboptimalclustering,withhighersquarederror.

Figure7.5.PoorstartingcentroidsforK-means.

Example7.2(LimitsofRandomInitialization).Onetechniquethatiscommonlyusedtoaddresstheproblemofchoosinginitialcentroidsistoperformmultipleruns,eachwithadifferentsetofrandomlychoseninitialcentroids,andthenselectthesetofclusterswiththeminimumSSE.Whilesimple,thisstrategymightnotworkverywell,dependingonthedatasetandthenumberofclusterssought.We

demonstratethisusingthesampledatasetshowninFigure7.6(a) .Thedataconsistsoftwopairsofclusters,wheretheclustersineach(top-bottom)pairareclosertoeachotherthantotheclustersintheotherpair.Figure7.6(b–d) showsthatifwestartwithtwoinitialcentroidsperpairofclusters,thenevenwhenbothcentroidsareinasinglecluster,thecentroidswillredistributethemselvessothatthe“true”clustersarefound.However,Figure7.7 showsthatifapairofclustershasonlyoneinitialcentroidandtheotherpairhasthree,thentwoofthetrueclusterswillbecombinedandonetrueclusterwillbesplit.

Figure7.6.Twopairsofclusterswithapairofinitialcentroidswithineachpairofclusters.

Figure7.7.Twopairsofclusterswithmoreorfewerthantwoinitialcentroidswithinapairofclusters.

Notethatanoptimalclusteringwillbeobtainedaslongastwoinitialcentroidsfallanywhereinapairofclusters,sincethecentroidswillredistributethemselves,onetoeachcluster.Unfortunately,asthenumberofclustersbecomeslarger,itisincreasinglylikelythatatleastonepairofclusterswillhaveonlyoneinitialcentroid—seeExercise4 onpage603.

Inthiscase,becausethepairsofclustersarefartherapartthanclusterswithinapair,theK-meansalgorithmwillnotredistributethecentroidsbetweenpairsofclusters,andthus,onlyalocalminimumwillbeachieved.

Becauseoftheproblemswithusingrandomlyselectedinitialcentroids,whichevenrepeatedrunsmightnotovercome,othertechniquesareoftenemployedforinitialization.Oneeffectiveapproachistotakeasampleofpointsandclusterthemusingahierarchicalclusteringtechnique.Kclustersareextractedfromthehierarchicalclustering,andthecentroidsofthoseclustersareusedastheinitialcentroids.Thisapproachoftenworkswell,butispracticalonlyif(1)thesampleisrelativelysmall,e.g.,afewhundredtoafewthousand(hierarchicalclusteringisexpensive),and(2)Kisrelativelysmallcomparedtothesamplesize.

Thefollowingprocedureisanotherapproachtoselectinginitialcentroids.Selectthefirstpointatrandomortakethecentroidofallpoints.Then,foreachsuccessiveinitialcentroid,selectthepointthatisfarthestfromanyoftheinitialcentroidsalreadyselected.Inthisway,weobtainasetofinitialcentroidsthatisguaranteedtobenotonlyrandomlyselectedbutalsowellseparated.Unfortunately,suchanapproachcanselectoutliers,ratherthanpointsindenseregions(clusters),whichcanleadtoasituationwheremanyclustershavejustonepoint—anoutlier—whichreducesthenumberofcentroidsforformingclustersforthemajorityofpoints.Also,itisexpensivetocomputethefarthestpointfromthecurrentsetofinitialcentroids.Toovercometheseproblems,thisapproachisoftenappliedtoasampleofthepoints.Becauseoutliersarerare,theytendnottoshowupinarandomsample.Incontrast,pointsfromeverydenseregionarelikelytobeincludedunlessthesamplesizeisverysmall.Also,thecomputationinvolvedinfindingtheinitialcentroidsisgreatlyreducedbecausethesamplesizeistypicallymuchsmallerthanthenumberofpoints.

K-means++

Morerecently,anewapproachforinitializingK-means,calledK-means++,hasbeendeveloped.ThisprocedureisguaranteedtofindaK-meansclusteringsolutionthatisoptimaltowithinafactorof ,whichinpracticetranslatesintonoticeablybetterclusteringresultsintermsoflowerSSE.Thistechniqueissimilartotheideajustdiscussedofpickingthefirstcentroidatrandomandthenpickingeachremainingcentroidasthepointasfarfromtheremainingcentroidsaspossible.Specifically,K-means++pickscentroidsincrementallyuntilkcentroidshavebeenpicked.Ateverysuchstep,eachpointhasaprobabilityofbeingpickedasthenewcentroidthatisproportionaltothesquareofitsdistancetoitsclosestcentroid.Itmightseemthatthisapproachmighttendtochooseoutliersforcentroids,butbecauseoutliersarerare,bydefinition,thisisunlikely.

ThedetailsofK-means++initializationaregivenbyAlgorithm7.2. TherestofthealgorithmisthesameasordinaryK-means.

Algorithm7.2K-means++initialization

algorithm.

Olog(k)

Forthefirstcentroid,pickoneofthepointsatrandom.for tonumberoftrialsdoComputethedistance, ,ofeachpointtoitsclosestcentroid.Assigneachpointaprobabilityproportionaltoeachpoint’s

.Picknewcentroidfromtheremainingpointsusingtheweightedprobabilities.endfor

1:2: i=13: d(x)4:

d(x)25:

Later,wewilldiscusstwootherapproachesthatarealsousefulforproducingbetter-quality(lowerSSE)clusterings:usingavariantofK-meansthatislesssusceptibletoinitializationproblems(bisectingK-means)andusingpostprocessingto“fixup”thesetofclustersproduced.K-means++couldbecombinedwitheitherapproach.

TimeandSpaceComplexityThespacerequirementsforK-meansaremodestbecauseonlythedatapointsandcentroidsarestored.Specifically,thestoragerequiredis

,wheremisthenumberofpointsandnisthenumberofattributes.ThetimerequirementsforK-meansarealsomodest—basicallylinearinthenumberofdatapoints.Inparticular,thetimerequiredis ,whereIisthenumberofiterationsrequiredforconvergence.Asmentioned,Iisoftensmallandcanusuallybesafelybounded,asmostchangestypicallyoccurinthefirstfewiterations.Therefore,K-meansislinearinm,thenumberofpoints,andisefficientaswellassimpleprovidedthatK,thenumberofclusters,issignificantlylessthanm.

7.2.2K-means:AdditionalIssues

HandlingEmptyClustersOneoftheproblemswiththebasicK-meansalgorithmisthatemptyclusterscanbeobtainedifnopointsareallocatedtoaclusterduringtheassignmentstep.Ifthishappens,thenastrategyisneededtochooseareplacementcentroid,sinceotherwise,thesquarederrorwillbelargerthannecessary.Oneapproachistochoosethepointthatisfarthestawayfromanycurrentcentroid.Ifnothingelse,thiseliminatesthepointthatcurrentlycontributes

O((m+K)n)

O(I×K×m×n)

mosttothetotalsquarederror.(AK-means++approachcouldbeusedaswell.)AnotherapproachistochoosethereplacementcentroidatrandomfromtheclusterthathasthehighestSSE.ThiswilltypicallysplittheclusterandreducetheoverallSSEoftheclustering.Ifthereareseveralemptyclusters,thenthisprocesscanberepeatedseveraltimes.

OutliersWhenthesquarederrorcriterionisused,outlierscanundulyinfluencetheclustersthatarefound.Inparticular,whenoutliersarepresent,theresultingclustercentroids(prototypes)aretypicallynotasrepresentativeastheyotherwisewouldbeandthus,theSSEwillbehigher.Becauseofthis,itisoftenusefultodiscoveroutliersandeliminatethembeforehand.Itisimportant,however,toappreciatethattherearecertainclusteringapplicationsforwhichoutliersshouldnotbeeliminated.Whenclusteringisusedfordatacompression,everypointmustbeclustered,andinsomecases,suchasfinancialanalysis,apparentoutliers,e.g.,unusuallyprofitablecustomers,canbethemostinterestingpoints.

Anobviousissueishowtoidentifyoutliers.AnumberoftechniquesforidentifyingoutlierswillbediscussedinChapter9 .Ifweuseapproachesthatremoveoutliersbeforeclustering,weavoidclusteringpointsthatwillnotclusterwell.Alternatively,outlierscanalsobeidentifiedinapostprocessingstep.Forinstance,wecankeeptrackoftheSSEcontributedbyeachpoint,andeliminatethosepointswithunusuallyhighcontributions,especiallyovermultipleruns.Also,weoftenwanttoeliminatesmallclustersbecausetheyfrequentlyrepresentgroupsofoutliers.

ReducingtheSSEwithPostprocessing

AnobviouswaytoreducetheSSEistofindmoreclusters,i.e.,tousealargerK.However,inmanycases,wewouldliketoimprovetheSSE,butdon’twanttoincreasethenumberofclusters.ThisisoftenpossiblebecauseK-meanstypicallyconvergestoalocalminimum.Varioustechniquesareusedto“fixup”theresultingclustersinordertoproduceaclusteringthathaslowerSSE.ThestrategyistofocusonindividualclusterssincethetotalSSEissimplythesumoftheSSEcontributedbyeachcluster.(WewillusethetermstotalSSEandclusterSSE,respectively,toavoidanypotentialconfusion.)WecanchangethetotalSSEbyperformingvariousoperationsontheclusters,suchassplittingormergingclusters.Onecommonlyusedapproachistoemployalternateclustersplittingandmergingphases.Duringasplittingphase,clustersaredivided,whileduringamergingphase,clustersarecombined.Inthisway,itisoftenpossibletoescapelocalSSEminimaandstillproduceaclusteringsolutionwiththedesirednumberofclusters.Thefollowingaresometechniquesusedinthesplittingandmergingphases.

TwostrategiesthatdecreasethetotalSSEbyincreasingthenumberofclustersarethefollowing:

Splitacluster:TheclusterwiththelargestSSEisusuallychosen,butwecouldalsosplittheclusterwiththelargeststandarddeviationforoneparticularattribute.

Introduceanewclustercentroid:Oftenthepointthatisfarthestfromanyclustercenterischosen.WecaneasilydeterminethisifwekeeptrackoftheSSEcontributedbyeachpoint.AnotherapproachistochooserandomlyfromallpointsorfromthepointswiththehighestSSEwithrespecttotheirclosestcentroids.

Twostrategiesthatdecreasethenumberofclusters,whiletryingtominimizetheincreaseintotalSSE,arethefollowing:

Disperseacluster:Thisisaccomplishedbyremovingthecentroidthatcorrespondstotheclusterandreassigningthepointstootherclusters.Ideally,theclusterthatisdispersedshouldbetheonethatincreasesthetotalSSEtheleast.

Mergetwoclusters:Theclusterswiththeclosestcentroidsaretypicallychosen,althoughanother,perhapsbetter,approachistomergethetwoclustersthatresultinthesmallestincreaseintotalSSE.ThesetwomergingstrategiesarethesameonesthatareusedinthehierarchicalclusteringtechniquesknownasthecentroidmethodandWard’smethod,respectively.BothmethodsarediscussedinSection7.3 .

UpdatingCentroidsIncrementallyInsteadofupdatingclustercentroidsafterallpointshavebeenassignedtoacluster,thecentroidscanbeupdatedincrementally,aftereachassignmentofapointtoacluster.Noticethatthisrequireseitherzeroortwoupdatestoclustercentroidsateachstep,sinceapointeithermovestoanewcluster(twoupdates)orstaysinitscurrentcluster(zeroupdates).Usinganincrementalupdatestrategyguaranteesthatemptyclustersarenotproducedbecauseallclustersstartwithasinglepoint,andifaclustereverhasonlyonepoint,thenthatpointwillalwaysbereassignedtothesamecluster.

Inaddition,ifincrementalupdatingisused,therelativeweightofthepointbeingaddedcanbeadjusted;e.g.,theweightofpointsisoftendecreasedastheclusteringproceeds.Whilethiscanresultinbetteraccuracyandfasterconvergence,itcanbedifficulttomakeagoodchoicefortherelativeweight,especiallyinawidevarietyofsituations.Theseupdateissuesaresimilartothoseinvolvedinupdatingweightsforartificialneuralnetworks.

Yetanotherbenefitofincrementalupdateshastodowithusingobjectivesotherthan“minimizeSSE.”Supposethatwearegivenanarbitraryobjectivefunctiontomeasurethegoodnessofasetofclusters.Whenweprocessanindividualpoint,wecancomputethevalueoftheobjectivefunctionforeachpossibleclusterassignment,andthenchoosetheonethatoptimizestheobjective.SpecificexamplesofalternativeobjectivefunctionsaregiveninSection7.5.2 .

Onthenegativeside,updatingcentroidsincrementallyintroducesanorderdependency.Inotherwords,theclustersproducedusuallydependontheorderinwhichthepointsareprocessed.Althoughthiscanbeaddressedbyrandomizingtheorderinwhichthepointsareprocessed,thebasicK-meansapproachofupdatingthecentroidsafterallpointshavebeenassignedtoclustershasnoorderdependency.Also,incrementalupdatesareslightlymoreexpensive.However,K-meansconvergesratherquickly,andtherefore,thenumberofpointsswitchingclustersquicklybecomesrelativelysmall.

7.2.3BisectingK-means

ThebisectingK-meansalgorithmisastraightforwardextensionofthebasicK-meansalgorithmthatisbasedonasimpleidea:toobtainKclusters,splitthesetofallpointsintotwoclusters,selectoneoftheseclusterstosplit,andsoon,untilKclustershavebeenproduced.ThedetailsofbisectingK-meansaregivenbyAlgorithm7.3.

Thereareanumberofdifferentwaystochoosewhichclustertosplit.Wecanchoosethelargestclusterateachstep,choosetheonewiththelargestSSE,oruseacriterionbasedonbothsizeandSSE.Differentchoicesresultindifferentclusters.

BecauseweareusingtheK-meansalgorithm“locally,”i.e.,tobisectindividualclusters,thefinalsetofclustersdoesnotrepresentaclusteringthatisalocalminimumwithrespecttothetotalSSE.Thus,weoftenrefinetheresultingclustersbyusingtheirclustercentroidsastheinitialcentroidsforthestandardK-meansalgorithm.

Algorithm7.3BisectingK-means

algorithm.

Example7.3(BisectingK-meansandInitialization).ToillustratethatbisectingK-meansislesssusceptibletoinitializationproblems,weshow,inFigure7.8 ,howbisectingK-meansfindsfourclustersinthedatasetoriginallyshowninFigure7.6(a) .Initeration1,twopairsofclustersarefound;initeration2,therightmostpairofclusters

Initializethelistofclusterstocontaintheclusterconsistingofallpoints.repeatRemoveaclusterfromthelistofclusters.{Performseveral“trial”bisectionsofthechosencluster.}for tonumberoftrialsdoBisecttheselectedclusterusingbasicK-means.endforSelectthetwoclustersfromthebisectionwiththelowesttotalSSE.Addthesetwoclusterstothelistofclusters.untilThelistofclusterscontainsKclusters.

2:3:4:5: i=16:7:8:

9:10:

issplit;andiniteration3,theleftmostpairofclustersissplit.BisectingK-meanshaslesstroublewithinitializationbecauseitperformsseveraltrialbisectionsandtakestheonewiththelowestSSE,andbecausethereareonlytwocentroidsateachstep.

Figure7.8.BisectingK-meansonthefourclustersexample.

Finally,byrecordingthesequenceofclusteringsproducedasK-meansbisectsclusters,wecanalsousebisectingK-meanstoproduceahierarchicalclustering.

7.2.4K-meansandDifferentTypesofClusters

K-meansanditsvariationshaveanumberoflimitationswithrespecttofindingdifferenttypesofclusters.Inparticular,K-meanshasdifficultydetectingthe“natural”clusters,whenclustershavenon-sphericalshapesorwidelydifferentsizesordensities.ThisisillustratedbyFigures7.9 ,7.10 ,and7.11 .InFigure7.9 ,K-meanscannotfindthethreenaturalclustersbecauseoneoftheclustersismuchlargerthantheothertwo,andhence,thelargerclusteris

broken,whileoneofthesmallerclustersiscombinedwithaportionofthelargercluster.InFigure7.10 ,K-meansfailstofindthethreenaturalclustersbecausethetwosmallerclustersaremuchdenserthanthelargercluster.Finally,inFigure7.11 ,K-meansfindstwoclustersthatmixportionsofthetwonaturalclustersbecausetheshapeofthenaturalclustersisnotglobular.

Figure7.9.K-meanswithclustersofdifferentsize.

Figure7.10.K-meanswithclustersofdifferentdensity.

Figure7.11.K-meanswithnon-globularclusters.

ThedifficultyinthesethreesituationsisthattheK-meansobjectivefunctionisamismatchforthekindsofclusterswearetryingtofindbecauseitisminimizedbyglobularclustersofequalsizeanddensityorbyclustersthatarewellseparated.However,theselimitationscanbeovercome,insomesense,iftheuseriswillingtoacceptaclusteringthatbreaksthenaturalclustersintoanumberofsubclusters.Figure7.12 showswhathappenstothethreepreviousdatasetsifwefindsixclustersinsteadoftwoorthree.Eachsmallerclusterispureinthesensethatitcontainsonlypointsfromoneofthenaturalclusters.

Figure7.12.UsingK-meanstofindclustersthataresubclustersofthenaturalclusters.

7.2.5StrengthsandWeaknesses

K-meansissimpleandcanbeusedforawidevarietyofdatatypes.Itisalsoquiteefficient,eventhoughmultiplerunsareoftenperformed.Somevariants,includingbisectingK-means,areevenmoreefficient,andarelesssusceptibletoinitializationproblems.K-meansisnotsuitableforalltypesofdata,however.Itcannothandlenon-globularclustersorclustersofdifferentsizesanddensities,althoughitcantypicallyfindpuresubclustersifalargeenoughnumberofclustersisspecified.K-meansalsohastroubleclusteringdatathatcontainsoutliers.Outlierdetectionandremovalcanhelpsignificantlyinsuchsituations.Finally,K-meansisrestrictedtodataforwhichthereisanotionofacenter(centroid).Arelatedtechnique,K-medoidclustering,doesnothavethisrestriction,butismoreexpensive.

7.2.6K-meansasanOptimizationProblem

Here,wedelveintothemathematicsbehindK-means.Thissection,whichcanbeskippedwithoutlossofcontinuity,requiresknowledgeofcalculusthroughpartialderivatives.Familiaritywithoptimizationtechniques,especiallythosebasedongradientdescent,canalsobehelpful.

Asmentionedearlier,givenanobjectivefunctionsuchas“minimizeSSE,”clusteringcanbetreatedasanoptimizationproblem.Onewaytosolvethisproblem—tofindaglobaloptimum—istoenumerateallpossiblewaysofdividingthepointsintoclustersandthenchoosethesetofclustersthatbestsatisfiestheobjectivefunction,e.g.,thatminimizesthetotalSSE.Ofcourse,thisexhaustivestrategyiscomputationallyinfeasibleandasaresult,amorepracticalapproachisneeded,evenifsuchanapproachfindssolutionsthatarenotguaranteedtobeoptimal.Onetechnique,whichisknownasgradientdescent,isbasedonpickinganinitialsolutionandthenrepeatingthefollowingtwosteps:computethechangetothesolutionthatbestoptimizestheobjectivefunctionandthenupdatethesolution.

Weassumethatthedataisone-dimensional,i.e., .Thisdoesnotchangeanythingessential,butgreatlysimplifiesthenotation.

DerivationofK-meansasanAlgorithmtoMinimizetheSSEInthissection,weshowhowthecentroidfortheK-meansalgorithmcanbemathematicallyderivedwhentheproximityfunctionisEuclideandistanceandtheobjectiveistominimizetheSSE.Specifically,weinvestigatehowwecanbestupdateaclustercentroidsothattheclusterSSEisminimized.Inmathematicalterms,weseektominimizeEquation7.1 ,whichwerepeathere,specializedforone-dimensionaldata.

Here, isthe cluster,xisapointin ,and isthemeanofthecluster.SeeTable7.1 foracompletelistofnotation.

dist(x,y)=(x−y)2

SSE=∑i=1K∑x∈Ci(ci−x)2 (7.4)

Ci ith Ci ci ith

Wecansolveforthe centroid ,whichminimizesEquation7.4 ,bydifferentiatingtheSSE,settingitequalto0,andsolving,asindicatedbelow.

Thus,aspreviouslyindicated,thebestcentroidforminimizingtheSSEofaclusteristhemeanofthepointsinthecluster.

DerivationofK-meansforSAETodemonstratethattheK-meansalgorithmcanbeappliedtoavarietyofdifferentobjectivefunctions,weconsiderhowtopartitionthedataintoKclusterssuchthatthesumoftheManhattan distancesofpointsfromthecenteroftheirclustersisminimized.Weareseekingtominimizethesumofthe absoluteerrors(SAE)asgivenbythefollowingequation,whereisthe distance.Again,fornotationalsimplicity,weuseone-dimensionaldata,i.e., .

Wecansolveforthe centroid ,whichminimizesEquation7.5 ,bydifferentiatingtheSAE,settingitequalto0,andsolving.

kth ck

∂∂ckSSE=∂∂ck∑i=1K∑x∈Ci(ci−x)2=∑i=1K∑x∈Ci∂∂ck(ci−x)2=∑x∈Ck2×(ck−xk)=0

∑x∈Ck2×(ck−xk)=0⇒mkck=∑x∈Ckxk⇒ck=1mk∑x∈Ckxk

(L1)

L1 distL1L1

distL1=|ci−x|

SAE=∑i=1K∑x∈CidistL1(ci,x) (7.5)

kth ck

∂∂ckSAE=∂∂ck∑i=1K∑x∈Ci|ci−x|=∑i=1K∑x∈Ci∂∂ck|ci−x|=∑x∈Ck∂∂ck|ck−x|=0

∑x∈Ck∂∂ck|ck−x|=0⇒∑x∈Cksign(x−ck)=0

Ifwesolvefor ,wefindthat ,themedianofthepointsinthecluster.Themedianofagroupofpointsisstraightforwardtocomputeandlesssusceptibletodistortionbyoutliers.

ck ck=median{x∈Ck}

7.3AgglomerativeHierarchicalClusteringHierarchicalclusteringtechniquesareasecondimportantcategoryofclusteringmethods.AswithK-means,theseapproachesarerelativelyoldcomparedtomanyclusteringalgorithms,buttheystillenjoywidespreaduse.Therearetwobasicapproachesforgeneratingahierarchicalclustering:

Agglomerative:Startwiththepointsasindividualclustersand,ateachstep,mergetheclosestpairofclusters.Thisrequiresdefininganotionofclusterproximity.

Divisive:Startwithone,all-inclusiveclusterand,ateachstep,splitaclusteruntilonlysingletonclustersofindividualpointsremain.Inthiscase,weneedtodecidewhichclustertosplitateachstepandhowtodothesplitting.

Agglomerativehierarchicalclusteringtechniquesarebyfarthemostcommon,and,inthissection,wewillfocusexclusivelyonthesemethods.AdivisivehierarchicalclusteringtechniqueisdescribedinSection8.4.2 .

Ahierarchicalclusteringisoftendisplayedgraphicallyusingatree-likediagramcalledadendrogram,whichdisplaysboththecluster-subclusterrelationshipsandtheorderinwhichtheclustersweremerged(agglomerativeview)orsplit(divisiveview).Forsetsoftwo-dimensionalpoints,suchasthosethatwewilluseasexamples,ahierarchicalclusteringcanalsobegraphicallyrepresentedusinganestedclusterdiagram.Figure7.13 showsanexampleofthesetwotypesoffiguresforasetoffourtwo-dimensionalpoints.

Thesepointswereclusteredusingthesingle-linktechniquethatisdescribedinSection7.3.2 .

Figure7.13.Ahierarchicalclusteringoffourpointsshownasadendrogramandasnestedclusters.

7.3.1BasicAgglomerativeHierarchicalClusteringAlgorithm

Manyagglomerativehierarchicalclusteringtechniquesarevariationsonasingleapproach:startingwithindividualpointsasclusters,successivelymergethetwoclosestclustersuntilonlyoneclusterremains.ThisapproachisexpressedmoreformallyinAlgorithm7.4 .

Algorithm7.4Basicagglomerative

hierarchicalclusteringalgorithm.

DefiningProximitybetweenClustersThekeyoperationofAlgorithm7.4 isthecomputationoftheproximitybetweentwoclusters,anditisthedefinitionofclusterproximitythatdifferentiatesthevariousagglomerativehierarchicaltechniquesthatwewilldiscuss.Clusterproximityistypicallydefinedwithaparticulartypeofclusterinmind—seeSection7.1.3 .Forexample,manyagglomerativehierarchicalclusteringtechniques,suchasMIN,MAX,andGroupAverage,comefromagraph-basedviewofclusters.MINdefinesclusterproximityastheproximitybetweentheclosesttwopointsthatareindifferentclusters,orusinggraphterms,theshortestedgebetweentwonodesindifferentsubsetsofnodes.Thisyieldscontiguity-basedclustersasshowninFigure7.2(c) .Alternatively,MAXtakestheproximitybetweenthefarthesttwopointsindifferentclusterstobetheclusterproximity,orusinggraphterms,thelongestedgebetweentwonodesindifferentsubsetsofnodes.(Ifourproximitiesaredistances,thenthenames,MINandMAX,areshortandsuggestive.Forsimilarities,however,wherehighervaluesindicatecloserpoints,thenamesseemreversed.Forthatreason,weusuallyprefertousethealternativenames,singlelinkandcompletelink,respectively.)Anothergraph-basedapproach,thegroupaveragetechnique,definesclusterproximitytobetheaveragepairwiseproximities(averagelengthofedges)ofallpairsofpointsfromdifferentclusters.Figure7.14 illustratesthesethreeapproaches.

Computetheproximitymatrix,ifnecessary.repeatMergetheclosesttwoclusters.Updatetheproximitymatrixtoreflecttheproximitybetweenthenewclusterandtheoriginalclusters.untilOnlyoneclusterremains.

1:2:3:4:

Figure7.14.Graph-baseddefinitionsofclusterproximity.

If,instead,wetakeaprototype-basedview,inwhicheachclusterisrepresentedbyacentroid,differentdefinitionsofclusterproximityaremorenatural.Whenusingcentroids,theclusterproximityiscommonlydefinedastheproximitybetweenclustercentroids.Analternativetechnique,Ward’smethod,alsoassumesthataclusterisrepresentedbyitscentroid,butitmeasurestheproximitybetweentwoclustersintermsoftheincreaseintheSSEthatresultsfrommergingthetwoclusters.LikeK-means,Ward’smethodattemptstominimizethesumofthesquareddistancesofpointsfromtheirclustercentroids.

TimeandSpaceComplexityThebasicagglomerativehierarchicalclusteringalgorithmjustpresentedusesaproximitymatrix.Thisrequiresthestorageof proximities(assumingtheproximitymatrixissymmetric)wheremisthenumberofdatapoints.Thespaceneededtokeeptrackoftheclustersisproportionaltothenumberofclusters,whichis ,excludingsingletonclusters.Hence,thetotalspacecomplexityis .

Theanalysisofthebasicagglomerativehierarchicalclusteringalgorithmisalsostraightforwardwithrespecttocomputationalcomplexity. timeisrequiredtocomputetheproximitymatrix.Afterthatstep,thereare

12m2

m−1O(m2)

O(m2)m−1

iterationsinvolvingsteps3and4becausetherearemclustersatthestartandtwoclustersaremergedduringeachiteration.Ifperformedasalinearsearchoftheproximitymatrix,thenforthe iteration,Step3requirestime,whichisproportionaltothecurrentnumberofclusterssquared.Step4requires timetoupdatetheproximitymatrixafterthemergeroftwoclusters.(Aclustermergeraffects proximitiesforthetechniquesthatweconsider.)Withoutmodification,thiswouldyieldatimecomplexityof.Ifthedistancesfromeachclustertoallotherclustersarestoredasasorted

list(orheap),itispossibletoreducethecostoffindingthetwoclosestclustersto .However,becauseoftheadditionalcomplexityofkeepingdatainasortedlistorheap,theoveralltimerequiredforahierarchicalclusteringbasedonAlgorithm7.4 is .

Thespaceandtimecomplexityofhierarchicalclusteringseverelylimitsthesizeofdatasetsthatcanbeprocessed.Wediscussscalabilityapproachesforclusteringalgorithms,includinghierarchicalclusteringtechniques,inSection8.5 .Note,however,thatthebisectingK-meansalgorithmpresentedinSection7.2.3 isascalablealgorithmthatproducesahierarchicalclustering.

7.3.2SpecificTechniques

SampleDataToillustratethebehaviorofthevarioushierarchicalclusteringalgorithms,wewillusesampledatathatconsistsofsixtwo-dimensionalpoints,whichareshowninFigure7.15 .ThexandycoordinatesofthepointsandtheEuclideandistancesbetweenthemareshowninTables7.3 and7.4 ,respectively.

ith O((m−i+1)2)

O(m−i+1)O(m−i+1)

O(m3)

O(m−i+1)

O(m2logm)

Figure7.15.Setofsixtwo-dimensionalpoints.

Table7.3.xy-coordinatesofsixpoints.

Point xCoordinate yCoordinate

p1 0.4005 0.5306

p2 0.2148 0.3854

p3 0.3457 0.3156

p4 0.2652 0.1875

p5 0.0789 0.4139

p6 0.4548 0.3022

Table7.4.Euclideandistancematrixforsixpoints.

p1 p2 p3 p4 p5 p6

p1 0.00 0.24 0.22 0.37 0.34 0.23

p2 0.24 0.00 0.15 0.20 0.14 0.25

p3 0.22 0.15 0.00 0.15 0.28 0.11

p4 0.37 0.20 0.15 0.00 0.29 0.22

p5 0.34 0.14 0.28 0.29 0.00 0.39

p6 0.23 0.25 0.11 0.22 0.39 0.00

SingleLinkorMINForthesinglelinkorMINversionofhierarchicalclustering,theproximityoftwoclustersisdefinedastheminimumofthedistance(maximumofthesimilarity)betweenanytwopointsinthetwodifferentclusters.Usinggraphterminology,ifyoustartwithallpointsassingletonclustersandaddlinksbetweenpointsoneatatime,shortestlinksfirst,thenthesesinglelinkscombinethepointsintoclusters.Thesinglelinktechniqueisgoodathandlingnon-ellipticalshapes,butissensitivetonoiseandoutliers.

Example7.4(SingleLink).Figure7.16 showstheresultofapplyingthesinglelinktechniquetoourexampledatasetofsixpoints.Figure7.16(a) showsthenestedclustersasasequenceofnestedellipses,wherethenumbersassociatedwiththeellipsesindicatetheorderoftheclustering.Figure7.16(b) showsthesameinformation,butasadendrogram.Theheightatwhichtwoclustersaremergedinthedendrogramreflectsthedistanceofthetwoclusters.Forinstance,fromTable7.4 ,weseethatthedistancebetweenpoints3and6is0.11,andthatistheheightatwhichtheyarejoinedintooneclusterinthedendrogram.Asanotherexample,thedistancebetweenclustersand isgivenby

{3,6}{2,5}

dist({3,6},{2,5})=min(dist(3,2), dist(6,2), dist(3,5), dist(6,5))=

Figure7.16.SinglelinkclusteringofthesixpointsshowninFigure7.15 .

CompleteLinkorMAXorCLIQUEForthecompletelinkorMAXversionofhierarchicalclustering,theproximityoftwoclustersisdefinedasthemaximumofthedistance(minimumofthesimilarity)betweenanytwopointsinthetwodifferentclusters.Usinggraphterminology,ifyoustartwithallpointsassingletonclustersandaddlinksbetweenpointsoneatatime,shortestlinksfirst,thenagroupofpointsisnotaclusteruntilallthepointsinitarecompletelylinked,i.e.,formaclique.Completelinkislesssusceptibletonoiseandoutliers,butitcanbreaklargeclustersanditfavorsglobularshapes.

Example7.5(CompleteLink).

min(0.15, 0.25, 0.28, 0.39)=0.15.

Figure7.17 showstheresultsofapplyingMAXtothesampledatasetofsixpoints.Aswithsinglelink,points3and6aremergedfirst.However,

ismergedwith ,insteadof or because

Figure7.17.CompletelinkclusteringofthesixpointsshowninFigure7.15 .

GroupAverageForthegroupaverageversionofhierarchicalclustering,theproximityoftwoclustersisdefinedastheaveragepairwiseproximityamongallpairsofpointsinthedifferentclusters.Thisisanintermediateapproachbetweenthesingleandcompletelinkapproaches.Thus,forgroupaverage,theclusterproximity

{3,6} {4} {2,5} {1}

dist({3,6},{4})=max(dist(3,4),dist(6,4))=max(0.15,0.22)=0.22.dist({3,6},{2,5})=max(dist(3,2),dist(6,2),dist(3,5),dist(6,5))=max(0.15,0.25,0.28,0.39)=0.39.dist({3,6},{1})=max(dist(3,1),dist(6,1))=max(0.22,0.23)=0.23.

proximity ofclusters and ,whichareofsize and ,respectively,isexpressedbythefollowingequation:

Example7.6(GroupAverage).Figure7.18 showstheresultsofapplyingthegroupaverageapproachtothesampledatasetofsixpoints.Toillustratehowgroupaverageworks,wecalculatethedistancebetweensomeclusters.

Figure7.18.GroupaverageclusteringofthesixpointsshowninFigure7.15 .

(Ci,Cj) Ci Cj mi mj

proximity(Ci, Cj)=∑y∈Cjx∈Ciproximity(x, y)mi×mj. (7.6)

dist({3,6,4},{1})=(0.22+0.37+0.23)/(3×1)=0.28dist({2,5},{1})=(0.24+0.34)/(2×1)=0.29dist({3,6,4},{2,5})=(0.15+0.28+0.25+0.39+0.20+0.29)/(3×2)=0.26

Because issmallerthan and,clusters and aremergedatthefourthstage.

Ward’sMethodandCentroidMethodsForWard’smethod,theproximitybetweentwoclustersisdefinedastheincreaseinthesquarederrorthatresultswhentwoclustersaremerged.Thus,thismethodusesthesameobjectivefunctionasK-meansclustering.WhileitmightseemthatthisfeaturemakesWard’smethodsomewhatdistinctfromotherhierarchicaltechniques,itcanbeshownmathematicallythatWard’smethodisverysimilartothegroupaveragemethodwhentheproximitybetweentwopointsistakentobethesquareofthedistancebetweenthem.

Example7.7(Ward’sMethod).Figure7.19 showstheresultsofapplyingWard’smethodtothesampledatasetofsixpoints.Theclusteringthatisproducedisdifferentfromthoseproducedbysinglelink,completelink,andgroupaverage.

dist({3,6,4},{2,5}) dist({3,6,4},{1}) dist({2,5},{1}) {3,6,4} {2,5}

Figure7.19.Ward’sclusteringofthesixpointsshowninFigure7.15 .

Centroidmethodscalculatetheproximitybetweentwoclustersbycalculatingthedistancebetweenthecentroidsofclusters.ThesetechniquesmayseemsimilartoK-means,butaswehaveremarked,Ward’smethodisthecorrecthierarchicalanalog.

Centroidmethodsalsohaveacharacteristic—oftenconsideredbad—thatisnotpossessedbytheotherhierarchicalclusteringtechniquesthatwehavediscussed:thepossibilityofinversions.Specifically,twoclustersthataremergedcanbemoresimilar(lessdistant)thanthepairofclustersthatweremergedinapreviousstep.Fortheothermethods,thedistancebetweenmergedclustersmonotonicallyincreases(oris,atworst,non-increasing)asweproceedfromsingletonclusterstooneall-inclusivecluster.

7.3.3TheLance-WilliamsFormulaforClusterProximity

Anyoftheclusterproximitiesthatwehavediscussedinthissectioncanbeviewedasachoiceofdifferentparameters(intheLance-WilliamsformulashownbelowinEquation7.7 )fortheproximitybetweenclustersQandR,whereRisformedbymergingclustersAandB.Inthisequation,p(.,.)isaproximityfunction,while , ,and arethenumberofpointsinclustersA,B,andQ,respectively.Inotherwords,afterwemergeclustersAandBtoformclusterR,theproximityofthenewcluster,R,toanexistingcluster,Q,isalinearfunctionoftheproximitiesofQwithrespecttothe

mA mB mQ

originalclustersAandB.Table7.5 showsthevaluesofthesecoefficientsforthetechniquesthatwehavediscussed.

Table7.5.TableofLance-Williamscoefficientsforcommonhierarchicalclusteringapproaches.

ClusteringMethod

SingleLink 1/2 1/2 0 −1/2

CompleteLink 1/2 1/2 0 1/2

GroupAverage 0 0

Centroid 0

Ward’s 0

AnyhierarchicalclusteringtechniquethatcanbeexpressedusingtheLance-Williamsformuladoesnotneedtokeeptheoriginaldatapoints.Instead,theproximitymatrixisupdatedasclusteringoccurs.Whileageneralformulaisappealing,especiallyforimplementation,itiseasiertounderstandthedifferenthierarchicalmethodsbylookingdirectlyatthedefinitionofclusterproximitythateachmethoduses.

7.3.4KeyIssuesinHierarchical

p(R,Q)= αA p(A,Q)+αB p(B,Q)+β p(A,B)+ γ|p(A,Q)−p(B,Q)| (7.7)

αA αB β γ

mAmA+mB mBmA+mB

mAmA+mB mBmA+mB −mAmB(mA+mB)2

mA+mQmA+mB+mQ

mB+mQmA+mB+mQ

−mQmA+mB+mQ

Clustering

LackofaGlobalObjectiveFunctionWepreviouslymentionedthatagglomerativehierarchicalclusteringcannotbeviewedasgloballyoptimizinganobjectivefunction.Instead,agglomerativehierarchicalclusteringtechniquesusevariouscriteriatodecidelocally,ateachstep,whichclustersshouldbemerged(orsplitfordivisiveapproaches).Thisapproachyieldsclusteringalgorithmsthatavoidthedifficultyofattemptingtosolveahardcombinatorialoptimizationproblem.(Itcanbeshownthatthegeneralclusteringproblemforanobjectivefunctionsuchas“minimizeSSE”iscomputationallyinfeasible.)Furthermore,suchapproachesdonothavedifficultiesinchoosinginitialpoints.Nonetheless,thetimecomplexityof

andthespacecomplexityof areprohibitiveinmanycases.

AbilitytoHandleDifferentClusterSizesOneaspectofagglomerativehierarchicalclusteringthatwehavenotyetdiscussedishowtotreattherelativesizesofthepairsofclustersthataremerged.(Thisdiscussionappliesonlytoclusterproximityschemesthatinvolvesums,suchascentroid,Ward’s,andgroupaverage.)Therearetwoapproaches:weighted,whichtreatsallclustersequally,andunweighted,whichtakesthenumberofpointsineachclusterintoaccount.Notethattheterminologyofweightedorunweightedreferstothedatapoints,nottheclusters.Inotherwords,treatingclustersofunequalsizeequally—theweightedapproach—givesdifferentweightstothepointsindifferentclusters,whiletakingtheclustersizeintoaccount—theunweightedapproach—givespointsindifferentclustersthesameweight.

O(m2log m) O(m2)

WewillillustratethisusingthegroupaveragetechniquediscussedinSection7.3.2 ,whichistheunweightedversionofthegroupaveragetechnique.Intheclusteringliterature,thefullnameofthisapproachistheUnweightedPairGroupMethodusingArithmeticaverages(UPGMA).InTable7.5 ,whichgivestheformulaforupdatingclustersimilarity,thecoefficientsforUPGMAinvolvethesize, and ofeachoftheclusters,AandBthatweremerged: .Fortheweightedversionofgroupaverage—knownasWPGMA—thecoefficientsareconstantsthatareindependentoftheclustersizes: .Ingeneral,unweightedapproachesarepreferredunlessthereisreasontobelievethatindividualpointsshouldhavedifferentweights;e.g.,perhapsclassesofobjectshavebeenunevenlysampled.

MergingDecisionsAreFinalAgglomerativehierarchicalclusteringalgorithmstendtomakegoodlocaldecisionsaboutcombiningtwoclustersbecausetheycanuseinformationaboutthepairwisesimilarityofallpoints.However,onceadecisionismadetomergetwoclusters,itcannotbeundoneatalatertime.Thisapproachpreventsalocaloptimizationcriterionfrombecomingaglobaloptimizationcriterion.Forexample,althoughthe“minimizesquarederror”criterionfromK-meansisusedindecidingwhichclusterstomergeinWard’smethod,theclustersateachleveldonotrepresentlocalminimawithrespecttothetotalSSE.Indeed,theclustersarenotevenstable,inthesensethatapointinoneclustercanbeclosertothecentroidofsomeotherclusterthanitistothecentroidofitscurrentcluster.Nonetheless,Ward’smethodisoftenusedasarobustmethodofinitializingaK-meansclustering,indicatingthatalocal“minimizesquarederror”objectivefunctiondoeshaveaconnectiontoaglobal“minimizesquarederror”objectivefunction.

mA mBαA=mA/(mA+mB), αB=mB/(mA+mB), β=0, γ=0

αA=1/2, αB=1/2, β=0, γ=0

Therearesometechniquesthatattempttoovercomethelimitationthatmergesarefinal.Oneapproachattemptstofixupthehierarchicalclusteringbymovingbranchesofthetreearoundsoastoimproveaglobalobjectivefunction.AnotherapproachusesapartitionalclusteringtechniquesuchasK-meanstocreatemanysmallclusters,andthenperformshierarchicalclusteringusingthesesmallclustersasthestartingpoint.

7.3.5Outliers

OutliersposethemostseriousproblemsforWard’smethodandcentroid-basedhierarchicalclusteringapproachesbecausetheyincreaseSSEanddistortcentroids.Forclusteringapproaches,suchassinglelink,completelink,andgroupaverage,outliersarepotentiallylessproblematic.Ashierarchicalclusteringproceedsforthesealgorithms,outliersorsmallgroupsofoutlierstendtoformsingletonorsmallclustersthatdonotmergewithanyotherclustersuntilmuchlaterinthemergingprocess.Bydiscardingsingletonorsmallclustersthatarenotmergingwithotherclusters,outlierscanberemoved.

7.3.6StrengthsandWeaknesses

Thestrengthsandweaknessesofspecificagglomerativehierarchicalclusteringalgorithmswerediscussedabove.Moregenerally,suchalgorithmsaretypicallyusedbecausetheunderlyingapplication,e.g.,creationofataxonomy,requiresahierarchy.Also,somestudieshavesuggestedthatthesealgorithmscanproducebetter-qualityclusters.However,agglomerativehierarchicalclusteringalgorithmsareexpensiveintermsoftheircomputationalandstoragerequirements.Thefactthatallmergesarefinalcan

alsocausetroublefornoisy,high-dimensionaldata,suchasdocumentdata.Inturn,thesetwoproblemscanbeaddressedtosomedegreebyfirstpartiallyclusteringthedatausinganothertechnique,suchasK-means.

7.4DBSCANDensity-basedclusteringlocatesregionsofhighdensitythatareseparatedfromoneanotherbyregionsoflowdensity.DBSCANisasimpleandeffectivedensity-basedclusteringalgorithmthatillustratesanumberofimportantconceptsthatareimportantforanydensity-basedclusteringapproach.Inthissection,wefocussolelyonDBSCANafterfirstconsideringthekeynotionofdensity.Otheralgorithmsforfindingdensity-basedclustersaredescribedinthenextchapter.

7.4.1TraditionalDensity:Center-BasedApproach

Althoughtherearenotasmanyapproachesfordefiningdensityastherearefordefiningsimilarity,thereareseveraldistinctmethods.Inthissectionwediscussthecenter-basedapproachonwhichDBSCANisbased.OtherdefinitionsofdensitywillbepresentedinChapter8 .

Inthecenter-basedapproach,densityisestimatedforaparticularpointinthedatasetbycountingthenumberofpointswithinaspecifiedradius,Eps,ofthatpoint.Thisincludesthepointitself.ThistechniqueisgraphicallyillustratedbyFigure7.20 .ThenumberofpointswithinaradiusofEpsofpointAis7,includingAitself.

Figure7.20.Center-baseddensity.

Thismethodissimpletoimplement,butthedensityofanypointwilldependonthespecifiedradius.Forinstance,iftheradiusislargeenough,thenallpointswillhaveadensityofm,thenumberofpointsinthedataset.Likewise,iftheradiusistoosmall,thenallpointswillhaveadensityof1.Anapproachfordecidingontheappropriateradiusforlow-dimensionaldataisgiveninthenextsectioninthecontextofourdiscussionofDBSCAN.

ClassificationofPointsAccordingtoCenter-BasedDensityThecenter-basedapproachtodensityallowsustoclassifyapointasbeing(1)intheinteriorofadenseregion(acorepoint),(2)ontheedgeofadenseregion(aborderpoint),or(3)inasparselyoccupiedregion(anoiseorbackgroundpoint).Figure7.21 graphicallyillustratestheconceptsofcore,border,andnoisepointsusingacollectionoftwo-dimensionalpoints.Thefollowingtextprovidesamoreprecisedescription.

Figure7.21.Core,border,andnoisepoints.

Corepoints:Thesepointsareintheinteriorofadensity-basedcluster.ApointisacorepointifthereareatleastMinPtswithinadistanceofEps,whereMinPtsandEpsareuser-specifiedparameters.InFigure7.21 ,pointAisacorepointfortheradius(Eps)if .

Borderpoints:Aborderpointisnotacorepoint,butfallswithintheneighborhoodofacorepoint.InFigure7.21 ,pointBisaborderpoint.Aborderpointcanfallwithintheneighborhoodsofseveralcorepoints.

Noisepoints:Anoisepointisanypointthatisneitheracorepointnoraborderpoint.InFigure7.21 ,pointCisanoisepoint.

7.4.2TheDBSCANAlgorithm

Giventhepreviousdefinitionsofcorepoints,borderpoints,andnoisepoints,theDBSCANalgorithmcanbeinformallydescribedasfollows.Anytwocorepointsthatarecloseenough—withinadistanceEpsofoneanother—areput

MinPts≥7

inthesamecluster.Likewise,anyborderpointthatiscloseenoughtoacorepointisputinthesameclusterasthecorepoint.(Tiesneedtoberesolvedifaborderpointisclosetocorepointsfromdifferentclusters.)Noisepointsarediscarded.TheformaldetailsaregiveninAlgorithm7.5 .ThisalgorithmusesthesameconceptsandfindsthesameclustersastheoriginalDBSCAN,butisoptimizedforsimplicity,notefficiency.

Algorithm7.5DBSCANalgorithm.

TimeandSpaceComplexityThebasictimecomplexityoftheDBSCANalgorithmisO( tofindpointsintheEps-neighborhood),wheremisthenumberofpoints.Intheworstcase,thiscomplexityis .However,inlow-dimensionalspaces(especially2Dspace),datastructuressuchaskd-treesallowefficientretrievalofallpointswithinagivendistanceofaspecifiedpoint,andthetimecomplexitycanbeaslowasO(mlogm)intheaveragecase.ThespacerequirementofDBSCAN,evenforhigh-dimensionaldata,isO(m)becauseitisnecessarytokeeponlyasmallamountofdataforeachpoint,i.e.,theclusterlabelandtheidentificationofeachpointasacore,border,ornoisepoint.

Labelallpointsascore,border,ornoisepoints.Eliminatenoisepoints.PutanedgebetweenallcorepointswithinadistanceEpsofeachother.Makeeachgroupofconnectedcorepointsintoaseparatecluster.Assigneachborderpointtooneoftheclustersofitsassociatedcorepoints.

1:2:3:

m×time

O(m2)

SelectionofDBSCANParametersThereis,ofcourse,theissueofhowtodeterminetheparametersEpsandMinPts.Thebasicapproachistolookatthebehaviorofthedistancefromapointtoits nearestneighbor,whichwewillcallthek-dist.Forpointsthatbelongtosomecluster,thevalueofk-distwillbesmallifkisnotlargerthantheclustersize.Notethattherewillbesomevariation,dependingonthedensityoftheclusterandtherandomdistributionofpoints,butonaverage,therangeofvariationwillnotbehugeiftheclusterdensitiesarenotradicallydifferent.However,forpointsthatarenotinacluster,suchasnoisepoints,thek-distwillberelativelylarge.Therefore,ifwecomputethek-distforallthedatapointsforsomek,sorttheminincreasingorder,andthenplotthesortedvalues,weexpecttoseeasharpchangeatthevalueofk-distthatcorrespondstoasuitablevalueofEps.IfweselectthisdistanceastheEpsparameterandtakethevalueofkastheMinPtsparameter,thenpointsforwhichk-distislessthanEpswillbelabeledascorepoints,whileotherpointswillbelabeledasnoiseorborderpoints.

Figure7.22 showsasampledataset,whilethek-distgraphforthedataisgiveninFigure7.23 .ThevalueofEpsthatisdeterminedinthiswaydependsonk,butdoesnotchangedramaticallyaskchanges.Ifthevalueofkistoosmall,thenevenasmallnumberofcloselyspacedpointsthatarenoiseoroutlierswillbeincorrectlylabeledasclusters.Ifthevalueofkistoolarge,thensmallclusters(ofsizelessthank)arelikelytobelabeledasnoise.TheoriginalDBSCANalgorithmusedavalueof ,whichappearstobeareasonablevalueformosttwo-dimensionaldatasets.

kth

k=4

Figure7.22.Sampledata.

Figure7.23.K-distplotforsampledata.

ClustersofVaryingDensityDBSCANcanhavetroublewithdensityifthedensityofclustersvarieswidely.ConsiderFigure7.24 ,whichshowsfourclustersembeddedinnoise.Thedensityoftheclustersandnoiseregionsisindicatedbytheirdarkness.Thenoisearoundthepairofdenserclusters,AandB,hasthesamedensityasclustersCandD.ForafixedMinPts,iftheEpsthresholdischosensothatDBSCANfindsCandDasdistinctclusters,withthepointssurroundingthem

asnoise,thenAandBandthepointssurroundingthemwillbecomeasinglecluster.IftheEpsthresholdissetsothatDBSCANfindsAandBasseparateclusters,andthepointssurroundingthemaremarkedasnoise,thenC,D,andthepointssurroundingthemwillalsobemarkedasnoise.

Figure7.24.Fourclustersembeddedinnoise.

AnExampleToillustratetheuseofDBSCAN,weshowtheclustersthatitfindsintherelativelycomplicatedtwo-dimensionaldatasetshowninFigure7.22 .Thisdatasetconsistsof3000two-dimensionalpoints.TheEpsthresholdforthisdatawasfoundbyplottingthesorteddistancesofthefourthnearestneighborofeachpoint(Figure7.23 )andidentifyingthevalueatwhichthereisasharpincrease.Weselected ,whichcorrespondstothekneeofthecurve.TheclustersfoundbyDBSCANusingtheseparameters,i.e.,and ,areshowninFigure7.25(a) .Thecorepoints,borderpoints,andnoisepointsaredisplayedinFigure7.25(b) .

Eps=10MinPts=4

Eps=10

Figure7.25.DBSCANclusteringof3000two-dimensionalpoints.

7.4.3StrengthsandWeaknesses

BecauseDBSCANusesadensity-baseddefinitionofacluster,itisrelativelyresistanttonoiseandcanhandleclustersofarbitraryshapesandsizes.Thus,DBSCANcanfindmanyclustersthatcouldnotbefoundusingK-means,suchasthoseinFigure7.22 .Asindicatedpreviously,however,DBSCANhastroublewhentheclustershavewidelyvaryingdensities.Italsohastroublewithhigh-dimensionaldatabecausedensityismoredifficulttodefineforsuchdata.OnepossibleapproachtodealingwithsuchissuesisgiveninSection8.4.9 .Finally,DBSCANcanbeexpensivewhenthecomputationofnearestneighborsrequirescomputingallpairwiseproximities,asisusuallythecaseforhigh-dimensionaldata.

7.5ClusterEvaluationInsupervisedclassification,theevaluationoftheresultingclassificationmodelisanintegralpartoftheprocessofdevelopingaclassificationmodel,andtherearewell-acceptedevaluationmeasuresandprocedures,e.g.,accuracyandcross-validation,respectively.However,becauseofitsverynature,clusterevaluationisnotawell-developedorcommonlyusedpartofclusteranalysis.Nonetheless,clusterevaluation,orclustervalidationasitismoretraditionallycalled,isimportant,andthissectionwillreviewsomeofthemostcommonandeasilyappliedapproaches.

Theremightbesomeconfusionastowhyclusterevaluationisnecessary.Manytimes,clusteranalysisisconductedasapartofanexploratorydataanalysis.Hence,evaluationseemstobeanunnecessarilycomplicatedadditiontowhatissupposedtobeaninformalprocess.Furthermore,becausethereareanumberofdifferenttypesofclusters—insomesense,eachclusteringalgorithmdefinesitsowntypeofcluster—itcanseemthateachsituationmightrequireadifferentevaluationmeasure.Forinstance,K-meansclustersmightbeevaluatedintermsoftheSSE,butfordensity-basedclusters,whichneednotbeglobular,SSEwouldnotworkwellatall.

Nonetheless,clusterevaluationshouldbeapartofanyclusteranalysis.Akeymotivationisthatalmosteveryclusteringalgorithmwillfindclustersinadataset,evenifthatdatasethasnonaturalclusterstructure.Forinstance,considerFigure7.26 ,whichshowstheresultofclustering100pointsthatarerandomly(uniformly)distributedontheunitsquare.TheoriginalpointsareshowninFigure7.26(a) ,whiletheclustersfoundbyDBSCAN,K-means,andcompletelinkareshowninFigures7.26(b) ,7.26(c) ,and7.26(d) ,respectively.SinceDBSCANfoundthreeclusters(afterwesetEpsbylooking

atthedistancesofthefourthnearestneighbors),wesetK-meansandcompletelinktofindthreeclustersaswell.(InFigure7.26(b) thenoiseisshownbythesmallmarkers.)However,theclustersdonotlookcompellingforanyofthethreemethods.Inhigherdimensions,suchproblemscannotbesoeasilydetected.

Figure7.26.

Clusteringof100uniformlydistributedpoints.

7.5.1Overview

Beingabletodistinguishwhetherthereisnon-randomstructureinthedataisjustoneimportantaspectofclustervalidation.Thefollowingisalistofseveralimportantissuesforclustervalidation.

1. Determiningtheclusteringtendencyofasetofdata,i.e.,distinguishingwhethernon-randomstructureactuallyexistsinthedata.

2. Determiningthecorrectnumberofclusters.3. Evaluatinghowwelltheresultsofaclusteranalysisfitthedatawithout

referencetoexternalinformation.4. Comparingtheresultsofaclusteranalysistoexternallyknownresults,

suchasexternallyprovidedclasslabels.5. Comparingtwosetsofclusterstodeterminewhichisbetter.

Noticethatitems1,2,and3donotmakeuseofanyexternalinformation—theyareunsupervisedtechniques—whileitem4requiresexternalinformation.Item5canbeperformedineitherasupervisedoranunsupervisedmanner.Afurtherdistinctioncanbemadewithrespecttoitems3,4,and5:Dowewanttoevaluatetheentireclusteringorjustindividualclusters?

Whileitispossibletodevelopvariousnumericalmeasurestoassessthedifferentaspectsofclustervaliditymentionedabove,thereareanumberofchallenges.First,ameasureofclustervalidityissometimesquitelimitedinthescopeofitsapplicability.Forexample,mostworkonmeasuresofclusteringtendencyhasbeendonefortwo-orthree-dimensionalspatialdata.Second,weneedaframeworktointerpretanymeasure.Ifweobtainavalue

of10forameasurethatevaluateshowwellclusterlabelsmatchexternallyprovidedclasslabels,doesthisvaluerepresentagood,fair,orpoormatch?Thegoodnessofamatchoftencanbemeasuredbylookingatthestatisticaldistributionofthisvalue,i.e.,howlikelyitisthatsuchavalueoccursbychance.Finally,ifameasureistoocomplicatedtoapplyortounderstand,thenfewwilluseit.

Theevaluationmeasures,orindices,thatareappliedtojudgevariousaspectsofclustervalidityaretraditionallyclassifiedintothefollowingthreetypes.

Unsupervised.Measuresthegoodnessofaclusteringstructurewithoutrespecttoexternalinformation.AnexampleofthisistheSSE.Unsupervisedmeasuresofclustervalidityareoftenfurtherdividedintotwoclasses:measuresofclustercohesion(compactness,tightness),whichdeterminehowcloselyrelatedtheobjectsinaclusterare,andmeasuresofclusterseparation(isolation),whichdeterminehowdistinctorwell-separatedaclusterisfromotherclusters.Unsupervisedmeasuresareoftencalledinternalindicesbecausetheyuseonlyinformationpresentinthedataset.

Supervised.Measurestheextenttowhichtheclusteringstructurediscoveredbyaclusteringalgorithmmatchessomeexternalstructure.Anexampleofasupervisedindexisentropy,whichmeasureshowwellclusterlabelsmatchexternallysuppliedclasslabels.Supervisedmeasuresareoftencalledexternalindicesbecausetheyuseinformationnotpresentinthedataset.

Relative.Comparesdifferentclusteringsorclusters.Arelativeclusterevaluationmeasureisasupervisedorunsupervisedevaluationmeasurethatisusedforthepurposeofcomparison.Thus,relativemeasuresarenotactuallyaseparatetypeofclusterevaluationmeasure,butareinsteadaspecificuseofsuchmeasures.Asanexample,twoK-meansclusteringscanbecomparedusingeithertheSSEorentropy.

Intheremainderofthissection,weprovidespecificdetailsconcerningclustervalidity.Wefirstdescribetopicsrelatedtounsupervisedclusterevaluation,beginningwith(1)measuresbasedoncohesionandseparation,and(2)twotechniquesbasedontheproximitymatrix.Sincetheseapproachesareusefulonlyforpartitionalsetsofclusters,wealsodescribethepopularcopheneticcorrelationcoefficient,whichcanbeusedfortheunsupervisedevaluationofahierarchicalclustering.Weendourdiscussionofunsupervisedevaluationwithbriefdiscussionsaboutfindingthecorrectnumberofclustersandevaluatingclusteringtendency.Wethenconsidersupervisedapproachestoclustervalidity,suchasentropy,purity,andtheJaccardmeasure.Weconcludethissectionwithashortdiscussionofhowtointerpretthevaluesof(unsupervisedorsupervised)validitymeasures.

7.5.2UnsupervisedClusterEvaluationUsingCohesionandSeparation

Manyinternalmeasuresofclustervalidityforpartitionalclusteringschemesarebasedonthenotionsofcohesionorseparation.Inthissection,weuseclustervaliditymeasuresforprototype-andgraph-basedclusteringtechniquestoexplorethesenotionsinsomedetail.Intheprocess,wewillalsoseesomeinterestingrelationshipsbetweenprototype-andgraph-basedmeasures.

Ingeneral,wecanconsiderexpressingoverallclustervalidityforasetofKclustersasaweightedsumofthevalidityofindividualclusters,

overallvalidity=∑i=1Kwi validity(Ci). (7.8)

Thevalidityfunctioncanbecohesion,separation,orsomecombinationofthesequantities.Theweightswillvarydependingontheclustervaliditymeasure.Insomecases,theweightsaresimply1orthesizeofthecluster,whileinothercasestobediscussedabitlater,theyreflectamorecomplicatedpropertyofthecluster.

Graph-BasedViewofCohesionandSeparationFromagraph-basedview,thecohesionofaclustercanbedefinedasthesumoftheweightsofthelinksintheproximitygraphthatconnectpointswithinthecluster.SeeFigure7.27(a) .(Recallthattheproximitygraphhasdataobjectsasnodes,alinkbetweeneachpairofdataobjects,andaweightassignedtoeachlinkthatistheproximitybetweenthetwodataobjectsconnectedbythelink.)Likewise,theseparationbetweentwoclusterscanbemeasuredbythesumoftheweightsofthelinksfrompointsinoneclustertopointsintheothercluster.ThisisillustratedinFigure7.27(b) .

Figure7.27.Graph-basedviewofclustercohesionandseparation.

Mostsimply,thecohesionandseparationforagraph-basedclustercanbeexpressedusingEquations7.9 and7.10 ,respectively.Theproximityfunctioncanbeasimilarityoradissimilarity.Forsimilarity,asinTable7.6 ,highervaluesarebetterforcohesionwhilelowervaluesarebetterforseparation.Fordissimilarity,theoppositeistrue,i.e.,lowervaluesarebetterforcohesionwhilehighervaluesarebetterforseparation.Morecomplicatedapproachesarepossiblebuttypicallyembodythebasicideasoffigure7.27a and7.27b .

Prototype-BasedViewofCohesionandSeparationForprototype-basedclusters,thecohesionofaclustercanbedefinedasthesumoftheproximitieswithrespecttotheprototype(centroidormedoid)ofthecluster.Similarly,theseparationbetweentwoclusterscanbemeasuredbytheproximityofthetwoclusterprototypes.ThisisillustratedinFigure7.28 ,wherethecentroidofaclusterisindicatedbya“+”.

cohesion(Ci)=∑x∈Ciy∈Ciproximity(x,y) (7.9)

separation(Ci,Cj)=∑x∈Ciy∈Cjproximity(x,y) (7.10)

Figure7.28.Prototype-basedviewofclustercohesionandseparation.

Cohesionforaprototype-basedclusterisgiveninEquation7.11 ,whiletwomeasuresforseparationaregiveninEquations7.12 and7.13 ,respectively,where istheprototype(centroid)ofcluster andcistheoverallprototype(centroid).Therearetwomeasuresforseparationbecause,aswewillseeinthenextsection,theseparationofclusterprototypesfromanoverallprototypeissometimesdirectlyrelatedtotheseparationofclusterprototypesfromoneanother.(Thisistrue,forexample,forEuclideandistance.)NotethatEquation7.11 istheclusterSSEifweletproximitybethesquaredEuclideandistance.

RelationshipbetweenPrototype-BasedCohesionandGraph-BasedCohesionWhilethegraph-basedandprototype-basedapproachestomeasuringthecohesionandseparationofaclusterseemdistinct,forsomeproximitymeasurestheyareequivalent.Forinstance,fortheSSEandpointsinEuclideanspace,itcanbeshown(Equation7.14 )thattheaveragepairwisedistancebetweenthepointsinaclusterisequivalenttotheSSEofthecluster.SeeExercise27 onpage610.

ci Ci

cohesion(Ci)=∑x∈Ciproximity(x,ci) (7.11)

separation(Ci,Cj)=proximity(ci,cj) (7.12)

separation(Ci)=proximity(ci,c) (7.13)

ClusterSSE=∑x∈Cidist(ci,x)2=12mi∑x∈Ci∑y∈Cidist(x,y)2(7.14)

RelationshipoftheTwoApproachestoPrototype-BasedSeparationWhenproximityismeasuredbyEuclideandistance,thetraditionalmeasureofseparationbetweenclustersisthebetweengroupsumofsquares(SSB),whichisthesumofthesquareddistanceofaclustercentroid, ,totheoverallmean,c,ofallthedatapoints.TheSSBisgivenbyEquation7.15 ,where isthemeanofthe clusterandcistheoverallmean.ThehigherthetotalSSBofaclustering,themoreseparatedtheclustersarefromoneanother.

ItisstraightforwardtoshowthatthetotalSSBisdirectlyrelatedtothepairwisedistancesbetweenthecentroids.Inparticular,iftheclustersizesareequal,i.e., ,thenthisrelationshiptakesthesimpleformgivenbyEquation7.16 .(SeeExercise28 onpage610.)ItisthistypeofequivalencethatmotivatesthedefinitionofprototypeseparationintermsofbothEquations7.12 and7.13 .

RelationshipbetweenCohesionandSeparationForsomevaliditymeasures,thereisalsoastrongrelationshipbetweencohesionandseparation.Specifically,itispossibletoshowthatthesumofthetotalSSEandthetotalSSBisaconstant;i.e.,thatitisequaltothetotalsumofsquares(TSS),whichisthesumofsquaresofthedistanceofeach

ci ith

TotalSSB=∑i=1Kmi dist(ci,c)2 (7.15)

mi=m/K

TotalSSB=12K∑i=1K∑j=1KmK dist(ci, cj)2 (7.16)

pointtotheoverallmeanofthedata.TheimportanceofthisresultisthatminimizingSSE(cohesion)isequivalenttomaximizingSSB(separation).

Weprovidetheproofofthisfactbelow,sincetheapproachillustratestechniquesthatarealsoapplicabletoprovingtherelationshipsstatedinthelasttwosections.Tosimplifythenotation,weassumethatthedataisone-dimensional,i.e., .Also,weusethefactthatthecross-term

is0.(SeeExercise29 onpage610.)

RelationshipbetweenGraph-andCentroid-BasedCohesionItcanalsobeshownthatthereisarelationshipbetweengraph-andcentroid-basedcohesionmeasuresforEuclideandistance.Forsimplicity,weonceagainassumeone-dimensionaldata.Recallthat .

Moregenerally,incaseswhereacentroidmakessenseforthedata,thesimplegraphorcentroid-basedmeasuresofclustervaliditywepresentedareoftenrelated.

dist(x,y)=(x−y)2∑i=1K∑x∈Ci(x−ci)(c−ci)

TSS=∑i=1K∑x∈Ci(x−c)2=∑i=1K∑x∈Ci((x−ci)−(c−ci))2=∑i=1K∑x∈Ci(x−ci)2−2∑i=1K∑x∈Ci(x−ci)(c−ci)+∑i=1K∑x∈Ci(c−ci)2=∑i=1K∑x∈Ci(x−ci)2+∑i=1K∑x∈Ci(c−ci)2=∑i=1K∑x∈Ci(x−ci)2+∑i=1K|Ci|(c−ci)2=SSE+SSB

ci=1/mi∑y∈Ciy

mi2cohesion(Ci)=mi2∑x∈Ciproximity(x,ci)=∑x∈Cimi2(x−ci)2=∑x∈Ci(mix−mici)2=∑x∈Ci(mix−mi(1/mi∑y∈Ciy))2=∑x∈Ci∑y∈Ci(x−y)2=∑x∈Ciy∈Ci(x−y)2=∑x∈Ciy∈Ciproximity(x,y)

OverallMeasuresofCohesionandSeparationThepreviousdefinitionsofclustercohesionandseparationgaveussomesimpleandwell-definedmeasuresofindividualclustervaliditythatcanbecombinedintoanoverallmeasureofclustervaliditybyusingaweightedsum,asindicatedinEquation7.8 .However,weneedtodecidewhatweightstouse.Notsurprisingly,theweightsusedcanvarywidely.Often,butnotalways,theyareeitherafunctionofclustersizeor1,whichtreatsallclustersequallyregardlessofsize.

TheCLUsteringTOolkit(CLUTO)(seetheBibliographicNotes)usestheclusterevaluationmeasuresdescribedinTable7.6 ,aswellassomeotherevaluationmeasuresnotmentionedhere.Onlysimilaritymeasuresareused:cosine,correlation,Jaccard,andtheinverseofEuclideandistance. isameasureofcohesionintermsofthepairwisesimilarityofobjectsinthecluster. isameasureofcohesionthatcanbeexpressedeitherintermsofthesumofthesimilaritiesofobjectsintheclustertotheclustercentroidorintermsofthepairwisesimilaritiesofobjectsinthecluster. isameasureofseparation.Itcanbedefinedintermsofthesimilarityofaclustercentroidtotheoverallcentroidorintermsofthepairwisesimilaritiesofobjectintheclustertoobjectsinotherclusters.(Although isameasureofseparation,theseconddefinitionshowsthatitalsousesclustercohesion,albeitintheclusterweight.) ,whichisaclustervaliditymeasurebasedonbothcohesionandseparation,isthesumofthepairwisesimilaritiesofallobjectsintheclusterwithallobjectsoutsidethecluster—thetotalweightoftheedgesofthesimilaritygraphthatmustbecuttoseparatetheclusterfromallotherclusters—dividedbythesumofthepairwisesimilaritiesofobjectsinthecluster.

Table7.6.Tableofgraph-basedclusterevaluationmeasures.

ℐ1

ℐ2

ε1

Name ClusterMeasure ClusterWeight Type

graph-basedcohesion

prototype-basedcohesion

prototype-basedseparation

graph-basedseparation

graph-basedseparationandcohesion

Notethatanyunsupervisedmeasureofclustervaliditypotentiallycanbeusedasanobjectivefunctionforaclusteringalgorithmandviceversa.CLUTOtakesthisapproachbyusinganalgorithmthatissimilartotheincrementalK-meansalgorithmdiscussedinSection7.2.2 .Specifically,eachpointisassignedtotheclusterthatproducesthebestvaluefortheclusterevaluationfunction.Theclusterevaluationmeasure correspondstotraditionalK-meansandproducesclustersthathavegoodSSEvalues.TheothermeasuresproduceclustersthatarenotasgoodwithrespecttoSSE,butthataremoreoptimalwithrespecttothespecifiedclustervaliditymeasure.

EvaluatingIndividualClustersandObjectsSofar,wehavefocusedonusingcohesionandseparationintheoverallevaluationofagroupofclusters.Mostofthesemeasuresofclustervalidityalsocanbeusedtoevaluateindividualclustersandobjects.Forexample,wecanrankindividualclustersaccordingtotheirspecificvalueofclustervalidity,i.e.,clustercohesionorseparation.Aclusterthathasahighvalueofcohesion

ℐ1 ∑x∈Ciy∈Cisim(x,y) 1mi

ℐ2 ∑x∈Cisim(x,ci)

ℐ2 ∑x∈Ciy∈Cisim(x,y)

ε1 sim(ci,c) mi

ε1 ∑j=1k∑x∈Ciy∈Cjsim(x,y)

mi(∑x∈Ciy∈Cisim(x,y))

1 ∑j=1j≠ik∑x∈Ciy∈Cjsim(x,y)

1∑x∈Ciy∈Cisim(x,y)

ℐ2

maybeconsideredbetterthanaclusterthathasalowervalue.Thisinformationoftencanbeusedtoimprovethequalityofaclustering.If,forexample,aclusterisnotverycohesive,thenwemaywanttosplititintoseveralsubclusters.Ontheotherhand,iftwoclustersarerelativelycohesive,butnotwellseparated,wemaywanttomergethemintoasinglecluster.

Wecanalsoevaluatetheobjectswithinaclusterintermsoftheircontributiontotheoverallcohesionorseparationofthecluster.Objectsthatcontributemoretothecohesionandseparationarenearthe“interior”ofthecluster.Thoseobjectsforwhichtheoppositeistrueareprobablynearthe“edge”ofthecluster.Inthefollowingsection,weconsideraclusterevaluationmeasurethatusesanapproachbasedontheseideastoevaluatepoints,clusters,andtheentiresetofclusters.

TheSilhouetteCoefficientThepopularmethodofsilhouettecoefficientscombinesbothcohesionandseparation.Thefollowingstepsexplainhowtocomputethesilhouettecoefficientforanindividualpoint,aprocessthatconsistsofthefollowingthreesteps.Weusedistances,butananalogousapproachcanbeusedforsimilarities.

1. Forthe object,calculateitsaveragedistancetoallotherobjectsinitscluster.Callthisvalue .

2. Forthe objectandanyclusternotcontainingtheobject,calculatetheobject’saveragedistancetoalltheobjectsinthegivencluster.Findtheminimumsuchvaluewithrespecttoallclusters;callthisvalue .

3. Forthe object,thesilhouettecoefficientis.

ithai

ith

biith si=(bi−ai)/max(ai,

bi)

Thevalueofthesilhouettecoefficientcanvarybetween and1.Anegativevalueisundesirablebecausethiscorrespondstoacaseinwhich ,theaveragedistancetopointsinthecluster,isgreaterthan ,theminimumaveragedistancetopointsinanothercluster.Wewantthesilhouettecoefficienttobepositive ,andfor tobeascloseto0aspossible,sincethecoefficientassumesitsmaximumvalueof1when .

Wecancomputetheaveragesilhouettecoefficientofaclusterbysimplytakingtheaverageofthesilhouettecoefficientsofpointsbelongingtothecluster.Anoverallmeasureofthegoodnessofaclusteringcanbeobtainedbycomputingtheaveragesilhouettecoefficientofallpoints.

Example7.8(SilhouetteCoefficient).Figure7.29 showsaplotofthesilhouettecoefficientsforpointsin10clusters.Darkershadesindicatelowersilhouettecoefficients.

Figure7.29.Silhouettecoefficientsforpointsintenclusters.

7.5.3UnsupervisedClusterEvaluation

−1ai

(ai<bi) aiai=0

UsingtheProximityMatrix

Inthissection,weexamineacoupleofunsupervisedapproachesforassessingclustervaliditythatarebasedontheproximitymatrix.Thefirstcomparesanactualandidealizedproximitymatrix,whilethesecondusesvisualization.

GeneralCommentsonUnsupervisedClusterEvaluationMeasuresInadditiontothemeasurespresentedabove,alargenumberofothermeasureshavebeenproposedasunsupervisedclustervaliditymeasures.Almostallthesemeasures,includingthemeasurespresentedaboveareintendedforpartitional,center-basedclusters.Inparticular,noneofthemdoeswellforcontinuity-ordensity-basedclusters.Indeed,arecentevaluation—seeBibliographicNotes—ofadozensuchmeasuresfoundthatalthoughanumberofthemdidwellintermsofhandlingissuessuchasnoiseanddifferingsizesanddensity,noneofthemexceptarelativelyrecentlyproposedmeasure,ClusteringValidationindexbasedonNearestNeighbors(CVNN),didwellonhandlingarbitraryshapes.Thesilhouetteindex,however,didwellonallotherissuesexaminedexceptforthat.

Mostunsupervisedclusterevaluationmeasures,suchasthesilhouettecoefficient,incorporatebothcohesion(compactness)andseparation.WhenusedwithapartitionalclusteringalgorithmsuchasK-means,thesemeasureswilltendtodecreaseuntilthe“natural”setofclustersisfoundandstartincreasingonceclustersarebeingsplit“toofinely”sinceseparationwillsufferandcohesionwillnotimprovemuch.Thus,thesemeasurescanprovideawaytodeterminethenumberofclusters.However,ifthedefinitionofacluster

usedbytheclusteringalgorithm,differsfromthatoftheclusterevaluationmeasure,thenthesetofclustersidentifiedasoptimalbythealgorithmandvalidationmeasurecandiffer.Forinstance,CLUTOusesthemeasuresdescribedinTable7.6 todriveitsclustering,andthus,theclusteringproducedwillnotusuallymatchtheoptimalclustersaccordingtothesilhouettecoefficient.LikewiseforstandardK-meansandSSE.Inaddition,ifthereactuallyaresubclustersthatarenotseparatedverywellfromoneanother,thenmethodsthatincorporatebothmayprovideonlyacoarseviewoftheclusterstructureofthedata.Anotherconsiderationisthatwhenclusteringforsummarization,wearenotinterestedinthe“natural”clusterstructureofthedata,butratherwanttoachieveacertainlevelofapproximation,e.g.,wanttoreduceSSEtoacertainlevel.

Moregenerally,iftherearenottoomanyclusters,thenitcanbebettertoexamineclustercohesionandseparationindependently.Thiscangiveamorecomprehensiveviewofhowcohesiveeachclusterisandhowwelleachpairofclustersisseparatedfromoneanother.Forinstance,givenacentroidbasedclustering,wecouldcomputethepairwisesimilarityordistanceofthecentroids,i.e.,computethedistanceorsimilaritymatrixofthecentroids.Theapproachjustoutlinedissimilartolookingattheconfusionmatrixforaclassificationprobleminsteadofclassificationmeasures,suchasaccuracyortheF-measure.

MeasuringClusterValidityviaCorrelationIfwearegiventhesimilaritymatrixforadatasetandtheclusterlabelsfromaclusteranalysisofthedataset,thenwecanevaluatethe“goodness”oftheclusteringbylookingatthecorrelationbetweenthesimilaritymatrixandanidealversionofthesimilaritymatrixbasedontheclusterlabels.(Withminorchanges,thefollowingappliestoproximitymatrices,butforsimplicity,wediscussonlysimilaritymatrices.)Morespecifically,anidealclusterisone

whosepointshaveasimilarityof1toallpointsinthecluster,andasimilarityof0toallpointsinotherclusters.Thus,ifwesorttherowsandcolumnsofthesimilaritymatrixsothatallobjectsbelongingtothesameclusteraretogether,thenanidealclustersimilaritymatrixhasablockdiagonalstructure.Inotherwords,thesimilarityisnon-zero,i.e.,1,insidetheblocksofthesimilaritymatrixwhoseentriesrepresentintra-clustersimilarity,and0elsewhere.Theidealclustersimilaritymatrixisconstructedbycreatingamatrixthathasonerowandonecolumnforeachdatapoint—justlikeanactualsimilaritymatrix—andassigninga1toanentryiftheassociatedpairofpointsbelongstothesamecluster.Allotherentriesare0.

Highcorrelationbetweentheidealandactualsimilaritymatricesindicatesthatthepointsthatbelongtothesameclusterareclosetoeachother,whilelowcorrelationindicatestheopposite.(Becausetheactualandidealsimilaritymatricesaresymmetric,thecorrelationiscalculatedonlyamongthe

entriesbeloworabovethediagonalofthematrices.)Consequently,thisisnotagoodmeasureformanydensity-orcontiguity-basedclusters,becausetheyarenotglobularandcanbecloselyintertwinedwithotherclusters.

Example7.9(CorrelationofActualandIdealSimilarityMatrices).Toillustratethismeasure,wecalculatedthecorrelationbetweentheidealandactualsimilaritymatricesfortheK-meansclustersshowninFigure7.26(c) (randomdata)andFigure7.30(a) (datawiththreewell-separatedclusters).Thecorrelationswere0.5810and0.9235,respectively,whichreflectstheexpectedresultthattheclustersfoundbyK-meansintherandomdataareworsethantheclustersfoundbyK-meansindatawithwell-separatedclusters.

n(n−1)/2

Figure7.30.Similaritymatrixforwell-separatedclusters.

JudgingaClusteringVisuallybyItsSimilarityMatrixTheprevioustechniquesuggestsamoregeneral,qualitativeapproachtojudgingasetofclusters:Orderthesimilaritymatrixwithrespecttoclusterlabelsandthenplotit.Intheory,ifwehavewell-separatedclusters,thenthesimilaritymatrixshouldberoughlyblock-diagonal.Ifnot,thenthepatternsdisplayedinthesimilaritymatrixcanrevealtherelationshipsbetweenclusters.Again,allofthiscanbeappliedtodissimilaritymatrices,butforsimplicity,wewillonlydiscusssimilaritymatrices.

Example7.10(VisualizingaSimilarityMatrix).ConsiderthepointsinFigure7.30(a) ,whichformthreewell-separatedclusters.IfweuseK-meanstogroupthesepointsintothreeclusters,then

weshouldhavenotroublefindingtheseclustersbecausetheyarewell-separated.TheseparationoftheseclustersisillustratedbythereorderedsimilaritymatrixshowninFigure7.30(b) .(Foruniformity,wehavetransformedthedistancesintosimilaritiesusingtheformula

.)Figure7.31 showsthereorderedsimilaritymatricesforclustersfoundintherandomdatasetofFigure7.26 byDBSCAN,K-means,andcompletelink.

Figure7.31.Similaritymatricesforclustersfromrandomdata.

Thewell-separatedclustersinFigure7.30 showaverystrong,block-diagonalpatterninthereorderedsimilaritymatrix.However,therearealsoweakblockdiagonalpatterns—seeFigure7.31 —inthereorderedsimilaritymatricesoftheclusteringsfoundbyK-means,DBSCAN,andcompletelinkintherandomdata.Justaspeoplecanfindpatternsinclouds,dataminingalgorithmscanfindclustersinrandomdata.Whileitisentertainingtofindpatternsinclouds,itispointlessandperhapsembarrassingtofindclustersinnoise.

Thisapproachmayseemhopelesslyexpensiveforlargedatasets,sincethecomputationoftheproximitymatrixtakes time,wheremisthenumber

s=1−(d−min_d)/(max_d−min_d)

O(m2)

ofobjects,butwithsampling,thismethodcanstillbeused.Wecantakeasampleofdatapointsfromeachcluster,computethesimilaritybetweenthesepoints,andplottheresult.Itissometimesnecessarytooversamplesmallclustersandundersamplelargeonestoobtainanadequaterepresentationofallclusters.

7.5.4UnsupervisedEvaluationofHierarchicalClustering

Thepreviousapproachestoclusterevaluationareintendedforpartitionalclusterings.Herewediscussthecopheneticcorrelation,apopularevaluationmeasureforhierarchicalclusterings.Thecopheneticdistancebetweentwoobjectsistheproximityatwhichanagglomerativehierarchicalclusteringtechniqueputstheobjectsinthesameclusterforthefirsttime.Forexample,ifatsomepointintheagglomerativehierarchicalclusteringprocess,thesmallestdistancebetweenthetwoclustersthataremergedis0.1,thenallpointsinoneclusterhaveacopheneticdistanceof0.1withrespecttothepointsintheothercluster.Inacopheneticdistancematrix,theentriesarethecopheneticdistancesbetweeneachpairofobjects.Thecopheneticdistanceisdifferentforeachhierarchicalclusteringofasetofpoints.

Example7.11(CopheneticDistanceMatrix).Table7.7 showsthecopheneticdistancematrixforthesinglelinkclusteringshowninFigure7.16 .(Thedataforthisfigureconsistsofthesixtwo-dimensionalpointsgiveninTable2.14 .)

Table7.7.CopheneticdistancematrixforsinglelinkanddatainTable

2.14 onpage90.

Point P1 P2 P3 P4 P5 P6

P1 0 0.222 0.222 0.222 0.222 0.222

P2 0.222 0 0.148 0.151 0.139 0.148

P3 0.222 0.148 0 0.151 0.148 0.110

P4 0.222 0.151 0.151 0 0.151 0.151

P5 0.222 0.139 0.148 0.151 0 0.148

P6 0.222 0.148 0.110 0.151 0.148 0

TheCopheneticCorrelationCoefficient(CPCC)isthecorrelationbetweentheentriesofthismatrixandtheoriginaldissimilaritymatrixandisastandardmeasureofhowwellahierarchicalclustering(ofaparticulartype)fitsthedata.Oneofthemostcommonusesofthismeasureistoevaluatewhichtypeofhierarchicalclusteringisbestforaparticulartypeofdata.

Example7.12(CopheneticCorrelationCoefficient).WecalculatedtheCPCCforthehierarchicalclusteringsshowninFigures7.16 –7.19 .ThesevaluesareshowninTable7.8 .Thehierarchicalclusteringproducedbythesinglelinktechniqueseemstofitthedatalesswellthantheclusteringsproducedbycompletelink,groupaverage,andWard’smethod.

Table7.8.CopheneticcorrelationcoefficientfordataofTable2.14andfouragglomerativehierarchicalclusteringtechniques.

Technique CPCC

SingleLink 0.44

CompleteLink 0.63

GroupAverage 0.66

Ward’s 0.64

7.5.5DeterminingtheCorrectNumberofClusters

Variousunsupervisedclusterevaluationmeasurescanbeusedtoapproximatelydeterminethecorrectornaturalnumberofclusters.

Example7.13(NumberofClusters).ThedatasetofFigure7.29 has10naturalclusters.Figure7.32showsaplotoftheSSEversusthenumberofclustersfora(bisecting)K-meansclusteringofthedataset,whileFigure7.33 showstheaveragesilhouettecoefficientversusthenumberofclustersforthesamedata.ThereisadistinctkneeintheSSEandadistinctpeakinthesilhouettecoefficientwhenthenumberofclustersisequalto10.

Figure7.32.SSEversusnumberofclustersforthedataofFigure7.29 onpage582.

Figure7.33.AveragesilhouettecoefficientversusnumberofclustersforthedataofFigure7.29 .

Thus,wecantrytofindthenaturalnumberofclustersinadatasetbylookingforthenumberofclustersatwhichthereisaknee,peak,ordipintheplotoftheevaluationmeasurewhenitisplottedagainstthenumberofclusters.Ofcourse,suchanapproachdoesnotalwaysworkwell.Clusterscanbe

considerablymoreintertwinedoroverlappingthanthoseshowninFigure7.29 .Also,thedatacanconsistofnestedclusters.Actually,theclustersinFigure7.29 aresomewhatnested;i.e.,therearefivepairsofclusterssincetheclustersareclosertoptobottomthantheyarelefttoright.ThereisakneethatindicatesthisintheSSEcurve,butthesilhouettecoefficientcurveisnotasclear.Insummary,whilecautionisneeded,thetechniquewehavejustdescribedcanprovideinsightintothenumberofclustersinthedata.

7.5.6ClusteringTendency

Oneobviouswaytodetermineifadatasethasclustersistotrytoclusterit.However,almostallclusteringalgorithmswilldutifullyfindclusterswhengivendata.Toaddressthisissue,wecouldevaluatetheresultingclustersandonlyclaimthatadatasethasclustersifatleastsomeoftheclustersareofgoodquality.However,thisapproachdoesnotaddressthefacttheclustersinthedatacanbeofadifferenttypethanthosesoughtbyourclusteringalgorithm.Tohandlethisadditionalproblem,wecouldusemultiplealgorithmsandagainevaluatethequalityoftheresultingclusters.Iftheclustersareuniformlypoor,thenthismayindeedindicatethattherearenoclustersinthedata.

Alternatively,andthisisthefocusofmeasuresofclusteringtendency,wecantrytoevaluatewhetheradatasethasclusterswithoutclustering.Themostcommonapproach,especiallyfordatainEuclideanspace,hasbeentousestatisticaltestsforspatialrandomness.Unfortunately,choosingthecorrectmodel,estimatingtheparameters,andevaluatingthestatisticalsignificanceofthehypothesisthatthedataisnon-randomcanbequitechallenging.Nonetheless,manyapproacheshavebeendeveloped,mostofthemforpointsinlow-dimensionalEuclideanspace.

Example7.14(HopkinsStatistic).Forthisapproach,wegenerateppointsthatarerandomlydistributedacrossthedataspaceandalsosamplepactualdatapoints.Forbothsetsofpoints,wefindthedistancetothenearestneighborintheoriginaldataset.Letthe bethenearestneighbordistancesoftheartificiallygeneratedpoints,whilethe arethenearestneighbordistancesofthesampleofpointsfromtheoriginaldataset.TheHopkinsstatisticHisthendefinedbyEquation7.17 .

Iftherandomlygeneratedpointsandthesampleofdatapointshaveroughlythesamenearestneighbordistances,thenHwillbenear0.5.ValuesofHnear0and1indicate,respectively,datathatishighlyclusteredanddatathatisregularlydistributedinthedataspace.Togiveanexample,theHopkinsstatisticforthedataofFigure7.26 wascomputedforand100differenttrials.TheaveragevalueofHwas0.56withastandarddeviationof0.03.Thesameexperimentwasperformedforthewell-separatedpointsofFigure7.30 .TheaveragevalueofHwas0.95withastandarddeviationof0.006.

7.5.7SupervisedMeasuresofClusterValidity

Whenwehaveexternalinformationaboutdata,itistypicallyintheformofexternallyderivedclasslabelsforthedataobjects.Insuchcases,theusualprocedureistomeasurethedegreeofcorrespondencebetweenthecluster

uiwi

H=∑i=1pwi∑i=1pui+∑i=1pwi (7.17)

p=20

labelsandtheclasslabels.Butwhyisthisofinterest?Afterall,ifwehavetheclasslabels,thenwhatisthepointinperformingaclusteranalysis?Motivationsforsuchananalysisincludethecomparisonofclusteringtechniqueswiththe“groundtruth”ortheevaluationoftheextenttowhichamanualclassificationprocesscanbeautomaticallyproducedbyclusteranalysis,e.g.,theclusteringofnewsarticles.Anotherpotentialmotivationcouldbetoevaluatewhetherobjectsinthesameclustertendtohavethesamelabelforsemi-supervisedlearningtechniques.

Weconsidertwodifferentkindsofapproaches.Thefirstsetoftechniquesusemeasuresfromclassification,suchasentropy,purity,andtheF-measure.Thesemeasuresevaluatetheextenttowhichaclustercontainsobjectsofasingleclass.Thesecondgroupofmethodsisrelatedtothesimilaritymeasuresforbinarydata,suchastheJaccardmeasurethatwesawinChapter2 .Theseapproachesmeasuretheextenttowhichtwoobjectsthatareinthesameclassareinthesameclusterandviceversa.Forconvenience,wewillrefertothesetwotypesofmeasuresasclassification-orientedandsimilarity-oriented,respectively.

Classification-OrientedMeasuresofClusterValidityThereareanumberofmeasuresthatarecommonlyusedtoevaluatetheperformanceofaclassificationmodel.Inthissection,wewilldiscussfive:entropy,purity,precision,recall,andtheF-measure.Inthecaseofclassification,wemeasurethedegreetowhichpredictedclasslabelscorrespondtoactualclasslabels,butforthemeasuresjustmentioned,nothingfundamentalischangedbyusingclusterlabelsinsteadofpredictedclasslabels.Next,wequicklyreviewthedefinitionsofthesemeasuresinthecontextofclustering.

Entropy:Thedegreetowhicheachclusterconsistsofobjectsofasingleclass.Foreachcluster,theclassdistributionofthedataiscalculatedfirst,i.e.,forclusteriwecompute theprobabilitythatamemberofclusteribelongstoclassjas where isthenumberofobjectsinclusteriand isthenumberofobjectsofclassjinclusteri.Usingthisclassdistribution,theentropyofeachclusteriiscalculatedusingthestandardformula,

,whereListhenumberofclasses.Thetotalentropyforasetofclustersiscalculatedasthesumoftheentropiesofeachclusterweightedbythesizeofeachcluster,i.e., whereKisthenumberofclustersandmisthetotalnumberofdatapoints.

Purity:Anothermeasureoftheextenttowhichaclustercontainsobjectsofasingleclass.Usingthepreviousterminology,thepurityofclusteriis

theoverallpurityofaclusteringis

Precision:Thefractionofaclusterthatconsistsofobjectsofaspecifiedclass.Theprecisionofclusteriwithrespecttoclassjis

Recall:Theextenttowhichaclustercontainsallobjectsofaspecifiedclass.Therecallofclusteriwithrespecttoclassjis where isthenumberofobjectsinclassj.

F-measureAcombinationofbothprecisionandrecallthatmeasurestheextenttowhichaclustercontainsonlyobjectsofaparticularclassandallobjectsofthatclass.TheF-measureofclusteriwithrespecttoclassjis

TheF-measureofasetofclusters,partitionalorhierarchicalispresentedonpage594whenwediscussclustervalidityforhierarchicalclusterings.

Example7.15(SupervisedEvaluationMeasures).Wepresentanexampletoillustratethesemeasures.Specifically,weuseK-meanswiththecosinesimilaritymeasuretocluster3204newspaper

pij,pij=mij/mi, mi mij

ei=−∑j=1Lpij log2pij

e=∑i=1Kmimei,

purity(i)=maxjpij, purity=∑i=1Kmimpurity(i).

precision(i,j)=pij.

recall(i,j)=mij/mj, mj

F(i,j)=(2×precision(i,j)×recall(i,j))/(precision(i,j)+recall(i,j)).

articlesfromtheLosAngelesTimes.Thesearticlescomefromsixdifferentclasses:Entertainment,Financial,Foreign,Metro,National,andSports.Table7.9 showstheresultsofaK-meansclusteringtofindsixclusters.Thefirstcolumnindicatesthecluster,whilethenextsixcolumnstogetherformtheconfusionmatrix;i.e.,thesecolumnsindicatehowthedocumentsofeachcategoryaredistributedamongtheclusters.Thelasttwocolumnsaretheentropyandpurityofeachcluster,respectively.

Table7.9.K-meansclusteringresultsfortheLATimesdocumentdataset.

Ideally,eachclusterwillcontaindocumentsfromonlyoneclass.Inreality,eachclustercontainsdocumentsfrommanyclasses.Nevertheless,manyclusterscontaindocumentsprimarilyfromjustoneclass.Inparticular,cluster3,whichcontainsmostlydocumentsfromtheSportssection,isexceptionallygood,bothintermsofpurityandentropy.Thepurityandentropyoftheotherclustersisnotasgood,butcantypicallybegreatlyimprovedifthedataispartitionedintoalargernumberofclusters.

Cluster Entertainment Financial Foreign Metro National Sports Entropy Purity

1 3 5 40 506 96 27 1.2270 0.7474

2 4 7 280 29 39 2 1.1472 0.7756

3 1 1 1 7 4 671 0.1813 0.9796

4 10 162 3 119 73 2 1.7487 0.4390

5 331 22 5 70 13 23 1.3976 0.7134

6 5 358 12 212 48 13 1.5523 0.5525

Total 354 555 341 943 273 738 1.1450 0.7203

Precision,recall,andtheF-measurecanbecalculatedforeachcluster.Togiveaconcreteexample,weconsidercluster1andtheMetroclassofTable7.9 .Theprecisionis recallis andhence,theFvalueis0.39.Incontrast,theFvalueforcluster3andSportsis0.94.Asinclassification,theconfusionmatrixgivesthemostdetailedinformation.

Similarity-OrientedMeasuresofClusterValidityThemeasuresthatwediscussinthissectionareallbasedonthepremisethatanytwoobjectsthatareinthesameclustershouldbeinthesameclassandviceversa.Wecanviewthisapproachtoclustervalidityasinvolvingthecomparisonoftwomatrices:(1)theidealclustersimilaritymatrixdiscussedpreviously,whichhasa1inthe entryiftwoobjects,iandj,areinthesameclusterand0,otherwise,and(2)aclasssimilaritymatrixdefinedwithrespecttoclasslabels,whichhasa1inthe entryiftwoobjects,iandj,belongtothesameclass,anda0otherwise.Asbefore,wecantakethecorrelationofthesetwomatricesasthemeasureofclustervalidity.ThismeasureisknownasHubert’sΓstatisticintheclusteringvalidationliterature.

Example7.16(CorrelationbetweenClusterandClassMatrices).Todemonstratethisideamoreconcretely,wegiveanexampleinvolvingfivedatapoints, and twoclusters, and

andtwoclasses, and TheidealclusterandclasssimilaritymatricesaregiveninTables7.10 and7.11 .Thecorrelationbetweentheentriesofthesetwomatricesis0.359.

Table7.10.Idealclustersimilaritymatrix.

506/677=0.75, 506/943=0.26,

ijth

p1,p2,p3,p4, p5, C1={p1,p2,p3}C2={p4,p5}, L1={p1,p2} L2={p3,p4,p5}.

Point p1 p2 p3 p4 p5

p1 1 1 1 0 0

p2 1 1 1 0 0

p3 1 1 1 0 0

p4 0 0 0 1 1

p5 0 0 0 1 1

Table7.11.Classsimilaritymatrix.

Point p1 p2 p3 p4 p5

p1 1 1 0 0 0

p2 1 1 0 0 0

p3 0 0 1 1 1

p4 0 0 1 1 1

p5 0 0 1 1 1

Moregenerally,wecanuseanyofthemeasuresforbinarysimilaritythatwesawinSection2.4.5 .(Forexample,wecanconvertthesetwomatricesintobinaryvectorsbyappendingtherows.)Werepeatthedefinitionsofthefourquantitiesusedtodefinethosesimilaritymeasures,butmodifyourdescriptivetexttofitthecurrentcontext.Specifically,weneedtocomputethefollowingfourquantitiesforallpairsofdistinctobjects.(Thereare suchpairs,ifmisthenumberofobjects.)

m(m−1)/2

f00=numberofpairsofobjectshavingadifferentclassandadifferentclusterf01

Inparticular,thesimplematchingcoefficient,whichisknownastheRandstatisticinthiscontext,andtheJaccardcoefficientaretwoofthemostfrequentlyusedclustervaliditymeasures.

Example7.17(RandandJaccardMeasures).Basedontheseformulas,wecanreadilycomputetheRandstatisticandJaccardcoefficientfortheexamplebasedonTables7.10 and7.11 .Notingthat and theandthe

Wealsonotethatthefourquantities, and defineacontingencytableasshowninTable7.12 .

Table7.12.Two-waycontingencytablefordeterminingwhetherpairsofobjectsareinthesameclassandsamecluster.

SameCluster DifferentCluster

SameClass

DifferentClass

Previously,inthecontextofassociationanalysis—seeSection5.7.1 onpage402—wepresentedanextensivediscussionofmeasuresofassociationthatcanbeusedforthistypeofcontingencytable.(CompareTable7.12 onpage593withTable5.6 onpage402.)Thosemeasurescanalsobeappliedtoclustervalidity.

Randstatistic=f00+f11f00+f01+f10+f11 (7.18)

Jaccardcoefficient=f11f01+f10+f11 (7.19)

f00=4,f01=2,f10=2, f11=2, Randstatistic=(2+4)/10=0.6Jaccardcoefficient=2/(2+2+2)=0.33.

f00,f01,f10, f11,

f11 f10

f01 f00

ClusterValidityforHierarchicalClusteringsSofarinthissection,wehavediscussedsupervisedmeasuresofclustervalidityonlyforpartitionalclusterings.Supervisedevaluationofahierarchicalclusteringismoredifficultforavarietyofreasons,includingthefactthatapreexistinghierarchicalstructureoftendoesnotexist.Inaddition,althoughrelativelypureclustersoftenexistatcertainlevelsinthehierarchicalclustering,astheclusteringproceeds,theclusterswillbecomeimpure.Thekeyideaoftheapproachpresentedhere,whichisbasedontheF-measure,istoevaluatewhetherahierarchicalclusteringcontains,foreachclass,atleastoneclusterthatisrelativelypureandincludesmostoftheobjectsofthatclass.Toevaluateahierarchicalclusteringwithrespecttothisgoal,wecompute,foreachclass,theF-measureforeachclusterintheclusterhierarchy,andthentakethemaximumF-measureattainedforanycluster.Finally,wecalculateanoverallF-measureforthehierarchicalclusteringbycomputingtheweightedaverageofallper-classF-measures,wheretheweightsarebasedontheclasssizes.Moreformally,thishierarchicalF-measureisdefinedasfollows:

wherethemaximumistakenoverallclustersiatalllevels, isthenumberofobjectsinclassj,andmisthetotalnumberofobjects.Notethatthismeasurecanalsobeappliedforapartitionalclusteringwithoutmodification.

7.5.8AssessingtheSignificanceofClusterValidityMeasures

F=∑jmjmmaxiF(i,j)

Clustervaliditymeasuresareintendedtohelpusmeasurethegoodnessoftheclustersthatwehaveobtained.Indeed,theytypicallygiveusasinglenumberasameasureofthatgoodness.However,wearethenfacedwiththeproblemofinterpretingthesignificanceofthisnumber,ataskthatmightbeevenmoredifficult.

Theminimumandmaximumvaluesofclusterevaluationmeasurescanprovidesomeguidanceinmanycases.Forinstance,bydefinition,apurityof0isbad,whileapurityof1isgood,atleastifwetrustourclasslabelsandwantourclusterstructuretoreflecttheclassstructure.Likewise,anentropyof0isgood,asisanSSEof0.

Sometimes,however,thereisnominimumormaximumvalue,orthescaleofthedatamightaffecttheinterpretation.Also,evenifthereareminimumandmaximumvalueswithobviousinterpretations,intermediatevaluesstillneedtobeinterpreted.Insomecases,wecanuseanabsolutestandard.If,forexample,weareclusteringforutility,weareoftenwillingtotolerateonlyacertainleveloferrorintheapproximationofourpointsbyaclustercentroid.

Butifthisisnotthecase,thenwemustdosomethingelse.Acommonapproachistointerpretthevalueofourvaliditymeasureinstatisticalterms.Specifically,weattempttojudgehowlikelyitisthatourobservedvaluewasachievedbyrandomchance.Thevalueisgoodifitisunusual;i.e.,ifitisunlikelytobetheresultofrandomchance.Themotivationforthisapproachisthatweareinterestedonlyinclustersthatreflectnon-randomstructureinthedata,andsuchstructuresshouldgenerateunusuallyhigh(low)valuesofourclustervaliditymeasure,atleastifthevaliditymeasuresaredesignedtoreflectthepresenceofstrongclusterstructure.

Example7.18(SignificanceofSSE).

Toshowhowthisworks,wepresentanexamplebasedonK-meansandtheSSE.Supposethatwewantameasureofhowgoodthewell-separatedclustersofFigure7.30 arewithrespecttorandomdata.Wegeneratemanyrandomsetsof100pointshavingthesamerangeasthepointsinthethreeclusters,findthreeclustersineachdatasetusingK-means,andaccumulatethedistributionofSSEvaluesfortheseclusterings.ByusingthisdistributionoftheSSEvalues,wecanthenestimatetheprobabilityoftheSSEvaluefortheoriginalclusters.Figure7.34 showsthehistogramoftheSSEfrom500randomruns.ThelowestSSEshowninFigure7.34 is0.0173.ForthethreeclustersofFigure7.30 ,theSSEis0.0050.Wecouldthereforeconservativelyclaimthatthereislessthana1%chancethataclusteringsuchasthatofFigure7.30 couldoccurbychance.

Figure7.34.HistogramofSSEfor500randomdatasets.

Inthepreviousexample,randomizationwasusedtoevaluatethestatisticalsignificanceofaclustervaliditymeasure.However,forsomemeasures,such

asHubert’sΓstatistic,thedistributionisknownandcanbeusedtoevaluatethemeasure.Inaddition,anormalizedversionofthemeasurecanbecomputedbysubtractingthemeananddividingbythestandarddeviation.SeeBibliographicNotesformoredetails.

Westressthatthereismoretoclusterevaluation(unsupervisedorsupervised)thanobtaininganumericalmeasureofclustervalidity.Unlessthisvaluehasanaturalinterpretationbasedonthedefinitionofthemeasure,weneedtointerpretthisvalueinsomeway.Ifourclusterevaluationmeasureisdefinedsuchthatlower(higher)valuesindicatestrongerclusters,thenwecanusestatisticstoevaluatewhetherthevaluewehaveobtainedisunusuallylow(high),providedwehaveadistributionfortheevaluationmeasure.Wehavepresentedanexampleofhowtofindsuchadistribution,butthereisconsiderablymoretothistopic,andwereferthereadertotheBibliographicNotesformorepointers.

Finally,evenwhenanevaluationmeasureisusedasarelativemeasure,i.e.,tocomparetwoclusterings,westillneedtoassessthesignificanceinthedifferencebetweentheevaluationmeasuresofthetwoclusterings.Althoughonevaluewillalmostalwaysbebetterthananother,itcanbedifficulttodetermineifthedifferenceissignificant.Notethattherearetwoaspectstothissignificance:whetherthedifferenceisstatisticallysignificant(repeatable)andwhetherthemagnitudeofthedifferenceismeaningfulwithrespecttotheapplication.Manywouldnotregardadifferenceof0.1%assignificant,evenifitisconsistentlyreproducible.

7.5.9ChoosingaClusterValidityMeasure

Therearemanymoremeasuresforevaluatingclustervaliditythanhavebeendiscussedhere.Variousbooksandarticlesproposevariousmeasuresasbeingsuperiortoothers.Inthissection,weoffersomehigh-levelguidance.First,itisimportanttodistinguishwhethertheclusteringisbeingperformedforsummarizationorunderstanding.Ifsummarization,typicallyclasslabelsarenotinvolvedandthegoalismaximumcompression.Thisisoftendonebyfindingclustersthatminimizethedistanceofobjectstotheirclosestclustercentroid.Indeed,theclusteringprocessisoftendrivenbytheobjectiveofminimizingrepresentationerror.MeasuressuchasSSEaremorenaturalforthisapplication.

Iftheclusteringisbeingperformedforunderstanding,thenthesituationismorecomplicated.Fortheunsupervisedcase,virtuallyallmeasurestrytomaximizecohesionandseparation.Somemeasuresobtaina“best”valueforaparticularvalueofK,thenumberofclusters.Althoughthismightseemappealing,suchmeasurestypicallyonlyidentifyone“right”numberofclusters,evenwhensubclustersarepresent.(RecallthatcohesionandseparationcontinuetoincreaseforK-meansuntilthereisoneclusterperpoint.)Moregenerally,ifthenumberofclustersisnottoolarge,itcanbeusefultomanuallyexaminetheclustercohesionofeachclusterandthepairwiseseparationofclusters.However,notethatveryfewclustervaliditymeasuresareappropriatetocontiguityordensity-basedclustersthatcanhaveirregularandintertwinedshapes.

Forthesupervisedcase,whereclusteringisalmostalwaysperformedwithagoalofgeneratingunderstandableclusters,theidealresultofclusteringistoproduceclustersthatmatchtheunderlyingclassstructure.Evaluatingthematchbetweenasetofclustersandclassesisanon-trivialproblem.TheF-measureandhierarchicalF-measurediscussedearlier,areexamplesofhowtoevaluatesuchamatch.OtherexamplescanbefoundinthereferencestoclusterevaluationprovidedintheBibliographicNotes.Inanycase,whenthe

numberofclustersarerelativelysmall,theconfusionmatrixcanbemoreinformativethananysinglemeasureofclustervaliditysinceitanindicatewhichclassestendtobeappearinclusterswithwhichotherclasses.Notethatsupervisedclusterevaluationindicesareindependentofwhethertheclustersarecenter-,contiguity-,ordensity-based.

Inconclusion,itisimportanttorealizethatclusteringisoftenusedasanexploratorydatatechniquewhosegoalisoftennottoprovideacrispanswer,butrathertoprovidesomeinsightintotheunderlyingstructureofthedata.Inthissituation,clustervalidityindicesareusefultotheextenttheyareusefultothatendgoal.

7.6BibliographicNotesDiscussioninthischapterhasbeenmostheavilyinfluencedbythebooksonclusteranalysiswrittenbyJainandDubes[536],Anderberg[509],andKaufmanandRousseeuw[540],aswellasthemorerecentbookeditedbyandAggarwalandReddy[507].AdditionalclusteringbooksthatmayalsobeofinterestincludethosebyAldenderferandBlashfield[508],Everittetal.[527],Hartigan[533],Mirkin[548],Murtagh[550],Romesburg[553],andSpäth[557].AmorestatisticallyorientedapproachtoclusteringisgivenbythepatternrecognitionbookofDudaetal.[524],themachinelearningbookofMitchell[549],andthebookonstatisticallearningbyHastieetal.[534].GeneralsurveysofclusteringaregivenbyJainetal.[537]andXuandWunsch[560],whileasurveyofspatialdataminingtechniquesisprovidedbyHanetal.[532].Behrkin[515]providesasurveyofclusteringtechniquesfordatamining.AgoodsourceofreferencestoclusteringoutsideofthedataminingfieldisthearticlebyArabieandHubert[511].ApaperbyKleinberg[541]providesadiscussionofsomeofthetrade-offsthatclusteringalgorithmsmakeandprovesthatitisimpossibleforaclusteringalgorithmtosimultaneouslypossessthreesimpleproperties.Awide-ranging,retrospectivearticlebyJainprovidesalookatclusteringduringthe50yearsfromtheinventionofK-means[535].

TheK-meansalgorithmhasalonghistory,butisstillthesubjectofcurrentresearch.TheK-meansalgorithmwasnamedbyMacQueen[545],althoughitshistoryismoreextensive.BockexaminestheoriginsofK-meansandsomeofitsextensions[516].TheISODATAalgorithmbyBallandHall[513]wasanearly,butsophisticatedversionofK-meansthatemployedvariouspre-andpostprocessingtechniquestoimproveonthebasicalgorithm.TheK-meansalgorithmandmanyofitsvariationsaredescribedindetailinthebooksby

Anderberg[509]andJainandDubes[536].ThebisectingK-meansalgorithmdiscussedinthischapterwasdescribedinapaperbySteinbachetal.[558],andanimplementationofthisandotherclusteringapproachesisfreelyavailableforacademicuseintheCLUTO(CLUsteringTOolkit)packagecreatedbyKarypis[520].Boley[517]hascreatedadivisivepartitioningclusteringalgorithm(PDDP)basedonfindingthefirstprincipaldirection(component)ofthedata,andSavaresiandBoley[555]haveexploreditsrelationshiptobisectingK-means.RecentvariationsofK-meansareanewincrementalversionofK-means(Dhillonetal.[522]),X-means(PellegandMoore[552]),andK-harmonicmeans(Zhangetal[562]).HamerlyandElkan[531]discusssomeclusteringalgorithmsthatproducebetterresultsthanK-means.WhilesomeofthepreviouslymentionedapproachesaddresstheinitializationproblemofK-meansinsomemanner,otherapproachestoimprovingK-meansinitializationcanalsobefoundintheworkofBradleyandFayyad[518].TheK-means++initializationapproachwasproposedbyArthurandVassilvitskii[512].DhillonandModha[523]presentageneralizationofK-means,calledsphericalK-means,whichworkswithcommonlyusedsimilarityfunctions.AgeneralframeworkforK-meansclusteringthatusesdissimilarityfunctionsbasedonBregmandivergenceswasconstructedbyBanerjeeetal.[514].

Hierarchicalclusteringtechniquesalsohavealonghistory.MuchoftheinitialactivitywasintheareaoftaxonomyandiscoveredinbooksbyJardineandSibson[538]andSneathandSokal[556].General-purposediscussionsofhierarchicalclusteringarealsoavailableinmostoftheclusteringbooksmentionedabove.Agglomerativehierarchicalclusteringisthefocusofmostworkintheareaofhierarchicalclustering,butdivisiveapproacheshavealsoreceivedsomeattention.Forexample,Zahn[561]describesadivisivehierarchicaltechniquethatusestheminimumspanningtreeofagraph.Whilebothdivisiveandagglomerativeapproachestypicallytaketheviewthatmerging(splitting)decisionsarefinal,therehasbeensomeworkbyFisher

[528]andKarypisetal.[539]toovercometheselimitations.MurtaghandContrerasprovidearecentoverviewofhierarchicalclusteringalgorithms[551]andhavealsoproposedalineartimehierarchicalclusteringalgorithm[521].

Esteretal.proposedDBSCAN[526],whichwaslatergeneralizedtotheGDBSCANalgorithmbySanderetal.[554]inordertohandlemoregeneraltypesofdataanddistancemeasures,suchaspolygonswhoseclosenessismeasuredbythedegreeofintersection.AnincrementalversionofDBSCANwasdevelopedbyKriegeletal.[525].OneinterestingoutgrowthofDBSCANisOPTICS(OrderingPointsToIdentifytheClusteringStructure)(Ankerstetal.[510]),whichallowsthevisualizationofclusterstructureandcanalsobeusedforhierarchicalclustering.Arecentdiscussionofdensity-basedclusteringbyKriegeletal.[542]providesaveryreadablesynopsisofdensity-basedclusteringandrecentdevelopments.

Anauthoritativediscussionofclustervalidity,whichstronglyinfluencedthediscussioninthischapter,isprovidedinChapter4 ofJainandDubes’clusteringbook[536].ArecentreviewofclustervaliditymeasuresbyXiongandLicanbefoundin[559].OtherrecentreviewsofclustervalidityarethoseofHalkidietal.[529,530]andMilligan[547].SilhouettecoefficientsaredescribedinKaufmanandRousseeuw’sclusteringbook[540].ThesourceofthecohesionandseparationmeasuresinTable7.6 isapaperbyZhaoandKarypis[563],whichalsocontainsadiscussionofentropy,purity,andthehierarchicalF-measure.TheoriginalsourceofthehierarchicalF-measureisanarticlebyLarsenandAone[543].TheCVNNmeasurewasintroducedbyLietal.[544].Anaxiomaticapproachtoclusteringvalidityispresentedin[546].ManyofthepopularindicesforclustervalidationareimplementedintheNbClustRpackage,whichisdescribedinthearticlebyCharradetal.[519].

Bibliography[507]C.C.AggarwalandC.K.Reddy,editors.DataClustering:Algorithms

andApplications.Chapman&Hall/CRC,1stedition,2013.

[508]M.S.AldenderferandR.K.Blashfield.ClusterAnalysis.SagePublications,LosAngeles,1985.

[509]M.R.Anderberg.ClusterAnalysisforApplications.AcademicPress,NewYork,December1973.

[510]M.Ankerst,M.M.Breunig,H.-P.Kriegel,andJ.Sander.OPTICS:OrderingPointsToIdentifytheClusteringStructure.InProc.of1999ACM-SIGMODIntl.Conf.onManagementofData,pages49–60,Philadelphia,Pennsylvania,June1999.ACMPress.

[511]P.Arabie,L.Hubert,andG.D.Soete.Anoverviewofcombinatorialdataanalysis.InP.Arabie,L.Hubert,andG.D.Soete,editors,ClusteringandClassification,pages188–217.WorldScientific,Singapore,January1996.

[512]D.ArthurandS.Vassilvitskii.k-means++:Theadvantagesofcarefulseeding.InProceedingsoftheeighteenthannualACM-SIAMsymposiumonDiscretealgorithms,pages1027–1035.SocietyforIndustrialandAppliedMathematics,2007.

[513]G.BallandD.Hall.AClusteringTechniqueforSummarizingMultivariateData.BehaviorScience,12:153–155,March1967.

[514]A.Banerjee,S.Merugu,I.S.Dhillon,andJ.Ghosh.ClusteringwithBregmanDivergences.InProc.ofthe2004SIAMIntl.Conf.onDataMining,pages234–245,LakeBuenaVista,FL,April2004.

[515]P.Berkhin.SurveyOfClusteringDataMiningTechniques.Technicalreport,AccrueSoftware,SanJose,CA,2002.

[516]H.-H.Bock.Originsandextensionsofthe-meansalgorithminclusteranalysis.JournalÉlectroniqued’HistoiredesProbabilitésetdelaStatistique[electroniconly],4(2):Article–14,2008.

[517]D.Boley.PrincipalDirectionDivisivePartitioning.DataMiningandKnowledgeDiscovery,2(4):325–344,1998.

[518]P.S.BradleyandU.M.Fayyad.RefiningInitialPointsforK-MeansClustering.InProc.ofthe15thIntl.Conf.onMachineLearning,pages91–99,Madison,WI,July1998.MorganKaufmannPublishersInc.

[519]M.Charrad,N.Ghazzali,V.Boiteau,andA.Niknafs.NbClust:anRpackagefordeterminingtherelevantnumberofclustersinadataset.JournalofStatisticalSoftware,61(6):1–36,2014.

[520]CLUTO2.1.2:SoftwareforClusteringHigh-DimensionalDatasets.www.cs.umn.edu/∼karypis,October2016.

[521]P.ContrerasandF.Murtagh.Fast,lineartimehierarchicalclusteringusingtheBairemetric.Journalofclassification,29(2):118–143,2012.

[522]I.S.Dhillon,Y.Guan,andJ.Kogan.IterativeClusteringofHighDimensionalTextDataAugmentedbyLocalSearch.InProc.ofthe2002IEEEIntl.Conf.onDataMining,pages131–138.IEEEComputerSociety,2002.

[523]I.S.DhillonandD.S.Modha.ConceptDecompositionsforLargeSparseTextDataUsingClustering.MachineLearning,42(1/2):143–175,2001.

[524]R.O.Duda,P.E.Hart,andD.G.Stork.PatternClassification.JohnWiley&Sons,Inc.,NewYork,secondedition,2001.

[525]M.Ester,H.-P.Kriegel,J.Sander,M.Wimmer,andX.Xu.IncrementalClusteringforMininginaDataWarehousingEnvironment.InProc.ofthe24thVLDBConf.,pages323–333,NewYorkCity,August1998.MorganKaufmann.

[526]M.Ester,H.-P.Kriegel,J.Sander,andX.Xu.ADensity-BasedAlgorithmforDiscoveringClustersinLargeSpatialDatabaseswithNoise.InProc.ofthe2ndIntl.Conf.onKnowledgeDiscoveryandDataMining,pages226–231,Portland,Oregon,August1996.AAAIPress.

[527]B.S.Everitt,S.Landau,andM.Leese.ClusterAnalysis.ArnoldPublishers,London,4thedition,May2001.

[528]D.Fisher.IterativeOptimizationandSimplificationofHierarchicalClusterings.JournalofArtificialIntelligenceResearch,4:147–179,1996.

[529]M.Halkidi,Y.Batistakis,andM.Vazirgiannis.Clustervaliditymethods:partI.SIGMODRecord(ACMSpecialInterestGrouponManagementofData),31(2):40–45,June2002.

[530]M.Halkidi,Y.Batistakis,andM.Vazirgiannis.Clusteringvaliditycheckingmethods:partII.SIGMODRecord(ACMSpecialInterestGrouponManagementofData),31(3):19–27,Sept.2002.

[531]G.HamerlyandC.Elkan.Alternativestothek-meansalgorithmthatfindbetterclusterings.InProc.ofthe11thIntl.Conf.onInformationandKnowledgeManagement,pages600–607,McLean,Virginia,2002.ACMPress.

[532]J.Han,M.Kamber,andA.Tung.SpatialClusteringMethodsinDataMining:Areview.InH.J.MillerandJ.Han,editors,GeographicDataMiningandKnowledgeDiscovery,pages188–217.TaylorandFrancis,London,December2001.

[533]J.Hartigan.ClusteringAlgorithms.Wiley,NewYork,1975.

[534]T.Hastie,R.Tibshirani,andJ.H.Friedman.TheElementsofStatisticalLearning:DataMining,Inference,Prediction.Springer,NewYork,2001.

[535]A.K.Jain.Dataclustering:50yearsbeyondK-means.Patternrecognitionletters,31(8):651–666,2010.

[536]A.K.JainandR.C.Dubes.AlgorithmsforClusteringData.PrenticeHallAdvancedReferenceSeries.PrenticeHall,March1988.

[537]A.K.Jain,M.N.Murty,andP.J.Flynn.Dataclustering:Areview.ACMComputingSurveys,31(3):264–323,September1999.

[538]N.JardineandR.Sibson.MathematicalTaxonomy.Wiley,NewYork,1971.

[539]G.Karypis,E.-H.Han,andV.Kumar.MultilevelRefinementforHierarchicalClustering.TechnicalReportTR99-020,UniversityofMinnesota,Minneapolis,MN,1999.

[540]L.KaufmanandP.J.Rousseeuw.FindingGroupsinData:AnIntroductiontoClusterAnalysis.WileySeriesinProbabilityandStatistics.JohnWileyandSons,NewYork,November1990.

[541]J.M.Kleinberg.AnImpossibilityTheoremforClustering.InProc.ofthe16thAnnualConf.onNeuralInformationProcessingSystems,December,9–142002.

[542]H.-P.Kriegel,P.Kröger,J.Sander,andA.Zimek.Density-basedclustering.WileyInterdisciplinaryReviews:DataMiningandKnowledgeDiscovery,1(3):231–240,2011.

[543]B.LarsenandC.Aone.FastandEffectiveTextMiningUsingLinear-TimeDocumentClustering.InProc.ofthe5thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages16–22,SanDiego,California,1999.ACMPress.

[544]Y.Liu,Z.Li,H.Xiong,X.Gao,J.Wu,andS.Wu.Understandingandenhancementofinternalclusteringvalidationmeasures.Cybernetics,IEEETransactionson,43(3):982–994,2013.

[545]J.MacQueen.Somemethodsforclassificationandanalysisofmultivariateobservations.InProc.ofthe5thBerkeleySymp.onMathematicalStatisticsandProbability,pages281–297.UniversityofCaliforniaPress,1967.

[546]M.Meilă.ComparingClusterings:AnAxiomaticView.InProceedingsofthe22NdInternationalConferenceonMachineLearning,ICML’05,pages577–584,NewYork,NY,USA,2005.ACM.

[547]G.W.Milligan.ClusteringValidation:ResultsandImplicationsforAppliedAnalyses.InP.Arabie,L.Hubert,andG.D.Soete,editors,ClusteringandClassification,pages345–375.WorldScientific,Singapore,January1996.

[548]B.Mirkin.MathematicalClassificationandClustering,volume11ofNonconvexOptimizationandItsApplications.KluwerAcademicPublishers,August1996.

[549]T.Mitchell.MachineLearning.McGraw-Hill,Boston,MA,1997.

[550]F.Murtagh.MultidimensionalClusteringAlgorithms.Physica-Verlag,HeidelbergandVienna,1985.

[551]F.MurtaghandP.Contreras.Algorithmsforhierarchicalclustering:anoverview.WileyInterdisciplinaryReviews:DataMiningandKnowledgeDiscovery,2(1):86–97,2012.

[552]D.PellegandA.W.Moore.X-means:ExtendingK-meanswithEfficientEstimationoftheNumberofClusters.InProc.ofthe17thIntl.Conf.onMachineLearning,pages727–734.MorganKaufmann,SanFrancisco,CA,2000.

[553]C.Romesburg.ClusterAnalysisforResearchers.LifeTimeLearning,Belmont,CA,1984.

[554]J.Sander,M.Ester,H.-P.Kriegel,andX.Xu.Density-BasedClusteringinSpatialDatabases:TheAlgorithmGDBSCANanditsApplications.DataMiningandKnowledgeDiscovery,2(2):169–194,1998.

[555]S.M.SavaresiandD.Boley.AcomparativeanalysisonthebisectingK-meansandthePDDPclusteringalgorithms.IntelligentDataAnalysis,8(4):345–362,2004.

[556]P.H.A.SneathandR.R.Sokal.NumericalTaxonomy.Freeman,SanFrancisco,1971.

[557]H.Späth.ClusterAnalysisAlgorithmsforDataReductionandClassificationofObjects,volume4ofComputersandTheirApplication.EllisHorwoodPublishers,Chichester,1980.ISBN0-85312-141-9.

[558]M.Steinbach,G.Karypis,andV.Kumar.AComparisonofDocumentClusteringTechniques.InProc.ofKDDWorkshoponTextMining,Proc.ofthe6thIntl.Conf.onKnowledgeDiscoveryandDataMining,Boston,MA,August2000.

[559]H.XiongandZ.Li.ClusteringValidationMeasures.InC.C.AggarwalandC.K.Reddy,editors,DataClustering:AlgorithmsandApplications,pages571–605.Chapman&Hall/CRC,2013.

[560]R.Xu,D.Wunsch,etal.Surveyofclusteringalgorithms.NeuralNetworks,IEEETransactionson,16(3):645–678,2005.

[561]C.T.Zahn.Graph-TheoreticalMethodsforDetectingandDescribingGestaltClusters.IEEETransactionsonComputers,C-20(1):68–86,Jan.1971.

[562]B.Zhang,M.Hsu,andU.Dayal.K-HarmonicMeans—ADataClusteringAlgorithm.TechnicalReportHPL-1999-124,HewlettPackardLaboratories,Oct.291999.

[563]Y.ZhaoandG.Karypis.Empiricalandtheoreticalcomparisonsofselectedcriterionfunctionsfordocumentclustering.MachineLearning,55(3):311–331,2004.

7.7Exercises1.Consideradatasetconsistingof datavectors,whereeachvectorhas32componentsandeachcomponentisa4-bytevalue.Supposethatvectorquantizationisusedforcompression,andthat prototypevectorsareused.Howmanybytesofstoragedoesthatdatasettakebeforeandaftercompressionandwhatisthecompressionratio?

2.Findallwell-separatedclustersinthesetofpointsshowninFigure7.35 .

Figure7.35.PointsforExercise2 .

3.Manypartitionalclusteringalgorithmsthatautomaticallydeterminethenumberofclustersclaimthatthisisanadvantage.Listtwosituationsinwhichthisisnotthecase.

4.GivenKequallysizedclusters,theprobabilitythatarandomlychoseninitialcentroidwillcomefromanygivenclusteris buttheprobabilitythateachclusterwillhaveexactlyoneinitialcentroidismuchlower.(ItshouldbeclearthathavingoneinitialcentroidineachclusterisagoodstartingsituationforK-means.)Ingeneral,ifthereareKclustersandeachclusterhasnpoints,thentheprobability,p,ofselectinginasampleofsizeKoneinitialcentroidfromeachclusterisgivenbyEquation7.20 .(Thisassumessamplingwith

220

216

1/K,

replacement.)Fromthisformulawecancalculate,forexample,thatthechanceofhavingoneinitialcentroidfromeachoffourclustersis

a. PlottheprobabilityofobtainingonepointfromeachclusterinasampleofsizeKforvaluesofKbetween2and100.

b. ForKclusters, and1000,findtheprobabilitythatasampleofsize2Kcontainsatleastonepointfromeachcluster.Youcanuseeithermathematicalmethodsorstatisticalsimulationtodeterminetheanswer.

5.IdentifytheclustersinFigure7.36 usingthecenter-,contiguity-,anddensity-baseddefinitions.Alsoindicatethenumberofclustersforeachcaseandgiveabriefindicationofyourreasoning.Notethatdarknessorthenumberofdotsindicatesdensity.Ifithelps,assumecenter-basedmeansK-means,contiguity-basedmeanssinglelink,anddensity-basedmeansDBSCAN.

Figure7.36.ClustersforExercise5 .

6.Forthefollowingsetsoftwo-dimensionalpoints,(1)provideasketchofhowtheywouldbesplitintoclustersbyK-meansforthegivennumberof

4!/44=0.0938.

p=numberofwaystoselectonecentroidfromeachclusternumberofwaystoselect(7.20)

K=10,100,

clustersand(2)indicateapproximatelywheretheresultingcentroidswouldbe.Assumethatweareusingthesquarederrorobjectivefunction.Ifyouthinkthatthereismorethanonepossiblesolution,thenpleaseindicatewhethereachsolutionisaglobalorlocalminimum.NotethatthelabelofeachdiagraminFigure7.37 matchesthecorrespondingpartofthisquestion,e.g.,Figure7.37(a) goeswithpart(a).

Figure7.37.DiagramsforExercise6 .

a. Assumingthatthepointsareuniformlydistributedinthecircle,howmanypossiblewaysarethere(intheory)topartitionthepointsintotwoclusters?Whatcanyousayaboutthepositionsofthetwocentroids?(Again,youdon’tneedtoprovideexactcentroidlocations,justaqualitativedescription.)

b. Thedistancebetweentheedgesofthecirclesisslightlygreaterthantheradiiofthecircles.

c. Thedistancebetweentheedgesofthecirclesismuchlessthantheradiiofthecircles.

e. Hint:Usethesymmetryofthesituationandrememberthatwearelookingforaroughsketchofwhattheresultwouldbe.

K=2.

K=3.

K=2.

K=3.

7.Supposethatforadataset

therearempointsandKclusters,

halfthepointsandclustersarein“moredense”regions,

halfthepointsandclustersarein“lessdense”regions,and

thetworegionsarewell-separatedfromeachother.

Forthegivendataset,whichofthefollowingshouldoccurinordertominimizethesquarederrorwhenfindingKclusters:

a. Centroidsshouldbeequallydistributedbetweenmoredenseandlessdenseregions.

b. Morecentroidsshouldbeallocatedtothelessdenseregion.

c. Morecentroidsshouldbeallocatedtothedenserregion.

Note:Donotgetdistractedbyspecialcasesorbringinfactorsotherthandensity.However,ifyoufeelthetrueanswerisdifferentfromanygivenabove,justifyyourresponse.

8.Considerthemeanofaclusterofobjectsfromabinarytransactiondataset.Whataretheminimumandmaximumvaluesofthecomponentsofthemean?Whatistheinterpretationofcomponentsoftheclustermean?Whichcomponentsmostaccuratelycharacterizetheobjectsinthecluster?

9.Giveanexampleofadatasetconsistingofthreenaturalclusters,forwhich(almostalways)K-meanswouldlikelyfindthecorrectclusters,butbisectingK-meanswouldnot.

10.WouldthecosinemeasurebetheappropriatesimilaritymeasuretousewithK-meansclusteringfortimeseriesdata?Whyorwhynot?Ifnot,whatsimilaritymeasurewouldbemoreappropriate?

11.TotalSSEisthesumoftheSSEforeachseparateattribute.WhatdoesitmeaniftheSSEforonevariableislowforallclusters?Lowforjustonecluster?Highforallclusters?Highforjustonecluster?HowcouldyouusethepervariableSSEinformationtoimproveyourclustering?

12.Theleaderalgorithm(Hartigan[533])representseachclusterusingapoint,knownasaleader,andassignseachpointtotheclustercorrespondingtotheclosestleader,unlessthisdistanceisaboveauser-specifiedthreshold.Inthatcase,thepointbecomestheleaderofanewcluster.

a. WhataretheadvantagesanddisadvantagesoftheleaderalgorithmascomparedtoK-means?

b. Suggestwaysinwhichtheleaderalgorithmmightbeimproved.

13.TheVoronoidiagramforasetofKpointsintheplaneisapartitionofallthepointsoftheplaneintoKregions,suchthateverypoint(oftheplane)isassignedtotheclosestpointamongtheKspecifiedpoints—seeFigure7.38 .WhatistherelationshipbetweenVoronoidiagramsandK-meansclusters?WhatdoVoronoidiagramstellusaboutthepossibleshapesofK-meansclusters?

Figure7.38.VoronoidiagramforExercise13 .

14.Youaregivenadatasetwith100recordsandareaskedtoclusterthedata.YouuseK-meanstoclusterthedata,butforallvaluesofK,theK-meansalgorithmreturnsonlyonenon-emptycluster.YouthenapplyanincrementalversionofK-means,butobtainexactlythesameresult.Howisthispossible?HowwouldsinglelinkorDBSCANhandlesuchdata?

15.Traditionalagglomerativehierarchicalclusteringroutinesmergetwoclustersateachstep.Doesitseemlikelythatsuchanapproachaccuratelycapturesthe(nested)clusterstructureofasetofdatapoints?Ifnot,explainhowyoumightpostprocessthedatatoobtainamoreaccurateviewoftheclusterstructure.

16.UsethesimilaritymatrixinTable7.13 toperformsingleandcompletelinkhierarchicalclustering.Showyourresultsbydrawingadendrogram.Thedendrogramshouldclearlyshowtheorderinwhichthepointsaremerged.

Table7.13.SimilaritymatrixforExercise16 .

p1 p2 p3 p4 p5

p1 1.00 0.10 0.41 0.55 0.35

p2 0.10 1.00 0.64 0.47 0.98

p3 0.41 0.64 1.00 0.44 0.85

p4 0.55 0.47 0.44 1.00 0.76

p5 0.35 0.98 0.85 0.76 1.00

17.HierarchicalclusteringissometimesusedtogenerateKclusters, bytakingtheclustersatthe levelofthedendrogram.(Rootisatlevel1.)Bylookingattheclustersproducedinthisway,wecanevaluatethebehaviorofhierarchicalclusteringondifferenttypesofdataandclusters,andalsocomparehierarchicalapproachestoK-means.

1≤K≤100,

K>1Kth

Thefollowingisasetofone-dimensionalpoints:{6,12,18,24,30,42,48}.

a. Foreachofthefollowingsetsofinitialcentroids,createtwoclustersbyassigningeachpointtothenearestcentroid,andthencalculatethetotalsquarederrorforeachsetoftwoclusters.Showboththeclustersandthetotalsquarederrorforeachsetofcentroids.

i. {18,45}

ii. {15,40}

b. Dobothsetsofcentroidsrepresentstablesolutions;i.e.,iftheK-meansalgorithmwasrunonthissetofpointsusingthegivencentroidsasthestartingcentroids,wouldtherebeanychangeintheclustersgenerated?

c. Whatarethetwoclustersproducedbysinglelink?

d. Whichtechnique,K-meansorsinglelink,seemstoproducethe“mostnatural”clusteringinthissituation?(ForK-means,taketheclusteringwiththelowestsquarederror.)

e. Whatdefinition(s)ofclusteringdoesthisnaturalclusteringcorrespondto?(Well-separated,center-based,contiguous,ordensity.)

f. Whatwell-knowncharacteristicoftheK-meansalgorithmexplainsthepreviousbehavior?

18.SupposewefindKclustersusingWard’smethod,bisectingK-means,andordinaryK-means.Whichofthesesolutionsrepresentsalocalorglobalminimum?Explain.

19.Hierarchicalclusteringalgorithmsrequire time,andconsequently,areimpracticaltousedirectlyonlargerdatasets.Onepossibletechniqueforreducingthetimerequiredistosamplethedataset.Forexample,ifKclustersaredesiredand pointsaresampledfromthem

O(m2log(m))

points,thenahierarchicalclusteringalgorithmwillproduceahierarchicalclusteringinroughly time.Kclusterscanbeextractedfromthishierarchicalclusteringbytakingtheclustersonthe levelofthedendrogram.Theremainingpointscanthenbeassignedtoaclusterinlineartime,byusingvariousstrategies.Togiveaspecificexample,thecentroidsoftheKclusterscanbecomputed,andtheneachofthe remainingpointscanbeassignedtotheclusterassociatedwiththeclosestcentroid.

Foreachofthefollowingtypesofdataorclusters,discussbrieflyif(1)samplingwillcauseproblemsforthisapproachand(2)whatthoseproblemsare.Assumethatthesamplingtechniquerandomlychoosespointsfromthetotalsetofmpointsandthatanyunmentionedcharacteristicsofthedataorclustersareasoptimalaspossible.Inotherwords,focusonlyonproblemscausedbytheparticularcharacteristicmentioned.Finally,assumethatKisverymuchlessthanm.

a. Datawithverydifferentsizedclusters.

b. High-dimensionaldata.

c. Datawithoutliers,i.e.,atypicalpoints.

d. Datawithhighlyirregularregions.

e. Datawithglobularclusters.

f. Datawithwidelydifferentdensities.

g. Datawithasmallpercentageofnoisepoints.

h. Non-Euclideandata.

i. Euclideandata.

j. Datawithmanyandmixedattributetypes.

O(m)Kth

m−m

20.ConsiderthefollowingfourfacesshowninFigure7.39 .Again,darknessornumberofdotsrepresentsdensity.Linesareusedonlytodistinguishregionsanddonotrepresentpoints.

Figure7.39.FigureforExercise20 .

a. Foreachfigure,couldyouusesinglelinktofindthepatternsrepresentedbythenose,eyes,andmouth?Explain.

b. Foreachfigure,couldyouuseK-meanstofindthepatternsrepresentedbythenose,eyes,andmouth?Explain.

c. WhatlimitationdoesclusteringhaveindetectingallthepatternsformedbythepointsinFigure7.39(c) ?

21.ComputetheentropyandpurityfortheconfusionmatrixinTable7.14 .

Table7.14.ConfusionmatrixforExercise21 .

Cluster Entertainment Financial Foreign Metro National Sports Total

#1 1 1 0 11 4 676 693

#2 27 89 333 827 253 33 1562

#3 326 465 8 105 16 29 949

Total 354 555 341 943 273 738 3204

22.Youaregiventwosetsof100pointsthatfallwithintheunitsquare.Onesetofpointsisarrangedsothatthepointsareuniformlyspaced.Theothersetofpointsisgeneratedfromauniformdistributionovertheunitsquare.

a. Isthereadifferencebetweenthetwosetsofpoints?

b. Ifso,whichsetofpointswilltypicallyhaveasmallerSSEforclusters?

c. WhatwillbethebehaviorofDBSCANontheuniformdataset?Therandomdataset?

23.UsingthedatainExercise24 ,computethesilhouettecoefficientforeachpoint,eachofthetwoclusters,andtheoverallclustering.

24.GiventhesetofclusterlabelsandsimilaritymatrixshowninTables7.15 and7.16 ,respectively,computethecorrelationbetweenthesimilaritymatrixandtheidealsimilaritymatrix,i.e.,thematrixwhose entryis1iftwoobjectsbelongtothesamecluster,and0otherwise.

Table7.15.TableofclusterlabelsforExercise24 .

Point ClusterLabel

P1 1

P2 1

P3 2

P4 2

K=10

ijth

Table7.16.SimilaritymatrixforExercise24 .

Point P1 P2 P3 P4

P1 1 0.8 0.65 0.55

P2 0.8 1 0.7 0.6

P3 0.65 0.7 1 0.9

P4 0.55 0.6 0.9 1

25.ComputethehierarchicalF-measurefortheeightobjects{p1,p2,p3,p4,p5,p6,p7,andp8}andhierarchicalclusteringshowninFigure7.40 .ClassAcontainspointsp1,p2,andp3,whilep4,p5,p6,p7,andp8belongtoclassB.

Figure7.40.HierarchicalclusteringforExercise25 .

26.ComputethecopheneticcorrelationcoefficientforthehierarchicalclusteringsinExercise16 .(Youwillneedtoconvertthesimilaritiesintodissimilarities.)

27.ProveEquation7.14 .

28.ProveEquation7.16 .

29.Provethat Thisfactwasusedintheproofthat inSection7.5.2 .

30.Clustersofdocumentscanbesummarizedbyfindingthetopterms(words)forthedocumentsinthecluster,e.g.,bytakingthemostfrequentkterms,wherekisaconstant,say10,orbytakingalltermsthatoccurmorefrequentlythanaspecifiedthreshold.SupposethatK-meansisusedtofindclustersofbothdocumentsandwordsforadocumentdataset.

a. HowmightasetoftermclustersdefinedbythetoptermsinadocumentclusterdifferfromthewordclustersfoundbyclusteringthetermswithK-means?

b. Howcouldtermclusteringbeusedtodefineclustersofdocuments?

31.Wecanrepresentadatasetasacollectionofobjectnodesandacollectionofattributenodes,wherethereisalinkbetweeneachobjectandeachattribute,andwheretheweightofthatlinkisthevalueoftheobjectforthatattribute.Forsparsedata,ifthevalueis0,thelinkisomitted.Bipartiteclusteringattemptstopartitionthisgraphintodisjointclusters,whereeachclusterconsistsofasetofobjectnodesandasetofattributenodes.Theobjectiveistomaximizetheweightoflinksbetweentheobjectandattributenodesofacluster,whileminimizingtheweightoflinksbetweenobjectandattributelinksindifferentclusters.Thistypeofclusteringisalsoknownasco-clusteringbecausetheobjectsandattributesareclusteredatthesametime.

a. Howisbipartiteclustering(co-clustering)differentfromclusteringthesetsofobjectsandattributesseparately?

b. Arethereanycasesinwhichtheseapproachesyieldthesameclusters?

c. Whatarethestrengthsandweaknessesofco-clusteringascomparedtoordinaryclustering?

Σi=1KΣx∈Ci(x−mi)(m−mi)=0.TSS=SSE+SSB

32.InFigure7.41 ,matchthesimilaritymatrices,whicharesortedaccordingtoclusterlabels,withthesetsofpoints.Differencesinshadingandmarkershapedistinguishbetweenclusters,andeachsetofpointscontains100pointsandthreeclusters.Inthesetofpointslabeled2,therearethreeverytight,equalsizedclusters.

Figure7.41.PointsandsimilaritymatricesforExercise32 .

8ClusterAnalysis:AdditionalIssuesandAlgorithms

Alargenumberofclusteringalgorithmshavebeendevelopedinavarietyofdomainsfordifferenttypesofapplications.Noalgorithmissuitableforalltypesofdata,clusters,andapplications.Infact,itseemsthatthereisalwaysroomforanewclusteringalgorithmthatismoreefficientorbettersuitedtoaparticulartypeofdata,cluster,orapplication.Instead,wecanonlyclaimthatwehavetechniquesthatworkwellinsomesituations.Thereasonisthat,inmanycases,whatconstitutesagoodsetofclustersisopentosubjectiveinterpretation.Furthermore,whenanobjectivemeasureisemployedtogiveaprecisedefinitionofacluster,theproblemoffindingtheoptimalclusteringisoftencomputationallyinfeasible.

Thischapterfocusesonimportantissuesinclusteranalysisandexplorestheconceptsandapproachesthathavebeendevelopedtoaddressthem.Webeginwithadiscussionofthekeyissuesofclusteranalysis,namely,thecharacteristicsofdata,clusters,andalgorithmsthatstronglyimpactclustering.Theseissues

areimportantforunderstanding,describing,andcomparingclusteringtechniques,andprovidethebasisfordecidingwhichtechniquetouseinaspecificsituation.Forexample,manyclusteringalgorithmshaveatimeorspacecomplexityof (mbeingthenumberofobjects)and,thus,arenotsuitableforlargedatasets.Wethendiscussadditionalclusteringtechniques.Foreachtechnique,wedescribethealgorithm,includingtheissuesitaddressesandthemethodsthatitusestoaddressthem.Weconcludethischapterbyprovidingsomegeneralguidelinesforselectingaclusteringalgorithmforagivenapplication.

O(m2)

8.1CharacteristicsofData,Clusters,andClusteringAlgorithmsThissectionexploresissuesrelatedtothecharacteristicsofdata,clusters,andalgorithmsthatareimportantforabroadunderstandingofclusteranalysis.Someoftheseissuesrepresentchallenges,suchashandlingnoiseandoutliers.Otherissuesinvolveadesiredfeatureofanalgorithm,suchasanabilitytoproducethesameresultregardlessoftheorderinwhichthedataobjectsareprocessed.Thediscussioninthissection,alongwiththediscussionofdifferenttypesofclusteringsinSection7.1.2 anddifferenttypesofclustersinSection7.1.3 ,identifiesanumberof“dimensions”thatcanbeusedtodescribeandcomparevariousclusteringalgorithmsandtheclusteringresultsthattheyproduce.Toillustratethis,webeginthissectionwithanexamplethatcomparestwoclusteringalgorithmsthatweredescribedinthepreviouschapter,DBSCANandK-means.Thisisfollowedbyamoredetaileddescriptionofthecharacteristicsofdata,clusters,andalgorithmsthatimpactclusteranalysis.

8.1.1Example:ComparingK-meansandDBSCAN

Tosimplifythecomparison,weassumethattherearenotiesindistancesforeitherK-meansorDBSCANandthatDBSCANalwaysassignsaborderpointthatisassociatedwithseveralcorepointstotheclosestcorepoint.

BothDBSCANandK-meansarepartitionalclusteringalgorithmsthatassigneachobjecttoasinglecluster,butK-meanstypicallyclustersalltheobjects,whileDBSCANdiscardsobjectsthatitclassifiesasnoise.K-meansusesaprototype-basednotionofacluster;DBSCANusesadensity-basedconcept.DBSCANcanhandleclustersofdifferentsizesandshapesandisnotstronglyaffectedbynoiseoroutliers.K-meanshasdifficultywithnon-globularclustersandclustersofdifferentsizes.Bothalgorithmscanperformpoorlywhenclustershavewidelydifferingdensities.K-meanscanonlybeusedfordatathathasawell-definedcentroid,suchasameanormedian.DBSCANrequiresthatitsdefinitionofdensity,whichisbasedonthetraditionalEuclideannotionofdensity,bemeaningfulforthedata.K-meanscanbeappliedtosparse,high-dimensionaldata,suchasdocumentdata.DBSCANtypicallyperformspoorlyforsuchdatabecausethetraditionalEuclideandefinitionofdensitydoesnotworkwellforhigh-dimensionaldata.TheoriginalversionsofK-meansandDBSCANweredesignedforEuclideandata,butbothhavebeenextendedtohandleothertypesofdata.DBSCANmakesnoassumptionaboutthedistributionofthedata.ThebasicK-meansalgorithmisequivalenttoastatisticalclusteringapproach(mixturemodels)thatassumesallclusterscomefromsphericalGaussiandistributionswithdifferentmeansbutthesamecovariancematrix.SeeSection8.2.2 .DBSCANandK-meansbothlookforclustersusingallattributes,thatis,theydonotlookforclustersthatinvolveonlyasubsetoftheattributes.K-meanscanfindclustersthatarenotwellseparated,eveniftheyoverlap(seeFigure7.2(b) ),butDBSCANmergesclustersthatoverlap.TheK-meansalgorithmhasatimecomplexityof ,whileDBSCANtakes time,exceptforspecialcasessuchaslow-dimensional

O(m)O(m2)

Euclideandata.DBSCANproducesthesamesetofclustersfromoneruntoanother,whileK-means,whichistypicallyusedwithrandominitializationofcentroids,doesnot.DBSCANautomaticallydeterminesthenumberofclusters;forK-means,thenumberofclustersneedstobespecifiedasaparameter.However,DBSCANhastwootherparametersthatmustbespecified,EpsandMinPts.K-meansclusteringcanbeviewedasanoptimizationproblem;i.e.,minimizethesumofthesquarederrorofeachpointtoitsclosestcentroid,andasaspecificcaseofastatisticalclusteringapproach(mixturemodels).DBSCANisnotbasedonanyformalmodel.

8.1.2DataCharacteristics

Thefollowingaresomecharacteristicsofdatathatcanstronglyaffectclusteranalysis.

HighDimensionalityInhigh-dimensionaldatasets,thetraditionalEuclideannotionofdensity,whichisthenumberofpointsperunitvolume,becomesmeaningless.Toseethis,considerthatasthenumberofdimensionsincreases,thevolumeincreasesrapidly,andunlessthenumberofpointsgrowsexponentiallywiththenumberofdimensions,thedensitytendsto0.(Volumeisexponentialinthenumberofdimensions.Forinstance,ahyperspherewithradius,r,anddimension,d,hasvolumeproportionalto .)Also,proximitytendstobecomemoreuniforminhigh-dimensionalspaces.Anotherwaytoviewthisfactisthattherearemoredimensions(attributes)thatcontributetotheproximitybetweentwopointsandthistendstomaketheproximitymoreuniform.Sincemostclusteringtechniquesarebasedon

proximityordensity,theycanoftenhavedifficultywithhigh-dimensionaldata.Onewaytoaddresssuchproblemsistoemploydimensionalityreductiontechniques.Anotherapproach,asdiscussedinSections8.4.6 and8.4.8 ,istoredefinethenotionsofproximityanddensity.

SizeManyclusteringalgorithmsthatworkwellforsmallormedium-sizedatasetsareunabletohandlelargerdatasets.Thisisaddressedfurtherinthediscussionofthecharacteristicsofclusteringalgorithms—scalabilityisonesuchcharacteristic—andinSection8.5 ,whichdiscussesscalableclusteringalgorithms.

SparsenessSparsedataoftenconsistsofasymmetricattributes,wherezerovaluesarenotasimportantasnon-zerovalues.Therefore,similaritymeasuresappropriateforasymmetricattributesarecommonlyused.However,other,relatedissuesalsoarise.Forexample,arethemagnitudesofnon-zeroentriesimportant,ordotheydistorttheclustering?Inotherwords,doestheclusteringworkbestwhenthereareonlytwovalues,0and1?

NoiseandOutliersAnatypicalpoint(outlier)canoftenseverelydegradetheperformanceofclusteringalgorithms,especiallyalgorithmssuchasK-meansthatareprototype-based.Ontheotherhand,noisecancausetechniques,suchassinglelink,tojoinclustersthatshouldnotbejoined.Insomecases,algorithmsforremovingnoiseandoutliersareappliedbeforeaclusteringalgorithmisused.Alternatively,somealgorithmscandetectpointsthatrepresentnoiseandoutliersduringtheclusteringprocessandthendeletethemorotherwiseeliminatetheirnegativeeffects.Inthepreviouschapter,forinstance,wesawthatDBSCANautomaticallyclassifieslow-densitypointsasnoiseandremovesthemfromtheclusteringprocess.Chameleon(Section8.4.4 ),SNNdensity-basedclustering(Section8.4.9 ),andCURE(Section8.5.3 )arethreeofthealgorithmsinthischapterthatexplicitlydealwithnoiseandoutliersduringtheclusteringprocess.

TypeofAttributesandDataSetAsdiscussedinChapter2 ,datasetscanbeofvarioustypes,suchasstructured,graph,orordered,whileattributesareusuallycategorical(nominalorordinal)orquantitative(intervalorratio),andarebinary,discrete,orcontinuous.Differentproximityanddensitymeasuresareappropriatefordifferenttypesofdata.Insomesituations,dataneedstobediscretizedorbinarizedsothatadesiredproximitymeasureorclusteringalgorithmcanbeused.Anothercomplicationoccurswhenattributesareofwidelydifferingtypes,e.g.,continuousandnominal.Insuchcases,proximityanddensityaremoredifficulttodefineandoftenmoreadhoc.Finally,specialdatastructuresandalgorithmsareoftenneededtohandlecertaintypesofdataefficiently.

ScaleDifferentattributes,e.g.,heightandweight,areoftenmeasuredondifferentscales.Thesedifferencescanstronglyaffectthedistanceorsimilaritybetweentwoobjectsand,consequently,theresultsofaclusteranalysis.Considerclusteringagroupofpeoplebasedontheirheights,whicharemeasuredinmeters,andtheirweights,whicharemeasuredinkilograms.IfweuseEuclideandistanceasourproximitymeasure,thenheightwillhavelittleimpactandpeoplewillbeclusteredmostlybasedontheweightattribute.If,however,westandardizeeachattributebysubtractingoffitsmeananddividingbyitsstandarddeviation,thenwewillhaveeliminatedeffectsduetothedifferenceinscale.Moregenerally,normalizationtechniques,suchasthosediscussedinSection2.3.7 ,aretypicallyusedtohandletheseissues.

MathematicalPropertiesoftheDataSpaceSomeclusteringtechniquescalculatethemeanofacollectionofpointsoruseothermathematicaloperationsthatonlymakesenseinEuclideanspaceorinotherspecificdataspaces.Otheralgorithmsrequirethatthedefinitionofdensitybemeaningfulforthedata.

8.1.3ClusterCharacteristics

Thedifferenttypesofclusters,suchasprototype-,graph-,anddensity-based,weredescribedearlierinSection7.1.3 .Here,wedescribeotherimportantcharacteristicsofclusters.

DataDistributionSomeclusteringtechniquesassumeaparticulartypeofdistributionforthedata.Morespecifically,theyoftenassumethatdatacanbemodeledasarisingfromamixtureofdistributions,whereeachclustercorrespondstoadistribution.ClusteringbasedonmixturemodelsisdiscussedinSection8.2.2 .

ShapeSomeclustersareregularlyshaped,e.g.,rectangularorglobular,butingeneral,clusterscanbeofarbitraryshape.TechniquessuchasDBSCANandsinglelinkcanhandleclustersofarbitraryshape,butprototype-basedschemesandsomehierarchicaltechniques,suchascompletelinkandgroupaverage,cannot.Chameleon(Section8.4.4 )andCURE(Section8.5.3 )areexamplesoftechniquesthatwerespecificallydesignedtoaddressthisproblem.

DifferingSizesManyclusteringmethods,suchasK-means,don’tworkwellwhenclustershavedifferentsizes.(SeeSection7.2.4 .)ThistopicisdiscussedfurtherinSection8.6 .

DifferingDensitiesClustersthathavewidelyvaryingdensitycancauseproblemsformethodssuchasDBSCANandK-means.TheSNNdensity-basedclusteringtechniquepresentedinSection8.4.9 addressesthisissue.

PoorlySeparatedClustersWhenclusterstouchoroverlap,someclusteringtechniquescombineclustersthatshouldbekeptseparate.Eventechniquesthatfinddistinctclustersarbitrarilyassignpointstooneclusteroranother.Fuzzyclustering,whichisdescribedinSection8.2.1 ,isonetechniquefordealingwithdatathatdoesnotformwell-separatedclusters.

RelationshipsamongClustersInmostclusteringtechniques,thereisnoexplicitconsiderationoftherelationshipsbetweenclusters,suchastheirrelativeposition.Self-organizingmaps(SOM),whicharedescribedinSection8.2.3 ,areaclusteringtechniquethatdirectlyconsiderstherelationshipsbetweenclustersduringtheclusteringprocess.Specifically,theassignmentofapointtooneclusteraffectsthedefinitionsofnearbyclusters.

SubspaceClustersClustersmayonlyexistinasubsetofdimensions(attributes),andtheclustersdeterminedusingonesetofdimensionsarefrequentlyquitedifferentfromtheclustersdeterminedbyusinganotherset.Whilethisissuecanarisewithasfewastwodimensions,itbecomesmoreacuteasdimensionalityincreases,becausethenumberofpossiblesubsetsofdimensionsisexponentialinthetotalnumberofdimensions.Forthatreason,itisnotfeasibletosimplylookforclustersinallpossiblesubsetsofdimensionsunlessthenumberofdimensionsisrelativelylow.

Oneapproachistoapplyfeatureselection,whichwasdiscussedinSection2.3.4 .However,thisapproachassumesthatthereisonlyonesubsetofdimensionsinwhichtheclustersexist.Inreality,clusterscanexistinmanydistinctsubspaces(setsofdimensions),someofwhichoverlap.Section8.3.2 considerstechniquesthataddressthegeneralproblemofsubspaceclustering,i.e.,offindingbothclustersandthedimensionstheyspan.

8.1.4GeneralCharacteristicsofClusteringAlgorithms

Clusteringalgorithmsarequitevaried.Weprovideageneraldiscussionofimportantcharacteristicsofclusteringalgorithmshere,andmakemorespecificcommentsduringourdiscussionofparticulartechniques.

OrderDependenceForsomealgorithms,thequalityandnumberofclustersproducedcanvary,perhapsdramatically,dependingontheorderinwhichthedataisprocessed.Whileitwouldseemdesirabletoavoidsuchalgorithms,sometimestheorderdependenceisrelativelyminororthealgorithmhasotherdesirablecharacteristics.SOM(Section8.2.3 )isanexampleofanalgorithmthatisorderdependent.

NondeterminismClusteringalgorithms,suchasK-means,arenotorder-dependent,buttheyproducedifferentresultsforeachrunbecausetheyrelyonaninitializationstepthatrequiresarandomchoice.Becausethequalityoftheclusterscanvaryfromoneruntoanother,multiplerunscanbenecessary.

ScalabilityItisnotunusualforadatasettocontainmillionsofobjects,andtheclusteringalgorithmsusedforsuchdatasetsshouldhavelinearornear-lineartimeandspacecomplexity.Evenalgorithmsthathaveacomplexityof

arenotpracticalforlargedatasets.Furthermore,clusteringtechniquesfordatasetscannotalwaysassumethatallthedatawillfitinmainmemoryorthatdataelementscanberandomlyaccessed.Suchalgorithmsareinfeasibleforlargedatasets.Section8.5 isdevotedtotheissueofscalability.

ParameterSelectionMostclusteringalgorithmshaveoneormoreparametersthatneedtobesetbytheuser.Itcanbedifficulttochoosethe

O(m2)

propervalues;thus,theattitudeisusually,“thefewerparameters,thebetter.”Choosingparametervaluesbecomesevenmorechallengingifasmallchangeintheparametersdrasticallychangestheclusteringresults.Finally,unlessaprocedure(whichmightinvolveuserinput)isprovidedfordeterminingparametervalues,auserofthealgorithmisreducedtousingtrialanderrortofindsuitableparametervalues.

Perhapsthemostwell-knownparameterselectionproblemisthatof“choosingtherightnumberofclusters”forpartitionalclusteringalgorithms,suchasK-means.OnepossibleapproachtothatissueisgiveninSection7.5.5 ,whilereferencestoothersareprovidedintheBibliographicNotes.

TransformingtheClusteringProblemtoAnotherDomainOneapproachtakenbysomeclusteringtechniquesistomaptheclusteringproblemtoaprobleminadifferentdomain.Graph-basedclustering,forinstance,mapsthetaskoffindingclusterstothetaskofpartitioningaproximitygraphintoconnectedcomponents.

TreatingClusteringasanOptimizationProblemClusteringisoftenviewedasanoptimizationproblem:dividethepointsintoclustersinawaythatmaximizesthegoodnessoftheresultingsetofclustersasmeasuredbyauser-specifiedobjectivefunction.Forexample,theK-meansclusteringalgorithm(Section7.2 )triestofindthesetofclustersthatminimizesthesumofthesquareddistanceofeachpointfromitsclosestclustercentroid.Intheory,suchproblemscanbesolvedbyenumeratingallpossiblesetsofclustersandselectingtheonewiththebestvalueoftheobjectivefunction,butthisexhaustiveapproachiscomputationallyinfeasible.Forthisreason,manyclusteringtechniquesarebasedonheuristicapproachesthatproducegood,butnotoptimalclusterings.Anotherapproachistouseobjectivefunctionsonagreedyorlocalbasis.Inparticular,thehierarchicalclusteringtechniques

discussedinSection7.3 proceedbymakinglocallyoptimal(greedy)decisionsateachstepoftheclusteringprocess.

RoadMapWearrangeourdiscussionofclusteringalgorithmsinamannersimilartothatofthepreviouschapter,groupingtechniquesprimarilyaccordingtowhethertheyareprototype-based,density-based,orgraph-based.Thereis,however,aseparatediscussionforscalableclusteringtechniques.Weconcludethischapterwithadiscussionofhowtochooseaclusteringalgorithm.

8.2Prototype-BasedClusteringInprototype-basedclustering,aclusterisasetofobjectsinwhichanyobjectisclosertotheprototypethatdefinestheclusterthantotheprototypeofanyothercluster.Section7.2 describedK-means,asimpleprototype-basedclusteringalgorithmthatusesthecentroidoftheobjectsinaclusterastheprototypeofthecluster.Thissectiondiscussesclusteringapproachesthatexpandontheconceptofprototype-basedclusteringinoneormoreways,asdiscussednext:

Objectsareallowedtobelongtomorethanonecluster.Morespecifically,anobjectbelongstoeveryclusterwithsomeweight.Suchanapproachaddressesthefactthatsomeobjectsareequallyclosetoseveralclusterprototypes.Aclusterismodeledasastatisticaldistribution,i.e.,objectsaregeneratedbyarandomprocessfromastatisticaldistributionthatischaracterizedbyanumberofstatisticalparameters,suchasthemeanandvariance.Thisviewpointgeneralizesthenotionofaprototypeandenablestheuseofwell-establishedstatisticaltechniques.Clustersareconstrainedtohavefixedrelationships.Mostcommonly,theserelationshipsareconstraintsthatspecifyneighborhoodrelationships;i.e.,thedegreetowhichtwoclustersareneighborsofeachother.Constrainingtherelationshipsamongclusterscansimplifytheinterpretationandvisualizationofthedata.

Weconsiderthreespecificclusteringalgorithmstoillustratetheseextensionsofprototype-basedclustering.Fuzzyc-meansusesconceptsfromthefieldoffuzzylogicandfuzzysettheorytoproposeaclusteringscheme,whichismuchlikeK-means,butwhichdoesnotrequireahardassignmentofapoint

toonlyonecluster.Mixturemodelclusteringtakestheapproachthatasetofclusterscanbemodeledasamixtureofdistributions,oneforeachcluster.TheclusteringschemebasedonSelf-OrganizingMaps(SOM)performsclusteringwithinaframeworkthatrequiresclusterstohaveaprespecifiedrelationshiptooneanother,e.g.,atwo-dimensionalgridstructure.

8.2.1FuzzyClustering

Ifdataobjectsaredistributedinwell-separatedgroups,thenacrispclassificationoftheobjectsintodisjointclustersseemslikeanidealapproach.However,inmostcases,theobjectsinadatasetcannotbepartitionedintowell-separatedclusters,andtherewillbeacertainarbitrarinessinassigninganobjecttoaparticularcluster.Consideranobjectthatliesneartheboundaryoftwoclusters,butisslightlyclosertooneofthem.Inmanysuchcases,itmightbemoreappropriatetoassignaweighttoeachobjectandeachclusterthatindicatesthedegreetowhichtheobjectbelongstothecluster.Mathematically, istheweightwithwhichobject belongstocluster .

Asshowninthenextsection,probabilisticapproachescanalsoprovidesuchweights.Whileprobabilisticapproachesareusefulinmanysituations,therearetimeswhenitisdifficulttodetermineanappropriatestatisticalmodel.Insuchcases,non-probabilisticclusteringtechniquesareneededtoprovidesimilarcapabilities.Fuzzyclusteringtechniquesarebasedonfuzzysettheoryandprovideanaturaltechniqueforproducingaclusteringinwhichmembershipweights(the )haveanatural(butnotprobabilistic)interpretation.Thissectiondescribesthegeneralapproachoffuzzyclusteringandprovidesaspecificexampleintermsoffuzzyc-means(fuzzyK-means).

FuzzySets

wij xi Cj

wij

LotfiZadehintroducedfuzzysettheoryandfuzzylogicin1965asawayofdealingwithimprecisionanduncertainty.Briefly,fuzzysettheoryallowsanobjecttobelongtoasetwithadegreeofmembershipbetween0and1,whilefuzzylogicallowsastatementtobetruewithadegreeofcertaintybetween0and1.Traditionalsettheoryandlogicarespecialcasesoftheirfuzzycounterpartsthatrestrictthedegreeofsetmembershiporthedegreeofcertaintytobeeither0or1.Fuzzyconceptshavebeenappliedtomanydifferentareas,includingcontrolsystems,patternrecognition,anddataanalysis(classificationandclustering).

Considerthefollowingexampleoffuzzylogic.Thedegreeoftruthofthestatement“Itiscloudy”canbedefinedtobethepercentageofcloudcoverinthesky,e.g.,iftheskyis50%coveredbyclouds,thenwewouldassign“Itiscloudy”adegreeoftruthof0.5.Ifwehavetwosets,“cloudydays”and“non-cloudydays,”thenwecansimilarlyassigneachdayadegreeofmembershipinthetwosets.Thus,ifadaywere25%cloudy,itwouldhavea25%degreeofmembershipin“cloudydays”anda75%degreeofmembershipin“non-cloudydays.”

FuzzyClustersAssumethatwehaveasetofdatapoints ,whereeachpoint, ,isann-dimensionalpoint,i.e., .Acollectionoffuzzyclusters,

isasubsetofallpossiblefuzzysubsetsof .(Thissimplymeansthatthemembershipweights(degrees), ,havebeenassignedvaluesbetween0and1foreachpoint, ,andeachcluster, .)However,wealsowanttoimposethefollowingreasonableconditionsontheclustersinordertoensurethattheclustersformwhatiscalledafuzzypseudo-partition.

1. Alltheweightsforagivenpoint, ,addupto1.

X={x1,…,xm} xixi=(xi1,…,xin)

C1,C2,…, Ck Xwij

xi Cj

2. Eachcluster, ,contains,withnon-zeroweight,atleastonepoint,butdoesnotcontain,withaweightofone,allofthepoints.

Fuzzyc-meansWhiletherearemanytypesoffuzzyclustering—indeed,manydataanalysisalgorithmscanbe“fuzzified”—weonlyconsiderthefuzzyversionofK-means,whichiscalledfuzzyc-means.Intheclusteringliterature,theversionofK-meansthatdoesnotuseincrementalupdatesofclustercentroidsissometimesreferredtoasc-means,andthiswasthetermadaptedbythefuzzycommunityforthefuzzyversionofK-means.Thefuzzyc-meansalgorithm,alsosometimesknownasFCM,isgivenbyAlgorithm8.1 .

Algorithm8.1Basicfuzzyc-means

algorithm.

∑j=1kwij=1Cj

0<∑i=1mwij<m

1:Selectaninitialfuzzypseudo-partition,i.e.,assignvaluestoallthe .2:repeat3:Computethecentroidofeachclusterusingthefuzzypseudo-partition.4:Recomputethefuzzypseudo-partition,i.e.,the .5:untilThecentroidsdon’tchange.(Alternativestoppingconditionsare“ifthechangeintheerrorisbelowaspecifiedthreshold”or“iftheabsolutechangeinanyisbelowagiventhreshold.”)

wij

Afterinitialization,FCMrepeatedlycomputesthecentroidsofeachclusterandthefuzzypseudo-partitionuntilthepartitiondoesnotchange.FCMissimilarinstructuretotheK-meansalgorithm,whichafterinitialization,alternatesbetweenastepthatupdatesthecentroidsandastepthatassignseachobjecttotheclosestcentroid.Specifically,computingafuzzypseudo-partitionisequivalenttotheassignmentstep.AswithK-means,FCMcanbeinterpretedasattemptingtominimizethesumofthesquarederror(SSE),althoughFCMisbasedonafuzzyversionofSSE.Indeed,K-meanscanberegardedasaspecialcaseofFCMandthebehaviorofthetwoalgorithmsisquitesimilar.ThedetailsofFCMaredescribedbelow.

ComputingSSE

Thedefinitionofthesumofthesquarederror(SSE)ismodifiedasfollows:

where isthecentroidofthe clusterandp,whichistheexponentthatdeterminestheinfluenceoftheweights,hasavaluebetween1and∞.NotethatthisSSEisjustaweightedversionofthetraditionalK-meansSSEgiveninEquation7.1 .

Initialization

Randominitializationisoftenused.Inparticular,weightsarechosenrandomly,subjecttotheconstraintthattheweightsassociatedwithanyobjectmustsumto1.AswithK-means,randominitializationissimple,butoftenresultsinaclusteringthatrepresentsalocalminimumintermsoftheSSE.Section7.2.1 ,whichcontainsadiscussiononchoosinginitialcentroidsforK-means,hasconsiderablerelevanceforFCMaswell.

SSE(C1,C2,…,Ck)=∑j=1k∑i=1mwijpdist(xi,cj)2 (8.1)

cj jth

ComputingCentroids

ThedefinitionofthecentroidgiveninEquation8.2 canbederivedbyfindingthecentroidthatminimizesthefuzzySSEasgivenbyEquation8.1 .(SeetheapproachinSection7.2.6 .)Foracluster, ,thecorrespondingcentroid, ,isdefinedbythefollowingequation:

Thefuzzycentroiddefinitionissimilartothetraditionaldefinitionexceptthatallpointsareconsidered(anypointcanbelongtoanycluster,atleastsomewhat)andthecontributionofeachpointtothecentroidisweightedbyitsmembershipdegree.Inthecaseoftraditionalcrispsets,whereall areeither0or1,thisdefinitionreducestothetraditionaldefinitionofacentroid.

Thereareafewconsiderationswhenchoosingthevalueofp.Choosingsimplifiestheweightupdateformula—seeEquation8.4 .However,ifpischosentobenear1,thenfuzzyc-meansbehavesliketraditionalK-means.Goingintheotherdirection,aspgetslarger,alltheclustercentroidsapproachtheglobalcentroidofallthedatapoints.Inotherwords,thepartitionbecomesfuzzieraspincreases.

UpdatingtheFuzzyPseudo-partition

Becausethefuzzypseudo-partitionisdefinedbytheweight,thisstepinvolvesupdatingtheweights associatedwiththe pointand cluster.TheweightupdateformulagiveninEquation8.3 canbederivedbyminimizingtheSSEofEquation8.1 subjecttotheconstraintthattheweightssumto1.

Cjcj

cj=∑i=1mwijpxi/∑i=1mwijp (8.2)

wij

p=2

wij ith jth

wij=(1/dist(xi,cj)2)1p−1/∑q=1k(1/dist(xi,cq)2)1p−1 (8.3)

Thisformulamightappearabitmysterious.However,notethatif ,thenweobtainEquation8.4 ,whichissomewhatsimpler.WeprovideanintuitiveexplanationofEquation8.4 ,which,withaslightmodification,alsoappliestoEquation8.3 .

Intuitively,theweight ,whichindicatesthedegreeofmembershipofpointincluster ,shouldberelativelyhighif isclosetocentroid (if islow)andrelativelylowif isfarfromcentroid (if ishigh).If

,whichisthenumeratorofEquation8.4 ,thenthiswillindeedbethecase.However,themembershipweightsforapointwillnotsumtooneunlesstheyarenormalized;i.e.,dividedbythesumofalltheweightsasinEquation8.4 .Tosummarize,themembershipweightofapointinaclusterisjustthereciprocalofthesquareofthedistancebetweenthepointandtheclustercentroiddividedbythesumofallthemembershipweightsofthepoint.

Nowconsidertheimpactoftheexponent inEquation8.3 .If ,thenthisexponentdecreasestheweightassignedtoclustersthatareclosetothepoint.Indeed,aspgoestoinfinity,theexponenttendsto0andweightstendtothevalue1/k.Ontheotherhand,aspapproaches1,theexponentincreasesthemembershipweightsofpointstowhichtheclusterisclose.Aspgoesto1,themembershipweightgoesto1fortheclosestclusterandto0foralltheotherclusters.ThiscorrespondstoK-means.

Example8.1(Fuzzyc-meansonThreeCircularClusters).Figure8.1 showstheresultofapplyingfuzzyc-meanstofindthreeclustersforatwo-dimensionaldatasetof100points.Eachpointwas

p=2

wij=1/dist(xi,cj)2/∑q=1k1/dist(xi,cq)2 (8.4)

wij xiCj xi cj dist(xi,cj)

xi cj dist(xi,cj)wij=1/dist(xi,cj)2

1/(p−1) p>2

assignedtotheclusterinwhichithadthelargestmembershipweight.Thepointsbelongingtoeachclusterareshownbydifferentmarkershapes,whilethedegreeofmembershipintheclusterisshownbytheshading.Thedarkerthepoints,thestrongertheirmembershipintheclustertowhichtheyhavebeenassigned.Themembershipinaclusterisstrongesttowardthecenteroftheclusterandweakestforthosepointsthatarebetweenclusters.

Figure8.1.Fuzzyc-meansclusteringofatwo-dimensionalpointset.

StrengthsandLimitations

ApositivefeatureofFCMisthatitproducesaclusteringthatprovidesanindicationofthedegreetowhichanypointbelongstoanycluster.Otherwise,ithasmuchthesamestrengthsandweaknessesasK-means,althoughitissomewhatmorecomputationallyintensive.

8.2.2ClusteringUsingMixtureModels

Thissectionconsidersclusteringbasedonstatisticalmodels.Itisoftenconvenientandeffectivetoassumethatdatahasbeengeneratedasaresultofastatisticalprocessandtodescribethedatabyfindingthestatisticalmodelthatbestfitsthedata,wherethestatisticalmodelisdescribedintermsofadistributionandasetofparametersforthatdistribution.Atahighlevel,thisprocessinvolvesdecidingonastatisticalmodelforthedataandestimatingtheparametersofthatmodelfromthedata.Thissectiondescribesaparticularkindofstatisticalmodel,mixturemodels,whichmodelthedatabyusinganumberofstatisticaldistributions.Eachdistributioncorrespondstoaclusterandtheparametersofeachdistributionprovideadescriptionofthecorrespondingcluster,typicallyintermsofitscenterandspread.

Thediscussioninthissectionproceedsasfollows.Afterprovidingadescriptionofmixturemodels,weconsiderhowparameterscanbeestimatedforstatisticaldatamodels.Wefirstdescribehowaprocedureknownasmaximumlikelihoodestimation(MLE)canbeusedtoestimateparametersforsimplestatisticalmodelsandthendiscusshowwecanextendthisapproachforestimatingtheparametersofmixturemodels.Specifically,wedescribethewell-knownExpectation-Maximization(EM)algorithm,whichmakesaninitialguessfortheparameters,andtheniterativelyimprovestheseestimates.WepresentexamplesofhowtheEMalgorithmcanbeusedto

clusterdatabyestimatingtheparametersofamixturemodelanddiscussitsstrengthsandlimitations.

Afirmunderstandingofstatisticsandprobability,ascoveredinAppendixC,isessentialforunderstandingthissection.Also,forconvenienceinthefollowingdiscussion,weusethetermprobabilitytorefertobothprobabilityandprobabilitydensity.

MixtureModelsMixturemodelsviewthedataasasetofobservationsfromamixtureofdifferentprobabilitydistributions.Theprobabilitydistributionscanbeanything,butareoftentakentobemultivariatenormal,asthistypeofdistributioniswellunderstood,mathematicallyeasytoworkwith,andhasbeenshowntoproducegoodresultsinmanyinstances.Thesetypesofdistributionscanmodelellipsoidalclusters.

Conceptually,mixturemodelscorrespondtothefollowingprocessofgeneratingdata.Givenseveraldistributions,usuallyofthesametype,butwithdifferentparameters,randomlyselectoneofthesedistributionsandgenerateanobjectfromit.Repeattheprocessmtimes,wheremisthenumberofobjects.

Moreformally,assumethatthereareKdistributionsandmobjects,.Letthe distributionhaveparameters ,andlet bethesetofall

parameters,i.e., .Then, istheprobabilityoftheobjectifitcomesfromthe distribution.Theprobabilitythatthedistributionischosentogenerateanobjectisgivenbytheweight ,wheretheseweights(probabilities)aresubjecttotheconstraintthattheysumtoone,i.e., .Then,theprobabilityofanobjectxisgivenbyEquation8.5 .

X={x1,…,xm} jth θj Θ

Θ={θ1,…,θK} prob(xi|θj) ithjth jth

wj,1≤j≤K

∑j=1Kwj=1

Iftheobjectsaregeneratedinanindependentmanner,thentheprobabilityoftheentiresetofobjectsisjusttheproductoftheprobabilitiesofeachindividual .

Formixturemodels,eachdistributiondescribesadifferentgroup,i.e.,adifferentcluster.Byusingstatisticalmethods,wecanestimatetheparametersofthesedistributionsfromthedataandthusdescribethesedistributions(clusters).Wecanalsoidentifywhichobjectsbelongtowhichclusters.However,mixturemodelingdoesnotproduceacrispassignmentofobjectstoclusters,butrathergivestheprobabilitywithwhichaspecificobjectbelongstoaparticularcluster.

Example8.2(UnivariateGaussianMixture).WeprovideaconcreteillustrationofamixturemodelintermsofGaussiandistributions.Theprobabilitydensityfunctionforaone-dimensionalGaussiandistributionatapointxis

TheparametersoftheGaussiandistributionaregivenby ,whereμisthemeanofthedistributionandσisthestandarddeviation.AssumethattherearetwoGaussiandistributions,withacommonstandarddeviationof2andmeansof and4,respectively.Alsoassumethateachofthetwodistributionsisselectedwithequalprobability,i.e., .ThenEquation8.5 becomesthefollowing:

prob(x|Θ)=∑j=1Kwjpj(x|θj) (8.5)

prob(X|Θ)=∏i=1mprob(xi|Θ)=∏i=1m∑j=1Kwjpj(xi|θj) (8.6)

prob(x|Θ)=12πσe−(x−μ)22σ2. (8.7)

θ=(μ, σ)

−4w1=w2=0.5

Figure8.2(a) showsaplotoftheprobabilitydensityfunctionofthismixturemodel,whileFigure8.2(b) showsthehistogramfor20,000pointsgeneratedfromthismixturemodel.

Figure8.2.Mixturemodelconsistingoftwonormaldistributionswithmeansof-4and4,respectively.Bothdistributionshaveastandarddeviationof2.

EstimatingModelParametersUsingMaximumLikelihoodGivenastatisticalmodelforthedata,itisnecessarytoestimatetheparametersofthatmodel.Astandardapproachusedforthistaskismaximumlikelihoodestimation,whichwenowexplain.

Considerasetofmpointsthataregeneratedfromaone-dimensionalGaussiandistribution.Assumingthatthepointsaregeneratedindependently,

prob(x|Θ)=122πe−(x+4)28+122πe−(x−4)28. (8.8)

theprobabilityofthesepointsisjusttheproductoftheirindividualprobabilities.(Again,wearedealingwithprobabilitydensities,buttokeepourterminologysimple,wewillrefertoprobabilities.)UsingEquation8.7 ,wecanwritethisprobabilityasshowninEquation8.9 .Becausethisprobabilitywouldbeaverysmallnumber,wetypicallywillworkwiththelogprobability,asshowninEquation8.10 .

Wewouldliketofindaproceduretoestimateμandσiftheyareunknown.Oneapproachistochoosethevaluesoftheparametersforwhichthedataismostprobable(mostlikely).Inotherwords,choosetheμandσthatmaximizeEquation8.9 .Thisapproachisknowninstatisticsasthemaximumlikelihoodprinciple,andtheprocessofapplyingthisprincipletoestimatetheparametersofastatisticaldistributionfromthedataisknownasmaximumlikelihoodestimation(MLE).

Theprincipleiscalledthemaximumlikelihoodprinciplebecause,givenasetofdata,theprobabilityofthedata,regardedasafunctionoftheparameters,iscalledalikelihoodfunction.Toillustrate,werewriteEquation8.9 asEquation8.11 toemphasizethatweviewthestatisticalparametersμandσasourvariablesandthatthedataisregardedasaconstant.Forpracticalreasons,theloglikelihoodismorecommonlyused.TheloglikelihoodfunctionderivedfromthelogprobabilityofEquation8.10 isshowninEquation8.12 .Notethattheparametervaluesthatmaximizetheloglikelihoodalsomaximizethelikelihoodsincelogisamonotonicallyincreasingfunction.

prob(X|Θ)=∏i=1m12πσe−(xi−μ)22σ2 (8.9)

log prob(X|Θ)=−∑i=1m(xi−μ)22σ2−0.5m log2π−mlogσ (8.10)

likelihood(Θ|X)=L(Θ|X)=∏i=1m12πσe−(xi−μ)22σ2 (8.11)

Example8.3(MaximumLikelihoodParameterEstimation).WeprovideaconcreteillustrationoftheuseofMLEforfindingparametervalues.Supposethatwehavethesetof200pointswhosehistogramisshowninFigure8.3(a) .Figure8.3(b) showsthemaximumloglikelihoodplotforthe200pointsunderconsideration.Thevaluesoftheparametersforwhichthelogprobabilityisamaximumare and

,whichareclosetotheparametervaluesoftheunderlyingGaussiandistribution, and .

Figure8.3.200pointsfromaGaussiandistributionandtheirlogprobabilityfordifferentparametervalues.

Graphingthelikelihoodofthedatafordifferentvaluesoftheparametersisnotpractical,atleastiftherearemorethantwoparameters.Thus,standard

log likelihood(Θ|X)=ℓ(Θ|X)=−∑i=1m(xi−μ)22σ2−0.5mlog2π−mlogσ (8.12)

μ=−4.1σ=2.1

μ=−4.0 σ=2.0

statisticalprocedureistoderivethemaximumlikelihoodestimatesofastatisticalparameterbytakingthederivativeoflikelihoodfunctionwithrespecttothatparameter,settingtheresultequalto0,andsolving.Inparticular,foraGaussiandistribution,itcanbeshownthatthemeanandstandarddeviationofthesamplepointsarethemaximumlikelihoodestimatesofthecorrespondingparametersoftheunderlyingdistribution.(SeeExercise9on700.)Indeed,forthe200pointsconsideredinourexample,theparametervaluesthatmaximizedtheloglikelihoodwerepreciselythemeanandstandarddeviationofthe200points,i.e., and .

EstimatingMixtureModelParametersUsingMaximumLikelihood:TheEMAlgorithmWecanalsousethemaximumlikelihoodapproachtoestimatethemodelparametersforamixturemodel.Inthesimplestcase,weknowwhichdataobjectscomefromwhichdistributions,andthesituationreducestooneofestimatingtheparametersofasingledistributiongivendatafromthatdistribution.Formostcommondistributions,themaximumlikelihoodestimatesoftheparametersarecalculatedfromsimpleformulasinvolvingthedata.

Inamoregeneral(andmorerealistic)situation,wedonotknowwhichpointsweregeneratedbywhichdistribution.Thus,wecannotdirectlycalculatetheprobabilityofeachdatapoint,andhence,itwouldseemthatwecannotusethemaximumlikelihoodprincipletoestimateparameters.ThesolutiontothisproblemistheEMalgorithm,whichisshowninAlgorithm8.2 .Briefly,givenaguessfortheparametervalues,theEMalgorithmcalculatestheprobabilitythateachpointbelongstoeachdistributionandthenusestheseprobabilitiestocomputeanewestimatefortheparameters.(Theseparametersaretheonesthatmaximizethelikelihood.)Thisiterationcontinuesuntiltheestimatesoftheparameterseitherdonotchangeorchangevery

μ=−4.1 σ=2.1

little.Thus,westillemploymaximumlikelihoodestimation,butviaaniterativesearch.

Algorithm8.2EMalgorithm.

TheEMalgorithmissimilartotheK-meansalgorithmgiveninSection7.2.1 .Indeed,theK-meansalgorithmforEuclideandataisaspecialcaseoftheEMalgorithmforsphericalGaussiandistributionswithequalcovariancematrices,butdifferentmeans.TheexpectationstepcorrespondstotheK-meansstepofassigningeachobjecttoacluster.Instead,eachobjectisassignedtoeverycluster(distribution)withsomeprobability.Themaximizationstepcorrespondstocomputingtheclustercentroids.Instead,alltheparametersofthedistributions,aswellastheweightparameters,areselectedtomaximizethelikelihood.Thisprocessisoftenstraightforward,astheparametersaretypicallycomputedusingformulasderivedfrommaximumlikelihoodestimation.Forinstance,forasingleGaussiandistribution,theMLEestimateofthemeanisthemeanoftheobjectsinthedistribution.Inthe

1:Selectaninitialsetofmodelparameters.(AswithK-means,thiscanbedonerandomlyorinavarietyofways.)2:repeat3:ExpectationStepForeachobject,calculatetheprobabilitythateachobjectbelongstoeachdistribution,i.e.,calculate

.4:MaximizationStepGiventheprobabilitiesfromtheexpectationstep,findthenewestimatesoftheparametersthatmaximizetheexpectedlikelihood.5:untilTheparametersdonotchange.(Alternatively,stopifthechangeintheparametersisbelowaspecifiedthreshold.)

prob(distribution j|xi,Θ)

contextofmixturemodelsandtheEMalgorithm,thecomputationofthemeanismodifiedtoaccountforthefactthateveryobjectbelongstoadistributionwithacertainprobability.Thisisillustratedfurtherinthefollowingexample.

Example8.4(SimpleExampleofEMAlgorithm).ThisexampleillustrateshowEMoperateswhenappliedtothedatainFigure8.2 .Tokeeptheexampleassimpleaspossible,weassumethatweknowthatthestandarddeviationofbothdistributionsis2.0andthatpointsweregeneratedwithequalprobabilityfrombothdistributions.Wewillrefertotheleftandrightdistributionsasdistributions1and2,respectively.

WebegintheEMalgorithmbymakinginitialguessesfor and ,say,and .Thus,theinitialparameters, ,forthetwo

distributionsare,respectively, and .Thesetofparametersfortheentiremixturemodelis .FortheexpectationstepofEM,wewanttocomputetheprobabilitythatapointcamefromaparticulardistribution;i.e.,wewanttocompute and

.ThesevaluescanbeexpressedbyEquation8.13 ,whichisastraightforwardapplicationofBayesrule,whichisdescribedinAppendixC.

where0.5istheprobability(weight)ofeachdistributionandjis1or2.

Forinstance,assumeoneofthepointsis0.UsingtheGaussiandensityfunctiongiveninEquation8.7 ,wecomputethat and

.(Again,wearereallycomputingprobabilitydensities.)UsingthesevaluesandEquation8.13 ,wefindthat

μ1 μ2μ1=−2 μ2=3 θ=(μ,σ)

θ1=(−2,2) θ2=(3,2)Θ={θ1,θ2}

prob(distribution 1|xi,Θ)prob(distribution 2|xi,Θ)

prob(distribution j|xi,θ)=0.5prob(xi|θj)0.5 prob(xi|θ1)+0.5 prob(xi|θ2), (8.13)

prob(0|θ1)=0.12prob(0|θ2)=0.06

and.Thismeansthatthepoint

0istwiceaslikelytobelongtodistribution1asdistribution2basedonthecurrentassumptionsfortheparametervalues.

Aftercomputingtheclustermembershipprobabilitiesforall20,000points,wecomputenewestimatesfor and (usingEquations8.14 and8.15 )inthemaximizationstepoftheEMalgorithm.Noticethatthenewestimateforthemeanofadistributionisjustaweightedaverageofthepoints,wheretheweightsaretheprobabilitiesthatthepointsbelongtothedistribution,i.e.,the values.

Werepeatthesetwostepsuntiltheestimatesof and eitherdon’tchangeorchangeverylittle.Table8.1 givesthefirstfewiterationsoftheEMalgorithmwhenitisappliedtothesetof20,000points.Forthisdata,weknowwhichdistributiongeneratedwhichpoint,sowecanalsocomputethemeanofthepointsfromeachdistribution.Themeansare

and .

Table8.1.FirstfewiterationsoftheEMalgorithmforthesimpleexample.

Iteration

0 3.00

1 4.10

2 4.07

prob(distribution 1|0,Θ)=0.12/(0.12+0.06)=0.66prob(distribution 2|0,Θ)=0.06/(0.12+0.06)=0.33

μ1 μ2

prob(distribution j|xi)

μ1=∑i=120,000xiprob(distribution 1|xi,Θ)∑i=120,000prob(distribution 1|xi,Θ)(8.14)

μ2=∑i=120,000xiprob(distribution 2|xi,Θ)∑i=120,000prob(distribution 2|xi,Θ)(8.15)

μ1 μ2

μ1=−3.98 μ2=4.03

μ1 μ2

−2.00

−3.74

−3.94

3 4.04

4 4.03

5 4.03

Example8.5(TheEMAlgorithmonSampleDataSets).WegivethreeexamplesthatillustratetheuseoftheEMalgorithmtofindclustersusingmixturemodels.Thefirstexampleisbasedonthedatasetusedtoillustratethefuzzyc-meansalgorithm—seeFigure8.1 .Wemodeledthisdataasamixtureofthreetwo-dimensionalGaussiandistributionswithdifferentmeansandidenticalcovariancematrices.WethenclusteredthedatausingtheEMalgorithm.TheresultsareshowninFigure8.4 .Eachpointwasassignedtotheclusterinwhichithadthelargestmembershipweight.Thepointsbelongingtoeachclusterareshownbydifferentmarkershapes,whilethedegreeofmembershipintheclusterisshownbytheshading.Membershipinaclusterisrelativelyweakforthosepointsthatareontheborderofthetwoclusters,butstrongelsewhere.ItisinterestingtocomparethemembershipweightsandprobabilitiesofFigures8.4 and8.1 .(SeeExercise11 onpage700.)

−3.97

−3.98

Figure8.4.EMclusteringofatwo-dimensionalpointsetwiththreeclusters.

Foroursecondexample,weapplymixturemodelclusteringtodatathatcontainsclusterswithdifferentdensities.Thedataconsistsoftwonaturalclusters,eachwithroughly500points.ThisdatawascreatedbycombiningtwosetsofGaussiandata,onewithacenterat andastandarddeviationof2,andonewithacenterat(0,0)andastandarddeviationof0.5.Figure8.5 showstheclusteringproducedbytheEMalgorithm.Despitethedifferencesinthedensity,theEMalgorithmisquitesuccessfulatidentifyingtheoriginalclusters.

(−4,1)

Figure8.5.EMclusteringofatwo-dimensionalpointsetwithtwoclustersofdifferingdensity.

Forourthirdexample,weusemixturemodelclusteringonadatasetthatK-meanscannotproperlyhandle.Figure8.6(a) showstheclusteringproducedbyamixturemodelalgorithm,whileFigure8.6(b) showstheK-meansclusteringofthesamesetof1,000points.Formixturemodelclustering,eachpointhasbeenassignedtotheclusterforwhichithasthehighestprobability.Inbothfigures,differentmarkersareusedtodistinguishdifferentclusters.Donotconfusethe‘+’and‘x’markersinFigure8.6(a) .

Figure8.6.MixturemodelandK-meansclusteringofasetoftwo-dimensionalpoints.

AdvantagesandLimitationsofMixtureModelClusteringUsingtheEMAlgorithmFindingclustersbymodelingthedatausingmixturemodelsandapplyingtheEMalgorithmtoestimatetheparametersofthosemodelshasavarietyofadvantagesanddisadvantages.Onthenegativeside,theEMalgorithmcanbeslow,itisnotpracticalformodelswithlargenumbersofcomponents,anditdoesnotworkwellwhenclusterscontainonlyafewdatapointsorifthedatapointsarenearlyco-linear.Thereisalsoaprobleminestimatingthenumberofclustersor,moregenerally,inchoosingtheexactformofthemodeltouse.ThisproblemtypicallyhasbeendealtwithbyapplyingaBayesianapproach,which,roughlyspeaking,givestheoddsofonemodelversusanother,basedonanestimatederivedfromthedata.Mixturemodelscanalsohavedifficultywithnoiseandoutliers,althoughworkhasbeendonetodealwiththisproblem.

Onthepositiveside,mixturemodelsaremoregeneralthanK-meansorfuzzyc-meansbecausetheycanusedistributionsofvarioustypes.Asaresult,mixturemodels(basedonGaussiandistributions)canfindclustersofdifferentsizesandellipticalshapes.Also,amodel-basedapproachprovidesadisciplinedwayofeliminatingsomeofthecomplexityassociatedwithdata.Toseethepatternsindata,itisoftennecessarytosimplifythedata,andfittingthedatatoamodelisagoodwaytodothatifthemodelisagoodmatchforthedata.Furthermore,itiseasytocharacterizetheclustersproduced,becausetheycanbedescribedbyasmallnumberofparameters.Finally,manysetsofdataareindeedtheresultofrandomprocesses,andthusshouldsatisfythestatisticalassumptionsofthesemodels.

8.2.3Self-OrganizingMaps(SOM)

TheKohonenSelf-OrganizingFeatureMap(SOFMorSOM)isaclusteringanddatavisualizationtechniquebasedonaneuralnetworkviewpoint.DespitetheneuralnetworkoriginsofSOM,itismoreeasilypresented—atleastinthecontextofthischapter—asavariationofprototype-basedclustering.Aswithothertypesofcentroid-basedclustering,thegoalofSOMistofindasetofcentroids(referencevectorsinSOMterminology)andtoassigneachobjectinthedatasettothecentroidthatprovidesthebestapproximationofthatobject.Inneuralnetworkterminology,thereisoneneuronassociatedwitheachcentroid.

AswithincrementalK-means,dataobjectsareprocessedoneatatimeandtheclosestcentroidisupdated.UnlikeK-means,SOMimposesatopographicorderingonthecentroidsandnearbycentroidsarealsoupdated.Furthermore,SOMdoesnotkeeptrackofthecurrentclustermembershipofanobject,and,unlikeK-means,ifanobjectswitchesclusters,thereisnoexplicitupdateoftheoldclustercentroid.However,iftheoldclusterisintheneighborhoodofthenewcluster,itwillbeupdated.Theprocessingofpointscontinuesuntilsomepredeterminedlimitisreachedorthecentroidsarenotchangingverymuch.ThefinaloutputoftheSOMtechniqueisasetofcentroidsthatimplicitlydefineclusters.Eachclusterconsistsofthepointsclosesttoaparticularcentroid.Thefollowingsectionexploresthedetailsofthisprocess.

TheSOMAlgorithmAdistinguishingfeatureofSOMisthatitimposesatopographic(spatial)organizationonthecentroids(neurons).Figure8.7 showsanexampleofatwo-dimensionalSOMinwhichthecentroidsarerepresentedbynodesthat

areorganizedinarectangularlattice.Eachcentroidisassignedapairofcoordinates(i,j).Sometimes,suchanetworkisdrawnwithlinksbetweenadjacentnodes,butthatcanbemisleadingbecausetheinfluenceofonecentroidonanotherisviaaneighborhoodthatisdefinedintermsofcoordinates,notlinks.TherearemanytypesofSOMneuralnetworks,butwerestrictourdiscussiontotwo-dimensionalSOMswitharectangularorhexagonalorganizationofthecentroids.

Figure8.7.Two-dimensional3-by-3rectangularSOMneuralnetwork.

EventhoughSOMissimilartoK-meansorotherprototype-basedapproaches,thereisafundamentaldifference.CentroidsusedinSOMhaveapredeterminedtopographicorderingrelationship.Duringthetrainingprocess,SOMuseseachdatapointtoupdatetheclosestcentroidandcentroidsthatarenearbyinthetopographicordering.Inthisway,SOMproducesanorderedsetofcentroidsforanygivendataset.Inotherwords,thecentroidsthatareclosetoeachotherintheSOMgridaremorecloselyrelatedtoeachotherthantothecentroidsthatarefartheraway.Becauseofthisconstraint,thecentroidsofatwo-dimensionalSOMcanbeviewedaslyingonatwo-dimensionalsurfacethattriestofitthen-dimensionaldataaswellaspossible.TheSOMcentroidscanalsobethoughtofastheresultofanonlinearregressionwithrespecttothedatapoints.

Atahighlevel,clusteringusingtheSOMtechniqueconsistsofthestepsdescribedinAlgorithm8.3 .

Algorithm8.3BasicSOMAlgorithm.

Initialization

Thisstep(line1)canbeperformedinanumberofways.Oneapproachistochooseeachcomponentofacentroidrandomlyfromtherangeofvaluesobservedinthedataforthatcomponent.Whilethisapproachworks,itisnotnecessarilythebestapproach,especiallyforproducingrapidconvergence.Anotherapproachistorandomlychoosetheinitialcentroidsfromtheavailabledatapoints.ThisisverymuchlikerandomlyselectingcentroidsforK-means.

SelectionofanObject

Thefirststepintheloop(line3)istheselectionofthenextobject.Thisisfairlystraightforward,buttherearesomedifficulties.Becauseconvergencecan

1:Initializethecentroids.2:repeat3:Selectthenextobject.4:Determinetheclosestcentroidtotheobject.5:Updatethiscentroidandthecentroidsthatareclose,i.e.,inaspecifiedneighborhood.6:untilThecentroidsdon’tchangemuchorathresholdisexceeded.7:Assigneachobjecttoitsclosestcentroidandreturnthecentroidsandclusters.

requiremanysteps,eachdataobjectmaybeusedmultipletimes,especiallyifthenumberofobjectsissmall.However,ifthenumberofobjectsislarge,thennoteveryobjectneedstobeused.Itisalsopossibletoenhancetheinfluenceofcertaingroupsofobjectsbyincreasingtheirfrequencyinthetrainingset.

Assignment

Thedeterminationoftheclosestcentroid(line4)isalsorelativelystraightforward,althoughitrequiresthespecificationofadistancemetric.TheEuclideandistancemetricisoftenused,asisthedotproductmetric.Whenusingthedotproductdistance,thedatavectorsaretypicallynormalizedbeforehandandthereferencevectorsarenormalizedateachstep.Insuchcases,usingthedotproductmetricisequivalenttousingthecosinemeasure.

Update

Theupdatestep(line5)isthemostcomplicated.Let bethecentroids.(Forarectangulargrid,notethatkistheproductofthenumberofrowsandthenumberofcolumns.)Fortimestept,let bethecurrentobject(point)andassumethattheclosestcentroidto is .Then,fortime ,the centroidisupdatedbyusingthefollowingequation.(Wewillseeshortlythattheupdateisreallyrestrictedtocentroidswhoseneuronsareinasmallneighborhoodof .)

Thus,attimet,acentroid isupdatedbyaddingaterm, ,whichisproportionaltothedifference, ,betweenthecurrentobject,

,andcentroid, . ,determinestheeffectthatthedifference,,willhaveandischosensothat(1)itdiminisheswithtimeand(2)it

enforcesaneighborhoodeffect,i.e.,theeffectofanobjectisstrongestonthe

m1,…, mk

p(t)p(t) mj t+1

jth

mj(t+1)=mj(t)+hj(t)(p(t)−mj(t)) (8.16)

mj(t) hj(t) (p(t)−mj(t))p(t)−mj(t)

p(t) mj(t) hj(t) p(t)−mj(t)

centroidsclosesttothecentroid .Herewearereferringtothedistanceinthegrid,notthedistanceinthedataspace.Typically, ischosentobeoneofthefollowingtwofunctions:

Thesefunctionsrequiremoreexplanation. isalearningrateparameter,,whichdecreasesmonotonicallywithtimeandcontrolstherateof

convergence. isthetwo-dimensionalpointthatgivesthegridcoordinatesofthe centroid. istheEuclideandistancebetweenthegridlocationofthetwocentroids,i.e., .Consequently,forcentroidswhosegridlocationsarefarfromthegridlocationofcentroid ,theinfluenceofobject willbeeithergreatlydiminishedornon-existent.Finally,notethatσisthetypicalGaussianvarianceparameterandcontrolsthewidthoftheneighborhood,i.e.,asmallσwillyieldasmallneighborhood,whilealargeσwillyieldawideneighborhood.Thethresholdusedforthestepfunctionalsocontrolstheneighborhoodsize.

Remember,itistheneighborhoodupdatingtechniquethatenforcesarelationship(ordering)betweencentroidsassociatedwithneighboringneurons.

Termination

Decidingwhenwearecloseenoughtoastablesetofcentroidsisanimportantissue.Ideally,iterationshouldcontinueuntilconvergenceoccurs,thatis,untilthereferencevectorseitherdonotchangeorchangeverylittle.Therateofconvergencewilldependonanumberoffactors,suchasthedataand .Wewillnotdiscusstheseissuesfurther,excepttomentionthat,ingeneral,convergencecanbeslowandisnotguaranteed.

mjhj(t)

hj(t)=α(t)exp(−dist(rj,rk)2/(2σ2(t)) (Gaussianfunction)hj(t)=α(t) if dist(rj,rk)≤th

α(t)0<α(t)<1

rk=(xk,yk)kth dist(rj,rk)

(xj−xk)2+(yj−yk)2mj

p(t)

α(t)

Example8.6(DocumentData).Wepresenttwoexamples.Inthefirstcase,weapplySOMwitha4-by-4hexagonalgridtodocumentdata.Weclustered3204newspaperarticlesfromtheLosAngelesTimes,whichcomefrom6differentsections:Entertainment,Financial,Foreign,Metro,National,andSports.Figure8.8 showstheSOMgrid.Wehaveusedahexagonalgrid,whichallowseachcentroidtohavesiximmediateneighborsinsteadoffour.EachSOMgridcell(cluster)hasbeenlabeledwiththemajorityclasslabeloftheassociatedpoints.Theclustersofeachparticularcategoryformcontiguousgroups,andtheirpositionrelativetoothercategoriesofclustersgivesusadditionalinformation,e.g.,thattheMetrosectioncontainsstoriesrelatedtoallothersections.

Figure8.8.VisualizationoftherelationshipsbetweenSOMclusterforLosAngelesTimesdocumentdataset.

Example8.7(Two-DimensionalPoints).Inthesecondcase,weusearectangularSOMandasetoftwo-dimensionaldatapoints.Figure8.9(a) showsthepointsandthepositionsofthe36referencevectors(shownasx’s)producedbySOM.Thepointsarearrangedinacheckerboardpatternandaresplitintofiveclasses:circles,triangles,squares,diamonds,andhexagons(stars).A6-by-6two-dimensionalrectangulargridofcentroidswasusedwithrandominitialization.AsFigure8.9(a) shows,thecentroidstendtodistributethemselvestothedenseareas.Figure8.9(b) indicatesthemajorityclassofthepointsassociatedwiththatcentroid.Theclustersassociatedwithtrianglepointsareinonecontiguousarea,asarethecentroidsassociatedwiththefourothertypesofpoints.ThisisaresultoftheneighborhoodconstraintsenforcedbySOM.Whiletherearethesamenumberofpointsineachofthefivegroups,noticealsothatthecentroidsarenotevenlydistributed.Thisispartlyduetotheoveralldistributionofpointsandpartlyanartifactofputtingeachcentroidinasinglecluster.

Figure8.9.SOMappliedtotwo-dimensionaldatapoints.

ApplicationsOncetheSOMvectorsarefound,theycanbeusedformanypurposesotherthanclustering.Forexample,withatwo-dimensionalSOM,itispossibletoassociatevariousquantitieswiththegridpointsassociatedwitheachcentroid(cluster)andtovisualizetheresultsviavarioustypesofplots.Forexample,plottingthenumberofpointsassociatedwitheachclusteryieldsaplotthatrevealsthedistributionofpointsamongclusters.Atwo-dimensionalSOMisanonlinearprojectionoftheoriginalprobabilitydistributionfunctionintotwodimensions.Thisprojectionattemptstopreservetopologicalfeatures;thus,usingSOMtocapturethestructureofthedatahasbeencomparedtotheprocessof“pressingaflower.”

StrengthsandLimitationsSOMisaclusteringtechniquethatenforcesneighborhoodrelationshipsontheresultingclustercentroids.Becauseofthis,clustersthatareneighborsaremorerelatedtooneanotherthanclustersthatarenot.Suchrelationshipsfacilitatetheinterpretationandvisualizationoftheclusteringresults.Indeed,thisaspectofSOMhasbeenexploitedinmanyareas,suchasvisualizingwebdocumentsorgenearraydata.

SOMalsohasanumberoflimitations,whicharelistednext.SomeofthelistedlimitationsareonlyvalidifweconsiderSOMtobeastandardclusteringtechniquethataimstofindthetrueclustersinthedata,ratherthanatechniquethatusesclusteringtohelpdiscoverthestructureofthedata.Also,

someoftheselimitationshavebeenaddressedeitherbyextensionsofSOMorbyclusteringalgorithmsinspiredbySOM.(SeetheBibliographicNotes.)

Theusermustchoosethesettingsofparameters,theneighborhoodfunction,thegridtype,andthenumberofcentroids.ASOMclusteroftendoesnotcorrespondtoasinglenaturalcluster.Insomecases,aSOMclustermightencompassseveralnaturalclusters,whileinothercasesasinglenaturalclusterissplitintoseveralSOMclusters.ThisproblemispartlyduetotheuseofagridofcentroidsandpartlyduetothefactthatSOM,likeotherprototype-basedclusteringtechniques,tendstosplitorcombinenaturalclusterswhentheyareofvaryingsizes,shapes,anddensities.SOMlacksaspecificobjectivefunction.SOMattemptstofindasetofcentroidsthatbestapproximatethedata,subjecttothetopographicconstraintsamongthecentroids,butthesuccessofSOMindoingthiscannotbeexpressedbyafunction.ThiscanmakeitdifficulttocomparedifferentSOMclusteringresults.SOMisnotguaranteedtoconverge,although,inpractice,ittypicallydoes.

8.3Density-BasedClusteringInSection7.4 ,weconsideredDBSCAN,asimple,buteffectivealgorithmforfindingdensity-basedclusters,i.e.,denseregionsofobjectsthataresurroundedbylow-densityregions.Thissectionexaminesadditionaldensity-basedclusteringtechniquesthataddressissuesofefficiency,findingclustersinsubspaces,andmoreaccuratelymodelingdensity.First,weconsidergrid-basedclustering,whichbreaksthedataspaceintogridcellsandthenformsclustersfromcellsthataresufficientlydense.Suchanapproachcanbeefficientandeffective,atleastforlow-dimensionaldata.Next,weconsidersubspaceclustering,whichlooksforclusters(denseregions)insubsetsofalldimensions.Foradataspacewithndimensions,potentially subspacesneedtobesearched,andthusanefficienttechniqueisneededtodothis.CLIQUEisagrid-basedclusteringalgorithmthatprovidesanefficientapproachtosubspaceclusteringbasedontheobservationthatdenseareasinahigh-dimensionalspaceimplytheexistenceofdenseareasinlower-dimensionalspace.Finally,wedescribeDENCLUE,aclusteringtechniquethatuseskerneldensityfunctionstomodeldensityasthesumoftheinfluencesofindividualdataobjects.WhileDENCLUEisnotfundamentallyagrid-basedtechnique,itdoesemployagrid-basedapproachtoimproveefficiency.

8.3.1Grid-BasedClustering

Agridisanefficientwaytoorganizeasetofdata,atleastinlowdimensions.Theideaistosplitthepossiblevaluesofeachattributeintoanumberofcontiguousintervals,creatingasetofgridcells.(Weareassuming,forthis

2n−1

discussionandtheremainderofthesection,thatourattributesareordinal,interval,orcontinuous.)Eachobjectfallsintoagridcellwhosecorrespondingattributeintervalscontainthevaluesoftheobject.Objectscanbeassignedtogridcellsinonepassthroughthedata,andinformationabouteachcell,suchasthenumberofpointsinthecell,canalsobegatheredatthesametime.

Thereareanumberofwaystoperformclusteringusingagrid,butmostapproachesarebasedondensity,atleastinpart,andthus,inthissection,wewillusegrid-basedclusteringtomeandensity-basedclusteringusingagrid.Algorithm8.4 describesabasicapproachtogrid-basedclustering.Variousaspectsofthisapproachareexplorednext.

Algorithm8.4Basicgrid-basedclustering

algorithm.

DefiningGridCellsThisisakeystepintheprocess,butalsotheleastwelldefined,astherearemanywaystosplitthepossiblevaluesofeachattributeintoanumberofcontiguousintervals.Forcontinuousattributes,onecommonapproachisto

1:Defineasetofgridcells.2:Assignobjectstotheappropriatecellsandcomputethedensityofeachcell.3:Eliminatecellshavingadensitybelowaspecifiedthreshold,τ.4:Formclustersfromcontiguous(adjacent)groupsofdensecells.

splitthevaluesintoequalwidthintervals.Ifthisapproachisappliedtoeachattribute,thentheresultinggridcellsallhavethesamevolume,andthedensityofacellisconvenientlydefinedasthenumberofpointsinthecell.

However,moresophisticatedapproachescanalsobeused.Inparticular,forcontinuousattributesanyofthetechniquesthatarecommonlyusedtodiscretizeattributescanbeapplied.(SeeSection2.3.6 .)Inadditiontotheequalwidthapproachalreadymentioned,thisincludes(1)breakingthevaluesofanattributeintointervalssothateachintervalcontainsanequalnumberofpoints,i.e.,equalfrequencydiscretization,or(2)usingclustering.Anotherapproach,whichisusedbythesubspaceclusteringalgorithmMAFIA,initiallybreaksthesetofvaluesofanattributeintoalargenumberofequalwidthintervalsandthencombinesintervalsofsimilardensity.

Regardlessoftheapproachtaken,thedefinitionofthegridhasastrongimpactontheclusteringresults.Wewillconsiderspecificaspectsofthislater.

TheDensityofGridCellsAnaturalwaytodefinethedensityofagridcell(oramoregenerallyshapedregion)isasthenumberofpointsdividedbythevolumeoftheregion.Inotherwords,densityisthenumberofpointsperamountofspace,regardlessofthedimensionalityofthatspace.Specific,low-dimensionalexamplesofdensityarethenumberofroadsignspermile(onedimension),thenumberofeaglespersquarekilometerofhabitat(twodimensions),andthenumberofmoleculesofagaspercubiccentimeter(threedimensions).Asmentioned,however,acommonapproachistousegridcellsthathavethesamevolumesothatthenumberofpointspercellisadirectmeasureofthecell’sdensity.

Example8.8(Grid-BasedDensity).

Figure8.10 showstwosetsoftwo-dimensionalpointsdividedinto49cellsusinga7-by-7grid.Thefirstsetcontains200pointsgeneratedfromauniformdistributionoveracirclecenteredat(2,3)ofradius2,whilethesecondsethas100pointsgeneratedfromauniformdistributionoveracirclecenteredat(6,3)ofradius1.ThecountsforthegridcellsareshowninTable8.2 .Sincethecellshaveequalvolume(area),wecanconsiderthesevaluestobethedensitiesofthecells.

Figure8.10.Grid-baseddensity.

Table8.2.Pointcountsforgridcells.

0 0 0 0 0 0 0

4 17 18 6 0 0 0

14 14 13 13 0 18 27

11 18 10 21 0 24 31

3 20 14 4 0 0 0

0 0 0 0 0 0 0

FormingClustersfromDenseGridCellsFormingclustersfromadjacentgroupsofdensecellsisrelativelystraightforward.(InFigure8.10 ,forexample,itisclearthattherewouldbetwoclusters.)Thereare,however,someissues.Weneedtodefinewhatwemeanbyadjacentcells.Forexample,doesatwo-dimensionalgridcellhave4adjacentcellsor8?Also,weneedanefficienttechniquetofindtheadjacentcells,particularlywhenonlyoccupiedcellsarestored.

TheclusteringapproachdefinedbyAlgorithm8.4 hassomelimitationsthatcouldbeaddressedbymakingthealgorithmslightlymoresophisticated.Forexample,therearelikelytobepartiallyemptycellsontheboundaryofacluster.Often,thesecellsarenotdense.Ifso,theywillbediscardedandpartsofaclusterwillbelost.Figure8.10 andTable8.2 showthatfourpartsofthelargerclusterwouldbelostifthedensitythresholdis9.Theclusteringprocesscouldbemodifiedtoavoiddiscardingsuchcells,althoughthiswouldrequireadditionalprocessing.

Itisalsopossibletoenhancebasicgrid-basedclusteringbyusingmorethanjustdensityinformation.Inmanycases,thedatahasbothspatialandnon-spatialattributes.Inotherwords,someoftheattributesdescribethelocationofobjectsintimeorspace,whileotherattributesdescribeotheraspectsoftheobjects.Acommonexampleishouses,whichhavebothalocationandanumberofothercharacteristics,suchaspriceorfloorspaceinsquarefeet.Becauseofspatial(ortemporal)autocorrelation,objectsinaparticularcelloftenhavesimilarvaluesfortheirotherattributes.Insuchcases,itispossibletofilterthecellsbasedonthestatisticalpropertiesofoneormorenon-spatial

attributes,e.g.,averagehouseprice,andthenformclustersbasedonthedensityoftheremainingpoints.

StrengthsandLimitationsOnthepositiveside,grid-basedclusteringcanbeveryefficientandeffective.Givenapartitioningofeachattribute,asinglepassthroughthedatacandeterminethegridcellofeveryobjectandthecountofeverygrid.Also,eventhoughthenumberofpotentialgridcellscanbehigh,gridcellsneedtobecreatedonlyfornon-emptycells.Thus,thetimeandspacecomplexityofdefiningthegrid,assigningeachobjecttoacell,andcomputingthedensityofeachcellisonly ,wheremisthenumberofpoints.Ifadjacent,occupiedcellscanbeefficientlyaccessed,forexample,byusingasearchtree,thentheentireclusteringprocesswillbehighlyefficient,e.g.,withatimecomplexityof

Forthisreason,thegrid-basedapproachtodensityclusteringformsthebasisofanumberofclusteringalgorithms,suchasSTING,GRIDCLUS,WaveCluster,Bang-Clustering,CLIQUE,andMAFIA.

Onthenegativeside,grid-basedclustering,likemostdensity-basedclusteringschemes,isverydependentonthechoiceofthedensitythresholdτ.Ifτistoohigh,thenclusterswillbelost.Ifτistoolow,twoclustersthatshouldbeseparatemaybejoined.Furthermore,ifthereareclustersandnoiseofdifferingdensities,thenitmightnotbepossibletofindasinglevalueofτthatworksforallpartsofthedataspace.

Therearealsoanumberofissuesrelatedtothegrid-basedapproach.InFigure8.10 ,forexample,therectangulargridcellsdonotaccuratelycapturethedensityofthecircularboundaryareas.Wecouldattempttoalleviatethisproblembymakingthegridfiner,butthenumberofpointsinthegridcellsassociatedwithaclusterwouldlikelyshowmorefluctuationbecausepointsintheclusterarenotevenlydistributed.Indeed,somegridcells,

O(m)

O(mlogm).

includingthoseintheinteriorofthecluster,mightevenbeempty.Anotherissueisthat,dependingontheplacementorsizeofthecells,agroupofpointscanappearinjustonecellorbesplitbetweenseveraldifferentcells.Thesamegroupofpointsmightbepartofaclusterinthefirstcase,butbediscardedinthesecond.Finally,asdimensionalityincreases,thenumberofpotentialgridcellsincreasesrapidly—exponentiallyinthenumberofdimensions.Eventhoughitisnotnecessarytoexplicitlyconsideremptygridcells,itcaneasilyhappenthatmostgridcellscontainasingleobject.Inotherwords,grid-basedclusteringtendstoworkpoorlyforhigh-dimensionaldata.

8.3.2SubspaceClustering

Theclusteringtechniquesconsidereduntilnowfoundclustersbyusingalloftheattributes.However,ifonlysubsetsofthefeaturesareconsidered,i.e.,subspacesofthedata,thentheclustersthatwefindcanbequitedifferentfromonesubspacetoanother.Therearetworeasonsthatsubspaceclustersmightbeinteresting.First,thedatamaybeclusteredwithrespecttoasmallsetofattributes,butrandomlydistributedwithrespecttotheremainingattributes.Second,therearecasesinwhichdifferentclustersexistindifferentsetsofdimensions.Consideradatasetthatrecordsthesalesofvariousitemsatvarioustimes.(Thetimesarethedimensionsandtheitemsaretheobjects.)Someitemsmightshowsimilarbehavior(clustertogether)forparticularsetsofmonths,e.g.,summer,butdifferentclusterswouldlikelybecharacterizedbydifferentmonths(dimensions).

Example8.9(SubspaceClusters).Figure8.11(a) showsasetofpointsinthree-dimensionalspace.Therearethreeclustersofpointsinthefullspace,whicharerepresentedby

squares,diamonds,andtriangles.Inaddition,thereisonesetofpoints,representedbycircles,thatisnotaclusterinthree-dimensionalspace.Eachdimension(attribute)oftheexampledatasetissplitintoafixednumber ofequalwidthintervals.Thereare intervals,eachofsize0.1.Thispartitionsthedataspaceintorectangularcellsofequalvolume,andthus,thedensityofeachunitisthefractionofpointsitcontains.Clustersarecontiguousgroupsofdensecells.Toillustrate,ifthethresholdforadensecellis or6%ofthepoints,thenthreeone-dimensionalclusterscanbeidentifiedinFigure8.12 ,whichshowsahistogramofthedatapointsofFigure8.11(a) forthexattribute.

(η) η=20

ξ=0.06,

Figure8.11.Examplefiguresforsubspaceclustering.

Figure8.12.Histogramshowingthedistributionofpointsforthexattribute.

Figure8.11(b) showsthepointsplottedinthexyplane.(Thezattributeisignored.)Thisfigurealsocontainshistogramsalongthexandyaxesthatshowthedistributionofthepointswithrespecttotheirxandycoordinates,respectively.(Ahigherbarindicatesthatthecorrespondingintervalcontainsrelativelymorepoints,andviceversa.)Whenweconsidertheyaxis,weseethreeclusters.Oneisfromthecirclepointsthatdonotformaclusterinthefullspace,oneconsistsofthesquarepoints,andoneconsistsofthediamondandtrianglepoints.Therearealsothreeclustersinthexdimension;theycorrespondtothethreeclusters—diamonds,triangles,andsquares—inthefullspace.Thesepointsalsoformdistinctclustersinthexyplane.Figure8.11(c) showsthepointsplottedinthexzplane.Therearetwoclusters,ifweconsideronlythezattribute.Oneclustercorrespondstothepointsrepresentedbycircles,whiletheother

consistsofthediamond,triangle,andsquarepoints.Thesepointsalsoformdistinctclustersinthexzplane.InFigure8.11(d) ,therearethreeclusterswhenweconsiderboththeyandzcoordinates.Oneoftheseclustersconsistsofthecircles;anotherconsistsofthepointsmarkedbysquares.Thediamondsandtrianglesformasingleclusterintheyzplane.

Thesefiguresillustrateacoupleofimportantfacts.First,asetofpoints—thecircles—maynotformaclusterintheentiredataspace,butmayformaclusterinasubspace.Second,clustersthatexistinthefulldataspace(orevenasubspace)showupasclustersinlower-dimensionalspaces.Thefirstfacttellsusthatweneedtolookinsubsetsofdimensionstofindclusters,whilethesecondfacttellsusthatmanyoftheclusterswefindinsubspacesarelikelytobe“shadows”(projections)ofhigher-dimensionalclusters.Thegoalistofindtheclustersandthedimensionsinwhichtheyexist,butwearetypicallynotinterestedinclustersthatareprojectionsofhigher-dimensionalclusters.

CLIQUECLIQUE(CLusteringInQUEst)isagrid-basedclusteringalgorithmthatmethodicallyfindssubspaceclusters.Itisimpracticaltocheckeachsubspaceforclustersbecausethenumberofsuchsubspacesisexponentialinthenumberofdimensions.Instead,CLIQUEreliesonthefollowingproperty:

Monotonicitypropertyofdensity-basedclustersIfasetofpointsformsadensity-basedclusterinkdimensions(attributes),thenthesamesetofpointsisalsopartofadensity-basedclusterinallpossiblesubsetsofthosedimensions.

Considerasetofadjacent,k-dimensionalcellsthatformacluster;i.e.,thereisacollectionofadjacentcellsthathaveadensityabovethespecifiedthreshold

ξ.Acorrespondingsetofcellsin dimensionscanbefoundbyomittingoneofthekdimensions(attributes).Thelower-dimensionalcellsarestilladjacent,andeachlow-dimensionalcellcontainsallpointsofthecorrespondinghigh-dimensionalcell.Itcancontainadditionalpointsaswell.Thus,alow-dimensionalcellhasadensitygreaterthanorequaltothatofitscorrespondinghigh-dimensionalcell.Consequently,thelow-dimensionalcellsformacluster;i.e.,thepointsformaclusterwiththereducedsetofattributes.

Algorithm8.5 givesasimplifiedversionofthestepsinvolvedinCLIQUE.Conceptually,theCLIQUEalgorithmissimilartotheApriorialgorithmforfindingfrequentitemsets.SeeChapter5 .

Algorithm8.5CLIQUE.

k−1

1:Findallthedenseareasintheone-dimensionalspacescorrespondingtoeachattribute.Thisisthesetofdenseone-dimensionalcells.2:3:repeat4:Generateallcandidatedensek-dimensionalcellsfromdense

-dimensionalcells.5:Eliminatecellsthathavefewerthanξpoints.6:7:untilTherearenocandidatedensek-dimensionalcells.8:Findclustersbytakingtheunionofalladjacent,high-densitycells.9:Summarizeeachclusterusingasmallsetofinequalitiesthatdescribetheattributerangesofthecellsinthecluster.

k←2

(k−1)

k←k+1

StrengthsandLimitationsofCLIQUEThemostusefulfeatureofCLIQUEisthatitprovidesanefficienttechniqueforsearchingsubspacesforclusters.Sincethisapproachisbasedonthewell-knownAprioriprinciplefromassociationanalysis,itspropertiesarewellunderstood.AnotherusefulfeatureisCLIQUE’sabilitytosummarizethelistofcellsthatcomprisesaclusterwithasmallsetofinequalities.

ManylimitationsofCLIQUEareidenticaltothepreviouslydiscussedlimitationsofothergrid-baseddensityschemes.OtherlimitationsaresimilartothoseoftheApriorialgorithm.Specifically,justasfrequentitemsetscanshareitems,theclustersfoundbyCLIQUEcanshareobjects.Allowingclusterstooverlapcangreatlyincreasethenumberofclustersandmakeinterpretationdifficult.AnotherissueisthatApriori—likeCLIQUE—potentiallyhasexponentialtimecomplexity.Inparticular,CLIQUEwillhavedifficultyiftoomanydensecellsaregeneratedatlowervaluesofk.Raisingthedensitythresholdξcanalleviatethisproblem.StillanotherpotentiallimitationofCLIQUEisexploredinExercise20 onpage702.

8.3.3DENCLUE:AKernel-BasedSchemeforDensity-BasedClustering

DENCLUE(DENsityCLUstEring)isadensity-basedclusteringapproachthatmodelstheoveralldensityofasetofpointsasthesumofinfluencefunctionsassociatedwitheachpoint.Theresultingoveralldensityfunctionwillhavelocalpeaks,i.e.,localdensitymaxima,andtheselocalpeakscanbeusedtodefineclustersinanaturalway.Specifically,foreachdatapoint,ahill-climbingprocedurefindsthenearestpeakassociatedwiththatpoint,andthesetofall

datapointsassociatedwithaparticularpeak(calledalocaldensityattractor)becomesacluster.However,ifthedensityatalocalpeakistoolow,thenthepointsintheassociatedclusterareclassifiedasnoiseanddiscarded.Also,ifalocalpeakcanbeconnectedtoasecondlocalpeakbyapathofdatapoints,andthedensityateachpointonthepathisabovetheminimumdensitythreshold,thentheclustersassociatedwiththeselocalpeaksaremerged.Therefore,clustersofanyshapecanbediscovered.

Example8.10(DENCLUEDensity).WeillustratetheseconceptswithFigure8.13 ,whichshowsapossibledensityfunctionforaone-dimensionaldataset.PointsA–Earethepeaksofthisdensityfunctionandrepresentlocaldensityattractors.Thedottedverticallinesdelineatelocalregionsofinfluenceforthelocaldensityattractors.Pointsintheseregionswillbecomecenter-definedclusters.Thedashedhorizontallineshowsadensitythreshold,ξ.Allpointsassociatedwithalocaldensityattractorthathasadensitylessthanξ,suchasthoseassociatedwithC,willbediscarded.Allotherclustersarekept.Notethatthiscanincludepointswhosedensityislessthanξ,aslongastheyareassociatedwithlocaldensityattractorswhosedensityisgreaterthanξ.Finally,clustersthatareconnectedbyapathofpointswithadensityaboveξarecombined.ClustersAandBwouldremainseparate,whileclustersDandEwouldbecombined.

*****

Figure8.13.IllustrationofDENCLUEdensityconceptsinonedimension.

Thehigh-leveldetailsoftheDENCLUEalgorithmaresummarizedinAlgorithm8.6 .Next,weexplorevariousaspectsofDENCLUEinmoredetail.First,weprovideabriefoverviewofkerneldensityestimationandthenpresentthegrid-basedapproachthatDENCLUEusesforapproximatingthedensity.

Algorithm8.6DENCLUEalgorithm.1:Deriveadensityfunctionforthespaceoccupiedbythedatapoints.2:Identifythepointsthatarelocalmaxima.(Thesearethedensityattractors.)3:Associateeachpointwithadensityattractorbymovinginthedirectionofmaximumincreaseindensity.4:Defineclustersconsistingofpointsassociatedwithaparticulardensityattractor.5:Discardclusterswhosedensityattractorhasadensitylessthanauser-specifiedthresholdofξ.

KernelDensityEstimationDENCLUEisbasedonawell-developedareaofstatisticsandpatternrecognitionthatisknownaskerneldensityestimation.Thegoalofthiscollectionoftechniques(andmanyotherstatisticaltechniquesaswell)istodescribethedistributionofthedatabyafunction.Forkerneldensityestimation,thecontributionofeachpointtotheoveralldensityfunctionisexpressedbyaninfluenceorkernelfunction.Theoveralldensityfunctionissimplythesumoftheinfluencefunctionsassociatedwitheachpoint.

Typically,theinfluenceorkernelfunctionissymmetric(thesameinalldirections)anditsvalue(contribution)decreasesasthedistancefromthepointincreases.Forexample,foraparticularpoint,x,theGaussianfunction,

isoftenusedasakernelfunction.σisaparameter,analogoustostandarddeviation,whichgovernshowquicklytheinfluenceofapointdiminisheswithdistance.Figure8.14(a) showswhataGaussiandensityfunctionwouldlooklikeforasinglepointintwodimensions,whileFigures8.14(c) and8.14(d) showtheoveralldensityfunctionproducedbyapplyingtheGaussianinfluencefunctiontothesetofpointsshowninFigure8.14(b) .

6:Combineclustersthatareconnectedbyapathofpointsthatallhaveadensityofξorhigher.

K(y)=e−distance(x,y)2/2σ2,

Figure8.14.ExampleoftheGaussianinfluence(kernel)functionandanoveralldensityfunction.

ImplementationIssuesComputationofkerneldensitycanbequiteexpensive,andDENCLUEusesanumberofapproximationstoimplementitsbasicapproachefficiently.First,itexplicitlycomputesdensityonlyatdatapoints.However,thisstillwouldresultinan timecomplexitybecausethedensityateachpointisafunctionofO(m2)

thedensitycontributedbyeverypoint.Toreducethetimecomplexity,DENCLUEusesagrid-basedimplementationtoefficientlydefineneighborhoodsandthuslimitthenumberofpointsthatneedtobeconsideredtodefinethedensityatapoint.First,apreprocessingstepcreatesasetofgridcells.Onlyoccupiedcellsarecreated,andthesecellsandtheirrelatedinformationcanbeefficientlyaccessedviaasearchtree.Then,whencomputingthedensityofapointandfindingitsnearestdensityattractor,DENCLUEconsidersonlythepointsintheneighborhood;i.e.,pointsinthesamecellandincellsthatareconnectedtothepoint’scell.Whilethisapproachcansacrificesomeaccuracywithrespecttodensityestimation,computationalcomplexityisgreatlyreduced.

StrengthsandLimitationsofDENCLUEDENCLUEhasasolidtheoreticalfoundationbecauseitisbasedontheconceptofkerneldensityestimation,whichisawell-developedareaofstatistics.Forthisreason,DENCLUEprovidesamoreflexibleandpotentiallymoreaccuratewaytocomputedensitythanothergrid-basedclusteringtechniquesandDBSCAN.(DBSCANisaspecialcaseofDENCLUE.)Anapproachbasedonkerneldensityfunctionsisinherentlycomputationallyexpensive,butDENCLUEemploysgrid-basedtechniquestoaddresssuchissues.Nonetheless,DENCLUEcanbemorecomputationallyexpensivethanotherdensity-basedclusteringtechniques.Also,theuseofagridcanadverselyaffecttheaccuracyofthedensityestimation,anditmakesDENCLUEsusceptibletoproblemscommontogrid-basedapproaches;e.g.,thedifficultyofchoosingthepropergridsize.Moregenerally,DENCLUEsharesmanyofthestrengthsandlimitationsofotherdensity-basedapproaches.Forinstance,DENCLUEisgoodathandlingnoiseandoutliersanditcanfindclustersofdifferentshapesandsize,butithastroublewithhigh-dimensionaldataanddatathatcontainsclustersofwidelydifferentdensities.

8.4Graph-BasedClusteringSection7.3 discussedanumberofclusteringtechniquesthattookagraph-basedviewofdata,inwhichdataobjectsarerepresentedbynodesandtheproximitybetweentwodataobjectsisrepresentedbytheweightoftheedgebetweenthecorrespondingnodes.Thissectionconsiderssomeadditionalgraph-basedclusteringalgorithmsthatuseanumberofkeypropertiesandcharacteristicsofgraphs.Thefollowingaresomekeyapproaches,differentsubsetsofwhichareemployedbythesealgorithms.

1. Sparsifytheproximitygraphtokeeponlytheconnectionsofanobjectwithitsnearestneighbors.Thissparsificationisusefulforhandlingnoiseandoutliers.Italsoallowstheuseofhighlyefficientgraphpartitioningalgorithmsthathavebeendevelopedforsparsegraphs.

2. Defineasimilaritymeasurebetweentwoobjectsbasedonthenumberofnearestneighborsthattheyshare.Thisapproach,whichisbasedontheobservationthatanobjectanditsnearestneighborsusuallybelongtothesameclass,isusefulforovercomingproblemswithhighdimensionalityandclustersofvaryingdensity.

3. Definecoreobjectsandbuildclustersaroundthem.Todothisforgraph-basedclustering,itisnecessarytointroduceanotionofdensity-basedonaproximitygraphorasparsifiedproximitygraph.AswithDBSCAN,buildingclustersaroundcoreobjectsleadstoaclusteringtechniquethatcanfindclustersofdifferingshapesandsizes.

4. Usetheinformationintheproximitygraphtoprovideamoresophisticatedevaluationofwhethertwoclustersshouldbemerged.Specifically,twoclustersaremergedonlyiftheresultingclusterwillhavecharacteristicssimilartotheoriginaltwoclusters.

Webeginbydiscussingthesparsificationofproximitygraphs,providingthreeexamplesoftechniqueswhoseapproachtoclusteringisbasedsolelyonthistechnique:MST,whichisequivalenttothesinglelinkclusteringalgorithm,Opossum,andspectralclustering.WethendiscussChameleon,ahierarchicalclusteringalgorithmthatusesanotionofself-similaritytodetermineifclustersshouldbemerged.WenextdefineSharedNearestNeighbor(SNN)similarity,anewsimilaritymeasure,andintroducetheJarvis-Patrickclusteringalgorithm,whichusesthissimilarity.Finally,wediscusshowtodefinedensityandcoreobjectsbasedonSNNsimilarityandintroduceanSNNdensity-basedclusteringalgorithm,whichcanbeviewedasDBSCANwithanewsimilaritymeasure.

8.4.1Sparsification

Thembymproximitymatrixformdatapointscanberepresentedasadensegraphinwhicheachnodeisconnectedtoallothersandtheweightoftheedgebetweenanypairofnodesreflectstheirpairwiseproximity.Althougheveryobjecthassomelevelofsimilaritytoeveryotherobject,formostdatasets,objectsarehighlysimilartoasmallnumberofobjectsandweaklysimilartomostotherobjects.Thispropertycanbeusedtosparsifytheproximitygraph(matrix),bysettingmanyoftheselow-similarity(high-dissimilarity)valuesto0beforebeginningtheactualclusteringprocess.Thesparsificationmaybeperformed,forexample,bybreakingalllinksthathaveasimilarity(dissimilarity)below(above)aspecifiedthresholdorbykeepingonlylinkstothek-nearestneighborsofpoint.Thislatterapproachcreateswhatiscalledak-nearestneighborgraph.

Sparsificationhasseveralbeneficialeffects:

Datasizeisreduced.Theamountofdatathatneedstobeprocessedtoclusterthedataisdrasticallyreduced.Sparsificationcanofteneliminatemorethan99%oftheentriesinaproximitymatrix.Asaresult,thesizeofproblemsthatcanbehandledisincreased.Clusteringoftenworksbetter.Sparsificationtechniqueskeeptheconnectionstotheirnearestneighborsofanobjectwhilebreakingtheconnectionstomoredistantobjects.Thisisinkeepingwiththenearestneighborprinciplethatthenearestneighborsofanobjecttendtobelongtothesameclass(cluster)astheobjectitself.Thisreducestheimpactofnoiseandoutliersandsharpensthedistinctionbetweenclusters.Graphpartitioningalgorithmscanbeused.Therehasbeenaconsiderableamountofworkonheuristicalgorithmsforfindingmincutpartitioningsofsparsegraphs,especiallyintheareasofparallelcomputingandthedesignofintegratedcircuits.Sparsificationoftheproximitygraphmakesitpossibletousegraphpartitioningalgorithmsfortheclusteringprocess.Forexample,OpossumandChameleonusegraphpartitioning.

Sparsificationoftheproximitygraphshouldberegardedasaninitialstepbeforetheuseofactualclusteringalgorithms.Intheory,aperfectsparsificationcouldleavetheproximitymatrixsplitintoconnectedcomponentscorrespondingtothedesiredclusters,butinpractice,thisrarelyhappens.Itiseasyforasingleedgetolinktwoclustersorforasingleclustertobesplitintoseveraldisconnectedsubclusters.AsweshallseewhenwediscussJarvis-PatrickandSNNdensity-basedclustering,thesparseproximitygraphisoftenmodifiedtoyieldanewproximitygraph.Thisnewproximitygraphcanagainbesparsified.Clusteringalgorithmsworkwiththeproximitygraphthatistheresultofallthesepreprocessingsteps.ThisprocessissummarizedinFigure8.15 .

Figure8.15.Idealprocessofclusteringusingsparsification.

8.4.2MinimumSpanningTree(MST)Clustering

InSection7.3 ,wherewedescribedagglomerativehierarchicalclusteringtechniques,wementionedthatdivisivehierarchicalclusteringalgorithmsalsoexist.Wesawanexampleofonesuchtechnique,bisectingK-means,inSection7.2.3 .Anotherdivisivehierarchicaltechnique,MST,startswiththeminimumspanningtreeoftheproximitygraphandcanbeviewedasanapplicationofsparsificationforfindingclusters.Webrieflydescribethisalgorithm.Interestingly,thisalgorithmalsoproducesthesameclusteringassinglelinkagglomerativeclustering.SeeExercise13 onpage700.

Aminimumspanningtreeofagraphisasubgraphthat(1)hasnocycles,i.e.,isatree,(2)containsallthenodesofthegraph,and(3)hastheminimumtotaledgeweightofallpossiblespanningtrees.Theterminology,minimumspanningtree,assumesthatweareworkingonlywithdissimilaritiesordistances,andwewillfollowthisconvention.Thisisnotalimitation,however,sincewecanconvertsimilaritiestodissimilaritiesormodifythenotionofaminimumspanningtreetoworkwithsimilarities.Anexampleofaminimumspanningtreeforsometwo-dimensionalpointsisshowninFigure8.16 .

Figure8.16.Minimumspanningtreeforasetofsixtwo-dimensionalpoints.

TheMSTdivisivehierarchicalalgorithmisshowninAlgorithm8.7 .ThefirststepistofindtheMSToftheoriginaldissimilaritygraph.Notethataminimumspanningtreecanbeviewedasaspecialtypeofsparsifiedgraph.Step3canalsobeviewedasgraphsparsification.Hence,MSTcanbeviewedasaclusteringalgorithmbasedonthesparsificationofthedissimilaritygraph.

Algorithm8.7MSTdivisivehierarchical

clusteringalgorithm.1:Computeaminimumspanningtreeforthedissimilaritygraph.2:repeat3:Createanewclusterbybreakingthelinkcorrespondingtothelargestdissimilarity.

8.4.3OPOSSUM:OptimalPartitioningofSparseSimilaritiesUsingMETIS

OPOSSUMisaclusteringtechniqueforclusteringsparse,high-dimensionaldata,e.g.,documentormarketbasketdata.LikeMST,itperformsclusteringbasedonthesparsificationofaproximitygraph.However,OPOSSUMusestheMETISalgorithm,whichwasspecificallycreatedforpartitioningsparsegraphs.ThestepsofOPOSSUMaregiveninAlgorithm8.8 .

Thesimilaritymeasuresusedarethoseappropriateforsparse,high-dimensionaldata,suchastheextendedJaccardmeasureorthecosinemeasure.TheMETISgraphpartitioningprogrampartitionsasparsegraphintokdistinctcomponents,wherekisauser-specifiedparameter,inorderto(1)minimizetheweightoftheedges(thesimilarity)betweencomponentsand(2)fulfillabalanceconstraint.OPOSSUMusesoneofthefollowingtwobalanceconstraints:(1)thenumberofobjectsineachclustermustberoughlythesame,or(2)thesumoftheattributevaluesmustberoughlythesame.Thesecondconstraintisusefulwhen,forexample,theattributevaluesrepresentthecostofanitem.

Algorithm8.8OPOSSUMclustering

algorithm.

4:untilOnlysingletonclustersremain.

1:Computeasparsifiedsimilaritygraph.

StrengthsandWeaknessesOPOSSUMissimpleandfast.Itpartitionsthedataintoroughlyequal-sizedclusters,which,dependingonthegoaloftheclustering,canbeviewedasanadvantageoradisadvantage.Becausetheyareconstrainedtobeofroughlyequalsize,clusterscanbebrokenorcombined.However,ifOPOSSUMisusedtogeneratealargenumberofclusters,thentheseclustersaretypicallyrelativelypurepiecesoflargerclusters.Indeed,OPOSSUMissimilartotheinitialstepoftheChameleonclusteringroutine,whichisdiscussednext.

8.4.4Chameleon:HierarchicalClusteringwithDynamicModeling

Agglomerativehierarchicalclusteringtechniquesoperatebymergingthetwomostsimilarclusters,wherethedefinitionofclustersimilaritydependsontheparticularalgorithm.Someagglomerativealgorithms,suchasgroupaverage,basetheirnotionofsimilarityonthestrengthoftheconnectionsbetweenthetwoclusters(e.g.,thepairwisesimilarityofpointsinthetwoclusters),whileothertechniques,suchasthesinglelinkmethod,usetheclosenessoftheclusters(e.g.,theminimumdistancebetweenpointsindifferentclusters)tomeasureclustersimilarity.Althoughtherearetwobasicapproaches,usingonlyoneofthesetwoapproachescanleadtomistakesinmergingclusters.ConsiderFigure8.17 ,whichshowsfourclusters.Ifweusetheclosenessofclusters(asmeasuredbytheclosesttwopointsindifferentclusters)asour

2:Partitionthesimilaritygraphintokdistinctcomponents(clusters)usingMETIS.

mergingcriterion,thenwewouldmergethetwocircularclusters,(c)and(d),whichalmosttouch,insteadoftherectangularclusters,(a)and(b),whichareseparatedbyasmallgap.However,intuitively,weshouldhavemergedrectangularclusters,(a)and(b).Exercise15 onpage700asksforanexampleofasituationinwhichthestrengthofconnectionslikewiseleadstoanunintuitiveresult.

Anotherproblemisthatmostclusteringtechniqueshaveaglobal(static)modelofclusters.Forinstance,K-meansassumesthattheclusterswillbeglobular,whileDBSCANdefinesclustersbasedonasingledensitythreshold.Clusteringschemesthatusesuchaglobalmodelcannothandlecasesinwhichclustercharacteristics,suchassize,shape,anddensity,varywidelybetweenclusters.Asanexampleoftheimportanceofthelocal(dynamic)modelingofclusters,considerFigure8.18 .Ifweusetheclosenessofclusterstodeterminewhichpairofclustersshouldbemerged,aswouldbethecaseifweused,forexample,thesinglelinkclusteringalgorithm,thenwewouldmergeclusters(a)and(b).However,wehavenottakenintoaccountthecharacteristicsofeachindividualcluster.Specifically,wehaveignoredthedensityoftheindividualclusters.Forclusters(a)and(b),whicharerelativelydense,thedistancebetweenthetwoclustersissignificantlylargerthanthedistancebetweenapointanditsnearestneighborswithinthesamecluster.

Thisisnotthecaseforclusters(c)and(d),whicharerelativelysparse.Indeed,whenclusters(c)and(d)aremerged,theyyieldaclusterthatseemsmoresimilartotheoriginalclustersthantheclusterthatresultsfrommergingclusters(a)and(b).

Chameleonisanagglomerativeclusteringalgorithmthataddressestheissuesoftheprevioustwoparagraphs.Itcombinesaninitialpartitioningofthedata,usinganefficientgraphpartitioningalgorithm,withanovelhierarchicalclusteringschemethatusesthenotionsofclosenessandinterconnectivity,togetherwiththelocalmodelingofclusters.Thekeyideaisthattwoclustersshouldbemergedonlyiftheresultingclusterissimilartothetwooriginalclusters.Self-similarityisdescribedfirst,andthentheremainingdetailsoftheChameleonalgorithmarepresented.

DecidingWhichClusterstoMergeTheagglomerativehierarchicalclusteringtechniquesconsideredinSection7.3 repeatedlycombinethetwoclosestclustersandareprincipallydistinguishedfromoneanotherbythewaytheydefineclusterproximity.Incontrast,Chameleonaimstomergethepairofclustersthatresultsinaclusterthatismostsimilartotheoriginalpairofclusters,asmeasuredbycloseness

andinterconnectivity.Becausethisapproachdependsonlyonthepairofclustersandnotonaglobalmodel,Chameleoncanhandledatathatcontainsclusterswithwidelydifferentcharacteristics.

Followingaremoredetailedexplanationsofthepropertiesofclosenessandinterconnectivity.Tounderstandtheseproperties,itisnecessarytotakeaproximitygraphviewpointandtoconsiderthenumberofthelinksandthestrengthofthoselinksamongpointswithinaclusterandacrossclusters.

RelativeCloseness(RC)istheabsoluteclosenessoftwoclustersnormalizedbytheinternalclosenessoftheclusters.Twoclustersarecombinedonlyifthepointsintheresultingclusterarealmostasclosetoeachotherasineachoftheoriginalclusters.Mathematically,

where and arethesizesofclusters and respectively;istheaverageweightoftheedges(ofthek-nearestneighbor

graph)thatconnectclusters and istheaverageweightofedgesifwebisectcluster and istheaverageweightofedgesifwebisectcluster (ECstandsforedgecut.)Figure8.18 illustratesthenotionofrelativecloseness.Asdiscussedpreviously,whileclusters(a)and(b)arecloserinabsolutetermsthanclusters(c)and(d),thisisnottrueifthenatureoftheclustersistakenintoaccount.RelativeInterconnectivity(RI)istheabsoluteinterconnectivityoftwoclustersnormalizedbytheinternalconnectivityoftheclusters.Twoclustersarecombinedifthepointsintheresultingclusterarealmostasstronglyconnectedaspointsineachoftheoriginalclusters.Mathematically,

RC(Ci,Cj)=S¯EC(Ci,Cj)mimi+mjS¯EC(Ci)+mjmi+mjS¯EC(Cj), (8.17)

mi mj Ci Cj,S¯EC(Ci,Cj)

Ci Cj;S¯EC(Ci)Ci; S¯EC(Cj)

Cj.

RI(Ci,Cj)=EC(Ci,Cj)12(EC(Ci))+EC(Cj)), (8.18)

where isthesumoftheedges(ofthek-nearestneighborgraph)thatconnectclusters and istheminimumsumofthecutedgesifwebisectcluster and istheminimumsumofthecutedgesifwebisectcluster Figure8.19 illustratesthenotionofrelativeinterconnectivity.Thetwocircularclusters,(c)and(d),havemoreconnectionsthantherectangularclusters,(a)and(b).However,merging(c)and(d)producesaclusterthathasconnectivityquitedifferentfromthatof(c)and(d).Incontrast,merging(a)and(b)producesaclusterwithconnectivityverysimilartothatof(a)and(b).

RIandRCcanbecombinedinmanydifferentwaystoyieldanoverallmeasureofself-similarity.OneapproachusedinChameleonistomergethepairofclustersthatmaximizes whereαisauser-specifiedparameterthatistypicallygreaterthan1.

ChameleonAlgorithm

EC(Ci,Cj)Ci Cj; EC(Ci)

Ci; EC(Cj)Cj.

RI(Ci,Cj)*RC(Ci,Cj)α,

Chameleonconsistsofthreekeysteps:sparsification,graphpartitioning,andhierarchicalclustering.Algorithm8.9 andFigure8.20 describethesesteps.

Algorithm8.9Chameleonalgorithm.

Sparsification

ThefirststepinChameleonistogenerateak-nearestneighborgraph.Conceptually,suchagraphisderivedfromtheproximitygraph,anditcontainslinksonlybetweenapointanditsk-nearestneighbors,i.e.,thepointstowhichitisclosest.Asmentioned,workingwithasparsifiedproximitygraph

1:Buildak-nearestneighborgraph.2:Partitionthegraphusingamultilevelgraphpartitioningalgorithm.3:repeat4:Mergetheclustersthatbestpreservetheclusterself-similaritywithrespecttorelativeinterconnectivityandrelativecloseness.5:untilNomoreclusterscanbemerged.

insteadofthefullproximitygraphcansignificantlyreducetheeffectsofnoiseandoutliersandimprovecomputationalefficiency.

GraphPartitioningOnceasparsifiedgraphhasbeenobtained,anefficientmultilevelgraphpartitioningalgorithm,suchasMETIS(seeBibliographicNotes),canbeusedtopartitionthedataset.Chameleonstartswithanall-inclusivegraph(cluster)andthenbisectsthelargestcurrentsubgraph(cluster)untilnoclusterhasmorethan points,where isauser-specifiedparameter.Thisprocessresultsinalargenumberofroughlyequallysizedgroupsofwell-connectedvertices(highlysimilardatapoints).Thegoalistoensurethateachpartitioncontainsobjectsmostlyfromonetruecluster.

AgglomerativeHierarchicalClustering

Asdiscussedpreviously,Chameleonmergesclustersbasedonthenotionofself-similarity.Chameleoncanbeparameterizedtomergemorethanonepairofclustersinasinglestepandtostopbeforeallobjectshavebeenmergedintoasinglecluster.

Complexity

Assumethatmisthenumberofdatapointsandpisthenumberofpartitions.Performinganagglomerativehierarchicalclusteringoftheppartitionsobtainedfromthegraphpartitioningrequirestime (SeeSection7.3.1 .)Theamountoftimerequiredforpartitioningthegraphis

Thetimecomplexityofgraphsparsificationdependsonhowmuchtimeittakestobuildthek-nearestneighborgraph.Forlow-dimensionaldata,thistakes timeifak-dtreeorasimilartypeofdatastructure

O(p2logp).

O(mp+mlogm).

O(mlogm)

isused.Unfortunately,suchdatastructuresonlyworkwellforlow-dimensionaldatasets,andthus,forhigh-dimensionaldatasets,thetimecomplexityofthesparsificationbecomes Becauseonlythek-nearestneighborlistneedstobestored,thespacecomplexityis plusthespacerequiredtostorethedata.

Example8.11.ChameleonwasappliedtotwodatasetsthatclusteringalgorithmssuchasK-meansandDBSCANhavedifficultyclustering.TheresultsofthisclusteringareshowninFigure8.21 .Theclustersareidentifiedbytheshadingofthepoints.InFigure8.21(a) ,thetwoclustersareirregularlyshapedandquiteclosetoeachother.Also,noiseispresent.InFigure8.21(b) ,thetwoclustersareconnectedbyabridge,andagain,noiseispresent.Nonetheless,Chameleonidentifieswhatmostpeoplewouldidentifyasthenaturalclusters.Chameleonhasspecificallybeenshowntobeveryeffectiveforclusteringspatialdata.Finally,noticethatChameleondoesnotdiscardnoisepoints,asdootherclusteringschemes,butinsteadassignsthemtotheclusters.

Figure8.21.

O(m2).O(km)

StrengthsandLimitationsChameleoncaneffectivelyclusterspatialdata,eventhoughnoiseandoutliersarepresentandtheclustersareofdifferentshapes,sizes,anddensity.Chameleonassumesthatthegroupsofobjectsproducedbythesparsificationandgraphpartitioningprocessaresubclusters;i.e.,thatmostofthepointsinapartitionbelongtothesametruecluster.Ifnot,thenagglomerativehierarchicalclusteringwillonlycompoundtheerrorsbecauseitcanneverseparateobjectsthathavebeenwronglyputtogether.(SeethediscussioninSection7.3.4 .)Thus,Chameleonhasproblemswhenthepartitioningprocessdoesnotproducesubclusters,asisoftenthecaseforhigh-dimensionaldata.

8.4.5SpectralClustering

Spectralclusteringisanelegantgraphpartitioningapproachthatexploitspropertiesofthesimilaritygraphtodeterminetheclusterpartitions.Specifically,itexaminesthegraph’sspectrum,i.e.,eigenvaluesandeigenvectorsassociatedwiththeadjacencymatrixofthegraph,toidentifythenaturalclustersofthedata.Tomotivatetheideasbehindthisapproach,considerthesimilaritygraphshowninFigure8.22 foradatasetthatcontains6datapoints.Thelinkweightsinthegrapharecomputedbasedonsomesimilaritymeasure,withathresholdappliedtoremovelinkswithlowsimilarityvalues.Thesparsificationproducesagraphwithtwoconnectedcomponents,whichtriviallyrepresentthetwoclustersinthedata,and

{v1,v2,v3}{v4,v5,v6}.

Figure8.22.Exampleofasimilaritygraphwithtwoconnectedcomponentsalongwithitsweightedadjacencymatrix(W),graphLaplacianmatrix(L),andeigendecomposition.

Thetopright-handpanelofthefigurealsoshowstheweightedadjacencymatrixofthegraph,denotedasW,andadiagonalmatrix,D,whosediagonalelementscorrespondtothesumoftheweightsofthelinksincidenttoeachnodeinthegraph,i.e.,

Notethattherowsandcolumnsoftheweightedadjacencymatrixhavebeenorderedinsuchawaythatnodesbelongingtothesameconnectedcomponentarenexttoeachother.Withthisordering,thematrixWhasablockstructureoftheform

Dij={ΣkWik,ifi=j;0,otherwise.

W=(W100W2),

inwhichtheoff-diagonalblocksarematricesofzerovaluessincetherearenolinksconnectinganodefromthefirstconnectedcomponenttoanodefromthesecondconnectedcomponent.Indeed,ifthesparsegraphcontainskconnectedcomponents,itsweightedadjacencymatrixcanbere-orderedintothefollowingblockdiagonalform:

Thisexamplesuggeststhepossibilityofidentifyingtheinherentclustersofadatasetbyexaminingtheblockstructureofitsweightedadjacencymatrix.

Unfortunately,unlesstheclustersarewell-separated,theadjacencymatricesassociatedwithmostsimilaritygraphsarenotinblockdiagonalform.Forexample,considerthegraphshowninFigure8.23 ,inwhichthereisalinkbetweennodes and withalowsimilarityvalue.Ifweareinterestedingeneratingtwoclusters,wecouldbreaktheweakestlink,locatedbetween

tosplitthegraphintotwopartitions.Becausethereisonlyoneconnectedcomponentinthegraph,theblockstructureinWishardertodiscern.

W=(W10⋯00W2⋯0⋯⋯⋯⋯00⋯Wk), (8.19)

v3 v4,

(v3,v4),

Figure8.23.Exampleofasimilaritygraphwithasingleconnectedcomponentalongwithitsweightedadjacencymatrix(W),graphLaplacianmatrix(L),andeigendecomposition.

Fortunately,thereisamoreobjectivewaytocreatetheclusterpartitionsbyconsideringthegraphspectrum.First,weneedtocomputethegraphLaplacianmatrix,whichisformallydefinedasfollows:

ThegraphLaplacianmatricesfortheexamplesshowninFigures8.22 and8.23 aredepictedinthebottomleftpanelofbothdiagrams.Thematrixhasseveralnotableproperties:

1. ItisasymmetricmatrixsincebothWandDaresymmetric.2. Itisapositivesemi-definitematrix,whichmeans foranyinput

vectorv.3. AlleigenvaluesofLmustbenon-negative.Theeigenvaluesand

eigenvectorsforthegraphsshowninFigure8.22 and8.23 aredenotedinthediagramsasΛandV,respectively.NotethattheeigenvaluesofthegraphLaplacianmatrixaregivenbythediagonalelementsofΛ.

4. ThesmallesteigenvalueofLiszero,withthecorrespondingeigenvectore,whichisavectorof1s.Thisisbecause

Thus, whichisequivalentto Thiscanbesimplifiedintotheeigenvalueequation since

L=D−W (8.20)

vTLv≥0

We=(W11W12⋯W1nW21W22⋯W2n⋯⋯⋯⋯Wn1Wn2⋯Wnn)(11⋯1)=(ΣjW1jΣjW2j⋯ΣjWnj)De=(ΣjW1j0⋯00ΣjW2j⋯0⋯⋯⋯⋯00⋯ΣjWnj)(11⋯1)=(ΣjW1jΣjW2j⋯ΣjWnj)

We=De, (D−W)e=0.Le=0e L=D−W.

5. AgraphwithkconnectedcomponentshasanadjacencymatrixWinblockdiagonalformasshowninEquation8.19 .ItsgraphLaplacianmatrixalsohasablockdiagonalform

Inaddition,itsgraphLaplacianmatrixhaskeigenvaluesofzeros,withthecorrespondingeigenvectors

wherethe ’sarevectorsof1’sand0’sarevectorsof0’s.Forexample,thegraphshowninFigure8.22 containstwoconnectedcomponents,whichiswhyitsgraphLaplacianmatrixhastwoeigenvaluesofzeros.Moreimportantly,itsfirsttwoeigenvectors(normalizedtounitlength),

correspondingtothefirsttwocolumnsinV,provideinformationabouttheclustermembershipofeachnode.Anodethatbelongtothefirstclusterhasapositivevalueintheitsfirsteigenvectorandazerovalueinitssecondeigenvector,whereasanodethatbelongtothesecondclusterhasazerovalueinthefirsteigenvectorandanegativevalueinthesecondeigenvector.

ThegraphshowninFigure8.23 hasoneeigenvalueofzerobecauseithasonlyoneconnectedcomponent.Nevertheless,ifweexaminethefirsttwoeigenvectorsofitsgraphLaplacianmatrix

thegraphcanbeeasilysplitintotwoclusterssincethesetofnodeshasanegativevalueinthesecondeigenvectorwhereas hasa

L=(L10⋯00L2⋯0⋯⋯⋯⋯00⋯Lk),

(e10⋯0),(0e2⋯0),⋯,(00⋯ek),

v1→v2→v3→v4→v5→v6→(0.5800.5800.5800−0.580−0.580−0.58),

v1→v2→v3→v4→v5→v6→(0.41−0.410.41−0.430.41−0.380.410.380.410.420.41

{v1,v2,v3}{v4,v5,v6}

positivevalueinthesecondeigenvector.Inshort,theeigenvectorsofthegraphLaplacianmatrixcontaininformationthatcanbeusedtopartitionthegraphintoitsunderlyingcomponents.However,insteadofmanuallycheckingtheeigenvectors,itiscommonpracticetoapplyasimpleclusteringalgorithmsuchasK-meanstohelpextracttheclustersfromtheeigenvectors.AsummaryofthespectralclusteringalgorithmisgiveninAlgorithm8.10 .

Algorithm8.10Spectralclustering

algorithm.

Example8.12.Considerthetwo-dimensionalringdatashowninFigure8.24(b) ,whichcontains350datapoints.Thefirst100pointsbelongtotheinnerringwhiletheremaining250pointsbelongtotheouterring.AheatmapshowingtheEuclideandistancebetweeneverypairofpointsisdepictedinFigure8.24(a) .Whilethepointsintheinnerringarerelativelyclosetoeachother,thoselocatedintheouterringcanbequitefarfromeachother.Asaresult,standardclusteringalgorithmssuchasK-meansperformpoorlyonthedata.Incontrast,applyingspectralclusteringonthesparsifiedsimilaritygraphcanproducethecorrectclusteringresults(seeFigure8.24(d) ).Here,thesimilaritybetweenpointsiscalculatedusingtheGaussianradialbasisfunctionandthegraphissparsifiedbychoosingthe10-nearestneighborsforeachdatapoint.Thesparsificationreducesthe

1:CreateasparsifiedsimilaritygraphG.2:ComputethegraphLaplacianforG,L(seeEquation(8.20)).

3:CreateamatrixVfromthefirstkeigenvectorsofL.4:ApplyK-meansclusteringonVtoobtainthekclusters.

similaritybetweenadatapointlocatedintheinnerringandacorrespondingpointintheouterring,whichenablesspectralclusteringtoeffectivelypartitionthedatasetintotwoclusters.

Figure8.24.ApplicationofK-meansandspectralclusteringtoatwo-dimensionalringdata.

RelationshipbetweenSpectralClusteringandGraphPartitioningTheobjectiveofgraphpartitioningistobreaktheweaklinksinagraphuntilthedesirednumberofclusterpartitionsisobtained.Onewaytoassessthequalityofthepartitionsisbysumminguptheweightsofthelinksthatwereremoved.Theresultingmeasureisknownasgraphcut.Unfortunately,minimizingthegraphcutofthepartitionsaloneisinsufficientasittendstoproduceclusterswithhighlyimbalancedsizes.Forexample,considerthegraphshowninFigure8.25 .Supposeweareinterestedinpartitioningthegraphintotwoconnectedcomponents.Thegraphcutmeasurepreferstobreakthelinkbetween and becauseithasthelowestweight.

Figure8.25.Exampletoillustratethelimitationofusinggraphcutasevaluationmeasureforgraphpartitioning.

Unfortunately,suchasplitwouldcreateoneclusterwithasingleisolatednodeandanotherclustercontainingalltheremainingnodes.Toovercomethislimitation,alternativemeasureshavebeenproposedincluding

v4 v5

Ratiocut(C1,C2,⋯,Ck)=12∑i=1kΣp∈Ci,q∉CiWpq|Ci|,

where denotetheclusterpartitions.Thenumeratorrepresentsthesumoftheweightsofthebrokenlinks,i.e.,thegraphcut,whilethedenominatorrepresentsthesizeofeachclusterpartition.Suchameasurecanbeusedtoensurethattheresultingclustersaremorebalancedintermsoftheirsizes.Moreimportantly,itcanbeshownthatminimizingtheratiocutforagraphisequivalenttofindingaclustermembershipmatrixYthatminimizestheexpression where denotesthetraceofamatrixandListhegraphLaplacian,subjecttotheconstraint ByrelaxingtherequirementthatYisabinarymatrix,wecanusetheLagrangemultipliermethodtosolvetheoptimizationproblem.

Inotherwords,anapproximatesolutiontotheratiocutminimizationproblemcanbeobtainedbyfindingtheeigenvectorsofthegraphLaplacianmatrix,whichisexactlytheapproachusedbyspectralclustering.

StrengthsandLimitationsAsshowninExample8.12 ,thestrengthofspectralclusteringliesinitsabilitytodetectclustersofvaryingsizesandshapes.However,theclusteringperformancedependsonhowthesimilaritygraphiscreatedandsparsified.Inparticular,tuningtheparametersofthesimilarityfunction(e.g.,Gaussianradialbasisfunction)toproduceanappropriatesparsegraphforspectralclusteringcanbequiteachallenge.ThetimecomplexityofthealgorithmdependsonhowfasttheeigenvectorsofthegraphLaplacianmatrixcanbecomputed.Efficienteigensolversforsparsematricesareavailable,e.g.,thosebasedonKrylovsubspacemethods,especiallywhenthenumberofclusterschosenissmall.Thestoragecomplexityis thoughitcanbesignificantlyreducedusingasparserepresentationforthegraphLaplacianmatrix.Inmanyways,spectralclusteringbehavessimilarlytotheK-means

C1,C2,⋯,Ck

Tr[YTLY], Tr[⋅]YTY=I.

Lagrangian, ℒ=Tr[YTLY]−λ(Tr[YTY−I])∂ℒ∂Y=LY−λY=0⇒LY=λY

O(N2),

clusteringalgorithm.First,theybothrequiretheusertospecifythenumberofclustersasinputparameter.Bothmethodsarealsosusceptibletothepresenceofoutliers,whichtendtoformtheirownconnectedcomponents(clusters).Thus,preprocessingorpostprocessingmethodswillbeneededtohandleoutliersinthedata.

8.4.6SharedNearestNeighborSimilarity

Insomecases,clusteringtechniquesthatrelyonstandardapproachestosimilarityanddensitydonotproducethedesiredclusteringresults.Thissectionexaminesthereasonsforthisandintroducesanindirectapproachtosimilaritythatisbasedonthefollowingprinciple:

Iftwopointsaresimilartomanyofthesamepoints,thentheyaresimilartooneanother,evenifa

directmeasurementofsimilaritydoesnotindicatethis.

WemotivatethediscussionbyfirstexplainingtwoproblemsthatanSNNversionofsimilarityaddresses:lowsimilarityanddifferencesindensity.

ProblemswithTraditionalSimilarityinHigh-DimensionalDataInhigh-dimensionalspaces,itisnotunusualforsimilaritytobelow.Consider,forexample,asetofdocumentssuchasacollectionofnewspaperarticlesthatcomefromavarietyofsectionsofthenewspaper:Entertainment,Financial,Foreign,Metro,National,andSports.AsexplainedinChapter2 ,thesedocumentscanbeviewedasvectorsinahigh-dimensionalspace,

whereeachcomponentofthevector(attribute)recordsthenumberoftimesthateachwordinavocabularyoccursinadocument.Thecosinesimilaritymeasureisoftenusedtoassessthesimilaritybetweendocuments.Forthisexample,whichcomesfromacollectionofarticlesfromtheLosAngelesTimes,Table8.3 givestheaveragecosinesimilarityineachsectionandamongtheentiresetofdocuments.

Table8.3.Similarityamongdocumentsindifferentsectionsofanewspaper.

Section AverageCosineSimilarity

Entertainment 0.032

Financial 0.030

Foreign 0.030

Metro 0.021

National 0.027

Sports 0.036

AllSections 0.014

Thesimilarityofeachdocumenttoitsmostsimilardocument(thefirstnearestneighbor)isbetter,0.39onaverage.However,aconsequenceoflowsimilarityamongobjectsofthesameclassisthattheirnearestneighborisoftennotofthesameclass.InthecollectionofdocumentsfromwhichTable8.3 wasgenerated,about20%ofthedocumentshaveanearestneighborofadifferentclass.Ingeneral,ifdirectsimilarityislow,thenitbecomesanunreliableguideforclusteringobjects,especiallyforagglomerativehierarchicalclustering,wheretheclosestpointsareputtogetherandcannot

beseparatedafterward.Nonetheless,itisstillusuallythecasethatalargemajorityofthenearestneighborsofanobjectbelongtothesameclass;thisfactcanbeusedtodefineaproximitymeasurethatismoresuitableforclustering.

ProblemswithDifferencesinDensityAnotherproblemrelatestodifferencesindensitiesbetweenclusters.Figure8.26 showsapairoftwo-dimensionalclustersofpointswithdifferingdensity.Thelowerdensityoftherightmostclusterisreflectedinaloweraveragedistanceamongthepoints.Eventhoughthepointsinthelessdenseclusterformanequallyvalidcluster,typicalclusteringtechniqueswillhavemoredifficultyfindingsuchclusters.Also,normalmeasuresofcohesion,suchasSSE,willindicatethattheseclustersarelesscohesive.Toillustratewitharealexample,thestarsinagalaxyarenolessrealclustersofstellarobjectsthantheplanetsinasolarsystem,eventhoughtheplanetsinasolarsystemareconsiderablyclosertooneanotheronaverage,thanthestarsinagalaxy.

Figure8.26.Twocircularclustersof200uniformlydistributedpoints.

SNNSimilarityComputationInbothsituations,thekeyideaistotakethecontextofpointsintoaccountindefiningthesimilaritymeasure.ThisideacanbemadequantitativebyusingasharednearestneighbordefinitionofsimilarityinthemannerindicatedbyAlgorithm8.11 .Essentially,theSNNsimilarityisthenumberofsharedneighborsaslongasthetwoobjectsareoneachother’snearestneighborlists.Notethattheunderlyingproximitymeasurecanbeanymeaningfulsimilarityordissimilaritymeasure.

Algorithm8.11Computingsharednearest

neighborsimilarity

ThecomputationofSNNsimilarityisdescribedbyAlgorithm8.11 andgraphicallyillustratedbyFigure8.27 .Eachofthetwoblackpointshaseightnearestneighbors,includingeachother.Fourofthosenearestneighbors—thepointsingray—areshared.Thus,thesharednearestneighborsimilaritybetweenthetwopointsis4.

1:Findthek-nearestneighborsofallpoints.2:iftwopoints,xandy,arenotamongthek-nearestneighborsofeachotherthen3:similarity4:else5:similarity6:endif

(x,y)←0

(x,y)←numberofsharedneighbors

Figure8.27.ComputationofSNNsimilaritybetweentwopoints.

ThesimilaritygraphoftheSNNsimilaritiesamongobjectsiscalledtheSNNsimilaritygraph.BecausemanypairsofobjectswillhaveanSNNsimilarityof0,thisisaverysparsegraph.

SNNSimilarityversusDirectSimilaritySNNsimilarityisusefulbecauseitaddressessomeoftheproblemsthatoccurwithdirectsimilarity.First,sinceittakesintoaccountthecontextofanobjectbyusingthenumberofsharednearestneighbors,SNNsimilarityhandlesthesituationinwhichanobjecthappenstoberelativelyclosetoanotherobject,butbelongstoadifferentclass.Insuchcases,theobjectstypicallydonotsharemanynearneighborsandtheirSNNsimilarityislow.

SNNsimilarityalsoaddressesproblemswithclustersofvaryingdensity.Inalow-densityregion,theobjectsarefartherapartthanobjectsindenserregions.However,theSNNsimilarityofapairofpointsonlydependsonthenumberofnearestneighborstwoobjectsshare,nothowfartheseneighborsarefromeachobject.Thus,SNNsimilarityperformsanautomaticscalingwithrespecttothedensityofthepoints.

8.4.7TheJarvis-PatrickClustering

Algorithm

Algorithm8.12 expressestheJarvis-Patrickclusteringalgorithmusingtheconceptsofthelastsection.TheJPclusteringalgorithmreplacestheproximitybetweentwopointswiththeSNNsimilarity,whichiscalculatedasdescribedinAlgorithm8.11 .AthresholdisthenusedtosparsifythismatrixofSNNsimilarities.Ingraphterms,anSNNsimilaritygraphiscreatedandsparsified.ClustersaresimplytheconnectedcomponentsoftheSNNgraph.

Algorithm8.12Jarvis-Patrickclustering

algorithm.

ThestoragerequirementsoftheJPclusteringalgorithmareonlybecauseitisnotnecessarytostoretheentiresimilaritymatrix,eveninitially.ThebasictimecomplexityofJPclusteringis ,sincethecreationofthek-nearestneighborlistcanrequirethecomputationofproximities.However,forcertaintypesofdata,suchaslow-dimensionalEuclideandata,specialtechniques,e.g.,ak-dtree,canbeusedtomoreefficientlyfindthek-nearestneighborswithoutcomputingtheentiresimilaritymatrix.Thiscanreducethetimecomplexityfrom to .

Example8.13(JPClusteringofaTwo-

1:ComputetheSNNsimilaritygraph.2:SparsifytheSNNsimilaritygraphbyapplyingasimilaritythreshold.3:Findtheconnectedcomponents(clusters)ofthesparsifiedSNNsimilaritygraph.

O(km)

O(m)2O(m)2

O(m)2 O(mlogm)

DimensionalDataSet).WeappliedJPclusteringtothe“fish”datasetshowninFigure8.28(a)tofindtheclustersshowninFigure8.28(b) .Thesizeofthenearestneighborlistwas20,andtwopointswereplacedinthesameclusteriftheysharedatleast10points.Thedifferentclustersareshownbythedifferentmarkersanddifferentshading.Thepointswhosemarkerisan“x”wereclassifiedasnoisebyJarvis-Patrick.Theyaremostlyinthetransitionregionsbetweenclustersofdifferentdensity.

Figure8.28.Jarvis-Patrickclusteringofatwo-dimensionalpointset.

StrengthsandLimitationsBecauseJPclusteringisbasedonthenotionofSNNsimilarity,itisgoodatdealingwithnoiseandoutliersandcanhandleclustersofdifferentsizes,

shapes,anddensities.Thealgorithmworkswellforhigh-dimensionaldataandisparticularlygoodatfindingtightclustersofstronglyrelatedobjects.

However,JPclusteringdefinesaclusterasaconnectedcomponentintheSNNsimilaritygraph.Thus,whetherasetofobjectsissplitintotwoclustersorleftasonecandependonasinglelink.Hence,JPclusteringissomewhatbrittle;i.e.,itcansplittrueclustersorjoinclustersthatshouldbekeptseparate.

Anotherpotentiallimitationisthatnotallobjectsareclustered.However,theseobjectscanbeaddedtoexistingclusters,andinsomecases,thereisnorequirementforacompleteclustering.JPclusteringhasabasictimecomplexityof ,whichisthetimerequiredtocomputethenearestneighborlistforasetofobjectsinthegeneralcase.Incertaincases,e.g.,low-dimensionaldata,specialtechniquescanbeusedtoreducethetimecomplexityforfindingnearestneighborsto .Finally,aswithotherclusteringalgorithms,choosingthebestvaluesfortheparameterscanbechallenging.

8.4.8SNNDensity

Asdiscussedintheintroductiontothischapter,traditionalEuclideandensitybecomesmeaninglessinhighdimensions.Thisistruewhetherwetakeagrid-basedview,suchasthatusedbyCLIQUE,acenter-basedview,suchasthatusedbyDBSCAN,orakernel-densityestimationapproach,suchasthatusedbyDENCLUE.Itispossibletousethecenter-baseddefinitionofdensitywithasimilaritymeasurethatworkswellforhighdimensions,e.g.,cosineorJaccard,butasdescribedinSection8.4.6 ,suchmeasuresstillhaveproblems.However,becausetheSNNsimilaritymeasurereflectsthelocal

O(m)2

O(mlogm)

configurationofthepointsinthedataspace,itisrelativelyinsensitivetovariationsindensityandthedimensionalityofthespace,andisapromisingcandidateforanewmeasureofdensity.

ThissectionexplainshowtodefineaconceptofSNNdensitybyusingSNNsimilarityandfollowingtheDBSCANapproachdescribedinSection7.4 .Forclarity,thedefinitionsofthatsectionarerepeated,withappropriatemodificationtoaccountforthefactthatweareusingSNNsimilarity.

Corepoints.Apointisacorepointifthenumberofpointswithinagivenneighborhoodaroundthepoint,asdeterminedbySNNsimilarityandasuppliedparameterEpsexceedsacertainthresholdMinPts,whichisalsoasuppliedparameter.

Borderpoints.Aborderpointisapointthatisnotacorepoint,i.e.,therearenotenoughpointsinitsneighborhoodforittobeacorepoint,butitfallswithintheneighborhoodofacorepoint.

Noisepoints.Anoisepointisanypointthatisneitheracorepointnoraborderpoint.

SNNdensitymeasuresthedegreetowhichapointissurroundedbysimilarpoints(withrespecttonearestneighbors).Thus,pointsinregionsofhighandlowdensitywilltypicallyhaverelativelyhighSNNdensity,whilepointsinregionswherethereisatransitionfromlowtohighdensity—pointsthatarebetweenclusters—willtendtohavelowSNNdensity.Suchanapproachiswell-suitedfordatasetsinwhichtherearewidevariationsindensity,butclustersoflowdensityarestillinteresting.

Example8.14(Core,Border,andNoisePoints).

TomaketheprecedingdiscussionofSNNdensitymoreconcrete,weprovideanexampleofhowSNNdensitycanbeusedtofindcorepointsandremovenoiseandoutliers.Thereare10,000pointsinthe2DpointdatasetshowninFigure8.29(a) .Figures8.29(b–d) distinguishbetweenthesepointsbasedontheirSNNdensity.Figure8.29(b) showsthepointswiththehighestSNNdensity,whileFigure8.29(c) showspointsofintermediateSNNdensity,andFigure8.29(d) showsfiguresofthelowestSNNdensity.Fromthesefigures,weseethatthepointsthathavehighdensity(i.e.,highconnectivityintheSNNgraph)arecandidatesforbeingrepresentativeorcorepointssincetheytendtobelocatedwellinsidethecluster,whilethepointsthathavelowconnectivityarecandidatesforbeingnoisepointsandoutliers,astheyaremostlyintheregionssurroundingtheclusters.

Figure8.29.SNNdensityoftwo-dimensionalpoints.

8.4.9SNNDensity-BasedClustering

TheSNNdensitydefinedabovecanbecombinedwiththeDBSCANalgorithmtocreateanewclusteringalgorithm.ThisalgorithmissimilartotheJPclusteringalgorithminthatitstartswiththeSNNsimilaritygraph.However,

insteadofusingathresholdtosparsifytheSNNsimilaritygraphandthentakingconnectedcomponentsasclusters,theSNNdensity-basedclusteringalgorithmsimplyappliesDBSCAN.

TheSNNDensity-basedClusteringAlgorithmThestepsoftheSNNdensity-basedclusteringalgorithmareshowninAlgorithm8.13 .

Algorithm8.13SNNdensity-based

clusteringalgorithm.

Thealgorithmautomaticallydeterminesthenumberofclustersinthedata.Notethatnotallthepointsareclustered.Thepointsthatarediscardedincludenoiseandoutliers,aswellaspointsthatarenotstronglyconnectedtoagroupofpoints.SNNdensity-basedclusteringfindsclustersinwhichthepointsarestronglyrelatedtooneanother.Dependingontheapplication,wemightwanttodiscardmanyofthepoints.Forexample,SNNdensity-basedclusteringisgoodforfindingtopicsingroupsofdocuments.

Example8.15(SNNDensity-basedClusteringofTimeSeries).TheSNNdensity-basedclusteringalgorithmpresentedinthissectionismoreflexiblethanJarvis-PatrickclusteringorDBSCAN.UnlikeDBSCAN,it

1:ComputetheSNNsimilaritygraph.2:ApplyDBSCANwithuser-specifiedparametersforEpsandMinPts.

canbeusedforhigh-dimensionaldataandsituationsinwhichtheclustershavedifferentdensities.UnlikeJarvis-Patrick,whichperformsasimplethresholdingandthentakestheconnectedcomponentsasclusters,SNNdensity-basedclusteringusesalessbrittleapproachthatreliesontheconceptsofSNNdensityandcorepoints.

TodemonstratethecapabilitiesofSNNdensity-basedclusteringonhigh-dimensionaldata,weappliedittomonthlytimeseriesdataofatmosphericpressureatvariouspointsontheEarth.Morespecifically,thedataconsistsoftheaveragemonthlysea-levelpressure(SLP)foraperiodof41yearsateachpointona longitude-latitudegrid.TheSNNdensity-basedclusteringalgorithmfoundtheclusters(grayregions)indicatedinFigure8.30 .Notethattheseareclustersoftimeseriesoflength492months,eventhoughtheyarevisualizedastwo-dimensionalregions.Thewhiteareasareregionsinwhichthepressurewasnotasuniform.Theclustersnearthepolesareelongatedbecauseofthedistortionofmappingasphericalsurfacetoarectangle.

2.5°

Figure8.30.ClustersofpressuretimeseriesfoundusingSNNdensity-basedclustering.

UsingSLP,Earthscientistshavedefinedtimeseries,calledclimateindices,whichareusefulforcapturingthebehaviorofphenomenainvolvingtheEarth’sclimate.Forexample,anomaliesinclimateindicesarerelatedtoabnormallyloworhighprecipitationortemperatureinvariouspartsoftheworld.SomeoftheclustersfoundbySNNdensity-basedclusteringhaveastrongconnectiontosomeoftheclimateindicesknowntoEarthscientists.

Figure8.31 showstheSNNdensitystructureofthedatafromwhichtheclusterswereextracted.Thedensityhasbeennormalizedtobeonascale

between0and1.Thedensityofatimeseriesmayseemlikeanunusualconcept,butitmeasuresthedegreetowhichthetimeseriesanditsnearestneighborshavethesamenearestneighbors.Becauseeachtimeseriesisassociatedwithalocation,itispossibletoplotthesedensitiesonatwo-dimensionalplot.Becauseoftemporalautocorrelation,thesedensitiesformmeaningfulpatterns,e.g.,itispossibletovisuallyidentifytheclustersofFigure8.31 .

Figure8.31.SNNdensityofpressuretimeseries.

StrengthsandLimitations

ThestrengthsandlimitationsofSNNdensity-basedclusteringaresimilartothoseofJPclustering.However,theuseofcorepointsandSNNdensityaddsconsiderablepowerandflexibilitytothisapproach.

8.5ScalableClusteringAlgorithmsEventhebestclusteringalgorithmisoflittlevalueifittakesanunacceptablylongtimetoexecuteorrequirestoomuchmemory.Thissectionexaminesclusteringtechniquesthatplacesignificantemphasisonscalabilitytotheverylargedatasetsthatarebecomingincreasinglycommon.Westartbydiscussingsomegeneralstrategiesforscalability,includingapproachesforreducingthenumberofproximitycalculations,samplingthedata,partitioningthedata,andclusteringasummarizedrepresentationofthedata.Wethendiscusstwospecificexamplesofscalableclusteringalgorithms:CUREandBIRCH.

8.5.1Scalability:GeneralIssuesandApproaches

Theamountofstoragerequiredformanyclusteringalgorithmsismorethanlinear;e.g.,withhierarchicalclustering,memoryrequirementsareusually

,wheremisthenumberofobjects.For10,000,000objects,forexample,theamountofmemoryrequiredisproportionalto ,anumberstillwellbeyondthecapacitiesofcurrentsystems.Notethatbecauseoftherequirementforrandomdataaccess,manyclusteringalgorithmscannoteasilybemodifiedtoefficientlyusesecondarystorage(disk),forwhichrandomdataaccessisslow.Likewise,theamountofcomputationrequiredforsomeclusteringalgorithmsismorethanlinear.Intheremainderofthissection,wediscussavarietyoftechniquesforreducingtheamountofcomputationand

O(m)2

104

storagerequiredbyaclusteringalgorithm.CUREandBIRCHusesomeofthesetechniques.

MultidimensionalorSpatialAccessMethodsManytechniques,suchasK-means,JarvisPatrickclustering,andDBSCAN,needtofindtheclosestcentroid,thenearestneighborsofapoint,orallpointswithinaspecifieddistance.Itispossibletousespecialtechniquescalledmultidimensionalorspatialaccessmethodstomoreefficientlyperformthesetasks,atleastforlow-dimensionaldata.Thesetechniques,suchasthek-dtreeorR*-tree,typicallyproduceahierarchicalpartitionofthedataspacethatcanbeusedtoreducethetimerequiredtofindthenearestneighborsofapoint.Notethatgrid-basedclusteringschemesalsopartitionthedataspace.

BoundsonProximitiesAnotherapproachtoavoidingproximitycomputationsistouseboundsonproximities.Forinstance,whenusingEuclideandistance,itispossibletousethetriangleinequalitytoavoidmanydistancecalculations.Toillustrate,ateachstageoftraditionalK-means,itisnecessarytoevaluatewhetherapointshouldstayinitscurrentclusterorbemovedtoanewcluster.Ifweknowthedistancebetweenthecentroidsandthedistanceofapointtothe(newlyupdated)centroidoftheclustertowhichitcurrentlybelongs,thenwemightbeabletousethetriangleinequalitytoavoidcomputingthedistanceofthepointtoanyoftheothercentroids.SeeExercise21 onpage702.

SamplingAnotherapproachtoreducingthetimecomplexityistosample.Inthisapproach,asampleofpointsistaken,thesepointsareclustered,andthentheremainingpointsareassignedtotheexistingclusters—typicallytotheclosestcluster.Ifthenumberofpointssampledis ,thenthetimecomplexityofan algorithmisreducedto Akeyproblemwithsampling,though,isthatsmallclusterscanbelost.WhenwediscussCURE,

mO(m)2 O(m).

wewillprovideatechniqueforinvestigatinghowfrequentlysuchproblemsoccur.

PartitioningtheDataObjectsAnothercommonapproachtoreducingtimecomplexityistousesomeefficienttechniquetopartitionthedataintodisjointsetsandthenclusterthesesetsseparately.Thefinalsetofclusterseitheristheunionoftheseseparatesetsofclustersorisobtainedbycombiningand/orrefiningtheseparatesetsofclusters.WeonlydiscussbisectingK-means(Section7.2.3 )inthissection,althoughmanyotherapproachesbasedonpartitioningarepossible.Onesuchapproachwillbedescribed,whenwedescribeCURElateroninthissection.

IfK-meansisusedtofindKclusters,thenthedistanceofeachpointtoeachclustercentroidiscalculatedateachiteration.WhenKislarge,thiscanbeveryexpensive.BisectingK-meansstartswiththeentiresetofpointsandusesK-meanstorepeatedlybisectanexistingclusteruntilwehaveobtainedKclusters.Ateachstep,thedistanceofpointstotwoclustercentroidsiscomputed.Exceptforthefirststep,inwhichtheclusterbeingbisectedconsistsofallthepoints,weonlycomputethedistanceofasubsetofpointstothetwocentroidsbeingconsidered.Becauseofthisfact,bisectingK-meanscanrunsignificantlyfasterthanregularK-means.

SummarizationAnotherapproachtoclusteringistosummarizethedata,typicallyinasinglepass,andthenclusterthesummarizeddata.Inparticular,theleaderalgorithm(seeExercise12 onpage605)eitherputsadataobjectintheclosestcluster(ifthatclusterissufficientlyclose)orstartsanewclusterthatcontainsthecurrentobject.Thisalgorithmislinearinthenumberofobjectsandcanbeusedtosummarizethedatasothatotherclusteringtechniquescanbeused.TheBIRCHalgorithmusesasimilarconcept.

ParallelandDistributedComputationIfitisnotpossibletotakeadvantageofthetechniquesdescribedearlier,oriftheseapproachesdonotyieldthedesiredaccuracyorreductionincomputationtime,thenotherapproachesareneeded.Ahighlyeffectiveapproachistodistributethecomputationamongmultipleprocessors.

8.5.2BIRCH

BIRCH(BalancedIterativeReducingandClusteringusingHierarchies)isahighlyefficientclusteringtechniquefordatainEuclideanvectorspaces,i.e.,dataforwhichaveragesmakesense.BIRCHcanefficientlyclustersuchdatawithonepassandcanimprovethatclusteringwithadditionalpasses.BIRCHcanalsodealeffectivelywithoutliers.

BIRCHisbasedonthenotionofaClusteringFeature(CF)andaCFtree.Theideaisthataclusterofdatapoints(vectors)canberepresentedbyatripleofnumbers(N,LS,SS),whereNisthenumberofpointsinthecluster,LSisthelinearsumofthepoints,andSSisthesumofsquaresofthepoints.Thesearecommonstatisticalquantitiesthatcanbeupdatedincrementallyandthatcanbeusedtocomputeanumberofimportantquantities,suchasthecentroidofaclusteranditsvariance(standarddeviation).Thevarianceisusedasameasureofthediameterofacluster.

Thesequantitiescanalsobeusedtocomputethedistancebetweenclusters.Thesimplestapproachistocalculatean (cityblock)or (Euclidean)distancebetweencentroids.Wecanalsousethediameter(variance)ofthemergedclusterasadistance.AnumberofdifferentdistancemeasuresforclustersaredefinedbyBIRCH,butallcanbecomputedusingthesummarystatistics.

L1 L2

ACFtreeisaheight-balancedtree.Eachinteriornodehasentriesoftheform,where isapointertothe childnode.Thespacethat

eachentrytakesandthepagesizedeterminethenumberofentriesinaninteriornode.Thespaceofeachentryis,inturn,determinedbythenumberofattributesofeachpoint.

Leafnodesconsistofasequenceofclusteringfeatures, ,whereeachclusteringfeaturerepresentsanumberofpointsthathavebeenpreviouslyscanned.Leafnodesaresubjecttotherestrictionthateachleafnodemusthaveadiameterthatislessthanaparameterizedthreshold,T.Thespacethateachentrytakes,togetherwiththepagesize,determinesthenumberofentriesinaleaf.

ByadjustingthethresholdparameterT,theheightofthetreecanbecontrolled.Tcontrolsthefinenessoftheclustering,i.e.,theextenttowhichthedataintheoriginalsetofdataisreduced.ThegoalistokeeptheCFtreeinmainmemorybyadjustingtheTparameterasnecessary.

ACFtreeisbuiltasthedataisscanned.Aseachdatapointisencountered,theCFtreeistraversed,startingfromtherootandchoosingtheclosestnodeateachlevel.Whentheclosestleafclusterforthecurrentdatapointisfinallyidentified,atestisperformedtoseeifaddingthedataitemtothecandidateclusterwillresultinanewclusterwithadiametergreaterthanthegiventhreshold,T.Ifnot,thenthedatapointisaddedtothecandidateclusterbyupdatingtheCFinformation.Theclusterinformationforallnodesfromtheleaftotherootisalsoupdated.

IfthenewclusterhasadiametergreaterthanT,thenanewentryiscreatediftheleafnodeisnotfull.Otherwisetheleafnodemustbesplit.Thetwoentries(clusters)thatarefarthestapartareselectedasseedsandtheremainingentriesaredistributedtooneofthetwonewleafnodes,basedonwhichleaf

[CFi, childi] childi ith

CFi

nodecontainstheclosestseedcluster.Oncetheleafnodehasbeensplit,theparentnodeisupdatedandsplitifnecessary;i.e.,iftheparentnodeisfull.Thisprocessmaycontinueallthewaytotherootnode.

BIRCHfollowseachsplitwithamergestep.Attheinteriornodewherethesplitstops,thetwoclosestentriesarefound.Iftheseentriesdonotcorrespondtothetwoentriesthatjustresultedfromthesplit,thenanattemptismadetomergetheseentriesandtheircorrespondingchildnodes.Thisstepisintendedtoincreasespaceutilizationandavoidproblemswithskeweddatainputorder.

BIRCHalsohasaprocedureforremovingoutliers.Whenthetreeneedstoberebuiltbecauseithasrunoutofmemory,thenoutlierscanoptionallybewrittentodisk.(Anoutlierisdefinedtobeanodethathasfarfewerdatapointsthanaverage.)Atcertainpointsintheprocess,outliersarescannedtoseeiftheycanbeabsorbedbackintothetreewithoutcausingthetreetogrowinsize.Ifso,theyarereabsorbed.Ifnot,theyaredeleted.

BIRCHconsistsofanumberofphasesbeyondtheinitialcreationoftheCFtree.AllthephasesofBIRCHaredescribedbrieflyinAlgorithm8.14 .

Algorithm8.14BIRCH.1:LoadthedataintomemorybycreatingaCFtreethatsummarizesthedata.2:BuildasmallerCFtreeifitisnecessaryforphase3.Tisincreased,andthentheleafnodeentries(clusters)arereinserted.SinceThasincreased,someclusterswillbemerged.3:Performglobalclustering.Differentformsofglobalclustering(clusteringthatusesthepairwisedistancesbetween

8.5.3CURE

CURE(ClusteringUsingREpresentatives)isaclusteringalgorithmthatusesavarietyofdifferenttechniquestocreateanapproachthatcanhandlelargedatasets,outliers,andclusterswithnon-sphericalshapesandnon-uniformsizes.CURErepresentsaclusterbyusingmultiplerepresentativepointsfromthecluster.Thesepointswill,intheory,capturethegeometryandshapeofthecluster.Thefirstrepresentativepointischosentobethepointfarthestfromthecenterofthecluster,whiletheremainingpointsarechosensothattheyarefarthestfromallthepreviouslychosenpoints.Inthisway,the

alltheclusters)canbeused.However,anagglomerative,hierarchicaltechniquewasselected.Becausetheclusteringfeaturesstoresummaryinformationthatisimportanttocertainkindsofclustering,theglobalclusteringalgorithmcanbeappliedasifitwerebeingappliedtoallthepointsinaclusterrepresentedbytheCF.4:Redistributethedatapointsusingthecentroidsofclustersdiscoveredinstep3,andthus,discoveranewsetofclusters.ThisovercomescertainproblemsthatcanoccurinthefirstphaseofBIRCH.BecauseofpagesizeconstraintsandtheTparameter,pointsthatshouldbeinoneclusteraresometimessplit,andpointsthatshouldbeindifferentclustersaresometimescombined.Also,ifthedatasetcontainsduplicatepoints,thesepointscansometimesbeclustereddifferently,dependingontheorderinwhichtheyareencountered.Byrepeatingthisphasemultipletimes,theprocessconvergestoalocallyoptimalsolution.

representativepointsarenaturallyrelativelywelldistributed.Thenumberofpointschosenisaparameter,butitwasfoundthatavalueof10ormoreworkedwell.

Oncetherepresentativepointsarechosen,theyareshrunktowardthecenterbyafactor,α.Thishelpsmoderatetheeffectofoutliers,whichareusuallyfartherawayfromthecenterandthus,areshrunkmore.Forexample,arepresentativepointthatwasadistanceof10unitsfromthecenterwouldmoveby3units(for ),whilearepresentativepointatadistanceof1unitwouldonlymove0.3units.

CUREusesanagglomerativehierarchicalschemetoperformtheactualclustering.Thedistancebetweentwoclustersistheminimumdistancebetweenanytworepresentativepoints(aftertheyareshrunktowardtheirrespectivecenters).Whilethisschemeisnotexactlylikeanyotherhierarchicalschemethatwehaveseen,itisequivalenttocentroid-basedhierarchicalclusteringif ,androughlythesameassinglelinkhierarchicalclusteringif .Noticethatwhileahierarchicalclusteringschemeisused,thegoalofCUREistofindagivennumberofclustersasspecifiedbytheuser.

CUREtakesadvantageofcertaincharacteristicsofthehierarchicalclusteringprocesstoeliminateoutliersattwodifferentpointsintheclusteringprocess.First,ifaclusterisgrowingslowly,thenitmayconsistofoutliers,sincebydefinition,outliersarefarfromothersandwillnotbemergedwithotherpointsveryoften.InCURE,thisfirstphaseofoutliereliminationtypicallyoccurswhenthenumberofclustersis1/3theoriginalnumberofpoints.ThesecondphaseofoutliereliminationoccurswhenthenumberofclustersisontheorderofK,thenumberofdesiredclusters.Atthispoint,smallclustersareagaineliminated.

α=0.7

α=0α=1

Becausetheworst-casecomplexityofCUREis ,itcannotbeapplieddirectlytolargedatasets.Forthisreason,CUREusestwotechniquestospeeduptheclusteringprocess.Thefirsttechniquetakesarandomsampleandperformshierarchicalclusteringonthesampleddatapoints.Thisisfollowedbyafinalpassthatassignseachremainingpointinthedatasettooneoftheclustersbychoosingtheclusterwiththeclosestrepresentativepoint.WediscussCURE’ssamplingapproachinmoredetaillater.

Insomecases,thesamplerequiredforclusteringisstilltoolargeandasecondadditionaltechniqueisrequired.Inthissituation,CUREpartitionsthesampledataandthenclustersthepointsineachpartition.Thispreclusteringstepisthenfollowedbyaclusteringoftheintermediateclustersandafinalpassthatassignseachpointinthedatasettooneoftheclusters.CURE’spartitioningschemeisalsodiscussedinmoredetaillater.

Algorithm8.15 summarizesCURE.NotethatKisthedesirednumberofclusters,misthenumberofpoints,pisthenumberofpartitions,andqisthedesiredreductionofpointsinapartition,i.e.,thenumberofclustersinapartitionis .Therefore,thetotalnumberofclustersis .Forexample,if , ,and ,theneachpartitioncontainspoints,andtherewouldbe clustersineachpartitionand

clustersoverall.

Algorithm8.15CURE.

O(m2logm)

mpq mqm=10,000 p=10 q=100 10,000/10=1000

1000/100=1010,000/100=100

1:Drawarandomsamplefromthedataset.TheCUREpaperisnotableforexplicitlyderivingaformulaforwhatthesizeofthissampleshouldbeinordertoguarantee,withhighprobability,thatallclustersarerepresentedbyaminimumnumberofpoints.2:Partitionthesampleintopequal-sizedpartitions.

SamplinginCUREAkeyissueinusingsamplingiswhetherthesampleisrepresentative,thatis,whetheritcapturesthecharacteristicsofinterest.Forclustering,theissueiswhetherwecanfindthesameclustersinthesampleasintheentiresetofobjects.Ideally,wewouldlikethesampletocontainsomeobjectsforeachclusterandfortheretobeaseparateclusterinthesampleforthoseobjectsthatbelongtoseparateclustersintheentiredataset.

Amoreconcreteandattainablegoalistoguarantee(withahighprobability)thatwehaveatleastsomepointsfromeachcluster.Thenumberofpointsrequiredforsuchasamplevariesfromonedatasettoanotheranddependsonthenumberofobjectsandthesizesoftheclusters.ThecreatorsofCUREderivedaboundforthesamplesizethatwouldbeneededtoensure(withhighprobability)thatweobtainatleastacertainnumberofpointsfromacluster.Usingthenotationofthisbook,thisboundisgivenbythefollowingtheorem.

3:Clusterthepointsineachpartitioninto clustersusingCURE’shierarchicalclusteringalgorithmtoobtainatotalof clusters.Notethatsomeoutliereliminationoccursduringthisprocess.4:UseCURE’shierarchicalclusteringalgorithmtoclusterthe clustersfoundinthepreviousstepuntilonlyKclustersremain.5:Eliminateoutliers.Thisisthesecondphaseofoutlierelimination.6:Assignallremainingdatapointstothenearestclustertoobtainacompleteclustering.

mpq

Theorem8.1.Letfbeafraction, .Forcluster ofsize ,wewillobtainatleast objectsfromcluster withaprobabilityof

,ifoursamplesizesisgivenbythefollowing:

wheremisthenumberofobjects.

Whilethisexpressionmightlookintimidating,itisreasonablyeasytouse.Supposethatthereare100,000objectsandthatthegoalistohavean80%chanceofobtaining10%oftheobjectsincluster ,whichhasasizeof1000.Inthiscase, ,andthus .Ifthegoalisa5%sampleof ,whichis50objects,thenasamplesizeof6440willsuffice.

Again,CUREusessamplinginthefollowingway.Firstasampleisdrawn,andthenCUREisusedtoclusterthissample.Afterclustershavebeenfound,eachunclusteredpointisassignedtotheclosestcluster.

PartitioningWhensamplingisnotenough,CUREalsousesapartitioningapproach.Theideaistodividethepointsintopgroupsofsizem/pandtouseCUREtoclustereachpartitioninordertoreducethenumberofobjectsbyafactorof

,whereqcanberoughlythoughtofastheaveragesizeofaclusterina

0≤f≤1 Ci mif*mi Ci

1−δ,0≤δ≤1

s=fm+mmi*log1δ+mmilog12δ+2*f*mi*log1δ. (8.21)

Cif=0.1,δ=0.2,m=100,000,mi=1000 s=11,962

q>1

partition.Overall, clustersareproduced.(NotethatsinceCURErepresentseachclusterbyanumberofrepresentativepoints,thereductioninthenumberofobjectsisnotq.)Thispreclusteringstepisthenfollowedbyafinalclusteringofthem/qintermediateclusterstoproducethedesirednumberofclusters(K).BothclusteringpassesuseCURE’shierarchicalclusteringalgorithmandarefollowedbyafinalpassthatassignseachpointinthedatasettooneoftheclusters.

Thekeyissueishowpandqshouldbechosen.AlgorithmssuchasCUREhaveatimecomplexityof orhigher,andfurthermore,requirethatallthedatabeinmainmemory.Wethereforewanttochoosepsmallenoughsothatanentirepartitioncanbeprocessedinmainmemoryandina‘reasonable’amountoftime.Atthecurrenttime,atypicaldesktopcomputercanperformahierarchicalclusteringofafewthousandobjectsinafewseconds.

Anotherfactorforchoosingp,andalsoq,concernsthequalityoftheclustering.Specifically,theobjectiveistochoosethevaluesofpandqsuchthatobjectsfromthesameunderlyingclusterendupinthesameclusterseventually.Toillustrate,supposethereare1000objectsandaclusterofsize100.Ifwerandomlygenerate100partitions,theneachpartitionwill,onaverage,haveonlyonepointfromourcluster.Thesepointswilllikelybeputinclusterswithpointsfromotherclustersorwillbediscardedasoutliers.Ifwegenerateonly10partitionsof100objects,butqis50,thenthe10pointsfromeachcluster(onaverage)willlikelystillbecombinedwithpointsfromotherclusters,becausethereareonly(onaverage)10pointsperclusterandweneedtoproduce,foreachpartition,twoclusters.Toavoidthislastproblem,whichconcernstheproperchoiceofq,asuggestedstrategyisnottocombineclustersiftheyaretoodissimilar.

O(m2)

8.6WhichClusteringAlgorithm?Avarietyoffactorsneedtobeconsideredwhendecidingwhichtypeofclusteringtechniquetouse.Many,ifnotall,ofthesefactorshavebeendiscussedtosomeextentinthecurrentandpreviouschapters.Ourgoalinthissectionistosuccinctlysummarizethesefactorsinawaythatshedssomelightonwhichclusteringalgorithmmightbeappropriateforaparticularclusteringtask.

TypeofClusteringForaclusteringalgorithmtobeappropriateforatask,thetypeofclusteringproducedbythealgorithmneedstomatchthetypeofclusteringneededbytheapplication.Forsomeapplications,suchascreatingabiologicaltaxonomy,ahierarchyispreferred.Inthecaseofclusteringforsummarization,apartitionalclusteringistypical.Inyetotherapplications,bothcanproveuseful.

Mostclusteringapplicationsrequireaclusteringofall(oralmostall)oftheobjects.Forinstance,ifclusteringisusedtoorganizeasetofdocumentsforbrowsing,thenwewouldlikemostdocumentstobelongtoagroup.However,ifwewantedtofindthestrongestthemesinasetofdocuments,thenwemightprefertohaveaclusteringschemethatproducesonlyverycohesiveclusters,evenifmanydocumentswereleftunclustered.

Finally,mostapplicationsofclusteringassumethateachobjectisassignedtoonecluster(oroneclusteronalevelforhierarchicalschemes).Aswehaveseen,however,probabilisticandfuzzyschemesprovideweightsthatindicatethedegreeorprobabilityofmembershipinvariousclusters.Othertechniques,suchasDBSCANandSNNdensity-basedclustering,havethenotionofcore

points,whichstronglybelongtoonecluster.Suchconceptsmaybeusefulincertainapplications.

TypeofClusterAnotherkeyaspectiswhetherthetypeofclustermatchestheintendedapplication.Therearethreecommonlyencounteredtypesofclusters:prototype-,graph-,anddensity-based.Prototype-basedclusteringschemes,aswellassomegraph-basedclusteringschemes—completelink,centroid,andWard’s—tendtoproduceglobularclustersinwhicheachobjectisclosetothecluster’sprototypeand/ortotheotherobjectsinthecluster.If,forexample,wewanttosummarizethedatatoreduceitssizeandwewanttodosowiththeminimumamountoferror,thenoneofthesetypesoftechniqueswouldbemostappropriate.Incontrast,density-basedclusteringtechniques,aswellassomegraph-basedclusteringtechniques,suchassinglelink,tendtoproduceclustersthatarenotglobularandthuscontainmanyobjectsthatarenotverysimilartooneanother.Ifclusteringisusedtosegmentageographicalareaintocontiguousregionsbasedonthetypeoflandcover,thenoneofthesetechniquesismoresuitablethanaprototype-basedschemesuchasK-means.

CharacteristicsofClustersBesidesthegeneraltypeofcluster,otherclustercharacteristicsareimportant.Ifwewanttofindclustersinsubspacesoftheoriginaldataspace,thenwemustchooseanalgorithmsuchasCLIQUE,whichexplicitlylooksforsuchclusters.Similarly,ifweareinterestedinenforcingspatialrelationshipsbetweenclusters,thenSOMorsomerelatedapproachwouldbeappropriate.Also,clusteringalgorithmsdifferwidelyintheirabilitytohandleclustersofvaryingshapes,sizes,anddensities.

CharacteristicsoftheDataSetsandAttributesAsdiscussedintheintroduction,thetypeofdatasetandattributescandictatethetypeofalgorithmtouse.Forinstance,theK-meansalgorithmcanonlybeusedondataforwhichanappropriateproximitymeasureisavailablethatallows

meaningfulcomputationofaclustercentroid.Forotherclusteringtechniques,suchasmanyagglomerativehierarchicalapproaches,theunderlyingnatureofthedatasetsandattributesislessimportantaslongasaproximitymatrixcanbecreated.

NoiseandOutliersNoiseandoutliersareparticularlyimportantaspectsofthedata.Wehavetriedtoindicatetheeffectofnoiseandoutliersonthevariousclusteringalgorithmsthatwehavediscussed.Inpractice,however,itcanbedifficulttoevaluatetheamountofnoiseinthedatasetorthenumberofoutliers.Morethanthat,whatisnoiseoranoutliertoonepersonmightbeinterestingtoanotherperson.Forexample,ifweareusingclusteringtosegmentanareaintoregionsofdifferentpopulationdensity,wedonotwanttouseadensity-basedclusteringtechnique,suchasDBSCAN,thatassumesthatregionsorpointswithdensitylowerthanaglobalthresholdarenoiseoroutliers.Asanotherexample,hierarchicalclusteringschemes,suchasCURE,oftendiscardclustersofpointsthataregrowingslowlyassuchgroupstendtorepresentoutliers.However,insomeapplicationswearemostinterestedinrelativelysmallclusters;e.g.,inmarketsegmentation,suchgroupsmightrepresentthemostprofitablecustomers.

NumberofDataObjectsWehaveconsideredhowclusteringisaffectedbythenumberofdataobjectsinconsiderabledetailinprevioussections.Wereiterate,however,thatthisfactoroftenplaysanimportantroleindeterminingthetypeofclusteringalgorithmtobeused.Supposethatwewanttocreateahierarchicalclusteringofasetofdata,wearenotinterestedinacompletehierarchythatextendsallthewaytoindividualobjects,butonlytothepointatwhichwehavesplitthedataintoafewhundredclusters.Ifthedataisverylarge,wecannotdirectlyapplyanagglomerativehierarchicalclusteringtechnique.Wecould,however,useadivisiveclusteringtechnique,suchastheminimumspanningtree(MST)algorithm,whichisthedivisiveanalogtosinglelink,butthiswouldonlyworkifthedatasetisnottoolarge.BisectingK-

meanswouldalsoworkformanydatasets,butifthedatasetislargeenoughthatitcannotbecontainedcompletelyinmemory,thenthisschemealsorunsintoproblems.Inthissituation,atechniquesuchasBIRCH,whichdoesnotrequirethatalldatabeinmainmemory,becomesmoreuseful.

NumberofAttributesWehavealsodiscussedtheimpactofdimensionalityatsomelength.Again,thekeypointistorealizethatanalgorithmthatworkswellinlowormoderatedimensionsmaynotworkwellinhighdimensions.Asinmanyothercasesinwhichaclusteringalgorithmisinappropriatelyapplied,theclusteringalgorithmwillrunandproduceclusters,buttheclusterswilllikelynotrepresentthetruestructureofthedata.

ClusterDescriptionOneaspectofclusteringtechniquesthatisoftenoverlookedishowtheresultingclustersaredescribed.Prototypeclustersaresuccinctlydescribedbyasmallsetofclusterprototypes.Inthecaseofmixturemodels,theclustersaredescribedintermsofsmallsetsofparameters,suchasthemeanvectorandthecovariancematrix.Thisisalsoaverycompactandunderstandablerepresentation.ForSOM,itistypicallypossibletovisualizetherelationshipsbetweenclustersinatwo-dimensionalplot,suchasthatofFigure8.8 .Forgraph-anddensity-basedclusteringapproaches,however,clustersaretypicallydescribedassetsofclustermembers.Nonetheless,inCURE,clusterscanbedescribedbya(relatively)smallsetofrepresentativepoints.Also,forgrid-basedclusteringschemes,suchasCLIQUE,morecompactdescriptionscanbegeneratedintermsofconditionsontheattributevaluesthatdescribethegridcellsinthecluster.

AlgorithmicConsiderationsTherearealsoimportantaspectsofalgorithmsthatneedtobeconsidered.Isthealgorithmnon-deterministicororder-dependent?Doesthealgorithmautomaticallydeterminethenumberofclusters?Isthereatechniquefordeterminingthevaluesofvariousparameters?Manyclusteringalgorithmstrytosolvetheclusteringproblemby

tryingtooptimizeanobjectivefunction.Istheobjectiveagoodmatchfortheapplicationobjective?Ifnot,thenevenifthealgorithmdoesagoodjoboffindingaclusteringthatisoptimalorclosetooptimalwithrespecttotheobjectivefunction,theresultisnotmeaningful.Also,mostobjectivefunctionsgivepreferencetolargerclustersattheexpenseofsmallerclusters.

SummaryThetaskofchoosingtheproperclusteringalgorithminvolvesconsideringalloftheseissues,anddomain-specificissuesaswell.Thereisnoformulafordeterminingthepropertechnique.Nonetheless,ageneralknowledgeofthetypesofclusteringtechniquesthatareavailableandconsiderationoftheissuesmentionedabove,togetherwithafocusontheintendedapplication,shouldallowadataanalysttomakeaninformeddecisiononwhichclusteringapproach(orapproaches)totry.

8.7BibliographicNotesAnextensivediscussionoffuzzyclustering,includingadescriptionoffuzzyc-meansandformalderivationsoftheformulaspresentedinSection8.2.1 ,canbefoundinthebookonfuzzyclusteranalysisbyHöppneretal.[595].Whilenotdiscussedinthischapter,AutoClassbyCheesemanetal.[573]isoneoftheearliestandmostprominentmixture-modelclusteringprograms.AnintroductiontomixturemodelscanbefoundinthetutorialofBilmes[568],thebookbyMitchell[606](whichalsodescribeshowtheK-meansalgorithmcanbederivedfromamixturemodelapproach),andthearticlebyFraleyandRaftery[581].Mixturemodelisanexampleofaprobabilisticclusteringmethod,inwhichtheclustersarerepresentedashiddenvariablesinthemodel.MoresophisticatedprobabilisticclusteringmethodssuchaslatentDirichletallocation(LDA)[570]havebeendevelopedinrecentyearsfordomainssuchastextclustering.

Besidesdataexploration,SOManditssupervisedlearningvariant,LearningVectorQuantization(LVQ),havebeenusedformanytasks:imagesegmentation,organizationofdocumentfiles,andspeechprocessing.OurdiscussionofSOMwascastintheterminologyofprototype-basedclustering.ThebookonSOMbyKohonenetal.[601]containsanextensiveintroductiontoSOMthatemphasizesitsneuralnetworkorigins,aswellasadiscussionofsomeofitsvariationsandapplications.OneimportantSOM-relatedclusteringdevelopmentistheGenerativeTopographicMap(GTM)algorithmbyBishopetal.[569],whichusestheEMalgorithmtofindGaussianmodelssatisfyingtwo-dimensionaltopographicconstraints.

ThedescriptionofChameleoncanbefoundinthepaperbyKarypisetal.[599].Capabilitiessimilar,althoughnotidenticaltothoseofChameleonhave

beenimplementedintheCLUTOclusteringpackagebyKarypis[575].TheMETISgraphpartitioningpackagebyKarypisandKumar[600]isusedtoperformgraphpartitioninginbothprograms,aswellasintheOPOSSUMclusteringalgorithmbyStrehlandGhosh[616].AdetaileddiscussiononspectralclusteringcanbefoundinthetutorialbyvonLuxburg[618].ThespectralclusteringmethoddescribedinthischapterisbasedonanunnormalizedgraphLaplacianmatrixandtheratiocutmeasure[590].AlternativeformulationsofspectralclusteringhavebeendevelopedusingnormalizedgraphLaplacianmatricesforotherevaluationmeasures[613].

ThenotionofSNNsimilaritywasintroducedbyJarvisandPatrick[596].AhierarchicalclusteringschemebasedonasimilarconceptofmutualnearestneighborswasproposedbyGowdaandKrishna[586].Guhaetal.[589]createdROCK,ahierarchicalgraph-basedclusteringalgorithmforclusteringtransactiondata,whichamongotherinterestingfeatures,alsousesanotionofsimilaritybasedonsharedneighborsthatcloselyresemblestheSNNsimilaritydevelopedbyJarvisandPatrick.AdescriptionoftheSNNdensity-basedclusteringtechniquecanbefoundinthepublicationsofErtözetal.[578,579].SNNdensity-basedclusteringwasusedbySteinbachetal.[614]tofindclimateindices.

Examplesofgrid-basedclusteringalgorithmsareOptiGrid(HinneburgandKeim[594]),theBANGclusteringsystem(SchikutaandErhart[611]),andWaveCluster(Sheikholeslamietal.[612]).TheCLIQUEalgorithmisdescribedinthepaperbyAgrawaletal.[564].MAFIA(Nageshetal.[608])isamodificationofCLIQUEwhosegoalisimprovedefficiency.Kailingetal.[598]havedevelopedSUBCLU(density-connectedSUBspaceCLUstering),asubspaceclusteringalgorithmbasedonDBSCAN.TheDENCLUEalgorithmwasproposedbyHinneburgandKeim[593].

OurdiscussionofscalabilitywasstronglyinfluencedbythearticleofGhosh[584].Awide-rangingdiscussionofspecifictechniquesforclusteringmassivedatasetscanbefoundinthepaperbyMurtagh[607].CUREisworkbyGuhaetal.[588],whiledetailsofBIRCHareinthepaperbyZhangetal.[620].CLARANS(NgandHan[609])isanalgorithmforscalingK-medoidclusteringtolargerdatabases.AdiscussionofscalingEMandK-meansclusteringtolargedatasetsisprovidedbyBradleyetal.[571,572].AparallelimplementationofK-meansontheMapReduceframeworkhasalsobeendeveloped[621].InadditiontoK-means,otherclusteringalgorithmsthathavebeenimplementedontheMapReduceframeworkincludeDBScan[592],spectralclustering[574],andhierarchicalclustering[617].

Inadditiontotheapproachesdescribedinthischapter,therearemanyotherclusteringmethodsproposedintheliterature.Oneclassofmethodsthathasbecomeincreasinglypopularinrecentyearsisbasedonnon-negativematrixfactorization(NMF)[602].Theideaisanextensionofthesingularvaluedecomposition(SVD)approachdescribedinChapter2 ,inwhichthedatamatrixisdecomposedintoaproductoflower-rankmatricesthatrepresenttheunderlyingcomponentsorclustersinthedata.InNMF,additionalconstraintsareimposedtoensurenon-negativityintheelementsofthecomponentmatrices.Withdifferentformulationsandconstraints,theNMFmethodcanbeshowntobeequivalenttootherclusteringapproaches,includingK-meansandspectralclustering[577,603].Anotherpopularclassofmethodsutilizestheconstraintsprovidedbyuserstoguidetheclusteringalgorithm.Suchalgorithmsarecommonlyknownasconstrainedclusteringorsemi-supervisedclustering[566,567,576,619].

Therearemanyaspectsofclusteringthatwehavenotcovered.AdditionalpointersaregiveninthebooksandsurveysmentionedintheBibliographicNotesofthepreviouschapter.Here,wementionfourareas—omitting,unfortunately,manymore.Clusteringoftransactiondata(Gantietal.[582],

Gibsonetal.[585],Hanetal.[591],andPetersandZaki[610])isanimportantarea,astransactiondataiscommonandofcommercialimportance.Streamingdataisalsobecomingincreasinglycommonandimportantascommunicationsandsensornetworksbecomepervasive.TwointroductionstoclusteringfordatastreamsaregiveninarticlesbyBarbará[565]andGuhaetal.[587].Conceptualclustering(FisherandLangley[580],Jonyeretal.[597],Mishraetal.[605],MichalskiandStepp[604],SteppandMichalski[615]),whichusesmorecomplicateddefinitionsofclustersthatoftencorrespondbettertohumannotionsofacluster,isanareaofclusteringwhosepotentialhasperhapsnotbeenfullyrealized.Finally,therehasbeenagreatdealofclusteringworkfordatacompressionintheareaofvectorquantization.ThebookbyGershoandGray[583]isastandardtextinthisarea.

Bibliography[564]R.Agrawal,J.Gehrke,D.Gunopulos,andP.Raghavan.Automatic

subspaceclusteringofhighdimensionaldatafordataminingapplications.InProc.of1998ACMSIGMODIntl.Conf.onManagementofData,pages94–105,Seattle,Washington,June1998.ACMPress.

[565]D.Barbará.Requirementsforclusteringdatastreams.SIGKDDExplorationsNewsletter,3(2):23–27,2002.

[566]S.Basu,A.Banerjee,andR.Mooney.Semi-supervisedclusteringbyseeding.InProceedingsof19thInternationalConferenceonMachineLearning,pages19–26,2002.

[567]S.Basu,I.Davidson,andK.Wagstaff.ConstrainedClustering:AdvancesinAlgorithms,Theory,andApplications.CRCPress,2008.

[568]J.Bilmes.AGentleTutorialontheEMAlgorithmanditsApplicationtoParameterEstimationforGaussianMixtureandHiddenMarkovModels.TechnicalReportICSITR-97-021,UniversityofCaliforniaatBerkeley,1997.

[569]C.M.Bishop,M.Svensen,andC.K.I.Williams.GTM:Aprincipledalternativetotheself-organizingmap.InC.vonderMalsburg,W.vonSeelen,J.C.Vorbruggen,andB.Sendhoff,editors,ArtificialNeural

Networks—ICANN96.Intl.Conf,Proc.,pages165–170.Springer-Verlag,Berlin,Germany,1996.

[570]D.M.Blei,A.Y.Ng,andM.I.Jordan.LatentDirichletAllocation.JournalofMachineLearningResearch,3(4-5):993–1022,2003.

[571]P.S.Bradley,U.M.Fayyad,andC.Reina.ScalingClusteringAlgorithmstoLargeDatabases.InProc.ofthe4thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages9–15,NewYorkCity,August1998.AAAIPress.

[572]P.S.Bradley,U.M.Fayyad,andC.Reina.ScalingEM(ExpectationMaximization)ClusteringtoLargeDatabases.TechnicalReportMSR-TR-98-35,MicrosoftResearch,October1999.

[573]P.Cheeseman,J.Kelly,M.Self,J.Stutz,W.Taylor,andD.Freeman.AutoClass:aBayesianclassificationsystem.InReadingsinknowledgeacquisitionandlearning:automatingtheconstructionandimprovementofexpertsystems,pages431–441.MorganKaufmannPublishersInc.,1993.

[574]W.Y.Chen,Y.Song,H.Bai,C.J.Lin,andE.Y.Chang.Parallelspectralclusteringindistributedsystems.IEEETransactionsonPatternAnalysisandMachineIntelligence,33(3):568586,2011.

[575]CLUTO2.1.2:SoftwareforClusteringHigh-DimensionalDatasets.www.cs.umn.edu/∼karypis,October2016.

[576]I.DavidsonandS.Basu.Asurveyofclusteringwithinstancelevelconstraints.ACMTransactionsonKnowledgeDiscoveryfromData,1:1–41,2007.

[577]C.Ding,X.He,andH.Simon.Ontheequivalenceofnonnegativematrixfactorizationandspectralclustering.InProcoftheSIAMInternationalConferenceonDataMining,page606-610,2005.

[578]L.Ertöz,M.Steinbach,andV.Kumar.ANewSharedNearestNeighborClusteringAlgorithmanditsApplications.InWorkshoponClusteringHighDimensionalDataanditsApplications,Proc.ofTextMine’01,FirstSIAMIntl.Conf.onDataMining,Chicago,IL,USA,2001.

[579]L.Ertöz,M.Steinbach,andV.Kumar.FindingClustersofDifferentSizes,Shapes,andDensitiesinNoisy,HighDimensionalData.InProc.ofthe2003SIAMIntl.Conf.onDataMining,SanFrancisco,May2003.SIAM.

[580]D.FisherandP.Langley.Conceptualclusteringanditsrelationtonumericaltaxonomy.ArtificialIntelligenceandStatistics,pages77–116,1986.

[581]C.FraleyandA.E.Raftery.HowManyClusters?WhichClusteringMethod?AnswersViaModel-BasedClusterAnalysis.TheComputerJournal,41(8):578–588,1998.

[582]V.Ganti,J.Gehrke,andR.Ramakrishnan.CACTUS–ClusteringCategoricalDataUsingSummaries.InProc.ofthe5thIntl.Conf.on

KnowledgeDiscoveryandDataMining,pages73–83.ACMPress,1999.

[583]A.GershoandR.M.Gray.VectorQuantizationandSignalCompression,volume159ofKluwerInternationalSeriesinEngineeringandComputerScience.KluwerAcademicPublishers,1992.

[584]J.Ghosh.ScalableClusteringMethodsforDataMining.InN.Ye,editor,HandbookofDataMining,pages247–277.LawrenceEalbaumAssoc,2003.

[585]D.Gibson,J.M.Kleinberg,andP.Raghavan.ClusteringCategoricalData:AnApproachBasedonDynamicalSystems.VLDBJournal,8(3–4):222–236,2000.

[586]K.C.GowdaandG.Krishna.AgglomerativeClusteringUsingtheConceptofMutualNearestNeighborhood.PatternRecognition,10(2):105–112,1978.

[587]S.Guha,A.Meyerson,N.Mishra,R.Motwani,andL.O’Callaghan.ClusteringDataStreams:TheoryandPractice.IEEETransactionsonKnowledgeandDataEngineering,15(3):515–528,May/June2003.

[588]S.Guha,R.Rastogi,andK.Shim.CURE:AnEfficientClusteringAlgorithmforLargeDatabases.InProc.of1998ACM-SIGMODIntl.Conf.onManagementofData,pages73–84.ACMPress,June1998.

[589]S.Guha,R.Rastogi,andK.Shim.ROCK:ARobustClusteringAlgorithmforCategoricalAttributes.InProc.ofthe15thIntl.Conf.onDataEngineering,pages512–521.IEEEComputerSociety,March1999.

[590]L.HagenandA.Kahng.Newspectralmethodsforratiocutpartitioningandclustering.IEEETrans.Computer-AidedDesign,11(9):10741085,1992.

[591]E.-H.Han,G.Karypis,V.Kumar,andB.Mobasher.HypergraphBasedClusteringinHigh-DimensionalDataSets:ASummaryofResults.IEEEDataEng.Bulletin,21(1):15–22,1998.

[592]Y.He,H.Tan,W.Luo,H.Mao,D.Ma,S.Feng,andJ.Fan.MR-DBSCAN:anefficientparalleldensity-basedclusteringalgorithmusingMapReduce.InProcoftheIEEEInternationalConferenceonParallelandDistributedSystems,pages473–480,2011.

[593]A.HinneburgandD.A.Keim.AnEfficientApproachtoClusteringinLargeMultimediaDatabaseswithNoise.InProc.ofthe4thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages58–65,NewYorkCity,August1998.AAAIPress.

[594]A.HinneburgandD.A.Keim.OptimalGrid-Clustering:TowardsBreakingtheCurseofDimensionalityinHigh-DimensionalClustering.InProc.ofthe25thVLDBConf.,pages506–517,Edinburgh,Scotland,UK,September1999.MorganKaufmann.

[595]F.Höppner,F.Klawonn,R.Kruse,andT.Runkler.FuzzyClusterAnalysis:MethodsforClassification,DataAnalysisandImageRecognition.JohnWiley&Sons,NewYork,July21999.

[596]R.A.JarvisandE.A.Patrick.ClusteringUsingaSimilarityMeasureBasedonSharedNearestNeighbors.IEEETransactionsonComputers,C-22(11):1025–1034,1973.

[597]I.Jonyer,D.J.Cook,andL.B.Holder.Graph-basedhierarchicalconceptualclustering.JournalofMachineLearningResearch,2:19–43,2002.

[598]K.Kailing,H.-P.Kriegel,andP.Kröger.Density-ConnectedSubspaceClusteringforHigh-DimensionalData.InProc.ofthe2004SIAMIntl.Conf.onDataMining,pages428–439,LakeBuenaVista,Florida,April2004.SIAM.

[599]G.Karypis,E.-H.Han,andV.Kumar.CHAMELEON:AHierarchicalClusteringAlgorithmUsingDynamicModeling.IEEEComputer,32(8):68–75,August1999.

[600]G.KarypisandV.Kumar.Multilevelk-wayPartitioningSchemeforIrregularGraphs.JournalofParallelandDistributedComputing,48(1):96–129,1998.

[601]T.Kohonen,T.S.Huang,andM.R.Schroeder.Self-OrganizingMaps.Springer-Verlag,December2000.

[602]D.D.LeeandH.S.Seung.Learningthepartsofobjectsbynon-negativematrixfactorization.Nature,401(6755):788791,1999.

[603]T.LiandC.H.Q.Ding.TheRelationshipsAmongVariousNonnegativeMatrixFactorizationMethodsforClustering.InProcoftheIEEEInternationalConferenceonDataMining,pages362–371,2006.

[604]R.S.MichalskiandR.E.Stepp.AutomatedConstructionofClassifications:ConceptualClusteringVersusNumericalTaxonomy.IEEETransactionsonPatternAnalysisandMachineIntelligence,5(4):396–409,1983.

[605]N.Mishra,D.Ron,andR.Swaminathan.ANewConceptualClusteringFramework.MachineLearningJournal,56(1–3):115–151,July/August/September2004.

[606]T.Mitchell.MachineLearning.McGraw-Hill,Boston,MA,1997.

[607]F.Murtagh.Clusteringmassivedatasets.InJ.Abello,P.M.Pardalos,andM.G.C.Reisende,editors,HandbookofMassiveDataSets.Kluwer,2000.

[608]H.Nagesh,S.Goil,andA.Choudhary.ParallelAlgorithmsforClusteringHigh-DimensionalLarge-ScaleDatasets.InR.L.Grossman,C.Kamath,P.Kegelmeyer,V.Kumar,andR.Namburu,editors,DataMiningforScientificandEngineeringApplications,pages335–356.KluwerAcademicPublishers,Dordrecht,Netherlands,October2001.

[609]R.T.NgandJ.Han.CLARANS:AMethodforClusteringObjectsforSpatialDataMining.IEEETransactionsonKnowledgeandDataEngineering,14(5):1003–1016,2002.

[610]M.PetersandM.J.Zaki.CLICKS:ClusteringCategoricalDatausingK-partiteMaximalCliques.InProc.ofthe21stIntl.Conf.onDataEngineering,Tokyo,Japan,April2005.

[611]E.SchikutaandM.Erhart.TheBANG-ClusteringSystem:Grid-BasedDataAnalysis.InAdvancesinIntelligentDataAnalysis,ReasoningaboutData,SecondIntl.Symposium,IDA-97,London,volume1280ofLectureNotesinComputerScience,pages513–524.Springer,August1997.

[612]G.Sheikholeslami,S.Chatterjee,andA.Zhang.Wavecluster:Amulti-resolutionclusteringapproachforverylargespatialdatabases.InProc.ofthe24thVLDBConf.,pages428–439,NewYorkCity,August1998.MorganKaufmann.

[613]J.ShiandJ.Malik.Normalizedcutsandimagesegmentation.IEEETransactionsonPatternAnalysisandMachineIntelligence,22(8):888905,2000.

[614]M.Steinbach,P.-N.Tan,V.Kumar,S.Klooster,andC.Potter.Discoveryofclimateindicesusingclustering.InKDD’03:ProceedingsoftheninthACMSIGKDDinternationalconferenceonKnowledgediscoveryanddatamining,pages446–455,NewYork,NY,USA,2003.ACMPress.

[615]R.E.SteppandR.S.Michalski.Conceptualclusteringofstructuredobjects:Agoal-orientedapproach.ArtificialIntelligence,28(1):43–69,1986.

[616]A.StrehlandJ.Ghosh.AScalableApproachtoBalanced,High-dimensionalClusteringofMarket-Baskets.InProc.ofthe7thIntl.Conf.onHighPerformanceComputing(HiPC2000),volume1970ofLectureNotesinComputerScience,pages525–536,Bangalore,India,December2000.Springer.

[617]T.Sun,C.Shu,F.Li,H.Yu,L.Ma,andY.Fang.Anefficienthierarchicalclusteringmethodforlargedatasetswithmap-reduce.InProcoftheIEEEInternationalConferenceonParallelandDistributedComputing,ApplicationsandTechnologies,pages494–499,2009.

[618]U.vonLuxburg.Atutorialonspectralclustering.StatisticsandComputing,17(4):395–416,2007.

[619]K.Wagstaff,C.Cardie,S.Rogers,andS.Schroedl.ConstrainedK-meansClusteringwithBackgroundKnowledge.InProceedingsof18thInternationalConferenceonMachineLearning,pages577–584,2001.

[620]T.Zhang,R.Ramakrishnan,andM.Livny.BIRCH:anefficientdataclusteringmethodforverylargedatabases.InProc.of1996ACM-SIGMODIntl.Conf.onManagementofData,pages103–114,Montreal,Quebec,Canada,June1996.ACMPress.

[621]W.Zhao,H.Ma,andQ.He.ParallelK-MeansClusteringbasedonMapReduce.InProcoftheIEEEInternationalConferenceonCloudComputing,page674-679,2009.

8.8Exercises1.Forsparsedata,discusswhyconsideringonlythepresenceofnon-zerovaluesmightgiveamoreaccurateviewoftheobjectsthanconsideringtheactualmagnitudesofvalues.Whenwouldsuchanapproachnotbedesirable?

2.DescribethechangeinthetimecomplexityofK-meansasthenumberofclusterstobefoundincreases.

3.Considerasetofdocuments.Assumethatalldocumentshavebeennormalizedtohaveunitlengthof1.Whatisthe“shape”ofaclusterthatconsistsofalldocumentswhosecosinesimilaritytoacentroidisgreaterthansomespecifiedconstant?Inotherwords, ,where .

4.Discusstheadvantagesanddisadvantagesoftreatingclusteringasanoptimizationproblem.Amongotherfactors,considerefficiency,non-determinism,andwhetheranoptimization-basedapproachcapturesalltypesofclusteringsthatareofinterest.

5.Whatisthetimeandspacecomplexityoffuzzyc-means?OfSOM?HowdothesecomplexitiescomparetothoseofK-means?

6.TraditionalK-meanshasanumberoflimitations,suchassensitivitytooutliersanddifficultyinhandlingclustersofdifferentsizesanddensities,orwithnon-globularshapes.Commentontheabilityoffuzzyc-meanstohandlethesesituations.

7.Forthefuzzyc-meansalgorithmdescribedinthisbook,thesumofthemembershipdegreeofanypointoverallclustersis1.Instead,wecouldonlyrequirethatthemembershipdegreeofapointinaclusterbebetween0and1.Whataretheadvantagesanddisadvantagesofsuchanapproach?

cos(d,c)≥δ 0<δ≤1

8.Explainthedifferencebetweenlikelihoodandprobability.

9.Equation8.12 givesthelikelihoodforasetofpointsfromaGaussiandistributionasafunctionofthemeanμandthestandarddeviationσ.Showmathematicallythatthemaximumlikelihoodestimateofμandσarethesamplemeanandthesamplestandarddeviation,respectively.

10.Wetakeasampleofadultsandmeasuretheirheights.Ifwerecordthegenderofeachperson,wecancalculatetheaverageheightandthevarianceoftheheight,separately,formenandwomen.Suppose,however,thatthisinformationwasnotrecorded.Woulditbepossibletostillobtainthisinformation?Explain.

11.ComparethemembershipweightsandprobabilitiesofFigures8.1 and8.4 ,whichcome,respectively,fromapplyingfuzzyandEMclusteringtothesamesetofdatapoints.Whatdifferencesdoyoudetect,andhowmightyouexplainthesedifferences?

12.Figure8.32 showsaclusteringofatwo-dimensionalpointdatasetwithtwoclusters:Theleftmostcluster,whosepointsaremarkedbyasterisks,issomewhatdiffuse,whiletherightmostcluster,whosepointsaremarkedbycircles,iscompact.Totherightofthecompactcluster,thereisasinglepoint(markedbyanarrow)thatbelongstothediffusecluster,whosecenterisfartherawaythanthatofthecompactcluster.ExplainwhythisispossiblewithEMclustering,butnotK-meansclustering.

Figure8.32.DatasetforExercise12 .EMclusteringofatwo-dimensionalpointsetwithtwoclustersofdifferingdensity.

13.ShowthattheMSTclusteringtechniqueofSection8.4.2 producesthesameclustersassinglelink.Toavoidcomplicationsandspecialcases,assumethatallthepairwisesimilaritiesaredistinct.

14.Onewaytosparsifyaproximitymatrixisthefollowing:Foreachobject(rowinthematrix),setallentriesto0exceptforthosecorrespondingtotheobjectsk-nearestneighbors.However,thesparsifiedproximitymatrixistypicallynotsymmetric.

a. Ifobjectaisamongthek-nearestneighborsofobjectb,whyisbnotguaranteedtobeamongthek-nearestneighborsofa?

b. Suggestatleasttwoapproachesthatcouldbeusedtomakethesparsifiedproximitymatrixsymmetric.

15.Giveanexampleofasetofclustersinwhichmergingbasedontheclosenessofclustersleadstoamorenaturalsetofclustersthanmergingbasedonthestrengthofconnection(interconnectedness)ofclusters.

16.Table8.4 liststhetwonearestneighborsoffourpoints.CalculatetheSNNsimilaritybetweeneachpairofpointsusingthedefinitionofSNNsimilaritydefinedinAlgorithm8.11 .

Table8.4.Twonearestneighborsoffourpoints.

Point FirstNeighbor SecondNeighbor

1 4 3

2 3 4

3 4 2

4 3 1

17.ForthedefinitionofSNNsimilarityprovidedbyAlgorithm8.11 ,thecalculationofSNNdistancedoesnottakeintoaccountthepositionofsharedneighborsinthetwonearestneighborlists.Inotherwords,itmightbedesirabletogivehighersimilaritytotwopointsthatsharethesamenearestneighborsinthesameorroughlythesameorder.

a. DescribehowyoumightmodifythedefinitionofSNNsimilaritytogivehighersimilaritytopointswhosesharedneighborsareinroughlythesameorder.

b. Discusstheadvantagesanddisadvantagesofsuchamodification.

18.NameatleastonesituationinwhichyouwouldnotwanttouseclusteringbasedonSNNsimilarityordensity.

19.Grid-clusteringtechniquesaredifferentfromotherclusteringtechniquesinthattheypartitionspaceinsteadofsetsofpoints.

a. Howdoesthisaffectsuchtechniquesintermsofthedescriptionoftheresultingclustersandthetypesofclustersthatcanbefound?

b. Whatkindofclustercanbefoundwithgrid-basedclustersthatcannotbefoundbyothertypesofclusteringapproaches?(Hint:SeeExercise20inChapter7 ,page608.)

20.InCLIQUE,thethresholdusedtofindclusterdensityremainsconstant,evenasthenumberofdimensionsincreases.Thisisapotentialproblembecausedensitydropsasdimensionalityincreases;i.e.,tofindclustersinhigherdimensionsthethresholdhastobesetatalevelthatmaywellresultinthemergingoflow-dimensionalclusters.Commentonwhetheryoufeelthisistrulyaproblemand,ifso,howyoumightmodifyCLIQUEtoaddressthisproblem.

21.GivenasetofpointsinEuclideanspace,whicharebeingclusteredusingtheK-meansalgorithmwithEuclideandistance,thetriangleinequalitycanbeusedintheassignmentsteptoavoidcalculatingallthedistancesofeachpointtoeachclustercentroid.Provideageneraldiscussionofhowthismightwork.

22.InsteadofusingtheformuladerivedinCURE—seeEquation8.21 —wecouldrunaMonteCarlosimulationtodirectlyestimatetheprobabilitythatasampleofsizeswouldcontainatleastacertainfractionofthepointsfromacluster.UsingaMonteCarlosimulationcomputetheprobabilitythatasampleofsizescontains50%oftheelementsofaclusterofsize100,wherethetotalnumberofpointsis1000,andwherescantakethevalues100,200,or500.

9AnomalyDetection

Inanomalydetection,thegoalistofindobjectsthatdonotconformtonormalpatternsorbehavior.Often,anomalousobjectsareknownasoutliers,since,onascatterplotofthedata,theyliefarawayfromotherdatapoints.Anomalydetectionisalsoknownasdeviationdetection,becauseanomalousobjectshaveattributevaluesthatdeviatesignificantlyfromtheexpectedortypicalattributevalues,orasexceptionmining,becauseanomaliesareexceptionalinsomesense.Inthischapter,wewillmostlyusethetermsanomalyoroutlier.Thereareavarietyofanomalydetectionapproachesfromseveralareas,includingstatistics,machinelearning,anddatamining.Alltrytocapturetheideathatananomalousdataobjectisunusualorinsomewayinconsistentwithotherobjects.

Alt1oughunusualobjectsoreventsare,bydefinition,relativelyrare,theirdetectionandanalysisprovidescriticalinsightsthatareusefulinanumberofapplications.Thefollowingexamplesillustrateapplicationsforwhichanomaliesareofconsiderableinterest.

FraudDetection.Thepurchasingbehaviorofsomeonewhostealsacreditcardisoftendifferentfromthatoftheoriginalowner.Creditcardcompaniesattempttodetectatheftbylookingforbuyingpatternsthatcharacterizetheftorbynoticingachangefromtypicalbehavior.Similarapproachesarerelevantinmanydomainssuchasdetectinginsuranceclaimfraudandinsidertrading.IntrusionDetection.Unfortunately,attacksoncomputersystemsandcomputernetworksarecommonplace.Whilesomeoftheseattacks,suchasthosedesignedtodisableoroverwhelmcomputersandnetworks,areobvious,otherattacks,suchasthosedesignedtosecretlygatherinformation,aredifficulttodetect.Manyoftheseintrusionscanonlybedetectedbymonitoringsystemsandnetworksforunusualbehavior.EcosystemDisturbances.TheEarth’secosystemhasbeenexperiencingrapidchangesinthelastfewdecadesduetonaturaloranthropogenicreasons.Thisincludesanincreasedpropensityforextremeevents,suchasheatwaves,droughts,andfloods,whichhaveahugeimpactontheenvironment.Identifyingsuchextremeeventsfromsensorrecordingsandsatelliteimagesisimportantforunderstandingtheiroriginsandbehavior,aswellasfordevisingsustainableadaptationpolicies.MedicineandPublicHealth.Foraparticularpatient,unusualsymptomsortestresults,suchasananomalousMRIscan,mayindicatepotentialhealthproblems.However,whetheraparticulartest

resultisanomalousmaydependonmanyothercharacteristicsofthepatient,suchasage,sex,andgeneticmakeup.Furthermore,thecategorizationofaresultasanomalousornotincursacost—unneededadditionaltestsifapatientishealthyandpotentialharmtothepatientifaconditionisleftundiagnosedanduntreated.Thedetectionofemergingdiseaseoutbreaks,suchasH1N1-influenzaorSARS,whichresultinunusualandalarmingtestresultsinaseriesofpatients,isalsoimportantformonitoringthespreadofdiseasesandtakingpreventiveactions.AviationSafety.Sinceaircraftsarehighlycomplexanddynamicsystems,theyarepronetoaccidents—oftenwithdrasticconsequences—duetomechanical,environmentalorhumanfactors.Tomonitortheoccurrenceofsuchanomalies,mostcommercialairplanesareequippedwithalargenumberofsensorstomeasuredifferentflightparameters,suchasinformationfromthecontrolsystem,theavionicsandpropulsionsystems,andpilotactions.Identifyingabnormaleventsinthesesensorrecordings(e.g.,ananomaloussequenceofpilotactionsoranabnormallyfunctioningaircraftcomponent)canhelppreventaircraftaccidentsandpromoteaviationsafety.

Althoughmuchoftherecentinterestinanomalydetectionisdrivenbyapplicationsinwhichanomaliesarethefocus,historically,anomalydetection(and

removal)hasbeenviewedasadatapreprocessingtechniquetoeliminateerroneousdataobjectsthatmayberecordedbecauseofhumanerror,aproblemwiththemeasuringdevice,orthepresenceofnoise.Suchanomaliesprovidenointerestinginformationbutonlydistorttheanalysisofnormalobjects.Theidentificationandremovalofsucherroneousdataobjectsisnotthefocusofthischapter.Instead,theemphasisisondetectinganomalousobjectsthatareinterestingintheirownright.

9.1CharacteristicsofAnomalyDetectionProblemsAnomalydetectionproblemsarequitediverseinnatureastheyappearinmultipleapplicationdomainsunderdifferentsettings.Thisdiversityinproblemcharacteristicshasresultedinarichvarietyofanomalydetectionmethodsthatareusefulindifferentsituations.Beforewediscussthesemethods,itwillbeusefultodescribesomeofthekeycharacteristicsofanomalydetectionproblemsthatmotivatethedifferentstylesofanomalydetectionmethods.

9.1.1ADefinitionofanAnomaly

Animportantcharacteristicofananomalydetectionproblemisthewayananomalyisdefined.Sinceanomaliesarerareoccurrencesthatarenotfullyunderstood,theycanbedefinedindifferentwaysdependingontheproblemrequirements.However,thefollowinghigh-leveldefinitionofananomalyencompassesmostofthedefinitionscommonlyemployed.

Definition9.1.Ananomalyisanobservationthatdoesn’tfitthedistributionofthedatafornormalinstances,i.e.,isunlikelyunderthedistributionofthemajorityofinstances.

Wenotethefollowingpoints:

Thisdefinitiondoesnotassumethatthedistributioniseasytoexpressintermsofwell-knownstatisticaldistributions.Indeed,thedifficultyofdoingsoisthereasonthatmanyanomalydetectionapproachesusenon-statisticalapproaches.Nonetheless,theseapproachesaimtofinddataobjectsthatarenotcommon.Conceptually,wecanrankdataobjectsaccordingtotheprobabilityofseeingsuchanobjectorsomethingmoreextreme.Thelowertheprobability,themorelikelytheobjectisananomaly.Often,thereciprocaloftheprobabilityisusedasarankingscore.Again,thisisonlypracticalinsomecases.SuchapproachesarediscussedinSection9.3 .Therecanbevariouscausesofananomaly:noise,theobjectcomesfromanotherdistribution,e.g.,afewgrapefruitmixedwithoranges,ortheobjectisjustarareoccurrenceofdatafromthedistribution,e.g.,a7foottallperson.Asmentioned,wearenotinterestedinanomaliesduetonoise.

9.1.2NatureofData

Thenatureoftheinputdataplaysakeyroleindecidingthechoiceofasuitableanomalydetectiontechnique.Someofthecommoncharacteristicsoftheinputdataincludethenumberandtypesofattributes,andtherepresentationusedfordescribingeverydatainstance.

UnivariateorMultivariate

Ifthedatacontainsasingleattribute,thequestionofwhetheranobjectisanomalousdependsonwhethertheobject’svalueforthatattributeis

anomalous.However,ifthedataobjectsarerepresentedusingmanyattributes,adataobjectmayhaveanomalousvaluesforsomeattributesbutordinaryvaluesforotherattributes.Furthermore,anobjectmaybeanomalousevenifnoneofitsattributevaluesareindividuallyanomalous.Forexample,itiscommontohavepeoplewhoaretwofeettall(children)orare100poundsinweight,butuncommontohaveatwo-foottallpersonwhoweighs100pounds.Identifyingananomalyinamultivariatesettingisthuschallenging,particularlywhenthedimensionalityofthedataishigh.

RecordDataorProximityMatrix

Themostcommonapproachforrepresentingadatasetistouserecorddataoritsvariants,e.g.,adatamatrix,whereeverydatainstanceisdescribedusingthesamesetofattributes.However,forthepurposeofanomalydetection,itisoftensufficienttoknowhowdifferentaninstanceisincomparisontootherinstances.Hence,someanomalydetectionmethodsworkwithadifferentrepresentationoftheinputdataknownasaproximitymatrix,whereeveryentryinthematrixdenotesthepairwiseproximity(similarityordissimilarity)betweentwoinstances.Notethatadatamatrixcanalwaysbeconvertedtoaproximitymatrixbyusinganappropriateproximitymeasure.Also,asimilaritymatrixcanbeeasilyconvertedtoadistancematrixusinganyofthetransformationspresentedinSection2.4.1 .

AvailabilityofLabels

Thelabelofadatainstancedenoteswhethertheinstanceisnormaloranomalous.Ifwehaveatrainingsetwithlabelsforeverydatainstance,thentheproblemofanomalydetectiontranslatestoasupervisedlearning(classification)problem.Classificationtechniquesthataddresstheso-calledrareclassproblemareparticularlyrelevantbecauseanomaliesarerelativelyrarewithrespecttonormalobjects.SeeSection4.11 .

However,inmostpracticalapplications,wedonothaveatrainingsetwithaccurateandrepresentativelabelsofthenormalandanomalousclasses.Notethatobtaininglabelsoftheanomalousclassisespeciallychallengingbecauseoftheirrarity.Itisthusdifficultforahumanexperttocatalogeverytypeofanomalysincethepropertiesoftheanomalousclassareoftenunknown.Hence,mostanomalydetectionproblemsareunsupervisedinnature,i.e.,theinputdatadoesnothaveanylabels.Allanomalydetectionmethodspresentedinthischapteroperateintheunsupervisedsetting.

Notethatintheabsenceoflabels,itischallengingtodifferentiateanomaliesfromnormalinstancesgivenaninputdataset.However,anomaliestypicallyhavesomepropertiesthattechniquescantakeadvantageoftomakefindinganomaliespractical.Twokeypropertiesarethefollowing:

RelativelySmallinNumber

Sinceanomaliesareinfrequent,mostinputdatasetshaveapredominanceofnormalinstances.Theinputdatasetisthusoftenusedasanimperfectrepresentationofthenormalclassinmostanomalydetectiontechniques.However,theperformanceofsuchmethodsneedstoberobusttothepresenceofoutliersintheinputdata.Someanomalydetectionmethodsalsoprovideamechanismtospecifytheexpectednumberofoutliersintheinputdata.Suchmethodscanworkwithalargernumberofanomaliesinthedata.

SparselyDistributed

Anomalies,unlikenormalobjects,areoftenunrelatedtoeachotherandhencedistributedsparselyinthespaceofattributes.Indeed,thesuccessfuloperationofmostanomalydetectionmethodsdependsonanomaliesnotbeingtightlyclustered.However,someanomalydetectionmethodsare

specificallydesignedtofindclusteredanomalies(seeSection9.5.1 ),whichareassumedtoeitherbesmallinsizeordistantfromotherinstances.

9.1.3HowAnomalyDetectionisUsed

Therearetwodifferentwaysinwhichanygenericanomalydetectionmethodcanbeused.Inthefirstapproach,wearegivenaninputdatathatcontainsbothnormalandanomalousinstances,andarerequiredtoidentifyanomaliesinthisinputdata.Allanomalydetectionapproachespresentedinthischapterareabletooperateinthissetup.Inthesecondapproach,wearealsoprovidedwithtestinstances(appearingoneatatime)thatneedtobeidentifiedasanomalies.Mostanomalydetectionmethods(withafewexceptions)areabletousetheinputdatasettoprovideoutputsonnewtestinstances.Findinganomaliesbyfindinganomalousclusters—Section9.5.1—isoneoftheexceptions.

9.2CharacteristicsofAnomalyDetectionMethodsTocatertothediverseneedsofanomalydetectionproblems,anumberoftechniqueshavebeenexploredusingconceptsfromdifferentresearchdisciplines.Inthissection,weprovideahigh-leveldescriptionofsomeofthecommoncharacteristicsofanomalydetectionmethodsthatarehelpfulinunderstandingtheircommonalitiesanddifferences.

Model-basedvs.Model-freeManyapproachesforanomalydetectionusetheinputdatatobuildmodelsthatcanbeusedtoidentifywhetheratestinstanceisanomalousornot.Mostmodel-basedtechniquesforanomalydetectionbuildamodelofthenormalclassandidentifyanomaliesthatdonotfitthismodel.Forexample,wecanfitaGaussiandistributiontomodelthenormalclassandthenidentifyanomaliesthatdonotconformwelltothelearneddistribution.Theothertypeofmodel-basedtechniqueslearnsamodelofboththenormalandanomalousclasses,andidentifiesinstancesasanomaliesiftheyaremorelikelytobelongtotheanomalousclass.Althoughtheseapproachestechnicallyrequirerepresentativelabelsfrombothclasses,theyoftenmakeassumptionsaboutthenatureoftheanomalousclass,e.g.,thatanomaliesarerareandsparselydistributed,andthuscanworkeveninanunsupervisedsetting.

Inadditiontoidentifyinganomalies,model-basedmethodsprovideinformationaboutthenaturethenormalclassandsometimeseventheanomalousclass.However,theassumptionstheymakeaboutthepropertiesofnormaland

anomalousclassesmaynotholdtrueineveryproblem.Incontrast,model-freeapproachesdonotexplicitlycharacterizethedistributionofthenormaloranomalousclasses.Instead,theydirectlyidentifyinstancesasanomalieswithoutlearningmodelsfromtheinputdata.Forexample,aninstancecanbeidentifiedasananomalyifitisquitedifferentfromotherinstancesinitsneighborhood.Model-freeapproachesareoftenintuitiveandsimpletouse.

Globalvs.LocalPerspectiveAninstancecanbeidentifiedasananomalyeitherbyconsideringtheglobalcontext,e.g.,bybuildingamodeloverallnormalinstancesandusingthisglobalmodelforanomalydetection,orbyconsideringthelocalperspectiveofeverydatainstance.Specifically,ananomalydetectionapproachistermedlocalifitsoutputonagiveninstancedoesnotchangeifinstancesoutsideitslocalneighborhoodaremodifiedorremoved.Thedifferencebetweentheglobalandlocalperspectivecanresultinsignificantdifferencesintheresultsofananomalydetectionmethod,becauseanobjectmayseemunusualwithrespecttoallobjectsglobally,butnotwithrespecttoobjectsinitslocalneighborhood.Forexample,apersonwhoseheightis6feet5inchesisunusuallytallwithrespecttothegeneralpopulation,butnotwithrespecttoprofessionalbasketballplayers.

Labelvs.ScoreDifferentapproachesforanomalydetectionproducetheiroutputsindifferentformats.Themostbasictypeofoutputisabinaryanomalylabel:anobjectiseitheridentifiedasananomalyorasanormalinstance.However,labelsdonotprovideanyinformationaboutthedegreetowhichaninstanceisanomalous.Frequently,someofthedetectedanomaliesaremoreextreme

thanothers,whilesomeinstanceslabeledasnormalmaybeonthevergeofbeingidentifiedasanomalies.

Hence,manyanomalydetectionmethodsproduceananomalyscorethatindicateshowstronglyaninstanceislikelytobeananomaly.Ananomalyscorecaneasilybesortedandconvertedintoranks,sothatananalystcanbeprovidedwithonlythetop-mostscoringanomalies.Alternatively,acutoffthresholdcanbeappliedtoananomalyscoretoobtainbinaryanomalylabels.Thetaskofchoosingtherightthresholdisoftenlefttothediscretionoftheanalyst.However,sometimesthescoreshaveanassociatedmeaning,e.g.,statisticalsignificance(seeSection9.3 ),whichmakestheanalysisofanomalieseasierandmoreinterpretable.

Inthefollowingsections,weprovidebriefdescriptionsofsixtypesofanomalydetectionapproaches.Foreachtype,wewilldescribetheirbasicidea,keyfeatures,andunderlyingassumptionsusingillustrativeexamples.Attheendofeverysection,wealsodiscusstheirstrengthsandweaknessinhandlingdifferentaspectsofanomalydetectionproblems.Tofollowcommonpractice,wewillusethetermsoutlierandanomalyinterchangeablyintheremainderofthischapter.

9.3StatisticalApproachesStatisticalapproachesmakeuseofprobabilitydistributions(e.g.,theGaussiandistribution)tomodelthenormalclass.Akeyfeatureofsuchdistributionsisthattheyassociateaprobabilityvaluetoeverydatainstance,indicatinghowlikelyitisfortheinstancetobegeneratedfromthedistribution.Anomaliesarethenidentifiedasinstancesthatareunlikelytobegeneratedfromtheprobabilitydistributionofthenormalclass.

Therearetwotypesofmodelsthatcanbeusedtorepresenttheprobabilitydistributionofthenormalclass:parametricmodelsandnon-parametricmodels.Whileparametricmodelsusewell-knownfamiliesofstatisticaldistributionsthatrequireestimatingparametersfromthedata,non-parametricmodelsaremoreflexibleandlearnthedistributionofthenormalclassdirectlyfromtheavailabledata.Inthefollowing,wediscussbothofthesetypesofmodelsforanomalydetection.

9.3.1UsingParametricModels

Someofthecommontypesofparametricmodelsthatarewidelyusedfordescribingmanytypesofdatasets,includetheGaussiandistribution,thePoissondistribution,andthebinomialdistribution.Theyinvolveparametersthatneedtobelearnedfromthedata,e.g.,aGaussiandistributionrequiresidentifyingthemeanandvarianceparametersfromthedata.

Parametricmodelsarequiteeffectiveinrepresentingthebehaviorofthenormalclass,especiallywhenthenormalclassisknowntofollowaspecific

distribution.Theanomalyscorescomputedbyparametricmodelsalsohavestrongtheoreticalproperties,whichcanbeusedforanalyzingtheanomalyscoresandassessingtheirstatisticalsignificance.Inthefollowing,wediscusstheuseoftheGaussiandistributionformodelingthenormalclass,intheunivariateandmultivariatesettings.

UsingtheUnivariateGaussianDistributionTheGaussian(normal)distributionisoneofthemostfrequentlyuseddistributionsinstatistics,andwewilluseittodescribeasimpleapproachtostatisticaloutlierdetection.TheGaussiandistributionhastwoparameters,μandσ,whicharethemeanandstandarddeviation,respectively,andisrepresentedusingthenotation .Theprobabilitydensityfunction ofapointxundertheGaussiandistributionisgivenas

Figure9.1 showstheprobabilitydensityfunctionof .Wecanseethatthe declinesasxmovesfartherfromthecenterofthedistribution.Wecanthususethedistanceofapointxfromtheoriginasananomalyscore.AswewillseelaterinSection9.3.4 ,thisdistancevaluehasaninterpretationintermsofprobabilitythatcanbeusedtoassesstheconfidenceincallingxasanoutlier.

N(μ,σ) f(x)

f(x)=12πσ2e−(x−μ)22σ2 (9.1)

N(0,1)p(x)

Figure9.1.ProbabilitydensityfunctionofaGaussiandistributionwithameanof0andastandarddeviationof1.

IftheattributeofinterestxfollowsaGaussiandistributionwithmeanμandstandarddeviationσ,i.e., ,acommonapproachistotransformtheattributextoanewattributez,whichhasa distribution.Thiscanbedonebyusing ,whichiscalledthez-score.Notethat isdirectlyrelatedtotheprobabilitydensityofthepointxinEquation9.1 sincethatequationcanberewrittenasfollows:

TheparametersμandσoftheGaussiandistributioncanbeestimatedfromthetrainingdataofmostlynormalinstances,byusingthesamplemean asμandthesamplestandarddeviation asσ.However,ifwebelievethe

N(μ,σ)N(0,1)

z=(x−μ)/σ z2

p(x)=12πσ2e−z2/2 (9.2)

x¯sx

outliersaredistortingtheestimatesoftheseparameterstoomuch,morerobustestimatesofthesequantitiescanbeused—seeBibliographicNotes.

UsingtheMultivariateGaussianDistributionForadatasetcomprisedofmorethanonecontinuousattribute,wecanuseamultivariateGaussiandistributiontomodelthenormalclass.AmultivariateGaussiandistribution involvestwoparameters,themeanvectorμandthecovariancematrixΣ,whichneedtobeestimatedfromthedata.Theprobabilitydensityofapointxdistributedas isgivenby

wherepisthenumberofdimensionsofxand denotesthedeterminantofthecovariancematrixΣ.

InthecaseofamultivariateGaussiandistribution,thedistanceofapointxfromthecenterμcannotbedirectlyusedasaviableanomalyscore.Thisisbecauseamultivariatenormaldistributionisnotsymmetricalwithrespecttoitscenteriftherearecorrelationsbetweentheattributes.Toillustratethis,Figure9.2 showstheprobabilitydensityofatwo-dimensionalmultivariateGaussiandistributionwithmeanof(0,0)andacovariancematrixof

N(μ,Σ)

f(x)=1(2π)m|Σ|1/2e−(x−μ)Σ−1(x−μ)2, (9.3)

|Σ|

Σ=(1.000.750.753.00).

Figure9.2.ProbabilitydensityofpointsfortheGaussiandistributionusedtogeneratethepointsofFigure9.3 .

Figure9.3.Mahalanobisdistanceofpointsfromthecenterofatwo-dimensionalsetof2002points.

Theprobabilitydensityvariesasymmetricallyaswemoveoutwardfromthecenterindifferentdirections.Toaccountforthisfact,weneedadistancemeasurethattakestheshapeofthedataintoconsideration.TheMahalanobisdistanceisonesuchdistancemeasure.(SeeEquation2.27 onpage96.)TheMahalanobisdistancebetweenapointxandthemeanofthedata isgivenby

whereSistheestimatedcovariancematrixofthedata.NotethattheMahalanobisdistancebetweenxand isdirectlyrelatedtotheprobability

x─

Mahalanobis(x,x─)=(x−x─)S−1(x−x─)T, (9.4)

x─

densityofxinEquation9.3 ,when andSareusedasestimatesofμandΣ,respectively.(SeeExercise9 onpage751.)

Example9.1(OutliersinaMultivariateNormalDistribution).Figure9.3 showstheMahalanobisdistance(fromthemeanofthedistribution)forpointsinatwo-dimensionaldataset.ThepointsA(−4,4)andB(5,5)areoutliersthatwereaddedtothedataset,andtheirMahalanobisdistanceisindicatedinthefigure.Theother2000pointsofthedatasetwererandomlygeneratedusingthedistributionusedforFigure9.2 .

BothAandBhavelargeMahalanobisdistances.However,eventhoughAisclosertothecenter(thelargeblackxat(0,0))asmeasuredbyEuclideandistance,itisfartherawaythanBintermsoftheMahalanobisdistancebecausetheMahalanobisdistancetakestheshapeofthedistributionintoaccount.Inparticular,pointBhasaEuclideandistanceof

andaMahalanobisdistanceof24,whilethepointAhasaEuclideandistanceof andaMahalanobisdistanceof35.

TheaboveapproachesassumethatthenormalclassisgeneratedfromasingleGaussiandistribution.Notethatthismaynotalwaysbethecase,especiallyiftherearemultipletypesofnormalclassesthathavedifferentmeansandvariances.Insuchcases,wecanuseaGaussianmixturemodel(asdescribedinChapter8.2.2 )torepresentthenormalclass.Foreachpoint,thesmallestMahalanobisdistanceofthepointtoanyoftheGaussiandistributionsiscomputedandusedastheanomalyscore.Thisapproachisrelatedtotheclustering-basedapproachesforanomalydetection,whichwillbedescribedinSection9.5 .

x─

5242

9.3.2UsingNon-parametricModels

Analternativeformodelingthedistributionofthenormalclassistousekerneldensityestimation-basedtechniquesthatemploykernelfunctions(describedinSection8.3.3 )toapproximatethedensityofthenormalclassfromtheavailabledata.Thisresultsintheconstructionofanon-parametricprobabilitydistributionofthenormalclass,suchthatregionswithadenseoccurrenceofnormalinstanceshavehighprobabilityandvice-versa.Notethatkernel-basedapproachesdonotassumethatthedataconformstoanyknownfamilyofdistributionsbutinsteadderivethedistributionpurelyfromthedata.Havinglearnedaprobabilitydensityforthenormalclassusingthekerneldensityapproach,theanomalyscoreofaninstanceiscomputedastheinverseofitsprobabilitywithrespecttothelearneddensity.

Asimplernon-parametricapproachtomodelingthenormalclassistobuildahistogramofthenormaldata.Forexample,ifthedatacontainsasinglecontinuousattribute,thenwecanconstructbinsfordifferentrangesoftheattribute,usingtheequal-widthdiscretizationtechniquedescribedinSection2.3.6 .Wecanthencheckifanewtestinstancefallsinanyofthebinsofthehistogram.Ifitdoesnotfallinanyofthebins,wecanidentifyitasananomaly.Otherwise,wecanusetheinverseoftheheight(frequency)ofthebininwhichitfallsasitsanomalyscore.Thisapproachisknownasthefrequency-basedorcounting-basedapproachforanomalydetection.

Akeystepinusingfrequency-basedapproachesforanomalydetectionischoosingthesizeofthebinusedforconstructingthehistogram.Asmallbinsizecanfalselyidentifymanynormalinstancesasanomalous,sincetheymightfallinemptyorsparselypopulatedbins.Ontheotherhand,ifthebinsizeistoolarge,manyanomalousinstancesmayfallinheavilypopulatedbins

andgounnoticed.Thus,choosingtherightbinsizeischallenging,andoftenrequirestryingmultiplesizeoptionsorusingexpertknowledge.

9.3.3ModelingNormalandAnomalousClasses

Thestatisticalapproachesdescribedsofaronlymodelthedistributionofthenormalclassbutnottheanomalousclass.Theyassumethatthetrainingsetpredominantlyhasnormalinstances.However,ifthereareoutlierspresentinthetrainingdata,whichiscommoninmostpracticalapplications,thelearningoftheprobabilitydistributionscorrespondingtothenormalclassmaybedistorted,resultinginpooridentificationofanomalies.

Here,wepresentastatisticalapproachforanomalydetectionthatcantolerateaconsiderablefraction(λ)ofoutliersinthetrainingset,providedthattheoutliersareuniformlydistributed(andthusnotclustered)intheattributespace.Thisapproachmakesuseofamixturemodelingtechniquetolearnthedistributionofnormalandanomalousclasses.ThisapproachissimilartotheExpectation-Maximization(EM)basedtechniqueintroducedinthecontextofclusteringinChapter8.2.2 .Notethatλ),thefractionofoutliers,islikeaprior.

Thebasicideaofthisapproachistoassumethatinstancesaregeneratedwithprobabilityλfromtheanomalousclass,whichhasuniformdistribution,,andwithprobability andfromthenormalclass,whichhasthe

distribution, ,whereθrepresentstheparametersofthedistribution.Theapproachforassigningtraininginstancestothenormalandanomalyclassescanbedescribedasfollows.Initially,alltheobjectsareassignedto

pA 1−λ

fM(θ)

thenormalclassandthesetofanomalousobjectsisempty.AteveryiterationoftheEMalgorithm,objectsaretransferredfromthenormalclasstotheanomalyclasstoimprovethelikelihoodoftheoveralldata.Let and bethesetofnormalandanomalousobjects,respectively,atiterationt.ThelikelihoodofthedatasetD, ,anditslog-likelihood, ,arethengivenbythefollowingequations:

where and arethenumberofobjectsinthenormalandanomalyclasses,respectively,and representstheparametersofthedistributionofthenormalclass,whichcanbeestimatedusing .Ifthetransferofanobjectxfrom to resultsinasignificantincreaseinthelog-likelihoodofthedata(greaterthanathresholdc),thenxisassignedtothesetofoutliers.Thesetofoutliers keepsgrowingtillweachievethemaximumlikelihoodofthedatausing and .ThisapproachissummarizedinAlgorithm9.1 .

Algorithm9.1Likelihood-basedoutlier

detection.

Mt At

ℒt(D) logℒt(D)

ℒt(D)=∏xi∈DP(xi)=((1−λ)|Mt|∏xi∈MtPM(xi,θt))(λ|At|∏xi∈AtPA(xi))

(9.5)

logℒt(D)=|Mt|log(1−λ)+∑xi∈MtlogPM(xi,θt)+|At|log λ+∑xi∈AtlogPA(xi)

(9.6)

|Mt| |At|θt

MtMt At

At At

Mt At

Initialization:Attime ,let containalltheobjects,whileisempty.foreachobjectxthatbelongsto doMovexfrom to toproducethenewdatasets and

1: t=0 Mt At

2: Mt3: Mt At At+1

Mt+1

Becausethenumberofnormalobjectsislargecomparedtothenumberofanomalies,thedistributionofthenormalobjectsmaynotchangemuchwhenanobjectismovedtothesetofanomalies.Inthatcase,thecontributionofeachnormalobjecttotheoveralllikelihoodofthenormalobjectswillremainrelativelyconstant.Furthermore,eachobjectmovedtothesetofanomaliescontributesafixedamounttothelikelihoodoftheanomalies.Thus,theoverallchangeinthetotallikelihoodofthedatawhenanobjectismovedtothesetofanomaliesisroughlyequaltotheprobabilityoftheobjectunderauniformdistribution(weightedbyλ)minustheprobabilityoftheobjectunderthedistributionofthenormaldataobjects(weightedby ).Consequently,thesetofanomalieswilltendtoconsistofthoseobjectsthathavesignificantlyhigherprobabilityunderauniformdistributionthanunderthedistributionofthenormalobjects.

Inthesituationjustdiscussed,theapproachdescribedbyAlgorithm9.1 isroughlyequivalenttoclassifyingobjectswithalowprobabilityunderthedistributionofnormalobjectsasoutliers.Forexample,whenappliedtothepointsinFigure9.3 ,thistechniquewouldclassifypointsAandB(andotherpointsfarfromthemean)asoutliers.However,ifthedistributionofthenormalobjectschangessignificantlyasanomaliesareremovedorthedistributionoftheanomaliescanbemodeledinamoresophisticatedmanner,thentheresultsproducedbythisapproachwillbedifferentthantheresultsofsimplyclassifyinglow-probabilityobjectsasoutliers.Also,thisapproachcanwork

Computethenewlog-likelihoodofD,Computethedifference,if ,wherecissomethresholdthenClassifyxasananomaly.Incrementtbyoneanduse and inthenextiteration.endifendfor

4: logℒt+1(D)5: Δ=logℒt+1(D)−logℒt(D)6: Δ>c7:8: Mt+1 At+19:10:

1−λ

evenwhenthedistributionofnormalobjectsismulti-modal,e.g.,byusingamixtureofGaussiandistributionsfor .Also,conceptually,itshouldbepossibletousethisapproachwithdistributionsotherthanGaussian.

9.3.4AssessingStatisticalSignificance

Statisticalapproachesprovideawaytoassignameasureofconfidencefortheinstancesdetectedasanomalies.Forexample,sincetheanomalyscorescomputedbystatisticalapproacheshaveaprobabilisticmeaning,wecanapplyathresholdtothesescoreswithstatisticalguarantees.Alternatively,itispossibletodefinestatisticaltests(alsotermedasdiscordancytests)thatcanidentifythestatisticalsignificanceofaninstancebeingidentifiedasananomalybyastatisticalapproach.Manyofthesediscordancytestsarehighlyspecializedandassumealevelofstatisticalknowledgebeyondthescopeofthistext.Thus,weillustratethebasicideaswithasimpleexamplethatusesunivariateGaussiandistributions,andreferthereadertotheBibliographicNotesforfurtherpointers.

ConsidertheGaussiandistribution showninFigure9.1 .AsdiscussedpreviouslyinSection9.3.1 ,mostoftheprobabilitydensityiscenteredaroundzeroandthereislittleprobabilitythatanobject(value)belongingto willoccurinthetailsofthedistribution.Forinstance,thereisonlyaprobabilityof0.0027thatanobjectliesbeyondthecentralareabetween standarddeviations.Moregenerally,ifcisaconstantandxistheattributevalueofanobject,thentheprobabilitythat decreasesrapidlyascincreases.Let .Table9.1 showssomesamplevaluesforcandthecorrespondingvaluesforαwhenthedistributionis .Notethatavaluethatismorethan4standarddeviationsfromthemeanisaone-inten-thousandoccurrence.

fM(θ)

N(0,1)

±3|x|≥c

α=prob(|x|≥c)N(0,1)

Table9.1.Samplepairs foraGaussiandistributionwithmean0andstandarddeviation1.

c αfor

1.00 0.3173

1.50 0.1336

2.00 0.0455

2.50 0.0124

3.00 0.0027

3.50 0.0005

4.00 0.0001

Thisinterpretationofthedistanceofapointfromthecentercanbeusedasthebasisofatesttoassesswhetheranobjectisanoutlier,usingthefollowingdefinition.

Definition9.2(OutlierforaSingleGaussianAttribute).AnobjectwithattributevaluexfromaGaussiandistributionwithmeanof0andstandarddeviation1isanoutlierif

(c,α),α=prob(|x|≥c)

N(0,1)

|x|≥c, (9.7)

wherecisaconstantchosensothat ,wherePrepresentsprobability.

Tousethisdefinitionitisnecessarytospecifyavalueforα.Fromtheviewpointthatunusualvalues(objects)indicateavaluefromadifferentdistribution,αindicatestheprobabilitythatwemistakenlyclassifyavaluefromthegivendistributionasanoutlier.Fromtheviewpointthatanoutlierisararevalueofa distribution,αspecifiesthedegreeofrareness.

Moregenerally,foraGaussiandistributionwithmeanμandstandarddeviationσ,wecanfirstcomputethezscoreofxandthenapplytheabovetestonx.Inpractice,thisworkswellwhenμandσareestimatedfromalargepopulation.Amoresophisticatedstatisticalprocedure(Grubbs’test),whichtakesintoaccountthedistortionofparameterestimatescausedbyoutliers,isexploredinExercise7 onpage750.

Theapproachtooutlierdetectionpresentedhereisequivalenttotestingdataobjectsforstatisticalsignificanceandclassifyingthestatisticallysignificantobjectsasanomalies.ThisisdiscussedinmoredetailinChapter10 .

9.3.5StrengthsandWeaknesses

Statisticalapproachestooutlierdetectionhaveafirmtheoreticalfoundationandbuildonstandardstatisticaltechniques.Whenthereissufficientknowledgeofthedataandthetypeoftestthatshouldbeapplied,theseapproachesarestatisticallyjustifiableandcanbeveryeffective.Theycanalsoprovideconfidenceintervalsassociatedwiththeanomalyscores,which

P(|x|≥c)=α

N(0,1)

canbeveryhelpfulinmakingdecisionsabouttestinstances,e.g.,determiningthresholdsontheanomalyscore.

However,ifthewrongmodelischosen,thenanormalinstancecanbeerroneouslyidentifiedasanoutlier.Forexample,thedatamaybemodeledascomingfromaGaussiandistribution,butmayactuallycomefromadistributionthathasahigherprobability(thantheGaussiandistribution)ofhavingvaluesfarfromthemean.Statisticaldistributionswiththistypeofbehaviorarecommoninpracticeandareknownasheavy-taileddistributions.Also,wenotethatwhilethereareawidevarietyofstatisticaloutliertestsforsingleattributes,farfeweroptionsareavailableformultivariatedata,andthesetestscanperformpoorlyforhigh-dimensionaldata.

9.4Proximity-basedApproachesProximity-basedmethodsidentifyanomaliesasthoseinstancesthataremostdistantfromtheotherobjects.Thisreliesontheassumptionthatnormalinstancesarerelatedandhenceappearclosetoeachother,whileanomaliesaredifferentfromtheotherinstancesandhencearerelativelyfarfromotherinstances.Sincemanyoftheproximity-basedtechniquesarebasedondistances,theyarealsoreferredtoasdistance-basedoutlierdetectiontechniques.

Proximity-basedapproachesaremodel-freeanomalydetectiontechniques,sincetheydonotconstructanexplicitmodelofthenormalclassforcomputingtheanomalyscore.Theymakeuseofthelocalperspectiveofeverydatainstancetocomputeitsanomalyscore.Theyaremoregeneralthanstatisticalapproaches,sinceitisofteneasiertodetermineameaningfulproximitymeasureforadatasetthantodetermineitsstatisticaldistribution.Inthefollowing,wepresentsomeofthebasicproximity-basedapproachesfordefiningananomalyscore.Primarily,thesetechniquesdifferinthewaytheyanalyzethelocalityofadatainstance.

9.4.1Distance-basedAnomalyScore

Oneofthesimplestwaystodefineaproximity-basedanomalyscoreofadatainstancexistousethedistancetoits nearestneighbor, .Ifaninstancexhasmanyotherinstanceslocatedclosetoit(characteristicofthenormalclass),itwillhavealowvalueof .Onotherhand,an

kth dist(x,k)

dist(x,k)

anomalousinstancexwillbequitedistantfromitsk-neighboringinstancesandwouldthushaveahighvalueof .

Figure9.4 showsasetofpointsinatwo-dimensionalspacethathavebeenshadedaccordingtotheirdistancetothe nearestneighbor,(where ).NotethatpointChasbeencorrectlyassignedahighanomalyscore,asitislocatedfarawayfromotherinstances.

Figure9.4.Anomalyscorebasedonthedistancetofifthnearestneighbor.

Notethat canbequitesensitivetothevalueofk.Ifkistoosmall,e.g.,1,thenasmallnumberofoutlierslocatedclosetoeachothercanshowalowanomalyscore.Forexample,Figure9.5 showsanomalyscoresusing

forasetofnormalpointsandtwooutliersthatarelocatedclosetoeachother(shadingreflectsanomalyscores).NotethatbothCanditsneighborhavealowanomalyscore.Ifkistoolarge,thenitispossibleforallobjectsinaclusterthathasfewerthankobjectstobecomeanomalies.Forexample,Figure9.6 showsadatasetthathasasmallclusterofsize5andalarger

dist(x,k)

kth dist(x,k)k=5

dist(x,k)

k=1

clusterofsize30.For ,theanomalyscoreofallpointsinthesmallerclusterisveryhigh.

Figure9.5.Anomalyscorebasedonthedistancetothefirstnearestneighbor.Nearbyoutliershavelowanomalyscores.

Figure9.6.

k=5

Anomalyscorebasedondistancetothefifthnearestneighbor.Asmallclusterbecomesanoutlier.

Analternativedistance-basedanomalyscorethatismorerobusttothechoiceofkistheaveragedistancetothefirstk-nearestneighbors, .Indeed, iswidelyusedinanumberofapplicationsasareliableproximity-basedanomalyscore.

9.4.2Density-basedAnomalyScore

Thedensityaroundaninstancecanbedefinedas ,wherenisthenumberofinstanceswithinaspecifieddistancedfromtheinstance,andisthevolumeoftheneighborhood.Since isconstantforagivend,thedensityaroundaninstanceisoftenrepresentedusingthenumberofinstancesnwithinafixeddistanced.ThisdefinitionissimilartotheoneusedbytheDBSCANclusteringalgorithminSection7.4 .Fromadensity-basedviewpoint,anomaliesareinstancesthatareinregionsoflowdensity.Hence,ananomalywillhaveasmallernumberofinstanceswithinadistancedthananormalinstance.

Similartothetrade-offinchoosingtheparameterkindistance-basedmeasures,itischallengingtochoosetheparameterdindensity-basedmeasures.Ifdistoosmall,thenmanynormalinstancescanincorrectlyshowlowdensityvalues.Ifdistoolarge,thenmanyanomaliesmayhavedensitiesthataresimilartonormalinstances.

Notethatthedistance-basedanddensity-basedviewsofproximityarequitesimilartoeachother.Torealizethis,considerthek-nearestneighborsofadatainstancex,whosedistancetothe nearestneighborisgivenby

avg.dist(x,k)avg.dist(x,k)

n/V(d)V(d)

V(d)

kth

.Inthisapproach, providesameasureofthedensityaroundx,usingadifferentvalueofdforeveryinstance.If islarge,thedensityaroundxissmall,andvice-versa.Distance-basedanddensity-basedanomalyscoresthusfollowaninverserelationship.Thiscanbeusedtodefinethefollowingmeasuresofdensitythatarebasedonthetwodistancemeasures, and :

9.4.3RelativeDensity-basedAnomalyScore

Theaboveproximity-basedapproachesonlyconsiderthelocalityofanindividualinstanceforcomputingitsanomalyscore.Inscenarioswherethedatacontainsregionsofvaryingdensities,suchmethodswouldnotbeabletocorrectlyidentifyanomalies,asthenotionofanormallocalitywouldchangeacrossregions.

Toillustratethis,considerthesetoftwo-dimensionalpointsinFigure9.7 .Thisfigurehasoneratherlooseclusterofpoints,anotherdenseclusterofpoints,andtwopoints,CandD,whicharequitefarfromthesetwoclusters.Assigninganomalyscorestopointsaccordingto with correctlyidentifiespointCtobeananomaly,butshowsalowscoreforpointD.Infact,thescoreforDismuchlowerthanmanypointsthatarepartoftheloosecluster.Tocorrectlyidentifyanomaliesinsuchdatasets,weneedanotionofdensitythatisrelativetothedensitiesofneighboringinstances.Forexample,pointDinFigure9.7 hasahigherabsolutedensitythanpointA,butitsdensityislowerrelativetoitsnearestneighbors.

dist(x,k) dist(x,k)dist(x,k)

dist(x,k) avg.dist(x,k)

density(x,k)=1/dist(x,k),avg.density(x,k)=1/avg.dist(x,k).

dist(x,k) k=5

Figure9.7.Anomalyscorebasedonthedistancetothefifthnearestneighbor,whenthereareclustersofvaryingdensities.

Therearemanywaystodefinetherelativedensityofaninstance.Forapointx,Oneapproachistocomputetheratiooftheaveragedensityofitsk-nearestneighbors, to tothedensityofx,asfollows:

Therelativedensityofapointishighwhentheaveragedensityofpointsinitsneighborhoodissignificantlyhigherthanthedensityofthepoint.

Notethatbyreplacing with intheaboveequation,wecanobtainamorerobustmeasureofrelativedensity.TheaboveapproachissimilartothatusedbytheLocalOutlierFactor(LOF)score,whichisawidely-usedmeasurefordetectinganomaliesusingrelativedensity.(SeeBibliographicNotes.)However,LOFusesasomewhatdifferentdefinitionofdensitytoachieveresultsthataremorerobust.

y1 yk

relative density(x,k)=∑i=1kdensity(yi,k)/kdensity(x,k). (9.8)

density(x,k) avg.density(x,k)

Example9.2(RelativeDensityAnomalyDetection).Figure9.8 showstheperformanceoftherelativedensity-basedanomalydetectionmethodontheexampledatasetusedpreviouslyinFigure9.7 .TheanomalyscoreofeverypointiscomputedusingEquation9.8 (with ).Theshadingofeverypointrepresentsitsscore,i.e.,pointswithahigherscorearedarker.WehavelabeledpointsA,C,andD,whichhavethelargestanomalyscores.Respectively,thesepointsarethemostextremeanomaly,themostextremepointwithrespecttothecompactsetofpoints,andthemostextremepointintheloosesetofpoints.

Figure9.8.

k=10

Relativedensity(LOF)outlierscoresfortwo-dimensionalpointsofFigure9.7 .

9.4.4StrengthsandWeaknesses

Proximity-basedapproachesarenon-parametricinnatureandhencearenotrestrictedtoanyparticularformofdistributionofthenormalandanomalousclasses.Theyhaveabroadapplicabilityoverawiderangeofanomalydetectionproblemswhereareasonableproximitymeasurecanbedefinedbetweeninstances.Theyarequiteintuitiveandvisuallyappealing,sinceproximity-basedanomaliescanbeinterpretedvisuallywhenthedatacanbedisplayedintwo-orthree-dimensionalscatterplots.

However,theeffectivenessofproximity-basedmethodsdependsgreatlyonthechoiceofthedistancemeasure.Definingdistancesinhigh-dimensionalspacescanbechallenging.Insomecases,dimensionalityreductiontechniquescanbeusedtomaptheinstancesintoalowerdimensionalfeaturespace.Proximity-basedmethodscanthenbeappliedinthereducedspacefordetectinganomalies.Anotherchallengecommontoallproximity-basedmethodsistheirhighcomputationalcomplexity.Givennpoints,computingtheanomalyscoreforeverypointrequiresconsideringallpair-wisedistances,resultinginan runningtime.Forlargedatasetsthiscanbetooexpensive,althoughspecializedalgorithmscanbeusedtoimproveperformanceinsomecases,e.g.,withlow-dimensionaldatasets.Choosingtherightvalueofparameters(kord)inproximity-basedmethodsisalsodifficultandoftenrequiresdomainexpertise.

O(n2)

9.5Clustering-basedApproachesClustering-basedmethodsforanomalydetectionuseclusterstorepresentthenormalclass.Thisreliesontheassumptionthatnormalinstancesappearclosetoeachotherandhencecanbegroupedintoclusters.Anomaliesarethenidentifiedasinstancesthatdonotfitwellintheclusteringofthenormalclass,orappearinsmallclustersthatarefarapartfromtheclustersofthenormalclass.Clustering-basedmethodscanbecategorizedintotwotypes:methodsthatconsidersmallclustersasanomalies,andmethodsthatdefineapointasanomalousifdoesnotfittheclusteringwell,typicallyasmeasuredbydistancefromaclustercenter.Wedescribebothtypesofclustering-basedmethodsnext.

9.5.1FindingAnomalousClusters

Thisapproachassumesthepresenceofclusteredanomaliesinthedata,wheretheanomaliesappearintightgroupsofsmallsize.Clusteredanomaliesappearwhentheanomaliesarebeinggeneratedfromthesameanomalousclass.Forexample,anetworkattackmayhaveacommonpatterninitsoccurrence,possiblybecauseofacommonattacker,whoappearsinsimilarwaysinmultipleinstances.

Clustersofanomaliesaregenerallysmallinsize,sinceanomaliesarerareinnature.Theyarealsoexpectedtobequitedistantfromtheclustersofthenormalclass,sinceanomaliesdonotconformtonormalpatternsorbehavior.Hence,abasicapproachfordetectinganomalousclustersistoclusterthe

overalldataandflagclustersthatareeithertoosmallinsizeortoofarfromotherclusters.

Forinstance,ifweuseaprototype-basedmethodforclusteringtheoveralldata,e.g.,usingK-means,everyclustercanberepresentedbyitsprototype,e.g.,thecentroidofthecluster.Wecanthentreateveryprototypeasapointandstraightforwardlyidentifyclustersthataredistantfromtherest.Asanotherexample,ifweareusinghierarchicaltechniquessuchasMIN,MAX,orGroupAverage—seeSection7.3 —thenanomaliesareoftenidentifiedasthoseinstancesthatareinsmallclustersorremainsingletonsevenafteralmostallotherpointshavebeenclustered.

9.5.2FindingAnomalousInstances

Fromaclusteringperspective,anotherwayofdescribingananomalyisasaninstancethatcannotbeexplainedwellbyanyofthenormalclusters.Hence,abasicapproachforanomalydetectionistofirstclusterallthedata(comprisedmainlyofnormalinstances)andthenassessthedegreetowhicheveryinstancebelongstoitsrespectivecluster.Forexample,ifweuseK-meansclustering,thedistanceofaninstancetoitsclustercentroidrepresentshowstronglyitbelongstothecluster.Instancesthatarequitedistantfromtheirrespectiveclustercentroidscanthusbeidentifiedasanomalies.

Althoughclustering-basedmethodsforanomalydetectionarequiteintuitiveandsimpletouse,thereareanumberofconsiderationsthatmustbekeptinmindwhileusingthem,aswediscussinthefollowing.

AssessingtheExtenttoWhichanObject

BelongstoaClusterForprototype-basedclusters,thereareseveralwaystoassesstheextenttowhichaninstancebelongstoacluster.Onemethodistomeasurethedistanceofaninstancefromitsclusterprototypeandconsiderthisastheanomalyscoreoftheinstance.However,iftheclustersareofdifferingdensities,thenwecanconstructananomalyscorethatmeasurestherelativedistanceofaninstancefromtheclusterprototypewithrespecttothedistancesoftheotherinstancesinthecluster.Anotherpossibility,providedthattheclusterscanbeaccuratelymodeledintermsofGaussiandistributions,istousetheMahalanobisdistance.

Forclusteringtechniquesthathaveanobjectivefunction,wecanassignananomalyscoretoaninstancethatreflectstheimprovementintheobjectivefunctionwhenthatinstanceiseliminatedfromtheoveralldata.However,suchanapproachisoftencomputationallyintensive.Forthatreason,thedistance-basedapproachesofthepreviousparagraphareusuallypreferred.

Example9.3(Clustering-BasedExample).ThisexampleisbasedonthesetofpointsshowninFigure9.7 .Prototype-basedclusteringinthisexampleusestheK-meansalgorithm,andtheanomalyscoreofapointiscomputedintwoways:(1)bythepoint’sdistancefromitsclosestcentroid,and(2)bythepoint’srelativedistancefromitsclosestcentroid,wheretherelativedistanceistheratioofthepoint’sdistancefromthecentroidtothemediandistanceofallpointsintheclusterfromthecentroid.Thelatterapproachisusedtoadjustforthelargedifferenceindensitybetweencompactandlooseclusters.

TheresultinganomalyscoresareshowninFigures9.9 and9.10 .Asbefore,theanomalyscore,measuredinthiscasebythedistanceor

relativedistance,isindicatedbytheshading.Weusetwoclustersineachcase.Theapproachbasedonrawdistancehasproblemswiththedifferingdensitiesoftheclusters,e.g.,Disnotconsideredanoutlier.Fortheapproachbasedonrelativedistances,thepointsthathavepreviouslybeenidentifiedasoutliersusingLOF(A,C,andD)alsoshowupasanomalieshere.

Figure9.9.Distanceofpointsfromclosestcentroid.

Figure9.10.Relativedistanceofpointsfromclosestcentroid.

ImpactofOutliersontheInitialClusteringClusteringbasedschemesareoftensensitivetothepresenceofoutliersinthedata.Hence,thepresenceofoutlierscandegradethequalityofclusterscorrespondingtothenormalclasssincetheseclustersarediscoveredbyclusteringtheoveralldata,whichiscomprisedofnormalandanomalousinstances.Toaddressthisissue,thefollowingapproachcanbeused:instancesareclustered,outliers,whicharethepointsfarthestfromanycluster,areremoved,andthentheinstancesareclusteredagain.ThisapproachisusedateveryiterationoftheK-meansalgorithm.TheK-means–algorithmisanexampleofsuchanalgorithm.Whilethereisnoguaranteethatthisapproachwillyieldoptimalresults,itiseasytouse.

Amoresophisticatedapproachistohaveaspecialgroupforinstancesthatdonotcurrentlyfitwellinanycluster.Thisgrouprepresentspotentialoutliers.Astheclusteringprocessproceeds,clusterschange.Instancesthatnolongerbelongstronglytoanyclusterareaddedtothesetofpotentialoutliers,whileinstancescurrentlyintheoutliergrouparetestedtoseeiftheynowstronglybelongtoaclusterandcanberemovedfromthesetofpotentialoutliers.Theinstancesremaininginthesetattheendoftheclusteringareclassifiedasoutliers.Again,thereisnoguaranteeofanoptimalsolutionoreventhatthisapproachwillworkbetterthanthesimpleronedescribedpreviously.

TheNumberofClusterstoUseClusteringtechniquessuchasK-meansdonotautomaticallydeterminethenumberofclusters.Thisisaproblemwhenusingclustering-basedmethodsforanomalydetection,sincewhetheranobjectisconsideredananomalyornotmaydependonthenumberofclusters.Forinstance,agroupof10objectsmayberelativelyclosetooneanother,butmaybeincludedaspartofalargerclusterifonlyafewlargeclustersarefound.Inthatcase,eachofthe10pointscouldberegardedasananomaly,eventhoughtheywouldhaveformedaclusterifalargeenoughnumberofclustershadbeenspecified.

Aswithsomeoftheotherissues,thereisnosimpleanswertothisproblem.Onestrategyistorepeattheanalysisfordifferentnumbersofclusters.Anotherapproachistofindalargenumberofsmallclusters.Theideaisthat(1)smallerclusterstendtobemorecohesiveand(2)ifanobjectisananomalyevenwhentherearealargenumberofsmallclusters,thenitislikelyatrueanomaly.Thedownsideisthatgroupsofanomaliesmayformsmallclustersandthusescapedetection.

9.5.3StrengthsandWeaknesses

Clustering-basedtechniquescanoperateinanunsupervisedsettingastheydonotrequiretrainingdataconsistingofonlynormalinstances.Alongwithidentifyinganomalies,thelearnedclustersofthenormalclasshelpinunderstandingthenatureofthenormaldata.Someclusteringtechniques,suchasK-means,havelinearornear-lineartimeandspacecomplexityandthus,ananomalydetectiontechniquebasedonsuchalgorithmscanbehighlyefficient.However,theperformanceofclustering-basedanomalydetectionmethodsisheavilydependentuponthenumberofclustersusedaswellasthepresenceofoutliersinthedata.AsdiscussedinChapters7 and8 ,eachclusteringalgorithmissuitableonlyforacertaintypeofdata;hencetheclusteringalgorithmneedstobechosencarefullytoeffectivelycapturetheclusterstructureinthedata.

9.6Reconstruction-basedApproachesReconstruction-basedtechniquesrelyontheassumptionthatthenormalclassresidesinaspaceoflowerdimensionalitythantheoriginalspaceofattributes.Inotherwords,therearepatternsinthedistributionofthenormalclassthatcanbecapturedusinglower-dimensionalrepresentations,e.g.,byusingdimensionalityreductiontechniques.

Toillustratethis,consideradatasetofnormalinstances,whereeveryinstanceisrepresentedusingpcontinuousattributes, .Ifthereisahiddenstructureinthenormalclass,wecanexpecttoapproximatethisdatausingfewerthanpderivedfeatures.Onecommonapproachforderivingusefulfeaturesfromadatasetistouseprincipalcomponentsanalysis(PCA),asdescribedinSection2.3.3 .ByapplyingPCAontheoriginaldata,weobtainpprincipalcomponents, ,whereeveryprincipalcomponentisalinearcombinationoftheoriginalattributes.Eachprincipalcomponentcapturesthemaximumamountofvariationintheoriginaldatasubjecttotheconstraintthatitmustbeorthogonaltotheprecedingprincipalcomponents.Thus,theamountofvariationcaptureddecreasesforeachsuccessiveprincipalcomponent,andhence,itispossibletoapproximatetheoriginaldatausingthetopkprincipalcomponents, .Indeed,ifthereisahiddenstructureinthenormalclass,wecanexpecttoobtainagoodapproximationusingasmallernumberoffeatures, .

Once,wehavederivedasmallersetofkfeatures,wecanprojectanynewdatainstancextoitsk-dimensionalrepresentationy.Moreover,wecanalsore-projectybacktotheoriginalspaceofpattributes,resultinginareconstructionofx.Letusdenotethisreconstructionas andthesquaredEuclideandistancebetweenxand asthereconstructionerror.

x1, . . . , xp

y1, . . . , yp

y1, . . . , yk

k<p

x^x^

Sincethelow-dimensionalfeaturesarespecificallylearnedtoexplainmostofthevariationinthenormaldata,wecanexpectthereconstructionerrortobelowfornormalinstances.However,thereconstructionerrorishighforanomalousinstances,astheydonotconformtothehiddenstructureofthenormalclass.Thereconstructionerrorcanthusbeusedasaneffectiveanomalydetectionscore.

Asanillustrationofareconstruction-basedapproachforanomalydetection,consideratwo-dimensionaldatasetofnormalinstances,shownascirclesinFigure9.11 .Theblacksquaresareanomalousinstances.Thesolidblacklineshowsthefirstprincipalcomponentlearnedfromthisdata,whichcorrespondstothedirectionofmaximumvarianceofnormalinstances.

Figure9.11.Reconstructionofatwo-dimensionaldatausingasingleprincipalcomponent(shownassolidblackline).

Wecanseethatmostofthenormalinstancesarecenteredaroundthisline.

ReconstructionError(x)=||x−x^||2

Thissuggeststhatthefirstprincipalcomponentprovidesagoodapproximationtothenormalclassusingalower-dimensionalrepresentation.Usingthisrepresentation,wecanprojecteverydatainstancextoapointontheline.Thisprojection, ,servesasareconstructionoftheoriginalinstanceusingasingleprincipalcomponent.

Thedistancebetweenxand ,whichcorrespondstothereconstructionerrorofx,isshownasdashedlinesinFigure9.11 .Wecanseethat,sincethefirstprincipalcomponenthasbeenlearnedtobestfitthenormalclass,thereconstructionerrorsofthenormalinstancesarequitesmallinvalue.However,thereconstructionerrorsforanomalousinstances(shownassquares)arehigh,sincetheydonotadheretothestructureofthenormalclass.

AlthoughPCAprovidesasimpleapproachforcapturinglow-dimensionalrepresentations,itcanonlyderivefeaturesthatarelinearcombinationsoftheoriginalattributes.Whenthenormalclassexhibitsnonlinearpatterns,itisdifficulttocapturethemusingPCA.Insuchscenarios,theuseofanartificialneuralnetworkknownanautoencoderprovidesonepossibleapproachfornonlineardimensionalityreductionandreconstruction.AsdescribedinSection4.7 ,autoencodersarewidelyusedinthecontextofdeeplearningtoderivecomplexfeaturesfromthetrainingdatainanunsupervisedsetting.

Anautoencoder(alsoreferredtoasanautoassociatororamirroringnetwork)isamulti-layerneuralnetwork,wherethenumberofinputandoutputneuronsisequaltothenumberoforiginalattributes.Figure9.12 showsthegeneralarchitectureofanautoencoder,whichinvolvestwobasicsteps,encodinganddecoding.Duringencoding,adatainstancexistransformedtoalow-dimensionalrepresentationy,usinganumberofnonlineartransformationsintheencodinglayers.Noticethatthenumberofneuronsreducesateveryencodinglayer,soastolearnlow-dimensionalrepresentationsfromthe

originaldata.Thelearnedrepresentationyisthenmappedbacktotheoriginalspaceofattributesusingthedecodinglayers,resultinginareconstructionofx,denotedby .Thedistancebetweenxand (thereconstructionerror)isthenusedasameasureofananomalyscore.

Figure9.12.Abasicarchitectureoftheautoencoder.

Inordertolearnanautoencoderfromaninputdatasetcomprisingprimarilyofnormalinstances,wecanusethebackpropagationtechniquesintroducedinthecontextofartificialneuralnetworksinSection4.7 .Theautoencoderschemeprovidesapowerfulapproachforlearningcomplexandnonlinearrepresentationsofthenormalclass.Anumberofvariantsofthebasicautoencoderschemedescribedabovehavealsobeenexploredtolearnrepresentationsindifferenttypesofdatasets.Forexample,thedenoisingautoencoderisabletorobustlylearnnonlinearrepresentationsfromthetrainingdata,eveninthepresenceofnoise.Formoredetailsonthedifferenttypesofautoencoders,seetheBibliographicNotes.

x^ x^

9.6.1StrengthsandWeaknesses

Reconstruction-basedtechniquesprovideagenericapproachformodelingthenormalclassthatdoesnotrequiremanyassumptionsaboutthedistributionofnormalinstances.Theyareabletolearnarichvarietyofrepresentationsofthenormalclassbyusingabroadfamilyofdimensionalityreductiontechniques.Theycanalsobeusedinthepresenceofirrelevantattributes,sinceanattributethatdoesnotshareanyrelationshipwiththeotherattributesislikelytobeignoredintheencodingstep,asitwouldnotbeofmuchuseinreconstructingthenormalclass.However,sincethereconstructionerroriscomputedbymeasuringthedistancebetweenxandintheoriginalspaceofattributes,performancecansufferwhenthenumber

ofattributesislarge.

9.7One-classClassificationOne-classclassificationapproacheslearnadecisionboundaryintheattributespacethatenclosesallnormalobjectsononesideoftheboundary.Figure9.13 showsanexampleofadecisionboundaryintheone-classsetting,wherepointsbelongingtoonesideoftheboundary(shaded)belongtothenormalclass.Thisisincontrasttobinaryclassificationapproachesintroducedinchapters3 and4 thatlearnboundariestoseparateobjectsfromtwoclasses.

Figure9.13.Thedecisionboundaryofaone-classclassificationproblemattemptstoenclosethenormalinstancesonthesamesideoftheboundary.

One-classclassificationpresentsauniqueperspectiveonanomalydetection,where,insteadoflearningthedistributionofthenormalclass,thefocusisonmodelingtheboundaryofthenormalclass.Fromanoperationalstandpoint,learningtheboundaryisindeedwhatweneedtodistinguishanomaliesfrom

normalobjects.InthewordsofVladimirVapnik,“Oneshouldsolvethe[classification]problemdirectlyandneversolveamoregeneralproblem[suchaslearningthedistributionofthenormalclass]asanintermediatestep.”

Inthissection,wepresentanSVM-basedaone-classapproach,knownasone-classSVM,whichonlyusestraininginstancesfromthenormalclasstolearnitsdecisionboundary.ContrastthiswithanormalSVM—seeSection4.9 –whichusestraininginstancesfromtwoclasses.Thisinvolvestheuseofkernelsandanovel“origintrick,”describedasfollows.(SeeSection2.4.7 foranintroductiontokernelmethods.)

9.7.1UseofKernels

Inordertolearnanonlinearboundarythatenclosesthenormalclass,wetransformthedatatoahigherdimensionalspacewherethenormalclasscanbeseparatedusingalinearhyperplane.Thiscanbedonebyusingafunctionϕthatmapseverydatainstancexintheoriginalspaceofattributestoapoint

inthetransformedhigh-dimensionalspace.(Thechoiceofthemappingfunctionwillbecomeclearlater.)Inthetransformedspace,thetraininginstancescanbeseparatedusingalinearhyperplanedefinedbyparameters

asfollows.

where denotestheinnerproductbetweenvectorsxandy.Ideally,wewantalinearhyperplanethatplacesallofthenormalinstancesononeside.Hence,wewant tobesuchthat ifxbelongstothenormalclass,and ifxbelongstotheanomalyclass.

ϕ(x)

(w,ρ)

⟨w,ϕ(x)⟩=ρ,

⟨x,y⟩

(w,ρ) ⟨w,ϕ(x)⟩>ρ⟨w,ϕ(x)⟩<ρ

Let bethesetoftraininginstancesbelongingtothenormalclass.SimilartotheuseofkernelsinSVMs(seeChapter4 ),wedefinewasalinearcombinationof ’s:

Theseparatinghyperplanecanthenbedescribedusing ’sandρasfollows.

Notethattheaboveequationdealswithinnerproductsof inthetransformedspacetodescribethehyperplane.Tocomputesuchinnerproducts,wecanmakeuseofkernelfunctions, ,introducedinSection2.4.7 .Notethatkernelfunctionsareextensivelyusedforlearningnonlinearboundariesinbinaryclassificationproblems,e.g.,usingkernel-SVMspresentedinChapter4 .However,learningnonlinearboundariesintheone-classsettingischallengingintheabsenceofanyinformationabouttheanomalyclassduringtraining.Toovercomethischallenge,one-classSVMusesthe“origintrick”tolearntheseparatinghyperplane,whichworksbestwithcertaintypesofkernelfunctions.Thisapproachcanbebrieflydescribedasfollows.

9.7.2TheOriginTrick

ConsidertheGaussiankernelthatiscommonlyusedforlearningnonlinearboundaries,whichcanbedefinedas

{x1,x2, . . .xn}

ϕ(xi)

w=∑i=1nαiϕ(xi).

αi

∑i=1nαi⟨ϕ(xi),ϕ(x)⟩=ρ,

ϕ(x)

κ(x, y)=⟨ϕ(x), ϕ(y)⟩

κ(x,y)=exp(−||x−y||22σ2),

where denotesthelengthofavectorandσisahyper-parameter.BeforeweusetheGaussiankerneltolearnaseparatinghyperplaneintheone-classsetting,itwillbeworthwhiletofirstunderstandwhatthetransformedspace

ofaGaussiankernellookslike.TherearetwoimportantpropertiesofthetransformedspaceofGaussiankernelsthatareusefulforunderstandingtheintuitionbehindone-classSVMs:

1. Everypointismappedtoahypersphereofunitradius.Torealizethis,considerthekernelfunction ofapointxontoitself.Since ,

Thisimpliesthatthelengthof isequalto1,andhence,residesonahypersphereofunitradiusforallx.

2. Everypointismappedtothesameorthantinthetransformedspace.Foranytwopointsxandy,since ,theanglebetween and isalwayssmallerthan .Hence,themappingsofallpointslieinthesameorthant(high-dimensionalanalogueof“quadrant”)inthetransformedspace.

Forillustrativepurposes,Figure9.14 showsaschematicvisualizationofthetransformedspaceofGaussiankernels,usingtheabovetwoconsiderations.Theblackdotsrepresentthemappingsoftraininginstancesinthetransformedspace,whichlieonaquarterarcofacirclewithunitradius.Inthisview,theobjectiveofone-classSVMistolearnalinearhyperplanethatcanseparatetheblackdotsfromthemappingsofanomalousinstances,whichwouldalsoresideonthesamequarterarc.Therearemanypossiblehyperplanesthatcanachievethistask,twoofwhichareshowninFigure9.14 asdashedlines.Inordertochoosethebesthyperplane(shownasaboldline),wemakeusetheprincipleofstructuralriskminimization,discussed

||.||

ϕ(x)

κ(x,x)||x−x||2=0

κ(x,x)=⟨ϕ(x),ϕ(x)⟩=||ϕ(x)||2=1.

ϕ(x) ϕ(x)

κ(x,y)=⟨ϕ(x),ϕ(y)⟩≥0ϕ(x) ϕ(y) π/2

inChapter4 inthecontextofSVM.Therearethreemainrequirementsthatweseekintheoptimalhyperplanedefinedbyparameters :

Figure9.14.Illustratingtheconceptofone-classSVMinthetransformedspace.

1. Thehyperplaneshouldhavealarge“margin”orasmallvalueof.Havingalargemarginensuresthatthemodelissimpleandhencelesssusceptibletothephenomenonofoverfitting.

2. Thehyperplaneshouldbeasdistantfromtheoriginaspossible.Thisensuresatightrepresentationofpointsontheuppersideofthehyperplane(correspondingtothenormalclass).NoticefromFigure9.14 thatthedistanceofahyperplanefromtheoriginisessentially

.Hence,maximizingρtranslatestomaximizingthedistanceofthehyperplanefromtheorigin.

(w,ρ)

||w||2

ρ||w||

3. Inthestyleof“soft-margin”SVMs,ifsomeofthetraininginstanceslieontheoppositesideofthehyperplane(correspondingtotheanomalyclass),thenthedistanceofsuchpointsfromthehyperplaneshouldbeminimized.Notethatitisimportantforananomalydetectionalgorithmtoberobusttoasmallnumberofoutliersinthetrainingsetasthatisquitecommoninreal-worldproblems.AnexampleofananomaloustraininginstanceisshowninFigure9.14 asthelower-mostblackdotonthequarterarc.Ifatraininginstance liesontheoppositesideofthehyperplane(correspondingtotheanomalyclass),itsdistancefromthehyperplane,asmeasuredbyitsslackvariable ,shouldbekeptsmall.If liesonthesidecorrespondingtothenormalclass,then .

Theabovethreerequirementsprovidethefoundationoftheoptimizationobjectiveofone-classSVM,whichcanbeformallydescribedasfollows:

wherenisthenumberoftraininginstancesand isahyper-parameterthatmaintainsatrade-offbetweenreducingthemodelcomplexityandimprovingthecoverageofthedecisionboundaryinkeepingthetraininginstancesonthesameside.

NoticethesimilarityoftheaboveequationtotheoptimizationobjectiveofbinarySVM,introducedinChapter4 .However,akeydifferenceinone-classSVMisthattheconstraintsareonlydefinedforthenormalclassbutnottheanomalyclass.Atafirstglance,thismightseemtobeaseriousproblem,becausethehyperplaneisheldbyconstraintsfromoneside(correspondingtothenormalclass)butisunconstrainedfromtheotherside.However,with

ξi xiξi=0

minw,ρ,ξ12||w||2−ρ+1nv∑i=1nξi, (9.9)

subjectto⟨w,ϕ(xi)⟩≥ρ−ξi, ξi≥0,

ν∈(0, 1]

thehelpofthe“origintrick,”one-classSVMisabletoovercomethisinsufficiencybymaximizingthedistanceofthehyperplanefromtheorigin.Fromthisperspective,theoriginactsasasurrogatesecondclassandthelearnedhyperplaneattemptstoseparatethenormalclassfromthissecondclassinamannersimilartothewayabinarySVMseparatestwoclasses.

Equation9.7.2isaninstanceofaquadraticprogrammingproblem(QPP)withlinearinequalityconstraints,whichissimilartotheoptimizationproblemofbinarySVM.Hence,theoptimizationproceduresdiscussedinChapter4forlearningabinarySVMcanbedirectlyappliedforsolvingEquation9.7.2.Thelearnedone-classSVMcanthenbeappliedonatestinstancetoidentifyifitbelongstothenormalclassortheanomalyclass.Further,ifatestinstanceisidentifiedasananomaly,itsdistancefromthehyperplanecanbeseenasanestimateofitsanomalyscore.

Thehyper-parameterνofone-classSVMhasaspecialinterpretation.Itrepresentsanupperboundonthefractionoftraininginstancesthatcanbetoleratedasanomalieswhilelearningthehyperplane.Thismeansthatnνrepresentsthemaximumnumberoftraininginstancesthatcanbeplacedontheothersideofthehyperplane(correspondingtotheanomalyclass).Alowvalueofνassumesthatthetrainingsethasasmallernumberofoutliers,whileahighvalueofνensuresthatthelearningofthehyperplaneisrobusttoalargenumberofoutliersduringtraining.

Figure9.15 showsthelearneddecisionboundaryforanexampletrainingsetofsize200using .WecanseethatthetrainingdataconsistsofmostlynormalinstancesgeneratedfromaGaussiandistributioncenteredat

.However,therearealsosomeoutliersintheinputdatathatdonotconformthedistributionofthenormalclass.With ,theone-classSVMisabletoplaceatmost20traininginstancesontheothersideofthehyperplane(correspondingtothenormalclass).Thisresultsinadecisionboundarythat

ν=0.1(

0, 0)ν=0.1

robustlyenclosesthemajorityofnormalinstances.Ifweinsteaduse ,wewouldonlyhavethebudgettotolerateatmost10outliersinthetrainingset,resultinginthedecisionboundaryshowninFigure9.16(a) .Wecanseethatthisdecisionboundaryassignsamuchlargerregiontothenormalclassthanisnecessary.Ontheotherhand,thedecisionboundarylearnedusing isshowninFigure9.16(b) ,whichappearstobemuchmorecompactasitcantolerateupto40outliersinthetrainingdata.Thechoiceofνthusplaysacrucialroleinthelearningofthedecisionboundaryinone-classSVMs.

Figure9.15.Decisionboundaryofone-classSVMwithν=0.1.

ν=0.05

ν=0.2

Figure9.16.Decisionboundariesofone-classSVMforvaryingvaluesofν.

9.7.3StrengthsandWeaknesses

One-classSVMsleveragetheprincipleofstructuralriskminimizationinthelearningofthedecisionboundary,whichhasstrongtheoreticalfoundations.Theyhavetheabilitytostrikeabalancebetweenthesimplicityofthemodelandtheeffectivenessoftheboundaryinenclosingthedistributionofthenormalclass.Byusingthehyper-parameterν,theyprovideabuilt-inmechanismtoavoidoutliersinthetrainingdata,whichisoftencommoninreal-worldproblems.However,asillustratedinFigure9.16 ,thechoiceofνsignificantlyimpactsthepropertiesofthelearneddecisionboundary.Choosingtherightvalueofνisdifficult,sincethehyper-parameterselectiontechniquesdiscussedinChapter4 areonlyapplicableinthemulticlasssetting,whereitispossibletodefinevalidationerrorrates.Also,theuseofa

Gaussiankernelrequiresarelativelylargetrainingsizetoeffectivelylearnnonlineardecisionboundariesintheattributespace.Further,likeregularSVM,one-classSVMhasahighcomputationalcost.Hence,itisexpensivetotrain,especiallywhenthetrainingsetislarge.

9.8InformationTheoreticApproachesTheseapproachesassumethatthenormalclasscanberepresentedusingcompactrepresentations,alsoknownascodes.Insteadofexplicitlylearningsuchrepresentations,thefocusofinformationtheoreticapproachesistoquantifytheamountofinformationrequiredforencodingthem.Ifthenormalclassshowssomestructureorpattern,wecanexpecttoencodeitusingasmallnumberofbits.Anomaliescanthenbeidentifiedasinstancesthatintroduceirregularitiesinthedata,whichincreasetheoverallinformationcontentofthedataset.Thisisanadmissibledefinitionofananomalyinanoperationalsetting,sinceanomaliesareoftenassociatedwithanelementofsurprise,astheydonotconformtothepatternsorbehaviorofthenormalclass.

Thereareanumberofapproachesforquantifyingtheinformationcontent(alsoreferredtoascomplexity)ofadataset.Forexample,ifthedatasetcontainsacategoricalvariable,wecanassessitsinformationcontentusingtheentropymeasure,describedinSection2.3.6 .Fordatasetswithothertypesofattributes,othermeasuressuchastheKolmogorovcomplexitycanbeused.Intuitively,theKolmogorovcomplexitymeasuresthecomplexityofadatasetbythesizeofthesmallestcomputerprogram(writteninapre-specifiedlanguage)thatcanreproducetheoriginaldata.Amorepracticalapproachistocompressthedatausingstandardcompressiontechniques,andusethesizeoftheresultingcompressedfileasameasureoftheinformationcontentoftheoriginaldata.

Abasicinformationtheoreticapproachforanomalydetectioncanbedescribedasfollows.LetusdenotetheinformationcontentofadatasetDas

ConsidercomputingtheanomalyscoreofadatainstancexinD.IfweInfo(D).

removexfromD,wecanmeasuretheinformationcontentoftheremainingdataas Ifxisindeedananomaly,itwouldshowahighvalueof

Thishappensbecauseanomaliesareexpectedtobesurprising,andthus,theireliminationshouldresultinasubstantialreductionintheinformationcontent.Wecanthususe asameasureofanomalyscore.

Typically,thereductionininformationcontentismeasuredbyeliminatingasubsetofinstances(thataredeemedanomalous)andnotjustasingleinstance.Thisisbecausemostmeasuresofinformationcontentarenotsensitivetotheeliminationofasingleinstance,e.g.,thesizeofacompresseddatafiledoesnotchangesubstantiallybyremovingasingledataentry.ItisthusnecessarytoidentifythesmallestsubsetofinstancesXthatshowthelargestvalueof uponelimination.Thisisanon-trivialproblemrequiringexponentialtimecomplexity,althoughapproximatesolutionswithlineartimecomplexityhavealsobeenproposed.(SeeBibliographicNotes.)

Example9.4.Givenasurveyreportofthe and ofacollectionofparticipants,wewanttoidentifythoseparticipantsthathaveunusualheightsandweights.Both and canberepresentedascategoricalvariablesthattakethreevalues:{low,medium,high}.Table9.2 showsthedatafortheweightandheightinformationof100participants,whichhasanentropyof2.08.Wecanseethatthereisapatternintheheightandweightdistributionofnormalparticipants,sincemostparticipantsthathaveahighvalueof alsohaveahighvalueof ,andvice-versa.However,thereare5participantsthathaveahigh valuebutlow value,whichisquiteunusual.By

Info(Dx).

Gain(x)=Info(D)−Info(Dx).

Gain(x)

Gain(X)

eliminatingthese5instances,theentropyoftheresultingdatasetbecomes1.89,resultinginagainof

Table9.2.Surveydataofweightandheightof100participants.

Frequency

low low 20

low medium 15

medium medium 40

high high 20

high low 5

9.8.1StrengthsandWeaknesses

Informationtheoreticapproachesoperateintheunsupervisedsetting,astheydonotrequireaseparatetrainingsetofnormal-onlyinstances.Theydonotmakemanyassumptionsaboutthestructureofthenormalclassandaregenericenoughtobeappliedwithdatasetsofvaryingtypesandproperties.However,theperformanceofinformationtheoreticapproachesdependsheavilyonthechoiceofthemeasureusedforcapturingtheinformationcontentofadataset.Themeasureshouldbesuitablychosensothatitissensitivetotheeliminationofasmallnumberofinstances.Thisisoftenachallenge,sincecompressiontechniquesareoftenrobusttosmalldeviations,renderingthemusefulonlywhenanomaliesarelargeinnumbers.Further,informationtheoreticapproachessufferfromhighcomputationalcost,makingthemexpensivetoapplyonlargedatasets.

2.08−1.89=0.19.

9.9EvaluationofAnomalyDetectionWhenclasslabelsareavailabletodistinguishbetweenanomaliesandnormaldata,thentheeffectivenessofananomalydetectionschemecanbeevaluatedbyusingmeasuresofclassificationperformancediscussedinSection4.11 .Sincetheanomalousclassisusuallymuchsmallerthanthenormalclass,measuressuchasprecision,recall,andfalsepositiveratearemoreappropriatethanaccuracy.Inparticular,thefalsepositiverate,whichisoftenreferredtothefalsealarmrate,oftendeterminesthepracticalityoftheanomalydetectionschemesincetoomanyfalsealarmsrenderananomalydetectionsystemuseless.

Ifclasslabelsarenotavailable,thenevaluationischallenging.Formodel-basedapproaches,theeffectivenessofoutlierdetectioncanbejudgedwithrespecttotheimprovementinthegoodnessoffitofthemodelonceanomaliesareeliminated.Similarlyforinformationtheoreticapproaches,theinformationgaingivesameasureoftheeffectiveness.Forreconstruction-basedapproaches,thereconstructionerrorprovidesameasurethatcanbeusedforevaluation.

Theevaluationpresentedinthelastparagraphisanalogoustotheunsupervisedevaluationmeasuresforclusteranalysis,wheremeasures—seeSection7.5 ,suchasthesumofthesquarederror(SSE)orthesilhouetteindex,canbecomputedevenwhenclasslabelsarenotpresent.Suchmeasureswerereferredtoas“internal”measuresbecausetheyuseonlyinformationpresentinthedataset.Thesameistrueoftheanomalyevaluationmeasuresmentionedinthelastparagraph,i.e.,theyareinternalmeasures.Thekeypointisthattheanomaliesofinterestforaparticularapplicationmaynotbethosethatananomalydetectionalgorithmlabelsas

anomalies,justastheclusterlabelsproducedbyaclusteringalgorithmmaynotbeconsistentwiththeclasslabelsprovidedexternally.Inpractice,thismeansthatselectingandtuningananomalydetectionapproachbasedonfeedbackfromtheusersofsuchasystem.

Amoregeneralwaytoevaluatetheresultsofanomalydetectionistolookatthedistributionoftheanomalyscores.Thetechniquesthatwehavediscussedassumethatonlyarelativelysmallfractionofthedataconsistsofanomalies.Thus,themajorityofanomalyscoresshouldberelativelylow,withasmallerfractionofscorestowardthehighend.(Thisassumesthatahigherscoreindicatesaninstanceismoreanomalous.)Thus,bylookingatthedistributionofthescoresviaahistogramordensityplot,wecanassesswhethertheapproachweareusinggeneratesscoresthatbehaveinareasonablemanner.Weillustratewithanexample.

Example9.5.(DistributionofAnomalyScores.).Figures9.17 and9.18 showtheanomalyscoresoftwoclustersofpoints.Bothhave100points,buttheleftmostclusterislessdense.Figure9.17 ,whichusestheaveragedistancetothe neighbor(averageKNNdist),showshigheranomalyscoresforthepointsinthelessdensecluster.Incontrast,Figure9.18 ,whichusestheLOFforitsanomalyscoring,showssimilarscoresbetweenthetwoclusters.

kth

Figure9.17.Anomalyscorebasedonaveragedistancetofifthnearestneighbor.

Figure9.18.AnomalyscorebasedonLOFusingfivenearestneighbors.

ThehistogramsoftheaverageKNNdistandtheLOFscoreareshowninfigures9.19 and9.20 ,respectively.ThehistogramoftheLOFscoresshowsmostpointswithsimilaranomalyscoresandafewpointswithsignificantlylargervalues.ThehistogramoftheaverageKNNdistshowsabimodaldistribution.

Figure9.19.Histogramofanomalyscorebasedonaveragedistancetothefifthnearestneighbor.

Figure9.20.HistogramofLOFanomalyscoreusingfivenearestneighbors.

ThekeypointisthatthedistributionofanomalyscoresshouldlooksimilartothatoftheLOFscoresinthisexample.Theremaybeoneormoresecondarypeaksinthedistributionasonemovestotheright,butthesesecondarypeaksshouldonlycontainarelativelysmallfractionofthe

points,andnotalargefractionofthepointsaswiththeaverageKNNdistapproach.

9.10BibliographicNotesAnomalydetectionhasalonghistory,particularlyinstatistics,whereitisknownasoutlierdetection.RelevantbooksonthetopicarethoseofAggarwal[623],BarnettandLewis[627],Hawkins[648],andRousseeuwandLeroy[683].ThearticlebyBeckmanandCook[629]providesageneraloverviewofhowstatisticianslookatthesubjectofoutlierdetectionandprovidesahistoryofthesubjectdatingbacktocommentsbyBernoulliin1777.Alsoseetherelatedarticles[630,649].AnothergeneralarticleonoutlierdetectionistheonebyBarnett[626].ArticlesonfindingoutliersinmultivariatedataincludethosebyDaviesandGather[639],GnanadesikanandKettenring[646],RockeandWoodruff[681],RousseeuwandvanZomerenand[685],andScott[690].Rosner[682]providesadiscussionoffindingmultipleoutliersatthesametime.

SurveysbyChandolaetal.[633]andHodgeandAustin[651]provideextensivecoverageofoutlierdetectionmethods,asdoesarecentbookonthetopicbyAggarwal[623].MarkouandSingh[674,675]giveatwo-partreviewoftechniquesfornoveltydetectionthatcoversstatisticalandneuralnetworktechniques,respectively.Pimentoetal.[678]isanotherreviewofnoveltydetectionapproaches,includingmanyofthemethodsdiscussedinthischapter.

Statisticalapproachesforanomalydetectionintheunivariatecasearewellcoveredbythebooksinthefirstparagraph.Shyuetal.[692]useanapproachbasedonprincipalcomponentsandtheMahalanobisdistancetoproduceanomalyscoresformultivariatedata.AnexampleofthekerneldensityapproachforanomalydetectionisgivenbySchubertetal.[688].ThemixturemodeloutlierapproachdiscussedinSection9.3.3 isfromEskin

[641].Anapproachbasedonthe measureisgivenbyYeandChen[695].Outlierdetectionbasedongeometricideas,suchasthedepthofconvexhulls,hasbeenexploredinpapersbyJohnsonetal.[654],Liuetal.[673],andRousseeuwetal.[684].

Thenotionofadistance-basedoutlierandthefactthatthisdefinitioncanincludemanystatisticaldefinitionsofanoutlierwasdescribedbyKnorretal.[663–665].Ramaswamyetal.[680]proposeanefficientdistance-basedoutlierdetectionprocedurethatgiveseachobjectanoutlierscorebasedonthedistanceofitsk-nearestneighbor.EfficiencyisachievedbypartitioningthedatausingthefirstphaseofBIRCH(Section8.5.2 ).Chaudharyetal.[634]usek-dtreestoimprovetheefficiencyofoutlierdetection,whileBayandSchwabacher[628]userandomizationandpruningtoimproveperformance.

Forrelativedensity-basedapproaches,thebestknowntechniqueisthelocaloutlierfactor(LOF)(Breunigetal.[631,632]),whichgrewoutofDBSCAN.AnotherlocallyawareanallydetectionalgorithmisLOCIbyPapadimitriouetal.[677].AmorerecentviewofthelocalapproachisgivenbySchubertetal.[689].Proximitiescanbeviewedasagraph.Theconnectivity-basedoutlierfactor(COF)byTangetal.[694]isagraph-basedapproachtolocaloutlierdetection.AsurveyofgraphbasedapproachesisprovidedbyAkogluetal.[625].

Highdimensionalityposessignificantproblemsfordistance-anddensity-basedapproaches.Adiscussionofoutlierremovalinhigh-dimensionalspacecanbefoundinthepapersbyAggarwalandYu[624]andDunaganandVempala[640].Zimeketal.provideasurveyofanomalydetectionapproachesforhigh-dimensionalnumericaldata[696].

χ2

Clusteringandanomalydetectionhavealongrelationship.InChapters7and8 ,weconsideredtechniques,suchasBIRCH,CURE,DENCLUE,DB-SCAN,andSNNdensity-basedclustering,whichspecificallyincludetechniquesforhandlinganomalies.StatisticalapproachesthatfurtherdiscussthisrelationshiparedescribedinpapersbyScott[690]andHardinandRocke[647].TheK-means–algorithm,whichcansimultaneouslyhandleclusteringandoutliers,wasproposedbyChawlaandGionis[637].

Ourdiscussionofreconstruction-basedapproachesfocusedonaneuralnetwork-basedapproach,i.e.,theautoencoder.Morebroadly,adiscussionofapproachesintheareaofneuralnetworkscanbefoundinpapersbyGhoshandSchwartzbard[645],Sykacek[693],andHawkinsetal.[650],whodiscussreplicatornetworks.TheoneclassSVMapproachforanomalydetectionwascreatedbySchölkopfetal.[686]andimprovedbyLietal.[672].Moregenerally,techniquesforoneclassclassificationaresurveyedin[662].TheuseofinformationmeasuresinanomalydetectionisdescribedbyLeeandXiang[671].

Inthischapter,wefocusedonunsupervisedanomalydetection.Supervisedanomalydetectionfallsintothecategoryofrareclassclassification.WorkonrareclassdetectionincludestheworkofJoshietal.[655–659].Therareclassproblemisalsosometimesreferredtoastheimbalanceddatasetproblem.OfrelevanceareanAAAIworkshop(Japkowicz[653]),anICMLworkshop(Chawlaetal.[635]),andaspecialissueofSIGKDDExplorations(Chawlaetal.[636]).

EvaluationofunsupervisedanomalydetectionapproacheswasdiscussedinSection9.9 .SeealsothediscussioninChapter8 ofthebookbyAggarwal[623].Insummary,evaluationapproachesarequitelimited.Forsupervisedanomalydetection,anoverviewofcurrentapproachesforevaluationcanbefoundinSchubertetal.[687].

Inthischapter,wehavefocusedonbasicanomalydetectionschemes.Wehavenotconsideredschemesthattakeintoaccountthespatialortemporalnatureofthedata.Shekharetal.[691]provideadetaileddiscussionoftheproblemofspatialoutliersandpresentaunifiedapproachtospatialoutlierdetection.AdiscussionofthechallengesforanomalydetectioninclimatedataisprovidedbyKawaleetal.[660].

TheissueofoutliersintimeserieswasfirstconsideredinastatisticallyrigorouswaybyFox[643].Muirhead[676]providesadiscussionofdifferenttypesofoutliersintimeseries.AbrahamandChuang[622]proposeaBayesianapproachtooutliersintimeseries,whileChenandLiu[638]considerdifferenttypesofoutliersintimeseriesandproposeatechniquetodetectthemandobtaingoodestimatesoftimeseriesparameters.WorkonfindingdeviantorsurprisingpatternsintimeseriesdatabaseshasbeenperformedbyJagadishetal.[652]andKeoghetal.[661].

Animportantapplicationareaforanomalydetectionisintrusiondetection.SurveysoftheapplicationsofdataminingtointrusiondetectionaregivenbyLeeandStolfo[669]andLazarevicetal.[668].Inadifferentpaper,Lazarevicetal.[667]provideacomparisonofanomalydetectionroutinesspecifictonetworkintrusion.Garciaetal.[644]providearecentsurveyofanomalydetectionfornetworkintrusiondetection.AframeworkforusingdataminingtechniquesforintrusiondetectionisprovidedbyLeeetal.[670].Clustering-basedapproachesintheareaofintrusiondetectionincludeworkbyEskinetal.[642],LaneandBrodley[666],andPortnoyetal.[679].

Bibliography[622]B.AbrahamandA.Chuang.OutlierDetectionandTimeSeries

Modeling.Technometrics,31(2):241–248,May1989.

[623]C.C.Aggarwal.OutlierAnalysis.SpringerScience&BusinessMedia,2013.

[624]C.C.AggarwalandP.S.Yu.OutlierDetectionforHighDimensionalData.InProceedingsofthe2001ACMSIGMODInternationalConferenceonManagementofData,SIGMOD’01,pages37–46,NewYork,NY,USA,2001.ACM.

[625]L.Akoglu,H.Tong,andD.Koutra.Graphbasedanomalydetectionanddescription:asurvey.DataMiningandKnowledgeDiscovery,29(3):626–688,2015.

[626]V.Barnett.TheStudyofOutliers:PurposeandModel.AppliedStatistics,27(3):242–250,1978.

[627]V.BarnettandT.Lewis.OutliersinStatisticalData.WileySeriesinProbabilityandStatistics.JohnWiley&Sons,3rdedition,April1994.

[628]S.D.BayandM.Schwabacher.Miningdistance-basedoutliersinnearlineartimewithrandomizationandasimplepruningrule.InProc.ofthe

9thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages29–38.ACMPress,2003.

[629]R.J.BeckmanandR.D.Cook.‘Outlier……….s’.Technometrics,25(2):119–149,May1983.

[630]R.J.BeckmanandR.D.Cook.[‘Outlier……….s’]:Response.Technometrics,25(2):161–163,May1983.

[631]M.M.Breunig,H.-P.Kriegel,R.T.Ng,andJ.Sander.OPTICS-OF:IdentifyingLocalOutliers.InProceedingsoftheThirdEuropeanConferenceonPrinciplesofDataMiningandKnowledgeDiscovery,pages262–270.Springer-Verlag,1999.

[632]M.M.Breunig,H.-P.Kriegel,R.T.Ng,andJ.Sander.LOF:Identifyingdensity-basedlocaloutliers.InProc.of2000ACM-SIGMODIntl.Conf.onManagementofData,pages93–104.ACMPress,2000.

[633]V.Chandola,A.Banerjee,andV.Kumar.Anomalydetection:Asurvey.ACMcomputingsurveys(CSUR),41(3):15,2009.

[634]A.Chaudhary,A.S.Szalay,andA.W.Moore.Veryfastoutlierdetectioninlargemultidimensionaldatasets.InProc.ACMSIGMODWorkshoponResearchIssuesinDataMiningandKnowledgeDiscovery(DMKD),2002.

[635]N.V.Chawla,N.Japkowicz,andA.Kolcz,editors.WorkshoponLearningfromImbalancedDataSetsII,20thIntl.Conf.onMachine

Learning,2000.AAAIPress.

[636]N.V.Chawla,N.Japkowicz,andA.Kolcz,editors.SIGKDDExplorationsNewsletter,Specialissueonlearningfromimbalanceddatasets,volume6(1),June2004.ACMPress.

[637]S.ChawlaandA.Gionis.k-means-:AUnifiedApproachtoClusteringandOutlierDetection.InSDM,pages189–197.SIAM,2013.

[638]C.ChenandL.-M.Liu.JointEstimationofModelParametersandOutlierEffectsinTimeSeries.JournaloftheAmericanStatisticalAssociation,88(421):284–297,March1993.

[639]L.DaviesandU.Gather.TheIdentificationofMultipleOutliers.JournaloftheAmericanStatisticalAssociation,88(423):782–792,September1993.

[640]J.DunaganandS.Vempala.Optimaloutlierremovalinhigh-dimensionalspaces.JournalofComputerandSystemSciences,SpecialIssueonSTOC2001,68(2):335–373,March2004.

[641]E.Eskin.AnomalyDetectionoverNoisyDatausingLearnedProbabilityDistributions.InProc.ofthe17thIntl.Conf.onMachineLearning,pages255–262,2000.

[642]E.Eskin,A.Arnold,M.Prerau,L.Portnoy,andS.J.Stolfo.Ageometricframeworkforunsupervisedanomalydetection.InApplicationsofData

MininginComputerSecurity,pages78–100.KluwerAcademics,2002.

[643]A.J.Fox.OutliersinTimeSeries.JournaloftheRoyalStatisticalSociety.SeriesB(Methodological),34(3):350–363,1972.

[644]P.Garcia-Teodoro,J.Diaz-Verdejo,G.Maciá-Fernández,andE.Vázquez.Anomaly-basednetworkintrusiondetection:Techniques,systemsandchallenges.computers&security,28(1):18–28,2009.

[645]A.GhoshandA.Schwartzbard.AStudyinUsingNeuralNetworksforAnomalyandMisuseDetection.In8thUSENIXSecuritySymposium,August1999.

[646]R.GnanadesikanandJ.R.Kettenring.RobustEstimates,Residuals,andOutlierDetectionwithMultiresponseData.Biometrics,28(1):81–124,March1972.

[647]J.HardinandD.M.Rocke.OutlierDetectionintheMultipleClusterSettingusingtheMinimumCovarianceDeterminantEstimator.ComputationalStatisticsandDataAnalysis,44:625–638,2004.

[648]D.M.Hawkins.IdentificationofOutliers.MonographsonAppliedProbabilityandStatistics.Chapman&Hall,May1980.

[649]D.M.Hawkins.‘[Outlier……….s]’:Discussion.Technometrics,25(2):155–156,May1983.

[650]S.Hawkins,H.He,G.J.Williams,andR.A.Baxter.OutlierDetectionUsingReplicatorNeuralNetworks.InDaWaK2000:Proc.ofthe4thIntnl.Conf.onDataWarehousingandKnowledgeDiscovery,pages170–180.Springer-Verlag,2002.

[651]V.J.HodgeandJ.Austin.ASurveyofOutlierDetectionMethodologies.ArtificialIntelligenceReview,22:85–126,2004.

[652]H.V.Jagadish,N.Koudas,andS.Muthukrishnan.MiningDeviantsinaTimeSeriesDatabase.InProc.ofthe25thVLDBConf.,pages102–113,1999.

[653]N.Japkowicz,editor.WorkshoponLearningfromImbalancedDataSetsI,SeventeenthNationalConferenceonArtificialIntelligence,PublishedasTechnicalReportWS-00-05,2000.AAAIPress.

[654]T.Johnson,I.Kwok,andR.T.Ng.FastComputationof2-DimensionalDepthContours.InKDD98,pages224–228,1998.

[655]M.V.Joshi.OnEvaluatingPerformanceofClassifiersforRareClasses.InProc.ofthe2002IEEEIntl.Conf.onDataMining,pages641–644,2002.

[656]M.V.Joshi,R.C.Agarwal,andV.Kumar.Miningneedleinahaystack:Classifyingrareclassesviatwo-phaseruleinduction.InProc.of2001ACM-SIGMODIntl.Conf.onManagementofData,pages91–102.ACMPress,2001.

[657]M.V.Joshi,R.C.Agarwal,andV.Kumar.Predictingrareclasses:canboostingmakeanyweaklearnerstrong?InProc.of2002ACM-SIGMODIntl.Conf.onManagementofData,pages297–306.ACMPress,2002.

[658]M.V.Joshi,R.C.Agarwal,andV.Kumar.PredictingRareClasses:ComparingTwo-PhaseRuleInductiontoCost-SensitiveBoosting.InProc.ofthe6thEuropeanConf.ofPrinciplesandPracticeofKnowledgeDiscoveryinDatabases,pages237–249.Springer-Verlag,2002.

[659]M.V.Joshi,V.Kumar,andR.C.Agarwal.EvaluatingBoostingAlgorithmstoClassifyRareClasses:ComparisonandImprovements.InProc.ofthe2001IEEEIntl.Conf.onDataMining,pages257–264,2001.

[660]J.Kawale,S.Chatterjee,A.Kumar,S.Liess,M.Steinbach,andV.Kumar.Anomalyconstructioninclimatedata:issuesandchallenges.InNASAConferenceonIntelligentDataUnderstandingCIDU,2011.

[661]E.Keogh,S.Lonardi,andB.Chiu.FindingSurprisingPatternsinaTimeSeriesDatabaseinLinearTimeandSpace.InProc.ofthe8thIntl.Conf.onKnowledgeDiscoveryandDataMining,Edmonton,Alberta,Canada,July2002.

[662]S.S.KhanandM.G.Madden.One-classclassification:taxonomyofstudyandreviewoftechniques.TheKnowledgeEngineeringReview,29(03):345–374,2014.

[663]E.M.KnorrandR.T.Ng.AUnifiedNotionofOutliers:PropertiesandComputation.InProc.ofthe3rdIntl.Conf.onKnowledgeDiscoveryand

DataMining,pages219–222,1997.

[664]E.M.KnorrandR.T.Ng.AlgorithmsforMiningDistance-BasedOutliersinLargeDatasets.InProc.ofthe24thVLDBConf.,pages392–403,August1998.

[665]E.M.Knorr,R.T.Ng,andV.Tucakov.Distance-basedoutliers:algorithmsandapplications.TheVLDBJournal,8(3-4):237–253,2000.

[666]T.LaneandC.E.Brodley.AnApplicationofMachineLearningtoAnomalyDetection.InProc.20thNIST-NCSCNationalInformationSystemsSecurityConf.,pages366–380,1997.

[667]A.Lazarevic,L.Ertöz,V.Kumar,A.Ozgur,andJ.Srivastava.AComparativeStudyofAnomalyDetectionSchemesinNetworkIntrusionDetection.InProc.ofthe2003SIAMIntl.Conf.onDataMining,2003.

[668]A.Lazarevic,V.Kumar,andJ.Srivastava.IntrusionDetection:ASurvey.InManagingCyberThreats:Issues,ApproachesandChallenges,pages19–80.KluwerAcademicPublisher,2005.

[669]W.LeeandS.J.Stolfo.DataMiningApproachesforIntrusionDetection.In7thUSENIXSecuritySymposium,pages26–29,January1998.

[670]W.Lee,S.J.Stolfo,andK.W.Mok.ADataMiningFrameworkforBuildingIntrusionDetectionModels.InIEEESymposiumonSecurityandPrivacy,pages120–132,1999.

[671]W.LeeandD.Xiang.Information-theoreticmeasuresforanomalydetection.InProc.ofthe2001IEEESymposiumonSecurityandPrivacy,pages130–143,May2001.

[672]K.-L.Li,H.-K.Huang,S.-F.Tian,andW.Xu.Improvingone-classSVMforanomalydetection.InMachineLearningandCybernetics,2003InternationalConferenceon,volume5,pages3077–3081.IEEE,2003.

[673]R.Y.Liu,J.M.Parelius,andK.Singh.Multivariateanalysisbydatadepth:descriptivestatistics,graphicsandinference.AnnalsofStatistics,27(3):783–858,1999.

[674]M.MarkouandS.Singh.Noveltydetection:Areview–part1:Statisticalapproaches.SignalProcessing,83(12):2481–2497,2003.

[675]M.MarkouandS.Singh.Noveltydetection:Areview–part2:Neuralnetworkbasedapproaches.SignalProcessing,83(12):2499–2521,2003.

[676]C.R.Muirhead.DistinguishingOutlierTypesinTimeSeries.JournaloftheRoyalStatisticalSociety.SeriesB(Methodological),48(1):39–47,1986.

[677]S.Papadimitriou,H.Kitagawa,P.B.Gibbons,andC.Faloutsos.Loci:Fastoutlierdetectionusingthelocalcorrelationintegral.InDataEngineering,2003.Proceedings.19thInternationalConferenceon,pages315–326.IEEE,2003.

[678]M.A.Pimentel,D.A.Clifton,L.Clifton,andL.Tarassenko.Areviewofnoveltydetection.SignalProcessing,99:215–249,2014.

[679]L.Portnoy,E.Eskin,andS.J.Stolfo.Intrusiondetectionwithunlabeleddatausingclustering.InInACMWorkshoponDataMiningAppliedtoSecurity,2001.

[680]S.Ramaswamy,R.Rastogi,andK.Shim.Efficientalgorithmsforminingoutliersfromlargedatasets.InProc.of2000ACM-SIGMODIntl.Conf.onManagementofData,pages427–438.ACMPress,2000.

[681]D.M.RockeandD.L.Woodruff.IdentificationofOutliersinMultivariateData.JournaloftheAmericanStatisticalAssociation,91(435):1047–1061,September1996.

[682]B.Rosner.OntheDetectionofManyOutliers.Technometrics,17(3):221–227,1975.

[683]P.J.RousseeuwandA.M.Leroy.RobustRegressionandOutlierDetection.WileySeriesinProbabilityandStatistics.JohnWiley&Sons,September2003.

[684]P.J.Rousseeuw,I.Ruts,andJ.W.Tukey.TheBagplot:ABivariateBoxplot.TheAmericanStatistician,53(4):382–387,November1999.

[685]P.J.RousseeuwandB.C.vanZomeren.UnmaskingMultivariateOutliersandLeveragePoints.JournaloftheAmericanStatistical

Association,85(411):633–639,September1990.

[686]B.Schölkopf,R.C.Williamson,A.J.Smola,J.Shawe-Taylor,J.C.Platt,etal.SupportVectorMethodforNoveltyDetection.InNIPS,volume12,pages582–588,1999.

[687]E.Schubert,R.Wojdanowski,A.Zimek,andH.-P.Kriegel.Onevaluationofoutlierrankingsandoutlierscores.InProceedingsofthe2012SIAMInternationalConferenceonDataMining.SIAM,2012.

[688]E.Schubert,A.Zimek,andH.-P.Kriegel.GeneralizedOutlierDetectionwithFlexibleKernelDensityEstimates.InSDM,volume14,pages542–550.SIAM,2014.

[689]E.Schubert,A.Zimek,andH.-P.Kriegel.Localoutlierdetectionreconsidered:ageneralizedviewonlocalitywithapplicationstospatial,video,andnetworkoutlierdetection.DataMiningandKnowledgeDiscovery,28(1):190–237,2014.

[690]D.W.Scott.PartialMixtureEstimationandOutlierDetectioninDataandRegression.InM.Hubert,G.Pison,A.Struyf,andS.V.Aelst,editors,TheoryandApplicationsofRecentRobustMethods,StatisticsforIndustryandTechnology.Birkhauser,2003.

[691]S.Shekhar,C.-T.Lu,andP.Zhang.AUnifiedApproachtoDetectingSpatialOutliers.GeoInformatica,7(2):139–166,June2003.

[692]M.-L.Shyu,S.-C.Chen,K.Sarinnapakorn,andL.Chang.ANovelAnomalyDetectionSchemeBasedonPrincipalComponentClassifier.InProc.ofthe2003IEEEIntl.Conf.onDataMining,pages353–365,2003.

[693]P.Sykacek.Equivalenterrorbarsforneuralnetworkclassifierstrainedbybayesianinference.InProc.oftheEuropeanSymposiumonArtificialNeuralNetworks,pages121–126,1997.

[694]J.Tang,Z.Chen,A.W.-c.Fu,andD.Cheung.Arobustoutlierdetectionschemeforlargedatasets.InIn6thPacific-AsiaConf.onKnowledgeDiscoveryandDataMining.Citeseer,2001.

[695]N.YeandQ.Chen.Chi-squareStatisticalProfilingforAnomalyDetection.InProc.ofthe2000IEEEWorkshoponInformationAssuranceandSecurity,pages187–193,June2000.

[696]A.Zimek,E.Schubert,andH.-P.Kriegel.Asurveyonunsupervisedoutlierdetectioninhigh-dimensionalnumericaldata.StatisticalAnalysisandDataMining,5(5):363–387,2012.

9.11Exercises1.CompareandcontrastthedifferenttechniquesforanomalydetectionthatwerepresentedinSection9.2 .Inparticular,trytoidentifycircumstancesinwhichthedefinitionsofanomaliesusedinthedifferenttechniquesmightbeequivalentorsituationsinwhichonemightmakesense,butanotherwouldnot.Besuretoconsiderdifferenttypesofdata.

2.Considerthefollowingdefinitionofananomaly:Ananomalyisanobjectthatisunusuallyinfluentialinthecreationofadatamodel.

a. Comparethisdefinitiontothatofthestandardmodel-baseddefinitionofananomaly.

b. Forwhatsizesofdatasets(small,medium,orlarge)isthisdefinitionappropriate?

3.Inoneapproachtoanomalydetection,objectsarerepresentedaspointsinamultidimensionalspace,andthepointsaregroupedintosuccessiveshells,whereeachshellrepresentsalayeraroundagroupingofpoints,suchasaconvexhull.Anobjectisananomalyifitliesinoneoftheoutershells.

a. TowhichofthedefinitionsofananomalyinSection9.2 isthisdefinitionmostcloselyrelated?

b. Nametwoproblemswiththisdefinitionofananomaly.

4.Associationanalysiscanbeusedtofindanomaliesasfollows.Findstrongassociationpatterns,whichinvolvesomeminimumnumberofobjects.Anomaliesarethoseobjectsthatdonotbelongtoanysuchpatterns.Tomakethismoreconcrete,wenotethatthehypercliqueassociationpattern

discussedinSection5.8 isparticularlysuitableforsuchanapproach.Specifically,givenauser-selectedh-confidencelevel,maximalhypercliquepatternsofobjectsarefound.Allobjectsthatdonotappearinamaximalhypercliquepatternofatleastsizethreeareclassifiedasoutliers.

a. Doesthistechniquefallintoanyofthecategoriesdiscussedinthischapter?Ifso,whichone?

b. Nameonepotentialstrengthandonepotentialweaknessofthisapproach.

5.Discusstechniquesforcombiningmultipleanomalydetectiontechniquestoimprovetheidentificationofanomalousobjects.Considerbothsupervisedandunsupervisedcases.

6.Describethepotentialtimecomplexityofanomalydetectionapproachesbasedonthefollowingapproaches:model-basedusingclustering,proximity-based,anddensity.Noknowledgeofspecifictechniquesisrequired.Rather,focusonthebasiccomputationalrequirementsofeachapproach,suchasthetimerequiredtocomputethedensityofeachobject.

7.TheGrubbs’test,whichisdescribedbyAlgorithm9.2 ,isamorestatisticallysophisticatedprocedurefordetectingoutliersthanthatofDefinition9.2 .Itisiterativeandalsotakesintoaccountthefactthatthez-scoredoesnothaveanormaldistribution.Thisalgorithmcomputesthez-scoreofeachvaluebasedonthesamplemeanandstandarddeviationofthecurrentsetofvalues.Thevaluewiththelargestmagnitudez-scoreisdiscardedifitsz-scoreislargerthan thecriticalvalueofthetestforanoutlieratsignificancelevelα.Thisprocessisrepeateduntilnoobjectsareeliminated.Notethatthesamplemean,standarddeviation,and areupdatedateachiteration.

gc,

a. Whatisthelimitofthevalue usedforGrubbs’testasmapproachesinfinity?Useasignificancelevelof0.05.

b. Describe,inwords,themeaningofthepreviousresult.

Algorithm9.2Grubbs’approachforoutlier

elimination.

m−1mtc2m−2+tc2

Inputthevaluesandα

{misthenumberofvalues,αisaparameter,and isavaluechosensothat foratdistributionwith degreesoffreedom.}

repeat

Computethesamplemean andstandarddeviation

Computeavalue sothat

(Intermsof andm, )

Computethez-scoreofeachvalue,i.e.,

Let i.e.,findthez-scoreoflargestmagnitudeandcallitg.

if then

Eliminatethevaluecorrespondingtog.

endif

untilNoobjectsareeliminated.

tcα=P(x≥tc) m−2

3: (x¯) (sx).

4: gc P(|z|≥gc)=α.

tc gc=m−1mtc2m−2+tc2.

5: z=(x─x¯)/sx.

6: g=max|z|

7: g>gc

9: m←m−1

10:

11:

8.Manystatisticaltestsforoutliersweredevelopedinanenvironmentinwhichafewhundredobservationswasalargedataset.Weexplorethelimitationsofsuchapproaches.

a. Forasetof1,000,000values,howlikelyarewetohaveoutliersaccordingtothetestthatsaysavalueisanoutlierifitismorethanthreestandarddeviationsfromtheaverage?(Assumeanormaldistribution.)

b. Doestheapproachthatstatesanoutlierisanobjectofunusuallylowprobabilityneedtobeadjustedwhendealingwithlargedatasets?Ifso,how?

9.TheprobabilitydensityofapointxwithrespecttoamultivariatenormaldistributionhavingameanμandcovariancematrixΣisgivenbytheequation

Usingthesamplemean andcovariancematrixSasestimatesofthemeanμandcovariancematrixΣ,respectively,showthatthe isequaltotheMahalanobisdistancebetweenadatapointxandthesamplemean plusaconstantthatdoesnotdependonx.

10.Comparethefollowingtwomeasuresoftheextenttowhichanobjectbelongstoacluster:(1)distanceofanobjectfromthecentroidofitsclosestclusterand(2)thesilhouettecoefficientdescribedinSection7.5.2 .

11.Considerthe(relativedistance)K-meansschemeforoutlierdetectiondescribedinSection9.5 andtheaccompanyingfigure,Figure9.10 .

a. ThepointsatthebottomofthecompactclustershowninFigure9.10haveasomewhathigheroutlierscorethanthosepointsatthetopofthecompactcluster.Why?

f(x)=1(2π)m|Σ|1/2e−(x−μ)Σ−1(x−μ)2. (9.10)

x─log(f(x))

x─

b. Supposethatwechoosethenumberofclusterstobemuchlarger,e.g.,10.Wouldtheproposedtechniquestillbeeffectiveinfindingthemostextremeoutlieratthetopofthefigure?Whyorwhynot?

c. Theuseofrelativedistanceadjustsfordifferencesindensity.Giveanexampleofwheresuchanapproachmightleadtothewrongconclusion.

12.Iftheprobabilitythatanormalobjectisclassifiedasananomalyis0.01andtheprobabilitythatananomalousobjectisclassifiedasanomalousis0.99,thenwhatisthefalsealarmrateanddetectionrateif99%oftheobjectsarenormal?(Usethedefinitionsgivenbelow.)

13.Whenacomprehensivetrainingsetisavailable,asupervisedanomalydetectiontechniquecantypicallyoutperformanunsupervisedanomalytechniquewhenperformanceisevaluatedusingmeasuressuchasthedetectionandfalsealarmrate.However,insomecases,suchasfrauddetection,newtypesofanomaliesarealwaysdeveloping.Performancecanbeevaluatedaccordingtothedetectionandfalsealarmrates,becauseitisusuallypossibletodetermineuponinvestigationwhetheranobject(transaction)isanomalous.Discusstherelativemeritsofsupervisedandunsupervisedanomalydetectionundersuchconditions.

14.Consideragroupofdocumentsthathasbeenselectedfromamuchlargersetofdiversedocumentssothattheselecteddocumentsareasdissimilarfromoneanotheraspossible.Ifweconsiderdocumentsthatarenothighlyrelated(connected,similar)tooneanotherasbeinganomalous,thenallofthedocumentsthatwehaveselectedmightbeclassifiedasanomalies.Isit

detectionrate=numberofanomaliesdetectedtotalnumberofanomalies (9.11)

falsealarmrate=numberoffalseanomaliesnumberofobjectsclassifiedasanomalies(9.12)

possibleforadatasettoconsistonlyofanomalousobjectsoristhisanabuseoftheterminology?

15.Considerasetofpoints,wheremostpointsareinregionsoflowdensity,butafewpointsareinregionsofhighdensity.Ifwedefineananomalyasapointinaregionoflowdensity,thenmostpointswillbeclassifiedasanomalies.Isthisanappropriateuseofthedensity-baseddefinitionofananomalyorshouldthedefinitionbemodifiedinsomeway?

16.Considerasetofpointsthatareuniformlydistributedontheinterval[0,1].Isthestatisticalnotionofanoutlierasaninfrequentlyobservedvaluemeaningfulforthisdata?

17.Ananalystappliesananomalydetectionalgorithmtoadatasetandfindsasetofanomalies.Beingcurious,theanalystthenappliestheanomalydetectionalgorithmtothesetofanomalies.

a. Discussthebehaviorofeachoftheanomalydetectiontechniquesdescribedinthischapter.(Ifpossible,trythisforrealdatasetsandalgorithms.)

b. Whatdoyouthinkthebehaviorofananomalydetectionalgorithmshouldbewhenappliedtoasetofanomalousobjects?

10AvoidingFalseDiscoveries

Thepreviouschaptershavedescribedthealgorithms,concepts,andmethodologiesoffourkeyareasofdatamining:classification,associationanalysis,clusteranalysis,andanomalydetection.Athoroughunderstandingofthismaterialprovidesthefoundationrequiredtostartanalyzingdatainreal-worldsituations.However,withoutcarefulconsiderationofsomeimportantissuesinevaluatingtheperformanceofadataminingprocedure,theresultsproducedmaynotbemeaningfulorreproducible,i.e.,theresultsmaybefalsediscoveries.Thewidespreadnatureofthisproblemhasbeenreportedbyanumberofhigh-profilepublicationsinscientificfields,andislikewise,commonincommerceandgovernment.Hence,itisimportanttounderstandsomeofthecommonreasonsforunreliabledataminingresultsandhowtoavoidthesefalsediscoveries.

Whenadataminingalgorithmisappliedtoadataset,itwilldutifullyproduceclusters,patterns,predictivemodels,oralistofanomalies.However,anyavailabledatasetisonlyafinitesamplefromtheoverall

population(distribution)ofallinstances,andthereisoftensignificantvariabilityamonginstanceswithinapopulation.Thus,thepatternsandmodelsdiscoveredfromaspecificdatasetmaynotalwayscapturethetruenatureofthepopulation,i.e.,allowaccurateestimationormodelingofthepropertiesofinterest.Sometimes,thesamealgorithmwillproduceentirelydifferentorinconsistentresultswhenappliedtoanothersampleofdata,thusindicatingthatthediscoveredresultsarespurious,e.g.,notreproducible.

Toproducevalid(reliableandreproducible)results,itisimportanttoensurethatadiscoveredpatternorrelationshipinthedataisnotanoutcomeofrandomchance(arisingduetonaturalvariabilityinthedatasamples),butrather,representsasignificanteffect.Thisofteninvolvesusingstatisticalprocedures,aswillbedescribedlater.Whileensuringthesignificanceofasingleresultisdemanding,theproblembecomesmorecomplexwhenwehavemultipleresultsthatneedtobeevaluatedsimultaneously,suchasthelargenumbersofitemsetstypicallydiscoveredbyafrequentpatternminingalgorithm.Inthiscase,manyorevenmostoftheresultswillrepresentfalsediscoveries.Thisisalsodiscussedindetailinthischapter.

Thepurposeofthischapteristocoverafewselectedtopics,knowledgeofwhichisimportantforavoidingcommondataanalysisproblemsandproducingvaliddataminingresults.Someofthesetopicshavebeen

discussedinspecificcontextsearlierinthebook,particularlyintheevaluationsectionsoftheprecedingchapters.Wewillbuilduponthesediscussionstoprovideanindepthviewofsomestandardproceduresforavoidingfalsediscoveriesthatareapplicableacrossmostareasofdatamining.Manyoftheseapproachesweredevelopedbystatisticiansfordesignedexperiments,wherethegoalwastocontrolexternalfactorsasmuchaspossible.Currently,however,theseapproachesareoften(perhapsevenmostly)appliedtoobservationaldata.Akeygoalofthischapteristoshowhowthesetechniquescanbeappliedtotypicaldataminingtaskstohelpensurethattheresultingmodelsandpatternsarevalid.

10.1Preliminaries:StatisticalTestingBeforewediscussapproachesforproducingvalidresultsindataminingproblems,wefirstintroducethebasicparadigmofstatisticaltestingthatiswidelyusedformakinginferencesaboutthevalidityofresults.Astatisticaltestisagenericprocedureformeasuringtheevidenceforacceptingorrejectingahypothesisthattheoutcome(result)ofanexperimentoradataanalysisprocedureprovides.Forexample,giventheoutcomeofanexperimenttostudyanewdrugforadisease,wecantesttheevidenceforthehypothesisthatthereisameasurableeffectofthedrugintreatingthedisease.Asanotherexample,giventheoutcomeofaclassifieronatestdataset,wecantesttheevidenceforthehypothesisthattheclassifierperformsbetterthanrandomguessing.Inthefollowing,wedescribedifferentframeworksforstatisticaltesting.

10.1.1SignificanceTestingSupposeyouwanttohireastockbrokerwhocanmakeprofitabledecisionsonyourinvestmentswith

ahighsuccessrate.Youknowofastockbroker,Alice,whomadeprofitabledecisionsfor7ofherlast

10stockpicks.Howconfidentwouldyoubeinofferingherthejobbecauseofyourassumptionthata

performanceasgoodasAlice’sisnotlikelyduetorandomchance?

Questionsoftheabovetypecanbeansweredusingthebasictoolsofsignificancetesting.Notethatinanygeneralproblemofstatisticaltesting,wearelookingforsomeevidenceintheoutcometovalidateadesiredphenomenon,pattern,orrelationship.Fortheproblemofhiringasuccessfulstockbroker,thedesiredphenomenonisthatAliceindeedhasknowledgeof

howthestockpricesvaryandusesthisknowledgetomake7correctdecisionsoutof10.However,thereisalsoapossibilitythatAlice’sperformanceisnobetterthanwhatmightbeobtainedbyrandomlyguessingonall10decisions.Theprimarygoalofsignificancetestingistocheckwhethertherearesufficientevidenceintheoutcometorejectthedefaulthypothesis(alsocallednullhypothesis)thatAlice’sperformanceformakingprofitablestockdecisionsisnobetterthanrandom.

NullHypothesisThenullhypothesisisageneralstatementthatthedesiredpatternorphenomenonofinterestisnottrueandtheobservedoutcomecanbeexplainedbythenaturalvariability,e.g.,byrandomchance.Thenullhypothesisisassumedtobetrueuntilthereissufficientevidencetoindicateotherwise.Itiscommonlydenotedas .Informally,iftheresultobtainedfromthedataisunlikelyunderthenullhypothesis,thisprovidesevidencethatourresultisnotjustaresultofnaturalvariabilityinthedata.

Forexample,thenullhypothesisinthestockbrokerproblemcouldbethatAliceisnobetteratmakingdecisionsthanapersonwhoperformsrandomguessing.RejectingthisnullhypothesiswouldimplythatthereissufficientgroundstobelieveAlice’sperformanceisbetterthanrandomguessing.Moregenerally,weareinterestedintherejectionofthenullhypothesissincethattypicallyimpliesanoutcomethatisnotduetonaturalvariability.

Sincedeclaringthenullhypothesisisthefirststepintheframeworkofsignificancetesting,caremustbetakentostateitinapreciseandcompletemannersothatthesubsequentstepsproducemeaningfulresults.Thisisimportantbecausemisstatingorlooselystatingthenullhypothesiscanyieldmisleadingconclusions.Ageneralapproachistobeginwithastatementofthedesiredresult,e.g.,apatterncapturesanactualrelationshipbetween

variables,andtakethenullhypothesistobethenegation(opposite)ofthatstatement,e.g.,thepatternisduetonaturalvariabilityinthedata.

TestStatisticToperformsignificancetesting,wefirstneedawaytoquantifytheevidenceintheobservedoutcomewithrespecttothenullhypothesis.Thisisachievedbyusingateststatistic,R,whichtypicallysummarizeseverypossibleoutcomeasanumericalvalue.Morespecifically,theteststatisticenablesthecomputationoftheprobabilityofanoutcomeunderthenullhypothesis.Forexample,inthestockbrokerproblem,Rcouldbethenumberofsuccessful(profitable)decisionsmadeinthelast10decisions.Inthisway,theteststatisticreducesanoutcomeconsistingof10differentdecisionsintoasinglenumericalvalue,i.e.,thecountofsuccessfuldecisions.

Theteststatisticistypicallyacountorreal-valuedquantityandmeasureshow“extreme”anobservedresultisunderthenullhypothesis.Dependingonthechoiceofthenullhypothesisandthewaytheteststatisticisdesigned,therecanbedifferentwaysofdefiningwhatis“extreme”relativetothenullhypothesis.Forexample,anobservedteststatistic, ,canbeconsideredextremeifitisgreaterthanorequaltoacertainvalue, ,smallerthanorequaltoacertainvalue, ,oroutsideaspecifiedinterval, .Thefirsttwocasesresultin“one-sidedtests”(right-tailedandleft-tailed,respectively),whilethelastcaseresultsina“two-sidedtest.”

NullDistributionHavingdecidedanappropriateteststatisticforaproblem,thenextstepinsignificancetestingistodeterminethedistributionoftheteststatisticunder

RobsRH

RL [RL,RH]

thenullhypothesis.Thisisknownasthenulldistribution,whichcanbeformallydescribedasfollows.

Definition10.1(nulldistribution).Givenateststatistic,R,thedistributionofRunderthenullhypothesis, ,iscalledthenulldistribution, .

Thenulldistributioncanbedeterminedinanumberofways.Forexample,wecanusestatisticalassumptionsaboutthebehaviorofRunder togenerateexactstatisticalmodelsofthenulldistribution.Wecanalsoconductexperimentstoproducesamplesfrom andthenanalyzethesesamplestoapproximatethenulldistribution.Ingeneral,theapproachfordeterminingthenulldistributiondependsonthespecificcharacteristicsoftheproblem.WewilldiscussapproachesfordeterminingthenulldistributioninthecontextofdataminingproblemsinSection10.2 .Weillustratewithanexampleofthenulldistributionforthestockbrokerproblem.

Example10.1(NullDistributionforStockbrokerProblem).Considerthestockbrokerproblemwheretheteststatistic,R,isthenumberofsuccessesofastockbrokerinthelast decisions.Underthenullhypothesisthatthestockbrokerperformsnobetterthanrandomguessing,theprobabilityofmakingasuccessfuldecisionwouldbe .Assumingthatthedecisionsondifferentdaysareindependentofeachother,theprobabilityofobtaininganobservedvalueofR,thetotalnumberof

H0 P(R|H0)

N=100

p=0.5

successesinNdecisions,underthenullhypothesiscanbemodeledusingthebinomialdistribution,whichisdescribedbythefollowingequation:

Figure10.1 showstheplotofthisnulldistributionasafunctionofRfor.

Figure10.1.Nulldistributionforthestockbrokerproblemwith .

Thenulldistributioncanbeusedtodeterminehowunlikelyisittoobtaintheobservedvalueofteststatistic, ,underthenullhypothesis.Inparticular,thenullhypothesiscanbeusedtocomputetheprobabilityofobtainingor“somethingmoreextreme”underthenullhypothesis.Thisprobabilityiscalledthep-value,whichcanbeformallydefinedasfollows:

P(R|H0)=(NR)×pR×(1−p)N−R.

N=100

RobsRobs

Definition10.2(p-value).Thep-valueofanobservedteststatistic, ,istheprobabilityofobtaining orsomethingmoreextremefromthenulldistribution.Dependingonhow“moreextreme”isdefinedfortheteststatistic,R,underthenullhypothesis, ,thep-valueof

canbewrittenasfollows:

Thereasonthatweaccountfor“somethingmoreextreme”inthecalculationofp-valuesisthattheprobabilityofanyparticularresultisoften0orcloseto0.P-valuesthuscapturetheaggregatetailprobabilitiesofthenulldistributionforteststatisticvaluesthatareatleastasextremeas .Forthestockbrokerproblem,sincelargervaluesoftheteststatistic(countofsuccessfuldecisions)wouldbeconsideredmoreextremeunderthenullhypothesis,wewouldcomputep-valuesusingtherighttailofthenulldistribution.

Example10.2(P-valuesareTailProbabilities).Toillustratethefactthatp-valuescanbecomputedusingthelefttail,righttail,orboth,consideranexamplewherethenulldistributionhasaGaussiandistributionwithmean0andstandarddeviation1,i.e., .Figure10.2 showstheteststatisticvaluescorrespondingtoap-valueof0.05forleft-tailed,righttailed,ortwo-sidedtests.(Seeshadedregions.)

RobsRobs

H0Robs

p-value(Robs)={P(R≥Robs|H0),forright-tailedtests.P(R≤Robs|H0),forleft-tailedtests.P(R≥Robs| or R≤−|Robs| |H0),fortwo-sidedtests.

Robs

N(0,1)

Wecanseethatthep-valuescorrespondtotheareainthetailsofthenulldistribution.Whileatwo-sidedtesthas0.025probabilityineachofthetails,aone-sidedtesthasallofits0.05probabilityinasingletail.

Figure10.2.Illustrationofp-valuesasshadedregionsforleft-tailed,right-tailed,andtwo-sidedtests.

AssessingStatisticalSignificanceP-valuesprovidethenecessarytooltoassessthestrengthoftheevidenceinaresultagainstthenullhypothesis.Thekeyideaisthatifthep-valueislow,thenaresultatleastasextremeastheobservedresultislesslikelytobeobtainedfrom .Forexample,ifthep-valueofaresultis0.01,thenthereisonlya1%chanceofobservingaresultfromthenullhypothesisthatisatleastasextremeastheobservedresult.

Alowp-valueindicatessmallerprobabilitiesinthetailsofthenulldistribution(forbothone-sidedandtwo-sidedtests).Thiscanprovidesufficientevidencetobelievethattheobservedresultisasignificantdeparturefromthenullhypothesis,thusconvincingustoreject .Formally,weoftenauseathresholdonp-values(calledthelevelofsignificance)anddescribeanobservedp-valuethatislowerthanthisthresholdasstatisticallysignificant.

Definition10.3(StatisticallySignificantResult).Givenauser-definedlevelofsignificance,α,aresultiscalledstatisticallysignificantifithasap-valuelowerthanα.

Somecommonchoicesforthelevelofsignificanceare0.05(5%)and0.01(1%).Thep-valueofastatisticallysignificantresultdenotestheprobabilityoffalselyrejecting when istrue.Hence,alowp-valueprovideshigherconfidencethattheobservedresultisnotlikelytobeconsistentwith ,thusmakingitworthyoffurtherinvestigation.Thisoftenmeansgatheringadditionaldataorconductingnon-statisticalverification,e.g.,byperformingexperimentalvalidation.(SeeBibliographicNotes.)However,evenwhenthep-valueislow,thereisalwayssomechance(unlessthep-valueis0)that istrueandwehavemerelyencounteredarareevent.

Itisimportanttokeepinmindthatap-valueisconditionalprobability,i.e.,iscomputedundertheassumptionthat istrue.Consequently,ap-valueisnottheprobabilityof ,whichmaybelikelyorunlikelyevenifthetestresultisnotsignificant.Thus,ifaresultisnotsignificant,thenitisnotappropriatetosaythatweacceptthenullhypothesis.Instead,itisbettertosaythatwefailtorejectthenullhypothesis.However,whenthenullhypothesisisknowntobetruemostofthetime,e.g.,whentestingforaneffectorresultthatisrarelyseen,itiscommontosaythatweacceptthenullhypothesis.(SeeExercise6 .)

H0 H0H0

H0H0

10.1.2HypothesisTesting

WhilesignificancetestingwasdevelopedbythefamousstatisticianFisherasanactionableframeworkforstatisticalinference,itsintendeduseisonlylimitedtoexploratoryanalysesofthenullhypothesisinthepreliminarystagesofastudy,e.g.,torefinethenullhypothesisormodifyfutureexperiments.Oneofthemajorlimitationsofsignificancetestingisthatitdoesnotexplicitlyspecifyanalternativehypothesis, ,whichistypicallythatthestatementwewouldliketoestablishastrue,i.e.,thataresultisnotspurious.Hence,significancetestingcanbeusedtorejectthenullhypothesisbutisunsuitablefordeterminingwhetheranobservedresultactuallysupports .

Theframeworkofhypothesistesting,developedbythestatisticiansNeymanandPearson,providesamoreobjectiveandrigorousapproachforstatisticaltesting,byexplicitlydefiningbothnullandalternativehypotheses.Hence,apartfromcomputingap-value,i.e.,theprobabilityoffalselyrejectingthenullhypothesiswhen istrue,wecanalsocomputetheprobabilityoffalselysayingaresultisnotsignificantwhenthealternativehypothesisistrue.Thisallowshypothesistestingtoprovideamoredetailedassessmentoftheevidenceprovidedbyanobservedresult.

Inhypothesistesting,wefirstdefineboththenullandalternativehypotheses( and ,respectively)andchooseateststatistic,R,thathelpstodifferentiatethebehaviorofresultsunderthenullhypothesisfromthebehaviorofresultsunderthealternativehypothesis.Aswithsignificancetesting,caremustbetakentoensurethatthenullandalternativehypothesesarepreciselyandcomprehensivelydefined.Wethenmodelthedistributionoftheteststatisticunderthenullhypothesis, ,aswellasunderthealternativehypothesis, .Similartothenulldistribution,therecanbemanywaysofgeneratingthedistributionofRunderthealternative

H0 H1

P(R|H0)P(R|H1)

hypothesis,e.g.,bymakingstatisticalassumptionsaboutthenatureof orbyconductingexperimentsandanalyzingsamplesfrom .Inthefollowingexample,weconcretelyillustrateasimpleapproachformodeling inthestockbrokerproblem.

Example10.3(AlternativeHypothesisforStockbrokerProblem).InExample10.1 ,wesawthatunderthenullhypothesisofrandomguessing,theprobabilityofobtainingasuccessonanygivendaycanbeassumedtobe .Therecouldbemanyalternativehypothesesforthisproblem,allofwhichwouldassumethattheprobabilityofsuccessisgreaterthan0.5,i.e., ,thusrepresentingasituationwherethestockbrokerperformsbetterthanrandomguessing.Tobespecific,assume

.Thedistributionoftheteststatistic(numberofsuccessesindecisions)underthealternativehypothesiswouldthenbegivenbythefollowingbinomialdistribution.

Figure10.3 showstheplotofthisdistribution(dottedline)withrespecttothenulldistribution(solidline).Wecanseethatthealternativedistributionisshiftedtowardtheright.Noticethatifastockbrokerhasmorethan60successes,thenthisoutcomewillbemorelikelyunder than

H1H1

P(R|H1)

p=0.5

p>0.5

p=0.7 N=100

P(R|H1)=(NR)×pR×(1−p)N−R.

H1H0

Figure10.3.Nullandalternativedistributionsforthestockbrokerproblemwith .

CriticalRegionGiventhedistributionsoftheteststatisticunderthenullandalternativehypotheses,theframeworkofhypothesistestingdecidesifweshould“reject”thenullhypothesisor“notreject”thenullhypothesisgiventheevidenceprovidedbytheteststatisticcomputedfromanobservedresult.Thisbinarydecisionistypicallymadebyspecifyingarangeofpossiblevaluesoftheteststatisticthataretooextremeunder .Thissetofvaluesiscalledthecriticalregion.Iftheobservedteststatistic, ,fallsinthisregion,thenthenullhypothesisisrejected.Otherwise,thenullhypothesisisnotrejected.

Thecriticalregioncorrespondstothecollectionofextremeresultswhoseprobabilityofoccurrenceunderthenullhypothesisislessthanathreshold.Thecriticalregioncaneitherbeinthelefttail,righttail,orbothleftandrighttailsofthenulldistribution,dependingonthetypeofstatisticaltestingbeingused.Theprobabilityofthecriticalregionunder iscalledthesignificance

N=100

H0Robs

level,α.Inotherwords,itistheprobabilityoffalselyrejectingthenullhypothesisforresultsbelongingtothecriticalregionwhen istrue.Inmostapplications,alowvalueofα(e.g.,0.05or0.01)isspecifiedbytheusertodefinethecriticalregion.

Rejectingthenullhypothesisiftheteststatisticfallsinthecriticalregionisequivalenttoevaluatingthep-valueoftheteststatisticandrejectingthenullhypothesisifthep-valuefallsbelowapre-specifiedthreshold,α.Notethatwhileeveryresulthasadifferentp-value,thesignificancelevel,α,inhypothesistestingisafixedconstantwhosevalueisdecidedbeforeanytestsareperformed.

TypeIandTypeIIErrorsUptothispoint,hypothesistestingmayseemsimilartosignificancetesting,atleastsuperficially.However,byconsideringboththenullandalternativehypotheses,hypothesistestingallowsustolookattwodifferenttypesoferrors,typeIerrorandtypeIIerrors,asdefinedbelow.

Definition10.4(TypeIError).AtypeIerroristheerrorofincorrectlyrejectingthenullhypothesisforaresult.TheprobabilityofincurringatypeIerroriscalledthetypeIerrorrate,α.Itisequaltotheprobabilityofthecriticalregionunder ,i.e.,αisthesameasthesignificancelevel.Formally,

α=P(R∈CriticalRegion|H0).

Definition10.5(TypeIIError).AtypeIIErroristheerroroffalselycallingaresulttobenotsignificantwhenthealternativehypothesisistrue.TheprobabilityofincurringatypeIIerroriscalledthetypeIIerrorrate,β,whichisequaltotheprobabilityofobservingteststatisticvaluesoutsidethecriticalregionunder ,i.e.,

Notethatdecidingthecriticalregion(specifyingα)automaticallydeterminesthevalueofβforaparticulartest,giventhedistributionoftheteststatisticunderthealternativehypothesis.

AcloselyrelatedconcepttothetypeIIerrorrateisthepowerofthetest,whichistheprobabilityofthecriticalregionunder ,i.e., .Powerisanimportantcharacteristicofatestbecauseitindicateshoweffectiveatestwillbeatcorrectlyrejectingthenullhypothesis.Lowpowermeansthatmanyresultsthatactuallyshowthedesiredpatternorphenomenonwillnotbeconsideredsignificantandthuswillbemissed.Asaconsequence,ifthepowerofatestislow,thenitmaynotbeappropriatetoignoreresultsthatfalloutsidethecriticalregion.IncreasingthesizeofthecriticalregionstoincreasepoweranddecreasetypeIIerrorswillincreasetypeIerrors,andvice-versa.Hence,itisthebalancebetweenensuringalowvalueofαandahighvalueofpowerthatisatthecoreofhypothesistesting.

β=P(R∈CriticalRegion|H1).

H1 1−β

Whenthedistributionoftheteststatisticunderthenullandalternativehypothesesdependsonthenumberofsamplesusedtoestimatetheteststatistic,thenincreasingthenumberofsampleshelpsobtainlessvariableestimatesofthetruenullandalternativedistributions.ThisreducesthechancesoftypeIandtypeIIerrors.Forexample,evaluatingastockbrokeron100decisionsismorelikelytogiveusanaccurateestimateoftheirtruesuccessratethanevaluatingthestockbrokeron10decisions.Theminimumnumberofsamplesrequiredforensuringalowvalueofαwhilehavingahighvalueofpowerisoftendeterminedbyastatisticalprocedurecalledpoweranalysis.(SeeBibliographicNotesformoredetails.)

Example10.4(ClassifyingMedicalResults).Supposethevalueofabloodtestisusedastheteststatistic,R,toidentifywhetherapatienthasacertaindiseaseornot.ItisknownthatthevalueofthisteststatistichasaGaussiandistributionwithmean40andstandarddeviation5forpatientsthatdonothavethedisease.Forpatientshavingthedisease,theteststatistichasameanof60andastandarddeviationof5.ThesedistributionsareshowninFigure10.4 . isthenullhypothesisthatthepatientdoesn’thavethedisease,i.e.,comesfromtheleftmostdistributionshowninsubfiguresofFigure10.4 . isthealternativehypothesisthatthepatienthasthedisease,i.e.,comesfromtherightmostdistributioninthesubfiguresofFigure10.4 .

Figure10.4.Distributionofteststatisticforthealternativehypothesis(rightmostdensitycurve)andnullhypothesis(leftmostdensitycurve).Shadedregioninrightsubfigureisα.

Supposethecriticalregionischosentobe50andabove,sincealevelof50isexactlyhalfwaybetweenmeansofthetwodistributions.Thesignificancelevel,α,correspondingtothischoiceofcriticalregion,shownastheshadedregionintherightsubfigureofFigure10.4 ,canthenbecalculatedasfollows:

ThetypeIIerrorrate,β,forthischoiceofcriticalregioncanalsobefoundtobeequalto0.023.(Thisisonlybecausenullandalternativehypotheseshavethesamedistribution,exceptfortheirmeans,andtheobservedvalueishalfwaybetweentheirmeans.)ThisisshownastheshadedregionintheleftsubfigureofFigure10.5 .Thepoweristhenequalto

,whichisshownintherightsubfigureofFigure10.5 .

α=P(R≥50|H0)=P(R≥50|R),R~N(μ=40,σ=5))=∫50∞12πσ2e−(R−μ)22σ2dR=∫50∞150πe−(R−40)250dR=0.023

1−0.023=0.977

Figure10.5.Shadedregioninleftsubfigureisβandshadedregioninrightsubfigureispower.

Ifweuseαof0.05insteadof0.023,thecriticalregionwouldbeslightlyexpandedto48.22andabove.Thiswouldincreasethepowerfrom0.977to0.991,thoughatthecostofahighervalueofα.Ontheotherhand,decreasingαto0.01woulddecreasethepowerto0.952.

EffectSizeEffectsizebringsindomainconsiderationsbyconsideringwhethertheobservedresultissignificantfromadomainpointofview.Forexample,supposethatanewdrugisfoundtolowerbloodpressure,butonlyby1%.Thisdifferencewillbestatisticallysignificantwithalargeenoughtestgroup,butthemedicalsignificanceofaneffectsizeof1%isprobablynotworththecostofthemedicineandthepotentialforsideeffects.Thus,aconsiderationofeffectsizeiscriticalsinceitcanoftenhappenthataresultisstatisticallysignificant,butofnopracticalimportanceinthedomain.Thisisparticularlytrueforlargedatasets.

Definition10.6(effectsize).Theeffectsizemeasuresthemagnitudeoftheeffectorcharacteristicbeingevaluated,andistypicallythemagnitudeoftheteststatistic.

Inmostproblemsthereisadesiredeffectsizethathelpsdeterminesthenullandalternativehypotheses.Forthestockbrokerproblem,asillustratedinExample10.3 ,thedesiredeffectsizeisthedesiredprobabilityofsuccess,0.7.Forthemedicaltestingproblem,whichwasjustdiscussedinExample10.4 ,theeffectsizeisthevalueofthethresholdusedtodefinethecutoffbetweennormalpatientsandthosewiththedisease.Whencomparingthemeansoftwosetsofobservations(AandB),theeffectsizeisthedifferenceinthemeans,i.e., ortheabsolutedifference, .

Thedesiredeffectsizeimpactsthechoiceofthecriticalregion,andthusthesignificancelevelandpowerofthetest.Exercises4 and5 furtherexploresomeoftheseconcepts.

10.1.3MultipleHypothesisTesting

Thestatisticaltestingframeworksdiscussedsofararedesignedtomeasuretheevidenceinasingleresult,i.e.,whethertheresultbelongstothenullhypothesisorthealternativehypothesis.However,manysituationsproducemultipleresultsthatneedtobeevaluated.Forexample,frequentpattern

μA−uB |μA−uB|

miningtypicallyproducesmanyfrequentitemsetsfromagiventransactiondataset,andweneedtotesteveryfrequentitemsettodeterminewhetherthereisastatisticallysignificantassociationamongitsconstituentitems.Themultiplehypothesistestingproblem(alsocalledthemultipletestingproblemormultiplecomparisonproblem)addressesthestatisticaltestingproblemwheremultipleresultsareinvolvedandastatisticaltestisperformedoneveryresult.

Thesimplestapproachistocomputethep-valueunderthenullhypothesisforeachresultindependentlyofotherresults.Ifthep-valueissignificantforanyresult,thenthenullhypothesisisrejectedforthatresult.However,thisstrategywilltypicallyproducemanyerroneousresultswhenthenumberofresultstobetestedislarge.Forexample,evenifsomethingonlyhasa5%chanceofhappeningforasingleresult,thenitwillhappen,onaverage,5timesoutofa100.Thus,ourapproachforhypothesistestingneedstobemodified.

Whenworkingwithmultipletests,weareinterestedinreportingthetotalnumberoferrorscommittedonacollectionofresults(alsoreferredtoasafamilyofresults).Forexample,ifwehaveacollectionofmresults,wecancountthetotalnumberoftimesatypeIerroriscommittedoratypeIIerroriscommittedoutofthemtests.TheaggregateinformationoftheperformanceacrossalltestscanbesummarizedbytheconfusionmatrixshowninTable10.1 .Inthistable,aresultthatactuallybelongstothenullhypothesisiscalleda‘negative’whilearesultthatactuallybelongstothealternativehypothesisiscalleda‘positive.’ThistableisessentiallythesameasTable4.6 ,whichwasintroducedinthecontextofevaluatingclassificationperformanceinSection4.11.2 .

Table10.1.Confusiontableinthecontextofmultiplehypothesistesting.

Declaredsignificant(+ Declarednotsignificant(− Total

prediction) prediction)

True(actual+)

TruePositive(TP) FalseNegative(FN)typeIIerror Positives

True(actual−)

FalsePositive(FP)typeIerror

TrueNegative(TN) Negatives

PositivePredictions(Ppred) NegativePredictions(Npred) m

Inmostpracticalsituationswhereweareperformingmultiplehypothesistesting,e.g.,whileusingstatisticalteststoevaluatewhetheracollectionofpatterns,clusters,etc.arespurious,therequiredentriesinTable10.1 areseldomavailable.(Forclassification,thetableisavailablewhenreliablelabelsareavailable,inwhichcase,manyofthequantitiesofinterestcanbedirectlyestimated.SeeSection10.3.2 .)Whenentriesarenotavailable,weneedtoestimatethem,ormoretypically,quantitiesderivedfromtheseentries.Thefollowingparagraphsofthissectiondescribevariousapproachesfordoingthis.

Family-wiseErrorRate(FWER)Ausefulerrormetricwhendealingwithafamilyofresultsisthefamily-wiseerrorrate(FWER),whichistheprobabilityofobservingevenasinglefalsepositive(typeIerror)intheentiresetofmresults.Inparticular,

IftheFWERislowerthanacertainthreshold,sayα,thentheprobabilityofobservinganytypeIerroramongalltheresultsislessthanα.

H1(m1)

H0(m0)

FEWR=P(FP>0).

TheFWERthusmeasurestheprobabilityofobservingatypeIerrorinanyorallofthemtests.ControllingtheFWER,i.e.,ensuringtheFWERtobelow,isusefulinapplicationswhereasetofresultsisdiscardedifevenasingletestiserroneous(producesatypeIerror).Forexample,considertheproblemofselectingastockbrokerdescribedinExample10.3 .Inthiscase,thegoalistofind,fromapoolofapplicants,astockbrokerthatmakescorrectdecisionsatleast70%ofthetime.EvenasingletypeIerrorcanleadtoanerroneoushiringdecision.Insuchcases,estimatingtheFWERgivesusabetterpictureoftheperformanceoftheentiresetofresultsthanthenaїveapproachofcomputingp-valuesseparatelyforeachresult.Thefollowingexampleillustratesthisconceptinthecontextofthestockbrokerproblem.

Example10.5(TestingMultipleStockbrokers).Considertheproblemofselectingsuccessfulstockbrokersfromapoolof

candidates.Foreverystockbroker,weperformastatisticaltesttocheckwhethertheirperformance(numberofsuccessfuldecisionsinthelastNdecisions)isbetterthanrandomguessing.Ifweuseasignificancelevelof foreverysuchtest,theprobabilityofmakingatypeIerroronanyindividualcandidateisequalto0.05.However,ifweassumethattheresultsareindependent,theprobabilityofobservingevenasingletypeIerrorinanyofthe50tests,i.e.,theFWER,isgivenby

whichisextremelyhigh.Eventhoughtheprobabilityofobservingnofalsepositivesonasingletestisquitehigh ,theprobabilityofseeingnofalsepositivesacrossalltests diminishesbyrepeatedmultiplication.Hence,theFWERcanbequitehighwhenmislarge,evenifthetypeIerrorrate,α,islow.

m=50

α=0.05

FEWR=1−(1−α)m=1−(1−0.05)50=0.923, (10.1)

(1−α=0.95)(0.9550=0.077)

BonferroniProcedureAnumberofprocedureshavebeendevelopedtoensurethattheFWERofasetofresultsislowerthananacceptablethreshold,α,whichisoften0.05.Theseapproaches,calledFWERcontrollingprocedures,basicallytrytoadjustthep-valuethresholdwhichisusedforeverytest,sothatthereisonlyasmallchanceoferroneouslyrejectingthenullhypothesisinthepresenceofmultipletests.Toillustratethiscategoryofprocedures,wedescribethemostconservativeapproach,whichistheBonferroniprocedure.

Definition10.7(Bonferroniprocedure).IfmresultsaretobetestedsothattheFWERislessthanα,theBonferroniproceduresetsthesignificancelevelforeverytesttobe .

TheintuitionbehindtheBonferroniprocedurecanbeunderstoodbyobservingtheformulaforFWERinEquation10.1 ,wherethemtestsareassumedtobeindependentofeachother.Usingareducedsignificancelevelof inEquation10.1 andapplyingthebinomialtheorem,wecanseethattheFWERiscontrolledbelowαasfollows:

α*=α/m

α/m

FWER=1−(1−αm)m=1−(1+m(−αm)+(m2)(−αm)2+…+(−αm)m)=α−(m2)(−αm)2−(m3)(−αm)3−…−(−αm)m≤α.

Whiletheabovediscussionwasforthecasewherethetestsareassumedtobeindependent,theBonferroniapproachguaranteesnotypeIerrorinthemtestswithaprobabilityof ,irrespectiveofwhetherthetests(results)arecorrelatedorindependent.WeillustratetheimportanceoftheBonferroniprocedureforcontrollingFWERusingthefollowingexample.

Example10.6(BonferroniProcedure).InthemultiplestockbrokerproblemdescribedinExample10.5 ,weanalyzetheeffectoftheBonferroniprocedureincontrollingtheFWER.Thenulldistributionforanindividualstockbrokercanbemodeledusingthebinomialdistributionwhere and .Givenasetofmresultssimulatedfromthenulldistribution(assumingtheresultsareindependent),wecomparetheperformanceoftwocompetingapproaches:thenaїveapproach,whichusesasignificancelevelof ,andtheBonferroniprocedure,whichusesasignificancelevelof .

Figure10.6 showstheFWERofthesetwoapproachesaswevarythenumberofresults,m.(Weused simulations.)WecanseethattheFWERoftheBonferroniprocedureisalwayscontrolledtobeα,whiletheFWERofthenaїveapproachshootsuprapidlyandreachesavaluecloseto1formgreaterthan70.Hence,theBonferroniprocedureispreferredoverthenaїveapproachwhenmislargeandtheFWERistheerrormetricwewishtocontrol.

1−α

p=0.5 N=100

α=0.05α*=0.05/m

106

Figure10.6.Thefamilywiseerrorrate(FWER)curvesforthenaїveapproachandtheBonferroniprocedureasafunctionofthenumberofresults,m.

TheBonferroniprocedureisalmostalwaysoverlyconservative,i.e.,itwilleliminatesomenon-spuriousresults,especiallywhenthenumberofresultsislargeandtheresultsmaybecorrelatedtoeachother,e.g.,infrequentpatternmining.Intheextremecasewhereallmresultsareperfectlycorrelatedtoeachother(andhenceidentical),theBonferroniprocedurewouldstilluseasignificancelevelof eventhoughasignificancelevelofαwouldhavesufficed.Toaddressthislimitation,anumberofalternativeFWERcontrollingprocedureshavebeendevelopedthatarelessconservativethanBonferroniwhendealingwithcorrelatedresults.(SeeBibliographicNotesformoredetails.)

Falsediscoveryrate(FDR)

α/m

Bydefinition,allFWERcontrollingproceduresseekalowprobabilityforobtaininganyfalsepositives,andthus,arenottheappropriatetoolwhenthegoalistoallowsomefalsepositivesinordertogetmoretruepositives.Forexample,infrequentpatternmining,weareinterestedinselectingfrequentitemsetsthatshowstatisticallysignificantassociations(actualpositives),whilediscardingtheremainingones.Asanotherexample,whentestingforaseriousdisease,itisbettertogetmoretruepositives(detectmoreactualcasesofthedisease)evenifthatmeansgeneratingsomefalsepositives.Inbothcases,wearereadytotolerateafewfalsepositivesaslongasweareabletoachievereasonablepowerforthedetectionoftruepositives.

Thefalsediscoveryrate(FDR)providesanerrormetrictomeasuretherateoffalsepositives,whicharealsocalledfalsediscoveries.TocomputeFDR,wefirstdefineavariable,Q,thatisequaltothenumberoffalsepositives,FP,dividedbythetotalnumberofresultspredictedaspositive,Ppred.(SeeTable10.1 .)

WhenweknowFP,thenumberoffalsepositives,asinclassification,Qisessentiallythefalsediscoveryrate,asdefinedinSection4.11.2 ,whichintroducedmeasuresforevaluatingclassificationperformanceunderclassimbalance.Assuch,Qiscloselyrelatedtotheprecision.Specifically,

.However,inthecontextofstatisticaltesting,when,i.e.,whennoresultsarepredictedaspositive, bydefinition.

However,indatamining,precisionandthusFDR,asdefinedinSection4.11.2 ,aretypicallyconsideredtobeundefinedinthatsituation.

InthecaseswherewedonotknowFP,itisnotpossibletouseQasthefalsediscoveryrate.Nonetheless,itisstillpossibletoestimatethevalueofQ,on

Q=FPPpred=FPTP+FP, if Ppred>0=0, if Ppred=0

precision=1−FDR=1−QPpred=0 Q=0

average,i.e.,tocomputetheexpectedvalueofQandusethatasourfalsediscoveryrate.Formally,

TheFDRisausefulmetricforensuringthattherateoffalsepositivesislow,especiallyincaseswherethepositivesarehighlyskewed,i.e.,thenumberofactualpositivesinthecollectionofresults, ,isverysmallcomparedtothenumberofactualnegatives, .

Benjamini-HochbergProcedureStatisticaltestingproceduresthattrytocontroltheFDRareknownasFDRcontrollingprocedures.Theseprocedurescantypicallyensurealownumberoffalsepositives(evenwhenthepositiveclassisrelativelyinfrequent)whileprovidinghigherpowerthanthemoreconservativeFWERcontrollingprocedures.Awidely-usedFDRcontrollingprocedureistheBenjamini-Hochberg(BH)procedure,whichsortstheresultsinincreasingorderoftheirp-valuesandusesadifferentsignificancelevel, ,foreveryresult, .

ThebasicideabehindtheBHprocedureisthatifwehaveobservedalargenumberofsignificantresultsthathavealowerp-valuethanagivenresult, ,wecanbelessstringentwhiletesting anduseamorerelaxedsignificancelevelthan .Algorithm10.1 providesasummaryoftheBHprocedure.Thefirststepinthisalgorithmistocomputethep-valuesforeveryresultandsorttheresultsinincreasingorderoftheirp-values(steps1to2).Thus,wouldcorrespondtothe smallestp-value.Thesignificancelevel, ,for isthencomputedusingthefollowingcorrection(line3)

FDR=E(Q). (10.2)

m0m1

α(i) Ri

RiRi

α/m

piith αi pi

αi=i×αm.

Noticethatthesignificancelevelforthesmallestp-value, ,isequalto ,whichissameasthecorrectionusedintheBonferroniprocedure.Further,thesignificancelevelforthelargestp-value, ,isequaltoα,whichisthesignificancelevelforasingletest(withoutaccountingformultiplehypothesistesting).Inbetweenthesetwop-values,thesignificancelevelgrowslinearlyfrom toα.Hence,theBHprocedurecanbeviewedasstrikingabalancebetweentheoverlyconservativeBonferroniprocedureandtheoverlyliberalnaїveapproach,thusresultinginhigherpower(findingmoreactualpositives)withoutproducingtoomanyfalsepositives.Letkbethelargestindexsuchthat islowerthanitssignificancelevel, (line4).TheBHprocedurethendeclaresthefirstkp-valuesassignificant(lines4to5).ItcanbeshownthattheFDRcomputedusingtheBHprocedureisguaranteedtobesmallerthanα.Inparticular,

where isthenumberofactualnegativeresultsandmisthetotalnumberofresults.(SeeTable10.1 .)

Algorithm10.1Benjamini-Hochberg(BH)

FDRalgorithm.

p1 α/m

α/m

pk αk

FDR≤m0mα≤α, (10.3)

1:Computep-valuesforthemresults.2:Orderthep-valuesfromsmallesttolargest( to ).3:Computethesignificancelevelfor as4:Letkbethelargestindexsuchthat .5:Reject forallresultscorrespondingtothefirstkp-values,

p1 pmpi αi=i×αm.pk≤αk

H0pi,1≤i≤k

Example10.7(BHandBonferroniprocedure).ConsiderthemultiplestockbrokerproblemdiscussedinExample10.6 ,whereinsteadofassumingallmstockbrokerstobelongtothenulldistribution,wemayhaveasmallnumberof candidateswhobelongtoanalternativedistribution.Thenulldistributioncanbemodeledbythebinomialdistributionwith0.5probabilityofmakingasuccessfuldecision.Thealternativedistributioncanbemodeledbythebinomialdistributionwith0.55probabilityofsuccess,whichisaslightlyhigherprobabilityofsuccessthanthatofrandomguessing.Weconsider decisionsforboththenullandalternativedistributions.

WeareinterestedincomparingtheperformanceoftheBonferroniandBHproceduresindetectingalargefractionofactualpositives(stockbrokersthatindeedperformbetterthanrandomguessing)withoutincurringalotoffalsepositives.Weran simulationsofmstockbrokerswherestockbrokersineachsimulationbelongtothealternativedistributionwhiletherestbelongtothenulldistribution.Wechose todemonstratetheeffectofaskewedpositiveclass,whichisquitecommoninmostpracticalapplicationsofmultiplehypothesistesting.Figure10.7 showstheplotsofFDRandexpectedpoweraswevarythenumberofstockbrokersineachsimulationrun,m,forthreecompetingprocedures:thenaїveapproach,theBonferroniprocedure,andtheBHprocedure.Thechoiceofthethreshold,α,ineachofthethreeprocedureswaschosentobe0.05.

N=100

106 m1

m1=m/3

Figure10.7.Comparingtheperformanceofmultiplecomparisonproceduresaswevarythenumberofresults,m,andset resultsaspositive. foreachofthethreeprocedures.

m1=m/3 α=0.05

WecanseethattheFDRofboththeBonferroniandtheBHproceduresaresmallerthan0.05forallvaluesofm,buttheFDRofthenaїveapproachisnotcontrolledandiscloseto0.1.Thisshowsthatthenaїveapproachisquiterelaxedincallingaresulttobepositive,andthusincursmorefalsepositives.However,italsogenerallyshowsahighpowerasmanyoftheactualpositivesareindeedlabeledaspositive.Ontheotherhand,theFDRoftheBonferroniismuchsmallerthan0.05andisthelowestamongallthethreeapproaches.ThisisbecausetheBonferroniapproachaimsatcontrollingamorestringenterrormetric,i.e.,theFWER,tobesmallerthan0.05.However,italsohaslowpowerasitisconservativeincallinganactualpositivetobesignificantsinceitsgoalistoavoidanyfalsepositives.

TheBHproceduremakesabalancebetweenbeingconservativeandrelaxedsuchthatitsFDRisalwayssmallerthan0.05butitsexpectedpowerisquitehighandcomparabletothenaїveapproach.Hence,atthecostofaminorincreaseintheFDRovertheBonferroniprocedure,itisabletoachievemuchhigherpowerandthusobtainabettertrade-offbetweenminimizingtypeIerrorsandtypeIIerrorsinmultiplehypothesistestingproblems.However,weemphasize,thatFWERprocedures,suchasBonferroni,andFDRcontrollingprocedures,suchasBH,areintendedfortwodifferenttasks,andthus,thebestproceduretouseinanyparticularsituationwillvarydependingonthegoaloftheanalysis.

Equation10.3 statesthattheFDRoftheBHprocedureislessthanorequalto ,whichisequaltoαonlywhen ,i.e.,whentherearenoactualpositivesintheresults.Hence,theBHproceduregenerallydiscoversasmallernumberoftruepositives,i.e.,haslowerpower,thanitshouldbegivenadesiredFDRofα.ToaddressthislimitationoftheBHprocedure,anumberofstatisticaltestingprocedureshavebeendevelopedto

m0/m×α m0=m

providetightercontroloverFDR,suchasthepositiveFDRcontrollingprocedures,andthelocalFDRcontrollingprocedures.ThesetechniquesgenerallyshowbetterpowerthantheBHprocedurewhileensuringasmallnumberoffalsepositives.(SeeBibliographicNotesformoredetails.)

NotethatsomeusersofFDRcontrollingproceduresassumethatαshouldbechoseninthesamewayasforhypothesis(significance)testingorforFWERcontrollingprocedures,whichtraditionallyuse or .However,forFDRcontrollingprocedures,αisthedesiredfalsediscoveryrateandisoftenchosentohaveavaluegreaterthan0.05,e.g.,0.20.Thereasonforthisisthatinmanycasesthepersonevaluatingtheresultsiswillingtoacceptmorefalsepositivesinordertogetmoretruepositives.Thisisespeciallytruewhenfew,ifany,potentialpositiveresultsareproducedwhenαissettoalowvalue,suchas0.05or0.01.Inthepreviousexample,wechoseαtobethesameforallthreetechniquestokeepthediscussionsimple.

10.1.4PitfallsinStatisticalTesting

Thestatisticaltestingapproachespresentedaboveprovideaneffectiveframeworkformeasuringtheevidenceinresults.However,aswithotherdataanalysistechniques,usingthemincorrectlycanoftenproducemisleadingconclusions.Muchofthemisunderstandingiscenteredontheuseofp-values.Inparticular,p-valuesarecommonlyassignedadditionalmeaningbeyondwhatcanbesupportedbythedataandtheseprocedures.Inthefollowing,wediscusssomeofthecommonpitfallsinstatisticaltestingthatshouldbeavoidedtoproducevalidresults.Someofthesedescribep-valuesandtheirproperrolewhileothersidentifycommonmisinterpretationsandmisuses.(SeeBibliographicNotesformoredetails.)

α=0.05 α=0.01

1. Ap-valueisnottheprobabilitythatthenullhypothesisistrue.AsdescribedpreviouslyinDefinition10.2 ,thep-valueistheconditionalprobabilityofobservingaparticularvalueofateststatistic,R,orsomethingmoreextremeunderthenullhypothesis.Hence,weareassumingthenullhypothesisistrueinordertocomputethep-value.Ap-valuedoesmeasurehowcompatibletheobservedresultiswiththenullhypothesis.

2. Typically,therecanbemanyhypothesesthatexplainaresultthatisfoundtobesignificantornon-significantunderthenullhypothesis.Notethataresultthatisdeclarednon-significant,i.e.,hasahighp-value,wasnotnecessarilygeneratedfromthenulldistribution.Forexample,ifwemodelthenulldistributionusingaGaussiandistributionwithmean0andstandarddeviation1,wewillfindanobservedteststatistic, ,tobenon-significantata5%level.However,theresultcouldbefromthealternativedistribution,evenifthereisalow(butnonzero)probabilityofthatevent.Further,ifwemisspecifiedournullhypothesis,thenthesameobservationcouldhaveeasilycomefromanotherdistribution,e.g.,aGaussiandistributionwithmean1andstandarddeviation1,underwhichitismorelikely.Hence,declaringaresulttobenon-significantdoesnotamountto“accepting”thenullhypothesis.Likewise,asignificantresultmaybeexplainedbymanyalternativehypothesis.Hence,rejectingthenullhypothesisdoesnotnecessarilyimplythatwehaveacceptedthealternativehypothesis.Thisisoneofthereasonsthatp-values,ormoregenerallytheresultofstatisticaltesting,arenotusuallysufficientformakingdecisions.Factorsexternaltothestatistics,suchasdomainknowledge,mustbeappliedaswell.

3. Alowp-valuedoesnotimplyausefuleffectsize(magnitudeoftheteststatistic)andviceversa.Recallthattheeffectsizeisthesizeoftheteststatisticforwhichtheresultisconsideredimportantinthedomainofinterest.Thus,effectsizebringsindomainconsiderationsby

Robs=1.5

consideringwhethertheobservedresultissignificantfromadomainpointofview.Forexample,supposethatanewdrugisfoundtolowerbloodpressure,butonlyby1%.Thisdifferencemaybestatisticallysignificant,butthemedicalsignificanceofaneffectsizeof1%isprobablynotworththecostofthemedicineandthepotentialforsideeffects.Inparticular,asignificantp-valuemaynothavealargeeffectsizeandanon-significantp-valuedoesnotimplynoeffectsize.Sincep-valuesdependverystronglyonthesizeofthedataset,smallp-valuesforbigdataapplicationsarebecomingincreasinglycommonsinceevensmalleffectsizeswillshowupasbeingstatisticallysignificant.Thus,itbecomescriticaltotakeeffectsizeintoconsiderationtoavoidgeneratingresultsthatarestatisticallysignificantbutnotuseful.Inparticular,evenifaresultisdeclaredsignificant,weshouldensurethatitseffectsizeisgreaterthanadomain-specifiedthresholdtobeofpracticalimportance.

Example10.8(Significantp-valuesinrandomdata).Toillustratethepointthatwecanobtainsignificantlylowp-valuesevenwithsmalleffectsizes,weconsiderthepairwisecorrelationof10randomvectorsthatweregeneratedfromaGaussiandistributionofmean0andstandarddeviation1.Thenullhypothesisisthatthecorrelationofanytwovectorsis0.Figures10.8a and10.8b showthatasthelengthofthevectors,n,increases,themaximumabsolutepairwisecorrelationofanypairofvectorstendsto0,buttheaveragenumberofpairwisecorrelationsthathaveap-valuelessthan0.05remainsconstantatabout2.25.Thisshowsthatthenumberofsignificantpair-wisecorrelationsdoesnotdecreasewhennislarge,althoughtheeffectsize(maximumabsolutecorrelation)isquitelow.

Thisexamplealsoillustratestheimportanceofadjustingformultipletests.Thereare45pairwisecorrelationsandthus,onaverage,wewouldexpect

significantcorrelationsatthe5%level,asisshowninFigure10.8b .0.05×45=2.25

Figure10.8.

Visualizingtheeffectofchangingthevectorlength,n,onthecorrelationsamong10randomvectors.

4. Itisunsoundtoanalyzearesultinmultiplewaysuntilweareabletodeclareitstatisticallysignificant.Thiscreatesamultiplehypothesistestingproblemandthep-valuesofindividualresultsarenolongeragoodguideastowhetherthenullhypothesisshouldberejectedornot.(Suchapproachesareknownasp-valuehacking.)Thiscanincludecaseswherep-valuesarenotexplicitlyusedbutthedataispre-processedoradjusteduntilamodelisfoundthatisacceptabletotheinvestigator.

10.2ModelingNullandAlternativeDistributionsAprimaryrequirementforconductingstatisticaltestingistoknowhowtheteststatisticisdistributedunderthenullhypothesis(andsometimesunderthealternativehypothesis).Inconventionalproblemsofstatisticaltesting,thisconsiderationiskeptinmindwhiledesigningtheexperimentalsetupforcollectingdata,sothatwehaveenoughdatasamplespertainingtothenullandalternativehypotheses.Forexample,inordertotesttheeffectofanewdrugincuringadisease,experimentaldataisusuallycollectedfromtwogroupsofsubjectsthatareassimilaraspossibleinallrespects,exceptthatonegroupisadministeredthedrugwhilethecontrolgroupisnot.Thedatasamplesfromthetwogroupsthenprovideinformationtomodelthealternativeandnulldistributions,respectively.Thereisanextensivebodyofworkonexperimentaldesignthatprovidesguidelinesforconductingexperimentsandcollectingdatapertainingtothenullandalternativehypotheses,sothattheycanbeusedlaterforstatisticaltesting.However,suchguidelinescannotbedirectlyappliedwhendealingwithobservationaldatawherethedataiscollectedwithoutapriorhypothesisinmind,asiscommoninmanydataminingproblems.

Hence,acentralobjectivewhenusingstatisticaltestingwithobservationaldataistocomeupwithanapproachtomodelthedistributionoftheteststatisticunderthenullandalternativehypotheses.Insomecases,thiscanbedonebymakingsomestatisticalassumptionsabouttheobservations,e.g.,thatthedatafollowsaknownstatisticaldistributionsuchasthenormal,binomial,orhypergeometricdistributions.Forexample,theinstancesinadatasetmaybegeneratedbyasinglenormaldistributionwhosemeanand

variancecanbeestimatedfromthedataset.Notethatinalmostallcaseswhereastatisticalmodelisused,theparametersofthatmodelmustbeestimatedfromthedata.Hence,anyprobabilitiescalculatedusingastatisticalmodelmayhavesomeinherenterror,withthemagnitudeofthaterrordependentonhowwellthechosendistributionfitsthedataandhowwelltheparametersofthemodelcanbeestimated.

Insomescenarios,itisdifficult,orevenimpossible,toadequatelymodelthebehaviorofthedatawithaknownstatisticaldistribution.Analternativemethodistofirstgeneratesampledatasetsunderthenulloralternativehypotheses,andthenmodelthedistributionoftheteststatisticusingthenewdatasets.Forthealternativehypothesis,thenewdatasetsmustbesimilartothecurrentdataset,butshouldreflectthenaturalvariabilityinherentinthedata.Forthenullhypothesis,thesedatasetsshouldbeassimilaraspossibletotheoriginaldataset,butlackthestructureorpatternofinterest,e.g.,aconnectionbetweenattributesandvalues,clusterstructure,orassociationsbetweenattributes.

Inthefollowing,wedescribesomegenericproceduresforestimatingthenulldistributioninthecontextofstatisticaltestingfordataminingproblems.(Unfortunately,outsideofusingaknownstatisticaldistribution,therearenotmanywidelyusedmethodsforgeneratingthealternativedistribution.)TheseprocedureswillserveasthebuildingblocksofthespecificapproachesforstatisticaltestingdiscussedinSections10.3 to10.6 .Notethattheexactdetailsoftheapproachusedforestimatingthenulldistributiondependsonthespecifictypeofproblembeingstudiedandthenatureofhypothesesbeingconsidered.However,atahighlevel,approachesinvolvegeneratingcompletelynewsyntheticdatasetsorrandomizinglabels.Inaddition,wewilldiscussapproachesforresamplingexistinginstances,whichcanbeusefulforgeneratingconfidenceintervalsforvariousdataminingresults,suchastheaccuracyofapredictivemodel.

10.2.1GeneratingSyntheticDataSets

Foranalysesinvolvingunlabeleddatasuchasclusteringandfrequentpatternmining,themainapproachforestimatinganulldistributionistogeneratesyntheticdatasets,eitherbyrandomizingtheorderofattributevaluesorbygeneratingnewinstances.Theresultantdatasetsshouldbesimilartotheoriginaldatasetinallmannersexceptthattheylackapatternofinterest,e.g.,clusterstructureorfrequentpatterns,whosesignificancehastobeassessed.

Forexample,ifweneedtoassesswhethertheitemsinatransactiondatasetarerelatedtoeachotherbeyondwhateverassociationoccursbyrandomchance,wecangeneratesynthetictransactiondatasetsbyrandomizing(permuting)theorderofexistingentrieswithinrowsandcolumnsofthebinaryrepresentationofatransactiondataset.Thegoalisthattheresultingdatasetshouldhavepropertiessimilartotheoriginaldatasetintermsofthenumberoftimesanitemoccursinthetransactions(i.e.,thesupportofanitem)andthenumberofitemsineverytransaction(i.e.,thelengthofatransaction),buthavestatisticallyindependentitems.Thesesyntheticdatasetscanbeprocessedtofindassociationpatternsandtheseresultscanbeusedtoprovideanestimateofthedistributionoftheteststatisticofinterest,e.g.,thesupportorconfidenceofanitemset,underthenullhypothesis.SeeSection10.4 formoredetails

Ifweneedtoevaluatewhethertheclusterstructureinadatasetisbetterthanwemightexpectbyrandomchance,weneedtogeneratenewinstancesthat,whencombinedinadataset,lackclusterstructure.Thesyntheticdatasetscanbeclusteredandusedtoestimatethenulldistributionofteststatistic.Forclusteranalysis,thequantityofinterest,i.e.,theteststatistic,istypicallysomemeasureofclusteringgoodness,suchasSSEorthesilhouettecoefficient.SeeSection10.5 .

Althoughtheprocessofrandomizingattributesmayappearsimple,executingthisapproachcanbeverychallengingsinceanäiveattemptatgeneratingsyntheticdatasetsmayomitimportantcharacteristicsorstructureoftheoriginaldataandthusmayyieldaninadequateapproximationofthetruenulldistribution.Forexample,givenatimeseriesdata,weneedtoensurethattheconsecutivevaluesintherandomizedtimeseriesaresimilartoeachother,sincetimeseriesdatatypicallyexhibittemporalautocorrelation.Further,ifthetimeseriesdatahaveayearlycycle(e.g.,inclimatedata),wewouldneedtoensurethatsuchcyclicpatternsarealsopreservedinthesyntheticallygeneratedtimeseries.

SpecifictechniquesforgeneratingsyntheticdatasetswillbediscussedinmoredetailinthecontextofassociationanalysisandclusteringinSections10.4 and10.5 ,respectively.

10.2.2RandomizingClassLabels

Wheneverydatainstancehasanassociatedclasslabel,acommonapproachforgeneratingnewdataistorandomlypermutetheclasslabels,aprocessalsoreferredtoaspermutationtesting.Thisinvolvesrepeatedlyshuffling(permuting)labelsamongdataobjectsatrandomtogenerateanewdatasetthatisidenticaltotheolddatasetexceptforthelabelassignments.Aclassificationmodelisbuiltoneachofthesedatasetsandateststatisticcalculated,e.g.,classificationaccuracy.Theresultingsetofvalues—oneforeachpermutation—canbeusedtoprovideadistributionforthestatisticunderthenullhypothesisthattheattributesinthedatasethavenorelationshipwiththeclasslabels.AswillbedescribedinSection10.3.1 ,thisapproachcanbeusedtotesthowlikelyisittoachievetheclassificationperformanceofalearnedclassifieronatestsetjustbyrandomchance.Althoughpermutingthe

labelsissimple,itcanresultininferiormodelsofthenulldistribution.(SeeBibliographicNotes.)

10.2.3ResamplingInstances

Ideally,wewouldliketohavemultiplesamplesfromtheunderlyingpopulationofdatainstancessothatwecanassessthevalidityandgeneralizabilityofthemodelsandpatternsourdataminingalgorithmsproduce.Onewaytosimulatesuchsamplesistorandomlysampleinstancesfromtheoriginaldatatocreatesyntheticcollectionsofdatainstances—anapproachcalledstatisticalresampling.Forexample,acommonapproachforgeneratingnewdatasetsistousebootstrapsampling,wheredatainstancesarerandomlyselectedwithreplacementsuchthattheresultantdatasetisofthesamesizeastheoriginalset.Forclassification,analternativetobootstrapsamplingisk-foldcross-validation,wherethedatasetissystematicallysplitintosubsetsforknumberoftimes.AswewillseelaterinSection10.3.1 ,suchstatisticalresamplingapproachesareusedtocomputedistributionsofmeasuresofclassificationperformance,suchasaccuracy,precision,andrecall.Resamplingapproachessuchasthebootstrapcanalsobeusedtoestimatethedistributionofthesupportofafrequentitemset.Wecanalsousethesedistributionstoproduceconfidenceintervalsforthesemeasures.

10.2.4ModelingtheDistributionoftheTestStatistic

Givenmultiplesamplesofdatasetsgeneratedunderthenullhypothesis,wecancomputetheteststatisticoneverysetofsamplestoobtainthenull

distributionoftheteststatistic.Thisdistributioncanbeusedforprovidingestimatesoftheprobabilitiesusedinstatisticaltestingprocedures,suchasp-values.Onewaytoachievethisistofitstatisticalmodels,e.g.,thenormalorthebinomialdistribution,ontheteststatisticvaluesfromthedatasetsgeneratedunderthenullhypothesis.Alternatively,wecanalsomakeuseofnon-parametricapproachesforcomputingp-values,givenenoughsamples.Forinstance,wecancountthefractionoftimestheteststatisticgeneratedunderthenulldistributionexceeds(ortakes“moreextreme”valuesthan)theteststatisticoftheobservedresult,andusethisfractionasthep-valueoftheresult.

10.3StatisticalTestingforClassificationThereareanumberofproblemsinclassificationthatcanbeviewedfromtheperspectiveofstatisticaltestingandthuscanbenefitfromthetechniquesdescribedpreviouslyinthischapterforavoidingfalsediscoveries.Inthefollowing,wediscusssomeoftheseproblemsandthestatisticaltestingapproachesthatcanbeusedtoaddressthem.NotethatapproachesforcomparingwhethertheperformanceoftwomodelsissignificantlydifferentisprovidedinSection3.9.2 .

10.3.1EvaluatingClassificationPerformance

Supposethataclassifierappliedtoatestsetshowsanaccuracyofx%.Inordertoassessthevalidityoftheclassifier’sresults,itisimportanttounderstandhowlikelyitistoobtainx%accuracybyrandomchance,i.e.,whenthereisnorelationshipbetweentheattributesinthedatasetandtheclasslabel.Also,ifwechooseacertainthresholdfortheclassificationaccuracytoidentifyeffectiveclassifiers,thenwewouldliketoknowhowmanytimeswecanexpecttofalselyrejectagoodclassifierthatshowsanaccuracylowerthanthethresholdduetothenaturalvariabilityinthedata.

Suchquestionsaboutthevalidityofaclassifier’sperformancecanbeaddressedbyviewingthisproblemfromtheperspectiveofhypothesistesting

asfollows.Considerastatisticaltestingsetupwherewelearnaclassifieronatrainingsetandevaluatethelearnedclassifieronatestset.Thenullhypothesisforthistestisthattheclassifierisnotabletolearnageneralizablerelationshipbetweentheattributesandtheclasslabelsfromthegiventrainingset.Thealternativehypothesisisthattheclassifierindeedlearnsageneralizablerelationshipbetweentheattributesandtheclasslabelsfromthetrainingset.Toevaluatewhetheranobservedresultbelongstothenullhypothesisorthealternativehypothesis,wecanuseameasureoftheclassifier’sperformanceonthetestset,e.g.,precision,recall,oraccuracy,astheteststatistic.

Randomization

Inordertoperformstatisticaltestingusingtheabovesetup,wefirstneedtogeneratenewsampledatasetsunderthenullhypothesisthattherearenonon-randomrelationshipsbetweentheattributesandclasslabels.Thiscanbeachievedbyrandomlypermutingtheclasslabelsofthetrainingdatasuchthatforeverypermutationofthelabels,weproduceanewtrainingsetwheretheattributesandclasslabelsareunrelatedtoeachother.Wecanthenlearnaclassifieroneverysampletrainingsetandapplythelearnedmodelsonthetestsettoobtainanulldistributionoftheteststatistic(classificationperformance).Then,forexample,ifweuseaccuracyasourteststatistic,theobservedvalueofaccuracyforthemodellearnedusingoriginallabelsshouldbesignificantlyhigherthanmostoralloftheaccuraciesgeneratedbymodelslearnedoverrandomlypermutedlabels.However,notethataclassifiermayhaveasignificantp-valuebuthaveanaccuracyonlyslightlybetterthanarandomclassifier,especiallyifthedatasetislarge.Hence,itisimportanttotaketheeffectsizeoftheclassifier(actualvalueofclassificationperformance)intoaccountalongwithinformationaboutitsp-value.

BootstrapandCross-Validation

Anothertypeofanalysisrelevanttopredictivemodels,suchasclassification,istomodelthedistributionofvariousmeasuresofclassificationperformance.Onewaytoestimatesuchdistributionsistogeneratebootstrapsamplesfromthelabeleddata(preservingtheoriginallabels)tocreatenewtrainingandtestsets.Theperformanceofaclassificationmodeltrainedandevaluatedonanumberofthesebootstrappeddatasetscanthenbeusedtogenerateadistributionforthemeasureofinterest.Anotherwaytocreatethealternativedistributionwouldbetousetherandomizedcross-validationprocedure(discussedinSection3.6.2 )wheretheprocessofrandomlypartitioningthelabeleddataintokfoldsisrepeatedmultipletimes.

Suchresamplingapproachescanalsohelpinestimatingconfidenceintervalsformeasuresofthetrueperformanceoftheclassifiertrainedoverallpossibleinstances.Aconfidenceintervalisanintervalofparametervaluesinwhichanestimatedparametervalueisguaranteedtofallacertainpercentageoftimes.Theconfidencelevelisthepercentageoftimestheestimatedparameterwillfallwithintheinterval.Forexample,giventhedistributionofaclassifier’saccuracy,wecanestimatetheintervalofvaluesthatcontains95%ofthedistribution.Thisservesastheconfidenceintervaloftheclassifier’strueaccuracyatthe95%confidencelevel.Toquantifytheinherentuncertaintyintheresult,confidenceintervalsareoftenreportedalongwithpointestimatesofamodel’soutput.

10.3.2BinaryClassificationasMultipleHypothesisTesting

TheprocessofestimatingthegeneralizationperformanceofabinaryclassifierresemblestheproblemofmultiplehypothesistestingdiscussedpreviouslyinSection10.1.2 .Inparticular,everytestinstancebelongstothenullhypothesis(negativeclass)orthealternativehypothesis(positiveclass).Byapplyingaclassificationmodeloneverytestinstance,weassigneachinstancetothepositiveorthenegativeclass.Theperformanceofaclassificationmodelonasetofresults(resultsofclassifyinginstancesinatestset)canthenbesummarizedbythefamiliarconfusionmatrixpresentedinTable10.1 .

Auniqueaspectofbinaryclassificationthatdifferentiatesitfromconventionalproblemsofmultiplehypothesistestingistheavailabilityofgroundtruthlabelsontestinstances.Hence,insteadofmakinginferencesusingstatisticalassumptions(e.g.,thedistributionoftheteststatisticunderthenullandalternativehypothesis),wecandirectlycomputeerrorestimatesforrejectingthenulloralternativehypothesesusingempiricalmethods,suchasthosepresentedinSection4.11.2 .Table10.2 showsthecorrespondencebetweentheerrormetricsusedinstatisticaltestingandevaluationmeasuresusedinclassificationproblems.

Table10.2.Correspondencebetweenstatisticaltestingconceptsandclassifierevaluationmeasures

StatisticalTestingConcept ClassifierEvaluationMeasure Formula

TypeIErrorRate, FalsePositiveRate

TypeIIErrorRate, FalseNegativeRate

Power, Recall

α FPFP+TN

β FNTP+FN

1−β TPTP+FN

Whiletheseerrormetricscanbereadilycomputedwiththehelpoflabeleddata,thereliabilityofsuchestimatesdependsontheaccuracyoftestlabelswhichmaynotalwaysbeperfect.Insuchcases,itisimportanttoquantifytheuncertaintyintheevaluationmeasuresarisingduetoinaccuraciesinthetestlabels.(SeeBibliographicNotesformoredetails.)Further,whenweapplyalearnedclassificationmodelonunlabeledinstances,wecanusestatisticalmethodsforquantifyingtheuncertaintyintheclassificationoutputs.Forexample,wecanbootstrapthetrainingset(asdiscussedinSection10.3.1 )togeneratemultipleclassificationmodels,andthedistributionoftheiroutputsonanunseeninstancecanbeusedtoestimatetheconfidenceintervaloftheoutputonthatinstance.

Althoughtheabovediscussionwasfocusedonassessingthequalityofaclassifierthatproducesbinaryoutputs,statisticalconsiderationscanalsobeusedtoassessthequalityofaclassifierthatproducesreal-valuedoutputssuchasclassificationscores.TheperformanceofaclassifieracrossarangeofscorethresholdsisgenerallyanalyzedwiththehelpofReceiverOperatingCharacteristiccurves,asdiscussedinSection4.11.4 .ThebasicapproachbehindgeneratinganROCcurveistosortthepredictionsaccordingtotheirscorevaluesandthenplotthetruepositiverateandthefalsepositiverateforeverypossiblevalueofscorethreshold.NotethatthisapproachbearssomeresemblancetotheFDRcontrollingproceduresdescribedinSection10.1.3 ,wherethetopfewrankinginstances(withthelowestp-values)arelabeledaspositivebytheclassifierinordertomaximizetheFDR.However,inthepresenceofgroundtruthlabels,wecanempiricallyestimatemeasuresofclassificationperformancefordifferentscorethresholdswithoutmakinguseofanyexplicitstatisticalmodelsorassumptions.

10.3.3MultipleHypothesisTestingin

ModelSelection

Theproblemofmultiplehypothesistestingplaysamajorroleintheprocessofmodelselection,whereevenifamorecomplexmodelshowsbetterperformancethanasimplermodel,thedifferenceintheirperformancesmaynotbestatisticallysignificant.Specifically,fromastatisticalperspective,amodelwithahighercomplexityoffersalargernumberofpossiblesolutionsthatalearningalgorithmcanchoosefrom,foragivenclassificationproblem.Forexample,havingalargernumberofattributesprovidesalargersetofcandidatesplittingcriteriathatadecisiontreelearningalgorithmcanchoosetobestfitthetrainingdata.However,whenthetrainingsizeissmallandthenumberofcandidatemodelsarelarge,thereisahigherchanceofpickingaspuriousmodel.Moregenerally,thisversionofthemultiplehypothesistestingisknownasselectiveinference.Thisproblemarisesinsituationswherethenumberofpossiblesolutionsforagivenproblem,suchasbuildingapredictivemodel,arenumerous,butthenumberofteststorobustlydeterminetheefficacyofasolutionisquitesmall.SelectiveinferencemayleadtothemodeloverfittingproblemdescribedinSection3.4 .

Howdoesthemultiplecomparisonprocedurerelatetomodeloverfitting?Manylearningalgorithmsexploreasetofindependentalternatives, ,andthenchooseanalternative, ,thatmaximizesagivencriterionfunction.Thealgorithmwilladd tothecurrentmodelinordertoimproveitstrainingerror.Thisprocedureisrepeateduntilnofurtherimprovementisobserved.Asanexample,duringdecisiontreegrowing,multipletestsareperformedtodeterminewhichattributecanbestsplitthetrainingdata.Theattributethatleadstothebestsplitischosentoextendthetreeaslongasthestoppingcriterionhasnotbeensatisfied.

{γi}γmaxγmax

Let betheinitialdecisiontreeand bethenewtreeafterinsertinganinternalnodeforattributex.Considerthefollowingstoppingcriterionforadecisiontreeclassifier:xisaddedtothetreeiftheobservedgain,,isgreaterthansomepredefinedthreshold .Ifthereisonlyoneattributetestconditiontobeevaluated,thenwecanavoidinsertingspuriousnodesbychoosingalargeenoughvalueof .However,inpractice,thereismorethanonetestconditionavailableandthedecisiontreealgorithmmustchoosethebestsplittingattribute fromasetofcandidates, .Themultiplecomparisonproblemarisesbecausethealgorithmappliesthefollowingtest, insteadof ,todecidewhetheradecisiontreeshouldbeextended.Justaswiththemultiplestockbrokerexample,asthenumberofalternativeskincreases,sodoesourchanceoffinding .Unlessthegainfunction orthreshold ismodifiedtoaccountfork,thealgorithmmayinadvertentlyaddspuriousnodeswithlowpredictivepowertothetree,whichleadstothemodeloverfittingproblem.

Thiseffectbecomesmorepronouncedwhenthenumberoftraininginstancesfromwhich ischosenissmall,becausethevarianceofishigherwhenfewertraininginstancesareavailable.Asaresult,the

probabilityoffinding increaseswhenthereareveryfewtraininginstances.Thisoftenhappenswhenthedecisiontreegrowsdeeper,whichinturnreducesthenumberofinstancescoveredbythenodesandincreasesthelikelihoodofaddingunnecessarynodesintothetree.

T0 Tx

Δ(T0,Tx) α

xmax {x1,x2,…,xk}

Δ(T0,Txmax)>α Δ(T0,Tx)>α

Δ(T0,Txmax)>α Δα

xmax Δ(T0,Txmax)

Δ(T0,Txmax)>α

10.4StatisticalTestingforAssociationAnalysisSinceproblemsinassociationanalysisareusuallyunsupervised,i.e.,wedonothaveaccesstogroundtruthlabelstoevaluateresults,itisimportanttoemployrobuststatisticaltestingapproachestoensurethatthediscoveredresultsarestatisticallysignificantandnotspurious.Forexample,inthediscoveryoffrequentitemsets,weoftenuseevaluationmeasuressuchasthesupportofanitemsettomeasureitsinterestingness.(Theuncertaintyinsuchevaluationmeasurescanbequantifiedbyusingresamplingmethods,e.g.,bybootstrappingthetransactionsandgeneratingadistributionofthesupportofanitemsetfromtheresultingdatasets.)Givenasuitableevaluationmeasure,wealsoneedtospecifyathresholdonthemeasuretoidentifyinterestingpatternssuchasfrequentitemsets.Althoughthechoiceofarelevantthresholdisgenerallyguidedbydomainconsiderations,itcanalsobeinformedwiththehelpofstatisticalprocedures,aswediscussinthefollowing.Tosimplifythisdiscussion,weassumethatourtransactiondatasetisrepresentedasasparsebinarymatrix,with1’srepresentingthepresenceofitemsand0’srepresentingtheirabsence.(SeeSection2.1.2 .)

Givenatransactiondataset,consideraresulttobethediscoveryofafrequentk-itemsetandtheteststatistictobethesupportoftheitemsetoranyotherevaluationmeasuredescribedinSection5.7 .Thenullhypothesisforthisresultwouldbethatthekitemsinanitemsetareunrelatedtoeachother.Givenacollectionoffrequentitemsets,wecouldthenapplymultiplehypothesistestingmethodssuchastheFWERorFDRcontrollingprocedurestoidentifysignificantpatternswithstronglyassociateditems.However,theitemsetsfoundbyanassociationminingalgorithmoverlapintermsofthe

itemstheycontain.Hence,themultipleresultsinassociationanalysiscannotbeassumedtobeindependentofeachother.Forthisreason,approachessuchastheBonferroniproceduremaybeoverlyconservativeincallingaresultsignificant,whichleadstolowpower.Further,atransactiondatasetmayhavestructureorcharacteristics,e.g.,asubsetoftransactionscontainingalargenumberofitems,whichneedtobeaccountedforwhenapplyingmultiplehypothesistestingprocedures.

Beforewecanapplystatisticaltestingproceduresforproblemsrelatedtoassociationanalysis,wefirstneedtoestimatethedistributionoftheteststatisticofanitemsetunderthenullhypothesisofnoassociationamongtheitems.Thiscanbedonebyeithermakinguseofstatisticalmodelsorbyperformingrandomizationexperiments.Boththesecategoriesofapproachesaredescribedinthefollowing.

10.4.1UsingStatisticalModels

Underthenullhypothesisthattheitemsareunrelated,wecanmodelthesupportcountofanitemsetusingstatisticalmodelsofindependenceamongitems.Foritemsetscontainingtwoindependentitems,wecanuseFisher’sexacttest.Foritemsetscontainingmorethantwoitems,wecanusealternativetestsofindependencesuchasthechi-squared test.Boththeseapproachesareillustratedinthefollowing.

UsingFisher’sExactTestConsidertheproblemofmodelingthesupportcountofa2-itemset, ,underthenullhypothesisthatAandBoccurindependentlyofeachother.WearegivenadatasetwithNtransactionswhereAandBappear and

(χ2)

{A,B}

NA NB

times,respectively.AssumingthatAandBareindependent,theprobabilityofobservingAandBtogetherwouldthenbegivenby

where and aretheprobabilitiesofobservingAandBindividually,whichareapproximatedbytheirsupport.TheprobabilityofnotobservingAandBtogetherwouldthenbeequalto .AssumingthattheNtransactionsareindependent,wecanconsidermodeling ,thenumberoftimesAandBappeartogether,usingthebinomialdistribution(introducedinSection10.1 )asfollows:

However,thebinomialdistributiondoesnotaccuratelymodelthesupportcountof becauseitassigns, ,thesupportcountof ,apositiveprobabilityevenwhen exceedstheindividualsupportcountsofAandB.Morespecifically,thebinomialdistributionrepresentstheprobabilityofobservinganevent(co-occurrenceofAandB)whensampledwithreplacementwithafixedprobability .However,inreality,theprobabilityofobserving decreasesifwehavealreadysampledAandBanumberoftimes,becausethesupportcountsofAandBarefixed.

Fisher’sexacttestwasdesignedtohandletheabovesituationwhereweperformsamplingwithoutreplacementfromafinitepopulationoffixedsize.ThistestisreadilyexplainedusingthesameterminologyusedinSection5.7 ,whichdealtwiththeevaluationofassociationpatterns.Foreasyreference,wereproducethecontingencytable,Table5.6 ,whichwasusedinthatdiscussion.SeeTable10.3 .

Table10.3.A2-waycontingencytableforvariablesAandB.

pAB=pA×pB=NAN×NBN,

pA pB

(1−pAB)NAB

P(NAB=k)=(Nk)(pAB)k(1−pAB)N−k.

{A,B} NAB {A,B}NAB

(pAB){A,B}

Weusethenotation toindicatethatA(B)isabsentfromatransaction.Eachentry inthis tabledenotesafrequencycount.Forexample, isthenumberoftimesAandBappeartogetherinthesametransaction,while isthenumberoftransactionsthatcontainBbutnotA.Therowsum representsthesupportcountforA,whilethecolumnsum

representsthesupportcountforB.

NotethatifN,thenumberoftransactions,andthesupportsof andarefixed,i.e.,heldconstant,then and arefixed.Thisalso

impliesthatspecifyingthevalueforoneoftheentries, ,or,completelyspecifiestherestoftheentriesinthetable.Inthatcase,Fisher’sexacttestgivesusasimpleformulaforexactlycomputingtheprobabilityofanyspecificcontingencytable.Becauseofourintendedapplication,weexpresstheformulaintermsofthesupportcount, ,forthe2-itemset

.Notethat istheobservedsupportcountof .

Example10.9(Fisher’sExactTest).WeillustratetheapplicationofFisher’sexacttestusingthetea-coffeeexampledescribedatthebeginningofSection5.7.1 .Weareinterestedinmodelingthenulldistributionofthesupportcountof .As

B¯

A¯

f11

f01

f10

f00

f1+

f0+

f+1 f+0

A¯ (B¯)fij 2×2

f11f01

f1+ f+1

A (f1+)B (f+1) f0+ f+0

f11,f10,f01 f00

NAB{A,B} f11 {A,B}

P(NAB=f11)=(f1+f11)(f0+f+1−f11)(Nf+1) (10.4)

{Tea,Coffee}

describedinSection5.7.1 ,theco-occurrenceofTeaandCoffeecanbesummarizedusingthecontingencytableshowninTable10.4 .WecanseethatthesupportcountofCoffeeis800andthesupportcountofTeais200,outofatotalof1000transactions.

Table10.4.Beveragepreferencesamongagroupof1000people.

Coffee

Tea 150

650

150

200

800

800 200 1000

Tomodelthenulldistributionofthesupportcountof ,wesimplyapplyEquation10.4 fromourdiscussionFisher’sexacttest.Thisyieldsthefollowing.

where isthesupportcountof .

Figure10.9 showsaplotofthenulldistributionofthesupportcountfor.Wecanseethatthelargestprobabilityforsupportcount

occurswhenitisequalto160.AnintuitiveexplanationforthisfactisthatwhenTeaandCoffeeareindependent,theprobabilityofobservingTeaandCoffeetogetherisequaltotheproductoftheirindividualprobabilities,i.e., .Theexpectedsupportcountof isthusequalto .Supportcountsthatarelessthan160indicatenegativeassociationsamongitems.

Coffee¯

Tea¯

{Tea,Coffee}

P(NAB=f11)=(200f11)(800800−f11)(1000800),

NAB {Tea, Coffee}

{Tea,Coffee}

0.8×0.2=0.16 {Tea,Coffee}0.16×1000=160

Figure10.9.PlotoftheprobabilityofsupportcountgiventheindependenceofTeaandCoffee

Hence,thep-valueofasupportcountof150canbecalculatedbysummingtheprobabilitiesinthelefttailofthenulldistributionforsupportcount150andsmaller.Thisyieldsap-valueof0.032.Thisresultisnotconclusive,sinceasupportcountof150orlesswilloccurroughly3timesoutofa100,onaverage,ifTeaandCoffeeareindependent.However,thelowp-valuetendstoindicatethatteaandcoffeearerelated,albeitinanegativeway,i.e.,teadrinkersarelesslikelytodrinkcoffeethanthosewhodon’tdrinktea.Notethatthisisjustanexampleanddoesnotnecessarilyreflectreality.Alsonotethatthisfindingisconsistentwithour

previousanalysisusingalternativemeasures,suchastheinterestfactor(liftmeasure),asdescribedinSection5.7.1 .

Althoughthediscussionabovewascenteredonthesupportmeasure,wecanalsomodelthenulldistributionofanyotherobjectiveinterestingnessmeasureofa2-itemsetintroducedinSection5.7 ,suchasinterest,oddsratio,cosine,orall-confidence.Thisisbecausealltheentriesofthecontingencytablecanbeuniquelydeterminedbythesupportmeasureofthe2-itemset,giventhenumberoftransactionsandthesupportcountsofthetwoitems.Morespecifically,theprobabilitiesdisplayedinFigure10.9 aretheprobabilitiesofspecificcontingencytablescorrespondingtoaspecificvalueofsupportforthe2-itemset.Foreachofthesetables,thevaluesofanyobjectiveinterestingnessmeasure(oftwoitems)canbecalculated,andthesevaluesdefinethenulldistributionofthemeasurebeingconsidered.Thisapproachcanalsobeusedtoevaluateinterestingnessmeasuresofassociationrulessuchastheconfidenceof ,whereAandBareitemsets.

NotethatusingFisher’sexacttestisequivalenttousingthehypergeometricdistribution.

UsingtheChi-SquaredTestThechi-squared testprovidesagenericbutapproximateapproachformeasuringthestatisticalindependenceamongmultipleitemsinanitemset.Thebasicideabehindthe testistocomputetheexpectedvalueofeveryentryinacontingencytable,suchastheoneshowninTable10.4 ,assumingthattheitemsarestatisticallyindependent.Thedifferencesbetweentheobservedandexpectedvaluesinthecontingencytablecanthenbeusedtocomputeateststatisticthatfollowsthe distributionunderthenullhypothesisofnoassociationbetweenitems.

A→B

(χ2)

χ2

Formally,consideratwo-dimensionalcontingencytablewheretheentryattherowand columnisdenotedby .(Weusethenotation

insteadof sincetheformeristraditionallyusedtorepresentthe“observed”valueindiscussionsofthe statistic.)IfthesumofallentriesisequaltoN,thenwecancomputetheexpectedvalueateveryentryas

Thisfollowsfromthefactthatthejointprobabilityofobservingindependenteventsisequaltotheproductoftheindividualprobabilities.Whenallitemsarestatisticallyindependent, wouldusuallybecloseto forallvaluesofiandj.Hence,thedifferencesbetween and canbeusedtomeasurethedeviationoftheobservedcontingencytablefromthenullhypothesisofnoassociation.Inparticular,wecancomputethefollowingteststatistic:

NotethatR=0onlyif and areequalforeveryvalueofiandj.ItcanbeshownthatthenulldistributionofRcanbeapproximatedbythedistributionwith1degreeoffreedomwhenNislarge.Wecanthuscomputethep-valueofanobservedvalueofRusingstandardimplementationsofthedistribution.

Whiletheabovediscussionwascenteredontheanalysisofatwo-dimensionalcontingencytableinvolvingtwoitems,the testcanbereadilyextendedtomulti-dimensionalcontingencytablesinvolvingmorethantwoitems.Forexample,givenak-itemset ,wecanconstructak-dimensionalcontingencytablewithobservedentriesrepresentedas

.TheexpectedvaluesofthecontingencytableandtheteststatisticRcouldthenbecomputedasfollows

ith jth Oi,j(i,j∈{0,1}) Oi,jfij

χ2

Eij=N×(∑iOi,jN)×(∑jOi,jN). (10.5)

Oi,j Ei,jOi,j Ei,j

R=∑i∑j(Oi,j−Ei,j)2Ei,j. (10.6)

Oi,j Ei,jχ2

χ2

X={i1,i2,…,ik}Oi2,i2,

…,ik (i1,i2,…ik∈{0,1})

UnderthenullhypothesisthatallkitemsintheitemsetXarestatisticallyindependent,thedistributionofRcanagainbeapproximatedbyadistribution.However,thegeneralformulaforthedegreesoffreedomis

.Thus,ifwehavea4by3contingencytable,then .

10.4.2UsingRandomizationMethods

Whenitisdifficulttomodelthenulldistributionofitemsetsusingstatisticalmodels,analternativeapproachistogeneratesynthetictransactiondatasetsunderthenullhypothesisofnoassociationamongtheitems,withthesamenumberofitemsandtransactionsastheoriginaldata.Thisinvolvesrandomlypermutingtherowsorcolumnsintheoriginaldatasuchthattheitemsintheresultantdataareunrelatedtoeachother.AsdiscussedinSection10.2.1 ,wemustensurewhilerandomizingtheattributesthattheresultantdatasetsaresimilartotheoriginaldatasetinallrespectsexceptforthedesiredeffectweareinterestedinevaluating,whichistheassociationamongitems.

Abasicstructurewewouldliketopreserveinthesyntheticdatasetsisthesupportofeveryitemintheoriginaldata.Inparticular,everyitemshouldappearinthesamenumberoftransactionsinthesyntheticdatasetsasintheoriginaldataset.Onewaytopreservethissupportstructureofitemsistorandomlypermutetheentriesineachcolumnoftheoriginaldatasetindependentlyoftheothercolumns.Thisensuresthattheitemshavethesamesupportinthesyntheticallygenerateddatasetsbutareindependentof

Ei1,i2,…,ik=N×∏j=1k(∑ijOi1,i2,…,ikN). (10.7)

R=∑i1∑i2…∑ik(Oi1,i2,…,ik−Ei1,i2,…,ik)2Ei1,i2,…ik. (10.8)

χ2df=

(numberofrows −1)×(numberofcolumns−1)df=(4 −1)×(3−1)=6

eachother.However,thismayviolateadifferentpropertyoftheoriginaldatathatwewouldliketopreserve,whichisthelengthofeverytransaction(numberofitemsinatransaction).Thispropertycanbepreservedbyrandomlyshufflingtherows,i.e.,therowsumsarepreserved.However,adrawbackofthisapproachisthatthesupportofeveryitemintheresultantdatasetmaybedifferentthanthesupportofitemsintheoriginaldataset.

Arandomizationapproachthatcanpreserveboththesupportsandthetransactionlengthsoftheoriginaldataisswaprandomization.Thebasicideaistopickapairofonesintheoriginaldatasetfromtwodifferentrowsandcolumns,sayat(rowk,columni)and(rowl,columnj),where and .(SeelefttableinFigure10.10 .)Thesetwoentriesdefinethediagonalofarectangleofvaluesinthebinarytransactionmatrix.Iftheentriesatoppositecornersoftherectangle,i.e.,(rowk,columnj)and(rowl,columni),arezeros,thenwecanswapthesezeroswiththeones,asshowninFigure10.10 .Notethatbyperformingthisswap,boththerowsumsandcolumnsumsarepreservedwhiletheassociationwithotheritemsisbroken.Thisprocesscontinuesuntilitislikelythatthedatasetissignificantlydifferentfromtheoriginalone.(Anappropriatethresholdforthenumberofswapsneedstobedetermineddependingonthesizeandnatureoftheoriginaldataset.)

Figure10.10.

k≠l i≠j

Illustrationofaswapforswaprandomization.

Swaprandomizationhasbeenshowntopreservethepropertiesoftransactiondatasetsmoreaccuratelythantheotherapproachesmentioned.However,itisverycomputationallyintensive,particularlyforlargerdatasets,whichcanlimititsapplication.Furthermore,apartfromthesupportofitemsandtransactionlengths,theremaybeothertypesofstructureinthetransactiondatathatswaprandomizationmaynotbeabletopreserve.Forinstance,theremaybesomeknowncorrelationsamongtheitems(duetodomainconsiderations)thatwewouldliketoretaininthesyntheticdatasetswhilebreakingthecorrelationsamongotheritems.Agoodexamplearedatasetsthatrecordthepresenceorabsenceofageneticvariationatvariouslocationsonthegenome(items)acrossmultiplesubjects(transactions).Itemsrepresentinglocationsthatarecloseonthegeneticsequenceareknowntobehighlycorrelated.Thislocalstructureofcorrelationmaybelostinthesyntheticdatasetsifwetreateachcolumnidenticallywhilerandomizing.Whatisneededinthiscaseistokeepthelocalcorrelationbuttobreakcorrelationofareasthatarefurtheraway.

Afterconstructingsyntheticdatasetspertainingtothenullhypothesis,wecangeneratethenulldistributionofthesupportofanitemsetbyobservingitssupportinthesyntheticdatasets.Thisprocedurecanhelpindecidingsupportthresholdsusingstatisticalconsiderationssothatthediscoveredfrequentitemsetsarestatisticallysignificant.

10.5StatisticalTestingforClusterAnalysisThegoodnessofaclusteringistypicallyevaluatedwiththehelpofclustervaliditymeasuresthateithercapturethecohesionorseparationofclusters,suchasthesumofsquarederrors(SSE),ormakeuseofexternallabelssuchasentropy.Insomecases,theminimumandmaximumvaluesofmeasureshaveintuitiveinterpretationsthatcanbeusedtoexaminethegoodnessofaclustering.Forinstance,ifwearegiventhetrueclasslabelsofinstancesandwewantourclusteringtoreflecttheclassstructure,thenapurityof0isbad,whileapurityof1isgood.Likewise,anentropyof0isgood,asisanSSEof0.However,inmanycases,wearegivenintermediatevaluesofclustervaliditymeasureswhicharedifficulttointerpretdirectlywithoutthehelpofdomainconsiderations.

Statisticaltestingproceduresprovideausefulwayofmeasuringthesignificanceofadiscoveredclustering.Inparticular,wecanconsiderthenullhypothesisthatthereisnoclusterstructureamongtheinstancesandtheclusteringalgorithmisproducingarandompartitioningofthedata.Theapproachistousetheclustervaliditymeasureasateststatistic.Thedistributionofthatteststatisticundertheassumptionthatthedatahasnoclusteringstructureisthenulldistribution.Wecanthentestwhetherthevaliditymeasureactuallyobservedforthedataissignificant.Inthefollowing,weconsidertwogeneralcases:(1)theteststatisticisaninternalclusteringvalidityindexcomputedforunlabeleddata,suchasSSEorthesilhouettecoefficient,or(2)theteststatisticisanexternalindex,i.e.,theclusterlabelsaretobecomparedagainstclasslabels,suchasentropyorpurity.TheseclustervaliditymeasuresaredescribedinSection7.5 .

10.5.1GeneratingaNullDistributionforInternalIndices

Internalindicesmeasurethegoodnessofaclusteronlybyreferencetothedataitself—seeSection7.5.2 .Furthermore,oftentheclusteringisdrivenbyanobjectivefunction,andinthosecases,themeasureofaclustering’sgoodnessisprovidedbytheobjectivefunction.Thus,mostofthetime,statisticalevaluationofaclusteringisnotperformed.

Anotherreasonthatsuchanevaluationisnotperformedisthedifficultyingeneratinganulldistribution.Inparticular,togetameaningfulnulldistributionfordeterminingclusterstructure,weneedtocreatedatawithsimilaroverallpropertiesandcharacteristicsasthedatawehaveexceptthatithasnoclusterstructure.Butthiscanbedifficultsincedataoftenhasacomplexstructure,e.g.,thedependenciesamongobservationsintimeseriesdata.Nonetheless,statisticaltestingcanbeusefulifthedifficultiescanbeovercome.Wepresentasimpleexampletoillustratetheapproach.

Example10.10(SignificanceofSSE).ThisexampleisbasedonK-meansandtheSSE.Supposethatwewantameasureofhowthewell-separatedclustersofFigure10.11a comparewithrespecttorandomdata.Wegeneratemanyrandom(uniformlydistributed)setsof100pointshavingthesamerangeofvaluesalongthetwodimensionsasthepointsinthethreeclusters,findthreeclustersineachdatasetusingK-means,andaccumulatethedistributionofSSEvaluesfortheseclusterings.ByusingthisdistributionoftheSSEvalues,wecanthenestimatetheprobabilityoftheSSEvaluefortheoriginalclusters.Figure10.11b showsthehistogramoftheSSEfrom500

randomruns.ThelowestSSEinthehistogramis0.0173.ForthethreeclustersofFigure10.11a ,theSSEis0.0050.Wecouldthereforeconservativelyclaimthatthereislessthana1%chancethataclusteringsuchasthatofFigure10.11a couldoccurbychance.

Figure10.11.Usingrandomizationtoevaluatethep-valueforaclustering.

Inthepreviousexample,itwasrelativelystraightforwardtouserandomizationtoevaluatethestatisticalsignificanceofaninternalclustervaliditymeasure.Inpractice,domainevaluationisusuallymoreimportant.Forinstance,adocumentclusteringschemecouldbeevaluatedbylookingatthedocumentsandjudgingwhethertheclustersmakesense.Moregenerally,adomainexpertwouldevaluatetheclusterforsuitabilitytoadesiredapplication.Nonetheless,astatisticalevaluationofclusteringissometimesnecessary.AreferencetoanexampleforclimatetimeseriesisprovidedintheBibliographicNotes.

10.5.2GeneratingaNullDistributionforExternalIndices

Ifexternallabelsareusedforevaluation,thenaclusteringisevaluatedusingameasuresuchasentropyortheRandstatistic—seeSection7.5.7 whichassesseshowcloselytheclusterstructure,asreflectedintheclusterlabels,matchestheclasslabels.Someofthesemeasurescanbemodeledwithastatisticaldistribution,e.g.,theadjustedRandindex,whichisbasedonthemultivariatehypergeometricdistribution.Ifameasurehasawell-knowndistribution,thenthisdistributioncanbeusedtocomputeap-value.

However,randomizationcanalsobeusedtogenerateanulldistributioninthiscase,asfollows.

1. GenerateMrandomizedsetsoflabels,2. Foreachrandomizedsetoflabels,computethevalueoftheexternal

index.Let bethevalueoftheexternalindexobtainedfortherandomization.Let bethevalueoftheexternalindexfortheoriginalsetoflabels.

3. Assumingthatalargervalueoftheexternalindexismoredesirable,definethep-valueof tobethefractionof forwhich .

Aswiththecaseofunsupervisedevaluationofaclustering,domainsignificanceoftenassumesadominantrole.Forexample,considerclusteringnewarticlesintodistinctgroupsasinExample7.15 ,wherethearticlesbelongtotheclasses:Entertainment,Financial,Foreign,Metro,National,andSports.Ifwehavethesamenumberofclustersasthenumberofclassesof

L1,…,Li,…,LM

mi ithm0

m0 mi mi>m0

p-value(m0)= |{mi:mi>mo}|M (10.9)

newsarticles,thenanidealclusteringwouldhavetwocharacteristics.First,everyclusterwouldcontainonlydocumentsfromoneclass,i.e.,itwouldbepure.Second,everyclusterwouldcontainallofthedocumentsfromaparticularclass.Anactualclusteringofdocumentscanbestatisticallysignificant,butstillbequitepoorintermsofpurityand/orcontainingallthedocumentsofaparticulardocumentclass.Sometimes,suchsituationsarestillofinterest,aswedescribenext.

10.5.3Enrichment

Insomecasesinvolvinglabeleddata,thegoalofevaluatingclustersistofindclustersthathavemoreinstancesofaparticularclassthanwouldbeexpectedforarandomclustering.Whenaclusterhasmorethantheexpectednumberofinstancesofaspecificclass,wesaythattheclusterisenrichedinthatclass.Thisapproachiscommonlyusedintheanalysisofbioinformaticsdata,suchasgeneexpressiondata,butisapplicableinmanyotherareasaswell.Furthermore,thisapproachcanbeusedforanycollectionofgroups,notjustthosecreatedbyclustering.Weillustratethisapproachwithasimpleexample.

Example10.11(EnrichmentofNeighborhoodsofaCityinTermsofIncomeLevels).Assumethatinaparticularcitythereare10distinctneighborhoods,whichcorrespondtoclustersinourproblem.Overall,thereare10,000peopleinthecity.Further,assumethatthereare3incomelevels,Poor(30%),Medium(50%),andWealthy(20%).Finally,assumethatoneoftheneighborhoodshas1,000residents,23%ofwhomfallintotheWealthycategory.Thequestioniswhetherthisneighborhoodhasmorewealthy

peoplethanexpectedbyrandomchance.ThecontingencytableforthisexampleisshowninTable10.5 .WecananalyzethistablebyusingFisher’sexacttest.(SeeExample10.9 inSection10.4.1 .)

Table10.5.Beveragepreferencesamongagroupof1000people.

InNeighborhood

Wealthy 230

770

1770

7,230

2,000

8,000

20,000 1,000 10,000

UsingFisher’sexacttest,wefindthatthep-valueforthisresultis0.0076.Thiswouldseemtoindicatethatmorewealthypeopleliveinthisneighborhoodthanwouldbeexpectedbyrandomchanceatasignificancelevelof1%.However,severalpointsneedtobemade.First,wemayverywellbetestingeverygroupagainsteveryneighborhoodtolookforenrichment.Thus,therewouldbe30testsoverallandthep-valuesshouldbeadjustedformultiplecomparisons.Forinstance,ifweusetheBonferroniprocedure,0.0076wouldnotbeasignificantresultsincethesignificancethresholdisnow .Also,theoddsratioforthiscontingencytableisonly1.22.Hence,evenifthedifferenceissignificant,theactualmagnitudeofthedifferencedoesn’tseemverylarge,i.e.,veryfarfromanoddsratioof1.Inaddition,notethatmultiplyingalltheentriesofthetableby10willgreatlydecreasethep-value ,buttheoddsratiowillremainthesame.Despitetheseissues,enrichmentcanbeavaluabletoolandhasyieldedusefulresultsforavarietyofapplications.

InNeighborhood¯

Wealthy¯

0.01/30=0.0003

(≈10−9)

10.6StatisticalTestingforAnomalyDetectionAnomalydetectionalgorithmstypicallyproduceoutputsintheformofclasslabels(whenaclassificationmodelistrainedoverlabeledanomalies)oranomalyscores.Statisticalconsiderationscanbeusedtoensurethevalidityofboththesetypesofoutputsasdescribedinthefollowing.

SupervisedAnomalyDetectionIfwehaveaccesstolabeledanomalousinstances,theproblemofanomalydetectioncanbeconvertedtoabinaryclassificationproblem,wherethenegativeclasscorrespondstothenormaldatainstances,whilethepositiveclasscorrespondstotheanomalousinstances.StatisticaltestingproceduresdiscussedinSection10.3 forclassificationaredirectlyrelevantforavoidingfalsediscoveriesinsupervisedanomalydetection,albeitwiththeadditionalchallengesofbuildingamodelforimbalancedclasses(SeeSection4.11 .)Inparticular,weneedtoensurethattheclassificationerrormetricusedduringstatisticaltestingissensitivetotheimbalanceamongtheclassesandgivesenoughemphasistotheerrorsrelatedtotherareanomalyclass(falsepositivesandfalsenegatives).Afterlearningavalidclassificationmodel,wecanalsousestatisticalmethodstocapturetheuncertaintyintheoutputsofthemodelonunseeninstances.Forexample,wecanuseresamplingapproachessuchasthebootstrappingtechniquetolearnmultipleclassificationmodelsfromthetrainingset,andthedistributionoftheirlabelsproducedonanunseeninstancecanbeusedtoestimateconfidenceintervalsofthetrueclasslabeloftheinstance.

UnsupervisedAnomalyDetectionMostunsupervisedanomalydetectionapproachesproduceananomalyscoreondatainstancestoindicatehowanomalousaninstanceiswithrespecttothenormalclass.Itisthenimportanttodecideasuitablethresholdontheanomalyscoretoidentifyinstancesthataresignificantlyanomalousandhenceareworthyoffurtherinvestigation.Thechoiceofathresholdisgenerallyspecifiedbytheuserbasedondomainconsiderationsonwhatisacceptableasasignificantdeparturefromthenormalbehavior.Suchdecisionscanalsobereinforcedwiththehelpofstatisticaltestingmethods.

Inparticular,fromastatisticalperspective,wecanconsidereveryinstancetobearesultanditsanomalyscoretobetheteststatistic.Thenullhypothesisisthattheinstancebelongstothenormalclasswhilethealternativehypothesisisthattheinstanceissignificantlydifferentfromotherpointsfromthenormalclassandhenceisananomaly.Hence,giventhenulldistributionoftheanomalyscore,wecancomputethep-valueofeveryresultandusethisinformationtodeterminestatisticallysignificantanomalies.

Aprimerequirementforperformingstatisticaltestingforanomalydetectionistoobtainthedistributionofanomalyscoresforinstancesthatbelongtothenormalclass,asthisisthenulldistribution.Iftheanomalydetectionapproachisbasedonstatisticaltechniques(seeSection9.3 ),wehaveaccesstoastatisticalmodelforestimatingthedistributionofthenormalclass.Inothercases,wecanuserandomizationmethodstogeneratesyntheticdatasetswheretheinstancesonlybelongtothenormalclass.Forexample,ifitispossibletoconstructamodelofthedatawithoutanomalies,thenthismodelcanbeusedtogeneratemultiplesamplesofthedata,andinturn,thosesamplescanbeusedtocreateadistributionoftheanomalyscoresforinstancesthatarenormal.Unfortunately,justasforgeneratingsyntheticdataforclustering,thereisusuallynoeasywaytoconstructrandomdatasetsthat

looksimilartotheoriginaldatainallrespectsexceptthattheycontainonlynormalinstances.

Ifanomalydetectionistobeuseful,however,thenatsomepoint,theresultsoftheanomalydetection,particularlythetoprankinganomalies,needtobeevaluatedbydomainexpertstoassesstheperformanceofthealgorithm.Iftheanomaliesproducedbythealgorithmdonotagreewiththeexpertassessment,thisdoesnotnecessarilymeanthatthealgorithmisnotperformingwell.Instead,itmayjustmeanthatthedefinitionofananomalybeingusedbytheexpertandthealgorithmdiffer.Forinstance,theexpertmayviewcertainaspectsofthedataasirrelevant,butthealgorithmmaybetreatingthemasimportant.Insuchcases,theseaspectsofthedatacanbedeemphasizedtohelprefinethestatisticaltestingprocedures.Alternatively,theremaybenewtypesofanomaliesthattheexpertisunfamiliarwith,sinceanomaliesare,bytheirverynature,supposedtobesurprising.

BaseRateFallacyConsiderananomalydetectionsystemthatcanaccuratelydetect99.9%ofthefraudulentcreditcardtransactionswithafalsealarmrateofonly0.01%.Ifatransactionisflaggedasananomalybythesystem,howlikelyitistobegenuinelyfraudulent?Acommonmisconceptionisthatthemajorityofthedetectedanomaliesarefraudulenttransactionsgiventhehighdetectionrateandlowfalsealarmrateofthesystem.However,thiscanbemisleadingiftheskewofthedataisnottakenintoconsideration.Thisproblemisalsoknownasbaseratefallacyorbaserateneglect.

Toillustratetheproblem,considerthecontingencytableshowninTable10.6 .Letdbethedetectionrate(i.e.,truepositiverate)ofthesystemandfbeitsfalsealarmrate,ortobemorespecific

Table10.6.Contingencytableforananomalydetectionsystemwithdetectionratedandfalsealarmratef.

Alarm NoAlarm

Fraud

NoFraud

dαN αN

Ourgoalistocalculatetheprecisionofthesystem,i.e.,P(Fraud|Alarm).Iftheprecisionishigh,thenthemajorityofthealarmsareindeedtriggeredbyfraudulenttransactions.BasedontheinformationgiveninTable10.6 ,theprecisionofthesystemcanbecalculatedasfollows:

whereαisthepercentageoffraudulenttransactionsinthedata.Sinceand ,theprecisionofthesystemis

Ifthedataisnotskewed,e.g.,when ,thenitsprecisionwouldbeveryhigh,0.9999,sowecantrustthatthemajorityoftheflaggedtransactionsarefraudulent.However,ifthedataishighlyskewed,e.g.,when (oneinfiftythousandtransactions),thentheprecisionisonly0.167,whichmeansthatonlyaboutoneissixalarmsisatrueanomalies.

f(1−α)N

(1−d)αN

(1−f)(1−α)N (1−α)N

dαN+f(1−α)N (1−d)αN+(1−f)(1−α)N

P(Alarm|Fraud)=dandP(Alarm|NotFraud)=f.

Precision=dαNdαN+f(1−α)N=dαf+(d−f)α, (10.10)

d=0.999 f=0.0001

Precision=0.999α0.0001+0.9989α (10.11)

α=0.5

α=2×10−5

Theprecedingexampleillustratestheimportanceofconsideringskewnessofthedatawhenchoosinganappropriateanomalydetectionsystemforagivenapplication.Iftheeventofinterestoccursrarely,say,oneinfiftythousandofthepopulation,thenevenasystemwith99.9%detectionrateand0.01%falsealarmratecanstillmake5mistakesforevery6anomaliesflaggedbythesystem.Theprecisionofthesystemdegradessignificantlyasthepercentageofskewnessinthedataincreases.Thecruxofthisproblemliesinthefactthatdetectionrateandfalsealarmratearemetricsthatarenotsensitivetoskewnessintheclassdistribution,aproblemthatwasfirstalludedtoinSection4.11 duringourdiscussionontheclassimbalancedproblem.Thelessonhereisthatanyevaluationofananomalydetectionsystemmusttakeintoaccountthedegreeofskewnessinthedatabeforedeployingthesystemintopractice.

10.7BibliographicNotesRecently,therehasbeenagrowingbodyofliteraturethatisconcernedwiththevalidityandreproducibilityofresearchresults.Perhapsthemostwell-knownworkinthatareaisthepaperbyIoannidis[721],whichassertsthatmostpublishedresearchfindingsarefalse.Therehavebeenvariouscritiquesofthiswork,e.g.,seeGoodmanandGreenland[717]andIoannidisrebuttal[716,722].Regardless,concernaboutthevalidityandreproducibilityofresultshasonlycontinuedtoexpand.ApaperbySimmonsetal.[742]statesthatalmostanyeffectinpsychologycanbepresentedasstatisticallysignificantgivencurrentpractice.Thepaperalsosuggestsrecommendedchangesinresearchpracticeandarticlereview.ANaturesurveybyBaker[697]reportedthatmorethan70%ofresearchershavetriedandfailedtoreplicateotherresearchers’results,and50%havefailedtoreplicatetheirownresults.Onamorepositivenote,JagerandLeek[724]lookedatpublishedmedicalresearchandalthoughtheyidentifiedaneedforimprovements,concludedthat“ouranalysissuggeststhatthemedicalliteratureremainsareliablerecordofscientificprogress.”TherecentbookbyNateSilver[741]hasdiscussedanumberofpredictivefailuresinvariousareas,includingbaseball,politics,andeconomics.Althoughnumerousotherstudiesandreferencescanbecitedinanumberofareas,thekeypointisthatthereisawidespreadperception,backedbyafairamountofevidence,thatmanycurrentdataanalysesarenottrustworthyandthattherearevariousstepsthatcanbetakentoimprovethesituation[699,723,729].Althoughthischapterhasfocusedonstatisticalissues,manyofthechangesadvocated,e.g.,byIoannidisinhisoriginalpaper,arenotstatisticalinnature.

ThenotionofsignificancetestingwasintroducedbytheprominentstatisticianRonaldFisher[710,734].Inresponsetoperceivedshortcomings,Neyman

andPearson[735,736]introducedhypothesistesting.Thetwoapproacheshaveoftenbeenmergedinanapproachknownasnullhypothesisstatisticaltesting(NHST)[731],whichhasbeenthesourceofmanyproblems[712,720].Anumberofp-valuemisconceptionsaresummarizedinvariousrecentpapers,forexample,thosebyGoodman[715],Nuzzo[737],andGelman[711].TheAmericanStatisticalAssociationhasrecentlyissuedastatementonp-values[751].PapersthatdescribetheBayesianapproach,asexemplifiedbytheBayesfactorandpriorodds,areKassandRaftery[727]andGoodmanandSander[716].ArecentpaperbyBenjaminandlargenumberofotherprominentstatisticians[699],usessuchanapproachtoarguethat0.005,insteadof0.05,shouldbethedefaultp-valuethresholdforstatisticalsignificance.Moregenerally,themisinterpretationandmisuseofp-valuesisnottheonlyproblemassomehavenoted[730].NotethatbothFisher’ssignificancetestingandtheNeyman-Pearsonhypothesistestingapproachesweredesignedwithstatisticallydesignedexperimentsinmind,butareoften,perhapsmostly,appliedtoobservationaldata.Indeed,mostdatabeinganalyzednowadaysisobservationaldata.

TheseminalpaperforthefalsediscoveryrateisbyBenjaminiandHochberg[701].ThepositivefalsediscoveryratewasproposedbyStorey[743–745].Efronhasadvocatedtheuseofthelocalfalsediscoveryrate[704–707].TheworkofEfron,Storey,Tibshirani,andothershasbeenappliedinasoftwarepackageforanalyzingmicroarrydata:SAM:SignificanceAnalysisofMicroarrays[707,746,750].Moregenerally,mostmathematicalandstatisticalsoftwarehaspackagesforcomputingFDR.Inparticular,seetheFDRtoolinRbyStrimmer[748,749]ortheq-valueroutine[698,747],whichisavailableinBioconductor,awell-knownRpackage.Arecentsurveyofpastandcurrentworkinmultiplehypothesistesting(multiplecomparison)isgivenbyBenjamini[700].

AsdiscussedinSection10.2 ,resamplingapproaches,especiallythosebasedontherandomization/permutationandthebootstrap/cross-validation,arethemainapproachtomodelingthenulldistributionorthedistributionsofevaluationmetrics,andthus,computingevaluationmeasuresofinterest,suchasp-values,falsediscoveryrates,andconfidenceintervals.Discussionandreferencestothebootstrapandcross-validationareprovidedintheBibliographicNotesofChapter3 .Generalresourcesonpermutation/randomizationincludebooksbyEdgingtonandOnghena[703],Good[714],andPesarinandLuigi[740],aswellasthearticlesbyCollingridge[702],Ernst[709]andWelch[756].Althoughsuchtechniquesarewidelyused,therearelimitations,suchasthosediscussedinsomedetailbyEfron[705].Inthispaper,EfrondescribesaBayesianapproachforestimatinganempiricalnulldistributionandusingittocomputea“local”falsediscoveryratethatismoreaccuratethanapproachesusinganulldistributionbasedonrandomizationortheoreticalapproaches.

Aswehaveseenintheapplicationspecificsections,differentareasofdataanalysistendtouseapproachesspecifictotheirproblem.Thepermutation(randomization)ofclasslabelsdescribedinSection10.3.1 isastraightforwardandwell-knowntechniqueinclassification.ThepaperbyOjalaandGarigga[738]examinesthisapproachinmoredepthandpresentsanalternativerandomizationapproachthatcanhelpidentify,foragivendataset,whetherdependencyamongfeaturesisimportantintheclassificationperformance.ThepaperbyJensenandCohen[726]isarelevantreferenceforthediscussionofmultiplehypothesistestinginmodelselection.Clusteringhasrelativelylittleworkintermsofstatisticalvalidationsincemostusersrelyonmeasuresofclusteringgoodnesstoevaluateoutcomes.However,someusefulresourcesareChapter4 ofJainandDubes’clusteringbook[725]andtherecentsurveyofclusteringvaliditymeasuresbyXiongandLi.[757].TheswaprandomizationapproachwasintroducedintoassociationanalysisbyGionisetal.[713].Thispaperhasanumberofreferencesthattracethe

originofthisapproachinotherareas,aswellasreferencestootherpapersfortheassessmentofassociationpatterns.Thisworkwasextendedtoreal-valuedmatricesbyOjalaetal.[739].AnotherimportantresourceforstatisticallysoundassociationpatterndiscoveryistheworkofWebb[752–755].HӓmӓlӓinenandWebbtaughtatutorialinKDD2014,StatisticallySoundPatternDiscovery.RelevantpublicationsbyHӓmӓlӓineninclude[719]and[718].

Thedesignofexperimentstoreducevariabilityandincreasepowerisacorecomponentofstatistics.Thereareanumberofgeneralbooksonthetopic,e.g.,theonebyMontgomery[732],butmanymorespecializedtreatmentsofthetopicareavailableforvariousdomains.Inrecentyears,A/Btestinghasemergedasacommontoolofcompaniesforcomparingtwoalternatives,e.g.,twowebpages.ArecentpaperbyKohavietal.[728]providesasurveyandpracticalguidetoA/Btestingandsomeofitsvariants.

Muchofthematerialpresentedinsections10.1 and10.2 iscoveredinvariousstatisticsbooksandarticles,manyofwhichwerementionedpreviously.Additionalreferencematerialforsignificanceandhypothesistestingcanbefoundinintroductorytexts,althoughasmentionedabove,thesetwoapproachesarenotalwaysclearlydistinguished.Theuseofhypothesistestingiswidespreadinanumberofdomains,e.g.,medicine,sincetheapproachallowinvestigatorstodeterminehowmanysampleswillbeneededtoachievecertaintargetvaluesfortheTypeIerror,power,andeffectsize.See,forexample,Ellis[708]andMurphyetal.[733].

Bibliography[697]M.Baker.1,500scientistsliftthelidonreproducibility.Nature,

533(7604):452–454,2016.

[698]D.Bass,A.Dabney,andD.Robinson.qvalue:Q-valueestimationforfalsediscoveryratecontrol.Rpackage,2012.

[699]D.J.Benjamin,J.Berger,M.Johannesson,B.A.Nosek,E.-J.Wagenmakers,R.Berk,K.Bollen,B.Brembs,L.Brown,C.Camerer,etal.Redefinestatisticalsignificance.PsyArXiv,2017.

[700]Y.Benjamini.Simultaneousandselectiveinference:currentsuccessesandfuturechallenges.BiometricalJournal,52(6):708–721,2010.

[701]Y.BenjaminiandY.Hochberg.Controllingthefalsediscoveryrate:apracticalandpowerfulapproachtomultipletesting.Journaloftheroyalstatisticalsociety.SeriesB(Methodological),pages289–300,1995.

[702]D.S.Collingridge.Aprimeronquantitizeddataanalysisandpermutationtesting.JournalofMixedMethodsResearch,7(1):81–97,2013.

[703]E.EdgingtonandP.Onghena.Randomizationtests.CRCPress,2007.

[704]B.Efron.Localfalsediscoveryrates.DivisionofBiostatistics,StanfordUniversity,2005.

[705]B.Efron.Large-scalesimultaneoushypothesistesting.JournaloftheAmericanStatisticalAssociation,2012.

[706]B.Efronetal.Microarrays,empiricalBayesandthetwo-groupsmodel.Statisticalscience,23(1):1–22,2008.

[707]B.Efron,R.Tibshirani,J.D.Storey,andV.Tusher.EmpiricalBayesanalysisofamicroarrayexperiment.JournaloftheAmericanstatisticalassociation,96(456):1151–1160,2001.

[708]P.D.Ellis.Theessentialguidetoeffectsizes:Statisticalpower,meta-analysis,andtheinterpretationofresearchresults.CambridgeUniversityPress,2010.

[709]M.D.Ernstetal.Permutationmethods:abasisforexactinference.StatisticalScience,19(4):676–685,2004.

[710]R.A.Fisher.Statisticalmethodsforresearchworkers.InBreakthroughsinStatistics,pages66–70.Springer,1992(originally,1925).

[711]A.Gelman.Commentary:Pvaluesandstatisticalpractice.Epidemiology,24(1):69–72,2013.

[712]G.Gigerenzer.Mindlessstatistics.TheJournalofSocio-Economics,33(5):587–606,2004.

[713]A.Gionis,H.Mannila,T.Mielikäinen,andP.Tsaparas.Assessingdataminingresultsviaswaprandomization.ACMTransactionsonKnowledgeDiscoveryfromData(TKDD),1(3):14,2007.

[714]P.Good.Permutationtests:apracticalguidetoresamplingmethodsfortestinghypotheses.SpringerScience&BusinessMedia,2013.

[715]S.Goodman.Adirtydozen:twelvep-valuemisconceptions.InSeminarsinhematology,volume45(13),pages135–140.Elsevier,2008.

[716]S.GoodmanandS.Greenland.AssessingtheUnreliabilityoftheMedicalLiterature:AresponsetoWhyMostPublishedResearchFindingsareFalse.bepress,2007.

[717]S.GoodmanandS.Greenland.Whymostpublishedresearchfindingsarefalse:problemsintheanalysis.PLoSMed,4(4):e168,2007.

[718]W.Hämäläinen.Efficientsearchforstatisticallysignificantdependencyrulesinbinarydata.PhDThesis,DepartmentofComputerScience,UniversityofHelsinki,2010.

[719]W.Hämäläinen.Kingfisher:anefficientalgorithmforsearchingforbothpositiveandnegativedependencyruleswithstatisticalsignificancemeasures.Knowledgeandinformationsystems,32(2):383–414,2012.

[720]R.Hubbard.AlphabetSoup:BlurringtheDistinctionsBetweenpsandα’sinPsychologicalResearch.Theory&Psychology,14(3):295–327,2004.

[721]J.P.Ioannidis.Whymostpublishedresearchfindingsarefalse.PLoSMed,2(8):e124,2005.

[722]J.P.Ioannidis.Whymostpublishedresearchfindingsarefalse:author’sreplytoGoodmanandGreenland.PLoSmedicine,4(6):e215,2007.

[723]J.P.Ioannidis.Howtomakemorepublishedresearchtrue.PLoSmedicine,11(10):e1001747,2014.

[724]L.R.JagerandJ.T.Leek.Anestimateofthescience-wisefalsediscoveryrateandapplicationtothetopmedicalliterature.Biostatistics,15(1):1–12,2013.

[725]A.K.JainandR.C.Dubes.AlgorithmsforClusteringData.PrenticeHallAdvancedReferenceSeries.PrenticeHall,March1988.

[726]D.JensenandP.R.Cohen.MultipleComparisonsinInductionAlgorithms.MachineLearning,38(3):309–338,March2000.

[727]R.E.KassandA.E.Raftery.Bayesfactors.Journaloftheamericanstatisticalassociation,90(430):773–795,1995.

[728]R.Kohavi,A.Deng,B.Frasca,T.Walker,Y.Xu,andN.Pohlmann.Onlinecontrolledexperimentsatlargescale.InProceedingsofthe19thACMSIGKDDinternationalconferenceonKnowledgediscoveryanddatamining,pages1168–1176.ACM,2013.

[729]D.Lakens,F.G.Adolfi,C.Albers,F.Anvari,M.A.Apps,S.E.Argamon,M.A.vanAssen,T.Baguley,R.Becker,S.D.Benning,etal.JustifyYourAlpha:AResponsetoRedefineStatisticalSignificance.PsyArXiv,2017.

[730]J.T.LeekandR.D.Peng.Statistics:Pvaluesarejustthetipoftheiceberg.Nature,520(7549):612,2015.

[731]E.F.Lindquist.Statisticalanalysisineducationalresearch.HoughtonMifflin,1940.

[732]D.C.Montgomery.Designandanalysisofexperiments.JohnWiley&Sons,2017.

[733]K.R.Murphy,B.Myors,andA.Wolach.Statisticalpoweranalysis:Asimpleandgeneralmodelfortraditionalandmodernhypothesistests.Routledge,2014.

[734]J.Neyman.RAFisher(1890–1962):AnAppreciation.Science,156(3781):1456–1460,1967.

[735]J.NeymanandE.S.Pearson.Ontheuseandinterpretationofcertaintestcriteriaforpurposesofstatisticalinference:PartI.Biometrika,pages

175–240,1928.

[736]J.NeymanandE.S.Pearson.Ontheuseandinterpretationofcertaintestcriteriaforpurposesofstatisticalinference:PartII.Biometrika,pages263–294,1928.

[737]R.Nuzzo.Scientificmethod:Statisticalerrors.NatureNews,Feb.122014.

[738]M.OjalaandG.C.Garriga.Permutationtestsforstudyingclassifierperformance.JournalofMachineLearningResearch,11(Jun):1833–1863,2010.

[739]M.Ojala,N.Vuokko,A.Kallio,N.Haiminen,andH.Mannila.Randomizationofreal-valuedmatricesforassessingthesignificanceofdataminingresults.InProceedingsofthe2008SIAMInternationalConferenceonDataMining,pages494–505.SIAM,2008.

[740]F.PesarinandL.Salmaso.Permutationtestsforcomplexdata:theory,applicationsandsoftware.JohnWiley&Sons,2010.

[741]N.Silver.Thesignalandthenoise:Whysomanypredictionsfail-butsomedon’t.Penguin,2012.

[742]J.P.Simmons,L.D.Nelson,andU.Simonsohn.False-positivepsychologyundisclosedflexibilityindatacollectionandanalysisallows

presentinganythingassignificant.Psychologicalscience,page0956797611417632,2011.

[743]J.D.Storey.Adirectapproachtofalsediscoveryrates.JournaloftheRoyalStatisticalSociety:SeriesB(StatisticalMethodology),64(3):479–498,2002.

[744]J.D.Storey.Thepositivefalsediscoveryrate:aBayesianinterpretationandtheq-value.Annalsofstatistics,pages2013–2035,2003.

[745]J.D.Storey,J.E.Taylor,andD.Siegmund.Strongcontrol,conservativepointestimationandsimultaneousconservativeconsistencyoffalsediscoveryrates:aunifiedapproach.JournaloftheRoyalStatisticalSociety:SeriesB(StatisticalMethodology),66(1):187–205,2004.

[746]J.D.StoreyandR.Tibshirani.SAM:thresholdingandfalsediscoveryratesfordetectingdifferentialgeneexpressioninDNAmicroarrays.InTheanalysisofgeneexpressiondata,pages272–290.Springer,2003.

[747]J.D.Storey,W.Xiao,J.T.Leek,R.G.Tompkins,andR.W.Davis.Significanceanalysisoftimecoursemicroarrayexperiments.ProceedingsoftheNationalAcademyofSciencesoftheUnitedStatesofAmerica,102(36):12837–12842,2005.

[748]K.Strimmer.fdrtool:aversatileRpackageforestimatinglocalandtailarea-basedfalsediscoveryrates.Bioinformatics,24(12):1461–1462,2008.

[749]K.Strimmer.Aunifiedapproachtofalsediscoveryrateestimation.BMCbioinformatics,9(1):303,2008.

[750]V.G.Tusher,R.Tibshirani,andG.Chu.Significanceanalysisofmicroarraysappliedtotheionizingradiationresponse.ProceedingsoftheNationalAcademyofSciences,98(9):5116–5121,2001.

[751]R.L.WassersteinandN.A.Lazar.TheASA’sstatementonp-values:context,process,andpurpose.TheAmericanStatistician,2016.

[752]G.I.Webb.Discoveringsignificantpatterns.MachineLearning,68(1):1–33,2007.

[753]G.I.Webb.Layeredcriticalvalues:apowerfuldirect-adjustmentapproachtodiscoveringsignificantpatterns.MachineLearning,71(2):307–323,2008.

[754]G.I.Webb.Self-sufficientitemsets:Anapproachtoscreeningpotentiallyinterestingassociationsbetweenitems.ACMTransactionsonKnowledgeDiscoveryfromData(TKDD),4(1):3,2010.

[755]G.I.WebbandJ.Vreeken.Efficientdiscoveryofthemostinterestingassociations.ACMTransactionsonKnowledgeDiscoveryfromData(TKDD),8(3):15,2014.

[756]W.J.Welch.Constructionofpermutationtests.JournaloftheAmericanStatisticalAssociation,85(411):693–698,1990.

[757]H.XiongandZ.Li.ClusteringValidationMeasures.InC.C.AggarwalandC.K.Reddy,editors,DataClustering:AlgorithmsandApplications,pages571–605.Chapman&Hall/CRC,2013.

10.8Exercises1.Statisticaltestingproceedsinamanneranalogoustothemathematicalprooftechnique,proofbycontradiction,whichprovesastatementbyassumingitisfalseandthenderivingacontradiction.Compareandcontraststatisticaltestingandproofbycontradiction.

2.Whichofthefollowingaresuitablenullhypotheses.Ifnot,explainwhy.

a. ComparingtwogroupsConsidercomparingtheaveragebloodpressureofagroupofsubjects,bothbeforeandaftertheyareplacedonalowsaltdiet.Inthiscase,thenullhypothesisisthatalowsaltdietdoesreducebloodpressure,i.e.,thattheaveragebloodpressureofthesubjectsisthesamebeforeandafterthechangeindiet.

b. ClassificationAssumetherearetwoclasses,labeled+and−,wherewearemostinterestedinthepositiveclass,e.g.,thepresenceofadisease.

isthestatementthattheclassofanobjectisnegative,i.e.,thatthepatientdoesnothavethedisease.

c. AssociationAnalysisForfrequentpatterns,thenullhypothesisisthattheitemsareindependentandthus,anypatternthatwedetectisspurious.

d. ClusteringThenullhypothesisisthatthereisclusterstructureinthedatabeyondwhatmightoccuratrandom.

e. AnomalyDetectionOurassumption, ,isthatanobjectisnotanomalous.

3.Consideronceagainthecoffee-teaexample,presentedinExample10.9 .Thefollowingtwotablesarethesameastheonepresentedin

Example10.9 exceptthateachentryhasbeendividedby10(lefttable)ormultipliedby10(righttable).

Table10.7.Beveragepreferencesamongagroupof100people(left)and10,000people(right).

Coffee

Tea 15

80 20 100

Coffee

Tea 1500

6500

500

1500

2000

8000

8000 2000 10000

a. Computethep-valueoftheobservedsupportcountforeachtable,i.e.,for15and1500.Whatpatterndoyouobserveasthesamplesizeincreases?

b. ComputetheoddsratioandinterestfactorforthetwocontingencytablespresentedinthisproblemandtheoriginaltableofExample10.9 .(SeeSection5.7.1 fordefinitionsofthesetwomeasures.)Whatpatterndoyouobserve?

c. Theoddsratioandinterestfactoraremeasuresofeffectsize.Arethesetwoeffectsizessignificantfromapracticalpointofview?

d. Whatwouldyouconcludeabouttherelationshipbetweenp-valuesandeffectsizeforthissituation?

4.Considerthedifferentcombinationsofeffectsizeandp-valueappliedtoanexperimentwherewewanttodeterminetheefficacyofanewdrug.

Coffee¯

Tea¯

Coffee¯

Tea¯

effectsizesmall,p-valuesmall

effectsizesmall,p-valuelarge(ii)

Whethereffectsizeissmallorlargedependsonthedomain,whichinthiscaseismedical.Forthisproblemconsiderasmallp-valuetobelessthan0.001,whilealargep-valueisabove0.05.Assumethatthesamplesizeisrelativelylarge,e.g.,thousandsofpatientswiththeconditionthatthedrughopestotreat.

a. Whichcombination(s)wouldverylikelybeofinterest?

b. Whichcombinations(s)wouldverylikelynotbeofinterest?

c. Ifthesamplesizeweresmall,wouldthatchangeyouranswers?

5.ForNeyman-Pearsonhypothesistesting,weneedtobalancethetradeoffbetweenα,theprobabilityofatypeIerrorandpower,i.e., ,whereβistheprobabilityofatypeIIerror.Computeα,β,andthepowerforthecasesgivenbelow,wherewespecifythenullandalternativedistributionsandtheaccompanyingcriticalregion.AlldistributionsareGaussianwithsomespecifiedmeanuandstandarddeviationσ,i.e., .LetTbetheteststatistic.

effectsizelarge,p-valuesmall

effectsizelarge,p-valuelarge

(iii)

(iv)

1−β

N(μ,σ)

,criticalregion: .

H0:N(0,1),H1:N(3,1) T>2

H0:N(0,1),H1:N(3,1) |T|>2

H0:N(−1,1),H1:N(3,1) T>1

H0:N(−1,1),H1:N(3,1) |T|>1

H0:N(−1,0.5),H1:N(3,0.5) T>1

H0:N(−1,0.5),H1:N(3,0.5) |T|>1

6.Ap-valuemeasurestheprobabilityoftheresultgiventhatthenullhypothesisistrue.However,manypeoplewhocalculatep-valueshaveuseditastheprobabilityofthenullhypothesisgiventheresult,whichiserroneous.ABayesianapproachtothisproblemissummarizedbyEquation10.12 .

Thisapproachcomputestheratiooftheprobabilityofthealternativeandnullhypotheses( and ,respectively)giventheobservedoutcome, .Inturn,thisquantityisexpressedastheproductoftwofactors:theBayesfactorandthepriorodds.Theprioroddsistheratiooftheprobabilityof totheprobabilityof basedonpriorinformationabouthowlikelywebelieveeachhypothesisis.Usually,theprioroddsisestimateddirectlybasedonexperience,Forexample,indrugtestinginthelaboratory,itmaybeknownthatmostdrugsdonotproducepotentiallytherapeuticeffects.TheBayesfactoristheratiooftheprobabilityorprobabilitydensityoftheobservedoutcome, ,under and .Thisquantityiscomputedandrepresentsameasureofhowmuchmoreorlesslikelytheobservedresultisunderthealternativehypothesisthanthenullhypothesis.Conceptually,thehigheritis,themorewewouldtendtopreferthealternativetothenull.ThehighertheBayesfactor,thestrongertheevidenceprovidedbythedatafor .Moregenerally,thisapproachcanbeappliedtoassesstheevidenceforanyhypothesisversusanother.Thus,therolesof canbe(andoftenare)reversedinEquation10.12 .

a. SupposethattheBayesfactoris20,whichisverystrong,buttheprioroddsare0.01.Wouldyoubeinclinedtopreferthealternativeornullhypothesis?

b. Supposetheprioroddsare0.25,thenulldistributionisGaussianwithdensitygivenby ,andthealternativedistributionisgivenby

.ComputetheBayesfactorandposterioroddsof forthe

P(H1|xobs)P(H0|xobs)=f(H1|xobs)f(H0|xobs)×P(H1)P(H0)posterioroddsof H1=(10.12)

H1 H0 xobs

H1H0

xobs H1 H0

f0(x)=N(0,2)f1(x)=N(3,1) H1

followingvaluesof :2,2.5,3,3.5,4,4.5,5.Explainthepatternthatyouseeinbothquantities.

7.Considertheproblemofdeterminingwhetheracoinisafairone,i.e.,,byflippingthecoin10times.Usethebinomialtheorem

andbasicprobabilitytoanswerthefollowingquestions.

8.Algorithm10.1 on773providesanmethodforcalculatingthefalsediscoveryrateusingthemethodadvocatedbyBenjaminiandHochberg.Thedescriptioninthetextispresentedintermsoforderingthep-valuesandadjustingthesignificanceleveltoassesswhetherap-valueissignificant.Anotherwaytointerpretthismethodisintermsoforderingthep-values,smallesttolargest,andcomputingadjusted”p-values, ,whereiidentifiesthe smallestp-valueandmisthenumberofp-values.Thestatisticalsignificanceisdeterminedbasedonwhether ,whereαisthedesiredfalsediscoveryrate.

xobs

P(heads)=P(tails)=0.5

Acoinisflippedtentimesanditcomesupheadseverytime.Whatistheprobabilityofgetting10headsinarowandwhatwouldyouconcludeaboutwhetherthecoinisfair?

Suppose10,000coinsareeachflipped10timesinarowandtheflipsof10coinsresultinallheads,canyouconfidentlysaythatthesecoinsarenotfair?

Whatcanyouconcludeaboutresultswhenevaluatedindividuallyversusinagroup?

Supposethatyouflipeachcoin20timesandthenevaluate10,000coins.Canyounowconfidentlysaythatanycoinwhichyieldsallheadsisnotfair?

p′i=pi*m/iith

p′i≤α

a. Computetheadjustedp-valuesforthep-valuesinTable10.8 .Notethattheadjustedp-valuesmaynotbemonotonic.Inthatcase,anadjustedp-valuethatislargerthanitssuccessorischangedtohavethesamevalueasitssuccessor.

Table10.8.OrderedCollectionofp-values..

1 2 3 4 5 6 7 8 9 10

originalp-values

0.001 0.005 0.05 0.065 0.15 0.21 0.25 0.3 0.45 0.5

b. IfthedesiredFDRis20%,i.e., ,thenforwhichp-valuesisrejected?

c. SupposethatweusetheBonferroniprocedureinstead.Fordifferentvaluesofα,namely0.01,0.05,and0.10,computethemodifiedp-valuethreshold, ,thattheBonferroniprocedurewillusetoevaluatep-values.Thendetermine,foreachvalueof ,forwhichp-values, willberejected.(Ifap-valueequalsthethreshold,itisrejected.)

9.Thepositivefalsediscoveryrate(pFDR)issimilartothefalsediscoveryratedefinedinSection10.1.3 butassumesthatthenumberoftruepositivesisgreaterthan0.CalculationofthepFDRissimilartothatofFDR,butrequiresanassumptiononthevalueof ,thenumberofresultsthatsatisfythenullhypothesis.ThepFDRislessconservativethanFDR,butmorecomplicatedtocompute.

ThepositivefalsediscoveryratealsoallowsthedefinitionofanFDRanalogueofthep-value.Theq-valueistheexpectedfractionofhypothesesthatwillbefalseifthegivenhypothesisisaccepted.Specifically,theq-valueassociatedwithap-valueistheexpectedproportionoffalsepositivesamongallhypothesesthataremoreextreme,i.e.,havealowerp-value.Thus,theq-

α=0.20 H0

α*=α/10α* H0

valueassociatedwithap-valueisthepositivefalsediscoveryratethatwouldresultifthep-valuewasusedasthethresholdforrejection.

Belowweshow50p-values,theirBenjamini-Hochbergadjustedp-values,andtheirq-values.

p-values

BHadjustedp-values

q-Values

0.0000 0.0000 0.0002 0.0004 0.0004 0.0010 0.0089 0.0089 0.0288 0.0479

0.0755 0.0755 0.0755 0.1136 0.1631 0.2244 0.2964 0.3768 0.3768 0.3768

0.4623 0.4623 0.4623 0.5491 0.5491 0.6331 0.7107 0.7107 0.7107 0.7107

0.7107 0.8371 0.9201 0.9470 0.9470 0.9660 0.9660 0.9660 0.9790 0.9928

0.9928 0.9928 0.9928 0.9960 0.9960 0.9989 0.9989 0.9995 0.9999 1.0000

0.0000 0.0000 0.0033 0.0040 0.0040 0.0083 0.0556 0.0556 0.1600 0.2395

0.2904 0.2904 0.2904 0.4057 0.5437 0.7012 0.8718 0.9420 0.9420 0.9420

1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000

0.0000 0.0000 0.0023 0.0033 0.0033 0.0068 0.0454 0.0454 0.1267 0.1861

0.2509 0.2509 0.2509 0.3351 0.4198 0.4989 0.5681 0.6257 0.6257 0.6257

0.6723 0.6723 0.6723 0.7090 0.7090 0.7375 0.7592 0.7592 0.7592 0.7592

a. Howmanyp-valuesareconsideredsignificantusingBHadjustedp-valuesandthresholdsof0.05,0.10,0.15,0.20,0.25,and0.30?

b. Howmanyp-valuesareconsideredsignificantusingq-valuesandthresholdsof0.05,0.10,0.15,0.20,0.25,and0.30?

c. Comparethetwosetsofresults.

10.Analternativetothedefinitionsofthefalsediscoveryratediscussedsofaristhelocalfalsediscoverrate,whichisbasedonmodelingtheobservedvaluesoftheteststatisticasamixtureoftwodistributions,wheremostoftheobservationscomefromthenulldistributionandsomeobservations(theinterestingones)comefromthealternativedistribution.(SeeSection8.2.2formoreinformationonmixturemodels.)IfZisourteststatistic,thedensity,

ofZ,isgivenbythefollowing:

where istheprobabilityaninstancecomesfromthenulldistribution,isthedistributionofthep-valuesunderthenullhypothesis, istheprobabilityaninstancecomesfromthealternativedistribution,and isthedistributionofp-valuesunderthealternativehypothesis.

UsingBayestheorem,wecanderivetheprobabilityofthenullhypothesisforanyvalueofzasfollows.

Thequantity, ,isthequantitythatwewouldliketodefineasthelocalfdr.Since isoftencloseto1,thelocalfalsediscoveryrate,representedasfdr,alllowercase,isdefinedasthefollowing:

0.7592 0.7879 0.8032 0.8078 0.8078 0.8108 0.8108 0.8108 0.8129 0.8150

0.8150 0.8150 0.8150 0.8155 0.8155 0.8159 0.8159 0.8160 0.8161 0.8161

f(z)

f(z)=p0f0(z)+p1f1(z), (10.13)

p0 f0(z)p1

f1(z)

p(H0|z)=f(H0 and z)/f(z)=p0f0(z)/f(z) (10.14)

p(H0|z)p0

Thisisapointestimate,notanintervalestimateaswiththestandardFDR,whichisbasedonp-value,andassuch,itwillvarywiththevalueoftheteststatistic.Notethatthelocalfdrhasaneasyinterpretation,namelyastheratioofthedensityofobservationsfromthenulldistributiontoobservationsfromboththenullandalternativedistributions.Italsohastheadvantageofbeinginterpretabledirectlyasarealprobability.

Thechallenge,ofcourse,isinestimatingthedensitiesinvolvedinEquation10.15 ,whichareusuallyestimatedempirically.Weconsiderthefollowingsimplecase,wherewespecifythedistributionsbyGaussiandistributions.Thenulldistributionisgivenby, ,whilethealternativedistributionisgivenby . and .

a. Compute forthefollowingvaluesofz:2,2.5,3,3.5,4,4.5,5.

b. Computethelocalfdrforthefollowingvaluesofz:2,2.5,3,3.5,4,4.5,5.

c. Howclosearethesetwosetsofvalues?

11.Thefollowingaretwoalternativestoswaprandomization—presentedinSection10.4.2 —forrandomizingabinarymatrixsothatthenumberof1’sinanyrowandcolumnarepreserved.Examineeachmethodand(i)verifythatitdoesindeedpreservethenumberof1’sinanyrowandcolumn,and(ii)identifytheproblemwiththealternativeapproach.

a. Randomlypermutetheorderofthecolumnsandrows.AnexampleisshowninFigure10.12 .

fdr(z)=f0(z)f(z) (10.15)

f0(z)=N(0,1)f0(z)=N(3,1) p0=0.999 p1=0.001

p(H0|z)

Figure10.12.A3×3matrixbeforeandafterrandomizingtheorderoftherowsandcolumns.Theleftmostmatrixistheoriginal.

b. Figure10.13 showsanotherapproachtorandomizingabinarymatrix.Thisapproachconvertsthebinarymatrixtoarow-columnrepresentation,thenrandomlyreassignsthecolumnstovariousentries,andfinallyconvertsthedatabackintotheoriginalbinarymatrixformat.

Figure10.13.A4×4matrixbeforeandafterrandomizingtheentries.Fromrighttoleft,thetablesrepresentthefollowing:Theoriginalbinarydatamatrix,thematrixinrow-columnformat,therow-columnformatafterrandomlypermutingtheentriesinthecolcolumn,andthematrixreconstructedfromtherandomizedrow-columnrepresentation.

AuthorIndex

Abdulghani,A.,433Abe,N.,345Abello,J.,698Abiteboul,S.,437Abraham,B.,745Adams,N.M.,430Adolfi,F.G.,807Adomavicius,G.,18Aelst,S.V.,749Afshar,R.,509Agarwal,R.C.,342,429,747Aggarwal,C.,18,430,507,509Aggarwal,C.C.,339,429,430,600,602,745,808Agrawal,R.,18,104,183,184,430,431,435,436,507,509,696Aha,D.W.,181,339Akaike,H.,181Akoglu,L.,745Aksehirli,E.,434Albers,C.,807Alcalá-Fdez,J.,508Aldenderfer,M.S.,600Alexandridis,M.G.,184Ali,K.,430Allen,D.M.,181Allison,D.B.,181Allwein,E.L.,340Alsabti,K.,181Altman,R.B.,19Alvarez,J.L.,508Amatriain,X.,18Ambroise,C.,181

Anderberg,M.R.,102,600Anderson,T.W.,849Andrews,R.,340Ankerst,M.,600Antonie,M.-L.,507Anvari,F.,807Aone,C.,601Apps,M.A.,807Arabie,P.,600,602Argamon,S.E.,807Arnold,A.,746Arthur,D.,600Atluri,G.,19,508Atwal,G.S.,103Aumann,Y.,507Austin,J.,747Ayres,J.,507

Baguley,T.,807Bai,H.,696Baker,M.,806Baker,R.J.,343Bakiri,G.,341Bakos,J.,438Baldi,P.,340Ball,G.,600Bandyopadhyay,S.,103Banerjee,A.,18,19,600,696,746Barbará,D.,430,696Barnett,V.,745Bass,D.,806Bastide,Y.,435Basu,S.,696Batistakis,Y.,601Baxter,R.A.,747Bay,S.D.,430,746Bayardo,R.,430Becker,R.,807Beckman,R.J.,746Belanche,L.,104Belkin,M.,849Ben-David,S.,19Bengio,S.,341Bengio,Y.,340,341,343,345Benjamin,D.J.,806Benjamini,Y.,430,806Bennett,K.,340Benning,S.D.,807Berger,J.,806

Berk,R.,806Berkhin,P.,600Bernecker,T.,430Berrar,D.,340Berry,M.J.A.,18Bertino,E.,20Bertolami,R.,342Bhaskar,R.,430Bienenstock,E.,182Bilmes,J.,696Bins,M.,183Bishop,C.M.,181,340,696Blashfield,R.K.,600Blei,D.M.,696Blondel,M.,20,184Blum,A.,102,340Bobadilla,J.,18Bock,H.-H.,600Bock,H.H.,102Boiteau,V.,600Boley,D.,600,602Bollen,K.,806Bolton,R.J.,430Borg,I.,102Borgwardt,K.M.,434Boswell,R.,340Bosworth,A.,103Bottou,L.,340Boulicaut,J.-F.,507Bowyer,K.W.,340Bradley,A.P.,340

Bradley,P.S.,18,438,600,696Bradski,G.,182Bratko,I.,183Breiman,L.,181,340Brembs,B.,806Breslow,L.A.,181Breunig,M.M.,600,746Brin,S.,430,431,436Brockwell,A.,20Brodley,C.E.,184,747Brown,L.,806Brucher,M.,20,184Bunke,H.,342Buntine,W.,181Burges,C.J.C.,340Burget,L.,343Burke,R.,18Buturovic,L.J.,183Bykowski,A.,507

Cai,C.H.,431Cai,J.,438Camerer,C.,806Campbell,C.,340Canny,J.F.,344Cantú-Paz,E.,181Cao,H.,431Cardie,C.,699Carreira-Perpinan,M.A.,849Carvalho,C.,435Catanzaro,B.,182Ceri,S.,434Cernock`y,J.,343Chakrabarti,S.,18Chan,P.K.,19,341Chan,R.,431Chandola,V.,746Chang,E.Y.,696Chang,L.,749Charrad,M.,600Chatterjee,S.,698,747Chaudhary,A.,746Chaudhuri,S.,103Chawla,N.V.,340,746Chawla,S.,746Cheeseman,P.,696Chen,B.,436Chen,C.,746Chen,M.-S.,18,435,509Chen,Q.,431,508,749Chen,S.,434

Chen,S.-C.,749Chen,W.Y.,696Chen,Z.,749Cheng,C.H.,431Cheng,H.,432,507Cheng,R.,437Cherkassky,V.,18,182,340Chervonenkis,A.Y.,184Cheung,D.,749Cheung,D.C.,431Cheung,D.W.,431,437Chiu,B.,747Chiu,J.,433Choudhary,A.,434,698Chrisman,N.R.,102Chu,C.,182Chu,G.,808Chuang,A.,745Chui,C.K.,431Chung,S.M.,433,508Clark,P.,340Clifton,C.,18,437Clifton,D.A.,748Clifton,L.,748Coatney,M.,435,508Cochran,W.G.,102Codd,E.F.,102Codd,S.B.,102Cohen,P.R.,183,807Cohen,W.W.,340Collingridge,D.S.,806

Contreras,P.,600,602Cook,D.J.,698Cook,R.D.,746Cooley,R.,431Cost,S.,340Cotter,A.,182Cournapeau,D.,20,184Courville,A.,340,341Courville,A.C.,341Couto,J.,430Cover,T.M.,102,341Cristianini,N.,104,341Cui,X.,181

Dabney,A.,806Dash,M.,103Datta,S.,19Davidson,I.,696Davies,L.,746Davis,R.W.,808Dayal,U.,431,508,602Dean,J.,890Demmel,J.W.,102,832,849Deng,A.,807Desrosiers,C.,18Dhillon,I.S.,600Diaz-Verdejo,J.,746Diday,E.,102Diederich,J.,340Dietterich,T.G.,341,343Ding,C.,437,696Ding,C.H.Q.,698Dokas,P.,431Domingos,P.,18,182,341Dong,G.,431Donoho,D.L.,849Doster,W.,184Dougherty,J.,102Doursat,R.,182Drummond,C.,341Dubes,R.C.,103,601,807Dubourg,V.,20,184Duchesnay,E.,20,184Duda,R.O.,18,182,341,600Duda,W.,344

Dudoit,S.,182Duin,R.P.W.,182,344DuMouchel,W.,431Dunagan,J.,746Dunham,M.H.,18,341Dunkel,B.,431

Edgington,E.,806Edwards,D.D.,344Efron,B.,182,806Elkan,C.,341,601Ellis,P.D.,806Elomaa,T.,103Erhan,D.,341Erhart,M.,698EricksonIII,D.J.,103Ernst,M.D.,806Ertöz,L.,431,696,697,748Eskin,E.,746,748Esposito,F.,182Ester,M.,600–602Everitt,B.S.,601Evfimievski,A.V.,431Ezeife,C.,508

Fürnkranz,J.,341Fabris,C.C.,431Faghmous,J.,19Faghmous,J.H.,18Faloutsos,C.,20,104,748,849Fan,J.,697Fan,W.,341Fang,G.,432,436Fang,Y.,699Fawcett,T.,344Fayyad,U.M.,18,103,438,600,696Feng,L.,432,437Feng,S.,697Fernández,S.,342Ferri,C.,341Field,B.,432Finucane,H.K.,104Fisher,D.,601,697Fisher,N.I.,432Fisher,R.A.,182,806Flach,P.,340Flach,P.A.,341Flannick,J.,507Flynn,P.J.,601Fodor,I.K.,849Fournier-Viger,P.,507Fovino,I.N.,20Fox,A.J.,746Fraley,C.,697Frank,E.,19,20,182,345Frasca,B.,807

Frawley,W.,435Freeman,D.,696Freitas,A.A.,431,432Freund,Y.,341Friedman,J.,182,342Friedman,J.H.,19,181,432,601Fu,A.,431,433Fu,A.W.-c.,749Fu,Y.,432,508Fukuda,T.,432,434,507Fukunaga,K.,182,341Furuichi,E.,435

Gada,D.,103Ganguly,A.,19Ganguly,A.R.,18,103Ganti,V.,182,697Gao,X.,601GaohuaGu,F.H.,103Garcia-Teodoro,P.,746Garofalakis,M.N.,507Garriga,G.C.,807Gather,U.,746Geatz,M.,20Gehrke,J.,18,19,104,182,431,507,696,697Geiger,D.,341Geisser,S.,182Gelman,A.,806Geman,S.,182Gersema,E.,183Gersho,A.,697Ghazzali,N.,600Ghemawat,S.,890Ghosh,A.,746Ghosh,J.,600,697,699Giannella,C.,19Gibbons,P.B.,748Gibson,D.,697Gigerenzer,G.,806Gionis,A.,746,806Glymour,C.,19Gnanadesikan,R.,746Goethals,B.,434Goil,S.,698

Goldberg,A.B.,345Golub,G.H.,832Gomariz,A.,507Good,P.,806Goodfellow,I.,341Goodfellow,I.J.,341Goodman,R.M.,344Goodman,S.,806Gorfine,M.,103Gowda,K.C.,697Grama,A.,20Gramfort,A.,20,184Graves,A.,342Gray,J.,103Gray,R.M.,697Greenland,S.,806Gries,D.,890Grimes,C.,849Grinstein,G.G.,18Grisel,O.,20,184Groenen,P.,102Grossman,R.L.,19,698Grossman,S.R.,104Guan,Y.,600Guestrin,C.,20Guha,S.,19,697Gunopulos,D.,432,437,696Guntzer,U.,432Gupta,M.,432Gupta,R.,432,508Gutiérrez,A.,18

Hagen,L.,697Haibt,L.,344Haight,R.,436Haiminen,N.,807Halic,M.,183Halkidi,M.,601Hall,D.,600Hall,L.O.,340Hall,M.,19,182Hamerly,G.,601Hamilton,H.J.,432Han,E.,432Han,E.-H.,183,342,432,508,601,697,698Han,J.,18,19,342,430,432–435,437,507–509,601,698Hand,D.J.,19,103,342,430Hardin,J.,746Hart,P.E.,18,182,341,600Hartigan,J.,601Hastie,T.,19,182,342,601Hatonen,K.,437Hawkins,D.M.,747Hawkins,S.,747He,H.,747He,Q.,433,699He,X.,437,696He,Y.,437,697Hearst,M.,342Heath,D.,182Heckerman,D.,342Heller,R.,103Heller,Y.,103

Hernando,A.,18Hernández-Orallo,J.,341Herrera,F.,508Hey,T.,19Hidber,C.,432Hilderman,R.J.,432Hinneburg,A.,697Hinton,G.,343Hinton,G.E.,342–344Hipp,J.,432Ho,C.-T.,20Hochberg,Y.,430,806Hodge,V.J.,747Hofmann,H.,432Holbrook,S.R.,437Holder,L.B.,698Holland,J.,344Holmes,G.,19,182Holt,J.D.,433Holte,R.C.,341,342Hong,J.,343Hornick,M.F.,19Houtsma,M.,433Hsieh,M.J.,509Hsu,M.,431,508,602Hsu,W.,434Hsueh,S.,433Huang,H.-K.,748Huang,T.S.,698Huang,Y.,433Hubbard,R.,807

Hubert,L.,600,602Hubert,M.,749Hulten,G.,18Hung,E.,431Hussain,F.,103Hwang,S.,433Hämäläinen,W.,807Höppner,F.,697

Iba,W.,343Imielinski,T.,430,433Inokuchi,A.,433,508Ioannidis,J.P.,807Ioffe,S.,342Irani,K.B.,103

Jagadish,H.V.,747Jager,L.R.,807Jain,A.K.,19,103,182,601,807Jajodia,S.,430Janardan,R.,849Japkowicz,N.,340,342,746,747Jardine,N.,601Jaroszewicz,S.,433Jarvis,R.A.,697Jensen,D.,104,183,807Jensen,F.V.,342Jeudy,B.,507Johannesson,M.,806John,G.H.,103Johnson,T.,747Jolliffe,I.T.,103,849Jonyer,I.,698Jordan,M.I.,342,696Joshi,A.,19Joshi,M.V.,183,342,343,508,747

Kahng,A.,697Kailing,K.,698Kallio,A.,807Kalpakis,K.,103Kamath,C.,19,181,698Kamber,M.,19,342,433,601Kantarcioglu,M.,18Kantardzic,M.,19Kao,B.,431Karafiát,M.,343Kargupta,H.,19Karpatne,A.,19Karypis,G.,18,183,342,432,433,436,508,509,601,602,697,698Kasif,S.,182,183Kass,G.V.,183Kass,R.E.,807Kaufman,L.,103,601Kawale,J.,747Kegelmeyer,P.,19,698Kegelmeyer,W.P.,340Keim,D.A.,697Kelly,J.,696Keogh,E.,747Keogh,E.J.,103Keshet,J.,182Kettenring,J.R.,746Keutzer,K.,182Khan,S.,103Khan,S.S.,747Khardon,R.,432Khoshgoftaar,T.M.,20

Khudanpur,S.,343Kifer,D.,19Kim,B.,183Kim,S.K.,182Kinney,J.B.,103Kitagawa,H.,748Kitsuregawa,M.,436Kivinen,J.,343Klawonn,F.,697Kleinberg,J.,19Kleinberg,J.M.,601,697Klemettinen,M.,433,437Klooster,S.,435,436,698Knorr,E.M.,747Kogan,J.,600Kohavi,R.,102,103,183,807Kohonen,T.,698Kolcz,A.,340,746Kong,E.B.,343Koperski,K.,432Kosters,W.A.,433Koudas,N.,747Koutra,D.,745Kröger,P.,698Kramer,S.,509Krantz,D.,103–105Kriegel,H.,430Kriegel,H.-P.,600–602,698,746,748,749Krishna,G.,697Krizhevsky,A.,342–344Krstajic,D.,183

Kruse,R.,697Kruskal,J.B.,103,849Kröger,P.,601Kubat,M.,343Kuhara,S.,435Kulkarni,S.R.,183Kumar,A.,747Kumar,V.,18,19,183,342–344,431,432,435–437,508,509,601,602,696–698,746–748,849Kuok,C.M.,433Kuramochi,M.,433,508Kwok,I.,747Kwong,W.W.,431

Lagani,V.,184Lajoie,I.,345Lakens,D.,807Lakhal,L.,435Lakshmanan,L.V.S.,434Lambert,D.,19Landau,S.,601Lander,E.S.,104Landeweerd,G.,183Landgrebe,D.,183,184Lane,T.,747Langford,J.C.,345,849Langley,P.,102,343,697Larochelle,H.,345Larsen,B.,601Lavrac,N.,343Lavrač,N.,434Law,M.H.C.,19Laxman,S.,430Layman,A.,103Lazar,N.A.,808Lazarevic,A.,431,748Leahy,D.E.,183LeCun,Y.,343Lee,D.D.,698Lee,P.,433Lee,S.D.,431,437Lee,W.,433,748Lee,Y.W.,105Leek,J.T.,807,808Leese,M.,601

Lent,B.,508Leroy,A.M.,748Lewis,D.D.,343Lewis,T.,745Li,F.,699Li,J.,431Li,K.-L.,748Li,N.,433Li,Q.,849Li,T.,698Li,W.,105,433,438Li,Y.,430Li,Z.,601,602,808Liao,W.-K.,434Liess,S.,747Lim,E.,433Lin,C.J.,696Lin,K.-I.,434,849Lin,M.,433Lin,Y.-A.,182Lindell,Y.,507Lindgren,B.W.,103Lindquist,E.F.,807Ling,C.X.,343Linoff,G.,18Lipton,Z.C.,20Liu,B.,434,437,509Liu,H.,103,104Liu,J.,434Liu,L.-M.,746Liu,R.Y.,748

Liu,Y.,433,434,601Livny,M.,699Liwicki,M.,342Llinares-López,F.,434Lonardi,S.,747Lu,C.-T.,749Lu,H.J.,432,435,437Lu,Y.,438Luce,R.D.,103–105Ludwig,J.,19Lugosi,G.,183Luo,C.,508Luo,W.,697

Ma,D.,697Ma,H.,699Ma,L.,699Ma,Y.,434Mabroukeh,N.R.,508Maciá-Fernández,G.,746MacQueen,J.,601Madden,M.G.,747Madigan,D.,19Malerba,D.,182Maletic,J.I.,434Malik,J.,698Malik,J.M.,344Mamoulis,N.,431Manganaris,S.,430Mangasarian,O.,343Mannila,H.,19,342,432,437,508,806,807Manzagol,P.-A.,341,345Mao,H.,697Mao,J.,182Maranell,G.M.,104Marchiori,E.,433Marcus,A.,434Margineantu,D.D.,343Markou,M.,748Martin,D.,508Masand,B.,431Mata,J.,508Matsuzawa,H.,434Matwin,S.,343McCullagh,P.,343

McCulloch,W.S.,343McLachlan,G.J.,181McVean,G.,104Megiddo,N.,434Mehta,M.,183,184Meilǎ,M.,602MeiraJr.,W.,20Meo,R.,434Merugu,S.,600Meyer,G.,19Meyerson,A.,19,697Michalski,R.S.,183,343,698,699Michel,V.,20,184Michie,D.,183,184Mielikäinen,T.,806Mikolov,T.,343Miller,H.J.,601Miller,R.J.,434,508Milligan,G.W.,602Mingers,J.,183Mirkin,B.,602Mirza,M.,341Mishra,N.,19,697,698Misra,J.,890Mitchell,T.,20,183,340,343,602,698Mitzenmacher,M.,104Mobasher,B.,432,697Modha,D.S.,600Moens,S.,434Mok,K.W.,433,748Molina,L.C.,104

Montgomery,D.C.,807Mooney,R.,696Moore,A.W.,602,746Moret,B.M.E.,183Morimoto,Y.,432,434,507Morishita,S.,432,507Mortazavi-Asl,B.,435,508Mosteller,F.,104,434Motoda,H.,103,104,433,508Motwani,R.,19,430,431,436,437,697Mozetic,I.,343Mueller,A.,434Muggleton,S.,343Muirhead,C.R.,748Mulier,F.,18,340Mulier,F.M.,182Mullainathan,S.,19Murphy,K.P.,183Murphy,K.R.,807Murtagh,F.,600,602,698Murthy,S.K.,183Murty,M.N.,601Muthukrishnan,S.,747Myers,C.L.,508Myneni,R.,435Myors,B.,807Müller,K.-R.,849

Nagesh,H.,698Nakhaeizadeh,G.,432Namburu,R.,19,698Naughton,J.F.,438Navathe,S.,435,509Nebot,A.,104Nelder,J.A.,343Nelson,L.D.,808Nemani,R.,435Nestorov,S.,437Neyman,J.,807Ng,A.Y.,182,696Ng,R.T.,434,698,746,747Niblett,T.,183,340Nielsen,M.A.,343Niknafs,A.,600Nishio,S.,435Niyogi,P.,849Nobel,A.B.,434Norvig,P.,344Nosek,B.A.,806Novak,P.K.,434Nuzzo,R.,807

O’Callaghan,L.,19,697Oates,T.,104Oerlemans,A.,433Ogihara,M.,105,438Ohsuga,S.,438Ojala,M.,807Olken,F.,104Olshen,R.,181Olukotun,K.,182Omiecinski,E.,435,509Onghena,P.,806Ono,T.,435Orihara,M.,438Ortega,F.,18Osborne,J.,104Ostrouchov,G.,103others,19,184,602,748,806,807Ozden,B.,435Ozgur,A.,435,748

Padmanabhan,B.,438,509Page,G.P.,181Palit,I.,183Palmer,C.R.,104Pan,S.J.,343Pandey,G.,432,508Pang,A.,434Papadimitriou,S.,20,748Papaxanthos,L.,434Pardalos,P.M.,698Parelius,J.M.,748Park,H.,849Park,J.S.,435ParrRud,O.,20Parthasarathy,S.,105,435,438,508Pasquier,N.,435Passos,A.,20,184Patrick,E.A.,697Pattipati,K.R.,184Paulsen,S.,434Pazzani,M.,341,430Pazzani,M.J.,103Pearl,J.,341,344Pearson,E.S.,807Pedregosa,F.,20,184Pei,J.,19,432,433,435,508Pelleg,D.,602Pellow,F.,103Peng,R.D.,807Perrot,M.,20,184Pesarin,F.,807

Peters,M.,698Pfahringer,B.,19,182Piatetsky-Shapiro,G.,18,20,435Pimentel,M.A.,748Pirahesh,H.,103Pison,G.,749Pitts,W.,343Platt,J.C.,748Pohlmann,N.,807Portnoy,L.,746,748Potter,C.,435,436,698Powers,D.M.,344Prasad,V.V.V.,429Pregibon,D.,19,20,431Prerau,M.,746Prettenhofer,P.,20,184Prince,M.,344Prins,J.,434Protopopescu,V.,103Provenza,L.P.,20Provost,F.J.,104,344Psaila,G.,434Pujol,J.M.,18Puttagunta,V.,103Pyle,D.,20

Quinlan,J.R.,184,344

Raftery,A.E.,697,807Raghavan,P.,696,697Rakhshani,A.,184Ramakrishnan,N.,20Ramakrishnan,R.,18,104,182,697,699Ramaswamy,S.,435,748Ramkumar,G.D.,435Ramoni,M.,344Ranka,S.,181,435Rao,N.,508Rastogi,R.,507,697,748Reddy,C.K.,183,600,602,808Redman,T.C.,104Rehmsmeier,M.,344Reichart,D.,103Reina,C.,696Reisende,M.G.C.,698Renz,M.,430Reshef,D.,104Reshef,D.N.,104Reshef,Y.,104Reshef,Y.A.,104Reutemann,P.,19,182Ribeiro,M.T.,20Richter,L.,509Riondato,M.,435Riquelme,J.C.,508Rissanen,J.,183Rivest,R.L.,184Robinson,D.,806Rochester,N.,344

Rocke,D.M.,746,748Rogers,S.,699Roiger,R.,20Romesburg,C.,602Ron,D.,698Ronkainen,P.,437Rosenblatt,F.,344Rosenthal,A.,437Rosete,A.,508Rosner,B.,748Rotem,D.,104Rousseeuw,P.J.,103,601,748Rousu,J.,103Roweis,S.T.,849Ruckert,U.,509Runkler,T.,697Russell,S.J.,344Ruts,I.,748

Sabeti,P.,104Sabeti,P.C.,104Sabripour,M.,181Safavian,S.R.,184Sahami,M.,102Saigal,S.,103Saito,T.,344Salakhutdinov,R.,344Salakhutdinov,R.R.,342Salmaso,L.,807Salzberg,S.,182,183,340Samatova,N.,18,19Sander,J.,600–602,746Sarawagi,S.,435Sarinnapakorn,K.,749Satou,K.,435Saul,L.K.,849Savaresi,S.M.,602Savasere,A.,435,509Saygin,Y.,20Schölkopf,B.,344Schafer,J.,20Schaffer,C.,184Schapire,R.E.,340,341Scheuermann,P.,436Schikuta,E.,698Schmidhuber,J.,342,344Schroeder,M.R.,698Schroedl,S.,699Schubert,E.,748,749Schuermann,J.,184

Schwabacher,M.,746Schwartzbard,A.,746Schwarz,G.,184Schölkopf,B.,104,748,849Scott,D.W.,749Sebastiani,P.,344Self,M.,696Semeraro,G.,182Sendhoff,B.,696Seno,M.,436,509Settles,B.,344Seung,H.S.,698Shafer,J.C.,184,429Shasha,D.E.,20Shawe-Taylor,J.,104,341,748Sheikholeslami,G.,698Shekhar,S.,18,19,433,435,437,749Shen,W.,509Shen,Y.,431Sheng,V.S.,343Shi,J.,698Shi,Z.,433Shibayama,G.,435Shim,K.,507,697,748Shinghal,R.,433Shintani,T.,436Shu,C.,699Shyu,M.-L.,749Sibson,R.,601Siebes,A.P.J.M.,432Siegmund,D.,808

Silberschatz,A.,435,436Silva,V.d.,849Silver,N.,808Silverstein,C.,430,436Simmons,J.P.,808Simon,H.,696Simon,N.,104Simon,R.,184Simonsohn,U.,808Simovici,D.,433Simpson,E.-H.,436Singer,Y.,340Singh,K.,748Singh,L.,436Singh,S.,20,748Singh,V.,181Sivakumar,K.,19Smalley,C.T.,102Smith,A.D.,430Smola,A.J.,104,343,344,748,849Smyth,P.,18–20,342,344Sneath,P.H.A.,104,602Soete,G.D.,600,602Sokal,R.R.,104,602Song,Y.,696Soparkar,N.,431Speed,T.,104Spiegelhalter,D.J.,183Spiliopoulou,M.,431Späth,H.,602Srebro,N.,182

Srikant,R.,18,104,430,431,434,436,507,509Srivastava,J.,431,436,509,748Srivastava,N.,342,344Steinbach,M.,18,19,183,344,432,435–437,508,602,696–698,747Stepp,R.E.,698,699Stevens,S.S.,104Stolfo,S.J.,341,433,746,748Stone,C.J.,181Stone,M.,184Storey,J.D.,806,808Stork,D.G.,18,182,341,600Strang,G.,832Strehl,A.,699Strimmer,K.,808Struyf,A.,749Stutz,J.,696Su,X.,20Suen,C.Y.,185Sugiyama,M.,434Sun,S.,344Sun,T.,699Sun,Z.,430Sundaram,N.,182Suppes,P.,103–105Sutskever,I.,342–344Suzuki,E.,436Svensen,M.,696Swami,A.,430,433,508Swaminathan,R.,698Sykacek,P.,749Szalay,A.S.,746

Szegedy,C.,342

Takagi,T.,435Tan,C.L.,103Tan,H.,697Tan,P.-N.,344,698Tan,P.N.,183,431,435–437,509Tang,J.,749Tang,S.,435Tansley,S.,19Tao,D.,345Taouil,R.,435Tarassenko,L.,748Tatti,N.,437Tax,D.M.J.,344Tay,S.H.,437,509Taylor,C.C.,183Taylor,J.E.,808Taylor,W.,696Tenenbaum,J.B.,849Teng,W.G.,509Thakurta,A.,430Theodoridis,Y.,20Thirion,B.,20,184Thomas,J.A.,102Thomas,S.,183,435Thompson,K.,343Tian,S.-F.,748Tibshirani,R.,19,104,182,184,342,344,601,806,808Tibshirani,R.J.,184Tickle,A.,340Timmers,T.,183Toivonen,H.,20,105,432,437,508

Tokuyama,T.,432,434,507Tolle,K.M.,19Tompkins,R.G.,808Tong,H.,745Torregrosa,A.,436Tsamardinos,I.,184Tsaparas,P.,806Tseng,V.S.,507Tsoukatos,I.,437Tsur,S.,431,435,437Tucakov,V.,747Tukey,J.W.,104,105,748Tung,A.,437,601Turnbaugh,P.J.,104Tusher,V.,806Tusher,V.G.,808Tuzhilin,A.,18,436,438,509Tversky,A.,103–105Tzvetkov,P.,509

Ullman,J.,431,437Uslaner,E.M.,103Utgoff,P.E.,184Uthurusamy,R.,18

Vaidya,J.,18,437Valiant,L.,184vanAssen,M.A.,807vanRijsbergen,C.J.,344vanZomeren,B.C.,748Vanderplas,J.,20,184Vandin,F.,435vanderLaan,M.J.,182VanLoan,C.F.,832Vapnik,V.,345Vapnik,V.N.,184Varma,S.,184Varoquaux,G.,20,184Vassilvitskii,S.,600Vazirgiannis,M.,601Velleman,P.F.,105Vempala,S.,746Venkatesh,S.S.,183Venkatrao,M.,103Verhein,F.,430Verkamo,A.I.,508Verma,T.S.,341Verykios,V.S.,20Vincent,P.,340,341,345Virmani,A.,433Vitter,J.S.,890vonLuxburg,U.,699vonSeelen,W.,696vonderMalsburg,C.,696Vorbruggen,J.C.,696Vreeken,J.,808

Vu,Q.,436Vuokko,N.,807Vázquez,E.,746

Wagenmakers,E.-J.,806Wagstaff,K.,696,699Wainwright,M.,342Walker,T.,807Wang,H.,184Wang,J.,430,509Wang,J.T.L.,20Wang,K.,437,509Wang,L.,437Wang,Q.,19Wang,Q.R.,185Wang,R.Y.,105Wang,W.,432,434Wang,Y.R.,105Warde-Farley,D.,341Washio,T.,433,508Wasserstein,R.L.,808Webb,A.R.,20,345Webb,G.I.,434,437,509,808Weiss,G.M.,345Weiss,R.,20,184Welch,W.J.,808Werbos,P.,345Widmer,G.,341Widom,J.,508Wierse,A.,18Wilhelm,A.F.X.,432Wilkinson,L.,105Williams,C.K.I.,696Williams,G.J.,747Williamson,R.C.,343,748

Wimmer,M.,600Wish,M.,849Witten,I.H.,19,20,182,345Wojdanowski,R.,748Wolach,A.,807Wong,M.H.,433Woodruff,D.L.,748Wu,C.-W.,507Wu,J.,601Wu,N.,430Wu,S.,601Wu,X.,20,344,509Wunsch,D.,602

Xiang,D.,748Xiao,W.,808Xin,D.,432Xiong,H.,433,436,437,601,602,808,849Xu,C.,345Xu,R.,602Xu,W.,748Xu,X.,600–602Xu,Y.,807

Yamamura,Y.,435Yan,X.,19,432,437,507,509Yang,C.,438Yang,Q.,343,431Yang,Y.,185,434,508Yao,Y.Y.,438Ye,J.,849Ye,N.,104,697,749Yesha,Y.,19Yin,Y.,432Yiu,T.,507Yoda,K.,434Yu,H.,436,699Yu,J.X.,432Yu,L.,104Yu,P.S.,18–20,430,435,745Yu,Y.,182

Zaïane,O.R.,432,507Zadrozny,B.,345Zahn,C.T.,602Zaki,M.J.,20,105,438,509,698Zaniolo,C.,184Zeng,C.,438Zeng,L.,433Zhang,A.,698Zhang,B.,438,602Zhang,C.,509Zhang,F.,438Zhang,H.,438,509Zhang,J.,341Zhang,M.-L.,345Zhang,N.,19Zhang,P.,749Zhang,S.,509Zhang,T.,699Zhang,Y.,185,437,438Zhang,Z.,438Zhao,W.,699Zhao,Y.,602Zhong,N.,438Zhou,Z.-H.,345Zhu,H.,435Zhu,X.,345Ziad,M.,105Zimek,A.,601,748,749Züfle,A.,430

SubjectIndex

k-nearestneighborgraph,657,663,664

accuracy,119,196activationfunction,251AdaBoost,306aggregation,51–52anomalydetection

applications,703–704clustering-based,724–728

example,726impactofoutliers,726membershipinacluster,725numberofclusters,728strengthsandweaknesses,728

definition,705–706definitions

distance-based,719density-based,720–724deviationdetection,703exceptionmining,703outliers,703proximity-based

distance-based,seeanomalydetection,distance-basedrelativedensity,722–723

example,723statistical,710–719

Gaussian,710Grubbs,751likelihoodapproach,715multivariate,712strengthsandweaknesses,718

techniques,708–709Apriori

algorithm,364principle,363

associationanalysis,357categoricalattributes,451continuousattributes,454indirect,503pattern,358rule,seerule

attribute,26–33definitionof,27numberofvalues,32type,27–32

asymmetric,32–33binary,32continuous,30,32discrete,32generalcomments,33–34interval,29,30nominal,29,30ordinal,29,30qualitative,30quantitative,30ratio,29

avoidingfalsediscoveries,755–806considerationsforanomalydetection,800–803considerationsforassociationanalysis,787randomization,793–795considerationsforclassification,783–787considerationsforclusteranalysis,795–800generatinganulldistribution,776–783

permutationtest,781randomization,781

hypothesistesting,seehypothesistestingmultiplehypothesistesting,seeFalseDiscoveryRateproblemswithsignificanceandhypothesistesting,778

axon,249

backpropagation,258bagging,seeclassifierBayes

naive,seeclassifiernetwork,seeclassifiertheorem,214

biasvariancedecomposition,300binarization,seediscretization,binarization,452,455BIRCH,684–686BonferroniProcedure),768boosting,seeclassifierBregmandivergence,94–95

candidategeneration,367,368,471,487itemset,362pruning,368,472,493rule,381sequence,468

case,seeobjectchameleon,660–666

algorithm,664–665graphpartitioning,664,665mergingstrategy,662relativecloseness,663relativeinterconnectivity,663self-similarity,656,661,663–665strengthsandlimitations,666

characteristic,seeattributecityblockdistance,seedistance,cityblockclass

imbalance,313classification

classlabel,114evaluation,119

classifierbagging,302base,296Bayesianbelief,227boosting,305combination,296decisiontree,119ensemble,296logisticregression,243

maximalmargin,278naive-Bayes,218nearestneighbor,208neuralnetworks,249perceptron,250probabilistic,212randomforest,310Rote,208rule-based,195supportvectormachine,276unstable,300

climateindices,680clusteranalysis

algorithmcharacteristics,619–620mappingtoanotherdomain,620nondeterminism,619optimization,620orderdependence,619parameterselection,seeparameterselectionscalability,seescalability

applications,525–527asanoptimizationproblem,620chameleon,seechameleonchoosinganalgorithm,690–693clustercharacteristics,617–618

datadistribution,618density,618poorlyseparated,618relationships,618shape,618size,618

subspace,618clusterdensity,618clustershape,548,618clustersize,618datacharacteristics,615–617

attributetypes,617datatypes,617high-dimensionality,616mathematicalproperties,617noise,616outliers,616scale,617size,616sparseness,616

DBSCAN,seeDBSCANdefinitionof,525,528DENCLUE,seeDENCLUEdensity-basedclustering,644–656fuzzyclustering,seefuzzyclusteringgraph-basedclustering,656–681

sparsification,657–658grid-basedclustering,seegrid-basedclusteringhierarchical,seehierarchicalclustering

CURE,seeCURE,seeCUREminimumspanningtree,658–659

Jarvis-Patrick,seeJarvis-PatrickK-means,seeK-meansmixturemodes,seemixturemodelsopossum,seeopossumparameterselection,567,587,619prototype-basedclustering,621–644

seesharednearestneighbor,density-basedclustering,679self-organizingmaps,seeself-organizingmapsspectralclustering,666subspaceclustering,seesubspaceclusteringsubspaceclusters,618typesofclusterings,529–531

complete,531exclusive,530fuzzy,530hierarchical,529overlapping,530partial,531partitional,529

typesofclusters,531–533conceptual,533density-based,532graph-based,532prototype-based,531well-separated,531

validation,seeclustervalidationclustervalidation,571–597

assessmentofmeasures,594–596clusteringtendency,571,588cohesion,574–579copheneticcorrelation,586forindividualclusters,581forindividualobjects,581hierarchical,585,594numberofclusters,587relativemeasures,574separation,574–578

silhouettecoefficient,581–582supervised,589–594

classificationmeasures,590–592similaritymeasures,592–594

supervisedmeasures,573unsupervised,574–589unsupervisedmeasures,573withproximitymatrix,582–585

codeword,332compactionfactor,400concepthierarchy,462conditionalindependence,229confidence

factor,196level,857measure,seemeasure

confusionmatrix,118constraint

maxgap,475maxspan,474mingap,475timing,473windowsize,476

contingencytable,402correlation

ϕ-coefficient,406coverage,196criticalregion,seehypothesistesting,criticalregioncross-validation,165CURE,686–690

algorithm,686,688

clusterfeature,684clusteringfeaturetree,684useofpartitioning,689–690useofsampling,688–689

curseofdimensionality,292

dag,seegraphdata

attribute,seeattributeattributetypes,617cleaning,seedataquality,datacleaningdistribution,618high-dimensional,616

problemswithsimilarity,673marketbasket,357mathematicalproperties,617noise,616object,seeobjectoutliers,616preprocessing,seepreprocessingquality,seedataqualityscale,617set,seedatasetsimilarity,seesimilaritysize,616sparse,616transformations,seetransformationstypes,617typesof,23,26–42

dataquality,23,42–50applicationissues,49–50

datadocumentation,50relevance,49timliness,49

datacleaning,42errors,43–48

accuracy,45

artifacts,44bias,45collection,43duplicatedata,48inconsistentvalues,47–48measurment,43missingvalues,46–47noise,43–44outliers,46precision,45significantdigits,45

dataset,26characteristics,34–35

dimensionality,34resolution,35sparsity,34

typesof,34–42graph-based,37–38matrix,seematrixordered,38–41record,35–37sequence,40sequential,38spatial,41temporal,38timeseries,39transaction,36

DBSCAN,565–569algorithm,567comparisontoK-means,614–615complexity,567

definitionofdensity,565parameterselection,567typesofpoints,566

border,566core,566noise,566

decisionboundary,146list,198stump,303tree,seeclassifier

deduplication,48DENCLUE,652–656

algorithm,653implementationissues,654kerneldensityestimation,654strengthsandlimitations,654

dendrite,249density

centerbased,565dimension,seeattributedimensionality

curse,57dimensionalityreduction,56–58,833–848

factoranalysis,840–842FastMap,845ISOMAP,845–847issues,847–848LocallyLinearEmbedding,842–844multidimensionalscaling,844–845PCA,58

SVD,58discretization,63–69,221

association,seeassociationbinarization,64–65clustering,456equalfrequency,456equalwidth,456ofbinaryattributes,seediscretization,binarizationofcategoricalvariables,68–69ofcontinuousattributes,65–68

supervised,66–68unsupervised,65–66

dissimilarity,76–78,94–95choosing,98–100definitionof,72distance,seedistancenon-metric,77transformations,72–75

distance,76–77cityblock,76Euclidean,76,822Hamming,332L1norm,76L2norm,76L∞,76Lmax,76Mahalanobis,96Manhattan,76metric,77

positivity,77symmetry,77

triangleinequality,77Minkowski,76–77supremum,76

distributionbinomial,162Gaussian,162,221

eagerlearner,seelearneredge,480effectsize,seehypothesistesting,effectsizeelement,466EMalgorithm,631–637ensemblemethod,seeclassifierentity,seeobjectentropy,67,128

useindiscretization,seediscretization,supervisederror

error-correctingoutputcoding,331generalization,156pessimistic,158

errorrate,119estimateerror,164Euclideandistance,seedistance,Euclideanevaluation

association,401exhaustive,198

factoranalysis,seedimensionalityreduction,factoranalysisFalseDiscoveryRate,778

Benjamini-HochbergFDR,772family-wiseerrorrate,768FastMap,seedimensionalityreduction,FastMapfeature

irrelevant,144featurecreation,61–63

featureextraction,61–62mappingdatatoanewspace,62–63

featureextraction,seefeaturecreation,featureextractionfeatureselection,58–61

architecturefor,59–60featureweighting,61irrelevantfeatures,58redundantfeatures,58typesof,58–59

embedded,58filter,59wrapper,59

field,seeattributeFouriertransform,62FP-growth,393FP-tree,seetreefrequentsubgraph,479fuzzyclustering,621–626

fuzzyc-means,623–626algorithm,623centroids,624example,626initialization,624

SSE,624strenthsandlimitations,626weightupdate,625

fuzzysets,622fuzzypsuedo-partition,623

gainratio,135generalization,seeruleginiindex,128graph,480

connected,484directedacyclic,462Laplacian,667undirected,484

grid-basedclustering,644–648algorithm,645clusters,646density,645gridcells,645

hierarchicalclustering,554–565agglomerativealgorithm,555centroidmethods,562clusterproximity,555

Lance-Williamsformula,562completelink,555,558–559complexity,556groupaverage,555,559–560inversions,562MAX,seecompletelinkmergingdecisions,564MIN,seesinglelinksinglelink,555,558Ward’smethod,561

high-dimensionalityseedata,high-dimensional,616

holdout,165hypothesis

alternative,459,858null,459,858

hypothesistesting,761criticalregion,763effectsize,766power,764TypeIerror,763TypeIIerror,764

independenceconditional,218

informationgainentropy-based,131FOIL’s,201

interest,seemeasureISOMAP,seedimensionalityreduction,ISOMAPisomorphism

definition,481item,seeattribute,358

competing,494negative,494

itemset,359candidate,seecandidateclosed,386maximal,384

Jarvis-Patrick,676–678algorithm,676example,677strengthsandlimitations,677

K-means,534–553algorithm,535–536bisecting,547–548centroids,537,539

choosinginitial,539–544comparisontodBSCAN,614–615complexity,544derivation,549–553emptyclusters,544incremental,546K-means++,543–544limitations,548–549objectivefunctions,537,539outliers,545reducingSEE,545–546

kerneldensityestimation,654kernelfunction,90–94

L1norm,seedistance,L1normL2norm,seedistance,L2normLagrangian,280lazylearner,seelearnerlearner

eager,208,211lazy,208,211

leastsquares,831leave-one-out,167lexicographicorder,371linearalgebra,817–832

matrix,seematrixvector,seevector

linearregression,831linearsystemsofequations,831lineartransformation,seematrix,lineartransformationLocallyLinearEmbedding,seedimensionalityreduction,LocallyLinearEmbedding

m-estimate,224majorityvoting,seevotingManhattandistance,seedistance,Manhattanmargin

soft,284marketbasketdata,seedatamatrix,37,823–829

addition,824–825columnvector,824confusion,seeconfusionmatrixdefinition,823–824document-term,37eigenvalue,829eigenvaluedecomposition,829–830eigenvector,829indataanalysis,831–832inverse,828–829linearcombination,835lineartransformations,827–829

columnspace,828leftnullspace,828nullspace,828projection,827reflection,827rotation,827rowspace,828scaling,827

multiplication,825–827positivesemidefinite,835rank,828rowvector,824

scalarmultiplication,825singularvalue,830singularvaluedecomposition,830singularvector,830sparse,37

maxgap,seeconstraintmaximumlikelihoodestimation,629–631maxspan,seeconstraintMDL,160mean,222measure

confidence,360consistency,408interest,405IS,406objective,401properties,409support,360symmetric,414

measurement,27–32definitionof,27scale,27

permissibletransformations,30–31types,27–32

metricaccuracy,119

metricsclassification,119

min-Apriori,461mingap,seeconstraintminimumdescriptionlength,seeMDL

missingvalues,seedataquality,errors,missingvaluesmixturemodels,627–637

advantagesandlimitations,637definitionof,627–629EMalgorithm,631–637maximumlikelihoodestimation,629–631

modelcomparison,173descriptive,116generalization,118overfitting,147predictive,116selection,156

modelcomplexityOccam’sRazor

AIC,157BIC,157

monotonicity,364multiclass,330multidimensionalscaling,seedimensionalityreduction,multidimensionalscalingmultiplecomparison,seeFalseDiscoveryRatemultiplehypothesistesting,seeFalseDiscoveryRate

family-wiseerrorrate,seefamily-wiseerrorratemutualexclusive,198mutualinformation,88–89

nearestneighborclassifier,seeclassifiernetwork

Bayesian,seeclassifiermultilayer,seeclassifierneural,seeclassifier

neuron,249node

internal,120leaf,120non-terminal,120root,120terminal,120

noise,211nulldistribution,758nullhypothesis,757

object,26observation,seeobjectOccam’srazor,157OLAP,51opposum,659–660

algorithm,660strengthsandweaknesses,660

outliers,seedataqualityoverfitting,seemodel,149

p-value,759pattern

cross-support,420hyperclique,423infrequent,493negative,494negativelycorrelated,495,496sequential,seesequentialsubgraph,seesubgraph

PCA,833–836examples,836mathematics,834–835

perceptron,seeclassifierpermutationtest,781Piatesky-Shapiro

PS,405point,seeobjectpower,seehypothesistesting,powerPrecision-RecallCurve,328precondition,195preprocessing,23,50–71

aggregation,seeaggregationbinarization,seediscretization,binarizationdimensionalityreduction,56discretization,seediscretizationfeaturecreation,seefeaturecreationfeatureselection,seefeatureselectionsampling,seesamplingtransformations,seetransformations

proximity,71–100choosing,98–100

cluster,555definitionof,71dissimilarity,seedissimilaritydistance,seedistanceforsimpleattributes,74–75issues,96–98

attributeweights,98combiningproximities,97–98correlation,96standardization,96

similarity,seesimilaritytransformations,72–74

pruningpost-pruning,163prepruning,162

randomforestseeclassifier,310

randomization,781associationpatterns,793–795

ReceiverOperatingCharacteristiccurve,seeROCrecord,seeobjectreducederrorpruning,189,346regression

logistic,243ROC,323Roteclassifier,seeclassifierrule

antecedent,195association,360candidate,seecandidateconsequent,195generalization,458generation,205,362,380,458ordered,198ordering,206pruning,202quantitative,454

discretization-based,454non-discretization,460statistics-based,458

redundant,458specialization,458validation,459

ruleset,195

sample,seeobjectsampling,52–56,314

approaches,53–54progressive,55–56random,53stratified,54withreplacement,54withoutreplacement,54

samplesize,54–55scalability

clusteringalgorithms,681–690BIRCH,684–686CURE,686–690generalissues,681–684

segmentation,529self-organizingmaps,637–644

algorithm,638–641applications,643strengthsandlimitations,643

sensitivity,319sequence

datasequence,468definition,466

sequentialpattern,464patterndiscovery,468timingconstraints,seeconstraint

sequentialcovering,199sharednearestneighbor,656

density,678–679density-basedclustering,679–681

algorithm,680example,680strengthsandlimitations,681

principle,657similarity,673–676

computation,675differencesindensity,674versusdirectsimilarity,676

significancelevel,859

significancetesting,761nulldistribution,seenulldistributionnullhypothesis,seenullhypothesisp-value,seep-valuestatisticalsignificance,seestatisticalsignificance

similarity,24,78–85choosing,98–100correlation,83–85cosine,81–82,822definitionof,72differences,85–88extendedJaccard,83Jaccard,80–81kernelfunction,90–94mutualinformation,88–89sharednearestneighbor,seesharednearestneighbor,similaritysimplematchingcoefficient,80Tanimoto,83transformations,72–75

Simpson’sparadox,416softsplitting,178

SOM,618,seeself-organizingmapsspecialization,seerulesplitinformation,135statisticalsignificance,760statistics

covarinacematrix,834subgraph

core,487definition,482pattern,479support,seesupport

subsequence,467contiguous,475

subspaceclustering,648–652CLIQUE,650

algorithm,651monotonicityproperty,651strengthsandlimitations,652

example,648subtree

replacement,163support

count,359counting,373,473,477,493limitation,402measure,seemeasurepruning,364sequence,468subgraph,483

supportvector,276supportvectormachine,seeclassifier

SVD,838–840example,838–840mathematics,838

SVM,seeclassifiernonlinear,290

svmnon-separable,284

synapse,249

taxonomy,seeconcepthierarchytransaction,358

extended,463width,379

transformations,69–71betweensimilarityanddissimilarity,72–74normalization,70–71simplefunctions,70standardization,70–71

treeconditionalFP-tree,398decision,seeclassifierFP-tree,394hash,375oblique,146

triangleinequality,77truepositive,319TypeIerror,seehypothesistesting,TypeIerrorTypeIIerror,seehypothesistesting,TypeIIerror

underfitting,149universalapproximator,261

variable,seeattributevariance,222vector,817–823

addition,817–818column,seematrix,columnvectordefinition,817dotproduct,820–822indataanalysis,822–823linearindependence,821–822mean,823mulitplicationbyascalar,818–819norm,820orthogonal,819–821orthogonalprojection,821row,seematrix,rowvectorspace,819–820basis,819dimension,819independentcomponents,819linearcombination,819span,819

vectorquantization,527vertex,480voting

distance-weighted,210majority,210

wavelettransform,63webcrawler,138windowsize,seeconstraint

CopyrightPermissionsSomefiguresandpartofthetextofChapter8originallyappearedinthearticle“FindingClustersofDifferentSizes,Shapes,andDensitiesinNoisy,HighDimensionalData,”LeventErtöz,MichaelSteinbach,andVipinKumar,ProceedingsoftheThirdSIAMInternationalConferenceonDataMining,SanFrancisco,CA,May1–3,2003,SIAM.©2003,SIAM.

SomefiguresandpartofthetextofChapter5appearedinthearticle“SelectingtheRightObjectiveMeasureforAssociationAnalysis,”Pang-NingTan,VipinKumar,andJaideepSrivastava,InformationSystems,29(4),293-313,2004,Elsevier.©2004,Elsevier.

SomeofthefiguresandtextofChapters8appearedinthearticle“DiscoveryofClimateIndicesUsingClustering,”MichaelSteinbach,Pang-NingTan,VipinKumar,StevenKlooster,andChristopherPotter,KDD’03:ProceedingsoftheNinthACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,446–455,Washington,DC,August2003,ACM.©2003,ACM,INC.DOI=http://doi.acm.org/10.1145/956750.956801

Someofthefigures(1-7,13)andtextofChapter7originallyappearedinthechapter“TheChallengeofClusteringHigh-DimensionalData,”LeventErtoz,MichaelSteinbach,andVipinKumarinNewDirectionsinStatisticalPhysics,Econophysics,Bioinformatics,andPatternRecognition,273–312,Editor,LucWille,Springer,ISBN3-540-43182-9.©2004,Springer-Verlag.

SomeofthefiguresandtextofChapter8originallyappearedinthearticle“Chameleon:HierarchicalClusteringUsingDynamicModeling,”byGeorge

Contents1. INTRODUCTIONTODATAMINING2. INTRODUCTIONTODATAMINING3. PrefacetotheSecondEdition

A. OverviewB. WhatisNewintheSecondEdition?C. TotheInstructorD. SupportMaterials

4. Contents5. 1Introduction

A. 1.1WhatIsDataMining?B. 1.2MotivatingChallengesC. 1.3TheOriginsofDataMiningD. 1.4DataMiningTasksE. 1.5ScopeandOrganizationoftheBookF. 1.6BibliographicNotesG. BibliographyH. 1.7Exercises

6. 2DataA. 2.1TypesofData

1. 2.1.1AttributesandMeasurementa. WhatIsanAttribute?b. TheTypeofanAttributec. TheDifferentTypesofAttributesd. DescribingAttributesbytheNumberofValuese. AsymmetricAttributes

f. GeneralCommentsonLevelsofMeasurement

2. 2.1.2TypesofDataSetsa. GeneralCharacteristicsofDataSets

a. Dimensionalityb. Distributionc. Resolution

b. RecordDataa. TransactionorMarketBasketDatab. TheDataMatrixc. TheSparseDataMatrix

c. Graph-BasedDataa. DatawithRelationshipsamongObjectsb. DatawithObjectsThatAreGraphs

d. OrderedDataa. SequentialTransactionDatab. TimeSeriesDatac. SequenceDatad. SpatialandSpatio-TemporalData

e. HandlingNon-RecordData

B. 2.2DataQuality1. 2.2.1MeasurementandDataCollectionIssues

a. MeasurementandDataCollectionErrorsb. NoiseandArtifactsc. Precision,Bias,andAccuracyd. Outliers

e. MissingValuesa. EliminateDataObjectsorAttributesb. EstimateMissingValuesc. IgnoretheMissingValueduringAnalysis

f. InconsistentValuesg. DuplicateData

2. 2.2.2IssuesRelatedtoApplications

C. 2.3DataPreprocessing1. 2.3.1Aggregation2. 2.3.2Sampling

a. SamplingApproachesb. ProgressiveSampling

3. 2.3.3DimensionalityReductiona. TheCurseofDimensionalityb. LinearAlgebraTechniquesforDimensionalityReduction

4. 2.3.4FeatureSubsetSelectiona. AnArchitectureforFeatureSubsetSelectionb. FeatureWeighting

5. 2.3.5FeatureCreationa. FeatureExtractionb. MappingtheDatatoaNewSpace

6. 2.3.6DiscretizationandBinarizationa. Binarizationb. DiscretizationofContinuousAttributes

a. UnsupervisedDiscretizationb. SupervisedDiscretization

c. CategoricalAttributeswithTooManyValues

7. 2.3.7VariableTransformationa. SimpleFunctionsb. NormalizationorStandardization

D. 2.4MeasuresofSimilarityandDissimilarity1. 2.4.1Basics

a. Definitionsb. Transformations

2. 2.4.2SimilarityandDissimilaritybetweenSimpleAttributes3. 2.4.3DissimilaritiesbetweenDataObjects

a. Distances

4. 2.4.4SimilaritiesbetweenDataObjects5. 2.4.5ExamplesofProximityMeasures

a. SimilarityMeasuresforBinaryDataa. SimpleMatchingCoefficientb. JaccardCoefficient

b. CosineSimilarityc. ExtendedJaccardCoefficient(TanimotoCoefficient)d. Correlatione. DifferencesAmongMeasuresForContinuousAttributes

6. 2.4.6MutualInformation7. 2.4.7KernelFunctions*

8. 2.4.8BregmanDivergence*9. 2.4.9IssuesinProximityCalculation

a. StandardizationandCorrelationforDistanceMeasuresb. CombiningSimilaritiesforHeterogeneousAttributesc. UsingWeights

10. 2.4.10SelectingtheRightProximityMeasure

E. 2.5BibliographicNotesF. BibliographyG. 2.6Exercises

7. 3Classification:BasicConceptsandTechniquesA. 3.1BasicConceptsB. 3.2GeneralFrameworkforClassificationC. 3.3DecisionTreeClassifier

1. 3.3.1ABasicAlgorithmtoBuildaDecisionTreea. Hunt'sAlgorithmb. DesignIssuesofDecisionTreeInduction

2. 3.3.2MethodsforExpressingAttributeTestConditions3. 3.3.3MeasuresforSelectinganAttributeTestCondition

a. ImpurityMeasureforaSingleNodeb. CollectiveImpurityofChildNodesc. Identifyingthebestattributetestconditiond. SplittingofQualitativeAttributese. BinarySplittingofQualitativeAttributesf. BinarySplittingofQuantitativeAttributesg. GainRatio

4. 3.3.4AlgorithmforDecisionTreeInduction

5. 3.3.5ExampleApplication:WebRobotDetection6. 3.3.6CharacteristicsofDecisionTreeClassifiers

D. 3.4ModelOverfitting1. 3.4.1ReasonsforModelOverfitting

a. LimitedTrainingSizeb. HighModelComplexity

E. 3.5ModelSelection1. 3.5.1UsingaValidationSet2. 3.5.2IncorporatingModelComplexity

a. EstimatingtheComplexityofDecisionTreesb. MinimumDescriptionLengthPrinciple

3. 3.5.3EstimatingStatisticalBounds4. 3.5.4ModelSelectionforDecisionTrees

F. 3.6ModelEvaluation1. 3.6.1HoldoutMethod2. 3.6.2Cross-Validation

G. 3.7PresenceofHyper-parameters1. 3.7.1Hyper-parameterSelection2. 3.7.2NestedCross-Validation

H. 3.8PitfallsofModelSelectionandEvaluation1. 3.8.1OverlapbetweenTrainingandTestSets2. 3.8.2UseofValidationErrorasGeneralizationError

I. 3.9ModelComparison1. 3.9.1EstimatingtheConfidenceIntervalforAccuracy

2. 3.9.2ComparingthePerformanceofTwoModels

J. 3.10BibliographicNotesK. BibliographyL. 3.11Exercises

8. 4Classification:AlternativeTechniquesA. 4.1TypesofClassifiersB. 4.2Rule-BasedClassifier

1. 4.2.1HowaRule-BasedClassifierWorks2. 4.2.2PropertiesofaRuleSet3. 4.2.3DirectMethodsforRuleExtraction

a. Learn-One-RuleFunctiona. RulePruningb. BuildingtheRuleSet

b. InstanceElimination

4. 4.2.4IndirectMethodsforRuleExtraction5. 4.2.5CharacteristicsofRule-BasedClassifiers

C. 4.3NearestNeighborClassifiers1. 4.3.1Algorithm2. 4.3.2CharacteristicsofNearestNeighborClassifiers

D. 4.4NaïveBayesClassifier1. 4.4.1BasicsofProbabilityTheory

a. BayesTheoremb. UsingBayesTheoremforClassification

2. 4.4.2NaïveBayesAssumption

a. ConditionalIndependenceb. HowaNaïveBayesClassifierWorks

c.EstimatingConditionalProbabilitiesforCategoricalAttributes

d.EstimatingConditionalProbabilitiesforContinuousAttributes

e. HandlingZeroConditionalProbabilitiesf. CharacteristicsofNaïveBayesClassifiers

E. 4.5BayesianNetworks1. 4.5.1GraphicalRepresentation

a. ConditionalIndependenceb. JointProbabilityc. UseofHiddenVariables

2. 4.5.2InferenceandLearninga. VariableEliminationb. Sum-ProductAlgorithmforTreesc. GeneralizationsforNon-TreeGraphsd. LearningModelParameters

3. 4.5.3CharacteristicsofBayesianNetworks

F. 4.6LogisticRegression1. 4.6.1LogisticRegressionasaGeneralizedLinearModel2. 4.6.2LearningModelParameters3. 4.6.3CharacteristicsofLogisticRegression

G. 4.7ArtificialNeuralNetwork(ANN)1. 4.7.1Perceptron

a. LearningthePerceptron

2. 4.7.2Multi-layerNeuralNetworka. LearningModelParameters

3. 4.7.3CharacteristicsofANN

H. 4.8DeepLearning1. 4.8.1UsingSynergisticLossFunctions

a. SaturationofOutputsb. Crossentropylossfunction

2. 4.8.2UsingResponsiveActivationFunctionsa. VanishingGradientProblemb. RectifiedLinearUnits(ReLU)

3. 4.8.3Regularizationa. Dropout

4. 4.8.4InitializationofModelParametersa. SupervisedPretrainingb. UnsupervisedPretrainingc. UseofAutoencodersd. HybridPretraining

5. 4.8.5CharacteristicsofDeepLearning

I. 4.9SupportVectorMachine(SVM)1. 4.9.1MarginofaSeparatingHyperplane

a. RationaleforMaximumMargin

2. 4.9.2LinearSVMa. LearningModelParameters

3. 4.9.3Soft-marginSVMa. SVMasaRegularizerofHingeLoss

4. 4.9.4NonlinearSVMa. AttributeTransformationb. LearningaNonlinearSVMModel

5. 4.9.5CharacteristicsofSVM

J. 4.10EnsembleMethods1. 4.10.1RationaleforEnsembleMethod2. 4.10.2MethodsforConstructinganEnsembleClassifier3. 4.10.3Bias-VarianceDecomposition4. 4.10.4Bagging5. 4.10.5Boosting

a. AdaBoost

6. 4.10.6RandomForests7. 4.10.7EmpiricalComparisonamongEnsembleMethods

K. 4.11ClassImbalanceProblem1. 4.11.1BuildingClassifierswithClassImbalance

a. OversamplingandUndersamplingb. AssigningScorestoTestInstances

2. 4.11.2EvaluatingPerformancewithClassImbalance3. 4.11.3FindinganOptimalScoreThreshold4. 4.11.4AggregateEvaluationofPerformance

a. ROCCurveb. Precision-RecallCurve

L. 4.12MulticlassProblemM. 4.13BibliographicNotesN. BibliographyO. 4.14Exercises

9. 5AssociationAnalysis:BasicConceptsandAlgorithmsA. 5.1PreliminariesB. 5.2FrequentItemsetGeneration

1. 5.2.1TheAprioriPrinciple2. 5.2.2FrequentItemsetGenerationintheAprioriAlgorithm3. 5.2.3CandidateGenerationandPruning

a. CandidateGenerationa. Brute-ForceMethodb. Methodc. Method

b. CandidatePruning

4. 5.2.4SupportCountinga. SupportCountingUsingaHashTree*

5. 5.2.5ComputationalComplexity

C. 5.3RuleGeneration1. 5.3.1Confidence-BasedPruning2. 5.3.2RuleGenerationinAprioriAlgorithm3. 5.3.3AnExample:CongressionalVotingRecords

D. 5.4CompactRepresentationofFrequentItemsets1. 5.4.1MaximalFrequentItemsets2. 5.4.2ClosedItemsets

Fk−1×F1Fk−1×Fk−1

E. 5.5AlternativeMethodsforGeneratingFrequentItemsets*F. 5.6FP-GrowthAlgorithm*

1. 5.6.1FP-TreeRepresentation2. 5.6.2FrequentItemsetGenerationinFP-GrowthAlgorithm

G. 5.7EvaluationofAssociationPatterns1. 5.7.1ObjectiveMeasuresofInterestingness

a. AlternativeObjectiveInterestingnessMeasuresb. ConsistencyamongObjectiveMeasuresc. PropertiesofObjectiveMeasures

a. InversionPropertyb. ScalingPropertyc. NullAdditionProperty

d. AsymmetricInterestingnessMeasures

2. 5.7.2MeasuresbeyondPairsofBinaryVariables3. 5.7.3Simpson'sParadox

H. 5.8EffectofSkewedSupportDistributionI. 5.9BibliographicNotesJ. BibliographyK. 5.10Exercises

10. 6AssociationAnalysis:AdvancedConceptsA. 6.1HandlingCategoricalAttributesB. 6.2HandlingContinuousAttributes

1. 6.2.1Discretization-BasedMethods2. 6.2.2Statistics-BasedMethods

a. RuleGenerationb. RuleValidation

3. 6.2.3Non-discretizationMethods

C. 6.3HandlingaConceptHierarchyD. 6.4SequentialPatterns

1. 6.4.1Preliminariesa. Sequencesb. Subsequences

2. 6.4.2SequentialPatternDiscovery3. 6.4.3TimingConstraints*

a. ThemaxspanConstraintb. ThemingapandmaxgapConstraintsc. TheWindowSizeConstraint

4. 6.4.4AlternativeCountingSchemes*

E. 6.5SubgraphPatterns1. 6.5.1Preliminaries

a. Graphsb. GraphIsomorphismc. Subgraphs

2. 6.5.2FrequentSubgraphMining3. 6.5.3CandidateGeneration4. 6.5.4CandidatePruning5. 6.5.5SupportCounting

F. 6.6InfrequentPatterns*1. 6.6.1NegativePatterns2. 6.6.2NegativelyCorrelatedPatterns

3.6.6.3ComparisonsamongInfrequentPatterns,NegativePatterns,andNegativelyCorrelatedPatterns

4. 6.6.4TechniquesforMiningInterestingInfrequentPatterns5. 6.6.5TechniquesBasedonMiningNegativePatterns6. 6.6.6TechniquesBasedonSupportExpectation

a. SupportExpectationBasedonConceptHierarchyb. SupportExpectationBasedonIndirectAssociation

G. 6.7BibliographicNotesH. BibliographyI. 6.8Exercises

11. 7ClusterAnalysis:BasicConceptsandAlgorithmsA. 7.1Overview

1. 7.1.1WhatIsClusterAnalysis?2. 7.1.2DifferentTypesofClusterings3. 7.1.3DifferentTypesofClusters

a. RoadMap

B. 7.2K-means1. 7.2.1TheBasicK-meansAlgorithm

a. AssigningPointstotheClosestCentroidb. CentroidsandObjectiveFunctions

a. DatainEuclideanSpaceb. DocumentDatac. TheGeneralCase

c. ChoosingInitialCentroidsa. K-means++

d. TimeandSpaceComplexity

2. 7.2.2K-means:AdditionalIssuesa. HandlingEmptyClustersb. Outliersc. ReducingtheSSEwithPostprocessingd. UpdatingCentroidsIncrementally

3. 7.2.3BisectingK-means4. 7.2.4K-meansandDifferentTypesofClusters5. 7.2.5StrengthsandWeaknesses6. 7.2.6K-meansasanOptimizationProblem

a. DerivationofK-meansasanAlgorithmtoMinimizetheSSEb. DerivationofK-meansforSAE

C. 7.3AgglomerativeHierarchicalClustering1. 7.3.1BasicAgglomerativeHierarchicalClusteringAlgorithm

a. DefiningProximitybetweenClustersb. TimeandSpaceComplexity

2. 7.3.2SpecificTechniquesa. SampleDatab. SingleLinkorMINc. CompleteLinkorMAXorCLIQUEd. GroupAveragee. Ward’sMethodandCentroidMethods

3. 7.3.3TheLance-WilliamsFormulaforClusterProximity4. 7.3.4KeyIssuesinHierarchicalClustering

a. LackofaGlobalObjectiveFunctionb. AbilitytoHandleDifferentClusterSizesc. MergingDecisionsAreFinal

5. 7.3.5Outliers6. 7.3.6StrengthsandWeaknesses

D. 7.4DBSCAN1. 7.4.1TraditionalDensity:Center-BasedApproach

a. ClassificationofPointsAccordingtoCenter-BasedDensity

2. 7.4.2TheDBSCANAlgorithma. TimeandSpaceComplexityb. SelectionofDBSCANParametersc. ClustersofVaryingDensityd. AnExample

3. 7.4.3StrengthsandWeaknesses

E. 7.5ClusterEvaluation1. 7.5.1Overview

2.7.5.2UnsupervisedClusterEvaluationUsingCohesionandSeparationa. Graph-BasedViewofCohesionandSeparationb. Prototype-BasedViewofCohesionandSeparation

c.RelationshipbetweenPrototype-BasedCohesionandGraph-BasedCohesion

d.RelationshipoftheTwoApproachestoPrototype-BasedSeparation

e. RelationshipbetweenCohesionandSeparationf. RelationshipbetweenGraph-andCentroid-BasedCohesiong. OverallMeasuresofCohesionandSeparationh. EvaluatingIndividualClustersandObjectsi. TheSilhouetteCoefficient

3.7.5.3UnsupervisedClusterEvaluationUsingtheProximityMatrix

a.GeneralCommentsonUnsupervisedClusterEvaluationMeasures

b. MeasuringClusterValidityviaCorrelationc. JudgingaClusteringVisuallybyItsSimilarityMatrix

4. 7.5.4UnsupervisedEvaluationofHierarchicalClustering5. 7.5.5DeterminingtheCorrectNumberofClusters6. 7.5.6ClusteringTendency7. 7.5.7SupervisedMeasuresofClusterValidity

a. Classification-OrientedMeasuresofClusterValidityb. Similarity-OrientedMeasuresofClusterValidityc. ClusterValidityforHierarchicalClusterings

8. 7.5.8AssessingtheSignificanceofClusterValidityMeasures9. 7.5.9ChoosingaClusterValidityMeasure

F. 7.6BibliographicNotesG. BibliographyH. 7.7Exercises

12. 8ClusterAnalysis:AdditionalIssuesandAlgorithmsA. 8.1CharacteristicsofData,Clusters,andClusteringAlgorithms

1. 8.1.1Example:ComparingK-meansandDBSCAN2. 8.1.2DataCharacteristics3. 8.1.3ClusterCharacteristics4. 8.1.4GeneralCharacteristicsofClusteringAlgorithms

a. RoadMap

B. 8.2Prototype-BasedClustering

1. 8.2.1FuzzyClusteringa. FuzzySetsb. FuzzyClustersc. Fuzzyc-means

a. ComputingSSEb. Initializationc. ComputingCentroidsd. UpdatingtheFuzzyPseudo-partition

d. StrengthsandLimitations

2. 8.2.2ClusteringUsingMixtureModelsa. MixtureModelsb. EstimatingModelParametersUsingMaximumLikelihood

c.EstimatingMixtureModelParametersUsingMaximumLikelihood:TheEMAlgorithm

d.AdvantagesandLimitationsofMixtureModelClusteringUsingtheEMAlgorithm

3. 8.2.3Self-OrganizingMaps(SOM)a. TheSOMAlgorithm

a. Initializationb. SelectionofanObjectc. Assignmentd. Updatee. Termination

b. Applicationsc. StrengthsandLimitations

C. 8.3Density-BasedClustering

1. 8.3.1Grid-BasedClusteringa. DefiningGridCellsb. TheDensityofGridCellsc. FormingClustersfromDenseGridCellsd. StrengthsandLimitations

2. 8.3.2SubspaceClusteringa. CLIQUEb. StrengthsandLimitationsofCLIQUE

3.8.3.3DENCLUE:AKernel-BasedSchemeforDensity-BasedClusteringa. KernelDensityEstimationb. ImplementationIssuesc. StrengthsandLimitationsofDENCLUE

D. 8.4Graph-BasedClustering1. 8.4.1Sparsification2. 8.4.2MinimumSpanningTree(MST)Clustering

3.8.4.3OPOSSUM:OptimalPartitioningofSparseSimilaritiesUsingMETISa. StrengthsandWeaknesses

4.8.4.4Chameleon:HierarchicalClusteringwithDynamicModelinga. DecidingWhichClusterstoMergeb. ChameleonAlgorithm

a. Sparsification

c. GraphPartitioninga. AgglomerativeHierarchicalClustering

b. Complexity

d. StrengthsandLimitations

5. 8.4.5SpectralClustering

a.RelationshipbetweenSpectralClusteringandGraphPartitioning

b. StrengthsandLimitations

6. 8.4.6SharedNearestNeighborSimilarity

a.ProblemswithTraditionalSimilarityinHigh-DimensionalData

b. ProblemswithDifferencesinDensityc. SNNSimilarityComputationd. SNNSimilarityversusDirectSimilarity

7. 8.4.7TheJarvis-PatrickClusteringAlgorithma. StrengthsandLimitations

8. 8.4.8SNNDensity9. 8.4.9SNNDensity-BasedClustering

a. TheSNNDensity-basedClusteringAlgorithmb. StrengthsandLimitations

E. 8.5ScalableClusteringAlgorithms1. 8.5.1Scalability:GeneralIssuesandApproaches2. 8.5.2BIRCH3. 8.5.3CURE

a. SamplinginCUREb. Partitioning

F. 8.6WhichClusteringAlgorithm?G. 8.7BibliographicNotesH. BibliographyI. 8.8Exercises

13. 9AnomalyDetectionA. 9.1CharacteristicsofAnomalyDetectionProblems

1. 9.1.1ADefinitionofanAnomaly2. 9.1.2NatureofData3. 9.1.3HowAnomalyDetectionisUsed

B. 9.2CharacteristicsofAnomalyDetectionMethodsC. 9.3StatisticalApproaches

1. 9.3.1UsingParametricModelsa. UsingtheUnivariateGaussianDistributionb. UsingtheMultivariateGaussianDistribution

2. 9.3.2UsingNon-parametricModels3. 9.3.3ModelingNormalandAnomalousClasses4. 9.3.4AssessingStatisticalSignificance5. 9.3.5StrengthsandWeaknesses

D. 9.4Proximity-basedApproaches1. 9.4.1Distance-basedAnomalyScore2. 9.4.2Density-basedAnomalyScore3. 9.4.3RelativeDensity-basedAnomalyScore4. 9.4.4StrengthsandWeaknesses

E. 9.5Clustering-basedApproaches1. 9.5.1FindingAnomalousClusters2. 9.5.2FindingAnomalousInstances

a.AssessingtheExtenttoWhichanObjectBelongstoaCluster

b. ImpactofOutliersontheInitialClusteringc. TheNumberofClusterstoUse

3. 9.5.3StrengthsandWeaknesses

F. 9.6Reconstruction-basedApproaches1. 9.6.1StrengthsandWeaknesses

G. 9.7One-classClassification1. 9.7.1UseofKernels2. 9.7.2TheOriginTrick3. 9.7.3StrengthsandWeaknesses

H. 9.8InformationTheoreticApproaches1. 9.8.1StrengthsandWeaknesses

I. 9.9EvaluationofAnomalyDetectionJ. 9.10BibliographicNotesK. BibliographyL. 9.11Exercises

14. 10AvoidingFalseDiscoveriesA. 10.1Preliminaries:StatisticalTesting

1. 10.1.1SignificanceTestinga. NullHypothesisb. TestStatisticc. NullDistributiond. AssessingStatisticalSignificance

2. 10.1.2HypothesisTestinga. CriticalRegionb. TypeIandTypeIIErrorsc. EffectSize

3. 10.1.3MultipleHypothesisTestinga. Family-wiseErrorRate(FWER)b. BonferroniProcedurec. Falsediscoveryrate(FDR)d. Benjamini-HochbergProcedure

4. 10.1.4PitfallsinStatisticalTesting

B. 10.2ModelingNullandAlternativeDistributions1. 10.2.1GeneratingSyntheticDataSets2. 10.2.2RandomizingClassLabels3. 10.2.3ResamplingInstances4. 10.2.4ModelingtheDistributionoftheTestStatistic

C. 10.3StatisticalTestingforClassification1. 10.3.1EvaluatingClassificationPerformance2. 10.3.2BinaryClassificationasMultipleHypothesisTesting3. 10.3.3MultipleHypothesisTestinginModelSelection

D. 10.4StatisticalTestingforAssociationAnalysis1. 10.4.1UsingStatisticalModels

a. UsingFisher’sExactTestb. UsingtheChi-SquaredTest

2. 10.4.2UsingRandomizationMethods

E. 10.5StatisticalTestingforClusterAnalysis1. 10.5.1GeneratingaNullDistributionforInternalIndices2. 10.5.2GeneratingaNullDistributionforExternalIndices3. 10.5.3Enrichment

F. 10.6StatisticalTestingforAnomalyDetectionG. 10.7BibliographicNotesH. BibliographyI. 10.8Exercises

15. AuthorIndex16. SubjectIndex17. CopyrightPermissions

ListofIllustrations

1. Figure1.1.2. Figure1.2.3. Figure1.3.4. Figure1.4.5. Figure2.1.6. Figure2.2.7. Figure2.3.8. Figure2.4.9. Figure2.5.10. Figure2.6.11. Figure2.7.12. Figure2.8.13. Figure2.9.

14. Figure2.10.15. Figure2.11.16. Figure2.12.17. Figure2.13.18. Figure2.14.19. Figure2.15.20. Figure2.16.21. Figure2.17.22. Figure2.19.23. Figure2.20.24. Figure2.21.25. Figure2.22.26. Figure3.1.27. Figure3.2.28. Figure3.3.29. Figure3.4.30. Figure3.5.31. Figure3.6.32. Figure3.7.33. Figure3.8.34. Figure3.9.35. Figure3.10.36. Figure3.11.37. Figure3.12.38. Figure3.13.39. Figure3.14.40. Figure3.15.41. Figure3.16.42. Figure3.17.43. Figure3.18.44. Figure3.19.

45. Figure3.20.46. Figure3.21.47. Figure3.22.48. Figure3.23.49. Figure3.24.50. Figure3.25.51. Figure3.26.52. Figure3.27.53. Figure3.28.54. Figure3.29.55. Figure3.30.56. Figure3.31.57. Figure3.32.58. Figure3.33.59. Figure3.34.60. Figure3.35.61. Figure3.36.62. Figure3.37.63. Figure4.1.64. Figure4.2.65. Figure4.3.66. Figure4.4.67. Figure4.5.68. Figure4.6.69. Figure4.7.70. Figure4.8.71. Figure4.9.72. Figure4.10.73. Figure4.11.74. Figure4.12.75. Figure4.13.

76. Figure4.14.77. Figure4.15.78. Figure4.16.79. Figure4.17.80. Figure4.18.81. Figure4.19.82. Figure4.20.83. Figure4.21.84. Figure4.22.85. Figure4.23.86. Figure4.24.87. Figure4.25.88. Figure4.26.89. Figure4.27.90. Figure4.28.91. Figure4.29.92. Figure4.30.93. Figure4.31.94. Figure4.32.95. Figure4.33.96. Figure4.34.97. Figure4.35.98. Figure4.36.99. Figure4.37.100. Figure4.38.101. Figure4.39.102. Figure4.40.103. Figure4.41.104. Figure4.42.105. Figure4.43.106. Figure4.44.

107. Figure4.45.108. Figure4.46.109. Figure4.47.110. Figure4.48.111. Figure4.49.112. Figure4.50.113. Figure4.51.114. Figure4.52.115. Figure4.53.116. Figure4.54.117. Figure4.55.118. Figure4.56.119. Figure4.57.120. Figure4.58.121. Figure4.59.122. Figure5.1.123. Figure5.2.124. Figure5.3.125. Figure5.4.126. Figure5.5.127. Figure5.6.128. Figure5.7.129. Figure5.8.130. Figure5.9.131. Figure5.10.132. Figure5.11.133. Figure5.12.134. Figure5.13.135. Figure5.14.136. Figure5.15.137. Figure5.16.

138. Figure5.17.139. Figure5.18.140. Figure5.19.141. Figure5.20.142. Figure5.21.143. Figure5.22.144. Figure5.23.145. Figure5.24.146. Figure5.25.147. Figure5.26.148. Figure5.27.149. Figure5.28.150. Figure5.29.151. Figure5.30.152. Figure5.31.153. Figure5.32.154. Figure5.33.155. Figure5.34.156. Figure6.1.157. Figure6.2.158. Figure6.3.159. Figure6.4.160. Figure6.5.161. Figure6.6.162. Figure6.7.163. Figure6.8.164. Figure6.9.165. Figure6.10.166. Figure6.11.167. Figure6.12.168. Figure6.13.

169. Figure6.14.170. Figure6.15.171. Figure6.16.172. Figure6.17.173. Figure6.18.174. Figure6.19.175. Figure6.20.176. Figure6.21.177. Figure6.22.178. Figure6.23.179. Figure6.24.180. Figure6.25.181. Figure6.26.182. Figure6.27.183. Figure6.28.184. Figure6.29.185. Figure7.1.186. Figure7.2.187. Figure7.3.188. Figure7.4.189. Figure7.5.190. Figure7.6.191. Figure7.7.192. Figure7.8.193. Figure7.9.194. Figure7.10.195. Figure7.11.196. Figure7.12.197. Figure7.13.198. Figure7.14.199. Figure7.15.

200. Figure7.16.201. Figure7.17.202. Figure7.18.203. Figure7.19.204. Figure7.20.205. Figure7.21.206. Figure7.22.207. Figure7.23.208. Figure7.24.209. Figure7.25.210. Figure7.26.211. Figure7.27.212. Figure7.28.213. Figure7.29.214. Figure7.30.215. Figure7.31.216. Figure7.32.217. Figure7.33.218. Figure7.34.219. Figure7.35.220. Figure7.36.221. Figure7.37.222. Figure7.38.223. Figure7.39.224. Figure7.40.225. Figure7.41.226. Figure8.1.227. Figure8.2.228. Figure8.3.229. Figure8.4.230. Figure8.5.

231. Figure8.6.232. Figure8.7.233. Figure8.8.234. Figure8.9.235. Figure8.10.236. Figure8.11.237. Figure8.12.238. Figure8.13.239. Figure8.14.240. Figure8.15.241. Figure8.16.242. Figure8.17.243. Figure8.18.244. Figure8.19.245. Figure8.20.246. Figure8.21.247. Figure8.22.248. Figure8.23.249. Figure8.24.250. Figure8.25.251. Figure8.26.252. Figure8.27.253. Figure8.28.254. Figure8.29.255. Figure8.30.256. Figure8.31.257. Figure8.32.258. Figure9.1.259. Figure9.2.260. Figure9.3.261. Figure9.4.

262. Figure9.5.263. Figure9.6.264. Figure9.7.265. Figure9.8.266. Figure9.9.267. Figure9.10.268. Figure9.11.269. Figure9.12.270. Figure9.13.271. Figure9.14.272. Figure9.15.273. Figure9.16.274. Figure9.17.275. Figure9.18.276. Figure9.19.277. Figure9.20.278. Figure10.1.279. Figure10.2.280. Figure10.3.281. Figure10.4.282. Figure10.5.283. Figure10.6.284. Figure10.7.285. Figure10.8.286. Figure10.9.287. Figure10.10.288. Figure10.11.289. Figure10.12.290. Figure10.13.

ListofTables

1. Table1.1.Marketbasketdata.2. Table1.2.Collectionofnewsarticles.3. Table2.1.Asampledatasetcontainingstudentinformation.4. Table2.2.Differentattributetypes.5. Table2.3.Transformationsthatdefineattributelevels.6. Table2.4.Datasetcontaininginformationaboutcustomer

purchases.7. Table2.5.Conversionofacategoricalattributetothreebinary

attributes.8. Table2.6.Conversionofacategoricalattributetofiveasymmetric

binaryattributes.9. Table2.7.Similarityanddissimilarityforsimpleattributes10. Table2.8.xandycoordinatesoffourpoints.11. Table2.9.EuclideandistancematrixforTable2.8.12. Table2.10.L1distancematrixforTable2.8.13. Table2.11.L∞distancematrixforTable2.8.14. Table2.12.Propertiesofcosine,correlation,andMinkowski

distancemeasures.15. Table2.13.Similaritybetween(x,y),(x,ys),and(x,yt).16. Table2.14.Entropyforx17. Table2.15.Entropyfory18. Table2.16.Jointentropyforxandy19. Table3.1.Examplesofclassificationtasks.20. Table3.2.Asampledataforthevertebrateclassificationproblem.21. Table3.3.Asampledatafortheloanborrowerclassification

problem.22. Table3.4.Confusionmatrixforabinaryclassificationproblem.23. Table3.5.DatasetforExercise2.

24. Table3.6.DatasetforExercise3.25. Table3.7.ComparingthetestaccuracyofdecisiontreesT10and

T100.26. Table3.8.Comparingtheaccuracyofvariousclassification

methods.27. Table4.1.Exampleofarulesetforthevertebrateclassification

problem.28. Table4.2.Thevertebratedataset.29. Table4.3.Exampleofamutuallyexclusiveandexhaustiverule

set.30. Table4.4.Exampleofdatasetusedtoconstructanensembleof

baggingclassifiers.31. Table4.5.Comparingtheaccuracyofadecisiontreeclassifier

againstthreeensemblemethods.32. Table4.6.Aconfusionmatrixforabinaryclassificationproblemin

whichtheclassesarenotequallyimportant.33. Table4.7.EntriesoftheconfusionmatrixintermsoftheTPR,

TNR,skew,α,andtotalnumberofinstances,N.34. Table4.8.Comparisonofvariousrule-basedclassifiers.35. Table4.9.DatasetforExercise7.36. Table4.10.DatasetforExercise8.37. Table4.11.DatasetforExercise10.38. Table4.12.DatasetforExercise12.39. Table4.13.PosteriorprobabilitiesforExercise16.40. Table5.1.Anexampleofmarketbaskettransactions.41. Table5.2.Abinary0/1representationofmarketbasketdata.42. Table5.3.Listofbinaryattributesfromthe1984UnitedStates

CongressionalVotingRecords.Source:TheUCImachinelearningrepository.

43. Table5.4.Associationrulesextractedfromthe1984UnitedStatesCongressionalVotingRecords.

44. Table5.5.Atransactiondatasetforminingcloseditemsets.45. Table5.6.A2-waycontingencytableforvariablesAandB.46. Table5.7.Beveragepreferencesamongagroupof1000people.47. Table5.8.Informationaboutpeoplewhodrinkteaandpeoplewho

usehoneyintheirbeverage.48. Table5.9.Examplesofobjectivemeasuresfortheitemset{A,B}.49. Table5.10.Exampleofcontingencytables.50. Table5.11.Rankingsofcontingencytablesusingthemeasures

giveninTable5.9.51. Table5.12.Contingencytablesforthepairs{p,q}and{r,s}.52. Table5.13.Thegrade-genderexample.(a)Sampledataofsize100.53. Table5.14.Anexampledemonstratingtheeffectofnulladdition.54. Table5.15.Propertiesofsymmetricmeasures.55. Table5.16.Exampleofathree-dimensionalcontingencytable.56. Table5.17.Atwo-waycontingencytablebetweenthesaleofhigh-

definitiontelevisionandexercisemachine.57. Table5.18.Exampleofathree-waycontingencytable.58. Table5.19.Groupingtheitemsinthecensusdatasetbasedon

theirsupportvalues.59. Table5.20.Exampleofmarketbaskettransactions.60. Table5.21.Marketbaskettransactions.61. Table5.22.Exampleofmarketbaskettransactions.62. Table5.23.Exampleofmarketbaskettransactions.63. Table5.24.AContingencyTable.64. Table5.25.ContingencytablesforExercise20.65. Table6.1.Internetsurveydatawithcategoricalattributes.66. Table6.2.Internetsurveydataafterbinarizingcategoricaland

symmetricbinaryattributes.67. Table6.3.Internetsurveydatawithcontinuousattributes.68. Table6.4.Internetsurveydataafterbinarizingcategoricaland

continuousattributes.

69. Table6.5.AbreakdownofInternetuserswhoparticipatedinonlinechataccordingtotheiragegroup.

70. Table6.6.Document-wordmatrix.71. Table6.7.Examplesillustratingtheconceptofasubsequence.72. Table6.8.Graphrepresentationofentitiesinvariousapplication

domains.73. Table6.9.Atwo-waycontingencytablefortheassociationrule

X→Y.74. Table6.10.Trafficaccidentdataset.75. Table6.11.DatasetforExercise2.76. Table6.12.DatasetforExercise3.77. Table6.13.DatasetforExercise4.78. Table6.14.DatasetforExercise6.79. Table6.15.Exampleofmarketbaskettransactions.80. Table6.16.Exampleofeventsequencesgeneratedbyvarious

sensors.81. Table6.17.ExampleofeventsequencedataforExercise14.82. Table6.18.Exampleofnumericdataset.83. Table7.1.Tableofnotation.84. Table7.2.K-means:Commonchoicesforproximity,centroids,

andobjectivefunctions.85. Table7.3.xy-coordinatesofsixpoints.86. Table7.4.Euclideandistancematrixforsixpoints.87. Table7.5.TableofLance-Williamscoefficientsforcommon

hierarchicalclusteringapproaches.88. Table7.6.Tableofgraph-basedclusterevaluationmeasures.89. Table7.7.Copheneticdistancematrixforsinglelinkanddatain

Table2.14onpage90.90. Table7.8.CopheneticcorrelationcoefficientfordataofTable2.14

andfouragglomerativehierarchicalclusteringtechniques.91. Table7.9.K-meansclusteringresultsfortheLATimesdocument

dataset.92. Table7.10.Idealclustersimilaritymatrix.93. Table7.11.Classsimilaritymatrix.94. Table7.12.Two-waycontingencytablefordeterminingwhether

pairsofobjectsareinthesameclassandsamecluster.95. Table7.13.SimilaritymatrixforExercise16.96. Table7.14.ConfusionmatrixforExercise21.97. Table7.15.TableofclusterlabelsforExercise24.98. Table7.16.SimilaritymatrixforExercise24.99. Table8.1.FirstfewiterationsoftheEMalgorithmforthesimple

example.100. Table8.2.Pointcountsforgridcells.101. Table8.3.Similarityamongdocumentsindifferentsectionsofa

newspaper.102. Table8.4.Twonearestneighborsoffourpoints.103. Table9.1.Samplepairs(c,α),α=prob(|x|≥c)foraGaussian

distributionwithmean0andstandarddeviation1.104. Table9.2.Surveydataofweightandheightof100participants.105. Table10.1.Confusiontableinthecontextofmultiplehypothesis

testing.106. Table10.2.Correspondencebetweenstatisticaltestingconcepts

andclassifierevaluationmeasures107. Table10.3.A2-waycontingencytableforvariablesAandB.108. Table10.4.Beveragepreferencesamongagroupof1000people.109. Table10.5.Beveragepreferencesamongagroupof1000people.110. Table10.6.Contingencytableforananomalydetectionsystem

withdetectionratedandfalsealarmratef.111. Table10.7.Beveragepreferencesamongagroupof100people

(left)and10,000people(right).112. Table10.8.OrderedCollectionofp-values..

Landmarks1. Frontmatter2. StartofContent3. backmatter4. ListofIllustrations5. ListofTables

1. i2. ii3. iii4. iv5. v6. vi7. vii8. viii9. ix10. x11. xi12. xii13. xiii14. xiv15. xv16. xvi17. xvii18. xviii19. xix20. xx21. 122. 2

23. 324. 425. 526. 627. 728. 829. 930. 1031. 1132. 1233. 1334. 1435. 1536. 1637. 1738. 1839. 1940. 2041. 2142. 2243. 2344. 2445. 2546. 2647. 2748. 2849. 2950. 3051. 3152. 3253. 33

54. 3455. 3556. 3657. 3758. 3859. 3960. 4061. 4162. 4263. 4364. 4465. 4566. 4667. 4768. 4869. 4970. 5071. 5172. 5273. 5374. 5475. 5576. 5677. 5778. 5879. 5980. 6081. 6182. 6283. 6384. 64

85. 6586. 6687. 6788. 6889. 6990. 7091. 7192. 7293. 7394. 7495. 7596. 7697. 7798. 7899. 79100. 80101. 81102. 82103. 83104. 84105. 85106. 86107. 87108. 88109. 89110. 90111. 91112. 92113. 93114. 94115. 95

116. 96117. 97118. 98119. 99120. 100121. 101122. 102123. 103124. 104125. 105126. 106127. 107128. 108129. 109130. 110131. 111132. 112133. 113134. 114135. 115136. 116137. 117138. 118139. 119140. 120141. 121142. 122143. 123144. 124145. 125146. 126

147. 127148. 128149. 129150. 130151. 131152. 132153. 133154. 134155. 135156. 136157. 137158. 138159. 139160. 140161. 141162. 142163. 143164. 144165. 145166. 146167. 147168. 148169. 149170. 150171. 151172. 152173. 153174. 154175. 155176. 156177. 157

178. 158179. 159180. 160181. 161182. 162183. 163184. 164185. 165186. 166187. 167188. 168189. 169190. 170191. 171192. 172193. 173194. 174195. 175196. 176197. 177198. 178199. 179200. 180201. 181202. 182203. 183204. 184205. 185206. 186207. 187208. 188

209. 189210. 190211. 191212. 192213. 193214. 194215. 195216. 196217. 197218. 198219. 199220. 200221. 201222. 202223. 203224. 204225. 205226. 206227. 207228. 208229. 209230. 210231. 211232. 212233. 213234. 214235. 215236. 216237. 217238. 218239. 219

240. 220241. 221242. 222243. 223244. 224245. 225246. 226247. 227248. 228249. 229250. 230251. 231252. 232253. 233254. 234255. 235256. 236257. 237258. 238259. 239260. 240261. 241262. 242263. 243264. 244265. 245266. 246267. 247268. 248269. 249270. 250

271. 251272. 252273. 253274. 254275. 255276. 256277. 257278. 258279. 259280. 260281. 261282. 262283. 263284. 264285. 265286. 266287. 267288. 268289. 269290. 270291. 271292. 272293. 273294. 274295. 275296. 276297. 277298. 278299. 279300. 280301. 281

302. 282303. 283304. 284305. 285306. 286307. 287308. 288309. 289310. 290311. 291312. 292313. 293314. 294315. 295316. 296317. 297318. 298319. 299320. 300321. 301322. 302323. 303324. 304325. 305326. 306327. 307328. 308329. 309330. 310331. 311332. 312

333. 313334. 314335. 315336. 316337. 317338. 318339. 319340. 320341. 321342. 322343. 323344. 324345. 325346. 326347. 327348. 328349. 329350. 330351. 331352. 332353. 333354. 334355. 335356. 336357. 337358. 338359. 339360. 340361. 341362. 342363. 343

364. 344365. 345366. 346367. 347368. 348369. 349370. 350371. 351372. 352373. 353374. 354375. 355376. 356377. 357378. 358379. 359380. 360381. 361382. 362383. 363384. 364385. 365386. 366387. 367388. 368389. 369390. 370391. 371392. 372393. 373394. 374

395. 375396. 376397. 377398. 378399. 379400. 380401. 381402. 382403. 383404. 384405. 385406. 386407. 387408. 388409. 389410. 390411. 391412. 392413. 393414. 394415. 395416. 396417. 397418. 398419. 399420. 400421. 401422. 402423. 403424. 404425. 405

426. 406427. 407428. 408429. 409430. 410431. 411432. 412433. 413434. 414435. 415436. 416437. 417438. 418439. 419440. 420441. 421442. 422443. 423444. 424445. 425446. 426447. 427448. 428449. 429450. 430451. 431452. 432453. 433454. 434455. 435456. 436

457. 437458. 438459. 439460. 440461. 441462. 442463. 443464. 444465. 445466. 446467. 447468. 448469. 449470. 450471. 451472. 452473. 453474. 454475. 455476. 456477. 457478. 458479. 459480. 460481. 461482. 462483. 463484. 464485. 465486. 466487. 467

488. 468489. 469490. 470491. 471492. 472493. 473494. 474495. 475496. 476497. 477498. 478499. 479500. 480501. 481502. 482503. 483504. 484505. 485506. 486507. 487508. 488509. 489510. 490511. 491512. 492513. 493514. 494515. 495516. 496517. 497518. 498

519. 499520. 500521. 501522. 502523. 503524. 504525. 505526. 506527. 507528. 508529. 509530. 510531. 511532. 512533. 513534. 514535. 515536. 516537. 517538. 518539. 519540. 520541. 521542. 522543. 523544. 524545. 525546. 526547. 527548. 528549. 529

550. 530551. 531552. 532553. 533554. 534555. 535556. 536557. 537558. 538559. 539560. 540561. 541562. 542563. 543564. 544565. 545566. 546567. 547568. 548569. 549570. 550571. 551572. 552573. 553574. 554575. 555576. 556577. 557578. 558579. 559580. 560

581. 561582. 562583. 563584. 564585. 565586. 566587. 567588. 568589. 569590. 570591. 571592. 572593. 573594. 574595. 575596. 576597. 577598. 578599. 579600. 580601. 581602. 582603. 583604. 584605. 585606. 586607. 587608. 588609. 589610. 590611. 591

612. 592613. 593614. 594615. 595616. 596617. 597618. 598619. 599620. 600621. 601622. 602623. 603624. 604625. 605626. 606627. 607628. 608629. 609630. 610631. 611632. 612633. 613634. 614635. 615636. 616637. 617638. 618639. 619640. 620641. 621642. 622

643. 623644. 624645. 625646. 626647. 627648. 628649. 629650. 630651. 631652. 632653. 633654. 634655. 635656. 636657. 637658. 638659. 639660. 640661. 641662. 642663. 643664. 644665. 645666. 646667. 647668. 648669. 649670. 650671. 651672. 652673. 653

674. 654675. 655676. 656677. 657678. 658679. 659680. 660681. 661682. 662683. 663684. 664685. 665686. 666687. 667688. 668689. 669690. 670691. 671692. 672693. 673694. 674695. 675696. 676697. 677698. 678699. 679700. 680701. 681702. 682703. 683704. 684

705. 685706. 686707. 687708. 688709. 689710. 690711. 691712. 692713. 693714. 694715. 695716. 696717. 697718. 698719. 699720. 700721. 701722. 702723. 703724. 704725. 705726. 706727. 707728. 708729. 709730. 710731. 711732. 712733. 713734. 714735. 715

736. 716737. 717738. 718739. 719740. 720741. 721742. 722743. 723744. 724745. 725746. 726747. 727748. 728749. 729750. 730751. 731752. 732753. 733754. 734755. 735756. 736757. 737758. 738759. 739760. 740761. 741762. 742763. 743764. 744765. 745766. 746

767. 747768. 748769. 749770. 750771. 751772. 752773. 753774. 754775. 755776. 756777. 757778. 758779. 759780. 760781. 761782. 762783. 763784. 764785. 765786. 766787. 767788. 768789. 769790. 770791. 771792. 772793. 773794. 774795. 775796. 776797. 777

798. 778799. 779800. 780801. 781802. 782803. 783804. 784805. 785806. 786807. 787808. 788809. 789810. 790811. 791812. 792813. 793814. 794815. 795816. 796817. 797818. 798819. 799820. 800821. 801822. 802823. 803824. 804825. 805826. 806827. 807828. 808

829. 809830. 810831. 811832. 812833. 813834. 814835. 815836. 816837. 817838. 818839. 819840. 820841. 821842. 822843. 823844. 824845. 825846. 826847. 827848. 828849. 829850. 830851. 831852. 832853. 833854. 834855. 835856. 836857. 837858. 838859. 839

860. 840861. 841862. 842863. 843

INTRODUCTION TO DATA MINING
INTRODUCTION TO DATA MINING
Preface to the Second Edition

Overview
What is New in the Second Edition?
To the Instructor
Support Materials

Contents
1 Introduction

1.1 What Is Data Mining?
1.2 Motivating Challenges
1.3 The Origins of Data Mining
1.4 Data Mining Tasks
1.5 Scope and Organization of the Book
1.6 Bibliographic Notes
Bibliography
1.7 Exercises

2 Data

2.1 Types of Data

2.1.1 Attributes and Measurement

What Is an Attribute?
The Type of an Attribute
The Different Types of Attributes
Describing Attributes by the Number of Values
Asymmetric Attributes
General Comments on Levels of Measurement

2.1.2 Types of Data Sets

General Characteristics of Data Sets

Dimensionality
Distribution
Resolution

Record Data

Transaction or Market Basket Data
The Data Matrix
The Sparse Data Matrix

Graph-Based Data

Data with Relationships among Objects
Data with Objects That Are Graphs

Ordered Data

Sequential Transaction Data
Time Series Data
Sequence Data
Spatial and Spatio-Temporal Data

Handling Non-Record Data

2.2 Data Quality

2.2.1 Measurement and Data Collection Issues

Measurement and Data Collection Errors
Noise and Artifacts
Precision, Bias, and Accuracy
Outliers
Missing Values

Eliminate Data Objects or Attributes
Estimate Missing Values
Ignore the Missing Value during Analysis

Inconsistent Values
Duplicate Data

2.2.2 Issues Related to Applications

2.3 Data Preprocessing

2.3.1 Aggregation
2.3.2 Sampling

Sampling Approaches
Progressive Sampling

2.3.3 Dimensionality Reduction

The Curse of Dimensionality
Linear Algebra Techniques for Dimensionality Reduction

2.3.4 Feature Subset Selection

An Architecture for Feature Subset Selection
Feature Weighting

2.3.5 Feature Creation

Feature Extraction
Mapping the Data to a New Space

2.3.6 Discretization and Binarization

Binarization
Discretization of Continuous Attributes

Unsupervised Discretization
Supervised Discretization

Categorical Attributes with Too Many Values

2.3.7 Variable Transformation

Simple Functions
Normalization or Standardization

2.4 Measures of Similarity and Dissimilarity

2.4.1 Basics

Definitions
Transformations

2.4.2 Similarity and Dissimilarity between Simple Attributes
2.4.3 Dissimilarities between Data Objects

Distances

2.4.4 Similarities between Data Objects
2.4.5 Examples of Proximity Measures

Similarity Measures for Binary Data

Simple Matching Coefficient
Jaccard Coefficient

Cosine Similarity
Extended Jaccard Coefficient (Tanimoto Coefficient)
Correlation
Differences Among Measures For Continuous Attributes

2.4.6 Mutual Information
2.4.7 Kernel Functions*
2.4.8 Bregman Divergence*
2.4.9 Issues in Proximity Calculation

Standardization and Correlation for Distance Measures
Combining Similarities for Heterogeneous Attributes
Using Weights

2.4.10 Selecting the Right Proximity Measure

2.5 Bibliographic Notes
Bibliography
2.6 Exercises

3 Classification: Basic Concepts and Techniques

3.1 Basic Concepts
3.2 General Framework for Classification
3.3 Decision Tree Classifier

3.3.1 A Basic Algorithm to Build a Decision Tree

Hunt's Algorithm
Design Issues of Decision Tree Induction

3.3.2 Methods for Expressing Attribute Test Conditions
3.3.3 Measures for Selecting an Attribute Test Condition

Impurity Measure for a Single Node
Collective Impurity of Child Nodes
Identifying the best attribute test condition
Splitting of Qualitative Attributes
Binary Splitting of Qualitative Attributes
Binary Splitting of Quantitative Attributes
Gain Ratio

3.3.4 Algorithm for Decision Tree Induction
3.3.5 Example Application: Web Robot Detection
3.3.6 Characteristics of Decision Tree Classifiers

3.4 Model Overfitting

3.4.1 Reasons for Model Overfitting

Limited Training Size
High Model Complexity

3.5 Model Selection

3.5.1 Using a Validation Set
3.5.2 Incorporating Model Complexity

Estimating the Complexity of Decision Trees
Minimum Description Length Principle

3.5.3 Estimating Statistical Bounds
3.5.4 Model Selection for Decision Trees

3.6 Model Evaluation

3.6.1 Holdout Method
3.6.2 Cross-Validation

3.7 Presence of Hyper-parameters

3.7.1 Hyper-parameter Selection
3.7.2 Nested Cross-Validation

3.8 Pitfalls of Model Selection and Evaluation

3.8.1 Overlap between Training and Test Sets
3.8.2 Use of Validation Error as Generalization Error

3.9 Model Comparison*

3.9.1 Estimating the Confidence Interval for Accuracy
3.9.2 Comparing the Performance of Two Models

3.10 Bibliographic Notes
Bibliography
3.11 Exercises

4 Classification: Alternative Techniques

4.1 Types of Classifiers
4.2 Rule-Based Classifier

4.2.1 How a Rule-Based Classifier Works
4.2.2 Properties of a Rule Set
4.2.3 Direct Methods for Rule Extraction

Learn-One-Rule Function

Rule Pruning
Building the Rule Set

Instance Elimination

4.2.4 Indirect Methods for Rule Extraction
4.2.5 Characteristics of Rule-Based Classifiers

4.3 Nearest Neighbor Classifiers

4.3.1 Algorithm
4.3.2 Characteristics of Nearest Neighbor Classifiers

4.4 Naïve Bayes Classifier

4.4.1 Basics of Probability Theory

Bayes Theorem
Using Bayes Theorem for Classification

4.4.2 Naïve Bayes Assumption

Conditional Independence
How a Naïve Bayes Classifier Works
Estimating Conditional Probabilities for Categorical Attributes
Estimating Conditional Probabilities for Continuous Attributes
Handling Zero Conditional Probabilities
Characteristics of Naïve Bayes Classifiers

4.5 Bayesian Networks

4.5.1 Graphical Representation

Conditional Independence
Joint Probability
Use of Hidden Variables

4.5.2 Inference and Learning

Variable Elimination
Sum-Product Algorithm for Trees
Generalizations for Non-Tree Graphs
Learning Model Parameters

4.5.3 Characteristics of Bayesian Networks

4.6 Logistic Regression

4.6.1 Logistic Regression as a Generalized Linear Model
4.6.2 Learning Model Parameters
4.6.3 Characteristics of Logistic Regression

4.7 Artificial Neural Network (ANN)

4.7.1 Perceptron

Learning the Perceptron

4.7.2 Multi-layer Neural Network

Learning Model Parameters

4.7.3 Characteristics of ANN

4.8 Deep Learning

4.8.1 Using Synergistic Loss Functions

Saturation of Outputs
Cross entropy loss function

4.8.2 Using Responsive Activation Functions

Vanishing Gradient Problem
Rectified Linear Units (ReLU)

4.8.3 Regularization

Dropout

4.8.4 Initialization of Model Parameters

Supervised Pretraining
Unsupervised Pretraining
Use of Autoencoders
Hybrid Pretraining

4.8.5 Characteristics of Deep Learning

4.9 Support Vector Machine (SVM)

4.9.1 Margin of a Separating Hyperplane

Rationale for Maximum Margin

4.9.2 Linear SVM

Learning Model Parameters

4.9.3 Soft-margin SVM

SVM as a Regularizer of Hinge Loss

4.9.4 Nonlinear SVM

Attribute Transformation
Learning a Nonlinear SVM Model

4.9.5 Characteristics of SVM

4.10 Ensemble Methods

4.10.1 Rationale for Ensemble Method
4.10.2 Methods for Constructing an Ensemble Classifier
4.10.3 Bias-Variance Decomposition
4.10.4 Bagging
4.10.5 Boosting

AdaBoost

4.10.6 Random Forests
4.10.7 Empirical Comparison among Ensemble Methods

4.11 Class Imbalance Problem

4.11.1 Building Classifiers with Class Imbalance

Oversampling and Undersampling
Assigning Scores to Test Instances

4.11.2 Evaluating Performance with Class Imbalance
4.11.3 Finding an Optimal Score Threshold
4.11.4 Aggregate Evaluation of Performance

ROC Curve
Precision-Recall Curve

4.12 Multiclass Problem
4.13 Bibliographic Notes
Bibliography
4.14 Exercises

5 Association Analysis: Basic Concepts and Algorithms

5.1 Preliminaries
5.2 Frequent Itemset Generation

5.2.1 The Apriori Principle
5.2.2 Frequent Itemset Generation in the Apriori Algorithm
5.2.3 Candidate Generation and Pruning

Candidate Generation

Brute-Force Method
Fk−1×F1 Method
Fk−1×Fk−1 Method

Candidate Pruning

5.2.4 Support Counting

Support Counting Using a Hash Tree*

5.2.5 Computational Complexity

5.3 Rule Generation

5.3.1 Confidence-Based Pruning
5.3.2 Rule Generation in Apriori Algorithm
5.3.3 An Example: Congressional Voting Records

5.4 Compact Representation of Frequent Itemsets

5.4.1 Maximal Frequent Itemsets
5.4.2 Closed Itemsets

5.5 Alternative Methods for Generating Frequent Itemsets*
5.6 FP-Growth Algorithm*

5.6.1 FP-Tree Representation
5.6.2 Frequent Itemset Generation in FP-Growth Algorithm

5.7 Evaluation of Association Patterns

5.7.1 Objective Measures of Interestingness

Alternative Objective Interestingness Measures
Consistency among Objective Measures
Properties of Objective Measures

Inversion Property
Scaling Property
Null Addition Property

Asymmetric Interestingness Measures

5.7.2 Measures beyond Pairs of Binary Variables
5.7.3 Simpson's Paradox

5.8 Effect of Skewed Support Distribution
5.9 Bibliographic Notes
Bibliography
5.10 Exercises

6 Association Analysis: Advanced Concepts

6.1 Handling Categorical Attributes
6.2 Handling Continuous Attributes

6.2.1 Discretization-Based Methods
6.2.2 Statistics-Based Methods

Rule Generation
Rule Validation

6.2.3 Non-discretization Methods

6.3 Handling a Concept Hierarchy
6.4 Sequential Patterns

6.4.1 Preliminaries

Sequences
Subsequences

6.4.2 Sequential Pattern Discovery
6.4.3 Timing Constraints*

The maxspan Constraint
The mingap and maxgap Constraints
The Window Size Constraint

6.4.4 Alternative Counting Schemes*

6.5 Subgraph Patterns

6.5.1 Preliminaries

Graphs
Graph Isomorphism
Subgraphs

6.5.2 Frequent Subgraph Mining
6.5.3 Candidate Generation
6.5.4 Candidate Pruning
6.5.5 Support Counting

6.6 Infrequent Patterns*

6.6.1 Negative Patterns
6.6.2 Negatively Correlated Patterns
6.6.3 Comparisons among Infrequent Patterns, Negative Patterns, and Negatively Correlated Patterns
6.6.4 Techniques for Mining Interesting Infrequent Patterns
6.6.5 Techniques Based on Mining Negative Patterns
6.6.6 Techniques Based on Support Expectation

Support Expectation Based on Concept Hierarchy
Support Expectation Based on Indirect Association

6.7 Bibliographic Notes
Bibliography
6.8 Exercises

7 Cluster Analysis: Basic Concepts and Algorithms

7.1 Overview

7.1.1 What Is Cluster Analysis?
7.1.2 Different Types of Clusterings
7.1.3 Different Types of Clusters

Road Map

7.2 K-means

7.2.1 The Basic K-means Algorithm

Assigning Points to the Closest Centroid
Centroids and Objective Functions

Data in Euclidean Space
Document Data
The General Case

Choosing Initial Centroids

K-means++

Time and Space Complexity

7.2.2 K-means: Additional Issues

Handling Empty Clusters
Outliers
Reducing the SSE with Postprocessing
Updating Centroids Incrementally

7.2.3 Bisecting K-means
7.2.4 K-means and Different Types of Clusters
7.2.5 Strengths and Weaknesses
7.2.6 K-means as an Optimization Problem

Derivation of K-means as an Algorithm to Minimize the SSE
Derivation of K-means for SAE

7.3 Agglomerative Hierarchical Clustering

7.3.1 Basic Agglomerative Hierarchical Clustering Algorithm

Defining Proximity between Clusters
Time and Space Complexity

7.3.2 Specific Techniques

Sample Data
Single Link or MIN
Complete Link or MAX or CLIQUE
Group Average
Ward’s Method and Centroid Methods

7.3.3 The Lance-Williams Formula for Cluster Proximity
7.3.4 Key Issues in Hierarchical Clustering

Lack of a Global Objective Function
Ability to Handle Different Cluster Sizes
Merging Decisions Are Final

7.3.5 Outliers
7.3.6 Strengths and Weaknesses

7.4 DBSCAN

7.4.1 Traditional Density: Center-Based Approach

Classification of Points According to Center-Based Density

7.4.2 The DBSCAN Algorithm

Time and Space Complexity
Selection of DBSCAN Parameters
Clusters of Varying Density
An Example

7.4.3 Strengths and Weaknesses

7.5 Cluster Evaluation

7.5.1 Overview
7.5.2 Unsupervised Cluster Evaluation Using Cohesion and Separation

Graph-Based View of Cohesion and Separation
Prototype-Based View of Cohesion and Separation
Relationship between Prototype-Based Cohesion and Graph-Based Cohesion
Relationship of the Two Approaches to Prototype-Based Separation
Relationship between Cohesion and Separation
Relationship between Graph- and Centroid-Based Cohesion
Overall Measures of Cohesion and Separation
Evaluating Individual Clusters and Objects
The Silhouette Coefficient

7.5.3 Unsupervised Cluster Evaluation Using the Proximity Matrix

General Comments on Unsupervised Cluster Evaluation Measures
Measuring Cluster Validity via Correlation
Judging a Clustering Visually by Its Similarity Matrix

7.5.4 Unsupervised Evaluation of Hierarchical Clustering
7.5.5 Determining the Correct Number of Clusters
7.5.6 Clustering Tendency
7.5.7 Supervised Measures of Cluster Validity

Classification-Oriented Measures of Cluster Validity
Similarity-Oriented Measures of Cluster Validity
Cluster Validity for Hierarchical Clusterings

7.5.8 Assessing the Significance of Cluster Validity Measures
7.5.9 Choosing a Cluster Validity Measure

7.6 Bibliographic Notes
Bibliography
7.7 Exercises

8 Cluster Analysis: Additional Issues and Algorithms

8.1 Characteristics of Data, Clusters, and Clustering Algorithms

8.1.1 Example: Comparing K-means and DBSCAN
8.1.2 Data Characteristics
8.1.3 Cluster Characteristics
8.1.4 General Characteristics of Clustering Algorithms

Road Map

8.2 Prototype-Based Clustering

8.2.1 Fuzzy Clustering

Fuzzy Sets
Fuzzy Clusters
Fuzzy c-means

Computing SSE
Initialization
Computing Centroids
Updating the Fuzzy Pseudo-partition

Strengths and Limitations

8.2.2 Clustering Using Mixture Models

Mixture Models
Estimating Model Parameters Using Maximum Likelihood
Estimating Mixture Model Parameters Using Maximum Likelihood: The EM Algorithm
Advantages and Limitations of Mixture Model Clustering Using the EM Algorithm

8.2.3 Self-Organizing Maps (SOM)

The SOM Algorithm

Initialization
Selection of an Object
Assignment
Update
Termination

Applications
Strengths and Limitations

8.3 Density-Based Clustering

8.3.1 Grid-Based Clustering

Defining Grid Cells
The Density of Grid Cells
Forming Clusters from Dense Grid Cells
Strengths and Limitations

8.3.2 Subspace Clustering

CLIQUE
Strengths and Limitations of CLIQUE

8.3.3 DENCLUE: A Kernel-Based Scheme for Density-Based Clustering

Kernel Density Estimation
Implementation Issues
Strengths and Limitations of DENCLUE

8.4 Graph-Based Clustering

8.4.1 Sparsification
8.4.2 Minimum Spanning Tree (MST) Clustering
8.4.3 OPOSSUM: Optimal Partitioning of Sparse Similarities Using METIS

Strengths and Weaknesses

8.4.4 Chameleon: Hierarchical Clustering with Dynamic Modeling

Deciding Which Clusters to Merge
Chameleon Algorithm

Sparsification

Graph Partitioning

Agglomerative Hierarchical Clustering
Complexity

Strengths and Limitations

8.4.5 Spectral Clustering

Relationship between Spectral Clustering and Graph Partitioning
Strengths and Limitations

8.4.6 Shared Nearest Neighbor Similarity

Problems with Traditional Similarity in High-Dimensional Data
Problems with Differences in Density
SNN Similarity Computation
SNN Similarity versus Direct Similarity

8.4.7 The Jarvis-Patrick Clustering Algorithm

Strengths and Limitations

8.4.8 SNN Density
8.4.9 SNN Density-Based Clustering

The SNN Density-based Clustering Algorithm
Strengths and Limitations

8.5 Scalable Clustering Algorithms

8.5.1 Scalability: General Issues and Approaches
8.5.2 BIRCH
8.5.3 CURE

Sampling in CURE
Partitioning

8.6 Which Clustering Algorithm?
8.7 Bibliographic Notes
Bibliography
8.8 Exercises

9 Anomaly Detection

9.1 Characteristics of Anomaly Detection Problems

9.1.1 A Definition of an Anomaly
9.1.2 Nature of Data
9.1.3 How Anomaly Detection is Used

9.2 Characteristics of Anomaly Detection Methods
9.3 Statistical Approaches

9.3.1 Using Parametric Models

Using the Univariate Gaussian Distribution
Using the Multivariate Gaussian Distribution

9.3.2 Using Non-parametric Models
9.3.3 Modeling Normal and Anomalous Classes
9.3.4 Assessing Statistical Significance
9.3.5 Strengths and Weaknesses

9.4 Proximity-based Approaches

9.4.1 Distance-based Anomaly Score
9.4.2 Density-based Anomaly Score
9.4.3 Relative Density-based Anomaly Score
9.4.4 Strengths and Weaknesses

9.5 Clustering-based Approaches

9.5.1 Finding Anomalous Clusters
9.5.2 Finding Anomalous Instances

Assessing the Extent to Which an Object Belongs to a Cluster
Impact of Outliers on the Initial Clustering
The Number of Clusters to Use

9.5.3 Strengths and Weaknesses

9.6 Reconstruction-based Approaches

9.6.1 Strengths and Weaknesses

9.7 One-class Classification

9.7.1 Use of Kernels
9.7.2 The Origin Trick
9.7.3 Strengths and Weaknesses

9.8 Information Theoretic Approaches

9.8.1 Strengths and Weaknesses

9.9 Evaluation of Anomaly Detection
9.10 Bibliographic Notes
Bibliography
9.11 Exercises

10 Avoiding False Discoveries

10.1 Preliminaries: Statistical Testing

10.1.1 Significance Testing

Null Hypothesis
Test Statistic
Null Distribution
Assessing Statistical Significance

10.1.2 Hypothesis Testing

Critical Region
Type I and Type II Errors
Effect Size

10.1.3 Multiple Hypothesis Testing

Family-wise Error Rate (FWER)
Bonferroni Procedure
False discovery rate (FDR)
Benjamini-Hochberg Procedure

10.1.4 Pitfalls in Statistical Testing

10.2 Modeling Null and Alternative Distributions

10.2.1 Generating Synthetic Data Sets
10.2.2 Randomizing Class Labels
10.2.3 Resampling Instances
10.2.4 Modeling the Distribution of the Test Statistic

10.3 Statistical Testing for Classification

10.3.1 Evaluating Classification Performance
10.3.2 Binary Classification as Multiple Hypothesis Testing
10.3.3 Multiple Hypothesis Testing in Model Selection

10.4 Statistical Testing for Association Analysis

10.4.1 Using Statistical Models

Using Fisher’s Exact Test
Using the Chi-Squared Test

10.4.2 Using Randomization Methods

10.5 Statistical Testing for Cluster Analysis

10.5.1 Generating a Null Distribution for Internal Indices
10.5.2 Generating a Null Distribution for External Indices
10.5.3 Enrichment

10.6 Statistical Testing for Anomaly Detection
10.7 Bibliographic Notes
Bibliography
10.8 Exercises

Author Index
Subject Index
Copyright Permissions